0% found this document useful (0 votes)

16 views

Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference

Uploaded by

Gordon Vanderbilt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference

Uploaded by

Gordon Vanderbilt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Econometrica, Vol. 89, No.

1 (January, 2021), 181–213

DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE

MAX H. FARRELL
Booth School of Business, University of Chicago

TENGYUAN LIANG
Booth School of Business, University of Chicago

SANJOG MISRA
Booth School of Business, University of Chicago

We study deep neural networks and their use in semiparametric inference. We es-
tablish novel nonasymptotic high probability bounds for deep feedforward neural nets.
These deliver rates of convergence that are sufficiently fast (in some cases minimax op-
timal) to allow us to establish valid second-step inference after first-step estimation with
deep learning, a result also new to the literature. Our nonasymptotic high probability
bounds, and the subsequent semiparametric inference, treat the current standard ar-
chitecture: fully connected feedforward neural networks (multilayer perceptrons), with
the now-common rectified linear unit activation function, unbounded weights, and a
depth explicitly diverging with the sample size. We discuss other architectures as well,
including fixed-width, very deep networks. We establish the nonasymptotic bounds for
these deep nets for a general class of nonparametric regression-type loss functions,
which includes as special cases least squares, logistic regression, and other generalized
linear models. We then apply our theory to develop semiparametric inference, focus-
ing on causal parameters for concreteness, and demonstrate the effectiveness of deep
learning with an empirical application to direct mail marketing.

KEYWORDS: Deep learning, neural networks, rectified linear unit, nonasymptotic

bounds, convergence rates, semiparametric inference, treatment effects, program eval-
uation.

1. INTRODUCTION
STATISTICAL MACHINE LEARNING METHODS are being rapidly integrated into the social
and medical sciences. Economics is no exception, and there has been a recent surge of
research that applies and explores machine learning methods in the context of econo-
metric modeling, particularly in “big data” settings. Furthermore, theoretical properties
of these methods are the subject of intense recent study. Our goal in the present work
is to study a particular statistical machine learning technique which is widely popular in
industrial applications, but less frequently used in academic work and largely ignored in
recent theoretical developments on inference: deep neural networks.
Neural networks are estimation methods that model the relationship between inputs
and outputs using layers of connected computational units (neurons), patterned after the
biological neural networks of brains. These computational units sit between the inputs

Max H. Farrell: [email protected]

Tengyuan Liang: [email protected]
Sanjog Misra: [email protected]
We thank Milica Popovic for outstanding research assistance. Liang gratefully acknowledges support from
the George C. Tiao Fellowship. Misra gratefully acknowledges support from the Neubauer Family Foundation.
We thank Guido Imbens, the coeditor, and two anonymous reviewers, as well as Alex Belloni, Xiaohong Chen,
Denis Chetverikov, Chris Hansen, Whitney Newey, and Andres Santos, for thoughtful comments, suggestions,
and discussions that substantially improved the paper.

© 2021 The Econometric Society https://doi.org/10.3982/ECTA16901

14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
182 M. H. FARRELL, T. LIANG, AND S. MISRA

and output and allow data-driven learning of the appropriate model, in addition to learn-
ing the parameters of that model. Put into terms more familiar in nonparametric econo-
metrics: neural networks can be thought of as a (complex) type of linear sieve estimation
where the basis functions themselves are flexibly learned from the data by optimizing over
many combinations of simple functions. Neural networks are perhaps not as familiar to
economists as other methods, and indeed, were out of favor in the machine learning com-
munity for several years, returning to prominence only very recently in the form of deep
learning. Deep neural nets contain many hidden layers of neurons between the input and
output layers, and have been found to exhibit superior performance across a variety of
contexts. Our work aims to bring wider attention to these methods and to fill some gaps
in the theoretical understanding of inference using deep neural networks.
Before the recent surge in attention, neural networks had taken a back seat to other
methods (such as kernel methods or forests) largely because of their modest empirical
performance and challenging optimization. Before falling out of favor, neural networks
were widely studied and applied, particularly in the 1990s. In that time, shallow neural
networks with smooth activation functions were shown to have many good theoretical
properties (White (1992), Anthony and Bartlett (1999), Chen and White (1999)). How-
ever, the availability of scalable computing and stochastic optimization techniques (Le-
Cun, Bottou, Bengio, and Haffner (1998), Kingma and Ba (2014)) and the changes from
shallow to deep networks and from smooth sigmoid-type activation functions to rectified
linear units (ReLU), x → max(x 0) (Nair and Hinton (2010)), have seemingly overcome
optimization hurdles and empirical issues, and now this form of deep learning matches or
sets the state of the art in many prediction contexts. Our theoretical results speak directly
to this modern implementation of deep learning: we focus on the ReLU activation func-
tion, explicitly model the depth of the network as diverging with the sample size, and do
not require bounded weights.
We provide nonasymptotic high probability bounds for nonparametric estimation using
deep neural networks for a large class of statistical models. Our bounds appear to be
new to the literature and are the main theoretical contributions of the paper. We provide
results for a general class of smooth loss functions for nonparametric regression style
problems, covering as special cases generalized linear models and other empirically useful
contexts. For example, in our application to causal inference we specialize our results
to linear and logistic regression as concrete illustrations. Our bounds immediately yield
empirical and population L2 convergence rates. For a certain architecture, we obtain the
optimal rate. Our proof strategy employs a localization analysis that uses scale-insensitive
measures of complexity, allowing us to consider richer classes of neural networks. This
is in contrast to analyses which restrict the networks to have bounded parameters for
each unit (discussed more below) and to the application of scale sensitive measures such
as metric entropy. These approaches would not deliver our sharp bounds and fast rates
when treating standard, feasible neural networks. Recent developments in approximation
theory and complexity for deep ReLU networks are important building blocks for our
results.
We follow our main results by applying our nonasymptotic high probability bounds to
deliver valid inference on finite-dimensional parameters following first-step estimation
using deep learning. Our aim is not to innovate at the semiparametric step but to utiliz-
ing existing results. Our work contributes directly to this area of research by showing that
deep nets are a valid and useful first-step estimator for semiparametric inference in gen-
eral. Further, we show that inference after deep learning may not require sample splitting
or cross-fitting. In particular, we use localization to directly verify conditions required for
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 183

valid inference, which may be a novel application of this proof method that is useful in
future problems of inference following machine learning.
We illustrate these ideas in the context of causal inference for concreteness and wide
applicability, as well as to allow direct comparison to the literature. Program evaluation
with observational data is one of the most common and important inference problems,
and has often been used as a test case for theoretical study of inference following machine
learning (e.g., Belloni, Chernozhukov, and Hansen (2014), Farrell (2015), Belloni, Cher-
nozhukov, Fernández-Val, and Hansen (2017), Athey, Imbens, and Wager (2018)). Deep
neural networks have been argued (experimentally) to outperform the previous state-of-
the-art (Westreich, Lessler, and Funk (2010), Shalit, Johansson, and Sontag (2017), Hart-
ford et al. (2017)). We establish valid inference for treatment effects and counterfactual
expected utility/profits from treatment targeting strategies. We note that the selection on
observables framework yields identification of counterfactual average outcomes without
additional structural assumptions, so that, for example, expected profit from a counter-
factual treatment rule can be evaluated.
We numerically illustrate our results, and more generally the utility of deep learning,
with an empirical study of a direct mail marketing campaign. Our data come from a large
US consumer products retailer and consists of around three hundred thousand consumers
with one hundred fifty covariates. Hitsch and Misra (2018) recently used this data to study
various estimators, both traditional and modern, of heterogeneous treatment effects. We
study the effect of catalog mailings on consumer purchases, and moreover, compare dif-
ferent targeting strategies (i.e., to which consumers catalogs should be mailed). The cost
of sending out a single catalog can be close to one dollar, and with millions being set out,
carefully assessing the targeting strategy is crucial. Our results suggest that deep nets are
at least as good as (and sometimes better) that the best methods found by Hitsch and
Misra (2018).
Our paper contributes to several rapidly growing literatures, and we cannot hope to do
justice to each here. We give only those citations of particular relevance; more references
can be found within these works. First, there has been much recent study of the statistical
properties of machine learning tools as an end in itself. Many studies have focused on the
lasso and its variants (Bickel, Ritov, and Tsybakov (2009), Belloni, Chernozhukov, and
Hansen (2014), Farrell (2015)) and tree/forest based methods (Wager and Athey (2018)).
Relatively less work has been done for deep neural networks. An important exception
is the recent work of Schmidt-Hieber (2019), who showed that a particular deep ReLU
network with uniformly bounded weights attains the optimal rate in expected risk for
squared loss. Further, Schmidt-Hieber (2019) formally shows that deep neural networks
can strictly improve on classical methods: if the unknown target function is itself a com-
position of simpler functions, then the composition-based deep net estimator is provably
superior to estimators that do not use compositions. This is a possible first step in theo-
retically understanding why deep learning is so successful empirically. Our work differs
substantially from Schmidt-Hieber (2019). First, our goal is not to demonstrate adap-
tation, and we do not study this property of deep nets, but focus on the common non-
parametric case. Second, our results and assumptions are quite different in that: (i) we
prove nonasymptotic high probability bounds instead of bounds on the expected risk, (ii)
we cover general, nonlinear regression problems, (iii) in linear models we allow for non-
Gaussian, heteroskedastic errors, relying only on boundedness, and (iv) we allow for un-
bounded weight parameters, which is crucial for feasible implementation and for approx-
imation results. Finally, our method of proof is entirely different from Schmidt-Hieber
(2019), and it is this proof which enables us to deliver (i)–(iv). Specializing our results
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
184 M. H. FARRELL, T. LIANG, AND S. MISRA

to the linear model treated by Schmidt-Hieber (2019), and looking at smooth functions,
our high probability bounds imply expect risk bounds similar to those obtained in that
paper, but under somewhat different regularity conditions: Schmidt-Hieber (2019) re-
quires bounded weights and errors that are Gaussian, independent of the covariates, and
have known homoskedasticity. These differences between our work and Schmidt-Hieber
(2019) are elaborated on further below, after stating our main results. Bach (2017) and
Bauer and Kohler (2019) also make important contributions on adaptation properties
of deep nets on functions with certain low dimensional structure. Yarotsky (2017, 2018)
and Bartlett, Harvey, Liaw, and Mehrabian (2017) are important building blocks for our
results.
A second strand of literature focuses on inference following of machine learning. Ini-
tial theoretical results were concerned with obtaining valid inference on a coefficient in a
high-dimensional regression, following model selection or regularization, with particular
focus on the lasso (Belloni, Chernozhukov, and Hansen (2014), Javanmard and Monta-
nari (2014), van de Geer, Buhlmann, Ritov, and Dezeure (2014)). Intuitively, this is a
semiparametric problem, where the coefficient of interest is estimable at the parametric
rate and the remaining coefficients are collectively a nonparametric nuisance parameter
estimated using machine learning methods. Building on this intuition, many have studied
the semiparametric stage directly, such as obtaining novel, weaker conditions easing the
application of machine learning methods (Belloni, Chernozhukov, and Hansen (2014),
Farrell (2015), Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins
(2018), and references therein). We build on this work, employing conditions therein, and
in particular, verifying them for deep ReLU nets.
The next section introduces deep ReLU networks and states our main theoretical re-
sults: nonasymptotic bounds for nonparametric regression-type losses. Semiparametric
inference is discussed in Section 3. The empirical study is in Section 4. Section 5 con-
cludes and proofs are given in the Appendix. We will use the following norms: for a ran-
dom vector X ∈ Rd , with generic realization x and sample realization xi , and a function
g(x), g∞ := supx |g(x)|, gL2 (X) := E[g(X)2 ]1/2 , and gn := En [g(xi )2 ]1/2 , where En [·]
denotes the sample average.

2. DEEP NEURAL NETWORKS

In this section, we will give our main theoretical results: high-probability nonasymptotic
bounds for deep neural network estimation. The utility of these results for second-step
semiparametric causal inference (the downstream task), for which the implied conver-
gence rates are sufficiently rapid, is demonstrated in Section 3. We view our results as
an initial step in establishing both the estimation and inference theory for modern deep
learning, that is, neural networks built using the multilayer perceptron architecture (de-
scribed below) and the nonsmooth ReLU activation function. This combination is crucial:
it has demonstrated state of the art performance empirically and can be feasibly opti-
mized. This is in contrast with sigmoid-based networks, either shallow (for which theory
exists, but may not match empirical performance) or deep (which are not feasible to op-
timize), and with shallow ReLU networks, which may not approximate broad function
classes.
As neural networks are perhaps less familiar to economists and other social scientists,
we first briefly review the construction of deep ReLU nets. Our main focus will be on
the fully connected feedforward neural network, frequently referred to as a multilayer
perceptron, as this may be the most commonly implemented network architecture and
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 185

we want our results to inform empirical practice. However, our results are more general,
accommodating other architectures provided they are able to yield a universal approxi-
mation (in the appropriate function class), and so we review neural nets more generally
and give concrete examples.
Our goal is to estimate an unknown function f∗ (x) that relates the covariates X ∈ Rd to
a scalar outcome Y as the minimizer of the expectation of a per-observation loss function.
Collecting these random variables into the vector Z = (Y X ) ∈ Rd+1 , with z = (y x )
denoting a realization, we write

f∗ = arg min E (f Z)
f

We allow for any loss function that is Lipschitz in f and obeys a curvature condition
around f∗ . Specifically, for constants c1 , c2 , and C that are bounded and bounded away
from zero, we assume that (f z) obeys

(f z) − (g z) ≤ C f (x) − g(x)
(2.1)
c1 E (f − f∗ )2 ≤ E (f Z) − E (f∗ Z) ≤ c2 E (f − f∗ )2

Our results will be stated for a general loss obeying these two conditions.1 We give a
unified localization analysis of all such problems. This family of loss function covers many
interesting problems. Two leading examples, used in our application to causal inference,
are least squares and logistic regression, corresponding to the outcome and propensity
score models, respectively. For least squares, the target function and loss are
1 2
f∗ (x) := E[Y |X = x] and (f z) = y − f (x) (2.2)
2
respectively, while for logistic regression these are
E[Y |X = x]
f∗ (x) := log and (f z) = −yf (x) + log 1 + ef (x) (2.3)
1 − E[Y |X = x]
Lemma 8 verifies, with explicit constants, that (2.1) holds for these two. Losses obeying
(2.1) extend beyond these cases to other generalized linear models, such as count models,
and can even cover multinomial logistic regression (multiclass classification), as shown in
Lemma 9.

2.1. Neural Network Constructions

We now briefly describe deep ReLU neural networks, paying closer attention to the
details germane to our theory. Goodfellow, Bengio, and Courville (2016) gave a complete
introduction and many references.
The crucial choice is the specific network architecture, or class. In general, we will call
this FDNN . From a theoretical point of view, different classes have different complexity
and different approximating power. We give results for several concrete examples below.
We will focus on feedforward neural networks. An example of a feedforward network is
shown in Figure 1. The network consists of d input units, corresponding to the covariates

1
We thank an anonymous referee for suggesting this approach of exposition.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
186 M. H. FARRELL, T. LIANG, AND S. MISRA

FIGURE 1.—Illustration of a feedforward neural network with W = 18, L = 2, U = 5, and input dimension
d = 2. The input units are shown in white at left, the output in black at right, and the hidden units in grey
between them.

X ∈ Rd , one output unit for the outcome Y . Between these are U hidden units, or com-
putational nodes or neurons. These are connected by a directed acyclic graph specifying
the architecture. The key graphical feature of a feedforward network is that hidden units
are grouped in a sequence of L layers, the depth of the network, where a node is in layer
l = 1 2 L, if it has a predecessor in layer l − 1 and no predecessor in any layer l ≥ l.
The width of the network at a given layer, denoted Hl , is the number of units in that layer.
The network is completed with the choice of an activation function σ : R → R applied to
the output of each node as described below. In this paper, we focus on the popular ReLU
activation function σ(x) = max(x 0), though our results can be extended (at notational
cost) to cover piecewise linear activation functions (see also Remark 3).
An important and widely used subclass is the one that is fully connected between con-
secutive layers but has no other connections and each layer has number of hidden units
that are of the same order of magnitude. This architecture is often referred to as a Mul-
tilayer Perceptron (MLP) and we denote this class as FMLP ; see Figure 2, cf. Figure 1. We
will assume that all the width of all layers share a common asymptotic order H, implying
that for this class U LH.
We allow for generic feedforward networks, but we present special results for the MLP
case, as it is widely used in empirical practice. As we will see below, the architecture,
through its complexity, and equally importantly, approximation power, plays a crucial
role in the final bound. In particular, we find only a suboptimal rate for the MLP case, but
our upper bound is still sufficient for semiparametric inference.
To build intuition on the computation, and compare to other nonparametric methods,
let us focus on least squares for the moment, that is, equation (2.2), with a continuous out-
come using a multilayer perceptron with constant width H. Each hidden unit u receives an
input in the form of a linear combination x̃ w + b, and then returns σ(x̃ w + b), where the
vector x̃ collects the output of all the units with a directed edge into u (i.e., from prior lay-
ers), w is a vector of weights, and b is a constant term. (The constant term is often referred
to as the “bias” in the deep learning literature, but given the loaded meaning of this term
in inference, we will largely avoid referring to b as a bias.) To be precise, let x̃hl denote the

FIGURE 2.—Illustration of multilayer perceptron FMLP with H = 3, L = 2 (U = 6, W = 25), and input

dimension d = 2.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 187

scalar output of a node u = (h l), for h = 1 Hl , l = 1 L, and let x̃l = (x̃1l x̃Hl )
for layer l ≤ L. The full network is defined through recursion: each node computes
x̃hl = σ(x̃l−1 whl−1 + bhl−1 ) and the final output is y = fMLP (x) = x̃L wL + bL . The MLP
estimator can be also written as a composition as follows. Define W l as the Hl+1 × Hl
Hl Hl
matrix collecting {whl }h=1 , where H0 = d, bl as the Hl -vector collecting {bhl }h=1 , and
σ : R → R as the function which applies σ(·) componentwise. Then
Hl Hl

fMLP (x) = W L σ · · · σ W 3 σ W 2 σ W 1 σ (W 0 x + b0 ) + b1 + b2 + b3 + · · · + bL

(This exact structure does not hold for the more general case of Section 2.3.) It is also
useful to write the output of the final layer as x̃L = x̃L (x), explicitly as a function of the
original covariates, and thus the final output may be seen as a basis function approxima-
tion (albeit a complex and data-dependent one), written as fMLP (x) = x̃L (x) wL + bL ,
which is reminiscent of a traditional series (linear sieve) estimator. If all layers save
the last were fixed, we could simply optimize using least squares directly: (wL bL ) ∈
arg minwb yi − x̃L w − b2n .
The crucial distinction is that the basis functions x̃L (·) are learned from the data.
The “basis” is x̃L = (x̃1L x̃HL ) , where each x̃hL = σ(x̃L−1 whL−1 + bhL−1 ). There-
fore, “before” we can solve the least squares problem above, we would have to esti-
mate (whL−1 bhL−1 ) h = 1 H, anticipating the final estimation. These in turn de-
pend on the prior layer, and so forth back to the original inputs x. Optimization pro-
ceeds layer-by-layer using (variants of) stochastic gradient descent, with gradients of
the parameters calculated by back-propagation (implementing the chain rule) induced
by the network structure. Our results match standard optimization methods by not re-
quiring the weight parameters to be uniformly bounded. The collection, over all nodes,
of w and b, constitutes the parameters θ which are optimized in the final estima-
tion. We denote W as the total number of parameters of the network. For the MLP,
W = (d + 1)H + (L − 1)(H 2 + H) + H + 1.
To further clarify the use of deep nets, it is useful to make explicit analogies to more
classical nonparametric techniques, leveraging the form fMLP (x) = x̃L (x) wL + bL . For
a traditional series estimator (such as splines) the two choices for the practitioner are
the basis (the spline shape and degree) and the number of terms (knots), commonly re-
ferred to as the smoothing and tuning parameters, respectively. In kernel regression, these
would respectively be the shape of the kernel (and degree of local polynomial) and the
bandwidth(s). For neural networks, the same phenomena are present: the architecture as
a whole (the graph structure and activation function) are the smoothing parameters while
the width and depth play the role of tuning parameters.
The architecture plays a crucial role in that it determines the approximation power
of the network, and it is worth noting that because of the relative complexity of neural
networks, such approximations, and comparisons across architectures, are not simple. It
is comparatively obvious that quartic splines are more flexible than cubic splines (for the
same number of knots) as is a higher degree local polynomial (for the same bandwidth).
At a glance, it may not be clear what function class a given network architecture (width,
depth, graph structure, and activation function) can approximate. As we will show below,
the MLP architecture is not yet known to yield an optimal approximation (for a given
width and depth) and, therefore, we are only able to prove a bound with slower than
optimal rate. As a final note, computational considerations are important for deep nets
in a way that is not true conventionally; see Remarks 1, 2, and 3.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
188 M. H. FARRELL, T. LIANG, AND S. MISRA

Just as for classical nonparametrics, for a fixed architecture it is the tuning parameters
that determine the rate of convergence (fixing smoothness of f∗ ). The recent wave of the-
oretical study of deep learning is still in its infancy. As such, there is no understanding
of optimal architectures or tuning parameters. These choices can be difficult and only
preliminary research has been done (see, e.g., Daniely (2017), Telgarsky (2016) and refer-
ences therein). However, it is interesting that in some cases, results can be obtained even
with a fixed width H, provided the network is deep enough; see Corollary 2.
In sum, for a user-chosen architecture FDNN , encompassing the choices σ(·), U, L,
W , and the graph structure, the final estimate is computed using observed samples zi =
(yi xi ) , i = 1 2 n, of Z, by solving

fDNN ∈ arg min (f zi ) (2.4)

fθ ∈FDNN
i=1
fθ ∞ ≤2M

Recall that θ collects, over all nodes, the weights and constants w and b. When (2.4) is
restricted to the MLP class, we denote the resulting estimator fMLP . The choice of M may
be arbitrarily large, and is part of the definition of the class FDNN . This is neither a tun-
ing parameter nor regularization in the usual sense: it is not assumed to vary with n, and
beyond being finite and bounding f∗ ∞ (see Assumption 1), no properties of M are re-
quired. This is simply a formalization of the requirement that the optimizer is not allowed
to diverge on the function level in the l∞ sense– the weakest form of constraint. It is im-
portant to note that while typically regularization will alter the approximation power of
the class, that is not the case with the choice of M as we will assume that the true function
f∗ (x) is bounded, as is standard in nonparametric analysis. With some extra notational
burden, one can make the dependence of the bound on M explicit, though we omit this
for clarity as it is not related to statistical issues.

REMARK 1: In applications, it is common to apply some form of regularization to the

optimization of (2.4). However, in theory, the role of explicit regularization is unclear
and may be unnecessary, as stochastic gradient descent presents good, if not better, solu-
tions empirically (Zhang, Bengio, Hardt, Recht, and Vinyals (2016)). Explicit regulariza-
tion may improve empirical performance in low signal-to-noise ratio problems. There are
many alternative regularization methods, including L1 and L2 (weight decay) penalties,
drop out, and others, a detailed investigation of which is beyond the present scope.

2.2. Nonasymptotic High-Probability Bounds for Multilayer Perceptrons

We now state our main theoretical results: nonasymptotic high-probability bounds for
deep ReLU networks. We begin by discussing our assumptions. The sampling assumptions
we require are collected in the following.

ASSUMPTION 1: Assume that zi = (yi xi ) , 1 ≤ i ≤ n are i.i.d. copies of Z = (Y X) ∈

Y × [−1 1]d , where X is continuously distributed. For an absolute constant M > 0, assume
f∗ ∞ ≤ M and Y ⊂ [−M M].

This assumption is fairly standard in nonparametrics. The only restriction worth men-
tioning is that the outcome is bounded. In many cases this holds by default (such as logistic
regression, where Y = {0 1}) or count models (where Y = {0 1 M}, with M limited
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 189

by real-world constraints). For continuous outcomes, such as least squares regression, our
restriction is not substantially more limiting than the usual assumption of a model such
as Y = f∗ (X) + ε, where X is compact-supported, f∗ is bounded, and the stochastic error
ε possesses many moments. Indeed, in many applications such a structure is only coher-
ent with bounded outcomes, such as the common practice of including lagged outcomes
as predictors. Next, the assumption of continuously distributed covariates is quite stan-
dard. Discrete covariates taking on many values may be more realistically thought of as
continuous, and it may be more accurate to allow these to slow the convergence rates.
Our focus on L2 (X) convergence allows for these essentially automatically. Finally, from
a practical point of view, deep networks handle discrete covariates seamlessly and have
demonstrated excellent empirical performance, which is in contrast to other more classi-
cal nonparametric techniques that may require manual adaptation.
Our main result treats the multilayer perceptron architecture, with the ReLU activation
function and unbounded weights, matching perhaps the most standard deep neural net-
work. Such MLPs are now known to approximate smooth functions well (Yarotsky (2017,
2018)), leading to our next assumption: that the target function f∗ lies in a Hölder ball
with certain smoothness. Discussion of Hölder, Sobolev, and Besov spaces can be found
in Gine and Nickl (2016).

ASSUMPTION 2: Assume f∗ lies in the Hölder ball W β∞ ([−1 1]d ), with smoothness β ∈
N+ ,

f∗ (x) ∈ W β∞ [−1 1]d := f : max ess supDα f (x) ≤ 1
α|α|≤β
x∈[−11]d

where α = (α1 αd ), |α| = α1 + · · · + αd and Dα f is the weak derivative.

Under Assumptions 1 and 2, we obtain the following high probability bounds, cover-
ing a host of models, which, to the best of our knowledge, is new to the literature. In
some sense, this is our main result for deep learning, as it deals with the most common
architecture.

THEOREM 1—Multilayer Perceptron: Suppose Assumptions 1 and 2 hold. Let fMLP be

the deep MLP-ReLU network estimator defined by (2.4), restricted to FMLP , for a loss function
d
obeying (2.1), with width H n 2(β+d) log2 n and depth L log n. Then with probability at
d
least 1 − exp(−n β+d log8 n), for n large enough,
(a) fMLP − f∗ 2L2 (X) ≤ C · {n− β+d log8 n + log log
β

n
n
} and
(b) En [(fMLP − f∗ )2 ] ≤ C · {n β+d log n + n },
β
− 8 log log n

for a constant C > 0 independent of n, which may depend on d, M, and other fixed constants.

Several aspects of these nonasymptotic bound warrant discussion. We build on the

recent results of Bartlett et al. (2017), who find nearly-tight bounds on the pseudo-
dimension of deep nets. One contribution of our proof is to use a scale sensitive local-
ization theory (Koltchinskii and Panchenko (2000), Bartlett, Bousquet, and Mendelson
(2005), Koltchinskii (2006, 2011), Liang, Rakhlin, and Sridharan (2015)) with such scale
insensitive measures for deep neural networks for general smooth loss functions. This
has two tangible benefits. First, we do not restrict the class of network architectures to
have bounded weights for each unit (scale insensitive), in accordance to standard prac-
tice (Zhang et al. (2016)) wherein optimization is not constrained, and in contrast to the
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
190 M. H. FARRELL, T. LIANG, AND S. MISRA

classic sieve analysis with scale sensitive measure such as metric entropy. This allows for a
richer set of approximating possibilities, in particular allowing more flexibility in seeking
architectures with specific properties, as we explore in the next subsection. For the special
case of least squares regression, Koltchinskii (2011) used a similar approach, and a similar
result to our Theorem 1(a) can be derived for this case using his Theorem 5.2 and Exam-
ple 3 (p. 85f). This is perhaps the nearest antecedent to our result. To avoid repetition,
other important results are discussed following Theorem 2 below.
Second, we are able to attain a faster rate on the second term of the bound, order
n−1 in the sample size, instead of the n−1/2 that would result from a direct application of
uniform deviation bounds. This upper bound informs the trade offs between H and L,
and the approximation power, and may point toward optimal architectures for statistical
inference. Even with these choices of H and L, the bound of Theorem 1 is not optimal (for
fixed β, in the sense of Stone (1982)). We rely on the explicit approximating constructions
of Yarotsky (2017), and it is possible that in the future improved approximation properties
of MLPs will be found, allowing for a sharpening of the results of Theorem 1 immediately,
that is, without change to our theoretical argument. At present, it is not clear if this rate
can be improved, but it is sufficiently fast for valid inference.
Finally, we note that as is standard in nonparametrics, this result relies on choosing H
appropriately given the smoothness β of Assumption 2. Of course, the true smoothness
is unknown and thus in practice the “β” appearing in H, and consequently in the con-
vergence rates, need not match that of Assumption 2. In general, the rate will depend on
the smaller of the two. Most commonly, it is assumed that the user-chosen β is fixed and
that the truth is smoother; witness the ubiquity of cubic splines and local linear regres-
sion. Rather than spell out these consequences directly, we will tacitly assume the true
smoothness is not less than the β appearing in H (here and below). Smoothness adaptive
approaches, as in classical nonparametrics, may also be possible with deep nets, but are
beyond the scope of this study.

2.3. Other Network Architectures

Theorem 1 covers only one specific architecture, albeit the most important one for cur-
rent practice. However, given that this field is rapidly evolving, it is important to consider
other possible architectures which may be beneficial in some cases. To this end, we will
state a more generic result and then two specific examples: one to obtain a faster rate of
convergence and one for fixed-width networks. All of these results are, at present, more
of theoretical interest than practical value, as they are either agnostic about the network
(thus infeasible) or rely on more limiting assumptions.
In order to be agnostic about the specific architecture of the network, we need to be
flexible in the approximation power of the class. To this end, we will replace Assump-
tion 2 with the following generic assumption, rather more of a definition, regarding the
approximation power of the network.

ASSUMPTION 3: Let f∗ lie in a class F . For the feedforward network class FDNN , used in
(2.4), let the approximation error DNN be

DNN := sup inf f − f∗ ∞

f∗ ∈F f ∈FDNN
f ∞ ≤2M

It may be possible to require only an approximation in the L2 (X) norm, but this as-
sumption matches the current approximation theory literature and is more comparable
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 191

with other work in nonparametrics, and thus we maintain the uniform definition. We then
obtain the following result.

THEOREM 2—General Feedforward Architecture: Suppose Assumptions 1 and 3 hold.

Let fDNN be the deep ReLU network estimator defined by (2.4), for a loss function obeying
(2.1). Then with probability at least 1 − e−γ , for n large enough,
(a) fDNN − f∗ 2L2 (X) ≤ C( W L log
n
W
log n + log logn n+γ + 2DNN ) and
(b) En [(fDNN − f∗ )2 ] ≤ C( W L log
n
W
log n + log logn n+γ + 2DNN ),
for a constant C > 0 independent of n, which may depend on d, M, and other fixed constants.

This result is more general than Theorem 1, covering the general deep ReLU net-
work problem defined in (2.4), general feedforward architectures, and the general class
of losses defined by (2.1). The same comments as were made following Theorem 1 apply
here as well: the same localization argument is used with the same benefits. We explicitly
use this in the next two corollaries, where we exploit the allowed flexibility in controlling
DNN by stating results for particular architectures. The bound here is not directly appli-
cable without specifying the network structure, which will determine both the variance
portion (through W , L, and U) and the approximation error. With these set, the bound
becomes operational upon choosing γ, which can be optimized as desired.
Perhaps the most directly related existing result, in addition to the aforementioned
result of Koltchinskii (2011), is Theorem 2 of Schmidt-Hieber (2019), which also uses
generic approximation error. That result is not a high-probability bound, only a rate
on the expected risk, only covers squared loss, and requires Gaussian noise that is in-
dependent of the covariates and has known, homoskedastic variance, and, importantly,
requires uniformly bounded weights in the network. The assumption of bounded weights
may be difficult to impose computationally and can limit the approximation power of
the network. To see this last point, consider a simple example: suppose that d = 1 and
f∗ (x) = σ(ζx + 1)/2 − σ(ζx − 1)/2. This f∗ , for any ζ, is bounded and can be realized
by a ReLU network without norm constraints using only two hidden units, and is thus
estimable at 1/n. However, for ζ > 1 a network with weights bounded by one (as in
Schmidt-Hieber (2019)) must have width 2ζ, so ζ must be known, and yields expected
risk of order ζ/n.
Turning to special cases, we first show that the optimal rate of Stone (1982) can be
attained, up to log factors. However, this relies on a rather artificial network structure,
designated to approximate functions in a Sobolev space well, but without concern for
practical implementation. Thus, while the following rate improves upon Theorem 1, we
view this result as mainly of theoretical interest: establishing that (certain) deep ReLU
networks are able to attain the optimal rate.

COROLLARY 1—Optimal Rate: Suppose Assumptions 1 and 2 hold. Let fOPT solve (2.4)
d
using the (deep and wide) network of Yarotsky (2017, Theorem 1), with W U n 2β+d log n
and depth L log n, the following hold with probability at least 1 − e−γ , for n large enough,
(a) fOPT − f∗ 2L2 (X) ≤ C · {n− 2β+d log4 n + log logn n+γ } and
2β

(b) En [(fOPT − f∗ )2 ] ≤ C · {n− 2β+d log4 n + log logn n+γ },

2β

for a constant C > 0 independent of n, which may depend on d, M, and other fixed constants.

The same rate, up to log factors, albeit concerning only the expected risk and subject to
the other limitations above, can be obtained from Theorems 2 and 5 of Schmidt-Hieber
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
192 M. H. FARRELL, T. LIANG, AND S. MISRA

(2019). However, the main goal of Schmidt-Hieber (2019) is not the standard nonpara-
metric problem considered here, but rather in studying dimension adaptivity. Specifically,
the main result therein, Theorem 1, shows that if f∗ is itself a composition of functions
2β
which are individually estimable faster than n− 2β+d , then a sparsely connected deep ReLU
network adapts to this structure and attains the faster rate, an oracle type result. We do
not explicitly study sparse networks. Further, it is shown that estimators which are not
based on a composition structure do not possess the same adaptation property. For more
on the results and limitations of Schmidt-Hieber (2019), see the published discussions
(Ghorbani, Mei, Misiakiewicz, and Montanari (2019), Shamir (2019), Kutyniok (2019)).
Other work in this direction is Bach (2017) and Bauer and Kohler (2019). Polson and
Ročková (2018) also obtain bounds for deep nets, building on these works, but applied in
a Bayesian context.
Next, we turn to very deep networks that are very narrow, which have attracted sub-
stantial recent interest. Theorem 1 and Corollary 1 dealt with networks where the depth
and the width grow with sample size. This matches the most common empirical prac-
tice, and is what we use in Sections 4. However, it is possible to allow for networks of
fixed width, provided the depth is sufficiently large. The next result is perhaps the largest
departure from the classical study of neural networks: earlier work considered networks
with diverging width but fixed depth (often a single layer), while the reverse is true here.
The activation function is of course qualitatively different as well, being piecewise lin-
ear instead of smooth. Using recent results (Mhaskar and Poggio (2016), Hanin (2019),
Yarotsky (2018)), we can establish the following rate for very deep, fixed-width MLPs.

COROLLARY 2—Fixed Width Networks: Let the conditions of Theorem 1 hold, with β ≥
1 in Assumption 2. Let fFW solve (2.4) for an MLP with fixed width H = 2d + 10 and depth
d
L n 2(2+d) . Then with probability at least 1 − e−γ , for n large enough,
(a) fFW − f∗ 2L2 (X) ≤ C · {n− 2+d log2 n + log logn n+γ } and
2

(b) En [(fFW − f∗ )2 ] ≤ C · {n− 2+d log2 n + log logn n+γ },

for a constant C > 0 independent of n, which may depend on d, M, and other fixed constants.

This result is again mainly of theoretical interest. The class is only able to approximate
well functions with β = 1 (cf. the choice of L) which limits the potential applications of
the result because, in practice, d will be large enough to render this rate, unlike those
above, too slow for use in later inference procedures. In particular, if d ≥ 3, the sufficient
conditions of Theorem 3 fail.
Finally, as mentioned following Theorem 1, our theory here will immediately yield a
faster rate upon discovery of improved approximation power of this class of networks.
In other words, for example, if a proof became available that fixed-width, very deep net-
works can approximate β-smooth functions (as in Assumption 2), then Corollary 2 will
be trivially improvable to match the rate of Theorem 1. Similarly, if the MLP architecture
can be shown to share the approximation power with that of Corollary 1, then Theorem 1
will itself deliver the optimal rate. Our proofs will not require adjustment.

REMARK 2: Although there has been a great deal of work in easing implementation
(optimization and tuning) of deep nets, it still may be a challenge in some settings, par-
ticularly when using non-standard architectures. See also Remark 1. Given the renewed
interest in deep networks, this is an area of study already (Hartford et al. (2017), Polson
and Ročková (2018)) and we expect this to continue and that implementations will rapidly
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 193

evolve. This is perhaps another reason that Theorem 1 is, at the present time, the most
practically useful, but that (as just discussed) Theorem 2 will be increasingly useful in the
future.

REMARK 3: Our results can be extended easily to include piecewise linear activation
functions beyond ReLU, using the complexity result obtained in Bartlett et al. (2017). In
principle, similar rates of convergence could be attained for other activation functions,
given results on their approximation error. However, it is not clear what practical value
would be offered due to computational issues (in which the activation choice plays a cru-
cial role). Indeed, the recent switch to ReLU stems not from their greater approximation
power, but from the fact that optimizing a deep net with sigmoid-type activation is unsta-
ble or impossible in practice. Thus, while it is certainly possible that we could complement
the single-layer results with rates for sigmoid-based deep networks, these results would
have no consequences for real-world practice.
From a purely practical point of view, several variations of the ReLU activation function
have been proposed recently (including the so-called Leaky ReLU, Randomized ReLU,
(Scaled) Exponential Linear Units, and so forth) and have been found in some experi-
ments to improve optimization properties. It is not clear what theoretical properties these
activation functions have or if the computational benefits persist more generically, though
this area is rapidly evolving. We conjecture that our results could be extended to include
these activation functions.

3. INFERENCE AFTER DEEP LEARNING

We will use the results above, in particular Theorem 1, coupled with results in the semi-
parametric literature, to deliver valid asymptotic inference following deep learning. The
novelty of our results is not in this semiparametric stage per se, but rather in delivering
valid inference after deep learning and, therefore, our discussion will be brief. Our results
for deep learning can be applied much more generally, see the longer version of this pa-
per (Farrell, Liang, and Misra (2019a)) and other recent literature (Belloni et al. (2017),
Chernozhukov et al. (2018)). At present, we focus on average causal parameters here, as
they are popular both in applications and in theoretical work, so our results can be put to
immediate use as well as easily compared to prior literature.
To briefly describe the set up: we observe a sample of n units, each exposed to a bi-
nary treatment, and for each unit we observe a vector of pretreatment covariates, X ∈ Rd ,
treatment status T ∈ {0 1}, and a scalar post-treatment outcome Y . The observed out-
come obeys Y = T Y (1) + (1 − T )Y (0), where Y (t) is the unobserved potential outcome
under treatment status t ∈ {0 1}. The prototypical parameter of interest is the average
treatment effect, τ := E[Y (1) − Y (0)], also referred to as “lift” in digital context. We also
study π(s) := E[s(X)Y (1) + (1 − s(X))Y (0)], the average realized outcome from a coun-
terfactual treatment policy, s(x) : supp{X} → {0 1}, that assigns a given set of character-
istics (e.g., a consumer profile) to treatment status. Note well that this is not necessarily
the observed treatment: s(xi ) = ti . Intuitively, τ is the expected gain from treating the
“next” person, relative to if they had not been exposed, that is, the expected change in
the outcome. On the other hand, π(s), which is the expected utility, welfare, or profit, is
concerned with the total outcome that would be observed for the next person if the treat-
ment rule were s(x). The parameter depends on a counterfactual/hypothetical treatment
targeting strategy, which is often itself the object of evaluation.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
194 M. H. FARRELL, T. LIANG, AND S. MISRA

We make the following standard assumption of unconfoundedness and overlap, which

delivers identification of the average treatment effect and, at no additional cost, counter-
factual welfare.

ASSUMPTION 4: Let p(x) = P[T = 1|X = x] denote the propensity score and μt (x) =
E[Y (t)|X = x], t ∈ {0 1} denote the two outcome regression functions. For t ∈ {0 1} and
almost surely X, E[Y (t)|T X = x] = E[Y (t)|X = x] and p̄ ≤ p(x) ≤ 1 − p̄ for some p̄ > 0.

Our approach to inference follows the current literature and uses sample averages of
the (uncentered) influence functions. This approach yields valid inference under weaker
conditions on the first step estimates (Farrell (2015), Chernozhukov et al. (2018)). Hahn
(1998) showed that the influence function for a single average potential outcome is
given by ψt (z) − E[Y (t)], for t ∈ {0 1} and z = (y t x ) , where ψt (z) = 1{T = t}(y −
μt (x))P[T = t | X = x]−1 + μt (x). We estimate the unknown functions with deep learning
to form

1{ti = t} yi −
μt (xi )
t (zi ) =
ψ +
μt (xi ) (3.1)

P[T = t | X = xi ]

where (xi ) for t = 1 and 1 − p

P[T = t | X = xi ] = p (xi ) for t = 0. The final estimators of
τ and π(s) are obtained by taking appropriate linear combinations:

τ = En ψ1 (zi ) − ψ
0 (zi ) and 1 (zi ) + 1 − s(xi ) ψ
π (s) = En s(xi )ψ 0 (zi ) (3.2)

To add a per-unit cost of treatment/targeting c and a margin m, simply replace ψ1 with

mψ1 − c and ψ0 with mψ0 . It may also be useful to compare a candidate targeting strategy,
say s (x), to baseline or status quo policy, s0 (x), by studying π(s ) − π(s0 ) = E[(s (X) −
s0 (X))Y (1) + (s0 (X) − s (X))Y (0)] = E[(s (X) − s0 (X))τ(X)], where τ(x) = E[Y (1) −
Y (0) | X = x] is the conditional average treatment effect. The latter form makes clear
that only those differently treated, of course, impact the evaluation of s compared to s0 .
The strategy s will be superior if, on average, it targets those with a higher individual
treatment effect.
We then obtain inference using the following result. Let βp and βμ be the smoothness
parameters of Assumption 2 for the propensity score and outcome models, respectively.

THEOREM 3: Suppose that {zi = (yi ti xi ) }ni=1 are i.i.d. obeying Assumption 4 and
the conditions Theorem 1 hold with βp ∧ βμ > d. Further assume that, for t ∈ {0 1},
E[(s(X)ψt (Z))2 |X] is bounded away from zero and for some δ > 0, E[(s(X)ψt (Z))4+δ |X]
is bounded. Then the deep MLP-ReLU network estimators defined above obey the follow-
ing, for t ∈ {0 1}, (a) En [( p(xi ) − p(xi ))2 ] = oP (1) and En [(
μt (xi ) − μt (xi ))2 ] = oP (1),
−1/2
(b) En [(μt (xi ) − μt (xi )) ] En [(
2 1/2
p(xi ) − p(xi )) ] = oP (n ), and (c) En [(
2 1/2
μt (xi ) −
μt (xi ))(1 − 1{ti = t}/P[T = t|X = xi ])] = oP (n−1/2 ) and, therefore, if p (xi ) is bounded in-
side (0 1), for a given s(x) and t ∈ {0 1}, we have

√ t (zi ) 2
En s(xi )ψ
t (zi ) − s(xi )ψt (zi ) = oP (1)
nEn s(xi )ψ and 2 = oP (1)
En s(xi )ψt (zi )
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 195

It is immediate from Theorem 3 that the estimators of (3.2), and other similar estima-
tors are asymptotically Normal with estimable variance. Looking at
π (s) to fix ideas,
√ −1/2 d
nΣ π (s) − π(s) → N (0 1)
= En s(xi )ψ
with Σ

1 (zi ) + 1 − s(xi ) ψ

0 (zi ) 2 −
π (s)2

Further, Theorem 3 can be generalized immediately to yield uniformly valid inference

(Belloni, Chernozhukov, and Hansen (2014), Farrell (2015)). Finally, it is worth specializ-
ing this result to randomized experiments due to their popularity in practice, particularly
important in the internet age. In this case, the propensity score is estimated with the
sample frequency, p (xi ) ≡ p
= En [ti ], and conditions (a) and (b) of Theorem 3 collapse,
leaving only condition (c). The following is a trivial corollary of Theorem 3.

COROLLARY 3: Let the conditions of Theorem 3 hold but instead of Assumption 4, as-
sume T is independent of Y (0), Y (1), and X, and is distributed Bernoulli with param-
eter p∗ bounded inside (0 1). Then, if βμ > d, the deep MLP-ReLU networks obey (a )
μt (xi ) − μt (xi ))2 ] = oP (1) and (c ) En [(
En [( μt (xi ) − μt (xi ))(1 − 1{ti = t}/p∗ )] = oP (n−1/2 ),
and the results of Theorem 3 hold.

Theorem 3 shows, for a specific context, how deep learning delivers valid asymptotic
inference for our parameters of interest. Theorem 1 (a generic result using Theorem 2
could be stated) proves that the nonparametric estimates converge sufficiently fast, as
formalized by conditions (a), (b), and (c), enabling feasible efficient semiparametric in-
ference. Proofs and further discussion of similar results can be found in Farrell (2015),
Chernozhukov et al. (2018). Here, it is worth mentioning that the condition (c), which
arises from a “leave-in” type remainder, can be weakened using sample splitting. Instead,
we employ our localization analysis, as was used to obtain the results of Section 2, to ver-
ify (c) directly (see Lemma 10); this appears to be a novel application of localization, and
this approach may be useful in future applications of second-step inference using machine
learning methods where the theoretical gain of weaker requirements may not be worth
the price paid in constants in finite samples.
Finally, we close this discussion by noting that our focus with Theorem 3 is showcasing
the practical utility of deep learning in inference, and not in attaining minimal condi-
tions. The requirement that βp ∧ βμ > d, or βp ∧ βμ > d/2 in Corollary 1, is not minimal.
Minimal conditions for semiparametric inference have been studied by many, dating at
least to Bickel and Ritov (1988); see Robins, Tchetgen, Li, and van der Vaart (2009) for
recent results and references. For causal inference, Chen, Hong, and Tarozzi (2008) and
Athey, Imbens, and Wager (2018) obtain efficiency under weaker conditions than ours on
p(x) (the former under minimal smoothness on μt (x) and the latter under a sparsity in
a high-dimensional linear model). Further, cross-fitting paired with local robustness may
yield weaker smoothness conditions by providing “underfitting” robustness, that is, weak-
ening bias-related assumptions (Chernozhukov et al. (2018)), but the cost may be too
high here. Weaker variance-related assumptions, or “overfitting” robustness (Cattaneo,
Jansson, and Ma (2019)), may also be possible following deep learning, but are less auto-
matic at present. Other methods for causal inference under relaxed assumptions may be
useful here, such as extensions to doubly robust inference (Tan (2020)) or robust inverse
weighting (Ma and Wang (2018)).
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
196 M. H. FARRELL, T. LIANG, AND S. MISRA

4. EMPIRICAL APPLICATION
To illustrate our results, we study a data from a large US retailer of consumer prod-
ucts. The firm sells directly to the customer (as opposed to via retailers) using a variety
of channels such as the web and mail. Targeted marketing instruments, such as catalogs,
aim to induce demand and often contain advertising and informational content about the
firms offerings. It is important to carefully select which customers should be sent this ma-
terial, that is, be targeted for treatment, since the costs of its creation and dissemination
accumulate rapidly. For a typical retailer, the costs of one catalog may be close to a dol-
lar. With millions of catalogs being sent, ascertaining the causal effects of such targeted
mailing, and then using these effects to evaluate potential targeting strategies, is crucial
for policy making. For a full discussion, see Hitsch and Misra (2018) (we use their 2015
sample).
The data consists of 292,657 consumers chosen at random from the retailer’s database.
Of these, 2/3 were randomly chosen to receive a catalog (the treatment). We observe treat-
ment status, roughly one hundred fifty covariates, including demographics, past purchase
behaviors, interactions with the firm, and other relevant information, and total consumer
spending, the outcome of interest, aggregated from all available purchase channels in-
cluding phone, mail, and the web, in a 3-month window. Average spending is $7.31, but
for the roughly 6% who made a purchase, the average spend is $117.73.
We implement equations (3.1) and (3.2) for eight different deep nets. All computation
was done using TensorFlowTM . The details of the eight deep net architectures are pre-
sented in Table I. A key measure of fit reported in the final column of the table is the
portion of τ(xi ) that were negative. As argued by Hitsch and Misra (2018), it is implausi-
ble, for nearly all individuals, under standard marketing or economic theory that receipt
of a catalog causes lower purchasing. Here, deep nets perform as well as, and sometimes
better than, the best methods found by Hitsch and Misra (2018). Figure 3 shows the dis-
tribution of τ(xi ) across customers for each of the eight architectures. While there are
differences in the shapes, the mean and variance estimates are nonetheless similar. We
also conducted a placebo test: using only the untreated customers in the data and ran-
domly assigning half to treated status we then reran the eight architectures.2 Figure 4

TABLE I
DEEP NETWORK ARCHITECTURESa

Learning Widths Dropout Total Validation Training

DNN Rate [H1 H2 ] [H1 H2 ] Parameters Loss Loss Pn [
τ(xi ) < 0]

1 00003 [60] [05] 8702 1405.62 1748.91 0.0014

2 00003 [100] [05] 14,502 1406.48 1751.87 0.0251
3 00001 [30 20] [05 0] 4952 1408.22 1751.20 0.0072
4 00009 [30 10] [03 01] 4622 1408.56 1751.62 0.0138
5 00003 [30 30] [0 0] 5282 1403.57 1738.59 0.0226
6 00003 [30 30] [05 0] 5282 1408.57 1755.28 0.0066
7 00003 [100 30 20] [05 05 0] 17,992 1408.62 1751.52 0.0103
8 000005 [80 30 20] [05 05 0] 14,532 1413.70 1756.93 0.0002
a All networks use the ReLU activation function. The width of each layer is shown, for example, Architecture 3 consists of two
layers, with 30 and 20 hidden units, respectively. The final column shows the portion of estimated individual treatment effects below
zero.

2
We thank Guido Imbens suggesting this analysis.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 197

FIGURE 3.—Conditional average treatment effects across architectures.

plots
τ(xi ) for these, and we see that the “true” zero average effect is recovered and with
the expected distribution.
Table II shows the estimates of the average treatment effect and the counterfactual
profits from three different targeting strategies, along with their respective 95% con-
fidence intervals. The strategies are (i) never treat, s(x) ≡ 0; (ii) a blanket treatment,
s(x) ≡ 1; (iii) a loyalty policy, s(x) = 1 only for those who had purchased in the prior cal-
endar year. In all cases, we add a profit margin m and a mailing cost c to π(s) (our NDA
with the firm forbids revealing m and c). It is clear that profits from the three policies are
ordered as π(never) < π(blanket) < π(loyalty). In all results, there is broad agreement
among the eight architectures. This may be due to the fact that the data is experimental,
so that the propensity score is constant. We have explored this using simulations, which
are reported in the Supplemental Material (Farrell, Liang, and Misra (2021)).

FIGURE 4.—Placebo test.

TABLE II
AVERAGE TREATMENT EFFECT ESTIMATES AND COUNTERFACTUAL PROFITS FROM THREE TARGETING
STRATEGIES, WITH 95% CONFIDENCE INTERVALS

Average Effect Never Treat Blanket Treatment Loyalty Policy

DNN
τ 95% CI
π (s) 95% CI
π (s) 95% CI
π (s) 95% CI

1 2.606 [2273 2932] 2.016 [1923 2110] 2.234 [2162 2306] 2.367 [2292 2443]
2 2.577 [2252 2901] 2.022 [1929 2114] 2.229 [2157 2301] 2.363 [2288 2438]
3 2.547 [2223 2872] 2.027 [1934 2120] 2.224 [2152 2296] 2.358 [2283 2434]
4 2.488 [2160 2817] 2.037 [1944 2130] 2.213 [2140 2286] 2.350 [2274 2425]
5 2.459 [2127 2791] 2.043 [1950 2136] 2.208 [2135 2281] 2.345 [2269 2422]
6 2.430 [2093 2767] 2.048 [1954 2142] 2.202 [2128 2277] 2.341 [2263 2418]
7 2.400 [2057 2744] 2.053 [1959 2148] 2.197 [2122 2272] 2.336 [2258 2414]
8 2.371 [2021 2721] 2.059 [1963 2154] 2.192 [2116 2268] 2.332 [2253 2411]

5. CONCLUSION
The utility of deep learning in social science applications is still a subject of interest
and debate. While there is an acknowledgment of its predictive power, there has been
limited adoption of deep learning in social sciences such as economics. Some part of
the reluctance to adopting these methods stems from the lack of theory facilitating use
and interpretation. We have shown, both theoretically as well as empirically, that these
methods can offer excellent performance.
In this paper, we have given a formal proof that inference can be valid after using deep
learning methods for first-step estimation. Our results thus contribute directly to the re-
cent explosion in both theoretical and applied research using machine learning methods
in economics, and to the recent adoption of deep learning in empirical settings. We ob-
tained novel bounds for deep neural networks, speaking directly to the modern (and em-
pirically successful) practice of using fully-connected feedforward networks. Our results
allow for different network architectures, including fixed width, very deep networks. Our
results cover general nonparametric regression-type loss functions, covering most non-
parametric practice. We used our bounds to deliver fast convergence rates allowing for
second-stage inference on a finite-dimensional parameter of interest.
There are practical implications of the theory presented in this paper. We focused on
semiparametric causal effects as a concrete illustration, but deep learning is a potentially
valuable tool in many diverse economic settings. Our results allow researchers to embed
deep learning into standard econometric models such as linear regressions, generalized
linear models, and other forms of limited dependent variables models (e.g., censored re-
gression). Our theory can also be used as a starting point for constructing deep learning
implementations of two-step estimators in the context of selection models, dynamic dis-
crete choice, and the estimation of games.
To be clear, we see our paper as an early step in the exploration of deep learning as a
tool for economic applications. There are a number of opportunities, questions, and chal-
lenges that remain. For some estimands, it may be crucial to estimate the density as well,
and this problem can be challenging in high dimensions. Deep nets, in the formulation
of GANs, are a promising tool for distribution estimation (Liang (2018), Athey, Imbens,
Metzger, and Munro (2019)). There are also interesting questions of network architec-
tures representing, and adapting to, the underlying function, and if these can be learned
from the data (Bach (2017), Dou and Liang (2020)). Lastly, further computational and
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 199

optimization guidance is needed. Research into these applications and structures is un-
derway.

APPENDIX A: PROOFS
In this appendix, we provide a proof of Theorems 1 and 2, our main theoretical re-
sults for deep ReLU networks, and their corollaries. The proof proceeds in several steps.
We first give the main breakdown and bound the bias (approximation error) term. We
then turn our attention to the empirical process term, to which we apply our localiza-
tion. Much of the proof uses a generic architecture, and thus pertains to both results. We
will specialize the architecture to the multi-layer perceptron only when needed later on.
Other special cases and related results are covered in Appendix A.4. Supporting lemmas
are stated in Appendix B.
The statements of Theorems 1 and 2 assume that n is large enough. Precisely, we re-
quire n > (2eM)2 ∨ Pdim(FDNN ). For notational simplicity, we will denote fDNN := f; see
(2.4), and DNN := n , see Assumption 3. As we are simultaneously consider Theorems 1
and 2, the generic notation DNN will be used throughout.

A.1. Main Decomposition and Bias Term

Referring to Assumption 3, define the best approximation realized by the deep ReLU
network class FDNN as
fn := arg min f − f∗ ∞
f ∈FDNN
f ∞ ≤2M

By definition, n := DNN := fn − f∗ ∞ .

Recalling the optimality of the estimator in (2.4), we know, as both fn and f are in
FDNN , that

−En (f z) + En (fn z) ≥ 0
/ FDNN . Using the above display
This result does not hold for f∗ in place of fn , because f∗ ∈
and the curvature of equation (2.1) (which does not hold with fn in place of f∗ therein),
we obtain

c1 f− f∗ 2L2 (X)

≤ E (f z) − E (f∗ z)

≤ E (f z) − E (f∗ z) − En (f z) + En (fn z)

= E (f z) − (f∗ z) − En (f z) − (f∗ z) + En (fn z) − (f∗ z)

= (E − En ) (f z) − (f∗ z) + En (fn z) − (f∗ z) (A.1)

Equation (A.1) is the main decomposition that begins the proof. The decomposition
must be done this way because of the above notes regarding f∗ and fn . The first term is the
empirical process term that will be treated in the subsequent subsection. For the second
term in (A.1), the bias term or approximation error, we apply Bernstein’s inequality to
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
200 M. H. FARRELL, T. LIANG, AND S. MISRA

find that, with probability at least 1 − e−γ̃ ,

2C2 fn − f∗ 2∞ γ̃ 21C M γ̃

En (fn z) − (f∗ z) ≤ E (fn z) − (f∗ z) + +
n 3n

2C2 fn − f∗ 2∞ γ̃ 7C M γ̃

≤ c2 E fn − f∗ 2 + +
n n

2C2 γ̃ 7C M γ̃
≤ c2 2
n + n + (A.2)
n n

using the Lipschitz and curvature of the loss function defined in equation (2.1) and
E[fn − f∗ 2 ] ≤ fn − f∗ 2∞ , along with the definition of 2n .
Once the empirical process term is controlled (in Appendix A.2), the two bounds will
be brought back together to compute the final result; see Appendix A.3.

A.2. Localization Analysis

We now turn to bounding the first term in (A.1) (the empirical processes term) using a
localized analysis that derives bounds based on scale insensitive complexity measure. The
ideas of our localization are rooted in Koltchinskii and Panchenko (2000) and Bartlett,
Bousquet, and Mendelson (2005), and related to Koltchinskii (2011). Localization analy-
sis extending to the unbounded f case has been developed in Mendelson (2014), Liang,
Rakhlin, and Sridharan (2015). This proof section proceeds in several steps.
A key quantity is the Rademacher complexity of the function class at hand. Given i.i.d.
Rademacher draws, ηi = ±1 with equal probability independent of the data, the random
variable Rn F , for a function class F , is defined as

1
n

Rn F := sup ηi f (xi )
f ∈F n i=1

Intuitively, Rn F measures how flexible the function class is for predicting random
signs. Taking the expectation of Rn F conditioned on the data we obtain the empirical
Rademacher complexity, denoted Eη [Rn F ]. When the expectation is taken over both the
data and the draws ηi , ERn F , we get the Rademacher complexity.

A.2.1. Step I: Quadratic Process

The first step is to show that, with high probability, the empirical L2 norm of the error
(f − f∗ ) is at most twice the population L2 norm bound for the same error, for certain
functions f outside a certain critical radius. This will be an ingredient to be used later on.
To do so, we study the quadratic process

f − f∗ 2n − f − f∗ 2L2 (X) = En (f − f∗ )2 − E(f − f∗ )2

We will apply the symmetrization of Lemma 5 to g = (f − f∗ )2 restricted to a radius

f − f∗ L2 (X) ≤ r. This function g has variance bounded as

V[g] ≤ E g2 ≤ E (f − f∗ )4 ≤ 9M 2 r 2
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 201

Writing g = (f + f∗ )(f − f∗ ), we see that by Assumption 1, |g| ≤ 3M|f − f∗ | ≤ 9M 2 , where

the first inequality verifies that g has a Lipschitz constant of 3M (when viewed as a func-
tion of its argument f ), and second that g itself is bounded. We therefore apply Lemma 5,
to obtain, with probability at least 1 − exp(−γ̃), that for any f ∈ F with f − f∗ L2 (X) ≤ r,

En (f − f∗ )2 − E(f − f∗ )2

2γ̃ 36M 2 γ̃
≤ 3ERn g = (f − f∗ ) : f ∈ F f − f∗ L2 (X) ≤ r + 3Mr
2
+
n 3 n

2γ̃ 12M 2 γ̃
≤ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r + 3Mr + (A.3)
n n
where the second inequality applies Lemma 2 to the Lipschitz functions {g} (as a function
of the real values f (x)) and iterated expectations.
Suppose the radius r satisfies

r 2 ≥ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r (A.4)

and
√
6 6M 2 γ̃
r ≥
2
(A.5)
n
Then we conclude from from (A.3) that

2γ̃ 12M 2 γ̃
En (f − f∗ ) ≤ r + r + 3Mr
2 2 2
+ ≤ (2r)2 (A.6)
n n
where the first inequality uses (A.4) and the second line uses (A.5). This means that for
r above the “critical radius” (see Step III), the empirical L2 -norm is at most twice the
population one with probability at least 1 − exp(−γ̃).

A.2.2. Step II: One Step Improvement

In this step, we will show that given a bound on f− f∗ L2 (X) we can use this bound as
information to obtain a tighter bound, if the initial bound is loose as made precise at the
end of this step. This tightening will then be pursued to its limit in Step III, which leads to
the final rate obtained in Step IV. Step I will be used herein.
Suppose we know that for some r0 , f− f∗ L2 (X) ≤ r0 . We may always start with r0 =
3M given Assumption 1 and (2.4). Apply Lemma 5 with G := {g = (f z) − (f∗ z) : f ∈
FDNN f − f∗ L2 (X) ≤ r0 }, we find that, with probability at least 1 − 2e−γ̃ , the empirical
process term of (A.1) is bounded as

2C2 r02 γ̃ 23 · 3MC γ̃

(E − En ) (f z) − (f∗ z) ≤ 6Eη Rn G + + (A.7)
n 3 n

where the middle term is due to the following variance calculation (recall Equation (2.1))
2
V[g] ≤ E g2 = E (f z) − (f∗ z) ≤ C2 E(f − f∗ )2 ≤ C2 r02
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
202 M. H. FARRELL, T. LIANG, AND S. MISRA

Here, the fact that Lemma 5 is variance dependent, and that the variance depends on
the radius r0 , is important. It is this property which enables a sharpening of the rate with
step-by-step reductions in the variance bound, as in Appendix A.2.4.
For the empirical Rademacher complexity term, the first term of (A.7), Lemma 2,
Step I, and Lemma 3 (notation defined there), yield

Eη Rn G = Eη Rn g : g = (f z) − (f∗ z) f ∈ FDNN f − f∗ ≤ r0

≤ 2C Eη Rn f − f∗ : f ∈ FDNN f − f∗ ≤ r0
≤ 2C Eη Rn f − f∗ : f ∈ FDNN f − f∗ n ≤ 2r0

12 2r0
≤ 2C inf 4α + √ log N δ FDNN · n dδ
0<α<2r0 n α

12 2r0
≤ 2C inf 4α + √ log N (δ FDNN |x1 xn ∞) dδ
0<α<2r0 n α

with probability 1 − exp(−γ̃) (when applying Step I). Recall Lemma 4, one can further
upper bound the entropy integral when n > Pdim(FDNN ),

12 2r0
inf 4α + √ log N (δ FDNN |x1 xn ∞) dδ
0<α<2r0 n α

12 2r0 2eMn
≤ inf 4α + √ Pdim(FDNN ) log dδ
0<α<2r0 n α δ · Pdim(FDNN )

Pdim(FDNN ) 2eM 3
≤ 32r0 log + log n
n r0 2

with a particular choice of α = 2r0 Pdim(FDNN )/n < 2r0 . Therefore, whenever r0 ≥ 1/n
and n ≥ (2eM)2 ,

Pdim(FDNN )
Eη Rn G ≤ 128C r0 log n
n
Applying this bound to (A.7), we have

Pdim(FDNN ) 2C2 γ̃ 23MC γ̃

(E − En ) (f z) − (f∗ z) ≤ Kr0 log n + r0 + (A.8)
n n n

where K = 6 × 128C .
Going back now to the main decomposition, plug (A.8) and (A.2) into (A.1), and we
overall have found that, with probability at least 1 − 4 exp(−γ̃), the following holds:

c1 f− f∗ 2L2 (X)

Pdim(FDNN ) 2C2 γ̃ 23MC γ̃ 2C2 γ̃ 7C M γ̃
≤ Kr0 log n + r0 + + c2 2
n + n +
n n n n n
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 203

Pdim(FDNN ) 2C2 γ̃ 2C2 γ̃ γ̃
≤ r0 · K log n + + c2 2n + n + 30MC
n n n n

√ W L log W 2C2 γ̃ 2C2 γ̃ γ̃
≤ r0 · K C log n + + c2 2n + n + 30MC (A.9)
n n n n

The last line applies Lemma 6. Therefore, whenever n r0 and W L log n
W
log n r0 , the

knowledge that f − f∗ L2 (X) ≤ r0 implies that (with high probability) f − f∗ L2 (X) ≤ r1 ,
for r1 r0 . One can recursively improve the bound r to a fixed point/radius r∗ , which
describes the fundamental difficulty of the problem. This is done in the course of the next
two steps.

A.2.3. Step III: Critical Radius

We now use the tightening of Step II to obtain the critical radius for this problem that
is then used as an input in the final rate derivation of Step IV. Formally, define the critical
radius r∗ to be the largest fixed point

r∗ = inf r > 0 : 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ s < s2 ∀s ≥ r

By construction this obeys (A.4), and thus so does 2r∗ . Denote the event E (depending on
the data) to be

E = f − f∗ n ≤ 4r∗ for all f ∈ F and f − f∗ L2 (X) ≤ 2r∗

and 1E to be the indicator

√ event E holds. We know from (A.6) that P(1E = 1) ≥
that
1 − n−1 , provided r∗ ≥ 18M log n/n to satisfy (A.5).
We can now give an upper bound for the the critical radius r∗ . Using the logic of Step
II to bound the empirical Rademacher complexity, and then applying Lemma 6, we find
that

r∗2 ≤ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r∗

≤ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ 2r∗
≤ 18ME Eη Rn f − f∗ : f ∈ F f − f∗ n ≤ 4r∗ 1E + 3M(1 − 1E )

√ W L log W 1
≤ 36MK C · r∗ log n + 36M 2
n n

√ W L log W
≤ 72MK C · r∗ log n
n
√
with the last line relying on the above restriction that r∗ ≥ 18M log n/n. Dividing
through by r∗ yields the final bound:

√ W L log W
r∗ ≤ 72MK C log n (A.10)
n
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
204 M. H. FARRELL, T. LIANG, AND S. MISRA

A.2.4. Step IV: Localization

We are now able to derive the final rate using a localization argument. This applies the
results of Step I and Step II repeatedly. Divide the space FDNN into shells of increasing
radius by intersecting it with the L2 balls

B(f∗ r̄) B(f∗ 2r̄)\B(f∗ r̄) B f∗ 2l r̄ \B f∗ 2l−1 r̄ (A.11)

where l ≥ 1 is chosen to be the largest integer no greater than log2 √ 2M

. We will pro-
(log n)/n
ceed to find a bound on r̄ which determines the final rate results.
Suppose r̄ > r∗ . Then for each shell, Step I and the union bound imply that with proba-
bility at least 1 − 2l exp(−γ̃),

f − f∗ L2 (X) ≤ 2j r̄ ⇒ f − f∗ n ≤ 2j+1 r̄ (A.12)

Further, suppose that for some j ≤ l,

f∈ B f∗ 2j r̄ \B f∗ 2j−1 r̄ (A.13)

Then applying the one step improvement argument in Step II (again the variance depen-
dence captured in Lemma 5 is crucial, here reflected in the variance within each shell),
equation (A.9) yields that with probability at least 1 − 4 exp(−γ̃),

√
1 W L log W 2C2 t
f− f∗ 2L2 (X) ≤ 2j r̄ · K C log n +
c1 n n

2C2 γ̃ γ̃
+ c2 2
n + n + 30MC
n n
≤ 22j−2 r̄ 2

if the following two conditions hold:

1 √ W L log W 2C2 γ̃ 1
K C log n + ≤ 2j r̄
c1 n n 2

1 2C2 γ̃ γ̃ 1
c2 2
n + n + 26MC ≤ 22j r̄ 2
c1 n n 4

It is easy to see that these two hold for all j if we choose

√
8 W L log W 2C2 γ̃
r̄ = K C log n +
c1 n n

2(c2 ∨ 1) 120MC γ̃
+ n+ + r∗ (A.14)
c1 c1 n
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 205

Therefore, with probability at least 1 − 6l exp(−γ̃), we can perform shell-by-shell argu-

ment combining the results in Step I and Step II:

f− f∗ L2 (X) ≤ 2l r̄ and f− f∗ n ≤ 2l+1 r̄

implies f− f∗ L2 (X) ≤ 2l−1 r̄ and f− f∗ n ≤ 2l r̄

implies f− f∗ L2 (X) ≤ 20 r̄ and f− f∗ n ≤ 21 r̄

The “and” part of each line follows from Step I and the implication uses the above ar-
gument following Step II. Therefore, in the end, we conclude with probability at least
1 − 6l exp(−γ̃),

f− f∗ L2 (X) ≤ r̄ (A.15)

f− f∗ n ≤ 2r̄ (A.16)

Therefore, by choosing γ = − log(6l) + γ̃, we know from (A.14), and the upper bound
on r∗ in (A.10),

√
8 W L log W 2C2 (log log n + γ)
r̄ ≤ K C log n +
c1 n n

2(c2 ∨ 1) 120MC log log n + γ
+ n+ + r∗
c2 c1 n

W L log W log log n + γ
≤C log n + + n (A.17)
n n

with some constant C > 0 that does not depend on n. This completes the proof of Theo-
rem 2.

A.3. Final Steps for the MLP Case

For the multilayer perceptron, W ≤ C · H 2 L, and plugging this into the bound (A.17),
we obtain

H 2 L2 log H 2 L log log n + γ
C log n + + n
n n
To optimize this upper bound on r̄, we need to specify the trade-offs in n and H and L.
To do so, we utilize the MLP-specific approximation rate of Lemma 7 and the embedding
of Lemma 1. Lemma 1 implies that, for any n , one can embed the approximation class
FDNN given by Lemma 7 into a standard MLP architecture FMLP , where specifically
−d 2
H = H( n ) ≤ W ( n )L( n ) ≤ C 2 n β log(1/ n ) + 1

L = L( n ) ≤ C · log(1/ n ) + 1
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
206 M. H. FARRELL, T. LIANG, AND S. MISRA

For standard MLP architecture FMLP ,

− 2d 7
H 2 L2 log H 2 L ≤ C̃ · n
β
log(1/ n ) + 1

Thus we can optimize the upper bound

− 2dβ 7

n log(1/ n ) + 1 log log n + γ
r̄ ≤ C log n + + n
n n
β d
by choosing n = n− 2(β+d) , H ·n 2(β+d) log2 n, L · log n. This gives

β log log n + t
r̄ ≤ C n− 2(β+d) log4 n +
n

Hence putting everything together, with probability at least 1 − exp(−γ),

β
− β+d log log n + γ
E(f − f∗ ) ≤ r̄ ≤ C n
2 2
log n +
8

n

β
− β+d log log n + γ
En (f − f∗ ) ≤ (2r̄) ≤ 4C n
2 2
log n +
8

n

This completes the proof of Theorem 1.

A.4. Proof of Corollaries 1 and 2

For Corollary 1, we want to optimize

W L log U log log n + γ

log n + + 2
DNN
n n
Yarotsky (2017, Theorem 1) showed that for the approximation error DNN to obey
d
−β
DNN ≤ , it suffices to choose W U ∝ (log(1/ ) + 1) and L ∝ (log(1/ ) + 1), given
the specific architecture described therein. Therefore, we attain n−β/(2β+d) by setting
d/(2β+d)
WU n and L log n, yielding the desired result.
For Corollary 2, we need to optimize

H 2 L2 log(HL) log log n + γ

log n + + 2
MLP
n n
Yarotsky (2018, Theorem 1) showed that for the approximation error MLP to obey MLP ≤
d
, it suffices to choose H ∝ 2d + 10 and L ∝ − 2 , given the specific architecture described
therein. Thus, for n−1/(2+d) we take L n−d/(4+2d) , and the result follows.

APPENDIX B: SUPPORTING LEMMAS

First, we show that one can embed a feedforward network into the multilayer percep-
tron architecture by adding auxiliary hidden nodes. This idea is due to Yarotsky (2018).
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 207

FIGURE 5.—Illustration of how to embed a feedforward network into a multilayer perceptron, with auxiliary
hidden nodes (shown in dark grey).

LEMMA 1—Embedding: For any function f ∈ FDNN , there is a g ∈ FMLP , with H ≤ W L +

U, such that g = f .

PROOF: The idea is illustrated in Figure 5. For the edges in the directed graph of
f ∈ FDNN that connect nodes not in adjacent layers (shown in yellow in Figure 5), one
can insert auxiliary hidden units in order to simply “pass forward” the information. The
number of such auxiliary “pass forward units” is at most the number of offending edges
times the depth L (i.e., for each edge, at most L auxiliary nodes are required), and this is
bounded by W L. Therefore the width of the MLP network that subsumes the original is
upper bounded by W L + U while still maintaining the required embedding that for any
fθ ∈ FDNN , there is a gθ ∈ FMLP such that gθ = fθ . In order to match modern practice,
we only need to show that auxiliary units can be implemented with ReLU activation. This
can be done by setting the constant (“bias”) term b of each auxiliary unit large enough
to ensure σ(x̃ w + b) = x̃ w + b when x̃ is the input covariates, and then subtracting the
same b in the last receiving unit along the path. Q.E.D.

Next, we give two properties of the Rademacher complexity (see Mendelson, 2003).

LEMMA 2—Contraction: Let φ : R → R be a Lipschitz function |φ(f1 )−φ(f2 )| ≤ L|f1 −

f2 |, then
Eη Rn {φ ◦ f : f ∈ F } ≤ 2LEη Rn F

LEMMA 3—Dudley’s Chaining: Let N (δ F · n ) denote the metric entropy for class F
(with covering radius δ and metric · n ), then

12 r
Eη Rn f : f ∈ F f n ≤ r ≤ inf 4α + √ log N δ F · n dδ
0<α<r n α
Furthermore, because f n ≤ maxi |f (xi )| and, therefore, N (δ F · n ) ≤ N (δ F |x1 xn
∞) and so the upper bound in the conclusions also holds with N (δ F |x1 xn ∞), where
F |x1 xn is the class F projected onto the data.

The next two results, Theorems 12.2 and 14.1 in Anthony and Bartlett (1999), show that
the metric entropy may be bounded in terms of the pseudo-dimension and that the latter
is bounded by the Vapnik-Chervonenkis (VC) dimension.

LEMMA 4: Assume for all f ∈ F , f ∞ ≤ M. Denote the pseudo-dimension of F as

Pdim(F ), then for n ≥ Pdim(F ), we have for any δ,
Pdim(F )
2eM · n
N (δ F |x1 xn ∞) ≤
δ · Pdim(F )
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
208 M. H. FARRELL, T. LIANG, AND S. MISRA

The following symmetrization lemma bounds the empirical processes term using
Rademacher complexity, and is thus a crucial piece of our localization. This is a stan-
dard result based on Talagrand’s concentration, but here special care is taken with the
dependence on the variance.

LEMMA 5—Symmetrization, Theorem 2.1 in Bartlett, Bousquet, and Mendelson

(2005): For any g ∈ G , assume that |g| ≤ G and V[g] ≤ V . Then for every γ > 0, with
probability at least 1 − e−γ ,

2V γ 4G γ
sup{Eg − En g} ≤ 3ERn G + +
g∈G n 3 n

and with probability at least 1 − 2e−t

2V γ 23G γ
sup{Eg − En g} ≤ 6Eη Rn G + +
g∈G n 3 n

The same result holds for supg∈G {En g − Eg}.

When bounding the complexity of FDNN , we use the following result. Bartlett et al.
(2017) also verify these bounds for the VC-dimension.

LEMMA 6—Theorem 6 in Bartlett et al. (2017), ReLU case: Consider a ReLU network
architecture F = FDNN (W L U), then the pseudo-dimension is sandwiched as

c · W L log(W /L) ≤ Pdim(F ) ≤ C · W L log W

with some universal constants c C > 0.

For the MLP, we use the following approximation result, Yarotsky (2017) Theorem 1.

LEMMA 7: There exists a network class FDNN , with ReLU activation, such that for any
> 0:
(a) FDNN approximates the W β∞ ([−1 1]d ) in the sense for any f∗ ∈ W β∞ ([−1 1]d ), there
exists a fn ( ) := fn ∈ FDNN such that

fn − f∗ ∞ ≤
d
−β
(b) and FDNN has L( ) ≤ C · (log(1/ ) + 1) and W ( ) U( ) ≤ C · (log(1/ ) + 1).
Here C only depends on d and β.

For completeness, we verify the requirements on the loss functions, equation (2.1), for
several examples. We first treat least squares and logistic losses, in slightly more detail, as
these are used in our subsequent inference results and empirical application.

LEMMA 8: Both the least squares (2.2) and logistic (2.3) loss functions obey the require-
ments of Equation (2.1). For least squares, c1 = c2 = 1/2 and C = M. For logistic regression,
c1 = (2(exp(M) + exp(−M) + 2))−1 , c2 = 1/8 and C = 1.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 209

PROOF: The Lipschitz conditions are trivial. For least squares, using iterated expecta-
tions

2E(f Z) − 2E(f∗ Z) = E −2Yf + f 2 + 2Yf∗ − f∗2

= E −2f∗ f (x) + f 2 + 2(f∗ )2 − f∗2

= E (f − f∗ )2
For logistic regression,

exp(f∗ ) 1 + exp(f )
E (f Z) − E (f∗ Z) = E − (f − f∗ ) + log
1 + exp(f∗ ) 1 + exp(f∗ )

Define ha (b) = − 1+exp(a)

exp(a)
(b − a) + log( 1+exp(a)
1+exp(b)
), then

1
ha (b) = ha (a) + ha (a)(b − a) + ha ξa + (1 − ξ)b (b − a)2
2
and ha (b) = 1
exp(b)+exp(−b)+2
≤ 14 . The lower bound holds as |ξf∗ + (1 − ξ)f | ≤ M. Q.E.D.

Beyond least squares and logistic regression, we give three further examples, discussed
in the general language of generalized linear models. Note that in the final example we
move beyond a simple scalar outcome.

LEMMA 9: For a convex function g(·) : R → R, consider the generalized linear loss func-
tion (f z) = −y f (x) + g(f (x)). The curvature and the Lipschitz conditions in (2.1) will
hold given specific g(·). In each case, the loss function corresponds to the negative log likeli-
hood function.
(a) Poisson: g(t) = exp(t), with f∗ (x) = log E[y|X = x].
(b) Gamma: g(t) = − log t, with f∗ (x) = −(E[y|X = x]) −1
.
(c) Multinomial Logistic, K + 1 classes: g(t) = log(1 + k∈K exp(t [k] )), with

exp f∗[k] (x) / 1 + exp f∗[k ] (x) = E y [k] |X = x
k ∈K

Here v[k] denotes the k-th coordinate of a vector v.

PROOF: Denote ∇g, Hessian[g] to be the gradient and Hessian of the convex function
g. By the convexity of g, the optimal f∗ satisfies E[∂(f∗ Z)/∂f |X = x] = 0, which implies
∇g(f∗ ) = E[Y |X = x]. If 2c0 Hessian[g(f )] 2c1 for all f of interest, then the curvature
condition in (2.1) holds, because

E (f Z) − E (f∗ Z) = E − ∇g(f∗ ) f − f∗ + g(f ) − g(f∗ )
1
= E f − f∗ Hessian g(f˜) f − f∗
2
≥ c0 Ef − f∗ 2

and the parallel argument for ≤ c1 Ef − f∗ 2 . The Lipschitz condition is equivalent to
∇g(f ) ≤ C for all f of interest, with bounded Y .
For our three examples in particular, we have the following.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
210 M. H. FARRELL, T. LIANG, AND S. MISRA

(a) For Poisson regression: ∇c(f ) = | exp(f )| ≤ exp(M) and Hessian[c(f )] =

exp(f ) ∈ [exp(−M) exp(M)].
(b) For Gamma regression, bounding −Y above and below is equivalent to 1/M ≤
f ≤ M and therefore: ∇c(f ) = |1/f | ≤ M and Hessian[c(f )] = 1/f 2 ∈ [1/M 2
M 2 ].
(c) For multinomial logistic regression, with general K, we know ∇c(f )
≤ 1 and
Hessian[c(f )] = diag{z} − zz , where z [k] := exp(f [k] )[1 + k exp(f [k ] )]−1 . One
can verify that the eigenvalues are bounded in the following sense, for bounded f ,

1 exp(M)
2 ≤ λ Hessian c(f ) ≤
1 + K exp(M) 1 + (K − 1) exp(−M) + exp(M)

This completes the proof. Q.E.D.

Our last result is to verify condition (c) of Theorem 3. We do so using our localization,
which may be of future interest in second-step inference with machine learning methods.

LEMMA 10: Let the conditions of Theorem 3 hold. Then

1{ti = t}
En μt (xi ) − μt (xi ) 1 −
P[T = t|X = xi ]

β log log n
= oP n− β+d log8 n + = oP n−1/2
n

PROOF: Without loss of generality we can take p̄ < 1/2. The only estimated function
here is μt (x), which plays the role of f∗ here. For function(als) L(·) of the form

1{ti = t}
L(f ) := f (xi ) − f∗ (xi ) 1 −
P[T = t|X = xi ]

it is true that

E 1{ti = t}|xi
E L(f ) = E f (X) − f∗ (X) 1 − =0
P[T = t|X = xi ]

and
2
V L(f ) ≤ (1/p̄ − 1)2 E f (X) − f∗ (X) ≤ (1/p̄ − 1)2 r̄ 2

L(f ) ≤ (1/p̄ − 1)2M

For r̄ defined in (A.14),

18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r̄ ≤ r̄ 2

ERn L(f ) : f ∈ F f − f∗ L2 (X) ≤ r̄
≤ 2(1/p̄ − 1)ERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r̄

where the first line is due to r̄ > r∗ , and second line uses Lemma 2.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 211

Then by the localization analysis and Lemma 5, for all f ∈ F , f − f∗ L2 (X) ≤ r̄, L(f )
obeys

En L(f ) = En L(f ) − E L(f )

4(1/p̄ − 1)2 t 8(1/p̄ − 1)3M t

≤ 6C r̄ 2 + r̄ +
n 3 n

β log log n
≤ 4C r̄ 2 ≤ C · n− β+d log8 n +
n

β
− β+d log log n
sup En L(f ) ≤ C · n log n +
8

f ∈F f −f∗ L (X) ≤r̄
2
n

With probability at least 1 − exp(−n β+d log8 n), fMLP lies in this set of functions, and there-
d

fore

1(T = t)

En L(fMLP ) = En fnHL (x) − f∗ (x) 1 −
P(T = t|x = x)

β log log n
≤ C · n− β+d log8 n +
n
as claimed. Q.E.D.

REFERENCES
ANTHONY, M., AND P. L. BARTLETT (1999): Neural Network Learning: Theoretical Foundations. Campbridge
University Press. [182,207]
ATHEY, S., G. W. IMBENS, J. METZGER, AND E. M. MUNRO (2019): “Using Wasserstein Generative Adversarial
Networks for the Design of Monte Carlo Simulations,” arXiv:1909.02210. Preprint. [198]
ATHEY, S., G. W. IMBENS, AND S. WAGER (2018): “Approximate Residual Balancing: Debiased Inference of
Average Treatment Effects in High Dimensions,” Journal of the Royal Statistical Society, Series B, 80, 597–623.
[183,195]
BACH, F. (2017): “Breaking the Curse of Dimensionality With Convex Neural Networks,” The Journal of Ma-
chine Learning Research, 18, 629–681. [184,192,198]
BARTLETT, P. L., O. BOUSQUET, AND S. MENDELSON (2005): “Local Rademacher Complexities,” The Annals
of Statistics, 33, 1497–1537. [189,200,208]
BARTLETT, P. L., N. HARVEY, C. LIAW, AND A. MEHRABIAN (2017): “Nearly-Tight VC-Dimension Bounds
for Piecewise Linear Neural Networks,” in Proceedings of the 22nd Annual Conference on Learning Theory
(COLT 2017). [184,189,193,208]
BAUER, B., AND M. KOHLER (2019): “On Deep Learning as a Remedy for the Curse of Dimensionality in
Nonparametric Regression,” Annals of Statistics, 47, 2261–2285. [184,192]
BELLONI, A., V. CHERNOZHUKOV, I. FERNÁNDEZ-VAL, AND C. HANSEN (2017): “Program Evaluation and
Causal Inference With High-Dimensional Data,” Econometrica, 85, 233–298. [183,193]
BELLONI, A., V. CHERNOZHUKOV, AND C. HANSEN (2014): “Inference on Treatment Effects After Selection
Amongst High-Dimensional Controls,” Review of Economic Studies, 81, 608–650. [183,184,195]
BICKEL, P. J., AND Y. RITOV (1988): “Estimating Integrated Squared Density Derivatives: Sharp Best Order
of Convergence Estimates,” Sankhyā, 50, 381–393. [195]
BICKEL, P. J., Y. RITOV, AND A. B. TSYBAKOV (2009): “Simultaneous Analysis of LASSO and Dantzig Selec-
tor,” The Annals of Statistics, 37, 1705–1732. [183]
CATTANEO, M. D., M. JANSSON, AND X. MA (2019): “Two-Step Estimation and Inference With Possibly Many
Included Covariates,” Review of Economic Studies, 86, 1095–1122. [195]
CHEN, X., AND H. WHITE (1999): “Improved Rates and Asymptotic Normality for Nonparametric Neural
Network Estimators,” IEEE Transactions on Information Theory, 45, 682–691. [182]
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
212 M. H. FARRELL, T. LIANG, AND S. MISRA

CHEN, X., H. HONG, AND A. TAROZZI (2008): “Semiparametric Efficiency in GMM Models With Auxiliary
Data,” The Annals of Statistics, 36, 808–843. [195]
CHERNOZHUKOV, V., D. CHETVERIKOV, M. DEMIRER, E. DUFLO, C. HANSEN, W. NEWEY, AND J. ROBINS
(2018): “Double/Debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics
Journal, 21, C1–C68. [184,193-195]
DANIELY, A. (2017): “Depth Separation for Neural Networks,” in Proceedings of the 22nd Annual Conference
on Learning Theory (COLT 2017). [188]
DOU, X., AND T. LIANG (2020): “Training Neural Networks as Learning Data-Adaptive Kernels: Provable
Representation and Approximation Benefits,” Journal of the American Statistical Association, (Forthcoming).
[198]
FARRELL, M. H. (2015): “Robust Inference on Average Treatment Effects With Possibly More Covariates
Than Observations,” Journal of Econometrics, 189, 1–23. arXiv:1309.4686. [183,184,194,195]
FARRELL, M. H., T. LIANG, AND S. MISRA (2019a): “Deep Neural Networks for Estimation and Inference,”
arXiv:1809.09953. [193]
(2021): “Supplement to ‘Deep Neural Networks for Estimation and Inference’,” Econometrica Sup-
plemental Material, 89, https://doi.org/10.3982/ECTA16901. [197]
GHORBANI, B., S. MEI, T. MISIAKIEWICZ, AND A. MONTANARI (2019): “Discussion of ‘Nonparametric Regres-
sion Using Deep Neural Networks With ReLU Activation Function’,” Annals of Statistics, (Forthcoming).
[192]
GINE, E., AND R. NICKL (2016): Mathematical Foundations of Infinite-Dimensional Models. Cambridge. [189]
GOODFELLOW, I., Y. BENGIO, AND A. COURVILLE (2016): Deep Learning. Cambridge: MIT Press. [185]
HAHN, J. (1998): “On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average
Treatment Effects,” Econometrica, 66, 315–331. [194]
HANIN, B. (2019): “Universal function approximation by deep neural nets with bounded width and relu acti-
vations”, Mathematics, 7, 992. [192]
HARTFORD, J., G. LEWIS, K. LEYTON-BROWN, AND M. TADDY (2017): “Deep iv: A Flexible Approach for
Counterfactual Prediction,” in International Conference on Machine Learning, 1414–1423. [183,192]
HITSCH, G. J., AND S. MISRA (2018): “Heterogeneous Treatment Effects and Optimal Targeting Policy Evalu-
ation,” SSRN preprint 3111957. [183,196]
JAVANMARD, A., AND A. MONTANARI (2014): “Confidence Intervals and Hypothesis Testing for High-
Dimensional Regression,” The Journal of Machine Learning Research, 15, 2869–2909. [184]
KINGMA, D. P., AND J. BA (2014): “Adam: A Method for Stochastic Optimization,” arXiv:1412.6980. Preprint.
[182]
KOLTCHINSKII, V. (2006): “Local Rademacher Complexities and Oracle Inequalities in Risk Minimization,”
The Annals of Statistics, 34, 2593–2656. [189]
(2011): Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer-
Verlag. [189-191,200]
KOLTCHINSKII, V., AND D. PANCHENKO (2000): “Rademacher Processes and Bounding the Risk of Function
Learning,” in High Dimensional Probability II. Springer, 443–457. [189,200]
KUTYNIOK, G. (2019): “Discussion of ‘Nonparametric Regression Using Deep Neural Networks With ReLU
Activation Function’,” Annals of Statistics, (Forthcoming). [192]
LECUN, Y., L. BOTTOU, Y. BENGIO, AND P. HAFFNER (1998): “Gradient-Based Learning Applied to Docu-
ment Recognition,” Proceedings of the IEEE, 86, 2278–2324. [182]
LIANG, T. (2018): “On How Well Generative Adversarial Networks Learn Densities: Nonparametric and Para-
metric Results,” arXiv:1811.03179. [198]
LIANG, T., A. RAKHLIN, AND K. SRIDHARAN (2015): “Learning With Square Loss: Localization Through Offset
Rademacher Complexity,” in Conference on Learning Theory, 1260–1285. [189,200]
MA, X., AND J. WANG (2018): “Robust Inference Using Inverse Probability Weighting,” arXiv:1810.11397.
Preprint. [195]
MENDELSON, S. (2003): “A few Notes on Statistical Learning Theory,” in Advanced Lectures on Machine Learn-
ing. Springer, 1–40. [207]
(2014): “Learning Without Concentration,” in Conference on Learning Theory, 25–39. [200]
MHASKAR, H. N., AND T. POGGIO (2016): “Deep vs. Shallow Networks: An Approximation Theory Perspec-
tive,” Analysis and Applications, 14, 829–848. [192]
NAIR, V., AND G. E. HINTON (2010): “Rectified Linear Units Improve Restricted Boltzmann Machines,” in
Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807–814. [182]
POLSON, N. G., AND V. ROČKOVÁ (2018): “Posterior Concentration for Sparse Deep Learning,” in Advances
in Neural Information Processing Systems, 930–941. [192]
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 213

ROBINS, J., E. T. TCHETGEN, L. LI, AND A. VAN DER VAART (2009): “Semiparametric Minimax Rates,” Elec-
tronic Journal of Statistics, 3, 1305–1321. [195]
SCHMIDT-HIEBER, J. (2019): “Nonparametric Regression Using Deep Neural Networks With ReLU Activation
Function,” Annals of Statistics, arXiv:1708.06633. (Forthcoming). [183,184,191,192]
PRECUP, D., AND Y. W. TEH (2017): “Estimating individual treatment effect: generalization bounds and algo-
rithms,” in Proceedings of the 34th International Conference on Machine Learning, 3076–3085. [183]
SHAMIR, O. (2019): “Discussion of ‘Nonparametric Regression Using Deep Neural Networks With ReLU
Activation Function’,” Annals of Statistics, (Forthcoming). [192]
STONE, C. J. (1982): “Optimal Global Rates of Convergence for Nonparametric Regression,” The Annals of
Statistics, 10, 1040–1053. [190,191]
TAN, Z. (2020): “Model-Assisted Inference for Treatment Effects Using Regularized Calibrated Estimation
With High-Dimensional Data,” The Annals of Statistics, 48, 811–837. [195]
TELGARSKY, M. (2017): “Benefits of depth in neural networks,” in 29th Annual Conference on Learning Theory,
1517–1539. [188]
VAN DE GEER, S., P. BUHLMANN, Y. RITOV, AND R. DEZEURE (2014): “On Asymptotically Optimal Confi-
dence Regions and Tests for High-Dimensional Models,” The Annals of Statistics, 42, 1166–1202. [184]
WAGER, S., AND S. ATHEY (2018): “Estimation and Inference of Heterogeneous Treatment Effects Using
Random Forests,” Journal of the American Statistical Association, 113, 1228–1242. [183]
WESTREICH, D., J. LESSLER, AND M. J. FUNK (2010): “Propensity Score Estimation: Neural Networks, Support
Vector Machines, Decision Trees (CART), and Meta-Classifiers as Alternatives to Logistic Regression,”
Journal of Clinical Epidemiology, 63, 826–833. [183]
WHITE, H. (1992): Artificial Neural Networks: Approximation and Learning Theory. Blackwell Publishers, Inc.
[182]
YAROTSKY, D. (2017): “Error Bounds for Approximations With Deep ReLU Networks,” Neural Networks, 94,
103–114. [184,189-191,206,208]
(2018): “Optimal approximation of continuous functions by very deep ReLU networks,” in 31st An-
nual Conference on Learning Theory, 639–649. [184,189,192,206]
ZHANG, C., S. BENGIO, M. HARDT, B. RECHT, AND O. VINYALS (2016): “Understanding Deep Learning
Requires Rethinking Generalization,” arXiv:1611.03530. Preprint. [188,189]

Co-editor Ulrich K. Müller handled this manuscript.

Manuscript received 13 December, 2018; final version accepted 6 August, 2020; available online 3 September, 2020.

Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
THE_DEEP_NEURAL_NETWORK-A_REVIEW
No ratings yet
THE_DEEP_NEURAL_NETWORK-A_REVIEW
5 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
Deep Learning: Nicholas G. Polson Vadim O. Sokolov
No ratings yet
Deep Learning: Nicholas G. Polson Vadim O. Sokolov
18 pages
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
LBDL
No ratings yet
LBDL
185 pages
On Deep Learning For Inverse Problems: Jaweria Amjad Jure Sokoli C Miguel R.D. Rodrigues
No ratings yet
On Deep Learning For Inverse Problems: Jaweria Amjad Jure Sokoli C Miguel R.D. Rodrigues
5 pages
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
No ratings yet
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
48 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Neural Networks and Fuzzy Logic
From Everand
Neural Networks and Fuzzy Logic
C. Naga Bhaskar
No ratings yet
Applying statistical learning theory to deep learning
No ratings yet
Applying statistical learning theory to deep learning
51 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Deep Learning A Tutorial
No ratings yet
Deep Learning A Tutorial
16 pages
LBDL
No ratings yet
LBDL
142 pages
Paper 1
No ratings yet
Paper 1
17 pages
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
100% (6)
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
368 pages
1810 01075 PDF
No ratings yet
1810 01075 PDF
59 pages
A Little Book of Deep Learning - Francois Fleuret
No ratings yet
A Little Book of Deep Learning - Francois Fleuret
149 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
2501.10465v1
No ratings yet
2501.10465v1
10 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Machine Learning 4th Unit
No ratings yet
Machine Learning 4th Unit
54 pages
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
No ratings yet
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
31 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
3048 Greedy Layer Wise Training of Deep Networks
No ratings yet
3048 Greedy Layer Wise Training of Deep Networks
8 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
No ratings yet
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
51 pages
On The Geometry of Deep Learning
No ratings yet
On The Geometry of Deep Learning
14 pages
Continuous Time 2
No ratings yet
Continuous Time 2
91 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
143 pages
Tutorial Math Deep Learning 2018 PDF
No ratings yet
Tutorial Math Deep Learning 2018 PDF
103 pages
Unit II
No ratings yet
Unit II
56 pages
1 NN in Cap Mkts Refenes Ch01 Intro
No ratings yet
1 NN in Cap Mkts Refenes Ch01 Intro
7 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
Main Notes
No ratings yet
Main Notes
227 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
lbdl
No ratings yet
lbdl
143 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
Mathematical_Foundations_of_Deep_Learning
No ratings yet
Mathematical_Foundations_of_Deep_Learning
174 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Machine Learning Advances For Time Series Forecasting: Ricardo P. Masini
No ratings yet
Machine Learning Advances For Time Series Forecasting: Ricardo P. Masini
44 pages
Main Notes
No ratings yet
Main Notes
227 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
Contents MLP PDF
No ratings yet
Contents MLP PDF
60 pages
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
From Everand
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
Jyh-Horng Jeng
No ratings yet
AI in Railways
No ratings yet
AI in Railways
25 pages
Clone Detection in 5G-Enabled Social Iot System Using Graph Semantics and Deep Learning Model
No ratings yet
Clone Detection in 5G-Enabled Social Iot System Using Graph Semantics and Deep Learning Model
14 pages
All Things Distributed - All Things Distributed
No ratings yet
All Things Distributed - All Things Distributed
41 pages
Down The Rabbit Hole Detecting Online Extremism, Radicalisation, and Politicised Hate Speech
No ratings yet
Down The Rabbit Hole Detecting Online Extremism, Radicalisation, and Politicised Hate Speech
35 pages
Heart Disease Documentation
No ratings yet
Heart Disease Documentation
82 pages
Geometric Deep Learning On Graphs and Manifolds Using Mixture Model Cnns
No ratings yet
Geometric Deep Learning On Graphs and Manifolds Using Mixture Model Cnns
13 pages
AIML 2nd Year
No ratings yet
AIML 2nd Year
5 pages
1 s2.0 S1746809423000812 Main
No ratings yet
1 s2.0 S1746809423000812 Main
12 pages
Rajesh Kumar Das
No ratings yet
Rajesh Kumar Das
2 pages
IEEE - Research Paper
No ratings yet
IEEE - Research Paper
6 pages
Myers-Briggs Personality Classification and Personality-Specific Language Generation Using Pre-Trained Language Models
No ratings yet
Myers-Briggs Personality Classification and Personality-Specific Language Generation Using Pre-Trained Language Models
6 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
GEN-AI
No ratings yet
GEN-AI
37 pages
Donkey Car Depp Reinforcement Learning
No ratings yet
Donkey Car Depp Reinforcement Learning
7 pages
Log-Based Anomaly Detection Without Log Parsing: Van-Hoang Le and Hongyu Zhang
No ratings yet
Log-Based Anomaly Detection Without Log Parsing: Van-Hoang Le and Hongyu Zhang
13 pages
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
No ratings yet
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
119 pages
Project List - Clgs
No ratings yet
Project List - Clgs
18 pages
A Malware Classification Method Based On Three-Channel Visualization and Deep Learning
No ratings yet
A Malware Classification Method Based On Three-Channel Visualization and Deep Learning
15 pages
Camouflaged Lifeform Detection ML Aarushi Approach Paper
No ratings yet
Camouflaged Lifeform Detection ML Aarushi Approach Paper
2 pages
Mini Project Report Format (1)
No ratings yet
Mini Project Report Format (1)
32 pages
Lung Disease Prediction From X Ray Images
100% (1)
Lung Disease Prediction From X Ray Images
63 pages
Geoscience Knowledge Graph in The Big Data Era
No ratings yet
Geoscience Knowledge Graph in The Big Data Era
11 pages
AI in Forensic Odontology
No ratings yet
AI in Forensic Odontology
7 pages
machine learning for cyber security
No ratings yet
machine learning for cyber security
30 pages
An Empirical Case Study On Indian Consumers' Sentiment Towards Electric Vehicles - A Big Data Analytics Approach
No ratings yet
An Empirical Case Study On Indian Consumers' Sentiment Towards Electric Vehicles - A Big Data Analytics Approach
12 pages
Worksheet-Unit-1 - Introduction To A.I
No ratings yet
Worksheet-Unit-1 - Introduction To A.I
4 pages
27 - Deep Learning-Based Automatic Player Identification and Logging in American
No ratings yet
27 - Deep Learning-Based Automatic Player Identification and Logging in American
10 pages
Data Science New
No ratings yet
Data Science New
9 pages
Artificial Intelligence (AI) : U C o I N
No ratings yet
Artificial Intelligence (AI) : U C o I N
13 pages
Application of Machine Learning in Agriculture 1st Edition Mohammad Ayoub Khan (Editor) - Quickly download the ebook to start your content journey
100% (1)
Application of Machine Learning in Agriculture 1st Edition Mohammad Ayoub Khan (Editor) - Quickly download the ebook to start your content journey
70 pages

Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference

Uploaded by

Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference

Uploaded by

Econometrica, Vol. 89, No.

1 (January, 2021), 181–213

DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE

KEYWORDS: Deep learning, neural networks, rectified linear unit, nonasymptotic

Max H. Farrell: [email protected]

© 2021 The Econometric Society https://doi.org/10.3982/ECTA16901

2. DEEP NEURAL NETWORKS

2.1. Neural Network Constructions

FIGURE 2.—Illustration of multilayer perceptron FMLP with H = 3, L = 2 (U = 6, W = 25), and input

fDNN ∈ arg min (f zi ) (2.4)

REMARK 1: In applications, it is common to apply some form of regularization to the

2.2. Nonasymptotic High-Probability Bounds for Multilayer Perceptrons

ASSUMPTION 1: Assume that zi = (yi xi ) , 1 ≤ i ≤ n are i.i.d. copies of Z = (Y X) ∈

where α = (α1 αd ), |α| = α1 + · · · + αd and Dα f is the weak derivative.

THEOREM 1—Multilayer Perceptron: Suppose Assumptions 1 and 2 hold. Let fMLP be

Several aspects of these nonasymptotic bound warrant discussion. We build on the

2.3. Other Network Architectures

DNN := sup inf f − f∗ ∞

THEOREM 2—General Feedforward Architecture: Suppose Assumptions 1 and 3 hold.

(b) En [(fOPT − f∗ )2 ] ≤ C · {n− 2β+d log4 n + log logn n+γ },

(b) En [(fFW − f∗ )2 ] ≤ C · {n− 2+d log2 n + log logn n+γ },

3. INFERENCE AFTER DEEP LEARNING

We make the following standard assumption of unconfoundedness and overlap, which

where  (xi ) for t = 1 and 1 − p

To add a per-unit cost of treatment/targeting c and a margin m, simply replace ψ1 with

Further, Theorem 3 can be generalized immediately to yield uniformly valid inference

Learning Widths Dropout Total Validation Training

1 00003 [60] [05] 8702 1405.62 1748.91 0.0014

FIGURE 3.—Conditional average treatment effects across architectures.

FIGURE 4.—Placebo test.

Average Effect Never Treat Blanket Treatment Loyalty Policy

A.1. Main Decomposition and Bias Term

By definition, n := DNN := fn − f∗ ∞ .

c1 f− f∗ 2L2 (X)

find that, with probability at least 1 − e−γ̃ ,

2C2 fn − f∗ 2∞ γ̃ 21C M γ̃

2C2 fn − f∗ 2∞ γ̃ 7C M γ̃

A.2. Localization Analysis

A.2.1. Step I: Quadratic Process

f − f∗ 2n − f − f∗ 2L2 (X) = En (f − f∗ )2 − E(f − f∗ )2

We will apply the symmetrization of Lemma 5 to g = (f − f∗ )2 restricted to a radius

Writing g = (f + f∗ )(f − f∗ ), we see that by Assumption 1, |g| ≤ 3M|f − f∗ | ≤ 9M 2 , where

r 2 ≥ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r (A.4)

A.2.2. Step II: One Step Improvement

2C2 r02 γ̃ 23 · 3MC γ̃

Eη Rn G = Eη Rn g : g = (f z) − (f∗ z) f ∈ FDNN f − f∗  ≤ r0

c1 f− f∗ 2L2 (X)

A.2.3. Step III: Critical Radius

r∗ = inf r > 0 : 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ s < s2 ∀s ≥ r

E = f − f∗ n ≤ 4r∗ for all f ∈ F and f − f∗ L2 (X) ≤ 2r∗

and 1E to be the indicator

r∗2 ≤ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r∗

A.2.4. Step IV: Localization

where l ≥ 1 is chosen to be the largest integer no greater than log2 √ 2M

f − f∗ L2 (X) ≤ 2j r̄ ⇒ f − f∗ n ≤ 2j+1 r̄ (A.12)

Further, suppose that for some j ≤ l,

if the following two conditions hold:

It is easy to see that these two hold for all j if we choose

Therefore, with probability at least 1 − 6l exp(−γ̃), we can perform shell-by-shell argu-

f− f∗ L2 (X) ≤ 2l r̄ and f− f∗ n ≤ 2l+1 r̄

f− f∗ L2 (X) ≤ r̄ (A.15)

A.3. Final Steps for the MLP Case

For standard MLP architecture FMLP ,

Thus we can optimize the upper bound

Hence putting everything together, with probability at least 1 − exp(−γ),

This completes the proof of Theorem 1.

A.4. Proof of Corollaries 1 and 2

W L log U log log n + γ

H 2 L2 log(HL) log log n + γ

APPENDIX B: SUPPORTING LEMMAS

LEMMA 1—Embedding: For any function f ∈ FDNN , there is a g ∈ FMLP , with H ≤ W L +

LEMMA 2—Contraction: Let φ : R → R be a Lipschitz function |φ(f1 )−φ(f2 )| ≤ L|f1 −

LEMMA 4: Assume for all f ∈ F , f ∞ ≤ M. Denote the pseudo-dimension of F as

LEMMA 5—Symmetrization, Theorem 2.1 in Bartlett, Bousquet, and Mendelson

and with probability at least 1 − 2e−t

The same result holds for supg∈G {En g − Eg}.

fDNN ∈ arg min (f zi ) (2.4)

ASSUMPTION 1: Assume that zi = (yi xi ) , 1 ≤ i ≤ n are i.i.d. copies of Z = (Y X) ∈

THEOREM 1—Multilayer Perceptron: Suppose Assumptions 1 and 2 hold. Let fMLP be

DNN := sup inf f − f∗ ∞

(b) En [(fOPT − f∗ )2 ] ≤ C · {n− 2β+d log4 n + log logn n+γ },

(b) En [(fFW − f∗ )2 ] ≤ C · {n− 2+d log2 n + log logn n+γ },

where (xi ) for t = 1 and 1 − p

By definition, n := DNN := fn − f∗ ∞ .

c1 f− f∗ 2L2 (X)

2C2 fn − f∗ 2∞ γ̃ 21C M γ̃

2C2 fn − f∗ 2∞ γ̃ 7C M γ̃

f − f∗ 2n − f − f∗ 2L2 (X) = En (f − f∗ )2 − E(f − f∗ )2

r 2 ≥ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r (A.4)

Eη Rn G = Eη Rn g : g = (f z) − (f∗ z) f ∈ FDNN f − f∗ ≤ r0

c1 f− f∗ 2L2 (X)

r∗ = inf r > 0 : 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ s < s2 ∀s ≥ r

E = f − f∗ n ≤ 4r∗ for all f ∈ F and f − f∗ L2 (X) ≤ 2r∗

r∗2 ≤ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r∗

f − f∗ L2 (X) ≤ 2j r̄ ⇒ f − f∗ n ≤ 2j+1 r̄ (A.12)

f− f∗ L2 (X) ≤ 2l r̄ and f− f∗ n ≤ 2l+1 r̄

f− f∗ L2 (X) ≤ r̄ (A.15)

LEMMA 4: Assume for all f ∈ F , f ∞ ≤ M. Denote the pseudo-dimension of F as

(a) For Poisson regression: ∇c(f ) = | exp(f )| ≤ exp(M) and Hessian[c(f )] =

18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r̄ ≤ r̄ 2