0% found this document useful (0 votes)
6 views33 pages

2504.16356v1

This paper presents a deep neural network-based approach for estimating covariate-dependent graphical models, allowing for flexible functional dependencies on covariates without assuming Gaussianity. The authors establish theoretical results with PAC guarantees and demonstrate the method's effectiveness on synthetic and real datasets from neuroscience and finance. The work contributes to the field by accommodating complex conditional dependence structures and providing finite-sample error bounds related to the neural network size and estimation error.

Uploaded by

Andrey Santos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views33 pages

2504.16356v1

This paper presents a deep neural network-based approach for estimating covariate-dependent graphical models, allowing for flexible functional dependencies on covariates without assuming Gaussianity. The authors establish theoretical results with PAC guarantees and demonstrate the method's effectiveness on synthetic and real datasets from neuroscience and finance. The work contributes to the field by accommodating complex conditional dependence structures and providing finite-sample error bounds related to the neural network size and estimation error.

Uploaded by

Andrey Santos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Covariate-dependent Graphical Model Estimation

via Neural Networks with Statistical Guarantees

Jiahe Lin 1 Yikai Zhang 1 George Michailidis 2

1
Machine Learning Research, Morgan Stanley
arXiv:2504.16356v1 [stat.ML] 23 Apr 2025

2
Department of Statistics and Data Science, UCLA

Abstract

Graphical models are widely used in diverse application domains to model the conditional dependencies
amongst a collection of random variables. In this paper, we consider settings where the graph structure
is covariate-dependent, and investigate a deep neural network-based approach to estimate it. The method
allows for flexible functional dependency on the covariate, and fits the data reasonably well in the absence
of a Gaussianity assumption. Theoretical results with PAC guarantees are established for the method, under
assumptions commonly used in an Empirical Risk Minimization framework. The performance of the pro-
posed method is evaluated on several synthetic data settings and benchmarked against existing approaches.
The method is further illustrated on real datasets involving data from neuroscience and finance, respectively,
and produces interpretable results.

1 Introduction

An undirected graphical model captures conditional dependencies amongst a collection of random variables
and has been widely used in diverse application areas such as bioinformatics and the social sciences. It is
associated with an undirected graph G = (V, E) with node set V := {1, · · · , p} and edge set E ⊆ V × V that
does not contain self-loops, i.e., (j, j) ̸∈ E; ∀ j ∈ V ; x = (x1 , x2 , · · · , xp )′ is a random real vector indexed by
the nodes of V , with probability distribution P.
The connection between the graph G and the probability distribution P(x) comes through the concept of
graph factorization and the properties that G Q exhibits. The probability distribution P(x) factorizes with
respect to G, if it can be written as P(x) = Z1 C∈C(G) φC (xC ), where C(G) denotes the set of all cliques
of G, φC > 0, ∀ C ∈ C(G) potential functions, and Z a normalizing constant. If the underlying graph
G in addition exhibits a Markov type property (see, e.g., Lauritzen (1996) and Appendix D.3), then the
conditional dependence relationships are encoded in the potential
Q functions φC . In particular, for G satisfying
the pairwise Markov property, P can be written as P(x) ∝ {j,k}∈E φ{j,k} (x{j,k} ), and the absence of an edge
in G implies that the corresponding potential function is zero and consequently the nodes are conditional
independent given the remaining nodes. The upshot is that the conditional independence relationships for
x ∼ P can be read-off from the graph, if P factorizes w.r.t. G and the latter is Markov.
For certain multivariate distributions, the form of the joint distribution P(x) enables its decomposition into
clique-wise potential functions based on G in a fairly straightforward manner; examples include the Gaussian
for continuous variables (Lauritzen, 1996, see also Appendix D.1), and the Ising and the Boltzmann machine
models (Wainwright et al., 2008) for binary and discrete random variables, respectively. The corresponding
, Jiahe Lin and Yikai Zhang contributed equally and are listed alphabetically.
# Correspondence to: George Michailidis. 〈[email protected]〉. § Code is available at https://github.com/GeorgeMichailidis/
covariate-dependent-graphical-model.
This paper is accepted by Transactions on Machine Learning Research (TMLR).

1
potential functions φC (·) are parameterized by (functions) of the parameters of the underlying distribution
(see example in Appendix D.1). Consequently, a graphical model and the conditional independence rela-
tionships that G encodes can be obtained by estimating the parameters of the underlying distribution, either
through maximum likelihood (Banerjee et al., 2006; Friedman et al., 2008) or regression procedures using
neighborhood selection techniques (Meinshausen and Bühlmann, 2006).
For other classes of graphical models beyond those mentioned above, to select the clique-wise potential
functions φC (·) becomes more involved. The difficulty stems from specifying potential functions based on
which a compatible joint distribution P(x) with respect to G can be obtained. One popular approach is
Q define node conditional distributions through pairwise cliques {j, k}; i.e., for any fixed j, P(xj |x−j ) ∝
to
k∈ne(j) φ{j,k} (xj , xk ) for k in the neighborhood of j. Such a conditional distribution captures pairwise
conditional interactions between nodes, however, this construction does not directly define a proper graph-
ical model for the joint distribution P(x), unless additional conditions are satisfied (see example in Ap-
pendix D.2). In general, the question of the compatibility of a graphical model for the joint distribution
P(x) based on a particular specification of all the node conditional distributions in V is rather open, and is
generally addressed on a case by case basis (Berti et al., 2014). In this paper, we adopt the convention that
models specified through their node-conditional distributions are also referred to as graphical models, which
capture node-wise conditional independence relationships.
The presentation thus far focused on a single “static” graphical model, which is suitable for modeling tasks
involving homogeneous data sets; i.e., the observed samples xi , i = 1, · · · , n are iid from P(x). However,
in certain applications, the conditional independence relationships may be heterogeneous in that they are
modulated by external covariates z ∈ Rq . An example from biology is that the brain connectivity network—
wherein the coordinates of x correspond to different brain regions—varies with age (z ∈ R+ ). Similarly,
gene expression exhibits individual-level variation affected by single nucleotide polymorphisms (SNPs); see,
e.g., Kim et al. (2012). The SNPs are thereby covariates—potentially of large dimension—that impact the
corresponding gene co-expression network. Motivated by such applications, this work introduces a class of
covariate dependent, pairwise interaction graphical models, wherein the log-densities of the node conditional
distributions are defined by
Xp
xj = βjk (z)xk + εj , j ∈ {1, · · · , p}, (1)
k̸=j

where z is a q−dimensional observed covariate, βjk (·) : Rq 7→ R functional regression coefficient dependent
on z that impacts the strength of the interaction between nodes j and k.1

1.1 Related Work

We provide a brief overview of approaches aiming to define graphical models for continuous random vari-
ables that go beyond the Gaussian case, and also of prior work on covariate dependent graphical models.

Non-Gaussian Graphical models. One line of work focuses on graphical models defined by monotone
transformations of the Gaussian case, namely the non-paranormal (Liu et al., 2009, 2012). Their estimation
entails first learning in a nonparametric fashion the respective transformations that “Gaussianize” the data,
then subsequently leveraging approaches used for the Gaussian graphical model. Another broader line of
work focuses on defining graphical models through specification of node conditional distributions. Yang et al.
(2014) consider pairwise node conditional distributions from the exponential family, wherein the canonical
parameter of the distribution is defined as a linear combination of higher order products of univariate func-
tions of neighbors of each node j in G. This construction leads to a family of potential functions that defines
a proper graphical model under G for the joint distribution P(x) (see details in Appendix D.2).
More recently, Baptista et al. (2024b) introduce a framework wherein the conditional dependence/independence
between xj and xk is characterized by a score matrix given by Ωjk := Ep(x) [∂j ∂k log p(x)]2 , where p(x) is
1 Note that in the case of a Gaussian distributed error term ε, a covariate dependent graphical model is defined in terms of the joint
distribution of x conditional on z, i.e., P(x|z), that factorizes over a graph G(z), “parameterized” by the external covariate z.

2
the density function of x ∼ P(x) and is further assumed to be continuously differentiable. The authors
propose to estimate the density p(x) via a lower triangular transport map, with the components of the map
parameterized through linear expansions with Hermite functions as the bases. The same score matrix Ω
is also leveraged in Zheng et al. (2023); however, instead of estimating the density p(x), their approach
parameterizes the data score ∇x log p(x) using neural networks and estimates it based on an implicit score
matching objective (Hyvärinen, 2005), with an additional sparsity-inducing term associated with Ω.

Covariate-dependent graphical models. At a high level, covariate-dependent graphical models fall under
the umbrella of “varying coefficient models”, where model parameters are assumed to evolve with “dynamic
features”. Traditionally, the specific functional dependency on these dynamic features is handled via poly-
nomial or smoothing splines, as well as kernel-local polynomial smoothing (see Fan and Zhang, 2008, and
references therein). More recently, it has been rebranded by Al-Shedivat et al. (2020, CEN), where dynamic
features are termed “contextual information”. Specifically, the authors consider a hierarchical probabilistic
framework in which the model parameters depend non-deterministically on the contextual information, and
the dependency is parameterized through deep neural networks.
Within the specific context of undirected graphical model estimation where the focus is on conditional in-
dependence relationships, the existing literature can be broadly segmented into two categories: (1) the
covariate is a categorical variable taking discrete values. Early work by Lafferty et al. (2001) considers a
conditional random field (CRF) for discrete x, and further assuming the underlying graph to be a chain
graph. For high-dimensional settings, with a continuous x, the problem is cast as efficient estimation of
graphical models while taking into account their similarities across different categories (Guo et al., 2011;
Cai et al., 2016). (2) For more general dependency on covariates that can be continuous-valued, one line
of work focuses on partition-based estimation, namely, segmenting the samples based on the values of the
covariates, then performing estimation within each partition (Liu et al., 2010). More recently, Gaussian
graphical models, in which the graph structure is learned as a continuous function of the covariates, are
being actively investigated. Specifically, Zhang and Li (2023); Zeng et al. (2024) assume a linear func-
tional dependency on the covariate, namely, βjk (z) = b⊤ jk z, with b being sparse. The former adopts a
regularization-based approach, while the latter a Bayesian one with sparsity inducing prior distributions.
Within the Bayesian paradigm, notable contributions also include Ni et al. (2022); Niu et al. (2024) that
introduce covariate-dependent prior distributions to enable context-dependent precision matrix estimation.
Finally, we remark that graphical model estimation with non-linear dependency on the covariate has been
investigated leveraging the idea of kernel smoothing (e.g., Zhou et al., 2010; Kolar et al., 2010). How-
ever, these approaches are limited to the case where the covariate is univariate, due to limitations induced
by smoothing kernels. Table 1 presents a selective summary of the above-mentioned existing works, with
an emphasis on those in which the covariate takes continuous values and corresponding theoretical results
provided. We note that the score matrix-based formulation for estimating non-Gaussian graphical models
could potentially be extended to incorporate the dependency on covariates, and such an extension is briefly
discussed in Section 6.
As a concluding remark, we note that “contextualized” estimation has also been adopted in other settings,
such as the estimation of directed acyclic graphs (DAGs); see, e.g., Zhou et al. (2022); Thompson et al.
(2024) and references therein.

1.2 Our Contributions

This work differs from existing ones in that it investigates a deep neural network-based approach for esti-
mating covariate-dependent graphical models, and establishes the corresponding statistical guarantees un-
der the PAC learning framework. Unlike much of the above-mentioned statistical literature that assumes a
linear relationship with the covariate, the current work leverages parameterization via neural networks to
accommodate more flexible dependency structures. On the theoretical front, finite-sample error bounds are
established under certain regularity conditions, enabling the quantification of the relationship between the
size of the neural network and the estimation error. Note that although Al-Shedivat et al. (2020) outline

3
Table 1: Summary of representative works in extant literature on covariate-dependent graphical models.
setup univariate, Gaussian, MLE-based kernel estimator
Zhou et al. (2010)
theor.res. finite-sample Frobenius and large-deviation bound
setup univariate, nodewise regression-based kernel smoothing
Kolar et al. (2010)
theor.res. finite-sample support recovery consistency under sub-Gaussanity and minimum edge strength assumption
setup multivariate, Gaussian, tree-based partition & MLE within the region
Liu et al. (2010)
theor.res. guarantee on the excess risk of the estimator
setup multivariate, Gaussian, nodewise regression w/ linear dependency on covariate
Zhang and Li (2023)
theor.res. finite-sample ℓ2 consistency and support recovery consistency
Ni et al. (2022) setup multivariate, Gaussian, likelihood-based w/ linear dependency on covariate
(Bayesian) theor.res. no guarantee on the posterior inference of the estimates provided
Niu et al. (2024) setup multivariate, covariate-dependent priors & Gaussian likelihood-based estimation (within partition)
(Bayesian) theor.res. convergence of the posterior distribution in an α-divergence metric
setup multivariate, nodewise regression with flexible dependency on covariate, parameterized by DNN
This Work
theor.res. finite-sample ℓ2 consistency and edge-wise recovery guarantee with thresholding

Note: for setup, univariate (or multivariate) corresponds to the dimension of the covariate z in question, whereas “Gaussian” corresponds to the
distributional assumption for x if explicitly assumed in the paper.

a generic template for handling contextual information via deep neural networks, the theoretical results
therein primarily focus on (i) quantifying the contribution of the context to predictive performance, under
the assumption that the expected predictive accuracy is bounded below by 1 − ε (Proposition 4), and (ii)
analyzing the error bound of the estimated linear coefficient in binary classifications tasks (Theorem 6). Both
results are established under the assumption that the underlying true dependency on the covariate is linear.
To this end, the contribution of this work can be summarized as follows.
• We consider a nonlinear covariate-dependent graphical model, which extends existing ones and al-
lows for flexible functional dependency on the covariate. Empirically, we demonstrate that such an
estimation framework effectively learns the dependency of the graph structure on the covariate z;
meanwhile, it accommodates complex conditional dependence structures amongst the x and exhibits
good performance even when the underlying data deviates from Gaussianity.
• On the theoretical front, we establish guarantees on the recovered edges based on a deep neural
network estimator. Concretely, a mis-specified setting2 is considered wherein both a generalization
error and an approximation error are present, and their bounds depend on the size of the underlying
problem (namely p and q), the size of the neural network and the sample size.
In particular, this work illustrates how existing results—specifically, generalization error bounds based
on local Rademacher complexity analysis (Bartlett et al., 2005) and approximation error bounds of an
MLP (Kohler and Langer, 2021)—established under generic ERM settings, can be synthesized to obtain
error bounds for a specific estimation problem. Such a roadmap is of interest for similar inference tasks.
Note that the roadmap for the technical analysis presented in this paper applies to a broad class of strongly
convex, Lipschitz, and uniformly bounded loss functions, by connecting the infinity norm of the function
approximation in Kohler and Langer (2021) to the excess error bound in Bartlett et al. (2005) through a
strong convexity-based argument. Hence, it goes beyond the result established for the mean squared error
loss function studied in Kohler and Langer (2021).

2 DNN-based Covariate-Dependent Graphical Model Estimation

We elaborate on the model under consideration in (1) and provide its formal definition next. We consider a
system of p variables x := (x1 , x2 , · · · , xp )′ taking values in X ⊆ Rp , whose conditional dependence structure
2 Here mis-specified is in the ERM context, where the target hypothesis does not necessarily live in the hypothesis class in which empirical
minimization is conducted.

4
depends on some q-dimensional “external” covariate z ∈ Z ⊆ Rq . Let [p] := {1, · · · , p} denote the index
set of the variables. To model such conditional dependence structures, we consider a nodewise regression
formulation as the working model, given in the form of
p
X
xj = β j (z), x−j + εj := βjk (z)xk + εj ; j ∈ [p]. (2)
k̸=j

E(εj ) = 0 and has finite moments; βjk (·) : Z 7→ R is a function that takes z as the input and outputs the
regression coefficients for xk ’s with xj being the response. For notation simplicity, we let β j (z) : Z 7→ Rp−1 ,
whose output coordinates correspond to βjk (z), k = 1, · · · , p, k ̸= j, and x−j := (x1 , · · · , xj−1 , xj+1 , · · · , xp )′ .
We note the analogy to the node-wise regression formulation for the Gaussian graphical model (Meinshausen
and Bühlmann, 2006) (see also Appendix D.1 for a brief discussion), where in the absence of the external
covariate z, the regression coefficient βjk “degenerates” to a scalar. With the dependency on z, the graph-
ical model is solely captured in the β j (z)’s, and one can retrieve the estimated graph at the sample level,
based on the corresponding z value for the sample of interest. Pp In particular, when x is jointly Gaussian, the
node-conditional distribution is given by P(xj |x−j , z) ∼ N ( k̸=j βjk (z)xk , σj2 (z)).
We consider parameterizing each βjk (·) by some neural network, e.g., a multi-layer perceptron (MLP). Let
θ denote the parameters of the neural networks in question collectively. Given independently identically
distributed (i.i.d.) samples {(xi , z i )}ni=1 , θ can be obtained by minimizing the empirical loss; e.g., with a
mean squared error loss, it is given by
n
X
arg min n1 b i ∥22 , where x
∥xi − x bij = β j (z i ; θ), xi−j ; (3)
θ i=1

the dependency on the neural network parameter θ is made explicit. The training pipeline is outlined in
Exhibit 1.

Exhibit 1: DNN-based Covariate-dependent Graphical Model (DNN-CGM) Learning Pipeline


Input: Input i.i.d. samples {(xi , z i )}n
i=1

bij for all j = 1, · · · , p and i = 1, · · · , n according to (3);


1. [forward pass] for fixed θ, calculate x
bi;
2. [loss] calculate the loss based on all samples, by evaluating the distance between xi and x
3. [backward pass] update θ using the gradients calculated based on the loss (back-propagation);
b p , ∀i
b and {β (z i , θ)}
Output: Estimated θ j j=1

Remark 1 (On the implementation of βjk (·) networks). Instead of creating separate neural networks for each
βjk (·)—which requires p(p−1) networks, with each taking z as the input and generating a scalar output—one
can alternatively create neural network(s) with appropriately sized hidden/output layers so that “backbones”
are shared across and fewer networks are needed. This leads to fewer total number of parameters needed
and empirically an easier optimization problem, which often yields superior performance. As a concrete
example where a single neural network is used, it takes z as the input and outputs a p(p − 1) dimensional
vector, with each coordinate corresponding to the value of βjk (z), j, k ∈ [p]; k ̸= j; such a design corresponds
to the case where βjk (z) is operationalized as βjk (z) = vjk (u(z)). Specifically, let h be the dimension of the
last hidden layer; u(·) : Rq 7→ Rh is shared across all βjk ’s and corresponds to stacked neural network layers
up until the last hidden layer; vjk (·) : Rh 7→ R is a linear layer that maps neurons from the last hidden layer
to the output unit indexed by jk, and it differs across the jk’s. In this case, θ = (θ u , {θ v,jk , j, k ∈ [p]; k ̸= j}).
Remark 2. The formulation under consideration is not restricted to z being continuous. In the case where
any coordinate(s) of z is categorical, one can first process the categories via an embedding layer, then
concatenate them with the numerical ones and proceed.

Some considerations on the formulation are discussed next. First note that for a node-conditional formula-
tion where the conditional distribution of each node is some function of its neighborhood set (Meinshausen

5
and Bühlmann, 2006), the model in its most general form can be written as xj = fj (x−j , εj ), j ∈ [p]. Such
a form can be too general, and potentially incur difficulty in actually identifying the neighborhood set based
on data3 . In the same
P spirit as in Bühlmann et al. (2014), by considering an additive form, the model is
restricted to xj = k̸=j fjk (xk ) + εj ; in particular, fjk (xk ) = 0 for k ∈
/ ne(j); ne(j) ⊆ [p] \ {j} is the neigh-
borhood of node j. With the presence of the covariate z that impacts the neighborhood structure, a natural
extension is to augment the input space of fjk ’s, namely,
X
xj = fjk (z, xk ) + εj ; j ∈ [p].
k̸=j

In this paper, we impose additional structure on the specification of fjk and let fjk (z, xk ) ≡ βjk (z)xk . The
implication is two-fold:

1. z and xk are assumed separable.In particular, we consider a multiplicative form, namely, fjk (z, xk ) ≡
βjk (z)γjk (xk ), that behaves like a “locally linear” model on γjk (xk ), and βjk (z) can be viewed as a gen-
eralized regression coefficient. The edge weight can be summarized by βjk (z), which can be interpreted
as the strength of relevance; in addition, βjk (z) = 0 ⇒ fjk (z, xk ) = 0. This preserves the interpretabil-
ity of the model in that the neighborhood set can be solely inferred based on {βjk (z)}j=1,··· ,p;k̸=j , a
primary quantity of interest that can be directly estimated and a probabilistic guarantee can be es-
tablished accordingly. Note that for fjk (z, xk ) that allows for general interactions between z and xk ,
extracting the underlying weighted graph edge that characterizes the strength of the dependency re-
quires post-hoc operations. For example, this can be achieved by considering quantities of the form
∇xk fjk (z, xk ).
A similar separable structure is employed in Alvarez Melis and Jaakkola (2018); Marcinkevičs and Vogt
(2021), where, despite a slightly different model setup that involves only x, the function is decomposed
as θ(x)h(x), where h(x) serves as a representation of x and θ(x) its corresponding relevance; those
models are interpretable and “self-explaining”.
2. The functions γjk (xk )’s are further simplified to γjk (xk ) ≡ xk . From a regression analysis perspective,
xk and its transformations can be interpreted as features. We use “raw” features directly, partly due
to the fact for the graphical model setting under consideration, they correspond to node values and
thus are inherently meaningful. This contrasts with, for example, image pixels that lack intrinsic
intepretability and thus benefit from transformations into higher level features (Alvarez Melis and
Jaakkola, 2018). One can potentially consider a more complex γjk (·), e.g., parameterize it by some
PL
neural network, or through basis expansion (e.g., Qiao et al., 2019), namely, γjk (x) := ℓ=1 ajk,ℓ ϕℓ (x)
with the bases ϕℓ (·) fixed apriori and ajk,ℓ learnable parameters. Empirically, we observe that the
current specification is essentially as powerful as its enriched counterparts (modulo tuning), in terms
of identifying the neighborhood of each node.

Remark 3. Up to this point, we do not impose any sparsity assumptions on the underlying graph and the
estimation procedure does not involve any regularization terms. In practice, to obtain a sparsified graph for
interpretability purposes, we consider a thresholding procedure, where the small entries can be effectively
treated as zeros. The thresholding level is selected by examining the edge magnitude histogram, and chosen
in the region where a “gap” is present4 . Finally, given the undirected nature of the graph, we adopt the
“AND” principle, namely, we consider an edge as present when both βbjk and βbkj are nonzero after hard-
thresholding.
3 With a general fj where interactions amongst the xk ’s are permitted, for xj ⊥ ⊥ xk | x−k to hold, xk needs to be absent from all terms,
which is difficult to operationalize either through constraints or in a post-hoc fashion.
4 See also the setting considered in Corollary 2. Note that in the case where the underlying true graph has edges that can be sep-

arated into weak and strong ones (with exact sparsity being a special case), one would expect a gap in the histogram, aside from
estimation/approximation errors. Empirically, one can first “normalize” the graph so that the maximum entry does not exceed 1 in
magnitude; the thresholding level is then scaled accordingly. This normalization step can potentially enhance the interpretability of
the the thresholding level, without fundamentally altering the sparsification outcome.

6
3 Theoretical Results

This section provides theoretical guarantees for the model posited in (2), assuming that it corresponds to the
ground-truth data generating process (DGP). Specifically, let β ∗ (·) denote the target model (i.e., the true DGP
one); given the availability of i.i.d. data {(xi , z i )}ni=1 generated according to (2), we are interested in the
error bound of β(·)
b under the Empirical Risk Minimization (ERM) framework, with βjk (·)’s parameterized by
deep neural networks. We establish finite-sample bounds for the generalization error using Local Rademacher
complexity tools (Bartlett et al., 2005), and the approximation error bound in terms of neural network hyper-
parameters (namely, depth and number of neurons) leveraging the techniques in Kohler and Langer (2021),
under certain regularity conditions. The two combined provides insight into consistency type of results for
β(·),
b relative to the ground truth β ∗ (·). In the ensuing technical developments, we further assume that the
random vectors x, z are bounded, namely x ∈ [−C, C]p and z ∈ [−C, C]q , respectively. Without loss of
generality, we set C = 1, which only affects the scale in the analysis.

3.1 Preliminaries and A Road Map

Definitions. Let (Y, P) denote a probability space and Ey [·] denote the expectation with respect to P over
random variable y ∈ Y. Let H denote a hypothesis class; i.e., a class of measurable functions h : Y 7→ R. The
infinity norm between two functions h, h̃ is defined as ∥h − h̃∥∞ := supy∈Y |h(y) − h̃(y)|, while for vector
functions h = (h1 , · · · , hp ), the corresponding vectorized infinity norm as ∥h − h̃∥∞ := maxj∈{1,··· ,p} ∥hj −
h̃j ∥∞ . For two functions, h, g, the notation h ≳ g means that h ≥ cg for some universal constant c, and ≲
is analogously defined. Further, g ≃ h, if cg ≤ h ≤ c̄g for some universal constants c̄ and c. Finally, for a
real-valued vector y, its weighted norm with respect to a matrix A is given by ∥y∥2A := ⟨y, Ay⟩.
For the problem at hand, let y := (x, z); Sn := {(xi , z i )}ni=1 ⊂ (X × Z)n denote a random sample of n iid
data points. We consider a loss function
Pp ℓ(a; b) : R ×R 7→ R and H the hypothesis class of risk functions of
the form h(y) := (ℓ ◦ β)(y) := p1 j=1 ℓ
P
k̸=j βjk z xk , xj , i.e., the risk is a composite function of the loss
and regression coefficients. Then, the population and empirical versions of the risk are respectively defined
as follows, with [n] := {1, · · · , n}:
 X  
1 p
R(β) := Ex,z ℓ ⟨β j (z), x−j ⟩, xj ; (4)
p j=1
 X  
1 X 1 p
Rn (β) := ℓ ⟨β j (z i ), xi−j ⟩, xij .
n i∈[n] p j=1

To perform ERM, we assume that the regression coefficient functions βjk (·) : Z 7→ R belong to some hypoth-
esis class F 5 , and denote Fp×(p−1) as the joint of p × (p − 1) hypotheses, with each individual hypothesis
opt
βjk (·) ∈ F, j, k ∈ [p], k ̸= j. We further define a collection of optimal hypotheses βjk ’s as
opt ∗
βjk := arg minβjk ∈F ∥βjk − βjk ∥∞ , ∀j, k ∈ [p]; k ̸= j,

and the approximation error can be defined as Eapprox (F) := maxj∈{1,··· ,p} ∥β opt ∗
j − β j ∥∞ . Finally, the ERM
estimator of the regression coefficient functions is given by

β β∈F p×(p−1) Rn (β). (5)


b := arg min

Assumptions. Before presenting the results, we posit the assumptions used in the theoretical analysis.
Together with the boundedness assumption on (x, z) mentioned above, these assumptions lead to a O(1/n)
fast convergence rates for the risk excess error, and hence a smaller sample size requirement. Further, the
analysis enables us to balance the trade-off between approximation error and generalization error based on
the size of the neural network, the sample size and the dimension of (x, z).
5F is typically selected by practitioners, such as neural networks with fixed number of layers and neurons.

7
Assumption 1 (On the loss function). The loss function ℓ(a; b) : R × R 7→ [0, M ]; w.o.l.g. we set M = 16 . We
assume ℓ(a, b) is:

• L-Lipschitz in a: |ℓ(a1 ; b) − ℓ(a2 ; b)| ≤ L|a1 − a2 |, where L > 0 denotes the Lipschitz constant.
• α-strongly convex in a: ℓ(a1 ; b) − ℓ(a2 ; b) ≥ ℓ′ (a; b)|a=a2 (a1 − a2 ) + α
2 (a1 − a2 )2 where α > 0 and ℓ′ (·)
denotes the derivative of the function ℓ.
Remark 4. Assumption 1 is widely used in the statistical learning literature, especially when aiming to
establish fast rates for the generalization error bound (see, e.g., Klochkov and Zhivotovskiy, 2021).

Assumption 2 (On the optimality of βjk ). The optimal regression coefficient function β ∗j , j ∈ [p] satisfies the
following condition: X 
p ′
Ex,z ℓ (⟨β j (z), x−j ⟩, xj )|βj =β∗j = 0.
j=1


Remark 5. Note that the combination of Assumptions 1 and 2 implies that βjk , j, k ∈ [p], k ̸= j minimizes
the population risk in (4) due to the strong convexity of the loss function ℓ. We also demonstrate that the
mean square error loss function satisfies Assumptions 1 and 2 in Appendix A.2.
Assumption 3 (Bounded pseudo-dimension of hypothesis class F). The hypothesis class F of the regression
coefficient functions β j , j ∈ [p] has finite pseudo-dimension (Pollard, 1990); i.e., dP (F) < ∞ (see formal
definition of pseudo-dimension in Definition 3 in Appendix A.1).

This assumption is widely adopted in the statistical learning literature and encompasses a broad range of
functional classes, including linear and polynomial functions (Anthony and Bartlett, 1999), as well as various
families of neural networks (Bartlett and Maass, 2003; Bartlett et al., 2019; Khavari and Rabusseau, 2021).
The primary result established for the model postulated in (1) is the statistical consistency for the estimate
b (Corollary 1). This is achieved through two key steps: (i) derive a bound on the excess error, defined as
β

R(β)−R(β
b ), as a function of β;
b see Theorem 1; and (ii) leveraging the strong convexity of the loss function
(mean squared error loss) and the result in Theorem 1, derive the error bound for β b under some projection
norm. The bound in Theorem 1 has two components that respectively capture the generalization error and
the approximation error, with the latter arising from the mis-specified setting assumption that allows the
true β ∗ to live outside of the working hypothesis class F. To derive a bound for the generalization error,
we first show that the hypothesis class—that encompasses composite functions quantifying the “delta in
risk” between β and the true β ∗ —satisfies Bernstein’s Condition (Lemma 1). Next, we select an appropriate
subroot function τ (·) tailored to the problem, enabling us to apply the proof strategy and results from Bartlett
et al. (2005) (specifically, Theorem 3.3, Corollary 2.2, and Corollary 3.7). A critical component of the proof
involves bounding the covering number of the function class βjk (z) under consideration.
With the working hypothesis class encompassing a class of MLPs, Theorem 2 characterizes the approxima-
tion error for such architectures, employing tools from Kohler and Langer (2021). Note that the pseudo-
dimension—which appears in the generalization error bound—can also be associated with the size of the
neural network. Consequently, the two error components for the MLP architecture under consideration can
be “balanced”, in that one can minimize the excess error bound by choosing the quantity that links the two
appropriately.

3.2 Main Results

The first result leverages Local Rademacher Complexity and empirical processes tools to bound the general-
ization error:  
R(β)b − R(β ∗ ) ≲ Rn (β)b − Rn (β ∗ ) + O dP (F) ,
| {z } | {z } n
excess error empirical excess error | {z }
generalization error

6 For 1
any loss function ℓ(a; b) : R × R 7→ [0, M ], it suffices to consider its rescaled version M
· ℓ(a; b) to satisfy such an assumption.

8
where the O(·) notation ignores all problem dependent constants and only depends on n and dP (F). The
empirical excess error can be further bounded by an approximation error term controlled by the capacity of
the hypothesis class (see equation (15) in the proof of Theorem 1 in Appendix A.3).
Theorem 1. Under Assumptions 1-3, the following bound for the excess error as a function of the empirical
b (see (5)) based on Sn samples holds with probability at least 1 − δ, for any δ > 0:
risk minimizer β
2
 
R(β)b − R(β ∗ ) ≲ L · Eapprox (F) + L dP (F) p2 log(nLp) log 1 . (6)
| {z } αn δ
approximation error | {z }
generalization error

Remark 6. The p2 term in (6) shows up due to estimating p × (p − 1) functions using neural networks, which
naturally increases the entropy number of F with respect to Sn in Dudley’s entropy integral (Dudley,
√ 2016)
by a factor of p2 . This indicates that the size of the graphical model can grow at a p = o( n) rate for
the excess error to vanish. A larger p can be accommodated, either by incorporating sparsity assumptions
(Meinshausen and Bühlmann, 2006; Wainwright, 2009) or through refined analysis as in multi-task learning
(Lounici et al., 2009), which however is outside the scope of the current paper.

The following corollary establishes the consistency of the empirical risk minimizer, leveraging the strong
convexity of the loss function.
b Let Aj (z) := Ex x−j x−j ⊤ z . Under the Assumptions of Theorem 1, the
 
Corollary 1 (Consistency of β).
following inequality holds with probability at least 1 − δ for any δ > 0:
p
L2 dP (F) 2
 X   
1 b (z) − β ∗ (z)∥2 1 L · Eapprox (F)
Ez ∥β ≤ p log(nLp) log + . (7)
p j=1 j j Aj (z)
α2 n δ α

Corollary 1 quantifies the quality of the recovery of the regression coefficient functions β j that capture the
strength of the conditional dependence relationships between the variables.

Note that to establish Theorem 1, the assumption of realizability—i.e., βjk (·) ∈ F—is not imposed. Con-
sequently, the empirical risk of the ERM estimator is not guaranteed to be smaller than that of β ∗ , i.e.,
Rn (β ∗ ) ≤ Rn (β)
b is not guaranteed. In particular, this can happen when the hypothesis class does not have
adequate capacity, and hence the approximation error term is nonzero and appears in (6). However, The-
orem 1 has not characterized the approximation error in terms of the sample size and dimension p of the
graphical model. This is established in Theorem 2, which also aims to balance the approximation error and
generalization error terms.
To that end, in addition to Assumptions 1-3 used to derive the excessive risk bound, we need the following
additional assumption to apply tools from Kohler and Langer (2021) to quantify the approximation error
based on the number of layers and the total number of neurons used by the feed-forward neural networks
to approximate the regression coefficients βjk .
∗ ∗
Assumption 4 ((m, C)-smoothness of βjk ). We assume βjk is (m, C)-smooth for all j, k ∈ [p], k ̸= j (see the
formal Definition 4 in Appendix A.1).
Theorem 2. Let F be a family of fully connected neural networks with ReLU
 2 activation functions, hav-
ing number of layers H ≃ ξ −q/2m and number of neurons r ≃ (2e)q m+q
q q . Under the Assumptions of
Theorem 1, together with Assumption 4 with m ≲ q, by setting
m
Lm4 d6 p2 log2 (nLp) log(p) log(1/δ) log(1/α)
 
m+q
ξ= ,
αn

the following holds with probability at least 1 − δ:


 X 
b − R(β ∗ ) ≲ Lξ, 1 p
b (z) − β ∗ (z)∥2A (z) ≲ Lξ .
R(β) and Ez ∥β j j j
(8)
p j=1 α

9
Remark 7. The rates of the bounds in (8) are of the order n−m/(m+q) , whereas the rate in Kohler and Langer
(2021) based on a quadratic loss is n−2m/(2m+q) . This discrepancy is due to the fact that when analyzing
general L-Lipschitz functions, the excess error bound depends in a linear fashion on the approximation error
of β(z). For the mean squared error loss function, it depends in a quadratic fashion and hence the same rate
as in Kohler and Langer (2021) can be obtained, as shown in Appendix A.6.

Extensions to edge-wise recovery. Recall that in Corollary 1, an ℓ2 -type consistency result is estab-
lished for β b based on a projection norm. If we consider a setting where edges can be segmented into
strong and weak ones, then the result in Corollary 1 can be translated to an “edge-wise” type guaran-
tee. Formally, we define the set of strong edges as E ∗ (z) := {(j, k)| |βjk ∗
(z)| ≥ β} and of weak ones as
∗c ∗
E (z) := {(j, k)| |βjk (z)| ≤ β̄}, where β and β̄ correspond to their respective minimum/maximum mag-
nitude. In addition, we assume that there exists a uniform lower bound ϕ > 0 on the margin between the
maximum and minimum magnitudes; namely, β − β̄ ≥ ϕ > 0. The edge-wise result in Corollary 2 focuses on

establishing a guarantee that 1{|βbjk (z)| ≥ τ } matches 1{|βbjk (z)| ≥ β} in expectation, for some threshold
τ := η β̄ + (1 − η)β, η > 0; i.e., τ is a convex combination of β̄ and β.
Assumption 5. Aj (z) := E x−j x⊤
 
−j |z ≻ γI holds uniformly for all z ∈ Z with γ > 0.

This assumption ensures that the minimum eigenvalue of Aj (z) is bounded away from zero, uniformly for
all z. This assumption allows to us to first establish an analogous result as in (7), with the distance between
b (z) − β ∗ (z) measured in the Euclidean norm ∥ · ∥ instead of the projection norm Aj (z); subsequently, the
β j j
Euclidean norm result is further translated to that on individual edges.
Corollary 2. Let τ := η β̄ + (1 − η)β with η > 0. Under the Assumptions of Corollary 1 and Assumption 5,
the following inequality holds with probability at least 1 − δ:
p
L2 dP (F)
X   
X  ∗ 1
p2 log(nLp) log

Ez 1 |βbjk (z)| ≥ τ ̸= 1 |βjk (z)| ≥ β ≲
j=1 k̸=j
α min{η , (1 − η)2 }γ(β − β̄)2 n
2 2 δ
(9)
L · Eapprox (F)
+ .
αγ min{η 2 , (1 − η)2 }(β − β̄)2
Remark 8. Some comments on the result in Corollary 2 are provided next.

• Similarly to the result in Corollary 1, the bound in (9) consists of two terms that correspond, respec-
tively, to the generalization error and the approximation error bounds. With a constant margin β − β̄,
the generalization error bound still vanishes, provided that the network size p and the sample size n
grow at certain rates. However, the approximation error bound does not vanish; in particular, when
the margin is small, the corresponding term can be large, resulting in a fairly loose bound. This aligns
with intuition that when the strong and weak edges are hard to “distinguish”, the edge-wise recovery
via thresholding becomes difficult.
• The established bound is in expectation and can be interpreted as follows: on average, the strong
edges can be identified via thresholding, provided that the threshold is selected within a certain range.
This differs from the graph recovery results established for sparse high-dimensional Gaussian graphical
models based on an ℓ1 penalty (e.g., Ravikumar et al., 2009), largely due to the fact that we are under
a PAC learning framework and hence there is some discrepancy embedded in the setting in question—
that we consider a mis-specified setting without assuming that we know the functional class of the
underlying true data generating process, and thereby the tools adopted.
• An important implication of this result is that, in practice, when there is clear separation in magnitude
between strong and weak edges, one can effectively obtain a sparse graph by thresholding. This
approach practically treats the weak edges as zero, thereby enhancing interpretability.
Remark 9 (Refinement of Corollary 2). By incorporating the analysis on the functional approximation error,
the above corollary can be refined as follows, under the same setting as considered in Theorem 2:
p
X o
X n o n
∗ pLξ
Ez 1 |βjk (z)| ≥ τ ̸= 1 |βjk (z)| ≥ β
b ≲ .
j=1 k̸=j
αγ min{η 2 , (1 − η)2 }(β − β̄)2

10
4 Synthetic Data Experiments

The performance of the proposed neural network-based method is assessed through a series of experiments
on synthetic data sets. Both Gaussian and non-Gaussian settings are considered, with samples (indexed by
i) generated according to one of the following three mechanisms:
• Gaussian: xi ∼ N (0, (Θi )−1 ).
• Non-paranormal (NPN, Liu et al. (2009)): xi = f −1 (x̌i ); x̌i ∼ N (0, (Θi )−1 ) with f being a monotone
and differentiable function, and its inverse f −1 is applied to x̌i in a coordinate-wise fashion.
• Directed Acyclic Graph (DAG): for each coordinate j, xij = k∈pa(j) fjk i
(xik ) + ϵj , where pa(j) is the
P

parent set of node j. In other words, xi is generated according to a structural equation model (SEM).
Note that the SEM serves solely as a data generation mechanism to introduce potentially nonlinear
dependencies (e.g., by parameterizing fjk ’s through basis functions) amongst the nodes in a flexible
way. However, the primary quantity of interest is still the conditional independence structure captured
in the corresponding undirected moralized graph, as described next.
For the Gaussian and the non-paranormal cases, Θi is determined by z i . For the DAG case, we start with a
binary Ai determined by z i that corresponds to the skeleton of the DAG (denoted by GiDAG ); the parent set
of node j satisfies pa(j) ≡ {k : Aijk ̸= 0}. The undirected conditional independence graph of interest that is
compatible with the data can then be obtained by moralizing the DAG (Cowell et al., 1999, Chapter 3.2.1),
namely, Gi = M(GiDAG ), whose graph structure is encoded in Θi . It can be viewed as the counterpart to those
in the Gaussian/non-paranormal case in terms of capturing conditional independence relationships7 . For all
three cases, Θi is the parameter of interest, and we are interested in how well its skeleton can be recovered by
the proposed method. The results are benchmarked against various competitors, including RegGMM (Zhang
and Li, 2023), glasso (Friedman et al., 2008) and nodewise Lasso (Meinshausen and Bühlmann, 2006).
RegGMM accounts for covariate dependence while assuming linearity, whereas the other two are graphical
model estimation methods that solely consume x as the input.

Settings. The description of the various simulation settings is outlined in Table 2. For settings where data
are generated according to the Gaussian or NPN models, we first generate “candidate” precision matrices
denoted by Ψl ’s, where Ψl either has a single band l steps from the main diagonal (for settings G1/N1) or
is block diagonal with nonzero entries on the l-th block (for settings G2/N2). Next, each Θi is a convex
combination — which automatically ensures the positive definiteness of Θi — of the candidate precision
matrices, and the mixing depends on the value of the covariate z i . For settings D1 and D2 where data are
generated according to a DAG, we first generate binary matrices B1 and B2 ; both B1 and B2 correspond to
the skeleton of some trees, with nodes having 1 to 3 children (randomly determined as the tree grows). The
covariate z i dictates the skeleton of DAG, i.e., Ai , and also governs fjk
i
either directly through the coefficients
on the xk ’s (linear case), or through a multiplier that impacts the coefficients of the basis functions (non-
linear case). For all settings, the samples effectively fall under different “clusters” according to their z i ’s.
Samples within the same cluster have identical skeletons, albeit the magnitude of the entries may differ
depending on the exact value of the z i ’s.
The degree of linearity in z varies by setting. For settings G1 and N1, the dependency on z ∈ R2 is linear
in its 1st coordinate which governs the mixing percentage, and the 2nd coordinate effectively dictates the
cluster membership. This is similar for settings D1 and D2, which however differ in that the dependency
on the 1st coordinate being quadratic. For settings G2 and N2, the dimension of z i is set at 10; to induce
non-linearity, we use a radial basis function (RBF) network φ that first transforms
PL z into a scalar, followed
by a sigmoid transformation, namely z̃ i := sigmoid(φ(z i )) where φ(z) := ℓ=1 αℓ exp(−βℓ ∥z − cℓ ∥2 ); the
sigmoid transformation ensures that z̃ i ∈ (0, 1), which then dictates the mixing percentage. Finally, note that
amongst these settings, the dependency on the xk ’s are linear in settings G1, G2 and D1. Additional details
and pictorial illustrations are deferred to Appendix B.
the special case where x is generated according to a linear SEM, namely x = Ax+ϵ, Θ := cov(x) satisfies Θ−1 ≡ (I −A⊤ )⊤ Ω(I −
7 In

A⊤ )−1 where Ω = cov(ϵ); see, e.g., Loh and Bühlmann (2014). In the absence of linearity, the exact magnitude of the entries in Θ
becomes difficult to infer; however, one can still infer its skeleton from that of the DAG by the definition of moralization.

11
Table 2: Overview of simulation settings.
MECHANISM DESCRIPTION

 i z1i Ψ1 + (1 − z1i )Ψ2 if z2i ∈ (0, 1/3)
z1

i
p = 50, q = 2; i ∼ Unif(0, 1); Θ = z1i Ψ2 + (1 − z1i )Ψ3
G1 Gaussian if z2i ∈ (1/3, 2/3) .
z2  i
z1 Ψ1 + (1 − z1i )Ψ3 if z2i ∈ (2/3, 1)

p = 90, q = 10; z i ∼ standard multivariate Gaussian; z̃ i := sigmoid(φ(z i )); φ(·) is an RBF network;
PL 2 L
G2 Gaussian ( ℓ=1 αℓ exp(−βℓ ∥z − cℓ ∥ ), L = 10, αℓ ∼ Unif(−10, 10), βℓ ∼ Unif(0.1, 0.5), cℓ ∼ [Unif(−1, 1)] ;
φ(z) :=
i i
z̃ Ψ1 + (1 − z̃ )Ψ3 i i
if z̃ > 0.9 or z̃ < 0.1
Θi = .
z̃ i Ψ1 + 0.5Ψ2 + (0.5 − z̃ i )Ψ3 otherwise
N1 NPN x̌i ’s are generated identically to those in G1; xi = g(x̌i ) where g(x) = x + sin(x).
N2 NPN x̌i ’s are generated identically to those G2; xi = g(x̌i ) where g(x) = x2 sign(x).

 i B1
 if z1i ∈ (0, 21 )
z1
p = 50, q = 2; i ∼ Unif(−1, 1); Ã = B2 i
if z1i ∈ (− 12 , 0) ;
D1 DAG z2  i 2 i 2

(z2 ) B1 + (1 − (z2 ) )B2 otherwise
i (x) := Ãi x ; Ai := 1(Ãi ̸= 0).
fjk jk k

p, q, z i ’s and Ãi ’s are set identically to those in D1; fjk


i (x) := αi i i
jk,1 ψ1 (x) + αjk,2 ψ2 (x) + αjk,3 ψ3 (x)
D2 DAG
ψm (·)’s are Gauss-Hermite functions (Olver et al., 2010); αijk,m = (Ãik,j · cjk,m ), cjk,m ∼ Unif(0.1, 0.5).

For all settings, we train the model with β(·) parameterized with an MLP8 on 10,000 samples, and evaluate
it on a test set of size 1000; the estimated graphs (off-diagonals only) for these test samples (indexed by
i) are extracted as {−βbjk (z i )}j,k∈[p];j̸=k . For benchmarking methods that are not covariate-dependent, i.e.,
glasso and nodewise Lasso, their sample-level estimates are identical, namely Θ bi ≡ Θb for all i. Instead of
directly running the method on the full set of training samples, we further partition them based on their
cluster membership, and conduct separate estimations using only within-cluster samples.9 Note that such
partition would not be feasible in real world settings as the cluster membership would not be known apriori.
All experiments are repeated over 5 data replicates.

Performance evaluation. We use AUROC and AUPRC as metrics, which adequately capture how well
the methods estimate entries with strong signals versus weak ones10 , and there is no need to apply hard-
thresholding to estimates to obtain these two metrics. For a single experiment based on (any) one data
b i against Θi , for all samples
replicate, the metrics are initially obtained at the sample-level by comparing Θ
under consideration11 ; to obtain the metrics of interest corresponding to a single experiment, we average
those over the samples. For glasso/nodewise Lasso, experiments are conducted over a sequence of penalty
parameters, and the highest metric values are reported. Table 3 reports these metrics, after averaging across
the data replicates (experiments), with the corresponding standard deviation reported in parentheses.

Table 3: Evaluation for the proposed and benchmarking methods, averaged over 5 data replicates
DNN-based CGM RegGMM glasso - est. by cluster nodewise Lasso - est. by cluster
AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC
G1 0.99 (0.000) 0.98 (0.001) 0.96 (0.002) 0.79 (0.004) 0.97 (0.000) 0.59 (0.000) 0.99 (0.000) 0.98 (0.002)
G2 0.99 (0.003) 0.92 (0.024) 0.86 (0.017) 0.47 (0.020) 0.97 (0.000) 0.45 (0.031) 0.96 (0.007) 0.55 (0.141)
N1 0.99 (0.000) 0.98 (0.001) 0.99 (0.001) 0.89 (0.001) 0.98 (0.000) 0.59 (0.001) 0.99 (0.001) 0.98 (0.003)
N2 0.99 (0.004) 0.90 (0.028) 0.82 (0.015) 0.37 (0.023) 0.96 (0.000) 0.42 (0.034) 0.93 (0.016) 0.37 (0.164)
D1 0.95 (0.004) 0.76 (0.026) 0.94 (0.005) 0.74 (0.026) 0.92 (0.005) 0.46 (0.008) 0.94 (0.005) 0.85 (0.032)
D2 0.92 (0.008) 0.66 (0.022) 0.87 (0.013) 0.60 (0.017) 0.89 (0.009) 0.40 (0.017) 0.87 (0.015) 0.50 (0.113)

8 We use stacked linear layers with ReLU activations, and do not leverage any RBF-style layers or sigmoid ones which are used in the
data generating process.
9 Recall that that the true graphs corresponding to samples within the same cluster have identical skeletons, although the magnitude

may still vary. glasso/nodewise Lasso are still “mis-specified”, however to a lesser extent when compared against the case where this
“cluster membership” information is ignored.
10 The estimated β b does not correspond to edge probabilities, but rather, it corresponds to edge weights. AUROC/AUPRC can effectively
summarize how well the graph skeleton is being captured after the entries are thresholded over a range of thresholding levels.
11 For methods that can produce sample-specific estimates, the evaluation is done on the test set; for those that can only produce a

single estimate based on all samples, the evaluation is done on the graph estimated based on the training samples directly.

12
The main observations are: (1) the proposed method exhibits superior performance than existing bench-
marks, and particularly so in AUPRC. As a matter of fact, all methods exhibit reasonable performance in
AUROC, even for glasso/nodewise Lasso that cannot perform sample-specific estimation. Notably, when the
dependency on z is linear within each cluster (settings G1 and N1), nodewise Lasso performs almost per-
fectly when separate estimation is conducted within each cluster and thus on samples whose underlying true
graphs have the same skeleton yet different magnitudes, despite that the model assumes that all samples
have identical graphs. (2) In the case where the true data generating process is mildly non-linear in xk ’s—
in particular, the non-linearity is induced via monotonic transformations (i.e., N1 and N2), the proposed
formulation in (2) as a working model—by only considering linear dependency on the xk ’s—is still able to
recover well the skeleton. (3) In the case where the true DGP becomes highly non-linear (e.g., D2), the per-
formance of the method deteriorates. This is expected in that the moral graph can be inherently difficult to
recover, partly due to the existence of edges that are induced by “married” nodes and have small magnitude;
the non-linearity in xk ’s can further add to the complexity of edge recovery. The former is corroborated by
D1, where although the dependency on xk ’s is linear, the skeleton recovery is still inferior. Additional tables
and remarks are deferred to Appendix B for further benchmarking and illustration, including the skeleton
recovery (0/1) performance of DNN-CGM at different thresholding levels (Table 4), the performance for the
DAG setting when estimates are evaluated against the “pseudo” moralized graph without the edges from the
married nodes (Table 5), and the performance of glasso/nodewise Lasso without sample partition (Table 6).

5 Real Data Experiments

To demonstrate how the proposed method performs in real world settings, we consider two applications
from the domains of neuroscience and finance (deferred to Appendix C), and assess the network structures
recovered.
We consider a dataset from the Human Connectome Project analyzed in Lee et al. (2023) comprising of
resting-state fMRI scans for 549 subjects. The scan of any subject is in the form of a spatial-temporal
matrix, with the rows corresponding to the “snapshot” value of 268 brain regions and the columns being
the temporal observations of the region, totaling 1200 time points. In addition, each subject is associated
with a Penn Progressive Matrices score, a surrogate of fluid intelligence. The quantity of interest is the
network of brain regions (p = 268) as a function of the intelligence score (q = 1). The dataset has been
pre-processed with global signal regression, where shared variance between the global signal and the time
course of each individual voxel is removed through linear regression (Murphy and Fox, 2017; Greene et al.,
2018). In our experiment, we flatten the time dimension of the fMRI scans and ignore the temporal nature
of these scans; therefore, the dataset contains 1200 × 549 ≈ 660k effective samples. However, note that since
the score is at the subject level, the estimation procedure yields a single unique network for each subject12 .

Results. We group the subjects based on the quantiles that their respective scores fall into, and obtain
the averaged graph over subjects that are within the same quantile band (labeled as q1, q2, q3 and q4).
Further, for visualization purposes, we partition the nodes into 6 sub-networks, corresponding to default
mode, frontoparietal, medial frontal, motor, subcortical cerebellar and visual, as in Lee et al.
(2023). Figure 1 shows the comparison between the medial frontal sub-networks of subjects in q1 and
q4, respectively, after the estimated graphs are sparsified (see Remark 3) and thresholded at 0.05 (on a
normalized scale). Note that each node is associated with a label13 ; in the plot, nodes with the same label
are grouped together.
The difference in the network density is rather pronounced; in particular, those having a higher score ex-
hibit a significantly more connected network. Similar patterns are also observed in other sub-networks; in
particular, the differential between q1 and q4 networks of frontoparietal, motor and visual is of com-
parable scale to that in medial frontal, while that of default mode and subcortical-cerebellar is less
12 Thisis due to the fact that the model is specified as a function of the score and thus effective samples of the same subject will always
have their sample-specific estimate graphs being identical.
13 Amongst all 268 nodes, there are 20 distinct labels

13
(a) Network for subjects whose scores are in the 1st quantile (b) Network for subjects whose scores are in the 4th quantile

Figure 1: Estimated networks as represented by −β(z) b for subjects having low (left) and high scores (right), after
thresholding at 0.05. Red cells indicate positive partial correlations and blue cells indicate negative ones.

pronounced. This is largely concordant with observations from existing literature (Song et al., 2008; Ohtani
et al., 2014) and is also corroborated in a validation dataset encompassing 828 subjects; the estimated
graphs therein exhibit the same pattern.

6 Discussion

A nonlinear covariate-dependent graphical model based on a node-conditional formulation is investigated.


The functional dependency on the covariate z is parameterized by neural networks, and hence the model
can capture this dependency in a flexible way. Theoretical guarantees are provided under the PAC learning
framework, wherein both the generalization error and the approximation error are taken into account.
At the methodological level, alternative approaches similar to the formulation in Baptista et al. (2024b),
can be potentially adopted, by considering a modified score matrix Ω that takes into account the external
covariate, namely Ωjk := Ep(x|z) [∂j ∂k log p(x|z)]2 . Consequently, estimation can proceed with the use of
a lower triangular transport map to estimate the conditional density p(x|z) (Baptista et al., 2024a), or
score matching to approximate the conditional score ∇x log p(x|z) (Dasgupta et al., 2023). However, it is
worth noting that the consistency result established in Baptista et al. (2024b) is under an asymptotic regime,
where the number of nodes p is assumed fixed and the sample size n grows. Their analysis relies on a Taylor
expansion, followed by steps that invoke the delta method and the continuous mapping theorem, under
some additional assumptions14 . In contrast, the theoretical result in this paper establishes a finite-sample
error bound for the estimator under a regime where p can grow slowly with n. Hence, the analysis requires
a different set of technical tools, even in the absence of the covariate z. By adotping a node-conditional
formulation, the learning task reduces to an ERM problem, enabling us to leverage existing results in the
supervised learning literature.
Finally, the focus of the paper is on graphical models for continuous variables x, but with appropriate modifi-
cations the proposed framework can be extended to discrete random variables—e.g., a covariate dependent
Ising and/or Potts model (Wainwright et al., 2008)—with a cross-entropy loss function. Theoretical results
can potentially be established using a similar set of arguments, leveraging results in ERM for the case of
classification.

14 Two key assumptions include: (i) the parameterization—based on bases expansions—of the transport map is sufficiently rich to cover
b jk := E[∂j ∂k log pb(x)]2 , rather
the target density, and (ii) the estimated Ω before thresholding is an exact expectation, namely, Ω
than sample-level estimates.

14
References
M. Al-Shedivat, A. Dubey, and E. Xing. Contextual explanation networks. Journal of Machine Learning
Research, 21(194):1–44, 2020. URL http://jmlr.org/papers/v21/18-856.html.
D. Alvarez Melis and T. Jaakkola. Towards robust interpretability with self-explaining neural networks.
Advances in Neural Information Processing Systems, 31, 2018.
M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations, volume 9. Cambridge
University Press, 1999.
O. Banerjee, L. E. Ghaoui, A. d’Aspremont, and G. Natsoulis. Convex optimization techniques for fitting
sparse Gaussian graphical models. In Proceedings of the 23rd International Conference on Machine Learning
(ICML), pages 89–96, 2006.
R. Baptista, Y. Marzouk, and O. Zahm. On the representation and learning of monotone triangular transport
maps. Foundations of Computational Mathematics, 24(6):2063–2108, 2024a.
R. Baptista, R. Morrison, O. Zahm, and Y. Marzouk. Learning non-Gaussian graphical models via Hessian
scores and triangular transport. Journal of Machine Learning Research, 25(85):1–46, 2024b. URL http:
//jmlr.org/papers/v25/21-0022.html.
P. L. Bartlett and W. Maass. Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory
and neural networks, pages 1188–1192, 2003.
P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.
Journal of Machine Learning Research, 3(Nov):463–482, 2002.
P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33
(4):1497–1537, 2005.
P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds
for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019.
P. Berti, E. Dreassi, and P. Rigo. Compatibility results for conditional distributions. Journal of Multivariate
Analysis, 125:190–203, 2014.
J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical
Society: Series B (Methodological), 36(2):192–225, 1974.
M. Billio, M. Getmansky, A. W. Lo, and L. Pelizzon. Econometric measures of connectedness and systemic
risk in the finance and insurance sectors. Journal of Financial Economics, 104(3):535–559, 2012.
P. Bühlmann, J. Peters, and J. Ernest. CAM: Causal additive models, high-dimensional order search and
penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
T. T. Cai, H. Li, W. Liu, and J. Xie. Joint estimation of multiple high-dimensional precision matrices. Statistica
Sinica, 26(2):445, 2016.
R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Building and using probabilistic networks.
Probabilistic Networks and Expert Systems, pages 25–41, 1999.
A. Dasgupta, J. Murgoitio-Esandi, D. Ray, and A. Oberai. Conditional score-based generative models for
solving physics-based inverse problems. In NeurIPS 2023 Workshop on Deep Learning and Inverse Problems,
2023. URL https://openreview.net/forum?id=ZL5wlFMg0Y.
F. X. Diebold and K. Yılmaz. On the network topology of variance decompositions: Measuring the connect-
edness of financial firms. Journal of Econometrics, 182(1):119–134, 2014.
R. M. Dudley. VN Sudakov’s work on expected suprema of Gaussian processes. In High Dimensional Proba-
bility VII: The Cargèse Volume, pages 37–43. Springer, 2016.
J. Fan and W. Zhang. Statistical methods with varying coefficient models. Statistics and its Interface, 1(1):
179, 2008.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.
Biostatistics, 9(3):432–441, 2008.
A. S. Greene, S. Gao, D. Scheinost, and R. T. Constable. Task-induced brain state manipulation improves
prediction of individual traits. Nature Communications, 9(1):2807, 2018.

15
J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint estimation of multiple graphical models. Biometrika, 98
(1):1–15, 2011.
D. Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis
dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232, 1995.
A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learn-
ing Research, 6(4), 2005.
B. Khavari and G. Rabusseau. Lower and upper bounds on the pseudo-dimension of tensor network models.
Advances in Neural Information Processing Systems, 34:10931–10943, 2021.
S. Kim, H. Cho, D. Lee, and M. J. Webster. Association between SNPs and gene expression in multiple regions
of the human brain. Translational Psychiatry, 2(5):e113–e113, 2012.
Y. Klochkov and N. Zhivotovskiy. Stability and deviation optimal risk bounds with convergence rate o(1/n).
Advances in Neural Information Processing Systems, 34:5065–5076, 2021.
M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network regression
estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
M. Kolar, A. P. Parikh, and E. P. Xing. On sparse nonparametric conditional covariance selection. In Proceed-
ings of the 27th International Conference on International Conference on Machine Learning (ICML), pages
559–566, 2010.
J. Lafferty, A. McCallum, F. Pereira, et al. Conditional random fields: Probabilistic models for segmenting and
labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML),
2001.
S. L. Lauritzen. Graphical Models, volume 17. Clarendon Press, 1996.
K.-Y. Lee, D. Ji, L. Li, T. Constable, and H. Zhao. Conditional functional graphical models. Journal of the
American Statistical Association, 118(541):257–271, 2023.
H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: semiparametric estimation of high dimensional
undirected graphs. Journal of Machine Learning Research, 10(10), 2009.
H. Liu, X. Chen, L. Wasserman, and J. Lafferty. Graph-valued regression. Advances in Neural Information
Processing Systems, 23, 2010.
H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High-dimensional semiparametric Gaussian copula
graphical models. The Annals of Statistics, 40(4):2293 – 2326, 2012. doi: 10.1214/12-AOS1037. URL
https://doi.org/10.1214/12-AOS1037.
P.-L. Loh and P. Bühlmann. High-dimensional learning of linear causal networks via inverse covariance
estimation. Journal of Machine Learning Research, 15(1):3065–3105, 2014.
K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task learning.
Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009.
R. Marcinkevičs and J. E. Vogt. Interpretable models for Granger causality using self-explaining neural
networks. In International Conference on Learning Representations (ICLR), 2021.
N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the Lasso. The
Annals of Statistics, pages 1436–1462, 2006.
K. Murphy and M. D. Fox. Towards a consensus regarding global signal regression for resting state functional
connectivity MRI. Neuroimage, 154:169–173, 2017.
Y. Ni, F. C. Stingo, and V. Baladandayuthapani. Bayesian covariate-dependent gaussian graphical models
with varying structure. Journal of Machine Learning Research, 23(242):1–29, 2022.
Y. Niu, Y. Ni, D. Pati, and B. K. Mallick. Covariate-assisted bayesian graph learning for heterogeneous data.
Journal of the American Statistical Association, 119(547):1985–1999, 2024.
T. Ohtani, P. G. Nestor, S. Bouix, Y. Saito, T. Hosokawa, and M. Kubicki. Medial frontal white and gray
matter contributions to general intelligence. PLoS One, 9(12):e112691, 2014.
F. W. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark. NIST Handbook of Mathematical Functions. Cam-
bridge University Press, 2010.

16
D. Pollard. Empirical Processes, volume 2. Institute of Mathematical Statistics, 1990.
X. Qiao, S. Guo, and G. M. James. Functional graphical models. Journal of the American Statistical Association,
114(525):211–222, 2019.
P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 71(5):1009–1030, 2009.
M. Song, Y. Zhou, J. Li, Y. Liu, L. Tian, C. Yu, and T. Jiang. Brain spontaneous functional connectivity and
intelligence. Neuroimage, 41(3):1168–1176, 2008.
R. Thompson, E. V. Bonilla, and R. Kohn. Contextual directed acyclic graphs. In International Conference on
Artificial Intelligence and Statistics (AISTATS), pages 2872–2880. PMLR, 2024.
V. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their
probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971.
M. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1-constrained
quadratic programming. IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational inference.
Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
E. Yang, Y. Baker, P. Ravikumar, G. Allen, and Z. Liu. Mixed graphical models via exponential families. In Pro-
ceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 33,
pages 1042–1050. PMLR, 2014.
E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via univariate exponential family distribu-
tions. Journal of Machine Learning Research, 16(1):3813–3847, 2015.
Z. Zeng, M. Li, and M. Vannucci. Bayesian covariate-dependent graph learning with a dual group spike-and-
slab prior. arXiv preprint arXiv:2409.17404, 2024.
J. Zhang and Y. Li. High-dimensional Gaussian graphical regression models with covariates. Journal of the
American Statistical Association, 118(543):2088–2100, 2023.
T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional undirected
graph estimation in R. Journal of Machine Learning Research, 13(1):1059–1062, 2012.
Y. Zheng, I. Ng, Y. Fan, and K. Zhang. Generalized precision matrix for scalable estimation of nonparametric
Markov networks. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL
https://openreview.net/forum?id=qBvBycTqVJ.
F. Zhou, K. He, and Y. Ni. Causal discovery with heterogeneous observational data. In Uncertainty in Artificial
Intelligence, pages 2383–2393. PMLR, 2022.
S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. Machine Learning, 80:295–319,
2010.

17
A Technical Appendix

A.1 Technical Definitions and A Useful Lemma


 
1
Pp P 
For notational convenience, the risk function is abbreviated to L(β) := p j=1 ℓ k̸=j βjk z xk , xj for
the remainder of the presentation. Let H := {h : Y → R} be a family of measurable functions, and
{σ i }ni=1 ∈ {−1, +1}n a collection of i.i.d Rademacher random variables. The Rademacher Complexity and
Rademacher Average (Bartlett and Mendelson, 2002) are defined as:
 n   n 
1X i 1X i
Rn H = E{σi }ni=1 sup σ h(y i ) , RH = Ey1:n ,{σi }ni=1 sup σ h(y i ) .
h∈H n i=1 h∈H n i=1

Further, we define the empirical average and population expectation of h, as


n
1X
Pn h := h(y) and Ph := Ey h(y).
n i=1

In particular, we view y := (x, z) and h(y) := (ℓ ◦ β)(y) := L(β; x, z) throughout our analysis. We further
define the Local Rademacher Complexity and Local Rademacher Average with radius r as Rn {h ∈ H, Pn h2 ≤ r}
and R{h ∈ H, Ph2 ≤ r}. The star hull of set of functions F is defined as

∗F := {αf : f ∈ F, α ∈ [0, 1]}.

The following concepts are used in the proofs of the main results:

Definition 1 (L2 -Covering Number). Let Sn := {y i }ni=1 be a set of q points with y i ∈ Y, ∀ i. A set U ⊆ Rn is
1
Pn
an ε-cover w.r.t L2 -norm of F on {y i }ni=1 , if ∀β ∈ F, ∃ u ∈ U, s.t. n
i 2
i=1 |ui − β(y )| ≤ ε, where ui is
the i-th coordinate of u. The covering number N2 (ε, F, Sn ) with L2 -norm of F on Sn is :

min{|U|: U is an ε-cover of F on Sn },

and the covering number of F with L2 -norm of size n is N2 (ε, F, n) := supSn ∈Y n N2 (ε, F, Sn )

Definition 2 (VC-dimension (Vapnik and Chervonenkis, 1971)). The VC-dimension dVC (H) of a hypothesis
class H = {h : Y 7→ {1, −1}} is the largest cardinality of the set S ⊆ Y such that for all subsets of S, denoted
by S̄, ∃ h ∈ H: (
1 if y ∈ S̄,
f (y) =
−1 if y ∈ S \ S̄.

Definition 3 (Pseudo-dimension (Pollard, 1990)). The Pseudo-dimension dP (H) of a real-valued hypothesis


class H = {h : Y 7→ [a, b]} is the VC-dimension of the hypothesis class

H̃ = {h̃ : Y × R 7→ {−1, 1} | h̃(y, t) = sign(h(y) − t), h ∈ H}.

Definition 4 ((m, C)-smoothness (Kohler and Langer, 2021)). Let m := t + s for some Pqt ∈ N0 and 0 < s ≤ 1.
A function f (·) : Rq 7→ R is called (m, C)-smooth if for every α = (α1 , α2 , ..., αq ) with j=1 αj = t, the partial
derivative ∂ t f /(∂ α1 z1 , ..., ∂ αq zq )(z) exists and satisfies:

∂tf ∂tf
(z) − (y) ≤ C∥z − y∥s .
(∂ α1 z1 , ..., ∂ αd zq ) (∂ α1 z1 , ..., ∂ αd zq )

The following Lemma is established which will be used in the proof of Theorem 1.

18
Lemma 1 (Verification of the Bernstein Condition). Under Assumptions 1 and 2, the following inequality
holds:
L2
 
2
Ex,z [(L(β) − L(β)) ] ≤
b Ex,z L(β) − L(β) .
b

Proof.
 X p X X 
b − L(β))2 ] ≤ Ex,z L2 ∗
z xk |2
 
Ex,z [(L(β) | βjk z xk − βjk
k̸=j k̸=j
j=1
p
L2 X
 X  X 

 
≤ Ex,z ℓ βjk z xk , xj − ℓ βjk z xk , xj
2α j=1 k̸=j k̸=j

L2
 
= Ex,z L(β)
b − L(β) .

A.2 Verification of Assumptions 1 and 2 For the Mean Squared Error Loss

Note that the mean squared error (MSE) loss function (i.e., ℓ(a, b) := (a−b)2 ) is used to recover the structure
of the graphical model (see (3)). Next, we establish that it satisfies Assumptions 1 and 2.
Recall that we assumed that the xj ’s are uniformly bounded; e.g., without loss of generality we consider
∥x∥2 ≤ 1. Further, let βjk (·) ∈ F and F belongs to family of uniformly bounded functions: ∥β(z)∥2 ≤ 1.
Next, we establish that the MSE loss is 1-strongly convex and 4-Lipschitz. Note that being a quadratic
function it is trivially 1-strongly convex. The Lipschitz continuity claim follows from:
x1 , x) − ℓ(b
ℓ(b x1 − x
x2 , x) = (b b2 )(b b1 − 2x) ≤ 4|b
x2 + x x1 − xb2 |.
Pp
Next, we show that Assumption 2 holds; namely, Ex,z [ j=1 ℓ′ (⟨β j , x−j ⟩, xj )|βj =βj∗ ] = 0. Recall that the
posited DGP is:
p
X

xj = βjk (z)xj + εj .
k̸=j

Then, the following calculation shows that Assumption 2 holds:


X p  p
X 
Ex,z ℓ′ (⟨β j , x−j ⟩, xj ) = Ex,z (⟨β j , x−j ⟩ − ⟨β ∗j , x−j ⟩ − εj )
j=1 j=1
p
X 
= Ex,z (⟨β j − β ∗j , x−j ⟩) .
j=1
| {z }
=0 if βj =βj∗

A.3 Proof of Theorem 1

For ease and conciseness of the presentation, the hypothesis class H and its members are defined as the
excess risk, the main quantity of interest in this analysis.

Proof. We start by defining the following function class:


 
H := ∆ ◦ L ◦ F := ∆L(β; β ∗ , x, z) = L(β) − L(β ∗ ) : β ∈ Fp×(p−1) .

19
Correspondingly, we denote h(x, z) := ∆L,β := ∆L(β; β ∗ , x, z), which is a composite function of β from
Fp×(p−1) , loss function L and the difference operation ∆. It could be easily verified that ∀ h ∈ H, h(x, z) ∈
[0, 1]. In addition, h(·) also satisfies the Bernstein Condition by Lemma 1. On the other hand, directly
bounding the Pseudo-dimension of H is not trivial. To that end, we leverage the Lipschitz property of ℓ and
the covering number of F to construct an upper-bound on the covering number of H. We will focus on
analyzing the Local Rademacher Average ERn {h ∈ H : Pn h2 ≤ r}:
n n
h 1X i i h 1X i i
ERn {h ∈ H, Pn h2 ≤ r} = ESn {σi }ni=1 sup σ h(xi , z i ) = ESn {σi }ni=1 sup σ ∆L,βb .
h∈H,Pn h2 ≤r n i=1 β∈Fp×(p−1) n i=1
Pn ∆2L,β
b ≤r

Next, we bound P∆L,βb − Pn ∆L,βb. To invoke Theorem 3.3 in Bartlett et al. (2005), we need to find a subroot
function τ (r) such that
2L2
τ (r) ≥ ERn {h ∈ H : E[h2 ] ≤ r}.
α
By Lemma 3.4 from Bartlett et al. (2005), it suffices to choose:

20L2 11 log n
τ (r∗ ) = ERn {∗H, Ph2 ≤ r} + ,
α n
with fixed point r∗ = τ (r∗ ) denoted as r∗ .
The following analysis largely follows from the proof in Corollary 3.7 in Bartlett et al. (2005). Since ∆L,βb is
uniformly bounded by 1, for any r ≥ τ (r), Corollary 2.2 in Bartlett et al. (2005) implies that with probability
at least 1 − n1 , {h ∈ ∗H : Ph2 ≤ r} ⊆ {h ∈ ∗H : Pn h2 ≤ 2r}. Let E := {h ∈ ∗H : Ph2 ≤ r} ⊆ {h ∈ ∗H :
Pn h2 ≤ 2r}, then the following holds:

ERn {∗H, Ph2 ≤ r} ≤ P[E]E[Rn {∗H, Ph2 ≤ r}|E] + P[E c ]E[Rn {∗H, Ph2 ≤ r}|E c ]
1
≤ E[Rn {∗H, Pn h2 ≤ 2r}] + .
n
Since r∗ is the fixed point of a sub-root function, namely r∗ = τ (r∗ ), by Lemma 3.2 in Bartlett et al. (2005),
r∗ satisfies the following
20L2 11 log n + 20
r∗ ≤ ERn {∗H, Pn h2 ≤ 2r∗ } + , (10)
α n
where the Lipschitz constant L and the strong convexity parameter α show due to the Bernstein condition.
Next, we leverage Dudley’s chaining bound (Dudley, 2016) to upper bound ERn {∗H, Pn h2 ≤ 2r∗ }, using
the integral of covering number. Specifically, by applying the chaining bound, it follows from Theorem
B.7 (Bartlett et al., 2005) that

2r ∗
const
Z p
2 ∗
ESn [Rn (∗H, Pn h ≤ 2r )] ≤ √ ESn log N2 (ε, ∗H, Sn )dε, (11)
n 0

where const represents some universal constant. Next, we bound the covering number N2 (ε, ∗H, Sn ) by
ε
N2 (ε/Lp, Fp×(p−1) , Sn ). We show that for all Sn , any Lp -cover of Fp×(p−1) is a ε-cover of H, which implies
that N2 (ε, ∗H, Sn ) ≤ N2 (ε/Lp, Fp×(p−1) , Sn ). Specifically, let Ujk ⊂ [0, 1]n be an ε-cover of F on Sn so that
for all βjk ∈ F, ∃{uijk }ni=1 ∈ Ujk so that
v
u1
u X
t (βjk (z i ) − uijk )2 ≤ ε.
n
i∈[n]

Further, let U ⊆ Rn×p×(p−1) , β(z) : Z → Rp×(p−1) where β(z) ∈ Fp×(p−1) is a family of p × (p − 1) joint
functions, with each element βjk ∈ F, j, k ∈ [p], j ̸= k. We say U is an ε-cover of Fp×(p−1) on {z i }ni=1 if

20
∀β ∈ Fp×(p−1) , ∃u ∈ U, s.t.
v
1
u X XX
(uijk − βjk (z i ))22 ≤ ε.
u
np(p − 1)
t
i∈[n] j∈[p] k̸=j

Clearly, let Ujk be any collection of arbitrary p × (p − 1) ε-covers of F, the Cartesian product of Ujk , j, k ∈
2 
[p], j ̸= k forms an ε-cover of Fp×(p−1) , which implies that |U| ≤ |Ujk |p . Thus we have N2 ε, Fp×(p−1) , Sn ≤
p2 ε
N2 ε, F, Sn . Next, we show that given any Lp -cover of Fp×(p−1) on Sn , denoted as U, one can construct
p
V := {v = (v , · · · , v ) ∈ R |v := p j ℓ(⟨uj , xi−j ⟩, xij ) − ℓ(⟨β ∗j , xi−j ⟩, xij ) , i ∈ [n], u ∈ U}, which is an
n ′ 1
1 n i
P i

q P
ε-cover for H, i.e., for all h ∈ H, ∃ v ∈ V so that n1 i∈[n] (h(xi , z i ) − v i )2 ≤ ε :
v
u X X 2
u1 1 i i
 1 X ∗ i

i i
t ℓ ⟨uj , x−j ⟩, xj − ℓ ⟨β j , x−j ⟩, xj − ∆L,β (x , z )
n p p
i∈[n] i∈[p] j∈[p]
v
 2
u X X 
u1 1 i i
 1 X i
= t ℓ ⟨uj , x−j ⟩, xj − ℓ ⟨β j , x−j ⟩, xj
n p p
i∈[n] j∈[p] j∈[p]
v
(1) u 1
u X X 2
≤ t i i
L|⟨ui − β j , x−j ⟩|
np2
k∈[n] i∈[p]
v
(2) u L2 X X
u  2
≤ t i i
|⟨uj − β j , x−j ⟩|
np
i∈[n] j∈[p]

(3)
≤ εLp.

In the above derivation, (1) leverages the fact that


   
1 X 1 X
∆L,β (xi , z i ) = ℓ ⟨β ij , xi−j ⟩, xj − ℓ ⟨β ∗j , xi−j ⟩, xj ;
p p
i∈[p] j∈[p]

(2) is by Lipschitz continuity, and (3) is by the definition of the covering number and the fact that |xik | ≤
1, ∀k ∈ [p]. Combining the above inequality with Corollary 3.7 from Bartlett et al. (2005), we have
   
ε 2
log N2 (ε, ∗H, Sn ) ≤ log N2 , H, Sn ⌈ ⌉+1
2 ε
   
ε 2
≤ log N2 , Fp×(p−1) , Sn ⌈ ⌉+1
8Lp ε
   
2 ε 2
≤ p log N2 , F, Sn ⌈ ⌉+1 .
8Lp ε
R √2r∗ p
Next, we bound const
√ E
n 0
log N2 (ε, ∗H, Sn )dε from (11). Note that by Haussler’s bound on the covering
number (Haussler, 1995) we have:
   
ε Lp
log N2 , F, Sn ≤ const · dP (F) log , ∀ Sn ,
8Lp ε

21
and therefore
√ Z √2r∗ s
2r ∗ p   
const const
Z
ε 2
√ ESn log N2 (ε, ∗H, Sn )dε ≤ √ ESn log N2 , H, Sn ⌈ ⌉ + 1 dε
n 0 n 0 2 ε

Z 2r∗ s   
const · p ε 2
≤ √ ESn log N2 , F, Sn ⌈ ⌉ + 1 dε
n 0 8Lp ε
r Z √2r∗ s   r
dP (F) Lp dP (F)r∗ log(L/r∗ )
≤ const · p log ≤ const · p
n 0 ε n
r
d2P (F) dP (F)r∗ log(nLp/edP (F))
≤ const · p + , (12)
n2 n
where e in (12) refers to Euler’s constant and const represents some universal constant that may change from
line to line in the above derivation. The inequality in (12) comes from the fact that r∗ ·log(1/r∗ ) is a monotone
dP (F )r ∗ log(1/r ∗ ) d2 (F )
increasing function for r∗ ≤ e: in the case where r∗ ≤ ed 1
n log2 (n/ed) ≤ e, we have n ≲ Pn2 ; in
dP (F )r ∗ log(1/r ∗ ) dP (F )r ∗
the case where r∗ ≥ ed 1
n log2 (n/ed) , we have n ≲ n log(n/edP (F)). Combining these two
cases yields inequality (12).
Together with (10) one can solve for


p2 L4 dP (F) log( dnLp
P (F )
)
r ≲ .
α2 n

By Theorem 3.3 in Bartlett et al. (2005), we have that for all β ∈ Fp×(p−1) , with probability at least 1 − δ:
 
∗ ∗ dP (F) 2 2 1
R(β) − R(β ) ≲ Rn (β) − Rn (β ) + L p log(nLp) log ,
αn δ
 
dP (F) 2 2 1
Rn (β) − Rn (β ∗ ) ≲ R(β) − R(β ∗ ) + L p log(nLp) log , (13)
αn δ

which gives
 
b − R(β ∗ ) ≲ Rn (β)
R(β) b − Rn (β ∗ ) + dP (F) L2 p2 log(nLp) log 1
. (14)
| {z } | {z } αn δ
excess error empirical excess error | {z }
generalization error

Due to the fact that R(β)


b is the empirical risk minimizer, we have:
 p   Xp  
∗ 1 X X i i i ∗ i i i
Rn (β) − Rn (β ) =
b ℓ ⟨β j (z ), x−j ⟩, xj −
b ℓ ⟨β j (, z ), x−j ⟩, xj
| {z } np
j=1 i∈[n] j=1
empirical excess error
 p   X p  
1 X X
≤ ℓ ⟨β opt
i (z i
), x i
−j ⟩, xi
j − ℓ ⟨β ∗ i
j (z ), x i
−j ⟩, xi
j (15)
np j=1 j=1
i∈[n]
 p 
L X X opt
≤ ∥β i − β ∗j ∥∞ (16)
np j=1
i∈[n]

≤ L · Eapprox (F) .
| {z }
approximation error

b is the empirical risk minimizer within Fp×(p−1) and thus Rn (β)


Equation (15) is due to the fact that β b ≤
opt
Rn (β ). Finally, Equation (16) is derived by the Lipshitzness of ℓ(·, ·).

22
A.4 Proof of Corollary 1

p2 L2 dp (F ) 1

b − R(β ∗ ) ≲ L · Eapprox (F) + ε, which could
Proof. Let ε := αn log(Lpn) log δ , by (6) we have R(β)
be expressed as:
p
X   Xp  

Ex,z ℓ ⟨β
b (z), x−j ⟩, xj −
j ℓ ⟨β j (z), x−j ⟩, xj ≲ L · Eapprox (F) + ε.
j=1 j=1

By the α-strongly convexity assumption (Assumption 1), namely,


α
ℓ(a1 ; b) − ℓ(a2 ; b) ≥ ℓ′ (a; b)|a=a2 (a1 − a2 ) + (a1 − a2 )2 ,
2
b (z), x−j ⟩, a2 = ⟨β ∗ (z), x−j ⟩, b = xj and the optimality of β ∗ , we have:
by setting a1 = ⟨β j j j

p
X   Xp  

ε + L · Eapprox ≥ Ex,z ℓ ⟨β
b (z), x−j ⟩, xj −
j ℓ ⟨β j (z), x −j ⟩, xj
j=1 j=1
p 
X 
≥ Ex,z ′
ℓ (⟨β j , x−j ⟩, xj )|βj =β∗j b (z) − β ∗ (z), x−j ⟩
· ⟨β j j
j=1
| {z }
=0 by optimality of β ∗
j
Xp 
α b (z) − β ∗ (z), x−j ⟩|2 ,
+ Ex,z |⟨β j j
2 j=1

which implies that :


p
X  p
X   
Ex,z b (z) − β ∗ (z), x−j ⟩|2 = Ez
|⟨β ⟨ b (z) − β ∗ (z), Ex x−j x−j ⊤ z (β
β b (z) − β ∗ (z))⟩
j j j j j j
j=1 j=1
2ε 2L · Eapprox
≤ + .
α α

A.5 Proof of Theorem 2

The result aims to balance the function approximation error (O(Eapprox )) and the generalization error (O( dnP )).

Proof. The major tool that we leverage is from Kohler and Langer (2021). In particular, if we invoke
Theorem 2 from Kohler and Langer (2021) with a = 1, p ≃ m, M ≃ ξ −1/2m , L ≃ ξ −q/2m , r ≃ (2e)q m+q
 2
q q ,
opt opt ∗
we have that ∀ i ∈ [p], j ∈ [p]/i, ∃ βjk ∈ F, ∥βjk − βjk ∥∞ ≤ ξ thus Eapprox (F) ≤ ξ. On the other hand, by
Theorem 6 from Bartlett et al. (2019), we can bound the VC-dimension in terms of number of layers and
neurons. In the case of a deep neural network (Theorem 2b from Kohler and Langer (2021)):

dP ≤ const · H 2 r2 log(Hr)
   2
m+q
≤ const · (2eM )q m · q 2 log(Hr)
q
≤ const · (2e2+m/q M )2q (m2 d5 )(log(2e2+m/q M ) + log(m) + log(q)).

It suffices to pick M = const · ξ −1/2m and one can get that

dP ≤ const · −ξ −q/m m4 q 5 log(ξ).

23
Plugging the above into (6), one has:
2 −q/m 4 5
 
b − R(β ∗ ) ≲ Lξ + L ξ
R(β)
m q 2
p log(1/ξ) log(nLp) log
1
. (17)
αn δ
To minimize the right-hand side of the above inequality, it suffices to choose
 m
Lm4 q 6 p2 log2 (nLp) log(p) log(1/δ) log(1/α) m+q

ξ= . (18)
αn
Finally, Equation (8) could be derived using a similar strong convexity argument as in Corollary 1 to bridge
the gap between excess risk and the quality of β: Ez [∥β(z) − β ∗ (z)∥A(z) ].

A.6 Extension of Theorem 2 with An Additional Assumption on the Loss Function

In this section we provide an extension of Theorem 2 under the following additional assumption:
Assumption 6. Assume the following holds in addition for the loss function ℓ(·, ·), namely its gradient is also
Lipschitz:
λ
ℓ(a1 ; b) − ℓ(a2 ; b) ≤ ℓ′ (a; b)|a=a2 (a1 − a2 ) + (a1 − a2 )2 .
2

This is equivalent to assuming that the second derivative of the loss function ℓ′′ (·) is upper bounded. Next,
we state the result formally.
Theorem 3. Let F correspond to a family of fully connected neural networks  2with ReLU activation functions,
with number of layers H ≃ ξ −q/2m and number of neurons r ≃ (2e)q m+q q q . Then, under the assumptions
of Theorem 1, together with Assumption 4 with m ≲ q and Assumption 6, by setting
 m
Lm4 d6 p2 log2 (nLp) log(p) log(1/δ) log(1/α) 2m+q

ξ= , (19)
λαn
the following holds with probability at least 1 − δ:
p
λξ 2
 X 
b − R(β ∗ ) ≲ λξ 2 , Ez 1
R(β) b (z) − β ∗ (z)∥2
∥β ≲ . (20)
p j=1 j j Aj (z)
α

Proof. Using a similar argument as in the proof of Theorem 1, starting from (15):
X X p   Xp  
b − Rn (β ∗ ) = 1
Rn (β) ℓ ⟨βb (z i ), xi ⟩, xi − ℓ ⟨β ∗
(, z i
), x i
⟩, xi
j −j j j −j j
np j=1 j=1
i∈[n]
 p   X p  
1 X X
≤ ℓ ⟨βiopt (z i ), xi−j ⟩, xij − ℓ ⟨β ∗j (z i ), xi−j ⟩, xij
np j=1 j=1
i∈[n]

= Rn (β opt ) − Rn (β ∗ ).
With inequality (13), the following holds with probability at least 1 − δ:
 
dP (F) 2 2 1
Rn (β opt ) − Rn (β ∗ ) ≲ R(β opt ) − R(β ∗ ) + L p log(nLp) log .
αn δ
Together with (14), we have:
 
b − R(β ∗ ) ≲ R(β opt ) − R(β ∗ ) + dP (F) 2 2 1
R(β) L p log(nLp) log . (21)
αn δ

24
For the term R(β opt ) − R(β ∗ ) on the RHS, it satisfies
Xp   X p  
opt ∗ 1 opt ∗
R(β ) − R(β ) = Ex,z ℓ ⟨β j (z), x−j ⟩, xj − ℓ ⟨β j (z), x−j ⟩, xj
np j=1 j=1
Xp  
1
≤ Ex,z ℓ′ (⟨β j , x−j ⟩, xj )|βj =β∗j · ⟨β opt
j (z) − β ∗
j (z), x −j ⟩
np j=1
| {z }
=0, by the optimality of β ∗
j
Xp 
λ 1
+· Ex,z |⟨β opt
j (z) − β ∗
j (z), x−j ⟩|2
2 np j=1
 p 
λ X X opt
≤ ∥β i − β ∗j ∥2∞
2np j=1
i∈[n]
λ
≤ · Eapprox (F)2 .
2
Plugging the above inequality into (21) we have:
 
b − R(β ∗ ) ≲ λ · Eapprox (F)2 + dP (F) L2 p2 log(nLp) log
R(β)
1
.
2 αn δ

Let Eapprox (F) := ξ, by plugging inequality (17) into above inequality, we have:
−q/m 4 5
 
b − R(β ∗ ) ≲ λ · ξ 2 + ξ
R(β)
m q 2 2
L p log(nLp) log
1
.
2 αn δ

To minimize the RHS of above inequality, if suffices to choose


m
Lm4 q 6 p2 log2 (nLp) log(p) log(1/δ) log(1/α)
  2m+q
ξ≃ .
λαn

Remark 10. The choice of ξ in (19) implies that the R(β) b − R(β ∗ ) converges at a rate of the order
−2m/(2m+q)
n which matches the rate from Kohler and Langer (2021).

A.7 Proof of Corollary 2

From inequality (7) we have:


p
X 
Ez b (z) − β ∗ (z), Aj (z)(β
β b (z) − β ∗ (z))
j j j j
j=1

L2 dP (F) 2
 
1 L · Eapprox (F)
≲ p log(nLp) log + .
α (β − β̄)2 n
2 δ α(β − β̄)2

b (z) − β ∗ (z), Aj (z)(β


Using Assumption 5, we have ⟨β b (z) − β ∗ (z)) ≥ 1 b
− β ∗j (z)∥22 for each j =
j j j j γ ∥β j (z)
1, · · · , p and therefore:
p X
L2 dP (F)
X   

2 1 L · Eapprox (F)
Ez β̂jk (z) − βjk (z) ≲ 2 p2 log(nLp) log + .
j=1 k̸=j
α γ(β − β̄)2 n δ αγ(β − β̄)2

25
On the other hand, by the fact that the margin between strong and weak edges is bounded away from zero,
namely, (β − β̄) ≥ ϕ > 0, we can select a threshold τ = η β̄ + (1 − η)β and measure the edge-wise guarantee

using the binary risk: 1{|βbjk (z)| ≥ τ } =
̸ 1{|βjk (z)| ≥ β}.
Next, we establish the following inequality:
 
2 2 2 ∗ ∗
min{η , (1 − η) }(β − β̄) 1{|βjk (z)| ≥ τ } =
b ̸ 1{|βjk (z)| ≥ β} ≤ (βbjk (z) − βjk (z))2 . (22)

The starting point is the following decomposition:


n o
∗ ∗ ∗
(βbjk (z) − βjk (z))2 = 1{|βbjk (z)| ≥ τ } =
̸ 1{|βjk (z)| ≥ β} (βbjk (z) − βjk (z))2
| {z }
Term I
n o
∗ ∗
+ 1{|βbjk (z)| ≥ τ } = 1{|βjk (z)| ≥ β} (βbjk (z) − βjk (z))2 .
| {z }
Term II


To show inequality (22), it suffices to bound Term I. In case 1{|βbjk (z)| ≥ τ } ̸= 1{|βjk (z)| ≥ β}, we have
∗ ∗
either |βbjk (z)| ≥ τ, |βjk (z)| ≤ β, or |βbjk (z)| ≤ τ, |βjk (z)| ≥ β. In both cases, we have min{η 2 , (1 − η)2 })(β −
β̄)2 ≤ (βbjk (z) − β ∗ (z))2 , and hence
jk
 
∗ ∗
min{η 2 , (1 − η)2 })(β − β̄)2 1{|βbjk (z)| ≥ τ } =
̸ 1{|βjk (z)| ≥ β} ≤ (βbjk (z) − βjk (z))2 .

The last expression implies that


p X
X  p X
X
2 2 2 ∗ ∗
min{η , (1 − η) }(β − β̄) 1{|βbjk (z)| ≥ τ } =
̸ 1{|βjk (z)| ≥ β} ≤ (β̂jk (z) − βjk (z))2 .
j=1 k̸=j j=1 k̸=j

Consequently, it follows that


p X
X 

Ez 1{|βbjk (z)| ≥ τ } =
̸ 1{|βjk (z)| ≥ β}
j=1 k̸=j

L2 dP (F)
 
1 L · Eapprox (F)
≲ 2 p2 log(nLp) log + .
α min{η 2 , (1 − η)2 }γ(β − β̄)2 n δ αγ min{η 2 , (1 − η)2 }(β − β̄)2

B Additional Details and Results for Synthetic Data Experiments

Additional illustration for the data generating process. Figure 2 provides visualization for candidate
skeletons Ψl ’s that are used in settings G1, G2, N1 and N2, and selected Θi ’s that are obtained as the convex
combination of the candidates, depending on the value of their corresponding z i ’s.
Figure 3a provides visualizations for the candidate skeletons B1 , B2 whose corresponding graphs have a tree
structure15 , and their convex combination (no longer corresponds to a tree); they serve as the skeletons for
the DAGs. Figure 3b shows the resulting Θi ’s after moralization. The color (red/blue) and shade respectively
reflect the sign (positive/negative) and the magnitude of the entries in the moralized graphs for the linear
case where the exact values can be calculated.
Of particular note, in the special case where the DAG possesses a tree structure, the moralized graph is
equivalent to removing the direction of the edges in the DAG, and therefore the skeleton matrix can be
15 Note that by definition of the DAG, the nodes possess a topological ordering and thus the corresponding matrix can be written as a
triangular one after reordering; here we are showing the skeleton matrix after such reordering.

26
(a) skeletons for Ψ1 , Ψ2 , Ψ3 , for settings G1 and N1 (b) examples for Θi ’s whose skeleton depends on z i ’s

(c) skeletons for Ψ1 , Ψ2 , Ψ3 , for settings G2 and N2 (d) examples for Θi ’s whose skeleton depends on z i ’s

Figure 2: Pictorial illustration for settings G1,G2,N1,N2. Left: candidate skeletons Ψ1 , Ψ2 , Ψ3 . Right: different Θi ’s that
are obtained from candidate skeletons, with the exact mixing depending on the values of the corresponding z i ’s; their
diagonals are suppressed for visualization purpose.

(a) skeletons for B1 , B2 and an example of their cvx combination (b) moralized graphs corresponding to the DAGs in 3a

Figure 3: Pictorial illustration for setting D1. Left: candidate skeletons B1 , B2 and an example of their convex combina-
tion. Right: Moralized graphs for the respective DAGs, with their diagonals suppressed for visualization purpose.

obtained by adding the “transposed” entries to that of the DAG; e.g., see the first two plots in Figure 3b.
Otherwise, in addition to the “transposed” entries, there are additional “married” entries as a result of
connecting nodes having common children. As it can be seen from the rightmost plot, these “married”
entries (the very faint ones in the middle of the heatmap) tend to be extremely weak in magnitude, and pose
challenges for estimating them due to the very low signal-to-noise. Finally, for the non-linear case, one can
i
obtain values of the moralized graph approximately, by considering linear approximation to the fjk ’s (recall
i i i
P
that xj = k∈pa(j) fjk (xk ) + ϵj ), and they exhibit qualitatively similar patterns to the linear case.

Additional simulation results. Table 4 presents additional metrics for the DNN-based covariate-dependent
graphical model estimation method. In particular, given the sparse nature of the underlying true graphs, we
precision∗recall
report the F1 score (2 ∗ precision+recall ) and balanced accuracy ( sensitivity+specificity
2 ) when the estimated graphs
are thresholded at different levels. The result shows that for practical purposes, practitioners can effectively
recover a sparse skeleton that is close to the truth by applying a reasonably small thresholding.
Table 5 presents additional results for settings D1 and D2, where instead of evaluating the performance
against the true moralized graph Θi , we evaluate against a “pseudo” moralized graph Θ̃i where we treat
edges that are present due to married nodes as if they were non-existent. As previously mentioned, these
edges typically exhibit very small magnitude and hence are difficult to recover. We expect the performance to
improve slightly if the comparison were done against the Θ̃i ’s, which is indeed the case: AUROC improves by
around 3% and AUPRC improves by around 5-7% (see, e.g., the last two rows in Table 3 for a comparison).

27
Table 4: F1 score (F1) and Balance Accuracy (BA) for the estimated graphs by DNN-GCM at different thresholding levels.
G1 G2 N1 N2 D1 D2
threshold F1 BA F1 BA F1 BA F1 BA F1 BA F1 BA
0.010 0.23 0.71 0.15 0.84 0.23 0.71 0.15 0.84 0.23 0.76 0.24 0.76
0.025 0.49 0.91 0.72 0.98 0.50 0.91 0.71 0.96 0.56 0.91 0.63 0.85
0.050 0.83 0.98 0.70 0.81 0.89 0.99 0.65 0.77 0.75 0.89 0.56 0.73
0.075 0.91 0.98 0.57 0.71 0.95 0.98 0.56 0.69 0.70 0.84 0.32 0.60
0.100 0.94 0.97 0.48 0.66 0.93 0.96 0.41 0.63 0.65 0.79 0.02 0.51

Note: reported metrics are first averaged across test samples for a single experiment, then averaged across experiments on 5 data replicates.

Table 5: Performance evaluation for DAG-based settings against the pseudo moralized graph without edges from married
nodes.
DNN CGM RegGMM glasso - est. by cluster nodewise Lasso - est. by cluster
AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC
D1 0.98 (0.003) 0.81 (0.028) 0.97 (0.006) 0.81 (0.028) 0.97 (0.003) 0.50 (0.007) 0.99 (0.002) 0.89 (0.020)
D2 0.96 (0.013) 0.74 (0.032) 0.91 (0.015) 0.65 (0.023) 0.94 (0.009) 0.44 (0.002) 0.92 (0.002) 0.52 (0.110)

Remark 11. For DNN-based method, the performance can improve when training is done with a larger
sample size. In particular, for challenging settings such as D1 and D2, for estimates obtained from a model
trained using 30,000 samples, when they are evaluated against the true graph Θi , AUROC is given by 0.97
(0.004) and 0.95 (0.007) respectively for D1 and D2, and AUPRC is given by 0.91 (0.004) and 0.75 (0.015).
When the estimates are evaluated against the pseudo moral graph Θ̃i (so that extremely weak entries from
married nodes are not counted toward the skeleton), AUROC reaches around 0.99 for both settings and
AUPRC reaches 0.95 and 0.82, resp.

Finally, Table 6 shows the metrics for glasso and nodewise Lasso when they perform estimation on the full
set of samples; note that in the absence of covariate-dependent estimation methods, this would be how the
two methods perform in practice since the partitioning would not be known apriori.

Table 6: Evaluation for glasso and nodewise Lasso on full samples without partitioned by clusters.
G1 G2 N1 N2 D1 D2 D1 - pseudo moralized graph D2 - pseudo moralized graph
AUROC 0.95 0.98 0.96 0.98 0.91 0.89 0.95 0.94
glasso
AUPRC 0.47 0.46 0.47 0.45 0.39 0.37 0.43 0.41
AUROC 0.97 0.98 0.97 0.97 0.92 0.88 0.97 0.94
nodewise Lasso
AUPRC 0.66 0.67 0.66 0.47 0.55 0.43 0.62 0.53

Note: values correspond to the average over experiments on 5 data replicates. The last 2 columns show the evaluation against pseudo moralized graphs
under DAG settings.

Compared with the results in Table 3, nodewise Lasso exhibits material improvement when the dependency
on the covariate is linear/mildly non-linear, whereas glasso seems more susceptible to the mis-specification,
induced by the varying magnitude across samples, as manifested by a much smaller improvement in perfor-
mance in the case where estimation is conducted on partitioned samples.

C Additional Real Data Experiments

In this section, we present results obtained from applying the DNN-based covariate-dependent graphical
model to a finance dataset involving S&P 100 stocks. Such a model estimates the partial correlation across
stocks, while conditioning on covariates that correspond to the broad market condition.

The S&P 100 constituent dataset. We examine the inter-connectedness of the S&P 100 Index16 con-
stituent stocks under different market conditions. These constituent stocks correspond to 100 major blue
16 https://www.spglobal.com/spdji/en/indices/equity/sp-100/#overview

28
chip companies in the United States and span various industry sectors. In the sequel, stocks and tickers may
be used interchangeably, and they both correspond to the nodes of the network of interest.

Data collection and preprocessing. We collect daily stock return (calculated based on adjusted close
price) data for those that are components of the S&P 100 Index as of 2023-12-29. At the initial data gathering
stage, the first historical date is set to 2000-01-02, and the last to 2023-12-29. We require all tickers to have
valid data on the first historical date, and the set of tickers are further filtered to ensure this; tickers such
as ABBV, GM, GOOG, META etc are therefore excluded from subsequent analyses. After this filtering step, the
remaining set encompasses 79 tickers and thus the size of the network p = 79.
To obtain samples for x, we consider beta-adjusted SPX residual returns, so that the market information
is profiled out from the mean structure. Concretely, as a data-preprocessing step, let mi be the S&P 500
Index return on day i, which is a proxy for the market return of US large-cap companies; for each ticker
indexed by j, let x̌ij be its raw return. We first estimate its beta βji by regressing its return against that of
the S&P 500 Index, over a lookback window of 252 days; the beta-adjusted residual return is then given
by xij := x̌ik − βbji mi . This gives rise to {xij , j = 1, · · · , p} for all tickers. After this pre-processing step, the
residual return data (namely, the xi ’s) effectively starts from 2001-01-02 and there are a total number of
5785 observations. Note that the processed data exhibit de minimis serial correlation and the xi ’s can be
regarded as i.i.d. for practical purposes. For z, we simply consider a set of variables that can reflect the
high-level market condition of the day, and it encompasses the returns of the S&P 500 Index, the Nasdaq
Composite Index, respectively, and VIX. As such, q = 3. Data is further split into train/val/test periods that
respectively span 2001-2017, 2018-2019, 2020 onwards. Note that such a split is more for the purpose of
demonstrating that the trained model is capable of producing interpretable results when applied to unseen
data (e.g., COVID period); however, results are presented for all periods to demonstrate the patterns the
proposed method uncover.

Results. Figure 4 shows various graph connectivity metrics over time (left) and estimated graphs on se-
lected dates (right), after thresholding. The observation is 3-fold: (1) the estimated networks exhibit signif-
icantly higher connectivity during volatile market conditions17 . (2) Under normal market conditions, intra-
sector connectivity level is higher, as manifested by the corresponding heatmap showing a block-diagonal
pattern; however, in adverse conditions, such a pattern is diluted due to increased connectivity across the
board. (3) Overall, the graph dynamics are slow-moving in the sense that the estimated graphs usually show
high concordance for dates that are not far apart.
Figure 5 presents the average partial correlation graph of high and normal VIX days in the test period,
respectively, where high VIX days encompass dates on which VIX exceeded the 90% percentile of the period
in question and normal VIX ones correspond to those that were below median.
After the averaging step, the increased connectivity during the high-VIX period is not as pronounced as
that on the extremes (e.g., 2020-03-23) where the intra-sector pattern no longer stands out. Nonetheless,
compared with the average graph from the normal-VIX days, the connectivity is still at an elevated level, as
manifested by the additional inter-sector connections while the intra-sector ones stay comparable.
Finally, note that we refrain from dictating any use case of the estimated graphs, which highly depends on
the specific context and application.
17 Similarpatterns have been reported in existing financial econometrics literature (e.g., Billio et al., 2012; Diebold and Yılmaz, 2014),
albeit models/methods therein examine different quantities rather than the precision matrix. Note also that these methods do not
support sample-specific estimates and thus their results were obtained through rolling window analyses.

29
(a) graph connectivity metrics (top) and VIX (b) heatmaps for estimated graphs on selected high-vix (left, 2020-03-23) and low-vix
level (bottom) (right, 2023-03-23) days.

Figure 4: Left panel: various connectivity metrics of the estimated graphs (top) and the corresponding VIX level (bot-
tom), from the beginning of 2001 to the end of 2023. Notably events that significantly increased market volatility have
been marked. Right panel: heatmaps of the estimated graphs on two representative dates that respectively have high
and low-VIX, with red cells indicating positive partial correlation and blue cells negative ones.

(a) Averaged graph for high-VIX days (b) Averaged graph for normal-VIX days

Figure 5: Averaged partial correlation graphs for high-VIX (left) and normal-VIX (right) days, with red cells indicating
positive partial correlation and blue cells negative ones.

D Graphical Model Preliminaries

D.1 Decomposition of a Gaussian Multivariate Distribution into Clique-wise Potential Func-


tions According to a Graph G.

Consider a Gaussian random vector x ∈ Rp with mean vector µ = 0 and covariance matrix Σ assumed to be
positive definite. Its joint distribution function is given by
1 1
exp − x⊤ Σ−1 x .

P(x) = p
p
(2π) det(Σ) 2

Then,
p
X
log(x) = C + xj xk Σ−1
jk ,
j,k=1

30
and hence the potential functions over pairwise cliques correspond to

φ(j,k) (x) = xj xk Σ−1


jk ;

they are functions of Σ which is the parameter of the specific Gaussian distribution under consideration.
⊥ xk | x−{j,k} , if and only if Σ−1
Further, it can be easily seen that xj ⊥ jk = 0. Hence, to estimate a Gaus-
sian graphical model from data, it suffices to estimate the inverse covariance matrix and all the pairwise
conditional independent relationships encoded by a graph G correspond to the zero elements of Σ−1 .
It has be further shown that the Gaussian graphical model can also be estimated based on regression
techniques; namely, by estimating the regression coefficients of the following model (Meinshausen and
Bühlmann, 2006), a procedure referred to as neighborhood selection:
p
X
xj = βjk xk + εj ,
k=1,k̸=j

where βjk := −Σ−1 −1


jk /Σjj .

D.2 An Exponential Family based Node-conditional Graphical Model

Consider a random vector x ∈ Rp whose joint distribution is P(x). Further, consider the following collection
of node conditional distributions from the univariate exponential family, where the neighbors of a node j
according to an underlying graph G are denoted by NG (j):
n X X
P(xj |x−j ) ∝ exp g(xj ) θj + θjk g(xk ) + θjkℓ g(xk )g(xl ) (23)
k∈NG (j) k,ℓ∈NG (j)
C
X Y o
+ ··· + θjm1 ··· ,mc g(xc ) ,
m1 ,··· ,mC ∈NG (j) c=1

where the multivariate canonical parameter θ is defined as linear combinations of up to C-order products of
positive univariate functions g(xk ) of neighboring nodes to j according to the graph G.
An application of the Hammerseley-Clifford theorem (see Section D.3) and some algebra (Yang et al., 2015)
show that the collection of the node conditional distributions in (23) define a proper graphical model with
respect to G.

D.3 Markov Properties of Graphical Models

The conditional independence/dependence relationships defined by a graphical model are a consequence


of the Markov properties encoded in the graph G. Specifically, the probability distribution P satisfies the
pairwise Markov property with respect to G, if xj ⊥⊥ xk | x−(j,k) for all {k, j} ̸∈ E i.e., variables xj and
xk are conditional independent given all the other variables, if they are not connected according to the edge
set E of G. It satisfies the local Markov property with respect to G, if xk ⊥⊥ xV −(neG (j)∪{k}) | xneG (j) ; i.e.,
the conditional distribution of variable xj given all its neighbors is independent of any other nodes in G.
Finally, P satisfies the global Markov property with respect to G, if for any subsets A, B, C of V such that C
separates A and B (i.e., every path between a node in A and a node in B contains a node in C), the following
relationship holds: xA ⊥ ⊥ xB | xC . It can be shown that the global Markov property implies the local one
and both imply the pairwise one.
Next, we elaborate on the relationship between the joint probability distribution P(x) of the random vector x
and the underlying graph G regarding conditional independence/dependence relationships. The connection
between the graph G and the probability distribution P(x) comes through the concept of graph factorization.
Specifically, let C(G) denote the set of all cliques of G. Then, P(x) factorizes with respect to G, if it can be

31
written as P(x) = Z1 C∈C(G) ϕC (xC ), with ϕC > 0 for all C ∈ C(G); i.e., the joint distribution of x can
Q

be written as a product of positive functions (called potential functions in the literature) that depend only
on a subset of the random variables in the clique. However, the concept of graph factorization does not
imply per se conditional independence/dependence relationships between subsets of the random vector x.
Such relationships are associated through the various Markov properties of G, as outlined above (see also
Lauritzen, 1996, for an indepth presentation). Hence, if P(x) factorizes over G, and G possesses a Markov
property, then conditional independence relationships for subsets of x are present under P. The reverse
relationship, namely that if P satisfies a Markov property with respect to G, then it also factorizes with
respect to G, is established by the famous Hammerseley-Clifford theorem, provided that P(x) possesses a
positive and continuous density (Besag, 1974).

E Notes on Implementation

Implementation of the neural network. For all synthetic and real data experiments, we use fully con-
nected MLPs (i.e., stacked modules consisting of Linear-ReLU-Dropout layers) with residual connections as
the underlying architecture to parameterize the βjk (·)’s. In particular, we adopt the implementation scheme
as mentioned in Remark 1, where we use a single neural network that takes z as the input (i.e., input dim
is q), but with a multi-dimensional output head that produces outputs for all βjk (·), j, k ∈ [p], k ̸= j (i.e.,
output dim is p(p − 1)).

z β1,2 β1,3 · · · β1,p−1 β1,p


MLP z hidden rep
β2,1 β2,3 · · · β2,p−1 β,2p
MLP .. .. .. .. ..
. . . . .
z βp,1 βp,2 βp,3 · · · βp,p−1
copy
input concat Output
size = q size = p(p − 1)

Figure 6: Diagram for the β(z) : Rq 7→ Rp(p−1) network; an MLP block consists of Linear-ReLU-Dropout layers.

Overall, we observe that the above-mentioned implementation scheme is fairly robust to the choice of hyper-
parameters; in addition, it shows material improvement in performance when compared with a single MLP
block without the residual connection, when q becomes moderately large. Other “intermediate” sharing-
backbone schemes have been experimented, yet the performance is generally inferior and more sensitive to
the choice of hyper-parameters. During training, we fix batch size at 512 with gradient clipped at 1.0, while
other hyperparameters are tuned over a hyper-grid, with the best set determined based the performance on
a validation set of 1000 samples, with MSE as the criterion. All experiments are done on a NVIDIA RTX
A5000 GPU.
Table 7: hyper-parameters for the MLPs and model training
hidden layer size/dropout learning rate scheduler type scheduler stepsize(milestones) / decay epochs
G1 [128, 64], [128] / 0.3 0.0005 StepLR 20/0.25 50
G2 [128, 64], [128] / 0.3 0.0005 StepLR 20/0.25 80
N1 [128, 64], [128] / 0.3 0.0005 StepLR 20/0.25 50
N2 [128, 64], [128] / 0.3 0.0005 StepLR 20/0.25 80
D1 [64, 32], [64] / 0.1 0.0005 StepLR 20/0.25 80
D2 [64, 32], [64] / 0.1 0.0005 StepLR 20/0.25 80

In regards to the benchmarking methods, for glasso and nodewise Lasso, we rely on the implementation in R
package huge18 (Zhao et al., 2012); for RegGMM, we did not find any publicly available packages/modules
and thus implemented the method ourselves leveraging PyTorch, where β(z) effectively reduces to a net-
work with a single linear layer.
18 https://cran.r-project.org/package=huge

32
Data and code availability. The resting-state fMRI scans dataset was provided by the authors of Lee et al.
(2023). The S&P 100 Index constituents dataset can be collected from Yahoo!Finance19 , with the list of tick-
ers corresponding to the constituents available through Wikipedia20 . The code repository containing all the
implementation is available at https://github.com/GeorgeMichailidis/covariate-dependent-graphical-model.

19 https://finance.yahoo.com
20 https://en.wikipedia.org/wiki/S%26P_100#Components

33

You might also like