A Stepwise Approach For High-Dimensional Gaussian Graphical Models
A Stepwise Approach For High-Dimensional Gaussian Graphical Models
Abstract
We present a stepwise approach to estimate high dimensional Gaussian graphical
models . We exploit the relation between the partial correlation coefficients and
the distribution of the prediction errors, and parametrize the model in terms of the
Pearson correlation coefficients between the prediction errors of the nodes’ best linear
predictors. We propose a novel stepwise algorithm for detecting pairs of conditionally
dependent variables. We show that the proposed algorithm outperforms existing
methods such as the graphical lasso and CLIME in simulation studies and real life
applications. In our comparison we report different performance measures that look at
different desirable features of the recovered graph and consider several model settings.
Keywords: Covariance Selection; Gaussian Graphical Model; Forward and Backward Se-
lection; Partial Correlation Coefficient.
Ginette Lafit, Postdoctoral research fellow, Research Group of Quantitative Psychology and Individual Differences,
KU LeuvenUniversity of Leuven, Leuven, Belgium (E-mail: [email protected]), Francisco J. Nogales is Professor,
Department of Statistics and UC3M-BS Institute of Financial Big Data, Universidad Carlos III de Madrid, España (E-mail:
[email protected]), Ruben H. Zamar is Professor, Department of Statistics, University of British Columbia, 3182
Earth Sciences Building, 2207 Main Mall, Vancouver, BC V6T 1Z4, Canada (Email: [email protected]) and Marcelo Ruiz
is Professor, Departamento de Matemática, FCEFQyNat, Universidad Nacional de Rı́o Cuarto, Córdoba, Argentina (E-mail:
[email protected]).
1
1 Introduction
High-dimensional Gaussian graphical models (GGM) are widely used in practice to repre-
sent the linear dependency between variables. The underlying idea in GGM is to measure
linear dependencies by estimating partial correlations to infer whether there is an associ-
ation between a given pair of variables, conditionally on the remaining ones. Moreover,
there is a close relation between the nonzero partial correlation coefficients and the nonzero
entries in the inverse of the covariance matrix. Covariance selection procedures take ad-
vantage of this fact to estimate the GGM conditional dependence structure given a sample
(Dempster, 1972; Lauritzen, 1996; Edwards, 2000).
When the dimension p is larger than the number n of observations, the sample covariance
matrix S is not invertible and the maximum likehood estimate (MLE) of Σ does not exist.
When p/n ≤ 1, but close to 1, S is invertible but ill-conditioned, increasing the estimation
error (Ledoit and Wolf, 2004). To deal with this problem, several covariance selection
procedures have been proposed based on the assumption that the inverse of the covariance
matrix, Ω, called precision matrix, is sparse.
We present an approach to perform covariance selection in a high dimensional GGM
based on a forward-backward algorithm called graphical stepwise (GS). Our procedure
takes advantage of the relation between the partial correlation and the Pearson correlation
coefficient of the residuals.
Existing methods to estimate the GGM can be classified in three classes: nodewise
regression methods, maximum likelihood methods and limited order partial correlations
methods. The nodewise regression method was proposed by Meinshausen and Bühlmann
(2006). This method estimates a lasso regression for each node in the graph. See for
example Peng et al. (2009), Yuan (2010), Liu and Wang (2012), Zhou et al. (2011) and
2
Ren et al. (2015). Penalized likelihood methods include Yuan and Lin (2007), Banerjee
et al. (2008), Friedman et al. (2008), Johnson et al. (2011) and Ravikumar et al. (2011)
among others. Cai et al. (2011) propose an estimator called CLIME that estimates precision
matrices by solving the dual of an `1 penalized maximum likelihood problem. Limited order
partial correlation procedures use lower order partial correlations to test for conditional
independence relations. See Spirtes et al. (2000), Kalisch and Bühlmann (2007), Rütimann
et al. (2009), Liang et al. (2015) and Huang et al. (2016).
The rest of the article is organized as follows. Section 2 introduces the stepwise approach
along with some notation. Section 3 gives simulations results and a real data example.
Section 4 presents some concluding remarks. The Appendix shows a detailed description
of the crossvalidation procedure used to determine the required parameters in our stepwise
algorithm and gives some additional results from our simulation study.
(i, j) ∈
/ E if and only if Xi Xj | XV \{i,j}. (2.1)
3
Here denotes conditional independence.
Given a node i ∈ V , its neighborhood Ai is defined as
Notice that Ai gives the nodes directly connected with i and therefore a GM can be
effectively described by giving the system of neighborhoods {Ai }pi=1 .
We further assume that (X1 , . . . , Xp )> ∼ N(0, Σ), where Σ = (σij )i,j=1...,p is a positive-
definite covariance matrix. In this case the graph is called a Gaussian graphical model
(GGM). The matrix Ω = (ωij )i,j=1...,p = Σ−1 is called precision matrix.
There exists an extensive literature on GM and GGM. For a detailed treatment of the
theory see for instance Lauritzen (1996), Edwards (2000), and Bühlmann and Van De Geer
(2011).
4
covariance matrix
Σ11 Σ12
(2.3)
Σ21 Σ22
such that Σ11 has dimension 2 × 2, Σ12 has dimension 2 × (p − 2) and so on. The matrix in
(2.3) is a partition of a permutation of the original covariance matrix Σ, and will be also
denoted by Σ, after a small abuse of notation.
Moreover, we set
−1
Σ11 Σ12 Ω11 Ω12
Ω= = .
Σ21 Σ22 Ω21 Ω22
Then, by (B.2) of Lauritzen (1996), the blocks Ωi,j can be written explicitly in terms of
Σi,j and Σ−1
i,j . In particular
−1
Ω11 = Σ11 − Σ12 Σ−1
22 Σ21 where
ωii ωil
Ω11 =
ωli ωll
is the submatrix of Ω (with rows i and l and columns i and l). Hence,
= Ω−1
11
1 ω −ωil
= ll
ωii ωll − ωil ωli −ωli ωii
5
This gives the standard parametrization of E in terms of the support of the precision matrix
b 1 = X1 − β > X2
ε = X1 − X
and let εi and εl denote the entries of ε (i.e. ε> = (εi , εl )). The regression error ε is
independent of X
b 1 and has normal distribution with mean 0 and covariance matrix Ψ11
b 1 − 2cov X1 , X
Ψ11 = cov (X1 ) + cov X b1
6
Summarizing, the problem of determining the conditional dependence structure in a GGM
(represented by E) is equivalent to finding the pairs of nodes of V that belong to the set
which is equal to the support of the precision matrix, supp (Ω), defined by (2.6).
Remark 1. As noticed above, under normality, partial and conditional correlation are the
same. However, in general they are different concepts (Lawrance, 1976).
Remark 2. Let βi,l be the regression coefficient of Xl in the regression of Xi versus XV \{i}
and, similarly let βl,i be the regression coefficient of Xi in the regression of Xl versus
p
XV \{i} . Then it follows that ρil·V \{i,l} = sign (βl,i ) βl,i βi,l . This allows for another popular
parametrization for E. Moreover, let i be the error term in the regression of the ith variable
on the remaining ones. Then by Lemma 1 in Peng et al. (2009) we have that cov(i , l ) =
ωil /ωii ωll and var(i ) = 1/ωii .
Conditionally on its neighbors, Xi is independent of all the other variables. Formally, for
all i,
if l ∈
/ Ai and l 6= i then Xi Xl |XAi . (2.9)
7
forward and the backward steps. In the forward step, the algorithm adds a new edge (j0 , l0 )
if the largest absolute empirical partial correlation between the variables Xj0 , Xl0 is above
the given threshold αf . In the backward step the algorithm deletes an edge (j0 , l0 ) if the
smallestt absolute empirical partial correlation between the variables Xj0 , Xl0 is below the
given threshold αb . A step by step description of GSA is as follows:
Input: the (centered) data {x1 , ..., xn } , and the forward and backward thresholds αf and
αb .
Iteration Step. Given Abk1 , Abk2 , ..., Abkp we compute Abk+1 k+1 bk+1 as follows.
1 , A2 , ..., Ap
(a) Regress the j th variable on the variables with subscript in the set Abkj and compute
the regression residuals ekj = ek1j , ek2j , ..., eknj .
(b) Regress the lth variables on the variables with subscript in the set Abkl and compute
the regression residuals ekl = ek1l , ek2l , ..., eknl .
(c) Obtain the partial correlation fjlk by calculating the Pearson correlation between
ekj and ekl .
If
max fjlk = fjk0 l0 ≥ αf
l∈
/Abk ,j∈V
j
If
max fjlk = fjk0 l0 < αf , stop.
8
Backward. For each j = 1, ..., p do the following.
(a) Regress the j th variables on the variables with subscript in the set Abk+1
j \ {l} and
compute the regression residuals rkj = r1j k , r k , ..., r k
2j nj .
(b) Regress the lth variable on the variables with subscript in the set Abk+1
l \ {j} and
compute the regression residuals rkl = r1lk , r k , ..., r k .
2l nl
(c) Compute the partial correlation bkjl by calculating the Pearson correlation between
rkj and rkl .
If
k k
min bjl = bj0 l0 ≤ αb
bk ,j∈V
l∈A j
Output
9
K
X
th
approximately equal sizes, the t subset being of size nt ≥ 2 and nt = n. For every t, let
t=1
(t) (t)
{xi }1≤i≤nt be the tth validation subset, and its complement {e the tth training
xi }1≤i≤n−nt ,
(t) (t)
subset. For every t and for every pair (αf , αb ) of threshold parameters let Ab1 , . . . , Abp
be the estimated neighborhoods given by GSA using the tth training subset. For every
j = 1, . . . , p let βb b(t) be the estimated coefficient of the regression of the variable Xj on the
Aj
(t)
neighborhood Abj .
(t)
Consider now the tth validation subset. So, for every j, using βb (t) , we obtain the vector
Aj
b (t) (αf , αb ). If A(t) = ∅ we predict each observation of Xj by the
of predicted values X j j
where k·k the L2-norm or euclidean distance in Rp . Hence the K–fold cross–validation
forward–backward thresholds α
bf , α
bb is
(b
αf , α
bb ) =: argmin CV (αf , αb )
(αf ,αb )∈H
where H is a grid of ordered pairs (αf , αb ) in [0, 1] × [0, 1] over which we perform the search.
For a detail description see the Appendix.
2.5 Example
To illustrate the algorithm we consider the GGM with 16 edges given in the first panel
of Figure 1. We draw n = 1000 independent observations from this model (see the next
section for details). The values for the threshold parameters αf = 0.17 and αb = 0.09 are
10
determined by 5-fold cross-validation. The figure also displays the selected pairs of edges
at each step in a sequence of successive updates of Abkj , for k = 1, 4, 9, 12 and the final step
k = 16, showing that the estimated graph is identical to the true graph.
●
7 ●
6
●
5 ●
7 ●
6
●
5 ●
7 ●
6
●
5
●
8 ●
4 ●
8 ●
4 ●
8 ●
4
●9 ●3 ●9 ●3 ●9 ●3
●
11 ● 1 ●
11 ● 1 ●
11 ● 1
●12 ●
20 ●12 ●
20 ●12 ●
20
●
13 ●
19 ●
13 ●
19 ●
13 ●
19
●
14 ●
18 ●
14 ●
18 ●
14 ●
18
●
15
●
16 ●
17 ●
15
●
16 ●
17 ●
15
●
16 ●
17
●
7 ●
6
●
5 ●
7 ●
6
●
5 ●
7 ●
6
●
5
●
8 ●
4 ●
8 ●
4 ●
8 ●
4
●9 ●3 ●9 ●3 ●9 ●3
●
11 ● 1 ●
11 ● 1 ●
11 ● 1
●12 ●
20 ●12 ●
20 ●12 ●
20
●
13 ●
19 ●
13 ●
19 ●
13 ●
19
●
14 ●
18 ●
14 ●
18 ●
14 ●
18
●
15
●
16 ●
17 ●
15
●
16 ●
17 ●
15
●
16 ●
17
k=9 k = 12 k = 16
11
3.1 Monte Carlo simulation study
Simulated Models
We consider three dimension values p = 50, 100, 150 and three different models for Ω:
Model 1. Autoregressive model of orden 1, denoted AR(1). In this case Σij = 0.4|i−j|
for i, j = 1, . . . p.
Model 2. Nearest neighbors model of order 2, denoted NN(2). For each node we
randomly select two neighbors and choose a pair of symmetric entries of Ω using the
NeighborOmega function of the R package Tlasso.
Model 3. Block diagonal matrix model with q blocks of size p/q, denoted BG. For
p = 50, 100 and 150, we use q = 10, 20 and 30 blocks, respectively. Each block, of
size p/q = 5, has diagonal elements equal to 1 and off-diagonal elements equal to 0.5.
For each p and each model we generate R = 50 random samples of size n = 100. These
graph models are widely used in the genetic literature to model gene expression data. See
for example Lee and Liu (2015) and Lee and Ghi (2006). Figure 2 displays graphs from
Models 1-3 with p = 100 nodes.
● ● ● ● ●
● ●
● ● ●
● ● ● ●
●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
●
● ● ●
●
●
● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ● ●
●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
●
● ● ●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●
●
● ● ●
●
●
● ●
● ● ●
●
● ● ●
● ●
●
● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ●
● ●
● ● ●
● ●
● ● ● ● ●
●
● ●
●
● ● ● ● ● ●
● ●
● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
AR(1) NN(2) BG
Figure 2: Graphs of AR(1), NN(2) and BG graphical models for p = 100 nodes.
12
Methods
We compare the performance of GS with Graphical lasso (Glasso) and Constrained l1 -
minimization for inverse matrix estimation (CLIME) proposed by Friedman et al. (2008)
and Cai et al. (2011) respectively. Therefore, the methods compared in our simulation
study are:
1. The proposed method GS with the forward and backward thresholds, (αf , αb ), esti-
mated by 5-fold crossvalidation on a grid of 20 values in [0, 1] × [0, 1], as described in
Subsection 2.4. The computing algorithm is available by request.
In our simulations and examples we use the R-package CVglasso with the tuning
parameter λ selected by 5−fold crossvalidation (the package default).
where S is the sample covariance, I is the identity matrix, |·|∞ is the elementwise l∞
norm, and λ is a tuning parameter. For computations, we use the R-package clime
with the tuning parameter λ selected by 5−fold crossvalidation (the package default).
To evaluate the ability of the methods for finding the pairs of edges, for each replicate, we
compute the Matthews correlation coefficient (Matthews, 1975)
TP × TN − FP × FN
MCC = p , (3.3)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
13
the Specificity = TN/(TN + FP) and the Sensitivity = TP/(TP + FN), where TP, TN,
FP and FN are, in this order, the number of true positives, true negatives, false positives
and false negatives, regarding the identification of the nonzero off-diagonal elements of Ω.
Larger values of MCC, Sensitivity and Specificity indicate a better performance (Fan et al.,
2009; Baldi et al., 2000).
For every replicate, the performance of Ω
b as an estimate for Ω is measured by mF =
||Ω
b − Ω||F (where || · ||F denotes the Frobenius norm) and by the normalized Kullback-
1 n b −1 o n h io
b −1 − p
DKL = tr ΩΩ − log det ΩΩ
2
Results
Table 1 shows the MCC performance for the three methods under Models 1-3. GS
clearly outperforms the other two methods while CLIME just slightly outperforms Glasso.
Cai et al. (2011) pointed out that a procedure yielding a more sparse Ω b is preferable
because this facilitates interpretation of the data. The sensitivity and specificity results,
reported in Table 4 in Appendix, show that in general GS is more sparse than the CLIME
and Glasso, yielding fewer false positives (more specificity) but a few more false negatives
(less sensitivity). Table 2 shows that under models AR(1) and NN(2) the three methods
achieve fairly similar performances for estimating Ω. However, under model BG, GS clearly
outperforms the other two.
Figure 3 display the heat-maps of the number of non-zero links identified in the 50
replications under model AR(1). Notice that among the three compared methods, the GS
sparsity patterns best match those of the true model. Figures 4 and 5 in the Appendix
lead to similar conclusions for models NN(2) and BG.
14
Table 1: Comparison of means and standard deviations (in brackets) of MCC over R = 50 replicates.
(a) p = 50
(b) p = 100
(c) p = 150
Figure 3: Model AR(1). Heatmaps of the frequency of the zeros identified for each entry of the precision matrix out of
R = 50 replicates. White color is 50 zeros identified out of 50 runs, and black is 0/50.
15
Table 2: Comparison of means and standard deviations (in brackets) of mF and mN KL over R = 50 replicates.
GS Glasso CLIME
Model p mN KL mF mN KL mF mN KL mF
50 0.70 3.82 0.64 3.90 0.63 3.91
(0.00) (0.00) (0.00) ( 0.02) (0.00) (0.01)
AR(1) 100 0.83 5.73 0.80 5.72 0.79 5.75
(0.00) (0.00) (0.00) (0.02) (0.00) (0.01)
150 1.25 7.16 1.17 7.21 1.17 7.25
(0.00) (0.00) (0.00) (0.02) (0.00) (0.01)
50 0.99 6.98 0.99 6.65 0.99 6.64
(0.00) (0.00) (0.00) (0.01) (0.00) (0.00)
NN(2) 100 0.10 10.11 1.00 9.64 1.00 9.601
(0.00) (0.00) (0.00) (0.009) (0.000) (0.005)
150 1.00 12.37 1.00 11.90 1.00 11.79
(0.00) (0.00) (0.00) (0.01) (0.00) (0.00)
BG 50 0.46 1.44 0.85 5.45 0.82 5.03
(0.00) (0.00) (0.00) (0.10) (0.00) (0.05)
100 0.71 2.94 0.93 9.16 0.92 8.71
(0.00) (0.00) (0.00) (0.07) (0.00) (0.02)
150 0.88 6.10 0.96 11.59 0.96 11.42
(0.00) (0.00) (0.00) (0.06) (0.00) (0.02)
16
classified in the group of residual disease (RD), indicating that cancer still remains. Their
data consist of 22283 gene expression levels for 133 patients, with 34 pCR and 99 RD.
Following Fan et al. (2009) and Cai et al. (2011) we randomly split the data into a training
set and a testing set. The testing set is formed by randomly selecting 5 pCR patients
and 16 RD patients (roughly 1/6 of the subjects) and the remaining patients form the
training set. From the training set, a two sample t-test is performed to select the 50 most
significant genes. The data is then standardized using the standard deviation estimated
from the training set.
We apply a linear discriminant analysis (LDA) to predict whether a patient may achieve
pathological complete response (pCR), based on the estimated inverse covariance matrix
of the gene expression levels. We label with r = 1 the pCR group and r = 2 the RD group
and assume that data are normally distributed, with common covariance matrix Σ and
different means µr . From the training set, we obtain µ
br, Ω
b and for the test data compute
b µr − 1 µ> Ωµ
δr (x) = x> Ωb b r + logb
πr for i = 1, . . . , n, (3.4)
2 r
where π
br is the proportion of group r subjects in the training set. The classification rule is
For every method we use 5-fold cross validation on the training data to select the tuning
constants. We repeat this scheme 100 times.
Table 3 displays the means and standard errors (in brackets) of Sensitivity, Specificity,
MCC and Number of selected Edges using Ω b over the 100 replications. Considering the
MCC, GS is slightly better than CLIME and CLIME than Glasso. While the three methods
give similar performance considering the Specificity, GS and CLIME improve over Glasso
in terms of Sensitivity.
17
Table 3: Comparison of means and standard deviations (in brackets) of Sensitivity, Specificity, MCC and Number of selected
edges over 100 replications.
GS CLIME Glasso
Sensitivity 0.798 (0.02) 0.786 (0.02) 0.602 (0.02)
Specificity 0.784 (0.01) 0.788 (0.01) 0.767 (0.01)
MCC 0.520 (0.02) 0.516 (0.02) 0.334 (0.02)
Number of Edges 54 (2) 4823 (8) 2103 (76)
4 Concluding remarks
This paper introduces a stepwise procedure, called GS, to perform covariance selection in
high dimensional Gaussian graphical models. Our method uses a different parametrization
of the Gaussian graphical model based on Pearson correlations between the best-linear-
predictors prediction errors. The GS algorithm begins with a family of empty neighbor-
hoods and using basic steps, forward and backward, adds or delete edges until appropriate
thresholds for each step are reached. These thresholds are automatically determined by
cross–validation.
GS is compared with Glasso and CLIME under different Gaussian graphical models
(AR(1), NN(2) and BG) and using different performance measures regarding network re-
covery and sparse estimation of the precision matrix Ω. GS is shown to have good support
recovery performance and to produce simpler models than the other two methods (i.e. GS
is a parsimonious estimation procedure).
We use GS for the analysis of breast cancer data and show that this method may be a
useful tool for applications in medicine and other fields.
18
Acknowledgements
The authors thanks the generous support of NSERC, Canada, the Institute of Financial
Big Data, University Carlos III of Madrid and the CSIC, Spain.
A Appendix
(t) (t)
where XAb(t) is the matrix with rows (xil1 , . . . , xilq ), 1 ≤ i ≤ nt represented in (A.2) (in blue
j
19
(t)
colour). If the neighborhood Aj = ∅ we define
where k·k the L2-norm or euclidean distance in Rp . Hence the K–fold cross–validation
forward–backward thresholds α
bf , α
bb is
(b
αf , α
bb ) =: argmin CV (αf , αb ) (A.1)
(αf ,αb )∈H
where H is a grid of ordered pairs (αf , αb ) in [0, 1] × [0, 1] over which we perform the search.
tth training subset
(t) (t) (t)
··· x
e1j ··· x
e1l1 ··· x
e1lq ···
.. .. .. .. .. .. ..
. . . . . . .
(t) (t) (t)
··· ··· ··· ···
x
en−nt j x
en−nt l1 x
en−nt lq
(A.2)
tth
validation subset
(t) (t) (t)
··· x1j ··· x1l1 ··· x1lq ···
.. .. .. .. .. .. ..
. . . . . . .
(t) (t) (t)
··· xnt j ··· xnt l1 ··· xnt lq ···
Remark 3. Matrix (A.2) represents, for every node j the comparison between estimated
and predicted values for cross-validation. βb b(t) is computed using the observations X
Aj
ej =
20
(t) (t) (t) (t)
(e en−nt j )> and the matrix X
x1j , . . . , x e b(t) with rows (e
Aj
eilq ), i = 1, . . . , n − nt in the tth
xil1 , . . . , x
training subset (red colour). Based on the tth validation set X b (t) is computed using X b(t)
j Aj
21
Table 4: Comparison of means and standard deviations (in brackets) of Specificity, Sensitivity and MCC over R = 50
replicates.
GS Glasso CLIME
Model p Sensitivity Specificity MCC Sensitivity Specificity MCC Sensitivity Specificity MCC
50 0.756 0.988 0.741 0.994 0.823 0.419 0.988 0.891 0.492
(0.015) (0.002) (0.009) (0.002) (0.012) (0.016) (0.002) (0.003) (0.006)
AR(1) 100 0.632 0.999 0.751 0.989 0.897 0.433 0.983 0.934 0.464
(0.007) (0.000) (0.004) (0.002) (0.009) (0.020) (0.002) (0.001) (0.004)
150 0.607 0.999 0.730 0.981 0.943 0.474 0.972 0.964 0.499
(0.006) (0.000) (0.004) (0.002) (0.007) (0.017) (0.002) (0.001) (0.003)
50 0.632 0.999 0.751 0.971 0.864 0.404 0.984 0.875 0.401
(0.007) (0.000) (0.004 ) (0.004) (0.010) (0.014) (0.003) (0.004) (0.007)
NN(2) 100 0.730 0.999 0.802 0.987 0.924 0.382 0.985 0.937 0.407
(0.008) (0.000) (0.005) (0.002) (0.004) (0.006) (0.002) (0.001) (0.005)
150 0.555 0.999 0.695 0.952 0.936 0.337 0.934 0.965 0.425
(0.017) (0.000) (0.007) (0.004) (0.002) (0.008) ( 0.003) (0.001) (0.003)
50 0.994 0.981 0.898 0.867 0.697 0.356 0.962 0.807 0.482
(0.002) (0.001) (0.005) (0.032) (0.021) (0.009) (0.004) (0.005) (0.005)
BG 100 0.949 0.989 0.857 0.569 0.908 0.348 0.818 0.920 0.4615
(0.007) (0.000) (0.005) (0.039) (0.011) ( 0.004) (0.005) (0.005) (0.002)
150 0.782 0.994 0.780 0.426 0.952 0.314 0.626 0.959 0.408
(0.021) (0.000) (0.008) (0.035) (0.006) (0.003) (0.006) (0.001) (0.003)
22
A.2 Complementary simulation results
(a) p = 50
(b) p = 100
(c) p = 150
Figure 4: Model NN(2). Heatmaps of the frequency of the zeros identified for each entry of the precision matrix out of
R = 50 replications. White color is 50 zeros identified out of 50 runs, and black is 0/50.
23
(a) p = 50
(b) p = 100
(c) p = 150
Figure 5: Model BG. Heatmaps of the frequency of the zeros identified for each entry of the precision matrix out of R = 50
replications. White color is 50 zeros videntified out of 50 runs, and black is 0/50.
References
Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. John Wiley.
Baldi, P., S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen (2000). Assessing the
accuracy of prediction algorithms for classification: An overview. Bioinformatics 16 (5),
412–424.
Banerjee, O., L. El Ghaoui, and A. d’Aspremont (2008). Model selection through sparse
maximum likelihood estimation for multivariate gaussian or binary data. The Journal
of Machine Learning Research 9, 485–516.
Bühlmann, P. and S. Van De Geer (2011). Statistics for high-dimensional data: methods,
theory and applications. Springer Science & Business Media.
24
Cai, T., W. Liu, and X. Luo (2011). A constrained `1 minimization approach to sparse
precision matrix estimation. Journal of the American Statistical Association 106 (494),
594–607.
Fan, J., Y. Feng, and Y. Wu (2009). Network exploration via the adaptive lasso and scad
penalties. The Annals of Applied Statistics 3 (2), 521–541.
Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation
with the graphical lasso. Biostatistics 9 (3), 432–441.
Huang, S., J. Jin, and Z. Yao (2016). Partial correlation screening for estimating large
precision matrices, with applications to classification. The Annals of Statistics 44 (5),
2018–2057.
25
Johnson, C. C., A. Jalali, and P. Ravikumar (2011). High-dimensional sparse inverse
covariance estimation using greedy methods. arXiv preprint arXiv:1112.6411 .
Lee, H. and J. Ghi (2006). Gradient directed regularization for sparse gaussian concen-
tration graphs, with applications to inference of genetic networks. Biostatistics 7 (2),
302317.
Lee, W. and Y. Liu (2015). Joint estimation of multiple precision matrices with common
structures. Journal of Machine Learning Research 16 (1), 10351062.
Liang, F., Q. Song, and P. Qiu (2015). An equivalent measure of partial correlation coeffi-
cients for high-dimensional gaussian graphical models. Journal of the American Statis-
tical Association 110 (511), 1248–1265.
Liu, H. and L. Wang (2012). Tiger: A tuning-insensitive approach for optimally estimating
gaussian graphical models. arXiv preprint arXiv:1209.2437 .
26
Meinshausen, N. and P. Bühlmann (2006). High-dimensional graphs and variable selection
with the lasso. The Annals of Statistics 34 (3), 1436–1462.
Peng, J., P. Wang, N. Zhou, and J. Zhu (2009). Partial correlation estimation by joint
sparse regression models. Journal of the American Statistical Association 104 (486),
735–746.
Ren, Z., T. Sun, C.-H. Zhang, H. H. Zhou, et al. (2015). Asymptotic normality and opti-
malities in estimation of large gaussian graphical models. The Annals of Statistics 43 (3),
991–1026.
Rütimann, P., P. Bühlmann, et al. (2009). High dimensional sparse covariance estimation
via directed acyclic graphs. Electronic Journal of Statistics 3, 1133–1160.
Spirtes, P., C. N. Glymour, and R. Scheines (2000). Causation, Prediction, and Search.
MIT press.
Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear pro-
gramming. The Journal of Machine Learning Research 11, 2261–2286.
Yuan, M. and Y. Lin (2007). Model selection and estimation in the gaussian graphical
model. Biometrika 94 (1), 19–35.
27