0% found this document useful (0 votes)
67 views

A Stepwise Approach For High-Dimensional Gaussian Graphical Models

This document proposes a stepwise approach for estimating high-dimensional Gaussian graphical models. It exploits the relationship between partial correlation coefficients and the distribution of prediction errors from linear regression models. The authors present a novel forward-backward selection algorithm for detecting conditionally dependent variable pairs. Simulation studies and applications to real data show their proposed algorithm outperforms existing methods like graphical lasso and CLIME across different performance measures and model settings.

Uploaded by

ivanmarce
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

A Stepwise Approach For High-Dimensional Gaussian Graphical Models

This document proposes a stepwise approach for estimating high-dimensional Gaussian graphical models. It exploits the relationship between partial correlation coefficients and the distribution of prediction errors from linear regression models. The authors present a novel forward-backward selection algorithm for detecting conditionally dependent variable pairs. Simulation studies and applications to real data show their proposed algorithm outperforms existing methods like graphical lasso and CLIME across different performance measures and model settings.

Uploaded by

ivanmarce
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

A Stepwise Approach for High-Dimensional

Gaussian Graphical Models


Ginette LAFIT, Francisco J. NOGALES, Marcelo RUIZ and Ruben H. ZAMAR
arXiv:1808.06016v1 [stat.ME] 17 Aug 2018

Abstract
We present a stepwise approach to estimate high dimensional Gaussian graphical
models . We exploit the relation between the partial correlation coefficients and
the distribution of the prediction errors, and parametrize the model in terms of the
Pearson correlation coefficients between the prediction errors of the nodes’ best linear
predictors. We propose a novel stepwise algorithm for detecting pairs of conditionally
dependent variables. We show that the proposed algorithm outperforms existing
methods such as the graphical lasso and CLIME in simulation studies and real life
applications. In our comparison we report different performance measures that look at
different desirable features of the recovered graph and consider several model settings.

Keywords: Covariance Selection; Gaussian Graphical Model; Forward and Backward Se-
lection; Partial Correlation Coefficient.

Ginette Lafit, Postdoctoral research fellow, Research Group of Quantitative Psychology and Individual Differences,
KU LeuvenUniversity of Leuven, Leuven, Belgium (E-mail: [email protected]), Francisco J. Nogales is Professor,
Department of Statistics and UC3M-BS Institute of Financial Big Data, Universidad Carlos III de Madrid, España (E-mail:
[email protected]), Ruben H. Zamar is Professor, Department of Statistics, University of British Columbia, 3182
Earth Sciences Building, 2207 Main Mall, Vancouver, BC V6T 1Z4, Canada (Email: [email protected]) and Marcelo Ruiz
is Professor, Departamento de Matemática, FCEFQyNat, Universidad Nacional de Rı́o Cuarto, Córdoba, Argentina (E-mail:
[email protected]).

1
1 Introduction
High-dimensional Gaussian graphical models (GGM) are widely used in practice to repre-
sent the linear dependency between variables. The underlying idea in GGM is to measure
linear dependencies by estimating partial correlations to infer whether there is an associ-
ation between a given pair of variables, conditionally on the remaining ones. Moreover,
there is a close relation between the nonzero partial correlation coefficients and the nonzero
entries in the inverse of the covariance matrix. Covariance selection procedures take ad-
vantage of this fact to estimate the GGM conditional dependence structure given a sample
(Dempster, 1972; Lauritzen, 1996; Edwards, 2000).
When the dimension p is larger than the number n of observations, the sample covariance
matrix S is not invertible and the maximum likehood estimate (MLE) of Σ does not exist.
When p/n ≤ 1, but close to 1, S is invertible but ill-conditioned, increasing the estimation
error (Ledoit and Wolf, 2004). To deal with this problem, several covariance selection
procedures have been proposed based on the assumption that the inverse of the covariance
matrix, Ω, called precision matrix, is sparse.
We present an approach to perform covariance selection in a high dimensional GGM
based on a forward-backward algorithm called graphical stepwise (GS). Our procedure
takes advantage of the relation between the partial correlation and the Pearson correlation
coefficient of the residuals.
Existing methods to estimate the GGM can be classified in three classes: nodewise
regression methods, maximum likelihood methods and limited order partial correlations
methods. The nodewise regression method was proposed by Meinshausen and Bühlmann
(2006). This method estimates a lasso regression for each node in the graph. See for
example Peng et al. (2009), Yuan (2010), Liu and Wang (2012), Zhou et al. (2011) and

2
Ren et al. (2015). Penalized likelihood methods include Yuan and Lin (2007), Banerjee
et al. (2008), Friedman et al. (2008), Johnson et al. (2011) and Ravikumar et al. (2011)
among others. Cai et al. (2011) propose an estimator called CLIME that estimates precision
matrices by solving the dual of an `1 penalized maximum likelihood problem. Limited order
partial correlation procedures use lower order partial correlations to test for conditional
independence relations. See Spirtes et al. (2000), Kalisch and Bühlmann (2007), Rütimann
et al. (2009), Liang et al. (2015) and Huang et al. (2016).
The rest of the article is organized as follows. Section 2 introduces the stepwise approach
along with some notation. Section 3 gives simulations results and a real data example.
Section 4 presents some concluding remarks. The Appendix shows a detailed description
of the crossvalidation procedure used to determine the required parameters in our stepwise
algorithm and gives some additional results from our simulation study.

2 Stepwise Approach to Covariance Selection

2.1 Definitions and Notation


In this section we review some definitions and technical concepts needed later on. Let
G = (V, E) be a graph where V 6= ∅ is the set of nodes or vertices and E ⊆ V × V = V 2 is
the set of edges. For simplicity we assume that V = {1, . . . , p}. We assume that the graph
G is undirected, that is, (i, j) ∈ E if and only if (j, i) ∈ E. Two nodes i and j are called
connected, adjacent or neighbors if (i, j) ∈ E.
A graphical model (GM) is a graph such that V indexes a set of variables {X1 , . . . , Xp }
and E is defined by:


(i, j) ∈
/ E if and only if Xi Xj | XV \{i,j}. (2.1)

3

Here denotes conditional independence.
Given a node i ∈ V , its neighborhood Ai is defined as

Ai = {l ∈ V \ {i} : (i, l) ∈ E}. (2.2)

Notice that Ai gives the nodes directly connected with i and therefore a GM can be
effectively described by giving the system of neighborhoods {Ai }pi=1 .
We further assume that (X1 , . . . , Xp )> ∼ N(0, Σ), where Σ = (σij )i,j=1...,p is a positive-
definite covariance matrix. In this case the graph is called a Gaussian graphical model
(GGM). The matrix Ω = (ωij )i,j=1...,p = Σ−1 is called precision matrix.
There exists an extensive literature on GM and GGM. For a detailed treatment of the
theory see for instance Lauritzen (1996), Edwards (2000), and Bühlmann and Van De Geer
(2011).

2.2 Conditional dependence in a GGM


In a GGM the set of edges E represents the conditional dependence structure of the vector
(X1 , . . . , Xp ). To represent this dependence structure as a statistical model it is convenient
to find a parametrization for E.
In this subsection we introduce a convenient parametrization of E using well known
results from classical multivariate analysis. For an exhaustive treatment of these results
see, for instance, Anderson (2003), Cramér (1999), Lauritzen (1996) and Eaton (2007).
Given a subset A of V , XA denotes the vector of variables with subscripts in A in
increasing order. For a given pair of nodes (i, l), set X> >
1 = (Xi , Xl ), X2 = XV \{i,l} and
> >
X = X>

1 , X2 . Note that X has multivariate normal distribution with mean 0 and

4
covariance matrix  
Σ11 Σ12
  (2.3)
Σ21 Σ22
such that Σ11 has dimension 2 × 2, Σ12 has dimension 2 × (p − 2) and so on. The matrix in
(2.3) is a partition of a permutation of the original covariance matrix Σ, and will be also
denoted by Σ, after a small abuse of notation.
Moreover, we set
 −1  
Σ11 Σ12 Ω11 Ω12
Ω=  = .
Σ21 Σ22 Ω21 Ω22

Then, by (B.2) of Lauritzen (1996), the blocks Ωi,j can be written explicitly in terms of
Σi,j and Σ−1
i,j . In particular
−1
Ω11 = Σ11 − Σ12 Σ−1
22 Σ21 where
 
ωii ωil
Ω11 =  
ωli ωll

is the submatrix of Ω (with rows i and l and columns i and l). Hence,

cov (X1 |X2 ) = Σ11 − Σ12 Σ−1


22 Σ21 (2.4)

= Ω−1
11
 
1 ω −ωil
=  ll 
ωii ωll − ωil ωli −ωli ωii

and, in consequence, the partial correlation between Xi and Xl can be expressed as


 ωil
corr Xi , Xl |XV \{i,l} = − √ . (2.5)
ωii ωll

5
This gives the standard parametrization of E in terms of the support of the precision matrix

supp (Ω) = {(i, l) ∈ V 2 : i 6= l, ωi,l 6= 0}. (2.6)

We now introduce another parametrization of E, which we need to define and implement


our proposed method. We consider the regression error for the regression of X1 on X2 ,

b 1 = X1 − β > X2
ε = X1 − X

and let εi and εl denote the entries of ε (i.e. ε> = (εi , εl )). The regression error ε is
independent of X
b 1 and has normal distribution with mean 0 and covariance matrix Ψ11

with elements denoted by


 
ψii ψil
Ψ11 =  . (2.7)
ψli ψll

A straightforward calculation shows that

   
b 1 − 2cov X1 , X
Ψ11 = cov (X1 ) + cov X b1

= Σ11 + Σ12 Σ−1 −1 −1


22 Σ22 Σ22 Σ21 − 2Σ12 Σ22 Σ21

= Σ11 − Σ12 Σ−1 −1


22 Σ21 = Ω11 .

See Cramér (1999, Section 23.4).


Therefore, by this equality, (2.4) and (2.5), the partial correlation coefficient and the
conditional correlation are equal
 ψil
ρil·V \{i,l} = corr Xi , Xl |XV \{i,l} = √ .
ψii ψll

6
Summarizing, the problem of determining the conditional dependence structure in a GGM
(represented by E) is equivalent to finding the pairs of nodes of V that belong to the set

{(i, l) ∈ V 2 : i 6= l, ψi,l 6= 0} (2.8)

which is equal to the support of the precision matrix, supp (Ω), defined by (2.6).

Remark 1. As noticed above, under normality, partial and conditional correlation are the
same. However, in general they are different concepts (Lawrance, 1976).

Remark 2. Let βi,l be the regression coefficient of Xl in the regression of Xi versus XV \{i}
and, similarly let βl,i be the regression coefficient of Xi in the regression of Xl versus
p
XV \{i} . Then it follows that ρil·V \{i,l} = sign (βl,i ) βl,i βi,l . This allows for another popular
parametrization for E. Moreover, let i be the error term in the regression of the ith variable
on the remaining ones. Then by Lemma 1 in Peng et al. (2009) we have that cov(i , l ) =
ωil /ωii ωll and var(i ) = 1/ωii .

2.3 The Stepwise Algorithm

Conditionally on its neighbors, Xi is independent of all the other variables. Formally, for
all i,


if l ∈
/ Ai and l 6= i then Xi Xl |XAi . (2.9)

Therefore, given a system of neighborhoods {Ai }pi=1 and l ∈


/ Ai (and so i ∈
/ Al ), the partial
correlation between Xi and Xl can be obtained by the following procedure: (i) regress
Xi on XAi and compute the regression residual εi ; regress Xl on XAl and compute the
regression residual εl ; (ii) calculate the Pearson correlation between εi and εl .
This reasoning motivates the graphical stepwise algorithm (GSA). It begins with the
(0)
family of empty neighborhoods, Âj = ∅ for each j ∈ V . There are two basic steps, the

7
forward and the backward steps. In the forward step, the algorithm adds a new edge (j0 , l0 )
if the largest absolute empirical partial correlation between the variables Xj0 , Xl0 is above
the given threshold αf . In the backward step the algorithm deletes an edge (j0 , l0 ) if the
smallestt absolute empirical partial correlation between the variables Xj0 , Xl0 is below the
given threshold αb . A step by step description of GSA is as follows:

Graphical Stepwise Algorithm

Input: the (centered) data {x1 , ..., xn } , and the forward and backward thresholds αf and
αb .

Initialization. k = 0: set Ab01 = Ab02 = · · · = Ab0p = φ.

Iteration Step. Given Abk1 , Abk2 , ..., Abkp we compute Abk+1 k+1 bk+1 as follows.
1 , A2 , ..., Ap

Forward. For each j = 1, ..., p do the following.

/ Abkj calculate the partial correlations fjlk as follows.


For each l ∈

(a) Regress the j th variable on the variables with subscript in the set Abkj and compute
 
the regression residuals ekj = ek1j , ek2j , ..., eknj .

(b) Regress the lth variables on the variables with subscript in the set Abkl and compute
the regression residuals ekl = ek1l , ek2l , ..., eknl .


(c) Obtain the partial correlation fjlk by calculating the Pearson correlation between
ekj and ekl .

If

max fjlk = fjk0 l0 ≥ αf

l∈
/Abk ,j∈V
j

set Abk+1 bk bk+1 = Abk ∪ {j0 } , Abk+1 = Abk for l 6= j0 , l0


j0 = Aj0 ∪ {l0 } , Al0 l0 l l

If

max fjlk = fjk0 l0 < αf , stop.

8
Backward. For each j = 1, ..., p do the following.

For each l ∈ Abk+1


j calculate the partial correlation bkjl as follows.

(a) Regress the j th variables on the variables with subscript in the set Abk+1
j \ {l} and
 
compute the regression residuals rkj = r1j k , r k , ..., r k
2j nj .

(b) Regress the lth variable on the variables with subscript in the set Abk+1
l \ {j} and
compute the regression residuals rkl = r1lk , r k , ..., r k .

2l nl

(c) Compute the partial correlation bkjl by calculating the Pearson correlation between
rkj and rkl .

If

k k
min bjl = bj0 l0 ≤ αb
bk ,j∈V
l∈A j

set Abk+1 bk+1 bk+1 → Abk+1 \ {j0 }.


j0 → Aj0 \ {l0 } , Al0 l0

Output

1. A collection of estimated neighborhoods Abj , j = 1, . . . , p.


n o
2. The set of estimated edges Eb = (i, l) ∈ V 2 : i ∈ Abl .

3. An estimate of Ω, Ω ωil )pi,l=1 with ω


b = (b bil defined as follow: in the case i = l, ω
bii =
n/(eTi ei ) for i = 1, ..., p, where ei is the vector of the prediction errors in the regression
of the ith variable on X Abi . In the case i 6= l we must distinguish two cases, if l ∈
/ Abi
bil = n eTi el / eTi ei eTl el (see Remark 2).
   
then ωbil = 0, otherwise ω

2.4 Thresholds selection by cross-validation


Let X be the n × p matrix with rows xi = (xi1 , . . . , xip ), i = 1, . . . , n, corresponding to
n observations. We randomly partition the dataset {xi }1≤i≤n into K disjoint subsets of

9
K
X
th
approximately equal sizes, the t subset being of size nt ≥ 2 and nt = n. For every t, let
t=1
(t) (t)
{xi }1≤i≤nt be the tth validation subset, and its complement {e the tth training
xi }1≤i≤n−nt ,
(t) (t)
subset. For every t and for every pair (αf , αb ) of threshold parameters let Ab1 , . . . , Abp
be the estimated neighborhoods given by GSA using the tth training subset. For every
j = 1, . . . , p let βb b(t) be the estimated coefficient of the regression of the variable Xj on the
Aj
(t)
neighborhood Abj .
(t)
Consider now the tth validation subset. So, for every j, using βb (t) , we obtain the vector
Aj
b (t) (αf , αb ). If A(t) = ∅ we predict each observation of Xj by the
of predicted values X j j

sample mean of the observations in the tth dataset of this variable.


Then, we define the K–fold cross–validation function as
K pj
1 XX (t) b (t)
2
CV (αf , αb ) = Xj − Xj (αf , αb )

n t=1 j=1

where k·k the L2-norm or euclidean distance in Rp . Hence the K–fold cross–validation
forward–backward thresholds α
bf , α
bb is

(b
αf , α
bb ) =: argmin CV (αf , αb )
(αf ,αb )∈H

where H is a grid of ordered pairs (αf , αb ) in [0, 1] × [0, 1] over which we perform the search.
For a detail description see the Appendix.

2.5 Example
To illustrate the algorithm we consider the GGM with 16 edges given in the first panel
of Figure 1. We draw n = 1000 independent observations from this model (see the next
section for details). The values for the threshold parameters αf = 0.17 and αb = 0.09 are

10
determined by 5-fold cross-validation. The figure also displays the selected pairs of edges
at each step in a sequence of successive updates of Abkj , for k = 1, 4, 9, 12 and the final step
k = 16, showing that the estimated graph is identical to the true graph.


7 ●
6

5 ●
7 ●
6

5 ●
7 ●
6

5

8 ●
4 ●
8 ●
4 ●
8 ●
4

●9 ●3 ●9 ●3 ●9 ●3

●10 ●2 ●10 ●2 ●10 ●


2


11 ● 1 ●
11 ● 1 ●
11 ● 1

●12 ●
20 ●12 ●
20 ●12 ●
20


13 ●
19 ●
13 ●
19 ●
13 ●
19


14 ●
18 ●
14 ●
18 ●
14 ●
18

15

16 ●
17 ●
15

16 ●
17 ●
15

16 ●
17

True graph k=1 k=4


7 ●
6

5 ●
7 ●
6

5 ●
7 ●
6

5

8 ●
4 ●
8 ●
4 ●
8 ●
4

●9 ●3 ●9 ●3 ●9 ●3

●10 ●2 ●10 ●2 ●10 ●


2


11 ● 1 ●
11 ● 1 ●
11 ● 1

●12 ●
20 ●12 ●
20 ●12 ●
20


13 ●
19 ●
13 ●
19 ●
13 ●
19


14 ●
18 ●
14 ●
18 ●
14 ●
18

15

16 ●
17 ●
15

16 ●
17 ●
15

16 ●
17

k=9 k = 12 k = 16

bk , for k = 1, 4, 9, 12, 16 of the GSA.


Figure 1: True graph and sequence of successive updates of Aj

3 Numerical results and real data example


We conducted extensive Monte Carlo simulations to investigate the performance of GS. In
this section we report some results from this study and a numerical experiment using real
data.

11
3.1 Monte Carlo simulation study
Simulated Models
We consider three dimension values p = 50, 100, 150 and three different models for Ω:

Model 1. Autoregressive model of orden 1, denoted AR(1). In this case Σij = 0.4|i−j|
for i, j = 1, . . . p.

Model 2. Nearest neighbors model of order 2, denoted NN(2). For each node we
randomly select two neighbors and choose a pair of symmetric entries of Ω using the
NeighborOmega function of the R package Tlasso.

Model 3. Block diagonal matrix model with q blocks of size p/q, denoted BG. For
p = 50, 100 and 150, we use q = 10, 20 and 30 blocks, respectively. Each block, of
size p/q = 5, has diagonal elements equal to 1 and off-diagonal elements equal to 0.5.

For each p and each model we generate R = 50 random samples of size n = 100. These
graph models are widely used in the genetic literature to model gene expression data. See
for example Lee and Liu (2015) and Lee and Ghi (2006). Figure 2 displays graphs from
Models 1-3 with p = 100 nodes.

● ● ● ● ●
● ●
● ● ●
● ● ● ●

● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●

● ● ●


● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ● ●

● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
● ● ●

● ● ●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●

● ● ●


● ●
● ● ●

● ● ●
● ●

● ●
● ● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ●
● ●
● ● ●
● ●
● ● ● ● ●

● ●

● ● ● ● ● ●
● ●
● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●

AR(1) NN(2) BG

Figure 2: Graphs of AR(1), NN(2) and BG graphical models for p = 100 nodes.

12
Methods
We compare the performance of GS with Graphical lasso (Glasso) and Constrained l1 -
minimization for inverse matrix estimation (CLIME) proposed by Friedman et al. (2008)
and Cai et al. (2011) respectively. Therefore, the methods compared in our simulation
study are:

1. The proposed method GS with the forward and backward thresholds, (αf , αb ), esti-
mated by 5-fold crossvalidation on a grid of 20 values in [0, 1] × [0, 1], as described in
Subsection 2.4. The computing algorithm is available by request.

2. The Glasso estimate obtained by solving the `1 penalized-likelihood problem:

min −log{det[Ω]} + tr{ΩX> X} + λ k Ω k1 .



(3.1)
Ω0

In our simulations and examples we use the R-package CVglasso with the tuning
parameter λ selected by 5−fold crossvalidation (the package default).

3. The CLIME estimate obtained by symmetrization of the solution of

min{k Ω k1 subject to |SΩ − I|∞ ≤ λ}, (3.2)

where S is the sample covariance, I is the identity matrix, |·|∞ is the elementwise l∞
norm, and λ is a tuning parameter. For computations, we use the R-package clime
with the tuning parameter λ selected by 5−fold crossvalidation (the package default).

To evaluate the ability of the methods for finding the pairs of edges, for each replicate, we
compute the Matthews correlation coefficient (Matthews, 1975)

TP × TN − FP × FN
MCC = p , (3.3)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)

13
the Specificity = TN/(TN + FP) and the Sensitivity = TP/(TP + FN), where TP, TN,
FP and FN are, in this order, the number of true positives, true negatives, false positives
and false negatives, regarding the identification of the nonzero off-diagonal elements of Ω.
Larger values of MCC, Sensitivity and Specificity indicate a better performance (Fan et al.,
2009; Baldi et al., 2000).
For every replicate, the performance of Ω
b as an estimate for Ω is measured by mF =

||Ω
b − Ω||F (where || · ||F denotes the Frobenius norm) and by the normalized Kullback-

Leibler divergence defined by mN KL = DKL /(1 + DKL ) where

1  n b −1 o n h io
b −1 − p

DKL = tr ΩΩ − log det ΩΩ
2

is the the Kullback-Leibler divergence between Ω


b and Ω.

Results
Table 1 shows the MCC performance for the three methods under Models 1-3. GS
clearly outperforms the other two methods while CLIME just slightly outperforms Glasso.
Cai et al. (2011) pointed out that a procedure yielding a more sparse Ω b is preferable

because this facilitates interpretation of the data. The sensitivity and specificity results,
reported in Table 4 in Appendix, show that in general GS is more sparse than the CLIME
and Glasso, yielding fewer false positives (more specificity) but a few more false negatives
(less sensitivity). Table 2 shows that under models AR(1) and NN(2) the three methods
achieve fairly similar performances for estimating Ω. However, under model BG, GS clearly
outperforms the other two.
Figure 3 display the heat-maps of the number of non-zero links identified in the 50
replications under model AR(1). Notice that among the three compared methods, the GS
sparsity patterns best match those of the true model. Figures 4 and 5 in the Appendix
lead to similar conclusions for models NN(2) and BG.

14
Table 1: Comparison of means and standard deviations (in brackets) of MCC over R = 50 replicates.

Model p GS Glasso CLIME


50 0.741 (0.009) 0.419 (0.016) 0.492 (0.006)
AR(1) 100 0.751 (0.004) 0.433 (0.020) 0.464 (0.004)
150 0.730 (0.004) 0.474 (0.017) 0.499 (0.003)
50 0.751 (0.004) 0.404 (0.014) 0.401 (0.007)
NN(2) 100 0.802 (0.005) 0.382 (0.006) 0.407 (0.005)
150 0.695 (0.007) 0.337 (0.008) 0.425 (0.003)
50 0.898 (0.005) 0.356 (0.009) 0.482 (0.005)
BG 100 0.857 (0.005) 0.348 (0.004) 0.461 (0.002)
150 0.780 (0.008) 0.314 (0.003) 0.408 (0.003)

(a) p = 50

(b) p = 100

(c) p = 150

Figure 3: Model AR(1). Heatmaps of the frequency of the zeros identified for each entry of the precision matrix out of
R = 50 replicates. White color is 50 zeros identified out of 50 runs, and black is 0/50.

15
Table 2: Comparison of means and standard deviations (in brackets) of mF and mN KL over R = 50 replicates.

GS Glasso CLIME
Model p mN KL mF mN KL mF mN KL mF
50 0.70 3.82 0.64 3.90 0.63 3.91
(0.00) (0.00) (0.00) ( 0.02) (0.00) (0.01)
AR(1) 100 0.83 5.73 0.80 5.72 0.79 5.75
(0.00) (0.00) (0.00) (0.02) (0.00) (0.01)
150 1.25 7.16 1.17 7.21 1.17 7.25
(0.00) (0.00) (0.00) (0.02) (0.00) (0.01)
50 0.99 6.98 0.99 6.65 0.99 6.64
(0.00) (0.00) (0.00) (0.01) (0.00) (0.00)
NN(2) 100 0.10 10.11 1.00 9.64 1.00 9.601
(0.00) (0.00) (0.00) (0.009) (0.000) (0.005)
150 1.00 12.37 1.00 11.90 1.00 11.79
(0.00) (0.00) (0.00) (0.01) (0.00) (0.00)
BG 50 0.46 1.44 0.85 5.45 0.82 5.03
(0.00) (0.00) (0.00) (0.10) (0.00) (0.05)
100 0.71 2.94 0.93 9.16 0.92 8.71
(0.00) (0.00) (0.00) (0.07) (0.00) (0.02)
150 0.88 6.10 0.96 11.59 0.96 11.42
(0.00) (0.00) (0.00) (0.06) (0.00) (0.02)

3.2 Analysis of Breast Cancer Data


In preoperative chemoterapy, the complete eradication of all invasive cancer cells is referred
to as pathological complete response, abbreviated as pCR. It is known in medicine that pCR
is associated with the long-term cancer-free survival of a patient. Gene expression profiling
(GEP) – the measurement of the activity (expression level) of genes in a patient – could in
principle be a useful predictor for the patient’s pCR.
Using normalized gene expression data of patients in stages I-III of breast cancer, Hess
et al. (2006) aim to identify patients that may achieve pCR under sequential anthracycline
paclitaxel preoperative chemotherapy. When a patient does not achieve pCR state, he is

16
classified in the group of residual disease (RD), indicating that cancer still remains. Their
data consist of 22283 gene expression levels for 133 patients, with 34 pCR and 99 RD.
Following Fan et al. (2009) and Cai et al. (2011) we randomly split the data into a training
set and a testing set. The testing set is formed by randomly selecting 5 pCR patients
and 16 RD patients (roughly 1/6 of the subjects) and the remaining patients form the
training set. From the training set, a two sample t-test is performed to select the 50 most
significant genes. The data is then standardized using the standard deviation estimated
from the training set.
We apply a linear discriminant analysis (LDA) to predict whether a patient may achieve
pathological complete response (pCR), based on the estimated inverse covariance matrix
of the gene expression levels. We label with r = 1 the pCR group and r = 2 the RD group
and assume that data are normally distributed, with common covariance matrix Σ and
different means µr . From the training set, we obtain µ
br, Ω
b and for the test data compute

the linear discriminant score as follows

b µr − 1 µ> Ωµ
δr (x) = x> Ωb b r + logb
πr for i = 1, . . . , n, (3.4)
2 r
where π
br is the proportion of group r subjects in the training set. The classification rule is

rb(x) = argmax δr (x) for r = 1, 2. (3.5)

For every method we use 5-fold cross validation on the training data to select the tuning
constants. We repeat this scheme 100 times.
Table 3 displays the means and standard errors (in brackets) of Sensitivity, Specificity,
MCC and Number of selected Edges using Ω b over the 100 replications. Considering the

MCC, GS is slightly better than CLIME and CLIME than Glasso. While the three methods
give similar performance considering the Specificity, GS and CLIME improve over Glasso
in terms of Sensitivity.

17
Table 3: Comparison of means and standard deviations (in brackets) of Sensitivity, Specificity, MCC and Number of selected
edges over 100 replications.

GS CLIME Glasso
Sensitivity 0.798 (0.02) 0.786 (0.02) 0.602 (0.02)
Specificity 0.784 (0.01) 0.788 (0.01) 0.767 (0.01)
MCC 0.520 (0.02) 0.516 (0.02) 0.334 (0.02)
Number of Edges 54 (2) 4823 (8) 2103 (76)

4 Concluding remarks
This paper introduces a stepwise procedure, called GS, to perform covariance selection in
high dimensional Gaussian graphical models. Our method uses a different parametrization
of the Gaussian graphical model based on Pearson correlations between the best-linear-
predictors prediction errors. The GS algorithm begins with a family of empty neighbor-
hoods and using basic steps, forward and backward, adds or delete edges until appropriate
thresholds for each step are reached. These thresholds are automatically determined by
cross–validation.
GS is compared with Glasso and CLIME under different Gaussian graphical models
(AR(1), NN(2) and BG) and using different performance measures regarding network re-
covery and sparse estimation of the precision matrix Ω. GS is shown to have good support
recovery performance and to produce simpler models than the other two methods (i.e. GS
is a parsimonious estimation procedure).
We use GS for the analysis of breast cancer data and show that this method may be a
useful tool for applications in medicine and other fields.

18
Acknowledgements
The authors thanks the generous support of NSERC, Canada, the Institute of Financial
Big Data, University Carlos III of Madrid and the CSIC, Spain.

A Appendix

A.1 Selection of the thresholds parameters by cross-validation


Let X be the n × p matrix with rows xi = (xi1 , . . . , xip ), i = 1, . . . , n, corresponding to n
observations. For each j = 1, . . . , p, let Xj = (x1j , . . . , xnj )> denote the jth–column of the
matrix X.
We randomly partition the dataset {xi }1≤i≤n into K disjoint subsets of approximately
K
X
th (t)
equal size, the t subset being of size nt ≥ 2 and nt = n. For every t, let {xi }1≤i≤nt
t=1
(t)
be the tth validation subset, and its complement {e xi }1≤i≤n−nt , the tth training subset.
(t) (t)
For every t = 1, . . . , K and threshold parameters (αf , αb ) ∈ [0, 1]×[0, 1] let Ab1 , . . . , Abp
(t)
be the estimated neighborhoods given by GSA using the tth training subset {e
xi }1≤i≤n−nt
with xe(t)
i = (e
(t)
xi1 , . . . , x
(t)
eip ), 1 ≤ i ≤ n − nt . Consider for every node j the estimated
(t)
neighborhood Abj = {l1 , . . . , lq } and let βbAb(t) be the estimated coefficient of the regression
j
(t) (t)
of X
e j = (ex ,...,x
1j en−nt j )> on Xl1 , . . . , Xlq , represented in (A.2) (red colour).
(t) (t) (t) (t)
Consider the tth validation subset {xi }1≤i≤nt with xi = (xi1 , . . . , xip ), 1 ≤ i ≤ nt
 >
(t) (t) (t)
and for every j let Xj = x1j , . . . , xnt j and define the vector of predicted values

b (t) (αf , αb ) = X b(t) βb(t)(t) ,


X j A Aj
j

(t) (t)
where XAb(t) is the matrix with rows (xil1 , . . . , xilq ), 1 ≤ i ≤ nt represented in (A.2) (in blue
j

19
(t)
colour). If the neighborhood Aj = ∅ we define

b (t) (αf , αb ) = (x̄(t) , . . . , x̄(t) )>


X j j j

(t) (t) (t)


where x̄j is the mean of the sample of observations x1j , . . . , xnt j .
We define the K–fold cross–validation function as
K p
1 XX (t) b (t)
2
CV (αf , αb ) = Xj − Xj (αf , αb )

n t=1 j=1

where k·k the L2-norm or euclidean distance in Rp . Hence the K–fold cross–validation
forward–backward thresholds α
bf , α
bb is

(b
αf , α
bb ) =: argmin CV (αf , αb ) (A.1)
(αf ,αb )∈H

where H is a grid of ordered pairs (αf , αb ) in [0, 1] × [0, 1] over which we perform the search.

 
tth training subset
 
(t) (t) (t)

 ··· x
e1j ··· x
e1l1 ··· x
e1lq ··· 

 .. .. .. .. .. .. .. 
. . . . . . .
 
 
 
 
 
(t) (t) (t)
··· ··· ··· ···
 
 x
en−nt j x
en−nt l1 x
en−nt lq 
(A.2)
 
 
 
 
tth
 

 validation subset 

 (t) (t) (t) 

 ··· x1j ··· x1l1 ··· x1lq ··· 

 .. .. .. .. .. .. .. 

 . . . . . . . 

(t) (t) (t)
··· xnt j ··· xnt l1 ··· xnt lq ···

Remark 3. Matrix (A.2) represents, for every node j the comparison between estimated
and predicted values for cross-validation. βb b(t) is computed using the observations X
Aj
ej =

20
(t) (t) (t) (t)
(e en−nt j )> and the matrix X
x1j , . . . , x e b(t) with rows (e
Aj
eilq ), i = 1, . . . , n − nt in the tth
xil1 , . . . , x
training subset (red colour). Based on the tth validation set X b (t) is computed using X b(t)
j Aj

and compared with Xj (in blue color).

21
Table 4: Comparison of means and standard deviations (in brackets) of Specificity, Sensitivity and MCC over R = 50
replicates.

GS Glasso CLIME
Model p Sensitivity Specificity MCC Sensitivity Specificity MCC Sensitivity Specificity MCC
50 0.756 0.988 0.741 0.994 0.823 0.419 0.988 0.891 0.492
(0.015) (0.002) (0.009) (0.002) (0.012) (0.016) (0.002) (0.003) (0.006)
AR(1) 100 0.632 0.999 0.751 0.989 0.897 0.433 0.983 0.934 0.464
(0.007) (0.000) (0.004) (0.002) (0.009) (0.020) (0.002) (0.001) (0.004)
150 0.607 0.999 0.730 0.981 0.943 0.474 0.972 0.964 0.499
(0.006) (0.000) (0.004) (0.002) (0.007) (0.017) (0.002) (0.001) (0.003)
50 0.632 0.999 0.751 0.971 0.864 0.404 0.984 0.875 0.401
(0.007) (0.000) (0.004 ) (0.004) (0.010) (0.014) (0.003) (0.004) (0.007)
NN(2) 100 0.730 0.999 0.802 0.987 0.924 0.382 0.985 0.937 0.407
(0.008) (0.000) (0.005) (0.002) (0.004) (0.006) (0.002) (0.001) (0.005)
150 0.555 0.999 0.695 0.952 0.936 0.337 0.934 0.965 0.425
(0.017) (0.000) (0.007) (0.004) (0.002) (0.008) ( 0.003) (0.001) (0.003)
50 0.994 0.981 0.898 0.867 0.697 0.356 0.962 0.807 0.482
(0.002) (0.001) (0.005) (0.032) (0.021) (0.009) (0.004) (0.005) (0.005)
BG 100 0.949 0.989 0.857 0.569 0.908 0.348 0.818 0.920 0.4615
(0.007) (0.000) (0.005) (0.039) (0.011) ( 0.004) (0.005) (0.005) (0.002)
150 0.782 0.994 0.780 0.426 0.952 0.314 0.626 0.959 0.408
(0.021) (0.000) (0.008) (0.035) (0.006) (0.003) (0.006) (0.001) (0.003)

22
A.2 Complementary simulation results

(a) p = 50

(b) p = 100

(c) p = 150

Figure 4: Model NN(2). Heatmaps of the frequency of the zeros identified for each entry of the precision matrix out of
R = 50 replications. White color is 50 zeros identified out of 50 runs, and black is 0/50.

23
(a) p = 50

(b) p = 100

(c) p = 150

Figure 5: Model BG. Heatmaps of the frequency of the zeros identified for each entry of the precision matrix out of R = 50
replications. White color is 50 zeros videntified out of 50 runs, and black is 0/50.

References
Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. John Wiley.

Baldi, P., S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen (2000). Assessing the
accuracy of prediction algorithms for classification: An overview. Bioinformatics 16 (5),
412–424.

Banerjee, O., L. El Ghaoui, and A. d’Aspremont (2008). Model selection through sparse
maximum likelihood estimation for multivariate gaussian or binary data. The Journal
of Machine Learning Research 9, 485–516.

Bühlmann, P. and S. Van De Geer (2011). Statistics for high-dimensional data: methods,
theory and applications. Springer Science & Business Media.

24
Cai, T., W. Liu, and X. Luo (2011). A constrained `1 minimization approach to sparse
precision matrix estimation. Journal of the American Statistical Association 106 (494),
594–607.

Cramér, H. (1999). Mathematical Methods of Statistics. Princeton University Press.

Dempster, A. P. (1972). Covariance selection. Biometrics, 157–175.

Eaton, M. L. (2007). Multivariate Statistics : A Vector Space Approach. Institute of


Mathematical Statistics.

Edwards, D. (2000). Introduction to Graphical Modelling. Springer Science & Business


Media.

Fan, J., Y. Feng, and Y. Wu (2009). Network exploration via the adaptive lasso and scad
penalties. The Annals of Applied Statistics 3 (2), 521–541.

Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation
with the graphical lasso. Biostatistics 9 (3), 432–441.

Hess, K. R., K. Anderson, W. F. Symmans, V. Valero, N. Ibrahim, J. A. Mejia, D. Booser,


R. L. Theriault, A. U. Buzdar, P. J. Dempsey, et al. (2006). Pharmacogenomic predictor
of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin,
and cyclophosphamide in breast cancer. Journal of Clinical Oncology 24 (26), 4236–4244.

Huang, S., J. Jin, and Z. Yao (2016). Partial correlation screening for estimating large
precision matrices, with applications to classification. The Annals of Statistics 44 (5),
2018–2057.

25
Johnson, C. C., A. Jalali, and P. Ravikumar (2011). High-dimensional sparse inverse
covariance estimation using greedy methods. arXiv preprint arXiv:1112.6411 .

Kalisch, M. and P. Bühlmann (2007). Estimating high-dimensional directed acyclic graphs


with the pc-algorithm. The Journal of Machine Learning Research 8, 613–636.

Lauritzen, S. L. (1996). Graphical Models. Oxford University Press.

Lawrance, A. J. (1976). On conditional and partial correlation. The American Statisti-


cian 30 (3), 146–149.

Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covari-


ance matrices. Journal of Multivariate Analysis 88 (2), 365–411.

Lee, H. and J. Ghi (2006). Gradient directed regularization for sparse gaussian concen-
tration graphs, with applications to inference of genetic networks. Biostatistics 7 (2),
302317.

Lee, W. and Y. Liu (2015). Joint estimation of multiple precision matrices with common
structures. Journal of Machine Learning Research 16 (1), 10351062.

Liang, F., Q. Song, and P. Qiu (2015). An equivalent measure of partial correlation coeffi-
cients for high-dimensional gaussian graphical models. Journal of the American Statis-
tical Association 110 (511), 1248–1265.

Liu, H. and L. Wang (2012). Tiger: A tuning-insensitive approach for optimally estimating
gaussian graphical models. arXiv preprint arXiv:1209.2437 .

Matthews, B. (1975). Comparison of the predicted and observed secondary structure of t4


phage lysozyme. Biochimica et Biophysica Acta 405 (2), 442451.

26
Meinshausen, N. and P. Bühlmann (2006). High-dimensional graphs and variable selection
with the lasso. The Annals of Statistics 34 (3), 1436–1462.

Peng, J., P. Wang, N. Zhou, and J. Zhu (2009). Partial correlation estimation by joint
sparse regression models. Journal of the American Statistical Association 104 (486),
735–746.

Ravikumar, P., M. J. Wainwright, G. Raskutti, B. Yu, et al. (2011). High-dimensional


covariance estimation by minimizing `1 -penalized log-determinant divergence. Electronic
Journal of Statistics 5, 935–980.

Ren, Z., T. Sun, C.-H. Zhang, H. H. Zhou, et al. (2015). Asymptotic normality and opti-
malities in estimation of large gaussian graphical models. The Annals of Statistics 43 (3),
991–1026.

Rütimann, P., P. Bühlmann, et al. (2009). High dimensional sparse covariance estimation
via directed acyclic graphs. Electronic Journal of Statistics 3, 1133–1160.

Spirtes, P., C. N. Glymour, and R. Scheines (2000). Causation, Prediction, and Search.
MIT press.

Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear pro-
gramming. The Journal of Machine Learning Research 11, 2261–2286.

Yuan, M. and Y. Lin (2007). Model selection and estimation in the gaussian graphical
model. Biometrika 94 (1), 19–35.

Zhou, S., P. Rütimann, M. Xu, and P. Bühlmann (2011). High-dimensional covariance


estimation based on gaussian graphical models. The Journal of Machine Learning Re-
search 12, 2975–3026.

27

You might also like