PERROT 2019 Archivage
PERROT 2019 Archivage
Composition du Jury :
Liliane BEL
Professeure, AgroParisTech (MIA 518) Présidente
David CAUSEUR
Professeur, Agrocampus Ouest (IRMAR UMR 6625) Rapporteur
Jean-Professeur BARDET
Professeur, Université Paris 1 (SAMM) Examinateur
Pierre NEUVIAL
Thèse de doctorat
1 Introduction 1
1.1 Contexte biologique . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Sélection de variables dans le modèle linéaire multivarié . . . . . . . 4
1.2.1 État de l’art . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Contributions des chapitres 2 et 3 . . . . . . . . . . . . . . . 9
1.3 Estimation de matrice de corrélation par blocs . . . . . . . . . . . . 14
1.3.1 État de l’art . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Contributions du chapitre 4 . . . . . . . . . . . . . . . . . . . 15
1.4 Une autre application : étude du dialogue entre les cellules dendri-
tiques et les lymphocytes Th . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Introduction à l’immunologie . . . . . . . . . . . . . . . . . . 18
1.4.2 Contributions du chapitre 5 . . . . . . . . . . . . . . . . . . . 20
3
3.3.2 Choice of the dependence modeling . . . . . . . . . . . . . . . 66
3.3.3 Choice of the model selection criterion . . . . . . . . . . . . . 67
3.3.4 Numerical performance . . . . . . . . . . . . . . . . . . . . . 67
3.4 Application to the analysis of a LC-MS data set . . . . . . . . . . . . 68
3.4.1 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2 Application of our four-step approach . . . . . . . . . . . . . 69
3.4.3 Comparison with existing methods . . . . . . . . . . . . . . . 70
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Bibliography 126
Annexe 127
Introduction
Introduction
1
Introduction
(T), la cytosine (C), la guanine (G). Un de ces brins est appelé le brin codant et
l’autre est appelé le brin matrice. La séquence de nucléotides du brin codant est le
complémentaire de la séquence de nucléotides du brin matrice, c’est-à-dire que la
thymine sera remplacée par l’adénine, la cytosine par la guanine et réciproquement.
Les gènes sont codés sur des fragments de ces brins d’ADN. Des ARN polymérases
se déplacent le long du brin matrice de l’ADN pour le transcrire en ARN messager
(ARNm). La séquence de nucléotides de l’ARNm est la même que celle du brin co-
dant sauf que la thymine (T) est remplacée par l’uracile (U). Un ribosome va ensuite
traduire cet ARNm en chaı̂nes d’acides aminés. Chaque groupe de trois nucléotides
parmi les 4 suivants : A, U, C, G, sera appelé un codon. Comme les chaı̂nes d’ARNm
sont composées avec quatre nucléotides différents, 43 = 64 codons existent donc pour
coder les acides aminés. Trois codons sont des codons stop qui arrêtent la traduction
d’ARNm en protéines, les autres sont tous liés à un acide aminé, certains sont codés
par plusieurs codons. Une fois formée une chaı̂ne d’acides aminés se replie finissant
ainsi de créer une protéine. Lorsqu’elles ne sont plus fonctionnelles, ou plus utilisées,
les protéines sont ensuite dégradées par notre système et sont transformées en acides
aminés et petites molécules qui font partie des métabolites.
Cette vision séquentielle du passage du génotype au phénotype est en réalité beau-
coup trop simpliste. On a par exemple découvert des petits ARN qui seraient
répresseurs de la traduction de certaines parties de l’ARNm. Ces petits ARN sont
codés à divers emplacements du génome, y compris dans les zones que l’on pensait
silencieuses pour la transcription. De plus les protéines agissent sur la transcription
de l’ARNm. Le passage du génotype au phénotype est donc loin d’être séquentiel
puisque les différentes étapes s’influencent mutuellement. De plus la connaissance
que l’on en a ne cesse d’évoluer et est probablement encore simpliste par rapport à
la réalité.
On étudie néanmoins les différences entre des individus d’une même espèce au
cours de ces différentes étapes. Certains traits phénotypiques peuvent par exemple
être expliqués par l’étude des séquences d’ADN (appelée la génomique). Plus précisé-
ment, on peut étudier des différences au sein des nucléotides qui varient fréquemment
d’un individu à un autre (Yang et al., 2010). Ce polymorphisme d’un seul nucléotide
est appelé SNP (pour Single Nucleotide polymorphism). La variation d’un seul
nucléotide peut avoir des effets importants puisqu’elle peut modifier l’acide aminé,
donc la protéine et les métabolites qui en découlent. De même, des locus (des
régions d’ADN, pouvant coder un ou plusieurs gènes) variant d’un individu à l’autre
peuvent avoir des effets importants sur le phénotype de l’individu. Ces études
peuvent aussi être effectuées sur les brins d’ARN transcrits. On parlera alors de
transcriptomique. La protéomique (respectivement la métabolomique) qui étudie
l’abondance de protéines (resp. de métabolites) peut également permettre d’expli-
quer des différences entre individus d’une même espèce, qui ne seraient pas ex-
pliquées par l’étude des gènes ou des transcripts. En effet, il a été montré que
2
Introduction
Figure 1.1 – Matrice de corrélation de SNPs le long du génome (Wittenburg et al.,
2016) à gauche et de protéines à droite.
des variations dans l’expression des gènes n’entraı̂nent pas nécessairement des va-
riations proportionnelles dans l’abondance des métabolites (Riekeberg & Powers,
2017). Plus généralement, pour expliquer un phénomène d’intérêt, les différents
types de jeu de données -omiques (génomique, transcriptomique, protéomique,
métabolomique) ont chacun des avantages et inconvénients qui sont détaillés dans
la table 1 de Karahalil (2016). C’est pourquoi dans la suite nous proposerons des
méthodes générales permettant d’étudier les liens entre des marqueurs des données
-omiques (les gènes, les protéines, les métabolites, les SNPs, les locus, . . .) et
3
Introduction
prenant en compte la dépendance qui existe entre les marqueurs. Pour cela nous
disposons de jeux de données sur lesquels on mesure q marqueurs sur n échantillons.
Le nombre de marqueurs à mesurer peut être très grand (jusqu’à plusieurs dizaines
de milliers) et les mesurer sur un échantillon peut être coûteux. Il est alors fréquent
d’avoir un nombre de marqueurs à étudier beaucoup plus élevé que le nombre
d’échantillons sur lesquels ils sont mesurés. Ainsi, nous proposerons dans cette thèse
des méthodes adaptées à de tels cas aussi bien pour étudier la dépendance entre les
marqueurs que pour étudier la présence ou l’absence de lien avec un phénomène
d’intérêt.
Pour expliquer les liens entre différents marqueurs et différents traits phénotypiques
on modélise les valeurs des marqueurs comme une combinaison linéaire du ou des
traits phénotypiques. Supposons que nous cherchions les liens existant entre q mar-
queurs et p traits phénotypiques et que pour n échantillons nous ayons la valeur
de ces q marqueurs et de ces p traits phénotypiques. Le mot valeur est ici uti-
lisé pour généraliser une abondance, une quantité ou simplement une valeur nulle
ou égale à 1 symbolisant l’appartenance ou non d’un échantillon à un groupe. No-
tons X i,k la valeur du trait phénotypique k de l’échantillon i et Y i,j la valeur du
marqueur j de l’échantillon i. Si le trait phénotypique est catégoriel X i,k vaut 1 si
l’échantillon i est de la catégorie k et 0 sinon. On cherche alors à écrire Y i,j par
p
X
Y i,j = X i,k B k,j + E i,j , ∀i ∈ J1, nK, ∀j ∈ J1, q K, (1.1)
k=1
Y = XB + E. (1.2)
Dans ce modèle, Y est une matrice aléatoire de réponses de taille n × q, X une ma-
trice de design de taille n × p contenant les caractéristiques du phénomène d’intérêt,
B une matrice de taille p × q contenant les valeurs des coefficients liant les réponses
et le phénomène d’intérêt et E une matrice aléatoire d’erreur de taille n × q. Au
cours de cette thèse nous ferons l’hypothèse que les lignes de E sont indépendantes
4
Introduction
et identiquement distribuées mais que ses colonnes ne le sont, en général, pas. Ainsi,
iid
∀i ∈ J1, nK, (Ei,1 , . . . , Ei,q ) ∼ N (0, Σ), (1.3)
où N (0, Σ) désigne la loi d’un vecteur gaussien d’espérance nulle et de matrice de
covariance Σ.
Pour diminuer radicalement le nombre d’associations potentielles entre des mar-
queurs et des traits phénotypiques on s’intéresse à la mise en place de méthodes de
sélection de variables dans le modèle linéaire général (1.2), voir par exemple Mardia
et al. (1980). Faire de la sélection de variables dans ce modèle revient à proposer
Bb un estimateur parcimonieux de B. Un tel estimateur permet de mettre en avant
des potentiels liens pertinents entre les réponses et les variables explicatives. En
effet, un coefficient B
b i,j positif (resp. négatif) indique un potentiel lien positif (resp.
négatif) entre la caractéristique i du phénomène d’intérêt et le marqueur j. À l’in-
verse, un coefficient nul indique une potentielle absence de lien. On recherche donc
un estimateur de B qui a des coefficients du même signe (positif, négatif ou nul)
que B. Pour cela, on définit la consistance en signe d’un estimateur C b de C par
1 si x > 0
P sign(C) = sign(C) → 1, lorsque n → ∞, où sign(x) =
b −1 si x < 0 .
0 si x = 0
Si B
b est un estimateur de B qui vérifie cette propriété, cela indique qu’avec une
probabilité tendant vers un les valeurs positives (resp. négatives, resp. nulles) de B
b
sont bien des valeurs positives (resp. négatives, resp. nulles) de B. Une valeur non
nulle de Bb indique donc une association réelle entre une variable réponse et une
variable explicative. Une telle approche permet donc de diminuer drastiquement
le nombre de validations expérimentales à effectuer et ce d’autant plus que B b est
parcimonieux.
Une approche possible pour sélectionner des variables dans le modèle linéaire
multivarié (1.2) est de le faire indépendamment dans les q modèles linéaires uni-
variés :
Y •,r = XB •,r + E •,r , ∀r ∈ J1, q K, (1.4)
y = Xb + e, (1.5)
5
Introduction
6
Introduction
où l’inégalité est à comprendre composante par composante, 1 est un vecteur de 1
de taille (p − |J|) et J c est le complémentaire de J dans J1, pK.
1+c4
Alors on a, pour tout λ = O(n 2 ), où c3 < c4 < c2 − c1 :
c3
P sign(bb(λ)) = sign(b) = 1 − o(e−n ) → 1, lorsque n → ∞,
La condition (i) est satisfaite, par exemple, lorsque les colonnes de la matrice
X sont centrées réduites. En forçant les valeurs singulières de la matrice de design
restreinte aux variables pertinentes à ne pas être trop faibles devant n, la condi-
tion (ii) indique que la matrice (X | X)J,J /n doit être inversible. Les conditions (iii)
et (iv) indiquent respectivement que les valeurs non nulles de b ne doivent être ni trop
nombreuses ni trop faibles par rapport au nombre n d’échantillons. Si ces différentes
conditions sont satisfaites au vu des données on peut appliquer indépendamment le
modèle (1.5) à chaque colonne de Y . Cependant, cette méthode ne prend pas en
compte la dépendance qui peut exister entre les différentes variables réponses. Nous
nous intéresserons ici à des méthodes qui cherchent les positions non nulles de la ma-
trice B définie dans le modèle (1.2), en prenant en compte cette dépendance. Pour
cela, dans le cas gaussien, ces méthodes minimisent l’opposé de la log-vraisemblance
du modèle et donc la fonction
1 |
`(B, Ω) = tr (Y − XB)Ω(Y − XB) − log(|Ω|), (1.7)
n
où tr (A) désigne la trace de A, Ω = Σ−1 est la matrice de précision et |Ω| est son
déterminant. Mardia et al. (1980) montrent que l’estimateur de B qui minimise `
à Ω fixée est indépendant de Σ et est donc le même que celui qui minimise l’erreur
quadratique pour chaque colonne de Y indépendamment. Cependant cette méthode
ne fait pas de sélection de variables c’est pourquoi on ajoute à la fonction (1.7) une
contrainte induisant de la parcimonie sur les coefficients de B. Les estimateurs ainsi
obtenus ne sont plus indépendants de Σ.
Rothman et al. (2010) proposent d’estimer à la fois B et Ω de manière parcimo-
nieuse. Pour cela ils minimisent une fonction de coût qui ajoute à ` deux pénalités :
une sur la norme `1 des valeurs de B et une sur la norme `1 des valeurs extra-
diagonales de Ω. Pour résoudre ce problème ils proposent une méthode itérative.
Dans une première étape ils estiment d’abord B à Ω fixé. Le problème devient alors
un problème Lasso classique qui est résolu en utilisant un algorithme de descente
par coordonnée. Dans une seconde étape ils estiment Ω à B fixé en utilisant l’esti-
mateur du graphical-Lasso proposé par Banerjee et al. (2008b) obtenu à l’aide de
l’algorithme de Friedman et al. (2008).
Cette méthode a été étendue par Lee & Liu (2012) qui proposent deux autres
7
Introduction
où !
Σy,y Σy,x
ΣZ = .
Σx,y Σx,x
8
Introduction
τ I q ]−1 , ou τ ≥ 0. Ils n’ont alors plus que la matrice B à estimer. Ils proposent
ensuite un algorithme permettant de minimiser `(B) sous différentes contraintes
sur B. Cet estimateur n’a pour l’instant pas été étudié théoriquement mais a été
validé par des expériences numériques. Cependant cet estimateur ne prend pas en
compte des dépendances dues à des facteurs qui ne sont pas dans X.
9
Introduction
Contributions du chapitre 2
b −1/2 = XB Σ
Y Σ b −1/2 + E Σ
b −1/2 . (1.12)
Ye = XeB + E,
e (1.13)
Theorème 1.2. Soit Y vérifiant le modèle (1.2) sous l’hypothèse (1.3). Supposons
que la condition d’irreprésentabilité (IC) soit vérifiée. Supposons de plus qu’il existe
des constantes positives M4 , M5 , M6 et M7 et c1 , c2 avec 0 < c1 + c2 < 1/2 , telles
que
(i) k(X | X)/nk∞ ≤ M4 ,
10
Introduction
(ii) λmin ((X | X)/n) ≥ M5 ,
(iii) |J| = O(q c1 ), où J = {i, tel que Bi 6= 0} et |J| est le cardinal de J,
(iv) q c2 minj∈J |Bj | ≥ M3 .
(v) λmax (Σ−1 ) ≤ M6 ,
(vi) λmin (Σ−1 ) ≥ M7 .
Supposons aussi que lorsque n tend vers l’infini, on ait :
(vii) kΣ−1 − Σ
b −1 k∞ = OP ((nq)−1/2 ),
b = OP ((nq)−1/2 ).
(viii) ρ(Σ − Σ)
Ainsi, pour tout λ tel que
1
λ λ
q = qn = o n 2(c1 +c2 ) , √ → ∞ et = o q −(c1 +c2 ) , lorsque n → ∞,
n n
on a
P sign(B(λ))
e = sign(B) → 1, lorsque n → ∞,
où B(λ)
e est défini dans (1.14). Ici, λmax (A), λmin (A), ρ(A) et kAk∞ sont respec-
tivement la plus grande, la plus petite valeur propre, le rayon spectral et la norme
infinie de A.
Les conditions (i) à (iv) du Théorème 1.2 sont similaires aux conditions (i) à
(iv) du Théorème 1.1. Les conditions (v) et (vi) du Théorème 1.2 indiquent que
les valeurs propres de Σ et Ω sont minorées par une constante strictement positive.
Enfin, les conditions (vii) et (viii) indiquent que ni la norme infinie de l’erreur de
prédiction de la matrice de précision ni le rayon spectral de l’erreur de prédiction
de la matrice de covariance ne peuvent être trop grands.
Contributions du chapitre 3
b = Y − X(X | X)−1 X | Y
E (1.15)
= (Id − PX )Y
= PX ⊥ Y ,
11
Introduction
Le modèle le plus simple est tel que chaque ligne de E est modélisée comme un
processus autorégressif d’ordre 1 (AR(1)). Cela signifie que pour tout i
de J1, nK et pour tout t de J2, q K (Ei,t ) est tel que
où |φ1 | < 1 et les Wi,t sont des bruits blancs de variance σ 2 que l’on note
BB(0, σ 2 ). Lorsque σ 2 = 1 la matrice Ω1/2 a la forme explicite suivante :
p
1 − φ21 −φ1 0 ··· 0
0 1 −φ1 · · · 0
.. .. ..
Ω1/2 = 0 0 . . . . (1.16)
.. .. .. ..
. . . . −φ1
0 0 ··· 0 1
Des modèles un peu plus généraux sont tels que chaque ligne de E est modélisée
comme un processus autorégressif à moyenne mobile ARMA(p ,q). Dans
ce cas pour tout i de J1, nK et pour tout t, Ei,t est solution de :
avec Wi,t ∼ BB(0, σ 2 ) et où les φj et les θj sont des paramètres réels.
Dans le cas où la modélisation par un processus ARMA n’est pas appropriée,
on peut modéliser chaque ligne de E comme un processus faiblement station-
naire général et estimer Σ comme une matrice Toeplitz symétrique c’est à dire
12
Introduction
comme
γ
b(0) γ
b(1) ··· b(q − 1)
γ
γ
b(1) γ
b(0) ··· b(q − 2)
γ
Σ
b = .. , (1.17)
.
γb(q − 1) γ
b(q − 2) · · · γ
b(0)
où n
1X
γ
b(h) = γ
bi (h),
n i=1
et γ
bi (h) est l’estimateur de l’autocovariance γi (h) = E(Ei,t Ei,t+h ), pour tout
t et tout h dans Z.
Pour sélectionner l’estimateur le plus adapté, nous proposons dans le chapitre 3
un test statistique qui est une adaptation du test de Portmanteau lui même
fondé sur le Théorème de Bartlett (Brockwell & Davis, 1991). Nous appelons
Σ
b l’estimateur final de Σ ainsi obtenu.
13
Introduction
E i = fi B f + U i , (1.18)
Σ = B |f Cov(f )B f + Σu , (1.19)
14
Introduction
Hosseini & Lee (2016) proposent aussi une méthode alliant parcimonie et modèle
à facteurs permettant l’estimation de la matrice de précision, et non pas la matrice
de covariance, comme une matrice parcimonieuse par blocs avec des blocs pouvant
se chevaucher. Une telle matrice est très utile par exemple pour faire de la sélection
de variables dans le modèle linéaire général comme on l’a vu dans la section 1.2.2.
Enfin, Perthame et al. (2016) allient sélection de variables et modèles à facteurs
afin de prendre en compte la dépendance au sein des variables réponses lorsque X
est une matrice d’ANOVA. Pour cela ils proposent le modèle :
Y = XB + ZB |Z + E, (1.20)
avec Σu une matrice diagonale. Chaque ligne de Z est supposée distribuée selon une
loi normale d’espérance nulle et de matrice de covariance l’identité. Ils proposent
une méthode itérative qui alterne l’estimation de B, B Z et Σu et l’inférence de
Z. Avec la transformation Y − ZB |Z ils retirent la dépendance existant entre les
différentes variables réponses et appliquent des méthodes de sélection de variables
comme celles décrites dans la section 1.2.1.
Production scientifique
Nous nous intéressons ici à l’estimation de matrice de corrélation par blocs parci-
monieuse en utilisant un modèle à facteur et des méthodes d’estimation de matrices
parcimonieuses.
15
Introduction
Σ = ZZ | + D, (1.22)
où Z est une matrice de taille q × k avec k q et D est une matrice diagonale de
telle sorte que tous les termes diagonaux de Σ soient égaux à 1. Des exemples, dans
le cas de blocs diagonaux et le cas de blocs extra-diagonaux, sont donnés dans la
figure 1.2.
Diagonale Extra-Diagonale
Z Σ Z Σ
50 50 50 50
40 40 40 40
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
12345 0 10 20 30 40 50 12345 0 10 20 30 40 50
16
Introduction
Σ ∈ M 50×50 Γ ∈ M 49×49
50 50
40 40
30 30
20 20
10 10
0 0
0 10 20 30 40 50 0 10 20 30 40 50
17
Introduction
18
Introduction
combattre ce type d’antigène. En fonction de ces signaux les lymphocytes Th ont
été catégorisés en différent profils. Les deux premiers profils, caractérisés dans la
littérature, sont les profils Th1 et Th2 (Mosmann et al., 1986; Mosmann & Coffman,
1989). En présence d’un pathogène intracellulaire, donc d’une maladie auto-immune,
la DC va sécréter de l’interleukine 12 (IL12), une fois capté par le lymphocyte T naı̈f
celui-ci va se différencier en lymphocytes Th1 et sécréter notamment de l’interféron
gamma (IFNg) pour combattre cette maladie auto-immune. De même, en présence
d’un parasite extra-cellulaire, les DC vont sécréter des signaux qui vont amener
le lymphocyte T naı̈f à se différencier en un lymphocyte Th2 qui va sécréter des
interleukines 4, 5, 13 (IL4, IL5, IL13). De nombreuses études ont mis en avant de
nouveaux profils tel que le profil Th17 induit par la présence d’IL6, TNFa, IL23
TGFb qui sécrètent IL17A et IL17F pour répondre à la présence de bactéries et de
champignons extérieurs (voir Park et al., 2005). La figure 1.4 qui est une version
simplifiée de la figure 1 de Leung et al. (2010) montre différents profils Th décrits.
IFNg
Th1
IL2
IL12p70
IL4
IL4 Th2 IL5
TGFb,IL6 IL13
IL17F
Th17
IL17A
19
Introduction
Dans cette partie nous étudions des données mesurant la réponse des lympho-
cytes Th aux signaux des DC sous diverses conditions de perturbations visant à re-
produire in vitro, l’environnement in situ et in vivo dans lequel baignent les cellules
dendritiques et les lymphocytes Th. Ce jeu de données contient pour 428 couples
groupe de DC / groupe de lymphocytes Th les valeurs de 36 signaux sécrétés par
les DC et de 18 signaux provenant des lymphocytes Th.
De ces expériences découle une matrice X (resp. une matrice Y ) de taille 428×36
contenant les valeurs des 36 signaux des DC (resp. des 18 signaux de Th) pour les 428
échantillons. Pour expliquer la matrice Y en fonction de la matrice X nous avons
appliqué la méthodologie décrite dans le chapitre 2 et dans la section 1.2.2. En
observant les valeurs non nulles de B b il est alors possible de mettre en évidence des
associations potentielles entre les signaux des DC et les signaux des Th. Certaines
sont connues de la littérature, d’autres ne le sont pas. Par exemple, le modèle suggère
une association entre IL12p70 et IL17F. Or IL12p70 est connu comme induisant les
profils Th1 alors qu’IL17F est caractéristique des Th17. Des expériences ont été
effectuées et il s’est avéré qu’en effet IL12p70 peut influencer IL17F dans certains
contextes. Cette étude a donné lieu à un article qui est disponible en annexe, un
résumé détaillé est proposé dans le chapitre 5 de cette thèse.
20
Chapter 2
Chapter 2
Variable selection in
multivariate linear models with
high-dimensional covariance
matrix estimation
Scientific production
Abstract
In this paper, we propose a novel variable selection approach in the framework of
multivariate linear models taking into account the dependence that may exist between
the responses. It consists in estimating beforehand the covariance matrix Σ of the
responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse
estimator of the coefficient matrix. The properties of our approach are investigated
both from a theoretical and a numerical point of view. More precisely, we give general
conditions that the estimators of the covariance matrix and its inverse have to satisfy
in order to recover the positions of the null and non null entries of the coefficient
matrix when the size of Σ is not fixed and can tend to infinity. We prove that these
conditions are satisfied in the particular case of some Toeplitz matrices. Our approach is
implemented in the R package MultiVarSel available from the Comprehensive R Archive
Network (CRAN) and is very attractive since it benefits from a low computational load.
We also assess the performance of our methodology using synthetic data and compare
it with alternative approaches. Our numerical experiments show that including the
estimation of the covariance matrix in the Lasso criterion dramatically improves the
variable selection performance in many cases.
21
2.1 Introduction
The multivariate linear model consists in generalizing the classical linear model,
in which a single response is explained by p variables, to the case where the number
q of responses is larger than 1. Such a general modeling can be used in a wide va-
Chapter 2
Y = XB + E, (2.1)
where Σ denotes the covariance matrix of the ith row of the error matrix E. We
shall moreover assume that the different rows of E are independent. With such
assumptions, there is some dependence between the columns of E but not between
the rows. Our goal is here to design a variable selection approach which is able to
identify the positions of the null and non null entries in the sparse matrix B by
taking into account the dependence between the columns of E.
This issue has recently been considered by Lee and Liu Lee & Liu (2012) who
extended the approach of Rothman et al. Rothman et al. (2010). More precisely,
Lee and Liu Lee & Liu (2012) proposed three approaches for dealing with this issue
based on penalized maximum likelihood with a weighted `1 regularization. In their
22
first approach B is estimated by using a plug-in estimator of Σ−1 , in the second
one, Σ−1 is estimated by using a plug-in estimator of B and in the third one, Σ−1
and B are estimated simultaneously. Lee and Liu Lee & Liu (2012) also investigate
the asymptotic properties of their methods when the sample size n tends to infinity
and the number of rows and columns q of Σ is fixed.
Chapter 2
In this paper, we propose to estimate Σ beforehand and to plug this estimator in a
Lasso criterion, in order to obtain a sparse estimator of B. Hence, our methodology
is close to the first approach of Lee and Liu Lee & Liu (2012). However, there are
two main differences.
The first one is the asymptotic framework in which our theoretical results are
established: q is assumed to depend on n and to tend to infinity at a rate which can
be larger than n as n tends to infinity. Moreover, p is assumed to be fixed. In this
framework, we give general conditions that the estimators of Σ and Σ−1 have to
satisfy in order to be able to recover the support of B that is to find the positions of
the null and non null entries of the matrix B. Such a framework in which q is much
larger than n and p is fixed may for instance occur in metabolomics which aims to
provide a global snapshot of the metabolism. In a typical metabolomic experiment,
we have access to the responses of q metabolites (features) for n samples belonging
to different groups. This information can be summarized in a n × q data matrix
where q ≈ 5000 and n ≈ 30. The goal is then to identify the most important features
distinguishing the different groups. Hence, this problem can be modeled by (3.2)
where X is the design matrix of a one-way ANOVA where p corresponds to the
number of groups.
The second main difference between Lee & Liu (2012) and our approach is the
strategy that we use for estimating Σ. In Lee & Liu (2012), Σ−1 is estimated by
using an adaptation of the Graphical Lasso (GLASSO) proposed by Friedman et al.
(2008). This technique has also been considered in Yuan & Lin (2007); Banerjee et al.
(2008a); Rothman et al. (2008). Here, we propose to estimate Σ beforehand from
the empirical covariance matrix of the residuals assuming that Σ has a particular
structure, Toeplitz for instance. We prove its efficiency in the particular case of an
AR(1) process in Section 2.2.3. Such a process is used, for instance, in population
genetics for modeling the phenomenon of recombination as shown in Chiquet et al.
(2016). More generally, we give general conditions that the estimators of Σ and Σ−1
have to satisfy in order to be able to recover the support of B. Hence, any approach
providing estimators satisfying these conditions can be used.
Let us now describe more precisely our methodology. We start by “whitening”
the observations Y by applying the following transformation to Model (3.2):
The goal of such a transformation is to remove the dependence between the columns
23
of Y . Then, for estimating B, we proceed as follows. Let us observe that (3.5) can
be rewritten as:
Y = X B + E, (2.4)
with
Chapter 2
where k·k1 and k·k2 denote the classical `1 -norm and `2 -norm, respectively. Inspired
by Zhao & Yu (2006), Theorem 2.1 established some conditions under which the
positions of the null and non null entries of B can be recovered by using B.b
In practical situations, the covariance matrix Σ is generally unknown and has
thus to be estimated. Let Σ b denote an estimator of Σ. For a description of the
methodology that we propose for estimating Σ, we refer the reader to the end of
Section 2.2.2. Then, the estimator Σb −1/2 of Σ−1/2 is such that
b −1 = Σ
Σ b −1/2 (Σ
b −1/2 )> .
b −1/2 = XB Σ
YΣ b −1/2 + E Σ
b −1/2 , (2.7)
where
b −1/2 ),
Ye = vec(Y Σ b −1/2 )> ⊗ X,
Xe = (Σ B = vec(B) b −1/2 ).
and Ee = vec(E Σ
B(λ)
e = ArgminB {kYe − XeBk22 + λkBk1 }. (2.9)
By extending Theorem 2.1, Theorem 2.5 gives some conditions on the eigenvalues
of Σ−1 and on the convergence rate of Σb and its inverse to Σ and Σ−1 , respectively,
under which the positions of the null and non null entries of B can be recovered by
using B.
e
We prove in Section 2.2.3 that when Σ is a particular Toeplitz matrix, namely
24
the covariance matrix of an AR(1) process, the assumptions of Theorem 2.5 are
satisfied. This strategy has been implemented in the R package MultiVarSel, which
is available on the Comprehensive R Archive Network (CRAN), for more general
Toeplitz matrices Σ such as the covariance matrix of ARMA processes or general
stationary processes. For a successful application of this methodology to particular
Chapter 2
“-omic” data, namely metabolomic data, we refer the reader to Perrot-Dockès et al.
(2018). For a review of the most recent methods for estimating high-dimensional
covariance matrices, we refer the reader to Pourahmadi (2013).
The paper is organized as follows. Section 2.2 is devoted to the theoretical results
of the paper. The assumptions under which the positions of the non null and null
entries of B can be recovered are established in Theorem 2.1 when Σ is known and
in Theorem 2.5 when Σ is unknown. Section 2.2.3 studies the specific case of the
AR(1) model. We present in Section 4.3 some numerical experiments in order to
support our theoretical results. The proofs of our main theoretical results are given
in Section 2.5.
1 >
C= X X and J = {j : 1 ≤ j ≤ pq, Bj 6= 0}, (2.10)
nq
where X is defined in (2.5) and where Bj denotes the jth component of the vector
B defined in (2.5).
Let also define
1 1
CJ,J = (X•,J )> X•,J and CJ c ,J = (X•,J c )> X•,J , (2.11)
nq nq
where X•,J and X•,J c denote the columns of X belonging to the set J defined in
(2.10) and to its complement J c , respectively.
More generally, for any matrix A, AI,J denotes the partitioned matrix extracted
from A by considering the rows of A belonging to the set I and the columns of A
belonging to the set J, with • indicating all the rows or all the columns.
The following theorem gives some conditions under which the estimator Bb defined
in (4.5) is sign-consistent as defined by Zhao & Yu (2006), namely,
lim P{sign(B)
b = sign(B)} = 1,
n→∞
where the sign function maps positive entries to 1, negative entries to −1 and zero
to 0.
25
Theorème 2.1. Assume that Y = (Y1 , . . . , Ynq )> satisfies Model (4.4). Assume
also that there exist some positive constants M1 , M2 , M3 and positive numbers c1 ,
c2 such that c1 + c2 ∈ (0, 1/2) satisfying
(A1) For all n ∈ N and j ∈ {1, . . . , pq}, n−1 (X•,j )> X•,j ≤ M1 , where X•,j is the jth
Chapter 2
where B(λ)
b is defined by (4.5).
Remark 2.1. Observe that if c1 + c2 < 1/(2k), for some positive k, then the first
condition of (L) becomes q = o(nk ). Hence for large values of k, the size q of Σ is
much larger than n.
26
Proposition 2.2. Let B(λ)
b be defined by (4.5). Then
P[sign{B(λ)}
b = sign(B)] ≥ P(An ∩ Bn ),
where
Chapter 2
−1 √ λ −1
An = |(CJ,J ) WJ | < nq |BJ | − |(CJ,J ) sign(BJ )| (2.12)
2nq
and
−1 λ −1
Bn = |CJ c ,J (CJ,J ) WJ − WJ c | ≤ √ 1 − |CJ c ,J (CJ,J ) sign(BJ )| ,
2 nq
(2.13)
0 √
with W = X E/ nq. In (2.12) and (2.13), CJ,J and CJ ,J are defined in (2.11) and
c
Proposition 2.3. If there exist some positive constants M10 , M20 , m1 , m2 such that,
for all n ∈ N,
(C1) For all j ∈ {1, . . . , p}, n−1 (X•,j )> X•,j ≤ M10 ,
(C2) n−1 λmin (X > X) ≥ M20 ,
(C3) λmax (Σ−1 ) ≤ m1 ,
(C4) λmin (Σ−1 ) ≥ m2 ,
then Assumptions (A1) and (A2) of Theorem 2.1 are satisfied.
Remark 2.2. Observe that (C1) and (C2) hold in the case where the columns of
the matrix X are orthogonal.
We give in Proposition 2.6 in Section 2.2.3 some conditions under which Condi-
tion (IC) holds in the specific case where Σ is the covariance matrix of an AR(1)
process.
e = 1 Xe> Xe
C (2.14)
nq
27
and
eJ,J = 1 (Xe•,J )> Xe•,J
C eJ c ,J = 1 (Xe•,J c )> Xe•,J ,
and C (2.15)
nq nq
where Xe•,J and Xe•,J c denote the columns of Xe belonging to the set J defined in
(2.10) and to its complement J c , respectively.
Chapter 2
P[sign{B(λ)}
e = sign(B)] ≥ P(A
fn ∩ B
fn ),
where
−1 f √ λ e −1
A
fn = (CJ,J ) WJ < nq |BJ | −
e |(CJ,J ) sign(BJ )| (2.16)
2nq
and
−1 f λ −1
B
fn = CJ c ,J (CJ,J ) WJ − WJ c ≤ √
e e f 1 − CJ c ,J (CJ,J ) sign(BJ )
e e ,
2 nq
(2.17)
0e √
with W = X E/ nq. In (2.16) and (2.17), CJ,J and CJ c ,J are defined in (2.15) and
f e e e
W
fJ and W fJ c denote the components of W f being in J and J c , respectively. Note
that the previous inequalities hold element-wise.
The following theorem extends Theorem 2.1 to the case where Σ is unknown and
gives some conditions under which the estimator Be defined in (2.9) is sign-consistent.
Theorème 2.5. Assume that Assumptions (A3), (A4), (IC) and (L) of Theorem 2.1
hold. Assume also that, there exist some positive constants M4 , M5 , M6 and M7 ,
such that for all n ∈ N,
(A5) k(X > X)/nk∞ ≤ M4 ,
(A6) λmin {(X > X)/n} ≥ M5 ,
(A7) λmax (Σ−1 ) ≤ M6 ,
(A8) λmin (Σ−1 ) ≥ M7 .
Suppose also that
b −1 k∞ = OP {(nq)−1/2 }, as n → ∞,
(A9) kΣ−1 − Σ
b = OP {(nq)−1/2 }, as n → ∞.
(A10) ρ(Σ − Σ)
Let B(λ)
e be defined by (2.9), then
lim P[sign{B(λ)}
e = sign(B)] = 1.
n→∞
28
In the previous assumptions, λmax (A), λmin (A), ρ(A) and kAk∞ denote the largest
eigenvalue, the smallest eigenvalue, the spectral radius and the infinite norm (in-
duced by the associated vector norm) of the matrix A.
Remark 2.3. In order to distinguish the assumptions that are required for the
Chapter 2
design matrix X and for the estimator Σ
b of Σ, the assumptions of Theorem 2.5 only
involve X, Σ and Σ − Σb but not Xe.
Remark 2.4. Observe that Assumptions (A5) and (A6) hold in the case where the
columns of the matrix X are orthogonal. Note also that (A7) and (A8) are the same
as (C3) and (C4) in Proposition 2.3.
The proof of Theorem 2.5 is given in Section 2.5 and is based on Proposition 2.4.
c c
In order to prove Theorem 2.5, it is enough to show that P(A fn ) and P(B fn ) tend
c c
to 0 as n → ∞. The idea of the proof consists in rewriting P(A fn ) (resp. P(B
fn ))
by adding terms depending on Σ − Σ b to P(Acn ) (resp. P(Bnc )) and to prove that
these additional terms tend to zero as n → ∞.
In order to estimate Σ, we propose the following strategy:
(a) Fitting a classical linear model to each column of the matrix Y in order to
have access to an estimation E b of the random error matrix E. It is possible
since p is assumed to be fixed and smaller than n.
(b) Estimating Σ from E
b by assuming that Σ has a particular structure, Toeplitz
for instance.
More precisely, E
b defined in the first step is such that:
The following proposition gives some conditions under which the strong Irrepre-
sentable Condition (IC) of Theorem 2.1 holds.
29
Proposition 2.6. Assume that (E1,t )t , . . . , (En,t )t in Model (3.2) are independent
AR(1) processes such that, for all i ∈ {1, . . . , n}, Ei,t − φ1 Ei,t−1 = Zi,t , where
the Zi,t s are zero-mean iid Gaussian random variables with variance σ 2 = 1 and
|φ1 | < 1. Assume also that X defined in (3.2) is such that X > X = νIdRp , where ν
is a positive constant. Moreover, suppose that if j ∈ J, then j > p and j < pq − p.
Chapter 2
Suppose also that for all j, j −p or j +p is not in J. Then, the strong Irrepresentable
Condition (IC) of Theorem 2.1 holds.
Sufficient conditions for Assumptions (A7), (A8), (A9) and (A10) of Theo-
rem 2.5
The following proposition establishes that in the particular case where the
(E1,t )t , . . . , (En,t )t are independent AR(1) processes, our strategy for estimating Σ
provides an estimator satisfying the assumptions of Theorem 2.5.
Proposition 2.7. Assume that (E1,t )t , . . . , (En,t )t in Model (3.2) are independent
AR(1) processes such that, for all i ∈ {1, . . . , n}, Ei,t − φ1 Ei,t−1 = Zi,t , where
the Zi,t s are zero-mean iid Gaussian random variables with variance σ 2 = 1 and
|φ1 | < 1. Let
1 φb1 φb21 . . . φb1q−1
φ1 1 φb1 . . . φb1q−2
b
1 ..
.. .. .. ..
Σ=
b . . . . . ,
2
1 − φ1 .
b
. .. .. .. ..
. . . . .
φbq−1
1 ... ... ... 1
where Pn Pq
`=2 Ei,` Ei,`−1
b b
i=1
φb1 = Pn Pq−1 b 2 , (2.20)
i=1 E
`=1 i,`
where E
b = (Ebi,` )1≤i≤n,1≤`≤q is defined in (2.18). Then, Assumptions (A7), (A8),
(A9) and (A10) of Theorem 2.5 are valid.
The proof of Proposition 2.7 is given in Section 2.5. It is based on the following
lemma.
Lemma 2.8. Assume that (E1,t )t , . . . , (En,t )t in Model (3.2) are independent AR(1)
processes such that, for all i ∈ {1, . . . , n}, Ei,t − φ1 Ei,t−1 = Zi,t , where the Zi,t s are
zero-mean iid Gaussian random variables with variance σ 2 and |φ1 | < 1. Let
Pn Pq
`=2 Ei,` Ei,`−1
b b
i=1
φb1 = Pn Pq−1 b 2 ,
i=1 `=1 Ei,`
30
where Eb = (E bi,` )1≤i≤n,1≤`≤q is defined in (2.18). Then, as n → ∞, √nqn (φb1 −
φ1 ) = Op (1).
Lemma 2.8 is proved in Section 2.5. Its proof is based on Lemma 2.10 in Sec-
tion 2.6.
Chapter 2
2.3 Numerical experiments
The goal of this section is twofold: (i) to provide sanity checks for our theoretical
results in a well-controlled framework; and (ii) to investigate the robustness of our
estimator to some violations of the assumptions of our theoretical results. The latter
may reveal a broader scope of applicability for our method than the one guaranteed
by the theoretical results.
We investigate (i) in the AR(1) framework presented in Section 2.2.3. Indeed, all
assumptions made in Theorems 2.1 and 2.5 can be specified with well-controllable
simulation parameters in the AR(1) case with balanced design matrix X.
Point (ii) aims to explore the limitations of our theoretical framework and assess
its robustness. To this end, we propose two numerical studies relaxing some of the
assumptions of our theorems: first, we study the effect of an unbalanced design —
which violates the sufficient condition of the irrepresentability condition (IC) given
in Proposition 2.6 — on the sign-consistency; and second, we study the effect of
other types of dependence than an AR(1).
In all experiments, the performance are assessed in terms of sign-consistency. In
other words, we evaluate the probability for the sign of various estimators to be
equal to sign(B). More precisely, we investigate for each estimator if their exists at
least one λ such that sign{B(λ)}
b = sign(B). We compare the performance of three
different estimators:
(i) Bb defined in (4.5), which corresponds to the LASSO criterion applied to the
data whitened with the true covariance matrix Σ; we call this estimator oracle.
Its theoretical properties are established in Theorem 2.1.
(ii) Be defined in (2.9), which corresponds to the LASSO criterion applied to the
data whitened with an estimator of the covariance matrix Σ;
b we refer to this
estimator as whitened-lasso. Its theoretical properties are established in
Theorem 2.5.
(iii) the LASSO criterion applied to the raw data, which we call raw-lasso here-
after. Its theoretical properties are established only in the univariate case in
Alquier & Doukhan (2011).
31
2.3.1 AR(1) dependence structure with balanced one-way ANOVA
In this section, we consider Model (3.2) where X is the design matrix of a one-
way ANOVA with two balanced groups. Each row of the random error matrix E is
distributed as a centered Gaussian random vector as in Eq. (3.3) where the matrix
Chapter 2
32
k= 1 k= 2 k= 4
1.00
0.75
q= 10
0.50
Chapter 2
0.25
0.00 φ1
1.00 0.5
0.95
0.75
q= 50
0.50
0.25 oracle
0.00 raw−lasso
1.00 whitened−lasso
0.75
q= 1000
0.50
0.25
0.00
10 100 1000 10 100 1000 10 100 1000
n
To this end, we consider the multivariate linear model (3.2) with the same AR(1)
dependence as the one considered in Section 2.3.1. Then, two different matrices
X are considered: First, an one-way ANOVA model with two unbalanced groups
with respective sizes n1 and n2 such that n1 + n2 = n; and second, a multiple
regression model with p correlated Gaussian predictors such that the rows of X are
iid N (0, ΣX ).
For the one-way ANOVA, violation of (IC) may occur when r = n1 /n is too
different from 1/2, as stated in Proposition 2.6. For the regression model, we choose
for ΣX a 9 × 9 matrix (p = 9) such that ΣX X
i,i = 1, Σi,j = ρ, when i 6= j.
The other simulation parameters are fixed as in Section 2.3.1.
We report in Figure 2.2 the results for the case where q = 1000 and k = 2
both for unbalanced one-way ANOVA (top panels) and regression with correlated
predictors (bottom panels). For the one-way ANOVA, r varies in {0.4, 0.2, 0.1}.
For the regression case, ρ varies in {0.2, 0.6, 0.8}. In both cases, the gray lines
correspond to the ideal situation that is, either balanced (r = 0.5) or uncorrelated
(ρ=0) in the legend of Figure 2.2. The probability of support recovery is estimated
over 1000 runs.
From this figure, we note that correlated features or unbalanced designs deteri-
33
orate the support recovery of all estimators. This was expected for these LASSO-
based methods which all suffer from the violation of the irrepresentability condi-
tion (IC). However, we also note that whitened-lasso and oracle have similar
performance, which means that the estimation of Σ is not altered, and that whiten-
ing always improves the support recovery.
Chapter 2
2.4 Discussion
In this paper, we proposed a variable selection approach for multivariate linear
models taking into account the dependence that may exist between the responses
● ● ●
0.75 ● ● ●
q = 1000
● ● ●
0.50
● ● ●
Balanced oracle
0.25 ● ● ●
Balanced raw−lasso
● ● ●
0.00 ● ● ●
Balanced whitened−lasso
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
n Uncorrelated oracle
Uncorrelated raw−lasso
ρ = 0.2 ρ = 0.6 ρ = 0.8
Uncorrelated whitened−lasso
1.00
oracle
0.75
q = 1000
raw−lasso
0.50
whitened−lasso
0.25
0.00
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
n
Figure 2.2 – Frequencies of support recovery in general linear models with unbal-
anced designs: one-way ANOVA and regression.
34
m= 5 m= 10
1.00
0.75
oracle
q= 500
0.50
raw−lasso
Chapter 2
0.25 whitened−lasso
0.00
0 100 200 300 400 500 0 100 200 300 400 500
n
Figure 2.3 – Frequencies of support recovery in one-way ANOVA with AR(m) co-
variance matrix.
and establish its theoretical properties. More precisely, our method consists in esti-
mating the covariance matrix Σ which models the dependence between the responses
and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator
of the coefficient matrix. Then, we give general conditions that the estimators of
the covariance matrix and its inverse have to satisfy in order to recover the positions
of the null and non null entries of the coefficient matrix when the size of Σ is not
fixed and can tend to infinity. In particular, we prove that these general conditions
are satisfied for some specific Toeplitz matrices such as the covariance matrix of an
AR(1) process. Note that our approach has been successfully applied to a data set
coming from a metabolomic experiment. For further details, we refer the reader
to Perrot-Dockès et al. (2018). Since, in this paper, we used general Toeplitz co-
variance matrices such as those of ARMA(p,q) processes or of weakly dependent
stationary processes, it would be interesting to prove that the strategy that we used
provides estimators of Σ satisfying the assumptions of Theorem 2.5. Moreover, it
would be interesting to see if other types of structured covariance matrices would
satisfy the assumptions of our theorems.
2.5 Proofs
Proof of Proposition 2.2. For a fixed nonnegative λ, by (4.5),
Bb = B(λ)
b = ArgminB {kY − X Bk22 + λkBk1 }.
b = Bb − B, we get
Denoting u
b 22 + λkBk
kY − X Bk b 22 + λkb
b 1 = kX B + E − X Bk bk22 + λkb
u + Bk1 = kE − X u u + Bk1
u0 X > E + u
= kEk22 − 2b b0 X > X u u + Bk1 .
b + λkb
Thus,
u
b = Argminu V (u),
35
where
√ √ √
V (u) = −2( nqu)> W + ( nqu)> C( nqu) + λku + Bk1 .
u
b satisfies
√ λ λ
bJ ) − WJ = − √ sign(b
CJ,J ( nq u uJ + BJ ) = − √ sign(BbJ )
2 nq 2 nq
bJ + BJ = BbJ 6= 0, and
if u
√ λ
|CJ c ,J ( nq u
bJ ) − WJ c | ≤ √ .
2 nq
Note that, if |b
uJ | < |BJ |, then BbJ =
6 0 and sign(BbJ ) = sign(BJ ).
Let us now prove that when An and Bn , defined in (2.12) and (2.13), are satisfied
then there exists u
b satisfying
√ λ
bJ ) − WJ = − √ sign(BJ ),
CJ,J ( nq u (2.21)
2 nq
|b
uJ | < |BJ |, (2.22)
√ λ
bJ ) − WJ c ≤ √ .
CJ c ,J ( nq u (2.23)
2 nq
By denoting
1 λ
bJ = √ (CJ,J )−1 WJ −
u (CJ,J )−1 sign(BJ ), (2.25)
nq 2nq
we obtain from (2.24) that (2.21) and (2.22) hold. Note that Bn implies
λ
− √ {1 − CJ c ,J (CJ,J )−1 sign(BJ )} ≤ CJ c ,J (CJ,J )−1 WJ − WJ c
2 nq
λ
≤ √ {1 + CJ c ,J (CJ,J )−1 sign(BJ )}.
2 nq
36
Hence,
−1 λ −1 λ
CJ c ,J (CJ,J ) WJ − √ (CJ,J ) sign(BJ ) − WJ c ≤ √ ,
2 nq 2 nq
Chapter 2
Proof of Theorem 2.1. By Proposition 2.2,
P[sign{B(λ)}
b = sign(B)] ≥ P(An ∩ Bn ) = 1 − P(Acn ∪ Bnc ) ≥ 1 − P(Acn ) − P(Bnc ),
where An and Bn are defined in (2.12) and (2.13). It is thus enough to prove that
P(Acn ) and P(Bnc ) tend to zero as n → ∞. By definition of An ,
−1 √ λ −1
P(Acn ) =P (CJ,J ) WJ ≥ nq |BJ | − |(CJ,J ) sign(BJ )|
2nq
√ λ
≤ sup P |ξj | ≥ nq |Bj | − |bj | ,
j∈J 2nq
where
1
ξ = (ξj )j∈J = (CJ,J )−1 WJ = √ (CJ,J )−1 (X•,J )> E ≡ HA E,
nq
and b = (bj )j∈J = (CJ,J )−1 sign(BJ ). By definition of Bn and (IC),
−1 λ −1
P(Bnc ) = P |CJ c ,J (CJ,J ) WJ − WJ c | > √ 1 − |CJ c ,J (CJ,J ) sign(BJ )|
2 nq
−1 λ
≤ P |CJ c ,J (CJ,J ) WJ − WJ c | > √ η
2 nq
λ
≤ sup P |ζj | > √ η ,
j∈J c 2 nq
where
37
Moreover,
kbk2 = k(CJ,J )−1 sign(BJ )k2 ≤ k(CJ,J )−1 k2 |J| ≡ λmax {(CJ,J )−1 } |J|,
p p
where λmax (A) denotes the largest eigenvalue of the matrix A. Observe that
Chapter 2
1 q q
λmax {(CJ,J )−1 } = = >
≤ , (2.26)
λmin (CJ,J ) λmin {(X X )J,J }/n M2
Thus,
√ −c2 λq|J|
P(Acn ) ≤ sup P |ξj | ≥ nq M3 q − .
j∈J 2nqM2
Since E is a centered Gaussian random vector having a covariance matrix equal to
identity, ξ = HA E is a centered Gaussian random vector with a covariance matrix
equal to:
> 1
HA HA = (CJ,J )−1 (X•,J )> X•,J (CJ,J )−1 = (CJ,J )−1 .
nq
−1
Hence, by (2.26), we get that for all j ∈ J, Var(ξj ) = ((CJ,J )−1 )jj ≤ λmax (CJ,J )≤
q/M2 . Thus,
√
√ λq|J| M2 √ λq|J|
P |ξj | ≥ nq M3 q −c2 − ≤ P |Z| ≥ √ M3 q −c2 nq − √ ,
2nqM2 q 2 nqM2
By Assumption (A3) of Theorem 2.1, we get that under the last condition of (L),
as n → ∞,
√ √
λq|J|/ nq = o q −c2 nq , (2.29)
Thus,
lim P(Acn ) = 0. (2.30)
n→∞
38
Let us now bound P(Bnc ). Observe that ζ = HB E is a centered Gaussian random
vector with a covariance matrix equal to
> 1
HB HB = {CJ c ,J (CJ,J )−1 (X•,J )> − X•,J
>
c }{X•,J (CJ,J )
−1
CJ,J c − X•,J c }
nq
Chapter 2
= CJ c ,J c − CJ c ,J (CJ,J )−1 CJ,J c
1
= (X•,J c )> [IdRnq − X•,J {(X•,J )> X•,J }−1 (X•,J )> ]X•,J c
nq
1
= (X•,J c )> (IdRnq − ΠIm(X•,J ) ) X•,J c ,
nq
where ΠIm(X•,J ) denotes the orthogonal projection onto the column space of X•,J .
Note that, for all j ∈ J c ,
1
Var(ζj ) = ((X•,J c )> (IdRnq − ΠIm(X•,J ) ) X•,J c )jj
nq
1 1
= ((X•,J c )> X•,J c )jj − ((X•,J c )> ΠIm(X•,J ) X•,J c )jj
nq nq
1 M1
≤ ((X•,J c )> X•,J c )jj ≤ ,
nq q
where the inequalities come from Lemma 2.9 and Assumption (A1) of Theorem 2.1.
Thus, for all j ∈ J c ,
√
λ λ q
P |ζj | > √ η ≤ P |Z| > √ √ η ,
2 nq 2 M1 nq
Proof of Proposition 2.3. Let us first prove that (C1) and (C3) imply (A1). For
j ∈ {1, . . . , pq}, by considering the Euclidian division of j − 1 by p given by j − 1 =
39
pkj + rj , we observe that
(X•,j )> X•,j = {((Σ−1/2 )> ⊗ X)•,j }> ((Σ−1/2 )> ⊗ X)•,j
= {(Σ−1/2 ) ⊗ X > )j,• }((Σ−1/2 )> ⊗ X)•,j
= {(Σ−1/2 )kj +1,• ⊗ (X•,rj +1 )> }[{(Σ−1/2 )•,kj +1 }> ⊗ X•,rj +1 }]
Chapter 2
1
(X•,j )> X•,j ≤ M10 (Σ−1 )kj +1,kj +1
n
≤ M10 sup {(Σ−1 )k+1,k+1 } ≤ M10 λmax (Σ−1 ) ≤ M10 m1 ,
k∈{0,...,q−1}
where the last inequality comes from (C3), which gives (A1).
Let us now prove that (C2) and (C4) imply (A2). Note that
1 1
λmin (X > X )J,J ≥ λmin (X > X)λmin (Σ−1 ) ≥ M20 m2 ,
n n
which gives (A2).
P[sign{B(λ)}
e = sign(B)] ≥ P(A
en ∩ B ecn ∪ B
en ) = 1 − P(A enc ) ≥ 1 − P(A
ecn ) − P(B
enc ),
where A
en and B
en are defined in (2.16) and (2.17). By definition of A
en , we get
ecn ) eJ,J )−1 W √ λ e −1
P(A =P (C fJ ≥ nq |BJ | − |(CJ,J ) sign(BJ )| .
2nq
40
Observing that
eJ,J )−1 W
(C fJ = (CJ,J )−1 WJ + (CJ,J )−1 (W
fJ − WJ )
eJ,J )−1 − (CJ,J )−1 WJ
+ (C
eJ,J )−1 − (CJ,J )−1 (W
Chapter 2
+ (C fJ − WJ ),
and
The first term in the right-hand side of (2.32) tends to 0 by the definition of Acn
and (2.30). By (2.27), the last term of (2.32) satisfies, for all j ∈ J,
−1 −1 2nq λ −1
P {(C
eJ,J ) − (CJ,J ) }sign(BJ ) ≥ |BJ | − |(CJ,J ) sign(BJ )|
5λ 2nq
n
−1 −1
o 2nq −c2 λq|J|
≤P (CJ,J ) − (CJ,J )
e sign(BJ ) ≥ M3 q − .
j 5λ 2nqM2
X p
|(U s)j | = Ujk sk ≤ |J|kU k2 . (2.33)
k∈J
41
We focus on
ρ(CJ,J − C eJ,J )
≤
λmin (C
eJ,J )λmin (CJJ )
ρ(CJ,J − C eJ,J )
≤ ,
λmin (C
eJ,J )(M2 /q)
where the last inequality comes from Assumption (A2) of Theorem 2.1, which gives
By definition of C and C
e given in (2.10) and (2.14), respectively, we get
By using that the eigenvalues of the Kronecker product of two matrices is equal to
the product of the eigenvalues of the two matrices, we obtain
where the last inequality follows from Theorem 4.3.1 of Horn & Johnson (1986).
Thus, by Assumptions (A5), (A6), (A8), (A9) and (A10), we get that, as n → ∞,
42
Hence, by (2.33), we get that, for all j ∈ J,
eJ,J )−1 − (CJ,J )−1 }sign(BJ )
2nq −c2 λq|J|
P {(C ≥ M3 q −
j 5λ 2nqM2
√
p
eJ,J )−1 − (CJ,J )−1 k2 ≥ 2 nq −c2 √ λq|J|
≤P |J| k(C nq − √
Chapter 2
M3 q .
5λ 2 nqM2
By the last condition of (L), (nqq −c2 /λ)/q 1+c1 → ∞ as n → ∞ and the result
follows since n → ∞. Hence, the last term of (2.32) tends to zero as n → ∞.
Let us now study the second term in the right-hand side of (2.32). We have
fJ − WJ = √1 1 e> e
n o
W Xe> Ee − X > E J = √ X E − X >E
nq J nq J
1 n
b −1/2 ⊗ X > ) (Σ
o n o
b −1/2 )> ⊗ IdRn vec(E) − (Σ−1/2 ⊗ X > ) (Σ−1/2 )> ⊗ IdRn vec(E)
=√ (Σ
nq J
1 n b −1 o
d
=√ (Σ − Σ−1 ) ⊗ X > vec(E) = AZ, (2.37)
nq J
1 hn b −1 on oi
A= √ (Σ − Σ−1 ) ⊗ X > (Σ1/2 )> ⊗ IdRn . (2.38)
nq J,•
Thus, for all j ∈ J, for all γ in R and every |J| × |J| matrix D,
n o
P fJ − WJ
D W ≥γ =P (DAZ)j ≥ γ ≤ P (kDk2 kAk2 kZk2 ≥ γ) ,
j
(2.40)
where A is defined in (2.38) and Z is a centered Gaussian random vector having a
43
covariance matrix equal to the identity. Hence, for all j ∈ J,
√
−1
nq λ −1
P (CJ,J ) (W J − W J )
f ≥ |Bj | − (CJ,J ) sign(BJ ) j
j 5 2nq
√
−1 nq −c2 λq|J|
≤ P k(CJ,J ) k2 kAk2 kZk2 ≥ M3 q − .
Chapter 2
5 2nqM2
where the first inequality comes from Theorem 4.3.15 of Horn & Johnson (1986).
Hence, by (A5), (A8) and (A9)
1 n
b −1 − Σ−1 ) ⊗ X >
on o
kAk2 = √ (Σ (Σ1/2 )> ⊗ Id = OP {q −1/2 (nq)−1/2 }.
nq J,• 2
(2.43)
By (2.29), (2.34) and (2.43), it is enough to prove that
nq
!
X
−2c2
lim P Zk2 ≥ nq n q = 0.
n→∞
k=1
The result follows from Markov’s inequality and the first condition of (L).
Let us now study the 3rd term in the right-hand side of (2.32). Observe that
1 −1/2 n o
WJ = √ (Σ ⊗ X > ) (Σ−1/2 )> ⊗ IdRn vec(E)
nq J
d 1 h n oi
=√ (Σ−1 ⊗ X > ) (Σ1/2 )> ⊗ IdRn Z ≡ A1 Z,
nq J,•
Using (2.39), we get for every j ∈ J, every γ ∈ R, and every |J| × |J| matrix D,
44
where A1 is defined in (2.44) and Z is a centered Gaussian random vector having a
covariance matrix equal to identity. Hence, for all j ∈ J,
√
−1 −1
nq λ −1
P {(CJ,J ) − (CJ,J ) }WJ
e ≥ |Bj | − (CJ,J ) sign(BJ ) j
j 5 2nq
√
Chapter 2
−1 −1 nq −c λq|J|
≤ P (C eJ,J ) − (CJ,J ) kA1 k2 kZk2 ≥ M3 q 2
− .
2 5 2nqM2
h i h i
(Σ−1 ⊗ X > ){(Σ1/2 )> ⊗ IdRn } = (Σ−1/2 ⊗ X > )
J,• 2 J,• 2
n o1/2
Σ−1 ⊗ (X > X) ≤ ρ[{Σ−1 ⊗(X > X)}]1/2 ≤ λmax (Σ−1 )1/2 λmax (X > X)1/2 ,
=ρ J,J
where the first inequality comes from Theorem 4.3.15 of Horn & Johnson (1986).
Hence, by (A5) and (A7),
1
kA1 k2 ≤ λmax (Σ−1 )1/2 λmax (X > X)1/2 = OP (q −1/2 ). (2.46)
nq
The result follows from Markov’s inequality and the first condition of (L).
Let us now study the 4th term in the right-hand side of (2.32). By (2.40), for all
j ∈ J,
n √
−1 −1
o nq λ −1
P (CJ,J ) − CJ,J )
e WJ − WJ
f ≥ |Bj | − (CJ,J ) sign(BJ ) j
j 5 2nq
√
−1 −1 nq −c2 λq|J|
≤ P (CJ,J ) − CJ,J )
e kAk2 kZk2 ≥ M3 q − ,
2 5 2nqM2
where A is defined in (2.38). By (2.29), (2.36) and (2.43), it is thus enough to prove
that ( nq )
X
lim P Zk2 ≥ (nq) n2 q 1−2c2 = 0.
n→∞
k=1
The result follows from the Markov inequality and the fact that c2 < 1/2.
Let us now study P(B en ). By definition of B
en , we get that
ec ) = P −1 f λ −1
P(Bn |CJ c ,J (CJ,J ) WJ − WJ c | ≥ √
e e f 1 − |CJ c ,J (CJ,J ) sign(BJ )|
e e .
2 nq
45
Observe that
C eJ,J )−1 W
eJ c ,J (C fJ c = CJ c ,J (CJ,J )−1 WJ − WJ c + CJ c ,J (CJ,J )−1 (W
fJ − W fJ − WJ )
eJ,J )−1 − (CJ,J )−1 }WJ + CJ c ,J {(C
+ CJ c ,J {(C eJ,J )−1 − (CJ,J )−1 }(W
fJ − WJ )
Chapter 2
eJ c ,J − CJ c ,J )(CJ,J )−1 WJ + (C
+ (C eJ c ,J − CJ c ,J )(CJ,J )−1 (W
fJ − WJ )
Moreover,
+(CeJ c ,J −CJ c ,J )(CJ,J )−1 sign(BJ )+(C eJ,J )−1 −(CJ,J )−1 }sign(BJ ).
eJ c ,J −CJ c ,J ){(C
The first term in the right-hand side of (2.47) tends to 0 by (2.31). Let us now
46
study the 2nd term of (2.47). By (2.40), we get that for all j ∈ J c ,
−1
λ
P CJ c ,J (CJ,J ) (W J − W J )
f ≥ √ η
j 24 nq
λ
≤ P kCJ c ,J k2 k(CJ,J )−1 k2 kAk2 kZk2 ≥
Chapter 2
√ η .
24 nq
Observe that
1/2
(X•,J c )> X•,J (X•,J )> X•,J c (X•,J c )> kX•,J k2
1
kCJ c ,J k2 = ρ = (X•,J c )> X•,J 2 ≤ √ 2
√
nq nq nq nq nq
1/2 1/2
(X•,J c )> X•,J c (X•,J )> X•,J
≤ρ ρ
nq nq
λmax (Σ−1 )
= ρ(CJ c J c )1/2 ρ(CJ,J )1/2 ≤ ρ(C) = λmax (X > X/n) = OP (q −1 ).
q
(2.48)
In (2.48) the last inequality and the fourth equality come from Theorem 4.3.15 of
Horn & Johnson (1986) and (2.35), respectively. The last equality comes from (A5)
and (A7). By (2.34), (2.43) and (2.48), it is thus enough to prove that
" nq 2 # ( nq 2 )
1/2 √ λ λ
X X
lim P Zk2 ≥ (nq) q√ = lim P 2
Zk ≥ (nq) √ =0
n→∞ nq n→∞ n
k=1 k=1
which holds true by the second condition of (L) and Markov’s inequality. Hence,
the second term of (2.47) tends to zero as n → ∞.
Let us now study the 3rd term of (2.47). By (2.45), we get that for all j ∈ J c ,
−1 −1
λ
P CJ c ,J {(CJ,J ) − (CJ,J ) }WJ
e ≥ √ η
j 24 nq
eJ,J )−1 − (CJ,J )−1 k2 kA1 k2 kZk2 ≥ λ
≤ P kCJ c ,J k2 k(C √ η .
24 nq
which holds true by the second condition of (L) and Markov’s inequality. Hence,
the 3rd term of (2.47) tends to zero as n → ∞.
47
Let us now study the 4th term of (2.47). By (2.40), it amounts to prove that
eJ,J )−1 − (CJ,J )−1 k2 kAk2 kZk2 ≥ λ
lim P kCJ c ,J k2 k(C √ η = 0.
n→∞ 24 nq
Chapter 2
which holds true by the second condition of (L). Hence, the 4th term of (2.47) tends
to zero as n → ∞.
Let us now study the 5th term of (2.47). By (2.45), proving that the 5th term
of (2.47) tends to 0 amounts to proving that
eJ c ,J k2 k(CJ,J )−1 k2 kA1 k2 kZk2 ≥ λ
lim P kCJ c ,J −C √ η = 0.
n→∞ 24 nq
kCJ c ,J − C
eJ c ,J k2 = k(C − C)
e J c ,J k2 = ρ{(C − C) e J c ,J }1/2
e J c ,J (C − C)
≤ k(C − C) e J c ,J k1/2
e J c ,J (C − C) e 1/2
∞ ≤ k(C − C)(C − C)k∞ ≤ kC − Ck∞
e e
1
= kΣ−1 − Σ b −1 k∞ kX > X/nk∞ = OP {q −1 (nq)−1/2 }, (2.49)
q
which can be deduced from Markov’s inequality and the second condition of (L).
Using similar arguments as those used for proving that the second, third and
fourth terms of (2.47) tend to zero, we get that the 6th,7th and 8th terms of (2.47)
tend to zero, as n → ∞, by replacing (2.48) by (2.49).
Let us now study the 9th term of (2.47). Replacing J by J c in (2.37), (2.38),
(2.40), (2.41) and (2.43) in order to prove that the ninth term of (2.47) tends to 0
it is enough to prove that
( nq 2 )
X
2 λ
lim P Zk ≥ nq √ = 0.
n→∞ n
k=1
48
which holds using Markov’s inequality and the second condition of (L).
Let us now study the 10th term of (2.47). Using the same idea as the one used
for proving (2.33), we get that
Chapter 2
12
eJ,J )−1 − (CJ,J )−1 k2 ≥ η ,
np o
≤P |J|kCJ c ,J k2 k(C
12
which tends to zero as n → ∞ by (A3), (2.36), (2.48) and the fact that c1 < 1/2.
Let us now study the 11th term of (2.47). Using the same idea as the one used
for proving (2.33), we get that
49
kSJ c ,J k∞ = ν|φ1 |. Let A = SJ,J . Since A = (ai,j ) is a diagonally dominant matrix,
then, by Theorem 1 of Varah (1975),
X
kA−1 k∞ ≤ 1/ min
ak,k − ak,j
.
Chapter 2
k
1≤j≤|J|
j6=k
If k ∈ J then k > p and k < pq − p. Thus, ak,k ≥ ν(1 + φ21 ). Hence, kA−1 k∞ ≤
1/{ν(1 + φ21 − |φ1 |)} and
|φ1 |
kSJ c ,J (SJ,J )−1 k∞ ≤ kSJ c ,J k∞ k(SJ,J )−1 k∞ ≤ .
1 + φ21 − |φ1 |
Since |φ1 | < 1, the strong Irrepresentability Condition holds when |φ1 | ≤ (1 − η)(1 +
|φ1 |2 − |φ1 |), which is true for a small enough η.
Proof of Proposition 2.7. Since |φ1 | < 1, kΣ−1 k∞ ≤ |φ1 | + |1 + φ21 | ≤ 3, which gives
(A7) by Theorem 5.6.9 of Horn & Johnson (1986). Observe that
q−1
!
1 X
h 1 2 3 − |φ1 | 3
kΣk∞ ≤ 1+2 |φ1 | ≤ 1+ = ≤ ,
1 − φ21 1 − φ21 1 − |φ1 | 1 − φ21 1 − φ21
h=1
kΣ−1 − Σ
b −1 k∞ ≤ 2|φ1 − φb1 | + (φ1 − φb1 )2 ,
Let us now check Assumption (A10) of Theorem 2.5. Since, by Theorem 5.6.9
of Horn & Johnson (1986), ρ(Σ − Σ)
b ≤ kΣ − Σk
b ∞ , it is enough to prove that, as
50
b ∞ = OP {(nq)−1/2 }. Observe that
n → ∞, kΣ − Σk
q−1
1 1 X φh1 φbh1
kΣ − Σk
b ∞≤ − + 2 −
1 − φ21 1 − φb21 1 − φ21 1 − φb21
h=1
q−1 q−1
!
Chapter 2
φ21 − φb21 X φh1 − φbh1 X 1 1
≤ +2 2 + 2 φbh1 2 −
2 2
(1 − φ1 )(1 − φ1 )
b 1 − φ 1 1 − φ1 1 − φb21
h=1 h=1
q−1 q−1
!
(φ1 − φb1 )(φ1 + φb1 ) X φh1 − φbh1 X
bh − φh
1 1
≤ +2 2 + 2 φ 1 1 2 −
2 2
(1 − φ1 )(1 − φb1 ) 1 − φ1 1 − φ1 1 − φb21
h=1 h=1
q−1
!
X
h 1 1
+2 φ1 −
1 − φ21 1 − φb21
h=1
(φ1 − φb1 )(φ1 + φb1 ) 2
≤ 1+
(1 − φ21 )(1 − φb21 ) 1 − |φ1 |
! q−1
1 (φ1 − φb1 )(φ1 + φb1 ) X bh
+2 + φ1 − φh1 .
|1 − φ21 | 2
(1 − φ1 )(1 − φ1 )
b2
h=1
Moreover,
q−1 q−1 h−1
!
1 − |φb1 |q−1 1 − |φ1 |q−1
X X X
φbh1 − φh1 ≤ φb1 − φ1 |φ1 |k |φb1 |h−k−1 ≤ φb1 − φ1
1 − |φb1 | 1 − |φ1 |
h=1 h=1 k=0
!
1 1
≤ φb1 − φ1 .
1 − |φb1 | 1 − |φ1 |
By (2.18),
q
n X
X q
X q
X
E
bi,` E
bi,`−1 = b•,` )> E
(E b•,`−1 = (ΠE•,` )> (ΠE•,`−1 )
i=1 `=2 `=2 `=2
q
X
= (φ1 ΠE•,`−1 + ΠZ•,` )> (ΠE•,`−1 ) (2.51)
`=2
q−1
X q
X
= φ1 (ΠE•,` )> (ΠE•,` ) + (ΠZ•,` )> (ΠE•,`−1 ),
`=1 `=2
51
where (2.51) comes from the definition of (Ei,t ). Hence,
Pq >
√1
√ nq `=2 (ΠZ•,` ) (ΠE•,`−1 )
nq (φb1 − φ1 ) = 1
Pn Pq−1 b 2 .
nq i=1 E
`=1 i,`
Chapter 2
√
In order to prove that nq (φb1 − φ1 ) = OP (1), it is enough to prove that
n q−1 n q−1
1 XX 2 1 X X b2
Ei,` − Ei,` = oP (1), as n → ∞, (2.52)
nq i=1 nq i=1
`=1 `=1
Let us first prove (2.52). By (2.19), Eb = (IdRq ⊗ Π)E ≡ AE. Note that Cov(E)
b =
A(Σ ⊗ IdRn )A> = Σ ⊗ Π. Hence, for all i ∈ {1, . . . , n},
Since the covariance matrix of E is equal to Σ ⊗ IdRn , it follows thay, for all i,
Var(Ei ) ≤ λmax (Σ). By Markov’s inequality,
n q−1 n q−1 n q n q
1 XX 2 1 X X b2 1 XX 2 1 X X b2
Ei,` − Ei,` = Ei,` − Ei,` + oP (1)
nq i=1 nq i=1 nq i=1 nq i=1
`=1 `=1 `=1 `=1
1
= (kEk22 − kEk
b 22 ) + oP (1).
nq
Observe that
E(Eei2 ) = Cov(E)
e i,i ≤ λmax (Σ),
52
Moreover,
( 2
q
)2 q n n
! n
X
XX X X
(ΠZ•,` )> (ΠE•,`−1 ) = E Πi,k Zk,` Πi,j Ej,`−1
E
`=2 `=2 i=1 k=1 j=1
Chapter 2
X X
= Πi,k Πi0 ,k0 Πi,j Πi0 ,j 0 E(Zk,` Zk0 ,`0 Ej,`−1 Ej 0 ,`0 −1 )
2≤`,`0 ≤q 1≤i,j,k,i0 ,j 0 ,k0 ≤n
X X X
= Πi,k Πi0 ,k0 Πi,j Πi0 ,j 0 φr1 φs1 E(Zk,` Zk0 ,`0 Zj,`−1−r Zj 0 ,`0 −1−s ),
2≤`,`0 ≤q 1≤i,j,k,i0 ,j 0 ,k0 ≤n r,s≥0
since the (Ei,t ) are AR(1) processes with |φ1 | < 1. Note that
qσ 4 nqσ 4
= tr(Π) ≤ ,
1 − φ21 1 − φ21
where tr(Π) denotes the trace of Π, which concludes the proof of (2.53) by Markov
inequality.
Proof. Observe that (A> ΠA) = A> Π> ΠA = (ΠA)> (ΠA), since Π is an orthogonal
projection matrix. Moreover, (A> ΠA)jj = e> > >
j (ΠA) (ΠA)ej ≥ 0, since (ΠA) (ΠA)
is a positive semidefinite symmetric matrix, where ej is a vector containing null
entries except the jth entry which is equal to 1.
Lemma 2.10. Assume that (E1,t )t , . . . , (En,t )t are independent AR(1) processes
such that, for all i ∈ {1, . . . , n}, Ei,t −φ1 Ei,t−1 = Zi,t , where the Zi,t s are zero-mean
iid Gaussian random variables with variance σ 2 and |φ1 | < 1. Then, as n → ∞,
n qn −1
1 XX 2 P σ2
Ei,` −→ .
nqn i=1 1 − φ21
`=1
2
Proof. In the following, for notational simplicity, q = qn . Since E(Ei,` ) = σ 2 /(1−φ21 ),
53
it is enough to prove that, as n → ∞,
n q−1
1 XX 2 2 P
{Ei,` − E(Ei,` )} −→ 0.
nq i=1
`=1
Since 2
X j X j j0
2
Ei,` = φ1 Zi,`−j = φ1 φ1 Zi,`−j Zi,`−j 0 ,
j≥0 j,j 0 ≥0
n q−1
" # n
1 XX 2 2 1 X X
2 2
Var Ei,` − E(Ei,` ) = Cov(Ei,` ; Ei,`0)
nq i=1 (nq)2 i=1
`=1 1≤`,`0 ≤q−1
n
1 X X X X 0 0
= φj1 φj1 φk1 φk1 Cov(Zi,`−j Zi,`−j 0 ; Zi,`0 −k Zi,`0 −k0 ).
(nq)2 i=1 1≤`,`0 ≤q−1 j,j 0 ≥0 k,k0 ≥0
(2.54)
54
A variable selection approach in
Chapter 3
the multivariate linear model:
An application to LC-MS
metabolomics data
Chapter 3
Scientific production
Abstract
Omic data are characterized by the presence of strong dependence structures that re-
sult either from data acquisition or from some underlying biological processes. Applying
statistical procedures that do not adjust the variable selection step to the dependence
pattern may result in a loss of power and the selection of spurious variables. The goal
of this paper is to propose a variable selection procedure within the multivariate linear
model framework that accounts for the dependence between the multiple responses. We
shall focus on a specific type of dependence which consists in assuming that the responses
of a given individual can be modelled as a time series. We propose a novel Lasso-based
approach within the framework of the multivariate linear model taking into account
the dependence structure by using different types of stationary processes covariance
structures for the random error matrix. Our numerical experiments show that including
the estimation of the covariance matrix of the random error matrix in the Lasso criterion
dramatically improves the variable selection performance. Our approach is successfully
applied to an untargeted LC-MS (Liquid Chromatography-Mass Spectrometry) data set
made of African copals samples. Our methodology is implemented in the R package
MultiVarSel which is available from the Comprehensive R Archive Network (CRAN).
55
3.1 Introduction
tures (hundreds or thousands) that can explain a difference between two or more
populations (see Zhang et al., 2012). It is well-known in the untargeted LC-MS
data analysis that the identification of metabolites discriminating these populations
remains a major bottleneck and therefore the selection of relevant features (metabo-
lites) is a crucial step, as explained in Verdegem et al. (2016). Our goal is to tackle
the task of feature selection by taking advantage of the specificities of the LC-MS
spectra.
We consider a typical untargeted metabolomic experiment where LC-MS spectra
(intensity vs m/z) are obtained from n samples, resulting in an n × q data matrix
where the q columns are ordered according to their m/z ratio. Note that the ab-
breviation m/z represents the quantity formed by dividing the ratio of the mass of
an ion to the unified atomic mass unit, by its charge number (regardless of sign).
Figure 3.1 displays an example of such a spectrum. It has to be noticed that the
data were first pre-processed using the methodology described in Section 3.4.1. We
further assume that the n samples are collected under C conditions and denote nc
P
the number of samples from Condition c, hence c nc = n. Multivariate ANOVA
(MANOVA, see e.g. Mardia et al., 1980; Muller & Stewart, 2006) provides a natu-
ral framework to analyze such a data set. Denoting Y c,r the q-dimensional vector
corresponding to the spectrum from the rth replicate in Condition c, the MANOVA
model assumes that
Y c,r = µc + E c,r , (3.1)
56
follows:
Y = XB + E, (3.2)
the problem of determining which metabolites are relevant boils down to finding the
non null coefficients in the matrix B and hence can be seen as a variable selection
problem in the multivariate linear model. Several approaches can be considered for
solving this task: either a posteriori methods such as classical statistical tests in
ANOVA models (see Mardia et al., 1980; Faraway, 2004) or methods embedding the
Chapter 3
variable selection such as Lasso-type methodologies (Tibshirani, 1996). However,
a naive application of such approaches does not take into account the potential
dependence between the different columns of Y , which may affect the identification
of the relevant features. This drawback will be illustrated in Section 4.3.
Different supervised machine learning approaches have been used to analyze
“omics” data during the last few years (see Saccenti et al., 2013; Ren et al., 2015;
Boccard & Rudaz, 2016; Zhang et al., 2017). Among them, in metabolomics, the
most popular is the partial least squares-discriminant analysis (PLS-DA) which
has recently been extended to sPLS-DA (sparse partial least squares-discriminant
analysis) by Lê Cao et al. (2011) to include a variable selection step.
The originality of our approach lies in the modeling of the dependence that
exists among the columns of Y which comes from the fact that usually biomarkers
share biosynthetic pathways (Audoin et al. (2014)). To account for this dependence,
we assume that the samples are all independent, namely, all the rows of E are
independent and for each sample i, the noise vector E i has a multivariate Gaussian
distribution:
E i = (Ei,1 , . . . , Ei,q ) ∼ N (0, Σq ), (3.3)
where Σq denotes the covariance matrix. The simplest assumption regarding the
covariance matrix is Σq = σ 2 I q , where I q denotes the q × q identity matrix. In
this case the different columns of Y are assumed to be independent. The other
extreme assumption consists in letting Σq free, assuming no specific form for this
dependence. However, in such a situation, q(q+1)/2 parameters should be estimated
which is not possible when n < q, which is the most standard case. Our approach
lies in between, assuming that some dependence exists but that it has a specific
structure. The form we consider is motivated by the nature of LC-MS spectra,
which can be seen as random functions of the m/z ratio. This suggests to consider
57
17.5
15.0
12.5
10.0
4
-2
Chapter 3
1.00
0.75
ACF
0.50
0.25
0.00
0 10 20 30
Lag
Figure 3.1 – An example of a LC-MS spectrum (an instance of Yc,r ) (top), the
same spectrum centered and normalized (middle) and its empirical autocorrelation
function (bottom).
58
4.2. Some numerical experiments on synthetic data are provided in Section 4.3.
Finally, an application to a metabolomic data set made of African copals samples
is given in Section 3.4.
Chapter 3
— Third step: Thanks to Σ
b q , transforming the data in order to remove the de-
pendence between the columns of Y . Such a transformation will be called
“whitening” hereafter.
— Fourth step: Applying to the transformed observations the Lasso approach
described in Section 3.2.2.
The first step provides a first estimate B
e of B. An estimate of E is then defined
as
b = Y − X B.
E e (3.4)
Y Σq−1/2 = XB Σ−1/2
q + E Σ−1/2
q (3.5)
removes the dependence between the columns of Y . Indeed the covariance matrix
of each row of EΣ−1/2
q is now equal to the identity matrix. Such a procedure will
be called “whitening” hereafter.
59
Parametric dependence
The simplest model among the parametric models is the autoregressive process
of order 1 denoted AR(1). More precisely, for each i in {1, . . . , n}, Ei,t satisfies the
following equation:
where |φ1 | < 1 and W N (0, σ 2 ) denotes a zero-mean white noise process of variance
σ 2 , defined as follows,
E(Zt ) = 0,
Zt ∼ W N (0, σ ) if E(Zt Zt0 ) = 0 if t 6= t0 ,
2
(3.7)
E(Zt2 ) = σ 2 .
Chapter 3
Note that the closer to one the parameter φ1 the stronger the dependence between
the Ei,t ’s.
In this case, the inverse of the square root of the covariance matrix Σq of
(Ei,1 , . . . , Ei,q ) has a simple closed-form expression given by
p
1 − φ21 −φ1 0 ··· 0
0 1 −φ1 · · · 0
.. .. ..
Σ−1/2 = 0 0 . . . . (3.8)
q
.. .. .. ..
. . . . −φ1
0 0 ··· 0 1
where φb1,i denotes the estimator of φ1 obtained by the classical Yule-Walker equa-
tions from (Ebi,1 , . . . , E
bi,q ), see Brockwell & Davis (1991) for more details.
More generally, it is also possible to have access to Σ−1/2 q for more complex
processes such as the ARMA(p, q) process defined as follows: For each i in {1, . . . , n},
60
Nonparametric dependence case
where γ bi (h) is the standard autocovariance estimator of γi (h) = E(Ei,t Ei,t+h ), for
all t. Usually, γbi (h) is referred to as the empirical autocovariance of the E bi,t ’s at lag
Chapter 3
h (i.e. the empirical covariance between (E bi,1 , . . . , E
bi,n−h ) and (E
bi,h+1 , . . . , E
bi,n )).
For a definition of the standard autocovariance estimator we refer the reader to
Chapter 7 of Brockwell & Davis (1991). The matrix Σ b −1/2 is then obtained by
q
inverting the Cholesky factor of Σ b q.
In order to decide which dependence modeling better fits the data at hand we
propose hereafter a statistical test. If the whitening modeling used is well chosen
then each row of E
e =E bΣb −1/2 should be a white noise as defined in (3.7), where E
b
q
is defined in (3.4).
One of the most popular approaches for testing whether a random process is a
white noise or not, is the Portmanteau test which is based on the Bartlett theo-
rem (for further details we refer the reader to Brockwell & Davis, 1991, Theorem
7.2.2). By this theorem, we get that under the null hypothesis (H0 ): “For each i in
{1, . . . , n}, (E
ei,1 , . . . , E
ei,q ) is a white noise”,
H
X
q ρbi (h)2 ≈ χ2 (H), as q → ∞, (3.11)
h=1
for each i in {1, . . . , n}, where ρbi (h) denotes the empirical autocorrelation of (E
ei,1 , . . . , E
ei,q )
at lag h and χ2 (H) denotes the chi-squared distribution with H degrees of freedom.
Thus, by (3.11), we have at our disposal a p-value for each i in {1, . . . , n} that we
denote by Pvali . In order to have a single p-value instead of n, we shall consider
n X
X H
q ρbi (h)2 ≈ χ2 (nH), as q → ∞, (3.12)
i=1 h=1
61
where the approximation comes from the fact that the rows of E e are assumed to
be independent. Equation (3.12) thus provides a p-value: Pval. Hence, if Pval < α,
the null hypothesis (H0 ) is rejected at the level α, where α is usually equal to 5%
and a large value of Pval indicates that the modeling for the dependence structure
of E is well chosen.
3.2.2 Estimation of B
Let us first explain briefly the usual framework in which the Lasso approach is
used. We consider a high-dimensional linear model of the following form
Y = X B + E, (3.13)
Chapter 3
where Y, B and E are vectors. Note that, in high-dimensional linear models, the
matrix X has usually more columns than rows which means that the number of
variables is larger than the number of observations but B is usually a sparse vector,
namely it contains a lot of null components.
In such models a very popular approach initially proposed by Tibshirani (1996)
is the Least Absolute Shrinkage and Selection Operator (LASSO), which is defined
as follows for a positive λ:
62
placed by Σb −1/2 . By using the same vectorization trick that allows us to transform
q
Model (3.2) into Model (3.13), the Lasso criterion can be applied to the vectorized
version of Model (3.5) where Σ−1/2 is replaced by Σ b −1/2 . The specific expressions
q q
of Y, X , B and E are given in Appendix B.
Note that this idea of “whitening” the observations has also been proposed by
Rothman et al. (2010) where the estimation of Σq and B is performed simulta-
neously. An implementation is available in the R package MRCE. In our approach,
Σq is estimated first and then its estimator is used in (3.5) instead of Σq before
applying the Lasso criterion. Hence, our method can be seen as a variant of the
MRCE method in which Σq is estimated beforehand. Moreover, after some numer-
ical experiments, we observed that for the values of n and q that we aim at using,
the computational burden of the approach designed by Rothman et al. (2010) is
too high for addressing our datasets for fixed regularization parameters, contrary
to ours. In addition, in practical situations, the regularization parameters of the
Chapter 3
MRCE approach have to be tuned. As a consequence, we have not been able to use
the MRCE approach for the purpose we consider here.
63
3.3 Simulation study
The goal of this section is to assess the statistical performance of our methodology
implemented in the R package MultiVarSel. In order to emphasize the benefits
of using a whitening approach from the variable selection point of view, we shall
first compare our approach to standard methodologies. Then, we shall analyze
the performance of our statistical test for choosing the best dependence modeling.
Finally, we shall investigate the performance of our model selection criterion.
To assess the performance of the different methodologies, we generate observa-
tions Y according to Model (3.2) with q = 1000, p = 3, n = 30 (n1 = 9, n2 = 8
and n3 = 13) and different dependence modelings, namely different matrices Σq
corresponding to the AR(1) model described in (3.6) with σ = 1 and φ1 = 0.7 or
0.9. Note that the values of the parameters p, q and n are chosen in order to match
the metabolomic data analyzed in Section 3.4.
We shall also investigate the effect of the sparsity and of the signal to noise ratio
Chapter 3
(SNR). The sparsity level s corresponds to the proportion of non null elements in B.
Different signal to noise ratios are obtained by multiplying B in (3.2) by a coefficient
κ.
s = 0.01 s = 0.3
φ1 = 0.7 φ1 = 0.9 φ1 = 0.7 φ1 = 0.9
1.00 S
SS 1.00 S
SS 1.00 S SS 1.00 SSS
SS S S S
S S S S
S S S
0.75 S 0.75 0.75 0.75 S
S S
S S
κ=1
S
TPR
S
S S
0.50 0.50 S 0.50 S 0.50
S
S S
0.25 0.25 0.25 0.25 S
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
1.00 S
S S SS 1.00 S
SS 1.00 SS 1.00 SS
S S SS S S S S
S S S S
S S
S S
S
0.75 S 0.75 S 0.75 0.75 S
S S
κ=2
S
TPR
S Lasso
0.50 0.50 S 0.50 0.50 S ANOVA
AR1
S S Nonparam
0.25 0.25 0.25 0.25
Oracle
S sPLSDA
0.00 S 0.00 S 0.00 S 0.00 S
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
FPR FPR
Figure 3.2 – Means of the ROC curves obtained from 200 replications for the different
methodologies in the AR(1) dependence modeling; κ is linked to the signal to noise
ratio (first row: κ = 1, second row κ = 2); φ1 is the correlation level in the AR(1)
and s the sparsity level (i.e. the fraction of nonzero elements in B).
The goal of this section is to compare the performance of our different whitening
strategies to standard existing methodologies. More precisely, we shall compare
64
s = 0.01 s = 0.3
φ1 = 0.7 φ1 = 0.9 φ1 = 0.7 φ1 = 0.9
1.00 1.00 1.0 1.0
Precision
κ=1
0.50 0.50
0.6 S 0.6
0.25 0.25 S S
0.4 S 0.4 S
S S S S
SS S SSS
0.00 S S S S SSSS
S 0.00 S S S S S SSSS SS
S S
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
0.75 0.75 S
0.8 0.8
κ=2
S Lasso
0.50 0.50 ANOVA S
0.6 0.6
S AR1
Nonparam S
0.25 0.25
S Oracle S
0.4 S 0.4 S
SS S sPLSDA SS
S SSSSSS
S
S S S S S SSSSS S SS
0.00 0.00
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
Recall Recall
Chapter 3
Figure 3.3 – Means of the precision-recall curves obtained from 200 replications for
the different methodologies in the AR(1) dependence modeling; κ is linked to the
signal to noise ratio (first row: κ = 1, second row κ = 2); φ1 is the correlation level
in the AR(1) and s the sparsity level (i.e. the fraction of nonzero elements in B).
our approaches to the classical ANOVA method (denoted ANOVA), the standard
Lasso (denoted Lasso), namely the Lasso approach without the whitening step
and to sPLSDA (Lê Cao et al., 2011), implemented in the mixOmics R package
and also in MetaboAnalyst, which is widely used in the metabolomics field. By
ANOVA, we mean the classical one-way ANOVA applied to each column of the
observations matrix Y without taking the dependence into account. Our different
whitening approaches (described in Sections 3.2.1 and 3.2.1) are denoted by AR1
and Nonparam. These methods are also compared to the Oracle approach where
the matrix Σq is known, which is never the case in practical situations.
We shall use three classical criteria for comparison: ROC curves, AUC (Area
Under the ROC Curve) and Precision-Recall (PR) curves. ROC curves display the
true positive rates (TPR) as a function of the false positive rates (FPR) and the
closer to one the AUC the better the methodology. PR curves display the Precision
as a function of the Recall. Since the features selected by sPLSDA are not assigned
to a given condition c, we shall consider that as soon as a feature is selected it is a
true positive, which gives a great advantage to sPLSDA.
We can see from Figures 3.2, 3.3 and Table 3.1 that in the case of an AR(1) depen-
dence, taking into account this dependence provides better results than sPLSDA and
than approaches that consider the columns of the matrix E as independent. More-
over, we observe that the performance of the non parametric modeling are on a par
with those of the parametric and the oracle ones. We also note that the larger the
sparsity level s the smaller the difference of performance between the approaches.
65
SNR φ1 s Lasso ANOVA AR1 Nonpar Oracle sPLSDA
1 0.7 0.01 0.78 0.78 0.83 0.84 0.84 0.73
1 0.7 0.3 0.74 0.74 0.80 0.80 0.80 0.72
1 0.9 0.01 0.63 0.64 0.83 0.83 0.83 0.58
1 0.9 0.3 0.63 0.64 0.77 0.77 0.77 0.61
2 0.7 0.01 0.91 0.91 0.95 0.95 0.95 0.86
2 0.7 0.3 0.85 0.85 0.88 0.88 0.88 0.84
2 0.9 0.01 0.77 0.77 0.91 0.91 0.91 0.72
2 0.9 0.3 0.75 0.76 0.86 0.86 0.86 0.74
As expected, the larger the signal to noise ratio κ the better the performance of the
different methodologies. We also conducted numerical experiments in a balanced
one-way ANOVA framework. Since the conclusions are similar, we did not report
the results here but they are available upon request.
Chapter 3
The goal of this section is to assess the performance of the whitening test pro-
posed in Section 3.2.1. We generated observations Y as described at the beginning
of Section 4.3, with AR(1) dependence, a sparsity level s = 0.01 and SNR such that
κ = 1. The corresponding results are displayed in Figure 3.4.
Oracle
Nonparam
AR1
Lasso
Figure 3.4 – Means and standard deviations of the p-values of the test described in
Section 3.2.1 of the main paper for the different approaches in the AR(1) dependence
modeling when φ1 = 0.7 (left) and φ1 = 0.9 (right).
We observe that our test behaves properly: it provides p-values close to zero in
the case where no whitening strategy is used (Lasso) and that when one of the
proposed whitening approaches is used the p-values are larger than 0.7.
66
3.3.3 Choice of the model selection criterion
We investigate here the performance of our model selection criterion described
in Section 3.2.2. Figure 3.5 displays the TPR and the FPR for different values N of
the sampling replicates and different thresholds. We can see from this figure than
taking N larger than 1000 and a threshold of 0.999 ensures a small false positive
rate and a large true positive rate.
TPR
0.50
0.25
0.00
0.006
Chapter 3
FPR
0.004
0.002
0.000
0.99 0.999 1 0.99 0.999 1 0.99 0.999 1 0.99 0.999 1 0.99 0.999 1 0.99 0.999 1
Threshold
φ1 0.7 0.9 N 10 100 1000 5000
Bullets (’•’) in Figure 3.6 show the positions of the variables selected by our four-
step approach for two possible thresholds (0.999 and 1) from N = 1000 replications.
The positions of the non null coefficients in B are displayed with ’+’. Here Y is
generated with the parameters described at the beginning of Section 4.3 in the case
of an AR(1) dependence with φ1 = 0.9 and κ = 10. We observe from this figure that
the positions of the non null coefficients are recovered for both thresholds. However,
the performance are slightly better when the threshold is equal to 0.999.
67
Threshold=0.99 Threshold=1
3
Condition
Figure 3.6 – Positions of the variables selected by our approach (’•’) when κ = 10.
Values on the y-axis correspond to the 3 conditions. The results obtained when the
threshold is equal to 0.999 (resp. 1) are on the left (resp. on the right). The size of
the bullets is all the more large that the selection frequency is high.
Chapter 3
200
Times in seconds
150
100
50
0
0 1000 2000 3000 4000 5000
q
burden of MultiVarSel is very low and that it takes only a few minutes to analyze
matrices having 5000 columns.
68
modeling proposed in Equations (3.1) and (3.2) with C = 3 conditions: CE, CW
and TE such that nCE = 9, nCW = 8 and nTE = 13. Our goal is to identify the
most important features (the m/z values) for distinguishing the different conditions.
In this section, we also compare the performance of our method with those of other
techniques which are widely used in metabolomics.
Chapter 3
imputation was realized using the KNN algorithm described in Hrydziuszko & Viant
(2012). Subsequently, the spectra were normalized to equalize signal intensities to
the median profile in order to reduce any variance arising from differing dilutions
of the biological extracts and probabilistic quotient normalization (PQN) was used,
see Dieterle et al. (2006) for further details. In order to reduce the size of the
data matrix which contains 6327 metabolites, selection of the adducts of interest
[M+H]+ was then performed using the CAMERA package of Kuhl et al. (2012). A
n × q matrix Y was then obtained with q = 1019 and submitted to the statistical
analyses.
69
— Fourth step: The Lasso approach described in Section 3.2.2 was then applied
to the whitened observations. The stability selection is used with N = 1000
replications and a threshold equal to 0.999.
3 2 16 10 67
TE CE CW
Figure 3.8 – Venn diagram of the features selected for each condition by
MultiVarSel.
Chapter 3
Figure 3.8 displays the Venn diagram of the features (m/z values) selected for
each condition CE, TE and CW. Among the 1019 features, 98 features have been
selected by MultiVarSel: 77 have been selected for Condition TE, 28 for Condition
CW and 5 for Condition CE. Note that there were no features selected for all the
conditions, 10 for both TE and CW and 2 for both CW and CE.
The goal of this section is to compare our approach with the sparse partial least
square discriminant analysis (sPLS-DA) which is classically used in metabolomics.
Additional simulations
Since in the case of real data, the position of the relevant features is of course un-
known, we propose the following additional simulations in order to further compare
these two approaches. We start by applying the first step of our approach in order
to get E.
b Then, we perform M random samplings with replacement among the
b Let E ? denote one of them, then we generate a new observation matrix
rows of E.
Y ? = X ? B + E ? , where X ? is the same as X except that its rows are permuted in
order to ensure a correspondence between the rows of E ? and X ? . The matrix B is
obtained as in Section 4.3 with s = 0.01 and κ = 0.5 and 1. ROC curves averaged
over M = 50 random samplings are displayed in Figure 3.9. We can see from this
figure that our approach outperforms the classical ones. Other values of s and κ
have been tested. The corresponding results are not reported here but available
upon request.
70
κ = 0.5 κ=1
1.00 SS S SS
S S
S S
0.75
S
s = 0.01
TPR
0.50
0.25
0.00 S
Figure 3.9 – Means of the ROC curves obtained by MultiVarSel, Lasso and sPLS-
DA.
Chapter 3
Comp2 CE
Comp2 CW Comp1 Comp2 TE Comp1
47 23
47 2 20 6 43 42 7 27 43 6
49
Comp1
As recommended by Lê Cao et al. (2011), we used two components for sPLS-DA.
Moreover, in order to make sPLS-DA comparable with our approach, 49 variables are
kept for each component. However, as explained in Section 4.3, the main difference
between our approach and sPLSDA is that the features selected by sPLSDA are not
assigned to a given condition c, and thus less interpretable.
Figure 3.11 displays the location of the features (m/z values) selected by our
approach and sPLS-DA. We can see from this figure that the features selected
for the condition TE are mainly located between 400 and 500 m/z whereas those
selected for the condition CE are around 600 m/z. The features selected by the
first component of the sPLS-DA are also mainly located between 400 and 500 m/z.
However, as previously explained, the features selected by sPLSDA are assigned to
a component built by the method and not to a condition of the experimental design.
Venn diagrams comparing the features selected by both methods are available in
Figure 3.10. We observe from these Venn diagrams that the features selected in
each component of sPLS-DA do not characterize the conditions of the MANOVA
model contrary to ours.
71
Comp2
sPLSDA
Comp1
CE
MultiVarSel
TE
CW
3.5 Conclusion
In this paper, we proposed a novel approach for feature selection taking into
account the dependence that may exist between the columns of the observations
matrix. Our approach is implemented in the R package MultiVarSel which is
available from The Comprehensive R Archive Network (CRAN). We have shown
that our method has two main features. Firstly, it is very efficient for selecting a
restricted number of stable features characterizing each condition. Secondly, its very
low computational burden makes its use possible on very large LC-MS metabolomics
data.
Acknowledgment: This project has been funded by La mission pour l’interdisciplinarité
du CNRS in the frame of the DEFI ENVIROMICS (project AREA). The authors
thank the Musée François Tillequin for providing the samples from the Guibourt
Collection.
Appendix A
Let vec(A) denote the vectorization of the matrix A formed by stacking the
columns of A into a single column vector. Let us apply the vec operator to Model
(3.2), then
vec(Y ) = vec(XB + E) = vec(XB) + vec(E).
72
where we used that
vec(AXB) = (B 0 ⊗ A)vec(X),
see (Mardia et al., 1979, Appendix A.2.5). In this equation, B 0 denotes the transpose
of the matrix B. Thus,
Y = X B + E,
where X = Iq ⊗ X and Y, B and E are vectors of size nq, pq and nq, respectively.
Appendix B
Let us apply the vec operator to Model (3.5) where Σ−1/2 b −1/2 ,
is replaced by Σ
q q
then
Hence,
Y = X B + E,
b −1/2 ), X = (Σ
where Y = vec(Y Σ b −1/2 )0 ⊗ X and E = vec(E Σ
b −1/2 ).
q q q
73
Estimation of large block
Chapter 4
structured covariance matrices:
Application to “multi-omic”
approaches to study seed quality
Scientific production
Chapter 4
Abstract
Motivated by an application in high-throughput genomics and metabolomics, we
propose a novel, efficient and fully data-driven approach for estimating large block
structured sparse covariance matrices in the case where the number of variables is much
larger than the number of samples without limiting ourselves to block diagonal matrices.
Our approach consists in approximating such a covariance matrix by the sum of a
low-rank sparse matrix and a diagonal matrix. Our methodology also can deal with
matrices for which the block structure appears only if the columns and rows are permuted
according to an unknown permutation. Our technique is implemented in the R package
BlockCov which is available from the Comprehensive R Archive Network (CRAN) and
from GitHub. In order to illustrate the statistical and numerical performance of our
package some numerical experiments are provided as well as a thorough comparison
with alternative methods. Finally, our approach is applied to the use of “multi-omic”
approaches for studying seed quality.
4.1 Introduction
Plant functional genomics refers to the description of the biological function of
a single or a group of genes and both the dynamics and the plasticity of genome
75
expression to shape the phenotype. Combining multi-omics such as transcriptomic,
proteomic or metabolomic approaches allows us to address in a new light the di-
mension and the complexity of the different levels of gene expression control and
the delicacy of the metabolic regulation of plants under fluctuation environments.
Thus, our era marks a real conceptual shift in plant biology where the individual
is no longer considered as a simple sum of components but rather as a system
with a set of interacting components to maximize its growth, its reproduction and
its adaptation. Plant systems biology is therefore defined by multidisciplinary and
multi-scale approaches based on the acquisition of a wide range of data as exhaustive
as possible.
In this context, it is crucial to propose new methodologies for integrating het-
erogeneous data explaining the co-regulations/co-accumulations of products of gene
expression (mRNA, proteins) and metabolites. In order to better understand these
phenomena, our goal will thus be to propose a new approach for estimating block
structured covariance matrix in a high-dimensional framework where the dimension
of the covariance matrix is much larger than the sample size. In this setting, it is
well known that the commonly used sample covariance matrix performs poorly. In
recent years, researchers have proposed various regularization techniques to consis-
tently estimate large covariance matrices or the inverse of such matrices, namely
precision matrices. To estimate such matrices, one of the key assumptions made
in the literature is that the matrix of interest is sparse, namely many entries are
equal to zero. A number of regularization approaches including banding, tapering,
thresholding and `1 minimization, have been developed to estimate large covariance
Chapter 4
matrices or their inverse such as, for instance, Ledoit & Wolf (2004), Bickel & Lev-
ina (2008), Banerjee et al. (2008b), Bien & Tibshirani (2011) and Rothman (2012)
among many others. For further references, we refer the reader to Cai & Yuan
(2012) and to the review of Fan et al. (2016).
In this paper, we shall consider the following framework. Let E 1 , E 2 , · · · , E n , n
zero-mean i.i.d. q-dimensional random vectors having a covariance matrix Σ such
that the number q of its rows and columns is much larger than n. The goal of
the paper is to propose a new estimator of Σ and of the square root of its inverse,
Σ−1/2 , in the particular case where Σ is assumed to have a block structure without
limiting ourselves to diagonal blocks. An accurate estimator of Σ can indeed be very
useful to better understand the links between the columns of the observation matrix
and may highlight some biological processes. Moreover, an estimator of Σ−1/2 can
be very useful in the general linear model in order to remove the dependence that
may exist between the columns of the observation matrix. For further details on
this point, we refer the reader to Perrot-Dockès et al. (2018), Perrot-Dockès et al.
(2018) and to the R package MultiVarSel in which such an approach is proposed
and implemented for performing variable selection in the multivariate linear model
in the presence of dependence between the columns of the observation matrix.
76
More precisely, in this paper, we shall assume that
Σ = ZZ 0 + D, (4.1)
Chapter 4
Figure 4.1 – Examples of matrices Σ generated from different matrices Z leading
to a block diagonal or to a more general block structure (extra-diagonal blocks).
We also propose a methodology to estimate Σ in the case where the block struc-
ture is latent; that is, permuting the columns and rows of Σ renders visible its block
structure. An example of such a matrix Σ is given in Figure 4.2 in the case where
k = 5 and q = 50.
Our approach is fully data-driven and consists in providing a low rank matrix
approximation of the ZZ 0 part of Σ and then in using a `1 regularization to obtain
a sparse estimator of Σ. When the block structure is latent, a hierarchical clustering
step must be applied first. With this estimator of Σ, we explain how to obtain an
estimator of Σ−1/2 .
Our methodology is described in Section 4.2. Some numerical experiments on
synthetic data are provided in Section 4.3. An application to the analysis of “-omic”
data to study seed quality is performed in Section 4.4.
77
Figure 4.2 – Examples of matrices Σ of Figure 4.1 in which the columns and rows
are randomly permuted.
to Σ
e to ensure that the final estimator Σ
b of Σ is positive definite.
— Fourth step: Estimation of Σ−1/2 . In this step, Σ−1/2 is estimated from the
spectral decomposition of Σ
b obtained in the previous step.
Si,j
Ri,j = , ∀1 ≤ i, j ≤ q, (4.2)
σi σj
78
where
n n
1 X 1X
σi2 = (E`,i − E i )2 , with E i = E`,i , ∀1 ≤ i ≤ q.
n−1 n
`=1 `=1
If S was the real matrix Σ, the corresponding matrix Γ would have a rank less than
or equal to k. Since S is an estimator of Σ, we shall use a rank r approximation
Γr of Γ. This will be performed by considering in its singular value decomposition
only the r largest singular values and by replacing the other ones by 0. By Eckart &
Young (1936), this corresponds to the best rank r approximation of Γ. The choice
of r will be discussed in Section 4.2.5.
Let us first explain the usual framework in which the Lasso approach is used.
We consider a linear model of the following form
Chapter 4
Y = X B + E, (4.4)
where Y, B and E are vectors and B is sparse meaning that it has a lot of null
components.
In such models a very popular approach initially proposed by Tibshirani (1996)
is the Least Absolute Shrinkage eStimatOr (Lasso), which is defined as follows for
a positive λ:
= ArgminB kY − X Bk22 + λkBk1 ,
B(λ)
b (4.5)
Pn Pn
where, for u = (u1 , . . . , un ), kuk22 = i=1 u2i and kuk1 = i=1 |ui |, i.e. the `1 -norm
of the vector u. Observe that the first term of (4.5) is the classical least-squares
criterion and that λkBk1 can be seen as a penalty term. The interest of such a
criterion is the sparsity enforcing property of the `1 -norm ensuring that the number
of non-zero components of the estimator Bb of B is small for large enough values of
λ. Let
Y = vecH (Γr ), (4.6)
where vecH defined in Section 16.4 of Harville (2001) is such that for a n × n matrix
79
A,
a1 ∗
a2 ∗
vecH (A) = .
,
..
an ∗
where ai ∗ is the sub-vector of the column i of A obtained by striking out the i − 1
first elements. In order to estimate the sparse matrix ZZ 0 , we need to propose a
sparse estimator of Γr . To do this we apply the Lasso criterion described in (4.5),
where X is the identity matrix. In the case where X is an orthogonal matrix it has
been shown in Giraud (2014) that the solution of (4.5) is:
(
Xj0 Y(1 − λ
2|Xj0 Y| ), if |Xj0 Y| > λ
2
B(λ)
b j=
0, otherwise,
where Xj denotes the jth column of X . Using the fact that X is the identity matrix
we get (
λ
Yj (1 − 2|Y j|
), if |Yj | > λ2
B(λ)
b j= (4.7)
0, otherwise.
We then reestimate the non null coefficients using the least-squares criterion and
get: (
Yj , if |Yj | > λ2
B(λ)
e j= (4.8)
0, otherwise,
Chapter 4
80
assumed to be equal to 1, we take the diagonal terms of Σ(λ)
e equal to 1. The lower
triangular part of Σ(λ)
e is then obtained by symmetry.
The choice of the best parameter λ denoted λfinal in the following will be discussed
in Section 4.3.2.
Chapter 4
estimated by U D −1/2 U 0 where D −1/2 is a diagonal matrix having its diagonal
terms equal to the square root of the inverse of the singular values of Σ. b However,
inverting the square root of too small eigenvalues may lead to poor estimators of
Σ−1/2 . This is the reason why we propose to estimate Σ−1/2 by
b −1/2 = U D −1/2 U 0 ,
Σ (4.9)
t t
−1/2
where D t is a diagonal matrix such that its diagonal entries are equal to the
square root of the inverse of the diagonal entries of D except for those which are
−1/2
smaller than a given threshold t which are replaced by 0 in D t . The choice of t
will be further discussed in Section 4.3.7.
81
— The Cattell criterion based on the Cattell’s scree plot described in Cattell
(1966) and
— the PA permutation method proposed by Horn (1965) and recently studied
from a theoretical point of view by Dobriban (2018).
To choose the parameter λ in (4.8), we shall compare two strategies in Section
4.3.2:
— The BL approach proposed in Bickel & Levina (2008) based on cross-validation
and
— the Elbow method which consists in computing for different values of λ the
Frobenius norm kR − Σ(λ)k
e F , where R and Σ(λ) are defined in (4.2) and at
e
the end of Section 4.2.2, respectively. Then, it fits two simple linear regressions
and chooses the value of λ achieving the best fit.
82
Cattell PA
90
Eigenvalues
60
30
0
0 50 [...] 500 0 50 [...] 500
Index
Real Sample
Figure 4.3 – Illustration of PA and Cattell criteria for choosing r when q = 500
and n = 30 in the Extra-Diagonal-Unequal case. The value of r found by both
methodologies is displayed with a dotted line, the straight lines obtained for the
Cattell criterion and the eigenvalues of the permuted matrices in the PA methodology
are displayed in grey.
For n ∈ {10, 30, 50} and q ∈ {100, 500}, 100 n × q matrices E were gener-
ated such that its rows E 1 , E 2 , · · · , E n are i.i.d. q-dimensional zero-mean Gaus-
sian vectors having a covariance matrix Σ chosen according to the four previous
cases: Diagonal-Equal, Diagonal-Unequal, Extra-Diagonal-Equal or Extra-
Diagonal-Unequal.
Chapter 4
4.3.1 Low rank approximation
The approaches for choosing r described in Section 4.2.5 are illustrated in Figure
4.3 in the Extra-Diagonal-Unequal case. We can see from this figure that both
methodologies find the right value of r which is here equal to 5.
To go further, we investigate the behavior of our methodologies from 100 repli-
cations of the matrix E for the four different types of Σ. Figure 4.4 displays the
barplots associated to the estimation of r made in the different replications by the
two approaches for the different scenarii. We can see from this figure that the PA
criterion seems to be slightly more stable than the Cattell criterion when n ≥ 30.
However, in the case where n = 10, the PA criterion underestimates the value of
r. Moreover, in terms of computational time, the performance of Cattell is much
better, see Figure 4.5.
83
n = 10 n = 10 n = 30 n = 30 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500 q = 100 q = 500
100
75
Cattell
50
25
0
100
75
50
PA
25
0
1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 3 4 5 6 5 6 5 6 5 6
r
Figure 4.4 – Barplots corresponding to the number of times where each value of r
is chosen in the low-rank approximation from 100 replications for the two method-
ologies in the different scenarii for the different values of n and q.
n = 10 n = 10 n = 30 n = 30 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500 q = 100 q = 500
12
Times in seconds
3
Chapter 4
0
Cattell PA Cattell PA Cattell PA Cattell PA Cattell PA Cattell PA
Σ(λ)
e is illustrated in Figure 4.6. This figure displays the True Positive Rate (TPR)
and the False Positive Rate (FPR) of the methodologies from 100 replications of
the matrix E for the four different types of Σ and for different values of n and q.
We can see from this figure that the performance of Elbow is on a par with the
one of BL except for the case where n = 10 for which the performance of Elbow is
slightly better in terms of True Positive Rate. Moreover, in terms of computational
time, the performance of Elbow is much better, see Figure 4.7.
84
n = 10 n = 10 n = 30 n = 30 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500 q = 100 q = 500
1.00
0.75
FPR
0.50
0.25
0.00
1.00
0.75
TPR
0.50
0.25
0.00
BL Elbow BL Elbow BL Elbow BL Elbow BL Elbow BL Elbow
Figure 4.6 – Boxplots comparing the TPR (True Positive Rate) and the FPR (False
positive Rate) of the two methodologies proposed to select the parameter λ from
100 replications in the different scenarii.
n = 10 n = 10 n = 30 n = 30 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500 q = 100 q = 500
Times in seconds
20
15
Chapter 4
10
0
BL Elbow BL Elbow BL Elbow BL Elbow BL Elbow BL Elbow
type
The goal of this section is to compare the statistical performance of our approach
with other methodologies.
Since our goal is to estimate a covariance matrix containing blocks, we shall
compare our approach with clustering techniques. Once the groups or blocks have
be obtained, Σ is estimated by assuming that the corresponding matrix estimator
is block-wise constant except for the diagonal blocks for which the diagonal entries
85
are equal to 1 and the extra-diagonal terms are assumed to be equal. This gives a
great advantage to these methodologies in the Diagonal-Equal and in the Extra-
Diagonal-Equal scenarii. More precisely, let ρi,j denote the value of the entries in
the block having its rows corresponding to Group (or Cluster) i and its columns to
Group (or Cluster) j. Then, for a given clustering C:
X
1
#C(i)#C(j) Rk,` , if C(i) 6= C(j)
k∈C(i),`∈C(j)
ρi,j = , (4.10)
X
1
Rk,` , if C(i) = C(j)
#C(i)(#C(i)−1)
k∈C(i),`∈C(i),k6=`
where C(i) denotes the cluster i, #C(i) denotes the number of elements in the
cluster C(i) and Rk,` is the (k, `) entry of the matrix R defined in Equation (4.2).
For the matrices Σ corresponding to the four scenarios previously described, we
shall compare the statistical performance of the following methods:
— empirical which estimates Σ by R defined in (4.2),
— blocks which estimates Σ using the methodology described in this article with
the criteria PA and BL for choosing r and λ, respectively,
— blocks fast which estimates Σ using the methodology described in this article
with the criteria Cattell and Elbow for choosing r and λ, respectively,
— blocks real which estimates Σ using the methodology described in this article
when r and the number of non null values are assumed to be known which gives
Chapter 4
86
n = 10 n = 10 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500
Extra-Diagonal Unequal
Extra-Diagonal Equal
Diagonal Unequal
Diagonal Equal
case where n = 10, the performance of blocks fast is on a par with the one of
blocks real and is better than the one of blocks. In the case where n = 50, the
performance of blocks is slightly better than the one of blocks fast and is similar
to the one of blocks real. Moreover, in all cases, either blocks fast or blocks
outperforms the other approaches.
Then, the estimators of Σ derived from blocks, blocks fast and blocks real
were compared to the PDSCE estimator proposed by Rothman (2012) and imple-
Chapter 4
mented in the R package PDSCE and to the estimator proposed by Blum et al.
(2016b) and implemented in the FANet package Blum et al. (2016a). Since the com-
putational burden of PDSCE is high for large values of q, we limit ourselves to the
Extra-Diagonal-Equal case when n = 30 and q = 100 for the comparison. Figure
4.9 displays the results. We can see from this figure that blocks, blocks fast and
blocks real provide better results than PDSCE and FANet. However, it has to
be noticed that PDSCE is not designed for dealing with block structured covariance
matrices but just for providing sparse estimators of large covariance matrices.
87
20
Frobenius norm 10
b − Σ in the Extra-Diagonal-
Figure 4.9 – Comparison of the Frobenius norm of Σ
Equal case for n = 30 and q = 100.
n = 10 n = 10 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500
Extra-Diagonal Unequal
Extra-Diagonal Equal
Diagonal Unequal
Diagonal Equal
to E ord which should provide an efficient estimator of Σord . In order to get an es-
timator of Σ the columns and rows are permuted according to the ordering coming
from the hierarchical clustering.
To assess the corresponding loss of performance, we generated for each matrix
E used for making Figure 4.8 a matrix E perm in which the columns of E were
randomly permuted. The associated covariance matrix is denoted Σperm . Then, we
applied the methodology described in the previous paragraph denoted blocks samp
and blocks fast samp in Figure 4.10 thus providing Σ b perm . The performance
of this new methodology was compared to the methodology that we proposed in
the previous sections (denoted blocks and blocks fast in Figure 4.10) when the
columns of E were not permuted. The results are displayed in Figure 4.10. We
can see from this figure that the performance of our approach does not seem to be
altered by the permutation of the columns.
88
4.3.5 Numerical performance
Figure 4.11 displays the computational times for estimating Σ with the methods
blocks and blocks fast for different values of q ranging from 100 to 3000 and
n = 30. The timings were obtained on a workstation with 16 GB of RAM and
Intel Core i7 (3.66GHz) CPU. Our methodology is implemented in the R package
BlockCov which uses the R language (R Core Team, 2017) and relies on the R
package Matrix. We can see from this figure that it takes around 3 minutes to
estimate a 1000 × 1000 correlation matrix.
7891.4
5032.8
Times in seconds
303.8
202.7
46.9
30.6
4.3
3.1
0.7
0.5
1000
3000
100
200
500
blocks blocks_fast
Chapter 4
Figure 4.11 – Times in seconds to perform our methodology in the Extra-Diagonal
Unequal case.
89
Chapter 4
b −1/2 ΣΣ
Figure 4.12 – Frobenius norm of Σ b −1/2 − Idq , where Σ
b −1/2 is computed for
t t t
different thresholds t.
90
n = 10 n = 10 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500
Extra-Diagonal Unequal
Extra-Diagonal Equal
Diagonal Unequal
Diagonal Equal
b −1/2 ΣΣ
Figure 4.13 – Comparison of the Frobenius norm of the error Σ b −1/2 − Idq ,
for different estimators Σ
b of Σ.
this figure that in the case where n = 10 the estimators of Σ−1/2 derived from
the empirical, the blocks fast and the blocks real estimators of Σ perform sim-
ilarly and seem to be more adapted than the others to remove the dependence
among the columns of E. However, when n = 50, the behavior is completely dif-
ferent. Firstly, in the Diagonal-Equal case, the estimator of Σ−1/2 derived from
the hclust estimator of Σ seems to perform better than the others. Secondly, in
the Diagonal-Unequal case, the estimator derived from blocks, blocks fast and
Chapter 4
blocks real perform similarly than the one obtained from hclust. Thirdly, in
the Extra-Diagonal case, the estimators derived from blocks, blocks fast and
blocks real methodology perform better than the other estimators.
Then, the estimators of Σ−1/2 derived from blocks, blocks fast and blocks real
were compared to the GRAB estimator proposed by Hosseini & Lee (2016). Since
the computational burden of GRAB is high for large values of q, we limit ourselves
to the Extra-Diagonal-Equal case when n = 30 and q = 100 for the comparison.
Figure 4.14 displays the results. We can see that blocks and blocks real provide
better results than GRAB. However, it has to be noticed that the latter approach
depends on a lot of parameters that were difficult to choose, thus we used the default
ones.
91
Frobenius norm 10
5
blocks_12 blocks_fast_12 blocks_real_12 Grab_12
b −1/2 ΣΣ
Figure 4.14 – Comparison of the Frobenius norm of Σ b −1/2 − Idq in the
Extra-Diagonal-Equal case.
R package:
Y = XB + E, (4.11)
(SNR) for the four scenarii defining Σ on the selection of the non null values of B
in (4.11). Different signal to noise ratios are obtained by multiplying B in (4.11)
by a coefficient κ.
Since the results are barely influenced by the scenario chosen for Σ, only the
Extradiagonal-Equal case is displayed in Figure 4.15, the other scenarii are avail-
able in Annexe 4.6.1. We can see from this figure that when the signal to noise ratio
is low and the value of s is high, meaning that there is a lot of non-zero values, the
FADA methodology performs better than the BlockCov methodology. Nevertheless,
in the three other cases the performance of BlockCov is either better or on a par
with the one of FADA methodology.
92
κ=1 κ = 10 κ=1 κ = 10
1.0 1.00
0.9 0.75
s = 0.01
s = 0.01
0.8 0.50
0.7 0.25
Precision
0.6
TPR
0.00
1.0 1.00
0.9 0.75
s = 0.3
s = 0.3
0.8 0.50
0.7 0.25
0.6
0.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0
FPR Recall
Figure 4.15 – Means of the ROC curves (left) and Precision Recall curves (right)
obtained from 100 replications comparing the variables selected by the MultiVarSel
strategy using either Σ−1/2 obtained by BlockCov to remove the dependence or the
methodology proposed by FADA methodology. κ is linked to the signal to noise
ratio and s denotes the sparsity levels i.e the fraction of non-zero elements in B.
are the most efficient form of dispersal of flowering plants in the environment. Seeds
are remarkably adapted to harsh environmental conditions as long as they are in
a quiescent state. Dry mature seeds (so called “orthodox seeds”) are an appropri-
Chapter 4
ate resource for preservation of plant genetic diversity in seedbanks. It has been
reported that the temperature regime during seed production affects agronomical
traits such as seed germination potential, see Huang et al. (2014),MacGregor et al.
(2015) and Kerdaffrec & Nordborg (2017). In order to highlight biomarkers of seed
quality according to thermal environment of the mother plant, Arabidopsis seeds
were produced under three temperature regimes (14-16 o C, 18-22 o C or 25-28 o C
under a long-day photoperiod). Dry mature seeds were analysed by shotgun pro-
teomic and GC/MS-based metabolomics Durand et al. (2019). The choice to use the
model plant, Arabidopsis, was motivated by the colossal effort of the international
scientific community for its genome annotation. This plant remains at the forefront
of modern genetics, genomics, plant modelling and system biology, see Provart et al.
(2016). Arabidopsis provides a very useful basis to study gene regulatory networks,
and develop modelling and systems biology approaches for translational research
towards agricultural applications.
93
modeling for our observations:
Y = XB + E, (4.12)
(4.12) by using the methodology developed in this paper, namely the BlockCov
package. By the results of Section 4.3, we know that the PA and BL approaches
performed poorly when n = 10. Since here n = 9, we used the Cattell and Elbow
criteria to choose r and λ, respectively. The results are displayed in Figure 4.16.
The Cattell criterion chooses r = 7 and the Elbow criterion chooses λ = 0.472, which
implies that among the 19701 coefficients of the correlation matrix only 6696 values
are considered as non null values.
Cattell Elbow
4000
60
3000
40
value
2000
20 1000
0 0
0 7 50 [...] 199 0 3000 6696 15000 19701
Index
94
Figure 4.17 – Estimator of the correlation matrix Σ of the rows of E once the rows
and the columns have been permuted according to the ordering provided by the
hierarchical clustering.
Chapter 4
The estimation of Σ obtained with our methodology is displayed in Figure 4.17
once the rows and the columns have been permuted according to the ordering pro-
vided by the hierarchical clustering to make visible the latent block structure.
Using the estimator of Σ−1/2 provided by the BlockCov package in the R package
MultiVarSel provides the sparse estimator of the matrix B defined in Model 4.12
and displayed in Figure 4.18. We can see from this figure that for the metabo-
lite X5MTP the coefficient of the matrix B b is positive when the temperature is
high which means that the production of the metabolite X5MTP is larger in high
temperature conditions than in low temperature conditions.
In order to go further in the biological interpretation, we wanted to better un-
derstand the underlying block structure of the estimator of the correlation matrix
of the residuals based on metabolite abundances Σ. b Thus, we applied a hierarchical
clustering with 8 groups to this matrix in order to split it into blocks. The cor-
responding dendogram is on the left part of Figure 4.17. The matrix containing
the correlation means within and between the blocks or groups of metabolites is
displayed in Figure 4.19. The composition of the metabolites groups is available in
Appendix 4.6.2.
Interestingly, we could observe that X5MTP belongs to Group 6 which displays
95
X5MTP
Valine
U3898.1.204 Estimate
>5
U3041.1.361
2.5
U3008.3.457 0
Metabolites
U2400.1.179 -2.5
U2282.4.349 <-5
U2278.6.361 Estimate
Tyrosine 0.5
1.0
Quercitrin
2.0
Galactinol 3.0
Fumarate
beta.Sitosterol
Figure 4.18 – Sparse estimator of the coefficients matrix B obtained thanks to the
package MultiVarSel with a threshold of 0.95.
Chapter 4
8 0.4
0.5 0.05 0.01 0.27 −0.06 −0.01
0.2
3 0.39 0.12 0.07 −0.07 −0.02
2 −0.6
0.32 0
−0.8
5 0.38
−1
96
an high correlations mean equal to 0.8 between the 14 metabolites that make it up.
At least 6 metabolites of this group belong to the same family, namely glucosinolates
(i.e. X4MTB, 4-methylthiobutyl glucosinolate; X5MTP, 5-methylthiopentyl glucosi-
nolate; X6MTH, 6-methylthiohexyl glucosinolate; X7MTH, 7-methylthiohexyl glu-
cosinolate; X8MTO, 8-methylthiooctyl glucosinolate; UGlucosinolate140.1, uniden-
tified glucosinolate). Glucosinolates (GLS) are specialized metabolites found in
Brassicaceae and related families (e.g. Capparaceae), containing a β-thioglucose
moiety, a sulfonated oxime moiety, and a variable aglycone side chain derived from
a α-amino acid. These compounds contribute to the plant’s overall defense mecha-
nism, see Wittstock & Halkier (2002). Methylthio-GLS are derivated from methio-
nine. Methionine is elongated through condensation with acetyl CoA and then, are
converted to aldoximes through the action of individual members of the cytochrome
P450 enzymes belonging to CYP79 family, see Field et al. (2004). The aldoxime un-
dergoes condensation with a sulfur donor, and stepwise converted to GLS, followed
by the side chain modification. The present results suggest that the accumulation
of methionine-derived glucosinolate family is strongly coordinated in Arabidopsis
seed. Moreover, we can see that they are influenced by the effect of the mother
plant thermal environment.
The same study was conducted on the proteomic data. The estimator of the
correlation matrix of the residuals based on proteine abundances Σ b obtained with
Chapter 4
our methodology is displayed in Figure 4.20 once the rows and the columns have
been permuted according to the ordering provided by the hierarchical clustering to
make visible the latent block structure. To better understand the underlying block
structure of Σ,
b we applied a hierarchical clustering with 9 groups to this matrix in
order to split it into blocks. The corresponding dendogram is on the left part of
Figure 4.20.
The matrix containing the correlation means within and between the blocks or
groups of proteins is displayed in Figure 4.21. We can see from this figure that
Group 8 has the highest correlation mean equal to 0.47. It consists of 34 proteins
which are given in Appendix 4.6.3.
A basic gene ontology analysis (http://geneontology.org/) showed that proteins
involved in response to stress (biotic and abiotic), in nitrogen and phosphorus
metabolic processes, in photosynthesis and carbohydrate metabolic process and in
oxidation-reduction process are overrepresented in this group, see Figure 4.22. Thus,
the correlation estimated within Group 8 seems to reflect a functional coherence of
the proteins of this group.
The variable selection in the multivariate linear model using the R package
MultiVarSel provided 31 proteins differentially accumulated in seeds produced un-
97
Figure 4.20 – Estimator of the correlation matrix of the residuals of the protein
accumulation measures once the rows and the columns of the residual matrix have
been permuted according to the ordering provided by the hierarchical clustering.
Chapter 4
8 −0.2
0.47 0.04 0.01 0.13
−0.4
7 0.27 0.12 0.1
−0.6
1 0.37 0.19
−0.8
4 0.44
−1
98
Response to stress
Nitrogen compound metabolic process
Response to abiotic stimulus
Response to biotic stimulus
Phosphorus metabolic process
Oxidation-reduction process
Carbohydrate metabolic process
Photosynthesis
0 5 10 15
Figure 4.22 – Gene ontology (GO) term enrichment analysis of the 34 pro-
teins belonging to Group 8. Data from PANTHER overrepresentation test
(http://www.geneontology.org); One uploaded id (i.e. AT5G50370) mapped to two
genes. Thus, GO term enrichment was performed on 35 elements. Blue bars: ob-
served proteins in Group 8; Orange bars: expected result from the reference Ara-
bidopsis genome.
Chapter 4
teins related to cell wall organization, a beta-glucosidase (BGLC1, AT5G20950) and
a translation elongation factor (eEF-1Bβ1, AT1G30230) were differentially accumu-
lated in seeds produced under contrasted temperature. eEF-1Bβ1 is associated to
plant development and is involved in cell wall formation, see Hossain et al. (2012).
These results suggest that cell wall rearrangements occur under temperature effect
during seed maturation.
As displayed in Figure 4.23, 6 other proteins involved in mRNA translation:
AT1G02780, AT1G04170, AT1G18070, AT1G72370, AT2G04390 and AT3G04840
were selected. The absolute failure of seed germination in the presence of protein
synthesis inhibitors underlines the essential role of translation for achieving this
developmental process, see Rajjou et al. (2004). Previous studies highlighted the
importance of selective and sequential mRNA translation during seed germination
and seed dormancy, see Galland et al. (2014), Bai et al. (2017) and Bai et al. (2018).
Thus, exploring translational regulation during seed maturation and germination
through the dynamic of mRNA recruitment on polysomes or either neosynthesized
proteome are emerging fields in seed research.
99
AT5G20950.1
AT4G09000.1
AT3G54400.1
AT3G44100.1
AT3G10020.1
AT3G08030.1 Estimate
AT3G04840.1
AT2G04390.1 >5
AT1G72370.1
AT1G71950.1 2.5
AT1G54860.1
AT1G30230.1 0
AT1G20200.1
AT1G18070.1 -2.5
Proteins
AT1G16030.1
AT1G15690.1 <-5
AT1G14940.1
AT1G14930.1
AT1G10670.1 Estimate
AT1G08110.1
0.5
AT1G07140.1
AT1G05510.1 1.0
AT1G04580.1
AT1G04410.1 2.0
AT1G04170.1
AT1G03890.1 3.0
AT1G03880.1
AT1G02780.1
AT1G02700.1
AT1G01900.1
AT1G01470.1
14-16°C 18-22°C 25-28°C
Temperature
Figure 4.23 – Values of the coefficients obtained using the package MultiVarSel
with a threshold of 0.95 on the proteomic dataset.
4.5 Conclusion
In this paper, we propose a fully data-driven methodology for estimating large
block structured sparse covariance matrices in the case where the number of vari-
Chapter 4
ables is much larger than the number of samples without limiting ourselves to block
diagonal matrices. Our methodology can also deal with matrices for which the
block structure only appears if the columns and rows are permuted according to an
unknown permutation. Our technique is implemented in the R package BlockCov
which is available from the Comprehensive R Archive Network and from GitHub. In
the course of this study, we have shown that BlockCov is a very efficient approach
both from the statistical and numerical point of view. Moreover, its very low com-
putational load makes its use possible even for very large covariance matrices having
several thousands of rows and columns.
Acknowledgments
We thank the members of the EcoSeed European project (FP7 Environment,
Grant/Award Number: 311840 EcoSeed, Coord. I. Kranner). IJPB was supported
by the Saclay Plant Sciences LABEX (ANR-10-LABX-0040-SPS). We also thank the
people who produced the biological material and the proteomic and metabolomic
analysis. In particular, we would like to thank the Warwick University (UWAR,
100
Finch-Savage WE and Awan S) for the production of seeds, the Plant Observatory-
Biochemistry platform (IJPB, Versailles; Bailly M, Cueff G) for having prepared the
samples for the proteomics and metabolomics, the PAPPSO Proteomic Plateform
(GQE-Moulon; Balliau T, Zivy M) for mass spectrometry-based proteome analy-
sis and the Plant Observatory-Chemistry/Metabolism platform (IJPB, Versailles;
Clement G) for the analysis of GC/MS-based metabolome analyses.
Chapter 4
101
4.6 Appendix
4.6.1 Variable selection performance
κ=1 κ = 10 κ=1 κ = 10
1.0
Diagonal_Equal
0.9
0.8
0.7
0.6
1.0
Diagonal_Unequal
0.9
0.8
0.7
0.6 Type
TPR
blocks_fast
1.0
FADA
Extradiagonal_Equal
0.9
0.8
0.7
0.6
1.0
Extradiagonal_Unequal
0.9
0.8
0.7
0.6
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
FPR
Chapter 4
Figure 4.24 – Means of the ROC curves obtained from 100 replications compar-
ing the variables selected by the MultiVarSel strategy using either Σ−1/2 obtained
by BlockCov to remove the dependence or the methodology proposed by FADA
methodology. κ is linked to the signal to noise ratio and s denotes the sparsity
levels i.e the fraction of non-zero elements in B.
102
s = 0.01 s = 0.01 s = 0.3 s = 0.3
κ=1 κ = 10 κ=1 κ = 10
1.00
Diagonal_Equal
0.75
0.50
0.25
0.00
1.00
Diagonal_Unequal
0.75
0.50
0.25
Type
Precision
0.00
blocks_fast
1.00
FADA
Extradiagonal_Equal
0.75
0.50
0.25
0.00
1.00
Extradiagonal_Unequal
0.75
Chapter 4
0.50
0.25
0.00
0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0
Recall
Figure 4.25 – Means of the precision recall curves obtained from 100 replications
comparing the variables selected by the MultiVarSel strategy using either Σ−1/2
obtained by BlockCov to remove the dependence or the methodology proposed by
FADA methodology. κ is linked to the signal to noise ratio and s denotes the sparsity
levels i.e the fraction of non-zero elements in B.
103
4.6.2 Groups of metabolites
104
Etude du dialogue entre cellules
Chapitre 5
dendritiques et lymphocytes Th
Production scientifique
5.1 Introduction
La communication entre cellules peut se faire par l’échange de signaux moléculaires
produits par une cellule donnée et transmis à une cellule cible qui émettra d’autres
signaux en réponse. Pour simplifier le discours nous appellerons input les signaux
émis par la cellule donnée et output les signaux émis en réponse par la cellule
cible. Ce mode de communication requiert l’analyse de signaux multiples émis par
les cellules. Cependant, la plupart des études sont univariées et se concentrent sur
l’effet d’un input ou d’un groupe d’inputs sur un output. En négligeant la diversité
des inputs et des outputs elles ne permettent pas d’approcher un contexte réaliste
de communication cellulaire.
Chapter 5
105
rencontré par la cellule dendritique. Plus précisément, en présence d’un pathogène
intracellulaire, donc d’une maladie auto-immune, la DC va sécréter de l’interleukine
12 (IL12), une fois capté par le lymphocyte T naı̈f celui-ci va se différencier en lym-
phocytes Th1 et sécréter notamment de l’interféron gamma (IFNg) pour combattre
cette maladie auto-immune. De même, en présence d’un parasite extra-cellulaire, les
DC vont sécréter des signaux qui vont amener le lymphocyte T naı̈f à se différencier
en un lymphocyte Th2 qui va sécréter des interleukines 4, 5, 13 (IL4, IL5, IL13). De
nombreuses études ont mis en avant d’autres profils tels que le profil Th17 induit par
la présence d’IL6, TNFa, IL23, TGFb qui sécrètent IL17A et IL17F pour répondre
à la présence de bactéries et de champignons extérieurs (voir Park et al., 2005). La
figure 1.4 qui est une version simplifiée de la figure 1 de Leung et al. (2010) montre
différents profils Th décrits.
IFNg
Th1
IL2
IL12p70
IL4
IL4 Th2 IL5
TGFb,IL6 IL13
IL17F
Cellule Lymphocyte Th17
dendritique T naı̈f IL17A
106
Perturbateurs
Signaux Signaux
des DC des Th
Cellule Lymphocyte
dendritique T naı̈f
5.3 Modélisation
Nous appliquons ensuite la méthode décrite dans les chapitres 2 et 3 et implémentée
dans le package MultiVarSel pour modéliser les valeurs des signaux des lympho-
cytes Th en fonction des signaux des DC. Cette méthode est fondée sur le modèle 1.2 :
Y = XB + E,
107
Color Key Color Key
P1 P1
P2 P2
P3 P40
P4 P47
P5 P42
P6 P41
P7 P50
P8 P72
P9 P73
P10 P74
P11 P62
P12 P59
P13 P63
P14 P58
P15 P64
P67
P16 P77
P17 P69
P18 P71
P19 P70
P20 P56
P21 P66
P22 P57
P23 P75
P24 P76
P25 P37
P26 P26
P27 P38
P28 P36
P29 P46
P30 P49
P31 P6
P32 P68
P33 P65
P34 P82
P35 P81
P36 P79
P37 P78
P38 P80
P39 P53
P40 P52
P41 P55
P42 P51
P43 P22
P21
P44 P48
P45 P43
P46 P3
P47 P4
P48 P24
P49 P54
P50 P11
P51 P45
P52 P23
P53 P27
P54 P44
P55 P30
P56 P20
P57 P13
P58 P28
P59 P32
P60 P35
P61 P5
P62 P7
P63 P10
P64 P31
P65 P12
P66 P33
P67 P25
P68 P17
P69 P8
P70 P61
P14
P71 P15
P72 P60
P73 P19
P74 P18
P75 P16
P76 P9
P77 P34
P78 P39
P79 P29
P80
P81
Expansionfold
IL9
IL17F
IL17A
IL2
TNFa1
TNFb
IL21
IL6
IFNg
IL10
IL22
IL5
GMCSF
IL3
IL31
IL13
IL4
P82
ICOSL
CD100
IL28a
CD11a
VISTA
ICAM2
Jagged2
41BBL
SLAMF3
CD29
B7H3
Galectin3
SLAMF5
CD18
ICAM3
CD58
PDL1
CD40
CD54
CD80
PDL2
IL12p70
Nectin2
HLADR
CD83
CD86
PVR
IL23
IL10
IL1
IL6
TNFa
CD70
LIGHT
OX40L
CD30L
Figure 5.3 – Heatmap représentant la moyenne des différents signaux sur les 82
conditions de perturbations obtenues à partir de plusieurs réplicats. Les signaux des
DC sont représentés à gauche et ceux des Th à droite.
108
Coefficients
value
−0.25 0.00 0.25
IL21
Th1
IFNg
IL2
IL9
Th17
IL17F
IL17A
TNFb
TNFa1
IL6_O
IL3
Exp
IL22
GMCSF
IL10_O
Th2
IL5
IL4
IL31
IL13
Galectin3
CD80
IL23
VISTA
B7H3
Nectin2
CD54
OX40L
LIGHT
41BBL
IL12p70
SLAMF5
CD58
TNFa
CD30L
CD86
CD18
ICAM2
CD100
PDL2
HLADR
ICOSL
CD11a
IL10
PDL1
Jagged2
PVR
CD29
CD70
ICAM3
IL1
CD83
IL28a
CD40
IL6
SLAMF3
Figure 5.4 – Heatmap représentant les coefficients obtenus lors de la modélisation
des signaux des lymphocytes Th par les signaux des cellules dendritiques en utilisant
la stratégie décrite dans les chapitres 2 et 3 avec un seuil de 0.65.
109
Conclusion et perspectives
Dans cette thèse nous avons mis en place une méthode, adaptée à la grande
dimension, permettant de faire de la sélection de variables dans le modèle linéaire
multivarié. Plus précisément, dans le chapitre 2, nous proposons un estimateur parci-
monieux des coefficients dans le modèle linéaire général et établissons des conditions
sous lesquelles la consistance en signe de cet estimateur est vérifiée. Cet estima-
teur nécessite l’estimation de la covariance existant entre les différentes variables
réponses. Dans le cas simple où la matrice de covariance est celle d’un processus
autorégressif d’ordre 1 et la matrice de design est celle d’une ANOVA équilibrée à 1
facteur, nous avons montré que les conditions pour obtenir la consistance en signe
de notre estimateur sont vérifiées. Nous avons ensuite développé des méthodes pour
l’estimation de matrices de covariance en grande dimension lorsqu’elle est supposée
être une matrice de Toeplitz symétrique (dans le chapitre 3) et lorsqu’elle est sup-
posée être une matrice par blocs (dans le chapitre 4). Ces différentes méthodes ont
été appliquées à des problématiques de protéomique, de métabolomique et d’immu-
nologie.
Dans ce dernier chapitre nous suggérons, dans un premier temps, des pistes qui
permettraient d’élargir nos résultats de consistance à d’autres types de matrices de
covariance. Dans un second temps, nous proposons d’autres estimateurs pénalisés
issus du modèle linéaire univarié que nous généralisons au cas multivarié afin de
prendre en compte des spécificités dues à des problématiques biologiques spécifiques
décrites à la fin de ce chapitre.
111
De même, lorsque la matrice de covariance est une matrice de covariance par
blocs diagonaux son inverse est aussi une matrice de covariance par blocs diagonaux,
chaque bloc étant inversé indépendamment. Dans le cas Diagonal-Equal décrit
dans le chapitre 4, la matrice de covariance est une matrice avec L blocs diagonaux
où pour tout l dans J1, LK le bloc l est de la forme :
al Iql + bl Jql ,
où ql est la taille du bloc l, al et bl sont des réels (avec al non nul), Iql est la matrice
identité de Rql et Jql est une matrice de taille ql × ql qui a tous ses éléments égaux
à 1. L’inverse de telles matrices a une forme explicite :
1 bl
(Iql − Jq ).
al al + ql bl l
112
où X (l) est la matrice contenant les variables appartenant au groupe l et pl est le
nombre de variables appartenant au groupe l.
Considérons maintenant qu’il existe des raisons pour lesquelles les coefficients liés
à certaines colonnes de X soient similaires. Pour cela, Xin et al. (2014) proposent
l’estimateur du fused lasso généralisé :
X
bbF = Argmin ||y − Xb||2 + λ1 |bi − bj | + λ 2 ||b||1 , (6.2)
b 2
(i,j)∈G
où G est l’ensemble des couples (i, j), indices des coefficients qu’on présume simi-
laires. Hoefling (2010) propose un algorithme pour calculer bbF . Inciter deux ou
plusieurs variables à avoir des coefficients similaires sera appelé fusion par la suite.
Dans cette section nous adaptons les pénalités utilisées dans le group lasso et le
fused lasso au cas multivarié. Ceci est fait simplement en utilisant des arguments
similaires à ceux utilisés dans la section 1.2.2 et dans le chapitre 2 (plus précisément
à l’équation (1.13)). Après une telle transformation il est possible d’appliquer les
algorithmes du group lasso et du fused lasso, en faisant varier les groupes se-
lon la problématique. En effet, dans le cadre multivarié, il peut être intéressant de
sélectionner une variable explicative, ou un groupe de variables explicatives, pour
toutes les réponses, par exemple. De même il peut être intéressant d’inciter la fu-
sion de coefficients liés à une ou plusieurs variables explicatives pour toutes les
réponses. Nous avons implémenté ces différents modèles dans le package R VariSel
qui permet de sélectionner une association (similaire ou différente) d’une ou plu-
sieurs variables explicatives sur une ou toutes les variables réponses. Ce package
utilise les algorithmes du lasso, du group Lasso et du fused Lasso implémentés
respectivement dans les packages glmnet (Friedman et al., 2010b), gglasso (Yang
& Zou, 2017) et FusedLasso (Hoefling, 2014). Le package contient par ailleurs des
fonctions permettant de comparer les modèles obtenus en utilisant ces différentes
pénalités, par bootstrap ou par cross-validation par exemple.
La figure 6.1 présente un exemple simple avec deux variables réponses Y1 et
Y2 qu’on essaie d’expliquer en fonction de quatre variables explicatives X1G1 , X2G1 ,
X3G2 , X4G2 , ces variables sont divisées en deux groupes : G1 composé de X1G1 , X2G1
et G2 composé de X3G2 , X4G2 . Dans cette figure nous représentons six modèles.
Dans la première ligne on représente trois modèles de type group-lasso qui
sélectionnent soit un groupe de variables explicatives sur une réponse (à gauche), soit
une variable explicative (sans prendre en compte les groupes) sur toutes les réponses
(au centre), soit un groupe de variables explicatives pour toutes les réponses (à
droite). Dans la seconde ligne on représente trois modèles de type fused-lasso qui
Conclusion
113
influencent soit un groupe de variables explicatives à avoir le même coefficient sur
une réponse, soit une variable explicative à avoir un coefficient similaire pour toutes
les réponses, soit un groupe de variables explicatives à avoir des coefficients simi-
laires, aussi bien entre eux que sur toutes les réponses. On notera que les matrices
représentées dans la figure 6.1 ne sont pas réellement estimées par les modèles mais,
elles permettent de mettre en avant leurs différences. En effet, dans les modèles de
type group-lasso les coefficients appartenant au même groupe sont forcés à
être sélectionnés ensemble mais ne sont pas forcés à être égaux, alors que dans le
cas du fused-lasso les coefficients sont incités à être égaux sans être forcés à
être sélectionnés ensemble.
Group
G
X3 2
Variables
G
X4 2
G1
X1
G1
X2
Fused
G
X3 2
G
X4 2
Y1 Y2 Y1 Y2 Y1 Y2
Responses
-2 -1 0 1 2
Figure 6.1 – Matrices de coefficients pouvant être estimées en utilisant les pénalités
de groupe (6.1) ou de fusion (6.2) aussi bien sur les réponses que sur les groupes ou
les deux. Ici la variable X1 est liée à la variable X2 et la variable X3 est liée à la
variable X4 .
114
6.2.3 Applications
Ces méthodes peuvent être utiles dans de nombreuses applications biologiques.
Nous avons commencé à nous y intéresser lors d’une collaboration avec Charlotte
Brault, Agnès Doligez et Timothée Flutre des unités AGAP et GQE de l’INRA.
Le but de cette collaboration est de chercher des plants de vignes qui sont peu
affectés par la sécheresse. Pour cela nous proposons de chercher des régions d’ADN
qui peuvent être liées à un ou plusieurs gènes et qui seraient spécifiques de traits
caractérisants la résistance à la sécheresse. Ces régions d’ADN sont appelées des
locus de caractères quantitatifs (QTL pour quantitative trait loci en anglais). Ici
nous nous concentrerons sur les régions caractéristiques des allèles des différents
(a)
gènes. Une colonne X i d’une matrice X de QTL est ici le nombre de réplicats
de l’allèle a d’un gène i donné pour les n échantillons. Afin de pouvoir sélectionner,
en même temps, tous les allèles liés à un gène donné nous proposons d’utiliser des
modèles de type group-lasso . La résistance à la sécheresse étant caractérisée
par différentes variables réponses il peut aussi être intéressant de sélectionner tous
les allèles d’un même gène pour toutes les réponses en même temps. De même, il
peut être intéressant d’inciter les coefficients liant les différentes variables réponses
à un allèle d’un gène à être similaires, cela peut ce faire à l’aide de modèles de type
fused-lasso .
Enfin, ces modèles peuvent nous permettre d’étudier plus spécifiquement les
données décrites dans le chapitre 5. Dans ce chapitre on étudie le lien entre des
signaux sécrétés par des cellules dendritiques (inputs) et les signaux sécrétés en
réponse par des lymphocytes Th (outputs). Ces données sont obtenues à partir de
deux types de cellules dendritiques qu’il serait intéressant de prendre en compte
dans la modélisation. Ainsi, pour chaque couple input / output on aurait deux coef-
ficients, un pour chaque types de cellules dendritiques. Étudier ces données à l’aide
d’un modèle de type fused-lasso en incitant les coefficients liant un couple
input /output aux deux types de cellules dendritiques à être égaux permettrait de
mettre en évidence des similarités et des différences sur l’effet des inputs sur les
outputs en fonction du type de cellules dendritiques.
115
Conclusion
Bibliographie
Audoin, C., Cocandeau, V., Thomas, O., Bruschini, A., Holderith, S., and Genta-
Jouve, G. (2014). Metabolome consistency : Additional parazoanthines from the
mediterranean zoanthid parazoanthus axinellae. Metabolites, 4 :421–432.
Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning.
Journal of Machine Learning Research, 9(Jun) :1179–1225.
Bai, B., Novák, O., Ljung, K., Hanson, J., and Bentsink, L. (2018). Combined trans-
criptome and translatome analyses reveal a role for tryptophan-dependent auxin
biosynthesis in the control of DOG1-dependent seed dormancy. New Phytologist,
217(3) :1077–1085.
Bai, B., Peviani, A., Horst, S., Gamm, M., Bentsink, L., and Hanson, J. (2017).
Extensive translational regulation during seed germination revealed by polysomal
profiling. New Phytologist, 214(1) :233–244.
Banerjee, O., Ghaoui, L. E., and D’aspremont, A. (2008a). Model selection through
sparse maximum likelihood estimation for multivariate gaussian or binary data.
Journal of Machine Learning Research, 9 :485–516.
Banerjee, O., Ghaoui, L. E., and d’Aspremont, A. (2008b). Model selection through
sparse maximum likelihood estimation for multivariate gaussian or binary data.
Journal of Machine learning research, 9(Mar) :485–516.
Bates, D. and Maechler, M. (2017). Matrix : Sparse and Dense Matrix Classes and
Methods. R package version 1.2-8.
Belloni, A., Chernozhukovand, V., and Wang, L. (2011). Square-root lasso : pivotal
recovery of sparse signals via conic programming. Biometrika, 98(4) :791–806.
117
Bien, J. and Tibshirani, R. J. (2011). Sparse estimation of a covariance matrix.
Biometrika, 98(4) :807–820.
Blum, Y., Houee-Bigot, M., and Causeur, D. (2016a). FANet : Sparse Factor
Analysis model for high dimensional gene co-expression Networks. R package
version 1.1.
Blum, Y., Houée-Bigot, M., and Causeur, D. (2016b). Sparse factor model for
co-expression networks with an application using prior biological knowledge.
Statistical applications in genetics and molecular biology, 15(3) :253—272.
Boccard, J. and Rudaz, S. (2016). Exploring omics data from designed experiments
using analysis of variance multiblock orthogonal partial least squares. Analytica
Chimica Acta, 920 :18 – 28.
Brockwell, P. and Davis, R. (1991). Time Series : Theory and Methods. Springer
Series in Statistics. Springer-Verlag New York.
Cai, T., Liu, W., and Luo, X. (2011). A constrained l1 minimization approach
to sparse precision matrix estimation. Journal of the American Statistical
Association, 106(494) :594–607.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate
behavioral research, 1(2) :245–276.
Chen, Y., Du, P., and Wang, Y. (2014). Variable selection in linear models. Wiley
Interdisciplinary Reviews : Computational Statistics, 6(1) :1–9.
Chiquet, J., Mary-Huard, T., and Robin, S. (2016). Structured regularization for
conditional Gaussian graphical models. Statistics and Computing, 27(3) :789–804.
Dieterle, F., Ross, A., Schlotterbeck, G., and Senn, H. (2006). Probabilistic quotient
normalization as robust method to account for dilution of complex biological mix-
tures. application in 1h nmr metabonomics. Analytical Chemistry, 78(13) :4281–
4290.
Conclusion
118
Dobriban, E. (2018). Permutation methods for factor analysis and PCA.
arXiv :1710.00479.
Durand, T. C., Cueff, G., Godin, B., Valot, B., Clément, G., Gaude, T., and Raj-
jou, L. (2019). Combined proteomic and metabolomic profiling of the arabidopsis
thaliana vps29 mutant reveals pleiotropic functions of the retromer in seed deve-
lopment. International journal of molecular sciences, 20(2) :362.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli-
hood and its oracle properties. Journal of the American Statistical Association,
96(456) :1348–1360.
Fan, J., Yuan, L., and Han, L. (2016). An overview of the estimation of large
covariance and precision matrices. The Econometrics Journal, 19(1) :C1–C32.
Field, B., Cardon, G., Traka, M., Botterman, J., Vancanneyt, G., and Mithen,
R. (2004). Glucosinolate and amino acid biosynthesis in arabidopsis. Plant
Physiology, 135(2) :828–839.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance esti-
mation with the graphical lasso. Biostatistics, 9(3) :432.
Friedman, J., Hastie, T., and Tibshirani, R. (2010a). Regularization paths for ge-
neralized linear models via coordinate descent. Journal of Statistical Software,
33(1) :1–22.
Friedman, J., Hastie, T., and Tibshirani, R. (2010b). Regularization paths for ge-
neralized linear models via coordinate descent. Journal of Statistical Software,
33(1) :1–22.
Galland, M., Huguet, R., Arc, E., Cueff, G., Job, D., and Rajjou, L. (2014). Dynamic
proteomics emphasizes the importance of selective mrna translation and protein
turnover during arabidopsis seed germination. Molecular & Cellular Proteomics,
13(1) :252–268.
119
Haddad, J. N. (2004). On the closed form of the covariance matrix and its inverse
of the causal arma process. Journal of Time Series Analysis, 25(4) :443–448.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical
Learning. Springer Series in Statistics. Springer New York Inc., New York, NY,
USA.
Heinze, G., Wallisch, C., and Dunkler, D. (2018). Variable selection - A review and
recommendations for the practicing statistician. Biom J, 60(3) :431–449.
Hoefling, H. (2010). A path algorithm for the fused lasso signal approximator.
Journal of Computational and Graphical Statistics, 19(4) :984–1006.
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis.
Psychometrika, 30(2) :179–185.
Hossain, Z., Amyot, L., McGarvey, B., Gruber, M., Jung, J., and Hannoufa, A.
(2012). The translation elongation factor eef-1bβ1 is involved in cell wall biosyn-
thesis and plant development in arabidopsis thaliana. PLoS One. e30425.
Hosseini, M. J. and Lee, S.-I. (2016). Learning sparse gaussian graphical models
with overlapping blocks. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I.,
and Garnett, R., editors, Advances in Neural Information Processing Systems 29,
pages 3808–3816. Curran Associates, Inc.
Huang, Z., Footitt, S., and Finch-Savage, W. E. (2014). The effect of temperature
on reproduction in the summer and winter annual arabidopsis thaliana ecotypes
bur and cvi. Annals of botany, 113(6) :921–929.
120
Johnson, R. A. and Wichern, D. W., editors (1988). Applied Multivariate Statistical
Analysis. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
Kendall, S. L., Hellwege, A., Marriot, P., Whalley, C., Graham, I. A., and Pen-
field, S. (2011). Induction of dormancy in arabidopsis summer annuals requires
parallel regulation of dog1 and hormone metabolism by low temperature and cbf
transcription factors. The Plant Cell, 23(7) :2568–2580.
Kirwan, J., Broadhurst, D., Davidson, R., and Viant, M. (2013). Characterising
and correcting batch variation in an automated direct infusion mass spectro-
metry (dims) metabolomics workflow. Analytical and Bioanalytical Chemistry,
405(15) :5147–5157.
Kuhl, C., Tautenhahn, R., Boettcher, C., Larson, T. R., and Neumann, S. (2012).
Camera : an integrated strategy for compound spectra extraction and annotation
of liquid chromatography/mass spectrometry data sets. Analytical Chemistry,
84 :283–289.
Lê Cao, K.-A., Boitard, S., and Besse, P. (2011). Sparse pls discriminant analy-
sis : biologically relevant feature selection and graphical displays for multiclass
problems. BMC Bioinformatics, 12(1) :253.
Lee, W. and Liu, Y. (2012). Simultaneous Multiple Response Regression and In-
verse Covariance Matrix Estimation via Penalized Gaussian Maximum Likelihood.
J. Multivar. Anal., 111 :241–255.
Leung, S., Liu, X., Fang, L., Chen, X., Guo, T., and Zhang, J. (2010). The cytokine
milieu in the interplay of pathogenic Th1/Th17 cells and regulatory T cells in
autoimmune disease. Cell. Mol. Immunol., 7(3) :182–189.
MacGregor, D. R., Kendall, S. L., Florance, H., Fedi, F., Moore, K., Paszkiewicz,
K., Smirnoff, N., and Penfield, S. (2015). Seed production temperature regula-
tion of primary dormancy occurs through control of seed coat phenylpropanoid
metabolism. New Phytologist, 205(2) :642–652.
121
Mallows, C. L. (1973). Some comments on c p. Technometrics, 15(4) :661–675.
Mardia, K., Kent, J., and Bibby, J. (1979). Multivariate analysis. Probability and
mathematical statistics. Academic Press.
Mehmood, T., Liland, K. H., Snipen, L., and Saebo, S. (2012). A review of variable
selection methods in partial least squares regression. Chemometrics and Intelligent
Laboratory Systems, 118 :62 – 69.
Meng, C., Kuster, B., Culhane, A. C., and Gholami, A. M. (2014). A multiva-
riate approach to the integration of multi-omics datasets. BMC Bioinformatics,
15(1) :162.
Molstad, A. J., Weng, G., Doss, C. R., and Rothman, A. J. (2018). An explicit
mean-covariance parameterization for multivariate response linear regression.
Mosmann, T. R., Cherwinski, H., Bond, M. W., Giedlin, M. A., and Coffman, R. L.
(1986). Two types of murine helper t cell clone. i. definition according to pro-
files of lymphokine activities and secreted proteins. The Journal of immunology,
136(7) :2348–2357.
Mosmann, T. R. and Coffman, R. (1989). Th1 and th2 cells : different patterns
of lymphokine secretion lead to different functional properties. Annual review of
immunology, 7(1) :145–173.
Park, H., Li, Z., Yang, X. O., Chang, S. H., Nurieva, R., Wang, Y.-H., Wang, Y.,
Hood, L., Zhu, Z., Tian, Q., et al. (2005). A distinct lineage of cd4 t cells regulates
tissue inflammation by producing interleukin 17. Nature immunology, 6(11) :1133.
122
Perrot-Dockès, M., Lévy-Leduc, C., Chiquet, J., Sansonnet, L., Brégère, M., Étienne,
M.-P., Robin, S., and Genta-Jouve, G. (2018). A variable selection approach in the
multivariate linear model : an application to lc-ms metabolomics data. Statistical
applications in genetics and molecular biology, 17(5).
Perrot-Dockès, M., Lévy-Leduc, C., Sansonnet, L., and Chiquet, J. (2018). Variable
selection in multivariate linear models with high-dimensional covariance matrix
estimation. Journal of Multivariate Analysis, 166 :78 – 97.
Perthame, E., Friguet, C., and Causeur, D. (2016). Stability of feature selec-
tion in classification issues for high-dimensional correlated data. Statistics and
Computing, 26(4) :783–796.
Perthame, E., Friguet, C., and Causeur, D. (2019). FADA : Variable Selection for
Supervised Classification in High Dimension. R package version 1.3.4.
Provart, N. J., Alonso, J., Assmann, S. M., Bergmann, D., Brady, S. M., Brkljacic,
J., ..., and Dangl, J. (2016). 50 years of arabidopsis research : highlights and
future directions. New Phytologist, 209(3) :921–944.
Rajjou, L., Gallardo, K., Debeaujon, I., Vandekerckhove, J., Job, C., and Job, D.
(2004). The effect of α-amanitin on the arabidopsis seed proteome highlights
the distinct roles of stored and neosynthesized mrnas during germination. Plant
physiology, 134(4) :1598–1613.
Raven, P., Singer, S., Johnson, G., Mason, K., Losos, J., Bouharmont, J., Masson,
P., and Van Hove, C. (2017). Biologie. Biologie. De Boeck Supérieur.
Ren, S., Hinzman, A. A., Kang, E. L., Szczesniak, R. D., and Lu, L. J. (2015).
Computational and statistical analysis of metabolomics data. Metabolomics,
11(6) :1492–1513.
Rinaldo, A. et al. (2009). Properties and refinements of the fused lasso. The Annals
of Statistics, 37(5B) :2922–2952.
123
Rothman, A. J. (2012). Positive definite estimators of large covariance matrices.
Biometrika, 99(3) :733–740.
Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation
invariant covariance estimation. Electron. J. Statist., 2 :494–515.
Rothman, A. J., Levina, E., and Zhu, J. (2010). Sparse multivariate regression
with covariance estimation. Journal of Computational and Graphical Statistics,
19(4) :947–962.
Saccenti, E., Hoefsloot, H. C. J., Smilde, A. K., Westerhuis, J. A., and Hendriks, M.
M. W. B. (2013). Reflections on univariate and multivariate analysis of metabo-
lomics data. Metabolomics, 10(3) :361–374.
Smith, C., Want, E., O’Maille, G., Abagyan, R., and Siuzdak, G. (2006). XCMS :
Processing mass spectrometry data for metabolite profiling using Nonlinear peak
alignment, matching, and identification. Analytical Chemistry, 78(3) :779–787.
Smith, R., Mathis, A., and Prince, J. (2014). Proteomics, lipidomics, metabolomics :
a mass spectrometry tutorial from a computer scientist’s point of view. BMC
Bioinformatics, 15.
Stuart, J. M., Segal, E., Koller, D., and Kim, S. K. (2003). A gene-coexpression net-
work for global discovery of conserved genetic modules. Science, 302(5643) :249–
255.
Varah, J. (1975). A lower bound for the smallest singular value of a matrix. Linear
Algebra and its Applications, 11(1) :3 – 5.
Verdegem, D., Lambrechts, D., Carmeliet, P., and Ghesquière, B. (2016). Improved
metabolite identification with midas and magma through ms/ms spectral dataset-
driven parameter optimization. Metabolomics, 12(6) :1–16.
Vogel, J. T., Cook, D., Fowler, S. G., and Thomashow, M. F. (2006). The cbf
cold response pathways of arabidopsis and tomato. Cold Hardiness in Plants :
Molecular Genetics, Cell Biology and Physiology, pages 11–29.
124
Volpe, E., Touzot, M., Servant, N., Marloie-Provost, M.-A., Hupé, P., Barillot, E.,
and Soumelis, V. (2009). Multiparametric analysis of cytokine-driven human th17
differentiation reveals a differential regulation of il-17 and il-22 production. Blood,
114(17) :3610–3614.
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing,
17(4) :395–416.
Walker, A. (1964). Asymptotic properties of least-squares estimates of parameters
of the spectrum of a stationary non-deterministic time-series. Journal of the
Australian Mathematical Society, 4(3) :363–384.
Wen, F., Yang, Y., Liu, P., and Qiu, R. C. (2016). Positive definite estimation
of large covariance matrix using generalized nonconvex penalties. IEEE Access,
4 :4168–4182.
Wittenburg, D., Teuscher, F., Klosa, J., and Reinsch, N. (2016). Covariance between
genotypic effects and its use for genomic inference in half-sib families. G3 : Genes,
Genomes, Genetics, 6(9) :2761–2772.
Wittstock, U. and Halkier, B. A. (2002). Glucosinolate research in the arabidopsis
era. Trends in plant science, 7(6) :263–270.
Xin, B., Kawahara, Y., Wang, Y., and Gao, W. (2014). Efficient generalized fused
lasso and its application to the diagnosis of alzheimer’s disease. In Twenty-Eighth
AAAI Conference on Artificial Intelligence.
Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R.,
Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., et al. (2010).
Common snps explain a large proportion of the heritability for human height.
Nature genetics, 42(7) :565.
Yang, Y. and Zou, H. (2017). gglasso : Group Lasso Penalized Learning Using a
Unified BMD Algorithm. R package version 1.4.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with
grouped variables. Journal of the Royal Statistical Society Series B, 68 :49–67.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian gra-
phical model. Biometrika, 94(1) :19–35.
Zhang, A., Sun, H., Wang, P., Han, Y., and Wang, X. (2012). Modern analytical
techniques in metabolomics analysis. Analyst, 137 :293–300.
Zhang, H., Zheng, Y., Yoon, G., Zhang, Z., Gao, T., Joyce, B., Zhang, W., Schwartz,
J., Vokonas, P., Colicino, E., Baccarelli, A., Hou, L., and Liu, L. (2017). Regula-
rized estimation in sparse high-dimensional multivariate regression, with applica-
tion to a DNA methylation study. Stat Appl Genet Mol Biol, 16(3) :159–171.
125
Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of
Machine Learning Research, 7 :2541–2563.
Zheng, X. and Loh, W.-Y. (1995). Consistent variable selection in linear models.
Journal of the American Statistical Association, 90(429) :151–156.
126
Annexe
127
Article
IL-23
IL-28α
IL-10
36 DC communication molecules
IL-3
Galectin-3
LIGHT IL-4 Julien Chiquet, Céline Lévy-Leduc,
Dendritic cell (DC) CD70 CD4 T cells IL-5
CD30L
Vassili Soumelis
17 T helper cytokines
4-1BBL IL-9
Jagged-2 IL-10
OX40L IL-17A
PDL1
ICAM-2 IL-17F
ICAM-3 IL-6
82 distinct SLAMF3
DC conditions SLAMF5
CD80
CD40
IFN-γ
TNF-α Correspondence
CD100 GMCSF
VISTA
PVR
IL-2
IL13
[email protected]
CD18
PDL2 IL22
ICOSL IL21
CD86
CD54 IL31
CD83
CD29
CD58
TNF-β
In Brief
HLA-DR
Nectin-2
CD11a
B7H3
Grandclaudon et al. show that
combinatorial rules that explain
Prediction: 346 DC/T cell
molecular associations
17 T helper cytokines
2. MODELING
Highlights
d 428 protein-level measurements of 36 DC communication
molecules and 17 Th cytokines
UMR8197, INSERM U1024, École Normale Supérieure, PSL Université, 75005 Paris, France
5Institut Curie, PSL Research University, Unit of Biostatistics, 75005 Paris, France
6Institut Curie, PSL Research University, INSERM U900, 75005 Paris, France
7Mines Paris Tech, 77305 Cedex Fontainebleau, France
8These authors contributed equally
9Lead Contact
*Correspondence: [email protected]
https://doi.org/10.1016/j.cell.2019.09.012
SUMMARY tors, and Notch ligands (Balan et al., 2018). In the context of
stress, multiple signals need to be integrated by innate and
Cell-cell communication involves a large number of adaptive immune cells, including cytokines, growth factors, in-
molecular signals that function as words of a com- flammatory mediators, and immune checkpoints (Chen and
plex language whose grammar remains mostly un- Flies, 2013; Macagno et al., 2007). In most studies, these
known. Here, we describe an integrative approach communication molecules have been studied as individual
involving (1) protein-level measurement of multiple stimuli to a target cell by gain- and loss-of-function experi-
ments. This provides important knowledge regarding the
communication signals coupled to output responses
downstream effects of the signals but prevents us from
in receiving cells and (2) mathematical modeling to
widely addressing their function in various contexts of other
uncover input-output relationships and interactions co-expressed communication signals.
between signals. Using human dendritic cell (DC)-T Context dependency is an important aspect of verbal lan-
helper (Th) cell communication as a model, we guage communication that can directly affect the meaning
measured 36 DC-derived signals and 17 Th cytokines of individual words but also modify the logic of syntactic rules
broadly covering Th diversity in 428 observations. (Cariani and Rips, 2017; Kintsch and Mangalath, 2011). Simi-
We developed a data-driven, computationally vali- larly, context dependencies may dramatically affect the func-
dated model capturing 56 already described and tion of biologically active communication signals. For
290 potentially novel mechanisms of Th cell specifi- example, we have shown that 90% of the transcriptional
cation. By predicting context-dependent behaviors, response to type I interferon in human CD4 T cells depends
on the cytokine context (T helper 1 [Th1], Th2, or Th17; Touzot
we demonstrate a new function for IL-12p70 as an
et al., 2014). Other studies have identified major context-
inducer of Th17 in an IL-1 signaling context. This
dependent functions of immune checkpoints, such as OX40-
work provides a unique resource to decipher the ligand (Ito et al., 2005), and regulatory cytokines, such as
complex combinatorial rules governing DC-Th cell transforming growth factor b (TGF-b) (Ivanov et al., 2006;
communication and guide their manipulation for vac- Manel et al., 2008; Volpe et al., 2008). These studies suggest
cine design and immunotherapies. that communication molecules function as words of a com-
plex language with grammar defining combinatorial rules of
INTRODUCTION co-expression and mutual influence of one signal over the
function (meaning) of another signal.
Cell-cell communication involves the exchange of molecular Three levels of biological complexity need to be integrated to
signals produced by a given cell and transmitting an effect decipher those combinatorial rules: (1) the multiplicity of input
through specific receptors expressed on target cells. This pro- communication signals to include as many possible contextual
cess requires integration of multiple communication signals effects; (2) communication signals at their naturally occurring
of different nature during homeostatic or stress-related re- concentrations; and (3) a large number of output responses in
sponses. For example, differentiation of pluripotent hematopoi- target cells, reflecting the effect of cell-cell communication quan-
etic stem cells into mature myeloid or lymphoid blood cells titatively and qualitatively. Those three levels create a bottleneck
requires the collective action of multiple cytokines, growth fac- in deciphering cell-cell communication.
multiple LPS + -
C82 MoDC or CD11c+ DC
doses Zym + - 44 donors total
B C
DC surface and secreted communication signals (n=428 data points)
% of positive Coefficient
TNF-D Input Range (log)
observations of variation
IL-6
IL-12p70 TNF-D 5.00 63.32 2.04
LIGHT IL-6 5.00 78.74 1.43
IL-28D IL-12p70 4.00 41.12 2.72
IL-10 LIGHT 4.00 50.00 2.30
Galectin-3 IL-28D 4.00 14.25 1.82
B7H3 IL-10 4.00 56.54 1.70
IL-1 Galectin-3 4.00 98.36 1.09
IL-23 B7H3 4.00 97.66 0.67
CD70 IL-1 3.00 41.12 2.10
CD30L IL-23 3.00 54.67 1.64
4-1BBL CD70 3.00 47.20 1.19
Jagged-2 CD30L 3.00 79.21 1.08
OX40L 4-1BBL 3.00 96.26 1.06
PDL1 Jagged-2 3.00 79.67 1.05
ICAM-2 OX40L 3.00 74.53 0.97
ICAM-3 PDL1 3.00 96.50 0.94
SLAMF3 ICAM-2 3.00 73.13 0.90
SLAMF5 ICAM-3 3.00 100.00 0.89
CD80 SLAMF3 3.00 96.73 0.83
CD40 SLAMF5 3.00 98.60 0.82
CD100 CD80 3.00 99.77 0.78
VISTA CD40 3.00 99.77 0.76
PVR CD100 3.00 98.13 0.72
CD18 VISTA 2.00 92.76 0.97
PDL2 PVR 2.00 100.00 0.75
ICOSL CD18 2.00 100.00 0.72
CD86 PDL2 2.00 93.22 0.71
CD54 ICOSL 2.00 90.65 0.65
CD83 CD86 2.00 100.00 0.65
CD29 CD54 2.00 100.00 0.64
CD58 CD83 2.00 97.90 0.60
HLA-DR CD29 2.00 98.13 0.59
Nectin-2 CD58 2.00 99.77 0.57
CD11a HLA-DR 2.00 100.00 0.56
Nectin-2 2.00 100.00 0.53
10 1,000 100,000 CD11a 1.00 99.77 0.44
Raw Expression Values
D
IL-23 IL-10 CD54 PVR IL-28D ICOSL CD18 CD100
10000 40000 3000 35000
15000
pg/mL
Group 1
C6
C7
C8
μx + σx clustering on Pearson metrics for the DC signals
C9
C10 μx
C11
C12
and Euclidian distances for the 82 DC conditions.
C13
C14 μx − σx
C15 (B) Expression profiles (mean and SD) of the 36
Ward distance based on pearson metrics
C16
C17
C18
Group 2
C26 μx + σx
C27
C28
C29
C30 μx
Expression data were logged and scaled so m rep-
resents the mean and s the SD of the expression of
C31
C32
C33 μx − σx
C34
C35
C36
C37
C38
a given DC signal across the whole dataset.
C39
C40
C41
C42
(C) Boxplot of selected DC signals for pairs of
C43
Group 3
C44
C45
C46
C47
μx + σx stimulatory conditions defined as being the most
C48 μx
C49
C50
C51
correlated within our dataset by Pearson correlation
C52 μx − σx
C53
C54
C55
(t test).
C56
C57
C58
C59
C60
(D) Best number of groups by Gaussian mixture
C61
Group 4
C62
C63
C64
μx + σx model, determined using the 428 points of the
C65
C66
C67
C68
μx
36 DC parameters.
C69
C70 μx − σx
C71
C72
C73
C74
C75
C76
C77
C78
C79
Group 5
C80
C81 μx + σx
C82
ICOSL
CD100
IL-28D
CD11a
VISTA
ICAM-2
Jagged-2
4-1BBL
SLAMF3
CD29
B7H3
Galectin-3
SLAMF5
CD18
ICAM-3
CD58
PDL1
CD40
CD54
CD80
PDL2
IL-12p70
Nectin-2
HLA-DR
CD83
CD86
PVR
IL-23
IL-10
IL-1
IL-6
TNF-D
CD70
LIGHT
OX40L
CD30L
μx
μx − σx
and C62 [MoDC PAM3, 10 mg/mL]) and
ICOSL
CD100
IL-28D
CD11a
VISTA
ICAM-2
Jagged-2
4-1BBL
SLAMF3
CD29
B7H3
Galectin-3
SLAMF5
CD18
ICAM-3
CD58
PDL1
CD40
CD54
CD80
PDL2
IL-12p70
Nectin-2
HLA-DR
CD83
CD86
PVR
IL-23
IL-10
IL-1
IL-6
TNF-D
CD70
LIGHT
OX40L
CD30L
1000 0 0
C32 C33 C32 C33 C32 C33 C32 C33 −3000 absence in C33 to over 1 ng/mL in
CD86 IL-1 4-1BBL
8000
PDL1 −6000 C32 (Figure 2C). In contrast, the pairs
** 40 ** 2000 * *
30000 6000
30 1500 C47/C48 and C61/C62 showed significant
20000 20 1000 4000
2000 0 20 40 60 80 differences for multiple Th stimuli. C47 ex-
10000 10 500
0 0 0 Number of groups
C47 C48 C47 C48 C47 C48 C47 C48 pressed significantly more CD86, PDL1,
CD86
300
CD70 OX40L
500
IL-12p70 and IL-1 than C48. On the contrary, C48
40000 * 2000 * 400 *
30000 200 1500 300 expressed higher levels of 4-1BBL. C61
20000 1000 200
100
500 100
and C62 showed marked differences
10000
0 0 0
C61 C62 C61 C62 C61 C62 C61 C62 in CD70 and IL-12 (higher in C61) and
OX40L (higher in C62) levels. Hence,
each DC condition expressed unique
of co-stimulatory molecules and cytokines with the exception of combinations of DC-derived Th stimuli, suggesting different
high IL-28a. Group 2 showed low expression for most DC- communication potential with CD4 T cells.
derived Th stimuli but high levels of integrins, VISTA and B7H3, An unsupervised Gaussian mixture model showed that the
suggesting a capacity to interact with T cells and transmit highest Bayesian information criterion (BIC) value corresponded
co-inhibitory signals. Group 3 showed a complementary pattern, to 82 groups, confirming that each DC condition induced a
lack of group 1- and group 2-specific molecules, and intermedi- unique profile of the 36 communication molecules (Figure 2D).
ate or high levels of co-stimulatory molecules such as CD83, Using principal-component analysis (PCA), we showed that
CD86, HLA-DR, 4-1BBL, and OX40L. This suggested potent neither the date of the experiment nor the donor batch had a ma-
T cell stimulating functions. Group 4 exhibited high levels of mol- jor effect on clustering (Figure S1C; STAR Methods).
ecules from the B7 and TNF superfamilies, such as CD80, CD86,
PDL1, PDL2, and CD40, but intermediate or low cytokine levels. The Heterogeneity of DC-Induced Th Cytokine Responses
In contrast, group 5 showed the highest level of cytokines and We characterized the diversity of CD4 T cell output responses,
molecules of the B7 and TNF superfamilies (Figure 2B). as assessed by Th cytokine profiles, following co-culture of
Next we sought to analyze intra-cluster heterogeneity. We naive CD4 T cells with activated DCs across the 82 conditions
selected three pairs of perturbators most closely related described previously. Th cytokines exhibited important varia-
as defined by Euclidian distance (C32 [MoDC HKLM, MOI 1] tions across the 428 observations (Figure 3A). Some cytokines,
and C33 [MoDC HKCA, MOI 1], C47 [bDC LPS, 100 ng/mL] and such as IL-2, TNF-a, GM-CSF, TNF-b, and IL-3, were always
C48 [bDC HLKM, MOI 1], and C61 [MoDC R848, 1 mg/mL] detected (Figure S2A).
Raw Values
10,000
100
TNF-β
Exp fold
IL-3
IL-4
IL-5
IL-9
IL-10
IL-17A
IL-17F
IL-6
IFN-γ
TNF-α
GMCSF
IL-2
IL13
IL22
IL21
IL31
Concentration (pg/mL)
750
500 2.0 Zym (10ug/mL)
250 1.5
1.0 Flu (1X)
Th9 cytokine Control Control Conditions
600 *
75
pg/mL
1000 400
C1 50
C2
C40 500 200 25
C47
C42
C41 0 0 0
C50
C72 C12 C33 C12 C33 C12 C33
C73
C74
C62
IL-3 IL-2 IFN-J
C59 p=0.056 * 20000
C63 3000 40000
Ward distance based on pearson metrics
C58
30000 15000
pg/mL
C64
C67 2000
82 distinct DC Conditions (C1 to C82)
C77
20000 10000
C69
Group 3
C71
C70
1000 10000 5000
C56
C66 0 0
C57
C75 C42 C47 C42 C47 C42 C47
C76
C37 IL-6 GM-CSF IL-21
C26
C38 3000 **
14000
* 15000
pg/mL
C36
C46
C49 2000 12000 10000
C6
10000
Group 4
C68
C65 5000
C82 1000 8000
C81
C79 0
C78
C80 C49 C46 C49 C46 C49 C46
C53
C52
C55
C51
C22
C21
E Best number of groups
Group 5
C48
C43
C3
C4
by gaussian mixture model
C24
C54 0
C11
C45
C23
C27 −1000
BIC value
C44
C30
C20
C13 −2000
C28
C32
C35
C5
−3000
C7
C10
C31 −4000
Group 6
C12
C33
C25
C17
0 20 40 60 80
C8
C61 Number of groups
C14
C15
C60
C19
C18
C16
C9 Medium
C34
C39
C29 LPS (100ng/mL)
Zym (10μg/mL)
Exp fold
IL-9
IL-17F
IL-17A
IL-2
TNF-D
TNF-E
IL-21
IL-6
IFN-J
IL-10
IL-22
IL-5
GM-CSF
IL-3
IL-31
IL-13
IL-4
Flu (1X)
Expression Value
Th9 Th17 Th1-Tfh cluster Th22 Th2
17 T helper cytokines −1 −0.5 0 0.5 1 1.5
MODEL BUILDING
Apply MultiVarSel
TMOD Display Coefficients
STABILITY SELECTION
Experimental Figure 4B
(1000 resampling
Dataset (Frequencie: 0.65)
of 1/2 dataset)
MODEL VALIDATION
LITERATURE VALIDATION
NUMERICAL VALIDATION
(Systematic analysis comparing
(Prediction error 10
our experimental model predictions
fold cross-validation)
with literature data)
Figure 4C
Figure 4D
TNF-E
correlation of Model’s Coefficient values
TNF-D
IL-21
IL-22
IFN-J
IL-4 10
IL-17A
Count
TNF-E
GM-CSF
IL-9
IL-13 5
IL-5
IL-10
IL-2
IL-31
TNF-D 0
0.4 0.6 0.8 1.0
IL-12p70
CD86
CD80
IL-1
IL-23
CD70
ICOSL
OX40L
4-1BBL
IL-10
IL-6
TNF-D
CD30L
CD40
CD54
ICAM-2
PDL2
CD11a
IL-28D
CD83
LIGHT
HLA-DR
CD100
Jagged-2
SLAMF3
B7H3
ICAM-3
PVR
VISTA
CD18
CD29
CD58
Nectin-2
Galectin-3
PDL1
SLAMF5
Square error
of predictions - +
Best univariate model
MultiVarSel= Our multivariate model of Fig 4B 36 Dendritic cell-derived communication signals (INPUTS)
Figure 4. A Data-Driven Lasso Penalized Regression Model Predicts Th Differentiation Outcomes from DC-Derived Communication Signals
(A) Mathematical modeling strategy.
(B) Heatmap of the model’s coefficient values of the MultiVarSel-derived model, explaining the 18 Th parameters based on the 36 DC-derived signals (Pearson
correlation-based hierarchical clustering).
(C) Mean and SE of prediction error values obtained by 10-fold cross-validation for Th parameters using the multivariate model (yellow) and the best univariate
model (gray) within the 36 DC signals.
(D) Literature-based validation score. For each DC signal, all predicted associations with Th cytokines were categorized as ‘‘new,’’ ‘‘validated,’’ or ‘‘contradictory.’’
B CD80/CD86
Output expression fold change
0.50 IL-5
(value with blocking
IL-4
0.00 IL-31
IL-3
IL-17A IL-17F IL-21 IL-31 GM-CSF Real GM-CSF
2.00 (exp. value) IL-21
1.00 IL-17F
0.50 Estimate IL-17A
0.00 (in silico KO) IL-13
IL-9
IL-6
Flu
S
m
Flu
S
m
Flu
S
m
Flu
S
m
Flu
S
m
LP
LP
LP
LP
LP
Zy
Zy
Zy
Zy
Zy
IL-2
Exp fold
Naive CD4 T cells
C rhIL-1E 5 days culture
Anti-CD3/
+ anti-CD28 + Thcytokines
polarizing
+ or agonist anti-hICOS mAb
Measure of Th outputs predicted
by the model to be modulated
or rhIL-12p70
D IL-1
Model prediction
1,500 *** 400 * 6,000 * 4,000 *** 800 ** 30,000 * 200 ** IL-31
300 3,000 600 150 IL-22
1,000 4,000 20,000
pg/mL
400 3,000
4 100 IL-2
2,000 2,000
3,000
pg/mL
5 20 50 500
20 1,000 100 IL-6
0 0 0 0 0 0 0 IL-10
Model prediction
rhIL-12p70 - + - + - + - + - + - + - + - + IL-22
Th condition Th0 Th0 Th0 Th0 Th0 Th2 ZYM-DC HKSA-DC IL-4
IL-5
IL-21 IL-22 TNF-E IFN-J IL-10 IL-3
IL-13
8,000 ** 500 ** 6,000 * 25,000 ** 6,000 * 2,500 * *
IL-31
pg/mL
300 M̂
M̂M̂M̂ 300 M̂
M̂M̂M̂
M̂M̂
M̂
M̂
M̂
M̂M̂
M̂
M̂
M̂
2. 12 variables created:
M̂ M̂
{
M̂ M̂
M̂ M̂ M̂ M̂
(i)
0 if Input = 0
M̂ M̂ M̂ M̂
(i)
M̂M̂ M̂M̂M̂
M̂ M̂ M̂M̂ M̂M̂M̂
M̂ M̂
M̂M̂ M̂M̂ M̂M̂ M̂M̂
j
M̂M̂ M̂M̂
M̂ M̂ M̂ M̂
M̂ M̂ M̂ M̂M̂ M̂ M̂ M̂ M̂M̂
M̂ M̂ M̂ M̂M̂ M̂M̂ M̂
M̂ M̂ M̂ M̂ M̂M̂ M̂M̂ M̂
M̂
I: 36 full inputs and 12 «IL-12_with» variables
M̂ M̂ M̂ M̂ M̂ M̂ M̂ M̂ M̂ M̂
M̂ M̂ M̂ M̂
M̂ M̂
{
(2) (2)
I >0 0 if I =0
j
M̂
(1) (2)
Ij _with_Ij =
I(2) = 0 I (1) if I (2) z 0 5. Display coefficients (treshold = 0.6, Figure 6C)
M̂
j j
B C
Multivariate modeling including context-dependent variables for IL-12
Computational Validation
assessed by cross-validation IL-17F
IL-17F IL-17A
CD58
B7H3
CD80
Jagged-2
Galectin-3
IL-12_with_IL-1
IL-1
IL-12_with_ICAM-2
IL-12_with_Jagged-2
IL-12_with_OX40L
4-1BBL
TNF-D
CD86
IL-23
IL-28D
PDL2
CD18
IL-12_with_IL-28D
CD11a
HLA-DR
SLAMF3
CD70
CD83
PVR
OX40L
IL-12_with_CD70
IL-12_with_IL-23
IL-12_with_LIGHT
CD29
IL-6
IL-12_with_TNF-D
IL-12_with_CD30L
VISTA
CD30L
ICAM-2
ICOSL
LIGHT
IL-17A
* *
900 *** ***
7500 1000 10000
0 0
300 2500
with IL-1b dramatically induced IL-17F at levels comparable with This effect was specific to the IL-12+IL-1b combination IL-6,
the positive control, without a detectable amount of IL-17A, IL-23, or TGF-b alone or combined with IL-12 could not induce
which fully validated the model predictions (Figure 7B). IL-17F expression (Figure S6C). The exact same pattern of Th
IL-17F (ng/mL)
IL-17F (ng/mL)
IL-17F (ng/mL)
CD80 2
CD70 6 30
ICAM-2 1.5
IL-23 20
4-1BBL
1 4
VISTA
CD11a 10
ICOSL 0.5 2
CD30L
CD86 0 0 0
PVR
IL-28D
TNF-D IL-17A IL-17A IL-17A
SLAMF3
CD18 250 2,000 * 6 **
IL-17A (pg/mL)
IL-17A (ng/mL)
Jagged-2
IL-17A (pg/mL)
LIGHT 200 1,500
HLA-DR 1,000 *
OX40L 500 4
150 ns
IL-10 100
CD40 100 80
CD100 60 2
CD29 50
SLAMF5 40
PDL2 20
CD54 0 0 0
CD83
T 0
IL h2
-1 IL 12
IL b
Th1b
17
0
2
17 17
2
IL-6
-1 IL 12
IL 3
3
-1 IL 12
IL 3
Th 3
17
Th
Th
D
2+ -1
2+ -2
-2
2+ -2
-2
PDL1
-
Th Th
-
+C
-
-
IL
L
Galectin-3
0+
I
Nectin-2
Th
ICAM-3
IL
CD58
IL
IL
B7H3
0.00 0.25 0.50 0.75 1.00 Th condition Th0 IL-1b
Frequencies of selection
*
IL-17F+ (%live cells) IL-17A+ (%live cells)
IL-17F
Th
2+ -1
-1
IL-12 + IL-1E
IL-22 IL-22
IL-17F+ single producers: 22.2% (±3.2) IL-17F+single producers: 22.2% (±2.5)
Figure 7. Synergistic Interaction of IL-12 and IL-1 Promotes IL-17F without IL-17A
(A) Stability selection frequencies of the different DC signals by a multivariate model, explaining the difference between IL-17F and IL-17A.
(B) Concentration of cytokines measured on restimulated Th supernatants. Naive CD4 T cells were differentiated for 5 days with anti-CD3/CD28 beads under the
indicated conditions; n = 6 donors, paired t test.
(C) The same experimental design as in (B), with conditions as annotated; n = 6 donors, Wilcoxon test.
(D) Coated anti-CD2 and anti-CD3 together with soluble anti-CD28 were given for 5 days to naive CD4 T cells under Th0 or Th17 conditions. Cytokine
concentrations were measured after 24-h restimulation on day 5. Mean and SD are shown; n = 8, Wilcoxon test.
(E) Day 5 intracellular FACS analysis of Th cells differentiated as in (B). Dot plots show a representative donor.
(F) Quantification of live total CD4 T cells producing either IL-17A or IL-17F; n = 6 donors, paired t test.
(G) Representative donor of CD4 memory T cells with intracellular FACS staining for IL-17A versus IL-17F.
(H) Venn Diagrams of IL-17F+/IL17A Th cells co-producing IL-22, IFN-g, and IL-21 of naive CD4 T cells under the IL-12+IL-1b. IL-12+IL; mean percentage and
confidence interval, n = 6 donors.
(I) Venn Diagrams of IL-17F+/IL17A Th cells co-producing IL-22, IFN-g, and IL-21 of memory CD4 T cells stimulated for 5 h with PMA and ionomycin; mean
percentage of 6 donors with confidence interval.
DECLARATION OF INTERESTS
STAR+METHODS
The authors declare no competing interests.
Detailed methods are provided in the online version of this paper
and include the following: Received: February 19, 2019
Revised: June 20, 2019
d KEY RESOURCES TABLE Accepted: September 9, 2019
Published: October 3, 2019
d LEAD CONTACT AND MATERIALS AVAILABILITY
d EXPERIMENTAL MODEL AND SUBJECT DETAILS
B Human subjects REFERENCES
d METHOD DETAILS
Abou-Jaoudé, W., Monteiro, P.T., Naldi, A., Grandclaudon, M., Soumelis, V.,
B PBMCs purification Chaouiya, C., and Thieffry, D. (2015). Model checking to assess T-helper cell
B MoDC generation and activation plasticity. Front. Bioeng. Biotechnol. 2, 86.
B Blood dendritic cells purification Acosta-Rodriguez, E.V., Napolitani, G., Lanzavecchia, A., and Sallusto, F.
+
B CD4 T lymphocytes purification (2007). Interleukins 1beta and 6 but not transforming growth factor-beta are
B Paired protein measurement in DC/T coculture essential for the differentiation of interleukin 17-producing human T helper
B IL-12 blocking experiment cells. Nat. Immunol. 8, 942–949.
B CD28 blocking experiment Alculumbre, S., and Pattarini, L. (2016). Purification of Human Dendritic Cell
B Addition of rhIL-12p70 during DC/T coculture Subsets from Peripheral Blood. Methods Mol. Biol. 1423, 153–167.
B DC-free Th cell polarization Antebi, Y.E., Reich-Zeliger, S., Hart, Y., Mayo, A., Eizenberg, I., Rimer, J.,
B ICOS agonism Putheti, P., Pe’er, D., and Friedman, N. (2013). Mapping differentiation under
mixed culture conditions reveals a tunable continuum of T cell fates. PLoS
B CD2 agonism
Biol. 11, e1001616.
B Flow cytometry analysis
Balan, S., Arnold-Schrauf, C., Abbas, A., Couespel, N., Savoret, J., Impera-
B Cytokine quantification
tore, F., Villani, A.C., Vu Manh, T.P., Bhardwaj, N., and Dalod, M. (2018).
B Gene expression quantification
Large-Scale Human Dendritic Cell Differentiation Revealing Notch-Dependent
B Anti-human ICOS monoclonal blocking antibody Lineage Bifurcation and Heterogeneity. Cell Rep. 24, 1902–1915.e6.
d QUANTIFICATION AND STATISTICAL ANALYSIS Banchereau, J., and Steinman, R.M. (1998). Dendritic cells and the control of
B Dataset quality control – batch effect immunity. Nature 392, 245–252.
B Dataset quality control – T cell expansion
Cariani, F., and Rips, L.J. (2017). Conditionals, Context, and the Suppression
B Statistical tests Effect. Cogn. Sci. (Hauppauge) 41, 540–589.
B Statistical analysis Chen, L., and Flies, D.B. (2013). Molecular mechanisms of T cell co-stimulation
B Model comparison and ROC Curves and co-inhibition. Nat. Rev. Immunol. 13, 227–242.
B Modeling strategy Ciofani, M., Madar, A., Galan, C., Sellars, M., Mace, K., Pauli, F., Agarwal, A.,
B Systematic literature review Huang, W., Parkhurst, C.N., Muratet, M., et al. (2012). A validated regulatory
d DATA AND CODE AVAILABILITY network for Th17 cell specification. Cell 151, 289–303.
Title : Regularized methods to study multivariate data in high dimensional settings: theory and applications.
Keywords : Regularized methods, multivariate data, high dimension
Abstract : In this PhD thesis we study general linear satisfy in order to recover the positions of the zero
model (multivariate linear model) in high dimensional and non-zero entries of the coefficient matrix when
settings. We propose a novel variable selection ap- the number of responses is not fixed and can tend to
proach in the framework of multivariate linear models infinity.
taking into account the dependence that may exist We also propose novel, efficient and fully data-driven
between the responses. It consists in estimating be- approaches for estimating Toeplitz and large block
forehand the covariance matrix of the responses and structured sparse covariance matrices in the case
to plug this estimator in a Lasso criterion, in order to where the number of variables is much larger than the
obtain a sparse estimator of the coefficient matrix. number of samples without limiting ourselves to block
The properties of our approach are investigated both diagonal matrices.
from a theoretical and a numerical point of view. More These approaches are applied to different biological
precisely, we give general conditions that the estima- issues in metabolomics, in proteomics and in immu-
tors of the covariance matrix and its inverse have to nology.
Université Paris-Saclay
Espace Technologique / Immeuble Discovery
Route de l’Orme aux Merisiers RD 128 / 91190 Saint-Aubin, France