0% ont trouvé ce document utile (0 vote)
18 vues152 pages

PERROT 2019 Archivage

Transféré par

hivagadamjeanjacques
Copyright
© © All Rights Reserved
Nous prenons très au sérieux les droits relatifs au contenu. Si vous pensez qu’il s’agit de votre contenu, signalez une atteinte au droit d’auteur ici.
Formats disponibles
Téléchargez aux formats PDF, TXT ou lisez en ligne sur Scribd
0% ont trouvé ce document utile (0 vote)
18 vues152 pages

PERROT 2019 Archivage

Transféré par

hivagadamjeanjacques
Copyright
© © All Rights Reserved
Nous prenons très au sérieux les droits relatifs au contenu. Si vous pensez qu’il s’agit de votre contenu, signalez une atteinte au droit d’auteur ici.
Formats disponibles
Téléchargez aux formats PDF, TXT ou lisez en ligne sur Scribd
Vous êtes sur la page 1/ 152

Méthodes régularisées pour l’analyse de données

multivariées en grande dimension : théorie et


applications.
Marie Perrot-Dockès Perrot-Dockès

To cite this version:


Marie Perrot-Dockès Perrot-Dockès. Méthodes régularisées pour l’analyse de données multivariées en
grande dimension : théorie et applications.. Applications [stat.AP]. Université Paris Saclay (COmUE),
2019. Français. �NNT : 2019SACLS304�. �tel-02384541�

HAL Id: tel-02384541


https://theses.hal.science/tel-02384541
Submitted on 28 Nov 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
NNT : 2019SACLS304

Méthodes régularisées pour l’analyse de


données multivariées en grande
dimension: théorie et applications.
Thèse de doctorat de l’Université Paris-Saclay
préparée à l’Université de Paris-Sud et au sein d’AgroParisTech

École doctorale n◦ 574 Mathématiques Hadamard (EDMH)


Spécialité de doctorat : Mathématiques aux interfaces

Thèse présentée et soutenue à Paris, le 8 octobre 2019, par

M ARIE P ERROT-D OCK ÈS

Composition du Jury :

Liliane BEL
Professeure, AgroParisTech (MIA 518) Présidente
David CAUSEUR
Professeur, Agrocampus Ouest (IRMAR UMR 6625) Rapporteur
Jean-Professeur BARDET
Professeur, Université Paris 1 (SAMM) Examinateur
Pierre NEUVIAL
Thèse de doctorat

Chargé de recherche, Institut de mathématiques de Toulouse


Examinateur
(Statistique et Probabilités)
Loı̈c RAJJOU
Professeur, AgroParisTech (UMR1318) Examinateur
Vassili SOUMELIS
Professeur des universités- praticien hospitalier, Hôpital Saint-Louis
Examinateur
(Immunologie humaine et mécanismes inflammatoires)
Céline Lévy-Leduc
Professeure, AgroParisTech (MIA 518) Directrice de thèse
Julien Chiquet
Chargé de recherche, AgroParisTech-INRA (MIA 518) Co-directeur de thèse
Laure Sansonnet
Maı̂tre de conférence, AgroParisTech (MIA 518) Invité
Table des matières

1 Introduction 1
1.1 Contexte biologique . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Sélection de variables dans le modèle linéaire multivarié . . . . . . . 4
1.2.1 État de l’art . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Contributions des chapitres 2 et 3 . . . . . . . . . . . . . . . 9
1.3 Estimation de matrice de corrélation par blocs . . . . . . . . . . . . 14
1.3.1 État de l’art . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Contributions du chapitre 4 . . . . . . . . . . . . . . . . . . . 15
1.4 Une autre application : étude du dialogue entre les cellules dendri-
tiques et les lymphocytes Th . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Introduction à l’immunologie . . . . . . . . . . . . . . . . . . 18
1.4.2 Contributions du chapitre 5 . . . . . . . . . . . . . . . . . . . 20

2 Variable selection in high dimensional multivariate linear models 21


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Case where Σ is known . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Case where Σ is unknown . . . . . . . . . . . . . . . . . . . . 27
2.2.3 The AR(1) case . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 AR(1) dependence structure with balanced one-way ANOVA 32
2.3.2 Robustness to unbalanced designs and correlated features . . 32
2.3.3 Robustness to more general autoregressive processes . . . . . 34
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 A variable selection approach in the multivariate linear model 55


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.1 Estimation of the dependence structure of E . . . . . . . . . 59
3.2.2 Estimation of B . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 Variable selection performance . . . . . . . . . . . . . . . . . 64

3
3.3.2 Choice of the dependence modeling . . . . . . . . . . . . . . . 66
3.3.3 Choice of the model selection criterion . . . . . . . . . . . . . 67
3.3.4 Numerical performance . . . . . . . . . . . . . . . . . . . . . 67
3.4 Application to the analysis of a LC-MS data set . . . . . . . . . . . . 68
3.4.1 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2 Application of our four-step approach . . . . . . . . . . . . . 69
3.4.3 Comparison with existing methods . . . . . . . . . . . . . . . 70
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4 Estimation of large block structured covariance matrices 75


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.1 Low rank approximation . . . . . . . . . . . . . . . . . . . . . 78
4.2.2 Detecting the position of the non null values . . . . . . . . . 79
4.2.3 Positive definiteness . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.4 Estimation of Σ−1/2 . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.5 Choice of the parameters . . . . . . . . . . . . . . . . . . . . 81
4.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.1 Low rank approximation . . . . . . . . . . . . . . . . . . . . . 83
4.3.2 Positions of the non null values . . . . . . . . . . . . . . . . . 83
4.3.3 Comparison with other methodologies . . . . . . . . . . . . . 85
4.3.4 Columns permutation . . . . . . . . . . . . . . . . . . . . . . 87
4.3.5 Numerical performance . . . . . . . . . . . . . . . . . . . . . 89
4.3.6 Choice of the threshold t for estimating Σ−1/2 . . . . . . . . 89
4.3.7 Use of Σ−1/2 to remove the dependence in multivariate linear
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4 Application to “multi-omic” approaches to study seed quality . . . . 92
4.4.1 Results obtained for the metabolomic data . . . . . . . . . . 94
4.4.2 Results obtained for the proteomic data . . . . . . . . . . . . 97
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.6.1 Variable selection performance . . . . . . . . . . . . . . . . . 102
4.6.2 Groups of metabolites . . . . . . . . . . . . . . . . . . . . . . 104
4.6.3 Groups of proteins . . . . . . . . . . . . . . . . . . . . . . . . 104

5 Etude du dialogue entre cellules dendritiques et lymphocytes Th 105


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Description des données . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2.1 Protocole expérimental . . . . . . . . . . . . . . . . . . . . . . 106
5.2.2 Une grande diversité . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 Modélisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 Validation biologique . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Conclusion et perspectives 111
6.1 Vers d’autres cas vérifiant les conditions de consistance en signe de
notre estimateur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Sélection de variables à l’aide d’autres régressions pénalisées . . . . . 112
6.2.1 Introduction à d’autres régressions pénalisées . . . . . . . . . 112
6.2.2 Extensions au cadre multivarié . . . . . . . . . . . . . . . . . 113
6.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Bibliography 126

Annexe 127
Introduction
Introduction

1.1 Contexte biologique


Comprendre comment certains traits phénotypiques sont transmis de génération
en génération est un  mystère  qui a longtemps intrigué les scientifiques et les phi-
losophes 1 . Les  spermatistes  pensaient que chaque spermatozoı̈de contenait un
embryon qui n’avait plus qu’à grandir au sein de la mère, qui ne lui apportait alors
que la nourriture essentielle à son développement. À l’inverse les  ovistes  pen-
saient que le bébé était contenu dans l’ovule et que le sperme permettait uniquement
de déclencher le processus. Cependant, aucune de ces deux théories ne permet de
répondre à la question récurrente pourquoi l’enfant ressemble t-il à ses deux parents ?
Selon Darwin, les traits phénotypiques et les fonctions de chaque cellule sont ex-
pliqués par des particules très petites appelées  gemmules . Les  gémmules  des
deux parents se retrouvent ensuite dans les cellules de leur enfant qui a donc des
caractéristiques moyennes entre ses deux parents. Cependant, selon cette théorie,
la diversité ne ferait que diminuer d’année en année pour arriver à une espèce
constituée de nombreux êtres identiques. Mendel émet alors une nouvelle théorie
qui va révolutionner la génétique : les caractéristiques des individus ne sont pas
gouvernées par un mais par deux facteurs, l’un venant de la mère et l’autre du père.
Cette théorie a, par la suite, été vérifiée : chacune de nos caractéristiques étant
bien gouvernée non par un mais par deux facteurs génotypiques. Il reste ensuite à
comprendre comment ces facteurs génotypiques (maintenant connus comme gènes)
influencent nos caractéristiques phénotypiques c’est pourquoi dans cette thèse nous
nous intéresserons au passage du génotype au phénotype. Ces étapes sont nom-
breuses, complexes et la connaissance que nous en avons ne cesse d’évoluer depuis
de nombreuses années. Une description détaillée de ces étapes et de l’évolution de
nos connaissances est disponible dans le chapitre VI de Raven et al. (2017). Nous
commençons par présenter le dogme central de la biologie moléculaire et nous don-
nons ensuite quelques exemples qui montrent que la vision séquentielle qu’il propose
est en réalité simpliste.
Le dogme central de la biologie moléculaire décrit le passage du génotype au
phénotype de manière séquentielle : l’ADN contenu dans chacun de nos chromo-
somes est transcrit en ARN qui est ensuite traduit en protéines qui vont directe-
ment influencer le phénotype. Plus précisément un chromosome contient deux brins
d’ADN. Chacun de ces brins contient des nucléotides : l’adénine (A), la thymine

1. Un lecteur intéressé pourra se référer à  Éloge de la différence. La génétique et les hommes  de


Jacquard (1978) qui propose une très belle introduction sur le sujet.

1
Introduction

(T), la cytosine (C), la guanine (G). Un de ces brins est appelé le brin codant et
l’autre est appelé le brin matrice. La séquence de nucléotides du brin codant est le
complémentaire de la séquence de nucléotides du brin matrice, c’est-à-dire que la
thymine sera remplacée par l’adénine, la cytosine par la guanine et réciproquement.
Les gènes sont codés sur des fragments de ces brins d’ADN. Des ARN polymérases
se déplacent le long du brin matrice de l’ADN pour le transcrire en ARN messager
(ARNm). La séquence de nucléotides de l’ARNm est la même que celle du brin co-
dant sauf que la thymine (T) est remplacée par l’uracile (U). Un ribosome va ensuite
traduire cet ARNm en chaı̂nes d’acides aminés. Chaque groupe de trois nucléotides
parmi les 4 suivants : A, U, C, G, sera appelé un codon. Comme les chaı̂nes d’ARNm
sont composées avec quatre nucléotides différents, 43 = 64 codons existent donc pour
coder les acides aminés. Trois codons sont des codons stop qui arrêtent la traduction
d’ARNm en protéines, les autres sont tous liés à un acide aminé, certains sont codés
par plusieurs codons. Une fois formée une chaı̂ne d’acides aminés se replie finissant
ainsi de créer une protéine. Lorsqu’elles ne sont plus fonctionnelles, ou plus utilisées,
les protéines sont ensuite dégradées par notre système et sont transformées en acides
aminés et petites molécules qui font partie des métabolites.
Cette vision séquentielle du passage du génotype au phénotype est en réalité beau-
coup trop simpliste. On a par exemple découvert des  petits  ARN qui seraient
répresseurs de la traduction de certaines parties de l’ARNm. Ces petits ARN sont
codés à divers emplacements du génome, y compris dans les zones que l’on pensait
silencieuses pour la transcription. De plus les protéines agissent sur la transcription
de l’ARNm. Le passage du génotype au phénotype est donc loin d’être séquentiel
puisque les différentes étapes s’influencent mutuellement. De plus la connaissance
que l’on en a ne cesse d’évoluer et est probablement encore simpliste par rapport à
la réalité.
On étudie néanmoins les différences entre des individus d’une même espèce au
cours de ces différentes étapes. Certains traits phénotypiques peuvent par exemple
être expliqués par l’étude des séquences d’ADN (appelée la génomique). Plus précisé-
ment, on peut étudier des différences au sein des nucléotides qui varient fréquemment
d’un individu à un autre (Yang et al., 2010). Ce polymorphisme d’un seul nucléotide
est appelé SNP (pour Single Nucleotide polymorphism). La variation d’un seul
nucléotide peut avoir des effets importants puisqu’elle peut modifier l’acide aminé,
donc la protéine et les métabolites qui en découlent. De même, des locus (des
régions d’ADN, pouvant coder un ou plusieurs gènes) variant d’un individu à l’autre
peuvent avoir des effets importants sur le phénotype de l’individu. Ces études
peuvent aussi être effectuées sur les brins d’ARN transcrits. On parlera alors de
transcriptomique. La protéomique (respectivement la métabolomique) qui étudie
l’abondance de protéines (resp. de métabolites) peut également permettre d’expli-
quer des différences entre individus d’une même espèce, qui ne seraient pas ex-
pliquées par l’étude des gènes ou des transcripts. En effet, il a été montré que

2
Introduction
Figure 1.1 – Matrice de corrélation de SNPs le long du génome (Wittenburg et al.,
2016) à gauche et de protéines à droite.

des variations dans l’expression des gènes n’entraı̂nent pas nécessairement des va-
riations proportionnelles dans l’abondance des métabolites (Riekeberg & Powers,
2017). Plus généralement, pour expliquer un phénomène d’intérêt, les différents
types de jeu de données  -omiques  (génomique, transcriptomique, protéomique,
métabolomique) ont chacun des avantages et inconvénients qui sont détaillés dans
la table 1 de Karahalil (2016). C’est pourquoi dans la suite nous proposerons des
méthodes générales permettant d’étudier les liens entre des marqueurs des données
 -omiques  (les gènes, les protéines, les métabolites, les SNPs, les locus, . . .) et

certains traits phénotypiques.


Pour proposer de telles méthodes, il est important de bien comprendre les liens
qui existent entre les marqueurs eux-mêmes et qui peuvent être dus au type de
données considérées. Par exemple, durant la méiose, des chromosomes d’une même
paire échangent un fragment de leur chromatine. On appelle cela l’entrecroisement
chromosomique. Ainsi, plus des SNPs ou des gènes sont proches le long du brin
d’ADN, plus ils vont être sur le même fragment de chromatine lors d’entrecroise-
ments chromosomiques. La dépendance de deux SNPs ou gènes sera donc d’autant
plus grande qu’ils sont physiquement proches (Gianola et al., 2003; Wittenburg et al.,
2016; Chiquet et al., 2016). Cette dépendance le long du génome est représentée dans
la figure 1.1 par la matrice de corrélation de données de SNPs, ordonnés le long du
génome, de 106 vaches Holstein-Friesian disponible dans Wittenburg et al. (2016).
De même les gènes ou les protéines impliqués dans une même voie métabolique sont
souvent co-exprimés (Stuart et al., 2003). Une telle co-expression est représentée
dans la figure 1.1 sur un cas de protéomique décrit plus en détails dans la section 1.3
et dans le chapitre 4.

L’objectif de cette thèse est de proposer des méthodes permettant de sélectionner


un nombre restreint de marqueurs liés à un ou plusieurs traits d’intérêt tout en

3
Introduction

prenant en compte la dépendance qui existe entre les marqueurs. Pour cela nous
disposons de jeux de données sur lesquels on mesure q marqueurs sur n échantillons.
Le nombre de marqueurs à mesurer peut être très grand (jusqu’à plusieurs dizaines
de milliers) et les mesurer sur un échantillon peut être coûteux. Il est alors fréquent
d’avoir un nombre de marqueurs à étudier beaucoup plus élevé que le nombre
d’échantillons sur lesquels ils sont mesurés. Ainsi, nous proposerons dans cette thèse
des méthodes adaptées à de tels cas aussi bien pour étudier la dépendance entre les
marqueurs que pour étudier la présence ou l’absence de lien avec un phénomène
d’intérêt.

1.2 Sélection de variables dans le modèle linéaire multi-


varié

Pour expliquer les liens entre différents marqueurs et différents traits phénotypiques
on modélise les valeurs des marqueurs comme une combinaison linéaire du ou des
traits phénotypiques. Supposons que nous cherchions les liens existant entre q mar-
queurs et p traits phénotypiques et que pour n échantillons nous ayons la valeur
de ces q marqueurs et de ces p traits phénotypiques. Le mot  valeur  est ici uti-
lisé pour généraliser une abondance, une quantité ou simplement une valeur nulle
ou égale à 1 symbolisant l’appartenance ou non d’un échantillon à un groupe. No-
tons X i,k la valeur du trait phénotypique k de l’échantillon i et Y i,j la valeur du
marqueur j de l’échantillon i. Si le trait phénotypique est catégoriel X i,k vaut 1 si
l’échantillon i est de la catégorie k et 0 sinon. On cherche alors à écrire Y i,j par
p
X
Y i,j = X i,k B k,j + E i,j , ∀i ∈ J1, nK, ∀j ∈ J1, q K, (1.1)
k=1

où B k,j est un coefficient associant le marqueur j au trait phénotypique k et E i,j


est un terme d’erreur aléatoire. En considérant les matrices Y , X, B et E qui sont
telles que pour une matrice A, Ai,j représente l’élément i de la colonne j de A, le
modèle (1.1) peut s’écrire comme suit :

Y = XB + E. (1.2)

Dans ce modèle, Y est une matrice aléatoire de réponses de taille n × q, X une ma-
trice de design de taille n × p contenant les caractéristiques du phénomène d’intérêt,
B une matrice de taille p × q contenant les valeurs des coefficients liant les réponses
et le phénomène d’intérêt et E une matrice aléatoire d’erreur de taille n × q. Au
cours de cette thèse nous ferons l’hypothèse que les lignes de E sont indépendantes

4
Introduction
et identiquement distribuées mais que ses colonnes ne le sont, en général, pas. Ainsi,
iid
∀i ∈ J1, nK, (Ei,1 , . . . , Ei,q ) ∼ N (0, Σ), (1.3)

où N (0, Σ) désigne la loi d’un vecteur gaussien d’espérance nulle et de matrice de
covariance Σ.
Pour diminuer radicalement le nombre d’associations potentielles entre des mar-
queurs et des traits phénotypiques on s’intéresse à la mise en place de méthodes de
sélection de variables dans le modèle linéaire général (1.2), voir par exemple Mardia
et al. (1980). Faire de la sélection de variables dans ce modèle revient à proposer
Bb un estimateur parcimonieux de B. Un tel estimateur permet de mettre en avant
des potentiels liens pertinents entre les réponses et les variables explicatives. En
effet, un coefficient B
b i,j positif (resp. négatif) indique un potentiel lien positif (resp.
négatif) entre la caractéristique i du phénomène d’intérêt et le marqueur j. À l’in-
verse, un coefficient nul indique une potentielle absence de lien. On recherche donc
un estimateur de B qui a des coefficients du même signe (positif, négatif ou nul)
que B. Pour cela, on définit la consistance en signe d’un estimateur C b de C par

   1 si x > 0

P sign(C) = sign(C) → 1, lorsque n → ∞, où sign(x) =
b −1 si x < 0 .

 0 si x = 0

Si B
b est un estimateur de B qui vérifie cette propriété, cela indique qu’avec une
probabilité tendant vers un les valeurs positives (resp. négatives, resp. nulles) de B
b
sont bien des valeurs positives (resp. négatives, resp. nulles) de B. Une valeur non
nulle de Bb indique donc une association réelle entre une variable réponse et une
variable explicative. Une telle approche permet donc de diminuer drastiquement
le nombre de validations expérimentales à effectuer et ce d’autant plus que B b est
parcimonieux.

1.2.1 État de l’art

Une approche possible pour sélectionner des variables dans le modèle linéaire
multivarié (1.2) est de le faire indépendamment dans les q modèles linéaires uni-
variés :
Y •,r = XB •,r + E •,r , ∀r ∈ J1, q K, (1.4)

où Y •,r , B •,r et E •,r sont respectivement les re colonnes de Y , B et E définies


dans le modèle (1.2). Pour simplifier les notations on notera le modèle (1.4) :

y = Xb + e, (1.5)

5
Introduction

où y ∈ Rn , b ∈ Rp et e ∈ Rn . Faire de la sélection de variables dans ce modèle revient


à proposer b̂ un estimateur parcimonieux de b. Les variables ayant un coefficient nul
ne sont pas sélectionnées alors que celles qui ont un coefficient non nul le sont. Il reste
à savoir comment choisir les composantes de b̂ qui seront mises à 0. Les coefficients
du vecteur b peuvent être estimés par maximum de vraisemblance (voir par exemple
Mardia et al., 1980; Draper & Smith, 1998), cependant cette méthode ne mène pas
à une solution parcimonieuse. Différentes méthodes, pour effectuer de la sélection de
variables dans le modèle (1.2) sont détaillées dans les travaux suivants : Thompson
(1978), Zheng & Loh (1995), Chen et al. (2014), Heinze et al. (2018), Desboulets
(2018). Certaines consistent à tester la nullité d’un ou plusieurs coefficients, ou à
comparer deux modèles emboı̂tés à l’aide de tests de Student ou de Fisher (voir Mar-
dia et al., 1980; Draper & Smith, 1998, par exemple). D’autres méthodes permettent
de faire un compromis entre la diminution de l’erreur quadratique et le nombre de
valeurs non nulles de b̂. Les méthodes les plus fréquemment utilisées sont le critère
AIC (Akaike, 1970), le Cp de Mallows (Mallows, 1973), le critère BIC (Schwarz et al.,
1978) ou plus généralement le critère GIC (Nishii, 1984). Les méthodes de régression
pénalisées consistent, quant à elles, à ajouter une pénalité sur les paramètres afin de
sélectionner des variables. Une approche très fréquemment utilisée est la méthode
Lasso introduite par Tibshirani (1996) qui propose l’estimateur suivant :

bb(λ) = Argmin p ky − Xbk2 + λkbk1 ,



b∈R 2 (1.6)
Pn
où kxk22 est la norme `2 de x = (x1 , . . . , xn ), définie par kxk22 = i=1 x2i et où kbk1
Pp
est la norme `1 de b = (b1 , . . . , bp ), définie par kbk1 = i=1 |bi |. Cette pénalité mène
à une solution parcimonieuse. Zhao & Yu (2006) ont établi la consistance en signe
de l’estimateur bb(λ) :
Theorème 1.1. Soit y un vecteur de Rn vérifiant (1.5). Supposons aussi qu’il
existe des constantes positives M1 , M2 , M3 et des nombres positifs c1 , c2 tels que
0 < c1 < c2 < 1 satisfaisant :
(i) ∀j ∈ J1, pK, n1 (X •,j )| X •,j ≤ M1 , où A| est la transposée de la matrice A et
où X •,j est la colonne j de X,
(ii) α| n1 (X | X)J,J α ≥ M2 , pour tout α ∈ R|J| tel que ||α||22 = 1,


où J = {i, bi 6= 0}, et (X | X)J,J est la matrice X | X restreinte aux lignes et


colonnes dans J et |J| est le cardinal de J,
(iii) |J| = O(nc1 ),
1−c2
(iv) n 2 minj∈J |bj | ≥ M3 .
Supposons également qu’il existe une constante c3 telle que 0 ≤ c3 < c2 − c1 et
c3
p = O(en ) et que la condition d’irreprésentabilité suivante soit satisfaite : il existe
un vecteur positif et constant η tel que :

(X | X)J c ,J ((X | X)J,J )−1 sign(bJ ) ≤ 1 − η, (IC)

6
Introduction
où l’inégalité est à comprendre composante par composante, 1 est un vecteur de 1
de taille (p − |J|) et J c est le complémentaire de J dans J1, pK.
1+c4
Alors on a, pour tout λ = O(n 2 ), où c3 < c4 < c2 − c1 :
  c3
P sign(bb(λ)) = sign(b) = 1 − o(e−n ) → 1, lorsque n → ∞,

où bb(λ) est défini par (1.6).

La condition (i) est satisfaite, par exemple, lorsque les colonnes de la matrice
X sont centrées réduites. En forçant les valeurs singulières de la matrice de design
restreinte aux variables pertinentes à ne pas être trop faibles devant n, la condi-
tion (ii) indique que la matrice (X | X)J,J /n doit être inversible. Les conditions (iii)
et (iv) indiquent respectivement que les valeurs non nulles de b ne doivent être ni trop
nombreuses ni trop faibles par rapport au nombre n d’échantillons. Si ces différentes
conditions sont satisfaites au vu des données on peut appliquer indépendamment le
modèle (1.5) à chaque colonne de Y . Cependant, cette méthode ne prend pas en
compte la dépendance qui peut exister entre les différentes variables réponses. Nous
nous intéresserons ici à des méthodes qui cherchent les positions non nulles de la ma-
trice B définie dans le modèle (1.2), en prenant en compte cette dépendance. Pour
cela, dans le cas gaussien, ces méthodes minimisent l’opposé de la log-vraisemblance
du modèle et donc la fonction
 
1 |
`(B, Ω) = tr (Y − XB)Ω(Y − XB) − log(|Ω|), (1.7)
n

où tr (A) désigne la trace de A, Ω = Σ−1 est la matrice de précision et |Ω| est son
déterminant. Mardia et al. (1980) montrent que l’estimateur de B qui minimise `
à Ω fixée est indépendant de Σ et est donc le même que celui qui minimise l’erreur
quadratique pour chaque colonne de Y indépendamment. Cependant cette méthode
ne fait pas de sélection de variables c’est pourquoi on ajoute à la fonction (1.7) une
contrainte induisant de la parcimonie sur les coefficients de B. Les estimateurs ainsi
obtenus ne sont plus indépendants de Σ.
Rothman et al. (2010) proposent d’estimer à la fois B et Ω de manière parcimo-
nieuse. Pour cela ils minimisent une fonction de coût qui ajoute à ` deux pénalités :
une sur la norme `1 des valeurs de B et une sur la norme `1 des valeurs extra-
diagonales de Ω. Pour résoudre ce problème ils proposent une méthode itérative.
Dans une première étape ils estiment d’abord B à Ω fixé. Le problème devient alors
un problème Lasso classique qui est résolu en utilisant un algorithme de descente
par coordonnée. Dans une seconde étape ils estiment Ω à B fixé en utilisant l’esti-
mateur du graphical-Lasso proposé par Banerjee et al. (2008b) obtenu à l’aide de
l’algorithme de Friedman et al. (2008).
Cette méthode a été étendue par Lee & Liu (2012) qui proposent deux autres

7
Introduction

approches pour l’estimation de B. La première consiste à estimer Ω dans un premier


temps puis à estimer B à Ω fixé. Pour l’estimation de Ω ils considèrent tout d’abord
la matrice X comme aléatoire et donc les variables aléatoires
iid
Zi = (Y |i,• , X |i,• ) ∼ N (0, ΣZ ), ∀i ∈ J1, nK,

où !
Σy,y Σy,x
ΣZ = .
Σx,y Σx,x

En remarquant que la matrice de covariance Σ est la matrice de covariance des Y i,•


conditionnellement aux X i,• on obtient

Σ = Σy,y − Σy,x Σ−1


x,x Σy,x .

Ainsi, estimer ΣZ permet d’obtenir Σ. Plus précisément, estimer Σ−1 Z à l’aide du


graphical-lasso permet d’obtenir Ω. La seconde méthode, proche de celle de Roth-
man et al. (2010), estime de manière itérative Ω et B. Cette méthode a une fonction
de coût qui n’est pas convexe en (B, Ω) et peut être instable lorsque q ≥ n. Ils ont
cependant mis en évidence à l’aide de simulations numériques que cette méthode
peut être plus performante que celle où on estime Ω dans un premier temps. Ils
montrent ensuite que sous certaines conditions on retrouve la consistance en signe
de ces estimateurs lorsque q est fixé.
Zhang et al. (2017) proposent une méthode qui tout comme la première méthode
de Lee & Liu (2012) estime Ω dans un premier temps avant d’estimer B. Pour
estimer Ω ils utilisent la méthode CLIME (constrained l1 minimization for inverse
matrix estimation) proposée par Cai et al. (2011). Cette méthode définit

b 1 = Argmin ||Ω||1 , avec ΩS − I q ≤ λ,


Ω (1.8)

où λ est un paramètre à fixer et S la matrice de covariance empirique. Pour assurer la


symétrie, l’estimateur Ω b = (wi,j )1≤i,j≤q final est ensuite obtenu en prenant wi,j =
wj,i = min(wi,j , wj,i ). Ils utilisent ensuite l’algorithme du Weighted Square root
Lasso (Belloni et al., 2011) pour estimer B à Σ fixé. Ils établissent ensuite une
inégalité de type oracle pour leur estimateur. Enfin, ils comparent par simulation
les résultats obtenus par leur méthode et ceux obtenus à l’aide du Square root
Lasso et du Lasso. Dans leurs différents scénarios de simulations le taux de faux
positifs obtenu par leur méthode est plus faible que ceux obtenus avec les autres
méthodes. Cependant, leur méthode ne dépend pas d’un unique paramètre mais de
deux paramètres à calibrer en pratique.
Enfin Molstad et al. (2018) proposent une méthode pour estimer B sous l’hy-
pothèse que plus deux réponses sont expliquées de la même façon par les prédicteurs
plus elles vont être corrélées. Ainsi, ils proposent de remplacer Ω dans ` par [B 0 B +

8
Introduction
τ I q ]−1 , ou τ ≥ 0. Ils n’ont alors plus que la matrice B à estimer. Ils proposent
ensuite un algorithme permettant de minimiser `(B) sous différentes contraintes
sur B. Cet estimateur n’a pour l’instant pas été étudié théoriquement mais a été
validé par des expériences numériques. Cependant cet estimateur ne prend pas en
compte des dépendances dues à des facteurs qui ne sont pas dans X.

1.2.2 Contributions des chapitres 2 et 3


Production scientifique

Cette section résume les publications suivantes :


• M. Perrot-Dockès, C. Lévy-Leduc, L. Sansonnet, J. Chiquet.
Variable selection in multivariate linear models with high-dimensional
covariance matrix estimation. Journal of Multivariate Analysis, 166:78
– 97, 2018.
• M. Perrot-Dockès, C. Lévy-Leduc, J. Chiquet, L. Sansonnet,
M. Brégère, M.-P. Étienne, S. Robin, G. Genta-Jouve. A vari-
able selection approach in the multivariate linear model: An application
to LC-MS metabolomics data. Statistical Applications in Genetics and
Molecular Biology, 17:5, 2018.
La méthode décrite est implémentée dans le paquet R :
• M. Perrot-Dockès, C. Lévy-Leduc, J. Chiquet. MultiVarSel. R
package version 1.1.2, 2018.

Nous proposons d’estimer tout d’abord la matrice de covariance Σ puis de l’uti-


liser pour estimer B. Notre méthode se rapproche donc de la première proposition
de Lee & Liu (2012) avec cependant deux différences majeures. La première vient
de notre cadre asymptotique. En effet, Lee & Liu (2012) étudient les propriétés de
leur estimateur lorsque p peut tendre vers l’infini et que q est fixé. Nous autorisons
au contraire q à tendre vers l’infini même potentiellement à une puissance de n, p
étant fixé. En effet, dans notre cadre le nombre de réponses q peut être beaucoup
plus grand que le nombre d’échantillons n. La seconde différence vient de notre es-
timation de Ω. Dans le chapitre 2 nous proposons des conditions que doit vérifier
l’estimateur de la matrice de covariance pour conserver la consistance en signe de
notre estimateur. Dans le chapitre 3 nous proposons une nouvelle méthode d’esti-
mation de la matrice de covariance Σ lorsque celle-ci est supposée être une matrice
de Toeplitz symétrique, ce qui revient à supposer que chaque ligne de la matrice E
est une réalisation d’un processus stationnaire au second ordre (voir Brockwell &
Davis, 1990). Ces deux chapitres sont résumés ci-après.

9
Introduction

Contributions du chapitre 2

Afin de retirer toute dépendance au sein des colonnes de E, notre méthode


consiste tout d’abord à  blanchir , c’est-à-dire à  décorréler , les données en
appliquant la transformation suivante :

Y Σ−1/2 = XB Σ−1/2 + E Σ−1/2 . (1.9)

En utilisant l’opérateur de vectorisation (vec) ce problème peut se réécrire comme


suit :
Y = X B + E, (1.10)

avec Y = vec(Y Σ−1/2 ) , X = (Σ−1/2 )| ⊗ X, B = vec(B) et E = vec(EΣ−1/2 ) où


⊗ désigne le produit de Kronecker. Puisque B = vec(B), estimer B dans le modèle
(1.10) revient à estimer B dans le modèle (1.2). En utilisant le critère Lasso pour
estimer B on obtient l’estimateur :

= ArgminB kY − X Bk22 + λkBk1 .



B(λ)
b (1.11)

En s’inspirant de Zhao & Yu (2006) nous avons établi la consistance en signe de


l’estimateur Bb (voir théorème 2.1 du chapitre 2).
En pratique Σ est inconnue. Nous proposons donc de l’estimer dans une étape
préalable et appelons cet estimateur Σ. b En remplaçant Σ−1/2 par Σ b −1/2 dans le
modèle (1.9) on obtient :

b −1/2 = XB Σ
Y Σ b −1/2 + E Σ
b −1/2 . (1.12)

Une fois encore nous réécrivons le modèle (1.12) comme

Ye = XeB + E,
e (1.13)

où Ye = vec(Y Σ b −1/2 ) , X = (Σ


b −1/2 )| ⊗ X, B = vec(B) et Ee = vec(E Σ
b −1/2 ). En
utilisant le critère Lasso pour estimer B on obtient l’estimateur :
n o
B(λ)
e = ArgminB kYe − XeBk22 + λkBk1 . (1.14)

Le théorème ci-dessous (théorème 2.5 du chapitre 2) établit la consistance en


signe de B(λ)
e sous certaines conditions sur X, B, Σ, Ω et Σ. b

Theorème 1.2. Soit Y vérifiant le modèle (1.2) sous l’hypothèse (1.3). Supposons
que la condition d’irreprésentabilité (IC) soit vérifiée. Supposons de plus qu’il existe
des constantes positives M4 , M5 , M6 et M7 et c1 , c2 avec 0 < c1 + c2 < 1/2 , telles
que
(i) k(X | X)/nk∞ ≤ M4 ,

10
Introduction
(ii) λmin ((X | X)/n) ≥ M5 ,
(iii) |J| = O(q c1 ), où J = {i, tel que Bi 6= 0} et |J| est le cardinal de J,
(iv) q c2 minj∈J |Bj | ≥ M3 .
(v) λmax (Σ−1 ) ≤ M6 ,
(vi) λmin (Σ−1 ) ≥ M7 .
Supposons aussi que lorsque n tend vers l’infini, on ait :
(vii) kΣ−1 − Σ
b −1 k∞ = OP ((nq)−1/2 ),
b = OP ((nq)−1/2 ).
(viii) ρ(Σ − Σ)
Ainsi, pour tout λ tel que
 1
 λ λ  
q = qn = o n 2(c1 +c2 ) , √ → ∞ et = o q −(c1 +c2 ) , lorsque n → ∞,
n n
on a  
P sign(B(λ))
e = sign(B) → 1, lorsque n → ∞,

où B(λ)
e est défini dans (1.14). Ici, λmax (A), λmin (A), ρ(A) et kAk∞ sont respec-
tivement la plus grande, la plus petite valeur propre, le rayon spectral et la norme
infinie de A.
Les conditions (i) à (iv) du Théorème 1.2 sont similaires aux conditions (i) à
(iv) du Théorème 1.1. Les conditions (v) et (vi) du Théorème 1.2 indiquent que
les valeurs propres de Σ et Ω sont minorées par une constante strictement positive.
Enfin, les conditions (vii) et (viii) indiquent que ni la norme infinie de l’erreur de
prédiction de la matrice de précision ni le rayon spectral de l’erreur de prédiction
de la matrice de covariance ne peuvent être trop grands.

Contributions du chapitre 3

Pour l’estimation de la matrice de covariance et le calcul de notre estimateur


de B lorsque la matrice Σ est supposée Toeplitz symétrique, nous proposons une
méthode en quatre étapes.
Première étape : Estimation de E
On estime E par

b = Y − X(X | X)−1 X | Y
E (1.15)
= (Id − PX )Y
= PX ⊥ Y ,

où PX (respectivement PX ⊥ ) est la matrice de projection orthogonale sur


Vect(X) qui désigne le sous espace engendré par les colonnes de X (respecti-
vement l’orthogonal de Vect(X)). E b est donc le projeté orthogonal de Y sur
l’orthogonal de Vect(X).

11
Introduction

Deuxième étape : Estimation de Σ


Nous proposons plusieurs estimateurs pour Σ correspondant à différents modèles
de processus stationnaires pour les lignes de E.

Le modèle le plus simple est tel que chaque ligne de E est modélisée comme un
processus autorégressif d’ordre 1 (AR(1)). Cela signifie que pour tout i
de J1, nK et pour tout t de J2, q K (Ei,t ) est tel que

Ei,t − φ1 Ei,t−1 = Wi,t , avec Wi,t ∼ BB(0, σ 2 ),

où |φ1 | < 1 et les Wi,t sont des bruits blancs de variance σ 2 que l’on note
BB(0, σ 2 ). Lorsque σ 2 = 1 la matrice Ω1/2 a la forme explicite suivante :
p 
1 − φ21 −φ1 0 ··· 0
0 1 −φ1 · · · 0
 
 

.. .. .. 
Ω1/2 = 0 0 . . . . (1.16)
 
 .. .. .. ..


 . . . . −φ1


0 0 ··· 0 1

Lorsque σ 2 n’est pas égale à 1, la matrice de covariance de chaque ligne de


EΩ1/2 est égale à σ 2 Id et la matrice de corrélation est égale à la matrice
b −1/2 un estimateur de Σ−1/2 en remplaçant φ1 par
identité. On obtient alors Σ
φ
c1 dans (1.16), où l’estimateur φ c1 est obtenu à l’aide des équations de Yule
Walker (décrites dans Walker, 1964) et de E b:
Pn Pq
`=2 Ei,` Ei,`−1
b b
i=1
φb1 = Pn Pq−1 b 2 .
i=1 `=1 Ei,`

b −1/2 ainsi obtenue vérifie


Nous avons montré dans le chapitre 2 que la matrice Σ
les hypothèses du Théorème 1.2.

Des modèles un peu plus généraux sont tels que chaque ligne de E est modélisée
comme un processus autorégressif à moyenne mobile ARMA(p ,q). Dans
ce cas pour tout i de J1, nK et pour tout t, Ei,t est solution de :

Ei,t − φ1 Ei,t−1 − · · · − φp Ei,t−p = Wi,t + θ1 Wi,t−1 + · · · + θq Wi,t−q ,

avec Wi,t ∼ BB(0, σ 2 ) et où les φj et les θj sont des paramètres réels.

Dans le cas où la modélisation par un processus ARMA n’est pas appropriée,
on peut modéliser chaque ligne de E comme un processus faiblement station-
naire général et estimer Σ comme une matrice Toeplitz symétrique c’est à dire

12
Introduction
comme  
γ
b(0) γ
b(1) ··· b(q − 1)
γ
γ
b(1) γ
b(0) ··· b(q − 2)
γ
 

Σ
b = .. , (1.17)
.
 
 
γb(q − 1) γ
b(q − 2) · · · γ
b(0)
où n
1X
γ
b(h) = γ
bi (h),
n i=1

et γ
bi (h) est l’estimateur de l’autocovariance γi (h) = E(Ei,t Ei,t+h ), pour tout
t et tout h dans Z.
Pour sélectionner l’estimateur le plus adapté, nous proposons dans le chapitre 3
un test statistique qui est une adaptation du test de Portmanteau lui même
fondé sur le Théorème de Bartlett (Brockwell & Davis, 1991). Nous appelons
Σ
b l’estimateur final de Σ ainsi obtenu.

Troisième étape : Blanchiment des données


Afin de décorréler au mieux les colonnes de E on multiplie l’équation (1.2) à
droite par l’estimateur Σb −1/2 (voir l’équation (1.12)).

Quatrième étape : Sélection de variables


Le critère Lasso, décrit dans (1.14), est appliqué aux données ainsi transformées.
Cependant cet estimateur dépend d’un paramètre λ qui permet de régler le
niveau de parcimonie de B(λ) e et qui est inconnu. Pour sélectionner les coeffi-
cients de B qui sont non nuls on propose une adaptation de la stability selection
(proposée par Meinshausen & Bühlmann, 2010) qui utilise la méthode de va-
lidation croisée.
Plus précisément nous choisissons un sous-échantillonnage aléatoire de taille
nq/2 auquel nous appliquons les trois premières étapes décrites ci-dessus pour
obtenir des données  blanchies . Nous calculons sur ces données λCV qui
minimise l’erreur de validation croisée dans le modèle (1.14). Nous obtenons
ainsi B(λ
e CV ) qui est un estimateur parcimonieux de B et enregistrons les
positions des valeurs non nulles de cet estimateur. Nous répétons ces étapes
de sous-échantillonnage N fois. A l’issue de ces différentes étapes, nous avons
accès au nombre de fois Ni où chaque composante Bei de Be a été estimée comme
étant non nulle. Nous conservons les composantes i dont la fréquence Ni /N
est plus grande qu’un seuil donné. L’influence du choix de N et du seuil est
étudié dans dans la section 3.3.3 du chapitre 3.

Notre méthode a été de plus comparée numériquement à des méthodes existantes


et semble être la plus adaptée dans les scénarios étudiés. Le détail de ces comparai-
sons est disponible dans la section 3.3 du chapitre 3. Enfin, elle a été appliquée à
une étude de métabolomique non ciblée dans la section 3.4 du chapitre 3.

13
Introduction

1.3 Estimation de matrice de corrélation par blocs

1.3.1 État de l’art

La section 1.2.2 présente une méthode permettant de modéliser des marqueurs


 -omiques  par des variables explicatives (traits phénotypiques par exemple) en
prenant en compte la covariance qui existe entre ces marqueurs. La méthode pro-
posée pour l’estimation de la matrice de covariance fait l’hypothèse qu’il s’agit d’une
matrice de Toeplitz symétrique. Cependant, les marqueurs  -omiques  (gènes,
protéines, métabolites, . . .) sont souvent co-exprimés / co-accumulés par groupe
(voir Section 1.1 et Stuart et al., 2003) ce qui se traduit par une structure de ma-
trice de covariance par blocs.
Cette structure par blocs peut venir de facteurs sous-jacents tels que l’apparte-
nance à une même voie métabolique ou la stimulation des marqueurs par les mêmes
molécules. Plus généralement, les liens entre des variables peuvent être dus à des
facteurs peu nombreux qui sont observables ou latents. Pour étudier de tels liens
Fan et al. (2016) présentent des méthodes de modèle à facteurs qui décrivent chaque
ligne i de E, matrice de taille n × q, comme :

E i = fi B f + U i , (1.18)

où B f est une matrice de coefficients de taille k × q et pour tout i, fi est un


vecteur aléatoire de facteurs (observés ou latents) de taille k avec k ≤ min(q, n)
(les fi sont supposés indépendants et identiquement distribués), les U i sont des
vecteurs aléatoires d’erreur de taille q, indépendants et identiquement distribués et
sont supposés indépendant de B f . Sous cette hypothèse d’indépendance on a :

Σ = B |f Cov(f )B f + Σu , (1.19)

où Σ est la matrice de covariance des lignes de E, Σu est la matrice de covariance


des Ui et Cov(f ) est la matrice de covariance des fi . Le nombre de facteurs k étant
plus petit que n, Cov(f ) peut être estimée par la matrice de covariance empirique
de f . Il ne reste plus qu’à estimer Σu par exemple de manière parcimonieuse, voire
diagonale.
Lorsque les facteurs f sont des vecteurs gaussiens aléatoires d’espérance nulle et
de variance l’identité Blum et al. (2016b) proposent une méthode qui estime B f de
manière parcimonieuse, amenant la matrice Σ à l’être aussi. Ils estiment B f et Σ à
l’aide d’un algorithme EM en ajoutant à l’étape de maximisation une pénalité sur la
norme `1 des valeurs de B afin d’assurer la parcimonie. Cette étape peut aussi leur
permettre d’estimer le nombre k de facteurs. En effet, ils appliquent leur algorithme
avec un certain nombre kmax de facteurs, puis ne gardent ensuite que ceux qui ont
au moins une composante non nulle.

14
Introduction
Hosseini & Lee (2016) proposent aussi une méthode alliant parcimonie et modèle
à facteurs permettant l’estimation de la matrice de précision, et non pas la matrice
de covariance, comme une matrice parcimonieuse par blocs avec des blocs pouvant
se chevaucher. Une telle matrice est très utile par exemple pour faire de la sélection
de variables dans le modèle linéaire général comme on l’a vu dans la section 1.2.2.
Enfin, Perthame et al. (2016) allient sélection de variables et modèles à facteurs
afin de prendre en compte la dépendance au sein des variables réponses lorsque X
est une matrice d’ANOVA. Pour cela ils proposent le modèle :

Y = XB + ZB |Z + E, (1.20)

où B et B Z sont des matrices de coefficients de taille p × q et k × q, Z est une


matrice de variables latentes de taille n × k avec k ≤ q et où
iid
∀i ∈ J1, nK, (E i,1 , . . . , E i,q ) ∼ N (0, Σu ), (1.21)

avec Σu une matrice diagonale. Chaque ligne de Z est supposée distribuée selon une
loi normale d’espérance nulle et de matrice de covariance l’identité. Ils proposent
une méthode itérative qui alterne l’estimation de B, B Z et Σu et l’inférence de
Z. Avec la transformation Y − ZB |Z ils retirent la dépendance existant entre les
différentes variables réponses et appliquent des méthodes de sélection de variables
comme celles décrites dans la section 1.2.1.

1.3.2 Contributions du chapitre 4

Production scientifique

Cette section résume la publication :


• M. Perrot-Dockès, C. Lévy-Leduc, L. Rajjou Estimation of large
block structured covariance matrices: Application to “multi-omic” ap-
proaches to study seed quality
Soumise
La méthode décrite est implémentée dans le paquet du logiciel R :
• M. Perrot-Dockès, C. Lévy-Leduc BlockCov. R package version
0.1.1, 2019.

Nous nous intéressons ici à l’estimation de matrice de corrélation par blocs parci-
monieuse en utilisant un modèle à facteur et des méthodes d’estimation de matrices
parcimonieuses.

15
Introduction

Nous supposons que Σ s’écrit comme suit :

Σ = ZZ | + D, (1.22)

où Z est une matrice de taille q × k avec k  q et D est une matrice diagonale de
telle sorte que tous les termes diagonaux de Σ soient égaux à 1. Des exemples, dans
le cas de blocs diagonaux et le cas de blocs extra-diagonaux, sont donnés dans la
figure 1.2.

Diagonale Extra-Diagonale

Z Σ Z Σ
50 50 50 50

40 40 40 40

30 30 30 30

20 20 20 20

10 10 10 10

0 0 0 0
12345 0 10 20 30 40 50 12345 0 10 20 30 40 50

-1.0 -0.5 0.0 0.5 1.0

Figure 1.2 – Exemples de matrices Σ générées à partir de différentes matrices Z.

Notre méthode comporte trois étapes :


Première étape : Approximation de rang faible d’une matrice conte-
nant les coefficients extra-diagonaux de Σ.
On définit la matrice Γ de taille (q − 1) × (q − 1) (voir figure 1.3) à partir de
Σ telle que : (
Σi,j+1 , si 1 ≤ i ≤ j ≤ q − 1
Γi,j = (1.23)
Γj,i , si ≤ j < i ≤ q − 1

Contrairement à Σ qui est de plein rang Γ a un rang k  q. De plus Γ contient


tous les coefficients extra-diagonaux de Σ. Comme tous les éléments diagonaux
de Σ valent 1 estimer Γ suffit pour estimer Σ. Le passage de Σ à Γ est présenté
dans la figure 1.3. En pratique Σ est inconnue, on propose donc Γ e l’estima-
teur de Γ en remplaçant Σ par R la matrice de corrélation empirique dans
l’équation (1.23). Nous approchons Γ e (k) , son approximation de rang k,
e par Γ
obtenue en utilisant sa décomposition en valeurs singulières. Le rang k est la
plupart du temps inconnu. Deux méthodes pour l’estimation de k sont pro-
posées et comparées dans le chapitre 4. La première utilise le critère de Cattell

16
Introduction
Σ ∈ M 50×50 Γ ∈ M 49×49
50 50

40 40

30 30

20 20

10 10

0 0
0 10 20 30 40 50 0 10 20 30 40 50

-1.0 -0.5 0.0 0.5 1.0

Figure 1.3 – Passage de Σ à Γ

décrit dans Cattell (1966) et la seconde utilise la méthode de permutation


PA proposée par Horn (1965) et étudiée théoriquement par Dobriban (2018).
Ces deux méthodes sont comparées dans divers scénarios. L’estimateur ainsi
obtenu est noté r.
Deuxième étape : Détecter les positions de valeurs non nulles de Γ
Afin d’obtenir une matrice parcimonieuse on applique un critère lasso sur les
e (r) , les coefficient gardés comme non nuls sont ensuite ré-estimés
valeurs de Γ
avec la méthode des moindres carrés. La matrice obtenue en appliquant le
e (r) (λ). Deux méthodes
critère lasso avec un paramètre λ sera ci-après notée Γ
pour l’estimation du seuil sont proposées et comparées dans le chapitre 4. La
première est celle proposée par Bickel & Levina (2008) qui est fondée sur une
méthode de validation croisée. La seconde méthode cherche une rupture dans
la pente liant l’erreur kΓ
e−Γ e (r) (λ)kF au nombre de valeur non nulles dans
e (r) (λ), où kM kF désigne la norme de Frobenius de la matrice M .
Γ
L’estimateur ainsi obtenu est noté : Γ b (r) . On obtient alors Σ
e un premier esti-
mateur de Σ comme
 (r)
 Γ
 b
i,j−1 si 1 ≤ i < j ≤ q
Σi,j =
e 1 si 1 ≤ i = j ≤ q

Σj,i si 1 ≤ j < i ≤ q
 e

Troisième étape : Assurer la positivité de l’estimateur Σ b


On obtient Σ
b en appliquant à Σ
e la méthode proposée par Higham (2002). Cette

17
Introduction

méthode calcule Σ b la matrice définie positive avec des coefficients diagonaux


égaux à 1 la plus proche de Σ.
e

Nous proposons ensuite une méthode permettant d’obtenir Ω b −1/2 , un estimateur


de Ω−1/2 , à partir de Σ. b
Cette stratégie a été étudiée à l’aide d’expériences numériques. Différents scénarios
de simulations faisant varier la forme de Σ, le nombre de variables et le nombre
d’échantillons sont étudiés dans la section 4.3.3 du chapitre 4. Dans ces différents
scénarios l’estimateur Σ b obtenu par notre méthode et implémenté dans le package
BlockCov est, selon nos critères, plus performant que ceux obtenus à l’aide de di-
verses méthodes de clustering (kmeans, spectral clustering, clustering hiérarchique)
ainsi que ceux proposés par Blum et al. (2016b) et Rothman (2012). De même, dans
les différents scénarios, les performances de notre estimateur de Ω b −1/2 ont été com-
parées à celles des estimateurs obtenus à l’aide des méthodes de clustering et aux
performances de l’estimateur proposé par Hosseini & Lee (2016). Dans la plupart
des scénarios envisagés notre estimateur est plus performant que les autres. L’esti-
mateur de Ω b −1/2 peut aussi être utilisé dans la méthode décrite section 1.2.2 pour
faire de la sélection de variables. Des courbes ROC et des courbes précision-rappel,
comparant les variables sélectionnées grâce à notre estimateur et celles sélectionnées
par la méthode décrite dans Perthame et al. (2016) sont proposées à la fin du cha-
pitre 4.
Cette méthode a ensuite été appliquée à une étude de métabolomique et de
protéomique visant à étudier l’impact de la température de production sur la capa-
cité germinative de graines d’Arabidopsis thaliana.

1.4 Une autre application : étude du dialogue entre les


cellules dendritiques et les lymphocytes Th
Ces méthodes ont ensuite été appliquées sur un jeu de données immunologiques
afin de mieux comprendre le dialogue existant entre les cellules dendritiques et les
lymphocytes Th, deux types de cellules essentielles à notre système immunitaire.

1.4.1 Introduction à l’immunologie


Les cellules dendritiques (DC) sont des cellules sentinelles. Elles sont présentes
dans toutes les portes d’entrée empruntées par les agents infectieux (antigènes).
Lorsque l’un d’entre eux s’invite dans notre organisme il est ingéré par une DC
qui émet alors des signaux biochimiques afin de stimuler l’ensemble du système
immunitaire et notamment les lymphocytes T-helper (Th) pour combattre cet agent
infectieux. Avant de recevoir ces signaux biochimiques le lymphocyte Th est dit naı̈f.
En recevant ces signaux il va se différencier et renvoyer des signaux spécifiques pour

18
Introduction
combattre ce type d’antigène. En fonction de ces signaux les lymphocytes Th ont
été catégorisés en différent profils. Les deux premiers profils, caractérisés dans la
littérature, sont les profils Th1 et Th2 (Mosmann et al., 1986; Mosmann & Coffman,
1989). En présence d’un pathogène intracellulaire, donc d’une maladie auto-immune,
la DC va sécréter de l’interleukine 12 (IL12), une fois capté par le lymphocyte T naı̈f
celui-ci va se différencier en lymphocytes Th1 et sécréter notamment de l’interféron
gamma (IFNg) pour combattre cette maladie auto-immune. De même, en présence
d’un parasite extra-cellulaire, les DC vont sécréter des signaux qui vont amener
le lymphocyte T naı̈f à se différencier en un lymphocyte Th2 qui va sécréter des
interleukines 4, 5, 13 (IL4, IL5, IL13). De nombreuses études ont mis en avant de
nouveaux profils tel que le profil Th17 induit par la présence d’IL6, TNFa, IL23
TGFb qui sécrètent IL17A et IL17F pour répondre à la présence de bactéries et de
champignons extérieurs (voir Park et al., 2005). La figure 1.4 qui est une version
simplifiée de la figure 1 de Leung et al. (2010) montre différents profils Th décrits.

IFNg
Th1
IL2

IL12p70
IL4
IL4 Th2 IL5
TGFb,IL6 IL13

IL17F
Th17
IL17A

Figure 1.4 – Les différents profils Th

En pratique, en présence d’un antigène, les DC vont sécréter de nombreux si-


gnaux qui vont être captés par des lymphocytes Th qui vont eux-mêmes sécréter
de nombreux signaux, et ne seront pas caractérisés par un simple profil. Afin de
mieux comprendre la réponse immunitaire il est important de mettre en évidence
quels signaux émis par les DC induisent un signal spécifique des lymphocytes Th.
Volpe et al. (2009) ont étudié 5 signaux des cellules dendritiques visant à expliquer
la différenciation des lymphocytes T et ont mis en évidence la synergie de 4 de ces
paramètres dans la génération de Th17. Cependant, un grand nombre de paramètres
(plus de 40) sont connus pour induire la réponse Th et n’ont jamais été étudiés tous
ensemble jusqu’à présent.

19
Introduction

1.4.2 Contributions du chapitre 5


Production scientifique

Cette section résume la publication suivante :


• M. Grandclaudon*, M. Perrot-Dockes*, C. Trichot, O. Mostafa-
Abouzid, W. Abou-Jaoudé, F. Berger, P. Hupé, D. Thieffry,
L. Sansonnet, J. Chiquet, C. Lévy-Leduc, V. Soumelis A quan-
titative multivariate model of human dendritic cell-T helper cell commu-
nication. Acceptée au journal Cell et en annexe de la thèse.
* : ces auteurs ont contribué de manière égale à cette publication

Dans cette partie nous étudions des données mesurant la réponse des lympho-
cytes Th aux signaux des DC sous diverses conditions de perturbations visant à re-
produire in vitro, l’environnement in situ et in vivo dans lequel baignent les cellules
dendritiques et les lymphocytes Th. Ce jeu de données contient pour 428 couples
groupe de DC / groupe de lymphocytes Th les valeurs de 36 signaux sécrétés par
les DC et de 18 signaux provenant des lymphocytes Th.
De ces expériences découle une matrice X (resp. une matrice Y ) de taille 428×36
contenant les valeurs des 36 signaux des DC (resp. des 18 signaux de Th) pour les 428
échantillons. Pour expliquer la matrice Y en fonction de la matrice X nous avons
appliqué la méthodologie décrite dans le chapitre 2 et dans la section 1.2.2. En
observant les valeurs non nulles de B b il est alors possible de mettre en évidence des
associations potentielles entre les signaux des DC et les signaux des Th. Certaines
sont connues de la littérature, d’autres ne le sont pas. Par exemple, le modèle suggère
une association entre IL12p70 et IL17F. Or IL12p70 est connu comme induisant les
profils Th1 alors qu’IL17F est caractéristique des Th17. Des expériences ont été
effectuées et il s’est avéré qu’en effet IL12p70 peut influencer IL17F dans certains
contextes. Cette étude a donné lieu à un article qui est disponible en annexe, un
résumé détaillé est proposé dans le chapitre 5 de cette thèse.

20
Chapter 2

Chapter 2
Variable selection in
multivariate linear models with
high-dimensional covariance
matrix estimation

Scientific production

The content of this chapter is contained in the article: M. Perrot-Dockès,


C. Lévy-Leduc, L. Sansonnet, J. Chiquet, ”Variable selection in multi-
variate linear models with high-dimensional covariance matrix estima-
tion” Journal of Multivariate Analysis, 166:78 – 97, 2018. The method
which is presented is implemented in the MultiVarSel R package avail-
able from the CRAN.

Abstract
In this paper, we propose a novel variable selection approach in the framework of
multivariate linear models taking into account the dependence that may exist between
the responses. It consists in estimating beforehand the covariance matrix Σ of the
responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse
estimator of the coefficient matrix. The properties of our approach are investigated
both from a theoretical and a numerical point of view. More precisely, we give general
conditions that the estimators of the covariance matrix and its inverse have to satisfy
in order to recover the positions of the null and non null entries of the coefficient
matrix when the size of Σ is not fixed and can tend to infinity. We prove that these
conditions are satisfied in the particular case of some Toeplitz matrices. Our approach is
implemented in the R package MultiVarSel available from the Comprehensive R Archive
Network (CRAN) and is very attractive since it benefits from a low computational load.
We also assess the performance of our methodology using synthetic data and compare
it with alternative approaches. Our numerical experiments show that including the
estimation of the covariance matrix in the Lasso criterion dramatically improves the
variable selection performance in many cases.

21
2.1 Introduction
The multivariate linear model consists in generalizing the classical linear model,
in which a single response is explained by p variables, to the case where the number
q of responses is larger than 1. Such a general modeling can be used in a wide va-
Chapter 2

riety of applications ranging from econometrics Lütkepohl (2005) to bioinformatics


Meng et al. (2014). In the latter field, for instance, multivariate models have been
used to gain insight into complex biological mechanisms like metabolism or gene
regulation. This has been made possible thanks to recently developed sequencing
technologies. For further details, we refer the reader to Mehmood et al. (2012).
However, the downside of such a technological expansion is to include irrelevant
variables in the statistical models. To circumvent this, devising efficient variable
selection approaches in the multivariate setting has become a growing concern.
A first naive approach to deal with the variable selection issue in the multivariate
setting consists in applying classical univariate variable selection strategies to each
response separately. Some well-known variable selection methods include the least
absolute shrinkage and selection operator (LASSO) proposed by Tibshirani Tibshi-
rani (1996) and the smoothly clipped absolute deviation (SCAD) approach devised
by Fan and Li Fan & Li (2001). However, such a strategy does not take into account
the dependence that may exist between the different responses.
In this paper, we shall consider the following multivariate linear model:

Y = XB + E, (2.1)

where Y = (Yi,j )1≤i≤n,1≤j≤q denotes the n × q random response matrix, X de-


notes the n × p design matrix, B denotes a p × q coefficient matrix and E =
(Ei,j )1≤i≤n,1≤j≤q denotes the n × q random error matrix, where n is the sample size.
In order to model the potential dependence that may exist between the columns of
E, we shall assume that for each i in {1, . . . , n},

(Ei,1 , . . . , Ei,q ) ∼ N (0, Σ), (2.2)

where Σ denotes the covariance matrix of the ith row of the error matrix E. We
shall moreover assume that the different rows of E are independent. With such
assumptions, there is some dependence between the columns of E but not between
the rows. Our goal is here to design a variable selection approach which is able to
identify the positions of the null and non null entries in the sparse matrix B by
taking into account the dependence between the columns of E.
This issue has recently been considered by Lee and Liu Lee & Liu (2012) who
extended the approach of Rothman et al. Rothman et al. (2010). More precisely,
Lee and Liu Lee & Liu (2012) proposed three approaches for dealing with this issue
based on penalized maximum likelihood with a weighted `1 regularization. In their

22
first approach B is estimated by using a plug-in estimator of Σ−1 , in the second
one, Σ−1 is estimated by using a plug-in estimator of B and in the third one, Σ−1
and B are estimated simultaneously. Lee and Liu Lee & Liu (2012) also investigate
the asymptotic properties of their methods when the sample size n tends to infinity
and the number of rows and columns q of Σ is fixed.

Chapter 2
In this paper, we propose to estimate Σ beforehand and to plug this estimator in a
Lasso criterion, in order to obtain a sparse estimator of B. Hence, our methodology
is close to the first approach of Lee and Liu Lee & Liu (2012). However, there are
two main differences.
The first one is the asymptotic framework in which our theoretical results are
established: q is assumed to depend on n and to tend to infinity at a rate which can
be larger than n as n tends to infinity. Moreover, p is assumed to be fixed. In this
framework, we give general conditions that the estimators of Σ and Σ−1 have to
satisfy in order to be able to recover the support of B that is to find the positions of
the null and non null entries of the matrix B. Such a framework in which q is much
larger than n and p is fixed may for instance occur in metabolomics which aims to
provide a global snapshot of the metabolism. In a typical metabolomic experiment,
we have access to the responses of q metabolites (features) for n samples belonging
to different groups. This information can be summarized in a n × q data matrix
where q ≈ 5000 and n ≈ 30. The goal is then to identify the most important features
distinguishing the different groups. Hence, this problem can be modeled by (3.2)
where X is the design matrix of a one-way ANOVA where p corresponds to the
number of groups.
The second main difference between Lee & Liu (2012) and our approach is the
strategy that we use for estimating Σ. In Lee & Liu (2012), Σ−1 is estimated by
using an adaptation of the Graphical Lasso (GLASSO) proposed by Friedman et al.
(2008). This technique has also been considered in Yuan & Lin (2007); Banerjee et al.
(2008a); Rothman et al. (2008). Here, we propose to estimate Σ beforehand from
the empirical covariance matrix of the residuals assuming that Σ has a particular
structure, Toeplitz for instance. We prove its efficiency in the particular case of an
AR(1) process in Section 2.2.3. Such a process is used, for instance, in population
genetics for modeling the phenomenon of recombination as shown in Chiquet et al.
(2016). More generally, we give general conditions that the estimators of Σ and Σ−1
have to satisfy in order to be able to recover the support of B. Hence, any approach
providing estimators satisfying these conditions can be used.
Let us now describe more precisely our methodology. We start by “whitening”
the observations Y by applying the following transformation to Model (3.2):

Y Σ−1/2 = XB Σ−1/2 + E Σ−1/2 . (2.3)

The goal of such a transformation is to remove the dependence between the columns

23
of Y . Then, for estimating B, we proceed as follows. Let us observe that (3.5) can
be rewritten as:
Y = X B + E, (2.4)

with
Chapter 2

Y = vec(Y Σ−1/2 ), X = (Σ−1/2 )> ⊗ X, and E = vec(E Σ−1/2 ),


B = vec(B)
(2.5)
where vec denotes the vectorization operator and ⊗ the Kronecker product.
With Model (4.4), estimating B is equivalent to estimate B since B = vec(B).
Then, for estimating B, we use the classical LASSO criterion defined as follows for
a nonnegative λ:
B(λ)
b = ArgminB {kY − X Bk22 + λkBk1 }, (2.6)

where k·k1 and k·k2 denote the classical `1 -norm and `2 -norm, respectively. Inspired
by Zhao & Yu (2006), Theorem 2.1 established some conditions under which the
positions of the null and non null entries of B can be recovered by using B.b
In practical situations, the covariance matrix Σ is generally unknown and has
thus to be estimated. Let Σ b denote an estimator of Σ. For a description of the
methodology that we propose for estimating Σ, we refer the reader to the end of
Section 2.2.2. Then, the estimator Σb −1/2 of Σ−1/2 is such that

b −1 = Σ
Σ b −1/2 (Σ
b −1/2 )> .

When Σ−1/2 is replaced by Σ


b −1/2 , (3.5) becomes

b −1/2 = XB Σ
YΣ b −1/2 + E Σ
b −1/2 , (2.7)

which can be rewritten as


Ye = XeB + E,
e (2.8)

where

b −1/2 ),
Ye = vec(Y Σ b −1/2 )> ⊗ X,
Xe = (Σ B = vec(B) b −1/2 ).
and Ee = vec(E Σ

In Model (2.8), B is estimated by

B(λ)
e = ArgminB {kYe − XeBk22 + λkBk1 }. (2.9)

By extending Theorem 2.1, Theorem 2.5 gives some conditions on the eigenvalues
of Σ−1 and on the convergence rate of Σb and its inverse to Σ and Σ−1 , respectively,
under which the positions of the null and non null entries of B can be recovered by
using B.
e
We prove in Section 2.2.3 that when Σ is a particular Toeplitz matrix, namely

24
the covariance matrix of an AR(1) process, the assumptions of Theorem 2.5 are
satisfied. This strategy has been implemented in the R package MultiVarSel, which
is available on the Comprehensive R Archive Network (CRAN), for more general
Toeplitz matrices Σ such as the covariance matrix of ARMA processes or general
stationary processes. For a successful application of this methodology to particular

Chapter 2
“-omic” data, namely metabolomic data, we refer the reader to Perrot-Dockès et al.
(2018). For a review of the most recent methods for estimating high-dimensional
covariance matrices, we refer the reader to Pourahmadi (2013).
The paper is organized as follows. Section 2.2 is devoted to the theoretical results
of the paper. The assumptions under which the positions of the non null and null
entries of B can be recovered are established in Theorem 2.1 when Σ is known and
in Theorem 2.5 when Σ is unknown. Section 2.2.3 studies the specific case of the
AR(1) model. We present in Section 4.3 some numerical experiments in order to
support our theoretical results. The proofs of our main theoretical results are given
in Section 2.5.

2.2 Theoretical results


2.2.1 Case where Σ is known
Let us first introduce some notations. Let

1 >
C= X X and J = {j : 1 ≤ j ≤ pq, Bj 6= 0}, (2.10)
nq
where X is defined in (2.5) and where Bj denotes the jth component of the vector
B defined in (2.5).
Let also define
1 1
CJ,J = (X•,J )> X•,J and CJ c ,J = (X•,J c )> X•,J , (2.11)
nq nq

where X•,J and X•,J c denote the columns of X belonging to the set J defined in
(2.10) and to its complement J c , respectively.
More generally, for any matrix A, AI,J denotes the partitioned matrix extracted
from A by considering the rows of A belonging to the set I and the columns of A
belonging to the set J, with • indicating all the rows or all the columns.
The following theorem gives some conditions under which the estimator Bb defined
in (4.5) is sign-consistent as defined by Zhao & Yu (2006), namely,

lim P{sign(B)
b = sign(B)} = 1,
n→∞

where the sign function maps positive entries to 1, negative entries to −1 and zero
to 0.

25
Theorème 2.1. Assume that Y = (Y1 , . . . , Ynq )> satisfies Model (4.4). Assume
also that there exist some positive constants M1 , M2 , M3 and positive numbers c1 ,
c2 such that c1 + c2 ∈ (0, 1/2) satisfying
(A1) For all n ∈ N and j ∈ {1, . . . , pq}, n−1 (X•,j )> X•,j ≤ M1 , where X•,j is the jth
Chapter 2

column of X defined in (2.5).


(A2) For all n ∈ N, n−1 λmin {(X > X )J,J } ≥ M2 , where λmin (A) denotes the smallest
eigenvalue of A.
(A3) |J| = O(q c1 ), where J is defined in (2.10) and |J| is the cardinality of the set
J,
(A4) q c2 minj∈J |Bj | ≥ M3 .
Assume also that the following strong Irrepresentable Condition holds:
(IC) There exists a positive constant vector η such that

|(X > X )J c ,J {(X > X )J,J }−1 sign(BJ )| ≤ 1 − η,

where 1 is a (pq − |J|) × 1 vector of 1s and the inequality holds element-wise.


Then, for all λ that satisfies
1 √
(L) q = qn = o{n 2(c1 +c2 ) }, λ/ n → ∞ and λ/n = o{q −(c1 +c2 ) }, as n →
∞,
we have
lim P[sign{B(λ)}
b = sign(B)] = 1,
n→∞

where B(λ)
b is defined by (4.5).

Remark 2.1. Observe that if c1 + c2 < 1/(2k), for some positive k, then the first
condition of (L) becomes q = o(nk ). Hence for large values of k, the size q of Σ is
much larger than n.

Theorem 2.1 is established under similar assumptions as those required by The-


orem 4 of Zhao & Yu (2006). However, Assumptions (5), (6), (7) and (8) of Zhao
& Yu (2006) had to be adapted to our multivariate framework where q tends to
infinity and were replaced by (A1), (A2), (A3) and (A4) in order to allow q and n
to grow to infinity with different rates. Moreover, the assumptions on λ had also to
be adapted to deal with our specific framework.
The proof of Theorem 2.1 is given in Section 2.5. It is based on the following
Proposition 2.2 which is an adaptation to the multivariate case of Proposition 1 in
Zhao & Yu (2006). More precisely, in order to prove Theorem 2.1, we show that
P (Acn ) and P (Bnc ) tend to 0 as n → ∞.

26
Proposition 2.2. Let B(λ)
b be defined by (4.5). Then

P[sign{B(λ)}
b = sign(B)] ≥ P(An ∩ Bn ),

where

Chapter 2
  
−1 √ λ −1
An = |(CJ,J ) WJ | < nq |BJ | − |(CJ,J ) sign(BJ )| (2.12)
2nq

and
 
−1 λ −1

Bn = |CJ c ,J (CJ,J ) WJ − WJ c | ≤ √ 1 − |CJ c ,J (CJ,J ) sign(BJ )| ,
2 nq
(2.13)
0 √
with W = X E/ nq. In (2.12) and (2.13), CJ,J and CJ ,J are defined in (2.11) and
c

WJ and WJ c denote the components of W being in J and J c , respectively. Note


that the previous inequalities hold element-wise.

The proof of Proposition 2.2 is given in Section 2.5.


In the following proposition, which is also proved in Section 2.5, we give some
conditions on X and Σ under which Assumptions (A1) and (A2) of Theorem 2.1
hold.

Proposition 2.3. If there exist some positive constants M10 , M20 , m1 , m2 such that,
for all n ∈ N,
(C1) For all j ∈ {1, . . . , p}, n−1 (X•,j )> X•,j ≤ M10 ,
(C2) n−1 λmin (X > X) ≥ M20 ,
(C3) λmax (Σ−1 ) ≤ m1 ,
(C4) λmin (Σ−1 ) ≥ m2 ,
then Assumptions (A1) and (A2) of Theorem 2.1 are satisfied.

Remark 2.2. Observe that (C1) and (C2) hold in the case where the columns of
the matrix X are orthogonal.

We give in Proposition 2.6 in Section 2.2.3 some conditions under which Condi-
tion (IC) holds in the specific case where Σ is the covariance matrix of an AR(1)
process.

2.2.2 Case where Σ is unknown


Similarly as in (2.10) and (2.11), we introduce the notations

e = 1 Xe> Xe
C (2.14)
nq

27
and
eJ,J = 1 (Xe•,J )> Xe•,J
C eJ c ,J = 1 (Xe•,J c )> Xe•,J ,
and C (2.15)
nq nq
where Xe•,J and Xe•,J c denote the columns of Xe belonging to the set J defined in
(2.10) and to its complement J c , respectively.
Chapter 2

A straightforward extension of Proposition 2.2 leads to the following proposition


for Model (2.8).

Proposition 2.4. Let B(λ)


e be defined by (2.9). Then

P[sign{B(λ)}
e = sign(B)] ≥ P(A
fn ∩ B
fn ),

where
  
−1 f √ λ e −1
A
fn = (CJ,J ) WJ < nq |BJ | −
e |(CJ,J ) sign(BJ )| (2.16)
2nq

and
 
−1 f λ  −1
B
fn = CJ c ,J (CJ,J ) WJ − WJ c ≤ √
e e f 1 − CJ c ,J (CJ,J ) sign(BJ )
e e ,
2 nq
(2.17)
0e √
with W = X E/ nq. In (2.16) and (2.17), CJ,J and CJ c ,J are defined in (2.15) and
f e e e
W
fJ and W fJ c denote the components of W f being in J and J c , respectively. Note
that the previous inequalities hold element-wise.

The following theorem extends Theorem 2.1 to the case where Σ is unknown and
gives some conditions under which the estimator Be defined in (2.9) is sign-consistent.

Theorème 2.5. Assume that Assumptions (A3), (A4), (IC) and (L) of Theorem 2.1
hold. Assume also that, there exist some positive constants M4 , M5 , M6 and M7 ,
such that for all n ∈ N,
(A5) k(X > X)/nk∞ ≤ M4 ,
(A6) λmin {(X > X)/n} ≥ M5 ,
(A7) λmax (Σ−1 ) ≤ M6 ,
(A8) λmin (Σ−1 ) ≥ M7 .
Suppose also that
b −1 k∞ = OP {(nq)−1/2 }, as n → ∞,
(A9) kΣ−1 − Σ
b = OP {(nq)−1/2 }, as n → ∞.
(A10) ρ(Σ − Σ)
Let B(λ)
e be defined by (2.9), then

lim P[sign{B(λ)}
e = sign(B)] = 1.
n→∞

28
In the previous assumptions, λmax (A), λmin (A), ρ(A) and kAk∞ denote the largest
eigenvalue, the smallest eigenvalue, the spectral radius and the infinite norm (in-
duced by the associated vector norm) of the matrix A.

Remark 2.3. In order to distinguish the assumptions that are required for the

Chapter 2
design matrix X and for the estimator Σ
b of Σ, the assumptions of Theorem 2.5 only
involve X, Σ and Σ − Σb but not Xe.

Remark 2.4. Observe that Assumptions (A5) and (A6) hold in the case where the
columns of the matrix X are orthogonal. Note also that (A7) and (A8) are the same
as (C3) and (C4) in Proposition 2.3.

The proof of Theorem 2.5 is given in Section 2.5 and is based on Proposition 2.4.
c c
In order to prove Theorem 2.5, it is enough to show that P(A fn ) and P(B fn ) tend
c c
to 0 as n → ∞. The idea of the proof consists in rewriting P(A fn ) (resp. P(B
fn ))
by adding terms depending on Σ − Σ b to P(Acn ) (resp. P(Bnc )) and to prove that
these additional terms tend to zero as n → ∞.
In order to estimate Σ, we propose the following strategy:
(a) Fitting a classical linear model to each column of the matrix Y in order to
have access to an estimation E b of the random error matrix E. It is possible
since p is assumed to be fixed and smaller than n.
(b) Estimating Σ from E
b by assuming that Σ has a particular structure, Toeplitz
for instance.
More precisely, E
b defined in the first step is such that:

b = {IdRn − X(X > X)−1 X > }E ≡ ΠE,


E (2.18)

which implies that


Eb = vec(E)
b = (IdRq ⊗ Π)E, (2.19)

where E is defined in (2.5).


We prove in Proposition 2.7 below that our strategy for estimating Σ pro-
vides an estimator satisfying the assumptions of Theorem 2.5 in the case where
(E1,t )t , . . . , (En,t )t are assumed to be independent AR(1) processes.

2.2.3 The AR(1) case


Sufficient conditions for Assumption (IC) of Theorem 2.1

The following proposition gives some conditions under which the strong Irrepre-
sentable Condition (IC) of Theorem 2.1 holds.

29
Proposition 2.6. Assume that (E1,t )t , . . . , (En,t )t in Model (3.2) are independent
AR(1) processes such that, for all i ∈ {1, . . . , n}, Ei,t − φ1 Ei,t−1 = Zi,t , where
the Zi,t s are zero-mean iid Gaussian random variables with variance σ 2 = 1 and
|φ1 | < 1. Assume also that X defined in (3.2) is such that X > X = νIdRp , where ν
is a positive constant. Moreover, suppose that if j ∈ J, then j > p and j < pq − p.
Chapter 2

Suppose also that for all j, j −p or j +p is not in J. Then, the strong Irrepresentable
Condition (IC) of Theorem 2.1 holds.

The proof of Proposition 2.6 is given in Section 2.5.

Sufficient conditions for Assumptions (A7), (A8), (A9) and (A10) of Theo-
rem 2.5

The following proposition establishes that in the particular case where the
(E1,t )t , . . . , (En,t )t are independent AR(1) processes, our strategy for estimating Σ
provides an estimator satisfying the assumptions of Theorem 2.5.

Proposition 2.7. Assume that (E1,t )t , . . . , (En,t )t in Model (3.2) are independent
AR(1) processes such that, for all i ∈ {1, . . . , n}, Ei,t − φ1 Ei,t−1 = Zi,t , where
the Zi,t s are zero-mean iid Gaussian random variables with variance σ 2 = 1 and
|φ1 | < 1. Let  
1 φb1 φb21 . . . φb1q−1
 φ1 1 φb1 . . . φb1q−2 
 b 
1  ..

.. .. .. .. 
Σ=
b  . . . . . ,

2
1 − φ1  .
b 
. .. .. .. .. 
 . . . . . 

φbq−1
1 ... ... ... 1
where Pn Pq
`=2 Ei,` Ei,`−1
b b
i=1
φb1 = Pn Pq−1 b 2 , (2.20)
i=1 E
`=1 i,`

where E
b = (Ebi,` )1≤i≤n,1≤`≤q is defined in (2.18). Then, Assumptions (A7), (A8),
(A9) and (A10) of Theorem 2.5 are valid.

The proof of Proposition 2.7 is given in Section 2.5. It is based on the following
lemma.

Lemma 2.8. Assume that (E1,t )t , . . . , (En,t )t in Model (3.2) are independent AR(1)
processes such that, for all i ∈ {1, . . . , n}, Ei,t − φ1 Ei,t−1 = Zi,t , where the Zi,t s are
zero-mean iid Gaussian random variables with variance σ 2 and |φ1 | < 1. Let
Pn Pq
`=2 Ei,` Ei,`−1
b b
i=1
φb1 = Pn Pq−1 b 2 ,
i=1 `=1 Ei,`

30
where Eb = (E bi,` )1≤i≤n,1≤`≤q is defined in (2.18). Then, as n → ∞, √nqn (φb1 −
φ1 ) = Op (1).

Lemma 2.8 is proved in Section 2.5. Its proof is based on Lemma 2.10 in Sec-
tion 2.6.

Chapter 2
2.3 Numerical experiments

The goal of this section is twofold: (i) to provide sanity checks for our theoretical
results in a well-controlled framework; and (ii) to investigate the robustness of our
estimator to some violations of the assumptions of our theoretical results. The latter
may reveal a broader scope of applicability for our method than the one guaranteed
by the theoretical results.
We investigate (i) in the AR(1) framework presented in Section 2.2.3. Indeed, all
assumptions made in Theorems 2.1 and 2.5 can be specified with well-controllable
simulation parameters in the AR(1) case with balanced design matrix X.
Point (ii) aims to explore the limitations of our theoretical framework and assess
its robustness. To this end, we propose two numerical studies relaxing some of the
assumptions of our theorems: first, we study the effect of an unbalanced design —
which violates the sufficient condition of the irrepresentability condition (IC) given
in Proposition 2.6 — on the sign-consistency; and second, we study the effect of
other types of dependence than an AR(1).
In all experiments, the performance are assessed in terms of sign-consistency. In
other words, we evaluate the probability for the sign of various estimators to be
equal to sign(B). More precisely, we investigate for each estimator if their exists at
least one λ such that sign{B(λ)}
b = sign(B). We compare the performance of three
different estimators:

(i) Bb defined in (4.5), which corresponds to the LASSO criterion applied to the
data whitened with the true covariance matrix Σ; we call this estimator oracle.
Its theoretical properties are established in Theorem 2.1.
(ii) Be defined in (2.9), which corresponds to the LASSO criterion applied to the
data whitened with an estimator of the covariance matrix Σ;
b we refer to this
estimator as whitened-lasso. Its theoretical properties are established in
Theorem 2.5.
(iii) the LASSO criterion applied to the raw data, which we call raw-lasso here-
after. Its theoretical properties are established only in the univariate case in
Alquier & Doukhan (2011).

31
2.3.1 AR(1) dependence structure with balanced one-way ANOVA
In this section, we consider Model (3.2) where X is the design matrix of a one-
way ANOVA with two balanced groups. Each row of the random error matrix E is
distributed as a centered Gaussian random vector as in Eq. (3.3) where the matrix
Chapter 2

Σ is the covariance matrix of an AR(1) process defined in Section 2.2.3.


In this setting, Assumptions (A1), (A2) and Condition (IC) of Theorem 2.1
are satisfied, see Propositions 2.3 and 2.6. The three remaining assumptions (A3),
(A4) and (L) are related to more practical quantities: (A3) controls the sparsity
level of the problem, involving c1 ; (A4) basically controls the signal-to-noise ratio,
involving c2 and (L) links the sample size n, q and the two constants c1 , c2 , so that
an appropriate range of penalty λ exists for having a large probability of support
recovery. This latter assumption is used in our experiments to tune the difficulty of
the support recovery as follows: we consider different values of n, q, c1 , c2 . For each
4-tuple (n, q, c1 , c2 ), we generated the 2q-vector B as follows: the absolute values
of its |J| = q c1 non-null components are sampled from a uniform distribution on
the interval [1/q c2 , 2/q c2 ]. Thus, Assumptions (A3) and (A4) are fulfilled. Hence,
the problem difficulty is essentially driven by the validity of Assumption (L) where
q = o(nk ) with c1 + c2 = 1/2k, and so by the relationship between n, q and k.
We consider a large range of sample sizes n varying from 10 to 1000 and three
different values for q ∈ {10, 50, 1000}. The constants c1 , c2 are chosen such that
c1 + c2 = 1/2k with c1 = c2 and k ∈ {1, 2, 4}. Additional values of c1 and c2
have also been considered and the corresponding results are available upon request.
Finally, we consider two values for the parameter φ1 appearing in the definition of
the AR(1) process: φ1 ∈ {0.5, 0.95}.
Note that in this AR(1) setting with the estimator φb1 of φ1 defined in (2.20), all
the assumptions of Theorem 2.5 are fulfilled, see Proposition 2.7.
The frequencies of support recovery for the three estimators averaged over 1000
replications is displayed in Figure 2.1.
We observe from Figure 2.1 that whitened-lasso and oracle have similar per-
formance since φ1 is well estimated. These two approaches always exhibit better
performance than raw-lasso, especially when φ1 = 0.95. In this case, the sam-
ple size n required to reach the same performance is indeed ten time larger for
raw-lasso than for oracle and whitened-lasso.
Finally, the performance of all estimators are altered when n is too small, espe-
cially in situations where the signal to noise ratio (SNR) is small and the signal is
not sparse enough, these two characteristics corresponding to small values of k.

2.3.2 Robustness to unbalanced designs and correlated features


The goal of this section is to study some particular design matrices X in Model (3.2)
that may lead to violation of the Irrepresentability Condition (IC).

32
k= 1 k= 2 k= 4
1.00

0.75

q= 10
0.50

Chapter 2
0.25

0.00 φ1
1.00 0.5
0.95
0.75

q= 50
0.50

0.25 oracle

0.00 raw−lasso

1.00 whitened−lasso

0.75

q= 1000
0.50

0.25

0.00
10 100 1000 10 100 1000 10 100 1000
n

Figure 2.1 – Frequencies of support recovery in a multivariate one-way ANOVA


model with two balanced groups and an AR(1) dependence.

To this end, we consider the multivariate linear model (3.2) with the same AR(1)
dependence as the one considered in Section 2.3.1. Then, two different matrices
X are considered: First, an one-way ANOVA model with two unbalanced groups
with respective sizes n1 and n2 such that n1 + n2 = n; and second, a multiple
regression model with p correlated Gaussian predictors such that the rows of X are
iid N (0, ΣX ).
For the one-way ANOVA, violation of (IC) may occur when r = n1 /n is too
different from 1/2, as stated in Proposition 2.6. For the regression model, we choose
for ΣX a 9 × 9 matrix (p = 9) such that ΣX X
i,i = 1, Σi,j = ρ, when i 6= j.
The other simulation parameters are fixed as in Section 2.3.1.
We report in Figure 2.2 the results for the case where q = 1000 and k = 2
both for unbalanced one-way ANOVA (top panels) and regression with correlated
predictors (bottom panels). For the one-way ANOVA, r varies in {0.4, 0.2, 0.1}.
For the regression case, ρ varies in {0.2, 0.6, 0.8}. In both cases, the gray lines
correspond to the ideal situation that is, either balanced (r = 0.5) or uncorrelated
(ρ=0) in the legend of Figure 2.2. The probability of support recovery is estimated
over 1000 runs.
From this figure, we note that correlated features or unbalanced designs deteri-

33
orate the support recovery of all estimators. This was expected for these LASSO-
based methods which all suffer from the violation of the irrepresentability condi-
tion (IC). However, we also note that whitened-lasso and oracle have similar
performance, which means that the estimation of Σ is not altered, and that whiten-
ing always improves the support recovery.
Chapter 2

2.3.3 Robustness to more general autoregressive processes


In this section, we consider the case where X is the design matrix of a one-way
ANOVA with two balanced groups and where Σ is the covariance matrix of an AR(m)
process with m ∈ {5, 10}. Figure 2.3 displays the performance of the different
estimators when q = 500 and k = 2. Here, for computing Σ b in whitened-lasso,
the parameters φ1 , . . . , φm of the AR(m) process are estimated as follows. They
are obtained by averaging over the n rows of E b defined in (2.18) the estimations
(i) (i)
φb1 , . . . , φbm obtained for the ith row of E
b by using standard estimation approaches
for AR processes described in Brockwell & Davis (1990). As previously, we observe
from this figure that whitened-lasso and oracle have better performance than
raw-lasso.

2.4 Discussion
In this paper, we proposed a variable selection approach for multivariate linear
models taking into account the dependence that may exist between the responses

r = 0.4 r = 0.2 r = 0.1


1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

0.75 ● ● ●
q = 1000

● ● ●

0.50
● ● ●
Balanced oracle
0.25 ● ● ●

Balanced raw−lasso
● ● ●
0.00 ● ● ●

Balanced whitened−lasso
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
n Uncorrelated oracle
Uncorrelated raw−lasso
ρ = 0.2 ρ = 0.6 ρ = 0.8
Uncorrelated whitened−lasso
1.00
oracle
0.75
q = 1000

raw−lasso
0.50
whitened−lasso
0.25

0.00
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
n

Figure 2.2 – Frequencies of support recovery in general linear models with unbal-
anced designs: one-way ANOVA and regression.

34
m= 5 m= 10
1.00

0.75
oracle

q= 500
0.50
raw−lasso

Chapter 2
0.25 whitened−lasso
0.00
0 100 200 300 400 500 0 100 200 300 400 500
n

Figure 2.3 – Frequencies of support recovery in one-way ANOVA with AR(m) co-
variance matrix.

and establish its theoretical properties. More precisely, our method consists in esti-
mating the covariance matrix Σ which models the dependence between the responses
and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator
of the coefficient matrix. Then, we give general conditions that the estimators of
the covariance matrix and its inverse have to satisfy in order to recover the positions
of the null and non null entries of the coefficient matrix when the size of Σ is not
fixed and can tend to infinity. In particular, we prove that these general conditions
are satisfied for some specific Toeplitz matrices such as the covariance matrix of an
AR(1) process. Note that our approach has been successfully applied to a data set
coming from a metabolomic experiment. For further details, we refer the reader
to Perrot-Dockès et al. (2018). Since, in this paper, we used general Toeplitz co-
variance matrices such as those of ARMA(p,q) processes or of weakly dependent
stationary processes, it would be interesting to prove that the strategy that we used
provides estimators of Σ satisfying the assumptions of Theorem 2.5. Moreover, it
would be interesting to see if other types of structured covariance matrices would
satisfy the assumptions of our theorems.

2.5 Proofs
Proof of Proposition 2.2. For a fixed nonnegative λ, by (4.5),

Bb = B(λ)
b = ArgminB {kY − X Bk22 + λkBk1 }.

b = Bb − B, we get
Denoting u

b 22 + λkBk
kY − X Bk b 22 + λkb
b 1 = kX B + E − X Bk bk22 + λkb
u + Bk1 = kE − X u u + Bk1
u0 X > E + u
= kEk22 − 2b b0 X > X u u + Bk1 .
b + λkb

Thus,
u
b = Argminu V (u),

35
where
√ √ √
V (u) = −2( nqu)> W + ( nqu)> C( nqu) + λku + Bk1 .

Since the first derivative of V with respect to u is equal to


√ √
2 nq{C( nqu) − W } + λ sign(u + B),
Chapter 2

u
b satisfies
√ λ λ
bJ ) − WJ = − √ sign(b
CJ,J ( nq u uJ + BJ ) = − √ sign(BbJ )
2 nq 2 nq

bJ + BJ = BbJ 6= 0, and
if u

√ λ
|CJ c ,J ( nq u
bJ ) − WJ c | ≤ √ .
2 nq

Note that, if |b
uJ | < |BJ |, then BbJ =
6 0 and sign(BbJ ) = sign(BJ ).

Let us now prove that when An and Bn , defined in (2.12) and (2.13), are satisfied
then there exists u
b satisfying

√ λ
bJ ) − WJ = − √ sign(BJ ),
CJ,J ( nq u (2.21)
2 nq
|b
uJ | < |BJ |, (2.22)
√ λ
bJ ) − WJ c ≤ √ .
CJ c ,J ( nq u (2.23)
2 nq

Note that An implies


 
√ λ
nq −|BJ | + (CJ,J ) sign(BJ ) < (CJ,J )−1 WJ
−1
2nq
 
√ λ −1
< nq |BJ | + (CJ,J ) sign(BJ ) .
2nq
(2.24)

By denoting
1 λ
bJ = √ (CJ,J )−1 WJ −
u (CJ,J )−1 sign(BJ ), (2.25)
nq 2nq
we obtain from (2.24) that (2.21) and (2.22) hold. Note that Bn implies

λ
− √ {1 − CJ c ,J (CJ,J )−1 sign(BJ )} ≤ CJ c ,J (CJ,J )−1 WJ − WJ c
2 nq
λ
≤ √ {1 + CJ c ,J (CJ,J )−1 sign(BJ )}.
2 nq

36
Hence,
 
−1 λ −1 λ
CJ c ,J (CJ,J ) WJ − √ (CJ,J ) sign(BJ ) − WJ c ≤ √ ,
2 nq 2 nq

which is (2.23) by (2.25). This concludes the proof.

Chapter 2
Proof of Theorem 2.1. By Proposition 2.2,

P[sign{B(λ)}
b = sign(B)] ≥ P(An ∩ Bn ) = 1 − P(Acn ∪ Bnc ) ≥ 1 − P(Acn ) − P(Bnc ),

where An and Bn are defined in (2.12) and (2.13). It is thus enough to prove that
P(Acn ) and P(Bnc ) tend to zero as n → ∞. By definition of An ,
  
−1 √ λ −1
P(Acn ) =P (CJ,J ) WJ ≥ nq |BJ | − |(CJ,J ) sign(BJ )|
2nq
  
√ λ
≤ sup P |ξj | ≥ nq |Bj | − |bj | ,
j∈J 2nq

where
1
ξ = (ξj )j∈J = (CJ,J )−1 WJ = √ (CJ,J )−1 (X•,J )> E ≡ HA E,
nq
and b = (bj )j∈J = (CJ,J )−1 sign(BJ ). By definition of Bn and (IC),
 
−1 λ  −1
P(Bnc ) = P |CJ c ,J (CJ,J ) WJ − WJ c | > √ 1 − |CJ c ,J (CJ,J ) sign(BJ )|
2 nq
 
−1 λ
≤ P |CJ c ,J (CJ,J ) WJ − WJ c | > √ η
2 nq
 
λ
≤ sup P |ζj | > √ η ,
j∈J c 2 nq

where

ζ = (ζj )j∈J c = CJ c ,J (CJ,J )−1 WJ − WJ c


1
= √ {CJ c ,J (CJ,J )−1 (X•,J )> − (X•,J c )> }E
nq
≡ HB E.

Note that, for all j ∈ J,


 1/2
X p X p
|bj | ≤ |bj | ≤ |J|  b2j  = |J| kbk2 .
j∈J j∈J

37
Moreover,

kbk2 = k(CJ,J )−1 sign(BJ )k2 ≤ k(CJ,J )−1 k2 |J| ≡ λmax {(CJ,J )−1 } |J|,
p p

where λmax (A) denotes the largest eigenvalue of the matrix A. Observe that
Chapter 2

1 q q
λmax {(CJ,J )−1 } = = >
≤ , (2.26)
λmin (CJ,J ) λmin {(X X )J,J }/n M2

by Assumption (A2) of Theorem 2.1. Thus, for all j ∈ J, |bj | ≤ q|J|/M2 . By


Assumption (A4) of Theorem 2.1, we get thus that for all j ∈ J,
   
√ λ −1 √ λ
nq |Bj | − {(CJ,J ) sign(BJ )}j = nq |Bj | − |bj | (2.27)
2nq 2nq
 
√ −c2 λq|J|
≥ nq M3 q − . (2.28)
2nqM2

Thus,   
√ −c2 λq|J|
P(Acn ) ≤ sup P |ξj | ≥ nq M3 q − .
j∈J 2nqM2
Since E is a centered Gaussian random vector having a covariance matrix equal to
identity, ξ = HA E is a centered Gaussian random vector with a covariance matrix
equal to:
> 1
HA HA = (CJ,J )−1 (X•,J )> X•,J (CJ,J )−1 = (CJ,J )−1 .
nq
−1
Hence, by (2.26), we get that for all j ∈ J, Var(ξj ) = ((CJ,J )−1 )jj ≤ λmax (CJ,J )≤
q/M2 . Thus,
    √  
√ λq|J| M2 √ λq|J|
P |ξj | ≥ nq M3 q −c2 − ≤ P |Z| ≥ √ M3 q −c2 nq − √ ,
2nqM2 q 2 nqM2

where Z is a standard Gaussian random variable. By Chernoff’s inequality, we thus


obtain that for all j ∈ J,
   (  2 )
√ λq|J| M 2 √ λq|J|
P |ξj | ≥ nq M3 q −c2 − ≤ 2 exp − M3 q −c2 nq − √ .
2nqM2 2q 2 nqM2

By Assumption (A3) of Theorem 2.1, we get that under the last condition of (L),
as n → ∞,
√ √ 
λq|J|/ nq = o q −c2 nq , (2.29)

Thus,
lim P(Acn ) = 0. (2.30)
n→∞

38
Let us now bound P(Bnc ). Observe that ζ = HB E is a centered Gaussian random
vector with a covariance matrix equal to

> 1
HB HB = {CJ c ,J (CJ,J )−1 (X•,J )> − X•,J
>
c }{X•,J (CJ,J )
−1
CJ,J c − X•,J c }
nq

Chapter 2
= CJ c ,J c − CJ c ,J (CJ,J )−1 CJ,J c
1
= (X•,J c )> [IdRnq − X•,J {(X•,J )> X•,J }−1 (X•,J )> ]X•,J c
nq
1
= (X•,J c )> (IdRnq − ΠIm(X•,J ) ) X•,J c ,
nq

where ΠIm(X•,J ) denotes the orthogonal projection onto the column space of X•,J .
Note that, for all j ∈ J c ,

1
Var(ζj ) = ((X•,J c )> (IdRnq − ΠIm(X•,J ) ) X•,J c )jj
nq
1 1
= ((X•,J c )> X•,J c )jj − ((X•,J c )> ΠIm(X•,J ) X•,J c )jj
nq nq
1 M1
≤ ((X•,J c )> X•,J c )jj ≤ ,
nq q

where the inequalities come from Lemma 2.9 and Assumption (A1) of Theorem 2.1.
Thus, for all j ∈ J c ,
   √ 
λ λ q
P |ζj | > √ η ≤ P |Z| > √ √ η ,
2 nq 2 M1 nq

where Z is a standard Gaussian random variable. By Chernoff’s inequality, for all


j ∈ J c, (
   2 )
λ 1 λ
P |ζj | > √ η ≤ 2 exp − √ √ η .
2 nq 2 2 M1 n

Hence, under the assumption that λ/ n → ∞, which is the second condition of
(L),
lim P(Bnc ) = 0. (2.31)
n→∞

This completes the argument.

Proof of Proposition 2.3. Let us first prove that (C1) and (C3) imply (A1). For
j ∈ {1, . . . , pq}, by considering the Euclidian division of j − 1 by p given by j − 1 =

39
pkj + rj , we observe that

(X•,j )> X•,j = {((Σ−1/2 )> ⊗ X)•,j }> ((Σ−1/2 )> ⊗ X)•,j
= {(Σ−1/2 ) ⊗ X > )j,• }((Σ−1/2 )> ⊗ X)•,j
= {(Σ−1/2 )kj +1,• ⊗ (X•,rj +1 )> }[{(Σ−1/2 )•,kj +1 }> ⊗ X•,rj +1 }]
Chapter 2

= (Σ−1/2 )kj +1,• {(Σ−1/2 )•,kj +1 }> ⊗ (X•,rj +1 )> X•,rj +1


= (Σ−1 )kj +1,kj +1 ⊗ (X•,rj +1 )> X•,rj +1
= (Σ−1 )kj +1,kj +1 (X•,rj +1 )> X•,rj +1 .

Hence, using (C1), we get that for all j ∈ {1, . . . , pq},

1
(X•,j )> X•,j ≤ M10 (Σ−1 )kj +1,kj +1
n
≤ M10 sup {(Σ−1 )k+1,k+1 } ≤ M10 λmax (Σ−1 ) ≤ M10 m1 ,
k∈{0,...,q−1}

where the last inequality comes from (C3), which gives (A1).
Let us now prove that (C2) and (C4) imply (A2). Note that

(X > X )J,J = {((Σ−1/2 )> ⊗ X}> ((Σ−1/2 )> ⊗ X))J,J


= (Σ−1/2 (Σ−1/2 )> ⊗ X > X)J,J
= (Σ−1 ⊗ X > X)J,J .

Then, by Theorem 4.3.15 of Horn & Johnson (1986),

λmin {(X > X )J,J } = λmin {(Σ−1 ⊗ X > X)J,J }


≥ λmin (Σ−1 ⊗ X > X)
= λmin (X > X)λmin (Σ−1 ).

Finally, by using Conditions (C2) and (C4), we obtain

1 1
λmin (X > X )J,J ≥ λmin (X > X)λmin (Σ−1 ) ≥ M20 m2 ,
n n
which gives (A2).

Proof of Theorem 2.5. By Proposition 2.4,

P[sign{B(λ)}
e = sign(B)] ≥ P(A
en ∩ B ecn ∪ B
en ) = 1 − P(A enc ) ≥ 1 − P(A
ecn ) − P(B
enc ),

where A
en and B
en are defined in (2.16) and (2.17). By definition of A
en , we get
  
ecn ) eJ,J )−1 W √ λ e −1
P(A =P (C fJ ≥ nq |BJ | − |(CJ,J ) sign(BJ )| .
2nq

40
Observing that

eJ,J )−1 W
(C fJ = (CJ,J )−1 WJ + (CJ,J )−1 (W
fJ − WJ )
eJ,J )−1 − (CJ,J )−1 WJ

+ (C
eJ,J )−1 − (CJ,J )−1 (W


Chapter 2
+ (C fJ − WJ ),

and

eJ,J )−1 sign(BJ ) = (CJ,J )−1 sign(BJ ) + {(C


(C eJ,J )−1 − (CJ,J )−1 }sign(BJ ),

and using the triangle inequality, we obtain


 √  
c
e )≤P −1 nq λ −1
P(An (CJ,J ) WJ ≥ |BJ | − (CJ,J ) sign(BJ )
5 2nq
 √  

−1 f
i nq λ −1
+P (CJ,J ) WJ − WJ ≥ |BJ | − (CJ,J ) sign(BJ )
5 2nq
  √  
eJ,J )−1 − (CJ,J )−1 WJ ≥ nq |BJ | − λ (CJ,J )−1 sign(BJ )

+P (C
5 2nq
  √  
−1 −1
  nq λ −1
+P (CJ,J ) − (CJ,J )
e WJ − WJ ≥
f |BJ | − (CJ,J ) sign(BJ )
5 2nq
 √  
λ 
−1 −1
 nq λ −1
+P √ (CJ,J ) − (CJ,J )
e sign(BJ ) ≥ |BJ | − (CJ,J ) sign(BJ ) .
2 nq 5 2nq
(2.32)

The first term in the right-hand side of (2.32) tends to 0 by the definition of Acn
and (2.30). By (2.27), the last term of (2.32) satisfies, for all j ∈ J,
  
−1 −1 2nq λ −1
P {(C
eJ,J ) − (CJ,J ) }sign(BJ ) ≥ |BJ | − |(CJ,J ) sign(BJ )|
5λ 2nq
 n  
−1 −1
o  2nq −c2 λq|J|
≤P (CJ,J ) − (CJ,J )
e sign(BJ ) ≥ M3 q − .
j 5λ 2nqM2

eJ,J )−1 − (CJ,J )−1 and s = sign(BJ ) then for all j ∈ J,


Let U = (C

X p
|(U s)j | = Ujk sk ≤ |J|kU k2 . (2.33)
k∈J

41
We focus on

eJ,J )−1 − (CJ,J )−1 k2 = k(C


k(C eJ,J )−1 (CJ,J − CeJ,J )(CJ,J )−1 k2
eJ,J )−1 k2 kCJ,J − C
≤ k(C eJ,J k2 k(CJ,J )−1 k2
Chapter 2

ρ(CJ,J − C eJ,J )

λmin (C
eJ,J )λmin (CJJ )
ρ(CJ,J − C eJ,J )
≤ ,
λmin (C
eJ,J )(M2 /q)

where the last inequality comes from Assumption (A2) of Theorem 2.1, which gives

k(CJ,J )−1 k2 ≤ q/M2 . (2.34)

Using Theorem 4.3.15 of Horn & Johnson (1986), we get

eJ,J )−1 − (CJ,J )−1 k2 ≤ qρ(C − C)


e
k(C .
λmin (C)M
e 2

By definition of C and C
e given in (2.10) and (2.14), respectively, we get

Σ−1 ⊗ (X > X) b −1 >


e = Σ ⊗ (X X) .
C= and C (2.35)
nq nq

By using that the eigenvalues of the Kronecker product of two matrices is equal to
the product of the eigenvalues of the two matrices, we obtain

eJ,J )−1 − (CJ,J )−1 k2 ≤ ρ(Σ−1 − Σ b −1 )λmax {(X > X)/n}q


k(C
b −1 )λmin {(X > X)/n}M2
λmin (Σ
ρ(Σ−1 − Σ
b −1 )λmax (Σ)λ
b max {(X > X)/n}q

λmin {(X > X)/n}M2
ρ(Σ−1 − Σ
b −1 ){ρ(Σ
b − Σ) + λmax (Σ)}λmax {(X > X)/n}q
≤ ,
λmin {(X > X)/n}M2

where the last inequality follows from Theorem 4.3.1 of Horn & Johnson (1986).
Thus, by Assumptions (A5), (A6), (A8), (A9) and (A10), we get that, as n → ∞,

eJ,J )−1 − (CJ,J )−1 k2 = OP {q(nq)−1/2 }.


k(C (2.36)

42
Hence, by (2.33), we get that, for all j ∈ J,
  
eJ,J )−1 − (CJ,J )−1 }sign(BJ )
 2nq −c2 λq|J|
P {(C ≥ M3 q −
j 5λ 2nqM2
 √  
p
eJ,J )−1 − (CJ,J )−1 k2 ≥ 2 nq −c2 √ λq|J|
≤P |J| k(C nq − √

Chapter 2
M3 q .
5λ 2 nqM2

By (2.29), (2.36) and (A3), it is enough to prove that


n o
lim P q c1 /2 q(nq)−1/2 ≥ nqq −c2 λ = 0.
n→∞

By the last condition of (L), (nqq −c2 /λ)/q 1+c1 → ∞ as n → ∞ and the result
follows since n → ∞. Hence, the last term of (2.32) tends to zero as n → ∞.

Let us now study the second term in the right-hand side of (2.32). We have

fJ − WJ = √1 1  e> e
n   o 
W Xe> Ee − X > E J = √ X E − X >E
nq J nq J
1  n
b −1/2 ⊗ X > ) (Σ
o n o
b −1/2 )> ⊗ IdRn vec(E) − (Σ−1/2 ⊗ X > ) (Σ−1/2 )> ⊗ IdRn vec(E)

=√ (Σ
nq J
1 n b −1 o 
d
=√ (Σ − Σ−1 ) ⊗ X > vec(E) = AZ, (2.37)
nq J

where Z is a centered Gaussian random vector having a covariance matrix equal to


the identity and

1 hn b −1 on oi
A= √ (Σ − Σ−1 ) ⊗ X > (Σ1/2 )> ⊗ IdRn . (2.38)
nq J,•

By the Cauchy–Schwarz inequality, we get for every K × nq matrix B, every nq × 1


vector U , and every k ∈ {1, . . . , K},
nq
X
|(BU )k | = Bk,` U` ≤ kBk2 kU k2 . (2.39)
`=1

Thus, for all j ∈ J, for all γ in R and every |J| × |J| matrix D,
    n o
P fJ − WJ
D W ≥γ =P (DAZ)j ≥ γ ≤ P (kDk2 kAk2 kZk2 ≥ γ) ,
j
(2.40)
where A is defined in (2.38) and Z is a centered Gaussian random vector having a

43
covariance matrix equal to the identity. Hence, for all j ∈ J,
 √  
−1
 nq λ −1

P (CJ,J ) (W J − W J )
f ≥ |Bj | − (CJ,J ) sign(BJ ) j
j 5 2nq
 √  
−1 nq −c2 λq|J|
≤ P k(CJ,J ) k2 kAk2 kZk2 ≥ M3 q − .
Chapter 2

5 2nqM2

Let us bound kAk2 . Observe that


hn on oi
b −1 − Σ−1 ) ⊗ X >
(Σ (Σ1/2 )> ⊗ Id (2.41)
J,• 2
  1/2
=ρ b −1 − Σ−1 )Σ(Σ
(Σ b −1 − Σ−1 ) ⊗ (X > X)
J,J
n o1/2 1/2
b −1 − Σ−1 )Σ(Σ
≤ ρ (Σ b −1 − Σ−1 ) λmax X > X
b −1 − Σ−1 )λmax (Σ)1/2 λmax (X > X)1/2 ,
≤ ρ(Σ (2.42)

where the first inequality comes from Theorem 4.3.15 of Horn & Johnson (1986).
Hence, by (A5), (A8) and (A9)

1 n
b −1 − Σ−1 ) ⊗ X >
on o
kAk2 = √ (Σ (Σ1/2 )> ⊗ Id = OP {q −1/2 (nq)−1/2 }.
nq J,• 2
(2.43)
By (2.29), (2.34) and (2.43), it is enough to prove that
nq
!
X
−2c2
lim P Zk2 ≥ nq n q = 0.
n→∞
k=1

The result follows from Markov’s inequality and the first condition of (L).

Let us now study the 3rd term in the right-hand side of (2.32). Observe that

1  −1/2 n o 
WJ = √ (Σ ⊗ X > ) (Σ−1/2 )> ⊗ IdRn vec(E)
nq J

d 1 h n oi
=√ (Σ−1 ⊗ X > ) (Σ1/2 )> ⊗ IdRn Z ≡ A1 Z,
nq J,•

where Z is a centered Gaussian random vector having a covariance matrix equal to


identity and
1 h −1 n oi
A1 = √ (Σ ⊗ X > ) (Σ1/2 )> ⊗ IdRn . (2.44)
nq J,•

Using (2.39), we get for every j ∈ J, every γ ∈ R, and every |J| × |J| matrix D,

P{| (D WJ )j | ≥ γ} = P{| (DA1 Z)j | ≥ γ} ≤ P (kDk2 kA1 k2 kZk2 ≥ γ) , (2.45)

44
where A1 is defined in (2.44) and Z is a centered Gaussian random vector having a
covariance matrix equal to identity. Hence, for all j ∈ J,
 √  
−1 −1
 nq λ −1

P {(CJ,J ) − (CJ,J ) }WJ
e ≥ |Bj | − (CJ,J ) sign(BJ ) j
j 5 2nq

Chapter 2
  
−1 −1 nq −c λq|J|
≤ P (C eJ,J ) − (CJ,J ) kA1 k2 kZk2 ≥ M3 q 2
− .
2 5 2nqM2

Let us now bound kA1 k2 . Note that

h i h i
(Σ−1 ⊗ X > ){(Σ1/2 )> ⊗ IdRn } = (Σ−1/2 ⊗ X > )
J,• 2 J,• 2
n o1/2
Σ−1 ⊗ (X > X) ≤ ρ[{Σ−1 ⊗(X > X)}]1/2 ≤ λmax (Σ−1 )1/2 λmax (X > X)1/2 ,

=ρ J,J

where the first inequality comes from Theorem 4.3.15 of Horn & Johnson (1986).
Hence, by (A5) and (A7),

1
kA1 k2 ≤ λmax (Σ−1 )1/2 λmax (X > X)1/2 = OP (q −1/2 ). (2.46)
nq

By (2.29), (2.36) and (2.46) it is thus enough to prove that


nq
!
X
lim P Zk2 ≥ nq n q −2c2 = 0.
n→∞
k=1

The result follows from Markov’s inequality and the first condition of (L).
Let us now study the 4th term in the right-hand side of (2.32). By (2.40), for all
j ∈ J,
 n √  
−1 −1
o  nq λ −1

P (CJ,J ) − CJ,J )
e WJ − WJ
f ≥ |Bj | − (CJ,J ) sign(BJ ) j
j 5 2nq
 √  
−1 −1 nq −c2 λq|J|
≤ P (CJ,J ) − CJ,J )
e kAk2 kZk2 ≥ M3 q − ,
2 5 2nqM2

where A is defined in (2.38). By (2.29), (2.36) and (2.43), it is thus enough to prove
that ( nq )
X
lim P Zk2 ≥ (nq) n2 q 1−2c2 = 0.
n→∞
k=1

The result follows from the Markov inequality and the fact that c2 < 1/2.
Let us now study P(B en ). By definition of B
en , we get that
 
ec ) = P −1 f λ  −1
P(Bn |CJ c ,J (CJ,J ) WJ − WJ c | ≥ √
e e f 1 − |CJ c ,J (CJ,J ) sign(BJ )|
e e .
2 nq

45
Observe that

C eJ,J )−1 W
eJ c ,J (C fJ c = CJ c ,J (CJ,J )−1 WJ − WJ c + CJ c ,J (CJ,J )−1 (W
fJ − W fJ − WJ )
eJ,J )−1 − (CJ,J )−1 }WJ + CJ c ,J {(C
+ CJ c ,J {(C eJ,J )−1 − (CJ,J )−1 }(W
fJ − WJ )
Chapter 2

eJ c ,J − CJ c ,J )(CJ,J )−1 WJ + (C
+ (C eJ c ,J − CJ c ,J )(CJ,J )−1 (W
fJ − WJ )

+ (C eJ,J )−1 − (CJ,J )−1 }WJ


eJ c ,J − CJ c ,J ){(C

+ (C eJ,J )−1 − (CJ,J )−1 }(W


eJ c ,J − CJ c ,J ){(C fJ − WJ ) + WJ c − W
fJ c .

Moreover,

C eJ,J )−1 sign(BJ ) = CJ c ,J (CJ,J )−1 sign(BJ )+CJ c ,J {(C


eJ c ,J (C eJ,J )−1 −(CJ,J )−1 }sign(BJ )

+(CeJ c ,J −CJ c ,J )(CJ,J )−1 sign(BJ )+(C eJ,J )−1 −(CJ,J )−1 }sign(BJ ).
eJ c ,J −CJ c ,J ){(C

By (IC) and the triangle inequality, we obtain that


 
λ
enc ) ≤ P
P(B CJ c ,J (CJ,J )−1 WJ − WJ c ≥ √ η
24 nq
 
−1 f λ
+ P |CJ c ,J (CJ,J ) (WJ − WJ )| ≥ √ η
24 nq
 
−1 −1 λ
+ P |CJ c ,J {(CJ,J ) − (CJ,J ) }WJ | ≥ √ η
e
24 nq
 
−1 −1 f λ
+ P |CJ c ,J {(CJ,J ) − (CJ,J ) }(WJ − WJ )| ≥ √ η
e
24 nq
 
eJ c ,J − CJ c ,J )(CJ,J )−1 WJ | ≥ √ λ
+ P |(C η
24 nq
 
−1 λ
+ P |(CeJ c ,J − CJ c ,J )(CJ,J ) (W fJ − WJ )| ≥ √ η
24 nq
 
−1 −1 λ
+ P |(CJ c ,J − CJ c ,J ){(CJ,J ) − (CJ,J ) }WJ | ≥ √ η
e e
24 nq
 
−1 −1 f λ
+ P |(CJ c ,J − CJ c ,J ){(CJ,J ) − (CJ,J ) }(WJ − WJ )| ≥ √ η
e e
24 nq
 
λ
+ P |WJ c − W fJ c | ≥ √ η
24 nq
h
−1 −1 ηi
+ P |CJ ,J {(CJ,J ) − (CJ,J ) }sign(BJ )| ≥
c e
n η o12
−1
+ P |(CeJ c ,J − CJ c ,J )(CJ,J ) sign(BJ )| ≥
12
eJ,J ) − (CJ,J ) }sign(BJ )| ≥ η .
h i
−1 −1
+ P |(C
eJ c ,J − CJ c ,J ){(C (2.47)
12

The first term in the right-hand side of (2.47) tends to 0 by (2.31). Let us now

46
study the 2nd term of (2.47). By (2.40), we get that for all j ∈ J c ,
 
−1
 λ
P CJ c ,J (CJ,J ) (W J − W J )
f ≥ √ η
j 24 nq
 
λ
≤ P kCJ c ,J k2 k(CJ,J )−1 k2 kAk2 kZk2 ≥

Chapter 2
√ η .
24 nq

Observe that
1/2
(X•,J c )> X•,J (X•,J )> X•,J c (X•,J c )> kX•,J k2

1
kCJ c ,J k2 = ρ = (X•,J c )> X•,J 2 ≤ √ 2

nq nq nq nq nq
1/2 1/2
(X•,J c )> X•,J c (X•,J )> X•,J
   
≤ρ ρ
nq nq
λmax (Σ−1 )
= ρ(CJ c J c )1/2 ρ(CJ,J )1/2 ≤ ρ(C) = λmax (X > X/n) = OP (q −1 ).
q
(2.48)

In (2.48) the last inequality and the fourth equality come from Theorem 4.3.15 of
Horn & Johnson (1986) and (2.35), respectively. The last equality comes from (A5)
and (A7). By (2.34), (2.43) and (2.48), it is thus enough to prove that
" nq  2 # ( nq  2 )
1/2 √ λ λ
X X
lim P Zk2 ≥ (nq) q√ = lim P 2
Zk ≥ (nq) √ =0
n→∞ nq n→∞ n
k=1 k=1

which holds true by the second condition of (L) and Markov’s inequality. Hence,
the second term of (2.47) tends to zero as n → ∞.

Let us now study the 3rd term of (2.47). By (2.45), we get that for all j ∈ J c ,
 
−1 −1
 λ
P CJ c ,J {(CJ,J ) − (CJ,J ) }WJ
e ≥ √ η
j 24 nq
 
eJ,J )−1 − (CJ,J )−1 k2 kA1 k2 kZk2 ≥ λ
≤ P kCJ c ,J k2 k(C √ η .
24 nq

By (2.36), (2.46) and (2.48), it is thus enough to prove that


" nq  2 # ( nq  2 )
X √ λ X λ
lim P Zk2 ≥ (nq)1/2 q √ = lim P Zk2 ≥ (nq) √ = 0,
n→∞ nq n→∞ n
k=1 k=1

which holds true by the second condition of (L) and Markov’s inequality. Hence,
the 3rd term of (2.47) tends to zero as n → ∞.

47
Let us now study the 4th term of (2.47). By (2.40), it amounts to prove that
 
eJ,J )−1 − (CJ,J )−1 k2 kAk2 kZk2 ≥ λ
lim P kCJ c ,J k2 k(C √ η = 0.
n→∞ 24 nq
Chapter 2

By (2.48), (2.36) and (2.43) it is enough to prove that


( nq  2 )
X λ
lim P Zk2 ≥ (nq) (nq) √ = 0,
n→∞ n
k=1

which holds true by the second condition of (L). Hence, the 4th term of (2.47) tends
to zero as n → ∞.
Let us now study the 5th term of (2.47). By (2.45), proving that the 5th term
of (2.47) tends to 0 amounts to proving that
 
eJ c ,J k2 k(CJ,J )−1 k2 kA1 k2 kZk2 ≥ λ
lim P kCJ c ,J −C √ η = 0.
n→∞ 24 nq

Let us now bound kCJ c ,J − C


eJ c ,J k2 as follows:

kCJ c ,J − C
eJ c ,J k2 = k(C − C)
e J c ,J k2 = ρ{(C − C) e J c ,J }1/2
e J c ,J (C − C)

≤ k(C − C) e J c ,J k1/2
e J c ,J (C − C) e 1/2
∞ ≤ k(C − C)(C − C)k∞ ≤ kC − Ck∞
e e
1
= kΣ−1 − Σ b −1 k∞ kX > X/nk∞ = OP {q −1 (nq)−1/2 }, (2.49)
q

as n → ∞, where the last equality comes from (A5) and (A9).


By (2.34), (2.46) and (2.49), to prove that the 5th term of (2.47) tends to zero
as n → ∞, it is enough to prove that
( nq  2 )
X
2 λ
lim P Zk ≥ nq √ = 0,
n→∞ n
k=1

which can be deduced from Markov’s inequality and the second condition of (L).
Using similar arguments as those used for proving that the second, third and
fourth terms of (2.47) tend to zero, we get that the 6th,7th and 8th terms of (2.47)
tend to zero, as n → ∞, by replacing (2.48) by (2.49).
Let us now study the 9th term of (2.47). Replacing J by J c in (2.37), (2.38),
(2.40), (2.41) and (2.43) in order to prove that the ninth term of (2.47) tends to 0
it is enough to prove that
( nq  2 )
X
2 λ
lim P Zk ≥ nq √ = 0.
n→∞ n
k=1

48
which holds using Markov’s inequality and the second condition of (L).
Let us now study the 10th term of (2.47). Using the same idea as the one used
for proving (2.33), we get that

eJ,J )−1 − (CJ,J )−1 }sign(BJ )| ≥ η


h i
P |CJ c ,J {(C

Chapter 2
12
eJ,J )−1 − (CJ,J )−1 k2 ≥ η ,
np o
≤P |J|kCJ c ,J k2 k(C
12
which tends to zero as n → ∞ by (A3), (2.36), (2.48) and the fact that c1 < 1/2.
Let us now study the 11th term of (2.47). Using the same idea as the one used
for proving (2.33), we get that

eJ c ,J − CJ c ,J )(CJ,J )−1 sign(BJ )| ≥ η ≤ P eJ c ,J − CJ c ,J k2 k(CJ,J )−1 k2 ≥ η ,


n o np o
P |(C |J| kC
12 12
which tends to zero as n → ∞ by (A3), (2.34) and (2.49) and the fact that c1 < 1/2.
Finally, the 12th term of (2.47) can be bounded as follows:

eJ,J )−1 − (CJ,J )−1 }sign(BJ )| ≥ η


h i
P |(C
eJ c ,J − CJ c ,J ){(C
12
eJ,J )−1 − (CJ,J )−1 k2 ≥ η ,
np o
≤P |J| kC
eJ c ,J − CJ c ,J k2 k(C
12
which tends to zero as n → ∞ by (A3), (2.36) and (2.49) and the fact that c1 <
1/2.

Proof of Proposition 2.6. Observe that


 
1 −φ1 0 ··· 0
2
−φ1 1 + φ1 −φ1 ··· 0
 


. .. . .. .. 
Σ−1 =  0 −φ1 . . (2.50)
 
 . .. ..

 . . 1 + φ21 −φ1
 . .


0 0 ··· −φ1 1

Let S = X > X = Σ−1 ⊗ X > X. Then,




 nri +1 if j = i and ki ∈ {0, q − 1},
 (1 + φ2 )n

if j = i and ki ∈/ {0, q − 1},
1 ri +1
Si,j =

 −φ1 nri +1 if j = i + p or if j = i − p,

0 otherwise,

where i − 1 = (p − 1)ki + ri corresponds to the Euclidean division of i − 1 by p − 1.


In order to prove (IC), it is enough to prove that kSJ c ,J (SJ,J )−1 k∞ ≤ 1 − η,
where η ∈ (0, 1). Since for all j, we have (j − p) ∈ J c or (j + p) ∈ J c , it follows that

49
kSJ c ,J k∞ = ν|φ1 |. Let A = SJ,J . Since A = (ai,j ) is a diagonally dominant matrix,
then, by Theorem 1 of Varah (1975),
 
X
kA−1 k∞ ≤ 1/ min 
 
ak,k − ak,j 
.
Chapter 2

k
1≤j≤|J|
j6=k

Using that for all j, (j − p) ∈ J c or (j + p) ∈ J c ,


X
ak,j ≤ ν|φ1 |.
1≤j≤|J|
j6=k

If k ∈ J then k > p and k < pq − p. Thus, ak,k ≥ ν(1 + φ21 ). Hence, kA−1 k∞ ≤
1/{ν(1 + φ21 − |φ1 |)} and

|φ1 |
kSJ c ,J (SJ,J )−1 k∞ ≤ kSJ c ,J k∞ k(SJ,J )−1 k∞ ≤ .
1 + φ21 − |φ1 |

Since |φ1 | < 1, the strong Irrepresentability Condition holds when |φ1 | ≤ (1 − η)(1 +
|φ1 |2 − |φ1 |), which is true for a small enough η.

Proof of Proposition 2.7. Since |φ1 | < 1, kΣ−1 k∞ ≤ |φ1 | + |1 + φ21 | ≤ 3, which gives
(A7) by Theorem 5.6.9 of Horn & Johnson (1986). Observe that

q−1
!  
1 X
h 1 2 3 − |φ1 | 3
kΣk∞ ≤ 1+2 |φ1 | ≤ 1+ = ≤ ,
1 − φ21 1 − φ21 1 − |φ1 | 1 − φ21 1 − φ21
h=1

which gives (A8) by Theorem 5.6.9 of Horn & Johnson (1986).

b −1 has the same expression as Σ−1 defined in (2.50) except that φ1 is


Since Σ
replaced by φ
c1 defined in (2.20), we get that

kΣ−1 − Σ
b −1 k∞ ≤ 2|φ1 − φb1 | + (φ1 − φb1 )2 ,

which implies Assumption (A9) of Theorem 2.5 by Lemma 2.8.

Let us now check Assumption (A10) of Theorem 2.5. Since, by Theorem 5.6.9
of Horn & Johnson (1986), ρ(Σ − Σ)
b ≤ kΣ − Σk
b ∞ , it is enough to prove that, as

50
b ∞ = OP {(nq)−1/2 }. Observe that
n → ∞, kΣ − Σk

q−1
1 1 X φh1 φbh1
kΣ − Σk
b ∞≤ − + 2 −
1 − φ21 1 − φb21 1 − φ21 1 − φb21
h=1
q−1 q−1
!

Chapter 2
φ21 − φb21 X φh1 − φbh1 X 1 1
≤ +2 2 + 2 φbh1 2 −
2 2
(1 − φ1 )(1 − φ1 )
b 1 − φ 1 1 − φ1 1 − φb21
h=1 h=1
q−1 q−1 
!
(φ1 − φb1 )(φ1 + φb1 ) X φh1 − φbh1 X
bh − φh
 1 1
≤ +2 2 + 2 φ 1 1 2 −
2 2
(1 − φ1 )(1 − φb1 ) 1 − φ1 1 − φ1 1 − φb21
h=1 h=1
q−1
!
X
h 1 1
+2 φ1 −
1 − φ21 1 − φb21
h=1
 
(φ1 − φb1 )(φ1 + φb1 ) 2
≤ 1+
(1 − φ21 )(1 − φb21 ) 1 − |φ1 |
! q−1
1 (φ1 − φb1 )(φ1 + φb1 ) X bh
+2 + φ1 − φh1 .
|1 − φ21 | 2
(1 − φ1 )(1 − φ1 )
b2
h=1

Moreover,
q−1 q−1 h−1
!
1 − |φb1 |q−1 1 − |φ1 |q−1
X X X 
φbh1 − φh1 ≤ φb1 − φ1 |φ1 |k |φb1 |h−k−1 ≤ φb1 − φ1
1 − |φb1 | 1 − |φ1 |
h=1 h=1 k=0
! 
1 1
≤ φb1 − φ1 .
1 − |φb1 | 1 − |φ1 |

b ∞ = OP {(nq)−1/2 }, which implies Assumption (A10)


Thus, by Lemma 2.8, kΣ − Σk
of Theorem 2.5.

Proof of Lemma 2.8. In the following, for notational simplicity, q = qn . Observe


that Pn Pq b b
√1
√ b nq i=1 `=2 Ei,` Ei,`−1
nq φ1 = 1
P n P q−1 b 2
.
nq i=1 `=1 Ei,`

By (2.18),
q
n X
X q
X q
X
E
bi,` E
bi,`−1 = b•,` )> E
(E b•,`−1 = (ΠE•,` )> (ΠE•,`−1 )
i=1 `=2 `=2 `=2
q
X
= (φ1 ΠE•,`−1 + ΠZ•,` )> (ΠE•,`−1 ) (2.51)
`=2
q−1
X q
X
= φ1 (ΠE•,` )> (ΠE•,` ) + (ΠZ•,` )> (ΠE•,`−1 ),
`=1 `=2

51
where (2.51) comes from the definition of (Ei,t ). Hence,
Pq >
√1
√ nq `=2 (ΠZ•,` ) (ΠE•,`−1 )
nq (φb1 − φ1 ) = 1
Pn Pq−1 b 2 .
nq i=1 E
`=1 i,`
Chapter 2


In order to prove that nq (φb1 − φ1 ) = OP (1), it is enough to prove that

n q−1 n q−1
1 XX 2 1 X X b2
Ei,` − Ei,` = oP (1), as n → ∞, (2.52)
nq i=1 nq i=1
`=1 `=1

by Lemma 2.10 and, as n → ∞,


q
1 X
√ (ΠZ•,` )> (ΠE•,`−1 ) = OP (1). (2.53)
nq
`=2

Let us first prove (2.52). By (2.19), Eb = (IdRq ⊗ Π)E ≡ AE. Note that Cov(E)
b =
A(Σ ⊗ IdRn )A> = Σ ⊗ Π. Hence, for all i ∈ {1, . . . , n},

Var(Ebi ) ≤ λmax (Σ).

Since the covariance matrix of E is equal to Σ ⊗ IdRn , it follows thay, for all i,
Var(Ei ) ≤ λmax (Σ). By Markov’s inequality,

n q−1 n q−1 n q n q
1 XX 2 1 X X b2 1 XX 2 1 X X b2
Ei,` − Ei,` = Ei,` − Ei,` + oP (1)
nq i=1 nq i=1 nq i=1 nq i=1
`=1 `=1 `=1 `=1
1
= (kEk22 − kEk
b 22 ) + oP (1).
nq

Observe that

b 22 = kEk22 − kAEk22 = E > E − E > A> AE = E > (IdRnq − IdRq ⊗ Π) E


kEk22 − kEk
pq
X
>
= E {Id Rq ⊗ (Id Rn − Π)}E = Eei2 ,
i=1

where Ee = OE, where O is an orthogonal matrix. Using the fact that

E(Eei2 ) = Cov(E)
e i,i ≤ λmax (Σ),

and Markov’s inequality, we get (2.52).


Let us now prove (2.53). By definition of (Ei,t ) and since |φ1 | < 1,

E{(ΠZ•,` )> (ΠE•,`−1 )} = 0.

52
Moreover,

(  2 
q
)2  q n n
! n
X  
 XX X X
(ΠZ•,` )> (ΠE•,`−1 )  = E  Πi,k Zk,`  Πi,j Ej,`−1  

E
 
`=2 `=2 i=1 k=1 j=1

Chapter 2
X X
= Πi,k Πi0 ,k0 Πi,j Πi0 ,j 0 E(Zk,` Zk0 ,`0 Ej,`−1 Ej 0 ,`0 −1 )
2≤`,`0 ≤q 1≤i,j,k,i0 ,j 0 ,k0 ≤n
X X X
= Πi,k Πi0 ,k0 Πi,j Πi0 ,j 0 φr1 φs1 E(Zk,` Zk0 ,`0 Zj,`−1−r Zj 0 ,`0 −1−s ),
2≤`,`0 ≤q 1≤i,j,k,i0 ,j 0 ,k0 ≤n r,s≥0

since the (Ei,t ) are AR(1) processes with |φ1 | < 1. Note that

E(Zk,` Zk0 ,`0 Zj,`−1−r Zj 0 ,`0 −1−s ) = 0

except when ` = `0 , k = k 0 , j = j 0 and r = s. Thus,


 !2   
Xq X Xq X
> 4 2r 
E  (ΠZ•,` ) (ΠE•,`−1 )  =σ φ1 Πi,k Πi0 ,k Πi,j Πi0 ,j
`=2 r≥0 `=2 1≤i,j,k,i0 ≤n

qσ 4 nqσ 4
= tr(Π) ≤ ,
1 − φ21 1 − φ21

where tr(Π) denotes the trace of Π, which concludes the proof of (2.53) by Markov
inequality.

2.6 Technical lemmas


Lemma 2.9. Let A ∈ Mn (R) and Π an orthogonal projection matrix. For any
j ∈ {1, . . . , n}, (A> ΠA)jj ≥ 0.

Proof. Observe that (A> ΠA) = A> Π> ΠA = (ΠA)> (ΠA), since Π is an orthogonal
projection matrix. Moreover, (A> ΠA)jj = e> > >
j (ΠA) (ΠA)ej ≥ 0, since (ΠA) (ΠA)
is a positive semidefinite symmetric matrix, where ej is a vector containing null
entries except the jth entry which is equal to 1.

Lemma 2.10. Assume that (E1,t )t , . . . , (En,t )t are independent AR(1) processes
such that, for all i ∈ {1, . . . , n}, Ei,t −φ1 Ei,t−1 = Zi,t , where the Zi,t s are zero-mean
iid Gaussian random variables with variance σ 2 and |φ1 | < 1. Then, as n → ∞,

n qn −1
1 XX 2 P σ2
Ei,` −→ .
nqn i=1 1 − φ21
`=1

2
Proof. In the following, for notational simplicity, q = qn . Since E(Ei,` ) = σ 2 /(1−φ21 ),

53
it is enough to prove that, as n → ∞,

n q−1
1 XX 2 2 P
{Ei,` − E(Ei,` )} −→ 0.
nq i=1
`=1

Since  2
X j X j j0
2
Ei,` = φ1 Zi,`−j  = φ1 φ1 Zi,`−j Zi,`−j 0 ,
j≥0 j,j 0 ≥0

n q−1
" # n
1 XX 2 2 1 X X
2 2
Var Ei,` − E(Ei,` ) = Cov(Ei,` ; Ei,`0)
nq i=1 (nq)2 i=1
`=1 1≤`,`0 ≤q−1
n
1 X X X X 0 0
= φj1 φj1 φk1 φk1 Cov(Zi,`−j Zi,`−j 0 ; Zi,`0 −k Zi,`0 −k0 ).
(nq)2 i=1 1≤`,`0 ≤q−1 j,j 0 ≥0 k,k0 ≥0
(2.54)

By the Cauchy–Schwarz inequality | Cov(Zi,`−j Zi,`−j 0 ; Zi,`0 −k Zi,`0 −k0 )| is bounded


by a positive constant. Moreover j≥0 |φ1 |j < ∞, hence (2.54) tends to zero as
P

n → ∞, which concludes the proof of the lemma.

54
A variable selection approach in
Chapter 3
the multivariate linear model:
An application to LC-MS
metabolomics data

Chapter 3
Scientific production

The content of this chapter is contained in the article: M. Perrot-Dockès,


C. Lévy-Leduc, J. Chiquet, L. Sansonnet, M. Brégère, M.-P. Étienne,
S. Robin, G. Genta-Jouve ”A variable selection approach in the mul-
tivariate linear model: An application to LC-MS metabolomics data”
Statistical Applications in Genetics and Molecular Biology, 17(5), 2018.
The method which is presented is implemented in the MultiVarSel R
package available from the CRAN.

Abstract
Omic data are characterized by the presence of strong dependence structures that re-
sult either from data acquisition or from some underlying biological processes. Applying
statistical procedures that do not adjust the variable selection step to the dependence
pattern may result in a loss of power and the selection of spurious variables. The goal
of this paper is to propose a variable selection procedure within the multivariate linear
model framework that accounts for the dependence between the multiple responses. We
shall focus on a specific type of dependence which consists in assuming that the responses
of a given individual can be modelled as a time series. We propose a novel Lasso-based
approach within the framework of the multivariate linear model taking into account
the dependence structure by using different types of stationary processes covariance
structures for the random error matrix. Our numerical experiments show that including
the estimation of the covariance matrix of the random error matrix in the Lasso criterion
dramatically improves the variable selection performance. Our approach is successfully
applied to an untargeted LC-MS (Liquid Chromatography-Mass Spectrometry) data set
made of African copals samples. Our methodology is implemented in the R package
MultiVarSel which is available from the Comprehensive R Archive Network (CRAN).

55
3.1 Introduction

Metabolomics aims to provide a global snapshot (quantitative or qualitative) of


the metabolism at a given time and by extension phenotypic information (see Nichol-
son et al., 1999). It studies the concentration of small molecules called metabolites
that are the end products of the enzymatic machinery of the cell. Indeed, mi-
nor variations in gene or protein expression levels that are not observable via high
throughput experiments may have an influence on the metabolites and hence on
the phenotype of interest. Thus, metabolomics is a promising approach that can
advantageously complement usual transcriptomic and proteomic analyses. For fur-
ther details on metabolomics, we refer the reader to Smith et al. (2014). The
analysis of the metabolomic biological samples is often performed using High Reso-
lution Mass Spectrometry (HRMS), Nuclear Magnetic Resonance (NMR) or Liquid
Chromatography-Mass Spectrometry (LC-MS) and produces a large number of fea-
Chapter 3

tures (hundreds or thousands) that can explain a difference between two or more
populations (see Zhang et al., 2012). It is well-known in the untargeted LC-MS
data analysis that the identification of metabolites discriminating these populations
remains a major bottleneck and therefore the selection of relevant features (metabo-
lites) is a crucial step, as explained in Verdegem et al. (2016). Our goal is to tackle
the task of feature selection by taking advantage of the specificities of the LC-MS
spectra.
We consider a typical untargeted metabolomic experiment where LC-MS spectra
(intensity vs m/z) are obtained from n samples, resulting in an n × q data matrix
where the q columns are ordered according to their m/z ratio. Note that the ab-
breviation m/z represents the quantity formed by dividing the ratio of the mass of
an ion to the unified atomic mass unit, by its charge number (regardless of sign).
Figure 3.1 displays an example of such a spectrum. It has to be noticed that the
data were first pre-processed using the methodology described in Section 3.4.1. We
further assume that the n samples are collected under C conditions and denote nc
P
the number of samples from Condition c, hence c nc = n. Multivariate ANOVA
(MANOVA, see e.g. Mardia et al., 1980; Muller & Stewart, 2006) provides a natu-
ral framework to analyze such a data set. Denoting Y c,r the q-dimensional vector
corresponding to the spectrum from the rth replicate in Condition c, the MANOVA
model assumes that
Y c,r = µc + E c,r , (3.1)

where µc is the theoretical mean spectrum in Condition c and E c,r is a random


q-dimensional error vector. Each metabolite corresponds to a given component of
these three vectors. A “relevant” feature is then defined as the jth m/z value,
(j)
the theoretical concentration µc of which significantly varies between conditions.
Stacking the row vectors Y c,r and E c,r , the MANOVA model can be rephrased as

56
follows:
Y = XB + E, (3.2)

where Y = (Yi,j )1≤i≤n, 1≤j≤q is the n × q observation matrix, X is the n × C design


(j)
matrix of a one-way ANOVA model, B= (µc )1≤c≤C, 1≤j≤q is the C × q coefficient
matrix and E = (Ei,j )1≤i≤n, 1≤j≤q is the n × q random error matrix. Observe that
C corresponds to the number of covariates. For notational simplicity, the samples
indexed with (c, r) are now identified with a single index i ∈ {1, . . . n}, starting with
the n1 samples from Condition c = 1, then the n2 samples from Condition c = 2, etc.
In this framework, assuming that the mean spectrum µ = n−1 n nc µc is set to zero,
P

the problem of determining which metabolites are relevant boils down to finding the
non null coefficients in the matrix B and hence can be seen as a variable selection
problem in the multivariate linear model. Several approaches can be considered for
solving this task: either a posteriori methods such as classical statistical tests in
ANOVA models (see Mardia et al., 1980; Faraway, 2004) or methods embedding the

Chapter 3
variable selection such as Lasso-type methodologies (Tibshirani, 1996). However,
a naive application of such approaches does not take into account the potential
dependence between the different columns of Y , which may affect the identification
of the relevant features. This drawback will be illustrated in Section 4.3.
Different supervised machine learning approaches have been used to analyze
“omics” data during the last few years (see Saccenti et al., 2013; Ren et al., 2015;
Boccard & Rudaz, 2016; Zhang et al., 2017). Among them, in metabolomics, the
most popular is the partial least squares-discriminant analysis (PLS-DA) which
has recently been extended to sPLS-DA (sparse partial least squares-discriminant
analysis) by Lê Cao et al. (2011) to include a variable selection step.
The originality of our approach lies in the modeling of the dependence that
exists among the columns of Y which comes from the fact that usually biomarkers
share biosynthetic pathways (Audoin et al. (2014)). To account for this dependence,
we assume that the samples are all independent, namely, all the rows of E are
independent and for each sample i, the noise vector E i has a multivariate Gaussian
distribution:
E i = (Ei,1 , . . . , Ei,q ) ∼ N (0, Σq ), (3.3)

where Σq denotes the covariance matrix. The simplest assumption regarding the
covariance matrix is Σq = σ 2 I q , where I q denotes the q × q identity matrix. In
this case the different columns of Y are assumed to be independent. The other
extreme assumption consists in letting Σq free, assuming no specific form for this
dependence. However, in such a situation, q(q+1)/2 parameters should be estimated
which is not possible when n < q, which is the most standard case. Our approach
lies in between, assuming that some dependence exists but that it has a specific
structure. The form we consider is motivated by the nature of LC-MS spectra,
which can be seen as random functions of the m/z ratio. This suggests to consider

57
17.5

15.0

12.5

10.0
4

-2
Chapter 3

200 400 600


m/z

1.00

0.75
ACF

0.50

0.25

0.00

0 10 20 30
Lag

Figure 3.1 – An example of a LC-MS spectrum (an instance of Yc,r ) (top), the
same spectrum centered and normalized (middle) and its empirical autocorrelation
function (bottom).

each random vector E i as a time-series and to borrow classical dependence structure


from time-series analysis to model Σq . This approach is consistent with the fact
that the empirical autocorrelation function of LC-MS spectra (see Figure 3.1 for an
example) displays the typical characteristics of most time-series such as vanishing
autocorrelation when the lag increases.
On top of accounting for the dependence between the columns of Y , our method-
ology can deal with a potentially high number of features (columns of Y ) thanks to
the underlying Lasso-based feature selection and the modeling of the dependence
which produces sparse estimates of Σ−1 q . We also couple the whole procedure to a
stability selection step to ensure robustness of the selected features. This method-
ology is implemented in the R package MultiVarSel which is available from the
Comprehensive R Archive Network (CRAN).
The rest of the paper is organized as follows. Our method is described in Section

58
4.2. Some numerical experiments on synthetic data are provided in Section 4.3.
Finally, an application to a metabolomic data set made of African copals samples
is given in Section 3.4.

3.2 Statistical inference


The strategy that we propose can be summarized as follows.
— First step: Fitting a one-way ANOVA to each column of the matrix Y in order
to have access to an estimation E
b of the error matrix E.

— Second step: Estimating the matrix Σq by using the methods described in


Sections 3.2.1 and 3.2.1. Then, choosing the most convenient estimator Σ
bq
thanks to a statistical test described in Section 3.2.1.

Chapter 3
— Third step: Thanks to Σ
b q , transforming the data in order to remove the de-
pendence between the columns of Y . Such a transformation will be called
“whitening” hereafter.
— Fourth step: Applying to the transformed observations the Lasso approach
described in Section 3.2.2.
The first step provides a first estimate B
e of B. An estimate of E is then defined
as
b = Y − X B.
E e (3.4)

In the following, we shall focus on the three other steps.

3.2.1 Estimation of the dependence structure of E


We propose hereafter to model each row of E as a realization of a stationary pro-
cess and hence we shall use time-series models in order to describe the dependence
structure of E. We refer the reader to Brockwell & Davis (1991) for further details
on time series modeling.
We shall consider a large variety of models ranging from the simplest parametric
to the most general nonparametric dependence modeling. In each case we focus on
the estimation of Σ−1/2
q since the use of the following transformation:

Y Σq−1/2 = XB Σ−1/2
q + E Σ−1/2
q (3.5)

removes the dependence between the columns of Y . Indeed the covariance matrix
of each row of EΣ−1/2
q is now equal to the identity matrix. Such a procedure will
be called “whitening” hereafter.

59
Parametric dependence

The simplest model among the parametric models is the autoregressive process
of order 1 denoted AR(1). More precisely, for each i in {1, . . . , n}, Ei,t satisfies the
following equation:

Ei,t − φ1 Ei,t−1 = Wi,t , with Wi,t ∼ W N (0, σ 2 ), (3.6)

where |φ1 | < 1 and W N (0, σ 2 ) denotes a zero-mean white noise process of variance
σ 2 , defined as follows,

E(Zt ) = 0,

Zt ∼ W N (0, σ ) if E(Zt Zt0 ) = 0 if t 6= t0 ,
2
(3.7)

E(Zt2 ) = σ 2 .

Chapter 3

Note that the closer to one the parameter φ1 the stronger the dependence between
the Ei,t ’s.
In this case, the inverse of the square root of the covariance matrix Σq of
(Ei,1 , . . . , Ei,q ) has a simple closed-form expression given by
p 
1 − φ21 −φ1 0 ··· 0
0 1 −φ1 · · · 0
 
 

.. .. .. 
Σ−1/2 = 0 0 . . . . (3.8)
 
q
 .. .. .. ..


 . . . . −φ1


0 0 ··· 0 1

Hence, to obtain the expression of Σ b −1/2 , it is enough to have an estimation of the


q
parameter φ1 and to replace it in (3.8). For this, we use the estimator E b defined in
(3.4) and obtained by fitting a standard ANOVA model to the observations, which
corresponds to the first step of our method. Then φ1 is estimated by φb1 defined by
n
1Xb
φb1 = φ1,i ,
n i=1

where φb1,i denotes the estimator of φ1 obtained by the classical Yule-Walker equa-
tions from (Ebi,1 , . . . , E
bi,q ), see Brockwell & Davis (1991) for more details.
More generally, it is also possible to have access to Σ−1/2 q for more complex
processes such as the ARMA(p, q) process defined as follows: For each i in {1, . . . , n},

Ei,t − φ1 Ei,t−1 − · · · − φp Ei,t−p = Wi,t + θ1 Wi,t−1 + . . . θq Wi,t−q , (3.9)

where Wi,t ∼ W N (0, σ 2 ), the φi ’s and the θi ’s are real parameters.

60
Nonparametric dependence case

In the situation where a parametric modeling is not relevant for Σq , it can be


estimated by  
γ
b(0) γ
b(1) ··· γ b(q − 1)
 γ b(1) γ
b(0) ··· γ b(q − 2)
 
Σq = 
b .. , (3.10)
.
 
 
b(q − 1) γ
γ b(q − 2) · · · γ
b(0)
with n
1X
γ
b(h) = γ
bi (h),
n i=1

where γ bi (h) is the standard autocovariance estimator of γi (h) = E(Ei,t Ei,t+h ), for
all t. Usually, γbi (h) is referred to as the empirical autocovariance of the E bi,t ’s at lag

Chapter 3
h (i.e. the empirical covariance between (E bi,1 , . . . , E
bi,n−h ) and (E
bi,h+1 , . . . , E
bi,n )).
For a definition of the standard autocovariance estimator we refer the reader to
Chapter 7 of Brockwell & Davis (1991). The matrix Σ b −1/2 is then obtained by
q
inverting the Cholesky factor of Σ b q.

Choice of the whitening modeling

In order to decide which dependence modeling better fits the data at hand we
propose hereafter a statistical test. If the whitening modeling used is well chosen
then each row of E
e =E bΣb −1/2 should be a white noise as defined in (3.7), where E
b
q
is defined in (3.4).
One of the most popular approaches for testing whether a random process is a
white noise or not, is the Portmanteau test which is based on the Bartlett theo-
rem (for further details we refer the reader to Brockwell & Davis, 1991, Theorem
7.2.2). By this theorem, we get that under the null hypothesis (H0 ): “For each i in
{1, . . . , n}, (E
ei,1 , . . . , E
ei,q ) is a white noise”,

H
X
q ρbi (h)2 ≈ χ2 (H), as q → ∞, (3.11)
h=1

for each i in {1, . . . , n}, where ρbi (h) denotes the empirical autocorrelation of (E
ei,1 , . . . , E
ei,q )
at lag h and χ2 (H) denotes the chi-squared distribution with H degrees of freedom.
Thus, by (3.11), we have at our disposal a p-value for each i in {1, . . . , n} that we
denote by Pvali . In order to have a single p-value instead of n, we shall consider

n X
X H
q ρbi (h)2 ≈ χ2 (nH), as q → ∞, (3.12)
i=1 h=1

61
where the approximation comes from the fact that the rows of E e are assumed to
be independent. Equation (3.12) thus provides a p-value: Pval. Hence, if Pval < α,
the null hypothesis (H0 ) is rejected at the level α, where α is usually equal to 5%
and a large value of Pval indicates that the modeling for the dependence structure
of E is well chosen.

3.2.2 Estimation of B

Lasso based approach

Let us first explain briefly the usual framework in which the Lasso approach is
used. We consider a high-dimensional linear model of the following form

Y = X B + E, (3.13)
Chapter 3

where Y, B and E are vectors. Note that, in high-dimensional linear models, the
matrix X has usually more columns than rows which means that the number of
variables is larger than the number of observations but B is usually a sparse vector,
namely it contains a lot of null components.
In such models a very popular approach initially proposed by Tibshirani (1996)
is the Least Absolute Shrinkage and Selection Operator (LASSO), which is defined
as follows for a positive λ:

= argminB kY − X Bk22 + λkBk1 ,



B(λ)
b (3.14)
Pn Pn
where, for u = (u1 , . . . , un ), kuk22 = i=1 u2i and kuk1 = i=1 |ui |, i.e. the `1 -norm
of the vector u. Observe that the first term of (3.14) is the classical least-squares
criterion and that λkBk1 can be seen as a penalty term. The interest of such a
criterion is the sparsity enforcing property of the `1 -norm ensuring that the number
of non-zero components of the estimator Bb of B is small for large enough values of
λ.
This methodology cannot be directly applied to our model since we have to
deal with matrices and not with vectors. Nevertheless, as explained in Appendix
A, Model (3.2) can be rewritten as in (3.13) where Y, B and E are vectors of
size nq, pq and nq, respectively. Hence, retrieving the positions of the non null
components in B is a first approach for finding relevant variables. However, this
approach does not take into account the dependence between the columns of Y .
Hence, we propose hereafter a modified version of the standard Lasso criterion
(3.14) taking into account this potential dependence.
As explained previously, our contribution consists first in “whitening” the obser-
vations, namely removing the dependence that may exist within the observations
matrix, by multiplying (3.2) on the right by Σ b −1/2 , see (3.5) where Σ−1/2 is re-
q q

62
placed by Σb −1/2 . By using the same vectorization trick that allows us to transform
q
Model (3.2) into Model (3.13), the Lasso criterion can be applied to the vectorized
version of Model (3.5) where Σ−1/2 is replaced by Σ b −1/2 . The specific expressions
q q
of Y, X , B and E are given in Appendix B.
Note that this idea of “whitening” the observations has also been proposed by
Rothman et al. (2010) where the estimation of Σq and B is performed simulta-
neously. An implementation is available in the R package MRCE. In our approach,
Σq is estimated first and then its estimator is used in (3.5) instead of Σq before
applying the Lasso criterion. Hence, our method can be seen as a variant of the
MRCE method in which Σq is estimated beforehand. Moreover, after some numer-
ical experiments, we observed that for the values of n and q that we aim at using,
the computational burden of the approach designed by Rothman et al. (2010) is
too high for addressing our datasets for fixed regularization parameters, contrary
to ours. In addition, in practical situations, the regularization parameters of the

Chapter 3
MRCE approach have to be tuned. As a consequence, we have not been able to use
the MRCE approach for the purpose we consider here.

Model selection issue

Estimator (3.14) depends on a parameter λ which tunes the sparsity level in B. b


We propose to mix two standard approaches to estimate the positions of the non
null components in B: the 10-fold cross-validation method and the stability selection
approach of Meinshausen & Buhlmann (2010) which guarantees the robustness of
the selected variables.
We first divide our samples into ten groups and remove one group from the
dataset thus creating 10 training sets: Y D1 , . . . , Y D10 . For each training set Y Dk ,
we apply the first three steps of our approach and the Lasso criterion with a 10-fold
(k)
cross-validation procedure to get λCV . Then, we randomly select a subsample of
size qnk /2, where nk denotes the size of the vector of observations Y Dk = Vec(Y Dk ).
(k)
We then apply the Lasso criterion to this subsample with λ = λCV and store the
indices i of the non null B
bi . This random selection of a subsample of the training set
and the application of the Lasso criterion are repeated N times. At the end, we have
(k)
access to the number of times Ni where each component B bi is non null among the
N replications for each group k. We only keep in the final set of selected variables
P10 (k)
the indices i such that ( k=1 (Ni /N ))/10 is larger than a given threshold.
The influence of N and the choice of the threshold will be investigated in Section
4.3.
For some theoretical results supporting our approach we refer the reader to
Perrot-Dockès et al. (2018).

63
3.3 Simulation study
The goal of this section is to assess the statistical performance of our methodology
implemented in the R package MultiVarSel. In order to emphasize the benefits
of using a whitening approach from the variable selection point of view, we shall
first compare our approach to standard methodologies. Then, we shall analyze
the performance of our statistical test for choosing the best dependence modeling.
Finally, we shall investigate the performance of our model selection criterion.
To assess the performance of the different methodologies, we generate observa-
tions Y according to Model (3.2) with q = 1000, p = 3, n = 30 (n1 = 9, n2 = 8
and n3 = 13) and different dependence modelings, namely different matrices Σq
corresponding to the AR(1) model described in (3.6) with σ = 1 and φ1 = 0.7 or
0.9. Note that the values of the parameters p, q and n are chosen in order to match
the metabolomic data analyzed in Section 3.4.
We shall also investigate the effect of the sparsity and of the signal to noise ratio
Chapter 3

(SNR). The sparsity level s corresponds to the proportion of non null elements in B.
Different signal to noise ratios are obtained by multiplying B in (3.2) by a coefficient
κ.

3.3.1 Variable selection performance

s = 0.01 s = 0.3
φ1 = 0.7 φ1 = 0.9 φ1 = 0.7 φ1 = 0.9
1.00 S
SS 1.00 S
SS 1.00 S SS 1.00 SSS
SS S S S
S S S S
S S S
0.75 S 0.75 0.75 0.75 S
S S
S S
κ=1

S
TPR

S
S S
0.50 0.50 S 0.50 S 0.50
S
S S
0.25 0.25 0.25 0.25 S

0.00 S 0.00 S 0.00 S 0.00 S

0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

1.00 S
S S SS 1.00 S
SS 1.00 SS 1.00 SS
S S SS S S S S
S S S S
S S
S S
S
0.75 S 0.75 S 0.75 0.75 S
S S
κ=2

S
TPR

S Lasso
0.50 0.50 S 0.50 0.50 S ANOVA
AR1
S S Nonparam
0.25 0.25 0.25 0.25
Oracle
S sPLSDA
0.00 S 0.00 S 0.00 S 0.00 S

0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

FPR FPR

Figure 3.2 – Means of the ROC curves obtained from 200 replications for the different
methodologies in the AR(1) dependence modeling; κ is linked to the signal to noise
ratio (first row: κ = 1, second row κ = 2); φ1 is the correlation level in the AR(1)
and s the sparsity level (i.e. the fraction of nonzero elements in B).

The goal of this section is to compare the performance of our different whitening
strategies to standard existing methodologies. More precisely, we shall compare

64
s = 0.01 s = 0.3
φ1 = 0.7 φ1 = 0.9 φ1 = 0.7 φ1 = 0.9
1.00 1.00 1.0 1.0
Precision

0.75 0.75 0.8 0.8


S

κ=1
0.50 0.50
0.6 S 0.6

0.25 0.25 S S
0.4 S 0.4 S
S S S S
SS S SSS
0.00 S S S S SSSS
S 0.00 S S S S S SSSS SS
S S

0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

1.00 1.00 1.0 S 1.0


S
Precision

0.75 0.75 S
0.8 0.8

κ=2
S Lasso
0.50 0.50 ANOVA S
0.6 0.6
S AR1
Nonparam S
0.25 0.25
S Oracle S
0.4 S 0.4 S
SS S sPLSDA SS
S SSSSSS
S
S S S S S SSSSS S SS
0.00 0.00
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

Recall Recall

Chapter 3
Figure 3.3 – Means of the precision-recall curves obtained from 200 replications for
the different methodologies in the AR(1) dependence modeling; κ is linked to the
signal to noise ratio (first row: κ = 1, second row κ = 2); φ1 is the correlation level
in the AR(1) and s the sparsity level (i.e. the fraction of nonzero elements in B).

our approaches to the classical ANOVA method (denoted ANOVA), the standard
Lasso (denoted Lasso), namely the Lasso approach without the whitening step
and to sPLSDA (Lê Cao et al., 2011), implemented in the mixOmics R package
and also in MetaboAnalyst, which is widely used in the metabolomics field. By
ANOVA, we mean the classical one-way ANOVA applied to each column of the
observations matrix Y without taking the dependence into account. Our different
whitening approaches (described in Sections 3.2.1 and 3.2.1) are denoted by AR1
and Nonparam. These methods are also compared to the Oracle approach where
the matrix Σq is known, which is never the case in practical situations.
We shall use three classical criteria for comparison: ROC curves, AUC (Area
Under the ROC Curve) and Precision-Recall (PR) curves. ROC curves display the
true positive rates (TPR) as a function of the false positive rates (FPR) and the
closer to one the AUC the better the methodology. PR curves display the Precision
as a function of the Recall. Since the features selected by sPLSDA are not assigned
to a given condition c, we shall consider that as soon as a feature is selected it is a
true positive, which gives a great advantage to sPLSDA.
We can see from Figures 3.2, 3.3 and Table 3.1 that in the case of an AR(1) depen-
dence, taking into account this dependence provides better results than sPLSDA and
than approaches that consider the columns of the matrix E as independent. More-
over, we observe that the performance of the non parametric modeling are on a par
with those of the parametric and the oracle ones. We also note that the larger the
sparsity level s the smaller the difference of performance between the approaches.

65
SNR φ1 s Lasso ANOVA AR1 Nonpar Oracle sPLSDA
1 0.7 0.01 0.78 0.78 0.83 0.84 0.84 0.73
1 0.7 0.3 0.74 0.74 0.80 0.80 0.80 0.72
1 0.9 0.01 0.63 0.64 0.83 0.83 0.83 0.58
1 0.9 0.3 0.63 0.64 0.77 0.77 0.77 0.61
2 0.7 0.01 0.91 0.91 0.95 0.95 0.95 0.86
2 0.7 0.3 0.85 0.85 0.88 0.88 0.88 0.84
2 0.9 0.01 0.77 0.77 0.91 0.91 0.91 0.72
2 0.9 0.3 0.75 0.76 0.86 0.86 0.86 0.74

Table 3.1 – AUC of the different methods corresponding to Figure 3.2

As expected, the larger the signal to noise ratio κ the better the performance of the
different methodologies. We also conducted numerical experiments in a balanced
one-way ANOVA framework. Since the conclusions are similar, we did not report
the results here but they are available upon request.
Chapter 3

3.3.2 Choice of the dependence modeling

The goal of this section is to assess the performance of the whitening test pro-
posed in Section 3.2.1. We generated observations Y as described at the beginning
of Section 4.3, with AR(1) dependence, a sparsity level s = 0.01 and SNR such that
κ = 1. The corresponding results are displayed in Figure 3.4.

Oracle

Nonparam

AR1

Lasso

0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1


P-value

Figure 3.4 – Means and standard deviations of the p-values of the test described in
Section 3.2.1 of the main paper for the different approaches in the AR(1) dependence
modeling when φ1 = 0.7 (left) and φ1 = 0.9 (right).

We observe that our test behaves properly: it provides p-values close to zero in
the case where no whitening strategy is used (Lasso) and that when one of the
proposed whitening approaches is used the p-values are larger than 0.7.

66
3.3.3 Choice of the model selection criterion
We investigate here the performance of our model selection criterion described
in Section 3.2.2. Figure 3.5 displays the TPR and the FPR for different values N of
the sampling replicates and different thresholds. We can see from this figure than
taking N larger than 1000 and a threshold of 0.999 ensures a small false positive
rate and a large true positive rate.

s = 0.01 s = 0.01 s = 0.01 s = 0.3 s = 0.3 s = 0.3

κ=1 κ=2 κ = 10 κ=1 κ=2 κ = 10


1.00
0.75

TPR
0.50
0.25
0.00

0.006

Chapter 3
FPR
0.004
0.002
0.000
0.99 0.999 1 0.99 0.999 1 0.99 0.999 1 0.99 0.999 1 0.99 0.999 1 0.99 0.999 1
Threshold
φ1 0.7 0.9 N 10 100 1000 5000

Figure 3.5 – Influence of the number of replications N and of the threshold.

Bullets (’•’) in Figure 3.6 show the positions of the variables selected by our four-
step approach for two possible thresholds (0.999 and 1) from N = 1000 replications.
The positions of the non null coefficients in B are displayed with ’+’. Here Y is
generated with the parameters described at the beginning of Section 4.3 in the case
of an AR(1) dependence with φ1 = 0.9 and κ = 10. We observe from this figure that
the positions of the non null coefficients are recovered for both thresholds. However,
the performance are slightly better when the threshold is equal to 0.999.

3.3.4 Numerical performance


In order to investigate the computational burden of our approach, we generated
matrices Y satisfying Model (3.2) with n = 30 and q ∈ {100, 1000, 2000, . . . , 5000}.
Here, the rows of the matrix E are generated as realizations of an AR(1) process and
the level of sparsity s of B is equal to 0.01. Figure 3.7 displays the computational
times of MultiVarSel, including the model selection step described in Section 3.2.2,
for different number of replications in the stability selection stage. Timings were
obtained on a workstation with 16 GB of RAM and Intel Core i7 (3.66GHz) CPU,
using 8 cores for parallel computing. Our implementation uses the R language (R
Core Team, 2017) and relies on the glmnet and Matrix packages (Friedman et al.,
2010a; Bates & Maechler, 2017). We can see from this figure that the computational

67
Threshold=0.99 Threshold=1

3
Condition

0 250 500 750 1000 0 250 500 750 1000


Position
0.25 0.50 0.75 1.00 Predicted Real

Figure 3.6 – Positions of the variables selected by our approach (’•’) when κ = 10.
Values on the y-axis correspond to the 3 conditions. The results obtained when the
threshold is equal to 0.999 (resp. 1) are on the left (resp. on the right). The size of
the bullets is all the more large that the selection frequency is high.
Chapter 3

200
Times in seconds

150

100

50

0
0 1000 2000 3000 4000 5000
q

N 100 500 1000

Figure 3.7 – Computational times (in seconds) of MultiVarSel. The number of


replications corresponds to the number N of subsamplings in the stability selection
step.

burden of MultiVarSel is very low and that it takes only a few minutes to analyze
matrices having 5000 columns.

3.4 Application to the analysis of a LC-MS data set


In this section, MultiVarSel is applied to a LC-MS (Liquid Chromatography-
Mass Spectrometry) data set made of African copals samples. The samples corre-
spond to ethanolic extracts of copals produced by trees belonging to two genera
Copaifera (C) and Trachylobium (T) with a second level of classification coming
from the geographical provenance of the Copaifera samples (West (W) or East (E)
Africa). Since all the Trachylobium samples come from East Africa, we can use the

68
modeling proposed in Equations (3.1) and (3.2) with C = 3 conditions: CE, CW
and TE such that nCE = 9, nCW = 8 and nTE = 13. Our goal is to identify the
most important features (the m/z values) for distinguishing the different conditions.
In this section, we also compare the performance of our method with those of other
techniques which are widely used in metabolomics.

3.4.1 Data pre-processing


LC-MS chromatograms were aligned using the R package XCMS proposed by Smith
et al. (2006) with the following parameters: a signal to noise ratio threshold of
10:1 for peak selection, a step size of 0.2 min and a minimum difference in m/z
for peaks with overlapping retention times of 0.05 amu. Sample filtering was also
performed: To be considered as informative, as suggested by Kirwan et al. (2013),
a peak was required to be present in at least 80% of the samples. Missing values

Chapter 3
imputation was realized using the KNN algorithm described in Hrydziuszko & Viant
(2012). Subsequently, the spectra were normalized to equalize signal intensities to
the median profile in order to reduce any variance arising from differing dilutions
of the biological extracts and probabilistic quotient normalization (PQN) was used,
see Dieterle et al. (2006) for further details. In order to reduce the size of the
data matrix which contains 6327 metabolites, selection of the adducts of interest
[M+H]+ was then performed using the CAMERA package of Kuhl et al. (2012). A
n × q matrix Y was then obtained with q = 1019 and submitted to the statistical
analyses.

3.4.2 Application of our four-step approach


The observations matrix Y is first centered and scaled.
— First step: A one-way ANOVA is fitted to each column of the observation
matrix Y in order to have access to an estimation E b of the matrix E. Then,
the test proposed in Section 3.2.1 is applied to E
b that is without “whitening”
the observations. We found a p-value equal to zero which indicates that the
columns of Eb cannot be considered as independent and hence that applying
the whitening strategy should improve the results.
— Second step: The different whitening strategies described in Section 3.2.1
were applied and the highest p-value for the test described in Section 3.2.1 is
obtained for the nonparametric whitening. More precisely, the p-values ob-
tained for the AR(1) and the nonparametric dependence modeling are equal
to 0 and 0.664, respectively. Hence, in the following we shall use the nonpara-
metric modeling.
— Third step: Observations were whitened with Σ b q obtained by using the non-
parametric modeling.

69
— Fourth step: The Lasso approach described in Section 3.2.2 was then applied
to the whitened observations. The stability selection is used with N = 1000
replications and a threshold equal to 0.999.

3 2 16 10 67

TE CE CW

Figure 3.8 – Venn diagram of the features selected for each condition by
MultiVarSel.
Chapter 3

Figure 3.8 displays the Venn diagram of the features (m/z values) selected for
each condition CE, TE and CW. Among the 1019 features, 98 features have been
selected by MultiVarSel: 77 have been selected for Condition TE, 28 for Condition
CW and 5 for Condition CE. Note that there were no features selected for all the
conditions, 10 for both TE and CW and 2 for both CW and CE.

3.4.3 Comparison with existing methods

The goal of this section is to compare our approach with the sparse partial least
square discriminant analysis (sPLS-DA) which is classically used in metabolomics.

Additional simulations

Since in the case of real data, the position of the relevant features is of course un-
known, we propose the following additional simulations in order to further compare
these two approaches. We start by applying the first step of our approach in order
to get E.
b Then, we perform M random samplings with replacement among the
b Let E ? denote one of them, then we generate a new observation matrix
rows of E.
Y ? = X ? B + E ? , where X ? is the same as X except that its rows are permuted in
order to ensure a correspondence between the rows of E ? and X ? . The matrix B is
obtained as in Section 4.3 with s = 0.01 and κ = 0.5 and 1. ROC curves averaged
over M = 50 random samplings are displayed in Figure 3.9. We can see from this
figure that our approach outperforms the classical ones. Other values of s and κ
have been tested. The corresponding results are not reported here but available
upon request.

70
κ = 0.5 κ=1
1.00 SS S SS
S S
S S
0.75
S

s = 0.01
TPR

0.50

0.25

0.00 S

0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1


FPR
MultiVarSel Lasso S sPLSDA

Figure 3.9 – Means of the ROC curves obtained by MultiVarSel, Lasso and sPLS-
DA.

Chapter 3
Comp2 CE
Comp2 CW Comp1 Comp2 TE Comp1
47 23

47 2 20 6 43 42 7 27 43 6

49
Comp1

Figure 3.10 – Venn diagrams comparing the features selected by MultiVarSel in


the three conditions with those selected by sPLS-DA in its two components.

Results on the LC-MS data set

As recommended by Lê Cao et al. (2011), we used two components for sPLS-DA.
Moreover, in order to make sPLS-DA comparable with our approach, 49 variables are
kept for each component. However, as explained in Section 4.3, the main difference
between our approach and sPLSDA is that the features selected by sPLSDA are not
assigned to a given condition c, and thus less interpretable.
Figure 3.11 displays the location of the features (m/z values) selected by our
approach and sPLS-DA. We can see from this figure that the features selected
for the condition TE are mainly located between 400 and 500 m/z whereas those
selected for the condition CE are around 600 m/z. The features selected by the
first component of the sPLS-DA are also mainly located between 400 and 500 m/z.
However, as previously explained, the features selected by sPLSDA are assigned to
a component built by the method and not to a condition of the experimental design.
Venn diagrams comparing the features selected by both methods are available in
Figure 3.10. We observe from these Venn diagrams that the features selected in
each component of sPLS-DA do not characterize the conditions of the MANOVA
model contrary to ours.

71
Comp2

sPLSDA
Comp1

CE

MultiVarSel
TE

CW

200 300 400 500 600 700


m/z

Figure 3.11 – Comparison of the features selected by MultiVarSel and sPLS-DA.


Chapter 3

3.5 Conclusion

In this paper, we proposed a novel approach for feature selection taking into
account the dependence that may exist between the columns of the observations
matrix. Our approach is implemented in the R package MultiVarSel which is
available from The Comprehensive R Archive Network (CRAN). We have shown
that our method has two main features. Firstly, it is very efficient for selecting a
restricted number of stable features characterizing each condition. Secondly, its very
low computational burden makes its use possible on very large LC-MS metabolomics
data.
Acknowledgment: This project has been funded by La mission pour l’interdisciplinarité
du CNRS in the frame of the DEFI ENVIROMICS (project AREA). The authors
thank the Musée François Tillequin for providing the samples from the Guibourt
Collection.

Appendix A

Let vec(A) denote the vectorization of the matrix A formed by stacking the
columns of A into a single column vector. Let us apply the vec operator to Model
(3.2), then
vec(Y ) = vec(XB + E) = vec(XB) + vec(E).

Let Y = vec(Y ), B = vec(B) and E = vec(E). Hence,

Y = vec(XB) + E = (Iq ⊗ X)B + E,

72
where we used that
vec(AXB) = (B 0 ⊗ A)vec(X),

see (Mardia et al., 1979, Appendix A.2.5). In this equation, B 0 denotes the transpose
of the matrix B. Thus,
Y = X B + E,

where X = Iq ⊗ X and Y, B and E are vectors of size nq, pq and nq, respectively.

Appendix B

Let us apply the vec operator to Model (3.5) where Σ−1/2 b −1/2 ,
is replaced by Σ
q q
then

−1/2 −1/2 −1/2


vec(Y Σ
b
q ) = vec(XB Σ
b
q ) + vec(E Σ
b
q )
b −1/2 )0 ⊗ X)vec(B) + vec(E Σ
= ((Σ b −1/2 ).
q q

Hence,
Y = X B + E,
b −1/2 ), X = (Σ
where Y = vec(Y Σ b −1/2 )0 ⊗ X and E = vec(E Σ
b −1/2 ).
q q q

73
Estimation of large block
Chapter 4
structured covariance matrices:
Application to “multi-omic”
approaches to study seed quality

Scientific production

The content of this chapter is contained in the article submitted for


publication: M. Perrot-Dockès, C. Lévy-Leduc ”Estimation of large
block structured covariance matrices: Application to “multi-omic” ap-
proaches to study seed quality” https://arxiv.org/abs/1806.10093
The method which is presented is implemented in the BlockCov R pack-
age available from the CRAN.

Chapter 4
Abstract
Motivated by an application in high-throughput genomics and metabolomics, we
propose a novel, efficient and fully data-driven approach for estimating large block
structured sparse covariance matrices in the case where the number of variables is much
larger than the number of samples without limiting ourselves to block diagonal matrices.
Our approach consists in approximating such a covariance matrix by the sum of a
low-rank sparse matrix and a diagonal matrix. Our methodology also can deal with
matrices for which the block structure appears only if the columns and rows are permuted
according to an unknown permutation. Our technique is implemented in the R package
BlockCov which is available from the Comprehensive R Archive Network (CRAN) and
from GitHub. In order to illustrate the statistical and numerical performance of our
package some numerical experiments are provided as well as a thorough comparison
with alternative methods. Finally, our approach is applied to the use of “multi-omic”
approaches for studying seed quality.

4.1 Introduction
Plant functional genomics refers to the description of the biological function of
a single or a group of genes and both the dynamics and the plasticity of genome

75
expression to shape the phenotype. Combining multi-omics such as transcriptomic,
proteomic or metabolomic approaches allows us to address in a new light the di-
mension and the complexity of the different levels of gene expression control and
the delicacy of the metabolic regulation of plants under fluctuation environments.
Thus, our era marks a real conceptual shift in plant biology where the individual
is no longer considered as a simple sum of components but rather as a system
with a set of interacting components to maximize its growth, its reproduction and
its adaptation. Plant systems biology is therefore defined by multidisciplinary and
multi-scale approaches based on the acquisition of a wide range of data as exhaustive
as possible.
In this context, it is crucial to propose new methodologies for integrating het-
erogeneous data explaining the co-regulations/co-accumulations of products of gene
expression (mRNA, proteins) and metabolites. In order to better understand these
phenomena, our goal will thus be to propose a new approach for estimating block
structured covariance matrix in a high-dimensional framework where the dimension
of the covariance matrix is much larger than the sample size. In this setting, it is
well known that the commonly used sample covariance matrix performs poorly. In
recent years, researchers have proposed various regularization techniques to consis-
tently estimate large covariance matrices or the inverse of such matrices, namely
precision matrices. To estimate such matrices, one of the key assumptions made
in the literature is that the matrix of interest is sparse, namely many entries are
equal to zero. A number of regularization approaches including banding, tapering,
thresholding and `1 minimization, have been developed to estimate large covariance
Chapter 4

matrices or their inverse such as, for instance, Ledoit & Wolf (2004), Bickel & Lev-
ina (2008), Banerjee et al. (2008b), Bien & Tibshirani (2011) and Rothman (2012)
among many others. For further references, we refer the reader to Cai & Yuan
(2012) and to the review of Fan et al. (2016).
In this paper, we shall consider the following framework. Let E 1 , E 2 , · · · , E n , n
zero-mean i.i.d. q-dimensional random vectors having a covariance matrix Σ such
that the number q of its rows and columns is much larger than n. The goal of
the paper is to propose a new estimator of Σ and of the square root of its inverse,
Σ−1/2 , in the particular case where Σ is assumed to have a block structure without
limiting ourselves to diagonal blocks. An accurate estimator of Σ can indeed be very
useful to better understand the links between the columns of the observation matrix
and may highlight some biological processes. Moreover, an estimator of Σ−1/2 can
be very useful in the general linear model in order to remove the dependence that
may exist between the columns of the observation matrix. For further details on
this point, we refer the reader to Perrot-Dockès et al. (2018), Perrot-Dockès et al.
(2018) and to the R package MultiVarSel in which such an approach is proposed
and implemented for performing variable selection in the multivariate linear model
in the presence of dependence between the columns of the observation matrix.

76
More precisely, in this paper, we shall assume that

Σ = ZZ 0 + D, (4.1)

where Z is a q × k sparse matrix with k  q, Z 0 denotes the transpose of the


matrix Z and D is a diagonal matrix such that the diagonal terms of Σ are equal
to one. Two examples of such matrices Z and Σ are given in Figure 4.1 in the
case where k = 5 and q = 50 and in the case where the columns of Σ do not need
to be permuted in order to see the block structure. Based on (4.1), our model
could seem to be close to factor models described in Johnson & Wichern (1988) and
Fan et al. (2016). However, in Johnson & Wichern (1988), the high-dimensional
aspects are not considered and in Fan et al. (2016) the sparsity constraint is not
studied. Blum et al. (2016b) proposed a methodology which is based on the factor
model but with a sparsity constraint on the coefficients of Z which leads to a sparse
covariance matrix. Note also that the block diagonal assumption has already been
recently considered by Devijver & Gallopin (2018) for estimating the inverse of large
covariance matrices in high-dimensional Gaussian Graphical Models (GGM).

Chapter 4
Figure 4.1 – Examples of matrices Σ generated from different matrices Z leading
to a block diagonal or to a more general block structure (extra-diagonal blocks).

We also propose a methodology to estimate Σ in the case where the block struc-
ture is latent; that is, permuting the columns and rows of Σ renders visible its block
structure. An example of such a matrix Σ is given in Figure 4.2 in the case where
k = 5 and q = 50.
Our approach is fully data-driven and consists in providing a low rank matrix
approximation of the ZZ 0 part of Σ and then in using a `1 regularization to obtain
a sparse estimator of Σ. When the block structure is latent, a hierarchical clustering
step must be applied first. With this estimator of Σ, we explain how to obtain an
estimator of Σ−1/2 .
Our methodology is described in Section 4.2. Some numerical experiments on
synthetic data are provided in Section 4.3. An application to the analysis of “-omic”
data to study seed quality is performed in Section 4.4.

77
Figure 4.2 – Examples of matrices Σ of Figure 4.1 in which the columns and rows
are randomly permuted.

4.2 Statistical inference


The strategy that we propose for estimating Σ and Σ−1/2 can be summarized
as follows.
— First step: Low rank approximation. In this step, we propose to approximate
the part ZZ 0 of Σ by a low rank matrix using a Singular Value Decomposition
(SVD).
— Second step: Detecting the position of the non null values. In this step, we
use a Lasso criterion to yield a sparse estimator Σ
e of Σ.

— Third step: Positive definiteness. We apply the methodology of Higham (2002)


Chapter 4

to Σ
e to ensure that the final estimator Σ
b of Σ is positive definite.

— Fourth step: Estimation of Σ−1/2 . In this step, Σ−1/2 is estimated from the
spectral decomposition of Σ
b obtained in the previous step.

4.2.1 Low rank approximation


By definition of Z in (4.1), ZZ 0 is a q × q low rank matrix having its rank
smaller or equal to k  q. In the first step, our goal is thus to propose a low rank
approximation of an estimator of ZZ 0 .
Let S be the sample q × q covariance matrix defined by
n n
1 X  0 1X
S= Ei − E Ei − E , with E = Ei,
n − 1 i=1 n i=1

where E i = (Ei,1 , . . . , Ei,q )0 . The corresponding q × q sample correlation matrix


R = (Ri,j ) is defined by:

Si,j
Ri,j = , ∀1 ≤ i, j ≤ q, (4.2)
σi σj

78
where
n n
1 X 1X
σi2 = (E`,i − E i )2 , with E i = E`,i , ∀1 ≤ i ≤ q.
n−1 n
`=1 `=1

Let us also consider the (q − 1) × (q − 1) matrix Γ defined by:

Γi,j = Ri,j+1 , ∀1 ≤ i ≤ j ≤ q − 1, (4.3)


Γi,j = Γj,i , ∀1 ≤ j < i ≤ q − 1.

If S was the real matrix Σ, the corresponding matrix Γ would have a rank less than
or equal to k. Since S is an estimator of Σ, we shall use a rank r approximation
Γr of Γ. This will be performed by considering in its singular value decomposition
only the r largest singular values and by replacing the other ones by 0. By Eckart &
Young (1936), this corresponds to the best rank r approximation of Γ. The choice
of r will be discussed in Section 4.2.5.

4.2.2 Detecting the position of the non null values

Let us first explain the usual framework in which the Lasso approach is used.
We consider a linear model of the following form

Chapter 4
Y = X B + E, (4.4)

where Y, B and E are vectors and B is sparse meaning that it has a lot of null
components.
In such models a very popular approach initially proposed by Tibshirani (1996)
is the Least Absolute Shrinkage eStimatOr (Lasso), which is defined as follows for
a positive λ:
= ArgminB kY − X Bk22 + λkBk1 ,

B(λ)
b (4.5)
Pn Pn
where, for u = (u1 , . . . , un ), kuk22 = i=1 u2i and kuk1 = i=1 |ui |, i.e. the `1 -norm
of the vector u. Observe that the first term of (4.5) is the classical least-squares
criterion and that λkBk1 can be seen as a penalty term. The interest of such a
criterion is the sparsity enforcing property of the `1 -norm ensuring that the number
of non-zero components of the estimator Bb of B is small for large enough values of
λ. Let
Y = vecH (Γr ), (4.6)

where vecH defined in Section 16.4 of Harville (2001) is such that for a n × n matrix

79
A, 
a1 ∗
 a2 ∗ 
 
vecH (A) =  . 

,
 .. 
an ∗
where ai ∗ is the sub-vector of the column i of A obtained by striking out the i − 1
first elements. In order to estimate the sparse matrix ZZ 0 , we need to propose a
sparse estimator of Γr . To do this we apply the Lasso criterion described in (4.5),
where X is the identity matrix. In the case where X is an orthogonal matrix it has
been shown in Giraud (2014) that the solution of (4.5) is:
(
Xj0 Y(1 − λ
2|Xj0 Y| ), if |Xj0 Y| > λ
2
B(λ)
b j=
0, otherwise,

where Xj denotes the jth column of X . Using the fact that X is the identity matrix
we get (
λ
Yj (1 − 2|Y j|
), if |Yj | > λ2
B(λ)
b j= (4.7)
0, otherwise.
We then reestimate the non null coefficients using the least-squares criterion and
get: (
Yj , if |Yj | > λ2
B(λ)
e j= (4.8)
0, otherwise,
Chapter 4

where Y is defined in (4.6).


It has to be noticed that Γ
b r obtained in (4.7) satisfies the following criterion:

b r = Argmin {kΓr − ΘkF + λ|Θ|1 } ,


Γ Θ

where k · kF denotes the Frobenius norm defined for a matrix A by kAk2F =


Trace(A0 A), |M |1 = kvec(M )k1 denotes the `1 -norm of the vector formed by stack-
ing the columns of M . It is thus closely related to the generalized thresholding
estimator defined in Wen et al. (2016) and to the one defined in Rothman (2012)
with τ = 0 except that in our case |Θ− |1 is replaced by |Θ|1 where Θ− corresponds
to the matrix Θ in which the diagonal terms are replaced by 0. The diagonal
terms of Σ were indeed already removed in Γr . Hence, we get Γ b r by elementwise
soft-thresholding that is by putting to zero the value of Γr that are under a given
threshold and by multiplying the non null values by a coefficient containing this
threshold.
Here, we choose to estimate Γr by Γ e r (λ) defined through B(λ)
e in (4.8) which
corresponds to a hard-thresholding and we set the upper triangular part of the
estimator Σ(λ)
e of Σ to be equal to Γe r (λ). Since the diagonal terms of Σ are

80
assumed to be equal to 1, we take the diagonal terms of Σ(λ)
e equal to 1. The lower
triangular part of Σ(λ)
e is then obtained by symmetry.
The choice of the best parameter λ denoted λfinal in the following will be discussed
in Section 4.3.2.

4.2.3 Positive definiteness


To ensure the positive definiteness of our estimator Σ b of Σ, we consider the
nearest correlation matrix to Σ(λfinal ) which is computed by using the methodology
e
proposed by Higham (2002) and which is implemented in the function nearPD of
the R package Matrix, see Bates & Maechler (2017).

4.2.4 Estimation of Σ−1/2


Even if providing an estimator of a large covariance matrix can be very useful in
practice, it may also be interesting to efficiently estimate Σ−1/2 . Such an estimator
can indeed be used in the general linear model in order to remove the dependence
that may exist between the columns of the observations matrix. For further details
on this point, we refer the reader to Perrot-Dockès et al. (2018), Perrot-Dockès et al.
(2018) and to the R package MultiVarSel in which such an approach is proposed
and implemented for performing variable selection in the multivariate linear model
in the presence of dependence between the columns of the observation matrix.
Since Σb is a symmetric matrix, it can be rewritten as U DU 0 , where D is a
diagonal matrix and U is an orthogonal matrix. The matrix Σ−1/2 can thus be

Chapter 4
estimated by U D −1/2 U 0 where D −1/2 is a diagonal matrix having its diagonal
terms equal to the square root of the inverse of the singular values of Σ. b However,
inverting the square root of too small eigenvalues may lead to poor estimators of
Σ−1/2 . This is the reason why we propose to estimate Σ−1/2 by

b −1/2 = U D −1/2 U 0 ,
Σ (4.9)
t t

−1/2
where D t is a diagonal matrix such that its diagonal entries are equal to the
square root of the inverse of the diagonal entries of D except for those which are
−1/2
smaller than a given threshold t which are replaced by 0 in D t . The choice of t
will be further discussed in Section 4.3.7.

4.2.5 Choice of the parameters


Our methodology for estimating Σ depends on two parameters: The number r of
singular values kept for defining Γr and the parameter λ which controls the sparsity
level namely the number of zero values in B(λ)
e defined in (4.8).
For choosing r, we shall compare two strategies in Section 4.3.1:

81
— The Cattell criterion based on the Cattell’s scree plot described in Cattell
(1966) and
— the PA permutation method proposed by Horn (1965) and recently studied
from a theoretical point of view by Dobriban (2018).
To choose the parameter λ in (4.8), we shall compare two strategies in Section
4.3.2:
— The BL approach proposed in Bickel & Levina (2008) based on cross-validation
and
— the Elbow method which consists in computing for different values of λ the
Frobenius norm kR − Σ(λ)k
e F , where R and Σ(λ) are defined in (4.2) and at
e
the end of Section 4.2.2, respectively. Then, it fits two simple linear regressions
and chooses the value of λ achieving the best fit.

4.3 Numerical experiments


Our methodology described in the previous section is implemented in the R pack-
age BlockCov and is available from the CRAN (Comprehensive R Achive Network)
and from GitHub.
We propose hereafter to investigate the performance of our approach for different
types of matrices Σ defined in (4.1) and for different values of n and q. The four
following cases considered correspond to different types of matrices Z, the matrices
D being chosen accordingly to ensure that the matrix Σ has its diagonal terms
equal to 1.
Chapter 4

— Diagonal-Equal case. In this situation, Z has the structure displayed in the


left part of Figure 4.1, namely it has 5 columns such that the numbers of the
non values in the five columns are equal to 0.1 × q, 0.2 × q, 0.3 × q, 0.2 × q and
√ √ √
0.2 × q, respectively and the non null values are equal to 0.7, 0.75, 0.65,
√ √
0.8 and 0.7, respectively.
— Diagonal-Unequal case. In this scenario, Z has the same structure as for
the Diagonal-Equal case except that the non null values in the five columns
√ √
are not fixed but randomly chosen in [ 0.6, 0.8] except for the third column
√ √
for which its values are randomly chosen in [ 0.3, 0.6].
— Extra-Diagonal-Equal case. Here, Z has the structure displayed in the right
part of Figure 4.1. The values of the columns of Z are the same as those of
the Diagonal-Equal case except for the fourth column which is assumed to
contain additional non values equal to -0.5 in the range [0.35 × q, 0.45 × q].
— Extra-Diagonal-Unequal case. Z has the same structure as in the Extra-
Diagonal-Equal case except that the values are randomly chosen as in the
Diagonal-Unequal case except for the fourth column where the additional
non values are still equal to -0.5 in the range [0.35 × q, 0.45 × q].

82
Cattell PA

90
Eigenvalues

60

30

0
0 50 [...] 500 0 50 [...] 500
Index

Real Sample

Figure 4.3 – Illustration of PA and Cattell criteria for choosing r when q = 500
and n = 30 in the Extra-Diagonal-Unequal case. The value of r found by both
methodologies is displayed with a dotted line, the straight lines obtained for the
Cattell criterion and the eigenvalues of the permuted matrices in the PA methodology
are displayed in grey.

For n ∈ {10, 30, 50} and q ∈ {100, 500}, 100 n × q matrices E were gener-
ated such that its rows E 1 , E 2 , · · · , E n are i.i.d. q-dimensional zero-mean Gaus-
sian vectors having a covariance matrix Σ chosen according to the four previous
cases: Diagonal-Equal, Diagonal-Unequal, Extra-Diagonal-Equal or Extra-
Diagonal-Unequal.

Chapter 4
4.3.1 Low rank approximation
The approaches for choosing r described in Section 4.2.5 are illustrated in Figure
4.3 in the Extra-Diagonal-Unequal case. We can see from this figure that both
methodologies find the right value of r which is here equal to 5.
To go further, we investigate the behavior of our methodologies from 100 repli-
cations of the matrix E for the four different types of Σ. Figure 4.4 displays the
barplots associated to the estimation of r made in the different replications by the
two approaches for the different scenarii. We can see from this figure that the PA
criterion seems to be slightly more stable than the Cattell criterion when n ≥ 30.
However, in the case where n = 10, the PA criterion underestimates the value of
r. Moreover, in terms of computational time, the performance of Cattell is much
better, see Figure 4.5.

4.3.2 Positions of the non null values


For the four scenarios, the performance of the two approaches: BL and Elbow
described in Section 4.2.5 for choosing λ and hence the number of non null values in

83
n = 10 n = 10 n = 30 n = 30 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500 q = 100 q = 500
100

75

Cattell
50

25

0
100

75

50

PA
25

0
1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 3 4 5 6 5 6 5 6 5 6
r

Diagonal Equal Diagonal Unequal Extra-Diagonal Equal Extra-Diagonal Unequal

Figure 4.4 – Barplots corresponding to the number of times where each value of r
is chosen in the low-rank approximation from 100 replications for the two method-
ologies in the different scenarii for the different values of n and q.

n = 10 n = 10 n = 30 n = 30 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500 q = 100 q = 500
12
Times in seconds

3
Chapter 4

0
Cattell PA Cattell PA Cattell PA Cattell PA Cattell PA Cattell PA

Diagonal Equal Diagonal Unequal Extra-Diagonal Equal Extra-Diagonal Unequal

Figure 4.5 – Computational times of PA and Cattell criteria.

Σ(λ)
e is illustrated in Figure 4.6. This figure displays the True Positive Rate (TPR)
and the False Positive Rate (FPR) of the methodologies from 100 replications of
the matrix E for the four different types of Σ and for different values of n and q.
We can see from this figure that the performance of Elbow is on a par with the
one of BL except for the case where n = 10 for which the performance of Elbow is
slightly better in terms of True Positive Rate. Moreover, in terms of computational
time, the performance of Elbow is much better, see Figure 4.7.

84
n = 10 n = 10 n = 30 n = 30 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500 q = 100 q = 500
1.00
0.75

FPR
0.50
0.25
0.00
1.00
0.75

TPR
0.50
0.25
0.00
BL Elbow BL Elbow BL Elbow BL Elbow BL Elbow BL Elbow

Diagonal Equal Diagonal Unequal Extra-Diagonal Equal Extra-Diagonal Unequal

Figure 4.6 – Boxplots comparing the TPR (True Positive Rate) and the FPR (False
positive Rate) of the two methodologies proposed to select the parameter λ from
100 replications in the different scenarii.

n = 10 n = 10 n = 30 n = 30 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500 q = 100 q = 500
Times in seconds

20

15

Chapter 4
10

0
BL Elbow BL Elbow BL Elbow BL Elbow BL Elbow BL Elbow
type

diag_eq Diagonal Equal Diagonal Unequal Extra-Diagonal Equal Extra-Diagonal Unequal

Figure 4.7 – Computational times of Elbow and BL criteria.

4.3.3 Comparison with other methodologies

The goal of this section is to compare the statistical performance of our approach
with other methodologies.
Since our goal is to estimate a covariance matrix containing blocks, we shall
compare our approach with clustering techniques. Once the groups or blocks have
be obtained, Σ is estimated by assuming that the corresponding matrix estimator
is block-wise constant except for the diagonal blocks for which the diagonal entries

85
are equal to 1 and the extra-diagonal terms are assumed to be equal. This gives a
great advantage to these methodologies in the Diagonal-Equal and in the Extra-
Diagonal-Equal scenarii. More precisely, let ρi,j denote the value of the entries in
the block having its rows corresponding to Group (or Cluster) i and its columns to
Group (or Cluster) j. Then, for a given clustering C:
 X
1


 #C(i)#C(j) Rk,` , if C(i) 6= C(j)


 k∈C(i),`∈C(j)
ρi,j = , (4.10)
X
1

Rk,` , if C(i) = C(j)


 #C(i)(#C(i)−1)


k∈C(i),`∈C(i),k6=`

where C(i) denotes the cluster i, #C(i) denotes the number of elements in the
cluster C(i) and Rk,` is the (k, `) entry of the matrix R defined in Equation (4.2).
For the matrices Σ corresponding to the four scenarios previously described, we
shall compare the statistical performance of the following methods:
— empirical which estimates Σ by R defined in (4.2),
— blocks which estimates Σ using the methodology described in this article with
the criteria PA and BL for choosing r and λ, respectively,
— blocks fast which estimates Σ using the methodology described in this article
with the criteria Cattell and Elbow for choosing r and λ, respectively,
— blocks real which estimates Σ using the methodology described in this article
when r and the number of non null values are assumed to be known which gives
Chapter 4

access to the best value of λ,


— hclust which estimates Σ by determining clusters using a hierarchical clus-
tering with the “complete” agglomeration method described in Hastie et al.
(2001) and then uses Equation (4.10) to estimate Σ,
— Specc which estimates Σ by determining clusters using spectral clustering
described in von Luxburg (2007) and estimates Σ with Equation (4.10),
— kmeans which estimates Σ by determining clusters from a k-means clustering
approach described in Hastie et al. (2001) and then uses Equation (4.10) to
estimate Σ.
In order to improve the performance of the clustering approaches: hclust, Specc
and kmeans, the real number of clusters has been provided to these methods. The
performance of the different approaches is assessed using the Frobenius norm of the
difference between Σ and its estimator.
Figure 4.8 displays the mean and standard deviations of the Frobenius norm
of the difference between Σ and its estimator for different values of n and q in
the four different cases: Diagonal-Equal, Diagonal-Unequal, Extra-Diagonal-
Equal and Extra-Diagonal-Unequal. We can see from this figure that in the

86
n = 10 n = 10 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500

Extra-Diagonal Unequal

Extra-Diagonal Equal

Diagonal Unequal

Diagonal Equal

15 20 25 30 35 75 100 125 150 5 10 15 0 25 50 75


Frobenius norm

blocks blocks_real hclust specc


blocks_fast empirical kmeans

Figure 4.8 – Comparison of the Frobenius norm of Σ − Σ


b for different estimators Σ
b
of Σ and for different Σ.

case where n = 10, the performance of blocks fast is on a par with the one of
blocks real and is better than the one of blocks. In the case where n = 50, the
performance of blocks is slightly better than the one of blocks fast and is similar
to the one of blocks real. Moreover, in all cases, either blocks fast or blocks
outperforms the other approaches.
Then, the estimators of Σ derived from blocks, blocks fast and blocks real
were compared to the PDSCE estimator proposed by Rothman (2012) and imple-

Chapter 4
mented in the R package PDSCE and to the estimator proposed by Blum et al.
(2016b) and implemented in the FANet package Blum et al. (2016a). Since the com-
putational burden of PDSCE is high for large values of q, we limit ourselves to the
Extra-Diagonal-Equal case when n = 30 and q = 100 for the comparison. Figure
4.9 displays the results. We can see from this figure that blocks, blocks fast and
blocks real provide better results than PDSCE and FANet. However, it has to
be noticed that PDSCE is not designed for dealing with block structured covariance
matrices but just for providing sparse estimators of large covariance matrices.

4.3.4 Columns permutation


In practice, it may occur that the columns of E consisting of the rows E 1 , E 2 , . . . , E n
are not ordered in a way which makes blocks appear in the matrix Σ. To address
this issue, we propose to perform a hierarchical clustering on E beforehand and use
the obtained permutation of the observations which guarantees that a cluster plot
using this ordering will not have crossings of the branches. Let us denote E ord the
matrix E in which the columns have been permuted according to this ordering and
Σord the covariance matrix of each row of E ord . Then, we apply our methodology

87
20

Frobenius norm 10

blocks blocks_fast blocks_real FANet PDSCE

blocks blocks_fast blocks_real FANet PDSCE

b − Σ in the Extra-Diagonal-
Figure 4.9 – Comparison of the Frobenius norm of Σ
Equal case for n = 30 and q = 100.

n = 10 n = 10 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500

Extra-Diagonal Unequal

Extra-Diagonal Equal

Diagonal Unequal

Diagonal Equal

15 20 25 30 35 75 100 125 150 3 6 9 10 20 30 40 50


Frobenius norm
Chapter 4

blocks blocks_fast blocks_fast_samp blocks_samp

Figure 4.10 – Comparison of the Frobenius norm of Σ − Σ,


b and Σperm − Σ
b perm .

to E ord which should provide an efficient estimator of Σord . In order to get an es-
timator of Σ the columns and rows are permuted according to the ordering coming
from the hierarchical clustering.
To assess the corresponding loss of performance, we generated for each matrix
E used for making Figure 4.8 a matrix E perm in which the columns of E were
randomly permuted. The associated covariance matrix is denoted Σperm . Then, we
applied the methodology described in the previous paragraph denoted blocks samp
and blocks fast samp in Figure 4.10 thus providing Σ b perm . The performance
of this new methodology was compared to the methodology that we proposed in
the previous sections (denoted blocks and blocks fast in Figure 4.10) when the
columns of E were not permuted. The results are displayed in Figure 4.10. We
can see from this figure that the performance of our approach does not seem to be
altered by the permutation of the columns.

88
4.3.5 Numerical performance
Figure 4.11 displays the computational times for estimating Σ with the methods
blocks and blocks fast for different values of q ranging from 100 to 3000 and
n = 30. The timings were obtained on a workstation with 16 GB of RAM and
Intel Core i7 (3.66GHz) CPU. Our methodology is implemented in the R package
BlockCov which uses the R language (R Core Team, 2017) and relies on the R
package Matrix. We can see from this figure that it takes around 3 minutes to
estimate a 1000 × 1000 correlation matrix.

7891.4
5032.8
Times in seconds

303.8
202.7

46.9
30.6

4.3
3.1

0.7
0.5
1000

3000
100
200

500

blocks blocks_fast

Chapter 4
Figure 4.11 – Times in seconds to perform our methodology in the Extra-Diagonal
Unequal case.

4.3.6 Choice of the threshold t for estimating Σ−1/2


−1/2
Since we are interested in assessing the ability of Σ
b
t defined in (4.9) to remove
the dependence that may exist between the columns of E, we shall consider the
Frobenius norm of Σ b −1/2 ΣΣ
b −1/2 − Idq which should be close to zero, where Idq
t t
denotes the identity matrix of Rq . Figure 4.12 displays the Frobenius norm of
b −1/2 ΣΣ
Σ b −1/2 − Idq for different threshold t. A threshold of 0.1 seems to provide a
t t
small error in terms of Frobenius norm. Hence, in the following, t will be equal to
0.1 and Σb −1/2 will be referred as Σ
b −1/2 .
0.1
This technique was applied to all of the estimators of Σ discussed in Section 4.3.3
b −1/2 ΣΣ
to get different estimators of Σ−1/2 . The Frobenius norm of the error Σ b −1/2 −
Idq is used to compare the different estimators obtained by considering the differ-
ent estimators of Σ. The results are displayed in Figure 4.13. We observe from

89
Chapter 4

b −1/2 ΣΣ
Figure 4.12 – Frobenius norm of Σ b −1/2 − Idq , where Σ
b −1/2 is computed for
t t t
different thresholds t.

90
n = 10 n = 10 n = 50 n = 50
q = 100 q = 500 q = 100 q = 500

Extra-Diagonal Unequal

Extra-Diagonal Equal

Diagonal Unequal

Diagonal Equal

10 20 30 0 40 80 120 160 0 5 10 15 20 0 50 100


Frobenius norm

blocks_12 blocks_real_12 hclust_12 specc_12


blocks_fast_12 empirical_12 kmeans_12

b −1/2 ΣΣ
Figure 4.13 – Comparison of the Frobenius norm of the error Σ b −1/2 − Idq ,
for different estimators Σ
b of Σ.

this figure that in the case where n = 10 the estimators of Σ−1/2 derived from
the empirical, the blocks fast and the blocks real estimators of Σ perform sim-
ilarly and seem to be more adapted than the others to remove the dependence
among the columns of E. However, when n = 50, the behavior is completely dif-
ferent. Firstly, in the Diagonal-Equal case, the estimator of Σ−1/2 derived from
the hclust estimator of Σ seems to perform better than the others. Secondly, in
the Diagonal-Unequal case, the estimator derived from blocks, blocks fast and

Chapter 4
blocks real perform similarly than the one obtained from hclust. Thirdly, in
the Extra-Diagonal case, the estimators derived from blocks, blocks fast and
blocks real methodology perform better than the other estimators.
Then, the estimators of Σ−1/2 derived from blocks, blocks fast and blocks real
were compared to the GRAB estimator proposed by Hosseini & Lee (2016). Since
the computational burden of GRAB is high for large values of q, we limit ourselves
to the Extra-Diagonal-Equal case when n = 30 and q = 100 for the comparison.
Figure 4.14 displays the results. We can see that blocks and blocks real provide
better results than GRAB. However, it has to be noticed that the latter approach
depends on a lot of parameters that were difficult to choose, thus we used the default
ones.

4.3.7 Use of Σ−1/2 to remove the dependence in multivariate linear


models
Eventually, we assess the performance of the BlockCov methodology to remove
the dependence in the columns of an observation matrix in order to be used for vari-
able selection in the multivariate linear model as it is performed in the MultiVarSel

91
Frobenius norm 10

5
blocks_12 blocks_fast_12 blocks_real_12 Grab_12

blocks_12 blocks_fast_12 blocks_real_12 Grab_12

b −1/2 ΣΣ
Figure 4.14 – Comparison of the Frobenius norm of Σ b −1/2 − Idq in the
Extra-Diagonal-Equal case.

R package:
Y = XB + E, (4.11)

where Y is a n × q response matrix, X is a n × p design matrix, B is a coefficients


matrix and E is an error matrix. Here, E 1 , E 2 , · · · , E n are n zero-mean i.i.d. q-
dimensional Gaussian random vectors having a covariance matrix Σ. To achieve
this goal, we generate observations Y according to this multivariate linear model.
We choose q = 100, p = 3, n = 30 and bX is the design matrix of a one-way
ANOVA model.We compared our methodology with the one proposed by Perthame
et al. (2016) and implemented in the FADA R package Perthame et al. (2019). We
shall investigate the effect of the sparsity of B and of the signal to noise ratio
Chapter 4

(SNR) for the four scenarii defining Σ on the selection of the non null values of B
in (4.11). Different signal to noise ratios are obtained by multiplying B in (4.11)
by a coefficient κ.
Since the results are barely influenced by the scenario chosen for Σ, only the
Extradiagonal-Equal case is displayed in Figure 4.15, the other scenarii are avail-
able in Annexe 4.6.1. We can see from this figure that when the signal to noise ratio
is low and the value of s is high, meaning that there is a lot of non-zero values, the
FADA methodology performs better than the BlockCov methodology. Nevertheless,
in the three other cases the performance of BlockCov is either better or on a par
with the one of FADA methodology.

4.4 Application to “multi-omic” approaches to study


seed quality
Climate change could lead to major crop failures in world. In the present study,
we addressed the impact of mother plant environment on seed composition. In-
deed, seed quality is of paramount ecological and agronomical importance. They

92
κ=1 κ = 10 κ=1 κ = 10
1.0 1.00

0.9 0.75

s = 0.01

s = 0.01
0.8 0.50
0.7 0.25

Precision
0.6
TPR

0.00
1.0 1.00

0.9 0.75

s = 0.3

s = 0.3
0.8 0.50
0.7 0.25
0.6
0.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0
FPR Recall

Type blocks_fast FADA Type blocks_fast FADA

Figure 4.15 – Means of the ROC curves (left) and Precision Recall curves (right)
obtained from 100 replications comparing the variables selected by the MultiVarSel
strategy using either Σ−1/2 obtained by BlockCov to remove the dependence or the
methodology proposed by FADA methodology. κ is linked to the signal to noise
ratio and s denotes the sparsity levels i.e the fraction of non-zero elements in B.

are the most efficient form of dispersal of flowering plants in the environment. Seeds
are remarkably adapted to harsh environmental conditions as long as they are in
a quiescent state. Dry mature seeds (so called “orthodox seeds”) are an appropri-

Chapter 4
ate resource for preservation of plant genetic diversity in seedbanks. It has been
reported that the temperature regime during seed production affects agronomical
traits such as seed germination potential, see Huang et al. (2014),MacGregor et al.
(2015) and Kerdaffrec & Nordborg (2017). In order to highlight biomarkers of seed
quality according to thermal environment of the mother plant, Arabidopsis seeds
were produced under three temperature regimes (14-16 o C, 18-22 o C or 25-28 o C
under a long-day photoperiod). Dry mature seeds were analysed by shotgun pro-
teomic and GC/MS-based metabolomics Durand et al. (2019). The choice to use the
model plant, Arabidopsis, was motivated by the colossal effort of the international
scientific community for its genome annotation. This plant remains at the forefront
of modern genetics, genomics, plant modelling and system biology, see Provart et al.
(2016). Arabidopsis provides a very useful basis to study gene regulatory networks,
and develop modelling and systems biology approaches for translational research
towards agricultural applications.

In this section, we apply our R packages BlockCov and MultiVarSel Perrot-


Dockès et al. (2019) to metabolomic and proteomic data to better understand the
impact of the temperature on the seed quality. More precisely, we use the following

93
modeling for our observations:

Y = XB + E, (4.12)

where Y is a n × q matrix containing the responses of the q metabolites (resp. the q


proteins) for the n samples with n = 9, q = 199 (resp. q = 724) for the metabolomic
(resp. proteomic) dataset, X is a n × 3 design matrix of a one-way ANOVA model,
such that its first (resp. second, resp. third) column is a vector which is equal
to 1 if the corresponding sample grows under low (resp medium, resp. elevated)
temperatures and 0 otherwise. B is a coefficient matrix and E is such that its n
rows E 1 , E 2 , · · · , E n are n zero-mean i.i.d. q-dimensional random vectors having a
covariance matrix Σ. We used our R package BlockCov to estimate Σ and Σ−1/2
assuming that there exists a latent block structure in the covariance matrix of the
rows of E. More precisely, we assume that there exists some groups of metabolites
(resp. proteins) having the same behavior since they belong to the same biological
process. Then, we plugged this estimator into our R package MultiVarSel to obtain
a sparse estimation of B. Thanks to this estimator of B, we could identify the
metabolites (resp. proteins) having a higher (resp. lower) concentration when the
temperature is high or low.

4.4.1 Results obtained for the metabolomic data

We first estimated the matrices Σ and Σ−1/2 associated to E defined in Equation


Chapter 4

(4.12) by using the methodology developed in this paper, namely the BlockCov
package. By the results of Section 4.3, we know that the PA and BL approaches
performed poorly when n = 10. Since here n = 9, we used the Cattell and Elbow
criteria to choose r and λ, respectively. The results are displayed in Figure 4.16.
The Cattell criterion chooses r = 7 and the Elbow criterion chooses λ = 0.472, which
implies that among the 19701 coefficients of the correlation matrix only 6696 values
are considered as non null values.

Cattell Elbow
4000
60
3000

40
value

2000

20 1000

0 0
0 7 50 [...] 199 0 3000 6696 15000 19701
Index

Figure 4.16 – Illustration of the Cattell and Elbow criteria.

94
Figure 4.17 – Estimator of the correlation matrix Σ of the rows of E once the rows
and the columns have been permuted according to the ordering provided by the
hierarchical clustering.

Chapter 4
The estimation of Σ obtained with our methodology is displayed in Figure 4.17
once the rows and the columns have been permuted according to the ordering pro-
vided by the hierarchical clustering to make visible the latent block structure.
Using the estimator of Σ−1/2 provided by the BlockCov package in the R package
MultiVarSel provides the sparse estimator of the matrix B defined in Model 4.12
and displayed in Figure 4.18. We can see from this figure that for the metabo-
lite X5MTP the coefficient of the matrix B b is positive when the temperature is
high which means that the production of the metabolite X5MTP is larger in high
temperature conditions than in low temperature conditions.
In order to go further in the biological interpretation, we wanted to better un-
derstand the underlying block structure of the estimator of the correlation matrix
of the residuals based on metabolite abundances Σ. b Thus, we applied a hierarchical
clustering with 8 groups to this matrix in order to split it into blocks. The cor-
responding dendogram is on the left part of Figure 4.17. The matrix containing
the correlation means within and between the blocks or groups of metabolites is
displayed in Figure 4.19. The composition of the metabolites groups is available in
Appendix 4.6.2.
Interestingly, we could observe that X5MTP belongs to Group 6 which displays

95
X5MTP

Valine

U3898.1.204 Estimate
>5
U3041.1.361
2.5
U3008.3.457 0
Metabolites

U2400.1.179 -2.5

U2282.4.349 <-5

U2278.6.361 Estimate
Tyrosine 0.5
1.0
Quercitrin
2.0
Galactinol 3.0

Fumarate

beta.Sitosterol

14-16°C 18-22°C 25-28°C


Temperature

Figure 4.18 – Sparse estimator of the coefficients matrix B obtained thanks to the
package MultiVarSel with a threshold of 0.95.
Chapter 4

4 0.32 0.23 −0.04 −0.07 −0.12 −0.23 0 −0.09


0.8

6 0.8 0 −0.03 −0.55 −0.43 −0.2 −0.1 0.6

8 0.4
0.5 0.05 0.01 0.27 −0.06 −0.01

0.2
3 0.39 0.12 0.07 −0.07 −0.02

1 0.61 0.29 0.23 0.1


−0.2

7 0.5 0.04 0.03 −0.4

2 −0.6
0.32 0

−0.8
5 0.38

−1

Figure 4.19 – Means of the correlations between the groups of metabolites.

96
an high correlations mean equal to 0.8 between the 14 metabolites that make it up.
At least 6 metabolites of this group belong to the same family, namely glucosinolates
(i.e. X4MTB, 4-methylthiobutyl glucosinolate; X5MTP, 5-methylthiopentyl glucosi-
nolate; X6MTH, 6-methylthiohexyl glucosinolate; X7MTH, 7-methylthiohexyl glu-
cosinolate; X8MTO, 8-methylthiooctyl glucosinolate; UGlucosinolate140.1, uniden-
tified glucosinolate). Glucosinolates (GLS) are specialized metabolites found in
Brassicaceae and related families (e.g. Capparaceae), containing a β-thioglucose
moiety, a sulfonated oxime moiety, and a variable aglycone side chain derived from
a α-amino acid. These compounds contribute to the plant’s overall defense mecha-
nism, see Wittstock & Halkier (2002). Methylthio-GLS are derivated from methio-
nine. Methionine is elongated through condensation with acetyl CoA and then, are
converted to aldoximes through the action of individual members of the cytochrome
P450 enzymes belonging to CYP79 family, see Field et al. (2004). The aldoxime un-
dergoes condensation with a sulfur donor, and stepwise converted to GLS, followed
by the side chain modification. The present results suggest that the accumulation
of methionine-derived glucosinolate family is strongly coordinated in Arabidopsis
seed. Moreover, we can see that they are influenced by the effect of the mother
plant thermal environment.

4.4.2 Results obtained for the proteomic data

The same study was conducted on the proteomic data. The estimator of the
correlation matrix of the residuals based on proteine abundances Σ b obtained with

Chapter 4
our methodology is displayed in Figure 4.20 once the rows and the columns have
been permuted according to the ordering provided by the hierarchical clustering to
make visible the latent block structure. To better understand the underlying block
structure of Σ,
b we applied a hierarchical clustering with 9 groups to this matrix in
order to split it into blocks. The corresponding dendogram is on the left part of
Figure 4.20.
The matrix containing the correlation means within and between the blocks or
groups of proteins is displayed in Figure 4.21. We can see from this figure that
Group 8 has the highest correlation mean equal to 0.47. It consists of 34 proteins
which are given in Appendix 4.6.3.
A basic gene ontology analysis (http://geneontology.org/) showed that proteins
involved in response to stress (biotic and abiotic), in nitrogen and phosphorus
metabolic processes, in photosynthesis and carbohydrate metabolic process and in
oxidation-reduction process are overrepresented in this group, see Figure 4.22. Thus,
the correlation estimated within Group 8 seems to reflect a functional coherence of
the proteins of this group.
The variable selection in the multivariate linear model using the R package
MultiVarSel provided 31 proteins differentially accumulated in seeds produced un-

97
Figure 4.20 – Estimator of the correlation matrix of the residuals of the protein
accumulation measures once the rows and the columns of the residual matrix have
been permuted according to the ordering provided by the hierarchical clustering.
Chapter 4

6 0.35 0.11 0.14 −0.01 −0.22 −0.22 −0.05 −0.01 −0.19


0.8

2 0.34 0.18 −0.12 0.09 −0.12 −0.07 −0.23 −0.32


0.6

5 0.36 −0.08 −0.08 0.03 −0.17 −0.16 −0.29


0.4

3 0.26 0.08 −0.08 −0.08 0 0.14 0.2

9 0.42 0.02 −0.03 −0.17 0.04 0

8 −0.2
0.47 0.04 0.01 0.13

−0.4
7 0.27 0.12 0.1

−0.6
1 0.37 0.19

−0.8
4 0.44
−1

Figure 4.21 – Means of the correlations between the groups of proteins.

98
Response to stress
Nitrogen compound metabolic process
Response to abiotic stimulus
Response to biotic stimulus
Phosphorus metabolic process
Oxidation-reduction process
Carbohydrate metabolic process
Photosynthesis
0 5 10 15

Expected from Arabidopsis genome Proteins, group 8(34)

Figure 4.22 – Gene ontology (GO) term enrichment analysis of the 34 pro-
teins belonging to Group 8. Data from PANTHER overrepresentation test
(http://www.geneontology.org); One uploaded id (i.e. AT5G50370) mapped to two
genes. Thus, GO term enrichment was performed on 35 elements. Blue bars: ob-
served proteins in Group 8; Orange bars: expected result from the reference Ara-
bidopsis genome.

der low, medium or elevated temperature. An aspartyl protease (AT3G54400),


belongs to both, the Group 8 and to the proteins selected by MultiVarSel. This
cell wall associated protein was up-acccumulated in dry seeds produced under low
temperature. The gene encoding for this protease was described as a cold responsive
gene assigned to the C-repeat binding factor (CBF) regulatory pathway, see Vogel
et al. (2006). This pathway is requested for regulation of dormancy induced by low
temperatures, see Kendall et al. (2011). Consistently, in Figure 4.23, two other pro-

Chapter 4
teins related to cell wall organization, a beta-glucosidase (BGLC1, AT5G20950) and
a translation elongation factor (eEF-1Bβ1, AT1G30230) were differentially accumu-
lated in seeds produced under contrasted temperature. eEF-1Bβ1 is associated to
plant development and is involved in cell wall formation, see Hossain et al. (2012).
These results suggest that cell wall rearrangements occur under temperature effect
during seed maturation.
As displayed in Figure 4.23, 6 other proteins involved in mRNA translation:
AT1G02780, AT1G04170, AT1G18070, AT1G72370, AT2G04390 and AT3G04840
were selected. The absolute failure of seed germination in the presence of protein
synthesis inhibitors underlines the essential role of translation for achieving this
developmental process, see Rajjou et al. (2004). Previous studies highlighted the
importance of selective and sequential mRNA translation during seed germination
and seed dormancy, see Galland et al. (2014), Bai et al. (2017) and Bai et al. (2018).
Thus, exploring translational regulation during seed maturation and germination
through the dynamic of mRNA recruitment on polysomes or either neosynthesized
proteome are emerging fields in seed research.

99
AT5G20950.1
AT4G09000.1
AT3G54400.1
AT3G44100.1
AT3G10020.1
AT3G08030.1 Estimate
AT3G04840.1
AT2G04390.1 >5
AT1G72370.1
AT1G71950.1 2.5
AT1G54860.1
AT1G30230.1 0
AT1G20200.1
AT1G18070.1 -2.5
Proteins

AT1G16030.1
AT1G15690.1 <-5
AT1G14940.1
AT1G14930.1
AT1G10670.1 Estimate
AT1G08110.1
0.5
AT1G07140.1
AT1G05510.1 1.0
AT1G04580.1
AT1G04410.1 2.0
AT1G04170.1
AT1G03890.1 3.0
AT1G03880.1
AT1G02780.1
AT1G02700.1
AT1G01900.1
AT1G01470.1
14-16°C 18-22°C 25-28°C
Temperature

Figure 4.23 – Values of the coefficients obtained using the package MultiVarSel
with a threshold of 0.95 on the proteomic dataset.

4.5 Conclusion
In this paper, we propose a fully data-driven methodology for estimating large
block structured sparse covariance matrices in the case where the number of vari-
Chapter 4

ables is much larger than the number of samples without limiting ourselves to block
diagonal matrices. Our methodology can also deal with matrices for which the
block structure only appears if the columns and rows are permuted according to an
unknown permutation. Our technique is implemented in the R package BlockCov
which is available from the Comprehensive R Archive Network and from GitHub. In
the course of this study, we have shown that BlockCov is a very efficient approach
both from the statistical and numerical point of view. Moreover, its very low com-
putational load makes its use possible even for very large covariance matrices having
several thousands of rows and columns.

Acknowledgments
We thank the members of the EcoSeed European project (FP7 Environment,
Grant/Award Number: 311840 EcoSeed, Coord. I. Kranner). IJPB was supported
by the Saclay Plant Sciences LABEX (ANR-10-LABX-0040-SPS). We also thank the
people who produced the biological material and the proteomic and metabolomic
analysis. In particular, we would like to thank the Warwick University (UWAR,

100
Finch-Savage WE and Awan S) for the production of seeds, the Plant Observatory-
Biochemistry platform (IJPB, Versailles; Bailly M, Cueff G) for having prepared the
samples for the proteomics and metabolomics, the PAPPSO Proteomic Plateform
(GQE-Moulon; Balliau T, Zivy M) for mass spectrometry-based proteome analy-
sis and the Plant Observatory-Chemistry/Metabolism platform (IJPB, Versailles;
Clement G) for the analysis of GC/MS-based metabolome analyses.

Chapter 4

101
4.6 Appendix
4.6.1 Variable selection performance

s = 0.01 s = 0.01 s = 0.3 s = 0.3

κ=1 κ = 10 κ=1 κ = 10
1.0

Diagonal_Equal
0.9

0.8

0.7

0.6

1.0

Diagonal_Unequal
0.9

0.8

0.7

0.6 Type
TPR

blocks_fast
1.0
FADA

Extradiagonal_Equal
0.9

0.8

0.7

0.6

1.0
Extradiagonal_Unequal
0.9

0.8

0.7

0.6

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
FPR
Chapter 4

Figure 4.24 – Means of the ROC curves obtained from 100 replications compar-
ing the variables selected by the MultiVarSel strategy using either Σ−1/2 obtained
by BlockCov to remove the dependence or the methodology proposed by FADA
methodology. κ is linked to the signal to noise ratio and s denotes the sparsity
levels i.e the fraction of non-zero elements in B.

102
s = 0.01 s = 0.01 s = 0.3 s = 0.3

κ=1 κ = 10 κ=1 κ = 10
1.00

Diagonal_Equal
0.75

0.50

0.25

0.00
1.00

Diagonal_Unequal
0.75

0.50

0.25
Type
Precision

0.00
blocks_fast
1.00
FADA

Extradiagonal_Equal
0.75

0.50

0.25

0.00
1.00
Extradiagonal_Unequal

0.75

Chapter 4
0.50

0.25

0.00
0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0
Recall

Figure 4.25 – Means of the precision recall curves obtained from 100 replications
comparing the variables selected by the MultiVarSel strategy using either Σ−1/2
obtained by BlockCov to remove the dependence or the methodology proposed by
FADA methodology. κ is linked to the signal to noise ratio and s denotes the sparsity
levels i.e the fraction of non-zero elements in B.

103
4.6.2 Groups of metabolites

Group 1 Group 2 Group 3 Group 4


Alanine Arginine Glutamate beta.Sitosterol
Asparagine Cystein alpha.Tocopherol Campesterol
Aspartate Gaba gamma.Tocopherol Eicosanoate
Glycine Glutamine Linolenic.acid Heptadecanoate
Isoleucine Tryptophan H2SO4 Stearic.acid
Leucine Linoleic.acid X2.Oxoglutarate Tetracosanoate
Lysine Quercetin Mannitol BenzoylX.Glucosinolate.3
Phenylalanine BenzoylGlucosinolate.3Breakdown Urea Sulfite
Proline Nonanenitrile.9.methylthio Fructose.6.P U2609.4.361
Serine UGlucosinolatebreakdown140.5 Digalactosylglycerol U3122.4.202.I3M.
Threonine X2.Hydroxyglutarate Galactinol dihydroxybenzoate
Tyrosine Citrate Galactosylglycerol beta.indole.3.acetonitrile
Valine Erythronate Rhamnose U1837.6.368
X5..methylthio.pentanenitrile Galactonate Stachyose U1841.9.394
Octanenitrile.8.methylthio Gluconate Sucrose U2003.8.293
UGlucosinolatebreakdown140.4 Glycerate U1093.6.147 U2371.1.361
Succinate Malate U1124.3.140 U2375.6.191
Threonate Allantoin U1530.2.314 U2513.2.296
Arabitol Erythritol U2053.6.321.1 U2513.2.296.1
myo.Inositol Ethanolamine U2109.3.305 U2692.9.361
Glycerol.3.P Sorbitol U2197.2.494 U2798.377
myo.Inositol.1.P Threitol U2315.2.245 U2942.2.556
Phosphate Xylitol U3898.1.204 U3063.0.361
U2206.2.299 Ethylphosphate U3415.9.498
Fructose Glucose
Glucopyranose..H2O. Mannose
U1154.3.156 Raffinose
U1393.172 Ribose
U1541.8.263 U1127.5.140
U1647.2.403 U1172.9.281
U1705.2.319.pentitol. U1559.4.217
U1729.0.273 U1628.9.233
U1816.2.228 U1849.2.285
U1859.2.246 U1927.0.204
U2076.9.204 U1939.1.210
U2170.6.361 U1983.0.217
U2184.1.299 U1983.0.217.1
U2251.5.361 U2012.7.361
U2278.6.361 U2282.4.349
U2550.7.149 U2400.1.179
U2731.2.160 U2779.9.361
U2857.8.342
U2929.1.297
U3041.1.361
U3080.7.361
U3100.8.361
Group 5 Group 6 Group 7 Group 8
Quercitrin X4MTB BenzoylGlucosinolate.2Breakdown Maleate
Dehydroascorbate X5MTP Hexanenitrile.6methylthio Pentonate.4
Fumarate X6MTH Sinapinate.trans U1408.4.298
Chapter 4

Sinapinate.cis X7MTH Anhydroglucose U1617.8.146


Arabinose X8MTO U1125.1.140 U1767.3.243
Galactose UGlucosinolate140.1 U1290.198 U1904.9.204
U1127.4.169 U1129.9.184 U1371.5.151 U2828.8.361
U1718.0.157 U1270.1.240 U1549.7.130 U2839.3.312
U1931.5.202 U1897.2.327 U1568.5.313 U2882.5.297
U2261.0.218 U2473.361 U1592.8.217 U3008.3.457
U2412.1.157 U2529.8.361 U1700.6.288 U3168.2.290
U2588.9.535 U2756.4.271 U1759.4.331 U3218.5.297
U2688.5.333 U2924.3.361 U1852.0.217 U3910.6.597.Trigalactosylglycerol.
U3213.1.400 U3279.7.361 U1872.1.204.methyl.hexopyranoside. U2443.7.217
U1380.5.184 U1958.217
U2053.6.321
U2087.6.321
U2150.9.279
U2271.6.249
U3188.1.361
U3701.368
U4132.5.575

4.6.3 Groups of proteins


Proteins present in group 8 : AT1G14170.1 , AT1G20260.1 , AT1G42970.1 ,
AT1G47980.1 , AT1G55210.1 , AT1G75280.1 , AT2G19900.1 , AT2G22240.1 , AT2G28900.1 ,
AT2G32920.1 , AT2G37970.1 , AT3G12580.1 , AT3G13930.1 , AT3G26650.1 , AT3G26720.1 ,
AT3G44300.1 , AT3G47930.1 , AT3G54400.1 , AT3G55800.1 , AT4G16760.1 , AT4G20830.1 ,
AT4G25740.1 , AT4G34870.1 , AT4G35790.1 , AT5G11880.1 , AT5G12040.1 , AT5G14030.1 ,
AT5G17380.1 , AT5G22810.1 , AT5G26000.1 , AT5G50370.1 , AT5G66190.1 , AT5G67360.1 ,
ATCG00480.1.

104
Etude du dialogue entre cellules
Chapitre 5
dendritiques et lymphocytes Th

Production scientifique

Ce chapitre est un résumé de l’article accepté au journal Cell :


M. Grandclaudon*, M. Perrot-Dockès*, C. Trichot, O. Mostafa-Abouzid,
W. Abou-Jaoudé, F. Berger, P. Hupé, D. Thieffry, L. Sansonnet, J. Chi-
quet, C. Lévy-Leduc, V. Soumelis A quantitative multivariate model of
human dendritic cell-T helper cell communication.
* : ces auteurs ont contribué de manière égale à cette publication Pour plus de détails,
un lecteur intéressé peut se référer à l’article complet disponible en Annexe.

5.1 Introduction
La communication entre cellules peut se faire par l’échange de signaux moléculaires
produits par une cellule donnée et transmis à une cellule cible qui émettra d’autres
signaux en réponse. Pour simplifier le discours nous appellerons  input  les signaux
émis par la cellule donnée et  output  les signaux émis en réponse par la cellule
cible. Ce mode de communication requiert l’analyse de signaux multiples émis par
les cellules. Cependant, la plupart des études sont univariées et se concentrent sur
l’effet d’un input ou d’un groupe d’inputs sur un output. En négligeant la diversité
des inputs et des outputs elles ne permettent pas d’approcher un contexte réaliste
de communication cellulaire.
Chapter 5

L’objectif de ce projet est de s’approcher de ce contexte réaliste pour mieux com-


prendre la communication entre les cellules dendritiques (DC) et les lymphocytes
T-helper (Th). Une brève introduction à l’immunologie et au dialogue entre les cel-
lules dendritiques et les lymphocytes Th est disponible en introduction de cette thèse
dans la partie 1.4. Une vision simple de ce dialogue est que lorsqu’une cellule dendri-
tique rencontre un agent pathogène elle transmet divers signaux à un lymphocyte
Th, dit naı̈f, qui en réponse va se différencier en un profil donné et émettre d’autres
signaux spécifiques. Ces signaux varient en fonction du type d’agent pathogène

105
rencontré par la cellule dendritique. Plus précisément, en présence d’un pathogène
intracellulaire, donc d’une maladie auto-immune, la DC va sécréter de l’interleukine
12 (IL12), une fois capté par le lymphocyte T naı̈f celui-ci va se différencier en lym-
phocytes Th1 et sécréter notamment de l’interféron gamma (IFNg) pour combattre
cette maladie auto-immune. De même, en présence d’un parasite extra-cellulaire, les
DC vont sécréter des signaux qui vont amener le lymphocyte T naı̈f à se différencier
en un lymphocyte Th2 qui va sécréter des interleukines 4, 5, 13 (IL4, IL5, IL13). De
nombreuses études ont mis en avant d’autres profils tels que le profil Th17 induit par
la présence d’IL6, TNFa, IL23, TGFb qui sécrètent IL17A et IL17F pour répondre
à la présence de bactéries et de champignons extérieurs (voir Park et al., 2005). La
figure 1.4 qui est une version simplifiée de la figure 1 de Leung et al. (2010) montre
différents profils Th décrits.

IFNg
Th1
IL2

IL12p70
IL4
IL4 Th2 IL5
TGFb,IL6 IL13

IL17F
Cellule Lymphocyte Th17
dendritique T naı̈f IL17A

Figure 5.1 – Les différents profils Th

5.2 Description des données

5.2.1 Protocole expérimental


Dans le cadre d’une collaboration avec Vassili Soumelis et Maximilien Grand-
claudon de l’Institut Curie nous avons eu accès à des données mesurant la réponse
des lymphocytes Th aux signaux des DC sous diverses conditions de perturbations
Chapter 5

visant à reproduire in vitro, l’environnement in situ et in vivo. En pratique, un


signal qui agit sur l’activité de la DC (ci-après appelé perturbateur) est injecté à
un groupe de DC qui va sécréter des signaux en réponse, 24h après l’insertion du
perturbateur dans la culture des cellules dendritiques 36 signaux sécrétés par les DC
sont mesurés. Tous les signaux sécrétés par les DC sont alors mis en présence d’un
groupe de lymphocyte Th qui va sécréter à nouveau des signaux. Six jours plus tard,
17 des signaux sécrétés par les lymphocytes Th sont mesurés ainsi que l’expansion
cellulaire. Pour simplifier le discours nous appellerons par la suite  inputs  les

106
Perturbateurs

Signaux Signaux
des DC des Th

Cellule Lymphocyte
dendritique T naı̈f

Figure 5.2 – Représentation du protocole expérimental et du dialogue entre DC


et lymphocytes Th

signaux des DC et  outputs  les signaux des lymphocytes Th et l’expansion cel-


lulaire. Cette expérience a été répétée sur 428 couples groupe de DC / groupe de
lymphocytes Th . La figure 5.2 résume ce protocole.

5.2.2 Une grande diversité


Afin de mieux comprendre la communication entre les DC et les lymphocytes Th
les données visent à capter une grande diversité dans la sécrétion des DC et donc
probablement une grande diversité dans les signaux sécrétés par les lymphocytes
Th. Pour cela, on fait varier le perturbateur injecté dans les DC ainsi que le type de
DC dans le protocole expérimental décrit section 5.2.1. Plus précisément on utilise
des DC provenant du sang et des DC dérivées des monocytes, ainsi que différentes
combinaisons et doses de 14 perturbateurs. Les 428 échantillons sont finalement
obtenus à l’aide de 82 couples types de DC / types de perturbateurs. La diversité que
cela engendre sur les signaux des cellules dendritiques (resp. des lymphocytes Th)
est représentée par une heatmap des moyennes sur plusieurs réplicats des différents
signaux dans les 82 conditions de perturbation (voir à gauche (resp. à droite) de la
figure 5.3).
Chapter 5

5.3 Modélisation
Nous appliquons ensuite la méthode décrite dans les chapitres 2 et 3 et implémentée
dans le package MultiVarSel pour modéliser les valeurs des signaux des lympho-
cytes Th en fonction des signaux des DC. Cette méthode est fondée sur le modèle 1.2 :

Y = XB + E,

107
Color Key Color Key

−1 0 0.5 1 1.5 −1 0 0.5 1 1.5


Value Value

P1 P1
P2 P2
P3 P40
P4 P47
P5 P42
P6 P41
P7 P50
P8 P72
P9 P73
P10 P74
P11 P62
P12 P59
P13 P63
P14 P58
P15 P64
P67
P16 P77
P17 P69
P18 P71
P19 P70
P20 P56
P21 P66
P22 P57
P23 P75
P24 P76
P25 P37
P26 P26
P27 P38
P28 P36
P29 P46
P30 P49
P31 P6
P32 P68
P33 P65
P34 P82
P35 P81
P36 P79
P37 P78
P38 P80
P39 P53
P40 P52
P41 P55
P42 P51
P43 P22
P21
P44 P48
P45 P43
P46 P3
P47 P4
P48 P24
P49 P54
P50 P11
P51 P45
P52 P23
P53 P27
P54 P44
P55 P30
P56 P20
P57 P13
P58 P28
P59 P32
P60 P35
P61 P5
P62 P7
P63 P10
P64 P31
P65 P12
P66 P33
P67 P25
P68 P17
P69 P8
P70 P61
P14
P71 P15
P72 P60
P73 P19
P74 P18
P75 P16
P76 P9
P77 P34
P78 P39
P79 P29
P80
P81

Expansionfold

IL9

IL17F

IL17A

IL2

TNFa1

TNFb

IL21

IL6

IFNg

IL10

IL22

IL5

GMCSF

IL3

IL31

IL13

IL4
P82
ICOSL
CD100
IL28a
CD11a
VISTA
ICAM2
Jagged2
41BBL
SLAMF3
CD29
B7H3
Galectin3
SLAMF5
CD18
ICAM3
CD58
PDL1
CD40
CD54
CD80
PDL2
IL12p70
Nectin2
HLADR
CD83
CD86
PVR
IL23
IL10
IL1
IL6
TNFa
CD70
LIGHT
OX40L
CD30L

Figure 5.3 – Heatmap représentant la moyenne des différents signaux sur les 82
conditions de perturbations obtenues à partir de plusieurs réplicats. Les signaux des
DC sont représentés à gauche et ceux des Th à droite.

où la matrice Y (resp. la matrice X) contient les concentrations des q = 18 signaux


des lymphocytes Th (resp. p = 36 signaux des DC) pour les 428 échantillons. Dans
ce cas n = 428  q = 18, la matrice de covariance empirique de E est donc utilisée
pour modéliser la matrice de covariance des lignes de E. En prenant en compte des
connaissances immunologiques et afin de garder de nombreuses associations entre
les signaux des lymphocytes Th et des cellules dendritiques le seuil de la stability
selection est ici fixé à 0.65. Une fois les variables sélectionnées les coefficients sont
alors ré-estimés en utilisant la méthode des moindres carrés ordinaires. Les valeurs
des coefficients ainsi obtenus sont disponibles dans la figure 5.4. En appliquant un
clustering hiérarchique sur les valeurs de ces coefficients ainsi obtenues on retrouve
des groupes contenant les signaux des lymphocytes Th caractéristiques des différents
profils (voir figure 5.1), ce qui montre la cohérence de nos résultats car il est attendu
que des signaux qui appartiennent au même profil Th soient expliqués par les mêmes
signaux de DC.
Chapter 5

5.4 Validation biologique


Une prédiction intéressante du modèle est l’association positive entre IL12p70
et IL17F. En effet IL12p70 est connu comme promoteur caractéristique d’INFg et
donc des Th1 mais n’est pas connu comme promoteur d’IL17F et des Th17. De plus
certaines études ont même montré une absence d’impact voir un impact négatif
d’IL12p70 sur les cellules Th17. Deux types d’expériences ont alors été mises en

108
Coefficients
value
−0.25 0.00 0.25

IL21

Th1
IFNg
IL2

IL9

Th17
IL17F
IL17A

TNFb
TNFa1
IL6_O

IL3
Exp

IL22
GMCSF
IL10_O

Th2
IL5
IL4
IL31
IL13
Galectin3
CD80
IL23
VISTA
B7H3
Nectin2
CD54
OX40L
LIGHT
41BBL
IL12p70
SLAMF5
CD58
TNFa
CD30L
CD86
CD18
ICAM2
CD100
PDL2
HLADR
ICOSL
CD11a
IL10
PDL1
Jagged2
PVR
CD29
CD70
ICAM3
IL1
CD83
IL28a
CD40
IL6
SLAMF3
Figure 5.4 – Heatmap représentant les coefficients obtenus lors de la modélisation
des signaux des lymphocytes Th par les signaux des cellules dendritiques en utilisant
la stratégie décrite dans les chapitres 2 et 3 avec un seuil de 0.65.

place afin de valider cette hypothèse.


La première consiste à perturber des cellules dendritiques à l’aide d’un des agents
pathogènes présent dans le jeu de données (ici Zymosan (concentré à 10g/ml))
puis de les mettre en contact avec des lymphocytes Th en bloquant ou non l’ef-
fet d’IL12p70. Cette expérience a été répétée sur 7 échantillons indépendants. Au
risque de 5% cette expérience a mis en évidence une diminution d’IL17F lorsque
l’on bloque IL12p70. Elle a cependant aussi mis en évidence une baisse du niveau
d’IL17A lorsque l’on bloque IL12p70 ce qui n’était pas prévu par le modèle.
La seconde expérience consiste à donner directement IL12p70 à un lymphocyte
Th naı̈f. Sur cette expérience aucune différence significative n’a été mise en évidence.
L’effet observé par le modèle et par la première expérience est peut être dû à l’effet de
l’IL12p70 en présence d’un autre input. Nous proposons alors un modèle permettant
de mettre en évidence ce type d’effet. Ce modèle émet l’hypothèse d’une association
entre IL12p70 et IL17F en présence de l’input IL1. Cette association a été validée
par des expériences biologiques qui montrent en effet qu’un lymphocyte Th produit
plus d’IL17F en présence d’IL12p70 et IL1 que en présence d’IL1 seul.
Cette association entre IL12p70, IL1 et IL17F est un exemple de validation bio-
logique obtenue à l’aide du modèle. Pour plus de précision sur les différentes asso-
ciations validées biologiquement le lecteur pourra se référer à l’article soumis dans
la revue à comité de lecture Cell et disponible en annexe.

109
Conclusion et perspectives

Dans cette thèse nous avons mis en place une méthode, adaptée à la grande
dimension, permettant de faire de la sélection de variables dans le modèle linéaire
multivarié. Plus précisément, dans le chapitre 2, nous proposons un estimateur parci-
monieux des coefficients dans le modèle linéaire général et établissons des conditions
sous lesquelles la consistance en signe de cet estimateur est vérifiée. Cet estima-
teur nécessite l’estimation de la covariance existant entre les différentes variables
réponses. Dans le cas simple où la matrice de covariance est celle d’un processus
autorégressif d’ordre 1 et la matrice de design est celle d’une ANOVA équilibrée à 1
facteur, nous avons montré que les conditions pour obtenir la consistance en signe
de notre estimateur sont vérifiées. Nous avons ensuite développé des méthodes pour
l’estimation de matrices de covariance en grande dimension lorsqu’elle est supposée
être une matrice de Toeplitz symétrique (dans le chapitre 3) et lorsqu’elle est sup-
posée être une matrice par blocs (dans le chapitre 4). Ces différentes méthodes ont
été appliquées à des problématiques de protéomique, de métabolomique et d’immu-
nologie.
Dans ce dernier chapitre nous suggérons, dans un premier temps, des pistes qui
permettraient d’élargir nos résultats de consistance à d’autres types de matrices de
covariance. Dans un second temps, nous proposons d’autres estimateurs pénalisés
issus du modèle linéaire univarié que nous généralisons au cas multivarié afin de
prendre en compte des spécificités dues à des problématiques biologiques spécifiques
décrites à la fin de ce chapitre.

6.1 Vers d’autres cas vérifiant les conditions de consis-


tance en signe de notre estimateur
Les conditions du théorème 2.5 portent sur la matrice de covariance Σ mais aussi
sur son inverse. Pour montrer que ces conditions sont vérifiées dans le cas de l’AR(1)
nous avons bénéficié du fait que ces deux matrices ont une forme explicite simple.
Pour généraliser à d’autres matrices de covariance en utilisant la même stratégie,
il faudrait idéalement avoir des formes explicites de ces matrices ainsi que de leurs
inverses.
Dans le cas des ARMA(p,q) il n’existe pas de formes explicites directes de la
matrice de covariance Σp,q ni de son inverse. Cependant, Haddad (2004) propose
des relations de récurrence sur le paramètre q pour obtenir Σp,q et Σ−1 p,q , ainsi que
des relations de récurrence sur le paramètre p pour obtenir Σp,0 et Σ−1 p,0 , ce qui
−1 −1
permet d’obtenir Σp,q et Σp,q à partir des matrices explicites Σ1,0 et Σ1,0 .
Conclusion

111
De même, lorsque la matrice de covariance est une matrice de covariance par
blocs diagonaux son inverse est aussi une matrice de covariance par blocs diagonaux,
chaque bloc étant inversé indépendamment. Dans le cas Diagonal-Equal décrit
dans le chapitre 4, la matrice de covariance est une matrice avec L blocs diagonaux
où pour tout l dans J1, LK le bloc l est de la forme :

al Iql + bl Jql ,

où ql est la taille du bloc l, al et bl sont des réels (avec al non nul), Iql est la matrice
identité de Rql et Jql est une matrice de taille ql × ql qui a tous ses éléments égaux
à 1. L’inverse de telles matrices a une forme explicite :

1 bl
(Iql − Jq ).
al al + ql bl l

Cependant, dans ce cas, l’estimation de Σ ne dépend pas de un mais de plusieurs


paramètres. De plus, les erreurs de prédiction de la matrice Σ et de son inverse,
ne viennent pas seulement de l’erreur de l’estimation des paramètres mais aussi
de l’erreur faite pour retrouver les blocs. Ces deux points compliquent alors la
vérification des hypothèses (A9) et (A10) du théorème 2.5.

6.2 Sélection de variables à l’aide d’autres régressions


pénalisées
Les chapitres 2 et 3 de cette thèse présentent une méthode permettant de faire
de la sélection de variables dans le modèle linéaire général dans des cas de grande
dimension. Cependant, ces méthodes ne permettent pas de prendre en compte une
connaissance biologique ou une problématique spécifique indiquant qu’un groupe de
variables réponses ou un groupe de variables explicatives ont plus de chance, voire
sont forcées, d’avoir des effets similaires. Dans cette section nous nous intéresserons
à des méthodes qui permettent de prendre en compte de telles contraintes.

6.2.1 Introduction à d’autres régressions pénalisées


Le Lasso permet de faire de la sélection de variables dans le modèle linéaire
univarié. Cependant, dans certains cas, les variables sont structurées en L groupes
et il peut être intéressant de les sélectionner selon les groupes. Pour cela, Yuan &
Lin (2006) proposent l’estimateur du group lasso :
 
 X X √ 
2
bbG = Argmin
b1 ,...,bL ||y − X (l) b(l) ||2 + λ pl ||bl ||2 , (6.1)
 
1≤l≤L 1≤l≤L
Conclusion

112
où X (l) est la matrice contenant les variables appartenant au groupe l et pl est le
nombre de variables appartenant au groupe l.
Considérons maintenant qu’il existe des raisons pour lesquelles les coefficients liés
à certaines colonnes de X soient similaires. Pour cela, Xin et al. (2014) proposent
l’estimateur du fused lasso généralisé :
 
 X 
bbF = Argmin ||y − Xb||2 + λ1 |bi − bj | + λ 2 ||b||1 , (6.2)
b 2
 
(i,j)∈G

où G est l’ensemble des couples (i, j), indices des coefficients qu’on présume simi-
laires. Hoefling (2010) propose un algorithme pour calculer bbF . Inciter deux ou
plusieurs variables à avoir des coefficients similaires sera appelé fusion par la suite.

6.2.2 Extensions au cadre multivarié

Dans cette section nous adaptons les pénalités utilisées dans le group lasso et le
fused lasso au cas multivarié. Ceci est fait simplement en utilisant des arguments
similaires à ceux utilisés dans la section 1.2.2 et dans le chapitre 2 (plus précisément
à l’équation (1.13)). Après une telle transformation il est possible d’appliquer les
algorithmes du group lasso et du fused lasso, en faisant varier les groupes se-
lon la problématique. En effet, dans le cadre multivarié, il peut être intéressant de
sélectionner une variable explicative, ou un groupe de variables explicatives, pour
toutes les réponses, par exemple. De même il peut être intéressant d’inciter la fu-
sion de coefficients liés à une ou plusieurs variables explicatives pour toutes les
réponses. Nous avons implémenté ces différents modèles dans le package R VariSel
qui permet de sélectionner une association (similaire ou différente) d’une ou plu-
sieurs variables explicatives sur une ou toutes les variables réponses. Ce package
utilise les algorithmes du lasso, du group Lasso et du fused Lasso implémentés
respectivement dans les packages glmnet (Friedman et al., 2010b), gglasso (Yang
& Zou, 2017) et FusedLasso (Hoefling, 2014). Le package contient par ailleurs des
fonctions permettant de comparer les modèles obtenus en utilisant ces différentes
pénalités, par bootstrap ou par cross-validation par exemple.
La figure 6.1 présente un exemple simple avec deux variables réponses Y1 et
Y2 qu’on essaie d’expliquer en fonction de quatre variables explicatives X1G1 , X2G1 ,
X3G2 , X4G2 , ces variables sont divisées en deux groupes : G1 composé de X1G1 , X2G1
et G2 composé de X3G2 , X4G2 . Dans cette figure nous représentons six modèles.
Dans la première ligne on représente trois modèles de type  group-lasso  qui
sélectionnent soit un groupe de variables explicatives sur une réponse (à gauche), soit
une variable explicative (sans prendre en compte les groupes) sur toutes les réponses
(au centre), soit un groupe de variables explicatives pour toutes les réponses (à
droite). Dans la seconde ligne on représente trois modèles de type  fused-lasso  qui
Conclusion

113
influencent soit un groupe de variables explicatives à avoir le même coefficient sur
une réponse, soit une variable explicative à avoir un coefficient similaire pour toutes
les réponses, soit un groupe de variables explicatives à avoir des coefficients simi-
laires, aussi bien entre eux que sur toutes les réponses. On notera que les matrices
représentées dans la figure 6.1 ne sont pas réellement estimées par les modèles mais,
elles permettent de mettre en avant leurs différences. En effet, dans les modèles de
type  group-lasso  les coefficients appartenant au même groupe sont forcés à
être sélectionnés ensemble mais ne sont pas forcés à être égaux, alors que dans le
cas du  fused-lasso  les coefficients sont incités à être égaux sans être forcés à
être sélectionnés ensemble.

Variables Réponses Les deux


G1
X1
G1
X2

Group
G
X3 2
Variables

G
X4 2

G1
X1
G1
X2

Fused
G
X3 2
G
X4 2

Y1 Y2 Y1 Y2 Y1 Y2
Responses

-2 -1 0 1 2

Figure 6.1 – Matrices de coefficients pouvant être estimées en utilisant les pénalités
de groupe (6.1) ou de fusion (6.2) aussi bien sur les réponses que sur les groupes ou
les deux. Ici la variable X1 est liée à la variable X2 et la variable X3 est liée à la
variable X4 .

Il serait intéressant de montrer la consistance en signe de ces estimateurs sous cer-


taines conditions. Pour cela, en mimant notre démarche pour montrer le théorème 2.5,
on pourrait s’inspirer des propriétés théoriques de ces estimateurs dans le cas uni-
varié. Par exemple Bach (2008) proposent des hypothèses sous lesquelles la consis-
tance en signe du groupe lasso est vérifiée. De même Rinaldo et al. (2009) proposent
des résultats de consistance, adaptés au cas du fused-lasso, lorsque la matrice X
est l’identité.
Conclusion

114
6.2.3 Applications
Ces méthodes peuvent être utiles dans de nombreuses applications biologiques.
Nous avons commencé à nous y intéresser lors d’une collaboration avec Charlotte
Brault, Agnès Doligez et Timothée Flutre des unités AGAP et GQE de l’INRA.
Le but de cette collaboration est de chercher des plants de vignes qui sont peu
affectés par la sécheresse. Pour cela nous proposons de chercher des régions d’ADN
qui peuvent être liées à un ou plusieurs gènes et qui seraient spécifiques de traits
caractérisants la résistance à la sécheresse. Ces régions d’ADN sont appelées des
locus de caractères quantitatifs (QTL pour quantitative trait loci en anglais). Ici
nous nous concentrerons sur les régions caractéristiques des allèles des différents
(a)
gènes. Une colonne X i d’une matrice X de QTL est ici le nombre de réplicats
de l’allèle a d’un gène i donné pour les n échantillons. Afin de pouvoir sélectionner,
en même temps, tous les allèles liés à un gène donné nous proposons d’utiliser des
modèles de type  group-lasso . La résistance à la sécheresse étant caractérisée
par différentes variables réponses il peut aussi être intéressant de sélectionner tous
les allèles d’un même gène pour toutes les réponses en même temps. De même, il
peut être intéressant d’inciter les coefficients liant les différentes variables réponses
à un allèle d’un gène à être similaires, cela peut ce faire à l’aide de modèles de type
 fused-lasso .

Enfin, ces modèles peuvent nous permettre d’étudier plus spécifiquement les
données décrites dans le chapitre 5. Dans ce chapitre on étudie le lien entre des
signaux sécrétés par des cellules dendritiques (inputs) et les signaux sécrétés en
réponse par des lymphocytes Th (outputs). Ces données sont obtenues à partir de
deux types de cellules dendritiques qu’il serait intéressant de prendre en compte
dans la modélisation. Ainsi, pour chaque couple input / output on aurait deux coef-
ficients, un pour chaque types de cellules dendritiques. Étudier ces données à l’aide
d’un modèle de type  fused-lasso  en incitant les coefficients liant un couple
input /output aux deux types de cellules dendritiques à être égaux permettrait de
mettre en évidence des similarités et des différences sur l’effet des inputs sur les
outputs en fonction du type de cellules dendritiques.

Le passage du génotype au phénotype est certes beaucoup moins obscur que du


temps des  spermatistes  et des  ovistes  mais n’est pas totalement explicite
pour autant. Être capable de manipuler des données de grande dimension peut
contribuer à clarifier ce problème, c’est dans ce cadre que s’inscrit cette thèse qui
cherche à proposer des outils adaptés. Nous espérons que les méthodes décrites dans
cette thèse seront utiles pour résoudre des problématiques biologiques mais avons
conscience de tout le chemin qu’il reste à parcourir pour percer le  mystère  du
passage du génotype au phénotype.
Conclusion

115
Conclusion
Bibliographie

Akaike, H. (1970). Statistical predictor identification. Annals of the institute of


Statistical Mathematics, 22(1) :203–217.

Alquier, P. and Doukhan, P. (2011). Sparsity considerations for dependent variables.


Electron. J. Statist., 5 :750–774.

Audoin, C., Cocandeau, V., Thomas, O., Bruschini, A., Holderith, S., and Genta-
Jouve, G. (2014). Metabolome consistency : Additional parazoanthines from the
mediterranean zoanthid parazoanthus axinellae. Metabolites, 4 :421–432.

Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning.
Journal of Machine Learning Research, 9(Jun) :1179–1225.

Bai, B., Novák, O., Ljung, K., Hanson, J., and Bentsink, L. (2018). Combined trans-
criptome and translatome analyses reveal a role for tryptophan-dependent auxin
biosynthesis in the control of DOG1-dependent seed dormancy. New Phytologist,
217(3) :1077–1085.

Bai, B., Peviani, A., Horst, S., Gamm, M., Bentsink, L., and Hanson, J. (2017).
Extensive translational regulation during seed germination revealed by polysomal
profiling. New Phytologist, 214(1) :233–244.

Banerjee, O., Ghaoui, L. E., and D’aspremont, A. (2008a). Model selection through
sparse maximum likelihood estimation for multivariate gaussian or binary data.
Journal of Machine Learning Research, 9 :485–516.

Banerjee, O., Ghaoui, L. E., and d’Aspremont, A. (2008b). Model selection through
sparse maximum likelihood estimation for multivariate gaussian or binary data.
Journal of Machine learning research, 9(Mar) :485–516.

Bates, D. and Maechler, M. (2017). Matrix : Sparse and Dense Matrix Classes and
Methods. R package version 1.2-8.

Belloni, A., Chernozhukovand, V., and Wang, L. (2011). Square-root lasso : pivotal
recovery of sparse signals via conic programming. Biometrika, 98(4) :791–806.

Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Ann.


Statist., 36(6) :2577–2604.
Conclusion

117
Bien, J. and Tibshirani, R. J. (2011). Sparse estimation of a covariance matrix.
Biometrika, 98(4) :807–820.

Blum, Y., Houee-Bigot, M., and Causeur, D. (2016a). FANet : Sparse Factor
Analysis model for high dimensional gene co-expression Networks. R package
version 1.1.

Blum, Y., Houée-Bigot, M., and Causeur, D. (2016b). Sparse factor model for
co-expression networks with an application using prior biological knowledge.
Statistical applications in genetics and molecular biology, 15(3) :253—272.

Boccard, J. and Rudaz, S. (2016). Exploring omics data from designed experiments
using analysis of variance multiblock orthogonal partial least squares. Analytica
Chimica Acta, 920 :18 – 28.

Brockwell, P. and Davis, R. (1991). Time Series : Theory and Methods. Springer
Series in Statistics. Springer-Verlag New York.

Brockwell, P. J. and Davis, R. A. (1990). Time Series : Theory and Methods.


Springer-Verlag New York, Inc., New York, NY, USA.

Cai, T., Liu, W., and Luo, X. (2011). A constrained l1 minimization approach
to sparse precision matrix estimation. Journal of the American Statistical
Association, 106(494) :594–607.

Cai, T. T. and Yuan, M. (2012). Adaptive covariance matrix estimation through


block thresholding. Ann. Statist., 40(4) :2014–2042.

Cattell, R. B. (1966). The scree test for the number of factors. Multivariate
behavioral research, 1(2) :245–276.

Chen, Y., Du, P., and Wang, Y. (2014). Variable selection in linear models. Wiley
Interdisciplinary Reviews : Computational Statistics, 6(1) :1–9.

Chiquet, J., Mary-Huard, T., and Robin, S. (2016). Structured regularization for
conditional Gaussian graphical models. Statistics and Computing, 27(3) :789–804.

Desboulets, L. D. D. (2018). A Review on Variable Selection in Regression Analysis.


working paper or preprint.

Devijver, E. and Gallopin, M. (2018). Block-diagonal covariance selection for


high-dimensional gaussian graphical models. Journal of the American Statistical
Association, 113(521) :306–314.

Dieterle, F., Ross, A., Schlotterbeck, G., and Senn, H. (2006). Probabilistic quotient
normalization as robust method to account for dilution of complex biological mix-
tures. application in 1h nmr metabonomics. Analytical Chemistry, 78(13) :4281–
4290.
Conclusion

118
Dobriban, E. (2018). Permutation methods for factor analysis and PCA.
arXiv :1710.00479.

Draper, N. R. and Smith, H. (1998). Applied Regression Analysis. Wiley.

Durand, T. C., Cueff, G., Godin, B., Valot, B., Clément, G., Gaude, T., and Raj-
jou, L. (2019). Combined proteomic and metabolomic profiling of the arabidopsis
thaliana vps29 mutant reveals pleiotropic functions of the retromer in seed deve-
lopment. International journal of molecular sciences, 20(2) :362.

Eckart, C. and Young, G. (1936). The approximation of one matrix by another of


lower rank. Psychometrika, 1(3) :211–218.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli-
hood and its oracle properties. Journal of the American Statistical Association,
96(456) :1348–1360.

Fan, J., Yuan, L., and Han, L. (2016). An overview of the estimation of large
covariance and precision matrices. The Econometrics Journal, 19(1) :C1–C32.

Faraway, J. J. (2004). Linear Models with R. Chapman & Hall/CRC.

Field, B., Cardon, G., Traka, M., Botterman, J., Vancanneyt, G., and Mithen,
R. (2004). Glucosinolate and amino acid biosynthesis in arabidopsis. Plant
Physiology, 135(2) :828–839.

Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance esti-
mation with the graphical lasso. Biostatistics, 9(3) :432.

Friedman, J., Hastie, T., and Tibshirani, R. (2010a). Regularization paths for ge-
neralized linear models via coordinate descent. Journal of Statistical Software,
33(1) :1–22.

Friedman, J., Hastie, T., and Tibshirani, R. (2010b). Regularization paths for ge-
neralized linear models via coordinate descent. Journal of Statistical Software,
33(1) :1–22.

Galland, M., Huguet, R., Arc, E., Cueff, G., Job, D., and Rajjou, L. (2014). Dynamic
proteomics emphasizes the importance of selective mrna translation and protein
turnover during arabidopsis seed germination. Molecular & Cellular Proteomics,
13(1) :252–268.

Gianola, D., Perez-Enciso, M., and Toro, M. A. (2003). On marker-assisted predic-


tion of genetic value : beyond the ridge. Genetics, 163(1) :347–365.

Giraud, C. (2014). Introduction to High-Dimensional Statistics. Chapman &


Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis.
Conclusion

119
Haddad, J. N. (2004). On the closed form of the covariance matrix and its inverse
of the causal arma process. Journal of Time Series Analysis, 25(4) :443–448.

Harville, D. (2001). Matrix Algebra : Exercises and Solutions : Exercises and


Solutions. Springer New York.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical
Learning. Springer Series in Statistics. Springer New York Inc., New York, NY,
USA.

Heinze, G., Wallisch, C., and Dunkler, D. (2018). Variable selection - A review and
recommendations for the practicing statistician. Biom J, 60(3) :431–449.

Higham, N. J. (2002). Computing the nearest correlation matrix - a problem from


finance. IMA Journal of Numerical Analysis, 22(3) :329–343.

Hoefling, H. (2010). A path algorithm for the fused lasso signal approximator.
Journal of Computational and Graphical Statistics, 19(4) :984–1006.

Hoefling, H. (2014). FusedLasso : Solves the generalized Fused Lasso. R package


version 1.0.6.

Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis.
Psychometrika, 30(2) :179–185.

Horn, R. A. and Johnson, C. R. (1986). Matrix Analysis. Cambridge University


Press, New York, NY, USA.

Hossain, Z., Amyot, L., McGarvey, B., Gruber, M., Jung, J., and Hannoufa, A.
(2012). The translation elongation factor eef-1bβ1 is involved in cell wall biosyn-
thesis and plant development in arabidopsis thaliana. PLoS One. e30425.

Hosseini, M. J. and Lee, S.-I. (2016). Learning sparse gaussian graphical models
with overlapping blocks. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I.,
and Garnett, R., editors, Advances in Neural Information Processing Systems 29,
pages 3808–3816. Curran Associates, Inc.

Hrydziuszko, O. and Viant, M. R. (2012). Missing values in mass spectrometry based


metabolomics : an undervalued step in the data processing pipeline. Metabolomics,
8(1) :161–174.

Huang, Z., Footitt, S., and Finch-Savage, W. E. (2014). The effect of temperature
on reproduction in the summer and winter annual arabidopsis thaliana ecotypes
bur and cvi. Annals of botany, 113(6) :921–929.

Jacquard, A. (1978). Eloge de la différence. La génétique et les hommes.


Conclusion

120
Johnson, R. A. and Wichern, D. W., editors (1988). Applied Multivariate Statistical
Analysis. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

Karahalil, B. (2016). Overview of systems biology and omics technologies. Current


medicinal chemistry, 23(37) :4221–4230.

Kendall, S. L., Hellwege, A., Marriot, P., Whalley, C., Graham, I. A., and Pen-
field, S. (2011). Induction of dormancy in arabidopsis summer annuals requires
parallel regulation of dog1 and hormone metabolism by low temperature and cbf
transcription factors. The Plant Cell, 23(7) :2568–2580.

Kerdaffrec, E. and Nordborg, M. (2017). The maternal environment interacts with


genetic variation in regulating seed dormancy in swedish arabidopsis thaliana.
PloS one, 12(12). e0190242.

Kirwan, J., Broadhurst, D., Davidson, R., and Viant, M. (2013). Characterising
and correcting batch variation in an automated direct infusion mass spectro-
metry (dims) metabolomics workflow. Analytical and Bioanalytical Chemistry,
405(15) :5147–5157.

Kuhl, C., Tautenhahn, R., Boettcher, C., Larson, T. R., and Neumann, S. (2012).
Camera : an integrated strategy for compound spectra extraction and annotation
of liquid chromatography/mass spectrometry data sets. Analytical Chemistry,
84 :283–289.

Lê Cao, K.-A., Boitard, S., and Besse, P. (2011). Sparse pls discriminant analy-
sis : biologically relevant feature selection and graphical displays for multiclass
problems. BMC Bioinformatics, 12(1) :253.

Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional


covariance matrices. Journal of Multivariate Analysis, 88(2) :365 – 411.

Lee, W. and Liu, Y. (2012). Simultaneous Multiple Response Regression and In-
verse Covariance Matrix Estimation via Penalized Gaussian Maximum Likelihood.
J. Multivar. Anal., 111 :241–255.

Leung, S., Liu, X., Fang, L., Chen, X., Guo, T., and Zhang, J. (2010). The cytokine
milieu in the interplay of pathogenic Th1/Th17 cells and regulatory T cells in
autoimmune disease. Cell. Mol. Immunol., 7(3) :182–189.

Lütkepohl, H. (2005). New introduction to multiple time series analysis. Springer,


Berlin.

MacGregor, D. R., Kendall, S. L., Florance, H., Fedi, F., Moore, K., Paszkiewicz,
K., Smirnoff, N., and Penfield, S. (2015). Seed production temperature regula-
tion of primary dormancy occurs through control of seed coat phenylpropanoid
metabolism. New Phytologist, 205(2) :642–652.

121
Mallows, C. L. (1973). Some comments on c p. Technometrics, 15(4) :661–675.

Mardia, K., Kent, J., and Bibby, J. (1979). Multivariate analysis. Probability and
mathematical statistics. Academic Press.

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1980). Multivariate Analysis


(Probability and Mathematical Statistics). Academic Press.

Mehmood, T., Liland, K. H., Snipen, L., and Saebo, S. (2012). A review of variable
selection methods in partial least squares regression. Chemometrics and Intelligent
Laboratory Systems, 118 :62 – 69.

Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal


Statistical Society : Series B (Statistical Methodology), 72(4) :417–473.

Meinshausen, N. and Buhlmann, P. (2010). Stability selection. Journal of the Royal


Statistical Society, 72(4) :417–473.

Meng, C., Kuster, B., Culhane, A. C., and Gholami, A. M. (2014). A multiva-
riate approach to the integration of multi-omics datasets. BMC Bioinformatics,
15(1) :162.

Molstad, A. J., Weng, G., Doss, C. R., and Rothman, A. J. (2018). An explicit
mean-covariance parameterization for multivariate response linear regression.

Mosmann, T. R., Cherwinski, H., Bond, M. W., Giedlin, M. A., and Coffman, R. L.
(1986). Two types of murine helper t cell clone. i. definition according to pro-
files of lymphokine activities and secreted proteins. The Journal of immunology,
136(7) :2348–2357.

Mosmann, T. R. and Coffman, R. (1989). Th1 and th2 cells : different patterns
of lymphokine secretion lead to different functional properties. Annual review of
immunology, 7(1) :145–173.

Muller, K. E. and Stewart, P. W. (2006). Linear Model Theory : Univariate,


Multivariate, and Mixed Models. John Wiley & Sons.

Nicholson, J. K., Lindon, J. C., and Holmes, E. (1999). ’metabonomics’ : understan-


ding the metabolic responses of living systems to pathophysiological stimuli via
multivariate statistical analysis of biological nmr spectroscopic data. Xenobiotica,
29(11) :1181–1189. PMID : 10598751.

Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in


multiple regression. The Annals of Statistics, pages 758–765.

Park, H., Li, Z., Yang, X. O., Chang, S. H., Nurieva, R., Wang, Y.-H., Wang, Y.,
Hood, L., Zhu, Z., Tian, Q., et al. (2005). A distinct lineage of cd4 t cells regulates
tissue inflammation by producing interleukin 17. Nature immunology, 6(11) :1133.

122
Perrot-Dockès, M., Lévy-Leduc, C., Chiquet, J., Sansonnet, L., Brégère, M., Étienne,
M.-P., Robin, S., and Genta-Jouve, G. (2018). A variable selection approach in the
multivariate linear model : an application to lc-ms metabolomics data. Statistical
applications in genetics and molecular biology, 17(5).

Perrot-Dockès, M., Lévy-Leduc, C., Sansonnet, L., and Chiquet, J. (2018). Variable
selection in multivariate linear models with high-dimensional covariance matrix
estimation. Journal of Multivariate Analysis, 166 :78 – 97.

Perrot-Dockès, M., Lévy-Leduc, C., and Chiquet, J. (2019). MultiVarSel : Variable


Selection in a Multivariate Linear Model. R package version 1.1.3.

Perthame, E., Friguet, C., and Causeur, D. (2016). Stability of feature selec-
tion in classification issues for high-dimensional correlated data. Statistics and
Computing, 26(4) :783–796.

Perthame, E., Friguet, C., and Causeur, D. (2019). FADA : Variable Selection for
Supervised Classification in High Dimension. R package version 1.3.4.

Pourahmadi, M. (2013). High-Dimensional Covariance Estimation. Wiley Series in


Probability and Statistics.

Provart, N. J., Alonso, J., Assmann, S. M., Bergmann, D., Brady, S. M., Brkljacic,
J., ..., and Dangl, J. (2016). 50 years of arabidopsis research : highlights and
future directions. New Phytologist, 209(3) :921–944.

R Core Team (2017). R : A Language and Environment for Statistical Computing.


R Foundation for Statistical Computing, Vienna, Austria.

Rajjou, L., Gallardo, K., Debeaujon, I., Vandekerckhove, J., Job, C., and Job, D.
(2004). The effect of α-amanitin on the arabidopsis seed proteome highlights
the distinct roles of stored and neosynthesized mrnas during germination. Plant
physiology, 134(4) :1598–1613.

Raven, P., Singer, S., Johnson, G., Mason, K., Losos, J., Bouharmont, J., Masson,
P., and Van Hove, C. (2017). Biologie. Biologie. De Boeck Supérieur.

Ren, S., Hinzman, A. A., Kang, E. L., Szczesniak, R. D., and Lu, L. J. (2015).
Computational and statistical analysis of metabolomics data. Metabolomics,
11(6) :1492–1513.

Riekeberg, E. and Powers, R. (2017). New frontiers in metabolomics : from measu-


rement to insight. F1000Research, 6.

Rinaldo, A. et al. (2009). Properties and refinements of the fused lasso. The Annals
of Statistics, 37(5B) :2922–2952.

123
Rothman, A. J. (2012). Positive definite estimators of large covariance matrices.
Biometrika, 99(3) :733–740.

Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation
invariant covariance estimation. Electron. J. Statist., 2 :494–515.

Rothman, A. J., Levina, E., and Zhu, J. (2010). Sparse multivariate regression
with covariance estimation. Journal of Computational and Graphical Statistics,
19(4) :947–962.

Saccenti, E., Hoefsloot, H. C. J., Smilde, A. K., Westerhuis, J. A., and Hendriks, M.
M. W. B. (2013). Reflections on univariate and multivariate analysis of metabo-
lomics data. Metabolomics, 10(3) :361–374.

Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals of


statistics, 6(2) :461–464.

Smith, C., Want, E., O’Maille, G., Abagyan, R., and Siuzdak, G. (2006). XCMS :
Processing mass spectrometry data for metabolite profiling using Nonlinear peak
alignment, matching, and identification. Analytical Chemistry, 78(3) :779–787.

Smith, R., Mathis, A., and Prince, J. (2014). Proteomics, lipidomics, metabolomics :
a mass spectrometry tutorial from a computer scientist’s point of view. BMC
Bioinformatics, 15.

Stuart, J. M., Segal, E., Koller, D., and Kim, S. K. (2003). A gene-coexpression net-
work for global discovery of conserved genetic modules. Science, 302(5643) :249–
255.

Thompson, M. L. (1978). Selection of variables in multiple regression : Part i. a


review and evaluation. International Statistical Review/Revue Internationale de
Statistique, pages 1–19.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.


J. Royal. Statist. Soc B., 58(1) :267–288.

Varah, J. (1975). A lower bound for the smallest singular value of a matrix. Linear
Algebra and its Applications, 11(1) :3 – 5.

Verdegem, D., Lambrechts, D., Carmeliet, P., and Ghesquière, B. (2016). Improved
metabolite identification with midas and magma through ms/ms spectral dataset-
driven parameter optimization. Metabolomics, 12(6) :1–16.

Vogel, J. T., Cook, D., Fowler, S. G., and Thomashow, M. F. (2006). The cbf
cold response pathways of arabidopsis and tomato. Cold Hardiness in Plants :
Molecular Genetics, Cell Biology and Physiology, pages 11–29.

124
Volpe, E., Touzot, M., Servant, N., Marloie-Provost, M.-A., Hupé, P., Barillot, E.,
and Soumelis, V. (2009). Multiparametric analysis of cytokine-driven human th17
differentiation reveals a differential regulation of il-17 and il-22 production. Blood,
114(17) :3610–3614.
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing,
17(4) :395–416.
Walker, A. (1964). Asymptotic properties of least-squares estimates of parameters
of the spectrum of a stationary non-deterministic time-series. Journal of the
Australian Mathematical Society, 4(3) :363–384.
Wen, F., Yang, Y., Liu, P., and Qiu, R. C. (2016). Positive definite estimation
of large covariance matrix using generalized nonconvex penalties. IEEE Access,
4 :4168–4182.
Wittenburg, D., Teuscher, F., Klosa, J., and Reinsch, N. (2016). Covariance between
genotypic effects and its use for genomic inference in half-sib families. G3 : Genes,
Genomes, Genetics, 6(9) :2761–2772.
Wittstock, U. and Halkier, B. A. (2002). Glucosinolate research in the arabidopsis
era. Trends in plant science, 7(6) :263–270.
Xin, B., Kawahara, Y., Wang, Y., and Gao, W. (2014). Efficient generalized fused
lasso and its application to the diagnosis of alzheimer’s disease. In Twenty-Eighth
AAAI Conference on Artificial Intelligence.
Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R.,
Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., et al. (2010).
Common snps explain a large proportion of the heritability for human height.
Nature genetics, 42(7) :565.
Yang, Y. and Zou, H. (2017). gglasso : Group Lasso Penalized Learning Using a
Unified BMD Algorithm. R package version 1.4.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with
grouped variables. Journal of the Royal Statistical Society Series B, 68 :49–67.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian gra-
phical model. Biometrika, 94(1) :19–35.
Zhang, A., Sun, H., Wang, P., Han, Y., and Wang, X. (2012). Modern analytical
techniques in metabolomics analysis. Analyst, 137 :293–300.
Zhang, H., Zheng, Y., Yoon, G., Zhang, Z., Gao, T., Joyce, B., Zhang, W., Schwartz,
J., Vokonas, P., Colicino, E., Baccarelli, A., Hou, L., and Liu, L. (2017). Regula-
rized estimation in sparse high-dimensional multivariate regression, with applica-
tion to a DNA methylation study. Stat Appl Genet Mol Biol, 16(3) :159–171.

125
Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of
Machine Learning Research, 7 :2541–2563.

Zheng, X. and Loh, W.-Y. (1995). Consistent variable selection in linear models.
Journal of the American Statistical Association, 90(429) :151–156.

126
Annexe

Cette annexe présente l’article soumis au journal Cell :


M. Grandclaudon*, M. Perrot-Dockès*, C. Trichot, O. Mostafa-Abouzid, W. Abou-
Jaoudé, F. Berger, P. Hupé, D. Thieffry, L. Sansonnet, J. Chiquet, C. Lévy-Leduc,
V. Soumelis
A quantitative multivariate model of human dendritic cell-T helper cell communica-
tion.
* : ces auteurs ont contribué de manière égale à cette publication

Un résumé est disponible dans le chapitre 5.

127
Article

A Quantitative Multivariate Model of Human


Dendritic Cell-T Helper Cell Communication
Graphical Abstract Authors
TNF-α
IL-6
Maximilien Grandclaudon,
IL-12p70
IL-1 Marie Perrot-Dockès, Coline Trichot, ...,
1. DATASET GENERATION

IL-23
IL-28α
IL-10
36 DC communication molecules

IL-3
Galectin-3
LIGHT IL-4 Julien Chiquet, Céline Lévy-Leduc,
Dendritic cell (DC) CD70 CD4 T cells IL-5
CD30L
Vassili Soumelis

17 T helper cytokines
4-1BBL IL-9
Jagged-2 IL-10
OX40L IL-17A
PDL1
ICAM-2 IL-17F
ICAM-3 IL-6
82 distinct SLAMF3
DC conditions SLAMF5
CD80
CD40
IFN-γ
TNF-α Correspondence
CD100 GMCSF
VISTA
PVR
IL-2
IL13
[email protected]
CD18
PDL2 IL22
ICOSL IL21
CD86
CD54 IL31
CD83
CD29
CD58
TNF-β
In Brief
HLA-DR
Nectin-2
CD11a
B7H3
Grandclaudon et al. show that
combinatorial rules that explain
Prediction: 346 DC/T cell
molecular associations

17 T helper cytokines
2. MODELING

DC and T helper cell data


integration into a communication between dendritic cells
mathematical model
and T helper cells can be helpful in
Y=XB+E vaccine design and immunotherapy.
iid
∀i ∈ {1,…,n}, (Ei,1,…,Ei,q) ~ N (0,Σq)
36 DC communication molecules
3. VALIDATION

1) Computational 2) Literature 3) Systematic 4) Validation of context


validation validation experimental -dependent effects
validation
178 articles analyzed
Stability selection 56 validated
predictions IL-12p70 + IL-1
Cross validation 290 novel CD4 T cells IL-17F
predictions
70 % validation 73.2 % validation

Highlights
d 428 protein-level measurements of 36 DC communication
molecules and 17 Th cytokines

d Data-driven quantitative model of DC-T cell communication


extensively validated

d Systematic and unbiased predictions of context-dependent


mechanisms

d Validation of a new context-dependent role of IL-12p70 in


Th17 differentiation

Grandclaudon et al., 2019, Cell 179, 432–447


October 3, 2019 ª 2019 Elsevier Inc.
https://doi.org/10.1016/j.cell.2019.09.012
Article

A Quantitative Multivariate Model of Human


Dendritic Cell-T Helper Cell Communication
Maximilien Grandclaudon,1,2,8 Marie Perrot-Dockès,3,8 Coline Trichot,1,2,8 Léa Karpf,1,2 Omar Abouzid,1,2
Camille Chauvin,1,2 Philémon Sirven,1,2 Wassim Abou-Jaoudé,4 Frédérique Berger,1,5,6 Philippe Hupé,1,6,7
Denis Thieffry,4 Laure Sansonnet,3 Julien Chiquet,3 Céline Lévy-Leduc,3 and Vassili Soumelis1,2,9,*
1Institut Curie, Centre de Recherche, PSL Research University, 75005 Paris, France
2INSERM U932, Immunity and Cancer, 75005 Paris, France
3UMR MIA-Paris, AgroParisTech, INRA—Université Paris-Saclay, 75005 Paris, France
4Computational Systems Biology Team, Institut de Biologie de l’École Normale Supérieure, Centre National de la Recherche Scientifique

UMR8197, INSERM U1024, École Normale Supérieure, PSL Université, 75005 Paris, France
5Institut Curie, PSL Research University, Unit of Biostatistics, 75005 Paris, France
6Institut Curie, PSL Research University, INSERM U900, 75005 Paris, France
7Mines Paris Tech, 77305 Cedex Fontainebleau, France
8These authors contributed equally
9Lead Contact

*Correspondence: [email protected]
https://doi.org/10.1016/j.cell.2019.09.012

SUMMARY tors, and Notch ligands (Balan et al., 2018). In the context of
stress, multiple signals need to be integrated by innate and
Cell-cell communication involves a large number of adaptive immune cells, including cytokines, growth factors, in-
molecular signals that function as words of a com- flammatory mediators, and immune checkpoints (Chen and
plex language whose grammar remains mostly un- Flies, 2013; Macagno et al., 2007). In most studies, these
known. Here, we describe an integrative approach communication molecules have been studied as individual
involving (1) protein-level measurement of multiple stimuli to a target cell by gain- and loss-of-function experi-
ments. This provides important knowledge regarding the
communication signals coupled to output responses
downstream effects of the signals but prevents us from
in receiving cells and (2) mathematical modeling to
widely addressing their function in various contexts of other
uncover input-output relationships and interactions co-expressed communication signals.
between signals. Using human dendritic cell (DC)-T Context dependency is an important aspect of verbal lan-
helper (Th) cell communication as a model, we guage communication that can directly affect the meaning
measured 36 DC-derived signals and 17 Th cytokines of individual words but also modify the logic of syntactic rules
broadly covering Th diversity in 428 observations. (Cariani and Rips, 2017; Kintsch and Mangalath, 2011). Simi-
We developed a data-driven, computationally vali- larly, context dependencies may dramatically affect the func-
dated model capturing 56 already described and tion of biologically active communication signals. For
290 potentially novel mechanisms of Th cell specifi- example, we have shown that 90% of the transcriptional
cation. By predicting context-dependent behaviors, response to type I interferon in human CD4 T cells depends
on the cytokine context (T helper 1 [Th1], Th2, or Th17; Touzot
we demonstrate a new function for IL-12p70 as an
et al., 2014). Other studies have identified major context-
inducer of Th17 in an IL-1 signaling context. This
dependent functions of immune checkpoints, such as OX40-
work provides a unique resource to decipher the ligand (Ito et al., 2005), and regulatory cytokines, such as
complex combinatorial rules governing DC-Th cell transforming growth factor b (TGF-b) (Ivanov et al., 2006;
communication and guide their manipulation for vac- Manel et al., 2008; Volpe et al., 2008). These studies suggest
cine design and immunotherapies. that communication molecules function as words of a com-
plex language with grammar defining combinatorial rules of
INTRODUCTION co-expression and mutual influence of one signal over the
function (meaning) of another signal.
Cell-cell communication involves the exchange of molecular Three levels of biological complexity need to be integrated to
signals produced by a given cell and transmitting an effect decipher those combinatorial rules: (1) the multiplicity of input
through specific receptors expressed on target cells. This pro- communication signals to include as many possible contextual
cess requires integration of multiple communication signals effects; (2) communication signals at their naturally occurring
of different nature during homeostatic or stress-related re- concentrations; and (3) a large number of output responses in
sponses. For example, differentiation of pluripotent hematopoi- target cells, reflecting the effect of cell-cell communication quan-
etic stem cells into mature myeloid or lymphoid blood cells titatively and qualitatively. Those three levels create a bottleneck
requires the collective action of multiple cytokines, growth fac- in deciphering cell-cell communication.

432 Cell 179, 432–447, October 3, 2019 ª 2019 Elsevier Inc.


Here we developed an integrative approach combining (1) DC surface (Figure S1A), and 7 were measured in the 24-h DC
coupled protein-level measurement of multiple communication culture supernatant (STAR Methods).
signals and output response molecules in target cells; (2) a multi- Following 24-h culture under each of the 82 DC perturbation
variate mathematical modeling strategy enabling us to infer the conditions, the same DC batch was used to stimulate naive
input-output relationships for individual signals, taking into CD4 T cells in a heterologous co-culture system. On day 6 of
account the context and configuration of all other signals; and co-culture, we measured Th cell expansion fold (Exp Fold) and
(3) experimental validation of model-derived hypotheses. We a total of 17 distinct Th cytokines broadly representing the spec-
applied this framework to decipher human dendritic cell (DC)- trum of Th cell output responses (STAR Methods). In total,
Th cell communication, which potentially involves over 70 we produced a unique dataset of coupled measurements of
individual molecular stimuli (Chen and Flies, 2013), including DC-derived Th stimuli and Th response cytokines from 428 inde-
cytokines, tumor necrosis factor (TNF) family members, integ- pendent observations from 44 independent donors (Figure 1A;
rins, nectins, Notch ligands, and galectins (Tindemans et al., Table S2).
2017; Zhu et al., 2010; Zygmunt and Veldhoen, 2011). These
molecules can all be expressed by DCs and function as commu- Variability and Specificity of DC Communication Signals
nication signals to T cells (hereafter called Th stimuli). They can We asked whether our systematic DC stimulation strategy could
be measured at the protein level by highly specific assays to generate important variations in the expression of individual DC-
optimize biological relevance. derived Th stimuli. All Th stimuli were expressed over at least
By using this unbiased data-driven approach, we could cap- three logs (Figure 1B) with high coefficients of variation (>0.44;
ture the simultaneous effects of large numbers of DC-T cell Figure 1C). Interleukins had higher variability (104–105) and
communication signals in naturally occurring patterns and high coefficients of variation from 2.72 for interleukin-12 (IL-12)
expression levels. Our systems-level model revealed novel p70 (IL-12) to 1.43 for IL-6. CD11a had a wide expression range
emergent and context-dependent mechanisms controlling Th (104) but the smallest coefficient of variation (0.44), with values
cell differentiation. A similar framework can be applied to sys- distributed around the mean (Figure 1C). Hence, we were able
tematically decipher the communication of other cell types. to generate highly variable expression patterns for all Th stimuli.
We sought to identify conserved and specific patterns of Th
RESULTS stimuli in response to standard DC perturbators. We compared
the expression levels of DC-derived Th stimuli under three
Generation of a Unique Multivariate Dataset of Human conditions belonging to distinct classes of microbes—LPS
DC-Th Cell Communication (100 ng/mL, bacteria), zymosan (10 mg/mL, fungi), and flu (13,
To induce a broad range of DC molecular states expressing Viruses)—that were used across at least 17 MoDC biological
various patterns of communication signals, human monocyte- replicates (Figure 1D). Medium MoDCs (negative control) ex-
derived DCs (MoDCs) and primary blood CD11c+ DCs (bDCs), pressed lower levels of activation-associated communication
were activated for 24 h with a diversity of DC-modulating signals molecules (Figures 1D and S1B). We confirmed previous find-
(hereafter called DC perturbators). These included 14 distinct ings, validating our experimental system: (1) zymosan induced
stimuli that were grouped in three categories reflecting various specifically IL-10 and IL-23, (2) flu induced a large amount of
physiopathological contexts: (1) the endogenous factors inter- IL-28a, and (3) LPS and zymosan induced a large amount of
feron b (IFN-b), GM-CSF, TSLP, and PGE2; (2) the Toll-like re- IL-12 (Figures 1D and S1B). In addition, we identified novel spe-
ceptor ligands lipopolysaccharide (LPS) (a Toll-like receptor 4 cific inductions of DC-derived Th stimuli: zymosan-treated
[TLR4] agonist), PAM3CSK4 (a TLR1 and 2 agonist), Curdlan (a MoDCs expressed the highest levels of CD54 and PVR, flu-
Dectin1 agonist), zymosan (a TLR2 and Dectin1 agonist), R848 treated MoDCs specifically induced ICOSL, and LPS-treated
(a TLR7 and 8 agonist), poly(I:C) (a TLR3 agonist), and aluminum MoDCs induced the highest levels of CD30L and CD83 (Fig-
potassium sulfate (Alum, an NLRP3 inflammasome inducer); and ure 1D). Specificity of expression of a given signal for a given
(3) the whole pathogens heat-killed Candida albicans (HKCA), DC stimulation was determined using Wilcoxon statistical test
heat-killed Listeria monocytogenes (HKLM), heat-killed Staphy- (Figure S1B). Hence, standard DC perturbators induced specific
lococcus aureus (HKSA), heat-killed Streptococcus pneumoniae patterns of Th stimuli.
(HKSP), and influenza virus (flu). These 14 DC perturbators were
used in distinct doses and combinations to further increase the Defining the Spectrum of DC Communication States
diversity of DC communication molecules and downstream Next we aimed to assess the spectrum of DC communication
functional effects (Table S1). In each independent experiment, states, as defined by their expression pattern of communication
we included a medium condition as a negative control and LPS signals, across the 82 DC conditions. We computed the mean
(100 ng/mL) and/or zymosan (10 mg/mL) as positive controls. A expression of biological replicates for each DC-derived Th stim-
total of 66 perturbators were used on MoDCs and 16 on bDCs, ulus and performed unsupervised hierarchical clustering to identify
totaling 82 distinct ‘‘DC conditions’’ (C1–C82; Table S1). classes of the most similar conditions (C1–C82, y axis) and DC-
Under each DC condition, we measured 36 DC-expressed derived Th stimuli (x axis) (Figure 2A). This revealed five groups
molecules that influence Th cell differentiation in at least one of DC conditions (Figure 2B). Each of the four standard DC condi-
published study (STAR Methods) and can be measured with tions (Figure 1D) belonged to a different group (Figure 2A).
a highly specific antibody-based assay. Twenty-nine were Group 1 was defined by high expression of adhesion mole-
measured by fluorescence-activated cell sorting (FACS) at the cules such as CD18, ICAM-2, ICAM-3, and CD29 and low levels

Cell 179, 432–447, October 3, 2019 433


Examples INPUTS OUTPUTS
A
Med Ctrl (-) C1
single Dendritic cell (DC) CD4 T cells
Flu
signal
Zym Ctrl (+)
GM-CSF 36 DC protein 17 T helper (Th)
82 distinct signals cytokines
DC conditions multiple LPS + R848 (428 data points) (428 data points)
(C1 to C82) stimuli LPS + Zym

multiple LPS + -
C82 MoDC or CD11c+ DC
doses Zym + - 44 donors total

B C
DC surface and secreted communication signals (n=428 data points)
% of positive Coefficient
TNF-D Input Range (log)
observations of variation
IL-6
IL-12p70 TNF-D 5.00 63.32 2.04
LIGHT IL-6 5.00 78.74 1.43
IL-28D IL-12p70 4.00 41.12 2.72
IL-10 LIGHT 4.00 50.00 2.30
Galectin-3 IL-28D 4.00 14.25 1.82
B7H3 IL-10 4.00 56.54 1.70
IL-1 Galectin-3 4.00 98.36 1.09
IL-23 B7H3 4.00 97.66 0.67
CD70 IL-1 3.00 41.12 2.10
CD30L IL-23 3.00 54.67 1.64
4-1BBL CD70 3.00 47.20 1.19
Jagged-2 CD30L 3.00 79.21 1.08
OX40L 4-1BBL 3.00 96.26 1.06
PDL1 Jagged-2 3.00 79.67 1.05
ICAM-2 OX40L 3.00 74.53 0.97
ICAM-3 PDL1 3.00 96.50 0.94
SLAMF3 ICAM-2 3.00 73.13 0.90
SLAMF5 ICAM-3 3.00 100.00 0.89
CD80 SLAMF3 3.00 96.73 0.83
CD40 SLAMF5 3.00 98.60 0.82
CD100 CD80 3.00 99.77 0.78
VISTA CD40 3.00 99.77 0.76
PVR CD100 3.00 98.13 0.72
CD18 VISTA 2.00 92.76 0.97
PDL2 PVR 2.00 100.00 0.75
ICOSL CD18 2.00 100.00 0.72
CD86 PDL2 2.00 93.22 0.71
CD54 ICOSL 2.00 90.65 0.65
CD83 CD86 2.00 100.00 0.65
CD29 CD54 2.00 100.00 0.64
CD58 CD83 2.00 97.90 0.60
HLA-DR CD29 2.00 98.13 0.59
Nectin-2 CD58 2.00 99.77 0.57
CD11a HLA-DR 2.00 100.00 0.56
Nectin-2 2.00 100.00 0.53
10 1,000 100,000 CD11a 1.00 99.77 0.44
Raw Expression Values

D
IL-23 IL-10 CD54 PVR IL-28D ICOSL CD18 CD100
10000 40000 3000 35000
15000
pg/mL

7500 60000 1000 2500 30000 900


30000
10000 5000 2000 25000 700
5000 40000 20000 500 20000
2500 1500 500
0 0 10000 1000 15000
20000 0
Zymosan specific Flu specific

CD83 CD30L IL-12p70 CD80 CD86 HLA-DR Medium


6000 1500 30000
3000
pg/mL

10000 LPS (100ng/mL)


4000 1000 20000 20000
2000 Zym (10μg/mL)
2000 1000 5000 10000
500 10000 Flu (1X)
0 0
LPS specific Zymosan & LPS Common activation molecules Control Conditions
specific

Figure 1. Variability and Specificity of DC Communication Signals


(A) Experimental strategy.
(B) Raw expression values of the 36 DC communication signals (n = 428 data points).
(C) Statistical descriptors of the 36 DC communication signals: expression range (log magnitude), percentage of positive observations among the 428 datapoints,
and coefficient of variation.
(D) Average expression values and SD for the four indicated DC signals for MoDCs.

434 Cell 179, 432–447, October 3, 2019


A B Figure 2. The Diversity of DC States Is
Ward distance based on pearson metrics
Defined by Unique Combinations of Commu-
nication Molecules
(A) Heatmap showing expression values of each 36
C1
C2
C3
C4
DC-derived signal, performed with hierarchical
C5

Group 1
C6
C7
C8
μx + σx clustering on Pearson metrics for the DC signals
C9
C10 μx
C11
C12
and Euclidian distances for the 82 DC conditions.
C13
C14 μx − σx
C15 (B) Expression profiles (mean and SD) of the 36
Ward distance based on pearson metrics

C16
C17
C18

82 distinct DC Conditions (C1 to C82)


C19
C20
C21
communication molecules within the five groups of
C22
C23
C24
C25
DC conditions, defined by hierarchical clustering.

Group 2
C26 μx + σx
C27
C28
C29
C30 μx
Expression data were logged and scaled so m rep-
resents the mean and s the SD of the expression of
C31
C32
C33 μx − σx
C34
C35
C36
C37
C38
a given DC signal across the whole dataset.
C39
C40
C41
C42
(C) Boxplot of selected DC signals for pairs of
C43

Group 3
C44
C45
C46
C47
μx + σx stimulatory conditions defined as being the most
C48 μx
C49
C50
C51
correlated within our dataset by Pearson correlation
C52 μx − σx
C53
C54
C55
(t test).
C56
C57
C58
C59
C60
(D) Best number of groups by Gaussian mixture
C61

Group 4
C62
C63
C64
μx + σx model, determined using the 428 points of the
C65
C66
C67
C68
μx
36 DC parameters.
C69
C70 μx − σx
C71
C72
C73
C74
C75
C76
C77
C78
C79

Group 5
C80
C81 μx + σx
C82
ICOSL
CD100
IL-28D
CD11a
VISTA
ICAM-2
Jagged-2
4-1BBL
SLAMF3
CD29
B7H3
Galectin-3
SLAMF5
CD18
ICAM-3
CD58
PDL1
CD40
CD54
CD80
PDL2
IL-12p70
Nectin-2
HLA-DR
CD83
CD86
PVR
IL-23
IL-10
IL-1
IL-6
TNF-D
CD70
LIGHT
OX40L
CD30L

μx

μx − σx
and C62 [MoDC PAM3, 10 mg/mL]) and
ICOSL
CD100
IL-28D
CD11a
VISTA
ICAM-2
Jagged-2
4-1BBL
SLAMF3
CD29
B7H3
Galectin-3
SLAMF5
CD18
ICAM-3
CD58
PDL1
CD40
CD54
CD80
PDL2
IL-12p70
Nectin-2
HLA-DR
CD83
CD86
PVR
IL-23
IL-10
IL-1
IL-6
TNF-D
CD70
LIGHT
OX40L
CD30L

36 DC protein signals compared them regarding expression of


Medium
Expression Value
LPS (100ng/mL)
the 36 DC-derived Th stimuli (Figure 2C).
Zym (10μg/mL)
−1 −0.5 0 0.5 1 1.5
C32 and C33 did not exhibit significant dif-
Flu (1X)
C D ferences in CD80 and CD86 expression,
CD86 CD80 IL-23 IL-6 Best number of groups reflecting equal levels of DC activation.
20000 4000 * by gaussian mixture model
200 2000 They were statistically different only for
15000 3000
10000 100 1000 0
5000
2000 IL-6, with levels ranging from complete
BIC value

1000 0 0
C32 C33 C32 C33 C32 C33 C32 C33 −3000 absence in C33 to over 1 ng/mL in
CD86 IL-1 4-1BBL
8000
PDL1 −6000 C32 (Figure 2C). In contrast, the pairs
** 40 ** 2000 * *
30000 6000
30 1500 C47/C48 and C61/C62 showed significant
20000 20 1000 4000
2000 0 20 40 60 80 differences for multiple Th stimuli. C47 ex-
10000 10 500
0 0 0 Number of groups
C47 C48 C47 C48 C47 C48 C47 C48 pressed significantly more CD86, PDL1,
CD86
300
CD70 OX40L
500
IL-12p70 and IL-1 than C48. On the contrary, C48
40000 * 2000 * 400 *
30000 200 1500 300 expressed higher levels of 4-1BBL. C61
20000 1000 200
100
500 100
and C62 showed marked differences
10000
0 0 0
C61 C62 C61 C62 C61 C62 C61 C62 in CD70 and IL-12 (higher in C61) and
OX40L (higher in C62) levels. Hence,
each DC condition expressed unique
of co-stimulatory molecules and cytokines with the exception of combinations of DC-derived Th stimuli, suggesting different
high IL-28a. Group 2 showed low expression for most DC- communication potential with CD4 T cells.
derived Th stimuli but high levels of integrins, VISTA and B7H3, An unsupervised Gaussian mixture model showed that the
suggesting a capacity to interact with T cells and transmit highest Bayesian information criterion (BIC) value corresponded
co-inhibitory signals. Group 3 showed a complementary pattern, to 82 groups, confirming that each DC condition induced a
lack of group 1- and group 2-specific molecules, and intermedi- unique profile of the 36 communication molecules (Figure 2D).
ate or high levels of co-stimulatory molecules such as CD83, Using principal-component analysis (PCA), we showed that
CD86, HLA-DR, 4-1BBL, and OX40L. This suggested potent neither the date of the experiment nor the donor batch had a ma-
T cell stimulating functions. Group 4 exhibited high levels of mol- jor effect on clustering (Figure S1C; STAR Methods).
ecules from the B7 and TNF superfamilies, such as CD80, CD86,
PDL1, PDL2, and CD40, but intermediate or low cytokine levels. The Heterogeneity of DC-Induced Th Cytokine Responses
In contrast, group 5 showed the highest level of cytokines and We characterized the diversity of CD4 T cell output responses,
molecules of the B7 and TNF superfamilies (Figure 2B). as assessed by Th cytokine profiles, following co-culture of
Next we sought to analyze intra-cluster heterogeneity. We naive CD4 T cells with activated DCs across the 82 conditions
selected three pairs of perturbators most closely related described previously. Th cytokines exhibited important varia-
as defined by Euclidian distance (C32 [MoDC HKLM, MOI 1] tions across the 428 observations (Figure 3A). Some cytokines,
and C33 [MoDC HKCA, MOI 1], C47 [bDC LPS, 100 ng/mL] and such as IL-2, TNF-a, GM-CSF, TNF-b, and IL-3, were always
C48 [bDC HLKM, MOI 1], and C61 [MoDC R848, 1 mg/mL] detected (Figure S2A).

Cell 179, 432–447, October 3, 2019 435


428 independent Th profiles of 17 distinct cytokines measured at the protein level
A

Raw Values
10,000

100

TNF-β
Exp fold

IL-3

IL-4

IL-5

IL-9

IL-10

IL-17A

IL-17F

IL-6

IFN-γ

TNF-α

GMCSF

IL-2

IL13

IL22

IL21

IL31
Concentration (pg/mL)

IL-2 IFN-J TNF-D TNF-E IL-17A IL-17F IL-21 IL-22


B 50000
30000 4000 800 100
pg/mL

50000 25000 3000 200 3000


2500 600 80
40000 40000 20000 2000 2000 400 60
30000 100 1000
30000 15000 1500 200 40
1000 0 0
Th1 cytokines Th17 cytokines Tfh cytokine Th22 cytokine
IL-3 IL-4 IL-5 IL-6 IL-10 IL-13 IL-31 GM-CSF
160
pg/mL

600 7000 15000


1600 500 1000 2000 140 6000 30
1200 400 1600 120 12500
300 100 5000 20 10000
800 200 500 1200 80 4000 10 7500
400 100 800 60 3000 0
Th2 cytokines
IL-9 Exp. fold Medium
1000 3.0
pg/mL

2.5 LPS (100ng/mL)


fold

750
500 2.0 Zym (10ug/mL)
250 1.5
1.0 Flu (1X)
Th9 cytokine Control Control Conditions

Ward distance based on pearson metrics


C D
IL-5 IL-17F IL-22
1500 100
Group 2 Group 1

600 *
75
pg/mL

1000 400
C1 50
C2
C40 500 200 25
C47
C42
C41 0 0 0
C50
C72 C12 C33 C12 C33 C12 C33
C73
C74
C62
IL-3 IL-2 IFN-J
C59 p=0.056 * 20000
C63 3000 40000
Ward distance based on pearson metrics

C58
30000 15000
pg/mL

C64
C67 2000
82 distinct DC Conditions (C1 to C82)

C77
20000 10000
C69
Group 3

C71
C70
1000 10000 5000
C56
C66 0 0
C57
C75 C42 C47 C42 C47 C42 C47
C76
C37 IL-6 GM-CSF IL-21
C26
C38 3000 **
14000
* 15000
pg/mL

C36
C46
C49 2000 12000 10000
C6
10000
Group 4

C68
C65 5000
C82 1000 8000
C81
C79 0
C78
C80 C49 C46 C49 C46 C49 C46
C53
C52
C55
C51
C22
C21
E Best number of groups
Group 5

C48
C43
C3
C4
by gaussian mixture model
C24
C54 0
C11
C45
C23
C27 −1000
BIC value

C44
C30
C20
C13 −2000
C28
C32
C35
C5
−3000
C7
C10
C31 −4000
Group 6

C12
C33
C25
C17
0 20 40 60 80
C8
C61 Number of groups
C14
C15
C60
C19
C18
C16
C9 Medium
C34
C39
C29 LPS (100ng/mL)
Zym (10μg/mL)
Exp fold
IL-9
IL-17F
IL-17A
IL-2
TNF-D
TNF-E
IL-21
IL-6
IFN-J
IL-10
IL-22
IL-5
GM-CSF
IL-3
IL-31
IL-13
IL-4

Flu (1X)

Expression Value
Th9 Th17 Th1-Tfh cluster Th22 Th2
17 T helper cytokines −1 −0.5 0 0.5 1 1.5

Figure 3. Th Cytokine Responses Mirror the Variability in DC Communication States


(A) Raw expression values of each of the 18 Th-derived parameters (n = 418 data points).
(B) Average expression values and SD for all Th-derived signals under the MoDC conditions medium, LPS, zymosan, and flu.
(C) Heatmap of expression values of each 18 Th parameters, performed with hierarchical clustering on Pearson metrics for the DC signals and Euclidian distances
for the T cell conditions.
(D) Boxplot of Th signals for pairs of conditions selected as being the most correlated within our dataset by Pearson correlation (t test).
(E) Best number of groups by Gaussian mixture model, determined only using the 428 points of the 18 Th parameters.

436 Cell 179, 432–447, October 3, 2019


To identify Th subset signatures, we compared cytokine complexity of the dataset and the lack of clear hypotheses
expression under our four standard conditions: medium (negative concerning the majority of DC-derived Th stimuli, we applied
control), LPS, zymosan, and flu. The Th17 cytokines IL-17A and an unsupervised mathematical modeling strategy (Figure 4A).
IL-17F were induced predominantly in zymosan MoDCs. LPS The MultiVarSel strategy with stability selection performed
MoDCs induced mixed Th1, Th2, and Th9 responses character- similarly as the internal positive control and better than other
ized by high IFN-g, IL-13, IL-3, and IL-9 compared with medium. methodologies tested (Figure S3A; STAR Methods). Therefore,
Flu MoDCs induced the Th2 cytokines IL-4, IL-5, and IL-31 (Fig- we applied MultiVarSel to the modeling of our experimental
ures 3B and S2B). These results indicate that, under the LPS, data (Figure 4A). This methodology takes into account the
zymosan, and flu conditions, each DC state induced a distinct dependencies that may exist among Th cell cytokines and com-
set of Th cytokine responses corresponding to prototypical Th bines Lasso criterion and stability selection to select associa-
signatures or mixed Th profiles. tions between DC-derived signals (inputs) and Th cytokines (out-
puts) (STAR Methods).
Th Cytokine Responses Mirror the Variability in DC Our multivariate model identified a large number of significant
Communication States positive (red) and negative (blue) associations of the 36 DC-
We asked whether Th cytokine responses would reveal distinct derived Th stimuli with the 17 Th-derived cytokines (Figure 4B).
patterns or a continuum of responses mirroring each of the DC White squares represent the absence of significant association
communication states (Figure 2A). We performed hierarchical (Figure 4B). The frequency of selection obtained for each
Pearson clustering on our 18 distinct Th-derived variables across input-output association is provided in Figure S3B.
the entire 82 DC-activating conditions (Figure 3C). This revealed Our mathematical model revealed (1) the effect of each DC
6 distinct groups, although intra-group heterogeneity was communication signal on Th output responses and (2) the critical
evident in almost all groups. Interestingly, DC perturbation con- regulators for each Th cytokine. For example, negative regula-
ditions (C1–C82) did not appear in the same order compared tors of IL-10 were OX40L, 4-1BBL, IL-12, TNF-a, CD58, VISTA,
with DC communication signal clustering (Figure 2A), indicating Galectin-3, CD80, CD29, IL-1, ICAM-3, SLAMF3, IL-28a, and
that closely related patterns of DC-derived Th stimuli did CD83, and positive regulators were Jagged-2, PDL1, IL-10,
not necessarily induce the closest patterns in Th cytokine CD11a, HLA-DR, ICOSL, CD100, CD30L, CD18, ICAM-2, and
responses. CD86 (Figure 4B). Hence, the model can predict IL-10 produc-
Group 1 was dominated by production of IL-10, IL-22, IL-5, tion by responding Th cells for any DC, given the expression level
GM-CSF, IL-3, IL-31, IL-13, and IL-4 (Figure S2C). Group 2 of these molecules. It allows simulating loss or gain of function of
was the most heterogeneous and included the inflammatory cy- an input. Similar insight can be obtained for each of the 17 Th
tokines TNF-a and IL-6 co-expressed with variable levels of the cytokine responses, which may be explained by a combination
Th1 (IFN-g) and Th2 (IL-4 and IL-13) cytokines (Figure S2C). of DC-derived communication signals.
Group 3 expressed IL-21, IFN-g, and IL-17F but no or low We used computational cross-validation to evaluate the error
IL-17A, suggesting the possibility of differential regulatory of prediction of our model (Figure 4C). For all Th cytokines, the
mechanisms (Figure S2C). Group 4 was dominated by the multivariate outperformed the best univariate model (Fig-
Th17 cytokines IL-17A and IL-17F, group 5 by IL-22, and group ure S3C). We ranked Th cytokines based on their prediction
6 by IL-2. Distinct sets of DC perturbation conditions and, hence, errors; the Th variables best explained by our model were IL-6,
patterns of DC-derived communication molecules were associ- IL-17F, Exp Fold, and IL-3 (Figure 4C).
ated with each of these groups (Figure 3C). This was the first sug- To address DC type specificity in model performance, we
gestion of specific rules underlying input-output relationships in calculated the cross-validation error for each Th output of the
DC-Th communication. MoDC and bDC dataset, respectively. Our model predicted
Because of intra-group heterogeneity, we asked whether most equally well the majority of the outputs for the two DC types (Fig-
correlated conditions within the same cluster would differ from ure S3D). For a few outputs, mostly IL-22 and TNF-b, the model
each other (Figure 3D). C12 and C33 were associated to different was more error prone in bDCs than MoDCs (Figure S3D). Inter-
levels in IL-17F, whereas C42 and C47 were different in IL-2 and estingly, a higher prediction error was found for TNF-b in 5 of
C46 and C49 were different in IL-6 and GM-CSF levels (Fig- 118 observations (Figure S3E), where TNF-b levels were very
ure 3D). As for the DC dataset, we found that 82 was the best high (range, 6.7–22.2). This suggested that a TNF-b-promoting
number of groups in our Th-derived dataset, based on a input signal might be involved in those 5 cases but not included
Gaussian mixture model (Figure 3E). This suggested that a single in our model. For IL-22, more observations had a higher predic-
DC profile of communication molecules would induce a unique tion error in bDCs compared with MoDCs, but the prediction
set of Th cytokines. error range and distributions were similar, suggesting that the
input-output relationship was conserved (Figure S3E).
A Data-Driven Lasso Penalized Regression Model We performed hierarchical clustering for both DC and T cell-
Predicts Th Cytokine Responses from Combinations of derived variables to identify co-regulations between Th outputs.
DC-Derived Th Stimuli We retrieved relevant clusters of Th cytokines belonging classi-
Having generated distinct patterns of DC-derived communica- cally to the same Th subset (Figure 4B). The Th2-related cyto-
tion signals associated with a diversity of induced CD4 T cell kines IL-13, IL-31, IL-5, IL-4, IL-10, and GM-CSF were found in
cytokine responses, the question of their relationship appeared the same cluster, suggesting that their induction would be
to be critical to decipher DC-Th communication. Given the controlled by similar mechanisms. IL-17A and IL-17F were also

Cell 179, 432–447, October 3, 2019 437


A MODELLING STRATEGY SELECTION
MULTIVARSEL STRATEGY
Compare 6
Simulated - Multivariate Penalized
modelling
dataset Regression (LASSO)
strategies
- Input dependencies

MODEL BUILDING
Apply MultiVarSel
TMOD Display Coefficients
STABILITY SELECTION
Experimental Figure 4B
(1000 resampling
Dataset (Frequencie: 0.65)
of 1/2 dataset)

MODEL VALIDATION
LITERATURE VALIDATION
NUMERICAL VALIDATION
(Systematic analysis comparing
(Prediction error 10
our experimental model predictions
fold cross-validation)
with literature data)
Figure 4C
Figure 4D

Global data-driven multivariate model of T helper differentiation Model’s Coefficient Value


B
−0.2≤ −0.1 0 0.1 ≥0.2
Hierarchical clustering based on Pearson

TNF-E
correlation of Model’s Coefficient values

TNF-D

17 T helper cytokines (OUTPUTS)


IL-6
Exp fold
IL-3
IL-21
IFN-J
IL-2
IL-9
IL-17F
IL-17A
IL-13
IL-31
IL-5
IL-4
IL-10
GM-CSF
IL-22
OX40L
CD54
4-1BBL
IL-12p70
LIGHT
SLAMF5
TNF-D
CD58
VISTA
Nectin-2
B7H3
Galectin-3
IL-23
CD80
CD70
CD29
IL-1
ICAM-3
CD40
IL-6
SLAMF3
IL-28D
CD83
Jagged-2
PVR
PDL1
IL-10
CD11a
HLA-DR
ICOSL
PDL2
CD100
CD30L
CD18
ICAM-2
CD86

36 Dendritic cell-derived communication signals (INPUTS)


Class
Contradictory
C Computational Validation D New
assessed by cross-validation Global literature assessment of model’s predictions
IL-6 Validated
IL-17F
Exp fold 15
IL-3
T helper cytokines

IL-21
IL-22
IFN-J
IL-4 10
IL-17A
Count

TNF-E
GM-CSF
IL-9
IL-13 5
IL-5
IL-10
IL-2
IL-31
TNF-D 0
0.4 0.6 0.8 1.0
IL-12p70
CD86
CD80
IL-1
IL-23
CD70
ICOSL
OX40L
4-1BBL
IL-10
IL-6
TNF-D
CD30L
CD40
CD54
ICAM-2
PDL2
CD11a
IL-28D
CD83
LIGHT
HLA-DR
CD100
Jagged-2
SLAMF3
B7H3
ICAM-3
PVR
VISTA
CD18
CD29
CD58
Nectin-2
Galectin-3
PDL1
SLAMF5

Square error
of predictions - +
Best univariate model
MultiVarSel= Our multivariate model of Fig 4B 36 Dendritic cell-derived communication signals (INPUTS)

Figure 4. A Data-Driven Lasso Penalized Regression Model Predicts Th Differentiation Outcomes from DC-Derived Communication Signals
(A) Mathematical modeling strategy.
(B) Heatmap of the model’s coefficient values of the MultiVarSel-derived model, explaining the 18 Th parameters based on the 36 DC-derived signals (Pearson
correlation-based hierarchical clustering).
(C) Mean and SE of prediction error values obtained by 10-fold cross-validation for Th parameters using the multivariate model (yellow) and the best univariate
model (gray) within the 36 DC signals.
(D) Literature-based validation score. For each DC signal, all predicted associations with Th cytokines were categorized as ‘‘new,’’ ‘‘validated,’’ or ‘‘contradictory.’’

438 Cell 179, 432–447, October 3, 2019


A MoDC Activated MoDC Allogeneic naive
CD4 T cells
24h activation anti-CD28 6 days coculture
Measure of Th outputs predicted
+ +/- blocking
by the model to be modulated
Flu (1X) mAb
or Zymosan (10 μg/mL)
or LPS (100 ng/mL)

B CD80/CD86
Output expression fold change

Exp fold IL-3 IL-2 IL-6 IL-9 IL-13


TNF-D
2.00 IL-10

Validation score = 11/15


1.00
/value without blocking)

0.50 IL-5
(value with blocking

IL-4
0.00 IL-31
IL-3
IL-17A IL-17F IL-21 IL-31 GM-CSF Real GM-CSF
2.00 (exp. value) IL-21
1.00 IL-17F
0.50 Estimate IL-17A
0.00 (in silico KO) IL-13
IL-9
IL-6
Flu

S
m

Flu

S
m

Flu

S
m

Flu

S
m

Flu

S
m
LP

LP

LP

LP

LP
Zy

Zy

Zy

Zy

Zy
IL-2
Exp fold
Naive CD4 T cells
C rhIL-1E 5 days culture
Anti-CD3/
+ anti-CD28 + Thcytokines
polarizing
+ or agonist anti-hICOS mAb
Measure of Th outputs predicted
by the model to be modulated
or rhIL-12p70

D IL-1

Validation score = 7/10


IL-6 IL-10 IL-13 IL-17F IL-9 TNF-D IL-17A IL-21

Model prediction
1,500 *** 400 * 6,000 * 4,000 *** 800 ** 30,000 * 200 ** IL-31
300 3,000 600 150 IL-22
1,000 4,000 20,000
pg/mL

200 2,000 400 100 IL-13


500 2,000 10,000 IL-9
100 1,000 200 50
IL-10
0 0 0 0 0 0 0 IL-6
rhIL-1E - + - + - + - + - + - + - +
IL-17F
Th condition Th0 Th0 Th0 Th0 Th2 Th2 Th17 IL-17A
TNF-D

E Exp fold IL-3 IL-5 IL-6 IL-10


8 * 6,000 * 600 * 200 * 5,000 * ICOSL
4,000
fold exp.

6 4,000 150 TNF-E


pg/mL

400 3,000
4 100 IL-2
2,000 2,000

Validation score = 10/16


2 200 50 IL-21
1,000
0 0 0 0 0 IL-17F
Model prediction

anti-hICOS agonist - + - + - + - + - + IL-4


Th condition Th0 Th0 Th0 Th0 Th0 IL-31
IL-17A
IL-13 IL-22 TNF-D GM-CSF IL-17A TNF-D
3,000 * 500 * 30,000 * 15,000 * 5,000 * IL-22
400 4,000 IL-3
pg/mL

2,000 300 20,000 10,000 3,000 IL-6


1,000 200 10,000 5,000 2,000 GM-CSF
100 1,000 IL-13
0 0 0 0 0 Exp fold
anti-hICOS agonist - + - + - + - + - + IL-5
Th condition Th0 Th0 Th0 Th0 Th17 IL-10

F Exp fold IL-4 IL-5 IL-6 IL-13 IL-31 G IL-9


15 * 100 ** 60 ** 200 * 1,500 ** 4,000 * 500 p=0.0625 * IL-12p70
80 150 400
fold exp.

3,000
pg/mL

10 60 40 1,000 300 IL-2


100 2,000
40 200 IL-17F
Validation score = 13/15

5 20 50 500
20 1,000 100 IL-6
0 0 0 0 0 0 0 IL-10
Model prediction

rhIL-12p70 - + - + - + - + - + - + - + - + IL-22
Th condition Th0 Th0 Th0 Th0 Th0 Th2 ZYM-DC HKSA-DC IL-4
IL-5
IL-21 IL-22 TNF-E IFN-J IL-10 IL-3
IL-13
8,000 ** 500 ** 6,000 * 25,000 ** 6,000 * 2,500 * *
IL-31
pg/mL

6,000 400 20,000 2,000


300 4,000 15,000 4,000 1,500 IL-3
4,000 200 1,000 IL-9
2,000 10,000 2,000
2,000 100 5,000 500 Exp fold
0 0 0 0 0 0 IL-21
rhIL-12p70 - + - + - + - + - + - + - + IFN-J
Th condition Th0 Th0 Th0 Th0 Th2 ZYM-DC HKSA-DC
TNF-E

(legend on next page)

Cell 179, 432–447, October 3, 2019 439


in the same cluster, implying that the model associated them experimentally validated (STAR Methods). The positive role of
with closely related DC communication signals (Figure 4B). Sur- CD80 and CD86 on IL-3 and IL-31, to our knowledge, have not
prisingly, our model closely related IL-9 expression to IL-17A and been described elsewhere. The predictions we failed to validate
IL-17F, suggesting common regulators. It also clustered IL-22 were for IL-4, IL-5, IL-10, and TNF-a (Figure S4A), all predicted
closer to the Th2 than to the Th17 cytokines. IL-21 was associ- to be decreased by CD80/CD86.
ated with the Th1 cytokines IL-2 and IFN-g (Figure 4B). Then we validated the effects of three additional inputs: IL-1,
ICOSL, and IL-12 used as exogenous factors (Figure 5C). First
The Multivariate DC-Th Model Reveals Novel Regulators we gave the selected input together with anti CD3/CD28 signals
of Th Cytokine Responses (Th0) and systematically measured all Th outputs predicted by
We systematically compared our model results with the literature the model to be influenced by that input. In the absence of any
as a knowledge-based validation but also novelty assessment. effect, we gave the selected input under a Th2 (IL-4) or Th17
We screened 178 relevant articles (STAR Methods) and ex- (IL-6, IL-1b, IL-23, and TGF-b) condition to detect additional syn-
tracted information regarding specific molecular control of a ergistic or inhibitory effects required to validate the predicted
given Th cytokine by DC-derived signals measured in our model effect. For example, it is not possible to validate the inhibition
(Table S3). We computed a validation score based on the num- of a Th2 cytokine without significant production of this cytokine
ber of articles identifying the same associations than our model at baseline.
(STAR Methods). IL-12 ranked as the top DC communication We focused on the ten predictions made by our model for IL-1
signal for which our model predictions globally recapitulated (Figure 5D). By adding IL-1b to the Th0 condition, we were able to
existing knowledge (8 of 13 predicted associations). Among detect significant upregulation of IL-6 and IL-17F and significant
other known associations, IL-23 was positively associated with downregulation of IL-10 and IL-13. IL-10 downregulation and
IL-17A and IL-17F, IL-10 was positively associated with IL-10 IL-6 upregulation were also significant in the Th2 context (Fig-
and negatively with IFN-g, and CD40 was positively associated ure S4B). Under a Th2 condition, we validated significant
with IFN-g. upregulation of TNF-a and downregulation of IL-9 by IL-1b (Fig-
However, the model also predicted 290 associations that were ure S4B), not seen in Th0 (Figure S4B). Under a Th17 condition,
not described previously. Putative novel regulators were identified we observed a positive effect of IL-1b on IL-17A. We could not
for all Th outputs (Table S4). The robustness of each prediction validate the predictions regarding IL-21, IL-31, and IL-22 (Fig-
could be estimated by the value of the coefficient and by the fre- ure S4B). In total, 7 of 10 predicted effects of IL-1 were validated.
quency of detection of the association (Table S4). Examples of Interestingly, the positive role of IL-1b on induction of IL-6 by Th
high scores were B7H3 and CD83 association with IL-4, 4-1BBL cells was not known (Table S3) and may suggest new biology
association with IL-9, ICOSL association with IL-13, and OX40L and amplification loops in an inflammatory context.
negative association with IL-22 (Table S4). Overall, literature We used a similar strategy to validate predictions regarding
knowledge was retrieved for 80 distinct input-output relationships ICOSL using an anti-ICOS agonistic antibody. Overall, we vali-
presented in our model (Figure 4B); 56 were in agreement with our dated 10 of 16 predictions (Figure 5E and S4C; STAR Methods).
model, representing a global literature validation score of 70%. Interestingly, five of the 10 validated predictions were novel
(Table S3; IL-5, IL-13, IL-3, GM-CSF, and IL-22), suggesting
Systematic and Independent Experimental Validation of common pathways to induce IL-22 and Th2 responses.
Model’s Predictions Finally, we experimentally tested the predictions regarding
We performed systematic experimental validation by selecting a IL-12 (Figure 5F). Adding IL-12 to the Th0 condition validated
subset of target inputs and systematically measuring the Th out- an induction of IFN-g, IL-21, Exp Fold, and TFN-b. We also vali-
puts selected by our model. We assessed the novelty of each dated the inhibitory role of IL-12 on Th2 cytokine (IL-4, IL-5, and
validated prediction (Table S3). IL-13), IL-6, and IL-22 production. Using the Th2 condition, we
First we addressed systematic validations of model predictions further validated the inhibitory role of IL-12 on IL-10 and IL-31.
by blocking experiments (Figure 5A). We performed double in silico The effects of IL-12 on TNF-b, IL-31, and IL-6 have not been
knockout for CD80 and CD86 under the three conditions—LPS described previously (Table S3).
(100 ng/mL), flu (13), and zymosan (10 mg/mL) MoDCs—in which Because our anti-CD3/CD28 system did not allow validating
CD80 and CD86 were highly expressed and predicted an effect on IL-12 effects on IL-2, IL-17F, IL-3, and IL-9 (Figure S4D), we
15 distinct Th outputs (Figure 5B), 11 of which were successfully wondered whether DC-dependent factors could affect the role

Figure 5. Independent and Systematic Experimental Validation of the Model’s Prediction


(A) CD28 blocking experimental design in DC-T co-culture.
(B) Comparison of the predicted versus observed fold change following CD28 blocking; n = 6 donors.
(C) Experimental scheme of the ‘‘adding’’ validation procedure used in (D)–(F).
(D) DC-free validation experiment studying the effect of adding IL-1b in Th0, Th2, and Th17. Naive T cells were stimulated by anti-CD3/CD28 beads; n = 6 donors.
(E) DC-free validation experiment studying the effect of adding ICOS in Th0 and Th17. Naive T cells were stimulated by coated anti-CD3 and ICOS antibodies and
soluble anti-CD28; n = 6 donors.
(F) IL-12 validation experiments in the DC-free system. Naive T cells were stimulated by anti-CD3/CD28 beads under Th0 and Th2 conditions; n = 8 donors.
(G) Validation of IL-12 predictions regarding IL-3 and IL-9. bDCs were cultured with naive CD4 T cells. IL-12 at 10 ng/mL was added for 6 days; n = 6 donors.
For (B) and (D)–(G), each panel shows the mean and SD of cytokine concentration, measured on restimulated Th supernatants (Wilcoxon test).

440 Cell 179, 432–447, October 3, 2019


of IL-12 on these cytokines. We selected DC conditions with very context-dependent mechanisms with relatively similar percent-
low production of IL-12 (C51 and C55; Figure 2A) and performed ages (range, 0.13–0.22) (Figure S5C).
a co-culture with naive T cells, adding or not adding IL-12. As a We used this strategy to explain the role of IL-12 in the control of
positive control, IL-12 was able to induce IFN-g in both zymosan Th17 differentiation through identification of context-dependent
and HKSA conditions (Figure S4E). We did not validate the role of effects. We found that adding context-dependent variables for
IL-12 on IL-2 or IL-17F regulation (data not shown). However, we IL-12 improved the model predictions for IL-17F and performed
validated that IL-3 was induced by IL-12 in both zymosan DCs equally well for IL-17A (Figure 6B). We then focused on DC-derived
(C51) and HKSA DCs (C54) (Figure 5G), whereas IL-9 was signif- signals that were kept significant by the model and observed
icantly upregulated only in HKSA DCs. Overall, we were able to distinct associations of the new IL-12 context-dependent variables
experimentally validate 13 of 15 predictions regarding IL-12. with IL-17A and IL-17F (Figure 6C), including some differentially
Our systematic strategy established a validated prediction of associated with IL-17A and IL-17F, respectively. Among various
the input-output relationship in 41 of 56 cases (73.2%), 13 repre- contexts, we found that IL-12 in the context of IL-1, ICAM-2, or
senting new mechanisms identified by the model. This number is Jagged-2 was associated with IL-17F, whereas IL-12 in the
similar to or higher than the computational cross-validation (Fig- context of CD70, IL-23, or LIGHT was associated with IL-17A.
ure 4C). Predictions with higher stability selection frequencies As a first level of in silico validation, we selected a DC condition
were more validated than those with low stability selection (Fig- under which IL-12 was co-expressed with many of these con-
ure S4F). However, the value of the model’s coefficients was not texts, and DC-derived signals induced IL-17A and IL-17F by
statistically different between the two groups (Figure S4F), indi- responder Th cells. Zymosan (10 mg/mL) on MoDCs fulfilled these
cating that the model efficiently captured associations with low criteria (Figures 1D and 3C). To study the specific effects of IL-12
coefficient values. in the context of all other DC communication signals induced by
Although IL-12 was the input best explained by our model, we zymosan, we performed in silico IL-12 knockout in the IL-12
could not validate the predicted association between IL-12 and context-dependent model. We compared predicted values for
IL-17F (Figure S4D), neither in the literature nor in our systematic IL-17A and IL-17F when IL-12 was kept or not kept in the model
experimental validation. Previous studies have shown either no (Figure 6D). In silico knockout of IL-12 diminished the production
effect (Volpe et al., 2008) or a negative effect (Acosta-Rodriguez of both IL-17A and IL-17F under the zymosan (10 mg/mL) condi-
et al., 2007) of IL-12 on Th17 differentiation. We hypothesized tion. As experimental validation, we performed independent
that context-dependent effects may lead to new functions of DC/T cell co-culture experiments using MoDCs treated with
IL-12, not accomplished by IL-12 as a single agent. 10 mg/mL zymosan in the presence and absence of IL-12-neutral-
izing antibody (Figure 6E). Blocking IL-12 significantly decreased
A Context-Dependent Model Reveals a Role of IL-12 in the production of IL-17A and IL-17F, as predicted (Figure 6E),
Th17 Differentiation and inhibited IFN-g production (Figure S5D). The same model pre-
We designed a strategy to capture context-dependent effects of dicted no effect of blocking IL-12 in Curdlan MoDCs (Figure S5E),
one input on any given output by integrating new composite vari- which we validated experimentally (Figure S5F).
ables into the model (Figure 6A). These new input variables were
based on the co-occurrence of a given input with other DC- Synergistic Interaction between IL-12 and IL-1 Explains
derived communication signals (i.e., contexts). They adopted Induction of IL-17F without IL-17A
the value of the given input (for instance, IL-12) in each observa- Our model predicted distinct roles of IL-12 on IL-17A and IL-17F
tion where the co-expressed DC signal was present, and they production depending on the context in which IL-12 is expressed.
took a zero value when the co-signal was absent. We could Interestingly, IL-12, IL-1, and CD80 were the top variables almost
derive 455 context-dependent variables. systematically selected by the model to explain the differences
The model including all context-dependent variables per- between IL-17A and IL-17F (Figure 7A). This corroborated the re-
formed less well (higher error of prediction) than our classical sults in Figure 6C, where we found that IL-12 in the context of IL-1
MultiVarSel strategy (Figure S5A), most likely because of overfit- was associated with IL-17F but not IL-17A. The model estimate for
ting issues dependent on the dataset size, with a number of input a stability selection of less than 0.8 indicated that IL-12, IL-1, and
variables exceeding the number of data points used to fit the CD80 were positive contributors to the differences between
model. Therefore, we derived 36 models, each one integrating IL-17A and IL-17F (Figure S6A). Consequently, we hypothesized
the context dependencies of one input (Table S5). For each of that the combination of IL-12 with IL-1 would induce IL-17F inde-
these models, we reported the coefficient and the stability selec- pendent of IL-17A.
tion frequencies of each input (Table S5). To globally estimate the To experimentally validate our hypothesis, we used a DC-free
influence of context dependencies within our data, we quantified Th polarization assay, allowing us to specifically study the inter-
the number of times an input variable was selected, either action between IL-12 and IL-1 regardless of any other molecular
‘‘alone’’ or ‘‘with’’ another one. We derived percentages of context. Naive CD4 T cells were polyclonally activated with anti-
context dependencies and represented the results either per CD3/CD28 beads and put in distinct cytokine treatments: Th0
input (Figure S5B) or per output (Figure S5C). The inputs most (no cytokine) and Th2 (IL-4) as negative controls; Th17
likely to present ‘‘context-dependent’’ functions were PDL1 (IL-1b+IL-23+IL-6+TGF-b) as a positive control, IL-12, IL-1b,
and SLAMF3, whereas CD11a and CD70 were mostly context- and IL-12+IL-1b. IL-12 alone induced IFN-g and IL-21 and in-
independent (Figure S5B). When analyzing the outputs, the hibited Th2-related cytokines, as expected (Figure S6B). IL-12
models revealed that all cytokines could be regulated by alone induced neither IL-17F nor IL-17A, but combining IL-12

Cell 179, 432–447, October 3, 2019 441


A
Theoretical mathematical description Application: study the role of IL-12p70 in
of context-dependent modelling the presence of other inputs
Not taking into account
Taking into account the 1. Select inputs that are absent in at least 20% of the samples
the presence or absence
presence or absence of I2 12 inputs selected
of other inputs (Figure 4)

300 M̂
M̂M̂M̂ 300 M̂
M̂M̂M̂
M̂M̂



M̂M̂



2. 12 variables created:
M̂ M̂

{
M̂ M̂
M̂ M̂ M̂ M̂
(i)
0 if Input = 0
M̂ M̂ M̂ M̂

(i)
M̂M̂ M̂M̂M̂
M̂ M̂ M̂M̂ M̂M̂M̂
M̂ M̂
M̂M̂ M̂M̂ M̂M̂ M̂M̂
j
M̂M̂ M̂M̂

200 M̂M̂ M̂M̂ M̂



M̂M̂ M̂

200 M̂M̂ M̂M̂ M̂

M̂M̂ M̂

IL12_with_Inputj =
M̂M̂ M̂ M̂M̂M̂
M̂ M̂M̂M̂
M̂ M̂

M̂ M̂
M̂M̂
M̂M̂M̂
M̂M̂
M̂M̂
M̂ M̂ IL-12 otherwise
j
M̂ M̂

M̂M̂M̂ M̂
M̂M̂M̂
M̂M̂
M̂ M̂ M̂M̂
M̂ M̂
M̂ M̂
M̂ M̂ M̂

M̂M̂ M̂ M̂M̂ M̂
M̂ M̂
M̂ M̂
M̂ M̂
M̂ M̂ 3. Linear Model (O = IB + E)
100 M̂

M̂ 100 M̂

M̂ M̂ M̂ M̂
M̂ M̂ M̂ M̂M̂ M̂ M̂ M̂ M̂M̂
M̂ M̂ M̂ M̂M̂ M̂M̂ M̂
M̂ M̂ M̂ M̂ M̂M̂ M̂M̂ M̂

I: 36 full inputs and 12 «IL-12_with» variables
M̂ M̂ M̂ M̂ M̂ M̂ M̂ M̂ M̂ M̂
M̂ M̂ M̂ M̂
M̂ M̂

0 0 O: 2 outputs IL-17A and IL-17F


0 100 200 300 400 500 0 100 200 300 400 500
4. Variable Selection:
I(1) I
(1)
(1) (1) (1) (2)
O = a + bI +e O = a + cI + dI _with_I +e
j j j j j j j j MultiVarSel
Stability Selection

{
(2) (2)
I >0 0 if I =0
j

(1) (2)
Ij _with_Ij =
I(2) = 0 I (1) if I (2) z 0 5. Display coefficients (treshold = 0.6, Figure 6C)

j j

B C
Multivariate modeling including context-dependent variables for IL-12
Computational Validation
assessed by cross-validation IL-17F

IL-17F IL-17A
CD58
B7H3
CD80
Jagged-2
Galectin-3
IL-12_with_IL-1
IL-1
IL-12_with_ICAM-2
IL-12_with_Jagged-2
IL-12_with_OX40L
4-1BBL
TNF-D
CD86
IL-23
IL-28D
PDL2
CD18
IL-12_with_IL-28D
CD11a
HLA-DR
SLAMF3
CD70
CD83
PVR
OX40L
IL-12_with_CD70
IL-12_with_IL-23
IL-12_with_LIGHT
CD29
IL-6
IL-12_with_TNF-D
IL-12_with_CD30L
VISTA
CD30L
ICAM-2
ICOSL
LIGHT
IL-17A

0.5 0.6 0.7


- +
Square Error of predictions Model’s
Coefficients
value
Best univariate model
0.2
MultivarSel = our multivariate
model from Fig 4B 0.0
MultivarSel_IL-12 context-dependent −0.2
−0.4

D Model predictions on IL-12 in silico KO E Experimental validation of IL-12 in silico knock-out


in 10 μg/mL zymosan treated MoDC on zymosan treated MoDC
Raw expression values (pg/mL)

IL17A IL17F IL-17A IL-17F


10000 1500 15000
Expression values (pg/mL)

* *
900 *** ***
7500 1000 10000

600 500 5000


5000

0 0
300 2500

0 0 iso iso a-IL-12 iso iso a-IL-12


IL-12 IL-12 IL-12 IL-12 Med Zymosan Med Zymosan
KO KO MoDC treated MoDC MoDC treated MoDC

Figure 6. A Context-Dependent Model Reveals a Role of IL-12 in Th17 Differentiation


(A) Context-dependent modeling and application to IL-12. I, input; O, output.
(B) Error of prediction values obtained by 10-fold cross-validation for IL-17A and IL-17F, comparing the best univariate model (gray), MultiVarSel (yellow), and
MultiVarSel with context dependencies (blue).
(C) Heatmap of the model’s coefficient value of the context-dependent multivariate model explaining IL-17A and IL-17F.
(D) Model predictions regarding IL-12 in silico knockout (KO) under the zymosan MoDC condition for IL-17A and IL-17F values (blue) compared with experimental
values in the presence of IL-12 (yellow); paired t test.
(E) Concentrations of IL-17A and IL-17F produced by Th cells after differentiation with zymosan MoDCs in the presence of anti-IL-12 neutralizing antibody or a
matched isotype; n = 6 donors, paired t test.

with IL-1b dramatically induced IL-17F at levels comparable with This effect was specific to the IL-12+IL-1b combination IL-6,
the positive control, without a detectable amount of IL-17A, IL-23, or TGF-b alone or combined with IL-12 could not induce
which fully validated the model predictions (Figure 7B). IL-17F expression (Figure S6C). The exact same pattern of Th

442 Cell 179, 432–447, October 3, 2019


A Frequencies of selection B C D
explaining the difference
between IL-17F and IL-17A IL-17F IL-17F IL-17F
IL-12p70 2.5 ** ** ns 40 *
IL-1 8

IL-17F (ng/mL)

IL-17F (ng/mL)
IL-17F (ng/mL)
CD80 2
CD70 6 30
ICAM-2 1.5
IL-23 20
4-1BBL
1 4
VISTA
CD11a 10
ICOSL 0.5 2
CD30L
CD86 0 0 0
PVR
IL-28D
TNF-D IL-17A IL-17A IL-17A
SLAMF3
CD18 250 2,000 * 6 **
IL-17A (pg/mL)

IL-17A (ng/mL)
Jagged-2

IL-17A (pg/mL)
LIGHT 200 1,500
HLA-DR 1,000 *
OX40L 500 4
150 ns
IL-10 100
CD40 100 80
CD100 60 2
CD29 50
SLAMF5 40
PDL2 20
CD54 0 0 0
CD83 ‡ ‡
T 0
IL h2
-1 IL 12
IL b
Th1b
17

0
2
17 17

2
IL-6

-1 IL 12
IL 3
3

-1 IL 12
IL 3
Th 3
17
Th

Th

D
2+ -1

2+ -2
-2

2+ -2
-2
PDL1
-

Th Th
-

+C
-

-
IL

L
Galectin-3

0+
I
Nectin-2

Th
ICAM-3
IL

CD58
IL

IL
B7H3
0.00 0.25 0.50 0.75 1.00 Th condition Th0 IL-1b
Frequencies of selection

*
IL-17F+ (%live cells) IL-17A+ (%live cells)

E Polarized naive CD4 T cells F 2.0 ns G Memory CD4 T cells


Th0 Th2 IL-12 1.5
0.028 0.011 0.19 3.34E-3 0.051 0.017
1.0
4.73 2.33
0.5
0.0 IL-17A
* *
IL-17A

99.9 0.051 99.7 0.068 99.8 0.11


2.0
IL-1E IL-12+IL-1E Th17
0.072 3.52E-3 0.13 0.02 0.61 0.057 1.5
1.0 92.3 0.64
0.5 IL-17F
99.7 0.25 98.4 1.41 98.6 0.76 0.0
Th0
IL IL-12
-1 IL 2
IL b
Th b
17

IL-17F
Th

2+ -1
-1

H Polarized naive CD4 T cells I Memory CD4 T cells

IL-12 + IL-1E

IL-21 IFN-J IL-21 IFN-J

10.5 24.1 41.8 30.3 17.5 4.8


(±2.5) (±2.4) (±3.8) (±3.3) (±3.8) (±2.2)
0.7 3.7
(±0.3) (±0.7)
0.1 0.4 9.2 1.9
(±0.1) (±0.3) (±2.4) (±1.2)
0.2 10.4
(±0.3) (±2.8)

IL-22 IL-22
IL-17F+ single producers: 22.2% (±3.2) IL-17F+single producers: 22.2% (±2.5)

Figure 7. Synergistic Interaction of IL-12 and IL-1 Promotes IL-17F without IL-17A
(A) Stability selection frequencies of the different DC signals by a multivariate model, explaining the difference between IL-17F and IL-17A.
(B) Concentration of cytokines measured on restimulated Th supernatants. Naive CD4 T cells were differentiated for 5 days with anti-CD3/CD28 beads under the
indicated conditions; n = 6 donors, paired t test.
(C) The same experimental design as in (B), with conditions as annotated; n = 6 donors, Wilcoxon test.
(D) Coated anti-CD2 and anti-CD3 together with soluble anti-CD28 were given for 5 days to naive CD4 T cells under Th0 or Th17 conditions. Cytokine
concentrations were measured after 24-h restimulation on day 5. Mean and SD are shown; n = 8, Wilcoxon test.

(legend continued on next page)

Cell 179, 432–447, October 3, 2019 443


cytokine expression was obtained by combining IL-1a or IL-1b significant specifically in the group of data points where IL-23
with IL-12, which fit model predictions because those two vari- and IL-1 were expressed (Figures S7D and S7E) and was lost
ables were highly correlated (Figure S6D). The capacity of when only IL-1 or IL-23 was expressed with IL-12 (Figure S7D).
IL-12+IL-1b to induce IL-17F was resistant to the presence of Therefore, we tested whether IL-12+IL-23 would induce IL-17A
other Th differentiation factors, such as IL-4 (Figure S6E). Using in the presence of IL-1b. We validated a significant induction of
CellTrace Violet (CTV; Figure S6F), we could show that the pro- IL-17A with no effect on IL-17F when IL-12 and IL-23 were given
duction of IL-17F could not be attributed to the distinct prolifer- in the presence of IL-1b compared with IL-12 or IL-23 (Figure 7C).
ation capacity of Th cells under the IL-12+IL-1b condition. We measured IL-17A and IL-17F by qPCR and retrieved the
Next we questioned whether Th cells generated under the same induction pattern (Figure S7F). Last, we could show that
IL-12+IL-1b condition would express transcription factors clas- RORc was higher in IL-12+IL-23+IL-1b than in IL-12+IL-1b
sically associated with Th17 differentiation. We measured 63 (Figure S7F).
RNA transcripts by qPCR under Th0, Th2, IL-1b, IL-12, Finally, we observed that our modeling strategy always identi-
IL-12+IL-1b, and Th17 conditions (Table S6). The 63 genes fied CD58 as a main Th17 inducer because it positively affected
included master regulators of the Th1 and Th2 subsets, such both IL-17A and IL-17F (Figures 4B and 6C), an association that
as T-bet and GATA3, respectively, and Th17 regulators, such we had not seen during our systematic literature review (Fig-
as RORc, STAT3, BATF, and SATB1 (Ciofani et al., 2012). ure 4D; Table S3). To test this hypothesis, we used an agonist
IL-17A and IL-17F regulation at the mRNA level mirrored the pro- anti-CD2 antibody that mimics the presence of CD58 (STAR
tein level (Figure S6H). IL-12+IL-1b induced significantly more Methods). As predicted, IL-17A and IL-17F were not induced
RORc, BATF, and Bcl6 than IL-12 or IL-1b alone (Figure S6H), by anti-CD2 alone under the Th0 condition. However, anti-CD2
which could explain the induction of IL-17F and IL-21. Still, the significantly induced production of IL-17A and IL-17F under
levels of RORc and Bcl6 were lower in IL-12+IL-1b than under Th17 conditions (Figure 7D), which was confirmed by intracel-
the Th17 condition (Figure S6H). T-bet was highly induced in lular FACS staining (Figures S7H and S7I), with IL-17F upregula-
IL-12+IL-1b in comparison with the IL-12 or Th17 conditions, tion restricted to IL-17A-positive cells (Figure S7I).
indicating that Th1 differentiation was maintained and that To establish the cytokine co-expression profiles of
T-bet did not inhibit IL-17F production. IL-12Rb2, a Th1 marker, IL-12+IL-1b-treated Th cells at the single-cell level, we per-
was downregulated by IL-1b when added to IL-12, whereas formed intracellular cytokine staining (Figure 7E). We confirmed
IL-12, IL-12+IL-1b, and Th17 conditions all induced the IL-23 that IL-12+IL-1b induced significantly more IL-17F-positive Th
receptor (Figure S6H). SATB1 was specifically upregulated in cells without co-production of IL-17A (Figure 7F). In naive CD4
IL-12+IL-1b in comparison with Th17 or IL-1b alone (Figure S6H), T cells polarized with the Th17 cytokine cocktail (IL-1b, IL-6,
suggesting that it could play a role in the specific upregulation TGF-b, and IL-23) we mainly found two subsets of Th17 cells
of IL-17F. producing either IL-17A or IL-17F, with very few cells co-produc-
To globally assess the expression of the various Th lineage- ing both cytokines. To check for in vivo existence of those IL-17A
specific factors, across IL-12- and IL-1-containing conditions, and IL-17F single producers, we analyzed the human CD4 T cell
we performed a principal-component analysis (PCA) including memory compartment by intracellular FACS in healthy donor pe-
all 63 mRNA variables (Figure S7A). Cells from the IL-12+IL-1b ripheral blood mononuclear cells (PBMCs). We could identify a
condition had an intermediate expression pattern between the small fraction of Th cells expressing only IL-17F in the absence
IL-12 (Th1) and Th17 conditions. By decomposing the PCA of IL-17A, suggesting that this phenotype constitutes a differen-
space into vectors for each variable, we found that IL-17F, tiation endpoint (Figure 7G).
IL-23R, ICOS, and T-bet projected predominantly along the To gain more insight into the functional properties of these
IL-12+IL-1b condition (Figure S7B), again pointing to mixed ‘‘Th17F’’ cells, we studied their co-production with IL-21,
Th1/Th17 features. IFN-g, and IL-22, all relevant to the Th17 and/or IL-12 pathways,
We then addressed the link between IL-12 and IL-17A in in vitro (Figure S7J) and ex vivo (Figure S7K). Among
various contexts. IL-12 with IL-23 was predicted to induce IL-17F+IL-17A cells generated with IL-12 and IL-1b, the majority
IL-17A but not IL-17F (Figure 6C). In a DC-free Th polarization co-produced IFN-g (41.8%), IL-21 (10.5%), or both (24.1%) (Fig-
assay, we used IL-12, IL-23, or IL-12+IL-23 and found that ure 7H), reflecting a dominant role of IL-12. IL-17F+/IL-17A
none of these conditions induced IL-17A (Figure 7C). We hypoth- memory CD4 cells preferentially co-expressed IL-21 (30.3%)
esized that a third input could explain the positive link between and IL-21 together with IFN-g (17.5%) (Figure 7I), which matched
‘‘IL-12_with_IL-23’’ and IL-17A. Using an unsupervised analysis, the in vitro differentiated CD4 T cells. In addition, the percentage
we found IL-1 as a top variable with the highest correlation (Fig- of IL-17F+/IL-17A/IL-22/IL-21/IFN-g cells between in vitro
ure S7C). In addition, IL-12 and IL-17A positive correlation was IL-12+IL-1b stimulation and the ex vivo restimulated memory

(E) Day 5 intracellular FACS analysis of Th cells differentiated as in (B). Dot plots show a representative donor.
(F) Quantification of live total CD4 T cells producing either IL-17A or IL-17F; n = 6 donors, paired t test.
(G) Representative donor of CD4 memory T cells with intracellular FACS staining for IL-17A versus IL-17F.
(H) Venn Diagrams of IL-17F+/IL17A Th cells co-producing IL-22, IFN-g, and IL-21 of naive CD4 T cells under the IL-12+IL-1b. IL-12+IL; mean percentage and
confidence interval, n = 6 donors.
(I) Venn Diagrams of IL-17F+/IL17A Th cells co-producing IL-22, IFN-g, and IL-21 of memory CD4 T cells stimulated for 5 h with PMA and ionomycin; mean
percentage of 6 donors with confidence interval.

444 Cell 179, 432–447, October 3, 2019


compartment was similar (22.2%), which indirectly supported of a gene cannot be associated with a given functional output,
that IL-12+IL-1b induced the emergence of IL-17F single preventing quantitative mathematical modeling. Functional
producers. response in target cells can only be estimated indirectly through
Taken together, our results demonstrate a synergy between surrogate activation markers, which is most often not performed.
IL-12 and IL-1 in inducing IL-17F single-producing Th cells with In our approach, all measurements of communication signals
possible physiopathological relevance. and output variables were done at the protein level, hence
directly measuring the bioactive communication molecules
DISCUSSION with a direct link to a specific response in target cells. This
ensures robustness of the modeling strategy, as evidenced by
Cell-cell communication may involve several tens of communica- our model’s ability to recapitulate most of the known relation-
tion signals functioning concomitantly and possibly interacting ships in DC-Th cell communication.
with each other. These signals, in turn, modify many molecular Modeling complex biological behaviors in a quantitative manner
and functional parameters in target cells. Such complexity cannot is challenging. In data-driven models, it relies in large parts on the
be captured and formalized without an integrated mathematical choice of explanatory (input) variables, which drive the induction
modeling approach. Theoretical models of Th cell differentiation or regulation of output variables. Here we selected DC-derived
have already been established (Abou-Jaoudé et al., 2015; Naldi communication molecules through exhaustive literature mining.
et al., 2010) and include a large number of possible inputs to The model was able to integrate 36 input and 18 output variables
T cells. However, they suffer from three limitations: (1) they include in a quantitative manner, which makes it a reference in the field.
input signals that may be expressed by diverse cell types in We have been able to describe patterns of DC communication
different anatomical locations; (2) they do not recapitulate combi- molecules in a way that goes beyond the classical view of imma-
nations of input signals in their naturally occurring patterns and ture versus mature DCs (Banchereau and Steinman, 1998; Guer-
concentrations; and (3) they use prior knowledge to infer input- monprez et al., 2002). In fact, we showed that almost every DC
output relationships, which does not integrate possible context- stimulatory condition leads to a distinct DC state. This is a first
dependencies and interactions. In parallel, data-driven models step in defining general combinatorial rules of DC-derived
have been developed in response to predefined stimuli, such as communication molecules: co-expressed molecules form the ba-
Th17 (Yosef et al., 2013) or Th1/Th2 (Antebi et al., 2013), which sis of putative context-dependent effects. Through the large num-
do not recapitulate the integration of multiple communication sig- ber of variables handled by the model, we identified 290 novel
nals. In our study, we applied an unbiased data-driven approach associations explaining major immunoregulatory cytokines, which
specifically designed to model DC-Th communication. Combina- may lead to the discovery of novel functions of known DC mole-
tions and concentrations of input communication signals were cules and suggest novel therapeutic targets.
measured as naturally determined by their intrinsic biological Going further into the complexity of communication, we
regulation. Subsequently, the input-output relationships were explored context dependencies of communication signals. In ver-
learned from the experimental data and integrated into any bal communication, the context may dramatically alter the mean-
underlying context dependency and interaction, even when not ing of an individual word. Currently, there is no systematic way to
described previously. This maximizes the relevance of the model search for context dependencies in biological communication. In
and the potential for novel discoveries. our modeling strategy, we devised a method that introduces
Cells can change state in response to environmental cues, a context-dependent variables for a given molecule. This allows un-
concept defined as plasticity (da Silva-Diz et al., 2018; Liu et al., biased identification of context-dependent functions that would
2001). Each cell state may be associated with different communi- have been missed by classical regression models. For example,
cation potential; i.e., different expression patterns of communica- we identified a new function for IL-12 in promoting IL-17F produc-
tion signals (Soumelis et al., 2002; Wang et al., 2014). To broadly tion by Th cells, which was completely unexpected based on prior
cover the possible DC states, we used various DC-stimulatory knowledge (Korn et al., 2009). Identifying such context depen-
conditions (cytokines, viruses, bacteria, fungi) at various doses dencies before therapeutic targeting of a DC-Th communication
and combinations and across a large number of observations molecule may improve the prediction of its effect.
(>400). This prevented us from biasing our observations toward Given that DC-Th communication is central to a large number of
certain quantitatively or qualitatively extreme behaviors. After physiopathological conditions (Keller, 2001), we can foresee mul-
the model has learned the rules from such an extended range of tiple applications for the model. Based on expression pattern of
observations, we anticipate that it should be able to predict DC molecules, the model can predict the induced Th cytokine pro-
behaviors in situations not necessarily covered in our original da- file. Quantitative measurements of DC communication molecules
taset, as confirmed in our computational and experimental valida- in a given disease or in an individual patient ex vivo can be used to
tions. This opens possibilities of application in many areas of simulate the corresponding Th response. Depending on the
immunology, inflammation, and immunotherapy. outcome, strategies may be devised to re-orient the response
RNA sequencing (RNA-seq) has offered a means of capturing toward a protective or less pathogenic profile, again through
the expression of many communication signals and their recep- model-based predictions. Alternatively, starting from a Th profile
tors to infer cell-cell communication between various cell types (cytokine or groups of cytokines), the appropriate molecular tar-
(Vento-Tormo et al., 2018). However, the RNA-to-protein corre- gets can be manipulated through gain- or loss-of-function exper-
lation can be rather low (Liu et al., 2016) and varies a lot depend- iments to amplify or inhibit a given Th cytokine. Last, the model
ing on the gene (Edfors et al., 2016). Consequently, RNA copies can help predict the most appropriate vaccine adjuvant to obtain

Cell 179, 432–447, October 3, 2019 445


protective immunity against some microbes or to re-orient a path- SUPPLEMENTAL INFORMATION
ogenic Th response. For example, all DC molecules positively
Supplemental Information can be found online at https://doi.org/10.1016/j.
associated in the model to Th2 responses are potential targets
cell.2019.09.012.
to decrease pathogenic Th2 allergic inflammation (Ito et al.,
2005; Nakayama et al., 2017; Soumelis et al., 2002).
ACKNOWLEDGMENTS
Using DC-Th communication as a model, we established a
framework that can now be applied to other types of cell-cell We thank the Institut Curie Cytometry platform for cell sorting, P. Gestraud and
communication following 5 major steps: (1) systematic perturba- F. Coffin for advice regarding statistical analyses, and O. Lantz and N. Manel
tion of the ‘‘sender’’ cell to generate a diversity of communication for important discussions. We wish to thank L. Pattarini, B. Fould, W. Cohen,
states; (2) broad, quantitative, and protein-level measurement of O. Geneste, B. Lockhart, V. Blanc, and their teams from the Institut de Re-
communication molecules relevant to the sender cell; (3) sys- cherche Servier for having produced and generously shared the anti-ICOS
antibody. This work was supported by funding from Agence Nationale de la
tematic quantitative assessment of the response in ‘‘receiver’’
Recherche: ANR-16-CE15-0024-01, ANR-10-IDEX-0001-02 PSL*, ANR-11-
or target cells; (4) MultiVarSel modeling of the input-output rela-
LABX-0043 and ANR-17-CE14-0025-02 from Center of Clinical Inverstigation
tionship, which defines communication rules; (5) in silico and IGR-Curie: CIC IGR-Curie 1428 and from Ligue contre le Cancer:
experimental validation. Currently, we believe that cell type EL2016.LNCC/VaS. M.G. was funded by ANRS and ARC.
specificities in expression of communication molecules and in
their function would prevent us from generalizing our DC-Th AUTHOR CONTRIBUTIONS
model to other cell types. Comparing different quantitative
models of cell-cell communication will ultimately tell us whether M.G., C.T., L.K., C.C., and P.S. performed the experiments. M.G. and V.S. de-
cells speak the same language (i.e., whether they express similar signed the experiments. M.P.-D. performed the statistical analysis. O.A.,
patterns of communication molecules) and whether the same W.A.-J., and M.G. performed literature mining. P.H., D.T., F.B., L.S., J.C.,
C.L.-L., and V.S. participated in and supervised the statistical analysis. M.G.
communication molecule has the same meaning (function)
and V.S. wrote the manuscript. V.S. supervised the study.
when expressed by two different cell types.

DECLARATION OF INTERESTS
STAR+METHODS
The authors declare no competing interests.
Detailed methods are provided in the online version of this paper
and include the following: Received: February 19, 2019
Revised: June 20, 2019
d KEY RESOURCES TABLE Accepted: September 9, 2019
Published: October 3, 2019
d LEAD CONTACT AND MATERIALS AVAILABILITY
d EXPERIMENTAL MODEL AND SUBJECT DETAILS
B Human subjects REFERENCES

d METHOD DETAILS
Abou-Jaoudé, W., Monteiro, P.T., Naldi, A., Grandclaudon, M., Soumelis, V.,
B PBMCs purification Chaouiya, C., and Thieffry, D. (2015). Model checking to assess T-helper cell
B MoDC generation and activation plasticity. Front. Bioeng. Biotechnol. 2, 86.
B Blood dendritic cells purification Acosta-Rodriguez, E.V., Napolitani, G., Lanzavecchia, A., and Sallusto, F.
+
B CD4 T lymphocytes purification (2007). Interleukins 1beta and 6 but not transforming growth factor-beta are
B Paired protein measurement in DC/T coculture essential for the differentiation of interleukin 17-producing human T helper
B IL-12 blocking experiment cells. Nat. Immunol. 8, 942–949.
B CD28 blocking experiment Alculumbre, S., and Pattarini, L. (2016). Purification of Human Dendritic Cell
B Addition of rhIL-12p70 during DC/T coculture Subsets from Peripheral Blood. Methods Mol. Biol. 1423, 153–167.
B DC-free Th cell polarization Antebi, Y.E., Reich-Zeliger, S., Hart, Y., Mayo, A., Eizenberg, I., Rimer, J.,
B ICOS agonism Putheti, P., Pe’er, D., and Friedman, N. (2013). Mapping differentiation under
mixed culture conditions reveals a tunable continuum of T cell fates. PLoS
B CD2 agonism
Biol. 11, e1001616.
B Flow cytometry analysis
Balan, S., Arnold-Schrauf, C., Abbas, A., Couespel, N., Savoret, J., Impera-
B Cytokine quantification
tore, F., Villani, A.C., Vu Manh, T.P., Bhardwaj, N., and Dalod, M. (2018).
B Gene expression quantification
Large-Scale Human Dendritic Cell Differentiation Revealing Notch-Dependent
B Anti-human ICOS monoclonal blocking antibody Lineage Bifurcation and Heterogeneity. Cell Rep. 24, 1902–1915.e6.
d QUANTIFICATION AND STATISTICAL ANALYSIS Banchereau, J., and Steinman, R.M. (1998). Dendritic cells and the control of
B Dataset quality control – batch effect immunity. Nature 392, 245–252.
B Dataset quality control – T cell expansion
Cariani, F., and Rips, L.J. (2017). Conditionals, Context, and the Suppression
B Statistical tests Effect. Cogn. Sci. (Hauppauge) 41, 540–589.
B Statistical analysis Chen, L., and Flies, D.B. (2013). Molecular mechanisms of T cell co-stimulation
B Model comparison and ROC Curves and co-inhibition. Nat. Rev. Immunol. 13, 227–242.
B Modeling strategy Ciofani, M., Madar, A., Galan, C., Sellars, M., Mace, K., Pauli, F., Agarwal, A.,
B Systematic literature review Huang, W., Parkhurst, C.N., Muratet, M., et al. (2012). A validated regulatory
d DATA AND CODE AVAILABILITY network for Th17 cell specification. Cell 151, 289–303.

446 Cell 179, 432–447, October 3, 2019


da Silva-Diz, V., Lorenzo-Sanz, L., Bernat-Peguera, A., Lopez-Cerda, M., and Perrot-Dockès, M., Lévy-Leduc, C., Chiquet, J., Sansonnet, L., Brégère, M.,
Muñoz, P. (2018). Cancer cell plasticity: Impact on tumor progression and ther- Étienne, M.P., Robin, S., and Genta-Jouve, G. (2018a). A variable selection
apy response. Semin. Cancer Biol. 53, 48–58. approach in the multivariate linear model: an application to LC-MS metabolo-
Edfors, F., Danielsson, F., Hallström, B.M., Käll, L., Lundberg, E., Pontén, F., mics data. Stat. Appl. Genet. Mol. Biol. 17, /j/sagmb.2018.17.issue-5/sagmb-
Forsström, B., and Uhlén, M. (2016). Gene-specific correlation of RNA and 2017-0077/sagmb-2017-0077.xml.
protein levels in human cells and tissues. Mol. Syst. Biol. 12, 883. Perrot-Dockès, M., Lévy-Leduc, C., Sansonnet, L., and Chiquet, J. (2018b).
Guermonprez, P., Valladeau, J., Zitvogel, L., Théry, C., and Amigorena, S. Variable selection in multivariate linear models with high-dimensional covari-
(2002). Antigen presentation and T cell stimulation by dendritic cells. Annu. ance matrix estimation. J. Multivariate Anal. 166, 78–97.
Rev. Immunol. 20, 621–667.
Soumelis, V., Reche, P.A., Kanzler, H., Yuan, W., Edward, G., Homey, B.,
Ito, T., Wang, Y.H., Duramad, O., Hori, T., Delespesse, G.J., Watanabe, N., Gilliet, M., Ho, S., Antonenko, S., Lauerma, A., et al. (2002). Human epithelial
Qin, F.X., Yao, Z., Cao, W., and Liu, Y.J. (2005). TSLP-activated dendritic cells cells trigger dendritic cell mediated allergic inflammation by producing
induce an inflammatory T helper type 2 cell response through OX40 ligand. TSLP. Nat. Immunol. 3, 673–680.
J. Exp. Med. 202, 1213–1223.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. J. R.
Ivanov, I.I., McKenzie, B.S., Zhou, L., Tadokoro, C.E., Lepelley, A., Lafaille,
Stat. Soc. Series B Stat. Methodol. 58, 267–288.
J.J., Cua, D.J., and Littman, D.R. (2006). The orphan nuclear receptor ROR-
gammat directs the differentiation program of proinflammatory IL-17+ T helper Tindemans, I., Peeters, M.J.W., and Hendriks, R.W. (2017). Notch Signaling in
cells. Cell 126, 1121–1133. T Helper Cell Subsets: Instructor or Unbiased Amplifier? Front. Immunol.
Keller, R. (2001). Dendritic cells: their significance in health and disease. Immu- 8, 419.
nol. Lett. 78, 113–122. Touzot, M., Grandclaudon, M., Cappuccio, A., Satoh, T., Martinez-Cingolani,
Kintsch, W., and Mangalath, P. (2011). The construction of meaning. Top. C., Servant, N., Manel, N., and Soumelis, V. (2014). Combinatorial flexibility
Cogn. Sci. 3, 346–370. of cytokine function during human T helper cell differentiation. Nat. Commun.
Korn, T., Bettelli, E., Oukka, M., and Kuchroo, V.K. (2009). IL-17 and Th17 5, 3987.
Cells. Annu. Rev. Immunol. 27, 485–517. Vento-Tormo, R., Efremova, M., Botting, R.A., Turco, M.Y., Vento-Tormo, M.,
Liu, Y.J., Kanzler, H., Soumelis, V., and Gilliet, M. (2001). Dendritic cell lineage, Meyer, K.B., Park, J.E., Stephenson, E., Polanski, K., Goncalves, A., et al.
plasticity and cross-regulation. Nat. Immunol. 2, 585–589. (2018). Single-cell reconstruction of the early maternal-fetal interface in
Liu, Y., Beyer, A., and Aebersold, R. (2016). On the Dependency of Cellular humans. Nature 563, 347–353.
Protein Levels on mRNA Abundance. Cell 165, 535–550. Volpe, E., Servant, N., Zollinger, R., Bogiatzi, S.I., Hupé, P., Barillot, E., and
Macagno, A., Napolitani, G., Lanzavecchia, A., and Sallusto, F. (2007). Dura- Soumelis, V. (2008). A critical function for transforming growth factor-beta,
tion, combination and timing: the signal integration model of dendritic cell acti- interleukin 23 and proinflammatory cytokines in driving and modulating human
vation. Trends Immunol. 28, 227–233. T(H)-17 responses. Nat. Immunol. 9, 650–657.
Manel, N., Unutmaz, D., and Littman, D.R. (2008). The differentiation of human Wang, Y., Chen, X., Cao, W., and Shi, Y. (2014). Plasticity of mesenchymal
T(H)-17 cells requires transforming growth factor-beta and induction of the nu- stem cells in immunomodulation: pathological and therapeutic implications.
clear receptor RORgammat. Nat. Immunol. 9, 641–649. Nat. Immunol. 15, 1009–1016.
Meinshausen, N., and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc.
Yosef, N., Shalek, A.K., Gaublomme, J.T., Jin, H., Lee, Y., Awasthi, A., Wu, C.,
Series B. Stat. Methodol. 72, 417–473.
Karwacz, K., Xiao, S., Jorgolli, M., et al. (2013). Dynamic regulatory network
Nakayama, T., Hirahara, K., Onodera, A., Endo, Y., Hosokawa, H., Shinoda, K., controlling TH17 cell differentiation. Nature 496, 461–468.
Tumes, D.J., and Okamoto, Y. (2017). Th2 Cells in Health and Disease. Annu.
Rev. Immunol. 35, 53–84. Zhu, J., Yamane, H., and Paul, W.E. (2010). Differentiation of effector CD4
T cell populations (*). Annu. Rev. Immunol. 28, 445–489.
Naldi, A., Carneiro, J., Chaouiya, C., and Thieffry, D. (2010). Diversity and plas-
ticity of Th cell types predicted from regulatory network modelling. PLoS Zygmunt, B., and Veldhoen, M. (2011). T helper cell differentiation more than
Comput. Biol. 6, e1000912. just cytokines. Adv. Immunol. 109, 159–196.

Cell 179, 432–447, October 3, 2019 447


Titre : Méthodes régularisées pour l’analyse de données multivariées en grande dimension: théorie et appli-
cations.
Mots clés : Méthodes régularisées, données multivariées, grande dimension
Résumé : Dans cette thèse nous nous intéressons Plus précisément, nous proposons des conditions
au modèle linéaire général (modèle linéaire multi- générales que doivent satisfaire les estimateurs de la
varié) en grande dimension. Nous proposons un nou- matrice de covariance et de son inverse pour obtenir
vel estimateur parcimonieux des coefficients de ce la consistance en signe des coefficients.
modèle qui prend en compte la dépendance qui peut Nous avons ensuite mis en place des méthodes,
exister entre les différentes réponses. Cet estimateur adaptées à la grande dimension, pour l’estimation de
est obtenu en estimant dans un premier temps la matrices de covariance qui sont supposées être des
matrice de covariance des réponses puis en incluant matrices de Toeplitz ou des matrices avec une struc-
cette matrice de covariance dans un critère Lasso. ture par blocs, pas nécessairement diagonaux.
Les propriétés théoriques de cet estimateur sont Ces différentes méthodes ont enfin été appliquées
étudiées lorsque le nombre de réponses peut tendre à des problématiques de métabolomique, de
vers l’infini plus vite que la taille de l’échantillon. protéomique et d’immunologie.

Title : Regularized methods to study multivariate data in high dimensional settings: theory and applications.
Keywords : Regularized methods, multivariate data, high dimension
Abstract : In this PhD thesis we study general linear satisfy in order to recover the positions of the zero
model (multivariate linear model) in high dimensional and non-zero entries of the coefficient matrix when
settings. We propose a novel variable selection ap- the number of responses is not fixed and can tend to
proach in the framework of multivariate linear models infinity.
taking into account the dependence that may exist We also propose novel, efficient and fully data-driven
between the responses. It consists in estimating be- approaches for estimating Toeplitz and large block
forehand the covariance matrix of the responses and structured sparse covariance matrices in the case
to plug this estimator in a Lasso criterion, in order to where the number of variables is much larger than the
obtain a sparse estimator of the coefficient matrix. number of samples without limiting ourselves to block
The properties of our approach are investigated both diagonal matrices.
from a theoretical and a numerical point of view. More These approaches are applied to different biological
precisely, we give general conditions that the estima- issues in metabolomics, in proteomics and in immu-
tors of the covariance matrix and its inverse have to nology.

Université Paris-Saclay
Espace Technologique / Immeuble Discovery
Route de l’Orme aux Merisiers RD 128 / 91190 Saint-Aubin, France

Vous aimerez peut-être aussi