0% found this document useful (0 votes)
6 views

Multi-View_Orthogonal_Projection_Regression_with_A

The document presents Multi-View Orthogonal Projection Regression (MVOPR), a novel method for variable selection in multi-omics integration that addresses challenges posed by correlations in multi-omics data. MVOPR transforms predictors into an uncorrelated feature space, enhancing variable selection performance and biological interpretability compared to traditional methods like Lasso regression. Empirical results demonstrate MVOPR's superiority in identifying relevant biological features in real-data analysis, particularly in the context of asthma research.

Uploaded by

Alejandra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Multi-View_Orthogonal_Projection_Regression_with_A

The document presents Multi-View Orthogonal Projection Regression (MVOPR), a novel method for variable selection in multi-omics integration that addresses challenges posed by correlations in multi-omics data. MVOPR transforms predictors into an uncorrelated feature space, enhancing variable selection performance and biological interpretability compared to traditional methods like Lasso regression. Empirical results demonstrate MVOPR's superiority in identifying relevant biological features in real-data analysis, particularly in the context of asthma research.

Uploaded by

Alejandra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Submitted to the Annals of Applied Statistics

MULTI-VIEW ORTHOGONAL PROJECTION REGRESSION WITH


APPLICATION IN MULTI-OMICS INTEGRATION

B Y Z ONGRUI DAI 1,a , Y VONNE J. H UANG2,c AND G EN L I* 1,b


1 Department of Biostatistics, University of Michigan, Ann Arbor,Michigan, a [email protected]; b [email protected]

2 Department of Internal Medicine, Division of Pulmonary and Critical Care Medicine, Department of Microbiology and
Immunology, University of Michigan Medical School, Ann Arbor, Michigan, c [email protected]

Multi-omics integration offers novel insights into complex biological


arXiv:2503.16807v1 [stat.ME] 21 Mar 2025

mechanisms by utlizing the fused information from various omics datasets.


However, the inherent within- and inter-modality correlations in multi-omics
data present significant challenges for traditional variable selection methods,
such as Lasso regression. These correlations can lead to multicollinearity,
compromising the stability and interpretability of selected variables. To ad-
dress these problems, we introduce the Multi-View Orthogonal Projection
Regression (MVOPR), a novel approach for variable selection in multi-omics
analysis. MVOPR leverages the unidirectional associations among omics lay-
ers, inspired by the Central Dogma of Molecular Biology, to transform pre-
dictors into an uncorrelated feature space. This orthogonal projection frame-
work effectively mitigates the correlations, allowing penalized regression
models to operate on independent components. Through simulations under
both well-specified and misspecified scenarios, MVOPR demonstrates su-
perior performance in variable selection, outperforming traditional Lasso-
based methods and factor-based models. In real-data analysis on the CAARS
dataset, MVOPR consistently identifies biologically relevant features, includ-
ing the Bacteroidaceae family and key metabolites which align well with
known asthma biomarkers. These findings illustrate MVOPR’s ability to en-
hance variable selection while offering biologically interpretable insights, of-
fering a robust tool for integrative multi-omics research.

1. Introduction. Multi-omics analysis provides a comprehensive understanding of bio-


logical mechanisms by integrating multiple types of data, such as genomics, transcriptomics,
proteomics, and metabolomics. These datasets offer a novel insight into molecular processes
and immunological research that cannot be found from any single modality alone (Chen
et al. (2023); Chu et al. (2021); Clark et al. (2021)). Numerous studies have found that the
fusion of different omics data can bring additional insights into the exploration of biomark-
ers, improving diagnostics, and therapy development (Gillenwater et al. (2021); Garg et al.
(2024); Olivier et al. (2019); Hussein, Abou-Shanab and Badr (2024); Menyhárt and Győrffy
(2021)). For example, in our recent study of CAARS data, we collected the gut microbiome
and metabolome data of 51 patients to investigate the combined impact of these two omics
layers on asthma development.
Asthma is a complicated respiratory disease which involves airway inflammation and aller-
gic reactions (Gautam, Johansson and Mersha (2022)). As one of the most prevalent chronic
airway diseases, it exhibits high heterogeneity, making diagnosis based on a single biomarker
challenging (Chung (2016); Abdel-Aziz et al. (2020)). Increasing evidence suggests that the
pathogenesis of asthma is closely linked to different omics data. For instance, potential host-
microbiota interactions have been associated with an increased risk of asthma (Ruff, Greiling

* [Corresponding author indication should be put in the Acknowledgment section if necessary.]


Keywords and phrases: Multi-omics Integration, Variable selection, Latent Variables.

1
2

and Kriegel (2020) ). Emerging collaborations are using multi-omics data to advance the un-
derstanding of asthma. For instance, multi-omics integration has explored the disease’s het-
erogeneity and underlying pathology. This approach not only helps identify new patient strat-
ification, but also paves the way for personalized treatment strategies (Zhang et al. (2024)).
Lasso-based regression is a widely used approach for variable selection in multi-omics data.
For example, IPF-LASSO (Integrative LASSO with Penalty Factors) and Priority-Lasso are
both Lasso-based methods that account for the heterogeneity of multi-omics data by as-
signing distinct penalty and priority weights in the regression model (Klau et al. (2018);
Boulesteix et al. (2017)). However, some studies have found that the performance of these
models can be influenced when highly correlated predictors are present, as these methods
do not account for the within- and inter-modality correlations inherent in multi-omics data
(Castel, Zhao and Thoresen (2024)).
To illustrate this challenge, we measure the correlation between microbiome and metabolome
through the Pearson correlation heatmap in the CAARS dataset. Microbiome data are aggre-
gated to the family level and applied centered log ratio transformation. Metabolome data are
centered and scaled. The heatmap shows that metabolome data have strong within-modality
correlation compared to the microbiome. Additionally, there are some negative inter-modality
correlations between the microbiome and metabolome (Figure.1).

Correlation Heatmap between Microbiome and Metabolome


Microbiome

Microbiome
1.0

0.5

0.0
Metabolome

−0.5

Metabolome

F IG 1. The Pearson Correlation Heatmap of Microbiome and Metabolome

The within- and inter-modalities correlations pose significant challenges for variable se-
lection in traditional regression models. These correlations arise from latent associations be-
tween omics layers and intrinsic dependencies within each individual omic. Network-based
approaches provide an effective way to capture these complex relationships, such as protein-
protein and protein-RNA interactions (Richards, Eckhardt and Krogan (2021); Nasiri et al.
A SAMPLE RUNNING HEAD TITLE 3

(2021); Szklarczyk et al. (2023)). Although these correlations can be quantified, they in-
troduce complications when modeling the relationship between multi-omics and a response
variable. Specifically, these strong correlations can lead to multicollinearity, making it diffi-
cult to distinguish the individual contributions of variables.
To better justify the problem, suppose there are two modalities M1 ∈ Rn×p1 and M2 ∈ Rn×p2
which are assoicated with response vector Y through β1 and β2 . A linear model could be con-
structed below:
Y = M1 β1 + M2 β2 + ϵ1
The existences of within- and inter-modality correlations indicate the latent relationships of
M1 and M2 below:
M1 = F Λ + U and M2 = M1 B + E ,
| {z } | {z }
Within-Modality Correlation Inter-Modality Correlation

where F represents the latent factors in modality M1 with rank r , and F Λ captures the
low-rank structure of M1 , which is a commonly used approach to describe within-modality
correlation. For instance, factor-based models frequently employ it to represent dependencies
within predictors. The inter-modality correlation shows the M2 modality can be represented
by the linear combination of M1 through the coefficient matrix B with error term E . Standard
variable selection methods, such as Lasso regression, struggle in this setting because they will
arbitrarily select among correlated predictors, potentially leading to unstable variable selec-
tion. Addressing this issue requires a method that not only accounts for the relationships
between omics layers but also isolates the independent contributions of each omic.
The factor model has emerged as a powerful tool for correlated data by decomposing them
into latent structures comprising factors and idiosyncratic components. For instance, factor-
adjusted regularized regression can handle highly correlated data by identifying and re-
moving the low-rank structure from data and retaining the idiosyncratic components for
variable selection(Fan, Ke and Wang (2020)). Integrative Factor Regression is another fac-
tor decomposition-based model designed for multi-model dataset. It can extract modality-
specific factors to account for the heterogeneity across modalities (Li and Li (2022)). How-
ever, factor-based models require data to have an approximate latent factor structure and only
reduce correlations within individual modalities. Thus, correlations between modalities still
exist.
An alternative approach is cooperative learning, which employs an agreement penalty based
on contrastive learning. This method encourages different modalities to contribute similarly.
By varying the hyperparameter of the agreement penalty, the solutions for this method include
the early and late fusion approaches for multi-model data, providing robust performance for
different settings (Ding et al. (2022a)). However, this method ignores the inherent correla-
tions between the modalities and enforces different modalities to align the contribution. In
multi-omics scenarios, this assumption may not always hold, as different omics layers can
have distinct influences on the response variable. For example, miR-155 and miR-146a are
well-known miRNAs that can suppress E. coli-induced inflammatory responses in neuroin-
flammation. This example suggests that the host transcriptome may have an opposing effect
compared to the microbiome, leading to potential contradictions between the molecular sig-
nals originating from the host and those from the microbiota (Yang et al. (2021)).
To address these challenge, we introduce a novel Multi-View Orthogonal Projection Regres-
sion (MVOPR) for variable selection in multi-omics data. Unlike existing methods that im-
pose specific structural assumptions on the data, our model leverages unidirectional associ-
ations among different omics to mitigate the correlations. Our approach is inspired by the
Central Dogma of Molecular Biology, which states that DNA transcribes to RNA, and RNA
4

translates into protein, while the reverse process is impossible. For instance, once a pro-
tein is synthesized, it cannot alter its original RNA template. This inherent directionality in
molecular interactions suggests that multi-omics relationships can be represented by a di-
rected graph (digraph) with unidirectional pathways. Building on this biological insight, our
method accounts for the dependencies by removing redundant correlations in a structured
manner. Specifically, we employ an orthogonal projection framework that sequentially re-
move the effects of upstream omics layers on downstream ones. By transforming the original
multi-omics data into an uncorrelated feature space, our approach ensures that variable se-
lection methods, such as penalized regression, operate on independent components free from
both within and across modality correlations. This enables MVOPR to overcome the limita-
tions of standard Lasso-based approaches, noted above.
In this study, we demonstrate the effectiveness of MVOPR for multi-omics variable selec-
tion through both theoretical analysis and empirical validation. Our simulations and real-data
analysis reveal that MVOPR consistently outperforms existing methods. We also show that
when inter-modality correlation exists, the factor-based models will face different problems,
unlike MVOPR. Importantly, even in cases where cross-modality correlations are absent,
MVOPR remains robust and performs comparably to standard Lasso regression, demonstrat-
ing its adaptability across different correlation structures. By incorporating biological direc-
tion assumptions, our approach not only enhances variable selection performance but also
aligns with the natural structure of molecular data, offering a robust framework for integra-
tive multi-omics analysis.
The rest of the article is organized as follows. In Section 2, we present the MVOPR frame-
work for both the two-modality and multiple-modality scenarios, followed by an introduc-
tion to three related methods for multi-modal data analysis. Section 3 provides a comparative
analysis of MVOPR against other competing methods under various settings. In Section 4,
we apply MVOPR to the CAARS dataset and evaluate its performance relative to alternative
approaches.

2. Methodology.

2.1. MVOPR for Two Modalities. Suppose we have two modalities, M1 ∈ Rn×p and
M2 ∈ Rn×q . Let Y be the response vector of length n, assumed to be associated with M1 and
M2 through regression coefficients β1 ∈ Rp and β2 ∈ Rq . The relationship is modeled as:
(1) Y = M1 β1 + M2 β2 + ϵ1 ,
where ϵ1 is an error term assumed to be uncorrelated with M1 and M2 , with E(ϵ1 ) = 0 and
Var(ϵ1 ) = σϵ21 I . We further assume that M2 is influenced by M1 through a low-rank coef-
ficient matrix B ∈ Rp×q of rank r , with an error component E ∈ Rn×q that is uncorrelated
with M1 B and ϵ1 :
(2) M2 = M1 B + E.
Using this inter-modality correlation (2), we can reformulate Model (1) as:
(3) Y = M1 (β1 + Bβ2 ) + Eβ2 + ϵ1 .
If E is small, the model becomes almost unidentifiable, which affects the selection of the
variables for M2 . To handle this issue, we aim to remove the associated component M1 B
from M2 while retaining only the uncorrelated part E .
Let U ΣV T be the singular value decomposition (SVD) of M1 B , where Ur consists of the
first r left singular vectors of U . Substituting this decomposition into Model (3) gives:
(4) Y = M1 β1 + Eβ2 + Ur γ1 + ϵ1 ,
A SAMPLE RUNNING HEAD TITLE 5

where γ1 is a nuisance parameter. Since Ur captures the principal directions of M1 B , it is


highly correlated with M1 . To eliminate this correlation, we project M1 onto a subspace
orthogonal to Ur .
Define the projection matrix P = Ur UrT , and let P ⊥ = I − P be its orthogonal complement.
Transforming M1 into P ⊥ M1 ensures that the new predictor no longer lies in the column
space of M1 B , thereby breaking the correlation with Ur . The transformed MVOPR model
for two modalities is then:
(5) Y = M1∗ β1 + M2∗ β2 + Ur γ ∗ + ϵ1 ,
where M1∗ = P ⊥ M1 and M2∗ = E . In this formulation, the predictors M1∗ , M2∗ , and Ur are
mutually uncorrelated.
This transformation enhances variable selection for both β1 and β2 by removing redundant
correlations between modalities. The decomposition effectively subtracts the linear contribu-
tion of M1 in M2 and projects M1 outside the column space of M1 B , ensuring no internal
dependencies between M1∗ , M2∗ , and the nuisance components Ur .

2.2. MVOPR for Multiple Modalities. Extending the model to three modalities, let M1 ∈
Rn×p1 , M2 ∈ Rn×p2 , and M3 ∈ Rn×p3 , with a known hierarchical dependency:
M2 = M1 B2,1 + E2 ,

M3 = M1 B3,1 + M2 B3,2 + E3 .
where E2 ∈ Rn×p2 and E3 ∈ Rn×p3 are independent error components, and B2,1 , B3,1 , and
B3,2 are low-rank coefficient matrices with ranks r1 , r2 , and r3 , respectively.
The response Y is modeled as:
(6) Y = M1 β1 + M2 β2 + M3 β3 + ϵ1 .
To remove the associated components, define two projection matrices: P1 = U1 U1T and P2 =
U2 U2T which based on M1 B1′ and E2 B3,2 separately. B1′ = (B2,1 , B3,1 ) is the concatenation
of B2,1 and B3,1 . The transformed modalities are:
M1∗ = (I − P1 )M1 , M2∗ = (I − P2 )E2 , M3∗ = E3 .
Transforming the modalities in model (6), the MVOPR model for three modalities is:
(7) Y = M1∗ β1 + M2∗ β2 + M3∗ β3 + U1 γ1 + U2 γ2 + ϵ1 .
where γ1 and γ2 are nuisance parameters. The detailed derivations are provided in Supple-
mentary Appendix A.
For more than three modalities, the transformation follows a similar procedure. Suppose
there are k modalities with features p1 , p2 , . . . , pk . If each modality Mj (for j = 2, 3, . . . , k )
depends only on previous modalities:
j−1
X
Mj = Mi Bj,i + Ej ,
i=1

where Ej is independent noise, then the final regression model is:


(8) Y = M1 β1 + M2 β2 + · · · + Mk βk + ϵ1 .
With the direction assumption above, we could derive the connection between response Y
and all the modalities by the following algorithm.
6

Algorithm 1 Algorithm for multi-view regression on multiple modalities


1: Input: Multiple modalities M1 , M2 , ..., Mk and response Y
2: Step.1: Obtain the estimation of B2,1 , B3,1 , ..., Bk,k−1
3: for j in 1:m
4: Regress Mj ∼ (M1 , ..., Mj−1 ). Calculate the residuals by Êj = Mj − M̂j
5: Step.2: Obtain the projection matrix P1 , ..., Pm−1
6: Calculate:

7: M1 B̂1 = M1 (B̂2,1 , B̂3,1 , ..., B̂k,1 )
′ ′
8: Ê2 B̂2 = Ê1 (B̂3,2 , B̂4,2 , ..., B̂k,2 ), ..., Êk−1 B̂k−1 = Êk−1 B̂k,k−1 .
9: Obtain the

SVD:
10: M1 B̂1 = U1 Σ1 V1T and projections P1 = U1 U1T with rank r1 .

11: For j ≥ 2, Êj B̂j = Uj Σj VjT and projections Pj = Uj UjT with rank rj .
12: Step.3: Transform the M1 , M2 , ..., Mm by M1∗ , M2∗ , ..., Mm

13: j = 1: M1∗ = P1⊥ M1 .


14: j = 2, .., k − 1: Mj∗ = Pj⊥ EMˆ .
j
15: ∗
j = k: Mk = EM ˆ
k
16: Obtain the nuisance variable U = (U1 , U2 , ..., Um−1 )
17: Step.4: Obtain the estimation of β1 , ..., βk
18: Solve the penalized optimization:
min ∥Y − M1∗ β1 − ... − Mk∗ βk − U γ∥2 + kj=1 Pλ (βj )
P
19:
β,γ
20: return βˆ1 , ..., βˆk

2.3. Related methods. To evaluate the relative performance of MVOPR, we consider sev-
eral alternative models for multi-modality data. Specifically, we compare our method against
Cooperative Regularized Linear Regression (Cooperative Ding et al. (2022b)), Integrative
Factor Regression (IntegFactor Li and Li (2022)), and Factor-Adjusted Regularized Regres-
sion (Factor Fan, Ke and Wang (2020)).

2.3.1. Cooperative Regularized Linear Regression. Cooperative regularized linear re-


gression is a widely used approach for multi-view learning. It integrates multiple modalities
by imposing an agreement penalty that encourages the predictions from different modalities
to be aligned. The level of agreement between modalities is controlled by the hyperparameter
ρ. When ρ = 0, this method becomes traditional penalized regression with chosen penalty.
When ρ = 1, it indicates a late fusion of all the modalities. Suppose there are two modalities
M1 and M2 , its least square problem can be written as below:
ρ
(9) min ∥Y − M1 β1 − M2 β2 ∥2 + ∥M1 β1 − M2 β2 ∥2 + λ1 ∥β1 ∥1 + λ2 ∥β2 ∥1
β1 ,β2 2
To simplify the optimization, λ1 and λ2 are equal in this study. Problem (9) is convex, we
can transform the original data below:
     
M M Y β1
(10) X̃ = √ 1 √ 2 , Ỹ = , β̃ = .
− ρM1 ρM2 0 β2
Based on the transformed data, the problem (9) is equivalent problem to the generic lasso
problem below:
2
(11) J(θx , θz ) = Ỹ − X̃ β̃ + λ β̃
1
A SAMPLE RUNNING HEAD TITLE 7

2.3.2. Factor-based Models. Factor-based methods assume that predictor M follows an


approximate factor model
(12) M = F Λ + U,
where F is a K × 1 vector of latent factors, Λ is a p × K loading matrix, and U is the p × 1
vector of idiosyncratic components. By separating the idiosyncratic components from the M ,
it de-correlates the original M to a weakly correlated element u. The regression model:
(13) Y = Mβ + ϵ
can then be reformulated as:
(14) Y = U β + F γ + ϵ, where γ = Λβ is a nuisance parameter.
To estimate the factors F̂ and idiosyncratic components Û from M , we adopt the method of
Bai and Li ( Bai and Li (2012)) and Fan et al.( Fan, Liao and Mincheva (2013)). The optimal
number of latent factors K is selected based on Bai and Ng’s information criteria method
( Bai and Ng (2002)). Using these estimates, the least squares problem can be written as
follows, where γ is the nuisance parameter.
min ∥Y − Û β − F̂ γ∥2 + λρ(β)
β,γ

In Factor-Adjusted Regularized Regression, the multi-modal data is treated as a unified de-


sign matrix, and factor decomposition is applied globally across the entire dataset. Specif-
ically, let M = (M1 , ..., Mm ) represent the concatenation of all modalities ( Fan, Ke and
Wang (2020)). While Integrative Factor Regression targets multimodal data by modeling
each modality separately, allowing for the extraction of modality-specific latent factors
and idiosyncratic components. That is, for each modality Mi has its own factors Fi and
idiosyncratic component Ui with i ∈ 1, ..., m. Then, the regression model is fitted using
the concatenated idiosyncratic components and latent factors, where U = (U1 , ..., Um ) and
F = (F1 , ..., Fm ) represent the concatenation of all modality-specific idiosyncratic compo-
nents and latent factors, respectively.
In our setting, M2 = M1 B + E can be interpreted as an approximate factor model with
M2 = F Λ + E , where F = Ur and Λ = ΣVrT , given that M1 B = Ur ΣVrT . However, despite
this approximate factor structure, factor-based models are not well-suited for our problem.
The decomposition used in Integrative Factor Regression will introduce a correlated nui-
sance parameter F , which may obscure the true effect of M1 . Furthermore, when M1 (I, B)
lacks spiked eigenvalues, selecting a appropriate number of factors becomes challenging in
Factor-Adjusted Regularized Regression. This often results in choosing an excessively large
number of factors, distorting the contribution of M1 and leading to suboptimal model perfor-
mance. Factor-based models impose structural assumptions that may not align well with the
dependencies present in multi-modal data. The risks associated with obscuring meaningful
relationships, introducing highly correlated nuisance parameters, and improperly selecting
the number of factors make these methods less effective in our problem setting. A more
detailed discussion of this issue is provided in the Supplementary Material Appendix A.2.

2.4. Estimation. To fit model (2), we first need to estimate the coefficient matrix B̂ that
captures the relationship between M1 and M2 . Several well-established reduced-rank re-
gression methods can be utilized for this estimation. For instances, row-sparse reduced-rank
regression ( Chen and Huang (2012)), sparse orthogonal factor regression ( Uematsu et al.
8

(2019)), and multivariate reduced-rank linear regression ( Chen, Dong and Chan (2013)) pro-
vide different sparsity assumptions for estimating B̂ . In general, the reduced-rank regression
problem can be formulated as the following optimization problem:
min ∥M2 − M1 U DV T ∥2F + λ1 ∥D∥1 + λ2 ρa (U D) + λ3 ρb (V D)
U,D,V

(15) s.t. U T U = I, V T V = I, B = U DV T
where the U T U = I, V T V = I are introduced for identifiable purpose. ρa and ρb are penalty
functions. They can be entry-wise L1 norm or row-wise L2,1 norm. λ1 , λ2 , λ3 are the tuning
parameters that control the magnitude of regularization. This framework generalizes sev-
eral well-known methods: Row-sparse Reduced-Rank Regression when λ1 = λ3 = 0 and
ρa = ∥ · ∥2,1 ; Multivariate Reduced-Rank Linear Regression when λ1 = λ2 = λ3 = 0; Sparse
Orthogonal Factor Regression when all tuning parameters are nonzero. The tuning parame-
ters and rank r are chosen based on the GIC (Fan and Tang (2013)). With the fitted model
above, we could obtain the coefficient matrix B̂ and residual term Ê .

Next, we estimate the P by the inner product of the first r left singular vectors Ur from M1 B̂ .
Denote the estimation as P̂ . Then, transformed M1 and M2 can be estimated based on previ-
ous procedures. Once the transformed matrices are obtained, we estimate βˆ1 and βˆ2 . This is
done by solving the following penalized least squares problem:
(16) min ∥Y − M̂1∗ β1 − M̂2∗ β2 − Ur γ2 ∥2 + λρ(β1 ) + λρ(β2 )
β1 ,β2 ,γ2

where ρ is a generic penalty function including the L1 norm, adaptive Lasso, MCP, and
SCAD penalties. λ is a tuning parameter that controls the regularization power on both β1
and β2 .
For MVOPR with three modalities, the estimation of β1 , β2 , and β3 follows a similar penal-
ized least squares approach:
min ∥Y − M̂1∗ β1 − M̂2∗ β2 − M̂3∗ β3 − U1 γ1 − U2 γ2 ∥2
β1 ,β2 ,β3 ,γ1 ,γ2

(17) + λρ(β1 ) + λρ(β2 ) + λρ(β3 )

where M̂1∗ = (I − Pˆ1 )M1 , M̂2∗ = (I − Pˆ1 )Ê2 , and M3∗ = Ê3 . U1 and U2 are the left singular
vectors with non-zero singular values of M1 (B̂2,1 B̂3,1 ) and E2 B̂3,2 .

3. Numerical analysis.

3.1. Variable selection on two modalities.

3.1.1. E with identity covariance matrix. To assess the performance of MVOPR in com-
parison to other methods, we carry out some simulations under different noise levels on ϵ1 and
ϵ2 . In this simulations, suppose there are two modalities M1 and M2 with 300 features and
200 observations. M1 is generated from multivariate normal distribution M V N (0p , ΣM1 )
with identity covariance matrix. Assume M2 is connected with M1 through a low-rank row
sparse coefficient matrix B with 95% rows as zeros with rank r = 1. Response Y is asso-
ciated with both M1 and M2 through β1 and β2 . β1 and β2 are generated with 290 zeros
and 10 non-zeros coefficients. The values of non-zero coefficients are sampled from uniform
distribution U (1, 2). We fix the signal-to-noise ratio (SNR) to be 100 for ϵ1 .By varying the
SNRs of ϵ2 , we compare the variable selection performance of each model by AUC. Each
AUC is calculated based on a 100 length grid of λ which controlling strength of sparsity. We
conduct each simulation experiment 100 times for one SNR of ϵ2 .
A SAMPLE RUNNING HEAD TITLE 9
Adaptive Lasso Cooperative(rho=1) Factor
Model
Cooperative(rho=0.5) IntegFactor MVOPR

Total AUC AUC (M1) AUC (M2)


1.00 1.000
1.0

0.975
0.95

0.9

0.950

0.90
Value

Value

Value
0.925
0.8

0.85

0.900

0.80
0.7

5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
SNR SNR SNR

F IG 2. The AUC for each model by varying the SNR of ϵ2 from 5 to 10

MVOPR outperforms other methods in terms of AUC across the entire range of SNR val-
ues (Figure.2). Factor-Adjusted Regularized Regression exhibits comparable performance to
MVOPR in scenarios with low SNR. However, as the SNR increases, which corresponds
to stronger correlations between M1 and M2 , MVOPR demonstrates clear superiority over
Factor-Adjusted Regularized Regression. Especially, MVOPR has evident benefits on vari-
able selection for M2 when SNR is large. It shows the ability of MVOPR to better inte-
grate information across modalities which allows it to maintain high AUC values even under
more challenging conditions. In contrast, Factor-Adjusted Regularized Regression appears
to struggle under high SNR conditions, likely due to its reliance on factor decomposition.
Moreover, other competing methods, such as Integrative Factor Regression and Cooperative
learning method, show declining performance as the SNR increases. These methods appear
to be less effective in maintaining robust performance when faced with strong inter-modality
correlations, highlighting the advantage of MVOPR in such scenarios.
Two alternative simulations are designed to show factor-based model may not be the ideal
model to account for the inter-modality correlations. In the first simulation, M1 and M2 are
generated from multivariate normal distribution with identity covariance matrix with 50 and
300 features. Each has 200 samples. Suppose M2 is connected with M1 through a low-rank
row sparse coefficient matrix B 70% rows as zeros with rank r = 9. β1 and β2 only has 10
non-zero coefficients separately. The second simulation is used to showcase the performance
of MVOPR under low-dimensional data compared. M1 and M2 are generated from multi-
variate normal distribution with identity covariance matrix with 200 samples and 50 features.
Low-rank row sparse coefficient matrix B has 50% rows as zeros with rank r = 3. β1 and β2
has 25 non-zero coefficients separately. The SNR for ϵ1 in both simulations are fixed as 100.
10
Adaptive Lasso Cooperative(rho=1) Factor
Model
Cooperative(rho=0.5) IntegFactor MVOPR

Total AUC AUC (M1) AUC (M2)


1.0
1.0
1.0

0.9
0.8 0.9

0.8
Value

Value

Value
0.8
0.6

0.7

0.7

0.4
0.6

0.6

20.0 22.5 25.0 27.5 30.0 20.0 22.5 25.0 27.5 30.0 20.0 22.5 25.0 27.5 30.0
SNR SNR SNR

F IG 3. The AUC for each model by varying the SNR of ϵ2 from 20 to 30 when M1 and M2 has different number
of features.

Adaptive Lasso Cooperative(rho=1) Factor


Model
Cooperative(rho=0.5) IntegFactor MVOPR

Total AUC AUC (M1) AUC (M2)


1.0
1.0
1.00

0.8 0.75

0.8
Value

Value

Value

0.50

0.6

0.6

0.25

0.4

15 16 17 18 19 20 15 16 17 18 19 20 15 16 17 18 19 20
SNR SNR SNR

F IG 4. The AUC for each model by varying the SNR of ϵ2 from 15 to 20 when M1 and M2 has 50 features
respectively.

Under strong correlations between M1 and M2 , MVOPR performs consistently well in


both M1 and M2 variable selections compared to other method (Figure.3, 4). Factor-Adjusted
Regularized Regression shows comparable performance to MVOPR when SNR is smaller
than 22. However, as inter-modality correlation become stronger, its variable selection ability
for M1 declines dramatically. This decline is likely attributed to the selection of excessively
large number of factors, which overwhelms the meaningful signal and reduces its perfor-
mance in isolating relevant variables from M1 . This problems become more obvious under
low-dimensional simulation, where (M1 , M2 ) are more likely to be decomposed to a structure
with excessive factors (Figure.4). Integrative Factor Regression consistently underperforms
in selecting variables for M1 , even when it selects zero number of factors for this modality.
It can be attributed to its correlated nuisance variables with M1 , which disrupts the variable
selection for predictors. This limitation underscores the difficulty of maintaining a balance
between factor decomposition and effective variable selection in the presence of strong inter-
modal correlations (Figure.3).
A SAMPLE RUNNING HEAD TITLE 11

3.2. Misspecified Case.

3.2.1. E with correlated structure. In real world settings, E may not always have inde-
pendent covariance structure. To verify whether MVOPR can still works under this misspec-
ified case, we consider two covariance patterns including auto-regressive (AR1) and com-
pound symmetry (CS). In the simulations below, we generate two modalities M1 and M2
while each has 300 features and 200 observations. M1 is generated from M V N (0p , ΣM1 )
with identity covariance matrix.M2 is associated with M1 through a low-rank row sparse co-
efficient matrix B 50% rows as zeros with rank r = 1. Response Y is associated with both
M1 and M2 through β1 and β2 . β1 and β2 are generated with 90 zeros and 10 non-zeros coef-
ficients. The absolute values of non-zero coefficients are sampled from uniform distribution
U (1, 2). The SNR for ϵ1 and ϵ2 are fixed to be 3 and 5. We compare the variable selection
performance of each model by AUC. In AR1 case, E is generated from M V N (0q , Σρ ). The
diagonal elements of Σρ are 1 with cov(ϵi2 , ϵi2 ) = ρ|i−j| . ρ = 0.9 and ρ = 0.95 conditions are
included. The results are shown in Figure.5.A and Figure.5.B. Under this misspecified case,
MVOPR still achieves a higher AUC than other methods. In compound symmetry case, E are
generated from a M V N (0q , Σµ ). The diagonal elements of Σµ are 1 with cov(ϵi2 , ϵi2 ) = µ.
We test the performance of each model under µ = 0.7 and µ = 0.9 condition. The results are
shown in Figure.5.C and Figure.5.D. Under this condition, both MVOPR and factor-based
models performs well compared to adaptive lasso.

Adaptive Lasso Cooperative(rho=0.5) Cooperative(rho=1) Factor IntegFactor MVOPR

(A) Auto-regressive (ρ=0.9) (B) Auto-regressive (ρ=0.95)

Total AUC Beta1 AUC Beta2 AUC Total AUC Beta1 AUC Beta2 AUC
1.0 1.0

0.9
0.9

0.9
0.85

0.8
0.8 0.8

0.8
AUC

0.8
AUC

0.75
0.7

0.6 0.7
0.6
0.7 0.6
0.65

0.6

0.5

0.6 0.4
0.55 0.4
2 4 6 2 4 6 2 4 6
2 4 6 2 4 6 2 4 6

SNR
SNR

(C) Compound Symmetry (μ=0.5) (D) Compound Symmetry (μ=0.7)

Total AUC Beta1 AUC Beta2 AUC Total AUC Beta1 AUC Beta2 AUC
1.0
1.0 1.0

0.9 0.9

0.9
0.9
0.9

0.8 0.8
0.8
AUC

AUC

0.8
0.8
0.8
0.7

0.7

0.7
0.6
0.6

0.7 0.7

0.6
0.6
0.5

2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6

SNR SNR

F IG 5. The AUC of each model when ϵ1 has certain correlated structure. Fig.2.A-B showcase the AUC for each
model under Auto-Regressive (AR1) covariance pattern. Fig.2.C-D showcase the AUC for each model under
Compound Symmetry (CS) covariance pattern.

3.2.2. Null Experiment: No inter-modality correlation. To evaluate whether MVOPR


can still perform well when the M2 = M1 B + ϵ assumption is missing, we generate both
M1 and M2 from M V N (0p , ΣM1,2 ) independently. In this simulation, we treat ΣM1,2 as
identity or auto-regressive (ρ = 0.9) covariance matrix. Suppose both M1 and M2 have 100
features and 200 samples. Y is associated with M1 and M2 based on β1 and β2 which are
generated based on the same rule in the previous section 3.2.1.
12

Based on the results in Figure.6 and Figure.7, we notice that four models perform similarly
under ΣM1,2 = I . Meaning that even when the unidirectional assumption is missing, MVOPR
can still work and share similar performance to other methods. However, if ΣM1,2 follows
an auto-regressive (ρ = 0.9) covariance pattern, factor-based models exhibit weaker perfor-
mance compared to adaptive Lasso and MVOPR. This may be attributed to the covariance
structure of each modality, as the absence of spiked eigenvalues hinders the effectiveness of
factor decomposition.

Cooperative(rho=0) Cooperative(rho=1) IntegFactor


model
Cooperative(rho=0.5) Factor Multi−view

Total AUC Beta1 AUC Beta2 AUC


1.0 1.00

0.95
0.95

0.9

0.90
0.90
value

value

value
0.8
0.85

0.85

0.80

0.7
0.80

0.75

2 4 6 2 4 6 2 4 6
Var2 Var2 Var2

F IG 6. The AUC of each model when both M1 and M2 have diagonal covariance matrix without the M2 = M1 B
assumption

Cooperative(rho=0) Cooperative(rho=1) IntegFactor


model
Cooperative(rho=0.5) Factor Multi−view

Total AUC Beta1 AUC Beta2 AUC

0.9
0.9

0.8
0.8 0.8

0.7
value

value

value

0.7

0.6

0.6
0.6

0.4 0.5
0.5

2 4 6 2 4 6 2 4 6
Var2 Var2 Var2

F IG 7. The AUC of each model when both M1 and M2 have autoregressive (ρ = 0.9) covariance matrix without
the M2 = M1 B assumption

3.3. Simulation for Multi-modalities. To evaluate the empirical performance of MVOPR


on multi-modalities condition, we consider three modalities case from model (7). Suppose
there are three modalities M1 , M2 , and M3 . Each modality has the same number of variables
p = p1 = p2 = p3 = 100 with 100 observations. B1 , B2 , and B3 are three low-rank coefficient
matrix with rank r1 = 3, r2 = r3 = 1. B1 , B2 , and B3 are dense matrices with no row-wise
A SAMPLE RUNNING HEAD TITLE 13

sparsity. Y is the response variable associated with M1 , M2 , and M3 .


The estimations of B̂1 , B̂2 , and B̂3 are based on Multivariate Reduced-Rank Regression. E2
and E3 are generated from M V N (0p , Σ), while Σ follows the identity. In this simulation, the
SNRs for E2 and E3 are fixed to be 10 and 20. We also consider a misspecified case where
E2 and E3 has correlated covariance structures.

(A)
Overall M1
1.0
0.9

0.8 0.8

0.7

0.6
0.6

0.5
0.4
0.4
AUC

M2 M3
1.0 1.0

0.8 0.9

0.8
0.6

0.7
0.4

0.6
0.2
so

so
)

)
1)

or
.5

PR
or

1)
to

.5

PR
or
as

as
0
o=

ct
=0
o=
ac
ct

ct
o=

VO
pL

VO
pL

Fa
Fa
rh

ho

Fa
gF

rh
rh
da

eg
e(

da

M
e(

(r
te
e(
tiv
A

e
iv

t
A
In
iv

In
iv
t
ra

at

ra

at
pe

er

pe

r
pe
p
oo

oo
oo

oo
C

C
C

(B)
Overall M1
1.0
0.9

0.8
0.8

0.7

0.6
0.6

0.5 0.4
AUC

M2 M3
1.0 1.0

0.8
0.8

0.6
0.6

0.4

0.4
0.2
so

so
)

or

)
1)

or
.5

PR
or

1)

.5

PR
or
as

as
t
=0
o=

t
=0
o=
ac
ct

ac
ct
VO
pL

VO
pL
ho

Fa
rh

ho

Fa
gF

rh

gF
da

M
e(

da

M
r

e(

r
te
e(

te
e(
tiv
A

tiv
A
In
tiv

In
tiv
ra

ra
ra

ra
pe

pe
pe

pe
oo

oo
oo

oo
C

C
C

F IG 8. AUC of the variable selection in M1 , M2 , and M3 . (A) The AUC distributions of MVOPR and other
methods when E2 and E3 have identity covariance; (B) The AUC distributions of MVOPR and other methods
when E2 and E3 have AR1 covariance

Based on the Figure.8, we find that MVOPR outperforms other methods in terms of over-
all AUC, AUC in M2 , and AUC in M3 . MVOPR achieves the highest mean AUC values in
these categories, indicating its superior performance in multi-modal integration. Among the
14

competing methods, factor-based models show some improvement in overall and AUC for
M2 , M3 compared to Adaptive Lasso. However, their performance is worse than MVOPR.
Specifically, Integrative Factor Regression exhibits a notable decline in performance when E2
and E3 share a correlated covariance structure. This result suggests that factor-based models
may struggle to capture the intricate inter-modalality correlations. For Cooperative Learning,
these models perform worse than adaptive Lasso, indicating that the agreement penalty may
not always bring benefits to variable selection in these settings. The simulation results reveal
the robustness and superiority of MVOPR in handling complex multi-omics settings. Even in
misspecified scenarios, MVOPR consistently outperforms competing methods, demonstrat-
ing its reliability and effectiveness in capturing intricate correlations.

4. Real data analysis.

4.1. CAARS Data Analysis. We conduct the MVOPR model on the CAARS data, col-
lected from 55 patients. This dataset contains two omics layers: microbiome and metabolome.
The study aims to understand how the omics datas influence the continuous eosinophil count.
To reduce the dimensionality of microbiome and metabolome, we only select the metabolites
which has top 200 variance. 139 microbiome are aggregated to 31 family levels. Then, we
normalize the microbiome by centered log ratio transformation and the metabolome data is
centered and scaled. We use the square root of continuous eosinophil count as response.
Multivariate reduced rank regression is applied to estimate the coefficient matrix B̂ with
the metabolome as a response and the microbiome as predictor. Original omics datasets are
transformed based on the B̂ and residuals Ê . To analyze the association between the square
root of continuous eosinophil count and transformed data, L1 peanlty is used for variable
selection. Using a leave-one-out sample to qualify the robustness for variable selection. Out-
sample MSE and stability indicators are used to qualify the performance of each model. If
one feature is selected with non-zero coefficient among 85% iterations, it will be considered
as a selected feature. Stability indicators are defined as follows: suppose the ith set of vari-
ables selected by the model during the ith iterations of leave-one-out sample as Si . Sj and
Si are paired to calculate the stability. The two pairs are represented as i and j while i ̸= j .
Here are three stability indicators: Jaccard similarity coefficient, Otsuka–Ochiai coefficient,
and Sørensen–Dice coefficient (Kwon et al. (2023)).

|Si ∩ Sj | |Si ∩ Sj | 2|Si ∩ Sj |


Jaccard(Si , Sj ) = Ochiai(Si , Sj ) = p Dice(Si , Sj ) =
|Si ∪ Sj | |Si | · |Sj | |Si | + |Sj |
In the results (Table.4.1), MVOPR achieves the lowest Mean Squared Error (MSE) of 177.73
and highest Jaccard similarity (0.66), Otsuka–Ochiai coefficient (0.75), and Sørensen–Dice
coefficient (0.74). Those results reflect its robustness in variable selection and model stability.
Additionally, MVOPR successfully selects five features. Notably, these selected features also
appear as non-zero coefficients in traditional Lasso regression during some iterations. How-
ever, their selection frequencies are low, and their confidence intervals cross zero, indicating
a lack of significance and stability in the traditional Lasso approach. In contrast, MVOPR
not only identifies these features consistently, but also provides stronger evidence of their
significance. The Factor-Adjusted Regularized Regression and Integrative Factor Regression
models have relatively worse stability and MSE compared to MVOPR. These results suggest
that although factor-based models may partially account for the within-modality correlations,
they may not fully leverage the inter-modality correlation as effectively as MVOPR.
A SAMPLE RUNNING HEAD TITLE 15

Models MSE Jaccard Otsuka–Ochiai Sørensen–Dice Selected Features


Multi-view regression 177.73 0.66 0.75 0.74 5
Lasso regression 223.44 0.27 0.35 0.32 0
Factor Regression 188.59 0.44 0.62 0.56 1
Integrative Factor Regression 414.49 0.41 0.53 0.47 1

(A) MVOPR Integrative Factor Regression


50

50
40

40
30

30
20

20
Selection Frequency

10

10
0

0
0 50 100 150 200 0 50 100 150 200

Lasso Regression Factor-Adjusted Regularized Regression


50

50
40

40
30

30
20

20
10

10
0

0 50 100 150 200 0 50 100 150 200

Feature ID

(B) MVOPR Integrative Factor Regression


0.6

0.0
0.3

-2.5
0.0
Coefficient Estimations

-5.0
-0.3

-7.5
0 50 100 150 200 0 50 100 150 200

Lasso Regression Factor-Adjusted Regularized Regression

0.3 0.5

0.0 0.0

-0.5
-0.3

-1.0
-0.6
0 50 100 150 200 0 50 100 150 200

Feature ID

F IG 9. MVOPR for CAARS Data Analysis. (A) Selection frequency Microbiome (ID: 1-31) and Metabolome (ID:
32 - 231); (B) Confident Intervals for Coefficients Estimations; (C) Pearson Correlation Matrix between Selected
Microbiome and Metabolome
16

td
_s
ys id
c

_i
:1 ic A
oP

10
id
Ac

_d
l
1- cho

-L

e
ic

in
o
ar

uc
ur
18
e

Le
St
Bacteroidaceae

0.5
Microbiome

−0.5

Metabolome

F IG 10. Pearson Correlation Matrix between Selected Microbiome and Metabolome within MVOPR

In MVOPR, Bacteroidaceae family is consistently selected as a nonzero coefficient across


52 iterations, showing a positive average effect on the square root-transformed continuous
eosinophil count. This result aligns with previous research suggesting that an increased rel-
ative abundance of Bacteroidaceae is associated with asthma development (Zimmermann
et al. (2019)). Within Bacteroidaceae, the genera Bacteroides plays a particularly important
role in asthma pathophysiology. Some studies have identified Bacteroides as a key micro-
bial component in asthma progression (Mahdavinia et al. (2023); Aslam et al. (2024); Fiuza
et al. (2024)). Among the four metabolites, stearic acid (tearic_acid_duplicate_2), murocholic
acid (murocholic_acid_duplicate_2), 1-18:1-LysoPE (lyso_pe_18_1_9z_0_0_duplicate_2),
and leucine (leucine_d10_i_std). Stearic acid has been previously identified as a biomarker
for asthma, showing elevated levels in asthma patients. Studies by Tao et al. (2017, 2019)
demonstrated that stearic acid exhibited excellent performance in distinguishing asthma pa-
tients from healthy controls. In our analysis, stearic acid also exhibited a positive correla-
tion with the square root of the continuous eosinophil count, aligning with these findings.
Muricholic acid has been linked to asthma in obesity models. A study by Barosova et al.
(2023) found that obese mice with induced asthma had significantly higher muricholic acid
levels compared to obese control mice. In our results, muricholic acid was positively corre-
lated with eosinophilic inflammation, supporting its role in asthma pathophysiology. LysoPE
(lysophosphatidylethanolamine) is a member of the lysophospholipids which is a large sub-
class of phospholipids. In previous studies under inflammatory diseases, researchers found
that the signals of lysophospholipids was associated with the Chronic Obstructive Pulmonary
Disease ( Madapoosi et al. (2022)). In our analysis, leucine is identified as a significant
metabolite associated with eosinophilic inflammation. Consistent with prior findings, higher
A SAMPLE RUNNING HEAD TITLE 17

leucine levels have been reported in asthmatic individuals with elevated exhaled nitric ox-
ide (FeNO > 35), a biomarker indicative of eosinophil-driven inflammation ( Comhair et al.
(2015)). This suggests that leucine may play a role in asthma pathophysiology, particularly
in individuals with active eosinophilic reaction. The interplay between Bacteroides and these
metabolites is further supported by lipidomic analyses. Notably, differences in lipid profiles
among Bacteroides are largely driven by variations in plasmalogens, glycerophosphoinosi-
tols, and certain sphingolipids. These lipidomic distinctions may influence immune responses
and inflammation, providing insight into the mechanisms by which Bacteroides species con-
tribute to asthma pathogenesis ( Ryan, Joyce and Clarke (2023); Ryan et al. (2023)). Above
all, MVOPR demonstrates strong performance in real data analysis, effectively identifying
key microbial and metabolic features associated with eosinophilic inflammation in asthma.
By selecting the Bacteroidaceae family and relevant metabolites, MVOPR aligns well with
established biological findings, highlighting its ability to capture meaningful microbiome-
metabolome interactions. These results highlight the robustness and reliability of MVOPR in
modeling complex multi-omics relationships, making it a powerful tool to uncover biomark-
ers in asthma.

5. Discussion. The MVOPR model presents a novel approach for multi-omics data in-
tegration by using the orthogonal projection framework to handle the correlated predictors,
enhancing variable selection. Our model is effective under the unidirectional assumption,
aligning well with the inherent biological pathways such as the Central Dogma of Molecu-
lar Biology. Traditional methods, such as Lasso-based regression and factor-based models,
struggle in multi-omics settings due to the strong within- and inter-modality correlations.
MVOPR effectively addresses these challenges by leveraging the unidirectional assumptions
between omics layers and employing an orthogonal projection framework to mitigate multi-
collinearity problems.
Based on the results from simulations and real data analysis, MVOPR showcases superior
performance over other competing methods. Unlike factor-based models, which require an
approximate factor structure on predictors, MVOPR successfully eliminates redundant de-
pendencies while preserving meaningful signals for variable selection. Even in scenarios
where the inter-modality correlation assumption is violated, MVOPR maintains competitive
performance, outperforming other methods. This suggests that MVOPR generalizes well be-
yond ideal conditions, making it a reliable tool for real-world applications.
However, in cases where the model is severely misspecified, such as incorrectly assuming
directionality, performance of MVOPR can be affected. For instance, if the true causal direc-
tion is from M1 to M2 , but a model with reverse direction is fitted (M2 to M1 ), the estimated
coefficient matrix B̂ may not be well-constructed, leading to poor projections and inaccurate
variable selection. To ensure proper unidirectional modeling, a strong understanding of the
latent relationships between modalities is crucial. This can be established through biological
knowledge, such as the Central Dogma of Molecular Biology, or causal inference that helps
to determine the correct directionality before model fitting.
In current analysis, MVOPR operates within a linear regression framework. However, some
biological systems are inherently nonlinear and hierarchical, often involving complex inter-
actions between different omics layers. Future extensions of MVOPR could incorporate non-
linear model, such as kernel-based methods or deep-learning approaches, to capture these
intricate dependencies more effectively.
When applying MVOPR to the CAARS dataset, we successfully identify microbial and
metabolic markers linked to eosinophilic inflammation in asthma. Notably, MVOPR se-
lect some biomarkers which aligns with prior research. Compared to competing approaches,
MVOPR demonstrate higher model stability and lower mean squared error (MSE) in real-
data analysis. Traditional methods such as Lasso regression and factor-based models failed to
18

maintain consistent variable selection across iterations. In contrast, MVOPR achieved higher
stability indicators (Jaccard, Otsuka–Ochiai, and Sørensen–Dice coefficients), suggesting im-
proved stability in biomarker identification.
MVOPR represents a advancement in multi-omics variable selection, providing a robust,
interpretable, and biologically relevant framework for multi-view data integration. By suc-
cessfully mitigating within- and inter-modality correlations, MVOPR allows for more pre-
cise biomarker discovery, particularly in complex diseases such as asthma. As multi-omics
datasets continue to develop, MVOPR offers a powerful and stable method for integra-
tive analysis, providing novel framework for personalized medicine and targeted therapeutic
strategies.

Acknowledgments. The authors would like to thank the anonymous referees, an Asso-
ciate Editor and the Editor for their constructive comments that improved the quality of this
paper.

Funding. The first author was supported by NSF Grant DMS-??-??????.


The second author was supported in part by NIH Grant ???????????.

APPENDIX A: EXTENSION TO MULTIPLE MODALITIES


A.1. Three modalities case. For three modalities case, we could first transform M2 and
M3 into their residuals forms E2 and E3 . The model (8) would be rewriten as below:
Y = M1 β1 + E2 β2 + E3 β3 + M1 (B2,1 β2 + B3,1 β3 ) + M2 B3,2 β3 + ϵ1
while M2 could be further decomposed to M1 B2,1 + E2 . Therefore,
Y = M1 β1 + E2 β2 + E3 β3 + M1 (B2,1 β2 + B3,1 β3 + B2,1 B3,2 β3 ) + E2 B3,2 β3 + ϵ1
Since two nuisance variable M1 (B2,1 , B3,1 ) and E2 B3,2 are correlated with predictors
M1 and E2 , we need to project those predictors to the orthogonal subspace. Suppose
M1 (B2,1 , B3,1 ) = U1 Σ1 V1T and E2 B3,2 = U2 Σ2 V2T with rank r1 and r2 . Two projection
matrices are P1 = U1 U1T and P2 = U2 U2T . Based on the projection, we have:
Y = (I − P1 )M1 β1 + (I − P2 )E2 β2 + E3 β3 + U1 γ1 + U2 γ2 + ϵ1
In this form, we will have mutually uncorrelated predictors and nuisance variables in the re-
gression. Below, we make a brief discussion about the assumptions on independence between
predictors and nuisance Variables:
1. E3 ⊥⊥ (I − P1 )M1 and E3 ⊥⊥ (I − P2 )E2 hold since E3 ⊥⊥ M1 and E3 ⊥⊥ E2 .
2. (I − P2 )ϵ2 ⊥⊥ (I − P1 )M1 holds since E2 ⊥⊥ M1 .
3. (I − P1 )M1 ⊥⊥ U1,r and (I − P2 )E2 ⊥⊥ U2,r′ hold since the projection matrix P1 , P2
are orthogonal to their complements (I − P1 ), (I − P2 ).
4. (I − P2 )E2 ⊥⊥ U1,r and E3 ⊥⊥ U1,r since E2 ⊥⊥ M1 and E3 ⊥⊥ M1 .
5. (I − P1 )M1 ⊥⊥ U2,r′ and E3 ⊥⊥ U2,r′ since M1 ⊥⊥ E2 and E3 ⊥⊥ E2 .

A.2. Multiple modalities case. For multi-omics data with k modalities, we need to de-
termine the order for each modality. Based on Central dogma of molecular biology, genomics
will generally serve as the first modality which has the ability to influence all the downstream
elements. Proteomics or metabolomics may serve as the last modality which can be regu-
larized by upstream elements. Any omics between the first and last modality will serve as
A SAMPLE RUNNING HEAD TITLE 19

intermediate modality such as transcriptome. After we have the sequential information for
multi-omics, we could transform each modality except to their residual forms first.
k
X k−1
X
Y = M1 β1 + Ej βj + M1 B1∗ γ1 + Ei Bi∗ γi + ϵ1
j=2 i=2

where B1∗ = (B2,1 , B3,1 , ..., Bk,1 ). For any 2 ≤ i ≤ k − 1, Bi∗ = (Bi+1,i , Bi+2,i , ..., Bk,i ).
To remove the correlation between predictors and nuisance variables, we next project each
predictors to the orthogonal subspace. Suppose we have SVD for each nuisance variable:
M1 B1∗ = U1 Σ1 V1T Ei Bi∗ = Ui Σi ViT
Assume U1 , U2 , ..., Uk−1 has rank r1 , r2 , ..., rk−1 . Projection matrix P1 = U1 U1T , P2 =
U2 U2T , ..., Pk−1 = Uk−1 Uk−1
T . Then, the final model for MVOPR will be:

k−1
X k−1
X
Y = P1⊥ M1 β1 + Pi⊥ Ei βi + Ek βk + Uj γj∗ + ϵ1
i=2 j=1

The transformed modalities are mutually uncorrelated to each other in the regression.

APPENDIX: CONNECTION TO FACTOR BASED MODEL


Assume M1 ∈ Rn×p and M2 ∈ Rn×q are two omics data. Let Y denotes the response
variable. Suppose Y is associated with M1 and M2 by β1 ∈ Rp and β2 ∈ Rq . The interplay
between M1 and M2 can be captured by a low-rank coefficient matrix B with rank r . E and
ϵ1 are the error matrix and vectors.
Y = M1 β1 + M2 β2 + ϵ1

M2 = M1 B + E
Suppose B matrix has a SVD as B = UB ΣB VBT . Therefore, M2 could be expressed based
on an approximate factor model. F = M1 UB and Λ = ΣB VBT . This structure aligns with the
scenarios for Integrative Factor Regression and Factor-Adjusted Regularized Regression.
M2 = M1 UB ΣB VBT + E
= FΛ + E
Suppose M1 doesn’t follow approximate factor model structure. In Integrative Factor Re-
gression, the factor decomposition for (M1 , M2 ) will become (M1 , E) with factors F as
nuisance parameter. However, since the factors are given by F = M1 UB , it implies that F
is a linear combination of M1 and is therefore highly correlated with it. When we fit the re-
gression Y = M1 β1 + Eβ2 + F γ + ϵ1 , the true contribution of M1 will be obscured by the
correlated nuisance parameter F . When M1 has some spiked eigenvalues and could be ap-
proximated by factor models, similar problem will still exist. Suppose M1 = F1 Λ1 + U1 , the
decomposition of (M1 , M2 ) will become (U1 , E) with nuisance parameters (F1 , F ). Since
F = M1 UB = F1 Λ1 UB + U1 UB , F will still be correlated to F1 and U1 .
In Factor-Adjusted Regularized Regression, the matrix M = (M1 , M2 ) is treated as a whole
and decomposed accordingly. Since M follows the decomposition M = M1 (I, B) + (0, E)
which does not perfectly align with the model’s assumption, the selection of the number of
factors will be affected. When B has a rank r that is closed or equal to p, M could be decom-
pose to F Λ + (0, E) with p factors. In this case, the transformed M1 will be nearly zero, as
most of its information is absorbed by the factors. Similar issue will happen when M1 (I, B)
lacks spiked eigenvalues, leading to difficulties in distinguishing factor structure. This in-
creases the risk of selecting an excessively large number of factors, potentially distorting the
factor adjustment process.
20

REFERENCES
A BDEL -A ZIZ , M. I., N EERINCX , A. H., V IJVERBERG , S. J., K RANEVELD , A. D. and M AITLAND - VAN DER
Z EE , A. H. (2020). Omics for the future in asthma. In Seminars in immunopathology 42 111–126. Springer.
A SLAM , R., H ERRLES , L., AOUN , R., P IOSKOWIK , A. and P IETRZYK , A. (2024). The Link between Gut
Microbiota Dysbiosis and Childhood Asthma: Insights from a Systematic Review. Journal of Allergy and
Clinical Immunology: Global 100289.
BAI , J. and L I , K. (2012). Statistical analysis of factor models of high dimension.
BAI , J. and N G , S. (2002). Determining the number of factors in approximate factor models. Econometrica 70
191–221.
BAROSOVA , R., BARANOVICOVA , E., H ANUSRICHTEROVA , J. and M OKRA , D. (2023). Metabolomics in Ani-
mal Models of Bronchial Asthma and Its Translational Importance for Clinics. International Journal of Molec-
ular Sciences 25 459.
B OULESTEIX , A.-L., D E B IN , R., J IANG , X. and F UCHS , M. (2017). IPF-LASSO: integrative L1-penalized
regression with penalty factors for prediction based on multi-omics data. Computational and mathematical
methods in medicine 2017 7691937.
C ASTEL , C., Z HAO , Z. and T HORESEN , M. (2024). Comparison of the LASSO and Integrative LASSO with
Penalty Factors (IPF-LASSO) methods for multi-omics data: Variable selection with Type I error control.
arXiv preprint arXiv:2404.02594.
C HEN , K., D ONG , H. and C HAN , K.-S. (2013). Reduced rank regression via adaptive nuclear norm penalization.
Biometrika 100 901–920.
C HEN , L. and H UANG , J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and
variable selection. J. Am. Stat. Assoc. 107 1533–1545.
C HEN , C., WANG , J., PAN , D., WANG , X., X U , Y., YAN , J., WANG , L., YANG , X., YANG , M. and L IU , G.-P.
(2023). Applications of multi-omics analysis in human diseases. MedComm 4 e315.
C HU , X., Z HANG , B., KOEKEN , V. A., G UPTA , M. K. and L I , Y. (2021). Multi-omics approaches in immuno-
logical research. Frontiers in Immunology 12 668045.
C HUNG , K. F. (2016). Asthma phenotyping: a necessity for improved therapeutic precision and new targeted
therapies. Journal of internal medicine 279 192–204.
C LARK , C., DAYON , L., M ASOODI , M., B OWMAN , G. L. and P OPP, J. (2021). An integrative multi-omics
approach reveals new central nervous system pathway alterations in Alzheimer’s disease. Alzheimer’s research
& therapy 13 1–19.
C OMHAIR , S. A., M C D UNN , J., B ENNETT, C., F ETTIG , J., E RZURUM , S. C. and K ALHAN , S. C. (2015).
Metabolomic endotype of asthma. The Journal of Immunology 195 643–650.
D ING , D. Y., L I , S., NARASIMHAN , B. and T IBSHIRANI , R. (2022a). Cooperative learning for multiview anal-
ysis. Proceedings of the National Academy of Sciences 119 e2202113119.
D ING , D. Y., L I , S., NARASIMHAN , B. and T IBSHIRANI , R. (2022b). Cooperative learning for multiview anal-
ysis. Proc. Natl. Acad. Sci. U. S. A. 119 e2202113119.
FAN , J., K E , Y. and WANG , K. (2020). Factor-adjusted regularized model selection. Journal of Econometrics
216 71–85.
FAN , J., L IAO , Y. and M INCHEVA , M. (2013). Large covariance estimation by thresholding principal orthogonal
complements. Journal of the Royal Statistical Society Series B: Statistical Methodology 75 603–680.
FAN , Y. and TANG , C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal
of the Royal Statistical Society Series B: Statistical Methodology 75 531–552.
F IUZA , B. S. D., DE A NDRADE , C. M., M EIRELLES , P. M., DA S ILVA , J. S., DE J ESUS S ILVA , M., S AN -
TANA , C. V. N., P INHEIRO , G. P., M PAIRWE , H., C OOPER , P., B ROOKS , C. et al. (2024). Gut microbiome
signature and nasal lavage inflammatory markers in young people with asthma. Journal of Allergy and Clinical
Immunology: Global 3 100242.
G ARG , M., K ARPINSKI , M., M ATELSKA , D., M IDDLETON , L., B URREN , O. S., H U , F., W HEELER , E.,
S MITH , K. R., FABRE , M. A., M ITCHELL , J. et al. (2024). Disease prediction with multi-omics and biomark-
ers empowers case–control genetic discoveries in the UK Biobank. Nature Genetics 56 1821–1831.
G AUTAM , Y., J OHANSSON , E. and M ERSHA , T. B. (2022). Multi-omics profiling approach to asthma: an evolv-
ing paradigm. Journal of personalized medicine 12 66.
G ILLENWATER , L. A., H ELMI , S., S TENE , E., P RATTE , K. A., Z HUANG , Y., S CHUYLER , R. P., L ANGE , L.,
C ASTALDI , P. J., H ERSH , C. P., BANAEI -K ASHANI , F. et al. (2021). Multi-omics subtyping pipeline for
chronic obstructive pulmonary disease. PloS one 16 e0255337.
H USSEIN , R., A BOU -S HANAB , A. M. and BADR , E. (2024). A multi-omics approach for biomarker discovery
in neuroblastoma: a network-based framework. npj Systems Biology and Applications 10 52.
A SAMPLE RUNNING HEAD TITLE 21

K LAU , S., J URINOVIC , V., H ORNUNG , R., H EROLD , T. and B OULESTEIX , A.-L. (2018). Priority-Lasso: a
simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC bioinformatics
19 1–14.
K WON , Y., H AN , K., S UH , Y. J. and J UNG , I. (2023). Stability selection for LASSO with weights based on
AUC. Scientific Reports 13 5207.
L I , Q. and L I , L. (2022). Integrative factor regression and its inference for multimodal data analysis. Journal of
the American Statistical Association 117 2207–2221.
M ADAPOOSI , S. S., C RUICKSHANK -Q UINN , C., O PRON , K., E RB -D OWNWARD , J. R., B EGLEY, L. A., L I , G.,
BARJAKTAREVIC , I., BARR , R. G., C OMELLAS , A. P., C OUPER , D. J. et al. (2022). Lung microbiota and
metabolites collectively associate with clinical outcomes in milder stage chronic obstructive pulmonary dis-
ease. American journal of respiratory and critical care medicine 206 427–439.
M AHDAVINIA , M., F YOLEK , J. P., J IANG , J., T HIVALAPILL , N., B ILAVER , L. A., WARREN , C., F OX , S.,
N IMMAGADDA , S. R., N EWMARK , P. J., S HARMA , H. et al. (2023). Gut microbiome is associated with
asthma and race in children with food allergy. Journal of Allergy and Clinical Immunology 152 1541–1549.
M ENYHÁRT, O. and G Y ŐRFFY, B. (2021). Multi-omics approaches in cancer research with applications in tumor
subtyping, prognosis, and diagnosis. Computational and structural biotechnology journal 19 949–960.
NASIRI , E., B ERAHMAND , K., ROSTAMI , M. and DABIRI , M. (2021). A novel link prediction algorithm for
protein-protein interaction networks by attributed graph embedding. Computers in Biology and Medicine 137
104772.
O LIVIER , M., A SMIS , R., H AWKINS , G. A., H OWARD , T. D. and C OX , L. A. (2019). The need for multi-omics
biomarker signatures in precision medicine. International journal of molecular sciences 20 4781.
R ICHARDS , A. L., E CKHARDT, M. and K ROGAN , N. J. (2021). Mass spectrometry-based protein–protein inter-
action networks for the study of human diseases. Molecular systems biology 17 e8792.
RUFF , W. E., G REILING , T. M. and K RIEGEL , M. A. (2020). Host–microbiota interactions in immune-mediated
diseases. Nature Reviews Microbiology 18 521–538.
RYAN , E., J OYCE , S. A. and C LARKE , D. J. (2023). Membrane lipids from gut microbiome-associated bacteria
as structural and signalling molecules. Microbiology 169 001315.
RYAN , E., G ONZALEZ PASTOR , B., G ETHINGS , L. A., C LARKE , D. J. and J OYCE , S. A. (2023). Lipidomic
analysis reveals differences in Bacteroides species driven largely by plasmalogens, glycerophosphoinositols
and certain sphingolipids. Metabolites 13 360.
S ZKLARCZYK , D., K IRSCH , R., KOUTROULI , M., NASTOU , K., M EHRYARY, F., H ACHILIF, R., G ABLE , A. L.,
FANG , T., D ONCHEVA , N. T., P YYSALO , S. et al. (2023). The STRING database in 2023: protein–protein
association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic acids
research 51 D638–D646.
TAO , J.-L., WANG , S.-C., T IAN , M., L IANG , H., X IE , T., L IN , L.-L. and DAI , Q.-G. (2017). Metabonomics
of syndrome markers in Infantile Bronchial Asthma Episode. Zhongguo Zhong xi yi jie he za zhi Zhongguo
Zhongxiyi Jiehe Zazhi= Chinese Journal of Integrated Traditional and Western Medicine 37 319–325.
TAO , J.-L., C HEN , Y.-Z., DAI , Q.-G., T IAN , M., WANG , S.-C., S HAN , J.-J., J I , J.-J., L IN , L.-L., L I , W.-W.
and Y UAN , B. (2019). Urine metabolic profiles in paediatric asthma. Respirology 24 572–581.
U EMATSU , Y., FAN , Y., C HEN , K., LV, J. and L IN , W. (2019). SOFAR: Large-Scale Association Network Learn-
ing. IEEE Transactions on Information Theory 65 4924-4939. https://doi.org/10.1109/TIT.2019.2909889
YANG , B., YANG , R., X U , B., F U , J., Q U , X., L I , L., DAI , M., TAN , C., C HEN , H. and WANG , X. (2021). miR-
155 and miR-146a collectively regulate meningitic Escherichia coli infection-mediated neuroinflammatory
responses. Journal of Neuroinflammation 18 114.
Z HANG , W., Z HANG , Y., L I , L., C HEN , R. and S HI , F. (2024). Unraveling heterogeneity and treatment of asthma
through integrating multi-omics data. Frontiers in Allergy 5 1496392.
Z IMMERMANN , P., M ESSINA , N., M OHN , W. W., F INLAY, B. B. and C URTIS , N. (2019). Association between
the intestinal microbiota and allergic sensitization, eczema, and asthma: a systematic review. Journal of Allergy
and Clinical Immunology 143 467–485.

You might also like