Multi-View_Orthogonal_Projection_Regression_with_A
Multi-View_Orthogonal_Projection_Regression_with_A
2 Department of Internal Medicine, Division of Pulmonary and Critical Care Medicine, Department of Microbiology and
Immunology, University of Michigan Medical School, Ann Arbor, Michigan, c [email protected]
1
2
and Kriegel (2020) ). Emerging collaborations are using multi-omics data to advance the un-
derstanding of asthma. For instance, multi-omics integration has explored the disease’s het-
erogeneity and underlying pathology. This approach not only helps identify new patient strat-
ification, but also paves the way for personalized treatment strategies (Zhang et al. (2024)).
Lasso-based regression is a widely used approach for variable selection in multi-omics data.
For example, IPF-LASSO (Integrative LASSO with Penalty Factors) and Priority-Lasso are
both Lasso-based methods that account for the heterogeneity of multi-omics data by as-
signing distinct penalty and priority weights in the regression model (Klau et al. (2018);
Boulesteix et al. (2017)). However, some studies have found that the performance of these
models can be influenced when highly correlated predictors are present, as these methods
do not account for the within- and inter-modality correlations inherent in multi-omics data
(Castel, Zhao and Thoresen (2024)).
To illustrate this challenge, we measure the correlation between microbiome and metabolome
through the Pearson correlation heatmap in the CAARS dataset. Microbiome data are aggre-
gated to the family level and applied centered log ratio transformation. Metabolome data are
centered and scaled. The heatmap shows that metabolome data have strong within-modality
correlation compared to the microbiome. Additionally, there are some negative inter-modality
correlations between the microbiome and metabolome (Figure.1).
Microbiome
1.0
0.5
0.0
Metabolome
−0.5
Metabolome
The within- and inter-modalities correlations pose significant challenges for variable se-
lection in traditional regression models. These correlations arise from latent associations be-
tween omics layers and intrinsic dependencies within each individual omic. Network-based
approaches provide an effective way to capture these complex relationships, such as protein-
protein and protein-RNA interactions (Richards, Eckhardt and Krogan (2021); Nasiri et al.
A SAMPLE RUNNING HEAD TITLE 3
(2021); Szklarczyk et al. (2023)). Although these correlations can be quantified, they in-
troduce complications when modeling the relationship between multi-omics and a response
variable. Specifically, these strong correlations can lead to multicollinearity, making it diffi-
cult to distinguish the individual contributions of variables.
To better justify the problem, suppose there are two modalities M1 ∈ Rn×p1 and M2 ∈ Rn×p2
which are assoicated with response vector Y through β1 and β2 . A linear model could be con-
structed below:
Y = M1 β1 + M2 β2 + ϵ1
The existences of within- and inter-modality correlations indicate the latent relationships of
M1 and M2 below:
M1 = F Λ + U and M2 = M1 B + E ,
| {z } | {z }
Within-Modality Correlation Inter-Modality Correlation
where F represents the latent factors in modality M1 with rank r , and F Λ captures the
low-rank structure of M1 , which is a commonly used approach to describe within-modality
correlation. For instance, factor-based models frequently employ it to represent dependencies
within predictors. The inter-modality correlation shows the M2 modality can be represented
by the linear combination of M1 through the coefficient matrix B with error term E . Standard
variable selection methods, such as Lasso regression, struggle in this setting because they will
arbitrarily select among correlated predictors, potentially leading to unstable variable selec-
tion. Addressing this issue requires a method that not only accounts for the relationships
between omics layers but also isolates the independent contributions of each omic.
The factor model has emerged as a powerful tool for correlated data by decomposing them
into latent structures comprising factors and idiosyncratic components. For instance, factor-
adjusted regularized regression can handle highly correlated data by identifying and re-
moving the low-rank structure from data and retaining the idiosyncratic components for
variable selection(Fan, Ke and Wang (2020)). Integrative Factor Regression is another fac-
tor decomposition-based model designed for multi-model dataset. It can extract modality-
specific factors to account for the heterogeneity across modalities (Li and Li (2022)). How-
ever, factor-based models require data to have an approximate latent factor structure and only
reduce correlations within individual modalities. Thus, correlations between modalities still
exist.
An alternative approach is cooperative learning, which employs an agreement penalty based
on contrastive learning. This method encourages different modalities to contribute similarly.
By varying the hyperparameter of the agreement penalty, the solutions for this method include
the early and late fusion approaches for multi-model data, providing robust performance for
different settings (Ding et al. (2022a)). However, this method ignores the inherent correla-
tions between the modalities and enforces different modalities to align the contribution. In
multi-omics scenarios, this assumption may not always hold, as different omics layers can
have distinct influences on the response variable. For example, miR-155 and miR-146a are
well-known miRNAs that can suppress E. coli-induced inflammatory responses in neuroin-
flammation. This example suggests that the host transcriptome may have an opposing effect
compared to the microbiome, leading to potential contradictions between the molecular sig-
nals originating from the host and those from the microbiota (Yang et al. (2021)).
To address these challenge, we introduce a novel Multi-View Orthogonal Projection Regres-
sion (MVOPR) for variable selection in multi-omics data. Unlike existing methods that im-
pose specific structural assumptions on the data, our model leverages unidirectional associ-
ations among different omics to mitigate the correlations. Our approach is inspired by the
Central Dogma of Molecular Biology, which states that DNA transcribes to RNA, and RNA
4
translates into protein, while the reverse process is impossible. For instance, once a pro-
tein is synthesized, it cannot alter its original RNA template. This inherent directionality in
molecular interactions suggests that multi-omics relationships can be represented by a di-
rected graph (digraph) with unidirectional pathways. Building on this biological insight, our
method accounts for the dependencies by removing redundant correlations in a structured
manner. Specifically, we employ an orthogonal projection framework that sequentially re-
move the effects of upstream omics layers on downstream ones. By transforming the original
multi-omics data into an uncorrelated feature space, our approach ensures that variable se-
lection methods, such as penalized regression, operate on independent components free from
both within and across modality correlations. This enables MVOPR to overcome the limita-
tions of standard Lasso-based approaches, noted above.
In this study, we demonstrate the effectiveness of MVOPR for multi-omics variable selec-
tion through both theoretical analysis and empirical validation. Our simulations and real-data
analysis reveal that MVOPR consistently outperforms existing methods. We also show that
when inter-modality correlation exists, the factor-based models will face different problems,
unlike MVOPR. Importantly, even in cases where cross-modality correlations are absent,
MVOPR remains robust and performs comparably to standard Lasso regression, demonstrat-
ing its adaptability across different correlation structures. By incorporating biological direc-
tion assumptions, our approach not only enhances variable selection performance but also
aligns with the natural structure of molecular data, offering a robust framework for integra-
tive multi-omics analysis.
The rest of the article is organized as follows. In Section 2, we present the MVOPR frame-
work for both the two-modality and multiple-modality scenarios, followed by an introduc-
tion to three related methods for multi-modal data analysis. Section 3 provides a comparative
analysis of MVOPR against other competing methods under various settings. In Section 4,
we apply MVOPR to the CAARS dataset and evaluate its performance relative to alternative
approaches.
2. Methodology.
2.1. MVOPR for Two Modalities. Suppose we have two modalities, M1 ∈ Rn×p and
M2 ∈ Rn×q . Let Y be the response vector of length n, assumed to be associated with M1 and
M2 through regression coefficients β1 ∈ Rp and β2 ∈ Rq . The relationship is modeled as:
(1) Y = M1 β1 + M2 β2 + ϵ1 ,
where ϵ1 is an error term assumed to be uncorrelated with M1 and M2 , with E(ϵ1 ) = 0 and
Var(ϵ1 ) = σϵ21 I . We further assume that M2 is influenced by M1 through a low-rank coef-
ficient matrix B ∈ Rp×q of rank r , with an error component E ∈ Rn×q that is uncorrelated
with M1 B and ϵ1 :
(2) M2 = M1 B + E.
Using this inter-modality correlation (2), we can reformulate Model (1) as:
(3) Y = M1 (β1 + Bβ2 ) + Eβ2 + ϵ1 .
If E is small, the model becomes almost unidentifiable, which affects the selection of the
variables for M2 . To handle this issue, we aim to remove the associated component M1 B
from M2 while retaining only the uncorrelated part E .
Let U ΣV T be the singular value decomposition (SVD) of M1 B , where Ur consists of the
first r left singular vectors of U . Substituting this decomposition into Model (3) gives:
(4) Y = M1 β1 + Eβ2 + Ur γ1 + ϵ1 ,
A SAMPLE RUNNING HEAD TITLE 5
2.2. MVOPR for Multiple Modalities. Extending the model to three modalities, let M1 ∈
Rn×p1 , M2 ∈ Rn×p2 , and M3 ∈ Rn×p3 , with a known hierarchical dependency:
M2 = M1 B2,1 + E2 ,
M3 = M1 B3,1 + M2 B3,2 + E3 .
where E2 ∈ Rn×p2 and E3 ∈ Rn×p3 are independent error components, and B2,1 , B3,1 , and
B3,2 are low-rank coefficient matrices with ranks r1 , r2 , and r3 , respectively.
The response Y is modeled as:
(6) Y = M1 β1 + M2 β2 + M3 β3 + ϵ1 .
To remove the associated components, define two projection matrices: P1 = U1 U1T and P2 =
U2 U2T which based on M1 B1′ and E2 B3,2 separately. B1′ = (B2,1 , B3,1 ) is the concatenation
of B2,1 and B3,1 . The transformed modalities are:
M1∗ = (I − P1 )M1 , M2∗ = (I − P2 )E2 , M3∗ = E3 .
Transforming the modalities in model (6), the MVOPR model for three modalities is:
(7) Y = M1∗ β1 + M2∗ β2 + M3∗ β3 + U1 γ1 + U2 γ2 + ϵ1 .
where γ1 and γ2 are nuisance parameters. The detailed derivations are provided in Supple-
mentary Appendix A.
For more than three modalities, the transformation follows a similar procedure. Suppose
there are k modalities with features p1 , p2 , . . . , pk . If each modality Mj (for j = 2, 3, . . . , k )
depends only on previous modalities:
j−1
X
Mj = Mi Bj,i + Ej ,
i=1
2.3. Related methods. To evaluate the relative performance of MVOPR, we consider sev-
eral alternative models for multi-modality data. Specifically, we compare our method against
Cooperative Regularized Linear Regression (Cooperative Ding et al. (2022b)), Integrative
Factor Regression (IntegFactor Li and Li (2022)), and Factor-Adjusted Regularized Regres-
sion (Factor Fan, Ke and Wang (2020)).
2.4. Estimation. To fit model (2), we first need to estimate the coefficient matrix B̂ that
captures the relationship between M1 and M2 . Several well-established reduced-rank re-
gression methods can be utilized for this estimation. For instances, row-sparse reduced-rank
regression ( Chen and Huang (2012)), sparse orthogonal factor regression ( Uematsu et al.
8
(2019)), and multivariate reduced-rank linear regression ( Chen, Dong and Chan (2013)) pro-
vide different sparsity assumptions for estimating B̂ . In general, the reduced-rank regression
problem can be formulated as the following optimization problem:
min ∥M2 − M1 U DV T ∥2F + λ1 ∥D∥1 + λ2 ρa (U D) + λ3 ρb (V D)
U,D,V
(15) s.t. U T U = I, V T V = I, B = U DV T
where the U T U = I, V T V = I are introduced for identifiable purpose. ρa and ρb are penalty
functions. They can be entry-wise L1 norm or row-wise L2,1 norm. λ1 , λ2 , λ3 are the tuning
parameters that control the magnitude of regularization. This framework generalizes sev-
eral well-known methods: Row-sparse Reduced-Rank Regression when λ1 = λ3 = 0 and
ρa = ∥ · ∥2,1 ; Multivariate Reduced-Rank Linear Regression when λ1 = λ2 = λ3 = 0; Sparse
Orthogonal Factor Regression when all tuning parameters are nonzero. The tuning parame-
ters and rank r are chosen based on the GIC (Fan and Tang (2013)). With the fitted model
above, we could obtain the coefficient matrix B̂ and residual term Ê .
′
Next, we estimate the P by the inner product of the first r left singular vectors Ur from M1 B̂ .
Denote the estimation as P̂ . Then, transformed M1 and M2 can be estimated based on previ-
ous procedures. Once the transformed matrices are obtained, we estimate βˆ1 and βˆ2 . This is
done by solving the following penalized least squares problem:
(16) min ∥Y − M̂1∗ β1 − M̂2∗ β2 − Ur γ2 ∥2 + λρ(β1 ) + λρ(β2 )
β1 ,β2 ,γ2
where ρ is a generic penalty function including the L1 norm, adaptive Lasso, MCP, and
SCAD penalties. λ is a tuning parameter that controls the regularization power on both β1
and β2 .
For MVOPR with three modalities, the estimation of β1 , β2 , and β3 follows a similar penal-
ized least squares approach:
min ∥Y − M̂1∗ β1 − M̂2∗ β2 − M̂3∗ β3 − U1 γ1 − U2 γ2 ∥2
β1 ,β2 ,β3 ,γ1 ,γ2
where M̂1∗ = (I − Pˆ1 )M1 , M̂2∗ = (I − Pˆ1 )Ê2 , and M3∗ = Ê3 . U1 and U2 are the left singular
vectors with non-zero singular values of M1 (B̂2,1 B̂3,1 ) and E2 B̂3,2 .
3. Numerical analysis.
3.1.1. E with identity covariance matrix. To assess the performance of MVOPR in com-
parison to other methods, we carry out some simulations under different noise levels on ϵ1 and
ϵ2 . In this simulations, suppose there are two modalities M1 and M2 with 300 features and
200 observations. M1 is generated from multivariate normal distribution M V N (0p , ΣM1 )
with identity covariance matrix. Assume M2 is connected with M1 through a low-rank row
sparse coefficient matrix B with 95% rows as zeros with rank r = 1. Response Y is asso-
ciated with both M1 and M2 through β1 and β2 . β1 and β2 are generated with 290 zeros
and 10 non-zeros coefficients. The values of non-zero coefficients are sampled from uniform
distribution U (1, 2). We fix the signal-to-noise ratio (SNR) to be 100 for ϵ1 .By varying the
SNRs of ϵ2 , we compare the variable selection performance of each model by AUC. Each
AUC is calculated based on a 100 length grid of λ which controlling strength of sparsity. We
conduct each simulation experiment 100 times for one SNR of ϵ2 .
A SAMPLE RUNNING HEAD TITLE 9
Adaptive Lasso Cooperative(rho=1) Factor
Model
Cooperative(rho=0.5) IntegFactor MVOPR
0.975
0.95
0.9
0.950
0.90
Value
Value
Value
0.925
0.8
0.85
0.900
0.80
0.7
5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
SNR SNR SNR
MVOPR outperforms other methods in terms of AUC across the entire range of SNR val-
ues (Figure.2). Factor-Adjusted Regularized Regression exhibits comparable performance to
MVOPR in scenarios with low SNR. However, as the SNR increases, which corresponds
to stronger correlations between M1 and M2 , MVOPR demonstrates clear superiority over
Factor-Adjusted Regularized Regression. Especially, MVOPR has evident benefits on vari-
able selection for M2 when SNR is large. It shows the ability of MVOPR to better inte-
grate information across modalities which allows it to maintain high AUC values even under
more challenging conditions. In contrast, Factor-Adjusted Regularized Regression appears
to struggle under high SNR conditions, likely due to its reliance on factor decomposition.
Moreover, other competing methods, such as Integrative Factor Regression and Cooperative
learning method, show declining performance as the SNR increases. These methods appear
to be less effective in maintaining robust performance when faced with strong inter-modality
correlations, highlighting the advantage of MVOPR in such scenarios.
Two alternative simulations are designed to show factor-based model may not be the ideal
model to account for the inter-modality correlations. In the first simulation, M1 and M2 are
generated from multivariate normal distribution with identity covariance matrix with 50 and
300 features. Each has 200 samples. Suppose M2 is connected with M1 through a low-rank
row sparse coefficient matrix B 70% rows as zeros with rank r = 9. β1 and β2 only has 10
non-zero coefficients separately. The second simulation is used to showcase the performance
of MVOPR under low-dimensional data compared. M1 and M2 are generated from multi-
variate normal distribution with identity covariance matrix with 200 samples and 50 features.
Low-rank row sparse coefficient matrix B has 50% rows as zeros with rank r = 3. β1 and β2
has 25 non-zero coefficients separately. The SNR for ϵ1 in both simulations are fixed as 100.
10
Adaptive Lasso Cooperative(rho=1) Factor
Model
Cooperative(rho=0.5) IntegFactor MVOPR
0.9
0.8 0.9
0.8
Value
Value
Value
0.8
0.6
0.7
0.7
0.4
0.6
0.6
20.0 22.5 25.0 27.5 30.0 20.0 22.5 25.0 27.5 30.0 20.0 22.5 25.0 27.5 30.0
SNR SNR SNR
F IG 3. The AUC for each model by varying the SNR of ϵ2 from 20 to 30 when M1 and M2 has different number
of features.
0.8 0.75
0.8
Value
Value
Value
0.50
0.6
0.6
0.25
0.4
15 16 17 18 19 20 15 16 17 18 19 20 15 16 17 18 19 20
SNR SNR SNR
F IG 4. The AUC for each model by varying the SNR of ϵ2 from 15 to 20 when M1 and M2 has 50 features
respectively.
3.2.1. E with correlated structure. In real world settings, E may not always have inde-
pendent covariance structure. To verify whether MVOPR can still works under this misspec-
ified case, we consider two covariance patterns including auto-regressive (AR1) and com-
pound symmetry (CS). In the simulations below, we generate two modalities M1 and M2
while each has 300 features and 200 observations. M1 is generated from M V N (0p , ΣM1 )
with identity covariance matrix.M2 is associated with M1 through a low-rank row sparse co-
efficient matrix B 50% rows as zeros with rank r = 1. Response Y is associated with both
M1 and M2 through β1 and β2 . β1 and β2 are generated with 90 zeros and 10 non-zeros coef-
ficients. The absolute values of non-zero coefficients are sampled from uniform distribution
U (1, 2). The SNR for ϵ1 and ϵ2 are fixed to be 3 and 5. We compare the variable selection
performance of each model by AUC. In AR1 case, E is generated from M V N (0q , Σρ ). The
diagonal elements of Σρ are 1 with cov(ϵi2 , ϵi2 ) = ρ|i−j| . ρ = 0.9 and ρ = 0.95 conditions are
included. The results are shown in Figure.5.A and Figure.5.B. Under this misspecified case,
MVOPR still achieves a higher AUC than other methods. In compound symmetry case, E are
generated from a M V N (0q , Σµ ). The diagonal elements of Σµ are 1 with cov(ϵi2 , ϵi2 ) = µ.
We test the performance of each model under µ = 0.7 and µ = 0.9 condition. The results are
shown in Figure.5.C and Figure.5.D. Under this condition, both MVOPR and factor-based
models performs well compared to adaptive lasso.
Total AUC Beta1 AUC Beta2 AUC Total AUC Beta1 AUC Beta2 AUC
1.0 1.0
0.9
0.9
0.9
0.85
0.8
0.8 0.8
0.8
AUC
0.8
AUC
0.75
0.7
0.6 0.7
0.6
0.7 0.6
0.65
0.6
0.5
0.6 0.4
0.55 0.4
2 4 6 2 4 6 2 4 6
2 4 6 2 4 6 2 4 6
SNR
SNR
Total AUC Beta1 AUC Beta2 AUC Total AUC Beta1 AUC Beta2 AUC
1.0
1.0 1.0
0.9 0.9
0.9
0.9
0.9
0.8 0.8
0.8
AUC
AUC
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.7 0.7
0.6
0.6
0.5
2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6
SNR SNR
F IG 5. The AUC of each model when ϵ1 has certain correlated structure. Fig.2.A-B showcase the AUC for each
model under Auto-Regressive (AR1) covariance pattern. Fig.2.C-D showcase the AUC for each model under
Compound Symmetry (CS) covariance pattern.
Based on the results in Figure.6 and Figure.7, we notice that four models perform similarly
under ΣM1,2 = I . Meaning that even when the unidirectional assumption is missing, MVOPR
can still work and share similar performance to other methods. However, if ΣM1,2 follows
an auto-regressive (ρ = 0.9) covariance pattern, factor-based models exhibit weaker perfor-
mance compared to adaptive Lasso and MVOPR. This may be attributed to the covariance
structure of each modality, as the absence of spiked eigenvalues hinders the effectiveness of
factor decomposition.
0.95
0.95
0.9
0.90
0.90
value
value
value
0.8
0.85
0.85
0.80
0.7
0.80
0.75
2 4 6 2 4 6 2 4 6
Var2 Var2 Var2
F IG 6. The AUC of each model when both M1 and M2 have diagonal covariance matrix without the M2 = M1 B
assumption
0.9
0.9
0.8
0.8 0.8
0.7
value
value
value
0.7
0.6
0.6
0.6
0.4 0.5
0.5
2 4 6 2 4 6 2 4 6
Var2 Var2 Var2
F IG 7. The AUC of each model when both M1 and M2 have autoregressive (ρ = 0.9) covariance matrix without
the M2 = M1 B assumption
(A)
Overall M1
1.0
0.9
0.8 0.8
0.7
0.6
0.6
0.5
0.4
0.4
AUC
M2 M3
1.0 1.0
0.8 0.9
0.8
0.6
0.7
0.4
0.6
0.2
so
so
)
)
1)
or
.5
PR
or
1)
to
.5
PR
or
as
as
0
o=
ct
=0
o=
ac
ct
ct
o=
VO
pL
VO
pL
Fa
Fa
rh
ho
Fa
gF
rh
rh
da
eg
e(
da
M
e(
(r
te
e(
tiv
A
e
iv
t
A
In
iv
In
iv
t
ra
at
ra
at
pe
er
pe
r
pe
p
oo
oo
oo
oo
C
C
C
(B)
Overall M1
1.0
0.9
0.8
0.8
0.7
0.6
0.6
0.5 0.4
AUC
M2 M3
1.0 1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
so
so
)
or
)
1)
or
.5
PR
or
1)
.5
PR
or
as
as
t
=0
o=
t
=0
o=
ac
ct
ac
ct
VO
pL
VO
pL
ho
Fa
rh
ho
Fa
gF
rh
gF
da
M
e(
da
M
r
e(
r
te
e(
te
e(
tiv
A
tiv
A
In
tiv
In
tiv
ra
ra
ra
ra
pe
pe
pe
pe
oo
oo
oo
oo
C
C
C
F IG 8. AUC of the variable selection in M1 , M2 , and M3 . (A) The AUC distributions of MVOPR and other
methods when E2 and E3 have identity covariance; (B) The AUC distributions of MVOPR and other methods
when E2 and E3 have AR1 covariance
Based on the Figure.8, we find that MVOPR outperforms other methods in terms of over-
all AUC, AUC in M2 , and AUC in M3 . MVOPR achieves the highest mean AUC values in
these categories, indicating its superior performance in multi-modal integration. Among the
14
competing methods, factor-based models show some improvement in overall and AUC for
M2 , M3 compared to Adaptive Lasso. However, their performance is worse than MVOPR.
Specifically, Integrative Factor Regression exhibits a notable decline in performance when E2
and E3 share a correlated covariance structure. This result suggests that factor-based models
may struggle to capture the intricate inter-modalality correlations. For Cooperative Learning,
these models perform worse than adaptive Lasso, indicating that the agreement penalty may
not always bring benefits to variable selection in these settings. The simulation results reveal
the robustness and superiority of MVOPR in handling complex multi-omics settings. Even in
misspecified scenarios, MVOPR consistently outperforms competing methods, demonstrat-
ing its reliability and effectiveness in capturing intricate correlations.
4.1. CAARS Data Analysis. We conduct the MVOPR model on the CAARS data, col-
lected from 55 patients. This dataset contains two omics layers: microbiome and metabolome.
The study aims to understand how the omics datas influence the continuous eosinophil count.
To reduce the dimensionality of microbiome and metabolome, we only select the metabolites
which has top 200 variance. 139 microbiome are aggregated to 31 family levels. Then, we
normalize the microbiome by centered log ratio transformation and the metabolome data is
centered and scaled. We use the square root of continuous eosinophil count as response.
Multivariate reduced rank regression is applied to estimate the coefficient matrix B̂ with
the metabolome as a response and the microbiome as predictor. Original omics datasets are
transformed based on the B̂ and residuals Ê . To analyze the association between the square
root of continuous eosinophil count and transformed data, L1 peanlty is used for variable
selection. Using a leave-one-out sample to qualify the robustness for variable selection. Out-
sample MSE and stability indicators are used to qualify the performance of each model. If
one feature is selected with non-zero coefficient among 85% iterations, it will be considered
as a selected feature. Stability indicators are defined as follows: suppose the ith set of vari-
ables selected by the model during the ith iterations of leave-one-out sample as Si . Sj and
Si are paired to calculate the stability. The two pairs are represented as i and j while i ̸= j .
Here are three stability indicators: Jaccard similarity coefficient, Otsuka–Ochiai coefficient,
and Sørensen–Dice coefficient (Kwon et al. (2023)).
50
40
40
30
30
20
20
Selection Frequency
10
10
0
0
0 50 100 150 200 0 50 100 150 200
50
40
40
30
30
20
20
10
10
0
Feature ID
0.0
0.3
-2.5
0.0
Coefficient Estimations
-5.0
-0.3
-7.5
0 50 100 150 200 0 50 100 150 200
0.3 0.5
0.0 0.0
-0.5
-0.3
-1.0
-0.6
0 50 100 150 200 0 50 100 150 200
Feature ID
F IG 9. MVOPR for CAARS Data Analysis. (A) Selection frequency Microbiome (ID: 1-31) and Metabolome (ID:
32 - 231); (B) Confident Intervals for Coefficients Estimations; (C) Pearson Correlation Matrix between Selected
Microbiome and Metabolome
16
td
_s
ys id
c
_i
:1 ic A
oP
10
id
Ac
_d
l
1- cho
-L
e
ic
in
o
ar
uc
ur
18
e
Le
St
Bacteroidaceae
0.5
Microbiome
−0.5
Metabolome
F IG 10. Pearson Correlation Matrix between Selected Microbiome and Metabolome within MVOPR
leucine levels have been reported in asthmatic individuals with elevated exhaled nitric ox-
ide (FeNO > 35), a biomarker indicative of eosinophil-driven inflammation ( Comhair et al.
(2015)). This suggests that leucine may play a role in asthma pathophysiology, particularly
in individuals with active eosinophilic reaction. The interplay between Bacteroides and these
metabolites is further supported by lipidomic analyses. Notably, differences in lipid profiles
among Bacteroides are largely driven by variations in plasmalogens, glycerophosphoinosi-
tols, and certain sphingolipids. These lipidomic distinctions may influence immune responses
and inflammation, providing insight into the mechanisms by which Bacteroides species con-
tribute to asthma pathogenesis ( Ryan, Joyce and Clarke (2023); Ryan et al. (2023)). Above
all, MVOPR demonstrates strong performance in real data analysis, effectively identifying
key microbial and metabolic features associated with eosinophilic inflammation in asthma.
By selecting the Bacteroidaceae family and relevant metabolites, MVOPR aligns well with
established biological findings, highlighting its ability to capture meaningful microbiome-
metabolome interactions. These results highlight the robustness and reliability of MVOPR in
modeling complex multi-omics relationships, making it a powerful tool to uncover biomark-
ers in asthma.
5. Discussion. The MVOPR model presents a novel approach for multi-omics data in-
tegration by using the orthogonal projection framework to handle the correlated predictors,
enhancing variable selection. Our model is effective under the unidirectional assumption,
aligning well with the inherent biological pathways such as the Central Dogma of Molecu-
lar Biology. Traditional methods, such as Lasso-based regression and factor-based models,
struggle in multi-omics settings due to the strong within- and inter-modality correlations.
MVOPR effectively addresses these challenges by leveraging the unidirectional assumptions
between omics layers and employing an orthogonal projection framework to mitigate multi-
collinearity problems.
Based on the results from simulations and real data analysis, MVOPR showcases superior
performance over other competing methods. Unlike factor-based models, which require an
approximate factor structure on predictors, MVOPR successfully eliminates redundant de-
pendencies while preserving meaningful signals for variable selection. Even in scenarios
where the inter-modality correlation assumption is violated, MVOPR maintains competitive
performance, outperforming other methods. This suggests that MVOPR generalizes well be-
yond ideal conditions, making it a reliable tool for real-world applications.
However, in cases where the model is severely misspecified, such as incorrectly assuming
directionality, performance of MVOPR can be affected. For instance, if the true causal direc-
tion is from M1 to M2 , but a model with reverse direction is fitted (M2 to M1 ), the estimated
coefficient matrix B̂ may not be well-constructed, leading to poor projections and inaccurate
variable selection. To ensure proper unidirectional modeling, a strong understanding of the
latent relationships between modalities is crucial. This can be established through biological
knowledge, such as the Central Dogma of Molecular Biology, or causal inference that helps
to determine the correct directionality before model fitting.
In current analysis, MVOPR operates within a linear regression framework. However, some
biological systems are inherently nonlinear and hierarchical, often involving complex inter-
actions between different omics layers. Future extensions of MVOPR could incorporate non-
linear model, such as kernel-based methods or deep-learning approaches, to capture these
intricate dependencies more effectively.
When applying MVOPR to the CAARS dataset, we successfully identify microbial and
metabolic markers linked to eosinophilic inflammation in asthma. Notably, MVOPR se-
lect some biomarkers which aligns with prior research. Compared to competing approaches,
MVOPR demonstrate higher model stability and lower mean squared error (MSE) in real-
data analysis. Traditional methods such as Lasso regression and factor-based models failed to
18
maintain consistent variable selection across iterations. In contrast, MVOPR achieved higher
stability indicators (Jaccard, Otsuka–Ochiai, and Sørensen–Dice coefficients), suggesting im-
proved stability in biomarker identification.
MVOPR represents a advancement in multi-omics variable selection, providing a robust,
interpretable, and biologically relevant framework for multi-view data integration. By suc-
cessfully mitigating within- and inter-modality correlations, MVOPR allows for more pre-
cise biomarker discovery, particularly in complex diseases such as asthma. As multi-omics
datasets continue to develop, MVOPR offers a powerful and stable method for integra-
tive analysis, providing novel framework for personalized medicine and targeted therapeutic
strategies.
Acknowledgments. The authors would like to thank the anonymous referees, an Asso-
ciate Editor and the Editor for their constructive comments that improved the quality of this
paper.
A.2. Multiple modalities case. For multi-omics data with k modalities, we need to de-
termine the order for each modality. Based on Central dogma of molecular biology, genomics
will generally serve as the first modality which has the ability to influence all the downstream
elements. Proteomics or metabolomics may serve as the last modality which can be regu-
larized by upstream elements. Any omics between the first and last modality will serve as
A SAMPLE RUNNING HEAD TITLE 19
intermediate modality such as transcriptome. After we have the sequential information for
multi-omics, we could transform each modality except to their residual forms first.
k
X k−1
X
Y = M1 β1 + Ej βj + M1 B1∗ γ1 + Ei Bi∗ γi + ϵ1
j=2 i=2
where B1∗ = (B2,1 , B3,1 , ..., Bk,1 ). For any 2 ≤ i ≤ k − 1, Bi∗ = (Bi+1,i , Bi+2,i , ..., Bk,i ).
To remove the correlation between predictors and nuisance variables, we next project each
predictors to the orthogonal subspace. Suppose we have SVD for each nuisance variable:
M1 B1∗ = U1 Σ1 V1T Ei Bi∗ = Ui Σi ViT
Assume U1 , U2 , ..., Uk−1 has rank r1 , r2 , ..., rk−1 . Projection matrix P1 = U1 U1T , P2 =
U2 U2T , ..., Pk−1 = Uk−1 Uk−1
T . Then, the final model for MVOPR will be:
k−1
X k−1
X
Y = P1⊥ M1 β1 + Pi⊥ Ei βi + Ek βk + Uj γj∗ + ϵ1
i=2 j=1
The transformed modalities are mutually uncorrelated to each other in the regression.
M2 = M1 B + E
Suppose B matrix has a SVD as B = UB ΣB VBT . Therefore, M2 could be expressed based
on an approximate factor model. F = M1 UB and Λ = ΣB VBT . This structure aligns with the
scenarios for Integrative Factor Regression and Factor-Adjusted Regularized Regression.
M2 = M1 UB ΣB VBT + E
= FΛ + E
Suppose M1 doesn’t follow approximate factor model structure. In Integrative Factor Re-
gression, the factor decomposition for (M1 , M2 ) will become (M1 , E) with factors F as
nuisance parameter. However, since the factors are given by F = M1 UB , it implies that F
is a linear combination of M1 and is therefore highly correlated with it. When we fit the re-
gression Y = M1 β1 + Eβ2 + F γ + ϵ1 , the true contribution of M1 will be obscured by the
correlated nuisance parameter F . When M1 has some spiked eigenvalues and could be ap-
proximated by factor models, similar problem will still exist. Suppose M1 = F1 Λ1 + U1 , the
decomposition of (M1 , M2 ) will become (U1 , E) with nuisance parameters (F1 , F ). Since
F = M1 UB = F1 Λ1 UB + U1 UB , F will still be correlated to F1 and U1 .
In Factor-Adjusted Regularized Regression, the matrix M = (M1 , M2 ) is treated as a whole
and decomposed accordingly. Since M follows the decomposition M = M1 (I, B) + (0, E)
which does not perfectly align with the model’s assumption, the selection of the number of
factors will be affected. When B has a rank r that is closed or equal to p, M could be decom-
pose to F Λ + (0, E) with p factors. In this case, the transformed M1 will be nearly zero, as
most of its information is absorbed by the factors. Similar issue will happen when M1 (I, B)
lacks spiked eigenvalues, leading to difficulties in distinguishing factor structure. This in-
creases the risk of selecting an excessively large number of factors, potentially distorting the
factor adjustment process.
20
REFERENCES
A BDEL -A ZIZ , M. I., N EERINCX , A. H., V IJVERBERG , S. J., K RANEVELD , A. D. and M AITLAND - VAN DER
Z EE , A. H. (2020). Omics for the future in asthma. In Seminars in immunopathology 42 111–126. Springer.
A SLAM , R., H ERRLES , L., AOUN , R., P IOSKOWIK , A. and P IETRZYK , A. (2024). The Link between Gut
Microbiota Dysbiosis and Childhood Asthma: Insights from a Systematic Review. Journal of Allergy and
Clinical Immunology: Global 100289.
BAI , J. and L I , K. (2012). Statistical analysis of factor models of high dimension.
BAI , J. and N G , S. (2002). Determining the number of factors in approximate factor models. Econometrica 70
191–221.
BAROSOVA , R., BARANOVICOVA , E., H ANUSRICHTEROVA , J. and M OKRA , D. (2023). Metabolomics in Ani-
mal Models of Bronchial Asthma and Its Translational Importance for Clinics. International Journal of Molec-
ular Sciences 25 459.
B OULESTEIX , A.-L., D E B IN , R., J IANG , X. and F UCHS , M. (2017). IPF-LASSO: integrative L1-penalized
regression with penalty factors for prediction based on multi-omics data. Computational and mathematical
methods in medicine 2017 7691937.
C ASTEL , C., Z HAO , Z. and T HORESEN , M. (2024). Comparison of the LASSO and Integrative LASSO with
Penalty Factors (IPF-LASSO) methods for multi-omics data: Variable selection with Type I error control.
arXiv preprint arXiv:2404.02594.
C HEN , K., D ONG , H. and C HAN , K.-S. (2013). Reduced rank regression via adaptive nuclear norm penalization.
Biometrika 100 901–920.
C HEN , L. and H UANG , J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and
variable selection. J. Am. Stat. Assoc. 107 1533–1545.
C HEN , C., WANG , J., PAN , D., WANG , X., X U , Y., YAN , J., WANG , L., YANG , X., YANG , M. and L IU , G.-P.
(2023). Applications of multi-omics analysis in human diseases. MedComm 4 e315.
C HU , X., Z HANG , B., KOEKEN , V. A., G UPTA , M. K. and L I , Y. (2021). Multi-omics approaches in immuno-
logical research. Frontiers in Immunology 12 668045.
C HUNG , K. F. (2016). Asthma phenotyping: a necessity for improved therapeutic precision and new targeted
therapies. Journal of internal medicine 279 192–204.
C LARK , C., DAYON , L., M ASOODI , M., B OWMAN , G. L. and P OPP, J. (2021). An integrative multi-omics
approach reveals new central nervous system pathway alterations in Alzheimer’s disease. Alzheimer’s research
& therapy 13 1–19.
C OMHAIR , S. A., M C D UNN , J., B ENNETT, C., F ETTIG , J., E RZURUM , S. C. and K ALHAN , S. C. (2015).
Metabolomic endotype of asthma. The Journal of Immunology 195 643–650.
D ING , D. Y., L I , S., NARASIMHAN , B. and T IBSHIRANI , R. (2022a). Cooperative learning for multiview anal-
ysis. Proceedings of the National Academy of Sciences 119 e2202113119.
D ING , D. Y., L I , S., NARASIMHAN , B. and T IBSHIRANI , R. (2022b). Cooperative learning for multiview anal-
ysis. Proc. Natl. Acad. Sci. U. S. A. 119 e2202113119.
FAN , J., K E , Y. and WANG , K. (2020). Factor-adjusted regularized model selection. Journal of Econometrics
216 71–85.
FAN , J., L IAO , Y. and M INCHEVA , M. (2013). Large covariance estimation by thresholding principal orthogonal
complements. Journal of the Royal Statistical Society Series B: Statistical Methodology 75 603–680.
FAN , Y. and TANG , C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal
of the Royal Statistical Society Series B: Statistical Methodology 75 531–552.
F IUZA , B. S. D., DE A NDRADE , C. M., M EIRELLES , P. M., DA S ILVA , J. S., DE J ESUS S ILVA , M., S AN -
TANA , C. V. N., P INHEIRO , G. P., M PAIRWE , H., C OOPER , P., B ROOKS , C. et al. (2024). Gut microbiome
signature and nasal lavage inflammatory markers in young people with asthma. Journal of Allergy and Clinical
Immunology: Global 3 100242.
G ARG , M., K ARPINSKI , M., M ATELSKA , D., M IDDLETON , L., B URREN , O. S., H U , F., W HEELER , E.,
S MITH , K. R., FABRE , M. A., M ITCHELL , J. et al. (2024). Disease prediction with multi-omics and biomark-
ers empowers case–control genetic discoveries in the UK Biobank. Nature Genetics 56 1821–1831.
G AUTAM , Y., J OHANSSON , E. and M ERSHA , T. B. (2022). Multi-omics profiling approach to asthma: an evolv-
ing paradigm. Journal of personalized medicine 12 66.
G ILLENWATER , L. A., H ELMI , S., S TENE , E., P RATTE , K. A., Z HUANG , Y., S CHUYLER , R. P., L ANGE , L.,
C ASTALDI , P. J., H ERSH , C. P., BANAEI -K ASHANI , F. et al. (2021). Multi-omics subtyping pipeline for
chronic obstructive pulmonary disease. PloS one 16 e0255337.
H USSEIN , R., A BOU -S HANAB , A. M. and BADR , E. (2024). A multi-omics approach for biomarker discovery
in neuroblastoma: a network-based framework. npj Systems Biology and Applications 10 52.
A SAMPLE RUNNING HEAD TITLE 21
K LAU , S., J URINOVIC , V., H ORNUNG , R., H EROLD , T. and B OULESTEIX , A.-L. (2018). Priority-Lasso: a
simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC bioinformatics
19 1–14.
K WON , Y., H AN , K., S UH , Y. J. and J UNG , I. (2023). Stability selection for LASSO with weights based on
AUC. Scientific Reports 13 5207.
L I , Q. and L I , L. (2022). Integrative factor regression and its inference for multimodal data analysis. Journal of
the American Statistical Association 117 2207–2221.
M ADAPOOSI , S. S., C RUICKSHANK -Q UINN , C., O PRON , K., E RB -D OWNWARD , J. R., B EGLEY, L. A., L I , G.,
BARJAKTAREVIC , I., BARR , R. G., C OMELLAS , A. P., C OUPER , D. J. et al. (2022). Lung microbiota and
metabolites collectively associate with clinical outcomes in milder stage chronic obstructive pulmonary dis-
ease. American journal of respiratory and critical care medicine 206 427–439.
M AHDAVINIA , M., F YOLEK , J. P., J IANG , J., T HIVALAPILL , N., B ILAVER , L. A., WARREN , C., F OX , S.,
N IMMAGADDA , S. R., N EWMARK , P. J., S HARMA , H. et al. (2023). Gut microbiome is associated with
asthma and race in children with food allergy. Journal of Allergy and Clinical Immunology 152 1541–1549.
M ENYHÁRT, O. and G Y ŐRFFY, B. (2021). Multi-omics approaches in cancer research with applications in tumor
subtyping, prognosis, and diagnosis. Computational and structural biotechnology journal 19 949–960.
NASIRI , E., B ERAHMAND , K., ROSTAMI , M. and DABIRI , M. (2021). A novel link prediction algorithm for
protein-protein interaction networks by attributed graph embedding. Computers in Biology and Medicine 137
104772.
O LIVIER , M., A SMIS , R., H AWKINS , G. A., H OWARD , T. D. and C OX , L. A. (2019). The need for multi-omics
biomarker signatures in precision medicine. International journal of molecular sciences 20 4781.
R ICHARDS , A. L., E CKHARDT, M. and K ROGAN , N. J. (2021). Mass spectrometry-based protein–protein inter-
action networks for the study of human diseases. Molecular systems biology 17 e8792.
RUFF , W. E., G REILING , T. M. and K RIEGEL , M. A. (2020). Host–microbiota interactions in immune-mediated
diseases. Nature Reviews Microbiology 18 521–538.
RYAN , E., J OYCE , S. A. and C LARKE , D. J. (2023). Membrane lipids from gut microbiome-associated bacteria
as structural and signalling molecules. Microbiology 169 001315.
RYAN , E., G ONZALEZ PASTOR , B., G ETHINGS , L. A., C LARKE , D. J. and J OYCE , S. A. (2023). Lipidomic
analysis reveals differences in Bacteroides species driven largely by plasmalogens, glycerophosphoinositols
and certain sphingolipids. Metabolites 13 360.
S ZKLARCZYK , D., K IRSCH , R., KOUTROULI , M., NASTOU , K., M EHRYARY, F., H ACHILIF, R., G ABLE , A. L.,
FANG , T., D ONCHEVA , N. T., P YYSALO , S. et al. (2023). The STRING database in 2023: protein–protein
association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic acids
research 51 D638–D646.
TAO , J.-L., WANG , S.-C., T IAN , M., L IANG , H., X IE , T., L IN , L.-L. and DAI , Q.-G. (2017). Metabonomics
of syndrome markers in Infantile Bronchial Asthma Episode. Zhongguo Zhong xi yi jie he za zhi Zhongguo
Zhongxiyi Jiehe Zazhi= Chinese Journal of Integrated Traditional and Western Medicine 37 319–325.
TAO , J.-L., C HEN , Y.-Z., DAI , Q.-G., T IAN , M., WANG , S.-C., S HAN , J.-J., J I , J.-J., L IN , L.-L., L I , W.-W.
and Y UAN , B. (2019). Urine metabolic profiles in paediatric asthma. Respirology 24 572–581.
U EMATSU , Y., FAN , Y., C HEN , K., LV, J. and L IN , W. (2019). SOFAR: Large-Scale Association Network Learn-
ing. IEEE Transactions on Information Theory 65 4924-4939. https://doi.org/10.1109/TIT.2019.2909889
YANG , B., YANG , R., X U , B., F U , J., Q U , X., L I , L., DAI , M., TAN , C., C HEN , H. and WANG , X. (2021). miR-
155 and miR-146a collectively regulate meningitic Escherichia coli infection-mediated neuroinflammatory
responses. Journal of Neuroinflammation 18 114.
Z HANG , W., Z HANG , Y., L I , L., C HEN , R. and S HI , F. (2024). Unraveling heterogeneity and treatment of asthma
through integrating multi-omics data. Frontiers in Allergy 5 1496392.
Z IMMERMANN , P., M ESSINA , N., M OHN , W. W., F INLAY, B. B. and C URTIS , N. (2019). Association between
the intestinal microbiota and allergic sensitization, eczema, and asthma: a systematic review. Journal of Allergy
and Clinical Immunology 143 467–485.