0% found this document useful (0 votes)
58 views

ss2

This study presents an uncertainty-aware adaptation of Large Language Models (LLMs) for predicting protein-protein interactions (PPIs), utilizing fine-tuned LLaMA-3 and BioMedGPT models. The approach integrates Low Rank Adaptation (LoRA) ensembles and Bayesian methods to enhance prediction reliability and address model uncertainty, ultimately improving trustworthiness in biomedical applications. The findings highlight the potential of these methods to advance precision medicine by providing better-calibrated insights into protein interactions across various disease contexts.

Uploaded by

bekendemissie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

ss2

This study presents an uncertainty-aware adaptation of Large Language Models (LLMs) for predicting protein-protein interactions (PPIs), utilizing fine-tuned LLaMA-3 and BioMedGPT models. The approach integrates Low Rank Adaptation (LoRA) ensembles and Bayesian methods to enhance prediction reliability and address model uncertainty, ultimately improving trustworthiness in biomedical applications. The findings highlight the potential of these methods to advance precision medicine by providing better-calibrated insights into protein interactions across various disease contexts.

Uploaded by

bekendemissie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Uncertainty-Aware Adaptation of Large Language Models for

Protein-Protein Interaction Analysis


Sanket Jantre1⋆ , Tianle Wang1 , Gilchan Park1 , Kriti Chopra1 , Nicholas Jeon2 ,
Xiaoning Qian1,2 , Nathan M. Urban1 , and Byung-Jun Yoon1,2

Abstract— Identification of protein-protein interactions (PPIs) have generated vast amounts of validated interaction data.
helps derive cellular mechanistic understanding, particularly in These experimentally determined interactions have been
the context of complex conditions such as neurodegenerative systematically collected in comprehensive databases such as
disorders, metabolic syndromes, and cancer. Large Language
STRING [3], BioGRID [4], and IntAct [5], creating valuable
arXiv:2502.06173v1 [cs.LG] 10 Feb 2025

Models (LLMs) have demonstrated remarkable potential in


predicting protein structures and interactions via automated resources for the research community. While experimental
mining of vast biomedical literature; yet their inherent un- methods remain the gold standard for PPI validation, their
certainty remains a key challenge for deriving reproducible labor-intensive and time-consuming nature has motivated the
findings, critical for biomedical applications. In this study, we development of computational approaches to predict novel
present an uncertainty-aware adaptation of LLMs for PPI
analysis, leveraging fine-tuned LLaMA-3 and BioMedGPT protein interactions.
models. To enhance prediction reliability, we integrate LoRA Early computational methods for PPI prediction primar-
ensembles and Bayesian LoRA models for uncertainty quantifi- ily relied on sequence-based evolutionary patterns across
cation (UQ), ensuring confidence-calibrated insights into protein species [6], [7], [8]. As the field advanced, machine learn-
behavior. Our approach achieves competitive performance ing (ML) approaches emerged, offering new ways to integrate
in PPI identification across diverse disease contexts while
addressing model uncertainty, thereby enhancing trustworthiness multiple types of biological data [9]. These included convolu-
and reproducibility in computational biology. These findings tional neural networks (CNNs) for analyzing protein sequence
underscore the potential of uncertainty-aware LLM adaptation patterns, recurrent neural networks (RNNs) for capturing se-
for advancing precision medicine and biomedical research. quential dependencies, and graph neural networks (GNNs) for
Index Terms— Large Language Model (LLM), Low Rank modeling the topology of protein interaction networks [10].
Adaptation (LoRA), Uncertainty Quantification (UQ), Bayesian
Inference, Deep Ensemble, Protein-Protein Interaction (PPI). The advancements of large language models (LLMs) [11],
[12], [13], [14], [15], [16], [17] have transformed multiple
I. I NTRODUCTION scientific domains through their unprecedented capabilities
in understanding complex patterns and relationships in text
Protein–protein interactions (PPIs) form the molecular
data [18], [19], [20], [21], [22], [23], [24], [25], [26], [27].
foundation of cellular function, orchestrating everything
Building on their powerful language-processing capabilities,
from gene regulation and signal transduction to metabolic
biology-specific models have been developed to tackle diverse
processes and immune response. The intricate network of
tasks: ProtBERT [28] and ESM3 [29] focus on protein
these interactions, often termed the interactome, represents
sequence analysis, while BioGPT [19] and BioMedGPT [30]
one of the most complex and dynamic systems in biology [1].
have shown promise in extracting biological knowledge
Understanding this complex PPI network is particularly
from scientific literature. Recently, [31] has investigated
crucial in disease contexts, where aberrant protein inter-
PPI prediction using LLM-based approaches, demonstrating
actions can lead to pathological states. Alterations in PPI
the potential of these models in specialized biomedical
networks have been implicated in numerous diseases, affecting
tasks. However, a critical challenge remains: such models
fundamental cellular processes such as protein homeostasis,
often produce overly confident predictions, especially when
cell cycle regulation, and metabolic control. These disease-
trained on limited data, posing serious risks in high-stakes
associated changes in the interactome can manifest through
biomedical applications [32], [33], [34], [35]. Uncertainty-
various mechanisms, from disrupted protein complex forma-
aware adaptation of LLMs is particularly critical in biomedical
tion to altered signaling cascades, ultimately contributing to
applications where miscalibrated confidence levels could
disease progression and severity. Elucidating these interaction
lead to erroneous conclusions about disease mechanisms or
networks is therefore essential for understanding disease
therapeutic targets. By incorporating principled uncertainty
mechanisms and developing therapeutic strategies [2].
estimation and confidence calibration techniques, we can
Traditional experimental methods for identifying PPIs, such
ensure that LLM-driven predictions are not only powerful
as yeast two-hybrid screening and co-immunoprecipitation,
but also reliable for guiding biomedical discoveries.
©This work has been submitted to the IEEE for possible publication. Within the broader machine learning community, uncer-
1 Computing and Data Sciences Directorate, Brookhaven National
tainty quantification has long been a challenge due to the
Laboratory, Upton, NY, USA. high-dimensional parameter spaces of deep learning models.
2 Department of Electrical and Computer Engineering, Texas A&M
University, College Station, TX, USA. Traditional Bayesian methods relying on Markov chain
⋆ corresponding author. Monte Carlo sampling often become intractable at scale
𝑑!
Protein A
Query
𝑩
“Do these proteins
interact in the LLM 𝑟 Yes/No
presence of a Backbone
certain disease?” 𝑨
Protein B
𝑑"

Uncertainty-aware LoRA

Fig. 1. Illustration of our uncertainty-aware low-rank adaptation approach for pre-trained LLMs in protein-protein interaction prediction.

[36], [37]. Consequently, approximate Bayesian methods II. P RELIMINARIES


such as variational inference [38], sparse learning [39], [40], Throughout the paper, all vectors and matrices are denoted
and dimension reduction [41] have been explored. Alter- by bold lowercase (l) and uppercase letters (L) respectively.
nately, ensemble-based methods like deep ensembles [42],
SWAG [43], and SeBayS [44] also provide principled strate- A. Low-Rank Adaptation
gies for capturing predictive variability. On the other hand, To adapt a pre-trained language model to downstream tasks,
parameter-efficient fine-tuning techniques—particularly Low- The authors of [45] introduced LoRA, a parameter-efficient
Rank Adaptation (LoRA) [45]—makes it more tractable to fine-tuning approach. Assuming that weight changes exhibit
adapt large models for specific downstream tasks. To this end, a low intrinsic rank, LoRA optimizes rank decomposition
novel UQ-aware fine-tuning approaches have emerged that matrices while keeping the pre-trained weights frozen.
combine LoRA with Bayesian or ensemble-based ideas, such Specifically, given that the weight update has a low-rank
as Bayesian LoRA [46], [47], [48], [49] and LoRA ensembles structure, the adapted forward pass is expressed as:
[50], [51].
h = (W0 + ∆W )a = (W0 + BA)a.
In this study, we specifically focus on LLaMA-3 [17]
and BioMedGPT [30] as our primary LLM frameworks. Here a and h represent the input and output vectors,
We integrate LoRA-based fine-tuning with uncertainty-aware respectively, of a large frozen pre-trained weight matrix W0 ∈
techniques to improve disease-specific PPI prediction. More Rd1 ×d2 . The matrices B ∈ Rd1 ×r and A ∈ Rr×d2 contain
concretely, we adopt Bayesian LoRA and LoRA ensemble trainable parameters, with r ≪ min(d1 , d2 ). This reduction
methods to mitigate overconfidence and capture richer pre- in the number of parameters, allows LoRA to provide efficient
dictive variability. By leveraging the language-like structure fine-tuning with a decrease in storage requirements. We adopt
of protein sequences, our approach naturally models intricate this model as a baseline in our experiments.
dependencies while generating well-calibrated estimates. To
this end, our contributions include: B. Bayesian model formulation
Let D = {(xi , yi )}i=1,··· ,N be a training dataset with N
1) Low rank adaptation-based fine-tuning of LLaMA-3-8B i.i.d. samples, where x represents the input samples and y
and BioMedGPT-LM-7B, comparing standard LoRA represents the output samples. Bayesian inference captures
fine-tuning with Bayesian LoRA and LoRA ensembles model uncertainty by inferring a probability distribution
to tackle disease-focused PPI prediction. over model parameters, θ = (θ1 , · · · , θT ) ∈ RT , instead of
2) Uncertainty quantification (UQ) integration to assess learning a single deterministic model–p(y|x, θ). The posterior
the reliability and robustness of PPI predictions. distribution follows Bayes’ rule: p(θ|D) ∝ p(D|θ)p(θ),
3) Comprehensive uncertainty-aware evaluation across where p(D|θ) is the model likelihood and p(θ) is the
disease-specific protein interaction networks, specif- prior distribution. To make predictions for a new input
ically those relevant to neurodegenerative disorders, xnew , Bayesian model averaging (BMA) is applied using
metabolic diseases, and cancer. the posterior distribution p(θ|D) as follows:
Z
Overall, our results confirm that incorporating UQ strategies p(ynew |xnew , D) = p(ynew |xnew , θ)p(θ|D)dθ
not only enhances PPI prediction accuracy but also yields
better-calibrated confidence measures—critical for drawing B
1 X
robust conclusions in biomedical research. By advancing ≈ p(ynew |xnew , θb ), θb ∼ p(θ|D).
B
uncertainty-aware methods in LLM-based modeling, we lay b=1

the groundwork for safer, more reliable, and more informative This approach improves generalizability and model calibration
computational tools in precision medicine. by incorporating parameter uncertainty into predictions.
III. M ETHODOLOGY strength. As a result, parameters from any previously trained
A. LoRA Ensemble model that used a reasonable weight decay setting (for
example, via AdamW with its weight decay) can be directly
We employ an ensemble of LoRA models – LoRA Ensemble
reused.
[50], [51] as an efficient strategy for uncertainty quantification
Next, to obtain an approximate posterior around θMAP ,
in LLMs. Traditional deep ensembles yield better predictive
Laplace method proceeds with a second-order Taylor expan-
performance and uncertainty estimation by training multiple
sion of the log-joint L(D, θ) = log p(y, X, θ) around θMAP .
models independently, but applying this directly to LLMs
Hence, by ignoring the higher-order terms, this yields
is often infeasible due to high memory and computational
costs. 1
L(D, θ) ≈ L(D, θMAP ) + (θ − θMAP )⊤ H (θ − θMAP ),
To circumvent these issues, each LoRA Ensemble member 2
fine-tunes the same pre-trained backbone W0 with a low- where the first-order term zeros out due to the zero gradient
rank trainable modification ∆Wm = Bm Am , where Bm ∈ at θMAP and H is the Hessian of the log-joint at θMAP ,
Rd1 ×r and Am ∈ Rr×d2 have significantly fewer parameters ∇2θ L(D, θ)|θMAP . Under this quadratic approximation,
than the full model, rm ≪ min(d1 , d2 ). These adapters
p(θ | D) ≈ N θ|θMAP , H −1 .

are trained independently and in parallel, ensuring diverse (1)
solutions–{W1 , W2 , . . . , WM }. The ensemble prediction is
computed by averaging outputs across M ensemble members. Hence, Laplace approximation turns out to be post-hoc
For a given input xnew , if ynew m
represents the prediction Bayesian inference method which requires the additional
from the m-th ensemble member, the final ensemble output step of computing the H −1 matrix at θMAP . In practice,
(for continuous outcomes) is given by: computing the full Hessian H can be expensive, especially
for large models due to quadratic complexity with respect to
M
1 X m the number of model parameters. We use the positive semi-
pens (ynew |xnew ) = p(ynew |xnew , Wm ).
M m=1 definite Fisher information matrix to circumvent the issue of
the potentially indefinite Hessian, which arises when local
This approach retains the benefits of ensembling–improved convexity conditions fail to hold in large machine learning
accuracy, calibration, and robustness–while preserving effi- models. Accordingly, the Fisher information is defined by
ciency by reusing the frozen backbone and only training
N
lightweight LoRA adapters. X
Eŷ∼P(y|fθ (xn )) GG⊤
 
F (θ) =
B. Bayesian Low-Rank Adaptation n
Despite the availability of scalable posterior inference where G = ∇θ P(ŷ|fθ (xn )) represents the gradient and the
methods like variational inference [38], a fully Bayesian expectation above is over the model’s output distribution.
treatment of LLMs remains computationally prohibitive. Next, in order to estimate the Fisher information in a
Instead, limiting Bayesian inference to LoRA parameters manner that is both tractable and memory-efficient, we
offers a more tractable means of capturing uncertainty in employ a Kronecker-Factored Approximate Curvature (K-
model predictions. However, even Markov chain Monte FAC) approach similar to [46]. In K-FAC, we treat Fisher as
Carlo approaches can become excessively costly for inferring a block-diagonal matrix for each linear layer and factorize
posteriors over the millions of LoRA parameters involved each block into two smaller matrices. For the l-th linear layer,
in large-scale models. As a practical compromise, Bayesian we compute Fisher block Fl using that layer’s input activations
LoRA [46] employs the Laplace approximation to estimate al−1 and log-likelihood gradients with respect to layer’s pre-
the posterior over these low-rank parameters, centered around activation output sl denoted by Gsl = ∇sl log P(y|X, θ).
their maximum a posteriori (MAP) estimate together with Hence the expression is
covariance equaling the Fisher information matrix [52].
N
To this end, let θ denote the trainable LoRA parameters X
EP(y|fθ (xn )) al−1 a⊤ ⊤
   
with a prior distribution of N (0, λ−1 I). The Laplace approx- Fl = l−1 ⊗ EP(y|fθ (xn )) Gsl Gsl
n=1
imation first calculates MAP estimate which is equivalent to (2)
maximizing the log-joint, log P(y, X, θ) This approach avoids storing the full, dense Hessian, thereby
θMAP = argmax log P(y, X, θ) reducing computational overhead. By applying K-FAC to the
θ LoRA parameters, we maintain a compact representation of
= argmax log P(y|X, θ) + log P(θ) uncertainty while keeping the overhead similar to standard
θ
training. However, in Equation (2), the first expectation grows
λ
= argmax log P(y|X, θ) + ||θ||22 + const with the square of the layer’s input width, while the second
θ 2 grows with the square of the output width. Because LoRA
where X represents the model inputs. The term associated adapters alternate between wide-input-narrow-output configu-
with log of the prior distribution provides us L2 -regularization ration and vice versa, one of these expectations can become
on the trainable parameters. We can incorporate this in especially large. To address this, we use an incremental SVD
frequentist model training via weight decay term with λ/2 to factorize the large matrix into two new low-rank factors
thereby saving memory. Further mathematical details are by prompting models to assess whether proteins interact
provided in Appendix E of [46]. in the corresponding conditions and collectively contribute
Once we infer the approximate posterior which is Gaussian to advancing computational models for predicting PPIs
as per Equation 1, we can linearize the model predictions across different disease contexts, enhancing our understanding
around the MAP estimate θMAP [53]. For a test input xnew , of disease-specific interaction networks. These tasks are
⊤  formalized into binary (True/False) classification problems
fθ (xnew ) ≈ fθMAP (xnew ) + ∇θ fθ (xnew ) θ θ − θMAP . as illustrated in Fig. 1. Furthermore, each dataset is divided
MAP

Because this expression is linear in θ, integrating out the into 80% for training and 20% for testing, with all models
Gaussian posterior over θ yields a Gaussian predictive evaluated on the fixed test set in each PPI prediction task. We
distribution for the logits: refer readers to [31], for additional details and exploratory
 analyses of these datasets.
fθ (xnew ) ∼ N y|fθMAP (xnew ), Λ ,
Implementation Details. In all experiments, we construct
where Λ = ∇θ fθMAP (xnew )⊤ H −1 ∇θ fθMAP (xnew ). a LoRA ensemble using three individually fine-tuned LoRA
Finally to efficiently sample from this predictive posterior, learners. The LoRA matrices B are initialized to zero, while
we use the Cholesky decomposition of Λ = LL⊤ . Then, the entries of A follow a Kaiming Uniform initialization
 [57]. Optimization is performed using the AdamW optimizer
ŷ = fθ (xnew ) = fθMAP (xnew ) + Lz, z ∼ N 0, I . with a learning rate of 1 × 10−4 , default hyperparameters,
and a total of four training epochs. The batch size is set to
This linearized predictive step, combined with a Gaussian
4 for the ND-PPI and M-PPI cases and 16 for the C-PPI
approximate posterior, yields efficient uncertainty estimates
case, following [31]. For Bayesian LoRA, the prior precision
in Bayesian LoRA approach for downstream tasks.
λ is fixed at 0.1. Lastly, LoRA is applied to the queries,
IV. E XPERIMENTAL R ESULTS values, and output layer across all methods, with specific
In this section, we assess the performance of two hyperparameters set to r = 16, α = 32, a dropout rate of
uncertainty-aware LoRA adaptations—LoRA Ensemble and 0.05, and a maximum sequence length of 50.
Bayesian LoRA—applied to LLaMA-3-8B and BioMedGPT- Results. The results for ND-PPI, M-PPI, and C-PPI tasks are
LM-7B models on publicly available protein-protein interac- summarized in Tables I, II, and III, respectively.
tion datasets. As a baseline, we include a single LoRA model In the ND-PPI prediction task (Table I), we demonstrate
trained in a deterministic manner. All LoRA-based approaches that the LoRA ensemble achieves the highest predictive
were implemented using the PEFT library [54], with each accuracy among all models in both LLM settings and has the
configuration run three times using different random seeds. lowest NLL in the LLaMA-3 fine-tuning case. Conversely,
We evaluate model performance and robustness by accuracy Bayesian LoRA demonstrates the best calibration in both
(Acc), negative log-likelihood (NLL), and expected calibration scenarios, exhibiting the lowest ECE and achieving the lowest
error (ECE) on the test sets. Additional details on the NLL and NLL in the BioMedGPT fine-tuning case. Lastly, the LoRA
ECE metrics can be found in Appendix V-A. Furthermore, we ensemble reports the highest values for specificity, precision,
report Matthews Correlation Coefficient (MCC), specificity F1-score, MCC, and AUROC among all models. In the M-PPI
(Spec.), precision (Prec.), F1-score, and Area under Receiver prediction task (Table II), we show that the LoRA ensemble
Operating Characteristic curve (AUROC) over test sets for a achieves the highest predictive accuracy and lowest NLL in
comprehensive view of predictive capabilities. Final metrics both LLM scenarios, while also attaining the lowest ECE
are summarized by the mean and standard deviation across in the LLaMA-3 case. Conversely, Bayesian LoRA achieves
three independent runs. the best calibration in the BioMedGPT case and the highest
PPI Datasets. The datasets analyzed here explore PPIs related specificity in both scenarios. Finally, the LoRA ensemble
to various diseases, providing valuable insights into their outperforms all the models by achieving best precision, F1-
underlying mechanisms. The Neurodegenerative diseases PPI score, MCC, and AUROC values.
(ND-PPI) dataset, sourced from the study [55], focuses on In the C-PPI prediction task (Table III), we demonstrate that
neurodegenerative diseases and examines a network of 820 the LoRA ensemble once again achieves the highest predictive
proteins forming 11,762 interactions, evenly split between accuracy and lowest NLL in both settings, while also attaining
positive and negative pairs. The dataset is structured to the lowest ECE in the BioMedGPT scenario. Bayesian LoRA
assess whether specific protein pairs interact in the presence matches the best predictive accuracy in the BioMedGPT case
of neurodegenerative conditions. Similarly, the metabolic and achieves the lowest ECE in the LLaMA-3 case. In the
disorders PPI (M-PPI) dataset, also from [55], investigates LLaMA-3 setting, the LoRA ensemble reports the highest
metabolic disorders, encompassing 1,063 proteins and a values for specificity, precision, F1-score, MCC, and AUROC
total of 10,262 interactions. The cancer PPI (C-PPI) dataset, among all models. Additionally, it achieves the best specificity
derived from the study [56], consisted of 933 positive and in the BioMedGPT case. Notably, both Bayesian LoRA and
1,308 negative interactions. To ensure balanced representation, the LoRA ensemble attain the best precision, F1-score, and
this dataset was curated to create an equal-sized collection of MCC values in the BioMedGPT case. Lastly, all three models
1,866 total interactions. These datasets have been evaluated yield identical AUROC values in the BioMedGPT case.
TABLE I
ND-PPI P REDICTION : T HE BEST RESULTS AMONG ALL COMPARED METHODS FOR A GIVEN LLM PRE - TRAINED MODEL ARE HIGHLIGHTED IN BOLD .
A LL METRICS ARE REPORTED AS MEANS WITH STANDARD DEVIATIONS IN SUBSCRIPT, BASED ON THREE INDEPENDENT RUNS .

LLM Model Methods Acc (↑) NLL (↓) ECE (↓) Spec. (↑) Prec. (↑) F1 (↑) MCC (↑) AUROC (↑)
Single LoRA 86.510.54 0.3620.036 0.0950.011 0.9660.005 0.8810.005 0.8630.006 0.7450.011 0.9530.004
Llama-3 LoRA Ensemble 88.700.62 0.3020.002 0.0880.005 0.9730.001 0.8990.004 0.8860.006 0.7850.011 0.9640.001
Bayesian LoRA 86.510.20 0.3170.003 0.0520.021 0.8480.033 0.8660.002 0.8650.002 0.7320.003 0.9440.004
Single LoRA 85.442.16 0.5390.053 0.1190.025 0.9630.016 0.8740.011 0.8520.023 0.7260.034 0.9440.005
BioMedGPT LoRA Ensemble 88.001.19 0.3630.049 0.0870.022 0.9650.008 0.8920.007 0.8790.013 0.7710.019 0.9560.001
Bayesian LoRA 86.820.60 0.3200.012 0.0330.007 0.8690.015 0.8680.006 0.8680.006 0.7370.012 0.9370.003

TABLE II
M-PPI P REDICTION : T HE BEST RESULTS AMONG ALL COMPARED METHODS FOR A GIVEN LLM PRE - TRAINED MODEL ARE HIGHLIGHTED IN BOLD . A LL
METRICS ARE REPORTED AS MEANS WITH STANDARD DEVIATIONS IN SUBSCRIPT, BASED ON THREE INDEPENDENT RUNS .

LLM Model Methods Acc (↑) NLL (↓) ECE (↓) Spec. (↑) Prec. (↑) F1 (↑) MCC (↑) AUROC (↑)
Single LoRA 85.820.26 0.3980.016 0.0840.006 0.9080.036 0.8630.007 0.8580.002 0.7210.009 0.9370.002
Llama-3 LoRA Ensemble 87.450.16 0.3080.013 0.0510.010 0.9220.016 0.8780.003 0.8740.002 0.7520.004 0.9500.002
Bayesian LoRA 83.411.17 0.3740.005 0.0710.018 0.9320.038 0.8500.003 0.8320.013 0.6830.013 0.9250.004
Single LoRA 83.680.54 0.5420.026 0.1130.009 0.7990.047 0.8400.002 0.8370.006 0.6780.007 0.9250.005
BioMedGPT LoRA Ensemble 87.141.39 0.3540.028 0.0620.010 0.8640.036 0.8720.013 0.8710.014 0.7440.027 0.9410.005
Bayesian LoRA 83.290.57 0.3850.015 0.0370.018 0.8880.033 0.8380.003 0.8320.007 0.6710.007 0.9050.004

TABLE III
C-PPI P REDICTION : T HE BEST RESULTS AMONG ALL COMPARED METHODS FOR A GIVEN LLM PRE - TRAINED MODEL ARE HIGHLIGHTED IN BOLD . A LL
METRICS ARE REPORTED AS MEANS WITH STANDARD DEVIATIONS IN SUBSCRIPT, BASED ON THREE INDEPENDENT RUNS .

LLM Model Methods Acc (↑) NLL (↓) ECE (↓) Spec. (↑) Prec. (↑) F1 (↑) MCC (↑) AUROC (↑)
Single LoRA 96.620.62 0.0940.011 0.0330.002 0.9730.016 0.9670.007 0.9660.006 0.9320.013 0.9960.002
Llama-3 LoRA Ensemble 97.860.00 0.0660.005 0.0290.005 0.9800.000 0.9790.000 0.9790.000 0.9570.000 0.9970.000
Bayesian LoRA 96.971.24 0.0850.020 0.0270.002 0.9630.012 0.9700.012 0.9700.012 0.9400.025 0.9960.002
Single LoRA 97.680.82 0.0590.011 0.0250.011 0.9760.006 0.9770.008 0.9770.008 0.9540.016 0.9980.001
BioMedGPT LoRA Ensemble 98.400.54 0.0520.000 0.0210.002 0.9800.000 0.9840.005 0.9840.005 0.9680.011 0.9980.000
Bayesian LoRA 98.400.54 0.0640.005 0.0310.001 0.9760.006 0.9840.005 0.9840.005 0.9680.011 0.9980.001

V. C ONCLUSION AND D ISCUSSION interactions and, consequently, advance computational protein


In this study, we presented a novel uncertainty-aware target discovery for therapeutic design.
adaptation of LLMs approach for predicting protein-protein
interactions across multiple disease contexts. Leveraging APPENDIX
fine-tuned LLaMA-3 and BioMedGPT models with LoRA
A. Robustness & Predictive Uncertainty Evaluation Metrics
ensemble and Bayesian LoRA, our approach consistently
improved prediction accuracy, reliability, and robustness, as To assess model robustness and predictive uncertainty, we
confirmed by comprehensive metrics such as negative log- use Negative Log-Likelihood (NLL) and Expected Calibration
likelihood and calibration error. LoRA ensembles excelled at Error (ECE). NLL evaluates how confidently a model predicts
achieving higher accuracy and reliable uncertainty estimates, the correct labels. Given a test dataset {xi , yi }N
i=1 , NLL is
while Bayesian LoRA provided well-calibrated predictions. computed as:
Together, they demonstrated robustness in neurodegenerative,
N
metabolic, and cancer-related PPI tasks. These findings 1 X
NLL = − log Pθ (yn ).
underscore the benefits of incorporating principled uncertainty N i=1
quantification into parameter-efficient fine-tuning for LLMs.
Future work will explore more advanced LLM uncertainty A lower NLL indicates better confidence calibration, as
quantification methods and apply this methodology to broader overconfident incorrect predictions increase this value. On
biomedical applications. Potential directions include eluci- the other hand, ECE measures how well predicted confidence
dating disease mechanisms by predicting disrupted protein aligns with actual accuracy. Predictions are grouped into bins
based on confidence, and ECE is calculated as: [15] OpenAI et al., “GPT-4 technical report,” arXiv:2303.08774, 2023.
M [16] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
X |Bm | hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
ECE = |acc(Bm ) − conf(Bm )| . Bhargava, Shruti Bhosale, et al., “Llama 2: Open foundation and
m=1
n fine-tuned chat models,” arXiv:2307.09288, 2023.
[17] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian,
Here, acc(Bm ) and conf(Bm ) represent the average accuracy Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten,
and confidence within bin Bm , respectively: Amy Yang, Angela Fan, et al., “The llama 3 herd of models,”
arXiv:2407.21783, 2024.
1 X
acc(Bm ) = 1(ŷi = yi ), [18] Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo,
|Bm | Nan Duan, and Bo Wang, “scgpt: toward building a foundation model
i∈Bm
for single-cell multi-omics using generative ai,” Nature Methods, pp.
1 X 1–11, 2024.
conf(Bm ) = P (ŷi ),
|Bm | [19] Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung
i∈Bm Poon, and Tie-Yan Liu, “BioGPT: Generative pre-trained transformer
for biomedical text generation and mining,” Briefings in Bioinformatics,
where |Bm | is the number of samples in bin m. Across all vol. 23, no. 6, pp. bbac409, 2022.
experiments, we set |Bm | = 15. [20] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason
Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-
R EFERENCES Lewis, Stephen Pfohl, et al., “Large language models encode clinical
[1] Michael E Cusick, Niels Klitgord, Marc Vidal, and David E Hill, knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023.
“Interactome: gateway into systems biology,” Human molecular genetics, [21] Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E
vol. 14, no. suppl 2, pp. R171–R181, 2005. Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Mona G
[2] Mileidy W. Gonzalez and Maricel G. Kann, “Chapter 4: Protein Flores, Ying Zhang, et al., “Gatortron: A large clinical language
interactions and disease,” PLoS Computational Biology, vol. 8, no. 12, model to unlock patient information from unstructured electronic health
pp. e1002819, 2012. records,” arXiv:2203.03540, 2022.
[3] Damian Szklarczyk, Rebecca Kirsch, Mikaela Koutrouli, Katerina [22] Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D
Nastou, Farrokh Mehryary, Radja Hachilif, Annika L. Gable, Tao White, and Philippe Schwaller, “ChemCrow: Augmenting large-
Fang, Nadezhda T. Doncheva, Sampo Pyysalo, et al., “The STRING language models with chemistry tools,” arXiv:2304.05376, 2023.
database in 2023: protein-protein association networks and functional [23] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom,
enrichment analyses for any sequenced genome of interest,” Nucleic Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez,
Acids Research, vol. 51, no. D1, pp. D638–D646, 2023. and Robert Stojnic, “Galactica: A large language model for science,”
[4] Rose Oughtred, Chris Stark, Bobby-Joe Breitkreutz, Jennifer Rust, arXiv:2211.09085, 2022.
Lorrie Boucher, Christie Chang, Nadine Kolas, Lara O’Donnell, Genie [24] Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang, “FinGPT:
Leung, Rochelle McAdam, et al., “The BioGRID interaction database: Open-source financial large language models,” arXiv:2306.06031, 2023.
2019 update,” Nucleic Acids Research, vol. 47, no. D1, pp. D529–D541, [25] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze,
2019. Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and
[5] Sandra Orchard, Mais Ammari, Bruno Aranda, Lionel Breuza, Leonardo Gideon Mann, “BloombergGPT: A large language model for finance,”
Briganti, Fiona Broackes-Carter, Nancy H. Campbell, Gayatri Chavali, arXiv:2303.17564, 2023.
Carol Chen, Noemi del Toro, et al., “The MIntAct project–intact as [26] David Thulke, Yingbo Gao, Petrus Pelser, Rein Brune, Rricha Jalota,
a common curation platform for 11 molecular interaction databases,” Floris Fok, Michael Ramos, Ian van Wyk, Abdallah Nasir, Hayden
Nucleic Acids Research, vol. 42, no. D1, pp. D358–D363, 2014. Goldstein, et al., “ClimateGPT: Towards ai synthesizing interdisci-
[6] Martin Weigt, Robert A White, Hendrik Szurmant, James A Hoch, plinary research on climate change,” arXiv:2401.09646, 2024.
and Terence Hwa, “Identification of direct residue contacts in [27] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde
protein–protein interaction by message passing,” Proceedings of the De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas
National Academy of Sciences, vol. 106, no. 1, pp. 67–72, 2009. Joseph, Greg Brockman, et al., “Evaluating large language models
[7] Ruben Sanchez-Garcia, C.O.S. Sorzano, J. M. Carazo, and Joan Segura, trained on code,” arXiv:2107.03374, 2021.
“BIPSPI: A method for the prediction of partner-specific protein-protein [28] Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal
interfaces,” Bioinformatics, vol. 35, no. 3, pp. 470–477, 2019. Linial, “ProteinBERT: a universal deep-learning model of protein
[8] Kriti Chopra, Bhawna Burdak, Kaushal Sharma, Ajit Kembhavi, sequence and function,” Bioinformatics, vol. 38, no. 8, pp. 2102–2110,
Shekhar C. Mande, and Radha Chauhan, “Cornea: A pipeline to decrypt 2022.
the inter-protein interfaces from amino acid sequence information,”
[29] Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz
Biomolecules, vol. 10, no. 6, pp. 938, 2020.
Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton,
[9] Neel Kewalramani, Andrew Emili, and Mark Crovella, “State-of-the-art
Marius Wiggert, et al., “Simulating 500 million years of evolution
computational methods to predict protein–protein interactions with high
with a language model,” Science, 2025.
accuracy and coverage,” Proteomics, vol. 23, no. 1-2, pp. e2200292,
2023. [30] Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao,
[10] Farzaneh Soleymani, Eric Paquet, Herna Viktor, Wojtek Michalowski, and Zaiqing Nie, “BioMedGPT: Open multimodal generative pre-
and Davide Spinello, “Protein-protein interaction prediction with deep trained transformer for biomedicine,” arXiv:2308.09442, 2023.
learning: A comprehensive review,” Computational and Structural [31] Ryan Engel and Gilchan Park, “Evaluating large language models
Biotechnology Journal, vol. 20, pp. 5316–5341, 2022. for predicting protein behavior under radiation exposure and disease
[11] Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha conditions,” in Proceedings of the 23rd Workshop on Biomedical
Majumdar, “A survey on uncertainty quantification of large language Natural Language Processing, 2024, pp. 427–439.
models: Taxonomy, open research challenges, and future directions,” [32] Moxin Li, Wenjie Wang, Fuli Feng, Fengbin Zhu, Qifan Wang, and
arXiv:2412.05563, 2024. Tat-Seng Chua, “Think twice before trusting: Self-detection for large
[12] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, language models through comprehensive answer reflection,” in Findings
Richard Socher, Xavier Amatriain, and Jianfeng Gao, “Large language of the Association for Computational Linguistics: EMNLP 2024, 2024.
models: A survey,” arXiv:2402.06196, 2024. [33] Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He,
[13] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Bryan Hooi, “Can LLMs express their uncertainty? an empirical
and Ilya Sutskever, “Language models are unsupervised multitask evaluation of confidence elicitation in LLMs,” arXiv:2306.13063, 2023.
learners,” OpenAI, 2019. [34] Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang,
[14] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared “Taming overconfidence in LLMs: Reward calibration in RLHF,”
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish arXiv:2410.09724, 2024.
Sastry, Amanda Askell, et al., “Language models are few-shot learners,” [35] Guande He, Jianfei Chen, and Jun Zhu, “Preserving pre-trained features
in Advances in Neural Information Processing Systems (NeurIPS), 2020. helps calibrate fine-tuned language models,” arXiv:2305.19249, 2023.
[36] R. M. Neal, Bayesian Learning for Neural Networks, New York: [56] Jiajun Qiu, Kui Chen, Chunlong Zhong, Sihao Zhu, and Xiao Ma,
Springer Verlag, 1996. “Network-based protein-protein interaction prediction method maps
[37] Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew perturbations of cancer interactome,” PLoS genetics, vol. 17, no. 11,
Gordon Gordon Wilson, “What are Bayesian neural network posteriors pp. e1009869, 2021.
really like?,” in International Conference on Machine Learning (ICML), [57] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving
2021. deep into rectifiers: Surpassing human-level performance on imagenet
[38] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan classification,” in Proceedings of the IEEE International Conference
Wierstra, “Weight uncertainty in neural network,” in International on Computer Vision (ICCV), 2015.
Conference on Machine Learning (ICML), 2015.
[39] Sanket Jantre, Shrijita Bhattacharya, and Tapabrata Maiti, “Layer
adaptive node selection in Bayesian neural networks: Statistical
guarantees and implementation details,” Neural Networks, vol. 167,
pp. 309–330, 2023.
[40] Sanket Jantre, Shrijita Bhattacharya, and Tapabrata Maiti, “Spike-and-
slab shrinkage priors for structurally sparse Bayesian neural networks,”
IEEE Transactions on Neural Networks and Learning Systems, 2024.
[41] Sanket Jantre, Nathan M Urban, Xiaoning Qian, and Byung-Jun
Yoon, “Learning active subspaces for effective and scalable uncertainty
quantification in deep neural networks,” in IEEE International
Conference on Acoustics, Speech and Signal Processing, 2024.
[42] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell,
“Simple and scalable predictive uncertainty estimation using deep
ensembles,” in Advances in Neural Information Processing Systems
(NeurIPS), 2017.
[43] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and
Andrew Gordon Wilson, “A simple baseline for Bayesian uncertainty in
deep learning,” in Advances in Neural Information Processing Systems
(NeurIPS), 2019.
[44] Sanket Jantre, Shrijita Bhattacharya, Nathan M Urban, Byung-Jun
Yoon, Tapabrata Maiti, Prasanna Balaprakash, and Sandeep Madireddy,
“Sequential Bayesian neural subnetwork ensembles,” arXiv:2206.00794,
2022.
[45] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi
Li, Shean Wang, Lu Wang, and Weizhu Chen, “LoRA: Low-rank
adaptation of large language models,” in International Conference on
Learning Representations (ICLR), 2022.
[46] Adam X. Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchi-
son, “Bayesian low-rank adaptation for large language models,” in
International Conference on Learning Representations (ICLR), 2024.
[47] Yibin Wang, Haizhou Shi, Ligong Han, Dimitris N. Metaxas, and Hao
Wang, “BLoB: Bayesian low-rank adaptation by backpropagation for
large language models,” in Advances in Neural Information Processing
Systems (NeurIPS), 2024.
[48] Cristian Meo, Ksenia Sycheva, Anirudh Goyal, and Justin Dauwels,
“Bayesian-LoRA: LoRA based parameter efficient fine-tuning using
optimal quantization levels and rank values trough differentiable
Bayesian gates,” in 2nd Workshop on Advancing Neural Network
Training: Computational Efficiency, Scalability, and Resource Opti-
mization (WANT@ICML 2024), 2024.
[49] Emre Onal, Klemens Flöge, Emma Caldwell, Arsen Sheverdin, and
Vincent Fortuin, “Gaussian stochastic weight averaging for Bayesian
low-rank adaptation of large language models,” in 6th Symposium on
Advances in Approximate Bayesian Inference - Non Archival Track,
2024.
[50] Xi Wang, Laurence Aitchison, and Maja Rudolph, “LoRA ensembles
for large language model fine-tuning,” arXiv:2310.00035, 2023.
[51] Oleksandr Balabanov and Hampus Linander, “Uncertainty quantifica-
tion in fine-tuned LLMs using LoRA ensembles,” arXiv:2402.12264,
2024.
[52] Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschen-
hagen, Matthias Bauer, and Philipp Hennig, “Laplace redux-effortless
Bayesian deep learning,” in Advances in Neural Information Processing
Systems (NeurIPS), 2021.
[53] Javier Antorán, David Janz, James U Allingham, Erik Daxberger,
Riccardo Rb Barbano, Eric Nalisnick, and José Miguel Hernández-
Lobato, “Adapting the linearised laplace model evidence for modern
deep learning,” in International Conference on Machine Learning
(ICML), 2022.
[54] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada,
Sayak Paul, and B Bossan, “Peft: State-of-the-art parameter-efficient
fine-tuning methods,” URL: https://github. com/huggingface/peft, 2022.
[55] Fen Pei, Qingya Shi, Haotian Zhang, and Ivet Bahar, “Predicting
protein–protein interactions using symmetric logistic matrix factoriza-
tion,” Journal of chemical information and modeling, vol. 61, no. 4,
pp. 1670–1682, 2021.

You might also like