0% found this document useful (0 votes)

53 views

202407-ProLLaMA - A Protein Language Model For Multi-Task Protein Language Processing

202407-ProLLaMA- A Protein Language Model for Multi-Task Protein Language Processing

Uploaded by

yangmengl123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views

202407-ProLLaMA - A Protein Language Model For Multi-Task Protein Language Processing

202407-ProLLaMA- A Protein Language Model for Multi-Task Protein Language Processing

Uploaded by

yangmengl123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

ProLLaMA: A Protein Language Model for

Multi-Task Protein Language Processing

Liuzhenghao Lv1 , Zongying Lin1 , Hao Li1,2 , Yuyang Liu1 ,

Jiaxi Cui1 , Calvin Yu-Chian Chen1 , Li Yuan1,2∗,Yonghong Tian1,2∗
1
Peking University, China 2 Peng Cheng Laboratory, China
arXiv:2402.16445v2 [cs.CE] 16 Jul 2024

Abstract

Large Language Models (LLMs) have achieved remarkable performance in mul-

tiple Natural Language Processing (NLP) tasks. Under the premise that protein
sequences constitute the protein language, Protein Language Models(PLMs) have
advanced the field of protein engineering. However, as of now, unlike LLMs in
NLP, PLMs cannot handle the protein understanding task and the protein gener-
ation task simultaneously in the Protein Language Processing (PLP) field. This
prompts us to delineate the inherent limitations in current PLMs: (i) the lack of
natural language capabilities, (ii) insufficient instruction understanding, and (iii)
high training resource demands. To address these challenges, we introduce a train-
ing framework to transform any general LLM into a PLM capable of handling
multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary
Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset
containing 13 million samples with superfamily information, facilitating better
modeling of protein sequence-function landscapes. Through these methods, we
develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks
simultaneously. Experiments show that ProLLaMA achieves state-of-the-art re-
sults in the unconditional protein sequence generation task. In the controllable
protein sequence generation task, ProLLaMA can design novel proteins with de-
sired functionalities. As for the protein understanding task, ProLLaMA achieves
a 62% exact match rate in superfamily prediction. Codes, model weights, and
datasets are available at https://github.com/PKU-YuanGroup/ProLLaMA and
https://huggingface.co/GreatCaptainNemo.

1 Introduction

Large Language Models (LLMs), like GPT-x and LLaMA2 [1, 2], have achieved outstanding
performance in handling a wide range of Natural Language Processing (NLP) tasks [3–9], including
both Natural Language Generation (NLG) and Natural Language Understanding (NLU) tasks, in a
generative manner. This surge in LLMs has extended their applications beyond traditional contexts,
including their adoption in the challenging field of protein engineering [10–14].
Taking protein sequences as the protein language, researchers train Protein Language Models (PLMs)
on vast protein corpora [6, 11, 12]. PLMs have the potential to significantly advance protein engi-
neering, holding immense promise for biomedical and biotechnological innovations [10]. However,
this progress is challenged, particularly in extending their capabilities to multi-task Protein Language
Processing (PLP).
∗
Correspondence to: [email protected] and [email protected]

Preprint. Under review.

Text ProLLaMA (Ours)
Generation
[Natural Language] ↦ General LLM
GPT-X, LLaMA, etc.
AND Unconditional
Text Protein Generation
Understanding
Analogy Capability
Gap Controllable
Protein
Protein Generation
Generation
OR
[Protein Sequence] ↦ Current PLM
ProtGPT2,ESM, etc. Protein Property
Protein
Prediction
Understanding

Figure 1: Left: LLMs can handle both generation and understanding tasks, whereas PLMs cannot.
This highlights the disparity in capabilities between the two. Right: Our ProLLaMA can handle
generation tasks (unconditional protein generation, controllable protein generation) and understanding
tasks (protein superfamily prediction), surpassing current PLMs.

Analogous to NLP, tasks related to protein language can be viewed as PLP [15, 16]. Consequently,
PLP tasks can also be divided into two categories: Protein Language Generation (PLG), such as
protein sequence generation, and Protein Language Understanding (PLU), such as protein property
prediction. However, current PLMs tend to focus either exclusively on PLG [17–19] or PLU [20–24],
rather than being proficient in both aspects simultaneously, as LLMs in NLP.
These limitations prompt the need for innovative solutions to unleash the full potential of PLMs.
Developing a multi-task PLM would be highly beneficial for protein engineering and protein fitness
landscape modeling [25–28], but three main challenges must be considered:
(i) Necessity of Natural Language: Protein language is not fully sufficient for PLP tasks, meaning
it cannot fully represent all components of a task (the task instruction, the input, and the expected
output) [29, 30]. It requires a language beyond protein language (typically, natural language) for
representation, which current PLMs lack.
(ii) Instruction Following: To possess multi-task capabilities, models must execute tasks following
user instructions [31–33]. However, current PLMs are unable to follow instructions.
(iii) Training Resource Consumption: Substantial training resources are needed for models to learn
natural language, protein language, and user instructions [34], which can sometimes be unaffordable.
To address the challenges, we construct an instruction dataset that contains approximately 13 million
samples and encompasses both PLG and PLU tasks. We propose a two-stage training framework
to achieve a PLM for multi-task PLP. In the first stage, we leverage a pre-trained general LLM like
LLaMA2 to continually learn the protein language while maintaining the natural language knowledge.
In the second stage, the model is further trained on the multi-task instruction dataset. Additionally,
we propose Protein Vocabulary Pruning (PVP) for general LLMs to significantly improve training
efficiency. Furthermore, during both stages, we adopt Low-Rank Adaptation (LoRA) [35], which
prevents catastrophic forgetting in the first stage and reduces training costs in both stages.
Using these methods, we develop ProLLaMA, a model capable of multi-tasks for PLP, distinguishing
it from all other PLMs. Through a series of experiments, we demonstrate the multi-task capabilities
of our ProLLaMA. Specifically, as for PLG, in unconditional protein generation, ProLLaMA out-
performs current PLMs on common metrics such as pLDDT and TM-score. In controllable protein
generation, based on a user-provided textual description, ProLLaMA generates novel proteins from
scratch with desired functionalities, such as the SAM-MT superfamily. As for PLU, in protein
superfamily prediction, ProLLaMA achieved a 62% exact match rate on the test dataset and obtained
an F1-score above 0.9 in many specific categories. In summary, the contributions of our research are
as follows:
• We propose a training framework that enables any general LLM to be trained as a proficient model
for multi-task PLP, including both PLG and PLU tasks.
• We propose Protein Vocabulary Pruning and are the first to apply Low-Rank Adaptation in the
training of large PLMs, which improves the efficiency of protein learning.
• We construct an instruction dataset that contains 13 million samples and over 11,000 kinds of
superfamily annotations, potentially facilitating a better modeling of sequence-function landscapes.
• Experiments show that our ProLLaMA not only handles multiple PLP tasks but also achieves
state-of-the-art results in protein generation tasks.

2
2 Preliminaries

A C
UniRef50 InterPro Protein2ipr Instruction Vocabulary Vocabulary
Database (61M) Database Dataset (13M) Index Token Index Token
Filter Instruction:[Generate by superfamily] 0. OS Prune Before Training 0. OS
Retrieve Input:Superfamily=<Ankyrin repeat-
1. know 1. HI
2. HI 2. EST
Protein Language containing domain superfamily> 3. EST ......
Dataset (53M) UniRef50 Protein Output:Seq=<MAPG...PVRKR> 4. ght Recover After Training
with Annotation Instruction:[Determine superfamily] Reserved Indices
Seq=<FLLSHMV......FEHTE> ......
Input:Seq=<MAPG...PVRKR>
Filter [0, 2, 3, ......]
Seq=<YNDKVQV......FNLQ> Output:Superfamily=<Ankyrin repeat-
containing domain superfamily>
Protein (6.4M)
Seq=<MKVRDV......NFGTR>
...... Embed or Head Embed or Head
...... with Superfamily Format
0 Prune Before Training 0

Vocab_size
Vocab_size
1 1
2 2
B 3
Recover After Training
Stage 1 Stage 2 4
LLaMA2 Continued Pre-training
ProLLaMA Instruction Tuning
ProLLaMA
Feature_size Feature_size

Figure 2: (A) Overview of the dataset construction. The protein language dataset contains 53
million samples, which is used for training in Stage 1. The instruction dataset contains 13 million
instances with 11,268 unique superfamily annotations, which is used for training in Stage 2. (B)
Overview of the training framework. Stage 1: The pre-trained LLaMA2 learns the protein language,
resulting in ProLLaMA. Stage 2: ProLLaMA learns to perform multiple tasks by instruction tuning.
(C) Overview of protein vocabulary pruning. The vocabulary, Embed layer and Head layer are
pruned based on characteristics of protein datasets, aiming for higher training efficiency.
Necessity of Natural Language. As aforementioned, given the similarities between protein sequences
and natural language, tasks related to protein sequences can be considered as PLP, analogous to
NLP. However, we observe a fundamental difference between protein language and natural language:
natural language is complete for NLP tasks, whereas protein language is not complete for PLP tasks.
For example, in the sentiment analysis task, instructions, inputs, and outputs are straightforwardly
expressed in natural language, such as “Analyze the sentiment: I am happy” yielding “The sentiment
is positive.” However, in PLP tasks like protein property prediction, instructions like “Predict this
protein’s property: MAFCF...FEV” cannot be fully conveyed in protein language alone, necessitating
assistance from a language beyond protein language, in this case, natural language, for representation.
Therefore, multi-task PLMs must possess a certain level of natural language ability, especially as
more textual descriptions of proteins become available [29].
Low-Rank Adaptation (LoRA) [35] is a parameter-efficient technique for fine-tuning LLMs. Due
to the immense parameter size of LLMs, full-parameter fine-tuning could be impractical sometimes.
LoRA circumvents this by freezing the original parameters of LLMs and introducing additional
trainable low-rank adapters. It achieves fine-tuning with a significantly smaller number of trainable
parameters, yielding results comparable to full-parameter fine-tuning. LoRA prevents catastrophic
forgetting of the original knowledge, as the newly learned knowledge has a lower rank than the
original knowledge. The theoretical details of LoRA are provided in Appendix A.2.

3 Methods
In Section 3.1 and Figure 2(A), we show how to construct the protein language dataset and the
instruction dataset. In Section 3.2 and Figure 2(C), we show how Protein Vocabulary Pruning (PVP)
works. In Section 3.3, we show how the LLaMA2 model learns protein language on the protein
language dataset through continued pre-training, resulting in ProLLaMA. In Section 3.4, we show
the integration of various tasks into ProLLaMA through instruction tuning on the instruction dataset.
The overview of the training framework is shown in Figure 2(B). The overview of the ProLLaMA
model is shown in Figure 3.

3.1 Dataset Construction

The protein language dataset is utilized in the first training stage to enable LLaMA2 to grasp
the language of proteins. Specifically, the dataset is sourced from UniRef50_2023_03 [36] on the
UniProt website. We eliminate the descriptive parts of UniRef50, retaining only the pure protein
sequences. Furthermore, We filter UniRef50 to ensure that the protein sequences consisted only
of the 20 standard amino acids. We also retain sequences with a length of less than 512, aligning

3
A. ProLLaMA Pipeline
Causal
Decoder Language
Tokenizer Embed Head
Blocks Modeling

B. Decoder Structure

SiLU
LoRA
LoRA
Layer Normalization

Layer Normalization
Attention Layer
LoRA LoRA LoRA LoRA

LoRA

Frozen Parameters
A B
Training Parameters

W Protein Vocabulary Pruning

Figure 3: The overview of the ProLLaMA model. We add low-rank adapters (LoRA) to certain
weights and apply protein vocabulary pruning (PVP). We freeze original parameters, focusing solely
on training LoRA (Embed and Head are also involved in the first training stage).

with ProGen [18]. Given that the lengths of protein sequences follow a long-tail distribution, the
sequences that are deleted constitute only a small portion of the total dataset. To preprocess the
protein sequences, we employ a specific prefix “Seq=<” and suffix “>”. This standardized format aids
LLaMA2 in distinguishing the new protein language from its existing natural language knowledge,
thereby reducing confusion. The original Uniref50 contains 60,952,894 sequences, while after the
above series of processing, our dataset comprises 52,807,283 protein sequences, with 90% for training
and 10% reserved for testing.
The instruction dataset is utilized in the second training stage to enable ProLLaMA to perform
various tasks. We first obtain the protein2ipr database from InterPro [37], which includes all proteins
from UniProtKB along with their corresponding InterPro annotation information. Subsequently, we
iterate through each protein’s rep_id in UniRef50 to retrieve the corresponding annotation information
from protein2ipr. This retrieval process is implemented using a distributed Redis database, and only
proteins with lengths less than 256 participate in the retrieval to enhance efficiency. We utilize regular
expressions to filter out superfamily annotation from the whole annotation. In the end, we obtain
6,350,106 data instances, each of which contains one protein sequence and its superfamily annotation.
And the number of unique superfamily annotations is 11,268.
Then, we process the obtained data into a multi-task instruction dataset following the Alpaca for-
mat [38], where each instance comprises three parts: instruction, input, and output. The instruction
specifies the task type. We design two tasks: generating proteins based on superfamily and determin-
ing the superfamily of the given protein. For the former task, the input is the superfamily annotation,
and the output is the expected protein. The latter task is the opposite. We selected these two tasks
because they represent PLG and PLU, respectively. Additionally, modeling the relationships between
sequences and superfamilies is crucial for uncovering protein functions and evolutionary insights [39].
In the end, the instruction dataset comprises 12,700,212 (6,350,106 * 2) instances, with 90% for
training and the rest reserved for testing.

3.2 Protein Vocabulary Pruning

It is known that LLMs are equipped with large vocabularies in tokenizers, which is beneficial to
cover multi-lingual corpora and shorten the length of the input token sequence. However, a larger

4
vocabulary also implies slower tokenization of raw text, along with increased parameters in the
model’s Embed layer and Head layer, as discussed in detail in Appendix A.9. When considering
only the protein language, such large vocabularies are clearly unnecessary. Therefore, we propose a
method called Protein Vocabulary Pruning (PVP). PVP helps ProLLaMA achieve almost the same
performance with higher training efficiency, whose proof is in Appendix A.3.
PVP is rule-based. Specifically, we iterate through each Table 1: Effects of PVP. The vocabu-
token in the original vocabulary and retain it if it meets our lary size, parameter numbers in Embed
predefined grammatical rules (e.g., consisting only of the and Head layers, parameter numbers in-
20 uppercase English letters representing standard amino volved in training are listed.
acids). We save these tokens and their indices. Using Training Vocab Embed Training
the saved tokens, we construct a smaller vocabulary. We Stage and
Head
then use the indices to extract the corresponding vectors
from Embed and Head layers to construct smaller ones. I (w/o PVP) 32,000 262M 582M
I (w/ PVP) 1,045 8.56M 328M
Thus, we can train LLMs using reduced vocabulary and II (w/o PVP) 32,000 262M 160M
parameters. II (w/ PVP) 25,466 209M 160M
PVP is employed before training. After training, the vocabulary can be recovered back to its original
size by incorporating the entries from the original vocabulary. This process is also applicable to the
Embed and Head layers, restoring them to their original dimensions. This ensures that the architecture
of LLMs remains entirely consistent both before and after training, facilitating potential training on
broader corpora in the future. Naturally, if subsequent training is not a consideration, omitting the
recovery operation is permissible.
We use PVP in both the first and second training stages of ProLLaMA, and its effects are shown
in Table 1. It is evident that we achieve a significant compression rate in the first stage, primarily
because the dataset in the first stage contains only protein sequences, which can be summarized by a
simple rule. In contrast, the dataset in the second stage also includes English. More details about
PVP is shown in Appendix A.3.

3.3 Learning Protein Language

As mentioned in Section 1, current PLMs lack natural language abilities, which hinders multi-task
capabilities. To solve this problem, we propose leveraging LLaMA2 to perform continued pre-training
on protein language. This approach is analogous to humans learning a foreign language, where the
model learns protein language while retaining its original natural language abilities.
We add Low-Rank Adapters (LoRA) into LLaMA2. To be specific, in each decoder block, we
add LoRA to certain weights including Wq , Wk , Wv , Wo , Wup , Wgate and Wdown . The original
parameters are frozen, enabling only LoRA to be trained. Due to the significant differences between
protein language and natural language, we choose a relatively high rank for LoRA, which helps the
model learn protein sequences better and prevents under-fitting. We also include both the Embed and
Head layers in training. This is based on the premise that a token may have different meanings in
protein sequences and natural languages, requiring distinct embeddings for the same token.
We trained the model using causal language modeling on the protein language dataset, resulting
in ProLLaMA. Details on causal language modeling are provided in Appendix A.1. Benefiting
from PVP and LoRA, as shown in Table 7, we train only about 5% of the parameters, in contrast to
full-parameter training, which significantly reduces training costs. Additionally, as the remaining
parameters are not involved in training, the inherent natural language abilities are preserved.
In summary, we have developed ProLLaMA, a model that comprehends both protein language
and natural language, with reduced training costs. Consequently, we have addressed two problems
mentioned in Section 1: the lack of natural language abilities and excessive training costs.

3.4 Performing Multiple Tasks

As aforementioned, current PLMs are unable to perform multiple tasks based on user instructions. To
solve this problem, we perform instruction tuning on ProLLaMA obtained from the previous section.
We train ProLLaMA on the instruction dataset mentioned in Section 3.1:
L(Θ) = Ex,u∼D [− log p(x|u; Θ)] (1)

5
Table 2: Comparison of proteins generated by different models. Our ProLLaMA achieves the best
performance on pLDDT, TM-score, and RMSD metrics, and is second-best in SC-Perp, demonstrating
ProLLaMA excels in de novo protein design. *: We list ProGen2 and ProGen as the same item,
referring to Appendix A.6 for explanation. AE: Auto-Encoder. AR: Auto-Regressive.

AFDB PDB
Type Method pLDDT↑ SC-Perp↓
TM-score↑ RMSD↓ TM-score↑ RMSD↓

CARP [40] 34.40±14.43 4.05±0.52 0.28 19.38 0.38 8.95

CNN
LRAR [40] 49.13±15.50 3.59±0.54 0.40 14.47 0.43 9.47

ESM-1b [22] 59.57±15.36 3.47±0.68 0.34 20.88 0.44 8.59

PLM (AE)
ESM-2 [24] 51.16±15.52 3.58±0.69 0.20 35.70 0.41 9.57

Diffusion EvoDiff [40] 44.29±14.51 3.71±0.52 0.32 21.02 0.41 10.11

ProtGPT2 [17] 56.32±16.05 3.27±0.59 0.44 12.60 0.43 9.19

PLM (AR) ProGen2/ProGen* [19, 18] 61.07±18.45 2.90±0.71 0.43 15.52 0.44 11.02
ProLLaMA (ours) 66.49±12.61 3.10±0.65 0.49 9.50 0.48 7.63

Table 3: Controllable generation of ProLLaMA. SAM-MT, TPHD, Trx, and CheY are four
supefamiles. High values of TM-score and H-Prob indicate the generated proteins meet the desired
superfamily. “xx% seen” denotes that xx% residues of one protein sequence belonging to the related
superfamily are provided to ESM-1b for further generation. Even so, ProLLaMA performs best.
SAM-MT TPHD Trx CheY
Method
TM-score↑ H-Prob↑ TM-score↑ H-Prob↑ TM-score↑ H-Prob↑ TM-score↑ H-Prob↑
ESM-1b 0.58 0.37% 0.55 0.48% 0.61 0.37% 0.63 0.27%
ESM-2 0.52 0.26% 0.51 0.25% 0.53 0.30% 0.57 0.18%
EvoDiff 0.46 1.17% 0.42 1.80% 0.42 1.10% 0.46 1.43%
ProtGPT2 0.45 3.86% 0.43 4.62% 0.44 2.53% 0.45 4.86%
ProGen2/ProGen* 0.44 1.90% 0.45 2.49% 0.43 2.44% 0.44 2.13%
ESM-1b (25% seen) 0.43 0.64% 0.40 4.13% 0.42 1.01% 0.49 4.47%
ESM-1b (50% seen) 0.59 61.08% 0.63 66.21% 0.64 62.75% 0.73 78.00%
ESM-1b (75% seen) 0.67 88.51% 0.73 90.23% 0.75 93.92% 0.78 96.93%
ProLLaMA (ours) 0.71 98.13% 0.82 100.00% 0.81 99.96% 0.93 100.00%

Here, Θ denotes the parameters to be optimized, L the loss function, and D the dataset. u denotes
the instruction and the input of one instance. For brevity, when we refer to the instruction in the
following text, it includes the input as well. x = {x0 , x1 , . . . , xn−1 } denotes the output, where xi is
the i-th token of the output.
Since causal language modeling is employed, we need to combine Equation 1 with Equation 3:

" #
X
L(Θ) = Ex∼D − log p(xi |u, x0 , x1 , . . . , xi−1 ; Θ) (2)
i

Equation 2 is the optimization objective for instruction tuning of ProLLaMA. u is not involved in the
loss calculation, whereas x is. This is because the latter is the output part, where ensuring its quality
of generation is crucial. The former, u, only needs to be understood by the model. In the instruction
tuning stage, we exclusively train LoRA at a lower rank than specified in Section 3.3.
In summary, through instruction tuning, we have made ProLLaMA capable of following instructions
and performing multiple tasks. Consequently, we have addressed the problems mentioned in Section 1:
the lack of instruction following and the lack of multi-task capabilities.
In addition, the tasks that ProLLaMA can perform depend on the tasks included in the instruction
dataset. Appendix A.7 shows that by training on an additional instruction dataset, ProLLaMA also
performs well in predicting protein solubility.

6
4 Experiments
We introduce the experiment setup in Section 4.1. And we evaluate the unconditional protein
generation task in Section 4.2, the controllable protein generation task in Section 4.3, the protein
property prediction task in Section 4.4.

4.1 Experiment Setup

Training Settings: For continued pre-training, the LoRA rank is set to 128, employing the AdamW
optimizer alongside a cosine annealing scheduler with warm-up. The peak learning rate stands at 5e-5,
with a total of one training epoch. It takes six days on eight A6000 GPUs using FlashAttention-2 [41].
For instruction tuning, the LoRA rank is set to 64 with two training epochs, and all other settings
remain consistent with the continued pre-training setup. It takes 5 days on eight A6000 GPUs. More
training details can be found in Appendix A.4.
Evaluation Settings: Unconditional protein generation involves generating protein sequences without
specific instructions. Controllable protein generation involves generating desired protein sequences
based on instructions that specify the required superfamily. Property prediction involves predicting
protein superfamily based on instructions, which include the protein sequences to be predicted. All
evaluations are conducted on one GPU with 24GB of VRAM, with model inference occupying
approximately 13GB of VRAM.
Evaluation Metrics: We use the following metrics to evaluate the generated protein sequences.
The pLDDT [42] is used to measure whether sequences are structurally plausible. Self-Consistency
Perplexity (SC-Perp) [40] serves as an additional metric of plausible structures since pLDDT falls
short in dealing with intrinsically disordered regions (IDRs) [43]. TM-score [44] reflects the structural
similarity between generated sequences and known ones in AFDB [45] and PDB [46]. RMSD also
reflects the structural similarity from the perspective of atomic distance. Homologous probability (H-
Prob) reflects the probability that the generated protein is homologous to a known one. Seq-Ident
reflects the sequence similarity between generated sequences and known ones. More details are
shown in Appendix A.5.

4.2 Unconditional Protein Generation

We compare our model with other models in protein sequence generation. These models cover a
variety of types, with parameter numbers shown in Table 7. Table 2 shows the results. Our ProLLaMA
is optimal on pLDDT, TM-score, and RMSD and is suboptimal on SC-Perp. This indicates that
ProLLaMA, through its training on protein sequence data, can generate structurally plausible proteins.
Notably, ProLLaMA-generated proteins exhibit a mean and standard deviation for pLDDT and
SC-Perp of 66.49±12.61 and 3.10±0.65, respectively. These values are comparable to those of natural
proteins as reported in [40], which are 68.25±17.85 and 3.09±0.63, respectively. It is noted that we
list ProGen2 and ProGen as the same item in the table, referring to Appendix A.6 for explanation.
De novo design of long and structurally plausible protein sequences is highly challenging [17], yet
our ProLLaMA performs well. As shown in Figure 4(A)(B)(C), when the length is more than 300,
ProLLaMA performs the best in all the three metrics. Although ProGen2’s performance is better in
short sequences (length≤200), it decreases as the length increases. This indicates that ProLLaMA is
able to capture long-range dependencies between amino acids while other models struggle.

4.3 Controllable Protein Generation

We utilize four superfamily descriptions as instructions respectively: the S-adenosyl-L-methionine-

dependent methyltransferase superfamily (SAM-MT), the Tetratricopeptide-like helical domain
superfamily (TPHD), the Thioredoxin-like superfamily (Trx), and the CheY-like superfamily (CheY).
For each superfamily, ProLLaMA generates 100 protein sequences. We randomly select 100 natural
proteins from each of the four superfamilies as benchmarks for comparison. We employ Foldseek[47]
to compare generated proteins with natural ones.
The results shown in Table 3 demonstrate that ProLLaMA can generate desired protein sequences
based on instructions that specify the required functionalities, confirming the capability for control-
lable generation. For SAM-MT, the TM-scores of our generated sequences exceed 0.7; for TPHD

7
A Variation of pLDDT by length B Variation of SC-Perp by length C Variation of TM-score by length
ESM1b

LRAR ProLLaMA
ProLLaMA ProtGPT2

ProGen2 ProGen2
ProGen2 ProtGPT2
LRAR
ProtGPT2
ProLLaMA
ESM1b

LRAR ESM1b

D ProLLaMA Natural Protein E ProLLaMA Natural Protein F ProLLaMA Natural Protein

eY eY eY
Ch Ch Ch

HD HD HD
TP TP TP
x x x
Tr Tr Tr

M M M
SA SA SA
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
pLDDT SC-Perp TM-Score

Figure 4: Contrast experiments of ProLLaMA. (A-C): Compared to other models, ProLLaMA

maintains a high quality of generated proteins as their length increases. (D-F): According to the three
indicators normalized by percentage, proteins generated by ProLLaMA are roughly comparable to
natural proteins in the four superfamilies of CheY, TPHD, Trx, and SAM-MT (SAM).

and CheY, they are over 0.8; and for Trx, they surpass 0.9. The high TM-score indicates that the
structures of the generated proteins closely resemble those of natural proteins in the same superfamily,
implying functional similarity. For SAM-MT, TPHD, Trx, and CheY, all of the H-prob values are
close to or even equal to 100%, indicating that the generated proteins are homologous to natural
proteins and belong to the same superfamily. In summary, these provide strong evidence that the
protein generation of ProLLaMA is controllable under instructions. In contrast, other models exhibit
low TM-score and very low H-Prob due to their uncontrollable generation.
We also conduct another experiment. We provide residues of natural proteins belonging to the
superfamily to ESM-1b, allowing ESM-1b to complete these sequences, rather than generating them
from scratch as done previously. As seen in Table 3, with more residues provided, the proteins
generated by ESM-1b increasingly exhibit the characteristics of the corresponding superfamily. Even
so, it still does not surpass ProLLaMA. Higher TM-score and H-prob indicate that, under instruction
control, ProLLaMA’s de novo protein generation even outperforms the non-de novo generation by
ESM-1b, which is provided with 75% of the residues. These results indicate that ProLLaMA can
effectively capture structural and evolutionary relationships (as reflected by TM-score and H-prob)
solely through learning from text and sequences.
Additionally, using natural proteins as a benchmark, we assess pLDDT, SC-Perp, and TM-score of
proteins generated by ProLLaMA. Figure 4(D) shows that for CheY, TPHD, Trx, and SAM-MT,
the average pLDDT of generated proteins is only 19.0%, 1.41%, 10.9%, and 10.3% lower than
that of natural proteins, respectively. Figure 4(E) shows that for TPHD and SAM-MT, the average
SC-Perp is 5.66% and 5.06% lower; for Trx and CheY, the average SC-Perp is 7.58% and 14.92%
higher. Figure 4(F) visualizes the TM-score, with scores near the maximum indicating a high degree
of structural similarity. These findings indicate that proteins generated by ProLLaMA are roughly
comparable to their natural counterparts in the same superfamily.
In Figure 5, we visualize four examples of proteins generated by ProLLaMA (colored in blue)
alongside the most structurally similar natural proteins from PDB (colored in yellow). The significant
overlap in 3D structures and the high TM-score confirm structural similarity. Low Seq-ident indicates
sequence diversity. In summary, through controllable protein generation, ProLLaMA is capable of
generating desired proteins with structures similar to natural proteins, yet with novel sequences.

4.4 Property Prediction

We use the test dataset to evaluate whether ProLLaMA can predict the superfamily to which a given
protein belongs. The test dataset consists of 10,000 samples. Although ProLLaMA performs a
classification task here, it is more complex than typical ones. The key difference is that typical
classification tasks require models to output a fixed label, often in one-hot encoding. In contrast,

8
Table 4: Protein property prediction. The results of ten certain superfamilies in the test dataset.
OBFD UPF0145 NACD U3S CCHC Kazal SAM-MT TPHD Trx CheY
Precision 0.33 1.00 1.00 1.00 0.75 1.00 0.77 0.94 0.86 0.93
Recall 1.00 1.00 1.00 1.00 0.95 1.00 0.94 0.91 0.94 1.00
F1-score 0.50 1.00 1.00 1.00 0.86 0.98 0.84 0.93 0.90 0.96

Seq-ident 16.2% Seq-ident 21.2% Seq-ident 21.0% Seq-ident 33.0%

TM-score 0.775 TM-score 0.833 TM-score 0.782 TM-score 0.922
H-prob 96% H-prob 98% H-prob 94% H-prob 100%

to PDB 3dh0_A to PDB 2vq2_A to PDB 3gnj_A to PDB 2a9p_A

SAM-MT TPHD Trx CheY

Figure 5: Protein visualization. Four proteins of controllable generation by ProLLaMA using
SAM-MT, TPHD, Trx, and CheY as instructions. Blue is generated proteins, and yellow is natural.
They are similar in structure but different in sequence.
ProLLaMA outputs the text. The advantage of the latter lies in its flexibility, such as the ability to
easily handle situations where a sample belongs to multiple categories simultaneously. However,
this increases task difficulty due to the much larger number of potential classification categories.
Table 5: Protein property
Even so, as shown in Table 5, ProLLaMA generates superfamily prediction. Results of all su-
descriptions that exactly match the real descriptions in 62% of the perfamilies in the test dataset.
test dataset. In addition, Table 4 illustrates ProLLaMA’s perfor- Metric Value
mance on ten specific superfamilies. The recall value exceeds 0.9
Exact Match 0.62
in all the ten superfamilies and the F1-score exceeds 0.8 in nine Jaccard Similarity 0.67
superfamilies. The calculation formulas for these metrics can be Precision 0.63
found in Appendix A.5. Additional experiments on protein solubility Recall 0.72
prediction using ProLLaMA can be found in Appendix A.7. F1-score 0.67

5 Related Work
Protein Language Models. Recognizing the similarity between natural language sequences and
protein sequences, many methods from NLP have been applied to protein sequence data [48–51]. This
has led to the development of PLMs, which are broadly categorized into two types [12, 52]: Auto-
Regressive (AR) PLMs and Auto-Encoder (AE) PLMs. AR PLMs adopt the decoder-only architecture
and Causal Language Modeling (CLM) [53, 54]. They primarily concentrate on PLG [55, 17–19],
with a minority also focusing on fitness prediction [10]. AE PLMs adopt the encoder-only architecture
and Masked Language Modeling (MLM) [20–24]. They excel in PLU, with the learned protein
representations being applied to downstream predictive tasks [29]. However, they face challenges in
de novo protein generation. Our ProLLaMA is capable of multitasking, excelling in tasks that both of
the above types specialize in, surpassing existing PLMs. This multitasking capability is achieved
through instruction following, making it user-friendly. We have also noticed the recent emergence of
scientific LLMs [13, 14, 56–58]. In Appendix A.10, we discuss the differences between these models
and ProLLaMA.
Training LLMs. There is a general training framework for LLMs [4], where LLMs are first pre-
trained on large-scale corpora [59] and then undergo instruction tuning to follow user instructions [60].
However, we believe that our proposed two-stage training framework differs in motivation and insights,
as discussed in Appendix A.11. Considering the vast number of parameters in LLMs, various
parameter-efficient techniques have been proposed to accelerate training and conserve memory [61,
62, 35, 63, 64], including LoRA. For the same reason, some vocabulary pruning methods have been
studied [65], which rely on statistics specific to a provided dataset. This results in them being more
time-consuming and failing to leverage the prior patterns of the dataset. In contrast, our PVP, derived
from the prior data patterns, achieves excellent compression rates.

6 Conclusion
Existing PLMs excel in either protein generation tasks or protein understanding tasks. In this work,
we introduce an efficient training framework to transform any general LLM into a multi-task PLM.

Y12 Lab Book GGS
No ratings yet
Y12 Lab Book GGS
46 pages
PAG1 1 Student Using A Light Microscope To Study Mitosis v1 0
No ratings yet
PAG1 1 Student Using A Light Microscope To Study Mitosis v1 0
4 pages
Working Memory: Alan D. Baddeley' and Graham Hitch'
No ratings yet
Working Memory: Alan D. Baddeley' and Graham Hitch'
22 pages
Pro Llama
No ratings yet
Pro Llama
12 pages
Evaluation of GPT and BERT-based Models On Identif
No ratings yet
Evaluation of GPT and BERT-based Models On Identif
25 pages
2503.08179v3
No ratings yet
2503.08179v3
40 pages
LLMs Proteins
No ratings yet
LLMs Proteins
33 pages
2404.07947v1
No ratings yet
2404.07947v1
16 pages
LLAMA_PRO
No ratings yet
LLAMA_PRO
21 pages
Bueno Teoria 2307.06435
No ratings yet
Bueno Teoria 2307.06435
37 pages
Efficient Prompting Methods For Large Language Models - A Survey
100% (1)
Efficient Prompting Methods For Large Language Models - A Survey
18 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
paper-1
No ratings yet
paper-1
44 pages
Jiang 等 - 2024 - Rapid Protein Evolution by Few-shot Learning With a Protein Language Model
No ratings yet
Jiang 等 - 2024 - Rapid Protein Evolution by Few-shot Learning With a Protein Language Model
30 pages
Longform: Optimizing Instruction Tuning For Long Text Generation With Corpus Extraction
No ratings yet
Longform: Optimizing Instruction Tuning For Long Text Generation With Corpus Extraction
17 pages
NepaliGPT 2.0: Nepali Text Understanding and Generation
No ratings yet
NepaliGPT 2.0: Nepali Text Understanding and Generation
9 pages
Instruction Position Matters in Sequence Generation With Large Language Models
No ratings yet
Instruction Position Matters in Sequence Generation With Large Language Models
11 pages
Taiyi: A Bilingual Fine-Tuned Large Language Model For Diverse Biomedical Tasks
No ratings yet
Taiyi: A Bilingual Fine-Tuned Large Language Model For Diverse Biomedical Tasks
25 pages
Overview of Training LLMs
No ratings yet
Overview of Training LLMs
31 pages
2503.01159v1
No ratings yet
2503.01159v1
55 pages
A O L M: R D O: N Verview On Anguage Odels Ecent Evelopments and Utlook
No ratings yet
A O L M: R D O: N Verview On Anguage Odels Ecent Evelopments and Utlook
24 pages
AComprehensive Overviewof Large Language Models
No ratings yet
AComprehensive Overviewof Large Language Models
36 pages
2411 03350v1
No ratings yet
2411 03350v1
76 pages
Connecting Large Language Models With Evolutionary Algorithms Yields Powerful Prompt
No ratings yet
Connecting Large Language Models With Evolutionary Algorithms Yields Powerful Prompt
18 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
2022 Acl-Demo 10
No ratings yet
2022 Acl-Demo 10
9 pages
LLAMA AI Paper
No ratings yet
LLAMA AI Paper
18 pages
Research Paper Llama
No ratings yet
Research Paper Llama
27 pages
2211.16742v1
No ratings yet
2211.16742v1
44 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
LLaMA Open and Efficient Foundation Language Models
No ratings yet
LLaMA Open and Efficient Foundation Language Models
27 pages
2110.05006v4
No ratings yet
2110.05006v4
57 pages
Instructbiomol: Advancing Biomolecule Understanding and Design Following Human Instructions
No ratings yet
Instructbiomol: Advancing Biomolecule Understanding and Design Following Human Instructions
35 pages
Optimizing Large Language Models a Deep Dive Into
No ratings yet
Optimizing Large Language Models a Deep Dive Into
32 pages
B M GPT: O M G P - T B M: IO ED PEN Ultimodal Enerative RE Trained Ransformer For IO Edicine
No ratings yet
B M GPT: O M G P - T B M: IO ED PEN Ultimodal Enerative RE Trained Ransformer For IO Edicine
12 pages
Me-Llama: Medical Foundation Large Language Models For Comprehensive Text Analysis and Beyond
No ratings yet
Me-Llama: Medical Foundation Large Language Models For Comprehensive Text Analysis and Beyond
21 pages
Prompting - Unleashing the Potential of Prompt Engineering in Large Language Models
No ratings yet
Prompting - Unleashing the Potential of Prompt Engineering in Large Language Models
58 pages
A E C P T L L M: A P ' G: N Mpirical Ategorization of Rompting Echniques FOR Arge Anguage Odels Ractitioner S Uide
No ratings yet
A E C P T L L M: A P ' G: N Mpirical Ategorization of Rompting Echniques FOR Arge Anguage Odels Ractitioner S Uide
16 pages
2304.02020v1
No ratings yet
2304.02020v1
36 pages
Platypus
No ratings yet
Platypus
17 pages
2024.05.24.595730v1.full
No ratings yet
2024.05.24.595730v1.full
22 pages
3 Paradigm 2: Prompt-Based Learning: Table 2: Example Prompt Designs For Learning From In-Structions
No ratings yet
3 Paradigm 2: Prompt-Based Learning: Table 2: Example Prompt Designs For Learning From In-Structions
10 pages
A Comprehensive Overview of Large Language Models: Preprint 1
No ratings yet
A Comprehensive Overview of Large Language Models: Preprint 1
46 pages
[24.07] Genomic Language Models Opportunities and Challenges
No ratings yet
[24.07] Genomic Language Models Opportunities and Challenges
25 pages
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
No ratings yet
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
47 pages
Fine Tune LLAMA
No ratings yet
Fine Tune LLAMA
20 pages
Large Language Models A Comprehensive Survey of It
No ratings yet
Large Language Models A Comprehensive Survey of It
30 pages
2024 - Unleashing The Potential of Prompt Engineering in LLM
No ratings yet
2024 - Unleashing The Potential of Prompt Engineering in LLM
25 pages
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
No ratings yet
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
46 pages
llamafactory
No ratings yet
llamafactory
13 pages
68 LLM Informed Discrete Promp
No ratings yet
68 LLM Informed Discrete Promp
6 pages
Prompt Engineering With Chatgpt: A Guide For Academic Writers
No ratings yet
Prompt Engineering With Chatgpt: A Guide For Academic Writers
5 pages
25636-1454-21112-2-10-20230927
No ratings yet
25636-1454-21112-2-10-20230927
4 pages
358_Large_Language_Models_as_O
No ratings yet
358_Large_Language_Models_as_O
41 pages
A_Review_on_Large_Language_Models_Archit
No ratings yet
A_Review_on_Large_Language_Models_Archit
32 pages
LMM Model
No ratings yet
LMM Model
41 pages
Large Language Models Are Human-Level Prompt Engineers
No ratings yet
Large Language Models Are Human-Level Prompt Engineers
2 pages
Performance Analysis of LoRA Finetuning Llama-2
No ratings yet
Performance Analysis of LoRA Finetuning Llama-2
4 pages
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
No ratings yet
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
36 pages
Llama_Factory
No ratings yet
Llama_Factory
12 pages
A Review On Large Language Models Architectures Ap
No ratings yet
A Review On Large Language Models Architectures Ap
31 pages
Mastering Prolog Programming: From Basics to Expert Proficiency
From Everand
Mastering Prolog Programming: From Basics to Expert Proficiency
William Smith
No ratings yet
Biochip: Biosensor or Bioprocessor That Utilizes Technologies of Modern Biology and Electronics in A Micro Scale
No ratings yet
Biochip: Biosensor or Bioprocessor That Utilizes Technologies of Modern Biology and Electronics in A Micro Scale
30 pages
Reproduction and Growth in Bacteria: By: Hafiza Asfa Shafique Microbiology BS Biotechnology V
No ratings yet
Reproduction and Growth in Bacteria: By: Hafiza Asfa Shafique Microbiology BS Biotechnology V
26 pages
Earth and Life Module Week 14
No ratings yet
Earth and Life Module Week 14
12 pages
Microbiology-and-Parasitology-Module-1
No ratings yet
Microbiology-and-Parasitology-Module-1
32 pages
Biology and Data Interpretation Techniques Concepts
No ratings yet
Biology and Data Interpretation Techniques Concepts
2 pages
04 Card Sort Cell Analogy
No ratings yet
04 Card Sort Cell Analogy
1 page
Samara University Natural Science Remedial Students Results Announcement
No ratings yet
Samara University Natural Science Remedial Students Results Announcement
17 pages
The Earth (4 Elements V1) - Salmon
No ratings yet
The Earth (4 Elements V1) - Salmon
413 pages
Seed Treatment With Liquid Microbial Consortia For Germination and Vigour Improvement in Tomato (Solanum Lycopersicum L.)
No ratings yet
Seed Treatment With Liquid Microbial Consortia For Germination and Vigour Improvement in Tomato (Solanum Lycopersicum L.)
6 pages
Modern Methods of Drug Discovery (PDFDrive)
No ratings yet
Modern Methods of Drug Discovery (PDFDrive)
293 pages
Methodology
No ratings yet
Methodology
19 pages
H Sas3
No ratings yet
H Sas3
11 pages
116-Article Text-342-1-10-20150815
No ratings yet
116-Article Text-342-1-10-20150815
9 pages
LESSON 2 - KNOWING ONESELF
No ratings yet
LESSON 2 - KNOWING ONESELF
45 pages
VP AIR Test-02_Dropper NEET (2024-25)_11!02!2025_Question Paper 20 Nos
No ratings yet
VP AIR Test-02_Dropper NEET (2024-25)_11!02!2025_Question Paper 20 Nos
12 pages
Gender and Society - Module 2
No ratings yet
Gender and Society - Module 2
7 pages
HAEMOCYTOMETER
No ratings yet
HAEMOCYTOMETER
5 pages
Mitosis Through Microscope Science 2003
No ratings yet
Mitosis Through Microscope Science 2003
7 pages
Body Self Dualism in Contemporary Ethics and Politics 1st Edition Patrick Lee - The ebook is ready for download, no waiting required
100% (1)
Body Self Dualism in Contemporary Ethics and Politics 1st Edition Patrick Lee - The ebook is ready for download, no waiting required
73 pages
Somali Federal Ministry of Education BIOLOGY
No ratings yet
Somali Federal Ministry of Education BIOLOGY
31 pages
Conocybe Punjabensis Fungi
No ratings yet
Conocybe Punjabensis Fungi
8 pages
Module 4 Non Fish Aquatic Resources
100% (1)
Module 4 Non Fish Aquatic Resources
30 pages
(WWW - Asianovel.com) - Tsuki Ga Michibiku Isekai Douchuu Chapter 51 - Chapter 100
No ratings yet
(WWW - Asianovel.com) - Tsuki Ga Michibiku Isekai Douchuu Chapter 51 - Chapter 100
578 pages
Figure 1 Concept of Structure and Function: Philippine-Eagle
100% (2)
Figure 1 Concept of Structure and Function: Philippine-Eagle
8 pages
2408.03286v2Biomedical SAM 2: Segment Anything in Biomedical Images and Videos
No ratings yet
2408.03286v2Biomedical SAM 2: Segment Anything in Biomedical Images and Videos
13 pages
Barrack's Buddy - CDS OTA - Schedule Details
No ratings yet
Barrack's Buddy - CDS OTA - Schedule Details
12 pages
Bioch5 Enzymes
No ratings yet
Bioch5 Enzymes
36 pages

202407-ProLLaMA - A Protein Language Model For Multi-Task Protein Language Processing

Uploaded by

202407-ProLLaMA - A Protein Language Model For Multi-Task Protein Language Processing

Uploaded by

ProLLaMA: A Protein Language Model for

Multi-Task Protein Language Processing

Liuzhenghao Lv1 , Zongying Lin1 , Hao Li1,2 , Yuyang Liu1 ,

Large Language Models (LLMs) have achieved remarkable performance in mul-

Preprint. Under review.

3.1 Dataset Construction

W Protein Vocabulary Pruning

3.2 Protein Vocabulary Pruning

3.3 Learning Protein Language

3.4 Performing Multiple Tasks

CARP [40] 34.40±14.43 4.05±0.52 0.28 19.38 0.38 8.95

ESM-1b [22] 59.57±15.36 3.47±0.68 0.34 20.88 0.44 8.59

Diffusion EvoDiff [40] 44.29±14.51 3.71±0.52 0.32 21.02 0.41 10.11

ProtGPT2 [17] 56.32±16.05 3.27±0.59 0.44 12.60 0.43 9.19

4.1 Experiment Setup

4.2 Unconditional Protein Generation

4.3 Controllable Protein Generation

We utilize four superfamily descriptions as instructions respectively: the S-adenosyl-L-methionine-

D ProLLaMA Natural Protein E ProLLaMA Natural Protein F ProLLaMA Natural Protein

Figure 4: Contrast experiments of ProLLaMA. (A-C): Compared to other models, ProLLaMA

4.4 Property Prediction

Seq-ident 16.2% Seq-ident 21.2% Seq-ident 21.0% Seq-ident 33.0%

to PDB 3dh0_A to PDB 2vq2_A to PDB 3gnj_A to PDB 2a9p_A

SAM-MT TPHD Trx CheY

You might also like