0% found this document useful (0 votes)
53 views

202407-ProLLaMA - A Protein Language Model For Multi-Task Protein Language Processing

202407-ProLLaMA- A Protein Language Model for Multi-Task Protein Language Processing

Uploaded by

yangmengl123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

202407-ProLLaMA - A Protein Language Model For Multi-Task Protein Language Processing

202407-ProLLaMA- A Protein Language Model for Multi-Task Protein Language Processing

Uploaded by

yangmengl123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ProLLaMA: A Protein Language Model for

Multi-Task Protein Language Processing

Liuzhenghao Lv1 , Zongying Lin1 , Hao Li1,2 , Yuyang Liu1 ,


Jiaxi Cui1 , Calvin Yu-Chian Chen1 , Li Yuan1,2∗,Yonghong Tian1,2∗
1
Peking University, China 2 Peng Cheng Laboratory, China
arXiv:2402.16445v2 [cs.CE] 16 Jul 2024

Abstract

Large Language Models (LLMs) have achieved remarkable performance in mul-


tiple Natural Language Processing (NLP) tasks. Under the premise that protein
sequences constitute the protein language, Protein Language Models(PLMs) have
advanced the field of protein engineering. However, as of now, unlike LLMs in
NLP, PLMs cannot handle the protein understanding task and the protein gener-
ation task simultaneously in the Protein Language Processing (PLP) field. This
prompts us to delineate the inherent limitations in current PLMs: (i) the lack of
natural language capabilities, (ii) insufficient instruction understanding, and (iii)
high training resource demands. To address these challenges, we introduce a train-
ing framework to transform any general LLM into a PLM capable of handling
multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary
Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset
containing 13 million samples with superfamily information, facilitating better
modeling of protein sequence-function landscapes. Through these methods, we
develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks
simultaneously. Experiments show that ProLLaMA achieves state-of-the-art re-
sults in the unconditional protein sequence generation task. In the controllable
protein sequence generation task, ProLLaMA can design novel proteins with de-
sired functionalities. As for the protein understanding task, ProLLaMA achieves
a 62% exact match rate in superfamily prediction. Codes, model weights, and
datasets are available at https://github.com/PKU-YuanGroup/ProLLaMA and
https://huggingface.co/GreatCaptainNemo.

1 Introduction

Large Language Models (LLMs), like GPT-x and LLaMA2 [1, 2], have achieved outstanding
performance in handling a wide range of Natural Language Processing (NLP) tasks [3–9], including
both Natural Language Generation (NLG) and Natural Language Understanding (NLU) tasks, in a
generative manner. This surge in LLMs has extended their applications beyond traditional contexts,
including their adoption in the challenging field of protein engineering [10–14].
Taking protein sequences as the protein language, researchers train Protein Language Models (PLMs)
on vast protein corpora [6, 11, 12]. PLMs have the potential to significantly advance protein engi-
neering, holding immense promise for biomedical and biotechnological innovations [10]. However,
this progress is challenged, particularly in extending their capabilities to multi-task Protein Language
Processing (PLP).

Correspondence to: [email protected] and [email protected]

Preprint. Under review.


Text ProLLaMA (Ours)
Generation
[Natural Language] ↦ General LLM
GPT-X, LLaMA, etc.
AND Unconditional
Text Protein Generation
Understanding
Analogy Capability
Gap Controllable
Protein
Protein Generation
Generation
OR
[Protein Sequence] ↦ Current PLM
ProtGPT2,ESM, etc. Protein Property
Protein
Prediction
Understanding

Figure 1: Left: LLMs can handle both generation and understanding tasks, whereas PLMs cannot.
This highlights the disparity in capabilities between the two. Right: Our ProLLaMA can handle
generation tasks (unconditional protein generation, controllable protein generation) and understanding
tasks (protein superfamily prediction), surpassing current PLMs.

Analogous to NLP, tasks related to protein language can be viewed as PLP [15, 16]. Consequently,
PLP tasks can also be divided into two categories: Protein Language Generation (PLG), such as
protein sequence generation, and Protein Language Understanding (PLU), such as protein property
prediction. However, current PLMs tend to focus either exclusively on PLG [17–19] or PLU [20–24],
rather than being proficient in both aspects simultaneously, as LLMs in NLP.
These limitations prompt the need for innovative solutions to unleash the full potential of PLMs.
Developing a multi-task PLM would be highly beneficial for protein engineering and protein fitness
landscape modeling [25–28], but three main challenges must be considered:
(i) Necessity of Natural Language: Protein language is not fully sufficient for PLP tasks, meaning
it cannot fully represent all components of a task (the task instruction, the input, and the expected
output) [29, 30]. It requires a language beyond protein language (typically, natural language) for
representation, which current PLMs lack.
(ii) Instruction Following: To possess multi-task capabilities, models must execute tasks following
user instructions [31–33]. However, current PLMs are unable to follow instructions.
(iii) Training Resource Consumption: Substantial training resources are needed for models to learn
natural language, protein language, and user instructions [34], which can sometimes be unaffordable.
To address the challenges, we construct an instruction dataset that contains approximately 13 million
samples and encompasses both PLG and PLU tasks. We propose a two-stage training framework
to achieve a PLM for multi-task PLP. In the first stage, we leverage a pre-trained general LLM like
LLaMA2 to continually learn the protein language while maintaining the natural language knowledge.
In the second stage, the model is further trained on the multi-task instruction dataset. Additionally,
we propose Protein Vocabulary Pruning (PVP) for general LLMs to significantly improve training
efficiency. Furthermore, during both stages, we adopt Low-Rank Adaptation (LoRA) [35], which
prevents catastrophic forgetting in the first stage and reduces training costs in both stages.
Using these methods, we develop ProLLaMA, a model capable of multi-tasks for PLP, distinguishing
it from all other PLMs. Through a series of experiments, we demonstrate the multi-task capabilities
of our ProLLaMA. Specifically, as for PLG, in unconditional protein generation, ProLLaMA out-
performs current PLMs on common metrics such as pLDDT and TM-score. In controllable protein
generation, based on a user-provided textual description, ProLLaMA generates novel proteins from
scratch with desired functionalities, such as the SAM-MT superfamily. As for PLU, in protein
superfamily prediction, ProLLaMA achieved a 62% exact match rate on the test dataset and obtained
an F1-score above 0.9 in many specific categories. In summary, the contributions of our research are
as follows:
• We propose a training framework that enables any general LLM to be trained as a proficient model
for multi-task PLP, including both PLG and PLU tasks.
• We propose Protein Vocabulary Pruning and are the first to apply Low-Rank Adaptation in the
training of large PLMs, which improves the efficiency of protein learning.
• We construct an instruction dataset that contains 13 million samples and over 11,000 kinds of
superfamily annotations, potentially facilitating a better modeling of sequence-function landscapes.
• Experiments show that our ProLLaMA not only handles multiple PLP tasks but also achieves
state-of-the-art results in protein generation tasks.

2
2 Preliminaries

A C
UniRef50 InterPro Protein2ipr Instruction Vocabulary Vocabulary
Database (61M) Database Dataset (13M) Index Token Index Token
Filter Instruction:[Generate by superfamily] 0. OS Prune Before Training 0. OS
Retrieve Input:Superfamily=<Ankyrin repeat-
1. know 1. HI
2. HI 2. EST
Protein Language containing domain superfamily> 3. EST ......
Dataset (53M) UniRef50 Protein Output:Seq=<MAPG...PVRKR> 4. ght Recover After Training
with Annotation Instruction:[Determine superfamily] Reserved Indices
Seq=<FLLSHMV......FEHTE> ......
Input:Seq=<MAPG...PVRKR>
Filter [0, 2, 3, ......]
Seq=<YNDKVQV......FNLQ> Output:Superfamily=<Ankyrin repeat-
containing domain superfamily>
Protein (6.4M)
Seq=<MKVRDV......NFGTR>
...... Embed or Head Embed or Head
...... with Superfamily Format
0 Prune Before Training 0

Vocab_size
Vocab_size
1 1
2 2
B 3
Recover After Training
Stage 1 Stage 2 4
LLaMA2 Continued Pre-training
ProLLaMA Instruction Tuning
ProLLaMA
Feature_size Feature_size

Figure 2: (A) Overview of the dataset construction. The protein language dataset contains 53
million samples, which is used for training in Stage 1. The instruction dataset contains 13 million
instances with 11,268 unique superfamily annotations, which is used for training in Stage 2. (B)
Overview of the training framework. Stage 1: The pre-trained LLaMA2 learns the protein language,
resulting in ProLLaMA. Stage 2: ProLLaMA learns to perform multiple tasks by instruction tuning.
(C) Overview of protein vocabulary pruning. The vocabulary, Embed layer and Head layer are
pruned based on characteristics of protein datasets, aiming for higher training efficiency.
Necessity of Natural Language. As aforementioned, given the similarities between protein sequences
and natural language, tasks related to protein sequences can be considered as PLP, analogous to
NLP. However, we observe a fundamental difference between protein language and natural language:
natural language is complete for NLP tasks, whereas protein language is not complete for PLP tasks.
For example, in the sentiment analysis task, instructions, inputs, and outputs are straightforwardly
expressed in natural language, such as “Analyze the sentiment: I am happy” yielding “The sentiment
is positive.” However, in PLP tasks like protein property prediction, instructions like “Predict this
protein’s property: MAFCF...FEV” cannot be fully conveyed in protein language alone, necessitating
assistance from a language beyond protein language, in this case, natural language, for representation.
Therefore, multi-task PLMs must possess a certain level of natural language ability, especially as
more textual descriptions of proteins become available [29].
Low-Rank Adaptation (LoRA) [35] is a parameter-efficient technique for fine-tuning LLMs. Due
to the immense parameter size of LLMs, full-parameter fine-tuning could be impractical sometimes.
LoRA circumvents this by freezing the original parameters of LLMs and introducing additional
trainable low-rank adapters. It achieves fine-tuning with a significantly smaller number of trainable
parameters, yielding results comparable to full-parameter fine-tuning. LoRA prevents catastrophic
forgetting of the original knowledge, as the newly learned knowledge has a lower rank than the
original knowledge. The theoretical details of LoRA are provided in Appendix A.2.

3 Methods
In Section 3.1 and Figure 2(A), we show how to construct the protein language dataset and the
instruction dataset. In Section 3.2 and Figure 2(C), we show how Protein Vocabulary Pruning (PVP)
works. In Section 3.3, we show how the LLaMA2 model learns protein language on the protein
language dataset through continued pre-training, resulting in ProLLaMA. In Section 3.4, we show
the integration of various tasks into ProLLaMA through instruction tuning on the instruction dataset.
The overview of the training framework is shown in Figure 2(B). The overview of the ProLLaMA
model is shown in Figure 3.

3.1 Dataset Construction

The protein language dataset is utilized in the first training stage to enable LLaMA2 to grasp
the language of proteins. Specifically, the dataset is sourced from UniRef50_2023_03 [36] on the
UniProt website. We eliminate the descriptive parts of UniRef50, retaining only the pure protein
sequences. Furthermore, We filter UniRef50 to ensure that the protein sequences consisted only
of the 20 standard amino acids. We also retain sequences with a length of less than 512, aligning

3
A. ProLLaMA Pipeline
Causal
Decoder Language
Tokenizer Embed Head
Blocks Modeling

B. Decoder Structure

SiLU
LoRA
LoRA
Layer Normalization

Layer Normalization
Attention Layer
LoRA LoRA LoRA LoRA

LoRA

Frozen Parameters
A B
Training Parameters

W Protein Vocabulary Pruning

Figure 3: The overview of the ProLLaMA model. We add low-rank adapters (LoRA) to certain
weights and apply protein vocabulary pruning (PVP). We freeze original parameters, focusing solely
on training LoRA (Embed and Head are also involved in the first training stage).

with ProGen [18]. Given that the lengths of protein sequences follow a long-tail distribution, the
sequences that are deleted constitute only a small portion of the total dataset. To preprocess the
protein sequences, we employ a specific prefix “Seq=<” and suffix “>”. This standardized format aids
LLaMA2 in distinguishing the new protein language from its existing natural language knowledge,
thereby reducing confusion. The original Uniref50 contains 60,952,894 sequences, while after the
above series of processing, our dataset comprises 52,807,283 protein sequences, with 90% for training
and 10% reserved for testing.
The instruction dataset is utilized in the second training stage to enable ProLLaMA to perform
various tasks. We first obtain the protein2ipr database from InterPro [37], which includes all proteins
from UniProtKB along with their corresponding InterPro annotation information. Subsequently, we
iterate through each protein’s rep_id in UniRef50 to retrieve the corresponding annotation information
from protein2ipr. This retrieval process is implemented using a distributed Redis database, and only
proteins with lengths less than 256 participate in the retrieval to enhance efficiency. We utilize regular
expressions to filter out superfamily annotation from the whole annotation. In the end, we obtain
6,350,106 data instances, each of which contains one protein sequence and its superfamily annotation.
And the number of unique superfamily annotations is 11,268.
Then, we process the obtained data into a multi-task instruction dataset following the Alpaca for-
mat [38], where each instance comprises three parts: instruction, input, and output. The instruction
specifies the task type. We design two tasks: generating proteins based on superfamily and determin-
ing the superfamily of the given protein. For the former task, the input is the superfamily annotation,
and the output is the expected protein. The latter task is the opposite. We selected these two tasks
because they represent PLG and PLU, respectively. Additionally, modeling the relationships between
sequences and superfamilies is crucial for uncovering protein functions and evolutionary insights [39].
In the end, the instruction dataset comprises 12,700,212 (6,350,106 * 2) instances, with 90% for
training and the rest reserved for testing.

3.2 Protein Vocabulary Pruning

It is known that LLMs are equipped with large vocabularies in tokenizers, which is beneficial to
cover multi-lingual corpora and shorten the length of the input token sequence. However, a larger

4
vocabulary also implies slower tokenization of raw text, along with increased parameters in the
model’s Embed layer and Head layer, as discussed in detail in Appendix A.9. When considering
only the protein language, such large vocabularies are clearly unnecessary. Therefore, we propose a
method called Protein Vocabulary Pruning (PVP). PVP helps ProLLaMA achieve almost the same
performance with higher training efficiency, whose proof is in Appendix A.3.
PVP is rule-based. Specifically, we iterate through each Table 1: Effects of PVP. The vocabu-
token in the original vocabulary and retain it if it meets our lary size, parameter numbers in Embed
predefined grammatical rules (e.g., consisting only of the and Head layers, parameter numbers in-
20 uppercase English letters representing standard amino volved in training are listed.
acids). We save these tokens and their indices. Using Training Vocab Embed Training
the saved tokens, we construct a smaller vocabulary. We Stage and
Head
then use the indices to extract the corresponding vectors
from Embed and Head layers to construct smaller ones. I (w/o PVP) 32,000 262M 582M
I (w/ PVP) 1,045 8.56M 328M
Thus, we can train LLMs using reduced vocabulary and II (w/o PVP) 32,000 262M 160M
parameters. II (w/ PVP) 25,466 209M 160M
PVP is employed before training. After training, the vocabulary can be recovered back to its original
size by incorporating the entries from the original vocabulary. This process is also applicable to the
Embed and Head layers, restoring them to their original dimensions. This ensures that the architecture
of LLMs remains entirely consistent both before and after training, facilitating potential training on
broader corpora in the future. Naturally, if subsequent training is not a consideration, omitting the
recovery operation is permissible.
We use PVP in both the first and second training stages of ProLLaMA, and its effects are shown
in Table 1. It is evident that we achieve a significant compression rate in the first stage, primarily
because the dataset in the first stage contains only protein sequences, which can be summarized by a
simple rule. In contrast, the dataset in the second stage also includes English. More details about
PVP is shown in Appendix A.3.

3.3 Learning Protein Language

As mentioned in Section 1, current PLMs lack natural language abilities, which hinders multi-task
capabilities. To solve this problem, we propose leveraging LLaMA2 to perform continued pre-training
on protein language. This approach is analogous to humans learning a foreign language, where the
model learns protein language while retaining its original natural language abilities.
We add Low-Rank Adapters (LoRA) into LLaMA2. To be specific, in each decoder block, we
add LoRA to certain weights including Wq , Wk , Wv , Wo , Wup , Wgate and Wdown . The original
parameters are frozen, enabling only LoRA to be trained. Due to the significant differences between
protein language and natural language, we choose a relatively high rank for LoRA, which helps the
model learn protein sequences better and prevents under-fitting. We also include both the Embed and
Head layers in training. This is based on the premise that a token may have different meanings in
protein sequences and natural languages, requiring distinct embeddings for the same token.
We trained the model using causal language modeling on the protein language dataset, resulting
in ProLLaMA. Details on causal language modeling are provided in Appendix A.1. Benefiting
from PVP and LoRA, as shown in Table 7, we train only about 5% of the parameters, in contrast to
full-parameter training, which significantly reduces training costs. Additionally, as the remaining
parameters are not involved in training, the inherent natural language abilities are preserved.
In summary, we have developed ProLLaMA, a model that comprehends both protein language
and natural language, with reduced training costs. Consequently, we have addressed two problems
mentioned in Section 1: the lack of natural language abilities and excessive training costs.

3.4 Performing Multiple Tasks

As aforementioned, current PLMs are unable to perform multiple tasks based on user instructions. To
solve this problem, we perform instruction tuning on ProLLaMA obtained from the previous section.
We train ProLLaMA on the instruction dataset mentioned in Section 3.1:
L(Θ) = Ex,u∼D [− log p(x|u; Θ)] (1)

5
Table 2: Comparison of proteins generated by different models. Our ProLLaMA achieves the best
performance on pLDDT, TM-score, and RMSD metrics, and is second-best in SC-Perp, demonstrating
ProLLaMA excels in de novo protein design. *: We list ProGen2 and ProGen as the same item,
referring to Appendix A.6 for explanation. AE: Auto-Encoder. AR: Auto-Regressive.

AFDB PDB
Type Method pLDDT↑ SC-Perp↓
TM-score↑ RMSD↓ TM-score↑ RMSD↓

CARP [40] 34.40±14.43 4.05±0.52 0.28 19.38 0.38 8.95


CNN
LRAR [40] 49.13±15.50 3.59±0.54 0.40 14.47 0.43 9.47

ESM-1b [22] 59.57±15.36 3.47±0.68 0.34 20.88 0.44 8.59


PLM (AE)
ESM-2 [24] 51.16±15.52 3.58±0.69 0.20 35.70 0.41 9.57

Diffusion EvoDiff [40] 44.29±14.51 3.71±0.52 0.32 21.02 0.41 10.11

ProtGPT2 [17] 56.32±16.05 3.27±0.59 0.44 12.60 0.43 9.19


PLM (AR) ProGen2/ProGen* [19, 18] 61.07±18.45 2.90±0.71 0.43 15.52 0.44 11.02
ProLLaMA (ours) 66.49±12.61 3.10±0.65 0.49 9.50 0.48 7.63

Table 3: Controllable generation of ProLLaMA. SAM-MT, TPHD, Trx, and CheY are four
supefamiles. High values of TM-score and H-Prob indicate the generated proteins meet the desired
superfamily. “xx% seen” denotes that xx% residues of one protein sequence belonging to the related
superfamily are provided to ESM-1b for further generation. Even so, ProLLaMA performs best.
SAM-MT TPHD Trx CheY
Method
TM-score↑ H-Prob↑ TM-score↑ H-Prob↑ TM-score↑ H-Prob↑ TM-score↑ H-Prob↑
ESM-1b 0.58 0.37% 0.55 0.48% 0.61 0.37% 0.63 0.27%
ESM-2 0.52 0.26% 0.51 0.25% 0.53 0.30% 0.57 0.18%
EvoDiff 0.46 1.17% 0.42 1.80% 0.42 1.10% 0.46 1.43%
ProtGPT2 0.45 3.86% 0.43 4.62% 0.44 2.53% 0.45 4.86%
ProGen2/ProGen* 0.44 1.90% 0.45 2.49% 0.43 2.44% 0.44 2.13%
ESM-1b (25% seen) 0.43 0.64% 0.40 4.13% 0.42 1.01% 0.49 4.47%
ESM-1b (50% seen) 0.59 61.08% 0.63 66.21% 0.64 62.75% 0.73 78.00%
ESM-1b (75% seen) 0.67 88.51% 0.73 90.23% 0.75 93.92% 0.78 96.93%
ProLLaMA (ours) 0.71 98.13% 0.82 100.00% 0.81 99.96% 0.93 100.00%

Here, Θ denotes the parameters to be optimized, L the loss function, and D the dataset. u denotes
the instruction and the input of one instance. For brevity, when we refer to the instruction in the
following text, it includes the input as well. x = {x0 , x1 , . . . , xn−1 } denotes the output, where xi is
the i-th token of the output.
Since causal language modeling is employed, we need to combine Equation 1 with Equation 3:

" #
X
L(Θ) = Ex∼D − log p(xi |u, x0 , x1 , . . . , xi−1 ; Θ) (2)
i

Equation 2 is the optimization objective for instruction tuning of ProLLaMA. u is not involved in the
loss calculation, whereas x is. This is because the latter is the output part, where ensuring its quality
of generation is crucial. The former, u, only needs to be understood by the model. In the instruction
tuning stage, we exclusively train LoRA at a lower rank than specified in Section 3.3.
In summary, through instruction tuning, we have made ProLLaMA capable of following instructions
and performing multiple tasks. Consequently, we have addressed the problems mentioned in Section 1:
the lack of instruction following and the lack of multi-task capabilities.
In addition, the tasks that ProLLaMA can perform depend on the tasks included in the instruction
dataset. Appendix A.7 shows that by training on an additional instruction dataset, ProLLaMA also
performs well in predicting protein solubility.

6
4 Experiments
We introduce the experiment setup in Section 4.1. And we evaluate the unconditional protein
generation task in Section 4.2, the controllable protein generation task in Section 4.3, the protein
property prediction task in Section 4.4.

4.1 Experiment Setup

Training Settings: For continued pre-training, the LoRA rank is set to 128, employing the AdamW
optimizer alongside a cosine annealing scheduler with warm-up. The peak learning rate stands at 5e-5,
with a total of one training epoch. It takes six days on eight A6000 GPUs using FlashAttention-2 [41].
For instruction tuning, the LoRA rank is set to 64 with two training epochs, and all other settings
remain consistent with the continued pre-training setup. It takes 5 days on eight A6000 GPUs. More
training details can be found in Appendix A.4.
Evaluation Settings: Unconditional protein generation involves generating protein sequences without
specific instructions. Controllable protein generation involves generating desired protein sequences
based on instructions that specify the required superfamily. Property prediction involves predicting
protein superfamily based on instructions, which include the protein sequences to be predicted. All
evaluations are conducted on one GPU with 24GB of VRAM, with model inference occupying
approximately 13GB of VRAM.
Evaluation Metrics: We use the following metrics to evaluate the generated protein sequences.
The pLDDT [42] is used to measure whether sequences are structurally plausible. Self-Consistency
Perplexity (SC-Perp) [40] serves as an additional metric of plausible structures since pLDDT falls
short in dealing with intrinsically disordered regions (IDRs) [43]. TM-score [44] reflects the structural
similarity between generated sequences and known ones in AFDB [45] and PDB [46]. RMSD also
reflects the structural similarity from the perspective of atomic distance. Homologous probability (H-
Prob) reflects the probability that the generated protein is homologous to a known one. Seq-Ident
reflects the sequence similarity between generated sequences and known ones. More details are
shown in Appendix A.5.

4.2 Unconditional Protein Generation

We compare our model with other models in protein sequence generation. These models cover a
variety of types, with parameter numbers shown in Table 7. Table 2 shows the results. Our ProLLaMA
is optimal on pLDDT, TM-score, and RMSD and is suboptimal on SC-Perp. This indicates that
ProLLaMA, through its training on protein sequence data, can generate structurally plausible proteins.
Notably, ProLLaMA-generated proteins exhibit a mean and standard deviation for pLDDT and
SC-Perp of 66.49±12.61 and 3.10±0.65, respectively. These values are comparable to those of natural
proteins as reported in [40], which are 68.25±17.85 and 3.09±0.63, respectively. It is noted that we
list ProGen2 and ProGen as the same item in the table, referring to Appendix A.6 for explanation.
De novo design of long and structurally plausible protein sequences is highly challenging [17], yet
our ProLLaMA performs well. As shown in Figure 4(A)(B)(C), when the length is more than 300,
ProLLaMA performs the best in all the three metrics. Although ProGen2’s performance is better in
short sequences (length≤200), it decreases as the length increases. This indicates that ProLLaMA is
able to capture long-range dependencies between amino acids while other models struggle.

4.3 Controllable Protein Generation

We utilize four superfamily descriptions as instructions respectively: the S-adenosyl-L-methionine-


dependent methyltransferase superfamily (SAM-MT), the Tetratricopeptide-like helical domain
superfamily (TPHD), the Thioredoxin-like superfamily (Trx), and the CheY-like superfamily (CheY).
For each superfamily, ProLLaMA generates 100 protein sequences. We randomly select 100 natural
proteins from each of the four superfamilies as benchmarks for comparison. We employ Foldseek[47]
to compare generated proteins with natural ones.
The results shown in Table 3 demonstrate that ProLLaMA can generate desired protein sequences
based on instructions that specify the required functionalities, confirming the capability for control-
lable generation. For SAM-MT, the TM-scores of our generated sequences exceed 0.7; for TPHD

7
A Variation of pLDDT by length B Variation of SC-Perp by length C Variation of TM-score by length
ESM1b

LRAR ProLLaMA
ProLLaMA ProtGPT2

ProGen2 ProGen2
ProGen2 ProtGPT2
LRAR
ProtGPT2
ProLLaMA
ESM1b

LRAR ESM1b

D ProLLaMA Natural Protein E ProLLaMA Natural Protein F ProLLaMA Natural Protein


eY eY eY
Ch Ch Ch

HD HD HD
TP TP TP
x x x
Tr Tr Tr

M M M
SA SA SA
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
pLDDT SC-Perp TM-Score

Figure 4: Contrast experiments of ProLLaMA. (A-C): Compared to other models, ProLLaMA


maintains a high quality of generated proteins as their length increases. (D-F): According to the three
indicators normalized by percentage, proteins generated by ProLLaMA are roughly comparable to
natural proteins in the four superfamilies of CheY, TPHD, Trx, and SAM-MT (SAM).

and CheY, they are over 0.8; and for Trx, they surpass 0.9. The high TM-score indicates that the
structures of the generated proteins closely resemble those of natural proteins in the same superfamily,
implying functional similarity. For SAM-MT, TPHD, Trx, and CheY, all of the H-prob values are
close to or even equal to 100%, indicating that the generated proteins are homologous to natural
proteins and belong to the same superfamily. In summary, these provide strong evidence that the
protein generation of ProLLaMA is controllable under instructions. In contrast, other models exhibit
low TM-score and very low H-Prob due to their uncontrollable generation.
We also conduct another experiment. We provide residues of natural proteins belonging to the
superfamily to ESM-1b, allowing ESM-1b to complete these sequences, rather than generating them
from scratch as done previously. As seen in Table 3, with more residues provided, the proteins
generated by ESM-1b increasingly exhibit the characteristics of the corresponding superfamily. Even
so, it still does not surpass ProLLaMA. Higher TM-score and H-prob indicate that, under instruction
control, ProLLaMA’s de novo protein generation even outperforms the non-de novo generation by
ESM-1b, which is provided with 75% of the residues. These results indicate that ProLLaMA can
effectively capture structural and evolutionary relationships (as reflected by TM-score and H-prob)
solely through learning from text and sequences.
Additionally, using natural proteins as a benchmark, we assess pLDDT, SC-Perp, and TM-score of
proteins generated by ProLLaMA. Figure 4(D) shows that for CheY, TPHD, Trx, and SAM-MT,
the average pLDDT of generated proteins is only 19.0%, 1.41%, 10.9%, and 10.3% lower than
that of natural proteins, respectively. Figure 4(E) shows that for TPHD and SAM-MT, the average
SC-Perp is 5.66% and 5.06% lower; for Trx and CheY, the average SC-Perp is 7.58% and 14.92%
higher. Figure 4(F) visualizes the TM-score, with scores near the maximum indicating a high degree
of structural similarity. These findings indicate that proteins generated by ProLLaMA are roughly
comparable to their natural counterparts in the same superfamily.
In Figure 5, we visualize four examples of proteins generated by ProLLaMA (colored in blue)
alongside the most structurally similar natural proteins from PDB (colored in yellow). The significant
overlap in 3D structures and the high TM-score confirm structural similarity. Low Seq-ident indicates
sequence diversity. In summary, through controllable protein generation, ProLLaMA is capable of
generating desired proteins with structures similar to natural proteins, yet with novel sequences.

4.4 Property Prediction

We use the test dataset to evaluate whether ProLLaMA can predict the superfamily to which a given
protein belongs. The test dataset consists of 10,000 samples. Although ProLLaMA performs a
classification task here, it is more complex than typical ones. The key difference is that typical
classification tasks require models to output a fixed label, often in one-hot encoding. In contrast,

8
Table 4: Protein property prediction. The results of ten certain superfamilies in the test dataset.
OBFD UPF0145 NACD U3S CCHC Kazal SAM-MT TPHD Trx CheY
Precision 0.33 1.00 1.00 1.00 0.75 1.00 0.77 0.94 0.86 0.93
Recall 1.00 1.00 1.00 1.00 0.95 1.00 0.94 0.91 0.94 1.00
F1-score 0.50 1.00 1.00 1.00 0.86 0.98 0.84 0.93 0.90 0.96

Seq-ident 16.2% Seq-ident 21.2% Seq-ident 21.0% Seq-ident 33.0%


TM-score 0.775 TM-score 0.833 TM-score 0.782 TM-score 0.922
H-prob 96% H-prob 98% H-prob 94% H-prob 100%

to PDB 3dh0_A to PDB 2vq2_A to PDB 3gnj_A to PDB 2a9p_A

SAM-MT TPHD Trx CheY


Figure 5: Protein visualization. Four proteins of controllable generation by ProLLaMA using
SAM-MT, TPHD, Trx, and CheY as instructions. Blue is generated proteins, and yellow is natural.
They are similar in structure but different in sequence.
ProLLaMA outputs the text. The advantage of the latter lies in its flexibility, such as the ability to
easily handle situations where a sample belongs to multiple categories simultaneously. However,
this increases task difficulty due to the much larger number of potential classification categories.
Table 5: Protein property
Even so, as shown in Table 5, ProLLaMA generates superfamily prediction. Results of all su-
descriptions that exactly match the real descriptions in 62% of the perfamilies in the test dataset.
test dataset. In addition, Table 4 illustrates ProLLaMA’s perfor- Metric Value
mance on ten specific superfamilies. The recall value exceeds 0.9
Exact Match 0.62
in all the ten superfamilies and the F1-score exceeds 0.8 in nine Jaccard Similarity 0.67
superfamilies. The calculation formulas for these metrics can be Precision 0.63
found in Appendix A.5. Additional experiments on protein solubility Recall 0.72
prediction using ProLLaMA can be found in Appendix A.7. F1-score 0.67

5 Related Work
Protein Language Models. Recognizing the similarity between natural language sequences and
protein sequences, many methods from NLP have been applied to protein sequence data [48–51]. This
has led to the development of PLMs, which are broadly categorized into two types [12, 52]: Auto-
Regressive (AR) PLMs and Auto-Encoder (AE) PLMs. AR PLMs adopt the decoder-only architecture
and Causal Language Modeling (CLM) [53, 54]. They primarily concentrate on PLG [55, 17–19],
with a minority also focusing on fitness prediction [10]. AE PLMs adopt the encoder-only architecture
and Masked Language Modeling (MLM) [20–24]. They excel in PLU, with the learned protein
representations being applied to downstream predictive tasks [29]. However, they face challenges in
de novo protein generation. Our ProLLaMA is capable of multitasking, excelling in tasks that both of
the above types specialize in, surpassing existing PLMs. This multitasking capability is achieved
through instruction following, making it user-friendly. We have also noticed the recent emergence of
scientific LLMs [13, 14, 56–58]. In Appendix A.10, we discuss the differences between these models
and ProLLaMA.
Training LLMs. There is a general training framework for LLMs [4], where LLMs are first pre-
trained on large-scale corpora [59] and then undergo instruction tuning to follow user instructions [60].
However, we believe that our proposed two-stage training framework differs in motivation and insights,
as discussed in Appendix A.11. Considering the vast number of parameters in LLMs, various
parameter-efficient techniques have been proposed to accelerate training and conserve memory [61,
62, 35, 63, 64], including LoRA. For the same reason, some vocabulary pruning methods have been
studied [65], which rely on statistics specific to a provided dataset. This results in them being more
time-consuming and failing to leverage the prior patterns of the dataset. In contrast, our PVP, derived
from the prior data patterns, achieves excellent compression rates.

6 Conclusion
Existing PLMs excel in either protein generation tasks or protein understanding tasks. In this work,
we introduce an efficient training framework to transform any general LLM into a multi-task PLM.

You might also like