PaperPS4
PaperPS4
Preprint, 2023
Version: 1
1
Department for Continuing Education, University of Oxford, Rewley House, 1 Wellington Square, OX1 2JA,
Oxford, United Kingdom
∗
Corresponding author. [email protected]
Abstract
Motivation
Protein secondary structure prediction is a subproblem of protein folding. A lightweight algorithm capable
of accurately predicting secondary structure from only the protein residue sequence could therefore provide
a useful input for tertiary structure prediction, alleviating the reliance on MSA typically seen in today’s
best-performing models. This in turn could see the development of protein folding algorithms which perform
better on orphan proteins, and which are much more accessible for both research and industry adoption due
to reducing the necessary computational resources to run. Unfortunately, existing datasets for secondary
structure prediction are small, creating a bottleneck in the rate of progress of automatic secondary structure
prediction. Furthermore, protein chains in these datasets are often not identified, hampering the ability of
researchers to use external domain knowledge when developing new algorithms.
Results
We present PS4, a dataset of 18,731 non-redundant protein chains and their respective Q8 secondary
structure labels. Each chain is identified by its PDB code, and the dataset is also non-redundant against
other secondary structure datasets commonly seen in the literature. We perform ablation studies by training
secondary structure prediction algorithms on the PS4 training set, and obtain state-of-the-art Q8 and Q3
accuracy on the CB513 test set in zero shots, without further fine-tuning. Furthermore, we provide a
software toolkit for the community to run our evaluation algorithms, train models from scratch and add new
samples to the dataset.
Availability
All code and data required to reproduce our results and make new inferences is available at
https://github.com/omarperacha/ps4-dataset
Key words: Databases, Protein Folding, Protein Secondary Structure, Machine Learning
1. Introduction these algorithms, particularly disk space for storing the database of
Recent years have seen great advances in automated protein potential homologues and computation time to adequately perform
structure prediction, with open-sourced algorithms, capable in a search through several hundred gigabytes of this data. For
many cases of matching the accuracy of traditional methods example, at the time of writing, the lighter-weight version of
for determining the structure of a folded protein such as x- AlphaFold2 currently made available by DeepMind as an official
ray crystallography and cryo-EM, made increasingly available. It release1 requires 600 GB of disk space, and comes with accuracy
remains common for the best-performing approaches to rely on tradeoffs; the full version occupies terabytes.
MSA data in order to provide strong results (Jumper et al., 2021; Improved performance on orphan proteins is particularly
Baek et al., 2021; Zheng et al., 2022). One drawback of this is that desirable as it may open up wider avenues for exploration when
these algorithms perform poorly on orphan proteins. Another is it comes to the set of possible human-designed proteins which
that quite significant extra resources are required when running
1 Available at https://github.com/deepmind/alphafold
and Wiuf, 2010), to facilitate research on further hierarchal chain; the ninth class, the polyproline helix, has not generally been
modelling tasks such as domain location prediction or domain taken into consideration by other secondary structure prediction
classification. algorithms, and we too ignore this class when performing our own
We perform ablation studies by using the PS4 training set and algorithmic assessments, however the information is retained in
evaluating on both the PS4 test set and the CB513 test set in the raw dataset provided.
a zero-shot manner, leaving out the CB513 training set from the We also store the residue number of the first residue denoted in
process. We use the same protein language model as Elnaggar the chain, as this is quite often not number one; being able to infer
et al. (2022) to extract input embeddings and evaluate on multiple the residue number of any given residue in the chain could better
neural network architectures for the end classifier, with no further facilitate the ability to use external data by future researchers.
inputs such as MSA. We obtain state-of-the-art results for Q3 and Following the first residue included in the DSSP file for that chain,
Q8 secondary structure prediction accuracy on the CB513, 86.8% we omit chains which are missing any subsequent residues. We
and 76.3% respectively, by training solely on PS4. further omit any chains containing fewer than 16 residues. Finally
We make the full dataset freely available along with code for we perform filtration to greatly reduce redundancy, checking for
evaluating our pretrained models, for training them from scratch similarity below 40% against the entire CB513 dataset, and then
to reproduce our results and for running predictions on new protein for all remaining samples against each other.
sequences. Finally, in the interests of obtaining a dataset of We chose the levenshtein distance to compute similarity
sufficient scale to truly maximise the potential of representation due to its balance of effectiveness as a distance metric for
learning for secondary structure prediction, we provide a toolkit biological sequences (Berger et al., 2021), its relative speed
for any researchers to add new sequences to the dataset, ensuring and its portability, with optimised implementations existing in
the same criteria for non-redundancy. New additions will be several programming languages. This last property is of particular
released in labelled versions to ensure the possibility for consistent importance when factoring in our aim for the PS4 dataset to be
benchmarking in a future-proof manner. extensible by the community, enabling a federated approach to
maximally scale and leverage the capabilities of deep learning.
Table 1. A comparison of commonly-cited datasets for secondary structure The possibility of running similarity checks locally with a non-
prediction in recent literature. Ours is by far the largest and the only specialised computing setup means that even a bioinformatics
one in which the proteins can be identified. Only ours and CB513 are hobbyist can add new sequences to future versions of the dataset
fully self-contained for training and evaluation with specified training and and guarantee a consistent level of non-redundancy, without
test sets. The CB513 achieves this distinction in many cases by masking relying on precomputed similarity clusters. This removes hurdles
subsequences of training samples, only sometimes including an entire towards future growth and utility of the PS4 dataset, while
sequence as a whole in the test set. also allowing lightweight similarity measurement against proteins
which are not easily identifiable by a PDB or UniProt code, such
Dataset Samples Train/Test Split Has PDB Codes as those in the CB513.
As a last step, we omit any chains which do not have entries
TS115 115 N/A No
in the CATH database as of 16th December 2021, ruling out
NEW364 364 N/A No
roughly 1,500 samples from the final dataset. We make the CATH
CB513 511 via masking No
data of all samples in the PS4 available alongside the dataset,
NetSurfP-2.0 10792 N/A No
in case future research is able to leverage that structural data
PS4 18731 17799 / 932 Yes
for improved performance on secondary structure prediction or
related tasks. We ultimately chose to focus the scope of this work
purely on predictions from single sequence, and did not find great
merit in early attempts to include domain classification within the
2. Methods secondary sequence prediction pipeline; as such, this restriction
will not be enforced for community additions to PS4.
2.1. Dataset Preparation The main secondary structure data is made available as a
The PS4 dataset consists of 18,731 protein sequences, split into CSV file, which is 8.2 MB in size. The supplemental CATH data
17,799 training samples and 932 validation samples, where each is a 1.3 MB file in pickle format, mapping chain ID to a list
sequence is from a distinct protein chain. We first obtained of domain boundary residue indices and the corresponding four-
the entire precomputed DSSP database (Kabsch and Sander, integer CATH classification. Finally, a file in compressed NumPy
1983; Joosten et al., 2011), initiating the database download on format (Harris et al., 2020) maps chain IDs to the training or
16th December 20212 . The full database at that time contained validation set, according the the split used in our experiments.
secondary structure information for 169,579 proteins, many of
which are multimers, in DSSP format, with each identified by
its respective PDB code. 2.2. Experimental Evaluation
We iterate through the precomputed DSSP files and create a We conduct experiments to validate the PS4 dataset’s suitability
separate record for each individual chain, noting its chain ID, its for use in secondary structure prediction tasks. We train two
residue sequence as a string of one-letter amino acid codes, and models, each based on a different neural network algorithm,
its respective secondary structure sequence, assigning one of nine to predict eight-class secondary structure given single sequence
possible secondary structure classes for each residue in the given protein input. The models are trained only on the PS4 training
set and then evaluated on both the PS4 test set and the CB513
2 Available by following the instructions at test set. We avoid using any training data from the CB513
https://swift.cmbi.umcn.nl/gv/dssp/DSSP 1.html training set, meaning evaluation on its test set is conducted in
bioRxiv preprint doi: https://doi.org/10.1101/2023.02.28.530456; this version posted March 1, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY 4.0 International license.
4 Omar Peracha
3. Implementation
3.1. Model Training
Fig. 2: Overview of the neural network meta architecture used
All neural network training, evaluation and inference logic is
in our experiments. The protein language model here refers to
implemented using PyTorch (Paszke et al., 2019). We train both
an encoder-only half-precision version of ProtT5-XL-UniRef50,
models for 30 epochs, using Adam optimiser (Kingma and Ba,
while the SS classifier is either a mega-based or convolution-based
2014) with 3 epochs of warmup and a batch size of 1, equating
network. We precompute the protein encodings generated by the
to 53,397 warmup steps. Both models increase the learning rate
pretrained language model, reducing the computations necessary
from 10−7 to a maximum value of 0.0001, chosen by conducting
during training to only the forwards and backwards passes of the
a learning rate range test (Smith and Topin, 2017), during the
classifier.
warmup phase before gradually reducing back to 10−7 via cosine
annealing.
The input embeddings from ProtT5-XL-UniRef50 are pre-
computed, requiring roughly one hour to generate these for the
entire PS4 dataset on GPU. Hence only the weights of the
zero shots. We also do not provide surrounding ground truth data
classifier component are updated by gradient descent, while the
at evaluation time for those samples in the CB513 test set which
encoder protein language model maintains its original weights from
are masked subsequences of a training sample, but rather predict
pretraining. For convenience and extensibility, we make a script
the secondary structure for the whole sequence at once.
available in our repository to generate these embeddings from any
Both models make use of the pretrained, open-source encoder-
FASTA file, allowing for predictions on novel proteins.
only version of ProtT5-XL-UniRef50 model by Elnaggar et al.
PS4-Mega has 83.8 million parameters and takes roughly 40
(2022) to generate input embeddings from the initial sequence.
minutes per epoch when training on a single GPU. PS4-Conv
Our algorithms are composable such that any protein language
is much more lightweight, with just 4.8 million parameters and
model could be used in place of ProtT5-XL-UniRef50, opening up
requiring only 13.5 minutes per epoch on GPU. Both models are
an avenue for potential future improvement, however our choice
trained in full-precision. PS4-Mega obtains a training and test set
was in part governed by a desire to maximimse accessibility; we
accuracy of 99.4% and 78.2% on the PS4 dataset for Q8 secondary
use the half-precision version of the model made publicly available
structure prediction respectively, and 76.3% on CB513. PS4-Conv
by Elnaggar et al. (2022), which can fit in just 8 GB of GPU RAM.
performs almost as well, obtaining 93.1% training and 77.9%
As such, our entire training and inferencing pipeline can fit on a
test set accuracy on PS4, and 75.6% on CB513. Furthermore,
single GPU.
both algorithms show an improvement over state-of-the-art for Q3
The protein language model generates an N ×1024 encoding
accuracy, as shown in Table 2.
matrix, where N is the number of residues in the given
protein chain. Our two models differ in the neural network
architecture used to form the classifier component of our overall 3.2. Dataset Extension
algorithm, which generates secondary structure predictions from Our initial dataset preprocessing, from which the secondary
these encoding matrices. Our first model, which we call PS4-Mega, structure CSV was obtained, was implemented in the Rust
leverages 11 moving average equipped gated attention (Mega) programming language. Filtering through over 160,000 proteins
encoder layers (Ma et al., 2022) to compute a final encoding which in Python was prohibitively slow. In particular, performing that
is then passed to an output affine layer and softmax to generate a many string comparisons to verify non-redundancy runs in O(n2 )
probability distribution over the eight secondary structure classes. time complexity, and given a large value of n as seen in our case,
bioRxiv preprint doi: https://doi.org/10.1101/2023.02.28.530456; this version posted March 1, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
made available under aCC-BY 4.0 International license.
PS4 Dataset 5
W. Wang, Z. Peng, and J. Yang. Single-sequence protein (Electronic); 1467-5463 (Print); 1467-5463 (Linking). doi:
structure prediction using supervised transformer protein 10.1093/bib/bbw129.
language models. Nature Computational Science, 2(12):804– W. Zheng, Q. Wuyun, and P. L. Freddolino. D-i-tasser: Integrating
814, 2022. deep learning with multi-msas and threading alignments for
Y. Yang, J. Gao, J. Wang, R. Heffernan, J. Hanson, K. Paliwal, protein structure prediction. 15th Community Wide Experiment
and Y. Zhou. Sixty-five years of the long march in protein on the Critical Assessment of Techniques for Protein Structure
secondary structure prediction: the final stretch? Briefings Prediction, December 2022.
in Bioinformatics, 19(3):482–494, May 2018. ISSN 1477-4054