Harry Scholes Thesis Computable Function Prediction
Harry Scholes Thesis Computable Function Prediction
Harry Scholes
I, Harry Scholes, confirm that the work presented in this thesis is my own. Where
information has been derived from other sources, I confirm that this has been indicated in
the work.
Abstract
Proteins are biological machines that perform the majority of functions necessary for life.
Nature has evolved many different proteins, each of which perform a subset of an organ-
ism’s functional repertoire. One aim of biology is to solve the sparse high dimensional
problem of annotating all proteins with their true functions. Experimental characterisa-
tion remains the gold standard for assigning function, but is a major bottleneck due to
resource scarcity. In this thesis, we develop a variety of computational methods to predict
protein function, reduce the functional search space for proteins, and guide the design of
experimental studies. Our methods take two distinct approaches: protein-centric methods
that predict the functions of a given protein, and function-centric methods that predict
which proteins perform a given function. We applied our methods to help solve a number
of open problems in biology. First, we identified new proteins involved in the progression
of Alzheimer’s disease using proteomics data of brains from a fly model of the disease.
Second, we predicted novel plastic hydrolase enzymes in a large data set of 1.1 billion pro-
tein sequences from metagenomes. Finally, we optimised a neural network method that
extracts a small number of informative features from protein networks, which we used to
predict functions of fission yeast proteins.
Impact statement
Second, we mined metagenomes for putative plastic hydrolase enzymes that break
plastic polymers down into their constiuent monomers. Virgin plastics are produced from
non-renewable petroleum sources, with a large carbon footprint. Despite efforts to in-
crease the amount that is recycled, only 14% of the plastic produced each year is collected
for recycling. This is largely because the process is uneconomic, but also because recycled
plastics have inferior properties. In the future, plastic hydrolases may form the fulcrum of
a profitable, circular plastic economy, where high-quality recycled plastics are regenerated
from used plastics in an energy efficient process. Currently, many unrecycled plastics go to
landfill, but some end up as pollutants, either by being incinerated, which further increases
the carbon footprint, or by polluting the environment directly. Plastic hydrolases, and the
bacteria that use these enzymes to metabolise plastic, may, one day, help to clean up the
8 Impact statement
environmental plastic pollution that affects every continent and ocean on Earth. Taken
together, plastic hydrolases may have a profound impact on the health of organisms and
the environment globally.
Finally, we predicted the functions of proteins from a species of yeast that is an impor-
tant model for understanding cellular processes in higher eukaryotes, including humans.
Predicted functions help guide the design of experimental studies at the microscopic level,
to confirm or refute predicted functions; at the mesoscopic level, to disentangle the inter-
play between proteins and pathways to determine their cellular effects; and at the macro-
scopic level, to identify the causes of diseases and how to treat or prevent them. Beyond
basic science, predicted functions have a broad range of commercial applications, includ-
ing, but not limited to, synthetic biology, radical life extension and astrobiology.
Protein function prediction is, and will remain, an important research topic with fruit-
ful applications in both the public and private sectors.
For my parents, Siân and Tim,
and my partner, Gal.
Acknowledgements
I would particularly like to thank Christine Orengo, my supervisor, for her guidance, en-
couragement and understanding throughout my PhD.
I would also like to thank my other supervisors, Jürg Bähler, Rob Finn, Jon Lees, John
Shawe-Taylor and Kostas Thalassinos for helping to shape my research projects.
The work in this thesis would not have been possible without the incredible contribu-
tions from my collaborators, Nico Bordin, Adam Cryar, Sayoni Das, Fiona Kerr, Clemens
Rauer, Maria Rodriguez Lopez and Ian Sillitoe.
I would also like to extend my thanks to all members of the CATH group, past and
present, including Mahnaz Abbasian, Tolu Adeyelu, Sebastian Applewhite, Paul ‘Ash’ Ash-
ford, Joseph Bonello, Sean Le Cornu, Natalie Dawson, Su Datt Lam, Tony Lewis, Millie
Pang, Neeladri Sen, Vaishali Waman and Laurel Woodridge.
I am grateful to Rob Finn and Janet Thornton for hosting me at the European Bioin-
formatics Institute, and to Alex Almeida, Martin Hölzer, Sara Kashaf, Alex Mitchell, Lorna
Richardson, Paul Saary and all members of the Metagenome Informatics and Sequence
Families teams for making my time there so enjoyable.
My heartfelt appreciation goes to my partner, Gal Horesh, for her loving support,
boundless encouragement and brilliant scientific ideas, which kept me going over the past
four years.
Special thanks to my friends, Eddie Anderton, Emma Elliston and Archie Wall.
I would like to recognise my undergraduate tutors at Oriel College, Max Crispin,
Lynne Cox and Shona Murphy, without whom I would not have embarked on this journey.
Finally, I am grateful to Wellcome for funding me.
Thanks also to my examiners, Mark Wass and Nick Luscombe, for reading my thesis
so thoroughly, asking interesting questions and making the viva so enjoyable.
Contents
List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
List of abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
List of publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1 Introduction 33
1.1 Protein function prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.2 The pre-bioinformatic age . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.3 The dawn of bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.3.1 Sequence homology . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.3.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . 39
1.3.3 Structure homology . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.3.4 CATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.3.5 Symbolic representations of functions . . . . . . . . . . . . . . . . 47
1.3.6 Benchmarking performance . . . . . . . . . . . . . . . . . . . . . . 50
1.4 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.4.1 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . 53
1.4.2 Encoder-decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.4.3 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.4.4 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . 59
1.4.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . 60
1.4.6 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.4.7 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . 62
1.5 Modern applications of machine learning to protein data . . . . . . . . . . 63
1.5.1 Protein networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.5.2 Protein sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
14 Contents
References 227
List of figures
1.1 Anatomy of a hidden Markov model (HMM) that models the sequence de-
pendencies of a 5’ splice site in DNA . . . . . . . . . . . . . . . . . . . . . 39
1.2 cath-resolve-hits resolves the domain boundaries of multiple ‘Input Hits’
to an optimal subset of non-overlapping ‘Resolved Hits’ that form the MDA 44
1.3 An overview of the GeMMA and FunFHMMer algorithms . . . . . . . . . . 45
1.4 GroupSim predicts specificity-determining positions (SDPs) in alignments 46
1.5 An overview of the GARDENER algorithm . . . . . . . . . . . . . . . . . . 48
1.6 The Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.7 General architecture of an artificial neural network . . . . . . . . . . . . . 54
1.8 Architecture of a general encoder-decoder model . . . . . . . . . . . . . . 57
1.9 deepNF overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.10 word2vec neural network architecture . . . . . . . . . . . . . . . . . . . . 73
1.11 doc2vec neural network architecture . . . . . . . . . . . . . . . . . . . . . 74
Abbreviation Phrase
Aβ amyloid beta
ABH α/β hydrolase
AD Alzheimer’s disease
ANN artificial neural networks
APP Aβ precursor protein
AUPR area under the precision-recall curve
Aβ amyloid beta
Aβ42 42 amino acid long amyloid beta
BIC Bayesian information criterion
BLAST basic local alignment search tool
BP biological process
CAFA Critical Assessment of Functional Annotation
CART classification and regression tree
CC cellular component
CCS collision cross section
CNN convolutional neural network
DAG directed acyclic graph
deepNF deep network fusion
DIA data-independent acquisition
DOPS diversity of position scores
E-value expect value
EC Enzyme Commission
ESI electrospray ionisation
26 List of tables
• Sillitoe, I., Dawson, N., Lewis, T. E., Das, S., Lees, J. G., Ashford, P., Tolulope, A.,
Scholes, H. M., Senatorov, I., Bujan, A., Rodriguez-Conde, F. C., Dowling, B., Thorn-
ton, J. & Orengo, C. A. “CATH: expanding the horizons of structure-based functional
annotations for genome sequences.” Nucleic Acids Research (2019)
• Scholes, H. M., Cryar, A., Kerr, F., Sutherland, D., Gethings, L. A., Vissers, J. P. C.,
Lees, J. G., Orengo, C. A., Partridge, L. & Thalassinos, K. “Dynamic changes in the
brain protein interaction network correlates with progression of Aβ42 pathology in
Drosophila.” Scientific Reports (2020)
• Lam, S. D., Bordin, N., Waman, V. P., Scholes, H. M., Ashford, P., Sen, N., van Dorp,
L., Rauer, C., Dawson, N. L., Pang, C. S. M., Abbasian, M., Sillitoe, I., Edwards, S.
J. L., Fraternali, F., Lees, J. G., Santini J. M. & Orengo, C. A. “SARS-CoV-2 spike
protein predicted to form stable complexes with host receptor protein orthologues
from mammals.” Scientific Reports (2020)
• Das, S., Scholes, H. M., Sen, N. & Orengo, C. A. “CATH functional families predict
functional sites in proteins.” Bioinformatics (2020)
• Sillitoe, I., Bordin, N., Dawson, N., Waman, V. P., Ashford, P., Scholes, H. M., Pang,
C. S. M., Woodridge, L., Sen, N., Abbasian, M., Le Cornu, S., Lam, S. D., Berka, K.,
Hutařová Varekova, I., Svobodova, R., Lees, J. G. & Orengo, C. A. “CATH: increased
structural coverage of functional space.” Nucleic Acids Research (2021; in press)
A theory is something nobody believes, except the person
who made it. An experiment is something everybody
believes, except the person who made it.
Albert Einstein
Chapter 1
Introduction
tions, such as the behaviour of multiple organisms, from the same, or different, species.)
Lower-level biochemical functions can contribute to physiological functions on higher lev-
els [2]. Disruption of lower-level functions, such as through mutation, can cause catas-
trophic organismal phenotypes, such as in the case of cancer. Therefore, an appropriate
vocabulary must be used when referring to function.
tions, mechanics and image processing. Within bioinformatics, protein function prediction
has predominantly relied on statistics, machine learning and graph theory to detect signals
and patterns in biological data.
The holy grail of protein function prediction is to be able to predict functions ab initio.
To do so will require an almost complete molecular description of life, which we appear
to be a long way away from. For now, protein functions are predicted transitively. For
example, given two proteins P1 and P2 that are sufficiently similar, let F be a function
that has been experimentally assigned to P1 . By the transitive property, F can also be
assigned to P2 . This process is referred to using different names, including, but not limited
to, ‘homology transfer’ and ‘guilt by association’.
In biology, homology refers to similar features that have a common ancestry and are
inherited through evolution. A pair of proteins are said to be homologous if they share a
common ancestor. In such cases, transferring F from P1 to P2 may be appropriate if P1
and P2 are homologous. Generally speaking, > 60% sequence identity—the proportion
of identical residues between a pair of proteins—is required for conservation of function,
as measured by being able to transfer Enzyme Commission (EC) functional annotations
entirely with ≥ 90% accuracy [7, 8].
However, it is not always safe to transfer functions in such ways. Homologous pro-
teins can be subdivided into orthologous sequences—referring to evolutionary specialisa-
tion of sequences—and paralogous sequences—meaning duplication of sequences within
an individual, followed by divergence of function. Therefore, similar sequences do not
necessarily have similar structures or functions [1, 9], so care must be taken when trans-
ferring functions between homologous proteins [10, 11].
The ‘orthologue conjecture’ posits that “orthologous genes share greater functional
similarity than do paralogous genes” [12], so inheritance of function is safer between or-
1.3. The dawn of bioinformatics 37
thologues than paralogues [13]. This claim became largely accepted within the biological
community, without substantive evidence to back it up. Subsequently, various studies
have attempted to test the conjecture. One study found that, after controlling for multi-
ple biases in function annotation data sets, there is a weak effect that orthologues have
more similar functions than paralogues [14]. Another study found that orthologues and
paralogues evolve and diverge at similar rates, so orthologues may be as likely to have
different functions as paralogues [15]. Recent work shows that there are actually two or-
thologue conjectures: one being a statement about the evolution of function—which is
difficult to test—whilst the other is a statement about the prediction of function [16]. The
authors found that function prediction is improved when the amount of data is maximised
by using both orthologue and paralogue data [16].
Furthermore, biological data is noisy and incomplete. F may not actually be a func-
tion of P1 (a false positive). If F is not a function of P1 , it might, or might not, be a function
of P2 . If care is not taken, errors can be easily propagated through biological databases.
These errors are notoriously difficult to correct.
To allay fears about the rather large and powerful elephant in the room, we focus on
machine learning methods in Section 1.4 and on the application of these methods to protein
function prediction in Section 1.5. For now, we focus on classical methods to annotate
proteins with functions. Overwhelmingly, the most popular method for protein function
prediction is by sequence and structure homology, which we explain next.
Sequence homology methods for protein function prediction rely on identifying related
proteins whose functions have already been characterised [1, 3, 11, 13]. Typically, se-
quence alignment and sequence clustering methods are exploited to cluster proteins into
evolutionarily related groups.
Basic local alignment search tool (BLAST) [17] is an incredibly popular method to
search a sequence database with a query sequence to find similar sequences. BLAST first
divides the query sequence up into short k-mers, or words. Each target sequence in the
database is searched for high-scoring matches to these words, using the BLOSUM62 sub-
stitution matrix. Exact matches are extended to form high-scoring segment pairs between
the query and target sequences, which are retained only if their score is high enough. Two
or more high-scoring segment pairs are then joined to make longer alignments using the
38 Chapter 1. Introduction
Smith-Waterman local sequence alignment algorithm. Target sequence hits are returned
by BLAST, and any annotations that they contain can be transferred to the query sequence
to predict its function. BLAST became, and remains, popular due to its advanced statistical
framework, based on the expect value (E-value). Given a database of a particular size and
a query sequence, the E-value of a hit measures the number of sequences that one can
expect to see in the database by chance with the same match score. The lower the E-value,
the more significant a match is. As such, the E-value can be used as a threshold to filter
out the random background noise when searching a database.
Proteins are composed of one or more domains, often arranged as a linear sequence
in the primary structure. Domains are protein structure units that are stable and can fold
independently of the wider protein context. Combinations of domains can give rise to
novel functions [5, 6]. The linear arrangement of domains in protein sequences—known
as the multi-domain architecture (MDA)—and the organisation of these domains in 3D
space, are determinants of function [20–23]. Sequence-based domain resources, such as
Pfam [24] organise domains into homologous domain families. In the larger Pfam families,
functions of members can be highly divergent between paralogues and distant orthologues
[11].
1.3. The dawn of bioinformatics 39
Functions can be transferred within a protein domain family, from one member whose
function has been characterised, to all other members of the family. MSAs can be used
to infer the phylogeny that relates domains into domain families, through evolutionary
time. The sequence diversity in an MSA can be represented using a hidden Markov model
(HMM) [25, 26]. HMMs are statistical models of Markov processes, defined as processes
where the next state of a system depends only on its current state.
Given an MSA, the transitions between states in the HMM are trained using the ob-
servable distribution of amino acid, gap and deletion probabilities in the MSA (Fig. 1.1). A
sequence can be scanned against a family’s HMM to score whether the sequence matches
the family, and therefore whether the sequence is evolutionarily related to the sequences
in the family.
./figures/Chapter_introduction/hmm.jpg
Figure 1.1: Anatomy of a hidden Markov model (HMM) that models the sequence depen-
dencies of a 5’ splice site in DNA. States for the exon (E), 5’ splice site and intron (I)
are shown. Transition probabilities determine the paths that are allowed between the
states, with one or more nucleotide in the exon, followed by the 5’ splice site at a G or
A nucleotide, followed by one or more nucleotides in the intron. Emission probabilities
are shown above the states that model the nucleotide composition of sequences in the
three states. Nucleotides are emitted at every state visited on a path through the model
from Start to End. Potential paths through the model for the ‘Sequence’ are shown in the
‘Parsing’ section, where the 5’ splice site state is a G or A. Log probabilities of paths are
calculated by multiplying all transition and emission probabilities together and taking
the logarithm. The most likely path is the path with the highest probability, in this case
log P = −41.22, which corresponds to the true ‘State path’ for the sequence. Figure
adapted from [26].
HMMER [27] and HHsuite [28] are the de facto standard tools to build HMMs and scan
40 Chapter 1. Introduction
sequences against them. HHsuite can also perform pairwise HMM alignment [29], which
can be thought of as comparing two MSAs. The jackhmmer program in HMMER, akin to
PSI-BLAST, is an iterative search tool. Given a single query sequence and a target sequence
database, an HMM is built from the sequence using a position-independent substitution
matrix, such as BLOSUM62. Sequences in the target database are scanned against the
HMM to identify hits, which are then aligned and used to build a new HMM. The whole
process can be repeated for a few iterations to pull in ever more distantly related sequences.
We saw in the previous sections on sequence homology and HMMs that the sequence of
amino acids—the primary protein structure—can be used to predict function. Therefore,
sequence-function relationships exist. In this section, we extend this concept to structure-
function and sequence-structure-function relationships [3, 11, 13].
The primary protein structure can fold, which allows proteins to adopt higher order
structures with many degrees of freedom. A 100 residue protein, with 99 peptide bonds,
has approximately 3198 = 3 × 1095 stable φ and ψ bond angle conformations [30]. The
polypeptide chain folds into local secondary structures [31], which arrange in space rela-
tive to each other in the tertiary structure. Multiple polypeptide chains can interact with
each other in space to form the quaternary structure.
Higher order protein structures can be used to predict protein function [3, 13]. The
protein structure imparts knowledge about the 3D arrangement of amino acids to form
functional sites in proteins, such as catalytic sites, regulatory motifs and allosteric sites
[2].
Structure is more conserved than sequence [32], which is perhaps to be expected,
given the evolutionary need for proteins to be able to form stable arrangements in three-
dimensional space. Distantly related proteins, whose sequences have diverged consid-
erably, so that they share lower sequence similarity than can be reliably detected using
sequence comparison methods, may still be identified as homologues by comparison of
their structures [11]. Therefore, structural homology can be used to predict protein func-
tion. Conservation of structure between two proteins can be identified by measuring the
root-mean-square deviation (RMSD) between equivalent atomic positions in two protein
structures aligned in three-dimensional space.
CATH [33] (introduced in Section 1.3.4) and SCOP [34] classify domains into
1.3. The dawn of bioinformatics 41
1.3.4 CATH
CATH [33, 38, 39] classifies protein domain structures into evolutionarily related families,
arranged in a four-level hierarchical taxonomy:
1. Class: secondary structure, all alpha-helical, all beta-sheet, mixed alpha-helical and
beta-sheet, or little secondary structure.
2. Architecture: global arrangement of secondary structure elements into the tertiary
structure.
3. Topology/fold: specific arrangement of secondary structure elements.
4. Homologous superfamily: evidence of evolutionary relatedness of domains.
The current version of CATH (v4.2) contains 6,119 superfamilies. To identify these
groups, protein structures are aligned using SSAP (structure and sequence alignment pro-
42 Chapter 1. Introduction
gram) [40]. SSAP employs double-dynamic programming to extend the concept behind
Needleman and Wunsch’s sequence alignment algorithm [41] to 3D protein structures. As
such, the double-dynamic programming method is guaranteed to find the optimal align-
ment of any two given proteins. The algorithm was subsequently modified to align struc-
tural motifs [42], akin to Smith and Waterman’s adaptation of the global sequence align-
ment algorithm to find local matches [43].
Crucial to its success is the way that SSAP represents protein structures. Instead of
using the letters of the amino acid alphabet, SSAP uses interatomic vectors between the Cβ
atoms of all residues, except glycines. The inclusion of positional information proved key
to SSAP’s success over contemporary methods that only considered interatomic distances
[44]. Furthermore, interatomic vectors proved to be necessary and sufficient to align pro-
tein structures [45]. Inclusion of additional structural information improved alignments
in only the most challenging cases [45].
In 1993, there were structures of 1,800 chains in the PDB. Because ∼ 50 new struc-
tures were being deposited each month, it became obvious that an automated method
would be needed to identify protein folds and classify proteins into protein fold families.
SSAP was initially used to align proteins containing globin domains that have very low
sequence similarity, but were found to conserve the same domain structure [40]. SSAP was
subsequently used to identify families of protein folds in groups of proteins with similar
sequences [46]. These families established the foundations of CATH [38, 39].
CATH can be used for protein function prediction in a number of ways. Firstly,
CATHEDRAL [47] can be used to align a query structure to structure representatives from
each superfamily. Functions of proteins within any matching superfamilies can be trans-
ferred. However, this classification is usually too broad to provide fine-grained functions.
Secondly, CATH-Gene3D can be used to assign a protein sequence to CATH superfamilies
(see Section 1.3.4.1). Finally, CATH-FunFams can be used to assign a protein sequence to
FunFams (see Section 1.3.4.2). Next, we introduce these two methods Gene3D and Fun-
Fams.
1.3.4.1 Gene3D
Gene3D [21, 48, 49] allows CATH superfamily domains to be predicted in protein se-
quences, using sequence information alone. Gene3D exploits protein sequence-structure
relationships by representing the sequence diversity of each CATH superfamily as a
1.3. The dawn of bioinformatics 43
set of one or more HMMs. Sequences are scanned against these HMMs and assigned
to any matching superfamilies. Interestingly, Gene3D is a hybrid sequence-structure-
homology method because structures are used to define the domain boundaries of domain
sequences. The equivalent functionality of Gene3D is provided to SCOP via the SUPER-
FAMILY database [50].
In any given protein sequence, there may be many, overlapping, matches to S95 mod-
els from the same, or different CATH superfamilies. cath-resolve-hits [52] is used to re-
duce the set of matches to an optimal subset of non-overlapping matches (Fig. 1.2). The
algorithm uses dynamic programming to deterministically find the set of domains that
maximises the sum of the bit scores.
1.3.4.2 FunFams
FunFams are groups of homologous proteins that perform the same function, or functions.
SDPs are used to identify and group likely isofunctional proteins into FunFams. Sequences
in CATH superfamilies are sub-classified into fine-grained groups, according to sequence
alone. The current version of CATH (v4.2) contains 67,598 FunFams in 2,620 of the 6,119
superfamilies. Whilst FunFams are strictly a type of sequence-homology method, the se-
quences used to construct CATH-FunFams are from Gene3D hits, so are structural homo-
logues. FunFams were only made possible because Gene3D is now able to retreive millions
of sequence homologues from UniProt. Having said this, FunFam is a general sequence-
based protocol that can be applied to any protein family. In addition to CATH-FunFams,
we also generate Pfam-FunFams from Pfam families.
44 Chapter 1. Introduction
./figures/Chapter_introduction/crh.jpg
Figure 1.2: cath-resolve-hits resolves the domain boundaries of multiple ‘Input Hits’ to an
optimal subset of non-overlapping ‘Resolved Hits’ that form the MDA. Figure
taken from https://cath-tools.readthedocs.io/en/latest/tools/cath-resolve-hits/.
./figures/Chapter_introduction/gemma_funfhmmer.png
Figure 1.3: An overview of the GeMMA and FunFHMMer algorithms. Scissors denote the
points where the FunFHMMer algorithm cuts the GeMMA tree. Specificity-determining
positions are identified using GroupSim (Fig. 1.4). Figure courtesy of Nicola Bordin.
Generally, we have noticed that the upper bound of starting clusters is 5,000 before
reaching the memory limits of our largest machines (with 3 TB memory). GeMMA
used to be applied to an entire superfamily [53]. As superfamilies have grown in
size, the protocol was changed so that GeMMA is run on subsets of proteins that
have the same MDA (see Section 1.3.4.3). Partitioning superfamilies by MDA makes
biological sense because the MDA is a determinant of function [20–22]. Proteins
with different MDAs are unlikely to be in the same FunFam, so it is reasonable to
segregate them ab initio.
3. FunFHMMer [54] determines the optimal partitioning of the GeMMA tree
into clades, each of which is a FunFam
FunFHMMer operates on MSAs of leaves and internal nodes (Fig. 1.3). Diverse pro-
tein sequences are required for FunFHMMer to elucidate conservation patterns and
SDPs in MSAs. FunFHMMer traverses the GeMMA tree from the leaves towards
the root. Let vi and vj be two child nodes that are connected to a parent node vp . A
functional coherence index is calculated for vp to determine whether the tree should
be cut before vp (traversing in the direction from leaf to root). If the tree is cut before
vp , two FunFams F1 and F2 are produced, where F1 = vi and F2 = vj . Otherwise,
if the tree is subsequently cut after vp , vi and vj will form part of the same FunFam
F , {vi ∪ vj } ⊆ F . The functional coherence index is powerful, generates func-
46 Chapter 1. Introduction
tionally pure FunFams and imbues FunFams with their predictive power. The index
considers three parameters:
• Information content of the MSA. Calculated using the diversity of position
scores (DOPS) from ScoreCons [55]. MSAs with DOPS > 70 are generally
considered to be sufficiently diverse.
• Proportion of predicted SDPs in an MSA. SDPs are predicted using Group-
Sim (Fig. 1.4), which calculates a score for each position in an MSA, whose
sequences are pre-assigned into two groups according to the sequences in the
child nodes [56]. The SDP-to-conserved position ratio determines whether the
tree is cut.
• Gaps in an MSA. Gaps in the parent alignment indicate that the child align-
ments were of different lengths. A multiplicative factor of 0 results if the num-
ber of gap positions is greater than the number of non-gap positions.
Some residues that are necessary for the function of a protein, such as catalytic
residues in the active site of an enzyme, may be highly conserved in a set of S90
clusters. Whilst these residues may be predictive of the general function of a given
protein, they do not determine how the GeMMA tree is partitioned by FunFHMMer
into FunFams. Only differentially conserved residues determine the partitioning
of the tree, and therefore the functional specificity of FunFams. Highly-conserved
residues may also be useful for protein structure prediction, particularly for evolu-
tionary covariation techniques [57, 58].
./figures/Chapter_introduction/groupsim.jpeg
The FunFams generated by the previous step are known as the seed alignments. Seed
alignment sequences are then scanned against their corresponding HMM to assign
an inclusion threshold to each FunFam, which is the largest E-value for the worst
match. Full alignments for FunFams are generated using sequences from the super-
family that were in S90 clusters without an experimental GO term. These sequences
are scanned against the seed HMMs using the per FunFam inclusion thresholds. Se-
quences are assigned to any matching FunFam, but if sequences match to multiple
FunFams, they are assigned to the best matching FunFam.
1.3.4.3 GARDENER
The FunFam generation protocol has recently been improved in a new algorithm, GAR-
DENER, which performs two rounds of FunFamming, rather than the single round in the
original protocol set out above (Fig. 1.5). In the first round, GARDENER processes each
MDA partition separately using GeMMA and FunFHMMer to generate FunFams. These
initial FunFams, from all of the MDA partitions, are then pooled. Initial FunFams are then
treated as starting clusters for a second round of GeMMA and FunFHMMer. The FunFams
from the second round of FunFamming are used as FunFams for the superfamily. Note
that if a superfamily has a single MDA, then only one round of GeMMA and FunFHMMer
are performed, instead of two. GARDENER is advantageous because it allows over-split
FunFams and singleton FunFams, from the first round, to be merged in the second round.
GARDENER is explained in more detail in Section 3.2.10.
./figures/Chapter_introduction/gardener.png
Figure 1.5: An overview of the GARDENER algorithm. Scissors denote the points where the
FunFHMMer algorithm cuts the GeMMA tree. Figure courtesy of Nicola Bordin.
child terms are strict descendants of one parent term. Child terms describe more specific
functions that are a subset of the parent function. For example, the function of tripep-
tide aminopeptidases are described by the EC number “EC 3.4.11.4”, where the four levels
correspond to:
• EC 3 hydrolases
• EC 3.4 peptidases
• EC 3.4.11 N-terminal exopeptidases
• EC 3.4.11.4 N-terminal exopeptidase for tripeptides
7,936 different enzymatic functions have been described using the EC database. These
terms have been annotated 246,858 times to a total of 235,086 protein IDs from direct
experimental characterisation of an enzyme’s activity, reactants, products, cofactors and
specificities.
Whilst EC terms are focussed on enzyme function, the Gene Ontology (GO) GO [60]
is a more ambitious project than EC, in that it aims to characterise molecular function,
biological processes and subcellular location of proteins. So far, the GO describes 44,167
functions, across three disjoint namespaces: ‘biological process’ (BP), ‘molecular function’
(MF) and ‘cellular component’ (CC). Proteins from 4,728 species have been characterised
using a total of 8,047,744 GO annotations to 1,569,827 unique proteins. We use the GO
extensively in this thesis and attempt to increase the functional coverage of proteins by
1.3. The dawn of bioinformatics 49
‘Ontology’ is the branch of philosophy that deals with the nature of being, of which
there are many possible relationships. Describing these relationships requires a more flex-
ible data structure than the tree used in EC. The GO uses a directed acyclic graph (DAG)
(Fig. 1.6), where edges in the graph are directed from child terms to parent terms. The
DAG is actually a multi-DAG (a DAG with multiple edge types), used to represent one-
to-many relationships between functions. Possible relationships in the GO DAG include
‘is a’, ‘part of’ and ‘regulates’. For example, ‘6-phosphofructokinase activity’ is part of
‘glycolytic process through fructose-6-phosphate’, which can be positively and negatively
regulated. Multiple levels of functions can be queried using the GO, for example, it can
be used to find all genes in a genome that are involved in signal transduction, or one can
focus only on the subset of genes that are tyrosine kinases.
Figure 1.6: The Gene Ontology. Subgraph of the Gene Ontology for ‘DNA damage checkpoint’
(GO:0000077). Figure taken from https://www.ebi.ac.uk/QuickGO/term/GO:0000077.
egorised into groups, with functions assigned by: direct experiment (EXP, IDA, IPI, IMP,
IGI and IEP) and their high-throughput counterparts (HTP, HDA, HMP, HGI and HEP);
inferred phylogenetically (IBA, IBD, IKR and IRD) or computationally (ISS, ISO, ISA, ISM,
IGC and RCA); assigned by a curator (IC); automatically inferred annotations (IEA); and
low-quality, untrusted annotations without evidence to back them up (NAS, ND). These
groups are approximately ordered by how much confidence one would assign to an anno-
tation with that evidence code.
A number of other protein function databases are touched on in this thesis, including
the MIPS FunCat database [61, 62]; the Human Phenotype Ontology [63], used to anno-
tate proteins that cause phenotypic abnormalities in human diseases; and the Disorder
Ontology [64] that characterises intrinsically disordered proteins. FunCat was developed
during the initial sequencing of the Saccharomyces cerevisiae genome to describe yeast
protein function. Similar to the EC, MIPS terms are a controlled vocabulary arranged in
a three-layer hierarchy of increasing functional specificity. Despite MIPS being rather an
ancient resource, we use some of its annotations in Chapter 4 to benchmark our method
against two other methods that chose to use MIPS [65, 66]. Although we do not use the
Human Phenotype Ontology and the Disorder Ontology, we encounter them in Chapter 5
as part of the CAFA 4 evaluation of protein function prediction methods.
Now that we know how functions are represented computationally, next we introduce
how to compare the predictive performance of different methods on the same prediction
task.
Many methods have been developed to predict protein function, each often claiming to
be the state-of-the-art. Such claims should always be treated with suspicion due to disin-
genuous science, overfitting on evaluation data, or luck. Transparent, public benchmarks
are required to evaluate different methods within a community. Protein structure pre-
diction first introduced the CASP [67] challenge in 1994 to compare structure prediction
methods on newly solved structures that were witheld from the community. Structure pre-
diction methods are also continuously evaluated by CAMEO [68]. The CAPRI challenge
benchmarks protein-protein docking for structure prediction [69, 70]. DREAM challenges
(http://dreamchallenges.org) are a set of various community evaluations of common bioin-
formatics, genomics and statistics methods, such as for network inference from expression
1.3. The dawn of bioinformatics 51
data [71].
Methods are evaluated blind, after a sufficient time delay, in which new, experimentally-
characterised functions are collected. Proteins are experimentally characterised in an un-
coordinated way by a decentralised network of scientists. To all intents and purposes, it
is impossible for any participant to cheat by colluding with experimental groups that may
have characterised new functions of a set of proteins. Therefore, it can be assumed that
no method has been trained using any of the annotations contained in the evaluation set.
As such, for a method to perform well on the evaluation set, it must have learnt general
patterns that link protein sequence to function from the training data.
Each team can enter predictions from three separate models, which are assessed for
coverage and (semantic) precision using Fmax and Smin . Fmax is defined as
2 · pr(τ ) · rc(τ )
Fmax = max , (1.1)
τ pr(τ ) + rc(τ )
TP
pr(τ ) =
TP + FP
TP
rc(τ ) = ,
TP + FN
for true and false positive and negative predictions. Smin is defined as
np o
Smin = min 2 2
ru(τ ) + mi(τ ) , (1.2)
τ
52 Chapter 1. Introduction
ne X
1 X
ru(τ ) = ic(f ) · 1 (f ∈
/ Pi (τ ) ∧ f ∈ Ti )
ne
i=1 f
ne X
1
ic(f ) · 1 (f ∈ Pi (τ ) ∧ f ∈
X
mi(τ ) = / Ti ) ,
ne
i=1 f
“For future CAFA experiments, it will therefore become even more important
to avoid ‘crowning winners’ (unless methods stand out by all means) and to
focus on method groups suited best for certain disciplines” [75].
Artificial neural networks (ANNs) are computational systems that are inspired by the or-
ganisation of neurons in biological brains [76]. Neurons receive an input, which gets com-
bined with the internal state of the neuron to produce an output. Optionally, an activation
function can be applied to modulate the output in a non-linear way. Neurons are arranged
in layers, consisting of 102 − 104 neurons. ANNs derive their power from arranging mul-
tiple layers, each of which successively processes the input data to extract increasingly
specific features.
Although invented in the 1940s the field only started to gain mainstream traction this
millennium [77]. Three main factors drove the uptake: better algorithms; powerful and
affordable hardware; and the availability of large data sets. These advances hailed the ad-
vent of deep learning. From driverless cars to smart assistants to recommendation systems,
deep learning and ANN models are now pervasive in society and shape our everyday life.
The key text on artificial neural networks is Ian Goodfellow’s Deep Learning [78].
ANNs and deep learning have been reviewed in [77, 79] and its applications to bioinfor-
matics in [80–84]. Many more reviews of these topics have been written, but we believe,
subjectively, that the above references are the best and most relevant to this thesis.
Although beyond the scope of this thesis, we, as a society, must never lose sight of
the ethical implications of using this technology [85–87].
To understand how ANNs function as a whole, one must first understand their con-
stituent parts—of which there are many. Here, we introduce the components of ANNs
before outlining various classes of ANN architectures. In Section 1.5 we review recent
applications of ANNs within bioinformatics that allow biological data to be handled in
unique ways that are not possible with any other type of machine learning model.
1.4.1.1 Architecture
ANNs are composed of two or more layers, each of which processes the input in some way.
Deep learning uses ANNs that contain many hidden layers between the input and output
layers (Fig. 1.7). Each layer of neurons consists of weights (internal states of neurons)
and biases (trainable values that are added to each neuron’s internal state, akin to b linear
function constants in y = ax + b). The width of a layer refers to the number of neurons it
54 Chapter 1. Introduction
W ∗ x + b,
where W are the weights and b is the bias term. Remarkably, an ANN (multi-layer per-
ceptron) with a single hidden layer of finite width is able to represent any mathematical
function [88]! However, the required width of the hidden layer may be infeasibly large.
Instead of using one very wide hidden layer, multiple smaller hidden layers tend to be
stacked on top of each other.
./figures/Chapter_introduction/ann.jpeg
Figure 1.7: General architecture of an artificial neural network. The input layer is processed
by two fully-connected hidden layers with the four neurons. The model outputs a
scalar number from a single neuron. Figure taken from https://medium.com/towards-
artificial-intelligence/artificial-neural-network-ship-crew-size-prediction-model-
c04017c7b6fa.
Why is layer stacking so effective? Applying multiple linear functions in series, re-
sults in a linear function. Therefore, one might think that there is no added benefit to stack-
ing layers. However, the power of deep ANNs comes not from their ability to learn linear
functions, but from their ability to learn non-linear functions. To achieve non-linearity,
non-linear activation functions are applied to the output of each neuron. The sigmoid
(logistic) function
1
σ(x) =
1 + e−x
has long been the workhorse of ANNs. Sigmoid neurons, however, suffer from the van-
ishing gradient problem, which is caused by the difficulty of calculating gradients of large
1.4. Machine learning 55
positive, or negative, values. More recently, the rectified linear unit (ReLU)
ReLU(x) = max(0, x)
and its variants mitigate this problem and perform better than sigmoid activations in
deeper networks. ReLU also has the benefit of being cheap to compute.
1.4.1.2 Learning
In a feedforward ANN, information flows forward through the network, through the hid-
den layers, to produce the output (Fig. 1.7). ANNs frame the learning process as an opti-
misation problem, where an objective function is optimised. Typically the objective is to
minimise the loss between the output and the ground truth. For example, in supervised
machine learning, the difference between the predicted class probabilities and the ground
truth class assignments is minimised.
Objective functions are optimised using gradient descent through a loss landscape.
The topology of this landscape is dependent on the model’s parameters. During the opti-
misation process, the model descends through the landscape until it reaches a minimum
loss. To do so, the model needs to know the direction (gradient) to travel in to reduce
its loss. The gradient of the objective function is computed by back-propagating the loss,
backwards through the network [89]. An optimiser algorithm uses the gradient to update
the model’s parameters so that the loss of the model is reduced.
So far, we have only dealt with ideal cases where the model’s loss always decreases.
Learning is made tricky by the non-convex nature of many loss landscapes. In these, local
minima, local maxima and saddle points are present where the gradient is 0. Therefore,
these points provide no information about which direction to travel to reach the global
minimum.
∂
The partial derivative ∂xi f (x) of a function f (x) gives the direction of the function
w.r.t. xi . By extension, the gradient of a vector ∇x f (x) is a vector containing containing
all of the partial derivatives xi . ANNs use back-propagation to calculate gradients using
the chain rule of differentiation.
Classically, ANNs have used the stochastic gradient descent (SGD) optimiser. Calcu-
lating the gradient using all training data is expensive, so SGD gets round this by estimat-
ing the gradient using a small sample of training examples. SGD introduces the concept
of the learning rate that is multiplied with the estimate of the gradient ĝ to update the
56 Chapter 1. Introduction
parameters θ = θ − ĝ. The learning rate can be used to limit large updates to θ, which
could occur for certain samples of training examples.
Optimisation of the objective function can become stuck at local minima and saddle
points with SGD, which produces inferior models and increases training time. Contem-
porary optimisers use adaptive learning rates and momentum to train models rapidly by
finding the correct direction to move in, whilst not getting stuck in local minima. Mo-
mentum encourages models to continue moving in the direction of past gradients, with
an exponentially decaying impact. Adaptive learning rates use separate learning rates for
parameters that are scaled according to their previous values. One such optimiser that
combines both concepts, Adam [90], is particularly popular in the community.
1.4.1.3 Regularisation
Whilst it is easy to train a neural network to perform well on the training data, it can
be hard to get good performance on unseen testing data. Regularisation can be applied
to encourage models to learn general patterns and to not overfit on the training data.
Overfitting is typically monitored on a validation set that is disjoint from the training
and testing sets. A simple way to prevent overfitting is to monitor the objective function
applied to the validation set. Training is stopped once the loss begins to increase on the
validation set, due to overfitting on the training set.
Classical regularisation penalties, such as L1 and L2 loss can be added to loss func-
tions. However, a popular, simple and cheap method of regularisation is dropout [91].
During training, neurons are dropped out at random with probability 1 − ρ. Dropout is
not applied at test time, so to ensure inputs are similar between training and testing, during
training the activations of retained neurons are scaled by ρ1 .
Now we know how models can be trained, we next focus on different types of ANN
architectures.
1.4.2 Encoder-decoders
ENC and DEC are optimised such that ENC(X) = h and DEC(h) = X0 . Transitively, if
the original data can be reconstructed by the encoder-decoder model, then the embeddings
must contain all salient information in the original data. As such, the embeddings can be
used as learnt features to train machine learning models.
./figures/Chapter_introduction/ae.png
The encoder and decoder functions are learnt in an unsupervised way from the data
using an optimisation process. In order to do this, a loss function must be defined to mea-
sure the difference between the original data and its reconstruction from the embeddings.
The encoder and decoder functions and then optimised to minimise the reconstruction
loss, and, concomitantly, the embeddings are improved.
Encoder-decoder models can be divided into direct-encoding and generalised meth-
ods [92]. Direct-encoding methods learn an embedding matrix Z, where embeddings are
simply
ENC(vi ) = Zvi
1.4.3 Autoencoders
Autoencoders are unsupervised neural network encoder-decoder models that attempt to
reconstruct their input X via a hidden representation h (Fig. 1.8). Typically, autoencoders
are used for unsupervised feature learning, where h are a small number of informative
features, where |h| << |X|, that are automatically extracted from X. This process is also
referred to as learning an embedding of X. The features learnt in h can be used for su-
pervised machine learning, distance estimation, dimensionality reduction, denoising data
and modelling latent variables. We focus on applications of autoencoders in Section 1.4.2.
A typical autoencoder architecture might look like
X → ENC → h → DEC → X0 .
Here, an encoder function generates the hidden representation ENC(X) = h and a decoder
function decodes h to reconstruct X as closely as possible DEC(h) = X0 .
Autoencoders are trained to minimise the loss between X and X0 , L(X0 , X). The
simplest way for the autoencoder to have a small loss is to learn the identity function
X = ENC(X) = h = DEC(h) = X0 . However, as we shall see, this defeats the purpose
of why one would want to use an autoencoder in the first place.
Restrictions are placed on the autoencoder so that the encoder function must actu-
ally learn from X. The autoencoder is prevented from being able to reconstruct X per-
fectly and is forced to prioritise learning useful features in h. As such, overfitting is not a
problem when training autoencoders. Undercomplete autoencoders employ the simplest
restriction, where h is made to have fewer dimensions than X. Interestingly, when the en-
coder and decoder functions are linear and L is the mean squared error, an undercomplete
autoencoder recapitulates PCA. With nonlinear encoder and decoder functions, autoen-
coders are able to learn nonlinear decompositions of X that are more powerful than PCA.
In addition to restricting the number of neurons in h, other restrictions can be applied,
such as sparsity or noise. Sparse autoencoders apply an L1 sparsity penalty Ω(h) to the
hidden layer, which encourages only important neurons in the hidden layer to fire. In
denoising autoencoders, random noise is applied to the input X̃ and the autoencoder is
trained to minimise L(X̃0 , X). The autoencoder must learn to undo the noise and thus
learn the true structure of X.
Like many machine learning methods, autoencoders learn to map inputs to a low-
1.4. Machine learning 59
dimensional manifold. Variational autoencoders train latent variables to learn the struc-
ture of this manifold. These latent variables can be used to generate synthetic examples
at some position on the manifold, and smoothly interpolate synthetic examples across the
surface of the manifold.
Pooling is applied to the output of a convolutional layer and calculates a statistic that
summarises the outputs in a window. One type, max pooling, calculates the maximum
h i
value within a window of n inputs. For example, 4-max pooling of 0.4, 0.7, 0.3, 0.5
produces 0.7. Pooling helps to make features invariant to translation, that is small changes
in the input do not change the output. This property is desirable if the task is to recognise
features in the input, regardless of their position. To improve a model’s computational
efficiency by reducing the number of parameters, pooling is often used to downsample an
output by using fewer pooling units than output units.
Recurrent neural networks (RNNs) are a class of neural network that can be used to process
sequences through repeated application of the network to each time step in the sequence.
Like CNNs, RNNs employ parameter sharing, whereby the same parameters are used to
process each time step. RNNs have been successfully applied to text, audio and sensor
data. Whilst an in-depth introduction to RNNs is beyond the scope of this chapter, we
introduce their salient details and limitations here.
Whilst RNNs are synonymous with sequence modelling [95], they do posses some sig-
nificant drawbacks, namely sequential dependencies, vanishing/exploding gradients and
poor memory, that we outline below. For these reasons, RNNs have fallen out of favour
at the big tech companies, in favour of convolutional architectures, such as attention-
based networks [96, 97]. A very recent class of CNN, temporal CNNs [95], are optimised
for modelling sequences and outperform RNNs on a wide range of sequence modelling
benchmarks.
• Sequential dependencies
RNNs are applied sequentially to input sequences x of length τ from time point 1 to
τ . The output of an RNN at time t is a function of x(t) and all previous time steps
−1 )
x(1) , ..., x(t . Note that time steps need not actually correspond to time, but can
instead correspond to position in the sequence. RNNs can even be used to predict
future states of a sequence at times t > τ , as is used in next word text prediction.
1.4. Machine learning 61
Despite the impressive results that have been obtained using RNNs, they can be
challenging to train, due to the dependency of processing all time steps 1, ..., t−1
before processing t.
• Vanishing/exploding gradients
To process a sequence, RNNs build a very deep computational graph, through which
it is often difficult to calculate the gradient. This can lead to the problem of vanish-
ing, or exploding, gradients during training, which make it difficult to know which
direction to update a model’s parameters—and in the case of exploding gradients
can make parameters fluctuate wildly during training.
• Poor memory
RNNs can learn dependencies between positions in sequences. However, learning
long-term dependencies is challenging, due to the weights of dependencies decreas-
ing in size exponentially with their distance. This means that short-term depen-
dencies with larger weights will tend to dominate the learning process and that it
will take a very long time to learn long-term dependencies. A vanilla RNN has been
shown to struggle to learn dependencies in sequences of length 20 [98]. Variants
of the RNN, such as the long short-term memory (LSTM) network [99] claim to in-
crease this limit to 1,000 time steps, however, in practice LSTMs are only able to
handle sequences up to 250 in length.
Decision trees are binary trees that segregate data into groups. Items begin at the root
node, and traverse the tree towards the terminal nodes, with each internal decision node’s
predicate determining the which node the item will visit next. Because the global opti-
mum partitioning of the data is NP-complete [100], the classification and regression trees
(CART) algorithm uses a recursive greedy procedure to grow decision trees.
In the training phase, trees are grown by splitting the data according to the predicate
that minimises the error at each internal node. A cost function is used to decide if a node
is worth splitting, otherwise it will become a terminal node. Cost functions typically take
into account the following criteria. If the node is split:
Once a tree has been grown to its full length, it can pruned to prevent overfitting by reduc-
ing the number of terminal nodes without increasing the prediction error substantially.
In the prediction phase, each example begins at the root node and follows a path along
the tree’s branches—according to its feature values—until it reaches a terminal node. The
label associated with the leaf node is assigned.
Whilst the CART algorithm is useful, it does have several problems. Firstly, the data
may not be partitioned optimally, so the global minimum error will not be achieved. Sec-
ondly, CARTs are sensitive to the training data, and small changes to the training data can
result in the growth of very different trees. Random forests (RFs) attempt to mitigate these
two problems and help to reduce the variance of the model.
RFs are ensembles of CARTs. RFs use a technique called bootstrap aggregating, or
‘bagging’, to reduce the variance of the model. Each decision tree in an RF is trained on
random subsets of the features and examples, from which bootstrap samples are drawn.
Bootstraping is a resampling method that can be used to estimate a sampling distribution.
Let X be some sample data drawn from an unknown distribution S. S can be estimated by
n bootstrap samples, b. bi is generated by sampling |X| items from X with replacement.
To increase the differences between each tree in an RF, a random subset of features can
also be selected to be included in the training data for each tree. The proportion of samples
and features to be included in each subset are hyperparameters, optimised during training.
The number of trees in the forest is not a hyperparameter that should be optimised
[101]. Generally, the largest number of trees that can be trained appropriately on the
available resources should be selected. Training and prediction of RFs is embarrassingly
parallel because each tree is independent. However, for prediction, the added overhead
associated with splitting the task across multiple processors does not typically outweigh
applying each tree in serial on a single processor.
In the prediction phase, each tree in the RF has an equal vote. In the classification
paradigm, RFs use the majority vote to predict the class of an example. In RF regression,
the probability that an example is from some class is the proportion of trees that predicted
that class
Support vector machines (SVMs) are supervised machine learning models that classify data
into one of two classes [102]. During training, the SVM learns a hyperplane in the training
1.5. Modern applications of machine learning to protein data 63
data space that maximises the separation between the data points from the two classes. In
the ideal case, data points from each class will be well separated, with a maximum margin
between the hyperplane and the nearest data points from each class. The data points that
lie on the margin on either side of the hyperplane are known as the support vectors. We
discuss applications of SVMs in Sections 1.5.1.2 and 1.5.1.5.
SVMs are inherently linear classifiers, however SVMs are able to perform more com-
plex, non-linear classification by use of the kernel trick. Instead of performing classifica-
tion in the original data space, SVMs first map the data to a high-dimensional space, in
which the two classes are more likely to be separable by a hyperplane. To do this, a ker-
nel function is applied to the data, which calculates dot products between data points in a
high-dimensional space, without actually mapping the data points to the high-dimensional
space. Implicitly, kernel functions allow SVMs to produce a curved decision plane in the
original feature space, which is actually a straight hyperplane in the kernel space. Ker-
nel functions should be selected according to the particular task at hand, but one popular
and flexible kernel, that produces curved decision boundaries, is the radial basis function
kernel.
Real-world data is rarely ideal, so SVMs must make certain trade-offs when optimis-
ing the separating hyperplane between how important misclassifications and true classi-
fications are. To do this, SVMs have a hyperparameter, the soft-margin penalty C, that
controls the penalty applied to errors. Lower values of C produce boundaries with larger
margins, which increase the risk of misclassifications but reduce the chance of overfitting.
On the other hand, larger values of C produce decision boundaries with a small margin be-
tween the two classes, at the risk of overfitting on the training data. The hyperparameter
should be optimised during model training, in addition to other hyperparameters that are
specific to particular kernel functions. For example, the radial basis function kernel has the
γ hyperparameter that controls the influence of the data points on the hyperplane, with
high values increasing the locality of the influence and low values extending the range of
influence.
tively, that the following reviews are best [77–84]. The ensuing explosion has resulted in
ANNs being used so widely in bioinformatics that it would be beyond the scope of any sin-
gle piece of work to cover all topics. Instead, we focus here on the more relevant advances
and applications to the task of protein function prediction and the related task of protein
family prediction. Specifically, we focus on methods that allow machine learning to be
applied directly to network and sequence data, without the need for feature engineering.
These methods feel almost like magic and would have been inconceivable just a decade
ago. Astute readers may notice some commonalities between the names of methods that
we introduce here. . .
The timing of the advances in biological data generation and ANNs have coincided
at an ideal time. High-throughput methods now generate unprecedented volumes of ex-
perimental data, that classical bioinformatics techniques struggle to process. ANNs are
powerful statistical models that require large amounts of data and computational power
to train. ANNs have already been proven to be a valuable tool in the bioinformatics tool-
box. We conclude that we have set out along a path in which ANNs will permanently
transform bioinformatics.
1.5.1.2 deepNF
More formally, deepNF takes a multigraph G with n nodes and learns an embedding
that represents the context of each node in this graph. Given k different types of edges in
66 Chapter 1. Introduction
G, an autoencoder is used to learn an encoder function that maps the adjacency matrix
of G to a low l-dimensional embedding space that represents the context of each node
across the k edge types. The encoder function is unary, embeddings are generated for
each node. An alternative view of deepNF is that k input networks, each with n nodes, are
used as input to a multimodal deep autoencoder (Fig. 1.9). Overall, the autoencoder maps
an [n × n × k] multigraph adjacency matrix to an [n × l] embedding matrix, where l << n
for any reasonably sized network.
./figures/Chapter_network_fusion/deepnf_overview.png
Random walks with restart are applied to the adjacency matrices to calculate a node
transition probability matrix. Local and medium-range topologies of the graphs are ex-
plored by the random walks. This, in turn, reduces the sparsity of the original adjacency
matrix. The pointwise mutual information is then calculated that two nodes occur on the
same random walk.
SVMs were trained to predict protein functions using the one-vs-rest multiclass strat-
egy. Radial-basis function kernels were precalculated and cached to speed up training time.
Terms were split into three levels according to how many proteins each term is annotated
to. As one might expect, more common functions that are annotated to 101 − 300 proteins
were predicted better than rarer functions annotated to 11 − 30 proteins.
deepNF, Deep Neural Graph Representations and Structural Deep Network Embed-
dings all suffer a number of limitations. Firstly, all require that the input dimension to the
autoencoder is |V |. In the case of deepNF, the input is even larger, at k|V | for k edge types.
1.5. Modern applications of machine learning to protein data 67
Therefore, these methods cannot be applied to large networks with > 105 − 106 nodes,
depending on memory resources. Secondly, models are fixed to the number of nodes in
V at training time. However, new embeddings can be generated if the original adjacency
matrix is rewired, for example under different developmental stages, cellular stresses or
other changes in proteome regulation.
We use deepNF extensively in Chapter 4. deepNF took much inspiration from another
graph embedding method, Mashup [65], which we introduce in the next section.
1.5.1.3 Mashup
Mashup first calculates a node transition probability matrix using random walks with
restart. Low-dimensional embeddings are then calculated by applying a novel dimension-
ality reduction method. Networks are high-dimensional, incomplete and noisy. We wish
to transform the original matrix into a low-dimensional matrix that explains the variance
of the original matrix, similar to PCA. Mashup achieves such a dimensionality reduction
by framing the process as an optimisation problem. Each node i is represented using two
vectors: xi for features of the node and wi that captures the context of the node in the
topology of the network. (If xi and wj are close in direction and have large inner product,
then node j will be visited often on random walks beginning at node i.) If these vectors
x and w do indeed capture topological features of the network, then they can be used to
identify similar nodes in the network. Mashup optimises the x and w vectors by minimis-
ing the Kullback–Leibler divergence between the observed transition probabilities s and
the predicted probabilities ŝ. This optimisation procedure can be extended to multiple net-
works, allowing node contexts across heterogenous types of edges to be integrated. The
trick is that each of the k networks has its own wik context vector, but node vectors xi are
shared across all networks. The objective function is jointly optimised across all networks.
In so doing, latent features of nodes are learnt in an unsupervised way and are captured
in the node vectors x, which can be used to train machine learning methods.
cases, Mashup was benchmarked against a state-of-the-art method and achieved a higher
performance.
• Predict protein function, benchmarked against GeneMANIA [108].
• Reconstruct Gene Ontology, benchmarked against NeXO [109].
• Predict genetic interactions, benchmarked against Ontotype [110].
• Predict drug efficacy, benchmarked against a synthetic lethality predictor for cancer
drugs [111].
to have some function [117]. Under the guilt by association framework, the remaining
proteins in the kernel can be ranked by their similarity to the seed set. Highly-ranked
proteins are likely to have the function, and vice versa for low-ranked proteins.
Secondly, kernels can be used for data fusion because a combination of kernels is
also a kernel [102]. As such, heterogenous information across multiple networks can be
fused by combining their graph kernels. This approach was taken by Hériché et al. to
fuse a range of human gene and protein association networks, in order to predict novel
genes involved in chromatin condensation [116]. Nine genes known to be involved in this
process were used as seeds to rank the remainder of the genome. An RNAi screen of the
100 best-ranked genes identified 32 that caused defective chromatin condensation when
knocked down. Hit rates in RNAi screens are notoriously low, so these results correspond
to an order of magnitude improvement on the median hit rate in mammalian cells [118].
Third and finally, kernels can be used directly in kernel-based machine learning mod-
els (Section 1.4.7). SVMs are an obvious type of kernel-based model, but many other types
of models exist [102], including kernel partial least squares regression, kernel principal
component analysis and kernel Fisher linear discriminant analysis. Often the radial basis
function kernel is used in SVMs, but, if desired, tailored kernel functions can be used and
the resulting kernel used directly in the SVM. For example, Lehtinen et al. [117] used a
commute time kernel of the STRING [119] network to predict protein function using ker-
nel partial least squares regression. This method outperformed GeneMANIA [120], the
best performing method at the time.
One of the most powerful features of ANNs in computational biology is the ability for
models to be applied directly to DNA, RNA and protein sequences [80]. Classical ma-
chine learning methods—such as decision trees and support vector machines—require
manual feature engineering to extract information from sequences manually. Examples
70 Chapter 1. Introduction
Studies using genomic sequences tend to divide sequences into shorter, more man-
ageable 600 bp [127] or 1,000 bp sequences [128–130] centred on a region of interest.
Protein sequences are much shorter than genomic sequences, so entire protein sequences
tend to be used.
The field of natural language processing has developed an array of machine learning meth-
ods to learn from text. One nascent approach is that of text embedding. These methods
embed variable-length sequences into fixed-length, low-dimensional vectors. The first
method to implement this approach was word2vec [131], which embedded words and
phrases. word2vec inspired many other approaches, including doc2vec [132], for longer
sequences of sentences, paragraphs and documents.
Recent natural language processing methods are able to capture information about
word order and semantics. Examples include the continuous bag of words model, where
a set of context words are used to predict a target word, and the skip-gram model, where
a target word is used to predict the context word. Using these methods in embedding
methods allows the embedding space to capture information about word order and se-
mantics in the original text space. Thus, ‘dog’ and ‘cat’ will be closer in the embedding
space than ‘dog’ and ‘car’, despite differing by only one letter. Word embeddings can be
used to answer algebraic questions [132], such as
Embeddings allow off the shelf machine learning models, that require a fixed number of
features, to be trained on any set of sequences. Alternatively, vectors can be directly com-
pared using distance metrics in vector space, such as the cosine distance.
Sequence embeddings can also be used for alignment-free sequence comparison [133].
Sequence alignment is expensive, particularly when performing pairwise alignments of
large sets of sequences. Other alignment-free sequence comparison methods, such as
those based on hashing, provide an approximate measure of the similarity between two se-
quences, without direct comparison of the sequences. These methods represent sequences
as a low-dimensional vector by selecting m unique k-mers from each sequence. Each k-
mer is hashed to a number (hash value), which is used to determine whether to select
this k-mer to represent the sequence. MinHash is one hashing method that represents
sequences using the m numerically smallest unique hash values obtained by hashing all
k-mers in a sequence [134–137]. Sequences can be compared rapidly by calculating an
72 Chapter 1. Introduction
|X ∩ Y |
J(X, Y ) = ,
|X ∪ Y |
between sets of hash values X and Y . Although hashing techniques are incredibly effi-
cient and powerful, sets of k-mers do not retain semantic information about the context
of k-mers in the sequence. The sequence embedding methods that we introduce below
are able to represent sequences using a low-dimensional vector that does retain semantic
information.
Convolutional architectures have permitted networks to be trained on raw data, such
as DNA and protein sequences [80]. Hand-engineered features no longer need to be cal-
culated from sequences; instead, the CNN learns to extract (non-linear) features from se-
quences automatically. Not only does this save time, but the extracted features are high-
quality and lead to greater performance [80].
ANN text embedding methods are based on the encoder-decoder model that was in-
troduced in Section 1.4.2. However, whereas network embedding methods use autoen-
coders to learn embeddings in an unsupervised manner, in sequence embedding, natu-
ral language processing methods are used to generate embeddings from unlabelled data.
Whilst sequence embedding methods are, overall, unsupervised, the learning objective is
posed as a supervised classification problem. One interesting early application of this ap-
proach was in Semantic Hashing [138] for document classification, however, here we focus
on more recent methods.
1.5.2.2 word2vec
word2vec [131] takes a corpus of text, composed of words from word set V, and generates
n-dimensional embeddings for each word. A neural network is constructed, consisting of
a V -dimensional input layer, where V = |V|, an N -dimensional hidden layer and a V -
dimensional output layer (Fig. 1.10; we use the notation from this figure in the remainder of
this section). The architecture is intentionally shallow to allow it to be trained efficiently
on large corpuses, where V > 109 . This architecture is reminiscent of an autoencoder
with a single hidden layer, but the training objective is different.
word2vec can be run in two modes: continuous bag of words or skip-gram. We use
the continuous bag of words mode as an example. Here, C context words xk are fed to
the network as one-hot V -dimensional vectors and are processed by a word matrix W.
1.5. Modern applications of machine learning to protein data 73
Neurons in the hidden layer h are trained, using the average of the C context word outputs,
to predict the target word yj . Embeddings are processed by the W0 matrix to generate
outputs. Outputs are converted to probabilities using the softmax function. The network
is optimised by reducing the loss between the target word yj and the predicted words y.
./figures/./Chapter_introduction/word2vec.png
Figure 1.10: word2vec neural network architecture. The continuous bag of words mode is
shown. Figure taken from http://www.stokastik.in/understanding-word-vectors-and-
word2vec/.
1.5.2.3 doc2vec
doc2vec [132] is based on word2vec and generates n-dimensional embeddings for each
paragraph P . One-hot V -dimensional vectors of words are shared across all paragraphs
and are processed by the word weight matrix W (Fig. 1.11). In addition to being trained on
context words, models are simultaneously trained on paragraph IDs. The paragraph ID is
shared across all contexts of a paragraph and is processed by the paragraph weight matrix
D. The paragraph ID can be thought of as an additional word that links the context words
to particular paragraphs. Therefore, the model is encouraged to learn the semantics of
individual words and how they shape the meaning of the paragraphs they occur in. Each
paragraph is mapped to an N -dimensional vector and each word is mapped to an M -
dimensional vector. These vectors are concatenated and used to predict the target word
in the same way as word2vec. At each step of the training process, C context words are
74 Chapter 1. Introduction
randomly sampled from P and are used to predict the target word.
./figures/./Chapter_introduction/doc2vec.png
Figure 1.11: doc2vec neural network architecture. The continuous bag of words mode is shown.
Figure taken from [132].
word2vec and doc2vec have been applied in many different ways to biological se-
quences. Below, we focus on applications to protein sequences. These methods have also
been applied to nucleic acid sequences [139, 140].
using a multiclass prediction strategy, rather than one-vs-rest. Protein family prediction
using ANNs is covered in more detail in Section 1.5.3.
1.5.2.5 seq2vec
seq2vec [143] is a sequence embedding method for biological sequences, based on doc2vec.
seq2vec improves upon BioVec because the overall context of k-mers (words) in sequences
(documents) are learnt, as well as the semantics of individual k-mers. Sequences were
embedded using 3-mers into a 250-dimensional space.
One-vs-rest SVMs were trained on embeddings to predict Pfam families for all of
Swiss-Prot using with 95% accuracy. Multiclass prediction was also tested for seq2vec us-
ing a multiclass SVM trained using the one-vs-one strategy, where n2 models are trained
to predict all pairs of classes. Here, the accuracy drops to 81% for seq2vec and 77% for
BioVec. However, only the largest 25 Pfam families, which is < 1% of all families. To be
useful for protein family classification in the wild, these methods will need to be able to
classify all CATH or Pfam families simultaneously (see Section 1.5.3 for more details on
protein family classification). It would be interesting to see how well an MLP performs
that is trained to predict a larger number of families simultaneously.
Notably, seq2vec and BioVec were benchmarked against BLAST. Whilst seq2vec con-
sistently performed significantly better than BioVec, BLAST outperformed seq2vec. How-
ever, BLAST is one of the most highly engineered pieces of bioinformatics software, so it
is natural to expect it to outperform nascent embedding-based methods. At least for now!
76 Chapter 1. Introduction
1.5.2.6 dom2vec
dom2vec [144] is a sequence embedding method for protein sequences that learns to embed
protein domains. dom2vec is based on word2vec [131] using the continuous bag of words
and skip-gram strategies. Philosophically, BioVec and seq2vec treat amino acids as words
and proteins as sentences, whereas, dom2vec treats domains as words and multi-domain
architectures as sentences. InterPro domain MDAs for all UniProt sequences were used
as input to a word2vec model dom2vec was benchmarked against ProtVec [139], which is
also based on word2vec, and SeqVec, achieving competitive results. Extensive benchmarks
were conducted for predicting EC classes, molecular function GO terms, InterPro domain
hierarchies and SCOPe secondary structure classes.
1.5.2.7 SeqVec
SeqVec [145] is a sequence embedding method that is not based on word2vec. Instead,
SeqVec is based on ELMo [146], an NLP model that predicts the next word in a sequence,
given all previous words in the sequence. ELMo is comprised of a CharCNN, followed by
two bi-directional LSTM RNN layers. The CharCNN learns a context-insensitive vector
representation of each character, in this case each amino acid. The bi-directional LSTM
layers take the CharCNN embedding and introduce contextual information. The forward
and backward passes of the LSTM layers are trained independently to avoid the backward
and foward passes leaking information to each other. SeqVec used a 28 letter alphabet to
represent the standard amino acid code, ambiguous residues and special tokens to mark
the beginning and end of sequences and padding. SeqVec generates a 1,024D embedding
vector from the summation of the 1,024D outputs of the three layers at the C-terminus of
the sequence. SeqVec was trained on 3 × 107 protein sequences from UniRef50, containing
9 × 109 residues. Each residue is a unique sequence context, thus was treated as a separate
token in the NLP sense. SeqVec achieved competitive performance in benchmarking, but
had faster run times than the competing methods.
1.5.2.8 UniRep
UniRep [147] is another LSTM-based sequence embedding method based on an NLP lan-
guage model. UniRep and SeqVec were published two months apart, so neither benchmark
the other. The model is trained on protein sequences to predict the next residue in the se-
quence. Like SeqVec, an amino acid character embedding is used as input to an LSTM.
Each protein sequence is represented using a 1,900D embedding vector. In order to cap-
1.5. Modern applications of machine learning to protein data 77
ture long-range and higher-order dependencies between residues, the embedding was the
average of the LSTM layer’s hidden state for each amino acid in the protein sequence. Con-
versely, SeqVec just use the hidden state of the LSTM at the final residue in the sequence.
Whilst both SeqVec and UniRep are no doubt powerful models, they both took ap-
proximately three weeks to train for one epoch. This is in stark contrast to word2vec,
which is designed for fast run time on large corpuses.
Recently, ANNs have been applied to the problem of protein family classification. Protein
families are defined by clustering protein sequences into sets of homologous sequences
that share sufficiently high sequence identity. Until recently, HMMs have been trained to
model the sequence diversity of protein families. HMMs can be applied to new sequences,
that were not used to train the HMM, to assign them to one or more family, dependent on
obtaining a sufficiently high match score. Sequence-based ANNs, such as CNNs [125, 148–
150], RNNs [151] and natural language processing models [150], are well-suited to protein
family classification because they can be trained directly on protein sequences. Models
are trained to learn a mapping from protein sequence to a vector of protein families P .
ANNs are not only able to learn a good mapping function [125, 148, 149], but sequences
can be classified into protein families much faster than HMMs [125, 149], due to the fast
inference time of ANNs and parallel execution on GPUs. Below, we introduce a selection
of the best methods for protein family classification using ANNs.
1.5.3.1 DeepFam
DeepFam [125] classifies protein sequences into families of proteins. Eight convolutional
filters of lengths {8, 12, 16, 20, 24, 28, 32, 36} were applied to each protein one-hot en-
coded protein sequence. The model is able to handle variable-length sequences by 1-max
pooling the convolutional stage, whereby the maximum activation from each of the eight
filters is taken. This fixed-width vector is used as input to a single hidden layer used for
classification. A variant of one-hot encoding is used that accounts for pairs of amino acids
with similar structure and chemistry. When one-hot encoding protein x to X, if residue
xi is in one of three pairs of residues, 0.5 is entered in the jth position corresponding to
78 Chapter 1. Introduction
1 if xi = jth residue
0.5 if xi = α and jth residue ∈ {D, N}
Xij = or xi = β and jth residue ∈ {E, Q}
or xi = γ and jth residue ∈ {I, L}
0 otherwise.
In the future, it would be useful if model performance was compared between one-hot en-
coded data using an alphabet of the 20 standard amino acids Σ20 versus a reduced alphabet
that merges amino acids with similar properties ΣN <20 . The method was benchmarked
using two protein family databases that were not built using HMMs: COGs and a manually
curated GPCR set.
1.5.3.2 DeepSF
DeepSF [148] predicts the structural fold of a protein using its sequence as input. The
input data for a protein sequence of length L consists of the sequence one-hot encoded
(20D), a position-specific scoring matrix generated by PSI-BLAST (20D), predicted sec-
ondary structure (α-helical, β-sheet or coil; 3D) and predicted solvent accessibility (ex-
posed or buried; 2D) to yield an [L × 45] matrix. DeepSF begins with a very deep CNN
with 10 convolutional layers each with 10 filters 10 × (L × 2). The model is able to handle
variable-width inputs by applying k-max pooling to the output of the 10th convolutional
layer. The 30 largest activations from each of the 20 L × 1 feature maps are taken and flat-
tened into a 600 neuron dense layer. An MLP maps the activations from the convolutional
layers to 1,195 protein folds. To train on variable-length sequences, proteins were binned
into mini-batches that contain proteins within a range of lengths and proteins within each
mini-batch were zero padded to the same length.
1.5.3.3 ProtCNN
ProtCNN [149] used a CNN to classify protein sequences into protein families by taking
the majority vote from ProtENN, an ensemble of 13 ProtCNN models. ProtCNN alone had
higher error rates than BLASTp- or HMM-based classifications, but these classical methods
were consistently beaten by ProtENN.
1,100D embeddings were calculated for representative sequences from protein fami-
1.5. Modern applications of machine learning to protein data 79
lies. Nearest neighbour classification was used to assign held out sequences to families by
calculating their embedding vector, followed by calculating cosine distances to all protein
families and finding the most similar family. ProtCNN was also able to embed sequences
from completely held out families into a similar region of embedding space, with small
cosine distances. Consequently, this demonstrates that the CNN was able to learn general
features of protein sequences, rather than merely memorising the training data.
Similar to DeepSequence [152] (Section 1.5.4.3), ProtCNN was able to learn a substi-
tution matrix that is very similar to the BLOSUM62 matrix, but using the cosine similarity
between 5D vectors centred on the residue of interest.
Until recently, this was simply impossible. Extensive amounts of features engineering
were required to extract low-dimensional, fixed-width representations of the information
within protein sequences. These features were reasonably predictive of protein function,
provided sufficient training examples were available. However, this feature engineering is
tedious. Ideally, we want these features to be extracted from the sequence automatically.
Over the past couple of years, this has been made possible by applying ANN models di-
rectly to protein sequences to predict protein function. Many of these applications have
already been covered in Section 1.5.2.1. Here, we introduce methods that are not based
on encoder-decoder embedding methods. First, we would like to highlight two studies
without introducing them in detail:
1.5.4.1 DeepGO
DeepGO [158] predicts GO term annotations for proteins using a model that is aware of
the graphical DAG structure of the GO. The model takes protein sequence as input and
maps overlapping 3-mer sequences to indexes in a vocabulary of all 203 possible k-mers
of length 3. Sequences were fixed at 1,002 amino acids in length, corresponding to 1,000
3-mers. Longer sequences were ignored and shorter sequences were padded with zeros.
Sequences are then embedded in a 128D space, where each 3-mer index is represented by
a vector of 128 numbers. This is an interesting approach, but one wonders why a sequence
embedding method like seq2vec was not employed. Network data was also included in the
model by generating 256D embeddings of nodes within a cross-species knowledge-graph
[159].
term with its own small neural network. Each term consists of a single fully-connected
layer with sigmoid activations. Top-level terms in the GO DAG take as input the concate-
nated sequence and network information. Terms that have child terms in the ontology
feed the output of their fully-connected layer to the fully-connected layers of any child
terms. The output of the term layers are fed to an output vector that predicts GO terms in
a way that is aware of the correlations and dependencies between terms in the ontology.
However, DeepGO was unable to overcome the classical problems with GO term pre-
diction. Firstly, the problem is high-dimensional in the number of GO terms that exist.
Secondly, GO terms that are deep in the ontology, and describe specific functions, are an-
notated to only a few proteins. So, instead of predicting all GO terms, a subset of terms that
are annotated to many proteins were selected. Terms were selected if they are annotated
to 250 proteins for biological process and 50 proteins for molecular function and cellular
component. This resulted in 932 terms for biological process, 589 for molecular function
and 436 for cellular component.
DeepGOPlus [160] is the prototypical ANN model for protein function prediction.
The model is simple and intuitive, taking protein sequences as input and predicting GO
terms as output. DeepGO was modified in three ways to create DeepGOPlus. Firstly, the
3-mer embedding stage is replaced by a one-hot encoding of the sequence, thus remov-
ing 128 × 8000 parameters from the model and reducing the chance of overfitting. Fur-
thermore, this architecture allowed DeepGOPlus to be applied to any length sequences.
Secondly, the CNN unit was converted to a deep CNN unit, consisting of stacked convolu-
tional layers. Thirdly, network information was not used because network information is
unavailable for most known proteins. Finally, GO terms were predicted using a flat fully-
connected layer, rather than the hierarchical set of layers used in DeepGO. DeepGOPlus
would have come in first or second place for the three GO ontologies in CAFA 3 [74].
1.5.4.2 ProLanGO
Neural machine translation models, such as Google Translate, allow text to be translated
between arbitrary pairs of languages using ANNs. ProLanGO [161] is a neural machine
translation model for GO term prediction. In this model, protein sequences and GO term
annotations are treated as languages and a mapping function is learnt by an ANN to trans-
late between the semantics of particular protein sequences and their equivalent GO term
semantics. The model is an RNN, composed of LSTM units.
The protein sequence language was constructed of words that are all k-mers of length
3, 4 or 5 that occur in UniProt > 1000 times. The GO term language is constructed by
assigning each GO term to a unique four letter code word in base 26. The four letter
code is the index of the term from a depth-first search of the GO DAG. For example, there
are 28,768 terms in the biological process ontology, so the root node of the ontology,
GO:0008150, is the 28768th term to be visited in the depth-first search, which corresponds
to BQKZ in four letter code.
1.5.4.3 DeepSequence
Second, in Chapter 3, we predict novel plastic hydrolase enzymes in a large data set
of 1.1 billion protein sequences from metagenomes using the CATH database. We mapped
a naturally-evolved plastic hydrolase from Ideonella sakaiensis to the alpha/beta hydrolase
CATH superfamily and FunFams. By scanning the metagenomic proteins against HMMs
of these families, we identified 500,000 putative sequences that may be able to hydrolyse
plastics, which we analysed further using associated metadata. Motivated by the size of
the metagenomic protein data set, we developed FRAN, a divide-and-conquer algorithm
that is able to generate FunFams on arbitrarily large sequence data sets.
Third, in Chapter 4, we perform feature learning from protein networks using a neural
network that generates embeddings of proteins, according to their context across multi-
ple networks. Using these embeddings, we trained supervised machine learning models
to predict protein function in budding and fission yeast. We show that, of the 3 × 104
dimensions in the yeast STRING networks, just 256 dimensions (< 1%) are sufficient to
adequately represent each protein. We also found that a vector of protein functions can be
predicted using structured learning with the same performance as predicting each function
using a separate classifier and the one-vs-rest strategy.
FunFams. We evaluated the performance of models trained on these data modes separately,
and in combination, finding that the best performing model used a combination of network
and evolutionary data. Finally, we entered the predictions from this model into the fourth
CAFA protein function prediction competition.
Chapter 2
2.1 Introduction
This chapter is a modified version of the paper: Scholes, H.M. Dynamic changes in
the brain protein interaction network correlates with progression of Aβ42 pathology in
Drosophila. Scientific Reports (2020) [163]. Adam Cryar, Fiona Kerr, David Sutherland, Lee
Gethings, Johannes Vissers, Jonathan Lees, Christine Orengo, Linda Partridge and Kon-
stantinos Thalassinos contributed to the research of the original publication. All final
language is my own.
The proteomics data set was collected predominantly by Adam Cryar. This project
was a collaboration between the Orengo and Thalassinos groups, which consited of me
analysing the proteomics data using sophisticated bioinformatics techniques.
beta (Aβ) [166], whereas neurofibrillary tangles are intraneuronal aggregates of hyper-
phosphorylated tau [167, 168]. In addition to these hallmarks, the AD brain experiences
many other changes, including metabolic and oxidative dysregulation [169, 170], DNA
damage [171], cell cycle re-entry [172], axon loss [173] and, eventually, neuronal death
[170, 174].
Despite a substantial research effort, no cure for AD has been found. Effective treat-
ments are desperately needed to cope with the projected increase in the number of new
cases as a result of longer life expectancy and an ageing population. Sporadic onset is
the most common form of AD (SAD), for which age is the major risk factor. Familial AD
(FAD)—a less common (< 1%), but more aggressive, form of the disease—has an early on-
set of pathology before the age of 65 [175]. FAD is caused by fully penetrant mutations in
the Aβ precursor protein (APP) and two subunits—presenilin 1 and presenilin 2—of the γ-
secretase complex that processes APP in the amyloidogenic pathway to produce Aβ. APP
is a 770 amino acid integral membrane protein that is involved in a wide range of devel-
opmental processes in neurons—–functioning as a cell surface receptor and cell adhesion
molecule [176]. Whilst the exact disease mechanisms of AD are not yet fully understood,
this has provided support for Aβ accumulation as a key player in its cause and progres-
sion [164]. Aβ42—a 42 amino acid variant of the peptide—is neurotoxic [177], necessary
for plaque deposition [178] and sufficient for tangle formation [179].
2.1.1.1 Aβ formation
Aβ is formed by cleavage of APP, a 695 to 770 amino acid transmembrane protein ex-
pressed in many tissues [180]. In addition to its role in AD, APP performs many cellu-
lar functions, notably being a cell surface receptor protein for stimulating intracellular
Ser/Thr kinases [181] and as a transcriptional regulator of miRNAs involved in neuronal
differentiation [182].
Cleavage of APP can occur via two distinct pathways involving the secretase family
of endopeptidases (Fig. 2.1). Most APP is processed in the non-amyloidogenic pathway by
the α-secretases that cleave at a residue located 83 amino acids from the C-terminus, near
the extracellular membrane surface [183]. Cleavage yields two fragments: a C-terminal
fragment CTF83 that remains in the membrane and an sAPPα extracellular protein. For-
mation of the Aβ peptide is inherently prevented due to the position of the cleavage site
in APP. In the complementary amyloidogenic pathway that produces Aβ, APP is cleaved
2.1. Introduction 87
tau is a neuronal protein that interacts with tubulin to promote and maintain microtubule
assembly [197]. Intraneuronal aggregates of tau form NFTs in the second AD lesion of AD.
tau is also implicated in a number of other neurological conditions, known as tauopathies
[198]. A link between Aβ and tau has been established, although there is debate as to
the importance of these agents in AD progression. Mechanistically, Aβ42 induces the
abnormal hyperphosphorylation of tau, which subsequently form paired helical fragment
building blocks of NFTs [198]. This was confirmed by injecting Aβ42 into the brains of
mice, causing NFTs to form locally [179]. In a similar vain to plaques, whilst NFTs are
a hallmark of AD, they appear to be inert [199]. But 40% of hyperphosphorylated tau is
monomeric in AD [198] and it is this form that is neurotoxic, sequesters normal tau and
promotes the disassembly of microtubules [200]. In summary, soluble Aβ and tau work in
conjunction to convert healthy neurons into a diseased state, occurring independently of
plaques and NFTs.
2.1.1.5 Proteomics
Post-mortem proteomics studies on human brains have been valuable in adapting the amy-
loid cascade hypothesis of AD and helped develop the alternative neuro-inflammation hy-
pothesis of AD. These studies revealed that the brain undergoes oxidative damage as a
response to amyloid accumulation in the end stages of disease. Using model organisms,
such as fruit flies, allows molecular alterations in the brain to be tracked from the onset of
AD, during its progression, until death. Comparison of proteomic analyses of post-mortem
human brains have further revealed an increase in metabolic processes and reduction in
synaptic function in AD [201]. Oxidised proteins also accumulate at early stages in AD
brain, probably as a result of mitochondrial ROS production [202], and redox proteomic
approaches suggest that enzymes involved in glucose metabolism are oxidised in mild
cognitive impairment and AD [203, 204]. Moreover, phospho-proteomic approaches have
revealed alterations in phosphorylation of metabolic enzymes and kinases that regulate
phosphorylation of chaperones such as HSP27 and crystallin alpha B [205]. Of note, how-
ever, there is little proteomic overlap between studies using post-mortem human brain
tissue, which may reflect the low sample numbers available for such studies, differences
in comorbidities between patients and confounding post-mortem procedures [201]. Al-
though valuable, post-mortem studies also reflect the end-stage of disease and, therefore,
90 Chapter 2. Aβ42 pathology in Drosophila brains
lighter ions. Analysers exploit this property to sort ions according to m/z. For example,
time-of-flight analysers accelerate ions through a flight tube and measure the m/z by the
2.1. Introduction 91
time it takes for the ion to reach the detector. Finally, any ions that pass through the
analyser are detected as an electrical current by the negatively charged detector.
One such burgeoning MS field is proteomics—the study of proteomes by MS [209,
210]. Technological developments in MS that are relevant to proteomics—and particularly
the methods used in this study—are introduced below.
2.1.2.2 Ion-mobility
Ion-mobility spectrometry is a method of separating ions in the gas phase [214]. Ions enter
a drift tube that is filled with a buffer gas and are moved through the drift tube with an
electric field. Particles of the buffer gas collide with the ions, which creates a frictional
force that impedes their progress. The time that it takes for a molecule to move through
the drift tube is known as the arrival time. The rate of these collisions depends on the ion’s
collision cross section (CCS), a measure that quantifies the overall shape of a particle. CCS
refers to the ensemble of all possible geometric orientations of a particle, and all possible
interaction types that it may have with other particles in the gas phase, averaged into a
cross sectional area of a circle [215]. Although the CCS of a protein is correlated with
its mass, there is a high variance in CCS for any given mass [215]. This makes intuitive
sense because proteins are not perfect spheres of uniform density—some are elongated and
others have pockets. CCS predicts that spheroidal globular proteins will collide less often
with the buffer gas than proteins with a large surface area-to-volume ratio. Furthermore,
chemically identical protein molecules do not have exactly the same arrival time, but rather
their arrival times are distributed, due to conformational differences. Differential CCS
92 Chapter 2. Aβ42 pathology in Drosophila brains
allows ions to be separated by their arrival time an additional dimension of shape, thus
enabling ion-mobility spectrometers to separate ions according to their arrival time, which
depends on the m/z and shape of the particle. Abstractly, ion-mobility is somewhat similar
to gel filtration chromatography, in that proteins are separated by their shape. Both CCS
and arrival time are not inherent properties of particles because they depend on the buffer
gas used and the temperature [214].
./figures/./Chapter_fly/synapt.jpg
Figure 2.3: Schematic of the Synapt GS-Si mass spectrometer. Figure produced by Waters
Corporation.
Sample preparations for label-free MS is simpler than labeled methods, but label-based
methods remain the most accurate methods for quantification.
2.1.3 Contributions
AD, the most prevalent form of dementia, is a progressive and devastating neurodegener-
ative condition for which there are no effective treatments. Understanding the molecular
pathology of AD during disease progression may identify new ways to reduce neuronal
damage. Here, we present a longitudinal study tracking dynamic proteomic alterations
in the brains of an inducible Drosophila melanogaster model of AD expressing the Arctic
mutant Aβ42 gene. We identified 3093 proteins from flies that were induced to express
Aβ42 and age-matched healthy controls using label-free quantitative ion-mobility data in-
dependent analysis mass spectrometry. Of these, 228 proteins were significantly altered
by Aβ42 accumulation and were enriched for AD-associated processes. Network analy-
94 Chapter 2. Aβ42 pathology in Drosophila brains
ses further revealed that these proteins have distinct hub and bottleneck properties in the
brain protein interaction network, suggesting that several may have significant effects on
brain function. Our unbiased analysis provides useful insights into the key processes gov-
erning the progression of amyloid toxicity and forms a basis for further functional analyses
in model organisms and translation to mammalian systems.
2.2 Methods
2.2.1 Data collection
All MS data were processed in Progenesis QI for proteomics. Data were imported into
Progenesis to generate a 3D representation of the data (m/z, retention time and peak
intensity). Samples were then time aligned with the software allowed to automatically
determine the best reference run from the dataset. Following alignment, peak picking was
performed on MS level data. A peak picking sensitivity of 4 (out of 5) was set. Peptide
features were tentatively aligned with their respective fragment ions based primarily on
the similarity of their chromatographic and mobility profiles. Requirements for features
to be included in post-processing database searching were as follows: 300 counts for low
energy ions, 50 counts for high energy ions and 750 counts for deconvoluted precursor
intensities. Subsequent data were searched against 20,049 sequences from the UniProt
canonical Drosophila database (appended with common contaminants). Trypsin was spec-
ified as the enzyme of choice and a maximum of two missed cleavages were permitted.
Carbamidomethyl (C) was set as a fixed modification whilst oxidation (M) and N-terminal
acetylation were set as variable modifications. Peptide identifications were grouped and
relative quantification was performed using non-conflicting peptides only.
2.2. Methods 97
and performs quasi-likelihood tests. limma fits linear models to the proteins and per-
formed empirical Bayes F-tests. maSigPro fits generalised linear models to the proteins
and performs log-likelihood ratio tests.
Significantly altered proteins were clustered using a Gaussian mixture model. Protein
abundances were log10 -transformed and z scores were calculated. Gaussian mixture mod-
els were implemented for 1 to 228 clusters. The best model was chosen using the Bayesian
information criterion (BIC), which penalises complex models BIC = −2 ln(L) + ln(n)k,
where ln(L) is the log-likelihood of the model, n is the number of significantly altered
proteins and k is the number of clusters. The model with lowest BIC was chosen.
2.2.3.3 Networks
All network analysis was performed using the Drosophila melanogaster Search Tool for the
Retrieval of Interacting Genes/Proteins (STRING) network (version 10) [234]. Low confi-
dence interactions with a ‘combined score’ < 0.5 were removed in all network analyses.
Network properties of the significantly altered proteins were analysed in the brain
protein interaction network. A subgraph of the STRING network was induced on the
3,093 proteins identified by IM-DIA-MS in healthy or Aβ42 flies and the largest connected
component was selected (2,428 nodes and 44,561 edges). The subgraph contained 183 of
the 228 significantly altered proteins. For these proteins, four network properties were
calculated as test statistics: mean node degree; mean unweighted shortest path length
between a node and the remaining 182 nodes; the size of the largest connected component
in the subgraph induced on these nodes; and mean betweenness centrality. Hypothesis
testing was performed using the null hypothesis that there is no difference between the
nodes in the subgraph. Assuming the null hypothesis is true, null distributions of each test
statistic were simulated by randomly sampling 183 nodes from the network 10,000 times.
Using the null distributions, one-sided non-parametric P values were calculated as the
probability of observing a test statistic as extreme as the test statistic for the significantly
altered proteins.
A subgraph of the STRING network was induced on the proteins significantly altered
in AD and their neighbours and the largest connected component was selected (4,842
nodes and 182,474 edges). The subgraph contained 198 of the 228 significantly altered
proteins and was assessed for enrichment of GO terms. Densely connected subgraphs
were identified using MCODE [235]. Modules were selected with an MCODE score >
2.3. Results 99
2.3 Results
2.3.1 Proteome analysis of healthy and Aβ42-expressing fly brains
In this study, we used an inducible transgenic fly line expressing human Arctic mutant
Aβ42 (TgAD) [222, 223] (Fig. 2.4a). Flies were either chronically induced two days after
eclosion (Aβ42 flies) or remained uninduced (healthy flies) and used as a control of normal
ageing.
We first sought to determine how the fly lifespan is affected in TgAD. We confirmed
a previously observed [207] reduction in lifespan following Aβ42 induction prior to pro-
teomic analyses (Fig. 2.4b).
To understand how the brain proteome is affected as Aβ42 toxicity progresses, fly
brains were dissected from healthy and Aβ42 flies at 5, 19, 31 and 46 days, and at 54 and 80
days for healthy controls, then analysed by label-free quantitative IM-DIA-MS (Fig. 2.4c).
1,854 proteins were identified in both healthy and Aβ42 fly brain from a total of 3,093
proteins (Fig. 2.4d), which is typical for recent fly proteomics studies [245, 246].
100 Chapter 2. Aβ42 pathology in Drosophila brains
a b
TgAD fly line Mifepristone inducer
1.0
5 19 31 46 54
0.4 80
c
Healthy Aβ42
0.2
Trypsin digest,
nanoscale UPLC
607 1854 632
Label-free
quantitative Healthy A 42
IM-DIA-MS
e
5 (age in days)
20 5
PC2 (8.6%)
10 19
Data analysis
19
0 31
46
Healthy 46
10
54 A 42 31
80
f 20 0 20 40
Abundance normalised to PC1 (70.6% variance explained)
healthy flies at 5 days
Higher
Lower
5
19
Healthy
31
Age (days)
46
54
80
5
A 42
19
31
Proteins 46
Proteins
Figure 2.4: Proteome analysis of healthy and AD fly brains. (a) Drosophila melanogaster
transgenic model of AD (TgAD) that expresses Arctic mutant Aβ42 in a mifepristone-
inducible GAL4/UAS expression system under the pan-neuronal elav promoter. (b) Sur-
vival curves for healthy and Aβ42 flies. Aβ42 flies were induced to express Aβ42 at 2
days. Markers indicate days that MS samples were collected. (c) Experimental design
of the brain proteome analysis. Aβ42 flies were induced to express Aβ42 at 2 days. For
each of the three biological repeats, 10 healthy and 10 Aβ42 flies were collected at 5,
19, 31 and 46 days, and at 54 and 80 days for healthy controls. Proteins were extracted
from dissected brains and digested with trypsin. The resulting peptides were separated
by nanoscale liquid chromatography and analysed by label-free quantitative IM-DIA-
MS. (d) Proteins identified by IM-DIA-MS. (e) Principal component analysis of the IM-
DIA-MS data. Axes are annotated with the percentage of variance explained by each
principal component. (f) Hierarchical biclustering using relative protein abundances
normalised to their abundance in healthy flies at 5 days.
For the 1,854 proteins identified in both healthy and Aβ42 flies, we assessed the relia-
bility of our data. Proteins were highly correlated between technical and biological repeats
We used principal component analysis of the protein abundances to identify sources of
variance (Fig. 2.4e). Healthy and Aβ42 samples are clearly separated in the first principal
component, probably due to the effects of Aβ42. In the second principal component, sam-
ples are separated by increasing age, due to age-dependent or disease progression changes
in the proteome. These results show that whilst ageing does contribute to changes in the
brain proteome (8.7% of the total variance), much larger changes are due to expression of
Aβ42 (70.6%) and this may reflect either a correlation with the ageing process or progres-
sion of AD pathology. We confirmed this result using hierarchical biclustering of protein
abundances in Aβ42 versus healthy flies at 5 days (Fig. 2.4f). The results reveal that most
proteins do not vary significantly in abundance with age in healthy flies, but many proteins
are differentially abundant in Aβ42 flies.
We next identified the proteins that were significantly altered following Aβ42 expression
in the fly brain. To achieve this, we used five methods commonly used to analyse time
course RNA-Seq data [247] and classified proteins as significantly altered if at least two
methods detected them [248]. We identified 228 significantly altered proteins from 740
proteins that were detected by one or more methods (Fig. 2.5a). A comparison of popular
RNA-Seq analysis tools [249] showed that edgeR [231] has a high false positive rate and
variable performance on different data sets, whereas, DESeq2 [229] and limma [232] have
102 Chapter 2. Aβ42 pathology in Drosophila brains
low false positive rates and perform more consistently. We observed a similar trend in
our data set. limma and DESeq2 detected the lowest number of proteins, with 21 proteins
in common (Fig. 2.6a). edgeR detected more proteins, of which 38 were also detected by
DESeq2 and 16 by limma. EDGE [230] and maSigPro [233] detected vastly more proteins,
464 of which were only detected by one method. Principal component analysis shows that
edgeR, DESeq2 and limma detect similar proteins, whereas, EDGE and maSigPro detect
very different proteins (Fig. 2.6b).
a b
600
512 Significantly altered proteins
16 38
5 Ageing
0
1 2 3 4 5 Aβ42
Number of methods that detect a
protein is significantly altered in A 42
c
Cluster 1 (n = 75) Cluster 2 (71) Cluster 3 (39) Cluster 4 (43)
Relative abundance
2
1
Healthy
0
Aβ42
1
2
5 19 31 46 5 19 31 46 5 19 31 46 5 19 31 46
Age (days)
Figure 2.5: Brain proteome dysregulation in AD. (a) Proteins significantly altered in AD were
identified using five methods (EDGE, edgeR, DESeq2, limma and maSigPro) and clas-
sified as significantly altered if at least two methods detected them. (b) Significantly
altered proteins in AD (from a) and ageing. (c) Significantly altered protein abundances
were z score-transformed and clustered using a Gaussian mixture model.
Although these methods should be able to differentiate between proteins that are al-
tered in Aβ42 flies from those that change during normal ageing, we confirmed this by
analysing healthy flies separately. In total, 61 proteins were identified as significantly al-
tered with age, of which 30 were also identified as significantly altered in AD (Fig. 2.5b)
and 31 in normal ageing alone. These proteins are not significantly enriched for any path-
ways or functions. Based on our results, we concluded that the vast majority of proteins
that are significantly altered in AD are not altered in normal ageing and that AD causes
significant dysregulation of the brain proteome.
2.3. Results 103
a b
20
DESeq2
PC2 (41.5%)
10
EDGE
edgeR
0 limma
maSigPro
10
20 0 20
PC1 (44.8%)
m 9)
)
2)
)
1
39
82
lim (13
(6
(3
(4
(3
2
a
R
eq
o
Pr
G
ge
ES
ED
ig
ed
aS
D
1 2 3 4 5
Number of methods
Figure 2.6: Analysis of the five statistical methods used to identify significantly altered
proteins. (a) Heat map of the proteins detected by each method. (b) Principal com-
ponent analysis of these results. Axes are annotated with the percentage of variance
explained by each principal component.
Of the 31 proteins specifically altered in ageing, 10 decreased with age (Acp1, CG7203,
mRPL12, qm, CG11017, HIP, HIP-R, PP02 and Rpn1), 15 increased with age (ade5, CG7352,
RhoGAP68F, CG9112, PCB, Aldh, D2hgdh, CG7470, CG7920, RhoGDI, Aldh7A1, CG8036,
Ssadh, muc and FKBP14) and four fluctuated throughout life (CG14095, His2A, RpL6 and
SERCA).
disease progresses. Proteins in cluster 3 follow a similar trend in healthy and Aβ42 flies
and increase in abundance with age. However, cluster 4 proteins decrease in abundance
as the disease progresses, whilst remaining steady in healthy flies.
We performed a statistical GO enrichment analysis on each cluster, but found no
enrichment of terms. Furthermore, we also saw no enrichment when we analysed all 228
proteins together.
Following the analyses of brain proteome dysregulation in Aβ42 flies, we analysed the
228 significantly altered proteins in the context of the brain protein interaction network
to determine whether their network properties are significantly different to the other brain
proteins. We used a subgraph of the STRING [234] network induced on the 3,093 proteins
identified by IM-DIA-MS (see Section 2.2.3.3 for more details). This subgraph contained
183 of the 228 significantly altered proteins. We then calculated four graph theoretic net-
work properties (Fig. 2.7a) of these 183 significantly altered proteins contained in this
network:
We performed hypothesis tests and found that these proteins have statistically signif-
icant network properties. Firstly, the significantly altered proteins make more interactions
than expected (mean degree P < 0.05; Fig. 2.7b). Therefore, these proteins may further
imbalance the proteome by disrupting the expression or activity of proteins they inter-
act with. Secondly, not only are these proteins close to each other (mean shortest path
P < 0.05; Fig. 2.7c), but also 129 of them form a connected component (size of largest
connected component P < 0.01; Fig. 2.7d). These two pieces of evidence suggest that
Aβ42 disrupts proteins at the centre of the proteome. Lastly, these proteins lie along short-
est paths between many pairs of nodes (mean betweenness centrality P < 0.01; Fig. 2.7e)
and may control how signals are transmitted in cells. Proteins with high betweenness cen-
2.3. Results 105
a Network properties
Property Degree Shortest path Largest connected Betweenness
component centrality
Example
b e
f
a c g
d h
BC(f) = {efgh, efg} = 0.66
Value D(a) = 3 SP(b,d) = bcd = 2 LCC = 4 {efgh, efg, fgh}
b c d e
p < 0.05 p < 0.05 p < 0.01 p < 0.01
frequency (×104)
Null distribution
1.0 1.5 2
1.0
1.0
0.5 0.5 1
0.5
Figure 2.7: Significantly altered proteins have statistically significant network properties
in the brain protein interaction network. (a) Network properties that were calcu-
lated: degree, the number of edges that a node has; shortest path, the smallest node
set that connect any two nodes; largest connected component, the largest node set for
which all nodes have at least one edge to any of the other nodes; and betweenness cen-
trality, the proportion of all shortest paths in the network that a particular node lies on.
Using a subgraph of the STRING network induced on the 3,093 proteins identified by
IM-DIA-MS in healthy and Aβ42 flies, the significance of four network characteristics
were calculated for the 183 significantly altered proteins contained in this subgraph.
(b) mean degree; (c) mean shortest path length between a node and the remaining 182
nodes; (d) the size of the largest connected component in the subgraph induced on these
nodes; and (e) mean betweenness centrality. One-sided non-parametric P values were
calculated using null distributions of the test statistics, simulated by randomly sampling
183 nodes from the network 10,000 times.
trality are also more likely to be essential genes for viability [250]. Taken together, these
findings suggest that the proteins significantly altered in AD are important in the protein
interaction network.
We predicted how severely particular Aβ42-associated protein alterations may affect the
brain using two network properties—the tendency of a node to be a hub or a bottleneck.
In networks, nodes with high degree are hubs for communication, whereas nodes with
high betweenness centrality are bottlenecks that regulate how signals propagate through
the network. Protein expression tends to be highly correlated to that of its neighbours
in the protein interaction network. One exception to this rule, however, are bottleneck
106 Chapter 2. Aβ42 pathology in Drosophila brains
proteins, whose expression tends to be poorly correlated with that of its neighbours [250].
This suggests that the proteome is finely balanced and that the expression of bottleneck
proteins is tightly regulated to maintain homeostasis. We analysed the hub and bottleneck
properties of the significantly altered proteins and identified four hub-bottlenecks and five
nonhub-bottlenecks that correlate with Aβ42 expression (Fig. 2.8a) and analysed how their
abundances change during normal ageing and as pathology progresses (Fig. 2.8b).
a 90th PC
b Nonhub-bottlenecks Hub-bottlenecks
3 Top2 Acs1 (cluster 1) mt:CoII (2) Hsp70A (4) Act57B (1)
10 Act57B 1.0
Hsp70Ab Hsp70Aa
0.5
Gp93
0.0
Got2
2 Acs1
10 mt:CoII Echs1 (3) Acp65Aa (1) Gp93 (1)
Degree
Echs1 1.0
0.5
Acp65Aa
1
0.0
10
Relative abundance
0.5
7 5 3 1
10 10 10 10 0.0
Betweenness centrality 5 19 31 46 54 80 5 19 31 46 54 80
Age (days)
Figure 2.8: Analysis of hubs and bottlenecks in the brain protein interaction network. In
networks, nodes with high degree are hubs and nodes with high betweenness cen-
trality are bottlenecks. (a) Degree (hub-ness) is plotted against betweenness central-
ity (bottleneck-ness) in the brain protein interaction network for all proteins identified
by IM-DIA-MS (grey circles). Of the significantly altered proteins (red circles), hub-
bottleneck (> 90th percentile (PC) for degree and betweenness centrality) and nonhub-
bottleneck proteins (> 90th PC for betweenness centrality) are highlighted (filled red
circles). (b) Profiles of significantly altered bottleneck proteins implicated in Aβ42 tox-
icity. Maximum abundances are scaled to unity. Numbers in parentheses denote which
cluster from Fig. 2.5c the protein was in.
Non-hub bottleneck proteins (Fig. 2.8b) Acs1 and Got2 levels were stably expressed
throughout normal ageing in our healthy flies but increased upon Aβ42 induction and
continued to rise with age in Aβ42 flies. On the other hand, Echs1 abundance increased
in healthy flies during normal ageing, but its levels were reduced upon Aβ42 induction
and its ageing-dependent increase was diminished in Aβ42 flies compared to controls.
Levels of mt:CoII (a COX subunit) declined with age in healthy control, but not in Aβ42, fly
brain, although its expression was downregulated compared to controls at all time-points
following Aβ42 induction. Finally, the cuticle protein Acp65Aa was also upregulated in
Aβ42 flies compared to controls, but levels fell sharply between 5 and 19 days of age.
2.3. Results 107
Of the four hub-bottlenecks (Fig. 2.8b) Hsp70A was significantly upregulated at early
time-points (5 days) in Aβ42 flies, dropped between days 5 and 31 post-induction, then
increased at later time-points, compared to healthy controls which exhibited stable ex-
pression of this protein throughout life. We found that Gp93 was increased across age
in Aβ42 flies compared to controls, possibly suggesting an early and sustained protective
mechanism against Aβ42-induced damage. DNA topoisomerase 2 (Top2), an essential en-
zyme for DNA double-strand break repair, was decreased in Aβ42 flies, following a pattern
which mirrors changes in its expression with normal ageing. Finally, we found that actin
(Act57B) was increased in Aβ42 flies but declined with age, in comparison to control fly
brains which displayed stable expression across life.
Due to the importance of these hub and bottleneck proteins in the protein interaction
network, we predict that AD-associated alterations in their abundance will likely have a
significant effect on the cellular dynamics of the brain.
Finally, we clustered the protein interaction network into modules and performed a GO
enrichment analysis on modules that contained any of the 228 significantly altered pro-
teins. We saw no GO term enrichment when we tested these proteins clustered according
to their abundance profiles (Fig. 2.5c), presumably because the proteins affected in AD are
diverse and involved in many different biological processes. However, by testing network
modules for functional enrichment, we exploited the principle that interacting proteins
are functionally associated. Using a subgraph of the STRING network containing the sig-
nificantly altered proteins and their directly-interacting neighbours (Section 2.2.3.3), we
used MCODE [235] to find modules of densely interconnected nodes. We chose to include
neighbouring proteins to compensate for proteins that may not have been detected in the
MS experiments due to the stochastic nature of observing peptides and the wide dynamic
range of biological samples [227]. The resulting subgraph contained 4,842 proteins, in-
cluding 183 of the 228 significantly altered proteins, as well as 477 proteins that were
only identified in healthy or Aβ42 flies and 3,125 proteins that were not identified in our
IM-DIA-MS experiments.
12 modules were present in the network (Fig. 2.9a). Module sizes range from 302
nodes to 17 nodes. The proportion of these modules that were composed of significantly
108 Chapter 2. Aβ42 pathology in Drosophila brains
altered proteins ranged from 0 to 8%. All but one of the modules were enriched for pro-
cesses implicated in AD and ageing (Fig. 2.9), including respiration and oxidative phos-
phorylation, transcription and translation, proteolysis, DNA replication and repair, and
cell cycle regulation. These modules contained two proteins that were recently found to
be significantly altered in the brain of AD mice [251] and are both upregulated four-fold in
AD: adenylate kinase, an adenine nucleotide phosphotransferase, and the armadillo pro-
tein Arm, involved in creating long-term memories. ApoB was found in the second highest
scoring module that contains proteins involved in translation and glucose transport [252]
(Fig. 2.9).
Not enriched
significantly altered in A 42
0.08
Module containing
0.06 ApoB
0.04
0.02
0.00
Figure 2.9: Analysis of network modules enriched for AD or ageing processes. MCODE was
used to identify network modules in a subgraph of the STRING network containing the
significantly altered proteins and their directly-interacting neighbours. The size of the
resulting 12 modules is plotted against the fraction of proteins in these modules that
are significantly altered in AD. Module 2 is annotated as containing ApoB. Marker sizes
denote the MCODE score for the module.
In humans, the greatest genetic risk factor for AD is the 4 allele of ApoE—an
apolipoprotein involved in cholesterol transport and repairing brain injuries [253]. A re-
cent study showed that ApoE is only upregulated in regions of the mouse brain that have
increased levels of Aβ [251], indicating a direct link between the two proteins. Although
flies lack a homolog of ApoE, they do possess a homolog of the related apolipoprotein
ApoB (Apolpp) [254], which contributes to AD in mice [255, 256] and is correlated with
AD in humans [257, 258]. Interestingly, whilst it was not identified by IM-DIA-MS, ApoB
interacts with 12 significantly altered proteins in the STRING network, so is included in
2.4. Discussion 109
the subgraph induced on the significantly altered proteins and their neighbours. ApoB was
found in the second highest scoring module that contains proteins involved in translation
and glucose transport [252] (Fig. 2.9).
We analysed the 31 proteins significantly altered in normal ageing, but not AD. Of
the 29 proteins that were contained in the STRING network, 24 interact directly with at
least one of the AD significantly altered proteins, suggesting an interplay between age-
ing and AD at the pathway level. Using a subgraph of the STRING network induced on
these proteins and their 1,603 neighbours, we identified eight network modules that were
enriched for ageing processes [259], including respiration, unfolded protein and oxidative
damage stress responses, cell cycle regulation, DNA damage repair, and apoptosis.
2.4 Discussion
Despite the substantial research effort spent on finding drugs against AD, effective treat-
ments remain elusive. We need to better understand the molecular processes that govern
the onset and progression of the complex pathologies observed in this condition to AD.
This knowledge will help to identify new drug targets to treat and prevent AD.
2.4.2 The brain proteome becomes dysregulated with age, in the absence
of AD
We identified 61 proteins which were significantly altered with age in fly brain, 31 of
which were not altered in response to Aβ42. Of these, structural chitin proteins (Acp1 and
CG7203), mitochondrial associated proteins (HemK1 and mRPL12), geranylgeranyl trans-
ferases (qm), and proteostasis proteins (HIP, HIP-R and Rpn1) were significantly downreg-
110 Chapter 2. Aβ42 pathology in Drosophila brains
ulated with age. Indeed, loss of mitochondrial function and proteostasis are key features of
the ageing brain [260] in agreement with these observations. Recent studies also suggest a
role for geranylgeranyl transferase I-mediated protein prenylation in mediating synapto-
genesis and learning and memory [261, 262]. Our findings suggest that this may represent
a novel mechanism of regulation and maintenance of these functions during ageing, which
warrants further investigation.
Rho GTPases are involved in maintenance of synaptic function [261], and reductions
in their levels correlate with ageing and increases in their expression with foraging be-
haviour in the brain of honey bees [266]. Our finding that inhibitors of these enzymes (Rho
GAPs and Rho GDIs) are upregulated in ageing fly brain further suggests that changes in
their activity may mediate loss of synaptic function throughout life.
Finally, several proteins fluctuated in expression across age in our flies, including
those involved in DNA repair (His2A), protein translation (RpL6), and ER calcium home-
ostasis (SERCA), processes which have been previously reported in association with brain
ageing [268–270]. Although alterations in these proteins are independent of Aβ42 expres-
sion in our flies, further work is required to investigate their functional role in preserving
brain function with age and their potential to increase the vulnerability of the ageing brain
to neurodegenerative diseases.
2.4. Discussion 111
and Aspartate aminotransferase (Got2), are metabolic enzymes with previous links to neu-
ronal function and damage [169, 170]. Got2 produces the neurotransmitter L-glutamate
from aspartate, is involved in assembly of synapses and becomes elevated following brain
injury [271].
Hsp70A is a heat shock protein that responds to hypoxia and Gp93 is a stress re-
sponse protein that binds unfolded proteins, consistent with responses to abnormal Aβ42
aggregation in our flies.
The cuticle protein Acp65Aa is a chitin, this class of which have been detected in AD
brains and suggested to facilitate Aβ nucleation [275].
Assessing the human orthologs of these genes, identified using DIOPT [277], indi-
cates that several of these bottleneck proteins have been previously implicated in asso-
ciation with AD or other neurological conditions in humans or mammalian models of
disease. ACSL4 (Acs1 ortholog) has been shown to associate with synaptic growth cone
development and mental retardation [278]. Mutations in ECHS1 (Echs1 ortholog), an en-
2.4. Discussion 113
zyme involved in mitochondrial fatty acid oxidation, associate with Leigh Syndrome, a
severe developmental neurological disorder [279]. Proteomic studies have revealed that
GOT2 (Got2 ortholog) is down-regulated in infarct regions following stroke [280], and in
AD patient brain [281]. Integrating data from human post-mortem brain studies, HSPA1A
(Hsp70Aa ortholog) upregulates in the protein interaction network of AD patients com-
pared to healthy controls [282], and has recently been suggested to block APP processing
and Aβ production in mouse brain [283]. Synthetic, fibrillar, Aβ42 reduces expression
of TOP2B (Top2 ortholog) in rat cerebellar granule cells and in a human mesenchymal
cell line, suggesting this may contribute to DNA damage in response to amyloid [284].
HSP90B1 (Gp93 ortholog) shows increased expression following TBI in mice [285], and
associates with animal models of Huntington’s disease [286]. Finally, ACTB (Act57B or-
tholog) has been implicated as a significant AD risk gene and central hub node using in-
tegrated network analyses across GWAS [287].
Three of the nonhub-bottlenecks, Acyl-CoA synthetase long chain (Acs1), Enoyl-CoA hy-
dratase, short chain 1 (Echs1), and Aspartate aminotransferase (Got2), are metabolic en-
zymes with previous links to neuronal function and damage. Acs1 and Echs1 are involved
in the production of acetyl-CoA from fatty acids. Many enzymes involved in acetyl-CoA
metabolism associate with AD leading to acetyl-CoA deficits in the brain and loss of cholin-
ergic neurons [170]. Got2 produces the neurotransmitter L-glutamate from aspartate, is
involved in assembly of synapses and becomes elevated following brain injury [271]. Brain
Acs1 and Got2 levels were stably expressed throughout normal ageing in our healthy flies
but increased upon Aβ42 induction and continued to rise with age in Aβ42 flies. This
suggests that levels of these proteins increase independently of ageing in AD, but corre-
late closely with disease progression. On the other hand, Echs1 abundance increases in
healthy flies during normal ageing, but its levels were reduced upon Aβ42 induction and
its ageing-dependent increase was diminished in Aβ42 flies compared to controls. This
may reflect a protective response with ageing that is suppressed by Aβ42 toxicity.
compared to controls at all time-points and was stably-expressed across age following
Aβ42 induction. The link between COX and AD is unclear, although Aβ is known to inhibit
COX activity [272]. For example, in AD patients, COX activity—but not abundance—is
reduced, resulting in increased levels of ROS [273]. However, in COX-deficient mouse
models of AD, plaque deposition and oxidative damage are reduced [274]. Hence, the
ageing-dependent decline in mt:CoII may represent either a reduction in COX function
which renders the brain vulnerable to damage and is exacerbated by Aβ42 toxicity, or a
protective mechanism against both ageing and amyloid toxicity.
The cuticle protein Acp65Aa was also upregulated in Aβ42 flies, but levels fell sharply
between 5 and 19 days. However, it is surprising that we identified Acp65Aa in our sam-
ples, as it is not expected to be expressed in the brain. One explanation may involve chitin,
which has been detected in AD brains and has been suggested to facilitate Aβ nucleation
[275]. Amyloid aggregation has previously been shown to plateau around 15 days post-
induction [288], which is around the same time that Acp65Aa drops in Aβ42 flies. Our
results suggest that Aβ42 causes an increase in Acp65Aa expression early in the disease,
but further experiments are needed to confirm this and to investigate its relationship with
nucleation and the aggregation process.
The four hub-bottlenecks are consistent with Aβ42 inducing stress. Hsp70A, a heat shock
protein that responds to hypoxia, was significantly upregulated at early time-points (5
days) in Aβ42 flies, compared to healthy controls which exhibited stable expression of this
protein throughout life. Although the levels dropped in Aβ42 flies between days 5 and 31
post-induction, at later time-points Hsp70A increased again, possibly suggesting a two-
phase response to hypoxia in Aβ42 flies. We found that Gp93—a stress response protein
that binds unfolded proteins—to be increased across age in Aβ42 flies compared to controls
possibly suggesting an early and sustained protective mechanism against Aβ42-induced
damage. DNA topoisomerase 2 (Top2), an essential enzyme for DNA double-strand break
repair, was decreased in Aβ42 flies, following a pattern which mirrors changes in its ex-
pression with normal ageing. Double-strand breaks occur naturally in the brain as a conse-
quence of neuronal activity—an effect that is aggravated by Aβ [171]. As a consequence of
deficient DNA repair machinery, deleterious genetic lesions may accumulate in the brain
and exacerbate neuronal loss.
2.4. Discussion 115
Finally, we found that actin (Act57B) was increased in Aβ42 flies, in agreement with
two recent studies on mice brains [251, 276]. Kommaddi and colleagues found that Aβ
causes depolymerisation of F-actin filaments in a mouse AD model before onset of AD
pathology [276]. The authors showed that although the concentration of monomeric
G-actin increases, the total concentration of actin remains unchanged. It has long been
known that G-, but not F-, actin is susceptible to cleavage by trypsin [289], permitting its
detection and quantification by IM-DIA-MS. Hence, the apparent increase of actin in Aβ42
flies may be due to F-actin depolymerisation, which increases the pool of trypsin-digestible
G-actin, and is consistent with the findings of Kommaddi et al. To confirm whether total
actin levels remain the same in the brains of Aβ42 flies, additional experiments would have
to be carried out in the future, for example tryptic digestion in the presence of MgADP—
which makes F-actin susceptible to cleavage [290]—and transcriptomic analysis of actin
mRNA. Furthermore, actin polymerisation is ATP-dependent, so increased levels of G-
actin may indicate reduced intracellular ATP. In addition, ATP is important for correct
protein folding and therefore reduced levels may lead to increased protein aggregation in
AD.
ACSL4, ECHS1, and HSP90B1 have no reported association with AD or related de-
mentias, however, which suggests that our study has the potential to identify new targets
in the molecular pathogenesis of this disease. Our study also provides additional informa-
tion about the homeostasis of these proteins across life from the point of amyloid produc-
tion. For example, the abundances of Acs1 and Got2 are elevated following Aβ42 induction
and continue to increase with age relative to controls. Echs1 is reduced in Aβ42 flies com-
pared to controls but increases across life in parallel with ageing-dependent increases in
this protein. Structural proteins Acp65Aa and Act57B are elevated in response to Aβ42,
but decline across life whilst remaining stable in control flies. Gp93 and Top2 are either
elevated or reduced in response to Aβ42 but mirror ageing-dependent alterations in their
expression. mt:CoII is reduced following Aβ42 expression at all time-points, but reduced
with ageing in controls. Hsp70A is increased early in Aβ42 flies, reduced to control lev-
els in mid-life, then elevated at late pathological stages whilst remaining stable in healthy
controls.
116 Chapter 2. Aβ42 pathology in Drosophila brains
Analysing GO enrichment using network modules, to capture the diverse biological pro-
cesses modified in AD, we identified 12 modules enriched for processes previously im-
plicated in ageing and AD. This validates the use of our Drosophila model in identifying
progressive molecular changes in response to Aβ42 that are likely to correlate with pro-
gression of cognitive decline in human disease. Further work is required to modify the
genes identified in our study at different ages, in order to elucidate whether they represent
mediators of toxicity as the disease progresses, factors which increase neuronal suscepti-
bility to disease with age or compensatory protective mechanisms. Model organisms will
be essential in unravelling these complex interactions. Our study therefore forms a basis
for future analyses that may identify new targets for disease intervention that are specific
to age and/or pathological stage of AD.
2.4.7 Conclusion
mammalian systems.
Addendum
During the viva for this thesis, it was brought to our attention that some of the statisti-
cal methods that we used to calculate differential gene expression were inappropriate for
the type of data analysed in this project. We chose the five methods that we used in this
study because they can identify differentially expressed genes in time-course data, albeit
in microarray or RNA-Seq data, rather than quantitative mass spectrometry data. The key
issue here is that RNA-Seq data are count-based, so should be modelled using discrete
distributions, whereas, microarray data and quantitative mass spectrometry data are con-
tinuous, so should be modelled using continuous distributions. Discrete data should not
be modelled using continuous distributions and vice versa.
EDGE, limma and maSigPro were used appropriately in this study to model differ-
ential gene expression using continuous distributions. However, DESeq2 and edgeR were
inappropriate for these data, as these methods are based on the (discrete) negative bino-
mial distribution. Interestingly, edgeR, DESeq2 and limma detect similar proteins, whereas,
EDGE and maSigPro detect very different proteins (Fig. 2.6).
When we began this work, we were unable to find any methods designed to identify
differentially expressed genes in quantitative proteomics data. Subsequently, a number
of methods have been developed, for example DEqMS [292], from the same group that
developed DESeq and DESeq2, and DEP [293], which is based on limma.
We thank the examiners for bringing this error to our attention.
Chapter 3
3.1 Introduction
3.1.1 Metagenomics and metagenome-assembled genomes
Metagenomics is the study of all genetic material in an environment (biome or micro-
biome) [294]. Metagenomics typically refers to the study of DNA, whereas, metatranscrip-
tomics refers to RNA. Microbiomes contain a large number of unknown species that have
not yet been cultured. It has been estimated [295] that only 1% of microorganisms have
been cultured—dubbed ‘the uncultured majority’ [296]. Many of these species are likely
to be eukaryotic [297], whether small or single-celled. Metagenome-assembled genomes
(MAGs) are genomes (assemblies) that are recovered from metagenomes by co-assembly
of metagenome sequences.
Recently, there has been a growing interest in metagenomics. A number of causes
have contributed to this, including sequencing becoming cheaper and easier, improved
databases to store metagenomes and metadata, and increasing maturity of bioinformat-
ics tools to process metagenomes. This has allowed microbiomes to be investigated on
unprecedented scales, for example to understand the distribution of species in humans
[298], the human gut [299, 300], the oceans [301, 302], and distributions of phages across
many of the Earth’s ecosystems [303]. This has, in turn, bettered our understanding of
the interactions between biomes, hosts and microbes, and improve our knowledge of how
microbiomes impact host health. The human gut microbiome has been shown to be domi-
nated by environmental factors, rather than host genetics [304]. In other words, cohabiting
individuals share microbiomes, whereas, family members who do not live together have
120 Chapter 3. Mining metagenomes for new protein functions
different microbiomes. Furthermore, the gut microbiome has been linked to mental health,
quality of life and depression, via the microbiota-gut-brain axis [305].
Metagenomics can be divided up into three steps, outlined in the sections below.
Recently, long-read Nanopore sequencing [318] using the MinION sequencer has
been gaining in popularity—even being used by PuntSeq in my own backyard to study
the metagenome of the river Cam in Cambridge [319]. Compared to Illumina sequencing,
MinIONs are small, portable, inexpensive and simple to operate [320], which has democra-
tised sequencing and genomics. In Nanopore sequencing, a polymer is ratcheted through
a protein nanopore, which disrupts the ionic current across the nanopore [321]. Polymer
sequences can be decoded from the characteristic patterns of currents across nanopores.
Nanopore sequencing can even be used to sequence proteins in real-time [322]. Unlike
Illumina sequencing, megabase-long contigs can be sequenced in one go using Nanopore
sequencing, without needing to be fragmented. The main drawback of the platform is
the high error rate, between 5% and 15% [323]. As the errors are uniformly distributed
across the length of the sequence sequence, high coverage depth can mitigate errors. Mul-
tiple sources contribute to the error rate, including the simultaneous influence of multiple
bases on the current across the pore [323]. It is thought that between five and six adjacent
bases (corresponding to 45 or 46 possible k-mers) contribute to the signal at each position
in the sequence. Efforts are being made to reduce the error rate, including updates to the
sequencing chemistry. Contributions of multiple adjacent bases and a low signal-to-noise
ratio produce an enigmatic signal, but computational approaches are being developed to
improve the base calling accuracy [323, 324].
Short- and long-read sequencing work synergistically and can be applied in tandem,
each alleviating the drawbacks of the other. A key motivation for this is hybrid assembly
of reads into contigs [325]. On one hand, short-reads have high accuracy, but are hard to
assemble into long contigs. On the other hand, long-reads have low accuracy, but can help
to bridge gaps in assemblies to form longer contigs [326].
Metagenomics produces large volumes of data, on an unprecedented scale for the biolog-
ical sciences. Processing these data requires new methods, such as MetaSPAdes [327] for
genome assembly, MMseqs2 [328] for sequence searches and Linclust [137] for sequence
clustering. Performances of various metagenomics tools were compared in a rigorous as-
sessment [329]. MGnify [330] provide a turnkey analysis pipeline for metagenome se-
quencing reads (https://github.com/EBI-Metagenomics/pipeline-v5) written in Common
122 Chapter 3. Mining metagenomes for new protein functions
Workflow Language [331]. Users need only upload their raw reads to the European Nu-
cleotide Archive [332] to be processed by the MGnify pipeline. MGnify is also a key
database for metagenomics data.
with low species diversity [342]. Restrictions on the maximum diversity of species
in a microbiome have since been lifted [299, 300, 343] by longer read lengths, im-
proved assemblers and better binning algorithms, which has allowed MAGs to be
recovered from diverse communities [344]. Binning assigns contigs to a bin if it is
likely they are from the same genome. Various properties are used to determine
the likeliness, including GC content, tetra-nucleotide frequency and depth of se-
quencing coverage [344]. MetaBAT is a popular binning algorithm [345]. Binning
increases the ability to recover genomes of rare species with low abundance in com-
munities [346]. A Nextflow pipeline for recovering MAGs is available in nf-core
[347] (https://github.com/nf-core/mag). MAGs are assigned to a taxonomic group
using GTDB-Tk [348].
4. Assessing the quality of metagenomes
Genome-quality, measured by completeness and contamination, can be assessed by
checking for presence of universally-conserved genes in particular taxa. Examples
of tools include CheckM [349] for prokaryotes, EukCC for eukaryotes [297], and
Busco for both [350]. Scores generated by these tools are complementary to classical
genome metrics, such as N50 (length of the shortest contig at 50% of the genome
length) or L50 (number of contigs that make up 50% of the genome length). Other
scores, such as the minimum information about a metagenome-assembled genome
(MIMAG) have been devised [344].
α/β hydrolases (ABHs) are hydrolases that contain an ABH domain (Fig. 3.1). The ABH
domain fold was first described in 1992 by structural comparisons of proteins that were of
very different phylogenetic origin and catalytic function [351]. These proteins, whilst not
sharing common sequences or substrates, had a characteristic fold, consisting of an eight-
stranded beta-sheet sandwiched between two planes of alpha-helices. A conserved cat-
alytic triad, consisting of nucleophile—histidine—acid residues located on loops between
the alpha-helices and beta-sheets, performs catalysis [351]. The triad is reminiscent of
the prototypical catalytic triad of serine proteases [352], but the residues are arranged
in the mirror inverse—a prototypical example of convergent evolution. Whilst the posi-
tions of the catalytic triad’s side-chains are exquisitely conserved, the rest of the sequence
is not conserved, leading to great structural diversity of the non-core domain structure
124 Chapter 3. Mining metagenomes for new protein functions
[353]. As such, the CATH ABH domain superfamily contains 34 structurally similar groups
(5Å complete-linkage clusters), composed of a conserved ABH domain core, that is em-
belished with many non-conserved structural elements. One consequence of this is that
the substrate-binding site is not conserved, permitting ABHs to catalyse multifarious sub-
strates [354, 355]. To show how functionally diverse the superfamily is, its members are
annotated with 1,277 unique Gene Ontology (GO) terms and 458 unique Enzyme Com-
mission (EC) terms. The family includes diversely important enzymes, such as acetyl-
cholinesterase, lipase, thioesterase, and various types of peptidases [356]. A number of
proteins relevant to biotechnology and pharmaceuticals also contain ABH domains, in-
cluding many plastic-degrading enzymes, introduced in Section 3.1.4. Rather remarkably,
the same ABH can catalyse endopeptidyl hydrolysis and epoxide ring opening, using a
single catalytic triad [357].
./figures/./Chapter_metagenomes/abh_domain_structure.jpg
Figure 3.1: Structure of the α/β hydrolase domain fold. The β-sheet (yellow) is sandwiched be-
tween two planes of α-helices (blue). The nucleophile—histidine—acid (S-H-D in this ex-
ample) residues (red) are on loops. Figure from [358]. Structure of DUF2319 C-terminal
catalytic domain from Mycobacterium tuberculosis H37Rv. DUF2319 proteins lack β1
strand of the canonical fold.
sandwich architecture, and Rossmann fold [359] topology. The ABH domain is one of the
largest superfamilies in nature: Gene3D (v16) predicts 557,283 ABH domains in UniProt
[49, 360]. The ABH superfamily contains 377 FunFams in CATH (v4.2).
ABHs in metagenomes, focussing on sequences that are similar to a novel PET hydrolase,
PETase.
3.1.5 PETase
In 2016, a new bacterial species was discovered that is able to metabolise PET as its only
carbon source [363, 374]. Ideonella sakaiensis (I. sakaiensis) was isolated from outside a
plastic recycling plant in Japan, living on PET bottles—an extreme, unnatural environ-
ment, with a high selection pressure to evolve PET metabolism. Two enzymes are respon-
sible: PETase, which depolymerises PET into MHET, and MHETase, which breaks MHET
down (Fig. 3.2). The resulting ethylene glycol and terephthalic acid and metabolised by I.
sakaiensis to provide ATP.
A number of concerns [375] were voiced by Yang et al. about the work presented
by Yoshida et al. [374]. Firstly, Yang identifies that a low crystallinity PET (1.9%) was
used to test the efficiency of PETase. Typically, commercial PET bottles have 30 to 40%
crystallinity. A cutinase from Fusarium solanipisi that hydrolases PET was shown to have a
non-linear decrease in PET hydrolysis efficiency as the crystallinity increased [372]. There-
fore, PETase was tested on a substrate that is not representative of how the enzyme will
most likely be used in the wild. Secondly, Yoshida did not measure the mass of PET to
show that it was broken down by I. sakaiensis. Instead, they showed a gel permeation
chromatogram and presented almost identical traces between a 22 day PETase experi-
ment and a 0 day negative control. Yoshida argue that PET hydrolysis only occurred at
the surface of the PET film, but Yang counter argue that breakdown of PET on the surface
could be caused by the mechanical action (i.e. not enzymatic) of I. sakaiensis on the surface
of the film. However, Yoshida argue [376] that their intention was simply to present a mi-
croorganism that can grow on PET, and to identify the enzymes that permit this growth.
They also argue that microbial degradation of PET had previously been confirmed [369].
In sum, I. sakaiensis is able to grow on PET, using PETase, but PETase has not been shown
to be able to efficiently hydrolyse the types of PET that are used prolifically in commercial
situations.
Whilst PETase and MHETase are responsible for PET degradation [374], the exact
mechanisms are unknown. Structural studies have proposed potential mechanisms [377–
380]. Despite evolving in a PET-rich environment, PETase does not hydrolyse PET opti-
mally [379]. PETase was modified to narrow the substrate binding cleft, by introducing
3.1. Introduction 127
./figures/./Chapter_metagenomes/petase_mhetase.jpg
Figure 3.2: Two enzymes are responsible for PET hydrolysis in I. sakaiensis. PETase de-
polymerises PET into MHET, followed by MHETase, which breaks MHET down into
ethylene glycol and terephthalic acid. Figure from [363].
conserved amino acids that are found in the equivalent positions in cutinases, which im-
proved PET degradation [379]. But this improvement pales in comparison to leaf-branch
compost cutinase (LCC) that degrades PET four orders of magnitude faster than PETase
[381]. Disulfide bridges were engineered into LCC to make it more thermostable. Post-
consumer PET waste was efficiently processed by LCC, with an estimated material cost of
required protein at 4% of the ton-price of virgin PET. As a tour de force, PET was then
completely recycled using LCC, whereby the breakdown products of ethylene glycol and
terephthalic acid, were used to re-form PET, demonstrating the potential for a circular
economy [381].
3.1.6 Contributions
The plastic bottle has transitioned from a modern miracle to an environmental scourge,
within a generation. Enzymatic bioremediation and recycling of PET is a promising solu-
tion to this problem. PET hydrolases are an exciting class of natural enzymes that have
recently evolved to degrade PET. Metagenomes are large, untapped reservoirs of novel
functions.
In this chapter, we mine metagenomes for ABH domains that may degrade PET or
other plastics. We analyse how metagenomic proteins from the MGnify database are dis-
tributed between the biomes. Using hidden Markov models (HMMs) of CATH superfami-
lies from Gene3D and FunFams, we identified 500,000 ABH domains and 400,000 FunFam
128 Chapter 3. Mining metagenomes for new protein functions
Large sequence data sets are becoming increasingly common with the advent of large-
scale genome sequencing and metagenomics. GARDENER, the current method for gener-
ating FunFams is unable to scale to data of this size. In this chapter, we also develop a new
divide-and-conquer algorithm, FRAN, to generate FunFams on large sequence data sets.
We rigorously benchmark the performance of FRAN and compare it to GARDENER.
3.2 Methods
3.2.1 Data
3.2.1.1 MGnify
MGnify [330] is a microbiome database. MGnify hosts metagenome sequences from mi-
crobiome studies. Typically, uploaded data sets are from shotgun metagenomics [382].
MGnify assembles reads into contigs using metaSPAdes [327], a De Bruijn graph assem-
bler. With contigs longer than 500 nucleotides in hand, RNA genes are predicted using
Rfam [383] and masked. ORFs are predicted in the remaining sequences using the prokary-
otic gene callers Prodigal [339] and FragGeneScan [340]. Prodigal is run first, followed by
FragGeneScan for regions in which Prodigal does not predict any proteins. Protein se-
quences were clustered using Linclust at 90% sequence identity with 90% coverage of
the centroid sequence (command line arguments --min-seq-id 0.90 -c 0.9
--cov-mode 1).
Metagenome studies are associated with metadata. One such datum is the biome from
which the metagenome was sampled. Biomes are classified according to the Genomes On-
Line Database (GOLD) [384] microbiome ontology, which is a directed acyclic graph that
describes hierarchical relationships between biomes, akin to the Gene Ontology for func-
3.2. Methods 129
tion annotations. Whilst biome classification at the study level is mutually exclusive, it
is not so at the protein sequence level because the same sequence can be found in multi-
ple biomes. A selection of 13 high-level ontological terms are associated with sequences
(demarcated by *):
• Engineered*
• Environmental
– Aquatic*
∗ Marine*
∗ Freshwater*
– Soil*
∗ Clay*
∗ Shrubland*
• Host-associated
– Plants*
– Human*
∗ Digestive system*
– Human but not root
∗ Digestive system*
– Animal*
• None of the above*
CATH, Gene3D and FunFams were introduced in Sections 1.3.4, 1.3.4.1 and 1.3.4.2. CATH
[33] v4.2 and Gene3D [49] v16 were used. Superfamily HMMs in Gene3D v16 were gen-
erated using UniProt [360] August 2017 release. Gene3D contains models for 6,119 su-
perfamilies, represented by 65,014 HMMs trained on multiple sequence alignments of S95
130 Chapter 3. Mining metagenomes for new protein functions
1B
5
100M
# sequences
10M
5
1M
5
Figure 3.3: Number of sequences in Swiss-Prot, TrEMBL and MGnify. MGnify non-redunant
are S90 cluster representative sequences.
sequence identity clusters. ABH domain sequences predicted by Gene3D from UniProt
August 2017 release were used.
3.2.2 Software
3.2.2.1 Containers
Singularity [385] v2.6.0 was used to run Singularity and Docker containers. Docker
(https://www.docker.com) images were downloaded from Docker Hub (https://hub.docker.com)
and BioContainers [386].
3.2.2.2 HMMER
HMMs and HMMER were introduced in Section 1.3.2. HMMER [27] v3.2.1 was used from
the Docker image “biocontainers/hmmer:v3.2.1dfsg-1-deb_cv1”.
3.2.2.3 Linclust
Linclust [137] is a fast, greedy, single-linkage sequence clustering method in MMseqs2
[328]. Due to some clever tricks, Linclust’s runtime scales linearly with the number of
sequences, independently of the number of clusters. First, Linclust performs an approxi-
mate and inexpensive clustering by assigning each sequence to multiple sets, or canopies
[387]. Canopy clustering approaches improve time complexities by applying exact but
expensive clustering methods to each canopy independently. Linclust uses sequence Min-
Hashing [133–135, 388, 389] to generate canopies. p different hash functions are used to
select the p k-mers from each sequence with the lowest hash value. Here, we used p = 21.
3.2. Methods 131
Sequences that share a k-mer are assigned to the same canopy. The longest sequence
in each canopy is selected as the centroid sequence. Edges are added between centroids
and the sequences in the canopy to form an initial single-linkage clustering. Expensive
comparisons, such as sequence alignment, are only performed between ≤ pn connected
sequence pairs. O(pn) runtime complexity is achieved where p << n, compared with
O(n2 ) for all-against-all comparison. Finally, edges are removed between sequence pairs
if one or more clustering criteria, such as the sequence identity threshold, are not met.
Linclust v10 was used from the Docker image “soedinglab/mmseqs2:version-10”. Se-
quences were clustered at different sequence identity thresholds and coverage of centroid
sequences, typically 30%, 50%, 70% or 90%, (command line arguments: --min-seq-
id <X> --c <X> --cov-mode 1, where X is some percentage, provided as a
fraction).
3.2.2.4 cath-resolve-hits
cath-resolve-hits was introduced in Section 1.3.4.1. cath-resolve-hits [52] v0.16.2 was used
to resolve domain boundaries in multi-domain architectures (MDAs). Different parameters
were used depending on the context, so these are defined in the protocols. cath-resolve-
hits was used from the Docker image “harryscholes/cath-resolve-hits:0.16.2”.
3.2.2.5 Nextflow
Nextflow [390] (v19.07.0.5106) was used to write and execute reproducible pipelines.
Nextflow introduces a number of useful features for deploying pipelines, particularly in
high-performance computing cluster environments. Nextflow pipelines are constructed
from building blocks of ‘processes’, arranged in linear or branched topologies. Each pro-
cess contains a set of inputs, a script that processes the inputs, and a set of outputs. The
script can be in any language, be it Bash, Julia or Python. Parallelism can be easily in-
troduced by splitting inputs into chunks, to be processed separately. Processes only re-
quest their required CPU and memory resources from cluster queue managers. To achieve
true portability, each process can be executed in a container (Docker and Singularity are
supported), which can be automatically pulled from various container repositories. Inter-
mediate results are cached, which allows pipelines to be resumed and updated without
repeating a lot of computation. Many expert-written pipelines are available from nf-core
[347], a community effort to provide complex bioinformatics pipelines to users, requiring
little (computational) domain-specific knowledge. In sum, Nextflow is one of many new
132 Chapter 3. Mining metagenomes for new protein functions
tools, which make it easy to write and execute reliable and reproducible pipelines.
Where appropriate, pipelines described in this work are implemented in Nextflow.
These pipelines are designed to be portable and reproducible. For example, input data are
automatically downloaded where possible, processes are executed in Singularity contain-
ers, and data are processed in parallel for speed.
for all i residue numbers in D and all j residue numbers in P . MDAs were assigned by
sorting domains according to their centre of mass.
ABH domain sequences were extracted from full-length sequences using the resolved
start and stop positions of continuous domains and the concatenated sequence of discon-
tinuous domains from the myriad start and stop positions.
Algorithm 3.2 Predict CATH superfamily domains in large sequence data sets.
1: procedure
2: scan sequences against a subset of the Gene3D HMM library
3: create a subset of the sequences with a hit to one of the HMMs
4: Algorithm 3.1 continues from Line 1
5: end procedure
First, the sequence database was searched using a subset of models. In this case, the
MGnify proteins were searched using the Gene3D HMMs from the ABH superfamily and
the same settings as Section 3.2.4.1. Second, the sequences that have hits to the subset
of models are extracted to create a derivative database that is likely to be much smaller
than the original database. The derivative database is used as input to the protocol in
Section 3.2.4.1. This modified protocol is implemented as a Nextflow pipeline.
FunFam domains were identified in sequences that contain ABH superfamily domains
from Section 3.2.4.2 using Algorithm 3.3. These sequences were scanned against the
FunFam HMM library using HMMER. The per FunFam inclusion threshold was used
(command line options --cut_tc). MDAs were resolved using cath-resolve-hits (com-
mand line options --min-dc-hmm-coverage=80 --worst-permissible-
bitscore=25 --long-domains-preference=2). This protocol is also im-
plemented as a Nextflow pipeline.
ABH domains from full-length MGnify sequences and UniProt were pooled to create a
combined ABH domain data set. These sequences were clustered into single-linkage clus-
ters using Linclust at 30%, 50%, 70% and 90% sequence identity and coverage of the
cluster centroid sequence, implemented as a Nextflow pipeline.
Mixed clusters, composed of ABH domains from MGnify and UniProt, from clustering at
S30, S50, S70 and S90 (Section 3.2.5) were used. The kingdom-level taxonomy of UniProt
sequences in mixed clusters were downloaded using the Proteins API [393].
3.2. Methods 135
C- and N-terminal regions and inter-domain regions of sequence may contain novel do-
mains that evolved in metagnomes. To test the evidence for novel domain evolution, termi-
nal and inter-domain sequence lengths were calculated from Gene3D hits (Section 3.2.4.2).
Inter-domain regions are defined as contiguous regions of sequence that do not have a
significant match to any Gene3D HMM in the protein’s MDA. Full-length proteins whose
MDAs only contain continuous domains and do not contain discontinuous domains were
considered. Single-domain proteins have two terminal regions, whereas, multi-domain
proteins can also have inter-domain sequences between each pair of adjacent domains.
Examples are shown below for domains (@) and terminal and inter-domain regions (-).
i. Single-domain protein:
-----@@@@@@@-----
Terminal: 1 2
-----@@@@@@@---@@@@@@@---@@@@@@@-----
Terminal: 1 4
Inter-domain: 2 3
X X
AS1 S2 = +
m g
where m are the substitution matrix scores for each match and mismatch, and g are the
gap opening penalties. Distances D were calculated from alignment scores [395] by
AS1 S2
DS1 S2 = 1 − .
min(AS1 S1 , AS2 S2 )
136 Chapter 3. Mining metagenomes for new protein functions
still exceeds memory limits because all sequences must be available at all times. However,
GARDENER did show that it is possible to merge FunFams by treating them as starting
clusters. We use this knowledge in designing a method to generate FunFams at gigascale.
An aside on scaling FunFams There may be a low-hanging fruit solution to the FunFam
generation problem that will not be tackled in this chapter. Currently, GeMMA grows a
tree from leaves to root. FunFHMMer then partitions the tree into FunFams. GeMMA
trees are cut closer to the leaves than the root, so the later node merges are redundant.
Although there are fewer of them, later merges are much more costly than early ones. The
low-hanging fruit is to only grow a partial tree. To do this, FunFHMMer would be run in
concert with GeMMA, every k merges, where k is sufficiently large. GeMMA would be
stopped prematurely if the FunFam criteria are met. Running GeMMA and FunFHMMer
in serial is O(n3 ) + O(n2 ), which reduces to O(n3 ) in big O notation. If the entire tree is
grown, the iterative procedure proposed here is O(n3 ) + O( nk n2 ) in the worst case when
n
FunFHMMer is run k times, which also reduces to O(n3 ). On average, the empirical
runtime will be much less than this because the FunFam criteria will be met before the
entire tree is grown. We were interested in implementing this new algorithm, but time
constraints and lack of Perl knowledge prevented it.
Let M be the set of protein sequences from a superfamily that have the same MDA.
M is also known as an MDA partition. The goal is to subset M into k representative
groups, Gk , each of size n. To do this, sequences are sampled into groups, such that cer-
tain characteristics of M are preserved. Namely, we wish Gi to have approximately the
same sequence diversity as M , whilst not being biased to any dominant regions of se-
quence space. In designing this algorithm, we were inspired by two clever algorithms, one
for clustering and another for sampling, which we now explain briefly. Canopy clustering
[387] initially clusters items into sets, or canopies, using a cheap method, followed by ex-
pensive clusterings of items within each canopy. The second method, geometric sketching
[397], uses by ‘geometric’ sampling to summarise large data sets using a small subset of the
data that preserves the geometry, rather than the density, of the whole data set. Initially,
a high-dimensional data space is covered with a lattice of equal-sized boxes, from which
sketches are generated by uniformly sampling a box, followed by uniformly sampling an
item from that box, repeated until the sketch is the desired size.
3.2.10.2 FRAN
3.2.10.3 FRANgeometric
FunFam generation by random geometric sampling (Algorithm 3.6), a geometric sam-
pling method. The difference in performance between FRANgeometric and FRAN will
demonstrate the added value of considering the evolutionary relationships between se-
quences to maintain the desirably characteristics of M . In practice, we actually imple-
mented Algorithm 3.6 as Algorithm 3.7.
Algorithm 3.7, line 6 is O(|M |), whilst line 7 is O(|C|), where |C| < |M |. These time
savings can be made because each cluster representative represents the sequence diversity
of its cluster. Each group Gi then takes the place of M in the extant FunFam algorithm,
GARDENER. As sequence data sets continue to grow in size, FunFam generation can be
made computationally tractable once more because |Gi | << |M |.
ii. Rand index: Let S = {o1 , ..., on } be a set of n elements. The Rand index R can be used
to compare two clustering methods that partition S into k subsets, X = {X1 , ..., Xk }, and
l subsets, Y = {Y1 , ..., Yl }, respectively. The Rand index is defined as
a+b
R= n
2
where
• a is the number of pairs of elements in S that are in the same cluster in X and Y ,
• b is the number of pairs of elements in S that are in different clusters in X and Y .
R = 1 if and only if X = Y . R = 0 if elements are clustered completely differently in
X and Y . Rand indexes were calculated to compare FunFam clusterings, whereby only
FunFams that were composed of two or more starting clusters were considered.
iii. Graph theoretic measures: Let S = {S1 , ..., Sk } be a set of starting clusters that are
clustered into a set of FunFams F = {F1 , ..., Fl } be a set of FunFams by some method M .
Let G = (S, E) be a graph of S where pairs of starting clusters are connected by an edge
if M clusters them into the same FunFam (Fig. 3.4). Therefore the l FunFams are both the
maximal cliques and connected components of G. A clique is a subgraph, where every
pair of nodes are adjacent. A maximal clique is a clique that cannot be extended, to form
a larger clique, by including one adjacent node.
FunFam graphs were constructed for FunFams generated by FRAN, FRANgeometric and
GARDENER. For all combinations of these graphs, a new graph Gu = G1 ∪ G2 was con-
structed from the union of nodes and edges (Fig. 3.4). The number of maximal cliques
and connected components in Gu were calculated. If the FunFams in G1 and G2 agree,
the number of maximal cliques and connected components will be the same. However,
142 Chapter 3. Mining metagenomes for new protein functions
FunFam graph G1
5
1 2 3 4 7
6
FunFam graph G2
6
1 2 3 4 5
7
Graph union G1 ∪ G2
5
1 2 3 4 7
6
Figure 3.4: FunFam graphs and their graph unions. Nodes in FunFam graphs are starting clus-
ters that are connected by an edge if they are clustered into the same FunFam. FunFams
are both the maximal cliques and connected components in FunFam graphs. Graph
unions Gu can be constructed from the node and edge sets of two FunFam graphs G1
and G2 .
if the FunFams in G1 and G2 partially agree, the number of connected components will
decrease, whilst the number of maximal cliques will increase. This will occur when a pair
of FunFams share at least one, but not all, starting clusters. As such, maximal cliques that
were separate connected components are now connected in the same component.
3.3 Results
3.3.1 Biome distribution of proteins found in metagenomes
We assessed which biomes the metagenomic protein sequences were sampled from
(Fig. 3.5). We used the S95 cluster representatives of MGnify’s predicted protein sequences.
The most populous biome is ‘Aquatic’ with 142,232,832 sequences. This is followed by
‘Marine’, a sub-biome of ‘Aquatic’, with 104,242,612 sequences, which corresponds to
73.3% of sequences in the ‘Aquatic’ biome. The ‘Engineered’ biome contains the third most
sequences at 78,057,115. Engineered biomes encompass a broad range of non-natural
biomes, including industrial settings, laboratory conditions or waste treatment. Fourth
is ‘Human’ with 67,475,113 sequences, followed by the human ‘Digestive system’ sub-
3.3. Results 143
Figure 3.5: Biome distribution of MGnify protein sequences. Biomes are ordered by number
of sequences. † = ‘Host but not root:Digestive system’.
ABH domains consisting of 1,444,433 were unique ABH domain sequences. Following
assignment of MDAs, 1,435,764 sequences contained ABH domains, whilst 194,402 se-
quences were subsequently found to not contain ABH domains.
Prodigal was used to predict whether sequences are full-length, truncated at the N-
terminus, C-terminus, or both (Fig. 3.6). 508,693 proteins (35%) are predicted to be full-
length. 46% of sequences were truncated at one end, split equally between 330,328 N-
terminal and 330,043 C-terminal truncations. Finally, 266,700 (19%) of sequences were
truncated at both ends. Due to the size of the data, we only took full-length sequences,
and any ABH domains contained in full-length sequences, forward for further analysis.
5×10⁵
4×10⁵
# sequences
3×10⁵
2×10⁵
1×10⁵
0
Fulllength Nterminal Cterminal N and Cterminal
Sequence truncation
Figure 3.6: Truncation of MGnify protein sequences. Prodigal was used to predict whether
sequences are full-length, truncated at the N-terminus, C-terminus, or both.
Lengths of ABH domains from different databases were compared. Length distribu-
tion of ABH domains from MGnify agree well with those from UniProt and Gene3D HMMs
(Fig. 3.7). Gene3D ABH HMMs, built from structures in CATH v4.2, have median length
284 residues. ABH domains in UniProt sequences from Gene3D have median length 259.
It is reasonable to expect that the median match length will be shorter than the median
HMM length because subsequences can match HMMs with significant E-values. ABH do-
mains in MGnify have median length 247 residues.
Gene3D HMM
UniProt
0.008
MGnify
0.006
Density
0.004
0.002
0.000
0 100 200 300 400 500 600
Figure 3.7: Distribution of ABH domain lengths. Lengths of ABH domains from MGnify,
UniProt and Gene3D HMMs are plotted. The probability density function of each dis-
tribution was estimated using kernel density estimation.
contain the 508,693 ABH domains against the FunFam HMM library. Whilst CATH su-
perfamily domains are assigned using a lenient significance threshold of E < 0.001 in
Gene3D, sequences are assigned to FunFams using a much stricter threshold, known as the
FunFam inclusion threshold. An inclusion threshold is generated by scanning sequences
from a FunFam alignment against the FunFam HMM. The lowest (worst) bit score is the
inclusion threshold. There may be many and overlapping FunFam matches to a sequence.
A researcher may wish to know all FunFam matches, in which case these matches are the
desired output. Instead, if a researcher prefers to know an MDA, then cath-resolve-hits
can be run to resolve the matches to the optimal set of non-overlapping FunFams.
After resolving FunFam MDAs, we found 398,580 significant FunFam hits in 360,119
sequences. Of these, there were 357,073 hits to ABH FunFams, in 351,853 protein se-
quences. There is not a one-to-one mapping between ABH hits from Gene3D and Fun-
Fams. Some proteins did not have any FunFams: 148,574 proteins that were predicted to
have an ABH domain do not have any hits to FunFams from any superfamily. Other pro-
teins did not have any ABH FunFams: 156,840 proteins that were predicted to have ABH
domains do not have any hits to ABH FunFams.
Lengths of ABH FunFam matches are distributed similarly to superfamily matches
(Fig. 3.8). Large peaks for FunFam HMMs and FunFam matches exist at a length of 260
amino acids.
146 Chapter 3. Mining metagenomes for new protein functions
Gene3D HMMs
0.0100 Gene3D matches
FunFam HMMs
FunFam matches
0.0075
Density
0.0050
0.0025
0.0000
0 100 200 300 400 500 600
Figure 3.8: Length distribution of ABH superfamily and ABH FunFam matches. ABH
HMMs from Gene3D and FunFams are also plotted. The probability density function of
each distribution was estimated using kernel density estimation.
We examined whether any biomes are enriched with ABH domains and may be promising
biomes to search for candidate plastic-degrading enzymes. A linear relationship exists
between the number of sequences found in a biome and the number of ABH domains
found in those sequences (Fig. 3.9). This means that ABH domains occur at a constant
rate in nature. Most biomes contain the expected number of ABH domains, given the
number of proteins that were found in the biome. According to a fitted regression model,
‘Engineered’ biomes have significantly more ABH domains than expected (Fisher’s P ≈ 0).
Engineered biomes encompass a broad range of non-natural biomes, including industrial
settings, laboratory conditions or waste treatment. The regression model also predicts
that proteins from ‘Human’ and human ‘Digestive system’ biomes are depleted with ABH
domains (Fisher’s P ≈ 0).
1.5×10⁵
1.0×10⁵
5.0×10⁴ Human
Digestive system
0
0 3.0×10⁷ 6.0×10⁷ 9.0×10⁷ 1.2×10⁸
Figure 3.9: Relationship between the number of MGnify proteins and the number of ABH
domains per biome. A linear regression model was fitted to the data and plotted.
Biomes that deviate from the regression line are labelled.
319,196 sequences (30%) share more than 90% sequence identity with another sequence
in the data set. As the MGnify proteins are S90 cluster representatives, they will not cluster
together at S90. Therefore, some UniProt and MGnify sequences may be clustersing into
mixed origin clusters (Section 3.3.5).
6×10⁵
# clusters
4×10⁵
2×10⁵
0
30% 50% 70% 90%
Figure 3.10: Number of S90 clusters in MGnify and UniProt ABH domains. 1,065,976 ABH
domain sequences from MGnify (508,693) and UniProt (557,283) at 30%, 50%, 70%
and 90% sequence identity using Linclust.
As the sequence identity threshold increases, the number of singleton clusters, that
148 Chapter 3. Mining metagenomes for new protein functions
contain a single sequence, increases rapidly (Fig. 3.11). 230,666 ABH domain sequences
(21%) share less than 70% sequence identity with all other sequences and are singletons.
These singletons could represent novel functions, whose sequence diversity is not rep-
resented in gold standard databases, such as UniProt. Conventional wisdom states that
protein function is conserved to approximately 60% sequence identity. An analysis in
2002 found that < 30% of proteins with > 50% sequence identity have exactly the same
function, according to all four digits of the EC annotation being the same [8]. But hard
and fast sequence identity thresholds of functional conservation for enzymes are unwise
because catalysis and substrate-specificity are often determined by only a small number
of residues [398].
6×10⁵
5×10⁵
# singleton clusters
4×10⁵
3×10⁵
2×10⁵
1×10⁵
0
30% 50% 70% 90%
Figure 3.11: Number of singleton S90 clusters in MGnify and UniProt ABH domains.
1,065,976 ABH domain sequences from MGnify (508,693) and UniProt (557,283) at
30%, 50%, 70% and 90% sequence identity using Linclust.
eukaryotic sequence ratio of 3 : 1 (75%) (Fig. 3.12 UniProt all). For proteins containing
ABH domains, UniProt remains biased towards prokaryotes (64%, Fisher’s P ≈ 0), but is
less biased than all of UniProt (Fig. 3.12 UniProt ABH). Mixed clusters follow a Bernoulli
distribution that models the probability that a UniProt sequence is prokaryotic. The null
hypothesis for mixed clusters is B(0.64).
1.00 Prokaryotic
Eukaryotic
Fraction of sequences
0.75
0.50
0.25
0.00
UniProt all UniProt ABH 30% 50% 70% 90%
Figure 3.12: Assessing the similarity of MGnify ABH domains to UniProt ABH domains
from prokaryotes or eukaryotes. 1,065,976 ABH domain sequences from MGnify
(508,693) and UniProt (557,283) at 30%, 50%, 70% and 90% sequence identity using
Linclust. Prokaryotic-to-eukaryotic ratios are plotted at each sequence identity thresh-
old for UniProt sequences in mixed clusters. The prokaryotic-to-eukaryotic ratio for
all UniProt sequences (UniProt all) and ABH domains in UniProt (UniProt ABH) is also
plotted.
ABH domains from MGnify proteins cluster with ABH domains from UniProt, show-
ing that metagenomes contain, at least some, previously identified sequence diversity con-
tained in UniProt. The prokaryotic fraction of these mixed clusters increases at higher
sequence identities (Fig. 3.12) (Fisher’s P ≈ 0 for testing all sequence identity threshold
clusterings against UniProt all or UniProt ABH). There may be a number of causes for this
effect, including the high fraction of non-culturable species in metagenomes, how micro-
biome samples are prepared before sequencing and how the protein sequences were pre-
dicted in the microbiome assemblies. These points are discussed further in (Section 3.4.3).
We next examined the taxonomy of UniProt sequences in mixed clusters at 30%, 50%,
70% and 90% sequence identity. Whilst the number of mixed clusters remains stable across
the sequence identity thresholds (data not shown), the prokaryotic fraction increases at
higher sequence identities (Fig. 3.12).
Many metagenomic ABHs are novel and rare, but those that are common in MGnify
are also found in UniProt (Fig. 3.13). Most ABHs from MGnify are rare and found in small
150 Chapter 3. Mining metagenomes for new protein functions
clusters with fewer than 10 sequences. Many of the rare ABHs are also functionally novel
because they are not represented in UniProt. Only 20% are in mixed clusters with UniProt
ABHs. On the contrary, functions of common metagenomic ABHs are already represented
in UniProt. 85% of clusters with 10 or more sequences are mixed.
MGnify only
Mixed MGnify and UniProt
2.0×10⁵
1.5×10⁵
# clusters
1.0×10⁵
5.0×10⁴
0
< 10 sequences ≥ 10 sequences
Figure 3.13: Analysis of rare and common MGnify ABH domains. 1,065,976 ABH domain
sequences from MGnify (508,693) and UniProt (557,283) at 70% sequence identity us-
ing Linclust. Clusters containing MGnify sequences were grouped by size into clusters
with < 10 sequences or ≥ 10 sequences. The number of clusters is plotted grouped
by whether the cluster contains ‘MGnify only’ ABHs, or ‘mixed MGnify and UniProt
ABHs’.
tion 3.2.7).
The distribution of inter-domain sequence lengths is shown in (Fig. 3.14). The modal
value is a gap length of 0 residues, i.e. contiguous domains. Gap length probability de-
creases exponentially with a median of 3 residues and a mean of 22 residues. For compar-
ison, CATH S95 model lengths are also plotted, which follow a positively skewed normal
distribution, or log-normal distribution. The median length of these models is 145 residues,
yet 95.4% of the gaps in the MGnify protein sequences are less than 145 residues long.
Furthermore, 64.8% of the gaps are shorter than the shortest HMM, which is 16 residues
long.
Gene3D HMMs
0.100
Interdomain sequences
0.075
Density
0.050
0.025
0.000
0 100 200 300 400 500
Length (residues)
Figure 3.14: Inter-domain sequence lengths. The distribution of inter-domain sequences
lengths that are not contained in MDAs is plotted. For comparison, the distribution
of Gene3D HMM lengths is shown. The x-axis is truncated at 500 residues. The prob-
ability density function of each distribution was estimated using kernel density esti-
mation.
We noticed that some ABH domain sequences found in the MGnify proteins were not
unique. Please note that the MGnify protein data set that we used in this analysis were
cluster representatives. For identical domain sequences to be in different S90 clusters, the
following is true: regions of sequence that flank identical ABH domains must be suffi-
ciently different to reduce the overall sequence identity below 90%. We investigated these
flanking regions to determine how similar the overall sequence is for proteins that contain
identical ABH domains. For sequences that contain identical ABH domains, we pairwise
aligned the full-length sequences using global sequence alignment with a constant gap
penalty. Most pairs of sequences that have identical ABH domains are very similar across
the entire sequence length (median distance = 0.011; Fig. 3.15). What’s more, 61 sequence
pairs have a distance of zero. Not only are the domain sequences identical, but the entire
sequences are completely identical. Finally, few sequence pairs have a distance > 0.1,
which is approximately equal to sequence identity < 90%.
8000
Pairs of sequences
6000
4000
2000
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Figure 3.15: Sequence alignment distances for pairs of MGnify proteins that contain iden-
tical ABH domain sequences. Full-length sequences were globally aligned with a
constant gap penalty. Alignment scores were converted to distances using, DS1 S2 =
A
1 − min(AS SS1 S,A
2
S2 S2 )
, where A is the global alignment score and D is the alignment
1 1
distance between two sequences S1 and S2 .
Whilst errors in clustering are not ideal, only 3,676 unique domains in 9,379 se-
quences are affected. We should not worry too much about this too much, as 9,379 se-
quences is less than 1% of the 508,693 sequences that contain an ABH domain. But this
3.3. Results 153
conclusion is not satisfactory to a researcher. Why were these identical, or nearly identi-
cal, sequences not clustered together? Put simply, Linclust trades off accuracy for speed.
The MMseqs2 issue tracker on GitHub (https://github.com/soedinglab/MMseqs2) has two
issues, #88 and #104, related to identical sequences not being clustered together. The
solution suggested by the developers is to increase the number of k-mers selected from
each sequence from 21 to 80. Doing so will, on average, increase the number of k-mers
that are shared between sequences and cluster centroid sequences, which will increase
the probability of identical sequences ending up in the same cluster. The runtime memory
requirements will quadruple from 400 GB to 1.6 TB. Given the current database size, this
would be feasible on EBI’s 1.9 TB ‘big memory’ machines. The sequence database need
only grow by 25% before this approach would no longer be tenable. To counter explod-
ing memory requirements, MMseqs2 provides an option to load chunks of sequences into
memory, at the expense of speed. We have not tested these approaches yet and debugging
this workflow is tedious because it takes ∼ 24 hours to run. But, in time, such solutions
will need to be explored as the database size increases.
We explored the distribution of ABH FunFam domains in the MGnify proteins and com-
pared it to the distribution in UniProt. CATH v4.2 has 377 ABH FunFams. ABH do-
mains from the MGnify proteins match 148 of the 377 ABH FunFams (39%; Fig. 3.16)
using the per FunFam inclusion thesholds. Domains are not distributed in these 148 Fun-
Fams the same as ABH domains in UniProt. For example, the largest FunFam in UniProt
(3.40.50.1820/FF/115309) has 53,144 members (16% of ABH domains in the 148 FunFams),
but is 70th largest in MGnify with only 34 members (∼ 0%). This FunFam is associated
with one molecular function: ‘Hydrolase activity, acting on ester bonds’ (GO:0016788).
Conversely, the largest FunFam in MGnify (3.40.50.1820/FF/115552) has 131,349 mem-
bers (37%) and is the fourth largest in CATH with 20,873 members (6%). This Fun-
Fam is also associated with GO:0016788, as well as many other molecular function and
biological process annotations. These include molecular functions ‘Chlorophyllase activ-
ity’ (GO:0047746), ‘Pheophytinase activity’ (GO:0080124) and ‘Bromide peroxidase activ-
ity’ (GO:0019806); and biological processes ‘Chlorophyll metabolic process’ (GO:0015994),
‘Response to toxic substance’ (GO:0009636) and ‘Aromatic compound catabolic process’
154 Chapter 3. Mining metagenomes for new protein functions
(GO:0019439).
FunFams ranked by size in UniProt
UniProt
MGnify
PETase
Table 3.1: FunFams generated by FRAN, FRANgeometric and GARDENER. FunFams were gen-
erated for the α/β hydrolase family, CATH superfamily 3.40.50.1820. The number of start-
ing clusters, FunFams, FunFams with EC terms and the Rand index of FunFam agreement
with GARDENER are reported.
We confirmed that FRAN and FRANgeometric parition starting clusters into similar Fun-
Fams by constructing graphs of FunFams and calculating graph theoretic measures on
them (Fig. 3.17). FRAN, FRANgeometric and GARDENER have as many maximal cliques as
connected components because FunFams are hard clusters. FunFam clustering by FRAN
and FRANgeometric have very similar global agreement, as shown by the graph union of
GARDENER and FRAN (G ∪ F) having approximately the same number of connected com-
ponents and maximal cliques as the graph union of GARDENER and FRANgeometric (G ∪
Fg). Both graph unions are below the y = x line, which shows that they have fewer
156 Chapter 3. Mining metagenomes for new protein functions
connected components than maximal cliques. This means that FRAN and FRANgeometric
generate different FunFams to GARDENER, but that these FunFams form larger connected
components, comprised of multiple maximal cliques, in the graph union because some
starting clusters are shared between the FunFams. This is confirmed by the graph union
of FRAN and FRANgeometric (F ∪ Fg), which has fewer connected components than maxi-
mal cliques, showing that many of the FunFams generated by these two methods intersect.
Finally, the graph union of all three methods (G ∪ F ∪ Fg) has slightly fewer cliques than G
∪ F, or G ∪ Fg, but far fewer connected components. More FunFams are being merged into
the same connected components in (G ∪ F ∪ Fg) because FRAN and FRANgeometric parition
the starting clusters into different FunFams. However, these FunFams intersect with other
FunFams generated by the other method and by GARDENER. Whilst FunFams generated
by FRAN and FRANgeometric do not agree perfectly with GARDENER, the FunFams are
similar, as shown by the large number of FunFams that have intersecting starting clusters.
Overall, the FRAN and FRANgeometric algorithms produce sufficiently similar FunFams to
be taken forward for further analysis.
The EC purity of FunFams generated by FRAN and FRANgeometric were similar to
GARDENER (Fig. 3.18). There was no significant difference between the EC purity dis-
tribution for FunFams produced by GARDENER and FRAN (two-sided Mann-Whitney
P = 0.72), or GARDENER and FRANgeometric (two-sided Mann-Whitney P = 0.85).
Therefore, the functional purity of FunFams generated by FRAN and FRANgeometric were
comparable to FunFams generated by GARDENER.
As FRAN and FRANgeometric produce very similar FunFams, but FRANgeometric is more
computationally expensive, FRAN will be taken forward to be used to generate FunFams
of large CATH superfamilies.
3.4 Discussion
3.4.1 A large number of α/β hydrolase domains were found in metagenomes
Whilst MGnify contains many analyses of metagenomic sequencing studies (315,181 as
of April 14, 2020), only a small subset (6%) of studies have so far been assembled (18,291
as of April 14, 2020). The number of protein sequences identified in each biome proba-
bly does not reflect the sequence and functional diversity of the biome (Fig. 3.5). Rather,
the number of sequences reflects the degree to which each biome has been sampled, which
samples have been assembled, or which assemblies have had ORFs predicted. For example,
3.4. Discussion 157
Figure 3.17: FunFam graph theoretic benchmarks for FRAN (F), FRANgeometric (Fg) and
GARDENER (G). Graphs were constructed for FunFams that are composed of > 1
starting cluster. Perfect FunFam clustering is shown as a dotted line at y = x. ∪
denotes the graph union operation.
there have been vast metagenomics projects of marine [301] and human gut [299, 300] mi-
crobiomes that will have captured the sequence diversity in these biomes. Comparatively,
plants, soil and other animal biomes have been neglected.
The modified protocol that we used to find superfamily domains in large sequence
data sets (Section 3.2.4.2) saved vast computational resources. Scanning 106 sequences
MGnify proteins against the 416 ABH Gene3D HMMs took approximately 30 minutes on
four cores—that is, 3.6 CPU weeks in total. Scanning 104 sequences against all Gene3D
HMMs took approximately 60 minutes on four cores—that is, 3.9 CPU weeks for all
1,630,166 sequences that had a significant ABH hit. Using these timings, we can esti-
158 Chapter 3. Mining metagenomes for new protein functions
1.00
0.75
EC purity
0.50
0.25
0.00
Combined GARDENER FRAN FRAN_geometric
Figure 3.18: FunFam EC purity benchmarks for FRAN, FRANgeometric and GARDENER.
mate that searching all of the MGnify proteins against all Gene3D HMMs would take 14
CPU years, whilst searching the redundant database of 1.1 billion sequences would take
50 CPU years. Additionally, by implementing this pipeline in Nextflow, we could conve-
niently split up the sequence data sets into managable chunks, that could be processed in
an embarrassingly parallel way, by many independent jobs, each with low memory and
CPU requirements.
We identified a large number of diverse ABH domain sequences from full-length and
truncated ORFs (Fig. 3.6). It is reasonable to assume that sequences truncated at one ter-
minus would, on average, be longer than sequences truncated at both termini. However,
the median length of sequences truncated at either terminus is 136 residues, whereas, se-
quences truncated at both termini have median length 164 residues. This could have been
caused by the Prodigal algorithm [339]. Prodigal looks for ribosome binding sequence mo-
tifs, such as the Shine-Dalgarno sequence, and an in-frame stop codon to predict proteins
from ORFs. To reduce the number of false positive ORFs, Prodigal penalises sequences
shorter than 250 bp by a linear factor that is proportional to their length [339]. These
features are strong, so short sequences that are truncated at one end can still end up with
favourable scores. However, sequences that are truncated at both ends must be long to
have favourable scores.
ABH domains identified in the MGnify proteins follow the same length distribution as
known ABH domains from UniProt (Fig. 3.7). This finding indicates the quality of MG-
nify data. Firstly, it suggests that the MGnify assemblies represent real contigs found in
microbiomes. Secondly, and propter hoc, the predicted ORFs and protein sequences real.
It is encouraging that ABH domains were found in engineered biomes more fre-
quently than the background rate for MGnify proteins (Fig. 3.9). This finding may rep-
resent an increased potential to find proteins with biotechnology applications similar to
PETase, which was found in an engineered environment [374]. Conversely, ABH domains
were depleted in the human digestive system (Fig. 3.9). Whilst α/β hydrolase proteins
are present in the digestive system to degrade proteins and lipids, the diversity of these
proteins may not be that high. Considering that the MGnify proteins are cluster represen-
tatives, the importance (concentration) of these enzymes may be obscured by the lack of
sequence diversity.
To understand their diversity and novelty, ABH domains from MGnify proteins were clus-
tered with ABH domains from UniProt proteins. A large number of clusters (Fig. 3.10)
and singleton clusters (Fig. 3.11) result at high sequence identity. This demonstrates that
the ABH domain fold can accommodate a high diversity of sequences, whilst leaving the
structure intact. Evidently, this robustness has been exploited by evolution to produce a
wide variety of ABH domain functions.
Sequences clustered together, as shown by the number of clusters always being much
lower than the number of sequences. The MGnify proteins are S90 cluster representatives.
So this result suggests that some of the MGnify ABH domain sequences are clustering with
those from UniProt. It should be noted that the domain sequences from UniProt have not
160 Chapter 3. Mining metagenomes for new protein functions
been filtered to remove redundant sequences, however, it is unlikely that only the UniProt
sequences clustered together. If MGnify and UniProt ABH domains clustered together,
this would further increase our confidence in the quality of the assemblies and ORFs in
MGnify.
It is remarkable that 21% of ABH domain sequences are singletons at S70. This means
that, whilst the overall chemistry of the catalysis may be the same, these sequences are
likely to have different (specific) functions, or be regulated differently. Additionally, the
large number of singletons at S70 suggests that biomes have not been sampled exhaus-
tively. Many more functions are out there waiting to be discovered. As more biomes are
studied and more samples are collected, I expect the number of clusters and singletons to
continue to increase.
Given both of these findings, it appears that clustering at S70 or S90 is too stringent
for the degree of sequence diversity in a superfamily and clustering at S60 is more likely
to obtain meaningful clusters. Clustering at this level makes sense because protein func-
tion is conserved to approximately 60% sequence identity [8, 398]. Whilst this sequence
identity threshold may be true for ABH domains, it may not be true for the other CATH
superfamilies.
UniProt is biased towards prokaryotes, but this bias is less so in proteins that contain ABH
domains (Fig. 3.12). We used the taxonomic bias for proteins containing ABH domains
to define a null distribution for mixed clusters. Other biases may affect the taxonomic
distribution of proteins in MGnify, for example which biomes were sampled, how high
the coverage of sampling was, how samples were prepared prior to sequencing, which
samples were assembled, and choice of gene callers. Despite these biases, we can arguably
be more confident about mixed, rather than single origin, clusters because these sequences
are present (independently) in both MGnify and UniProt.
[295]. Many of these species are likely to be eukaryotic [297], whether small or single-
celled. It is reasonable to assume that ABH domains have evolved in these species to allow
them to occupy particular niches. Therefore, it is likely that these domain sequences will
not be present in UniProt. Secondly, metagenomic samples are size fractionated to remove
larger objects in samples, so eukaryotes are more likely to be removed and prokaryotes
are more likely to be retained. Thirdly, MGnify only uses prokaryotic gene callers to
predict ORFs in assemblies (Section 3.2.1.1). Prokaryotic gene callers may have a high
false negative rate on eukaryotic contigs. Eukaryotic gene callers, such as GeneMark-
EP [399], MetaEuk [400], or EuGene [401], could be applied in parallel with the current
prokaryotic gene callers. These three factors will contribute to eukaryotic proteins being
underrepresented in the MGnify proteins. This analysis should be repeated when MGnify
incorporates eukaryotic gene calling into its protein prediction pipeline.
Overall, presence of mixed clusters shows that ABH domains in metagenomes are not
distinct from previously known sequences in UniProt. Rather, we see sequence conserva-
tion alongside evolution of new functions. New functions are generated by exploration of
previously unexplored regions of sequence space to enable organisms to survive and oc-
cupy different niches. Retention of function by an organism is a fitness cost: if functions
are not required, they will be lost. Conservation of previously-known functions demon-
strates that these functions are required by species to survive in particular biomes
We assessed whether novel domains had evolved in metagenomes (Fig. 3.14), but found lit-
tle evidence for it. Inter-domain lengths were short, so the vast majority of metagenomic
protein sequence is covered by a significant Gene3D hit to a CATH superfamily. There-
fore, it appears that novel domains have not evolved in metagenomes. There are, however,
some caveats to this conclusion. Although we have little evidence of novel domain evo-
lution in proteins containing ABH domains, we cannot extrapolate these conclusions to
metagenomic protein sequences in general. Similar analyses on different superfamilies,
subsets of superfamilies, or all superfamilies would be required before drawing general
conclusions about metagenomes. Further caution should be exercised because 91% of
sequences containing ABH domains in Pfam are single-domain proteins (73,297 out of
80,360 sequences in Pfam family PF00561), which only have two terminal regions and no
inter-domain regions.
162 Chapter 3. Mining metagenomes for new protein functions
As FunFams are functionally pure, functions of proteins can be predicted by mapping do-
mains to FunFams. The ABH domain from PETase was mapped to three FunFams involved
in the hydrolysis of large organic biomolecules (Fig. 3.16). PET is a polymer of ethylene
terephthalate, a monomer composed of an aromatic ring with carboxy groups at the 1 and
4 ring positions. A carboxy group at the 1 position of one monomer reacts with a second
monomer’s ethanol moiety attached to the single-bonded oxygen of the carboxy group at
the 4 position. PET resembles the substrates of proteins that map to the same FunFams.
Lipids are carbon polymers. Many plastics, including PET, are polyesters, i.e. polymers
formed by esterification between carboxylic acids and alcohols of monomers. Chrloro-
phyll, pheophytin and steroid hormones are composed of many aromatic groups. These
findings naturally give rise to a hypothesis for the origins of PETase. A hydrolase, whose
cognate ligand is some type of large organic biomolecule, evolved to degrade PET under
massive selection pressures.
In addition, MGnify ABH domains were mapped to FunFams (Fig. 3.16). One possi-
ble reason for finding so few matches in MGnify to the largest ABH FunFam in UniProt
(3.40.50.1820/FF/115309) could be that the FunFam is large and so the sequence alignment
might be poor. 16% of UniProt ABH domains are in this family. Thus, the resulting HMM
would not be very informative, so few sequences would have high-scoring matches. Low-
scoring matches would be trumped by higher-scoring matches when the MDAs were re-
solved. 3.40.50.1820/FF/115552 contains 37% of MGnify ABH domains. Compared with
UniProt, this family is significantly expanded in metagenomes, which suggests that this
FunFam increases the fitness on species living in diverse microbiomes. This family is as-
sociated with hydrolysis of ester bonds and large biomolecules. This might mean that
hydrolases have evolved to metabolise other synthetic man-made materials, such as other
types of plastics. These materials are certainly in the environment, so there is a selection
pressure for organisms to make use of these energy sources. If so, it is likely that the
ABH domain will evolve to degrade these materials because of its incredible functional
plasticity.
3.4. Discussion 163
Presently, we are facing challenges generating FunFams of large superfamilies using GAR-
DENER because all starting clusters are required to be kept in memory as the GeMMA
tree is grown. Our current workaround is to run large superfamilies on a high-memory
machine with 3 TB memory, but we are already approaching the memory limit of these
machines. Memory is expensive and 3 TB is already a lot of memory, so we need to develop
a low-memory strategy that will allow FunFams to be scaled to extremely large superfam-
ilies in the future. Whole-genome sequencing and metagenomics are now possible, are
being adopted and are growing in popularity. As such, a scalable protocol would allow
FunFams to be generated using protein sequences from metagenomes and from UniProt.
Also, we currently only generate FunFams from S90 clusters that have at least one exper-
imental GO term annotation. A scalable protocol will allow us to remove this restriction.
The current restriction means that every FunFam is associated with a known func-
tion, which is useful for function prediction. However, this means that FunFams cannot be
generated for proteins with novel functions. If a FunFam has no annotated functions, pu-
tative functions could be predicted by transferring any functions from the nearest k neigh-
bouring FunFams, within some E-value radius r. Neighbouring FunFams can be found by
pairwise HMM alignment to all other FunFams in the same superfamily. These putative
functions will be useful to experimentalists to guide their choice of proteins to validate,
which will be particularly important when validating proteins with novel functions from
metagenomes. Alternatively, FunFams can be annotated post hoc, whenever one of the
members has been experimentally characterised. We hypothesise that the number of Fun-
Fams will increase dramatically to reflect the increase in sequence and functional diversity
present in these additional sequences.
3.4.7 Conclusion
This work laid the foundations for two exciting new avenues of research. First, we con-
ducted a proof-of-concept study into mining very large protein sequence data sets for novel
functions. We used our tools—CATH, Gene3D and FunFams—to search metagenomes for
plastic-degrading enzymes similar to PETase. Sequence databases are continue to grow
in size, so following on from this, we investigated approaches that would allow FunFams
to be generated on an arbitrarily large number of sequences. We decomposed the Fun-
Fam generation algorithm, GARDENER, using a divide-and-conquer random sampling
approach, FRAN. This approach will allow FunFams to be generated on ever-growing se-
quence databases, including metagenomes, future-proofing FunFams for many years to
come.
Chapter 4
4.1 Introduction
In this chapter, we explore a state-of-the-art protein function prediction method, deep
network fusion (deepNF), that uses protein networks as its sole training data.
Graphs are ubiquitous data structures that are suited to modelling problems that are
high-dimensional and sparse. As such, graphs are an ideal data structure to represent
the myriad protein interactions that give rise to biological life. However, due to many
biological and experimental factors, networks are models that try to capture as many of
the true interactions that proteins make, but are often incomplete.
Given some protein network, one may want to perform link prediction to predict
additional interactions that proteins might make, but are missing from the network. In
the past decade, machine learning has become the de facto framework to model predic-
tion problems. Due to memory constraints, it may not be practically feasible to represent
graphs as dense matrices of features, as is required for many types of machine learning
algorithms. For example, storing a dense adjacency matrix of a 106 node graph in 64-bit
precision requires 80 GB of memory. Furthermore, even if dense graph matrices can be
stored, machine learning algorithms may not be able to learn from them because of high
sparsity, the curse of dimensionality and the ‘many features; few examples’ problem.
Classically, hand engineered features would be calculated from graphs for use in ma-
chine learning. For example, in link prediction, one may want to encode the strength of
interactions between pairs of nodes. Alternatively, in node classification, one may want to
encode information about the local and global context of nodes. Encoding this information
166 Chapter 4. Feature learning from graphs
4.1.3 Encoder-decoders
Encoder-decoders are a general framework used by embedding methods [92]. In encoder-
decoders, the encoder first maps nodes to low-dimensional embeddings, followed by re-
construction of the original data by the decoder, using only the embeddings. Transitively,
4.1. Introduction 167
if the original data can be reconstructed by the encoder-decoder model, then the embed-
dings must contain all salient information in the graph. As such, the embeddings can be
used for machine learning.
The encoder and decoder functions are learnt in an unsupervised way from the data
using an optimisation process. In order to do this, a loss function must be defined to mea-
sure the difference between the original data and its reconstruction from the embeddings.
The encoder and decoder functions are then optimised to minimise the reconstruction loss,
and, concomitantly, the embeddings are improved.
4.1.4 Contributions
Functional association data are powerful predictors of protein function. Here, we perform
feature learning from protein networks using multimodal deep autoencoders (MDAEs) to
embed proteins into a latent space, according to their context across multiple networks.
Using these embeddings, we train supervised machine learning models to predict pro-
tein function in budding and fission yeast. We began by replicating the published per-
formance of deepNF [66] at predicting S. cerevisiae protein function. Following this, we
improved upon deepNF in three ways. Firstly, we showed that smaller MDAE architec-
tures, and secondly, smaller embedding dimensions, achieve comparable performance to
deepNF. Thirdly, we found that protein functions can be predicted using structured learn-
ing with the same performance as predicting each function using a separate classifier. This
not only reduced training and prediction time, but also allowed non-linear correlations to
be learnt between features and labels. We then applied this improved model to predict S.
pombe protein function using structured learning. Finally, we attempted to improve the
predicted protein functions by learning features from a larger set of orthogonal types of
protein interactions. We take this approach forward to Chapter 5, where we predict S.
pombe protein function in combination with phenotypic and protein evolution data.
168 Chapter 4. Feature learning from graphs
4.2 Methods
4.2.1 Protein functional association data
Protein networks were prepared as in deepNF [66]. Interactions from the Search Tool for
the Retrieval of Interacting Genes/Proteins (STRING) database [405] were used. STRING
has seven interaction types:
• ‘neighborhood’,
• ‘fusion’,
• ‘cooccurence’,
• ‘coexpression’,
• ‘experimental’,
• ‘database’, and
• ‘textmining’ (not used in this study).
Briefly, adjacency matrices were generated by the following protocol for each of the
six interaction types:
1. Protein IDs were sorted alphabetically and converted to numerical node IDs.
2. Edge weights were normalised to the interval [0, 1].
3. Each adjacency matrix was scaled so that rows sum to unity.
4. Random walks with restart
was applied to each adjacency matrix A. P (t) is a matrix whose rows are the proba-
bilities for a random walk from the ith protein reaching the jth protein after t steps.
α = 0.98 is the restart probability. Random walks with restart was run for t = 3
time steps.
5. Positive pointwise mutual information was applied to P to remove low probability
edges.
STRING v9.1 [406] was used to generate Saccharomyces cerevisiae (S. cerevisiae) ad-
jacency matrices. Despite STRING v10.5 [234] being available when this work was per-
formed, deepNF used STRING v9.1 in [66], so we also used v9.1 so that we could com-
pare the performance of our models to deepNF. STRING v10.5 [234] was used to generate
Schizosaccharomyces pombe (S. pombe) matrices. A number of additional S. pombe net-
4.2. Methods 169
As neither deepNF or Mashup benchmarked S. pombe, rather than use MIPS annota-
tions, we chose to use Gene Ontology (GO) annotations because the data is more compre-
hensive and up-to-date. GO terms were divided into three levels per ontology (Table 4.2)
using the same criteria as S. cerevisiae MIPS annotations. Annotations with experimental
(EXP, IDA, IPI, IMP, IGI, IEP, HDA and HMP) and curated (IC and TAS) evidence codes
were used.
Table 4.2: S. pombe GO term annotations for each of the biological process (P), molecular
function (F) and cellular component (C) ontologies.
where the decoding portion is always the mirror image of the encoding portion.
In words, the 2000-600 MDAE takes g adjacency matrices with shape [n × n] as
input, processes each matrix with a separate 2000 neuron layer and passes the outputs
of these g layers to a single 600 neuron encoding layer. For each of the n proteins, this
layer outputs a 600D vector, which is the of each protein in a 600D space. Embeddings
are then passed to g 2000 neuron layers, whose outputs are used to reconstruct the g
adjacency matrices. deepNF uses g = 6 adjacency matrices from six types of interaction
from STRING (Section 4.2.1).
4.2.3.2 Autoencoders
Autoencoders are a type of neural network that consists of an encoder and a decoder. The
network tries to reconstruct the input as best it can. Reconstructions are rarely perfect,
but that is not why autoencoders are used. Instead, they are used to learn a set of latent
4.2. Methods 171
features from the input data. To do this, constraints are imposed on the encoder part of
the network that prevent a simple identity function from being learnt. In this work, we
impose a size contraint on the encoder, where inputs are embedded into a low-dimensional
latent space. As such, the reconstuctions will be lossy, but the embedding will be forced
to be informative.
Sigmoid activations were always used on the output layer. Autoencoders were trained
using data from 90% of proteins. The remaining 10% of proteins were used as a validation
set to monitor training. Models were trained using binary crossentropy loss. Batch sizes
of 128 examples were used. Unless specified otherwise:
• Rectified linear unit activation functions were used on hidden layers.
• The Adam optimiser [90] was used.
• Models were trained for 500 epochs, where the weights from the epoch with the
lowest validation loss were used to generate the embeddings.
Models were implemented in Python v3.6 [410] using Keras v2.1.5 [411] (TensorFlow v1.8.0
[412] backend).
ters were estimated using an exhaustive grid search of C = {1, 5, 10, 50, 100} and
γ = {0.001, 0.005, 0.01, 0.05, 0.1}, selecting values that maximised mAUPR. The soft-
margin penalty C is common to all SVMs and controls the penalty applied to errors, with
large values producing decision boundaries with a small margin between the two classes.
The radial basis function kernel parameter γ controls the influence of the data points,
with high values increasing the locality of influence that the support vectors have on ker-
nel values. Models were evaluated using 10 independent trials of 5-fold cross-validation
with 80% of the data as a training set and 20% of the data as a test set. Models were
implemented in Python 3.6 [410] using scikit-learn 0.19.1.
4.3 Results
Each experiment performed in this chapter had a common structure:
1. Use an MDAE for unsupervised protein feature learning from multiple protein net-
works.
2. Generate a small number of informative features for each protein.
3. Train a classifier to predict protein function using these features.
the training data and a validation set shows that the MDAE is not fully trained because
the loss has not levelled out by the 10th epoch (Fig. 4.1). Training for more epochs, until
the validation loss is minimised, may improve the performance.
4
3
Loss
2
1
2 4 6 8 10
Epoch
Figure 4.1: MDAE reconstruction loss when replicating deepNF results. Binary crossentropy
loss on the validation set (solid line) and on the training set (dotted line).
The MDAE is not overfitting to the training data, so should be generalisable. When
overfitting occurs, the validation loss increases, whilst the training loss continues to de-
crease.
When training for more epochs, overfitting should be controlled by early stopping or
by saving the weights from the epoch with the lowest validation loss. Both strategies have
their merits: training with early stopping is faster, but it is not possible to know whether
the validation loss may begin to decrease again at a later epoch. Saving the weights from
the best epoch may overcome the problem of not knowing how long to wait before early
stopping, however, the model may need to be trained for a long time to be confident.
Proteins were embedded into a 600D space by the MDAE, using weights from the
10th epoch. Embeddings were used as features to train SVMs to predict MIPS terms. The
one-vs-rest multiclass strategy was used, where classes are treated separately and one
classifier is trained per class. We successfully replicated results of deepNF published in
[66] for mAUPR, MAUPR, accuracy and f1 score.
Accuracy
M AUPR
m AUPR
0.6 0.6 0.6 0.6
f1
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
1 2 3 1 2 3 1 2 3 1 2 3
Level Level Level Level
Figure 4.2: Protein function prediction performance when replicating deepNF results for
S. cerevisiae. MIPS terms were divided into three levels according to the number of
proteins that are annotated with each term. Bars are the mean performance across 10
independent trials of 5-fold cross-validation and error bars are the standard deviation.
• Due to the layout of memory on the GPU, neural networks are trained best when
using a power of 2 number of neurons in each layer (e.g. 128, 256, 512 and 1024).
• The output of a preceeding layer should always used as input to a layer of the same,
or smaller, size (e.g. 512 → 512 → 128). In autoencoders, this rule should only be
applied to the encoding part of the network; decoders are the mirror image of the
encoder.
We trained nine different MDAE architectures that embed proteins into a 128D or
256D space. 128D embeddings were generated by 256-128, 512-128, 1024-128, 256-256-
128, 512-256-128 and 512-512-128 MDAE architectures. 256D embeddings were gener-
ated by 256-256, 512-256 and 512-512-256 MDAE architectures. The 256-256 MDAE had
the lowest validation loss of 0.162 after 161 epochs (Fig. 4.3). Embeddings from this MDAE
were taken forward for further experiments.
eters to train, they are less likely to be underfit on small training data sets, and they can
be trained much faster. Finally, SVM training time will be faster on smaller embeddings.
m AUPR
f1
0.25 0.25 0.25 0.25
0.00 0.00 0.00 0.00
NF
NF
NF
NF
6-2 A
6-2 A
6-2 A
6-2 A
56
56
56
56
ep
ep
ep
ep
25 MD
25 MD
25 MD
25 MD
de
de
de
de
Figure 4.4: S. cerevisiae protein function prediction performance on 256D embeddings.
SVMs were trained to predict MIPS terms in the one-vs-rest strategy. MIPS terms were
divided into three levels according to the number of proteins that are annotated with
each term. Results for level 1 terms are plotted (light grey bars). deepNF results from
Fig. 4.2 are shown for comparison (dark grey bars). Bars are the mean performance
across 10 independent trials of 5-fold cross-validation and error bars are the standard
deviation.
non-linear correlations between features and labels. Output vectors can be converted to
probabilities by applying the softmax function
e xi
S(xi ) = P yj .
je
After much experimentation, we settled on an MLP with two hidden layers of 512
and 256 neurons, a dropout probability of 0.5 and a batch size of 128. The structured pre-
diction performance of the MLP is comparable with deepNF’s one-vs-rest SVM predictions
(Fig. 4.5). Whilst the MLP has slightly lower performance according to mAUPR, MAUPR
and accuracy, their f1 performance is higher. On balance, we believe that these moderate
reductions in performance are worthwhile, due to the concomitant benefits of structured
prediction and the orders of magnitude faster training time compared to one-vs-rest SVMs.
m AUPR
ton d
NF
ton d
NF
ton d
NF
ton d
dic ture
dic ture
dic ture
dic ture
ep
ep
ep
ep
pretruc
pre uc
pre uc
pre uc
de
de
de
de
Str
Str
Str
S
Figure 4.5: S. cerevisiae protein function prediction performance using structured learning
on 256D embeddings. MLPs were trained to predict all MIPS terms in a level simulta-
neously. MIPS terms were divided into three levels according to the number of proteins
that are annotated with each term. Results for level 1 terms are plotted (light grey bars).
deepNF results from Fig. 4.2 are shown for comparison (dark grey bars). Bars are the
mean performance across 10 independent trials of 5-fold cross-validation and error bars
are the standard deviation.
With a performant protein function prediction model in hand, we next focussed our at-
tention on predicting S. pombe protein functions. First, we generated embeddings for the
5100 S. pombe proteins contained in STRING (v10.5), using the same MDAE architectures
as S. cerevisiae. Similar to S. cerevisiae, we found that the 256-256 MDAE was the best
architecture, achieving a validation loss of 0.221 after 127 epochs. Validation losses were
much higher for S. pombe than for S. cerevisiae, which may reflect the higher sparsity of S.
pombe data, compared with S. cerevisiae, which has been studied extensively.
4.3. Results 177
Using the 256D embeddings, we predicted GO term annotations from each ontology
using a 512-256 MLP and structured learning (Fig. 4.6). We achieved good performance
for predicting functions of proteins from level 1 for all ontologies. Level 2 terms were also
predicted well according mAUPR, MAUPR and f1, but not according to accuracy. In this
case, the subset accuracy is a harsh metric for multiclass prediction because the vector of
predictions must exactly match the vector of labels. Biological process and molecular func-
tion terms were predicted less well than cellular component terms in level 3. Overall, the
performance of predicting terms from the cellular component ontology is more consistent
across the three levels.
Level 1
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
Accuracy
M AUPR
m AUPR
f1
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
P F C P F C P F C P F C
Ontology Ontology Ontology Ontology
Level 2
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
Accuracy
M AUPR
m AUPR
f1
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
P F C P F C P F C P F C
Ontology Ontology Ontology Ontology
Level 3
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
Accuracy
M AUPR
m AUPR
Figure 4.6: S. pombe protein function prediction performance using structured learning
on 256D embeddings. MLPs were trained to predict all GO terms in a level simultane-
ously. GO terms were divided into three levels according to the number of proteins that
are annotated with each term. Bars are the mean performance across 10 independent
trials of 5-fold cross-validation and error bars are the standard deviation.
Splitting terms into three levels according to how many proteins they are annotated
to seems quite arbitrary. Instead, we trained models to predict all terms from an ontology
178 Chapter 4. Feature learning from graphs
simultaneously using structured learning (Fig. 4.7). Generally speaking, we are able to
predict all terms in an ontology with approximately the same performance of predicting
level 3 terms in Fig. 4.6. Whilst these trends are true for mAUPR, MAUPR and f1, it is not
true for accuracy, due to a subset accuracy of ∼ 0.25 for each ontology.
Accuracy
M AUPR
m AUPR
f1
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
P F C P F C P F C P F C
Ontology Ontology Ontology Ontology
Figure 4.7: S. pombe protein function prediction performance using structured learning
on 256D embeddings. MLPs were trained to predict all GO terms in an ontology si-
multaneously. Bars are the mean performance across 10 independent trials of 5-fold
cross-validation and error bars are the standard deviation.
4.3.5 Including orthogonal protein network data did not increase predic-
tion performance
In order to learn a low-dimensional embeddings space, MDAs perform data fusion of mult-
ple networks. The goal of data fusion is to combine multiple, heterogenous and orthogonal
data sets together into a composite data set, whose predictive power is higher than any
one data set alone. deepNF fused six types of protein network data encoded in the STRING
database. Many more types of interactions between proteins are possible, but were not in-
cluded in deepNF. Here, we tested whether the predictive power of protein embeddings
could be increased by including other, orthogonal types of interactions:
• genetic interactions from the BioGRID database,
• gene co-expression correlations from a meta-analysis of S. pombe protein expression
studies [408], and
• experimental phenotypes from the fission yeast phenotype ontology.
We fused these three networks and the six STRING networks using the same set of
MDAE architectures that we used in Section 4.3.4. The 256-256 MDAE architecture had
the best validation loss of 0.403. For comparison, this loss is much higher than the best
loss of 0.221 when autoencoding the six STRING networks (Section 4.3.4). This suggests
that the three additional networks are noisy, reducing the MDAE’s ability to reconstruct
4.3. Results 179
the networks. An alternative argument to explain this result could be that the reconstruc-
tion loss is higher for nine input networks because the 256-256 MDAE’s neurons were
saturated and the MDAE has reached its maximum reconstruction capacity. However, this
situation is unlikely to be true because 31 embedding dimensions remain untrained in the
MDAE, where all values in these dimensions are zero for all proteins. Also, other archi-
tectures that were larger, such as the 512-512-256 MDAE, had very similar loss curves.
Either way, the 256D embeddings generated by feature learning from nine networks had a
lower prediction performance (Fig. 4.8) than 256D embeddings from six STRING networks
(Fig. 4.6).
Level 1
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
Accuracy
M AUPR
m AUPR
f1
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
P F C P F C P F C P F C
Ontology Ontology Ontology Ontology
Level 2
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
Accuracy
M AUPR
m AUPR
f1
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
P F C P F C P F C P F C
Ontology Ontology Ontology Ontology
Level 3
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
Accuracy
M AUPR
m AUPR
Figure 4.8: S. pombe protein function prediction performance on 256D embeddings gen-
erated by feature learning from nine networks. MLPs were trained to predict all
GO terms in a level simultaneously. GO terms were divided into three levels according
to the number of proteins that are annotated with each term. Bars are the mean per-
formance across 10 independent trials of 5-fold cross-validation and error bars are the
standard deviation.
We also tried fusing three other combinations of data, but could not improve upon
the performance of six STRING networks alone. However, we chose to not show these
180 Chapter 4. Feature learning from graphs
results because they are similar to Fig. 4.8 and are highly repetitive We tested the following
combinations:
1. six STRING networks, genetic interaction and gene co-expression,
2. five STRING (not co-expression), genetic interaction and gene co-expression, and
3. STRING physical interaction network, genetic interaction and gene co-expression.
Combinations 1 and 2 produced slightly lower prediction performance than the
six STRING networks. Combination 3 yielded much worse performance than using six
STRING networks.
4.4 Discussion
4.4.1 Small embeddings achieve comparable performance to deepNF
deepNF used a 2000-600 MDAE, which contains an unnecessarily large number of pa-
rameters to model the amount of information contained in S. cerevisiae protein networks.
Models of this size require ∼ 106 examples to train all parameters sufficiently. Because
we only have in the order of 103 − 104 proteins as examples, smaller architectures should
be used. We tested whether smaller embeddings of 128D or 256D could be used (Sec-
tion 4.3.2). We found that 256D were sufficient to replicate the performance of deepNF’s
600D embeddings (Fig. 4.4). Although we were unable to surpass the published perfor-
mance of deepNF, this is still an improvement over deepNF. By using smaller embeddings,
MDAs, SVMs and MLPs can be trained faster, which is beneficial when performing re-
search using machine learning.
We applied this model to predict S. pombe protein function using 256D embeddings and
structured learning (Section 4.3.4). We either predicted all terms from an ontology simul-
taneously, or split terms from each ontology into three levels and predicted separately all
terms from each level simultaneously.
Going forward, one strategy for protein function prediction could be to train an en-
semble model, consisting of a structured learning model and a one-vs-rest model. Clas-
sifiers trained using the one-vs-rest strategy would learn patterns that predict individual
functions well. On the other hand, structured learning models would learn non-linear
correlations between features and labels to predict subtle, complex functions. The ensem-
ble model E could combine predictions from the one-vs-rest model O and the structured
learning model S using a strategy such as:
4.4.4 Including orthogonal protein network data did not increase predic-
tion performance
We tested whether the performance of predicting S. pombe protein functions could be
improved by including additional, orthogonal protein network data. We trained MDAs
to learn features from different combinations of network data, with and without the six
STRING networks used by deepNF. In all cases, we were unable to improve upon the fea-
tures learnt from six STRING networks. Compared to the six STRING networks alone,
the reconstruction losses for MDAs trained on different combinations of networks were
much higher. Furthermore, the protein function prediction performance was also much
lower when using these different combinations of networks, compared to the six STRING
networks alone.
It is unclear why this is the case. STRING is a trusted resource that is curated to a
high-quality. The six STRING networks may have been processed in some way, such that
the six networks correlate well with each other. Including exogenous data, such as genetic
interactions, gene co-expression correlations and experimental phenotypes may not cor-
relate well with the STRING data. Alternatively, these other three types of interaction data
may be more noisy than STRING data, due to the nature of the data, how it was processed,
or the level to which the database has been curated.
In the future, it will be interesting to see whether additional, orthogonal types of in-
teraction data can be fused together successfully. Protein functional association resources
continue to improve as measurement accuracies increase, large-scale experiments become
commonplace and previous experimental results are independently validated.
4.4.5 Conclusion
The work in this chapter successfully replicated the published performance of a novel
protein function prediction method, deepNF, that uses MDAs and protein network data to
generate embeddings of proteins. We then improved upon deepNF in three ways. Firstly,
we showed that smaller MDAE architectures, and secondly, smaller embedding dimen-
sions, achieve comparable performance to deepNF. Thirdly, we found that protein func-
tions can be predicted using structured learning with the same performance as predicting
each function using a separate classifier. This reduced training and prediction time, whilst
also allowing non-linear correlations to be learnt between features and labels. We applied
our improved model to successfully predict S. pombe protein function using structured
4.4. Discussion 183
learning. We use this approach in Chapter 5, in combination with phenotypic and protein
evolution data, to predict S. pombe protein function.
Chapter 5
5.1 Introduction
In this chapter, we develop a machine learning model to predict protein function using
a combination of network, evolutionary and phenomics data. We train machine learn-
ing models using protein network embeddings from Chapter 4, evolutionary family data
from CATH-FunFams and phenomic data from high-throughput phenomics screens of
Schizosaccharomyces pombe (S. pombe; fission yeast) gene deletion mutants. First we anal-
yse the phenomics data, then rigorously benchmark the machine learning models, before
predicting functions of S. pombe proteins. Finally, we enter our predictions into the Criti-
cal Assessment of Functional Annotation (CAFA) evaluation of protein function prediction
methods. We begin the Introduction with pertinent information about fission yeast, phe-
nomics screening and CAFA.
similar to mammalian mitosis [416]. S. pombe is usually haploid, but, in stressful situa-
tions, two haploid cells of opposite mating types can fuse to form a diploid cell, which
then divides into four haploid spores [417].
Functional genomics in S. pombe has been aided by the generation of a library of hap-
loid gene deletion strains for all non-essential genes [421]. Each gene was replaced with a
selectable KanMX marker that allows for positive selection of the knockout strains. Short
oligonucleotide sequences, known as barcodes, flank the marker with a unique sequence
for each knocked out gene. All barcodes can be amplified by the polymerase chain reaction
simultaneously using universal sequences. This allows strains to be profiled in a parallel
and competitive fashion using Bar-seq [422, 423], or barcode sequencing.
Since 1992, when the budding yeast genome was sequenced, the percentage of functional
characterisation of its proteome initially increased rapidly, but over the past decade has
plateaued at 82% [424]. Fission yeast saw a much faster increase over a shorter time
frame, but it too has topped out at 84% coverage [424]. Biological roles for the remaining
∼ 20% of proteins remain elusive. Strangely, many of these proteins are conserved in
humans, which suggests that they are involved in key cellular processes and therefore are
a priority to study [424]. There are many reasons why a protein may not have been studied
[425], ranging from biological biases (experimental assays, detectability of functions, cost
effectiveness) to cultural biases (funding priorities, citability, fashion trends). Next, we
touch on some explanations for these biases.
Yeast gene deletion mutants that are unable to grow in standard laboratory condi-
tions on rich media are deemed to be ‘essential’ genes. In fission yeast, there are 1,390
essential genes, found by searching PomBase [426] for ‘inviable cell population’ (Fission
Yeast Phenotype Onotology (FYPO) [409] term FYPO:0002059; null expression, single allele
5.1. Introduction 187
genotypes) on March 23, 2020. Of the non-essential genes in budding yeast, 34% of gene
deletion mutant strains show a growth phenotype when grown in standard conditions, but
97% of genes are essential for normal growth in at least one of 1,144 different chemical
genomic assays [427]. The remaining 3% of genes that did not display a phenotype may
do so in at least one condition that has not (yet) been tested. Experimental function anno-
tations are known to be biased towards high-throughput assays that are able to generate
large volumes of functional annotations for many proteins (see Section 5.1.2 for an exam-
ple), but usually only for a tight range of functions [428]. Furthermore, these annotations
are usually subject to the curse of ‘few articles, many proteins’, whereby 0.14% of papers
in the Gene Ontology Consortium account for the annotations of 25% of proteins [428].
Functional annotations derived from high-throughput experiments have a high error rate,
so would benefit from being confirmed by independent high-throughput studies, or tar-
geted low-throughput experimental validation. As such, the degree to which we can trust
functional annotations, across all species, is questionable. We explore the scope of op-
portunities for protein function prediction in S. pombe in Section 5.3.1 by analysing Gene
Ontology terms annotated to proteins. Finally, the law of diminishing returns suggests that
publishing papers on highly studied proteins such as p53 (that has amassed over 33,000
publications since 2007 [424]) will not be as fruitful as investigating a conserved human
protein whose orthologues in any species have not been characterised.
A previous joint PhD project between the Orengo and Bähler groups involved the develop-
ment of Compass, a method to predict S. pombe protein function using protein networks.
Using the S. pombe Search Tool for the Retrieval of Interacting Genes/Proteins (STRING)
network [234], weighted adjacency matrices were generated for each of the seven edge
types: ‘neighborhood’, ‘fusion’, ‘cooccurence’, ‘coexpression’, ‘experimental’, ‘database’
and ‘textmining’ (see Section 4.2.1 for details). These adjacency matrices were summed to
form a combined graph—a cheap and cheerful data fusion technique. To account for false
negative and false positive edges, a kernel function was applied to the adjacency matrix
to generate a kernel matrix. In this case, the commute-time kernel [115] was generated,
which represents the expected number of steps it would take to travel from node vi to vj
and return back to vi in a random walk on graph nodes. The commute-time is not only
able to measure distances between nodes, but is also able to capture topological features of
188 Chapter 5. Learning the functions of fission yeast proteins
the graph’s wiring pattern at mesoscopic resolution. This is beneficial because the Bähler
group has previously shown that topological features of S. pombe functional networks are
predictive of protein function [429, 430]. Network commute-times, and kernels thereof,
were used previously by the Orengo group to successfully predict protein function in a
guilt-by-association framework using combinations of kernels to fuse heterogenous data
and kernel matrices to measure the similarity between pairs of proteins [116, 431]. Lehti-
nen et al. [117] used kernel matrices to train a supervised partial least squares regression
model [102] to predict protein function by supervised learning. Partial least squares regres-
sion first learns a low-dimensional representation of the kernel matrix, in a similar manner
to dimensionality reduction into principal components, followed by ordinary least squares
regression in this low-dimensional space.
Compass was benchmarked against GeneMANIA [108], the de facto method for pro-
tein function prediction at that time. GeneMANIA is a network-based protein function
prediction method. First, GeneMANIA combines multiple functional association networks
by calculating a weighted average of the adjacency matrices. A ridge regression model (L2
regularised linear regression) is trained to learn the weightings for each network, with
the ability to downweight redundant and irrelevant information. Then, GeneMANIA runs
label propagation on the composite network to predict protein function. The label prop-
agation algorithm takes a set of seed proteins, a set of positive examples (positive node
bias), and optionally a set of negative examples (negative node bias). Labels are propa-
gated through the network in an optimisation process guided by a cost function that tries
to minimise the difference between the label of neighbouring nodes and also the differ-
ence between the predicted label and the initial bias for each node. Compass consistently
outperformed GeneMANIA on a variety of benchmarks and across a panel of metrics [117].
The field of protein function prediction has seen rapid development and progress in
the past decade, enabled by higher-quality data that covers more proteins and functions,
and on the other hand by advances in machine learning methods and computational power,
which together have produced more performant models. We make use of these advances
in this study to make better predictions of S. pombe protein function.
5.1.2 Phenomics
quencing, proteomics, systems biology and bioinformatics have enabled large, heteroge-
nous data sets to be interrogated to understand genotype-phenotype relationships [433,
434]. That year, an early commentary about the nascent field of proteomics suggested
that phenomics is “an all embracing term (on a par with genomics) to describe functional
genomics” [435]. Phenomics screens are typically genome-wide high-throughput experi-
ments that produce high-dimensional phenotypic data.
Some phenotypes have direct one-to-one relationships with their causal genotypes.
Examples include fully penetrant mutations in the amyloid-β precursor protein that cause
familial Alzheimer’s disease (Section 2.1.1), or Gregor Mendel’s genetic hybridisation ex-
periments in peas [436]. However, the majority of phenotypes are complex (meaning ‘mul-
tivariate’, where more than one gene determines a phenotype) where there is no clear re-
lationship between genotype and phenotype. It cannot be stated more succinctly than in
[437]
Knowledge of the laws of the lower level is necessary for a full understanding
of the higher level; yet the unique properties of phenomena at the higher level
can not be predicted, a priori, from the laws of the lower level.
In other words, phenomics can elicit higher level phenotypes and can predict which
lower levels (genes or variants) should be investigated further.
Phenomics has become a popular tool in the geneticist’s toolbox. For example, it has
been used to study the effects of obesity in human [440], increase crop yields [441, 442],
and understand mammalian gene function in mouse [443]. Over the next decade, many
190 Chapter 5. Learning the functions of fission yeast proteins
more genotype-phenotype relationships will need to be untangled now that gene editing
has been trivialised with CRISPR-Cas9 [444, 445]. Phenomics will likely play a key role
[446].
Phenomics has also been an important tool for studying genotype-phenotype rela-
tionships in S. pombe [447]. Genome-wide screens in fission yeast are facilitated by the
Bioneer library of gene deletion strains for all non-essential genes [421] and the ease with
which quantitative phenotypes can be collected. A number of phenotypic proxies are used
in S. pombe, including automated colony-sizing of colonies grown on solid media with
computational processing of the large volume of data [448–451], or the abundance of bar-
codes in multiplexed Bar-seq experiments [423, 452].
A common strategy is to screen strains in the presence of some stressor that elicits the
conditional essentiality of genes. One study identified 33 new ageing genes by inhibiting
the TORC1 signalling pathway in the gene deletion mutant stain library with rapamycin
and caffeine [453]. Whilst it was known that a high dosage of caffeine is toxic to fission
yeast, but a low dosage can be tolerated [454, 455], the genetic reasons were unknown. A
second screen of the gene deletion mutant stain library in media containing caffeine found
that oxidative stress pathways are responsible for tolerance [456]. A third study identified
60 new genes by screening the gene deletion mutant strain library in the presence of the
transcriptional inhibitor 6-azauracil [457].
5.1.3 CAFA 4
The latest iteration of the CAFA protein function prediction challenge [72–74], CAFA 4,
took place between October 2019 and February 2020. CAFA’s organisers provide a set
of target protein sequences, where the aim is for participants to predict the functions of
these sequences. In CAFA4, the targets were whole proteomes from 18 model organism
species, including human, mouse, fly, worm, baker’s yeast, and, crucially, fission yeast.
Participants could predict functions from any of three disjoint ontologies: Gene Ontology,
Human Phenotype Ontology [63], and Disorder Ontology [64], which was included for the
first time. Viewed in its wider context, CAFA 4 ultimately aimed to improve the quality
and coverage of functional annotations for a large number of important proteins in the
biological sciences. Each participating team could enter predictions from three separate
models, which were assessed for coverage using Fmax (Eq. (1.1)) and (semantic) precision
using Smin (Eq. (1.2)). Teams were ranked according to their best performing model on
5.2. Methods 191
each metric, which allows teams to enter a ‘strict’ model that is optimised for Smin and a
‘relaxed’ model for Fmax .
An initial evaluation of CAFA 4 took place and was presented at the ISMB conference
in July 2020. The performance of the top 10 models under Fmax and Smin were presented
for each of the three Gene Ontology (GO) name spaces. The final evaluation is expected
to take place in November 2020, with a publication expected in mid-2021.
5.1.3.1 Contributions
Fission yeast is an important model organism to understand the biological mechanisms of
cellular processes. In this chapter, develop machine learning models that predict the func-
tions of S. pombe proteins. To do this, we collected a large, phenomic data set of S. pombe
gene deletion mutant strains grown in 131 different conditions. Using an extensive array
of analyses, we consistently found that the growth phenotypes were not reliable, repro-
ducible or predictive of protein function. We trained machine learning models using this
bespoke experimental data, alongside orthogonal data from protein networks and evolu-
tionary information from CATH. We evaluated the performance of models trained on these
data modes separately, and in combination, finding that the best performing model used
a combination of network and evolutionary data. In particular, we focussed on functions
that were predicted for proteins conserved in vertebrates, and those that currently have
no experimentally characterised functions in S. pombe. Finally, we entered the predictions
from this model into the fourth CAFA evaluation of protein function prediction methods.
5.2 Methods
5.2.1 Growth phenotyping
5.2.1.1 Collection
All genes have been systematically deleted from S. pombe. Strains of the ∼ 3,700 non-
essential gene deletion mutants (out of 5,137 genes in total) are available commercially as
a library, known as the Bioneer library [421].
To measure the effect of gene deletions on the growth phenotype of S. pombe, the
library was plated onto agar media supplemented with different molecules. The plates
were incubated at various temperatures and for various times to allow the colonies to
grow. The particular combination of media, added molecules, temperature and incubation
time is referred to as the ‘condition’. Growth phenotypes were collected from 131 unique
192 Chapter 5. Learning the functions of fission yeast proteins
5.2.1.2 Processing
Plate images were processed using pyphe [450]. pyphe was inspired by scan-o-matic [449],
a pipeline for processing plate-based high-throughput growth phenotyping experiments.
The grid of wild type colonies is used to interpolate a ‘reference surface’ across the
plate, corresponding to the expected growth of a wild type colony at any point on the
plate. Colony sizes are normalised by dividing the colony size by the reference surface
value for the colony’s position on the plate. Colonies of gene deletion mutants that are
less fit than wild type will have a relative colony size < 1, whilst those that are fitter will
have a relative colony size > 1.
In total, 2,832,384 colonies were present in the data set (Fig. 5.2). Colonies were
removed from the data set if:
2,389,858 colonies remained after ‘grid’ and ‘empty’ colonies were removed (Fig. 5.2).
2,256,475 colonies remain after poor-quality colonies were removed. Sizes were rounded
to 3 decimal places to remove excess numerical precision that is probably experimental
noise. Phloxine B is a red food dye that is used in yeast cell-viability assays [458]. Cell
membranes prevent the dye from entering cells, however, when cells die and membrane in-
tegrity is lost, the dye enters cells. Colonies that were grown in conditions with, and with-
out, phloxine B were pooled because phloxine B does not interfere with growth. Colony
sizes for each strain were normalised, to remove effects that knocking out the gene may
have on growth, by dividing by the mean colony size of the strain in plain media at 32◦ C:
‘YES_32’ for conditions based on YES media, or ‘EMM_32’ for EMM media. Colony sizes
were log2 -transformed to make inverse growth differences symmetric about 0. For exam-
5.2. Methods 193
ple, log2 (2) = 1 and log2 (2−1 ) = log2 (0.5) = −1. Each strain-condition pair was then
represented by the colony whose size that was most different to the wild type by
X[argmax(abs(X))]. (5.1)
Finally, for each condition, any missing colony sizes were imputed using the mean
colony size of strains in that condition. Mean imputation was chosen so that imputed
values would not have any effect when training machine learning models. From now
on, we use the term ‘colony size’ to refer to colony sizes that are normalised and log2 -
transformed for brevity.
The Bonferroni method [460, 461] controls the family-wise error rate (FWER), i.e. the
probability of making at least one type I error (false positive rate). The Bonferroni method
α
rejects all hypotheses for which Pj < J. For J hypothesis tests, the FWER at a given
significance threshold α is 1 − (1 − α)J . For example, at the 0.01 significance level, if 100
tests are performed, the FWER = 1 − 0.99100 = 0.634. In other words, there is a 63%
chance of making one or more type I errors. FWER methods provide strict control of type
I errors, but have correspondingly low power. If it is desirable for the false positive rate
(the type I error rate) to be low—for example hits are to be analysed manually—Bonferroni
194 Chapter 5. Learning the functions of fission yeast proteins
(CART). We initially experimented with 100 trees and later increased this to 500 trees
for final model evaluation. There is usually only a mild improvement in the performance
when increasing from 100 to 500 trees (< 0.1 AUPR improvement). The cost function
used in this study employed the following criteria:
• No maximum depth, so trees could grow arbitrarily deep.
• A minimum of two samples is needed to split a node, resulting in terminal nodes
with single samples in each.
• No minimum purity increase, here defined according to minimising entropy.
Trees were not pruned in this study.
To better compare the performance using different input data, we implemented a
strategy to obtain the same train-tests splits for each label and repeat, regardless of the
input data. Before predicting the ith label in the nth repeat, the pseudo-random number
generator is seeded with i + n before being used to generate the cross-validation splits.
However, if the input data does not contain the same set of strains (rows) as other input
data, then the splitting strategy will not be the same.
Hyperparameters were chosen using an exhaustive grid search over the parameter
ranges. Each combination of parameters was assessed using a nested five-fold stratified
cross-validation. The parameter combination that produced the highest area under the
precision-recall curve (AUPR) was chosen is shown in Table 5.1.
Table 5.1: RF hyperparameter optimisation. Hyperparameters that were optimised and the pa-
rameter ranges over which an exhaustive grid search was carried out are listed.
Model performance was estimated using five-fold stratified cross-validation. The data
was shuffled before each cross-validation. Five independent runs of cross-validation were
performed to estimate the model performance under different train-test splits. Terms were
predicted using the one-vs-rest multiclass strategy.
5.2.2.5 Prediction
After benchmarking, GO term annotations were predicted for S. pombe genes. We took
two approaches to do this. Initially, we used the canonical approach of training a model
using all available training data, then predicting labels. Latterly, after realising that the
5.2. Methods 197
random forest models had overfit on the training data and were conservative at predicting
positive labels, we implemented a more liberal prediction strategy based on 5-fold cross-
validation. Under this strategy, models were trained using 80% of the data and used to
predict labels for the remaining 20%, and so on for each of the five train-test splits. This
means that models are never trained on the same examples that they predict labels for,
which forces models to predict labels using general principles that were learned in the
training phase.
GO annotations were propagated to their parent terms in the GO DAG. Terms an-
notated to between 50 and 1,000 proteins were included in the target set, so that suf-
ficient training examples were present for machine learning. Random forest classifiers
were trained using 500 trees. Because we did not use growth phenotype features to make
final predictions, we did not exclude any genes from our data set. We observed that mod-
els were robust to values of hyperparameters, so for the sake of expediency, we did not
perform hyperparameter optimisation. We chose sensible default values for the number
√
of sub-features as n and partial sampling of 0.7. GO terms were predicted using 5-fold
cross-validation. Predictive performance was evaluated using AUPR.
5.2.2.6 CAFA
CAFA is a protein function prediction challenge. Though similar to how GO terms were
predicted for fission yeast, as explained in Section 5.2.2.5, a number of updates and mod-
ifications were made. STRING v11.0 [466] was used. Due to potential circularity, the text
mining STRING network was not included in prior benchmarks, but we included it here
to boost our performance. 256D network embeddings were generated using networks
of the seven STRING edge types. The latest version of FunFams, v4.3, generated in Jan-
uary 2020, so were used. S. pombe protein sequences were searched against the v4.2 and
v4.3 FunFams using hmmsearch. GO terms associated with FunFams were downloaded in
February 2020, and all terms, except those with NAS, ND, TAS and IEA, were included, as
well as UniProtKB-kw IEA terms. CAFA provided a frozen version of the Gene Ontology
for every team to use (October 7, 2019 release). Up-to-date S. pombe GO annotations were
downloaded from PomBase on January 14, 2020.
v4.2 FunFams in model 2. Inclusion thresholds were used for v4.2 FunFam HMMs, whereas,
a threshold of E < 10−4 was used for v4.3 FunFam HMMs to mitigate any potential over-
splitting that may have occurred whilst generating these FunFams. E-values were − log10 -
transformed to be used as FunFam features.
5.2.3.1 Clustering
Growth phenotype data was clustered [467] and heatmaps with dendrograms were
plotted. Euclidean distance matrices were calculated [468] (‘euclidean’ metric in
scipy.spatial.distance.pdist [469]). The distance matrix was clustered
using the unweighted pair group method using arithmetic averages (UPGMA) algorithm
[470] (‘average’ method in scipy.cluster.hierarchy.linkage [469]). UP-
GMA performs bottom-up hierarchical agglomerative clustering to form a dendrogram
by iteratively merging the two most-similar clusters. Distances between clusters are cal-
culated as the mean distance between all pairwise combinations of items between each
cluster. By considering all items in clusters, UPGMA is more robust to outliers than single-
linkage clustering, which only considers the nearest pair of items. Heatmaps were plotted
using seaborn.clustermap.
5.2.3.2 Code
Julia 1.1-1.4 was used for most of this work. Julia packages used were: DataFrames
v0.19.1, DataFramesMeta v0.5.0, DecisionTree v0.10.1, Distances v0.8.0, GLM v1.1.1, Hy-
pothesisTests v0.8.0, MLBase v0.8.0, MLDataUtils v0.5.0, MultipleTesting v0.4.1, OBOParse
v0.0.1, PlotlyJS v0.13.1, Plots v0.27.1, StatsBase v0.31.0, StatsPlots v0.10.2, and UMAP
v0.1.4. Custom code is available as a Julia package PombeAgeingGenes.jl on GitHub
(https://github.com/harryscholes/PombeAgeingGenes.jl). Python 3.7 was used to plot
clustermaps of matrices using numpy v1.15.4, pandas v0.23.4, matplotlib v3.0.2 and seaborn
v0.9.0.
5.3. Results 199
5.3 Results
5.3.1 Known functions of fission yeast proteins
Overarchingly, this project aims to accurately predict functions of S. pombe genes, so that
experimentalists can be informed about which targeted functional validations to perform,
so as to increase the total number of genes with at least one ‘experimental’ annotation,
and to increase the total number of ‘experimental’ annotations. 4,523 (88%) of S. pombe’s
5,137 protein-coding genes have at least one ‘experimental’ annotation in any of the three
ontologies—biological process, molecular function and cellular component (Fig. 5.1). How-
ever, if only biological process and molecular function terms are not considered, only 1,963
(38%) genes have at least one ‘experimental’ annotation. This is because it is relatively
easy to determine the subcellular location of proteins in high-throughput screens, such as
automated fluorescence microscopy.
For the biological process and molecular function ontologies separately, 4,190 (82%)
and 3,310 (64%) of genes have at least one high-quality annotation in the ‘experimen-
tal’ or ‘curated’ classes of GO evidence codes. In the cellular component ontology, 4,449
(87%) of genes have at least one ‘experimental’ annotation, of which 2,790 (63%) are from
high-throughput studies with evidence codes HTP, HDA, HMP, HGI or HEP, which may
represent a significant bias in the functional distribution of S. pombe gene annotations
[428]. Annotations in the ‘bad’ class should not be trusted. Together with genes that have
no annotations, there is an opportunity to assign functions from the biological process,
molecular function and cellular component ontologies to 845 (16%), 1,439 (28%) and 231
(5%) genes, respectively.
5000 None
Bad
4000 Automatic
Curated
Experimental
3000
Genes
2000
1000
0
All Biological process Molecular function Cellular component
Ontologies
Figure 5.1: Assessing the quality of GO term annotations in S. pombe genes. For each
ontology—biological process, molecular function and cellular component—together
(All) and separately, S. pombe genes are assigned to one of five classes—experimental,
curated, automatic, bad and none—in that order of preference.
that were grown in conditions that were, or were not, supplemented with phloxine B were
pooled because phloxine B does not interfere with growth [472].
3×10⁶
2×10⁶
Colonies
1×10⁶
0
All data All colonies High-quality colonies
Figure 5.2: Processing colonies in growth phenotype data. ‘All data’: all colony size measure-
ments in the data set, ‘All colonies’: colonies remaining after ‘grid’ and ‘empty’ colonies
were removed, ‘High-quality colonies’: colonies remaining after poor-quality colonies
were removed.
5.3. Results 201
For each strain-condition pair, the variance in colony size was calculated (Fig. 5.3).
Variances are distributed exponentially, with a heavy positive skew. Generally, strain-
condition pairs have low variance, with mean 0.0733 and 95th percentile 0.287.
100k
Straincondition pairs
10k
1000
100
10
0 10 20 30
Figure 5.3: Distribution of colony size variance across for all strain-condition pairs. Vari-
ance in colony size for strain-condition pairs with > 1 repeat were calculated.
We assessed the reliability of strain colony sizes from multiple repeats in the same
condition. By reliability, we mean how similar are the sizes of pairs of colonies from the
same strain grown in the same condition. We chose the condition ‘YES_SDS_0.04percent’
because it had the most pairs of repeats, except for ‘YES_32’ and ‘EMM_32’. Colony sizes
from the 1,055,687 pairs of repeats for particular strains agreed poorly (Fig. 5.4). These
pairs of colony sizes had a Pearson correlation coefficient of r = 0.215. We also fitted a
linear regression model to the pairs of colony sizes and found that this line deviates signif-
icantly from the ideal y = x line—corresponding to perfect reliability—with a coefficient
of determination R2 = 0.036.
We observed that colonies of S. pombe gene deletion mutants displayed a high false
negative rate, i.e. a growth phenotype was not observed, despite the knocked out gene
being affected by the growth condition. In other words, strains would sometimes show a
phenotype in one repeat, but inexplicably would not show any phenotype in other repeats.
Each strain-condition pair was then represented by the colony whose size had the
largest effect size by (Eq. (5.1)). From now on, we refer to these values as the ‘colony
202 Chapter 5. Learning the functions of fission yeast proteins
0
Colony size in repeat y
-1
-2
-3
-4
-4 -3 -2 -1 0 1
Figure 5.4: Reliability of colony sizes. The condition ‘YES_SDS_0.04percent’ was used.
1,055,687 pairs of repeats for particular strains exists. 20,000 pairs were randomly
sampled and plotted.
size’ for each strain-condition pair. Colony sizes across all strains and conditions follow
a normal distribution with N (µ = −0.046, σ 2 = 0.281) (Fig. 5.5). Mean colony size of
−0.046 shows that strains tend not to have strong phenotypes in conditions, with a small
trend towards being less fit. The variance of the fitted normal distribution agrees well
with the empirical mean of colony size variance of each strain-condition pair (Fig. 5.3).
Assuming normality, a standard deviation of 0.530 means that 95% of colonies will be
within
µ ± 2σ = −0.046 ± 0.530 = −0.576 < x < 0.484.
So taking the inverse logarithm results in 95% of corrected colony sizes between
On closer inspection, the distribution is bimodal about 0, due to the way that we
calculated point estimates of colony sizes using X[argmax(abs(X))]. As the most extreme
5.3. Results 203
colony size, x, tends to zero, the probability, P (x), also tends to zero,
lim P (x) → 0
x→0
1.5×10⁴ 1.5×10⁴
Frequency
Frequency
1.0×10⁴ 1.0×10⁴
5.0×10³ 5.0×10³
0 0
5 0 5 10 0.50 0.25 0.00 0.25 0.50
Figure 5.5: Distribution of colony sizes from all strains and conditions. Colony sizes for
all colonies (left) and −0.5 ≤ x ≤ 0.5 (right) are shown. Colony sizes were cor-
rected using the reference surface from the grid of wild type colonies on each plate,
normalised to the strain’s growth in control conditions, and log2 -transformed. Point
estimates of the colony size distribution for each strain-condition pair, X, was calcu-
lated by X[argmax(abs(X))].
YES_EGTA_10mM_day2
YES_EGTA_10mM_5days
1.5
EMM_C_2ugml
Variance
YES_EGTA_10mM
1.0
YES_KCl_0.5M_SDS_0.04percent
0.5
YES_Diamide_2mM
0.0
0.4 0.2 0.0 0.2 0.4
Figure 5.6: Mean colony size per condition across all strains, against the variance. Condi-
tions that elicit strong phenotypes, with mean colony size > 0.2 or < −0.2, or variance
> 1.0, were labelled.
Figure 5.7: Sensitivity of strains in conditions. A threshold of 1.25-fold growth difference was
set for resistant strains, where log2 (1.25) = 0.32. Resistant colonies (yellow) have log2 -
transformed sizes > 0.32, and sensitive colonies (blue) have log2 -transformed sizes
< −0.32. The Euclidean distance matrix was clustered using Ward’s method. Columns
are coloured by media type, where green denotes YES media and pink denotes EMM
media.
206 Chapter 5. Learning the functions of fission yeast proteins
easiest of the three clustered heatmap figures to interpret because it is the smallest. Con-
ditions based on EMM media cluster into two clades, that are not contaminated with any
YES-based conditions. Furthermore, 11 out of the 17 EMM conditions can be easily sepa-
rated from the remaining conditions in the first UMAP embedding dimension [475] (data
not shown). UMAP is a dimensionality reduction method similar to, but more power-
ful than, principal component analysis. Most YES glucose conditions form a clade at the
bottom right corner of (Fig. 5.8), with some contamination from other sugars (galactose,
xilose and mannitol) and alcohols (glycerol and ethanol), which are metabolised similarly.
Interestingly, although tea tree oil is a potent antifungal [476], the two tea tree conditions
also clustered into this clade. This is likely because all of these conditions result in small
colonies.
1.0
0.5
PCC
0.0
0.5
1.0
YES_benzamidine_10mM
YES_EDTA_0.1mM
YES_SDS_0.01percent
YES_LiCl_4mM_SDS_0.04percent
YES_LiCl_7.5mM_SDS_0.04percent
YES_LiCl_4mM_MMS_0.0075
YES_LiCl_7.5mM
YES_CdCl2_0.5mM
YES_Diamide_3mM
YES_EGTA_10mM
YES_EGTA_10mM_5days
YES_EGTA_10mM_day2
YES_BrefeldinA_80uM
YES_tunicamycin_2ulin100ml
EMM_T_3ugml
EMM_l_2ugml
EMM_A_150ugml
EMM_C_2ugml
EMM_lysine
EMM_lysine_day2
EMM_3AT
EMM_DIP
EMM_Glutamate
EMM_proline_day2
EMM_serine_day2
EMM_serine_lysine_day2
EMM_proline_lysine
EMM_proline_lysine_day2
EMM_proline
EMM_serine
EMM_serine_lysine
YES_Bleomycin_600ugml
YES_NaOrthovanadate_0.5mM
YES_25
YES_H2O2_1mM
YES_37
YES_H2O2_0.5mM
YES_bortezomib_125uM
YES_100g
YES_rapa_100g
YES_rapa
YES_32_pre-rapa
YES_pre-rapa_rapa_100g
YES_NaCl_100mM_MMS_0.075percent
YES_benzamidine_0.5mM
YES_cycloheximide_0.05ugml
YES_cycloheximide_0.025ugml
YES_cycloheximide_0.025ugml_day2
YES_cycloheximide_0.05ugml_day2
YES_TSA_500nM
YES_tunicamycin_1ulin100ml
YES_glucose_0.5percent_25C
YES_glucose_1percent_25C
YES_calcofluor_2ugml
YES_calcofluor_2ugml_SDS_0.04percent
YES_EGTA_1mM
YES_18
YES_SDS_0.04percent
YES_TBH_1mM
YES_TBH_1mM_day2
YES_KCl_1M_SDS_0.04percent
YES_torin1_5uM
YES_HU_40mM
YES_caffeine_10mM
YES_MMS_0.005percent
YES_formamide_2.5percent
YES_EDTA_0.3mM
YES_MMS_0.0075percent
YES_ethanol_5percent
YES_MgCl2_100mM_SDS_0.04percent
YES_MgCl2_100mM_SDS_0.04percent_day2
YES_VPA_5mM
YES_VPA_8mM
YES_HU_10mM
YES_ethanol_10percent
YES_ethanol_10percent_day2
YES_NaOrthovanadate_1mM
YES_KCl_0.5M_SDS_0.04percent
YES_KCl_1M
YES_Diamide_2mM
YES_KCl_0.5M
YES_MgCl2_100mM
YES_cycloheximide_0.01ugml
YES_CoCl_0.45mM
YES_sucrose_2percent
YES_fructose_2percent
YES_maltose_2percent
YES_KCl_0.5M_MMS_0.0075percent
YES_EGTA_5mM
YES_SDS_0.02percent
YES_MgCl2_200mM_SDS_0.04percent
YES_LiCl_4mM
YES_LiCl_2mM_SDS_0.04percent
YES_NaCl_100mM_SDS_0.04percent
YES_NaCl_200mM
YES_LiCl_2mM
YES_NaCl_100mM
YES_Xilose_1_week
YES_glycerol_MMS_0.0075percent
YES_glucose_3percent_25C_1week
YES_glucose_0.5percent_25C_1week
YES_glucose_1percent_25C_1week
YES_glucose_3percent_32C_10days
YES_glucose_1percent_32C_10days
YES_glucose_3percent_25C_10days
YES_glucose_0.5percent_25C_10days
YES_glucose_1percent_25C_10days
YES_glycerol
YES_galactose_2percent_glucose_0.1percent
YES_glycerol_galactose_2percent_glucose_0.1percent
YES_galactose_2percent_glucose_0.1percent_day2
YES_glucose_0.1percent_32C
YES_glucose_0.1percent_25C
YES_glucose_0.1percent_25C_1week
YES_glucose_0.5percent_32C_10days
YES_Xilose_2percent_glucose_0.1percent_day2
YES_glucose_0.1percent_25C_10days
YES_glucose_0.1percent_32C_10days
YES_Xilose_2percent_glucose_0.1percent
YES_mannitol_2percent_glucose_0.1percent
YES_ethanol_1percent_no_glucose
YES_glycerol_day2
YES_glucose_0.1percent
YES_glucose_0.1percent_day2
YES_32_1week
YES_tea_tree_0.25ulml
YES_tea_tree_0.5ulml
YES_glucose_0.5percent_32C
YES_glucose_1percent_32C
YES_LiCl_7.5mM
EMM_l_2ugml
YES_benzamidine_10mM
YES_EDTA_0.1mM
YES_CdCl2_0.5mM
YES_Diamide_3mM
YES_EGTA_10mM
EMM_T_3ugml
EMM_A_150ugml
EMM_C_2ugml
EMM_lysine
EMM_lysine_day2
EMM_3AT
EMM_DIP
EMM_Glutamate
EMM_proline_day2
EMM_serine_day2
EMM_proline
EMM_serine
YES_EGTA_1mM
YES_18
YES_TBH_1mM
YES_VPA_5mM
YES_VPA_8mM
YES_HU_10mM
YES_KCl_1M
YES_Diamide_2mM
YES_KCl_0.5M
YES_MgCl2_100mM
YES_CoCl_0.45mM
YES_EGTA_5mM
YES_LiCl_4mM
YES_NaCl_200mM
YES_LiCl_2mM
YES_NaCl_100mM
YES_glycerol
YES_SDS_0.01percent
EMM_serine_lysine_day2
EMM_proline_lysine
EMM_proline_lysine_day2
EMM_serine_lysine
YES_25
YES_glycerol_day2
YES_32_1week
YES_LiCl_4mM_SDS_0.04percent
YES_LiCl_4mM_MMS_0.0075
YES_Bleomycin_600ugml
YES_H2O2_1mM
YES_37
YES_H2O2_0.5mM
YES_32_pre-rapa
YES_LiCl_7.5mM_SDS_0.04percent
YES_EGTA_10mM_5days
YES_EGTA_10mM_day2
YES_BrefeldinA_80uM
YES_tunicamycin_2ulin100ml
YES_NaOrthovanadate_0.5mM
YES_bortezomib_125uM
YES_100g
YES_rapa
YES_rapa_100g
YES_pre-rapa_rapa_100g
YES_NaCl_100mM_MMS_0.075percent
YES_benzamidine_0.5mM
YES_cycloheximide_0.05ugml
YES_cycloheximide_0.025ugml
YES_TSA_500nM
YES_tunicamycin_1ulin100ml
YES_calcofluor_2ugml
YES_TBH_1mM_day2
YES_torin1_5uM
YES_HU_40mM
YES_caffeine_10mM
YES_EDTA_0.3mM
YES_cycloheximide_0.025ugml_day2
YES_cycloheximide_0.05ugml_day2
YES_glucose_0.5percent_25C
YES_glucose_1percent_25C
YES_SDS_0.04percent
YES_MMS_0.0075percent
YES_ethanol_5percent
YES_fructose_2percent
YES_maltose_2percent
YES_Xilose_1_week
YES_glucose_0.1percent
YES_tea_tree_0.25ulml
YES_tea_tree_0.5ulml
YES_calcofluor_2ugml_SDS_0.04percent
YES_KCl_1M_SDS_0.04percent
YES_MMS_0.005percent
YES_formamide_2.5percent
YES_MgCl2_100mM_SDS_0.04percent
YES_ethanol_10percent
YES_NaOrthovanadate_1mM
YES_cycloheximide_0.01ugml
YES_sucrose_2percent
YES_MgCl2_100mM_SDS_0.04percent_day2
YES_ethanol_10percent_day2
YES_KCl_0.5M_SDS_0.04percent
YES_KCl_0.5M_MMS_0.0075percent
YES_SDS_0.02percent
YES_MgCl2_200mM_SDS_0.04percent
YES_LiCl_2mM_SDS_0.04percent
YES_NaCl_100mM_SDS_0.04percent
YES_glycerol_MMS_0.0075percent
YES_glucose_3percent_25C_1week
YES_glucose_0.5percent_25C_1week
YES_glucose_1percent_25C_1week
YES_glucose_3percent_32C_10days
YES_glucose_1percent_32C_10days
YES_glucose_3percent_25C_10days
YES_glucose_0.1percent_32C
YES_glucose_0.1percent_25C
YES_glucose_0.5percent_32C
YES_glucose_1percent_32C
YES_glucose_0.5percent_25C_10days
YES_glucose_1percent_25C_10days
YES_glucose_0.1percent_25C_1week
YES_galactose_2percent_glucose_0.1percent
YES_glucose_0.5percent_32C_10days
YES_glucose_0.1percent_25C_10days
YES_glucose_0.1percent_32C_10days
YES_ethanol_1percent_no_glucose
YES_glucose_0.1percent_day2
YES_glycerol_galactose_2percent_glucose_0.1percent
YES_galactose_2percent_glucose_0.1percent_day2
YES_Xilose_2percent_glucose_0.1percent_day2
YES_Xilose_2percent_glucose_0.1percent
YES_mannitol_2percent_glucose_0.1percent
Figure 5.8: Correlation of conditions. The Pearson correlation coefficient (PCC) matrix of con-
ditions across all strains was clustered using UPGMA.
5.4. Clustering strains and conditions according to phenotypic patterns 207
The study that identified tea tree as lethal to S. pombe found the minimum inhibitory
concentration to be 0.5% v/v, but the conditions that we used were 0.25 µl/ml and 0.50
µl/ml, i.e. 5% and 10% of the minimum inhibitory concentration. Low concentrations
of antibiotics are known to act as environmental signals for bacteria that cause multifar-
ious cellular changes [477]. Although the mechanism of tea tree’s antifungal properties
is not known, it is thought to affect the permeability of membranes [478]. However, a
tight clade was formed with the two tea tree conditions, YES_glucose_0.5percent_32C and
YES_glucose_1percent_32C, so it is also possible that tea tree had a minimal effect on cells
at such low concentrations.
Figure 5.9: Correlation of strains. The Pearson correlation coefficient (PCC) matrix of strains
across all conditions was clustered using UPGMA.
208 Chapter 5. Learning the functions of fission yeast proteins
S. pombe proteins were mapped to FunFams, to encode homolog information when using
machine learning to predict GO annotations. Sequences of the 5,137 proteins encoded
in S. pombe’s genome were scanned against HMMs from each of the 68,065 FunFams.
At a threshold of E < 10−3 , 3,319 (64%) proteins had 149,099 hits to 23,900 FunFams
5.5. Significance testing to identify significant phenotypes 209
3×10⁴
Uncorrected
Corrected
2×10⁴
Frequency
1×10⁴
0
0.00 0.25 0.50 0.75 1.00
P
(a) Testing whether strain growth is significantly affected by conditions.
25
20 600
Frequency
Frequency
15
400
10
200
5
0 0
0 100 200 300 0 20 40 60 80
Figure 5.10: Significance testing. a. Testing whether strain growth is significantly affected by
conditions. The null hypothesis was that there is no difference between how a strain
grows in a condition and its corresponding control condition—‘YES_32’ for conditions
based on YES media, or ‘EMM_32’ for EMM media. Two-sample unequal variance
t-tests were performed for colony size distributions and their corresponding null dis-
tributions. b. Number of hits per condition. Two-sample unequal variance t-tests were
performed for colony size distributions and their corresponding null distributions. P
values were corrected for multiple testing using the Benjamini-Hochberg method. Hits
were called at the Q = 0.01 significance level. c. Number of hits per strain. Two-
sample unequal variance t-tests were performed for colony size distributions and their
corresponding null distributions. P values were corrected for multiple testing using
the Benjamini-Hochberg method. Hits were called at the Q = 0.01 significance level.
210 Chapter 5. Learning the functions of fission yeast proteins
(35%). 10,136 of these FunFams (42%) were hit by only one protein (Fig. 5.11). The me-
dian number of hits per FunFam was 2 and the mean was 5.35. The maximum number
of hits per FunFam was 126 proteins for the WD40 repeat-containing serine/threonine ki-
nase FunFam from the YVTN repeat-like/Quinoprotein amine dehydrogenase superfamily
(2.130.10.10/FF/102735).
2.39×10⁴
2.00×10⁴
Cumulative frequency
1.50×10⁴
1.00×10⁴
5.00×10³
0
0 20 40 60 80 100 126
Figure 5.11: Number of hits per FunFam for S. pombe proteins. Proteins were scanned against
the CATH v4.2 FunFam HMM library using an E-value threshold of E < 10−3 . Only
FunFams with at least one hit are shown.
indicates that ‘vesicle mediated transport’ has significantly more descendant terms than
expected, P = 0.011, which may go some way to explain why 17% of FunFams contain
sequences that are annotated with terms related to ‘vesicle mediated transport’.
Figure 5.12: Numbers of FunFams associated with GO Slim terms. GO terms were associated
to FunFams by identifying all GO terms that are annotated to proteins in each FunFam.
All annotations were included, except those with NAS, ND, TAS or IEA evidence codes,
but UniProtKB-kw IEA curated terms were included. Ancestor terms that have ‘is_a’,
‘has_part’, ‘part_of’ and ‘regulates’ relationships were also included.
1.0
FF = 0.424
GP = 0.126
0.8 NE_FF = 0.599
NE = 0.583
GP_NE = 0.544
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Figure 5.13: Precision-recall curves for predicting GO Slim terms. The 53 GO Slim terms were
predicted using RFs and a one-vs-rest classification strategy. The prediction error was
estimated using five independent repeats of 5-fold cross-validation. Precision-recall
curves are plotted for each repeat (thin translucent curves) as well as a micro-averaged
curve (thick solid curves). Numbers in legend correspond to the micro-averaged AUPR.
Legend abbreviations: GP, growth phenotypes; NE, network embeddings; FF, FunFam
homology data.
random classifier would have AUPR = 0.0329. Therefore, these growth phenotype data
are able to predict GO Slim terms 3.8 times better than a random classifier, but, compared
to network embeddings and FunFam data, the growth phenotypes are poorly predictive of
GO Slim annotations.
GO terms that are annotated to between 50 to 1,000 S. pombe proteins were predicted with
AUPR = 0.569 (Fig. 5.14). 2,390,915 annotations were predicted, of which 97,017 were al-
5.5. Significance testing to identify significant phenotypes 213
ready known and in the target set. After removing known annotations, with experimental
or curated evidence codes, and 2,167,057 predictions with probability P (annotation) <
0.1, 126,841 predictions remained for 534 GO terms in 4,456 proteins. Only 2,628 (2%) of
these predictions were present in the S. pombe GO annotations with IEA evidence codes,
and 7,710 (6%) with IEA, NAS or ND evidence codes. We do not predict any functions for
proteins that had no annotations (regardless of evidence codes), and we also do not predict
functions of proteins that had no experimental or curated annotations (i.e. only IEA, NAS
or ND evidence codes). We do, however, predict 117,654 new functions for proteins that
previously had experimentally validated functions.
1.0
AUC = 0.569
Baseline = 0.038
0.8
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Figure 5.14: Performance of predicting functions of S. pombe proteins. The model was
trained on network embeddings and FunFams. GO terms that are annotated to be-
tween 50 to 1,000 proteins are included. A precision-recall curve is plotted. Numbers
in legend correspond to the micro-averaged AUPR.
692 S. pombe proteins have unknown function, of which 409 have orthologues in
other organisms and 145 of these are conserved in vertebrates—the so called ‘priority un-
studied genes’ defined by PomBase. (NB some ‘unknown function’ and ‘priority unstudied’
proteins have automatically assigned functions with the IEA evidence code.) We analysed
the functions that were predicted for proteins conserved in vertebrates and the priority
unstudied proteins (accessed on June 10, 2020). For the conserved proteins, we predicted
4,285 functions for 397 GO terms in 153 proteins (22%). 18 of these annotations were
previously predicted with IEA evidence codes. For the priority unstudied proteins, we
214 Chapter 5. Learning the functions of fission yeast proteins
predicted 1,865 functions for 350 GO terms in 64 (44%) proteins. 18 of these annotations
were previously predicted with IEA evidence codes. 133 of the 145 priority unstudied
proteins are in the Bioneer gene deletion mutant library, and 60 out of the 64 priority
unstudied proteins that we predicted functions for are in the Bioneer collection.
5.5.4 CAFA 4
We entered CAFA 4 with predictions for fission yeast GO terms, under the team name
‘OrengoFunFamLab2’. We experimented with many different combinations of features and
submitted the three models that had the best AUPR from cross-validation (Fig. 5.15). Our
three models were all trained on network embeddings. Additionally, v4.3 FunFams were
used in model 1 and v4.2 FunFams were used in model 2. Our initial model—using net-
work embeddings and v4.3 FunFams—produced AUPR = 0.473 (results not shown). After
extensive experimentation (results not shown), we were able to to increase the AUPR of
our final models considerably. Despite this, model performance appears to be asymptotic
to AUPR ≈ 0.58.
1.0
NE FF v4.3 = 0.58
NE FF v4.2 = 0.582
0.8 NE = 0.573
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Figure 5.15: Performance of three models submitted to CAFA 4. Models were trained on dif-
ferent combinations of network embeddings (NE) and FunFams (FF). Precision-recall
curves are plotted. Numbers in legend correspond to the micro-averaged AUPR.
We can understand how the various features contribute to the models by analysing
the shape of precision-recall curves. Network embeddings produce high precision predic-
5.6. Discussion 215
tions at low recall, whereas, FunFams increase the precision of predictions at high recall.
If the goal is to assign functions to a set of proteins, without making any incorrect pre-
dictions, or missing any predictions, then it may be beneficial to use FunFam information,
possibly in combination with network embeddings.
Though not directly relevant to this chapter, I also participated in the Orengo group’s
FunFam-based predictions for all 18 species, under OrengoFunFamLab team’s 1, 3, 4 and 5.
The preliminary results of CAFA 4 were presented at the ISMB conference in July 2020. For
each ontoloy, the top 10 methods were presented, where each research group could only
occur once in the top 10. The OrengoFunFamLab3 method came top in molecular function,
third in biological process, and was not placed in the top 10 for cellular component. Oren-
goFunFamLab3 used CATH v4.3 and InterPro data to train an XGBoost classifier using a
learning to rank strategy.
5.6 Discussion
5.6.1 Many factors may contribute to colony size phenomics being unre-
liable
Colonies of S. pombe gene deletion mutants displayed a high false negative rate for phe-
notypes in conditions and some strains had a high variance in colony sizes in particular
conditions (Section 5.3.3). We assessed the variance in measuring colony sizes, due to tech-
nical errors associated with scanning plates, and found it to be very low. We normalised
colony sizes to account for spatial biases, and the position of strains relative to each other,
associated with plates. Therefore, false negative phenotypes appear to be genuine biolog-
ical phenomena, which could have many contributing factors, including:
• viability of cells after thawing the gene deletion mutant collection,
• number of cells pinned on the plate to seed each colony,
• temperature and other environmental conditions of the laboratory during prepara-
tion of plates,
• temperature and other environmental conditions in the incubator during growth,
• nutrients and their concentrations within the agar plate.
The Bioneer collection contains gene deletion mutant strains of S. pombe for all non-
essential genes. Genes may be non-essential because of genetic redundancy, or because
the gene is not required in benign standard laboratory growth conditions. By stressing
gene deletion mutants in a panel of growth conditions, we hoped to trigger condition-
216 Chapter 5. Learning the functions of fission yeast proteins
Data from multiple, independent repeats are often processed to obtain a point esti-
mate that estimates the population distribution. A measure of central tendency, like the
mean or median, is usually taken. If the false negative rate for phenotypes is caused by
an underlying stochastic mechanism, or a mechanism that we perceive as stochastic, then
central tendencies will obfuscate any true positives. For example, consider four colonies
with sizes [ − 1.0, −0.1, 0.0, 0.1], where −1.0 is a true positive and −0.1, 0.0 and 0.1 are
false negatives. Here, the mean is −0.25 and the median is -0.05, which would both fail
to capture the ground truth phenotype. To combat the high false negative error rate, we
took as point estimates the maximum observed effect size from any repeat. That is, in
our toy example above, the point estimate is -1.0, which successfully captures the ground
truth phenotype. The downside of this approach is the possibility of increasing the false
positive rate, which may arise if, for example, the grid normalisation failed to adequately
normalise a plate’s spatial biases. False positives are not as detrimental as false negatives.
False positives can be filtered out by experimental screening or using evidence from the
literature, but false negatives will not be tested because they would not be in the set of
predicted functions.
5.6. Discussion 217
This is the first time that FunFam homology information has been used by us in a machine
learning context. We encoded this information as a matrix of log-transformed HMM E-
values. The resulting matrix is high-dimensional and sparse. HMM-based features have
been used previously to predict protein function with machine learning [487–491], includ-
ing logarithms of E-values [487]. High-dimensionality and sparsity are not ideal proper-
ties for machine learning, but we attempted to mitigate their negative effects using a novel
training strategy that, to our knowledge, has never been used before. We trained models
using the one-vs-rest strategy and only included FunFams that are associated with the
GO term being predicted, thus reducing the dimensionality of the feature space (Fig. 5.12).
When this work was conducted, deepNF was one of the best function prediction methods
[66], so it was encouraging to see that the FunFams were able to improve on deepNF’s
performance.
All GO Slim terms are in the ‘biological process’ ontology. Network-based features
were more predictive of GO Slim terms than FunFam-based features. At least two factors
may contribute to this phenomenon. First, FunFams are groups of functionally pure pro-
teins from different species, therefore they tend to be better at predicting GO terms from
the ‘molecular function’ ontology, rather than the ‘biological process’ ontology [73, 74].
In CAFA 3, for example, the Orengo-FunFam team was ranked second place for predicting
‘molecular function’ terms and fourth place for ‘biological process’ terms. Therefore, this
may explain only the modest performance improvements achieved when training models
using FunFam and network embedding features. Second, S.pombe has been characterised
extensively, so has comprehensive network data compiled across a large number of sepa-
rate experiments. It could be that the quality of network data in S. pombe is very high and
this effect would also be observed in popular model organisms, but not in less well-studied
species.
The WD40 repeat-containing serine/threonine kinase FunFam that was hit by the
most S. pombe proteins is large and contains 113,743 sequences from UniProt. The inclu-
sion threshold—the HMM’s trusted cutoff value—for this FunFam is 6.00, which is very low
and suggests that either the FunFam multiple sequence alignment is poor, or this FunFam
is affected by a known bug in CATH v4.2 FunFams. Sequences from a FunFam multiple
218 Chapter 5. Learning the functions of fission yeast proteins
alignment are scanned against the corresponding FunFam HMM and the lowest bit score
is used as the inclusion threshold. We recently discovered a problem that affects some
FunFams, whereby short sequences, or subsequences, matched the HMMs with commen-
surately small bit scores. As such, sequences may be assigned to FunFams erroneously.
Despite this, we can be reassured about the false positive rate because 95% of FunFams
are hit by no more than 21 proteins.
Network embeddings were the most predictive set of features for S. pombe protein func-
tion. We used deepNF to generate low-dimensional embeddings of proteins using infor-
mation about their context across multiple networks [66]. deepNF is a highly competitive
protein function prediction method. This is somewhat surprising, given that the method
only uses network data. Until recently, network data had a bad reputation for being noisy
and incomplete [492–495], but recent work suggests that networks are now more reliable
[66, 116, 117, 496–498]. For example, GOLabeler [499] was the best method overall in
CAFA 3 [74] (Zhu Lab team in CAFA 3), but NetGO [498], a model that adds network data
to GOLabeler, was found to improve performance.
It has not escaped our attention that all of the work cited above use STRING [234, 466]
as their sole source of network data. The authors state that STRING “aims to collect, score
and integrate all publicly available sources of protein–protein interaction information, and
to complement these with computational predictions” [466]. It is conceivable that the
power of network data actually results from STRING’s coverage, high-quality curation and
accurate predictions, because it is known that integrating information from independent
sources improves predictions [500].
The growth phenotype data were acquired using a high-throughput plate colony size
assay. These data constitute a single screen that has associated biological and technical
error rates [501], as opposed to STRING [466], which is, in essence, a meta-analysis of all
known protein-protein interaction information, with far lower error rates.
It may appear surprising that, when combined with the network embeddings, the
growth phenotypes cause a reduction in performance, but this can happen for the follow-
ing reasons. Firstly, each tree in the forest is trained on random subsets of features from
the training data that are selected from 131 growth phenotypes and 221 network embed-
ding dimensions, so the growth phenotypes account for 37% of features. Given that we
5.6. Discussion 219
know growth phenotypes are much less predictive than network embeddings, it is almost
surprising that the reduction in performance is not larger than 7%, relative to network
embeddings alone. Secondly, growing trees uses a greedy algorithm, so the associated er-
ror may not be equal to the global minimum, but rather may be an artefact of the heuristics
used in the algorithm.
Some interesting questions were raised during CAFA due to the evaluation strategy:
Which models should be developed? To what extent, during development, should model
choice be influenced by performance on benchmarks? How can overfitting on benchmarks
be avoided, whilst still performing well on the evaluation data set? It is vital to not overfit
models to benchmarks because the annotations in benchmarks are unlikely to be repre-
sentative of the annotations that will accumulate to form the evaluation data set. Instead,
models should be developed using general biological principles and our intuition [502]. In
other words, if a model, that performs well in benchmarking, looks unlikely, it probably
is. Here, we stuck to three biological principles that functions are: conserved through evo-
lution (FunFams), encoded in how proteins interact (network embeddings), and functions
have phenotypic consequences (growth phenotypes).
Here, we only predicted functions for one of the 18 model organism proteomes in-
cluded in CAFA 4. The small number of proteins in the S. pombe proteome meant that
evaluating the performance of our models using a time-delay strategy was infeasible, so
instead we used cross-validation and precision-recall curves. We will have to wait until
220 Chapter 5. Learning the functions of fission yeast proteins
the final results are due to be published in October to understand how well our models
perform. We do not know whether performance is limited by the RF, our training strategy,
or inherent inaccuracies and noise in the features and GO term annotations. Either way,
CAFA 4 is likely to generate a large number of high-quality predicted functions for fission
yeast, which, if made public, could be hosted on PomBase. However, these predictions will
need to be validated by the community.
Preliminary results from CAFA 4 suggest that FunFam-based predictors are still cut-
ting edge, especially amongst tough competition from advanced neural network-based
predictors. We were delighted to be placed top for molcular function, as we believe Fun-
Fams capture molecular function information well. We were also encouraged by achieving
third place for biological process, as these terms are harder to predict using the type of in-
formation encoded by FunFams. For comparison, in CAFA 3, we were second for molecular
function and fourth for biological process.
5.6.5 Conclusion
Here, we trained machine learning models to predict functions of S. pombe proteins. We
obtained encouraging results from evaluating our models using cross-validation and also
entered our predictions into CAFA 4. However, to be confident about the quality of our
predictions, we need experimental validation by growing gene deletion mutant strains in
conditions that would elicit loss of function phenotypes. We applied our protein func-
tion prediction method to fission yeast as a proof of principle, but the method is species-
agnostic, providing feature and target data are available for any species of interest. Despite
this, our method is time-consuming to train and is restricted to the information from one
species at a time. Going forward, methods that are not restricted to a single species may be
more preferable, such as deepFRI [503], which aggregates information from any species.
Aetiologies of protein function at the residue-, domain-, molecular-, cellular-, or
organism-level remain a partial mystery, but recent developments in the field of protein
function have gone some way towards being able to predict functions. We are grateful to
have been able to make a small contribution to the field and its development. In the future,
a greater emphasis will be placed on de-blackboxification of predictions and on uncovering
general principles that explain how proteins are bestowed with functions.
Chapter 6
The overarching theme of this thesis was the development and application of protein func-
tion prediction methods.
In this chapter, we identify commonalities between these research projects, draw general
conclusions from them, and sketch out future directions of research in these areas.
drawbacks. Firstly, they are restricted to organisms that have network data—let alone
high-quality data from well characterised species—and are often constrained to be applied
to a single organism. Secondly, network-based methods are not applicable to novel data,
such as the metagenomic protein sequences we encountered in Chapter 3. Finally, net-
work data can be noisy, as many databases infer edges between proteins from correlations
in gene expression, such as from RNA-Seq experiments. Physically-interacting proteins
in humans, mice and budding yeast only have a slightly higher correlation in their gene
expression than randomly selected pairs of proteins [504]. Despite this, network data has
been improving and will continue to do so as high-throughput experiments become more
reliable and interactions are confirmed by independent studies.
6.2 Determine which types of data and models are most pre-
dictive of protein function
Protein function prediction performance is limited by the data used to train predictive
models. As the quality, volume and coverage of training data is increased, one would
expect model performance to increase. At some point, this relationship may break down
as higher-order effects, that are not present in the training data, cannot be accounted for.
It is reasonable to assume, however, that we have not reached this point yet. Therefore,
improving the training data should in turn improve model performance.
6.3. Predicted functions need experimental validation 223
The question then becomes: which types of data should we use to predict protein func-
tion? It will be extremely useful to understand which data are most predictive of protein
function, separately or in combination, to guide experimental and curational data collec-
tion efforts going forward. To some extent, the answer depends on what the question is.
On one hand, sequence data is ubiquitously available, so can be used to build general func-
tion prediction methods (Chapters 3 and 5). Network data, on the other hand, is powerful,
but is essentially limited to model organisms, as network data is nonexistant for novel and
neglected species (Chapters 2, 4 and 5). Targeted molecular biology and high-throughput
screens are possible for culturable species (Chapter 3), but limited to smaller organisms
with short life spans (Chapter 5).
In addition, it will be useful to understand which models, given the optimal training
data, are most predictive of protein function. Neural networks have shown great promise
in recent years and may prove to be the model of choice for protein function prediction.
However, whilst neural networks are flexible models, their application to biological se-
quences is not yet as flexible as HMMs, which have performed well in previous CAFA
challenges. Analysis of the best performing methods in the CAFA challenges will help to
shed some light on which types of data and which models are most predictive of function.
This will be especially true for CAFA 4, for which models would have had a great deal
more training data available than for CAFA 3 and neural networks were a more popular
choice of method.
ate FunFams using Gene3D hits from UniProt and MGnify. Doing so will help to improve
FunFams and protein functions predicted using them. This is an exciting new direction
for CATH, which will help the database and methods to remain competitive when faced
with an onslaught of competition from neural network-based methods.
We will use the findings from the analyses performed in the plastic hydrolase project
to improve FunFHMMer. The FunFHMMer algorithm was developed, tuned and bench-
marked using only three of the superfamiles in CATH [54]. FunFams were then generated
for all superfamilies in CATH. Whilst FunFHMMer works well on the three superfamiles
used to develop FunFHMMer, and produces high-quality FunFams for these superfamilies,
we know that FunFHMMer does not generate such high-quality FunFams for other su-
perfamilies. For example, during our search for novel plastic hydrolases, our analysis of
the α/β hydrolase superfamily FunFams has demonstrated that FunFHMMer may be over-
splitting sequences into too many FunFams. Compared with CATH v4.2, the latest version,
v4.3, has many more FunFams for the α/β hydrolase superfamily. One reason for this may
be that v4.3 contains more sequences that are more diverse, so, in turn, these sequences will
segregate into more families, each with different SDPs and, therefore, functions. However,
we have recently noticed that FunFam alignments tend to have low sequence diversity,
as measured by the Neff score for the number of effective sequences in an alignment [18,
505]. In general, sequence diversity in alignments is good for structure prediction, but
not for function prediction. As we have recently begun a collaboration that uses FunFams
for structure prediction, we are exploring ways to merge FunFams to create ‘StructFams’
of more diverse sequences that are better for structure prediction. We hope that these
improvements will produce better FunFams for all superfamilies.
[1] Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y.
“Predicting function: From genes to genomes and back”. Journal of Molecular Biol-
ogy (1998).
[2] Friedberg, I. “Automated protein function prediction - The genomic challenge”.
Briefings in Bioinformatics 7.3 (2006), pp. 225–242.
[3] Watson, J. D., Laskowski, R. A., and Thornton, J. M. Predicting protein function from
sequence and structural data. 2005.
[4] Itakura, K., Hirose, T., Crea, R., Riggs, A. D., Heyneker, H. L., Bolivar, F., and Boyer,
H. W. “Expression in Escherichia coli of a chemically synthesized gene for the hor-
mone somatostatin”. Science (1977).
[5] Apic, G., Gough, J., and Teichmann, S. A. “Domain combinations in archaeal, eu-
bacterial and eukaryotic proteomes”. Journal of Molecular Biology (2001).
[6] Björklund, Å. K., Ekman, D., Light, S., Frey-Skött, J., and Elofsson, A. “Domain
rearrangements in protein evolution”. Journal of Molecular Biology (2005).
[7] Tian, W. and Skolnick, J. “How well is enzyme function conserved as a function of
pairwise sequence identity?” Journal of Molecular Biology (2003).
[8] Rost, B. “Enzyme function less conserved than anticipated”. Journal of Molecular
Biology (2002).
[9] Sander, C. and Schneider, R. “Database of homology-derived protein structures and
the structural meaning of sequence alignment”. Proteins: Structure, Function, and
Bioinformatics (1991).
[10] Fitch, W. M. “Homology a personal view”. Trends in Genetics (2000).
[11] Lee, D., Redfern, O., and Orengo, C. “Predicting protein function from sequence
and structure”. Nature Reviews Molecular Cell Biology 8.12 (2007), pp. 995–1005.
[12] Nehrt, N. L., Clark, W. T., Radivojac, P., and Hahn, M. W. “Testing the ortholog
conjecture with comparative functional genomic data from mammals”. PLoS Com-
putational Biology (2011).
[13] Loewenstein, Y., Raimondo, D., Redfern, O. C., Watson, J., Frishman, D., Linial,
M., Orengo, C., Thornton, J., and Tramontano, A. “Protein function annotation by
homology-based inference.” Genome biology 10.2 (2009), p. 207.
[14] Altenhoff, A. M., Studer, R. A., Robinson-Rechavi, M., and Dessimoz, C. “Resolv-
ing the ortholog conjecture: Orthologs tend to be weakly, but significantly, more
similar in function than paralogs”. PLoS Computational Biology (2012).
[15] Studer, R. A. and Robinson-Rechavi, M. “How confident can we be that orthologs
are similar, but paralogs differ?” Trends in Genetics (2009).
[16] Stamboulian, M., Guerrero, R. F., Hahn, M. W., and Radivojac, P. “The ortholog
conjecture revisited: the value of orthologs and paralogs in function prediction”.
Bioinformatics (Oxford, England) (2020).
228 REFERENCES
[17] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. “Basic local
alignment search tool.” Journal of molecular biology 215.3 (1990), pp. 403–10.
[18] Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W.,
and Lipman, D. J. “Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs.” Nucleic acids research 25.17 (1997), pp. 3389–402.
[19] Müller, A., MacCallum, R. M., and Sternberg, M. J. “Benchmarking PSI-BLAST in
genome annotation”. Journal of Molecular Biology (1999).
[20] Yu, L., Tanwar, D. K., Penha, E. D. S., Wolf, Y. I., Koonin, E. V., and Basu, M. K.
“Grammar of protein domain architectures.” Proceedings of the National Academy
of Sciences of the United States of America 116.9 (2019), pp. 3636–3645.
[21] Lees, J. G., Lee, D., Studer, R. A., Dawson, N. L., Sillitoe, I., Das, S., Yeats, C., Des-
sailly, B. H., Rentzsch, R., and Orengo, C. A. “Gene3D: Multi-domain annotations
for protein sequence and comparative genome analysis”. Nucleic Acids Research
42.D1 (2014), pp. D240–D245.
[22] Bashton, M. and Chothia, C. “The Generation of New Protein Functions by the
Combination of Domains”. Structure (2007).
[23] Dessailly, B. H., Redfern, O. C., Cuff, A., and Orengo, C. A. “Exploiting structural
classifications for function prediction: towards a domain grammar for protein func-
tion”. Current Opinion in Structural Biology 19.3 (2009), pp. 349–356.
[24] El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Luciani, A., Potter, S. C., Qureshi,
M., Richardson, L. J., Salazar, G. A., Smart, A., Sonnhammer, E. L., Hirsh, L., Paladin,
L., Piovesan, D., Tosatto, S. C., and Finn, R. D. “The Pfam protein families database
in 2019”. Nucleic Acids Research (2019).
[25] Grewal, J. K., Krzywinski, M., and Altman, N. “Markov models — hidden Markov
models”. Nature Methods (2019).
[26] Eddy, S. R. “What is a hidden Markov model?” Nature Biotechnology 22.10 (2004),
pp. 1315–1316.
[27] Mistry, J., Finn, R. D., Eddy, S. R., Bateman, A., and Punta, M. “Challenges in ho-
mology search: HMMER3 and convergent evolution of coiled-coil regions”. Nucleic
Acids Research 41.12 (2013), e121–e121.
[28] Steinegger, M., Meier, M., Mirdita, M., Vöhringer, H., Haunsberger, S. J., and Söding,
J. “HH-suite3 for fast remote homology detection and deep protein annotation”.
BMC Bioinformatics (2019).
[29] Söding, J. “Protein homology detection by HMM-HMM comparison”. Bioinformat-
ics (2005).
[30] Levinthal, C. “How to Fold Graciously”. University of Illinois Press (1969).
[31] Mizuguchi, K. and Go, N. “Comparison of spatial arrangements of secondary struc-
tural elements in proteins”. Protein Engineering (1995).
[32] Chothia, C. and Lesk, A. “The relation between the divergence of sequence and
structure in proteins.” The EMBO Journal (1986).
[33] Sillitoe, I., Dawson, N., Lewis, T. E., Das, S., Lees, J. G., Ashford, P., Tolulope, A.,
Scholes, H. M., Senatorov, I., Bujan, A., Ceballos Rodriguez-Conde, F., Dowling, B.,
Thornton, J., and Orengo, C. A. “CATH: Expanding the horizons of structure-based
functional annotations for genome sequences”. Nucleic Acids Research (2019).
[34] Andreeva, A., Kulesha, E., Gough, J., and Murzin, A. G. “The SCOP database in
2020: Expanded classification of representative family and superfamily domains of
known protein structures”. Nucleic Acids Research (2020).
REFERENCES 229
[35] Grabowski, M., Joachimiak, A., Otwinowski, Z., and Minor, W. “Structural ge-
nomics: keeping up with expanding knowledge of the protein universe”. Current
Opinion in Structural Biology (2007).
[36] Glasner, M. E., Gerlt, J. A., and Babbitt, P. C. “Evolution of enzyme superfamilies”.
Current Opinion in Chemical Biology (2006).
[37] Hannenhalli, S. S. and Russell, R. B. “Analysis and prediction of functional sub-
types from protein sequence alignments”. Journal of Molecular Biology (2000).
[38] Orengo, C. A., Pearl, F. M., Bray, J. E., Todd, A. E., Martin, A. C., Lo Conte, L.,
and Thornton, J. M. “The CATH database provides insights into protein struc-
ture/function relationships”. Nucleic Acids Research (1999).
[39] Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and Thornton,
J. M. “CATH - A hierarchic classification of protein domain structures”. Structure
(1997).
[40] Taylor, W. R. and Orengo, C. A. “Protein structure alignment”. Journal of Molecular
Biology (1989).
[41] Needleman, S. B. and Wunsch, C. D. “A general method applicable to the search
for similarities in the amino acid sequence of two proteins”. Journal of Molecular
Biology 48.3 (1970), pp. 443–453.
[42] Orengo, C. A. and Taylor, W. R. “A local alignment method for protein structure
motifs”. Journal of Molecular Biology (1993).
[43] Smith, T. F. and Waterman, M. S. “Identification of common molecular subse-
quences”. Journal of Molecular Biology (1981).
[44] Orengo, C. A. and Taylor, W. R. “SSAP: Sequential structure alignment program for
protein structure comparison”. Methods in enzymology. Vol. 266. 1996, pp. 617–635.
[45] Taylor, W. R. and Orengo, C. A. “A holistic approach to protein structure align-
ment”. Protein Engineering, Design and Selection (1989).
[46] Orengo, C. A., Flores, T. P., Taylor, W. R., and Thornton, J. M. “Identification and
classification of protein fold families”. Protein Engineering, Design and Selection
(1993).
[47] Redfern, O. C., Harrison, A., Dallman, T., Pearl, F. M., and Orengo, C. A. “CATHE-
DRAL: A fast and effective algorithm to predict folds and domain boundaries from
multidomain protein structures”. PLoS Computational Biology (2007).
[48] Lees, J., Yeats, C., Perkins, J., Sillitoe, I., Rentzsch, R., Dessailly, B. H., and Orengo, C.
“Gene3D: A domain-based resource for comparative genomics, functional annota-
tion and protein network analysis”. Nucleic Acids Research 40.D1 (2012), pp. D465–
D471.
[49] Lewis, T. E., Sillitoe, I., Dawson, N., Lam, S. D., Clarke, T., Lee, D., Orengo, C., and
Lees, J. “Gene3D: Extensive prediction of globular domains in proteins”. Nucleic
Acids Research 46.D1 (2018), pp. D435–D439.
[50] Pandurangan, A. P., Stahlhacke, J., Oates, M. E., Smithers, B., and Gough, J. “The
SUPERFAMILY 2.0 database: A significant proteome update and a new webserver”.
Nucleic Acids Research (2019).
[51] Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. “CD-HIT: Accelerated for clustering the
next-generation sequencing data”. Bioinformatics (2012).
[52] Lewis, T. E., Sillitoe, I., and Lees, J. G. “Cath-resolve-hits: A new tool that resolves
domain matches suspiciously quickly”. Bioinformatics (2019).
230 REFERENCES
[53] Lee, D. A., Rentzsch, R., and Orengo, C. “GeMMA: functional subfamily classifica-
tion within superfamilies of predicted protein structural domains”. Nucleic Acids
Research 38.3 (2010), pp. 720–737.
[54] Das, S., Lee, D., Sillitoe, I., Dawson, N. L., Lees, J. G., and Orengo, C. A. “Func-
tional classification of CATH superfamilies: a domain-based approach for protein
function annotation.” Bioinformaticsc 31.21 (2015), pp. 3460–7.
[55] Valdar, W. S. “Scoring residue conservation”. Proteins: Structure, Function and Ge-
netics (2002).
[56] Capra, J. A. and Singh, M. “Characterization and prediction of residues determining
protein functional specificity”. Bioinformatics (2008).
[57] Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., and
Sander, C. “Protein 3D structure computed from evolutionary sequence variation”.
PLoS ONE (2011).
[58] Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek,
A., Nelson, A. W., Bridgland, A., Penedones, H., Petersen, S., Simonyan, K., Crossan,
S., Kohli, P., Jones, D. T., Silver, D., Kavukcuoglu, K., and Hassabis, D. “Improved
protein structure prediction using potentials from deep learning”. Nature (2020).
[59] Bairoch, A. “The ENZYME database in 2000”. Nucleic Acids Research (2000).
[60] Carbon, S., Douglass, E., Dunn, N., Good, B., Harris, N. L., Lewis, S. E., Mungall,
C. J., Basu, S., Chisholm, R. L., Dodson, R. J., Hartline, E., Fey, P., Thomas, P. D.,
Albou, L. P., Ebert, D., Kesling, M. J., Mi, H., Muruganujan, A., Huang, X., Poudel,
S., et al. “The Gene Ontology Resource: 20 years and still GOing strong”. Nucleic
Acids Research (2018).
[61] Pagel, P., Kovac, S., Oesterheld, M., Brauner, B., Dunger-Kaltenbach, I., Frishman,
G., Montrone, C., Mark, P., Stumpflen, V., Mewes, H.-W., Ruepp, A., and Frishman,
D. “The MIPS mammalian protein-protein interaction database”. Bioinformatics
21.6 (2005), pp. 832–834.
[62] Ruepp, A., Zollner, A., Maier, D., Albermann, K., Hani, J., Mokrejs, M., Tetko, I.,
Güldener, U., Mannhaupt, G., Münsterkötter, M., and Mewes, H. W. “The FunCat, a
functional annotation scheme for systematic classification of proteins from whole
genomes”. Nucleic Acids Research (2004).
[63] Köhler, S., Carmody, L., Vasilevsky, N., Jacobsen, J. O., Danis, D., Gourdine, J. P.,
Gargano, M., Harris, N. L., Matentzoglu, N., McMurry, J. A., Osumi-Sutherland, D.,
Cipriani, V., Balhoff, J. P., Conlin, T., Blau, H., Baynam, G., Palmer, R., Gratian, D.,
Dawkins, H., Segal, M., et al. “Expansion of the Human Phenotype Ontology (HPO)
knowledge base and resources”. Nucleic Acids Research (2019).
[64] Hatos, A., Hajdu-Soltész, B., Monzon, A. M., Palopoli, N., Álvarez, L., Aykac-Fas, B.,
Bassot, C., Benítez, G. I., Bevilacqua, M., Chasapi, A., Chemes, L., Davey, N. E., Davi-
dović, R., Dunker, A. K., Elofsson, A., Gobeill, J., Foutel, N. S., Sudha, G., Guharoy,
M., Horvath, T., et al. “DisProt: Intrinsic protein disorder annotation in 2020”. Nu-
cleic Acids Research (2020).
[65] Cho, H., Berger, B., and Peng, J. “Compact Integration of Multi-Network Topology
for Functional Analysis of Genes”. Cell Systems (2016).
[66] Gligorijević, V., Barot, M., and Bonneau, R. “deepNF: deep network fusion for pro-
tein function prediction”. Bioinformatics 34.22 (2018). Ed. by Wren, J., pp. 3873–
3881.
REFERENCES 231
[67] Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., and Moult, J. “Critical assess-
ment of methods of protein structure prediction (CASP)—Round XIII”. Proteins:
Structure, Function, and Bioinformatics 87.12 (2019), pp. 1011–1020.
[68] Haas, J., Barbato, A., Behringer, D., Studer, G., Roth, S., Bertoni, M., Mostaguir,
K., Gumienny, R., and Schwede, T. “Continuous Automated Model EvaluatiOn
(CAMEO) complementing the critical assessment of structure prediction in
CASP12”. Proteins: Structure, Function and Bioinformatics (2018).
[69] Janin, J. Welcome to CAPRI: A Critical Assessment of PRedicted Interactions. 2002.
[70] Janin, J. “Assessing predictions of protein-protein interaction: The CAPRI experi-
ment”. Protein Science (2005).
[71] Marbach, D., Costello, J. C., Küffner, R., Vega, N. M., Prill, R. J., Camacho, D. M.,
Allison, K. R., Aderhold, A., Allison, K. R., Bonneau, R., Camacho, D. M., Chen, Y.,
Collins, J. J., Cordero, F., Costello, J. C., Crane, M., Dondelinger, F., Drton, M., Es-
posito, R., Foygel, R., et al. “Wisdom of crowds for robust gene network inference”.
Nature Methods 9.8 (2012), pp. 796–804.
[72] Radivojac, P., Clark, W. T., Oron, T. R., Schnoes, A. M., Wittkop, T., Sokolov, A.,
Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., Pandey, G., Yunes, J. M., Talwalkar,
A. S., Repo, S., Souza, M. L., Piovesan, D., Casadio, R., Wang, Z., Cheng, J., Fang,
H., et al. “A large-scale evaluation of computational protein function prediction”.
Nature Methods 10.3 (2013), pp. 221–227.
[73] Jiang, Y., Oron, T. R., Clark, W. T., Bankapur, A. R., D’Andrea, D., Lepore, R., Funk,
C. S., Kahanda, I., Verspoor, K. M., Ben-Hur, A., Koo, D. C. E., Penfold-Brown, D.,
Shasha, D., Youngs, N., Bonneau, R., Lin, A., Sahraeian, S. M. E., Martelli, P. L., Prof-
iti, G., Casadio, R., et al. “An expanded evaluation of protein function prediction
methods shows an improvement in accuracy”. Genome Biology 17.1 (2016), p. 184.
[74] Zhou, N., Jiang, Y., Bergquist, T. R., Lee, A. J., Kacsoh, B. Z., Crocker, A. W., Lewis,
K. A., Georghiou, G., Nguyen, H. N., Hamid, M. N., Davis, L., Dogan, T., Atalay, V.,
Rifaioglu, A. S., Dalklran, A., Cetin Atalay, R., Zhang, C., Hurto, R. L., Freddolino,
P. L., Zhang, Y., et al. “The CAFA challenge reports improved protein function pre-
diction and new functional annotations for hundreds of genes through experimen-
tal screens”. Genome Biology 20.1 (2019), p. 244.
[75] Hamp, T., Kassner, R., Seemayer, S., Vicedo, E., Schaefer, C., Achten, D., Auer, F.,
Boehm, A., Braun, T., Hecht, M., Heron, M., Hönigschmid, P., Hopf, T. A., Kauf-
mann, S., Kiening, M., Krompass, D., Landerer, C., Mahlich, Y., Roos, M., and Rost,
B. “Homology-based inference sets the bar high for protein function prediction”.
BMC Bioinformatics (2013).
[76] McCulloch, W. S. and Pitts, W. “A logical calculus of the ideas immanent in nervous
activity”. The Bulletin of Mathematical Biophysics 5.4 (1943), pp. 115–133.
[77] Schmidhuber, J. “Deep learning in neural networks: An overview”. Neural Networks
61 (2015), pp. 85–117.
[78] Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.
[79] Lecun, Y., Bengio, Y., and Hinton, G. “Deep learning”. Nature 521.7553 (2015),
pp. 436–444.
[80] Angermueller, C., Pärnamaa, T., Parts, L., and Stegle, O. “Deep learning for com-
putational biology”. Molecular Systems Biology 12.7 (2016), p. 878.
[81] Min, S., Lee, B., and Yoon, S. “Deep learning in bioinformatics”. Briefings in Bioin-
formatics March (2016), bbw068.
232 REFERENCES
[82] Baldi, P. “Deep Learning in Biomedical Data Science”. Annual Review of Biomedical
Data Science 1.1 (2018), pp. 181–205.
[83] Wainberg, M., Merico, D., Delong, A., and Frey, B. J. “Deep learning in biomedicine”.
Nature Biotechnology (2018).
[84] Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way,
G. P., Ferrero, E., Agapow, P. M., Zietz, M., Hoffman, M. M., Xie, W., Rosen, G. L.,
Lengerich, B. J., Israeli, J., Lanchantin, J., Woloszynek, S., Carpenter, A. E., Shriku-
mar, A., Xu, J., Cofer, E. M., et al. “Opportunities and obstacles for deep learning in
biology and medicine”. Journal of the Royal Society Interface 15.141 (2018).
[85] Bostrom, N. Superintelligence. 2017.
[86] Tegmark, M. Life 3.0. 2017.
[87] Russel, S. Human Compatible. 2019.
[88] Lin, H. W., Tegmark, M., and Rolnick, D. “Why Does Deep and Cheap Learning
Work So Well?” Journal of Statistical Physics 168.6 (2017), pp. 1223–1247.
[89] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. “Learning representations by
back-propagating errors”. Nature 323.6088 (1986), pp. 533–536.
[90] Kingma, D. P. and Ba, J. “Adam: A Method for Stochastic Optimization” (2014).
[91] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
“Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. 15 (2014),
pp. 1929–1958.
[92] Hamilton, W. L., Ying, R., and Leskovec, J. “Representation Learning on Graphs:
Methods and Applications” (2017).
[93] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and
Jackel, L. D. “Backpropagation Applied to Handwritten Zip Code Recognition”.
Neural Computation 1.4 (1989), pp. 541–551.
[94] Krizhevsky, A. and Hinton, G. E. “ImageNet Classification with Deep Convolu-
tional Neural Networks”. NIPS’12 Proceedings of the 25th International Conference 1
(2012), pp. 1–9.
[95] Bai, S., Kolter, J. Z., and Koltun, V. “An Empirical Evaluation of Generic Convolu-
tional and Recurrent Networks for Sequence Modeling” (2018).
[96] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and
Bengio, Y. “Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention” (2015).
[97] Elbayad, M., Besacier, L., and Verbeek, J. “Pervasive Attention: 2D Convolutional
Neural Networks for Sequence-to-Sequence Prediction” (2018).
[98] Bengio, Y., Simard, P., and Frasconi, P. “Learning long-term dependencies with gra-
dient descent is difficult”. IEEE Transactions on Neural Networks 5.2 (1994), pp. 157–
166.
[99] Hochreiter, S. and Schmidhuber, J. “Long Short-Term Memory”. Neural Computa-
tion 9.8 (1997), pp. 1735–1780.
[100] Hyafil, L. and Rivest, R. “Constructing Optimal Binary Search Trees is NP Com-
plete”. Information Processing Letters (1976).
[101] Probst, P., Wright, M., and Boulesteix, A.-L. “Hyperparameters and Tuning Strate-
gies for Random Forest” (2018).
[102] Cristianini, N. and Shawe-Taylor, J. Kernel Methods for Pattern Analysis. 2004.
[103] Cai, H., Zheng, V. W., and Chang, K. C. C. “A Comprehensive Survey of Graph Em-
bedding: Problems, Techniques, and Applications”. IEEE Transactions on Knowledge
and Data Engineering 30.9 (2018), pp. 1616–1637.
REFERENCES 233
[104] Grover, A. and Leskovec, J. “node2vec: Scalable Feature Learning for Networks”
(2016).
[105] Perozzi, B., Al-Rfou, R., and Skiena, S. “DeepWalk: Online Learning of Social Rep-
resentations” (2014), pp. 701–710.
[106] Cao, S., Lu, W., and Xu, Q. “Deep neural networks for learning graph representa-
tions”. 30th AAAI Conference on Artificial Intelligence, AAAI 2016. 2016.
[107] Wang, D., Cui, P., and Zhu, W. “Structural deep network embedding”. Proceedings
of the ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. 2016.
[108] Franz, M., Rodriguez, H., Lopes, C., Zuberi, K., Montojo, J., Bader, G. D., and Morris,
Q. “GeneMANIA update 2018”. Nucleic Acids Research (2018).
[109] Dutkowski, J., Kramer, M., Surma, M. a., Balakrishnan, R., Cherry, J. M., Krogan,
N. J., and Ideker, T. “A gene ontology inferred from molecular networks.” Nature
biotechnology 31.1 (2013), pp. 38–45.
[110] Yu, M. K., Kramer, M., Dutkowski, J., Srivas, R., Licon, K., Kreisberg, J. F., Ng, C. T.,
Krogan, N., Sharan, R., and Ideker, T. “Translation of genotype to phenotype by a
hierarchy of cell subsystems”. Cell Systems (2016).
[111] Jerby-Arnon, L., Pfetzer, N., Waldman, Y. Y., McGarry, L., James, D., Shanks, E.,
Seashore-Ludlow, B., Weinstock, A., Geiger, T., Clemons, P. A., Gottlieb, E., and
Ruppin, E. “Predicting cancer-specific vulnerability via data-driven detection of
synthetic lethality”. Cell (2014).
[112] Zitnik, M. and Leskovec, J. “Predicting multicellular function through multi-layer
tissue networks”. Bioinformatics. 2017.
[113] Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. “LINE: Large-scale in-
formation network embedding”. WWW 2015 - Proceedings of the 24th International
Conference on World Wide Web. 2015.
[114] Chen, H., Hu, Y., Perozzi, B., and Skiena, S. “HARP: Hierarchical representation
learning for networks”. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018.
2018.
[115] Fouss, F., Francoisse, K., Yen, L., Pirotte, A., and Saerens, M. “An experimental in-
vestigation of kernels on graphs for collaborative recommendation and semisuper-
vised classification”. Neural Networks 31 (2012), pp. 53–72.
[116] Heriche, J.-K., Lees, J. G., Morilla, I., Walter, T., Petrova, B., Roberti, M. J., Hossain,
M. J., Adler, P., Fernandez, J. M., Krallinger, M., Haering, C. H., Vilo, J., Valencia, A.,
Ranea, J. A., Orengo, C., and Ellenberg, J. “Integration of biological data by kernels
on graph nodes allows prediction of new genes involved in mitotic chromosome
condensation”. Molecular Biology of the Cell 25.16 (2014), pp. 2522–2536.
[117] Lehtinen, S., Lees, J., Bähler, J., Shawe-Taylor, J., and Orengo, C. “Gene function
prediction from functional association networks using kernel partial least squares
regression”. PLoS ONE 10.8 (2015), pp. 1–14.
[118] Sigoillot, F. D. and King, R. W. “Vigilance and validation: Keys to success in RNAi
screening”. ACS Chemical Biology 6.1 (2011), pp. 47–60.
[119] Mering, C. von, Jensen, L. J., Snel, B., Hooper, S. D., Krupp, M., Foglierini, M., Jouffre,
N., Huynen, M. A., and Bork, P. “STRING: known and predicted protein-protein
associations, integrated and transferred across organisms.” Nucleic acids research
33.Database issue (2005), pp. D433–7.
234 REFERENCES
[120] Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C., and Morris, Q. “GeneMANIA:
A real-time multiple association network integration algorithm for predicting gene
function”. Genome Biology 9.SUPPL. 1 (2008), S4.
[121] Saerens, M., Fouss, F., Yen, L., and Dupont, P. “The principal components analysis
of a graph, and its relationships to spectral clustering”. Lecture Notes in Artificial
Intelligence (Subseries of Lecture Notes in Computer Science). 2004.
[122] Bach, F. R. and Jordan, M. I. “Learning spectral clustering”. Advances in Neural
Information Processing Systems. 2004.
[123] Das, S., Scholes, H. M., Sen, N., and Orengo, C. “CATH functional families predict
functional sites in proteins”. Bioinformatics (2020). Ed. by Elofsson, A., pp. 0–0.
[124] Zhang, D. and Kabuka, M. “Protein Family Classification from Scratch: A CNN
based Deep Learning Approach”. IEEE/ACM Transactions on Computational Biology
and Bioinformatics (2020).
[125] Seo, S., Oh, M., Park, Y., and Kim, S. “DeepFam: deep learning based alignment-free
method for protein family modeling and prediction”. Bioinformatics 34.13 (2018),
pp. i254–i262.
[126] Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. “Predicting the sequence
specificities of DNA- and RNA-binding proteins by deep learning”. Nature Biotech-
nology 33.8 (2015), pp. 831–838.
[127] Kelley, D. R., Snoek, J., and Rinn, J. L. “Basset: learning the regulatory code of the ac-
cessible genome with deep convolutional neural networks.” Genome research 26.7
(2016), pp. 990–9.
[128] Quang, D. and Xie, X. “DanQ: a hybrid convolutional and recurrent deep neural
network for quantifying the function of DNA sequences”. Nucleic Acids Research
44.11 (2016), e107–e107.
[129] Zhou, J. and Troyanskaya, O. G. “Predicting effects of noncoding variants with
deep learning-based sequence model”. Nature Methods 12.10 (2015), pp. 931–934.
[130] Angermueller, C., Lee, H. J., Reik, W., and Stegle, O. “DeepCpG: Accurate prediction
of single-cell DNA methylation states using deep learning”. Genome Biology (2017).
[131] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Distributed Representations of Words
and Phrases and their Compositionality. Tech. rep.
[132] Le, Q. V. and Mikolov, T. “Distributed Representations of Sentences and Docu-
ments”. 31st International Conference on Machine Learning, ICML 2014 4 (2014),
pp. 2931–2939.
[133] Ren, J., Bai, X., Lu, Y. Y., Tang, K., Wang, Y., Reinert, G., and Sun, F. “Alignment-Free
Sequence Analysis and Applications”. Annual Review of Biomedical Data Science 8.1
(2018).
[134] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren,
S., and Phillippy, A. M. “Mash: Fast genome and metagenome distance estimation
using MinHash”. Genome Biology 17.1 (2016), p. 132.
[135] Buhler, J. “Efficient large-scale sequence comparison by locality-sensitive hashing”.
Bioinformatics 17.5 (2001), pp. 419–428.
[136] Luo, Y., Yu, Y. W., Zeng, J., Berger, B., and Peng, J. “Metagenomic binning through
low-density hashing”. Bioinformatics (2019).
[137] Steinegger, M. and Söding, J. “Clustering huge protein sequence sets in linear time”.
Nature Communications 9.1 (2018), p. 2542.
[138] Salakhutdinov, R. and Hinton, G. “Semantic hashing”. International Journal of Ap-
proximate Reasoning (2009).
REFERENCES 235
rys Kishore, C. J., Kanth, S., Ahmed, M., et al. “Human Protein Reference Database
- 2009 update”. Nucleic Acids Research (2009).
[156] You, R., Huang, X., and Zhu, S. “DeepText2GO: Improving large-scale protein func-
tion prediction with deep semantic text representation”. Methods (2018).
[157] Fa, R., Cozzetto, D., Wan, C., and Jones, D. T. “Predicting human protein function
with multitask deep neural networks”. PLoS ONE (2018).
[158] Kulmanov, M., Khan, M. A., and Hoehndorf, R. “DeepGO: Predicting protein func-
tions from sequence and interactions using a deep ontology-aware classifier”.
Bioinformatics (2018).
[159] Alshahrani, M., Khan, M. A., Maddouri, O., Kinjo, A. R., Queralt-Rosinach, N., and
Hoehndorf, R. “Neuro-symbolic representation learning on biological knowledge
graphs”. Bioinformatics 33.17 (2017), pp. 2723–2730.
[160] Kulmanov, M., Hoehndorf, R., and Cowen, L. “DeepGOPlus: Improved protein func-
tion prediction from sequence”. Bioinformatics (2020).
[161] Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., and Chen, Z. “ProLanGO: Protein
function prediction using neural machine translation based on a recurrent neural
network”. Molecules (2017).
[162] Sinai, S., Kelsic, E., Church, G. M., and Nowak, M. A. “Variational auto-encoding of
protein sequences” (2018).
[163] Scholes, H. M., Cryar, A., Kerr, F., Sutherland, D., Gethings, L. A., Vissers, J. P. C.,
Lees, J. G., Orengo, C. A., Partridge, L., and Thalassinos, K. “Dynamic changes in the
brain protein interaction network correlates with progression of Aβ42 pathology
in Drosophila”. Scientific Reports 10.1 (2020), p. 18517.
[164] Lane, C. A., Hardy, J., and Schott, J. M. “Alzheimer’s disease”. European Journal of
Neurology 25.1 (2018), pp. 59–70.
[165] Alzheimer, A. “Über eine eigenartige Erkrankung der Hirnrinde”. Allgemeine
Zeitschrift für Psychiatrie und psychisch-gerichtliche Medizin 64 (1907), pp. 146–148.
[166] Glenner, G. G. and Wong, C. W. “Alzheimer’s disease: initial report of the purifica-
tion and characterization of a novel cerebrovascular amyloid protein. 1984.” Bio-
chemical and biophysical research communications 425.3 (2012), pp. 534–539.
[167] Grundke-Iqbal, I., Iqbal, K., Tung, Y. C., Quinlan, M., Wisniewski, H. M., and
Binder, L. I. “Abnormal phosphorylation of the microtubule-associated protein tau
in Alzheimer cytoskeletal pathology.” Proceedings of the National Academy of Sci-
ences of the United States of America 83.13 (1986), pp. 4913–7.
[168] Goedert, M., Wischik, C. M., Crowther, R. A., Walker, J. E., and Klug, A. “Cloning
and sequencing of the cDNA encoding a core protein of the paired helical filament
of Alzheimer disease: identification as the microtubule-associated protein tau.” Pro-
ceedings of the National Academy of Sciences of the United States of America 85.11
(1988), pp. 4051–4055.
[169] Cai, H., Cong, W.-n., Ji, S., Rothman, S., Maudsley, S., and Martin, B. “Metabolic Dys-
function in Alzheimers Disease and Related Neurodegenerative Disorders”. Current
Alzheimer Research 9.1 (2012), pp. 5–17.
[170] Szutowicz, A., Bielarczyk, H., Jankowska-Kulawy, A., Pawełczyk, T., and
Ronowska, A. “Acetyl-CoA the key factor for survival or death of cholinergic
neurons in course of neurodegenerative diseases.” Neurochemical research 38.8
(2013), pp. 1523–42.
[171] Suberbielle, E., Sanchez, P. E., Kravitz, A. V., Wang, X., Ho, K., Eilertson, K., Devidze,
N., Kreitzer, A. C., and Mucke, L. “Physiologic brain activity causes DNA double-
REFERENCES 237
[186] Mullan, M., Crawford, F., Axelman, K., Houlden, H., Lilius, L., Winblad, B., and
Lannfelt, L. “A pathogenic mutation for probable Alzheimer’s disease in the APP
gene at the N–terminus of β–amyloid”. Nature Genetics 1.5 (1992), pp. 345–347.
[187] Nilsberth, C., Westlind-Danielsson, A., Eckman, C. B., Condron, M. M., Axelman,
K., Forsell, C., Stenh, C., Luthman, J., Teplow, D. B., Younkin, S. G., Näslund, J.,
and Lannfelt, L. “The ‘Arctic’ APP mutation (E693G) causes Alzheimer’s disease
by enhanced Aβ protofibril formation”. Nature Neuroscience 4.9 (2001), pp. 887–
893.
[188] Serpell, L. C. “Alzheimer’s amyloid fibrils: structure and assembly”. Biochimica et
Biophysica Acta (BBA) - Molecular Basis of Disease 1502.1 (2000), pp. 16–30.
[189] Ahmed, M., Davis, J., Aucoin, D., Sato, T., Ahuja, S., Aimoto, S., Elliott, J. I., Van Nos-
trand, W. E., and Smith, S. O. “Structural conversion of neurotoxic amyloid-β1–42
oligomers to fibrils”. Nature Structural & Molecular Biology 17.5 (2010), pp. 561–
567.
[190] Miller, D. L., Papayannopoulos, I. A., Styles, J., Bobin, S. A., Lin, Y. Y., Biemann, K.,
and Iqbal, K. “Peptide compositions of the cerebrovascular and senile plaque core
amyloid deposits of Alzheimer’s disease.” Archives of biochemistry and biophysics
301.1 (1993), pp. 41–52.
[191] Drummond, E., Nayak, S., Faustin, A., Pires, G., Hickman, R. A., Askenazi, M., Co-
hen, M., Haldiman, T., Kim, C., Han, X., Shao, Y., Safar, J. G., Ueberheide, B., and
Wisniewski, T. “Proteomic differences in amyloid plaques in rapidly progressive
and sporadic Alzheimer’s disease”. Acta Neuropathologica (2017).
[192] Younkin, S. G. “The role of Aβ42 in Alzheimer’s disease”. Journal of Physiology
Paris 92.3-4 (1998), pp. 289–292.
[193] Hatami, A., Monjazeb, S., Milton, S., and Glabe, C. G. “Familial Alzheimer’s Dis-
ease Mutations within the Amyloid Precursor Protein Alter the Aggregation and
Conformation of the Amyloid-β Peptide.” The Journal of biological chemistry 292.8
(2017), pp. 3172–3185.
[194] Murakami, K., Irie, K., Morimoto, A., Ohigashi, H., Shindo, M., Nagao, M., Shimizu,
T., and Shirasawa, T. “Synthesis, aggregation, neurotoxicity, and secondary struc-
ture of various Aβ1–42 mutants of familial Alzheimer’s disease at positions 21–23”.
Biochemical and Biophysical Research Communications 294.1 (2002), pp. 5–10.
[195] Masters, C. L., Multhaup, G., Simms, G., Pottgiesser, J., Martins, R. N., and
Beyreuther, K. “Neuronal origin of a cerebral amyloid: neurofibrillary tangles
of Alzheimer’s disease contain the same protein as the amyloid of plaque cores
and blood vessels.” The EMBO journal 4.11 (1985), pp. 2757–2763.
[196] Gouras G K, Tsai J, Naslund J, Vincent B, Edgar M, Checler F, Greenfield J P,
Haroutunian V, Buxbaum J D, Xu H, Greengard P, and Relkin N R. “Intraneuronal
Aβ42 Accumulation in Human Brain”. Am. J. Pathol. 156.1 (2000), pp. 15–20.
[197] Weingarten, M. D., Lockwood, A. H., Hwo, S. Y., and Kirschner, M. W. “A protein
factor essential for microtubule assembly.” Proceedings of the National Academy of
Sciences of the United States of America 72.5 (1975), pp. 1858–62.
[198] Iqbal, K., Liu, F., Gong, C.-X., and Grundke-Iqbal, I. “Tau in Alzheimer disease and
related tauopathies.” Current Alzheimer research 7.8 (2010), pp. 656–64.
[199] Alonso, A. d. C., Li, B., Grundke-Iqbal, I., and Iqbal, K. “Polymerization of hyper-
phosphorylated tau into filaments eliminates its inhibitory activity”. Proceedings of
the National Academy of Sciences 103.23 (2006), pp. 8864–8869.
REFERENCES 239
[200] Alonso, A. C., Zaidi, T., Grundke-Iqbal, I., and Iqbal, K. “Role of abnormally phos-
phorylated tau in the breakdown of microtubules in Alzheimer disease.” Proceed-
ings of the National Academy of Sciences of the United States of America 91.12 (1994),
pp. 5562–6.
[201] Moya-Alvarado, G., Gershoni-Emek, N., Perlson, E., and Bronfman, F. C. “Neurode-
generation and Alzheimer’s disease (AD). What can proteomics tell us about the
Alzheimer’s brain?” Molecular and Cellular Proteomics 15.2 (2016), pp. 409–425.
[202] Lynn, B. C., Wang, J., Markesbery, W. R., and Lovell, M. A. “Quantitative changes
in the mitochondrial proteome from subjects with mild cognitive impairment,
early stage, and late stage Alzheimer’s disease”. Journal of Alzheimer’s Disease 19.1
(2010), pp. 325–339.
[203] Butterfield, D. A., Di Domenico, F., Swomley, A. M., Head, E., and Perluigi, M. “Re-
dox proteomics analysis to decipher the neurobiology of Alzheimer-like neurode-
generation: Overlaps in Down’s syndrome and Alzheimer’s disease brain”. Bio-
chemical Journal 463.2 (2014), pp. 177–189.
[204] Aluise, C. D., Robinson, R. A., Cai, J., Pierce, W. M., Markesbery, W. R., and But-
terfield, D. A. “Redox proteomics analysis of brains from subjects with amnes-
tic mild cognitive impairment compared to brains from subjects with preclinical
alzheimer’s disease: Insights into memory loss in MCI”. Journal of Alzheimer’s Dis-
ease 23.2 (2011), pp. 257–269.
[205] Dammer, E. B., Lee, A. K., Duong, D. M., Gearing, M., Lah, J. J., Levey, A. I., and
Seyfried, N. T. “Quantitative phosphoproteomics of Alzheimer’s disease reveals
cross-talk between kinases and small heat shock proteins”. Proteomics 15.2-3 (2015),
pp. 508–519.
[206] Sultana, R., Robinson, R. A., Di Domenico, F., Abdul, H. M., St. Clair, D. K., Markes-
bery, W. R., Cai, J., Pierce, W. M., and Butterfield, D. A. “Proteomic identification
of specifically carbonylated brain proteins in APP NLh/APP NLh×PS-1 P264L/PS-1
P264L human double mutant knock-in mice model of Alzheimer disease as a func-
tion of age”. Journal of Proteomics 74.11 (2011), pp. 2430–2440.
[207] Sofola, O., Kerr, F., Rogers, I., Killick, R., Augustin, H., Gandy, C., Allen, M. J., Hardy,
J., Lovestone, S., and Partridge, L. “Inhibition of GSK-3 Ameliorates Aβ Pathology
in an Adult-Onset Drosophila Model of Alzheimer’s Disease”. PLoS Genetics 6.9
(2010). Ed. by Lu, B., e1001087.
[208] Keifer, D. Z., Motwani, T., Teschke, C. M., and Jarrold, M. F. “Measurement of the
accurate mass of a 50 MDa infectious virus”. Rapid communications in mass spec-
trometry : RCM 30.17 (2016), pp. 1957–1962.
[209] Wasinger, V. C., Cordwell, S. J., Poljak, A., Yan, J. X., Gooley, A. A., Wilkins, M. R.,
Duncan, M. W., Harris, R., Williams, K. L., and Humphery-Smith, I. “Progress with
gene-product mapping of the Mollicutes: Mycoplasma genitalium”. Electrophoresis
16.1 (1995), pp. 1090–1094.
[210] Aebersold, R. and Mann, M. Mass spectrometry-based proteomics. 2003.
[211] Tanaka, K., Waki, H., Ido, Y., Akita, S., Yoshida, Y., Yoshida, T., and Matsuo, T. “Pro-
tein and polymer analyses up to m/z 100 000 by laser ionization time-of-flight mass
spectrometry”. Rapid Communications in Mass Spectrometry 2.8 (1988), pp. 151–153.
[212] Karas, M., Bachmann, D., and Hillenkamp, F. “Influence of the Wavelength in High-
Irradiance Ultraviolet Laser Desorption Mass Spectrometry of Organic Molecules”.
Analytical Chemistry 57.14 (1985), pp. 2935–2939.
240 REFERENCES
[213] Whitehouse, C. M., Dreyer, R. N., Yamashita, M., and Fenn, J. B. “Electrospray In-
terface for Liquid Chromatographs and Mass Spectrometers”. Analytical Chemistry
57.3 (1985), pp. 675–679.
[214] Gabelica, V. and Marklund, E. “Fundamentals of ion mobility spectrometry”. Cur-
rent Opinion in Chemical Biology 42 (2018), pp. 51–59.
[215] May, J. C., Morris, C. B., and McLean, J. A. “Ion mobility collision cross section
compendium”. Analytical Chemistry 89.2 (2017), pp. 1032–1044.
[216] Schubert, O. T., Röst, H. L., Collins, B. C., Rosenberger, G., and Aebersold, R. “Quan-
titative proteomics: Challenges and opportunities in basic and applied research”.
Nature Protocols 12.7 (2017), pp. 1289–1294.
[217] Noor, Z., Ahn, S. B., Baker, M. S., Ranganathan, S., and Mohamedali, A. “Mass spec-
trometry–based protein identification in proteomics—a review”. Briefings In Bioin-
formatics (2019).
[218] Henderson, R. A., Michel, H., Sakaguchi, K., Shabanowitz, J., Appella, E., Hunt,
D. F., and Engelhard, V. H. “HLA-A2.1-Associated peptides from a mutant cell line:
A second pathway of antigen presentation”. Science 255.5049 (1992), pp. 1264–1266.
[219] Gillet, L. C., Leitner, A., and Aebersold, R. “Mass Spectrometry Applied to Bottom-
Up Proteomics: Entering the High-Throughput Era for Hypothesis Testing”. An-
nual Review of Analytical Chemistry 9.1 (2016), pp. 449–472.
[220] Chapman, J. D., Goodlett, D. R., and Masselon, C. D. “Multiplexed and data-
independent tandem mass spectrometry for global proteome profiling”. Mass Spec-
trometry Reviews 33.6 (2014), pp. 452–470.
[221] Geromanos, S. J., Hughes, C., Ciavarini, S., Vissers, J. P., and Langridge, J. I. “Us-
ing ion purity scores for enhancing quantitative accuracy and precision in com-
plex proteomics samples”. Analytical and Bioanalytical Chemistry 404.4 (2012),
pp. 1127–1139.
[222] Crowther, D. C., Kinghorn, K. J., Miranda, E., Page, R., Curry, J. A., Duthie, F. A. I.,
Gubb, D. C., and Lomas, D. A. “Intraneuronal Aβ, non-amyloid aggregates and
neurodegeneration in a Drosophila model of Alzheimer’s disease”. Neuroscience
132.1 (2005), pp. 123–135.
[223] Osterwalder, T., Yoon, K. S., White, B. H., and Keshishian, H. “A conditional tissue-
specific transgene expression system using inducible GAL4.” Proceedings of the Na-
tional Academy of Sciences of the United States of America 98.22 (2001), pp. 12596–
12601.
[224] Li, G. Z., Vissers, J. P., Silva, J. C., Golick, D., Gorenstein, M. V., and Geromanos,
S. J. “Database searching and accounting of multiplexed precursor and product ion
spectra from the data independent analysis of simple and complex peptide mix-
tures”. Proteomics 9.6 (2009), pp. 1696–1719.
[225] Distler, U., Kuharev, J., Navarro, P., Levin, Y., Schild, H., and Tenzer, S. “Drift time-
specific collision energies enable deep-coverage data-independent acquisition pro-
teomics”. Nature Methods 11.2 (2014), pp. 167–170.
[226] Silva, J. C., Gorenstein, M. V., Li, G.-Z., Vissers, J. P. C., and Geromanos, S. J. “Ab-
solute quantification of proteins by LCMSE: a virtue of parallel MS acquisition.”
Molecular & cellular proteomics : MCP 5.1 (2006), pp. 144–56.
[227] Lazar, C., Gatto, L., Ferro, M., Bruley, C., and Burger, T. “Accounting for the Multiple
Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Com-
pare Imputation Strategies”. Journal of Proteome Research 15.4 (2016), pp. 1116–
1125.
REFERENCES 241
[228] Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. “A comparison of nor-
malization methods for high density oligonucleotide array data based on variance
and bias”. Bioinformatics 19.2 (2003), pp. 185–193.
[229] Love, M. I., Huber, W., and Anders, S. “Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2”. Genome Biology 15.12 (2014), p. 550.
[230] Woo, S., Leek, J. T., and Storey, J. D. “A computationally efficient modular optimal
discovery procedure”. Bioinformatics 27.4 (2011), pp. 509–515.
[231] Robinson, M. D., McCarthy, D. J., and Smyth, G. K. “edgeR: A Bioconductor package
for differential expression analysis of digital gene expression data”. Bioinformatics
26.1 (2009), pp. 139–140.
[232] Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K.
“Limma powers differential expression analyses for RNA-sequencing and microar-
ray studies”. Nucleic Acids Research 43.7 (2015), e47.
[233] Nueda, M. J., Tarazona, S., and Conesa, A. “Next maSigPro: Updating maSigPro bio-
conductor package for RNA-seq time series”. Bioinformatics 30.18 (2014), pp. 2598–
2602.
[234] Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas,
J., Simonovic, M., Roth, A., Santos, A., Tsafou, K. P., Kuhn, M., Bork, P., Jensen, L. J.,
and Von Mering, C. “STRING v10: Protein-protein interaction networks, integrated
over the tree of life”. Nucleic Acids Research 43.D1 (2015), pp. D447–D452.
[235] Bader, G. D. and Hogue, C. W. “An automated method for finding molecular com-
plexes in large protein interaction networks”. BMC Bioinformatics 4.1 (2003), p. 2.
[236] Mi, H., Muruganujan, A., Casagrande, J. T., and Thomas, P. D. “Large-scale gene
function analysis with the PANTHER classification system.” Nature protocols 8.8
(2013), pp. 1551–1566.
[237] Jones, E., Oliphant, T., Peterson, P., et al. “SciPy: Open Source Scientific Tools for
Python”. 9 (2015), pp. 10–20.
[238] Oliphant, T. E. A guide to NumPy. Vol. 1. Trelgol Publishing USA, 2006.
[239] McKinney, W. “Data Structures for Statistical Computing in Python”. Proceedings of
the 9th Python in Science Conference. Vol. 1697900. Scipy. Austin, TX. 2010, pp. 51–
56.
[240] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blon-
del, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Courna-
peau, D., Brucher, M., Perrot, M., and Duchesnay, E. “Scikit-learn: Machine Learn-
ing in Python”. Journal of Machine Learning Research 12.Oct (2011), pp. 2825–2830.
[241] Hagberg, A. A., Schult, D. A., and Swart, P. J. Exploring network structure, dynam-
ics, and function using NetworkX. Tech. rep. Los Alamos National Lab.(LANL), Los
Alamos, NM (United States), 2008, pp. 11–15.
[242] Pérez, F. and Granger, B. E. “IPython: A system for interactive scientific comput-
ing”. Computing in Science and Engineering 9.3 (2007), pp. 21–29.
[243] Kluyver, T., Ragan-kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kel-
ley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., and Willing,
C. “Jupyter Notebooks—a publishing format for reproducible computational work-
flows”. Positioning and Power in Academic Publishing: Players, Agents and Agendas.
2016, pp. 87–90.
[244] Hunter, J. D. “Matplotlib: A 2D graphics environment”. Computing in Science and
Engineering 9.3 (2007), pp. 99–104.
242 REFERENCES
[245] Brown, C. J., Kaufman, T., Trinidad, J. C., and Clemmer, D. E. “Proteome changes in
the aging Drosophila melanogaster head”. International Journal of Mass Spectrom-
etry (2018).
[246] Tain, L. S., Sehlke, R., Jain, C., Chokkalingam, M., Nagaraj, N., Essers, P., Rassner,
M., Grönke, S., Froelich, J., Dieterich, C., Mann, M., Alic, N., Beyer, A., and Partridge,
L. “A proteomic atlas of insulin signalling reveals tissue-specific mechanisms of
longevity assurance”. Molecular Systems Biology (2017).
[247] Anders, S. and Huber, W. “Differential expression analysis for sequence count
data”. Genome Biology (2010).
[248] Zhang, Z. H., Jhaveri, D. J., Marshall, V. M., Bauer, D. C., Edson, J., Narayanan,
R. K., Robinson, G. J., Lundberg, A. E., Bartlett, P. F., Wray, N. R., and Zhao, Q. Y. “A
comparative study of techniques for differential expression analysis on RNA-seq
data”. PLoS ONE 9.8 (2014). Ed. by Provero, P., e103207.
[249] Seyednasrollah, F., Laiho, A., and Elo, L. L. “Comparison of software packages for
detecting differential expression in RNA-seq studies”. Briefings in Bioinformatics
16.1 (2013), pp. 59–70.
[250] Yu, H., Kim, P. M., Sprecher, E., Trifonov, V., and Gerstein, M. “The importance of
bottlenecks in protein networks: Correlation with gene essentiality and expression
dynamics”. PLoS Computational Biology 3.4 (2007), e59.
[251] Savas, J. N., Wang, Y.-Z., DeNardo, L. A., Martinez-Bartolome, S., McClatchy, D. B.,
Hark, T. J., Shanks, N. F., Cozzolino, K. A., Lavallée-Adam, M., Smukowski, S. N.,
Park, S. K., Kelly, J. W., Koo, E. H., Nakagawa, T., Masliah, E., Ghosh, A., and Yates,
J. R. “Amyloid Accumulation Drives Proteome-wide Alterations in Mouse Models
of Alzheimer’s Disease-like Pathology.” Cell reports 21.9 (2017), pp. 2614–2627.
[252] Niccoli, T., Cabecinha, M., Tillmann, A., Kerr, F., Wong, C. T., Cardenes, D., Vin-
cent, A. J., Bettedi, L., Li, L., Grönke, S., Dols, J., and Partridge, L. “Increased Glu-
cose Transport into Neurons Rescues Aβ Toxicity in Drosophila”. Current Biology
(2016).
[253] Liu, C. C., Kanekiyo, T., Xu, H., and Bu, G. “Apolipoprotein e and Alzheimer disease:
Risk, mechanisms and therapy”. Nature Reviews Neurology (2013).
[254] Palm, W., Sampaio, J. L., Brankatschk, M., Carvalho, M., Mahmoud, A., Shevchenko,
A., and Eaton, S. “Lipoproteins in Drosophila melanogaster-assembly, function, and
influence on tissue lipid composition”. PLoS Genetics (2012).
[255] Bereczki, E., Bernat, G., Csont, T., Ferdinandy, P., Scheich, H., and Sántha, M. “Over-
expression of human apolipoprotein B-100 induces severe neurodegeneration in
transgenic mice”. Journal of Proteome Research (2008).
[256] Löffler, T., Flunkert, S., Havas, D., Sántha, M., Hutter-Paier, B., Steyrer, E., and
Windisch, M. “Impact of ApoB-100 expression on cognition and brain pathology
in wild-type and hAPPsl mice”. Neurobiology of Aging 34.10 (2013), pp. 2379–2388.
[257] Caramelli, P., Nitrini, R., Maranhao, R., Lourenço, A. C., Damasceno, M. C., Vina-
gre, C., and Caramelli, B. “Increased apolipoprotein B serum concentration in
Alzheimer’s disease”. Acta Neurologica Scandinavica (1999).
[258] Zhang, R., Barker, L., Pinchev, D., Marshall, J., Rasamoelisolo, M., Smith, C.,
Kupchak, P., Kireeva, I., Ingratta, L., and Jackowski, G. “Mining biomarkers in hu-
man sera using proteomic tools”. Proteomics (2004).
[259] López-Otín, C., Blasco, M. A., Partridge, L., Serrano, M., and Kroemer, G. “The hall-
marks of aging”. Cell 153.6 (2013), pp. 1194–1217.
REFERENCES 243
[260] Hou, Y., Dan, X., Babbar, M., Wei, Y., Hasselbalch, S. G., Croteau, D. L., and Bohr,
V. A. “Ageing as a risk factor for neurodegenerative disease”. Nature Reviews Neu-
rology (2019).
[261] Afshordel, S., Wood, W. G., Igbavboa, U., Muller, W. E., and Eckert, G. P. “Impaired
geranylgeranyltransferase-I regulation reduces membrane-associated Rho protein
levels in aged mouse brain”. Journal of Neurochemistry (2014).
[262] Gao, S., Yu, R., and Zhou, X. “The Role of Geranylgeranyltransferase I-Mediated
Protein Prenylation in the Brain”. Molecular Neurobiology (2016).
[263] D’Souza, Y., Elharram, A., Soon-Shiong, R., Andrew, R. D., and Bennett, B. M. “Char-
acterization of Aldh2-/- mice as an age-related model of cognitive impairment and
Alzheimer’s disease”. Molecular Brain (2015).
[264] Ohsawa, I., Nishimaki, K., Murakami, Y., Suzuki, Y., Ishikawa, M., and Ohta, S. “Age-
dependent neurodegeneration accompanying memory loss in transgenic mice de-
fective in mitochondrial aldehyde dehydrogenase 2 activity”. Journal of Neuro-
science (2008).
[265] Sade, Y., Toker, L., Kara, N. Z., Einat, H., Rapoport, S., Moechars, D., Berry,
G. T., Bersudsky, Y., and Agam, G. “IP3 accumulation and/or inositol depletion:
Two downstream lithium’s effects that may mediate its behavioral and cellular
changes”. Translational Psychiatry (2016).
[266] Dobrin, S. E. and Fahrbach, S. E. “Rho GTPase activity in the honey bee mushroom
bodies is correlated with age and foraging experience”. Journal of Insect Physiology
(2012).
[267] Owen, L. and Sunram-Lea, S. I. “Metabolic agents that enhance ATP can improve
cognitive functioning: A review of the evidence for glucose, oxygen, pyruvate, cre-
atine, and L-carnitine”. Nutrients 3.8 (2011), pp. 735–755.
[268] Maynard, S., Fang, E. F., Scheibye-Knudsen, M., Croteau, D. L., and Bohr, V. A.
“DNA damage, DNA repair, aging, and neurodegeneration”. Cold Spring Harbor
Perspectives in Medicine (2015).
[269] Anisimova, A. S., Alexandrov, A. I., Makarova, N. E., Gladyshev, V. N., and Dmitriev,
S. E. “Protein synthesis and quality control in aging”. Aging 10.12 (2018), pp. 4269–
4288.
[270] Mattson, M. P. and Arumugam, T. V. “Hallmarks of Brain Aging: Adaptive and
Pathological Modification by Metabolic States”. Cell Metabolism 27.6 (2018),
pp. 1176–1199.
[271] Maas, A. I. “Cerebrospinal fluid enzymes in acute brain injury 2 Relation of CSF
enzyme activity to extent of brain injury”. Journal of Neurology, Neurosurgery and
Psychiatry 40.7 (1977), pp. 666–674.
[272] Casley, C. S., Canevari, L., Land, J. M., Clark, J. B., and Sharpe, M. A. “β-Amyloid
inhibits integrated mitochondrial respiration and key enzyme activities”. Journal
of Neurochemistry (2002).
[273] Cardoso, S. M., Proença, M. T., Santos, S., Santana, I., and Oliveira, C. R. “Cy-
tochrome c oxidase is decreased in Alzheimer’s disease platelets”. Neurobiology
of Aging (2004).
[274] Fukui, H., Diaz, F., Garcia, S., and Moraes, C. T. “Cytochrome c oxidase deficiency in
neurons decreases both oxidative stress and amyloid formation in a mouse model of
Alzheimer’s disease”. Proceedings of the National Academy of Sciences of the United
States of America (2007).
244 REFERENCES
[275] Castellani, R., Siedlak, S., Fortino, A., Perry, G., Ghetti, B., and Smith, M. “Chitin-
like Polysaccharides in Alzheimers Disease Brains”. Current Alzheimer Research
(2005).
[276] Kommaddi, R. P., Das, D., Karunakaran, S., Nanguneri, S., Bapat, D., Ray, A., Shaw,
E., Bennett, D. A., Nair, D., and Ravindranath, V. “Aβ mediates F-actin disassembly
in dendritic spines leading to cognitive deficits in alzheimer’s disease”. Journal of
Neuroscience (2018).
[277] Hu, Y., Flockhart, I., Vinayagam, A., Bergwitz, C., Berger, B., Perrimon, N., and
Mohr, S. E. “An integrative approach to ortholog prediction for disease-focused
and other functional studies”. BMC Bioinformatics (2011).
[278] Meloni, I., Muscettola, M., Raynaud, M., Longo, I., Bruttini, M., Moizard, M. P., Go-
mot, M., Chelly, J., Des Portes, V., Fryns, J. P., Ropers, H. H., Magi, B., Bellan, C.,
Volpi, N., Yntema, H. G., Lewis, S. E., Schaffer, J. E., and Renieri, A. “FACL4, encod-
ing fatty acid-CoA ligase 4, is mutated in nonspecific X-linked mental retardation”.
Nature Genetics (2002).
[279] Peters, H., Buck, N., Wanders, R., Ruiter, J., Waterham, H., Koster, J., Yaplito-Lee, J.,
Ferdinandusse, S., and Pitt, J. “ECHS1 mutations in Leigh disease: A new inborn
error of metabolism affecting valine metabolism”. Brain (2014).
[280] Datta, A., Akatsu, H., Heese, K., and Sze, S. K. “Quantitative clinical proteomic
study of autopsied human infarcted brain specimens to elucidate the deregulated
pathways in ischemic stroke pathology”. Journal of Proteomics (2013).
[281] McKenzie, A. T., Moyon, S., Wang, M., Katsyv, I., Song, W. M., Zhou, X., Dammer,
E. B., Duong, D. M., Aaker, J., Zhao, Y., Beckmann, N., Wang, P., Zhu, J., Lah, J. J.,
Seyfried, N. T., Levey, A. I., Katsel, P., Haroutunian, V., Schadt, E. E., Popko, B., et
al. “Multiscale network modeling of oligodendrocytes reveals molecular compo-
nents of myelin dysregulation in Alzheimer’s disease”. Molecular Neurodegenera-
tion (2017).
[282] Chi, L. M., Wang, X., and Nan, G. X. “In silico analyses for molecular genetic mech-
anism and candidate genes in patients with Alzheimer’s disease”. Acta Neurologica
Belgica (2016).
[283] Gerber, H., Mosser, S., Boury-Jamot, B., Stumpe, M., Piersigilli, A., Goepfert, C.,
Dengjel, J., Albrecht, U., Magara, F., and Fraering, P. C. “The APMAP interactome
reveals new modulators of APP processing and beta-amyloid production that are
altered in Alzheimer’s disease”. Acta neuropathologica communications (2019).
[284] Terzioglu-Usak, S., Negis, Y., Karabulut, D. S., Zaim, M., and Isik, S. “Cellular Model
of Alzheimer’s Disease: Aβ1-42 Peptide Induces Amyloid Deposition and a De-
crease in Topo Isomerase IIβ and Nurr1 Expression”. Current Alzheimer Research
(2017).
[285] Tzekov, R., Dawson, C., Orlando, M., Mouzon, B., Reed, J., Evans, J., Crynen, G.,
Mullan, M., and Crawford, F. “Sub-Chronic neuropathological and biochemical
changes in mouse visual system after repetitive mild traumatic brain injury”. PLoS
ONE (2016).
[286] Matthias E, F., Ravi Kiran Reddy, K., Joaquin, G. L., Susana, M., and Kameshwar RS,
A. “The unfolded protein response and its potential role in Huntington’s disease
elucidated by a systems biology approach”. F1000Research (2015).
[287] Talwar, P., Silla, Y., Grover, S., Gupta, M., Agarwal, R., Kushwaha, S., and Kukreti, R.
“Genomic convergence and network analysis approach to identify candidate genes
in Alzheimer’s disease”. BMC Genomics (2014).
REFERENCES 245
[288] Rogers, I., Kerr, F., Martinez, P., Hardy, J., Lovestone, S., and Partridge, L. “Ageing
increases vulnerability to Aβ42 toxicity in Drosophila”. PLoS ONE (2012).
[289] Jacobson, G. R. and Rosenbusch, J. P. “ATP binding to a protease resistant core
of actin”. Proceedings of the National Academy of Sciences of the United States of
America (1976).
[290] Hozumi, T. “Structural aspects of skeletal muscle F-actin as studied by tryptic di-
gestion: Evidence for a second nucleotide interacting site”. Journal of Biochemistry
(1988).
[291] Rodriguez-Suarez, E., Hughes, C., Gethings, L., Giles, K., Wildgoose, J., Stapels, M.,
E. Fadgen, K., J. Geromanos, S., P.C. Vissers, J., Elortza, F., and I. Langridge, J. “An
Ion Mobility Assisted Data Independent LC-MS Strategy for the Analysis of Com-
plex Biological Samples”. Current Analytical Chemistry 9.2 (2013), pp. 199–211.
[292] Zhu, Y., Orre, L. M., Tran, Y. Z., Mermelekas, G., Johansson, H. J., Malyutina, A.,
Anders, S., and Lehtiö, J. “DEqMS: A method for accurate variance estimation in
differential protein expression analysis”. Molecular and Cellular Proteomics (2020).
[293] Zhang, X., Smits, A. H., Van Tilburg, G. B., Ovaa, H., Huber, W., and Vermeulen,
M. “Proteome-wide identification of ubiquitin interactions using UbIA-MS”. Nature
Protocols (2018).
[294] Sberro, H., Fremin, B. J., Zlitni, S., Edfors, F., Greenfield, N., Snyder, M. P., Pavlopou-
los, G. A., Kyrpides, N. C., and Bhatt, A. S. “Large-Scale Analyses of Human Micro-
biomes Reveal Thousands of Small, Novel Genes”. Cell 178.5 (2019), 1245–1259.e14.
[295] Rinke, C., Schwientek, P., Sczyrba, A., Ivanova, N. N., Anderson, I. J., Cheng, J. F.,
Darling, A., Malfatti, S., Swan, B. K., Gies, E. A., Dodsworth, J. A., Hedlund, B. P.,
Tsiamis, G., Sievert, S. M., Liu, W. T., Eisen, J. A., Hallam, S. J., Kyrpides, N. C.,
Stepanauskas, R., Rubin, E. M., et al. “Insights into the phylogeny and coding po-
tential of microbial dark matter”. Nature (2013).
[296] Rappé, M. S. and Giovannoni, S. J. “The Uncultured Microbial Majority”. Annual
Review of Microbiology (2003).
[297] Saary, P., Mitchell, A. L., and Finn, R. D. “Estimating the quality of eu-
karyotic genomes recovered from metagenomic analysis”. bioRxiv (2019),
p. 2019.12.19.882753.
[298] Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J. H., Chinwalla,
A. T., Creasy, H. H., Earl, A. M., Fitzgerald, M. G., Fulton, R. S., Giglio, M. G.,
Hallsworth-Pepin, K., Lobos, E. A., Madupu, R., Magrini, V., Martin, J. C., Mitreva,
M., Muzny, D. M., Sodergren, E. J., Versalovic, J., et al. “Structure, function and
diversity of the healthy human microbiome”. Nature (2012).
[299] Almeida, A., Mitchell, A. L., Boland, M., Forster, S. C., Gloor, G. B., Tarkowska,
A., Lawley, T. D., and Finn, R. D. “A new genomic blueprint of the human gut
microbiota”. Nature (2019).
[300] Forster, S. C., Kumar, N., Anonye, B. O., Almeida, A., Viciani, E., Stares, M. D., Dunn,
M., Mkandawire, T. T., Zhu, A., Shao, Y., Pike, L. J., Louie, T., Browne, H. P., Mitchell,
A. L., Neville, B. A., Finn, R. D., and Lawley, T. D. “A human gut bacterial genome
and culture collection for improved metagenomic analyses”. Nature Biotechnology
(2019).
[301] Sunagawa, S., Coelho, L. P., Chaffron, S., Kultima, J. R., Labadie, K., Salazar, G., Dja-
hanschiri, B., Zeller, G., Mende, D. R., Alberti, A., Cornejo-Castillo, F. M., Costea,
P. I., Cruaud, C., D’Ovidio, F., Engelen, S., Ferrera, I., Gasol, J. M., Guidi, L., Hilde-
246 REFERENCES
brand, F., Kokoszka, F., et al. “Structure and function of the global ocean micro-
biome”. Science (2015).
[302] Sunagawa, S., Acinas, S. G., Bork, P., Bowler, C., Eveillard, D., Gorsky, G., Guidi,
L., Iudicone, D., Karsenti, E., Lombard, F., Ogata, H., Pesant, S., Sullivan, M. B.,
Wincker, P., and Vargas, C. de. “Tara Oceans: towards global ocean ecosystems
biology”. Nature Reviews Microbiology (2020), pp. 1–18.
[303] Al-Shayeb, B., Sachdeva, R., Chen, L. X., Ward, F., Munk, P., Devoto, A., Castelle,
C. J., Olm, M. R., Bouma-Gregson, K., Amano, Y., He, C., Méheust, R., Brooks, B.,
Thomas, A., Lavy, A., Matheus-Carnevali, P., Sun, C., Goltsman, D. S., Borton, M. A.,
Sharrar, A., et al. “Clades of huge phages from across Earth’s ecosystems”. Nature
(2020).
[304] Rothschild, D., Weissbrod, O., Barkan, E., Kurilshikov, A., Korem, T., Zeevi, D.,
Costea, P. I., Godneva, A., Kalka, I. N., Bar, N., Shilo, S., Lador, D., Vila, A. V., Zmora,
N., Pevsner-Fischer, M., Israeli, D., Kosower, N., Malka, G., Wolf, B. C., Avnit-Sagi,
T., et al. “Environment dominates over host genetics in shaping human gut micro-
biota”. Nature 555.7695 (2018), pp. 210–215.
[305] Valles-Colomer, M., Falony, G., Darzi, Y., Tigchelaar, E. F., Wang, J., Tito, R. Y., Schi-
weck, C., Kurilshikov, A., Joossens, M., Wijmenga, C., Claes, S., Van Oudenhove, L.,
Zhernakova, A., Vieira-Silva, S., and Raes, J. “The neuroactive potential of the hu-
man gut microbiota in quality of life and depression”. Nature Microbiology (2019).
[306] Bachmann, N. L., Rockett, R. J., Timms, V. J., and Sintchenko, V. “Advances in
clinical sample preparation for identification and characterization of bacterial
pathogens using metagenomics”. Frontiers in Public Health 6.DEC (2018).
[307] Franzosa, E. A., Morgan, X. C., Segata, N., Waldron, L., Reyes, J., Earl, A. M., Gi-
annoukos, G., Boylan, M. R., Ciulla, D., Gevers, D., Izard, J., Garrett, W. S., Chan,
A. T., and Huttenhower, C. “Relating the metatranscriptome and metagenome of
the human gut”. Proceedings of the National Academy of Sciences of the United States
of America (2014).
[308] Bowers, R. M., Clum, A., Tice, H., Lim, J., Singh, K., Ciobanu, D., Ngan, C. Y., Cheng,
J. F., Tringe, S. G., and Woyke, T. “Impact of library preparation protocols and tem-
plate quantity on the metagenomic reconstruction of a mock microbial commu-
nity”. BMC Genomics (2015).
[309] Lozupone, C. A., Stombaugh, J., Gonzalez, A., Ackermann, G., Wendel, D., Vázquez-
Baeza, Y., Jansson, J. K., Gordon, J. I., and Knight, R. “Meta-analyses of studies of
the human microbiota”. Genome Research (2013).
[310] Voigt, A. Y., Costea, P. I., Kultima, J. R., Li, S. S., Zeller, G., Sunagawa, S., and Bork, P.
“Temporal and technical variability of human gut metagenomes”. Genome Biology
(2015).
[311] Solonenko, S. A., Ignacio-Espinoza, J. C., Alberti, A., Cruaud, C., Hallam, S., Kon-
stantinidis, K., Tyson, G., Wincker, P., and Sullivan, M. B. “Sequencing platform and
library preparation choices impact viral metagenomes”. BMC Genomics (2013).
[312] Raes, J. and Bork, P. “Molecular eco-systems biology: Towards an understanding
of community function”. Nature Reviews Microbiology (2008).
[313] Costea, P. I., Zeller, G., Sunagawa, S., Pelletier, E., Alberti, A., Levenez, F., Tramon-
tano, M., Driessen, M., Hercog, R., Jung, F. E., Kultima, J. R., Hayward, M. R., Coelho,
L. P., Allen-Vercoe, E., Bertrand, L., Blaut, M., Brown, J. R., Carton, T., Cools-Portier,
S., Daigneault, M., et al. “Towards standards for human fecal sample processing in
metagenomic studies”. Nature Biotechnology (2017).
REFERENCES 247
[314] Kallies, R., Hölzer, M., Toscan, R. B., Rocha, U. N. da, Anders, J., Marz, M., and
Chatzinotas, A. “Evaluation of sequencing library preparation protocols for viral
metagenomic analysis from pristine aquifer groundwaters”. Viruses (2019).
[315] Sato, M. P., Ogura, Y., Nakamura, K., Nishida, R., Gotoh, Y., Hayashi, M., Hisatsune,
J., Sugai, M., Takehiko, I., and Hayashi, T. “Comparison of the sequencing bias of
currently available library preparation kits for Illumina sequencing of bacterial
genomes and metagenomes”. DNA Research 26.5 (2019), pp. 391–398.
[316] Sevim, V., Lee, J., Egan, R., Clum, A., Hundley, H., Lee, J., Everroad, R. C., Detweiler,
A. M., Bebout, B. M., Pett-Ridge, J., Göker, M., Murray, A. E., Lindemann, S. R.,
Klenk, H. P., O’Malley, R., Zane, M., Cheng, J. F., Copeland, A., Daum, C., Singer,
E., and Woyke, T. “Shotgun metagenome data of a defined mock community using
Oxford Nanopore, PacBio and Illumina technologies”. Scientific data (2019).
[317] Rodrigue, S., Materna, A. C., Timberlake, S. C., Blackburn, M. C., Malmstrom, R. R.,
Alm, E. J., and Chisholm, S. W. “Unlocking short read sequencing for metage-
nomics”. PLoS ONE (2010).
[318] Deamer, D., Akeson, M., and Branton, D. “Three decades of nanopore sequencing”.
Nature Biotechnology 34.5 (2016), pp. 518–524.
[319] Urban, L., Holzer, A., Baronas, J. J., Hall, M., Braeuninger-Weimer, P., Scherm,
M. J., Kunz, D. J., Perera, S. N., Martin-Herranz, D. E., Tipper, E. T., Salter, S. J.,
and Stammnitz, M. R. “Freshwater monitoring by nanopore sequencing”. bioRxiv
(2020), p. 2020.02.06.936302.
[320] Jain, M., Olsen, H. E., Paten, B., and Akeson, M. “The Oxford Nanopore MinION:
Delivery of nanopore sequencing to the genomics community”. Genome Biology
(2016).
[321] Church, G. and Deamer, D. W. Characterization of individual polymer molecules
based on monomer-interface interactions. 1996.
[322] Ouldali, H., Sarthak, K., Ensslen, T., Piguet, F., Manivet, P., Pelta, J., Behrends, J. C.,
Aksimentiev, A., and Oukhaled, A. “Electrical recognition of the twenty proteino-
genic amino acids using an aerolysin nanopore”. Nature Biotechnology 38.2 (2020),
pp. 176–181.
[323] Rang, F. J., Kloosterman, W. P., and Ridder, J. de. “From squiggle to basepair: Com-
putational approaches for improving nanopore sequencing read accuracy”. Genome
Biology 19.1 (2018).
[324] Wick, R. R., Judd, L. M., and Holt, K. E. “Performance of neural network basecalling
tools for Oxford Nanopore sequencing”. Genome Biology (2019).
[325] Antipov, D., Korobeynikov, A., McLean, J. S., and Pevzner, P. A. “HybridSPAdes:
An algorithm for hybrid assembly of short and long reads”. Bioinformatics (2016).
[326] Overholt, W. A., Hölzer, M., Geesink, P., Diezel, C., Marz, M., and Küsel, K.
“Inclusion of Oxford Nanopore long reads improves all microbial and phage
metagenome-assembled genomes from a complex aquifer system”. bioRxiv (2019).
[327] Nurk, S., Meleshko, D., Korobeynikov, A., and Pevzner, P. A. “MetaSPAdes: A new
versatile metagenomic assembler”. Genome Research 27.5 (2017), pp. 824–834.
[328] Steinegger, M. and Söding, J. “MMseqs2 enables sensitive protein sequence search-
ing for the analysis of massive data sets”. Nature Biotechnology 35.11 (2017),
pp. 1026–1028.
[329] Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I.,
Majda, S., Fiedler, J., Dahms, E., Bremges, A., Fritz, A., Garrido-Oter, R., Jørgensen,
T. S., Shapiro, N., Blood, P. D., Gurevich, A., Bai, Y., Turaev, D., Demaere, M. Z., et
248 REFERENCES
[361] Geyer, R., Jambeck, J. R., and Law, K. L. “Production, use, and fate of all plastics
ever made”. Science Advances 3.7 (2017), e1700782.
[362] Ragaert, K., Delva, L., and Van Geem, K. “Mechanical and chemical recycling of
solid plastic waste”. Waste Management 69 (2017), pp. 24–58.
[363] Bornscheuer, U. T. “Feeding on plastic”. Science 351.6278 (2016), pp. 1154–1155.
[364] Wilcox, C., Van Sebille, E., Hardesty, B. D., and Estes, J. A. “Threat of plastic pol-
lution to seabirds is global, pervasive, and increasing”. Proceedings of the National
Academy of Sciences of the United States of America 112.38 (2015), pp. 11899–11904.
[365] Law, K. L. and Thompson, R. C. “Microplastics in the seas”. Science 345.6193 (2014),
pp. 144–145.
[366] Rochman, C. M. “Microplastics research-from sink to source”. Science 360.6384
(2018), pp. 28–29.
[367] Lebreton, L., Slat, B., Ferrari, F., Sainte-Rose, B., Aitken, J., Marthouse, R., Ha-
jbane, S., Cunsolo, S., Schwarz, A., Levivier, A., Noble, K., Debeljak, P., Maral, H.,
Schoeneich-Argent, R., Brambini, R., and Reisser, J. “Evidence that the Great Pa-
cific Garbage Patch is rapidly accumulating plastic”. Scientific Reports 8.1 (2018),
pp. 1–15.
[368] Lacerda, A. L., Rodrigues, L. d. S., Sebille, E. van, Rodrigues, F. L., Ribeiro, L., Sec-
chi, E. R., Kessler, F., and Proietti, M. C. “Plastics in sea surface waters around the
Antarctic Peninsula”. Scientific Reports 9.1 (2019), pp. 1–12.
[369] Sharon, C. and Sharon, M. “Studies on biodegradation of polyethylene tereph-
thalate: A synthetic polymer”. Journal of Microbiology and Biotechnology Research
(2013).
[370] Müller, R.-J., Schrader, H., Profe, J., Dresler, K., and Deckwer, W.-D. “Enzymatic
Degradation of Poly(ethylene terephthalate): Rapid Hydrolyse using a Hydrolase
fromT. fusca”. Macromolecular Rapid Communications 26.17 (2005), pp. 1400–1405.
[371] Ronkvist, Å. M., Xie, W., Lu, W., and Gross, R. A. “Cutinase-Catalyzed hydrolysis
of poly(ethylene terephthalate)”. Macromolecules 42.14 (2009), pp. 5128–5138.
[372] Vertommen, M. A., Nierstrasz, V. A., Veer, M. V. D., and Warmoeskerken, M. M. “En-
zymatic surface modification of poly(ethylene terephthalate)”. Journal of Biotech-
nology 120.4 (2005), pp. 376–386.
[373] Gan, Z. and Zhang, H. “PMBD: a Comprehensive Plastics Microbial Biodegradation
Database”. Database 2019 (2019).
[374] Yoshida, S., Hiraga, K., Takehana, T., Taniguchi, I., Yamaji, H., Maeda, Y., Toyohara,
K., Miyamoto, K., Kimura, Y., and Oda, K. “A bacterium that degrades and assimi-
lates poly(ethylene terephthalate)”. Science (2016).
[375] Yang, Y., Yang, J., and Jiang, L. “Comment on "A bacterium that degrades and as-
similates poly(ethylene terephthalate)"”. Science 353.6301 (2016), pp. 759–759.
[376] Yoshida, S., Hiraga, K., Takehana, T., Taniguchi, I., Yamaji, H., Maeda, Y., Toy-
ohara, K., Miyamoto, K., Kimura, Y., and Oda, K. “Response to Comment on "a
bacterium that degrades and assimilates poly(ethylene terephthalate)"”. Science
353.6301 (2016), pp. 759–759.
[377] Han, X., Liu, W., Huang, J. W., Ma, J., Zheng, Y., Ko, T. P., Xu, L., Cheng, Y. S.,
Chen, C. C., and Guo, R. T. “Structural insight into catalytic mechanism of PET
hydrolase”. Nature Communications 8.1 (2017), pp. 1–6.
[378] Joo, S., Cho, I. J., Seo, H., Son, H. F., Sagong, H. Y., Shin, T. J., Choi, S. Y., Lee,
S. Y., and Kim, K. J. “Structural insight into molecular mechanism of poly(ethylene
terephthalate) degradation”. Nature Communications 9.1 (2018), pp. 1–12.
REFERENCES 251
[379] Austin, H. P., Allen, M. D., Donohoe, B. S., Rorrer, N. A., Kearns, F. L., Silveira, R. L.,
Pollard, B. C., Dominick, G., Duman, R., El Omari, K., Mykhaylyk, V., Wagner, A.,
Michener, W. E., Amore, A., Skaf, M. S., Crowley, M. F., Thorne, A. W., Johnson,
C. W., Woodcock, H. L., McGeehan, J. E., and Beckham, G. T. “Characterization
and engineering of a plastic-degrading aromatic polyesterase”. Proceedings of the
National Academy of Sciences 115.19 (2018), E4350–E4357.
[380] Palm, G. J., Reisky, L., Böttcher, D., Müller, H., Michels, E. A., Walczak, M. C.,
Berndt, L., Weiss, M. S., Bornscheuer, U. T., and Weber, G. “Structure of the plastic-
degrading Ideonella sakaiensis MHETase bound to a substrate”. Nature Communi-
cations 10.1 (2019), pp. 1–10.
[381] Tournier, V., Topham, C. M., Gilles, A., David, B., Folgoas, C., Moya-Leclair, E.,
Kamionka, E., Desrousseaux, M.-L., Texier, H., Gavalda, S., Cot, M., Guémard, E.,
Dalibey, M., Nomme, J., Cioci, G., Barbe, S., Chateau, M., André, I., Duquesne, S.,
and Marty, A. “An engineered PET depolymerase to break down and recycle plastic
bottles”. Nature 580.7802 (2020), pp. 216–219.
[382] Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J., and Segata, N. “Shot-
gun metagenomics, from sampling to analysis”. Nature Biotechnology 35.9 (2017),
pp. 833–844.
[383] Kalvari, I., Argasinska, J., Quinones-Olvera, N., Nawrocki, E. P., Rivas, E., Eddy,
S. R., Bateman, A., Finn, R. D., and Petrov, A. I. “Rfam 13.0: shifting to a genome-
centric resource for non-coding RNA families.” Nucleic acids research 46.D1 (2018),
pp. D335–D342.
[384] Mukherjee, S., Stamatis, D., Bertsch, J., Ovchinnikova, G., Katta, H. Y., Mojica, A.,
Chen, I. M. A., Kyrpides, N. C., and Reddy, T. B. “Genomes OnLine database (GOLD)
v.7: Updates and new features”. Nucleic Acids Research (2019).
[385] Kurtzer, G. M., Sochat, V., and Bauer, M. W. “Singularity: Scientific containers for
mobility of compute”. PLOS ONE 12.5 (2017). Ed. by Gursoy, A., e0177459.
[386] da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J.,
Barsnes, H., Vaudel, M., Moreno, P., Gatto, L., Weber, J., Bai, M., Jimenez, R. C.,
Sachsenberg, T., Pfeuffer, J., Vera Alvarez, R., Griss, J., Nesvizhskii, A. I., and Perez-
Riverol, Y. “BioContainers: an open-source and community-driven framework for
software standardization”. Bioinformatics (2017).
[387] McCallum, A., Nigam, K., and Ungar, L. H. “Efficient clustering of high-dimensional
data sets with application to reference matching”. Proceeding of the Sixth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. 2000.
[388] Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. “Min-wise inde-
pendent permutations (extended abstract)”. Proceedings of the thirtieth annual ACM
symposium on Theory of computing - STOC ’98. New York, New York, USA: Associ-
ation for Computing Machinery (ACM), 1998, pp. 327–336.
[389] Broder, A. Z. “On the resemblance and containment of documents”. Proceedings
of the International Conference on Compression and Complexity of Sequences. IEEE,
1997, pp. 21–29.
[390] DI Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and
Notredame, C. “Nextflow enables reproducible computational workflows”. Nature
Biotechnology 35.4 (2017), pp. 316–319.
[391] Bezanson, J., Karpinski, S., Shah, V. B., and Edelman, A. “Julia: A Fast Dynamic
Language for Technical Computing” (2012).
[392] Scholes, H. harryscholes/CATHBase.jl: v0.1.0. 2020.
252 REFERENCES
[393] Nightingale, A., Antunes, R., Alpi, E., Bursteinas, B., Gonzales, L., Liu, W., Luo, J., Qi,
G., Turner, E., and Martin, M. “The Proteins API: accessing key integrated protein
and genome information”. Nucleic Acids Research 45.W1 (2017), W539–W544.
[394] Henikoff, S. and Henikoff, J. G. “Amino acid substitution matrices from protein
blocks”. Proceedings of the National Academy of Sciences of the United States of Amer-
ica 89.22 (1992), pp. 10915–10919.
[395] Katoh, K. “MAFFT: a novel method for rapid multiple sequence alignment based
on fast Fourier transform”. Nucleic Acids Research (2002).
[396] Müllner, D. “Modern hierarchical, agglomerative clustering algorithms” (2011).
[397] Hie, B., Cho, H., DeMeo, B., Bryson, B., and Berger, B. “Geometric Sketching Com-
pactly Summarizes the Single-Cell Transcriptomic Landscape”. Cell Systems (2019).
[398] Rentzsch, R. and Orengo, C. A. “Protein function prediction - the power of multi-
plicity”. Trends in Biotechnology 27.4 (2009), pp. 210–219.
[399] Bruna, T., Lomsadze, A., and Borodovsky, M. “GeneMark-EP and -EP+: automatic
eukaryotic gene prediction supported by spliced aligned proteins”. bioRxiv 17
(2020), p. 2019.12.31.891218.
[400] Levy Karin, E., Mirdita, M., and Söding, J. “MetaEuk—sensitive, high-throughput
gene discovery, and annotation for large-scale eukaryotic metagenomics”. Micro-
biome 8.1 (2020), p. 48.
[401] Sallet, E., Gouzy, J., and Schiex, T. “EuGene: An automated integrative gene finder
for eukaryotes and prokaryotes”. Methods in Molecular Biology. Vol. 1962. Humana
Press Inc., 2019, pp. 97–120.
[402] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis,
A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L.,
Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M.,
Sherlock, G., and Consortium, G. O. “Gene Ontology: Tool for The Unification of
Biology”. Nature Genetics 25.1 (2000), pp. 25–29.
[403] Peled, S., Leiderman, O., Charar, R., Efroni, G., Shav-Tal, Y., and Ofran, Y. “De-novo
protein function prediction using DNA binding and RNA binding proteins as a test
case.” Nature communications 7 (2016), p. 13424.
[404] Gillis, J. and Pavlidis, P. “"Guilt by association" is the exception rather than the
rule in gene networks”. PLoS Computational Biology 8.3 (2012). Ed. by Rzhetsky,
A., e1002444.
[405] Szklarczyk, D., Morris, J. H., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., San-
tos, A., Doncheva, N. T., Roth, A., Bork, P., Jensen, L. J., and Von Mering, C.
“The STRING database in 2017: Quality-controlled protein-protein association net-
works, made broadly accessible”. Nucleic Acids Research 45.D1 (2017), pp. D362–
D368.
[406] Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth, A., Lin,
J., Minguez, P., Bork, P., Von Mering, C., and Jensen, L. J. “STRING v9.1: Protein-
protein interaction networks, with increased coverage and integration”. Nucleic
Acids Research (2013).
[407] Chatr-Aryamontri, A., Oughtred, R., Boucher, L., Rust, J., Chang, C., Kolas, N. K.,
O’Donnell, L., Oster, S., Theesfeld, C., Sellam, A., Stark, C., Breitkreutz, B. J., Dolin-
ski, K., and Tyers, M. “The BioGRID interaction database: 2017 update”. Nucleic
Acids Research 45.D1 (2017), pp. D369–D379.
REFERENCES 253
[408] Pancaldi, V., Schubert, F., and Bähler, J. “Meta-analysis of genome regulation and
expression variability across hundreds of environmental and genetic perturbations
in fission yeast.” Molecular BioSystems 6.3 (2010), pp. 543–52.
[409] Harris, M. A., Lock, A., Bähler, J., Oliver, S. G., and Wood, V. “FYPO: The fission
yeast phenotype ontology”. Bioinformatics 29.13 (2013), pp. 1671–1678.
[410] Python. url: https://www.python.org/ (visited on 03/14/2016).
[411] Chollet, F. et al. Keras. 2015.
[412] GoogleResearch. “TensorFlow: Large-scale machine learning on heterogeneous
systems”. Google Research (2015).
[413] Bähler, J. and Wood, V. “Probably the best model organism in the world”. Yeast
23.13 (2006), pp. 899–900.
[414] Jeffares, D. C., Rallis, C., Rieux, A., Speed, D., Převorovský, M., Mourier, T., Marsel-
lach, F. X., Iqbal, Z., Lau, W., Cheng, T. M. K., Pracana, R., Mülleder, M., Lawson,
J. L. D., Chessel, A., Bala, S., Hellenthal, G., O’Fallon, B., Keane, T., Simpson, J. T.,
Bischof, L., et al. “The genomic and phenotypic diversity of Schizosaccharomyces
pombe”. Nature Genetics 47.3 (2015), pp. 235–241.
[415] Goffeau, A., Barrell, G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert,
F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y.,
Philippsen, P., Tettelin, H., and Oliver, S. G. “Life with 6000 genes”. Science 274.5287
(1996), pp. 546–567.
[416] Hoffman, C. S., Wood, V., and Fantes, P. A. “An ancient yeast for young geneticists:
A primer on the Schizosaccharomyces pombe model system”. Genetics 201.2 (2015),
pp. 403–423.
[417] Lee, M. G. “Molecular biology of the fission yeast”. Trends in Genetics 6 (1990),
p. 270.
[418] Wood, V., Gwilliam, R., Rajandream, M. A., Lyne, M., Lyne, R., Stewart, A., Sgouros,
J., Peat, N., Hayles, J., Baker, S., Basham, D., Bowman, S., Brooks, K., Brown, D.,
Brown, S., Chillingworth, T., Churcher, C., Collins, M., Connor, R., Cronin, A., et al.
“The genome sequence of Schizosaccharomyces pombe”. Nature 415.6874 (2002),
pp. 871–880.
[419] Sipiczki, M. “Where does fission yeast sit on the tree of life?” Genome biology 1.2
(2000), reviews1011.1.
[420] Nurse, P., Thuriaux, P., and Nasmyth, K. “Genetic control of the cell division cy-
cle in the fission yeast Schizosaccharomyces pombe”. MGG Molecular & General
Genetics 146.2 (1976), pp. 167–178.
[421] Kim, D.-U., Hayles, J., Kim, D., Wood, V., Park, H.-O., Won, M., Yoo, H.-S., Duhig, T.,
Nam, M., Palmer, G., Han, S., Jeffery, L., Baek, S.-T., Lee, H., Shim, Y. S., Lee, M., Kim,
L., Heo, K.-S., Noh, E. J., Lee, A.-R., et al. “Analysis of a genome-wide set of gene
deletions in the fission yeast Schizosaccharomyces pombe”. Nature Biotechnology
28.6 (2010), pp. 617–623.
[422] Han, T. X., Xu, X. Y., Zhang, M. J., Peng, X., and Du, L. L. “Global fitness profiling of
fission yeast deletion strains by barcode sequencing”. Genome Biology 11.6 (2010),
R60.
[423] Smith, A. M., Heisler, L. E., Mellor, J., Kaper, F., Thompson, M. J., Chee, M., Roth,
F. P., Giaever, G., and Nislow, C. “Quantitative phenotyping via deep barcode se-
quencing”. Genome Research 19.10 (2009), pp. 1836–1842.
254 REFERENCES
[424] Wood, V., Lock, A., Harris, M. A., Rutherford, K., Bähler, J., and Oliver, S. G. “Hidden
in plain sight: What remains to be discovered in the eukaryotic proteome?” Open
Biology (2019).
[425] Stoeger, T., Gerlach, M., Morimoto, R. I., and Nunes Amaral, L. A. “Large-scale
investigation of the reasons why potentially important genes are ignored”. PLOS
Biology 16.9 (2018). Ed. by Freeman, T., e2006643.
[426] Lock, A., Rutherford, K., Harris, M. A., and Wood, V. “PomBase: The scientific re-
source for fission yeast”. Methods in Molecular Biology. 2018.
[427] Hillenmeyer, M. E., Fung, E., Wildenhain, J., Pierce, S. E., Hoon, S., Lee, W., Proctor,
M., St.Onge, R. P., Tyers, M., Koller, D., Altman, R. B., Davis, R. W., Nislow, C., and
Giaever, G. “The Chemical Genomic Portrait of Yeast: Uncovering a Phenotype for
All Genes”. Science 320.5874 (2008), pp. 362–365.
[428] Schnoes, A. M., Ream, D. C., Thorman, A. W., Babbitt, P. C., and Friedberg, I. “Bi-
ases in the Experimental Annotations of Protein Function and Their Effect on Our
Understanding of Protein Function Space”. PLoS Computational Biology 9.5 (2013).
Ed. by Orengo, C. A., e1003063.
[429] Saraç, Ö. S., Pancaldi, V., Bähler, J., and Beyer, A. “Topology of functional networks
predicts physical binding of proteins”. Bioinformatics (2012).
[430] Pancaldi, V., Saraç, O. S., Rallis, C., McLean, J. R., Převorovský, M., Gould, K., Beyer,
A., and Bähler, J. “Predicting the fission yeast protein interaction network.” G3
(Bethesda, Md.) 2.4 (2012), pp. 453–67.
[431] Lees, J. G., Hériché, J. K., Morilla, I., Fernández, J. M., Adler, P., Krallinger, M., Vilo, J.,
Valencia, A., Ellenberg, J., Ranea, J. A., and Orengo, C. “FUN-L: Gene prioritization
for RNAi screens”. Bioinformatics 31.12 (2015), pp. 2052–2053.
[432] Houle, D., Govindaraju, D. R., and Omholt, S. “Phenomics: The next challenge”.
Nature Reviews Genetics 11.12 (2010), pp. 855–866.
[433] Shuler, M. L. “Functional genomics: An opportunity for bioengineers”. Biotechnol-
ogy Progress 15.3 (1999), p. 287.
[434] Schilling, C. H., Edwards, J. S., and Palsson, B. O. “Toward metabolic phenomics:
Analysis of genomic data using flux balances”. Biotechnology Progress 15.3 (1999),
pp. 288–295.
[435] Dove, A. “Proteomics: Translating genomics into products?” Nature Biotechnology
17.3 (1999), pp. 233–236.
[436] Mendel, G. “Experiments in plant hybridisation”. Verhandlungen des naturforschen-
den Ver-eines in Brünn 4 (1866), pp. 3–47.
[437] Houle, D. “Numbering the hairs on our heads: The shared challenge and promise
of phenomics”. Proceedings of the National Academy of Sciences of the United States
of America 107.SUPPL. 1 (2010), pp. 1793–1799.
[438] Novikoff, A. B. “The concept of integrative levels and biology”. Science 101.2618
(1945), pp. 209–215.
[439] Murren, C. J. “The integrated phenotype”. Integrative and Comparative Biology.
2012.
[440] Hoyles, L., Fernández-Real, J. M., Federici, M., Serino, M., Abbott, J., Charpentier,
J., Heymes, C., Luque, J. L., Anthony, E., Barton, R. H., Chilloux, J., Myridakis,
A., Martinez-Gili, L., Moreno-Navarrete, J. M., Benhamed, F., Azalbert, V., Blasco-
Baque, V., Puig, J., Xifra, G., Ricart, W., et al. “Molecular phenomics and metage-
nomics of hepatic steatosis in non-diabetic obese women”. Nature Medicine 24.7
(2018), pp. 1070–1080.
REFERENCES 255
[441] Zhao, C., Zhang, Y., Du, J., Guo, X., Wen, W., Gu, S., Wang, J., and Fan, J. “Crop
phenomics: Current status and perspectives”. Frontiers in Plant Science 10 (2019),
p. 714.
[442] Watt, M., Fiorani, F., Usadel, B., Rascher, U., Muller, O., and Schurr, U. “Phenotyp-
ing: New Windows into the Plant for Breeders”. Annual Review of Plant Biology 71
(2020), pp. 689–712.
[443] Brown, S. D. M., Holmes, C. C., Mallon, A.-M., Meehan, T. F., Smedley, D., and
Wells, S. “High-throughput mouse phenomics for characterizing mammalian gene
function”. Nature Reviews Genetics 19.6 (2018), pp. 357–370.
[444] Mali, P., Yang, L., Esvelt, K. M., Aach, J., Guell, M., DiCarlo, J. E., Norville, J. E.,
and Church, G. M. “RNA-guided human genome engineering via Cas9”. Science
339.6121 (2013), pp. 823–826.
[445] Cong, L., Ran, F. A., Cox, D., Lin, S., Barretto, R., Habib, N., Hsu, P. D., Wu, X.,
Jiang, W., Marraffini, L. A., and Zhang, F. “Multiplex genome engineering using
CRISPR/Cas systems”. Science 339.6121 (2013), pp. 819–823.
[446] Gerlai, R. “Phenomics: Fiction or the future?” Trends in Neurosciences 25.10 (2002),
pp. 506–509.
[447] Rallis, C. and Bähler, J. “Cell-based screens and phenomics with fission yeast”. Crit-
ical Reviews in Biochemistry and Molecular Biology 51.2 (2016), pp. 86–95.
[448] Bischof, L., Převorovský, M., Rallis, C., Jeffares, D. C., Arzhaeva, Y., and Bähler, J.
“Spotsizer: High-throughput quantitative analysis of microbial growth”. BioTech-
niques 61.4 (2016), pp. 191–201.
[449] Zackrisson, M., Hallin, J., Ottosson, L.-G., Dahl, P., Fernandez-Parada, E., Länd-
ström, E., Fernandez-Ricaud, L., Kaferle, P., Skyman, A., Stenberg, S., Omholt, S.,
Petrovič, U., Warringer, J., and Blomberg, A. “Scan-o-matic: High-Resolution Mi-
crobial Phenomics at a Massive Scale.” G3 (Bethesda, Md.) 6.9 (2016), pp. 3003–14.
[450] Kamrad, S., Rodríguez-López, M., Cotobal, C., Correia-Melo, C., Ralser, M., and Bäh-
ler, J. “Pyphe: A python toolbox for assessing microbial growth and cell viability
in high-throughput colony screens”. bioRxiv (2020), p. 2020.01.22.915363.
[451] Sailem, H. Z., Rittscher, J., and Pelkmans, L. “KCML: a machine-learning framework
for inference of multi-scale gene functions from genetic perturbation screens”.
Molecular Systems Biology 16.3 (2020).
[452] Romila, C.-A. “High-Throughput Chronological Lifespan Screening of the Fission
Yeast Deletion Library Using Barcode Sequencing”. Doctoral thesis, UCL (University
College London). (2019).
[453] Rallis, C., Lopez-Maury, L., Georgescu, T., Pancaldi, V., and Bahler, J. “Systematic
screen for mutants resistant to TORC1 inhibition in fission yeast reveals genes
involved in cellular ageing and growth”. Biology Open 3.2 (2014), pp. 161–171.
[454] Gentner, N. E. and Werner, M. M. “Repair in Schizosaccharomyces pombe as mea-
sured by recovery from caffeine enhancement of radiation-induced lethality”. MGG
Molecular & General Genetics 142.3 (1975), pp. 171–183.
[455] Osman, F. and McCready, S. “Differential effects of caffeine on DNA damage
and replication cell cycle checkpoints in the fission yeast Schizosaccharomyces
pombe”. Molecular and General Genetics 260.4 (1998), pp. 319–334.
[456] Calvo, I. A., Gabrielli, N., Iglesias-Baena, I., García-Santamarina, S., Hoe, K.-L., Kim,
D. U., Sansó, M., Zuin, A., Pérez, P., Ayté, J., and Hidalgo, E. “Genome-Wide Screen
of Genes Required for Caffeine Tolerance in Fission Yeast”. PLoS ONE 4.8 (2009).
Ed. by Bonini, M., e6619.
256 REFERENCES
[457] Zhou, H., Liu, Q., Shi, T., Yu, Y., and Lu, H. “Genome-wide screen of fission yeast
mutants for sensitivity to 6-azauracil, an inhibitor of transcriptional elongation”.
Yeast 32.10 (2015), pp. 643–655.
[458] Noda, T. “Viability Assays to Monitor Yeast Autophagy”. Methods in enzymology.
Vol. 451. Academic Press, 2008, pp. 27–32.
[459] Benjamini, Y. and Hochberg, Y. “Controlling the False Discovery Rate: A Practical
and Powerful Approach to Multiple Testing”. Journal of the Royal Statistical Society
B 57.1 (1995), pp. 289–300.
[460] Dunn, O. J. “Multiple Comparisons among Means”. Journal of the American Statis-
tical Association 56.293 (1961), pp. 52–64.
[461] Dunn, O. J. “Estimation of the Medians for Dependent Variables”. The Annals of
Mathematical Statistics 30.1 (1959), pp. 192–197.
[462] The Gene Ontology Consortium. “Gene ontology consortium: Going forward”. Nu-
cleic Acids Research 43.D1 (2015), pp. D1049–D1056.
[463] Martignon, L., Vitouch, O., Takezawa, M., and Forster, M. R. “Naive and yet En-
lightened: From Natural Frequencies to Fast and Frugal Decision Trees”. Thinking:
Psychological Perspectives on Reasoning, Judgment and Decision Making. John Wiley
& Sons, Ltd, 2005, pp. 189–211.
[464] Raab, M. and Gigerenzer, G. “The power of simplicity: a fast-and-frugal heuristics
approach to performance science”. Frontiers in Psychology 6.OCT (2015), p. 1672.
[465] Murphy, K. P. Machine learning : a probabilistic perspective. MIT Press, 2012, p. 1067.
[466] Szklarczyk, D., Gable, A. L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Si-
monovic, M., Doncheva, N. T., Morris, J. H., Bork, P., Jensen, L. J., and Mering, C.
von. “STRING v11: protein–protein association networks with increased coverage,
supporting functional discovery in genome-wide experimental datasets”. Nucleic
Acids Research 47.D1 (2019), pp. D607–D613.
[467] D’Haeseleer, P. “How does gene expression clustering work?” Nature Biotechnology
23.12 (2005), pp. 1499–1501.
[468] Gibbons, F. D. and Roth, F. P. “Judging the quality of gene expression-based clus-
tering methods using gene annotation”. Genome Research 12.10 (2002), pp. 1574–
1581.
[469] Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau,
D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., Walt, S. J. van der, Brett, M.,
Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R., Jones, E., Kern, R., Larson, E.,
Carey, C. J., et al. “SciPy 1.0: fundamental algorithms for scientific computing in
Python”. Nature Methods 17.3 (2020), pp. 261–272.
[470] Sokal, R. and Michener, C. “A statistical method for evaluating systematic relation-
ships”. Univ Kans Sci Bull 38 (1958), pp. 1409–1438.
[471] Mirisola, M. G., Braun, R. J., and Petranovic, D. “Approaches to study yeast cell
aging and death”. FEMS Yeast Research 14.1 (2014), pp. 109–118.
[472] Kwolek-Mirek, M. and Zadrag-Tecza, R. “Comparison of methods used for assess-
ing the viability and vitality of yeast cells”. FEMS Yeast Research 14.7 (2014), n/a–
n/a.
[473] Alao, J. P., Weber, A. M., Shabro, A., and Sunnerhagen, P. “Suppression of sensitivity
to drugs and antibiotics by high external cation concentrations in fission yeast”.
PLoS ONE 10.3 (2015), e0119297.
REFERENCES 257
[491] Zaman, R., Chowdhury, S. Y., Rashid, M. A., Sharma, A., Dehzangi, A., and
Shatabda, S. “HMMBinder: DNA-Binding Protein Prediction Using HMM Profile
Based Features”. BioMed Research International 2017 (2017).
[492] Sprinzak, E., Sattath, S., and Margalit, H. “How reliable are experimental protein-
protein interaction data?” Journal of Molecular Biology 327.5 (2003), pp. 919–923.
[493] Silva, E. de, Thorne, T., Ingram, P., Agrafioti, I., Swire, J., Wiuf, C., and Stumpf, M. P.
“The effects of incomplete protein interaction data on structural and evolutionary
inferences”. BMC Biology 4.1 (2006), pp. 1–13.
[494] Kuchaiev, O., Rašajski, M., Higham, D. J., and Pržulj, N. “Geometric de-noising
of protein-protein interaction networks”. PLoS Computational Biology 5.8 (2009),
e1000454.
[495] Zaki, N., Efimov, D., and Berengueres, J. “Protein complex detection using interac-
tion reliability assessment and weighted clustering coefficient”. BMC Bioinformat-
ics 14.1 (2013), p. 163.
[496] Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C., and Tosatto, S. C. E. “INGA: pro-
tein function prediction combining interaction networks, domain assignments and
sequence similarity.” Nucleic acids research 43.W1 (2015), W134–40.
[497] Moya-García, A., Adeyelu, T., Kruger, F. A., Dawson, N. L., Lees, J. G., Overington,
J. P., Orengo, C., and Ranea, J. A. “Structural and Functional View of Polypharma-
cology”. Scientific Reports 7.1 (2017), pp. 1–14.
[498] You, R., Yao, S., Xiong, Y., Huang, X., Sun, F., Mamitsuka, H., and Zhu, S. “NetGO:
improving large-scale protein function prediction with massive network informa-
tion.” Nucleic acids research 47.W1 (2019), W379–W387.
[499] You, R., Zhang, Z., Xiong, Y., Sun, F., Mamitsuka, H., and Zhu, S. “GOLabeler:
improving sequence-based large-scale protein function prediction by learning to
rank.” Bioinformatics (Oxford, England) 34.14 (2018), pp. 2465–2473.
[500] Lees, J. G., Heriche, J. K., Morilla, I., Ranea, J. A., and Orengo, C. A. “Systematic com-
putational prediction of protein interaction networks”. Physical Biology 8.3 (2011),
p. 035008.
[501] Bilder, R. M., Sabb, F. W., Cannon, T. D., London, E. D., Jentsch, J. D., Parker, D. S.,
Poldrack, R. A., Evans, C., and Freimer, N. B. “Phenomics: The systematic study of
phenotypes on a genome-wide scale”. Neuroscience 164.1 (2009), pp. 30–42.
[502] Gigerenzer, G. Gut feelings : the intelligence of the unconscious. Penguin Books, 2014.
[503] Gligorijevic, V., Renfrew, P. D., Kosciolek, T., Leman, J. K., Cho, K., Vatanen,
T., Berenberg, D., Taylor, B., Fisk, I. M., Xavier, R. J., Knight, R., and Bonneau,
R. “Structure-Based Function Prediction using Graph Convolutional Networks”.
bioRxiv (2019), p. 786236.
[504] Bhardwaj, N. and Lu, H. “Correlation between gene expression profiles and
protein-protein interactions within and across genomes”. Bioinformatics (2005).
[505] Peng, J. and Xu, J. “Low-homology protein threading”. Bioinformatics (2010).
[506] Colin, P.-Y., Kintses, B., Gielen, F., Miton, C. M., Fischer, G., Mohamed, M. F., Hyvö-
nen, M., Morgavi, D. P., Janssen, D. B., and Hollfelder, F. “Ultrahigh-throughput dis-
covery of promiscuous enzymes by picodroplet functional metagenomics”. Nature
Communications 6.1 (2015), p. 10008.