Bioinformatics-2020-MingyueZheng-0-TransformerCPI
Bioinformatics-2020-MingyueZheng-0-TransformerCPI
doi: 10.1093/bioinformatics/btaa524
Advance Access Publication Date: 19 May 2020
Original Paper
Structural bioinformatics
TransformerCPI: improving compound–protein
interaction prediction by sequence-based deep learning
1
Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy
of Sciences, Shanghai 201203, China, 2University of Chinese Academy of Sciences, Beijing 100049, China and 3Shanghai Institute for
Advanced Immunochemical Studies, School of Life Science and Technology, ShanghaiTech University, Shanghai 200031, China
*To whom correspondence should be addressed.
Associate Editor: Arne Elofsson
Received on February 26, 2020; revised on April 13, 2020; editorial decision on May 12, 2020; accepted on May 14, 2020
Abstract
Motivation: Identifying compound–protein interaction (CPI) is a crucial task in drug discovery and chemogenomics
studies, and proteins without three-dimensional structure account for a large part of potential biological targets,
which requires developing methods using only protein sequence information to predict CPI. However, sequence-
based CPI models may face some specific pitfalls, including using inappropriate datasets, hidden ligand bias and
splitting datasets inappropriately, resulting in overestimation of their prediction performance.
Results: To address these issues, we here constructed new datasets specific for CPI prediction, proposed a novel
transformer neural network named TransformerCPI, and introduced a more rigorous label reversal experiment to
test whether a model learns true interaction features. TransformerCPI achieved much improved performance on the
new experiments, and it can be deconvolved to highlight important interacting regions of protein sequences and
compound atoms, which may contribute chemical biology studies with useful guidance for further ligand structural
optimization.
Availability and implementation: https://github.com/lifanchen-simm/transformerCPI.
Contact: [email protected] or [email protected]
1 Introduction same time in a unified model (Bleakley and Yamanishi, 2009; Cheng
et al., 2012; Gonen, 2012; Jacob and Vert, 2008; van Laarhoven
Identifying compound–protein interaction (CPI) plays an import et al., 2011; Wang et al., 2011; Wang and Zeng, 2013; Yamanishi
role in discovering hit compounds (Vamathevan et al., 2019). et al., 2008).
Conventional methods, such as structure-based virtual screening With the rapid development of deep learning, many types of end-
and ligand-based virtual screening, have been studied for decades to-end frameworks have been utilized in CPI research. In compari-
and gained great success in drug discovery. However, some cases are son with traditional machine leaning algorithms, end-to-end learn-
not suitable to apply conventional screening methods, where the ing integrates representation learning and model training in a
protein three-dimensional (3D) structure is unknown or the amount unified architecture simultaneously and no descriptors need to be
of known ligand dataset is too small. Therefore, Bredel and Jacoby defined and calculated before modeling. Although deep neural net-
(2004) introduced a novel perspective called chemogenomics to pre- works have been used in several CPI models, these current methods
dict CPI without protein 3D structures. A variety of machine learn- take predefined molecular fingerprints and protein descriptors as in-
ing based algorithms have been proposed since then, which put features, which are fixed during training process and contain
considers compound information and protein information at the less information than that of end-to-end learning (Hamanaka et al.,
C The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]
V 4406
TransformerCPI 4407
2017; Tian et al., 2016; Wan and Zeng, 2016). Regarding the CPI investigate if the model learns in the manner as expected. The hid-
problem as binary classification task, compounds can be considered den ligand bias issue has been reported in DUD-E and MUV datasets
as 1D sequences or molecular graphs (i.e. traditionally called 2D (Sieg et al., 2019), raising extensive concerns in the field of drug de-
structures), and protein sequences can be regarded as 1D sequences. sign. Structure-based virtual screening, 3D-CNN-based models
DeepDTA (Ozturk et al., 2018) used convolutional neural networks (Chen et al., 2019) and other models trained on DUD-E dataset
(CNNs) to extract low-dimensional real-valued features of com- (Sieg et al., 2019) have been pointed out to make predictions mainly
pounds and proteins, and then concatenated two feature vectors and based on ligand patterns rather than interaction features, leading to
pass through fully connected layers to calculate the final output. mismatch between theoretical modeling and practical application.
WideDTA (Öztürk et al., 2019) and Conv-DTI (Lee et al., 2019) fol- We wondered whether chemogenomics-based CPI modeling facing
lowed the similar idea, and WideDTA utilized two extra features as similar problem, and thus revisited a previous typical model CPI–
well, ligand max common structures and protein motifs and GNN trained on Human dataset as an example to study the poten-
domains, to improve model performance. From another perspective tial effects of hidden ligand bias.
that regards compound structure as molecular graph, CPI–GNN Figure 1A shows the weight distribution plot of CPI–GNN
(Tsubaki et al., 2019) and GraphDTA (Nguyen et al., 2019) used model trained on Human dataset. The weights of CNN blocks used
always predicted to interact or not interact with different proteins. the word2vec model for 30 epochs on the large corpus we built be-
These results highlighted the possibility that ligand patterns can mis- fore, protein sequences can be inferred to real-valued 100-dimen-
lead the model. sional vectors.
Sequential feature vectors of proteins were then passed to the en-
coder to learn more abstract representations of proteins. Of note
1.3 Splitting dataset inappropriately here we replaced the original self-attention layers in the encoder
The risk of hidden ligand bias is difficult to eliminate but can be with a relatively simple structure. Considering that conventional
reduced. Usually, machine learning researchers split data into train- transformer architecture usually requires a large training corpus and
ing and test sets at random. However, using conventional classifica- is easy to overfit on small or modestly sized datasets (Qiu et al.,
tion measurements on a randomly split test set, we are not clear 2020), we used a gated convolutional network (Dauphin et al.,
whether the model learns true interaction features or other unex- 2016) with Conv1D and gated linear unit instead because it showed
pected hidden variables, which may produce precise models that an- better performance on our designed datasets. The input to gated
swer the wrong questions (Riley, 2019). Thus, test sets should be convolutional network is a sequence of protein feature vectors. We
designed according to the real goal of modeling and its application compute the hidden layers h0 ; . . . ; hL as Equation 1
experiments to evaluate whether a data-driven model falls into com- W 2 2 Rkm1 m2 , t 2 Rm2 , are learned parameters, L is the number
mon pitfalls of AI. As a result, TransformerCPI achieved the best of hidden layers, n is the sequence length, m1 , m2 are the dimen-
performance on three public datasets and two label reversal data- sion of input and hidden features, respectively, and k is the patch
sets. Moreover, we further studied the interpretability of size, r is the sigmoid function and is the element-wise product be-
TransformerCPI to uncover its underlying prediction mechanism by tween matrices (Dauphin et al., 2016). The output of the gated con-
mapping attention weights back to protein sequences and compound volutional network is the final representation of protein sequences,
molecules, and the results also confirmed that the self-attention as shown in Figure 2. In our implementation, L is 3, m1 is 64, m2 is
mechanism of TransformerCPI is useful in capturing the desired 128 and k is 7. The output of the encoder is protein sequence
interaction features. We hope that these findings may raise our at- p1 ; p2 ; . . . ; pb , where b is the length of protein sequence.
tention to improve the generalization and interpretation capability Each of the atom features was initially represented as a vector of
of CPI modeling. size 34 using RDKit python package, and the list of atom features is
summarized in Table 1. We then used GCNs to learn the representa-
tion of each atom by integrating its neighbor atom features.
2 Materials and methods The GCN is originally devised to solve the problem of semisuper-
vised node classification, which can be transferred to solve molecu-
2.1 Model architecture of TransformerCPI lar representation problem. We here denote a graph for a compound
The model we proposed is based on the transformer architecture molecule as G ¼ ðV; EÞ, where V 2 Raf is the set of a atoms in a
(Vaswani et al., 2017), which was originally devised for neural ma- molecule, each represented as a f-dimensional feature vector and E is
chine translation tasks. Transformer is an autoregressive encoder– the set of covalent bonds in a molecule represented as an adjacency
decoder model using a combination of multiheaded attention layers matrix A 2 Raa . The propagation rule is shown in Equation 2:
and positional feed forward to solve sequence-to-sequence (seq2seq) 1
1
tasks. Recently, transformer architecture achieves great success in ðÞ
H ðlþ1Þ ¼ f H l ; A ¼ r D ~ 2 A
~D ~ 2 H ðlÞ W ðlÞ ; (2)
3
language representation learning task, and many novel and powerful
pre-training models have been established, such as BERT (Devlin where A ~ ¼ A þ I; I is the identity matrix; H ðlÞ 2 Raf is the output
ðlÞ
et al., 2019), GPT-2 , Transformer-XL (Dai et al., 2019) and XLnet of the ‘th layer, W 3 2 Rf f is a weight matrix for the ‘th neural
(Yang et al., 2019). Transformer is also applied in chemical reaction network layer, D ~ 2 Raa is the diagonal node degree matrix of
prediction (Schwaller et al., 2019), however, it is still confined in ~ 2 Raa ; and rðÞ is a nonlinear activation function. In our imple-
A
seq2seq tasks. Inspired by its great ability of capturing features be- mentation, we chose f to be 34, and the number of GCN layers to
tween two sequences, we modified the transformer architecture to be 1. After processed by GCN layer, the atom sequence
predict CPI, regarding compounds and proteins as two kinds of c1 ; c2 ; . . . ; ca is obtained, where a is the number of atoms.
sequences. An overview of the proposed TransformerCPI is shown When protein sequence representation and atom representation
in Figure 2, where we remained the decoder of the transformer and were obtained, we successfully converted proteins and compounds
modified its encoder and final linear layers. into two sequences, which fitted the transformer architecture.
To convert protein sequences into sequential representation, we Interaction features are learned through the decoder of transformer,
first split a protein sequence into an overlapping 3-gram amino acid which consists of self-attention layers and feed forward layers. In
sequence, and then translated all words to real-valued embeddings our work, protein sequence is the input of encoder, while the atom
by the pretraining approach word2vec (Mikolov et al., 2013a,b). sequence is the input of decoder, and the output of decoder is the
Word2vec is an unsupervised technique to learn high-quality distrib- interaction sequence which contains interaction features and has the
uted vector representations that describe sophisticated syntactic and
semantic word relationships, comprising two pretraining technique
called Skip-Gram and Continue Bag-of-Words (CBOW). Skip-Gram Table 1. List of compound atom features
is used to predict a certain word from its context, while CBOW is
used to predict context from a given word. Integrating Skip-Gram Atom type C, N, O, F, P, S, Cl, Br, I, other (one hot)
and CBOW, word2vec can finally map the words to low- Degree of atom 0, 1, 2, 3, 4, 5, 6 (one hot)
dimensional real-valued vectors, where the words that have similar Formal charge 0 or 1
semantics map to the vectors that are close to each other. There Number of radical electrons 0 or 1
have been some works to apply word2vec to represent protein Hybridization type sp, sp2, sp3, sp3d, sp3d2, other (one hot)
sequences (Kimothi et al., 2016; Kobeissy et al., 2015; Mazzaferro Aromatic 0 or 1
and Carlo, 2017; Yang et al., 2018), in which the amino acid se- Number of hydrogen 0, 1, 2, 3, 4 (one hot)
quence of constant length k (k-mers) were split as words and the atoms attached
whole amino acid sequence was regarded as a document. We fol- Chirality 0(False) or 1(True)
lowed these works to preprocess protein sequence, and included all Configuration R, S (one hot)
human protein sequences in UniProt as corpus to pretrain the
word2vec model, and set hidden dimension to 100. After training
TransformerCPI 4409
same length with atom sequence. Given that the order of atom fea- The final interaction feature vector is calculated by weighted
ture vectors has no effect on CPI modeling, we removed positional sum of interaction vectors with attention weights:
embeddings in TransformerCPI.
The key technique in the decoder is multiheaded self-attention X
a
yinteraction ¼ a i xi : (6)
layer. A multiheaded self-attention layer consists of several scaled- i¼1
dot attention layers to extract interaction information between the
encoder and the decoder. The self-attention layer takes three inputs, At last, the final interaction feature vector yinteraction is fed to the
the keys, K, the values, V and the queries, Q, and calculates the at- following fully connected layers, and the probability y^ that a com-
tention as follows: pound interacts with a protein is returned. As a conventional binary
! classification task, we used binary cross entropy loss to train
QK T TransformerCPI model:
attentionðQ; K; V Þ ¼ softmax pffiffiffiffiffi V; (3)
dk
Loss ¼ ylog^ y þ ð1 yÞlogð1 y^Þ : (7)
where dk is a scaling factor depending on the layer size. This mech-
Name Value
experimentally validated database; (ii) each ligand should exist in Table 3. Summary of the datasets
both two classes. Many previous studies generated negative samples
by random cross combination of CPI pairs or by using similarity Proteins Compounds Interactions Positive Negative
based approaches, which may introduce unexpected noise and un-
GPCR 356 5359 15 343 7989 7354
noticed bias. Here, compiled negative data that have been experi-
mentally validated. Kinase 229 1644 111 237 23 190 88 047
First, we constructed a GPCR dataset from GLASS database
(Chan et al., 2015). GLASS database provides a great amount of ex-
perimentally validated GPCR–ligand associations (Chan et al.,
2015), which satisfies our first rule. GLASS database used IC50, Ki
and EC50 as the binding affinity values, which were transformed
into negative logarithm, pIC50, pKi and pEC50. Following early
works (Liu et al., 2007; Wan et al., 2019), a threshold of 6.0 was set
to divide original dataset into a positive set and a negative set. Then,
machines (SVM), newly reported sequence-based models CPI–GNN patterns of GPCR dataset may bring in non-negligible influence as
(Tsubaki et al., 2019) and DrugVQA (Zheng et al., 2020) have been ligand bias in CPI–GNN. On Kinase set, TransformerCPI outper-
evaluated on these datasets. GraphDTA (Nguyen et al., 2019) was forms CPI–GNN, GraphDTA and GCN in terms of AUC and PRC
originally designed for regression task, here, we tailor its last layer to and AUC of reference models are all smaller than 0.5, so we argue
binary classification task. It should be noted that models relying on that ligand patterns of Kinase dataset may brought in non-negligible
3D structural information of protein are not compared here, due to influence in all reference models. Moreover, GraphDTA and GCN
the absence of such information for these two datasets. We followed achieved good performance on GPCR dataset, which are close to
the same training and evaluating strategies as CPI–GNN (Tsubaki TransformerCPI, but performed much worse on Kinase set. In com-
et al., 2019) and repeated with three different random seeds followed parison, TransformerCPI achieved the best performance on both
by DrugVQA (Zheng et al., 2020) to evaluate TransformerCPI, and datasets, revealing its robustness and generalization ability. Overall,
Area Under Receiver Operating Characteristic Curve (AUC), preci- these results suggested that our proposed TransformerCPI possesses
sion and recall of each model are shown in Tables 4 and 5. Since the capability of learning interactions between proteins and ligands, and
implementation of KNN, RF, L2, SVM are not mentioned in the lit- the label reversal experiments can effectively assess the impact of
erature (Tsubaki et al., 2019), these models are not compared on hidden ligand bias on models, and, more importantly, the proposed
decoder by conventional vector concatenation on the same label re- to different proteins. This result also explains why TransformerCPI
versal experiment. As shown in Figure 3C, this ablation procedure shows better performance on label reversal experiments. Dynamic fea-
significantly compromised the performance of TransformerCPI on ture extraction of TransformerCPI based on a specific protein context
both GPCR set and Kinase set, demonstrating that self-attention helps the model extract the key information of the interaction, while
mechanism together with encoder–decoder architecture indeed plays also reducing the probability of hidden ligand bias. Moreover, the de-
a key role in extracting CPI features between two types of coder of TransformerCPI integrates the features of protein sequences
sequences. and compound atoms dynamically to form direct interaction features,
which is similar to language translation task and agrees well with the
binding process of ligands to proteins.
3.5 Model interpretation
To further verify the meaning of the attention weights of atoms,
Although deep learning is known as a black box algorithm, it is es-
we selected the compound phenothiazine to show the interpretation
sential to understand how the model makes a prediction, and
of TransformerCPI. Phenothiazine is a classic antipsychotic drug tar-
whether the model can provide suggestions or guidance for opti-
geted on dopamine receptor, and its structure–activity relationship
mization. Due to the transformer architecture and self-attention
(SAR) has been thoroughly explored. As illustrated in Figure 4B, the
Fig. 5. Attention weights of protein sequences. The regions in proteins, which have high attention weights extracted from TransformerCPI, are highlighted in purple. (A)
Attention weight of histamine H1 receptor (PDB: 3RZE). (B) Attention weight of 5-HT1B receptor (PDB: 4IAQ). (C)Attention weight of MAPK8 (PDB: 1UKI)
TransformerCPI 4413
4 Conclusion Gilson,M.K. et al. (2016) BindingDB in 2015: a public database for medicinal
chemistry, computational chemistry and systems pharmacology. Nucleic
In this work, a transformer architecture with self-attention mechan- Acids Res., 44, D1045–1053.
ism was modified to address sequence-based CPI classification task, Gonen,M. (2012) Predicting drug–target interactions from chemical and gen-
resulting in a model named TransformerCPI showing high perform- omic kernels using Bayesian matrix factorization. Bioinformatics, 28,
ance on three benchmark datasets. Intriguingly, we compared it 2304–2310.
with previous reported CPI models and conventional machine Gunther,S. et al. (2007) SuperTarget and Matador: resources for exploring
learning-based control models, and noticed that most of these mod- drug–target relationships. Nucleic Acids Res., 36, D919–D922. (Database
els yielded impressive results on those benchmark tests. Given the issue).
challenging nature of CPI prediction, we argue that these models Hamanaka,M. et al. (2017) CGBVS-DNN: prediction of compound–protein
might face potential pitfalls of deep learning. To address these po- interactions based on deep learning. Mol. Inform., 36, 1–2.
tential risks, we constructed new datasets specific for He,T. et al. (2017) SimBoost: a read-across approach for predicting drug–tar-
chemogenomics-based CPI task, and designed more rigorous label get binding affinities using gradient boosting machines. J. Cheminform., 9,
reversal experiments as new measurements for chemogenomics- 24.
Jacob,L. and Vert,J.P. (2008) Protein–ligand interaction prediction: an
Tsubaki,M. et al. (2019) Compound–protein interaction prediction with Wang,Y. and Zeng,J. (2013) Predicting drug–target interactions using
end-to-end learning of neural networks for graphs and sequences. restricted Boltzmann machines. Bioinformatics, 29, i126–i134.
Bioinformatics, 35, 309–318. Wishart,D.S. et al. (2008) DrugBank: a knowledgebase for drugs, drug actions
Vamathevan,J. et al. (2019) Applications of machine learning in drug discov- and drug targets. Nucleic Acids Res., 36, D901–D906.
ery and development. Nat. Rev. Drug Discov., 18, 463–477. Yamanishi,Y. et al. (2008) Prediction of drug–target interaction networks
van Laarhoven,T. et al. (2011) Gaussian interaction profile kernels for predict- from the integration of chemical and genomic spaces. Bioinformatics, 24,
ing drug–target interaction. Bioinformatics, 27, 3036–3043. i232–i240.
Vaswani,A. et al. (2017) Attention Is All You Need. In Advances in Neural Yang,K.K. et al. (2018) Learned protein embeddings for machine learning.
Information Processing Systems, pp. 6000-6010. Long Beach, California, Bioinformatics, 34, 2642–2648.
USA, Curran Associates Inc. Yang,Z. et al. (2019) XLNet: Generalized Autoregressive Pretraining for
Wan,F. and Zeng,J. (2016) Deep learning with feature embedding for com- Language Understanding. In Proceedings of the 57th Conference of the
pound–protein interaction prediction. bioRxiv doi:10.1101/086033. Association for Computational Linguistics, ACL 2019, Florence, Italy, July
Wan,F. et al. (2019) DeepCPI: a deep learning-based framework for 28-August 2, 2019, Volume 1: Long Papers, pp. 2978–2988. Association
large-scale in silico drug screening. Genomics Proteomics Bioinf., 17, for Computational Linguistics. Florence, Italy.
478–495. Zhang,M. et al. (2019) Lookahead optimizer: k steps forward, 1 step back. In,