DeepDTA: Deep Drug-Target Binding Affinity Prediction
DeepDTA: Deep Drug-Target Binding Affinity Prediction
doi: 10.1093/bioinformatics/bty593
ECCB 2018
Abstract
Motivation: The identification of novel drug–target (DT) interactions is a substantial part of the
drug discovery process. Most of the computational methods that have been proposed to predict
DT interactions have focused on binary classification, where the goal is to determine whether a DT
pair interacts or not. However, protein–ligand interactions assume a continuum of binding strength
values, also called binding affinity and predicting this value still remains a challenge. The increase
in the affinity data available in DT knowledge-bases allows the use of advanced learning techni-
ques such as deep learning architectures in the prediction of binding affinities. In this study, we
propose a deep-learning based model that uses only sequence information of both targets and
drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity
prediction use either 3D structures of protein–ligand complexes or 2D features of compounds. One
novel approach used in this work is the modeling of protein sequences and compound 1D repre-
sentations with convolutional neural networks (CNNs).
Results: The results show that the proposed deep learning based model that uses the 1D represen-
tations of targets and drugs is an effective approach for drug target binding affinity prediction. The
model in which high-level representations of a drug and a target are constructed via CNNs
achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets,
outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding af-
finity prediction.
Availability and implementation: https://github.com/hkmztrk/DeepDTA
Contact: [email protected] or [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction (Cer et al., 2009) and low IC50 values signal strong binding.
The successful identification of drug–target interactions (DTI) is a Similarly, low Ki values indicate high binding affinity. Kd and Ki val-
critical step in drug discovery. As the field of drug discovery expands ues are usually represented in terms of pKd or pKi, the negative loga-
with the discovery of new drugs, repurposing of existing drugs and rithm of the dissociation or inhibition constants.
identification of novel interacting partners for approved drugs is In binary classification based DTI prediction studies, construc-
also gaining interest (Oprea and Mestres, 2012). Until recently, DTI tion of the datasets constitutes a major step, since designation of the
prediction was approached as a binary classification problem negative (not-binding) samples directly affects the performance of
(Bleakley and Yamanishi, 2009; Cao et al., 2014, 2012; Cobanoglu the model. As of last decade, most of the DTI studies utilized four
et al., 2013; Gönen, 2012; Öztürk et al., 2016; Yamanishi et al., major datasets by Yamanishi et al. (2008) in which DT pairs with
2008; van Laarhoven et al., 2011), neglecting an important piece of no known binding information are treated as negative (not-binding)
information about protein–ligand interactions, namely the binding samples. Recently, DTI studies that rely on databases with binding
affinity values. Binding affinity provides information on the strength affinity information have been providing more realistic binary
of the interaction between a drug–target (DT) pair and it is usually datasets created with a chosen binding affinity threshold value (Wan
expressed in measures such as dissociation constant (Kd), inhibition and Zeng, 2016). Formulating the DT prediction task as a binding
constant (Ki) or the half maximal inhibitory concentration (IC50). affinity prediction problem enables the creation of more realistic
IC50 depends on the concentration of the target and ligand datasets, where the binding affinity scores are directly used.
Furthermore, a regression-based model brings in the advantage of In this study, we propose an approach to predict the binding
predicting an approximate value for the strength of the interaction affinities of protein–ligand interactions with deep learning models
between the drug and target which in turn would be significantly using only sequences (1D representations) of proteins and ligands.
beneficial for limiting the large compound search-space in drug dis- To this end, the sequences of the proteins and SMILES (Simplified
covery studies. Molecular Input Line Entry System) representations of the com-
Prediction of protein–ligand binding affinities has been the focus pounds are used rather than external features or 3D-structures of
of protein–ligand scoring, which is frequently used after virtual the binding complexes. We employ CNN blocks to learn representa-
screening and docking campaigns in order to predict the putative tions from the raw protein sequences and SMILES strings and com-
strengths of the proposed ligands to the target (Ragoza et al., 2017). bine these representations to feed into a fully connected layer block
Non-parametric machine learning methods such as the Random that we call DeepDTA. We use the Davis Kinase binding affinity
Forest (RF) algorithm have been used as a successful alternative to dataset (Davis et al., 2011) and the KIBA large-scale kinase inhibi-
scoring functions that depend on multiple parameters (Ballester and tors bioactivity data (He et al., 2017; Tang et al., 2014) to evaluate
weak binding affinities (Kd > 10000 nM) or are not observed in the The protein sequences of the Davis dataset were extracted from
primary screen (Pahikkala et al., 2014). As such they are true the UniProt protein database based on gene names/RefSeq accession
negatives. numbers (Apweiler et al., 2004). Similarly, the UniProt IDs of the
The distribution of the KIBA scores is depicted in the right panel targets in the KIBA dataset were used to collect the protein sequen-
of Figure 1A. He et al. (2017) pre-processed the KIBA scores as fol- ces. Figure 1C (left panel) shows the lengths of the sequences of the
lows: (i) for each KIBA score, its negative was taken, (ii) the min- proteins in the Davis dataset. The maximum length of a protein se-
imum value among the negatives was chosen and (iii) the absolute quence is 2549 and the average length is 788 characters. Figure 1C
value of the minimum was added to all negative scores, thus con- (right panel) depicts the distribution of protein sequence length in
structing the final form of the KIBA scores. KIBA targets. The maximum length of a protein sequence is 4128
The compound SMILES strings of the Davis dataset were and the average length is 728 characters.
extracted from the Pubchem compound database based on their We should also note that the Smith–Waterman (S–W) similarity
Pubchem CIDs (Bolton et al., 2008). For KIBA, first the CHEMBL among proteins of the KIBA dataset is at most 60% for 99% of the
IDs were converted into Pubchem CIDs and then, the corresponding protein pairs. The target similarity is at most 60% for 92% of the
CIDs were used to extract the SMILES strings. Figure 1B illustrates protein pairs for the Davis dataset. These statistics indicate that
the distribution of the lengths of the SMILES strings of the com- both datasets are non-redundant.
pounds in the Davis (left) and KIBA (right) datasets. For the com-
pounds of the Davis dataset, the maximum length of a SMILES is
103, while the average length is equal to 64. For the compounds of 2.2 Input representation
KIBA, the maximum length of a SMILES is 590, while the average We used integer/label encoding that uses integers for the categories
length is equal to 58. to represent inputs. We scanned approximately 2 M SMILES
i824 H.Öztürk et al.
½C N ¼ C ¼ O ¼ ½1 3 63 1 63 5
where k is the kernel function. In order to represent compounds, Fig. 3. Experiment setup
they utilized a similarity matrix computed using Pubchem structure
clustering server (Pubchem Sim)(http://pubchem.ncbi.nlm.nih.gov), 1X
a tool that utilizes single linkage for cluster and uses 2D properties CI ¼ hðbi bj Þ (7)
Z d >d
i j
of the compounds to measure their similarity. As for proteins, the
Smith–Waterman algorithm was used to construct a protein similar- where bi is the prediction value for the larger affinity di, bj is the pre-
ity matrix (Smith and Waterman, 1981).
Table 2. Parameter settings for CNN based DeepDTA model Table 4. The average CI and MSE scores of the test set trained on
five different training sets for the KIBA dataset
Parameters Range
Proteins Compounds CI (std) MSE
Number of filters 32*1; 32*2; 32*3
Filter length (compounds) [4, 6, 8] KronRLS S–W Pubchem Sim 0.782 (0.0009) 0.411
Filter length (proteins) [4, 8, 12] (Pahikkala
epoch 100 et al., 2014)
hidden neurons 1024; 1024; 512 SimBoost S–W Pubchem Sim 0.836 (0.001) 0.222
batch size 256 (He et al.,
dropout 0.1 2017)
optimizer Adam DeepDTA S–W Pubchem Sim 0.710 (0.002) 0.502
learning rate (lr) 0.001 DeepDTA CNN Pubchem Sim 0.718 (0.004) 0.571
DeepDTA S–W CNN 0.854 (0.001) 0.204
Table 3. The average CI and MSE scores of the test set trained on Note: The standard deviations are given in parenthesis.
five different training sets for the Davis dataset
2
Table 5. The average rm and AUPR scores of the test set trained on
five different training sets for the Davis dataset
employed to tasks such as detecting homology (Hochreiter et al., Cer,R.Z. et al. (2009) Ic 50-to-k i: a web-based tool for converting ic 50 to k i
2007), constructive peptide design (Muller et al., 2018) and function values for inhibitors of enzyme activity and ligand binding. Nucleic Acids
prediction (Liu, 2017) that utilize amino-acid sequences. As future Res., 37, W441–W445.
Chan,K.C. et al. (2016) Large-scale prediction of drug–target interactions
work, we also aim to utilize a recent ligand-based protein representa-
from deep representations. In: 2016 International Joint Conference on
tion method proposed by our team that uses SMILES sequences of
Neural Networks (IJCNN), Vancouver, BC, Canada. IEEE, pp. 1236–1243.
the interacting ligands to describe proteins (Öztürk et al., 2018).
Chen, T. and Guestrin, C. (2016) Xgboost: a scalable tree boosting system. In:
The results indicated that deep-learning based methodologies Proceedings of the 22nd acm sigkdd international conference on knowledge
performed notably better than the baseline methods with a statistical discovery and data mining, San Francisco, CA, USA. ACM, pp. 785–794.
significance when the dataset grows in size, as the KIBA dataset is Chen,T. and He,T. (2015) Higgs boson discovery with boosted trees. In: NIPS
four times larger than the Davis dataset. The improvement over the 2014 Workshop on High-energy Physics and Machine Learning, Montreal,
baseline was significantly higher for the KIBA dataset (from CI score Canada, pp. 69–80.
of 0.836 to 0.863) compared to the Davis dataset (from CI score of Chetlur,S. et al. (2014) cudnn: Efficient primitives for deep learning. arXiv pre-
LeCun,Y. et al. (2015) Deep learning. Nature, 521, 436–444. predictions: emphasis on scaling of response data. J. Comput. Chem., 34,
Leung,M.K. et al. (2014) Deep learning of the tissue-regulated splicing code. 1071–1082.
Bioinformatics, 30, i121–i129. Shar,P.A. et al. (2016) Pred-binding: large-scale protein–ligand binding affin-
Li,H. et al. (2015) Low-quality structural and interaction data improves ity prediction. J. Enzyme Inhib. Med. Chem., 31, 1443–1450.
binding affinity prediction via random forest. Molecules, 20, 10947–10962. Simonyan,K. and Zisserman,A. (2015) Very deep convolutional networks for
Liu,X. (2017) Deep recurrent neural network for protein function prediction large-scale image recognition. In: 3rd International Conference on
from sequence. arXiv Preprint arXiv, 1701, 08318. Learning Representations (ICLR), Hilton San Diego Resort & Spa, May
Ma,J. et al. (2015) Deep neural nets as a method for quantitative 7–9, 2015.
structure–activity relationships. J. Chem. Inf. Model., 55, 263–274. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular
Muller,A.T. et al. (2018) Recurrent neural network model for constructive subsequences. J. Mol. Biol., 147, 195–197.
peptide design. J. Chem. Inf. Model., 58, 472–479. Srivastava,N. et al. (2014) Dropout: a simple way to prevent neural networks
Nair,V. and Hinton,G.E. (2010) Rectified linear units improve restricted from overfitting. J. Mach. Learn. Res., 15, 1929–1958.
boltzmann machines. In: Proceedings of the 27th International Conference Tang,J. et al. (2014) Making sense of large-scale kinase inhibitor bioactivity