0% found this document useful (0 votes)
23 views12 pages

Deep_learning_in_drug_discovery

Uploaded by

gnishant941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views12 pages

Deep_learning_in_drug_discovery

Uploaded by

gnishant941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Review www.molinf.

com

DOI: 10.1002/minf.201501008

Deep Learning in Drug Discovery


Erik Gawehn,[a] Jan A. Hiss,[a] and Gisbert Schneider*[a]

Abstract: Artificial neural networks had their first heyday in Here, we provide an overview of this emerging field of mo-
molecular informatics and drug discovery approximately lecular informatics, present the basic concepts of promi-
two decades ago. Currently, we are witnessing renewed in- nent deep learning methods and offer motivation to ex-
terest in adapting advanced neural network architectures plore these techniques for their usefulness in computer-as-
for pharmaceutical research by borrowing from the field of sisted drug discovery and design. We specifically emphasize
“deep learning”. Compared with some of the other life sci- deep neural networks, restricted Boltzmann machine net-
ences, their application in drug discovery is still limited. works and convolutional networks.
Keywords: bioinformatics · cheminformatics · drug design · machine-learning · neural network · virtual screening

1 Introduction tative structure-activity relationship (QSAR) models. Feature


extraction from the descriptor patterns is the decisive step
Machine-learning provides a theoretical framework for the in the model development process.[6,7] In current cheminfor-
discovery and prioritization of bioactive compounds with matics applications, the prevalent machine-learning archi-
desired pharmacological effects and their optimization as tectures are “shallow” and contain a single layer of feature
drug-like leads. Biological target identification and protein transformation. These architectures include linear and non-
design are emerging areas of application. Among the many linear principle component analysis, k-means clustering
machine-learning approaches in molecular informatics, che- methods, partial least square projection to latent structures,
mocentric methods have found widespread application. decision trees, multivariate linear regression, linear discrimi-
Their underlying logic typically follows three steps. First, nant analysis, support vector machines (SVMs), logistic and
there is the selection of a problem-specific set of descrip- kernel regression, multi-layer Perceptrons and related
tors that are believed to capture the essential properties of neural network approaches.[8] Although all of these meth-
the molecules involved. At present, there are over 5000 di- ods have proven to be useful for (Q)SAR modeling and mo-
verse molecular representations (“descriptors”) that address lecular design,[9] the single feature transformation step into
the various properties of chemical entities.[1] Second, a suitable space for the subsequent application of a linear
a metric or scoring scheme is used to compare the en- pattern separation model might limit their modeling and
coded molecules to one another.[2] Finally, a machine-learn- representational power when applied to more complex
ing algorithm is employed to identify the features that may data and setups. A reason for their success in pharmacolog-
serve to qualitatively or quantitatively distinguish the active ical applications may stem from the fact that a major part
from the inactive compounds.[3] Artificial neural networks of the complexity inherent to molecular interactions has
(ANNs) were among the first methods borrowed from the been engineered into the descriptors employed as patterns
computer sciences for this purpose.[4] for model training, thereby allowing single layer machine-
In 2013, public attention was drawn to a multi-problem learning architectures to tackle the problem.[10] One chal-
QSAR machine-learning challenge in drug discovery posted lenging question is whether the underlying data complexi-
by Merck. This competition on drug property and activity ty and hidden features can be more efficiently dealt with
prediction was won by a deep learning network with a rela- by shifting attention from descriptor engineering to the ar-
tive accuracy improvement of approximately 14 % over chitecture of the machine-learning system and the training
Merck’s in-house systems and resulted in an article in The of the algorithms involved. This is the domain of “deep
New York Times.[5] Here, we present state-of-the-art of ad- learning”.[11,12]
vanced chemocentric machine-learning methods with
a focus on emerging “deep learning” concepts. We high-
light recent advances in the field and point to prospective
applications and developments of this potentially game- [a] E. Gawehn, J. A. Hiss, G. Schneider
Swiss Federal Institute of Technology (ETH), Department of
changing technology for drug discovery. A general task for
Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 4,
machine-learning is to uncover the relationship between CH-8093 Zurich, Switzerland
the molecular descriptors used and the measured activity Fax: + 41 44 633 13 79, Tel: + 41 44 633 74 38
of the compounds to obtain qualitative classifiers or quanti- *e-mail: [email protected]

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 3
Review www.molinf.com

Deep neural networks possess multiple hidden layers the recently promoted deep learning approaches.[20] In
and are capable of computing layers of adaptive non-linear 1989, LeCun presented a neural network for the recogni-
features that capture increasingly complex data patterns tion of zip codes that learned by error back-propagation.[21]
with each additional layer.[12,13] Deep learning may be par- However, this and following deep neural networks suffered
ticularly well-suited for data mining in the life sciences be- from the “vanishing gradient” problem,[22] which was one
cause this approach deals with complex patterns in nature, reason why other machine-learning methods became more
systems biology and heterogeneous “big data”.[14] Genomic prominent at the time. This problem was partially alleviated
studies currently lead the way in this field.[15] The first deep in 2006 when Hinton et al. proposed an unsupervised,
ANN learning methods were developed in approximately greedy layer-by-layer pre-training scheme that targeted the
1980 with the introduction of the Neocognitron[16] and the vanishing gradient effect and introduced a new class of
back-propagation of errors algorithm.[17] In 1988, Qian and deep generative algorithms called “deep belief net-
Sejnowski published the first application of neural networks works”.[23] Although ANNs were soon identified as useful
in computational (bio)chemistry[18] (protein secondary struc- tools for computational drug design,[24] the computational-
ture prediction). This application has since been improved ly-oriented medicinal chemists and drug designers have
through optimized ANN architectures,[19] including some of never intensively used deep learning. Instead, other ma-
chine-learning methods such as SVMs and random for-
ests[25,26] progressed to dominate the field today. This devel-
Erik Gawehn studied physics at the opment occurred in part due to the development of more
Ludwig-Maximilians-Universit•t efficient learning algorithms,[27] but was also due to the
Munich, Germany. In 2015, he joined desire to interpret and understand the features extracted in
the Computer-Assisted Drug Design at a chemically meaningful way, which became increasingly
ETH Zurich, Switzerland, as a PhD stu-
difficult for higher-order features.[28]
dent. His research focuses on the ap-
There are several examples of cascaded learning, which
plication of deep neural networks to
the virtual screening and the de novo can be observed as an intermediate step to contemporary
design of drug-like molecules with tail- deep learning. Cascaded approaches differ from today’s ap-
ored target profiles. proach of deep neural networks in that depth is achieved
by combining the strengths of different types of machine-
learning algorithms. One representative example is a jury
network (weighted voting network)[29,30] . The cascaded jury
shown in Figure 1 was successfully applied to predict pep-
Dr. Jan A. Hiss is a staff scientist in the
Computer-Assisted Drug Design group tide binding to major histocompatibility complex 1.[31] Its
at ETH Zurich. He studied bioinformat- first feature extraction layer consists of ANN and SVM
ics and received his doctoral degree models that are fed by different descriptor sets, and the
from the Goethe-University Frankfurt, output values of the first-layer models serve as the input to
Germany. During his post-doctoral the jury network. This machine-learning architecture al-
studies at Goethe-University and ETH ready contains important aspects of deep learning. Since
Zurich, he developed computational the introduction of deep belief networks approximately
methods for analyzing peptide-mem- one decade ago, deep learning has seen remarkable im-
brane interaction. His current research
provements, such as better regularization techniques,[32] en-
focus is on nature-inspired algorithms
hanced activation functions,[33] and parallel computing on
and computer-assisted peptide design.
graphics processing units.[34] For an extensive historic ac-
count of deep learning, we refer to a recent article by
Gisbert Schneider is a Full Professor of Schmidhuber.[35]
Computer-Assisted Drug Design at A rapidly growing number of university groups, start-up
ETH Zurich. He received his doctoral companies, and IT global players such as Google[36] and
degree in biochemistry from the Freie Facebook[37] have joined research efforts to improve deep
Universit•t Berlin. Prior to joining ETH,
learning algorithms and develop specialized hardware. For
he worked at F. Hoffmann-La Roche
Pharmaceuticals, Basel, and held the image recognition tasks, deep learning already achieves ac-
Beilstein Endowed Chair for Chem- curacies comparable to or even surpassing humans on sev-
and Bioinformatics at Goethe-Universi- eral specialized datasets, including traffic signs,[38] human
ty, Frankfurt, where he now is an ad- faces,[39] and handwritten digits.[40] In February 2015, deep
junct professor. In 2015, he became learning beat the human performance on the 1000-class
a Fellow of the University of Tokyo. His ImageNet dataset.[41]
main research interests are adaptive
autonomous systems for molecular
design.

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 4
Review www.molinf.com

for one sample, thereby omitting the sum over different


input data vectors. Once the cost function is calculated, at-
tempts can be made to minimize it using the weights to in-
crease the network performance. Calculating the gradient
of the cost function and subsequently adjusting the
weights accordingly achieves weight optimization:

¢hdðqnÞ xp; n ¼ 1
(
ðnÞ @C
Dw qp ¼ ¢h ðnÞ
¼ ; with
@wqp ¢hdðqnÞ hðpn¢1Þ ; n6¼1
8 ¨ ¦  Ž
> E 0 oq a0 zðqnÞ ; n ¼ N
@C <
dðqnÞ ¼ ðnÞ  Ž
@zq >: a0 zðnþ1Þ P r dðnþ1Þ wðnþ1Þ ; n6¼N
q r rq

Figure 1. Schematic of a jury network (adapted from ref. 31). In where DwðqpnÞ denotes the change of one of the network’s
this particular architecture, feature extraction is performed in two weights (p, q and r are arbitrary indices for neurons within
stages. The second-stage model is a feed-forward network that the involved layers), h is the learning rate, n 2 f1; :::; Ng
weighs the relevance of the first-stage model and improves the represents the index over the number of network layers
overall prediction performance. Note that each first-stage model (excluding the input layer, which contains fan-out units
receives a different set of molecular descriptors as input. SVM, sup-
only), hðpn¢1Þ is the output value of a hidden neuron p in
port vector machine; MLP, multi-layer Perceptron.
layer (n-1), zðqnÞ is the preactivation of a neuron q in layer n,
and a(…) is a nonlinear activation function with a’(…) de-
2 A Simple Deep Network noting its derivative. The error function depends indirectly
on all of the Ž network’s weights through the output values
The most basic form of a deep neural network is a fully ðNÞ ðN¢1Þ
ol wlk ; hk (Figure 2).
connected feed-forward network with more than one
hidden layer, as depicted in Figure 2. The term “fully con- The definition of the quantity dðqnÞ (called “sensitivity”)
nected” refers to an architecture in which each neuron in merely facilitates notation.[42] The sensitivity dðqnÞ and thus
a network layer is connected to all non-bias neurons in the the weight update depend multiplicatively on  the sensitivi-
Ž
next layer, and “feed-forward” indicates that there are no ty dðr nþ1Þ, the activation function derivative a0 zðqnþ1Þ and
loops or backward connections within the network archi- the weights wðrqnþ1Þ of the next layer. This dependence leads
tecture. A supervised learning problem for an ANN consists to the problem of vanishing gradients: if these quantities
of approximating a multidimensional nonlinear function are smaller than one, then the repeated multiplication of
f(x). This approximation is attempted by multiple nonlinear, small values leads to cost function gradients in the lower
weighted transformations of the input data vector within layers that rapidly approach zero. Thus, no substantial up-
the hidden layers. Nonlinearity is implemented into the dates are achieved for the network’s lower level weights.
overall network by defining the appropriate architecture For a choice of a logistic function as the activation function,
and the use of nonlinear activation functions. Common a’(…) will be in a range [0,1] that often results in vanishing
choices for activation functions are sigmoidal functions, gradients. However, the problem has been observed for
such as the logistic function a(z) = 1/(1 + exp(¢z)) and the a wide range of setups. Since Hochreiter et al. formally in-
rectifier function a(z) = max(0,z). vestigated this problem for recurrent neural networks,[43]
During the learning process, the approximation of the many additional studies have focused on the issue of van-
network function f(x) is improved by gradually tweaking ishing gradients and the closely related problems of ex-
the network’s weights in such a way that the mapping of ploding gradients or oscillating weights for recurrent neural
input data to the computed output values better resembles networks and other deep architectures.[44] Several ap-
the unknown function f(x). The network’s mapping per- proaches have been proposed to mitigate these issues, in-
formance is estimated by comparing its output values ol cluding unsupervised pre-training of deep neural networks
calculated for an input data vector x to the desired true layer-by-layer[45] or special architectures such as the Long
output values tl using a cost function or error-function C(ol). Short Term Memory networks.[46] Successful training of
An example of a frequently used cost function is the sum a back-propagation neural network is problem-dependent,
¨ ¦ 1PP¨ s
of squared errors function (SSE) C osl ¼ 2 tl ¢ osl 2 ,
¦ and even for a fully connected feed-forward network there
s l are hyper-parameters and choices (e.g., error and activation
where s is an index denoting one training sample (molecu- functions) that may influence the final result. However,
lar descriptor vector) fed to the network. For the sake of some heuristics have been identified that may assist the
clarity, Figure 2 shows only a forward and backward pass user in decision-making.[47] For QSAR research, a preliminary

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 5
Review www.molinf.com

Figure 2. Schematic of a deep feed-forward neural network with one input layer x, two hidden layers h(1) and h(2), and one output layer o.
The input values for the biases are x0 ¼ hð01Þ ¼ h0ð2Þ ¼ 1. During a forward pass (red arrow), the pre-activation zðj 1Þ , which is a linear combina-
tion of the preceding
Ž layer’s output values, is calculated for each non-bias neuron in the first hidden layer. Next, a nonlinear activation
function a zðj 1Þ is applied to compute the output values hðj 1Þ . Once the neurons’ activations for one layer are calculated, they serve as
input values for the following layer until one finally arrives at the network’s output values ol. Then, the output values are evaluated against
the true output values tl using an error-function C(ol). To resolve the indirect dependency of the error function on the network’s weights,
the error is back-propagated through the network (blue arrow) using the chain rule for derivatives. Other training techniques may be applied.

guideline for choosing hyper-parameters has already been centage of randomly selected neurons in a layer “vanish”
developed.[4,5] Meta-optimization of neural network parame- (i.e., the activation function is fixed to output = 0) on each
ters by particle swarms and related adaptive stochastic al- presentation of each training case. Dropout is believed to
gorithms were also suggested.[48] lead to a competition of neurons, thereby preventing col-
Although adding neurons can increase the capacity of laboration because one neuron cannot rely on the pres-
deep learning architectures until they can cope with very ence of other neurons. Thus, each neuron is forced to learn
large data sets, deep neural networks with excessive free some general feature. Another way of looking at dropout
parameters can easily overfit even the largest datasets. regularization is as an averaging model over an exponential
Therefore, training techniques attempt to stimulate the net- number of the many different neural networks produced
work to learn the most general of the possible weight com- by deleting random subsets of hidden units and inputs.
binations. Prominent examples are l1 and l2 regularization, This idea of an adaptive network architecture has been pro-
Bayesian regularization with penalties built into the cost posed before and is currently receiving renewed atten-
function, and the automatic relevance determination of the tion.[51] One of the early applications was the sequence-
input variables.[49] Another popular technique is referred to based prediction of the transmembrane helices of integral
as “early termination”:[50] The idea behind this technique is membrane proteins.[52]
to not only separate the dataset into a training and a test
set, but also to further separate the training set into a train-
ing and at least one external validation set. Then, the net-
3 Properties of Deep Learning Architectures
work is optimized using the training data and simultane-
ously tested on the validation set. By stopping the training There are several advantages of deep learning neural net-
procedure as soon as the accuracy starts to decrease on works that suggest that they should replace shallow ANNs:
the validation set, the risk of overfitting on the training
data can be reduced. Another widely used technique is (i) A deep, hierarchical architecture enables ANNs to per-
called “dropout”.[32] For the dropout method, a certain per- form many nonlinear transformations, which leads to

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 6
Review www.molinf.com

the learning of more abstract features compared to for drug design and discovery: the Restricted Boltzmann
shallow networks.[53] This development allows for Machine (RBM) and Convolutional Neural Networks (CNN).
a more sophisticated combination of low-level features Many more deep learning approaches have been con-
(i.e., nonlinear descriptor combinations) to achieve ceived, and the interested reader is referred to the respec-
a better molecular representation and thus classifica- tive literature.[35,66]
tion of compounds. This approach reduces the need
for the a priori engineering of more sophisticated de-
scriptors. More abstract nonlinear features tend to be
invariant to local changes in the input, leading to in-
4 Restricted Boltzmann Machine
herent noise reduction and increased network robust- Boltzmann machines are undirected, generative, stochastic
ness. Therefore, deep learning architectures capture neural networks that rose to prominence when Hinton and
certain families of functions exponentially more effi- coworkers proposed contrastive divergence[60] as a fast unsu-
ciently than shallow architectures. pervised learning algorithm.[61] Restricted Boltzmann Ma-
(ii) Deep architectures promote the reuse of features. This chines (RBM) are instances of Boltzmann machines without
ensemble view of the same data corresponds to an im- intra-layer interactions (Figure 3). Boltzmann machines are
provement on the principle of cascading networks
that share the intelligence of different algorithms. How
well-regularized deep learning networks can actually
exploit commonalities between different tasks to trans-
fer knowledge as inductive bias remains a matter of
debate. This aspect is relevant for missing classes
within the training set, which is typical for large QSAR
data sets. Aside from enabling “educated guesses” on
missing information, multitask learning also has the
benefits of presenting a shared, learned feature extrac-
tion pipeline for multiple tasks. Dahl suggested that
this concept might have a regularization effect be-
cause weights tend to develop in a way that is useful Figure 3. Schematic of a restricted Boltzmann machine. The visible
layer is the interface between the network and the outside world.
for many targets instead of overfitting one target in Input can be clamped to this layer, and the resulting visible
particular.[54] All of these findings promise the simulta- neuron configuration represents a sample from the network’s equi-
neous testing of bioactivities against many related tar- librium distribution after an equilibration process. Hidden neurons
gets at a realistic computational cost. Such applica- (latent variables) transform the input data and detect prominent
tions may be driven even further by the field of poly- features. The hidden layer models the visible layer in conjunction
pharmacology.[55] In this case, deep learning might ac- with the weight matrix.
company current kernel methods to find additional ap-
plications for drugs (“re-purposing”) and to assist in
the identification of undesired off-target activities.[56] not feed-forward models, which is a crucial difference from
(iii) As highlighted by the Merck QSAR contest,[5] deep the deep feed-forward neural networks presented in
multitask neural networks can be successfully applied Figure 2. The given input states serve as input to the
to QSAR modeling.[57] Generative deep architectures hidden neurons but in turn are dependent on the resulting
complement evolutionary algorithms in computational hidden neuron states that modify the “visible” neurons.
drug discovery that are often used to find new com- This back and forth of mutual layer influencing eventually
pounds with specific features identified by some previ- leads to an equilibrium of neuron activations regardless of
ously trained machine-learning architecture.[58] A the originally given input vector (Figure 4). The overall dis-
trained deep generative neural network has learned to tribution governing the activation of visible neurons at one
separate compound activity classes within its architec- of these steps is the network’s “belief” of how the visible
ture while being able to generate output according to neurons behave. The goal is to nudge the network’s belief
the learned representation; thus, this network contains of the visible neuron state distribution into the direction of
both functions in one algorithm. Therefore, deep learn- the actual input data distribution by intermittently adding
ing might provide a fresh approach to solving the “in- new input data vectors and adjusting the weights in such
verse QSAR problem”.[59] a way that their probability is increased. The idea is that
over time the believed visible state distribution eventually
However, simply trading network width for depth alone approximates the real state distribution, i.e. the network
does not automatically lead to better models. In the follow- “learns” the input distribution in an unsupervised fashion. A
ing sections, we highlight two deep learning architectures popular criterion to evaluate the learning process is the re-
that we think deserve special attention and consideration construction error, which is a measure of the ability of the

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 7
Review www.molinf.com

matrix. The dynamics within the RBM network depend on


whether wij is a positive or negative connection between
neurons xi and hj, which indicates whether they tend to col-
laboratively fire or hinder each other’s firing as well as each
neuron’s intrinsic bias to fire. The Boltzmann probability of
a joined configuration of visible and hidden units is

e¢Eðx;hÞ
pðx;hÞ ¼ ; with
Z
X
Z¼ e¢E ðx;hÞ ;
x;h

wherePZ denotes P the RBM’s combinatorial partition func-


tion. x and h denote the sum over all possible states
of the input layer neurons and hidden layer neurons, re-
spectively. Due to the generally extremely large number of
possible states, the partition function is considered to be
intractable. Marginalizing over all corresponding hidden
configurations derives the probability of a particular config-
uration x of visible units:

e¢E ðx;hÞ
P
h
pðxÞ ¼ :
Z

To maximize the probability of the data-vectors, the pa-


rameters V (weights and biases) are adjusted so that the
negative log-likelihood of the corresponding states is mini-
Figure 4. Restricted Boltzmann machine-learning. First, an input mized:
vector from the dataset enters the network. Subsequently, the
hidden layer state is calculated with regard to the new input
V ¼ V ¢ h rV ¢ log p x ¼ xðtÞ
¨ ¨ ¨ ¦¦¦
vector, from which a new input layer state can be calculated by ;
the network. This iterative process continues for k steps, and the
resulting visible layer state serves as a representative for the net- where x(t) signifies the visible layer configuration of a partic-
work’s belief in the visible layer states. A weight update is calculat- ular sample inserted into the network at time t, and V is
ed and the training cycle is re-iterated from the data input vector
and the belief state vector.
a parameter at that time t. Deriving the log-likelihood
within the brackets with regard to the parameter V yields:

@ ¨
¢ log p xðtÞ ¼
¨ ¦¦
network to produce an output that is consistent with the
@V
input data.[62] P ¢EðxðtÞ ;hÞ ¨ ¨ ðtÞ ¦¦
Neurons in a basic RBM are binary and stochastic. The ¢ x;h e¢Eðx;hÞ @V ðE ðx;hÞÞ
P
he @ E x ;h
þ P ¢EVðxðtÞ ;hÞ
RBM’s overall state under the current weight configuration Z h e
is described by the energy function — ¨ ¨ ðtÞ ¦¦–
¼ ¢h@V ðE ðx;hÞÞix;h þ @V E x ;h hjxðtÞ :
XX X X
E ðx;hÞ¼ ¢ wji xi hj ¢ bi xi ¢ cj hj ; The expressions h:::ix;h and h:::ihjxðtÞ denote the expected
i j i j
values based on the probability distributions over all possi-
ble configurations or the conditional values depending on
where xi denotes the value for visible neuron i, hj is the the current sample configuration. The intuition for the two
output value of hidden neuron j, and wij is the weight con- expected values (called the “negative phase” and the “posi-
necting the two. The parameters bi and cj denote the bias tive phase”) is as follows: the positive phase increases the
for visible layer neurons and hidden layer neurons, respec- probability of the input vector x(t), whereas the negative
tively. These parameters differ from the deep feed-forward phase decreases the probability of the model’s belief (i.e.,
network where input neuron bias is not considered and the the probability of all other model configurations). For
hidden layer bias is usually incorporated into the weight a weight parameter V = wji, the positive phase becomes

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 8
Review www.molinf.com

that the lower bound on log p(x(t)) increases with each ad-
X
¼ ¢xi p hj ¼ 1”xðtÞ :
— – ¨ ” ðtÞ ¦ ¨ ” ¦
¢hj xi hjxðtÞ ¼ hj ¼f0:1g ¢ hj xi p hj x
”
ditional layer added. Although this does not necessarily
mean that log p(x(t)) improves with additional layers, this
The lack of intra-layer interactions in RBMs suggests that
approach ensures that the model does not deteriorate
the hidden layer neurons are independent of one another,
beyond a certain point.[23,65] To make a deep belief network
which isQ also ¦true for the Q visible layer neurons:
¨ discriminative, an additional classifier can be added on top.
pðhjxÞ ¼ j p hj jx and pðxjhÞ ¼ i pðxi jhÞ. Using this in-
If the training data labels are known (e.g., compound activi-
dependence leads to expressions for the inference of states
ty classes), back-propagation can subsequently be used to
that resemble the calculation of neuron activations in deep
fine-tune the network weights. The resulting deep architec-
neural networks:
ture has its weights initialized so that they already corre-
¨ ¦ X Ž spond to a representation of the input data distribution.
p hj ¼ 1jx ¼ s i wji xi þ cj ; This type of unsupervised RBM pre-training can lead to
better weight initialization with the additional advantage of
using unlabeled data. Because the method of unsupervised
X Ž
pðxi ¼ 1jhÞ ¼ s h wji þ bi :
j j
pre-training addresses the problem of network overfitting,
it can be considered a strong regularization procedure.[66]
A random number rand uniformly drawn from the inter- These properties render the approach particularly appeal-
val [0,1] can be used to model the neurons’ stochastic be- ing for big data analysis in drug discovery. A recent pio-
havior: neering RBM study achieved near-perfect ROC AUC values
( ¨ ¦ for the prediction of drug-target interactions.[67]
1; p hj ¼ 1jx > rand
hj ¼ ¨ ¦
0; p hj ¼ 1jx ‹ rand
5 Convolutional Neural Network
The input vector x(t)can now be used to infer a sample
vector h ~ ðtÞ from the hidden layer to calculate the positive CNNs are inspired by neuroscience and imitate the image
phase. Moreover, this inferred layer configuration can be representation within the visual cortex that is created by
reused to sample from the original layer distribution: successive transformations of the retinal signals in cortical
xðtÞ ! h ~ ðtÞ ! x ~ðtþ1Þ . This method, in which each variable areas called the “ventral stream” that end in the inferior
draws a sample from its posterior while keeping the other temporal (IT) cortex.[68] Contemporary CNNs rival the IT-cor-
variables fixed, is also known as Gibbs sampling.[63] Essen- tex’s representational performance of a primate in “core
tially, the same trick of inferring states by using conditional visual object recognition”.[69] The prototype of CNNs is the
probabilities may be used to derive an expression for the Neocognitron,[16] which was conceived in the early 1980s.
negative phase term. The expected value of the negative Related architectures for phoneme recognition followed
phase is replaced by a point estimate x ~ that corresponds a few years later; thereafter, CNNs for object recognition in
to a one-step Monte-Carlo approximation of the expected images and document reading were developed just prior
value. This x ~ is obtained by performing k steps of Gibbs to the turn of the millennium.[70] CNN improvement togeth-
sampling starting from a data vector x(t) (Figure 4). For an er with the general advances in deep learning became
infinite number of Gibbs sampling steps (k = 1), this proce- a focal point of interest when a deep CNN won the ILSVRC
dure results in an unbiased sample from the network’s be- image recognition challenge in 2012 by a wide margin.[71]
lieved distribution over the visible states. Hence, the new CNNs have dominated this challenge ever since, as well as
expression for the log-likelihood derivation reads the field of object recognition and detection in images in
general. They now constitute the underlying architecture
@ ¨   ðtþk¢1Þ
ŽŽ   ðt Þ
ŽŽ for a wide range of applications ranging from image under-
~ ~
¨ ðtÞ ¦¦
¢ log p x ¢@V E x ~ðtþkÞ ; h þ @V E xðtÞ ; h :
@V standing, video analysis, speech recognition and language
understanding to artificial intelligence.[72]
Hinton and coworkers showed that a single iteration (k = The two guiding principles of weight-sharing and local
1) of Gibbs sampling might be sufficient for an RBM to effi- connectivity in a convolutional net allow for the detection
ciently find features in the input data, even if the log-likeli- of features within topologically structured input data. Addi-
hood estimate might not be perfect.[64] The corresponding tional use of pooling and many layers enable the network
method was termed “contrastive divergence learning”, to learn increasingly abstract features with each additional
which is not derived by minimizing a negative log-likeli- layer. The concept of weight sharing and local connectivity
hood but by minimization of another objective function. In is depicted in Figure 5. Here, the input data corresponds to
2006, Hinton et al. proposed a new method to stack RBM the input image, the filter corresponds to the weight-con-
modules and initialize the weights between all of the layers nections, and the feature map (activation map) corresponds
in a way that enabled fast training of the resulting “deep to the first hidden layer. In contrast to a fully connected
belief network”.[23] An advantage of using several layers is network (Figure 2), hidden-layer neurons are only connect-

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 9
Review www.molinf.com

Further pre-activation values within a feature map are


calculated by moving the filter across the image by a prede-
fined step size (stride) and calculating the pre-activations
for each new position. The weights are kept constant while
the weight-filter moves across the image, leading to the
aforementioned sharing of weights among hidden neurons
organized in the same feature map. This reduces the overall
number of free parameters and the computational cost of
calculating the linear pre-activation of a hidden neuron. Ul-
timately, this training process results in a feature map (a
map of where the feature encoded in the filter has been
found in the input data).[73, 74]
Figure 5. The figure depicts a toy example of a greyscale input Different filters result in different feature maps. Similar to
image with pixel values between 0 and 1, a weight matrix (filter) fully connected networks, a CNN gradually adjusts the
and the resulting feature map. Pre-activations of hidden neurons weights to better approximate the target function. This
within the feature map are obtained by weighting pixel-values technique corresponds to an adjustment of the filters and
within a 2 Õ 2 receptive field with the corresponding weights of the hence the features that are detected in the input data. The
filter and summing up the resulting four terms as shown in
network “learns” the features it requires to minimize the
orange. To obtain the next feature map value, the filter is moved
across the image with a stride of 1 (e.g., from the orange to the error function.
green quadrant). The resulting feature map is a map of where the Often the notion of an “input volume” and “output
feature encoded in the filter has occurred in the input image. Here, volume” is employed, which becomes clearer when the
the filter corresponds to a diagonal that results in a strong pre-acti- input data is composed of several input channels with
vation signal for the green, red and blue filter positions over the a similar topology, such as the case of an RGB image
input image, which is reflected in the feature map accordingly.
(Figure 6). The multi-component weight filters (filter banks)
between the input volume and the first hidden volume
usually have as many components as the input volume. For
ed to a local patch within the input image. In Figure 5, only the connections between higher hidden volumes, filter
four input values are connected to four weights and con- banks do not necessarily act on the entire depth of the
tribute to the pre-activation of a hidden neuron in the fea- input hidden volume.
ture map instead of all 16 input values weighted according- A closer look at the convolutional method described and
ly in a fully connected layer. The spatial extent of this local depicted in Figure 5 shows that the pixel values at the
patch is a hyper-parameter called the receptive field. Not image boundaries are less often considered when sliding
considering biases, the pre-activation z of a hidden layer the convolution kernel across the image than values from
neuron in the feature map is computed by summing up the center of the image. To counteract these boundary ef-
the weighted input values within the receptive field that is fects, the input data for this convolution operation can be
multiplied by their corresponding weights (in Figure 5 this lined with additional rows and columns of zeroes, which
step is highlighted in orange: 0.1 · 0.1 + 0.5 · 0.5 + 0.4 · 0.5 + leads to the effect that the original data is now located fur-
0.8 · 0.1 = 0.54): ther towards the center within the new, larger data set.
It is also common to subsample feature maps by inter-
X mittently inserting pooling layers between convolutional
ðx ? w Þij ¼ xiþa;jþb wiþa;jþb ;
ab
layers.[75] In addition to reducing the size of the feature

where x corresponds to the input matrix, w signifies the


weight matrix and a and b are two indices running from 0
to r¢1 to capture the entire quadrant of size r Õ r. Another
way to achieve the same result is to transform the weight
matrix into a convolution kernel k by flipping its rows and
columns and subsequently performing a convolution of the
input image:
X
ðx Ÿ k Þij ¼ xiþa;jþb kr¢a;r¢b :
ab

Figure 6. The image shows a toy example of different filter banks


This latter approach is usually employed, and it is this being applied to the three components of an RGB input volume.
perspective that coined the notion of a “convolutional net- This results in different convolutional neural network feature maps
work”. forming an output volume (the first hidden volume).

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 10
Review www.molinf.com

Table 1. Applications of deep learning in drug discovery. There is ample opportunity to fill the uncharted fields.
Deep learning architecture Virtual screening, QSAR, ADMET properties Protein structure Other
proteome mining and function applications
Deep feed-forward network [5, 54, 57, 84–87] [88,98] [85, 89–92] [93]
Deep restricted Boltzmann machine [94, 95]
Deep auto-encoder network [90–92,96]
Recursive neural network [97,98] [99–101] [102]
Deep convolutional network [103,104] [105] [102,106]
Cascaded neural network [31] [107]
Other architectures [85] [85,108–109] [110]

maps and thus the number of parameters, this method plied.[83] Full appreciation of deep learning methods will
also improves the generalization of the features the net- likely only be achieved when the user gains an understand-
work learns (because these features are derived from more ing of the underlying molecular features that enable pat-
coarse-grained feature maps in the preceding hidden tern association and classification. With the development of
volume). The most common method is MAX pooling, in new deep learning concepts such as RBMs and CNNs, the
which the feature map is subdivided and the MAX opera- molecular modeler’s tool box has been equipped with po-
tion is applied to every resulting tile. This step is repeated tentially game-changing methods. Judging from the suc-
for every feature map within a hidden volume. Recently, cess of recent pioneering studies, we are convinced that
the notion of an all-convolutional net has been introduced, modern deep learning architecture will be useful for the
which abandons pooling layers altogether and seeks to coming age of big data analysis in pharmaceutical research,
achieve better generalization using a larger stride for the toxicity prediction, genome mining and chemogenomic ap-
convolution operation.[76] Following the concept of jury net- plications, to name only some of the immediate areas of
works, advanced CNN versions include “multi column” application. Personalized health care, in particular, could
deep neural networks that effectively represent a combina- benefit from these capabilities. However, it may be wise
tion of separately and differently initialized CNNs that are not to put all of our eggs in one basket. We still need to
combined into one big network. The final prediction results fully understand the advantages and limitations of deep
from averaging all of the CNN predictions.[40] learning techniques. One should not blindly use them with-
A challenge for successful CNN applications in drug dis- out simultaneously applying straightforward linear methods
covery will be to find appropriate molecular representation and pursuing chemical similarity approaches. By maintain-
schemes to serve as the input. For example, a first positron ing some healthy skepticism, we feel that the time is ripe
emission tomography image analysis has been performed for (re)discovering and exploring the usefulness of deep
with CNNs.[77] Further applications are listed in Table 1. learning in drug discovery.

6 Conclusion: Deep Learning to the Rescue?


Although the concepts of deep learning were introduced
Conflict of Interest
with the conception of multi-layer neural networks,[78] the G. S. is a co-founder of inSili.com LLC, Zurich, and a consul-
life science community has adapted some of these tech- tant to the pharmaceutical industry.
niques only recently (Table 1). In fact, neural networks have
a long and fruitful history in drug discovery and design. Be-
cause they bear the risk of being easily over-trained and
are perceived as a “black box”, they have often been substi- Acknowledgements
tuted by other approaches such as SVM models.[27,79] As
noted by Winkler in a review article from 2004, continuous This research was supported by the Swiss National Science
methodological advances in the field of neural networks al- Foundation (grant no. 200021_157190 and CR32I2_159737).
leviate some of the pitfalls and may have much to offer for
hit and lead discovery.[80] Evidently, deep network architec-
tures will require a particularly careful analysis and thor- References
ough definition of their respective domains of applicabili-
[1] a) R. Todeschini, V. Consonni, Molecular Descriptors for Chem-
ty,[81,82] and ideally provide hands-on guidelines for chem- informatics, Wiley-VCH, Weinheim, 2009; b) R. Sawada, M.
ists. Appropriate confidence measures for neural network Kotera, Y. Yamanishi, Mol. Inf. 2014, 33, 719 – 731.
classifiers have been developed and can readily be ap- [2] P. Willett, Mol. Inf. 2014, 33, 403 – 413.

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 11
Review www.molinf.com

[3] a) A. Lavecchia, Drug Discovery Today 2015, 20, 318 – 331; Networks (S. C. Kremer, J. F. Kolen, eds.), IEEE Press, New York,
b) E. J. Gardiner, V. J. Gillet, J. Chem. Inf. Model. 2015, 55, 2001.
1781 – 1803. [23] a) G. E. Hinton, Trends Cogn. Sci. 2007, 11, 428 – 434; b) G. E.
[4] a) J. Devillers (ed.), Neural Networks in QSAR and Drug Design, Hinton, S. Osindero, Y. W. The, Neural Comput. 2006, 7,
Elsevier, Amsterdam; b) G. Schneider, P. Wrede, Prog. Biophys, 1527 – 1554.
Mol. Biol. 1998, 70, 175 – 222; c) J. Zupan, J. Gasteiger, Neural [24] a) G. Schneider, J. Schuchhardt, P. Wrede, Comput. Appl.
Networks in Chemistry and Drug Design, Wiley-VCH, Wein- Biosci. 1994, 10, 635 – 645; b) G. Schneider, Neural Netw.
heim, 1999; d) S. Agatonovic-Kustrin, R. Beresford, J. Pharm. 2000, 13, 15 – 16; c) S. Agatonovic-Kustrin, R. Beresford, J.
Biomed. Anal. 2000, 22, 717 – 727. Pharm. Biomed. Anal. 2000, 22, 717 – 727.
[5] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, V. Svetnik, J. Chem. [25] a) L. Breiman, Machine Learning 2001, 45, 5 – 32; b) N. Mein-
Inf. Model. 2015, 55, 263 – 274; b) J. Markoff, Scientists See shausen, J. Mach. Learn. Res. 2006, 7, 983 – 999.
Promise in Deep-Learning Programs. The New York Times, No- [26] a) Y. Sakiyama, Expert Opin. Drug Metab. Toxicol. 2009, 5,
vember 23, 2012. 149 – 169; b) D. Plewczynski, S. A. Spieser, U. Koch, Comb.
[6] a) G. Schneider, S.-S. So, Adaptive Systems in Drug Design, Chem. High Throughput Screen. 2009, 12, 358 – 368; c) B. Spra-
Landes Bioscience, Georgetown, 2001; b) M. Reutlinger, G. gue, Q. Shi, M. T. Kim, L. Zhang, A. Sedykh, E. Ichiishi, H.
Schneider, J. Mol. Graph. Model. 2012, 34, 108 – 117. Tokuda, K. H. Lee, H. Zh, J. Comput. Aided Mol. Des. 2014, 28,
[7] G. E. Hinton, Cogn. Sci. 2014, 38, 1078 – 1101. 631 – 646; d) M. A. Khamis, W. Gomaa, W. F. Ahmed, Artif.
[8] a) J. Gasteiger, T. Engel (eds.), Chemoinformatics: A Textbook, Intell. Med. 2015, 63, 135 – 152; e) T. Rodrigues, D. Reker, M.
Wiley-VCH, 2003; b) G. Schneider, G. Downs (eds.), Machine Welin, M. Caldera, C. Brunner, G. Gabernet, P. Schneider, B.
Learning Methods in QSAR Modelling, QSAR Comb. Sci. 2003, Walse, G. Schneider, Angew. Chem. Int. Ed. 2015, in press.
22(5); c) J. Bajorath (ed.) Chemoinformatics: Concepts, Meth- [27] a) E. Byvatov, U. Fechner, J. Sadowski, G. Schneider, J. Chem.
ods, and Tools for Drug Discovery, Humana Press, Totowa, Inf. Comput. Sci. 2003, 43, 1882 – 1889; b) E. Byvatov, G.
2004; d) M. Dehmer, K. Varmuza, D. Bonchev (eds.) Statistical Schneider, Appl. Bioinformatics 2003, 2, 67 – 77.
Modelling of Molecular Descriptors in QSAR/QSPR, Wiley-Black- [28] a) L. Yang, P. Wang, Y. Jiang, J. Chen, J. Chem. Inf. Model.
well, New York, 2011. 2005, 45, 1804 – 1811; b) I. I. Baskin, V. A. Palyulin, N. S. Zefir-
[9] a) G. Schneider, K.-H. Baringhaus, Molecular Design - Concepts ov, Methods Mol. Biol. 2008, 458, 137 – 158.
and Applications. Wiley-VCH, Weinheim, 2008; b) G. Schneid- [29] A. Givehchi, G. Schneider, Mol. Divers. 2005, 9, 371 – 383.
er (ed.), De Novo Molecular Design. Wiley-VCH, Weinheim, [30] a) A. C. Tsoi, A. D. Back, IEEE Trans. Neural Netw. 1994, 5, 229 –
2013. 239; b) B. Hammer, A. Micheli, A. Sperduti, M. Strickert,
[10] J. Gasteiger, Mini Rev. Med. Chem. 2003, 3, 789 – 796. Neural Netw. 2004, 17, 1061 – 1085; c) D. V. Buonomano,
[11] Y. Bengio, Foundations and Trends in Machine Learning 2009, Neuron 2009, 63, 423 – 425; d) D. Plewczynski, J. Mol. Model.
2, 1 – 127. 2011, 17, 2133 – 2141.
[12] a) G. E. Hinton, R. R. Salakhutdinov, Science 2006, 313, 504 – [31] a) J. A. Hiss, A. Bredenbeck, F. O. Losch, P. Wrede, P. Walden,
507; b) Y. LeCun, Y. Bengio, G. Hinton, Nature 2015, 521, 436 – G. Schneider, Protein Eng. Des. Sel. 2007, 20, 99 – 108; b) C. P.
444. Koch, A. M. Perna, M. Pillong, N. K. Todoroff, P. Wrede, G.
[13] Y. Roudi, G. Taylor, Curr. Opin. Neurobiol. 2015, 35, 110 – 118. Folkers, J. A. Hiss, G. Schneider PLoS Comput. Biol. 2013, 9,
[14] a) J. Fan, H. Liu, Adv. Drug Deliv. Rev. 2013, 65, 987 – 1000; e1003088; c) C. P. Koch, A. M. Perna, S. Weissmìller, S. Bauer,
b) D. C. Cireşan, A. Giusti, L. M. Gambardella, J. Schmidhuber, M. Pillong, R. B. Baleeiro, M. Reutlinger, G. Folkers, P. Walden,
Med. Image Comput. Comput. Assist. Interv. 2013, 16, 411 – P. Wrede, J. A. Hiss, Z. Waibler, G. Schneider, ACS Chem. Biol.
418; c) N. T. Issa, S. W. Byers, S. Dakshanamurthy, Expert Rev. 2013, 8, 1876 – 1881.
Clin. Pharmacol. 2014, 7, 293 – 298; d) F. F. Costa FF, Drug Dis- [32] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Sal-
covery Today 2014, 19, 433 – 440; e) S. J. Lusher, R. McGuire, akhutdinov, Cornell University Library 2012, arXiv:1207.0580.
R. C. van Schaik, C. D. Nicholson, J. de Vlieg, Drug Discovery [33] F. Agostinelli, M. Hoffman, P. Sadowski, P. Baldi, Cornell Uni-
Today 2014, 19, 859 – 868; f) J. B. Brown, M. Nakatsui, Y. versity Library 2014, arXiv:1412.6830.
Okuno, Mol. Inf. 2014, 33, 732 – 741; g) L. Richter, G. F. Ecker, [34] R. Raina, A. Madhavan, A. Y. Ng, Proc. 26th Annual Int. Conf.
Drug Discovery Today Technol. 2015, 14, 37 – 41. Mach. Learn. ICML 2009, 40, 1 – 8.
[15] Y. Park, M. Kellis, Nat. Biotechnol. 2015, 33, 825 – 826. [35] J. Schmidhuber, Neural Networks 2015, 61, 85 – 117.
[16] a) K. Fukushima, S. Miyake, Pattern Recognit. 1982, 15, 455 – [36] Q. V. Le, IEEE International Conference on Acoustics, Speech
469; b) K. Fukushima, Neural Netw. 2013, 37, 103 – 119. and Signal Processing 2013, 8595 – 8598.
[17] a) P. J. Werbos, Beyond Regression: New Tools for Prediction [37] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, IEEE Conference on
and Analysis in the Behavioral Sciences. PhD thesis, Harvard Computer Vision and Pattern Recognition 2014, 1701 – 1708.
University, 1974; b) D. E. Rumelhart, G. E. Hinton, R. J. Wil- [38] D. Cireşan, U. Meier, J. Masci, J. Schmidhuber, Neural Net-
liams, Nature 1986, 323, 533 – 536. works 2012, 32, 333 – 338.
[18] N. Qian, T. J. Sejnowski, J. Mol. Biol. 1988, 202, 865 – 884. [39] Y. Sun, Y. Chen, X. Wang, X. Tang, Advances in Neural Informa-
[19] M. Punta, B. Rost, Methods Mol. Biol. 2008, 458, 203 – 230. tion Processing Systems 2014, 1988 – 1996.
[20] M. Spencer, J. Eickholt, J. Cheng, IEEE/ACM Trans. Comput. [40] a) D. Ciresan, U. Meier, J. Schmidhuber, Comput. IEEE Confer-
Biol. Bioinform. 2015, 12, 103 – 112. ence on Visual Pattern Recognition (CVPR), 2012, 3642 – 3649,
[21] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, doi:10.1109/CVPR.2012.6248110; b) L. Wan, M. Zeiler, S.
W. Hubbard, LD. Jackel, Adv. Neural Inf. Proc. Sys. (NIPS) 1989, Zhang, Y. L. Cun, R. Fergus, Proceedings of the 30th Interna-
2, 396–404. tional Conference on Machine Learning (ICML-13) 2013,
[22] a) Y. Bengio, P. Simard, P. Frasconi, IEEE Trans. Neural Networks 1058 – 1066.
1994, 5, 157 – 166; b) S. Hochreiter, Y. Bengio, P. Frasconi, J. [41] K. He, X. Zhang, S. Ren, J. Sun, Cornell University Library 2015,
Schmidhuber, in: A Field Guide to Dynamical Recurrent Neural arxiv.org/abs/1502.01852.

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 12
Review www.molinf.com

[42] R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification, 2nd [63] C. M. Bishop, Pattern Recognition and Machine Learning,
ed., Wiley, New York, 2000. Springer, Heidelberg, 2006.
[43] S. Hochreiter, Diploma Thesis, Tech. Univ. Mìnchen, Title: “Un- [64] G. E. Hinton, Neural Comput. 2002, 8, 1771 – 1800.
tersuchungen zu dynamischen neuronalen Netzen,” 1991. [65] R. Salakhutdinov, G. Hinton, Neural Comput. 2012, 8, 1967 –
[44] R. Pascanu, T. Mikolov, Y. Bengio, Proceedings of the 30th In- 2006.
ternational Conference on Machine Learning 2012, 1310 – [66] L. Deng, APSIPA Trans. Signal Inf. Process. 2014, 3, e2.
1318. [67] Y. Wang, J. Zeng, Bioinformatics 2013, 29, i126-i134.
[45] J. Schmidhuber, Neural Comput. 1992, 4, 131 – 139. [68] a) B. Fasel, Acta Neurol. Belg. 2003, 103, 6 – 12; b) P. Wang, G.
[46] S. Hochreiter, J. Schmidhuber, Neural Comput. 1997, 9, 1735 – Cottrell, J. Vis. 2015, 15, 1091.
1780. [69] a) J. J. DiCarlo, D. Zoccolan, N. C. Rust, Neuron 2012, 73, 415 –
[47] G. Montavon, G. B. Orr, K.-R. Mìller (eds.), Neural Networks: 434; b) C. F. Cadieu, H. Hong, D. L. Yamins, N. Pinto, D. Ardila,
Tricks of the Trade. Springer, Berlin, 2012. E. A. Solomon, N. J. Majaj, J. J. DiCarlo, PLoS Comput. Biol.
[48] a) M. Meissner, M. Schmuker, G. Schneider, BMC Bioinformat- 2014, 10, e1003963.
ics 2006, 7, 125; b) G. Hanrahan, Analyst 2011, 136, 3587 – [70] a) R. P. Lippmann, Neural Comput. 1989, 1, 1 – 38; b) S. Law-
3594; c) W. Van Geit, E. De Schutter, P. Achard, Biol. Cybern. rence, C. L. Giles, A. C. Tsoi, A. D. Back, IEEE Trans. Neural
2008, 99, 241 – 251. Netw. 1997, 8, 98 – 113.
[49] a) I. I. Baskin, V. A. Palyulin, N. S. Zefirov, Neural networks in [71] A. Krizhevsky, I. Sutskever, G. E. Hinton, in: Advances in Neural
building QSAR models. Methods Mol. Biol. 2008, 458, 137 – Information Processing Systems 2012, 1097 – 1105.
158; b) F. Burden, D. Winkler, Bayesian regularization of [72] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
neural networks. Methods Mol. Biol. 2008, 458, 25 – 44; c) J. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrov-
Tao, S. Wen, W. Hu, L1-norm locally linear representation reg- ski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D.
ularization multi-source adaptation learning. Neural Networks Kumaran, D. Wierstra, S. Legg, D. Hassabis, Nature 2015, 518,
2015, 69, 80–98; d) K. Li, J. Deng, H. B. He, Y. Li, D. J. Du, Int. 529 – 533.
J. Comput. Biol. Drug Des. 2010, 3, 112 – 132. [73] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Proc. IEEE 1998, 11,
[50] A. Tropsha, Mol. Inf. 2010, 29, 476 – 488. 2278 – 2323.
[51] a) M. M. Islam, M. A. Sattar, M. F. Amin, X. Yao, K. Murase, IEEE [74] M. Zeiler, R. Fergus, Comput. Vision¢ECCV 2014 2014, 818 –
Trans. Syst. Man Cybern. B Cybern. 2009, 39, 705 – 722; b) H. G. 833.
Han, L. D. Wang, J. F. Qiao, Neural Networks 2013, 43, 22 – 32.
[75] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, Y. LeCun, in 2009
[52] R. Lohmann, G. Schneider, P. Wrede, Biopolymers 1996, 38,
IEEE 12th International Conference on Computer Vision, 2009,
13 – 29.
2146–2153.
[53] Y. Bengio, A. Courville, P. Vincent, IEEE Trans. Pattern Anal.
[76] J. T. Springenberg, A. Dosovitskiy, T. Brox, M. A. Riedmiller,
Mach. Intell. 2013, 35, 1798 – 1828.
Cornell University Library 2014, arXiv:1412.6806.
[54] G. Dahl. Deep Learning Approaches to Problems in Speech Rec-
[77] P. P. Ypsilantis, M. Siddique, H. M. Sohn, A. Davies, G. Cook, V.
ognition, Computational Chemistry, and Natural Language Text
Goh, G. Montana, PLoS One 2015, 10, e0137036.
Processing. PhD thesis, University of Toronto, 2015.
[78] D. E. Rumelhart, J. L. McClelland, PDB Research Group, Paral-
[55] a) M. A. Yildirim, K.-I. Goh, M. E. Cusick, A.-L. Barab‚si, M.
lel Distributed Processing, MIT Press, Cambridge, 1986.
Vidal, Nat. Biotechnol. 2007, 25, 1119 – 1126; b) G. Schneider,
[79] A. Mendyk, S. Gìres, R. Jachowicz, J. Szle˛k, S. Polak, B. Wiś-
D. Reker, T. Rodrigues, P. Schneider, Chimia 2014, 68, 648 –
653. niowska, P. Kleinebudde, Comput. Math. Methods Med. 2015,
[56] a) J. Scheiber, B. Chen, M. Milik, S. C. Sukuru, A. Bender, D. Mi- 2015, 863874.
khailov, S. Whitebread, J. Hamon, K. Azzaoui, L. Urban, M. [80] D. A. Winkler, Mol. Biotechnol. 2004, 27, 139 – 168.
Glick, J. W. Davies, J. L. Jenkins, J. Chem. Inf. Model. 2009, 49, [81] T. I. Netzeva, A. Worth, T. Aldenberg, R. Benigni, M. T. D.
308 – 317; b) T. Van Laarhoven, S. B. Nabuurs, E. Marchiori, Bio- Cronin, P. Gramatica, J. S. Jaworska, S. Kahn, G. Klopman,
informatics 2011, 27, 3036 – 3043; c) M. Reutlinger, T. Rodri- C. A. Marchant, G. Myatt, N. Nikolova-Jeliazkova, G. Y. Patle-
gues, P. Schneider, G. Schneider, Angew. Chem. Int. Ed. 2014, wicz, R. Perkins, D. Roberts, T. Schultz, D. W. Stanton, J. J. van
53, 4244 – 4248. de Sandt, W. Tong, G. Veith, C. Yang, Altern. Lab. Anim. 2005,
[57] D. Erhan, P.-J. L’heureux, S. Y. Yue, Y. Bengio, J. Chem. Inf. 33, 155 – 173.
Model. 2006, 46, 626 – 635. [82] a) K. Roy, Expert Opin. Drug Discov. 2007, 2, 1567 – 1577; b) I.
[58] G. Schneider, O. Cl¦ment-Chomienne, L. Hilfiger, P. Schneider, Tetko, Methods Mol. Biol. 2008, 458, 185 – 202; c) A. Tropsha,
S. Kirsch, H.-J. Bçhm, W. Neidhart, Angew. Chem. Int. Ed. Mol. Inf. 2010, 29, 476 – 488; d) N. Fjodorova, M. Novič, A.
2000, 39, 4130 – 4133. Roncaglioni, E. Benfenati, J. Comput. Aided Mol Des. 2011, 25,
[59] a) J. V. de Julian-Ortiz, Comb. Chem. High Throughput Screen. 1147 – 1158.
2001, 4, 295 – 310; b) D. P. Visco Jr, R. S. Pophale, M. D. Rin- [83] a) D. T. Manallack, B. G. Tehan, E. Gancia, B. D. Hudson, M. G.
toul, J. L. Faulon, J. Mol. Graph. Model. 2002, 20, 429 – 438; Ford, D. J. Livingstone, D. C. Whitley, W. R. Pitt, J. Chem. Inf.
c) N. Brown, B. McKay, J. Gasteiger, J. Comput. Aided Mol. Des. Comput. Sci. 2003, 43, 674 – 679; b) I. V. Tetko, D. J. Living-
2006, 20, 333 – 341; d) W. W. Wong, F. J. Burkowski, J. Chemin- stone, A. I. Luik, J. Chem. Inf. Comput. Sci. 1995, 35, 826 – 833;
form. 2009, 1, 4. c) I. Sushko, S. Novotarskyi, R. Kçrner, A. K. Pandey, V. V. Kova-
[60] M. A. Carreira-Perpignan, G. E. Hinton, in: Proceedings of the lishyn, V. V. Prokopenko, I. V. Tetko, J. Chemometr. 2010, 24,
10th International Workshop on Artificial Intelligence and Statis- 202 – 208; d) I. Sushko, S. Novotarskyi, R. Kçrner, A. K. Pandey,
tics (AISTATS) 2005, 59–66. A. Cherkasov, J. Li, P. Gramatica, K. Hansen, T. Schroeter, K.-R.
[61] D. H. Ackley, G. E. Hinton, T. J. Sejnowski, Cogn. Sci. 1985, 9, Mìller, L. Xi, H. Liu, X. Yao, T. ©berg, F. Hormozdiari, P. Dao, C.
147 – 169. Sahinalp, R. Todeschini, P. Polishchuk, A. Artemenko, V.
[62] D. Buchaca, E. Romero, F. Mazzanti, J. Delgado, Cornell Uni- Kuz’min, T. M. Martin, D. M. Young, D. Fourches, E. Muratov,
versity Library 2013, arXiv:1312.6062. A. Tropsha, I. Baskin, D. Horvath, G. Marcou, C. Mìller, A.

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 13
Review www.molinf.com

Varnek, V. V. Prokopenko, I. V. Tetko, J. Chem. Inf. Model. 2010, [98] Y. Xu, Z. Dai, F. Chen, S. Gao, J. Pei, L. Lai, J. Chem. Inf. Model.
50, 2094 – 2111. 2015, 55, 2085 – 2093.
[84] G. E. Dahl, N. Jaitly, R. Salakhutdinov, Cornell University Library [99] P. Baldi, S. Brunak, P. Frasconi, G. Soda, G. Pollastri, Bioinfor-
2014, arXiv:1406.1231. matics 1999, 11, 937 – 946.
[85] Y. Qi, M. Oja, J. Weston, W. S. Noble, PLoS One 2012, 3, [100] P. Baldi, G. Pollastri, J. Mach. Learn. Res. 2004, 4, 575 – 602.
e32235. [101] S. K. Sønderby, O. Winther, Cornell University Library 2014,
[86] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, arXiv:1412.7828.
V. Pande, Cornell University Library 2015, arXiv:1502.02072. [102] S. K. Sønderby, C. K. Sønderby, H. Nielsen, O. Winther, Cornell
[87] T. Unterthiner, A. Mayr, G. Klambauer, M. Steijaert, J. K. University Library 2015, arXiv:1503.01919.
Wegner, H. Ceulemans, S. Hochreiter, Deep Learning and Rep- [103] Y. Park, M. Kellis, Nat. Biotechnol. 2015, 8, 825 – 826.
resentation Learning Workshop, NIPS 2014. [104] B. Alipanahi, A. Delong, M. T. Weirauch, B. J. Frey, Nat. Bio-
[88] T. Unterthiner, A. Mayr, G. Klambauer, S. Hochreiter, Cornell technol. 2015, 8, 831 – 838.
University Library 2015, arXiv:1503.01445. [105] T. B. Hughes, G. P. Miller, S. J. Swamidass, ACS Cent. Sci. 2015,
[89] P. Di Lena, K. Nagata, P. Baldi, Bioinformatics 2012, 19, 2449 – 4, 168 – 180.
2457. [106] O. Denas, J. Taylor. Representation Learning Workshop, ICML
[90] J. Lyons, A. Dehzangi, R. Heffernan, A. Sharma, K. Paliwal, A. 2013.
Sattar, Y. Zhou, Y. Yang, J. Comput. Chem. 2014, 28, 2040 – [107] W. Kew, J. B. O. Mitchell, Mol. Inf. 2015, 9, 634 – 647.
2046. [108] P. Di Lena, K. Nagata, P. F. Baldi, Advances in Neural Informa-
[91] R. Heffernan, K. Paliwal, J. Lyons, A. Dehzangi, A. Sharma, J. tion Processing Systems, 2012, 512 – 520.
Wang, A. Sattar, Y. Yang, Y. Zhou, Sci. Rep. 2015, 11476. [109] M. J. Skwark, D. Raimondi, M. Michel, A. Elofsson, PLoS
[92] S. P. Nguyen, Y. Shang, D. Xu, Proc. Int. Jt. Conf. Neural Netw. Comput. Biol. 2014, 11, e1003889.
2014, 2071 – 2078. [110] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R.
[93] M. K. K. Leung, H. Y. Xiong, L. J. Lee, B. J. Frey, Bioinformatics Gûmez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, R. P. Adams,
2014, 12, i121 – i129. Cornell University Library 2015, arXiv:1509.09292.
[94] J. Eickholt, J. Cheng, Bioinformatics 2012, 23, 3066 – 3072.
[95] J. Eickholt, J. Cheng, BMC Bioinformatics 2013, 14, S12.
[96] J. Zhou, O. G. Troyanskaya, Cornell University Library 2014,
arXiv:1403.1347. Received: October 25, 2015
[97] A. Lusci, G. Pollastri, P. Baldi, J. Chem. Inf. Model. 2013, 7, Accepted: December 1, 2015
1563 – 1575. Published online: December 30, 2015

Ó 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2016, 35, 3 – 14 14

You might also like