On The Origin of Deep Learning: Haohan Wang Bhiksha Raj
On The Origin of Deep Learning: Haohan Wang Bhiksha Raj
Abstract
This paper is a review of the evolutionary history of deep learning models. It covers from
the genesis of neural networks when associationism modeling of the brain is studied, to the
models that dominate the last decade of research in deep learning like convolutional neural
networks, deep belief networks, and recurrent neural networks. In addition to a review of
these models, this paper primarily focuses on the precedents of the models above, examining
how the initial ideas are assembled to construct the early models and how these preliminary
models are developed into their current forms. Many of these evolutionary paths last more
than half a century and have a diversity of directions. For example, CNN is built on prior
knowledge of biological vision system; DBN is evolved from a trade-off of modeling power
and computation complexity of graphical models and many nowadays models are neural
counterparts of ancient linear models. This paper reviews these evolutionary paths and
offers a concise thought flow of how these models are developed, and aims to provide a
thorough background for deep learning. More importantly, along with the path, this paper
summarizes the gist behind these milestones and proposes many directions to guide the
future research of deep learning.
1
Wang and Raj
1. Introduction
Deep learning has dramatically improved the state-of-the-art in many different artificial
intelligent tasks like object detection, speech recognition, machine translation (LeCun et al.,
2015). Its deep architecture nature grants deep learning the possibility of solving many more
complicated AI tasks (Bengio, 2009). As a result, researchers are extending deep learning
to a variety of different modern domains and tasks in additional to traditional tasks like
object detection, face recognition, or language models, for example, Osako et al. (2015) uses
the recurrent neural network to denoise speech signals, Gupta et al. (2015) uses stacked
autoencoders to discover clustering patterns of gene expressions. Gatys et al. (2015) uses a
neural model to generate images with different styles. Wang et al. (2016) uses deep learning
to allow sentiment analysis from multiple modalities simultaneously, etc. This period is the
era to witness the blooming of deep learning research.
However, to fundamentally push the deep learning research frontier forward, one needs
to thoroughly understand what has been attempted in the history and why current models
exist in present forms. This paper summarizes the evolutionary history of several different
deep learning models and explains the main ideas behind these models and their relationship
to the ancestors. To understand the past work is not trivial as deep learning has evolved
over a long time of history, as showed in Table 1. Therefore, this paper aims to offer the
readers a walk-through of the major milestones of deep learning research. We will cover the
milestones as showed in Table 1, as well as many additional works. We will split the story
into different sections for the clearness of presentation.
This paper starts the discussion from research on the human brain modeling. Although
the success of deep learning nowadays is not necessarily due to its resemblance of the human
brain (more due to its deep architecture), the ambition to build a system that simulate brain
indeed thrust the initial development of neural networks. Therefore, the next section begins
with connectionism and naturally leads to the age when shallow neural network matures.
With the maturity of neural networks, this paper continues to briefly discuss the ne-
cessity of extending shallow neural networks into deeper ones, as well as the promises deep
neural networks make and the challenges deep architecture introduces.
With the establishment of the deep neural network, this paper diverges into three dif-
ferent popular deep learning topics. Specifically, in Section 4, this paper elaborates how
Deep Belief Nets and its construction component Restricted Boltzmann Machine evolve as a
trade-off of modeling power and computation loads. In Section 5, this paper focuses on the
development history of Convolutional Neural Network, featured with the prominent steps
along the ladder of ImageNet competition. In Section 6, this paper discusses the develop-
ment of Recurrent Neural Networks, its successors like LSTM, attention models and the
successes they achieved.
While this paper primarily discusses deep learning models, optimization of deep archi-
tecture is an inevitable topic in this society. Section 7 is devoted to a brief summary of
optimization techniques, including advanced gradient method, Dropout, Batch Normaliza-
tion, etc.
This paper could be read as a complementary of (Schmidhuber, 2015). Schmidhuber’s
paper is aimed to assign credit to all those who contributed to the present state of the art,
so his paper focuses on every single incremental work along the path, therefore cannot elab-
2
On the Origin of Deep Learning
3
Wang and Raj
orate well enough on each of them. On the other hand, our paper is aimed at providing the
background for readers to understand how these models are developed. Therefore, we em-
phasize on the milestones and elaborate those ideas to help build associations between these
ideas. In addition to the paths of classical deep learning models in (Schmidhuber, 2015),
we also discuss those recent deep learning work that builds from classical linear models.
Another article that readers could read as a complementary is (Anderson and Rosenfeld,
2000) where the authors conducted extensive interviews with well-known scientific leaders
in the 90s on the topic of the neural networks’ history.
4
On the Origin of Deep Learning
2.1 Associationism
“When, therefore, we accomplish an act of reminiscence, we pass through a
certain series of precursive movements, until we arrive at a movement on which
the one we are in quest of is habitually consequent. Hence, too, it is that we
hunt through the mental train, excogitating from the present or some other,
and from similar or contrary or coadjacent. Through this process reminiscence
takes place. For the movements are, in these cases, sometimes at the same time,
sometimes parts of the same whole, so that the subsequent movement is already
more than half accomplished.”
• Similarity: Thought of one event tends to trigger the thought of a similar event.
• Contrast: Thought of one event tends to trigger the thought of an opposite event.
Back then, Aristotle described the implementation of these laws in our mind as common
sense. For example, the feel, the smell, or the taste of an apple should naturally lead to
the concept of an apple, as common sense. Nowadays, it is surprising to see that these
laws proposed more than 2000 years ago still serve as the fundamental assumptions of
machine learning methods. For example, samples that are near each other (under a defined
distance) are clustered into one group; explanatory variables that frequently occur with
response variables draw more attention from the model; similar/dissimilar data are usually
represented with more similar/dissimilar embeddings in latent space.
Contemporaneously, similar laws were also proposed by Zeno of Citium, Epicurus and
St Augustine of Hippo. The theory of associationism was later strengthened with a variety
of philosophers or psychologists. Thomas Hobbes (1588-1679) stated that the complex
experiences were the association of simple experiences, which were associations of sensations.
He also believed that association exists by means of coherence and frequency as its strength
5
Wang and Raj
factor. Meanwhile, John Locke (1632-1704) introduced the concept of “association of ideas”.
He still separated the concept of ideas of sensation and ideas of reflection and he stated
that complex ideas could be derived from a combination of these two simple ideas. David
Hume (1711-1776) later reduced Aristotle’s four laws into three: resemblance (similarity),
contiguity, and cause and effect. He believed that whatever coherence the world seemed to
have was a matter of these three laws. Dugald Stewart (1753-1828) extended these three
laws with several other principles, among an obvious one: accidental coincidence in the
sounds of words. Thomas Reid (1710-1796) believed that no original quality of mind was
required to explain the spontaneous recurrence of thinking, rather than habits. James Mill
(1773-1836) emphasized on the law of frequency as the key to learning, which is very similar
to later stages of research.
David Hartley (1705-1757), as a physician, was remarkably regarded as the one that
made associationism popular (Hartley, 2013). In addition to existing laws, he proposed his
argument that memory could be conceived as smaller scale vibrations in the same regions
of the brain as the original sensory experience. These vibrations can link up to represent
complex ideas and therefore act as a material basis for the stream of consciousness. This
idea potentially inspired Hebbian learning rule, which will be discussed later in this paper
to lay the foundation of neural networks.
6
On the Origin of Deep Learning
networks of today: an individual cell is summarizing the stimulation from other selected
linked cells within a grouping, as showed in Figure 1. The joint stimulation from a and c
triggers X, stimulation from b and c triggers Y and stimulation from a and c triggers Z. In
his original illustration, a, b, c stand for simulations, X and Y are outcomes of cells.
With the establishment of how this associative structure of neural grouping can function
as memory, Bain proceeded to describe the construction of these structures. He followed the
directions of associationism and stated that relevant impressions of neural groupings must
be made in temporal contiguity for a period, either on one occasion or repeated occasions.
Further, Bain described the computational properties of neural grouping: connections
are strengthened or weakened through experience via changes of intervening cell-substance.
Therefore, the induction of these circuits would be selected comparatively strong or weak.
As we will see in the following section, Hebb’s postulate highly resembles Bain’s de-
scription, although nowadays we usually label this postulate as Hebb’s, rather than Bain’s,
according to (Wilkes and Wade, 1997). This omission of Bain’s contribution may also be
due to Bain’s lack of confidence in his own theory: Eventually, Bain was not convinced by
himself and doubted about the practical values of neural groupings.
This archaic paragraph can be re-written into modern machine learning languages as the
following:
where ∆wi stands for the change of synaptic weights (wi ) of Neuron i, of which the input
signal is xi . y denotes the postsynaptic response and η denotes learning rate. In other
words, Hebbian Learning Rule states that the connection between two units should be
strengthened as the frequency of co-occurrences of these two units increase.
Although Hebbian Learning Rule is seen as laying the foundation of neural networks,
seen today, its drawbacks are obvious: as co-occurrences appear more, the weights of connec-
tions keep increasing and the weights of a dominant signal will increase exponentially. This
is known as the unstableness of Hebbian Learning Rule (Principe et al., 1999). Fortunately,
these problems did not influence Hebb’s identity as the father of neural networks.
7
Wang and Raj
where t denotes the iteration. A straightforward way to avoid the exploding of weights is
to apply normalization at the end of each iteration, yielding:
wt + ηxi y
wit+1 = Pn i 1
( i=1 (wit + ηxi y)2 ) 2
where n denotes the number of neurons. The above equation can be further expanded into
the following form:
Pn
t w
w yx i i j yxj wj
wit+1 = i + η( + ) + O(η 2 )
Z Z Z3
1
where Z = ( ni wi2 ) 2 . Further, two more assumptions are introduced: 1) η is small.
P
1
Therefore O(η 2 ) is approximately 0. 2) Weights are normalized, therefore Z = ( ni wi2 ) 2 =
P
1.
When these two assumptions were introduced back to the previous equation, Oja’s rule
was proposed as following:
Oja took a step further to show that a neuron that was updated with this rule was
effectively performing Principal Component Analysis on the data. To show this, Oja first
re-wrote Equation 2 as the following forms with two additional assumptions (Oja, 1982):
d t
w = Cwit − ((wit )T Cwit )wit
d(t) i
where C is the covariance matrix of input X. Then he proceeded to show this property
with many conclusions from his another work (Oja and Karhunen, 1985) and linked back
to PCA with the fact that components from PCA are eigenvectors and the first component
is the eigenvector corresponding to largest eigenvalues of the covariance matrix. Intuitively,
we could interpret this property with a simpler explanation: the eigenvectors of C are the
solution when we maximize the rule updating function. Since wit are the eigenvectors of the
covariance matrix of X, we can get that wit are the PCA.
Oja’s learning rule concludes our story of learning rules of the early-stage neural network.
Now we proceed to visit the ideas on neural models.
8
On the Origin of Deep Learning
where y stands for output, xi stands for input of signals, wi stands for the corresponding
weights and zj stands for the inhibitory input. θ stands for the threshold. The function is
designed in a way that the activity of any inhibitory input completely prevents excitation
of the neuron at any time.
Despite the resemblance between MCP Neural Model and modern perceptron, they are
still different distinctly in many different aspects:
• MCP Neural Model is initially built as electrical circuits. Later we will see that the
study of neural networks has borrowed many ideas from the field of electrical circuits.
• The weights of MCP Neural Model wi are fixed, in contrast to the adjustable weights
in modern perceptron. All the weights must be assigned with manual calculation.
• The idea of inhibitory input is quite unconventional even seen today. It might be an
idea worth further study in modern deep learning research.
2.6 Perceptron
With the success of MCP Neural Model, Frank Rosenblatt further substantialized Hebbian
Learning Rule with the introduction of perceptrons (Rosenblatt, 1958). While theorists
like Hebb were focusing on the biological system in the natural environment, Rosenblatt
constructed the electronic device named Perceptron that was showed with the ability to
learn in accordance with associationism.
Rosenblatt (1958) introduced the perceptron with the context of the vision system, as
showed in Figure 2(a). He introduced the rules of the organization of a perceptron as
following:
• Stimuli impact on a retina of the sensory units, which respond in a manner that the
pulse amplitude or frequency is proportional to the stimulus intensity.
• Impulses are transmitted to Projection Area (AI ). This projection area is optional.
9
Wang and Raj
(a) Illustration of organization of a perceptron in (b) A typical perceptron in modern machine learn-
(Rosenblatt, 1958) ing literature
Figure 2(a) illustrates his explanation of perceptron. From left to right, the four units
are sensory unit, projection unit, association unit and response unit respectively. Projection
unit receives the information from sensory unit and passes onto association unit. This unit
is often omitted in other description of similar models. With the omission of projection
unit, the structure resembles the structure of nowadays perceptron in a neural network (as
showed in Figure 2(b)): sensory units collect data, association units linearly adds these data
with different weights and apply non-linear transform onto the thresholded sum, then pass
the results to response units.
One distinction between the early stage neuron models and modern perceptrons is the
introduction of non-linear activation functions (we use sigmoid function as an example
in Figure 2(b)). This originates from the argument that linear threshold function should
be softened to simulate biological neural networks (Bose et al., 1996) as well as from the
consideration of the feasibility of computation to replace step function with a continuous
one (Mitchell et al., 1997).
After Rosenblatt’s introduction of Perceptron, Widrow et al. (1960) introduced a follow-
up model called ADALINE. However, the difference between Rosenblatt’s Perceptron and
ADALINE is mainly on the algorithm aspect. As the primary focus of this paper is neural
network models, we skip the discussion of ADALINE.
10
On the Origin of Deep Learning
To show a more concrete example, we introduce a linear preceptron with only two inputs
x1 and x2 , therefore, the decision boundary w1 x1 + w2 x2 forms a line in a two-dimensional
space. The choice of threshold magnitude shifts the line horizontally and the sign of the
function picks one side of the line as the halfspace the function represents. The halfspace
is showed in Figure 3 (a).
In Figure 3 (b)-(d), we present two nodes a and b to denote to input, as well as the node
to denote the situation when both of them trigger and a node to denote the situation when
neither of them triggers. Figure 3 (b) and Figure 3 (c) show clearly that a linear perceptron
can be used to describe AND and OR operation of these two inputs. However, in Figure 3
(d), when we are interested in XOR operation, the operation can no longer be described by
a single linear decision boundary.
In the next section, we will show that the representation ability is greatly enlarged when
we put perceptrons together to make a neural network. However, when we keep stacking
one neural network upon the other to make a deep learning model, the representation power
will not necessarily increase.
11
Wang and Raj
• Boolean Approximation: an MLP of one hidden layer1 can represent any boolean
function exactly.
• Continuous Approximation: an MLP of one hidden layer can approximate any bounded
continuous function with arbitrary accuracy.
• Arbitrary Approximation: an MLP of two hidden layers can approximate any function
with arbitrary accuracy.
We will discuss these three properties in detail in the following paragraphs. To suit different
readers’ interest, we will first offer an intuitive explanation of these properties and then offer
the proofs.
1. Through this paper, we will follow the most widely accepted naming convention that calls a two-layer
neural network as one hidden layer neural network.
12
On the Origin of Deep Learning
f (x) is dense in the subspace of where it is in. In other words, for an arbitrary function
g(x) in the same subspace as f (x), we have
where > 0. In Equation 3, σ denotes the activation function (a squashing function back
then), wi denotes the weights for the input layer and ωi denotes the weights for the hidden
layer.
This conclusion was drawn with a proof by contradiction: With Hahn-Banach Theorem
and Riesz Representation Theorem, the fact that the closure of f (x) is not all the subspace
where f (x) is in contradicts the assumption that σ is an activation (squashing) function.
Till today, this property has drawn thousands of citations. Unfortunately, many of
the later works cite this property inappropriately (Castro et al., 2000) because Equation 3
is not the widely accepted form of a one-hidden-layer neural network because it does not
deliver a thresholded/squashed output, but a linear output instead. Ten years later after
this property was shown, Castro et al. (2000) concluded this story by showing that when
the final output is squashed, this universal approximation property still holds.
Note that, this property was shown with the context that activation functions are squash-
ing functions. By definition, a squashing function σ : R → [0, 1] is a non-decreasing function
13
Wang and Raj
with the properties limx→∞ σ(x) = 1 and limx→−∞ σ(x) = 0. Many activation functions of
recent deep learning research do not fall into this category.
Before we move on to explain this property, we need first to show a major property regarding
combining linear perceptrons into neural networks. Figure 5 shows that as the number of
linear perceptrons increases to bound the target function, the area outside the polygon with
the sum close to the threshold shrinks. Following this trend, we can use a large number of
perceptrons to bound a circle, and this can be achieved even without knowing the threshold
because the area close to the threshold shrinks to nothing. What left outside the circle is,
in fact, the area that sums to N2 , where N is the number of perceptrons used.
Therefore, a neural network with one hidden layer can represent a circle with arbitrary
diameter. Further, we introduce another hidden layer that is used to combine the outputs of
many different circles. This newly added hidden layer is only used to perform OR operation.
Figure 6 shows an example that when the extra hidden layer is used to merge the circles
from the previous layer, the neural network can be used to approximate any function. The
target function is not necessarily continuous. However, each circle requires a large number
of neurons, consequently, the entire function requires even more.
This property was showed in (Lapedes and Farber, 1988) and (Cybenko, 1988) respec-
tively. Looking back at this property today, it is not arduous to build the connections
between this property to Fourier series approximation, which, in informal words, states
that every function curve can be decomposed into the sum of many simpler curves. With
this linkage, to show this universal approximation property is to show that any one-hidden-
layer neural network can represent one simple surface, then the second hidden layer sums
up these simple surfaces to approximate an arbitrary function.
As we know, one hidden layer neural network simply performs a thresholded sum op-
eration, therefore, the only step left is to show that the first hidden layer can represent a
simple surface. To understand the “simple surface”, with linkage to Fourier transform, one
can imagine one cycle of the sinusoid for the one-dimensional case or a “bump” of a plane
in the two-dimensional case.
14
On the Origin of Deep Learning
Figure 6: How a neural network can be used to approximate a leaf shaped function.
For one dimension, to create a simple surface, we only need two sigmoid functions
appropriately placed, for example, as following:
h
f1 (x) =
1+ e−(x+t1 )
h
f2 (x) =
1 + ex−t2
Then, with f1 (x) + f2 (x), we create a simple surface with height 2h from t1 ≤ x ≤ t2 .
This could be easily generalized to n-dimensional case, where we need 2n sigmoid functions
(neurons) for each simple surface. Then for each simple surface that contributes to the final
function, one neuron is added onto the second hidden layer. Therefore, despite the number
of neurons need, one will never need a third hidden layer to approximate any function.
Similarly to how Gibbs phenomenon affects Fourier series approximation, this approxi-
mation cannot guarantee an exact representation.
The universal approximation properties showed a great potential of shallow neural net-
works at the price of exponentially many neurons at these layers. One followed-up question
is that how to reduce the number of required neurons while maintaining the representation
power. This question motivates people to proceed to deeper neural networks despite that
shallow neural networks already have infinite modeling power. Another issue worth atten-
tion is that, although neural networks can approximate any functions, it is not trivial to
find the set of parameters to explain the data. In the next two sections, we will discuss
these two questions respectively.
15
Wang and Raj
The universal approximation properties of shallow neural networks come at a price of expo-
nentially many neurons and therefore are not realistic. The question about how to maintain
this expressive power of the network while reducing the number of computation units has
been asked for years. Intuitively, Bengio and Delalleau (2011) suggested that it is nature
to pursue deeper networks because 1) human neural system is a deep architecture (as we
will see examples in Section 5 about human visual cortex.) and 2) humans tend to rep-
resent concepts at one level of abstraction as the composition of concepts at lower levels.
Nowadays, the solution is to build deeper architectures, which comes from a conclusion that
states the representation power of a k layer neural network with polynomial many neurons
need to be expressed with exponentially many neurons if a k − 1 layer structured is used.
However, theoretically, this conclusion is still being completed.
This conclusion could trace back to three decades ago when Yao (1985) showed the
limitations of shallow circuits functions. Hastad (1986) later showed this property with
parity circuits: “there are functions computable in polynomial size and depth k but requires
exponential size when depth is restricted to k − 1”. He showed this property mainly by
the application of DeMorgan’s law, which states that any AND or ORs can be rewritten
as OR of ANDs and vice versa. Therefore, he simplified a circuit where ANDs and ORs
appear one after the other by rewriting one layer of ANDs into ORs and therefore merge
this operation to its neighboring layer of ORs. By repeating this procedure, he was able to
represent the same function with fewer layers, but more computations.
Moving from circuits to neural networks, Delalleau and Bengio (2011) compared deep
and shallow sum-product neural networks. They showed that a function√ that could be
expressed with O(n) neurons on a network of depth k required at least O(2 n ) and O((n −
1)k ) neurons on a two-layer neural network.
Further, Bianchini and Scarselli (2014) extended this study to a general neural net-
work with many major activation functions including tanh and sigmoid. They derived
the conclusion with the concept of Betti numbers, and used this number to describe the
representation power of neural networks. They showed that for a shallow network, the rep-
resentation power can only grow polynomially with respect to the number of neurons, but
for deep architecture, the representation can grow exponentially with respect to the number
of neurons. They also related their conclusion to VC-dimension of neural networks, which
is O(p2 ) for tanh (Bartlett and Maass, 2003) where p is the number of parameters.
Recently, Eldan and Shamir (2015) presented a more thorough proof to show that depth
of a neural network is exponentially more valuable than the width of a neural network, for a
standard MLP with any popular activation functions. Their conclusion is drawn with only a
few weak assumptions that constrain the activation functions to be mildly increasing, mea-
surable, and able to allow shallow neural networks to approximate any univariate Lipschitz
function. Finally, we have a well-grounded theory to support the fact that deeper network
is preferred over shallow ones. However, in reality, many problems will arise if we keep
increasing the layers. Among them, the increased difficulty of learning proper parameters
is probably the most prominent one. Immediately in the next section, we will discuss the
main drive of searching parameters for a neural network: Backpropagation.
16
On the Origin of Deep Learning
• The number of input neurons cannot smaller than the classes/patterns of data.
However, their approaches may not be relevant anymore as they require the data to be
linearly separable, under which condition that many other models can be applied.
17
Wang and Raj
was able to overfit the data. Therefore, this phenomenon should have played a critical role
in the research of improving the optimization techniques. Recently, the studying of cost
surfaces of neural networks have indicated the existence of saddle points (Choromanska
et al., 2015; Dauphin et al., 2014; Pascanu et al., 2014), which may explain the findings of
Brady et al back in the late 80s.
Backpropagation enables the optimization of deep neural networks. However, there is
still a long way to go before we can optimize it well. Later in Section 7, we will briefly
discuss more techniques related to the optimization of neural networks.
18
On the Origin of Deep Learning
Figure 7: Trade off of representation power and computation complexity of several models,
that guides the development of better models
With the background of how modern neural network is set up, we proceed to visit the
each prominent branch of current deep learning family. Our first stop is the branch that
leads to the popular Restricted Boltzmann Machines and Deep Belief Nets, and it starts as
a model to understand the data unsupervisedly.
Figure 7 summarizes the model that will be covered in this Section. The horizontal axis
stands for the computation complexity of these models while the vertical axis stands for the
representation power. The six milestones that will be focused in this section are placed in
the figure.
19
Wang and Raj
represent data while the lower circles denote the data. There is no connection between the
nodes in SOM 2 .
The position of each node is fixed. The representation should not be viewed as only a
numerical value. Instead, the position of it also matters. This property is different from
some widely-accepted representation criterion. For example, we compare the case when
one-hot vector and one-dimension SOM are used to denote colors: To denote green out
of a set: C = {green, red, purple}, one-hot representation can use any vector of (1, 0, 0),
(0, 1, 0) or (0, 0, 1) as long as we specify the bit for green correspondingly. However, for a
one-dimensional SOM, only two vectors are possible: (1, 0, 0) or (0, 0, 1). This is because
that, since SOM aims to represent the data while retaining the similarity; and red and
purple are much more similar than green and red or green and purple, green should not be
represented in a way that it splits red and purple. One should notice that, this example is
only used to demonstrate that the position of each unit in SOM matters. In practice, the
values of SOM unit are not restricted to integers.
The learned SOM is usually a good tool for visualizing data. For example, if we conduct
a survey on the happiness level and richness level of each country and feed the data into
a two-dimensional SOM. Then the trained units should represent the happiest and richest
country at one corner and represent the opposite country at the furthest corner. The rest
two corners represent the richest, yet unhappiest and the poorest but happiest countries.
The rest countries are positioned accordingly. The advantage of SOM is that it allows one
2. In some other literature, (Bullinaria, 2004) as an example, one may notice that there are connections
in the illustrations of models. However, those connections are only used to represent the neighborhood
relationship of nodes, and there is no information flowing via those connections. In this paper, as we
will show many other models that rely on a clear illustration of information flow, we decide to save the
connections to denote that.
20
On the Origin of Deep Learning
to easily tell how a country is ranked among the world with a simple glance of the learned
units (Guthikonda, 2005).
With an understanding of the representation power of SOM, now we proceed to its param-
eter learning algorithm. The classic algorithm is heuristic and intuitive, as shown below:
Here we use a two-dimensional SOM as example, and i, j are indexes of units; w is weight
of the unit; v denotes data vector; k is the index of data; t denotes the current iteration; N
constrains the maximum number of steps allowed; P (·) denotes the penalty considering the
distance between unit p, q and unit i, j; l is learning rate; r denotes a radius used to select
neighbor nodes. Both l and r typically decrease as t increases. || · ||22 denotes Euclidean
distance and dist(·) denotes the distance on the position of units.
This algorithm explains how SOM can be used to learn a representation and how the
similarities are retained as it always selects a subset of units that are similar with the data
sampled and adjust the weights of units to match the data sampled.
However, this algorithm relies on a careful selection of the radius of neighbor selection
and a good initialization of weights. Otherwise, although the learned weights will have a
local property of topological similarity, it loses this property globally: sometimes, two similar
clusters of similar events are separated by another dissimilar cluster of similar events. In
simpler words, units of green may actually separate units of red and units of purple if the
network is not appropriately trained. (Germano, 1999).
3. The term “recurrent” is very confusing nowadays because of the popularity recurrent neural network
(RNN) gains.
21
Wang and Raj
The magnetic system will evolve until this potential energy is minimum.
22
On the Origin of Deep Learning
where s is the state of a unit, b denotes the bias; w denotes the bidirectional weights and i, j
are indexes of units. This energy function closely connects to the potential energy function
of spin glass, as showed in Equation 4.
Hopfield Network is typically applied to memorize the state of data. The weights of a
network are designed or learned to make sure that the energy is minimized given the state
of interest. Therefore, when another state presented to the network, while the weights are
fixed, Hopfield Network can search for the states that minimize the energy and recover the
state in memory. For example, in a face completion task, when some image of faces are
presented to Hopfield Network (in a way that each unit of the network corresponds to each
pixel of one image, and images are presented one after the other), the network can calculate
the weights to minimize the energy given these faces. Later, if one image is corrupted or
distorted and presented to this network again, the network is able to recover the original
image by searching a configuration of states to minimize the energy starting from corrupted
input presented.
The term “energy” may remind people of physics. To explain how Hopfield Network
works in a physics scenario will be clearer: nature uses Hopfield Network to memorize the
equilibrium position of a pendulum because, in an equilibrium position, the pendulum has
the lowest gravitational potential energy. Therefore, whenever a pendulum is placed, it will
converge back to the equilibrium position.
23
Wang and Raj
network will invert the state and proceed to test the next unit. This procedure is called
Asynchronous update and this procedure is obviously subject to the sequential order of
selection of units. A counterpart is known as Synchronous update when the network first
tests for all the units and then inverts all the unit-to-invert simultaneously. Both of these
methods may lead to a local optimal. Synchronous update may even result in an increasing
of energy and may converge to an oscillation or loop of states.
4.2.4 Capacity
One distinct disadvantage of Hopfield Network is that it cannot keep the memory very
efficient because a network of N units can only store memory up to 0.15N 2 bits. While a
network with N units has N 2 edges. In addition, after storing M memories (M instances
of data), each connection has an integer value in range [−M, M ]. Thus, the number of bits
required to store N units are N 2 log(2M + 1) (Hopfield, 1982). Therefore, we can safely
draw the conclusion that although Hopfield Network is a remarkable idea that enables the
network to memorize data, it is extremely inefficient in practice.
As follow-ups of the invention of Hopfield Network, many works are attempted to study
and increase the capacity of original Hopfield Network (Storkey, 1997; Liou and Yuan,
1999; Liou and Lin, 2006). Despite these attempts made, Hopfield Network still gradually
fades out of the society. It is replaced by other models that are inspired by it. Immediately
following this section, we will discuss the popular Boltzmann Machine and Restricted Boltz-
mann Machine and study how these models are upgraded from the initial ideas of Hopfield
Network and evolve to replace it.
Es
F (s) ∝ e− kT
where s stands for the state and Es is the corresponding energy. k and T are Boltzmann’s
constant and thermodynamic temperature respectively. Naturally, the ratio of two distri-
bution is only characterized by the difference of energies, as following:
24
On the Origin of Deep Learning
Figure 10: Illustration of Boltzmann Machine. With the introduction of hidden units
(shaded nodes), the model conceptually splits into two parts: visible units and
hidden units. The red dashed line is used to highlight the conceptual separation.
With how the distribution is specified by the energy, the probability is defined as the
term of each state divided by a normalizer, as following:
Esi
e− kT
psi =
P − Esj
je
kT
25
Wang and Raj
algorithm called Simulated Annealing (Khachaturyan et al., 1979; Aarts and Korst, 1988)
back then, but Simulated Annealing is hardly relevant to nowadays deep learning society.
Regardless of the historical importance that the term T introduces, within this section, we
will assume T = 1 as a constant, for the sake of simplification.
where v stands for visible units, h stands for hidden units. This equation also connects
back to Equation 4, except that Boltzmann Machine splits the energy function according
to hidden units and visible units.
Based on this energy function, the probability of a joint configuration over both visible
unit the hidden unit can be defined as following:
e−E(v,h)
p(v, h) = P −E(m,n)
m,n e
The probability of visible/hidden units can be achieved by marginalizing this joint proba-
bility.
For example, by marginalizing out hidden units, we can get the probability distribution
of visible units:
P −E(v,h)
e
p(v) = P h −E(m,n)
m,n e
26
On the Origin of Deep Learning
function is usually performed to determine the parameters. For simplicity, the following
derivation is based on a single observation.
First, we have the log likelihood function of visible units as
X X
l(v; w) = log p(v; w) = log e−Ev,h − log e−Em,n
h m,n
27
Wang and Raj
Figure 11: Illustration of Restricted Boltzmann Machine. With the restriction that there is
no connections between hidden units (shaded nodes) and no connections between
visible units (unshaded nodes), the Boltzmann Machine turns into a Restricted
Boltzmann Machine. The model now is a a bipartite graph.
∂l(v; w)
= − < si , sj >p0 + < si , sj >p∞ (8)
∂w
here we use p0 to denote data distribution and p∞ to denote model distribution. Other
notations remain unchanged. Therefore, the difficulty of mentioned methods to learn the
parameters is that it requires potentially “infinitely” many sampling steps to approximate
the model distribution.
Hinton (2002) overcame this issue magically, with the introduction of a method named
Contrastive Divergence. Empirically, he found that one does not have to perform “infinitely”
many sampling steps to converge to the model distribution, a finite k steps of sampling is
enough. Therefore, Equation 8 is effectively re-written into:
∂l(v; w)
= − < si , sj >p0 + < si , sj >pk
∂w
Remarkably, Hinton (2002) showed that k = 1 is sufficient for the learning algorithm to
work well in practice.
Carreira-Perpinan and Hinton (2005) attempted to justify Contrastive Divergence in
theory, but their derivation led to a negative conclusion that Contrastive Divergence is a
28
On the Origin of Deep Learning
Figure 12: Illustration of Deep Belief Networks. Deep Belief Networks is not just stacking
RBM together. The bottom layers (layers except the top one) do not have the
bi-directional connections, but only connections top down.
biased algorithm, and a finite k cannot represent the model distribution. However, their
empirical results suggested that finite k can approximate the model distribution well enough,
resulting a small enough bias. In addition, the algorithm works well in practice, which
strengthened the idea of Contrastive Divergence.
With the reasonable modeling power and a fast approximation algorithm, RBM quickly
draws great attention and becomes one of the most fundamental building blocks of deep
neural networks. In the following two sections, we will introduce two distinguished deep
neural networks that are built based on RBM/Boltzmann Machine, namely Deep Belief
Nets and Deep Boltzmann Machine.
8. This paper is generally seen as the opening of nowadays Deep Learning era, as it first introduces the
possibility of training a deep neural network by layerwise training
29
Wang and Raj
• Bengio et al. (2007) suggested that unsupervised pre-training initializes the model to
a point in parameter space which leads to a more effective optimization process, that
the optimization can find a lower minimum of the empirical cost function.
• Erhan et al. (2010) empirically argued for a regularization explanation, that unsu-
pervised pretraining guides the learning towards basins of attraction of minima that
support better generalization from the training data set.
In addition to Deep Belief Networks, this pretraining mechanism also inspires the pre-
training for many other classical models, including the autoencoders (Poultney et al., 2006;
Bengio et al., 2007), Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009) and some
models inspired by these classical models like (Yu et al., 2010).
After the pre-training is performed, fine-tuning is carried out to further optimize the net-
work to search for the parameters that lead to a lower minimum. For Deep Belief Networks,
there are two different fine tuning strategies dependent on the goals of the network.
Fine Tuning for Generative Model Fine-tuning for a generative model is achieved
with a contrastive version of wake-sleep algorithm (Hinton et al., 1995). This algorithm is
intriguing for the reason that it is designed to interpret how the brain works. Scientists have
found that sleeping is a critical process of brain function and it seems to be an inverse version
of how we learn when we are awake. The wake-sleep algorithm also has two steps. In wake
phase, we propagate information bottom up to adjust top-down weights for reconstructing
the layer below. Sleep phase is the inverse of wake phase. We propagate the information
top down to adjust bottom-up weights for reconstructing the layer above.
The contrastive version of this wake-sleep algorithm is that we add one Contrastive
Divergence phase between wake phase and sleep phase. The wake phase only goes up to the
visible layer of the top RBM, then we sample the top RBM with Contrastive Divergence,
then a sleep phase starts from the visible layer of top RBM.
Fine Tuning for Discriminative Model The strategy for fine tuning a DBN as a
discriminative model is to simply apply standard backpropagation to pre-trained model
30
On the Origin of Deep Learning
Figure 13: Illustration of Deep Boltzmann Machine. Deep Boltzmann Machine is more like
stacking RBM together. Connections between every two layers are bidirectional.
since we have labels of data. However, pre-training is still necessary in spite of the generally
good performance of backpropagation.
X N X
X X N
X −1 X
E(v, h) = − vi bi − hn,k bn,k − vi wik hk − hn,k wn,k,l hn+1,l
i n=1 k i,k n=1 k,l
4.6.1 Deep Boltzmann Machine (DBM) v.s. Deep Belief Networks (DBN)
As their acronyms suggest, Deep Boltzmann Machine and Deep Belief Networks have many
similarities, especially from the first glance. Both of them are deep neural networks origi-
nates from the idea of Restricted Boltzmann Machine. (The name “Deep Belief Network”
31
Wang and Raj
seems to indicate that it also partially originates from Bayesian Network (Krieg, 2001).)
Both of them also rely on layerwise pre-training for a success of parameter learning.
However, the fundamental differences between these two models are dramatic, intro-
duced by how the connections are made between bottom layers (un-directed/bi-directed
v.s. directed). The bidirectional structure of DBM grants the possibility of DBM to learn
a more complex pattern of data. It also grants the possibility for the approximate inference
procedure to incorporate top-down feedback in addition to an initial bottom-up pass, allow-
ing Deep Boltzmann Machines to better propagate uncertainty about ambiguous inputs.
32
On the Origin of Deep Learning
• Retina converts the light energy that comes from the rays bouncing off of an object
into chemical energy. This chemical energy is then converted into action potentials
that are transferred onto primary visual cortex. (In fact, there are several other
brain structures involved between retina and V1, but we omit these structures for
simplicity9 .)
9. We deliberately discuss the components that have connections with established technologies in convo-
lutional neural network, one who is interested in developing more powerful models is encouraged to
investigate other components.
33
Wang and Raj
Figure 14: A brief illustration of ventral stream of the visual cortex in human vision system.
It consists of primary visual cortex (V1), visual areas (V2 and V4) and inferior
temporal gyrus.
• Primary visual cortex (V1) mainly fulfills the task of edge detection, where an edge
is an area with strongest local contrast in the visual signals.
• V2, also known as secondary visual cortex, is the first region within the visual as-
sociation area. It receives strong feedforward connections from V1 and sends strong
connections to later areas. In V2, cells are tuned to extract mainly simple properties
of the visual signals such as orientation, spatial frequency, and colour, and a few more
complex properties.
• Inferior temporal gyrus (TI) is responsible for identifying the object based on the color
and form of the object and comparing that processed information to stored memories
of objects to identify that object (Kolb et al., 2014). In other words, IT performs the
semantic level tasks, like face recognition.
Many of the descriptions of functions about visual cortex should revive a recollection
of convolutional neural networks for the readers that have been exposed to some relevant
technical literature. Later in this section, we will discuss more details about convolutional
neural networks, which will help build explicit connections. Even for readers that barely
34
On the Origin of Deep Learning
have knowledge in convolutional neural networks, this hierarchical structure of visual cortex
should immediately ring a bell about neural networks.
Besides convolutional neural networks, visual cortex has been inspiring the works in
computer vision for a long time. For example, Li (1998) built a neural model inspired
by the primary visual cortex (V1). In another granularity, Serre et al. (2005) introduced
a system with feature detections inspired from the visual cortex. De Ladurantaye et al.
(2012) published a book describing the models of information processing in the visual cortex.
Poggio and Serre (2013) conducted a more comprehensive survey on the relevant topic, but
they didn’t focus on any particular subject in detail in their survey. In this section, we
discuss the connections between visual cortex and convolutional neural networks in details.
We will begin with Neocogitron, which borrows some ideas from visual cortex and later
inspires convolutional neural network.
Neocogitron, proposed by Fukushima (1980), is generally seen as the model that inspires
Convolutional Neural Networks on the computation side. It is a neural network that con-
sists of two different kinds of layers (S-layer as feature extractor and C-layer as structured
connections to organize the extracted features.)
S-layer consists of a number of S-cells that are inspired by the cell in primary visual
cortex. It serves as a feature extractor. Each S-cell can be ideally trained to be responsive
to a particular feature presented in its receptive field. Generally, local features such as edges
in particular orientations are extracted in lower layers while global features are extracted
in higher layers. This structure highly resembles how human conceive objects. C-layer
resembles complex cell in the higher pathway of visual cortex. It is mainly introduced for
shift invariant property of features extracted by S-layer.
During parameter learning process, only the parameters of S-layer are updated. Neocog-
itron can also be trained unsupervisedly, for a good feature extractor out of S-layers. The
training process for S-layer is very similar to Hebbian Learning rule, which strengthens the
connections between S-layer and C-layer for whichever S-cell shows the strongest response.
This training mechanism also introduces the problem Hebbian Learning rule introduces,
that the strength of connections will saturate (since it keeps increasing). The solution was
also introduced by Fukushima (1980), which was introduced with the name “inhibitory
cell”. It performed the function as a normalization to avoid the problem.
Now we proceed from Neocogitron to Convolutional Neural Network. First, we will in-
troduce the building components: convolutional layer and subsampling layer. Then we
assemble these components to present Convolutional Neural Network, using LeNet as an
example.
35
Wang and Raj
and denoted as h = f ? g.
Convolutional neural network typically works with two-dimensional convolution opera-
tion, which could be summarized in Figure 15.
As showed in Figure 15, the leftmost matrix is the input matrix. The middle one is
usually called a kernel matrix. Convolution is applied to these matrices and the result
is showed as the rightmost matrix. The convolution process is an element-wise product
followed by a sum, as showed in the example. When the left upper 3×3 matrix is convoluted
with the kernel, the result is 29. Then we slide the target 3 × 3 matrix one column right,
convoluted with the kernel and get the result 12. We keep sliding and record the results as
a matrix. Because the kernel is 3 × 3, every target matrix is 3 × 3, thus, every 3 × 3 matrix
is convoluted to one digit and the whole 5 × 5 matrix is shrunk into 3 × 3 matrix. (Because
5 − (3 − 1) = 3. The first 3 means the size of the kernel matrix. )
One should realize that convolution is locally shift invariant, which means that for many
different combinations of how the nine numbers in the upper 3 × 3 matrix are placed, the
convoluted result will be 29. This invariant property plays a critical role in vision problem
because that in an ideal case, the recognition result should not be changed due to shift or
rotation of features. This critical property is used to be solved elegantly by Lowe (1999);
Bay et al. (2006), but convolutional neural network brought the performance up to a new
level.
36
On the Origin of Deep Learning
Figure 16: Convolutional kernels example. Different kernels applied to the same image will
result in differently processed images. Note that there is a 91 divisor applied to
these kernels.
37
Wang and Raj
Figure 17: An illustration of LeNet, where Conv stands for convolutional layer and Sam-
pling stands for SubSampling Layer.
many well-trained popular models can usually perform well in other tasks with only limited
fine-tuning process: the kernels have been well trained and can be universally applicable.
With the understanding of the essential role convolution operation plays in vision tasks,
we proceed to investigate some major milestones along the way.
f (x) = max(0, x)
which is a transform that removes the negative part of the input, resulting in a clearer
contrast of meaningful features as opposed to other side product the kernel produces.
Therefore, this non-linearity grants the convolution more power in extracting useful
features and allows it to simulate the functions of visual cortex more closely.
38
On the Origin of Deep Learning
5.4.3 LeNet
With the two most important components introduced, we can stack them together to as-
semble a convolutional neural network. Following the recipe of Figure 17, we will end up
with the famous LeNet.
LeNet is known as its ability to classify digits and can handle a variety of different
problems of digits including variances in position and scale, rotation and squeezing of digits,
and even different stroke width of the digit. Meanwhile, with the introduction of LeNet,
LeCun et al. (1998b) also introduces the MNIST database, which later becomes the standard
benchmark in digit recognition field.
5.5.1 AlexNet
While LeNet is the one that starts the era of convolutional neural networks, AlexNet,
invented by Krizhevsky et al. (2012), is the one that starts the era of CNN used for ImageNet
classification. AlexNet is the first evidence that CNN can perform well on this historically
difficult ImageNet dataset and it performs so well that leads the society into a competition
of developing CNNs.
The success of AlexNet is not only due to this unique design of architecture but also
due to the clever mechanism of training. To avoid the computationally expensive training
process, AlexNet has been split into two streams and trained on two GPUs. It also used
data augmentation techniques that consist of image translations, horizontal reflections, and
patch extractions.
The recipe of AlexNet is shown in Figure 18. However, rarely any lessons can be
learned from the architecture of AlexNet despite its remarkable performance. Even more
unfortunately, the fact that this particular architecture of AlexNet does not have a well-
grounded theoretical support pushes many researchers to blindly burn computing resources
39
Wang and Raj
to test for a new architecture. Many models have been introduced during this period, but
only a few may be worth mentioning in the future.
5.5.2 VGG
In the blind competition of exploring different architectures, Simonyan and Zisserman (2014)
showed that simplicity is a promising direction with a model named VGG. Although VGG
is deeper (19 layer) than other models around that time, the architecture is extremely
simplified: all the layers are 3 × 3 convolutional layer with a 2 × 2 pooling layer. This
simple usage of convolutional layer simulates a larger filter while keeping the benefits of
smaller filter sizes, because the combination of two 3×3 convolutional layers has an effective
receptive field of a 5 × 5 convolutional layer, but with fewer parameters.
The spatial size of the input volumes at each layer will decrease as a result of the
convolutional and pooling layers, but the depth of the volumes increases because of the
increased number of filters (in VGG, the number of filters doubles after each pooling layer).
This behavior reinforces the idea of VGG to shrink spatial dimensions, but grow depth.
VGG is not the winner of the ImageNet competition of that year (The winner is
GoogLeNet invented by Szegedy et al. (2015)). GoogLeNet introduced several important
concepts like Inception module and the concept later used by R-CNN (Girshick et al., 2014;
Girshick, 2015; Ren et al., 2015), but the arbitrary/creative design of architecture barely
contribute more than what VGG does to the society, especially considering that Residual
Net, following the path of VGG, won the ImageNet challenge in an unprecedented level.
40
On the Origin of Deep Learning
increasing the number of layers will only result in worse results, for both training cases and
testing cases (He et al., 2015).
The breakthrough ResNet introduces, which allows ResNet to be substantially deeper
than previous networks, is called Residual Block. The idea behind a Residual Block is
that some input of a certain layer (denoted as x) can be passed to the component two
layers later either following the traditional path which involves convolutional layers and
ReLU transform succession (we denote the result as f (x)), or going through an express way
that directly passes x there. As a result, the input to the component two layers later is
f (x) + x instead of what is typically seen as f (x). The idea of Residual Block is illustrated
in Figure 19.
In a complementary work, He et al. (2016) validated that residual blocks are essential
for propagating information smoothly, therefore simplifies the optimization. They also
extended the ResNet to a 1000-layer version with success on CIFAR data set.
Another interesting perspective of ResNet is provided by (Veit et al., 2016). They
showed that ResNet behave behaves like ensemble of shallow networks: the express way
introduced allows ResNet to perform as a collection of independent networks, each network
is significantly shallower than the integrated ResNet itself. This also explains why gradient
can be passed through the ultra-deep architecture without being vanished. (We will talk
more about vanishing gradient problem when we discuss recurrent neural network in the
next section.) Another work, which is not directly relevant to ResNet, but may help to
understand it, is conducted by Hariharan et al. (2015). They showed that features from
lower layers are informative in addition to what can be summarized from the final layer.
ResNet is still not completely vacant from clever designs. The number of layers in the
whole network and the number of layers that Residual Block allows identity to bypass are
still choices that require experimental validations. Nonetheless, to some extent, ResNet
has shown that critical reasoning can help the development of CNN better than blind
41
Wang and Raj
experimental trails. In addition, the idea of Residual Block has been found in the actual
visual cortex (In the ventral stream of the visual cortex, V4 can directly accept signals from
primary visual cortex), although ResNet is not designed according to this in the first place.
With the introduction of these state-of-the-art neural models that are successful in these
challenges, Canziani et al. (2016) conducted a comprehensive experimental study comparing
these models. Upon comparison, they showed that there is still room for improvement on
fully connected layers that show strong inefficiencies for smaller batches of images.
42
On the Origin of Deep Learning
Figure 20: Illustrations of some mistakes of neural networks. (a)-(d) (from (Szegedy et al.,
2013)) are adversarial images that are generated based on original images. The
differences between these and the original ones are un-observable by naked eye,
but the neural network can successfully classify original ones but fail adversarial
ones. (e)-(h) (from (Nguyen et al., 2015)) are patterns that are generated. A
neural network classify them into (e) school bus, (f) guitar, (g) peacock and (h)
Pekinese respectively.
misclassification with high confidence. Null space works like a blind spot to a matrix and
changes within null space are never sensible to the corresponding matrix.
This blind spot should not discourage the promising future of neural networks. On the
contrary, it makes the convolutional neural network resemble the human vision system in a
deeper level. In the human vision system, blind spots (Gregory and Cavanagh, 2011) also
exist (Wandell, 1995). Interesting work might be seen about linking the flaws of human
vision system to the defects of neural networks and helping to overcome these defects in the
future.
43
Wang and Raj
Figure 21: Some failed images of ImageNet classification by ResNet and the primary label
associated with the image.
To further improve the performance ResNet reached, one direction might be to modeling
the annotators’ labeling preference. One assumption could be that annotators prefer to label
an image to make it distinguishable. Some established work to modeling human factors
(Wilson et al., 2015) could be helpful.
However, the more important question is that whether it is worth optimizing the model
to increase the testing results on ImageNet dataset, since remaining misclassifications may
not be a result of the incompetency of the model, but problems of annotations.
The introduction of other data sets, like COCO (Lin et al., 2014), Flickr (Plummer et al.,
2015), and VisualGenome (Krishna et al., 2016) may open a new era of vision problems with
more competitive challenges. However, the fundamental problems and experiences that this
section introduces should never be forgotten.
44
On the Origin of Deep Learning
45
Wang and Raj
Figure 22: The difference of recurrent structure from Jordan Network and Elman Network.
46
On the Origin of Deep Learning
Figure 23: The unfolded structured of BRNN. The temporal order is from left to right.
Hidden layer 1 is unfolded in the standard way of an RNN. Hidden layer 2 is
unfolded to simulate the reverse connection.
Bidirectional Recurrent Neural Network (BRNN) was invented by Schuster and Paliwal
(1997) with the goal to introduce a structure that was unfolded to be a bidirectional neural
network. Therefore, when it is applied to time series data, not only the information can
be passed following the natural temporal sequences, but the further information can also
reversely provide knowledge to previous time steps.
Figure 23 shows the unfolded structure of a BRNN. Hidden layer 1 is unfolded in the
standard way of an RNN. Hidden layer 2 is unfolded to simulate the reverse connection.
Transparency (in Figure 23) is applied to emphasize that unfolding an RNN is only a concept
that is used for illustration purpose. The actual model handles data from different time
steps with the same single model.
BRNN is formulated as following:
where the subscript 1 and 2 denote the variables associated with hidden layer 1 and 2
respectively.
With the introduction of “recurrent” connections back from the future, Backpropaga-
tion through Time is no longer directly feasible. The solution is to treat this model as a
combination of two RNNs: a standard one and a reverse one, then apply BPTT onto each
of them. Weights are updated simultaneously once two gradients are computed.
47
Wang and Raj
that is designed to overcome vanishing gradient problem, with the help of a special designed
memory cell. Nowadays, “LSTM” is widely used to denote any recurrent network that
with that memory cell, which is nowadays referred as an LSTM cell.
LSTM was introduced to overcome the problem that RNNs cannot long term dependen-
cies (Bengio et al., 1994). To overcome this issue, it requires the specially designed memory
cell, as illustrated in Figure 24 (a).
LSTM consists of several critical components.
• states: values that are used to offer the information for output.
• gates: values that are used to decide the information flow of states.
? input gate: it decides whether input state enters internal state. It is denoted as
g, and we have:
g t = σ(Wgi it ) (10)
? forget gate: it decides whether internal state forgets the previous internal state.
It is denoted as f , and we have:
f t = σ(Wf i it ) (11)
? output gate: it decides whether internal state passes its value to output and
hidden state of next time step. It is denoted as o and we have:
ot = σ(Woi it ) (12)
Finally, considering how gates decide the information flow of states, we have the last two
equations to complete the formulation of LSTM:
mt =g t it + f t mt−1 (13)
ht =ot mt (14)
48
On the Origin of Deep Learning
(a) LSTM “memory” cell (b) Input data and previous hidden
state form into input state
(c) Calculating input gate and forget (d) Calculating output gate
gate
(e) Update internal state (f) Output and update hidden state
49
Wang and Raj
input gate and forget gate are computed, as described in Equation 10 and Equation 11.
Figure 24 (d) shows how output gate is computed, as described in Equation 12. Figure 24
(e) shows how internal state is updated, as described in Equation 13. Figure 24 (f) shows
how output and hidden state are updated, as described in Equation 14.
All the weights are parameters that need to be learned during training. Therefore,
theoretically, LSTM can learn to memorize long time dependency if necessary and can
learn to forget the past when necessary, making itself a powerful model.
With this important theoretical guarantee, many works have been attempted to improve
LSTM. For example, Gers and Schmidhuber (2000) added a peephole connection that allows
the gate to use information from the internal state. Cho et al. (2014) introduced the Gated
Recurrent Unit, known as GRU, which simplified LSTM by merging internal state and
hidden state into one state, and merging forget gate and input gate into a simple update
gate. Integrating LSTM cell into bidirectional RNN is also an intuitive follow-up to look
into (Graves et al., 2013).
Interestingly, despite the novel LSTM variants proposed now and then, Greff et al.
(2015) conducted a large-scale experiment investigating the performance of LSTMs and got
the conclusion that none of the variants can improve upon the standard LSTM architecture
significantly. Probably, the improvement of LSTM is in another direction rather than
updating the structure inside a cell. Attention models seem to be a direction to go.
Attention Models are loosely based on a bionic design to simulate the behavior of human
vision attention mechanism: when humans look at an image, we do not scan it bit by bit
or stare at the whole image, but we focus on some major part of it and gradually build the
context after capturing the gist. Attention mechanisms were first discussed by Larochelle
and Hinton (2010) and Denil et al. (2012). The attention models mostly refer to the models
that were introduced in (Bahdanau et al., 2014) for machine translation and soon applied to
many different domains like (Chorowski et al., 2015) for speech recognition and (Xu et al.,
2015) for image caption generation.
Attention models are mostly used for sequence output prediction. Instead of seeing the
whole sequential data and make one single prediction (for example, language model), the
model needs to make a sequential prediction for the sequential input for tasks like machine
translation or image caption generation. Therefore, the attention model is mostly used to
answer the question on where to pay attention to based on previously predicted labels or
hidden states.
The output sequence may not have to be linked one-to-one to the input sequence, and the
input data may not even be a sequence. Therefore, usually, an encoder-decoder framework
(Cho et al., 2015) is necessary. The encoder is used to encode the data into representations
and decoder is used to make sequential predictions. Attention mechanism is used to locate
a region of the representation for predicting the label in current time step.
Figure 25 shows a basic attention model under encoder-decoder network structure. The
representation encoder encodes is all accessible to attention model, and attention model only
selects some regions to pass onto the LSTM cell for further usage of prediction making.
50
On the Origin of Deep Learning
Figure 25: The unfolded structured of an attention model. Transparency is used to show
that unfolding is only conceptual. The representation encoder learns are all
available to the decoder across all time steps. Attention module only selects
some to pass onto LSTM cell for prediction.
Therefore, all the magic of attention models is about how this attention module in
Figure 25 helps to localize the informative representations.
To formalize how it works, we use r to denote the encoded representation (there is a
total of M regions of representation), use h to denote hidden states of LSTM cell. Then,
the attention module can generate the unscaled weights for ith region of the encoded rep-
resentation as:
βit = f (ht−1 , r, {αjt−1 }M
j=1 )
where αjt−1 is the attention weights computed at the previous time step, and can be com-
puted at current time step as a simple softmax function:
exp(βit )
αit = PM
t
j exp(βj )
Therefore, we can further use the weights α to reweight the representation r for prediction.
There are two ways for the representation to be reweighted:
• Soft attention: The result is a simple weighted sum of the context vectors such that:
M
X
t
r = αjt cj
j
• Hard attention: The model is forced to make a hard decision by only localizing one
region: sampling one region out following multinoulli distribution.
51
Wang and Raj
(a) Deep input architecture (b) Deep recurrent architec- (c) Deep output architecture
ture
One problem about hard attention is that sampling out of multinoulli distribution is
not differentiable. Therefore, the gradient based method can be hardly applied. Variational
methods (Ba et al., 2014) or policy gradient based method (Sutton et al., 1999) can be
considered.
In this very last section of the evolutionary path of RNN family, we will visit some ideas
that have not been fully explored.
Although recurrent neural network suffers many issues that deep neural network has because
of the recurrent connections, current RNNs are still not deep models regarding representa-
tion learning compared to models in other families.
Pascanu et al. (2013a) formalizes the idea of constructing deep RNNs by extending
current RNNs. Figure 26 shows three different directions to construct a deep recurrent
neural network by increasing the layers of the input component (Figure 26 (a)), recurrent
component (Figure 26 (b)) and output component (Figure 26 (c)) respectively.
52
On the Origin of Deep Learning
53
Wang and Raj
θt+1 = θt − η5tθ
7.1.1 Rprop
Rprop was introduced by Riedmiller and Braun (1993). It is a unique method even studied
back today as it does not fully utilize the information of gradient, but only considers the
sign of it. In other words, it updates the parameters following:
7.1.2 AdaGrad
AdaGrad was introduced by Duchi et al. (2011). It follows the idea of introducing an
adaptive learning rate mechanism that assigns higher learning rate to the parameters that
have been updated more mildly and assigns lower learning rate to the parameters that have
been updated dramatically. The measure of the degree of the update applied is the `2
norm of historical gradients, S t = ||51θ , 52θ , ... 5tθ ||22 , therefore we have the update rule as
following:
η
θt+1 = θt − 5t
St + θ
where is small term to avoid η divided by zero.
54
On the Origin of Deep Learning
AdaGrad has been showed with great improvement of robustness upon traditional gra-
dient method (Dean et al., 2012). However, the problem is that as `2 norm accumulates,
the fraction of η over `2 norm decays to a substantial small term.
7.1.3 AdaDelta
AdaDelta is an extension of AdaGrad that aims to reducing the decaying rate of learning
rate, proposed in (Zeiler, 2012). Instead of accumulating the gradients of each time step as
in AdaGrad, AdaDelta re-weights previously accumulation before adding current term onto
previously accumulated result, resulting in:
where β is the weight for re-weighting. Then the update rule is the same as AdaGrad:
η
θt+1 = θt − 5t
St + θ
which is almost the same as another famous gradient variant named RMSprop10 .
7.1.4 Adam
Adam stands for Adaptive Moment Estimation, proposed in (Kingma and Ba, 2014). Adam
is like a combination momentum method and AdaGrad method, but each component are
re-weighted at time step t. Formally, at time step t, we have:
∆tθ =α∆t−1
θ + (1 − α)5tθ
(S t )2 =β(S t−1 )2 + (1 − β)(5tθ )2
η
θt+1 =θt − t ∆t
S + θ
All these modern gradient variants have been published with a promising claim that is
helpful to improve the convergence rate of previous methods. Empirically, these methods
seem to be indeed helpful, however, in many cases, a good choice of these methods seems
only to benefit to a limited extent.
7.2 Dropout
Dropout was introduced in (Hinton et al., 2012; Srivastava et al., 2014). The technique soon
got influential, not only because of its good performance but also because of its simplicity
of implementation. The idea is very simple: randomly dropping out some of the units while
training. More formally: on each training case, each hidden unit is randomly omitted from
the network with a probability of p.
As suggested by Hinton et al. (2012), Dropout can be seen as an efficient way to perform
model averaging across a large number of different neural networks, where overfitting can
be avoided with much less cost of computation.
10. It seems this method never gets published, the resources all trace back to Hinton’s slides at
http://www.cs.toronto.edu/t̃ijmen/csc321/slides/lecture slides lec6.pdf
55
Wang and Raj
Because of the actual performance it introduces, Dropout soon became very popular
upon its introduction, a lot of work has attempted to understand its mechanism in different
perspectives, including (Baldi and Sadowski, 2013; Cho, 2013; Ma et al., 2016). It has also
been applied to train other models, like SVM (Chen et al., 2014).
where µB and σB denote the mean and variance of that batch. µL and σL two parameters
learned by the algorithm to rescale and shift the output. xi and yi are inputs and outputs
of that function respectively.
These steps are performed for every batch during training. Batch Normalization turned
out to work very well in training empirically and soon became popularly.
As a follow-up, Ba et al. (2016) proposes the technique Layer Normalization, where
they “transpose” batch normalization into layer normalization by computing the mean and
variance used for normalization from all of the summed inputs to the neurons in a layer on a
single training case. Therefore, this technique has a nature advantage of being applicable to
recurrent neural network straightforwardly. However, it seems that this “transposed batch
normalization” cannot be implemented as simple as Batch Normalization. Therefore, it has
not become as influential as Batch Normalization is.
56
On the Origin of Deep Learning
Two remarks need to be made before we proceed: 1) Obviously, most of these meth-
ods can trace back to counterparts in non-parametric machine learning field, but because
most of these methods did not perform enough to raise an impact, focusing a discussion on
the evolutionary path may mislead readers. Instead, we will only list these methods for the
readers who seek for inspiration. 2) Many of these methods are not exclusively optimization
techniques because these methods are usually proposed with a particularly designed archi-
tecture. Technically speaking, these methods should be distributed to previous sections
according to the models associated. However, because these methods can barely inspire
modern modeling research, but may have a chance to inspire modern optimization research,
we list these methods in this section.
One of the earliest and most important works on this topic was proposed by Fahlman and
Lebiere (1989). They introduced a model, as well as its corresponding algorithm named
Cascade-Correlation Learning. The idea is that the algorithm starts with a minimum net-
work and builds up towards a bigger network. Whenever another hidden unit is added,
the parameters of previous hidden units are fixed, and the algorithm only searches for an
optimal parameter for the newly-added hidden unit.
Interestingly, the unique architecture of Cascade-Correlation Learning grants the net-
work to grow deeper and wider at the same time because every newly added hidden unit
takes the data together with outputs of previously added units as input.
Two important questions of this algorithm are 1) when to fix the parameters of current
hidden units and proceed to add and tune a newly added one 2) when to terminate the
entire algorithm. These two questions are answered in a similar manner: the algorithm
adds a new hidden unit when there are no significant changes in existing architecture and
terminates when the overall performance is satisfying. This training process may introduce
problems of overfitting, which might account for the fact that this method is seen much in
modern deep learning research.
Mézard and Nadal (1989) presented the idea of Tiling Algorithm, which learns the parame-
ters, the number of layers, as well as the number of hidden units in each layer simultaneously
for feedforward neural network on Boolean functions. Later this algorithm was extended to
multiple class version by Parekh et al. (1997).
The algorithm works in such a way that on every layer, it tries to build a layer of hidden
units that can cluster the data into different clusters where there is only one label in one
cluster. The algorithm keeps increasing the number of hidden units until such a clustering
pattern can be achieved and proceed to add another layer.
Mézard and Nadal (1989) also offered a proof of theoretical guarantees for Tiling Algo-
rithm. Basically, the theorem says that Tiling Algorithm can greedily improve the perfor-
mance of a neural network.
57
Wang and Raj
58
On the Origin of Deep Learning
8. Conclusion
In this paper, we have revisited the evolutionary path of the nowadays deep learning models.
We revisited the paths for three major families of deep learning models: the deep generative
model family, convolutional neural network family, and recurrent neural network family as
well as some topics for optimization techniques.
This paper could serve two goals: 1) First, it documents the major milestones in the
science history that have impacted the current development of deep learning. These mile-
stones are not limited to the development in computer science fields. 2) More importantly,
by revisiting the evolutionary path of the major milestone, this paper should be able to sug-
gest the readers that how these remarkable works are developed among thousands of other
contemporaneous publications. Here we briefly summarize three directions that many of
these milestones pursue:
• Occam’s razor: While it seems that part of the society tends to favor more complex
models by layering up one architecture onto another and hoping backpropagation can
find the optimal parameters, history says that masterminds tend to think simple:
Dropout is widely recognized not only because of its performance, but more because
of its simplicity in implementation and intuitive (tentative) reasoning. From Hopfield
Network to Restricted Boltzmann Machine, models are simplified along the iterations
until when RBM is ready to be piled-up.
• Be ambitious: If a model is proposed with substantially more parameters than
contemporaneous ones, it must solve a problem that no others can solve nicely to be
remarkable. LSTM is much more complex than traditional RNN, but it bypasses the
vanishing gradient problem nicely. Deep Belief Network is famous not due to the fact
the they are the first one to come up with the idea of putting one RBM onto another,
but due to that they come up an algorithm that allow deep architectures to be trained
effectively.
• Widely read: Many models are inspired by domain knowledge outside of machine
learning or statistics field. Human visual cortex has greatly inspired the development
of convolutional neural networks. Even the recent popular Residual Networks can find
corresponding mechanism in human visual cortex. Generative Adversarial Network
can also find some connection with game theory, which was developed fifty years ago.
We hope these directions can help some readers to impact more on current society. More
directions should also be able to be summarized through our revisit of these milestones by
readers.
Acknowledgements
Thanks to the demo from http://beej.us/blog/data/convolution-image-processing/ for a
quick generation of examples in Figure 16. Thanks to Bojian Han at Carnegie Mellon Uni-
versity for the examples in Figure 21. Thanks to the blog at http://sebastianruder.com/optimizing-
gradient-descent/index.html for a summary of gradient methods in Section 7.1. Thanks to
Yutong Zheng and Xupeng Tong at Carnegie Mellon University for suggesting some relevant
contents.
59
Wang and Raj
References
Emile Aarts and Jan Korst. Simulated annealing and boltzmann machines. 1988.
David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for
boltzmann machines. Cognitive science, 9(1):147–169, 1985.
James A Anderson and Edward Rosenfeld. Talking nets: An oral history of neural networks.
MiT Press, 2000.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint
arXiv:1701.07875, 2017.
Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with
visual attention. arXiv preprint arXiv:1412.7755, 2014.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Alexander Bain. Mind and Body the Theories of Their Relation by Alexander Bain. Henry
S. King & Company, 1873.
Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in Neural Infor-
mation Processing Systems, pages 2814–2822, 2013.
Peter L Bartlett and Wolfgang Maass. Vapnik chervonenkis dimension of neural nets. The
handbook of brain theory and neural networks, pages 1188–1192, 2003.
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In
European conference on computer vision, pages 404–417. Springer, 2006.
Yoshua Bengio. Learning deep architectures for ai. Foundations and trends
R in Machine
Learning, 2(1):1–127, 2009.
Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In
International Conference on Algorithmic Learning Theory, pages 18–36. Springer, 2011.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with
gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte.
Convex neural networks. In Advances in neural information processing systems, pages
123–130, 2005.
Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise
training of deep networks. Advances in neural information processing systems, 19:153,
2007.
60
On the Origin of Deep Learning
Monica Bianchini and Franco Scarselli. On the complexity of shallow and deep neural
network classifiers. In ESANN, 2014.
James G Booth and James P Hobert. Maximizing generalized linear mixed model likelihoods
with an automated monte carlo em algorithm. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 61(1):265–285, 1999.
Nirmal K Bose et al. Neural network fundamentals with graphs, algorithms, and applications.
Number 612.82 BOS. 1996.
Martin L Brady, Raghu Raghavan, and Joseph Slawny. Back propagation fails to separate
where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36(5):665–674,
1989.
George W Brown. Iterative solution of games by fictitious play. Activity analysis of pro-
duction and allocation, 13(1):374–376, 1951.
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.
arXiv preprint arXiv:1509.00519, 2015.
Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural
network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
Juan Luis Castro, Carlos Javier Mantas, and JM Benıtez. Neural networks with a continuous
squashing function in the output are universal approximators. Neural Networks, 13(6):
561–563, 2000.
Ning Chen, Jun Zhu, Jianfei Chen, and Bo Zhang. Dropout training for support vector
machines. arXiv preprint arXiv:1404.4171, 2014.
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.
Infogan: Interpretable representation learning by information maximizing generative ad-
versarial nets. In Advances In Neural Information Processing Systems, pages 2172–2180,
2016.
61
Wang and Raj
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using
rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078,
2014.
Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing multimedia content
using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17
(11):1875–1886, 2015.
Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun.
The loss surfaces of multilayer networks. In AISTATS, 2015.
Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Ben-
gio. Attention-based models for speech recognition. In Advances in Neural Information
Processing Systems, pages 577–585, 2015.
Avital Cnaan, NM Laird, and Peter Slasor. Tutorial in biostatistics: Using the general
linear mixed model to analyse unbalanced repeated measures and longitudinal data. Stat
Med, 16:2349–2380, 1997.
Alberto Colorni, Marco Dorigo, Vittorio Maniezzo, et al. Distributed optimization by ant
colonies. In Proceedings of the first European conference on artificial life, volume 142,
pages 134–142. Paris, France, 1991.
David Daniel Cox and Thomas Dean. Neural networks and neuroscience-inspired computer
vision. Current Biology, 24(18):R921–R929, 2014.
G Cybenko. Continuous valued neural networks with two hidden layers are sufficient. 1988.
Zihang Dai, Amjad Almahairi, Bachman Philip, Eduard Hovy, and Aaron Courville. Cali-
brating energy-based generative adversarial networks. ICLR submission, 2017.
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and
Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional
non-convex optimization. In Advances in neural information processing systems, pages
2933–2941, 2014.
Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks.
In Neural Information Processing Systems (NIPS), 2016.
62
On the Origin of Deep Learning
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew
Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks.
In Advances in neural information processing systems, pages 1223–1231, 2012.
Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances
in Neural Information Processing Systems, pages 666–674, 2011.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. Learning where to
attend with deep architectures for image tracking. Neural computation, 24(8):2151–2184,
2012.
Jean-Pierre Didier and Emmanuel Bigand. Rethinking physical and rehabilitation medicine:
New technologies induce new learning strategies. Springer Science & Business Media,
2011.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):
2121–2159, 2011.
Angela Lee Duckworth, Eli Tsukayama, and Henry May. Establishing causality using lon-
gitudinal hierarchical linear modeling: An illustration predicting achievement from self-
control. Social psychological and personality science, 2010.
Samuel Frederick Edwards and Phil W Anderson. Theory of spin glasses. Journal of Physics
F: Metal Physics, 5(5):965, 1975.
Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv
preprint arXiv:1512.03965, 2015.
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent,
and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of
Machine Learning Research, 11(Feb):625–660, 2010.
Marcus Frean. The upstart algorithm: A method for constructing and training feedforward
neural networks. Neural computation, 2(2):198–209, 1990.
63
Wang and Raj
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic
style. arXiv preprint arXiv:1508.06576, 2015.
Felix A Gers and Jürgen Schmidhuber. Recurrent nets that time and count. In Neural
Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint
Conference on, volume 3, pages 189–194. IEEE, 2000.
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Com-
puter Vision, pages 1440–1448, 2015.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies
for accurate object detection and semantic segmentation. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 580–587, 2014.
Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint
arXiv:1701.00160, 2016.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems, pages 2672–2680, 2014.
Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992.
Céline Gravelines. Deep Learning via Stacked Sparse Autoencoders for Automated Voxel-
Wise Brain Parcellation Based on Functional Connectivity. PhD thesis, The University
of Western Ontario, 1991.
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with
deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU),
2013 IEEE Workshop on, pages 273–278. IEEE, 2013.
Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnı́k, Bas R Steunebrink, and Jürgen
Schmidhuber. Lstm: A search space odyssey. arXiv preprint arXiv:1503.04069, 2015.
Richard Gregory and Patrick Cavanagh. The blind spot. Scholarpedia, 6(10):9618, 2011.
Aman Gupta, Haohan Wang, and Madhavi Ganapathiraju. Learning structure in gene
expression data using deep architectures, with an application to gene clustering. In
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, pages
1328–1335. IEEE, 2015.
64
On the Origin of Deep Learning
Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Hypercolumns for
object segmentation and fine-grained localization. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 447–456, 2015.
Johan Hastad. Almost optimal lower bounds for small depth circuits. In Proceedings of the
eighteenth annual ACM symposium on Theory of computing, pages 6–20. ACM, 1986.
James V Haxby, Elizabeth A Hoffman, and M Ida Gobbini. The distributed human neural
system for face perception. Trends in cognitive sciences, 4(6):223–233, 2000.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep
residual networks. arXiv preprint arXiv:1603.05027, 2016.
Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The” wake-sleep”
algorithm for unsupervised neural networks. Science, 268(5214):1158, 1995.
Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep
belief nets. Neural computation, 18(7):1527–1554, 2006.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature de-
tectors. arXiv preprint arXiv:1207.0580, 2012.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
John J Hopfield. Neural networks and physical systems with emergent collective computa-
tional abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks
are universal approximators. Neural networks, 2(5):359–366, 1989.
Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep
neural networks with logic rules. arXiv preprint arXiv:1603.06318, 2016.
David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’s striate
cortex. The Journal of physiology, 148(3):574–591, 1959.
65
Wang and Raj
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals,
Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint
arXiv:1610.00527, 2016.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-
supervised learning with deep generative models. In Advances in Neural Information
Processing Systems, pages 3581–3589, 2014.
Tinne Hoff Kjeldsen. John von neumann’s conception of the minimax theorem: a journey
through different mathematical contexts. Archive for history of exact sciences, 56(1):
39–68, 2001.
Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.
Bryan Kolb, Ian Q Whishaw, and G Campbell Teskey. An introduction to brain and behav-
ior, volume 1273. 2014.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Connecting language and vision using crowdsourced dense image annotations. arXiv
preprint arXiv:1602.07332, 2016.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.
2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems,
pages 1097–1105, 2012.
66
On the Origin of Deep Learning
Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convo-
lutional inverse graphics network. In Advances in Neural Information Processing Systems,
pages 2539–2547, 2015.
Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept
learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
Alan S Lapedes and Robert M Farber. How neural nets work. In Neural information
processing systems, pages 442–456, 1988.
Hugo Larochelle and Geoffrey E Hinton. Learning to combine foveal glimpses with a third-
order boltzmann machine. In Advances in neural information processing systems, pages
1243–1251, 2010.
B Boser Le Cun, John S Denker, D Henderson, Richard E Howard, W Hubbard, and
Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In
Advances in neural information processing systems. Citeseer, 1990.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998a.
Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of hand-
written digits, 1998b.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):
436–444, 2015.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep
belief networks for scalable unsupervised learning of hierarchical representations. In Pro-
ceedings of the 26th annual international conference on machine learning, pages 609–616.
ACM, 2009.
Zhaoping Li. A neural model of contour integration in the primary visual cortex. Neural
computation, 10(4):903–940, 1998.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European Conference on Computer Vision, pages 740–755. Springer, 2014.
Cheng-Yuan Liou and Shiao-Lin Lin. Finite memory loading in hairy neurons. Natural
Computing, 5(1):15–42, 2006.
Cheng-Yuan Liou and Shao-Kuo Yuan. Error tolerant associative memory. Biological Cy-
bernetics, 81(4):331–342, 1999.
Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson, and
David Heckerman. Fast linear mixed models for genome-wide association studies. Nature
methods, 8(10):833–835, 2011.
Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural
networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.
67
Wang and Raj
David G Lowe. Object recognition from local scale-invariant features. In Computer vision,
1999. The proceedings of the seventh IEEE international conference on, volume 2, pages
1150–1157. Ieee, 1999.
Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf.
arXiv preprint arXiv:1603.01354, 2016.
Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy.
Dropout with expectation-linear regularization. arXiv preprint arXiv:1609.08017, 2016.
M Maschler, Eilon Solan, and Shmuel Zamir. Game theory. translated from the hebrew by
ziv hellman and edited by mike borns, 2013.
Charles E McCulloch and John M Neuhaus. Generalized linear mixed models. Wiley Online
Library, 2001.
Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
Marc Mézard and Jean-P Nadal. Learning in feedforward layered networks: The tiling
algorithm. Journal of Physics A: Mathematical and General, 22(12):2191, 1989.
Marc Mézard, Giorgio Parisi, and Miguel-Angel Virasoro. Spin glass theory and beyond.
1990.
Marvin L Minski and Seymour A Papert. Perceptrons: an introduction to computational
geometry. MA: MIT Press, Cambridge, 1969.
Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998.
Tom M Mitchell et al. Machine learning. wcb, 1997.
Jeffrey Moran and Robert Desimone. Selective attention gates visual processing in the
extrastriate cortex. Frontiers in cognitive neuroscience, 229:342–345, 1985.
Michael C Mozer. A focused back-propagation algorithm for temporal pattern recognition.
Complex systems, 3(4):349–381, 1989.
Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann
machines. In Proceedings of the 27th International Conference on Machine Learning
(ICML-10), pages 807–814, 2010.
John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.
John F Nash et al. Equilibrium points in n-person games. Proc. Nat. Acad. Sci. USA, 36
(1):48–49, 1950.
Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High
confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 427–436. IEEE, 2015.
68
On the Origin of Deep Learning
Danh V Nguyen, Damla Şentürk, and Raymond J Carroll. Covariate-adjusted linear mixed
effects model with an application to longitudinal data. Journal of nonparametric statis-
tics, 20(6):459–481, 2008.
Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathe-
matical biology, 15(3):267–273, 1982.
Erkki Oja and Juha Karhunen. On stochastic approximation of the eigenvectors and eigen-
values of the expectation of a random matrix. Journal of mathematical analysis and
applications, 106(1):69–84, 1985.
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural
networks. arXiv preprint arXiv:1601.06759, 2016.
Keiichi Osako, Rita Singh, and Bhiksha Raj. Complex recurrent neural networks for denois-
ing speech signals. In Applications of Signal Processing to Audio and Acoustics (WAS-
PAA), 2015 IEEE Workshop on, pages 1–5. IEEE, 2015.
Rajesh G Parekh, Jihoon Yang, and Vasant Honavar. Constructive neural network learning
algorithms for multi-category real-valued pattern classification. 1997.
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct
deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013a.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent
neural networks. ICML (3), 28:1310–1318, 2013b.
Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point
problem for non-convex optimization. arXiv preprint arXiv:1405.4604, 2014.
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier,
and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences
for richer image-to-sentence models. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2641–2649, 2015.
Tomaso Poggio and Thomas Serre. Models of visual cortex. Scholarpedia, 8(4):3516, 2013.
Christopher Poultney, Sumit Chopra, Yann L Cun, et al. Efficient learning of sparse rep-
resentations with an energy-based model. In Advances in neural information processing
systems, pages 1137–1144, 2006.
Jose C Principe, Neil R Euliano, and W Curt Lefebvre. Neural and adaptive systems:
fundamentals through simulations with CD-ROM. John Wiley & Sons, Inc., 1999.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In Advances in neural information
processing systems, pages 91–99, 2015.
Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropa-
gation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International
Conference On, pages 586–591. IEEE, 1993.
69
Wang and Raj
AJ Robinson and Frank Fallside. The utility driven dynamic error propagation network.
University of Cambridge Department of Engineering, 1987.
Frank Rosenblatt. The perceptron: a probabilistic model for information storage and orga-
nization in the brain. Psychological review, 65(6):386, 1958.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal repre-
sentations by error propagation. Technical report, DTIC Document, 1985.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large
scale visual recognition challenge. International Journal of Computer Vision, 115(3):
211–252, 2015.
S Rasoul Safavian and David Landgrebe. A survey of decision tree classifier methodology.
1990.
Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS,
volume 1, page 3, 2009.
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and
Xi Chen. Improved techniques for training gans. In Advances in Neural Information
Processing Systems, pages 2226–2234, 2016.
Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks,
61:85–117, 2015.
Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Trans-
actions on Signal Processing, 45(11):2673–2681, 1997.
Thomas Serre, Lior Wolf, and Tomaso Poggio. Object recognition with features inspired
by visual cortex. In 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), volume 2, pages 994–1000. IEEE, 2005.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin-
ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-
experts layer. arXiv preprint arXiv:1701.06538, 2017.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
Paul Smolensky. Information processing in dynamical systems: Foundations of harmony
theory. Technical report, DTIC Document, 1986.
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation
using deep conditional generative models. In Advances in Neural Information Processing
Systems, pages 3483–3491, 2015.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of
Machine Learning Research, 15(1):1929–1958, 2014.
70
On the Origin of Deep Learning
Amos Storkey. Increasing the capacity of a hopfield network without sacrificing functionality.
In International Conference on Artificial Neural Networks, pages 451–456. Springer, 1997.
Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Pol-
icy gradient methods for reinforcement learning with function approximation. In NIPS,
volume 99, pages 1057–1063, 1999.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199, 2013.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 1–9, 2015.
Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.
Conditional image generation with pixelcnn decoders. In Advances In Neural Information
Processing Systems, pages 4790–4798, 2016.
Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like en-
sembles of relatively shallow networks. In Advances in Neural Information Processing
Systems, pages 550–558, 2016.
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man-
zagol. Stacked denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371–
3408, 2010.
Haohan Wang and Jingkang Yang. Multiple confounders correction with regularized linear
mixed effect models, with application in biological processes. 2016.
Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. Select-additive
learning: Improving cross-individual generalization in multimodal sentiment analysis.
arXiv preprint arXiv:1609.05244, 2016.
Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings
of the IEEE, 78(10):1550–1560, 1990.
Bernard Widrow et al. Adaptive” adaline” Neuron Using Chemical” memistors.”. 1960.
71
Wang and Raj
Alan L Wilkes and Nicholas J Wade. Bain on neural networks. Brain and cognition, 33(3):
295–305, 1997.
Andrew G Wilson, Christoph Dann, Chris Lucas, and Eric P Xing. The human kernel. In
Advances in Neural Information Processing Systems, pages 2854–2862, 2015.
SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-
chun Woo. Convolutional lstm network: A machine learning approach for precipitation
nowcasting. In Advances in Neural Information Processing Systems, pages 802–810, 2015.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdi-
nov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption
generation with visual attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.
Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. Multi-task cross-lingual sequence
tagging from scratch. arXiv preprint arXiv:1603.06270, 2016.
Andrew Chi-Chih Yao. Separating the polynomial-time hierarchy by oracles. In 26th Annual
Symposium on Foundations of Computer Science (sfcs 1985), 1985.
Xin Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447,
1999.
Dong Yu, Li Deng, and George Dahl. Roles of pre-training and fine-tuning in context-
dependent dbn-hmms for real-world speech recognition. In Proc. NIPS Workshop on
Deep Learning and Unsupervised Feature Learning, 2010.
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint
arXiv:1605.07146, 2016.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.
Understanding deep learning requires rethinking generalization. arXiv preprint
arXiv:1611.03530, 2016a.
Ke Zhang, Miao Sun, Tony X Han, Xingfang Yuan, Liru Guo, and Tao Liu. Resid-
ual networks of residual networks: Multilevel residual networks. arXiv preprint
arXiv:1608.02908, 2016b.
Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial net-
work. arXiv preprint arXiv:1609.03126, 2016.
Xiang Zhou and Matthew Stephens. Genome-wide efficient mixed-model analysis for asso-
ciation studies. Nature genetics, 44(7):821–824, 2012.
72