ST M Hdstat RNN Deep Learning
ST M Hdstat RNN Deep Learning
• Ian Goodfellow, Yoshua Bengio and Aaron Courville : • The identity function
http://www.deeplearningbook.org/ φ(x) = x.
2 Neural Networks and Introduction to Deep Learning
1
φ(x) = .
1 + exp(−x)
φβ (x) = 1x≥β .
Historically, the sigmoid was the mostly used activation function since it is
differentiable and allows to keep values in the interval [0, 1]. Nevertheless, it
is problematic since its gradient is very close to 0 when |x| is not close to 0.
The Figure 3 represents the Sigmoid function and its derivative.
With neural networks with a high number of layers (which is the case for deep
learning), this causes troubles for the backpropagation algorithm to estimate
the parameter (backpropagation is explained in the following). This is why the
sigmoid function was supplanted by the rectified linear function. This function
Figure 1: source: andrewjames turner.co.uk is not differentiable in 0 but in practice this is not really a problem since the
probability to have an entry equal to 0 is generally null. The ReLU function
also has a sparsification effect. The ReLU function and its derivative are equal
The Figure 2 represents the activation function described above. to 0 for negative values, and no information can be obtain in this case for such a
3 Neural Networks and Introduction to Deep Learning
(this is the case for recurrent neural networks). On last layer, called output
layer, we may apply a different activation function as for the hidden layers de-
pending on the type of problems we have at hand : regression or classification.
The Figure 4 represents a neural network with three input variables, one output
variable, and two hidden layers.
Multilayers perceptrons have a basic architecture since each unit (or neuron)
Figure 3: Sigmoid function (in black) and its derivatives (in red)
of a layer is linked to all the units of the next layer but has no link with the
neurons of the same layer. The parameters of the architecture are the number
of hidden layers and of neurons in each layer. The activation functions are also
unit, this is why it is advised to add a small positive bias to ensure that each unit to choose by the user. For the output layer, as mentioned previously, the acti-
is active. Several variations of the ReLU function are considered to make sure vation function is generally different from the one used on the hidden layers.
that all units have a non vanishing gradient and that for x < 0 the derivative is In the case of regression, we apply no activation function on the output layer.
not equal to 0. Namely For binary classification, the output gives a prediction of P(Y = 1/X) since
this value is in [0, 1], the sigmoid activation function is generally considered.
φ(x) = max(x, 0) + α min(x, 0) For multi-class classification, the output layer contains one neuron per class
i, giving a prediction of P(Y = i/X). The sum of all these values has to be
where α is either a fixed parameter set to a small positive value, or a parameter
equal to 1. The multidimensional function softmax is generally used
to estimate.
exp(zi )
2.2 Multilayer perceptron softmax(z)i = P .
j exp(zj )
A multilayer perceptron (or neural network) is a structure composed by sev-
eral hidden layers of neurons where the output of a neuron of a layer becomes Let us summarize the mathematical formulation of a multilayer perceptron
the input of a neuron of the next layer. Moreover, the output of a neuron can with L hidden layers.
also be the input of a neuron of the same layer or of neuron of previous layers We set h(0) (x) = x.
4 Neural Networks and Introduction to Deep Learning
2.4.2 Penalized empirical risk Rumelhart et al. (1988), it is still crucial for deep learning.
The expected loss can be written as
The stochastic gradient descent algorithm performs at follows :
L(θ) = E(X,Y )∼P [`(f (X, θ), Y )]
• Initialization of θ = (W (1) , b(1) , . . . , W (L+1) , b(L+1) ).
and it is associated to a loss function `.
• For N iterations :
In order to estimate the parameters θ, we use a training sample (Xi , Yi )1≤i≤n
and we minimize the empirical loss
n – For each training data (Xi , Yi ),
1X
L̃n (θ) = `(f (Xi , θ), Yi ) 1 X
n i=1 θ =θ−ε [5θ `(f (Xi , θ), Yi ) + λ 5θ Ω(θ)].
m
i∈B
eventually we add a regularization term. This leads to minimize the penalized
empirical risk Note that, in the previous algorithm, we do not compute the gradient for the
loss function at each step of the algorithm but only on a subset B of cardinal-
n
1X ity m (called a batch). This is what is classically done for big data sets (and
Ln (θ) = `(f (Xi , θ), Yi ) + λΩ(θ). for deep learning) or for sequential data. B is taken at random without re-
n i=1
placement. An iteration over all the training examples is called an epoch. The
2
We can consider L regularization. Using the same notations as in Section 2.2, numbers of epochs to consider is a parameter of the deep learning algorithms.
X X X (k) The total number of iterations equals the number of epochs times the sample
Ω(θ) = (Wi,j )2 size n divided by m, the size of a batch. This procedure is called batch learn-
k i j ing, sometimes, one also takes batches of size 1, reduced to a single training
X example (Xi , Yi ).
= kW (k) k2F
k 2.4.3 Backpropagation algorithm for regression with the quadratic loss
where kW kF denotes the Frobenius norm of the matrix W . Note that only the We consider the regression case and explain in this section how to compute
weights are penalized, the biases are not penalized. It is easy to compute the the gradient of the empirical quadratic loss by the Backpropagation algorithm.
gradient of Ω(θ) : To simplify, we do not consider here the penalization term, that can easily be
5W (k) Ω(θ) = 2W (k) . added. Assuming that the output of the multilayer perceptron is of size K, and
1
One can also consider L regularization, leading to parcimonious solutions : using the notations of Section 2.2, the empirical quadratic loss is proportional
to
n
(k)
XXX
Ω(θ) = |Wi,j |.
X
Ri (θ)
k i j i=1
In a regression model, the output activation function ψ is generally the identity Then we have
function, to be more general, we assume that
∂Ri
(L+1)
= δk,i h(L)
m (Xi ) (1)
ψ(a1 , . . . , aK ) = (g1 (a1 ), . . . , gK (aK )) ∂Wk,m
∂Ri (L−1)
where g1 , . . . , gK are functions from R to R. Let us compute the partial deriva- (L)
= sm,i hl (Xi ), (2)
tives of Ri with respect to the weights of the output layer. Recalling that ∂Wm,l
known as the backpropagation equations. The values of the gradient are used
a(L+1) (x) = b(L+1) + W (L+1) h(L) (x),
to update the parameters in the gradient descent algorithm. At step r + 1, we
we get have :
(L+1,r+1) (L+1,r)
X ∂Ri
∂Ri (L+1) Wk,m = Wk,m − εr
= −2(Yi,k − fk (Xi , θ))gk0 (ak (Xi ))h(L)
m (Xi ).
(L+1,r)
(L+1) i∈B ∂Wk,m
∂Wk,m
(L,r+1) (L,r)
X ∂Ri
Wm,l = Wm,l − εr (L,r)
Differentiating now with respect to the weights of the previous layer i∈B ∂Wm,l
K (L+1)
∂Ri X (L+1) ∂ak (Xi ) where B is a batch (either the n training sample or a subsample, P eventually
= −2 (Yi,k − fk (Xi , θ))gk0 (ak (Xi )) . →
(L)
∂Wm,l
(L)
∂Wm,l
of
P 2size 1) and ε r > 0 is the learning rate that satisfies ε r 0, r εr = ∞,
k=1 ε < ∞, for example ε = 1/r.
r r r
For the hyperbolic tangent function ("tanh") for the loss function ` associated to the cross-entropy.
Using the notations of Section 2.2, we want to compute the gradients
exp(x) − exp(−x)
φ(x) = , φ0 (x) = 1 − φ2 (x). ∂`(f (x), y)
exp(x) + exp(−x) Output weights Output biases ∂`(f(L+1)
(x),y)
(L+1) ∂bi
∂Wi,j
The backpropagation algorithm is also used for classification with the cross ∂`(f (x), y)
entropy as explained in the next section. Hidden weights (h)
Hidden biases ∂`(f (x),y)
(h)
∂bi
∂Wi,j
2.4.4 Backpropagation algorithm for classification with the cross en-
tropy for 1 ≤ h ≤ L. We use the chain-rule : if z(x) = φ(a1 (x), . . . , aJ (x)), then
∂z X ∂z ∂aj ∂a
here a K class
We consider classification problem. The output of the MLP = = h5φ, i.
P(Y = 1/x) ∂xi j
∂aj ∂xi ∂xi
.
is f (x) = . We assume that the output activation function is
. Hence we have
P(Y = K/x)
∂`(f (x), y) X ∂`(f (x), y) ∂f (x)j
the softmax function. = .
∂(a(L+1) (x))i ∂f (x)j ∂(a(L+1) (x))i
j
1
softmax(x1 , . . . , xK ) = PK (ex1 , . . . , exK ).
x ∂`(f (x), y) −1y=j
k=1 e
i
= .
∂f (x)j (f (x))y
Let us make some useful computations to compute the gradient.
∂`(f (x), y) X 1y=j ∂softmax(a(L+1) (x))j
∂softmax(x)i = −
= softmax(x)i (1 − softmax(x)i ) if i = j ∂(a(L+1) (x))i (f (x))y ∂(a(L+1) (x))i
∂xj j
where (f (x))k is the kth component of f (x) : (f (x))k = P(Y = k/x). Then ∂`(f (x), y)
= (−1 + f (x)y )1y = i + f (x)i 1y 6= i.
we have ∂(a(L+1) (x))i
K
X Hence we obtain
− log(f (x))y = − 1y=k log(f (x))k = `(f (x), y),
k=1
5a(L+1) (x) `(f (x), y) = f (x) − e(y),
8 Neural Networks and Introduction to Deep Learning
where, for y ∈ {1, 2, . . . , K}, e(y) is the RK vector with i th component 1i=y . Recalling that h(k) (x)j = φ(a(k) (x)j ),
We now obtain easily the partial derivative of the loss function with respect to
the output bias. Since ∂`(f (x), y) ∂`(f (x), y) 0 (k)
= φ (a (x)j ).
∂a(k) (x)j ∂h(k) (x)j
∂((a(L+1) (x)))j
= 1i=j , Hence,
∂(b(L+1) )i
5a(k) (x) `(f (x), y) = 5h(k) (x) `(f (x), y)(φ0 (a(k) (x)1 ), . . . , φ0 (a(k) (x)j ), . . .)0
5b(L+1) `(f (x), y) = f (x) − e(y), (3)
Let us now compute the partial derivative of the loss function with respect to where denotes the element-wise product. This leads to
the output weights.
∂`(f (x), y) ∂`(f (x), y) ∂a(k) (x)i
(k)
=
∂`(f (x), y) X ∂`(f (x), y) ∂(a(L+1) (x))k ∂Wi,j ∂a(k) (x)i ∂W (k)
i,j
(L+1)
= (L+1) (x)) (L+1)
∂Wi,j k
∂(a k ∂W ∂`(f (x), y) (k−1)
i,j = h (x)
∂a(k) (x)i j
and
∂(a(L+1) (x))k Finally, the gradient of the loss function with respect to hidden weights is
(L+1)
= a(L) (x))j 1i=k .
∂Wi,j 5W (k) `(f (x), y) = 5a(k) (x) `(f (x), y)h(k−1) (x)0 . (5)
Hence
The last step is to compute the gradient with respect to the hidden biases. We
5W (L+1) `(f (x), y) = (f (x) − e(y))(a(L) (x))0 . (4)
simply have
Let us now compute the gradient of the loss function at hidden layers. We use ∂`(f (x), y) ∂`(f (x), y)
(k)
=
the chain rule ∂bi ∂a(k) (x)i
* Compute the gradient at the hidden layer k the optimization can be blocked on a local minimum. If the learning rate is too
large, the network will oscillate around an optimum without stabilizing and
5W (k) `(f (x), y) = 5a(k) (x) `(f (x), y)h(k−1) (x)0 converging. A classical way to proceed is to adapt the learning rate during the
5b(k) `(f (x), y) = 5a(k) (x) `(f (x), y) training : it is recommended to begin with a "large " value of , (for example
0.1) and to reduce its value during the successive iterations. However, there is
* Compute the gradient at the previous layer no general rule on how to adjust the learning rate, and this is more the expe-
rience of the engineer concerning the observation of the evolution of the loss
5h(k−1) (x) `(f (x), y) = (W (k) )0 5a(k) (x) `(f (x), y) function that will give indications on the way to proceed.
The stochasticity of the SGD algorithm lies in the computation of the gradi-
and ent. Indeed, we consider batch learning : at each step, m training examples
are randomly chosen without replacement and the mean of the m correspond-
5a(k−1) (x) `(f (x), y) = 5h(k−1) (x) `(f (x), y)
ing gradients is used to update the parameters. An epoch corresponds to a pass
(. . . , φ0 (a(k−1) (x)j ), . . . )0 through all the learning data, for example if the batch size m is 1/100 times the
sample size n, an epoch corresponds to 100 batches. We iterate the process on
2.4.5 Initialization a certain number nb of epochs that is fixed in advance. If the algorithm did not
converge after nb epochs, we have to continue for nb0 more epochs. Another
The input data have to be normalized to have approximately the same range. stopping rule, called early stopping is also used : it consists in considering a
The biases can be initialized to 0. The weights cannot be initialized to 0 since validation sample, and stop learning when the loss function for this validation
for the tanh activation function, the derivative at 0 is 0, this is a saddle point. sample stops to decrease. Batch learning is used for computational reasons,
They also cannot be initialized with the same values, otherwise, all the neurons indeed, as we have seen, the backpropagation algorithm needs to store all the
of a hidden layer would have the same behaviour. We generally initialize the intermediate values computed at the forward step, to compute the gradient dur-
(k)
weights at random : the values Wi,j are i.i.d. Uniform on [−c, c] with possibly ing the backward pass, and for big data sets, such as millions of images, this is
√
c = Nk +N6k−1 where Nk is the size of the hidden layer k. We also sometimes not feasible, all the more that the deep networks have millions of parameters to
initialize the weights with a normal distribution N (0, 0.01) (see Gloriot and calibrate. The batch size m is also a parameter to calibrate. Small batches gen-
Bengio, 2010). erally lead to better generalization properties. The particular case of batches
of size 1 is called On-line Gradient Descent. The disadvantage of this proce-
2.4.6 Optimization algorithms dure is the very long computation time. Let us summarize the classical SGD
Many algorithms can be used to minimize the loss function, all of them have algorithm.
hyperparameters, that have to be calibrated, and have an important impact on
the convergence of the algorithms. The elementary tool of all these algorithms A LGORITHM 1 Stochastic Gradient Descent algorithm
is the Stochastic Gradient Descent (SGD) algorithm. It is the most simple one:
• Fix the parameters ε : learning rate, m : batch size, nb : number of
∂L old epochs.
θinew = θiold − ε (θ ),
∂θi i
• For l = 1 to nb epochs
where ε is the learning rate , and its calibration is very important for the con-
vergence of the algorithm. If it is too small, the convergence is very slow and • For l = 1 to n/m,
10 Neural Networks and Introduction to Deep Learning
– Take a random batch of size m without replacement in the learning deep learning, the mostly used method is the dropout. It was introduced by
sample : (Xi , Yi )i∈Bl Hinton et al. (2012), [2]. With a certain probability p, and independently of
– Compute the gradients with the backpropagation algorithm the others, each unit of the network is set to 0. The probability p is another
hyperparameter. It is classical to set it to 0.5 for units in the hidden layers, and
1 X to 0.2 for the entry layer. The computational cost is weak since we just have to
5̃θ = 5θ `(f (Xi , θ), Yi ).
m set to 0 some weights with probability p. This method improves significantly
i∈Bl
the generalization properties of deep neural networks and is now the most pop-
– Update the parameters ular regularization method in this context. The disadvantage is that training is
much slower (it needs to increase the number of epochs). Ensembling models
θnew = θold − ε5̃θ . (aggregate several models) can also be used. It is also classical to use data
augmentation or Adversarial examples.
Since the choice of the learning rate is delicate and very influent on the
convergence of the SGD algorithm, variations of the algorithm have been pro-
posed. They are less sensitive to the learning rate. The principle is to add a
correction when we update the gradient, called momentum. The method is
due to Polyak (1964) [9].
ε X
(5̃θ )(r) = γ(5̃θ )(r−1) + 5θ `(f (Xi , θ(r−1) ), Yi ).
m
i∈Bl
removed the manual extraction of features. CNN act directly on matrices, 3.0.7 Layers in a CNN
or even on tensors for images with three RGB color chanels. CNN are now
widely used for image classification, image segmentation, object recognition, A Convolutional Neural Network is composed by several kinds of layers,
face recognition .. that are described in this section : convolutional layers, pooling layers and
fully connected layers.
3.0.8 Convolution layer
The discrete convolution between two functions f and g is defined as
X
(f ∗ g)(x) = f (t)g(x + t).
t
Figure 6: Image annotation. Source : http://danielnouri.org/media/deep- As shown in Figure 8, the principle of 2D convolution is to drag a convo-
learning-whales-krizhevsky-lsvrc-2012-predictions.jpg lution kernel on the image. At each position, we get the convolution between
the kernel and the part of the image that is currently treated. Then, the kernel
moves by a number s of pixels, s is called the stride. When the stride is small,
we get redondant information. Sometimes, we also add a zero padding, which
is a margin of size p containing zero values around the image in order to control
the size of the output. Assume that we apply C0 kernels (also called filters),
each of size k × k on an image. If the size of the input image is Wi × Hi × Ci
(Wi denotes the width, Hi the height, and Ci the number of channels, typically
Ci = 3), the volume of the output is W0 × H0 × C0 , where C0 corresponds to
the number of kernels that we consider, and
Wi − k + 2p
W0 = +1
s
3.1 Architectures
We have described the different types of layers composing a CNN. We now
present how this layers are combined to form the architecture of the network.
Choosing an architecture is very complex and this is more engineering that
an exact science. It is therefore important to study the architectures that have
proved to be effective and to draw inspiration from these famous examples. In
the most classical CNN, we chain several times a convolution layer followed
by a pooling layer and we add at the end fully connected layers. The LeNet
network, proposed by the inventor of the CNN, Yann LeCun [12] is of this
Figure 10: Source :http://image.slidesharecdn.com/ type, as shown in Figure 12. This network was devoted to digit recognition. It
is composed only on few layers and few filters, due to the computer limitations
at that time.
but another advantage of the pooling is that it makes the network less sensitive
to small translations of the input images.
A few years later, with the appearance of GPU (Graphical Processor Unit)
cards, much more complex architectures for CNN have been proposed, like the
network AlexNet (see [6]) that won the ImageNet competition and for which
a simplified version is presented in Figure 13. This competition was devoted
to the classification of one million of color images onto 1000 classes. The
resolution of images was 224 × 224. AlexNet is composed of 5 convolution
layers, 3 max-pooling 2 × 2 layers and fully connected layers. As showed if
Figure 13, the kernel shape of the first convolution layer is (11, 11, 3, 96) with
a stride of s = 4, and the first output shape is (55, 55, 96).
We detail in the following tabular the architecture of the network : The network that won the competition in 2014 is the network GoogLeNet
Input 227 * 227 * 3 [1], which is a new kind of CNN, not only composed on successive convolution
Conv 1 55*55*96 96 11 *11 filters at stride 4, pad 0 and pooling layers, but also on new modules called Inception, which are some
Max Pool 1 27*27*96 3 *3 filters at stride 2 kind of network in the network. An example is represented in Figure 15.
Conv 2 27*27*256 256 5*5 filters at stride 1, pad 2 The most recent innovations concern the ResNet networks (see [1]). The
Max Pool 2 13*13*256 3 *3 filters at stride 2 originality of the ResNets is to add a connection linking the input of a layer (or
Conv 3 13*13*384 384 3*3 filters at stride 1, pad 1 a set of layers) with its output. In order to reduce the number of parameters, the
Conv 4 13*13*384 384 3*3 filters at stride 1, pad 1 ResNets do not have fully connected layers. GoogleNet and ResNet are much
Conv 5 13*13*256 256 3*3 filters at stride 1, pad 1 deeper than the previous CNN, but contain much less parameters. They are
Max Pool 3 6*6*256 3 *3 filters at stride 2 nevertheless much costly in memory than more classical CNN such as VGG
FC1 4096 4096 neurons or AlexNet.
FC2 4096 4096 neurons
Figure 17 shows a comparison of the deepth and of the performances of the
FC3 1000 1000 neurons (softmax logits)
different networks, on the ImageNet challenge.
Figure 14 presents another example.
15 Neural Networks and Introduction to Deep Learning
were developed in the 1980’s, a hidden layer at time t depends on the entry at
time t, xt but also on the same hidden layer at time t−1 or on the output at time
t − 1. We therefore have a loop from a hidden layer to itself or from the output
to the hidden layer as shown in Figure 18. RNN may seem, at the first glance,
Figure 16: Inception-v4, Inception-resnet (Szegedy, C. et al. , 2016 [1]) Figure 18: Diagram of a RNN. Source : Understanding LSTM Networks
by Christopher Olah - http://colah.github.io/posts/2015-08-Understanding-
LSTMs/
4 Recurrent neural networks
In order to infer sequential data such as text or time series, Recurrent Neu- very different from classical neural network. In fact, this is not the case. RNN
ral Networks (RNN) are considered. The most simple recurrent networks
16 Neural Networks and Introduction to Deep Learning
can be seem as multiple copies of the same network (As in Figure 18 and 19), 4.1 Long Short-Term Memory
each passing information to its successor. This is the unrolled representation
of RNN, shown in Figure 19. In the last years, RNN have been successfully used again for various
applications such as speech recognition, translation, image captioning .. This
success is mostly due to the performances of LSTMs : Long Short-Term
Memorys, which is a special kind of recurrent neural networks. Long Short-
Term Memory (LSTM) cells were introduced by Hochreiter and Schmidhuber
(1997) [10] and were created in order to be able to learn long time depen-
dancies. A LSTM cell comprises at time t , a state Ct and an output ht . As
input, this cell at time t comprises xt , Ct−1 and ht−1 . Inside the LSTM,
the computations are defined by doors that allow or not the transmission of
Figure 19: Unrolled representation of a RNN. Source : Understanding information. These computations are governed by the following equations
LSTM Networks by Christopher Olah - http://colah.github.io/posts/2015-08- described in [10] .
Understanding-LSTMs/
ut = σ(W u ht−1 + I u xt + bu ) Update gate H
where σ is an activation function. The neurons of the hidden layer that are I u, I f , I c, I o Input weights N × H
looped to themselves are called context units. In the model introduced by Jor-
dan, ẑl (t − 1) is replaced in the last equation by ŷl (t − 1). In this case, the bu , bf , bc , bo Biases H
context units are the output neurons. These models have been introduced in
Figure 20 reveals the main difference between a classical RNN and an
linguistic analysis. They are widely used in natural language processing. Nev-
LSTM. For a standard RNN, the repeated module A is very simple, it con-
ertheless, the basic version of recurrent neural networks falls to learn long time
tains a single layer. For the LSTM, the repeated module contains four layers
dependancies. New architectures have been introduced to tackle this problem.
(the yellow boxes), interacting as described by the above equations.
17 Neural Networks and Introduction to Deep Learning
Exercise. — On Figure 20, put the different elements mentioned in the equa- [10] Hochreiter S. and SchmidhuberJ. Long short-term memory. Neural Com-
tions defining the LSTM. putation, 9(8):1735–1780, 1997.
There are also variants of LSTMs, and this field of research is still very [11] I. Sutskever, J. Martens, G.E. Dahl, and G.E. Hinton. On the importance
active to obtain more and more powerful models. of initialization and momentum in deep learning. ICML, 28(3):1139–
1147, 2013.
References [12] LeCun Y., Bottou L., Bengio Y., and Haffner P. Gradient-based learn-
[1] Szegedy C., Ioffe S., Vanhouche V., and Alemi A. Inception-v4, ing applied to document recognition. IEEE Communications magazine,
inception-resnet and the impact of residual connections on learning. 27(11):41–46, 1998.
Arxiv, 1602.07261, 2016. [13] LeCun Y., Jackel L., Boser B., Denker J., Graf H., Guyon I., Henderson
D., Howard R., and Hubbard W. Handwritten digit recognition : Appli-
[2] Hinton G.E., Srivastava N., Krizhevsky A., Sutskever I., and Salakhutdi- cations of neural networks chipsand automatic learning. Proceedings of
nov R. Improving neural networks by preventing co-adaptation of feature the IEEE, 86(11):2278–2324, 1998.
detectors. CoRR, abs/1207.0580, 2012.