Session 03 - Neural Networks
Session 03 - Neural Networks
In this session
• We will learn about artificial neural networks (ANNs):
• Biological inspiration
• Architectures
• Perceptron
• Multi-layer perceptron (MLP)
• Training algorithm
• Controlling for overfitting
1
27/06/2019
Definition
• A neural network is a massively parallel distributed processor made up of simple processing
units that has a natural propensity for storing experiential knowledge and making it
available for use.
[*] https://www.theguardian.com/technology/2015/sep/02/computer-algorithm-recreates-van-gogh-painting-picasso 4
2
27/06/2019
Models of neurons 𝑚
𝑣𝑘 = 𝑏𝑘 + 𝑤𝑘𝑗 𝑥𝑗
𝑗=1
output signal of
the adder
𝑦𝑘 = 𝜑 𝑣𝑘
𝑚
= 𝜑 𝑏𝑘 + 𝑤𝑘𝑗 𝑥𝑗
𝑗=1
In matrix form: =𝜑 𝐰𝑘𝑇 𝐱
1
𝑥1 𝑇 output signal of
𝑣𝑘 = 𝑏𝑘 , 𝑤𝑘1 , … , 𝑤𝑘𝑚 ⋮ = 𝐰𝑘 𝐱 the neuron
𝑥𝑗
(dot product)
3
27/06/2019
Activation functions
Linear function
Step function Rectified linear (ReLU)
output
output
𝑣 𝑣
threshold
𝑚 𝑣𝑘
1 if 𝑣𝑘 ≥ 𝜃 if 𝑣𝑘 > 0
𝑦𝑘 = 𝜑 𝑣𝑘 = 𝑏𝑘 + 𝑤𝑘𝑗 𝑥𝑗 𝑦𝑘 = ቊ 𝑦𝑘 = ൜
0 otherwise 0 otherwise
𝑗=1
Activation functions
Softmax function
Logistic (sigmoid) function
Output range: [0,1]
output
1 𝑣𝑘 1 − 𝑒 −𝑣𝑘
𝑦𝑘 = 𝜑 𝑣𝑘 = 𝑦𝑘 = 𝜑 𝑣𝑘 = tanh =
1 + 𝑒 −𝑎𝑣𝑘 2 1 + 𝑒 −𝑣𝑘
where 𝑎 is the slope parameter of the sigmoid function. Derivative of the tanh function:
Derivative of the sigmoid function:
𝜕𝜑 𝑣 1
𝜕𝜑 𝑣 = 1+𝜑 𝑣 1−𝜑 𝑣
= 𝜑 𝑣 1−𝜑 𝑣 𝜕𝑣 2
𝜕𝑣
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 8
4
27/06/2019
• Let’s define 𝑐 as the number of output neurons. Each output neuron is defined as:
𝑒 𝑣𝑖
𝑦𝑖 = 𝑐
σ𝑖=1 𝑒 𝑣𝑖
• The use of softmax is appropriate when the network is to be used for estimating
probabilities.
̶ E.g., each output 𝑦𝑖 can represent a probability that the corresponding neural network
input belongs to class 𝐶𝑖 .
10
5
27/06/2019
11
12
6
27/06/2019
13
14
7
27/06/2019
15
Perceptrons
• Perceptrons:
̶ the first generation of neural networks.
16
8
27/06/2019
𝑚
o If the output unit is correct, leave its
weights alone.
𝑣𝑘 = 𝑏𝑘 + 𝑤𝑘𝑗 𝑥𝑗
o If the output unit incorrectly outputs a
𝑗=1 zero, add the input vector to the weight
vector.
1 if 𝑣𝑘 ≥ 0 o If the output unit incorrectly outputs a
output
17
18
9
27/06/2019
19
• The perceptron convergence procedure works by ensuring that every time the weights
change, they get closer to every “generously feasible” set of weights.
̶ This type of guarantee cannot be extended to more complex networks in which the
average of two good solutions may be a bad solution.
20
10
27/06/2019
= − 𝑥𝑖𝑛 (𝑡 𝑛 − 𝑦 𝑛 )
𝑛
3. The batch delta rule changes the weights in
𝜕𝐸
proportion to their error derivatives summed ∆𝑤𝑖 = −𝜀
over all training cases 𝜕𝑤𝑖
= 𝜀𝑥𝑖𝑛 (𝑡 𝑛 − 𝑦 𝑛 )
𝑛
21
For multilayer non linear nets the error surface is much Vertical cross section
more complicated.
𝑤1
• As long as the weights are not too big, the error surface
will still be smooth, but it may have many local minima. 𝑤2
Horizontal cross sections
22
11
27/06/2019
23
• We can compute error derivatives for all the hidden units efficiently at the same time.
• Once we have the error derivatives for the hidden activities, its easy to get the error
derivatives for the weights going into a hidden unit.
24
12
27/06/2019
(forward pass) 25
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier
25
26
13
27/06/2019
𝜕𝐸
𝐰 (𝑡+1) = 𝐰 (𝑡) − 𝜀
𝜕𝐰 (𝑡)
where 𝜀 > 0 is the learning rate.
• After each such update, the gradient is re-evaluated for the new weight vector and the
process repeated.
• Note that the error function is defined with respect to a training set, and so each step
requires that the entire training set be processed in order to evaluate the gradient.
[*] Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [page 240]
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 27
27
• When we fit the model, it cannot tell which regularities are real and which are caused by
sampling error.
• So it fits both kinds of regularity.
• If the model is very flexible it can model the sampling error really well. This is a disaster.
28
14
27/06/2019
29
30
15
27/06/2019
Autoencoders
• an auto-encoder is trained, with an
absolutely standard weight-adjustment
algorithm to reproduce the input.
• By making this happen with (many) fewer
units than the inputs, this forces the ‘hidden
layer’ units to become good feature
detectors
31
Deep learning
• Deep learning means using a neural network with several layers of nodes
between input and output.
• The series of layers between input & output are autoencoders that do feature
identification and processing in a series of stages, just as our brains seem to.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 32
32
16
27/06/2019
useful ways.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 33
33
Why is DL useful?
o Manually designed features are often
over-specified, incomplete and take a
long time to design and validate
o Learned Features are easy to adapt,
fast to learn
o Deep learning provides a very flexible,
(almost?) universal, learnable
framework for representing world,
visual and linguistic information.
o Can learn both unsupervised and
supervised
o Effective end-to-end joint system
learning
o Utilize large amounts of training data
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 34
34
17
27/06/2019
Convolutional Neural
Networks (CNNs)
• Fully connected networks tend to learn slowly some
datasets such as images.
• One simple solution to this problem is to restrict the
connections between the hidden units and the input
units, allowing each hidden unit to connect to only a
small subset of the input units.
• This idea of having locally connected networks also
draws inspiration from how the early visual system is
wired up in biology.
• Specifically, neurons in the visual cortex have localized
receptive fields (i.e., they respond only to stimuli in a
certain location).
• Natural images have the property of being
”‘stationary”’, meaning that the statistics of one part of
the image are the same as any other part. This suggests
that the features that we learn at one part of the image
can also be applied to other parts of the image, and we
can use the same features at all locations.
35
35
CNN architecture
1) Convolutional Filter
*
Convolutional
3x3 filter
Input matrix
2) Max-pooling Filter
36
18
27/06/2019
37
Feature extraction
38
19
27/06/2019
39
From image to text
40
20
27/06/2019
Summary
• We learnt about artificial neural networks, typical architectures, how percetrons
and multi-layer perceptrons learn, and how to control for learning issues such as
overfitting.
41
21