0% found this document useful (0 votes)
53 views

Session 03 - Neural Networks

This document summarizes a session on neural networks and deep learning presented by Dr. Ivan Olier. The session covered biological inspiration for artificial neural networks, common network architectures like perceptrons and multi-layer perceptrons, training algorithms, and controlling overfitting. It also defined neural networks, described their benefits like nonlinear learning and adaptation, and covered concepts like activation functions, softmax functions, and representing probabilities with network outputs.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Session 03 - Neural Networks

This document summarizes a session on neural networks and deep learning presented by Dr. Ivan Olier. The session covered biological inspiration for artificial neural networks, common network architectures like perceptrons and multi-layer perceptrons, training algorithms, and controlling overfitting. It also defined neural networks, described their benefits like nonlinear learning and adaptation, and covered concepts like activation functions, softmax functions, and representing probabilities with network outputs.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

27/06/2019

Session 3 – Neural Networks and


deep learning
Dr Ivan Olier
[email protected]

ECI – International Summer School /


Machine Learning
2019

In this session
• We will learn about artificial neural networks (ANNs):

• Biological inspiration
• Architectures
• Perceptron
• Multi-layer perceptron (MLP)
• Training algorithm
• Controlling for overfitting

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 2

1
27/06/2019

Definition
• A neural network is a massively parallel distributed processor made up of simple processing
units that has a natural propensity for storing experiential knowledge and making it
available for use.

• It resembles the brain in two respects:


1. Knowledge is acquired by the network from its environment through a learning
process.
2. Inter-neuron connection strengths, known as synaptic weights, are used to store the
acquired knowledge.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 3

More about neural networks


• A neural network derives its computing power through:
• Its massively parallel distributed structure and
• its ability to learn and therefore generalise.

• Generalisation refers to the production of reasonable outputs for


inputs not encountered during training (learning).

A curious application: ‘A Neural Net Algorithm of Artistic Style’

Publication: Image style The Eiffel tower gets the


transfer using Convolutional Van Gogh treatment.
Neural Networks, 2016.
Credit: kaishengtai/GitHub

[*] https://www.theguardian.com/technology/2015/sep/02/computer-algorithm-recreates-van-gogh-painting-picasso 4

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier

2
27/06/2019

Benefits of neural networks


• inherently nonlinear
• the network learns from the examples by constructing an input–output mapping for the
problem at hand.
• It has the capability to adapt their synaptic weights to changes in the surrounding
environment.
• They can be designed to provide information not only about which particular pattern to
select, but also about the confidence in the decision made.
• This may be used to reject ambiguous patterns, and thereby improve classification
performance.
• Knowledge is represented by the very structure and activation state of a neural net.
• Every neuron in the network is potentially affected by the global activity of all other
neurons in the network.
• Consequently, contextual information is dealt with naturally by a neural network.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 5

Models of neurons 𝑚

𝑣𝑘 = 𝑏𝑘 + ෍ 𝑤𝑘𝑗 𝑥𝑗
𝑗=1

output signal of
the adder

𝑦𝑘 = 𝜑 𝑣𝑘
𝑚

= 𝜑 𝑏𝑘 + ෍ 𝑤𝑘𝑗 𝑥𝑗
𝑗=1
In matrix form: =𝜑 𝐰𝑘𝑇 𝐱
1
𝑥1 𝑇 output signal of
𝑣𝑘 = 𝑏𝑘 , 𝑤𝑘1 , … , 𝑤𝑘𝑚 ⋮ = 𝐰𝑘 𝐱 the neuron
𝑥𝑗
(dot product)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 6

3
27/06/2019

Activation functions
Linear function
Step function Rectified linear (ReLU)

output
output

𝑣 𝑣
threshold

𝑚 𝑣𝑘
1 if 𝑣𝑘 ≥ 𝜃 if 𝑣𝑘 > 0
𝑦𝑘 = 𝜑 𝑣𝑘 = 𝑏𝑘 + ෍ 𝑤𝑘𝑗 𝑥𝑗 𝑦𝑘 = ቊ 𝑦𝑘 = ൜
0 otherwise 0 otherwise
𝑗=1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 7

Activation functions
Softmax function
Logistic (sigmoid) function
Output range: [0,1]
output

1 𝑣𝑘 1 − 𝑒 −𝑣𝑘
𝑦𝑘 = 𝜑 𝑣𝑘 = 𝑦𝑘 = 𝜑 𝑣𝑘 = tanh =
1 + 𝑒 −𝑎𝑣𝑘 2 1 + 𝑒 −𝑣𝑘
where 𝑎 is the slope parameter of the sigmoid function. Derivative of the tanh function:
Derivative of the sigmoid function:
𝜕𝜑 𝑣 1
𝜕𝜑 𝑣 = 1+𝜑 𝑣 1−𝜑 𝑣
= 𝜑 𝑣 1−𝜑 𝑣 𝜕𝑣 2
𝜕𝑣
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 8

4
27/06/2019

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 9

Softmax activation function


• One approach toward approximating probabilities is to choose the softmax function, which
is a exponential function with a normalised output such that the sum of all outputs is 1.

• Let’s define 𝑐 as the number of output neurons. Each output neuron is defined as:

𝑒 𝑣𝑖
𝑦𝑖 = 𝑐
σ𝑖=1 𝑒 𝑣𝑖

• The use of softmax is appropriate when the network is to be used for estimating
probabilities.
̶ E.g., each output 𝑦𝑖 can represent a probability that the corresponding neural network
input belongs to class 𝐶𝑖 .

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 10

10

5
27/06/2019

Neural network architectures


Feed-forward neural networks
outputs

• These are the commonest type of neural network in practical


applications. output
layer
̶ The first layer is the input and the last layer is the output.
̶ If there is more than one hidden layer, we call them “deep”
neural networks. hidden
layer
• They compute a series of transformations that change the
similarities between cases. input
̶ The activities of the neurons in each layer are a non-linear layer
function of the activities in the layer below inputs

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 11

11

Neural network architectures


Recurrent neural networks

• These have directed cycles in their connection graph:


̶ That means you can sometimes get back to where you
started by following the arrows.

• They can have complicated dynamics and this can make


them very difficult to train.
̶ There is a lot of interest at present in finding efficient Recurrent nets with multiple
ways of training recurrent nets. hidden layers are just a special
case that has some of the
• They are more biologically realistic. hidden → hidden connections
missing
• Recurrent neural networks are a very natural way to model
sequential data
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 12

12

6
27/06/2019

Neural network architectures


Symmetrically connected networks

• These are like recurrent networks, but the connections between


units are symmetrical (they have the same weight in both
directions).
̶ John Hopfield (and others) realized that symmetric networks
are much easier to analyse than recurrent networks.
̶ They are also more restricted in what they can do, because
they obey an energy function.
▪ For example, they cannot model cycles.

• Symmetrically connected nets without hidden units are called


“Hopfield nets”.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 13

13

Neural network architectures


Symmetrically connected networks with hidden units
• These are called “Boltzmann machines”:
• They have a beautifully simple learning algorithm.

• The Boltzmann machine would theoretically be a rather


general computational medium.
• For instance, if trained on photographs, the machine
would theoretically model the distribution of
photographs, and could use that model to, for example,
complete a partial photograph. ▪ Each undirected edge represents
• Unfortunately, Boltzmann machines have scalability dependency.
issues, as in order to reach convergence (equilibrium), ▪ In this example there are 3 hidden
units and 4 visible units.
statistics grows exponentially with the machine’s size
▪ This is not a restricted Boltzmann
and connection strengths. machine.
[*] Ackley, Hinton, Sejnowski (1985). A Learning Algorithm for Boltzmann Machines. Cognitive Science, 9(1):147-169.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 14

14

7
27/06/2019

Neural network architectures


Symmetrically connected networks with hidden units
• Although learning is impractical in general Boltzmann
machines, it can be made quite efficient in an architecture
called the Restricted Boltzmann Machine (RBM).
• RBM does not allow intra-layer connections (dependencies):
• Only allows connections between hidden and visible
units
• After training one RBM, the activities of its hidden units can
be treated as data for training a higher-level RBM.
• This method of stacking RBMs makes it possible to train
many layers of hidden units efficiently and is one of the most
common deep learning strategies.
• As each new layer is added the overall generative model Restricted Boltzmann machine:
gets better.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 15

15

Perceptrons
• Perceptrons:
̶ the first generation of neural networks.

• They were popularised by Frank Rosenblatt in the


early 1960’s.
̶ They appeared to have a very powerful learning
algorithm.
̶ Lots of grand claims were made for what they
could learn to do.

• In 1969, Minsky and Papert published a book called


“Perceptrons” that analysed what they could do and
The Mark I Perceptron machine
showed their limitations. was the 1st implementation of
̶ Many people thought these limitations applied to the perceptron algorithm.
all neural network models.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 16

16

8
27/06/2019

Perceptron as a binary classifier


Steps for training the perceptron as a
classifier:
̶ Pick training cases using any policy that
ensures that every training case will keep
getting picked.

𝑚
o If the output unit is correct, leave its
weights alone.
𝑣𝑘 = 𝑏𝑘 + ෍ 𝑤𝑘𝑗 𝑥𝑗
o If the output unit incorrectly outputs a
𝑗=1 zero, add the input vector to the weight
vector.
1 if 𝑣𝑘 ≥ 0 o If the output unit incorrectly outputs a
output

𝑦𝑘 = ቊ 1, subtract the input vector from the


0 otherwise
weight vector.
threshold 𝑣
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 17

17

Perceptron as a binary classifier


• Steps for training binary output neurons as classifiers – in more detail:
1. Extra component for the bias: 𝑥0𝑛 = 1. Initialise weights and the threshold.
2. For each example 𝑛 in the training set, perform the following steps over the input
𝐱 𝑛 (with 𝑚 features), and desired output (target) 𝑡𝑛 :
a) Calculate the actual output in this iteration 𝑖:
𝑦 𝑛 𝑖 = 𝑓 𝐰 𝑖 ∙ 𝐱𝑛
= 𝑓 𝑤0 𝑖 ∙ 𝑥0𝑛 + 𝑤1 𝑖 ∙ 𝑥1𝑛 + ⋯ + 𝑤𝑚 𝑖 ∙ 𝑥𝑚 𝑛

b) Update the weights:


𝑤𝑚 𝑖 + 1 = 𝑤𝑚 𝑖 + 𝑡 𝑛 − 𝑦 𝑛 𝑖 𝑥𝑚 𝑛

3. Repeat step 2 until the error is ‘acceptable’.


• Convergence:
̶ While the algorithm is guaranteed to converge on some solution in the case of a linearly
separable training set, it may still pick any solution and problems may admit many
solutions of varying quality.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 18

18

9
27/06/2019

From perceptron to deep learning neural networks

Perceptron Multi-layer perceptron Deep learning neural network[*]

1957 1973, 80s 2010s

[*] We will be back to them in the last session


2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 19

19

Multi-layer perceptron (MLP)


• Networks without hidden units are very limited in the input-output mappings they can
learn to model.
̶ More layers of linear units do not help. It is still linear.
̶ Fixed output non-linearities are not enough.

• The perceptron convergence procedure works by ensuring that every time the weights
change, they get closer to every “generously feasible” set of weights.
̶ This type of guarantee cannot be extended to more complex networks in which the
average of two good solutions may be a bad solution.

• So “multi-layer” neural networks do not use the perceptron learning procedure.


̶ They should never have been called multi-layer perceptrons.

We want a method that can be generalised to


multi-layer, non-linear neural networks.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 20

20

10
27/06/2019

Deriving the delta rule

1. Define the error as the squared residuals 1


𝐸 = ෍(𝑡 𝑛 − 𝑦 𝑛 )2
summed over all training cases (𝑛 ∈ 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔). 2
𝑛

2. Now differentiate to get error derivatives for 𝜕𝐸 1 𝜕𝑦 𝑛 𝑑𝐸 𝑛


= ෍
weights 𝜕𝑤𝑖 2 𝜕𝑤𝑖 𝑑𝑦 𝑛
𝑛

= − ෍ 𝑥𝑖𝑛 (𝑡 𝑛 − 𝑦 𝑛 )
𝑛
3. The batch delta rule changes the weights in
𝜕𝐸
proportion to their error derivatives summed ∆𝑤𝑖 = −𝜀
over all training cases 𝜕𝑤𝑖
= ෍ 𝜀𝑥𝑖𝑛 (𝑡 𝑛 − 𝑦 𝑛 )
𝑛

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 21

21

The error surface for a linear neuron 𝐸


The error surface means a surface that lies in a space where
the horizontal axes correspond to the weights of the neural
net, and the vertical axis corresponds to the error it makes.
• For a linear neuron with a squared error, that surface 𝑤2
𝑤1
always forms a quadratic bowl. Error surface of a linear neuron
with two input weights
• Vertical cross sections are parabolas.
• Horizontal cross sections are ellipses.
𝐸

For multilayer non linear nets the error surface is much Vertical cross section

more complicated.
𝑤1
• As long as the weights are not too big, the error surface
will still be smooth, but it may have many local minima. 𝑤2
Horizontal cross sections

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 22

22

11
27/06/2019

The error surface for a linear neuron


• Using this error surface we can get a picture of what is
happening as we do gradient descent learning using
the delta rule.
• So what the delta rule does is: 𝑤1
– it computes the derivative of the error with respect
to the weights, and if you change the weights in 𝑤2
proportion to that derivative, that is equivalent to
doing steepest descent on the error surface. The simplest kind of batch
learning does steepest descent
– To put it another way, if we look at the error
on the error surface.
surface from above, we get elliptical contour lines. – This travels perpendicular to
And the delta rule is going to take it at right angles the contour lines.
to those elliptical contour lines. That is what
happens with what is called batch learning, where
we get the gradient summed over all training cases.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 23

23

The idea behind backpropagation


• We do not know what the hidden units ought to do, but we can compute how fast the error
changes as we change a hidden activity on a particular training case.
• Instead of using desired activities to train the hidden units, use error derivatives w.r.t.
hidden activities.
• Each hidden activity can affect many output units and can therefore have many
separate effects on the error. These effects must be combined.

• We can compute error derivatives for all the hidden units efficiently at the same time.
• Once we have the error derivatives for the hidden activities, its easy to get the error
derivatives for the weights going into a hidden unit.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 24

24

12
27/06/2019

Sketch of the backpropagation algorithm on a single case


1. Convert the discrepancy between each We define the error, it being the squared difference
output and its target value into an error between the target values of the output unit j and
derivative. the actual values that the net produces for the
2. Then compute error derivatives in each output unit j 𝐸 = 1 ෍ (𝑡𝑗 − 𝑦𝑗 )2
2
𝑗∈𝑜𝑢𝑡𝑝𝑢𝑡
hidden layer from error derivatives in the
layer above. We differentiate that,
𝜕𝐸
= −(𝑡𝑗 − 𝑦𝑗 )
𝜕𝑦𝑗
The core of backpropagation is taking error
derivatives in one layer and from them computing and we get our expression on how the error changes as we change
the error derivatives in the layer that comes before the activity of an output unit 𝑗
that.
𝜕𝐸
So we want to compute 𝜕𝐸/𝜕𝑦𝑖 𝜕𝑦𝑗 layer 𝑗

3. Then use error derivatives w.r.t. activities to get


error derivatives w.r.t. the incoming weights. 𝜕𝐸
𝜕𝑦𝑖 layer 𝑖

(forward pass) 25
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier

25

Limitation of gradient descent with backpropagation


• Gradient descent with backpropagation is not
guaranteed to find the global minimum of the
error function, but only a local minimum.
• Also, it has trouble crossing plateaux in the error
function landscape.
• This issue, caused by the non-convexity of error
functions in neural networks, was long thought to
be a major drawback.
• But in a 2015 review article[*], the authors argue
that in many practical problems, it is not.

[*] LeCun, et. al, “Deep learning” Nature, 521(7553):436-444, 2015.


2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 26

26

13
27/06/2019

Gradient descent [*]


• The simplest approach to using gradient information is to choose the weight update in
𝐰 (𝑡+1) = 𝐰 (𝑡) + ∆𝐰 (𝑡) , where 𝑡 labels the iteration step, to comprise a small step in the
direction of the negative gradient, so that

𝜕𝐸
𝐰 (𝑡+1) = 𝐰 (𝑡) − 𝜀
𝜕𝐰 (𝑡)
where 𝜀 > 0 is the learning rate.

• After each such update, the gradient is re-evaluated for the new weight vector and the
process repeated.
• Note that the error function is defined with respect to a training set, and so each step
requires that the entire training set be processed in order to evaluate the gradient.

[*] Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [page 240]
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 27

27

Overfitting: the downside of using powerful models


• The training data contains information about the regularities in the mapping from input to
output. But it also contains two types of noise.
• The target values may be unreliable (usually only a minor worry).
• There is sampling error. There will be accidental regularities just because of the
particular training cases that were chosen.

• When we fit the model, it cannot tell which regularities are real and which are caused by
sampling error.
• So it fits both kinds of regularity.
• If the model is very flexible it can model the sampling error really well. This is a disaster.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 28

28

14
27/06/2019

Ways to reduce overfitting


A large number of different methods have been developed:
• Weight-decay:
• where you try and keep the weights of the networks small. It will make the model
simpler.
• Weight-sharing:
• where again, you make the model simpler by insisting that many of the weights have
exactly the same value as each other. You do not know what the value is and you are
going to learn it but it has to be exactly the same for many of the weights.
• Early stopping:
• where you make yourself a fake test set. And as you are training the network, you peak
at what is happening on this fake test set. And once the performance on the fake test set
starts getting worse, you stop training.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 29

29

Ways to reduce overfitting


• Model averaging:
• where you train lots of different neural networks. And you average them together in the
hope that that will reduce the errors you are making.
• Bayesian fitting of neural networks:
• which is a fancy form of model averaging.
• Dropout:
• where you try and make your model more robust by randomly emitting hidden units
when you're training it.
• Generative pre-training:
• which is somewhat more complicated.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 30

30

15
27/06/2019

Autoencoders
• an auto-encoder is trained, with an
absolutely standard weight-adjustment
algorithm to reproduce the input.
• By making this happen with (many) fewer
units than the inputs, this forces the ‘hidden
layer’ units to become good feature
detectors

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 31

31

Deep learning

• Deep learning means using a neural network with several layers of nodes
between input and output.
• The series of layers between input & output are autoencoders that do feature
identification and processing in a series of stages, just as our brains seem to.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 32

32

16
27/06/2019

What is Deep Learning (DL) ?


• A machine learning subfield of
learning representations of
data. Exceptional effective at
learning patterns.

• Deep learning algorithms


attempt to learn (multiple levels
of) representation by using a
hierarchy of multiple layers

• If you provide the system tons


of information, it begins to
understand it and respond in https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png

useful ways.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 33

33

Why is DL useful?
o Manually designed features are often
over-specified, incomplete and take a
long time to design and validate
o Learned Features are easy to adapt,
fast to learn
o Deep learning provides a very flexible,
(almost?) universal, learnable
framework for representing world,
visual and linguistic information.
o Can learn both unsupervised and
supervised
o Effective end-to-end joint system
learning
o Utilize large amounts of training data
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 34

34

17
27/06/2019

Convolutional Neural
Networks (CNNs)
• Fully connected networks tend to learn slowly some
datasets such as images.
• One simple solution to this problem is to restrict the
connections between the hidden units and the input
units, allowing each hidden unit to connect to only a
small subset of the input units.
• This idea of having locally connected networks also
draws inspiration from how the early visual system is
wired up in biology.
• Specifically, neurons in the visual cortex have localized
receptive fields (i.e., they respond only to stimuli in a
certain location).
• Natural images have the property of being
”‘stationary”’, meaning that the statistics of one part of
the image are the same as any other part. This suggests
that the features that we learn at one part of the image
can also be applied to other parts of the image, and we
can use the same features at all locations.
35

35

CNN architecture
1) Convolutional Filter

*
Convolutional
3x3 filter
Input matrix

2) Max-pooling Filter

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 36

36

18
27/06/2019

Convolutional neural networks (CNNs)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 37

37

Feature extraction

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 38

38

19
27/06/2019

Recurrent Neural Networks (RNNs)


• RNNs can be stacked up many times
• They suffer from vanishing gradient problem (tend to forget)

ℎ𝑡 = 𝜎(𝑊 (ℎℎ) ℎ𝑡−1 + 𝑊 (ℎ𝑥) 𝑥𝑡 )

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 39

39
From image to text

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 40

40

20
27/06/2019

Summary
• We learnt about artificial neural networks, typical architectures, how percetrons
and multi-layer perceptrons learn, and how to control for learning issues such as
overfitting.

• We also learnt about deep learning, which is basically an extension of the


traditional neural networks.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 41

41

21

You might also like