Artificial Neural Networks presentations

2
Functions Can be Made Linear
• Data is not linearly separable in one dimension
• Not separable if you insist on using a specific class of functions
𝒙

3
Blown Up Feature Space
• Data are separable in < 𝒙, 𝒙2 > space
𝒙
𝒙2

4
Neural Networks
• Multi-layer networks were designed to overcome the
computational (expressivity) limitation of a single threshold
element.
Linear Threshold Unit
Input
Hidden
Output
𝑦 = 𝑠𝑖𝑔𝑛(∑w𝑖𝑥𝑖 − 𝑇)

5
History: Neural Computation
• McCulloch and Pitts (1943) showed how linear threshold units
can be used to compute logical functions
𝑦 = 𝑠𝑖𝑔𝑛(∑w𝑖𝐼𝑖 − 𝑇)

6
History: Neural Computation
• But XOR?
Two Layered Two Unit Network

7
Neural Networks
• Multi-layer networks were designed to overcome the
computational (expressivity) limitation of a single threshold
element.
Linear Threshold Unit
Input
Hidden
Output

8
Neural Networks
• Multi-layer networks were designed to overcome the computational
(expressivity) limitation of a single threshold element.
• The idea is to stack several
layers of threshold elements,
each layer using the output of
the previous layer as input.
• Multi-layer networks can represent arbitrary functions, but building
effective learning methods for such network was [thought to be]
difficult.
Input
Hidden
Output

9
Neural Networks
• Neural Networks are functions: 𝑁𝑁: 𝑿 → 𝑌
• where 𝑿 = 0,1 𝑛
, or ℝ𝑛
and 𝑌 = [0,1], {0,1}
• Robust approach to approximating real-valued, discrete-valued and
vector valued target functions.
𝐻3 = 𝑠𝑖𝑔𝑛( 𝑤13𝐼1 + 𝑤23𝐼2 − 𝑇3)
𝐻4 = 𝑠𝑖𝑔𝑛( 𝑤14𝐼1 + 𝑤24𝐼2 − 𝑇4)
𝑂5 = 𝑠𝑖𝑔𝑛( 𝑤35𝐻3 + 𝑤45𝐻4 − 𝑇5)
Trainable Parameters:
𝑤13, 𝑤14, 𝑤23, 𝑤24, 𝑤35, 𝑤45, 𝑇3, 𝑇4, 𝑇5 𝑇4
𝑇5
𝑇3

10
Neural Networks
• Neural Networks are functions: 𝑁𝑁: 𝑿 → 𝑌
• where 𝑿 = 0,1 𝑛
, or ℝ𝑛
and 𝑌 = [0,1], {0,1}
• Robust approach to approximating real-valued, discrete-valued and vector
valued target functions.
• Among the most effective general purpose supervised learning
method currently known.
• Effective especially for complex and hard to interpret input data such
as real-world sensory data, where a lot of supervision is available.
• Learning:
• The Backpropagation algorithm for neural networks has been shown
successful in many practical problems

11
Motivation for Neural Networks
• Inspired by biological neural network systems
• But are not identical to them
• We are currently on rising part of a wave of interest in NN
architectures, after a long downtime from the mid-90-ies.
• Better computer architecture (parallelism on GPUs & TPUs)
• A lot more data than before; in many domains, supervision is available.

12
Motivation for Neural Networks
• One potentially interesting perspective:
• We used to think about NNs only as function approximators.
• Geoffrey Hinton introduced “Restricted Boltzman Machines” (RBMs) in the
mid 2000s – method to learn high-level representations of input
• Many other ideas focusing in the Intermediate Representations of NNs
• Ideas are being developed on the value of these intermediate
representations for transfer learning, for the meaning they represent etc.
• We will present in the next two lectures a few of the basic
architectures and learning algorithms, and provide some
examples for applications

13
Basic Unit in Multi-Layer Neural Network
• Threshold units: 𝑜𝑗 = sgn(𝒘 ⋅ 𝒙 − 𝑇) introduce non-linearity
• But not differentiable,
• Hence unsuitable for learning via Gradient Descent
activation
Input
Hidden
Output

14
Logistic Neuron / Sigmoid Activation
• Neuron is modeled by a unit 𝑗 connected by weighted links 𝑤𝑖𝑗 to other
units 𝑖.
• Use a non-linear, differentiable output function such as the sigmoid or logistic
function
• Net input to a unit is defined as:
• Output of a unit is defined as:
net𝑗 = ∑𝑤𝑖𝑗 ⋅ 𝑥𝑖
𝑜𝑗 = 𝜎 𝑛𝑒𝑡𝑗 =
1
1 + exp − (net𝑗−𝑇𝑗)
 𝑜𝑗
𝑥1
𝑥2
𝑥3
𝑥4
𝑥5
𝑥6
𝑥𝑗
𝑤1𝑗
𝑤6𝑗

15
Representational Power
• Any Boolean function can be represented by a two layer network
(simulate a two layer AND-OR network)
• Any bounded continuous function can be approximated with arbitrary
small error by a two layer network.
• Sigmoid functions provide a set of basis functions from which arbitrary
function can be composed.
• Any function can be approximated to arbitrary accuracy by a three layer
network.

16
Quiz Time!
• Given a neural network, how can we make predictions?
• Given input, calculate the output of each layer (starting from the first
layer), until you get to the output.
• What is required to fully specify a neural network?
• The weights.
• Why NN predictions can be quick?
• Because many of the computations could be parallelized.
• What makes a neural networks expressive?
• The non-linear units.

18
History: Learning Rules
• Hebb (1949) suggested that if two units are both active (firing) then
the weights between them should increase:
𝑤𝑖𝑗 = 𝑤𝑖𝑗 + 𝑅𝑜𝑖𝑜𝑗
• 𝑅 and is a constant called the learning rate
• Supported by physiological evidence
• Rosenblatt (1959) suggested that when a target output value is
provided for a single neuron with fixed input, it can incrementally
change weights and learn to produce the output using the
Perceptron learning rule.
• assumes binary output units; single linear threshold unit
• Led to the Perceptron Algorithm
• See: http://people.idsia.ch/~juergen/who-invented-backpropagation.html

19
Two layer Two Unit Neural Network
𝐻3 = 𝜎( 𝑤13𝐼1 + 𝑤23𝐼2 − 𝑇3)
𝐻4 = 𝜎( 𝑤14𝐼1 + 𝑤24𝐼2 − 𝑇4)
𝑂5 = 𝜎( 𝑤35𝐻3 + 𝑤45𝐻4 − 𝑇5)
Trainable Parameters:
𝑤13, 𝑤14, 𝑤23, 𝑤24, 𝑤35, 𝑤45, 𝑇3, 𝑇4, 𝑇5
𝑇4
𝑇5
𝑇3

20
Gradient Descent
• We use gradient descent to determine the weight vector that
minimizes some scalar valued loss function 𝐸𝑟𝑟 𝒘 𝑗 ;
• Fixing the set 𝐷 of examples, 𝐸rr is a function of 𝒘 𝑗
• At each step, the weight vector is modified in the direction that
produces the steepest descent along the error surface.
𝐸𝑟𝑟(𝒘)
𝒘
𝒘3 𝒘2 𝒘1 𝑤0

21
Backpropagation Learning Rule
• Since there could be multiple output units, we define the error as the sum
over all the network output units.
• 𝐸𝑟𝑟 𝒘 =
1
2
∑𝑑∈𝐷 ∑𝑘∈𝐾 𝑡𝑘𝑑 − 𝑜𝑘𝑑
2
• where 𝐷 is the set of training examples,
• 𝐾 is the set of output units
• This is used to derive the (global) learning rule which performs gradient
descent in the weight space in an attempt to minimize the error function.
Δ𝑤𝑖𝑗 = −𝑅
𝜕𝐸
𝜕𝑤𝑖𝑗
𝑜1…𝑜𝑘
(1, 0, 1, 0, 0)

22
Learning with a Multi-Layer Perceptron
• It’s easy to learn the top layer – it’s just a linear unit.
• Given feedback (truth) at the top layer, and the activation at the layer
below it, you can use the Perceptron update rule (more generally, gradient
descent) to updated these weights.
• The problem is what to do with the other set of weights – we do
not get feedback in the intermediate layer(s).
activation
Input
Hidden
Output
w2
ij
w1
ij

23
Learning with a Multi-Layer Perceptron
• The problem is what to do with the other set of
weights – we do not get feedback in the intermediate
layer(s).
• Solution: If all the activation functions are
differentiable, then the output of the network is also a
differentiable function of the input and weights in the
network.
• Define an error function (e.g., sum of squares) that is a
differentiable function of the output, i.e. this error
function is also a differentiable function of the
weights.
• We can then evaluate the derivatives of the error with
respect to the weights, and use these derivatives to
find weight values that minimize this error function,
using gradient descent (or other optimization
methods).
• This results in an algorithm called back-propagation.
activation
Input
Hidden
Output
w2
ij
w1
ij

25
Chain Rule
𝑑𝑓
𝑑𝑥
=
𝑑𝑓
𝑑𝑔
𝑑𝑔
𝑑𝑥

26
Some facts from real analysis
• First let’s get the notation right:
• The arrow shows functional dependence of 𝑧 on 𝑦
• i.e. given 𝑦, we can calculate 𝑧.
• e.g., for example: 𝑧(𝑦) = 2𝑦2
The derivative of 𝑧, with respect to 𝑦.

27
• Simple chain rule
• If 𝑧 is a function of 𝑦, and 𝑦 is a function of 𝑥
• Then 𝑧 is a function of 𝑥, as well.
• Question: how to find
𝜕𝑧
𝜕𝑥
We will use these facts to derive
the details of the
Backpropagation algorithm.
𝑧 will be the error (loss) function.
- We need to know how to
differentiate 𝑧
Intermediate nodes use a logistics
function (or another
differentiable step function).
- We need to know how to
differentiate it.
𝜕𝑧
𝜕𝑥
=
𝜕𝑧
𝜕𝑦
𝜕𝑦
𝜕𝑥

28
• Multiple path chain rule
Slide Credit: Richard Socher
𝜕𝑧
𝜕𝑥
=
𝜕𝑧
𝜕𝑦1
𝜕𝑦1
𝜕𝑥
+
𝜕𝑧
𝜕𝑦2
𝜕𝑦2
𝜕𝑥

29
• Multiple path chain rule: general
Slide Credit: Richard Socher
𝜕𝑧
𝜕𝑥
=
𝑖=1
𝑛
𝜕𝑧
𝜕𝑦𝑖
𝜕𝑦𝑖
𝜕𝑥

30
Key Intuitions Required for BP
• Gradient Descent
• Change the weights in the direction of
gradient to minimize the error function.
• Chain Rule
• Use the chain rule to calculate the
weights of the intermediate weights
• Dynamic Programming (Memoization)
• Memoize the weight updates to make
the updates faster.
𝜕𝐸
𝜕𝑤𝑖𝑗
output
input

31
Backpropagation: the big picture
• Loop over instances:
1. The forward step
• Given the input, make predictions
layer-by-layer, starting from the first
layer)
2. The backward step
• Calculate the error in the output
• Update the weights layer-by-layer,
starting from the final layer
𝜕𝐸
𝜕𝑤𝑖𝑗
output
input

32
Quiz time!
• What is the purpose of forward step?
• To make predictions, given an input.
• What is the purpose of backward step?
• To update the weights, given an output error.
• Why do we use the chain rule?
• To calculate gradient in the intermediate layers.
• Why backpropagation could be efficient?
• Because it can be parallelized.
output
input

34
Reminder: Model Neuron (Logistic)
• Neuron is modeled by a unit 𝑗 connected by weighted links 𝑤𝑖𝑗 to other
units 𝑖.
• Use a non-linear, differentiable output function such as the sigmoid or logistic
function
• Net input to a unit is defined as:
• Output of a unit is defined as:
net𝑗 = ∑𝑤𝑖𝑗. 𝑥𝑖
𝑜𝑗 =
1
1 + exp −(net𝑗 − 𝑇𝑗)
 𝑜𝑗
𝑥1
𝑥2
𝑥3
𝑥4
𝑥5
𝑥6
𝑥7
𝑤17
𝑤67
Note:
Other gates, beyond Sigmoid, can be used (TanH, ReLu)
Other Loss functions, beyond LMS, can be used.

35
Derivation of Learning Rule
• The weights are updated incrementally; the error is computed for
each example and the weight update is then derived.
𝐸𝑑 𝒘 =
1
2
𝑘∈𝐾
𝑡𝑘 − 𝑜𝑘
2
• 𝑤𝑖𝑗 influences the output 𝑜𝑗 only through net𝑗
• Therefore:
𝜕𝐸𝑑
𝜕𝑤𝑖𝑗
=
𝜕𝐸𝑑
𝜕o𝑗
𝜕𝑜𝑗
𝜕net𝑗
𝜕net𝑗
𝜕𝑤𝑖𝑗
𝑜1…, 𝑜𝑗, . . 𝑜𝑘
x𝑖
𝑤𝑖𝑗
𝑜𝑗 =
1
1+exp{−(net𝑗−𝑇)}
and net𝑗 = ∑𝑤𝑖𝑗. 𝑥𝑖

36
Derivatives
• Function 1 (error):
• 𝐸 =
1
2
∑𝑘∈𝐾 𝑡𝑘 − 𝑜𝑘
2
•
𝜕𝐸
𝜕𝑜𝑖
= − 𝑡𝑖 − 𝑜𝑖
• Function 2 (linear gate):
• net𝑗 = ∑𝑤𝑖𝑗 ⋅ 𝑥𝑖
•
𝜕net𝑗
𝜕𝑤𝑖𝑗
= 𝑥𝑖
• Function 3 (differentiable activation function):
• 𝑜𝑖 =
1
1+exp{−(net𝑗−𝑇)}
•
𝜕𝑜𝑖
𝜕net𝑗
=
exp{−(net𝑗−𝑇)}
(1+exp{−(net𝑗−𝑇)})2 = 𝑜𝑖(1 − 𝑜𝑖)
𝑜1…, 𝑜𝑗, . . 𝑜𝑘
x𝑖
𝑤𝑖𝑗

= − 𝑡𝑗 − 𝑜𝑗 𝑜𝑗 1 − 𝑜𝑗 𝑥𝑖
𝜕𝐸𝑑
𝜕𝑤𝑖𝑗
=
𝜕𝐸𝑑
𝜕o𝑗
𝜕𝑜𝑗
𝜕net𝑗
𝜕net𝑗
𝜕𝑤𝑖𝑗
37
Derivation of Learning Rule (2)
• Weight updates of output units:
• 𝑤𝑖𝑗 influences the output only through net𝑗
• Therefore: 𝑗
𝑖
𝑤𝑖𝑗
𝐸𝑑 𝒘 =
1
2
𝑘∈𝐾
𝑡𝑘 − 𝑜𝑘
2
𝜕𝑜𝑗
𝜕net𝑗
= 𝑜𝑗(1 − 𝑜𝑗)
𝑜𝑗
=
1
1 + exp{−(net𝑗 − 𝑇𝑗)}
𝑜1…𝑜𝑘

38
• Weights of output units:
• 𝑤𝑖𝑗 is changed by:
Where we defined:
𝛿𝑗 =
𝜕𝐸𝑑
𝜕net𝑗
= 𝑡𝑗 − 𝑜𝑗 𝑜𝑗 1 − 𝑜𝑗
Δ𝑤𝑖𝑗 = 𝑅 𝑡𝑗 − 𝑜𝑗 𝑜𝑗 1 − 𝑜𝑗 𝑥𝑖
= 𝑅𝛿𝑗𝑥𝑖
𝑗
𝑖
𝑤𝑖𝑗
𝑜𝑗
𝑥𝑖

𝜕𝐸𝑑
𝜕𝑤𝑖𝑗
39
• Weights of hidden units:
• 𝑤𝑖𝑗 Influences the output only through all the units whose direct input
include 𝑗
𝑘
𝑗
𝑖
𝑤𝑖𝑗
𝑜𝑘
𝐸𝑑
𝑜1

=
𝑘∈𝑝𝑎𝑟𝑒𝑛𝑡(𝑗)
−𝛿𝑘
𝜕net𝑘
𝜕net𝑗
𝑥𝑖
=
𝜕𝐸𝑑
𝜕net𝑘
𝜕net𝑘
𝜕net𝑗
𝑥𝑖
𝜕𝐸𝑑
𝜕𝑤𝑖𝑗
=
𝜕𝐸𝑑
𝜕net𝑗
𝜕net𝑗
𝜕𝑤𝑖𝑗
=
40
• 𝑤𝑖𝑗 Influences the output only through all the units whose direct input
include 𝑗
𝑘
𝑗
𝑖
𝑤𝑖𝑗
𝑜𝑘
=
𝜕𝐸𝑑
𝜕net𝑗
𝑥𝑖 =
𝑜1
𝐸𝑑

=
−𝛿𝑘
𝜕net𝑘
𝜕𝑜𝑗
𝜕𝑜𝑗
𝜕net𝑗
𝑥𝑖
=
−𝛿𝑘 𝑤𝑗𝑘 𝑜𝑗(1 − 𝑜𝑗) 𝑥𝑖
41
• 𝑤𝑖𝑗 influences the output only through all the units whose direct input
include 𝑗
𝑘
𝑗
𝑖
𝑤𝑖𝑗
𝑜𝑘
𝜕𝐸𝑑
𝜕𝑤𝑖𝑗
=
−𝛿𝑘
𝜕net𝑘
𝜕net𝑗
𝑥𝑖 =

42
• 𝑤𝑖𝑗 is changed by:
• Where
𝛿𝑗 = 𝑜𝑗 1 − 𝑜𝑗 . ∑𝑘∈𝑝𝑎𝑟𝑒𝑛𝑡 𝑗 −𝛿𝑘 𝑤𝑗𝑘
• First determine the error for the output units.
• Then, backpropagate this error layer by layer through the network, changing weights appropriately in each
layer.
𝑘
𝑗
𝑖
𝑤𝑖𝑗
𝑜𝑘
Δ𝑤𝑖𝑗 = 𝑅 𝑜𝑗 1 − 𝑜𝑗 .
𝑘∈𝑝𝑎𝑟𝑒𝑛𝑡 𝑗
−𝛿𝑘 𝑤𝑗𝑘 𝑥𝑖
= 𝑅𝛿𝑗𝑥𝑖

43
The Backpropagation Algorithm
• Create a fully connected three layer network. Initialize weights.
• Until all examples produce the correct output within 𝜖 (or other criteria)
For each example in the training set do:
1. Compute the network output for this example
2. Compute the error between the output and target value
𝛿𝑘 = 𝑡𝑘 − 𝑜𝑘 𝑜𝑘 1 − 𝑜𝑘
1. For each output unit k, compute error term
𝛿𝑗 = 𝑜𝑗 1 − 𝑜𝑗 .
𝑘∈𝑑𝑜𝑤𝑛𝑠𝑡𝑟𝑒𝑎𝑚 𝑗
−𝛿𝑘 𝑤𝑗𝑘
1. For each hidden unit, compute error term: Δ𝑤𝑖𝑗 = 𝑅𝛿𝑗𝑥𝑖
2. Update network weights with Δ𝑤𝑖𝑗
End epoch

44
More Hidden Layers
• The same algorithm holds for more hidden layers.
output
input

45
Demo time!
• Link: https://playground.tensorflow.org/

46
Feed-forward (FF) Network / Multi-layer Perceptron (MLP)
𝑥 ∈ 𝑅𝑚
ℎ1 ∈ 𝑅𝑑1
ℎ2 ∈ 𝑅𝑑2
𝑦 ∈ 𝑅𝑛

47
𝑥 ∈ 𝑅𝑚
ℎ1 ∈ 𝑅𝑑1
ℎ2 ∈ 𝑅𝑑2
𝑦 ∈ 𝑅𝑛
ℎ1 = 𝜎(𝑊1𝑥) ; 𝑊1 ∈ 𝑅𝑑1ⅹ𝑚
𝒘𝟏𝟏 𝒘𝟐𝟏 𝒘𝟑𝟏 𝒘𝟒𝟏

48
𝑥 ∈ 𝑅𝑚
ℎ1 ∈ 𝑅𝑑1
ℎ2 ∈ 𝑅𝑑2
𝑦 ∈ 𝑅𝑛
ℎ1 = 𝜎(𝑊1𝑥) ; 𝑊1 ∈ 𝑅𝑑1ⅹ𝑚
𝒘𝟏𝟏
𝒘𝟏𝟐
𝒘𝟏𝟑
𝒘𝟏𝟒
𝒘𝟏𝟓
𝒘𝟏𝟔

49
𝑥 ∈ 𝑅𝑚
ℎ1 ∈ 𝑅𝑑1
ℎ2 ∈ 𝑅𝑑2
𝑦 ∈ 𝑅𝑛
ℎ1 = 𝜎(𝑊1𝑥) ; 𝑊1 ∈ 𝑅𝑑1ⅹ𝑚
ℎ2 = 𝜎(𝑊2ℎ1) ; 𝑊2 ∈ 𝑅𝑑2ⅹ𝑑1
𝑦 = 𝜎(𝑊3ℎ2) ; 𝑊3 ∈ 𝑅𝑛ⅹ𝑑2

50
The Backpropagation Algorithm
• Create a fully connected network. Initialize weights.
• Until all examples produce the correct output within 𝜖 (or other criteria)
For each example (𝑥𝑖, 𝑡𝑖) in the training set do:
1. Compute the network output 𝑦𝑖 for this example
2. Compute the error between the output and target value
𝐸 = ∑ 𝑡𝑖
𝑘
− 𝑜𝑖
𝑘 2
3. Compute the gradient for all weight values, Δ𝑤𝑖𝑗
4. Update network weights with 𝑤𝑖𝑗 = 𝑤𝑖𝑗 − R ∗ Δ𝑤𝑖𝑗
End epoch
Auto-differentiation packages such as Tensorflow, Torch, etc. help!
Quick example in code

51
Comments on Training
• No guarantee of convergence; neural networks form non-convex
functions with multiple local minima
• In practice, many large networks can be trained on large amounts of
data for realistic problems.
• Many epochs (tens of thousands) may be needed for adequate
training. Large data sets may require many hours of CPU
• Termination criteria: Number of epochs; Threshold on training set
error; No decrease in error; Increased error on a validation set.
• To avoid local minima: several trials with different random initial
weights with majority or voting techniques

52
Over-training Prevention
• Running too many epochs and/or a NN with many hidden layers may
lead to an overfit network
• Keep an held-out validation set and test accuracy after every epoch
• Early stopping: maintain weights for best performing network on the
validation set and return it when performance decreases
significantly beyond that.
• To avoid losing training data to validation:
• Use 10-fold cross-validation to determine the average number of epochs
that optimizes validation performance
• Train on the full data set using this many epochs to produce the final
results

53
Over-fitting prevention
• Too few hidden units prevent the system from adequately
fitting the data and learning the concept.
• Using too many hidden units leads to over-fitting.
• Similar cross-validation method can be used to determine an
appropriate number of hidden units. (general)
• Another approach to prevent over-fitting is weight-decay: all
weights are multiplied by some fraction in (0,1) after every
epoch.
• Encourages smaller weights and less complex hypothesis
• Equivalently: change Error function to include a term for the sum of
the squares of the weights in the network. (general)

54
Dropout training
• Proposed by (Hinton et al, 2012)
• Each time decide whether to delete one hidden unit with some probability
𝑝

55
Dropout training
• Dropout of 50% of the hidden units and 20% of the input units (Hinton et
al, 2012)

56
Dropout training
• Model averaging effect
• Among 2𝐻 models, with shared parameters
• 𝐻: number of units in the network
• Only a few get trained
• Much stronger than the known regularizer
• What about the input space?
• Do the same thing!

57
Recap: Multi-Layer Perceptrons
• Multi-layer network
• A global approximator
• Different rules for training it
• The Back-propagation
• Forward step
• Back propagation of errors
• Congrats! Now you know the most important algorithm in neural networks!
• Next Time:
• Convolutional Neural Networks
• Recurrent Neural Networks
activation
Input
Hidden
Output

Artificial Neural Networks presentations

Recommended

More Related Content

Similar to Artificial Neural Networks presentations (20)

Recently uploaded (20)

Artificial Neural Networks presentations