0% found this document useful (0 votes)
101 views

04 - Neural Networks PDF

The document discusses feed-forward neural networks and machine learning. It begins by describing a simple linear classifier model and then represents it as a neural network with input nodes, weights, and a sign activation function. It explains that neural networks can learn feature representations tailored to the task, unlike kernel methods. The document then discusses biological neural networks and how feed-forward networks are an abstraction, using linear combinations and nonlinear activations instead of spiking signals. Finally, it covers deep neural networks with multiple hidden layers and reasons for their recent successes, including abundant data, computational resources, and flexible architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

04 - Neural Networks PDF

The document discusses feed-forward neural networks and machine learning. It begins by describing a simple linear classifier model and then represents it as a neural network with input nodes, weights, and a sign activation function. It explains that neural networks can learn feature representations tailored to the task, unlike kernel methods. The document then discusses biological neural networks and how feed-forward networks are an abstraction, using linear combinations and nonlinear activations instead of spiking signals. Finally, it covers deep neural networks with multiple hidden layers and reasons for their recent successes, including abundant data, computational resources, and flexible architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Machine Learning

2
Feed-Forward Neural networks
• So far our classifiers rely on pre-compiled features:
yˆ = sign (θ . φ( x ))
• Here we have a simple linear classifier, taking as input an expanded feature
representation from an input vector x, associated parameters θ, passed through
a sign function, and getting the classification decision.
• Now let's write this model slightly differently
x1 φ1(x) θ1

x2 φ2(x) θ2 sign

… θD
xd φD(x)
Once we have thisx feature representation,
φ(x) then we take a linear combination of these
coordinate values to produce the classification decision.
Feed-Forward Neural networks 3
Feed-Forward Neural networks
• One particular aspect of this way of solving a nonlinear classification problem is
that how input x is mapped to a longer feature representation, φ(x), that
mapping is fixed and is not tailored to the task that we are trying to solve.
• This is the particular difference between a non-linear method like a kernel
method and a neural network. A neural network tries to optimize the feature
representation for the task that you're trying to solve.
• In terms of deep neural networks, we are talking about representation learning:
learning to represent the examples in a way that is actually helpful for the
eventual classification task.
• this task is actually quite a bit harder than the task that we have presented here:
– In order to do a good classification, we would need to know what that feature
representation is.
– In order to understand what a good feature representation would be, we would need
to know how that feature representation is exercised in the ultimate classification task.

Feed-Forward Neural networks 4


Neural networks
• Our feed forward neural networks don't actually look very much like real neural
networks in the brain.

• Real neural networks are composed of cells called neurons. They aggregate in
through dendrites, and there are about 1,000 to 10,000 connections that are
formed by other neurons to these dendrites.

Feed-Forward Neural networks 5


Neural networks
• The signal from the connections, called synapses, propagate through the
dendrite into the cell body.

• The potential increases in the cell body. Once it reaches a threshold, the neuron
sends a spike along the axon that connects to roughly 100 other neurons. So we
have these highly connected parallel signals that are propagated in a temporal
facet.
Feed-Forward Neural networks 6
Neural networks
• Our abstraction of this network is in terms of a simple linear classifier.
• The input nodes take the role of the synapses and the dendrites.
• The weights or the parameters associated with these input coordinates try to
mimic the strength of the connection that propagates to the cell body.
• The response of a cell is in terms of just a nonlinear transformation of the
aggregates as single.
• And the output is simply a real number, not actually a temporal signal that's
propagated through.
x1 θ1

x2 θ2 f
∈R

xd

Feed-Forward Neural networks 7


Neural networks

x1 w1

x2 w2 f
∈R
Input Output


wd
xd
d
x z = ∑ wi xi + w0
i =1

f = f (z) − Non linear


Activation function

Feed-Forward Neural networks 8


Neural networks
• What could be the activation function (transfer function)?
f f

z z

Linear ReLU: f(z) = max (0, z)

tanh

Feed-Forward Neural networks 9


Neural networks

x1 w1

x2 w2 f
∈R


wd
xd

x1
x2
f

xd

Feed-Forward Neural networks 10


Deep Neural networks

Loosely motivated by biological neurons, networks..

Adjustable processing units (~ linear classifiers).

Highly parallel, typically organized in layers.

Deep = many transformations (layers) before output

e.g., edges → Simple parts → Parts → Objects → scenes

Feed-Forward Neural networks 11


Deep Neural networks

width

depthf

Feed-Forward Neural networks 12


Deep Neural networks
• L'apprentissage profond a dépassé un certain nombre de disciplines
académiques en quelques années seulement:
– Computer vision
– Natural language processing
– Speech recognition
– Computational biology, etc.
• Key roles in recent successes:
– Self driving vehicles
– Speech interfaces
– Conversational agents
– Superhuman game playing
• Many more underway
– Personalized / automated medicine
– Chemistry, robotics, …

Feed-Forward Neural networks 13


Deep Learning … Why Now?
• Reason #1: lots of data.
Many significant problems can only be solved at scale.

• Reason #2: Computational resources (esp. GPUs).


Platforms / Systems that support running deep (machine) learning algorithms at
scale.

• Reason #3: Large models are easier to train.


Large models can be successfully estimated with simple gradient based learning
algorithms.

• Reason #4: flexible neural “lego pieces”.


Common representation, diversity of architectural choices.

Feed-Forward Neural networks 14


One Hidden layer model
Layer 0 Layer 1 Layer 2
(tanh) (linear)
z1
w11
x1 c c f1
w12 z
f
c
w21
x2 c w22 c f2
z2
2
z1 = ∑ x j w j1 + w01 f1 = f ( z1 ) = tanh ( z1 )
j =1
2
z 2 = ∑ x j w j 2 + w02 f 2 = f ( z 2 ) = tanh ( z 2 )
j =1

Feed-Forward Neural networks 15


One Hidden layer model
Layer 0 Layer 1 Layer 2
(tanh) (linear)
z1
w11
x1 c c f1
w12 z
f
c
w21
x2 c w22 c f2
z2

x1 f1
f
x2 f2

Feed-Forward Neural networks 16


One Hidden layer model
Layer 0 Layer 1
w1.x + w01 = 0

Signal Transformation (tanh)
z1
w11 Decision boundary
x1 c c f1
w12 Oriented the same way as
in linear classifiers
w21
x2 c w22 c f2 w1 x w1
z2 2

z1 = w1.x + w01 Linear combination



x1

  w11 
w1 =  
 w21the
we can understand  first hidden unit here in terms of x1, x2 coordinates as our linear
classifier, where this is now the w1 vector. The decision boundary here corresponds to when
the aggregate input to that unit is 0.

Feed-Forward Neural networks 17


One Hidden layer model

we can understand the first hidden unit here in terms of x1, x2


coordinates as our linear classifier, where this is now the w1 vector.
The decision boundary here corresponds to when the aggregate
input to that unit is 0.

Feed-Forward Neural networks 18


Example Problem: Hidden Layer representation

Hidden layer units Linear activation


4 4
(1)
3
2
2
(2)
0 (2)1
f2 0
-2
-1

-4 -2
-6 -4 -2 0 2 4 6 -2 -1 0 1 2 3 4

(1)
f1

Feed-Forward Neural networks 19


Example Problem: Hidden Layer representation

Hidden layer units Tanh activation


4 1
(1)
3
2
2
(2)
0 (2)1
f2 0
-2
-1

-4 -1
-6 -4 -2 0 2 4 6 -1 -1 0 1 2 3 1

(1)
f1

Feed-Forward Neural networks 20


Example Problem: Hidden Layer representation

Hidden layer units ReLU activation


4 4
(1)
2

(2)
0 (2)2
f2
-2

-4 0
-6 -4 -2 0 2 4 6 0 2 4

(1)
f1

Feed-Forward Neural networks 21


Does the orientation matter ?

Hidden layer units Tanh activation


4 1
(1)
2

(2)
0 (2)0
f2 0
-2
-1

-4 -1
-6 -4 -2 0 2 4 6 -1 0 1

(1)
f1

Feed-Forward Neural networks 22


Does the orientation matter ?

Hidden layer units ReLU activation


4 4
(1)
2

(2)
0 (2)2
f2 0
-2
-1

-4 -1
-6 -4 -2 0 2 4 6 -1 2 4

(1)
f1

Feed-Forward Neural networks 23


Random Hidden Units

Hidden layer units Tanh activation


4 1

(2)
2

(1)
0 (2)
f2
-2

-4 -1
-6 -4 -2 0 2 4 6 -1 2 4

(1)
f1

Feed-Forward Neural networks 24


Random Hidden Units
f1
f2
Hidden layer units
4 f3
(2) f4
x1
2 f5
∈ R10
f6
(1) x2
0 f7
f8
-2 f9
f10
-4
-6 -4 -2 0 2 4 6 Are the points linearly separables in
the resulting 10 dimensional space ?

(10 randomly chosen units) YES

Feed-Forward Neural networks 25


Random Hidden Units

Hidden layer units


4 1.5

(2)
2

(1)
0

-2

-4 -3
-6 -4 -2 0 2 4 6 -1.5 0.2

What are the coordinates ?

(10 randomly chosen units)

Feed-Forward Neural networks 26


Feed-Forward Neural Networks
• Feed-forward neural networks with multiple hidden layers mediating the
calculation from the input to the output are complicated models that are trying to
capture the representation of the examples towards the output unit in such a way
as to facilitate the actual prediction task.
• It is this representation learning part – you're learning the feature representation
as well as how to make use of it – that makes the learning problem difficult.
• Turns out that a simple stochastic gradient descent algorithm actually succeeds
in finding typically a good solution to the parameters, provided that we give the
model a little bit of overcapacity.
• The main algorithmic question, then, is how to actually evaluate that gradient,
the derivative of the Loss with respect to the parameters. And that can be
computed efficiently using so-called back propagation algorithm.

Feed-Forward Neural networks 27


Learning Neural Networks

×
x × f (x; w)
×

A simple example of a feed-forward neural network.


These are the input units.
The network evaluates a complicated mapping, then, using parameters collectively
w, parameters mediating each of the layer-wise transformations getting to output
here.
The output is actually a complicated function of the input as well as their
parameters collectively w.
Feed-Forward Neural networks 28
Learning Neural Networks

×
x × f (x; w)
×

In training, we are also given a target y, our task in applying stochastic gradient descent is
simply to evaluate for each example.
For a pair of input and the desired output y, we want to adjust the parameters in the model
so as to nudge them in the reverse direction of the gradient.
What we have to do is to evaluate some loss that measures how much that our predicted
output differs from the desired target.
The main question here is since the mapping is complicated, how do we actually evaluate
these derivatives.

Feed-Forward Neural networks 29


Learning Neural Networks

×
x × f (x; w)
×

Exemple (x, y)

∂ Loss (y, f (x; w ))


wijl = wlij − η
∂ wijl

How to evaluate this derivative ?


Back propagation algorithm
Feed-Forward Neural networks 30
Simple Example

A long chain like neural nework

w1 z1 f 1 w z2
2
f2 wL zL fL
x y
∈R ∈R

z1 = x.w1
Loss (y, fL) = 1/2(y, fL)2
f1 = tanh (x.w1)

∂f L
fL-1 fL
fL = tanh (fL-1.wL) ∂f L -1

Feed-Forward Neural networks 31


Back-Propagation

w1 z1 f 1 w z2
2
f2 wL zL fL
x y
Loss (y, fL)
∂Loss(y, f L ) ∂f 1 ∂Loss(y, f L )
=
∂w1 ∂w1 ∂f 1

∂f 1 ∂ tanh(w1x )
= 1 − tanh(w1x ) x
2
= 2
= 1− f 1 x
∂w1 ∂w1

We know how to compute he first part.

Feed-Forward Neural networks 32


Back-Propagation

w1 z1 f 1 w z2
2
f2 wL zL fL
x y
Loss (y, fL)
∂Loss(y, f L ) ∂f 2 ∂Loss(y, f L )
=
∂f 1 ∂f 1 ∂f 2

∂f 2 ∂ tanh(w2 f1 )
∂f 1
=
∂w2
= 1 − (tanh(w2 f1 )) w2 = 1 − f 2 w2
2 2
( )

Feed-Forward Neural networks 33


Back-Propagation

w1 z1 f 1 w z2
2
f2 wL zL fL
x y

∂ Loss(y, f L ) ∂ f2 ∂ f3 ∂ fL ∂ Loss(y, f L )
=
∂f 1 ∂f 1 ∂f 2 ∂f L -1 ∂f L
we can move backwards here by multiplying always by 1
∂ (y - f L )
2
Jacobian between the output and the input.
= 2
Back propagation is simply computing these quantities that ∂f L
we want at a point where we need to adjust the
parameters, propagating this training single from the output = −(y − f L )
where it's available all the way down to the parameters that
we are interested in modifying.

Feed-Forward Neural networks 34


Back-Propagation
• Now based on the nature the calculator, the fact that we evaluate the loss at the
very output, then multiply by these Jacobians also highlights how this can go
wrong.

• Imagine if these Jacobians here, the value is the derivatives of the layer-wise
mappings are very small. Then the gradient vanishes very quickly as the depth of
the architecture increases.

• If these derivatives are large, then the gradients can also explode.

• So there are issues that we need to deal with when the architecture is deep.

Feed-Forward Neural networks 35


Back-Propagation: Interpretation
f (x, y, z) = (x + y) × z
e.g. x = -2, y = 5, z = -4 ∂f x -2
∂f
q=x+y
∂x -4 ∂q
f=q×z -4
+ q=3
∂f
∂z
∂z
=q ∂f f = -12
y 5 ×
∂f ∂y -4
∂f 1
=z
∂q z -4 ∂f
3
∂f ∂f ∂q
=
∂f ∂f ∂q
=
∂f
∂y ∂q ∂y ∂y ∂q ∂x
∂z
∂q ∂q
=1 =1
∂y ∂x

Feed-Forward Neural networks 36


Back-Propagation: Interpretation

x
∂L ∂L ∂z
=
∂x ∂z ∂x
Local gradients
∂z
∂x

∂z
f z

∂L
∂y
∂z
∂L ∂L ∂z
=
∂y ∂z ∂y
y

Feed-Forward Neural networks 37


2 hidden units training
x1 c c f1
f
c

x2 c c f2

We illustrated how those hidden units as linear classifiers in this plot.


4 7
(1)
2

(2)

Ave hinge Loss


0

-2

-4 0
-6 -4 -2 0 2 4 6 0 Iterations 16

Feed-Forward Neural networks 38


2 hidden units training
x1 c c f1
f
c

x2 c c f2

Initial network (Hidden layer units) Average hinge loss per epoch
4 7
(1)
2

(2)

Ave hinge Loss


0

-2

-4 0
-6 -4 -2 0 2 4 6 0 Iterations 16

Feed-Forward Neural networks 39


2 hidden units training
x1 c c f1
f
c

x2 c c f2

Initial network (Hidden layer units) Average hinge loss per epoch
4 7
(1)
2

(2)

Ave hinge Loss


0

I also plotted on the left how the average hinge loss of this classification task
-2 evolves as a function of the iterations or epochs. So each iteration is running

through all the training examples, performing a stochastic gradient descent update
-4 in random order of the training samples. 0
-6 -4 -2 0 2 4 6 0 Iterations 16

Feed-Forward Neural networks 40


2 hidden units training
x1 c c f1
f
c

x2 c c f2

Initial network (Hidden layer units) Average hinge loss per epoch
4 7
(1)
After a few runs over the training examples, the network actually succeeds in
2 finding a solution that has zero hinge loss.
(2)

Ave hinge Loss


0

-2 We have represented an initial network, where we have the two input coordinates −
x1, x2 − and two hidden layer units, and a final classifier f here.
-4 0
-6 -4 -2 0 2 4 6 0 Iterations 16

Feed-Forward Neural networks 41


2 hidden units training

After ∼ 10 passes through the data

Hidden Unit Activation


4 1
(1)
2

(2)
0 (2)

-2

-4 -1
-6 -4 -2 0 2 4 6 -1 1
(1)

Feed-Forward Neural networks 42


10 hidden units

Randomly initialized weights (zero offset) for the hidden units

4
(1)
2

(2)
0

-2

-4
-6 -4 -2 0 2 4 6

After ∼ 10 epochs, hidden units are arranged in manner sufficient for the task (but
not otherwise perfect)
Feed-Forward Neural networks 43
Decisions (and a harder task)

2 Hidden unit can no longer solve his task

10 Hidden Units
4

-2

-4
-6 -4 -2 0 2 4 6

Feed-Forward Neural networks 44


Decisions (and a harder task)

100 Hidden Units


4

-2

-4
-6 -4 -2 0 2 4 6
0 offset initialization

Feed-Forward Neural networks 45


Decisions (and a harder task)

100 Hidden Units 100 Hidden Units

0 offset initialization Random offset initialization

Feed-Forward Neural networks 46

You might also like