04 - Neural Networks PDF
04 - Neural Networks PDF
2
Feed-Forward Neural networks
• So far our classifiers rely on pre-compiled features:
yˆ = sign (θ . φ( x ))
• Here we have a simple linear classifier, taking as input an expanded feature
representation from an input vector x, associated parameters θ, passed through
a sign function, and getting the classification decision.
• Now let's write this model slightly differently
x1 φ1(x) θ1
x2 φ2(x) θ2 sign
…
… θD
xd φD(x)
Once we have thisx feature representation,
φ(x) then we take a linear combination of these
coordinate values to produce the classification decision.
Feed-Forward Neural networks 3
Feed-Forward Neural networks
• One particular aspect of this way of solving a nonlinear classification problem is
that how input x is mapped to a longer feature representation, φ(x), that
mapping is fixed and is not tailored to the task that we are trying to solve.
• This is the particular difference between a non-linear method like a kernel
method and a neural network. A neural network tries to optimize the feature
representation for the task that you're trying to solve.
• In terms of deep neural networks, we are talking about representation learning:
learning to represent the examples in a way that is actually helpful for the
eventual classification task.
• this task is actually quite a bit harder than the task that we have presented here:
– In order to do a good classification, we would need to know what that feature
representation is.
– In order to understand what a good feature representation would be, we would need
to know how that feature representation is exercised in the ultimate classification task.
• Real neural networks are composed of cells called neurons. They aggregate in
through dendrites, and there are about 1,000 to 10,000 connections that are
formed by other neurons to these dendrites.
• The potential increases in the cell body. Once it reaches a threshold, the neuron
sends a spike along the axon that connects to roughly 100 other neurons. So we
have these highly connected parallel signals that are propagated in a temporal
facet.
Feed-Forward Neural networks 6
Neural networks
• Our abstraction of this network is in terms of a simple linear classifier.
• The input nodes take the role of the synapses and the dendrites.
• The weights or the parameters associated with these input coordinates try to
mimic the strength of the connection that propagates to the cell body.
• The response of a cell is in terms of just a nonlinear transformation of the
aggregates as single.
• And the output is simply a real number, not actually a temporal signal that's
propagated through.
x1 θ1
x2 θ2 f
∈R
…
xd
x1 w1
x2 w2 f
∈R
Input Output
…
wd
xd
d
x z = ∑ wi xi + w0
i =1
z z
tanh
x1 w1
x2 w2 f
∈R
…
wd
xd
x1
x2
f
xd
width
depthf
x1 f1
f
x2 f2
w11
w1 =
w21the
we can understand first hidden unit here in terms of x1, x2 coordinates as our linear
classifier, where this is now the w1 vector. The decision boundary here corresponds to when
the aggregate input to that unit is 0.
-4 -2
-6 -4 -2 0 2 4 6 -2 -1 0 1 2 3 4
(1)
f1
-4 -1
-6 -4 -2 0 2 4 6 -1 -1 0 1 2 3 1
(1)
f1
(2)
0 (2)2
f2
-2
-4 0
-6 -4 -2 0 2 4 6 0 2 4
(1)
f1
(2)
0 (2)0
f2 0
-2
-1
-4 -1
-6 -4 -2 0 2 4 6 -1 0 1
(1)
f1
(2)
0 (2)2
f2 0
-2
-1
-4 -1
-6 -4 -2 0 2 4 6 -1 2 4
(1)
f1
(2)
2
(1)
0 (2)
f2
-2
-4 -1
-6 -4 -2 0 2 4 6 -1 2 4
(1)
f1
(2)
2
(1)
0
-2
-4 -3
-6 -4 -2 0 2 4 6 -1.5 0.2
×
x × f (x; w)
×
×
x × f (x; w)
×
In training, we are also given a target y, our task in applying stochastic gradient descent is
simply to evaluate for each example.
For a pair of input and the desired output y, we want to adjust the parameters in the model
so as to nudge them in the reverse direction of the gradient.
What we have to do is to evaluate some loss that measures how much that our predicted
output differs from the desired target.
The main question here is since the mapping is complicated, how do we actually evaluate
these derivatives.
×
x × f (x; w)
×
Exemple (x, y)
w1 z1 f 1 w z2
2
f2 wL zL fL
x y
∈R ∈R
z1 = x.w1
Loss (y, fL) = 1/2(y, fL)2
f1 = tanh (x.w1)
…
∂f L
fL-1 fL
fL = tanh (fL-1.wL) ∂f L -1
w1 z1 f 1 w z2
2
f2 wL zL fL
x y
Loss (y, fL)
∂Loss(y, f L ) ∂f 1 ∂Loss(y, f L )
=
∂w1 ∂w1 ∂f 1
∂f 1 ∂ tanh(w1x )
= 1 − tanh(w1x ) x
2
= 2
= 1− f 1 x
∂w1 ∂w1
w1 z1 f 1 w z2
2
f2 wL zL fL
x y
Loss (y, fL)
∂Loss(y, f L ) ∂f 2 ∂Loss(y, f L )
=
∂f 1 ∂f 1 ∂f 2
∂f 2 ∂ tanh(w2 f1 )
∂f 1
=
∂w2
= 1 − (tanh(w2 f1 )) w2 = 1 − f 2 w2
2 2
( )
w1 z1 f 1 w z2
2
f2 wL zL fL
x y
∂ Loss(y, f L ) ∂ f2 ∂ f3 ∂ fL ∂ Loss(y, f L )
=
∂f 1 ∂f 1 ∂f 2 ∂f L -1 ∂f L
we can move backwards here by multiplying always by 1
∂ (y - f L )
2
Jacobian between the output and the input.
= 2
Back propagation is simply computing these quantities that ∂f L
we want at a point where we need to adjust the
parameters, propagating this training single from the output = −(y − f L )
where it's available all the way down to the parameters that
we are interested in modifying.
• Imagine if these Jacobians here, the value is the derivatives of the layer-wise
mappings are very small. Then the gradient vanishes very quickly as the depth of
the architecture increases.
• If these derivatives are large, then the gradients can also explode.
• So there are issues that we need to deal with when the architecture is deep.
x
∂L ∂L ∂z
=
∂x ∂z ∂x
Local gradients
∂z
∂x
∂z
f z
∂L
∂y
∂z
∂L ∂L ∂z
=
∂y ∂z ∂y
y
x2 c c f2
(2)
-2
-4 0
-6 -4 -2 0 2 4 6 0 Iterations 16
x2 c c f2
Initial network (Hidden layer units) Average hinge loss per epoch
4 7
(1)
2
(2)
-2
-4 0
-6 -4 -2 0 2 4 6 0 Iterations 16
x2 c c f2
Initial network (Hidden layer units) Average hinge loss per epoch
4 7
(1)
2
(2)
I also plotted on the left how the average hinge loss of this classification task
-2 evolves as a function of the iterations or epochs. So each iteration is running
through all the training examples, performing a stochastic gradient descent update
-4 in random order of the training samples. 0
-6 -4 -2 0 2 4 6 0 Iterations 16
x2 c c f2
Initial network (Hidden layer units) Average hinge loss per epoch
4 7
(1)
After a few runs over the training examples, the network actually succeeds in
2 finding a solution that has zero hinge loss.
(2)
-2 We have represented an initial network, where we have the two input coordinates −
x1, x2 − and two hidden layer units, and a final classifier f here.
-4 0
-6 -4 -2 0 2 4 6 0 Iterations 16
(2)
0 (2)
-2
-4 -1
-6 -4 -2 0 2 4 6 -1 1
(1)
4
(1)
2
(2)
0
-2
-4
-6 -4 -2 0 2 4 6
After ∼ 10 epochs, hidden units are arranged in manner sufficient for the task (but
not otherwise perfect)
Feed-Forward Neural networks 43
Decisions (and a harder task)
10 Hidden Units
4
-2
-4
-6 -4 -2 0 2 4 6
-2
-4
-6 -4 -2 0 2 4 6
0 offset initialization