0% found this document useful (0 votes)
16 views14 pages

Neural networks unit-3

This document discusses the evolution of neural networks from perceptrons to multi-layer networks, emphasizing their ability to model complex, non-linear decision boundaries. It explains the back-propagation algorithm for training these networks, highlighting the importance of weight initialization and the challenges of tuning network parameters. Additionally, it addresses the benefits of deep networks in approximating functions that would require excessive hidden units in shallow networks.

Uploaded by

Gk Pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

Neural networks unit-3

This document discusses the evolution of neural networks from perceptrons to multi-layer networks, emphasizing their ability to model complex, non-linear decision boundaries. It explains the back-propagation algorithm for training these networks, highlighting the importance of weight initialization and the challenges of tuning network parameters. Additionally, it addresses the benefits of deep networks in approximating functions that would require excessive hidden units in shallow networks.

Uploaded by

Gk Pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Unit-3

NeuralNetworks
The first learning models (decision trees and nearest neighbor models)
created complex, non-linear decision boundaries. We moved from there to
the perceptron, perhaps the most classic linear model.

An extension of perceptron learning to nonlinear decision boundaries,


taking the biological inspiration of neurons even further. In the perceptron,
we thought of the input data point (eg., an image) as being directly
connected to an output (eg., label). This is called a single-layer network
because there is one layer of weights. Now, instead of directly connecting
the inputs to the outputs, we will insert a layer of “hidden” nodes, moving
from a single-layer network to a multi-layer network.

Bio-inspired Multi-Layer Networks


One approach to doing this is to chain
Output Layer
together a collection of perceptrons to build
more complex neural networks. An example
of a two-layer network is shown in Figure 8.1.
Here, you can see five inputs (features) that
Hidden Layers
are fed into two hidden units. These hidden
units are then fed in to a single output unit.
Each edge in this figure corresponds to a Input Layers
different weight. (Even though it looks like
there are three layers, this is called a two-
layer network because we don’t count the
inputs as a real layer. That is, it’s two layers
of trained weights.)

Each neuron is connected by connections from the neuron form the


previous layer.

Prediction with a neural network is a straightforward generalization of


prediction with a perceptron. First you compute activations of the nodes in
the hidden unit based on the inputs and the input weights. Then you
compute activations of the output unit given the hidden unit activations and
the second layer of weights.

The only major difference between this computation and the


perceptron computation is that the hidden units compute a non-linear
function of their inputs. This is usually called the activation function or link
function. More formally, if wi , d is the weights on the edge connecting input
d to hidden unit i, then the activation of hidden unit i is computed as:

hi = f (wi · x) (8.1)
Where f is the link function and w i refers to the vector of weights
feeding in to node i .

One example link function is the sign function. That is, if the incoming
signal is negative, the activation is −1. Otherwise the activation is +1. This is
a potentially useful activation function, but it is non-differentiable

EXPLAIN BIAS!!!
A more popular link function is the hyperbolic tangent function, tanh. A
comparison between the sign function and the tanh function is in Figure 8 .
2. It is a reasonable approximation to the sign function, but is convenient in
that it is differentiable. Because it looks like an “S” and because the Greek
character for “S” is “Sigma,” such functions are usually called sigmoid
functions.
Assuming that we are using tanh as the link function, the overall
prediction made by a two-layer network can be computed using Algorithm
8 . 1. This function takes a matrix of weights W corresponding to the first
layer weights and a vector of weights v corresponding to the second layer.
You can write this entire computation out in one line as

Where the second line is short hand assuming that tanh can take a vector as
input and product a vector as output.

The claim is that two-layer neural networks are more expressive than
single layer networks (i.e., perceptrons). To see this, you can construct a
very small two-layer network for solving the XOR problem.

Suppose that the data set consists of four data points, given in Table 8
. 1. The classification rule is that y = +1 if an only if x 1 = x2, where the
features are just ±1.

You can solve this problem using a two layer network with two hidden
units. The key idea is to make the first hidden unit compute an “or” function:
x1 ∨ x2. The second hidden unit can compute an “and” function: x 1 ∧ x2. The
output can combine these into a single prediction that mimics (imitate) XOR.
Once you have the first hidden unit activate for “or” and the second for
“and,” you need only set the output weights as −2 and +1, respectively.

To achieve the “or” behavior, you can start by setting the bias to −0.5
and the weights for the two “real” features as both being 1. You can check
for yourself that this will do the “right thing” if the link function were the sign
function. Of course it’s not, it’s tanh. To get tanh to mimic sign, you need to
make the dot product either really really large or really really small. You can
accomplish this by setting the bias to −500,000 and both of the two weights
to 1,000,000. Now, the activation of this unit will be just slightly above −1
for x = h−1,−1iand just slightly below +1 for the other three examples.

One-layer networks can represent any linear function and only linear
functions. You’ve also seen that two-layer networks can represent non-linear
functions like XOR.

Theorem 9 (Two-Layer Networks are Universal Function


Approximators). Let F be a continuous function on a bounded subset of D-
dimensional space. Then there exists a two-layer neural network with a
finite number of hidden units that approximate F arbitrarily well. Namely, for
all x in the domain of F,

Or, in colloquial terms “two-layer networks can approximate any


function.”

This is a remarkable theorem. Practically, it says that if you give me a


function F and some error tolerance parameter e, I can construct a two layer
network that computes F. In a sense, it says that going from one layer to two
layers completely changes the representational capacity of your model.

When working with two-layer networks, If your data is D dimensional


and you have K hidden units, then the total number of parameters is (D +
2)K. (The first +1 is from the bias, the second is from the second layer of
weights.) Following on from the heuristic that you should have one to two
examples for each parameter you are trying to estimate, this suggests a
method for choosing the number of hidden units as roughly b . In other
words, if you have tons and tons of examples, you can safely have lots of
hidden units. If you only have a few examples, you should probably restrict
the number of hidden units in your network.
The number of units is both a form of inductive bias and a form of
regularization. In both view, the number of hidden units controls how
complex your function will be. Lots of hidden units⇒ very complicated
function. As the number increases, training performance continues to get
better. But at some point, test performance gets worse because the network
has overfit the data.

The Back-propagation Algorithm


The back-propagation algorithm is a classic approach to training neural
networks. To classify our best data, we have to update the weights of
parameter and bias. We use gradient descent algorithm using back
propagation.

Back-propagation algorithm calculates the gradient of the “error


function”. Back-propagation can be written as a function of the neural
network. Back-propagation algorithms are a set of methods used to
efficiently train artificial neural networks following a gradient descent
approach which exploits the chain rule.

back-propagation = gradient descent+chain rule

Back-propagation efficiently computes one layer at a time.

We are going to optimize the weights in the network to minimize some


objective function. The only difference is that the predictor is no longer linear
but now non-linear

1. Inputs x, arrive through the preconnected path.

2. Input is modelled using real weights w. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.

4. Calculate the error in the outputs.

Error = Actual Output – Desired Output

5. Travel Back from the output layer to the hidden layer to adjust the weight
such that the error is decreased.

Keep repeating the process until the desired output is achieved.

To be completely explicit, we will focus on optimizing squared error.


Again, this is mostly for historic reasons. You could easily replace squared
error with your loss function of choice. Our overall objective is:

The easy case is to differentiate this with respect to v: the weights for
the output unit. Without even doing any math, you should be able to guess
what this looks like. The way to think about it is that from vs perspective, it
is just a linear model, attempting to minimize squared error. The only
“funny” thing is that its inputs are the activations h rather than the
examples x. So the gradient with respect to v is just as for the linear case.

To make things notationally more convenient, let en denote the error on the
nth example (i.e., the blue term above), and let hn denote the vector of
hidden unit activations on that example. Then:

Weights can easily measure how their changes affect the output.

The weights in the first layer aren’t necessarily trying to produce


specific values, say 0 or 5 or −2.1. They are simply trying to produce
activations that get fed to the output layer. So the change they want to
make depends crucially on how the output layer interprets them.

Ignoring the sum over data points, we can compute:


Putting this together, we get that the gradient with respect to w i is:

If the overall error of the predictor (e) is small, you want to make small
steps. If vi is small for hidden unit i, then this means that the output is not
particularly sensitive to the activation of the ith hidden unit. Thus, its
gradient should be small. If v i flips sign, the gradient at wi should also flip
signs. The name back-propagation comes from the fact that you propagate
gradients backward through the network, starting at the end.

Implementing the back-propagation algorithm can be a bit tricky. Sign


errors often abound. A useful trick is first to keep W fixed and work on just
training v. Then keep v fixed and work on training W. Then put them
together.

Initialization and Convergence of Neural Networks:


Based on linear models, you might be tempted to initialize all the
weights in a neural network to zero.

An initialization of W = 0 and v = 0 will lead to “uninteresting”


solutions. In other words, if you initialize the model in this way, it will
eventually get stuck in a bad local optimum. To see this, first realize that on
any example x, the activation hi of the hidden units will all be zero since W =
0. This means that on the first iteration, the gradient on the output weights
(v) will be zero, so they will stay put. Furthermore, the gradient w 1,d for the
dth feature on the ith unit will be exactly the same as the gradient w 2,d for
the same feature on the second unit. This means that the weight matrix,
after a gradient step, will change in exactly the same way for every hidden
unit. Thinking through this example for iterations 2..., the values of the
hidden units will always be exactly the same, which means that the weights
feeding in to any of the hidden units will be exactly the same. Eventually the
model will converge, but it will converge to a solution that does not take
advantage of having access to the hidden units.

Neural networks are sensitive to their initialization. In particular, the


function that they optimize is non-convex, meaning that it might have
plentiful local optima. In a sense, neural networks must have local optima.
Suppose you have a two layer network with two hidden units that’s been
optimized. You have weights w1 from inputs to the first hidden unit, weights
w2 from inputs to the second hidden unit and weights (v 1,v2) from the hidden
units to the output. If I give you back another network with w 1 and w2
swapped, and v1 and v2 swapped, the network computes exactly the same
thing, but with a markedly different weight structure. This phenomena is
known as symmetric modes (“mode” referring to an optima) meaning that
there are symmetries in the weight space. It would be one thing if there
were lots of modes and they were all symmetric: then finding one of them
would be as good as finding any other. Unfortunately there are additional
local optima that are not global optima.

By initializing a network with small random weights (say, uniform


between −0.1 and 0.1), the network is unlikely to fall into the trivial,
symmetric local optimum. By training a collection of networks, each with a
different random initialization, you can often obtain better solutions that with
just one initialization. You can train ten networks with different random
seeds, and then pick the one that does best on heldout data. Figure 8.3
shows prototypical test-set performance for ten networks with different
random initialization, plus an eleventh plot for the trivial symmetric network
initialized with zeros.

One of the typical complaints about neural networks is that they are
finicky. In particular, they have a rather large number of knobs to tune:

1. The number of layers

2. The number of hidden units per layer

3. The gradient descent learning rate η

4. The initialization

5. The stopping iteration or weight regularization

For two layer networks, having to choose the number of hidden units,
and then get the learning rate and initialization “right” can take a bit of
work. Clearly it can be automated, but nonetheless it takes time.

Another difficulty of neural networks is that their weights can be


difficult to interpret. You’ve seen that, for linear networks, you can often
interpret high weights as indicative of positive examples and low weights as
indicative of negative examples. In multilayer networks, it becomes very
difficult to try to understand what the different hidden units are doing.

Beyond Two Layers:


The definition of neural networks and the back-propagation algorithm
can be generalized beyond two layers to any arbitrary directed acyclic
graph.
Suppose that your network structure is stored in some directed acyclic
graph, like that in Figure 8.5. We index nodes in this graph as u,v. The
activation before applying non-linearity at a node is a u and after non-linearity
is hu. The graph has a single sink, which is the output node y with activation
ay (no non-linearity is performedon the output unit). The graph has D-many
inputs (i.e., nodes with no parent), whose activations h u are given by an
input example. An edge (u,v) is from a parent to a child (i.e., from an input
to a hidden unit, or from a hidden unit to the sink). Each edge has a weight
wu,v. We say that par(u) is the set of parents of u.
There are two relevant algorithms: forward-propagation and
backpropagation. Forward-propagation tells you how to compute the
activation of the sink y given the inputs. Back-propagation computes
derivatives of the edge weights for a given input.

The key aspect of the forward-propagation algorithm is to iteratively


compute activations, going deeper and deeper in the DAG. Once the
activations of all the parents of a node u have been computed, you can
compute the activation of node u.
Back-propagation (see Algorithm 8.4) does the opposite: it computes
gradients top-down in the network. The key idea is to compute an error for
each node in the network. The error at the output unit is the “true error.” For
any input unit, the error is the amount of gradient that we see coming from
our children (i.e., higher in the network). These errors are computed
backwards in the network (hence the name back-propagation) along with the
gradients themselves. This is also explained pictorially in Figure 8.7.

Given the back-propagation algorithm, you can directly run gradient


descent, using it as a subroutine for computing the gradients.

Breadth versus Depth:


The goal is to show that there are functions for which it might be a
“good idea” to use a deep network. There are functions that will require a
huge number of hidden units if you force the network to be shallow, but can
be done in a small number of units if you allow it to be deep.

The example that we’ll use is the parity function which is a


generalization of the XOR problem. The function is defined over binary inputs
as:

It is easy to define a circuit of depth O(log2 D) with O(D)-many gates


for computing the parity function. Each gate is an XOR, arranged in a
complete binary tree.
This shows that if you are allowed to be deep, you can construct a
circuit with that computes parity using a number of hidden units that is
linear in the dimensionality. We can not do the same with shallow circuits.
It’s a famous result of circuit complexity that parity requires exponentially
many gates to compute in constant depth.

The formal theorem is below:

Theorem (Parity Function Complexity). Any circuit of depth K < log 2 D that
computes the parity function of D input bits must containOe D gates.

A neural network isn’t exactly the same as a circuit, the is generally


believed that the same result holds for neural networks. This gives a strong
indication that depth might be an important consideration in neural
networks.

One way of thinking about the issue of breadth versus depth has to do
with the number of parameters that need to be estimated. By the heuristic
that you need roughly one or two examples for every parameter, a deep
model could potentially require exponentially fewer examples to train than a
shallow model!

Deep network makes the architecture selection problem more


significant. Namely, when you use a two-layer network, the only
hyperparameter to choose is how many hidden units should go in the middle
layer. When you choose a deep network, you need to choose how many
layers, and what is the width of all those layers. This can be somewhat
daunting.

As back-propagation works its way down through the model, the sizes
of the gradients shrink. If you are the beginning of a very deep network,
changing one single weight is unlikely to have a significant effect on the
output, since it has to go through so many other units before getting there.
This directly implies that the derivatives are small. This means that
backpropagation essentially never moves far from its initialization when run
on very deep networks.

make training difficult, they might be good for other reasons: what
reasons?

Finding good ways to train deep networks is an active research area.


There are two general strategies. The first is to attempt to initialize the
weights better, often by a layer-wise initialization strategy. This can be often
done using unlabeled data. After this initialization, back-propagation can be
run to tweak the weights for whatever classification problem you care about.
A second approach is to use a more complex optimization procedure, rather
than gradient descent.

Basis Functions:
We’ve seen that: (a) neural networks can mimic linear functions and
(b) they can learn more complex functions.

A natural way to train a neural network to mimic a KNN classifier is to


replace the sigmoid link function with a radial basis function (RBF). In a
sigmoid network (i.e., a network with sigmoid links), the hidden units were
computed as hi = tanh(wi,x·). In an RBF network, the hidden units are
computed as:

The hidden units behave like little Gaussian “bumps” centered around
locations specified by the vectors wi. The parameter γi (Gaama) specifies the
width of the Gaussian bump. If γi (Gaama) is large, then only data points that
are really close to wi have non-zero activations. To distinguish sigmoid
networks from RBF networks, the hidden units are typically drawn with
sigmoids or with Gaussian bumps.

Training RBF networks involves finding good values for the Gassian
widths, γi(Gaama), the centers of the Gaussian bumps, w i and the
connections between the Gaussian bumps and the output unit, v. This can all
be done using back-propagation. The gradient terms for v remain unchanged
from before, the derivates for the other variables differ.

You might also like