Neural networks unit-3
Neural networks unit-3
NeuralNetworks
The first learning models (decision trees and nearest neighbor models)
created complex, non-linear decision boundaries. We moved from there to
the perceptron, perhaps the most classic linear model.
hi = f (wi · x) (8.1)
Where f is the link function and w i refers to the vector of weights
feeding in to node i .
One example link function is the sign function. That is, if the incoming
signal is negative, the activation is −1. Otherwise the activation is +1. This is
a potentially useful activation function, but it is non-differentiable
EXPLAIN BIAS!!!
A more popular link function is the hyperbolic tangent function, tanh. A
comparison between the sign function and the tanh function is in Figure 8 .
2. It is a reasonable approximation to the sign function, but is convenient in
that it is differentiable. Because it looks like an “S” and because the Greek
character for “S” is “Sigma,” such functions are usually called sigmoid
functions.
Assuming that we are using tanh as the link function, the overall
prediction made by a two-layer network can be computed using Algorithm
8 . 1. This function takes a matrix of weights W corresponding to the first
layer weights and a vector of weights v corresponding to the second layer.
You can write this entire computation out in one line as
Where the second line is short hand assuming that tanh can take a vector as
input and product a vector as output.
The claim is that two-layer neural networks are more expressive than
single layer networks (i.e., perceptrons). To see this, you can construct a
very small two-layer network for solving the XOR problem.
Suppose that the data set consists of four data points, given in Table 8
. 1. The classification rule is that y = +1 if an only if x 1 = x2, where the
features are just ±1.
You can solve this problem using a two layer network with two hidden
units. The key idea is to make the first hidden unit compute an “or” function:
x1 ∨ x2. The second hidden unit can compute an “and” function: x 1 ∧ x2. The
output can combine these into a single prediction that mimics (imitate) XOR.
Once you have the first hidden unit activate for “or” and the second for
“and,” you need only set the output weights as −2 and +1, respectively.
To achieve the “or” behavior, you can start by setting the bias to −0.5
and the weights for the two “real” features as both being 1. You can check
for yourself that this will do the “right thing” if the link function were the sign
function. Of course it’s not, it’s tanh. To get tanh to mimic sign, you need to
make the dot product either really really large or really really small. You can
accomplish this by setting the bias to −500,000 and both of the two weights
to 1,000,000. Now, the activation of this unit will be just slightly above −1
for x = h−1,−1iand just slightly below +1 for the other three examples.
One-layer networks can represent any linear function and only linear
functions. You’ve also seen that two-layer networks can represent non-linear
functions like XOR.
2. Input is modelled using real weights w. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
5. Travel Back from the output layer to the hidden layer to adjust the weight
such that the error is decreased.
The easy case is to differentiate this with respect to v: the weights for
the output unit. Without even doing any math, you should be able to guess
what this looks like. The way to think about it is that from vs perspective, it
is just a linear model, attempting to minimize squared error. The only
“funny” thing is that its inputs are the activations h rather than the
examples x. So the gradient with respect to v is just as for the linear case.
To make things notationally more convenient, let en denote the error on the
nth example (i.e., the blue term above), and let hn denote the vector of
hidden unit activations on that example. Then:
Weights can easily measure how their changes affect the output.
If the overall error of the predictor (e) is small, you want to make small
steps. If vi is small for hidden unit i, then this means that the output is not
particularly sensitive to the activation of the ith hidden unit. Thus, its
gradient should be small. If v i flips sign, the gradient at wi should also flip
signs. The name back-propagation comes from the fact that you propagate
gradients backward through the network, starting at the end.
One of the typical complaints about neural networks is that they are
finicky. In particular, they have a rather large number of knobs to tune:
4. The initialization
For two layer networks, having to choose the number of hidden units,
and then get the learning rate and initialization “right” can take a bit of
work. Clearly it can be automated, but nonetheless it takes time.
Theorem (Parity Function Complexity). Any circuit of depth K < log 2 D that
computes the parity function of D input bits must containOe D gates.
One way of thinking about the issue of breadth versus depth has to do
with the number of parameters that need to be estimated. By the heuristic
that you need roughly one or two examples for every parameter, a deep
model could potentially require exponentially fewer examples to train than a
shallow model!
As back-propagation works its way down through the model, the sizes
of the gradients shrink. If you are the beginning of a very deep network,
changing one single weight is unlikely to have a significant effect on the
output, since it has to go through so many other units before getting there.
This directly implies that the derivatives are small. This means that
backpropagation essentially never moves far from its initialization when run
on very deep networks.
make training difficult, they might be good for other reasons: what
reasons?
Basis Functions:
We’ve seen that: (a) neural networks can mimic linear functions and
(b) they can learn more complex functions.
The hidden units behave like little Gaussian “bumps” centered around
locations specified by the vectors wi. The parameter γi (Gaama) specifies the
width of the Gaussian bump. If γi (Gaama) is large, then only data points that
are really close to wi have non-zero activations. To distinguish sigmoid
networks from RBF networks, the hidden units are typically drawn with
sigmoids or with Gaussian bumps.
Training RBF networks involves finding good values for the Gassian
widths, γi(Gaama), the centers of the Gaussian bumps, w i and the
connections between the Gaussian bumps and the output unit, v. This can all
be done using back-propagation. The gradient terms for v remain unchanged
from before, the derivates for the other variables differ.