0% found this document useful (0 votes)

16 views14 pages

Neural networks unit-3

This document discusses the evolution of neural networks from perceptrons to multi-layer networks, emphasizing their ability to model complex, non-linear decision boundaries. It explains the back-propagation algorithm for training these networks, highlighting the importance of weight initialization and the challenges of tuning network parameters. Additionally, it addresses the benefits of deep networks in approximating functions that would require excessive hidden units in shallow networks.

Uploaded by

Gk Pradeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views14 pages

Neural networks unit-3

Uploaded by

Gk Pradeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Unit-3

NeuralNetworks
The first learning models (decision trees and nearest neighbor models)
created complex, non-linear decision boundaries. We moved from there to
the perceptron, perhaps the most classic linear model.

An extension of perceptron learning to nonlinear decision boundaries,

taking the biological inspiration of neurons even further. In the perceptron,
we thought of the input data point (eg., an image) as being directly
connected to an output (eg., label). This is called a single-layer network
because there is one layer of weights. Now, instead of directly connecting
the inputs to the outputs, we will insert a layer of “hidden” nodes, moving
from a single-layer network to a multi-layer network.

Bio-inspired Multi-Layer Networks

One approach to doing this is to chain
Output Layer
together a collection of perceptrons to build
more complex neural networks. An example
of a two-layer network is shown in Figure 8.1.
Here, you can see five inputs (features) that
Hidden Layers
are fed into two hidden units. These hidden
units are then fed in to a single output unit.
Each edge in this figure corresponds to a Input Layers
different weight. (Even though it looks like
there are three layers, this is called a two-
layer network because we don’t count the
inputs as a real layer. That is, it’s two layers
of trained weights.)

Each neuron is connected by connections from the neuron form the

previous layer.

Prediction with a neural network is a straightforward generalization of

prediction with a perceptron. First you compute activations of the nodes in
the hidden unit based on the inputs and the input weights. Then you
compute activations of the output unit given the hidden unit activations and
the second layer of weights.

The only major difference between this computation and the

perceptron computation is that the hidden units compute a non-linear
function of their inputs. This is usually called the activation function or link
function. More formally, if wi , d is the weights on the edge connecting input
d to hidden unit i, then the activation of hidden unit i is computed as:

hi = f (wi · x) (8.1)
Where f is the link function and w i refers to the vector of weights
feeding in to node i .

One example link function is the sign function. That is, if the incoming
signal is negative, the activation is −1. Otherwise the activation is +1. This is
a potentially useful activation function, but it is non-differentiable

EXPLAIN BIAS!!!
A more popular link function is the hyperbolic tangent function, tanh. A
comparison between the sign function and the tanh function is in Figure 8 .
2. It is a reasonable approximation to the sign function, but is convenient in
that it is differentiable. Because it looks like an “S” and because the Greek
character for “S” is “Sigma,” such functions are usually called sigmoid
functions.
Assuming that we are using tanh as the link function, the overall
prediction made by a two-layer network can be computed using Algorithm
8 . 1. This function takes a matrix of weights W corresponding to the first
layer weights and a vector of weights v corresponding to the second layer.
You can write this entire computation out in one line as

Where the second line is short hand assuming that tanh can take a vector as
input and product a vector as output.

The claim is that two-layer neural networks are more expressive than
single layer networks (i.e., perceptrons). To see this, you can construct a
very small two-layer network for solving the XOR problem.

Suppose that the data set consists of four data points, given in Table 8
. 1. The classification rule is that y = +1 if an only if x 1 = x2, where the
features are just ±1.

You can solve this problem using a two layer network with two hidden
units. The key idea is to make the first hidden unit compute an “or” function:
x1 ∨ x2. The second hidden unit can compute an “and” function: x 1 ∧ x2. The
output can combine these into a single prediction that mimics (imitate) XOR.
Once you have the first hidden unit activate for “or” and the second for
“and,” you need only set the output weights as −2 and +1, respectively.

To achieve the “or” behavior, you can start by setting the bias to −0.5
and the weights for the two “real” features as both being 1. You can check
for yourself that this will do the “right thing” if the link function were the sign
function. Of course it’s not, it’s tanh. To get tanh to mimic sign, you need to
make the dot product either really really large or really really small. You can
accomplish this by setting the bias to −500,000 and both of the two weights
to 1,000,000. Now, the activation of this unit will be just slightly above −1
for x = h−1,−1iand just slightly below +1 for the other three examples.

One-layer networks can represent any linear function and only linear
functions. You’ve also seen that two-layer networks can represent non-linear
functions like XOR.

Theorem 9 (Two-Layer Networks are Universal Function

Approximators). Let F be a continuous function on a bounded subset of D-
dimensional space. Then there exists a two-layer neural network with a
ﬁnite number of hidden units that approximate F arbitrarily well. Namely, for
all x in the domain of F,

Or, in colloquial terms “two-layer networks can approximate any

function.”

This is a remarkable theorem. Practically, it says that if you give me a

function F and some error tolerance parameter e, I can construct a two layer
network that computes F. In a sense, it says that going from one layer to two
layers completely changes the representational capacity of your model.

When working with two-layer networks, If your data is D dimensional

and you have K hidden units, then the total number of parameters is (D +
2)K. (The ﬁrst +1 is from the bias, the second is from the second layer of
weights.) Following on from the heuristic that you should have one to two
examples for each parameter you are trying to estimate, this suggests a
method for choosing the number of hidden units as roughly b . In other
words, if you have tons and tons of examples, you can safely have lots of
hidden units. If you only have a few examples, you should probably restrict
the number of hidden units in your network.
The number of units is both a form of inductive bias and a form of
regularization. In both view, the number of hidden units controls how
complex your function will be. Lots of hidden units⇒ very complicated
function. As the number increases, training performance continues to get
better. But at some point, test performance gets worse because the network
has overﬁt the data.

The Back-propagation Algorithm

The back-propagation algorithm is a classic approach to training neural
networks. To classify our best data, we have to update the weights of
parameter and bias. We use gradient descent algorithm using back
propagation.

Back-propagation algorithm calculates the gradient of the “error

function”. Back-propagation can be written as a function of the neural
network. Back-propagation algorithms are a set of methods used to
efficiently train artificial neural networks following a gradient descent
approach which exploits the chain rule.

back-propagation = gradient descent+chain rule

Back-propagation efficiently computes one layer at a time.

We are going to optimize the weights in the network to minimize some

objective function. The only difference is that the predictor is no longer linear
but now non-linear

1. Inputs x, arrive through the preconnected path.

2. Input is modelled using real weights w. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.

4. Calculate the error in the outputs.

Error = Actual Output – Desired Output

5. Travel Back from the output layer to the hidden layer to adjust the weight
such that the error is decreased.

Keep repeating the process until the desired output is achieved.

To be completely explicit, we will focus on optimizing squared error.

Again, this is mostly for historic reasons. You could easily replace squared
error with your loss function of choice. Our overall objective is:

The easy case is to differentiate this with respect to v: the weights for
the output unit. Without even doing any math, you should be able to guess
what this looks like. The way to think about it is that from vs perspective, it
is just a linear model, attempting to minimize squared error. The only
“funny” thing is that its inputs are the activations h rather than the
examples x. So the gradient with respect to v is just as for the linear case.

To make things notationally more convenient, let en denote the error on the
nth example (i.e., the blue term above), and let hn denote the vector of
hidden unit activations on that example. Then:

Weights can easily measure how their changes affect the output.

The weights in the ﬁrst layer aren’t necessarily trying to produce

speciﬁc values, say 0 or 5 or −2.1. They are simply trying to produce
activations that get fed to the output layer. So the change they want to
make depends crucially on how the output layer interprets them.

Ignoring the sum over data points, we can compute:

Putting this together, we get that the gradient with respect to w i is:

If the overall error of the predictor (e) is small, you want to make small
steps. If vi is small for hidden unit i, then this means that the output is not
particularly sensitive to the activation of the ith hidden unit. Thus, its
gradient should be small. If v i ﬂips sign, the gradient at wi should also ﬂip
signs. The name back-propagation comes from the fact that you propagate
gradients backward through the network, starting at the end.

Implementing the back-propagation algorithm can be a bit tricky. Sign

errors often abound. A useful trick is first to keep W fixed and work on just
training v. Then keep v fixed and work on training W. Then put them
together.

Initialization and Convergence of Neural Networks:

Based on linear models, you might be tempted to initialize all the
weights in a neural network to zero.

An initialization of W = 0 and v = 0 will lead to “uninteresting”

solutions. In other words, if you initialize the model in this way, it will
eventually get stuck in a bad local optimum. To see this, ﬁrst realize that on
any example x, the activation hi of the hidden units will all be zero since W =
0. This means that on the ﬁrst iteration, the gradient on the output weights
(v) will be zero, so they will stay put. Furthermore, the gradient w 1,d for the
dth feature on the ith unit will be exactly the same as the gradient w 2,d for
the same feature on the second unit. This means that the weight matrix,
after a gradient step, will change in exactly the same way for every hidden
unit. Thinking through this example for iterations 2..., the values of the
hidden units will always be exactly the same, which means that the weights
feeding in to any of the hidden units will be exactly the same. Eventually the
model will converge, but it will converge to a solution that does not take
advantage of having access to the hidden units.

Neural networks are sensitive to their initialization. In particular, the

function that they optimize is non-convex, meaning that it might have
plentiful local optima. In a sense, neural networks must have local optima.
Suppose you have a two layer network with two hidden units that’s been
optimized. You have weights w1 from inputs to the first hidden unit, weights
w2 from inputs to the second hidden unit and weights (v 1,v2) from the hidden
units to the output. If I give you back another network with w 1 and w2
swapped, and v1 and v2 swapped, the network computes exactly the same
thing, but with a markedly different weight structure. This phenomena is
known as symmetric modes (“mode” referring to an optima) meaning that
there are symmetries in the weight space. It would be one thing if there
were lots of modes and they were all symmetric: then finding one of them
would be as good as finding any other. Unfortunately there are additional
local optima that are not global optima.

By initializing a network with small random weights (say, uniform

between −0.1 and 0.1), the network is unlikely to fall into the trivial,
symmetric local optimum. By training a collection of networks, each with a
different random initialization, you can often obtain better solutions that with
just one initialization. You can train ten networks with different random
seeds, and then pick the one that does best on heldout data. Figure 8.3
shows prototypical test-set performance for ten networks with different
random initialization, plus an eleventh plot for the trivial symmetric network
initialized with zeros.

One of the typical complaints about neural networks is that they are
ﬁnicky. In particular, they have a rather large number of knobs to tune:

1. The number of layers

2. The number of hidden units per layer

3. The gradient descent learning rate η

4. The initialization

5. The stopping iteration or weight regularization

For two layer networks, having to choose the number of hidden units,
and then get the learning rate and initialization “right” can take a bit of
work. Clearly it can be automated, but nonetheless it takes time.

Another difﬁculty of neural networks is that their weights can be

difﬁcult to interpret. You’ve seen that, for linear networks, you can often
interpret high weights as indicative of positive examples and low weights as
indicative of negative examples. In multilayer networks, it becomes very
difﬁcult to try to understand what the different hidden units are doing.

Beyond Two Layers:

The deﬁnition of neural networks and the back-propagation algorithm
can be generalized beyond two layers to any arbitrary directed acyclic
graph.
Suppose that your network structure is stored in some directed acyclic
graph, like that in Figure 8.5. We index nodes in this graph as u,v. The
activation before applying non-linearity at a node is a u and after non-linearity
is hu. The graph has a single sink, which is the output node y with activation
ay (no non-linearity is performedon the output unit). The graph has D-many
inputs (i.e., nodes with no parent), whose activations h u are given by an
input example. An edge (u,v) is from a parent to a child (i.e., from an input
to a hidden unit, or from a hidden unit to the sink). Each edge has a weight
wu,v. We say that par(u) is the set of parents of u.
There are two relevant algorithms: forward-propagation and
backpropagation. Forward-propagation tells you how to compute the
activation of the sink y given the inputs. Back-propagation computes
derivatives of the edge weights for a given input.

The key aspect of the forward-propagation algorithm is to iteratively

compute activations, going deeper and deeper in the DAG. Once the
activations of all the parents of a node u have been computed, you can
compute the activation of node u.
Back-propagation (see Algorithm 8.4) does the opposite: it computes
gradients top-down in the network. The key idea is to compute an error for
each node in the network. The error at the output unit is the “true error.” For
any input unit, the error is the amount of gradient that we see coming from
our children (i.e., higher in the network). These errors are computed
backwards in the network (hence the name back-propagation) along with the
gradients themselves. This is also explained pictorially in Figure 8.7.

Given the back-propagation algorithm, you can directly run gradient

descent, using it as a subroutine for computing the gradients.

Breadth versus Depth:

The goal is to show that there are functions for which it might be a
“good idea” to use a deep network. There are functions that will require a
huge number of hidden units if you force the network to be shallow, but can
be done in a small number of units if you allow it to be deep.

The example that we’ll use is the parity function which is a

generalization of the XOR problem. The function is deﬁned over binary inputs
as:

It is easy to deﬁne a circuit of depth O(log2 D) with O(D)-many gates

for computing the parity function. Each gate is an XOR, arranged in a
complete binary tree.
This shows that if you are allowed to be deep, you can construct a
circuit with that computes parity using a number of hidden units that is
linear in the dimensionality. We can not do the same with shallow circuits.
It’s a famous result of circuit complexity that parity requires exponentially
many gates to compute in constant depth.

The formal theorem is below:

Theorem (Parity Function Complexity). Any circuit of depth K < log 2 D that
computes the parity function of D input bits must containOe D gates.

A neural network isn’t exactly the same as a circuit, the is generally

believed that the same result holds for neural networks. This gives a strong
indication that depth might be an important consideration in neural
networks.

One way of thinking about the issue of breadth versus depth has to do
with the number of parameters that need to be estimated. By the heuristic
that you need roughly one or two examples for every parameter, a deep
model could potentially require exponentially fewer examples to train than a
shallow model!

Deep network makes the architecture selection problem more

signiﬁcant. Namely, when you use a two-layer network, the only
hyperparameter to choose is how many hidden units should go in the middle
layer. When you choose a deep network, you need to choose how many
layers, and what is the width of all those layers. This can be somewhat
daunting.

As back-propagation works its way down through the model, the sizes
of the gradients shrink. If you are the beginning of a very deep network,
changing one single weight is unlikely to have a signiﬁcant effect on the
output, since it has to go through so many other units before getting there.
This directly implies that the derivatives are small. This means that
backpropagation essentially never moves far from its initialization when run
on very deep networks.

make training difﬁcult, they might be good for other reasons: what
reasons?

Finding good ways to train deep networks is an active research area.

There are two general strategies. The ﬁrst is to attempt to initialize the
weights better, often by a layer-wise initialization strategy. This can be often
done using unlabeled data. After this initialization, back-propagation can be
run to tweak the weights for whatever classiﬁcation problem you care about.
A second approach is to use a more complex optimization procedure, rather
than gradient descent.

Basis Functions:
We’ve seen that: (a) neural networks can mimic linear functions and
(b) they can learn more complex functions.

A natural way to train a neural network to mimic a KNN classiﬁer is to

replace the sigmoid link function with a radial basis function (RBF). In a
sigmoid network (i.e., a network with sigmoid links), the hidden units were
computed as hi = tanh(wi,x·). In an RBF network, the hidden units are
computed as:

The hidden units behave like little Gaussian “bumps” centered around
locations speciﬁed by the vectors wi. The parameter γi (Gaama) speciﬁes the
width of the Gaussian bump. If γi (Gaama) is large, then only data points that
are really close to wi have non-zero activations. To distinguish sigmoid
networks from RBF networks, the hidden units are typically drawn with
sigmoids or with Gaussian bumps.

Training RBF networks involves ﬁnding good values for the Gassian
widths, γi(Gaama), the centers of the Gaussian bumps, w i and the
connections between the Gaussian bumps and the output unit, v. This can all
be done using back-propagation. The gradient terms for v remain unchanged
from before, the derivates for the other variables differ.

How To Build Your Own Neural Network From Scratch in
No ratings yet
How To Build Your Own Neural Network From Scratch in
6 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
Main
No ratings yet
Main
25 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
10 Multilayer Perceptrons
No ratings yet
10 Multilayer Perceptrons
54 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
1-s2.0-S0889540623005930
No ratings yet
1-s2.0-S0889540623005930
4 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
Chap11 Neural Nets
No ratings yet
Chap11 Neural Nets
38 pages
Session XX - Neural Network
No ratings yet
Session XX - Neural Network
43 pages
M3_Transcript
No ratings yet
M3_Transcript
10 pages
AI17-Neural Networks
No ratings yet
AI17-Neural Networks
34 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
Multi Layer Perceptron 1
No ratings yet
Multi Layer Perceptron 1
54 pages
Classification BP Regression KNN Other Classifiers_ Final.ppt
No ratings yet
Classification BP Regression KNN Other Classifiers_ Final.ppt
116 pages
Neural
No ratings yet
Neural
53 pages
Neural Network and Fuzzy Logic
50% (2)
Neural Network and Fuzzy Logic
54 pages
lect8_dnn (1)
No ratings yet
lect8_dnn (1)
33 pages
Neural Net Notes
No ratings yet
Neural Net Notes
7 pages
Understanding and Coding Neural Networks From Scratch in Python and R
100% (1)
Understanding and Coding Neural Networks From Scratch in Python and R
15 pages
DL_ANN_RNN_CNN [Autosaved] [Autosaved]
No ratings yet
DL_ANN_RNN_CNN [Autosaved] [Autosaved]
53 pages
S2_5_NN
No ratings yet
S2_5_NN
22 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Neural Network
100% (1)
Neural Network
54 pages
Ch 12_Artificial Neural Networks
No ratings yet
Ch 12_Artificial Neural Networks
39 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
8 pages
week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
week 03-04 - Deep Feedforward Networks - Intro
141 pages
Branch Prediction With Neural Networks - Hidden Layers and Recurrent Connections
No ratings yet
Branch Prediction With Neural Networks - Hidden Layers and Recurrent Connections
15 pages
Understanding and Coding Neural Networks From Scratch in Python and R
No ratings yet
Understanding and Coding Neural Networks From Scratch in Python and R
12 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Slide 2
No ratings yet
Slide 2
35 pages
P95 Course Slides
No ratings yet
P95 Course Slides
86 pages
Classification Advanced
No ratings yet
Classification Advanced
51 pages
Unit 1
No ratings yet
Unit 1
19 pages
Dave Reed: Connectionist Approach To AI
No ratings yet
Dave Reed: Connectionist Approach To AI
26 pages
neural-networks-essay-feranmi-dere
No ratings yet
neural-networks-essay-feranmi-dere
7 pages
CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation
No ratings yet
CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation
9 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
Neural Networks
No ratings yet
Neural Networks
54 pages
AI UNIT 4 PART 2
No ratings yet
AI UNIT 4 PART 2
45 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Neural Networks / Deep Learning
No ratings yet
Neural Networks / Deep Learning
9 pages
Unit 2 Deep Learning
No ratings yet
Unit 2 Deep Learning
19 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
DL
No ratings yet
DL
73 pages
AN2DL_02_2324_Perceptron_2_FeedForward
No ratings yet
AN2DL_02_2324_Perceptron_2_FeedForward
55 pages
14 Deep
No ratings yet
14 Deep
6 pages
Learning in Multi-Layer Perceptrons - Back-Propagation: Neural Computation: Lecture 7
No ratings yet
Learning in Multi-Layer Perceptrons - Back-Propagation: Neural Computation: Lecture 7
20 pages
22222222222222222
No ratings yet
22222222222222222
1 page
Ut
No ratings yet
Ut
14 pages
Week 9 Neural Networks
No ratings yet
Week 9 Neural Networks
40 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Chapter 11 Neural Nets (Python)
No ratings yet
Chapter 11 Neural Nets (Python)
43 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
TTD Guidelines
No ratings yet
TTD Guidelines
2 pages
Dbms
No ratings yet
Dbms
93 pages
Web Technology
No ratings yet
Web Technology
121 pages
BA CA Syl 3 3 20201644403495
No ratings yet
BA CA Syl 3 3 20201644403495
32 pages
Entity-Relationship Model: (E - R Model)
No ratings yet
Entity-Relationship Model: (E - R Model)
9 pages
ZZ Error Estimator Lecture27
No ratings yet
ZZ Error Estimator Lecture27
8 pages
CS231 Data Structures Lab
No ratings yet
CS231 Data Structures Lab
2 pages
Robust Face Detection Using Convolutional Neural Network: Robert Yao Aaronson Wu Chen Ben-Bright Benuwa
No ratings yet
Robust Face Detection Using Convolutional Neural Network: Robert Yao Aaronson Wu Chen Ben-Bright Benuwa
7 pages
Train A Simple NN - Jupyter Notebook
No ratings yet
Train A Simple NN - Jupyter Notebook
4 pages
Mind Mup Map On Polynomials Class X
No ratings yet
Mind Mup Map On Polynomials Class X
1 page
Python For Machine Learning Basics
No ratings yet
Python For Machine Learning Basics
36 pages
Protocols For Public-Key Cryptosystems
No ratings yet
Protocols For Public-Key Cryptosystems
13 pages
Week 4 - Digital Logic Design Lab Term2310
No ratings yet
Week 4 - Digital Logic Design Lab Term2310
1 page
tut1
No ratings yet
tut1
3 pages
79ca22bd-c90a-4dd8-9304-4c8a6f02217d.pdf
No ratings yet
79ca22bd-c90a-4dd8-9304-4c8a6f02217d.pdf
2 pages
Bayesian Statistics and Modelling
No ratings yet
Bayesian Statistics and Modelling
28 pages
Asymptotic Notations
No ratings yet
Asymptotic Notations
29 pages
Mec 203 - Linear Algeb
No ratings yet
Mec 203 - Linear Algeb
5 pages
Major Project Final
No ratings yet
Major Project Final
21 pages
Convolution and Sampling Theorem
100% (1)
Convolution and Sampling Theorem
44 pages
MCS 031
No ratings yet
MCS 031
15 pages
Data Mining Tasks Notes Given
No ratings yet
Data Mining Tasks Notes Given
26 pages
Linear Algebra Assignment 2
No ratings yet
Linear Algebra Assignment 2
1 page
Chapter 02
No ratings yet
Chapter 02
46 pages
(Ebook) An Introduction To Statistical Learning: With Applications In R (Second Edition) by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani ISBN 9781071614174, 1071614177 - Quickly download the ebook to read anytime, anywhere
50% (2)
(Ebook) An Introduction To Statistical Learning: With Applications In R (Second Edition) by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani ISBN 9781071614174, 1071614177 - Quickly download the ebook to read anytime, anywhere
84 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
CNS 2102 - Data Structures and Algorithms - July 2022
No ratings yet
CNS 2102 - Data Structures and Algorithms - July 2022
5 pages
Double Well Potential
No ratings yet
Double Well Potential
6 pages
CNS Module II
No ratings yet
CNS Module II
111 pages
Explain Algorithm and Flowchart With Examples PDF
No ratings yet
Explain Algorithm and Flowchart With Examples PDF
4 pages
Inverse Problems and Carleman Estimates: Global Uniqueness, Global Convergence and Experimental Data 63rd Edition Michael V. Klibanov - The latest ebook version is now available for instant access
100% (5)
Inverse Problems and Carleman Estimates: Global Uniqueness, Global Convergence and Experimental Data 63rd Edition Michael V. Klibanov - The latest ebook version is now available for instant access
77 pages
Autoregressive - Models (AR)
No ratings yet
Autoregressive - Models (AR)
24 pages
Vectors Problem 2
No ratings yet
Vectors Problem 2
3 pages
DC Fault Detection and Pulsed Lo
No ratings yet
DC Fault Detection and Pulsed Lo
10 pages
Gomoku Cup
No ratings yet
Gomoku Cup
2 pages

Neural networks unit-3

Uploaded by

Neural networks unit-3

Uploaded by

Unit-3

An extension of perceptron learning to nonlinear decision boundaries,

Bio-inspired Multi-Layer Networks

Each neuron is connected by connections from the neuron form the

Prediction with a neural network is a straightforward generalization of

The only major difference between this computation and the

Theorem 9 (Two-Layer Networks are Universal Function

Or, in colloquial terms “two-layer networks can approximate any

This is a remarkable theorem. Practically, it says that if you give me a

When working with two-layer networks, If your data is D dimensional

The Back-propagation Algorithm

Back-propagation algorithm calculates the gradient of the “error

back-propagation = gradient descent+chain rule

Back-propagation efficiently computes one layer at a time.

We are going to optimize the weights in the network to minimize some

1. Inputs x, arrive through the preconnected path.

4. Calculate the error in the outputs.

Error = Actual Output – Desired Output

Keep repeating the process until the desired output is achieved.

To be completely explicit, we will focus on optimizing squared error.

The weights in the ﬁrst layer aren’t necessarily trying to produce

Ignoring the sum over data points, we can compute:

Implementing the back-propagation algorithm can be a bit tricky. Sign

Initialization and Convergence of Neural Networks:

An initialization of W = 0 and v = 0 will lead to “uninteresting”

Neural networks are sensitive to their initialization. In particular, the

By initializing a network with small random weights (say, uniform

1. The number of layers

2. The number of hidden units per layer

3. The gradient descent learning rate η

5. The stopping iteration or weight regularization

Another difﬁculty of neural networks is that their weights can be

Beyond Two Layers:

The key aspect of the forward-propagation algorithm is to iteratively

Given the back-propagation algorithm, you can directly run gradient

Breadth versus Depth:

The example that we’ll use is the parity function which is a

It is easy to deﬁne a circuit of depth O(log2 D) with O(D)-many gates

The formal theorem is below:

A neural network isn’t exactly the same as a circuit, the is generally

Deep network makes the architecture selection problem more

Finding good ways to train deep networks is an active research area.

A natural way to train a neural network to mimic a KNN classiﬁer is to

You might also like