0% found this document useful (0 votes)

20 views211 pages

Unit-7_ANN

The document provides an overview of deep learning and artificial neural networks, focusing on the structure and function of biological neurons as well as the principles behind artificial neurons. It discusses the McCulloch-Pitts neuron model and the perceptron model, highlighting their capabilities in representing boolean functions and making binary classifications. Additionally, it introduces the perceptron learning algorithm for optimizing weights and thresholds in neural networks.

Uploaded by

indujeph30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views211 pages

Unit-7_ANN

Uploaded by

indujeph30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 211

2CEIT602: Artificial Intelligence

Unit-6: Deep Learning: Basics of Neural Network

Department of Computer Engineering & Information Technology,

U V Patel College of Engineering,
Ganpat University
History of Biological Neurons
Biological Neurons
Biological Neurons
Biological Neurons
Biological Neurons
Biological Neurons
Biological Neural Networks (BNN)
➢ Nervous System
➢ Neurons
• What?→ A neuron is a nerve cell that is fundamental
building block of the biological nervous system. Neurons are
similar to other cells in the human body in a number of ways,
but there is one key difference between neurons and other
cells. Neurons are specialized to transmit information
throughout the body
• 10 – 100 billions Neurons
• connection to 100 – 10000 other neurons
• different types
• signal
Biological Neural Networks
Biological Neural Networks
Biological Neural Networks
➢ How it works?
Biological Neural Networks
➢ How it works?
Basic Components of Biological Neurons
➢The majority of neurons encode their activations or outputs as a series of brief electrical
pulses (i.e. action potentials).
➢The neuron’s cell body (soma) processes the incoming activations and converts them
into output activations.
➢The neuron’s nucleus contains the genetic material in the form of DNA. This exists in
most types of cells, not just neurons.
➢Dendrites are fibers which start from the cell body and provide the receptive zones that
receive activation from other neurons.
➢Axons are fibers acting as transmission lines that send activation to other neurons.
➢The junctions that allow signal transmission between the axons and dendrites are called
synapses.
➢The process of transmission is by diffusion of chemicals called neurotransmitters
across the synaptic cleft.
➢At the other end of axon, there exits an inhibiting unit called synapse. This unit controls
flow of neuronal current from originating neuron to receiving dendrites of neighborhood
neurons.
➢Synapses have processing value or weight.
Basic Components of Biological Neurons
➢Communication Between Synapses

➢Once an electrical impulse has reached the end of an axon, the information must be
transmitted across the synaptic gap to the dendrites of the adjoining neuron.

➢In some cases, the electrical signal can almost instantaneously bridge the gap between
the neurons and continue along its path.

➢In other cases, neurotransmitters are needed to send the information from one neuron to
the next.

➢Neurotransmitters are chemical messengers that are released from the axon terminals to
cross the synaptic gap and reach the receptor sites of other neurons.

➢In a process known as reuptake, these neurotransmitters attach to the receptor site and
are reabsorbed by the neuron to be reused.
Artificial Neural Network (ANN)
History of Artificial Neural Network
Artificial Neural Network
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?
The inspiration comes from biology
w1 w2 w3 (more specifically, from the brain)
x1 x2 x3 biological neurons = neural cells =
neural processing units
Artificial Neuron
McCulloch Pitts Neuron
y ∈ {0, 1 } McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory orinhibitory
y = 0if any x i is inhibitory,else
𝑛
x1 x2 .. .. xn ∈ {0,1 } 𝑔 𝑥1 , 𝑥2 , … , 𝑥𝑛 = 𝑔 𝑥 = ෍ 𝑥𝑖
𝑖=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
θ is called the thresholding parameter
This is called Thresholding Logic
Let us implement some boolean functions using this McCulloch Pitts (MP) neuron
...
y ∈ {0, 1 } y ∈ {0, 1 } y ∈ {0, 1 }

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1 } y ∈ {0, 1 } y ∈ {0, 1 }

1 0 0

x1 x2 x1 x2 x1
x 1 AND !x2∗ NOR function NOT function

∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0
Can any boolean function be represented using a McCulloch Pitts unit ?
Can any boolean function be represented using a McCulloch Pitts unit ?
let us first see the geometric interpretation of a MP unit ...
y ∈ {0, 1 }
What if we have more than 2 inputs?

1 OR

x1 x2 x3
y ∈ {0, 1 }
What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)

(0,1, 1) (1, 1, 1)

(0,0,0) (1, 0,0) x1

(0,0,1) (1,0, 1)
x3
y ∈ {0, 1 }
What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)

(0,1, 1) (1, 1, 1)x1 + x 2 + x 3 = θ = 1

(0,0,0) (1, 0,0) x1

(0,0,1) (1,0, 1)
x3
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
Linear separability (for boolean functions) : There exists a line (plane) such that
all inputs which produce a 1 lie on one side of the line (plane) and all inputs
which produce a 0lie on other side of the line (plane)
Perceptron
What about non-boolean (say, real) inputs ?
Do we always need to hand code the threshold ?
Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
What about functions which are not linearly separable ?
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
A more general computational model than
McCulloch–Pitts neurons
Main differences: Introduction of numer-
w1 w2 .. .. wn ical weights for inputs and a mechanism for
x1 x2 .. .. xn learning these weights
Inputs are no longer limited to boolean values
Refined and carefully analyzed by Minsky and
Papert (1969) - their model is referred to as
the perceptron model here
Why are we trying to implement boolean functions?
Why do we need weights ?
Why is w0 = −θ called the bias ?
y Consider the task of predicting whether we would like a
movie ornot
Suppose, we base our decision on 3 inputs (binary, for
simplicity)
Based on our past viewing experience (data), we may
give a high weight to isDirectorNolan as compared to
w0 = −θ w1 w2 w3 the other inputs
x0 = 1 x1 x2 x3 Specifically, even if the actor is not Matt Damon and
the genre is not thriller we would still want to cross the
threshold θ by assigning a high weight to isDirect-
x1 = isActorDamon orNolan
x2 = isttenreThriller
x3 = isDirectorNolan
y w0 is called the bias as it represents the prior (preju-
dice)
A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, dir-
ector [θ= 0]
On the other hand, a selective viewer may only watch
w0 = −θ w1 w2 w3 thrillers starring Matt Damon and directed by Nolan [θ
x0 = 1 x1 x2 x3 = 3]
The weights (w1, w2, ..., wn) and the bias (w0) will de-
pend on the data (viewer history in this case)
x1 = isActorDamon
x2 = isttenreThriller
x3 = isDirectorNolan
What kind of functions can be implemented using the perceptron? Any difference
from McCulloch Pitts neurons?
Errors and Error Surfaces
Let us fix the threshold (−w0 = 1) and try
different values of w1,w2
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Lets try some more values of w1, w2 and note
how many errors we make

(0,0) (1, 0) x1

−1 + (−1)x1 + (−1)x2 = 0
Let us fix the threshold (−w0 = 1) and try
different values of w1,w2 −1 + (0.45)x1 + (0.45)x2 = 0
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Lets try some more values of w1, w2 and note
how many errors we make
w1 w2 errors
-1 -1 3 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
We are interested in those values of w0, w1, w2
−1 + (−1)x1 + (−1)x2 = 0
which result in 0error
Let us plot the error surface corresponding to
different values of w0, w1, w2
Perceptron Learning Algorithm
We will now see a more principled approach for learning these weights and
threshold but before that let us answer this question...
Apart from implementing boolean functions (which does not look very interest-
ing) what can a perceptron be used for ?
Our interest lies in the use of perceptron as a binary classifier. Let us see what
this means...
y Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and a
label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
Further, suppose we represent each movie with
x0 = 1 x1 x2 .. .. xn n features (some boolean, some real valued)
We will assume that the data is linearly sep-
x1 = isActorDamon
arable and we want a perceptron to learn how
x2 = isttenreThriller to make thisdecision
x3 = isDirectorNolan
x4 = imdbRating(scaled to 0 to 1)
... ...
xn = criticsRating(scaled to 0 to 1)
y Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and a
label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
w0 = −θ w1 w2 .. .. wn Further, suppose we represent each movie with
x0 = 1 x1 x2 .. .. xn n features (some boolean, some real valued)
We will assume that the data is linearly sep-
x1 = isActorDamon
arable and we want a perceptron to learn how
x2 = isttenreThriller to make thisdecision
x3 = isDirectorNolan In other words, we want the perceptron to find
x4 = imdbRating(scaled to 0 to 1) the equation of this separating plane (or find
... ... the values of w0, w1, w2, .., wm)
xn = criticsRating(scaled to 0 to 1)
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w randomly;
while !convergence do

end
//the algorithm converges when all the
inputs are classifiedcorrectly
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., w T x ≥
0) p2 w
What will be the angle between any such vec- wTx = 0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., w T x < 0) n1
What will be the angle between any such vec- x1
tor and w ? Obviously, greater than 90◦
Of course, this also follows from the formula
wT x
(cosα = ||w||||x|| )
Keeping this picture in mind let us revisit the n2 n3
algorithm
We will now see this algorithm in action for a toy dataset
x2 We initialized w to a random value
We observe that currently, w ·x < 0(∵ angle
p2 > 90◦) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦) for all the negative points
p1 (the situation is exactly oppsite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p1), apply correc-
tion w = w + x ∵ w ·x < 0(you can check
the angle visually)
n2 n3
Coming back to our questions ...
What about non-boolean (say, real) inputs?

Do we always need to hand code the threshold?

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs?
What about functions which are not linearly separable ?
Coming back to our questions ...
What about non-boolean (say, real) inputs? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold?

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ?
Coming back to our questions ...
What about non-boolean (say, real) inputs? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold? No, we can learn the threshold

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ? Not possible with a
single perceptron but we will see how to handle this ..
Linearly Separable Boolean Functions
So what do we do about functions which are not linearly separable ?
So what do we do about functions which are not linearly separable ?
Let us see one such simple boolean function first ?
Most real world data is not linearly separable
and will always contain someoutliers
In fact, sometimes there may not be any out-
liers but still the data may not be linearly sep-
arable
We need computational units (models) which
can deal with such data
While a single perceptron cannot deal with
such data, we will show that a network of per-
ceptrons can indeed deal with such data
oo o o o oo
oo + + + + oo
o + oo
o ++ + o
o
o+ + o
o + + oo
o + + o
o +
oo + + + oo
oo o
oooo
Before seeing how a network of perceptrons can deal with linearly inseparable
data, need to discuss boolean functions in some more detail ...
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f 10 f 11 f 12 f 13 f 14 f 15 f 16
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
n
In general, how many boolean functions can you have for n inputs ? 22
n
How many of these 22 functions are not linearly separable ? For the time being,
it suffices to know that at least some of these may not be linearly inseparable (I
encourage you to figure out the exact answer :-) )
Representation Power of a Network of Perceptrons
See how to implement any boolean function using a network of perceptrons ...
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2
red edge indicates w= -1
blue edge indicates w= +1
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
The bias (w0) of each perceptron is -2 (i.e.,
each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to belearned)
x1 x2
red edge indicates w= -1
blue edge indicates w= +1
For this discussion, we will assume True
= +1 and False = -1
y We consider 2 inputs and 4perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
w1 w2 w3 w4 The bias (w0) of each perceptron is -2 (i.e.,
each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to belearned)
x1 x2
The output of this perceptron (y) is the
red edge indicates w= -1 output of thisnetwork
blue edge indicates w= +1
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1, x 2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
h1 h2 h3 h4 The final layer containing one output
neuron is called the output layer
The outputs of the 4 perceptrons in the
bias =-2 hidden layer are denoted by h1, h2, h3, h4
The red and blue edges are called layer 1
x1 x2 weights
red edge indicates w= -1 w1, w2, w3, w4 are called layer 2 weights
blue edge indicates w= +1
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1, w2, w3, w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4
h1 h2 h3 h4 work
Astonishing claim! Well, not really, if you
-1,-1
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w= -1 the first perceptron fires for{-1,-1}
blue edge indicates w= +1
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1, w2, w3, w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4
h1 h2 h3 h4 work
-1,1
Astonishing claim! Well, not really, if you
-1,-1
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w= -1 the second perceptron fires for {-1,1}
blue edge indicates w= +1
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1, w2, w3, w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4
h1 h2 h3 h4 work
-1,1 1,-1
Astonishing claim! Well, not really, if you
-1,-1
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w= -1 the third perceptron fires for {1,-1}
blue edge indicates w= +1
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1, w2, w3, w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4
h1 h2 h3 h4 work
-1,1 1,-1 1,1
Astonishing claim! Well, not really, if you
-1,-1
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w= -1 Let us see why this network works by tak-
blue edge indicates w= +1 ing an example of the XOR function
It should be clear that the same network
can be used to represent the remaining 15
y boolean functions also
Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
w1 w2 w3 w4
h1 h2 h3 h4 setting w1, w2, w3,w4

-1,-1 -1,1 1,-1 1,1 Try it!

bias =-2

x1 x2
red edge indicates w= -1
blue edge indicates w= +1
What if we have more than 3 inputs ?
Again each of the 8 perceptorns will fire only for one of the 8inputs
Each of the 8 weights in the second layer is responsible for one of the 8 inputs and
can be adjusted to produce the desired output for that input
y

w1 w2 w3 w4 w5 w6 w7 w8

bias =-3

x1 x2 x3
What if we have n inputs ?
Theorem
Any boolean function of n inputs can be represented exactly by a network of
perceptrons containing 1 hidden layer with 2n perceptrons and one output layer
containing 1 perceptron

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For

example, we already saw how to represent AND function with just 1 perceptron

Catch: As n increases the number of perceptrons in the hidden layers obviously

increases exponentially
Again, why do we care about boolean functions ?
Networks of the form that we just saw (containing, an input, output and one or
more hidden layers) are called Multilayer Perceptrons (MLP, in short)
More appropriate terminology would be“Multilayered Network of Perceptrons”
but MLP is the more commonly used name
The theorem that we just saw gives us the representation power of a MLP with a
single hidden layer
Specifically, it tells us that a MLP with a single hidden layer can represent any
boolean function
Sigmoid Neuron
Enough about boolean functions!
What about arbitrary functions of the form y = f (x) where x ∈ Rn (instead of
{0, 1} n ) and y ∈ R (instead of {0, 1}) ?
Can we have a network which can (approximately) represent such functions ?
Before answering the above question we will have to first graduate from per-
ceptrons to sigmoidal neurons ...
A perceptron will fire if the weighted sum of its inputs is greater than the
threshold (-w0)
y
The thresholding logic used by a perceptron is
very harsh !
bias = w0 = −0.5 For example, let us return to our problem of
deciding whether we will like or dislike a movie
Consider that we base our decision only on one
w1 = 1 input (x1 = criticsRating which lies between
x1 0and 1)
If the threshold is 0.5 (w0 = −0.5) and w1 = 1
criticsRating
then what would be the decision for a movie
with criticsRating = 0.51 ? (like)
What about a movie with criticsRating =
0.49 ? (dislike)
It seems harsh that we would like a movie with
rating 0.51 but not one with a ratingof 0.49
This behavior is not a characteristic of the
1
specific problem we chose or the specific
weight and threshold that we chose
It is a characteristic of the perceptron function
itself which behaves like a step function
y

There will always be this sudden𝑛change in the

decision (from 0to 1) when ෍ 𝑤𝑖 𝑥𝑖 crosses
the threshold (-w0) 𝑖=1

For most real world applications we would

-w0 expect a smoother decision function which
gradually changes from 0to 1
𝑛

𝑧 = ෍ 𝑤𝑖 𝑥𝑖
𝑖=1
Introducing sigmoid neurons where the out-
1
put function is much smoother than the step
function
Here is one form of the sigmoid function called
the logistic function
y

We no longer see a sharp transition around the

-w0 threshold -w0
Also the output y is no longer binary but a real
𝑛
value between 0 and 1 which can be in-
𝑧 = ෍ 𝑤𝑖 𝑥𝑖 terpreted as a probability
𝑖=1 Instead of a like/dislike decision we get the
probability of liking the movie
Perceptron Sigmoid (logistic) Neuron
Perceptron Sigmoid Neuron
1 1

y
y

𝑛 -w0 𝑛 -w0

𝑧 = ෍ 𝑤𝑖 𝑥𝑖 𝑧 = ෍ 𝑤𝑖 𝑥𝑖
𝑖=1 𝑖=1
Smooth, continuous, differentiable
Not smooth, not continuous (at w0), not
differentiable
A typical Supervised Machine Learning Setup
What next ?
Sigmoid (logistic) Neuron Well, just as we had an algorithm for learning
y the weights of a perceptron, we also need a
way of learning the weights of a sigmoid
neuron
Before we see such an algorithm we will revisit
the concept of error

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn
Earlier we mentioned that a single perceptron cannot
deal with this data because it is not linearly separable
What does “cannot deal with” mean?
What would happen if we use a perceptron model to
classify this data ?
We would probably end up with a line like this ...
This line doesn’t seem to be toobad
Sure, it misclassifies 3 blue points and 3 red points but
we could live with this error in most real world
applications
From now on, we will accept that it is hard to drive the
error to 0 in most cases and will instead aim to reach
the minimum possible error
As an illustration, consider our movie example
Data: {x i= movie, y = i like/dislike} i = 1
n

Model: Our approximation of the relation between x and y (the probability of

liking a movie).

yˆ= 1
1 + e −(w T x )

Parameter: w
Learning algorithm: Gradient Descent [we will see soon]
Objective/Loss/Error function: One possibility is

The learning algorithm should aim to find a w which minimizes the above
function (squared error between y and yˆ)
Learning Parameters: (Infeasible) guess work
y Keeping this supervised ML setup in mind, we
will now focus on this model and discuss an
algorithm for learning the parameters of
σ this model from some given data using an
appropriate objective function
σ stands for the sigmoid function (logistic
w0 = −θ w1 w2 .. .. wn function in this case)
x0 = 1 x1 x2 .. .. xn

f (x) = 1
1+e−(w·x+b)
x w σ yˆ= f (x) Keeping this supervised ML setup in mind, we
will now focus on this model and discuss an
b algorithm for learning the parameters of
1
this model from some given data using an
f (x) = 1 appropriate objective function
1+e−(w·x+b)
σ stands for the sigmoid function (logistic
function in this case)
For ease of explanation, we will consider a very
simplified version of the model having just 1
input
Further to be consistent with the literature,
from now on, we will refer to w0 as b(bias)
Lastly, instead of considering the problem of
predicting like/dislike, we will assume that we
want to predict criticsRating(y) given
imdbRating(x) (for no particular reason)
x w σ yˆ= f (x)
What does it mean to train the network?
b Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
f (x) = 1
1+e−(w·x+b)
At the end of training we expect to
find w*, b* such that:
f (0.5) → 0.2 and f (2.5) → 0.9

In other words...
We hope to find a sigmoid function
such that (0.5, 0.2) and (2.5, 0.9) lie
on this sigmoid
x w σ yˆ= f (x)
What does it mean to train the network?
b Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
f (x) = 1
1+e−(w·x+b)
At the end of training we expect to
find w*, b* such that:
f (0.5) → 0.2 and f (2.5) → 0.9

In other words...
We hope to find a sigmoid function
such that (0.5, 0.2) and (2.5, 0.9) lie
on this sigmoid
Let us see this in more detail....
Can we try to find such a w∗, b∗manually
Let us try a random guess.. (say, w= 0.5, b= 0)
Clearly not good, but how bad is it ?
Let us revisit to see how bad it is ...

1
σ(x) =
1 + e −(wx+b)
1 We want to be as close to 0 as possible
σ(x) =
1+ e −(wx+b)
Let us try some other values of w, b

w b
0.50 0.00 0.0730

1
σ(x) =
1 + e −(wx+b)
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481

1
σ(x) = Oops!! this made things even worse...
1 + e −(wx+b)
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214

1 Perhaps it would help to push wand bin the

σ(x) =
1 + e −(wx+b) other direction...
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028

1 Let us keep going in this direction, i.e., increase

σ(x) =
1 + e −(wx+b) wand decrease b
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003

1 Let us keep going in this direction, i.e., increase

σ(x) =
1 + e −(wx+b) w and decrease b
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

1 With some guess work and intuition we were able

σ(x) = to find the right values for w and b
1 + e −(wx+b)
Let us look at something better than our “guess work”
algorithm....
Let us look at the geometric interpretation of our
“guess work” algorithm in terms of this error surface
We will figure this out over the next few slides ...
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 0, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 5, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 10, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 20, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 30, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw
Further we can adjust the value of b to
control the position on the x-axis at
which the function transitions from 0
to 1

σ(x) = 1
1+e−(wx+b)
w= 50, b= 10
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw
Further we can adjust the value of b to
control the position on the x-axis at
which the function transitions from 0
to 1

σ(x) = 1
1+e−(wx+b)
w= 50, b= 25
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw
Further we can adjust the value of b to
control the position on the x-axis at
which the function transitions from 0
to 1

σ(x) = 1
1+e−(wx+b)
w= 50, b= 35
Multi-Layered Perceptron (MLP)
Feedforward Neural Networks (a.k.a. multilayered network
of neurons)
The input to the network is an n-dimensional
vector
The input to the network is an n-dimensional
vector

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
this case) having n neurons each

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
a3 this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation

x1 x2 xn
The input to the network is an n-dimensional
hL = yˆ= f (x)
vector
The network contains L −1 hidden layers (2, in
a3 this case) having n neurons each
Finally, there is one output layer containing k
h2
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation and
activation (ai and hi are vectors)
h1
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
a1

ai(x) = bi + Wihi−1(x)
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2 where g is called the activation function (for
W2 b2
h1 example, logistic, tanh, linear, etc.)

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2 where g is called the activation function
W2 (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL(x) = O(aL(x))
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2 where g is called the activation function
W2 (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL(x) = O(aL(x))
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)
To simplify notation we will refer to ai(x) as ai
and hi(x) as hi
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai = bi + Wihi−1
a3
W3 b3 The activation at layer i is givenby
h2
hi = g(ai)
a2 where g is called the activation function
W2 (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL = O(aL)
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)
hL = yˆ= f (x) N
Data: { xi , y}
i i= 1

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) N
Data: { xi , y}
i i= 1
Model:
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) N
Data: { x i , y}
i i= 1
Model:
a3
W3 yˆi= f (x i) = O(W3g(W2g(W1x + b1) + b2) + b3)
b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) N
Data: { x i , y}
i i= 1
Model:
a3
W3 yˆi= f (x i) = O(W3g(W2g(W1x + b1) + b2) + b3)
b3
h2
Parameters:
a2 θ = W1, .., W L , b1, b2, ..., bL(L = 3)
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) N
Data: { x i , y}
i i= 1
Model:
a3
W3 yˆi= f (x i) = O(W3g(W2g(W1x + b1) + b2) + b3)
b3
h2
Parameters:
a2 θ = W1, .., W L , b1, b2, ..., bL(L = 3)
W2 b2 Algorithm: Gradient Descent with Back-
h1
propagation (we will see soon)

a1
W1 b1

x1 x2 xn
More Information on MLP
Learning Parameters of Feedforward Neural
Networks (Intuition)
hL = yˆ= f (x) Recall our gradient descent algorithm
We can write it more concisely as
a3 Algorithm: gradient descent()
W3 b3 t ← 0;
h2
max iterations ← 1000;
Initialize w0, b0;
a2 while t++ < max iterations do
W2 b2 wt+1 ← wt −η∇wt;
h1 bt+1 ← bt −η∇bt;
end
a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) Recall our gradient descent algorithm
We can write it more concisely as
a3 Algorithm: gradient descent()
W3 b3 t ← 0;
h2
max iterations ← 1000;
Initialize θ0 = [w0,b0];
a2 while t++ < max iterations do
W2 b2 θt+1 ← θt −η∇θt;
h1 end

a1
W1 b1

x1 x2 xn
Output Functions and Loss Functions
Softmax Function
• Softmax function converts raw values (as outcome of functions) into
probabilities.
• The output of softmax function maps to a [0, 1] range. And, it maps outputs in a way
that the total sum of all the output values is 1. Thus, it could be said that the output of
the softmax function is probability distribution.
• Softmax function is used in classifications algorithms where there is a
need to obtain probability or probability distribution as the output. Some
of these algorithms are following:
• Neural networks
• Multinomial logistic regression (Softmax regression)
• Bayes naive classifier
• Multi-class linear discriminant analysis
• In artificial neural networks, the softmax function is used in the final /
last layer.
• Softmax function is also used in case of reinforcement learning to output
probabilities related to different actions to be taken.
Outputs

Real Values Probabilities

Output Activation

Loss Function
Outputs

Real Values Probabilities

Output Activation Linear

Loss Function
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error

Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Outputs
Regression- Classification-
Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at hand
but the two loss functions that we just saw are encountered veryoften
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at hand
but the two loss functions that we just saw are encountered veryoften
For the rest of this lecture we will focus on the case where the output activation is
a softmax function and the loss function is cross entropy

Unit I deeplearning
No ratings yet
Unit I deeplearning
13 pages
03 NeuralNetworksI PDF
100% (1)
03 NeuralNetworksI PDF
78 pages
MAT6007 Session4 MP Neuron Perceptrons
No ratings yet
MAT6007 Session4 MP Neuron Perceptrons
15 pages
Ml Unit Iiia
No ratings yet
Ml Unit Iiia
180 pages
DL_Unit_I_&_Unit_II
No ratings yet
DL_Unit_I_&_Unit_II
156 pages
L2
No ratings yet
L2
68 pages
Deep Leaning
No ratings yet
Deep Leaning
117 pages
DL QB Answers
No ratings yet
DL QB Answers
121 pages
MODULE 5
No ratings yet
MODULE 5
27 pages
Inhibitory Input
No ratings yet
Inhibitory Input
16 pages
NITIN SIR NOTES
No ratings yet
NITIN SIR NOTES
66 pages
Unit 7 Neural Networks
No ratings yet
Unit 7 Neural Networks
92 pages
4-Early Neural Network Architectures (MADALINE Network), And Application Domains.-16!12!2024
No ratings yet
4-Early Neural Network Architectures (MADALINE Network), And Application Domains.-16!12!2024
136 pages
module-4
No ratings yet
module-4
84 pages
Module 1
No ratings yet
Module 1
100 pages
Unit1.2-OOMDUML
No ratings yet
Unit1.2-OOMDUML
28 pages
Unit1.2
No ratings yet
Unit1.2
28 pages
Unit Vi: TO Artificial Neural Network
No ratings yet
Unit Vi: TO Artificial Neural Network
71 pages
Module5
No ratings yet
Module5
91 pages
dp learn
No ratings yet
dp learn
72 pages
MODULE 4
No ratings yet
MODULE 4
55 pages
Rowan's Primer of EEG. 2nd Edition. ISBN 0323353878, 978-0323353878
100% (25)
Rowan's Primer of EEG. 2nd Edition. ISBN 0323353878, 978-0323353878
23 pages
ML_Lec-21
No ratings yet
ML_Lec-21
18 pages
CMPE 442 Introduction To Machine Learning: Artificial Neural Networks
No ratings yet
CMPE 442 Introduction To Machine Learning: Artificial Neural Networks
65 pages
What Is Perceptron - Simplilearn
No ratings yet
What Is Perceptron - Simplilearn
46 pages
McCulloch-Pitts Neuron
No ratings yet
McCulloch-Pitts Neuron
14 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
81 pages
Artificial Neural Networks (1)
No ratings yet
Artificial Neural Networks (1)
17 pages
ML_M4_ANN
No ratings yet
ML_M4_ANN
31 pages
cv_2025_Spring_14
No ratings yet
cv_2025_Spring_14
33 pages
Lecture 2
No ratings yet
Lecture 2
69 pages
Neural Network – Overview
No ratings yet
Neural Network – Overview
37 pages
NN Part1
No ratings yet
NN Part1
43 pages
Introduction DL
No ratings yet
Introduction DL
53 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
25 pages
LECT 4
No ratings yet
LECT 4
24 pages
Biological Neuron
No ratings yet
Biological Neuron
20 pages
UNIT I Artificial Neural Networks Hightlighted
No ratings yet
UNIT I Artificial Neural Networks Hightlighted
136 pages
Neural Networks - V Unit (2)
No ratings yet
Neural Networks - V Unit (2)
43 pages
Unit 1 Notes Final.docx
No ratings yet
Unit 1 Notes Final.docx
36 pages
MP_Neuron_Perceptrons
No ratings yet
MP_Neuron_Perceptrons
11 pages
Machine Learning
No ratings yet
Machine Learning
77 pages
Unit 1 Deep Learning
No ratings yet
Unit 1 Deep Learning
20 pages
Introduction To Artificial Neural Networks and Perceptron
No ratings yet
Introduction To Artificial Neural Networks and Perceptron
59 pages
ML Module 5
No ratings yet
ML Module 5
14 pages
CHP 9
No ratings yet
CHP 9
29 pages
Lesson 7.0 Supervised Learning With Neural Networks (1)
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks (1)
22 pages
Foundational Concepts in Neuroscience A Brain Mind Odyssey (Norton Series on Interpersonal Neurobiology) Authorized Download
100% (14)
Foundational Concepts in Neuroscience A Brain Mind Odyssey (Norton Series on Interpersonal Neurobiology) Authorized Download
17 pages
ADVANCED_SUPERVISED_LEARNING[1]
No ratings yet
ADVANCED_SUPERVISED_LEARNING[1]
17 pages
Unit 2
No ratings yet
Unit 2
15 pages
20200428135045cfbc718e2c (1)
No ratings yet
20200428135045cfbc718e2c (1)
30 pages
NNDL
No ratings yet
NNDL
96 pages
Unit 1
No ratings yet
Unit 1
25 pages
Lec 11
No ratings yet
Lec 11
11 pages
FALLSEM2023-24 CSE4020 ETH VL2023240103694 2023-09-01 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSE4020 ETH VL2023240103694 2023-09-01 Reference-Material-I
35 pages
DL KIET Model Question Paper
No ratings yet
DL KIET Model Question Paper
31 pages
Neural Networks
No ratings yet
Neural Networks
42 pages
DL Unit-2
No ratings yet
DL Unit-2
31 pages
Lesson 4 Ganglion BLocking Agens
No ratings yet
Lesson 4 Ganglion BLocking Agens
31 pages
Nervous
No ratings yet
Nervous
8 pages
Biological Neuron Artificial Neuron
No ratings yet
Biological Neuron Artificial Neuron
18 pages
6. Biomechanics of Peripheral and Spinal Nerves
No ratings yet
6. Biomechanics of Peripheral and Spinal Nerves
41 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Basal Ganglia
No ratings yet
Basal Ganglia
2 pages
Nolte's The Human Brain in Photographs and Diagrams 5th Edition Full PDF Download
100% (12)
Nolte's The Human Brain in Photographs and Diagrams 5th Edition Full PDF Download
14 pages
Functional Neurology for Practitioners of Manual Medicine, 2nd Edition Secure eBook Download
100% (10)
Functional Neurology for Practitioners of Manual Medicine, 2nd Edition Secure eBook Download
14 pages
Vanders Human Physiology The Mechanisms of Body Function 14th Edition Widmaier Test Bank - Download All Chapters Immediately In PDF Format
100% (5)
Vanders Human Physiology The Mechanisms of Body Function 14th Edition Widmaier Test Bank - Download All Chapters Immediately In PDF Format
48 pages
WhatsApp Chat with Space ?? Talk
No ratings yet
WhatsApp Chat with Space ?? Talk
63 pages
Practical-7
No ratings yet
Practical-7
7 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
DL - FNN - RNN
No ratings yet
DL - FNN - RNN
5 pages
controll and coordination WORKSHEET..
No ratings yet
controll and coordination WORKSHEET..
5 pages
Digestive system & Nervous system
No ratings yet
Digestive system & Nervous system
19 pages
21CS743
No ratings yet
21CS743
1 page
Neural-Network-Ppt-Presentation
No ratings yet
Neural-Network-Ppt-Presentation
22 pages
Instant Download Homeostatic control of brain function 1st Edition Boison PDF All Chapters
100% (2)
Instant Download Homeostatic control of brain function 1st Edition Boison PDF All Chapters
55 pages
(Ebook) The human nervous system: structure and function by Charles R. Noback, David A. Ruggiero, Robert J. Demarest, Norman L. Strominger ISBN 9781588290397, 9781588290403, 1588290395, 1588290409 instant download
No ratings yet
(Ebook) The human nervous system: structure and function by Charles R. Noback, David A. Ruggiero, Robert J. Demarest, Norman L. Strominger ISBN 9781588290397, 9781588290403, 1588290395, 1588290409 instant download
48 pages
Practical-7
No ratings yet
Practical-7
7 pages
WhatsApp Chat with +91 91045 25074
No ratings yet
WhatsApp Chat with +91 91045 25074
61 pages
01 - Large Networks
No ratings yet
01 - Large Networks
38 pages
Reticular Activating System
No ratings yet
Reticular Activating System
57 pages
CS445 - Neural Networks and Deep Learning - Lecture Notes
No ratings yet
CS445 - Neural Networks and Deep Learning - Lecture Notes
5 pages
Artificial neural networks (II) (Part I)
No ratings yet
Artificial neural networks (II) (Part I)
12 pages
srs_024 IT
No ratings yet
srs_024 IT
3 pages
srs 1
No ratings yet
srs 1
3 pages
Cognitive Psych 258
No ratings yet
Cognitive Psych 258
3 pages
second-Project Report Format FOR CE-IT-CE(AI) 6th Sem CP-I
No ratings yet
second-Project Report Format FOR CE-IT-CE(AI) 6th Sem CP-I
2 pages
Neurotransmitter_Table_
No ratings yet
Neurotransmitter_Table_
2 pages
G10 term-II Test-1
No ratings yet
G10 term-II Test-1
3 pages
srs_1 final_2
No ratings yet
srs_1 final_2
33 pages
Practical 4 & 5 CNS
No ratings yet
Practical 4 & 5 CNS
6 pages
WhatsApp Chat with Pavitr papis,??
No ratings yet
WhatsApp Chat with Pavitr papis,??
8 pages
Deep Learning - Lecture 4
No ratings yet
Deep Learning - Lecture 4
13 pages
Free Neuroscience Textbooks and Online Tutorials
No ratings yet
Free Neuroscience Textbooks and Online Tutorials
8 pages
Practical 1
No ratings yet
Practical 1
4 pages
Potencial Da Ação - Havard
No ratings yet
Potencial Da Ação - Havard
14 pages
Neural Control and Coordination - NEET-MPS-Sol
No ratings yet
Neural Control and Coordination - NEET-MPS-Sol
2 pages
Animal Nervous System
No ratings yet
Animal Nervous System
21 pages
Quiz
No ratings yet
Quiz
4 pages
From Imaginary Oxymora to Real Polarities and Return: A New Science of Reality
From Everand
From Imaginary Oxymora to Real Polarities and Return: A New Science of Reality
Hans-Joachim Rudolph
No ratings yet
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet

Unit-7_ANN

Uploaded by

Unit-7_ANN

Uploaded by

2CEIT602: Artificial Intelligence

Unit-6: Deep Learning: Basics of Neural Network

Department of Computer Engineering & Information Technology,

y ∈ {0, 1 } y ∈ {0, 1 } y ∈ {0, 1 }

(0,0,0) (1, 0,0) x1

(0,1, 1) (1, 1, 1)x1 + x 2 + x 3 = θ = 1

(0,0,0) (1, 0,0) x1

Do we always need to hand code the threshold?

-1,-1 -1,1 1,-1 1,1 Try it!

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For

Catch: As n increases the number of perceptrons in the hidden layers obviously

There will always be this sudden𝑛change in the

For most real world applications we would

We no longer see a sharp transition around the

Model: Our approximation of the relation between x and y (the probability of

1 Perhaps it would help to push wand bin the

1 Let us keep going in this direction, i.e., increase

1 Let us keep going in this direction, i.e., increase

1 With some guess work and intuition we were able

Real Values Probabilities

Real Values Probabilities

Output Activation Linear

Real Values Probabilities

Output Activation Linear Softmax

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

You might also like