0% found this document useful (0 votes)
20 views211 pages

Unit-7_ANN

The document provides an overview of deep learning and artificial neural networks, focusing on the structure and function of biological neurons as well as the principles behind artificial neurons. It discusses the McCulloch-Pitts neuron model and the perceptron model, highlighting their capabilities in representing boolean functions and making binary classifications. Additionally, it introduces the perceptron learning algorithm for optimizing weights and thresholds in neural networks.

Uploaded by

indujeph30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views211 pages

Unit-7_ANN

The document provides an overview of deep learning and artificial neural networks, focusing on the structure and function of biological neurons as well as the principles behind artificial neurons. It discusses the McCulloch-Pitts neuron model and the perceptron model, highlighting their capabilities in representing boolean functions and making binary classifications. Additionally, it introduces the perceptron learning algorithm for optimizing weights and thresholds in neural networks.

Uploaded by

indujeph30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 211

2CEIT602: Artificial Intelligence

Unit-6: Deep Learning: Basics of Neural Network

Department of Computer Engineering & Information Technology,


U V Patel College of Engineering,
Ganpat University
History of Biological Neurons
Biological Neurons
Biological Neurons
Biological Neurons
Biological Neurons
Biological Neurons
Biological Neural Networks (BNN)
➢ Nervous System
➢ Neurons
• What?→ A neuron is a nerve cell that is fundamental
building block of the biological nervous system. Neurons are
similar to other cells in the human body in a number of ways,
but there is one key difference between neurons and other
cells. Neurons are specialized to transmit information
throughout the body
• 10 – 100 billions Neurons
• connection to 100 – 10000 other neurons
• different types
• signal
Biological Neural Networks
Biological Neural Networks
Biological Neural Networks
➢ How it works?
Biological Neural Networks
➢ How it works?
Basic Components of Biological Neurons
➢The majority of neurons encode their activations or outputs as a series of brief electrical
pulses (i.e. action potentials).
➢The neuron’s cell body (soma) processes the incoming activations and converts them
into output activations.
➢The neuron’s nucleus contains the genetic material in the form of DNA. This exists in
most types of cells, not just neurons.
➢Dendrites are fibers which start from the cell body and provide the receptive zones that
receive activation from other neurons.
➢Axons are fibers acting as transmission lines that send activation to other neurons.
➢The junctions that allow signal transmission between the axons and dendrites are called
synapses.
➢The process of transmission is by diffusion of chemicals called neurotransmitters
across the synaptic cleft.
➢At the other end of axon, there exits an inhibiting unit called synapse. This unit controls
flow of neuronal current from originating neuron to receiving dendrites of neighborhood
neurons.
➢Synapses have processing value or weight.
Basic Components of Biological Neurons
➢Communication Between Synapses

➢Once an electrical impulse has reached the end of an axon, the information must be
transmitted across the synaptic gap to the dendrites of the adjoining neuron.

➢In some cases, the electrical signal can almost instantaneously bridge the gap between
the neurons and continue along its path.

➢In other cases, neurotransmitters are needed to send the information from one neuron to
the next.

➢Neurotransmitters are chemical messengers that are released from the axon terminals to
cross the synaptic gap and reach the receptor sites of other neurons.

➢In a process known as reuptake, these neurotransmitters attach to the receptor site and
are reabsorbed by the neuron to be reused.
Artificial Neural Network (ANN)
History of Artificial Neural Network
Artificial Neural Network
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?
The inspiration comes from biology
w1 w2 w3 (more specifically, from the brain)
x1 x2 x3 biological neurons = neural cells =
neural processing units
Artificial Neuron
McCulloch Pitts Neuron
y ∈ {0, 1 } McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory orinhibitory
y = 0if any x i is inhibitory,else
𝑛
x1 x2 .. .. xn ∈ {0,1 } 𝑔 𝑥1 , 𝑥2 , … , 𝑥𝑛 = 𝑔 𝑥 = ෍ 𝑥𝑖
𝑖=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
θ is called the thresholding parameter
This is called Thresholding Logic
Let us implement some boolean functions using this McCulloch Pitts (MP) neuron
...
y ∈ {0, 1 } y ∈ {0, 1 } y ∈ {0, 1 }

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1 } y ∈ {0, 1 } y ∈ {0, 1 }

1 0 0

x1 x2 x1 x2 x1
x 1 AND !x2∗ NOR function NOT function


circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0
Can any boolean function be represented using a McCulloch Pitts unit ?
Can any boolean function be represented using a McCulloch Pitts unit ?
let us first see the geometric interpretation of a MP unit ...
y ∈ {0, 1 }
What if we have more than 2 inputs?

1 OR

x1 x2 x3
y ∈ {0, 1 }
What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)

(0,1, 1) (1, 1, 1)

(0,0,0) (1, 0,0) x1

(0,0,1) (1,0, 1)
x3
y ∈ {0, 1 }
What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)

(0,1, 1) (1, 1, 1)x1 + x 2 + x 3 = θ = 1

(0,0,0) (1, 0,0) x1

(0,0,1) (1,0, 1)
x3
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
Linear separability (for boolean functions) : There exists a line (plane) such that
all inputs which produce a 1 lie on one side of the line (plane) and all inputs
which produce a 0lie on other side of the line (plane)
Perceptron
What about non-boolean (say, real) inputs ?
Do we always need to hand code the threshold ?
Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
What about functions which are not linearly separable ?
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
A more general computational model than
McCulloch–Pitts neurons
Main differences: Introduction of numer-
w1 w2 .. .. wn ical weights for inputs and a mechanism for
x1 x2 .. .. xn learning these weights
Inputs are no longer limited to boolean values
Refined and carefully analyzed by Minsky and
Papert (1969) - their model is referred to as
the perceptron model here
Why are we trying to implement boolean functions?
Why do we need weights ?
Why is w0 = −θ called the bias ?
y Consider the task of predicting whether we would like a
movie ornot
Suppose, we base our decision on 3 inputs (binary, for
simplicity)
Based on our past viewing experience (data), we may
give a high weight to isDirectorNolan as compared to
w0 = −θ w1 w2 w3 the other inputs
x0 = 1 x1 x2 x3 Specifically, even if the actor is not Matt Damon and
the genre is not thriller we would still want to cross the
threshold θ by assigning a high weight to isDirect-
x1 = isActorDamon orNolan
x2 = isttenreThriller
x3 = isDirectorNolan
y w0 is called the bias as it represents the prior (preju-
dice)
A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, dir-
ector [θ= 0]
On the other hand, a selective viewer may only watch
w0 = −θ w1 w2 w3 thrillers starring Matt Damon and directed by Nolan [θ
x0 = 1 x1 x2 x3 = 3]
The weights (w1, w2, ..., wn) and the bias (w0) will de-
pend on the data (viewer history in this case)
x1 = isActorDamon
x2 = isttenreThriller
x3 = isDirectorNolan
What kind of functions can be implemented using the perceptron? Any difference
from McCulloch Pitts neurons?
Errors and Error Surfaces
Let us fix the threshold (−w0 = 1) and try
different values of w1,w2
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Lets try some more values of w1, w2 and note
how many errors we make

(0,0) (1, 0) x1

−1 + (−1)x1 + (−1)x2 = 0
Let us fix the threshold (−w0 = 1) and try
different values of w1,w2 −1 + (0.45)x1 + (0.45)x2 = 0
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Lets try some more values of w1, w2 and note
how many errors we make
w1 w2 errors
-1 -1 3 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
We are interested in those values of w0, w1, w2
−1 + (−1)x1 + (−1)x2 = 0
which result in 0error
Let us plot the error surface corresponding to
different values of w0, w1, w2
Perceptron Learning Algorithm
We will now see a more principled approach for learning these weights and
threshold but before that let us answer this question...
Apart from implementing boolean functions (which does not look very interest-
ing) what can a perceptron be used for ?
Our interest lies in the use of perceptron as a binary classifier. Let us see what
this means...
y Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and a
label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
Further, suppose we represent each movie with
x0 = 1 x1 x2 .. .. xn n features (some boolean, some real val- ued)
We will assume that the data is linearly sep-
x1 = isActorDamon
arable and we want a perceptron to learn how
x2 = isttenreThriller to make thisdecision
x3 = isDirectorNolan
x4 = imdbRating(scaled to 0 to 1)
... ...
xn = criticsRating(scaled to 0 to 1)
y Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and a
label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
w0 = −θ w1 w2 .. .. wn Further, suppose we represent each movie with
x0 = 1 x1 x2 .. .. xn n features (some boolean, some real val- ued)
We will assume that the data is linearly sep-
x1 = isActorDamon
arable and we want a perceptron to learn how
x2 = isttenreThriller to make thisdecision
x3 = isDirectorNolan In other words, we want the perceptron to find
x4 = imdbRating(scaled to 0 to 1) the equation of this separating plane (or find
... ... the values of w0, w1, w2, .., wm)
xn = criticsRating(scaled to 0 to 1)
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w randomly;
while !convergence do

end
//the algorithm converges when all the
inputs are classifiedcorrectly
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., w T x ≥
0) p2 w
What will be the angle between any such vec- wTx = 0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., w T x < 0) n1
What will be the angle between any such vec- x1
tor and w ? Obviously, greater than 90◦
Of course, this also follows from the formula
wT x
(cosα = ||w||||x|| )
Keeping this picture in mind let us revisit the n2 n3
algorithm
We will now see this algorithm in action for a toy dataset
x2 We initialized w to a random value
We observe that currently, w ·x < 0(∵ angle
p2 > 90◦) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦) for all the negative points
p1 (the situation is exactly oppsite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p1), apply correc-
tion w = w + x ∵ w ·x < 0(you can check
the angle visually)
n2 n3
Coming back to our questions ...
What about non-boolean (say, real) inputs?

Do we always need to hand code the threshold?

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs?
What about functions which are not linearly separable ?
Coming back to our questions ...
What about non-boolean (say, real) inputs? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold?

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs?
What about functions which are not linearly separable ?
Coming back to our questions ...
What about non-boolean (say, real) inputs? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold? No, we can learn the threshold

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs?
What about functions which are not linearly separable ?
Coming back to our questions ...
What about non-boolean (say, real) inputs? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold? No, we can learn the threshold

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ?
Coming back to our questions ...
What about non-boolean (say, real) inputs? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold? No, we can learn the threshold

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ? Not possible with a
single perceptron but we will see how to handle this ..
Linearly Separable Boolean Functions
So what do we do about functions which are not linearly separable ?
So what do we do about functions which are not linearly separable ?
Let us see one such simple boolean function first ?
Most real world data is not linearly separable
and will always contain someoutliers
In fact, sometimes there may not be any out-
liers but still the data may not be linearly sep-
arable
We need computational units (models) which
can deal with such data
While a single perceptron cannot deal with
such data, we will show that a network of per-
ceptrons can indeed deal with such data
oo o o o oo
oo + + + + oo
o + oo
o ++ + o
o
o+ + o
o + + oo
o + + o
o +
oo + + + oo
oo o
oooo
Before seeing how a network of perceptrons can deal with linearly inseparable
data, need to discuss boolean functions in some more detail ...
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f 10 f 11 f 12 f 13 f 14 f 15 f 16
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
n
In general, how many boolean functions can you have for n inputs ? 22
n
How many of these 22 functions are not linearly separable ? For the time being,
it suffices to know that at least some of these may not be linearly inseparable (I
encourage you to figure out the exact answer :-) )
Representation Power of a Network of Perceptrons
See how to implement any boolean function using a network of perceptrons ...
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2
red edge indicates w= -1
blue edge indicates w= +1
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2
red edge indicates w= -1
blue edge indicates w= +1
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
The bias (w0) of each perceptron is -2 (i.e.,
each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to belearned)
x1 x2
red edge indicates w= -1
blue edge indicates w= +1
For this discussion, we will assume True
= +1 and False = -1
y We consider 2 inputs and 4perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
w1 w2 w3 w4 The bias (w0) of each perceptron is -2 (i.e.,
each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to belearned)
x1 x2
The output of this perceptron (y) is the
red edge indicates w= -1 output of thisnetwork
blue edge indicates w= +1
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1, x 2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
h1 h2 h3 h4 The final layer containing one output
neuron is called the output layer
The outputs of the 4 perceptrons in the
bias =-2 hidden layer are denoted by h1, h2, h3, h4
The red and blue edges are called layer 1
x1 x2 weights
red edge indicates w= -1 w1, w2, w3, w4 are called layer 2 weights
blue edge indicates w= +1
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1, w2, w3, w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4
h1 h2 h3 h4 work
Astonishing claim! Well, not really, if you
-1,-1
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w= -1 the first perceptron fires for{-1,-1}
blue edge indicates w= +1
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1, w2, w3, w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4
h1 h2 h3 h4 work
-1,1
Astonishing claim! Well, not really, if you
-1,-1
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w= -1 the second perceptron fires for {-1,1}
blue edge indicates w= +1
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1, w2, w3, w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4
h1 h2 h3 h4 work
-1,1 1,-1
Astonishing claim! Well, not really, if you
-1,-1
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w= -1 the third perceptron fires for {1,-1}
blue edge indicates w= +1
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1, w2, w3, w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4
h1 h2 h3 h4 work
-1,1 1,-1 1,1
Astonishing claim! Well, not really, if you
-1,-1
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w= -1 Let us see why this network works by tak-
blue edge indicates w= +1 ing an example of the XOR function
It should be clear that the same network
can be used to represent the remaining 15
y boolean functions also
Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
w1 w2 w3 w4
h1 h2 h3 h4 setting w1, w2, w3,w4

-1,-1 -1,1 1,-1 1,1 Try it!

bias =-2

x1 x2
red edge indicates w= -1
blue edge indicates w= +1
What if we have more than 3 inputs ?
Again each of the 8 perceptorns will fire only for one of the 8inputs
Each of the 8 weights in the second layer is responsible for one of the 8 inputs and
can be adjusted to produce the desired output for that input
y

w1 w2 w3 w4 w5 w6 w7 w8

bias =-3

x1 x2 x3
What if we have n inputs ?
Theorem
Any boolean function of n inputs can be represented exactly by a network of
perceptrons containing 1 hidden layer with 2n perceptrons and one output layer
containing 1 perceptron

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For


example, we already saw how to represent AND function with just 1 perceptron

Catch: As n increases the number of perceptrons in the hidden layers obviously


increases exponentially
Again, why do we care about boolean functions ?
Networks of the form that we just saw (containing, an input, output and one or
more hidden layers) are called Multilayer Perceptrons (MLP, in short)
More appropriate terminology would be“Multilayered Network of Perceptrons”
but MLP is the more commonly used name
The theorem that we just saw gives us the representation power of a MLP with a
single hidden layer
Specifically, it tells us that a MLP with a single hidden layer can represent any
boolean function
Sigmoid Neuron
Enough about boolean functions!
What about arbitrary functions of the form y = f (x) where x ∈ Rn (instead of
{0, 1} n ) and y ∈ R (instead of {0, 1}) ?
Can we have a network which can (approximately) represent such functions ?
Before answering the above question we will have to first graduate from per-
ceptrons to sigmoidal neurons ...
A perceptron will fire if the weighted sum of its inputs is greater than the
threshold (-w0)
y
The thresholding logic used by a perceptron is
very harsh !
bias = w0 = −0.5 For example, let us return to our problem of
deciding whether we will like or dislike a movie
Consider that we base our decision only on one
w1 = 1 input (x1 = criticsRating which lies between
x1 0and 1)
If the threshold is 0.5 (w0 = −0.5) and w1 = 1
criticsRating
then what would be the decision for a movie
with criticsRating = 0.51 ? (like)
What about a movie with criticsRating =
0.49 ? (dislike)
It seems harsh that we would like a movie with
rating 0.51 but not one with a ratingof 0.49
This behavior is not a characteristic of the
1
specific problem we chose or the specific
weight and threshold that we chose
It is a characteristic of the perceptron function
itself which behaves like a step function
y

There will always be this sudden𝑛change in the


decision (from 0to 1) when ෍ 𝑤𝑖 𝑥𝑖 crosses
the threshold (-w0) 𝑖=1

For most real world applications we would


-w0 expect a smoother decision function which
gradually changes from 0to 1
𝑛

𝑧 = ෍ 𝑤𝑖 𝑥𝑖
𝑖=1
Introducing sigmoid neurons where the out-
1
put function is much smoother than the step
function
Here is one form of the sigmoid function called
the logistic function
y

We no longer see a sharp transition around the


-w0 threshold -w0
Also the output y is no longer binary but a real
𝑛
value between 0 and 1 which can be in-
𝑧 = ෍ 𝑤𝑖 𝑥𝑖 terpreted as a probability
𝑖=1 Instead of a like/dislike decision we get the
probability of liking the movie
Perceptron Sigmoid (logistic) Neuron
Perceptron Sigmoid Neuron
1 1

y
y

𝑛 -w0 𝑛 -w0

𝑧 = ෍ 𝑤𝑖 𝑥𝑖 𝑧 = ෍ 𝑤𝑖 𝑥𝑖
𝑖=1 𝑖=1
Smooth, continuous, differentiable
Not smooth, not continuous (at w0), not
differentiable
A typical Supervised Machine Learning Setup
What next ?
Sigmoid (logistic) Neuron Well, just as we had an algorithm for learn- ing
y the weights of a perceptron, we also need a
way of learning the weights of a sigmoid
neuron
Before we see such an algorithm we will revisit
the concept of error

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn
Earlier we mentioned that a single perceptron cannot
deal with this data because it is not linearly separable
What does “cannot deal with” mean?
What would happen if we use a perceptron model to
classify this data ?
We would probably end up with a line like this ...
This line doesn’t seem to be toobad
Sure, it misclassifies 3 blue points and 3 red points but
we could live with this error in most real world
applications
From now on, we will accept that it is hard to drive the
error to 0 in most cases and will instead aim to reach
the minimum possible error
As an illustration, consider our movie example
Data: {x i= movie, y = i like/dislike} i = 1
n

Model: Our approximation of the relation between x and y (the probability of


liking a movie).

yˆ= 1
1 + e −(w T x )

Parameter: w
Learning algorithm: Gradient Descent [we will see soon]
Objective/Loss/Error function: One possibility is

The learning algorithm should aim to find a w which minimizes the above
function (squared error between y and yˆ)
Learning Parameters: (Infeasible) guess work
y Keeping this supervised ML setup in mind, we
will now focus on this model and discuss an
algorithm for learning the parameters of
σ this model from some given data using an
appropriate objective function
σ stands for the sigmoid function (logistic
w0 = −θ w1 w2 .. .. wn function in this case)
x0 = 1 x1 x2 .. .. xn

f (x) = 1
1+e−(w·x+b)
x w σ yˆ= f (x) Keeping this supervised ML setup in mind, we
will now focus on this model and discuss an
b algorithm for learning the parameters of
1
this model from some given data using an
f (x) = 1 appropriate objective function
1+e−(w·x+b)
σ stands for the sigmoid function (logistic
function in this case)
For ease of explanation, we will consider a very
simplified version of the model having just 1
input
Further to be consistent with the literature,
from now on, we will refer to w0 as b(bias)
Lastly, instead of considering the problem of
predicting like/dislike, we will assume that we
want to predict criticsRating(y) given
imdbRating(x) (for no particular reason)
x w σ yˆ= f (x)
What does it mean to train the network?
b Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
f (x) = 1
1+e−(w·x+b)
At the end of training we expect to
find w*, b* such that:
f (0.5) → 0.2 and f (2.5) → 0.9

In other words...
We hope to find a sigmoid function
such that (0.5, 0.2) and (2.5, 0.9) lie
on this sigmoid
x w σ yˆ= f (x)
What does it mean to train the network?
b Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
f (x) = 1
1+e−(w·x+b)
At the end of training we expect to
find w*, b* such that:
f (0.5) → 0.2 and f (2.5) → 0.9

In other words...
We hope to find a sigmoid function
such that (0.5, 0.2) and (2.5, 0.9) lie
on this sigmoid
Let us see this in more detail....
Can we try to find such a w∗, b∗manually
Let us try a random guess.. (say, w= 0.5, b= 0)
Clearly not good, but how bad is it ?
Let us revisit to see how bad it is ...

1
σ(x) =
1 + e −(wx+b)
1 We want to be as close to 0 as possible
σ(x) =
1+ e −(wx+b)
Let us try some other values of w, b

w b
0.50 0.00 0.0730

1
σ(x) =
1 + e −(wx+b)
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481

1
σ(x) = Oops!! this made things even worse...
1 + e −(wx+b)
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214

1 Perhaps it would help to push wand bin the


σ(x) =
1 + e −(wx+b) other direction...
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028

1 Let us keep going in this direction, i.e., increase


σ(x) =
1 + e −(wx+b) wand decrease b
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003

1 Let us keep going in this direction, i.e., increase


σ(x) =
1 + e −(wx+b) w and decrease b
Let us try some other values of w, b

w b
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

1 With some guess work and intuition we were able


σ(x) = to find the right values for w and b
1 + e −(wx+b)
Let us look at something better than our “guess work”
algorithm....
Let us look at the geometric interpretation of our
“guess work” algorithm in terms of this error surface
We will figure this out over the next few slides ...
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 0, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 5, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 10, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 20, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw

σ(x) = 1
1+e−(wx+b)
w= 30, b= 0
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw
Further we can adjust the value of b to
control the position on the x-axis at
which the function transitions from 0
to 1

σ(x) = 1
1+e−(wx+b)
w= 50, b= 10
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw
Further we can adjust the value of b to
control the position on the x-axis at
which the function transitions from 0
to 1

σ(x) = 1
1+e−(wx+b)
w= 50, b= 25
If we take the logistic function and set
w to a very high value we will recover
the step function
Let us see what happens as we change
the value ofw
Further we can adjust the value of b to
control the position on the x-axis at
which the function transitions from 0
to 1

σ(x) = 1
1+e−(wx+b)
w= 50, b= 35
Multi-Layered Perceptron (MLP)
Feedforward Neural Networks (a.k.a. multilayered network
of neurons)
The input to the network is an n-dimensional
vector
The input to the network is an n-dimensional
vector

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
this case) having n neurons each

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
this case) having n neurons each

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts :

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts :

x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L −1 hidden layers (2, in
a3 this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation

a1

x1 x2 xn
The input to the network is an n-dimensional
hL = yˆ= f (x)
vector
The network contains L −1 hidden layers (2, in
a3 this case) having n neurons each
Finally, there is one output layer containing k
h2
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation and
activation
h1

a1

x1 x2 xn
The input to the network is an n-dimensional
hL = yˆ= f (x)
vector
The network contains L −1 hidden layers (2, in
a3 this case) having n neurons each
Finally, there is one output layer containing k
h2
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation and
activation (ai and hi are vectors)
h1

a1

x1 x2 xn
The input to the network is an n-dimensional
hL = yˆ= f (x)
vector
The network contains L −1 hidden layers (2, in
a3 this case) having n neurons each
Finally, there is one output layer containing k
h2
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation and
activation (ai and hi are vectors)
h1
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
a1

x1 x2 xn
The input to the network is an n-dimensional
hL = yˆ= f (x)
vector
The network contains L −1 hidden layers (2, in
a3 this case) having n neurons each
Finally, there is one output layer containing k
h2
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation and
activation (ai and hi are vectors)
h1
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
a1 Wi ∈Rn×n and bi ∈Rn are the weight and bias
W1 b1 between layers i −1 and i (0 < i < L)
x1 x2 xn
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2 where g is called the activation function (for
W2 b2
h1 example, logistic, tanh, linear, etc.)

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2 where g is called the activation function
W2 (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL(x) = O(aL(x))
W1 b1

x1 x2 xn
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2 where g is called the activation function
W2 (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL(x) = O(aL(x))
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai(x) = bi + Wihi−1(x)
a3
W3 b3 The activation at layer i is givenby
h2
hi(x) = g(ai(x))
a2 where g is called the activation function
W2 (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL(x) = O(aL(x))
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)
To simplify notation we will refer to ai(x) as ai
and hi(x) as hi
hL = yˆ= f (x) The pre-activation at layer i is givenby

ai = bi + Wihi−1
a3
W3 b3 The activation at layer i is givenby
h2
hi = g(ai)
a2 where g is called the activation function
W2 (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL = O(aL)
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)
hL = yˆ= f (x) N
Data: { xi , y}
i i= 1

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) N
Data: { xi , y}
i i= 1
Model:
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) N
Data: { x i , y}
i i= 1
Model:
a3
W3 yˆi= f (x i) = O(W3g(W2g(W1x + b1) + b2) + b3)
b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) N
Data: { x i , y}
i i= 1
Model:
a3
W3 yˆi= f (x i) = O(W3g(W2g(W1x + b1) + b2) + b3)
b3
h2
Parameters:
a2 θ = W1, .., W L , b1, b2, ..., bL(L = 3)
W2 b2
h1

a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) N
Data: { x i , y}
i i= 1
Model:
a3
W3 yˆi= f (x i) = O(W3g(W2g(W1x + b1) + b2) + b3)
b3
h2
Parameters:
a2 θ = W1, .., W L , b1, b2, ..., bL(L = 3)
W2 b2 Algorithm: Gradient Descent with Back-
h1
propagation (we will see soon)

a1
W1 b1

x1 x2 xn
More Information on MLP
Learning Parameters of Feedforward Neural
Networks (Intuition)
hL = yˆ= f (x) Recall our gradient descent algorithm
We can write it more concisely as
a3 Algorithm: gradient descent()
W3 b3 t ← 0;
h2
max iterations ← 1000;
Initialize w0, b0;
a2 while t++ < max iterations do
W2 b2 wt+1 ← wt −η∇wt;
h1 bt+1 ← bt −η∇bt;
end
a1
W1 b1

x1 x2 xn
hL = yˆ= f (x) Recall our gradient descent algorithm
We can write it more concisely as
a3 Algorithm: gradient descent()
W3 b3 t ← 0;
h2
max iterations ← 1000;
Initialize θ0 = [w0,b0];
a2 while t++ < max iterations do
W2 b2 θt+1 ← θt −η∇θt;
h1 end

a1
W1 b1

x1 x2 xn
Output Functions and Loss Functions
Softmax Function
• Softmax function converts raw values (as outcome of functions) into
probabilities.
• The output of softmax function maps to a [0, 1] range. And, it maps outputs in a way
that the total sum of all the output values is 1. Thus, it could be said that the output of
the softmax function is probability distribution.
• Softmax function is used in classifications algorithms where there is a
need to obtain probability or probability distribution as the output. Some
of these algorithms are following:
• Neural networks
• Multinomial logistic regression (Softmax regression)
• Bayes naive classifier
• Multi-class linear discriminant analysis
• In artificial neural networks, the softmax function is used in the final /
last layer.
• Softmax function is also used in case of reinforcement learning to output
probabilities related to different actions to be taken.
Outputs

Real Values Probabilities

Output Activation

Loss Function
Outputs

Real Values Probabilities

Output Activation Linear

Loss Function
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error


Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy


Outputs
Regression- Classification-
Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at hand
but the two loss functions that we just saw are encountered veryoften
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at hand
but the two loss functions that we just saw are encountered veryoften
For the rest of this lecture we will focus on the case where the output activation is
a softmax function and the loss function is cross entropy

You might also like