465-Lecture 2-4

This document summarizes key concepts about deep feedforward neural networks: 1. It describes the basic perceptron model, which consists of a single neuron that calculates weighted sums of inputs and applies an activation function. 2. It explains that multilayer perceptrons are needed to model nonlinear data using multiple hidden layers of neurons stacked together. 3. It outlines the main components of neural networks, including the input layer, hidden layers, weight connections, and output layer. It also discusses different types of activation functions.

Uploaded by

Faisal Bin Abdur Rahman 1912038642

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

465-Lecture 2-4

Uploaded by

Faisal Bin Abdur Rahman 1912038642

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

CSE 465

Lecture 2-4
Deep Feed Forward Neural Networks
Artificial Neural Network (ANN)
Perceptron
• The simplest neural network is the perceptron, which consists of a single
neuron
• Conceptually, the perceptron functions in a manner like a biological neuron
Perceptron
• The artificial neuron performs two consecutive functions
• It calculates the weighted sum of the inputs to represent the total strength
of the input signals
• Then it applies a step function to the result
• To determine whether to fire the output 1 if the signal exceeds a certain threshold
or
• 0 if the signal doesn’t exceed the threshold
• Not all input features are equally useful or important.
• To represent that, each input node is assigned a weight value, called its connection
weight, to reflect its importance
Perceptron

Input vector: The feature vector that It is usually denoted with an uppercase X to
is fed to the neuron represent a vector of inputs (x1, x2, . . ., xn)

Weights matrix: Each xi is assigned a weight value wi

Functions: The calculations performed within the neuron to modulate the

input signals: the weighted sum and step activation function

Output: Controlled by the type of activation function

Weights matrix: Feature weights
• Not all input features are equally important (or useful) features
• Each input feature (x1) is assigned its own weight (w1) that reflects its
importance in the decision-making process
• Inputs assigned greater weight have a greater effect on the output
• If the weight is high, it amplifies the input signal; and if the weight is low, it
diminishes the input signal
• In common representations of neural networks, the weights are
represented by lines or edges from the input node to the perceptron
How does a perceptron learn?
• The perceptron uses trial and error to learn from its mistakes
• It uses the weights as knobs by tuning their values up and down until the network is
trained
• This process is repeated many times, and the neuron continues to update
the weights to improve its predictions
Is one neuron enough?
• Suppose we want to train a perceptron to predict whether a player will be
accepted into the national Cricket squad
• We collect all the data from previous years and train the perceptron to predict
whether players will be accepted based on only two features (height and weight)
• z = height · w1 + age · w2 + b
• After the training is complete on the training data, we can start using the
perceptron to predict about new players
• When we get a player who is 150 cm in height and 12 years old, we compute the
previous linear expression with these values [(150, 12)]
• In this simple case, the single perceptron may work fine because data was
linearly separable
Model for accepting/rejecting players
Linear/Nonlinear data
• Linear datasets—The data can be split with a single straight line
• Nonlinear datasets—The data cannot be split with a single straight line
• We need more than one line to form a shape that splits the data
• In the linear problem, the stars and dots can be easily classified by drawing
a single straight line
• In nonlinear data, a single line will not separate both shapes
Multiple perceptron
• To split a nonlinear dataset, we
need more than one line
• Example of a small neural
network that is used to model
nonlinear data
• In this network, we use three
neurons stacked together in
one layer called a hidden layer
• Hidden as we don’t see the
output of these layers during the
training process
Multi-layer perceptron
Main components of a Neural Network
• Input layer— Contains the feature vector
• Hidden layers—The neurons are stacked on top of each other in hidden layers
• They are called “hidden” layers because we don’t see or control the input going into these
layers or the output
• All we do is feed the feature vector to the input layer and see the output coming out of the
output layer
• Weight connections (edges)—Weights are assigned to each connection between
the nodes to reflect the importance of their influence on the final output
prediction
• In graph network terms, these are called edges connecting the nodes
• Output layer — We get the answer or prediction from our model from the
output layer
• Depending on the setup of the neural network, the final output
• Could be a real-valued output (regression problem) or
• A set of probabilities (classification problem)
Activation functions
• Activation functions are sometimes referred to as transfer functions or
nonlinearities because they transform the linear combination of a weighted
sum into a nonlinear model
• The purpose of the activation function is to introduce nonlinearity into the
network
• Without it, a multilayer perceptron will perform similarly to a single
perceptron no matter how many layers we add
• Activation functions are needed to restrict the output value to a certain
finite value
Linear Activation Function
• A linear transfer function, also called an
identity function, indicates that the function
passes a signal through unchanged
• In practical terms, the output will be equal to
the input, which means we don’t actually
have an activation function
• So no matter how many layers our neural
network has, all it is doing is computing a
linear activation function or, at most, scaling
the weighted average coming in
Step Activation Function
• The step function produces a binary output
• It basically says that
• If the input x > 0, it fires (output y = 1)
• Else (input < 0), it doesn’t fire (output y = 0)
• It is mainly used in binary classification
problems like true or false, spam or not
spam, and pass or fail
Sigmoid/Logistic Activation Function
• This is one of the most common activation
functions
• It is often used in binary classifiers to predict
the probability of a class when we have two
classes
• Sigmoid or logistic functions convert infinite
continuous variables (range between –∞ to
+∞) into simple probabilities between 0
and 1
• It is also called the S-shape curve because
when plotted in a graph, it produces an S-
shaped curve.
Hyperbolic Tangent (tanh) Activation Function
• The hyperbolic tangent function is a shifted version of the sigmoid version
• Instead of squeezing the signal values between 0 and 1, tanh squishes all
values to the range –1 to 1
• Tanh almost always works better than the sigmoid function in hidden layers
because it has the effect of centering data so that the mean of the data is
close to zero rather than 0.5, which makes learning for the next layer a little
bit easier
• One of the downsides of both sigmoid and tanh functions is that if (z) is
very large or very small
• Then the gradient (or derivative or slope) of this function becomes very small (close
to zero), which will slow down gradient descent
Hyperbolic Tangent (tanh) Activation Function
Rectified Linear (ReLU) Activation Function
• The rectified linear unit (ReLU) activation function
activates a node only if the input is above zero
• If the input is below zero, the output is always zero
• But when the input is higher than zero, it has a linear
relationship with the output variable
• ReLU is considered the state-of-the-art activation
function because it works well in many different
situations, and it tends to train better than
sigmoid and tanh in hidden layers
Leaky ReLU Activation Function
• One disadvantage of ReLU activation is that
the derivative is equal to zero when (x) is
negative
• Leaky ReLU is a ReLU variation that tries to
mitigate this issue
• Instead of having the function be zero when
x < 0, leaky ReLU introduces a small negative
slope (around 0.01) when (x) is negative
• It usually works better than the ReLU
function
Softmax Function
• The softmax function is a generalization of the sigmoid function
• It is used to obtain classification probabilities when we have more than two
classes
• It forces the outputs of a neural network to sum to 1 (for example, 0 < output <
1)
• A very common use case in deep learning problems is to predict a single class out
of many options (more than two)
• Softmax is the go-to function that we often use at the output layer of a classifier
when we are working on a problem where we need to predict a class between
more than two classes
• Softmax works fine classifying two classes, as well – it will basically work like a
sigmoid function
Softmax Activation Function (2)
The feedforward process
• The term feedforward is used to imply the forward direction in which the
information flows from the input layer through the hidden layers, all the
way to the output layer
• This process happens through the implementation of two consecutive
functions
• The weighted sum and
• The activation function
• In short, the forward pass is the calculations through the layers to make a
prediction
A fully connected Neural Network

Forward
Calculation
The weights break-down
Feedforward calculation
Feedforward process (2)
Feature learning
• The nodes in the hidden layers (ai) are the new features that are learned after
each layer
• For example, in the network of two slides earlier, we see that we have three
feature inputs (x1, x2, and x3)
• After computing the forward pass in the first layer, the network learns patterns,
and these features are transformed to three new features with different values (
a1 1 , a2 1 , a3 1 )
• Then, in the next layer, the network learns patterns within the patterns and
produces new features (a12, a22, a32 , and a42, and so forth)
• The produced features after each layer are not totally understood, and we don’t
see them, nor do we have much control over them
• That’s why they are called hidden layers
• What we do is this: we look at the final output prediction and keep tuning some
parameters until we are satisfied by the network’s performance
Feature Learning Process
Cost/Loss/Error Function
• The error function is a measure of how “wrong” the neural network
prediction is with respect to the expected output (the label)
• It quantifies how far we are from the correct solution
• For example, if we have a high loss, then our model is not doing a good
job
• The smaller the loss, the better the job the model is doing
• The larger the loss, the more our model needs to be trained to increase its
accuracy
Mean Square Error Loss Function
• Mean squared error (MSE) is commonly used in regression problems that require the output to
be a real value (like house pricing)

• Instead of just comparing the prediction output with the label (ŷi – yi), the error is squared and
averaged over the number of data points
• MSE is a good choice for a few reasons
• The square ensures the error is always positive, and
• Larger errors are penalized more than smaller errors
• MSE is quite sensitive to outliers, since it squares the error value
• A variation error function of MSE called mean absolute error (MAE) is immune to this issue
• It averages the absolute error over the entire dataset without taking the square of the error
Cross-entropy loss function

• The loss function L(ŷ; y) produces a single scalar value

• Most common for Neural Networks
• Cross-entropy is commonly used in classification problems because it
quantifies the difference between two probability distributions
• How close is the predicted distribution to the true distribution?
• That is what the cross entropy loss function determines
Loss Function to compare models
Training a neural network
• Training a neural network involves showing the network many examples (a
training dataset)
• The network makes predictions through feedforward calculations and
compares them with the correct labels to calculate the error
• Finally, the neural network needs to adjust the weights (on all edges) until
it gets the minimum error value, which means maximum accuracy
• In neural networks, optimizing the error function means updating the
weights and biases until we find the optimal weights, or the best values for
the weights to produce the minimum error
• In mathematical terms:
Optimization Algorithm: Gradient Descent
• The general definition of a gradient (also known as a derivative) is that it is
the function that tells the slope or rate of change of the line that is tangent
to the curve at any given point
• It is just a fancy term for the slope or steepness of the curve
Gradient descent
• Gradient descent simply means updating the weights iteratively to descend
the slope of the error curve until we get to the point with minimum error
• Gradient descent has several variations: batch gradient descent (BGD),
stochastic gradient descent (SGD), and mini-batch GD (MB-GD)
How does Gradient Descent work?
• The random initial weight (starting weight)
is at point A, and our goal is to descend
this error mountain to the goal w1 and w2
weight values, which produce the
minimum error value
• The way we do that is by taking a series of
steps down the curve until we get the
minimum error
• In order to descend the error mountain,
we need to determine two things for each
step
• The step direction (gradient)
• The step size (learning rate)
Optimizing the network: Cost function visualization
Stochastic/Mini-batch Gradient Descent
• In stochastic gradient descent, we update the weights based on each data
point at a time
• This make the algorithm jump here and there and kind of unstable sometimes
• However, end of the day, it provides better optimization
• Mini-batch gradient descent is a compromise between Gradient Descent
and stochastic GD
• Instead of computing the gradient from one sample (SGD) or all samples (BGD), we
divide the training sample into mini-batches from which to compute the gradient
GD vs SGD vs MBSGD
• In practice, we use mini batch gradient descent (MBSGD) algorithm
• The GD (also called BGD – batch gradient descent) almost always fails to
find the optimal point (inherent in GD algorithm) and needs to load all data
points to the memory before an update
• The SGD is better in finding the optimal point than the GD, but is unstable
• The MBSGD is the sweet point between GD and SGD: it updates the weight
on small batches, so can use the memory optimally
• Also better at finding the optimum than GD
Backpropagation
• Backpropagation is the core of how neural networks learn
• Up until this point, we have learned that training a neural network typically
happens by the repetition of the following three steps
• Forward pass: get the linear combination (weighted sum), and apply the activation
function to get the output prediction (ŷ)
• Find the loss: Compare the prediction with the label to calculate the error or loss
function
• Update: Use a gradient descent optimization algorithm to compute the Δw that
optimizes the error function
Initialize the W/b matrices
• We cannot initialize all to 0
• If we do that then z[3] = W[3]a[2]+b[3] will be 0
• However, the output of the neural network is defined as a[3] = g(z[3])
• Here g(·) is defined as the sigmoid function. This means a[3] = g(0) = 0.5. Thus, no matter
what value of x(i) we provide, the network will output ŷ = 0.5
• What if we had initialized all parameters to be the same non-zero value? In this
case, consider the activations of the first layer

• Each element of the activation vector a[1] will be the same (because W[1] contains all the
same values)
• This behavior will occur at all layers of the neural network

AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
UNIT V (1)
No ratings yet
UNIT V (1)
25 pages
4. ANNs
No ratings yet
4. ANNs
57 pages
Perceptron: Single Layer Neural Network
No ratings yet
Perceptron: Single Layer Neural Network
14 pages
Machine Learning NN
100% (2)
Machine Learning NN
16 pages
Basics of Deep Learning
No ratings yet
Basics of Deep Learning
20 pages
unit2ml-230101150634-5590aaef
No ratings yet
unit2ml-230101150634-5590aaef
202 pages
UNIT V
No ratings yet
UNIT V
26 pages
ML UNIT 3-2-18
No ratings yet
ML UNIT 3-2-18
17 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Unit 4
No ratings yet
Unit 4
19 pages
Lesson 7.0 Supervised Learning With Neural Networks (1)
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks (1)
22 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
29 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
72 pages
UNIT-I.pptx
No ratings yet
UNIT-I.pptx
90 pages
AI UNIT-3
No ratings yet
AI UNIT-3
69 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
Copy of Copy of SSG 311_Module 6_Neural Network (1)
No ratings yet
Copy of Copy of SSG 311_Module 6_Neural Network (1)
41 pages
Lecture 15
No ratings yet
Lecture 15
21 pages
FALLSEM2024-25_BCSE209L_TH_VL2024250101737_2024-08-06_Reference-Material-I
No ratings yet
FALLSEM2024-25_BCSE209L_TH_VL2024250101737_2024-08-06_Reference-Material-I
20 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Unit 2_Activation Function_PR
No ratings yet
Unit 2_Activation Function_PR
22 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
23 pages
Lecture 5-Introduction to neural network (1)
No ratings yet
Lecture 5-Introduction to neural network (1)
42 pages
Activation Function
No ratings yet
Activation Function
4 pages
Uni2 NNDL
No ratings yet
Uni2 NNDL
21 pages
Module 5 Lecture 2
No ratings yet
Module 5 Lecture 2
45 pages
Perceptron For Class
No ratings yet
Perceptron For Class
28 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
Single Neuron Model
No ratings yet
Single Neuron Model
16 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
ANN Unit 2
No ratings yet
ANN Unit 2
46 pages
Activation Function Hoonors
No ratings yet
Activation Function Hoonors
8 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Activation Function
No ratings yet
Activation Function
31 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
0905 Cs 161183 Vishal
No ratings yet
0905 Cs 161183 Vishal
38 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
ML Unit4 Notes
No ratings yet
ML Unit4 Notes
96 pages
Implementation of Activation Layer
No ratings yet
Implementation of Activation Layer
17 pages
Neural Networks - V Unit (2)
No ratings yet
Neural Networks - V Unit (2)
43 pages
Module 5 AIML Notes
No ratings yet
Module 5 AIML Notes
77 pages
Unit VML
No ratings yet
Unit VML
14 pages
Competitive Applications
No ratings yet
Competitive Applications
16 pages
Module1 - Upto Loss Function
No ratings yet
Module1 - Upto Loss Function
137 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
UNIT-III Activation-function
No ratings yet
UNIT-III Activation-function
6 pages
UNIT 3 - Part - 2
No ratings yet
UNIT 3 - Part - 2
43 pages
Module1
No ratings yet
Module1
124 pages
UNIT 4 - Perceptron and DL
No ratings yet
UNIT 4 - Perceptron and DL
39 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
Unit 2b
No ratings yet
Unit 2b
11 pages
Uni2 NN 2023
No ratings yet
Uni2 NN 2023
52 pages
CE6146_Lecture_2
No ratings yet
CE6146_Lecture_2
72 pages
Multi Percept Ron
No ratings yet
Multi Percept Ron
14 pages
Percept Ron
No ratings yet
Percept Ron
49 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
f8194544 Microsoft PowerPoint DeepLearning
No ratings yet
f8194544 Microsoft PowerPoint DeepLearning
28 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Industrial Visit Report
100% (1)
Industrial Visit Report
9 pages
MSTP US Style Guide
No ratings yet
MSTP US Style Guide
1,078 pages
Axe 6-Musk Vs Zuckerberg
No ratings yet
Axe 6-Musk Vs Zuckerberg
3 pages
Honors Through Research Students Councelors List
No ratings yet
Honors Through Research Students Councelors List
12 pages
Industry 5.0
No ratings yet
Industry 5.0
21 pages
Nptel Week 6 - 2
No ratings yet
Nptel Week 6 - 2
4 pages
Power of Recurrent Neural Networks (RNN) - Revolutionizing AI
No ratings yet
Power of Recurrent Neural Networks (RNN) - Revolutionizing AI
33 pages
Quantum Sensors in Sports Biomechanics in Revolutionizing Injury Prevention
No ratings yet
Quantum Sensors in Sports Biomechanics in Revolutionizing Injury Prevention
16 pages
Business-Analytics-with-Machine-Learning-Syllabus
No ratings yet
Business-Analytics-with-Machine-Learning-Syllabus
2 pages
Deepfake Detection of Images
No ratings yet
Deepfake Detection of Images
9 pages
Ensuring_AI_Is_Helpful_and_Not_Harmful_in_Health_Care
No ratings yet
Ensuring_AI_Is_Helpful_and_Not_Harmful_in_Health_Care
4 pages
05 User Interfaces
No ratings yet
05 User Interfaces
2 pages
AI SPECIALIST SALESFORCE EXAM
No ratings yet
AI SPECIALIST SALESFORCE EXAM
24 pages
Unit 1
No ratings yet
Unit 1
22 pages
GR 10 - CompSci - P1
No ratings yet
GR 10 - CompSci - P1
11 pages
Digital Innovation Brochure D4
No ratings yet
Digital Innovation Brochure D4
4 pages
PG Practical ML LAB QP
No ratings yet
PG Practical ML LAB QP
7 pages
SRM 2023 Fee Structure
No ratings yet
SRM 2023 Fee Structure
8 pages
CSC 227 Notes Chapter 4 DR Akinlade 2021
No ratings yet
CSC 227 Notes Chapter 4 DR Akinlade 2021
17 pages
The-Fundamentals-of-Artificial-Intelligence
No ratings yet
The-Fundamentals-of-Artificial-Intelligence
8 pages
Course Project
No ratings yet
Course Project
8 pages
BCA_NEP_Syllabus_third (finalized) (1)
No ratings yet
BCA_NEP_Syllabus_third (finalized) (1)
33 pages
Module 1
No ratings yet
Module 1
100 pages
The AI Revolution in Customer Support Redefining BPO Industry Standards
No ratings yet
The AI Revolution in Customer Support Redefining BPO Industry Standards
7 pages
AI-ERA Artificial Intelligence-Empowered Resource Allocation For LoRa-Enabled IoT Applications
No ratings yet
AI-ERA Artificial Intelligence-Empowered Resource Allocation For LoRa-Enabled IoT Applications
13 pages
World Gold 2019 Proceedings
100% (2)
World Gold 2019 Proceedings
746 pages
Ensemble Learning
No ratings yet
Ensemble Learning
12 pages
Project Management Trends
No ratings yet
Project Management Trends
10 pages
GEN-AI-unit 3
No ratings yet
GEN-AI-unit 3
30 pages
Ipcw Ann
No ratings yet
Ipcw Ann
100 pages

465-Lecture 2-4

Uploaded by

465-Lecture 2-4

Uploaded by

CSE 465

Weights matrix: Each xi is assigned a weight value wi

Functions: The calculations performed within the neuron to modulate the

Output: Controlled by the type of activation function

• The loss function L(ŷ; y) produces a single scalar value

You might also like