0% found this document useful (0 votes)

0 views

Loss functions

The document discusses loss functions, model selection, and evaluation in neural networks, emphasizing the importance of choosing the right loss function for training and optimization. It explains various loss functions such as Mean Squared Error, Absolute Error, Huber Loss, and different classification loss functions like Binary Cross Entropy and Hinge Loss. Additionally, it covers model evaluation metrics, the bias-variance tradeoff, and techniques for model selection like K-Fold Cross-Validation and BootStrap.

Uploaded by

phantomx443

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

Loss functions

Uploaded by

phantomx443

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Loss functions &

Model Selection and Evaluation

• Neural networks are trained using stochastic
gradient descent and require that you choose
a loss function when designing and
configuring your model.
• We need to discover the role of loss and loss
functions in training deep learning neural
networks and how to choose the right loss
function for your predictive modeling
problems.
say you are on the top of a hill and need to climb down. How do you decide where to walk
towards

Look around to see all the possible paths

Reject the ones going up. This is because these paths would actually cost more energy and
make task even more difficult
Finally, take the path that I think has the most slope downhill

A loss function maps decisions to their associated costs.

In supervised machine learning algorithms, we want to minimize the error for each
training example during the learning process. This is done using some optimization
strategies like gradient descent. And this error comes from the loss function

Although cost function and loss function are synonymous and used
interchangeably, they are different.
A loss function is for a single training example. It is also sometimes called
an error function.
A cost function, on the other hand, is the average loss over the entire training
dataset.

The optimization strategies aim at minimizing the cost function.

Gradient descent approach applied to linear regression with weight (coefficient update strategy )
The steps that will be followed for each loss function below:

1. Write the expression for our predictor function, f(X) and identify the
parameters that we need to find
2. Identify the loss to use for each training example
3. Find the expression for the Cost Function – the average loss on all
examples
4. Find the gradient of the Cost Function with respect to each unknown
parameter
5. Decide on the learning rate and run the weight update rule for a fixed
number of iterations
Lets say we use the famous Boston Housing Dataset for understanding loss functions.

And to keep things simple, we will use only one feature – the Average number of rooms per
house (X) – to predict the dependent variable – Median Value (Y) of houses in $1000′ s.
1. Squared Error Loss
The corresponding cost function is
the Mean of these Squared Errors (MSE).

Applied on Boston dataset for different values of the learning rate for 500 iterations each
a bit more about the MSE loss function. It is a positive quadratic function
(of the form ax^2 + bx + c where a > 0). how it looks graphically?

A quadratic function only has a global minimum. Since there are no local minima, we
will never get stuck in one.
Hence, it is always guaranteed that Gradient Descent will converge (if it converges
at all) to the global minimum.

The MSE loss function penalizes the model for making large errors by squaring them.
Squaring a large quantity makes it even larger, right? But there’s a caveat.
This property makes the MSE cost function less robust to outliers.

Therefore, it should not be used if our data is prone to many outliers.

2. Absolute Error Loss
Absolute Error for each training example is the distance between the predicted and the
actual values, irrespective of the sign. Absolute Error is also known as the L1 loss
the cost is the Mean of these Absolute Errors
(MAE).
The MAE cost is more robust to outliers as compared to MSE.
However, handling the absolute or modulus operator in mathematical equations is
not easy.
We can consider this as a disadvantage of MAE.

plot after running the

code for 500 iterations
with different learning
rates:
3. Huber Loss
The Huber loss combines the best
properties of MSE and MAE.
It is quadratic for smaller errors and is
linear otherwise (and similarly for its
gradient).
It is identified by its delta parameter:

plot for 500 iterations of weight

update at a learning rate of 0.0001 for
different values of the
delta parameter:

Huber loss is more robust to outliers than

MSE. It is used in Robust Regression, M-
estimation and Additive Modelling. A
variant of Huber Loss is also used in
classification.
Binary Classification Loss Functions
Binary Classification refers to assigning an object into one of two classes.
This classification is based on a rule applied to the input feature vector.
For example, classifying an email as spam or not spam based on, say its subject line

##Cancer dataset
To classify a tumor
as ‘Malignant’ or ‘Benign’ based on
features like average radius, area,
perimeter, etc.
For simplification, we will use only
two input features (X_1 and X_2)
namely ‘worst area’ and ‘mean
symmetry’ for classification.
The target value Y can be 0
(Malignant) or 1 (Benign).
1. Binary Cross Entropy Loss or log loss

entropy to indicate disorder or

uncertainty.
It is measured for a random
variable X with probability
distribution p(X):

A greater value of entropy for a probability distribution indicates a greater

uncertainty in the distribution. Likewise, a smaller value indicates a more certain
distribution.
This makes binary cross-entropy suitable as a loss function – you want to
minimize its value.
We use binary cross-entropy loss for classification models which output a
probability p.
Probability that the element belongs to class 1 (or positive class) = p
Then, the probability that the element belongs to class 0 (or negative class) = 1 - p

Then, the cross-entropy loss for output label y (can take values 0 and 1)
and predicted probability p is defined as:

This is also called Log-Loss. To calculate the probability p, we can use the sigmoid
function. Here, z is a function of our input features:
The range of the sigmoid function is [0, 1] which makes it suitable for calculating probability.

plot on using the weight

update rule for 1000 iterations
with different values of alpha:
2. Hinge Loss

Hinge loss is mainly used with Support Vector Machine classifiers with class labels -1 and 1.

Hinge Loss not only penalizes wrong predictions but also right predictions that are not confident.

Hinge loss for an input-output pair (x, y) is :

After running the update

function for 2000
iterations with three
different values of alpha,
we obtain this plot

used when we want to make real-time decisions with not a laser-sharp focus on accuracy
Multi-Class Classification Loss Functions

use the Iris Dataset for understanding

use 2 features X_1, Sepal length and feature X_2, Petal width, to predict the class (Y) of
the Iris flower – Setosa, Versicolor or Virginica
Multi-Class Cross Entropy Loss
The multi-class cross-entropy loss is a generalization of the Binary Cross Entropy loss.
The loss for input vector Xi and the corresponding one-hot encoded target vector Yi :

use the softmax function to find the probabilities p_ij:

Softmax is implemented through
a neural network layer just before
the output layer.

The Softmax layer must have the

same number of nodes as the
output layer.”

output is the class with the maximum

probability for the given input.

a model using an input layer

and an output layer and
compile it with different
learning rates.
the plots for cost and accuracy respectively after training for 200 epochs
Neural Network Learning as Optimization

• A deep learning neural network learns to map a set of

inputs to a set of outputs from training data.
• cannot calculate the perfect weights for a neural
network; there are too many unknowns. Instead, the
problem of learning is cast as a search or optimization
problem and an algorithm is used to navigate the space
of possible sets of weights the model may use in order
to make good or good enough predictions.
• a neural network model is trained using the stochastic
gradient descent optimization algorithm and weights
are updated using the backpropagation of error
algorithm
• “gradient” in gradient descent refers to an
error gradient. The model with a given set of
weights is used to make predictions and the
error for those predictions is calculated
• The gradient descent algorithm seeks to
change the weights so that the next
evaluation reduces the error, meaning the
optimization algorithm is navigating down the
gradient (or slope) of error.
• In the context of an optimization algorithm, the
function used to evaluate a candidate solution (i.e. a
set of weights) is referred to as the objective function.
• We may seek to maximize or minimize the objective
function, meaning that we are searching for a
candidate solution that has the highest or lowest score
respectively.
• Typically, with neural networks, we seek to minimize
the error. As such, the objective function is often
referred to as a cost function or a loss function and the
value calculated by the loss function is referred to as
simply “loss.”
• The cost or loss function has an important job in that it
must faithfully distill all aspects of the model down into
a single number in such a way that improvements in
that number are a sign of a better model.
• In calculating the error of the model during the
optimization process, a loss function must be chosen.
• This can be a challenging problem as the function must
capture the properties of the problem and be
motivated by concerns that are important to the
project and stakeholders.
Model Selection
Model selection is a technique for selecting the best model after the individual models are
evaluated based on the required criteria.
Resampling methods, are simple techniques of rearranging data samples to inspect if the model
performs well on data samples that it has not been trained on.
Random Splits are used to randomly sample a percentage of data into training, testing, and
preferably validation sets
time-wise split :The training set can have data for the last three years and 10 months of the
present year. The last two months can be reserved for the testing or validation set.
K-Fold Cross-Validation: The cross-validation technique works by randomly shuffling the
dataset and then splitting it into k groups. Then on iterating over each group, the group needs
to be considered as a test set while all other groups are clubbed together into the training set.
BootStrap: The first step is to select a sample size (which is usually equal to the size of the
original dataset). Thereafter, a sample data point must be randomly selected from the original
dataset and added to the bootstrap sample. After the addition, the sample needs to be put
back into the original sample. This process needs to be repeated for N times, where N is the
sample size.
Model Evaluation
For every classification model prediction, a matrix called the confusion matrix can be
constructed which demonstrates number of test cases correctly and incorrectly classified.

Actual 0 Actual 1

Predicted True False

0 Negatives Negatives
(TN) (FN)

Predicted False True

1 Positives Positives
(FP) (TP)

Accuracy is the simplest metric and can be defined as the number of test cases correctly
classified divided by the total number of test cases.
Precision is the metric used to identify the correctness of classification.

Recall tells us the number of positive cases correctly identified out of the total number of
positive cases.

F1 score is the harmonic mean of Recall and Precision and therefore, balances out the
strengths of each.
AUC-ROC: ROC curve is a plot of true positive rate (recall) against false positive rate (TN /
(TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics and the
higher the area, the better is the model performance.

If the curve is somewhere near the 50% diagonal line, it suggests that the model
randomly predicts the output variable.
Bias , Variance
Bias occurs when a model is strictly ruled by assumptions – like the linear
regression model assumes that the relationship of the output variable with
the independent variables is a straight line. This leads to underfitting when
the actual values are non-linearly related to the independent variables.

Variance is high when a model focuses on the training set too much and learns
the variations very closely, compromising on generalization. This leads
to overfitting.

An optimal model is one that has the

lowest bias and variance and since these
two attributes are indirectly proportional,
the only way to achieve this is through a
tradeoff between the two.
Therefore, the model selection should be
such that the bias and variance intersect
like in the image

M348 Applied Statistical Modelling - Linear Models
No ratings yet
M348 Applied Statistical Modelling - Linear Models
504 pages
Mark Stamp - Introduction To Machine Learning With Applications in Information Security (Chapman & Hall - CRC Machine Learning & Pattern Recogn (2022, Chapman and Hall - CRC) - Libgen - Li
50% (2)
Mark Stamp - Introduction To Machine Learning With Applications in Information Security (Chapman & Hall - CRC Machine Learning & Pattern Recogn (2022, Chapman and Hall - CRC) - Libgen - Li
549 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
UNIT4 CostFunctions
No ratings yet
UNIT4 CostFunctions
23 pages
Assignment 1 - Machine Learning
No ratings yet
Assignment 1 - Machine Learning
9 pages
DL Practical 3 Loss Function
No ratings yet
DL Practical 3 Loss Function
6 pages
chp2 cost functions
No ratings yet
chp2 cost functions
7 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
UNIT2
No ratings yet
UNIT2
25 pages
DL UNIT-I
No ratings yet
DL UNIT-I
30 pages
ml
No ratings yet
ml
10 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
52 pages
Cost Function Loss Function
No ratings yet
Cost Function Loss Function
7 pages
Gradient Boosting
No ratings yet
Gradient Boosting
9 pages
Cost Function in Machine Learning - Javatpoint
No ratings yet
Cost Function in Machine Learning - Javatpoint
9 pages
Weights and Biases
No ratings yet
Weights and Biases
10 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
5 pages
Lecture_8_Zainab (1)
No ratings yet
Lecture_8_Zainab (1)
21 pages
Raghav soni(20IOT6014) Algo_Assignment
No ratings yet
Raghav soni(20IOT6014) Algo_Assignment
14 pages
UNIT 1 - Types of Learning
No ratings yet
UNIT 1 - Types of Learning
13 pages
EPS-DL-Handout7-Ex2 ANN Model for Binary Classification
No ratings yet
EPS-DL-Handout7-Ex2 ANN Model for Binary Classification
17 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
MLp
No ratings yet
MLp
28 pages
Lect 9- Loss Functions
No ratings yet
Lect 9- Loss Functions
28 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
ML-UNIT-3-1
No ratings yet
ML-UNIT-3-1
57 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Assignment 4 Reportdocx
No ratings yet
Assignment 4 Reportdocx
10 pages
MDL Assignment2 Spring23
No ratings yet
MDL Assignment2 Spring23
5 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
Unit 2
No ratings yet
Unit 2
37 pages
Module 3
No ratings yet
Module 3
27 pages
Loss Functions
No ratings yet
Loss Functions
7 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
92 pages
SVM
No ratings yet
SVM
5 pages
Loss Functions Types
No ratings yet
Loss Functions Types
11 pages
Unit 2
No ratings yet
Unit 2
31 pages
S-2
No ratings yet
S-2
10 pages
KNN-Unit1-Notes (1)
No ratings yet
KNN-Unit1-Notes (1)
57 pages
Unit - 2 ML notes
No ratings yet
Unit - 2 ML notes
14 pages
Important Questions
No ratings yet
Important Questions
18 pages
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
100% (1)
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
6 pages
TC-1 Final Answer Key
No ratings yet
TC-1 Final Answer Key
14 pages
DL UNIT 2
No ratings yet
DL UNIT 2
46 pages
Lab NN KNN SVM
No ratings yet
Lab NN KNN SVM
13 pages
Machine Learning
No ratings yet
Machine Learning
115 pages
CQF EXAM 3-Answer
No ratings yet
CQF EXAM 3-Answer
14 pages
Supervised Learning (Classification and Regression)
No ratings yet
Supervised Learning (Classification and Regression)
14 pages
Machine Learning
No ratings yet
Machine Learning
87 pages
Feature Engineering
No ratings yet
Feature Engineering
23 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
ML-UNIT-3
No ratings yet
ML-UNIT-3
46 pages
Regression PPT
No ratings yet
Regression PPT
21 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
H-AML_2024_2025_Scheme for Internal Assessment_
No ratings yet
H-AML_2024_2025_Scheme for Internal Assessment_
5 pages
Module3UX Design Process_V1.2 (2)
No ratings yet
Module3UX Design Process_V1.2 (2)
68 pages
Module 2_Deep_Learning_Fundamentals
No ratings yet
Module 2_Deep_Learning_Fundamentals
98 pages
Module4UxDWireframmingAndPRotottyping (2)
No ratings yet
Module4UxDWireframmingAndPRotottyping (2)
85 pages
fin_irjmets1682089319
No ratings yet
fin_irjmets1682089319
19 pages
Sta 316 Cat2 PDF
No ratings yet
Sta 316 Cat2 PDF
2 pages
Hypothesis Testing Using P-Value Approach
No ratings yet
Hypothesis Testing Using P-Value Approach
16 pages
ETC1000 Exam Sem1 2017 PDF
No ratings yet
ETC1000 Exam Sem1 2017 PDF
11 pages
Contoh Uji Validitas Dan Reliabulitas Dengan Excell Dan SPSS
No ratings yet
Contoh Uji Validitas Dan Reliabulitas Dengan Excell Dan SPSS
8 pages
Beer Industry in Colombia - Final
No ratings yet
Beer Industry in Colombia - Final
22 pages
One-Sample Hypothesis Testing
No ratings yet
One-Sample Hypothesis Testing
9 pages
SMuR Complete
No ratings yet
SMuR Complete
114 pages
Full download Applied Multivariate Statistics for the Social Sciences Fifth Edition James P. Stevens pdf docx
No ratings yet
Full download Applied Multivariate Statistics for the Social Sciences Fifth Edition James P. Stevens pdf docx
51 pages
Statistics for Business and Economics 13th Edition Anderson Solutions Manual - Download Instantly To Experience The Full Content
100% (1)
Statistics for Business and Economics 13th Edition Anderson Solutions Manual - Download Instantly To Experience The Full Content
47 pages
Inferential Statistics Parametric and Non Parametric Student Workbook
No ratings yet
Inferential Statistics Parametric and Non Parametric Student Workbook
42 pages
28 Oct EDA Notes
No ratings yet
28 Oct EDA Notes
16 pages
Ts Dyn
No ratings yet
Ts Dyn
86 pages
Anova
0% (1)
Anova
5 pages
Adijfpqo
No ratings yet
Adijfpqo
8 pages
Nonlife Actuarial Models: B Uhlmann Credibility
No ratings yet
Nonlife Actuarial Models: B Uhlmann Credibility
44 pages
Chi-Squared Worksheet
100% (3)
Chi-Squared Worksheet
4 pages
ML Project Report
No ratings yet
ML Project Report
12 pages
m248 Block C
No ratings yet
m248 Block C
123 pages
Handout 2020 Part1 PDF
No ratings yet
Handout 2020 Part1 PDF
82 pages
Module Handbook - Business Statistics II
No ratings yet
Module Handbook - Business Statistics II
4 pages
Research Paper
No ratings yet
Research Paper
10 pages
Maharashtra Public Service Commission Assessment: General Knowledge
No ratings yet
Maharashtra Public Service Commission Assessment: General Knowledge
76 pages
AnalytixLabs - Linear Regression - 1623137749089
No ratings yet
AnalytixLabs - Linear Regression - 1623137749089
41 pages
〈1010〉 ANALYTICAL DATA-INTERPRETATION AND TREATMENT
No ratings yet
〈1010〉 ANALYTICAL DATA-INTERPRETATION AND TREATMENT
29 pages
Spss Tutorial Guide Complete
No ratings yet
Spss Tutorial Guide Complete
34 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Final Stat
No ratings yet
Final Stat
27 pages
Likelihood Approaches To Low Default Portfolios: Adjustment of Alan Forrest's Method To The Multi-Year Period Design
No ratings yet
Likelihood Approaches To Low Default Portfolios: Adjustment of Alan Forrest's Method To The Multi-Year Period Design
9 pages