Loss functions
Loss functions
Although cost function and loss function are synonymous and used
interchangeably, they are different.
A loss function is for a single training example. It is also sometimes called
an error function.
A cost function, on the other hand, is the average loss over the entire training
dataset.
Gradient descent approach applied to linear regression with weight (coefficient update strategy )
The steps that will be followed for each loss function below:
1. Write the expression for our predictor function, f(X) and identify the
parameters that we need to find
2. Identify the loss to use for each training example
3. Find the expression for the Cost Function – the average loss on all
examples
4. Find the gradient of the Cost Function with respect to each unknown
parameter
5. Decide on the learning rate and run the weight update rule for a fixed
number of iterations
Lets say we use the famous Boston Housing Dataset for understanding loss functions.
And to keep things simple, we will use only one feature – the Average number of rooms per
house (X) – to predict the dependent variable – Median Value (Y) of houses in $1000′ s.
1. Squared Error Loss
The corresponding cost function is
the Mean of these Squared Errors (MSE).
Applied on Boston dataset for different values of the learning rate for 500 iterations each
a bit more about the MSE loss function. It is a positive quadratic function
(of the form ax^2 + bx + c where a > 0). how it looks graphically?
A quadratic function only has a global minimum. Since there are no local minima, we
will never get stuck in one.
Hence, it is always guaranteed that Gradient Descent will converge (if it converges
at all) to the global minimum.
The MSE loss function penalizes the model for making large errors by squaring them.
Squaring a large quantity makes it even larger, right? But there’s a caveat.
This property makes the MSE cost function less robust to outliers.
##Cancer dataset
To classify a tumor
as ‘Malignant’ or ‘Benign’ based on
features like average radius, area,
perimeter, etc.
For simplification, we will use only
two input features (X_1 and X_2)
namely ‘worst area’ and ‘mean
symmetry’ for classification.
The target value Y can be 0
(Malignant) or 1 (Benign).
1. Binary Cross Entropy Loss or log loss
Then, the cross-entropy loss for output label y (can take values 0 and 1)
and predicted probability p is defined as:
This is also called Log-Loss. To calculate the probability p, we can use the sigmoid
function. Here, z is a function of our input features:
The range of the sigmoid function is [0, 1] which makes it suitable for calculating probability.
Hinge loss is mainly used with Support Vector Machine classifiers with class labels -1 and 1.
Hinge Loss not only penalizes wrong predictions but also right predictions that are not confident.
used when we want to make real-time decisions with not a laser-sharp focus on accuracy
Multi-Class Classification Loss Functions
Actual 0 Actual 1
Accuracy is the simplest metric and can be defined as the number of test cases correctly
classified divided by the total number of test cases.
Precision is the metric used to identify the correctness of classification.
Recall tells us the number of positive cases correctly identified out of the total number of
positive cases.
F1 score is the harmonic mean of Recall and Precision and therefore, balances out the
strengths of each.
AUC-ROC: ROC curve is a plot of true positive rate (recall) against false positive rate (TN /
(TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics and the
higher the area, the better is the model performance.
If the curve is somewhere near the 50% diagonal line, it suggests that the model
randomly predicts the output variable.
Bias , Variance
Bias occurs when a model is strictly ruled by assumptions – like the linear
regression model assumes that the relationship of the output variable with
the independent variables is a straight line. This leads to underfitting when
the actual values are non-linearly related to the independent variables.
Variance is high when a model focuses on the training set too much and learns
the variations very closely, compromising on generalization. This leads
to overfitting.