Pdf&rendition 1 3
Pdf&rendition 1 3
Note that the underfitting model has High bias and low variance.
Underfitting
Reasons for Underfitting
• The model is too simple, So it may be not capable to represent the
complexities in the data.
• The input features which is used to train the model is not the
adequate representations of underlying factors influencing the
target variable.
• The size of the training dataset used is not enough.
• Excessive regularization are used to prevent the overfitting, which
constraint the model to capture the data well.
• Features are not scaled.
Advantages of bootstrap
• It is a non-parametric method, which means it does not require
any assumptions about the underlying distribution of the data.
• It can be used to estimate standard errors and confidence intervals
for a wide range of statistics.
• It can be used to estimate the uncertainty of a statistic even when
the sample size is small.
• It can be used to perform hypothesis tests and compare the
distributions of different statistics.
• It is widely used in many fields such as statistics, finance, and
machine learning
Bootstrapping
Disadvantages of bootstrap:
• It can be computationally intensive, especially when working with
large datasets.
• It may not be appropriate for all types of data, such as highly
skewed or heavy-tailed distributions.
• It may not be appropriate for estimating the uncertainty of
statistics that have very large variances.
• It may not be appropriate for estimating the uncertainty of
statistics that are not smooth or have very different variances.
• It may not always be a good substitute for other statistical
methods, when large sample sizes are available.
Bagging
Step 3 Each model is learned in parallel with each training set and
independent of each other.
I. Accuracy
The accuracy metric is one of the simplest Classification metrics to
implement, and it can be determined as the number of correct
predictions to the total number of predictions.
It can be formulated as
III. Precision
V. F-Scores
VI. AUC-ROC
Sometimes we need to visualize the performance of the classification
model on charts; then, we can use the AUC-ROC curve. It is one of the
popular and important metrics for evaluating the performance of the
classification model.
Firstly, let's understand ROC (Receiver Operating Characteristic curve)
curve. ROC represents a graph to show the performance of a
classification model at different threshold levels. The curve is plotted
between two parameters, which are:
True Positive Rate
False Positive Rate
TPR or true Positive rate is a synonym for Recall, hence can be
calculated as:
Performance Metrics for Regression
Regression is a supervised learning technique that aims to find the
relationships between the dependent and independent variables. A
predictive regression model predicts a numeric or discrete value. The
metrics used for regression are different from the classification metrics.
It means we cannot use the Accuracy metric (explained above) to
evaluate a regression model; instead, the performance of a Regression
model is reported as errors in the prediction. Following are the popular
metrics that are used to evaluate the performance of Regression models.
• Mean Absolute Error
• Mean Squared Error
• R2 Score
• Adjusted R2
Performance Metrics for Regression
1) Mean Absolute Error (MAE)
Mean Absolute Error or MAE is one of the simplest metrics, which
measures the absolute difference between actual and predicted values,
where absolute means taking a number as Positive.
To understand MAE, let's take an example of Linear Regression, where
the model draws a best fit line between dependent and independent
variables. To measure the MAE or error in prediction, we need to
calculate the difference between actual values and predicted values. But
in order to find the absolute error for the complete dataset, we need to
find the mean absolute of the complete dataset.
The below formula is used to calculate MAE:
Performance Metrics for Regression
1) Mean Absolute Error (MAE)
Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total
number of data points.
MAE is much more robust for the outliers. One of the limitations of MAE
is that it is not differentiable, so for this, we need to apply different
optimizers such as Gradient Descent. However, to overcome this
limitation, another metric can be used, which is Mean Squared Error or
MSE.
Performance Metrics for Regression
II. Mean Squared Error
Mean Squared error or MSE is one of the most suitable metrics for
Regression evaluation. It measures the average of the Squared
difference between predicted values and the actual value given by the
model. Since in MSE, errors are squared, therefore it only assumes
non-negative values, and it is usually positive and non-zero.
Moreover, due to squared differences, it penalizes small errors also, and
hence it leads to over-estimation of how bad the model is.
MSE is a much-preferred metric compared to other regression metrics
as it is differentiable and hence optimized better.
The formula for calculating MSE is given below:
Here, Y is the Actual outcome, Y' is the predicted outcome, and N is the
total number of data points.
Performance Metrics for Regression
III. R Squared Error
R squared error is also known as Coefficient of Determination, which is
another popular metric used for Regression model evaluation. The R-
squared metric enables us to compare our model with a constant
baseline to determine the performance of the model. To select the
constant baseline, we need to take the mean of the data and draw the
line at the mean.
The R squared score will always be less than or equal to 1 without
concerning if the values are too large or small.
Performance Metrics for Regression
IV. Adjusted R Squared
Adjusted R squared, as the name suggests, is the improved version of R
squared error. R square has a limitation of improvement of a score on
increasing the terms, even though the model is not improving, and it
may mislead the data scientists.
To overcome the issue of R square, adjusted R squared is used, which
will always show a lower value than R². It is because it adjusts the
values of increasing predictors and only shows improvement if there is a
real improvement.
We can calculate the adjusted R squared as follows: