0% found this document useful (0 votes)
7 views

016-Overfitting vs Underfitting

The document discusses the concepts of overfitting and underfitting in machine learning, highlighting their definitions, causes, and solutions. Overfitting occurs when a model performs well on training data but poorly on unseen data, while underfitting happens when a model is too simplistic to capture data complexities. Techniques such as cross-validation, feature selection, and regularization are suggested to mitigate these issues and achieve a good model fit.

Uploaded by

alibadran2001.ab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

016-Overfitting vs Underfitting

The document discusses the concepts of overfitting and underfitting in machine learning, highlighting their definitions, causes, and solutions. Overfitting occurs when a model performs well on training data but poorly on unseen data, while underfitting happens when a model is too simplistic to capture data complexities. Techniques such as cross-validation, feature selection, and regularization are suggested to mitigate these issues and achieve a good model fit.

Uploaded by

alibadran2001.ab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Machine Learning

Overfitting vs Underfitting

Mariam Fakih 1
Outlines
• Generalization in ML
• How good is the fit
• Overfitting and how to avoid it
• Underfitting and how to solve it
• Summary

Mariam Fakih 2
Generalization in ML
• Machine learning can be described as the
process of "generalizing from examples".
• Generalization in machine learning is used to
measure the model’s performance to classify
unseen data samples.
• A model is said to be generalizing well if it can
forecast data samples from diversified sets.
• The model should be fitted correctly.
• Neither over-fitting nor under-fitting should
occur
Mariam Fakih 3
Good Fit in ML

Mariam Fakih 4
Some key terms
❑ Statistical fit: The goodness of the fit is the statistical
technique to measure how well the data points fit the model’s
curve.
The model is a good fit if it converges well between the
expected and observed data points, in both the training data
and, more importantly, in new, unseen data.
❑ Training error: The fitness of the predictive model is
calculated by comparing the actual and expected values.
When this is done on the training data itself, it gives us the
training error.
❑ Common metrics for calculating training error include mean
squared error (MSE), mean absolute error (MAE), or other
appropriate loss functions.

Mariam Fakih 5
Some key terms
❑ Test error: Predictive models find the mapping between the
input variable and the target value.
The corresponding error is called test error when we test the
model on an unknown dataset.
❑ NB: Balancing low training error with low test error is crucial
for building models that are not only accurate but also robust
and applicable to new situations.
❑ Bias: It measures the difference between the model’s
prediction and the target value.
❑ If a model is too simple and makes assumptions that do
not align with the true underlying patterns (ground truth)
in the data, it may consistently underpredict or
overpredict the target values, resulting in a biased model.

Mariam Fakih 6
Some key terms
❑ Variance: Variance measures the
inconsistency of different predictions over
a varied dataset.
Suppose the model's performance is
tested on different datasets—the closer
the prediction, the lesser the variance.
Higher variance indicates overfitting, in
which the model loses the ability to
generalize.
Mariam Fakih 7
How good the fit is

Mariam Fakih 8
Overfitting
• Over-fitting is a modeling error, which
occurs if a function is defined to closely fit
to a limited set of data points during the
training phase.
• It makes a model conform to mimic a
dataset that may not be fully representative
of other data points that the model may
encounter in future.
• One can tell a model is overfitting when it
performs great on training set, but poorly
on test set (or new data).
Mariam Fakih 9
Overfitting
• Overfitting in Machine Learning. A statistical model
is said to be overfitted when the model does not
make accurate predictions on testing data.
• This best fit line is divided into 3 types depending
on the precision level- Underfit, Good fit and
Overfit.
• The error produced from the training dataset is
known as Bias.
• The error produced by testing data set is Variance.
• The aim of any model will be to obtain a low bias
and low variance model. Mariam Fakih 10
Overfitting Solution?
❑ K-fold cross-validation is one of the most
common techniques used to detect
overfitting.
❑ Here, we split the data points into k
equally sized subsets in K-folds cross-
validation, called "folds."
❑ One split subset acts as the testing set
while the remaining groups are used to
train the model.

Mariam Fakih 11
K-fold cross-validation

Mariam Fakih 12
Overfitting happens when:
❑ The training data is not cleaned and contains some
“garbage” values. The model captures the noise in the
training data and fails to generalize the model's
learning.
❑ The model has a high variance.
❑ The training data size is insufficient, and the model
trains on the limited training data for several epochs.
❑ The architecture of the model has several neural
layers bundled together. Deep neural networks are
complex and require a significant amount of time to
train, and often lead to overfitting the training set.
❑ Incorrect tuning of hyper-parameters in the training
phase leads to over-observing the training set,
resulting in memorizing features. Mariam Fakih 13
Underfitting
• A statistical model or a machine learning algorithm is
said to have underfitting when a model is too simple to
capture data complexities.
• It represents the inability of the model to learn the
training data effectively result in poor performance
both on the training and testing data.
• In simple terms, an underfit model’s are inaccurate,
especially when applied to new, unseen examples.
• It mainly happens when we uses very simple model
with overly simplified assumptions.
• To address underfitting problem of the model, we need
to use more complex models, with enhanced feature
representation, and less regularization.
• The underfitting model has High bias and low
variance.
Mariam Fakih 14
Underfitting happens when:
❑ Unclean training data containing noise or outliers can
be a reason for the model not being able to derive
patterns from the dataset.
❑ The model has a high bias due to the inability to
capture the relationship between the input examples
and the target values. This usually happens in the
case of varied datasets.
❑ The model is assumed to be too simple—for example,
we train a linear model in complex scenarios.
❑ Incorrect hyper-parameters tuning often leads to
underfitting due to under-observing of the features.
Mariam Fakih 15
Overfitting, Underfitting and Bias-
variance tradeoff

Mariam Fakih 16
Overfitting vs Underfitting

Mariam Fakih 17
Overfitting & Underfitting

Mariam Fakih 18
Techniques to Reduce Underfitting
• Increase model complexity.
• Increase the number of features,
performing feature engineering.
• Remove noise from the data.
• Increase the number of epochs or
increase the duration of training to get
better results.
Mariam Fakih 19
How to achieve a Good Fit
1. Introduction of the validation set
2. Resampling methods

Mariam Fakih 20
Validation set
• A set of observations used during model training
to provide feedback on how well the current
parameters generalize beyond the training set.
• If training error decreases but validation error
increases, the model is likely overfitting and one
should pause training.
• Bias is an error from erroneous assumptions in the
learning algorithm.
• High bias can cause an algorithm to miss the
relevant relations between features and target
outputs (under fitting).
Mariam Fakih 21
How to avoid Overfitting?
1. Train with more 7. Regularization
data 8. Ensembling
2. Data augmentation 9. Early stopping
3. Addition of noise to 10. Adding dropout layers
the input data
4. Feature selection
5. Cross-validation
6. Simplify data
Mariam Fakih 22
How to reduce Underfitting?
1. Decrease regularization
2. Increase the duration of training
3. Feature selection
4. Remove noise from data

Mariam Fakih 23
Overfitting vs Underfitting
❑ Overfitting and Underfitting are the two biggest causes
of the poor performance of machine learning algorithms
and models.
❑ The scenario in which the model performs well in the
training phase but gives a poor accuracy in the test
dataset is called overfitting.
❑ The scenario in which the model performs poorly on the
training dataset as it cannot derive features from the
training set.
❑ To achieve a good fit means to stop training the model at
an optimal point such that the model is neither under
observing the features nor learning the unnecessary
details and the noise in the training set.
Mariam Fakih 24
Overfitting & Underfitting
• In Underfit, the best fit line doesn’t cover many data points
present.
Thus, in the training dataset itself, there is a high chance of
occurrence of error (high bias).
And eventually in the testing dataset also (low variance).
• In an overfit line, the best fit line covers every single data
points. You might think isn’t that what we want??
But no, this may be well and good in the case of training
dataset (low bias).
But, when a testing dataset is provided to the same model,
there will be a high error in the testing dataset (high
variance).
• But only a Good fit line, the best fit line will be in such a way
that any point to be predicted will be accurately predicted.
Consequently, it will still have a low bias and low variance.

Mariam Fakih 25
Overfitting and Random Forest
• Instead of relying on one DT, the Random Forest
takes the prediction from each tree and based on
the majority votes of predictions, and it predicts
the final output.
• The greater number of trees in the forest leads to
higher accuracy and prevents the problem of over-
fitting.
• Each tree draws a random sample from the
original dataset when generating its splits, adding
a further element of randomness that prevents
over-fitting.

Mariam Fakih 26
Importance of Feature Selection
• Machine learning models work on a simple rule—garbage
in, garbage out.
• By garbage here means noise in data.
• the high-dimensional dataset can lead to lot of problems,
namely long training time, model could be complex which in
turn may lead to over-fitting.
• We do not need to use every feature that is present in the
dataset.
• One can assist modeling by feeding in only those features
that are really important.
• Sometimes, less is better.
Mariam Fakih 27
Importance of Feature Selection
• In a case with very high dimensions, there are several
redundant features that do not contribute much but are
simply extensions of the other important features.
• These redundant features do not contribute to the model’s
predictive capability.
• Clearly, there is a need to remove these redundant features
from the dataset to get the most effective predictive
modeling performance.
• We can summarize the objectives of feature selection as
follows:
– It enables faster training of a model.
– It reduces the complexity of the model, and it becomes
easier to interpret and avoids over fitting.
– It improves the prediction accuracy of a model.
– Less data to store and process.
Mariam Fakih 28
Normalization
• Normalization is the process of rescaling data
so that they have the same scale typically in
the interval [1,+1] or [0,1].
• Normalization is used to avoid over-fitting and
improving computation speed.
• Normalization is used when the attributes in
data have varying scales.

Mariam Fakih 29
Regularization
• Regularization is a technique used in machine learning
to prevent overfitting by adding a penalty term to the
loss function.
• The penalty term discourages overly complex models
that may fit the training data too closely, making them
less likely to generalize well to new, unseen data
• Regularization helps to control the model's complexity
and prevent it from capturing noise in the training data.
• There are different types of regularization, and two
common ones are L1 regularization (Lasso) and L2
regularization (Ridge).
Mariam Fakih 30
Summary

Mariam Fakih 31
Summary
• Overfitting in Machine Learning. A
statistical model is said to
be overfitted when the model does not
make accurate predictions on testing
data
• Underfitting occurs when the model is
too simple and is unable to find
relationships and patterns accurately.
Mariam Fakih 32

You might also like