0% found this document useful (0 votes)
4 views

Overfitting & Feature Engineering.pptx

Uploaded by

enpass
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Overfitting & Feature Engineering.pptx

Uploaded by

enpass
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Overfitting &

Feature Engineering
- Techniques available to a data scientist
Overfitting

❖ Goodness of fit refers to how closely a


model’s predicted values match the observed
(true) values

❖ A model that has learned the noise instead


of the signal is considered “overfit”

❖ “Signal” is the true underlying pattern that


you wish to learn from the data.

❖ “Noise,” on the other hand, refers to the


irrelevant information or randomness in a
dataset.
Underfitting

❖ Underfitting occurs when a model is too simple – informed by too few features or
regularized too much – which makes it inflexible in learning from the dataset.
❖ Simple learners tend to have less variance in their predictions but more bias towards
wrong outcomes
❖ On the other hand, complex learners tend to have more variance in their predictions
Error from bias

❖ No matter how many more observations you collect, a linear regression won't be able
to model the curves in that data!
Error from Variance

❖ Variance refers to your algorithm's sensitivity to specific sets of training data.

❖ High variance algorithms will produce dramatically different models depending on the
training set.

❖ Here, the unconstrained model has basically memorized the training set, including all
of the noise
Bias-Variance Tradeoff

❖ Low variance (high bias) algorithms tend to be less complex, with simple or rigid
underlying structure. They train models that are consistent, but inaccurate on average.

❖ On the other hand, low bias (high variance) algorithms tend to be more complex, with
flexible underlying structure. They train models that are accurate on average, but
inconsistent.
Key Point

Tradeoff in complexity is why there's a tradeoff in bias and variance - an algorithm

cannot simultaneously be more complex and less complex


Total Error

Total Error = Bias error + Variance + Irreducible Error

Ultimate goal of supervised machine learning - to isolate the signal from the dataset while
ignoring the noise!
Hyperparameters vs Parameters

❖ A model parameter is a configuration variable that is internal to the model and whose
value is learned by the model from data.

❖ For example, the coefficients in a linear regression or logistic regression

❖ A model hyperparameter is a configuration that is external to the model and whose


value cannot be estimated from data

❖ Hyperparameters are often specified by the practitioner and they are set before
parameters are learned

❖ For example, the learning rate in a gradient descent algorithm and setting l1 or l2 in a
logistic regression
Detecting Overfitting

❖ Check the evaluation metrics and if our model does much better on the training set

than on the test set, then we’re likely overfitting.

❖ For example, it would be a big red flag if our model saw 95% accuracy on the training

set but only 65% accuracy on the test set.


Handling Overfitting

❖ Occam’s Razor

❖ Cross-validation

❖ Train with more data

❖ Remove features

❖ Early stopping

❖ Regularization

❖ Ensembling
Occam’s Razor

❖ Start with a very simple model to serve as a benchmark

❖ As you try more complex algorithms, you’ll have a reference point to see if the
additional complexity is worth it.

❖ This is the Occam’s razor test. If two models have comparable performance, then you
should usually pick the simpler one
Cross-validation

❖ Cross-validation is a powerful preventative measure against overfitting. Here, we use


the initial training data to generate multiple mini train-test splits.

❖ Using these splits, we tune our model


K fold cross-validation

❖ In standard k-fold cross-validation, we partition the data into k subsets, called folds.
Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as
the test set (called the “holdout fold”).

❖ Cross-validation allows you to tune hyperparameters with only your original training
set. This allows you to keep your test set as a truly unseen dataset for selecting your
final model.
Step by Step

1. Split your training data into 10 equal parts, or "folds."


2. From all sets of hyperparameters you wish to consider, choose a set of
hyperparameters.
3. Train your model with that set of hyperparameters on the first 9 folds.
4. Evaluate it on the 10th fold, or the"hold-out" fold.
5. Repeat steps (3) and (4) 10 times with the same set of hyperparameters, each time
holding out a different fold.
6. Aggregate the performance across all 10 folds. This is your performance metric for the
set of hyperparameters.
7. Repeat steps (2) to (6) for all sets of hyperparameters you wish to consider.
More Training Data

❖ Training with more data can help algorithms detect the signal better in general.

❖ It is difficult to memorise more data and algorithm is better off learning the signal.

❖ If we just add more noisy data, this technique won’t help. That’s why you should
always ensure your data is clean and relevant
Remove features

❖ We can manually improve the algorithm’s generalizability by removing irrelevant input


features.

❖ Here we are basically trying to reduce the noisy data and making sure we give
relevant features (that contain signal) to the algorithm

❖ Can use stepwise approach to do feature selection

❖ If variables are highly correlated, some of them can be safely removed

❖ If a variable doesn’t change much, then the variable doesn’t add value and can be
disregarded

❖ Use algorithms to reduce the dimensionality like PCA or LDA


Early Stopping

❖ If using an iterative process to learn the parameters, then up until a certain number of
iterations, new iterations improve the model. After that point, however, the model’s
ability to generalize can weaken as it begins to overfit the training data
Regularization

❖ Regularization refers to a broad range of techniques for artificially forcing your model
to be simpler

❖ The method will depend on the type of learner you’re using. For example, you could
prune a decision tree, use dropout on a neural network, or add a penalty parameter to
the cost function in regression.

❖ Oftentimes, the regularization method is a hyperparameter as well, which means it can


be tuned through cross-validation.
Ensembling
❖ Ensembles are machine learning methods for combining predictions from multiple separate
models. There are a few different methods for ensembling, but the two most common are
Bagging and Boosting
❖ Bagging attempts to reduce the chance overfitting complex models.
➢ It trains a large number of "strong" learners in parallel.
➢ A strong learner is a model that's relatively unconstrained.
➢ Bagging then combines all the strong learners together in order to "smooth out" their
predictions
❖ Boosting attempts to improve the predictive flexibility of simple models.
➢ It trains a large number of "weak" learners in sequence.
➢ A weak learner is a constrained model (i.e. you could limit the max depth of each
decision tree).
➢ Each one in the sequence focuses on learning from the mistakes of the one before it.
➢ Boosting then combines all the weak learners into a single strong learner.
Good ML WorkFlow

A proper machine learning workflow includes:

❖ Separate training and test sets

❖ Trying appropriate algorithms

❖ Fitting model parameters

❖ Tuning impactful hyperparameters

❖ Proper performance metrics

❖ Systematic cross-validation
Feature Engineering

❖ Preparing the input dataset, compatible with the machine learning algorithm
requirements

❖ Improving the performance of machine learning models. Here performance can be in


two ways. One is the time required for learning the parameters and other is the
performance w.r.t. Evaluation metrics
Feature Engineering Techniques

❖ Imputation
❖ Handling Outliers
❖ Binning
❖ Transformations
❖ Encoding
❖ Feature Split
❖ Scaling
❖ Extracting Date
Data Imputation

❖ Most of the algorithms do not accept datasets with missing values and gives an error

❖ Most simple solution to the missing values is to drop the rows or the entire column(if
lot of missing values).

❖ Numerical Imputation involves filling the missing values with a default numerical
value. Most of the times, we use median

❖ Categorical Imputation involves filling the missing values with the most occuring
value generally or creating another categorical value like ‘other’ for it
Missing values
Missing Values
Handling Outliers

❖ Outlier Detection with Standard Deviation

❖ Outlier Detection with Percentiles

❖ Outlier Dilemma: Drop or Cap


Binning

❖ Binning can be applied on both categorical and numerical data:


Binning

❖ The main motivation of binning is to make the model more robust and prevent
overfitting, however, it has a cost to the performance.

❖ The trade-off between performance and overfitting is the key point of the binning
process

❖ For categorical columns, the labels with low frequencies probably affect the
robustness of statistical models negatively. Thus, assigning a general category to
these less frequent values helps to keep the robustness of the model.

❖ For example, if your data size is 100,000 rows, it might be a good option to unite the
labels with a count less than 100 to a new category like “Other”.
Data Transformations

❖ Helps to handle skewed data and after transformation, the distribution becomes more
approximate to normal.

❖ It also decreases the effect of the outliers, due to the normalization of magnitude
differences and the model become more robust.

❖ Log transformation, Box Cox transformation(scaling required), Yeo-Johnson


transformation
Encoding

❖ One hot Encoding


❖ Label Encoding
Feature Split

❖ Some of the times the dataset contains string columns which have potential
information useful for the model
❖ By extracting the utilizable parts of a column into new features:
➢ We enable machine learning algorithms to comprehend them.
➢ Make possible to bin and group them.
➢ Improve model performance by uncovering potential information.
Scaling

❖ The numerical features of the dataset do not have a certain range and they differ from
each other. In real life, it is nonsense to expect age and income columns to have the
same range.

❖ Scaling solves this problem. The continuous features become identical in terms of the
range, after a scaling process. This process is not mandatory for many algorithms, but
it might be still nice to apply. However, the algorithms based on distance calculations
such as k-NN or k-Means need to have scaled continuous features as model input.
Scaling - Normalization

❖ Normalization (or min-max normalization) scale all values in a fixed range between 0
and 1. This transformation does not change the distribution of the feature

Scaling - Standardisation

❖ Standardization (or z-score normalization) scales the values while taking into account
standard deviation
Extracting Date

❖ Extracting the parts of the date into different columns: Year, month, day, etc.

❖ Extracting the time period between the current date and columns in terms of years,
months, days, etc.

❖ Extracting some specific features from the date: Name of the weekday, Weekend or
not, holiday or not, etc.

You might also like