Overfitting & Feature Engineering.pptx
Overfitting & Feature Engineering.pptx
Feature Engineering
- Techniques available to a data scientist
Overfitting
❖ Underfitting occurs when a model is too simple – informed by too few features or
regularized too much – which makes it inflexible in learning from the dataset.
❖ Simple learners tend to have less variance in their predictions but more bias towards
wrong outcomes
❖ On the other hand, complex learners tend to have more variance in their predictions
Error from bias
❖ No matter how many more observations you collect, a linear regression won't be able
to model the curves in that data!
Error from Variance
❖ High variance algorithms will produce dramatically different models depending on the
training set.
❖ Here, the unconstrained model has basically memorized the training set, including all
of the noise
Bias-Variance Tradeoff
❖ Low variance (high bias) algorithms tend to be less complex, with simple or rigid
underlying structure. They train models that are consistent, but inaccurate on average.
❖ On the other hand, low bias (high variance) algorithms tend to be more complex, with
flexible underlying structure. They train models that are accurate on average, but
inconsistent.
Key Point
Ultimate goal of supervised machine learning - to isolate the signal from the dataset while
ignoring the noise!
Hyperparameters vs Parameters
❖ A model parameter is a configuration variable that is internal to the model and whose
value is learned by the model from data.
❖ Hyperparameters are often specified by the practitioner and they are set before
parameters are learned
❖ For example, the learning rate in a gradient descent algorithm and setting l1 or l2 in a
logistic regression
Detecting Overfitting
❖ Check the evaluation metrics and if our model does much better on the training set
❖ For example, it would be a big red flag if our model saw 95% accuracy on the training
❖ Occam’s Razor
❖ Cross-validation
❖ Remove features
❖ Early stopping
❖ Regularization
❖ Ensembling
Occam’s Razor
❖ As you try more complex algorithms, you’ll have a reference point to see if the
additional complexity is worth it.
❖ This is the Occam’s razor test. If two models have comparable performance, then you
should usually pick the simpler one
Cross-validation
❖ In standard k-fold cross-validation, we partition the data into k subsets, called folds.
Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as
the test set (called the “holdout fold”).
❖ Cross-validation allows you to tune hyperparameters with only your original training
set. This allows you to keep your test set as a truly unseen dataset for selecting your
final model.
Step by Step
❖ Training with more data can help algorithms detect the signal better in general.
❖ It is difficult to memorise more data and algorithm is better off learning the signal.
❖ If we just add more noisy data, this technique won’t help. That’s why you should
always ensure your data is clean and relevant
Remove features
❖ Here we are basically trying to reduce the noisy data and making sure we give
relevant features (that contain signal) to the algorithm
❖ If a variable doesn’t change much, then the variable doesn’t add value and can be
disregarded
❖ If using an iterative process to learn the parameters, then up until a certain number of
iterations, new iterations improve the model. After that point, however, the model’s
ability to generalize can weaken as it begins to overfit the training data
Regularization
❖ Regularization refers to a broad range of techniques for artificially forcing your model
to be simpler
❖ The method will depend on the type of learner you’re using. For example, you could
prune a decision tree, use dropout on a neural network, or add a penalty parameter to
the cost function in regression.
❖ Systematic cross-validation
Feature Engineering
❖ Preparing the input dataset, compatible with the machine learning algorithm
requirements
❖ Imputation
❖ Handling Outliers
❖ Binning
❖ Transformations
❖ Encoding
❖ Feature Split
❖ Scaling
❖ Extracting Date
Data Imputation
❖ Most of the algorithms do not accept datasets with missing values and gives an error
❖ Most simple solution to the missing values is to drop the rows or the entire column(if
lot of missing values).
❖ Numerical Imputation involves filling the missing values with a default numerical
value. Most of the times, we use median
❖ Categorical Imputation involves filling the missing values with the most occuring
value generally or creating another categorical value like ‘other’ for it
Missing values
Missing Values
Handling Outliers
❖ The main motivation of binning is to make the model more robust and prevent
overfitting, however, it has a cost to the performance.
❖ The trade-off between performance and overfitting is the key point of the binning
process
❖ For categorical columns, the labels with low frequencies probably affect the
robustness of statistical models negatively. Thus, assigning a general category to
these less frequent values helps to keep the robustness of the model.
❖ For example, if your data size is 100,000 rows, it might be a good option to unite the
labels with a count less than 100 to a new category like “Other”.
Data Transformations
❖ Helps to handle skewed data and after transformation, the distribution becomes more
approximate to normal.
❖ It also decreases the effect of the outliers, due to the normalization of magnitude
differences and the model become more robust.
❖ Some of the times the dataset contains string columns which have potential
information useful for the model
❖ By extracting the utilizable parts of a column into new features:
➢ We enable machine learning algorithms to comprehend them.
➢ Make possible to bin and group them.
➢ Improve model performance by uncovering potential information.
Scaling
❖ The numerical features of the dataset do not have a certain range and they differ from
each other. In real life, it is nonsense to expect age and income columns to have the
same range.
❖ Scaling solves this problem. The continuous features become identical in terms of the
range, after a scaling process. This process is not mandatory for many algorithms, but
it might be still nice to apply. However, the algorithms based on distance calculations
such as k-NN or k-Means need to have scaled continuous features as model input.
Scaling - Normalization
❖ Normalization (or min-max normalization) scale all values in a fixed range between 0
and 1. This transformation does not change the distribution of the feature
❖
Scaling - Standardisation
❖ Standardization (or z-score normalization) scales the values while taking into account
standard deviation
Extracting Date
❖ Extracting the parts of the date into different columns: Year, month, day, etc.
❖ Extracting the time period between the current date and columns in terms of years,
months, days, etc.
❖ Extracting some specific features from the date: Name of the weekday, Weekend or
not, holiday or not, etc.
❖