0% found this document useful (0 votes)

4 views

Overfitting & Feature Engineering.pptx

Uploaded by

enpass

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Overfitting & Feature Engineering.pptx

Uploaded by

enpass

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Overﬁtting &

Feature Engineering
- Techniques available to a data scientist
Overﬁtting

❖ Goodness of ﬁt refers to how closely a

model’s predicted values match the observed
(true) values

❖ A model that has learned the noise instead

of the signal is considered “overﬁt”

❖ “Signal” is the true underlying pattern that

you wish to learn from the data.

❖ “Noise,” on the other hand, refers to the

irrelevant information or randomness in a
dataset.
Underﬁtting

❖ Underﬁtting occurs when a model is too simple – informed by too few features or
regularized too much – which makes it inﬂexible in learning from the dataset.
❖ Simple learners tend to have less variance in their predictions but more bias towards
wrong outcomes
❖ On the other hand, complex learners tend to have more variance in their predictions
Error from bias

❖ No matter how many more observations you collect, a linear regression won't be able
to model the curves in that data!
Error from Variance

❖ Variance refers to your algorithm's sensitivity to speciﬁc sets of training data.

❖ High variance algorithms will produce dramatically different models depending on the
training set.

❖ Here, the unconstrained model has basically memorized the training set, including all
of the noise
Bias-Variance Tradeoff

❖ Low variance (high bias) algorithms tend to be less complex, with simple or rigid
underlying structure. They train models that are consistent, but inaccurate on average.

❖ On the other hand, low bias (high variance) algorithms tend to be more complex, with
ﬂexible underlying structure. They train models that are accurate on average, but
inconsistent.
Key Point

Tradeoff in complexity is why there's a tradeoff in bias and variance - an algorithm

cannot simultaneously be more complex and less complex

Total Error

Total Error = Bias error + Variance + Irreducible Error

Ultimate goal of supervised machine learning - to isolate the signal from the dataset while
ignoring the noise!
Hyperparameters vs Parameters

❖ A model parameter is a conﬁguration variable that is internal to the model and whose
value is learned by the model from data.

❖ For example, the coefﬁcients in a linear regression or logistic regression

❖ A model hyperparameter is a conﬁguration that is external to the model and whose

value cannot be estimated from data

❖ Hyperparameters are often speciﬁed by the practitioner and they are set before
parameters are learned

❖ For example, the learning rate in a gradient descent algorithm and setting l1 or l2 in a
logistic regression
Detecting Overﬁtting

❖ Check the evaluation metrics and if our model does much better on the training set

than on the test set, then we’re likely overﬁtting.

❖ For example, it would be a big red ﬂag if our model saw 95% accuracy on the training

set but only 65% accuracy on the test set.

Handling Overﬁtting

❖ Occam’s Razor

❖ Cross-validation

❖ Train with more data

❖ Remove features

❖ Early stopping

❖ Regularization

❖ Ensembling
Occam’s Razor

❖ Start with a very simple model to serve as a benchmark

❖ As you try more complex algorithms, you’ll have a reference point to see if the
additional complexity is worth it.

❖ This is the Occam’s razor test. If two models have comparable performance, then you
should usually pick the simpler one
Cross-validation

❖ Cross-validation is a powerful preventative measure against overﬁtting. Here, we use

the initial training data to generate multiple mini train-test splits.

❖ Using these splits, we tune our model

K fold cross-validation

❖ In standard k-fold cross-validation, we partition the data into k subsets, called folds.
Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as
the test set (called the “holdout fold”).

❖ Cross-validation allows you to tune hyperparameters with only your original training
set. This allows you to keep your test set as a truly unseen dataset for selecting your
ﬁnal model.
Step by Step

1. Split your training data into 10 equal parts, or "folds."

2. From all sets of hyperparameters you wish to consider, choose a set of
hyperparameters.
3. Train your model with that set of hyperparameters on the ﬁrst 9 folds.
4. Evaluate it on the 10th fold, or the"hold-out" fold.
5. Repeat steps (3) and (4) 10 times with the same set of hyperparameters, each time
holding out a different fold.
6. Aggregate the performance across all 10 folds. This is your performance metric for the
set of hyperparameters.
7. Repeat steps (2) to (6) for all sets of hyperparameters you wish to consider.
More Training Data

❖ Training with more data can help algorithms detect the signal better in general.

❖ It is difﬁcult to memorise more data and algorithm is better off learning the signal.

❖ If we just add more noisy data, this technique won’t help. That’s why you should
always ensure your data is clean and relevant
Remove features

❖ We can manually improve the algorithm’s generalizability by removing irrelevant input

features.

❖ Here we are basically trying to reduce the noisy data and making sure we give
relevant features (that contain signal) to the algorithm

❖ Can use stepwise approach to do feature selection

❖ If variables are highly correlated, some of them can be safely removed

❖ If a variable doesn’t change much, then the variable doesn’t add value and can be
disregarded

❖ Use algorithms to reduce the dimensionality like PCA or LDA

Early Stopping

❖ If using an iterative process to learn the parameters, then up until a certain number of
iterations, new iterations improve the model. After that point, however, the model’s
ability to generalize can weaken as it begins to overﬁt the training data
Regularization

❖ Regularization refers to a broad range of techniques for artiﬁcially forcing your model
to be simpler

❖ The method will depend on the type of learner you’re using. For example, you could
prune a decision tree, use dropout on a neural network, or add a penalty parameter to
the cost function in regression.

❖ Oftentimes, the regularization method is a hyperparameter as well, which means it can

be tuned through cross-validation.
Ensembling
❖ Ensembles are machine learning methods for combining predictions from multiple separate
models. There are a few different methods for ensembling, but the two most common are
Bagging and Boosting
❖ Bagging attempts to reduce the chance overﬁtting complex models.
➢ It trains a large number of "strong" learners in parallel.
➢ A strong learner is a model that's relatively unconstrained.
➢ Bagging then combines all the strong learners together in order to "smooth out" their
predictions
❖ Boosting attempts to improve the predictive ﬂexibility of simple models.
➢ It trains a large number of "weak" learners in sequence.
➢ A weak learner is a constrained model (i.e. you could limit the max depth of each
decision tree).
➢ Each one in the sequence focuses on learning from the mistakes of the one before it.
➢ Boosting then combines all the weak learners into a single strong learner.
Good ML WorkFlow

A proper machine learning workﬂow includes:

❖ Separate training and test sets

❖ Trying appropriate algorithms

❖ Fitting model parameters

❖ Tuning impactful hyperparameters

❖ Proper performance metrics

❖ Systematic cross-validation
Feature Engineering

❖ Preparing the input dataset, compatible with the machine learning algorithm
requirements

❖ Improving the performance of machine learning models. Here performance can be in

two ways. One is the time required for learning the parameters and other is the
performance w.r.t. Evaluation metrics
Feature Engineering Techniques

❖ Imputation
❖ Handling Outliers
❖ Binning
❖ Transformations
❖ Encoding
❖ Feature Split
❖ Scaling
❖ Extracting Date
Data Imputation

❖ Most of the algorithms do not accept datasets with missing values and gives an error

❖ Most simple solution to the missing values is to drop the rows or the entire column(if
lot of missing values).

❖ Numerical Imputation involves ﬁlling the missing values with a default numerical
value. Most of the times, we use median

❖ Categorical Imputation involves ﬁlling the missing values with the most occuring
value generally or creating another categorical value like ‘other’ for it
Missing values
Missing Values
Handling Outliers

❖ Outlier Detection with Standard Deviation

❖ Outlier Detection with Percentiles

❖ Outlier Dilemma: Drop or Cap

Binning

❖ Binning can be applied on both categorical and numerical data:

Binning

❖ The main motivation of binning is to make the model more robust and prevent
overﬁtting, however, it has a cost to the performance.

❖ The trade-off between performance and overﬁtting is the key point of the binning
process

❖ For categorical columns, the labels with low frequencies probably affect the
robustness of statistical models negatively. Thus, assigning a general category to
these less frequent values helps to keep the robustness of the model.

❖ For example, if your data size is 100,000 rows, it might be a good option to unite the
labels with a count less than 100 to a new category like “Other”.
Data Transformations

❖ Helps to handle skewed data and after transformation, the distribution becomes more
approximate to normal.

❖ It also decreases the effect of the outliers, due to the normalization of magnitude
differences and the model become more robust.

❖ Log transformation, Box Cox transformation(scaling required), Yeo-Johnson

transformation
Encoding

❖ One hot Encoding

❖ Label Encoding
Feature Split

❖ Some of the times the dataset contains string columns which have potential
information useful for the model
❖ By extracting the utilizable parts of a column into new features:
➢ We enable machine learning algorithms to comprehend them.
➢ Make possible to bin and group them.
➢ Improve model performance by uncovering potential information.
Scaling

❖ The numerical features of the dataset do not have a certain range and they differ from
each other. In real life, it is nonsense to expect age and income columns to have the
same range.

❖ Scaling solves this problem. The continuous features become identical in terms of the
range, after a scaling process. This process is not mandatory for many algorithms, but
it might be still nice to apply. However, the algorithms based on distance calculations
such as k-NN or k-Means need to have scaled continuous features as model input.
Scaling - Normalization

❖ Normalization (or min-max normalization) scale all values in a ﬁxed range between 0
and 1. This transformation does not change the distribution of the feature
❖
Scaling - Standardisation

❖ Standardization (or z-score normalization) scales the values while taking into account
standard deviation
Extracting Date

❖ Extracting the parts of the date into different columns: Year, month, day, etc.

❖ Extracting the time period between the current date and columns in terms of years,
months, days, etc.

❖ Extracting some speciﬁc features from the date: Name of the weekday, Weekend or
not, holiday or not, etc.
❖

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Machine Learning Interview Questions.
50% (2)
Machine Learning Interview Questions.
43 pages
Stepwise Regression
100% (2)
Stepwise Regression
28 pages
(Chapter 3) Quadratic Function
100% (3)
(Chapter 3) Quadratic Function
17 pages
Input Devices Worksheet: Name: - Grade
No ratings yet
Input Devices Worksheet: Name: - Grade
6 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
unit-1.2-Perceptron-2024
No ratings yet
unit-1.2-Perceptron-2024
107 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
2. Linear Regression, Polynomical, Gradiant Descent
No ratings yet
2. Linear Regression, Polynomical, Gradiant Descent
42 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Cofusion Matrix Cross- Validation
No ratings yet
Cofusion Matrix Cross- Validation
34 pages
Lec 3
No ratings yet
Lec 3
13 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
All DL
No ratings yet
All DL
72 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Lec - 4
No ratings yet
Lec - 4
43 pages
Unit 2
No ratings yet
Unit 2
28 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Lec-1 Bias-variance-Tradeoff
No ratings yet
Lec-1 Bias-variance-Tradeoff
24 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Deep Learning_Lecture 3_Regularization in Neural Networks
No ratings yet
Deep Learning_Lecture 3_Regularization in Neural Networks
16 pages
DL_Unit1 (1)
No ratings yet
DL_Unit1 (1)
79 pages
Variance and Bias
No ratings yet
Variance and Bias
14 pages
machine learning-unit 3
No ratings yet
machine learning-unit 3
18 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
9 pages
Unit 3
No ratings yet
Unit 3
55 pages
Ensemble Learning
No ratings yet
Ensemble Learning
46 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
Lecture 12 - Machine Learning
No ratings yet
Lecture 12 - Machine Learning
18 pages
AI - W7L14
No ratings yet
AI - W7L14
22 pages
UNIT03
No ratings yet
UNIT03
52 pages
Unit - 2 Deep Learning
No ratings yet
Unit - 2 Deep Learning
26 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
Unit 4
No ratings yet
Unit 4
50 pages
module 3 modified
No ratings yet
module 3 modified
48 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
Ml Unit4 Notes
No ratings yet
Ml Unit4 Notes
20 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
ML2
No ratings yet
ML2
8 pages
Lec8 (1)
No ratings yet
Lec8 (1)
19 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
Unit 4
No ratings yet
Unit 4
35 pages
Interview Questions
100% (1)
Interview Questions
67 pages
Bias and Variance
No ratings yet
Bias and Variance
36 pages
Ensemble Method
No ratings yet
Ensemble Method
12 pages
Chapter III - Supervised and Unsupervised Algorithms
No ratings yet
Chapter III - Supervised and Unsupervised Algorithms
122 pages
Linear Regression Summary
No ratings yet
Linear Regression Summary
57 pages
Semester
No ratings yet
Semester
8 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
ML Lec-7
No ratings yet
ML Lec-7
12 pages
DL Class3
No ratings yet
DL Class3
28 pages
Chapter 1-ML
No ratings yet
Chapter 1-ML
27 pages
unit-online-1.2
No ratings yet
unit-online-1.2
20 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
68 pages
Bias and Variance
No ratings yet
Bias and Variance
7 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
23 pages
Module3-Ensemble Learning
No ratings yet
Module3-Ensemble Learning
107 pages
Data Mining Primer
No ratings yet
Data Mining Primer
5 pages
Sampling Methods in Machine Learning
No ratings yet
Sampling Methods in Machine Learning
13 pages
Topics Elec
No ratings yet
Topics Elec
8 pages
IOP-FL Inside-Outside Personalization For Federated Medical Image Segmentation
No ratings yet
IOP-FL Inside-Outside Personalization For Federated Medical Image Segmentation
12 pages
NE40E Standard Reference (V800R008C10)
No ratings yet
NE40E Standard Reference (V800R008C10)
34 pages
Reusable Components Methodology
No ratings yet
Reusable Components Methodology
26 pages
Prasanna CV 2023 - 1695036508952 - Prasanna Harti
No ratings yet
Prasanna CV 2023 - 1695036508952 - Prasanna Harti
2 pages
Download Full Extremophiles Diversity Adaptation and Applications 1st Edition Masrure Alam PDF All Chapters
No ratings yet
Download Full Extremophiles Diversity Adaptation and Applications 1st Edition Masrure Alam PDF All Chapters
40 pages
Px c 3891600
No ratings yet
Px c 3891600
7 pages
Website Design and Development
0% (1)
Website Design and Development
54 pages
Oracle Database Design Final Exam
No ratings yet
Oracle Database Design Final Exam
15 pages
Encoders: For Machine Tool Inspection and Acceptance Testing
No ratings yet
Encoders: For Machine Tool Inspection and Acceptance Testing
20 pages
AFSRC-Form Final PDF
No ratings yet
AFSRC-Form Final PDF
1 page
Data Structures
No ratings yet
Data Structures
20 pages
Serene Joshua Finalresume
No ratings yet
Serene Joshua Finalresume
1 page
11 Fundamentals Sel Service Portal Ebook v2 Including Engagement Plan
No ratings yet
11 Fundamentals Sel Service Portal Ebook v2 Including Engagement Plan
21 pages
User's Guide: IBM System x3850 M2 and System x3950 M2 Types 7141 and 7233
No ratings yet
User's Guide: IBM System x3850 M2 and System x3950 M2 Types 7141 and 7233
102 pages
Project - Splendid - Time Motion Study (PEX) - 31.07.2024
No ratings yet
Project - Splendid - Time Motion Study (PEX) - 31.07.2024
11 pages
Dse8610 Mkii PCN
No ratings yet
Dse8610 Mkii PCN
4 pages
CFX GetStart1 19.0 WS01 Mixing Tee
No ratings yet
CFX GetStart1 19.0 WS01 Mixing Tee
69 pages
B8 Comp WK9 - 2
No ratings yet
B8 Comp WK9 - 2
2 pages
02 - SIMATIC PCS 7 Documentation and Online Support V9.0 - en
No ratings yet
02 - SIMATIC PCS 7 Documentation and Online Support V9.0 - en
28 pages
VSG 4MTG Manual
No ratings yet
VSG 4MTG Manual
158 pages
Calibration of Tornatech Electric or Diesel Fire Pump Controller
No ratings yet
Calibration of Tornatech Electric or Diesel Fire Pump Controller
13 pages
Devi Ahilya Vishwavidyalaya, Indore: Guidelines
No ratings yet
Devi Ahilya Vishwavidyalaya, Indore: Guidelines
20 pages
Gartner 2006 PDF
No ratings yet
Gartner 2006 PDF
11 pages
(Final) 1000+ SNLP MCQ
No ratings yet
(Final) 1000+ SNLP MCQ
688 pages
BE Message Format 2.12 (10mar2023)
No ratings yet
BE Message Format 2.12 (10mar2023)
62 pages
RTRP Lab Project
No ratings yet
RTRP Lab Project
13 pages