0% found this document useful (0 votes)
11 views

3. Cross Validation

Uploaded by

a4584851
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

3. Cross Validation

Uploaded by

a4584851
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Understanding Cross-

Validation
Introduction
• Machine learning validation methods provide a means for us to estimate
generalization error.
• This is crucial for determining what model provides the most best
predictions for unobserved data.
• In cases where large amounts of data are available, machine learning
data validation begins with splitting the data into three separate
datasets:
1. A training set is used to train the machine learning model(s) during
development.
2. A validation set is used to estimate the generalization error of the
model created from the training set for the purpose of model selection.
Cross-Validation in Machine Learning
• The model validation process in the previous section
works when we have large datasets.
• When data is limited we must instead use a technique
called cross-validation.
• The purpose of cross-validation is to provide a
better estimate of a model's ability to perform on
unseen data.
• It provides an unbiased estimate of the generalization
error, especially in the case of limited data.
There are many reasons we may want to do this i.e.
Cross Validation

• There are many reasons we may want to do this:


1. To have a clearer measure of how our model performs.
2. To tune hyperparameters.
3. To make model selections.
• The intuition behind cross-validation is simple - rather
than training our models on one training set we train
our model on multiple subsets of data.
The basic steps of cross-validation
are:
1.Split data into portions.
2.Train our model on a subset of the portions.
3.Test our model on the remaining subsets of the data.
4.Repeat steps 2-3 until the model has been trained and
tested on the entire dataset.
5.Average the model performance across all iterations of
testing to get the total model performance.
Common Cross-Validation Methods
• Though the basic concept of cross-validation is fairly
simple, there are several ways to go about each step.
• A few examples of cross-validation methods include
1. k-Fold Cross-Validation
2. Stratified k-Fold Cross-Validation
3. Leave-One-Out Cross-Validation
4. Time-Series Cross-Validation
k-Fold Cross-Validation
• In k-fold cross-validation:
• The dataset is divided into k equal sized-folds.
• The model is trained on k-1 folds and tested on the
remaining fold.
• The process is repeated k times, with each fold serving
as the test set exactly once.
• The performance metrics are averaged over the k
iterations.
Stratified k-Fold Cross-
Validation
• This process is similar to k-fold cross-validation with
minor but important exceptions:
• The class distribution in each fold is preserved.
• It is useful for imbalanced datasets.
Leave-One-Out Cross-Validation
• The Leave-one-out cross-validation process:
• Trains the model using all data observations except one.
• Tests the data using the unused data point.
• Repeats this for n iterations until each data point is used
exactly once as a test set.
Time-Series Cross-Validation
• This cross-validation method, designed specifically for
time-series:Splits the data into training and testing sets
in a chronologically ordered manner, such as sliding or
expanding windows.
• Trains the model on past data and tests the model on
future data, based on the splitting point.
Method Advantages Disadvantage
s
k-Fold Cross- •Provides a good •Can be
Validation estimate of the computationally
model's expensive,
performance by especially for
using all the data large datasets or
for both training complex models.
and testing. •May not work well
•Reduces the for imbalanced
variance in datasets or when
performance there is a specific
estimates order to the data.
compared to other
methods.
Method Advantages Disadvantages

Stratified k-Fold •Ensures that each •Can still be


Cross-Validation fold has a computationally
representative expensive,
distribution of especially for large
classes, which can datasets or
improve complex models.
performance •May not be
estimates for necessary for
imbalanced balanced datasets
datasets. where class
•Reduces the distribution is
variance in already even.
performance
Method Advantages Disadvantages

Leave-One-Out • Provides the least • Can be


Cross-Validation biased estimate computationally
(LOOCV) of the model's expensive, as it
performance, as requires training
the model is and testing the
tested on every model n times.
data point. • May have high
• Can be useful variance in
when dealing performance
with very limited estimates, due to
data. the small size in
the test set.
Method Advantages Disadvantages

Time Series • Accounts for • May not be


Cross- temporal applicable for
Validation dependencies non-time
in time series series data.
data. • Can be
• Provides a sensitive to
realistic the choice of
estimate of the window size
model's and data
performance in splitting
real-world strategy.

You might also like