Cross-Validation in Machine Learning
Cross-Validation in Machine Learning
Learning
• Cross-validation is a technique for validating the model efficiency by
training it on the subset of input data and testing on previously unseen
subset of the input data. We can also say that it is a technique to check
how a statistical model generalizes to an independent dataset.
• In machine learning there is always the need to test the stability of the
model. It means based only on the training dataset; we can't fit our
model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After
that, we test our model on that sample before deployment, and this
complete process comes under cross-validation. This is something
different from the general train-test split.
• Hence the basic steps of cross-validations are:
• Reserve a subset of the dataset as a validation set.
• Provide the training to the model using the training dataset.
• Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else
check for the issues.
Key aspects of evaluating the quality of the model are –
• How accurate the model is
• How generalized the model is
• When we start building a model and train it with the ‘entire’ dataset, we can very well calculate its accuracy
on this training data set. But we cannot test how this model will behave with new data which is not present in
the training set, hence its generalization cannot be determined.
• Hence we need techniques to make use of the same data set for both training and testing of the models.
• In machine learning, Cross-Validation is the technique to evaluate how well the model has generalized and its
overall accuracy. For this purpose, it randomly samples data from the dataset to create training and testing
sets. There are multiple cross-validation approaches as follows –
• import pandas as pd
• import numpy as np
• from sklearn.tree import DecisionTreeClassifier
• from sklearn.model_selection import train_test_split
• from sklearn.model_selection import KFold
• from sklearn.model_selection import StratifiedKFold
• Reading CSV Data into Pandas
• Next, we load the dataset in the CSV file into the pandas dataframes
and check the top 5 rows.
• df=pd.read_csv(“Parkinsson disease.csv")
• df.head()
• Data Preprocessing
• The “name” column is not going to add any value in training the model
and can be discarded, so we are dropping it below.
• df.drop(df.columns[0], axis = 1, inplace = True)
• Next, we will separate the feature and target matrix as shown below.
• #Independent And dependent features
• X=df.drop('status', axis=1)
• y=df['status']
Hold out Approach in Sklearn
• Out[41]:
• array([0.61538462, 0.79487179, 0.71794872, 0.74358974, 0.71794872])
• 0.717948717948718
Leave One Out Cross Validation(LOOCV)
• In Sklearn Leave One Out Cross Validation (LOOCV) can be applied by using LeaveOneOut module of sklearn.model_selection
• from sklearn.model_selection import LeaveOneOut
• model=DecisionTreeClassifier()
• leave_validation=LeaveOneOut()
• results=cross_val_score(model,X,y,cv=leave_validation)
• results
• Out[22]:
• array([1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1.,
• 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1.,
• 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1.,
• 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
• 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1.,,,,,,,,,,,,,,,,,,,,,,,,,,,]
• print(np.mean(results))
• Out[44]:
• 0.8358974358974359
Repeated Random Test-Train Splits