Model Evaluation
Model Evaluation
Lecture 5
2
Agenda
▪Data processing
▪ Data cleaning and transforming
▪ Feature selection and visualization
▪Model selection and tuning
▪Over-fitting and Under-fitting
▪ Bias, variance
▪Cross-Validation and Re-sampling methods
▪ K-Fold Cross-Validation
▪Gradient descent (batch, stochastic)
▪Performance evaluation methods
3
Data processing
▪Data Processing is the task of converting data from a
given form to a much more usable and desired form
▪ i.e., making it more meaningful and informative.
▪Using Machine Learning algorithms, mathematical
modeling, and statistical knowledge, this entire
process can be automated.
▪The output of this complete process can be in any
desired form like graphs, videos, charts, tables,
images, and many more,
4
Data - Preprocessing
▪Is technique of preparing the raw data to make it suitable
for a building and training ML models.
▪Why we need it
▪ Real-world data is incomplete, inconsistent, inaccurate and
often lacks specific attribute values/trends.
▪ Duplicate or missing values may give an incorrect view of the
overall statistics of data.
▪ Outliers and inconsistent data points often tend to disturb the
model’s overall learning, leading to false predictions.
▪It is a common thumb rule in ML that the greater the
amount of data we have, the better models we can train.
5
Data - Preprocessing
▪Features in machine learning
▪ Individual independent variables that operate as an input
in our machine learning model.
▪ They can be thought of as representations or attributes
that describe the data and help the models to predict the
classes/labels.
▪ Features in a structured dataset like in a CSV format refer
to each column representing a measurable piece of data
that can be used for analysis:
▪ E.g., Name, Age, Sex, Fare, and so on.
6
4 Steps in Data Preprocessing
7
Data Preprocessing: Cleaning
▪Missing values:- solve this issue by:
1. Ignore those tuples:
▪ when dataset is huge and numerous missing values are
present within a tuple.
2. Fill in the missing values
▪ There are many methods to achieve this, such as
▪ filling in the values manually,
▪ predicting the missing values using regression method, or numerical
methods like attribute mean.
8
Data Preprocessing: Cleaning…
▪Noisy Data: It involves removing a random error or
variance in a measured variable.
1. Binning: Works on sorted data values to smoothen any
noise present in it.
▪ The data is divided into equal-sized bins, and each bin/bucket is
dealt with independently.
▪ All data in a segment can be replaced by its mean, median or
boundary values.
2. Regression: is used for prediction.
▪ smoothen noise by fitting all the data points in a regression function.
3. Clustering: Creation of groups/clusters from data having
similar values.
9
Data Preprocessing: Cleaning…
▪Removing outliers:
▪ Clustering techniques group together similar data points.
▪ The tuples that lie outside the cluster are
outliers/inconsistent data.
10
Data Preprocessing: Cleaning…
11
Data Preprocessing…
▪Data Integration: merge the data present in multiple
sources into a single larger data store like a data
warehouse.
▪Data Transformation: Consolidate data into alternate
forms by changing the value, structure, or format of
dat.
▪Data Reduction: The size of the dataset in a data
warehouse can be too large to be handled by data
analysis and data mining algorithms.
12
Data Preprocessing: Best practices
▪The first step in Data Preprocessing is to understand your
data.
▪Use statistical methods or pre-built libraries that help you
visualize the dataset and give a clear image of how your
data looks in terms of class distribution.
▪Summarize your data in terms of the number of duplicates,
missing values, and outliers present in the data.
▪Drop the fields you think have no use for the modeling or
are closely related to other attributes.
▪Do some feature engineering and figure out which
attributes contribute most towards model training.
13
Feature selection and visualization
▪Garbage in Garbage out (GIGO)
▪ Whatever goes in, comes out.
▪ If we put garbage into our model, we can expect the
output to be garbage too.
▪ In this case, garbage refers to noise in our data.
▪Feature Selection is the method of reducing the input
variable to your model by using only relevant data
and getting rid of noise in data.
14
Feature selection and visualization…
15
Model selection and tuning
C0 - Personal Information
39