0% found this document useful (0 votes)
6 views

IML 2 - Data Preparation

Uploaded by

yasir11.work
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

IML 2 - Data Preparation

Uploaded by

yasir11.work
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

INTRODUCTION TO MACHINE LEARNING

UNIT # 2

FALL 2023 Sajjad Haider 1


ACKNOWLEDGEMENT

 The slides and code in this lecture are primarily taken from
 Machine Learning with PyTorch and Scikit-Learn by Rischka et al.
 Discussion and figures on CRISP-ML(Q) is taken from
 https://ml-ops.org/content/crisp-ml

FALL 2023 Sajjad Haider 2


TODAY’S AGENDA

 Continuation of the previous lecture (EDA Part):


 Removing and imputing missing values from the dataset
 Data Types
 Data Encoding, Scaling and Normalization
 ML Roadmap and CRISP-ML(Q)

FALL 2023 Sajjad Haider 3


A B C D
2 6 3 4
5 6 NaN 8
HANDLING MISSING VALUES
10 11 12 NaN

 One of the easiest ways to deal with missing data is simply to remove the corresponding
features (columns) or training examples (rows) from the dataset entirely.
 df.dropna(axis=0)
 df.dropna(axis=1)
 (# drop columns that have at least one NaN in any row)
 df.dropna(how='all’)
 (# only drop rows where all columns are NaN)
 df.dropna(thresh=4)
 (# drop rows that have fewer than 4 real values)
 df.dropna(subset=['C’])
 (# only drop rows where NaN appear in specific columns (here: 'C'))
FALL 2023 Sajjad Haider 4
IMPUTING MISSING VALUES

 Removal of training examples or dropping of entire feature columns may not be feasible as
we might lose too much valuable data.
 In this case, we can use different interpolation techniques to estimate the missing values
from the other training examples in our dataset.
 One of the most common interpolation techniques is mean imputation, where we simply
replace the missing value with the mean value of the entire feature column.
 A convenient way to achieve this is by using the SimpleImputer class from scikit-learn.
 There is another popular function (KNNImputator) in sklearn that employs k-Nearest
Neighbor (kNN) method. Will discuss it after we have studied the kNN method.

FALL 2023 Sajjad Haider 5


FEATURES/VARIABLES

 An attribute is a data field, representing a characteristic or feature of a


data object. The nouns attribute, dimension, feature, and variable are
often used interchangeably in the literature.
 The term dimension is commonly used in data warehousing.
 Machine learning literature tends to use the term feature
 Statisticians prefer the term variable.
 Data mining and database professionals commonly use the term attribute.

FALL 2023 Sajjad Haider 6


DATA TYPES

 Nominal (Categorical) Attributes


 Hair_color, marital_status, customer_id (why it is categorical?)
 Binary Attributes
 is_smoker, medical_test_result (+ve/-ve)
 Ordinal Attributes
 Shirt_size, grades, professional_rank, customer_satisfaction
 Numerical Attributes
 age, temperature, salary, number_of_dependents

FALL 2023 Sajjad Haider 7


ENCODING: ONE-HOT, LABEL AND ORDINAL

 The idea behind one-hot encoding is to create a new dummy feature for each unique value in
the nominal feature column.
 Suppose we have three possible values under color feature: blue, green and red.
 We would convert the color feature into three new features: blue, green, and red.
 Binary values can then be used to indicate the particular color of an example; for example, a
blue example can be encoded as blue=1, green=0, red=0.
 A convenient way to create those dummy features via one-hot encoding is to use the
get_dummies method implemented in pandas.
 Sklearn also provide label and ordinal encoding

FALL 2023 Sajjad Haider 8


Age Salary
28 100,000
34 150,000
SCALING 26 140,000
38 300,000

 Feature scaling is a crucial step in our preprocessing pipeline that can easily be
forgotten.
 Many ML algorithms (like Decision trees and random forests) are scale-
invariant where we don’t need to worry about feature scaling.
 However, many other ML algorithms behave much better if features are on the
same scale.
 If we need to compute the similarity among records using Euclidean distance,
then Salary would play a more significant role than Age in the given table.

FALL 2023 Sajjad Haider 9


MIN-MAX (NORMALIZATION) AND Z-SCORE
(STANDARDIZATION)

 There are two common approaches to


bringing different features onto the
same scale: normalization and
standardization.
 min-max scaling. To normalize our
data, we can simply apply the min-max
scaling to each feature column

FALL 2023 Sajjad Haider 10


STANDARDIZATION (CONT’D)

 Although normalization via min-max scaling is a commonly used technique that is useful
when we need values in a bounded interval, standardization can be more practical for
many machine learning algorithms, especially for optimization algorithms such as gradient
descent.
 Using standardization, we center the feature columns at mean 0 with standard deviation 1
so that the feature columns have the same parameters as a standard normal distribution
(zero mean and unit variance), which makes it easier to learn the weights.
 However, it must be emphasized that standardization does not change the shape of the
distribution, and it does not transform non-normally distributed data into normally
distributed data.

FALL 2023 Sajjad Haider 11


NORMALIZATION VS STANDARDIZATION

 When using distance-based algorithms (like kNN or clustering algorithms),


normalization is preferable.
 Normalization is more intuitive and offer better interpretability than
standardization.
 Standardization is less sensitive to outliers and, hence, is preferable when
outliers are present.
 Standardization is required/preferable when working with Neural Networks
or PCA.

FALL 2023 Sajjad Haider 12


 AutoViz: understand patterns, trends, and relationships in the data by
creating insightful visualizations with minimal effort

FALL 2023 Sajjad Haider 13

You might also like