0% found this document useful (0 votes)

6 views

IML 2 - Data Preparation

Uploaded by

yasir11.work

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

IML 2 - Data Preparation

Uploaded by

yasir11.work

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

INTRODUCTION TO MACHINE LEARNING

UNIT # 2

FALL 2023 Sajjad Haider 1

ACKNOWLEDGEMENT

 The slides and code in this lecture are primarily taken from
 Machine Learning with PyTorch and Scikit-Learn by Rischka et al.
 Discussion and figures on CRISP-ML(Q) is taken from
 https://ml-ops.org/content/crisp-ml

FALL 2023 Sajjad Haider 2

TODAY’S AGENDA

 Continuation of the previous lecture (EDA Part):

 Removing and imputing missing values from the dataset
 Data Types
 Data Encoding, Scaling and Normalization
 ML Roadmap and CRISP-ML(Q)

FALL 2023 Sajjad Haider 3

A B C D
2 6 3 4
5 6 NaN 8
HANDLING MISSING VALUES
10 11 12 NaN

 One of the easiest ways to deal with missing data is simply to remove the corresponding
features (columns) or training examples (rows) from the dataset entirely.
 df.dropna(axis=0)
 df.dropna(axis=1)
 (# drop columns that have at least one NaN in any row)
 df.dropna(how='all’)
 (# only drop rows where all columns are NaN)
 df.dropna(thresh=4)
 (# drop rows that have fewer than 4 real values)
 df.dropna(subset=['C’])
 (# only drop rows where NaN appear in specific columns (here: 'C'))
FALL 2023 Sajjad Haider 4
IMPUTING MISSING VALUES

 Removal of training examples or dropping of entire feature columns may not be feasible as
we might lose too much valuable data.
 In this case, we can use different interpolation techniques to estimate the missing values
from the other training examples in our dataset.
 One of the most common interpolation techniques is mean imputation, where we simply
replace the missing value with the mean value of the entire feature column.
 A convenient way to achieve this is by using the SimpleImputer class from scikit-learn.
 There is another popular function (KNNImputator) in sklearn that employs k-Nearest
Neighbor (kNN) method. Will discuss it after we have studied the kNN method.

FALL 2023 Sajjad Haider 5

FEATURES/VARIABLES

 An attribute is a data field, representing a characteristic or feature of a

data object. The nouns attribute, dimension, feature, and variable are
often used interchangeably in the literature.
 The term dimension is commonly used in data warehousing.
 Machine learning literature tends to use the term feature
 Statisticians prefer the term variable.
 Data mining and database professionals commonly use the term attribute.

FALL 2023 Sajjad Haider 6

DATA TYPES

 Nominal (Categorical) Attributes

 Hair_color, marital_status, customer_id (why it is categorical?)
 Binary Attributes
 is_smoker, medical_test_result (+ve/-ve)
 Ordinal Attributes
 Shirt_size, grades, professional_rank, customer_satisfaction
 Numerical Attributes
 age, temperature, salary, number_of_dependents

FALL 2023 Sajjad Haider 7

ENCODING: ONE-HOT, LABEL AND ORDINAL

 The idea behind one-hot encoding is to create a new dummy feature for each unique value in
the nominal feature column.
 Suppose we have three possible values under color feature: blue, green and red.
 We would convert the color feature into three new features: blue, green, and red.
 Binary values can then be used to indicate the particular color of an example; for example, a
blue example can be encoded as blue=1, green=0, red=0.
 A convenient way to create those dummy features via one-hot encoding is to use the
get_dummies method implemented in pandas.
 Sklearn also provide label and ordinal encoding

FALL 2023 Sajjad Haider 8

Age Salary
28 100,000
34 150,000
SCALING 26 140,000
38 300,000

 Feature scaling is a crucial step in our preprocessing pipeline that can easily be
forgotten.
 Many ML algorithms (like Decision trees and random forests) are scale-
invariant where we don’t need to worry about feature scaling.
 However, many other ML algorithms behave much better if features are on the
same scale.
 If we need to compute the similarity among records using Euclidean distance,
then Salary would play a more significant role than Age in the given table.

FALL 2023 Sajjad Haider 9

MIN-MAX (NORMALIZATION) AND Z-SCORE
(STANDARDIZATION)

 There are two common approaches to

bringing different features onto the
same scale: normalization and
standardization.
 min-max scaling. To normalize our
data, we can simply apply the min-max
scaling to each feature column

FALL 2023 Sajjad Haider 10

STANDARDIZATION (CONT’D)

 Although normalization via min-max scaling is a commonly used technique that is useful
when we need values in a bounded interval, standardization can be more practical for
many machine learning algorithms, especially for optimization algorithms such as gradient
descent.
 Using standardization, we center the feature columns at mean 0 with standard deviation 1
so that the feature columns have the same parameters as a standard normal distribution
(zero mean and unit variance), which makes it easier to learn the weights.
 However, it must be emphasized that standardization does not change the shape of the
distribution, and it does not transform non-normally distributed data into normally
distributed data.

FALL 2023 Sajjad Haider 11

NORMALIZATION VS STANDARDIZATION

 When using distance-based algorithms (like kNN or clustering algorithms),

normalization is preferable.
 Normalization is more intuitive and offer better interpretability than
standardization.
 Standardization is less sensitive to outliers and, hence, is preferable when
outliers are present.
 Standardization is required/preferable when working with Neural Networks
or PCA.

FALL 2023 Sajjad Haider 12

 AutoViz: understand patterns, trends, and relationships in the data by
creating insightful visualizations with minimal effort

FALL 2023 Sajjad Haider 13

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (82)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
1737527078055
No ratings yet
1737527078055
111 pages
ML - WEEK 04
No ratings yet
ML - WEEK 04
33 pages
Lecture1-Introduction To Data Mining
No ratings yet
Lecture1-Introduction To Data Mining
46 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Eda
No ratings yet
Eda
48 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Week 10
No ratings yet
Week 10
50 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
My Notes
No ratings yet
My Notes
15 pages
data processing
No ratings yet
data processing
19 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit-II
No ratings yet
Unit-II
119 pages
PMA Unit-2 pdf
No ratings yet
PMA Unit-2 pdf
19 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Unit 2 ML 2019
No ratings yet
Unit 2 ML 2019
91 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
5.Feauture Engineering
No ratings yet
5.Feauture Engineering
34 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
ML Notes
No ratings yet
ML Notes
79 pages
3-Random Projection and Compressed Sensing technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing technique-13-01-2025
84 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Lecture5
No ratings yet
Lecture5
26 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Final ML
No ratings yet
Final ML
2 pages
Lec 5
No ratings yet
Lec 5
24 pages
Preprocessing_1
No ratings yet
Preprocessing_1
11 pages
Data Mining Notes: 7 Semester. CS 1435: Syllabus
No ratings yet
Data Mining Notes: 7 Semester. CS 1435: Syllabus
4 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Presentation
No ratings yet
Presentation
10 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing II
No ratings yet
Data Preprocessing II
21 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Week2_DataPreprocessing
No ratings yet
Week2_DataPreprocessing
43 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
False Position Method
100% (1)
False Position Method
8 pages
Presentation On: Operation Research
No ratings yet
Presentation On: Operation Research
10 pages
Control System Engineering
No ratings yet
Control System Engineering
2 pages
Optimization of DFA Based Pattern Matchers
50% (2)
Optimization of DFA Based Pattern Matchers
17 pages
Lagrange Multiplier Complete
No ratings yet
Lagrange Multiplier Complete
22 pages
شبكات عصبية ٢
No ratings yet
شبكات عصبية ٢
6 pages
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
No ratings yet
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
20 pages
Beyond Classical Search AIMA Exercises
No ratings yet
Beyond Classical Search AIMA Exercises
6 pages
Pde Parabolic
No ratings yet
Pde Parabolic
49 pages
Jin Et Al. 2020
No ratings yet
Jin Et Al. 2020
9 pages
Synthesis-By-Analysis of BCH Codes: October 2012
No ratings yet
Synthesis-By-Analysis of BCH Codes: October 2012
6 pages
Skripsi Tanpa Bab Pembahasan
No ratings yet
Skripsi Tanpa Bab Pembahasan
63 pages
247-Article Text-517-1-10-20201130 PDF
No ratings yet
247-Article Text-517-1-10-20201130 PDF
9 pages
Practice Questions CNNs Solns
No ratings yet
Practice Questions CNNs Solns
11 pages
3 RD QEin Math 7
No ratings yet
3 RD QEin Math 7
3 pages
SP Manual 2023-24
No ratings yet
SP Manual 2023-24
98 pages
Backtracking
100% (1)
Backtracking
301 pages
Further Math 2 of 2
No ratings yet
Further Math 2 of 2
6 pages
Unit I SNM
No ratings yet
Unit I SNM
38 pages
Binomial Theorem
No ratings yet
Binomial Theorem
12 pages
It Sba
No ratings yet
It Sba
6 pages
Cruz CoE702 Review Paper
No ratings yet
Cruz CoE702 Review Paper
1 page
This Study Resource Was
No ratings yet
This Study Resource Was
8 pages
Operation Research Assignment 2
100% (1)
Operation Research Assignment 2
7 pages
UNIT-5 DAA Complete Notes
No ratings yet
UNIT-5 DAA Complete Notes
52 pages
DSA - Answer Key
No ratings yet
DSA - Answer Key
3 pages
Maulina Putri Lestari - M0220052 - Tugas 1
No ratings yet
Maulina Putri Lestari - M0220052 - Tugas 1
5 pages
K Means Handout
No ratings yet
K Means Handout
7 pages
Quantitative Methods: Instructor: DR: Abdelhamid Mostafa
No ratings yet
Quantitative Methods: Instructor: DR: Abdelhamid Mostafa
18 pages
BSCI543841182976200
No ratings yet
BSCI543841182976200
15 pages

IML 2 - Data Preparation

Uploaded by

IML 2 - Data Preparation

Uploaded by

INTRODUCTION TO MACHINE LEARNING

FALL 2023 Sajjad Haider 1

FALL 2023 Sajjad Haider 2

 Continuation of the previous lecture (EDA Part):

FALL 2023 Sajjad Haider 3

FALL 2023 Sajjad Haider 5

 An attribute is a data field, representing a characteristic or feature of a

FALL 2023 Sajjad Haider 6

 Nominal (Categorical) Attributes

FALL 2023 Sajjad Haider 7

FALL 2023 Sajjad Haider 8

FALL 2023 Sajjad Haider 9

 There are two common approaches to

FALL 2023 Sajjad Haider 10

FALL 2023 Sajjad Haider 11

 When using distance-based algorithms (like kNN or clustering algorithms),

FALL 2023 Sajjad Haider 12

FALL 2023 Sajjad Haider 13

You might also like