Slides on DataII

The document outlines essential tasks for data preparation in machine learning, focusing on scaling values and handling missing data. It discusses methods for scaling, such as MinMaxScaler and StandardScaler, and emphasizes the importance of treating qualitative attributes differently. Additionally, it covers strategies for managing missing values, including deletion and imputation techniques using SimpleImputer and IterativeImputer.

Uploaded by

duarte.denio

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Slides on DataII

Uploaded by

duarte.denio

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Introduction to

Data Processing II

Prof. Denio Duarte

[email protected]
Introduction
● We need two more tasks to prepare our data
– Scaling the values
– Get rid off missing values
Scaling
● The values stored in the dataset could be in different
ranges
○ Blood tests - glucose (around 100) and , pH urine
(<10), platelet count (>100,000 and <400,00)

Fonte: http://enroute.pl/skalowanie-danych/
Scaling
● Some ML algorithms suffers in dealing with different
ranges of values
○ [1.000,10.000], [0,1], [-50,50] …
● Scaling is about to transform all values from the dataset
into a same range
○ Min and Max values
○ Standardization
Scaling
● pandas’ dataframe has a method to describe de values
of the dataset
○ We can check the range values of the attributes
■ data.describe()
Scaling
● Scaling data allows algorithms to be more efficient
○ Computationally
○ Better models
Scaling
● Sklearn:
from sklearn.preprocessing import MinMaxScaler
#scaling values between 0 and 1
mmax = MinMaxScaler().fit(X)
X_mm=mmax.transform(X)
# or X_mm=pd.DataFrame(mmax.transform(X))
Tips Min Max scaled

before

after
Tips Min Max scaled

before

Qualitative attributes
should not be scaled

after
Standardization
● Sklearn:
from sklearn.preprocessing import StandardScaler
#Features values will have mean next to 0 (normal distribution)
scaler = StandardScaler()
scaler.fit(X)
X_stand=scaler.transform(X)
#or X_stand=pd.DataFrame(scaler.transform(X))
Standardization

Before

After
Standardization

Before

Qualitative attributes
should not be standardized

After
Exercises
● Build two new datasets from the previous dataset (tips)
○ Minimum and maximum values (MinMaxScaler)
○ Standardize values (StandardScaler)
● Using random forests, create 2 models: regression and
classification
○ from sklearn.ensemble import RandomForestClassifier as rfc,
RandomForestRegressor as rfr
○ myrfc=rfc()
○ myrfr=rfr()
Time to hunt missing values
● And if there are some features without valid values
(missing values)?
○ Delete the entire example (tuple/row)
■ The set of deleted examples must be small
○ Replace the missing value
■ Mean
■ Median
■ Most frequent
■ Constant
Missing values
● sklearn implement SimpleImputer from impute package

from sklearn.impute import SimpleImputer

simp=SimpleImputer(strategy='mean') # mean as new value
#SimpleImputer must receive a vector
tbarray=np.array(X['total_bill']).reshape(-1,1)
simp.fit(tbarray) # training
X['total_bill']=simp.transform(tbarray) # replace
# mean median most_frequent constant (fill_value)
Missing Values
● Let’s study missing value approaches using horse-
colic.csv dataset

import pandas as pd
from sklearn.impute import SimpleImputer
hdf=pd.read_csv('horse-colic.csv',header=None,na_values='?')
hdf.describe()
# total number of null values by column
hdf.isnull().sum()
hdf.iloc[row,:].isnull().sum() # # of null values in row row
hdf.iloc[:,col].isnull().sum() # # of null values in column col
Missing Values
● Consider the following
– When a missing value must be deleted
– When a missing value must be imputed
● In both case, the dataset must be carefully analyzed
● The imputation must follow the attributes’ values
– Qualitative (discrete values – classes)
– Quantitative (continuous values)
Missing Values
● Deleting (dropping)
– Row
●
# of missing attributes in a given row (propose a threshold)

idx_del=[]
for i in range(hdf.shape[0]): # visit all hdf’s rows
hdf.iloc[i,:].isnull().sum() > threshold:
idx_del.append(i)
new_hdf=hdf.drop(idx_del, axis=0)
or
hdf.drop(idx_del, axis=0,inplace=True)
Missing Values
● Deleting (dropping)
– Columns
●
# of missing values in a given column

col_del=[]
for i in range(hdf.shape[1]): # visit all hdf’s columns
hdf.iloc[:,i].isnull().sum() > threshold:
col_del.append(i)
new_hdf=hdf.drop(columns=col_del) or hdf.drop(columns=col_del,inplace=True)
or
new_hdf=hdf.drop(col_del,axis=1) or hdf.drop(col_del,axis=1,inplace=True)
Missing Values
● Imputing
– Check which attributes are qualitative or quantitative
●
Approach: counting # of unique values for each attribute

att_quali=[]
att_quant=[]
for c in hdf.columns:
nunique=hdf[c].unique()
if len(nunique) > threshold:
att_quant.append(c)
else:
att_quali.append(c)
Missing Values
● Imputing

# for qualitative values is better impute discrete values

from sklearn.impute import SimpleImputer
imp_quali=SimpleImputer(strategy='most_frequent')
imp_quali.fit(hdf[att_quali])
hdf_quali=imp_quali.transform(hdf[att_quali])
# hdf_quali is numpy array
Missing Values
● Imputing quantitative values
– SimpleImputer is limited to mean, median ...
Advanced Missing Values Substitution
● sklearn offers a set of package to execute an advanced
missing value substitution
from sklearn.experimental import enable_iterative_imputer
# now we can use IterativeImputer
from sklearn.impute import IterativeImputer
#Better than use mean, median, ...
from sklearn.linear_model import BayesianRidge, LinearRegression, \
HuberRegressor, RidgeCV, ARDRegression, \
PassiveAggressiveRegressor, TheilSenRegressor
# all this algorithms can be used for finding new values for missing ones
Advanced Missing Values Substitution
#usefull for quantitative features
from sklearn.experimental import enable_iterative_imputer
# now we can use IterativeImputer
from sklearn.impute import IterativeImputer
#Better than use mean, median, ...
from sklearn.linear_model import BayesianRidge, LinearRegression, \
HuberRegressor, RidgeCV, ARDRegression, \
PassiveAggressiveRegressor, TheilSenRegressor
# all this algorithms can be used for finding new values for missing ones
it_imp=IterativeImputer(estimator=RidgeCV(),max_iter=100,random_state=0)
it_imp.fit(hdf[att_quant])
hdf_quant=it_imp.transform(hdf[att_quant])
Missing Values
● Imputing - wrap up

# Building a new dataset without misssing values

hdf_womv=np.insert(hdf_quant,[hdf_quant.shape[1]],hdf_quali,axis=1)
Exercises
● Using the pos-processed dataset
– Build a model using random forest algorithm
– Split the dataset into 2 subsets: training and test
– Build the model using the training set
– Evaluate the model using the test set
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.model_selection import train_test_split as tts
X_tr,y_tr,X_te,y_te=tts(hdf_wom,y,test_size=0.3, random_state=42)
myrf=rfc() #create an object

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (82)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Machine Learning
100% (2)
Machine Learning
136 pages
Machine Learning Techniques Lesson 1
No ratings yet
Machine Learning Techniques Lesson 1
9 pages
Missing Values
No ratings yet
Missing Values
3 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
Intermediate Machine learning
No ratings yet
Intermediate Machine learning
12 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Data Preprocessing in Python
No ratings yet
Data Preprocessing in Python
3 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Lab File
No ratings yet
Lab File
96 pages
Lect 2
No ratings yet
Lect 2
54 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Data Cleaning_Project work
No ratings yet
Data Cleaning_Project work
10 pages
Chapter 1. Data Preparation (2)
No ratings yet
Chapter 1. Data Preparation (2)
74 pages
ML SELF UNIT 2
No ratings yet
ML SELF UNIT 2
20 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
data science practicals
No ratings yet
data science practicals
47 pages
Data Analytics Lab Manual_250402_095326
No ratings yet
Data Analytics Lab Manual_250402_095326
58 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
lec 4
No ratings yet
lec 4
9 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
1_Data Preprocessing and Cleaning_55
No ratings yet
1_Data Preprocessing and Cleaning_55
8 pages
ANOOSHA_ML_LAB02
No ratings yet
ANOOSHA_ML_LAB02
5 pages
ML Unit 2
No ratings yet
ML Unit 2
41 pages
DA lab
No ratings yet
DA lab
27 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
platias2020-Greece
No ratings yet
platias2020-Greece
10 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Unit-4 Part 1 Preparing Model
No ratings yet
Unit-4 Part 1 Preparing Model
20 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
TP2- ML -handling outliers
No ratings yet
TP2- ML -handling outliers
5 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
data-mining-lab-manual-CSE-VII-Sem
No ratings yet
data-mining-lab-manual-CSE-VII-Sem
63 pages
Avinash DA 6
No ratings yet
Avinash DA 6
3 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Data Preparation For Machine Learning Mini Course
No ratings yet
Data Preparation For Machine Learning Mini Course
19 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
Eda
No ratings yet
Eda
48 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Lift Standards - EN 81-20 and EN 81-50 - Designing Buildings Wiki
No ratings yet
Lift Standards - EN 81-20 and EN 81-50 - Designing Buildings Wiki
7 pages
AITS-2425-FT-VIII-JEEA-Paper-1-Sol
No ratings yet
AITS-2425-FT-VIII-JEEA-Paper-1-Sol
17 pages
7356 2 QP Mathematics AS 14oct20 PM MQP18A4
No ratings yet
7356 2 QP Mathematics AS 14oct20 PM MQP18A4
52 pages
Elements of Quality Control
No ratings yet
Elements of Quality Control
17 pages
Pressure Aging Vessel - Pavement Interactive
No ratings yet
Pressure Aging Vessel - Pavement Interactive
17 pages
From Great Paragraphs To Great Essays 3 Great Writing 4
No ratings yet
From Great Paragraphs To Great Essays 3 Great Writing 4
26 pages
Spatial Data Quality
No ratings yet
Spatial Data Quality
34 pages
الادارة الهندسية محاضرة 4
No ratings yet
الادارة الهندسية محاضرة 4
9 pages
Entrepreneurial Growth-Bridging Experiential Learning, Ecological Systems Analysis and Governance of Entrepreneurship Center Environments
No ratings yet
Entrepreneurial Growth-Bridging Experiential Learning, Ecological Systems Analysis and Governance of Entrepreneurship Center Environments
13 pages
Jeram Landfill - Chapter 8
100% (1)
Jeram Landfill - Chapter 8
32 pages
Precalculus-Q1-W3-Module 3
No ratings yet
Precalculus-Q1-W3-Module 3
16 pages
Bachelor of Science transcript
No ratings yet
Bachelor of Science transcript
5 pages
1- Introduction to Histology and Cell Structure
No ratings yet
1- Introduction to Histology and Cell Structure
16 pages
Environment Impact Assessment Notification 2020
No ratings yet
Environment Impact Assessment Notification 2020
16 pages
Solaronix Materials: Innovative Solutions For Solar Professionals
No ratings yet
Solaronix Materials: Innovative Solutions For Solar Professionals
20 pages
Chapter 12 Study Guide Physical Science
No ratings yet
Chapter 12 Study Guide Physical Science
4 pages
Low Rise V/S Medium Rise V/S High Rise in Urban Context
No ratings yet
Low Rise V/S Medium Rise V/S High Rise in Urban Context
48 pages
Plan de Lectie, Cls A VIa
No ratings yet
Plan de Lectie, Cls A VIa
5 pages
IELTS Listening - WS 1 - Full
No ratings yet
IELTS Listening - WS 1 - Full
4 pages
Subject: Life 1 Semester Lesson Planning SY 2017-2018 Prepared By: Grade: KG 3/1 Approved By: - Week / Date Content Evaluation After Class
No ratings yet
Subject: Life 1 Semester Lesson Planning SY 2017-2018 Prepared By: Grade: KG 3/1 Approved By: - Week / Date Content Evaluation After Class
3 pages
Scenario D - Hbs Discussion Board Post
No ratings yet
Scenario D - Hbs Discussion Board Post
2 pages
Food Food Lesson 12 Reading
No ratings yet
Food Food Lesson 12 Reading
2 pages
104700-9191 Hyundai̇ Kia 2500
No ratings yet
104700-9191 Hyundai̇ Kia 2500
5 pages
B. Pharm - Study Material_Notes Drive Link
No ratings yet
B. Pharm - Study Material_Notes Drive Link
1 page
GCSE Physics Exam
No ratings yet
GCSE Physics Exam
12 pages
Sum 2020 - E-Waste Hotspots and Best Routes Analysis For Reverse Logistics in The City of Sp-Brazil
No ratings yet
Sum 2020 - E-Waste Hotspots and Best Routes Analysis For Reverse Logistics in The City of Sp-Brazil
4 pages
Supervision 398 18871 1685558787 1
No ratings yet
Supervision 398 18871 1685558787 1
46 pages
Estimation of Charge Induced on Two Identical Styrofoam Balls Using Coulomb’s Law
No ratings yet
Estimation of Charge Induced on Two Identical Styrofoam Balls Using Coulomb’s Law
11 pages
HMD (Cet433) M3
No ratings yet
HMD (Cet433) M3
116 pages
MCE17L Test No. 4
No ratings yet
MCE17L Test No. 4
9 pages

Slides on DataII

Uploaded by

Slides on DataII

Uploaded by

Introduction to

Prof. Denio Duarte

from sklearn.impute import SimpleImputer

# for qualitative values is better impute discrete values

# Building a new dataset without misssing values

You might also like