0% found this document useful (0 votes)
2 views

Slides on DataII

The document outlines essential tasks for data preparation in machine learning, focusing on scaling values and handling missing data. It discusses methods for scaling, such as MinMaxScaler and StandardScaler, and emphasizes the importance of treating qualitative attributes differently. Additionally, it covers strategies for managing missing values, including deletion and imputation techniques using SimpleImputer and IterativeImputer.

Uploaded by

duarte.denio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Slides on DataII

The document outlines essential tasks for data preparation in machine learning, focusing on scaling values and handling missing data. It discusses methods for scaling, such as MinMaxScaler and StandardScaler, and emphasizes the importance of treating qualitative attributes differently. Additionally, it covers strategies for managing missing values, including deletion and imputation techniques using SimpleImputer and IterativeImputer.

Uploaded by

duarte.denio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to

Data Processing II

Prof. Denio Duarte


[email protected]
Introduction
● We need two more tasks to prepare our data
– Scaling the values
– Get rid off missing values
Scaling
● The values stored in the dataset could be in different
ranges
○ Blood tests - glucose (around 100) and , pH urine
(<10), platelet count (>100,000 and <400,00)

Fonte: http://enroute.pl/skalowanie-danych/
Scaling
● Some ML algorithms suffers in dealing with different
ranges of values
○ [1.000,10.000], [0,1], [-50,50] …
● Scaling is about to transform all values from the dataset
into a same range
○ Min and Max values
○ Standardization
Scaling
● pandas’ dataframe has a method to describe de values
of the dataset
○ We can check the range values of the attributes
■ data.describe()
Scaling
● Scaling data allows algorithms to be more efficient
○ Computationally
○ Better models
Scaling
● Sklearn:
from sklearn.preprocessing import MinMaxScaler
#scaling values between 0 and 1
mmax = MinMaxScaler().fit(X)
X_mm=mmax.transform(X)
# or X_mm=pd.DataFrame(mmax.transform(X))
Tips Min Max scaled

before

after
Tips Min Max scaled

before

Qualitative attributes
should not be scaled

after
Standardization
● Sklearn:
from sklearn.preprocessing import StandardScaler
#Features values will have mean next to 0 (normal distribution)
scaler = StandardScaler()
scaler.fit(X)
X_stand=scaler.transform(X)
#or X_stand=pd.DataFrame(scaler.transform(X))
Standardization

Before

After
Standardization

Before

Qualitative attributes
should not be standardized

After
Exercises
● Build two new datasets from the previous dataset (tips)
○ Minimum and maximum values (MinMaxScaler)
○ Standardize values (StandardScaler)
● Using random forests, create 2 models: regression and
classification
○ from sklearn.ensemble import RandomForestClassifier as rfc,
RandomForestRegressor as rfr
○ myrfc=rfc()
○ myrfr=rfr()
Time to hunt missing values
● And if there are some features without valid values
(missing values)?
○ Delete the entire example (tuple/row)
■ The set of deleted examples must be small
○ Replace the missing value
■ Mean
■ Median
■ Most frequent
■ Constant
Missing values
● sklearn implement SimpleImputer from impute package

from sklearn.impute import SimpleImputer


simp=SimpleImputer(strategy='mean') # mean as new value
#SimpleImputer must receive a vector
tbarray=np.array(X['total_bill']).reshape(-1,1)
simp.fit(tbarray) # training
X['total_bill']=simp.transform(tbarray) # replace
# mean median most_frequent constant (fill_value)
Missing Values
● Let’s study missing value approaches using horse-
colic.csv dataset

import pandas as pd
from sklearn.impute import SimpleImputer
hdf=pd.read_csv('horse-colic.csv',header=None,na_values='?')
hdf.describe()
# total number of null values by column
hdf.isnull().sum()
hdf.iloc[row,:].isnull().sum() # # of null values in row row
hdf.iloc[:,col].isnull().sum() # # of null values in column col
Missing Values
● Consider the following
– When a missing value must be deleted
– When a missing value must be imputed
● In both case, the dataset must be carefully analyzed
● The imputation must follow the attributes’ values
– Qualitative (discrete values – classes)
– Quantitative (continuous values)
Missing Values
● Deleting (dropping)
– Row

# of missing attributes in a given row (propose a threshold)

idx_del=[]
for i in range(hdf.shape[0]): # visit all hdf’s rows
hdf.iloc[i,:].isnull().sum() > threshold:
idx_del.append(i)
new_hdf=hdf.drop(idx_del, axis=0)
or
hdf.drop(idx_del, axis=0,inplace=True)
Missing Values
● Deleting (dropping)
– Columns

# of missing values in a given column

col_del=[]
for i in range(hdf.shape[1]): # visit all hdf’s columns
hdf.iloc[:,i].isnull().sum() > threshold:
col_del.append(i)
new_hdf=hdf.drop(columns=col_del) or hdf.drop(columns=col_del,inplace=True)
or
new_hdf=hdf.drop(col_del,axis=1) or hdf.drop(col_del,axis=1,inplace=True)
Missing Values
● Imputing
– Check which attributes are qualitative or quantitative

Approach: counting # of unique values for each attribute

att_quali=[]
att_quant=[]
for c in hdf.columns:
nunique=hdf[c].unique()
if len(nunique) > threshold:
att_quant.append(c)
else:
att_quali.append(c)
Missing Values
● Imputing

# for qualitative values is better impute discrete values


from sklearn.impute import SimpleImputer
imp_quali=SimpleImputer(strategy='most_frequent')
imp_quali.fit(hdf[att_quali])
hdf_quali=imp_quali.transform(hdf[att_quali])
# hdf_quali is numpy array
Missing Values
● Imputing quantitative values
– SimpleImputer is limited to mean, median ...
Advanced Missing Values Substitution
● sklearn offers a set of package to execute an advanced
missing value substitution
from sklearn.experimental import enable_iterative_imputer
# now we can use IterativeImputer
from sklearn.impute import IterativeImputer
#Better than use mean, median, ...
from sklearn.linear_model import BayesianRidge, LinearRegression, \
HuberRegressor, RidgeCV, ARDRegression, \
PassiveAggressiveRegressor, TheilSenRegressor
# all this algorithms can be used for finding new values for missing ones
Advanced Missing Values Substitution
#usefull for quantitative features
from sklearn.experimental import enable_iterative_imputer
# now we can use IterativeImputer
from sklearn.impute import IterativeImputer
#Better than use mean, median, ...
from sklearn.linear_model import BayesianRidge, LinearRegression, \
HuberRegressor, RidgeCV, ARDRegression, \
PassiveAggressiveRegressor, TheilSenRegressor
# all this algorithms can be used for finding new values for missing ones
it_imp=IterativeImputer(estimator=RidgeCV(),max_iter=100,random_state=0)
it_imp.fit(hdf[att_quant])
hdf_quant=it_imp.transform(hdf[att_quant])
Missing Values
● Imputing - wrap up

# Building a new dataset without misssing values


hdf_womv=np.insert(hdf_quant,[hdf_quant.shape[1]],hdf_quali,axis=1)
Exercises
● Using the pos-processed dataset
– Build a model using random forest algorithm
– Split the dataset into 2 subsets: training and test
– Build the model using the training set
– Evaluate the model using the test set
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.model_selection import train_test_split as tts
X_tr,y_tr,X_te,y_te=tts(hdf_wom,y,test_size=0.3, random_state=42)
myrf=rfc() #create an object

You might also like