Slides on DataII
Slides on DataII
Data Processing II
Fonte: http://enroute.pl/skalowanie-danych/
Scaling
● Some ML algorithms suffers in dealing with different
ranges of values
○ [1.000,10.000], [0,1], [-50,50] …
● Scaling is about to transform all values from the dataset
into a same range
○ Min and Max values
○ Standardization
Scaling
● pandas’ dataframe has a method to describe de values
of the dataset
○ We can check the range values of the attributes
■ data.describe()
Scaling
● Scaling data allows algorithms to be more efficient
○ Computationally
○ Better models
Scaling
● Sklearn:
from sklearn.preprocessing import MinMaxScaler
#scaling values between 0 and 1
mmax = MinMaxScaler().fit(X)
X_mm=mmax.transform(X)
# or X_mm=pd.DataFrame(mmax.transform(X))
Tips Min Max scaled
before
after
Tips Min Max scaled
before
Qualitative attributes
should not be scaled
after
Standardization
● Sklearn:
from sklearn.preprocessing import StandardScaler
#Features values will have mean next to 0 (normal distribution)
scaler = StandardScaler()
scaler.fit(X)
X_stand=scaler.transform(X)
#or X_stand=pd.DataFrame(scaler.transform(X))
Standardization
Before
After
Standardization
Before
Qualitative attributes
should not be standardized
After
Exercises
● Build two new datasets from the previous dataset (tips)
○ Minimum and maximum values (MinMaxScaler)
○ Standardize values (StandardScaler)
● Using random forests, create 2 models: regression and
classification
○ from sklearn.ensemble import RandomForestClassifier as rfc,
RandomForestRegressor as rfr
○ myrfc=rfc()
○ myrfr=rfr()
Time to hunt missing values
● And if there are some features without valid values
(missing values)?
○ Delete the entire example (tuple/row)
■ The set of deleted examples must be small
○ Replace the missing value
■ Mean
■ Median
■ Most frequent
■ Constant
Missing values
● sklearn implement SimpleImputer from impute package
import pandas as pd
from sklearn.impute import SimpleImputer
hdf=pd.read_csv('horse-colic.csv',header=None,na_values='?')
hdf.describe()
# total number of null values by column
hdf.isnull().sum()
hdf.iloc[row,:].isnull().sum() # # of null values in row row
hdf.iloc[:,col].isnull().sum() # # of null values in column col
Missing Values
● Consider the following
– When a missing value must be deleted
– When a missing value must be imputed
● In both case, the dataset must be carefully analyzed
● The imputation must follow the attributes’ values
– Qualitative (discrete values – classes)
– Quantitative (continuous values)
Missing Values
● Deleting (dropping)
– Row
●
# of missing attributes in a given row (propose a threshold)
idx_del=[]
for i in range(hdf.shape[0]): # visit all hdf’s rows
hdf.iloc[i,:].isnull().sum() > threshold:
idx_del.append(i)
new_hdf=hdf.drop(idx_del, axis=0)
or
hdf.drop(idx_del, axis=0,inplace=True)
Missing Values
● Deleting (dropping)
– Columns
●
# of missing values in a given column
col_del=[]
for i in range(hdf.shape[1]): # visit all hdf’s columns
hdf.iloc[:,i].isnull().sum() > threshold:
col_del.append(i)
new_hdf=hdf.drop(columns=col_del) or hdf.drop(columns=col_del,inplace=True)
or
new_hdf=hdf.drop(col_del,axis=1) or hdf.drop(col_del,axis=1,inplace=True)
Missing Values
● Imputing
– Check which attributes are qualitative or quantitative
●
Approach: counting # of unique values for each attribute
att_quali=[]
att_quant=[]
for c in hdf.columns:
nunique=hdf[c].unique()
if len(nunique) > threshold:
att_quant.append(c)
else:
att_quali.append(c)
Missing Values
● Imputing