0% found this document useful (0 votes)
5 views

Intermediate Machine learning

The document discusses the importance of handling missing values in machine learning datasets, outlining three approaches: dropping columns, imputation, and imputation with an extension. It provides example code for each method and emphasizes the need to evaluate the effectiveness of these approaches using Mean Absolute Error (MAE). Additionally, it covers methods for handling categorical variables, including dropping, label encoding, and one-hot encoding, with corresponding example code and evaluation metrics.

Uploaded by

bikid25585
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Intermediate Machine learning

The document discusses the importance of handling missing values in machine learning datasets, outlining three approaches: dropping columns, imputation, and imputation with an extension. It provides example code for each method and emphasizes the need to evaluate the effectiveness of these approaches using Mean Absolute Error (MAE). Additionally, it covers methods for handling categorical variables, including dropping, label encoding, and one-hot encoding, with corresponding example code and evaluation metrics.

Uploaded by

bikid25585
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Intermediate Machine learning

Step 2: Missing Values

1. Introduction:

 Importance of Handling Missing Values:


o Many datasets have missing values, which can cause issues with machine
learning models.
o Ignoring missing values can lead to errors or biases in predictions.

2. Three Approaches to Handling Missing Values:

 Approach 1: Drop Columns with Missing Values


 Approach 2: Imputation
 Approach 3: Imputation with an Extension (Add a Missing Indicator)

3. Investigating Missing Values:

 Check for Missing Values: Use pandas functions to identify missing values in the
dataset.
 Example Code:

python
Copy code
import pandas as pd

# Load data
data = pd.read_csv('train.csv')

# Select target and features


y = data.SalePrice
X = data.drop(['SalePrice'], axis=1)

# Break off validation set from training data


from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8, test_size=0.2, random_state=0)

# Shape of training data (num_rows, num_columns)


print(X_train.shape)

4. Approach 1: Drop Columns with Missing Values:

 When to Use:
o When a column has many missing values.
o When the column is not critical for analysis.
 Example Code:

python
Copy code
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if
X_train[col].isnull().any()]
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

# Check the shape of reduced data


print(reduced_X_train.shape)

5. Approach 2: Imputation:

 Definition:
o Imputation is the process of filling in missing values with substituted values.
 Common Strategies:
o Mean Imputation: Replace missing values with the mean of the column.
o Median Imputation: Replace missing values with the median of the column.
o Most Frequent Imputation: Replace missing values with the most frequent
value in the column.
 Example Code:

python
Copy code
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer(strategy='median')

# Imputation on training and validation data


imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back


imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

6. Approach 3: Imputation with an Extension (Add a Missing Indicator):

 Extension of Imputation:
o Combine imputation with an additional indicator column that shows where the
missing values were.
 Why Use It:
o It allows the model to account for the fact that certain values were missing,
which might be informative.
 Example Code:

python
Copy code
from sklearn.impute import SimpleImputer

# Make copy to avoid changing original data (when imputing)


X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed


for col in cols_with_missing:
X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer(strategy='median')
imputed_X_train_plus =
pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus =
pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back


imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

7. Scoring the Approaches:

 Scoring Approach: Use Mean Absolute Error (MAE) to compare the different
approaches.
 Example Code:

python
Copy code
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function to compare MAE with different approaches


def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)

# Score for Approach 1 (Drop Columns with Missing Values)


reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train,
y_valid))

# Score for Approach 2 (Imputation)


imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train,
y_valid))

7. Scoring the Approaches (continued):

 Scoring Approach: Use Mean Absolute Error (MAE) to compare the different
approaches.
 Example Code (continued):

python
Copy code
# Score for Approach 2 (Imputation)
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train,
y_valid))

# Score for Approach 3 (Imputation with Extension)


imputed_X_train_plus =
pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus =
pd.DataFrame(my_imputer.transform(X_valid_plus))
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
print("MAE (Imputation with Extension):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus,
y_train, y_valid))

8. Conclusion:

 Key Takeaways:
o Approach 1 (Drop Columns with Missing Values): Simple but may lose
important information.
o Approach 2 (Imputation): Retains data, but the choice of imputation strategy
can affect model performance.
o Approach 3 (Imputation with Extension): Combines the benefits of
imputation with added indicators for missing values, which can provide
additional information to the model.
 Final Thoughts: Handling missing values effectively is crucial for building accurate
and robust machine learning models. Choose the appropriate method based on the
nature of your data and the specific requirements of your analysis.

Exercise(full code)
python
Copy code
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Load data
data = pd.read_csv('train.csv')

# Select target and features


y = data.SalePrice
X = data.drop(['SalePrice'], axis=1)

# Break off validation set from training data


X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8,
test_size=0.2, random_state=0)

# Shape of training data (num_rows, num_columns)


print(X_train.shape)
# Define function to measure quality of each approach
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)

# Approach 1: Drop columns with missing values


# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if
X_train[col].isnull().any()]

# Drop columns in training and validation data


reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

# Score dataset
print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

# Approach 2: Imputation
my_imputer = SimpleImputer(strategy='median')

# Imputation on training and validation data


imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back


imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

# Score dataset
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

# Approach 3: Imputation with an Extension


# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed


for col in cols_with_missing:
X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer(strategy='median')
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back


imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

# Score dataset
print("MAE (Imputation with Extension):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train,
y_valid))

Explanation:
1. Loading Data: Load the dataset from a CSV file.
2. Selecting Target and Features: Define the target variable y and the feature variables
X.
3. Splitting Data: Split the data into training and validation sets using
train_test_split.
4. Defining the Scoring Function: Define a function to measure the mean absolute
error (MAE) for each approach.
5. Approach 1 - Drop Columns with Missing Values: Identify columns with missing
values, drop them, and score the dataset.
6. Approach 2 - Imputation: Use SimpleImputer to impute missing values with the
median and score the dataset.
7. Approach 3 - Imputation with an Extension: Add indicators for missing values,
impute missing values, and score the dataset

Step 3: Categorical Variables

1. Introduction:

 Definition: Categorical variables are variables that contain label values rather than
numeric values.
 Importance: Many machine learning models require all input features to be numeric,
so categorical variables need to be converted to a suitable numeric format.

2. Methods to Handle Categorical Variables:

 Method 1: Drop Categorical Variables


 Method 2: Label Encoding
 Method 3: One-Hot Encoding

3. Investigating Categorical Variables:

 Check for Categorical Variables: Use pandas functions to identify categorical


variables in the dataset.
 Example Code:

python
Copy code
import pandas as pd

# Load data
data = pd.read_csv('train.csv')

# Select target and features


y = data.SalePrice
X = data.drop(['SalePrice'], axis=1)

# Break off validation set from training data


from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8, test_size=0.2, random_state=0)

# Get list of categorical variables


s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)

4. Method 1: Drop Categorical Variables:

 When to Use:
o When categorical variables are not critical for the analysis.
 Example Code:

python
Copy code
# Drop categorical variables
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

# Define function to measure quality of each approach


from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):


model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)

print("MAE (Drop categorical variables):")


print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

5. Method 2: Label Encoding:

 Definition:
o Label Encoding assigns each unique value in a categorical column an integer
value.
 When to Use:
o When the categorical variable has an ordinal relationship (e.g., 'low', 'medium',
'high').
 Example Code:

python
Copy code
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data


label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorical data


label_encoder = LabelEncoder()
label_X_train[object_cols] =
label_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] =
label_encoder.transform(X_valid[object_cols])

print("MAE (Label Encoding):")


print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
6. Method 3: One-Hot Encoding:

 Definition:
o One-Hot Encoding creates new binary columns indicating the presence of each
possible value in the original column.
 When to Use:
o When the categorical variable does not have an ordinal relationship and has a
relatively low number of unique values.
 Example Code:

 We set handle_unknown='ignore' to avoid errors when the validation data


contains classes that aren't represented in the training data, and
 setting sparse=False ensures that the encoded columns are returned as a
numpy array (instead of a sparse matrix).

python
Copy code
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data


OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train =
pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid =
pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back


OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)


num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features


OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE (One-Hot Encoding):")


print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

7. Conclusion:

 Key Takeaways:
o Dropping Categorical Variables: Simple but may lose important
information.
o Label Encoding: Suitable for ordinal categorical variables.
o One-Hot Encoding: Suitable for nominal categorical variables with relatively
few unique values.
 Final Thoughts: Choose the appropriate method for handling categorical variables
based on the nature of your data and the specific requirements of your analysis.
Exercise and code with notes of this step:

Dropping Categorical Columns

Objective: Remove columns with categorical data and evaluate model performance.

python
Copy code
# Import necessary libraries and load data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Read the data


X = pd.read_csv('../input/train.csv', index_col='Id')
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors


X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, drop columns with missing values


cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Break off validation set from training data


X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8,
test_size=0.2,
random_state=0)

# Function to score the dataset using Random Forest Regressor


def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)

# Drop categorical columns in training and validation sets


drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

# Check MAE from dropping categorical columns


print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

Result: MAE from Approach 1 (Drop categorical variables): 17837.83

Ordinal Encoding

Objective: Use ordinal encoding for categorical variables and evaluate model performance.

python
Copy code
from sklearn.preprocessing import OrdinalEncoder

# Identify categorical columns


object_cols = [col for col in X_train.columns if X_train[col].dtype ==
"object"]

# Identify categorical columns that can be safely ordinal encoded


good_label_cols = [col for col in object_cols if
set(X_valid[col]).issubset(set(X_train[col]))]

# Identify problematic categorical columns that will be dropped


bad_label_cols = list(set(object_cols) - set(good_label_cols))

# Print categorical columns for ordinal encoding and columns to be dropped


print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:',
bad_label_cols)

# Drop categorical columns that will not be encoded


label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply ordinal encoder


ordinal_encoder = OrdinalEncoder()
label_X_train[good_label_cols] =
ordinal_encoder.fit_transform(X_train[good_label_cols])
label_X_valid[good_label_cols] =
ordinal_encoder.transform(X_valid[good_label_cols])

# Check MAE from ordinal encoding approach


print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

Result: MAE from Approach 2 (Ordinal Encoding): 17098.02

Investigating Cardinality

Objective: Understand the cardinality of categorical variables.

python
Copy code
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order


sorted(d.items(), key=lambda x: x[1])

Output:

css
Copy code
[('Street', 2), ('Utilities', 2), ('CentralAir', 2), ('LandSlope', 3),
('PavedDrive', 3), ('LotShape', 4), ('LandContour', 4), ('ExterQual', 4),
('KitchenQual', 4), ('MSZoning', 5), ('LotConfig', 5), ('BldgType', 5),
('ExterCond', 5), ('HeatingQC', 5), ('Condition2', 6), ('RoofStyle', 6),
('Foundation', 6), ('Heating', 6), ('Functional', 6), ('SaleCondition', 6),
('RoofMatl', 7), ('HouseStyle', 8), ('Condition1', 9), ('SaleType', 9),
('Exterior1st', 15), ('Exterior2nd', 16), ('Neighborhood', 25)]

Observations:

 Categorical variables have varying numbers of unique entries (cardinality).


 Some variables have high cardinality (>10), which may impact model performance and
dataset size if one-hot encoded.

One-Hot Encoding

Objective: Apply one-hot encoding to categorical variables with low cardinality and evaluate
model performance.

python
Copy code
from sklearn.preprocessing import OneHotEncoder

# Identify columns for one-hot encoding (low cardinality)


low_cardinality_cols = [col for col in object_cols if
X_train[col].nunique() < 10]

# Identify columns to be dropped (high cardinality)


high_cardinality_cols = list(set(object_cols) - set(low_cardinality_cols))

# Print columns for one-hot encoding and columns to be dropped


print('Categorical columns that will be one-hot encoded:',
low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:',
high_cardinality_cols)

# Initialize one-hot encoder and apply to low cardinality columns


OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train =
pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid =
pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# Indexing back to original indices


OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Drop categorical columns and concatenate with one-hot encoded columns


num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)


OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type


OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

# Check MAE from one-hot encoding approach


print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

Result: MAE from Approach 3 (One-Hot Encoding): 17525.35

You might also like