Intermediate Machine learning
Intermediate Machine learning
1. Introduction:
Check for Missing Values: Use pandas functions to identify missing values in the
dataset.
Example Code:
python
Copy code
import pandas as pd
# Load data
data = pd.read_csv('train.csv')
When to Use:
o When a column has many missing values.
o When the column is not critical for analysis.
Example Code:
python
Copy code
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if
X_train[col].isnull().any()]
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
5. Approach 2: Imputation:
Definition:
o Imputation is the process of filling in missing values with substituted values.
Common Strategies:
o Mean Imputation: Replace missing values with the mean of the column.
o Median Imputation: Replace missing values with the median of the column.
o Most Frequent Imputation: Replace missing values with the most frequent
value in the column.
Example Code:
python
Copy code
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer(strategy='median')
Extension of Imputation:
o Combine imputation with an additional indicator column that shows where the
missing values were.
Why Use It:
o It allows the model to account for the fact that certain values were missing,
which might be informative.
Example Code:
python
Copy code
from sklearn.impute import SimpleImputer
Scoring Approach: Use Mean Absolute Error (MAE) to compare the different
approaches.
Example Code:
python
Copy code
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
Scoring Approach: Use Mean Absolute Error (MAE) to compare the different
approaches.
Example Code (continued):
python
Copy code
# Score for Approach 2 (Imputation)
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train,
y_valid))
8. Conclusion:
Key Takeaways:
o Approach 1 (Drop Columns with Missing Values): Simple but may lose
important information.
o Approach 2 (Imputation): Retains data, but the choice of imputation strategy
can affect model performance.
o Approach 3 (Imputation with Extension): Combines the benefits of
imputation with added indicators for missing values, which can provide
additional information to the model.
Final Thoughts: Handling missing values effectively is crucial for building accurate
and robust machine learning models. Choose the appropriate method based on the
nature of your data and the specific requirements of your analysis.
Exercise(full code)
python
Copy code
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
# Load data
data = pd.read_csv('train.csv')
# Score dataset
print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
# Approach 2: Imputation
my_imputer = SimpleImputer(strategy='median')
# Score dataset
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
# Imputation
my_imputer = SimpleImputer(strategy='median')
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
# Score dataset
print("MAE (Imputation with Extension):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train,
y_valid))
Explanation:
1. Loading Data: Load the dataset from a CSV file.
2. Selecting Target and Features: Define the target variable y and the feature variables
X.
3. Splitting Data: Split the data into training and validation sets using
train_test_split.
4. Defining the Scoring Function: Define a function to measure the mean absolute
error (MAE) for each approach.
5. Approach 1 - Drop Columns with Missing Values: Identify columns with missing
values, drop them, and score the dataset.
6. Approach 2 - Imputation: Use SimpleImputer to impute missing values with the
median and score the dataset.
7. Approach 3 - Imputation with an Extension: Add indicators for missing values,
impute missing values, and score the dataset
1. Introduction:
Definition: Categorical variables are variables that contain label values rather than
numeric values.
Importance: Many machine learning models require all input features to be numeric,
so categorical variables need to be converted to a suitable numeric format.
python
Copy code
import pandas as pd
# Load data
data = pd.read_csv('train.csv')
When to Use:
o When categorical variables are not critical for the analysis.
Example Code:
python
Copy code
# Drop categorical variables
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
Definition:
o Label Encoding assigns each unique value in a categorical column an integer
value.
When to Use:
o When the categorical variable has an ordinal relationship (e.g., 'low', 'medium',
'high').
Example Code:
python
Copy code
from sklearn.preprocessing import LabelEncoder
Definition:
o One-Hot Encoding creates new binary columns indicating the presence of each
possible value in the original column.
When to Use:
o When the categorical variable does not have an ordinal relationship and has a
relatively low number of unique values.
Example Code:
python
Copy code
from sklearn.preprocessing import OneHotEncoder
7. Conclusion:
Key Takeaways:
o Dropping Categorical Variables: Simple but may lose important
information.
o Label Encoding: Suitable for ordinal categorical variables.
o One-Hot Encoding: Suitable for nominal categorical variables with relatively
few unique values.
Final Thoughts: Choose the appropriate method for handling categorical variables
based on the nature of your data and the specific requirements of your analysis.
Exercise and code with notes of this step:
Objective: Remove columns with categorical data and evaluate model performance.
python
Copy code
# Import necessary libraries and load data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
Ordinal Encoding
Objective: Use ordinal encoding for categorical variables and evaluate model performance.
python
Copy code
from sklearn.preprocessing import OrdinalEncoder
Investigating Cardinality
python
Copy code
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))
Output:
css
Copy code
[('Street', 2), ('Utilities', 2), ('CentralAir', 2), ('LandSlope', 3),
('PavedDrive', 3), ('LotShape', 4), ('LandContour', 4), ('ExterQual', 4),
('KitchenQual', 4), ('MSZoning', 5), ('LotConfig', 5), ('BldgType', 5),
('ExterCond', 5), ('HeatingQC', 5), ('Condition2', 6), ('RoofStyle', 6),
('Foundation', 6), ('Heating', 6), ('Functional', 6), ('SaleCondition', 6),
('RoofMatl', 7), ('HouseStyle', 8), ('Condition1', 9), ('SaleType', 9),
('Exterior1st', 15), ('Exterior2nd', 16), ('Neighborhood', 25)]
Observations:
One-Hot Encoding
Objective: Apply one-hot encoding to categorical variables with low cardinality and evaluate
model performance.
python
Copy code
from sklearn.preprocessing import OneHotEncoder