handling missing values
handling missing values
For handling missing values different approaches are used from CCA, Simple imputation (For
Numerical -Mean/Median, Random, End of distribution and Categorical-Mode, missing category),
Random imputation for both numerical and categorical, Missing indicator, univariate -KNN imputer
and iterative imputer
Load and Explore Data: - Load your chosen dataset into a pandas DataFrame. - Use .info()
and .describe() to understand the structure and summary statistics of the dataset, including
missing values.
2. Simple Imputation
Mean Imputation: Replaces missing values with the mean of the available values.
Median Imputation: Replaces missing values with the median of the available values.
Random Imputation: Replaces missing values with randomly selected values from the
available data.
End of Distribution: Replaces missing values with extreme values (e.g., very high or very
low values).
Mode Imputation: Replaces missing values with the most frequent value (mode).
Missing Category: Introduces a new category to indicate missing values.
3. Random Imputation
Description: Missing values are replaced by randomly selected values from the observed
values.
Advantages: Maintains the distribution of the data.
Disadvantages: Introduces randomness, which might not be appropriate for all datasets.
4. Missing Indicator
Description: Adds a binary indicator variable for each feature with missing values to denote
the presence or absence of missing data.
Advantages: Allows the model to learn patterns of missingness.
Disadvantages: Increases the dimensionality of the dataset.
5. Univariate Imputation
KNN Imputer
Description: Uses k-nearest neighbors to impute missing values. The missing value is
predicted based on the mean (or other statistics) of the neighbors.
Advantages: Captures relationships between features and is generally more accurate than
simple imputation.
Disadvantages: Computationally expensive and can be influenced by irrelevant features.
Iterative Imputer
Description: Imputes missing values by modeling each feature with missing values as a
function of other features iteratively.
Advantages: Often provides more accurate imputation by considering the multivariate
relationships between features.
Disadvantages: Computationally intensive and requires careful handling to avoid overfitting.
Summary
CCA: Good for datasets with few missing values; otherwise, may result in data loss.
Simple Imputation: Easy to implement; may not capture the underlying data distribution.
Random Imputation: Maintains data distribution but adds randomness.
Missing Indicator: Useful for models that can handle increased dimensionality.
Univariate Imputation (KNN, Iterative): More accurate but computationally intensive.
Missing Indicator
1. Identify Missing Values: Determine which columns or features in your dataset have
missing values.
2. Create Indicator Variable: For each feature with missing values, create a new
binary column that indicates whether the original value was missing (1 for missing, 0
for not missing).
3. Include in Dataset: Add these indicator variables as additional features in your
dataset alongside the original features.
4. Apply to Different Types of Features: Missing indicators can be applied to both
numerical and categorical features to capture missing values effectively.