0% found this document useful (0 votes)
5 views

handling missing values

The document discusses various approaches for handling missing values in datasets, including Complete Case Analysis (CCA), Simple Imputation, Random Imputation, Missing Indicators, and Univariate Imputation methods like KNN and Iterative Imputer. Each method has its advantages and disadvantages, with CCA being simple but potentially leading to data loss, while imputation methods aim to preserve data distribution and improve model performance. Missing indicators are highlighted as a useful technique to capture information about missingness without altering the dataset's distribution.

Uploaded by

mriga jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

handling missing values

The document discusses various approaches for handling missing values in datasets, including Complete Case Analysis (CCA), Simple Imputation, Random Imputation, Missing Indicators, and Univariate Imputation methods like KNN and Iterative Imputer. Each method has its advantages and disadvantages, with CCA being simple but potentially leading to data loss, while imputation methods aim to preserve data distribution and improve model performance. Missing indicators are highlighted as a useful technique to capture information about missingness without altering the dataset's distribution.

Uploaded by

mriga jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Handling Missing Values

For handling missing values different approaches are used from CCA, Simple imputation (For
Numerical -Mean/Median, Random, End of distribution and Categorical-Mode, missing category),
Random imputation for both numerical and categorical, Missing indicator, univariate -KNN imputer
and iterative imputer

Load and Explore Data: - Load your chosen dataset into a pandas DataFrame. - Use .info()
and .describe() to understand the structure and summary statistics of the dataset, including
missing values.

1. Complete Case Analysis (CCA)

 Description: Excludes any records with missing values.


 Advantages: Simple and ensures that only complete data is used.
 Disadvantages: Can lead to significant data loss if many records have missing values,
potentially biasing the analysis.

2. Simple Imputation

For Numerical Data:

 Mean Imputation: Replaces missing values with the mean of the available values.
 Median Imputation: Replaces missing values with the median of the available values.
 Random Imputation: Replaces missing values with randomly selected values from the
available data.
 End of Distribution: Replaces missing values with extreme values (e.g., very high or very
low values).

For Categorical Data:

 Mode Imputation: Replaces missing values with the most frequent value (mode).
 Missing Category: Introduces a new category to indicate missing values.

3. Random Imputation

 Description: Missing values are replaced by randomly selected values from the observed
values.
 Advantages: Maintains the distribution of the data.
 Disadvantages: Introduces randomness, which might not be appropriate for all datasets.

4. Missing Indicator

 Description: Adds a binary indicator variable for each feature with missing values to denote
the presence or absence of missing data.
 Advantages: Allows the model to learn patterns of missingness.
 Disadvantages: Increases the dimensionality of the dataset.
5. Univariate Imputation

KNN Imputer

 Description: Uses k-nearest neighbors to impute missing values. The missing value is
predicted based on the mean (or other statistics) of the neighbors.
 Advantages: Captures relationships between features and is generally more accurate than
simple imputation.
 Disadvantages: Computationally expensive and can be influenced by irrelevant features.

Iterative Imputer

 Description: Imputes missing values by modeling each feature with missing values as a
function of other features iteratively.
 Advantages: Often provides more accurate imputation by considering the multivariate
relationships between features.
 Disadvantages: Computationally intensive and requires careful handling to avoid overfitting.

Summary

 CCA: Good for datasets with few missing values; otherwise, may result in data loss.
 Simple Imputation: Easy to implement; may not capture the underlying data distribution.
 Random Imputation: Maintains data distribution but adds randomness.
 Missing Indicator: Useful for models that can handle increased dimensionality.
 Univariate Imputation (KNN, Iterative): More accurate but computationally intensive.
Missing Indicator

Creating a missing indicator involves the following steps:

1. Identify Missing Values: Determine which columns or features in your dataset have
missing values.
2. Create Indicator Variable: For each feature with missing values, create a new
binary column that indicates whether the original value was missing (1 for missing, 0
for not missing).
3. Include in Dataset: Add these indicator variables as additional features in your
dataset alongside the original features.
4. Apply to Different Types of Features: Missing indicators can be applied to both
numerical and categorical features to capture missing values effectively.

Purpose of Missing Indicators:

1. Preservation of Information: Missing indicators ensure that information about the


presence of missing values is not lost during data pre-processing. Instead of simply
imputing missing values, which might alter the distribution of the data, missing
indicators provide an additional feature that explicitly flags missingness.
2. Improved Model Performance: Including missing indicators as features in your
dataset allows machine learning algorithms to learn patterns associated with missing
values. This can sometimes lead to improved model performance, as the presence or
absence of missing values might be informative for predicting the target variable.
3. Flexibility in Modelling: Some algorithms, particularly tree-based models like
decision trees and random forests, can naturally handle missing values. Including
missing indicators allows these models to leverage information about missingness
without requiring imputation strategies that might introduce bias.
4. Decision Making in Imputation:When deciding how to handle missing values (e.g.,
impute with mean, median, or a specific value), analysts can use the missing indicator
to guide their decisions. For example, they might choose to impute missing values
differently depending on whether a missing indicator is present or not.
Cols
_with_missing)
# Create
missing
indicators

You might also like