0% found this document useful (0 votes)
15 views

S3 Missing Value Analysis Imputation

The document discusses missing value analysis in data. It describes determining the type, extent, and randomness of missing data through classification as MCAR or MAR. The extent is analyzed by calculating proportions of missing values across cases and variables. Randomness is diagnosed using Little's MCAR test. For MCAR data, imputation methods include mean, median, and random substitution. For MAR data, modeling-based methods like multivariate feature imputation and nearest neighbors imputation are recommended.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

S3 Missing Value Analysis Imputation

The document discusses missing value analysis in data. It describes determining the type, extent, and randomness of missing data through classification as MCAR or MAR. The extent is analyzed by calculating proportions of missing values across cases and variables. Randomness is diagnosed using Little's MCAR test. For MCAR data, imputation methods include mean, median, and random substitution. For MAR data, modeling-based methods like multivariate feature imputation and nearest neighbors imputation are recommended.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Missing Value Analysis

1
Missing Value Analysis
§ Missing values – Implications
§ Missing values – Analysis
§ Determine the type of Missing Data
§ Determine the extent of Missing Data
§ Diagnose the randomness of Missing Data [MAR or MCAR]
§ Missing values – Imputation Methods
§ Imputation Methods for MCAR Data
§ Imputation Methods for MAR Data

2
Missing Value Analysis
• Researcher evaluates the impact of missing data, identifies outliers, and tests for the assumptions
underlying most multivariate techniques
• Missing data are a nuisance to researchers and primarily result from errors in data collection or data
entry or from the omission of answers by respondents.
• Classifying missing data and the reasons underlying their presence are addressed through a series of
steps that not only identify the impacts of the missing data, but that also provide remedies for dealing
with it in the analysis.

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 3
Missing Values - Impact
Missing Values

Practical Impact Substantive Impact

For example, what


if we found that individuals who did not provide their
household income tended to be almost
exclusively those in the higher income brackets? Will your data
be not biased ?

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 4
Missing Values - Impact
The practical impact of missing data is the reduction of the sample size available for analysis.
For example, if remedies for missing data are not applied, any observation with missing data
on any of the variables will be excluded from the analysis.

In many multivariate analyses, particularly survey research applications, missing data may eliminate so
many observations that what was an adequate sample is reduced to an inadequate sample.

For example, it has been shown that if 10 percent of the data is randomly missing in a set of five variables,
on average almost 60 percent of the cases will have at least one missing value. Thus, when complete
data are required, the sample is reduced to 40 percent of the original size.

From a substantive perspective, any statistical results based on data with a nonrandom missing
data process could be biased.

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 5
Missing Values - Analysis
Determine the type of
Missing Data
Errors in
data entry
or non-
response

Non-Ignorable Missing
Ignorable Missing Data Data

MCAR Data
It is part of Determine the
research Determine the Extent of
Randomness of Missing
Missing Data
design Data

MAR Data

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 6
Missing Values - Analysis
Determine the Randomness of Missing
Determine the Extent of Missing Data Data

MCAR Data MAR Data

1. Understand the dimensions of data MCAR stands for Missing Completely At MAR stands for Missing At Random and
2. Finding proportion of missing values Random and is the rarest type of missing implies that the values which are missing
in entire data values when there is no cause to the can be explained by the data we already
3. Finding proportion of cases with missingness. In other words, the missing have. For example, in a data household
missing values values are unrelated to any feature, just as income data, the proportion of missing
4. Finding the proportion of missing the name suggests. Say for example values is more among male respondents
values in each case household income has missing values. If than female respondents
5. Finding the proportion of missing missing values of household income are
values in each variable truly random, it is not associated with
any other variable

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 7
Diagnostic Test
Little's Missing Completely at Random (MCAR) Test

Null Hypothesis : Missing Data are completely at random (MCAR)


Alternate Hypothesis: Missing Data are not completely at random (MAR)

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 8
Imputation Methods
MCAR Data MAR Data

Modelling based Imputation


Non-Imputation Methods Imputation Methods
Methods

• Complete cases approach • Constant Substitution • Multivariate feature


• All Available approach • Mean substitution imputation
• Case-Substitution approach • Median substitution • Nearest neighbors'
• Mode substitution imputation
• Random replacement

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 9
Imputation Methods – MAR Data
Complete Case Approach: The simplest and most direct approach for dealing with missing data is to include only those observations with
complete data, also known as the complete case approach. this approach also results in the greatest reduction in sample size, because
missing data on any variable eliminates the entire case. It has been shown that with only 2 percent randomly missing data, more than 18
percent of the cases will have some missing data.

All Available Approach: All available valid values are used for each variable.

Hot or Cold Deck Imputation. In this approach, the researcher substitutes a value from another source for the missing values. In the “hot
deck” method, the value comes from another observation in the sample that is deemed similar. Each observation with missing data is
paired with another case that is similar on a variable(s) specified by the researcher. Then, missing data are replaced with valid values from
the similar observation. “Cold deck” imputation derives the replacement value from an external source (e.g., prior studies, other samples,
etc.).

Case Substitution. In this method, entire observations with missing data are replaced by choosing another nonsampled observation. A
common example is to replace a sampled household that cannot be contacted or that has extensive missing data with another household
not in the sample, preferably similar to the original observation.

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 10
Imputation Methods – MCAR Data

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 11
Imputation Methods – MCAR Data

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 12
Imputation Methods – MAR Data

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition 13
Imputation Methods – MAR Data
Multivariate Feature Imputation:
A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of
other features and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is
designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor
is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter
imputation rounds. The results of the final imputation round are returned.

Nearest neighbors imputation:


The KNNImputer class provides imputation for filling in missing values using the k-Nearest Neighbors approach. Each missing feature
is imputed using values from n_neighbors nearest neighbors that have a value for the feature. The feature of the neighbors are averaged
uniformly or weighted by distance to each neighbor. If a sample has more than one feature missing, then the neighbors for that sample
can be different depending on the particular feature being imputed.

https://scikit-learn.org/stable/modules/impute.html#multivariate-feature-imputation 14
Thank You

15

You might also like