0% found this document useful (0 votes)

4 views

Missing Values

This tutorial discusses three approaches to handle missing values in datasets: dropping columns with missing values, imputation, and an extension of imputation that tracks imputed values. The effectiveness of these methods is compared using the Melbourne Housing dataset, revealing that imputation generally yields better results than dropping columns. The tutorial concludes that imputing missing values is often more beneficial as it retains useful information from the dataset.

Uploaded by

Teo Chee Kiat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Missing Values

Uploaded by

Teo Chee Kiat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

In this tutorial, you will learn three approaches to dealing with missing values.

Then you'll compare the effectiveness of these approaches on a real-world dataset.

Introduction
There are many ways data can end up with missing values. For example,

A 2 bedroom house won't include a value for the size of a third bedroom.

A survey respondent may choose not to share his income.

Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.

Three Approaches

1) A Simple Option: Drop Columns with Missing Values

The simplest option is to drop columns with missing values.

Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach. As an extreme example, consider a dataset with 10,000 rows, where one important
column is missing a single entry. This approach would drop the column entirely!

2) A Better Option: Imputation

Imputation fills in the missing values with some number. For instance, we can fill in the mean value along each column.

The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.

3) An Extension To Imputation
Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be
unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.

In this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.

In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.

Example
In the example, we will work with the Melbourne Housing dataset (https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home). Our model will use information such as the number of rooms and land size to predict
home price.

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in X_train , X_valid , y_train , and y_valid .

unfold_more
Show hidden code
Define Function to Measure Quality of Each Approach
We define a function score_dataset() to compare different approaches to dealing with missing values. This function reports the mean absolute error (https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random
forest model.

unfold_more
Show hidden code

Score from Approach 1 (Drop Columns with Missing Values)

Since we are working with both training and validation sets, we are careful to drop the same columns in both DataFrames.

In [3]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]

# Drop columns in training and validation data

reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")

print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):

183550.22137772635

Score from Approach 2 (Imputation)

Next, we use SimpleImputer (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to replace missing values with the mean value along each column.

Although it's simple, filling in the mean value generally performs quite well (but this varies by dataset). While statisticians have experimented with more complex ways to determine imputed values (such as regression
imputation, for instance), the complex strategies typically give no additional benefit once you plug the results into sophisticated machine learning models.

In [4]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")

print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from Approach 2 (Imputation):

178166.46269899711

We see that Approach 2 has lower MAE than Approach 1, so Approach 2 performed better on this dataset.

Score from Approach 3 (An Extension to Imputation)

Next, we impute the missing values, while also keeping track of which values were imputed.
In [5]:
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed

for col in cols_with_missing:
X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back

imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")

print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extension to Imputation):

178927.503183954

As we can see, Approach 3 performed slightly worse than Approach 2.

So, why did imputation perform better than dropping the columns?

The training data has 10864 rows and 12 columns, where three columns contain missing data. For each column, less than half of the entries are missing. Thus, dropping the columns removes a lot of useful information, and so it
makes sense that imputation would perform better.

In [6]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data

missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(10864, 12)
Car 49
BuildingArea 5156
YearBuilt 4307
dtype: int64

Conclusion
As is common, imputing missing values (in Approach 2 and Approach 3) yielded better results, relative to when we simply dropped columns with missing values (in Approach 1).

Your Turn
Compare these approaches to dealing with missing values yourself in this exercise (https://www.kaggle.com/kernels/fork/3370280)!

Have questions or comments? Visit the course discussion forum (https://www.kaggle.com/learn/intermediate-machine-learning/discussion) to chat with other learners.

Successful Property Letting - David Lawrenson
50% (4)
Successful Property Letting - David Lawrenson
38 pages
Machine Learning
100% (2)
Machine Learning
136 pages
Intermediate Machine learning
No ratings yet
Intermediate Machine learning
12 pages
Machine Learning Techniques Lesson 1
No ratings yet
Machine Learning Techniques Lesson 1
9 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Slides on DataII
No ratings yet
Slides on DataII
26 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples)
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples)
10 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
platias2020-Greece
No ratings yet
platias2020-Greece
10 pages
Data Cleaning_Project work
No ratings yet
Data Cleaning_Project work
10 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
ADS_Exp2
No ratings yet
ADS_Exp2
4 pages
Missing Data
No ratings yet
Missing Data
14 pages
lec 4
No ratings yet
lec 4
9 pages
Unit 3
No ratings yet
Unit 3
30 pages
Unit - 3 - R Programming
No ratings yet
Unit - 3 - R Programming
16 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
DT - Missing Values
No ratings yet
DT - Missing Values
11 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
handling missing values
No ratings yet
handling missing values
5 pages
Mida (AE)
No ratings yet
Mida (AE)
12 pages
Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling
No ratings yet
Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling
8 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Data Imputation for Missing Values
No ratings yet
Data Imputation for Missing Values
14 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
3 -Missing Values-1
No ratings yet
3 -Missing Values-1
9 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Lab File
No ratings yet
Lab File
96 pages
Pandas
No ratings yet
Pandas
4 pages
Missing Data
No ratings yet
Missing Data
25 pages
Lect 2
No ratings yet
Lect 2
54 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Day 19 - Numpy
No ratings yet
Day 19 - Numpy
5 pages
What Are the Different Ways to Handle Missing Values
No ratings yet
What Are the Different Ways to Handle Missing Values
2 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
chapter_3
No ratings yet
chapter_3
58 pages
FDA EXP2 E0323040
No ratings yet
FDA EXP2 E0323040
3 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
221FJ01056
No ratings yet
221FJ01056
4 pages
FDS_U4.pptx
No ratings yet
FDS_U4.pptx
93 pages
Cse4020 ML Exp 1
No ratings yet
Cse4020 ML Exp 1
6 pages
Chapter 1. Data Preparation (2)
No ratings yet
Chapter 1. Data Preparation (2)
74 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Cabay NHS ME Plan
No ratings yet
Cabay NHS ME Plan
19 pages
Assessment Checklist
No ratings yet
Assessment Checklist
7 pages
Regami Solutions_GET-Applications Engineer JD V2
No ratings yet
Regami Solutions_GET-Applications Engineer JD V2
11 pages
Talent Management
No ratings yet
Talent Management
14 pages
1 MARK TYPE(AP)
No ratings yet
1 MARK TYPE(AP)
22 pages
2
No ratings yet
2
25 pages
Performance Innovative Task in DE (Differential Equation)
No ratings yet
Performance Innovative Task in DE (Differential Equation)
29 pages
D2 Pilot OM EN
No ratings yet
D2 Pilot OM EN
24 pages
Applied-Linguistics
No ratings yet
Applied-Linguistics
2 pages
Solution Manual for Elementary Linear Algebra with Applications 9th Edition by Kolman Hill ISBN 0132296543 9780132296540 2024 scribd download full chapters
100% (10)
Solution Manual for Elementary Linear Algebra with Applications 9th Edition by Kolman Hill ISBN 0132296543 9780132296540 2024 scribd download full chapters
41 pages
Bhic 113 em GP Kunj Publication
No ratings yet
Bhic 113 em GP Kunj Publication
65 pages
Role of Peoples and Nations in Protecting The Natural Environment
No ratings yet
Role of Peoples and Nations in Protecting The Natural Environment
38 pages
Sigmoid Tumor
No ratings yet
Sigmoid Tumor
5 pages
WM8731/L Audio CODEC Supported Sampling Rates: To Receive Regular Email Updates, Sign Up
No ratings yet
WM8731/L Audio CODEC Supported Sampling Rates: To Receive Regular Email Updates, Sign Up
9 pages
Memo-for-Questions-on-DNA-code-of-Life-29-August
No ratings yet
Memo-for-Questions-on-DNA-code-of-Life-29-August
2 pages
Advanced Business Analytics
No ratings yet
Advanced Business Analytics
3 pages
Popular Misogyny: A Zeitgeist: Sarah Banet-Weiser
No ratings yet
Popular Misogyny: A Zeitgeist: Sarah Banet-Weiser
10 pages
Unit 7
No ratings yet
Unit 7
2 pages
Tips and Hints For Sharing Data PDF
No ratings yet
Tips and Hints For Sharing Data PDF
5 pages
Mohammed Hussain (SR - Electrical Technician
No ratings yet
Mohammed Hussain (SR - Electrical Technician
2 pages
37ss Blowapart Hi
No ratings yet
37ss Blowapart Hi
2 pages
Chapter 2: Design Inputs: Fig. 3. Project Perspective
No ratings yet
Chapter 2: Design Inputs: Fig. 3. Project Perspective
26 pages
Political Philosophy - Britannica Online Encyclopedia
No ratings yet
Political Philosophy - Britannica Online Encyclopedia
33 pages
3VA92330JB12 Datasheet en
No ratings yet
3VA92330JB12 Datasheet en
3 pages
PDF For Log Gat
No ratings yet
PDF For Log Gat
26 pages
Learning and Behavior A Contemporary Synthesis 2nd Edition Complete Book Download
100% (3)
Learning and Behavior A Contemporary Synthesis 2nd Edition Complete Book Download
17 pages
1song of Myself
No ratings yet
1song of Myself
6 pages
Httpspharfac - Mans.edu - egimagesfilesGraduateAcademic20guide202020202021.PDF 2
No ratings yet
Httpspharfac - Mans.edu - egimagesfilesGraduateAcademic20guide202020202021.PDF 2
93 pages
Power Crisis in The Philippines
100% (1)
Power Crisis in The Philippines
1 page

Missing Values

Uploaded by

Missing Values

Uploaded by

In this tutorial, you will learn three approaches to dealing with missing values.

Then you'll compare the effectiveness of these approaches on a real-world dataset.

A survey respondent may choose not to share his income.

1) A Simple Option: Drop Columns with Missing Values

2) A Better Option: Imputation

Score from Approach 1 (Drop Columns with Missing Values)

# Drop columns in training and validation data

print("MAE from Approach 1 (Drop columns with missing values):")

MAE from Approach 1 (Drop columns with missing values):

Score from Approach 2 (Imputation)

# Imputation removed column names; put them back

print("MAE from Approach 2 (Imputation):")

MAE from Approach 2 (Imputation):

Score from Approach 3 (An Extension to Imputation)

# Make new columns indicating what will be imputed

# Imputation removed column names; put them back

print("MAE from Approach 3 (An Extension to Imputation):")

MAE from Approach 3 (An Extension to Imputation):

As we can see, Approach 3 performed slightly worse than Approach 2.

# Number of missing values in each column of training data

You might also like