Missing Values
Missing Values
Introduction
There are many ways data can end up with missing values. For example,
A 2 bedroom house won't include a value for the size of a third bedroom.
Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.
Three Approaches
Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach. As an extreme example, consider a dataset with 10,000 rows, where one important
column is missing a single entry. This approach would drop the column entirely!
The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.
3) An Extension To Imputation
Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be
unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.
In this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.
In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.
Example
In the example, we will work with the Melbourne Housing dataset (https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home). Our model will use information such as the number of rooms and land size to predict
home price.
We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in X_train , X_valid , y_train , and y_valid .
unfold_more
Show hidden code
Define Function to Measure Quality of Each Approach
We define a function score_dataset() to compare different approaches to dealing with missing values. This function reports the mean absolute error (https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random
forest model.
unfold_more
Show hidden code
In [3]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
Next, we use SimpleImputer (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to replace missing values with the mean value along each column.
Although it's simple, filling in the mean value generally performs quite well (but this varies by dataset). While statisticians have experimented with more complex ways to determine imputed values (such as regression
imputation, for instance), the complex strategies typically give no additional benefit once you plug the results into sophisticated machine learning models.
In [4]:
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
We see that Approach 2 has lower MAE than Approach 1, so Approach 2 performed better on this dataset.
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
So, why did imputation perform better than dropping the columns?
The training data has 10864 rows and 12 columns, where three columns contain missing data. For each column, less than half of the entries are missing. Thus, dropping the columns removes a lot of useful information, and so it
makes sense that imputation would perform better.
In [6]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)
(10864, 12)
Car 49
BuildingArea 5156
YearBuilt 4307
dtype: int64
Conclusion
As is common, imputing missing values (in Approach 2 and Approach 3) yielded better results, relative to when we simply dropped columns with missing values (in Approach 1).
Your Turn
Compare these approaches to dealing with missing values yourself in this exercise (https://www.kaggle.com/kernels/fork/3370280)!
Have questions or comments? Visit the course discussion forum (https://www.kaggle.com/learn/intermediate-machine-learning/discussion) to chat with other learners.