ML | Handling Missing Values
Last Updated :
14 Aug, 2024
Missing values are a common issue in machine learning. This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. It is essential to address missing values efficiently to ensure strong and impartial results in your machine-learning projects. In this article, we will see How to Handle Missing Values in Datasets in Machine Learning.
What is a Missing Value?
Missing values are data points that are absent for a specific variable in a dataset. They can be represented in various ways, such as blank cells, null values, or special symbols like "NA" or "unknown." These missing data points pose a significant challenge in data analysis and can lead to inaccurate or biased results.
Missing Values
Missing values can pose a significant challenge in data analysis, as they can:
- Reduce the sample size:Â This can decrease the accuracy and reliability of your analysis.
- Introduce bias:Â If the missing data is not handled properly, it can bias the results of your analysis.
- Make it difficult to perform certain analyses:Â Some statistical techniques require complete data for all variables, making them inapplicable when missing values are present
Why Is Data Missing From the Dataset?
Data can be missing for many reasons like technical issues, human errors, privacy concerns, data processing issues, or the nature of the variable itself. Understanding the cause of missing data helps choose appropriate handling strategies and ensure the quality of your analysis.
It's important to understand the reasons behind missing data:
- Identifying the type of missing data: Is it Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR)?
- Evaluating the impact of missing data:Â Is the missingness causing bias or affecting the analysis?
- Choosing appropriate handling strategies:Â Different techniques are suitable for different types of missing data.
Types of Missing Values
There are three main types of missing values:
- Missing Completely at Random (MCAR): MCAR is a specific type of missing data in which the probability of a data point being missing is entirely random and independent of any other variable in the dataset. In simpler terms, whether a value is missing or not has nothing to do with the values of other variables or the characteristics of the data point itself.
- Missing at Random (MAR): MAR is a type of missing data where the probability of a data point missing depends on the values of other variables in the dataset, but not on the missing variable itself. This means that the missingness mechanism is not entirely random, but it can be predicted based on the available information.
- Missing Not at Random (MNAR): MNAR is the most challenging type of missing data to deal with. It occurs when the probability of a data point being missing is related to the missing value itself. This means that the reason for the missing data is informative and directly associated with the variable that is missing.
Methods for Identifying Missing Data
Locating and understanding patterns of missingness in the dataset is an important step in addressing its impact on analysis. Working with Missing Data in Pandas there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame.
Functions | Descriptions |
---|
.isnull() | Identifies missing values in a Series or DataFrame. |
.notnull() | check for missing values in a pandas Series or DataFrame. It returns a boolean Series or DataFrame, where True indicates non-missing values and False indicates missing values. |
.info() | Displays information about the DataFrame, including data types, memory usage, and presence of missing values. |
.isna() | similar to notnull() but returns True for missing values and False for non-missing values. |
dropna() | Drops rows or columns containing missing values based on custom criteria. |
fillna() | Fills missing values with specific values, means, medians, or other calculated values. |
replace() | Replaces specific values with other values, facilitating data correction and standardization. |
drop_duplicates() | Removes duplicate rows based on specified columns. |
unique() | Finds unique values in a Series or DataFrame. |
How Is a Missing Value Represented in a Dataset?
Missing values can be represented by blank cells, specific values like "NA", or codes. It's important to use consistent and documented representation to ensure transparency and facilitate data handling.
Common Representations
- Blank cells:Â Empty cells in spreadsheets or databases often signify missing data.
- Specific values: Special values like "NULL", "NA", or "-999" are used to represent missing data explicitly.
- Codes or flags:Â Non-numeric codes or flags can be used to indicate different types of missing values.
Effective Strategies for Handling Missing Values in Data Analysis
Missing values are a common challenge in data analysis, and there are several strategies for handling them. Here's an overview of some common approaches:
Creating a Sample Dataframe
Python
import pandas as pd
import numpy as np
# Creating a sample DataFrame with missing values
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'],
'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
Output:
Sample Dataframe:
School ID Name Address City Subject Marks Rank Grade
0 101.0 Alice 123 Main St Los Angeles Math 85.0 2 B
1 102.0 Bob 456 Oak Ave New York English 92.0 1 A
2 103.0 Charlie 789 Pine Ln Houston Science 78.0 4 C
3 NaN David 101 Elm St Los Angeles Math 89.0 3 B
4 105.0 Eva NaN Miami History NaN 8 D
5 106.0 Frank 222 Maple Rd NaN Math 95.0 1 A
6 107.0 Grace 444 Cedar Blvd Houston Science 80.0 5 C
7 108.0 Henry 555 Birch Dr New York English 88.0 3 B
Removing Rows with Missing Values
- Simple and efficient:Â Removes data points with missing values altogether.
- Reduces sample size:Â Can lead to biased results if missingness is not random.
- Not recommended for large datasets:Â Can discard valuable information.
In this example, we are removing rows with missing values from the original DataFrame (df
) using the dropna()
method and then displaying the cleaned DataFrame (df_cleaned
).
Python
# Removing rows with missing values
df_cleaned = df.dropna()
# Displaying the DataFrame after removing missing values
print("\nDataFrame after removing rows with missing values:")
print(df_cleaned)
Output:
DataFrame after removing rows with missing values:
School ID Name Address City Subject Marks Rank Grade
0 101.0 Alice 123 Main St Los Angeles Math 85.0 2 B
1 102.0 Bob 456 Oak Ave New York English 92.0 1 A
2 103.0 Charlie 789 Pine Ln Houston Science 78.0 4 C
6 107.0 Grace 444 Cedar Blvd Houston Science 80.0 5 C
7 108.0 Henry 555 Birch Dr New York English 88.0 3 B
Imputation Methods
- Replacing missing values with estimated values.
- Preserves sample size:Â Doesn't reduce data points.
- Can introduce bias:Â Estimated values might not be accurate.
Here are some common imputation methods:
1- Mean, Median, and Mode Imputation:
- Replace missing values with the mean, median, or mode of the relevant variable.
- Simple and efficient:Â Easy to implement.
- Can be inaccurate:Â Doesn't consider the relationships between variables.
In this example, we are explaining the imputation techniques for handling missing values in the 'Marks' column of the DataFrame (df
). It calculates and fills missing values with the mean, median, and mode of the existing values in that column, and then prints the results for observation.
- Mean Imputation: Calculates the mean of the 'Marks' column in the DataFrame (
df
).df['Marks'].fillna(...)
: Fills missing values in the 'Marks' column with the mean value.mean_imputation
: The result is stored in the variable mean_imputation
.
- Median Imputation: Calculates the median of the 'Marks' column in the DataFrame (
df
).df['Marks'].fillna(...)
: Fills missing values in the 'Marks' column with the median value.median_imputation
: The result is stored in the variable median_imputation
.
- Mode Imputation: Calculates the mode of the 'Marks' column in the DataFrame (
df
). The result is a Series..iloc[0]
: Accesses the first element of the Series, which represents the mode.df['Marks'].fillna(...)
: Fills missing values in the 'Marks' column with the mode value.
Python
# Mean, Median, and Mode Imputation
mean_imputation = df['Marks'].fillna(df['Marks'].mean())
median_imputation = df['Marks'].fillna(df['Marks'].median())
mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0])
print("\nImputation using Mean:")
print(mean_imputation)
print("\nImputation using Median:")
print(median_imputation)
print("\nImputation using Mode:")
print(mode_imputation)
Output:
Imputation using Mean:
0 85.000000
1 92.000000
2 78.000000
3 89.000000
4 86.714286
5 95.000000
6 80.000000
7 88.000000
Name: Marks, dtype: float64
Imputation using Median:
0 85.0
1 92.0
2 78.0
3 89.0
4 88.0
5 95.0
6 80.0
7 88.0
Name: Marks, dtype: float64
Imputation using Mode:
0 85.0
1 92.0
2 78.0
3 89.0
4 78.0
5 95.0
6 80.0
7 88.0
Name: Marks, dtype: float64
2. Forward and Backward Fill
- Replace missing values with the previous or next non-missing value in the same variable.
- Simple and intuitive:Â Preserves temporal order.
- Can be inaccurate:Â Assumes missing values are close to observed values
These fill methods are particularly useful when there is a logical sequence or order in the data, and missing values can be reasonably assumed to follow a pattern. The method
parameter in fillna()
allows to specify the filling strategy, and here, it's set to 'ffill' for forward fill and 'bfill' for backward fill.
- Forward Fill (
forward_fill
)df['Marks'].fillna(method='ffill')
: This method fills missing values in the 'Marks' column of the DataFrame (df
) using a forward fill strategy. It replaces missing values with the last observed non-missing value in the column.forward_fill
: The result is stored in the variable forward_fill
.
- Backward Fill (
backward_fill
)df['Marks'].fillna(method='bfill')
: This method fills missing values in the 'Marks' column using a backward fill strategy. It replaces missing values with the next observed non-missing value in the column.backward_fill
: The result is stored in the variable backward_fill
.
Python
# Forward and Backward Fill
forward_fill = df['Marks'].fillna(method='ffill')
backward_fill = df['Marks'].fillna(method='bfill')
print("\nForward Fill:")
print(forward_fill)
print("\nBackward Fill:")
print(backward_fill)
Output:
Forward Fill:
0 85.0
1 92.0
2 78.0
3 89.0
4 89.0
5 95.0
6 80.0
7 88.0
Name: Marks, dtype: float64
Backward Fill:
0 85.0
1 92.0
2 78.0
3 89.0
4 95.0
5 95.0
6 80.0
7 88.0
Name: Marks, dtype: float64
Note
- Forward fill uses the last valid observation to fill missing values.
- Backward fill uses the next valid observation to fill missing values.
3. Interpolation Techniques
- Estimate missing values based on surrounding data points using techniques like linear interpolation or spline interpolation.
- More sophisticated than mean/median imputation:Â Captures relationships between variables.
- Requires additional libraries and computational resources.
These interpolation techniques are useful when the relationship between data points can be reasonably assumed to follow a linear or quadratic pattern. The method
parameter in the interpolate()
method allows to specify the interpolation strategy.
- Linear Interpolation
df['Marks'].interpolate(method='linear')
: This method performs linear interpolation on the 'Marks' column of the DataFrame (df
). Linear interpolation estimates missing values by considering a straight line between two adjacent non-missing values.linear_interpolation
: The result is stored in the variable linear_interpolation
.
- Quadratic Interpolation
df['Marks'].interpolate(method='quadratic')
: This method performs quadratic interpolation on the 'Marks' column. Quadratic interpolation estimates missing values by considering a quadratic curve that passes through three adjacent non-missing values.quadratic_interpolation
: The result is stored in the variable quadratic_interpolation
.
Python
# Interpolation Techniques
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')
print("\nLinear Interpolation:")
print(linear_interpolation)
print("\nQuadratic Interpolation:")
print(quadratic_interpolation)
Output:
Linear Interpolation:
0 85.0
1 92.0
2 78.0
3 89.0
4 92.0
5 95.0
6 80.0
7 88.0
Name: Marks, dtype: float64
Quadratic Interpolation:
0 85.00000
1 92.00000
2 78.00000
3 89.00000
4 98.28024
5 95.00000
6 80.00000
7 88.00000
Name: Marks, dtype: float64
Note:
- Linear interpolation assumes a straight line between two adjacent non-missing values.
- Quadratic interpolation assumes a quadratic curve that passes through three adjacent non-missing values.
Choosing the right strategy depends on several factors:
- Type of missing data: MCAR, MAR, or MNAR.
- Proportion of missing values.
- Data type and distribution.
- Analytical goals and assumptions.
Impact of Handling Missing Values
Missing values are a common occurrence in real-world data, negatively impacting data analysis and modeling if not addressed properly. Handling missing values effectively is crucial to ensure the accuracy and reliability of your findings.
Here are some key impacts of handling missing values:
- Improved data quality:Â Addressing missing values enhances the overall quality of the dataset. A cleaner dataset with fewer missing values is more reliable for analysis and model training.
- Enhanced model performance: Machine learning algorithms often struggle with missing data, leading to biased and unreliable results. By appropriately handling missing values, models can be trained on a more complete dataset, leading to improved performance and accuracy.
- Preservation of Data Integrity: Handling missing values helps maintain the integrity of the dataset. Imputing or removing missing values ensures that the dataset remains consistent and suitable for analysis.
- Reduced bias:Â Ignoring missing values may introduce bias in the analysis or modeling process. Handling missing data allows for a more unbiased representation of the underlying patterns in the data.
- Descriptive statistics, such as means, medians, and standard deviations, can be more accurate when missing values are appropriately handled. This ensures a more reliable summary of the dataset.
- Increased efficiency:Â Efficiently handling missing values can save you time and effort during data analysis and modeling.
Conclusion
Handling missing values requires careful consideration and a tailored approach based on the specific characteristics of your data. By understanding the different types and causes of missing values, exploring various imputation techniques and best practices, and evaluating the impact of your chosen strategy, you can confidently address this challenge and optimize your machine learning pipeline for success
Similar Reads
Data Analysis (Analytics) Tutorial Data Analytics is a process of examining, cleaning, transforming and interpreting data to discover useful information, draw conclusions and support decision-making. It helps businesses and organizations understand their data better, identify patterns, solve problems and improve overall performance.
4 min read
Introduction to Data Analytics
What is Data Analytics?Data Analytics is the process of collecting, organizing and studying data to find useful information understand whatâs happening and make better decisions. In simple words it helps people and businesses learn from data like what worked in the past, what is happening now and what might happen in the
6 min read
Why Data Analysis is Important?DData Analysis involves inspecting, transforming, and modeling data to discover useful information, inform conclusions, and support decision-making. It encompasses a range of techniques and tools used to interpret raw data, identify patterns, and extract actionable insights. Effective data analysis
5 min read
Data Science vs Data AnalyticsIn this article, we will discuss the differences between the two most demanded fields in Artificial intelligence that is data science, and data analytics.What is Data Science Data Science is a field that deals with extracting meaningful information and insights by applying various algorithms preproc
3 min read
Uses of Data AnalyticsIn this article, we are going to discuss different uses of data analytics. And will discuss the application where we will see how data is an essential part of different sectors. So, let's discuss them one by one. Data is of much importance nowadays. Data helps you understand performance providing th
3 min read
Life Cycle Phases of Data AnalyticsIn this article, we are going to discuss life cycle phases of data analytics in which we will cover various life cycle phases and will discuss them one by one. Data Analytics Lifecycle :The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is iterative to
3 min read
Data Preprocessing and Exploration
What is Data Cleaning?Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies within a dataset. This crucial step in the data management and data science pipeline ensures that the data is accurate, consistent, and
12 min read
ML | Handling Missing ValuesMissing values are a common issue in machine learning. This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. It is essential to address missing values efficiently to ensure strong and impar
12 min read
What is Feature Engineering?Feature Engineering is the process of creating new features or transforming existing features to improve the performance of a machine-learning model. It involves selecting relevant information from raw data and transforming it into a format that can be easily understood by a model. The goal is to im
14 min read
What is Data Transformation?Data transformation is an important step in data analysis process that involves the conversion, cleaning, and organizing of data into accessible formats. It ensures that the information is accessible, consistent, secure, and finally recognized by the intended business users. This process is undertak
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Univariate, Bivariate and Multivariate data and its analysisData analysis is an important process for understanding patterns and making informed decisions based on data. Depending on the number of variables involved it can be classified into three main types: univariate, bivariate and multivariate analysis. Each method focuses on different aspects of the dat
5 min read
Python - Data visualization tutorialData visualization is a crucial aspect of data analysis, helping to transform analyzed data into meaningful insights through graphical representations. This comprehensive tutorial will guide you through the fundamentals of data visualization using Python. We'll explore various libraries, including M
7 min read
Statistical Analysis and Probability
Probability Data Distributions in Data ScienceUnderstanding how data behaves is one of the first steps in data science. Before we dive into building models or running analysis, we need to understand how the values in our dataset are spread out and thatâs where probability distributions come in.Let us start with a simple example: If you roll a f
8 min read
Central Limit Theorem in StatisticsThe Central Limit Theorem in Statistics states that as the sample size increases and its variance is finite, then the distribution of the sample mean approaches normal distribution irrespective of the shape of the population distribution.The central limit theorem posits that the distribution of samp
11 min read
Parametric Methods in StatisticsParametric statistical methods are those that make assumptions regarding the distribution of the population. These methods presume that the data have a known distribution (e.g., normal, binomial, Poisson) and rely on parameters (e.g., mean and variance) to define the data.Key AssumptionsParametric t
6 min read
Non-Parametric TestsNon-parametric tests are applied in hypothesis testing when the data does not satisfy the assumptions necessary for parametric tests, such as normality or equal variances. These tests are especially helpful for analyzing ordinal data, small sample sizes, or data with outliers.Common Non-Parametric T
5 min read
ANOVA for Machine LearningANOVA is useful when we need to compare more than two groups and determine whether their means are significantly different. Suppose you're trying to understand which ingredients in a recipe affect its taste. Some ingredients, like spices might have a strong influence while others like a pinch of sal
9 min read
Confidence IntervalA Confidence Interval (CI) is a range of values that contains the true value of something we are trying to measure like the average height of students or average income of a population.Instead of saying: âThe average height is 165 cm.âWe can say: âWe are 95% confident the average height is between 1
7 min read
Hypothesis TestingHypothesis testing compares two opposite ideas about a group of people or things and uses data from a small part of that group (a sample) to decide which idea is more likely true. We collect and study the sample data to check if the claim is correct.Hypothesis TestingFor example, if a company says i
9 min read
P-Value: Comprehensive Guide to Understand, Apply, and InterpretA p-value is a statistical metric used to assess a hypothesis by comparing it with observed data. This article delves into the concept of p-value, its calculation, interpretation, and significance. It also explores the factors that influence p-value and highlights its limitations. Table of Content W
12 min read
Data Analysis Libraries & Tools
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Matplotlib TutorialMatplotlib is an open-source visualization library for the Python programming language, widely used for creating static, animated and interactive plots. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, Qt, GTK and wxPython. It
5 min read
Python Seaborn TutorialSeaborn is a library mostly used for statistical plotting in Python. It is built on top of Matplotlib and provides beautiful default styles and color palettes to make statistical plots more attractive.In this tutorial, we will learn about Python Seaborn from basics to advance using a huge dataset of
15+ min read
Power BI Tutorial | Learn Power BIPower BI is a Microsoft-powered business intelligence tool that helps transform raw data into interactive dashboards and actionable insights. It allow users to connect to various data sources, clean and shape data and visualize it using charts, graphs and reports all with minimal coding.Itâs widely
5 min read
Tableau TutorialIn this Tableau tutorial, we will learn about Tableau from basics to advance using the huge dataset containing topics like Tableau basics, working with different data sources, different charts available in Tableau, etc. Tableau is a powerful tool used for data analysis and visualization. It allows t
5 min read
SQL for Data AnalysisSQL (Structured Query Language) is a powerful tool for data analysis, allowing users to efficiently query and manipulate data stored in relational databases. Whether you are working with sales, customer or financial data, SQL helps extract insights and perform complex operations like aggregation, fi
6 min read
How to Perform Data Analysis in Excel: A Beginnerâs GuideExcel is one of the most powerful tools for data analysis, allowing you to process, manipulate, and visualize large datasets efficiently. Whether you're analyzing sales figures, financial reports, or any other type of data, knowing how to perform data analysis in Excel can help you make informed dec
14 min read
Time Series Analysis
Time Series Analysis & Visualization in PythonTime series data consists of sequential data points recorded over time which is used in industries like finance, pharmaceuticals, social media and research. Analyzing and visualizing this data helps us to find trends and seasonal patterns for forecasting and decision-making. In this article, we will
6 min read
8 Types of Plots for Time Series Analysis using PythonTime series data Time series data is a collection of observations chronologically arranged at regular time intervals. Each observation corresponds to a specific time point, and the data can be recorded at various frequencies (e.g., daily, monthly, yearly). This type of data is very essential in many
10 min read
Handling Missing Values in Time Series DataHandling missing values in time series data in R is a crucial step in the data preprocessing phase. Time series data often contains gaps or missing observations due to various reasons such as sensor malfunctions, human errors, or other external factors. In R Programming Language dealing with missing
5 min read
Understanding the Moving average (MA) in Time Series DataData is often collected with respect to time, whether for scientific or financial purposes. When data is collected in a chronological order, it is referred to as time series data. Analyzing time series data provides insights into how the data behaves over time, including underlying patterns that can
15 min read
Augmented Dickey-Fuller (ADF)Augmented Dickey-Fuller (ADF) test is a statistical test in time series analysis used to determine whether a given time series is stationary. A stationary time series has constant mean and variance over time, which is a core assumption in many time series models, including ARIMA.Why Stationarity Mat
3 min read
AutoCorrelationAutocorrelation is a fundamental concept in time series analysis. Autocorrelation is a statistical concept that assesses the degree of correlation between the values of variable at different time points. The article aims to discuss the fundamentals and working of Autocorrelation. Table of Content Wh
10 min read
Data Analytics Projects
30+ Top Data Analytics Projects in 2025 [With Source Codes]Are you an aspiring data analyst? Dive into 40+ FREE Data Analytics Projects packed with the hottest 2024 tech. Data Analytics Projects for beginners, final-year students, and experienced professionals to Master essential data analytical skills. These top data analytics projects serve as a simple ye
4 min read
Top 80+ Data Analyst Interview Questions and AnswersData is information, often in the form of numbers, text, or multimedia, that is collected and stored for analysis. It can come from various sources, such as business transactions, social media, or scientific experiments. In the context of a data analyst, their role involves extracting meaningful ins
15+ min read