0% found this document useful (0 votes)
3 views

data-cleaning-using-pandas

The article discusses the importance of data cleaning in data science, emphasizing that clean data is crucial for accurate analysis and decision-making. It outlines the data cleaning process, including identifying and correcting errors, ensuring data quality, and using the Pandas library in Python for effective data manipulation. The article also provides step-by-step guidance on cleaning a sample dataset, highlighting techniques such as handling missing values and removing duplicates.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

data-cleaning-using-pandas

The article discusses the importance of data cleaning in data science, emphasizing that clean data is crucial for accurate analysis and decision-making. It outlines the data cleaning process, including identifying and correcting errors, ensuring data quality, and using the Pandas library in Python for effective data manipulation. The article also provides step-by-step guidance on cleaning a sample dataset, highlighting techniques such as handling missing values and removing duplicates.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Cleaning Using Pandas

BE G I NNE R D AT A C LE A NI NG PRO G RA M M I NG PYT HO N S T RUC T URE D D AT A

This article was published as a part of the Data Science Blogathon

Introduction

As we know that, Data Science is the discipline of study which involves extracting insights from huge
amounts of data by the use of various scientific methods, algorithms, and processes. To extract useful
knowledge from data, Data Scientists need raw data. This Raw data is a collection of information from
various outlines sources and an essential raw material of Data Scientists. It is additionally known as
primary or source data. It consists of garbage, irregular and inconsistent values which lead to many
difficulties. When using data, the insights and analysis extracted are only as good as the data we are
using. Essentially, when garbage data is in, then garbage analysis comes out. Here Data cleaning comes
into the picture, Data cleansing is an essential part of data science. Data cleaning is the process of
removing incorrect, corrupted, garbage, incorrectly formatted, duplicate, or incomplete data within a
dataset.

What is data cleaning?

When working with multiple data sources, there are many chances for data to be incorrect, duplicated, or
mislabeled. If data is wrong, outcomes and algorithms are unreliable, even though they may look correct.
Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or
incomplete data in a dataset. There’s no such absolute way to describe the precise steps in the data
cleaning process because the processes may vary from dataset to dataset. Data cleansing, data cleansing,
or data scrub is that the initiative among the general data preparation process. Data cleaning plays an
important part in developing reliable answers and within the analytical process and is observed to be a
basic feature of the info science basics. The motive of data cleaning services is to construct uniform and
standardized data sets that enable data analytical tools and business intelligence easy access and
perceive accurate data for each problem.

Why data cleaning is essential?

Data cleaning is the most important task that should be done as a data science professional. Having
wrong or bad quality data can be detrimental to processes and analysis. Having clean data will ultimately
increase overall productivity and permit the very best quality information in your decision-making.
Following are some reasons why data cleaning is essential:
Image source: by me

1. Error-Free Data: When multiple sources of data are combined there may be chances of so much error.
Through Data Cleaning, errors can be removed from data. Having clean data which is free from wrong and
garbage values can help in performing analysis faster as well as efficiently. By doing this task our
considerable amount of time is saved. If we use data containing garbage values, the results won’t be
accurate. When we don’t use accurate data, surely we will make mistakes. Monitoring errors and good
reporting helps to find where errors are coming from, and also makes it easier to fix incorrect or corrupt
data for future applications.

2. Data Quality: The quality of the data is the degree to which it follows the rules of particular
requirements. For example, if we have imported phone numbers data of different customers, and in some
places, we have added email addresses of customers in the data. But because our needs were
straightforward for phone numbers, then the email addresses would be invalid data. Here some pieces of
data follow a specific format. Some types of numbers have to be in a specific range. Some data cells might
require a selected quite data like numeric, Boolean, etc. In every scenario, there are some mandatory
constraints our data should follow. Certain conditions affect multiple fields of data in a particular form.
Particular types of data have unique restrictions. If the data isn’t in the required format, it would always be
invalid. Data cleaning will help us simplify this process and avoid useless data values.

3. Accurate and Efficient: Ensuring the data is close to the correct values. We know that most of the data
in a dataset are valid, and we should focus on establishing its accuracy. Even if the data is authentic and
correct, it doesn’t mean the data is accurate. Determining accuracy helps to figure out the data entered is
accurate or not. For example, the address of a customer is stored in the specified format, maybe it doesn’t
need to be in the right one. The email has an additional character or value that makes it incorrect or invalid.
Another example is the phone number of a customer. This means that we have to rely on data sources, to
cross-check the data to figure out if it’s accurate or not. Depending on the kind of data we are using, we
might be able to find various resources that could help us in this regard for cleaning.

4. Complete Data: Completeness is the degree to which we should know all the required values.
Completeness is a little more challenging to achieve than accuracy or quality. Because it’s nearly
impossible to have all the info we need. Only known facts can be entered. We can try to complete data by
redoing the data gathering activities like approaching the clients again, re-interviewing people, etc. For
example, we might need to enter every customer’s contact information. But a number of them might not
have email addresses. In this case, we have to leave those columns empty. If we have a system that
requires us to fill all columns, we can try to enter missing or unknown there. But entering such values does
not mean that the data is complete. It would be still being referred to as incomplete.

5. Maintains Data Consistency: To ensure the data is consistent within the same dataset or across
multiple datasets, we can measure consistency by comparing two similar systems. We can also check the
data values within the same dataset to see if they are consistent or not. Consistency can be relational. For
example, a customer’s age might be 25, which is a valid value and also accurate, but it is also stated as a
senior citizen in the same system. In such cases, we have to cross-check the data, similar to measuring
accuracy, and see which value is true. Is the client a 25-year old? Or the client is a senior citizen? Only one
of these values can be true. There are multiple ways to for your data consistent.

By checking in different systems.


By checking the source.
By checking the latest data.

Data Cleaning Cycle

It is the method of analyzing, distinguishing, and correcting untidy, raw data. Data cleaning involves filling
in missing values, distinguish and fix errors present in the dataset. Whereas the techniques used for data
cleaning might vary in step with different types of datasets, the following are standard steps to map out
data cleaning:
Image source: by me

Data cleaning with Pandas

Data scientists spend a huge amount of time cleaning datasets and getting them in the form in which they
can work. It is an essential skill of Data Scientists to be able to work with messy data, missing values,
inconsistent, noise, or nonsensical data. To work smoothly python provides a built-in module Pandas.
Pandas is the popular Python library that is mainly used for data processing purposes like cleaning,
manipulation, and analysis. Pandas stand for “Python Data Analysis Library”. It consists of classes to read,
process, and write CSV data files. There are numerous Data cleaning tools present but, the Pandas library
provides a really fast and efficient way to manage and explore data. It does that by providing us with Series
and DataFrames, which help us not only to represent data efficiently but also manipulate it in various ways.

In this article, we will use the Pandas module to clean our dataset.

We are using a simple dataset for data cleaning i.e. iris species dataset. You can download this dataset
from kaggle.com.

Let’s get started with data cleaning step by step.

To start working with Pandas we need to import it. We are using Google Colab as IDE, so we will import
Pandas in Google Colab.
#importing module import pandas as pd

Impor t Dataset

To import the dataset we use the read_csv() function of pandas and store it in the DataFrame named as
data. As the dataset is in tabular format, when working with tabular data in Pandas it will be automatically
converted in a DataFrame. DataFrame is a two-dimensional, mutable data structure in Python. It is a
combination of rows and columns like an excel sheet.

#importing the dataset by reading the csv file data = pd.read_csv(/content/Iris.csv)

#displaying the first five rows of dataset data.head()

The head() function is a built-in function in pandas for the dataframe used to display the rows of the
dataset. We can specify the number of rows by giving the number within the parenthesis. By default, it
displays the first five rows of the dataset. If we want to see the last five rows of the dataset we use the
tail()function of the dataframe like this:

#displayinf last five rows of dataset data.tail()

Merge Dataset

Merging the dataset is the process of combining two datasets in one, and line up rows based on some
particular or common property for data analysis. We can do this by using the merge() function of the
dataframe. Following is the syntax of the merge function:

DataFrame_name.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False,

right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

[source]

But in this case, we don’t need to merge two datasets. So, we will skip this step.

Rebuild Missing Data


To find and fill the missing data in the dataset we will use another function. There are 4 ways to find the
null values if present in the dataset. Let’s see them one by one:

Using isnull() function:

data.isnull()

This function provides the boolean value for the complete dataset to know if any null value is present or
not.

Using isna() function:

data.isna()

This is the same as the isnull() function. Ans provides the same output.

Using isna().any()

data.isna().any()
This function also gives a boolean value if any null value is present or not, but it gives results column-wise,
not in tabular format.

Using isna(). sum()

data.isna().sum()

This function gives the sum of the null values preset in the dataset column-wise.

Using isna().any().sum()

data.isna().any().sum()

This function gives output in a single value if any null is present or not.

There are no null values present in our dataset. But if there are any null value s preset we can fill those
places with any other value using the fillna() function of DataFrame.Following is the syntax of fillna()
function:

DataFrame_name.fillna (value=None, method=None, axis=None, inplace=False, limit=None, downcast=None )

[source]

This function will fill NA/NaN or 0 values in place of null spaces.

Standardization and Normalization


Data Standardization and Normalization is a common practice in machine learning.

Standardization is another scaling technique where the values are centered around the mean with a unit
standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution
has a unit standard deviation.

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging
between 0 and 1. It is also known as Min-Max scaling.

To know more about this click here.

This step is not needed for the dataset we are using. So, we will skip this step.

De-Duplicate

De-Duplicate means remove all duplicate values. There is no need for duplicate values in data analysis.
These values only affect the accuracy and efficiency of the analysis result. To find duplicate values in the
dataset we will use a simple dataframe function i.e. duplicated(). Let’s see the example:

data.duplicated()

This function also provides bool values for duplicate values in the dataset. As we can see that dataset
doesn’t contain any duplicate values.

If a dataset contains duplicate values it can be removed using the drop_duplicates() function. Following is
the syntax of this function:

DataFrame_name.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

[source]

Verify and Enrich

After removing null, duplicate, and incorrect values, we should verify the dataset and validate its accuracy.
In this step, we have to check that the data cleaned so far is making any sense. If the data is incomplete
we have to enrich the data again by data gathering activities like approaching the clients again, re-
interviewing people, etc. Completeness is a little more challenging to achieve accuracy or quality in the
dataset.

Export Dataset

This is the last step of the data cleaning process. After performing all the above operations, the data is
transformed into clean the dataset and it is ready to export for the next process in Data Science or Data
Analysis.
This brings us to the end of this article. I hope you enjoyed the article and increased your knowledge about
Data Cleaning Process.

Thanks for Reading. Do let me know your comments and feedback in the comment section.

For more articles click here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Article Url - https://www.analyticsvidhya.com/blog/2021/06/data-cleaning-using-pandas/

neelutiwari

You might also like