0% found this document useful (0 votes)

3 views

data-cleaning-using-pandas

The article discusses the importance of data cleaning in data science, emphasizing that clean data is crucial for accurate analysis and decision-making. It outlines the data cleaning process, including identifying and correcting errors, ensuring data quality, and using the Pandas library in Python for effective data manipulation. The article also provides step-by-step guidance on cleaning a sample dataset, highlighting techniques such as handling missing values and removing duplicates.

Uploaded by

nhatlinhhuynh2002

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

data-cleaning-using-pandas

Uploaded by

nhatlinhhuynh2002

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Cleaning Using Pandas

BE G I NNE R D AT A C LE A NI NG PRO G RA M M I NG PYT HO N S T RUC T URE D D AT A

This article was published as a part of the Data Science Blogathon

Introduction

As we know that, Data Science is the discipline of study which involves extracting insights from huge
amounts of data by the use of various scientific methods, algorithms, and processes. To extract useful
knowledge from data, Data Scientists need raw data. This Raw data is a collection of information from
various outlines sources and an essential raw material of Data Scientists. It is additionally known as
primary or source data. It consists of garbage, irregular and inconsistent values which lead to many
difficulties. When using data, the insights and analysis extracted are only as good as the data we are
using. Essentially, when garbage data is in, then garbage analysis comes out. Here Data cleaning comes
into the picture, Data cleansing is an essential part of data science. Data cleaning is the process of
removing incorrect, corrupted, garbage, incorrectly formatted, duplicate, or incomplete data within a
dataset.

What is data cleaning?

When working with multiple data sources, there are many chances for data to be incorrect, duplicated, or
mislabeled. If data is wrong, outcomes and algorithms are unreliable, even though they may look correct.
Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or
incomplete data in a dataset. There’s no such absolute way to describe the precise steps in the data
cleaning process because the processes may vary from dataset to dataset. Data cleansing, data cleansing,
or data scrub is that the initiative among the general data preparation process. Data cleaning plays an
important part in developing reliable answers and within the analytical process and is observed to be a
basic feature of the info science basics. The motive of data cleaning services is to construct uniform and
standardized data sets that enable data analytical tools and business intelligence easy access and
perceive accurate data for each problem.

Why data cleaning is essential?

Data cleaning is the most important task that should be done as a data science professional. Having
wrong or bad quality data can be detrimental to processes and analysis. Having clean data will ultimately
increase overall productivity and permit the very best quality information in your decision-making.
Following are some reasons why data cleaning is essential:
Image source: by me

1. Error-Free Data: When multiple sources of data are combined there may be chances of so much error.
Through Data Cleaning, errors can be removed from data. Having clean data which is free from wrong and
garbage values can help in performing analysis faster as well as efficiently. By doing this task our
considerable amount of time is saved. If we use data containing garbage values, the results won’t be
accurate. When we don’t use accurate data, surely we will make mistakes. Monitoring errors and good
reporting helps to find where errors are coming from, and also makes it easier to fix incorrect or corrupt
data for future applications.

2. Data Quality: The quality of the data is the degree to which it follows the rules of particular
requirements. For example, if we have imported phone numbers data of different customers, and in some
places, we have added email addresses of customers in the data. But because our needs were
straightforward for phone numbers, then the email addresses would be invalid data. Here some pieces of
data follow a specific format. Some types of numbers have to be in a specific range. Some data cells might
require a selected quite data like numeric, Boolean, etc. In every scenario, there are some mandatory
constraints our data should follow. Certain conditions affect multiple fields of data in a particular form.
Particular types of data have unique restrictions. If the data isn’t in the required format, it would always be
invalid. Data cleaning will help us simplify this process and avoid useless data values.

3. Accurate and Efficient: Ensuring the data is close to the correct values. We know that most of the data
in a dataset are valid, and we should focus on establishing its accuracy. Even if the data is authentic and
correct, it doesn’t mean the data is accurate. Determining accuracy helps to figure out the data entered is
accurate or not. For example, the address of a customer is stored in the specified format, maybe it doesn’t
need to be in the right one. The email has an additional character or value that makes it incorrect or invalid.
Another example is the phone number of a customer. This means that we have to rely on data sources, to
cross-check the data to figure out if it’s accurate or not. Depending on the kind of data we are using, we
might be able to find various resources that could help us in this regard for cleaning.

4. Complete Data: Completeness is the degree to which we should know all the required values.
Completeness is a little more challenging to achieve than accuracy or quality. Because it’s nearly
impossible to have all the info we need. Only known facts can be entered. We can try to complete data by
redoing the data gathering activities like approaching the clients again, re-interviewing people, etc. For
example, we might need to enter every customer’s contact information. But a number of them might not
have email addresses. In this case, we have to leave those columns empty. If we have a system that
requires us to fill all columns, we can try to enter missing or unknown there. But entering such values does
not mean that the data is complete. It would be still being referred to as incomplete.

5. Maintains Data Consistency: To ensure the data is consistent within the same dataset or across
multiple datasets, we can measure consistency by comparing two similar systems. We can also check the
data values within the same dataset to see if they are consistent or not. Consistency can be relational. For
example, a customer’s age might be 25, which is a valid value and also accurate, but it is also stated as a
senior citizen in the same system. In such cases, we have to cross-check the data, similar to measuring
accuracy, and see which value is true. Is the client a 25-year old? Or the client is a senior citizen? Only one
of these values can be true. There are multiple ways to for your data consistent.

By checking in different systems.

By checking the source.
By checking the latest data.

Data Cleaning Cycle

It is the method of analyzing, distinguishing, and correcting untidy, raw data. Data cleaning involves filling
in missing values, distinguish and fix errors present in the dataset. Whereas the techniques used for data
cleaning might vary in step with different types of datasets, the following are standard steps to map out
data cleaning:
Image source: by me

Data cleaning with Pandas

Data scientists spend a huge amount of time cleaning datasets and getting them in the form in which they
can work. It is an essential skill of Data Scientists to be able to work with messy data, missing values,
inconsistent, noise, or nonsensical data. To work smoothly python provides a built-in module Pandas.
Pandas is the popular Python library that is mainly used for data processing purposes like cleaning,
manipulation, and analysis. Pandas stand for “Python Data Analysis Library”. It consists of classes to read,
process, and write CSV data files. There are numerous Data cleaning tools present but, the Pandas library
provides a really fast and efficient way to manage and explore data. It does that by providing us with Series
and DataFrames, which help us not only to represent data efficiently but also manipulate it in various ways.

In this article, we will use the Pandas module to clean our dataset.

We are using a simple dataset for data cleaning i.e. iris species dataset. You can download this dataset
from kaggle.com.

Let’s get started with data cleaning step by step.

To start working with Pandas we need to import it. We are using Google Colab as IDE, so we will import
Pandas in Google Colab.
#importing module import pandas as pd

Impor t Dataset

To import the dataset we use the read_csv() function of pandas and store it in the DataFrame named as
data. As the dataset is in tabular format, when working with tabular data in Pandas it will be automatically
converted in a DataFrame. DataFrame is a two-dimensional, mutable data structure in Python. It is a
combination of rows and columns like an excel sheet.

#importing the dataset by reading the csv file data = pd.read_csv(/content/Iris.csv)

#displaying the first five rows of dataset data.head()

The head() function is a built-in function in pandas for the dataframe used to display the rows of the
dataset. We can specify the number of rows by giving the number within the parenthesis. By default, it
displays the first five rows of the dataset. If we want to see the last five rows of the dataset we use the
tail()function of the dataframe like this:

#displayinf last five rows of dataset data.tail()

Merge Dataset

Merging the dataset is the process of combining two datasets in one, and line up rows based on some
particular or common property for data analysis. We can do this by using the merge() function of the
dataframe. Following is the syntax of the merge function:

DataFrame_name.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False,

right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

[source]

But in this case, we don’t need to merge two datasets. So, we will skip this step.

Rebuild Missing Data

To find and fill the missing data in the dataset we will use another function. There are 4 ways to find the
null values if present in the dataset. Let’s see them one by one:

Using isnull() function:

data.isnull()

This function provides the boolean value for the complete dataset to know if any null value is present or
not.

Using isna() function:

data.isna()

This is the same as the isnull() function. Ans provides the same output.

Using isna().any()

data.isna().any()
This function also gives a boolean value if any null value is present or not, but it gives results column-wise,
not in tabular format.

Using isna(). sum()

data.isna().sum()

This function gives the sum of the null values preset in the dataset column-wise.

Using isna().any().sum()

data.isna().any().sum()

This function gives output in a single value if any null is present or not.

There are no null values present in our dataset. But if there are any null value s preset we can fill those
places with any other value using the fillna() function of DataFrame.Following is the syntax of fillna()
function:

DataFrame_name.fillna (value=None, method=None, axis=None, inplace=False, limit=None, downcast=None )

[source]

This function will fill NA/NaN or 0 values in place of null spaces.

Standardization and Normalization

Data Standardization and Normalization is a common practice in machine learning.

Standardization is another scaling technique where the values are centered around the mean with a unit
standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution
has a unit standard deviation.

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging
between 0 and 1. It is also known as Min-Max scaling.

To know more about this click here.

This step is not needed for the dataset we are using. So, we will skip this step.

De-Duplicate

De-Duplicate means remove all duplicate values. There is no need for duplicate values in data analysis.
These values only affect the accuracy and efficiency of the analysis result. To find duplicate values in the
dataset we will use a simple dataframe function i.e. duplicated(). Let’s see the example:

data.duplicated()

This function also provides bool values for duplicate values in the dataset. As we can see that dataset
doesn’t contain any duplicate values.

If a dataset contains duplicate values it can be removed using the drop_duplicates() function. Following is
the syntax of this function:

DataFrame_name.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

[source]

Verify and Enrich

After removing null, duplicate, and incorrect values, we should verify the dataset and validate its accuracy.
In this step, we have to check that the data cleaned so far is making any sense. If the data is incomplete
we have to enrich the data again by data gathering activities like approaching the clients again, re-
interviewing people, etc. Completeness is a little more challenging to achieve accuracy or quality in the
dataset.

Export Dataset

This is the last step of the data cleaning process. After performing all the above operations, the data is
transformed into clean the dataset and it is ready to export for the next process in Data Science or Data
Analysis.
This brings us to the end of this article. I hope you enjoyed the article and increased your knowledge about
Data Cleaning Process.

Thanks for Reading. Do let me know your comments and feedback in the comment section.

For more articles click here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Article Url - https://www.analyticsvidhya.com/blog/2021/06/data-cleaning-using-pandas/

neelutiwari

Natural Language Processing For Legal Document Review: Categorising Deontic Modalities in Contracts
No ratings yet
Natural Language Processing For Legal Document Review: Categorising Deontic Modalities in Contracts
22 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
17 pages
Chaper 3 FoDS - Copy
No ratings yet
Chaper 3 FoDS - Copy
127 pages
Math211101020
No ratings yet
Math211101020
12 pages
DATA WAREHOUSING UNIT 1[1]
No ratings yet
DATA WAREHOUSING UNIT 1[1]
26 pages
ppt3
No ratings yet
ppt3
15 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
UNIT 1
No ratings yet
UNIT 1
27 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Unit-2
No ratings yet
Unit-2
21 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Unit 1
No ratings yet
Unit 1
8 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
Coursera - Data Analytics - Course 4
No ratings yet
Coursera - Data Analytics - Course 4
6 pages
Handouts
No ratings yet
Handouts
19 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
A) Data Cleaning
No ratings yet
A) Data Cleaning
7 pages
data profiling is a critical step in data manageme
No ratings yet
data profiling is a critical step in data manageme
7 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
AUTOMATED EDA Libraries
No ratings yet
AUTOMATED EDA Libraries
12 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
Data Preprocessing
No ratings yet
Data Preprocessing
0 pages
ML-Lecture-5-data-quality
No ratings yet
ML-Lecture-5-data-quality
19 pages
unit -1 (b) DWM.docx
No ratings yet
unit -1 (b) DWM.docx
26 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
Unit 2 Data Warehouse
No ratings yet
Unit 2 Data Warehouse
22 pages
Unit-2 DS
No ratings yet
Unit-2 DS
10 pages
Unit 1-Part3-Compressed
No ratings yet
Unit 1-Part3-Compressed
28 pages
UNIT V
No ratings yet
UNIT V
47 pages
Module 1
No ratings yet
Module 1
35 pages
Chap.3 Data Preprocessing
No ratings yet
Chap.3 Data Preprocessing
6 pages
Data Mining - Unit - 3
No ratings yet
Data Mining - Unit - 3
62 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
22amh32 - Data Analytics and Data Science Unit I & Data Science Process 1. Data Science Process
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Data Science Process 1. Data Science Process
7 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Mining
No ratings yet
Data Mining
7 pages
221a1129 DS Exp1
No ratings yet
221a1129 DS Exp1
4 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
Data Analysis _Unit1
No ratings yet
Data Analysis _Unit1
65 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
From Everand
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
Ike Beck
No ratings yet
Unit 2 Data Management and Processing System
100% (1)
Unit 2 Data Management and Processing System
43 pages
L01 Introduction To AI
No ratings yet
L01 Introduction To AI
26 pages
Fundamentals of SQL: Datonics Club Initiative - Session 1
No ratings yet
Fundamentals of SQL: Datonics Club Initiative - Session 1
18 pages
20483-Programming-in-C
No ratings yet
20483-Programming-in-C
3 pages
Captivators
No ratings yet
Captivators
13 pages
Rohan Thesis Presentation
No ratings yet
Rohan Thesis Presentation
13 pages
UNIT 4 &5
No ratings yet
UNIT 4 &5
17 pages
resume parsing report m
No ratings yet
resume parsing report m
103 pages
Data Mining Assignment 2
No ratings yet
Data Mining Assignment 2
25 pages
Machine Learning Natural Language 2023
No ratings yet
Machine Learning Natural Language 2023
28 pages
Fake News Detection Natural Language Processing
No ratings yet
Fake News Detection Natural Language Processing
62 pages
Silverpeak Scheduler Persistent Logs Guide
No ratings yet
Silverpeak Scheduler Persistent Logs Guide
3 pages
Oopm(Cs305) Unit-1 Notes
No ratings yet
Oopm(Cs305) Unit-1 Notes
11 pages
Threat Behavior Textual Search by Attention Graph Isomorphism
No ratings yet
Threat Behavior Textual Search by Attention Graph Isomorphism
15 pages
Page 1 of 3
No ratings yet
Page 1 of 3
3 pages
ENGL 158 - 8-Lectures 9-10 - Report Writing
No ratings yet
ENGL 158 - 8-Lectures 9-10 - Report Writing
26 pages
ADF Project - 1
No ratings yet
ADF Project - 1
3 pages
BAI602-ML-I
No ratings yet
BAI602-ML-I
4 pages
From+Entertainment+to+Engagement+ +Exploring+the+Role+of+Vlogging+as+a+Transformative+Educational+Tool+in+Tertiary+Education (1)
No ratings yet
From+Entertainment+to+Engagement+ +Exploring+the+Role+of+Vlogging+as+a+Transformative+Educational+Tool+in+Tertiary+Education (1)
9 pages
Database Revision Question
No ratings yet
Database Revision Question
7 pages
FELIX: Automatic and Interpretable Feature Engineering Using LLMs
No ratings yet
FELIX: Automatic and Interpretable Feature Engineering Using LLMs
17 pages
Symmetric Key Cryptography and Its Types
No ratings yet
Symmetric Key Cryptography and Its Types
29 pages
Writing Friendly Letters
No ratings yet
Writing Friendly Letters
7 pages
1 s2.0 S1877050923021452 Main
No ratings yet
1 s2.0 S1877050923021452 Main
6 pages
Unit 1 Cs3251
No ratings yet
Unit 1 Cs3251
2 pages
Midhun 111
No ratings yet
Midhun 111
22 pages
Computational Intelligence To Aid Text F
No ratings yet
Computational Intelligence To Aid Text F
14 pages
Smitesh Cv
No ratings yet
Smitesh Cv
3 pages
Sengunthar Arts and Science College
No ratings yet
Sengunthar Arts and Science College
2 pages

data-cleaning-using-pandas

Uploaded by

data-cleaning-using-pandas

Uploaded by

Data Cleaning Using Pandas

BE G I NNE R D AT A C LE A NI NG PRO G RA M M I NG PYT HO N S T RUC T URE D D AT A

This article was published as a part of the Data Science Blogathon

What is data cleaning?

Why data cleaning is essential?

By checking in different systems.

Data Cleaning Cycle

Data cleaning with Pandas

Let’s get started with data cleaning step by step.

#importing the dataset by reading the csv file data = pd.read_csv(/content/Iris.csv)

#displaying the first five rows of dataset data.head()

#displayinf last five rows of dataset data.tail()

DataFrame_name.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False,

right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Rebuild Missing Data

Using isnull() function:

Using isna() function:

Using isna(). sum()

DataFrame_name.fillna (value=None, method=None, axis=None, inplace=False, limit=None, downcast=None )

This function will fill NA/NaN or 0 values in place of null spaces.

Standardization and Normalization

To know more about this click here.

DataFrame_name.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

Verify and Enrich

For more articles click here.

Article Url - https://www.analyticsvidhya.com/blog/2021/06/data-cleaning-using-pandas/

You might also like