0% found this document useful (0 votes)
9 views

Data Pre Processing and Cleaning

smd nd kys u sob

Uploaded by

dgod975
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Pre Processing and Cleaning

smd nd kys u sob

Uploaded by

dgod975
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Pre processing and cleaning

Data preprocessing involves the transformation of the raw dataset


into an understandable format. Preprocessing data is a fundamental
stage in data mining to improve data efficiency. The data
preprocessing methods directly affect the outcomes of any analytic
algorithm.

1.DATA CLEANING
The first step of Data Preprocessing is Data Cleaning. Most of the
data that we work today are not clean and requires substantial
amount of Data Cleaning. Some have missing values and some have
junk data in it. If these missing values and inconsistencies are not
handled properly then our model wouldn’t give accurate results.

So, before getting into the nitty gritty details of Data Cleaning, let’s
have a high level understanding of what are the possible problems
we face in real world data scenarios.

MISSING VALUES :

Missing values is very crucial when it comes to building a model. It


can break your complete model by predicting inaccurately if not
handled properly. Let’s check the below example to understand
more about it.

Below dataset can be used to predict the graduate admission of


students. However it has some missing values which are critical to
predict their admissions.
As you could see above, some of the records have the GRE
Score, TOEFL Score, University
Rating, SOP, LOR, CGPA and Research missing, which are
important features for predicting the admit for the student.

HANDLING MISSING DATA:

In order to build a robust model which handles complex tasks we


need to handle missing data more efficiently. There are many ways
of handling missing data. Some of them are as follows:

METHOD 1. REMOVING THE DATA

The first step that we should do is to check if a dataset has any


missing values. A model cannot accept missing values. So one
common and easy method to handle missing values is to delete the
entire row if there is any missing value in that row or we delete an
entire column if it has 70 to 75% of missing data, however this
percent limit is not fixed and mostly depends on what kind of data
we are dealing with, and what kind of features are there in the
dataset.

Advantage of this method is, it’s a pretty quick and dirty method
of fixing the missing values issue. But this is not always the go to
method as you might sometime end up losing critical information by
deleting the features.
1. Load the dataset
In [1]:

import pandas as pd
import numpy as np
In [2]:

df = pd.read_csv("Banking_Marketing.csv")
In [3]:

df.head()

2. Check the datatype for each column

df.dtypes

age float64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration float64
campaign int64
pdays int64
previous int64
poutcome object
emp_var_rate float64
cons_price_idx float64
cons_conf_idx float64
euribor3m float64
nr_employed float64
y int64
dtype: object

3. Finding the missing values in each column

df.isna().sum()
age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

4. Dropping/Deleting the rows containing the missing values.

datadrop = df.dropna()

5. Checking to see the NA's after deletion of the rows.


datadrop.isna().sum()
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

METHOD 2: MEAN/MEDIAN/MODE IMPUTATION

In this method we will use the Mean/Median/Mode to replace


missing values.

1. In the case of Numerical data, we can compute its mean or


median and use the result to replace missing values.
2. While if there is Categorical (non-numerical) data, we can
compute its mode to replace the missing value.

This process is known as Mean/Median/Mode imputation.

Advantage of this method is that we don’t remove the data which


prevents data loss.

The drawback is that you don’t know how accurate using the mean,
median, or mode is going to be in a given situation.

Method 2: Imputing Missing Data


1. Loading the data
In [2]:

import pandas as pd
import numpy as np
In [3]:

df = pd.read_csv("Banking_Marketing.csv")
In [4]:

df.head()

df.isna().sum()
age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

2. Calculate the mean of the age column


In [6]:

age_mean = df.age.mean()

In [7]:
print("Mean of age column: ",age_mean)
Mean of age column: 40.023812413525256

3.Impute the missing data in the age column with the mean age value
In [8]:
df.age.fillna(age_mean,inplace=True)

df.isna().sum()

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

4. Checking all the records in the dataset for which the 'duration' column is
NA
df[df['duration'].isnull()]

5.Sort the values of duration in ascending order


In [11]:

df['duration'].sort_values(ascending=False).head()
Out[11]:
7802 4918.0
18610 4199.0
32880 3785.0
1974 3643.0
10633 3631.0
Name: duration, dtype: float64

6.Calculating the median value of the duration column

duration_med= df.duration.median()

print("The median of duration is: ",duration_med)

The median of duration is: 180.0

7. Imputing the median value of duration to all the NA fields in duration


df.duration.fillna(duration_med,inplace=True)
df.isna().sum()

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

8. In the above steps both the 'duration' and 'age' column were numerical so
we used mean and median to impute the missing values. However in this case
'contact' is a categorical value, so we will use Mode here to impute the
missing values.¶

df['contact'].unique()

out:
array(['cellular', 'telephone', nan], dtype=object)

contact_mode = df.contact.mode()[0]
print("Mode for contact: ",contact_mode)

Mode for contact: cellular

9.Imputing the Mode value of 'contact' to all the NA fields in 'contact'


df.contact.fillna(contact_mode,inplace=True)
In [20]:
df.isna().sum()
Out[20]:
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

Understanding Data Processing

Data Processing is the task of converting data from a given form to a much more
usable and desired form i.e. making it more meaningful and informative. Using
Machine Learning algorithms, mathematical modeling, and statistical knowledge,
this entire process can be automated. The output of this complete process can be in
any desired form like graphs, videos, charts, tables, images, and many more,
depending on the task we are performing and the requirements of the machine. This
might seem to be simple but when it comes to massive organizations like Twitter,
Facebook, Administrative bodies like Parliament, UNESCO, and health sector
organizations, this entire process needs to be performed in a very structured manner.
So, the steps to perform are as follows:
Data processing is a crucial step in the machine learning (ML) pipeline, as it
prepares the data for use in building and training ML models. The goal of data
processing is to clean, transform, and prepare the data in a format that is suitable for
modeling.
The main steps involved in data processing typically include:
1.Data collection: This is the process of gathering data from various sources, such
as sensors, databases, or other systems. The data may be structured or unstructured,
and may come in various formats such as text, images, or audio.
2.Data preprocessing: This step involves cleaning, filtering, and transforming the
data to make it suitable for further analysis. This may include removing missing
values, scaling or normalizing the data, or converting it to a different format.
3.Data analysis: In this step, the data is analyzed using various techniques such as
statistical analysis, machine learning algorithms, or data visualization. The goal of
this step is to derive insights or knowledge from the data.
4.Data interpretation: This step involves interpreting the results of the data analysis
and drawing conclusions based on the insights gained. It may also involve presenting
the findings in a clear and concise manner, such as through reports, dashboards, or
other visualizations.
5.Data storage and management: Once the data has been processed and analyzed,
it must be stored and managed in a way that is secure and easily accessible. This
may involve storing the data in a database, cloud storage, or other systems, and
implementing backup and recovery strategies to protect against data loss.
6.Data visualization and reporting: Finally, the results of the data analysis are
presented to stakeholders in a format that is easily understandable and actionable.
This may involve creating visualizations, reports, or dashboards that highlight key
findings and trends in the data.

 Collection :
The most crucial step when starting with ML is to have data of good
quality and accuracy. Data can be collected from any authenticated source
like data.gov.in, Kaggle or UCI dataset repository. For example, while
preparing for a competitive exam, students study from the best study
material that they can access so that they learn the best to obtain the best
results. In the same way, high-quality and accurate data will make the
learning process of the model easier and better and at the time of testing,
the model would yield state-of-the-art results.
A huge amount of capital, time and resources are consumed in collecting
data. Organizations or researchers have to decide what kind of data they
need to execute their tasks or research.
Example: Working on the Facial Expression Recognizer, needs numerous
images having a variety of human expressions. Good data ensures that the
results of the model are valid and can be trusted upon.

 Preparation :
The collected data can be in a raw form which can’t be directly fed to the
machine. So, this is a process of collecting datasets from different sources,
analyzing these datasets and then constructing a new dataset for further
processing and exploration. This preparation can be performed either
manually or from the automatic approach. Data can also be prepared in
numeric forms also which would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N dimensions,
the value of each cell will indicate the image pixel.
 Input :
Now the prepared data can be in the form that may not be machine-
readable, so to convert this data to the readable form, some conversion
algorithms are needed. For this task to be executed, high computation and
accuracy is needed. Example: Data can be collected through the sources
like MNIST Digit data(images), Twitter comments, audio files, video
clips.
 Processing :
This is the stage where algorithms and ML techniques are required to
perform the instructions provided over a large volume of data with
accuracy and optimal computation.
 Output :
In this stage, results are procured by the machine in a meaningful manner
which can be inferred easily by the user. Output can be in the form of
reports, graphs, videos, etc
 Storage :
This is the final step in which the obtained output and the data model data
and all the useful information are saved for future use.

Advantages of data processing in Machine Learning:

1. Improved model performance: Data processing helps improve the


performance of the ML model by cleaning and transforming the data into a
format that is suitable for modeling.
2. Better representation of the data: Data processing allows the data to be
transformed into a format that better represents the underlying
relationships and patterns in the data, making it easier for the ML model to
learn from the data.
3. Increased accuracy: Data processing helps ensure that the data is accurate,
consistent, and free of errors, which can help improve the accuracy of the
ML model.
Disadvantages of data processing in Machine Learning:

1. Time-consuming: Data processing can be a time-consuming task,


especially for large and complex datasets.
2. Error-prone: Data processing can be error-prone, as it involves
transforming and cleaning the data, which can result in the loss of
important information or the introduction of new errors.
3. Limited understanding of the data: Data processing can lead to a limited
understanding of the data, as the transformed data may not be
representative of the underlying relationships and patterns in the data.

Overview of Data Cleaning


Data cleaning is one of the important parts of machine learning. It plays a
significant part in building a model. It surely isn’t the fanciest part of machine
learning and at the same time, there aren’t any hidden tricks or secrets to uncover.
However, the success or failure of a project relies on proper data cleaning.
Professional data scientists usually invest a very large portion of their time in this
step because of the belief that “Better data beats fancier algorithms”.
If we have a well-cleaned dataset, there are chances that we can get achieve good
results with simple algorithms also, which can prove very beneficial at times
especially in terms of computation when the dataset size is large. Obviously,
different types of data will require different types of cleaning. However, this
systematic approach can always serve as a good starting point.
Steps Involved in Data Cleaning
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent, and free of errors, as
incorrect or inconsistent data can negatively impact the performance of the ML
model.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial
step in the data science pipeline that involves identifying and correcting or removing
errors, inconsistencies, and inaccuracies in the data to improve its quality and
usability. Data cleaning is essential because raw data is often noisy, incomplete, and
inconsistent, which can negatively impact the accuracy and reliability of the insights
derived from it.

The following are the most common steps involved in data cleaning:
 Import the necessary libraries
 Load the dataset
 Check the data information using df.info()

Processing CSV Data


import pandas as pd
import numpy as np

# Load the dataset


df = pd.read_csv('input.csv')
df.head()

id name salary start_date dept


0 1 Rick 623.30 2012-01-01 IT
1 2 Dan 515.20 2013-09-23 Operations
2 3 Tusar 611.00 2014-11-15 IT
3 4 Ryan 729.00 2014-05-11 HR
4 5 Gary 843.25 2015-03-27 Finance
5 6 Rasmi 578.00 2013-05-21 IT
6 7 Pranab 632.80 2013-07-30 Operations
7 8 Guru 722.50 2014-06-17 Finance

Reading Specific Rows


The read_csv function of the pandas library can also be used to
read some specific rows for a given column. We slice the result
from the read_csv function using the code shown below for first 5
rows for the column named salary.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Slice the result for first 5 rows


print (data[0:5]['salary'])

When we execute the above code, it produces the following


result.

0 623.30
1 515.20
2 611.00
3 729.00
4 843.25
Name: salary, dtype: float64

Reading Specific Columns


The read_csv function of the pandas library can also be used to
read some specific columns. We use the multi-axes indexing
method called .loc() for this purpose. We choose to display the
salary and name column for all the rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion


print (data.loc[:,['salary','name']])

When we execute the above code, it produces the following


result.

salary name
0 623.30 Rick
1 515.20 Dan
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab
7 722.50 Guru

Reading Specific Columns and Rows


The read_csv function of the pandas library can also be used to
read some specific columns and specific rows. We use the multi-
axes indexing method called .loc() for this purpose. We choose to
display the salary and name column for some of the rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion


print (data.loc[[1,3,5],['salary','name']])

When we execute the above code, it produces the following


result.

salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi

Reading Specific Columns for a Range of Rows


The read_csv function of the pandas library can also be used to
read some specific columns and a range of rows. We use the
multi-axes indexing method called .loc() for this purpose. We
choose to display the salary and name column for some of the
rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion


print (data.loc[2:6,['salary','name']])

When we execute the above code, it produces the following


result.
salary name
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab

Detect and Remove the Outliers using Python

An Outlier is a data-item/object that deviates significantly from the rest of the (so-
called normal)objects. They can be caused by measurement or execution errors. The
analysis for outlier detection is referred to as outlier mining. There are many ways to
detect the outliers, and the removal process is the data frame same as removing a
data item from the panda’s data frame.

Dataset Used For Outlier Detection

The dataset used in this article is the Diabetes dataset and it is preloaded in the
sklearn library.
Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset


diabetics = load_diabetes()

# Create the dataframe


column_name = diabetics.feature_names
df_diabetics = pd.DataFrame(diabetics.data)
df_diabetics.columns = column_name
df_diabetics.head()
Outliers can be detected using visualization, implementing mathematical formulas
on the dataset, or using the statistical approach. All of these are discussed below.

Outliers Visualization

Visualizing Outliers Using Box Plot

A Box Plot is also known as Whisker plot is created to display the summary of the
set of data values having properties like minimum, first quartile, median, third
quartile and maximum. In the box plot, a box is created from the first quartile to the
third quartile, a vertical line is also there which goes through the box at the median.
Here x-axis denotes the data to be plotted while the y-axis shows the frequency
distribution.
it is primarily used to indicate a distribution is skewed or not and if there are potential unusual
observations (also called outliers) present in the data set. Boxplots are also very beneficial when large
numbers of data sets are involved or compared.

Parts of Box Plots


Check the image below which shows the minimum, maximum, first quartile, third quartile,
median and outliers.
Minimum: The minimum value in the given dataset

First Quartile (Q1): The first quartile is the median of the lower half of the data set.

Median: The median is the middle value of the dataset, which divides the given dataset into
two equal parts. The median is considered as the second quartile.

Third Quartile (Q3): The third quartile is the median of the upper half of the data.

Maximum: The maximum value in the given dataset.

Apart from these five terms, the other terms used in the box plot are:

Interquartile Range (IQR): The difference between the third quartile and first quartile is
known as the interquartile range. (i.e.) IQR = Q3-Q1

Outlier: The data that falls on the far left or right side of the ordered data is tested to be the
outliers. Generally, the outliers fall more than the specified distance from the first and third
quartile.

(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR)

import seaborn as sns


sns.boxplot(df_diabetics['bmi'])
Outlie

rs present in the bmi columns

In the above graph, can clearly see that values above 10 are acting as outliers.
# Position of the Outlier
import numpy as np
print(np.where(df_diabetics['bmi']>0.12))

output:

(array([ 32, 145, 256, 262, 366, 367, 405]),)

Visualizing Outliers Using ScatterPlot.


It is used when you have paired numerical data and when your dependent variable
has multiple values for each reading independent variable, or when trying to
determine the relationship between the two variables. In the process of utilizing
the scatter plot, one can also use it for outlier detection.

# Scatter plot
fig, ax = plt.subplots(figsize = (6,4))
ax.scatter(df_diabetics['bmi'],df_diabetics['bp'])

# x-axis label
ax.set_xlabel('(body mass index of people)')
# y-axis label
ax.set_ylabel('(bp of the people )')
plt.show()

Looking at the graph can summarize that most of the data points are in the bottom
left corner of the graph but there are few points that are exactly;y opposite that is the
top right corner of the graph. Those points in the top right corner can be regarded as
Outliers.
Using approximation can say all those data points that are x>20 and y>600 are
outliers.
Outliers in BMI and BP Column Combined
Python3
# Position of the Outlier
print(np.where((df_diabetics['bmi']>0.12) & (df_diabetics['bp']<0.8)))

Output:
(array([ 32, 145, 256, 262, 366, 367, 405]),)

IQR (Inter Quartile Range)

IQR (Inter Quartile Range) Inter Quartile Range approach to finding the outliers is
the most commonly used and most trusted approach used in the research field.
IQR = Quartile3 – Quartile1

Python3
# IQR

Q1 = np.percentile(df_diabetics['bmi'], 25, method='midpoint')

Q3 = np.percentile(df_diabetics['bmi'], 75, method='midpoint')

IQR = Q3 - Q1

print(IQR)

Output:
0.06520763046978838
Syntax: numpy.percentile(arr, n, axis=None, out=None)
Parameters :
arr :input array.
n : percentile value.
To define the outlier base value is defined above and below dataset’s normal range
namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR
value is considered) :
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR (new_IQR =
IQR + 0.5*IQR) is taken, to consider all the data between 2.7 standard deviations in
the Gaussian Distribution.

Python3
# Above Upper bound

upper=Q3+1.5*IQR

upper_array=np.array(df_diabetics['bmi']>=upper)

print("Upper Bound:",upper)

print(upper_array.sum())

#Below Lower bound

lower=Q1-1.5*IQR

lower_array=np.array(df_diabetics['bmi']<=lower)

print("Lower Bound:",lower)

print(lower_array.sum())

Output:
Upper Bound: 0.12879000811776306
3
Lower Bound: -0.13204051376139045
0

Removing the outliers


For removing the outlier, one must follow the same process of removing an entry
from the dataset using its exact position in the dataset because in all the above
methods of detecting the outliers end result is the list of all those data items that
satisfy the outlier definition according to the method used.
References: How to delete exactly one row in python?
dataframe.drop(row index,inplace=True)
The above code can be used to drop a row from the dataset given the row_indexes to
be dropped. Inplace =True is used to tell Python to make the required change in the
original dataset. row_index can be only one value or list of values or NumPy array
but it must be one dimensional.

Example:
df_diabetics.drop(lists[0],inplace = True)
Full Code: Detecting the outliers using IQR and removing them.
Python3
# Importing

import sklearn

from sklearn.datasets import load_diabetes

import pandas as pd

# Load the dataset

diabetes = load_diabetes()

# Create the dataframe

column_name = diabetes.feature_names

df_diabetes = pd.DataFrame(diabetes.data)

df_diabetes .columns = column_name

df_diabetes .head()

print("Old Shape: ", df_diabetes.shape)


''' Detection '''

# IQR

# Calculate the upper and lower limits

Q1 = df_diabetes['bmi'].quantile(0.25)

Q3 = df_diabetes['bmi'].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5*IQR

upper = Q3 + 1.5*IQR

# Create arrays of Boolean values indicating the outlier rows

upper_array = np.where(df_diabetes['bmi']>=upper)[0]

lower_array = np.where(df_diabetes['bmi']<=lower)[0]

# Removing the outliers

df_diabetes.drop(index=upper_array, inplace=True)

df_diabetes.drop(index=lower_array, inplace=True)

# Print the new shape of the DataFrame

print("New Shape: ", df_diabetes.shape)

Output:
Old Shape: (442, 10)
New Shape: (439, 10)

Sampling distribution Using Python

There are different types of distributions that we study in statistics like


normal/gaussian distribution, exponential distribution, binomial distribution, and
many others. We will study one such distribution today which is Sampling
Distribution.
Let’s say we have some data then if we sample some finite number of data points
from it and then calculate some statistical measure of it and let’s do this
some n number of times. Then if we draw the distribution curve of those sample
statistics then the distribution obtained is known as Sampling Distribution.
Sampling distribution Using Python
There is also a special case of the sampling distribution which is known as
the Central Limit Theorem which says that if we take some samples from a
distribution of data(no matter how it is distributed) then if we draw a distribution
curve of the mean of those samples then it will be a normal distribution.
Let’s understand it by using an example:
Let’s take numbers from 1 to 10 and use them as our primary data.

import numpy as np
num = np.arange(10)
num

Output:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Now, let’s sample two points from the data and take the average of these two. Also,
let’s maintain a dictionary with the sample means and the number of times they
appear.
sample_freq = {}

for i in range(4):
for j in range(4):
# Selecting each pair possible with
# repetition

mean_of_two = (num[i] + num[j]) / 2

if (mean_of_two in sample_freq):
# Updating the value for a mean value
# if it already exists
sample_freq[mean_of_two] += 1

else:
# Adding a new key to the dictionary
# if it is not their
sample_freq[mean_of_two] = 1

sample_freq
Output:
{1.0: 1, 1.5: 2, 2.0: 3, 2.5: 4, 3.0: 3, 3.5: 2, 4.0: 1}
Now, let’s plot the sample statistics to visualize its distribution.

 Python3
import matplotlib.pyplot as plt
plt.scatter(sample_freq.keys(), sample_freq.values())
plt.show()

From the above graph, we can observe that the distribution of the sample statistic is
symmetric and if we will take infinite such points which are totally random then
we’ll be able to observe that the distribution formed will be a normal/gaussian
distribution.

Data Normalization with Pandas


 Pandas: Pandas is an open-source library that’s built on top of NumPy
library. it is a Python package that provides various data structures and
operations for manipulating numerical data and statistics. It’s mainly
popular for importing and analyzing data much easier. Pandas is fast and it’s
high-performance & productive for users.
 Data Normalization: Data Normalization could also be a typical practice in
machine learning which consists of transforming numeric columns to a
standard scale. In machine learning, some feature values differ from others
multiple times. The features with higher values will dominate the learning
process.

Steps Needed

Here, we will apply some techniques to normalize the data and discuss these with the
help of examples. For this, let’s understand the steps needed for data normalization
with Pandas.
1. Import Library (Pandas)
2. Import / Load / Create data.
3. Use the technique to normalize the data.
Examples
Here, we create data by some random values and apply some normalization
techniques to it.

Python3
# importing packages

import pandas as pd

# create data

df = pd.DataFrame([

[180000, 110, 18.9, 1400],

[360000, 905, 23.4, 1800],

[230000, 230, 14.0, 1300],

[60000, 450, 13.5, 1500]],

columns=['Col A', 'Col B',

'Col C', 'Col D'])

# view data

display(df)

Output:

import matplotlib.pyplot as plt


df.plot(kind = 'bar')
Let’s apply normalization techniques one by one.

Using The maximum absolute scaling

The maximum absolute scaling rescales each feature between -1 and 1 by dividing
every observation by its maximum absolute value. We can apply the maximum
absolute scaling in Pandas using the .max() and .abs() methods, as shown below.

Python3
# copy the data
df_max_scaled = df.copy()

# apply normalization techniques


for column in df_max_scaled.columns:
df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max()

# view normalized data


display(df_max_scaled)

Output:
See the plot of this dataframe:

Python3
import matplotlib.pyplot as plt
df_max_scaled.plot(kind = 'bar')

Output:

Using The min-max feature scaling

The min-max approach (often called normalization) rescales the feature to a hard
and fast range of [0,1] by subtracting the minimum value of the feature then dividing
by the range. We can apply the min-max scaling in Pandas using the .min()
and .max() methods.

Python3
# copy the data
df_min_max_scaled = df.copy()

# apply normalization techniques


for column in df_min_max_scaled.columns:
df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column

# view normalized data


print(df_min_max_scaled)

Output :

Let’s draw a plot with this dataframe:

Python3
import matplotlib.pyplot as plt
df_min_max_scaled.plot(kind = 'bar')
Using The z-score method

The z-score method (often called standardization) transforms the info into
distribution with a mean of 0 and a typical deviation of 1. Each standardized value is
computed by subtracting the mean of the corresponding feature then dividing by the
quality deviation.

Python3
# copy the data
df_z_scaled = df.copy()

# apply normalization techniques


for column in df_z_scaled.columns:
df_z_scaled[column] = (df_z_scaled[column] -
df_z_scaled[column].mean()) / df_z_scaled[column].std()

# view normalized data


display(df_z_scaled)

Output :

Python3
import matplotlib.pyplot as plt
df_z_scaled.plot(kind='bar')
Data Manipulation with
Python
Data manipulation with python is defined as a process in the python
programming language that enables users in data organization in
order to make reading or interpreting the insights from the data more
structured and comprises of having better design. For example,
arranging the employee’s names in alphabetical order will enable
quicker searching of a particular employee by their name. The key
feature of data manipulation is enabling faster business operations
and also emphasize optimization in the process. Through proper
manipulated data one can analyze trends, interpret insights from
financial data, analyze consumer behaviour or pattern, etc. Not only
the analyzing, but it also enables users to neglect any unnecessary
data in the set so that one can save space and only fill the limited
space with important and necessary data. In this article, we will look
into the different methods of manipulation in python and also look
into the examples
DataFrame in Pandas

A DataFrame is a two-dimensional table in pandas. Each column can have


different data types like int, float, or string. Each column is of class Series in
pandas

This article was published as a part of the Data Science Blogathon

Pandas

Pandas is an open-source data analysis and data manipulation library written in


python. Pandas provide you with data structures and functions to work on
structured data seamlessly. The name Pandas refer to “Panel Data”, which means
a structured dataset. Pandas have two main classes to work on, DataFrame and
Series. Let us explore more on this later in this article.

Key Features of Pandas

 Perform Group by operation seamlessly


 Datasets are mutable using pandas which means we can add new rows and
columns to them.
 Easy to handle missing data
 Merge and join datasets
 Indexing and subsetting data

Installation
Install via pip using the following command,

pip install pandas

Install via anaconda using the following command,

conda install pandas


DataFrame in Pandas

A DataFrame is a two-dimensional table in pandas. Each column can have different


data types like int, float, or string. Each column is of class Series in pandas, we’ll
discuss this later in this article.

Creating a DataFrame in Pandas

# import the library as pd


import pandas as pd
df = pd.DataFrame(
{
'Name': ['Srivignesh', 'Hari'],
'Age': [22, 11],
'Country': ['India', 'India']
}
)
print(df)
# output
# Name Age Country
# 0 Srivignesh 22 India
# 1 Hari 11 India

pd.DataFrame is a class available in pandas. Here we provide a dictionary whose


keys are the column names (‘Name’, ‘Age’, ‘Country’) and the values are the values
in those columns. Here each column is of class pandas.Series. Series is a one-
dimensional data used in pandas.

# accessing the column 'Name' in df


print(df['Name'])
# Output
# 0 Srivignesh
# 1 Hari
# Name: Name, dtype: object
print(type(df['Name']))
# Output
# <class 'pandas.core.series.Series'>

Let’s get started with Data Manipulation using Pandas!

For this purpose, we are going to use Titanic Dataset which is available on Kaggle.

import pandas as pd
path_to_data = 'path/to/titanic_dataset'
# read the csv data using pd.read_csv function
data = pd.read_csv(path_to_data)
data.head()
Dropping columns in the data
df_dropped = data.drop('Survived', axis=1)
df_dropped.head()

The ‘Survived’ column is dropped in the data. The axis=1 denotes that it
‘Survived’ is a column, so it searches ‘Survived’ column-wise to drop.

Drop multiple columns using the following code,

df_dropped_multiple = data.drop(['Survived', 'Name'], axis=1)


df_dropped_multiple.head()

The columns ‘Survived’ and ‘Name’ are dropped in the data.

Dropping rows in the data


df_row_dropped = data.drop(2, axis=0)
df_row_dropped.head()
The row with index 2 is dropped in the data. The axis=0 denotes that index 2 is a
row, so it searches the index 2 column-wise.

Drop multiple rows using the following code,

df_row_dropped_multiple = data.drop([2, 3], axis=0)


df_row_dropped_multiple.head()

The rows with indexes 2 and 3 are dropped in the data.

Renaming a column in the dataset


data.columns
# Output
# Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
'SibSp',
# 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
# dtype='object')
df_renamed = data.rename(columns={'PassengerId': 'Id'})
df_renamed.head()

The column ‘PassengerId’ is renamed to ‘Id’ in the data. Do not forget to mention
the dictionary inside the columns parameter.

Rename multiple columns using the following code,


df_renamed_multiple = data.rename(
columns={
'PassengerId': 'Id',
'Sex': 'Gender',
}
)
df_renamed_multiple.head()

The columns ‘PassengerId’ and ‘Sex’ are renamed to ‘Id’ and ‘Gender’
respectively.

Select columns with specific data types


integer_data = data.select_dtypes('int')
integer_data.head()

The above code selects all columns with integer data types.

float_data = data.select_dtypes('float')
float_data.head()
The above code selects all columns with float data types.

Slicing the dataset


data.iloc[:5, 0]

The above code returns the first five rows of the first column. The ‘:5’ in the iloc
denotes the first five rows and the number 0 after the comma denotes the first
column, iloc is used to locate the data using numbers or integers.

data.loc[:5, 'PassengerId']
The above code does the same but we can use the column names directly using loc
in pandas. Here the index 5 is inclusive.

Handle Duplicates in Dataset

Since there are no duplicate data in the titanic dataset, let us first add a duplicated
row into the data and handle it.

df_dup = data.copy()
# duplicate the first row and append it to the data
row = df_dup.iloc[:1]
df_dup = df_dup.append(row, ignore_index=True)
df_dup

df_dup[df_dup.duplicated()]

The above code returns the duplicated rows in the data.

df_dup.drop_duplicates()
The above code drops the duplicated rows in the data.

Select specific values in the column


data[data['Pclass'] == 1]

The above code returns the values which are equal to one in the column ‘Pclass’ in
the data.

Select multiple values in the column using the following code,

data[data['Pclass'].isin([1, 0])]
The above code returns the values which are equal to one and zero in the column
‘Pclass’ in the data.

Group by in DataFrame
data.groupby('Sex').agg({'PassengerId': 'count'})

The above code groups the values of the column ‘Sex’ and aggregates the column
‘PassengerId’ by the count of that column.

data.groupby('Sex').agg({'Age':'mean'})

The above code groups the values of the column ‘Sex’ and aggregates the column
‘Age’ by mean of that column.

Group multiple columns using the following code,

data.groupby(['Pclass', 'Sex']).agg({'PassengerId': 'count'})


For example, let’s say we have a Pandas DataFrame that contains
information about the sales of different products in different
regions. We can group the DataFrame by region to calculate the total
sales in each region as follows:

import pandas as pd

# create a sample DataFrame


data = {'Product': ['Product A', 'Product B', 'Product C', 'Product D',
'Product E', 'Product F'],
'Region': ['North', 'South', 'East', 'West', 'North', 'West'],
'Sales': [10000, 5000, 7000, 9000, 6000, 8000]}

df = pd.DataFrame(data)

# group the DataFrame by region and calculate the total sales in each
region
grouped_df = df.groupby('Region')['Sales'].sum()
print(grouped_df)

The output of this code will be:

Region
East 7000
North 16000
South 5000
West 17000
Name: Sales, dtype: int64

As you can see, the DataFrame has been grouped by region, and
the sum() function has been applied to the Sales column to calculate
the total sales in each region.

Replacing values in a DataFrame


data['Sex'].replace(['male', 'female'], ["M", "F"])

The above code replaces ‘male’ as ‘M’ and ‘female’ as ‘F’.

Save the DataFrame as a CSV file


data.to_csv('/path/to/save/the/data.csv', index=False)

The index=False argument does not save the index as a separate column in the
CSV.

Getting Shape and information of the data


Let’s exact information of each column, i.e. what type of value it stores and how
many of them are unique. There are three support functions, .shape, .info()
and .corr() which output the shape of the table, information on rows and columns,
and correlation between numerical columns.
Code:

import pandas as pd

# creating a dataframe object


student_register = pd.DataFrame()

# assigning values to the


# rows and columns of the dataframe
student_register['Name'] = ['Abhijit','Smriti',
'Akash', 'Roshni']
student_register['Age'] = [20, 19, 20, 14]
student_register['Student'] = [False, True,
True, False]

print(student_register)

# creating a new pandas


# series object
new_person = pd.Series(['Mansi', 19, True],
index = ['Name', 'Age',
'Student'])
# One important concept is that the “dataframe” object of Python, consists of
rows which are “series” objects instead, stack together to form a table. Hence
adding a new row means creating a new series object and appending it to the
dataframe.
# using the .append() function
# to add that row to the dataframe
student_register.append(new_person, ignore_index = True)
print(student_register)

# dimension of the dataframe


print('Shape: ')
print(student_register.shape)
print('--------------------------------------')
# showing info about the data
print('Info: ')
print(student_register.info())
print('--------------------------------------')
# correlation between columns
print('Correlation: ')
print(student_register.corr())

Output:
Shape:
(4, 3)
--------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
2 Student 4 non-null bool
dtypes: bool(1), int64(1), object(1)
memory usage: 196.0+ bytes
None
--------------------------------------
Correlation:
Age Student
Age 1.000000 0.502519
Student 0.502519 1.000000
In the above example, the .shape function gives an output (4, 3) as that is the size of
the created dataframe.
The description of the output given by .info() method is as follows:
1. RangeIndex describes about the index column, i.e. [0, 1, 2, 3] in our
datagram. Which is the number of rows in our dataframe.
2. As the name suggests Data columns give the total number of columns as
output.
3. Name, Age, Student are the name of the columns in our data, non-null
tells us that in the corresponding column, there is no NA/ Nan/ None value
exists. object, int64 and bool are the datatypes each column have.
4. dtype gives you an overview of how many data types present in the
datagram, which in term simplifies the data cleaning process.
Also, in high-end machine learning models, memory usage is an
important term, we can’t neglect that.
Sorting data in Pandas
Sorting data is a crucial step in data manipulation as it helps to
organize the data and identify patterns quickly. Pandas provides a
powerful set of functions to sort data based on one or more columns.
The sort_values() function is used to sort data in Pandas. It takes
the column name(s) to sort by as the input and sorts the data in
ascending or descending order based on the user's preference.

For example, let’s say we have a Pandas DataFrame that contains


information about the sales of different products in different
regions. We can sort the DataFrame by the total sales to identify the
top-selling products as follows:
import pandas as pd

# create a sample DataFrame


data = {'Product': ['Product A', 'Product B', 'Product C', 'Product D'],
'Region': ['North',’North’, 'South', 'East', 'West'],
'Sales': [10000, ‘12000’,5000, 7000, 9000]}

df = pd.DataFrame(data)

# sort the DataFrame by sales in descending order


sorted_df = df.sort_values(by='Sales', ascending= False)

print(sorted_df)

The output of this code will be:

Product Region Sales


0 Product A North 10000
3 Product D West 9000
2 Product C East 7000
1 Product B South 5000
As you can see, the DataFrame has been sorted based on
the Sales column in descending order. The sort_values() function
also allows us to sort the data by multiple columns. For example,
let's sort the DataFrame first by Region in ascending order, and then
by Sales in descending order:

sorted_df = df.sort_values(by=['Region', 'Sales'], ascending=[True,


False])

print(sorted_df)

The output of this code will be:

Product Region Sales


2 Product C East 7000
0 Product A North 10000
1 Product B South 5000
3 Product D West 9000

As you can see, the DataFrame has been sorted first by Region in
ascending order, and then by Sales in descending order. This allows
us to identify the top-selling products in each region easily.

Filtering data in Pandas


Filtering data is another essential step in data manipulation that
allows us to extract a subset of data based on certain criteria. Pandas
provides a powerful set of functions to filter data based on one or
more conditions. The loc[] function is used to filter data in Pandas.
It takes a Boolean expression as the input and returns a subset of the
DataFrame that satisfies the condition.
For example, let’s say we have a Pandas DataFrame that contains
information about the sales of different products in different
regions. We can filter the DataFrame to extract the sales of products
that exceed a certain threshold as follows:

import pandas as pd

# create a sample DataFrame


data = {'Product': ['Product A', 'Product B', 'Product C', 'Product D'],
'Region': ['North', 'South', 'East', 'West'],
'Sales': [10000, 5000, 7000, 9000]}

df = pd.DataFrame(data)

# filter the DataFrame to extract the sales of products that exceed 8000
filtered_df = df.loc[df['Sales'] > 8000]

print(filtered_df)

The output of this code will be:

Product Region Sales


0 Product A North 10000
3 Product D West 9000

As you can see, the DataFrame has been filtered to extract the sales
of products that exceed 8000. The loc[] function also allows us to
filter the data based on multiple conditions. For example, let's filter
the DataFrame to extract the sales of products that exceed 8000 and
are sold in the North or West region:

# filter the DataFrame to extract the sales of products that exceed 8000
and are sold in the North
filtered_df = df.loc[(df['Sales'] > 8000) & ((df['Region'] == 'North') )]

print(filtered_df)
The output of this code will be:

Product Region Sales


0 Product A North 10000

As you can see, the DataFrame has been filtered to extract the sales
of products that exceed 8000 and are sold in the North region.

How to encode categorical features in Python?

There are two types of categorical data: nominal and ordinal.

Nominal data
Nominal data is categorical data that may be divided into groups, but these groups
lack any intrinsic hierarchy or order. Examples of nominal data include brand names
(Coca-Cola, Pepsi, Sprite), varieties of pizza toppings(pepperoni, mushrooms,
onions), and hair color (blonde, brown, black, etc.).

Ordinal data
Ordinal data, on the other hand, describes information that can be categorized and
has a distinct order or ranking. Levels of education (high school, bachelor's,
master's), levels of work satisfaction (extremely satisfied, satisfied, neutral,
unsatisfied, very unsatisfied), and star ratings (1-star, 2-star, 3-star, 4-star, 5-star)
are a few examples of ordinal data.

By giving each category a numerical value that reflects its order or ranking, ordinal
data can be transformed into numerical data and used in machine learning. For
algorithms that are sensitive to the size of the input data, this may be helpful.

Understanding data types in pandas


The widely used Python open-source library pandas is used for data analysis and
manipulation. It has strong capabilities for dealing with structured data, including as
data frames and series that can deal with tabular data with labeled rows and
columns.

pandas also provides several functions to read and write different file types (csv,
parquet, database, etc.). When you read a file using pandas, each column is
assigned a data type based on the inference. Here are all the data types pandas can
possibly assign:
1. Numeric: This includes integers and floating-point numbers. Numeric
data is typically used for quantitative analysis and mathematical
operations.
2. String: This data type is used to represent textual data such as
names, addresses, and descriptions.
3. Boolean: This data type can only have two possible values: True or
False. Boolean data is often used for logical operations and filtering.
4. Datetime: This data type is used to represent dates and times.
pandas has powerful tools for manipulating datetime data.
5. Categorical: This data type represents data that takes on a limited
number of values. Categorical data is often used for grouping and
aggregating data.
6. Object: This data type is a catch-all for data that does not fit into the
other categories. It can include a variety of different data types, such
as lists, dictionaries, and other objects.

Analyzing Categorical Features in Python


There are a few functions in pandas, a popular data analysis library in Python, that
allow you to quickly analyze categorical data types in your dataset. Let us examine
them one by one:

Value Counts
`value_counts()` is a function in the pandas library that returns the frequency of each
unique value in a categorical data column. This function is useful when you want to
get a quick understanding of the distribution of a categorical variable, such as the
most common categories and their frequency.
# read csv using pandas
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# check value counts of Cut column


data['Cut'].value_counts()

Dataframe:
Output:

Cross tab
`crosstab()` is a function in pandas that creates a cross-tabulation table, which shows the
frequency distribution of two or more categorical variables. This function is useful when you
want to see the relationship between two or more categorical variables, such as how the
frequency of one variable is related to another variable.

# read csv using pandas

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# cross tab of Cut and Color

pd.crosstab(index=data['Cut'], columns=data['Color'])

output:
The output from the crosstab function in pandas is a table that shows the frequency
distribution of two or more categorical variables. Each row of the table represents a unique
category in one of the variables, and each column represents a unique category in the other
variable. The entries in the table are the frequency counts of the combinations of categories in
the two variables.

Pivot Table
`pivot_table()` is a function in Pandas that creates pivot tables, which are similar to cross-
tabulation tables but with more flexibility. This function is useful when you want to analyze
multiple categorical variables and their relationship to one or more numeric variables. Pivot
tables allow you to aggregate data in multiple ways and display the results in a compact form.

# read csv using pandas

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# create pivot table

pd.pivot_table(data, values='Price', index='Cut', columns='Color', aggfunc=np.mean)

Output:
This table shows the average price of each diamond cut for each color. The rows represent
the different diamond cut, the columns represent the different diamond colors, and the entries
in the table are the average price of the diamond.

The pivot_table function is useful when you want to summarize and compare the numerical
data across multiple variables in a table format. The function allows you to aggregate the data
using various functions (such as mean, sum, count, etc.) and organize it into a format that is
easy to read and analyze.

Encoding ordinal data

Step 1 - Import the library


import pandas as pd

Step 2 - Setting up the Data

We have created a dataframe with one feature "score" with categorical variables "Low",
"Medium" and "High".
df = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "High",
"Low", "Medium","High", "Low"]})
print(df)

Step 3 - Encoding variable

We have created an object scale_mapper in which we have passed the encoding


parameter i.e putting numerical values instead of categorical variable. We have made a
feature scale in which there will be numerical encoded values.

scale_mapper = {"Low":1, "Medium":2, "High":3}


df["Scale"] = df["Score"].replace(scale_mapper)
print(df)

So the output comes as:

Score
0 Low
1 Low
2 Medium
3 Medium
4 High
5 Low
6 Medium
7 High
8 Low

Encoding Categorical Features in Python


Categorical data cannot typically be directly handled by machine learning algorithms, as most
algorithms are primarily designed to operate with numerical data only. Therefore, before
categorical features can be used as inputs to machine learning algorithms, they must be
encoded as numerical values.

There are several techniques for encoding categorical features, including one-hot encoding,
ordinal encoding, and target encoding. The choice of encoding technique depends on the
specific characteristics of the data and the requirements of the machine learning algorithm
being used.

One-hot encoding
One hot encoding is a process of representing categorical data as a set of binary values, where
each category is mapped to a unique binary value. In this representation, only one bit is set to
1, and the rest are set to 0, hence the name "one hot." This is commonly used in machine
learning to convert categorical data into a format that algorithms can process.

Image Source
pandas categorical to numeric

One way to achieve this in pandas is by using the `pd.get_dummies()` method. It is a function
in the Pandas library that can be used to perform one-hot encoding on categorical variables in
a DataFrame. It takes a DataFrame and returns a new DataFrame with binary columns for
each category. Here's an example of how to use it:

Suppose we have a data frame with a column "fruit" containing categorical data:

import pandas as pd
# generate df with 1 col and 4 rows

data = {

"fruit": ["apple", "banana", "orange", "apple"]

# show head

df = pd.DataFrame(data)

df.head()

OpenAI

Output:

# apply get_dummies function

df_encoded = pd.get_dummies(df["fruit"])

df_encoded .head()

OpenAI

Output:
Even though `pandas.get_dummies` is straightforward to use, a more common approach is to
use `OneHotEncoder` from the sklearn library, especially when you are doing machine
learning tasks. The primary difference is `pandas.get_dummies` cannot learn encodings; it
can only perform one-hot-encoding on the dataset you pass as an input. On the other hand,
`sklearn.OneHotEncoder` is a class that can be saved and used to transform other incoming
datasets in the future.

import pandas as pd

# generate df with 1 col and 4 rows

data = {

"fruit": ["apple", "banana", "orange", "apple"]

# one-hot-encode using sklearn

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoded_results = encoder.fit_transform(df).toarray()

OpenAI

Output:

You might also like