Data Pre Processing and Cleaning
Data Pre Processing and Cleaning
1.DATA CLEANING
The first step of Data Preprocessing is Data Cleaning. Most of the
data that we work today are not clean and requires substantial
amount of Data Cleaning. Some have missing values and some have
junk data in it. If these missing values and inconsistencies are not
handled properly then our model wouldn’t give accurate results.
So, before getting into the nitty gritty details of Data Cleaning, let’s
have a high level understanding of what are the possible problems
we face in real world data scenarios.
MISSING VALUES :
Advantage of this method is, it’s a pretty quick and dirty method
of fixing the missing values issue. But this is not always the go to
method as you might sometime end up losing critical information by
deleting the features.
1. Load the dataset
In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv("Banking_Marketing.csv")
In [3]:
df.head()
df.dtypes
age float64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration float64
campaign int64
pdays int64
previous int64
poutcome object
emp_var_rate float64
cons_price_idx float64
cons_conf_idx float64
euribor3m float64
nr_employed float64
y int64
dtype: object
df.isna().sum()
age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64
datadrop = df.dropna()
The drawback is that you don’t know how accurate using the mean,
median, or mode is going to be in a given situation.
import pandas as pd
import numpy as np
In [3]:
df = pd.read_csv("Banking_Marketing.csv")
In [4]:
df.head()
df.isna().sum()
age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64
age_mean = df.age.mean()
In [7]:
print("Mean of age column: ",age_mean)
Mean of age column: 40.023812413525256
3.Impute the missing data in the age column with the mean age value
In [8]:
df.age.fillna(age_mean,inplace=True)
df.isna().sum()
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64
4. Checking all the records in the dataset for which the 'duration' column is
NA
df[df['duration'].isnull()]
df['duration'].sort_values(ascending=False).head()
Out[11]:
7802 4918.0
18610 4199.0
32880 3785.0
1974 3643.0
10633 3631.0
Name: duration, dtype: float64
duration_med= df.duration.median()
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64
8. In the above steps both the 'duration' and 'age' column were numerical so
we used mean and median to impute the missing values. However in this case
'contact' is a categorical value, so we will use Mode here to impute the
missing values.¶
df['contact'].unique()
out:
array(['cellular', 'telephone', nan], dtype=object)
contact_mode = df.contact.mode()[0]
print("Mode for contact: ",contact_mode)
Data Processing is the task of converting data from a given form to a much more
usable and desired form i.e. making it more meaningful and informative. Using
Machine Learning algorithms, mathematical modeling, and statistical knowledge,
this entire process can be automated. The output of this complete process can be in
any desired form like graphs, videos, charts, tables, images, and many more,
depending on the task we are performing and the requirements of the machine. This
might seem to be simple but when it comes to massive organizations like Twitter,
Facebook, Administrative bodies like Parliament, UNESCO, and health sector
organizations, this entire process needs to be performed in a very structured manner.
So, the steps to perform are as follows:
Data processing is a crucial step in the machine learning (ML) pipeline, as it
prepares the data for use in building and training ML models. The goal of data
processing is to clean, transform, and prepare the data in a format that is suitable for
modeling.
The main steps involved in data processing typically include:
1.Data collection: This is the process of gathering data from various sources, such
as sensors, databases, or other systems. The data may be structured or unstructured,
and may come in various formats such as text, images, or audio.
2.Data preprocessing: This step involves cleaning, filtering, and transforming the
data to make it suitable for further analysis. This may include removing missing
values, scaling or normalizing the data, or converting it to a different format.
3.Data analysis: In this step, the data is analyzed using various techniques such as
statistical analysis, machine learning algorithms, or data visualization. The goal of
this step is to derive insights or knowledge from the data.
4.Data interpretation: This step involves interpreting the results of the data analysis
and drawing conclusions based on the insights gained. It may also involve presenting
the findings in a clear and concise manner, such as through reports, dashboards, or
other visualizations.
5.Data storage and management: Once the data has been processed and analyzed,
it must be stored and managed in a way that is secure and easily accessible. This
may involve storing the data in a database, cloud storage, or other systems, and
implementing backup and recovery strategies to protect against data loss.
6.Data visualization and reporting: Finally, the results of the data analysis are
presented to stakeholders in a format that is easily understandable and actionable.
This may involve creating visualizations, reports, or dashboards that highlight key
findings and trends in the data.
Collection :
The most crucial step when starting with ML is to have data of good
quality and accuracy. Data can be collected from any authenticated source
like data.gov.in, Kaggle or UCI dataset repository. For example, while
preparing for a competitive exam, students study from the best study
material that they can access so that they learn the best to obtain the best
results. In the same way, high-quality and accurate data will make the
learning process of the model easier and better and at the time of testing,
the model would yield state-of-the-art results.
A huge amount of capital, time and resources are consumed in collecting
data. Organizations or researchers have to decide what kind of data they
need to execute their tasks or research.
Example: Working on the Facial Expression Recognizer, needs numerous
images having a variety of human expressions. Good data ensures that the
results of the model are valid and can be trusted upon.
Preparation :
The collected data can be in a raw form which can’t be directly fed to the
machine. So, this is a process of collecting datasets from different sources,
analyzing these datasets and then constructing a new dataset for further
processing and exploration. This preparation can be performed either
manually or from the automatic approach. Data can also be prepared in
numeric forms also which would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N dimensions,
the value of each cell will indicate the image pixel.
Input :
Now the prepared data can be in the form that may not be machine-
readable, so to convert this data to the readable form, some conversion
algorithms are needed. For this task to be executed, high computation and
accuracy is needed. Example: Data can be collected through the sources
like MNIST Digit data(images), Twitter comments, audio files, video
clips.
Processing :
This is the stage where algorithms and ML techniques are required to
perform the instructions provided over a large volume of data with
accuracy and optimal computation.
Output :
In this stage, results are procured by the machine in a meaningful manner
which can be inferred easily by the user. Output can be in the form of
reports, graphs, videos, etc
Storage :
This is the final step in which the obtained output and the data model data
and all the useful information are saved for future use.
The following are the most common steps involved in data cleaning:
Import the necessary libraries
Load the dataset
Check the data information using df.info()
0 623.30
1 515.20
2 611.00
3 729.00
4 843.25
Name: salary, dtype: float64
salary name
0 623.30 Rick
1 515.20 Dan
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab
7 722.50 Guru
salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi
An Outlier is a data-item/object that deviates significantly from the rest of the (so-
called normal)objects. They can be caused by measurement or execution errors. The
analysis for outlier detection is referred to as outlier mining. There are many ways to
detect the outliers, and the removal process is the data frame same as removing a
data item from the panda’s data frame.
The dataset used in this article is the Diabetes dataset and it is preloaded in the
sklearn library.
Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt
Outliers Visualization
A Box Plot is also known as Whisker plot is created to display the summary of the
set of data values having properties like minimum, first quartile, median, third
quartile and maximum. In the box plot, a box is created from the first quartile to the
third quartile, a vertical line is also there which goes through the box at the median.
Here x-axis denotes the data to be plotted while the y-axis shows the frequency
distribution.
it is primarily used to indicate a distribution is skewed or not and if there are potential unusual
observations (also called outliers) present in the data set. Boxplots are also very beneficial when large
numbers of data sets are involved or compared.
First Quartile (Q1): The first quartile is the median of the lower half of the data set.
Median: The median is the middle value of the dataset, which divides the given dataset into
two equal parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile is
known as the interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered data is tested to be the
outliers. Generally, the outliers fall more than the specified distance from the first and third
quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR)
In the above graph, can clearly see that values above 10 are acting as outliers.
# Position of the Outlier
import numpy as np
print(np.where(df_diabetics['bmi']>0.12))
output:
# Scatter plot
fig, ax = plt.subplots(figsize = (6,4))
ax.scatter(df_diabetics['bmi'],df_diabetics['bp'])
# x-axis label
ax.set_xlabel('(body mass index of people)')
# y-axis label
ax.set_ylabel('(bp of the people )')
plt.show()
Looking at the graph can summarize that most of the data points are in the bottom
left corner of the graph but there are few points that are exactly;y opposite that is the
top right corner of the graph. Those points in the top right corner can be regarded as
Outliers.
Using approximation can say all those data points that are x>20 and y>600 are
outliers.
Outliers in BMI and BP Column Combined
Python3
# Position of the Outlier
print(np.where((df_diabetics['bmi']>0.12) & (df_diabetics['bp']<0.8)))
Output:
(array([ 32, 145, 256, 262, 366, 367, 405]),)
IQR (Inter Quartile Range) Inter Quartile Range approach to finding the outliers is
the most commonly used and most trusted approach used in the research field.
IQR = Quartile3 – Quartile1
Python3
# IQR
IQR = Q3 - Q1
print(IQR)
Output:
0.06520763046978838
Syntax: numpy.percentile(arr, n, axis=None, out=None)
Parameters :
arr :input array.
n : percentile value.
To define the outlier base value is defined above and below dataset’s normal range
namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR
value is considered) :
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR (new_IQR =
IQR + 0.5*IQR) is taken, to consider all the data between 2.7 standard deviations in
the Gaussian Distribution.
Python3
# Above Upper bound
upper=Q3+1.5*IQR
upper_array=np.array(df_diabetics['bmi']>=upper)
print("Upper Bound:",upper)
print(upper_array.sum())
lower=Q1-1.5*IQR
lower_array=np.array(df_diabetics['bmi']<=lower)
print("Lower Bound:",lower)
print(lower_array.sum())
Output:
Upper Bound: 0.12879000811776306
3
Lower Bound: -0.13204051376139045
0
Example:
df_diabetics.drop(lists[0],inplace = True)
Full Code: Detecting the outliers using IQR and removing them.
Python3
# Importing
import sklearn
import pandas as pd
diabetes = load_diabetes()
column_name = diabetes.feature_names
df_diabetes = pd.DataFrame(diabetes.data)
df_diabetes .head()
# IQR
Q1 = df_diabetes['bmi'].quantile(0.25)
Q3 = df_diabetes['bmi'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
upper_array = np.where(df_diabetes['bmi']>=upper)[0]
lower_array = np.where(df_diabetes['bmi']<=lower)[0]
df_diabetes.drop(index=upper_array, inplace=True)
df_diabetes.drop(index=lower_array, inplace=True)
Output:
Old Shape: (442, 10)
New Shape: (439, 10)
import numpy as np
num = np.arange(10)
num
Output:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Now, let’s sample two points from the data and take the average of these two. Also,
let’s maintain a dictionary with the sample means and the number of times they
appear.
sample_freq = {}
for i in range(4):
for j in range(4):
# Selecting each pair possible with
# repetition
if (mean_of_two in sample_freq):
# Updating the value for a mean value
# if it already exists
sample_freq[mean_of_two] += 1
else:
# Adding a new key to the dictionary
# if it is not their
sample_freq[mean_of_two] = 1
sample_freq
Output:
{1.0: 1, 1.5: 2, 2.0: 3, 2.5: 4, 3.0: 3, 3.5: 2, 4.0: 1}
Now, let’s plot the sample statistics to visualize its distribution.
Python3
import matplotlib.pyplot as plt
plt.scatter(sample_freq.keys(), sample_freq.values())
plt.show()
From the above graph, we can observe that the distribution of the sample statistic is
symmetric and if we will take infinite such points which are totally random then
we’ll be able to observe that the distribution formed will be a normal/gaussian
distribution.
Steps Needed
Here, we will apply some techniques to normalize the data and discuss these with the
help of examples. For this, let’s understand the steps needed for data normalization
with Pandas.
1. Import Library (Pandas)
2. Import / Load / Create data.
3. Use the technique to normalize the data.
Examples
Here, we create data by some random values and apply some normalization
techniques to it.
Python3
# importing packages
import pandas as pd
# create data
df = pd.DataFrame([
# view data
display(df)
Output:
The maximum absolute scaling rescales each feature between -1 and 1 by dividing
every observation by its maximum absolute value. We can apply the maximum
absolute scaling in Pandas using the .max() and .abs() methods, as shown below.
Python3
# copy the data
df_max_scaled = df.copy()
Output:
See the plot of this dataframe:
Python3
import matplotlib.pyplot as plt
df_max_scaled.plot(kind = 'bar')
Output:
The min-max approach (often called normalization) rescales the feature to a hard
and fast range of [0,1] by subtracting the minimum value of the feature then dividing
by the range. We can apply the min-max scaling in Pandas using the .min()
and .max() methods.
Python3
# copy the data
df_min_max_scaled = df.copy()
Output :
Python3
import matplotlib.pyplot as plt
df_min_max_scaled.plot(kind = 'bar')
Using The z-score method
The z-score method (often called standardization) transforms the info into
distribution with a mean of 0 and a typical deviation of 1. Each standardized value is
computed by subtracting the mean of the corresponding feature then dividing by the
quality deviation.
Python3
# copy the data
df_z_scaled = df.copy()
Output :
Python3
import matplotlib.pyplot as plt
df_z_scaled.plot(kind='bar')
Data Manipulation with
Python
Data manipulation with python is defined as a process in the python
programming language that enables users in data organization in
order to make reading or interpreting the insights from the data more
structured and comprises of having better design. For example,
arranging the employee’s names in alphabetical order will enable
quicker searching of a particular employee by their name. The key
feature of data manipulation is enabling faster business operations
and also emphasize optimization in the process. Through proper
manipulated data one can analyze trends, interpret insights from
financial data, analyze consumer behaviour or pattern, etc. Not only
the analyzing, but it also enables users to neglect any unnecessary
data in the set so that one can save space and only fill the limited
space with important and necessary data. In this article, we will look
into the different methods of manipulation in python and also look
into the examples
DataFrame in Pandas
Pandas
Installation
Install via pip using the following command,
For this purpose, we are going to use Titanic Dataset which is available on Kaggle.
import pandas as pd
path_to_data = 'path/to/titanic_dataset'
# read the csv data using pd.read_csv function
data = pd.read_csv(path_to_data)
data.head()
Dropping columns in the data
df_dropped = data.drop('Survived', axis=1)
df_dropped.head()
The ‘Survived’ column is dropped in the data. The axis=1 denotes that it
‘Survived’ is a column, so it searches ‘Survived’ column-wise to drop.
The column ‘PassengerId’ is renamed to ‘Id’ in the data. Do not forget to mention
the dictionary inside the columns parameter.
The columns ‘PassengerId’ and ‘Sex’ are renamed to ‘Id’ and ‘Gender’
respectively.
The above code selects all columns with integer data types.
float_data = data.select_dtypes('float')
float_data.head()
The above code selects all columns with float data types.
The above code returns the first five rows of the first column. The ‘:5’ in the iloc
denotes the first five rows and the number 0 after the comma denotes the first
column, iloc is used to locate the data using numbers or integers.
data.loc[:5, 'PassengerId']
The above code does the same but we can use the column names directly using loc
in pandas. Here the index 5 is inclusive.
Since there are no duplicate data in the titanic dataset, let us first add a duplicated
row into the data and handle it.
df_dup = data.copy()
# duplicate the first row and append it to the data
row = df_dup.iloc[:1]
df_dup = df_dup.append(row, ignore_index=True)
df_dup
df_dup[df_dup.duplicated()]
df_dup.drop_duplicates()
The above code drops the duplicated rows in the data.
The above code returns the values which are equal to one in the column ‘Pclass’ in
the data.
data[data['Pclass'].isin([1, 0])]
The above code returns the values which are equal to one and zero in the column
‘Pclass’ in the data.
Group by in DataFrame
data.groupby('Sex').agg({'PassengerId': 'count'})
The above code groups the values of the column ‘Sex’ and aggregates the column
‘PassengerId’ by the count of that column.
data.groupby('Sex').agg({'Age':'mean'})
The above code groups the values of the column ‘Sex’ and aggregates the column
‘Age’ by mean of that column.
import pandas as pd
df = pd.DataFrame(data)
# group the DataFrame by region and calculate the total sales in each
region
grouped_df = df.groupby('Region')['Sales'].sum()
print(grouped_df)
Region
East 7000
North 16000
South 5000
West 17000
Name: Sales, dtype: int64
As you can see, the DataFrame has been grouped by region, and
the sum() function has been applied to the Sales column to calculate
the total sales in each region.
The index=False argument does not save the index as a separate column in the
CSV.
import pandas as pd
print(student_register)
Output:
Shape:
(4, 3)
--------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
2 Student 4 non-null bool
dtypes: bool(1), int64(1), object(1)
memory usage: 196.0+ bytes
None
--------------------------------------
Correlation:
Age Student
Age 1.000000 0.502519
Student 0.502519 1.000000
In the above example, the .shape function gives an output (4, 3) as that is the size of
the created dataframe.
The description of the output given by .info() method is as follows:
1. RangeIndex describes about the index column, i.e. [0, 1, 2, 3] in our
datagram. Which is the number of rows in our dataframe.
2. As the name suggests Data columns give the total number of columns as
output.
3. Name, Age, Student are the name of the columns in our data, non-null
tells us that in the corresponding column, there is no NA/ Nan/ None value
exists. object, int64 and bool are the datatypes each column have.
4. dtype gives you an overview of how many data types present in the
datagram, which in term simplifies the data cleaning process.
Also, in high-end machine learning models, memory usage is an
important term, we can’t neglect that.
Sorting data in Pandas
Sorting data is a crucial step in data manipulation as it helps to
organize the data and identify patterns quickly. Pandas provides a
powerful set of functions to sort data based on one or more columns.
The sort_values() function is used to sort data in Pandas. It takes
the column name(s) to sort by as the input and sorts the data in
ascending or descending order based on the user's preference.
df = pd.DataFrame(data)
print(sorted_df)
print(sorted_df)
As you can see, the DataFrame has been sorted first by Region in
ascending order, and then by Sales in descending order. This allows
us to identify the top-selling products in each region easily.
import pandas as pd
df = pd.DataFrame(data)
# filter the DataFrame to extract the sales of products that exceed 8000
filtered_df = df.loc[df['Sales'] > 8000]
print(filtered_df)
As you can see, the DataFrame has been filtered to extract the sales
of products that exceed 8000. The loc[] function also allows us to
filter the data based on multiple conditions. For example, let's filter
the DataFrame to extract the sales of products that exceed 8000 and
are sold in the North or West region:
# filter the DataFrame to extract the sales of products that exceed 8000
and are sold in the North
filtered_df = df.loc[(df['Sales'] > 8000) & ((df['Region'] == 'North') )]
print(filtered_df)
The output of this code will be:
As you can see, the DataFrame has been filtered to extract the sales
of products that exceed 8000 and are sold in the North region.
Nominal data
Nominal data is categorical data that may be divided into groups, but these groups
lack any intrinsic hierarchy or order. Examples of nominal data include brand names
(Coca-Cola, Pepsi, Sprite), varieties of pizza toppings(pepperoni, mushrooms,
onions), and hair color (blonde, brown, black, etc.).
Ordinal data
Ordinal data, on the other hand, describes information that can be categorized and
has a distinct order or ranking. Levels of education (high school, bachelor's,
master's), levels of work satisfaction (extremely satisfied, satisfied, neutral,
unsatisfied, very unsatisfied), and star ratings (1-star, 2-star, 3-star, 4-star, 5-star)
are a few examples of ordinal data.
By giving each category a numerical value that reflects its order or ranking, ordinal
data can be transformed into numerical data and used in machine learning. For
algorithms that are sensitive to the size of the input data, this may be helpful.
pandas also provides several functions to read and write different file types (csv,
parquet, database, etc.). When you read a file using pandas, each column is
assigned a data type based on the inference. Here are all the data types pandas can
possibly assign:
1. Numeric: This includes integers and floating-point numbers. Numeric
data is typically used for quantitative analysis and mathematical
operations.
2. String: This data type is used to represent textual data such as
names, addresses, and descriptions.
3. Boolean: This data type can only have two possible values: True or
False. Boolean data is often used for logical operations and filtering.
4. Datetime: This data type is used to represent dates and times.
pandas has powerful tools for manipulating datetime data.
5. Categorical: This data type represents data that takes on a limited
number of values. Categorical data is often used for grouping and
aggregating data.
6. Object: This data type is a catch-all for data that does not fit into the
other categories. It can include a variety of different data types, such
as lists, dictionaries, and other objects.
Value Counts
`value_counts()` is a function in the pandas library that returns the frequency of each
unique value in a categorical data column. This function is useful when you want to
get a quick understanding of the distribution of a categorical variable, such as the
most common categories and their frequency.
# read csv using pandas
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')
Dataframe:
Output:
Cross tab
`crosstab()` is a function in pandas that creates a cross-tabulation table, which shows the
frequency distribution of two or more categorical variables. This function is useful when you
want to see the relationship between two or more categorical variables, such as how the
frequency of one variable is related to another variable.
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')
pd.crosstab(index=data['Cut'], columns=data['Color'])
output:
The output from the crosstab function in pandas is a table that shows the frequency
distribution of two or more categorical variables. Each row of the table represents a unique
category in one of the variables, and each column represents a unique category in the other
variable. The entries in the table are the frequency counts of the combinations of categories in
the two variables.
Pivot Table
`pivot_table()` is a function in Pandas that creates pivot tables, which are similar to cross-
tabulation tables but with more flexibility. This function is useful when you want to analyze
multiple categorical variables and their relationship to one or more numeric variables. Pivot
tables allow you to aggregate data in multiple ways and display the results in a compact form.
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')
Output:
This table shows the average price of each diamond cut for each color. The rows represent
the different diamond cut, the columns represent the different diamond colors, and the entries
in the table are the average price of the diamond.
The pivot_table function is useful when you want to summarize and compare the numerical
data across multiple variables in a table format. The function allows you to aggregate the data
using various functions (such as mean, sum, count, etc.) and organize it into a format that is
easy to read and analyze.
We have created a dataframe with one feature "score" with categorical variables "Low",
"Medium" and "High".
df = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "High",
"Low", "Medium","High", "Low"]})
print(df)
Score
0 Low
1 Low
2 Medium
3 Medium
4 High
5 Low
6 Medium
7 High
8 Low
There are several techniques for encoding categorical features, including one-hot encoding,
ordinal encoding, and target encoding. The choice of encoding technique depends on the
specific characteristics of the data and the requirements of the machine learning algorithm
being used.
One-hot encoding
One hot encoding is a process of representing categorical data as a set of binary values, where
each category is mapped to a unique binary value. In this representation, only one bit is set to
1, and the rest are set to 0, hence the name "one hot." This is commonly used in machine
learning to convert categorical data into a format that algorithms can process.
Image Source
pandas categorical to numeric
One way to achieve this in pandas is by using the `pd.get_dummies()` method. It is a function
in the Pandas library that can be used to perform one-hot encoding on categorical variables in
a DataFrame. It takes a DataFrame and returns a new DataFrame with binary columns for
each category. Here's an example of how to use it:
Suppose we have a data frame with a column "fruit" containing categorical data:
import pandas as pd
# generate df with 1 col and 4 rows
data = {
# show head
df = pd.DataFrame(data)
df.head()
OpenAI
Output:
df_encoded = pd.get_dummies(df["fruit"])
df_encoded .head()
OpenAI
Output:
Even though `pandas.get_dummies` is straightforward to use, a more common approach is to
use `OneHotEncoder` from the sklearn library, especially when you are doing machine
learning tasks. The primary difference is `pandas.get_dummies` cannot learn encodings; it
can only perform one-hot-encoding on the dataset you pass as an input. On the other hand,
`sklearn.OneHotEncoder` is a class that can be saved and used to transform other incoming
datasets in the future.
import pandas as pd
data = {
encoder = OneHotEncoder()
encoded_results = encoder.fit_transform(df).toarray()
OpenAI
Output: