3. Exploring Categorical Data_students
3. Exploring Categorical Data_students
STRUCTURE OF DATA
Quantifiable Categoric
data Numerical Qualitative
al data
Continuou Discrete
Nominal Ordinal
s data data
Categoric
Numerical
al
Continuou Discrete
Nominal Ordinal
s data data
Types of Categorical Attributes
Handling categorical data
• It is common for real word datasets to contain one or
more categorical features. When we are talking about
categorical data, we have further distinguished between
ordinal and nominal features. Ordinal features can be
understood as categorical values that can be sorted or
ordered. For e.g. t-shirt size would be ordinal feature,
because we can define XL>L>M>S>XS.
• In contrast, nominal feature don’t imply any order. For
e.g. t-shirt color is nominal feature since it typically
doesn’t make sense to say that , for example red is
larger than blue.
Exploring Categorical Attributes in Machine
Learning
Encoding categorical (ordinal)
variables
• To make sure that the learning algorithm interprets the
ordinal features correctly, we need to convert the
categorical string values into integers. Unfortunately,
there is no convenient function that can automatically
derive the correct order of the ;labels of our size
feature, se we have to define the mapping manually. In
the following simple example, lets assume that we know
the numerical differentce between features, for e.g. XL
=L+1=M+2:
Categorical data encoding with
pandas
• Before we explore different techniques for handling categorical
data. Let’s create a new DataFrame to illustrate the problem.
Color is a
nominal
feature
• Can you spot a problem in previous slide?
• The encoding assumes that green is larger than blue,
and red is larger than green.
• Although this assumption is incorrect , a classifier could
still produce useful results. However, those results
would not be optimal.
• A common workaround for this problem is to use
one-hot encoding
For one hot encoding use get_dummies
function
• The get_dummies function in pandas is used to perform one-hot encoding, a
process that converts categorical variables into a format suitable for machine
learning models. Specifically, it transforms each unique category in a column
into a separate binary column, where:1 indicates the presence of that
category for a specific row.0 indicates its absence.
The problem with the above slide is that by using one hot encoding , we have to
keep in mind that this introduces multicollinearity, which can be an issue. To
reduce correlation, we can simply remove one feature column form one hot
encoded array as shown below. Note that we don’t loose any important
information by removing a feature column , though , for example , if we remove
color_blue , the feature information is still preserved since if we observe
color_green =0 and color_red=0, it implies that the observation must be blue.
Transforming numeric (continuous)
features to categorical features
• Sometimes there is a need of transforming a continuous
numerical variable into a categorical variable. For
example, we may want to treat the real estate price
prediction problem, which is a regression problem, as a
real estate price category prediction, which is a
classification problem. In that case, we can ‘bin’ the
numerical data into multiple categories based on the
data range. In the context of the real estate price
prediction example, the original data set has a
numerical feature apartment_price as shown in Figure
4.5a. It can be transformed to a categorical variable
price-grade either as shown in Figure 4.5b or as shown
in Figure 4.5c
Transforming numeric (continuous) features
to categorical features
Consider MPG dataset
Categorical data
• We may also be interested to know the proportion (or
percentage) of count of data elements belonging to a
category. Say, e.g., for the attributes ‘cylinders’, the
proportion of data elements belonging to the category 4
is 204 ÷ 398 = 0.513, i.e. 51.3% as shown in previous
slide.
Visualization of categorical data
• Bar chart
• Pie chart
Bar Chart
• Bar chart: Displays categories as bars with lengths
proportional to the values (e.g., counts or percentages).
A bar chart is a graphical representation used to
compare categorical data. It uses rectangular bars to
show the frequency, count, or other metrics associated
with each category.
• Use Case: Comparing counts or frequencies across
categories. Showing categorical data with a few distinct
groups.
Bar Chart
Bar chart
Pie Chart