0% found this document useful (0 votes)
4 views40 pages

3. Exploring Categorical Data_students

The document explores the structure of data, focusing on categorical and numerical data types. It discusses the classification of numerical data into continuous and discrete categories, and the treatment of discrete data as categorical data depending on context. Additionally, it covers encoding techniques for categorical data, visualization methods, and operations applicable to categorical data.

Uploaded by

Rohith Saindla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views40 pages

3. Exploring Categorical Data_students

The document explores the structure of data, focusing on categorical and numerical data types. It discusses the classification of numerical data into continuous and discrete categories, and the treatment of discrete data as categorical data depending on context. Additionally, it covers encoding techniques for categorical data, visualization methods, and operations applicable to categorical data.

Uploaded by

Rohith Saindla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

EXPLORING

STRUCTURE OF DATA

Exploring categorical data


Data

Quantifiable Categoric
data Numerical Qualitative
al data

Continuou Discrete
Nominal Ordinal
s data data

Numeric data that Discrete data is


can take on any countable and has
value within a a finite or
range. It is countably infinite
measurable and number of values.
can include E.g. Number of
fractional or cylinders in a car.
decimal values.
E.g. Weight
Note: Numerical of an
data can also be classified based on its level of measurement: Interval and ratio
Numerical data
• Numerical attributes in machine learning are features
or variables in a dataset that represent numeric,
quantifiable data. These attributes can take on numeric
values and are used to perform mathematical
operations, which is essential for building and training
machine learning models.
Types of numerical data
Can discrete data be treated as
categorical data?
• Yes, discrete data can sometimes be treated as
categorical data, but it depends on the context and
the nature of the data. Here's an explanation:
• Key Definitions
• Discrete Data: Data that consists of distinct, separate
values (often integers) that can be counted. Examples:
number of children, cylinders in a car, or the year of
manufacture.
• Categorical Data: Data that represents labels or
categories, which may or may not have an inherent
order. Examples: car brands, fuel types, or levels of
education.
Categorical attribute
• Categorical attributes in machine learning are features
or variables in a dataset that represent categories or
labels.
Data

Categoric
Numerical
al

Continuou Discrete
Nominal Ordinal
s data data
Types of Categorical Attributes
Handling categorical data
• It is common for real word datasets to contain one or
more categorical features. When we are talking about
categorical data, we have further distinguished between
ordinal and nominal features. Ordinal features can be
understood as categorical values that can be sorted or
ordered. For e.g. t-shirt size would be ordinal feature,
because we can define XL>L>M>S>XS.
• In contrast, nominal feature don’t imply any order. For
e.g. t-shirt color is nominal feature since it typically
doesn’t make sense to say that , for example red is
larger than blue.
Exploring Categorical Attributes in Machine
Learning
Encoding categorical (ordinal)
variables
• To make sure that the learning algorithm interprets the
ordinal features correctly, we need to convert the
categorical string values into integers. Unfortunately,
there is no convenient function that can automatically
derive the correct order of the ;labels of our size
feature, se we have to define the mapping manually. In
the following simple example, lets assume that we know
the numerical differentce between features, for e.g. XL
=L+1=M+2:
Categorical data encoding with
pandas
• Before we explore different techniques for handling categorical
data. Let’s create a new DataFrame to illustrate the problem.

As we see in the dataframe , it contains a nominal


feature (color), an ordinal feature (size), and a
numerical feature (price). The class labels are
stored in the last column.
Mapping/ encoding ordinal features
Encoding class labels
• To encode the class labels, we can use an approach
similar to the mapping of ordinal features discussed
previously. We need to remember that class labels are
not ordinal, it doesn’t matter which integer number we
assign to a particular string label. Thus we simply
enumerate the class labels
Encoding class labels
Another way of encoding class
labels
Encoding nominal features

Color is a
nominal
feature
• Can you spot a problem in previous slide?
• The encoding assumes that green is larger than blue,
and red is larger than green.
• Although this assumption is incorrect , a classifier could
still produce useful results. However, those results
would not be optimal.
• A common workaround for this problem is to use
one-hot encoding
For one hot encoding use get_dummies
function
• The get_dummies function in pandas is used to perform one-hot encoding, a
process that converts categorical variables into a format suitable for machine
learning models. Specifically, it transforms each unique category in a column
into a separate binary column, where:1 indicates the presence of that
category for a specific row.0 indicates its absence.
The problem with the above slide is that by using one hot encoding , we have to
keep in mind that this introduces multicollinearity, which can be an issue. To
reduce correlation, we can simply remove one feature column form one hot
encoded array as shown below. Note that we don’t loose any important
information by removing a feature column , though , for example , if we remove
color_blue , the feature information is still preserved since if we observe
color_green =0 and color_red=0, it implies that the observation must be blue.
Transforming numeric (continuous)
features to categorical features
• Sometimes there is a need of transforming a continuous
numerical variable into a categorical variable. For
example, we may want to treat the real estate price
prediction problem, which is a regression problem, as a
real estate price category prediction, which is a
classification problem. In that case, we can ‘bin’ the
numerical data into multiple categories based on the
data range. In the context of the real estate price
prediction example, the original data set has a
numerical feature apartment_price as shown in Figure
4.5a. It can be transformed to a categorical variable
price-grade either as shown in Figure 4.5b or as shown
in Figure 4.5c
Transforming numeric (continuous) features
to categorical features
Consider MPG dataset
Categorical data
• We may also be interested to know the proportion (or
percentage) of count of data elements belonging to a
category. Say, e.g., for the attributes ‘cylinders’, the
proportion of data elements belonging to the category 4
is 204 ÷ 398 = 0.513, i.e. 51.3% as shown in previous
slide.
Visualization of categorical data
• Bar chart
• Pie chart
Bar Chart
• Bar chart: Displays categories as bars with lengths
proportional to the values (e.g., counts or percentages).
A bar chart is a graphical representation used to
compare categorical data. It uses rectangular bars to
show the frequency, count, or other metrics associated
with each category.
• Use Case: Comparing counts or frequencies across
categories. Showing categorical data with a few distinct
groups.
Bar Chart
Bar chart
Pie Chart

• What it shows: Proportions of each category as parts


of a whole.
• When to use:
• For a small number of categories (e.g., <6).
• When highlighting proportions or percentages.
• Example: Proportions of cars by fuel type.
Proportions of cars by fuel cylinder.
Pie Chart
Operations that can be applied to
categorical data
• Counting Frequency
• Sorting
• Mode (if ordinal and nominal)
• Median ( if ordinal)
• Proportions
Operations that can be applied to
categorical data
• An attribute may have one or more modes. Frequency
distribution of an attribute having single mode is called
‘unimodal’, two modes are called ‘bimodal’ and multiple
modes are called ‘multimodal’.
• Mode of a data is the data value which appears most
often. In context of categorical attribute, it is the
category which has highest number of data values.
Since mean and median cannot be applied for
categorical variables, mode is the sole measure of
central tendency.

You might also like