3. Exploring Categorical Data_students

The document explores the structure of data, focusing on categorical and numerical data types. It discusses the classification of numerical data into continuous and discrete categories, and the treatment of discrete data as categorical data depending on context. Additionally, it covers encoding techniques for categorical data, visualization methods, and operations applicable to categorical data.

Uploaded by

Rohith Saindla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views40 pages

3. Exploring Categorical Data_students

Uploaded by

Rohith Saindla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

EXPLORING

STRUCTURE OF DATA

Exploring categorical data

Data

Quantifiable Categoric
data Numerical Qualitative
al data

Continuou Discrete
Nominal Ordinal
s data data

Numeric data that Discrete data is

can take on any countable and has
value within a a finite or
range. It is countably infinite
measurable and number of values.
can include E.g. Number of
fractional or cylinders in a car.
decimal values.
E.g. Weight
Note: Numerical of an
data can also be classified based on its level of measurement: Interval and ratio
Numerical data
• Numerical attributes in machine learning are features
or variables in a dataset that represent numeric,
quantifiable data. These attributes can take on numeric
values and are used to perform mathematical
operations, which is essential for building and training
machine learning models.
Types of numerical data
Can discrete data be treated as
categorical data?
• Yes, discrete data can sometimes be treated as
categorical data, but it depends on the context and
the nature of the data. Here's an explanation:
• Key Definitions
• Discrete Data: Data that consists of distinct, separate
values (often integers) that can be counted. Examples:
number of children, cylinders in a car, or the year of
manufacture.
• Categorical Data: Data that represents labels or
categories, which may or may not have an inherent
order. Examples: car brands, fuel types, or levels of
education.
Categorical attribute
• Categorical attributes in machine learning are features
or variables in a dataset that represent categories or
labels.
Data

Categoric
Numerical
al

Continuou Discrete
Nominal Ordinal
s data data
Types of Categorical Attributes
Handling categorical data
• It is common for real word datasets to contain one or
more categorical features. When we are talking about
categorical data, we have further distinguished between
ordinal and nominal features. Ordinal features can be
understood as categorical values that can be sorted or
ordered. For e.g. t-shirt size would be ordinal feature,
because we can define XL>L>M>S>XS.
• In contrast, nominal feature don’t imply any order. For
e.g. t-shirt color is nominal feature since it typically
doesn’t make sense to say that , for example red is
larger than blue.
Exploring Categorical Attributes in Machine
Learning
Encoding categorical (ordinal)
variables
• To make sure that the learning algorithm interprets the
ordinal features correctly, we need to convert the
categorical string values into integers. Unfortunately,
there is no convenient function that can automatically
derive the correct order of the ;labels of our size
feature, se we have to define the mapping manually. In
the following simple example, lets assume that we know
the numerical differentce between features, for e.g. XL
=L+1=M+2:
Categorical data encoding with
pandas
• Before we explore different techniques for handling categorical
data. Let’s create a new DataFrame to illustrate the problem.

As we see in the dataframe , it contains a nominal

feature (color), an ordinal feature (size), and a
numerical feature (price). The class labels are
stored in the last column.
Mapping/ encoding ordinal features
Encoding class labels
• To encode the class labels, we can use an approach
similar to the mapping of ordinal features discussed
previously. We need to remember that class labels are
not ordinal, it doesn’t matter which integer number we
assign to a particular string label. Thus we simply
enumerate the class labels
Encoding class labels
Another way of encoding class
labels
Encoding nominal features

Color is a
nominal
feature
• Can you spot a problem in previous slide?
• The encoding assumes that green is larger than blue,
and red is larger than green.
• Although this assumption is incorrect , a classifier could
still produce useful results. However, those results
would not be optimal.
• A common workaround for this problem is to use
one-hot encoding
For one hot encoding use get_dummies
function
• The get_dummies function in pandas is used to perform one-hot encoding, a
process that converts categorical variables into a format suitable for machine
learning models. Specifically, it transforms each unique category in a column
into a separate binary column, where:1 indicates the presence of that
category for a specific row.0 indicates its absence.
The problem with the above slide is that by using one hot encoding , we have to
keep in mind that this introduces multicollinearity, which can be an issue. To
reduce correlation, we can simply remove one feature column form one hot
encoded array as shown below. Note that we don’t loose any important
information by removing a feature column , though , for example , if we remove
color_blue , the feature information is still preserved since if we observe
color_green =0 and color_red=0, it implies that the observation must be blue.
Transforming numeric (continuous)
features to categorical features
• Sometimes there is a need of transforming a continuous
numerical variable into a categorical variable. For
example, we may want to treat the real estate price
prediction problem, which is a regression problem, as a
real estate price category prediction, which is a
classification problem. In that case, we can ‘bin’ the
numerical data into multiple categories based on the
data range. In the context of the real estate price
prediction example, the original data set has a
numerical feature apartment_price as shown in Figure
4.5a. It can be transformed to a categorical variable
price-grade either as shown in Figure 4.5b or as shown
in Figure 4.5c
Transforming numeric (continuous) features
to categorical features
Consider MPG dataset
Categorical data
• We may also be interested to know the proportion (or
percentage) of count of data elements belonging to a
category. Say, e.g., for the attributes ‘cylinders’, the
proportion of data elements belonging to the category 4
is 204 ÷ 398 = 0.513, i.e. 51.3% as shown in previous
slide.
Visualization of categorical data
• Bar chart
• Pie chart
Bar Chart
• Bar chart: Displays categories as bars with lengths
proportional to the values (e.g., counts or percentages).
A bar chart is a graphical representation used to
compare categorical data. It uses rectangular bars to
show the frequency, count, or other metrics associated
with each category.
• Use Case: Comparing counts or frequencies across
categories. Showing categorical data with a few distinct
groups.
Bar Chart
Bar chart
Pie Chart

• What it shows: Proportions of each category as parts

of a whole.
• When to use:
• For a small number of categories (e.g., <6).
• When highlighting proportions or percentages.
• Example: Proportions of cars by fuel type.
Proportions of cars by fuel cylinder.
Pie Chart
Operations that can be applied to
categorical data
• Counting Frequency
• Sorting
• Mode (if ordinal and nominal)
• Median ( if ordinal)
• Proportions
Operations that can be applied to
categorical data
• An attribute may have one or more modes. Frequency
distribution of an attribute having single mode is called
‘unimodal’, two modes are called ‘bimodal’ and multiple
modes are called ‘multimodal’.
• Mode of a data is the data value which appears most
often. In context of categorical attribute, it is the
category which has highest number of data values.
Since mean and median cannot be applied for
categorical variables, mode is the sole measure of
central tendency.

Research II: Quarter 2 - Module 1: Using Research Instruments
No ratings yet
Research II: Quarter 2 - Module 1: Using Research Instruments
31 pages
UNIT3
No ratings yet
UNIT3
98 pages
Week 6. Data Preparation and Transformation
No ratings yet
Week 6. Data Preparation and Transformation
34 pages
Features
No ratings yet
Features
5 pages
Feature Engineering
No ratings yet
Feature Engineering
43 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
02 - ML - Data Presentation-24-03-09
No ratings yet
02 - ML - Data Presentation-24-03-09
21 pages
Summary Statistics - Variable Types Cheatsheet - Codecademy
No ratings yet
Summary Statistics - Variable Types Cheatsheet - Codecademy
2 pages
Dealing with Categorical Data
No ratings yet
Dealing with Categorical Data
14 pages
3-Random Projection and Compressed Sensing technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing technique-13-01-2025
84 pages
DMI UNIT 2
No ratings yet
DMI UNIT 2
19 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
DMI UNIT 2_186_N3
No ratings yet
DMI UNIT 2_186_N3
21 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Lecture3
No ratings yet
Lecture3
15 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
05 Pandas (1)
No ratings yet
05 Pandas (1)
12 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Unit-II
No ratings yet
Unit-II
119 pages
Module1 Understanding Data1
No ratings yet
Module1 Understanding Data1
56 pages
A Comparative Study of Categorical Variable Encoding Techniques
No ratings yet
A Comparative Study of Categorical Variable Encoding Techniques
4 pages
(Articulo) A Comparative Study of Categorical Variable Encoding PDF
No ratings yet
(Articulo) A Comparative Study of Categorical Variable Encoding PDF
4 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Feature Engineering
100% (2)
Feature Engineering
76 pages
Handling of Categorical Data
No ratings yet
Handling of Categorical Data
18 pages
L1_Data Pre-processing & Steps of Building a Model (1)
No ratings yet
L1_Data Pre-processing & Steps of Building a Model (1)
30 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
4 - Ch4 - Data Objects and Attribute Types
No ratings yet
4 - Ch4 - Data Objects and Attribute Types
14 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
DWDM UNIT-2
No ratings yet
DWDM UNIT-2
19 pages
DS Handout 4
No ratings yet
DS Handout 4
4 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
A Deep-Learned Embedding Technique For Categorical Features Encoding
No ratings yet
A Deep-Learned Embedding Technique For Categorical Features Encoding
11 pages
All About Encoding - by Baijayanta Roy - Towards Data Science
No ratings yet
All About Encoding - by Baijayanta Roy - Towards Data Science
25 pages
2b.Graphical Representation
No ratings yet
2b.Graphical Representation
8 pages
DS_w3-4
No ratings yet
DS_w3-4
69 pages
ML_Unit-5
No ratings yet
ML_Unit-5
12 pages
2. Know_Your_Data and Rescaling
No ratings yet
2. Know_Your_Data and Rescaling
72 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
2. Know Your Data
No ratings yet
2. Know Your Data
83 pages
Week 10
No ratings yet
Week 10
50 pages
Week 2
No ratings yet
Week 2
73 pages
ML U2
No ratings yet
ML U2
62 pages
Machine Learning Pipeline: Created by Arbaz Ali
No ratings yet
Machine Learning Pipeline: Created by Arbaz Ali
32 pages
UNIT-I - Data Categorization-by-Dr - SKY
No ratings yet
UNIT-I - Data Categorization-by-Dr - SKY
22 pages
5_Data Summaries and Visualization (4)
No ratings yet
5_Data Summaries and Visualization (4)
87 pages
9-2 Data analysis and pre-processing part 2.pdf
No ratings yet
9-2 Data analysis and pre-processing part 2.pdf
27 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
1.3.2. Feature Engineering and Variable - Transformation
No ratings yet
1.3.2. Feature Engineering and Variable - Transformation
29 pages
All About Categorical Variable Encoding
No ratings yet
All About Categorical Variable Encoding
21 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
clustering_vivek_saxena
No ratings yet
clustering_vivek_saxena
169 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Dealing with categorical
No ratings yet
Dealing with categorical
25 pages
CPSC 4830 2025Summer Lecture 2
No ratings yet
CPSC 4830 2025Summer Lecture 2
42 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
6. Data Quality and Remediation
No ratings yet
6. Data Quality and Remediation
40 pages
5. Relationship Between Variables
No ratings yet
5. Relationship Between Variables
18 pages
4. PCA Steps -Numerical Problem
No ratings yet
4. PCA Steps -Numerical Problem
8 pages
4. Exploring Numerical Data_students
No ratings yet
4. Exploring Numerical Data_students
97 pages
7. LDA Numerical Problem
No ratings yet
7. LDA Numerical Problem
5 pages
STT033-MODULES-1-6
No ratings yet
STT033-MODULES-1-6
99 pages
FILES
No ratings yet
FILES
5 pages
MC 103 Statistics For Business Decisions 61011104
No ratings yet
MC 103 Statistics For Business Decisions 61011104
4 pages
Data Process Improvement: Data Collection Is A Term Used To Describe A Process of Preparing and
No ratings yet
Data Process Improvement: Data Collection Is A Term Used To Describe A Process of Preparing and
8 pages
Practical Research 2
100% (1)
Practical Research 2
40 pages
The Use of Rating Scales in Affective Disorders
No ratings yet
The Use of Rating Scales in Affective Disorders
5 pages
Analysis, Interpretation & Use of Test Data: What Are Measures of Central Tendency?
No ratings yet
Analysis, Interpretation & Use of Test Data: What Are Measures of Central Tendency?
10 pages
Trade Study
No ratings yet
Trade Study
15 pages
Biostatistics and Epidemiology Lecture
No ratings yet
Biostatistics and Epidemiology Lecture
13 pages
2A03 Week 1.1 and 1.2 Introduction To The Course 2
No ratings yet
2A03 Week 1.1 and 1.2 Introduction To The Course 2
259 pages
MATH 533 Project 1
No ratings yet
MATH 533 Project 1
15 pages
Data Sources Data Handling Data Visualization
No ratings yet
Data Sources Data Handling Data Visualization
23 pages
Download full Measurement by the Physical Educator Why and How 6th Edition David Miller ebook all chapters
100% (2)
Download full Measurement by the Physical Educator Why and How 6th Edition David Miller ebook all chapters
55 pages
Scales of Measurement PPT Mitchell
No ratings yet
Scales of Measurement PPT Mitchell
7 pages
UGC Net Statistics
No ratings yet
UGC Net Statistics
47 pages
1 - Introduction To Statistics
No ratings yet
1 - Introduction To Statistics
34 pages
Cie Research Methods Slides Sample Extract
No ratings yet
Cie Research Methods Slides Sample Extract
41 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Experimental Psychology Reviewer
No ratings yet
Experimental Psychology Reviewer
26 pages
Business Statistics,: 9e, GE (Groebner/Shannon/Fry) Chapter 1 The Where, Why, and How of Data Collection
No ratings yet
Business Statistics,: 9e, GE (Groebner/Shannon/Fry) Chapter 1 The Where, Why, and How of Data Collection
39 pages
(eBook PDF) The Process of Social Research 2nd Edition by Jeffrey C. Dixon download
100% (1)
(eBook PDF) The Process of Social Research 2nd Edition by Jeffrey C. Dixon download
58 pages
4 Formulating A Research Problem
No ratings yet
4 Formulating A Research Problem
2 pages
Studyguide Mip2602
No ratings yet
Studyguide Mip2602
144 pages
Mco 3
No ratings yet
Mco 3
126 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
Lecture 5 - Correlation
No ratings yet
Lecture 5 - Correlation
48 pages
Lecture 2 Experimental Research
No ratings yet
Lecture 2 Experimental Research
7 pages
SPSS Survey Tips
No ratings yet
SPSS Survey Tips
30 pages
Act-1 2
No ratings yet
Act-1 2
1 page

3. Exploring Categorical Data_students

Uploaded by

3. Exploring Categorical Data_students

Uploaded by

EXPLORING

Exploring categorical data

Numeric data that Discrete data is

As we see in the dataframe , it contains a nominal

• What it shows: Proportions of each category as parts

You might also like