0% found this document useful (0 votes)
4 views

Data Mining - I

Data mining pdf

Uploaded by

Gaurav Sengar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Mining - I

Data mining pdf

Uploaded by

Gaurav Sengar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

DATA MINING - I

[FOR LIMITED CIRCULATION]

Editorial Board

Ms. Asha Yadav


Dr. Reema Thareja
Content Writers

Ms. Aishwarya Anand Arora, Dr. Charu Gupta


Academic Coordinator

Deekshant Awasthi

© Department of Distance and Continuing Education


E-mail: [email protected]
[email protected]

Published by:
Department of Distance and Continuing Education
Campus of Open Learning, School of Open Learning,
University of Delhi, Delhi-110007

Printed by:
School of Open Learning, University of Delhi
DATA MINING - I

External Reviewer

Dr. Bharti
Assistant Professor,
DisclaimerDepartment of Computer Science, University of Delhi

Corrections/Modifications/Suggestions proposed by Statutory Body, DU/


Stakeholder/s in the Self Learning Material (SLM) will be incorporated in
the next edition. However, these corrections/modifications/suggestions will be
uploaded on the website https://sol.du.ac.in. Any feedback or suggestions may
be sent at the email- [email protected]

Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
New Delhi - 110026 (600 Copies, 2024)

Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
Syllabus
Data Mining - I

Syllabus Mapping
Unit - I: Introduction to Data Mining Lesson 1:
Motivation and Challenges for data mining, Types of data mining tasks, Introduction to Data Mining
Applications of data mining, Data measurements, Data quality, Supervised (Pages 1–20)
vs. unsupervised techniques.
Unit - II: Data Pre-Processing Lesson 2:
Data aggregation, sampling, dimensionality reduction, feature subset se- Data Pre-processing:
lection, feature creation, variable transformation. Transforming Raw Data
into Processed Data
(Pages 21–37)
Unit - III: Cluster Analysis Lesson 3: The Art of
Basic concepts of clustering, measure of similarity, types of clusters and Grouping: Exploring
clustering methods, K-means algorithm, measures for cluster validation, Cluster Analysis
determine optimal number of clusters. (Pages 38–62)
Unit - IV: Association Rule Mining Lesson 4: Data Connec-
Transaction data-set, frequent itemset, support measure, rule generation, tions: The Essentials of
confidence of association rule, Apriori algorithm, Apriori principle. Association Rule Mining
(Pages 63–83)
Unit - V: Classification Lesson 5: Building
Naive Bayes classifier, Nearest Neighbour classifier, decision tree, overfit- Blocks of Classification
ting, confusion matrix, evaluation metrics and model evaluation. Systems
(Pages 84–111)

Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi

Data Mining_Syllebus.indd 1 26-Sep-24 7:38:20 PM


Data Mining_Syllebus.indd 2 26-Sep-24 7:38:20 PM
Contents

PAGE
Lesson 1: Introduction to Data Mining 1–20

Lesson 2: Data Pre-processing: Transforming Raw Data into Processed Data 21–37

Lesson 3: The Art of Grouping: Exploring Cluster Analysis 38–62

Lesson 4: Data Connections: The Essentials of Association Rule Mining 63–83

Lesson 5: Building Blocks of Classification Systems 84–111

Glossary113–118

PAGE i
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

TOC.indd 1 24-Sep-24 2:02:27 PM


TOC.indd 2 24-Sep-24 2:02:27 PM
L E S S O N

1
Introduction to Data
Mining
Aishwarya Anand Arora
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]

STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 What is Data Mining?
1.4 Applications of Data Mining
1.5 Data Mining Task
1.6 Motivation and Challenges
1.7 Types of Data Attributes and Measurements
1.8 Data Quality
1.9 Supervised vs. Unsupervised
1.10 Summary
1.11 Answers to In-Text Questions
1.12 Self-Assessment Questions
1.13 References
1.14 Suggested Readings

PAGE 1
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 1 26-Sep-24 7:39:17 PM


DATA MINING - I

Notes
1.1 Learning Objectives
‹ ‹To understand the concepts of data mining.
‹ ‹To know applications of Data Mining.
‹ ‹To apply techniques of data mining in the real world.
‹ ‹To extract valuable information from unstructured data using data
mining techniques.

1.2 Introduction
Data mining is like being a detective, where instead of solving crimes, you
uncover the secret patterns and hidden data in this sea of data. Suppose
that you have a big treasure that is filled with so much information in
the form of text, images, and other forms of multimedia. It just looks like
jumbled facts at surface but with right kind of tools, you can see through
this data and find valuable patterns that can be useful in your business.
Think of it as mining Gold. Data Mining is the process of turning huge
amounts of data into useful information which is then transformed into
knowledge. It is the process of extracting patterns, trends, and insights in
large databases through various computational methods and algorithms. It
involves extracting valuable information and knowledge about raw data,
which in turn helps an organization make data-driven decisions and develop
a fuller understanding of their operations, customers, and markets. Some
of the strategies comprised in data mining techniques include statistical
analysis, machine learning, pattern identification, and visualization. It
makes predictive modelling, anomaly detection, clustering, and association
rule mining more accessible by showing hidden patterns and correlations
within data. Data mining is crucial in many fields today, including business
intelligence, marketing, health care, finance, and cybersecurity since data
in the digital world today is growing exponentially. Data mining helps an
organization with competitive advantage and solidifies valuable insights.
Following is a figure that illustrates how the discovery of knowledge is
done through the process of Data Mining.

2 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 2 26-Sep-24 7:39:17 PM


Introduction to Data Mining

Notes

Figure 1.1: Knowledge Discovery by Data Mining


It also works within the big vista of data life cycle, which includes
stages such as collecting, preprocessing, model development or learning,
evaluation, and deployment. Typically, data is pre-processed and collected
to ensure its quality and relevance; it is followed by problem definition
and domain understanding. To arrive at patterns and valuable knowledge,
appropriate data mining techniques are applied on the pre-processed data.
Further evaluation of the models is performed for accuracy, reliability, and
generalization capability using several measures. The learned knowledge
is finally integrated into operational systems or applied to support deci-
sion-making processes to create maximum value for the company. Data
mining techniques have correspondingly evolved to deal with scale - hence
large data sets - as data increases in volume and complexity, enabling
enterprises to unlock the true value of their data assets.

PAGE 3
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 3 26-Sep-24 7:39:17 PM


DATA MINING - I

Notes
1.3 What is Data Mining?
Now, visualize a gigantic library with thousands of books. Can you read
every single one of those to find out which one is the best? Well! I’m
sure this is impossible since you could spend years doing this. Here
comes Data Mining into the picture. Finding patterns, connections, and
insights in big databases to extract useful information is known as Data
Mining. It analyses data and discovers hidden patterns in several ways by
utilizing the techniques taken from database systems, machine learning,
and statistics for predictive modelling and decision making. Data mining
provides organizations with the ability to find patterns, anomalies, and
relationships through association rule mining, regression analysis, clustering,
and classification. The mined knowledge can be used in a wide variety
of marketing, finance, healthcare, and business applications to enhance
operations, gain competitive advantage, and make informed decisions.
Major steps of data mining include data collection, pre-processing, explor-
atory data analysis, feature engineering and selection, model development,
evaluation, deployment, and monitoring.
Data mining’s primary characteristics are:
‹ ‹Automatic Pattern Recognition
‹ ‹The categorization of data based on prior knowledge or statistical
data derived from patterns and/or their representation is known
as pattern recognition. e.g., license plate recognition, fingerprint
analysis, face detection/verification, and voice-based authentication.
‹ ‹Forecasting Probable Results
‹ ‹The process of forecasting involves using past data to generate
predictions about what will happen in the future or under what
circumstances. e.g., a company might forecast an increase in
demand for its products during the Diwali season.
‹ ‹Production of useful knowledge.
‹ ‹Pay attention to big databases and datasets.

4 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 4 26-Sep-24 7:39:18 PM


Introduction to Data Mining

Notes

Figure 1.2: A Typical Data Mining System


The data in the following graphic is first taken from a variety of data
sources and then cleaned to eliminate data anomalies. This information
is chosen by the work requirements and connected with the system. It’s
now possible to identify and analyze useful patterns for the next decisions.

1.4 Applications of Data Mining


Knowledge discovery in databases, or KDD, is the process of extracting
significant patterns, correlations, and insights from large datasets. Appli-
cations of data mining go far and wide, revolutionizing decision-making
processes across a host of industries. Key areas of application include
but are not limited to the following:
1. Business and Market Analysis: Business intelligence involves the use
of data mining to analyse market trends, consumer, and competitive
environments for business intelligence.
It aids in the derivation of better-informed decisions on product
development, pricing strategy, and marketing strategy.

PAGE 5
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 5 26-Sep-24 7:39:18 PM


DATA MINING - I

Notes 2. Healthcare and Medicine: Data mining is used in drug development


through biological data analysis. The basis for clinical decision
support is enabled to identify trends within patient data in order
to predict diseases, and it is useful for considering personalized
treatment recommendations.
3. Finance: Data mining has its application in customer segmentation,
fraud detection, and risk analysis within the financial sector. It helps
financial companies in the best investment profile optimization and
prediction of market trends.
4. In the industry of telecommunication, data mining is applied to
analyse the CDRs predict the failure of networks and improve the
satisfaction of customers by understanding the trends of usage. It
optimizes resource utilization and network infrastructure.
5. Retail: Data mining is applied by retailers in recommendation
engines, consumer segmentation, and control of inventories. This will
help pinpoint buying trends, which in turn helps them in focused
marketing to improve client loyalty.
6. Education: Academic institutions to predict student performance,
identify at-risk kids, and enhance usage of course material utilize
data mining. Personalized learning methods result in better academic
outcomes.
7. Manufacturing and Quality Control: Data mining improves supply
chain management, monitors and improves manufacturing processes,
and predicts equipment failures. It ensures product uniformity and
supports fault detection, which helps in quality control.
8. E-commerce: Online merchants utilize data mining to study the
characteristics of browsing and buying of their customers and, hence,
personalized recommendations are made. It makes for improving
user experience by optimizing marketing strategies and website
design.
9. Government and Security: Data mining has its application in
government organizations to predict criminal activities, fraud, and
national security. It enables processing large data to discover trends
that may lead to potential security threats.

6 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 6 26-Sep-24 7:39:18 PM


Introduction to Data Mining

CASE STUDY Notes

How Spotify Creates User Playlist


Big Data has long been utilized by businesses to help them expand
by enabling them to make decisions that are highly likely to increase
sales and growth. One excellent example of this is Spotify, an internet
music-streaming service.
Spotify uses the total of your click streams to build a personal por-
trait that captures your interests, dislikes, aspirations, and goals in
addition to creating playlists that reflect who you are. This gives
the business the ability to decide how best to choose respectable
content to license that they know their target audience would enjoy.
The majority of data are user-centric, enabling them to do a variety
of tasks including making music suggestions and selecting the tune
you’ll hear on the radio next.
Even though there are a lot of knowledgeable, opinionated, and am-
bitious people working for the organization, data is used to make
judgments wherever feasible. Data-driven decisions are carefully
monitored and incorporated back into the system for use in making
future decisions.

1.5 Data Mining Task


A variety of activities are included in data mining to obtain useful infor-
mation from data. Among the basic data mining tasks are:
1. Classification: Classification is a process of naming data with some
predefined names or categories, based on its attributes. A simple
example could be spam/not-spam email classification.
2. Clustering: There are no previously established categories in clustering.
It groups similar data based on intrinsic patterns within the data.
Example: Segment customers with similar buying habits.
3. Regression: Regression looks into previous data to estimate a continuous
numerical value. A certain example could be the price of a house,
which could be forecasted by its location, size, and facilities.
4. Association Rule Mining: In huge datasets, association rule mining
reveals some interesting correlations or relationships between
PAGE 7
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 7 26-Sep-24 7:39:18 PM


DATA MINING - I

Notes variables. For example, consider items you usually buy at a grocery
store together.
5. Anomaly Detection: The technique to find the pattern, which is
much farther away from usual patterns, would involve finding fraud
cases in financial transactions.
6. Sequential Pattern Mining: It’s a technique that finds patterns
occurring within sequential data, including time series and event
sequences. Example: clickstream analysis to understand customer
behaviours on a website.
7. Text Mining: Using unstructured text data, text-mining aims to extract
useful information. Sentiment analysis of customer feedback is one
example.
8. Image Mining: The goal of image mining is to identify patterns
and information within image data. Recognizing things in satellite
photos in order to classify land use.
9. Spatial Data Mining: This technique focuses on identifying patterns
in spatial data.
The relative geographic information about the earth and its features
is included in spatial data. A particular place on Earth is defined by
a pair of latitude and longitude coordinates. For example, consider
analysing geographic data to find trends in the spread of illness.
10. Studying Time-Series Data: Time-series analysis looks for patterns
or trends in data points gathered over an extended period of time.
For example, consider forecasting stock values using past data.

Figure 1.3: The Data Mining Process


8 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 8 26-Sep-24 7:39:18 PM


Introduction to Data Mining

IN-TEXT QUESTIONS Notes

1. Which of the following best describes data mining?


(a) Storing data in a database
(b) Analysing data to find hidden patterns and relationships
(c) Collecting data from various sources
(d) Displaying data in graphical format
2. Which data mining task involves grouping similar objects
together based on their characteristics?
(a) Classification
(b) Regression
(c) Clustering
(d) Association
3. Which of the following is NOT a potential application of data
mining?
(a) Customer segmentation
(b) Weather prediction
(c) Product recommendation
(d) Inventory management

1.6 Motivation and Challenges


Motivation
The major motivations behind data mining are:
1. Knowledge Discovery: The main goal of data mining is to uncover
hidden information from large datasets so that users may make
better decisions.
2. Pattern Recognition: Data mining makes the identification of trends,
correlations, and patterns easier, which may be hard or impossible
with conventional analysis.
3. Business Competitiveness: The utilization of data mining by
organizations enables them to capture information on consumer
preference and market trends, thus giving them a competitive edge.
PAGE 9
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 9 26-Sep-24 7:39:18 PM


DATA MINING - I

Notes 4. Better Decision Making: Data mining helps organizations acquire


useful information that aids them in making strategic and better
decisions.
5. Scientific Discovery: Data mining helps in scientific discovery in
the field of domains such as environmental science and healthcare
by uncovering patterns and relationships.

Challenges
The various challenges with data mining are:
1. Data Quality: Poor quality of data, such as incomplete and noisy
information or inconsistency in data, threatens the accuracy and
reliability of the output produced in data mining.
2. Scalability: Massive data sets cannot be handled effectively with
regard to issues of memory, processing speed, and computing
efficiency.
3. Complexity and Interpretability: Patterns in some data mining
models are hard for users to understand, especially in machine
learning, as it is complex and may not easily be interpretable.
4. Privacy Issues: The use of personal information raises a host of
privacy issues, and while doing data mining, ethical issues are
things a company considers.
5. Choice of Algorithm: Choosing the right algorithm for any particular
task is tough since different algorithms work better depending on
the type of data they are employed upon.
6. Dynamic Nature of Data: Due to the dynamic nature, the pattern in
data might change. In adjusting to changes, it becomes challenging
to keep the models relevant.
7. Domain Knowledge: Many times, interpretation of the results
from data mining requires knowledge of the domain for proper
interpretation, and that might be without domain expertise.
8. Data Integration: Pre-processing is such an important task, as the
integration of various sources of data, in different formats and
structures, may be tough.

10 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 10 26-Sep-24 7:39:18 PM


Introduction to Data Mining

Notes
1.7 Types of Data Attributes and Measurements
An attribute is a data field representing the characteristics of a data
object. Data attributes are characteristics or properties of data used for
modelling and analysis and in data mining are variously known as fea-
tures or variables. It is these qualities that provide information about the
objects or subjects being studied and form the basis for the development
of predictive models, the identification of trends, and drawing inferences.
Following is the list of data attributes used in data mining:

Data Attribute Types:


1. Numeric Attributes: This is an attribute that takes numerical values
to represent quantitative data. Examples include age, income, height,
and temperature.
2. Nominal Attribute alias Categorical: The categorical attributes
depict qualitative data which could have discrete values that can
fall in one or more categories. This is related to names and is also
called Categorical data. A few examples are Gender, Colour and
product kind.
3. Ordinal Attributes: Ordinal attributes are natural rankings or orders
for categorical data. Examples include star ratings that would appear
as 1 star, 2 stars, or 3 stars, or educational attainment, which would
show high school, bachelor’s, or masters.
4. Binary Attributes: Basically, binary attributes take the numerals 0
and 1, although there are other representations. Binary attributes
are a special case of nominal attributes in which there are only two
possible values. Examples include true/false, yes/no, and presence/
absence.

Features of the Data Attributes:


Data Attributes have the following features:
1. Scale: Numerical properties may fall under different scales, such as
interval or ratio scales, which in turn provide ground for how the
property is assessed or perceived.

PAGE 11
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 11 26-Sep-24 7:39:18 PM


DATA MINING - I

Notes 2. Granularity: A value of an attribute can represent a different degree


of detail or abstraction. For instance, a time attribute may have
representations in terms of seconds, minutes, hours, or days.
3. Sparsity: Some attributes may have many missing or null values,
which will be disadvantageous for modelling and processing.
4. Correlation: The chance that one feature might be interdependent
with other features would show some dependencies or relationships
that could influence how algorithms perform data mining.

Taking Care of Data Properties:


Feature selection has to do with the determination of what features are
relevant to carry out the analysis and modelling to increase the effec-
tiveness and efficiency of a data mining algorithm.
1. Feature Engineering: A process of expanding or modifying already
existing features to enhance a model’s predictive power.
2. Normalization and standardization are features of pre-processing
that ensure numerical properties are comparable in range and on a
similar scale.
3. One-hot encoding is a method that transforms categorical attributes
into numeric representation, which is readable for machine learning
algorithms.
4. Dealing with Missing Values: Techniques of imputation or deletion
shall be used to handle the missing/incomplete attribute data.

1.8 Data Quality


Data quality is an important aspect of data mining since the accuracy,
consistency, completeness, and reliability of the data directly affect the
outcomes and effectiveness of the analysis. The following provides an
overview of the data mining factors associated with data quality:
1. Data Accuracy: Closeness of data that describes precisely the
real-world phenomenon which the data is intended to describe, is
called accuracy. Inaccurate data may lead to wrong inferences and
defective models. Common causes include measurement mistakes,
system failures, and human error during data entry.

12 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 12 26-Sep-24 7:39:18 PM


Introduction to Data Mining

2. Completeness: It is the degree of a dataset to contain all data from Notes


the real world. Several elements may cause information to be
missing, including errors made while entering data, malfunctioning
machinery, and deliberate omissions. Missing records can affect
the validity of statistical surveys and predictive models and may
cause biases as well. Missing values can be approximated through
techniques such as data imputation, taking into consideration the
available information.
3. Consistency: In case there are no principal inconsistencies or
contradictions in the dataset, then the data can be said to be
consistent. Duplicate entry, contradictory information, or lack of
integration may result in inconsistent data. For example, there should
be enforcement of standards regarding data validation both at the
time of entry and during integration resolve conflicting entries and
resolve discrepancies.
4. Timeliness: It’s about the relevance of the data based on the current
needs of the analysis and its freshness. Sometimes, very outdated
data results in incorrect insight and conclusions since it might not
correctly represent the status of the underlying phenomenon. Data
timeliness can be preserved with the help of frequent updates and
real-time data integration techniques.
5. Relevance: The ability of data to achieve the specific objectives and
requirements of the data mining project is called the relevance of data.
Irrelevant data introduces noise in the study. Quality evaluation must
therefore address whether available data supports the objectives and
parameters of the analysis, whether it offers insightful information
that can drive decisions.
6. Integrity: Data integrity ensures the reliability, accuracy, and
consistency of the information in its whole life cycle, from its
collection to analysis and distribution. Audit trails, access limits, and
data quality controls should be put into practice to guarantee data
integrity by preventing illegal changes, corruption, and tampering.
7. Usability: Data usability is the ease of accessing, understanding, and
applying data for analyses and decision making. Well-structured,
well-documented, and in a presentation that makes lives easier for

PAGE 13
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 13 26-Sep-24 7:39:18 PM


DATA MINING - I

Notes data analysts and other stakeholders in understanding and exploring


the data.
In practice, data quality issues are extremely critical, and proper utiliza-
tion of data governance frameworks, quality assurance protocols, and data
management techniques is needed to ensure that data used for mining is
accurate, reliable, and fit for use.

1.9 Supervised vs. Unsupervised


Supervised Learning
The machine learning model life cycle requires labelled input and output
data in the training phase, which are applied to supervised machine learn-
ing. A data scientist would be required to label this training data before
the model is employed for its training and testing. The model would,
after learning the relationship between input and output data, classify
new and unexplored datasets and predict the outcome.
Because at least some of this process requires human involvement, it is
called supervised machine learning. The greater part of the data available
today is in the form of raw, unlabelled data. Human involvement is usually
necessary for data to be correctly labelled and prepared for supervised
learning. Understandably enough, this is a time-consuming process since
large amounts of precisely labelled training data are needed.
Supervised machine learning categorizes the unseen data into predeter-
mined categories, and at the same time, it is used as a predictive model
in detecting patterns and any change in the future. A model for supervised
machine learning can recognize objects and their features which classify
them. Predictive models are normally trained using methods of supervised
machine learning techniques. The supervised machine learning model can
make a prediction from data that might have never been seen before, but
with patterns that can be identified within both the input and output data.
This could be in predicting shifts in home values or consumer buying
trends. Frequently, supervised machine learning is employed in:
‹ ‹Categorizingmany file formats, including textual words, documents,
and photographs.
‹ ‹Making predictions about future developments and trends by
identifying patterns in training data.
14 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 14 26-Sep-24 7:39:18 PM


Introduction to Data Mining

Notes

Figure 1.4: Supervised Learning


Source: Gladys Kiruba, “Types of machine learning for Data science”,
created on 16 Jan 2023 accessed on 17 April 2024

Unsupervised Learning
In unsupervised machine learning, the models are trained on raw, unlabeled
training data. It is most often used for segmenting similar data into a set
number of groups or to find patterns and trends within raw information.
It is also one of the common strategies used in the early stages to get a
better view of the datasets.
As might be expected by the name, unsupervised machine learning takes
another approach than supervised machine learning. Human beings will
tune model hyperparameters, like the number of cluster points, but the
model will handle enormous amounts of data efficiently and autonomously.
Due to this fact, unsupervised machine learning provides answers partic-
ularly apt to show hidden patterns and connections within the data itself.
However, with less human control, much more consideration must be
given to explaining unsupervised machine learning. Most data available
currently is raw and unlabelled. Therefore, unsupervised learning will be
a potent way to make sense of such data; data will be in clusters with
similar characteristics, or it may analyze datasets for hidden patterns. On
the other hand, labelled data may require a great deal of resources when
using supervised machine learning.

PAGE 15
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 15 26-Sep-24 7:39:18 PM


DATA MINING - I

Notes The major applications of unsupervised machine learning are:


‹ ‹Group datasets based on feature similarity or segment data.
‹ ‹Recognize the connections between various data points, such as
those found in automatic music suggestions.
‹ ‹Carry out preliminary data analysis.

Figure 1.5: Unsupervised Learning


Source: Gladys Kiruba, “Types of machine learning for Data science”,
created on 16 Jan 2023 accessed on 17 April 2024

Supervised and Unsupervised Learning

Figure 1.6: Different Machine Learning Algorithms [Adapted]

16 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 16 26-Sep-24 7:39:18 PM


Introduction to Data Mining

IN-TEXT QUESTIONS Notes

4. What is the main difference between supervised and unsupervised


learning?
(a) Supervised learning uses labelled data, while unsupervised
learning uses unlabelled data
(b) Supervised learning does not require a training phase, while
unsupervised learning does
(c) Supervised learning is only used for regression tasks, while
unsupervised learning is only used for classification tasks
(d) Supervised learning is always more accurate than unsupervised
learning
5. Which of the following techniques is an example of unsupervised
learning?
(a) Linear Regression
(b) K-Means Clustering
(c) Support Vector Machines (SVM)
(d) Decision Trees
6. What is a characteristic of overfitting in supervised learning?
(a) The model generalizes well to new, unseen data
(b) The model captures noise or random fluctuations in the
training data
(c) The model is too simple to capture the underlying structure
of the data
(d) The model cannot make accurate predictions even on the
training data

1.10 Summary
Data mining is the process of finding trends, relations, and insights from
large databases to glean information that would be of importance or use-
ful to the business for making decisions. Its application extends across
multiple domains due to obvious reasons: marketing, banking, health care,

PAGE 17
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 17 26-Sep-24 7:39:19 PM


DATA MINING - I

Notes and telecommunications are just a few examples. Some examples of the
tasks of data mining are classification, regression, clustering, association
rule mining, and anomaly detection. In this lesson, you have learnt that,
among others, finding of trends, enhancing decision-making and compet-
itive advantage are some of the driving forces in Data Mining. However,
there are problems that make Data Mining less successful. A few obstacles
are poor data quality, scalability, and interpretability. Data quality covers
completeness, accuracy, consistency, and timeliness of information. Data
qualities maybe categorical, numerical, ordinal, or interval.
In more detail, supervised learning is a process in which a machine
learning model is trained on a certain data for which each input is as-
sociated with a target output label. It will learn from this generalization
of examples by minimizing the discrepancy between its own prediction
and the true labels, and then making the best possible predictions on
new, unseen data. Regression and classification problems are among the
most common tasks in supervised learning. Opposed to that, unsupervised
learning deals with training on unlabelled data; the algorithm hence needs
to find structures or patterns in the data on its own. Most often, the goal
is to detect hidden patterns or clusters in the data, such as dimension-
ality reduction or clustering of similar data points. Possible applications
of unsupervised learning range from recommendation systems over data
compression to anomaly detection.

1.11 Answers to In-Text Questions


1. (b) Analysing data to find hidden patterns and relationships
2. (c) Clustering
3. (b) Weather prediction
4. (a) Supervised learning uses labelled data, while unsupervised
learning uses unlabelled data
5. (b) K-Means Clustering
6. (b) The model captures noise or random fluctuations in the training
data

18 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 18 26-Sep-24 7:39:19 PM


Introduction to Data Mining

Notes
1.12 Self-Assessment Questions
1. What are the main objectives of data mining, and how do they
differ from traditional statistical analysis?
2. Explain the differences between supervised and unsupervised learning
in the context of data mining. Provide examples of each.
3. Describe the steps involved in the data mining process, highlighting
the importance of each step.
4. Discuss the challenges associated with handling missing data in a
dataset and explain some common techniques used to address this
issue.
5. Describe the key differences between supervised and unsupervised
learning. Provide examples of each.
6. Explain the concept of overfitting in the context of supervised
learning. How can overfitting be detected and prevented?
7. What are the main challenges in unsupervised learning, and how
are they addressed?
8. Describe two popular algorithms used in supervised learning and
two popular algorithms used in unsupervised learning. Explain how
each algorithm works and provide examples of their applications.

1.13 References
‹ ‹Han, J., Kamber, M., & Jian, P. (2011). Data Mining: Concepts and
Techniques. 3rd edition. Morgan Kaufmann.
‹ ‹Tan,P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data
Mining. 1st Edition. Pearson Education.
‹ ‹Gladys Kiruba. “Types of machine learning for Data science”. 16
Jan 2023 accessed on 17 April 2024.

1.14 Suggested Readings


‹ ‹Gupta, G. K. (2006). Introduction to Data Mining with Case Studies.
Prentice-Hall of India.

PAGE 19
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 19 26-Sep-24 7:39:19 PM


DATA MINING - I

Notes ‹ ‹Hand, D., & Mannila, H. & Smyth, P. (2006). Principles of Data
Mining. Prentice-Hall of India.
‹ ‹Pujari, A. (2008). Data Mining Techniques. 2nd edition. Universities
Press.
‹ ‹Ding, H., Wu, J., Zhao, W., Matinlinna, J. P., Burrow, M. F., &
Tsoi, J. K. (2023). Artificial intelligence in dentistry—A review.
Frontiers in Dental Medicine, 4, 1085251.

20 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 20 26-Sep-24 7:39:19 PM


L E S S O N

2
Data Pre-processing:
Transforming Raw Data
into Processed Data
Aishwarya Anand Arora
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]

STRUCTURE
2.1 Learning Objectives
2.2 Introduction
2.3 Data Pre-Processing - Aggregation
2.4 Sampling
2.5 Dimensionality Reduction
2.6 Feature Subset Selection
2.7 Discretization and Binarization
2.8 Variable Transformation
2.9 Summary
2.10 Answers to In-Text Questions
2.11 Self-Assessment Questions
2.12 References
2.13 Suggested Readings

2.1 Learning Objectives


‹ ‹To understand techniques of Data Pre-processing.
‹ ‹To create a clean, informative, and well-structured dataset.

PAGE 21
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 21 26-Sep-24 7:39:19 PM


DATA MINING - I

Notes ‹ ‹To apply techniques to reduce the dimensions of large datasets.


‹ ‹To enhance the model performance using various techniques of
data processing.

2.2 Introduction
Data preparation involves the conversion, cleaning, and preparation of
raw data in the light of further analysis or modelling. It is also expect-
ed to enhance different techniques that ensure data relevance, quality,
and structure in such a manner that makes it fit for planned analytical
activities. Some of the problems with data preprocessing can affect the
accuracy and efficiency in subsequent analysis, including missing values,
noisy data, outliers, and inconsistencies. Accordingly, poor quality in
the processing of data can improve the performance of machine learn-
ing models or reduce bias, based on which reliability improves through
findings by different other data-driven techniques.

Figure 2.1: Data Pre-Processing in Data Mining


Source: Premananda Suna, “Data Preprocessing in Data Mining”
created on 06 Dec 2023 accessed on 29 Apr 2024

In the figure above, the preprocessing of data typically involves a great


many procedures, which include data integration, then data transformation,
and sometimes data cleansing. Data cleaning is the process of finding
and correcting errors, inconsistencies, and missing values in a dataset to
render it accurate and complete. Data transformation will include data for-
matting, scaling of numerical features, encoding categorical variables, and

22 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 22 26-Sep-24 7:39:19 PM


Data Pre-processing: Transforming Raw Data into Processed Data

dimensionality reduction, all aiming to improve computation efficiency and Notes


enhance model performance. Data integration refers to bringing together
diverse sources or formats of information into one dataset for comprehen-
sive analysis and decision-making. Feature extraction and selection methods
can be used to determine which characteristics are the most relevant and
informative to the study. In any case, data preparation thus provides the
bedrock for meaningful data analysis and, in turn, enables practitioners and
academics alike to draw enlightened inferences and defensible decisions.

2.3 Data Pre-Processing - Aggregation


Data preparation in analysis involves the transformation, cleaning, and
preparation of raw data for further analysis or modelling. It includes var-
ious techniques for enhancement relevance, quality, and organization to
ensure preparedness of the data for planned analytical activities. It includes
a variety of problems that can degrade the precision of later analysis:
missing values, noisy data, outliers, and inconsistencies. Appropriately
prepared data can help researchers and analysts to improve performance for
machine learning models and other data-driven techniques, reduce biases,
and enhance the reliability of their conclusions. There are various steps
involved in data Preprocessing that are shown in the below Figure 2.2.

Figure 2.2 Data Preprocessing Steps


Source: Agarwal, Vivek. (2015). Research on Data Preprocessing and
Categorization Technique for Smartphone Review Analysis.
International Journal of Computer Applications. 131. 30-36.
10.5120/ijca2015907309

PAGE 23
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 23 26-Sep-24 7:39:19 PM


DATA MINING - I

Notes Data pre-processing refers to a set of methods used to clean, transform,


and prepare raw data into a form that is ready for the extraction pro-
cess. One of the interesting steps in the chain of pre-processing involves
aggregation, which summarizes multiple instances of data into a single
representation. Consolidation for various related instances of this can
be done by retrieving summary statistics like counts, averages, sums, or
other aggregate functions and subsequently grouping similar instances
based on certain properties. Summary of the data involves maintaining
the integrity of important information to achieve greater closeness of a
dataset to analysis by reducing its scale and intricacy. For instance, ag-
gregation in sales data can involve categorizing transactions by customer
segment or product category, and the average purchase value or total sales
revenue. Aggregation reduces data into compact summaries that enhance
interpretability, accelerate processing, and enable higher-order analysis
and decision-making from data mining tasks.
Apart from summarization, another use of aggregation within data pre-pro-
cessing is the extraction of higher-level insights and patterns, which are
not easily visible at the level of individual instances. Data aggregation
allows for the identification of broad patterns, anomalies, and trends
that inform decision-making with useful insights. Aggregation will help
in the indication not only of outliers or extreme values that might cause
distortion in the study but also reduction in the noise and redundancy
within the dataset. Aggregate data may be useful to reporting and visual-
ization in that it can provide rapid overviews of key metrics and trends
to stakeholders. All things considered, aggregation is a critical compo-
nent of data pre-processing since it streamlines intricate datasets while
maintaining the critical information required for insightful analysis and
decision support. As shown in the Figure 2.3 below.

24 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 24 26-Sep-24 7:39:19 PM


Data Pre-processing: Transforming Raw Data into Processed Data

Notes

Figure 2.3: Data Aggregation


Source: Educba “Aggregation in data mining”, Updated on 7 Nov
2023, accessed on 29 Apr 2024

2.4 Sampling
In data mining, sampling is the process of choosing a portion of a big-
ger dataset for examination. It’s an essential method for increasing the
effectiveness of data analysis, particularly when handling big data sets.
Below is an explanation of the significance of sampling as well as a list
of typical sampling techniques in data mining:
‹ ‹Efficiency: Handling big datasets requires a lot of computing power
and time. Data miners may work with manageable portions of data
thanks to sampling, which speeds up and improves the usefulness
of analysis.
‹ ‹Cost-cutting:Gathering, storing, and analyzing huge datasets can
be costly. Organizations can cut expenses on data processing and
storage by using sampling.
‹ ‹Representativeness: A carefully thought-out sample ought to
faithfully capture the traits of the broader population it is taken
from. This guarantees that the patterns and insights gleaned from
the sample can be applied to the total population.

PAGE 25
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 25 26-Sep-24 7:39:19 PM


DATA MINING - I

Notes Sampling methods can be broadly categorized into probability sampling


and non-probability sampling techniques. Here’s an overview of the dif-
ferent types within each category:

Probability Sampling Methods:


1. Simple Random Sampling:
‹ ‹Every individual in the population has an equal chance of being
selected.
‹ ‹Achieved by randomly selecting samples without any specific
pattern or criteria.
‹ ‹Below Figure 2.4 pictorially depicts how samples are selected
from an entire population.

Figure 2.4: Simple Random Sampling


Source: Khan, G. F. (2018). Creating value with social media
analytics: Managing, aligning, and mining social media text,
networks, actions, location, apps, hyperlinks, multimedia,
& search engines data. CreateSpace
2. Stratified Sampling:
‹ ‹The population is divided into homogeneous subgroups (strata)
based on certain characteristics.
‹ ‹Samples are then randomly selected from each stratum in
proportion to the population size of the stratum. (Strata - Smaller
population groups)
‹ ‹Ensures representation from each subgroup in the sample.
‹ ‹Below Figure 2.5 pictorially depicts how samples are selected
from stratas.

26 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 26 26-Sep-24 7:39:19 PM


Data Pre-processing: Transforming Raw Data into Processed Data

Notes

Figure 2.5: Stratified Sampling


Source: Internet

3. Systematic Sampling:
‹ ‹Involves
selecting every nth item from a list or sequence after a
random start.
‹ ‹Useful when the population is ordered in some meaningful way
(e.g., alphabetical order, time sequence).
‹ ‹Below Figure 2.6 pictorially depicts how samples are selected.

Figure 2.6: Systematic Sampling


Source: Kultar Singh, “Systematic Random Sampling: Overview,
Advantages, and Disadvantages”, created on 19 Oct 2022,
accessed on 29 Apr 2024

PAGE 27
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 27 26-Sep-24 7:39:20 PM


DATA MINING - I

Notes 4. Cluster Sampling:


‹ ‹The population is divided into clusters (e.g., geographical areas,
and classrooms).
‹ ‹Random clusters are selected, and all individuals within the
selected clusters are included in the sample.
‹ ‹Particularly useful when it’s impractical or costly to sample
individuals independently.
‹ ‹Below Figure 2.7 pictorially depicts how samples are selected
from sample clusters.

Figure 2.7: Cluster Sampling


Source: Internet

5. Multi-stage Sampling:
‹ ‹Combines multiple sampling methods.
‹ ‹Involves selecting samples in stages, often starting with large-scale
clusters and then progressively sampling smaller units within those
clusters.
‹ ‹Below Figure 2.8 pictorially depicts how samples are selected at
multiple stages.

Figure 2.8: Multi-Stage Sampling


Source: Internet

28 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 28 26-Sep-24 7:39:20 PM


Data Pre-processing: Transforming Raw Data into Processed Data

Non-Probability Sampling Methods: Notes


1. Convenience Sampling:
‹ ‹Involves selecting individuals who are readily available and
accessible.
‹ ‹Convenient but may lead to biased samples since individuals are
not randomly selected.
2. Purposive (Judgmental) Sampling:
‹ ‹Samples are chosen based on the researcher’s judgment or expertise.
‹ ‹Useful for selecting specific individuals or cases that are deemed
most relevant to the research objectives.
3. Snowball Sampling:
‹ ‹Starts with a small set of individuals who meet the inclusion
criteria.
‹ ‹These individuals then refer other potential participants, creating
a snowball effect.
‹ ‹Often used in studies where the population of interest is difficult
to reach or identify.
4. Quota Sampling:
‹ ‹The population is divided into subgroups based on predetermined
criteria (e.g., age, gender).
‹ ‹Samples are then selected based on quotas set for each subgroup
to ensure representation.
‹ ‹Similarto stratified sampling but without random selection within
subgroups.
5. Volunteer Sampling:
‹ ‹Individuals self-select to participate in the study.
‹ ‹Commonly used in online surveys, but results may not be
representative of the broader population due to self-selection bias.
Each sampling method has its advantages and limitations, and the choice
of method depends on factors such as the research objectives, resources,
and the nature of the population being studied.

PAGE 29
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 29 26-Sep-24 7:39:20 PM


DATA MINING - I

Notes IN-TEXT QUESTIONS


1. Which of the following statements regarding aggregation in
data analysis is true?
(a) Aggregation involves breaking down data into its smallest
components
(b) Aggregation refers to combining multiple data values into
a single summary value
(c) Aggregation is primarily used to identify outliers in the
dataset
(d) Aggregation is only applicable to qualitative data
2. Which sampling method involves dividing the population into
homogeneous subgroups and then selecting samples from each
subgroup proportionally to its population size?
(a) Simple Random Sampling
(b) Convenience Sampling
(c) Stratified Sampling
(d) Snowball Sampling

2.5 Dimensionality Reduction


A key data mining approach is dimensionality reduction, which lowers
the number of variables or features in a dataset while retaining as much
data as possible. This procedure is crucial for several reasons, such as
increasing machine learning algorithm performance, reducing the effects
of dimensionality, and increasing computational efficiency. The steps
involved in dimensionality reduction in data mining are as follows:
1. The volume of the feature space rises exponentially as the number of
features or dimensions increases; a phenomenon known as the “curse
of dimensionality.” This may result in some problems, including
overfitting (When a model functions brilliantly on training data
but poorly on test data (fresh data)) in machine learning models,
data sparsity, and increased computing complexity. Dimensionality
reduction reduces the number of characteristics while keeping
pertinent data, which helps to mitigate these problems.

30 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 30 26-Sep-24 7:39:20 PM


Data Pre-processing: Transforming Raw Data into Processed Data

2. Comparing Feature Extraction with Feature Selection. Notes


‹ ‹Feature selection is the process of choosing a portion of the
original features according to predetermined standards, including
significance, relevance, or association with the target variable.
‹ ‹Feature extraction is the process of transferring the original
features while maintaining the most significant information,
into a lower-dimensional space using methods such as manifold
learning or linear projections.
3. A nonlinear method for visualizing high-dimensional data by maintaining
local similarities between data points in a lower-dimensional space
is called t-distributed Stochastic Neighbour Embedding, or t-SNE.
‹ ‹Autoencoders: Models based on neural networks that are trained
to efficiently extract features from high-dimensional data by
compressing and reconstructing it.
‹ ‹Factor analysis is a statistical technique that uses a smaller set
of unseen variables called factors to predict the relationships
between observable variables.

Typical Methods for Reducing Dimensionality:


‹ ‹PCA: Principal Component Analysis probably is the most used linear
technique serving to provide orthogonal axes or principal components
in which data most varies. The PCA transformation transforms the
original features into a new set of variables uncorrelated and called
principal components.
‹ ‹Linear Discriminant Analysis is a supervised method for finding
the linear combinations of features that best explain the separation
between the different classes of objects taken into consideration.

The Advantages of Dimension Reduction


‹ ‹Enhanced Model Performance: By lowering the feature count,
machine learning models are better able to generalize and experience
a decrease in overfitting.
‹ ‹Efficient Computation: Less computer resources are needed for
training and inference when the dataset’s dimensionality is decreased.

PAGE 31
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 31 26-Sep-24 7:39:20 PM


DATA MINING - I

Notes ‹ ‹Visualization: To support exploratory data analysis and pattern


discovery, dimensionality reduction techniques frequently make
it easier to visualize high-dimensional data in lower-dimensional
environments.
Dimensionality reduction has been a must-do job in data mining for
transforming high-dimensional data into a more useful and manageable
form, enabling much effective and efficient modelling and analysis.

2.6 Feature Subset Selection


Feature Subset Selection or FSS is the process of choosing a subset of
pertinent features from the original set of features in a dataset, and it
relates to data mining and machine learning. Feature subset selection
aims to decrease the dimensionality of the data while retaining the most
informative features to improve the performance of machine learning
algorithms. The following is an overview of feature subset selection.

Significance of Feature Subset Selection


Feature Subset Selection is one method to reduce the impact of the so-
called “curse of dimensionality”, problems that can emerge with high-di-
mensional data, such as overfitting and increased computational complexity.
The result of this process will be a more interpretable model, less data
noise, and improved generalization performance by eliminating superflu-
ous or irrelevant features.

Feature Subset Selection Types:


Filter Methods: These are methods independent of the learning algo-
rithm to assess feature importance. Some very popular techniques include
chi-square testing, information gain, and correlation analysis. The filter
approach selects the top-ranked features based on a set of criteria to
carry out studies.
Wrapper techniques: The objective of wrapper techniques is to evalu-
ate the performance of a subset of features by training and assessing a
certain machine learning algorithm. Examples include recursive feature
elimination, forward selection, and reverse elimination. In this regard,
the wrapper technique selects that subset that provides the maximum

32 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 32 26-Sep-24 7:39:20 PM


Data Pre-processing: Transforming Raw Data into Processed Data

performance of the learning algorithm by searching through the space Notes


of possible feature subsets.
Embedded Methods: These methods embed feature selection into the
actual training process of the learning algorithm. Other popular examples
include decision trees and LASSO. The model training itself contains
pruning as one of the steps while deciding on most relevant features
which best suit the model.

Issues and Difficulties


‹ ‹Sizeof dataset, difficulty of learning task, and available computer
power-all influence the selection of feature subset selection technique.
‹ ‹Then, feature subset selection can be very expensive computationally,
especially for the wrapper methods since they need to train and
evaluate many models.
‹ ‹Itis very important in the feature subset selection process to strike
a balance between the obtained dimensionality reduction and the
loss of potentially useful information.

2.7 Discretization and Binarization


Preprocessing methods discretization and binarization are generally used
in data mining and machine learning for changing the form of continu-
ous or categorical variables, respectively. A description of each follows:
Discretization is the process of dividing continuous variables into dis-
tinct categories or intervals. This might make the data manipulations
easier to provide a clear interpretation of the results or turn it suitable
for algorithms that require discrete input values. Such discretization can
be performed by various methods:
‹ ‹Equal-width Discretization: The range of continuous values is
divided into a pre-specified number of equal-width intervals. For
instance, using five intervals on a range from our example of 0-100
for the continuous variable, then the span of each interval would
be 20 units: [0-20], [21-40], and so on.
‹ ‹Equal frequency Discretization: This technique divides the data
into intervals, each containing approximately the same number of

PAGE 33
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 33 26-Sep-24 7:39:20 PM


DATA MINING - I

Notes data points. This process provides the best assurance that the number
of occurrences in each category is roughly equal.
‹ ‹Custom Discretization: This involves the creation of specific
thresholds or intervals based on requirements or domain knowledge.
Among the common preprocessing steps, discretization also features when
dealing with algorithms that require discrete inputs, such as association
rule mining or decision trees.
Binarization: Also, it is sometimes referred to as binning or binarizing;
binarization is basically the process of transforming data into binary
format. This is achieved by thresholding the continuous variables into
binary features. A threshold is determined through binarization, and ev-
ery value above this is transferred to one category, usually represented
by the letter 1, and below it is transferred to another category, usually
represented by the letter 0.
The most common reasons people do binarization include but are not
limited to:
‹ ‹Binarization can help reduce the complexity of the data and emphasize
a pattern or relationship through feature engineering. In sentiment
analysis, words may be binarized depending on whether they appear
or do not appear in a document.
‹ ‹Imbalanced Data: Another area of application for binarization is in
fixing class imbalance by transforming a multi-class problem into
a binary classification problem.
‹ ‹The Sparse Data In particular, text mining and image processing
may use binarization to transform representations of sparse data
into more compact, efficient format.
Continuous and Categorical Data: Considering both continuous and
categorical data, binarization is a very simple yet useful technique which,
depending on the needs of the concrete analysis or modelling at hand,
might be called upon.
Preprocessing methods such as discretization and binarization take the
raw data into forms more amenable to modelling or analytics; the former
converts continuous variables into discrete categories, while the latter
represents the application of a threshold on data to convert it into binary

34 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 34 26-Sep-24 7:39:20 PM


Data Pre-processing: Transforming Raw Data into Processed Data

form. Both discretization and binarization play major roles in feature en- Notes
gineering and data pretreatment within data mining and machine learning
workflows.

2.8 Variable Transformation


Variable transformation is a very simple data preprocessing technique
that changes the scale, distribution, or composition of one or more vari-
ables involved in each dataset. Variable transformation is used to address
nonlinearity, heteroscedasticity, and non-normality in the data so that the
latter can be more acceptable for analysis or modelling. A summary of
variable transformation is as shown below:

Types of Variable Transformation:


1. Normalization is the process of bringing numerical variables onto
a common scale, often within the range from 0 to 1 or from -1 to
1. In that way, it will make sure that various scaled or unit-based
variables contribute equally in the study.
2. Standardization is a process through which numerical variables are
brought to a standard deviation of one and an average of zero. This
works particularly well if the variables have very special distributions
or units.

2.9 Summary
In this lesson, we discussed how data mining finds trends, relationships,
and insights in large databases to retrieve information for better deci-
sion-making. Applications are numerous, from marketing, banking, and
healthcare to telecommunications. Tasks dealing with data mining include
classification, regression, clustering, association rule mining, and anom-
aly detection, among others. Driving forces it has found include finding
trends, enhancing decision-making processes, and gaining a competitive
advantage. However, due to some obstacles, it is less successful. Some
of these are related to data quality, scalability, and interpretability. Data
qualities include completeness, accuracy, consistency, and timeliness. Data
qualities might be categorical, numerical, ordinal, or interval.

PAGE 35
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 35 26-Sep-24 7:39:20 PM


DATA MINING - I

Notes Data pre-processing: The process by which crude data is pre-processed to


make them ready for analysis. It involves techniques such as sampling,
where some data are selected to analyse, and aggregation, which is a
combination of summary statistics of values of data. In dimensionality
reduction, one needs to reduce the number of features to reduce overfit-
ting and computational complexity. Feature creation involves creating new
features from the existing ones, while feature subset selection involves
selecting relevant features from existing ones. While discretization chang-
es a continuous variable into a discrete one, the process of binarization
transforms a continuous variable into a binary representation. Variable
transformation can also be done to change the variables for better analy-
sis in terms of magnitude or distribution or nature. These skills are very
important in enhancing data quality and preparing them effectively for
mining.

2.10 Answers to In-Text Questions


1. (b) Aggregation refers to combining multiple data values into a
single summary value
2. (c) Stratified Sampling

2.11 Self-Assessment Questions


1. Compare and contrast the various data preprocessing techniques,
including normalization, standardization, and feature scaling. When
would you use each technique?
2. Explain the concept of dimensionality reduction and discuss its
importance in data mining. Provide examples of dimensionality
reduction techniques.
3. What is feature selection, and why is it important in the context
of machine learning? Describe different feature selection methods
and their advantages.
4. Discuss the challenges associated with handling categorical variables
in a dataset and explain techniques for encoding categorical data
for use in machine learning models.

36 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 36 26-Sep-24 7:39:20 PM


Data Pre-processing: Transforming Raw Data into Processed Data

5. Describe the role of data quality in the success of data mining Notes
projects. What are some common sources of data quality issues,
and how can they be addressed during data preprocessing?

2.12 References
‹ ‹Han,J., Kamber, M., & Jian, P. (2011). Data Mining: Concepts and
Techniques. 3rd edition. Morgan Kaufmann.
‹ ‹Tan,
P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data
Mining. 1st Edition. Pearson Education.

2.13 Suggested Readings


‹ ‹Gupta,G. K. (2006). Introduction to Data Mining with Case Studies.
Prentice-Hall of India.
‹ ‹Hand,D., & Mannila, H. & Smyth, P. (2006). Principles of Data
Mining. Prentice-Hall of India.
‹ ‹Pujari, A. (2008). Data Mining Techniques. 2nd edition. Universities
Press.

PAGE 37
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 37 26-Sep-24 7:39:20 PM


L E S S O N

3
The Art of Grouping:
Exploring Cluster Analysis
Aishwarya Anand Arora
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]

STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Basic Concepts of Clustering
3.4 Measure of Similarity
3.5 Types of Clusters and Clustering Methods
3.6 K-Means Algorithm
3.7 Measures for Cluster Validation
3.8 Determine Optimal Number of Clusters
3.9 Summary
3.10 Answers to In-Text Questions
3.11 Self-Assessment Questions
3.12 References
3.13 Suggested Readings

3.1 Learning Objectives


‹ ‹To understand the principles of clustering.
‹ ‹To apply the technique of K-means algorithm.
‹ ‹To evaluate the measures for cluster validation.
‹ ‹To apply techniques to determine the optimal number of clusters.

38 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 38 26-Sep-24 7:39:21 PM


The Art of Grouping: Exploring Cluster Analysis

Notes
3.2 Introduction
A key method in data mining and machine learning is cluster analysis,
which is grouping a collection of items according to how similar they
are. It is extensively employed in many different domains, including
anomaly detection, picture analysis, pattern recognition, and consumer
segmentation. Fundamentally, the goal of cluster analysis is to find hidden
structures and patterns in data so that complicated datasets can be better
understood and interpreted.
Measures of similarity, which express how similar two things are, and
types of clusters, which can differ in size, density, and form, are im-
portant ideas in cluster analysis. Figure 3.1 shows how clustering groups
similar data together. There are many kinds of clustering algorithms and
methodologies, and each has advantages and disadvantages of its own.
K-means, which divides the data into a predefined number of groups by
iteratively reducing the within-cluster variation, is one of the most often
used clustering methods.

Figure 3.1: Clustering


Similarly, clustering techniques go along with cluster validation metrics
that provide necessary measures of the quality and validity of a generated
cluster. These metrics therefore quantify the compactness and separation
of a cluster, thus helping to choose the best number of clusters that can
suit a particular dataset. This choice has a direct relation to the interpret-
ability and usefulness of the result obtained from clustering; therefore,
the selection of optimum number of clusters is one of the most import-
ant steps in cluster analysis. By the end of this tutorial, you will have
learned how cluster analysis provides a robust structure for organizing and
PAGE 39
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 39 26-Sep-24 7:39:21 PM


DATA MINING - I

Notes summarizing large and complex datasets in driving insightful conclusions


and informed decisions for practitioners and academics alike.

3.3 Basic Concepts of Clustering


Imagine you are at a party with a few people and your goal is to find a
group of people that has some interest in common with yours. Without
knowing anything about them beforehand you can just ask everyone di-
rectly, so you start observing. As you move around you notice that some
people are talking about movies, some are talking about sports, some are
talking about technology and more. Without realizing it you have now
started sorting these people into various groups and based on your inter-
est you add yourself into a group, this is exactly what clustering does.
Clustering or cluster analysis groups the unlabeled dataset. “A method
of grouping the data points into different clusters, consisting of similar
data points” is one definition for it. The items that may be related to
one another stay in a group that is less similar to another group or has
none at all.”
It accomplishes this by identifying some common patterns—such as size,
shape, color, activity, and so on—in the unlabeled dataset and classifying
them according to whether or not those patterns are present.
Since it’s an unsupervised learning approach, the algorithm works with
the unlabeled dataset without any supervision.
Each cluster or group that results from the application of this clustering
technique is given a cluster ID, which the ML system can use to make
handling big, complicated datasets easier.
For example, let’s examine the clustering technique using the Mall as
a real-world example: Any shopping center worth visiting will show us
that items that are used similarly are grouped. For example, t-shirts are
arranged in one section and pants in another; likewise, in the vegetable
sections, apples, bananas, mangoes, and so on are arranged in different
sections to make it easier for us to find what we’re looking for. The same
is true for the clustering strategy as well. Other instances of clustering
include organizing materials based on subject matter.
The clustering method is extensively applicable to a wide range of jobs.
Typical applications for this method include:

40 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 40 26-Sep-24 7:39:21 PM


The Art of Grouping: Exploring Cluster Analysis

‹ ‹Market Segmentation Notes


‹ ‹Statistical data analysis
‹ ‹Social network analysis
‹ ‹Image segmentation
‹ ‹Anomaly detection, etc.
In addition to these common uses, Amazon uses it in its recommendation
engine to deliver recommendations based on previous product searches.
Netflix employs this method as well to suggest movies and web series
to its viewers based on their viewing preferences. Figure 3.2 shows the
clustering process.

Figure 3.2: Clustering Process


Applications of Clustering
There are various applications of clustering, including:
‹ ‹Identificationof Malignant Cells: The identification of malignant
cells is a common use of clustering techniques. It creates distinct
groups based on the malignant and non-cancerous data sets.
‹ ‹In Search Engines: The clustering approach is also utilized by
search engines. Depending on which object is closest to the search
query, the search result is displayed. It accomplishes this by placing
dissimilar objects far apart from one group of related data pieces. The
effectiveness of the clustering algorithm being utilized determines
how accurate a query will be.

PAGE 41
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 41 26-Sep-24 7:39:21 PM


DATA MINING - I

Notes ‹ ‹Customer Segmentation: Based on their choices and preferences,


customers are divided into groups in market research.
‹ ‹Biology: The image recognition approach is utilized in the biology
stream to classify various plant and animal species.

3.4 Measure of Similarity


By putting comparable objects together, a technique known as clustering
can be used to determine whether the properties of two objects are similar
or different. The similarity measure in data mining terms is a distance
whose dimensions describe the attributes of an entity. This indicates that
there is a high degree of object similarity if the distance between two
data points is small, and vice versa. Subjective in nature, the similarity
is highly dependent on application and circumstance. Vegetables can be
identified as similar by their taste, size, color, and other characteristics.
To evaluate the similarities or differences between a pair of items, most
clustering algorithms employ distance measurements. The most often
used distance metrics are:
Euclidean Distance: This distance is regarded as the conventional measure
for geometric issues. It is easily understood to be the standard distance
between two places. It is among the cluster analysis’s most popular al-
gorithms. K-mean is one algorithm that makes use of this formula. In
mathematical terms, it calculates the square root of the coordinate dif-
ferences between two objects. As shown in Figure 3.3 below.

Figure 3.3: Euclidean Distance


Source: Internet

42 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 42 26-Sep-24 7:39:21 PM


The Art of Grouping: Exploring Cluster Analysis

Manhattan Distance: It calculates the absolute difference between the Notes


two coordinates. Assume we have two points, P and Q. To find the dis-
tance between them, we just need to compute the points’ perpendicular
distances from the X and Y axes. In a plane where P is located at (x1,
y1) and Q is located at (x2, y2). P and Q’s Manhattan distance is equal
to |x1 – x2| + |y1 – y2|. As shown in Figure 3.4 below.

Figure 3.4: Manhattan Distance


The red line depicts the Manhattan Distance.
Jaccard Index: The intersection of two data set items divided by the
union of the data items yields the Jaccard distance, which quantifies how
similar the two things are. As shown in Figure 3.5 below.

Figure 3.5: Jaccard Index

PAGE 43
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 43 26-Sep-24 7:39:21 PM


DATA MINING - I

Notes IN-TEXT QUESTIONS


1. What is the primary objective of clustering in machine learning?
(a) Classification
(b) Dimensionality Reduction
(c) Data Preprocessing
(d) Grouping similar data points together
2. In which field is clustering often utilized for market segmentation
and customer profiling?
(a) Healthcare
(b) Retail
(c) Transportation
(d) Agriculture
3. What is one of the primary applications of clustering in
recommendation systems?
(a) Identifying fraudulent transactions
(b) Predicting stock market trends
(c) Grouping similar products for recommendation
(d) Classifying email spam

3.5 Types of Clusters and Clustering Methods


Various kinds of Clusters
Several definitions of a cluster prove useful in real-world scenarios. We
use two-dimensional points to visually represent the distinctions between
these types of clusters, as the image illustrates. The types of clusters
discussed here are equally applicable to various forms of data.
1. Well-separated Cluster: A cluster is a collection of items in which
every item is more similar to or closer to every other item in the
cluster. Occasionally, a limit is employed to signify that every

44 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 44 26-Sep-24 7:39:21 PM


The Art of Grouping: Exploring Cluster Analysis

object within a cluster needs to be sufficiently near or comparable Notes


to one another. Only when the data comprises naturally occurring
clusters that are somewhat far from one another does the concept
of a cluster get satisfied. An example of a well-separated cluster
made up of two points in a two-dimensional space is shown in
Figure 3.6 below. Any shape will do for well-separated clusters;
spherical clusters are not necessary.

Figure 3.6: Well Separated Clusters


2. Prototype-Based Cluster: A cluster consists of a collection of items
that are more or less similar to each other’s prototypes than they
are to the prototypes of other clusters. A centroid is typically the
prototype of a cluster for data with continuous features. If a centroid
is not significant, it refers to the mean (average) of every point in
the cluster. For instance, the prototype is typically a medoid, which
is the most representative point of a cluster, when the data has
distinct properties. In some types of data, the model can be thought
of as the most central point; in these cases, prototype-based clusters
are often called centre-based clusters. These kinds of clusters are
typically spherical, as one could anticipate. As shown in Figure 3.7
below.

Figure 3.7: Prototype-Based Clusters

PAGE 45
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 45 26-Sep-24 7:39:21 PM


DATA MINING - I

Notes 3. Graph-based Cluster: A cluster is a connected component in a


graph representation of the data, where the nodes are the objects.
The objects in the group are related to one another, but they are
not related to anything outside the group. Contiguity-based clusters,
in which two items are related when positioned at a predetermined
distance from one another, are a notable example of graph-based
clusters. It implies that every object in a cluster based on contiguity
is identical to at least one other object in the cluster.

Various Clustering Methods


Hard clustering, in which a data point belongs to just one group, and soft
clustering, in which a data point might also belong to another group, are
the two main categories into which clustering techniques fall. However,
there are other different clustering algorithms as well. The primary clus-
tering techniques used in machine learning are listed below:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering
This kind of clustering separates the data into groups that are not hierar-
chical. Another name for it is the centroid-based approach. The K-Means
Clustering technique is the most widely used illustration of partitioning
clustering.
With this kind, the number of pre-defined groups is denoted by K, and
the dataset is split up into a collection of K groups. The cluster center
is designed so that there is the least amount of space between a cluster’s
data points and another cluster centroid. As shown in Figure 3.8 below.

46 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 46 26-Sep-24 7:39:21 PM


The Art of Grouping: Exploring Cluster Analysis

Notes

Figure 3.8: Partitioning Clustering


Source: Internet

Density-Based Clustering
As long as the dense region can be connected, the density-based clustering
method forms arbitrarily shaped distributions by connecting the highly
dense areas into clusters. This technique connects the regions of high
densities into clusters by detecting several clusters within the dataset.
Sparser spaces in data space separate the dense sections from one another.
When dealing with high dimensions and changing densities in the dataset,
these techniques may have trouble clustering the data points. As shown
in Figure 3.9 below.

PAGE 47
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 47 26-Sep-24 7:39:21 PM


DATA MINING - I

Notes

Figure 3.9: Density-Based Clustering


Source: Internet

Distribution Model-Based Clustering


The probability that a dataset will belong to a specific distribution deter-
mines how the data is partitioned in the distribution model-based cluster-
ing approach. Assuming certain distributions, most notably the Gaussian
Distribution, the grouping is completed. The Expectation-Maximization
Clustering algorithm, which makes use of Gaussian Mixture Models
(GMM), is an example of this kind. Refer to Figure 3.10 below.

Figure 3.10: Distributional Model Clustering


Source: Internet

48 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 48 26-Sep-24 7:39:22 PM


The Art of Grouping: Exploring Cluster Analysis

Hierarchical Clustering Notes


Since the number of clusters to be produced need not be predetermined,
hierarchical clustering can be utilized as an alternative to partitioned
clustering. Using this method, the dataset is split up into clusters to
produce a dendrogram, or structure like a tree. By chopping the tree at
the proper level, it is possible to select the observations or any number
of clusters. As shown in Figure 3.11 below.

Figure 3.11: Hierarchical Clustering


Source: Internet

Fuzzy Clustering
A sort of soft approach called fuzzy clustering allows a data object to be
a part of multiple groups or clusters. A set of membership coefficients,
which are dependent on the level of participation in a cluster, are pres-
ent in every dataset. An example of this kind of clustering is the fuzzy
C-means algorithm, which is sometimes referred to as the fuzzy k-means
algorithm at times.

Various Clustering Algorithms


Models for the Clustering techniques can be used to categorize them, as
previously discussed. While many various kinds of clustering algorithms
have been described, only a select number are frequently employed. The

PAGE 49
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 49 26-Sep-24 7:39:22 PM


DATA MINING - I

Notes type of data we use determines the clustering algorithms mentioned in


Figure 3.12 below. For example, certain algorithms must determine the
minimum distance between dataset observations, while others must esti-
mate the number of clusters in the provided dataset.

Figure 3.12: Different Clustering Algorithms


Source: Internet

3.6 K-Means Algorithm


Unsupervised learning techniques like K-means are widely used for group-
ing unlabeled data into k clusters. Finding the right number of clusters,
k, is one of the more difficult clustering tasks.
As a sort of unsupervised learning, clustering divides data points into
various sets according on how similar they are.
The different kinds of grouping include:
‹ ‹Hierarchical clustering
‹ ‹Partitioning clustering
There are further categories within hierarchical clustering:
‹ ‹Agglomerative clustering
‹ ‹Divisive clustering

50 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 50 26-Sep-24 7:39:22 PM


The Art of Grouping: Exploring Cluster Analysis

Moreover, partitioning clustering is separated into: Notes


‹ ‹K-Means clustering
‹ ‹Fuzzy C-Means clustering

K-Means Overview
Before we examine the dataset, let’s quickly go over how k-means operates:
‹ ‹K centroids are initialized at random at the start of the operation.
‹ ‹The closest cluster is assigned points based on these centroids.
‹ ‹The centroids’ positions are then updated using the mean of all the
locations within the cluster.
‹ ‹Untilthe centroids’ values stabilize, the previously mentioned steps
are repeated.
An algorithm for unsupervised learning is K-means clustering. Unlike
supervised learning, this clustering does not use labelled data. K-Means
divides the objects into groups based on similarities and differences be-
tween the objects in each cluster.
K is an acronym for a number. The number of clusters that must be
created must be specified to the system. K = 2, for instance, designates
two clusters. The optimal or best value of K for a particular set of data
can be determined in a certain method.
To get a better understanding of k-means, let’s look at a cricket example.
Consider that you have access to data on a large number of international
cricket players, including details on their runs scored and wickets claimed
over the course of the previous ten matches. We must divide the data into
two clusters—batsmen and bowlers—based on this information.
Let’s examine the procedures involved in forming these clusters.

Solution:
Our data set is shown here using the “x” and “y” coordinates. The y-axis
displays the number of runs scored, while the x-axis displays the number
of wickets the players have taken.
This is how the information would seem if it were plotted:

PAGE 51
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 51 26-Sep-24 7:39:22 PM


DATA MINING - I

Notes

Figure 3.13: Data Set Plot


Source: Internet

The clusters must be established, as indicated below:

Figure 3.14: Cluster 1


Source: Internet

52 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 52 26-Sep-24 7:39:22 PM


The Art of Grouping: Exploring Cluster Analysis

Notes

Figure 3.15: Cluster 2


Source: Internet

Using the same set of data, let’s apply K-Means clustering to solve the
problem (with K = 2).
The random assignment of two centroids (as K = 2) is the initial stage
in the k-means clustering process. Centroids are allocated to two points.
Keep in mind that because the points are random, they could be any-
where. Even though they are originally not the centre of a specific data
set, they go by the name centroids.

PAGE 53
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 53 26-Sep-24 7:39:22 PM


DATA MINING - I

Notes The next step is to calculate the separation between each data point of
the randomly allocated centroids. Every point has its distance measured
from both centroids; the point is assigned to the centroid whose distance
is shorter. The data points are shown here in blue and yellow and are
affixed to the centroids.

Finding these two clusters’ true centroid is the next stage. It is necessary
to move the initial centroid that was chosen at random to the clusters’
actual centroid.

Up until we reach our final cluster, we keep doing this centroid relocation
and distance calculation process. Subsequently, the centroid realignment
ceases.

As can be seen above, when the centroid no longer requires repositioning,


the algorithm has reached convergence and both of the clusters have a
centroid.

54 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 54 26-Sep-24 7:39:22 PM


The Art of Grouping: Exploring Cluster Analysis

Algorithm of K-Means Clustering Notes


Algorithm 1 k-means algorithm
1: Specify the number k of clusters to assign.
2: Randomly initialize k centroids.
3: repeat
4: expectation: Assign each point to its closest centroid.
5: maximization: Compute the new centroid (mean) of each cluster.
6: until The centroid positions do not change.

Sample Python Code for K-means Clustering

Applications of K-Means Clustering


In real life, K-Means clustering is applied in many scenarios or business
cases, such as:
‹ ‹Academic performance
‹ ‹Diagnostic systems
‹ ‹Search engines
‹ ‹Wireless sensor networks

PAGE 55
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 55 26-Sep-24 7:39:22 PM


DATA MINING - I

Notes IN-TEXT QUESTIONS


4. What is the primary objective of the K-means clustering
algorithm?
(a) Maximize inter-cluster similarity
(b) Minimize intra-cluster variance
(c) Maximize silhouette score
(d) Minimize the number of clusters
5. In the K-means algorithm, how are initial cluster centroids
typically selected?
(a) Randomly
(b) Based on class labels
(c) By maximizing silhouette score
(d) By minimizing Euclidean distance
6. How does the K-means algorithm update cluster centroids in
each iteration?
(a) By randomly reassigning data points to clusters
(b) By computing the mean of data points in each cluster
(c) By removing outliers from clusters
(d) By merging clusters with similar centroids

3.7 Measures for Cluster Validation


Finding clusters of related data in a data set can be facilitated by clustering.
However, a lot of clustering techniques—K-means among them—don’t
specify the “ideal” number of clusters. Therefore, it is up to us to figure
out how many clusters are “optimal” and how good they are. If not, our
grouping could result in incorrect judgments.
A low inter-cluster (between-cluster) similarity and a high intra-cluster
(within-cluster) similarity are our goals. Stated differently, we aim for
dense clusters separated by a large distance. When data points inside a
cluster exhibit high intra-cluster similarity, it indicates that their traits and

56 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 56 26-Sep-24 7:39:22 PM


The Art of Grouping: Exploring Cluster Analysis

features are similar. Conversely, a low intra-cluster similarity indicates Notes


that the characteristics of the data points in various clusters are distinct.
Consider Below Figure 3.16.

Figure 3.16: Low and High Inter-cluster Similarity


Source: Internet

Therefore, we must measure the cluster’s quality, i.e., its compactness,


connectivity, and separation, in order to establish the “optimal” number
of clusters. We have three options for how to go about this:
‹ ‹Internal Cluster Validation: It makes use of internal data from the
clustering procedure, such as the sum of squares within a cluster.
‹ ‹ExternalCluster Validation: Results are compared to externally
known results, such as given labels, in an external cluster validation.
‹ ‹RelativeCluster Validation: It modifies the clustering method’s
parameters, such as the number of clusters.
The “optimal” number of clusters can be ascertained using internal and
relative cluster validation. These are frequently combined. On the other
hand, the appropriate clustering technique can be ascertained through
external cluster validation.
There are two categories of methodologies that we might use to quantify
and validate the quality of the clustering.
On the one hand, we have direct approaches that quantify similarity inside
and/or between clusters. Examples of direct techniques are the Elbow
Curve and many more.
Alternatively, we can analyze the clusters against a null hypothesis using
statistical techniques like the Gap Statistic approach.

PAGE 57
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 57 26-Sep-24 7:39:22 PM


DATA MINING - I

Notes
3.8 Determine Optimal Number of Clusters
The number of clusters that are appropriate for our dataset must be
determined for clustering techniques such as K-Means clustering. This
guarantees an accurate and effective division of the data. Maintaining a
suitable balance between the compressibility and accuracy of clusters, as
well as guaranteeing proper granularity, are made easier with an appro-
priate amount of “k,” or the number of clusters.
Let’s look at two scenarios:
Case 1: Handle every dataset as a single cluster.
Case 2: Consider every data point to be a cluster.
Because there is no gap between the data point and the cluster center,
this will result in the highest accurate clustering. However, this won’t be
useful for forecasting fresh inputs. It won’t allow for any type of data
summarization.
Thus, figuring out the “right” number of clusters for each given dataset
is crucial. Although this is a difficult undertaking, it is quite manageable
if we rely on the distribution’s form and scale.

Direct Method
Elbow Curve: The number of clusters and the within-cluster variance
determine the Elbow curve. The Within-cluster Sum-of-Squared Distance
(WSSD) between each data point and its cluster center is represented by
inertia, which determines the within-cluster variance. Denser clusters are
those with lower inertia. Refer Figure 3.17.
As the number of clusters increases, the inertia usually decreases. None-
theless, the initial slope of the decline tends to be steeper and becomes
less steep when we surpass the “optimal” number of clusters. Therefore,
we use the location of the bend in the high and low slope to determine
the “optimal” number of clusters.

58 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 58 26-Sep-24 7:39:22 PM


The Art of Grouping: Exploring Cluster Analysis

Notes

Figure 3.17: Elbow Curve


Source: Internet

The Elbow Curve is a visual method; thus it can be unclear at times.


Since the bend is frequently difficult to locate, selecting the “optimal”
number of clusters is frequently arbitrary.

Statistical Methods
Gap Statistic Method: The within-cluster variation is used by the Gap
Statistic approach to gauge how well the clustering is done. To do this,
the entire within-cluster variance is compared to how a random data set
would cluster, or, in other words, to what would be predicted from a null
reference distribution.
For a given number of clusters (k), we build a certain number B of null
reference (uniform) distributions to derive the Gap Statistic G. We calcu-
late the inertia W for each sample and average the results. Next, we can
determine the Gap Statistic using the cluster’s inertia of our real data set:

Finding the least number of clusters (k) that fulfils the following criteria
yields the “optimal” number of clusters:

PAGE 59
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 59 26-Sep-24 7:39:23 PM


DATA MINING - I

Notes Where s is ascertained using the cluster inertia standard deviation:

3.9 Summary
In this lesson, you have learned that clustering is the unsupervised learn-
ing approach par excellence to uncover hidden structures and patterns
of data. Basically, clustering implies putting data into meaningful groups
so that maximization of the similarity in a cluster, called intra-cluster
similarity, is achieved while minimization of inter-cluster similarity takes
place. More precisely, at the very core of this concept lie measures of
similarity quantifying closeness or proximity between points in data.
Depending on the type of data, these metrics would differ. They could
be some form of distance metric for numerical data, such as Euclidean
distance, and some form of similarity metric in the case of text or cate-
gorical data, such as cosine similarity. These actual clusters themselves
can be disjoint, overlapping, hierarchical, fuzzy, or any kind that best
suits the properties of the data at hand and the goals of the analysis.
Clustering can be done in several ways, one of which is the quite popular
algorithm, that is, K-means.
K-means accomplishes some optimization of cluster centroids to minimize
the within-cluster sum of squares by iteratively assigning data points to
clusters based on closeness to a centroid. The major phases of every
clustering analysis are the quality assessment of the clusters and determi-
nation of the best number of clusters. Examples of such cluster validation
measures include the silhouette coefficient and the elbow method, applied
to determine the ideal number of clusters for a dataset and evaluate var-
ious clustering techniques. These basic ideas make clustering a powerful
methodology in the discovery of hidden structure within data that fosters
insight and knowledgeable decision-making in many disciplines.

60 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 60 26-Sep-24 7:39:23 PM


The Art of Grouping: Exploring Cluster Analysis

Notes
3.10 Answers to In-Text Questions
1. (d) Grouping similar data points together
2. (b) Retail
3. (c) Grouping similar products for recommendation
4. (b) Minimize intra-cluster variance
5. (a) Randomly
6. (b) By computing the mean of data points in each cluster

3.11 Self-Assessment Questions


1. What is the primary objective of clustering in machine learning?
2. Explain the difference between similarity measures and distance
metrics in clustering.
3. Describe the concept of centroids in the context of clustering
algorithms.
4. How does the K-means algorithm partition data into clusters?
5. Provide an example of a situation where hierarchical clustering might
be preferable over K-means clustering.
6. How does the elbow method help determine the optimal number of
clusters in K-means clustering?
7. What is the main difference between partitioning clustering and
density-based clustering algorithms?
8. Discuss the concept of overlap in clustering and provide an example.
9. Why is cluster validation important, and what are some common
techniques used for cluster validation?

3.12 References
‹ ‹Han, J., Kamber, M., & Jian, P. (2011). Data Mining: Concepts and
Techniques. 3rd edition. Morgan Kaufmann.
‹ ‹Tan,P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data
Mining. 1st Edition. Pearson Education.
PAGE 61
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 61 26-Sep-24 7:39:23 PM


DATA MINING - I

Notes
3.13 Suggested Readings
‹ ‹Gupta, G. K. (2006). Introduction to Data Mining with Case Studies.
Prentice-Hall of India.
‹ ‹Hand, D., & Mannila, H. & Smyth, P. (2006). Principles of Data
Mining. Prentice-Hall of India.
‹ ‹Pujari, A. (2008). Data Mining Techniques. 2nd edition. Universities
Press.

62 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 62 26-Sep-24 7:39:23 PM


L E S S O N

4
Data Connections: The
Essentials of Association
Rule Mining
Dr. Charu Gupta
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]

STRUCTURE
4.1 Learning Objectives
4.2 Introduction: Association Rule Mining
4.3 Transaction Data Set and Frequent Itemset, Support Measure
4.4 Rule Generation
4.5 Confidence of Association Rule
4.6 Apriori Principle
4.7 Apriori Algorithm
4.8 Summary
4.9 Answers to In-Text Questions
4.10 Self-Assessment Questions
4.11 References
4.12 Suggested Readings

4.1 Learning Objectives


‹ ‹To understand the basics of Association Rule Learning.
‹ ‹To develop the rules generated by association rule mining.
‹ ‹To apply association rule mining to real-world data sets.
‹ ‹To analyse the rules generated by association rule mining.
PAGE 63
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 63 26-Sep-24 7:39:23 PM


DATA MINING - I

Notes
4.2 Introduction: Association Rule Mining
Consider a Software Analyst who wishes to analyse the download history
of various programming languages, their related libraries and plugins.
Using association rule mining, the data analyst applies an algorithm to
the transaction dataset to identify common patterns among programming
languages. One of the resulting rules might be {python} → {numpy},
indicating that programmers who download the python compiler also
download its numpy library as well. Suppose the dataset consists of
10,000 transactions, and the rule {python} → {numpy} has a support
of 40% (indicating that python and numpy appear together in 40% of
all transactions) and a confidence of 90% (meaning that in 90% of the
transactions where python and numpy are downloaded together). This
information can be used by the software repository to place both python
and numpy together on the web page to create combined promotions,
ultimately enhancing and improving user experience.
Association rule mining is a data mining technique used to identify in-
teresting relationships, patterns, and associations within large datasets. It
is particularly useful in the context of market basket analysis, where the
goal is to identify sets of products that frequently co-occur in transactions.
Consider an example of a grocery retail store. Association rule mining
shows that the customers who buy bread will have high probability of
buying butter and milk. This relation between products: bread, butter,
milk, is represented through association rule such as {bread} -> {butter,
milk}, where in a transaction instance, i.e. in a row of dataset if bread
is present , then it implies that butter and milk may also be present with
high probability.
Association rule mining is a data mining technique that aims at discov-
ering hidden relationships and patterns among different items in a large
dataset. In other words, we can say that association rule mining is an
unsupervised learning algorithm that consists of two parts: antecedent
(left hand side) and consequent (right hand side). This means that in
a transaction data set, if item on left hand side (antecedent) is present,
then item(s) on right hand side (consequent) will also be present with a
certain percentage of probability. Basically, association rules bring out
how the presence of one itemset can influence the occurrence of another

64 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 64 26-Sep-24 7:39:23 PM


Data Connections: The Essentials of Association Rule Mining

itemset in transactional data. It is a rule-based algorithm used in finding Notes


relationships between database variables. We can represent association
rules in the form {X}→{Y} where X (antecedent) and Y (consequent)
are itemsets.
For example if a student learning python programming, downloads python
from github repository, then he/she is likely to download libraries like
numpy, matplotlib, and pandas with high probability. Different association
rule algorithms that are popularly known are the Apriori association rule,
Filtered growth, and relational association rule mining.
The rules generated through association rule mining technique are eval-
uated using metrics like support, confidence, and lift.
‹ ‹Support: This metric measures the frequency of the itemset in the
dataset.
‹ ‹Confidence: This metric quantifies the likelihood or the probability
of the occurrence of the consequent given the antecedent has taken
place.
‹ ‹Lift:The lift metric assesses the rule’s effectiveness when compared
to a random chance.
Data Analysts use association rule mining techniques such as Apriori
algorithm or FP-Growth, to analyse customer purchasing behaviour, op-
timise inventory management, and develop targeted marketing strategies.
In this lesson, we will learn the core principles and methodologies of
association rule mining, that will focus mainly on the Apriori algorithm.
Before that, let us first read the benefits of association rule mining in
real life applications.

Benefits of Association Rule Mining


Association rule mining is an essential tool in data mining and business
intelligence that provide the benefits as given below:
1. Identification of Hidden Patterns: Association Rule Mining helps
in finding those hidden patterns and relationships in large data
sets that would have not been detected at first glance. With the
finding of frequent itemsets, we can create much better insights
into customers’ behaviours and preferences with strong association
rules.

PAGE 65
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 65 26-Sep-24 7:39:23 PM


DATA MINING - I

Notes 2. Improved Decision Making: The rules generated and provided by the
association rule mining facilitate us to make decisions based on data
collected. For example, retailers can optimize product placements
or create an effective cross-selling strategy or market their products
based on co-occurring of products in customer transactions.
3. Better Customer Experience: As a business person, the association
rule mining technique will help in knowing the products which
are usually, being sold together. In this way, we can enhance the
customer experience and suggest related products, group items and
create far more personalized recommendations.
4. Optimize Inventory Management: With the knowledge of itemsets
that are frequently bought together, we can manage its inventory
more effectively. It helps to maintain adequate stock and, at the
same time, avoids the chances of stock-out and overstock situations
for frequently associated products.
5. Increased Sales and Revenue: One can achieve higher sales by
offering promotions and discounts with better use of association
rules. For example, consider that a rule indicates that customers
who purchase coffee will also purchase sugar. Therefore, one could
offer discounts on sugar for customers who buy coffee in order to
increase the sales of both the items.
6. Detection of Fraud and Risk Management: In financial sectors and
insurance, association rule mining is able to detect deviating and
unusual patterns suggesting fraud cases. Analysis of transaction data
may also disclose suspicious behaviour, thus helping a company to
take preventive measures to mitigate risks.
7. Market Basket Analysis: Market Basket Analysis is one of the
most popular example and application for studying association rule
mining. In the market basket analysis, retailer use association rule
mining to detect patterns among the products being purchased by
the customers. This analysis helps the retail stores to organise the
products together or nearest to each other, because customers are
more likely to purchase them together. The resultant analysis has
helped retailers to optimize the layout of the store, offer promotions,
improves sales, reducing costs and enhancing profits.

66 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 66 26-Sep-24 7:39:23 PM


Data Connections: The Essentials of Association Rule Mining

8. Identify Medical Conditions that occur together: Medical researchers Notes


and analysts perform association rule mining in dataset of medical
and health records. The analysis is used to find diseases and
symptoms that occur together in datasets of patients. This in turn
helps medical practitioners to create treatment plans and improve
the health care services.

4.3 Transaction Data Set and Frequent Itemset, Support


Measure
Let us learn basic concepts of Transaction Data Set, Frequent Itemset
and Support measure.

4.3.1 Transaction Data Set


A transaction data set is a set of records, where each record is represents
a transaction containing a set of items. A transaction dataset is a collection
of data records called transactions, usually, in a tabular form where every
transaction is itself a set of items or events occurring together. These
transactions are maintained normally in tabular format, where each row
shows a different transaction and each column shows an item. The dataset
is used in association rule mining and market basket analysis, showing
the relationships and patterns among items or events.
As an example, consider an online software repository scenario where
each transaction represents a library downloaded.

Transaction ID Python Numpy Matplotlib Pandas Java


1 1 1 0 0 0
2 0 0 0 0 1
3 1 1 1 1 0
4 0 0 0 1 0
5 1 0 1 0 0

The transaction ID 1 shows that a person has downloaded python and


numpy together. The presence of 1 means the software or library has
been downloaded from a software repository. The presence of number

PAGE 67
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 67 26-Sep-24 7:39:23 PM


DATA MINING - I

Notes zero shows that the software or the library has not been downloaded
from a software repository.
The transaction Data sets are used in many scenarios and real life appli-
cations. Few of them are given below:

Applications of Transaction Data Sets


1. Market Basket Analysis: Transaction data helps retailers to understand
which products are frequently bought together. The retailers then
place such frequently bought products together in the store or
categorise them together in an online store. The retailers also create
promotions, and improve cross-selling strategies thereby maximising
profits.
2. Recommendation Systems: E-commerce platforms use transaction
data to recommend products to customers based on their purchase
history and the buying patterns of other customers. For example,
brooms, mops and other home cleaning solutions that customers
have purchased together, are also recommended to other naïve
customers.
3. Fraud Detection: Financial institutions analyse transaction data to
detect anomalous and unusual patterns that may indicate fraudulent
activities.
4. Inventory Management: By understanding which items are frequently
purchased together, we can better manage the inventory levels and
ensure that complementary products are stocked adequately.
Let us learn, how we can implement the association rule mining in py-
thon using mlxtend library. Here, we have imported pandas and mlxtend
libraries. Then a sample transaction data set is created in the variable
‘transactions’. The dataset is then encoded, transformed and loaded in
DataFrame. The method association_rules() is called and evaluated using
the metric confidence with minimum threshold parameter set to 0.6.
Example Code for implementing Transaction Data set in python using
‘mlxtend’ library:

68 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 68 26-Sep-24 7:39:23 PM


Data Connections: The Essentials of Association Rule Mining

Notes

In this section, we learnt that transaction data sets are foundation for
various data mining tasks, especially association rule mining and market
basket analysis. These data sets are analysed to identify and detect valuable
insights, hidden patterns, anomalous behaviour in customer purchasing
patterns, optimise product placement, improve inventory management,
detect fraudulent transactions, minimize risk, take preventive measures
and enhance overall operational efficiency.

4.3.2 Frequent Itemset and Support Measure


A frequent itemset is a set of items that appears together in transactions
a number of times. The fundamental step in association rule mining is
to identify the frequent itemset, as these itemsets provide the basis for
generating association rules. For example, {python, numpy} appear to-

PAGE 69
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 69 26-Sep-24 7:39:23 PM


DATA MINING - I

Notes gether in download history of a software repository. Similarly, {milk,


bread} appear together in the carts of customers who go to shop during
morning hours. A few more examples of frequent itemset are given below:
‹ ‹{trousers, shirts}
‹ ‹{coffee, sugar}
‹ ‹{coffee, milk}
‹ ‹{bread, butter}
‹ ‹{french fries, coke}
‹ ‹{burger, cold drink}
‹ ‹{tea, samosa}
‹ ‹{tea, biscuit}
‹ ‹{shoes, socks}
‹ ‹{keyboard, mouse}
In other words, a frequent data itemset is a set of items that appear to-
gether in a transactional dataset with a frequency that meets or exceeds
a specified minimum support threshold. These itemsets are fundamental
to association rule mining, as they represent combinations of items that
occur commonly enough to be considered significant.
A lattice structure as shown in Figure 4.1 can also be used to enumerate
the list of all possible itemsets. In general, a data set that contains k
items can generate up to 2k -1 frequent itemsets, excluding the null set.
Few examples of frequent item sets are shown in Table 4.1

Table 4.1
S. List of Item Number of List of Frequent Number of Frequent
No Sets Items (k) Item Sets Item Sets 2k – 1
1. {Milk, Bread, 03 1. {Milk}, = 23-1
Butter} 2. {Bread}, = 8-1
3. {Butter}, = 7
4. {Milk, Bread} {null set is not counted,
5. {Milk, Butter} therefore, one has been
subtracted}
6. {Bread, Butter}
7. {Milk, Bread, Butter}

70 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 70 26-Sep-24 7:39:23 PM


Data Connections: The Essentials of Association Rule Mining

S. List of Item Number of List of Frequent Number of Frequent Notes


No Sets Items (k) Item Sets Item Sets 2k – 1
2. {python, 02 1. {python}, =22-1
numpy} 2. {numpy}, =4-1
3. {python, numpy} =3
{null set is not counted,
therefore, one has been
subtracted}
3. {python, 04 1. {python}, =24-1
numpy, 2. {numpy}, =16-1
pandas,
3. {pandas}, =15
matplotlib}
4. {matplotlib}, {null set is not counted,
5. {python, numpy}, therefore, one has been
subtracted}
6. {python, pandas},
7. {python, matplotlib},
8. {numpy, pandas},
9. {numpy, matplotlib},
10. {pandas, matplotlib},
11. { python, numpy,
pandas},
12. { python, numpy,
matplotlib},
13. { python, pandas,
matplotlib},
14. { numpy, pandas,
matplotlib}
15. { python, numpy,
pandas, matplotlib}

PAGE 71
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 71 26-Sep-24 7:39:23 PM


DATA MINING - I

Notes

Figure 4.1: Frequent Item Set [Ref 1]

Key Concepts
Itemset: An itemset is a collection of one or more items. For example,
in an online software repository scenario, an itemset might be X{python,
numpy, pandas}.
Support: Support is a metric that measures how frequently an itemset
appears in the dataset. It is defined as the proportion of transactions in
which the itemset occurs.
Mathematically, support for an itemset X.


For example, if {python, numpy} appears in 50 out of 200 transactions,
the support is 0.25 or 25%.
Minimum Support Threshold: It is a user-defined value that specifies
the minimum frequency for an itemset to be considered frequent. Itemsets
with support above this threshold are termed frequent itemsets.

72 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 72 26-Sep-24 7:39:24 PM


Data Connections: The Essentials of Association Rule Mining

Transaction width is the number of items present in a transaction. Notes


Now, we are familiar with the basic key terms of Frequent itemsets. Let
us understand the importance of frequent itemsets in association rule min-
ing. Frequent itemsets are important in association rule mining because
they are the basis for generating strong and meaningful association rules.
Frequent itemsets are used to generate association rules. For instance,
from the frequent itemset {python, numpy}, we can derive rules like:
‹ ‹{python}→{numpy}

‹ ‹{numpy}→{python}

These rules are evaluated using metrics like confidence and lift to deter-
mine their effectiveness and usefulness.
Methods to find Frequent Itemsets: The basic approach to find fre-
quent itemsets is to find the value of support metric for every itemset
in the lattice structure. For this, we will compare each candidate item
set against every transaction dataset. If the candidate item is present in
transaction data item, the value of support metric is increased by 1. How-
ever, since the number of items are large, the comparisons are increased
exponentially. The number of comparisons are O(NMw) where, N is the
number of transactions, M is the number of candidate itemsets and w
is the maximum transaction width. There are three main approaches to
reduce the computational complexity of frequent itemset generation as
detailed below:
1. Reduce the Number of Candidate Itemsets (M): The Apriori
principle is used to eliminate some of M = 2k − 1 the candidate
itemsets without counting their support values.
2. Reduce the Number of Comparisons (w): Instead of matching
each candidate itemset against every transaction, we can reduce the
number of comparisons by using more advanced data structures,
either to store the candidate itemsets or to compress the data set.
3. Reduce the Number of Transactions (N): As the size of candidate
itemsets increases, fewer transactions will be supported by the
itemsets.
In this section we learnt that frequent data itemsets are important for as-
sociation rule mining. The frequent item datasets provide the foundation
for identifying and detecting hidden patterns and relationships within large

PAGE 73
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 73 26-Sep-24 7:39:24 PM


DATA MINING - I

Notes data sets. Data analysts focus on itemsets that meet a minimum support
threshold, to ensure that the rules generated are significant and make
reliable patterns. These rules and detected patterns lead to actionable
insights in various applications such as market basket analysis, recom-
mendation systems, fraud detection, anomaly detection and inventory
management to name a few.

4.4 Rule Generation


The whole process of generating rules in association rule mining involves
the extraction of meaningful and useful patterns and rules from large data-
sets. Such rules help in finding different kinds of relationships between
items in transactional data, including sets of items customers commonly
buy together at a retail store, software downloaded from a software re-
pository, movies watched on an online movie platform. Therefore, we can
say that the rule generation process has been divided into steps namely
(a) determining frequent itemsets, (b) generating possible rules, and then
(c) evaluating their strength based on various metrics.

Steps in Rule Generation


(a) Identify Frequent Itemsets: The rule generation begins by collecting
all itemsets that have least minimum support metric. It can be done
through Apriori or FP-Growth algorithms. Support metric is the
frequency of occurrence of an itemset in a dataset. An itemset is
said to be frequent if it has its support metric greater than a user-
specified threshold.
(b) Generate Possible Rules: Once frequent itemsets are identified, the
next step is to generate all possible rules from these itemsets. A rule
is of the form {X→Y}, where X (antecedent) and Y (consequent)
are disjoint (mutually exclusive; no item common) subsets of an
itemset. For example, if {numpy, python, matplolib} is a frequent
itemset, possible rules include:
(i) {python, matplotlib}→{numpy}
(ii) {python, numpy}→{pandas}
(iii) {pandas, numpy}→{python}

74 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 74 26-Sep-24 7:39:24 PM


Data Connections: The Essentials of Association Rule Mining

(c) Evaluate Rule Strength: In this step, the strength and accuracy of Notes
rule generated is evaluated using the confidence, lift, and conviction
metrics. These metrics are defined below alongwith their respective
formulae:
‹ ‹Confidence: Confidence is the measure of the likelihood that the
consequent Y is present in transactions that contain the antecedent
X. It is calculated as:

‹ ‹Lift:Lift compares the observed support of the rule to the expected


support if X and Y were independent. A lift value greater than 1
indicates a positive correlation between X and Y. It is defined as:

‹ ‹Conviction: Conviction is the measure of the strength of the


implication of the rule. The formula of the conviction is:

In this section we learnt that the process of rule generation in association


rule mining involves several steps namely, (a) the identification of fre-
quent itemsets, (b) generating all possible rules from those itemsets, and
further (c) evaluating the rules with metrics such as support, confidence,
and lift. Then, it helps in pattern detection and relationship analysis in
large datasets. Such detection and identification of patterns enable various
businesses and academia to make data-driven decisions, optimize product
placements, detect anomalies, identify frauds, minimize risk, marketing
strategies, and enhance overall efficiency.

4.5 Confidence of Association Rule


Confidence is the most important metric in association rule mining, as it
measures how reliable or strong an association rule is. In other words,
it defines the number of occurrences of the consequent-also known as
the right-hand side of the rule in the transactions for each of the an-

PAGE 75
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 75 26-Sep-24 7:39:24 PM


DATA MINING - I

Notes tecedent, or the left-hand side. Confidence measures the probability of a


transaction containing the consequent given the antecedent. The formula
of Confidence is given by

Importance of Confidence
‹ ‹Confidence Rule Evaluation: It helps to filter out the strong rules
which are more likely to be useful for prediction or recommendation.
‹ ‹Business Insights: High Confidence Rules provides decision making
insights about cross-selling, product placement, and inventory
management in retail store management.
‹ ‹Filtering Rules: Association rule mining can discard any rule during
the building process that is below a certain confidence threshold,
which means rules that focus on the most reliable associations are
kept while the less reliable rules are discarded.
The main metrics in the association rule mining is the confidence metric.
It expresses the conditional probability of having a consequent, provided
there is already an antecedent in a transaction dataset. We can identify
the patterns within the transactional data to make intelligent decisions and
optimize a set of operations by calculating and analyzing the confidence
of different rules.

4.6 Apriori Principle


It is observed that the generation of a frequent itemset is complex and
grows exponentially with an increase in the number of candidate itemsets.
The support measure reduces the number of candidate itemsets by using
the following principle.
“If an itemset is frequent, then all of its subsets must also be frequent.”
For example, if {c, d, e} is a frequent itemset. All transaction that con-
tains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c},
{d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c,
d, e} are also frequent.

76 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 76 26-Sep-24 7:39:24 PM


Data Connections: The Essentials of Association Rule Mining

Notes
4.7 Apriori Algorithm
In 1994, computer scientists Rakesh Agrawal and Ramakrishnan Srikant
developed the Apriori algorithm. The Apriori algorithm is a very import-
ant method for association rule mining that provides a pruning technique
based on support measures in order to control the exponential growth
of candidate itemsets. The Apriori algorithm presents an algorithm for
frequent itemset and then derivation of Association Rules from these
itemsets. Apriori relies on the basic property that any non-empty subset
of a frequent itemset needs to be frequent itself. The pre-processing is
much easier because it will prune lots of candidate itemsets during early
phases that fail to meet the minimum support threshold. Therefore, for an
Apriori algorithm, one can have a better complex frequent itemset with
lesser complexity. The algorithm, as given in Ref [1] for the Apriori, is
given below:

Figure 4.2: Apriori Algorithm


Source: Ref [1]

Consider generating frequent itemset using the Apriori algorithm using


a small example with transactions involving the items: Python, Numpy,
Matplotlib, and Pandas. Assuming a minimum support threshold of 50%
(0.5) for simplicity.

Transactions Dataset
Consider the following transactions:
1. {Python, Numpy, Pandas}
2. {Python, Numpy}
3. {Numpy, Matplotlib}

PAGE 77
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 77 26-Sep-24 7:39:24 PM


DATA MINING - I

Notes 4. {Python, Matplotlib}


5. {Python, Numpy, Matplotlib, Pandas}
6. {Python, Matplotlib, Pandas}

Step 1: Initialize and Calculate Support for Individual Items


‹ ‹Support for Python: 5/6 ≈ 0.83
‹ ‹Support for Numpy: 4/6 ≈ 0.67
‹ ‹Support for Matplotlib: 4/6 ≈ 0.67
‹ ‹Support for Pandas: 3/6 = 0.5
All items have support above the minimum threshold (0.5), so all are
frequent.

Step 2: Generate Candidate Itemsets of Size 2 and calculate their


support.
‹ ‹{Python, Numpy}: Appears in transactions 1, 2, and 5 → Support:
3/6 = 0.5
‹ ‹{Python, Matplotlib}: Appears in transactions 4, 5, and 6 → Support:
3/6 = 0.5
‹ ‹{Python, Pandas}: Appears in transactions 1, 5, and 6 → Support:
3/6 = 0.5
‹ ‹{Numpy, Matplotlib}: Appears in transactions 3 and 5 → Support:
2/6 ≈ 0.33
‹ ‹{Numpy, Pandas}: Appears in transactions 1 and 5 → Support:
2/6 ≈ 0.33
‹ ‹{Matplotlib, Pandas}: Appears in transactions 5 and 6 → Support:
2/6 ≈ 0.33
Only {Python, Numpy}, {Python, Matplotlib}, and {Python, Pandas}
have support ≥ 0.5.

Step 3: Generate Candidate Itemsets of Size 3


‹ ‹{Python, Numpy, Matplotlib}: Appears in transaction 5 → Support:
1/6 ≈ 0.17
‹ ‹{Python, Numpy, Pandas}: Appears in transaction 5 → Support:
1/6 ≈ 0.17

78 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 78 26-Sep-24 7:39:24 PM


Data Connections: The Essentials of Association Rule Mining

‹ ‹{Python,Matplotlib, Pandas}: Appears in transactions 5 and 6 → Notes


Support: 2/6 ≈ 0.33
None of these itemsets have support ≥ 0.5, so there are no frequent
itemsets of size 3.

Step 4: Generate Association Rules


From the frequent itemsets found (of sizes 1 and 2), generate association
rules and calculate their confidence.

Rules from Itemsets of Size 2


‹ ‹From {Python, Numpy}:
‹ ‹Python → Numpy: Support ({Python, Numpy})/Support(Python)
= 3/5 = 0.6
‹ ‹Numpy → Python: Support ({Python, Numpy})/Support(Numpy)
= 3/4 = 0.75
‹ ‹From {Python, Matplotlib}:
‹ ‹Python → Matplotlib: Support ({Python, Matplotlib})/ Support(Python)
= 3/5 = 0.6
‹ ‹Matplotlib → Python: Support ({Python, Matplotlib})/Support(Matplotlib)
= 3/4 = 0.75
‹ ‹From {Python, Pandas}:
‹ ‹Python → Pandas: Support ({Python, Pandas})/Support(Python) =
3/5 = 0.6
‹ ‹Pandas → Python: Support ({Python, Pandas})/Support(Pandas) =
3/3 = 1.0

Step 5: Select High Confidence Rules with confidence above the min-
imum threshold (assume 0.6).
‹ ‹Python → Numpy (0.6)
‹ ‹Numpy → Python (0.75)
‹ ‹Python → Matplotlib (0.6)
‹ ‹Matplotlib → Python (0.75)
‹ ‹Python → Pandas (0.6)
‹ ‹Pandas → Python (1.0)

PAGE 79
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 79 26-Sep-24 7:39:24 PM


DATA MINING - I

Notes In summary, the Apriori algorithm has identified several frequent item-
sets and high-confidence association rules from the given transactions,
following these steps:
1. Calculate the support of individual items and identify frequent items.
2. Generate and prune candidate itemsets of size 2.
3. Attempt to generate candidate itemsets of size three but found none
frequent.
4. Generate association rules from frequent itemsets and calculate their
confidence.
5. Select rules with confidence above the specified threshold.
We have seen that Apriori algorithm works in following four steps:
(i) Generate candidate itemsets of length k. This step is called candidate
generation.
(ii) Remove the candidate itemsets with support smaller than the threshold.
(iii) Count support metric in the dataset by scanning it.
(iv) Retain only the frequent ones.
This process is iterative and, in each step, the length of an itemset is
added until no more frequent itemsets are obtained. The Apriori algorithm
is to known for its scalability and efficiency and is easy to implement.
However, with very large datasets and minimum support thresholds set
too low, it becomes computationally expensive. Therefore, sometimes the
hashing technique is used to prune candidate itemsets and partitioning
methods so as to reduce the number of dataset scans.
There are many areas where association rule mining and the Apriori
algorithm can be applied. Market basket analysis in retail is one area
in which association rule mining and Apriori algorithm is used to iden-
tify product associations with a view to optimally devise store layouts,
improve cross-selling strategies, and design promotions. In web usage
mining, these rules provide understanding of user behaviour patterns
that improve website navigation and personalization. The algorithms of
mining association rules can also be applied to the healthcare sector in
an effort to identify correlations between symptoms and diseases, aiding
in diagnostics and personalised treatment plans.

80 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 80 26-Sep-24 7:39:24 PM


Data Connections: The Essentials of Association Rule Mining

Notes
4.8 Summary
This lesson comprehensively covers association rule mining and the Apriori
algorithm, to enable students and practitioners in applying these techniques
to datasets. Mastery of these concepts will empower learners towards the
deriving useful insights from big data, driving effective decision-making
and strategic planning in all fields of academia, research and industry.
Emphasis will be placed on the metrics of support, confidence, lift, it-
erative algorithms, and pruning algorithms. The lesson also covered the
practical implementation aspects, such as data preparation, the application
of the Apriori algorithm using popular libraries like mlxtend in python,
and how to interpret the results. This provides hands-on experience in
mining association rules from transaction data.
In short, this lesson offers the learner some basic knowledge and abilities
of association rule mining using the Apriori algorithm. These methods will
be useful to analysts and data scientists in finding valuable patterns from
a large dataset, which then improves their decision-making to achieve
better business results.
IN-TEXT QUESTIONS
1. A collection of records where each record represents a transaction
containing a set of items is called:
(a) Frequent itemset
(b) Transaction item set
(c) Association item set
(d) All of the above
2. What is association rule mining?
(a) Same as frequent itemset mining
(b) Finding of strong association rules using frequent itemsets
(c) Using association to analyse correlation rules
(d) Finding Itemsets for future trends
3. A collection of one or more items is called as __________.
(a) Itemset (b) Support
(c) Confidence (d) Support Count

PAGE 81
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 81 26-Sep-24 7:39:24 PM


DATA MINING - I

Notes 4. The frequency of occurrence of an itemset is called __________.


(a) Support
(b) Confidence
(c) Support Count
(d) Rules

4.9 Answers to In-Text Questions


1. (b) Transaction item set
2. (b) Finding of strong association rules using frequent itemsets
3. (a) Itemset
4. (c) Support count

4.10 Self-Assessment Questions


1. Explain the difference between support and confidence in the context
of association rule mining. Why are both metrics important when
generating association rules?
2. Describe the “Apriori principle” and explain how it helps in reducing
the computational complexity of the Apriori algorithm.
3. Using the Apriori algorithm with a minimum support threshold of
60%, identify the frequent itemsets of size 1 and 2. Consider the
following transactions:
(i) T1: {Milk, Bread, Butter}
(ii) T2: {Bread, Butter}
(iii) T3: {Milk, Bread}
(iv) T4: {Milk, Butter}
(v) T5: {Bread, Butter}
4. Discuss some real-world applications of association rule mining. How
can businesses leverage the insights obtained from association rule
mining to improve their operations and decision-making processes?

82 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 82 26-Sep-24 7:39:24 PM


Data Connections: The Essentials of Association Rule Mining

Notes
4.11 References
‹ ‹TanP. N., Steinbach M, Karpatne A. and Kumar V. Introduction to
Data Mining, 2nd edition, Pearson, 2021.
‹ ‹HanJ., Kamber M. and Pei J. Data Mining: Concepts and Techniques,
3rd edition, 2011, Morgan Kaufmann Publishers.
‹ ‹ZakiM. J. and Meira J. Jr. Data Mining and Machine Learning:
Fundamental Concepts and Algorithms, 2nd edition, Cambridge
University Press, 2020.

4.12 Suggested Readings


‹ ‹Mudumba, B., & Kabir, M. F. (2024). Mine-first association rule
mining: An integration of independent frequent patterns in distributed
environments. Decision Analytics Journal, 100434.
‹ ‹Dol, S. M., & Jawandhiya, P. M. (2023). Classification technique
and its combination with clustering and association rule mining in
educational data mining—A survey. Engineering Applications of
Artificial Intelligence, 122, 106071.
‹ ‹Kumbhare, T. A., & Chobe, S. V. (2014). An overview of association
rule mining algorithms. International Journal of Computer Science
and Information Technologies, 5(1), 927-930.
‹ ‹Zhang, C., & Zhang, S. (Eds.). (2002). Association rule mining:
models and algorithms. Berlin, Heidelberg: Springer Berlin Heidelberg.

PAGE 83
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 83 26-Sep-24 7:39:24 PM


L E S S O N

5
Building Blocks of
Classification Systems
Dr. Charu Gupta
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]

STRUCTURE
5.1 Learning Objectives
5.2 Introduction: About Classification
5.3 Naive Bayes Classifier
5.4 Nearest Neighbour Classifier
5.5 Decision Tree
5.6 Overfitting
5.7 Confusion Matrix, Evaluation Metrics and Model Evaluation
5.8 Summary
5.9 Answers to In-Text Questions
5.10 Self-Assessment Questions
5.11 References
5.12 Suggested Readings

5.1 Learning Objectives


‹ ‹To understand the classification techniques for datasets.
‹ ‹To understand the working principles of various classification algorithms.
‹ ‹To apply techniques of classification on real-world data sets.
‹ ‹Toevaluate the different classification techniques to make informed decisions about
their use in real-world applications.

84 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 84 26-Sep-24 7:39:24 PM


Building Blocks of Classification Systems

Notes
5.2 Introduction: About Classification
Consider opening an email box. Various emails have been categorised
as inbox, sent, spam, and drafts. The inbox emails are further catego-
rised into primary, promotion, social, and updates. The emails can also
be categorised further as starred or important. Categorising the emails
according to the subjects helps identify the required emails without con-
suming much time and effort.

Figure 5.1: Email Classification


Source: Domitilla Brandoni, CINECA created on June 2023,
accessed on 15 Apr 2024

Let us consider another scenario. While going grocery shopping, we


prepare a list of items that can be categorised as snacks, cereals, puls-
es, spices, fruits, vegetables, and dairy products. Sorting the items into
categories helps us in shopping the items from their respective shops or
counters. Similarly, a student in class 10th may create labels or folders to
segregate the emails according to different subjects like English, Hindi,
Maths, Science, Social Science, and Computer Science.
There are many other real-life day-to-day activities in which items are
categorized and segregated according to their features. The process of
classification helps us to organize, sort, and understand a lot of infor-
mation quickly and easily. It also lets us pay attention to what matters,
make good choices, and improve our daily lives. Classification is an

PAGE 85
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 85 26-Sep-24 7:39:25 PM


DATA MINING - I

Notes important job in machine learning, data science and data mining. It is
used in many applications in day-to-day activities and industries. Clas-
sification helps in sorting and grouping large data making it easier to
understand and analyze.

5.2.1 The Need for Classification


Classification is one of the fundamental tasks of human life and machine
learning, in which information should be categorized into well-defined
groups or classes by focusing on its central theme. Therefore, classifi-
cation is a common activity in our day-to-day life, when we sort our
emails into folders, products are organized in categories in a grocery
store, or simple symptoms could identify each medical condition and
were diagnosed accordingly.
In the present scenario, large volumes of data are being collated at a rap-
id speed. The requirement is for organization of data and analyzing and
extracting useful insights out of it. The volume of data cannot make its
manual categorization feasible. Classification algorithms, therefore, make it
automatic so that the analysis may be easy and the computers can predict
the category of new incoming data based on the learned patterns from
the labeled training data. This capability is very important across a wide
array of applications that include spam filtering in emails, recommending
movies on streaming platforms, detecting fraud in financial transactions,
and offering personalized user experiences online. Classification aids in
the organization of data into meaningful groups, thereby enhancing the
understanding of information, which promotes better decision-making and
operational efficiency.
Let us study few reasons why classification is needed:
(a) Decision-making and Automation: Classification models automate
the process of segmentation, segregation, sorting and categorising
such as filtering out spam emails or detecting fraudulent financial
transactions, without human interference.
(b) Predictive Analytics: The classification models facilitate organizations
to make better decisions through identification of various activities
like customer churn prediction, market assessment, sales prediction,

86 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 86 26-Sep-24 7:39:25 PM


Building Blocks of Classification Systems

weather forecast, and calculating credit risk by classifying or labeling Notes


the new data with the help of predictive analytics.
(c) Data Organization, Categorization, and Understanding: Classification
algorithms and methods helps in summarizing large volumes of data
by grouping them together based on their similar characteristics.
Organising data makes it easier to understand and analyse. For example,
it is easy to categorise news article into different topics like sports,
nation, entertainment, environment, advertisement, politics etc.
(d) Pattern Recognition: It also allows one to recognize patterns and
connections from data that may be useful in gaining some insight
and driving strategic decisions.
(e) Enhancing User Experience-Personalization and Content Moderation:
Classification can also be utilized for personalization of user
experiences in recommending relevant content or products, basing
this on preference or behaviour, like in streaming services or online
shopping.
This is helpful in content moderation, such as classifying and
filtering inappropriate or harmful material in social media platforms
and online communities.
(f) Resource Optimisation: Efficiency and Productivity: The Classification
of activities or issues or resources into their types assists an
organization in enhancing its performance. Examples include the
classification of customer tickets according to their severeness with
the view to prioritize them, and classify maintenance issues in order
to target resources there.
(g) Enhancement of Marketing Strategies: Classifications help a
business segment customers and address those in targeted marketing
for more effective and overall efficient marketing campaigns.
(h) Safety and Security: Anomaly detection is very important in
assuring security, considering unusual patterns which may indicate
a cybersecurity threat or fraud, for example.
(i) Medical Diagnosis: In healthcare, the classification models diagnose
diseases by categorizing medical images or patient data into one of
several classes to help in early detection and treatment.

PAGE 87
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 87 26-Sep-24 7:39:25 PM


DATA MINING - I

Notes (j) Scientific and Research Applications: This is a type of classification


where data obtained from research is classified. In the field of
biology, it works in the classification of species. On the other hand,
in astronomy, it classifies types of galaxies, document classification,
and resumé recommendation.
(k) Environmental Monitoring: It also aids in environmental monitoring
where satellite imagery or sensor data classification monitors the
changes in land use, vegetation, and climate pattern.
(l) Simplification of Complex Data: Apparently complex data are
reduced to predefined classes through classificatory analysis, which
makes it simpler to analyse and reach a conclusion.
(m) Interpretable: It offers a more organized method of interpretation
and understanding the data in many areas where interpretability is
so crucial, such as health and finance.
Overall, classification is a powerful tool in Data mining and machine
learning that drives efficiency, enhances decision-making, and enables a
wide range of applications across different domains. It helps transform
raw data into meaningful categories, and classification helps organisations
and individuals make better, more informed decisions.

5.2.2 Examples of Classification


A few examples of classification are:
1. Students’ performances are classified into grades A, B, C, D, or F
based on their scores in Assessment.
2. People classify their clothes based on usability, daily wear, office
wear, party wear, and wedding wear.
3. Classifying vehicles based on a number of wheels: two-wheeler,
three-wheeler, four-wheeler.
4. Classifying vehicles based on combustion engines, petrol vehicles,
diesel vehicles, electric vehicles, and hybrid vehicles.
5. Classifying sports as indoor games, outdoor games.
6. Household waste is typically classified into recyclables (paper, plastics,
glass), organic waste (food scraps, yard trimmings), hazardous waste

88 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 88 26-Sep-24 7:39:25 PM


Building Blocks of Classification Systems

(batteries, chemicals), and electronic waste (broken CDs, pen drives, Notes
malfunctioning mice, keyboards).
7. Classifying movies into genres such as action, comedy, drama,
romance and documentaries.
8. Songs can be classified into duet songs, sad songs, party songs, disco
songs, love songs, and religious songs.
9. A technical support centre may classify incoming queries as software
problems, hardware issues, network problems, or billing queries.
10. Animals are classified based on living habitat: terrestrial, aquatic,
amphibian, arboreal, and aerial.

5.2.3 Classification as a Model


Classification is a fundamental technique in data mining that plays an
inevitable role in organising and analysing large datasets. When we cat-
egorise data into meaningful classes, using classification algorithms, it
enables us to make predictions, detect hidden patterns, and extract valuable
knowledge from complex and large datasets. The principles, techniques
and algorithms of classification facilitate data scientists and analysts to
build accurate and reliable models for a large number of applications,
from customer segmentation and fraud detection to medical diagnosis and
text classification. While algorithms in data mining continues to grow,
evolve, change and become more efficient and accurate, the capabilities
of classification techniques have become significant for decision making
and inference from data across diversified industry, academia, research,
medical, finance, banking, management and various other domains.

Figure 5.2: Classification Model [4]

PAGE 89
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 89 26-Sep-24 7:39:25 PM


DATA MINING - I

Notes “A classification model is an abstract representation of the relationship


between the attribute set and the class label.”
Let us see how the classification model can be visualized. The classifi-
cation model can be visualized or represented as a tree, probability table,
or vector of real-valued parameters. Mathematically speaking, the target
function f is able to take an input set of attributes x and output the class.
An instance R(x,y) can be classified as y in a model and mathematically
represented as f(x) = y.
A classification model consists of the following key components:
(a) Training Set: Data consisting of a set of instances with their
respective values of features and their corresponding correct classes.
For example, the training data in a dataset of emails will contain
emails labelled or classified as either “spam” or “not spam”.
(b) Features: Features or attributes are the identifying values that define
an instance marked for classification. For example, a specific list
or frequency of words in an email that defines it as spam or not.
(c) Labels: Labels or classes are the output classes to which an instance
is classified using the classification algorithm. In supervised
classification, labels are known for the training data and unknown
for the new data or testing data.
(d) Learning Algorithm: The systematic steps for learning a classification
model on a training set is known as a learning algorithm. A few
popular classification algorithms are Naïve Bayes, Decision Tree,
and K-nearest neighbour.
(e) Induction: The process of using a classification algorithm to build
or model a classification model from the training dataset is called
as induction or “learning a model” or “building a model.”
(f) Testing Data Set: The test data which is used to evaluate the
performance and generalisation of the ability of a trained model or
classification algorithm using different evaluation metrics.
(g) Deduction: The process of applying a classification model or an
algorithm on new and unseen instances of test datasets to identify
or predict their respective class or category is known as deduction.

90 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 90 26-Sep-24 7:39:25 PM


Building Blocks of Classification Systems

(h) Evaluation Metrics: The performance of the classification model Notes


provides the ability to evaluate whether the classes identified for
the instances of test dataset is in the correct category. These metrics
provide a quantitative and numeric value to evlaluate the correctness
of classification algorithm. The performance of classification algorithm
is evaluated using various metrics such as accuracy, precision, recall,
F1 score, and ROC-AUC.

5.2.4 Types of Classification


Classification of an instance in a data set can be divided into several types
based on the nature of the target variable and the structure of the data.
(a) Binary Classification: The model predicts one of two possible
classes. Examples include spam detection (spam or not spam) and
disease diagnosis (disease or no disease).
(b) Multi-class Classification: The model predicts one of three or more
possible classes. Examples include classifying types of animals in
images (cat, dog, bird) or sorting student assignments into subject
categories (English, Hindi, Science, Mathematics).

(c) Multi-label Classification: The model can predict multiple classes


for a single instance. For example, a text can be classified into
multiple categories like events and sports.

Figure 5.3: Types of Classification


Source: Brian Mutea, “Logistic regression in Python
with Scikit-learn”, accessed on 15 Apr 2024

PAGE 91
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 91 26-Sep-24 7:39:25 PM


DATA MINING - I

Notes

Figure 5.4: Classification Types


Source: Prof Purshottam Kar, IIT Delhi

The goal of classification algorithms is to generate more certain, precise


and accurate system results. The learning algorithms of most classification
techniques are designed to learn models that attain the highest accuracy
or, equivalently, the lowest error rate.
when applied to the test set. In the forthcoming sections, we will study
basic classification algorithms.

5.3 Naive Bayes Classifier


Naive Bayes is a family of probabilistic algorithms based on Bayes’
Theorem, with the “naive” assumption that features are independent of
each other in a class. Naive Bayes classifiers perform well in many re-
al-world situations, particularly for text classification problems like spam
detection, sentiment analysis, and news classification.
Bayes’ Theorem describes the probability of an event based on prior
knowledge of conditions that might be related to the event. The formula is:

92 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 92 26-Sep-24 7:39:25 PM


Building Blocks of Classification Systems

Where Notes
‹ ‹P(A|B) is the posterior probability of class A given features B.
‹ ‹P(B|A) is the likelihood of features B given class A.
‹ ‹P(A) is the prior probability of class A.
‹ ‹P(B) is the prior probability of class B.

Types of Naive Bayes Classifiers


‹ ‹Gaussian Naive Bayes: It assumes that the features follow a normal
distribution curve.
‹ ‹Multinomial Naive Bayes: It is when the dataset contains discrete
counts (e.g., word counts in text classification).
‹ ‹BernoulliNaive Bayes: It is used when a dataset contains binary/
Boolean features (e.g., word presence/absence in text classification).

Scenario: Text Classification with Multinomial Naive Bayes


Consider a data set of code snippets classified into three programming
languages: Java, python and C++. The following steps are followed to
load, train, and evaluate the dataset using Python programming:
(i) Load the Dataset: Load the data set code_snippets.csv.
(ii) Preprocess the Data: Convert text data to feature vectors.
(iii) Train the Model: Train the data set using the Multinomial Naive
Bayes classifier.
(iv) Evaluate the Model: Evaluate the performance using the splitting
method.
Naive Bayes classifiers, especially Multinomial Naive Bayes, are partic-
ularly effective for text classification tasks due to their simplicity and
efficiency. Despite the strong independence assumption, Naïve Bayes
algorithms give good performance and are a good starting point for many
classification problems.

PAGE 93
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 93 26-Sep-24 7:39:25 PM


DATA MINING - I

Notes

5.4 Nearest Neighbour Classifier


Nearest Neighbour Classification, often referred to as k-nearest Neigh-
bours (k-NN), is a simple, non-parametric, and lazy learning algorithm
used for both classification and regression tasks. The fundamental idea is
to classify a data point based on the majority class among its k nearest
neighbours in the feature space.
Working of K-NN algorithm:
1. Training Phase: k-NN is a lazy learner, which means it does not
explicitly train a model. Instead, it stores the training data and uses
it during the prediction phase.
2. Prediction Phase: Given a new data point, k-NN calculates the
distance between this point and all the points in the training dataset.
The distance can be calculated using various metrics, with Euclidean
distance being the most common. It then selects the k closest data

94 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 94 26-Sep-24 7:39:25 PM


Building Blocks of Classification Systems

points (neighbours) to the new data point. For classification, it Notes


assigns the class that is most common among these k neighbours.

Choosing Distance Metrics in the K-NN algorithm

Algorithm for K-NN Classifier


The algorithm calculates the distance (or similarity) between each instance
of the test data set and all the instances of the training data set to find
a list of its nearest neighbour list, Dz.

Figure 5.5: K-Nearest Neighbour Algorithm [Ref 1]

Consider the Iris dataset from scikit-learn. We will perform Step-by-Step


Implementation using Python.
(i) Load the Dataset: Load the Iris dataset from sklearn.
(ii) Split the Data: Split the dataset into training and testing sets.
(iii) Standardise the Data: Feature scaling to improve distance calculation.
(iv) Train the Model: Use the k-NN algorithm from sklearn.
(v) Evaluate the Model: Measure the accuracy and visualise the results.

Advantages and Disadvantages of K-NN


Advantages
‹ ‹Simple and Intuitive: Easy to understand and implement.
‹ ‹No Training Phase: Makes it quick to set up and start using.
‹ ‹Versatile: Can be used for both classification and regression tasks.

PAGE 95
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 95 26-Sep-24 7:39:25 PM


DATA MINING - I

Notes Disadvantages
‹ ‹Computationally Intensive: Requires calculating distances to all
points in the training set for each prediction, which can be slow
for large datasets.
‹ ‹Sensitive to Irrelevant Features: All features are treated equally,
so irrelevant features can negatively impact performance.
‹ ‹Storage Requirements: Needs to store all training data, which can
be impractical for very large datasets.
In this section, we studied the k-nearest Neighbors classification algorithm.
This algorithm is a powerful and versatile algorithm for classification
and regression, particularly useful for small to medium-sized datasets and
problems where interpretability and simplicity are important.

96 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 96 26-Sep-24 7:39:25 PM


Building Blocks of Classification Systems

Notes
5.5 Decision Tree
The Decision Tree is one of the most popular and powerful supervised
learning algorithms that considers classification and regression problems.
It breaks down a dataset into smaller and smaller subsets while an asso-
ciated decision tree is incrementally developed. More precisely, it divides
the data into subsets with respect to input features value and then forms
decisions based on a tree-like model. The tree thus formed is a set of
decision nodes and leaf nodes. Leaves, or the terminal nodes, are the
classes that is the output of the decision tree classification algorithm.

Working of Decision Tree Classification


A decision tree is built using the earliest method known as ‘Hunt’s’ Al-
gorithm. The Decision is grown recursively by splitting the nodes into
one or more sub-sets using a set of criteria.
1. Root Node: The top node of the tree that represents the entire dataset
is called a root node. It represents the entire dataset, which is further
split into two or more homogeneous and mutually exclusive sets.
2. Splitting: The process of dividing the data sub-sets of a node into
two or more sub-nodes based on certain pre-defined conditions.
3. Decision Node: A node of data sets that can be divided further into
sub-nodes. These are the nodes where the data sets are divided
based on certain conditions.
4. Leaf/Terminal Node: The final output node that does not split
further. The leaf nodes represent the final class of the instance
being considered.
5. Pruning: The process of removing nodes of sub-data sets of a decision
tree to avoid overfitting and improve the model’s classification
ability on other data sets and training data sets.

PAGE 97
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 97 26-Sep-24 7:39:25 PM


DATA MINING - I

Notes 6. Algorithm for Decision Tree:

Figure 5.6: Decision Tree Algorithm [Ref 1]

7. Design Issues for a Decision Tree Algorithm


(1) Splitting Criteria: The decision tree splits or divides the
data at each node based on a criterion that maximises the
separation of classes in mutually exclusive data sets. At each
recursive step, a feature is selected to divide the instances of
a training set into smaller subsets associated with its child
nodes. Commonly used criteria of splitting the data sets in
decision tree classification include:
(a) Entropy: Entropy is a measure of the randomness or
impurity in the dataset.
(b) Information Gain: Information gain measures the reduction
in entropy or impurity after the dataset is split on an
attribute. The attribute with the highest information gain
is chosen for the split.
(c) Gini Index: The Gini Index measures the impurity of a
node. This criterion is used by the CART (Classification
and Regression Tree) algorithm.
(d) Chi-square: The value of chi-square is used as splitting
criteria for categorical data.
(2) Stopping Criteria: Stopping criteria is used to stop expanding
or splitting or dividing all the instances of the training datasets
98 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 98 26-Sep-24 7:39:25 PM


Building Blocks of Classification Systems

into sub-sets. This stops when all instances have the same Notes
class or similar value of the feature.
8. Example of Decision Tree Algorithm: We will build a decision
tree classifier using the Iris dataset from scikit-learn. Step-by-step
implementation is given below:
(i) Load the Dataset: Load the Iris dataset from sklearn.
(ii) Split the Data: Split the dataset into training and testing sets.
(iii) Train the Model: Use the DecisionTreeClassifier from sklearn.
(iv) Evaluate the Model: Measure the accuracy and visualise the
tree.

9. Advantages and Disadvantages of Decision Trees: Decision trees


are a popular machine learning model used for classification and
regression tasks. The decision tree algorithm provides a clear and
intuitive decision-making processes through a tree-like structure
of rules. The decision tree algorithm has both advantages and
disadvantages as given below:

PAGE 99
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 99 26-Sep-24 7:39:25 PM


DATA MINING - I

Notes (a) Advantages


‹ ‹It is simple to understand and interpret: The visual
representation of decision tree algorithm makes it easy to
understand the decision-making process.
‹ ‹Itrequires little data preparation: There is minimum need
for feature scaling or normalisation for training dataset in
decision tree algorithms.
‹ ‹The Decision Tree algorithm handles both numerical and
categorical data. The decision tree algorithm is versatile
and can be used on different types of data.
‹ ‹The decision tree algorithm is non-parametric in nature. It
means that no assumptions are made about the underlying
data distribution.
(b) Disadvantages
‹ ‹The decision tree algorithm is prone to Overfitting. This
means that it may create over-complex trees that do not
classify new testing dataset with greater accuracy.
‹ ‹The decision tree algorithm is instable meaning small
changes in the data can result in completely different trees.
‹ ‹The decision tree algorithm is biased towards dominant
classes.
Decision Trees represent one of the robust, powerful, versatile, and
intuitive algorithms for classification and regression tasks. We do step-
by-step implementation to build a decision tree classifier, then evaluate
and visualize it. The algorithm provides insights into the decision-making
process, and performance-wise, decision tree algorithms can work on a
wide range of data types. However, it needs the selection of criteria so
that overfitting is avoided and the model works on the classification of
new testing datasets.

5.6 Overfitting
Overfitting is the problem in classification where a model learns a training
dataset too well-its noise and outliers. That means that the fitted model
will be really good on the training data but will perform poorly on any

100 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 100 26-Sep-24 7:39:26 PM


Building Blocks of Classification Systems

new, unseen test data. This happens whenever the model is too complex Notes
to capture random fluctuations and noise instead of the true underlying
patterns in the training data.
1. Characteristics of Overfitting: The following are characteristics of
classification model that leads to overfitting of the testing dataset:
‹ ‹High Accuracy on Training Data: The model exhibits very high
performance on the training set, often close to 100%.
‹ ‹Low Accuracy on Test Data: The model performs significantly
worse on the test set, indicating poor applicability to new testing
datasets.
‹ ‹Complex Model: The model has too many parameters or a very
complex structure relative to the amount of training data available.
2. Causes of Overfitting: The following are the reasons for the
classification for leading to overfitting the datasets:
‹ ‹Complex Models: Using models with too many parameters (e.g.,
deep decision trees, high-degree polynomials).
‹ ‹Insufficient Training Data: Having too little data for the model
to learn the true underlying patterns.
‹ ‹Noisy Data: Training data that contains a lot of noise or irrelevant
features.
‹ ‹Too Many Features: Including a large number of features,
especially if many of them are irrelevant.
3. Preventing Overfitting: The following are the methods through
which we can prevent overfitting of the classification models:
‹ ‹Simpler Models: We can look for simpler models that involve
fewer parameters, such as pruning decision trees and reducing
the degree of polynomials.
‹ ‹More Training Data: We will gather more training data in order
to capture the underlying distribution better.
‹ ‹Cross-validation:We can make use of cross-validation techniques
so that our model runs nicely on different subsets of the data.
‹ ‹Regularization: We should apply regularization techniques to
penalize overly complex models, including L1 and L2 regularisation.

PAGE 101
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 101 26-Sep-24 7:39:26 PM


DATA MINING - I

Notes ‹ ‹Feature Selection: Only select most relevant features on which to


train the model. This reduces the risk of capturing noise.
‹ ‹Early Stopping: Stop training in case of iterative algorithms, for
example, gradient descent, when the performance on some validation
set starts to degrade.

5.7 Confusion Matrix, Evaluation Metrics and Model


Evaluation
In this section, we will learn the process of calculating various evalua-
tion metrics from a given confusion matrix. Evaluation metrics include
accuracy, precision, recall, and F1-score. Let us start with the definitions
and then provide an example.
Confusion Matrix: Confusion Matrix is a table used to describe the
performance of a classification model on a set of test data for which the
true values are known. For a binary classification problem, the confusion
matrix is given below:
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

5.7.1 Evaluation Metrics


(a) Accuracy: The ratio of correctly predicted instances (both true
positives and true negatives) to the total instances.

(b) Precision: The ratio of correctly predicted positive observations to


the total predicted positives.

(c) Recall (Sensitivity or True Positive Rate): The ratio of correctly


predicted positive observations to all observations in the actual
class.

102 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 102 26-Sep-24 7:39:26 PM


Building Blocks of Classification Systems

Notes

(d) F1 Score: The weighted average of Precision and Recall.

Example Confusion Matrix


Let us consider the following confusion matrix for an email spam clas-
sification problem:
Predicted Spam Predicted Not Spam
Actual Positive 50 10
Actual Negative 5 35

True Positives (TP) = 50


‹ ‹False Negatives (FN) = 10
‹ ‹False Positives (FP) = 5
‹ ‹True Negatives (TN) = 35

Calculating the Metrics


1. Accuracy:

So, the accuracy id 85%.


2. Precision:

So, the precision is approximately 90.9%.

PAGE 103
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 103 26-Sep-24 7:39:26 PM


DATA MINING - I

Notes 3. Recall:

So, the recall is approximately 83.3%.


4. F1 Score:

So, the F1 is approximately 86.8%.


Therefore, with the given the confusion matrix:
Predicted Positive Predicted Negative
Actual Positive 50 10
Actual Negative 5 35

The following evaluation metrics are calculated:


‹ ‹Accuracy: 85%
‹ ‹Precision: 90.9%
‹ ‹Recall: 83.3%
‹ ‹F1 Score: 86.8%
These metrics provide a comprehensive view of the classifier’s perfor-
mance, balancing the trade-offs between different types of prediction errors.

5.7.2 Model Evaluation


Model evaluation is the process of determining performance and efficiency
of a machine learning model and algorithm. It uses techniques that are
meant for understanding the performance on unseen or new or testing data
in order to know where the algorithm needs an improvement. Some of the
common methods used for model evaluation include the Holdout Method,
Random Sub-sampling Method, and K-Fold Cross-Validation Method.
1. Holdout Method: The Holdout Method involves splitting the entire
dataset into two sub-sets-a training set and a testing set-in ratios
such as 70:30, 60:40, and 80:20. It trains the model on the training
set and tests the same model on the test set.

104 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 104 26-Sep-24 7:39:26 PM


Building Blocks of Classification Systems

‹ ‹Procedure: The Hold-out method follows the steps as given below: Notes
1. Firstly, split the dataset into a training and a test set.
2. Next, train the model on the training set.
3. Then, find the performance of the model on test dataset using
different evaluation metrics like accuracy, precision, recall,
ROC, and F1 score.
‹ ‹Advantages: The advantages of Hold-out method are:
1. The Holdout method is very simple and easy to use.
2. The Holdout method gives a quick estimation of the model’s
performance.
‹ ‹Disadvantages: The disadvantages of Hold-out method are:
1. Using the Holdout method, results can be different with a
random partitioning.
2. The holdout method generalizes well only in cases where the
data is large or imbalanced.
2. Random Subsampling Method: The Random Subsampling Method
is similar to the Holdout Method but involves repeated random
partitioning of the dataset into training and test sets.
‹ ‹Procedure: The Random Sub-Sampling follows the steps as
given below:
1. Firstly, split the dataset into training and test sets randomly.
2. Next, train the model using the training set.
3. Then, evaluate the model’s performance on the test set.
4. Repeat steps 1-3 a number of times, such as 10 or 100, and
then calculate an average over the evaluation metrics.
‹ ‹Advantages: The advantages of Random Sub-Sampling Method
are:
1. It reduces variance in model performance as opposed to a single
holdout split.
2. It then gives a better estimate of model performance.
‹ ‹Disadvantages: The disadvantages of Random Sub-Sampling
Method are:

PAGE 105
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 105 26-Sep-24 7:39:26 PM


DATA MINING - I

Notes 1. The random sub-sampling method is more computationally


expensive compared to the Holdout Method.
2. The random sub-sampling may suffer from variability due to
random partitioning.
3. K-Fold Cross-Validation Method: The K-fold cross-validation
resamples, splitting the dataset into k subsets called folds, each of
which it uses once as a test set while the remaining k-1 folds are
for training. That process repeats k times in such a way that every
fold acts as a test set exactly once.
‹ ‹Procedure: The K-Fold Cross validation method follows the
steps as given below:
1. Firstly, split the dataset into K equal-sized folds.
2. Next, for each fold:
‹ ‹Use the fold as a test set.
‹ ‹Train the model on the remaining K-1 folds.
‹ ‹Evaluate the model’s performance on the test set.
3. Then, calculate the average of the evaluation metrics over K
folds.
‹ ‹Advantages: The advantages of K-Fold Cross Validation Method
are:
1. The k-fold method gives a more realistic model performance
estimate compared to the random subsampling method.
2. The k-fold method reduces the variability of performance estimates.
‹ ‹Disadvantages: The disadvantages of K-Fold Cross Validation
Method are:
1. The computational cost increases because k-fold can be quite
expensive for big datasets and complex models.
2. The k-fold must be implemented carefully to ensure that folds
are representative of the whole dataset.
Overall, each method of model evaluation has its strengths and weak-
nesses, and the choice of evaluation method depends on factors such as
dataset size, model complexity, computational resources, and the desired
level of accuracy and stability in performance estimates.

106 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 106 26-Sep-24 7:39:26 PM


Building Blocks of Classification Systems

Notes
5.8 Summary
In Data Mining, classification is the process of putting data into classes
based on some given input features. Indeed, there are several types of
classification algorithms, each having its own strengths and specific fields
of application. Naive Bayes classification algorithm is a probabilistic
classifier that employs Bayes’ theorem to calculate feature probabilities.
This is pretty fit for text classification since it is fast and straightfor-
ward. Decision trees have an easy interpretation of decisions based on
the feature splits due to their tree-like structure, which is hierarchical in
nature. The k-nearest Neighbours algorithm classifies data points using
the majority label from the nearest neighbours in the space of the fea-
tures. It makes the algorithm intuitive and relatively simple; however,
computationally intensive for big datasets. Some of the major problems
with classification involve overfitting: a model would present very good
results on the training data but poor performance on new, unseen data
due to over-complexity. Model evaluation is an important part of model
building in order to check how well a classifier has performed, using
metrics like accuracy, precision, recall, F1 score, and techniques such as
confusion matrices and ROC curves. Proper evaluation helps in model
selection and in tuning so as to find a better balance between bias and
variance, with an assurance of performing well on real-world data.
In this lesson, we learned the basic concept of classification that assigns
data into categories or classes based on their characteristics. We started
by learning what is essential in any classification task: the dataset with
labelled examples, selection of the right features, and choosing appro-
priate algorithms. We then proceeded with the study of some of the var-
ious algorithms used for classification, such as decision trees, k-nearest
neighbours, and naive Bayes. All these have their own unique approach
and are suited to different types of data and problem domains.
We discussed also the importance of the evaluation of a model concern-
ing classification: the different metrics used, such as accuracy, precision,
recall, and F1 score, provide an idea of the performance of a classifier
and at which points it may be improved.

PAGE 107
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 107 26-Sep-24 7:39:26 PM


DATA MINING - I

Notes We learned through practical examples and discussions how classification


applies in reality-for instance, spam detection, sentiment analysis, medical
diagnosis, and document categorization. We learned in this course how
the algorithms of classification automate decision-making processes that
extract insight from data on important problems while solving complex
issues in different domains.

IN-TEXT QUESTIONS
1. K-fold method becomes __________, especially for large datasets
and complex models.
(a) Computationally more expensive
(b) Computationally less expensive
(c) Exponentially more expensive
(d) Exponentially less expensive
2. To prevent overfitting, we can:
(a) Collect more testing data
(b) Collect more training data
(c) Remove Training data
(d) Remove Testing Data
3. Which of the following is not an evaluation metric for Classification
Algorithms?
(a) Recall
(b) Precision
(c) F-1 Score
(d) Mean
4. Naive Bayes is a family of __________ algorithms:
(a) Logarithmic
(b) Algebraic
(c) Probabilistic
(d) Exponential

108 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 108 26-Sep-24 7:39:26 PM


Building Blocks of Classification Systems

Notes
5.9 Answers to In-Text Questions
1. (a) Computationally more expensive
2. (b) Collect more training data
3. (d) Mean
4. (c) Probabilistic

5.10 Self-Assessment Questions


1. Explain how classification as a technique or algorithm helps in mining
useful insights from huge volumes of data.
2. Identify the features and categories that can be used to classify the
following data:
(a) A data set containing the programming code of different software.
(b) A data set containing audio of different songs.
(c) An e-commerce website that is designed to sell different types
of clothing
(d) An e-commerce website that is designed to sell different types
of healthcare products.
(e) A data set containing resumes of candidates applying for
various job positions in a large MNC.
3. Consider the dataset of One Million emails. Elaborate the steps to
classify the dataset as spam and non-spam emails. What features
are more useful in categorising the emails as spam? A conference
organiser sends 5000 emails to students, scholars, and faculty of
different colleges and universities for the paper submission in the
conference. Will such emails be considered spam or not? Explain
with reasons.
4. How the algorithm Decision Tree is applied to classify a corpus of
news articles as Sports, Politics, Technology. Explain taking the
following dataset:

PAGE 109
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 109 26-Sep-24 7:39:26 PM


DATA MINING - I

Notes Document ID Text Content


1. “The game was exciting, and the team played well.”
2. “The government passed a new law today.”
3. “AI imagines city after 1000 years.”
4. “The election results announced.”
5. “School A and School B cricket match ended in
a draw.”
6. “Apple released a smartphone with 108MPX
camera.”
5. Implement using Python Naïve Bayes, Decision Tree and K-NN
classification algorithms on the following dataset. Calculate the
evaluation metrics and visualise the data using Python programming.
(a) Dataset of IRIS.
(b) Dataset of Abalone
(c) Text Categorisation Data
(d) Dataset of different programming code samples.
6. Use Naive Bayes, K-nearest, and Decision tree classification algorithms
and build classifiers on any two datasets. Divide the data set into
training and test set. Compare the accuracy of the different classifiers
under the following situations:
I. (a) Training set = 75% Test set = 25% b) Training set = 66.6%
(2/3rd of total), Test set = 33.3%
II. The Training set is chosen by (i) the hold-out method, (ii)
Random subsampling, and (iii) Cross-Validation. Compare the
accuracy of the classifiers obtained.

5.11 References
‹ ‹Tan P. N., Steinbach M, Karpatne A. and Kumar V. Introduction to
Data Mining, 2nd edition, Pearson, 2021.
‹ ‹Han J., Kamber M. and Pei J. Data Mining: Concepts and Techniques,
3rd edition, 2011, Morgan Kaufmann Publishers.

110 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 110 26-Sep-24 7:39:26 PM


Building Blocks of Classification Systems

‹ ‹Zaki M. J. and Meira J. Jr. Data Mining and Machine Learning: Notes
Fundamental Concepts and Algorithms, 2nd edition, Cambridge
University Press, 2020.
‹ ‹Alnuaimi, A. F., & Albaldawi, T. H. (2024). Concepts of statistical
learning and classification in machine learning: An overview. In
BIO Web of Conferences (Vol. 97, p. 00129). EDP Sciences.

5.12 Suggested Readings


‹ ‹Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., & Nagler, J.
(2021). Automated text classification of news articles: A practical
guide. Political Analysis, 29(1), 19–42.
‹ ‹Mansoor, R. A. Z. A., Jayasinghe, N. D., & Muslam, M. M. A. (2021,
January). A comprehensive review of email spam classification using
machine learning algorithms. In 2021 International Conference on
Information Networking (ICOIN) (pp. 327–332). IEEE.
‹ ‹Kesavaraj, G., & Sukumaran, S. (2013, July). A study on classification
techniques in data mining. In 2013, the fourth International Conference
on Computing, Communications and Networking Technologies
(ICCCNT) (pp. 1-7). IEEE.
‹ ‹Huang, Y., & Li, L. (2011, September). Naive Bayes classification
algorithm based on a small sample set. In 2011 IEEE International
conference on cloud computing and intelligence systems (pp. 34-
39). IEEE.

PAGE 111
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining.indd 111 26-Sep-24 7:39:26 PM


Data Mining.indd 112 26-Sep-24 7:39:26 PM
Glossary
Aggregation: The process of summarizing variable values into summary statistics to sim-
plify the analysis by reducing the volume of data.
Algorithm: A step-by-step procedure or formula for solving a problem, often used for
data analysis and prediction.
Anomaly Detection: It falls under unsupervised learning - sometimes semi-supervised
learning - where the algorithm identifies the points of data that are different from the
pattern, thus showing potential outliers or abnormalities.
Antecedent (LHS): The left-hand side of an association rule. It is the itemset that ap-
pears in the condition part of the rule. For example, in the rule {python} →→ {numpy},
{pandas} is the antecedent.
Apriori Algorithm: The classical algorithm for mining frequent itemsets and generating
association rules. The algorithm works on the principle that any subset of a frequent
itemset has to be frequent.
Association Rule: A rule that implies a relationship between two sets of items in a dataset.
Association Rules Mining: It is a data mining algorithm that detects patterns and to-
getherness among different items of a large datasets. The performance of association rule
mining is measured using support, confidence, and lift metrics.
AUC (Area under the Curve): A single scalar value to summarise the performance of
the ROC curve, representing the likelihood that the classifier will rank a randomly chosen
positive instance higher than a randomly chosen negative instance.
Binarization: The process of transforming data to be binary typically by a threshold, with
the objective of facilitating analyses sometimes; henceforth, represent categorical variables
using binary features.
Classification: A type of supervised learning wherein an algorithm learns to assign input
data to pre-defined classes or categories.
Classifier: An algorithm that maps input data to a specific category.
Closed Itemset: An itemset is closed if none of its immediate supersets have the same
support count as the itemset itself.
Cluster: A cluster is a group of objects such that the data points in that group are much
more like each other than they are to objects in any other cluster.

PAGE 113
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining_Glossary.indd 113 26-Sep-24 7:40:31 PM


DATA MINING - I

Notes Clustering: It is a machine learning technique used for the aggregation of


data points into clusters according to their features or attributes. It is an
unsupervised learning type wherein similar data points are segregated, de-
pending on their features, into some group without any predefined categories.
Confidence: The measure of how often items in the consequent appear
in transactions that contain the antecedent. It is calculated as:

Confidence Interval: A range of values derived from the confidence


measure that is used to assess the reliability of an association rule.
Confusion Matrix: A table used to evaluate the performance of a clas-
sification model, showing the true vs. predicted classifications.
Consequent (RHS): The right-hand side of an association rule. It is the
itemset that appears in the outcome part of the rule. For example, in the
rule {milk} →→ {bread}, {bread} is the consequent.
Data Mining Tasks: Specific activities performed in the course of data
mining and include classification, regression, clustering, association rule
mining, and anomaly detection among others.
Data Mining: The process of discovering patterns, relationships, and
insights from large data sets in view of extracting valuable knowledge
for decision-making.
Data Preprocessing: This is the process of preparing the raw data to
be analysed, which could involve aggregation, sampling, dimensionality
reduction, feature subset selection, feature creation, discretization, bina-
rization, and variable transformation.
Data Quality: Completeness, accuracy, consistency, timeliness-data prop-
erties determine the reliability and usefulness of analyses and models.
Density-Based Clustering: A clustering technique that identifies clusters
as dense regions of data points separated by sparse regions, commonly
used in spatial data analysis.
Dimensionality Reduction: Dimensionality reduction is a technique used
to decrease the number of features or variables that describe a dataset
without necessarily losing useful information in data analysis. Examples
include PCA and LDA.

114 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining_Glossary.indd 114 26-Sep-24 7:40:31 PM


Glossary

Discretization: The process of converting continuous into discrete inter- Notes


vals or categories, in an attempt to simplify the analysis or for certain
algorithms that require it.
F1 Score: A metric that combines precision and recall, calculated as the
harmonic mean of precision and recall.
Feature Subset Selection: Choosing a subset from the relevant features
of the original set in a dataset for model performance improvement and
interpretability.
Feature: A quantity or attribute that is being measured about the object
being observed. For machine learning, features are what will be applied
to the data points and allow for making predictions.
FP-Growth Algorithm: An efficient algorithm used in the mining frequent
itemsets without utilizing candidate generation. It operates on a compressed
dataset by using an FP-tree, which stands for Frequent Pattern Tree.
Frequent Itemset: An itemset with support metric greater than or equal
to the minimum support threshold.
Fuzzy Clustering: A clustering approach where data points can belong to
multiple clusters with varying degrees of membership, unlike traditional
“hard” clustering where each point belongs to exactly one cluster.
Hierarchical Clustering: This is one of the methods of clustering, struc-
turing clusters in a hierarchy, through building up larger clusters from
smaller ones-termed agglomerative-or dividing larger clusters into smaller
ones, termed divisive.
Itemset: A collection of one or more items. For example, {milk, bread,
butter} is an itemset.
K-Means Algorithm: This is one of the most popular algorithms in
unsupervised learning; its purpose is to partition a set of observation
points into K clusters.
Label: The output or class associated with a particular input in super-
vised learning.
Lift: The ratio of the observed support of the rule to the expected support
if the antecedent and consequent were independent. It is calculated as:

PAGE 115
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining_Glossary.indd 115 26-Sep-24 7:40:31 PM


DATA MINING - I

Notes A lift value greater than 1 indicates a positive association between the
antecedent and the consequent.
Machine Learning: A subcategory of AI in which systems can inde-
pendently learn data to predict or make decisions without explicit training
based on experience.
Market Basket Analysis: A common application of association rule
mining where sets of products that frequently co-occur in transactions
are identified to understand customer purchasing behaviour.
Maximal Itemset: An itemset is maximal if none of its immediate su-
persets is frequent.
Metric Distance: This is generally a function that gives the distance
between two data points in a multi-dimensional space and is among the
major functions involved in clustering algorithms. Centroid: It is a point
characterizing the center of a cluster, and it is usually the meaning of all
the data points in that cluster.
Minimum Confidence Threshold: The threshold below which a user
sets the minimum level of confidence a rule must have to be interesting.
Optimal Number of Clusters: The number of clusters that best rep-
resents the true structure underlying any data, as determined by a cluster
validation technique or domain knowledge.
Overfitting: A modelling error that occurs when a model learns the de-
tails and noise in the training data to the extent that it performs poorly
on new data.
Overlap: Overlapping occurs when a data point actually simultaneously
belongs to more than one cluster. It’s a common happening in several
fuzzy clustering methods.
Partitioning Clustering: A class of clustering algorithms that partition
data into non-overlapping clusters, with each data point assigned to ex-
actly one cluster.
Precision: A metric that measures the accuracy of positive predictions,
defined as the number of true positives divided by the number of true
positives plus false positives.
Recall: A metric that measures the ability of a model to identify all rel-
evant instances, defined as the number of true positives divided by the
number of true positives plus false negatives.
116 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining_Glossary.indd 116 26-Sep-24 7:40:31 PM


Glossary

Redundant Rule: An association rule is considered redundant if it con- Notes


veys no additional information beyond what is already present in a more
general rule.
Regression: It’s a kind of supervised learning where the algorithm learns
on how to predict continuous output values from input features.
Resampling: Several methods choose subsets of data items from a larger
data set for analysis, either because computation is infeasible on the full
data set or due to considerations of representativeness.
ROC Curve (Receiver Operating Characteristic Curve): A graphical
plot that illustrates the diagnostic ability of a binary classifier system as
its discrimination threshold is varied.
Rule Pruning: The process of removing redundant or uninteresting rules
from the set of discovered association rules to focus on the most relevant
ones.
Similarity: It is a measure of the resemblance or proximity between two
data points. This can typically be quantified using distance metrics or
similarity measures.
Supervised Learning: A class of machine learning where the algorithm
is trained on a labeled dataset comprising matched input data points and
their corresponding output labels. Typical applications include classifi-
cation and regression.
Support: The proportion of transactions in the dataset that contain a
particular itemset. It is calculated as:

Testing Data: This is data unforeseen; it is instead used to test the per-
formance of an already trained machine learning model. It is never used
as part of the training data set, but it does help in the estimation of a
model’s ability to generalize.
Testing Set: A dataset used to evaluate the performance of a trained
machine learning model.
Threshold Minimum Support: User-specified threshold on minimum
support of an itemset, that can be counted as frequent.

PAGE 117
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining_Glossary.indd 117 26-Sep-24 7:40:31 PM


DATA MINING - I

Notes Training Data: A labelled dataset on which a machine learning model


is trained. It consists of sets of inputs with corresponding outputs used
in fine-tuning the model parameters during training or learning.
Training Set: A dataset used to train a machine learning model consisting
of input-output pairs.
Transaction Data Set: A collection of transactions where each transaction
contains a set of items. This is the elementary form of data on which to
search for item relations.
Underfitting: A modelling error that occurs when a model is too simple
to capture the underlying structure of the data.
Unsupervised Learning: This is a type of machine learning where an
algorithm attempts to find patterns or structure in data without explic-
it guidance from an already labeled set of data. Common applications
include but are not limited to clustering, dimensionality reduction, and
anomaly detection.
Validation Set: A dataset used to tune model parameters and prevent
overfitting during training.
Variable Transformation: Change of scale, distribution, and/or nature
of variables within a dataset to prepare these variables to best fit the
analysis or modeling to be performed.

118 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Data Mining_Glossary.indd 118 26-Sep-24 7:40:31 PM

You might also like