Data Mining - I
Data Mining - I
Editorial Board
Deekshant Awasthi
Published by:
Department of Distance and Continuing Education
Campus of Open Learning, School of Open Learning,
University of Delhi, Delhi-110007
Printed by:
School of Open Learning, University of Delhi
DATA MINING - I
External Reviewer
Dr. Bharti
Assistant Professor,
DisclaimerDepartment of Computer Science, University of Delhi
Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
New Delhi - 110026 (600 Copies, 2024)
Syllabus Mapping
Unit - I: Introduction to Data Mining Lesson 1:
Motivation and Challenges for data mining, Types of data mining tasks, Introduction to Data Mining
Applications of data mining, Data measurements, Data quality, Supervised (Pages 1–20)
vs. unsupervised techniques.
Unit - II: Data Pre-Processing Lesson 2:
Data aggregation, sampling, dimensionality reduction, feature subset se- Data Pre-processing:
lection, feature creation, variable transformation. Transforming Raw Data
into Processed Data
(Pages 21–37)
Unit - III: Cluster Analysis Lesson 3: The Art of
Basic concepts of clustering, measure of similarity, types of clusters and Grouping: Exploring
clustering methods, K-means algorithm, measures for cluster validation, Cluster Analysis
determine optimal number of clusters. (Pages 38–62)
Unit - IV: Association Rule Mining Lesson 4: Data Connec-
Transaction data-set, frequent itemset, support measure, rule generation, tions: The Essentials of
confidence of association rule, Apriori algorithm, Apriori principle. Association Rule Mining
(Pages 63–83)
Unit - V: Classification Lesson 5: Building
Naive Bayes classifier, Nearest Neighbour classifier, decision tree, overfit- Blocks of Classification
ting, confusion matrix, evaluation metrics and model evaluation. Systems
(Pages 84–111)
PAGE
Lesson 1: Introduction to Data Mining 1–20
Lesson 2: Data Pre-processing: Transforming Raw Data into Processed Data 21–37
Glossary113–118
PAGE i
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
1
Introduction to Data
Mining
Aishwarya Anand Arora
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]
STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 What is Data Mining?
1.4 Applications of Data Mining
1.5 Data Mining Task
1.6 Motivation and Challenges
1.7 Types of Data Attributes and Measurements
1.8 Data Quality
1.9 Supervised vs. Unsupervised
1.10 Summary
1.11 Answers to In-Text Questions
1.12 Self-Assessment Questions
1.13 References
1.14 Suggested Readings
PAGE 1
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
1.1 Learning Objectives
To understand the concepts of data mining.
To know applications of Data Mining.
To apply techniques of data mining in the real world.
To extract valuable information from unstructured data using data
mining techniques.
1.2 Introduction
Data mining is like being a detective, where instead of solving crimes, you
uncover the secret patterns and hidden data in this sea of data. Suppose
that you have a big treasure that is filled with so much information in
the form of text, images, and other forms of multimedia. It just looks like
jumbled facts at surface but with right kind of tools, you can see through
this data and find valuable patterns that can be useful in your business.
Think of it as mining Gold. Data Mining is the process of turning huge
amounts of data into useful information which is then transformed into
knowledge. It is the process of extracting patterns, trends, and insights in
large databases through various computational methods and algorithms. It
involves extracting valuable information and knowledge about raw data,
which in turn helps an organization make data-driven decisions and develop
a fuller understanding of their operations, customers, and markets. Some
of the strategies comprised in data mining techniques include statistical
analysis, machine learning, pattern identification, and visualization. It
makes predictive modelling, anomaly detection, clustering, and association
rule mining more accessible by showing hidden patterns and correlations
within data. Data mining is crucial in many fields today, including business
intelligence, marketing, health care, finance, and cybersecurity since data
in the digital world today is growing exponentially. Data mining helps an
organization with competitive advantage and solidifies valuable insights.
Following is a figure that illustrates how the discovery of knowledge is
done through the process of Data Mining.
2 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
PAGE 3
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
1.3 What is Data Mining?
Now, visualize a gigantic library with thousands of books. Can you read
every single one of those to find out which one is the best? Well! I’m
sure this is impossible since you could spend years doing this. Here
comes Data Mining into the picture. Finding patterns, connections, and
insights in big databases to extract useful information is known as Data
Mining. It analyses data and discovers hidden patterns in several ways by
utilizing the techniques taken from database systems, machine learning,
and statistics for predictive modelling and decision making. Data mining
provides organizations with the ability to find patterns, anomalies, and
relationships through association rule mining, regression analysis, clustering,
and classification. The mined knowledge can be used in a wide variety
of marketing, finance, healthcare, and business applications to enhance
operations, gain competitive advantage, and make informed decisions.
Major steps of data mining include data collection, pre-processing, explor-
atory data analysis, feature engineering and selection, model development,
evaluation, deployment, and monitoring.
Data mining’s primary characteristics are:
Automatic Pattern Recognition
The categorization of data based on prior knowledge or statistical
data derived from patterns and/or their representation is known
as pattern recognition. e.g., license plate recognition, fingerprint
analysis, face detection/verification, and voice-based authentication.
Forecasting Probable Results
The process of forecasting involves using past data to generate
predictions about what will happen in the future or under what
circumstances. e.g., a company might forecast an increase in
demand for its products during the Diwali season.
Production of useful knowledge.
Pay attention to big databases and datasets.
4 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
PAGE 5
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
6 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes variables. For example, consider items you usually buy at a grocery
store together.
5. Anomaly Detection: The technique to find the pattern, which is
much farther away from usual patterns, would involve finding fraud
cases in financial transactions.
6. Sequential Pattern Mining: It’s a technique that finds patterns
occurring within sequential data, including time series and event
sequences. Example: clickstream analysis to understand customer
behaviours on a website.
7. Text Mining: Using unstructured text data, text-mining aims to extract
useful information. Sentiment analysis of customer feedback is one
example.
8. Image Mining: The goal of image mining is to identify patterns
and information within image data. Recognizing things in satellite
photos in order to classify land use.
9. Spatial Data Mining: This technique focuses on identifying patterns
in spatial data.
The relative geographic information about the earth and its features
is included in spatial data. A particular place on Earth is defined by
a pair of latitude and longitude coordinates. For example, consider
analysing geographic data to find trends in the spread of illness.
10. Studying Time-Series Data: Time-series analysis looks for patterns
or trends in data points gathered over an extended period of time.
For example, consider forecasting stock values using past data.
Challenges
The various challenges with data mining are:
1. Data Quality: Poor quality of data, such as incomplete and noisy
information or inconsistency in data, threatens the accuracy and
reliability of the output produced in data mining.
2. Scalability: Massive data sets cannot be handled effectively with
regard to issues of memory, processing speed, and computing
efficiency.
3. Complexity and Interpretability: Patterns in some data mining
models are hard for users to understand, especially in machine
learning, as it is complex and may not easily be interpretable.
4. Privacy Issues: The use of personal information raises a host of
privacy issues, and while doing data mining, ethical issues are
things a company considers.
5. Choice of Algorithm: Choosing the right algorithm for any particular
task is tough since different algorithms work better depending on
the type of data they are employed upon.
6. Dynamic Nature of Data: Due to the dynamic nature, the pattern in
data might change. In adjusting to changes, it becomes challenging
to keep the models relevant.
7. Domain Knowledge: Many times, interpretation of the results
from data mining requires knowledge of the domain for proper
interpretation, and that might be without domain expertise.
8. Data Integration: Pre-processing is such an important task, as the
integration of various sources of data, in different formats and
structures, may be tough.
10 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
1.7 Types of Data Attributes and Measurements
An attribute is a data field representing the characteristics of a data
object. Data attributes are characteristics or properties of data used for
modelling and analysis and in data mining are variously known as fea-
tures or variables. It is these qualities that provide information about the
objects or subjects being studied and form the basis for the development
of predictive models, the identification of trends, and drawing inferences.
Following is the list of data attributes used in data mining:
PAGE 11
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
12 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 13
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Unsupervised Learning
In unsupervised machine learning, the models are trained on raw, unlabeled
training data. It is most often used for segmenting similar data into a set
number of groups or to find patterns and trends within raw information.
It is also one of the common strategies used in the early stages to get a
better view of the datasets.
As might be expected by the name, unsupervised machine learning takes
another approach than supervised machine learning. Human beings will
tune model hyperparameters, like the number of cluster points, but the
model will handle enormous amounts of data efficiently and autonomously.
Due to this fact, unsupervised machine learning provides answers partic-
ularly apt to show hidden patterns and connections within the data itself.
However, with less human control, much more consideration must be
given to explaining unsupervised machine learning. Most data available
currently is raw and unlabelled. Therefore, unsupervised learning will be
a potent way to make sense of such data; data will be in clusters with
similar characteristics, or it may analyze datasets for hidden patterns. On
the other hand, labelled data may require a great deal of resources when
using supervised machine learning.
PAGE 15
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
16 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
1.10 Summary
Data mining is the process of finding trends, relations, and insights from
large databases to glean information that would be of importance or use-
ful to the business for making decisions. Its application extends across
multiple domains due to obvious reasons: marketing, banking, health care,
PAGE 17
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes and telecommunications are just a few examples. Some examples of the
tasks of data mining are classification, regression, clustering, association
rule mining, and anomaly detection. In this lesson, you have learnt that,
among others, finding of trends, enhancing decision-making and compet-
itive advantage are some of the driving forces in Data Mining. However,
there are problems that make Data Mining less successful. A few obstacles
are poor data quality, scalability, and interpretability. Data quality covers
completeness, accuracy, consistency, and timeliness of information. Data
qualities maybe categorical, numerical, ordinal, or interval.
In more detail, supervised learning is a process in which a machine
learning model is trained on a certain data for which each input is as-
sociated with a target output label. It will learn from this generalization
of examples by minimizing the discrepancy between its own prediction
and the true labels, and then making the best possible predictions on
new, unseen data. Regression and classification problems are among the
most common tasks in supervised learning. Opposed to that, unsupervised
learning deals with training on unlabelled data; the algorithm hence needs
to find structures or patterns in the data on its own. Most often, the goal
is to detect hidden patterns or clusters in the data, such as dimension-
ality reduction or clustering of similar data points. Possible applications
of unsupervised learning range from recommendation systems over data
compression to anomaly detection.
18 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
1.12 Self-Assessment Questions
1. What are the main objectives of data mining, and how do they
differ from traditional statistical analysis?
2. Explain the differences between supervised and unsupervised learning
in the context of data mining. Provide examples of each.
3. Describe the steps involved in the data mining process, highlighting
the importance of each step.
4. Discuss the challenges associated with handling missing data in a
dataset and explain some common techniques used to address this
issue.
5. Describe the key differences between supervised and unsupervised
learning. Provide examples of each.
6. Explain the concept of overfitting in the context of supervised
learning. How can overfitting be detected and prevented?
7. What are the main challenges in unsupervised learning, and how
are they addressed?
8. Describe two popular algorithms used in supervised learning and
two popular algorithms used in unsupervised learning. Explain how
each algorithm works and provide examples of their applications.
1.13 References
Han, J., Kamber, M., & Jian, P. (2011). Data Mining: Concepts and
Techniques. 3rd edition. Morgan Kaufmann.
Tan,P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data
Mining. 1st Edition. Pearson Education.
Gladys Kiruba. “Types of machine learning for Data science”. 16
Jan 2023 accessed on 17 April 2024.
PAGE 19
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Hand, D., & Mannila, H. & Smyth, P. (2006). Principles of Data
Mining. Prentice-Hall of India.
Pujari, A. (2008). Data Mining Techniques. 2nd edition. Universities
Press.
Ding, H., Wu, J., Zhao, W., Matinlinna, J. P., Burrow, M. F., &
Tsoi, J. K. (2023). Artificial intelligence in dentistry—A review.
Frontiers in Dental Medicine, 4, 1085251.
20 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
2
Data Pre-processing:
Transforming Raw Data
into Processed Data
Aishwarya Anand Arora
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]
STRUCTURE
2.1 Learning Objectives
2.2 Introduction
2.3 Data Pre-Processing - Aggregation
2.4 Sampling
2.5 Dimensionality Reduction
2.6 Feature Subset Selection
2.7 Discretization and Binarization
2.8 Variable Transformation
2.9 Summary
2.10 Answers to In-Text Questions
2.11 Self-Assessment Questions
2.12 References
2.13 Suggested Readings
PAGE 21
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
2.2 Introduction
Data preparation involves the conversion, cleaning, and preparation of
raw data in the light of further analysis or modelling. It is also expect-
ed to enhance different techniques that ensure data relevance, quality,
and structure in such a manner that makes it fit for planned analytical
activities. Some of the problems with data preprocessing can affect the
accuracy and efficiency in subsequent analysis, including missing values,
noisy data, outliers, and inconsistencies. Accordingly, poor quality in
the processing of data can improve the performance of machine learn-
ing models or reduce bias, based on which reliability improves through
findings by different other data-driven techniques.
22 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 23
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
24 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
2.4 Sampling
In data mining, sampling is the process of choosing a portion of a big-
ger dataset for examination. It’s an essential method for increasing the
effectiveness of data analysis, particularly when handling big data sets.
Below is an explanation of the significance of sampling as well as a list
of typical sampling techniques in data mining:
Efficiency: Handling big datasets requires a lot of computing power
and time. Data miners may work with manageable portions of data
thanks to sampling, which speeds up and improves the usefulness
of analysis.
Cost-cutting:Gathering, storing, and analyzing huge datasets can
be costly. Organizations can cut expenses on data processing and
storage by using sampling.
Representativeness: A carefully thought-out sample ought to
faithfully capture the traits of the broader population it is taken
from. This guarantees that the patterns and insights gleaned from
the sample can be applied to the total population.
PAGE 25
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
26 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
3. Systematic Sampling:
Involves
selecting every nth item from a list or sequence after a
random start.
Useful when the population is ordered in some meaningful way
(e.g., alphabetical order, time sequence).
Below Figure 2.6 pictorially depicts how samples are selected.
PAGE 27
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
5. Multi-stage Sampling:
Combines multiple sampling methods.
Involves selecting samples in stages, often starting with large-scale
clusters and then progressively sampling smaller units within those
clusters.
Below Figure 2.8 pictorially depicts how samples are selected at
multiple stages.
28 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 29
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
30 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 31
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
32 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 33
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes data points. This process provides the best assurance that the number
of occurrences in each category is roughly equal.
Custom Discretization: This involves the creation of specific
thresholds or intervals based on requirements or domain knowledge.
Among the common preprocessing steps, discretization also features when
dealing with algorithms that require discrete inputs, such as association
rule mining or decision trees.
Binarization: Also, it is sometimes referred to as binning or binarizing;
binarization is basically the process of transforming data into binary
format. This is achieved by thresholding the continuous variables into
binary features. A threshold is determined through binarization, and ev-
ery value above this is transferred to one category, usually represented
by the letter 1, and below it is transferred to another category, usually
represented by the letter 0.
The most common reasons people do binarization include but are not
limited to:
Binarization can help reduce the complexity of the data and emphasize
a pattern or relationship through feature engineering. In sentiment
analysis, words may be binarized depending on whether they appear
or do not appear in a document.
Imbalanced Data: Another area of application for binarization is in
fixing class imbalance by transforming a multi-class problem into
a binary classification problem.
The Sparse Data In particular, text mining and image processing
may use binarization to transform representations of sparse data
into more compact, efficient format.
Continuous and Categorical Data: Considering both continuous and
categorical data, binarization is a very simple yet useful technique which,
depending on the needs of the concrete analysis or modelling at hand,
might be called upon.
Preprocessing methods such as discretization and binarization take the
raw data into forms more amenable to modelling or analytics; the former
converts continuous variables into discrete categories, while the latter
represents the application of a threshold on data to convert it into binary
34 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
form. Both discretization and binarization play major roles in feature en- Notes
gineering and data pretreatment within data mining and machine learning
workflows.
2.9 Summary
In this lesson, we discussed how data mining finds trends, relationships,
and insights in large databases to retrieve information for better deci-
sion-making. Applications are numerous, from marketing, banking, and
healthcare to telecommunications. Tasks dealing with data mining include
classification, regression, clustering, association rule mining, and anom-
aly detection, among others. Driving forces it has found include finding
trends, enhancing decision-making processes, and gaining a competitive
advantage. However, due to some obstacles, it is less successful. Some
of these are related to data quality, scalability, and interpretability. Data
qualities include completeness, accuracy, consistency, and timeliness. Data
qualities might be categorical, numerical, ordinal, or interval.
PAGE 35
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
36 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
5. Describe the role of data quality in the success of data mining Notes
projects. What are some common sources of data quality issues,
and how can they be addressed during data preprocessing?
2.12 References
Han,J., Kamber, M., & Jian, P. (2011). Data Mining: Concepts and
Techniques. 3rd edition. Morgan Kaufmann.
Tan,
P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data
Mining. 1st Edition. Pearson Education.
PAGE 37
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
3
The Art of Grouping:
Exploring Cluster Analysis
Aishwarya Anand Arora
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]
STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Basic Concepts of Clustering
3.4 Measure of Similarity
3.5 Types of Clusters and Clustering Methods
3.6 K-Means Algorithm
3.7 Measures for Cluster Validation
3.8 Determine Optimal Number of Clusters
3.9 Summary
3.10 Answers to In-Text Questions
3.11 Self-Assessment Questions
3.12 References
3.13 Suggested Readings
38 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
3.2 Introduction
A key method in data mining and machine learning is cluster analysis,
which is grouping a collection of items according to how similar they
are. It is extensively employed in many different domains, including
anomaly detection, picture analysis, pattern recognition, and consumer
segmentation. Fundamentally, the goal of cluster analysis is to find hidden
structures and patterns in data so that complicated datasets can be better
understood and interpreted.
Measures of similarity, which express how similar two things are, and
types of clusters, which can differ in size, density, and form, are im-
portant ideas in cluster analysis. Figure 3.1 shows how clustering groups
similar data together. There are many kinds of clustering algorithms and
methodologies, and each has advantages and disadvantages of its own.
K-means, which divides the data into a predefined number of groups by
iteratively reducing the within-cluster variation, is one of the most often
used clustering methods.
40 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 41
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
42 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 43
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
44 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 45
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Partitioning Clustering
This kind of clustering separates the data into groups that are not hierar-
chical. Another name for it is the centroid-based approach. The K-Means
Clustering technique is the most widely used illustration of partitioning
clustering.
With this kind, the number of pre-defined groups is denoted by K, and
the dataset is split up into a collection of K groups. The cluster center
is designed so that there is the least amount of space between a cluster’s
data points and another cluster centroid. As shown in Figure 3.8 below.
46 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Density-Based Clustering
As long as the dense region can be connected, the density-based clustering
method forms arbitrarily shaped distributions by connecting the highly
dense areas into clusters. This technique connects the regions of high
densities into clusters by detecting several clusters within the dataset.
Sparser spaces in data space separate the dense sections from one another.
When dealing with high dimensions and changing densities in the dataset,
these techniques may have trouble clustering the data points. As shown
in Figure 3.9 below.
PAGE 47
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
48 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Fuzzy Clustering
A sort of soft approach called fuzzy clustering allows a data object to be
a part of multiple groups or clusters. A set of membership coefficients,
which are dependent on the level of participation in a cluster, are pres-
ent in every dataset. An example of this kind of clustering is the fuzzy
C-means algorithm, which is sometimes referred to as the fuzzy k-means
algorithm at times.
PAGE 49
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
50 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
K-Means Overview
Before we examine the dataset, let’s quickly go over how k-means operates:
K centroids are initialized at random at the start of the operation.
The closest cluster is assigned points based on these centroids.
The centroids’ positions are then updated using the mean of all the
locations within the cluster.
Untilthe centroids’ values stabilize, the previously mentioned steps
are repeated.
An algorithm for unsupervised learning is K-means clustering. Unlike
supervised learning, this clustering does not use labelled data. K-Means
divides the objects into groups based on similarities and differences be-
tween the objects in each cluster.
K is an acronym for a number. The number of clusters that must be
created must be specified to the system. K = 2, for instance, designates
two clusters. The optimal or best value of K for a particular set of data
can be determined in a certain method.
To get a better understanding of k-means, let’s look at a cricket example.
Consider that you have access to data on a large number of international
cricket players, including details on their runs scored and wickets claimed
over the course of the previous ten matches. We must divide the data into
two clusters—batsmen and bowlers—based on this information.
Let’s examine the procedures involved in forming these clusters.
Solution:
Our data set is shown here using the “x” and “y” coordinates. The y-axis
displays the number of runs scored, while the x-axis displays the number
of wickets the players have taken.
This is how the information would seem if it were plotted:
PAGE 51
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
52 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Using the same set of data, let’s apply K-Means clustering to solve the
problem (with K = 2).
The random assignment of two centroids (as K = 2) is the initial stage
in the k-means clustering process. Centroids are allocated to two points.
Keep in mind that because the points are random, they could be any-
where. Even though they are originally not the centre of a specific data
set, they go by the name centroids.
PAGE 53
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes The next step is to calculate the separation between each data point of
the randomly allocated centroids. Every point has its distance measured
from both centroids; the point is assigned to the centroid whose distance
is shorter. The data points are shown here in blue and yellow and are
affixed to the centroids.
Finding these two clusters’ true centroid is the next stage. It is necessary
to move the initial centroid that was chosen at random to the clusters’
actual centroid.
Up until we reach our final cluster, we keep doing this centroid relocation
and distance calculation process. Subsequently, the centroid realignment
ceases.
54 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 55
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
56 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 57
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
3.8 Determine Optimal Number of Clusters
The number of clusters that are appropriate for our dataset must be
determined for clustering techniques such as K-Means clustering. This
guarantees an accurate and effective division of the data. Maintaining a
suitable balance between the compressibility and accuracy of clusters, as
well as guaranteeing proper granularity, are made easier with an appro-
priate amount of “k,” or the number of clusters.
Let’s look at two scenarios:
Case 1: Handle every dataset as a single cluster.
Case 2: Consider every data point to be a cluster.
Because there is no gap between the data point and the cluster center,
this will result in the highest accurate clustering. However, this won’t be
useful for forecasting fresh inputs. It won’t allow for any type of data
summarization.
Thus, figuring out the “right” number of clusters for each given dataset
is crucial. Although this is a difficult undertaking, it is quite manageable
if we rely on the distribution’s form and scale.
Direct Method
Elbow Curve: The number of clusters and the within-cluster variance
determine the Elbow curve. The Within-cluster Sum-of-Squared Distance
(WSSD) between each data point and its cluster center is represented by
inertia, which determines the within-cluster variance. Denser clusters are
those with lower inertia. Refer Figure 3.17.
As the number of clusters increases, the inertia usually decreases. None-
theless, the initial slope of the decline tends to be steeper and becomes
less steep when we surpass the “optimal” number of clusters. Therefore,
we use the location of the bend in the high and low slope to determine
the “optimal” number of clusters.
58 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Statistical Methods
Gap Statistic Method: The within-cluster variation is used by the Gap
Statistic approach to gauge how well the clustering is done. To do this,
the entire within-cluster variance is compared to how a random data set
would cluster, or, in other words, to what would be predicted from a null
reference distribution.
For a given number of clusters (k), we build a certain number B of null
reference (uniform) distributions to derive the Gap Statistic G. We calcu-
late the inertia W for each sample and average the results. Next, we can
determine the Gap Statistic using the cluster’s inertia of our real data set:
Finding the least number of clusters (k) that fulfils the following criteria
yields the “optimal” number of clusters:
PAGE 59
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
3.9 Summary
In this lesson, you have learned that clustering is the unsupervised learn-
ing approach par excellence to uncover hidden structures and patterns
of data. Basically, clustering implies putting data into meaningful groups
so that maximization of the similarity in a cluster, called intra-cluster
similarity, is achieved while minimization of inter-cluster similarity takes
place. More precisely, at the very core of this concept lie measures of
similarity quantifying closeness or proximity between points in data.
Depending on the type of data, these metrics would differ. They could
be some form of distance metric for numerical data, such as Euclidean
distance, and some form of similarity metric in the case of text or cate-
gorical data, such as cosine similarity. These actual clusters themselves
can be disjoint, overlapping, hierarchical, fuzzy, or any kind that best
suits the properties of the data at hand and the goals of the analysis.
Clustering can be done in several ways, one of which is the quite popular
algorithm, that is, K-means.
K-means accomplishes some optimization of cluster centroids to minimize
the within-cluster sum of squares by iteratively assigning data points to
clusters based on closeness to a centroid. The major phases of every
clustering analysis are the quality assessment of the clusters and determi-
nation of the best number of clusters. Examples of such cluster validation
measures include the silhouette coefficient and the elbow method, applied
to determine the ideal number of clusters for a dataset and evaluate var-
ious clustering techniques. These basic ideas make clustering a powerful
methodology in the discovery of hidden structure within data that fosters
insight and knowledgeable decision-making in many disciplines.
60 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
3.10 Answers to In-Text Questions
1. (d) Grouping similar data points together
2. (b) Retail
3. (c) Grouping similar products for recommendation
4. (b) Minimize intra-cluster variance
5. (a) Randomly
6. (b) By computing the mean of data points in each cluster
3.12 References
Han, J., Kamber, M., & Jian, P. (2011). Data Mining: Concepts and
Techniques. 3rd edition. Morgan Kaufmann.
Tan,P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data
Mining. 1st Edition. Pearson Education.
PAGE 61
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
3.13 Suggested Readings
Gupta, G. K. (2006). Introduction to Data Mining with Case Studies.
Prentice-Hall of India.
Hand, D., & Mannila, H. & Smyth, P. (2006). Principles of Data
Mining. Prentice-Hall of India.
Pujari, A. (2008). Data Mining Techniques. 2nd edition. Universities
Press.
62 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
4
Data Connections: The
Essentials of Association
Rule Mining
Dr. Charu Gupta
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]
STRUCTURE
4.1 Learning Objectives
4.2 Introduction: Association Rule Mining
4.3 Transaction Data Set and Frequent Itemset, Support Measure
4.4 Rule Generation
4.5 Confidence of Association Rule
4.6 Apriori Principle
4.7 Apriori Algorithm
4.8 Summary
4.9 Answers to In-Text Questions
4.10 Self-Assessment Questions
4.11 References
4.12 Suggested Readings
Notes
4.2 Introduction: Association Rule Mining
Consider a Software Analyst who wishes to analyse the download history
of various programming languages, their related libraries and plugins.
Using association rule mining, the data analyst applies an algorithm to
the transaction dataset to identify common patterns among programming
languages. One of the resulting rules might be {python} → {numpy},
indicating that programmers who download the python compiler also
download its numpy library as well. Suppose the dataset consists of
10,000 transactions, and the rule {python} → {numpy} has a support
of 40% (indicating that python and numpy appear together in 40% of
all transactions) and a confidence of 90% (meaning that in 90% of the
transactions where python and numpy are downloaded together). This
information can be used by the software repository to place both python
and numpy together on the web page to create combined promotions,
ultimately enhancing and improving user experience.
Association rule mining is a data mining technique used to identify in-
teresting relationships, patterns, and associations within large datasets. It
is particularly useful in the context of market basket analysis, where the
goal is to identify sets of products that frequently co-occur in transactions.
Consider an example of a grocery retail store. Association rule mining
shows that the customers who buy bread will have high probability of
buying butter and milk. This relation between products: bread, butter,
milk, is represented through association rule such as {bread} -> {butter,
milk}, where in a transaction instance, i.e. in a row of dataset if bread
is present , then it implies that butter and milk may also be present with
high probability.
Association rule mining is a data mining technique that aims at discov-
ering hidden relationships and patterns among different items in a large
dataset. In other words, we can say that association rule mining is an
unsupervised learning algorithm that consists of two parts: antecedent
(left hand side) and consequent (right hand side). This means that in
a transaction data set, if item on left hand side (antecedent) is present,
then item(s) on right hand side (consequent) will also be present with a
certain percentage of probability. Basically, association rules bring out
how the presence of one itemset can influence the occurrence of another
64 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 65
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes 2. Improved Decision Making: The rules generated and provided by the
association rule mining facilitate us to make decisions based on data
collected. For example, retailers can optimize product placements
or create an effective cross-selling strategy or market their products
based on co-occurring of products in customer transactions.
3. Better Customer Experience: As a business person, the association
rule mining technique will help in knowing the products which
are usually, being sold together. In this way, we can enhance the
customer experience and suggest related products, group items and
create far more personalized recommendations.
4. Optimize Inventory Management: With the knowledge of itemsets
that are frequently bought together, we can manage its inventory
more effectively. It helps to maintain adequate stock and, at the
same time, avoids the chances of stock-out and overstock situations
for frequently associated products.
5. Increased Sales and Revenue: One can achieve higher sales by
offering promotions and discounts with better use of association
rules. For example, consider that a rule indicates that customers
who purchase coffee will also purchase sugar. Therefore, one could
offer discounts on sugar for customers who buy coffee in order to
increase the sales of both the items.
6. Detection of Fraud and Risk Management: In financial sectors and
insurance, association rule mining is able to detect deviating and
unusual patterns suggesting fraud cases. Analysis of transaction data
may also disclose suspicious behaviour, thus helping a company to
take preventive measures to mitigate risks.
7. Market Basket Analysis: Market Basket Analysis is one of the
most popular example and application for studying association rule
mining. In the market basket analysis, retailer use association rule
mining to detect patterns among the products being purchased by
the customers. This analysis helps the retail stores to organise the
products together or nearest to each other, because customers are
more likely to purchase them together. The resultant analysis has
helped retailers to optimize the layout of the store, offer promotions,
improves sales, reducing costs and enhancing profits.
66 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 67
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes zero shows that the software or the library has not been downloaded
from a software repository.
The transaction Data sets are used in many scenarios and real life appli-
cations. Few of them are given below:
68 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
In this section, we learnt that transaction data sets are foundation for
various data mining tasks, especially association rule mining and market
basket analysis. These data sets are analysed to identify and detect valuable
insights, hidden patterns, anomalous behaviour in customer purchasing
patterns, optimise product placement, improve inventory management,
detect fraudulent transactions, minimize risk, take preventive measures
and enhance overall operational efficiency.
PAGE 69
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Table 4.1
S. List of Item Number of List of Frequent Number of Frequent
No Sets Items (k) Item Sets Item Sets 2k – 1
1. {Milk, Bread, 03 1. {Milk}, = 23-1
Butter} 2. {Bread}, = 8-1
3. {Butter}, = 7
4. {Milk, Bread} {null set is not counted,
5. {Milk, Butter} therefore, one has been
subtracted}
6. {Bread, Butter}
7. {Milk, Bread, Butter}
70 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 71
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Key Concepts
Itemset: An itemset is a collection of one or more items. For example,
in an online software repository scenario, an itemset might be X{python,
numpy, pandas}.
Support: Support is a metric that measures how frequently an itemset
appears in the dataset. It is defined as the proportion of transactions in
which the itemset occurs.
Mathematically, support for an itemset X.
For example, if {python, numpy} appears in 50 out of 200 transactions,
the support is 0.25 or 25%.
Minimum Support Threshold: It is a user-defined value that specifies
the minimum frequency for an itemset to be considered frequent. Itemsets
with support above this threshold are termed frequent itemsets.
72 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
{numpy}→{python}
These rules are evaluated using metrics like confidence and lift to deter-
mine their effectiveness and usefulness.
Methods to find Frequent Itemsets: The basic approach to find fre-
quent itemsets is to find the value of support metric for every itemset
in the lattice structure. For this, we will compare each candidate item
set against every transaction dataset. If the candidate item is present in
transaction data item, the value of support metric is increased by 1. How-
ever, since the number of items are large, the comparisons are increased
exponentially. The number of comparisons are O(NMw) where, N is the
number of transactions, M is the number of candidate itemsets and w
is the maximum transaction width. There are three main approaches to
reduce the computational complexity of frequent itemset generation as
detailed below:
1. Reduce the Number of Candidate Itemsets (M): The Apriori
principle is used to eliminate some of M = 2k − 1 the candidate
itemsets without counting their support values.
2. Reduce the Number of Comparisons (w): Instead of matching
each candidate itemset against every transaction, we can reduce the
number of comparisons by using more advanced data structures,
either to store the candidate itemsets or to compress the data set.
3. Reduce the Number of Transactions (N): As the size of candidate
itemsets increases, fewer transactions will be supported by the
itemsets.
In this section we learnt that frequent data itemsets are important for as-
sociation rule mining. The frequent item datasets provide the foundation
for identifying and detecting hidden patterns and relationships within large
PAGE 73
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes data sets. Data analysts focus on itemsets that meet a minimum support
threshold, to ensure that the rules generated are significant and make
reliable patterns. These rules and detected patterns lead to actionable
insights in various applications such as market basket analysis, recom-
mendation systems, fraud detection, anomaly detection and inventory
management to name a few.
74 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
(c) Evaluate Rule Strength: In this step, the strength and accuracy of Notes
rule generated is evaluated using the confidence, lift, and conviction
metrics. These metrics are defined below alongwith their respective
formulae:
Confidence: Confidence is the measure of the likelihood that the
consequent Y is present in transactions that contain the antecedent
X. It is calculated as:
PAGE 75
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Importance of Confidence
Confidence Rule Evaluation: It helps to filter out the strong rules
which are more likely to be useful for prediction or recommendation.
Business Insights: High Confidence Rules provides decision making
insights about cross-selling, product placement, and inventory
management in retail store management.
Filtering Rules: Association rule mining can discard any rule during
the building process that is below a certain confidence threshold,
which means rules that focus on the most reliable associations are
kept while the less reliable rules are discarded.
The main metrics in the association rule mining is the confidence metric.
It expresses the conditional probability of having a consequent, provided
there is already an antecedent in a transaction dataset. We can identify
the patterns within the transactional data to make intelligent decisions and
optimize a set of operations by calculating and analyzing the confidence
of different rules.
76 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
4.7 Apriori Algorithm
In 1994, computer scientists Rakesh Agrawal and Ramakrishnan Srikant
developed the Apriori algorithm. The Apriori algorithm is a very import-
ant method for association rule mining that provides a pruning technique
based on support measures in order to control the exponential growth
of candidate itemsets. The Apriori algorithm presents an algorithm for
frequent itemset and then derivation of Association Rules from these
itemsets. Apriori relies on the basic property that any non-empty subset
of a frequent itemset needs to be frequent itself. The pre-processing is
much easier because it will prune lots of candidate itemsets during early
phases that fail to meet the minimum support threshold. Therefore, for an
Apriori algorithm, one can have a better complex frequent itemset with
lesser complexity. The algorithm, as given in Ref [1] for the Apriori, is
given below:
Transactions Dataset
Consider the following transactions:
1. {Python, Numpy, Pandas}
2. {Python, Numpy}
3. {Numpy, Matplotlib}
PAGE 77
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
78 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Step 5: Select High Confidence Rules with confidence above the min-
imum threshold (assume 0.6).
Python → Numpy (0.6)
Numpy → Python (0.75)
Python → Matplotlib (0.6)
Matplotlib → Python (0.75)
Python → Pandas (0.6)
Pandas → Python (1.0)
PAGE 79
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes In summary, the Apriori algorithm has identified several frequent item-
sets and high-confidence association rules from the given transactions,
following these steps:
1. Calculate the support of individual items and identify frequent items.
2. Generate and prune candidate itemsets of size 2.
3. Attempt to generate candidate itemsets of size three but found none
frequent.
4. Generate association rules from frequent itemsets and calculate their
confidence.
5. Select rules with confidence above the specified threshold.
We have seen that Apriori algorithm works in following four steps:
(i) Generate candidate itemsets of length k. This step is called candidate
generation.
(ii) Remove the candidate itemsets with support smaller than the threshold.
(iii) Count support metric in the dataset by scanning it.
(iv) Retain only the frequent ones.
This process is iterative and, in each step, the length of an itemset is
added until no more frequent itemsets are obtained. The Apriori algorithm
is to known for its scalability and efficiency and is easy to implement.
However, with very large datasets and minimum support thresholds set
too low, it becomes computationally expensive. Therefore, sometimes the
hashing technique is used to prune candidate itemsets and partitioning
methods so as to reduce the number of dataset scans.
There are many areas where association rule mining and the Apriori
algorithm can be applied. Market basket analysis in retail is one area
in which association rule mining and Apriori algorithm is used to iden-
tify product associations with a view to optimally devise store layouts,
improve cross-selling strategies, and design promotions. In web usage
mining, these rules provide understanding of user behaviour patterns
that improve website navigation and personalization. The algorithms of
mining association rules can also be applied to the healthcare sector in
an effort to identify correlations between symptoms and diseases, aiding
in diagnostics and personalised treatment plans.
80 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
4.8 Summary
This lesson comprehensively covers association rule mining and the Apriori
algorithm, to enable students and practitioners in applying these techniques
to datasets. Mastery of these concepts will empower learners towards the
deriving useful insights from big data, driving effective decision-making
and strategic planning in all fields of academia, research and industry.
Emphasis will be placed on the metrics of support, confidence, lift, it-
erative algorithms, and pruning algorithms. The lesson also covered the
practical implementation aspects, such as data preparation, the application
of the Apriori algorithm using popular libraries like mlxtend in python,
and how to interpret the results. This provides hands-on experience in
mining association rules from transaction data.
In short, this lesson offers the learner some basic knowledge and abilities
of association rule mining using the Apriori algorithm. These methods will
be useful to analysts and data scientists in finding valuable patterns from
a large dataset, which then improves their decision-making to achieve
better business results.
IN-TEXT QUESTIONS
1. A collection of records where each record represents a transaction
containing a set of items is called:
(a) Frequent itemset
(b) Transaction item set
(c) Association item set
(d) All of the above
2. What is association rule mining?
(a) Same as frequent itemset mining
(b) Finding of strong association rules using frequent itemsets
(c) Using association to analyse correlation rules
(d) Finding Itemsets for future trends
3. A collection of one or more items is called as __________.
(a) Itemset (b) Support
(c) Confidence (d) Support Count
PAGE 81
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
82 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
4.11 References
TanP. N., Steinbach M, Karpatne A. and Kumar V. Introduction to
Data Mining, 2nd edition, Pearson, 2021.
HanJ., Kamber M. and Pei J. Data Mining: Concepts and Techniques,
3rd edition, 2011, Morgan Kaufmann Publishers.
ZakiM. J. and Meira J. Jr. Data Mining and Machine Learning:
Fundamental Concepts and Algorithms, 2nd edition, Cambridge
University Press, 2020.
PAGE 83
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
5
Building Blocks of
Classification Systems
Dr. Charu Gupta
Assistant Professor
School of Open Learning
University of Delhi
Email-Id: [email protected]
STRUCTURE
5.1 Learning Objectives
5.2 Introduction: About Classification
5.3 Naive Bayes Classifier
5.4 Nearest Neighbour Classifier
5.5 Decision Tree
5.6 Overfitting
5.7 Confusion Matrix, Evaluation Metrics and Model Evaluation
5.8 Summary
5.9 Answers to In-Text Questions
5.10 Self-Assessment Questions
5.11 References
5.12 Suggested Readings
84 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
5.2 Introduction: About Classification
Consider opening an email box. Various emails have been categorised
as inbox, sent, spam, and drafts. The inbox emails are further catego-
rised into primary, promotion, social, and updates. The emails can also
be categorised further as starred or important. Categorising the emails
according to the subjects helps identify the required emails without con-
suming much time and effort.
PAGE 85
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes important job in machine learning, data science and data mining. It is
used in many applications in day-to-day activities and industries. Clas-
sification helps in sorting and grouping large data making it easier to
understand and analyze.
86 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 87
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
88 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
(batteries, chemicals), and electronic waste (broken CDs, pen drives, Notes
malfunctioning mice, keyboards).
7. Classifying movies into genres such as action, comedy, drama,
romance and documentaries.
8. Songs can be classified into duet songs, sad songs, party songs, disco
songs, love songs, and religious songs.
9. A technical support centre may classify incoming queries as software
problems, hardware issues, network problems, or billing queries.
10. Animals are classified based on living habitat: terrestrial, aquatic,
amphibian, arboreal, and aerial.
PAGE 89
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
90 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 91
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
92 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Where Notes
P(A|B) is the posterior probability of class A given features B.
P(B|A) is the likelihood of features B given class A.
P(A) is the prior probability of class A.
P(B) is the prior probability of class B.
PAGE 93
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
94 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 95
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Disadvantages
Computationally Intensive: Requires calculating distances to all
points in the training set for each prediction, which can be slow
for large datasets.
Sensitive to Irrelevant Features: All features are treated equally,
so irrelevant features can negatively impact performance.
Storage Requirements: Needs to store all training data, which can
be impractical for very large datasets.
In this section, we studied the k-nearest Neighbors classification algorithm.
This algorithm is a powerful and versatile algorithm for classification
and regression, particularly useful for small to medium-sized datasets and
problems where interpretability and simplicity are important.
96 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
5.5 Decision Tree
The Decision Tree is one of the most popular and powerful supervised
learning algorithms that considers classification and regression problems.
It breaks down a dataset into smaller and smaller subsets while an asso-
ciated decision tree is incrementally developed. More precisely, it divides
the data into subsets with respect to input features value and then forms
decisions based on a tree-like model. The tree thus formed is a set of
decision nodes and leaf nodes. Leaves, or the terminal nodes, are the
classes that is the output of the decision tree classification algorithm.
PAGE 97
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
into sub-sets. This stops when all instances have the same Notes
class or similar value of the feature.
8. Example of Decision Tree Algorithm: We will build a decision
tree classifier using the Iris dataset from scikit-learn. Step-by-step
implementation is given below:
(i) Load the Dataset: Load the Iris dataset from sklearn.
(ii) Split the Data: Split the dataset into training and testing sets.
(iii) Train the Model: Use the DecisionTreeClassifier from sklearn.
(iv) Evaluate the Model: Measure the accuracy and visualise the
tree.
PAGE 99
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
5.6 Overfitting
Overfitting is the problem in classification where a model learns a training
dataset too well-its noise and outliers. That means that the fitted model
will be really good on the training data but will perform poorly on any
100 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
new, unseen test data. This happens whenever the model is too complex Notes
to capture random fluctuations and noise instead of the true underlying
patterns in the training data.
1. Characteristics of Overfitting: The following are characteristics of
classification model that leads to overfitting of the testing dataset:
High Accuracy on Training Data: The model exhibits very high
performance on the training set, often close to 100%.
Low Accuracy on Test Data: The model performs significantly
worse on the test set, indicating poor applicability to new testing
datasets.
Complex Model: The model has too many parameters or a very
complex structure relative to the amount of training data available.
2. Causes of Overfitting: The following are the reasons for the
classification for leading to overfitting the datasets:
Complex Models: Using models with too many parameters (e.g.,
deep decision trees, high-degree polynomials).
Insufficient Training Data: Having too little data for the model
to learn the true underlying patterns.
Noisy Data: Training data that contains a lot of noise or irrelevant
features.
Too Many Features: Including a large number of features,
especially if many of them are irrelevant.
3. Preventing Overfitting: The following are the methods through
which we can prevent overfitting of the classification models:
Simpler Models: We can look for simpler models that involve
fewer parameters, such as pruning decision trees and reducing
the degree of polynomials.
More Training Data: We will gather more training data in order
to capture the underlying distribution better.
Cross-validation:We can make use of cross-validation techniques
so that our model runs nicely on different subsets of the data.
Regularization: We should apply regularization techniques to
penalize overly complex models, including L1 and L2 regularisation.
PAGE 101
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
102 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
PAGE 103
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes 3. Recall:
104 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Procedure: The Hold-out method follows the steps as given below: Notes
1. Firstly, split the dataset into a training and a test set.
2. Next, train the model on the training set.
3. Then, find the performance of the model on test dataset using
different evaluation metrics like accuracy, precision, recall,
ROC, and F1 score.
Advantages: The advantages of Hold-out method are:
1. The Holdout method is very simple and easy to use.
2. The Holdout method gives a quick estimation of the model’s
performance.
Disadvantages: The disadvantages of Hold-out method are:
1. Using the Holdout method, results can be different with a
random partitioning.
2. The holdout method generalizes well only in cases where the
data is large or imbalanced.
2. Random Subsampling Method: The Random Subsampling Method
is similar to the Holdout Method but involves repeated random
partitioning of the dataset into training and test sets.
Procedure: The Random Sub-Sampling follows the steps as
given below:
1. Firstly, split the dataset into training and test sets randomly.
2. Next, train the model using the training set.
3. Then, evaluate the model’s performance on the test set.
4. Repeat steps 1-3 a number of times, such as 10 or 100, and
then calculate an average over the evaluation metrics.
Advantages: The advantages of Random Sub-Sampling Method
are:
1. It reduces variance in model performance as opposed to a single
holdout split.
2. It then gives a better estimate of model performance.
Disadvantages: The disadvantages of Random Sub-Sampling
Method are:
PAGE 105
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
106 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
5.8 Summary
In Data Mining, classification is the process of putting data into classes
based on some given input features. Indeed, there are several types of
classification algorithms, each having its own strengths and specific fields
of application. Naive Bayes classification algorithm is a probabilistic
classifier that employs Bayes’ theorem to calculate feature probabilities.
This is pretty fit for text classification since it is fast and straightfor-
ward. Decision trees have an easy interpretation of decisions based on
the feature splits due to their tree-like structure, which is hierarchical in
nature. The k-nearest Neighbours algorithm classifies data points using
the majority label from the nearest neighbours in the space of the fea-
tures. It makes the algorithm intuitive and relatively simple; however,
computationally intensive for big datasets. Some of the major problems
with classification involve overfitting: a model would present very good
results on the training data but poor performance on new, unseen data
due to over-complexity. Model evaluation is an important part of model
building in order to check how well a classifier has performed, using
metrics like accuracy, precision, recall, F1 score, and techniques such as
confusion matrices and ROC curves. Proper evaluation helps in model
selection and in tuning so as to find a better balance between bias and
variance, with an assurance of performing well on real-world data.
In this lesson, we learned the basic concept of classification that assigns
data into categories or classes based on their characteristics. We started
by learning what is essential in any classification task: the dataset with
labelled examples, selection of the right features, and choosing appro-
priate algorithms. We then proceeded with the study of some of the var-
ious algorithms used for classification, such as decision trees, k-nearest
neighbours, and naive Bayes. All these have their own unique approach
and are suited to different types of data and problem domains.
We discussed also the importance of the evaluation of a model concern-
ing classification: the different metrics used, such as accuracy, precision,
recall, and F1 score, provide an idea of the performance of a classifier
and at which points it may be improved.
PAGE 107
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
IN-TEXT QUESTIONS
1. K-fold method becomes __________, especially for large datasets
and complex models.
(a) Computationally more expensive
(b) Computationally less expensive
(c) Exponentially more expensive
(d) Exponentially less expensive
2. To prevent overfitting, we can:
(a) Collect more testing data
(b) Collect more training data
(c) Remove Training data
(d) Remove Testing Data
3. Which of the following is not an evaluation metric for Classification
Algorithms?
(a) Recall
(b) Precision
(c) F-1 Score
(d) Mean
4. Naive Bayes is a family of __________ algorithms:
(a) Logarithmic
(b) Algebraic
(c) Probabilistic
(d) Exponential
108 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
5.9 Answers to In-Text Questions
1. (a) Computationally more expensive
2. (b) Collect more training data
3. (d) Mean
4. (c) Probabilistic
PAGE 109
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
5.11 References
Tan P. N., Steinbach M, Karpatne A. and Kumar V. Introduction to
Data Mining, 2nd edition, Pearson, 2021.
Han J., Kamber M. and Pei J. Data Mining: Concepts and Techniques,
3rd edition, 2011, Morgan Kaufmann Publishers.
110 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Zaki M. J. and Meira J. Jr. Data Mining and Machine Learning: Notes
Fundamental Concepts and Algorithms, 2nd edition, Cambridge
University Press, 2020.
Alnuaimi, A. F., & Albaldawi, T. H. (2024). Concepts of statistical
learning and classification in machine learning: An overview. In
BIO Web of Conferences (Vol. 97, p. 00129). EDP Sciences.
PAGE 111
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 113
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
114 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 115
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes A lift value greater than 1 indicates a positive association between the
antecedent and the consequent.
Machine Learning: A subcategory of AI in which systems can inde-
pendently learn data to predict or make decisions without explicit training
based on experience.
Market Basket Analysis: A common application of association rule
mining where sets of products that frequently co-occur in transactions
are identified to understand customer purchasing behaviour.
Maximal Itemset: An itemset is maximal if none of its immediate su-
persets is frequent.
Metric Distance: This is generally a function that gives the distance
between two data points in a multi-dimensional space and is among the
major functions involved in clustering algorithms. Centroid: It is a point
characterizing the center of a cluster, and it is usually the meaning of all
the data points in that cluster.
Minimum Confidence Threshold: The threshold below which a user
sets the minimum level of confidence a rule must have to be interesting.
Optimal Number of Clusters: The number of clusters that best rep-
resents the true structure underlying any data, as determined by a cluster
validation technique or domain knowledge.
Overfitting: A modelling error that occurs when a model learns the de-
tails and noise in the training data to the extent that it performs poorly
on new data.
Overlap: Overlapping occurs when a data point actually simultaneously
belongs to more than one cluster. It’s a common happening in several
fuzzy clustering methods.
Partitioning Clustering: A class of clustering algorithms that partition
data into non-overlapping clusters, with each data point assigned to ex-
actly one cluster.
Precision: A metric that measures the accuracy of positive predictions,
defined as the number of true positives divided by the number of true
positives plus false positives.
Recall: A metric that measures the ability of a model to identify all rel-
evant instances, defined as the number of true positives divided by the
number of true positives plus false negatives.
116 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Testing Data: This is data unforeseen; it is instead used to test the per-
formance of an already trained machine learning model. It is never used
as part of the training data set, but it does help in the estimation of a
model’s ability to generalize.
Testing Set: A dataset used to evaluate the performance of a trained
machine learning model.
Threshold Minimum Support: User-specified threshold on minimum
support of an itemset, that can be counted as frequent.
PAGE 117
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
118 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi