0% found this document useful (0 votes)
3 views

Classification-Clustering

The document discusses classification and clustering in machine learning, highlighting methods such as decision trees and Naive Bayesian classifiers. It explains concepts like information gain, Bayes' theorem, and the k-means clustering algorithm, detailing their processes and applications. Additionally, it outlines the strengths and weaknesses of these methods and emphasizes that no single approach is universally superior for all datasets.

Uploaded by

sahu.leena24
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Classification-Clustering

The document discusses classification and clustering in machine learning, highlighting methods such as decision trees and Naive Bayesian classifiers. It explains concepts like information gain, Bayes' theorem, and the k-means clustering algorithm, detailing their processes and applications. Additionally, it outlines the strengths and weaknesses of these methods and emphasizes that no single approach is universally superior for all datasets.

Uploaded by

sahu.leena24
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Machine Learning

Classification and Clustering


2
Classificati
on Example
 The class label attribute, buys computer, has two
distinct values, namely, {yes, no}; therefore, there
are two distinct classes (i.e., m = 2).
 Let class C1 correspond to yes and class C2
correspond to no.
 There are nine tuples of class yes and five tuples of
class no.
 A (root) node N is created for the tuples in D.
 To find the splitting criterion for these tuples, we must
compute the information gain of each attribute.
 We first compute the expected information needed to
classify a tuple in D

3
Classificati
on Example
 Next, we need to compute the expected information
requirement for each attribute.
 Let’s start with the attribute age.
 We need to look at the distribution of yes and no
tuples for each category of age.
 For the age
category “youth,” there are two yes tuples and three no
tuples.
category “middle aged,” there are four yes tuples and
zero no tuples.
category “senior,” there are three yes tuples and two no
tuples.
 Compute the expected information needed to classify
a tuple in D if the tuples are partitioned according to
4 age is
5
6
7
8
9
10
11
Quiz Time !!!
What is the complexity of decision tree?

As the data structure is tree the complexity is


in order of log, but the cost of designing tree
must be included.

12
Basic
Classificati
on Concepts
• Bayesian classifiers are statistical classifiers.
• They can predict class membership probabilities such as
the probability that a given tuple belongs to a particular
class.

• Bayesian classification is based on Bayes’ theorem.


• Naive Bayesian classifiers assume that the effect of an
attribute value on a given class is independent of the
values of the other attributes.
• This assumption is called class conditional
independence. It is made to simplify the computations
involved and, in this sense, is considered “naive.”

13
Classificati
on
Bayes’ Theorem
 Let X be a data tuple.
 In Bayesian terms, X is considered “evidence.”
 As usual, it is described by measurements made on a
set of n attributes.
 Let H be some hypothesis such as that the data tuple
X belongs to a specified class C.
 For classification problems, we want to determine
P(H|X), the probability that the hypothesis H holds
given the “evidence” or observed data tuple X.
 In other words, we are looking for the probability that
tuple X belongs to class C, given that we know the
attribute description of X.

14
Classificati
on Bayes’ Theorem
• P(H|X) is the posterior probability, or a posteriori
probability, of H conditioned on X.
• For example, suppose our world of data tuples is
confined to customers described by the attributes
age and income, respectively, and that X is a 35-
year-old customer with an income of $40,000.
• Suppose that H is the hypothesis that our customer
will buy a computer.
• Then P(H|X) reflects the probability that customer X
will buy a computer given that we know the
customer’s age and income.

15
Classificati
on Bayes’ Theorem
 In contrast, P(H) is the prior probability, or a priori
probability, of H.
 For our example, this is the probability that any given
customer will buy a computer, regardless of age,
income, or any other information, for that matter.
 The posterior probability, P(H|X), is based on more
information (e.g., customer information) than the prior
probability, P(H), which is independent of X.
 Similarly, P(X|H) is the posterior probability of X
conditioned on H.
 That is, it is the probability that a customer, X, is 35
years old and earns $40,000, given that we know the
customer will buy a computer.
 P(X) is the prior probability of X.
16
 Using our example, it is the probability that a person
Classificati
on Naive Bayesian Classification
 The naïve Bayesian classifier, or simple Bayesian
classifier, works as follows:
1. Let D be a training set of tuples and their
associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector,
X ={ x1, x2, : : : , xn}, depicting n measurements
made on the tuple from n attributes, respectively,
A1, A2, : : : , An.
2. Suppose that there are m classes, C1, C2, : : : ,
Cm. Given a tuple, X, the classifier will predict that
X belongs to the class having the highest posterior
probability, conditioned on X. That is, the na¨ıve
Bayesian classifier predicts that tuple X belongs to
the class Ci if and only if
17
Classificati
on
Naïve Bayesian Classification
2. Thus, we maximize P(Ci|X). The class Ci for which
P(Ci|X) is maximized is called the maximum
posteriori hypothesis. By Bayes’ theorem

3. As P(X) is constant for all classes, only P(X|Ci)/P(Ci)


needs to be maximized. If the class prior probabilities
are not known, then it is commonly assumed that the
classes are equally likely, that is, P(C1)= P(C2) =, …, =
P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci). Note that the
class prior probabilities may be estimated by P(Ci) = |
Ci,D|/|D|, where |Ci,D| is the number of training tuples of
class Ci in D.

18
Classificati
on Naïve Bayesian Classification
4. Given data sets with many attributes, it would be
extremely computationally expensive to compute
P(X|Ci). To reduce computation in evaluating P(X|Ci),
the na¨ıve assumption of class-conditional
independence is made. This presumes that the
attributes’ values are conditionally independent of
one another, given the class label of the tuple (i.e.,
that there are no dependence relationships among
the attributes). Thus,

19
Classificati
on Naïve Bayesian Classification
5. To predict the class label of X, P(X|Ci)P(Ci) is
evaluated for each class Ci . The classifier predicts
that the class label of tuple X is the class Ci if and
only if

In other words, the predicted class label is the class Ci for


which P(X|Ci)/P(Ci) is the maximum.

20
21
Classificati
on Example
 The data tuples are described by the attributes age,
income, student, and credit rating.
 The class label attribute, buys computer, has two
distinct values namely, {yes, no}).
 Let C1 correspond to the class buys computer D yes
and C2 correspond to buys computer D no. The tuple
we wish to classify is

22
Classificati
on Example
 We need to maximize P(X|Ci)P(Ci), for i=1, 2.
 P(Ci), the prior probability of each class, can be
computed based on the training tuples:

23
Classificati
on Example

Using these probabilities, we obtain

Similarly,

To find the class, Ci , that maximizes P(X|Ci)P(Ci), we


compute

24
Naïve Bayes
Figure 1 The Train Dataset

weekday winter high heavy ?????


26
For the Train dataset

27
28
29
Features CLASSIFICATION CLUSTERING

Type of learning Supervised Unsupervised

Algorithms available Naïve Bayesian,SVM K-means,K-Medoid

Type of dataset Labeled dataset Unlabeled dataset

Application Weather prediction, Customer


covid detection segmentation
Basic criteria Information gain, gain Distance measures
ratio, gini index
Process Classify an unknown Group
sample
Type of data Discrete valued data Numeric data

Accuracy measures Confusion Mean square error


matrix,F1score ,precis
30
ion, recall
Clustering

31
What Is a Good Clustering?

A good clustering method will produce clusters


with
High intra-class similarity
Low inter-class similarity

The quality of a clustering result depends on

both the similarity measure used by the


method and its implementation

33
Requirements for Clustering in
Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal domain knowledge required to
determine input parameters
 Ability to deal with noise and outliers
 Insensitivity to order of input records
 Robustness wrt high dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

34
Major Clustering
Approaches
 Partitioning approach:
 Construct various partitions and then evaluate them by

some criterion, e.g., minimizing the sum of square errors


 Typical methods: k-means, k-medoids.

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or

objects) using some criterion


 Typical methods: Agglomerative, Diana, BIRCH

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS


35
36
Partitioning Algorithms

 Partitioning method: Construct a partition of a


database D of n objects into a set of k clusters
 Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
k-means (MacQueen, 1967): Each cluster is
represented by the center of the cluster
k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw, 1987): Each cluster is
represented by one of the objects in the cluster

37
K-Means Clustering

 Given k, the k-means algorithm is implemented in


four steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to k Step 2, stop when
2 no more new
E   p m i
assignment i 1 p  Ci

38
K-Means Clustering (contd.)

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

39
39
K-means Example

For simplicity, 1-dimensional data and k=2.


Data: 1, 2, 5, 6, 7
K-means:
Randomly select 5 and 6 as initial centroids;
=> Two clusters {1, 2, 5} and {6,7};
 mean C1=8/3 = 2.66 & mean C2=6.5
=> {1,2}, {5,6,7};
 mean C1=1.5 & mean C2=6
=> No Change

40
Comments on the K-Means
Method

 Strengths
Relatively efficient: O(nkt), where n is # objects, k
is # clusters, and t is # iterations. Normally, k, t
<< n.

 Weaknesses
Applicable only when mean is defined (what about
categorical data?)
Need to specify k, the number of clusters, in
advance
Trouble with noisy data and outliers
Instead of taking the mean value of the object in a
Not suitable to discover clusters with non-convex
cluster as a reference point, medoids can be used, which
shapes
is the most centrally located object in a cluster.
41
Example: K-Means

Figure: Initial
Choice of Centroids

42 Figure: Objects For


Clustering
Figure: Centroids After First
Iterations

Figure: Centroids After First Two


Iterations

43
Summary
 Classification and prediction are two forms of
data analysis that can be used to extract models
describing important data classes or to predict future
data trends.
 Effectiveand scalable methods have been
developed for decision trees induction, Naive
Bayesian classification, Bayesian belief
network, rule-based classifiers etc.
 No single method has been found to be superior over

all others for all data sets.


Books
Text Books:
1. Han, Kamber, "Data Mining Concepts and Techniques", Morgan
Kaufmann 3nd Edition
2. P. N. Tan, M. Steinbach, Vipin Kumar, “Introduction to Data
Mining”, Pearson Education
3. M. H. Dunham, Data Mining Techniques and Algorithms”,
Prentice Hall-2000.

45

You might also like