Classification-Clustering
Classification-Clustering
3
Classificati
on Example
Next, we need to compute the expected information
requirement for each attribute.
Let’s start with the attribute age.
We need to look at the distribution of yes and no
tuples for each category of age.
For the age
category “youth,” there are two yes tuples and three no
tuples.
category “middle aged,” there are four yes tuples and
zero no tuples.
category “senior,” there are three yes tuples and two no
tuples.
Compute the expected information needed to classify
a tuple in D if the tuples are partitioned according to
4 age is
5
6
7
8
9
10
11
Quiz Time !!!
What is the complexity of decision tree?
12
Basic
Classificati
on Concepts
• Bayesian classifiers are statistical classifiers.
• They can predict class membership probabilities such as
the probability that a given tuple belongs to a particular
class.
13
Classificati
on
Bayes’ Theorem
Let X be a data tuple.
In Bayesian terms, X is considered “evidence.”
As usual, it is described by measurements made on a
set of n attributes.
Let H be some hypothesis such as that the data tuple
X belongs to a specified class C.
For classification problems, we want to determine
P(H|X), the probability that the hypothesis H holds
given the “evidence” or observed data tuple X.
In other words, we are looking for the probability that
tuple X belongs to class C, given that we know the
attribute description of X.
14
Classificati
on Bayes’ Theorem
• P(H|X) is the posterior probability, or a posteriori
probability, of H conditioned on X.
• For example, suppose our world of data tuples is
confined to customers described by the attributes
age and income, respectively, and that X is a 35-
year-old customer with an income of $40,000.
• Suppose that H is the hypothesis that our customer
will buy a computer.
• Then P(H|X) reflects the probability that customer X
will buy a computer given that we know the
customer’s age and income.
15
Classificati
on Bayes’ Theorem
In contrast, P(H) is the prior probability, or a priori
probability, of H.
For our example, this is the probability that any given
customer will buy a computer, regardless of age,
income, or any other information, for that matter.
The posterior probability, P(H|X), is based on more
information (e.g., customer information) than the prior
probability, P(H), which is independent of X.
Similarly, P(X|H) is the posterior probability of X
conditioned on H.
That is, it is the probability that a customer, X, is 35
years old and earns $40,000, given that we know the
customer will buy a computer.
P(X) is the prior probability of X.
16
Using our example, it is the probability that a person
Classificati
on Naive Bayesian Classification
The naïve Bayesian classifier, or simple Bayesian
classifier, works as follows:
1. Let D be a training set of tuples and their
associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector,
X ={ x1, x2, : : : , xn}, depicting n measurements
made on the tuple from n attributes, respectively,
A1, A2, : : : , An.
2. Suppose that there are m classes, C1, C2, : : : ,
Cm. Given a tuple, X, the classifier will predict that
X belongs to the class having the highest posterior
probability, conditioned on X. That is, the na¨ıve
Bayesian classifier predicts that tuple X belongs to
the class Ci if and only if
17
Classificati
on
Naïve Bayesian Classification
2. Thus, we maximize P(Ci|X). The class Ci for which
P(Ci|X) is maximized is called the maximum
posteriori hypothesis. By Bayes’ theorem
18
Classificati
on Naïve Bayesian Classification
4. Given data sets with many attributes, it would be
extremely computationally expensive to compute
P(X|Ci). To reduce computation in evaluating P(X|Ci),
the na¨ıve assumption of class-conditional
independence is made. This presumes that the
attributes’ values are conditionally independent of
one another, given the class label of the tuple (i.e.,
that there are no dependence relationships among
the attributes). Thus,
19
Classificati
on Naïve Bayesian Classification
5. To predict the class label of X, P(X|Ci)P(Ci) is
evaluated for each class Ci . The classifier predicts
that the class label of tuple X is the class Ci if and
only if
20
21
Classificati
on Example
The data tuples are described by the attributes age,
income, student, and credit rating.
The class label attribute, buys computer, has two
distinct values namely, {yes, no}).
Let C1 correspond to the class buys computer D yes
and C2 correspond to buys computer D no. The tuple
we wish to classify is
22
Classificati
on Example
We need to maximize P(X|Ci)P(Ci), for i=1, 2.
P(Ci), the prior probability of each class, can be
computed based on the training tuples:
23
Classificati
on Example
Similarly,
24
Naïve Bayes
Figure 1 The Train Dataset
27
28
29
Features CLASSIFICATION CLUSTERING
31
What Is a Good Clustering?
33
Requirements for Clustering in
Data Mining
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal domain knowledge required to
determine input parameters
Ability to deal with noise and outliers
Insensitivity to order of input records
Robustness wrt high dimensionality
Incorporation of user-specified constraints
Interpretability and usability
34
Major Clustering
Approaches
Partitioning approach:
Construct various partitions and then evaluate them by
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or
Density-based approach:
Based on connectivity and density functions
37
K-Means Clustering
38
K-Means Clustering (contd.)
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
cluster center 4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
39
39
K-means Example
40
Comments on the K-Means
Method
Strengths
Relatively efficient: O(nkt), where n is # objects, k
is # clusters, and t is # iterations. Normally, k, t
<< n.
Weaknesses
Applicable only when mean is defined (what about
categorical data?)
Need to specify k, the number of clusters, in
advance
Trouble with noisy data and outliers
Instead of taking the mean value of the object in a
Not suitable to discover clusters with non-convex
cluster as a reference point, medoids can be used, which
shapes
is the most centrally located object in a cluster.
41
Example: K-Means
Figure: Initial
Choice of Centroids
43
Summary
Classification and prediction are two forms of
data analysis that can be used to extract models
describing important data classes or to predict future
data trends.
Effectiveand scalable methods have been
developed for decision trees induction, Naive
Bayesian classification, Bayesian belief
network, rule-based classifiers etc.
No single method has been found to be superior over
45