0% found this document useful (0 votes)

21 views56 pages

Unit 7 Clustering

The document provides an overview of clustering, defining it as the process of organizing similar objects into groups while distinguishing them from others. It discusses various types of data used in clustering analysis, similarity and dissimilarity measures, and details several clustering algorithms including K-Means, K-Means++, Mini-Batch K-Means, and K-Medoids. Each algorithm's methodology, advantages, and limitations are also highlighted, emphasizing the importance of choosing appropriate methods based on data characteristics.

Uploaded by

rosunnnn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views56 pages

Unit 7 Clustering

Uploaded by

rosunnnn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Data Warehousing and Data

Mining
Bsc.CSIT 7th Sem
1

UNIT:7
CLUSTERING
Clustering
2

A cluster is a collection of objects which are “similar” between

them and are “dissimilar” to the objects belonging to other
clusters.
Clustering is “the process of organizing objects into groups
whose members are similar in some way”.
“Cluster Analysis is a set of methods for constructing a
(hopefully) sensible and informative classification of an initially
unclassified set of data, using the variable values observed on
each individual.”
Clustering is an unsupervised learning algorithm.
 Hence, Clustering is concerned with grouping things together.
Type of data in clustering analysis
3

 Interval-scaled variables

 Binary variables

 Nominal, ordinal, and ratio variables

 Variables of mixed types

Interval-valued variables
4

 Standardize data

 Calculate the mean absolute deviation:

sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where mf  1
n (x1 f  x2 f  ...  xnf )
.

 Calculate the standardized measurement (z-score)

xif  m f
zif  sf
 Using mean absolute deviation is more robust than using
standard deviation
Similarity and Dissimilarity between Objects
5

The similarity measure is a way of measuring how data

samples are related or closed to each other.
On the other hand, the dissimilarity measure is to tell how
much the data objects are distinct.
In clustering, the similar data samples are grouped into one
cluster and all other data samples are grouped into different ones.
The similarity measure is usually expressed as a numerical
value: It gets higher when the data samples are more alike.
It is often expressed as a number between zero and one by
conversion: zero means low similarity(the data objects are
dissimilar). One means high similarity(the data objects are very
similar).
Similarity and Dissimilarity Between Objects
6

 Distances are normally used to measure the similarity or

dissimilarity between two data objects
 Some popular ones include: Minkowski distance:

d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
 If q = 1, d is Manhattan distance

d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp
Similarity and Dissimilarity Between Objects (Cont.)
7

 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Binary Variables
Object j
1 0 sum
 A contingency table for binary
1 a b a b
data Object i
0 c d cd
sum a  c b  d p
 Distance measure for
symmetric binary variables: d (i, j)  bc
a bc  d
 Distance measure for
asymmetric binary variables: d (i, j)  bc
a bc
 Jaccard coefficient (similarity
measure for asymmetric
simJaccard (i, j)  a
binary variables): a b c
8
Dissimilarity between Binary Variables
9

 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 gender is a symmetric attribute

 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
Nominal Variables
10

 A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m: # of matches, p: total # of variables

d (i, j)  p 
p
m

 Method 2: use a large number of binary variables

 creating a new binary variable for each of the M nominal
states
Ordinal Variables
11

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank rif {1,..., M f }

 map the range of each variable onto [0, 1] by replacing i-

th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables
Ratio-Scaled Variables
12

 Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
 Methods:
 treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their rank as
interval-scaled
Variables of Mixed Types
13

 A database may contain all the six types of variables

 symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
 compute ranks rif and
 and treat zif as interval-scaled zif 
r
if 1
M f 1
Vector Objects
14

 Vector objects: keywords in documents, gene

features in micro-arrays, etc.
 Broad applications: information retrieval, biologic
taxonomy, etc.
 Cosine measure

 A variant: Tanimoto coefficient

Clustering Algorithm
1) K-Means Algorithm
15

 The k-means algorithm is partitioning method of

clustering where each cluster’s center is represented by
the mean value of the objects in the cluster.
 Algorithm
1. Choose k number of clusters to be determined.
2. Choose k objects randomly as the initial cluster centers
3. Repeat
3.1 Assign each object to their closest cluster
3.2 Compute new clusters, calculate mean points
4. Until
4.1 No change in cluster entities OR
4.2 No objects change in its clusters
K means Clustering Detail
16

 Initial centroids are often chosen randomly.

 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the cluster.
 „Closeness‟ is measured mostly by Euclidean distance, cosine
similarity, correlation, etc.
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to „Until relatively few
points change clusters‟
 Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-Means Algorithm Example
17
18
19
20
Pros and Cons of K-Means Clustering Algorithm
21

Pros
 Simple
 Fast for low dimensional data
 It can find pure sub clusters if large number of clusters is specified

Cons
 K-Means cannot handle non-globular data of different sizes and densities
 Unable to handle noisy data and outliers
 It is sensitive to the initialization of the centroids or mean points. If it is
initialized to far point it might end up with no point associated with it.
 More than one centroid might be initialized into the same cluster resulting
in poor clustering.
2) K- means ++
22

 To overcome the drawback of k means, we use K-means ++.

 This algorithm ensures a smarter initialization of the centroids and improves
the quality of the clustering.
 After initialization the rest of the algorithm is same as k-means algorithm.
K-means ++ algorithm
23

1. Choose k number of cluster.

2. Choose an initial centroid uniformly at random from input
data points.
3. For each data point compute its distance from the nearest
previously chosen centroid.
4. Select the next centroid from the data point such that the
probability of choosing a point as centroid is directly
proportional to its distance from the nearest previously chosen
centroid. i.e. the point having maximum distance from the
nearest centroid is most likely to be selected as next centroid.
5. Repeat steps 3 and 4 until k centroid have been sampled.
6. Proceed as with the standard k-means algorithm.
K Means++ Example:
24

Considered the dataset { (7,4), (8,3), (5,9), (3,3), (1,3), (10,1) } cluster
the following dataset with the help of k-means ++ ( take k=3)
Solution: number of cluster to be created (k)=3, so initially choose a
point randomly as an initial cluster center lets the data point C1=(7,4)
 First find the distance from C1 to all the dataset(X)
Step 1:
 d(C1,X1)2=d{(7,4), (7,4)} = 0
 d(C1,X2) 2= d{(7,4), (8,3)}= 2
 d(C1,X3) 2 = d{(7,4), (5,9)} =29
 d(C1,X4) 2 = d{(7,4), (3,3)} =17
 d(C1,X5) 2 = d{(7,4), (1,3)} =37
 d(C1,X6) 2 = d{(7,4), (10,1)} =18
25
Step 2:
 Now, probabilities of each dataset as choosing a point as centroid
 The probability of choosing a next centroid point C2 is(1,3)
Data Probabilities
set (X)
(7,4) 0
(8,3) 2/103 =0.0194
(5,9) 29/103 =0.2815
(3,3) 17/103 = 0.165
(1,3) 37/103 =0.3592
Step 3:
(10,1) 18/103 = 0.1747
 Now find distance from C1 and C2 to dataset (X)
 d(C1,X1)= null
26
27

 The probability of choosing the next centroid point

C3 is ( 5,9)
 Now we have centroid for first cluster C1=(7,4),
centroid for second cluster C2=(1,3) and centroid for
third cluster C3=(5,9).
 Now all data set around these three centroids are
calculated by using standard k-means algorithm.
3) Mini-Batch K means
28

 The Mini-batch K-means clustering algorithm is a

version of the standard K-means algorithm.
 The primary idea behind the Mini Batch K-means
method is to create small random batches of data
with a predetermined size so that they may be kept
in memory.
 Each iteration obtains a fresh random sample from
the dataset and uses it to update the clusters, and the
process is continued until convergence.
Mini-Batch K means Algorithm
29
Why Mini-Batch K means is better than
normal K means?
 Mini-batch k means is considered
30 better than the normal K-means
algorithm because Mini-Batch K-Means uses mini-batches of samples
rather than the entire dataset, which can make it faster for large
datasets, and perhaps more robust to statistical noise.
 The mini-batch K-means is faster but gives slightly different results
than the normal K-means.
 As the number of clusters and the number of data increases, the relative
saving in computational time also increases. The saving in
computational time is more noticeable only when the number of
clusters is very large.
 The effect of the batch size in the computational time is also more
evident when the number of clusters is larger. Increasing the number of
clusters decreases the similarity of the mini-batch K-means solution to
the K-means solution.
 Despite that the agreement between the partitions decreases as the
number of clusters increases; the objective function does not degrade at
the same rate. It means that the final partitions are different, but closer
in quality.
4) K-Mediods
31

 K -mediods clustering is alternative of k-means that

is more robust to noises and outliers.
 Instead of using the mean point as the center of a
cluster, K-mediods uses an actual point in the cluster
to represent it.
 Mediods is the most centrally located object of the
cluster.
Algorithm
32

K-mediods PAM a k-mediods algorithm for partitioning based on mediod or central

objects.
Input:
 k: the number of cluster.

 D: a data set containing n objects.

Output: A set of k clusters.

 Method:
 Arbitrarily choose k objects in D as the initial representative objects.
 Repeat
 Assign each remaining object to the cluster with the nearest representative objects.

 For each representative object Oj

 Randomly select a non-representative object, Orandom .

 Compute the total cost , S, of swapping representative object, Oj, with Orandom .

 If S<0 then swap Oj with Orandom to form the new set of k representative objects.

 Until no change.
33

For a given k=2, cluster the following data set Point X Y

Initializing two initial cluster representatives from the given data 1 7 6
set. Let us choose data point 5 as first mediod K1=(7, 4) and data point 2 2 6
10 as second mediod K2= (3, 4). Suppose considering the Manhattan
3 3 8
Distance metric as the distance measure.
4 8 5
Now, calculating distance of each data point with each cluster
5 7 4
Representative.
6 4 7
7 6 2
8 7 3
9 6 4
10 3 4
J Data obj C1=(7,4) C2=(3,4) cost
D1=|x-7| +|y-4| D2=|x-3| +|y-4|

1 7, 6 0 +2 =2 4+2=6 C1=2
2 2, 6 7 3 C2=3
3 3, 8 8 4 C2=4
4 8, 5 2 6 C1=2
5 7, 4 0 4+0=4 C1=0
6 4, 7 6 4 C2=4
7 6, 2 3 5 C1=3
8 7, 3 1 5 C1=1
9 6, 4 1 3 C1=1
10 3, 4 4+0=4 0 C2=0

Based on the distance shown in table above our initial cluster are
C1={ (7,6), (8, 5), (7, 4), (6, 2), (7, 3), (6, 4)}
C2={(2, 6), (3, 8), (4,7), (3,4)}
Now calculating the distance Σd(j, previous mediods) = 2+3+4+2+0+4+3+1+1+0= 20

34
Now, choose some other point to be a mediod lets take data point 8 as new mediod
K3=(7, 3) and repeat the same steps as earlier we obtain the following table
J Data obj C1=(7,3) C2=(3,4) cost
D1=|x-7| +|y-3| D2=|x-3| +|y-4|

1 7, 6 3 6 C1=3
2 2, 6 8 3 C2=3
3 3, 8 9 4 C2=4
4 8, 5 3 6 C1=3
5 7, 4 1 4+0=4 C1=1
6 4, 7 7 4 C2=4
7 6, 2 2 5 C1=2
8 7, 3 0 5 C1=0
9 6, 4 2 3 C1=2
10 3, 4 5 0 C2=0
Based on the distance shown in table above our initial cluster are
C1={ (7,6), (8, 5), (7, 4), (6, 2), (7, 3), (6, 4)}
C2={(2, 6), (3, 8), (4,7), (3,4)}
Now calculating the distance Σd(j, previous mediods) = 3+3+4+3+1+4+2+0+2+0=22
35
36

 Now computing the total cost, S, of swapping a mediod (7, 4)

with Orandom =(7, 3)
 Total cost = new_mediod – old_mediod =22-20=2
 Here, the total cost of swapping K2 with Orandom is 2 which
is greater than zero. i.e. S>0 so, K2 should not be exchanged
with Orandom i.e. the total cost when (7,3) is the mediod >
the total cost when (7,4) was the mediod earlier.
 Hence (7, 4) should be chosen instead of (7,3) as the
mediod. The cluster obtained finally are:
 C1={ (7,6), (8, 5), (7, 4), (6, 2), (7, 3), (6, 4)}
 C2={(2, 6), (3, 8), (4,7), (3,4)}
Hierarchical clustering
37

 Partitioning clustering algorithm always tries to make cluster of the

same size also we have to decide the number of clusters at the
beginning of the algorithm.
 Hierarchical clustering will create a hierarchy of clusters and therefore
we doesnot required predefined number of cluster.
 A Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue
these steps until all the clusters are merged together.
 Depending upon the bottom up or top down it is classified into two
categories:
 Agglomerative hierarchical clustering
 Divisive hierarchical clustering
Agglomerative hierarchical clustering
38

 Agglomerative clustering is one of the most common

types of hierarchical clustering used to group similar
objects in clusters.
 Agglomerative clustering is also known as AGNES
(Agglomerative Nesting).
 In agglomerative clustering, each data point act as an
individual cluster and at each step, data objects are
grouped in a bottom-up method.
 Initially, each data object is in its cluster. At each iteration,
the clusters are combined with different clusters until one
cluster is formed.
Agglomerative hierarchical clustering Algorithm
39
Agglomerative hierarchical clustering Algorithm
40
Agglomerative hierarchical clustering Algorithm
41
Divisive hierarchical clustering Algorithm
42

 Divisive hierarchical clustering is exactly the opposite of

Agglomerative Hierarchical clustering.
 In Divisive Hierarchical clustering, all the data points are
considered an individual cluster, and in every iteration, the
data points that are not similar are separated from the
cluster.
 The separated data points are treated as an individual
cluster. Finally, we are left with N clusters.
Divisive hierarchical clustering Algorithm
43
Advantages and dis-advantages of hierarchical clustering Algorithm
44
Density based (Dbscan)
45

Partitioning and hierarchical method are suitable for

finding spherical shaped cluster. They are also affected by
the presence of outliers.
DBSCAN stands for Density-Based Spatial Clustering of
Applications with Noise.
 It depends on a density-based notion of cluster.
It also identifies clusters of arbitrary size in the spatial
database with outliers.
DBSCAN
46

DBSCAN algorithm requires two parameters:

eps : It defines the neighborhood around a data point i.e. if the
distance between two points is lower or equal to „eps‟ then they are
considered as neighbors. If the eps value is chosen too small then
large part of the data will be considered as outliers. If it is chosen
very large then the clusters will merge and majority of the data
points will be in the same clusters.
MinPts: Minimum number of neighbors (data points) within eps
radius. Larger the dataset, the larger value of MinPts must be
chosen. As a general rule, the minimum MinPts can be derived from
the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.
DBSCAN
47

In this algorithm, we have 3 types of data points.

1.Core Point: A point is a core point if it has more than MinPts points
within eps.
2. Border Point: A point which has fewer than MinPts within eps but
it is in the neighborhood of a core point.
3. Noise or outlier: A point which is not a core point or border point.
DBSCAN Basic terminology:
48

Neighborhood: Neighborhood of data point P is a set of all data points within

distance eps from P.
Directly density reachable: A point P is directly density reachable from a point q
with respect to eps and minpts if P belong to neighborhood to q and q satisfy the core
point condition.
Density reachable: a data point p is density reachable from a data point q with
respect to eps and minpts if there is a chain of points p1, p2,….pn where p1 is q and
pn is p such that pi+1 is directly reachable from pi.
 For example: data point p is density reachable from the data point q because there
is chain of point q,p1,p where p1 is directly density reachable from q and point p is
directly density reachable from p1 so p is density reachable from point q.
Density connected: A data point p is density connected to a point q with respect eps
and minpts if there is a data point o such that both p and q are density reachable form
o with respect to eps and minpts.
 For example: point p and q are density connected because these two points are
density reachable from a common data point o.
DBSCAN: The Algorithm
49

 Arbitrary select a point p

 Retrieve all points density-reachable from p w.r.t. Eps and

MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable

from p and DBSCAN visits the next point of the database.

 Continue the process until all of the points have been

processed.
Dbscan Example:
50

 If eps=2 and minpts= 2 what are the clusters that Dbscan would discover with
following data set. Data X Y
Solution: to find the clusters by DBSCAN we need points
to calculate first the distance among all pairs of A1 2 10
given data point. A2 2 5
Let us use the Euclidean distance measure for A3 8 4
distance calculation. A4 5 8
A5 7 5
A6 6 4
A7 1 2
A8 4 9
Data A1 A2 A3 A4 A5 A6 A7 A8
A1 0 5 6 3.6 7.07 7.21 8.06 2.23
A2 5 0 6.08 4.24
51 5 4.12 3.16 4.47
A3 6 6.08 0 5 1.41 1.41 7.28 6.4
A4 3.6 4.24 5 0 3.6 4.12 7.21 1.41
A5 7.07 5 1.41 3.6 0 1.41 6.7 5
A6 7.21 4.12 1.41 4.12 1.41 0 5.38 5.38
A7 8.06 3.16 7.28 7.21 6.7 5.38 0 7.61
A8 2.23 4.47 6.4 1.41 5 5.38 7.61 0

Now we have to find the cluster. Lets select point A1 as first point. Find all the data point that lie in the
eps neighborhood of A1. since no other points have distance <=2 with A1 so eps neighborhood of A1
becomes zero.
Similarly for point A2, it still doesn‟t have any other point <=2 so eps neighborhood of A2 also zero.
For point A3, it have two data points <=2 i.e. A5 and A6 so eps neighborhood of A3={A5, A6}
For point A4, it have one neighborhood i.e. A4={A8}
For point A5, it have two neighborhood i.e. A5= {A3, A6}
For point A6, it have two neighborhood i.e. A6={A3, A5}
For point A7, it doesn‟t have any neighborhood.
For point A8, it have one neighborhood i.e. A8={A4}
52

 Now, cluster will be constructed on the basis of minpts. Since

minpts =2 we get points A4 and A8 within the cluster C1 and
A3, A5 & A6 within cluster C2.
 Since the data points A1, A2 & A7 have no any data point i.e.
these data point are neither belongs to the neighborhood of any
core points nor itself is core point so these points are outliers.
# Practice Question
If epsilon = 2 and Minpts = 2, what are the core point,
border point, and outliner that DBSCAN would find from
the data set A (3,10), B (2,3), C (3,4), D(6,7), and E(7,6).
Advantages and Dis-advantages of DBSCAN
53

Advantages

1) Does not require a-priori specification of number of clusters.

2) Able to identify noise data while clustering.

3) DBSCAN algorithm is able to find arbitrarily size and arbitrarily shaped clusters.

Disadvantages

1) DBSCAN algorithm fails in case of varying density clusters.

2) Fails in case of neck type of dataset.

3) Does not work well in case of high dimensional data.

Outlier Analysis
54

 Outlier analysis is the process of identifying outliers, or

abnormal observations, in a dataset.
 An outlier is an element of a data set that distinctly stands out
from the rest of the data. In other words, outliers are those data
points that lie outside the overall pattern of distribution.
 Outlier analysis is also known as outlier detection.
 Outlier analysis is an important step in data analysis, as it
removes erroneous or inaccurate observations which might
otherwise skew conclusions.
 The outlier detection can be used in credit card fraud
detection, Telecom fraud detection, customer segmentation
and medical analysis.
Outlier Analysis
55
Outlier Analysis
56

SSRN Id4565813
No ratings yet
SSRN Id4565813
50 pages
TypeScript Interview Questions
No ratings yet
TypeScript Interview Questions
36 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
k Mean Clustering
No ratings yet
k Mean Clustering
32 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Clustering
No ratings yet
Clustering
125 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
PART2
No ratings yet
PART2
61 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
KMeans Variants
No ratings yet
KMeans Variants
27 pages
K Mean Clustering
No ratings yet
K Mean Clustering
45 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Week 11
No ratings yet
Week 11
49 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Lecture-18-Clustering-19092024-091909am
No ratings yet
Lecture-18-Clustering-19092024-091909am
33 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
k-means
No ratings yet
k-means
25 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
k Means Clustering
No ratings yet
k Means Clustering
29 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Unit 5
No ratings yet
Unit 5
63 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
K-Means Clustering-converted-merged
No ratings yet
K-Means Clustering-converted-merged
76 pages
K Mean Clustering
No ratings yet
K Mean Clustering
36 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
07-Clustering
No ratings yet
07-Clustering
54 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Unit-4
No ratings yet
Unit-4
46 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
K Mean
No ratings yet
K Mean
7 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
Cluster
No ratings yet
Cluster
50 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Aeroflex Test Equipment Catalog
100% (1)
Aeroflex Test Equipment Catalog
48 pages
Fillable Planner Citra Ayu (Color) FORM PDF
100% (6)
Fillable Planner Citra Ayu (Color) FORM PDF
11 pages
IE 305 Recitation 2: Introduction To Arena 18.10.2021
No ratings yet
IE 305 Recitation 2: Introduction To Arena 18.10.2021
7 pages
New Features of J2SE 5.0: Dr. Stephan Fischli
No ratings yet
New Features of J2SE 5.0: Dr. Stephan Fischli
46 pages
GMD Joint Tracking
No ratings yet
GMD Joint Tracking
36 pages
Huawei OSN 8800
No ratings yet
Huawei OSN 8800
2 pages
git-session-slides
No ratings yet
git-session-slides
35 pages
ADAM-4570 ADAM-4571: 2-Port Ethernet To RS-232/422/485 Data Gateway 1-Port Ethernet To RS-232/422/485 Data Gateway
No ratings yet
ADAM-4570 ADAM-4571: 2-Port Ethernet To RS-232/422/485 Data Gateway 1-Port Ethernet To RS-232/422/485 Data Gateway
2 pages
Rdso Spec - For Msdac - 176/2013: (VERSION:3.0)
No ratings yet
Rdso Spec - For Msdac - 176/2013: (VERSION:3.0)
10 pages
Unit IV - Microcontroller Based System Design
No ratings yet
Unit IV - Microcontroller Based System Design
71 pages
IQRA University Karachi, EDC Campus: Prakash Naik 34170
No ratings yet
IQRA University Karachi, EDC Campus: Prakash Naik 34170
2 pages
logcat_1731540495781
No ratings yet
logcat_1731540495781
16 pages
Mapo Lcy
No ratings yet
Mapo Lcy
76 pages
Audio Hijack Pro & Spotify
No ratings yet
Audio Hijack Pro & Spotify
3 pages
Sunray3 1.1
No ratings yet
Sunray3 1.1
49 pages
Case Study On BTRFS: A Fault Tolerant File System
No ratings yet
Case Study On BTRFS: A Fault Tolerant File System
6 pages
427 431 PDF
No ratings yet
427 431 PDF
5 pages
10-03-24 Sysdate, Date Function (Add - Months, Month - Between, Next - Day, Last - Day, Trunc With Date
No ratings yet
10-03-24 Sysdate, Date Function (Add - Months, Month - Between, Next - Day, Last - Day, Trunc With Date
8 pages
Parachute Gore Size Calculator
No ratings yet
Parachute Gore Size Calculator
5 pages
Dgca Module 1121 Session June 2019
No ratings yet
Dgca Module 1121 Session June 2019
4 pages
SAP EWM Value Added Services: Kit-to-Stock Process
No ratings yet
SAP EWM Value Added Services: Kit-to-Stock Process
22 pages
Lord Core - Pack File List
No ratings yet
Lord Core - Pack File List
7 pages
SPSS With Job Description Data
No ratings yet
SPSS With Job Description Data
48 pages
HP DesignJet T630 Printer Series
No ratings yet
HP DesignJet T630 Printer Series
2 pages
Boats
No ratings yet
Boats
10 pages
U285en-1 3
No ratings yet
U285en-1 3
63 pages
B.ed. Syllabus Final-1
No ratings yet
B.ed. Syllabus Final-1
131 pages
DGX Pod Reference Design Whitepaper
No ratings yet
DGX Pod Reference Design Whitepaper
15 pages

Unit 7 Clustering

Uploaded by

Unit 7 Clustering

Uploaded by

Data Warehousing and Data

A cluster is a collection of objects which are “similar” between

 Nominal, ordinal, and ratio variables

 Variables of mixed types

 Calculate the mean absolute deviation:

 Calculate the standardized measurement (z-score)

The similarity measure is a way of measuring how data

 Distances are normally used to measure the similarity or

 gender is a symmetric attribute

 A generalization of the binary variable in that it can take

 Method 2: use a large number of binary variables

 An ordinal variable can be discrete or continuous

 map the range of each variable onto [0, 1] by replacing i-

 Ratio-scaled variable: a positive measurement on a

 A database may contain all the six types of variables

 Vector objects: keywords in documents, gene

 A variant: Tanimoto coefficient

 The k-means algorithm is partitioning method of

 Initial centroids are often chosen randomly.

 To overcome the drawback of k means, we use K-means ++.

1. Choose k number of cluster.

 The probability of choosing the next centroid point

 The Mini-batch K-means clustering algorithm is a

 K -mediods clustering is alternative of k-means that

K-mediods PAM a k-mediods algorithm for partitioning based on mediod or central

 D: a data set containing n objects.

Output: A set of k clusters.

 For each representative object Oj

 Randomly select a non-representative object, Orandom .

For a given k=2, cluster the following data set Point X Y

 Now computing the total cost, S, of swapping a mediod (7, 4)

 Partitioning clustering algorithm always tries to make cluster of the

 Agglomerative clustering is one of the most common

 Divisive hierarchical clustering is exactly the opposite of

Partitioning and hierarchical method are suitable for

DBSCAN algorithm requires two parameters:

In this algorithm, we have 3 types of data points.

Neighborhood: Neighborhood of data point P is a set of all data points within

 Arbitrary select a point p

 Retrieve all points density-reachable from p w.r.t. Eps and

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable

 Continue the process until all of the points have been

 Now, cluster will be constructed on the basis of minpts. Since

1) Does not require a-priori specification of number of clusters.

2) Able to identify noise data while clustering.

1) DBSCAN algorithm fails in case of varying density clusters.

2) Fails in case of neck type of dataset.

3) Does not work well in case of high dimensional data.

 Outlier analysis is the process of identifying outliers, or

You might also like