0% found this document useful (0 votes)
56 views

04 LEC Data Science Kmeans

This document provides an outline for a lecture on K-means clustering and K-means based on MapReduce. It begins with an introduction to clustering techniques including partitioning, hierarchical, density-based, and grid-based methods. It then describes the standard K-means algorithm and provides a step-by-step example. Finally, it outlines how K-means can be implemented using MapReduce, including the map, combiner, and reduce functions.

Uploaded by

Viram Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

04 LEC Data Science Kmeans

This document provides an outline for a lecture on K-means clustering and K-means based on MapReduce. It begins with an introduction to clustering techniques including partitioning, hierarchical, density-based, and grid-based methods. It then describes the standard K-means algorithm and provides a step-by-step example. Finally, it outlines how K-means can be implemented using MapReduce, including the map, combiner, and reduce functions.

Uploaded by

Viram Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Science (CO314)

By:
Nidhi S. Periwal,
SVNIT, Surat.
Outline
• K-means clustering
• Reference book: “Data mining concepts and techniques 3rd edition”
by : Jiawei Han and Kamber
• K-means based on MapReduce
• Reference Material: Weizhong Zhao, “Parallel k-means clustering
based on MapReduce”, LNCS, Springer, 2009.
Clustering
• Process of grouping a set of data objects into multiple groups or
clusters so that objects within a cluster have high similarity, but a
very dissimilar to objects in other clusters.
• Clustering is also called data segmentation in some applications
because clustering partitions large data sets into groups according to
their similarity
• Used for outlier detection
Clustering
• Techniques of Clustering
• Partitioning methods
• Hierarchical methods
• Density-based methods
• Grid- based methods
Clustering
• Techniques of Clustering
• Partitioning methods
• K-means clustering
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Each remaining object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the
cluster center.
• The new mean for each cluster is then calculated.
• Iterates last two steps until the criterion function converges.
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Eg:
• Data= {2,3,4,10,11,12,20,25,30}
• Let k=2
• Randomly select 4 and 12 as initial cluster value
• So, let m1 = 4 and m2 =12
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Eg: ATTRIBUTE
2
3
4
10
11
12
20
25
30
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Each remaining object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the
cluster center.
K-means Algorithm •Dist1(2,4) = sqrt((2-4)2 )= 2
•Dist2(2,12) = sqrt((2-12)2 )= 10
• Data= {2} •Compare dist1 with dist2
•If dist1<dist2,
•put sample-I In Cluster1
•Else,
• put sample-I in Cluster2

Cluster1 with m1 = 4 Cluster2 with m2 = 12


2
K-means Algorithm
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 4 Cluster2 with m2 = 12


2 10
3 11
4 12
20
25
30
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Each remaining object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the
cluster center.
• The new mean for each cluster is then calculated.
K-means Algorithm
• Data= {2,3,4,10,11,12,20,25,30}
Cluster2 with m2 = 12
Cluster1 with m1 = 4
10
2
11
3
12
4
20
25
30

m2 = 18
m1 = (2+3+4)/3 = 3
K-means Algorithm
• ITERATION 2 :
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 3
Cluster2 with m2 = 18
2
11
3
12
4
20
10
25
30

m1 = (2+3+4+10)/4 = m2 = 20
approx = 5
K-means Algorithm
• ITERATION 3 :
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 5
Cluster2 with m2 = 20
2
20
3
25
4
30
10
11
12
m2 = 25
m1 = 7
K-means Algorithm
• ITERATION 4 :
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 7
Cluster2 with m2 = 25
2
20
3
25
4
30
10
11
12
m2 = 25
m1 = 7
K-means Algorithm
• ITERATION 4 :
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 7
Cluster2 with m2 = 25
2
20
3
25
4
30
10
11
12

• As, after four iterations mean values are same, so final cluster are formed.
K-means with MapReduce
• The map function performs the procedure of assigning each sample
to the closest center
• The reduce function performs the procedure of updating the new
centers.
• To decrease the cost of network communication, a combiner
function is developed to deal with partial combination of the
intermediate values with the same key within the same map task
K-means with MapReduce
• The map function performs the procedure of assigning each sample
to the closest center
• The reduce function performs the procedure of updating the new
centers.
• To decrease the cost of network communication, a combiner
function is developed to deal with partial combination of the
intermediate values with the same key within the same map task
K-means with MapReduce
• Algo: Mapper
• Input: Global variable centers, the offset key, the sample value
• Output: <key’, value’> pair, where the key’ is the index of the closest
center point and value’ is a string comprise of sample information
K-means with MapReduce
Algo: Mapper
1.Construct the sample instance from value;
2. minDis = Double.MAX VALUE;
3. index = -1;
4. For i=0 to centers.length do
dis= ComputeDist(instance, centers[i]);
If dis < minDis {
minDis = dis;
Index = i;}
5. End For
6. Take index as key’;
7. Construct value’ as a string comprise of the values of different dimensions;
8. output < key, value> pair;
9. End
K-means with MapReduce
Algo: combiner
• Input: key is the index of the cluster, V is the list of the samples
assigned to the same cluster
• Output: < key, value > pair,
• where the key’ is the index of the cluster,
• value’ is a string comprised of sum of the samples in the same cluster and the
sample number
K-means with MapReduce
Algo: combiner
1.Initialize one array to record the sum of value of each dimensions of the
samples contained in the same cluster, i.e. the samples in the list V;
2. Initialize a counter num as 0 to record the sum of sample number in the
same cluster;
3. while(V.hasNext()){
• Construct the sample instance from V.next();
• Add the values of different dimensions of instance to the array
• num++; }
4. Take key as key’;
5. Construct value’ as a string comprised of the sum values of different
dimensions and num;
6. output < key , value> pair;
7. End
K-means with MapReduce
Algo: Reducer
• Input: key is the index of the cluster, V is the list of the partial sums from
different host
• Output: < key , value > pair,
• where the key’ is the index of the cluster,
• value’ is a string representing the new center
K-means with MapReduce
Algo: Reducer
1. Initialize one array record the sum of value of each dimensions of the samples
contained in the
same cluster, e.g. the samples in the list V;
2. Initialize a counter NUM as 0 to record the sum of sample number in the same
cluster;
3. while(V.hasNext()){
Construct the sample instance from V.next();
Add the values of different dimensions of instance to the array
NUM += num; }
4. Divide the entries of the array by NUM to get the new center’s coordinates;
5. Take key as key’;
6. Construct value’ as a string comprise of the center’s coordinates;
7. output < key, value > pair;
8. End
THANK YOU!!!

You might also like