0% found this document useful (0 votes)
74 views26 pages

04 LEC Data Science Kmeans

This document provides an outline for a lecture on K-means clustering and K-means based on MapReduce. It begins with an introduction to clustering techniques including partitioning, hierarchical, density-based, and grid-based methods. It then describes the standard K-means algorithm and provides a step-by-step example. Finally, it outlines how K-means can be implemented using MapReduce, including the map, combiner, and reduce functions.

Uploaded by

Viram Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views26 pages

04 LEC Data Science Kmeans

This document provides an outline for a lecture on K-means clustering and K-means based on MapReduce. It begins with an introduction to clustering techniques including partitioning, hierarchical, density-based, and grid-based methods. It then describes the standard K-means algorithm and provides a step-by-step example. Finally, it outlines how K-means can be implemented using MapReduce, including the map, combiner, and reduce functions.

Uploaded by

Viram Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Science (CO314)

By:
Nidhi S. Periwal,
SVNIT, Surat.
Outline
• K-means clustering
• Reference book: “Data mining concepts and techniques 3rd edition”
by : Jiawei Han and Kamber
• K-means based on MapReduce
• Reference Material: Weizhong Zhao, “Parallel k-means clustering
based on MapReduce”, LNCS, Springer, 2009.
Clustering
• Process of grouping a set of data objects into multiple groups or
clusters so that objects within a cluster have high similarity, but a
very dissimilar to objects in other clusters.
• Clustering is also called data segmentation in some applications
because clustering partitions large data sets into groups according to
their similarity
• Used for outlier detection
Clustering
• Techniques of Clustering
• Partitioning methods
• Hierarchical methods
• Density-based methods
• Grid- based methods
Clustering
• Techniques of Clustering
• Partitioning methods
• K-means clustering
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Each remaining object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the
cluster center.
• The new mean for each cluster is then calculated.
• Iterates last two steps until the criterion function converges.
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Eg:
• Data= {2,3,4,10,11,12,20,25,30}
• Let k=2
• Randomly select 4 and 12 as initial cluster value
• So, let m1 = 4 and m2 =12
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Eg: ATTRIBUTE
2
3
4
10
11
12
20
25
30
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Each remaining object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the
cluster center.
K-means Algorithm •Dist1(2,4) = sqrt((2-4)2 )= 2
•Dist2(2,12) = sqrt((2-12)2 )= 10
• Data= {2} •Compare dist1 with dist2
•If dist1<dist2,
•put sample-I In Cluster1
•Else,
• put sample-I in Cluster2

Cluster1 with m1 = 4 Cluster2 with m2 = 12


2
K-means Algorithm
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 4 Cluster2 with m2 = 12


2 10
3 11
4 12
20
25
30
K-means Algorithm
• K-means clustering, first selects k objects from the whole objects
which represent initial cluster centers.
• Each remaining object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the
cluster center.
• The new mean for each cluster is then calculated.
K-means Algorithm
• Data= {2,3,4,10,11,12,20,25,30}
Cluster2 with m2 = 12
Cluster1 with m1 = 4
10
2
11
3
12
4
20
25
30

m2 = 18
m1 = (2+3+4)/3 = 3
K-means Algorithm
• ITERATION 2 :
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 3
Cluster2 with m2 = 18
2
11
3
12
4
20
10
25
30

m1 = (2+3+4+10)/4 = m2 = 20
approx = 5
K-means Algorithm
• ITERATION 3 :
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 5
Cluster2 with m2 = 20
2
20
3
25
4
30
10
11
12
m2 = 25
m1 = 7
K-means Algorithm
• ITERATION 4 :
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 7
Cluster2 with m2 = 25
2
20
3
25
4
30
10
11
12
m2 = 25
m1 = 7
K-means Algorithm
• ITERATION 4 :
• Data= {2,3,4,10,11,12,20,25,30}

Cluster1 with m1 = 7
Cluster2 with m2 = 25
2
20
3
25
4
30
10
11
12

• As, after four iterations mean values are same, so final cluster are formed.
K-means with MapReduce
• The map function performs the procedure of assigning each sample
to the closest center
• The reduce function performs the procedure of updating the new
centers.
• To decrease the cost of network communication, a combiner
function is developed to deal with partial combination of the
intermediate values with the same key within the same map task
K-means with MapReduce
• The map function performs the procedure of assigning each sample
to the closest center
• The reduce function performs the procedure of updating the new
centers.
• To decrease the cost of network communication, a combiner
function is developed to deal with partial combination of the
intermediate values with the same key within the same map task
K-means with MapReduce
• Algo: Mapper
• Input: Global variable centers, the offset key, the sample value
• Output: <key’, value’> pair, where the key’ is the index of the closest
center point and value’ is a string comprise of sample information
K-means with MapReduce
Algo: Mapper
1.Construct the sample instance from value;
2. minDis = Double.MAX VALUE;
3. index = -1;
4. For i=0 to centers.length do
dis= ComputeDist(instance, centers[i]);
If dis < minDis {
minDis = dis;
Index = i;}
5. End For
6. Take index as key’;
7. Construct value’ as a string comprise of the values of different dimensions;
8. output < key, value> pair;
9. End
K-means with MapReduce
Algo: combiner
• Input: key is the index of the cluster, V is the list of the samples
assigned to the same cluster
• Output: < key, value > pair,
• where the key’ is the index of the cluster,
• value’ is a string comprised of sum of the samples in the same cluster and the
sample number
K-means with MapReduce
Algo: combiner
1.Initialize one array to record the sum of value of each dimensions of the
samples contained in the same cluster, i.e. the samples in the list V;
2. Initialize a counter num as 0 to record the sum of sample number in the
same cluster;
3. while(V.hasNext()){
• Construct the sample instance from V.next();
• Add the values of different dimensions of instance to the array
• num++; }
4. Take key as key’;
5. Construct value’ as a string comprised of the sum values of different
dimensions and num;
6. output < key , value> pair;
7. End
K-means with MapReduce
Algo: Reducer
• Input: key is the index of the cluster, V is the list of the partial sums from
different host
• Output: < key , value > pair,
• where the key’ is the index of the cluster,
• value’ is a string representing the new center
K-means with MapReduce
Algo: Reducer
1. Initialize one array record the sum of value of each dimensions of the samples
contained in the
same cluster, e.g. the samples in the list V;
2. Initialize a counter NUM as 0 to record the sum of sample number in the same
cluster;
3. while(V.hasNext()){
Construct the sample instance from V.next();
Add the values of different dimensions of instance to the array
NUM += num; }
4. Divide the entries of the array by NUM to get the new center’s coordinates;
5. Take key as key’;
6. Construct value’ as a string comprise of the center’s coordinates;
7. output < key, value > pair;
8. End
THANK YOU!!!

You might also like