0% found this document useful (0 votes)

47 views28 pages

Lecture 25 K Means Clustering

The document discusses machine learning classification using the k-means clustering algorithm with MapReduce for big data analytics. It begins by introducing cluster analysis and its goal of organizing similar data items into groups. It then discusses applications of cluster analysis like customer segmentation. Next, it covers the k-means clustering algorithm, similarity measures, and key points of cluster analysis like the fact that interpretation is required to make sense of results. Finally, it discusses potential uses of cluster results like data segmentation, classifying new data, providing labeled data for classification models, and detecting anomalies.

Uploaded by

sunita chalageri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views28 pages

Lecture 25 K Means Clustering

Uploaded by

sunita chalageri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Machine Learning Algorithm

Parallel K-means using Map Reduce

for Big Data Analytics

Dr. Rajiv Misra

Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Preface
Content of this Lecture:
In this lecture, we will discuss machine learning
classification algorithm k-means using mapreduce for
big data analytics

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Cluster Analysis Overview
Goal: Organize similar items into groups.

In cluster analysis, the goal is to organize similar items in given

data set into groups or clusters. By segmenting given data into
clusters, we can analyze each cluster more carefully.

Note that cluster analysis is also referred to as clustering.

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Applications: Cluster Analysis
Segment customer base into groups: A very common application of
cluster analysis is to divide your customer base into segments based on their
purchasing histories. For example, you can segment customers into those
who have purchased science fiction books and videos, versus those who tend
to buy nonfiction books, versus those who have bought many children's
books. This way, you can provide more targeted suggestions to each different
group.
Characterize different weather patterns for a region: Some other
examples of cluster analysis are characterizing different weather patterns for
a region.
Group news articles into topics: Grouping the latest news articles into
topics to identify the trending topics of the day.
Discover crime hot spots: Discovering hot spots for different types of
crime from police reports in order to provide sufficient police presence for
problem areas.

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Cluster Analysis
Divides data into clusters: Cluster analysis divides all the samples in a data set
into groups. In this diagram, we see that the red, green, and purple data points
are clustered together. Which group a sample is placed in is based on some
measure of similarity.
Similar items are placed in same cluster: The goal of cluster analysis is to
segment data so that differences between samples in the same cluster are
minimized, as shown by the yellow arrow, and differences between samples of
different clusters are maximized, as shown by the orange arrow. Visually, we
can think of this as getting samples in each cluster to be as close together as
possible, and the samples from different clusters to be as far apart as possible.

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Similarity Measures
Cluster analysis requires some sort of metric to measure
similarity between two samples. Some common
similarity measures are Euclidean distance, which is the
distance along a straight line between two points, A and
B, as shown in this plot.
Manhattan distance, which is calculated on a strictly
horizontal and vertical path, as shown in the right plot.
To go from point A to point B, you can only step along
either the x-axis or the y-axis in a two-dimensional case.
So the path to calculate the Manhattan distance consists
of segments along the axes instead of along a diagonal
path, as with Euclidean distance.
Cosine similarity measures the cosine of the angle
between points A and B, as shown in the bottom plot.
Since distance measures such as Euclidean distance are
often used to measure similarity between samples in
clustering algorithms, note that it may be necessary to
normalize the input variables so that no one value
dominates the similarity calculation.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Another natural inner product measure

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Another natural inner product measure

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Cosine similarity- normalize

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Cosine similarity- normalize

Not a proper distance metric

Efficient to compute for
sparse vecs
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Normalize

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Cosine Normalize

In general, -1 < similarity < 1

For positive features (like tf-idf)
0 < similarity < 1

Define distance = 1-similarity

Big Data Computing Vu Pham Machine Learning Classification Algorithm
Cluster Analysis Key Points
Cluster analysis is an unsupervised task. This means that
Unsupervised there is no target label for any sample in the data set.

In general, there is no correct clustering results. The best

set of clusters is highly dependent on how the resulting
There is no clusters will be used.
‘correct’ clustering Clusters don't come with labels. You may end up with
five different clusters at the end of a cluster analysis
Clusters don’t process, but you don't know what each cluster
come with labels represents. Only by analyzing the samples in each cluster
can you come out with reasonable labels for your
clusters. Given all this, it is important to keep in mind
that interpretation and analysis of the clusters are
required to make sense of and make use of the results of
cluster analysis.

Interpretation and analysis required to

make sense of clustering results!
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Uses of Cluster Results
Data segmentation
Analysis of each segment can provide insights: There are several ways
that the results of cluster analysis can be used. The most obvious is data segmentation
and the benefits that come from that. If we segment your customer base into different
types of readers, the resulting insights can be used to provide more effective marketing
to the different customer groups based on their preferences. For example, analyzing
each segment separately can provide valuable insights into each group's likes, dislikes
and purchasing behavior, just like we see science fiction, non-fiction and children's
books preferences here.
Science fiction

non-fiction

children’s

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Uses of Cluster Results
Categories for classifying new data
New sample assigned to closest cluster: Clusters can also be used to
classify new data samples. When a new sample is received, like the orange sample
here, compute the similarity measure between it and the centers of all clusters, and
assign a new sample to the closest cluster. The label of that cluster, manually
determined through analysis, is then used to classify the new sample. In our book
buyers' preferences example, a new customer can be classified as being either a
science fiction, non-fiction or children's books customer depending on which cluster
the new customer is most similar to.

Label of closest
cluster used to
classify new sample

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Uses of Cluster Results
Labeled data for classification
Cluster samples used as labeled data: Once cluster labels have been
determined, samples in each cluster can be used as labeled data for a classification
task. The samples would be the input to the classification model. And the cluster label
would be the target class for each sample. This process can be used to provide much
needed labeled data for classification.

Label of closest
cluster used to
classify new sample

Labeled samples for

science fiction
customers
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Uses of Cluster Results
Basis for anomaly detection
Cluster outliers are anomalies: Yet another use of cluster results is as a basis
for anomaly detection. If a sample is very far away, or very different from any of the
cluster centers, like the yellow sample here, then that sample is a cluster outlier and
can be flagged as an anomaly. However, these anomalies require further analysis.
Depending on the application, these anomalies can be considered noise, and should be
removed from the data set. An example of this would be a sample with a value of 150
for age.

Anomalies that require

further analysis

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Cluster Analysis Summary

Organize similar items into groups

Analyzing clusters often leads to useful insights about

data

Clusters require analysis and interpretation

Big Data Computing Vu Pham Machine Learning Classification Algorithm

k-Means Algorithm
Select k initial centroids (cluster centers)

Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid

Until some stopping criterion is reached

Big Data Computing Vu Pham Machine Learning Classification Algorithm

k-Means Algorithm: Example

Original samples Initial centroids Assign samples

Re-calculate Assign samples Re-calculate

centroids centroids

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Choosing Initial Centroids
Issue:
Final clusters are sensitive to initial centroids

Solution:
Run k-means multiple times with different random
initial centroids, and choose best results

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Evaluating Cluster Results

error=distance between sample & centroid

Squared error = error2

Sum of squared errors

between all samples & centroid

Sum over all clusters ➔ WSSE

(Within-Cluster Sum
of Squared Error)
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Using WSSE

WSSE1 < WSSE2 ➔ WSSE1 is better numerically

Caveats:
Does not mean that cluster set 1 is more ‘correct’ than
cluster set 2

Larger values for k will always reduce WSSE

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Choosing Value for k k=?

Approaches: Choosing the optimal value for k is always a

big question in using k-means. There are several methods
to determine the value for k.
• Visualization: Visualization techniques can be used to determine the
dataset to see if there are natural groupings of the samples. Scatter plots and the
use of dimensionality reduction are useful here, to visualize the data.

• Application-Dependent: A good value for k is application-dependent.

So domain knowledge of the application can drive the selection for the value of k.
For example, if you want to cluster the types of products customers are
purchasing, a natural choice for k might be the number of product categories that
you offer. Or k might be selected to represent the geographical locations of
respondents to a survey. In which case, a good value for k would be the number of
regions your interested in analyzing.

• Data-Driven: There are also data-driven method for determining the value
of k. These methods calculate symmetric for different values of k to determine the
best selections of k. One such method is the elbow method.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Elbow Method for Choosing k

“Elbow” suggests
value for k should be 3

The elbow method for determining the value of k is shown on this plot. As we saw in the
previous slide, WSSE, or within-cluster sum of squared error, measures how much data
samples deviate from their respective centroids in a set of clustering results. If we plot WSSE
for different values for k, we can see how this error measure changes as a value of k changes
as seen in the plot. The bend in this error curve indicates a drop in gain by adding more
clusters. So this elbow in the curve provides a suggestion for a good value of k.

Note that the elbow can not always be unambiguously determined, especially for complex
data. And in many cases, the error curve will not have a clear suggestion for one value, but
for multiple values. This can be used as a guideline for the range of values to try for k.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Stopping Criteria
When to stop iterating ?
• No changes to centroids: How do you know when to stop
iterating when using k-means? One obviously stopping criterion is when
there are no changes to the centroids. This means that no samples would
change cluster assignments. And recalculating the centroids will not result
in any changes. So additional iterations will not bring about any more
changes to the cluster results.
• Number of samples changing clusters is below
threshold: The stopping criterion can be relaxed to the second
stopping criterion: which is when the number of sample changing clusters
is below a certain threshold, say 1% for example. At this point, the
clusters are changing by only a few samples, resulting in only minimal
changes to the final cluster results. So the algorithm can be stopped here.

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Interpreting Results
Examine cluster centroids
• How are clusters different ? At the end of k-means we have a set of
clusters, each with a centroid. Each centroid is the mean of the samples assigned
to that cluster. You can think of the centroid as a representative sample for that
cluster. So to interpret the cluster analysis results, we can examine the cluster
centroids. Comparing the values of the variables between the centroids will reveal
how different or alike clusters are and provide insights into what each cluster
represents. For example, if the value for age is different for different customer
clusters, this indicates that the clusters are encoding different customer segments
by age, among other variables.

Big Data Computing Vu Pham Machine Learning Classification Algorithm

K-Means Summary

Classic algorithm for cluster analysis

Simple to understand and implement and is efficient

Value of k must be specified

Final clusters are sensitive to initial centroids

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Concise Text of Neville Goddard Teachings PDF
100% (1)
Concise Text of Neville Goddard Teachings PDF
8 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
FPA unit 3
No ratings yet
FPA unit 3
17 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Unit 2
No ratings yet
Unit 2
89 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
CC - Unit IV - Chapters
No ratings yet
CC - Unit IV - Chapters
47 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering
No ratings yet
Clustering
5 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Module 5
No ratings yet
Module 5
370 pages
W6 Clustering
No ratings yet
W6 Clustering
29 pages
DataMining_Chapter4
No ratings yet
DataMining_Chapter4
10 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
S VD For Clustering
No ratings yet
S VD For Clustering
10 pages
2023112069310_29501Clustering In Data Mining Process
No ratings yet
2023112069310_29501Clustering In Data Mining Process
3 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
Lecture Unsupervised (17!04!2024).Pptx
No ratings yet
Lecture Unsupervised (17!04!2024).Pptx
61 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
DWDM 5
No ratings yet
DWDM 5
12 pages
Clustering
No ratings yet
Clustering
29 pages
Aiml 5th Module Part2
No ratings yet
Aiml 5th Module Part2
28 pages
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
Classification Clustering Overview
No ratings yet
Classification Clustering Overview
7 pages
14
No ratings yet
14
4 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Clustering
No ratings yet
Clustering
57 pages
UNIT5
No ratings yet
UNIT5
60 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
Data Warehousing PDF 6
No ratings yet
Data Warehousing PDF 6
13 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
ML-UNSUPERVISED
No ratings yet
ML-UNSUPERVISED
35 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Cluster
100% (1)
Cluster
72 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
07Clustering
No ratings yet
07Clustering
34 pages
ML Module5
No ratings yet
ML Module5
37 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Clustering
No ratings yet
Clustering
34 pages
YEAH
No ratings yet
YEAH
2 pages
clustering-u-5
No ratings yet
clustering-u-5
2 pages
Ijcttjournal V1i1p12
No ratings yet
Ijcttjournal V1i1p12
3 pages
Clustering
No ratings yet
Clustering
84 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
Untitled document
No ratings yet
Untitled document
32 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
Module 5
No ratings yet
Module 5
98 pages
17 GM ASAP Data Mining - Clustering
No ratings yet
17 GM ASAP Data Mining - Clustering
107 pages
Cluster_analysis
No ratings yet
Cluster_analysis
22 pages
Lecture 2.1.1 to 2.1.2 (1)
No ratings yet
Lecture 2.1.1 to 2.1.2 (1)
97 pages
1.supervised and Unsupervised
No ratings yet
1.supervised and Unsupervised
42 pages
ML+Clustering
No ratings yet
ML+Clustering
33 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
dm 4
No ratings yet
dm 4
76 pages
Chapter 9
No ratings yet
Chapter 9
22 pages
Front of RK
No ratings yet
Front of RK
7 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Construction of Minimum Connected Dominating Set in
No ratings yet
Construction of Minimum Connected Dominating Set in
13 pages
Data Warehouse MCQS With Answer - Computer Science PDF
100% (2)
Data Warehouse MCQS With Answer - Computer Science PDF
41 pages
RK Intershipl
No ratings yet
RK Intershipl
27 pages
Varshney 2018
No ratings yet
Varshney 2018
6 pages
Csesch PDF
No ratings yet
Csesch PDF
11 pages
Case Study: Flight Data Analysis Using Spark Graphx
No ratings yet
Case Study: Flight Data Analysis Using Spark Graphx
23 pages
Parallel K-Means Using Map Reduce On Big Data Cluster Analysis
No ratings yet
Parallel K-Means Using Map Reduce On Big Data Cluster Analysis
9 pages
Big Data Computing Decision Trees For Big Data Analytics
No ratings yet
Big Data Computing Decision Trees For Big Data Analytics
48 pages
Journal Pre-Proof: Computer Networks
No ratings yet
Journal Pre-Proof: Computer Networks
75 pages
Trail of Tears Lesson Plan Breitbarth and Evans Weebly Ready 2
No ratings yet
Trail of Tears Lesson Plan Breitbarth and Evans Weebly Ready 2
9 pages
Sample Questions Human Resource Management
No ratings yet
Sample Questions Human Resource Management
9 pages
Research Proposal-Chapter One
No ratings yet
Research Proposal-Chapter One
15 pages
Writing-Conclusions-and-Recommendations-in-Qualitative-Research
No ratings yet
Writing-Conclusions-and-Recommendations-in-Qualitative-Research
12 pages
Education Resume 6
No ratings yet
Education Resume 6
2 pages
Neuropsychology
No ratings yet
Neuropsychology
13 pages
Statement of Purpose-Product Design
No ratings yet
Statement of Purpose-Product Design
2 pages
Parenting 2
No ratings yet
Parenting 2
10 pages
Advanced Time Series Econometric PDF
No ratings yet
Advanced Time Series Econometric PDF
3 pages
Sample Validation
No ratings yet
Sample Validation
4 pages
The Myth of Apathy, The Ecologist
No ratings yet
The Myth of Apathy, The Ecologist
1 page
Wong Et - Al. - Strategic Leadership Competencies
No ratings yet
Wong Et - Al. - Strategic Leadership Competencies
19 pages
Cent Percent Mathematics - What We Do
No ratings yet
Cent Percent Mathematics - What We Do
12 pages
Develop Jedi Self-Confidence - Unleash The Force Within You
No ratings yet
Develop Jedi Self-Confidence - Unleash The Force Within You
119 pages
Intro To Philo-M5
No ratings yet
Intro To Philo-M5
30 pages
LE PAR-CŒUR en
No ratings yet
LE PAR-CŒUR en
4 pages
Master Deck - Australia-Malaysia Virtual PHD Internship
No ratings yet
Master Deck - Australia-Malaysia Virtual PHD Internship
6 pages
RPH Catch Up Plan Minggu 37
No ratings yet
RPH Catch Up Plan Minggu 37
10 pages
Copia de Van Moort (2022) TESIS DOCTORAL What You Read vs. What You Know - A Methodologically
No ratings yet
Copia de Van Moort (2022) TESIS DOCTORAL What You Read vs. What You Know - A Methodologically
245 pages
SOP Abertay Neha
No ratings yet
SOP Abertay Neha
3 pages
A Set of Postulates For The Science of Language Leonard Bloomfield
No ratings yet
A Set of Postulates For The Science of Language Leonard Bloomfield
9 pages
5 Marketing Research Its Components
No ratings yet
5 Marketing Research Its Components
32 pages
Minimal Mentalism 3
100% (6)
Minimal Mentalism 3
112 pages
Artikel Chemistry Education
No ratings yet
Artikel Chemistry Education
8 pages
BEED 3 Reviewer
No ratings yet
BEED 3 Reviewer
7 pages
GRADE.4.MATHEMATICS
No ratings yet
GRADE.4.MATHEMATICS
65 pages
SHS Core - Contemporary Philippine Arts From The Regions CG
No ratings yet
SHS Core - Contemporary Philippine Arts From The Regions CG
4 pages
Fomo
No ratings yet
Fomo
16 pages
Business Skills Upper Int Worksheet 14
No ratings yet
Business Skills Upper Int Worksheet 14
2 pages

Lecture 25 K Means Clustering

Uploaded by

Lecture 25 K Means Clustering

Uploaded by

Machine Learning Algorithm

Parallel K-means using Map Reduce

Dr. Rajiv Misra

Big Data Computing Vu Pham Machine Learning Classification Algorithm

In cluster analysis, the goal is to organize similar items in given

Note that cluster analysis is also referred to as clustering.

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Not a proper distance metric

Big Data Computing Vu Pham Machine Learning Classification Algorithm

In general, -1 < similarity < 1

Define distance = 1-similarity

In general, there is no correct clustering results. The best

Interpretation and analysis required to

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Labeled samples for

Anomalies that require

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Organize similar items into groups

Analyzing clusters often leads to useful insights about

Clusters require analysis and interpretation

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Until some stopping criterion is reached

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Original samples Initial centroids Assign samples

Re-calculate Assign samples Re-calculate

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Big Data Computing Vu Pham Machine Learning Classification Algorithm

error=distance between sample & centroid

Squared error = error2

Sum of squared errors

Sum over all clusters ➔ WSSE

WSSE1 < WSSE2 ➔ WSSE1 is better numerically

Larger values for k will always reduce WSSE

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Approaches: Choosing the optimal value for k is always a

• Application-Dependent: A good value for k is application-dependent.

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Big Data Computing Vu Pham Machine Learning Classification Algorithm

Classic algorithm for cluster analysis

Simple to understand and implement and is efficient

Value of k must be specified

Final clusters are sensitive to initial centroids

Big Data Computing Vu Pham Machine Learning Classification Algorithm

You might also like