0% found this document useful (0 votes)
47 views28 pages

Lecture 25 K Means Clustering

The document discusses machine learning classification using the k-means clustering algorithm with MapReduce for big data analytics. It begins by introducing cluster analysis and its goal of organizing similar data items into groups. It then discusses applications of cluster analysis like customer segmentation. Next, it covers the k-means clustering algorithm, similarity measures, and key points of cluster analysis like the fact that interpretation is required to make sense of results. Finally, it discusses potential uses of cluster results like data segmentation, classifying new data, providing labeled data for classification models, and detecting anomalies.

Uploaded by

sunita chalageri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views28 pages

Lecture 25 K Means Clustering

The document discusses machine learning classification using the k-means clustering algorithm with MapReduce for big data analytics. It begins by introducing cluster analysis and its goal of organizing similar data items into groups. It then discusses applications of cluster analysis like customer segmentation. Next, it covers the k-means clustering algorithm, similarity measures, and key points of cluster analysis like the fact that interpretation is required to make sense of results. Finally, it discusses potential uses of cluster results like data segmentation, classifying new data, providing labeled data for classification models, and detecting anomalies.

Uploaded by

sunita chalageri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning Algorithm

Parallel K-means using Map Reduce


for Big Data Analytics

Dr. Rajiv Misra


Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Preface
Content of this Lecture:
In this lecture, we will discuss machine learning
classification algorithm k-means using mapreduce for
big data analytics

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Cluster Analysis Overview
Goal: Organize similar items into groups.

In cluster analysis, the goal is to organize similar items in given


data set into groups or clusters. By segmenting given data into
clusters, we can analyze each cluster more carefully.

Note that cluster analysis is also referred to as clustering.

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Applications: Cluster Analysis
Segment customer base into groups: A very common application of
cluster analysis is to divide your customer base into segments based on their
purchasing histories. For example, you can segment customers into those
who have purchased science fiction books and videos, versus those who tend
to buy nonfiction books, versus those who have bought many children's
books. This way, you can provide more targeted suggestions to each different
group.
Characterize different weather patterns for a region: Some other
examples of cluster analysis are characterizing different weather patterns for
a region.
Group news articles into topics: Grouping the latest news articles into
topics to identify the trending topics of the day.
Discover crime hot spots: Discovering hot spots for different types of
crime from police reports in order to provide sufficient police presence for
problem areas.

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Cluster Analysis
Divides data into clusters: Cluster analysis divides all the samples in a data set
into groups. In this diagram, we see that the red, green, and purple data points
are clustered together. Which group a sample is placed in is based on some
measure of similarity.
Similar items are placed in same cluster: The goal of cluster analysis is to
segment data so that differences between samples in the same cluster are
minimized, as shown by the yellow arrow, and differences between samples of
different clusters are maximized, as shown by the orange arrow. Visually, we
can think of this as getting samples in each cluster to be as close together as
possible, and the samples from different clusters to be as far apart as possible.

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Similarity Measures
Cluster analysis requires some sort of metric to measure
similarity between two samples. Some common
similarity measures are Euclidean distance, which is the
distance along a straight line between two points, A and
B, as shown in this plot.
Manhattan distance, which is calculated on a strictly
horizontal and vertical path, as shown in the right plot.
To go from point A to point B, you can only step along
either the x-axis or the y-axis in a two-dimensional case.
So the path to calculate the Manhattan distance consists
of segments along the axes instead of along a diagonal
path, as with Euclidean distance.
Cosine similarity measures the cosine of the angle
between points A and B, as shown in the bottom plot.
Since distance measures such as Euclidean distance are
often used to measure similarity between samples in
clustering algorithms, note that it may be necessary to
normalize the input variables so that no one value
dominates the similarity calculation.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Another natural inner product measure

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Another natural inner product measure

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Cosine similarity- normalize

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Cosine similarity- normalize

Not a proper distance metric


Efficient to compute for
sparse vecs
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Normalize

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Cosine Normalize

In general, -1 < similarity < 1


For positive features (like tf-idf)
0 < similarity < 1

Define distance = 1-similarity


Big Data Computing Vu Pham Machine Learning Classification Algorithm
Cluster Analysis Key Points
Cluster analysis is an unsupervised task. This means that
Unsupervised there is no target label for any sample in the data set.

In general, there is no correct clustering results. The best


set of clusters is highly dependent on how the resulting
There is no clusters will be used.
‘correct’ clustering Clusters don't come with labels. You may end up with
five different clusters at the end of a cluster analysis
Clusters don’t process, but you don't know what each cluster
come with labels represents. Only by analyzing the samples in each cluster
can you come out with reasonable labels for your
clusters. Given all this, it is important to keep in mind
that interpretation and analysis of the clusters are
required to make sense of and make use of the results of
cluster analysis.

Interpretation and analysis required to


make sense of clustering results!
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Uses of Cluster Results
Data segmentation
Analysis of each segment can provide insights: There are several ways
that the results of cluster analysis can be used. The most obvious is data segmentation
and the benefits that come from that. If we segment your customer base into different
types of readers, the resulting insights can be used to provide more effective marketing
to the different customer groups based on their preferences. For example, analyzing
each segment separately can provide valuable insights into each group's likes, dislikes
and purchasing behavior, just like we see science fiction, non-fiction and children's
books preferences here.
Science fiction

non-fiction

children’s

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Uses of Cluster Results
Categories for classifying new data
New sample assigned to closest cluster: Clusters can also be used to
classify new data samples. When a new sample is received, like the orange sample
here, compute the similarity measure between it and the centers of all clusters, and
assign a new sample to the closest cluster. The label of that cluster, manually
determined through analysis, is then used to classify the new sample. In our book
buyers' preferences example, a new customer can be classified as being either a
science fiction, non-fiction or children's books customer depending on which cluster
the new customer is most similar to.

Label of closest
cluster used to
classify new sample

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Uses of Cluster Results
Labeled data for classification
Cluster samples used as labeled data: Once cluster labels have been
determined, samples in each cluster can be used as labeled data for a classification
task. The samples would be the input to the classification model. And the cluster label
would be the target class for each sample. This process can be used to provide much
needed labeled data for classification.

Label of closest
cluster used to
classify new sample

Labeled samples for


science fiction
customers
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Uses of Cluster Results
Basis for anomaly detection
Cluster outliers are anomalies: Yet another use of cluster results is as a basis
for anomaly detection. If a sample is very far away, or very different from any of the
cluster centers, like the yellow sample here, then that sample is a cluster outlier and
can be flagged as an anomaly. However, these anomalies require further analysis.
Depending on the application, these anomalies can be considered noise, and should be
removed from the data set. An example of this would be a sample with a value of 150
for age.

Anomalies that require


further analysis

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Cluster Analysis Summary

Organize similar items into groups

Analyzing clusters often leads to useful insights about


data

Clusters require analysis and interpretation

Big Data Computing Vu Pham Machine Learning Classification Algorithm


k-Means Algorithm
Select k initial centroids (cluster centers)

Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid

Until some stopping criterion is reached

Big Data Computing Vu Pham Machine Learning Classification Algorithm


k-Means Algorithm: Example

Original samples Initial centroids Assign samples

Re-calculate Assign samples Re-calculate


centroids centroids

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Choosing Initial Centroids
Issue:
Final clusters are sensitive to initial centroids

Solution:
Run k-means multiple times with different random
initial centroids, and choose best results

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Evaluating Cluster Results

error=distance between sample & centroid

Squared error = error2

Sum of squared errors


between all samples & centroid

Sum over all clusters ➔ WSSE


(Within-Cluster Sum
of Squared Error)
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Using WSSE

WSSE1 < WSSE2 ➔ WSSE1 is better numerically

Caveats:
Does not mean that cluster set 1 is more ‘correct’ than
cluster set 2

Larger values for k will always reduce WSSE

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Choosing Value for k k=?

Approaches: Choosing the optimal value for k is always a


big question in using k-means. There are several methods
to determine the value for k.
• Visualization: Visualization techniques can be used to determine the
dataset to see if there are natural groupings of the samples. Scatter plots and the
use of dimensionality reduction are useful here, to visualize the data.

• Application-Dependent: A good value for k is application-dependent.


So domain knowledge of the application can drive the selection for the value of k.
For example, if you want to cluster the types of products customers are
purchasing, a natural choice for k might be the number of product categories that
you offer. Or k might be selected to represent the geographical locations of
respondents to a survey. In which case, a good value for k would be the number of
regions your interested in analyzing.

• Data-Driven: There are also data-driven method for determining the value
of k. These methods calculate symmetric for different values of k to determine the
best selections of k. One such method is the elbow method.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Elbow Method for Choosing k

“Elbow” suggests
value for k should be 3

The elbow method for determining the value of k is shown on this plot. As we saw in the
previous slide, WSSE, or within-cluster sum of squared error, measures how much data
samples deviate from their respective centroids in a set of clustering results. If we plot WSSE
for different values for k, we can see how this error measure changes as a value of k changes
as seen in the plot. The bend in this error curve indicates a drop in gain by adding more
clusters. So this elbow in the curve provides a suggestion for a good value of k.

Note that the elbow can not always be unambiguously determined, especially for complex
data. And in many cases, the error curve will not have a clear suggestion for one value, but
for multiple values. This can be used as a guideline for the range of values to try for k.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Stopping Criteria
When to stop iterating ?
• No changes to centroids: How do you know when to stop
iterating when using k-means? One obviously stopping criterion is when
there are no changes to the centroids. This means that no samples would
change cluster assignments. And recalculating the centroids will not result
in any changes. So additional iterations will not bring about any more
changes to the cluster results.
• Number of samples changing clusters is below
threshold: The stopping criterion can be relaxed to the second
stopping criterion: which is when the number of sample changing clusters
is below a certain threshold, say 1% for example. At this point, the
clusters are changing by only a few samples, resulting in only minimal
changes to the final cluster results. So the algorithm can be stopped here.

Big Data Computing Vu Pham Machine Learning Classification Algorithm


Interpreting Results
Examine cluster centroids
• How are clusters different ? At the end of k-means we have a set of
clusters, each with a centroid. Each centroid is the mean of the samples assigned
to that cluster. You can think of the centroid as a representative sample for that
cluster. So to interpret the cluster analysis results, we can examine the cluster
centroids. Comparing the values of the variables between the centroids will reveal
how different or alike clusters are and provide insights into what each cluster
represents. For example, if the value for age is different for different customer
clusters, this indicates that the clusters are encoding different customer segments
by age, among other variables.

Big Data Computing Vu Pham Machine Learning Classification Algorithm


K-Means Summary

Classic algorithm for cluster analysis

Simple to understand and implement and is efficient

Value of k must be specified

Final clusters are sensitive to initial centroids

Big Data Computing Vu Pham Machine Learning Classification Algorithm

You might also like