Cluster Analysis
Cluster Analysis
Classification is the process of classifying the data with the help of class
labels
Wednesday, December 09, 2020 3
Wednesday, December 09, 2020 4
Wednesday, December 09, 2020 5
Wednesday, December 09, 2020 6
The Wine Quality
The Wine Quality Dataset involves predicting the quality of white wines Good/Bad
for given chemical measures of each wine.
1.Fixed acidity.
2.Volatile acidity.
3.Citric acid.
4.Residual sugar.
5.Chlorides.
6.Free sulfur dioxide.
7.Total sulfur dioxide.
8.Density.
9.pH.
10.Sulphates.
11.Alcohol.
12.Quality (Good/Bad).
Wednesday, December 09, 2020 7
Example
The below table has information about 20 wines sold in the market along with their alcohol
and alkalinity of ash content
Alkalinity of
Wine Alcohol Wine Alcohol Alkalinity of Ash
Ash
Regression:
• It predicts continuous values and their output. Regression analysis is the
statistical model that is used to predict the numeric data instead of
labels.
• It can also identify the distribution trends based on the available data or
historic data.
• This way, when a new data point arrives, we can easily identify which
group or cluster it belongs to.
Similarity of characteristics
• Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
• You don’t know who or what belongs to which group. Not even the
number of groups.
Wednesday, December 09, 2020 24
Applications of Clustering Applications
Example:
A bank wants to give credit card offers to its customers. Currently, they look
at the details of each customer and based on this information, decide which
offer should be given to which customer.
Now, the bank can potentially have millions of customers. Does it make
sense to look at the details of each customer separately and then make a
decision? Certainly not! It is a manual process and will take a huge amount
of time.
So what can the bank do? One option is to segment its customers into
different groups. For instance, the bank can group the customers based on
their income:
• cluster analysis in
market segmentation:
• Help marketers
discover distinct
groups in their
customer bases, and
then use this
knowledge to develop
targeted marketing
programs
Similarity
The degree of correspondence among objects across
all of the characteristics.
Correlational measures
Distance measures
Alkalinity of
Wine Alcohol Wine Alcohol Alkalinity of Ash
Ash
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
• Elbow Method.
• Silhouette Method.
Wednesday, December 09, 2020 39
• You’ll define a target number k, which refers to the number of
centroids you need in the dataset. A centroid is the imaginary or real
location representing the center of the cluster.
Note that there is no ordering of the clusters, so the cluster coloring is arbitrary.
Wednesday, December 09, 2020 41
Advantages of K-means
• It is very simple to implement.
• It is scalable to a huge data set and also faster to large datasets.
• it adapts the new examples very frequently.
• Generalization of clusters for different shapes and sizes.
Disadvantages of K-means
• It is sensitive to the outliers.
• Choosing the k values manually is a tough job.
• As the number of dimensions increases its scalability decreases.
Cluster the following eight points (with (x, y) representing locations) into two
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Cluster the following points (with (x, y) representing locations) into two clusters:
Cluster the following points (with (x, y) representing locations) into two clusters:
Cluster the following points (with (x, y) representing locations) into two clusters:
1.It starts by putting every point in its own cluster, so each cluster is a singleton
2.It then merges the 2 points that are closest to each other based on the distances from
the distance matrix. The consequence is that there is one less cluster
3.It then recalculates the distances between the new and old clusters and save them in
a new distance matrix which will be used in the next step
4.Finally, steps 1 and 2 are repeated until all clusters are merged into one single cluster
including all points.
2.Complete linkage: computes the maximum distance between clusters before merging
them.
3.Average linkage: computes the average distance between clusters before merging
them.
4.Centroid linkage: calculates centroids for both clusters, then computes the distance
between the two before merging them.
5.Ward’s (minimum variance) criterion: minimizes the total within-cluster variance and
find the pair of clusters that leads to minimum increase in total within-cluster variance
after merging.
Wednesday, December 09, 2020 54
Dendogram: -
• A dendrogram is a diagram that shows the hierarchical relationship between objects.
• The main use of a dendrogram is to work out the best way to allocate objects to
clusters.
• In the example above, we can see that E and F are most similar, as the height of the link
that joins them together is the smallest. The next two most similar objects are A and B.
• In the dendrogram above, the height of the dendrogram indicates the order in which the
clusters were joined.