The document discusses the concept of clustering, which is an unsupervised machine learning technique used to group unlabeled data points that are similar. It describes how clustering algorithms aim to identify natural groups within data based on some measure of similarity, without any labels provided. The key types of clustering are partition-based (like k-means), hierarchical, density-based, and model-based. Applications include marketing, earth science, insurance, and more. Quality measures for clustering include intra-cluster similarity and inter-cluster dissimilarity.
This document discusses cluster analysis and clustering algorithms. It defines a cluster as a collection of similar data objects that are dissimilar from objects in other clusters. Unsupervised learning is used with no predefined classes. Popular clustering algorithms include k-means, hierarchical, density-based, and model-based approaches. Quality clustering produces high intra-class similarity and low inter-class similarity. Outlier detection finds dissimilar objects to identify anomalies.
- Hierarchical clustering produces nested clusters organized as a hierarchical tree called a dendrogram. It can be either agglomerative, where each point starts in its own cluster and clusters are merged, or divisive, where all points start in one cluster which is recursively split.
- Common hierarchical clustering algorithms include single linkage (minimum distance), complete linkage (maximum distance), group average, and Ward's method. They differ in how they calculate distance between clusters during merging.
- K-means is a partitional clustering algorithm that divides data into k non-overlapping clusters based on minimizing distance between points and cluster centroids. It is fast but sensitive to initialization and assumes spherical clusters of similar size and density.
This document provides an introduction to cluster analysis, including definitions of key concepts, types of data and clusters, and clustering techniques. It defines cluster analysis as finding groups of similar objects that are different from objects in other groups. There are various types of clusters that can be formed, such as well-separated, center-based, contiguous, and density-based clusters. Common clustering techniques include partitional clustering (e.g., k-means), hierarchical clustering, and density-based clustering. The document discusses considerations for different data types and provides details on partitioning clustering algorithms.
The document discusses different clustering algorithms, including k-means and EM clustering. K-means aims to partition items into k clusters such that each item belongs to the cluster with the nearest mean. It works iteratively to assign items to centroids and recompute centroids until the clusters no longer change. EM clustering generalizes k-means by computing probabilities of cluster membership based on probability distributions, with the goal of maximizing the overall probability of items given the clusters. Both algorithms are used to group similar items in applications like market segmentation.
This document provides an overview of clustering and classification techniques in data mining. It defines clustering and classification as unsupervised and supervised learning respectively. The document discusses how classification works by building a model from training data and then using the model to classify new data. For clustering, it explains that clusters are formed by grouping similar data objects without predefined labels. The document also describes different types of clustering techniques like hierarchical, partitioning, and probabilistic clustering. Finally, it provides a step-by-step explanation of the k-means clustering algorithm.
This document discusses different types of clustering analysis techniques in data mining. It describes clustering as the task of grouping similar objects together. The document outlines several key clustering algorithms including k-means clustering and hierarchical clustering. It provides an example to illustrate how k-means clustering works by randomly selecting initial cluster centers and iteratively assigning data points to clusters and recomputing cluster centers until convergence. The document also discusses limitations of k-means and how hierarchical clustering builds nested clusters through sequential merging of clusters based on a similarity measure.
The document provides an overview of clustering methods and algorithms. It defines clustering as the process of grouping objects that are similar to each other and dissimilar to objects in other groups. It discusses existing clustering methods like K-means, hierarchical clustering, and density-based clustering. For each method, it outlines the basic steps and provides an example application of K-means clustering to demonstrate how the algorithm works. The document also discusses evaluating clustering results and different measures used to assess cluster validity.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques;This slide is all about the Data mining techniques;This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques
The document discusses clustering analysis for data mining. It begins by outlining the importance and purposes of cluster analysis, including grouping related data and reducing large datasets. It then describes different types of clustering like hierarchical, partitional, density-based, and grid-based clustering. Specific clustering algorithms like k-means, hierarchical clustering, and DBSCAN are also covered. Finally, applications of clustering are mentioned, such as for machine translation, online shopping recommendations, and spatial databases.
K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarity. It works by randomly assigning data points to k clusters and then iteratively updating cluster centroids and reassigning points until cluster membership stabilizes. K-means clustering aims to minimize intra-cluster variation while maximizing inter-cluster variation. There are various applications and variants of the basic k-means algorithm.
The document discusses various unsupervised learning techniques including clustering algorithms like k-means, k-medoids, hierarchical clustering and density-based clustering. It explains how k-means clustering works by selecting initial random centroids and iteratively reassigning data points to the closest centroid. The elbow method is described as a way to determine the optimal number of clusters k. The document also discusses how k-medoids clustering is more robust to outliers than k-means because it uses actual data points as cluster representatives rather than centroids.
K-means clustering groups data points into k clusters by minimizing the distance between points and cluster centroids. It works by randomly assigning points to initial centroids and then iteratively reassigning points to centroids until clusters are stable. Hierarchical clustering builds a dendrogram showing the relationship between clusters by either recursively merging or splitting clusters. Both are unsupervised learning techniques that group similar data points together without labels.
Pattern recognition binoy k means clustering108kaushik
This document discusses clustering and the k-means clustering algorithm. It defines clustering as grouping a set of data objects into clusters so that objects within the same cluster are similar to each other but dissimilar to objects in other clusters. The k-means algorithm is described as an iterative process that assigns each object to one of k predefined clusters based on the object's distance from the cluster's centroid, then recalculates the centroid, repeating until cluster assignments no longer change. A worked example demonstrates how k-means partitions 7 objects into 2 clusters over 3 iterations. The k-means algorithm is noted to be efficient but requires specifying k and can be impacted by outliers, noise, and non-convex cluster shapes.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
This document discusses multidimensional clustering methods for data mining and their industrial applications. It begins with an introduction to clustering, including definitions and goals. Popular clustering algorithms are described, such as K-means, fuzzy C-means, hierarchical clustering, and mixture of Gaussians. Distance measures and their importance in clustering are covered. The K-means and fuzzy C-means algorithms are explained in detail. Examples are provided to illustrate fuzzy C-means clustering. Finally, applications of clustering algorithms in fields such as marketing, biology, and earth sciences are mentioned.
Clustering algorithms group data points together such that there is high similarity between points within a cluster and low similarity between points in different clusters. K-means clustering is a partitional clustering algorithm that partitions data into K mutually exclusive clusters by minimizing the within-cluster sum of squares. It works by iteratively assigning data points to the closest cluster centroid and recalculating centroids based on newly assigned points until cluster assignments stabilize. K-means requires specifying the number of clusters K in advance and is sensitive to initialization but is simple, efficient and intuitive for optimizing intra-cluster similarity.
Unsupervised learning techniques like clustering are used to explore intrinsic structures in unlabeled data and group similar data instances together. Clustering algorithms like k-means partition data into k clusters where each cluster has a centroid, and data points are assigned to the closest centroid. Hierarchical clustering creates nested clusters by iteratively merging or splitting clusters based on distance metrics. Choosing the right distance metric and clustering algorithm depends on factors like attribute ranges and presence of outliers.
very useful for cluster analysis. supportive for engineering student as well as it students. also provide example for every topic helps in numerical problems. good material for reading.
This document provides an overview of clustering and classification techniques in data mining. It defines clustering and classification as unsupervised and supervised learning respectively. The document discusses how classification works by building a model from training data and then using the model to classify new data. For clustering, it explains that clusters are formed by grouping similar data objects without predefined labels. The document also describes different types of clustering techniques like hierarchical, partitioning, and probabilistic clustering. Finally, it provides a step-by-step explanation of the k-means clustering algorithm.
This document discusses different types of clustering analysis techniques in data mining. It describes clustering as the task of grouping similar objects together. The document outlines several key clustering algorithms including k-means clustering and hierarchical clustering. It provides an example to illustrate how k-means clustering works by randomly selecting initial cluster centers and iteratively assigning data points to clusters and recomputing cluster centers until convergence. The document also discusses limitations of k-means and how hierarchical clustering builds nested clusters through sequential merging of clusters based on a similarity measure.
The document provides an overview of clustering methods and algorithms. It defines clustering as the process of grouping objects that are similar to each other and dissimilar to objects in other groups. It discusses existing clustering methods like K-means, hierarchical clustering, and density-based clustering. For each method, it outlines the basic steps and provides an example application of K-means clustering to demonstrate how the algorithm works. The document also discusses evaluating clustering results and different measures used to assess cluster validity.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques;This slide is all about the Data mining techniques;This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques
The document discusses clustering analysis for data mining. It begins by outlining the importance and purposes of cluster analysis, including grouping related data and reducing large datasets. It then describes different types of clustering like hierarchical, partitional, density-based, and grid-based clustering. Specific clustering algorithms like k-means, hierarchical clustering, and DBSCAN are also covered. Finally, applications of clustering are mentioned, such as for machine translation, online shopping recommendations, and spatial databases.
K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarity. It works by randomly assigning data points to k clusters and then iteratively updating cluster centroids and reassigning points until cluster membership stabilizes. K-means clustering aims to minimize intra-cluster variation while maximizing inter-cluster variation. There are various applications and variants of the basic k-means algorithm.
The document discusses various unsupervised learning techniques including clustering algorithms like k-means, k-medoids, hierarchical clustering and density-based clustering. It explains how k-means clustering works by selecting initial random centroids and iteratively reassigning data points to the closest centroid. The elbow method is described as a way to determine the optimal number of clusters k. The document also discusses how k-medoids clustering is more robust to outliers than k-means because it uses actual data points as cluster representatives rather than centroids.
K-means clustering groups data points into k clusters by minimizing the distance between points and cluster centroids. It works by randomly assigning points to initial centroids and then iteratively reassigning points to centroids until clusters are stable. Hierarchical clustering builds a dendrogram showing the relationship between clusters by either recursively merging or splitting clusters. Both are unsupervised learning techniques that group similar data points together without labels.
Pattern recognition binoy k means clustering108kaushik
This document discusses clustering and the k-means clustering algorithm. It defines clustering as grouping a set of data objects into clusters so that objects within the same cluster are similar to each other but dissimilar to objects in other clusters. The k-means algorithm is described as an iterative process that assigns each object to one of k predefined clusters based on the object's distance from the cluster's centroid, then recalculates the centroid, repeating until cluster assignments no longer change. A worked example demonstrates how k-means partitions 7 objects into 2 clusters over 3 iterations. The k-means algorithm is noted to be efficient but requires specifying k and can be impacted by outliers, noise, and non-convex cluster shapes.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
This document discusses multidimensional clustering methods for data mining and their industrial applications. It begins with an introduction to clustering, including definitions and goals. Popular clustering algorithms are described, such as K-means, fuzzy C-means, hierarchical clustering, and mixture of Gaussians. Distance measures and their importance in clustering are covered. The K-means and fuzzy C-means algorithms are explained in detail. Examples are provided to illustrate fuzzy C-means clustering. Finally, applications of clustering algorithms in fields such as marketing, biology, and earth sciences are mentioned.
Clustering algorithms group data points together such that there is high similarity between points within a cluster and low similarity between points in different clusters. K-means clustering is a partitional clustering algorithm that partitions data into K mutually exclusive clusters by minimizing the within-cluster sum of squares. It works by iteratively assigning data points to the closest cluster centroid and recalculating centroids based on newly assigned points until cluster assignments stabilize. K-means requires specifying the number of clusters K in advance and is sensitive to initialization but is simple, efficient and intuitive for optimizing intra-cluster similarity.
Unsupervised learning techniques like clustering are used to explore intrinsic structures in unlabeled data and group similar data instances together. Clustering algorithms like k-means partition data into k clusters where each cluster has a centroid, and data points are assigned to the closest centroid. Hierarchical clustering creates nested clusters by iteratively merging or splitting clusters based on distance metrics. Choosing the right distance metric and clustering algorithm depends on factors like attribute ranges and presence of outliers.
very useful for cluster analysis. supportive for engineering student as well as it students. also provide example for every topic helps in numerical problems. good material for reading.
This document discusses various methods for object recognition in digital image processing. It begins by explaining the main steps of image processing, including low, mid, and high-level processing. It then defines object recognition as a computer program that identifies objects in real-world pictures using models of known objects. Two common methods are described: decision-theoretic methods that use quantitative descriptors and numeric pattern vectors, and structural methods that use qualitative descriptors like strings and trees. Pattern classes and arrangements like numeric vectors, strings, and trees are also defined. The document focuses on decision-theoretic methods and minimum distance classifiers, explaining concepts like decision functions, decision boundaries, and how unknown patterns are assigned to the closest class.
The document discusses the Fourier transform and its applications in image processing. The Fourier transform represents functions as a combination of basis functions with different frequencies, allowing for an intuitive description of signal and image content. It decomposes a function into sine and cosine waves of varying frequencies, with the weights determining how much each frequency component contributes. This decomposition enables filtering operations to be performed by modifying the frequency content, such as removing noise by reducing high frequency components.
This chapter introduces the topic and is entirely focused on an introduction. It repeats the title "Chapter 1 Introduction" multiple times, suggesting an introductory chapter that does not provide much substantive information.
This document provides an overview of a digital image processing course. It outlines 4 course outcomes: 1) describing basic concepts and applications of image processing, 2) describing techniques in color, segmentation, and recognition, 3) illustrating pixel relationships and image arithmetic, and 4) analyzing digital image enhancement principles. The document then discusses topics that will be covered in the course, including image types, operations, and applications in various fields.
Better Builder Magazine brings together premium product manufactures and leading builders to create better differentiated homes and buildings that use less energy, save water and reduce our impact on the environment. The magazine is published four times a year.
As an AI intern at Edunet Foundation, I developed and worked on a predictive model for weather forecasting. The project involved designing and implementing machine learning algorithms to analyze meteorological data and generate accurate predictions. My role encompassed data preprocessing, model selection, and performance evaluation to ensure optimal forecasting accuracy.
Kevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdfMedicoz Clinic
Kevin Corke, a respected American journalist known for his work with Fox News, has always kept his personal life away from the spotlight. Despite his public presence, details about his spouse remain mostly private. Fans have long speculated about his marital status, but Corke chooses to maintain a clear boundary between his professional and personal life. While he occasionally shares glimpses of his family on social media, he has not publicly disclosed his wife’s identity. This deep dive into his private life reveals a man who values discretion, keeping his loved ones shielded from media attention.
This presentation provides a detailed overview of air filter testing equipment, including its types, working principles, and industrial applications. Learn about key performance indicators such as filtration efficiency, pressure drop, and particulate holding capacity. The slides highlight standard testing methods (e.g., ISO 16890, EN 1822, ASHRAE 52.2), equipment configurations (such as aerosol generators, particle counters, and test ducts), and the role of automation and data logging in modern systems. Ideal for engineers, quality assurance professionals, and researchers involved in HVAC, automotive, cleanroom, or industrial filtration systems.
Expansive soils (ES) have a long history of being difficult to work with in geotechnical engineering. Numerous studies have examined how bagasse ash (BA) and lime affect the unconfined compressive strength (UCS) of ES. Due to the complexities of this composite material, determining the UCS of stabilized ES using traditional methods such as empirical approaches and experimental methods is challenging. The use of artificial neural networks (ANN) for forecasting the UCS of stabilized soil has, however, been the subject of a few studies. This paper presents the results of using rigorous modelling techniques like ANN and multi-variable regression model (MVR) to examine the UCS of BA and a blend of BA-lime (BA + lime) stabilized ES. Laboratory tests were conducted for all dosages of BA and BA-lime admixed ES. 79 samples of data were gathered with various combinations of the experimental variables prepared and used in the construction of ANN and MVR models. The input variables for two models are seven parameters: BA percentage, lime percentage, liquid limit (LL), plastic limit (PL), shrinkage limit (SL), maximum dry density (MDD), and optimum moisture content (OMC), with the output variable being 28-day UCS. The ANN model prediction performance was compared to that of the MVR model. The models were evaluated and contrasted on the training dataset (70% data) and the testing dataset (30% residual data) using the coefficient of determination (R2), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) criteria. The findings indicate that the ANN model can predict the UCS of stabilized ES with high accuracy. The relevance of various input factors was estimated via sensitivity analysis utilizing various methodologies. For both the training and testing data sets, the proposed model has an elevated R2 of 0.9999. It has a minimal MAE and RMSE value of 0.0042 and 0.0217 for training data and 0.0038 and 0.0104 for testing data. As a result, the generated model excels the MVR model in terms of UCS prediction.
Department of Environment (DOE) Mix Design with Fly Ash.MdManikurRahman
Concrete Mix Design with Fly Ash by DOE Method. The Department of Environmental (DOE) approach to fly ash-based concrete mix design is covered in this study.
The Department of Environment (DOE) method of mix design is a British method originally developed in the UK in the 1970s. It is widely used for concrete mix design, including mixes that incorporate supplementary cementitious materials (SCMs) such as fly ash.
When using fly ash in concrete, the DOE method can be adapted to account for its properties and effects on workability, strength, and durability. Here's a step-by-step overview of how the DOE method is applied with fly ash.
2. Definitions
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group than those in other groups.
Clustering is a technique to group objects based on distance or similarity
The data points that are in the same group should have similar properties and/or features,
while data points in different groups should have highly dissimilar properties and/or
features.
The clustering-based learning method is identified as an unsupervised learning task
wherein the learning starts from no specific target attribute in mind, and the data is
explored with a goal of finding intrinsic structures in them
2
3. The primary goal of the clustering technique is fiding similar or homogenous
groups in data that are called clusters.
The way this is done is—data instances that are similar or, in short, are near to each other
are grouped in one cluster, and the instances that are different are grouped into a
different cluster.
Clustering refers to the grouping of records, observations, or cases into classes of
similar objects.
A cluster is a collection of records that are similar to one another and dissimilar to records
in other clusters.
3
4. Clustering differs from classification in that there is no target variable for clustering.
The clustering task does not try to classify, estimate, or predict the value of a target
variable.
Instead, clustering algorithms seek to segment the entire data set into relatively
homogeneous subgroups or clusters, where the similarity of the records within the
cluster is maximized, and the similarity to records outside this cluster is minimized.
4
6. Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use
this knowledge to develop targeted marketing programs.
Land use: Identification of areas of similar land use in an earth observation database.
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost.
City-planning: Identifying groups of houses according to their house type, value, and
geographical location.
6
7. Main issues in clustering:
• how to measure similarity
• how to measure distance for categorical variables
• how to standardize or normalize numerical variables
• how many clusters
7
8. How to measure similarity
For measuring similarity Distance metric is used.
Most common distance metric is Euclidean
Distance. Other Distances can also be used.
where x = x1, x2, … , xm, and y = y1 , y2, … , ym represent the m attribute values of
two records.
8
9. how to measure distance for categorical
variables
For categorical variables, we may again define the “different from” function for
comparing the ith attribute values of a pair of records:
where xi and yi are categorical values. We may then substitute different (xi, yi) for the
ith term in the Euclidean distance metric above.
9
10. how to standardize or normalize numerical
variables
For optimal performance, clustering algorithms, just like algorithms for classification,
require the data to be normalized so that no particular variable or subset of variables
dominates the analysis. Analysts may use either the min–max normalization or Z-score
standardization.
Range(X) = Max(X)- Min(X) SD(X) = Standard Deviation of X
10
11. All clustering methods have as their
goal the identification of groups of
records such that similarity within a
group is very high while the similarity to
records in other groups is very low.
In other words, clustering algorithms
seek to construct clusters of records
such that the between-cluster
variation is large compared to the
within-cluster variation.
11
12. Requirements of Clustering Algorithms
Scalability − We need highly scalable clustering algorithms to deal with large databases.
Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on
any kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures that
tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle low-dimensional
data but also the high dimensional space.
12
13. Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and usable.
13
14. Clustering Methods
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
14
15. Partition Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n.
It means that it will classify the data into k groups, which satisfy the following requirements:
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects from
one group to other.
15
16. Algorithms in Partition Method:
K-mean Clustering - Each cluster is represented by the center of the
cluster
k-medoids or PAM (Partition around medoids) : Each cluster is
represented by one of the objects in the cluster
16
17. K-mean Clustering Algorithm
K-Means clustering intends to partition n objects into k clusters in which each object belongs to
the cluster with the nearest mean.
This method produces exactly k different clusters of greatest possible distinction.
The best number of clusters k leading to the greatest separation (distance) is not known as a
priori and must be computed from the data.
17
18. K- mean algorithm
Step 1: Select number of clusters k the data set should be partitioned into.
Step 2: Randomly assign k records to be the initial cluster( Usually first k record are
assigned to K clusers)
Step3: Calculate centroid of the cluster.
Step 4: For each record, find the nearest cluster center and add the record to that cluster.
Step5: For each of the k clusters, find the cluster centroid, and update the location of
each cluster center to the new value of the centroid.
Step 5: Repeat steps 4–5 until convergence or termination(centroid do not change).
Centroid of the cluster is the mean value of the elements in that cluster
18
19. K-mean Clustering Example:
Dataset = { 2, 5, 7,12, 26, 30 40 50 }
K = 3
1. Initially Create Three Empty Clusters
Cluster C1 Cluster C2 Cluster C3
2. Add first three elements to the cluster
Cluster C1 Cluster C2 Cluster C3
2 5 7
3. Find the centroid of each cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =5 Centroid =7
2 5 7
19
20. Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
Centroid =2 Centroid =5 Centroid =7
2 0 3 5
5 3 0 2
7 5 2 0
12 10 7 5
26 24 21 19
30 28 25 23
40 38 35 33
50 48 45 43
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =5 Centroid =7
2 5 7,12,26,30,40,50
20
21. 3. Find the centroid of each cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =5 Centroid =27.5
2 5 7,12,26,30,40,50
21
22. Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
Centroid =2 Centroid =5 Centroid =27.5
2 0 3 25.7
5 3 0 24.5
7 5 2 20.5
12 10 7 15.5
26 24 21 1.5
30 28 25 2.5
40 38 35 12.5
50 48 45 22.5
Cluster C1 Cluster C2 Cluster C3
Centroid =3.5 Centroid =9.5 Centroid =27.5
2 5,7,12 26,30,40,50
22
23. 3. Find the centroid of each cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =8 Centroid =36.5
2 5,7,12 26,30,40,50
23
24. Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
Centroid =2 Centroid =8 Centroid =36.5
2 0 5 34.5
5 3 3 31.5
7 5 1 29.5
12 10 4 24.5
26 24 18 10.5
30 28 22 6.5
40 38 32 3.5
50 48 42 13.5
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =8 Centroid =36.5
2,5 7,12 26,30,40,50
24
25. 3. Find the centroid of each cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =3.5 Centroid =9.5 Centroid =36.5
2,5 7,12 26,30,40,50
25
26. Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
Centroid =3.5 Centroid =9.5 Centroid =36.5
2 1.5 7.5 34.5
5 1.5 4.5 31.5
7 3.5 2.5 29.5
12 8.5 2.5 24.5
26 22.5 15.5 10.5
30 26.5 19.5 6.5
40 36.5 29.5 3.5
50 46.5 39.5 13.5
Cluster C1 Cluster C2 Cluster C3
Centroid =3.5 Centroid =9.5 Centroid =36.5
2,5 7,12 26,30,40,50
26
28. When to stop?
The Clustering algorithm may terminate when some convergence criterion is met, such
as no significant shrinkage in the mean squared error (MSE):
28
32. Cluster Quality
The clustering algorithms seek to construct clusters of records such that the between-
cluster variation is large compared to the within-cluster variation. Because this concept is
analogous to the analysis of variance, we define a pseudo-F statistic as follows:
where SSE is defied as above, MSB is the mean square between, and SSB is the sum
of squares between clusters, defied as:
where ni is the number of records in cluster i, mi
is the centroid (cluster center) for cluster i, and
M is the grand mean of all the data.
32
33. MSB represents the between-cluster variation and MSE represents the within-cluster
variation.
Thus, a “good” cluster would have a large value of the pseudo-F statistic, representing a
situation where the between-cluster variation is large compared to the within-cluster
variation.
Hence, as the k-means algorithm proceeds, and the quality of the clusters increases, we
would expect MSB to increase, MSE to decrease, and F to increase.
33
34. K-means Clustering summary
Advantages:
• Simple, understandable
• items automatically assigned to clusters
Disadvantages:
• Must pick number of clusters before hand
• Often terminates at a local optimum.
• All items forced into a cluster
• Too sensitive to outliers
34
35. K-medoid Algorithm
Medoids are representative objects of a data set or a cluster with a data set whose average
dissimilarity to all the objects in the cluster is minimal.
Medoids are similar in concept to means or centroids, but medoids are always restricted to be
members of the data set.
Medoids are most commonly used on data when a mean or centroid cannot be defined, such as
graphs.
A medoid of a finite dataset is a data point from this set, whose average dissimilarity to all the data
points is minimal i.e. it is the most centrally located point in the set.
35
36. Mathematical Formulation for K-means
36
D= {x1,x2,…,xi,…,xm} a data set of m records
xi= (xi1,xi2,…,xin) a each record is an n-dimensional vector
37. 37
Finding Cluster Centres that Minimize Distortion:
Solution can be found by setting the partial derivative of Distortion w.r.t. each cluster centre
to zero.
38. 38
For any k clusters, the value of k should be such that even if we increase the value of k from
after several levels of clustering the distortion remains constant. The achieved point is
called the “Elbow”.
This is the ideal value of k, for the clusters created.
39. Hierarchical Methods:
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
• Agglomerative Approach
• Divisive Approach
In hierarchical clustering, a treelike cluster structure (dendrogram) is created through
recursive partitioning (divisive methods) or combining (agglomerative) of existing
clusters.
39
40. In hierarchical clustering, we categorized the objects into a hierarchy similar to a tree‐like
diagram which is called as dendogram.
Dendogram:
The standard output of hierarchical clustering is a
dendogram.
A dendogram is a cluster tree diagram where the distance of
split or merge is recorded.
Dendogram is a visualization of hierarchical clustering.
40
41. Using dendogram, we can easily specify the cutting point to determine number of
clusters. For example, in the left dendogram below, we set cutting distance at 2 and we
obtain two clusters out of 6 objects. The first cluster consists of 4 objects (number 4, 6, 5
and 3) and the second cluster consists of two objects (number 1 and 2). Similarly, in the
right dendogram, setting cutting distance at 1.2 will produce 3 clusters.
41
42. • Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start
with each object forming a separate cluster. It keeps on merging the
objects or clusters that are close to one another. It keep on doing so until
all of the clusters are merged into one or until the termination condition
holds.
• Divisive Approach
This approach is also known as the top-down approach. In this, we start
with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once a
merging or splitting is done, it can never be undone.
42
43. Steps for Hierarchical Clustering –
Agglomerative approach
1. Compute distance matrix from object features.
2. Set each object as a independent cluster.( if there are 5 objects , then there will be
5 clusters)
3. Iterate until number of cluster is equal to 1
A. Merge two closest clusters
B. Update distance matrix
43
44. Example:
Assume we have six objects A,B,C,D,E,F each having two attribute X1 and X2
Distance between two objects is calculated using Euclidian distance formula using
Their attributes X1 and X2.
For example distance between A and B can be calculated as:
d(A,B) =
44
45. Object X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
45
48. How to calculate Distance between new cluster (D,F) and other clusters A,B,C, E?
48
49. Linkages between Objects
The rule of hierarchical clustering lie on how objects should be grouped
into clusters. Given a distance matrix, linkages between objects can be
computed through a criterion to compute distance between groups.
Most common & basic criteria are
1. Single Linkage: minimum distance criterion
49
61. We summarized the results of computation as follow:
1. In the beginning we have 6 clusters: A, B, C, D, E and F
2. We merge cluster D and F into cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B into (A, B) at distance 0.71
4. We merge cluster E and (D, F) into ((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C into (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50
7. The last cluster contain all the objects, thus conclude the computation
61
63. how do we determine distance between clusters of
records?
There are several criteria for determining distance between arbitrary clusters A and B:
Single linkage:
Single linkage, sometimes termed the nearest-neighbour approach, is based on
the minimum distance between any record in cluster A and any record in cluster B.
In other words, cluster similarity is based on the similarity of the most similar
members from each cluster.
Single linkage tends to form long, slender clusters, which may sometimes lead to
heterogeneous records being clustered together.
63
64. Complete linkage:
Complete linkage, sometimes termed the farthest-neighbor approach, is based
on the maximum distance between any record in cluster A and any record in
cluster B.
In other words, cluster similarity is based on the similarity of the most dissimilar members from each
cluster.
Complete linkage tends to form more compact, spherelike clusters.
64