UNIT 4 Clustering and Applications
UNIT 4 Clustering and Applications
Clustering:
The process of making a group of abstract objects into classes of similar objects is known
as Clustering.
Cluster Analysis in Data Mining means that to find out the group of objects which are similar
to each other in the group but are different from the object in other groups. In the process
of clustering in data analytics, the sets of data are divided into groups or classes based on
data similarity. Then each of these classes is labelled according to their data types.
Clustering is a type of unsupervised learning method of machine learning. In the
unsupervised learning method, the inferences are drawn from the data sets which do not
contain labelled output variable. It is an exploratory data analysis technique that allows us
to analyze the multivariate data sets.
Clustering is a task of dividing the data sets into a certain number of clusters in such a
manner that the data points belonging to a cluster have similar characteristics. Clusters are
nothing but the grouping of data points such that the distance between the data points within
the clusters is minimal. Clustering is done to segregate the groups with similar traits.
Applications of cluster analysis:
It is widely used in many applications such as image processing, data analysis,
and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies
and identifying genes with the same capabilities.
It also helps in information discovery by classifying documents on the web.
Requirements of clustering in data mining:
The following are some points because clustering is important in data mining.
1. Scalability – we require highly scalable clustering algorithms to work with large
databases.
2. Ability to deal with different kinds of attributes – Algorithms should be able to
work with the type of data such as categorical, numerical, and binary data.
3. Discovery of clusters with attribute shape – The algorithm should be able to detect
clusters in arbitrary shapes and it should not be bounded to distance measures.
4. Interpretability-The result of clustering should be usable, understandable and
interpretable. The main aim of clustering in data analytics is to make sure haphazard
data is stored in groups based on their characteristical similarity.
5. Helps in dealing with messed up data-Usually, the data is messed up and
unstructured. It cannot be analyzed quickly, and that is why the clustering of
information is so significant in data mining. Grouping can give some structure to the
data by organizing it into groups of similar data objects. It becomes more comfortable
for the data expert in processing the data and also discover new things. Analyzing
data that has already been classified and labelled through clustering is much easier
than analyzing unstructured data. It also leaves less room for error.
6. High Dimensional-Data clustering is also able to handle the data of high dimension
along with the data of small size. The clustering algorithms in data mining need to
be able to handle any dimension of data.
7. Attribute shape clusters are discovered-Clustering algorithms in data
mining should be able to detect arbitrarily shaped clusters. These algorithms should
not be limited by only being able to find smaller, spherical clusters.
Importance of Clustering Methods:
1. Having clustering methods helps restart the local search procedure and removes the
inefficiency. In addition, clustering helps to determine the internal structure of the data.
2. This clustering method has been used for model analysis and vector region of attraction.
3. Clustering helps in understanding the natural grouping in a dataset. They aim to make
sense of partitioning the data into some logical groupings.
4. Clustering quality depends on the methods and the identification of hidden patterns.
5. They play a wide role in applications like marketing economic research and weblogs to
identify similarity measures, Image processing, and spatial research.
6. They are used in outlier detections to detect credit card fraudulence.
Characteristics of Cluster Analysis:
It helps to visualize high-dimensional data
It further enables data scientists to deal with different types of data like discrete,
categorical, and binary data
It gives them some structure to unstructured data sets by organizing them into a group
Advantages of Cluster Analysis:
Helps to identify obscure patterns and relationships within a data set
It helps to carry out exploratory data analysis
It can also be used for market segmentation, customer profiling, and more
Disadvantages of Cluster Analysis:
It can be difficult to interpret the results of an ambiguous or ill-defined cluster
The result of the analysis is affected by the choice of the clustering algorithm
Furthermore, the success of cluster analysis depends on the data, the goal of the
analysis, and the data scientist’s capability to interpret the result
Types of Data in Cluster Analysis:
A database may contain all the six types of variables:
1. symmetric binary
2. asymmetric binary
3. nominal
4. ordinal
5. interval
6. ratio.
Categorization of Major Clustering Methods:
It can be classified based on the following categories.
1. Model-Based Method:
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
2. Hierarchical Method:
Hierarchical clustering investigates data clusters with a variety of scales and distances. In this
approach, you create a cluster tree with a multilevel hierarchy consisting of small clusters.
Then, neighboring clusters with similar features from every hierarchy are grouped together.
This continues until only one cluster is left in the hierarchy. This, therefore, allows the data
scientist to identify the hierarchical cluster appropriate to them.
There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering:
Here are the two approaches that are used to improve the quality of hierarchical clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-
clustering on the micro-clusters.
3. Constraint-Based Method:
In this method, the clustering is performed by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or the properties of desired clustering
results. Constraints provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application requirement.
4. Grid-Based Method:
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized
space.
5. Partitioning Method:
Partitioning clustering considers every data point in a cluster as objects that have a
specific location and distance from each other. It partitions the objects in a way that
objects with the same features are close to each other, and far away from objects in
other clusters.
6. Density-Based Method:
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.
Outlier Analysis:
Outlier analysis in data mining involves identifying and analyzing data points significantly
different or deviating from the rest of the dataset. Outliers can be caused by various factors,
such as data entry errors, unexpected events, etc., and their detection can lead to valuable
insights and improve the accuracy of models. A wide range of techniques can be used for outlier
analysis in data mining, such as statistical methods, clustering algorithms, and machine
learning models.
Outlier analysis in data mining is the process of identifying and examining data points that
significantly differ from the rest of the dataset. An outlier can be defined as a data point that
deviates significantly from the normal pattern or behavior of the data. Various factors, such as
measurement errors, unexpected events, data processing errors, etc., can cause these outliers.
For example, outliers are represented as red dots in the figure below, and you can see that they
deviate significantly from the rest of the data points. Outliers are also often referred to as
anomalies, aberrations, or irregularities.