ML Module Iv
ML Module Iv
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but
they both differ depending on how they work. As there is no requirement to predetermine the
number of clusters as we did in the K-Means algorithm.
The working of the AHC algorithm can be explained using the below steps:
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.
Divisivehierarchical clustering
Divisive hierarchical clustering is a method used in machine learning and data analysis to
create a hierarchical decomposition of a dataset. Unlike agglomerative clustering, which
starts with each sample as its own cluster and merges them, divisive clustering begins with a
single cluster containing all the samples and recursively divides it into smaller clusters.
Step-2: Divide the cluster: The algorithm identifies subgroups within the cluster. This
division can be based on various distance metrics or similarity measures.
Step-3: Recursive division: The clusters are divided further into smaller clusters, iteratively
creating a hierarchical structure.
Step -4 :Stop condition: The algorithm continues dividing until a stopping condition is met.
This could be a predefined number of clusters, reaching a certain threshold of similarity, or
any other criterion specific to the problem.
One challenge with divisive clustering is determining the optimal number of clusters or
deciding when to stop the recursive division. This aspect often involves domain knowledge or
using techniques like examining dendrograms or evaluating cluster quality metrics to make
an informed decision.
Basic clustering algorithms like K means and agglomerative clustering are the most
commonly used clustering algorithms. But when performing clustering on very large datasets,
BIRCH and DBSCAN are the advanced clustering algorithms useful for performing precise
clustering on large datasets. Moreover, BIRCH is very useful because of its easy
implementation. BIRCH is a clustering algorithm that clusters the dataset first in small
summaries, then after small summaries get clustered. It does not directly cluster the dataset.
That is why BIRCH is often used with other clustering algorithms; after making the
summary, the summary can also be clustered by other clustering algorithms.
Stages of BIRCH
BIRCH is often used to complement other clustering algorithms by creating a summary of the
dataset that the other clustering algorithm can now use. However, BIRCH has one major
drawback it can only process metric attributes. A metric attribute is an attribute whose values
can be represented in Euclidean space, i.e., no categorical attributes should be present. The
BIRCH clustering algorithm consists of two stages:
1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined
as an ordered triple (N, LS, SS) where 'N' is the number of data points in the cluster,
'LS' is the linear sum of the data points, and 'SS' is the squared sum of the data points
in the cluster. A CF entry can be composed of other CF entries. Optionally, we can
condense this initial CF tree into a smaller CF.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF
tree. A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a
CF tree contains a pointer to a child node, and a CF entry made up of the sum of CF
entries in the child nodes. Optionally, we can refine these clusters.
Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the Clustering
feature tree (CF tree). This algorithm is based on the CF (clustering features) tree. In addition,
this algorithm uses a tree-structured summary to create clusters.
In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those
nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are
situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds
necessary information of given data for further hierarchical clustering. This prevents the need
to work with whole data given as input. The tree cluster of data points as CF is represented by
three numbers (N, LS, SS).
There are mainly four phases which are followed by the algorithm of BIRCH.
o Refining clusters.
Two of them (resize data and refining clusters) are optional in these four phases. They come
in the process when more clarity is required. But scanning data is just like loading data into a
model. After loading the data, the algorithm scans the whole data and fits them into the CF
trees.
In condensing, it resets and resizes the data for better fitting into the CF tree. In global
clustering, it sends CF trees for clustering using existing clustering algorithms. Finally,
refining fixes the problem of CF trees where the same valued points are assigned to different
leaf nodes.
BIRCH clustering achieves its high efficiency by clever use of a small set of summary
statistics to represent a larger set of data points. These summary statistics constitute a CF and
represent a sufficient substitute for the actual data for clustering purposes.
A CF is a set of three summary statistics representing a set of data points in a single cluster.
These statistics are as follows:
o Linear Sum [The sum of the individual coordinates. This is a measure of the location
of the cluster]
o Squared Sum [The sum of the squared coordinates. This is a measure of the spread of
the cluster]
CF Tree
The building process of the CF Tree can be summarized in the following steps, such as:
Step 1: For each given record, BIRCH compares the location of that record with the location
of each CF in the root node, using either the linear sum or the mean of the CF. BIRCH passes
the incoming record to the root node CF closest to the incoming record.
Step 2: The record then descends down to the non-leaf child nodes of the root node CF
selected in step 1. BIRCH compares the location of the record with the location of each non-
leaf CF. BIRCH passes the incoming record to the non-leaf node CF closest to the incoming
record.
Step 3: The record then descends down to the leaf child nodes of the non-leaf node CF
selected in step 2. BIRCH compares the location of the record with the location of each leaf.
BIRCH tentatively passes the incoming record to the leaf closest to the incoming record.
1. If the radius of the chosen leaf, including the new record, does not exceed the
threshold T, then the incoming record is assigned to that leaf. The leaf and its parent
CF's are updated to account for the new data point.
2. If the radius of the chosen leaf, including the new record, exceeds the Threshold T,
then a new leaf is formed, consisting of the incoming record only. The parent CFs is
updated to account for the new data point.
If step 4(ii) is executed, and the maximum L leaves are already in the leaf node, the leaf node
is split into two leaf nodes. If the parent node is full, split the parent node, and so on. The
most distant leaf node CFs are used as leaf node seeds, with the remaining CFs being
assigned to whichever leaf node is closer. Note that the radius of a cluster may be calculat ed
even without knowing the data points, as long as we have the count n, the linear sum LS, and
the squared sum SS. This allows BIRCH to evaluate whether a given data point belongs to a
particular sub-cluster without scanning the original data set.
Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters
(the CF leaf nodes) to combine these sub-clusters into clusters. The task of clustering
becomes much easier as the number of sub-clusters is much less than the number of data
points. When a new data value is added, these statistics may be easily updated, thus making
the computation more efficient.
Parameters of BIRCH
There are three parameters in this algorithm that needs to be tuned. Unlike K-means, the
optimal number of clusters (k) need not be input by the user as the algorithm determines
them.
o Threshold: Threshold is the maximum number of data points a sub-cluster in the leaf
node of the CF tree can hold.
Advantages of BIRCH
It is local in that each clustering decision is made without scanning all data points and
existing clusters. It exploits the observation that the data space is not usually uniformly
occupied, and not every data point is equally important.
It uses available memory to derive the finest possible sub-clusters while minimizing I/O
costs. It is also an incremental method that does not require the whole data set in advance.
Density-based clustering refers to a method that is based on local cluster criterion, such as
density connected points. In this tutorial, we will discuss density-based clustering with
examples.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that
point.
A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts ifi belongs to NEps(k)
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there
is a sequence chain of a point i1,…., in, i1 = j, pn = i such that i i + 1 is directly density
reachable from i i.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point
o such that both i and j are considered as density reachable from o with respect to Eps and
MinPts.
Suppose a set of objects is denoted by D', we can say that an object I is directly density
reachable form the object j only if it is located within the ε neighborhood of j, and j is a core
object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set
of objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such
that ii + 1 is directly density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of
objects, D' only if there is an object o belongs to D such that both point i and j are density
reachable from o with respect to ε and MinPts.
o It is a scan method.
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends
on a density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial
database with outliers.
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the
cluster comprises information equivalent to the density-based clustering related to a long
range of parameter settings. OPTICS methods are beneficial for both automatic and
interactive cluster analysis, including determining an intrinsic clustering structure.
DENCLUE
Nodes (Vertices): Each node in the graph represents a random variable, and the entire set of
nodes represents the set of random variables in the model.
Edges (Links): The edges between nodes encode relationships or dependencies. Absence of
an edge between nodes implies conditional independence given the rest of the variables.
Global Normalization: Unlike directed models, undirected models often require global
normalization due to the lack of a clear flow of probability as in directed models.
Applications:
Computer Vision: Object recognition, scene understanding, and spatial modeling of visual
data.
Undirected graphical models offer a flexible framework for representing complex
dependencies among variables, enabling various applications in different domains while
capturing probabilistic relationships among them.
Key Concepts:
2. Elimination of Variables:
1. Initialization:
2. Variable Ordering:
Choose an elimination order for the variables. This order significantly impacts
the efficiency of the algorithm.
4. Final Computation:
Once all variables are eliminated, the remaining factors represent the desired
probabilities.
3. Eliminate S:
4. Eliminate D:
In the realm of data analysis and machine learning, accurate grouping of similar entities is
crucial for efficient decision−making processes. While traditional clustering algorithms have
certain limitations, CURE (Clustering Using Representatives) offers a unique approach that
shines with its creative methodology. In this article, we will dive into a detailed exploration
of the CURE algorithm, providing a clear understanding along with an illustrative diagram
example. As technology advances and big data proliferates across industries, harnessing the
power of algorithms like CURE is essential in extracting valuable knowledge from complex
datasets for improved decision−making processes and discovery of hidden patterns within
vast information−rich environments.
CURE Algorithm
The CURE algorithm provides an effective means for discovering hidden structures and
patterns in large datasets by adopting a systematic approach to clustering. Employing random
sampling, hierarchical clustering, distance measures, merging representative points along
with subsequent refinement and splitting stages all culminate in accurate final membership
assignments. Armed with its efficient execution time and utilization of partial aggregations,
CURE plays a crucial role in diverse applications where dataset exploration is paramount.
The CURE algorithm utilizes both single−level and hierarchical methods to overcome
common challenges faced by other clustering algorithms. Its core principle centers around
defining cluster representatives − points within a given cluster that best represent its overall
characteristics − rather than merely relying on centroids or medoids.
To initiate the CURE algorithm, an initial subset of data points needs to be chosen from the
dataset being analyzed. These randomly selected points will act as potential representatives
for producing robust clusters.
Hierarchical Clustering
Next, these representative points are clustered hierarchically using either agglomerative or
divisive techniques. Agglomerative clustering gradually merges similar representatives until
reaching one central representative per cluster while divisive clustering splits them based on
dissimilarities.
Cluster Shrinkage
Once all clusters are obtained through hierarchical clustering, each cluster's size is reduced by
reducing the outlier’s weights in relation to their distance from their respective representative
points. This process helps eliminate irrelevant noise and focuses on more relevant patterns in
each individual cluster.
After shrinking the initial clusters down to their core components, all remaining
nonrepresentative points are assigned to their nearest existing representative based on
Euclidean distance or other suitable measures consistent with specific applications.
A detailed explanation of the basic steps involved in the CURE algorithm is listed below,
The first step in the CURE algorithm entails randomly selecting a subset of data points from
the given dataset. This random sampling ensures that representative samples are obtained
across different regions of the data space rather than being biased toward particular areas or
clusters.
Next comes hierarchical clustering on the sampled points. Employing techniques such as
Single Linkage or Complete Linkage hierarchical clustering methods helps create initial
compact clusters based on their proximity to each other within this smaller dataset.
CURE leverages distance measures to compute distances between clusters during merging
operations while maintaining an efficient runtime. Euclidean distance is commonly used due
to simplicity; however, other distance metrics like Manhattan can be employed depending on
domain−specific requirements.
With cluster centroids determined through hierarchical clustering, CURE focuses on merging
representative points from various sub−clusters into a unified set by employing partial
aggregations and pruning appropriately. This consolidation facilitates a significant reduction
in computation time by making subsequent operations more concise.
Step 5: Cluster Refinement and Splitting
After merging representatives, refinement takes place through exchanging outliers among
aggregated sets for better alignment with true target structures within each merged group.
Subsequently, splitting occurs, when necessary, by forming new individual agglomerative
groups representing modified substructures unaccounted for during earlier hierarchies.
Lastly, assigning remaining objects outside formed aggregates follows suit − specifically
those not captured effectively via either mergers or refinements. These yet−to−beclustered
points are linked with the cluster identifiers of their nearest representative points, finalizing
the overall clustering process.
1. Fundamentals of Clustering
Clustering Objective: Identify inherent structures or patterns within the data without
labeled information.
Grouping Similar Data: Clustering groups data points with similar characteristics
into clusters while keeping dissimilar data points in different clusters.
Objective Function: Minimize the sum of squared distances between data points and
their respective cluster centroids.
Steps:
Initialization: Select initial cluster centroids.
Assignment: Assign data points to the nearest centroid.
Update Centroids: Recalculate centroids based on the new assignments.
Convergence: Iterate until convergence or a stopping criterion is met.
Euclidean Distance: Commonly used metric for measuring distances between data
points in Euclidean space.
Other Distance Metrics: Manhattan distance, cosine similarity, Mahalanobis
distance, etc., can be used based on data characteristics.
Comparison with Ground Truth: If available, external metrics like purity, F-score,
or adjusted Rand index measure clustering performance against known labels.
Elbow Method, Silhouette Score: Techniques to find the optimal number of clusters.
Domain Knowledge: Utilizing domain knowledge to validate clustering results.
Partitional clustering techniques play a crucial role in data analysis and pattern recognition
across various domains. Understanding these algorithms, their strengths, limitations, and
practical applications contributes significantly to effective data exploration and knowledge
extraction from datasets.
Hidden Markov Models (HMMs) are a type of probabilistic model that are commonly used in
machine learning for tasks such as speech recognition, natural language processing, and
bioinformatics. They are a popular choice for modelling sequences of data because they can
effectively capture the underlying structure of the data, even when the data is noisy or
incomplete. In this article, we will give a comprehensive overview of Hidden Markov
Models, including their mathematical foundations, applications, and limitations.
The basic idea behind an HMM is that the hidden states generate the observations, and the
observed data is used to estimate the hidden state sequence. This is often referred to as
the forward-backwards algorithm.
Now, we will explore some of the key applications of HMMs, including speech recognition,
natural language processing, bioinformatics, and finance.
o Speech Recognition
o Bioinformatics
HMMs are also widely used in bioinformatics, where they are used to model
sequences of DNA, RNA, and proteins. The hidden states, in this case, correspond to
the different types of residues, while the observations are the sequences of residues.
The goal is to estimate the hidden state sequence, which corresponds to the underlying
structure of the molecule, based on the observed sequences of residues. HMMs are
useful in bioinformatics because they can effectively capture the underlying structure
of the molecule, even when the data is noisy or incomplete. In bioinformatics systems,
the HMMs are usually trained on large datasets of molecular sequences, and the
estimated parameters of the HMMs are used to predict the structure or function of
new molecular sequences.
o Finance
Finally, HMMs have also been used in finance, where they are used to model stock
prices, interest rates, and currency exchange rates. In these applications, the hidden
states correspond to different economic states, such as bull and bear markets, while
the observations are the stock prices, interest rates, or exchange rates. The goal is to
estimate the hidden state sequence, which corresponds to the underlying economic
state, based on the observed prices, rates, or exchange rates. HMMs are useful in
finance because they can effectively capture the underlying economic state, even
when the data is noisy or incomplete. In finance systems, the HMMs are usually
trained on large datasets of financial data, and the estimated parameters of the HMMs
are used to make predictions about future market trends or to develop investment
strategies.
Now, we will explore some of the key limitations of HMMs and discuss how they can impact
the accuracy and performance of HMM-based systems.
One of the key limitations of HMMs is that they are relatively limited in their
modelling capabilities. HMMs are designed to model sequences of data, where the
underlying structure of the data is represented by a set of hidden states. However, the
structure of the data can be quite complex, and the simple structure of HMMs may not
be enough to accurately capture all the details. For example, in speech recognition, the
complex relationship between the speech sounds and the corresponding acoustic
signals may not be fully captured by the simple structure of an HMM.
o Overfitting
Another limitation of HMMs is that they can be prone to overfitting, especially when
the number of hidden states is large or the amount of training data is limited.
Overfitting occurs when the model fits the training data too well and is unable to
generalize to new data. This can lead to poor performance when the model is applied
to real-world data and can result in high error rates. To avoid overfitting, it is
important to carefully choose the number of hidden states and to use appropriate
regularization techniques.
o Lackof Robustness.
HMMs are also limited in their robustness to noise and variability in the data. For
example, in speech recognition, the acoustic signals generated by speech can be
subjected to a variety of distortions and noise, which can make it difficult for the
HMM to accurately estimate the underlying structure of the data. In some cases, these
distortions and noise can cause the HMM to make incorrect decisions, which can
result in poor performance. To address these limitations, it is often necessary to use
additional processing and filtering techniques, such as noise reduction and
normalization, to pre-process the data before it is fed into the HMM.
o Computational Complexity