0% found this document useful (0 votes)
6 views

ARTIFICIAL INTELLIGENCE LEC 5

The document provides an overview of unsupervised learning in artificial intelligence, focusing on clustering and its types, including partitioning and hierarchical clustering. It explains the differences between classification and clustering, along with various applications of clustering in fields such as marketing, biology, and finance. Additionally, it details the K-Means algorithm and the steps involved in clustering data, emphasizing the importance of selecting the optimal number of clusters.

Uploaded by

Kunal Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

ARTIFICIAL INTELLIGENCE LEC 5

The document provides an overview of unsupervised learning in artificial intelligence, focusing on clustering and its types, including partitioning and hierarchical clustering. It explains the differences between classification and clustering, along with various applications of clustering in fields such as marketing, biology, and finance. Additionally, it details the K-Means algorithm and the steps involved in clustering data, emphasizing the importance of selecting the optimal number of clusters.

Uploaded by

Kunal Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

ARTIFICIAL INTELLIGENCE(ADVANCED)

A Course under Centre of Excellence as Initiative of Department of Science and


Technology, Government of Bihar

GOVERNMENT POLYTECHNIC SAHARSA


Presenter:
Prof. Shubham
HoD(Computer Science and Engineering)
Todays Class
➢ Introduction to Unsupervised Learning
➢Introduction to Clustering
➢Classification vs Clustering
➢Types of Clustering
Unsupervised Machine Learning:
Unsupervised learning is a type of machine learning where the algorithm learns
to recognize patterns in data without being explicitly trained using labeled
examples. The goal of unsupervised learning is to discover the underlying
structure or distribution in the data.
There are two main types of unsupervised learning:
•Clustering: Clustering algorithms group similar data points together based on
their characteristics. The goal is to identify groups, or clusters, of data points that
are similar to each other, while being distinct from other groups. Some popular
clustering algorithms include K-means, Hierarchical clustering, and DBSCAN.
•Dimensionality reduction: Dimensionality reduction algorithms reduce the
number of input variables in a dataset while preserving as much of the original
information as possible. This is useful for reducing the complexity of a dataset
and making it easier to visualize and analyze. Some popular dimensionality
reduction algorithms include Principal Component Analysis (PCA), t-SNE, and
Autoencoders.
Clustering
Clustering is the task of dividing the population or data points into a
number of groups such that data points in the same groups are more
similar to other data points in the same group and dissimilar to the data
points in other groups. It is basically a collection of objects on the basis
of similarity and dissimilarity between them.
” A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group.“
Clustering is a technique of organising a group of data into classes and
clusters where the objects reside inside a cluster will have high similarity
and the objects of two clusters would be dissimilar to each other. Here
the two clusters can be considered as disjoint. The main target of
clustering is to divide the whole data into multiple clusters. Unlike
classification process, here the class labels of objects are not known
before, and clustering pertains to unsupervised learning.
Classification vs Clustering
1.Classification is the process of classifying the data with the help of class labels. On
the other hand, Clustering is similar to classification but there are no predefined
class labels.
2.Classification is geared with supervised learning. As against, clustering is also
known as unsupervised learning.
3.Training sample is provided in classification method while in case of clustering
training data is not provided.
4.Classification is more complex as compared to clustering as there are many
levels in the classification phase whereas only grouping is done in clustering.
5.Output in Classification is known but output in clustering is not known.
6.Classification examples are Logistic regression, Naive Bayes classifier, Support
vector machines, etc. Whereas clustering examples are k-means clustering
algorithm, Fuzzy c-means clustering algorithm, Gaussian (EM) clustering
algorithm, etc.
Applications of Clustering in different fields:
1.Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
2.Biology: It can be used for classification among different species of plants and animals.
3.Libraries: It is used in clustering different books on the basis of topics and information.
4.Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
5.City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.
6.Earthquake studies: By learning the earthquake-affected areas we can determine the
dangerous zones.
7.Image Processing: Clustering can be used to group similar images together, classify images
based on content, and identify patterns in image data.
8.Genetics: Clustering is used to group genes that have similar expression patterns and identify
gene networks that work together in biological processes.
9.Finance: Clustering is used to identify market segments based on customer behavior, identify
patterns in stock market data, and analyze risk in investment portfolios.
10.Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.
Types of Clustering
Partitioning Clustering
This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data.
It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is
the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster
centroid.
Input: K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output: A dataset of K clusters
K-Means Clustering
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need
to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into


k different clusters in such a way that each
dataset belongs only one group that has similar properties
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The k-means clustering algorithm mainly performs two tasks:
•Determines the best value for K center points or centroids by an iterative process.
•Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Steps for K-Means Clustering
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to find
the optimal number of clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K.
Hierarchical Clustering
•The clusters formed in this method form a tree-type structure based on the
hierarchy. New clusters are formed using the previously formed one. It is divided into
two category Agglomerative (bottom-up approach)
•Divisive (top-down approach)
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created.
In this technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram. T
The observations or any number of clusters can be selected by cutting the tree at the correct
level.
The hierarchical clustering technique has two approaches:
1.Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with
taking all data points as single clusters and merging them until one cluster is left.
2.Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.
Agglomerative Clustering
• It follows the bottom-up approach. It means, this algorithm considers
each dataset as a single cluster at the beginning, and then start combining
the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.
• Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.
• Step 2: Take two closest data points or clusters and merge them to form
one cluster. So, there will now be N-1 clusters.
• Step 3: Again, take the two closest clusters and merge them together to
form one cluster. There will be N-2 clusters.
• Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
Divisive Clustering
• It is also known as a top-down approach. This algorithm also
does not require to prespecify the number of clusters.
• Top-down clustering requires a method for splitting a cluster
that contains the whole data and proceeds by splitting clusters
recursively until individual data have been split into singleton
clusters.
• Divisive clustering is more complex as compared to
agglomerative clustering, as in the case of divisive clustering
we need a flat clustering method as “subroutine” to split each
cluster until we have each data having its own singleton
cluster.
Hierarchical Agglomerative vs Divisive Clustering
• Divisive clustering is more complex as compared to agglomerative clustering, as in
the case of divisive clustering we need a flat clustering method as “subroutine” to
split each cluster until we have each data having its own singleton cluster.
• Divisive clustering is more efficient if we do not generate a complete hierarchy all
the way down to individual data leaves. The time complexity of a naive
agglomerative clustering is O(n3) because we exhaustively scan the N x N matrix
dist_mat for the lowest distance in each of N-1 iterations. Using priority queue data
structure we can reduce this complexity to O(n2logn). By using some more
optimizations it can be brought down to O(n2). Whereas for divisive clustering given
a fixed number of top levels, using an efficient flat algorithm like K-Means, divisive
algorithms are linear in the number of patterns and clusters.
• A divisive algorithm is also more accurate. Agglomerative clustering makes
decisions by considering the local patterns or neighbor points without initially taking
into account the global distribution of data. These early decisions cannot be undone.
whereas divisive clustering takes into consideration the global distribution of data
when making top-level partitioning decisions.

You might also like