The document provides an overview of unsupervised learning in artificial intelligence, focusing on clustering and its types, including partitioning and hierarchical clustering. It explains the differences between classification and clustering, along with various applications of clustering in fields such as marketing, biology, and finance. Additionally, it details the K-Means algorithm and the steps involved in clustering data, emphasizing the importance of selecting the optimal number of clusters.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
6 views
ARTIFICIAL INTELLIGENCE LEC 5
The document provides an overview of unsupervised learning in artificial intelligence, focusing on clustering and its types, including partitioning and hierarchical clustering. It explains the differences between classification and clustering, along with various applications of clustering in fields such as marketing, biology, and finance. Additionally, it details the K-Means algorithm and the steps involved in clustering data, emphasizing the importance of selecting the optimal number of clusters.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20
ARTIFICIAL INTELLIGENCE(ADVANCED)
A Course under Centre of Excellence as Initiative of Department of Science and
Technology, Government of Bihar
GOVERNMENT POLYTECHNIC SAHARSA
Presenter: Prof. Shubham HoD(Computer Science and Engineering) Todays Class ➢ Introduction to Unsupervised Learning ➢Introduction to Clustering ➢Classification vs Clustering ➢Types of Clustering Unsupervised Machine Learning: Unsupervised learning is a type of machine learning where the algorithm learns to recognize patterns in data without being explicitly trained using labeled examples. The goal of unsupervised learning is to discover the underlying structure or distribution in the data. There are two main types of unsupervised learning: •Clustering: Clustering algorithms group similar data points together based on their characteristics. The goal is to identify groups, or clusters, of data points that are similar to each other, while being distinct from other groups. Some popular clustering algorithms include K-means, Hierarchical clustering, and DBSCAN. •Dimensionality reduction: Dimensionality reduction algorithms reduce the number of input variables in a dataset while preserving as much of the original information as possible. This is useful for reducing the complexity of a dataset and making it easier to visualize and analyze. Some popular dimensionality reduction algorithms include Principal Component Analysis (PCA), t-SNE, and Autoencoders. Clustering Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. ” A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group.“ Clustering is a technique of organising a group of data into classes and clusters where the objects reside inside a cluster will have high similarity and the objects of two clusters would be dissimilar to each other. Here the two clusters can be considered as disjoint. The main target of clustering is to divide the whole data into multiple clusters. Unlike classification process, here the class labels of objects are not known before, and clustering pertains to unsupervised learning. Classification vs Clustering 1.Classification is the process of classifying the data with the help of class labels. On the other hand, Clustering is similar to classification but there are no predefined class labels. 2.Classification is geared with supervised learning. As against, clustering is also known as unsupervised learning. 3.Training sample is provided in classification method while in case of clustering training data is not provided. 4.Classification is more complex as compared to clustering as there are many levels in the classification phase whereas only grouping is done in clustering. 5.Output in Classification is known but output in clustering is not known. 6.Classification examples are Logistic regression, Naive Bayes classifier, Support vector machines, etc. Whereas clustering examples are k-means clustering algorithm, Fuzzy c-means clustering algorithm, Gaussian (EM) clustering algorithm, etc. Applications of Clustering in different fields: 1.Marketing: It can be used to characterize & discover customer segments for marketing purposes. 2.Biology: It can be used for classification among different species of plants and animals. 3.Libraries: It is used in clustering different books on the basis of topics and information. 4.Insurance: It is used to acknowledge the customers, their policies and identifying the frauds. 5.City Planning: It is used to make groups of houses and to study their values based on their geographical locations and other factors present. 6.Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones. 7.Image Processing: Clustering can be used to group similar images together, classify images based on content, and identify patterns in image data. 8.Genetics: Clustering is used to group genes that have similar expression patterns and identify gene networks that work together in biological processes. 9.Finance: Clustering is used to identify market segments based on customer behavior, identify patterns in stock market data, and analyze risk in investment portfolios. 10.Customer Service: Clustering is used to group customer inquiries and complaints into categories, identify common issues, and develop targeted solutions. Types of Clustering Partitioning Clustering This clustering method classifies the information into multiple groups based on the characteristics and similarity of the data. It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-based method. The most common example of partitioning clustering is the K-Means Clustering algorithm. In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups. The cluster center is created in such a way that the distance between the data points of one cluster is minimum as compared to another cluster centroid. Input: K: The number of clusters in which the dataset has to be divided D: A dataset containing N number of objects Output: A dataset of K clusters K-Means Clustering K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into
k different clusters in such a way that each dataset belongs only one group that has similar properties It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training. It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters. The k-means clustering algorithm mainly performs two tasks: •Determines the best value for K center points or centroids by an iterative process. •Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster. Steps for K-Means Clustering The working of the K-Means algorithm is explained in the below steps: Step-1: Select the number K to decide the number of clusters. Step-2: Select random K points or centroids. (It can be other from the input dataset). Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters. Step-4: Calculate the variance and place a new centroid of each cluster. Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster. Step-6: If any reassignment occurs, then go to step-4 else go to FINISH. Step-7: The model is ready. How to choose the value of "K number of clusters" in K-means Clustering? The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number of clusters, but here we are discussing the most appropriate method to find the number of clusters or value of K. Hierarchical Clustering •The clusters formed in this method form a tree-type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two category Agglomerative (bottom-up approach) •Divisive (top-down approach) Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to create a tree-like structure, which is also called a dendrogram. T The observations or any number of clusters can be selected by cutting the tree at the correct level. The hierarchical clustering technique has two approaches: 1.Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data points as single clusters and merging them until one cluster is left. 2.Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach. Agglomerative Clustering • It follows the bottom-up approach. It means, this algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest pair of clusters together. It does this until all the clusters are merged into a single cluster that contains all the datasets. • Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of clusters will also be N. • Step 2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1 clusters. • Step 3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2 clusters. • Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below images: Divisive Clustering • It is also known as a top-down approach. This algorithm also does not require to prespecify the number of clusters. • Top-down clustering requires a method for splitting a cluster that contains the whole data and proceeds by splitting clusters recursively until individual data have been split into singleton clusters. • Divisive clustering is more complex as compared to agglomerative clustering, as in the case of divisive clustering we need a flat clustering method as “subroutine” to split each cluster until we have each data having its own singleton cluster. Hierarchical Agglomerative vs Divisive Clustering • Divisive clustering is more complex as compared to agglomerative clustering, as in the case of divisive clustering we need a flat clustering method as “subroutine” to split each cluster until we have each data having its own singleton cluster. • Divisive clustering is more efficient if we do not generate a complete hierarchy all the way down to individual data leaves. The time complexity of a naive agglomerative clustering is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in each of N-1 iterations. Using priority queue data structure we can reduce this complexity to O(n2logn). By using some more optimizations it can be brought down to O(n2). Whereas for divisive clustering given a fixed number of top levels, using an efficient flat algorithm like K-Means, divisive algorithms are linear in the number of patterns and clusters. • A divisive algorithm is also more accurate. Agglomerative clustering makes decisions by considering the local patterns or neighbor points without initially taking into account the global distribution of data. These early decisions cannot be undone. whereas divisive clustering takes into consideration the global distribution of data when making top-level partitioning decisions. •