UNIT 4 Clustering and Applications

Clustering is an unsupervised machine learning technique that groups unlabeled data points together based on similarities. It divides data into clusters such that objects within a cluster are more similar to each other than objects in other clusters. Clustering has applications in image processing, data analysis, pattern recognition, market segmentation, and more. There are several types of clustering algorithms, including hierarchical, partitioning, density-based, grid-based, and model-based clustering. The goal is to automatically detect internal structures within unlabeled data to gain insights from large, complex datasets.

Uploaded by

singireddysindhu1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views

UNIT 4 Clustering and Applications

Uploaded by

singireddysindhu1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Clustering and Applications

Clustering:
The process of making a group of abstract objects into classes of similar objects is known
as Clustering.
Cluster Analysis in Data Mining means that to find out the group of objects which are similar
to each other in the group but are different from the object in other groups. In the process
of clustering in data analytics, the sets of data are divided into groups or classes based on
data similarity. Then each of these classes is labelled according to their data types.
Clustering is a type of unsupervised learning method of machine learning. In the
unsupervised learning method, the inferences are drawn from the data sets which do not
contain labelled output variable. It is an exploratory data analysis technique that allows us
to analyze the multivariate data sets.
Clustering is a task of dividing the data sets into a certain number of clusters in such a
manner that the data points belonging to a cluster have similar characteristics. Clusters are
nothing but the grouping of data points such that the distance between the data points within
the clusters is minimal. Clustering is done to segregate the groups with similar traits.
Applications of cluster analysis:
 It is widely used in many applications such as image processing, data analysis,
and pattern recognition.
 It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant taxonomies
and identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on the web.
Requirements of clustering in data mining:
The following are some points because clustering is important in data mining.
1. Scalability – we require highly scalable clustering algorithms to work with large
databases.
2. Ability to deal with different kinds of attributes – Algorithms should be able to
work with the type of data such as categorical, numerical, and binary data.
3. Discovery of clusters with attribute shape – The algorithm should be able to detect
clusters in arbitrary shapes and it should not be bounded to distance measures.
4. Interpretability-The result of clustering should be usable, understandable and
interpretable. The main aim of clustering in data analytics is to make sure haphazard
data is stored in groups based on their characteristical similarity.
5. Helps in dealing with messed up data-Usually, the data is messed up and
unstructured. It cannot be analyzed quickly, and that is why the clustering of
information is so significant in data mining. Grouping can give some structure to the
data by organizing it into groups of similar data objects. It becomes more comfortable
for the data expert in processing the data and also discover new things. Analyzing
data that has already been classified and labelled through clustering is much easier
than analyzing unstructured data. It also leaves less room for error.
6. High Dimensional-Data clustering is also able to handle the data of high dimension
along with the data of small size. The clustering algorithms in data mining need to
be able to handle any dimension of data.
7. Attribute shape clusters are discovered-Clustering algorithms in data
mining should be able to detect arbitrarily shaped clusters. These algorithms should
not be limited by only being able to find smaller, spherical clusters.
Importance of Clustering Methods:
1. Having clustering methods helps restart the local search procedure and removes the
inefficiency. In addition, clustering helps to determine the internal structure of the data.
2. This clustering method has been used for model analysis and vector region of attraction.
3. Clustering helps in understanding the natural grouping in a dataset. They aim to make
sense of partitioning the data into some logical groupings.
4. Clustering quality depends on the methods and the identification of hidden patterns.
5. They play a wide role in applications like marketing economic research and weblogs to
identify similarity measures, Image processing, and spatial research.
6. They are used in outlier detections to detect credit card fraudulence.
Characteristics of Cluster Analysis:
 It helps to visualize high-dimensional data
 It further enables data scientists to deal with different types of data like discrete,
categorical, and binary data
 It gives them some structure to unstructured data sets by organizing them into a group
Advantages of Cluster Analysis:
 Helps to identify obscure patterns and relationships within a data set
 It helps to carry out exploratory data analysis
 It can also be used for market segmentation, customer profiling, and more
Disadvantages of Cluster Analysis:
 It can be difficult to interpret the results of an ambiguous or ill-defined cluster
 The result of the analysis is affected by the choice of the clustering algorithm
 Furthermore, the success of cluster analysis depends on the data, the goal of the
analysis, and the data scientist’s capability to interpret the result
Types of Data in Cluster Analysis:
A database may contain all the six types of variables:
1. symmetric binary
2. asymmetric binary
3. nominal
4. ordinal
5. interval
6. ratio.
Categorization of Major Clustering Methods:
It can be classified based on the following categories.
1. Model-Based Method:
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.

2. Hierarchical Method:
Hierarchical clustering investigates data clusters with a variety of scales and distances. In this
approach, you create a cluster tree with a multilevel hierarchy consisting of small clusters.
Then, neighboring clusters with similar features from every hierarchy are grouped together.
This continues until only one cluster is left in the hierarchy. This, therefore, allows the data
scientist to identify the hierarchical cluster appropriate to them.
There are two approaches here −
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering:
Here are the two approaches that are used to improve the quality of hierarchical clustering −
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-
clustering on the micro-clusters.
3. Constraint-Based Method:
In this method, the clustering is performed by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or the properties of desired clustering
results. Constraints provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application requirement.
4. Grid-Based Method:
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized
space.
5. Partitioning Method:
Partitioning clustering considers every data point in a cluster as objects that have a
specific location and distance from each other. It partitions the objects in a way that
objects with the same features are close to each other, and far away from objects in
other clusters.
6. Density-Based Method:
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.
Outlier Analysis:
Outlier analysis in data mining involves identifying and analyzing data points significantly
different or deviating from the rest of the dataset. Outliers can be caused by various factors,
such as data entry errors, unexpected events, etc., and their detection can lead to valuable
insights and improve the accuracy of models. A wide range of techniques can be used for outlier
analysis in data mining, such as statistical methods, clustering algorithms, and machine
learning models.
Outlier analysis in data mining is the process of identifying and examining data points that
significantly differ from the rest of the dataset. An outlier can be defined as a data point that
deviates significantly from the normal pattern or behavior of the data. Various factors, such as
measurement errors, unexpected events, data processing errors, etc., can cause these outliers.
For example, outliers are represented as red dots in the figure below, and you can see that they
deviate significantly from the rest of the data points. Outliers are also often referred to as
anomalies, aberrations, or irregularities.

Journeys Readers Notebook G4 Units4-6 Student
53% (15)
Journeys Readers Notebook G4 Units4-6 Student
227 pages
Euv Lithography 2nd Edition Vivek Bakshi All Chapters Instant Download
100% (4)
Euv Lithography 2nd Edition Vivek Bakshi All Chapters Instant Download
52 pages
Data Mining 5 Units Notes
No ratings yet
Data Mining 5 Units Notes
85 pages
02 Data Mining-Partitioning Method
No ratings yet
02 Data Mining-Partitioning Method
8 pages
Presentation ON Neo4J
No ratings yet
Presentation ON Neo4J
5 pages
Unit4 Datascience
No ratings yet
Unit4 Datascience
43 pages
Project Report Text Editor in Java
100% (1)
Project Report Text Editor in Java
10 pages
Introduction: Data Analytic Thinking
No ratings yet
Introduction: Data Analytic Thinking
38 pages
Collections Assignment
No ratings yet
Collections Assignment
9 pages
KDD Vs Data Mining
No ratings yet
KDD Vs Data Mining
2 pages
Chomsky Classification of Grammars
No ratings yet
Chomsky Classification of Grammars
3 pages
Job Recommender Java Spring Boot
No ratings yet
Job Recommender Java Spring Boot
21 pages
Data Warehousing and Data Mining Important Question
No ratings yet
Data Warehousing and Data Mining Important Question
7 pages
Candidate Generation and Pruning
No ratings yet
Candidate Generation and Pruning
9 pages
BCSL 058 Computer Oriented Numerical Techniques Lab Solved Assignment 2019 20
No ratings yet
BCSL 058 Computer Oriented Numerical Techniques Lab Solved Assignment 2019 20
17 pages
Mfcs PPT (All Units)
No ratings yet
Mfcs PPT (All Units)
103 pages
DWDM
No ratings yet
DWDM
25 pages
Numerical Based On Indexing: Problem 1.2
No ratings yet
Numerical Based On Indexing: Problem 1.2
3 pages
Student Mark Analysis System
No ratings yet
Student Mark Analysis System
3 pages
Documentation of Bekary
No ratings yet
Documentation of Bekary
42 pages
Chap - 4: Screen Designing: Visually Pleasing Composition
No ratings yet
Chap - 4: Screen Designing: Visually Pleasing Composition
23 pages
Database Management System
No ratings yet
Database Management System
32 pages
Infosys Campus Registration Guide
No ratings yet
Infosys Campus Registration Guide
7 pages
Rashtrasant Tukadoji Maharaj Nagpur University, Nagpur Question Bank Bachelor of Commerce (Computer Application) (Bcca)
100% (1)
Rashtrasant Tukadoji Maharaj Nagpur University, Nagpur Question Bank Bachelor of Commerce (Computer Application) (Bcca)
16 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Thread Lab Assignment
No ratings yet
Thread Lab Assignment
7 pages
Collaborating Using Cloud Services
No ratings yet
Collaborating Using Cloud Services
3 pages
UNIT 2 DMW
No ratings yet
UNIT 2 DMW
26 pages
Solved Osee Sppu Q - Paper
No ratings yet
Solved Osee Sppu Q - Paper
18 pages
Data Mining TOC
No ratings yet
Data Mining TOC
3 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Q.1 Write A Program To Find Maximum Between Three Numbers.: Code
No ratings yet
Q.1 Write A Program To Find Maximum Between Three Numbers.: Code
89 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Computer Science and Information Technology scqp09
0% (1)
Computer Science and Information Technology scqp09
3 pages
MCQ Mad MCQ
No ratings yet
MCQ Mad MCQ
31 pages
Quiz Presentation
No ratings yet
Quiz Presentation
15 pages
1 Conceptual Graphs
No ratings yet
1 Conceptual Graphs
11 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
DAA Unit 5
No ratings yet
DAA Unit 5
14 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
ds4015-big-data-analytics-vignesh-k-notes
No ratings yet
ds4015-big-data-analytics-vignesh-k-notes
146 pages
LDR Report
No ratings yet
LDR Report
20 pages
Software Testing Notes
100% (1)
Software Testing Notes
12 pages
A Report of Six Weaks Industrial Training at BBSBEC, Fatehgarh Sahib
No ratings yet
A Report of Six Weaks Industrial Training at BBSBEC, Fatehgarh Sahib
24 pages
The Database System Environment
100% (1)
The Database System Environment
2 pages
4.gilb's Approach
100% (1)
4.gilb's Approach
22 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
Python
100% (1)
Python
8 pages
Synopsis - Note Sharing Application Using Django
No ratings yet
Synopsis - Note Sharing Application Using Django
12 pages
Sepm Unit 3.... Roshan
No ratings yet
Sepm Unit 3.... Roshan
16 pages
Direct View Storage Tubes
No ratings yet
Direct View Storage Tubes
4 pages
Nokia Sample Programming Placement Paper Level1
No ratings yet
Nokia Sample Programming Placement Paper Level1
6 pages
Bit 2204a Bbit 310 Bac 2201 BSD 2107 Bisf 2201 Java Programming Rayfrankmuriithi
No ratings yet
Bit 2204a Bbit 310 Bac 2201 BSD 2107 Bisf 2201 Java Programming Rayfrankmuriithi
13 pages
Directory Structure
No ratings yet
Directory Structure
47 pages
Syllabus of MCA - Bridge Course (Mangt) (2020patt.) 2020 - 2022 - 13.05.2021
No ratings yet
Syllabus of MCA - Bridge Course (Mangt) (2020patt.) 2020 - 2022 - 13.05.2021
13 pages
Java Bca Slips
No ratings yet
Java Bca Slips
30 pages
Compiler-All-Anna-Question Till Nov-2016 PDF
No ratings yet
Compiler-All-Anna-Question Till Nov-2016 PDF
39 pages
Enterprise Computing With Java Practical File: Master of Computer Application
No ratings yet
Enterprise Computing With Java Practical File: Master of Computer Application
45 pages
Guru Ghasidas University: Department of Csit
No ratings yet
Guru Ghasidas University: Department of Csit
42 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
CLUSTER ANALYSIS unit 3 Data mining
No ratings yet
CLUSTER ANALYSIS unit 3 Data mining
84 pages
Capacity Plus Configuration Guide 9.1
No ratings yet
Capacity Plus Configuration Guide 9.1
29 pages
CAE 464 517 Sp21 Lecture15 Air Distribution Systems Duct Design Methods
No ratings yet
CAE 464 517 Sp21 Lecture15 Air Distribution Systems Duct Design Methods
57 pages
Transitions 4
No ratings yet
Transitions 4
40 pages
MT 2 Mod
No ratings yet
MT 2 Mod
1 page
4 Introduction To The Divergence Theorem and Stoke's Theorem (Short)
No ratings yet
4 Introduction To The Divergence Theorem and Stoke's Theorem (Short)
116 pages
Report Reyes Barragan 6456723-3600725
No ratings yet
Report Reyes Barragan 6456723-3600725
1 page
Module 2
No ratings yet
Module 2
36 pages
Pharma 5
No ratings yet
Pharma 5
27 pages
MS Word Groups and Commands
100% (1)
MS Word Groups and Commands
69 pages
Assignment
No ratings yet
Assignment
15 pages
Seasons 1 PDF
No ratings yet
Seasons 1 PDF
144 pages
Service Operation Project On Maruti Suzuki
No ratings yet
Service Operation Project On Maruti Suzuki
12 pages
7 Comparing Algorithms
No ratings yet
7 Comparing Algorithms
7 pages
Perovskite Solar Cells Materials Processes And Devices Shahzada Ahmad instant download
100% (1)
Perovskite Solar Cells Materials Processes And Devices Shahzada Ahmad instant download
89 pages
(EDITED) Lesson 18
No ratings yet
(EDITED) Lesson 18
6 pages
Rizal Reviewer (Finals)
No ratings yet
Rizal Reviewer (Finals)
35 pages
FinalExam-2010
No ratings yet
FinalExam-2010
19 pages
Gaps Diet
No ratings yet
Gaps Diet
50 pages
Wizard 2 50c92axx
No ratings yet
Wizard 2 50c92axx
120 pages
Voltammetry-Skoog
No ratings yet
Voltammetry-Skoog
12 pages
Russell & Norvig Ch. 5: - Constraint Satisfaction Offers A Powerful Problem-Solving Paradigm
No ratings yet
Russell & Norvig Ch. 5: - Constraint Satisfaction Offers A Powerful Problem-Solving Paradigm
20 pages
Tama Si Ama, Tama Si Mama E
No ratings yet
Tama Si Ama, Tama Si Mama E
13 pages
October 2018 RFBT New Topis MCQ
No ratings yet
October 2018 RFBT New Topis MCQ
43 pages
The Contemporary World Reviewer Midterms
No ratings yet
The Contemporary World Reviewer Midterms
14 pages
SE Lab Manual 3rd Sem
No ratings yet
SE Lab Manual 3rd Sem
66 pages
Microbes in Food and Health 1st Edition Neelam Garg pdf download
100% (3)
Microbes in Food and Health 1st Edition Neelam Garg pdf download
63 pages
Statistics Question Bank
No ratings yet
Statistics Question Bank
52 pages
SanShing Machine[Ver.6.2]
No ratings yet
SanShing Machine[Ver.6.2]
9 pages

UNIT 4 Clustering and Applications

Uploaded by

UNIT 4 Clustering and Applications

Uploaded by

Clustering and Applications

You might also like