0% found this document useful (0 votes)

31 views5 pages

Unit 4

The document discusses different techniques for data clustering and outlier analysis including K-means clustering, hierarchical clustering, density-based clustering, and distribution-based clustering. It also covers memory-based reasoning, link analysis, and outlier analysis.

Uploaded by

jas deep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views5 pages

Unit 4

Uploaded by

jas deep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

UNIT – IV

Cluster detection, K- means Algorithm, Outlier Analysis, memory-based reasoning, link

analysis, Mining Association Rules in Large Databases: Association Rule Mining, genetic
algorithms, neural networks. Data mining tools.

Clustering

Task of dividing the population or data points into a number of groups such that data points
in the same groups are more similar to other data points in the same group than those in other
groups to segregate groups with similar traits and assign them into clusters is used for
analyzing data which does not include pre-labeled classes grouped together using the concept
of maximizing intra class similarity and minimizing the similarity between differing classes

Applications

pattern recognition

image analysis

information retrieval

bioinformatics

data compression

computer graphics

machine learning

Types of Clustering

1. Centroid-based Clustering

Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to

hierarchical clustering defined below. k-means is the most widely-used centroid-based
clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions
and outliers. This course focuses on k-means because it is an efficient, effective, and simple
clustering algorithm.

K means
K-Means clustering is an unsupervised iterative clustering technique.
It partitions the given data set into k predefined distinct clusters.
A cluster is defined as a collection of data points exhibiting certain similarities .

It partitions the data set such that-

Each data point belongs to a cluster with the nearest mean.

Data points belonging to one cluster have high degree of similarity.

Data points belonging to different clusters have high degree of dissimilarity.

It is relatively efficient with time complexity O(nkt) where-

n = number of instances

k = number of clusters

t = number of iterations

Advantages-

It often terminates at local optimum.

Techniques such as Simulated Annealing or Genetic Algorithms may be used to find the

global optimum.

Disadvantages-

K-Means Clustering Algorithm has the following disadvantages-

It requires to specify the number of clusters (k) in advance.

It can not handle noisy data and outliers.

It is not suitable to identify clusters with non-convex shapes.

K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the following steps-

Step-01: Choose the number of clusters K.

Step-02: Randomly select any K data points as cluster centers.

Select cluster centers in such a way that they are as farther as possible from each other.

Step-03: Calculate the distance between each data point and each cluster center.

The distance may be calculated either by using given distance function or by using
euclidean distance formula.
Step-04: Assign each data point to some cluster.

A data point is assigned to that cluster whose center is nearest to that data point.

Step-05: Re-compute the center of newly formed clusters.

The center of a cluster is computed by taking mean of all the data points contained in that
cluster.

Step-06: Keep repeating the procedure from Step-03 to Step-05 until any of the following
stopping criteria is met-

Center of newly formed clusters do not change

Data points remain present in the same cluster

Maximum number of iterations are reached

2. Hierarchical cluster analysis strategies –

2.1. Agglomerative Clustering: Also known as bottom-up approach or hierarchical
agglomerative clustering (HAC). A structure that is more informative than the unstructured
set of clusters returned by flat clustering. This clustering algorithm does not require us to
prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton
cluster at the outset and then successively agglomerates pairs of clusters until all clusters
have been merged into a single cluster that contains all data.
2.2. Divisive clustering: Also known as a top-down approach. This algorithm also does not
require to prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.
3. Density-based Clustering

Density-based clustering connects areas of high example density into clusters. This allows for
arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have
difficulty with data of varying densities and high dimensions. Further, by design, these
algorithms do not assign outliers to clusters.

Two parameters:
 Eps: Maximum radius of the neighborhood.
 MinPts: Minimum number of points in an Eps-neighbourhood of that point.

3.1 OPTICS

OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the
cluster comprises information equivalent to the density-based clustering related to a long
range of parameter settings. OPTICS methods are beneficial for both automatic and
interactive cluster analysis, including determining an intrinsic clustering structure.

3.2 DENCLUE

Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical

description of arbitrarily shaped clusters in high dimension state of data, and it is good for
data sets with a huge amount of noise.

Major Features

It has got a solid mathematical foundation.

It is definitely good for data sets with large amounts of noise.

It allows a compact mathematical description of arbitrarily shaped clusters in high-

dimensional data sets.

3.3 DBSCAN(Density-Based Spatial Clustering of Applications with Noise)

It relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-

connected points. It discovers clusters of arbitrary shape in spatial databases with noise.

4. Distribution-based Clustering
This clustering approach assumes data is composed of distributions, such as Gaussian
distribution. As distance from the distribution's center increases, the probability that a point
belongs to the distribution decreases. The bands show that decrease in probability. When you
do not know the type of distribution in your data, you should use a different algorithm.

Outliers analysis

Outlier analysis is the process of identifying outliers, or abnormal observations, in a dataset.

Also known as outlier detection, it’s an important step in data analysis, as it removes
erroneous or inaccurate observations which might otherwise skew conclusions.

Memory-based reasoning

It is a process that identifies similar cases and applies the information that is obtained
from these cases to a new record. In Enterprise Miner, the Memory-Based Reasoning
(MBR) node is a modeling tool that uses a k-nearest neighbor algorithm to categorize or
predict observations.

There are various applications of Memory Based Reasoning which are as follows −
Fraud detection − New cases of fraud are same to be known cases. MBR can discover and
flag them for more investigation.
Customer response prediction − The next customers probable to respond to an offer are
probably same to prior customers who have acknowledged. MBR can simply recognize the
next likely customers.
Medical treatments − The efficient treatment for a given patient is possible the treatment
that resulted in the best results for same patients. MBR can discover the treatment that makes
the best results.
Classifying responses − Free-text responses, including those on the U.S. Census form for
occupation and market or complaints appearing from users, required to be classified into a
fixed set of codes. MBR can process the free-text and creates the codes

Link Analysis
Link analysis is a data analysis technique used in network theory that is used to evaluate the
relationships or connections between network nodes. These relationships can be between
various types of objects (nodes), including people, organizations and even transactions.

Link analysis is essentially a kind of knowledge discovery that can be used to visualize data
to allow for better analysis, especially in the context of links, whether Web links or
relationship links between people or between different entities. Link analysis is often used in
search engine optimization as well as in intelligence, in security analysis and in market and
medical research.

Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
Unit 4
No ratings yet
Unit 4
74 pages
DMW UNIT 5
No ratings yet
DMW UNIT 5
10 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Unit 5
No ratings yet
Unit 5
5 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Clustering
No ratings yet
Clustering
65 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Unit 4
No ratings yet
Unit 4
4 pages
Clustering new
No ratings yet
Clustering new
6 pages
Unit 3
No ratings yet
Unit 3
58 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Clustering
No ratings yet
Clustering
57 pages
Fast_and_Robust_General_Purpose_Clustering_Algorit
No ratings yet
Fast_and_Robust_General_Purpose_Clustering_Algorit
29 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
M5
No ratings yet
M5
40 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
DM MODULE 4
No ratings yet
DM MODULE 4
17 pages
Unit 4
No ratings yet
Unit 4
40 pages
Untitled document
No ratings yet
Untitled document
32 pages
Wk. 9. Cluster Analysis [01-04-2021]
No ratings yet
Wk. 9. Cluster Analysis [01-04-2021]
97 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering
No ratings yet
Clustering
34 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
M5
No ratings yet
M5
40 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
DSS09 (B) - Clustering
No ratings yet
DSS09 (B) - Clustering
35 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unit - V DW
No ratings yet
Unit - V DW
6 pages
CLUSTER ANALYSIS unit 3 Data mining
No ratings yet
CLUSTER ANALYSIS unit 3 Data mining
84 pages
Machine Learning Note Modul 4 5[1]
No ratings yet
Machine Learning Note Modul 4 5[1]
20 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
UNIT-4
No ratings yet
UNIT-4
106 pages
Clustering
No ratings yet
Clustering
11 pages
Unsupervised Machine Learning Techniques (2)
No ratings yet
Unsupervised Machine Learning Techniques (2)
58 pages
Unit 5 DWM by DR KSR Cluster Analysis
No ratings yet
Unit 5 DWM by DR KSR Cluster Analysis
72 pages
DWM PT 2 QB Soln
No ratings yet
DWM PT 2 QB Soln
8 pages
Module 5
No ratings yet
Module 5
91 pages
Dwdm Unit v Note
No ratings yet
Dwdm Unit v Note
19 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages