0% found this document useful (0 votes)
18 views

Unit 4

The document discusses different techniques for data clustering and outlier analysis including K-means clustering, hierarchical clustering, density-based clustering, and distribution-based clustering. It also covers memory-based reasoning, link analysis, and outlier analysis.

Uploaded by

jas deep
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit 4

The document discusses different techniques for data clustering and outlier analysis including K-means clustering, hierarchical clustering, density-based clustering, and distribution-based clustering. It also covers memory-based reasoning, link analysis, and outlier analysis.

Uploaded by

jas deep
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

UNIT – IV

Cluster detection, K- means Algorithm, Outlier Analysis, memory-based reasoning, link


analysis, Mining Association Rules in Large Databases: Association Rule Mining, genetic
algorithms, neural networks. Data mining tools.

Clustering

Task of dividing the population or data points into a number of groups such that data points
in the same groups are more similar to other data points in the same group than those in other
groups to segregate groups with similar traits and assign them into clusters is used for
analyzing data which does not include pre-labeled classes grouped together using the concept
of maximizing intra class similarity and minimizing the similarity between differing classes

Applications

pattern recognition

image analysis

information retrieval

bioinformatics

data compression

computer graphics

machine learning

Types of Clustering

1. Centroid-based Clustering

Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to


hierarchical clustering defined below. k-means is the most widely-used centroid-based
clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions
and outliers. This course focuses on k-means because it is an efficient, effective, and simple
clustering algorithm.

K means
K-Means clustering is an unsupervised iterative clustering technique.
It partitions the given data set into k predefined distinct clusters.
A cluster is defined as a collection of data points exhibiting certain similarities .
 
It partitions the data set such that-

Each data point belongs to a cluster with the nearest mean.

Data points belonging to one cluster have high degree of similarity.

Data points belonging to different clusters have high degree of dissimilarity.

It is relatively efficient with time complexity O(nkt) where-

n = number of instances

k = number of clusters

t = number of iterations

Advantages-

It often terminates at local optimum.

Techniques such as Simulated Annealing or Genetic Algorithms may be used to find the


global optimum.

 Disadvantages-

 K-Means Clustering Algorithm has the following disadvantages-

It requires to specify the number of clusters (k) in advance.

It can not handle noisy data and outliers.

It is not suitable to identify clusters with non-convex shapes.

K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the following steps-

 Step-01:  Choose the number of clusters K.

 Step-02:  Randomly select any K data points as cluster centers.

Select cluster centers in such a way that they are as farther as possible from each other.

 Step-03:  Calculate the distance between each data point and each cluster center.

The distance may be calculated either by using given distance function or by using
euclidean distance formula.
Step-04:  Assign each data point to some cluster.

A data point is assigned to that cluster whose center is nearest to that data point.

 Step-05:  Re-compute the center of newly formed clusters.

The center of a cluster is computed by taking mean of all the data points contained in that
cluster.

 Step-06:  Keep repeating the procedure from Step-03 to Step-05 until any of the following
stopping criteria is met-

Center of newly formed clusters do not change

Data points remain present in the same cluster

Maximum number of iterations are reached

2. Hierarchical cluster analysis strategies –


2.1. Agglomerative Clustering: Also known as bottom-up approach or hierarchical
agglomerative clustering (HAC). A structure that is more informative than the unstructured
set of clusters returned by flat clustering. This clustering algorithm does not require us to
prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton
cluster at the outset and then successively agglomerates pairs of clusters until all clusters
have been merged into a single cluster that contains all data. 
2.2. Divisive clustering: Also known as a top-down approach. This algorithm also does not
require to prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.
3. Density-based Clustering

Density-based clustering connects areas of high example density into clusters. This allows for
arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have
difficulty with data of varying densities and high dimensions. Further, by design, these
algorithms do not assign outliers to clusters.

Two parameters:
 Eps: Maximum radius of the neighborhood.
 MinPts: Minimum number of points in an Eps-neighbourhood of that point.

           
3.1 OPTICS

OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the
cluster comprises information equivalent to the density-based clustering related to a long
range of parameter settings. OPTICS methods are beneficial for both automatic and
interactive cluster analysis, including determining an intrinsic clustering structure.

3.2 DENCLUE

Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical


description of arbitrarily shaped clusters in high dimension state of data, and it is good for
data sets with a huge amount of noise.

Major Features

It has got a solid mathematical foundation.

It is definitely good for data sets with large amounts of noise.

It allows a compact mathematical description of arbitrarily shaped clusters in high-


dimensional data sets.

3.3 DBSCAN(Density-Based Spatial Clustering of Applications with Noise)

It relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-


connected points. It discovers clusters of arbitrary shape in spatial databases with noise.

4. Distribution-based Clustering
This clustering approach assumes data is composed of distributions, such as Gaussian
distribution. As distance from the distribution's center increases, the probability that a point
belongs to the distribution decreases. The bands show that decrease in probability. When you
do not know the type of distribution in your data, you should use a different algorithm.

Outliers analysis

Outlier analysis is the process of identifying outliers, or abnormal observations, in a dataset.


Also known as outlier detection, it’s an important step in data analysis, as it removes
erroneous or inaccurate observations which might otherwise skew conclusions.

Memory-based reasoning

It is a process that identifies similar cases and applies the information that is obtained
from these cases to a new record. In Enterprise Miner, the Memory-Based Reasoning
(MBR) node is a modeling tool that uses a k-nearest neighbor algorithm to categorize or
predict observations.

There are various applications of Memory Based Reasoning which are as follows −
Fraud detection − New cases of fraud are same to be known cases. MBR can discover and
flag them for more investigation.
Customer response prediction − The next customers probable to respond to an offer are
probably same to prior customers who have acknowledged. MBR can simply recognize the
next likely customers.
Medical treatments − The efficient treatment for a given patient is possible the treatment
that resulted in the best results for same patients. MBR can discover the treatment that makes
the best results.
Classifying responses − Free-text responses, including those on the U.S. Census form for
occupation and market or complaints appearing from users, required to be classified into a
fixed set of codes. MBR can process the free-text and creates the codes

Link Analysis
Link analysis is a data analysis technique used in network theory that is used to evaluate the
relationships or connections between network nodes. These relationships can be between
various types of objects (nodes), including people, organizations and even transactions.

Link analysis is essentially a kind of knowledge discovery that can be used to visualize data
to allow for better analysis, especially in the context of links, whether Web links or
relationship links between people or between different entities. Link analysis is often used in
search engine optimization as well as in intelligence, in security analysis and in market and
medical research.

You might also like