Unit 4
Unit 4
Clustering
Task of dividing the population or data points into a number of groups such that data points
in the same groups are more similar to other data points in the same group than those in other
groups to segregate groups with similar traits and assign them into clusters is used for
analyzing data which does not include pre-labeled classes grouped together using the concept
of maximizing intra class similarity and minimizing the similarity between differing classes
Applications
pattern recognition
image analysis
information retrieval
bioinformatics
data compression
computer graphics
machine learning
Types of Clustering
1. Centroid-based Clustering
K means
K-Means clustering is an unsupervised iterative clustering technique.
It partitions the given data set into k predefined distinct clusters.
A cluster is defined as a collection of data points exhibiting certain similarities .
It partitions the data set such that-
n = number of instances
k = number of clusters
t = number of iterations
Advantages-
Disadvantages-
Select cluster centers in such a way that they are as farther as possible from each other.
Step-03: Calculate the distance between each data point and each cluster center.
The distance may be calculated either by using given distance function or by using
euclidean distance formula.
Step-04: Assign each data point to some cluster.
A data point is assigned to that cluster whose center is nearest to that data point.
The center of a cluster is computed by taking mean of all the data points contained in that
cluster.
Step-06: Keep repeating the procedure from Step-03 to Step-05 until any of the following
stopping criteria is met-
Density-based clustering connects areas of high example density into clusters. This allows for
arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have
difficulty with data of varying densities and high dimensions. Further, by design, these
algorithms do not assign outliers to clusters.
Two parameters:
Eps: Maximum radius of the neighborhood.
MinPts: Minimum number of points in an Eps-neighbourhood of that point.
3.1 OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the
cluster comprises information equivalent to the density-based clustering related to a long
range of parameter settings. OPTICS methods are beneficial for both automatic and
interactive cluster analysis, including determining an intrinsic clustering structure.
3.2 DENCLUE
Major Features
4. Distribution-based Clustering
This clustering approach assumes data is composed of distributions, such as Gaussian
distribution. As distance from the distribution's center increases, the probability that a point
belongs to the distribution decreases. The bands show that decrease in probability. When you
do not know the type of distribution in your data, you should use a different algorithm.
Outliers analysis
Memory-based reasoning
It is a process that identifies similar cases and applies the information that is obtained
from these cases to a new record. In Enterprise Miner, the Memory-Based Reasoning
(MBR) node is a modeling tool that uses a k-nearest neighbor algorithm to categorize or
predict observations.
There are various applications of Memory Based Reasoning which are as follows −
Fraud detection − New cases of fraud are same to be known cases. MBR can discover and
flag them for more investigation.
Customer response prediction − The next customers probable to respond to an offer are
probably same to prior customers who have acknowledged. MBR can simply recognize the
next likely customers.
Medical treatments − The efficient treatment for a given patient is possible the treatment
that resulted in the best results for same patients. MBR can discover the treatment that makes
the best results.
Classifying responses − Free-text responses, including those on the U.S. Census form for
occupation and market or complaints appearing from users, required to be classified into a
fixed set of codes. MBR can process the free-text and creates the codes
Link Analysis
Link analysis is a data analysis technique used in network theory that is used to evaluate the
relationships or connections between network nodes. These relationships can be between
various types of objects (nodes), including people, organizations and even transactions.
Link analysis is essentially a kind of knowledge discovery that can be used to visualize data
to allow for better analysis, especially in the context of links, whether Web links or
relationship links between people or between different entities. Link analysis is often used in
search engine optimization as well as in intelligence, in security analysis and in market and
medical research.