UNIT V DWM Notes
UNIT V DWM Notes
Cluster Analysis
Cluster is a collection of data objects. Similar to one another within in the same cluster. Dissimilar to
the objects in the other clusters
Cluster Analysis
• Finding similarities between data according to the characteristics found in the data and grouping
similar data objects into clusters
• The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering
Scalability: Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects. However, a large database may contain millions of objects. Clustering on a sample
of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed
Ability to deal with different types of attributes: Many algorithms are designed to cluster interval
based data. however applications may require clustering other types of data, such as binary, categorical
and ordinal data or mixtures of these data types
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on
Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to finf
spherical clusters with similar size and density.
Minimal; requirements for domain knowledge to determine input parameters: Many clustering
algorithms require users to input certain parameters in cluster analysis. The clustering results can be quite
sensitive to input parameters.
Ability to deal with noisy data: Most real world databases contain outliers or missing unknown or
erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
Incremental clustering and insensitivity to the order of input records: Some clustering algorithms
cannot incorporate newly inserted data into existing clustering structures and instead must determine a
new clustering from scratch.
High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many
clustering algorithms are good at handling low dimensional data, involving only two to three dimensions.
Constraint based clustering: Real world applications may need to perform clustering under various
kinds of constraints. To decide upon this, you may cluster households while considering constraints such
as city rivers and highway networks.
Interpretability and usability: Users expect clustering results to be interpretable comprehensible and
usable. That is, clustering may need to be tied to specific semantic interpretations and applications.
Data matrix: This represents n objects, such as persons, with p variables, such as age, height, weight,
gender, and so on. The structure is in the form of a relational table, or n by p matrix
Dissimilarity matrix: This stores a collection of proximities that are available for all pairs of n objects. It
is often represented by an n-by-n table
The rows and columns of the data matrix represent different entities, while those of the dissimilarity
matrix represent the same entity. Thus, the data matrix is often called a two mode matrix, whereas the
dissimilarity matrix is called a one mode matrix. Many clustering algorithms operate on a dissimilarity
matrix. If the data are presented in the form of a data matrix, it can first be transformed into a
dissimilarity matrix before applying such clustering algorithms
Interval scaled variable describe distance measures that are commonly used for computing the
dissimilarity of objects described by such variables. These measures include the Euclidean Manhattan and
and Minkowski distances
Standardize data:
Sf =1\n (|x1f-mf|+|x2f-mf|+….|xnf-mf|)
Zif = xif-mf
--------
sf
using a mean absolute deviation is more robust than standard deviation. Mad is even more
robust, but outliers disappear completely
Distances are normally used to measure the similarity or dissimilarity between tow data objects
Some popular distances are based on Minkowski distance
d(i,k)=|xi1-xk1|+|xi2-xk2|+……+|xin-xkn|
d(I,k) =[|xi1-xk1|^2+|xi2-xk2|^2+…..+|xin-xkn|^2]^1/2
• A binary variable has only two states: 0 or 1, where 0 means that the variable is absent,
and 1 means that is present.
• A binary variable is a symmetric if the outcomes of the states are not equally important
such as the positive and negative outcome of a disease test.
d(i,j)=r+s
------
q+r+s+t
r+s
d(i,j)=
------
q+r+s
Asymmetric binary similarity
sim(i,j)= q
------------- =1-d(i,j)
p+q+r+s
Categorical variables
• A categorical variables is a generalization of the binary variable in that it can take on
more than two states
• Let the number of states of a categorical variable of M. The states can be denoted by
letter symbols or a set of integers
Ordinal variables
• A discrete ordinal variable resembles a categorical variable. except that the M states of
the ordinal value are ordered in a meaningful sequence.
• Ordinal variables are very useful for registering subjective assessments of qualities that
cannot be measured objectively.
1. The value of f for the ith object sif, and f has Mf ordered states representing the ranking
1,…….,Mf.
2. Since each ordinal variable can have a different number of states, it is often necessary to
map the range of each variable onto. So that each variable has equal height.
3. Dissimilarity can then be computed using any of the distance measures described for
interval scaled variables
• Treat ration scaled variables like interval scaled variables. This however is not usually a
good choice it is likely that the scale may be disorted
• Apply logarithmic transformations to a ratio scaled variable f having value xif for object I
by using the formule yif=log(xif). The yif values can be treated as interval valued.
• Treat xif as continuous ordinal data and treat their ranks as interval valued
A more preferable approach is to process all variable types together, performing a single cluster
analysis. One such technique combines the different variables into a single dissimilarity matrix,
bringing all of the meaningful variables onto a common scale of the interval
Vector Objects
In some applications such as information retrieval text document clustering, and biological
taxonomy, we need to compare and cluster complex objects containing a large number of
symbolic entities. To measure the distance between complex objects it is often desirable to
abandon traditional metric distance computation and introduce a nonmetric similarity function.
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partition represents a cluster and k<=n
It classifies the data into k groups, which together satisfy the following requirements
The general criterion of a good partitioning is that objects in the same cluster are close or related
to each other, whereas objects of different clusters are far apart or very different
A hierarchical methods creates a hierarchical decomposition of the given set of data object i.e.
the data are not partitioned into a particular cluster in a single step. Instead, a serried of partitions
takes place which may run from a single cluster containing allo objects to n cluster each
containing a single object
• Grid based methods quantize the object space into a finite number of cell that from a grid
structure
• All of the clustering operations are performed on the grid structure
• Model based methods hypothesize a model for each of the clusters and find the best fit of
the data to the given model
• A model based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points
Partitioning Methods
• The most commonly used partitional clustering strategy is based on the square error
criterion. The general objective is to obtain the partition that for a fixed number of
clusters, minimizes the total square error.
The most well known and commonly used partitioning methods are k-means, k-medoids, and
their variations
K means is one of the simplest unsupervised learning algorithms that solve the well
known clustering proble. The k-means algorithm takes the input parameter, k and partitions a set
of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster
similarity is low.
Advantages
PAM(portioning around medoids) was one of the first k-medoids algorithms introduced. It
attempts to determine k partitions for n objects. After an initial random selection of k
representative objects.
• If a better neighbor is found CLARANS moves to the neighbours node and the process
starts again
• Once a user specified number of minimal bas been found, the algorithm outputs, as a
solutions the best local minimum that is the local minimum having the lowest cost
Advantages
Disadvantages
May not find a real local minimum due to the trimming of its searching
It assumes that all objects fit into the main memory, and the result is sensitive to
input order
Hierarchical Methods
Basically hierarchical methods group data into a tree of clusters, There are two basic varities of
hierarchical algorithms; agglomerative and divisive. A tree structure called a dendrogram is
commonly used to represent the process of hierarchical clustering
Begin with all objects in one cluster, Groups are continually divided until there are as many
clusters as objects
repeat
BIRCH applies a multiphase clustering techniques a single scan of the data set yields a basic
good clustering, and one or more additional scans scan be used to further improve the quality.
phase 1: BIRCH scans the database to build an initial in memory CF tree, which can be viewed
as a multilevel compression of the data that tries to preserve the inherent clustering structure of
the data
phase 2: BIRCH applies a clustering algorithm to cluster the leaf nodes of the CF tree, which
removes sparse clusters as outliers and group dense clusters into larger ones
Steps:
Partition sample S into a set of partitions and forma cluster fro each partition
Representative points are found by selecting a constant number of points from a cluster and then
shrinking them toward the center of the cluster
Cluster similarity is the similarity of the closest pair of representative points from different
clusters
Shrinking representative points toward the center helps avoid problems with noise and outliers
Steps:
Compute the link value for each set of points i.e., transform the original similarities into
similarities that reflect the number of shared neighbors between points
Assign the remaining points to the clusters that have been found
DBSCAN: A density based clustering method based on connected regions with sufficiently high
density
The neighborhood within a radius of a given object is called the neighborhood of the objects
If the neighborhood of an object contains at least a minimum number minpts of objects then the
object is called a core object
Given a set of objects D, we say that an object p is directly density reachable from object q if p is
within the neighborhood of q, and q is a core object.
An object p is density reachable from object q with respect to and minpts in a set of objects, D, if
there is a chain of objects
An objects p is density connected to object q with respect to and minpts in a set of objects D,
The core distance of an object p is the smallest value that makes a core object. If p is not a core
object, the core distance of p is undefined
The reach ability distance of an object q with respect to another object p is the greater value of
the core distance of p and the Euclidean distance between p and q. If p is not a core object, the
reach ability distance between p and q is undefined.
Wave cluster
CLIQUE
STING
Each cell at a high level; is partitioned into a number of smaller cells in the next lower
level;
Statistical info of each cell is calculated and stored before hand and is used to answer
queries
Wave cluster
A multi resolution clustering approach which apples wavelet transforms to the feature space
Wavelet transforms
Wavelet transform: A signal processing technique that decomposes a signal into different
frequency sub-band
Data are transformed to preserve relative distance between objects at different levels of
resolution
Input parameters
Major features
Complexity
Model based clustering methods attempt to optimize the fit between the given data and
some mathematical model.
Expectation Maximization
Make an initial guess for the parameter vector: This involves randomly selecting k objects to
represent the cluster means or centers as well as making guesses for the additional parameters
The EM algorithm is simple and easy to implement. In practice it converges fast but may not
reach the global optima. Convergence is guaranteed for certain forms of optimization functions.
Conceptual Clustering
Intraclass similarity is the probability. The larger this value is the greater the proportion of
class members that share this attribute value pair and the more predictable the pair of class
members
Interclass similarity is the probability. The larger this value is the fewer the objects in
constraining classes that share this attribute value pair and the more predictive the pair of the
class
The neural network approach is motivated by biological neural networks. Neural networks
have several properties that make them popular for clustering
Self organizing feature maps are one of the most popular neural network methods for cluster
analysis. They are sometimes referred to as kohomen self organizing feature maps, after their
creator.
Clustering high dimensional Data
Most clustering methods are designed for clustering low dimensional data and encounter
challenges when the dimensionality of the data grows really high. This is because when the
dimensionality increases, usually only a small number of dimensions are relevant to certain
clusters but data in the irrelevant dimensions may produce much noise and mask the real clusters
to be discovered.
Feature transformation methods such as principal component analysis and singular value
decomposition transform the data onto a smaller space while generally preserving the original
relative distance between objects
Given a large set of multidimensional data points the data space is usually not uniformly
occupied by the data points. CLIQUEs clustering identifies the sparse and the crwded areas in
space
A unit is dense if the fraction of total data points contained in it exceeds an input model
parameter. In CLIQUE a cluster is defined as a maximal set of connected dense units
Constraints on the selection of clustering parameters: A user may like to set a desired range
for each clustering parameters. Clustering parameters are usually quite specific to the given
clustering algorithm
User specified constraints on the properties of individual clusters: A user may like to specify
desired characteristics of the resulting clusters, which may strongly influence the clustering
process.
Outlier Analysis
Data objects that show significantly different characteristics from remaining data are
declared outliers. Detection and analysis of outliers is called outliers mining
Application
Medical analysis
Statistical approach
The distance based outlier mining concepts assigns numeric distances to data objects and
computes outliers as data objects with relatively larger distances.
Index based algorithms: Given a data set the index based algorithm uses multidimensional
indexing structures, such as R-trees or k-d trees, to search for neighbors of each object o within
radius dmin around the object.
Cell based Algorithm: The first layer is one cell thick, while the second is d2pk_le cells
thick, rounded upto the close integer. The algorithm counts outliers on a cell by cell rather than
an object by object basis.
Cell+1 layer count- Number of objects in the cell + Number of objects in first layer
Cell+2 layers count- Number of objects in the cell+Number of objects in both layers
Depth based techniques represent every data object in a k-d space, and assign a depth to
each object. This is degree of outlierness is composed as the local factor of an object. It is local
in the sense that the degree depends on how isolated the object is with respect to the surrounding
neighborhood.
Deviation based outlier detection identifies outliers by examining the main characteristics
of objects in a group. Objects that deviate from this description are considered outliers.
Sequential approach
The sequential exception technique simulates the way in which humans can distinguish
unusual objects from among a series of supposedly like objects
Exception set: This is the set of deviation or outliers. It is defined as the smallest subset of
objects whose removal results in the greatest reduction of dissimilarity in the residual set
Cardinality function: This is typically the count of the number of objects in a given set
Smoothing factor: Smoothing factor assesses how much the dissimilarity can be reduced
by removing the sunset from the original set of objects, This value is scaled by the cardinality of
the set.
Data Mining Applications
Design and construction of data warehouses for multidimensional data analysis and data
mining: Like many other application, data warehouses need to be constructed for banking and
financial data.
Loan payment prediction and customer credit policy analysis: Loan payment prediction
and customer credit analysis are critical to the business of a bank. Many factors can strongly or
weekly influence loan payment performance
Detection of money laundering and other financial crimes: To detect money laundering
and other laundering and other financial crimes, it is important to integrate information from
multiple databases
Design and construction of data warehouses based on the benefits of data mining: Because
retail data cover a wide spectrum there can be many ways to design a data warehouse for this
industry.
Multidimensional analysis of sales customers products time and region: The retail industry
requires timely information regarding customer needs, product sales, trends, and fashions, as
well as the quality cost.
Analysis of the effectiveness of sales campaigns: The retail industry conducts sales
campaigns using advertisemements, coupons and various kinds of discouts and bounses to
promote products and attract customers.
Product recommendation and cross referencing of items: By mining associations from sales
records, one may discover that a customer who buys a digital camera is likely to buy another set
of items
Pattern analysis and the identification of unusual patterns: Fraudulent activity costs the
telecommunication industry millions of dollars per year.
Discovery of structural patterns and analysis of genetic networks and protein pathways: In
biology, protein sequences are folded into three dimensional structures, and such structures
interact with each other based on their relative positions and the distance between them.
Association and path analysis: Identifying co-occuring gene sequences and linking genes to
different stages of disease development:
Mining complex data types: Scientific data sets are heterogeneous in nature, typically
involving semi-structured and unstructured data, such as multimedia data and georeferenced
stream data. Robust method are needed for handling spatiotemporal data.
Visualization tolls and domain specific knowledge: High level graphical user interfaces
and visualization tools are required for scientific data mining systems.
Development of data mining algorithms for intrusion detection: Data mining algorithms
can be used for misuse detection and anomaly detection. In misuse detection training data are
labeled either normal or intrusion.
Association and correlation analysis, and aggregation to help select and build
discriminating attributes: Association and correlation mining can be applied to find relationships
between system attributes describing the network data.
Analysis of stream data: Due to the transient and dynamic nature of intrusions and
malicious attacks, it is crucial to perform intrusion detection in the data stream environment.
Distributed data mining: Intrusions can be launched from several different locations and
targeted to many different destinations. Distributed data mining methods may be used to analyze
network data from several network locations in order to detect these distributed attacks
Visualization and querying tools: Visualization tools should be available for viewing any
anomalous patterns detected. Such tools may include features for viewing associations clusters
and outliers