0% found this document useful (0 votes)
3 views

UNIT 4

Top good
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

UNIT 4

Top good
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

KCA012: Data Warehousing & Data Mining

UNIT-4

Classification: Definition, Data Generalization, Analytical Characterization,


Analysis of attribute relevance, Mining Class comparisons, Statistical measures in
large Databases, Statistical-Based Algorithms, Distance-Based Algorithms,
Decision Tree-Based Algorithms.
Clustering: Introduction, Similarity and Distance Measures, Hierarchical and
Partitional Algorithms. Hierarchical Clustering- CURE and Chameleon. Density
Based Methods DBSCAN, OPTICS. Grid Based Methods- STING, CLIQUE. Model
Based Method – Statistical Approach, Association rules: Introduction, Large Item
sets, Basic Algorithms, Parallel and Distributed Algorithms, Neural Network
approach.
.....................................................................................................................................

Basic Concept of Classification (Data Mining)


Data Mining: Data mining in general terms means mining or digging deep into
data that is in different forms to gain patterns, and to gain knowledge on that
pattern. In the process of data mining, large data sets are first sorted, then patterns
are identified and relationships are established to perform data analysis and solve
problems.

Classification: It is a data analysis task, i.e. the process of finding a model that
describes and distinguishes data classes and concepts. Classification is the
problem of identifying to which of a set of categories (subpopulations), a new
observation belongs to, on the basis of a training set of data containing
observations and whose categories membership is known.

1
Example: Before starting any project, we need to check its feasibility. In this
case, a classifier is required to predict class labels such as ‘Safe’ and ‘Risky’ for
adopting the Project and to further approve it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn
using the training set available. The model has to be trained for the prediction
of accurate results.
2. Classification Step: Model used to predict class labels and testing the
constructed model on test data and hence estimate the accuracy of the
classification rules.

Test data are used to estimate the accuracy of the classification rule

Classification:
Classification is the process of finding a good model that describes the data
classes or concepts, and the purpose of classification is to predict the class of
objects whose class label is unknown. In simple terms, we can think of
Classification as categorizing the incoming new data based on our current or past
assumptions that we have made and the data that we already have with us.
Prediction:
We can think of prediction is like something that may go to happen in the future.
And just like that in prediction, we identify or predict the missing or unavailable
data for a new observation based on the previous data that we have and based on
the future assumptions. In prediction, the output is a continuous value.

2
Difference between Prediction and Classification:
Prediction Classification
Prediction is about predicting a Classification is about determining a
missing/unknown element(continuous value) (categorial) class (or label) for an
of a dataset element in a dataset
Eg. We can think of prediction as predicting Eg. Whereas the grouping of patients
the correct treatment for a particular disease based on their medical records can be
for an individual person. considered classification.
The model used to predict the unknown The model used to classify the unknown
value is called a predictor. value is called a classifier.
The predictor is constructed from a training A classifier is also constructed from a
set and its accuracy refers to how well it can training set composed of the records of
estimate the value of new data. databases and their corresponding class
names

Data Generalization:-
It is the process of summarizing data by replacing relatively low level values
with higher level concepts. It is a form of descriptive data mining.
There are two basic approaches of data generalization :
1. Data cube approach :
 It is also known as OLAP approach.
 It is an efficient approach as it is helpful to make the past selling graph.
 In this approach, computation and results are stored in the Data cube.
 It uses Roll-up and Drill-down operations on a data cube.
 These operations typically involve aggregate functions, such as count(), sum(),
average(), and max().
2. Attribute oriented induction :
 It is an online data analysis, query oriented and generalization based approach.
 In this approach, we perform generalization on basis of different values of
each attributes within the relevant data set. after that same tuple are merged
and their respective counts are accumulated in order to perform aggregation.
 It performs off-line aggregation before an OLAP or data mining query is
submitted for processing.
 Attribute oriented induction approach uses two method :
(i). Attribute removal.
(ii). Attribute generalization.

3
Analytical Characterization in Data Mining
Analytical characterization is used to help and identifying the weakly relevant, or
irrelevant attributes. We can exclude these unwanted irrelevant attributes when we
preparing our data for the mining.
" Analytical characterization in data mining is the attribute measure in
analysis relevance used in identifying irrelevant attributes"
Why Analytical Characterization?
Analytical Characterization is a very important activity in data mining due to the
following reasons;
Due to the limitation of the OLAP tool about handling the complex objects.
Due to the lack of an automated generalization, we must explicitly tell the system
which attributes are irrelevant and must be removed, and similarly, we must
explicitly tell the system which attributes are relevant and must be included in the
class characterization.

Analysis of attribute relevance


The basic concept behind attribute relevance analysis is to evaluate some
measure that can compute the relevance of an attribute regarding a given class
or concept. Such measures involve information gain, ambiguity, and correlation
coefficient.
Attribute relevance analysis for concept description is implemented as follows
1. Data Collection – Collect information for both the objective class and the
differentiating class by inquiry handling. For class correlation, the client in
the information mining question gives both the objective class and the
differentiating class
2. Preliminary relevance analysis using conservative AOI(Attribute-
oriented induction) – (AOI) can be utilized to play out some starter
significance examination on the information by eliminating or summing up
qualities having a very huge number of unmistakable qualities, (for
example, name and phone#).
3. Remove irrelevant and weakly attributes using the selected relevance
analysis measure – We assess each quality in the candidate relation using
the importance of relevance analysis measure. The attributes are then
sorted(i.e., ranked )according to their computed relevance to the data
mining task.

4
4. Generate the concept description using AOI –
Perform AOI utilizing a less Conservative arrangement of characteristic
speculation limits.

Relevance Measure Components :


1. Information Gain (ID3)
2. Gain Ratio (C4.5)
3. Gini Index
4. Chi^2 contingency table statistics
5. Uncertainty Coefficient

Reasons for attribute relevance analysis


There are several reasons for attribute relevance analysis are as follows −
 It can decide which dimensions must be included.
 It can produce a high level of generalization.
 It can reduce the number of attributes that support us to read patterns easily.

Mining Class comparisons


Class discrimination or comparison mines characterization that categorize a target
class from its contrasting classes. The target and contrasting classes should be
comparable providing they share same dimensions and attributes. For instance, the
three classes, person, address, and elements, are not comparable. But the sales in
the last three years are comparable classes, and so are computer science candidates
versus physics candidates.
1. Data Collection: The set of relevant data in the database and data
warehouse is collected by query Processing and partitioned into a target
class and one or a set of contrasting classes.
2. Dimension relevance analysis: If there are many dimensions and analytical
comparisons are desired, then dimension relevance analysis should be
performed. Only the highly relevant dimensions are included in the further
analysis.
3. Synchronous Generalization: The process of generalization is performed
upon the target class to the level controlled by the user or expert specified
dimension threshold, which results in a prime target class relation or cuboid.
The concepts in the contrasting class or classes are generalized to the same

5
level as those in the prime target class relation or cuboid, forming the prime
contrasting class relation or cuboid.
4. Presentation of the derived comparison: The resulting class comparison
description can be visualized in the form of tables, charts, and rules

Statistical measures in large Databases


A Descriptive statistic is a statistical summary that quantitatively describes or
summarizes features of a collection of information on, while descriptive
statistics is the process of using and analyzing those statistics. Descriptive
statistics are distinguished from inferential statistics (or inductive statistics) by its
aim to summarize a sample.

There are several descriptive statistical measures to mine in large databases in data
mining i.e used for knowledge discovery in large databases.

These measures are listed down below.


 Measuring Central Tendency.
 Measuring the Dispersion of Data.
 Histogram Analysis.

A) Measures of central tendency − Measures of central tendency such as mean,


median, mode, and mid-range.
1. Mean − The arithmetic average is evaluated simply by inserting together all
values and splitting them by the number of values. It uses data from every single
value. Let x1, x2,... xn be a set of N values or observations like salary. The mean of
this set of values is average=sum/count
2. Median:
Median − There are two methods for computing the median, based on the
distribution of values.
If x1, x2, .... xn are arranged in descending order and n is odd. Thus the median is
((n+1)/2)thvalue
For example, 1, 4, 6, 7, 12, 14, 18
Median = 7
When n is even. Then the median is
((n/2)+(n/2)+1)) / 2
For example, 1, 4, 6, 7, 8, 12, 14, 16.
Median=7+8/ 2=7.5

6
Mode:
 It is nothing but the value that occurs most frequently in the data.
For Example,
 In {6, 9, 3, 6, 6, 5, 2, 3}, the Mode is 6 as it occurs most often.

B) Measuring The Dispersion Of Data


Quartiles: Those are nothing but the 1/4th of the data, Q1 (25th percentile), Q3
(75th percentile). IQR=Q3−Q1
Quartiles Formula
Suppose, Q3 is the upper quartile is the median of the upper half of the data set.
Whereas, Q1 is the lower quartile and median of the lower half of the data set. Q 2 is
the median. Consider, we have n number of items in a data set. Then the quartiles
are given by;
Q1 = [(n+1)/4]th item
Q2 = [(n+1)/2]th item
Q3 = [3(n+1)/4]th item
Find the quartiles of the following data: 4, 6, 7, 8, 10, 23, 34.
Solution: Here the numbers are arranged in the ascending order and number of
items, n = 7
Lower quartile, Q1 = [(n+1)/4] th item
Q1= 7+1/4 = 2nd item = 6
Median, Q2 = [(n+1)/2]th item
Q2= 7+1/2 item = 4th item = 8
Upper Quartile, Q3 = [3(n+1)/4]th item
Q3 = 3(7+1)/4 item = 6th item = 23

 Inter-quartile range: It is the differences between the 75th and 25th quartile (IQR
= Q3 – Q1).

Quantile Plot

 It displays all of the data (allowing the user to assess both the overall behavior and
unusual occurrences).
 It plots quantile information

7
Scatter Plot

 It provides a first look at bivariate data to see clusters of points, outliers, etc.
 Each pair of values is treated as a pair of coordinates and plotted as points in the
plane.

C)Histogram Analysis

 It is a graph that displays basic statistical class descriptions.


Frequency histograms
 It is a univariate graphical method.
 It consists of a set of rectangles that reflect the counts or frequencies of the classes
present in the given data.

8
Statistical-based algorithms
There are two types of statistical-based algorithms which are as follows −
 Regression − Regression issues deal with the evaluation of an output value
located on input values. When utilized for classification, the input values are
values from the database and the output values define the classes. Regression
can be used to clarify classification issues, but it is used for different
applications including forecasting. The elementary form of regression is
simple linear regression that includes only one predictor and a prediction.
Regression can be used to implement classification using two various
methods which are as follows −
o Division − The data are divided into regions located on class.
o Prediction − Formulas are created to predict the output class’s value.
 Bayesian Classification − Statistical classifiers are used for the
classification. Bayesian classification is based on the Bayes theorem.
Bayesian classifiers view high efficiency and speed when used to high
databases.
Bayes Theorem − Let X be a data tuple. In the Bayesian method, X is
treated as “evidence.” Let H be some hypothesis, including that the data
tuple X belongs to a particularized class C. The probability P (H|X) is
decided to define the data. This probability P (H|X) is the probability that
hypothesis H’s influence has given the “evidence” or noticed data tuple X.
P (H|X) is the posterior probability of H conditioned on X. For instance,
consider the nature of data tuples is limited to users defined by the attribute
age and income, commonly, and that X is 30 years old users with Rs. 20,000
income. Assume that H is the hypothesis that the user will purchase a

9
computer. Thus P (H|X) reverses the probability that user X will purchase a
computer given that the user’s age and income are acknowledged.
P (H) is the prior probability of H. For instance, this is the probability that
any given user will purchase a computer, regardless of age, income, or some
other data. The posterior probability P (H|X) is located on more data than the
prior probability P (H), which is free of X.
Likewise, P (X|H) is the posterior probability of X conditioned on H. It is the
probability that a user X is 30 years old and gains Rs. 20,000.
P (H), P (X|H), and P (X) can be measured from the given information.
Bayes theorem supports a method of computing the posterior probability P
(H|X), from P (H), P (X|H), and P(X). It is given by
P(H|X)=P(X|H)P(H)P(X)

Distance-based algorithms
Distance-based algorithms are nonparametric methods that can be used for
classification. These algorithms classify objects by the dissimilarity between them
as measured by distance functions. Several candidate distance functions are
reviewed in this chapter along with two particular classification algorithms. The
algorithms are used to measure the distance between each text and to calculate the
score. Distance measures play an important role in machine learning. They provide
the foundations for many popular and effective machine learning algorithms like
KNN (K-Nearest Neighbours) for supervised learning and K-Means clustering for
unsupervised learning.

Decision Tree :-

A Decision tree is similar to a computer science tree, with a hierarchical structure .


It has nodes and these nodes are connected by edges. A decision tree classifies data
by asking questions at each node. ( In a typical situation, If the answer is yes, go to
the right child. If not , go to the left child ).

10
fig 1.1 : an example decision tree

fig 1.1 represents a simple decision tree that is used to for a classification task of
whether a customer gets a loan or not. The input features are salary of the person,
the number of children and the age of the person. The decision tree uses these
attributes or features and asks the right questions at the right step or node so as to
classify whether the loan can be provided to the person or not.

Terminologies
Node: The blue coloured rectangles that are shown above are what we call the
nodes of the tree. In a decision tree, a question is asked at each node and based on
the answer, certain selected outcome is given.
Root Node or Root : In a decision tree, the top most node is called as the root
node. In the above tree, the node that asks “age over 30 ?” is the root node.
Leaf node : Nodes that do not have any children are called leaf nodes. ( Get Loan,
Don’t get Loan ). Leaf nodes hold the output labels.

11
Some algorithms used in Decision Trees:

ID3 → (extension of D3)


C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits
when computing classification trees)
MARS → (multivariate adaptive regression splines)
The ID3 algorithm builds decision trees using a top-down greedy search approach
through the space of possible branches with no backtracking. A greedy algorithm,
as the name suggests, always makes the choice that seems to be the best at that
moment.
Steps in ID3 algorithm: [AKTU]
1. It begins with the original set S as the root node.
2. On each iteration of the algorithm, it iterates through the very unused attribute of
the set S and calculates Entropy(H) and Information gain(IG) of this attribute.
3. It then selects the attribute which has the smallest Entropy or Largest
Information gain.
4. The set S is then split by the selected attribute to produce a subset of the data.
5. The algorithm continues to recur on each subset, considering only attributes
never selected before.

12
Clustering
The process of grouping a set of physical objects into classes of similar objects is
called clustering.
Cluster – similar Objects are grouped within a cluster and dissimilar objects are
grouped in another clusters
Cluster applications – pattern recognition, image processing and market research.
Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset

Typical requirements of clustering in data mining


1. Scalability – Clustering algorithms should work for huge databases
2. Ability to deal with different types of attributes – Clustering algorithms
should work not only for numeric data, but also for other data types.
3. Discovery of clusters with arbitrary shape – Clustering algorithms (based on
distance measures) should work for clusters of any shape.
4. Minimal requirements for domain knowledge to determine input
parameters – Clustering results are sensitive to input parameters to a clustering
algorithm (example – number of desired clusters). Determining the value of these
parameters is difficult and requires some domain knowledge.
5.High dimensionality
6.Ability to deal with noisy data
7. Interpretability and usability

Difference between Classification and Clustering


Classification Clustering
Classification is a supervised learning Clustering is an unsupervised learning
approach where a specific label is provided approach where grouping is done on
to the machine to classify new observations. similarities basis.
Supervised learning approach. Unsupervised learning approach.
It uses a training dataset. It does not use a training dataset.
It uses algorithms to categorize the new data It uses statistical concepts in which the
as per the observations of the training set. data set is divided into subsets with the
same features.
In classification, there are labels for training In clustering, there are no labels for
data. training data.
It is more complex as compared to It is less complex as compared to
clustering. clustering.

13
Clustering Methods
Clustering methods can be classified into the following categories −
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It
means that it will classify the data into k groups, which satisfy the following
requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember −
 For a given number of partitions (say k), the partitioning method will create
an initial partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects.
We can classify hierarchical methods on the basis of how the hierarchical
decomposition is formed. There are two approaches here −
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one
or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into

14
smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it
can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then
performing macro-clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood exceeds some
threshold, i.e., for each data point within a given cluster, the radius of a given
cluster has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data
for a given model. This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user expectation or the
properties of desired clustering results. Constraints provide us with an interactive

15
way of communication with the clustering process. Constraints can be specified by
the user or the application requirement.

Similarity and Distance Measures in Data Mining


Clustering consists of grouping certain objects that are similar to each other, it
can be used to decide if two items are similar or dissimilar in their properties.
In a Data Mining sense, the similarity measure is a distance with dimensions
describing object features. That means if the distance among two data points
is small then there is a high degree of similarity among the objects and vice
versa. The similarity is subjective and depends heavily on the context and
application. For example, similarity among vegetables can be determined from
their taste, size, colour etc.
Most clustering approaches use distance measures to assess the similarities or
differences between a pair of objects, the most popular distance measures used
are:
1. Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with
geometry. It can be simply explained as the ordinary distance between two
points. It is one of the most used algorithms in the cluster analysis. One of the
algorithms that use this formula would be K-mean. Mathematically it computes
the root of squared differences between the coordinates between two objects.

Figure – Euclidean Distance


2. Manhattan Distance:
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between these
points we simply have to calculate the perpendicular distance of the points from

16
X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

Here the total distance of the Red line gives the Manhattan distance between both
the points.
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as
the intersection of those items divided by the union of the data items.

Figure – Jaccard Index


4. Minkowski distance:
It is the generalized form of the Euclidean and Manhattan Distance Measure. In
an N-dimensional space, a point is represented as,
(x1, x2, ..., xN)

17
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:

 When p = 2, Minkowski distance is same as the Euclidean distance.


 When p = 1, Minkowski distance is same as the Manhattan distance.
5. Cosine Index:
Cosine distance measure for clustering determines the cosine of the angle
between two vectors given by the following formula.

Here (theta) gives the angle between two vectors and A, B are n-dimensional
vectors.

Figure – Cosine Distance


Hierarchical and Partitional Algorithms.

Hierarchical clustering- “Child” and “parent” clusters


Partitional clustering - Partition objects into disjoint clusters,No hierarchical
relationship

A Hierarchical clustering method works via grouping data into a tree of


clusters. Hierarchical clustering begins by treating every data point as a separate
cluster. Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps
until all the clusters are merged together.

18
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested
clusters. A diagram called Dendrogram (A Dendrogram is a tree-like diagram
that statistics the sequences of merges or splits) graphically represents this
hierarchy and is an inverted tree that describes the order in which factors are
merged (bottom-up view) or clusters are broken up (top-down view).

Methods in Hierarchical clustering


The basic method to generate hierarchical clustering is
1. Agglomerative: Initially consider every data point as an individual Cluster
and at every step, merge the nearest pairs of the cluster. (It is a bottom-up
method).
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of
the Agglomerative Hierarchical clustering (Top Down Aproach)

Partitioning Method:
This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects
then the partitioning method constructs user-specified(K) partitions of the data in
which each partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular ones are K-
Mean, PAM(K-Mediods), CLARA algorithm (Clustering Large Applications) etc.
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

19
Output:
A dataset of K clusters

CURE(Clustering Using Representatives)


 It is a hierarchical based clustering technique, that adopts a middle ground
between the centroid based and the all-point extremes.
 It is used for identifying the spherical and non-spherical clusters.
 It is useful for discovering groups and identifying interesting distributions in
the underlying data.
 Instead of using one point centroid, as in most of data mining algorithms,
CURE uses a set of well-defined representative points, for efficiently handling
the clusters and eliminating the outliers.

Representation of Clusters and Outliers

Six steps in CURE algorithm:

CURE Architecture

20
Chameleon Clustering
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to
decide the similarity among pairs of clusters.
Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor
graph into a large number of relatively small subclusters.
In Chameleon, cluster similarity is assessed depending on how well-connected
objects are inside a cluster and on the proximity of clusters. Especially, two
clusters are combined if their interconnectivity is high and they are close together.

Density-Based Clustering

Density-Based Clustering method is one of the clustering methods based on


density (local cluster criterion), such as density-connected points.The basic ideas
of density-based clustering involve a number of new definitions. We intuitively
present these definitions and then follow up with an example.
The neighborhood within a radius ε of a given object is called the ε-neighborhood
of the object.If the ε-neighborhood of an object contains at least a minimum
number, MinPts, of objects, then the object is called a core object.

Density-reachable:

 A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of


points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from
pi

21
Density-connected:

 A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o


such that both, p and q are density-reachable from o wrt. Eps and MinPts.

Working Of Density-Based Clustering

Given a set of objects, D' we say that an object p is directly density-reachable from
object q if p is within the ε-neighborhood of q, and q is a core object.
An object p is density-reachable from object q with respect to ε and MinPts in a set
of objects, D' if there is a chain of objects p1,.,.,.pn, where p1 = q and pn = p such
that pi+1 is directly density-reachable from pi with respect to e and MinPts, for
1/n, pi € D.

An object p is density-connected to object q with respect to ε and MinPts in a set of


objects, D', if there is an object o, belongs D such that both p and q are density-
reachable from o with respect to ε and MinPts.
Major features:
 It is used to discover clusters of arbitrary shape.
 It is also used to handle noise in the data clusters.
 It is a one scan method.

22
 It needs density parameters as a termination condition.

DBSCAN(Density-Based Spatial Clustering of Applications with Noise)

It relies on a density-based notion of cluster: A cluster is defined as a maximal set


of density-connected points.
It discovers clusters of arbitrary shape in spatial databases with noise.

DBSCAN Algorithm

1. Arbitrary select a point p.


2. Retrieve all points density-reachable from p wrt Eps and MinPts.
3. If p is a core point, a cluster is formed.
4. If p is a border point, no points are density-reachable from p and DBSCAN
visits the next point of the database.
5. Continue the process until all of the points have been processed.

23
say, let MinPts = 3.

Of the labeled points, m, p, o, and r are core objects because each is in an ε-


neighborhood containing at least three points.
q is directly density-reachable from m. m is directly density-reachable from p and
vice versa.
q is (indirectly) density-reachable from p because q is directly density-reachable
from m and m is directly density-reachable from p.
However, p is not density-reachable from q because q is not a core object.
Similarly, r and s are density-reachable from o, and o is density-reachable from o,
and o is density-reachable from R.

OPTICS - A Cluster-Ordering Method


OPTICS: Ordering Points To Identify the Clustering Structure.
 It produces a special order of the database with respect to its density-based
clustering structure.
 This cluster-ordering contains info equivalent to the density-based clusterings
corresponding to a broad range of parameter settings.

24
 It is good for both automatic and interactive cluster analysis, including finding an
intrinsic clustering structure.
 It can be represented graphically or using visualization techniques.

Core-distance and reachability-distance: The figure illustrates the concepts of


core-distance and reachability-distance.

Suppose that e=6 mm and MinPts=5.


The core distance of p is the distance, e0, between p and the fourth closest data
object.

The reachability-distance of q1 with respect to p is the core-distance of p (i.e., e0


=3 mm) because this is greater than the Euclidean distance from p to q1.

Grid-Based Clustering
Grid-Based Clustering method uses a multi-resolution grid data structure.

STING - A Statistical Information Grid Approach


STING was proposed by Wang, Yang, and Muntz (VLDB’97).
In this method, the spatial area is divided into rectangular cells.
There are several levels of cells corresponding to different levels of resolution.

25
For each cell, the high level is partitioned into several smaller cells in the next
lower level.
The statistical info of each cell is calculated and stored beforehand and is used to
answer queries.
The parameters of higher-level cells can be easily calculated from parameters of
lower-level cell
 Count, mean, s, min, max
 Type of distribution—normal, uniform, etc.
Then using a top-down approach we need to answer spatial data queries.
Then start from a pre-selected layer—typically with a small number of cells.
For each cell in the current level compute the confidence interval.
Now remove the irrelevant cells from further consideration.
When finishing examining the current layer, proceed to the next lower level.
Repeat this process until the bottom layer is reached.

Advantages:
 It is Query-independent, easy to parallelize, incremental update.
 O(K), where K is the number of grid cells at the lowest level.

Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected.

26
CLIQUE - Clustering In QUEst
It was proposed by Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
It is based on automatically identifying the subspaces of high dimensional data
space that allow better clustering than original space.
CLIQUE can be considered as both density-based and grid-based:
 It partitions each dimension into the same number of equal-length intervals.
 It partitions an m-dimensional data space into non-overlapping rectangular
units.
 A unit is dense if the fraction of the total data points contained in the unit
exceeds the input model parameter.
 A cluster is a maximal set of connected dense units within a subspace.

Partition the data space and find the number of points that lie inside each cell
of the partition.
Identify the subspaces that contain clusters using the Apriori principle.

Identify clusters:
 Determine dense units in all subspaces of interests.
 Determine connected dense units in all subspaces of interests.

Generate minimal description for the clusters:


 Determine maximal regions that cover a cluster of connected dense units for
each cluster.
 Determination of minimal cover for each cluster.

27
Advantages
 It automatically finds subspaces of the highest dimensionality such that
high-density clusters exist in those subspaces.
 It is insensitive to the order of records in input and does not presume some
canonical data distribution.
 It scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases.

Disadvantages
 The accuracy of the clustering result may be degraded at the expense of the
simplicity of the method.

Model-based clustering
Model-based clustering method is an attempt to optimize the fit between the data
and some mathematical models. It is the Statistical and AI approach.
Each cluster corresponds to a different distribution, and these distributions are
assumed to be Gaussians.
Model-based clustering is a statistical approach to data clustering. The observed
(multivariate) data is considered to have been created from a finite combination of
component models. Each component model is a probability distribution, generally
a parametric multivariate distribution.

28
Model-based clustering is a try to advance the fit between the given data and some
mathematical model and is based on the assumption that data are created by a
combination of a basic probability distribution.
There are the following types of model-based clustering are as follows −
1.Statistical approach − Expectation maximization is a popular iterative
refinement algorithm. An extension to k-means −
 It can assign each object to a cluster according to weight (probability
distribution).
 New means are computed based on weight measures.
The basic idea is as follows −
 It can start with an initial estimate of the parameter vector.
 It can be used to iteratively rescore the designs against the mixture density
made by the parameter vector.
 It is used to rescored patterns are used to update the parameter estimates.
 It can be used to pattern belonging to the same cluster if they are placed by
their scores in a particular component.
2. Machine learning approach − Machine learning is an approach that makes
complex algorithms for huge data processing and supports results to its users. It
uses complex programs that can understand through experience and create
predictions.
The algorithms are improved by themselves by frequent input of training
information. The main objective of machine learning is to learn data and build
models from data that can be understood and used by humans.

Association rules: Introduction, Large Item sets, Apriori Algorithms and


applications
Association rule mining finds interesting associations and relationships among
large sets of data items. This rule shows how frequently a itemset occurs in a
transaction. A typical example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to
show associations between items.It allows retailers to identify relationships
between the items that people buy together frequently.

Given a set of transactions, we can find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction.
TID Items
1 Bread, Milk
29
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Support Count( ) – Frequency of occurrence of a itemset.


Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup
threshold.
Association Rule – An implication expression of the form X -> Y, where X and
Y are any 2 itemsets.
Example: {Milk, Diaper}->{Beer}

Applications of Association Rule Learning


 Market Basket Analysis : It is one of the popular examples and
applications of association rule mining. This technique is commonly used by
big retailers to determine the association between items.
 Medical Diagnosis : With the help of association rules, patients can be
cured easily, as it helps in identifying the probability of illness for a
particular disease.
 Protein Sequence: The association rules help in determining the synthesis
of artificial Proteins.
 It is also used for the Catalog Design and Loss-leader Analysis and many
more other applications.
 Banking services used by retail users (money industry accounts, CDs,
investment services, car loans, etc.) recognize users likely to needed other
services.
 Items purchased on a credit card, such as rental cars and hotel rooms,
support insight into the following product that customer are likely to buy.

Working of Association Rule Learning work

Here the If element is called antecedent, and then statement is called as


Consequent.

30
Support: Support is the frequency of A or how frequently an item appears in the
dataset.

Confidence: It is the ratio of the transaction that contains X and Y to the number
of records that contain X.

Lift : It is the strength of any rule, which can be defined as below formula:

Types of Association Rule Learning


Association rule learning can be divided into three algorithms:
 Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is
designed to work on the databases that contain transactions. This algorithm
uses a breadth-first search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the
products that can be bought together. It can also be used in the healthcare
field to find drug reactions for patients.
 Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This
algorithm uses a depth-first search technique to find frequent itemsets in a
transaction database. It performs faster execution than Apriori Algorithm.
 F-P Growth Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is the
improved version of the Apriori Algorithm. It represents the database in the
form of a tree structure that is known as a frequent pattern or tree. The
purpose of this frequent tree is to extract the most frequent patterns.

Parallel Algorithm
An algorithm is a sequence of steps that take inputs from the user and after some
computation, produces an output. A parallel algorithm is an algorithm that can
execute several instructions simultaneously on different processing devices and
then combine all the individual outputs to produce the final result.

31
In parallel computing multiple processors performs multiple tasks assigned to
them simultaneously. Memory in parallel systems can either be shared or
distributed. Parallel computing provides concurrency and saves time and money.
What is Parallelism?
Parallelism is the process of processing several set of instructions simultaneously.
It reduces the total computational time. Parallelism can be implemented by
using parallel computers, i.e. a computer with many processors. Parallel
computers require parallel algorithm, programming languages, compilers and
operating system that support multitasking.

Distributed algorithm
A distributed algorithm is an algorithm designed to run on computer
hardware constructed from interconnected processors. Distributed algorithms are
used in different application areas of distributed computing, such
as telecommunications, scientific computing, distributed information processing,
and real-time process control.
Distributed algorithms are a sub-type of parallel algorithm, typically
executed concurrently, with separate parts of the algorithm being run
simultaneously on independent processors, and having limited information about
what the other parts of the algorithm are doing.
In distributed computing we have multiple autonomous computers which seems
to the user as single system. In distributed systems there is no shared memory and
computers communicate with each other through message passing. In distributed
computing a single task is divided among different computers.

32
Parallel computing, also known as parallel processing, speeds up a computational
task by dividing it into smaller jobs across multiple processors inside one
computer. Distributed computing, on the other hand, uses a distributed system,
such as the internet, to increase the available computing power and enable larger,
more complex tasks to be executed across multiple machines.
S.NO Parallel Computing Distributed Computing
1. Many operations are performed System components are located at
simultaneously different locations
2. Single computer is required Uses multiple computers
3. Multiple processors perform Multiple computers perform multiple
multiple operations operations
4. It may have shared or It have only distributed memory
distributed memory
5. Processors communicate with Computer communicate with each other
each other through bus through message passing.
6. Improves the system Improves system scalability, fault
performance tolerance and resource sharing
capabilities

Neural Network:
In information technology (IT), an artificial neural network (ANN) is a system of
hardware and/or software patterned after the operation of neurons in the human
brain. ANNs -- also called, simply, neural networks -- are a variety of deep
learning technology, which also falls under the umbrella of artificial intelligence,
or AI.
Commercial applications of these technologies generally focus on solving
complex signal processing or pattern recognition problems. Examples of
significant commercial applications since 2000 include handwriting recognition for
check processing, speech-to-text transcription, oil-exploration data analysis,
weather prediction and facial recognition.

33
Neural Network Architecture:
While there are numerous different neural network architectures that have been
created by researchers, the most successful applications in data mining neural
networks have been multilayer feed forward networks. These are networks in
which there is an input layer consisting of nodes that simply accept the input
values and successive layers of nodes that are neurons as depicted in the above
figure of Artificial Neuron. The outputs of neurons in a layer are inputs to
neurons in the next layer. The last layer is called the output layer. Layers between
the input and output layers are known as hidden layers.

34
Multilayer Feedforward Neural Networks
A multilayer feedforward neural network is an interconnection of perceptrons in
which data and calculations flow in a single direction, from the input data to the
outputs. The number of layers in a neural network is the number of layers of
perceptrons. The simplest neural network is one with a single input layer and an
output layer of perceptrons. The network in Figure illustrates this type of network.
Technically, this is referred to as a one-layer feedforward network with two
outputs because the output layer is the only layer with an activation calculation.

A Single-Layer Feedforward Neural Net


In this single-layer feedforward neural network, the network’s inputs are directly
connected to the output layer perceptrons, Z1 and Z2.
The output perceptrons use activation functions, g1 and g2, to produce the
outputs Y1 and Y2.

Application of Multilayer Feed-Forward Neural Network:


1. Medical field
2. Speech regeneration
3. Data processing and compression
4. Image processing

35
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to
the larger part of evolutionary algorithms. Genetic algorithms are based on the
ideas of natural selection and genetics. These are intelligent exploitation of
random search provided with historical data to direct the search into the region of
better performance in solution space. They are commonly used to generate
high-quality solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means
those species who can adapt to changes in their environment are able to survive
and reproduce and go to next generation. In simple words, they simulate “survival
of the fittest” among individual of consecutive generation for solving a
problem. Each generation consist of a population of individuals and each
individual represents a point in search space and possible solution. Each
individual is represented as a string of character/integer/float/bits. This string is
analogous to the Chromosome.

Foundation of Genetic Algorithms


Genetic algorithms are based on an analogy with genetic structure and behaviour
of chromosomes of the population. Following is the foundation of GAs based on
this analogy –
1. Individual in population compete for resources and mate
2. Those individuals who are successful (fittest) then mate to create more
offspring than others
3. Genes from “fittest” parent propagate throughout the generation, that is
sometimes parents create offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.

Search space
The population of individuals are maintained within search space. Each
individual represents a solution in search space for given problem. Each
individual is coded as a finite length vector (analogous to chromosome) of
components. These variable components are analogous to Genes. Thus a
chromosome (individual) is composed of several genes (variable components).

36
Fitness Score
A Fitness Score is given to each individual which shows the ability of an
individual to “compete”. The individual having optimal fitness score (or near
optimal) are sought.

Operators of Genetic Algorithms


Once the initial generation is created, the algorithm evolves the generation using
following operators –
1) Selection Operator: The idea is to give preference to the individuals with
good fitness scores and allow them to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two
individuals are selected using selection operator and crossover sites are chosen
randomly. Then the genes at these crossover sites are exchanged thus creating a
completely new individual (offspring). For example –

3) Mutation Operator: The key idea is to insert random genes in offspring to


maintain the diversity in the population to avoid premature convergence. For
example –

The whole algorithm can be summarized as –


1) Randomly initialize populations p
2) Determine fitness of population
3) Until convergence repeat:
a) Select parents from population

37
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population
K-Nearest Neighbours Classification
K-Nearest Neighbours is one of the most basic yet essential classification
algorithms in Machine Learning. It belongs to the supervised learning domain
and finds intense application in pattern recognition, data mining and intrusion
detection.
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to
the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set
to the cats and dogs images and based on the most similar features it will put
it in either cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:

38
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o Step-6: Our model is ready.

39

You might also like