Clustering - K-Means, DBSCAN

MEDI-CAPS UNIVERSITY
Faculty of Engineering
Mr. Sagar Pandya
Information Technology Department
sagar.pandya@medicaps.ac.in

IT3ED02 Data Mining and Warehousing 3-0-0
Mr. Sagar Pandya
 Unit 1. Introduction
 Unit 2. Data Mining
 Unit 3. Association and Classification
 Unit 4. Clustering
 Unit 5. Business Analysis

Reference Books
Text Books
 Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann,
India, 2012.
 Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University Press.
 Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA.
Reference Books
 Sam Anahory and Dennis Murray, Data Warehousing in the Real World,
Pearson Education Asia.
 W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India.
and many others
Mr. Sagar Pandya

Unit-4 Clustering
 Clustering: Introduction, Types of clustering;
 Partition-based clustering: K-Means, K-Medoids;
 Density based clustering: DBSCAN, Clustering evaluation.
 Mining Data Stream, Mining Time-Series Data, Mining Sequence
Patterns in Transactional Database,
 Social Network analysis and Multirelational Data Mining.
Mr. Sagar Pandya

Clustering
 In clustering, a group of different data objects is classified as similar
objects.
 One group means a cluster of data.
 Data sets are divided into different groups in the cluster analysis,
which is based on the similarity of the data.
 After the classification of data into various groups, a label is assigned
to the group.
 It helps in adapting to the changes by doing the classification.
 In other words, similar objects are grouped in one cluster and
dissimilar objects are grouped in another cluster.
 Clustering is an unsupervised Machine Learning-based Algorithm
that comprises a group of data points into clusters so that the objects
belong to the same group.
Mr. Sagar Pandya

Clustering
• The quality of cluster depends on the method used.
• Clustering is also called as data segmentation, because it partitions
large data sets into groups according to their similarity.
• There are 3 basic stages of clustering algorithm which are shown
below
Mr. Sagar Pandya

Clustering
 Now that the data from our customer base is divided into clusters, we can
make an informed decision about who we think is best suited for this product.
Mr. Sagar Pandya

Clustering
 What is a Cluster?
 A cluster is a subset of similar objects.
• A subset of objects such that the distance between any of the two objects in
the cluster is less than the distance between any object in the cluster and any
object that is not located inside it.
 What is clustering in Data Mining?
• Clustering is the method of converting a group of abstract objects into classes
of similar objects.
• Clustering is a method of partitioning a set of data or objects into a set of
significant subclasses called clusters.
• It helps users to understand the structure or natural grouping in a data set and
used either as a stand-alone instrument to get a better insight into data
distribution or as a pre-processing step for other algorithms
Mr. Sagar Pandya

Clustering
• Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
processing.
• Clustering can also help marketers discover distinct groups in their
customer base. And they can characterize their customer groups
based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionalities and gain
insight into structures inherent to populations.
• Clustering also helps in identification of areas of similar land use in
an earth observation database. It also helps in the identification of
groups of houses in a city according to house type, value, and
geographic location.
Mr. Sagar Pandya

Clustering
• Clustering also helps in classifying documents on the web for
information discovery.
• Clustering is also used in outlier detection applications such as
detection of credit card fraud.
 Clustering Methods
 Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Mr. Sagar Pandya

Clustering
• A good clustering method requirements are:
• The ability to discover some or all of the hidden clusters.
• Within-cluster similarity and between-cluster dissimilarity.
• Ability to deal with various types of attributes.
• Can deal with noise and outliers.
• Can handle high dimensionality.
• Scalable, Interpretable and usable.
• An important issue in clustering is how to determine the similarity
between two objects, so that clusters can be formed from objects
with high similarity within clusters and low similarity between
clusters.
Mr. Sagar Pandya

Clustering
• Commonly, to measure similarity or dissimilarity between objects, a
distance measure such as Euclidean, Manhattan and Minkowski is
used.
• A distance function returns a lower value for pairs of objects that are
more similar to one another.
Mr. Sagar Pandya

Methods of Clustering in Data Mining
Mr. Sagar Pandya

K-Means Algorithm
 The Concept
 Imagine you’re opening a small book store.
 You have a stack of different books, and 3 bookshelves.
 Your goal is place similar books in one shelf.
 What you would do, is pick up 3 books, one for each shelf in order to
set a theme for every shelf.
 These books will now dictate which of the remaining books will go
in which shelf.
Mr. Sagar Pandya

K-Means Algorithm
• Every time you pick a new book up from the stack, you would
compare it with those first 3 books, and place this new book on the
shelf that has similar books.
• You would repeat this process until all the books have been placed.
 Once you’re done, you might notice that changing the number of
bookshelves, and picking up different initial books for those shelves
(changing the theme for each shelf) would increase how well you’ve
grouped the books.
 So, you repeat the process in hopes of a better outcome.
Mr. Sagar Pandya

K-Means Algorithm
 The Algorithm
 K-means clustering is a good place to start exploring an unlabeled
dataset. The K in K-Means denotes the number of clusters.
 This algorithm is bound to converge to a solution after some
iterations.
 It has 4 basic steps:
1. Initialize Cluster Centroids (Choose those 3 books to start with)
2. Assign datapoints to Clusters (Place remaining the books one by one)
3. Update Cluster centroids (Start over with 3 different books)
4. Repeat step 2–3 until the stopping condition is met.
 You don’t have to start with 3 clusters initially, but 2–3 is generally a
good place to start, and update later on.
Mr. Sagar Pandya

K-Means Algorithm
Mr. Sagar Pandya

, Medi-Caps University, Indore

K Medoids Algorithm
 The k-means method is based on the centroid techniques to represent
the cluster and it is sensitive to outliers.
 This means, a data object with an extremely large value may disrupt
the distribution of data.
 To overcome the problem we used K-medoids method which is
based on representative object techniques.
 Medoid is replaced with centroid to represent the cluster.
 Medoid is the most centrally located data object in a cluster.
 Here, k data objects are selected randomly as medoids to represent k
cluster and remaining all data objects are placed in a cluster having
medoid nearest (or most similar) to that data object

K Medoids Algorithm
 After processing all data objects, new medoid is determined which
can represent cluster in a better way and the entire process is
repeated.
 Again all data objects are bound to the clusters based on the new
medoids.
 In each iteration, medoids change their location step by step.
 This process is continued until no any medoid move.
 As a result, k clusters are found representing a set of n data objects.
 The most common k-medoids clustering methods is
the PAM algorithm (Partitioning Around Medoids

K Medoids Algorithm
 K-Medoids (also called as Partitioning Around Medoid) algorithm was
proposed in 1987 by Kaufman and Rousseeuw.
 A medoid can be defined as the point in the cluster, whose dissimilarities
with all the other points in the cluster is minimum.
 1. Initialize: select k random points out of the n data points as the
medoids.
 2. Associate each data point to the closest medoid by using any common
distance metric methods.
 3. While the cost decreases: For each medoid m, for each data o point
which is not a medoid:
1. Swap m and o, associate each data point to the closest medoid,
recompute the cost.
 2. If the total cost is more than that in the previous step, undo the swap.

K Medoids Algorithm
 PAM concept: The use of means implies that k-means clustering is
highly sensitive to outliers.
 This can severely affects the assignment of observations to clusters.
 A more robust algorithm is provided by the PAM algorithm.
 PAM algorithm: The PAM algorithm is based on the search for k
representative objects or medoids among the observations of the data set.
 After finding a set of k medoids, clusters are constructed by assigning
each observation to the nearest medoid.
 Next, each selected medoid m and each non-medoid data point are
swapped and the objective function is computed.
 The objective function corresponds to the sum of the dissimilarities of all
objects to their nearest medoid.

K Medoids Algorithm
 In summary, PAM algorithm proceeds in two phases as follow:
 Build phase:
1. Select k objects to become the medoids, or in case these objects were
provided use them as the medoids;
2. Calculate the dissimilarity matrix if it was not provided;
3. Assign every object to its closest medoid;
 Swap phase:
 4. For each cluster search if any of the object of the cluster decreases
the average dissimilarity coefficient; if it does, select the entity that
decreases this coefficient the most as the medoid for this cluster;
 5. If at least one medoid has changed go to (3), else end the
algorithm.

K Medoids Algorithm
 Advantages:
1. It is simple to understand and easy to implement.
2. K-Medoid Algorithm is fast and converges in a fixed number of steps.
3. PAM is less sensitive to outliers than other partitioning algorithms.
 Disadvantages:
1. The main disadvantage of K-Medoid algorithms is that it is not suitable
for clustering non-spherical (arbitrary shaped) groups of objects. This
is because it relies on minimizing the distances between the non-
medoid objects and the medoid (the cluster centre) – briefly, it uses
compactness as clustering criteria instead of connectivity.
2. It may obtain different results for different runs on the same dataset
because the first k medoids are chosen randomly.

Hierarchical Clustering Algorithm
 Hierarchical Clustering Algorithm also called Hierarchical
cluster analysis or HCA is an unsupervised clustering algorithm
which involves creating clusters that have predominant ordering
from top to bottom.
 For e.g: All files and folders on our hard disk are organized in a
hierarchy.
 The algorithm groups similar objects into groups called clusters. The
endpoint is a set of clusters or groups, where each cluster is distinct
from each other cluster, and the objects within each cluster are
broadly similar to each other.
 This clustering technique is divided into two types:
1. Agglomerative Hierarchical Clustering
2. Divisive Hierarchical Clustering

 Agglomerative Hierarchical Clustering:
The Agglomerative Hierarchical Clustering is the most common type
of hierarchical clustering used to group objects in clusters based on
their similarity.
 It’s also known as AGNES (Agglomerative Nesting).
 It's a “bottom-up” approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the
hierarchy.
 How does it work?
1. Make each data point a single-point cluster → forms N clusters
2. Take the two closest data points and make them one cluster → forms
N-1 clusters

3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.
4. Repeat step-3 until you are left with only one cluster.
 What is a Dendrogram?
 A Dendrogram is a type of tree diagram showing hierarchical relationships
between different sets of data.
 As already said a Dendrogram contains the memory of hierarchical clustering
algorithm, so just by looking at the Dendrogram you can tell how the cluster is
formed.
 Have a look at the visual representation of Agglomerative Hierarchical
Clustering for better understanding:
 The point of doing all this is to demonstrate the way hierarchical clustering
works, it maintains a memory of how we went through this process and that
memory is stored in Dendrogram.

 Note:-
1. Distance between data points represents dissimilarities.
2. Height of the blocks represents the distance between clusters.
 So you can observe from the above figure that initially P5 and P6
which are closest to each other by any other point are combined into
one cluster followed by P4 getting merged into the same cluster(C2).
 Then P1and P2 gets combined into one cluster followed by P0
getting merged into the same cluster(C4).
 Now P3 gets merged in cluster C2 and finally, both clusters get
merged into one.

 There are several ways to measure the distance between clusters in order to
decide the rules for clustering, and they are often called Linkage Methods.
Some of the common linkage methods are:
• Single-linkage: the distance between two clusters is defined as
the shortest distance between two points in each cluster. This linkage may be
used to detect high values in your dataset which may be outliers as they will be
merged at the end.

• Complete-linkage: the distance between two clusters is defined as
the longest distance between two points in each cluster.
• In complete linkage hierarchical clustering, the distance between two clusters
is defined as the longest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the
length of the arrow between their two furthest points.

• Average-linkage: the distance between two clusters is defined as the average
distance between each point in one cluster to every point in the other cluster.
• Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and
then calculates the distance between the two before merging.

 Parts of a Dendrogram:
 A dendrogram can be a column graph or a row graph.
 Some dendrograms are circular or have a fluid-shape, but the
software will usually produce a row or column graph.
 No matter what the shape, the basic graph comprises the same parts:
• The Clades are the branch and are arranged according to how similar
(or dissimilar) they are.
• Clades that are close to the same height are similar to each other;
clades with different heights are dissimilar — the greater the
difference in height, the more dissimilarity.

• Each clade has one or more leaves.
• Leaves A, B, and C are more similar to each other than they are to leaves D, E,
or F.
• Leaves D and E are more similar to each other than they are to leaves A, B, C,
or F.
• Leaf F is substantially different from all of the other leaves.
 A clade can theoretically have an infinite amount of leaves. However, the more
leaves you have, the harder the graph will be to read with the naked eye.
 One question that might have intrigued you by now is how do you decide
when to stop merging the clusters?
 You cut the dendrogram tree with a horizontal line at a height where the line
can traverse the maximum distance up and down without intersecting the
merging point.

 For example in the below figure L3 can traverse maximum distance up and
down without intersecting the merging points. So we draw a horizontal line
and the number of vertical lines it intersects is the optimal number of clusters.
 Number of Clusters in this case = 3.

 Let’s see the graphical representation of this algorithm using a dendrogram.
 Note:
This is just a demonstration of how the actual algorithm works no calculation
has been performed below all the proximity among the clusters are assumed.
 Let’s say we have six data points A, B, C, D, E, F.

• Step-1: Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
• Step-2: In the second step comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we
merge them in the second step similarly with cluster (D) and (E) and at last, we get the
clusters
[(A), (BC), (DE), (F)]
• Step-3: We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
• Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
• Step-5: At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].

 Divisive Hierarchical Clustering
 In Divisive or DIANA(DIvisive ANAlysis Clustering) is a top-down clustering
method where we assign all of the observations to a single cluster and then
partition the cluster to two least similar clusters.
 Finally, we proceed recursively on each cluster until there is one cluster for
each observation. So this clustering approach is exactly opposite to
Agglomerative clustering.
 There is evidence that divisive algorithms produce more accurate hierarchies
than agglomerative algorithms in some circumstances but is conceptually more
complex.
 In both agglomerative and divisive hierarchical clustering, users need to
specify the desired number of clusters as a termination condition(when to stop
merging).

DBSCAN Clustering Algorithm
 Clustering analysis is an unsupervised learning method that separates
the data points into several specific bunches or groups, such that the
data points in the same groups have similar properties and data
points in different groups have different properties in some sense.
 Centrally, all clustering methods use the same approach i.e. first we
calculate similarities and then we use it to cluster the data points into
groups or batches.
 DBSCAN is well known as Density-based spatial clustering of
applications with noise clustering method.
 It was proposed by Martin Ester et al. in 1996. DBSCAN is a
density-based clustering algorithm that works on the assumption that
clusters are dense regions in space separated by regions of lower
density.

 It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
 K-Means and Hierarchical Clustering both fail in creating clusters of
arbitrary shapes. They are not able to form clusters based on varying
densities. That’s why we need DBSCAN clustering.
• minPts: The minimum number of points (a threshold) clustered
together for a region to be considered dense.
• eps (ε): A distance measure that will be used to locate the points in
the neighborhood of any point.
• Core — This is a point that has at least m points within
distance n from itself.
• Border — This is a point that has at least one Core point at a
distance n.

• Noise — This is a point that is neither a Core nor a Border. And it
has less than m points within distance n from itself.
 These parameters can be understood if we explore two concepts
called Density Reachability and Density Connectivity.
 Reachability in terms of density establishes a point to be reachable
from another if it lies within a particular distance (eps) from it.
 Connectivity, on the other hand, involves a transitivity based
chaining-approach to determine whether points are located in a
particular cluster.
 For example, p and q points could be connected if p->r->s->t->q,
where a->b means b is in the neighborhood of a.
 In 2014, the algorithm was awarded the ‘Test of Time’ award at the
leading Data Mining conference, KDD.

 A point X is directly density-reachable from point Y w.r.t epsilon,
minPoints if,
1. X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon
2. Y is a core point
• Here, X is directly density-reachable from Y, but vice versa is not
valid.

 Here, X is density-reachable from Y with X being directly density-
reachable from P2, P2 from P3, and P3 from Y. But, the inverse of
this is not valid.

 DBSCAN algorithm can be abstracted in the following steps –
1. Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exist a point c which
has a sufficient number of points in its neighbors and both the points a and b are
within the eps distance. This is a chaining process. So, if b is neighbor of c, c is
neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is
neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that
do not belong to any cluster are noise.

Grid-Based Clustering Algorithms
 In this, the objects together form a grid.
 The object space is quantized into finite number of cells that form a
grid structure.
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and compute the density
of each cell.
3. Eliminate cells, whose density is below a certain threshold t.
4. Form clusters from contiguous (adjacent) groups of dense cells
(usually minimizing a given objective function)

Grid-Based Clustering Algorithms
 Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the
quantized space.
 Several interesting methods (in addition to the basic grid-based
algorithm)
 STING (a STatistical INformation Grid approach) by Wang, Yang
and Muntz (1997)
 CLIQUE: Agrawal, et al. (SIGMOD’98)

Model-Based Clustering Algorithms
 In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
 This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
 In model-based clustering, the data is considered as coming from a mixture of density.
 Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution
which is characterized by the parameters:
 μk: mean vector,
 ∑k: covariance matrix,
 An associated probability in the mixture. Each point has a probability of belonging to
each cluster.

MINING DATA STREAM
 Large amount of data streams every day.
 Efficient knowledge discovery of such data streams is an emerging
active research area in data mining with broad applications.
 Data Stream Mining (also known as stream learning) is the process
of extracting knowledge structures from continuous, rapid data
records.
 Data streams typically arrive continuously in high speed with huge
amount and changing data distribution.
 Data mining techniques which require multiple scans of the entire
data sets can not be applied directly to mine stream data, which
usually allows only one scan and demands fast response time

MINING DATA STREAM
 Imagine a factory with 500 sensors capturing 10 KB of information
every second, in one hour is captured nearby 36 GB of information and
432 GB daily.
 This massive information needs to be analyzed in real time (or in the
shortest time possible) to detect irregularities or deviations in the system
and quickly react.
 Stream Mining enables to analyze large amounts of data in real-time.
 Data Stream Mining is the process of extracting knowledge from
continuous rapid data records which comes to the system in a stream.
 A Data Stream is an ordered sequence of instances in time.
 Data stream mining is a process of mining continuous incoming real
time streaming data with acceptable performance.

MINING DATA STREAM
 Data Stream Mining fulfil the following characteristics:
 Continuous Stream of Data: High amount of data in an infinite stream.
we do not know the entire dataset
 Concept Drifting: The data change or evolves over time
 Volatility of data: The system does not store the data received (Limited
resources). When data is analyzed it’s discarded or summarized.

MINING DATA STREAM
 Data stream is a high-speed continuous flow of data from diverse
resources.
 The sources might include remote sensors, scientific processes, stock
markets, online transactions, tweets, internet traffic, video
surveillance systems etc.
 Generally these streams come in high-speed with a huge volume of
data generated by real-time applications.
 Data streams have unique characteristics when compared with
traditional datasets.
 They include potentially infinite, massive, continuous, temporarily
ordered and fast changing.

MINING DATA STREAM
 Storing such streams and then process is not viable as that needs a lot
of storage and processing power.
 For this reason they are to be processed in real-time in order to
discover knowledge from them instead of storing and processing like
traditional data mining.
 The data stream mining procedure includes selecting a part of stream
data, preprocessing, incremental learning and extraction of
knowledge in a single pass.
 The result of data stream mining is the knowledge that can help in
taking intelligent decisions.
 Thus the processing of data streams throw challenges in terms of
memory and processing power of systems. General procedure for
processing streaming data is presented in Figure.

MINING DATA STREAM

MINING DATA STREAM
What are the Applications?
 Telecommunication calling records
 Business credit card transaction flows
 Network monitoring and traffic engineering
 Financial market: stock exchange
 Engineering & industrial processes: power supply & manufacturing
 Sensor, monitoring & surveillance: video streams, RFIDs
 Security monitoring
 Web logs and Web page click streams
 Massive data sets (even saved but random access is too expensive)

MINING DATA STREAM
Software for data stream mining:
 MOA (Massive Online Analysis): free open-source software specific
for mining data streams with concept drift. It has several machine
learning algorithms (classification, regression, clustering, outlier
detection and recommender systems).
 RapidMiner: commercial software for knowledge discovery, data
mining, and machine learning also featuring data stream mining,
learning time-varying concepts, and tracking drifting concept (if used
in combination with its data stream mining plugin (formerly:
Concept Drift plugin)).
 StreamDM: StreamDM is an open source framework for big data
stream mining that uses the Spark Streaming.

MINING DATA STREAM
 When the volume of the underlying data is very large, highspeed and continuous
flow it leads to number of computational and mining challenges listed below.
 (1)Data contained in data streams is fast changing, high-speed and real-time.
 (2) Multiple or random access of data streams is in expensive rather almost
impossible.
 (3)Huge volume of data to be processed in limited memory.
 (4)Data stream mining system must process highspeed and gigantic data within
time limitations.
 (5) The data arriving in multidimensional and low level so techniques to mine
such data needs to be very sophisticated.
 (6)Data stream elements change rapidly overtime. Thus, data from the past may
become irrelevant for the mining.

MINING TIME SERIES DATA
 A time series is a sequence of data points recorded at specific time
points - most often in regular time intervals (seconds, hours, days,
months etc.).
 Every organization generates a high volume of data every single day –
be it sales figure, revenue, traffic, or operating cost.
 Time series data mining can generate valuable information for long-term
business decisions, yet they are underutilized in most organizations.
 Stock market analysis, economic and sales forecasting, scientific and
engineering experiments, medical treatments etc. can also be considered
as a Sequence database consists of a sequence of ordered events (time –
optional)
 Web page Traversal Sequence Time-Series data can be analyzed to
Identify correlations Similar / Regular patterns, trends, outliers.

 Industries in all sectors generate and use time series data to make
important business decisions.
 Using the past data, grocery chain wants to know which time of the
year peaks market demands for a particular product; call centers need
to forecast future call volumes so they can maintain adequate
staffing; credit card companies lookout for fraudulent transactions —
all these business decisions benefit from the use of times series data.
 Time series data points are snapshots of the past.
 Understanding historical events, patterns and trends are some basic
indicators that all businesses track.
 They want to understand how good they had performed in the past
and where they are headed into the future.

 Basic understanding of historical events through time series data
doesn’t require fancy modeling, just plotting data against time can
generate very powerful insights.
 In the old days, spreadsheets were good enough to create powerful
visual stories and insights.
 Nowadays most statistical and data analysis tools (e.g. Python,
Tableau, PowerBI) can handle time-series data pretty well for
creating time series charts, dashboards etc.
 Time series data provides a wealth of analytics and application
possibilities in all domains of applications.
 Historical analysis, forecasting, anomaly detection, and predictive
analytics are just a few of those possibilities.

 New analytical frontiers are also emerging with the development of
new tools and techniques.
 Artificial neural networks (e.g. LSTM) and econometrics are such
cutting-edge frontiers in time series data analytics.
 Experienced and aspiring data scientists alike can make tremendous
contributions to their domains by taking advantage of these tools, or
maybe by developing new ones.

Mining Sequence Patterns in Transactional Database
 A sequence database consists of sequences of ordered elements or
events, recorded with or without a concrete notion of time.
 There are many applications involving sequence data. Typical
examples include customer shopping sequences, Web clickstreams,
biological sequences, sequences of events in science and engineering,
and in natural and social developments.
 “What is sequential pattern mining?” Sequential pattern mining is the
mining of frequently occurring ordered events or subsequences as
patterns.
 An example of a sequential pattern is “Customers who buy a Canon
digital camera are likely to buy an HP color printer within a month.”

 Other areas in which sequential patterns can be applied include Web
access pattern analysis, weather prediction, production processes, and
network intrusion detection.
 The sequential pattern mining problem was first introduced by
Agrawal and Srikant in 1995 [AS95] based on their study of
customer purchase sequences, as follows:
 “Given a set of sequences, where each sequence consists of a list of
events (or elements) and each event consists of a set of items, and
given a user-specified minimum support threshold of min sup,
sequential pattern mining finds all frequent subsequences, that is, the
subsequences whose occurrence frequency in the set of sequences is
no less than min sup.”

 Consider the sequence database, S, given in Table , Let min sup = 2.
 The set of items in the database is {a, b, c, d, e, f , g}. The database
contains four sequences.
 Let’s look at sequence 1, which is {(abc)(ac)d(cf)}.
 It has five events, namely (a),(abc), (ac), (d), and (cf), which occur in
the order listed.
 Items a and c each appear more than once in different events of the
sequence.
 There are nine instances of items in sequence 1; therefore, it has a
length of nine and is called a 9-sequence.
 Item a occurs three times in sequence 1 and so contributes three to
the length of the sequence.

 However, the entire sequence contributes only one to the support of
{a}.
 Sequence {a(bc)df} is a subsequence of sequence 1 since the events
of the former are each subsets of events in sequence 1, and the order
of events is preserved.
 Consider subsequence s = {(ab)c}.
 Looking at the sequence database, S, we see that sequences 1 and 3
are the only ones that contain the subsequence s.
 The support of s is thus 2, which satisfies minimum support.

 Therefore, s is frequent, and so we call it a sequential pattern.
 It is a 3-pattern since it is a sequential pattern of length three.

Sequence Database and Transaction Database
 A sequence database is a set of sequences where each sequence is a
list of itemsets.
 An itemset is an unordered set of items.
 For example, the table shown below contains four sequences.
 The first sequence, named S1, contains 5 itemsets.
 It means that item 1 was followed by items 1 2 and 3 at the same
time, which were followed by 1 and 3, followed by 4, and followed
by 3 and 6.
 Note that it is assumed that no items appear twice in the same
itemset and that items in an itemset are lexically ordered.

Sequence Database to a Transaction Database
ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)
Transaction id Items
t1 {1, 2, 3, 4, 6}
t2 {1, 2, 3, 4, 5}
t3 {1, 2, 3, 4, 5, 6}
t4 {1, 2, 3, 5, 6, 7}

Transaction Database and Sequence Database
 A transaction database is a set of transactions.
Each transaction is a set of items.
 For example, consider the following transaction database. It
contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5).
 For example, the first transaction represents the set of items 1,
3 and 4.
Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}

Transaction Database to a Sequence Database
 A sequence database is a set of sequences. Each sequence is an
ordered list of itemsets. Each itemset is an unordered set of items
(symbols) represented by positive integers.
 The output for this example is the following sequence database. It
contains five sequences. The first sequence indicates that item 1
is followed by item 3, which is followed by item 4.
Sequence id Itemsets
s1 {1},{3}, {4}
s2 {2},{3},{5}
s3 {1}, {2}, {3}, {5}
s4 {2}, {5},

Social Network Analysis
 A social network is defined as a social structure of individuals, who are
related (directly or indirectly to each other) based on a common
relation of interest, e.g. friendship, trust, etc.
 Social network analysis is the study of social networks to understand
their structure and behavior.
 Social network analysis has gained prominence due to its use in
different applications - from product marketing (e.g. viral marketing)
to search engines and organizational dynamics (e.g. management).
 Recently there has been a rapid increase in interest regarding social
network analysis in the data mining community.
 The basic motivation is the demand to exploit knowledge from copious
amounts of data collected, pertaining to social behavior of users in
online environments.

 A social network is a heterogeneous and multi relational dataset
represented by a graph. Vertexes represent the objects (entities), edges
represent the links (relationships or interaction) and both objects and
links may have attributes.
 Social networks research emerged from sociology, psychology,
statistics and graph theory.
 Based on theoretical graph concepts, a social network interprets the
social relationships of individuals as points and their relationships as
the lines connecting them.
 Data mining technique in social media
 GRAPH MINING
 TEXT MINING

Graph Mining
 Graphs(or networks) constitute a prominent data structure and appear
essentially in all form of information .
 Example include the webgraph ,social network.
 Typically, communities correspond to , group of nodes , where nodes
within the same community ( or clusters) tend to be highly similar
sharing common features ,while on the other hand nodes of different
communities show low similarities.
 Extracting useful knowledge (patterns, outliers ,etc) from structured
data that can be represented as graph.
 Graph mining is used for understanding relationship as well as content.
• Phone provider can look at phone call records using graph mining.

Text Mining
 Text mining, also known as text analysis, is the process of transforming
unstructured text data into meaningful and actionable information.
 Text mining utilizes different AI technologies to automatically process data
and generate valuable insights, enabling companies to make data-driven
decisions.
 For businesses, the large amount of data generated every day represents both
an opportunity and a challenge.
 On the one side, data helps companies get smart insights on people’s opinions
about a product or service.
 Think about all the potential ideas that you could get from analyzing emails,
product reviews, social media posts, customer feedback, support tickets, etc.
 On the other side, there’s the dilemma of how to process all this data. And
that’s where text mining plays a major role.

Text Mining
 The fundamental steps involved in text mining are:
 Gathering unstructured data from multiple data sources like plain
text, web pages, pdf files, emails, and blogs, to name a few.
 Detect and remove anomalies from data by conducting pre-
processing and cleansing operations.
 Data cleansing allows you to extract and retain the valuable
information hidden within the data and to help identify the roots of
specific words.
 For this, you get a number of text mining tools and text mining
applications.
 Convert all the relevant information extracted from unstructured data
into structured formats.

Text Mining
 Analyze the patterns within the data via the Management
Information System (MIS).
 Store all the valuable information into a secure database to drive
trend analysis and enhance the decision-making process of the
organization.

Text Mining
S.NO. DATA MINING TEXT MINING
1.
Data mining is the statistical
technique of processing raw data
in a structured form.
Text mining is the part of data
mining which involves processing
of text from documents.
2.
Pre-existing databases and
spreadsheets are used to gather
information.
The text is used to gather high
quality information.
3.
In data mining data is stored in
structured format.
In text mining data is stored in
unstructured format.
4.
Data is homogeneous and is easy
to retrieve.
Data is heterogeneous and is not so
easy to retrieve.

Text Mining
S.NO. DATA MINING TEXT MINING
5. It supports mining of mixed data.
In text mining, mining of text is
only done.
6.
It combines artificial intelligence,
machine learning and statistics
and applies it on data.
It applies pattern recognizing and
natural language processing to
unstructured data.
7.
It is used in fields like marketing,
medicine, healthcare.
It is used in fields like bioscience
and customer profile analysis.
8.
Structured data from large
datasets found in systems such
databases, spreadsheets, ERP,
CRM and accounting applications
Unstructured textual data found in
emails, documents, presentations,
videos, file shares, social media
and the Internet.

Web Mining
 Web mining is an application of data mining techniques to find
information patterns from the web data.
 Web mining helps to improve the power of web search engine by
identifying the web pages and classifying the web documents.
 The main purpose of web mining is discovering useful information
from the World-Wide Web and its usage patterns.
 Web mining is very useful to e-commerce websites and e-services.
 Applications of Web Mining:
 Web mining helps to improve the power of web search engine by
classifying the web documents and identifying the web pages.
 It is used for Web Searching e.g., Google, Yahoo etc.

Web Mining
 Web mining is used to predict user behavior.
 Web mining is very useful of a particular Website and e-service e.g.,
landing page optimization.

Web Mining
S.No DATA MINING WEB MINING
1
Data Mining is the process that attempts
to discover pattern and hidden
knowledge in large data sets in any
system.
Web Mining is the process of data mining
techniques to automatically discover and
extract information from web documents.
2
Data Mining is very useful for web
page analysis.
Web Mining is very useful for a
particular website and e-service.
3 Data scientist and data engineers. Data scientists along with data analysts.
4 Data Mining is access data privately. Web Mining is access data publicly.
5
Clustering, classification, regression,
prediction, optimization and control.
Web content mining, Web structure
mining.
6
It includes tools like machine learning
algorithms.
Special tools for web mining are Scrapy,
PageRank and Apache logs.

Multirelational Data Mining
 The multi relational data mining approach has developed as an
alternative way for handling the structured data such that RDBMS.
 This will provides the mining in multiple tables directly.
 In MRDM the patterns are available in multiple tables (relations) from
a relational database.
 As the data are available over the many tables which will affect the
many problems in the practice of the data mining.
 To deal with this problem, one either constructs a single table by
Propositionalisation, or uses a Multi-Relational Data Mining
algorithm.
 RDM approaches have been successfully applied in the area of
bioinformatics.

Multirelational Data Mining
 Three popular pattern finding techniques classification, clustering and
association are frequently used in MRDM.
 Multi relational approach has developed as an alternative for analyzing
the structured data such as relational database.
 MRDM allowing applying directly in the data mining in multiple tables.
To avoid the expensive joining operations and semantic losses we used
the MRDM technique.
 An important aspect of data mining algorithms and systems is that they
should scale well to large databases.
 A consequence of this is that most data mining tools are based on
machine learning algorithms that work on data in attribute-value format.
 Experience has proven that such 'single-table' mining algorithms indeed
scale well.

Unit – 4
Any - 5 Assignment Questions Marks:-20
 Q.1 What is Cluster analysis? Discuss k-means Algorithm with suitable examples?
 Q.2 Write a short note on following:
a) Unsupervised Learning b) Web Mining
c) Text Mining d) Social Network Analysis
 Q.3 List out difference between clustering and classification. Briefly describe
hierarchical clustering method.
 Q.4 Suppose we have the following points: (1,1), (2,4), (3,4), (5,8), (6,2), (7,8).
Use k - means algorithm (k = 2) to find two cluster. The distance function is
Euclidean distance.
 Q.5 Define Clustering. What are the requirments for cluster analysis.
 Q.6 Explain DBSCAN Algorithm with suitable example.
 Q.7 Describe Data Stream Mining. How to mine time series data.

Thank You
Great God, Medi-Caps, All the attendees
Mr. Sagar Pandya
www.sagarpandya.tk
LinkedIn: /in/seapandya
Twitter: @seapandya
Facebook: /seapandya

Clustering - K-Means, DBSCAN

Recommended

More Related Content

What's hot (20)

Similar to Clustering - K-Means, DBSCAN (20)

More from Medicaps University (14)

Recently uploaded (20)

Clustering - K-Means, DBSCAN

Editor's Notes