SlideShare a Scribd company logo
MEDI-CAPS UNIVERSITY
Faculty of Engineering
Mr. Sagar Pandya
Information Technology Department
sagar.pandya@medicaps.ac.in
IT3ED02 Data Mining and Warehousing 3-0-0
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Unit 1. Introduction
 Unit 2. Data Mining
 Unit 3. Association and Classification
 Unit 4. Clustering
 Unit 5. Business Analysis
Reference Books
Text Books
 Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann,
India, 2012.
 Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University Press.
 Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA.
Reference Books
 Sam Anahory and Dennis Murray, Data Warehousing in the Real World,
Pearson Education Asia.
 W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India.
and many others
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Unit-4 Clustering
 Clustering: Introduction, Types of clustering;
 Partition-based clustering: K-Means, K-Medoids;
 Density based clustering: DBSCAN, Clustering evaluation.
 Mining Data Stream, Mining Time-Series Data, Mining Sequence
Patterns in Transactional Database,
 Social Network analysis and Multirelational Data Mining.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Clustering
 In clustering, a group of different data objects is classified as similar
objects.
 One group means a cluster of data.
 Data sets are divided into different groups in the cluster analysis,
which is based on the similarity of the data.
 After the classification of data into various groups, a label is assigned
to the group.
 It helps in adapting to the changes by doing the classification.
 In other words, similar objects are grouped in one cluster and
dissimilar objects are grouped in another cluster.
 Clustering is an unsupervised Machine Learning-based Algorithm
that comprises a group of data points into clusters so that the objects
belong to the same group.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Clustering
• The quality of cluster depends on the method used.
• Clustering is also called as data segmentation, because it partitions
large data sets into groups according to their similarity.
• There are 3 basic stages of clustering algorithm which are shown
below
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Clustering
 Now that the data from our customer base is divided into clusters, we can
make an informed decision about who we think is best suited for this product.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Clustering
 What is a Cluster?
 A cluster is a subset of similar objects.
• A subset of objects such that the distance between any of the two objects in
the cluster is less than the distance between any object in the cluster and any
object that is not located inside it.
 What is clustering in Data Mining?
• Clustering is the method of converting a group of abstract objects into classes
of similar objects.
• Clustering is a method of partitioning a set of data or objects into a set of
significant subclasses called clusters.
• It helps users to understand the structure or natural grouping in a data set and
used either as a stand-alone instrument to get a better insight into data
distribution or as a pre-processing step for other algorithms
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Clustering
• Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
processing.
• Clustering can also help marketers discover distinct groups in their
customer base. And they can characterize their customer groups
based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionalities and gain
insight into structures inherent to populations.
• Clustering also helps in identification of areas of similar land use in
an earth observation database. It also helps in the identification of
groups of houses in a city according to house type, value, and
geographic location.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Clustering
• Clustering also helps in classifying documents on the web for
information discovery.
• Clustering is also used in outlier detection applications such as
detection of credit card fraud.
 Clustering Methods
 Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Clustering
• A good clustering method requirements are:
• The ability to discover some or all of the hidden clusters.
• Within-cluster similarity and between-cluster dissimilarity.
• Ability to deal with various types of attributes.
• Can deal with noise and outliers.
• Can handle high dimensionality.
• Scalable, Interpretable and usable.
• An important issue in clustering is how to determine the similarity
between two objects, so that clusters can be formed from objects
with high similarity within clusters and low similarity between
clusters.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Clustering
• Commonly, to measure similarity or dissimilarity between objects, a
distance measure such as Euclidean, Manhattan and Minkowski is
used.
• A distance function returns a lower value for pairs of objects that are
more similar to one another.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Methods of Clustering in Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
 The Concept
 Imagine you’re opening a small book store.
 You have a stack of different books, and 3 bookshelves.
 Your goal is place similar books in one shelf.
 What you would do, is pick up 3 books, one for each shelf in order to
set a theme for every shelf.
 These books will now dictate which of the remaining books will go
in which shelf.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
• Every time you pick a new book up from the stack, you would
compare it with those first 3 books, and place this new book on the
shelf that has similar books.
• You would repeat this process until all the books have been placed.
 Once you’re done, you might notice that changing the number of
bookshelves, and picking up different initial books for those shelves
(changing the theme for each shelf) would increase how well you’ve
grouped the books.
 So, you repeat the process in hopes of a better outcome.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
 The Algorithm
 K-means clustering is a good place to start exploring an unlabeled
dataset. The K in K-Means denotes the number of clusters.
 This algorithm is bound to converge to a solution after some
iterations.
 It has 4 basic steps:
1. Initialize Cluster Centroids (Choose those 3 books to start with)
2. Assign datapoints to Clusters (Place remaining the books one by one)
3. Update Cluster centroids (Start over with 3 different books)
4. Repeat step 2–3 until the stopping condition is met.
 You don’t have to start with 3 clusters initially, but 2–3 is generally a
good place to start, and update later on.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
, Medi-Caps University, Indore
, Medi-Caps University, Indore
, Medi-Caps University, Indore
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K-Means Algorithm
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
K Medoids Algorithm
 The k-means method is based on the centroid techniques to represent
the cluster and it is sensitive to outliers.
 This means, a data object with an extremely large value may disrupt
the distribution of data.
 To overcome the problem we used K-medoids method which is
based on representative object techniques.
 Medoid is replaced with centroid to represent the cluster.
 Medoid is the most centrally located data object in a cluster.
 Here, k data objects are selected randomly as medoids to represent k
cluster and remaining all data objects are placed in a cluster having
medoid nearest (or most similar) to that data object
, Medi-Caps University, Indore
K Medoids Algorithm
 After processing all data objects, new medoid is determined which
can represent cluster in a better way and the entire process is
repeated.
 Again all data objects are bound to the clusters based on the new
medoids.
 In each iteration, medoids change their location step by step.
 This process is continued until no any medoid move.
 As a result, k clusters are found representing a set of n data objects.
 The most common k-medoids clustering methods is
the PAM algorithm (Partitioning Around Medoids
, Medi-Caps University, Indore
K Medoids Algorithm
 K-Medoids (also called as Partitioning Around Medoid) algorithm was
proposed in 1987 by Kaufman and Rousseeuw.
 A medoid can be defined as the point in the cluster, whose dissimilarities
with all the other points in the cluster is minimum.
 1. Initialize: select k random points out of the n data points as the
medoids.
 2. Associate each data point to the closest medoid by using any common
distance metric methods.
 3. While the cost decreases: For each medoid m, for each data o point
which is not a medoid:
1. Swap m and o, associate each data point to the closest medoid,
recompute the cost.
 2. If the total cost is more than that in the previous step, undo the swap.
, Medi-Caps University, Indore
K Medoids Algorithm
 PAM concept: The use of means implies that k-means clustering is
highly sensitive to outliers.
 This can severely affects the assignment of observations to clusters.
 A more robust algorithm is provided by the PAM algorithm.
 PAM algorithm: The PAM algorithm is based on the search for k
representative objects or medoids among the observations of the data set.
 After finding a set of k medoids, clusters are constructed by assigning
each observation to the nearest medoid.
 Next, each selected medoid m and each non-medoid data point are
swapped and the objective function is computed.
 The objective function corresponds to the sum of the dissimilarities of all
objects to their nearest medoid.
, Medi-Caps University, Indore
K Medoids Algorithm
 In summary, PAM algorithm proceeds in two phases as follow:
 Build phase:
1. Select k objects to become the medoids, or in case these objects were
provided use them as the medoids;
2. Calculate the dissimilarity matrix if it was not provided;
3. Assign every object to its closest medoid;
 Swap phase:
 4. For each cluster search if any of the object of the cluster decreases
the average dissimilarity coefficient; if it does, select the entity that
decreases this coefficient the most as the medoid for this cluster;
 5. If at least one medoid has changed go to (3), else end the
algorithm.
, Medi-Caps University, Indore
, Medi-Caps University, Indore
, Medi-Caps University, Indore
, Medi-Caps University, Indore
, Medi-Caps University, Indore
, Medi-Caps University, Indore
, Medi-Caps University, Indore
, Medi-Caps University, Indore
K Medoids Algorithm
 Advantages:
1. It is simple to understand and easy to implement.
2. K-Medoid Algorithm is fast and converges in a fixed number of steps.
3. PAM is less sensitive to outliers than other partitioning algorithms.
 Disadvantages:
1. The main disadvantage of K-Medoid algorithms is that it is not suitable
for clustering non-spherical (arbitrary shaped) groups of objects. This
is because it relies on minimizing the distances between the non-
medoid objects and the medoid (the cluster centre) – briefly, it uses
compactness as clustering criteria instead of connectivity.
2. It may obtain different results for different runs on the same dataset
because the first k medoids are chosen randomly.
, Medi-Caps University, Indore
Hierarchical Clustering Algorithm
 Hierarchical Clustering Algorithm also called Hierarchical
cluster analysis or HCA is an unsupervised clustering algorithm
which involves creating clusters that have predominant ordering
from top to bottom.
 For e.g: All files and folders on our hard disk are organized in a
hierarchy.
 The algorithm groups similar objects into groups called clusters. The
endpoint is a set of clusters or groups, where each cluster is distinct
from each other cluster, and the objects within each cluster are
broadly similar to each other.
 This clustering technique is divided into two types:
1. Agglomerative Hierarchical Clustering
2. Divisive Hierarchical Clustering
, Medi-Caps University, Indore
Hierarchical Clustering Algorithm
 Agglomerative Hierarchical Clustering:
The Agglomerative Hierarchical Clustering is the most common type
of hierarchical clustering used to group objects in clusters based on
their similarity.
 It’s also known as AGNES (Agglomerative Nesting).
 It's a “bottom-up” approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the
hierarchy.
 How does it work?
1. Make each data point a single-point cluster → forms N clusters
2. Take the two closest data points and make them one cluster → forms
N-1 clusters
, Medi-Caps University, Indore
Hierarchical Clustering Algorithm
3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.
4. Repeat step-3 until you are left with only one cluster.
 What is a Dendrogram?
 A Dendrogram is a type of tree diagram showing hierarchical relationships
between different sets of data.
 As already said a Dendrogram contains the memory of hierarchical clustering
algorithm, so just by looking at the Dendrogram you can tell how the cluster is
formed.
 Have a look at the visual representation of Agglomerative Hierarchical
Clustering for better understanding:
 The point of doing all this is to demonstrate the way hierarchical clustering
works, it maintains a memory of how we went through this process and that
memory is stored in Dendrogram.
, Medi-Caps University, Indore
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
 Note:-
1. Distance between data points represents dissimilarities.
2. Height of the blocks represents the distance between clusters.
 So you can observe from the above figure that initially P5 and P6
which are closest to each other by any other point are combined into
one cluster followed by P4 getting merged into the same cluster(C2).
 Then P1and P2 gets combined into one cluster followed by P0
getting merged into the same cluster(C4).
 Now P3 gets merged in cluster C2 and finally, both clusters get
merged into one.
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
 There are several ways to measure the distance between clusters in order to
decide the rules for clustering, and they are often called Linkage Methods.
Some of the common linkage methods are:
• Single-linkage: the distance between two clusters is defined as
the shortest distance between two points in each cluster. This linkage may be
used to detect high values in your dataset which may be outliers as they will be
merged at the end.
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
• Complete-linkage: the distance between two clusters is defined as
the longest distance between two points in each cluster.
• In complete linkage hierarchical clustering, the distance between two clusters
is defined as the longest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the
length of the arrow between their two furthest points.
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
• Average-linkage: the distance between two clusters is defined as the average
distance between each point in one cluster to every point in the other cluster.
• Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and
then calculates the distance between the two before merging.
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
 Parts of a Dendrogram:
 A dendrogram can be a column graph or a row graph.
 Some dendrograms are circular or have a fluid-shape, but the
software will usually produce a row or column graph.
 No matter what the shape, the basic graph comprises the same parts:
• The Clades are the branch and are arranged according to how similar
(or dissimilar) they are.
• Clades that are close to the same height are similar to each other;
clades with different heights are dissimilar — the greater the
difference in height, the more dissimilarity.
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
• Each clade has one or more leaves.
• Leaves A, B, and C are more similar to each other than they are to leaves D, E,
or F.
• Leaves D and E are more similar to each other than they are to leaves A, B, C,
or F.
• Leaf F is substantially different from all of the other leaves.
 A clade can theoretically have an infinite amount of leaves. However, the more
leaves you have, the harder the graph will be to read with the naked eye.
 One question that might have intrigued you by now is how do you decide
when to stop merging the clusters?
 You cut the dendrogram tree with a horizontal line at a height where the line
can traverse the maximum distance up and down without intersecting the
merging point.
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
 For example in the below figure L3 can traverse maximum distance up and
down without intersecting the merging points. So we draw a horizontal line
and the number of vertical lines it intersects is the optimal number of clusters.
 Number of Clusters in this case = 3.
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
 Let’s see the graphical representation of this algorithm using a dendrogram.
 Note:
This is just a demonstration of how the actual algorithm works no calculation
has been performed below all the proximity among the clusters are assumed.
 Let’s say we have six data points A, B, C, D, E, F.
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
• Step-1: Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
• Step-2: In the second step comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we
merge them in the second step similarly with cluster (D) and (E) and at last, we get the
clusters
[(A), (BC), (DE), (F)]
• Step-3: We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
• Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
• Step-5: At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
 Divisive Hierarchical Clustering
 In Divisive or DIANA(DIvisive ANAlysis Clustering) is a top-down clustering
method where we assign all of the observations to a single cluster and then
partition the cluster to two least similar clusters.
 Finally, we proceed recursively on each cluster until there is one cluster for
each observation. So this clustering approach is exactly opposite to
Agglomerative clustering.
 There is evidence that divisive algorithms produce more accurate hierarchies
than agglomerative algorithms in some circumstances but is conceptually more
complex.
 In both agglomerative and divisive hierarchical clustering, users need to
specify the desired number of clusters as a termination condition(when to stop
merging).
Hierarchical Clustering Algorithm
, Medi-Caps University, Indore
DBSCAN Clustering Algorithm
, Medi-Caps University, Indore
 Clustering analysis is an unsupervised learning method that separates
the data points into several specific bunches or groups, such that the
data points in the same groups have similar properties and data
points in different groups have different properties in some sense.
 Centrally, all clustering methods use the same approach i.e. first we
calculate similarities and then we use it to cluster the data points into
groups or batches.
 DBSCAN is well known as Density-based spatial clustering of
applications with noise clustering method.
 It was proposed by Martin Ester et al. in 1996. DBSCAN is a
density-based clustering algorithm that works on the assumption that
clusters are dense regions in space separated by regions of lower
density.
DBSCAN Clustering Algorithm
, Medi-Caps University, Indore
 It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
 K-Means and Hierarchical Clustering both fail in creating clusters of
arbitrary shapes. They are not able to form clusters based on varying
densities. That’s why we need DBSCAN clustering.
• minPts: The minimum number of points (a threshold) clustered
together for a region to be considered dense.
• eps (ε): A distance measure that will be used to locate the points in
the neighborhood of any point.
• Core — This is a point that has at least m points within
distance n from itself.
• Border — This is a point that has at least one Core point at a
distance n.
DBSCAN Clustering Algorithm
, Medi-Caps University, Indore
DBSCAN Clustering Algorithm
, Medi-Caps University, Indore
• Noise — This is a point that is neither a Core nor a Border. And it
has less than m points within distance n from itself.
 These parameters can be understood if we explore two concepts
called Density Reachability and Density Connectivity.
 Reachability in terms of density establishes a point to be reachable
from another if it lies within a particular distance (eps) from it.
 Connectivity, on the other hand, involves a transitivity based
chaining-approach to determine whether points are located in a
particular cluster.
 For example, p and q points could be connected if p->r->s->t->q,
where a->b means b is in the neighborhood of a.
 In 2014, the algorithm was awarded the ‘Test of Time’ award at the
leading Data Mining conference, KDD.
DBSCAN Clustering Algorithm
, Medi-Caps University, Indore
 A point X is directly density-reachable from point Y w.r.t epsilon,
minPoints if,
1. X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon
2. Y is a core point
• Here, X is directly density-reachable from Y, but vice versa is not
valid.
DBSCAN Clustering Algorithm
, Medi-Caps University, Indore
 Here, X is density-reachable from Y with X being directly density-
reachable from P2, P2 from P3, and P3 from Y. But, the inverse of
this is not valid.
DBSCAN Clustering Algorithm
, Medi-Caps University, Indore
 DBSCAN algorithm can be abstracted in the following steps –
1. Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exist a point c which
has a sufficient number of points in its neighbors and both the points a and b are
within the eps distance. This is a chaining process. So, if b is neighbor of c, c is
neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is
neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that
do not belong to any cluster are noise.
DBSCAN Clustering Algorithm
, Medi-Caps University, Indore
Grid-Based Clustering Algorithms
 In this, the objects together form a grid.
 The object space is quantized into finite number of cells that form a
grid structure.
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and compute the density
of each cell.
3. Eliminate cells, whose density is below a certain threshold t.
4. Form clusters from contiguous (adjacent) groups of dense cells
(usually minimizing a given objective function)
, Medi-Caps University, Indore
Grid-Based Clustering Algorithms
 Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the
quantized space.
 Several interesting methods (in addition to the basic grid-based
algorithm)
 STING (a STatistical INformation Grid approach) by Wang, Yang
and Muntz (1997)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
, Medi-Caps University, Indore
Model-Based Clustering Algorithms
 In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
 This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
 In model-based clustering, the data is considered as coming from a mixture of density.
 Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution
which is characterized by the parameters:
 μk: mean vector,
 ∑k: covariance matrix,
 An associated probability in the mixture. Each point has a probability of belonging to
each cluster.
, Medi-Caps University, Indore
MINING DATA STREAM
 Large amount of data streams every day.
 Efficient knowledge discovery of such data streams is an emerging
active research area in data mining with broad applications.
 Data Stream Mining (also known as stream learning) is the process
of extracting knowledge structures from continuous, rapid data
records.
 Data streams typically arrive continuously in high speed with huge
amount and changing data distribution.
 Data mining techniques which require multiple scans of the entire
data sets can not be applied directly to mine stream data, which
usually allows only one scan and demands fast response time
, Medi-Caps University, Indore
MINING DATA STREAM
 Imagine a factory with 500 sensors capturing 10 KB of information
every second, in one hour is captured nearby 36 GB of information and
432 GB daily.
 This massive information needs to be analyzed in real time (or in the
shortest time possible) to detect irregularities or deviations in the system
and quickly react.
 Stream Mining enables to analyze large amounts of data in real-time.
 Data Stream Mining is the process of extracting knowledge from
continuous rapid data records which comes to the system in a stream.
 A Data Stream is an ordered sequence of instances in time.
 Data stream mining is a process of mining continuous incoming real
time streaming data with acceptable performance.
, Medi-Caps University, Indore
MINING DATA STREAM
 Data Stream Mining fulfil the following characteristics:
 Continuous Stream of Data: High amount of data in an infinite stream.
we do not know the entire dataset
 Concept Drifting: The data change or evolves over time
 Volatility of data: The system does not store the data received (Limited
resources). When data is analyzed it’s discarded or summarized.
, Medi-Caps University, Indore
MINING DATA STREAM
 Data stream is a high-speed continuous flow of data from diverse
resources.
 The sources might include remote sensors, scientific processes, stock
markets, online transactions, tweets, internet traffic, video
surveillance systems etc.
 Generally these streams come in high-speed with a huge volume of
data generated by real-time applications.
 Data streams have unique characteristics when compared with
traditional datasets.
 They include potentially infinite, massive, continuous, temporarily
ordered and fast changing.
, Medi-Caps University, Indore
MINING DATA STREAM
 Storing such streams and then process is not viable as that needs a lot
of storage and processing power.
 For this reason they are to be processed in real-time in order to
discover knowledge from them instead of storing and processing like
traditional data mining.
 The data stream mining procedure includes selecting a part of stream
data, preprocessing, incremental learning and extraction of
knowledge in a single pass.
 The result of data stream mining is the knowledge that can help in
taking intelligent decisions.
 Thus the processing of data streams throw challenges in terms of
memory and processing power of systems. General procedure for
processing streaming data is presented in Figure.
, Medi-Caps University, Indore
MINING DATA STREAM
, Medi-Caps University, Indore
MINING DATA STREAM
, Medi-Caps University, Indore
What are the Applications?
 Telecommunication calling records
 Business credit card transaction flows
 Network monitoring and traffic engineering
 Financial market: stock exchange
 Engineering & industrial processes: power supply & manufacturing
 Sensor, monitoring & surveillance: video streams, RFIDs
 Security monitoring
 Web logs and Web page click streams
 Massive data sets (even saved but random access is too expensive)
MINING DATA STREAM
, Medi-Caps University, Indore
Software for data stream mining:
 MOA (Massive Online Analysis): free open-source software specific
for mining data streams with concept drift. It has several machine
learning algorithms (classification, regression, clustering, outlier
detection and recommender systems).
 RapidMiner: commercial software for knowledge discovery, data
mining, and machine learning also featuring data stream mining,
learning time-varying concepts, and tracking drifting concept (if used
in combination with its data stream mining plugin (formerly:
Concept Drift plugin)).
 StreamDM: StreamDM is an open source framework for big data
stream mining that uses the Spark Streaming.
MINING DATA STREAM
, Medi-Caps University, Indore
 When the volume of the underlying data is very large, highspeed and continuous
flow it leads to number of computational and mining challenges listed below.
 (1)Data contained in data streams is fast changing, high-speed and real-time.
 (2) Multiple or random access of data streams is in expensive rather almost
impossible.
 (3)Huge volume of data to be processed in limited memory.
 (4)Data stream mining system must process highspeed and gigantic data within
time limitations.
 (5) The data arriving in multidimensional and low level so techniques to mine
such data needs to be very sophisticated.
 (6)Data stream elements change rapidly overtime. Thus, data from the past may
become irrelevant for the mining.
MINING DATA STREAM
, Medi-Caps University, Indore
MINING TIME SERIES DATA
, Medi-Caps University, Indore
 A time series is a sequence of data points recorded at specific time
points - most often in regular time intervals (seconds, hours, days,
months etc.).
 Every organization generates a high volume of data every single day –
be it sales figure, revenue, traffic, or operating cost.
 Time series data mining can generate valuable information for long-term
business decisions, yet they are underutilized in most organizations.
 Stock market analysis, economic and sales forecasting, scientific and
engineering experiments, medical treatments etc. can also be considered
as a Sequence database consists of a sequence of ordered events (time –
optional)
 Web page Traversal Sequence Time-Series data can be analyzed to
Identify correlations Similar / Regular patterns, trends, outliers.
MINING TIME SERIES DATA
, Medi-Caps University, Indore
 Industries in all sectors generate and use time series data to make
important business decisions.
 Using the past data, grocery chain wants to know which time of the
year peaks market demands for a particular product; call centers need
to forecast future call volumes so they can maintain adequate
staffing; credit card companies lookout for fraudulent transactions —
all these business decisions benefit from the use of times series data.
 Time series data points are snapshots of the past.
 Understanding historical events, patterns and trends are some basic
indicators that all businesses track.
 They want to understand how good they had performed in the past
and where they are headed into the future.
MINING TIME SERIES DATA
, Medi-Caps University, Indore
 Basic understanding of historical events through time series data
doesn’t require fancy modeling, just plotting data against time can
generate very powerful insights.
 In the old days, spreadsheets were good enough to create powerful
visual stories and insights.
 Nowadays most statistical and data analysis tools (e.g. Python,
Tableau, PowerBI) can handle time-series data pretty well for
creating time series charts, dashboards etc.
 Time series data provides a wealth of analytics and application
possibilities in all domains of applications.
 Historical analysis, forecasting, anomaly detection, and predictive
analytics are just a few of those possibilities.
MINING TIME SERIES DATA
, Medi-Caps University, Indore
 New analytical frontiers are also emerging with the development of
new tools and techniques.
 Artificial neural networks (e.g. LSTM) and econometrics are such
cutting-edge frontiers in time series data analytics.
 Experienced and aspiring data scientists alike can make tremendous
contributions to their domains by taking advantage of these tools, or
maybe by developing new ones.
Mining Sequence Patterns in Transactional Database
, Medi-Caps University, Indore
 A sequence database consists of sequences of ordered elements or
events, recorded with or without a concrete notion of time.
 There are many applications involving sequence data. Typical
examples include customer shopping sequences, Web clickstreams,
biological sequences, sequences of events in science and engineering,
and in natural and social developments.
 “What is sequential pattern mining?” Sequential pattern mining is the
mining of frequently occurring ordered events or subsequences as
patterns.
 An example of a sequential pattern is “Customers who buy a Canon
digital camera are likely to buy an HP color printer within a month.”
Mining Sequence Patterns in Transactional Database
, Medi-Caps University, Indore
 Other areas in which sequential patterns can be applied include Web
access pattern analysis, weather prediction, production processes, and
network intrusion detection.
 The sequential pattern mining problem was first introduced by
Agrawal and Srikant in 1995 [AS95] based on their study of
customer purchase sequences, as follows:
 “Given a set of sequences, where each sequence consists of a list of
events (or elements) and each event consists of a set of items, and
given a user-specified minimum support threshold of min sup,
sequential pattern mining finds all frequent subsequences, that is, the
subsequences whose occurrence frequency in the set of sequences is
no less than min sup.”
Mining Sequence Patterns in Transactional Database
, Medi-Caps University, Indore
 Consider the sequence database, S, given in Table , Let min sup = 2.
 The set of items in the database is {a, b, c, d, e, f , g}. The database
contains four sequences.
 Let’s look at sequence 1, which is {(abc)(ac)d(cf)}.
 It has five events, namely (a),(abc), (ac), (d), and (cf), which occur in
the order listed.
 Items a and c each appear more than once in different events of the
sequence.
 There are nine instances of items in sequence 1; therefore, it has a
length of nine and is called a 9-sequence.
 Item a occurs three times in sequence 1 and so contributes three to
the length of the sequence.
Mining Sequence Patterns in Transactional Database
, Medi-Caps University, Indore
 However, the entire sequence contributes only one to the support of
{a}.
 Sequence {a(bc)df} is a subsequence of sequence 1 since the events
of the former are each subsets of events in sequence 1, and the order
of events is preserved.
 Consider subsequence s = {(ab)c}.
 Looking at the sequence database, S, we see that sequences 1 and 3
are the only ones that contain the subsequence s.
 The support of s is thus 2, which satisfies minimum support.
Mining Sequence Patterns in Transactional Database
, Medi-Caps University, Indore
 Therefore, s is frequent, and so we call it a sequential pattern.
 It is a 3-pattern since it is a sequential pattern of length three.
Sequence Database and Transaction Database
, Medi-Caps University, Indore
 A sequence database is a set of sequences where each sequence is a
list of itemsets.
 An itemset is an unordered set of items.
 For example, the table shown below contains four sequences.
 The first sequence, named S1, contains 5 itemsets.
 It means that item 1 was followed by items 1 2 and 3 at the same
time, which were followed by 1 and 3, followed by 4, and followed
by 3 and 6.
 Note that it is assumed that no items appear twice in the same
itemset and that items in an itemset are lexically ordered.
Sequence Database to a Transaction Database
, Medi-Caps University, Indore
ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)
Transaction id Items
t1 {1, 2, 3, 4, 6}
t2 {1, 2, 3, 4, 5}
t3 {1, 2, 3, 4, 5, 6}
t4 {1, 2, 3, 5, 6, 7}
Transaction Database and Sequence Database
, Medi-Caps University, Indore
 A transaction database is a set of transactions.
Each transaction is a set of items.
 For example, consider the following transaction database. It
contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5).
 For example, the first transaction represents the set of items 1,
3 and 4.
Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
Transaction Database to a Sequence Database
, Medi-Caps University, Indore
 A sequence database is a set of sequences. Each sequence is an
ordered list of itemsets. Each itemset is an unordered set of items
(symbols) represented by positive integers.
 The output for this example is the following sequence database. It
contains five sequences. The first sequence indicates that item 1
is followed by item 3, which is followed by item 4.
Sequence id Itemsets
s1 {1},{3}, {4}
s2 {2},{3},{5}
s3 {1}, {2}, {3}, {5}
s4 {2}, {5},
Social Network Analysis
, Medi-Caps University, Indore
 A social network is defined as a social structure of individuals, who are
related (directly or indirectly to each other) based on a common
relation of interest, e.g. friendship, trust, etc.
 Social network analysis is the study of social networks to understand
their structure and behavior.
 Social network analysis has gained prominence due to its use in
different applications - from product marketing (e.g. viral marketing)
to search engines and organizational dynamics (e.g. management).
 Recently there has been a rapid increase in interest regarding social
network analysis in the data mining community.
 The basic motivation is the demand to exploit knowledge from copious
amounts of data collected, pertaining to social behavior of users in
online environments.
Social Network Analysis
, Medi-Caps University, Indore
 A social network is a heterogeneous and multi relational dataset
represented by a graph. Vertexes represent the objects (entities), edges
represent the links (relationships or interaction) and both objects and
links may have attributes.
 Social networks research emerged from sociology, psychology,
statistics and graph theory.
 Based on theoretical graph concepts, a social network interprets the
social relationships of individuals as points and their relationships as
the lines connecting them.
 Data mining technique in social media
 GRAPH MINING
 TEXT MINING
Graph Mining
, Medi-Caps University, Indore
 Graphs(or networks) constitute a prominent data structure and appear
essentially in all form of information .
 Example include the webgraph ,social network.
 Typically, communities correspond to , group of nodes , where nodes
within the same community ( or clusters) tend to be highly similar
sharing common features ,while on the other hand nodes of different
communities show low similarities.
 Extracting useful knowledge (patterns, outliers ,etc) from structured
data that can be represented as graph.
 Graph mining is used for understanding relationship as well as content.
• Phone provider can look at phone call records using graph mining.
Social Network Analysis
, Medi-Caps University, Indore
Social Network Analysis
, Medi-Caps University, Indore
Text Mining
, Medi-Caps University, Indore
 Text mining, also known as text analysis, is the process of transforming
unstructured text data into meaningful and actionable information.
 Text mining utilizes different AI technologies to automatically process data
and generate valuable insights, enabling companies to make data-driven
decisions.
 For businesses, the large amount of data generated every day represents both
an opportunity and a challenge.
 On the one side, data helps companies get smart insights on people’s opinions
about a product or service.
 Think about all the potential ideas that you could get from analyzing emails,
product reviews, social media posts, customer feedback, support tickets, etc.
 On the other side, there’s the dilemma of how to process all this data. And
that’s where text mining plays a major role.
Text Mining
, Medi-Caps University, Indore
 The fundamental steps involved in text mining are:
 Gathering unstructured data from multiple data sources like plain
text, web pages, pdf files, emails, and blogs, to name a few.
 Detect and remove anomalies from data by conducting pre-
processing and cleansing operations.
 Data cleansing allows you to extract and retain the valuable
information hidden within the data and to help identify the roots of
specific words.
 For this, you get a number of text mining tools and text mining
applications.
 Convert all the relevant information extracted from unstructured data
into structured formats.
Text Mining
, Medi-Caps University, Indore
 Analyze the patterns within the data via the Management
Information System (MIS).
 Store all the valuable information into a secure database to drive
trend analysis and enhance the decision-making process of the
organization.
Text Mining
, Medi-Caps University, Indore
S.NO. DATA MINING TEXT MINING
1.
Data mining is the statistical
technique of processing raw data
in a structured form.
Text mining is the part of data
mining which involves processing
of text from documents.
2.
Pre-existing databases and
spreadsheets are used to gather
information.
The text is used to gather high
quality information.
3.
In data mining data is stored in
structured format.
In text mining data is stored in
unstructured format.
4.
Data is homogeneous and is easy
to retrieve.
Data is heterogeneous and is not so
easy to retrieve.
Text Mining
, Medi-Caps University, Indore
S.NO. DATA MINING TEXT MINING
5. It supports mining of mixed data.
In text mining, mining of text is
only done.
6.
It combines artificial intelligence,
machine learning and statistics
and applies it on data.
It applies pattern recognizing and
natural language processing to
unstructured data.
7.
It is used in fields like marketing,
medicine, healthcare.
It is used in fields like bioscience
and customer profile analysis.
8.
Structured data from large
datasets found in systems such
databases, spreadsheets, ERP,
CRM and accounting applications
Unstructured textual data found in
emails, documents, presentations,
videos, file shares, social media
and the Internet.
Web Mining
, Medi-Caps University, Indore
 Web mining is an application of data mining techniques to find
information patterns from the web data.
 Web mining helps to improve the power of web search engine by
identifying the web pages and classifying the web documents.
 The main purpose of web mining is discovering useful information
from the World-Wide Web and its usage patterns.
 Web mining is very useful to e-commerce websites and e-services.
 Applications of Web Mining:
 Web mining helps to improve the power of web search engine by
classifying the web documents and identifying the web pages.
 It is used for Web Searching e.g., Google, Yahoo etc.
Web Mining
, Medi-Caps University, Indore
 Web mining is used to predict user behavior.
 Web mining is very useful of a particular Website and e-service e.g.,
landing page optimization.
Web Mining
, Medi-Caps University, Indore
S.No DATA MINING WEB MINING
1
Data Mining is the process that attempts
to discover pattern and hidden
knowledge in large data sets in any
system.
Web Mining is the process of data mining
techniques to automatically discover and
extract information from web documents.
2
Data Mining is very useful for web
page analysis.
Web Mining is very useful for a
particular website and e-service.
3 Data scientist and data engineers. Data scientists along with data analysts.
4 Data Mining is access data privately. Web Mining is access data publicly.
5
Clustering, classification, regression,
prediction, optimization and control.
Web content mining, Web structure
mining.
6
It includes tools like machine learning
algorithms.
Special tools for web mining are Scrapy,
PageRank and Apache logs.
Multirelational Data Mining
, Medi-Caps University, Indore
 The multi relational data mining approach has developed as an
alternative way for handling the structured data such that RDBMS.
 This will provides the mining in multiple tables directly.
 In MRDM the patterns are available in multiple tables (relations) from
a relational database.
 As the data are available over the many tables which will affect the
many problems in the practice of the data mining.
 To deal with this problem, one either constructs a single table by
Propositionalisation, or uses a Multi-Relational Data Mining
algorithm.
 RDM approaches have been successfully applied in the area of
bioinformatics.
Multirelational Data Mining
, Medi-Caps University, Indore
 Three popular pattern finding techniques classification, clustering and
association are frequently used in MRDM.
 Multi relational approach has developed as an alternative for analyzing
the structured data such as relational database.
 MRDM allowing applying directly in the data mining in multiple tables.
To avoid the expensive joining operations and semantic losses we used
the MRDM technique.
 An important aspect of data mining algorithms and systems is that they
should scale well to large databases.
 A consequence of this is that most data mining tools are based on
machine learning algorithms that work on data in attribute-value format.
 Experience has proven that such 'single-table' mining algorithms indeed
scale well.
Unit – 4
Any - 5 Assignment Questions Marks:-20
, Medi-Caps University, Indore
 Q.1 What is Cluster analysis? Discuss k-means Algorithm with suitable examples?
 Q.2 Write a short note on following:
a) Unsupervised Learning b) Web Mining
c) Text Mining d) Social Network Analysis
 Q.3 List out difference between clustering and classification. Briefly describe
hierarchical clustering method.
 Q.4 Suppose we have the following points: (1,1), (2,4), (3,4), (5,8), (6,2), (7,8).
Use k - means algorithm (k = 2) to find two cluster. The distance function is
Euclidean distance.
 Q.5 Define Clustering. What are the requirments for cluster analysis.
 Q.6 Explain DBSCAN Algorithm with suitable example.
 Q.7 Describe Data Stream Mining. How to mine time series data.
Questions
Thank You
Great God, Medi-Caps, All the attendees
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
www.sagarpandya.tk
LinkedIn: /in/seapandya
Twitter: @seapandya
Facebook: /seapandya
Ad

More Related Content

What's hot (20)

Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
SowmyaJyothi3
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Anna Fensel
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
Mustafa Sherazi
 
Clustering
ClusteringClustering
Clustering
LipikaSaha2
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
singh7599
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Kamalakshi Deshmukh-Samag
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Pushkar Mishra
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Subrata Kumer Paul
 
K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
SaurabhWani6
 
K means clustering
K means clusteringK means clustering
K means clustering
Ahmedasbasb
 
Clustering
ClusteringClustering
Clustering
Rashmi Bhat
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Edureka!
 
Feature scaling
Feature scalingFeature scaling
Feature scaling
Gautam Kumar
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
SowmyaJyothi3
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Anna Fensel
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
Mustafa Sherazi
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
singh7599
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Subrata Kumer Paul
 
K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
SaurabhWani6
 
K means clustering
K means clusteringK means clustering
K means clustering
Ahmedasbasb
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Edureka!
 

Similar to Clustering - K-Means, DBSCAN (20)

Data mining
Data miningData mining
Data mining
EmaSushan
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
Nandakumar P
 
4.Unit 4 ML Q&A.pdf machine learning qb
4.Unit  4  ML Q&A.pdf machine learning qb4.Unit  4  ML Q&A.pdf machine learning qb
4.Unit 4 ML Q&A.pdf machine learning qb
gopikuppa945
 
Machine Learning : Clustering - Cluster analysis.pptx
Machine Learning : Clustering - Cluster analysis.pptxMachine Learning : Clustering - Cluster analysis.pptx
Machine Learning : Clustering - Cluster analysis.pptx
tecaviw979
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Publishing House
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Journals
 
For iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxFor iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptx
SureshPolisetty2
 
Clustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdfClustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdf
igeabroad
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
nandhini manoharan
 
computational statistics machine learning unit 5.pptx
computational statistics machine learning unit 5.pptxcomputational statistics machine learning unit 5.pptx
computational statistics machine learning unit 5.pptx
AnubhavKushagra
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
IJERA Editor
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 
Clustering
ClusteringClustering
Clustering
Dr. C.V. Suresh Babu
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
rahuljain582793
 
Unit 5-1.pdf
Unit 5-1.pdfUnit 5-1.pdf
Unit 5-1.pdf
marow75067
 
Data clustring
Data clustring Data clustring
Data clustring
Salman Memon
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxUnit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
Dr.Shweta
 
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETTWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
IJDKP
 
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETTWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
IJDKP
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
IJERA Editor
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
Nandakumar P
 
4.Unit 4 ML Q&A.pdf machine learning qb
4.Unit  4  ML Q&A.pdf machine learning qb4.Unit  4  ML Q&A.pdf machine learning qb
4.Unit 4 ML Q&A.pdf machine learning qb
gopikuppa945
 
Machine Learning : Clustering - Cluster analysis.pptx
Machine Learning : Clustering - Cluster analysis.pptxMachine Learning : Clustering - Cluster analysis.pptx
Machine Learning : Clustering - Cluster analysis.pptx
tecaviw979
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Publishing House
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
eSAT Journals
 
For iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxFor iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptx
SureshPolisetty2
 
Clustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdfClustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdf
igeabroad
 
computational statistics machine learning unit 5.pptx
computational statistics machine learning unit 5.pptxcomputational statistics machine learning unit 5.pptx
computational statistics machine learning unit 5.pptx
AnubhavKushagra
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
IJERA Editor
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
rahuljain582793
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxUnit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
Dr.Shweta
 
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETTWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
IJDKP
 
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETTWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
IJDKP
 
Ad

More from Medicaps University (14)

data mining and warehousing computer science
data mining and warehousing computer sciencedata mining and warehousing computer science
data mining and warehousing computer science
Medicaps University
 
Unit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptxUnit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptx
Medicaps University
 
Unit-4 (IO Interface).pptx
Unit-4 (IO Interface).pptxUnit-4 (IO Interface).pptx
Unit-4 (IO Interface).pptx
Medicaps University
 
UNIT-3 Complete PPT.pptx
UNIT-3 Complete PPT.pptxUNIT-3 Complete PPT.pptx
UNIT-3 Complete PPT.pptx
Medicaps University
 
UNIT-2.pptx
UNIT-2.pptxUNIT-2.pptx
UNIT-2.pptx
Medicaps University
 
UNIT-1 CSA.pptx
UNIT-1 CSA.pptxUNIT-1 CSA.pptx
UNIT-1 CSA.pptx
Medicaps University
 
Scheduling
SchedulingScheduling
Scheduling
Medicaps University
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
Medicaps University
 
Clock synchronization
Clock synchronizationClock synchronization
Clock synchronization
Medicaps University
 
Distributed Objects and Remote Invocation
Distributed Objects and Remote InvocationDistributed Objects and Remote Invocation
Distributed Objects and Remote Invocation
Medicaps University
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
Medicaps University
 
Association and Classification Algorithm
Association and Classification AlgorithmAssociation and Classification Algorithm
Association and Classification Algorithm
Medicaps University
 
Data Mining
Data MiningData Mining
Data Mining
Medicaps University
 
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
Medicaps University
 
Ad

Recently uploaded (20)

Gas Power Plant for Power Generation System
Gas Power Plant for Power Generation SystemGas Power Plant for Power Generation System
Gas Power Plant for Power Generation System
JourneyWithMe1
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Elevate Your Workflow
Elevate Your WorkflowElevate Your Workflow
Elevate Your Workflow
NickHuld
 
Crack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By VivekCrack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By Vivek
Vivek Srivastava
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
BTech_CSE_LPU_Presentation.pptx.........
BTech_CSE_LPU_Presentation.pptx.........BTech_CSE_LPU_Presentation.pptx.........
BTech_CSE_LPU_Presentation.pptx.........
jinny kaur
 
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlDust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Janapriya Roy
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
comparison of motors.pptx 1. Motor Terminology.ppt
comparison of motors.pptx 1. Motor Terminology.pptcomparison of motors.pptx 1. Motor Terminology.ppt
comparison of motors.pptx 1. Motor Terminology.ppt
yadavmrr7
 
vlsi digital circuits full power point presentation
vlsi digital circuits full power point presentationvlsi digital circuits full power point presentation
vlsi digital circuits full power point presentation
DrSunitaPatilUgaleKK
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
Mirada a 12 proyectos desarrollados con BIM.pdf
Mirada a 12 proyectos desarrollados con BIM.pdfMirada a 12 proyectos desarrollados con BIM.pdf
Mirada a 12 proyectos desarrollados con BIM.pdf
topitodosmasdos
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
Engineering Chemistry First Year Fullerenes
Engineering Chemistry First Year FullerenesEngineering Chemistry First Year Fullerenes
Engineering Chemistry First Year Fullerenes
5g2jpd9sp4
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Unit III.pptx IT3401 web essentials presentatio
Unit III.pptx IT3401 web essentials presentatioUnit III.pptx IT3401 web essentials presentatio
Unit III.pptx IT3401 web essentials presentatio
lakshitakumar291
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Gas Power Plant for Power Generation System
Gas Power Plant for Power Generation SystemGas Power Plant for Power Generation System
Gas Power Plant for Power Generation System
JourneyWithMe1
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Elevate Your Workflow
Elevate Your WorkflowElevate Your Workflow
Elevate Your Workflow
NickHuld
 
Crack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By VivekCrack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By Vivek
Vivek Srivastava
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
BTech_CSE_LPU_Presentation.pptx.........
BTech_CSE_LPU_Presentation.pptx.........BTech_CSE_LPU_Presentation.pptx.........
BTech_CSE_LPU_Presentation.pptx.........
jinny kaur
 
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlDust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Janapriya Roy
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
comparison of motors.pptx 1. Motor Terminology.ppt
comparison of motors.pptx 1. Motor Terminology.pptcomparison of motors.pptx 1. Motor Terminology.ppt
comparison of motors.pptx 1. Motor Terminology.ppt
yadavmrr7
 
vlsi digital circuits full power point presentation
vlsi digital circuits full power point presentationvlsi digital circuits full power point presentation
vlsi digital circuits full power point presentation
DrSunitaPatilUgaleKK
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
Mirada a 12 proyectos desarrollados con BIM.pdf
Mirada a 12 proyectos desarrollados con BIM.pdfMirada a 12 proyectos desarrollados con BIM.pdf
Mirada a 12 proyectos desarrollados con BIM.pdf
topitodosmasdos
 
Engineering Chemistry First Year Fullerenes
Engineering Chemistry First Year FullerenesEngineering Chemistry First Year Fullerenes
Engineering Chemistry First Year Fullerenes
5g2jpd9sp4
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Unit III.pptx IT3401 web essentials presentatio
Unit III.pptx IT3401 web essentials presentatioUnit III.pptx IT3401 web essentials presentatio
Unit III.pptx IT3401 web essentials presentatio
lakshitakumar291
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 

Clustering - K-Means, DBSCAN

  • 1. MEDI-CAPS UNIVERSITY Faculty of Engineering Mr. Sagar Pandya Information Technology Department [email protected]
  • 2. IT3ED02 Data Mining and Warehousing 3-0-0 Mr. Sagar Pandya [email protected]  Unit 1. Introduction  Unit 2. Data Mining  Unit 3. Association and Classification  Unit 4. Clustering  Unit 5. Business Analysis
  • 3. Reference Books Text Books  Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann, India, 2012.  Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press.  Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA. Reference Books  Sam Anahory and Dennis Murray, Data Warehousing in the Real World, Pearson Education Asia.  W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India. and many others Mr. Sagar Pandya [email protected]
  • 4. Unit-4 Clustering  Clustering: Introduction, Types of clustering;  Partition-based clustering: K-Means, K-Medoids;  Density based clustering: DBSCAN, Clustering evaluation.  Mining Data Stream, Mining Time-Series Data, Mining Sequence Patterns in Transactional Database,  Social Network analysis and Multirelational Data Mining. Mr. Sagar Pandya [email protected]
  • 5. Clustering  In clustering, a group of different data objects is classified as similar objects.  One group means a cluster of data.  Data sets are divided into different groups in the cluster analysis, which is based on the similarity of the data.  After the classification of data into various groups, a label is assigned to the group.  It helps in adapting to the changes by doing the classification.  In other words, similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster.  Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data points into clusters so that the objects belong to the same group. Mr. Sagar Pandya [email protected]
  • 6. Clustering • The quality of cluster depends on the method used. • Clustering is also called as data segmentation, because it partitions large data sets into groups according to their similarity. • There are 3 basic stages of clustering algorithm which are shown below Mr. Sagar Pandya [email protected]
  • 7. Clustering  Now that the data from our customer base is divided into clusters, we can make an informed decision about who we think is best suited for this product. Mr. Sagar Pandya [email protected]
  • 8. Clustering  What is a Cluster?  A cluster is a subset of similar objects. • A subset of objects such that the distance between any of the two objects in the cluster is less than the distance between any object in the cluster and any object that is not located inside it.  What is clustering in Data Mining? • Clustering is the method of converting a group of abstract objects into classes of similar objects. • Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called clusters. • It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone instrument to get a better insight into data distribution or as a pre-processing step for other algorithms Mr. Sagar Pandya [email protected]
  • 9. Clustering • Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing. • Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns. • In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations. • Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location. Mr. Sagar Pandya [email protected]
  • 10. Clustering • Clustering also helps in classifying documents on the web for information discovery. • Clustering is also used in outlier detection applications such as detection of credit card fraud.  Clustering Methods  Clustering methods can be classified into the following categories − • Partitioning Method • Hierarchical Method • Density-based Method • Grid-Based Method • Model-Based Method • Constraint-based Method Mr. Sagar Pandya [email protected]
  • 11. Clustering • A good clustering method requirements are: • The ability to discover some or all of the hidden clusters. • Within-cluster similarity and between-cluster dissimilarity. • Ability to deal with various types of attributes. • Can deal with noise and outliers. • Can handle high dimensionality. • Scalable, Interpretable and usable. • An important issue in clustering is how to determine the similarity between two objects, so that clusters can be formed from objects with high similarity within clusters and low similarity between clusters. Mr. Sagar Pandya [email protected]
  • 12. Clustering • Commonly, to measure similarity or dissimilarity between objects, a distance measure such as Euclidean, Manhattan and Minkowski is used. • A distance function returns a lower value for pairs of objects that are more similar to one another. Mr. Sagar Pandya [email protected]
  • 13. Methods of Clustering in Data Mining Mr. Sagar Pandya [email protected]
  • 14. K-Means Algorithm  The Concept  Imagine you’re opening a small book store.  You have a stack of different books, and 3 bookshelves.  Your goal is place similar books in one shelf.  What you would do, is pick up 3 books, one for each shelf in order to set a theme for every shelf.  These books will now dictate which of the remaining books will go in which shelf. Mr. Sagar Pandya [email protected]
  • 15. K-Means Algorithm • Every time you pick a new book up from the stack, you would compare it with those first 3 books, and place this new book on the shelf that has similar books. • You would repeat this process until all the books have been placed.  Once you’re done, you might notice that changing the number of bookshelves, and picking up different initial books for those shelves (changing the theme for each shelf) would increase how well you’ve grouped the books.  So, you repeat the process in hopes of a better outcome. Mr. Sagar Pandya [email protected]
  • 16. K-Means Algorithm  The Algorithm  K-means clustering is a good place to start exploring an unlabeled dataset. The K in K-Means denotes the number of clusters.  This algorithm is bound to converge to a solution after some iterations.  It has 4 basic steps: 1. Initialize Cluster Centroids (Choose those 3 books to start with) 2. Assign datapoints to Clusters (Place remaining the books one by one) 3. Update Cluster centroids (Start over with 3 different books) 4. Repeat step 2–3 until the stopping condition is met.  You don’t have to start with 3 clusters initially, but 2–3 is generally a good place to start, and update later on. Mr. Sagar Pandya [email protected]
  • 36. K Medoids Algorithm  The k-means method is based on the centroid techniques to represent the cluster and it is sensitive to outliers.  This means, a data object with an extremely large value may disrupt the distribution of data.  To overcome the problem we used K-medoids method which is based on representative object techniques.  Medoid is replaced with centroid to represent the cluster.  Medoid is the most centrally located data object in a cluster.  Here, k data objects are selected randomly as medoids to represent k cluster and remaining all data objects are placed in a cluster having medoid nearest (or most similar) to that data object , Medi-Caps University, Indore
  • 37. K Medoids Algorithm  After processing all data objects, new medoid is determined which can represent cluster in a better way and the entire process is repeated.  Again all data objects are bound to the clusters based on the new medoids.  In each iteration, medoids change their location step by step.  This process is continued until no any medoid move.  As a result, k clusters are found representing a set of n data objects.  The most common k-medoids clustering methods is the PAM algorithm (Partitioning Around Medoids , Medi-Caps University, Indore
  • 38. K Medoids Algorithm  K-Medoids (also called as Partitioning Around Medoid) algorithm was proposed in 1987 by Kaufman and Rousseeuw.  A medoid can be defined as the point in the cluster, whose dissimilarities with all the other points in the cluster is minimum.  1. Initialize: select k random points out of the n data points as the medoids.  2. Associate each data point to the closest medoid by using any common distance metric methods.  3. While the cost decreases: For each medoid m, for each data o point which is not a medoid: 1. Swap m and o, associate each data point to the closest medoid, recompute the cost.  2. If the total cost is more than that in the previous step, undo the swap. , Medi-Caps University, Indore
  • 39. K Medoids Algorithm  PAM concept: The use of means implies that k-means clustering is highly sensitive to outliers.  This can severely affects the assignment of observations to clusters.  A more robust algorithm is provided by the PAM algorithm.  PAM algorithm: The PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set.  After finding a set of k medoids, clusters are constructed by assigning each observation to the nearest medoid.  Next, each selected medoid m and each non-medoid data point are swapped and the objective function is computed.  The objective function corresponds to the sum of the dissimilarities of all objects to their nearest medoid. , Medi-Caps University, Indore
  • 40. K Medoids Algorithm  In summary, PAM algorithm proceeds in two phases as follow:  Build phase: 1. Select k objects to become the medoids, or in case these objects were provided use them as the medoids; 2. Calculate the dissimilarity matrix if it was not provided; 3. Assign every object to its closest medoid;  Swap phase:  4. For each cluster search if any of the object of the cluster decreases the average dissimilarity coefficient; if it does, select the entity that decreases this coefficient the most as the medoid for this cluster;  5. If at least one medoid has changed go to (3), else end the algorithm. , Medi-Caps University, Indore
  • 48. K Medoids Algorithm  Advantages: 1. It is simple to understand and easy to implement. 2. K-Medoid Algorithm is fast and converges in a fixed number of steps. 3. PAM is less sensitive to outliers than other partitioning algorithms.  Disadvantages: 1. The main disadvantage of K-Medoid algorithms is that it is not suitable for clustering non-spherical (arbitrary shaped) groups of objects. This is because it relies on minimizing the distances between the non- medoid objects and the medoid (the cluster centre) – briefly, it uses compactness as clustering criteria instead of connectivity. 2. It may obtain different results for different runs on the same dataset because the first k medoids are chosen randomly. , Medi-Caps University, Indore
  • 49. Hierarchical Clustering Algorithm  Hierarchical Clustering Algorithm also called Hierarchical cluster analysis or HCA is an unsupervised clustering algorithm which involves creating clusters that have predominant ordering from top to bottom.  For e.g: All files and folders on our hard disk are organized in a hierarchy.  The algorithm groups similar objects into groups called clusters. The endpoint is a set of clusters or groups, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.  This clustering technique is divided into two types: 1. Agglomerative Hierarchical Clustering 2. Divisive Hierarchical Clustering , Medi-Caps University, Indore
  • 50. Hierarchical Clustering Algorithm  Agglomerative Hierarchical Clustering: The Agglomerative Hierarchical Clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity.  It’s also known as AGNES (Agglomerative Nesting).  It's a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.  How does it work? 1. Make each data point a single-point cluster → forms N clusters 2. Take the two closest data points and make them one cluster → forms N-1 clusters , Medi-Caps University, Indore
  • 51. Hierarchical Clustering Algorithm 3. Take the two closest clusters and make them one cluster → Forms N-2 clusters. 4. Repeat step-3 until you are left with only one cluster.  What is a Dendrogram?  A Dendrogram is a type of tree diagram showing hierarchical relationships between different sets of data.  As already said a Dendrogram contains the memory of hierarchical clustering algorithm, so just by looking at the Dendrogram you can tell how the cluster is formed.  Have a look at the visual representation of Agglomerative Hierarchical Clustering for better understanding:  The point of doing all this is to demonstrate the way hierarchical clustering works, it maintains a memory of how we went through this process and that memory is stored in Dendrogram. , Medi-Caps University, Indore
  • 52. Hierarchical Clustering Algorithm , Medi-Caps University, Indore
  • 53. Hierarchical Clustering Algorithm , Medi-Caps University, Indore
  • 54. Hierarchical Clustering Algorithm , Medi-Caps University, Indore  Note:- 1. Distance between data points represents dissimilarities. 2. Height of the blocks represents the distance between clusters.  So you can observe from the above figure that initially P5 and P6 which are closest to each other by any other point are combined into one cluster followed by P4 getting merged into the same cluster(C2).  Then P1and P2 gets combined into one cluster followed by P0 getting merged into the same cluster(C4).  Now P3 gets merged in cluster C2 and finally, both clusters get merged into one.
  • 55. Hierarchical Clustering Algorithm , Medi-Caps University, Indore  There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are: • Single-linkage: the distance between two clusters is defined as the shortest distance between two points in each cluster. This linkage may be used to detect high values in your dataset which may be outliers as they will be merged at the end.
  • 56. Hierarchical Clustering Algorithm , Medi-Caps University, Indore • Complete-linkage: the distance between two clusters is defined as the longest distance between two points in each cluster. • In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.
  • 57. Hierarchical Clustering Algorithm , Medi-Caps University, Indore • Average-linkage: the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. • Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.
  • 58. Hierarchical Clustering Algorithm , Medi-Caps University, Indore  Parts of a Dendrogram:  A dendrogram can be a column graph or a row graph.  Some dendrograms are circular or have a fluid-shape, but the software will usually produce a row or column graph.  No matter what the shape, the basic graph comprises the same parts: • The Clades are the branch and are arranged according to how similar (or dissimilar) they are. • Clades that are close to the same height are similar to each other; clades with different heights are dissimilar — the greater the difference in height, the more dissimilarity.
  • 59. Hierarchical Clustering Algorithm , Medi-Caps University, Indore
  • 60. Hierarchical Clustering Algorithm , Medi-Caps University, Indore • Each clade has one or more leaves. • Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F. • Leaves D and E are more similar to each other than they are to leaves A, B, C, or F. • Leaf F is substantially different from all of the other leaves.  A clade can theoretically have an infinite amount of leaves. However, the more leaves you have, the harder the graph will be to read with the naked eye.  One question that might have intrigued you by now is how do you decide when to stop merging the clusters?  You cut the dendrogram tree with a horizontal line at a height where the line can traverse the maximum distance up and down without intersecting the merging point.
  • 61. Hierarchical Clustering Algorithm , Medi-Caps University, Indore  For example in the below figure L3 can traverse maximum distance up and down without intersecting the merging points. So we draw a horizontal line and the number of vertical lines it intersects is the optimal number of clusters.  Number of Clusters in this case = 3.
  • 62. Hierarchical Clustering Algorithm , Medi-Caps University, Indore  Let’s see the graphical representation of this algorithm using a dendrogram.  Note: This is just a demonstration of how the actual algorithm works no calculation has been performed below all the proximity among the clusters are assumed.  Let’s say we have six data points A, B, C, D, E, F.
  • 63. Hierarchical Clustering Algorithm , Medi-Caps University, Indore • Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other clusters. • Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second step similarly with cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)] • Step-3: We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)] • Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)]. • Step-5: At last the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
  • 64. Hierarchical Clustering Algorithm , Medi-Caps University, Indore  Divisive Hierarchical Clustering  In Divisive or DIANA(DIvisive ANAlysis Clustering) is a top-down clustering method where we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters.  Finally, we proceed recursively on each cluster until there is one cluster for each observation. So this clustering approach is exactly opposite to Agglomerative clustering.  There is evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumstances but is conceptually more complex.  In both agglomerative and divisive hierarchical clustering, users need to specify the desired number of clusters as a termination condition(when to stop merging).
  • 65. Hierarchical Clustering Algorithm , Medi-Caps University, Indore
  • 66. DBSCAN Clustering Algorithm , Medi-Caps University, Indore  Clustering analysis is an unsupervised learning method that separates the data points into several specific bunches or groups, such that the data points in the same groups have similar properties and data points in different groups have different properties in some sense.  Centrally, all clustering methods use the same approach i.e. first we calculate similarities and then we use it to cluster the data points into groups or batches.  DBSCAN is well known as Density-based spatial clustering of applications with noise clustering method.  It was proposed by Martin Ester et al. in 1996. DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are dense regions in space separated by regions of lower density.
  • 67. DBSCAN Clustering Algorithm , Medi-Caps University, Indore  It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers.  K-Means and Hierarchical Clustering both fail in creating clusters of arbitrary shapes. They are not able to form clusters based on varying densities. That’s why we need DBSCAN clustering. • minPts: The minimum number of points (a threshold) clustered together for a region to be considered dense. • eps (ε): A distance measure that will be used to locate the points in the neighborhood of any point. • Core — This is a point that has at least m points within distance n from itself. • Border — This is a point that has at least one Core point at a distance n.
  • 68. DBSCAN Clustering Algorithm , Medi-Caps University, Indore
  • 69. DBSCAN Clustering Algorithm , Medi-Caps University, Indore • Noise — This is a point that is neither a Core nor a Border. And it has less than m points within distance n from itself.  These parameters can be understood if we explore two concepts called Density Reachability and Density Connectivity.  Reachability in terms of density establishes a point to be reachable from another if it lies within a particular distance (eps) from it.  Connectivity, on the other hand, involves a transitivity based chaining-approach to determine whether points are located in a particular cluster.  For example, p and q points could be connected if p->r->s->t->q, where a->b means b is in the neighborhood of a.  In 2014, the algorithm was awarded the ‘Test of Time’ award at the leading Data Mining conference, KDD.
  • 70. DBSCAN Clustering Algorithm , Medi-Caps University, Indore  A point X is directly density-reachable from point Y w.r.t epsilon, minPoints if, 1. X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon 2. Y is a core point • Here, X is directly density-reachable from Y, but vice versa is not valid.
  • 71. DBSCAN Clustering Algorithm , Medi-Caps University, Indore  Here, X is density-reachable from Y with X being directly density- reachable from P2, P2 from P3, and P3 from Y. But, the inverse of this is not valid.
  • 72. DBSCAN Clustering Algorithm , Medi-Caps University, Indore  DBSCAN algorithm can be abstracted in the following steps – 1. Find all the neighbor points within eps and identify the core points or visited with more than MinPts neighbors. 2. For each core point if it is not already assigned to a cluster, create a new cluster. 3. Find recursively all its density connected points and assign them to the same cluster as the core point. A point a and b are said to be density connected if there exist a point c which has a sufficient number of points in its neighbors and both the points a and b are within the eps distance. This is a chaining process. So, if b is neighbor of c, c is neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is neighbor of a. 4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster are noise.
  • 73. DBSCAN Clustering Algorithm , Medi-Caps University, Indore
  • 74. Grid-Based Clustering Algorithms  In this, the objects together form a grid.  The object space is quantized into finite number of cells that form a grid structure. Basic Grid-based Algorithm 1. Define a set of grid-cells 2. Assign objects to the appropriate grid cell and compute the density of each cell. 3. Eliminate cells, whose density is below a certain threshold t. 4. Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function) , Medi-Caps University, Indore
  • 75. Grid-Based Clustering Algorithms  Advantages • The major advantage of this method is fast processing time. • It is dependent only on the number of cells in each dimension in the quantized space.  Several interesting methods (in addition to the basic grid-based algorithm)  STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)  CLIQUE: Agrawal, et al. (SIGMOD’98) , Medi-Caps University, Indore
  • 76. Model-Based Clustering Algorithms  In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method locates the clusters by clustering the density function. It reflects spatial distribution of the data points.  This method also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. It therefore yields robust clustering methods.  In model-based clustering, the data is considered as coming from a mixture of density.  Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution which is characterized by the parameters:  μk: mean vector,  ∑k: covariance matrix,  An associated probability in the mixture. Each point has a probability of belonging to each cluster. , Medi-Caps University, Indore
  • 77. MINING DATA STREAM  Large amount of data streams every day.  Efficient knowledge discovery of such data streams is an emerging active research area in data mining with broad applications.  Data Stream Mining (also known as stream learning) is the process of extracting knowledge structures from continuous, rapid data records.  Data streams typically arrive continuously in high speed with huge amount and changing data distribution.  Data mining techniques which require multiple scans of the entire data sets can not be applied directly to mine stream data, which usually allows only one scan and demands fast response time , Medi-Caps University, Indore
  • 78. MINING DATA STREAM  Imagine a factory with 500 sensors capturing 10 KB of information every second, in one hour is captured nearby 36 GB of information and 432 GB daily.  This massive information needs to be analyzed in real time (or in the shortest time possible) to detect irregularities or deviations in the system and quickly react.  Stream Mining enables to analyze large amounts of data in real-time.  Data Stream Mining is the process of extracting knowledge from continuous rapid data records which comes to the system in a stream.  A Data Stream is an ordered sequence of instances in time.  Data stream mining is a process of mining continuous incoming real time streaming data with acceptable performance. , Medi-Caps University, Indore
  • 79. MINING DATA STREAM  Data Stream Mining fulfil the following characteristics:  Continuous Stream of Data: High amount of data in an infinite stream. we do not know the entire dataset  Concept Drifting: The data change or evolves over time  Volatility of data: The system does not store the data received (Limited resources). When data is analyzed it’s discarded or summarized. , Medi-Caps University, Indore
  • 80. MINING DATA STREAM  Data stream is a high-speed continuous flow of data from diverse resources.  The sources might include remote sensors, scientific processes, stock markets, online transactions, tweets, internet traffic, video surveillance systems etc.  Generally these streams come in high-speed with a huge volume of data generated by real-time applications.  Data streams have unique characteristics when compared with traditional datasets.  They include potentially infinite, massive, continuous, temporarily ordered and fast changing. , Medi-Caps University, Indore
  • 81. MINING DATA STREAM  Storing such streams and then process is not viable as that needs a lot of storage and processing power.  For this reason they are to be processed in real-time in order to discover knowledge from them instead of storing and processing like traditional data mining.  The data stream mining procedure includes selecting a part of stream data, preprocessing, incremental learning and extraction of knowledge in a single pass.  The result of data stream mining is the knowledge that can help in taking intelligent decisions.  Thus the processing of data streams throw challenges in terms of memory and processing power of systems. General procedure for processing streaming data is presented in Figure. , Medi-Caps University, Indore
  • 82. MINING DATA STREAM , Medi-Caps University, Indore
  • 83. MINING DATA STREAM , Medi-Caps University, Indore What are the Applications?  Telecommunication calling records  Business credit card transaction flows  Network monitoring and traffic engineering  Financial market: stock exchange  Engineering & industrial processes: power supply & manufacturing  Sensor, monitoring & surveillance: video streams, RFIDs  Security monitoring  Web logs and Web page click streams  Massive data sets (even saved but random access is too expensive)
  • 84. MINING DATA STREAM , Medi-Caps University, Indore Software for data stream mining:  MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It has several machine learning algorithms (classification, regression, clustering, outlier detection and recommender systems).  RapidMiner: commercial software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: Concept Drift plugin)).  StreamDM: StreamDM is an open source framework for big data stream mining that uses the Spark Streaming.
  • 85. MINING DATA STREAM , Medi-Caps University, Indore  When the volume of the underlying data is very large, highspeed and continuous flow it leads to number of computational and mining challenges listed below.  (1)Data contained in data streams is fast changing, high-speed and real-time.  (2) Multiple or random access of data streams is in expensive rather almost impossible.  (3)Huge volume of data to be processed in limited memory.  (4)Data stream mining system must process highspeed and gigantic data within time limitations.  (5) The data arriving in multidimensional and low level so techniques to mine such data needs to be very sophisticated.  (6)Data stream elements change rapidly overtime. Thus, data from the past may become irrelevant for the mining.
  • 86. MINING DATA STREAM , Medi-Caps University, Indore
  • 87. MINING TIME SERIES DATA , Medi-Caps University, Indore  A time series is a sequence of data points recorded at specific time points - most often in regular time intervals (seconds, hours, days, months etc.).  Every organization generates a high volume of data every single day – be it sales figure, revenue, traffic, or operating cost.  Time series data mining can generate valuable information for long-term business decisions, yet they are underutilized in most organizations.  Stock market analysis, economic and sales forecasting, scientific and engineering experiments, medical treatments etc. can also be considered as a Sequence database consists of a sequence of ordered events (time – optional)  Web page Traversal Sequence Time-Series data can be analyzed to Identify correlations Similar / Regular patterns, trends, outliers.
  • 88. MINING TIME SERIES DATA , Medi-Caps University, Indore  Industries in all sectors generate and use time series data to make important business decisions.  Using the past data, grocery chain wants to know which time of the year peaks market demands for a particular product; call centers need to forecast future call volumes so they can maintain adequate staffing; credit card companies lookout for fraudulent transactions — all these business decisions benefit from the use of times series data.  Time series data points are snapshots of the past.  Understanding historical events, patterns and trends are some basic indicators that all businesses track.  They want to understand how good they had performed in the past and where they are headed into the future.
  • 89. MINING TIME SERIES DATA , Medi-Caps University, Indore  Basic understanding of historical events through time series data doesn’t require fancy modeling, just plotting data against time can generate very powerful insights.  In the old days, spreadsheets were good enough to create powerful visual stories and insights.  Nowadays most statistical and data analysis tools (e.g. Python, Tableau, PowerBI) can handle time-series data pretty well for creating time series charts, dashboards etc.  Time series data provides a wealth of analytics and application possibilities in all domains of applications.  Historical analysis, forecasting, anomaly detection, and predictive analytics are just a few of those possibilities.
  • 90. MINING TIME SERIES DATA , Medi-Caps University, Indore  New analytical frontiers are also emerging with the development of new tools and techniques.  Artificial neural networks (e.g. LSTM) and econometrics are such cutting-edge frontiers in time series data analytics.  Experienced and aspiring data scientists alike can make tremendous contributions to their domains by taking advantage of these tools, or maybe by developing new ones.
  • 91. Mining Sequence Patterns in Transactional Database , Medi-Caps University, Indore  A sequence database consists of sequences of ordered elements or events, recorded with or without a concrete notion of time.  There are many applications involving sequence data. Typical examples include customer shopping sequences, Web clickstreams, biological sequences, sequences of events in science and engineering, and in natural and social developments.  “What is sequential pattern mining?” Sequential pattern mining is the mining of frequently occurring ordered events or subsequences as patterns.  An example of a sequential pattern is “Customers who buy a Canon digital camera are likely to buy an HP color printer within a month.”
  • 92. Mining Sequence Patterns in Transactional Database , Medi-Caps University, Indore  Other areas in which sequential patterns can be applied include Web access pattern analysis, weather prediction, production processes, and network intrusion detection.  The sequential pattern mining problem was first introduced by Agrawal and Srikant in 1995 [AS95] based on their study of customer purchase sequences, as follows:  “Given a set of sequences, where each sequence consists of a list of events (or elements) and each event consists of a set of items, and given a user-specified minimum support threshold of min sup, sequential pattern mining finds all frequent subsequences, that is, the subsequences whose occurrence frequency in the set of sequences is no less than min sup.”
  • 93. Mining Sequence Patterns in Transactional Database , Medi-Caps University, Indore  Consider the sequence database, S, given in Table , Let min sup = 2.  The set of items in the database is {a, b, c, d, e, f , g}. The database contains four sequences.  Let’s look at sequence 1, which is {(abc)(ac)d(cf)}.  It has five events, namely (a),(abc), (ac), (d), and (cf), which occur in the order listed.  Items a and c each appear more than once in different events of the sequence.  There are nine instances of items in sequence 1; therefore, it has a length of nine and is called a 9-sequence.  Item a occurs three times in sequence 1 and so contributes three to the length of the sequence.
  • 94. Mining Sequence Patterns in Transactional Database , Medi-Caps University, Indore  However, the entire sequence contributes only one to the support of {a}.  Sequence {a(bc)df} is a subsequence of sequence 1 since the events of the former are each subsets of events in sequence 1, and the order of events is preserved.  Consider subsequence s = {(ab)c}.  Looking at the sequence database, S, we see that sequences 1 and 3 are the only ones that contain the subsequence s.  The support of s is thus 2, which satisfies minimum support.
  • 95. Mining Sequence Patterns in Transactional Database , Medi-Caps University, Indore  Therefore, s is frequent, and so we call it a sequential pattern.  It is a 3-pattern since it is a sequential pattern of length three.
  • 96. Sequence Database and Transaction Database , Medi-Caps University, Indore  A sequence database is a set of sequences where each sequence is a list of itemsets.  An itemset is an unordered set of items.  For example, the table shown below contains four sequences.  The first sequence, named S1, contains 5 itemsets.  It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6.  Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.
  • 97. Sequence Database to a Transaction Database , Medi-Caps University, Indore ID Sequences S1 (1), (1 2 3), (1 3), (4), (3 6) S2 (1 4), (3), (2 3), (1 5) S3 (5 6), (1 2), (4 6), (3), (2) S4 (5), (7), (1 6), (3), (2), (3) Transaction id Items t1 {1, 2, 3, 4, 6} t2 {1, 2, 3, 4, 5} t3 {1, 2, 3, 4, 5, 6} t4 {1, 2, 3, 5, 6, 7}
  • 98. Transaction Database and Sequence Database , Medi-Caps University, Indore  A transaction database is a set of transactions. Each transaction is a set of items.  For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5).  For example, the first transaction represents the set of items 1, 3 and 4. Transaction id Items t1 {1, 3, 4} t2 {2, 3, 5} t3 {1, 2, 3, 5} t4 {2, 5}
  • 99. Transaction Database to a Sequence Database , Medi-Caps University, Indore  A sequence database is a set of sequences. Each sequence is an ordered list of itemsets. Each itemset is an unordered set of items (symbols) represented by positive integers.  The output for this example is the following sequence database. It contains five sequences. The first sequence indicates that item 1 is followed by item 3, which is followed by item 4. Sequence id Itemsets s1 {1},{3}, {4} s2 {2},{3},{5} s3 {1}, {2}, {3}, {5} s4 {2}, {5},
  • 100. Social Network Analysis , Medi-Caps University, Indore  A social network is defined as a social structure of individuals, who are related (directly or indirectly to each other) based on a common relation of interest, e.g. friendship, trust, etc.  Social network analysis is the study of social networks to understand their structure and behavior.  Social network analysis has gained prominence due to its use in different applications - from product marketing (e.g. viral marketing) to search engines and organizational dynamics (e.g. management).  Recently there has been a rapid increase in interest regarding social network analysis in the data mining community.  The basic motivation is the demand to exploit knowledge from copious amounts of data collected, pertaining to social behavior of users in online environments.
  • 101. Social Network Analysis , Medi-Caps University, Indore  A social network is a heterogeneous and multi relational dataset represented by a graph. Vertexes represent the objects (entities), edges represent the links (relationships or interaction) and both objects and links may have attributes.  Social networks research emerged from sociology, psychology, statistics and graph theory.  Based on theoretical graph concepts, a social network interprets the social relationships of individuals as points and their relationships as the lines connecting them.  Data mining technique in social media  GRAPH MINING  TEXT MINING
  • 102. Graph Mining , Medi-Caps University, Indore  Graphs(or networks) constitute a prominent data structure and appear essentially in all form of information .  Example include the webgraph ,social network.  Typically, communities correspond to , group of nodes , where nodes within the same community ( or clusters) tend to be highly similar sharing common features ,while on the other hand nodes of different communities show low similarities.  Extracting useful knowledge (patterns, outliers ,etc) from structured data that can be represented as graph.  Graph mining is used for understanding relationship as well as content. • Phone provider can look at phone call records using graph mining.
  • 103. Social Network Analysis , Medi-Caps University, Indore
  • 104. Social Network Analysis , Medi-Caps University, Indore
  • 105. Text Mining , Medi-Caps University, Indore  Text mining, also known as text analysis, is the process of transforming unstructured text data into meaningful and actionable information.  Text mining utilizes different AI technologies to automatically process data and generate valuable insights, enabling companies to make data-driven decisions.  For businesses, the large amount of data generated every day represents both an opportunity and a challenge.  On the one side, data helps companies get smart insights on people’s opinions about a product or service.  Think about all the potential ideas that you could get from analyzing emails, product reviews, social media posts, customer feedback, support tickets, etc.  On the other side, there’s the dilemma of how to process all this data. And that’s where text mining plays a major role.
  • 106. Text Mining , Medi-Caps University, Indore  The fundamental steps involved in text mining are:  Gathering unstructured data from multiple data sources like plain text, web pages, pdf files, emails, and blogs, to name a few.  Detect and remove anomalies from data by conducting pre- processing and cleansing operations.  Data cleansing allows you to extract and retain the valuable information hidden within the data and to help identify the roots of specific words.  For this, you get a number of text mining tools and text mining applications.  Convert all the relevant information extracted from unstructured data into structured formats.
  • 107. Text Mining , Medi-Caps University, Indore  Analyze the patterns within the data via the Management Information System (MIS).  Store all the valuable information into a secure database to drive trend analysis and enhance the decision-making process of the organization.
  • 108. Text Mining , Medi-Caps University, Indore S.NO. DATA MINING TEXT MINING 1. Data mining is the statistical technique of processing raw data in a structured form. Text mining is the part of data mining which involves processing of text from documents. 2. Pre-existing databases and spreadsheets are used to gather information. The text is used to gather high quality information. 3. In data mining data is stored in structured format. In text mining data is stored in unstructured format. 4. Data is homogeneous and is easy to retrieve. Data is heterogeneous and is not so easy to retrieve.
  • 109. Text Mining , Medi-Caps University, Indore S.NO. DATA MINING TEXT MINING 5. It supports mining of mixed data. In text mining, mining of text is only done. 6. It combines artificial intelligence, machine learning and statistics and applies it on data. It applies pattern recognizing and natural language processing to unstructured data. 7. It is used in fields like marketing, medicine, healthcare. It is used in fields like bioscience and customer profile analysis. 8. Structured data from large datasets found in systems such databases, spreadsheets, ERP, CRM and accounting applications Unstructured textual data found in emails, documents, presentations, videos, file shares, social media and the Internet.
  • 110. Web Mining , Medi-Caps University, Indore  Web mining is an application of data mining techniques to find information patterns from the web data.  Web mining helps to improve the power of web search engine by identifying the web pages and classifying the web documents.  The main purpose of web mining is discovering useful information from the World-Wide Web and its usage patterns.  Web mining is very useful to e-commerce websites and e-services.  Applications of Web Mining:  Web mining helps to improve the power of web search engine by classifying the web documents and identifying the web pages.  It is used for Web Searching e.g., Google, Yahoo etc.
  • 111. Web Mining , Medi-Caps University, Indore  Web mining is used to predict user behavior.  Web mining is very useful of a particular Website and e-service e.g., landing page optimization.
  • 112. Web Mining , Medi-Caps University, Indore S.No DATA MINING WEB MINING 1 Data Mining is the process that attempts to discover pattern and hidden knowledge in large data sets in any system. Web Mining is the process of data mining techniques to automatically discover and extract information from web documents. 2 Data Mining is very useful for web page analysis. Web Mining is very useful for a particular website and e-service. 3 Data scientist and data engineers. Data scientists along with data analysts. 4 Data Mining is access data privately. Web Mining is access data publicly. 5 Clustering, classification, regression, prediction, optimization and control. Web content mining, Web structure mining. 6 It includes tools like machine learning algorithms. Special tools for web mining are Scrapy, PageRank and Apache logs.
  • 113. Multirelational Data Mining , Medi-Caps University, Indore  The multi relational data mining approach has developed as an alternative way for handling the structured data such that RDBMS.  This will provides the mining in multiple tables directly.  In MRDM the patterns are available in multiple tables (relations) from a relational database.  As the data are available over the many tables which will affect the many problems in the practice of the data mining.  To deal with this problem, one either constructs a single table by Propositionalisation, or uses a Multi-Relational Data Mining algorithm.  RDM approaches have been successfully applied in the area of bioinformatics.
  • 114. Multirelational Data Mining , Medi-Caps University, Indore  Three popular pattern finding techniques classification, clustering and association are frequently used in MRDM.  Multi relational approach has developed as an alternative for analyzing the structured data such as relational database.  MRDM allowing applying directly in the data mining in multiple tables. To avoid the expensive joining operations and semantic losses we used the MRDM technique.  An important aspect of data mining algorithms and systems is that they should scale well to large databases.  A consequence of this is that most data mining tools are based on machine learning algorithms that work on data in attribute-value format.  Experience has proven that such 'single-table' mining algorithms indeed scale well.
  • 115. Unit – 4 Any - 5 Assignment Questions Marks:-20 , Medi-Caps University, Indore  Q.1 What is Cluster analysis? Discuss k-means Algorithm with suitable examples?  Q.2 Write a short note on following: a) Unsupervised Learning b) Web Mining c) Text Mining d) Social Network Analysis  Q.3 List out difference between clustering and classification. Briefly describe hierarchical clustering method.  Q.4 Suppose we have the following points: (1,1), (2,4), (3,4), (5,8), (6,2), (7,8). Use k - means algorithm (k = 2) to find two cluster. The distance function is Euclidean distance.  Q.5 Define Clustering. What are the requirments for cluster analysis.  Q.6 Explain DBSCAN Algorithm with suitable example.  Q.7 Describe Data Stream Mining. How to mine time series data.
  • 117. Thank You Great God, Medi-Caps, All the attendees Mr. Sagar Pandya [email protected] www.sagarpandya.tk LinkedIn: /in/seapandya Twitter: @seapandya Facebook: /seapandya

Editor's Notes