0% found this document useful (0 votes)
3 views

UNIT-IV

The document discusses various classification methods in data mining, including Decision Trees, Bayesian Classification, Rule-based Classification, Backpropagation, Support Vector Machines, and Associative Classification. Each method is explained in terms of its structure, algorithm, and application, highlighting key concepts like entropy, information gain, and the use of rules for classification. Additionally, it covers lazy learners like K-Nearest Neighbors, emphasizing their reliance on existing data for predictions.

Uploaded by

Aflah Sidhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

UNIT-IV

The document discusses various classification methods in data mining, including Decision Trees, Bayesian Classification, Rule-based Classification, Backpropagation, Support Vector Machines, and Associative Classification. Each method is explained in terms of its structure, algorithm, and application, highlighting key concepts like entropy, information gain, and the use of rules for classification. Additionally, it covers lazy learners like K-Nearest Neighbors, emphasizing their reliance on existing data for predictions.

Uploaded by

Aflah Sidhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT IV CLASSIFICATION AND CLUSTERING

DECISION TREE INDUCTION

Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision
tree creates classification or regression models as a tree structure. It separates a data set
into smaller subsets, and at the same time, the decision tree is steadily developed. The
final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two
branches. The leaf nodes show a classification or decision. We can’t accomplish more split
on leaf nodes-The uppermost decision node in a tree that relates to the best predictor
called the root node. Decision trees can deal with both categorical and numerical data.

Key factors:

Entropy:

Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.

Decision Tree Induction

Information Gain

Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return
the highest data gain.

Decision Tree Induction

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing
decisions. Starting with the dataset, we can measure the entropy to find a way to segment
the set until the data belongs to the same class

Decision tree Algorithm:

The decision tree algorithm may appear long, but it is quite simply the basis algorithm
techniques is as follows:

The algorithm is based on three parameters: D, attribute_list, and Attribute


_selection_method.

Generally, we refer to D as a data partition.


Initially, D is the entire set of training tuples and their related class levels (input training
data).

The parameter attribute_list is a set of attributes defining the tuples.

Attribute_selection_method specifies a heuristic process for choosing the attribute that


“best” discriminates the given tuples according to class.

Attribute_selection_method process applies an attribute selection measure.

BAYESIAN CLASSIFICATION

In numerous applications, the connection between the attribute set and the class variable
is non- deterministic. In other words, we can say the class label of a test record cant be
assumed with certainty even though its attribute set is the same as some of the training
examples. These circumstances may emerge due to the noisy data or the presence of
certain confusing factors that influence classification, but it is not included in the analysis.
For example, consider the task of predicting the occurrence of whether an individual is at
risk for liver illness based on individuals eating habits and working efficiency. Although
most people who eat healthly and exercise consistently having less probability of
occurrence of liver disease, they may still do so due to other factors. For example, due to
consumption of the high-calorie street foods and alcohol abuse. Determining whether an
individual’s eating routine is healthy or the workout efficiency is sufficient is also subject to
analysis, which in turn may introduce vulnerabilities into the leaning issue.

Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings. The theory expresses how a level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.

Bayes’s theorem is expressed mathematically by the following equation that is given below.

Data Mining Bayesian Classifiers


Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is
true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a “degree of belief.” Bayes theorem


connects the degree of belief in a hypothesis before and after accounting for evidence. For
example, Lets us consider an example of the coin. If we toss a coin, then we get either
heads or tails, and the percent of occurrence of either heads and tails is 50%. If the coin is
flipped numbers of times, and the outcomes are observed, the degree of belief may rise,
fall, or remain the same depending on the outcomes.

For proposition X and evidence Y,

P(X), the prior, is the primary degree of belief in X

P(X/Y), the posterior is the degree of belief having accounted for Y.

The quotient

Data Mining Bayesian Classifiers represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:


Where P (X⋂Y) is the joint probability of both X and Y being true, because

Data Mining Bayesian Classifiers

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling


(PGM) procedure that is utilized to compute uncertainties by utilizing the probability
concept. Generally known as Belief Networks, Bayesian Networks are used to show
uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.
Data Mining Bayesian Classifiers

The nodes here represent random variables, and the edges define the relationship between
these variables.

A DAG models the uncertainty of an event taking place based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability Table
(CPT) is used to represent the CPD of each variable in a network.

What is Rule-based Classification in Data Mining?


Rule-based classification in data mining is a technique in which class decisions are taken
based on various “if...then… else” rules. Thus, we define it as a classification type governed
by a set of IF-THEN rules. We write an IF-THEN rule as:

“IF condition THEN conclusion.”

IF-THEN Rule

To define the IF-THEN rule, we can split it into two parts:


Rule Antecedent: This is the “if condition” part of the rule. This part is present in the
LHS(Left Hand Side). The antecedent can have one or more attributes as conditions, with
logic AND operator.

Rule Consequent: This is present in the rule’s RHS(Right Hand Side). The rule consequent
consists of the class prediction.

Example:

R1: IF tutor = codingNinja AND student = yes

THEN happyLearning = true

Assessment of Rule

In rule-based classification in data mining, there are two factors based on which we can
access the rules. These are:

Coverage of Rule: The fraction of the records which satisfy the antecedent conditions of a
particular rule is called the coverage of that rule.

We can calculate this by dividing the number of records satisfying the rule(n1) by the total
number of records(n).

Coverage(R) = n1/n

Accuracy of a rule: The fraction of the records that satisfy the antecedent conditions and
meet the consequent values of a rule is called the accuracy of that rule.

We can calculate this by dividing the number of records satisfying the consequent
values(n2) by the number of records satisfying the rule(n1).

Accuracy(R) = n2/n1
Generally, we convert them into percentages by multiplying them by 100. We do so to make
it easy for the layman to understand these terms and their values.

Properties of Rule-Based Classifiers

There are two significant properties of rule-based classification in data mining. They are:

Rules may not be mutually exclusive

Rules may not be exhaustive

Rules may not be mutually exclusive in nature

Many different rules are generated for the dataset, so it is possible and likely that many of
them satisfy the same data record. This condition makes the rules not mutually exclusive.

Since the rules are not mutually exclusive, we cannot decide on classes that cover different
parts of data on different rules. But this was our main objective. So, to solve this problem,
we have two ways:

The first way is using an ordered set of rules. By ordering the rules, we set priority orders.
Thus, this ordered rule set is called a decision list. So the class with the highest priority rule
is taken as the final class.

The second solution can be assigning votes for each class depending on their weights. So,
in this, the set of rules remains unordered.

Rules may not be exhaustive in nature

It is not a guarantee that the rule will cover all the data entries. Any of the rules may leave
some data entries. This case, on its occurrence, will make the rules non-exhaustive. So, we
have to solve this problem too. So, to solve this problem, we can make use of a default
class. Using a default class, we can assign all the data entries not covered by any rules to
the default class. Thus using the default class will solve the problem of non-exhaustivity.
CLASSIFICATION BY BACK PROPAGATION

Backpropagation is a supervised learning algorithm commonly used in neural networks for


classification tasks in data mining. It involves propagating errors backward through the
network to adjust the weights and biases, iteratively improving the model’s performance.

Input Layer:

Represents the features of the dataset.

Hidden Layers:

Neurons in these layers process information, applying weights and biases.

Output Layer:

Produces the final classification result.

Forward Pass:

Input data is fed forward through the network to produce an initial output.

Calculate Error:

The difference between predicted and actual outputs is computed.

Backward Pass (Backpropagation):

Errors are propagated backward, updating weights and biases using gradient descent.

Iterative Process:

Steps 4-6 are repeated for multiple epochs to refine the model.

Activation Function:

Non-linear activation functions introduce non-linearity to the model, allowing it to learn


complex patterns.

Loss Function:

Measures the difference between predicted and actual outputs, guiding the weight
adjustments.

Learning Rate:

Controls the size of weight adjustments during each iteration, preventing overshooting or
slow convergence.
Backpropagation is an effective method for training neural networks in classification tasks,
adapting the model to the underlying patterns in the data

SUPPORT VECTOR MACHINES IN DATA MINING

Support Vector Machines (SVMs) are powerful tools in data mining for classification and
regression tasks. They work by finding the hyperplane that best separates different classes
in the feature space. SVMs aim to maximize the margin between classes, which is the
distance between the hyperplane and the nearest data points from each class.

Key points about SVMs in data mining:

Linear Separation: SVMs are effective when classes can be separated by a linear boundary.
However, they can be extended to handle non-linear boundaries through techniques like
the kernel trick.

Kernel Trick: SVMs can use a kernel function to implicitly map the input data into a higher-
dimensional space, allowing for the handling of non-linear relationships between features.

Margin Maximization: SVMs focus on maximizing the margin between classes, which
enhances generalization and helps avoid overfitting.

Support Vectors: The data points that lie closest to the hyperplane are known as support
vectors. They play a crucial role in defining the decision boundary.

C Parameter: SVMs have a regularization parameter (C) that influences the trade-off
between achieving a smooth decision boundary and classifying training points correctly.

Multi-class Classification: SVMs can be extended to handle multi-class classification


problems through strategies like one-vs-one or one-vs-all.

Sensitivity to Outliers: SVMs can be sensitive to outliers since they heavily depend on
support vectors, which are the closest points to the decision boundary.
Applications: SVMs find applications in various fields, including image classification, text
categorization, bioinformatics, and more.

When using SVMs in data mining, it’s essential to consider the nature of the data, choose
appropriate kernel functions, and fine-tune parameters for optimal performance.

ASSOCIATIVE CLASSIFICATION

Associative Classification in Data Mining

Data mining is an effective process that includes drawing insightful conclusions and
patterns from vast amounts of data. Its importance rests in the capacity to unearth buried
information, spot trends, and make wise judgments based on the information recovered.

A crucial data mining approach called associative classification focuses on identifying


connections and interactions between various variables in a dataset. Its goal is to find
relationships and patterns among qualities so that future events can be predicted or new
occurrences can be categorized. Associative categorization can be used to uncover useful
patterns that help businesses and organizations better understand their data, make
data−driven choices, and improve their operations.

This method offers a thorough framework to identify intricate linkages in data, resulting in
insightful information and prospective advancements in a range of industries, including
marketing, finance, healthcare, and more. We’ll be talking about associative classification
in data mining in this post. Let’s begin.

Understanding Associaltive Classification

Understanding associative classification is essential for realizing its full potential in data
mining. Making prediction or classification jobs easier, entails identifying correlations and
links between attributes in a collection. The fundamental goal of associative classification
is to identify patterns connecting different variables by using association rule mining
techniques.
Rule creation, rule assessment, and rule selection are generally the three main steps in the
process. When rules are developed, they are based on the dataset, however, when rules
are evaluated, they are evaluated for quality and importance. In order to improve the
accuracy and relevance of the classification process, rule selection seeks to weed out
unimportant or inapplicable rules. A few benefits of associative categorization are its
capacity to manage complicated data linkages, manage high−dimensional datasets, and
give comprehensible rules.

The computational complexity of big datasets, sensitivity to noise and irrelevant features,
and a possible trade−off between accuracy and interpretability are some of its drawbacks.
Nevertheless, being aware of these factors enables data analysts to employ associative
categorization efficiently and base choices on the discovered patterns.

Techniques and Algorithms

Apriori Algorithm and Its Role in Associative Classification

In associative classification, the Apriori algorithm is a key method that is essential for
identifying popular item sets. The method finds itemsets that meet a minimal support
criterion via an iterative technique, creating strong correlations between qualities. Its main
function in associative categorization is to produce a set of frequent item sets from which
association rules may be derived.

Utilizing the “apriori property,” which stipulates that any non−frequent itemset must have
non−frequent subsets, the method effectively prunes the search space.

Fuzzy Association Rule Mining and Its Applications

A development of conventional association rule mining that addresses ambiguity and


imprecision in data is fuzzy association rule mining. In datasets where characteristics
include degrees of membership rather than binary values, it enables the discovery of
relationships.

In fields like medical diagnosis or consumer behavior research, where ambiguity and
vagueness are common, fuzzy association rule mining is very helpful. This method uses
fuzzy logic to generate rules and identify correlations, allowing for more informed
decision−making and the identification of patterns in large datasets.
Evaluation and Validation

Metrics of association rules

To assess the value and importance of association rules produced by associative


classification, many metrics are used. The metrics lift, support, and interestingness are
frequently employed. The potency of connections, the accuracy of forecasts, and the
applicability of the patterns found are all quantified by these measures.

Cross−Validation and Holdout Methods for Model Evaluation

Cross−validation and holdout approaches are frequently used to confirm the efficacy of
the associative classification model. By splitting the dataset into several subsets,
cross−validation enables repeated training and testing on various partitions.

Holdout approaches, in contrast, divide the data into training and testing sets, utilizing the
first to build the model and the latter to assess how well it performs.

Techniques for Handling Imbalanced Datasets

Associative categorization can be complicated by datasets that are unbalanced in terms of


class distribution. Undersampling, oversampling, and ensemble procedures, among
others, can be used to balance the dataset and lessen the effect of class imbalance on
model performance.

LAZY LEARNERS IN DATA MINING

Lazy learners in data mining, such as k-nearest neighbors (KNN), make predictions based
on existing data rather than constructing an explicit model. They are computationally
efficient during training but might require more time for predictions. KNN, for instance,
relies on similarity measures to classify or predict based on the nearest neighbors in the
training set.

Types of lazy learners in data mining

Common types of lazy learners in data mining include:


K-Nearest Neighbors (KNN): Classifies instances based on the majority class among their
k-nearest neighbors.

Locally Weighted Learning (LWL): Assigns weights to training instances based on their
proximity to the query instance for better prediction.

Case-Based Reasoning (CBR): Makes predictions by finding and adapting solutions from
similar past cases.

Instance-Based Learning (IBL): Similar to KNN, it relies on instances in the training set to
make predictions without explicitly building a model.

Memory-Based Reasoning (MBR): Uses memory or instances from the training set to make
decisions.

These lazy learners are characterized by their reliance on stored instances during
prediction rather than learning a model during the training phase

Classification Methods in data mining

In data mining, besides popular methods like decision trees and k-nearest neighbors, other
classification methods include support vector machines (SVM), naive Bayes, neural
networks, and ensemble methods like random forests or gradient boosting. Each method
has its strengths and weaknesses, and their suitability depends on the specific
characteristics of the dataset and the problem at hand.

Certainly! Here’s a brief overview of some classification methods in data mining:

Decision Trees:
Hierarchical tree-like structures for decision-making based on features.

Support Vector Machines (SVM):

Classifies data points by finding the hyperplane that maximizes the margin between
classes.

K-Nearest Neighbors (KNN):

Assigns a class based on the majority class of its k-nearest neighbors in the feature space.

Naive Bayes:

Utilizes Bayes’ theorem to calculate the probability of each class given the input features.

Neural Networks:

Mimics the structure and functioning of the human brain, composed of layers of
interconnected nodes.

Random Forests:

Ensemble method combining multiple decision trees to improve accuracy and reduce
overfitting.

Gradient Boosting:

Builds a series of weak learners sequentially, each correcting the errors of its predecessor.

These methods vary in complexity, interpretability, and performance depending on the


nature of the data and the problem being addressed.

Clustering techniques
Clustering techniques in data mining group similar data points together based on certain
criteria. Common methods include K-means, hierarchical clustering, and DBSCAN. K-
means partitions data into K clusters, hierarchical clustering forms a tree-like structure,
and DBSCAN identifies dense regions. Each technique has strengths and limitations
depending on the nature of the data and the desired outcome

Sure, here’s a brief overview of clustering techniques in data mining:

K-means Clustering:

Divides data into K clusters based on similarity.

Minimizes the sum of squared distances within each cluster.

Hierarchical Clustering:

Builds a tree-like structure of clusters.

Two main types: Agglomerative (bottom-up) and Divisive (top-down).

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Identifies clusters based on density.

Groups data points into dense regions separated by sparser areas.

Mean-Shift Clustering:

Shifts data points towards areas of higher density.

Converges to local modes, defining clusters.

Fuzzy Clustering (Fuzzy C-means):

Assigns each data point a degree of membership to multiple clusters.

Allows for soft boundaries between clusters.


These techniques aid in discovering patterns, similarities, and structures within datasets
for various applications.

Partitioning methods

In data mining, partitioning methods involve dividing a dataset into subsets for analysis.
Two common approaches are:

Training-Testing Split:

Purpose: Used in supervised learning to assess model performance.

Process: Dataset is divided into training and testing sets. The model is trained on the
former and tested on the latter.

Cross-Validation:

Purpose: Provides a more robust evaluation by repeatedly partitioning data into subsets.

Process: Divides data into k folds, where the model is trained on k-1 folds and validated on
the remaining one. This process is repeated k times, with each fold serving as the validation
set once.

These methods help evaluate models effectively and mitigate issues like overfitting by
assessing performance on independent subsets.

k-means Hierarchical Methods briefly in data mining

K-means and hierarchical methods are clustering techniques in data mining. K-means
partitions data into k clusters based on similarity, while hierarchical methods organize data
into a tree-like structure, representing nested clusters. K-means is efficient but sensitive to
initial centroids, while hierarchical methods offer a more detailed cluster hierarchy but can
be computationally intensive. Choose based on data characteristics and analysis goals

K-means is a partitioning clustering method in data mining that assigns data points to k
clusters based on similarity, aiming to minimize intra-cluster variance. Hierarchical
methods, on the other hand, organize data into a tree-like structure, creating a hierarchy of
clusters. K-means is efficient but sensitive to initial conditions, while hierarchical methods
offer a detailed cluster hierarchy at the cost of increased computational complexity. The
choice between them depends on the nature of the data and the desired level of cluster
granularity.

Difference between k-meansHierarchical Methods in data mining

The primary differences between K-means and hierarchical methods in data mining lie in
their approaches to clustering:

Clustering Mechanism:

K-means: Divides data into a predefined number (k) of non-overlapping clusters based on
minimizing intra-cluster variance.

Hierarchical Methods: Creates a tree-like structure of nested clusters, forming a hierarchy


based on the similarity of data points.

Cluster Formation:

K-means: Forms distinct, non-overlapping clusters; each data point belongs to exactly one
cluster.

Hierarchical Methods: Forms a nested structure where clusters can be subclusters of


larger clusters, allowing for overlapping relationships.

Flexibility:

K-means: Requires specifying the number of clusters (k) beforehand, which can impact
results.

Hierarchical Methods: Do not require specifying the number of clusters in advance,


providing more flexibility, but they may be computationally more expensive.

Initialization Sensitivity:
K-means: Sensitive to initial centroid placement, leading to potential variations in results
with different starting points.

Hierarchical Methods: Less sensitive to initialization, as they consider all data points in the
hierarchy.

Result Interpretation:

K-means: Outputs a flat clustering result, suitable for scenarios where distinct, non-
overlapping groups are desired.

Hierarchical Methods: Provide a hierarchical structure, offering insights into both finer and
coarser levels of clustering.

Computational Complexity:

K-means: Generally computationally efficient, but complexity increases with larger


datasets.

Hierarchical Methods: Can be computationally more intensive, especially for large


datasets, due to the hierarchical tree structure.

Choosing between K-means and hierarchical methods depends on the specific


characteristics of the data and the objectives of the analysis. K-means is often preferred for
simplicity and efficiency, while hierarchical methods offer a more nuanced view of data
relationships.

distance-based agglomerative and divisible clustering in data mining

Distance-based agglomerative clustering involves merging data points or clusters based on


their proximity or similarity, forming a hierarchical structure. Divisive clustering, on the
other hand, starts with a single cluster and recursively divides it into smaller clusters until a
desired granularity is reached.

In distance-based agglomerative clustering, the process begins with each data point as a
separate cluster and then merges the closest ones iteratively. This continues until all points
belong to a single cluster. The hierarchy is represented by a dendrogram.
Divisive clustering starts with all data points in one cluster and then partitions it into
smaller clusters. This process continues recursively until each data point is its own cluster.
The result is a tree structure, similar to the dendrogram in agglomerative clustering but
created in the opposite direction.

In summary, the key difference lies in the approach: agglomerative clustering starts with
individual points and merges them, while divisive clustering starts with all points in one
cluster and splits them.

Distance-based agglomerative clustering merges data points or clusters based on their


proximity, forming a hierarchy. It starts with each point as a separate cluster and iteratively
combines the closest ones.

Divisible clustering, on the other hand, begins with all data points in one cluster and
recursively divides it into smaller clusters until a desired granularity is achieved. This
results in a hierarchical tree structure, with clusters representing different levels of
granularity.

Types of distance-based agglomerative and divisible clustering in data mining

In distance-based agglomerative clustering, common types include:

Single Linkage: Merges clusters based on the proximity of their closest members, resulting
in elongated clusters.

Complete Linkage: Combines clusters based on the distance between their farthest
members, leading to more compact clusters.

Average Linkage: Merges clusters based on the average distance between all pairs of their
members, balancing sensitivity to outliers.
Ward’s Method: Minimizes the variance within clusters, often used for minimizing the
overall within-cluster variance.

As for divisible clustering, there isn’t a standardized set of types, but some methods
include:

Top-Down (Binary Splitting): Divides the dataset into two clusters in each step, recursively
splitting until desired granularity is reached.

K-Means-Based Divisive Clustering: Adapts the K-Means algorithm for divisive clustering,
iteratively splitting clusters.

CURE (Clustering Using Representatives): Divisive clustering method that uses a


hierarchical tree structure and representative points for clusters.

DIANA (Divisive Analysis): A divisive clustering method that starts with one cluster and
recursively splits it into smaller clusters.

These methods provide different approaches to hierarchical clustering based on the


distance between data points or clusters.

What is Density-based clustering?

Density-Based Clustering refers to one of the most popular unsupervised learning


methodologies used in model building and machine learning algorithms. The data points in
the region separated by two clusters of low point density are considered as noise. The
surroundings with a radius ε of a given object are known as the ε neighborhood of the
object. If the ε neighborhood of the object comprises at least a minimum number, MinPts
of objects, then it is called a core object.

Density-Based Clustering – Background

There are two different parameters to calculate the density-based clustering

EPS: It is considered as the maximum radius of the neighborhood.

MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that
point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}

Directly density reachable:

A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if

I belongs to NEps(k)

Core point condition:

NEps (k) >= MinPts

Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there
is a sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density
reachable from ii.

Density connected:

A point i refers to density connected to a point j with respect to Eps, MinPts if there is a
point o such that both i and j are considered as density reachable from o with respect to
Eps and MinPts.

Working of Density-Based Clustering


Suppose a set of objects is denoted by D’, we can say that an object I is directly density
reachable form the object j only if it is located within the ε neighborhood of j, and j is a core
object.

An object i is density reachable form the object j with respect to ε and MinPts in a given set
of objects, D’ only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such
that ii + 1 is directly density reachable from ii with respect to ε and MinPts.

An object i is density connected object j with respect to ε and MinPts in a given set of
objects, D’ only if there is an object o belongs to D such that both point i and j are density
reachable from o with respect to ε and MinPts.

Major Features of Density-Based Clustering

The primary features of Density-based clustering are given below.

It is a scan method.

It requires density parameters as a termination condition.

It is used to manage noise in data clusters.

Density-based clustering is used to identify clusters of arbitrary size.

DBSCAN

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It


depends on a density-based notion of cluster. It also identifies clusters of arbitrary size in
the spatial database with outliers.

OPTICS

OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the
cluster comprises information equivalent to the density-based clustering related to a long
range of parameter settings. OPTICS methods are beneficial for both automatic and
interactive cluster analysis, including determining an intrinsic clustering structure.
DENCLUE

Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical


description of arbitrarily shaped clusters in high dimension state of data, and it is good for
data sets with a huge amount of noise.

Maximization briefly in data mining

Expectation-Maximization (EM) is an iterative statistical technique often used in data


mining for clustering and density estimation. In the context of data mining, EM is employed
to find the parameters of a probabilistic model when there are latent (unobservable)
variables. It alternates between the E-step, where it estimates the expected values of the
latent variables given the observed data and current parameter estimates, and the M-step,
where it maximizes the likelihood function with respect to the model parameters. This
process iterates until convergence, refining the model and improving parameter estimates
over time. EM is particularly useful in situations with incomplete or missing data.

Working and. Types of expectation Maximization in data mining

Expectation-Maximization (EM) in data mining involves two main steps: the E-step
(Expectation step) and the M-step (Maximization step). Here’s a brief overview:

Expectation Step (E-step):

Objective: Estimate the values of latent (unobservable) variables given the observed data
and current parameter estimates.

Process: Calculate the expected values of the latent variables based on the current model
parameters. This step involves computing probabilities or membership weights for each
data point belonging to different clusters or categories.
Maximization Step (M-step):

Objective: Maximize the likelihood function with respect to the model parameters.

Process: Adjust the model parameters to maximize the likelihood of the observed data,
incorporating the information gained from the E-step. This step involves updating the
parameters of the model to better fit the observed data.

Types of Expectation-Maximization:

Hard EM:

Assumes that the latent variables are fixed or assigned to the most likely category after
each E-step. It results in a deterministic assignment of data points to clusters.

Soft EM:

Allows for probabilistic assignment of data points to clusters in the E-step. Instead of
assigning a data point to a single cluster, it assigns probabilities of belonging to each
cluster. This results in a more flexible and probabilistic clustering.

Gaussian Mixture Model (GMM):

A specific application of EM in clustering, where it is assumed that the data is generated


from a mixture of multiple Gaussian distributions. Each cluster is modeled by a Gaussian
distribution, and EM is used to estimate the parameters of these distributions.

Hidden Markov Model (HMM):

Applied in sequential data analysis, where the latent variables represent unobservable
states. EM is used to estimate the transition probabilities and emission probabilities of the
hidden states in the model.

Understanding and applying EM in these various forms enables data miners to handle
complex data structures and uncover underlying patterns in the data.
Grid Based Methods,working and types In data mining

Grid-based methods in data mining involve dividing the data space into a grid and then
analyzing the data within each grid cell. This approach is useful for tasks like clustering,
density estimation, and outlier detection. Types of grid-based methods include:

Grid Clustering: Divides the data into a grid and assigns each point to the grid cell it falls
into. Cells with more points may indicate clusters.

Density-Based Grids: Focus on regions with high data density, helping identify areas with
significant data concentrations.

Grid-based Outlier Detection: Identifies cells with lower data density, suggesting potential
outliers or anomalies.

Wavelet-based Methods: Combine grid structures with wavelet transforms to analyze data
at multiple resolutions, capturing both global and local patterns.

Grid-based methods provide a scalable and efficient way to process large datasets,
especially in spatial data analysis or applications where data distribution varies across
space.

Grid-based methods in data mining typically follow these steps:

Grid Creation: Divide the data space into a grid by creating a set of cells or buckets. This
grid structure can be one-dimensional, two-dimensional, or even higher-dimensional,
depending on the nature of the data.

Data Mapping: Assign each data point to the appropriate grid cell based on its attributes or
coordinates. This mapping helps organize and structure the data spatially within the grid.
Analysis within Cells: Perform data analysis within each grid cell independently. This can
include tasks such as counting points, calculating statistics, or applying specific
algorithms to understand the characteristics of the data within each cell.

Clustering or Density Estimation: Identify clusters or regions of high data density by


examining the distribution of points within the grid. Cells with a higher concentration of
data points may indicate clusters, while those with fewer points may suggest outliers.

Visualization: Visualize the results to interpret patterns or anomalies in the data. Heatmaps
or other graphical representations of the grid can help in understanding the distribution and
relationships within the dataset.

Adjustment and Refinement: Depending on the analysis results, the grid parameters or
algorithms may be adjusted to refine the process. This iterative refinement helps in
achieving meaningful insights from the data.

Grid-based methods are particularly useful in spatial data mining, where the relationships
between data points are influenced by their spatial proximity. They provide a structured
approach to handling large datasets, allowing for efficient processing and analysis.

Model-Based Clustering Methods In data mining

Model-based clustering methods in data mining involve creating statistical models to


identify natural groupings or clusters within a dataset. These methods assume that the
data can be generated from a mixture of probability distributions, with each distribution
representing a different cluster.

Common techniques include Gaussian Mixture Models (GMM), which assume that the
data within each cluster follows a Gaussian distribution, and Finite Mixture Models (FMM),
where a finite number of components are used to model the underlying distributions.
The Expectation-Maximization (EM) algorithm is often employed to estimate the
parameters of these models iteratively. It involves an “expectation” step, where cluster
assignments are updated based on current parameter estimates, and a “maximization”
step, where parameters are recalculated based on the updated cluster assignments.

These methods can be powerful for discovering hidden patterns and structures in complex
datasets, but they do require assumptions about the underlying distribution of the data.
Careful consideration of model selection and validation is crucial for the effectiveness of
model-based clustering in data mining

Model-based clustering methods in data mining encompass various techniques, each with
its own characteristics and working principles. Here are some types and a brief overview of
how they work:

Gaussian Mixture Models (GMM):

Working: Assumes that the data is generated from a mixture of Gaussian distributions. The
algorithm iteratively estimates the parameters (means, covariances, and weights) of these
distributions using the Expectation-Maximization (EM) algorithm.

Finite Mixture Models (FMM):

Working: Similar to GMM but more general, allowing the use of different distribution types
for each cluster. The EM algorithm is also commonly used to estimate parameters
iteratively.

Dirichlet Process Mixture Models (DPMM):

Working: Extends mixture models by employing a Dirichlet Process to model an infinite


number of potential clusters. This allows for flexibility in the number of clusters without the
need to pre-specify it.

Hierarchical Mixture Models:


Working: Hierarchical approaches build a tree-like structure of clusters, capturing
relationships at different scales. Agglomerative or divisive algorithms are often used to
construct this hierarchy.

Categorical Data Clustering:

Working: Tailored for categorical data, these methods model the probability distribution of
categorical variables within each cluster. Examples include Latent Class Analysis (LCA)
and Categorical Latent Variable Models.

Bayesian Model-Based Clustering:

Working: Incorporates Bayesian principles into the clustering process, allowing for the
incorporation of prior knowledge and uncertainty. Markov Chain Monte Carlo (MCMC)
methods are often employed for inference.

Density-Based Clustering:

Working: Identifies clusters based on regions of higher data density. DBSCAN (Density-
Based Spatial Clustering of Applications with Noise) is a popular algorithm in this category.

In general, model-based clustering involves the following steps:

Initialization: Start with initial guesses for model parameters.

E-step (Expectation): Assign data points to clusters probabilistically based on the current
model.

M-step (Maximization): Update the model parameters based on the assigned clusters.

Iterate: Repeat the E-step and M-step until convergence or a stopping criterion is met.

Choosing the appropriate model depends on the characteristics of the data and the
assumptions that can be reasonably made. Evaluation metrics and validation techniques
are crucial for assessing the quality of the clustering results.

CONSTRAINT BASED CLUSTERING METHOD


Constraint-based clustering finds clusters that satisfy user-stated preferences or
constraints. It is based on the nature of the constraints, constraint-based clustering can
adopt instead of different approaches. There are several categories of constraints which
are as follows –

Constraints on individual objects – It can define constraints on the objects to be clustered.


In a real estate application, for instance, one can like to spatially cluster only those luxury
mansions worth over a million dollars. This constraint confines the collection of objects to
be clustered. It can simply be managed by preprocessing (e.g., implementing selection
using an SQL query), after which the problem decreases to an example of unconstrained
clustering.

Constraints on the selection of clustering parameters – A user can like to set a desired area
for each clustering parameter. Clustering parameters are generally quite specific to the
given clustering algorithm. Examples of parameters contain k, the desired number of
clusters in a k-means algorithm; or ε (the radius) and MinPts (the minimum number of
points) in the DBSCAN algorithm.

Although such user-stated parameters can strongly hold the clustering results, they are
generally confined to the algorithm itself. Therefore, their fine-tuning and processing are
generally not treated as a form of constraint-based clustering.

Constraints on distance or similarity functions – It can define several distances or similarity


functions for definite attributes of the objects to be clustered, or different distance
measures for limited pairs of objects. When clustering sportsmen, for instance, it can use
several weighting schemes for height, body weight, age, and skill level.

User-specified constraints on the properties of individual clusters – A user can like to


specify desired features of the resulting clusters, which can strongly hold the clustering
process.
Consider a package delivery company that would like to decide the locations for k service
stations in a city. The company has a database of users that registers the user’s names,
locations, length of time because the customers start using the company’s services, and
average monthly price. It can formulate this location selection problem as an instance of
unconstrained clustering using a distance function computed based on customer location.

A smarter method is to partition the customers into two classes – high-value customers
(who required frequent, regular service) and ordinary customers (who require occasional
service). It can save costs and support good service, the manager adds the following
constraints –

Each station must serve a minimum of 100 high-value customers.

Each station must serve a minimum of 5,000 ordinary customers. Constraint-based


clustering will consider such constraints during the clustering procedure.

Semi-supervised clustering based on “partial” supervision – The quality of unsupervised


clustering can be essentially improved using some weak form of supervision. This can be in
the form of pairwise constraints (i.e., pairs of objects labeled as owned by the same or
different cluster). Such a constrained clustering process is known as semi-supervised
clustering.

What is Outlier Analysis in Data Mining

Outlier analysis in data mining is the process of identifying and examining data points that
significantly differ from the rest of the dataset. An outlier can be defined as a data point
that deviates significantly from the normal pattern or behavior of the data. Various factors,
such as measurement errors, unexpected events, data processing errors, etc., can cause
these outliers. For example, outliers are represented as red dots in the figure below, and
you can see that they deviate significantly from the rest of the data points. Outliers are also
often referred to as anomalies, aberrations, or irregularities.

Benefits of Outlier Analysis in Data Mining

Outlier analysis in data mining can provide several benefits, as mentioned below –
Improved accuracy of data analysis – Outliers can skew the results of statistical analyses
or predictive models, leading to inaccurate or misleading conclusions. Detecting and
removing outliers can improve the accuracy and reliability of data analysis.

Identification of data quality issues – Outliers can be caused by data collection,


processing, or measurement errors, which can indicate data quality issues. Outlier
analysis in data mining can help identify and correct these issues to improve data quality.

Detection of unusual events or patterns – Outliers can represent unusual events or


patterns in the data that may be of interest to the businesses. Studying these outliers can
provide valuable insights and lead to discoveries.

Better decision-making – Outlier analysis in data mining can help decision-makers identify
and understand the factors affecting their data, leading to better-informed decisions.

Improved model performance – Outliers can negatively affect the performance of


predictive models. Removing outliers or developing models that can handle them
appropriately can improve model performance.

Types of Outliers in Data Mining

Let’s understand various types of outliers in the data mining process –

Global (Point) Outliers

These are data points that are significantly different from the rest of the dataset in a global
sense. Global outliers are typically detected using statistical methods focusing on the
entire dataset’s extreme values. For example, if we have a dataset of heights for a group of
people, and one person is 7 feet tall while the rest of the heights range between 5 and 6
feet, the height of 7 feet would be a global outlier. An example of a global outlier is also
shown below –
Collective Outliers

These are groups of data points that are significantly different from the rest of the dataset
when considered together. Collective outliers are typically detected using clustering
algorithms or other methods that group similar data points. For example, suppose we have
a dataset of customer transactions, and a group of customers consistently makes
purchases that are significantly larger than the rest of the customers. In that case, this
group of customers could be considered a collective outlier. Similarly, in an intrusion
detection system, the transmission of a DOS packet from one PC to another PC can be
considered normal behavior, but if DOS packets are transmitted to many PCs at the same
time, it would be considered as collective outliers.
Contextual (Conditional) Outliers

These data points significantly differ from the rest of the dataset in a specific context.
Contextual outliers are typically detected using domain knowledge or contextual
information relevant to the dataset. For example, if a city is recording 40-degree Celsius
temperature, it may be considered normal in the summer and a contextual outlier in the
winter. An example of a contextual outlier is shown below –

You might also like