UNIT-IV
UNIT-IV
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision
tree creates classification or regression models as a tree structure. It separates a data set
into smaller subsets, and at the same time, the decision tree is steadily developed. The
final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two
branches. The leaf nodes show a classification or decision. We can’t accomplish more split
on leaf nodes-The uppermost decision node in a tree that relates to the best predictor
called the root node. Decision trees can deal with both categorical and numerical data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Information Gain
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return
the highest data gain.
In short, a decision tree is just like a flow chart diagram with the terminal nodes showing
decisions. Starting with the dataset, we can measure the entropy to find a way to segment
the set until the data belongs to the same class
The decision tree algorithm may appear long, but it is quite simply the basis algorithm
techniques is as follows:
BAYESIAN CLASSIFICATION
In numerous applications, the connection between the attribute set and the class variable
is non- deterministic. In other words, we can say the class label of a test record cant be
assumed with certainty even though its attribute set is the same as some of the training
examples. These circumstances may emerge due to the noisy data or the presence of
certain confusing factors that influence classification, but it is not included in the analysis.
For example, consider the task of predicting the occurrence of whether an individual is at
risk for liver illness based on individuals eating habits and working efficiency. Although
most people who eat healthly and exercise consistently having less probability of
occurrence of liver disease, they may still do so due to other factors. For example, due to
consumption of the high-calorie street foods and alcohol abuse. Determining whether an
individual’s eating routine is healthy or the workout efficiency is sufficient is also subject to
analysis, which in turn may introduce vulnerabilities into the leaning issue.
Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings. The theory expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
Bayes’s theorem is expressed mathematically by the following equation that is given below.
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is
true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.
Bayesian interpretation:
The quotient
Bayesian network:
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.
Data Mining Bayesian Classifiers
The nodes here represent random variables, and the edges define the relationship between
these variables.
A DAG models the uncertainty of an event taking place based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability Table
(CPT) is used to represent the CPD of each variable in a network.
IF-THEN Rule
Rule Consequent: This is present in the rule’s RHS(Right Hand Side). The rule consequent
consists of the class prediction.
Example:
Assessment of Rule
In rule-based classification in data mining, there are two factors based on which we can
access the rules. These are:
Coverage of Rule: The fraction of the records which satisfy the antecedent conditions of a
particular rule is called the coverage of that rule.
We can calculate this by dividing the number of records satisfying the rule(n1) by the total
number of records(n).
Coverage(R) = n1/n
Accuracy of a rule: The fraction of the records that satisfy the antecedent conditions and
meet the consequent values of a rule is called the accuracy of that rule.
We can calculate this by dividing the number of records satisfying the consequent
values(n2) by the number of records satisfying the rule(n1).
Accuracy(R) = n2/n1
Generally, we convert them into percentages by multiplying them by 100. We do so to make
it easy for the layman to understand these terms and their values.
There are two significant properties of rule-based classification in data mining. They are:
Many different rules are generated for the dataset, so it is possible and likely that many of
them satisfy the same data record. This condition makes the rules not mutually exclusive.
Since the rules are not mutually exclusive, we cannot decide on classes that cover different
parts of data on different rules. But this was our main objective. So, to solve this problem,
we have two ways:
The first way is using an ordered set of rules. By ordering the rules, we set priority orders.
Thus, this ordered rule set is called a decision list. So the class with the highest priority rule
is taken as the final class.
The second solution can be assigning votes for each class depending on their weights. So,
in this, the set of rules remains unordered.
It is not a guarantee that the rule will cover all the data entries. Any of the rules may leave
some data entries. This case, on its occurrence, will make the rules non-exhaustive. So, we
have to solve this problem too. So, to solve this problem, we can make use of a default
class. Using a default class, we can assign all the data entries not covered by any rules to
the default class. Thus using the default class will solve the problem of non-exhaustivity.
CLASSIFICATION BY BACK PROPAGATION
Input Layer:
Hidden Layers:
Output Layer:
Forward Pass:
Input data is fed forward through the network to produce an initial output.
Calculate Error:
Errors are propagated backward, updating weights and biases using gradient descent.
Iterative Process:
Steps 4-6 are repeated for multiple epochs to refine the model.
Activation Function:
Loss Function:
Measures the difference between predicted and actual outputs, guiding the weight
adjustments.
Learning Rate:
Controls the size of weight adjustments during each iteration, preventing overshooting or
slow convergence.
Backpropagation is an effective method for training neural networks in classification tasks,
adapting the model to the underlying patterns in the data
Support Vector Machines (SVMs) are powerful tools in data mining for classification and
regression tasks. They work by finding the hyperplane that best separates different classes
in the feature space. SVMs aim to maximize the margin between classes, which is the
distance between the hyperplane and the nearest data points from each class.
Linear Separation: SVMs are effective when classes can be separated by a linear boundary.
However, they can be extended to handle non-linear boundaries through techniques like
the kernel trick.
Kernel Trick: SVMs can use a kernel function to implicitly map the input data into a higher-
dimensional space, allowing for the handling of non-linear relationships between features.
Margin Maximization: SVMs focus on maximizing the margin between classes, which
enhances generalization and helps avoid overfitting.
Support Vectors: The data points that lie closest to the hyperplane are known as support
vectors. They play a crucial role in defining the decision boundary.
C Parameter: SVMs have a regularization parameter (C) that influences the trade-off
between achieving a smooth decision boundary and classifying training points correctly.
Sensitivity to Outliers: SVMs can be sensitive to outliers since they heavily depend on
support vectors, which are the closest points to the decision boundary.
Applications: SVMs find applications in various fields, including image classification, text
categorization, bioinformatics, and more.
When using SVMs in data mining, it’s essential to consider the nature of the data, choose
appropriate kernel functions, and fine-tune parameters for optimal performance.
ASSOCIATIVE CLASSIFICATION
Data mining is an effective process that includes drawing insightful conclusions and
patterns from vast amounts of data. Its importance rests in the capacity to unearth buried
information, spot trends, and make wise judgments based on the information recovered.
This method offers a thorough framework to identify intricate linkages in data, resulting in
insightful information and prospective advancements in a range of industries, including
marketing, finance, healthcare, and more. We’ll be talking about associative classification
in data mining in this post. Let’s begin.
Understanding associative classification is essential for realizing its full potential in data
mining. Making prediction or classification jobs easier, entails identifying correlations and
links between attributes in a collection. The fundamental goal of associative classification
is to identify patterns connecting different variables by using association rule mining
techniques.
Rule creation, rule assessment, and rule selection are generally the three main steps in the
process. When rules are developed, they are based on the dataset, however, when rules
are evaluated, they are evaluated for quality and importance. In order to improve the
accuracy and relevance of the classification process, rule selection seeks to weed out
unimportant or inapplicable rules. A few benefits of associative categorization are its
capacity to manage complicated data linkages, manage high−dimensional datasets, and
give comprehensible rules.
The computational complexity of big datasets, sensitivity to noise and irrelevant features,
and a possible trade−off between accuracy and interpretability are some of its drawbacks.
Nevertheless, being aware of these factors enables data analysts to employ associative
categorization efficiently and base choices on the discovered patterns.
In associative classification, the Apriori algorithm is a key method that is essential for
identifying popular item sets. The method finds itemsets that meet a minimal support
criterion via an iterative technique, creating strong correlations between qualities. Its main
function in associative categorization is to produce a set of frequent item sets from which
association rules may be derived.
Utilizing the “apriori property,” which stipulates that any non−frequent itemset must have
non−frequent subsets, the method effectively prunes the search space.
In fields like medical diagnosis or consumer behavior research, where ambiguity and
vagueness are common, fuzzy association rule mining is very helpful. This method uses
fuzzy logic to generate rules and identify correlations, allowing for more informed
decision−making and the identification of patterns in large datasets.
Evaluation and Validation
Cross−validation and holdout approaches are frequently used to confirm the efficacy of
the associative classification model. By splitting the dataset into several subsets,
cross−validation enables repeated training and testing on various partitions.
Holdout approaches, in contrast, divide the data into training and testing sets, utilizing the
first to build the model and the latter to assess how well it performs.
Lazy learners in data mining, such as k-nearest neighbors (KNN), make predictions based
on existing data rather than constructing an explicit model. They are computationally
efficient during training but might require more time for predictions. KNN, for instance,
relies on similarity measures to classify or predict based on the nearest neighbors in the
training set.
Locally Weighted Learning (LWL): Assigns weights to training instances based on their
proximity to the query instance for better prediction.
Case-Based Reasoning (CBR): Makes predictions by finding and adapting solutions from
similar past cases.
Instance-Based Learning (IBL): Similar to KNN, it relies on instances in the training set to
make predictions without explicitly building a model.
Memory-Based Reasoning (MBR): Uses memory or instances from the training set to make
decisions.
These lazy learners are characterized by their reliance on stored instances during
prediction rather than learning a model during the training phase
In data mining, besides popular methods like decision trees and k-nearest neighbors, other
classification methods include support vector machines (SVM), naive Bayes, neural
networks, and ensemble methods like random forests or gradient boosting. Each method
has its strengths and weaknesses, and their suitability depends on the specific
characteristics of the dataset and the problem at hand.
Decision Trees:
Hierarchical tree-like structures for decision-making based on features.
Classifies data points by finding the hyperplane that maximizes the margin between
classes.
Assigns a class based on the majority class of its k-nearest neighbors in the feature space.
Naive Bayes:
Utilizes Bayes’ theorem to calculate the probability of each class given the input features.
Neural Networks:
Mimics the structure and functioning of the human brain, composed of layers of
interconnected nodes.
Random Forests:
Ensemble method combining multiple decision trees to improve accuracy and reduce
overfitting.
Gradient Boosting:
Builds a series of weak learners sequentially, each correcting the errors of its predecessor.
Clustering techniques
Clustering techniques in data mining group similar data points together based on certain
criteria. Common methods include K-means, hierarchical clustering, and DBSCAN. K-
means partitions data into K clusters, hierarchical clustering forms a tree-like structure,
and DBSCAN identifies dense regions. Each technique has strengths and limitations
depending on the nature of the data and the desired outcome
K-means Clustering:
Hierarchical Clustering:
Mean-Shift Clustering:
Partitioning methods
In data mining, partitioning methods involve dividing a dataset into subsets for analysis.
Two common approaches are:
Training-Testing Split:
Process: Dataset is divided into training and testing sets. The model is trained on the
former and tested on the latter.
Cross-Validation:
Purpose: Provides a more robust evaluation by repeatedly partitioning data into subsets.
Process: Divides data into k folds, where the model is trained on k-1 folds and validated on
the remaining one. This process is repeated k times, with each fold serving as the validation
set once.
These methods help evaluate models effectively and mitigate issues like overfitting by
assessing performance on independent subsets.
K-means and hierarchical methods are clustering techniques in data mining. K-means
partitions data into k clusters based on similarity, while hierarchical methods organize data
into a tree-like structure, representing nested clusters. K-means is efficient but sensitive to
initial centroids, while hierarchical methods offer a more detailed cluster hierarchy but can
be computationally intensive. Choose based on data characteristics and analysis goals
K-means is a partitioning clustering method in data mining that assigns data points to k
clusters based on similarity, aiming to minimize intra-cluster variance. Hierarchical
methods, on the other hand, organize data into a tree-like structure, creating a hierarchy of
clusters. K-means is efficient but sensitive to initial conditions, while hierarchical methods
offer a detailed cluster hierarchy at the cost of increased computational complexity. The
choice between them depends on the nature of the data and the desired level of cluster
granularity.
The primary differences between K-means and hierarchical methods in data mining lie in
their approaches to clustering:
Clustering Mechanism:
K-means: Divides data into a predefined number (k) of non-overlapping clusters based on
minimizing intra-cluster variance.
Cluster Formation:
K-means: Forms distinct, non-overlapping clusters; each data point belongs to exactly one
cluster.
Flexibility:
K-means: Requires specifying the number of clusters (k) beforehand, which can impact
results.
Initialization Sensitivity:
K-means: Sensitive to initial centroid placement, leading to potential variations in results
with different starting points.
Hierarchical Methods: Less sensitive to initialization, as they consider all data points in the
hierarchy.
Result Interpretation:
K-means: Outputs a flat clustering result, suitable for scenarios where distinct, non-
overlapping groups are desired.
Hierarchical Methods: Provide a hierarchical structure, offering insights into both finer and
coarser levels of clustering.
Computational Complexity:
In distance-based agglomerative clustering, the process begins with each data point as a
separate cluster and then merges the closest ones iteratively. This continues until all points
belong to a single cluster. The hierarchy is represented by a dendrogram.
Divisive clustering starts with all data points in one cluster and then partitions it into
smaller clusters. This process continues recursively until each data point is its own cluster.
The result is a tree structure, similar to the dendrogram in agglomerative clustering but
created in the opposite direction.
In summary, the key difference lies in the approach: agglomerative clustering starts with
individual points and merges them, while divisive clustering starts with all points in one
cluster and splits them.
Divisible clustering, on the other hand, begins with all data points in one cluster and
recursively divides it into smaller clusters until a desired granularity is achieved. This
results in a hierarchical tree structure, with clusters representing different levels of
granularity.
Single Linkage: Merges clusters based on the proximity of their closest members, resulting
in elongated clusters.
Complete Linkage: Combines clusters based on the distance between their farthest
members, leading to more compact clusters.
Average Linkage: Merges clusters based on the average distance between all pairs of their
members, balancing sensitivity to outliers.
Ward’s Method: Minimizes the variance within clusters, often used for minimizing the
overall within-cluster variance.
As for divisible clustering, there isn’t a standardized set of types, but some methods
include:
Top-Down (Binary Splitting): Divides the dataset into two clusters in each step, recursively
splitting until desired granularity is reached.
K-Means-Based Divisive Clustering: Adapts the K-Means algorithm for divisive clustering,
iteratively splitting clusters.
DIANA (Divisive Analysis): A divisive clustering method that starts with one cluster and
recursively splits it into smaller clusters.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that
point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if
I belongs to NEps(k)
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there
is a sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density
reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a
point o such that both i and j are considered as density reachable from o with respect to
Eps and MinPts.
An object i is density reachable form the object j with respect to ε and MinPts in a given set
of objects, D’ only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such
that ii + 1 is directly density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of
objects, D’ only if there is an object o belongs to D such that both point i and j are density
reachable from o with respect to ε and MinPts.
It is a scan method.
DBSCAN
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the
cluster comprises information equivalent to the density-based clustering related to a long
range of parameter settings. OPTICS methods are beneficial for both automatic and
interactive cluster analysis, including determining an intrinsic clustering structure.
DENCLUE
Expectation-Maximization (EM) in data mining involves two main steps: the E-step
(Expectation step) and the M-step (Maximization step). Here’s a brief overview:
Objective: Estimate the values of latent (unobservable) variables given the observed data
and current parameter estimates.
Process: Calculate the expected values of the latent variables based on the current model
parameters. This step involves computing probabilities or membership weights for each
data point belonging to different clusters or categories.
Maximization Step (M-step):
Objective: Maximize the likelihood function with respect to the model parameters.
Process: Adjust the model parameters to maximize the likelihood of the observed data,
incorporating the information gained from the E-step. This step involves updating the
parameters of the model to better fit the observed data.
Types of Expectation-Maximization:
Hard EM:
Assumes that the latent variables are fixed or assigned to the most likely category after
each E-step. It results in a deterministic assignment of data points to clusters.
Soft EM:
Allows for probabilistic assignment of data points to clusters in the E-step. Instead of
assigning a data point to a single cluster, it assigns probabilities of belonging to each
cluster. This results in a more flexible and probabilistic clustering.
Applied in sequential data analysis, where the latent variables represent unobservable
states. EM is used to estimate the transition probabilities and emission probabilities of the
hidden states in the model.
Understanding and applying EM in these various forms enables data miners to handle
complex data structures and uncover underlying patterns in the data.
Grid Based Methods,working and types In data mining
Grid-based methods in data mining involve dividing the data space into a grid and then
analyzing the data within each grid cell. This approach is useful for tasks like clustering,
density estimation, and outlier detection. Types of grid-based methods include:
Grid Clustering: Divides the data into a grid and assigns each point to the grid cell it falls
into. Cells with more points may indicate clusters.
Density-Based Grids: Focus on regions with high data density, helping identify areas with
significant data concentrations.
Grid-based Outlier Detection: Identifies cells with lower data density, suggesting potential
outliers or anomalies.
Wavelet-based Methods: Combine grid structures with wavelet transforms to analyze data
at multiple resolutions, capturing both global and local patterns.
Grid-based methods provide a scalable and efficient way to process large datasets,
especially in spatial data analysis or applications where data distribution varies across
space.
Grid Creation: Divide the data space into a grid by creating a set of cells or buckets. This
grid structure can be one-dimensional, two-dimensional, or even higher-dimensional,
depending on the nature of the data.
Data Mapping: Assign each data point to the appropriate grid cell based on its attributes or
coordinates. This mapping helps organize and structure the data spatially within the grid.
Analysis within Cells: Perform data analysis within each grid cell independently. This can
include tasks such as counting points, calculating statistics, or applying specific
algorithms to understand the characteristics of the data within each cell.
Visualization: Visualize the results to interpret patterns or anomalies in the data. Heatmaps
or other graphical representations of the grid can help in understanding the distribution and
relationships within the dataset.
Adjustment and Refinement: Depending on the analysis results, the grid parameters or
algorithms may be adjusted to refine the process. This iterative refinement helps in
achieving meaningful insights from the data.
Grid-based methods are particularly useful in spatial data mining, where the relationships
between data points are influenced by their spatial proximity. They provide a structured
approach to handling large datasets, allowing for efficient processing and analysis.
Common techniques include Gaussian Mixture Models (GMM), which assume that the
data within each cluster follows a Gaussian distribution, and Finite Mixture Models (FMM),
where a finite number of components are used to model the underlying distributions.
The Expectation-Maximization (EM) algorithm is often employed to estimate the
parameters of these models iteratively. It involves an “expectation” step, where cluster
assignments are updated based on current parameter estimates, and a “maximization”
step, where parameters are recalculated based on the updated cluster assignments.
These methods can be powerful for discovering hidden patterns and structures in complex
datasets, but they do require assumptions about the underlying distribution of the data.
Careful consideration of model selection and validation is crucial for the effectiveness of
model-based clustering in data mining
Model-based clustering methods in data mining encompass various techniques, each with
its own characteristics and working principles. Here are some types and a brief overview of
how they work:
Working: Assumes that the data is generated from a mixture of Gaussian distributions. The
algorithm iteratively estimates the parameters (means, covariances, and weights) of these
distributions using the Expectation-Maximization (EM) algorithm.
Working: Similar to GMM but more general, allowing the use of different distribution types
for each cluster. The EM algorithm is also commonly used to estimate parameters
iteratively.
Working: Tailored for categorical data, these methods model the probability distribution of
categorical variables within each cluster. Examples include Latent Class Analysis (LCA)
and Categorical Latent Variable Models.
Working: Incorporates Bayesian principles into the clustering process, allowing for the
incorporation of prior knowledge and uncertainty. Markov Chain Monte Carlo (MCMC)
methods are often employed for inference.
Density-Based Clustering:
Working: Identifies clusters based on regions of higher data density. DBSCAN (Density-
Based Spatial Clustering of Applications with Noise) is a popular algorithm in this category.
E-step (Expectation): Assign data points to clusters probabilistically based on the current
model.
M-step (Maximization): Update the model parameters based on the assigned clusters.
Iterate: Repeat the E-step and M-step until convergence or a stopping criterion is met.
Choosing the appropriate model depends on the characteristics of the data and the
assumptions that can be reasonably made. Evaluation metrics and validation techniques
are crucial for assessing the quality of the clustering results.
Constraints on the selection of clustering parameters – A user can like to set a desired area
for each clustering parameter. Clustering parameters are generally quite specific to the
given clustering algorithm. Examples of parameters contain k, the desired number of
clusters in a k-means algorithm; or ε (the radius) and MinPts (the minimum number of
points) in the DBSCAN algorithm.
Although such user-stated parameters can strongly hold the clustering results, they are
generally confined to the algorithm itself. Therefore, their fine-tuning and processing are
generally not treated as a form of constraint-based clustering.
A smarter method is to partition the customers into two classes – high-value customers
(who required frequent, regular service) and ordinary customers (who require occasional
service). It can save costs and support good service, the manager adds the following
constraints –
Outlier analysis in data mining is the process of identifying and examining data points that
significantly differ from the rest of the dataset. An outlier can be defined as a data point
that deviates significantly from the normal pattern or behavior of the data. Various factors,
such as measurement errors, unexpected events, data processing errors, etc., can cause
these outliers. For example, outliers are represented as red dots in the figure below, and
you can see that they deviate significantly from the rest of the data points. Outliers are also
often referred to as anomalies, aberrations, or irregularities.
Outlier analysis in data mining can provide several benefits, as mentioned below –
Improved accuracy of data analysis – Outliers can skew the results of statistical analyses
or predictive models, leading to inaccurate or misleading conclusions. Detecting and
removing outliers can improve the accuracy and reliability of data analysis.
Better decision-making – Outlier analysis in data mining can help decision-makers identify
and understand the factors affecting their data, leading to better-informed decisions.
These are data points that are significantly different from the rest of the dataset in a global
sense. Global outliers are typically detected using statistical methods focusing on the
entire dataset’s extreme values. For example, if we have a dataset of heights for a group of
people, and one person is 7 feet tall while the rest of the heights range between 5 and 6
feet, the height of 7 feet would be a global outlier. An example of a global outlier is also
shown below –
Collective Outliers
These are groups of data points that are significantly different from the rest of the dataset
when considered together. Collective outliers are typically detected using clustering
algorithms or other methods that group similar data points. For example, suppose we have
a dataset of customer transactions, and a group of customers consistently makes
purchases that are significantly larger than the rest of the customers. In that case, this
group of customers could be considered a collective outlier. Similarly, in an intrusion
detection system, the transmission of a DOS packet from one PC to another PC can be
considered normal behavior, but if DOS packets are transmitted to many PCs at the same
time, it would be considered as collective outliers.
Contextual (Conditional) Outliers
These data points significantly differ from the rest of the dataset in a specific context.
Contextual outliers are typically detected using domain knowledge or contextual
information relevant to the dataset. For example, if a city is recording 40-degree Celsius
temperature, it may be considered normal in the summer and a contextual outlier in the
winter. An example of a contextual outlier is shown below –