UNIT 3
UNIT 3
In data mining, identifying patterns within large datasets is crucial for extracting meaningful insights.
These patterns can be categorized into two primary types:
1. Descriptive Patterns:
Frequent Patterns: These are items or events that occur together frequently within a dataset. For
example, in market basket analysis, discovering that customers often purchase bread and butter
together.
Sequential Patterns: These involve identifying regular sequences of events or items. For instance,
understanding that customers who buy a smartphone often purchase a phone case shortly after.
Clustering: This technique groups similar data points based on specific characteristics, aiding in
understanding the inherent structure of the data.
2. Predictive Patterns:
Classification: This involves assigning data points to predefined categories based on learned patterns
from historical data. For example, categorizing emails as 'spam' or 'not spam' based on their content.
Regression: This technique predicts a continuous value based on input variables. For instance,
forecasting sales figures based on advertising spend and market conditions.
The process of pattern discovery in data mining typically involves several steps:
By systematically following these steps, organizations can uncover valuable patterns that inform
strategic decisions and drive innovation.
Pattern Evaluation Methods
In data mining, the process of rating the usefulness and importance of patterns found is known as
pattern evaluation. It is essential for drawing insightful conclusions from enormous volumes of data. An
essential step in this process is pattern evaluation, which involves systematically evaluating the
identified patterns to ascertain their utility, importance, and quality.It acts as a filter to distinguish useful
patterns from noise or unimportant connections, and it is a crucial phase in the data mining workflow.
Association rules: Association rule mining is an unsupervised learning technique used to discover
interesting relationships or associations among variables in large datasets. It is widely used in various
fields such as market basket analysis, web usage mining, and continuous production. Example: "If a
customer buys a laptop, there is a 70% chance they will buy a mouse."
Sequential Patterns: These involve identifying regular sequences of events or items. For instance,
understanding that customers who buy a smartphone often purchase a phone case shortly after.
Support−Confidence Framework
Support measures how frequently a rule is true by describing the frequency or recurrence of an item set
in a dataset. It is determined by dividing the total number of transactions by the proportion of
transactions that contain the itemset. The conditional likelihood of the subsequent item given the
antecedent item is represented by confidence. It is calculated as the proportion of transactions with
both an antecedent and a consequent to transactions with only the antecedent.
Additional assessment metrics that are used to rate the strength and interest of association rules
include lift and conviction metrics. Lift quantifies how dependent the antecedent and consequent
elements are in a rule. It is calculated as the difference between the observed and predicted levels of
support for the rule under independence. When the lift value exceeds 1, there is a positive correlation
between the components; when it is below 1, there is a negative correlation or independence.
Contrarily, conviction gives an indication of the strength of connection in terms of how likely it is that
the subsequent item will emerge without the antecedent. It is calculated as the reciprocal of the
complement of confidence to the complement of the consequent's support. Strong links between the
items are implied by conviction values larger than 1, whilst weaker relationships are suggested by
conviction values closer to 1.
Evaluation of sequential patterns entails determining the importance and applicability of patterns found
in sequential data. The Sequential Pattern Growth algorithm is one often employed technique for
assessing sequential patterns.
It finds sequential patterns by gradually expanding them from shorter to longer sequences, making sure
that each extension is still common in the dataset. This technique allows analysts to quickly find and
assess sequential patterns of various durations and complexity.
Episode Evaluation
Another assessment technique utilized in the study of sequential patterns is episode evaluation. The
term "episode" refers to a group of related events that take place in a predetermined time frame or
sequence. In medical research, for instance, episodes could stand in for groups of symptoms that
frequently coexist in a given condition.
Measurement of the importance and recurrence of certain event combinations is the main goal of
episode assessment. By examining episodes, analysts can obtain insight into the patterns of how events
occur together and can find significant temporal or associational correlations in the sequential data.
Pattern Mining
Pattern mining is a data mining technique focused on discovering patterns or regularities in large
datasets. These patterns can reveal useful insights and relationships within the data, which are helpful
for decision-making and predictive analysis.
Example: Identifying frequent itemsets in market basket analysis (e.g., customers buying bread and milk together).
Example: Finding patterns in customer transactions over time (e.g., buying a phone, then a phone cover).
Example: "If a customer buys a laptop, there is a 70% chance they will buy a mouse."
Subgraph Mining:
Concept hierarchies organize data into multiple levels of abstraction, facilitating analysis at different
granularities. For example:
Geographical Hierarchy:
Level 1: Country
Level 2: State/Province
Level 3: City
Level 1: Electronics
Level 2: Computers
Level 3: Laptops
By analyzing data across these levels, organizations can uncover patterns that may not be evident when
considering a single level of abstraction.
Top-Down Approach:
Bottom-Up Approach:
Determining minimum support levels for different hierarchy levels can be complex.
Uniform support thresholds may not be suitable across all levels.
Dynamic adjustment of support thresholds is often necessary to capture meaningful patterns at
each level.
Striking the right balance between detailed, specific patterns and broader, generalized patterns
is crucial.
Overly specific patterns may lack general applicability, while overly general patterns may miss
important nuances.
Applications of Multilevel Pattern Mining:
Fraud Detection:
Identifying fraudulent behaviors that manifest differently across various levels of transaction
data.
For example, detecting anomalies in transaction amounts at both the account level and the
regional level.
By employing multilevel pattern mining, organizations can gain a more comprehensive understanding of
their data, leading to more informed decision-making and strategic planning.
Definition: These rules identify relationships among items across different dimensions. For example,
analyzing sales data might reveal that "customers aged 30-40 (age dimension) who live in urban areas
(location dimension) tend to purchase electronic gadgets (product dimension)."
Types:
Techniques: Algorithms like Apriori can be extended to handle multiple dimensions by treating each
dimension as a separate attribute.
Definition: Focuses on finding sequences of events or items that occur in a specific order across multiple
dimensions.
Applications: Useful in analyzing customer behaviors over time, considering factors like time of
purchase, location, and product categories.
Techniques: Algorithms such as PrefixSpan can be adapted to incorporate multiple dimensions by
considering each dimension's sequential impact.
Data Sparsity: As dimensions increase, the data can become sparse, making it challenging to find
significant patterns.
Interpretability: Ensuring that the discovered patterns are understandable and actionable.
By leveraging multidimensional pattern mining, organizations can gain deeper insights into their data,
leading to more informed decision-making and strategic planning.
Constraint–Based Frequent Pattern Mining
Constraint-based frequent pattern mining is an advanced approach in data mining that focuses on
discovering frequent patterns within datasets while adhering to specific user-defined constraints. By
incorporating constraints, this method enhances the efficiency of the mining process and ensures that
the extracted patterns are both relevant and actionable.
Key Concepts:
Definition: Conditions or rules specified by users to filter and guide the pattern discovery process.
Types of Constraints:
Anti-Monotonic Constraints: If a pattern violates the constraint, all its supersets will also violate it. For
example, a constraint specifying that the sum of items in a pattern should not exceed a certain value.
Monotonic Constraints: If a pattern satisfies the constraint, all its supersets will also satisfy it. For
instance, a constraint requiring a minimum number of items in a pattern.
Succinct Constraints: Constraints that can be directly applied during the pattern generation phase, such
as specifying that a particular item must be included in the pattern.
Efficiency: By applying constraints early in the mining process, the search space is significantly reduced,
leading to faster computations.
Relevance: Ensures that the discovered patterns meet specific criteria, making them more meaningful
and actionable for users.
Constraint Pushing: Integrating constraints directly into the mining algorithms allows for the pruning of
candidate patterns that do not meet the specified criteria, enhancing efficiency.
Applications:
Market Basket Analysis: Identifying product combinations that meet specific profitability or inventory
constraints.
Bioinformatics: Discovering gene sequences that satisfy certain biological constraints, aiding in
understanding genetic relationships.
Fraud Detection: Detecting transaction patterns that adhere to predefined suspicious activity rules,
helping in identifying fraudulent behavior.
By integrating user-defined constraints into the pattern mining process, constraint-based frequent
pattern mining offers a focused and efficient approach to uncovering valuable insights within large
datasets.
Mining High Dimensional
Mining high-dimensional data presents unique challenges due to the "curse of dimensionality," where
the data's dimensionality can hinder traditional analysis methods. To effectively extract meaningful
patterns from such data, specialized techniques have been developed:
1. Dimensionality Reduction:
Principal Component Analysis (PCA): Transforms the original features into a set of linearly
uncorrelated variables called principal components, ordered by the amount of variance they
capture from the data.
t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique that is particularly
well-suited for embedding high-dimensional data into a low-dimensional space for visualization
purposes.
2. Subspace Clustering:
Approach: Identifies clusters within different subspaces of the data, acknowledging that clusters may
exist only in specific combinations of dimensions.
Techniques: Algorithms like CLIQUE and SUBCLU search for dense regions in various subspaces to find
meaningful clusters.
Challenges: The high dimensionality can lead to a vast number of potential patterns, making the mining
process computationally intensive.
Solutions: Incorporating constraints can help focus the search on the most relevant patterns, improving
efficiency.
4. Manifold Learning:
Concept: Assumes that high-dimensional data lie on low-dimensional manifolds within the higher-
dimensional space.
Techniques: Methods like Isomap and Locally Linear Embedding (LLE) aim to uncover these manifolds,
facilitating the analysis of the data's intrinsic structure.
5. Visualization Techniques:
Methods: Tools like parallel coordinates and heatmaps can help identify patterns, clusters, and outliers
within the data.
By employing these specialized techniques, analysts can effectively mine high-dimensional data,
uncovering valuable insights that might be obscured in lower-dimensional analyses.
Data Classification
Data classification in data mining is the process of categorizing data into predefined groups or classes.
The goal of classification is to predict the category or class of an object based on its attributes or
features. It's a supervised learning technique, meaning the model is trained on labeled data, where each
data point already has a known class.
Data Preprocessing: This involves cleaning the data, handling missing values, and normalizing or
standardizing data.
Feature Selection/Extraction: Selecting the most relevant features or extracting new features from raw
data to improve model performance.
Training the Model: Using a set of labeled data (training set) to teach the classification algorithm how to
distinguish between classes.
Model Evaluation: Testing the trained model on unseen data (test set) to assess its accuracy, precision,
recall, and other evaluation metrics.
Prediction: After the model is trained and evaluated, it can be used to predict the class labels for new,
unseen data.
Decision Trees: These models split data into branches based on feature values, creating a tree-like
structure to classify data.
Naive Bayes: A probabilistic model that applies Bayes' Theorem to predict the class of data based on
prior probabilities.
Support Vector Machines (SVM): SVM finds the hyperplane that best separates different classes in the
feature space.
k-Nearest Neighbors (k-NN): This method classifies a data point based on the majority class of its
nearest neighbors.
Logistic Regression: A regression model used for binary classification, predicting probabilities of class
membership.
Medical Diagnosis: Classifying patients based on symptoms, medical history, or test results into
categories like disease/no disease or risk levels.
Credit Scoring: Classifying individuals into "good" or "bad" credit risk categories based on financial data.
Image Recognition: Identifying objects in images and classifying them (e.g., distinguishing between
different animals or vehicles).
Evaluation Metrics:
Precision: The ratio of true positives to the sum of true positives and false positives.
Recall (Sensitivity): The ratio of true positives to the sum of true positives and false negatives.
F1 Score: The harmonic mean of precision and recall, used to balance both metrics in cases of
imbalanced classes.
Challenges in Classification:
Imbalanced Data: When certain classes are underrepresented, leading to biased models.
Overfitting: When a model is too complex and fits the training data too closely, making it less
generalizable to new data.
High Dimensionality: When there are too many features, leading to the "curse of dimensionality."
Data classification plays a vital role in data mining, helping in decision-making processes across various
industries.
The decision tree induction algorithm works by recursively splitting the dataset into subsets based on
certain conditions. The process stops when:
A node reaches a predefined threshold (e.g., a certain depth or minimum number of samples).
All data points at the node belong to the same class.
Several criteria can be used to determine the best feature for splitting the data:
It measures how much "information" a feature gives us about the class. The feature that
provides the most reduction in entropy (uncertainty) is chosen.
Entropy: A measure of the uncertainty or impurity in a dataset.
Information Gain: The difference between the entropy of the original set and the weighted sum
of the entropy of each subset.
Formula:
where
It measures the "impurity" of a dataset, with a value between 0 (perfectly pure) and 1 (completely
impure). The feature that results in the lowest Gini index is chosen.
where
Chi-Square (CART):
It is a statistical test to measure the independence between the feature and the target variable. A higher
chi-square statistic indicates a better split.
Simple and Easy to Understand: The decision tree model is visual and easy to interpret.
Handles both numerical and categorical data: It can work with different types of data.
No Need for Data Normalization: Decision trees don’t require normalization of features.
Handles Missing Values: Decision trees can handle missing values through techniques like
surrogate splits.
Overfitting: Decision trees are prone to overfitting, especially with deep trees. This can result in
a model that works well on training data but performs poorly on unseen data.
Instability: Small changes in the data can lead to a completely different tree.
Bias toward Features with More Categories: Features with many distinct values might dominate
the splits, leading to biased results.
Poor Performance with Continuous Data: Trees tend to perform worse with continuous data
compared to other methods like regression.
Pruning is a technique used to reduce the size of the decision tree to avoid overfitting. It involves
removing nodes that provide little additional predictive power. There are two types of pruning:
1. Pre-pruning: Stopping the tree-building process early when the tree reaches a certain depth or
when further splits do not significantly improve the model.
2. Post-pruning: Building the tree fully and then removing branches that have little importance
(using techniques like cost-complexity pruning).
ID3 (Iterative Dichotomiser 3): Uses information gain to select the best feature to split at each
node.
C4.5: An extension of ID3, C4.5 handles continuous features and pruning, using information gain
ratio instead of simple information gain.
CART (Classification and Regression Trees): Can handle both classification and regression
problems and uses the Gini index to make splits.
Example:
Given a dataset with features such as Age, Income, and Education, and a target class like Purchase
Decision (yes/no), a decision tree might look like this:
This tree suggests that if a person is aged 30 or younger, the decision to purchase depends on their
income. If the income is less than 50K, they are likely to make a purchase, otherwise not.
Decision tree induction is a powerful and interpretable technique, and with proper handling of
overfitting and data quality, it can deliver great results in a variety of real-world applications.
Overview:
Assumes that the features are conditionally independent given the class label.
Despite the "naive" assumption of independence, it performs surprisingly well in many real-
world scenarios.
Formula:
Bayes' theorem:
Applications:
2. Bayesian Networks
Overview:
Key Features:
Applications:
Advantages:
Limitations:
Naive Bayes assumes independence among features, which is rarely true in practice.
Bayesian Networks require expert knowledge for structure design.
Sensitive to the quality of prior probabilities.
Versus Decision Trees: Bayes classifiers generally require less training data and are less prone to
overfitting.
Versus SVMs and Neural Networks: Naive Bayes is faster but usually less accurate on complex
tasks.
How It Works:
2. Rule Components:
Antecedent (Condition): Combination of attribute tests.
Consequent (Class Label): Target class assigned if the condition is true.
3. Classification Process:
An instance is classified by finding the first rule whose condition is satisfied.
If no rule matches, a default class is assigned.
1. Direct Methods:
Extract rules directly from the data.
Example Algorithms:
o RIPPER (Repeated Incremental Pruning to Produce Error Reduction): Efficient for
large datasets.
o CN2: Handles noisy data using statistical significance tests.
o OneR: Creates simple rules using a single attribute.
2. Indirect Methods:
Extract rules from other models (e.g., Decision Trees or Neural Networks).
Example:
C4.5 / J48 Decision Trees: Rules are derived from paths from the root to the leaf nodes.
Advantages:
Applications:
Customer segmentation
Medical diagnosis
Fraud detection
Intrusion detection systems
Scikit-learn (Python): Implements Decision Tree classifiers that can be converted to rules.
Weka (Java): Provides RIPPER and PART rule-based classifiers.
Orange (Python): Visual programming tool supporting rule-based classifiers.