0% found this document useful (0 votes)
6 views

Introduction to Data Mining1

Data mining is the process of extracting patterns and insights from large datasets using techniques like machine learning and statistics, aiding in decision-making across various domains. The data mining process involves steps such as data collection, preprocessing, exploration, pattern discovery, evaluation, and deployment, with applications in business, healthcare, and finance. Key techniques include classification, clustering, association rule mining, and regression analysis, each serving different analytical purposes.

Uploaded by

error2000deepak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Introduction to Data Mining1

Data mining is the process of extracting patterns and insights from large datasets using techniques like machine learning and statistics, aiding in decision-making across various domains. The data mining process involves steps such as data collection, preprocessing, exploration, pattern discovery, evaluation, and deployment, with applications in business, healthcare, and finance. Key techniques include classification, clustering, association rule mining, and regression analysis, each serving different analytical purposes.

Uploaded by

error2000deepak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT – IV

DATA MINING
INTRODUCTION TO DATA MINING: -
What is Data Mining?
Data Mining is the process of discovering patterns, relationships, and useful
insights from large datasets using various techniques such as machine learning,
statistics, and artificial intelligence. It helps in decision-making by extracting
meaningful information from raw data.
Key Features of Data Mining: -
 Extracts useful patterns and relationships from large datasets.
 Uses statistical, machine learning, and AI techniques.
 Helps in decision-making, trend prediction, and knowledge discovery.
 Applied in various domains like business, healthcare, finance, and e-
commerce.
Steps in Data Mining Process: -
Data mining follows a systematic approach to uncover patterns:
1. Data Collection – Gathering data from different sources.
2. Data Preprocessing – Cleaning, transforming, and handling missing values.
3. Data Exploration – Understanding data using visualization and statistical
techniques.
4. Pattern Discovery – Applying algorithms to identify trends and
relationships.
5. Evaluation & Interpretation – Validating and analyzing results.
6. Deployment – Using the findings in real-world applications.
Techniques of Data Mining: -
Data mining uses various methods to extract patterns:
1. Classification – Categorizing data into predefined classes (e.g., spam vs.
non-spam emails).
2. Clustering – Grouping similar data points (e.g., customer segmentation in
marketing).
3. Association Rule Mining – Finding relationships between variables (e.g.,
"If a customer buys bread, they are likely to buy butter").
4. Regression Analysis – Predicting numerical values (e.g., stock price
prediction).
5. Anomaly Detection – Identifying unusual data points (e.g., fraud detection
in banking).
Applications of Data Mining: -
 Business & Marketing – Customer segmentation, recommendation systems
(Amazon, Netflix).
 Finance & Banking – Fraud detection, credit risk analysis.
 Healthcare – Disease prediction, medical diagnosis.
 E-commerce – Personalized recommendations, sentiment analysis.
 Education – Student performance prediction, adaptive learning.

KNOWLEDGE DISCOVERY PROCESS IN DATA MINING: -


The Knowledge Discovery in Databases (KDD) process is a systematic approach
used to extract meaningful insights from large datasets. It involves multiple steps,
from data collection to pattern evaluation, ensuring that valuable knowledge is
derived from raw data.
Steps in the Knowledge Discovery Process: -
1. Data Selection
 Identifying and gathering relevant data from multiple sources.
Example: Collecting customer transaction data from a retail database.
2. Data Preprocessing (Cleaning & Transformation)
 Handling missing values, duplicate records, and noisy data.
 Normalizing and transforming data into a suitable format.
Example: Removing duplicate customer records from a database.
3. Data Integration
 Combining data from multiple sources into a unified format.
Example: Merging customer purchase history with demographic details.
4. Data Reduction
 Reducing data size while maintaining essential patterns.
 Techniques: Feature selection, dimensionality reduction Principal
Component Analysis (PCA).
 Example: Removing irrelevant attributes like timestamps in sales data.
5. Data Mining (Pattern Discovery)
 Applying machine learning and statistical algorithms to uncover patterns.
 Techniques: Classification, clustering, association rule mining.
Example: Identifying frequent item sets in a supermarket (e.g., "Customers
who buy bread often buy butter").
6. Pattern Evaluation & Interpretation
 Assessing the validity and usefulness of discovered patterns.
Example: Checking if an identified customer trend aligns with business
goals.
7. Knowledge Representation & Deployment
 Visualizing findings using graphs, charts, or reports.
 Implementing insights in real-world applications (e.g., recommender
systems).
Example: Using predictive models for targeted marketing campaigns.

COUNTING CO-OCCURRENCE IN DATA MINING: -


What is Co-Occurrence?
Co-occurrence refers to the frequency with which two or more items, events, or
entities appear together in a dataset. It is widely used in market basket analysis,
text mining, and recommendation systems.
Examples:
 In a supermarket, if bread and butter are frequently bought together, they
have a high co-occurrence.
 In text mining, the words "machine" and "learning" often appear together in
documents.
 In social networks, if two users frequently like the same posts, they have a
strong co-occurrence.
Methods for Counting Co-Occurrence
1. Co-Occurrence Matrix
A co-occurrence matrix is a table showing the frequency of two items appearing
together.
Example:
Market Basket Analysis
Item Bread Butter Milk Eggs
Bread - 50 30 20
Butter 50 - 40 15
Milk 30 40 - 25
Eggs 20 15 25 -
Here, Butter & Bread = 50, meaning they were bought together 50 times.
2. Association Rule Mining
 Used in market basket analysis to find frequent patterns.
 Example rule: "If a customer buys bread, they are 70% likely to buy butter".
 Algorithm: Apriori, FP-Growth.
3. Pointwise Mutual Information (PMI) (for Text Analysis)
 Measures how often two words appear together compared to their individual
frequencies.
 Used in natural language processing (NLP) to find word associations.
4. Graph-Based Approaches
 Represents items as nodes and co-occurrences as edges.
 Used in social network analysis and recommendation systems.
Applications of Co-Occurrence Counting: -
 Market Basket Analysis – Helps retailers suggest complementary products.
 Text Mining – Finds relationships between words in documents.
 Recommendation Systems – Suggests movies, books, or songs based on
user behavior.
 Social Network Analysis – Detects strong connections between users.
ICEBERG QUERIES IN DATA MINING: -
What are Iceberg Queries?
Iceberg queries are a special type of SQL query that retrieve only frequent (large)
aggregate values from a dataset, filtering out the less significant ones. They are
useful in data mining, market basket analysis, and business intelligence where we
are only interested in frequently occurring patterns.
 Why the name "Iceberg"?
Like an iceberg, which has most of its mass hidden underwater, iceberg
queries return only the "top" (frequent) results while ignoring the rest.
Key Features of Iceberg Queries: -
 Focuses on highly frequent occurrences in large datasets.
 Uses HAVING clauses in SQL to filter results based on aggregate values.
 Efficient for association rule mining, frequent itemset mining, and pattern
discovery.
Use Cases of Iceberg Queries: -
 Market Basket Analysis – Finding frequently purchased product
combinations.
 Text Mining – Identifying frequently used words in large documents.
 Recommendation Systems – Suggesting popular movies/books based on
user preferences.
 Fraud Detection – Detecting unusual but frequent transaction patterns.
Optimizing Iceberg Queries: -
 To improve efficiency in large datasets, the following techniques are used:
Indexing – Speeds up data retrieval.
 Partitioning – Divides data into smaller, manageable parts.
 Bitmap Indexing – Efficiently stores categorical data for faster processing.

MINING FOR RULES IN DATA MINING: -


Rule mining is a key technique in data mining used to extract useful patterns,
relationships, and associations from large datasets. It helps businesses and
researchers uncover hidden patterns and make informed decisions.
1. What is Rule Mining?
Rule mining identifies relationships between variables in a dataset. The most
common type is Association Rule Mining, which finds patterns like:
Note: - "If a customer buys bread, they are likely to buy butter."
This is useful in market basket analysis, recommendation systems, and fraud
detection.
2. Types of Rule Mining
(a) Association Rule Mining
 Finds patterns in transaction data.
 Example: {Milk} → {Cookies} (Customers who buy milk often buy
cookies).
 Common algorithms: Apriori, FP-Growth.
(b) Classification Rule Mining
 Extracts rules to classify data into different categories.
 Example: If (income > 50K) AND (age < 30), then High Credit Risk.
 Algorithm: Decision Trees (C4.5, ID3), Rule-Based Classifiers.
(c) Sequential Rule Mining
 Finds patterns in time-ordered data.
 Example: "If a customer buys a phone today, they will buy a phone case
within a week."
 Algorithm: GSP (Generalized Sequential Pattern), SPADE, PrefixSpan.
(d) Correlation Rule Mining
 Finds rules where items have a strong correlation.
 Example: Strong correlation between "high-income" customers and "luxury
car purchases".
3. Metrics for Rule Evaluation
 Support – How often a rule appears in the dataset.
 Confidence – Probability of the consequence occurring given the
antecedent.
 Lift – Measures the strength of a rule compared to random occurrence.
Example:
 Support: 30% (30% of transactions contain {Milk, Cookies}).
 Confidence: 80% (80% of people who buy milk also buy cookies).
 Lift: 2.5 (Buying milk increases the likelihood of buying cookies by 2.5
times).
4. Applications of Rule Mining
 Retail & Market Basket Analysis – Finding frequently bought items
together.
 Healthcare – Identifying disease patterns and risk factors.
 Finance & Banking – Fraud detection, credit risk assessment.
 Web & E-commerce – Personalized recommendations (Amazon, Netflix).

ASSOCIATION RULE: -
Association Rule is a fundamental concept in data mining and machine learning,
primarily used for discovering relationships between variables in large datasets. It
is widely used in market basket analysis, recommendation systems, and various
other domains.
Definition: -
Association rules identify patterns and relationships between items in a dataset.
Algorithms for Association Rule Mining
1. Apriori Algorithm:
o Generates frequent itemsets using a level-wise search.
o Uses a candidate generation-and-pruning approach.
2. FP-Growth (Frequent Pattern Growth):
o Uses a tree-based structure to generate frequent itemsets efficiently.
o Faster than Apriori in many cases.
3. Eclat (Equivalence Class Transformation):
o Uses a depth-first search approach.
o More efficient when working with dense datasets.
Applications of Association Rule Mining
 Market Basket Analysis (Retail): Identifying products frequently bought
together.
 Recommendation Systems (E-commerce, Streaming Platforms):
Suggesting items based on user behavior.
 Medical Diagnosis: Identifying co-occurring diseases or symptoms.
 Web Usage Mining: Understanding user behavior on websites.
 Fraud Detection: Detecting unusual transaction patterns in financial data.

TYPES OF CLUSTERING ALGORITHMS: -


There are several clustering algorithms, each with different approaches:
1. Partition-Based Clustering
 Divides the data into K clusters.
 Example: K-Means Clustering
o Each cluster is represented by a centroid.
o The algorithm iterates to minimize the distance between data points
and centroids.
2. Hierarchical Clustering
 Creates a tree-like structure of clusters (dendrogram).
 Two approaches:
o Agglomerative Clustering (Bottom-Up) → Merges small clusters
into larger ones.
o Divisive Clustering (Top-Down) → Splits large clusters into smaller
ones.
3. Density-Based Clustering
Groups points that are densely packed together.
Example: DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)
o Works well for arbitrary-shaped clusters.
o Identifies noise/outliers separately.
4. Model-Based Clustering
 Assumes that data is generated by a mixture of probability distributions.
Example: Gaussian Mixture Model (GMM)
o Uses multiple Gaussian distributions to model clusters.
5. Grid-Based Clustering
 Divides the data space into grid cells and clusters them.
Example: STING (Statistical Information Grid-based Clustering)
Key Metrics for Clustering Evaluation: -
Since clustering is unsupervised, evaluating the results can be tricky. Common
metrics include:
1. Within-Cluster Sum of Squares (WCSS) – Measures compactness of
clusters (used in K-Means).
2. Silhouette Score – Measures how similar a point is to its cluster vs. other
clusters.
3. Davies-Bouldin Index – Evaluates cluster separation and compactness.
Applications of Clustering: -
 Customer Segmentation (E-commerce, Marketing)
 Anomaly Detection (Fraud Detection, Cybersecurity)
 Image Segmentation (Medical Imaging, Object Detection)
 Document Clustering (News Categorization, Text Mining)
 Social Network Analysis (Community Detection)

CLASSIFICATION RULES: -
Definition:
Classification rules are used when the target variable is categorical. The goal is to
classify data into predefined categories based on input features.
Algorithms for Classification Rule Mining:
1. Decision Trees (C4.5, CART, ID3) – Generates rules from tree structures.
2. Rule-Based Classification (RIPPER, CN2) – Creates rules directly from
data.
3. Association Rule-Based Classification (Apriori, FP-Growth) – Uses
frequent patterns for classification.
4. Naïve Bayes – A probabilistic approach to rule-based classification.
Evaluation Metrics for Classification:
 Accuracy – Percentage of correctly classified instances.
 Precision, Recall, F1-score – Measures model performance in imbalanced
datasets.
 ROC-AUC – Evaluates model discrimination capability.
REGRESSION RULES: -
Definition:
Regression rules are used when the target variable is continuous. The goal is to
predict numerical values based on input features.
Algorithms for Regression Rule Mining:
1. Decision Trees for Regression ((Classification and Regression Tree
CART), M5P Tree) – Splits data into rules for numeric prediction.
2. Rule-Based Regression (M5 Rules) – Extracts rules from tree-based
regression.
3. Linear Regression Models – Generates equations instead of explicit rules.
4. Random Forest Regression – Uses an ensemble of decision trees for better
accuracy.
Evaluation Metrics for Regression:
 Mean Squared Error (MSE) – Measures average squared error.
 Root Mean Squared Error (RMSE) – Square root of MSE, interpretable in
the same unit as the target variable.
 R² Score (Coefficient of Determination) – Indicates how well the model
explains variance in data.
Applications of Classification and Regression Rules: -
Classification Applications: -
 Spam Detection – Classify emails as spam or not.
 Medical Diagnosis – Predict disease presence.
 Fraud Detection – Identify fraudulent transactions.
Regression Applications:
 Stock Market Prediction – Forecast stock prices.
 Weather Prediction – Estimate temperature, rainfall, etc.
 House Price Estimation – Predict real estate values.

You might also like