Introduction to Data Mining1

Data mining is the process of extracting patterns and insights from large datasets using techniques like machine learning and statistics, aiding in decision-making across various domains. The data mining process involves steps such as data collection, preprocessing, exploration, pattern discovery, evaluation, and deployment, with applications in business, healthcare, and finance. Key techniques include classification, clustering, association rule mining, and regression analysis, each serving different analytical purposes.

Uploaded by

error2000deepak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Introduction to Data Mining1

Uploaded by

error2000deepak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT – IV

DATA MINING
INTRODUCTION TO DATA MINING: -
What is Data Mining?
Data Mining is the process of discovering patterns, relationships, and useful
insights from large datasets using various techniques such as machine learning,
statistics, and artificial intelligence. It helps in decision-making by extracting
meaningful information from raw data.
Key Features of Data Mining: -
 Extracts useful patterns and relationships from large datasets.
 Uses statistical, machine learning, and AI techniques.
 Helps in decision-making, trend prediction, and knowledge discovery.
 Applied in various domains like business, healthcare, finance, and e-
commerce.
Steps in Data Mining Process: -
Data mining follows a systematic approach to uncover patterns:
1. Data Collection – Gathering data from different sources.
2. Data Preprocessing – Cleaning, transforming, and handling missing values.
3. Data Exploration – Understanding data using visualization and statistical
techniques.
4. Pattern Discovery – Applying algorithms to identify trends and
relationships.
5. Evaluation & Interpretation – Validating and analyzing results.
6. Deployment – Using the findings in real-world applications.
Techniques of Data Mining: -
Data mining uses various methods to extract patterns:
1. Classification – Categorizing data into predefined classes (e.g., spam vs.
non-spam emails).
2. Clustering – Grouping similar data points (e.g., customer segmentation in
marketing).
3. Association Rule Mining – Finding relationships between variables (e.g.,
"If a customer buys bread, they are likely to buy butter").
4. Regression Analysis – Predicting numerical values (e.g., stock price
prediction).
5. Anomaly Detection – Identifying unusual data points (e.g., fraud detection
in banking).
Applications of Data Mining: -
 Business & Marketing – Customer segmentation, recommendation systems
(Amazon, Netflix).
 Finance & Banking – Fraud detection, credit risk analysis.
 Healthcare – Disease prediction, medical diagnosis.
 E-commerce – Personalized recommendations, sentiment analysis.
 Education – Student performance prediction, adaptive learning.

KNOWLEDGE DISCOVERY PROCESS IN DATA MINING: -

The Knowledge Discovery in Databases (KDD) process is a systematic approach
used to extract meaningful insights from large datasets. It involves multiple steps,
from data collection to pattern evaluation, ensuring that valuable knowledge is
derived from raw data.
Steps in the Knowledge Discovery Process: -
1. Data Selection
 Identifying and gathering relevant data from multiple sources.
Example: Collecting customer transaction data from a retail database.
2. Data Preprocessing (Cleaning & Transformation)
 Handling missing values, duplicate records, and noisy data.
 Normalizing and transforming data into a suitable format.
Example: Removing duplicate customer records from a database.
3. Data Integration
 Combining data from multiple sources into a unified format.
Example: Merging customer purchase history with demographic details.
4. Data Reduction
 Reducing data size while maintaining essential patterns.
 Techniques: Feature selection, dimensionality reduction Principal
Component Analysis (PCA).
 Example: Removing irrelevant attributes like timestamps in sales data.
5. Data Mining (Pattern Discovery)
 Applying machine learning and statistical algorithms to uncover patterns.
 Techniques: Classification, clustering, association rule mining.
Example: Identifying frequent item sets in a supermarket (e.g., "Customers
who buy bread often buy butter").
6. Pattern Evaluation & Interpretation
 Assessing the validity and usefulness of discovered patterns.
Example: Checking if an identified customer trend aligns with business
goals.
7. Knowledge Representation & Deployment
 Visualizing findings using graphs, charts, or reports.
 Implementing insights in real-world applications (e.g., recommender
systems).
Example: Using predictive models for targeted marketing campaigns.

COUNTING CO-OCCURRENCE IN DATA MINING: -

What is Co-Occurrence?
Co-occurrence refers to the frequency with which two or more items, events, or
entities appear together in a dataset. It is widely used in market basket analysis,
text mining, and recommendation systems.
Examples:
 In a supermarket, if bread and butter are frequently bought together, they
have a high co-occurrence.
 In text mining, the words "machine" and "learning" often appear together in
documents.
 In social networks, if two users frequently like the same posts, they have a
strong co-occurrence.
Methods for Counting Co-Occurrence
1. Co-Occurrence Matrix
A co-occurrence matrix is a table showing the frequency of two items appearing
together.
Example:
Market Basket Analysis
Item Bread Butter Milk Eggs
Bread - 50 30 20
Butter 50 - 40 15
Milk 30 40 - 25
Eggs 20 15 25 -
Here, Butter & Bread = 50, meaning they were bought together 50 times.
2. Association Rule Mining
 Used in market basket analysis to find frequent patterns.
 Example rule: "If a customer buys bread, they are 70% likely to buy butter".
 Algorithm: Apriori, FP-Growth.
3. Pointwise Mutual Information (PMI) (for Text Analysis)
 Measures how often two words appear together compared to their individual
frequencies.
 Used in natural language processing (NLP) to find word associations.
4. Graph-Based Approaches
 Represents items as nodes and co-occurrences as edges.
 Used in social network analysis and recommendation systems.
Applications of Co-Occurrence Counting: -
 Market Basket Analysis – Helps retailers suggest complementary products.
 Text Mining – Finds relationships between words in documents.
 Recommendation Systems – Suggests movies, books, or songs based on
user behavior.
 Social Network Analysis – Detects strong connections between users.
ICEBERG QUERIES IN DATA MINING: -
What are Iceberg Queries?
Iceberg queries are a special type of SQL query that retrieve only frequent (large)
aggregate values from a dataset, filtering out the less significant ones. They are
useful in data mining, market basket analysis, and business intelligence where we
are only interested in frequently occurring patterns.
 Why the name "Iceberg"?
Like an iceberg, which has most of its mass hidden underwater, iceberg
queries return only the "top" (frequent) results while ignoring the rest.
Key Features of Iceberg Queries: -
 Focuses on highly frequent occurrences in large datasets.
 Uses HAVING clauses in SQL to filter results based on aggregate values.
 Efficient for association rule mining, frequent itemset mining, and pattern
discovery.
Use Cases of Iceberg Queries: -
 Market Basket Analysis – Finding frequently purchased product
combinations.
 Text Mining – Identifying frequently used words in large documents.
 Recommendation Systems – Suggesting popular movies/books based on
user preferences.
 Fraud Detection – Detecting unusual but frequent transaction patterns.
Optimizing Iceberg Queries: -
 To improve efficiency in large datasets, the following techniques are used:
Indexing – Speeds up data retrieval.
 Partitioning – Divides data into smaller, manageable parts.
 Bitmap Indexing – Efficiently stores categorical data for faster processing.

MINING FOR RULES IN DATA MINING: -

Rule mining is a key technique in data mining used to extract useful patterns,
relationships, and associations from large datasets. It helps businesses and
researchers uncover hidden patterns and make informed decisions.
1. What is Rule Mining?
Rule mining identifies relationships between variables in a dataset. The most
common type is Association Rule Mining, which finds patterns like:
Note: - "If a customer buys bread, they are likely to buy butter."
This is useful in market basket analysis, recommendation systems, and fraud
detection.
2. Types of Rule Mining
(a) Association Rule Mining
 Finds patterns in transaction data.
 Example: {Milk} → {Cookies} (Customers who buy milk often buy
cookies).
 Common algorithms: Apriori, FP-Growth.
(b) Classification Rule Mining
 Extracts rules to classify data into different categories.
 Example: If (income > 50K) AND (age < 30), then High Credit Risk.
 Algorithm: Decision Trees (C4.5, ID3), Rule-Based Classifiers.
(c) Sequential Rule Mining
 Finds patterns in time-ordered data.
 Example: "If a customer buys a phone today, they will buy a phone case
within a week."
 Algorithm: GSP (Generalized Sequential Pattern), SPADE, PrefixSpan.
(d) Correlation Rule Mining
 Finds rules where items have a strong correlation.
 Example: Strong correlation between "high-income" customers and "luxury
car purchases".
3. Metrics for Rule Evaluation
 Support – How often a rule appears in the dataset.
 Confidence – Probability of the consequence occurring given the
antecedent.
 Lift – Measures the strength of a rule compared to random occurrence.
Example:
 Support: 30% (30% of transactions contain {Milk, Cookies}).
 Confidence: 80% (80% of people who buy milk also buy cookies).
 Lift: 2.5 (Buying milk increases the likelihood of buying cookies by 2.5
times).
4. Applications of Rule Mining
 Retail & Market Basket Analysis – Finding frequently bought items
together.
 Healthcare – Identifying disease patterns and risk factors.
 Finance & Banking – Fraud detection, credit risk assessment.
 Web & E-commerce – Personalized recommendations (Amazon, Netflix).

ASSOCIATION RULE: -
Association Rule is a fundamental concept in data mining and machine learning,
primarily used for discovering relationships between variables in large datasets. It
is widely used in market basket analysis, recommendation systems, and various
other domains.
Definition: -
Association rules identify patterns and relationships between items in a dataset.
Algorithms for Association Rule Mining
1. Apriori Algorithm:
o Generates frequent itemsets using a level-wise search.
o Uses a candidate generation-and-pruning approach.
2. FP-Growth (Frequent Pattern Growth):
o Uses a tree-based structure to generate frequent itemsets efficiently.
o Faster than Apriori in many cases.
3. Eclat (Equivalence Class Transformation):
o Uses a depth-first search approach.
o More efficient when working with dense datasets.
Applications of Association Rule Mining
 Market Basket Analysis (Retail): Identifying products frequently bought
together.
 Recommendation Systems (E-commerce, Streaming Platforms):
Suggesting items based on user behavior.
 Medical Diagnosis: Identifying co-occurring diseases or symptoms.
 Web Usage Mining: Understanding user behavior on websites.
 Fraud Detection: Detecting unusual transaction patterns in financial data.

TYPES OF CLUSTERING ALGORITHMS: -

There are several clustering algorithms, each with different approaches:
1. Partition-Based Clustering
 Divides the data into K clusters.
 Example: K-Means Clustering
o Each cluster is represented by a centroid.
o The algorithm iterates to minimize the distance between data points
and centroids.
2. Hierarchical Clustering
 Creates a tree-like structure of clusters (dendrogram).
 Two approaches:
o Agglomerative Clustering (Bottom-Up) → Merges small clusters
into larger ones.
o Divisive Clustering (Top-Down) → Splits large clusters into smaller
ones.
3. Density-Based Clustering
Groups points that are densely packed together.
Example: DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)
o Works well for arbitrary-shaped clusters.
o Identifies noise/outliers separately.
4. Model-Based Clustering
 Assumes that data is generated by a mixture of probability distributions.
Example: Gaussian Mixture Model (GMM)
o Uses multiple Gaussian distributions to model clusters.
5. Grid-Based Clustering
 Divides the data space into grid cells and clusters them.
Example: STING (Statistical Information Grid-based Clustering)
Key Metrics for Clustering Evaluation: -
Since clustering is unsupervised, evaluating the results can be tricky. Common
metrics include:
1. Within-Cluster Sum of Squares (WCSS) – Measures compactness of
clusters (used in K-Means).
2. Silhouette Score – Measures how similar a point is to its cluster vs. other
clusters.
3. Davies-Bouldin Index – Evaluates cluster separation and compactness.
Applications of Clustering: -
 Customer Segmentation (E-commerce, Marketing)
 Anomaly Detection (Fraud Detection, Cybersecurity)
 Image Segmentation (Medical Imaging, Object Detection)
 Document Clustering (News Categorization, Text Mining)
 Social Network Analysis (Community Detection)

CLASSIFICATION RULES: -
Definition:
Classification rules are used when the target variable is categorical. The goal is to
classify data into predefined categories based on input features.
Algorithms for Classification Rule Mining:
1. Decision Trees (C4.5, CART, ID3) – Generates rules from tree structures.
2. Rule-Based Classification (RIPPER, CN2) – Creates rules directly from
data.
3. Association Rule-Based Classification (Apriori, FP-Growth) – Uses
frequent patterns for classification.
4. Naïve Bayes – A probabilistic approach to rule-based classification.
Evaluation Metrics for Classification:
 Accuracy – Percentage of correctly classified instances.
 Precision, Recall, F1-score – Measures model performance in imbalanced
datasets.
 ROC-AUC – Evaluates model discrimination capability.
REGRESSION RULES: -
Definition:
Regression rules are used when the target variable is continuous. The goal is to
predict numerical values based on input features.
Algorithms for Regression Rule Mining:
1. Decision Trees for Regression ((Classification and Regression Tree
CART), M5P Tree) – Splits data into rules for numeric prediction.
2. Rule-Based Regression (M5 Rules) – Extracts rules from tree-based
regression.
3. Linear Regression Models – Generates equations instead of explicit rules.
4. Random Forest Regression – Uses an ensemble of decision trees for better
accuracy.
Evaluation Metrics for Regression:
 Mean Squared Error (MSE) – Measures average squared error.
 Root Mean Squared Error (RMSE) – Square root of MSE, interpretable in
the same unit as the target variable.
 R² Score (Coefficient of Determination) – Indicates how well the model
explains variance in data.
Applications of Classification and Regression Rules: -
Classification Applications: -
 Spam Detection – Classify emails as spam or not.
 Medical Diagnosis – Predict disease presence.
 Fraud Detection – Identify fraudulent transactions.
Regression Applications:
 Stock Market Prediction – Forecast stock prices.
 Weather Prediction – Estimate temperature, rainfall, etc.
 House Price Estimation – Predict real estate values.

Best Buy Order Details
100% (2)
Best Buy Order Details
2 pages
23 70 08 - CVFDR
No ratings yet
23 70 08 - CVFDR
223 pages
A Brief Overview On Data Mining Survey PDF
No ratings yet
A Brief Overview On Data Mining Survey PDF
8 pages
Smo Project by Odelabi Taiwo
No ratings yet
Smo Project by Odelabi Taiwo
75 pages
Rheem 310 Series Heat Pump Hot Water
No ratings yet
Rheem 310 Series Heat Pump Hot Water
68 pages
DBMS UNIT-IV
No ratings yet
DBMS UNIT-IV
20 pages
Fundamentals of Data Mining
No ratings yet
Fundamentals of Data Mining
36 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
data mining unit I notes
No ratings yet
data mining unit I notes
24 pages
1. Introduction
No ratings yet
1. Introduction
26 pages
Datawarehouse&Data mining_ALL
No ratings yet
Datawarehouse&Data mining_ALL
46 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Unit - I
No ratings yet
Unit - I
22 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
DM UNIT -3
No ratings yet
DM UNIT -3
10 pages
2 Data Mining
No ratings yet
2 Data Mining
20 pages
Module1 DataMining Ktustudents - in
No ratings yet
Module1 DataMining Ktustudents - in
24 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Data Mining Unit 1-1
No ratings yet
Data Mining Unit 1-1
11 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Data Mining - I
No ratings yet
Data Mining - I
126 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
ISS-DSS - Module 3
No ratings yet
ISS-DSS - Module 3
23 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
Data Science
No ratings yet
Data Science
11 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
30 pages
DM NOTES
No ratings yet
DM NOTES
91 pages
Unit 3
No ratings yet
Unit 3
23 pages
Data Mining Implementation
No ratings yet
Data Mining Implementation
9 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
DATA MINING ASSIGN 1
No ratings yet
DATA MINING ASSIGN 1
7 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
01 Intro
No ratings yet
01 Intro
23 pages
⇶Data Mining--2
No ratings yet
⇶Data Mining--2
16 pages
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
DWDM MOD-1
No ratings yet
DWDM MOD-1
13 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
01 Intro
No ratings yet
01 Intro
26 pages
data mining 1
No ratings yet
data mining 1
39 pages
Chapter 1
No ratings yet
Chapter 1
38 pages
DWDM 1
No ratings yet
DWDM 1
17 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Data Mining
No ratings yet
Data Mining
8 pages
Digital Data Mining Nostos - FP
No ratings yet
Digital Data Mining Nostos - FP
37 pages
UNIT I DBMI
No ratings yet
UNIT I DBMI
35 pages
Unit-4 Introduction To Data Mining
No ratings yet
Unit-4 Introduction To Data Mining
26 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
Seminar on Data Mining Concepts and Its
No ratings yet
Seminar on Data Mining Concepts and Its
8 pages
1_Lect 1 & 2 Data Mining
No ratings yet
1_Lect 1 & 2 Data Mining
20 pages
Data Mining Nostos - Resp
No ratings yet
Data Mining Nostos - Resp
39 pages
DWM
No ratings yet
DWM
66 pages
Data Mining
No ratings yet
Data Mining
13 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
Data-Mining-OVERVIEW (1)
No ratings yet
Data-Mining-OVERVIEW (1)
8 pages
Unit-1
No ratings yet
Unit-1
7 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
2024-2025 Fall HRM Assessment Brief
No ratings yet
2024-2025 Fall HRM Assessment Brief
7 pages
Bee16 Class 41
No ratings yet
Bee16 Class 41
17 pages
Nikon D7100 Specs
No ratings yet
Nikon D7100 Specs
11 pages
Business Administration Css Notes
100% (1)
Business Administration Css Notes
14 pages
C-P 62-82 Clause 6
No ratings yet
C-P 62-82 Clause 6
21 pages
Concrete Formwork Using Acrow
No ratings yet
Concrete Formwork Using Acrow
13 pages
Introduction To Social Finance and Impact Investment
No ratings yet
Introduction To Social Finance and Impact Investment
13 pages
Boiler Operation and Maintenance Essentials - Failure Modes and Mechanisms COMPLETED (Unlocked by WWW - Freemypdf.com)
No ratings yet
Boiler Operation and Maintenance Essentials - Failure Modes and Mechanisms COMPLETED (Unlocked by WWW - Freemypdf.com)
81 pages
ET300 Manual
100% (1)
ET300 Manual
8 pages
Murphy - Controlador EMS 447 - Sales Bulletin
No ratings yet
Murphy - Controlador EMS 447 - Sales Bulletin
2 pages
Cr02 Manual
No ratings yet
Cr02 Manual
4 pages
Amsoil H-D Oil
No ratings yet
Amsoil H-D Oil
2 pages
Rcaf Map
No ratings yet
Rcaf Map
1 page
Project Profile On Fly Ash Bricks: Dharani Modular Constructions
No ratings yet
Project Profile On Fly Ash Bricks: Dharani Modular Constructions
2 pages
PCSO Board of Directors v. Lapid. G.R. No. 191940, 12 April 2011
No ratings yet
PCSO Board of Directors v. Lapid. G.R. No. 191940, 12 April 2011
67 pages
GGU1017074
No ratings yet
GGU1017074
4 pages
Part 2:: Describe A River, Lake or Sea Which You Like
No ratings yet
Part 2:: Describe A River, Lake or Sea Which You Like
4 pages
Method 6c-Determination of Sulfur Dioxide Emissions From Stationary Sources (Instrumental Analyzer Procedure) PDF
No ratings yet
Method 6c-Determination of Sulfur Dioxide Emissions From Stationary Sources (Instrumental Analyzer Procedure) PDF
5 pages
Intro To Safety
No ratings yet
Intro To Safety
12 pages
Construction and Building Materials: Ningyi Su, Feipeng Xiao, Jingang Wang, Serji Amirkhanian
No ratings yet
Construction and Building Materials: Ningyi Su, Feipeng Xiao, Jingang Wang, Serji Amirkhanian
15 pages
DAFTAR PUSTAKA Prarancangan Pabrik Anilin Dari Nitrobenzene
No ratings yet
DAFTAR PUSTAKA Prarancangan Pabrik Anilin Dari Nitrobenzene
3 pages
Gig Economy and The Future of Work A Fiverrcom Cas
100% (1)
Gig Economy and The Future of Work A Fiverrcom Cas
10 pages
What Is A Corporation
No ratings yet
What Is A Corporation
2 pages
Pinakin Rao Resume
No ratings yet
Pinakin Rao Resume
6 pages
Fragmentation and Its Impact On Downstream Processing
No ratings yet
Fragmentation and Its Impact On Downstream Processing
4 pages
Fashion Centre at Pentagon City
No ratings yet
Fashion Centre at Pentagon City
10 pages

Introduction to Data Mining1

Uploaded by

Introduction to Data Mining1

Uploaded by

UNIT – IV

KNOWLEDGE DISCOVERY PROCESS IN DATA MINING: -

COUNTING CO-OCCURRENCE IN DATA MINING: -

MINING FOR RULES IN DATA MINING: -

TYPES OF CLUSTERING ALGORITHMS: -

You might also like