0% found this document useful (0 votes)

3 views

DataMining_Unit-3

The document discusses key techniques in data mining, focusing on classification and prediction, which are used to derive insights from datasets. It explains classification as the process of assigning categorical labels to data based on features, while prediction estimates continuous values, often using regression analysis. Additionally, it covers decision trees, Bayesian classification, K-Nearest Neighbors, and rule-based classification, highlighting their structures, algorithms, and applications.

Uploaded by

Aparna kallepu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

DataMining_Unit-3

Uploaded by

Aparna kallepu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

UNIT – 3

1) Classification and prediction are two fundamental techniques in data mining used to extract
meaningful insights and models from datasets.

1. Classification in Data Mining:

Classification is the process of identifying the category or class of an object based on a given set of features or
attributes. The goal of classification is to predict the discrete class label (categorical variable) of a data instance.
In classification, you have a dataset with a target variable (also called the class label), which is categorical. You
use this dataset to build a model that can predict the class label for new, unseen data.
Example of Classification:
Suppose you are working with a dataset of customer information, and the goal is to classify whether each
customer will buy a product or not. The dataset might have features like age, income, and browsing history, and
the class label could be "Buy" or "Don't Buy."
Age Income Browsing Time (minutes) Class
25 40,000 20 Don't Buy
34 60,000 40 Buy
29 50,000 15 Don't Buy
42 80,000 50 Buy
Using this dataset, a classification algorithm (such as Decision Trees, k-NN, or SVM) can learn patterns from
the data, such as:
• Customers who are older and have a higher income are more likely to buy.
• Customers who spend more time browsing the product are more likely to buy.
Once the model is trained, it can predict the class label (e.g., "Buy" or "Don't Buy") for new customers.

2. Prediction in Data Mining:

Prediction is the task of estimating or forecasting a continuous value based on input data. The goal of prediction
is to predict numeric values or quantities that are continuous, rather than class labels. In other words, the output
is a continuous variable (e.g., a price, temperature, or stock value).
Prediction can be seen as a type of regression analysis, where you model the relationship between input features
(independent variables) and the continuous target variable (dependent variable).
Example of Prediction:
Imagine you're trying to predict the price of a house based on its features (e.g., size, number of bedrooms,
location). You have a dataset like this:
Size (sq ft) Bedrooms Location Price ($)
1,200 3 Suburb 300,000
1,800 4 City 500,000
1,500 3 Suburb 350,000
2,000 4 City 600,000
Here, the goal is to predict the price of a house (a continuous value) given its size, number of bedrooms, and
location. A regression algorithm (like linear regression, decision trees for regression, or neural networks) would
learn the relationship between the input features and the price, and could predict the price for new houses.

Comparison of Classification and Prediction:

Aspect Classification Prediction (Regression)
Goal Assign an object to a class/category Predict a continuous numeric value
Output
Categorical (discrete classes) Continuous (real values)
Variable
Aspect Classification Prediction (Regression)
Algorithms Decision Trees, k-NN, Naive Bayes, SVM, Linear Regression, Decision Trees (for
Used Random Forest regression), Neural Networks
Spam detection, Medical diagnosis, House price prediction, Stock price forecasting,
Examples
Sentiment analysis Weather prediction

2) What is a Decision Tree?

A Decision Tree is a popular and powerful model used for both classification and regression tasks in data
mining. It is a supervised learning algorithm that splits data into subsets based on feature values, creating a
tree-like structure to make decisions.

Structure of a Decision Tree:

A decision tree consists of the following components:

1. Root Node: The topmost node, representing the entire dataset. It is split into two or more branches
based on a feature's value.
2. Internal Nodes: These represent the features (attributes) used for decision-making and splitting the data.
3. Edges/Branches: These represent the outcomes of splitting the data on the attribute.
4. Leaf Nodes: The terminal nodes, which represent the final class label (for classification) or predicted
value (for regression).
5. Pruning: The process of removing or cutting down specific nodes in a tree to prevent overfitting and
simplify the model.

Steps of Building a Decision Tree:

1. Select the Best Feature:

o At each node, choose the feature that best splits the data into different classes. This is done using
impurity measures such as Gini Index, Entropy (for classification).
2. Split the Data:
o The chosen feature is used to split the dataset into subsets (branches), which are then used for the
next decision-making step.
3. Repeat:
o The process is repeated for each subset, choosing the best feature to split the data further.
4. Stop:
o The tree grows until certain stopping conditions are met, such as:
▪ All data points in a node belong to the same class.
▪ The maximum tree depth is reached.

Common Splitting Criteria:

1. Gini Index (for classification): The Gini index measures the "impurity" or "impurity" of a node. A
lower Gini value indicates better splits (more homogeneous groups).
Example of Decision Tree

Decision trees are upside down which means the root is at the top and then this root is split into various several
nodes. Decision trees are nothing but a bunch of if-else statements in layman terms. It checks if the condition is
true and if it is then it goes to the next node attached to that decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If yes then it will
go to the next feature which is humidity and wind. It will again check if there is a strong wind or weak, if it’s a
weak wind and it’s rainy then the person may go and play.

Rules:
1) If weather = cloudy, then play = YES
2) If weather = sunny, humidity = high, then play = NO
3) If weather = sunny, humidity = normal, then play = YES
4) If weather = rainy, wind=strong, then play = NO
5) If weather = rainy, wind=weak, then play = YES
Test:
Day 11: If weather = cloudy, humidity = high,wind=weak, then play = ?
Answer: YES
3) Bayesian Classification in Data Mining
❖ Bayesian classification is a statistical method based on Bayes' theorem, which describes the probability
of an event based on prior knowledge of conditions related to the event. In data mining,
❖ Bayesian classification is used to predict the probability that a given data point belongs to a particular
class.

Bayes' Theorem in Data Mining

This fundamental theorem forms the basis of Bayesian classification. It is expressed as:
P(A∣B)=P(B∣A)⋅P(A)
_____________
P(B)P(A|B)
where:
• P(A∣B) is the posterior probability of event A occurring given that B is true.
• P(B∣A) is the likelihood of event B given that A is true.
• P(A) is the prior probability of event A.
• P(B) is the probability of event B.

Example Problem
You observe a basket containing fruits, and you want to classify whether a randomly chosen fruit is an Apple or
an Orange based on the observed color.

Given Data
The basket contains:
• 60 fruits in total:
o 30 Apples: 20 Red and 10 Green.
o 30 Oranges: 15 Orange and 15 Green.

Goal: If a fruit is Red, determine whether it is more likely an Apple or an Orange.

Applications of Bayes’ Theorem

Bayes' Theorem is widely used in various fields for different applications. Here are some key applications:

• Medical Diagnosis: Estimating the probability of a disease given the presence of certain symptoms and
test results.
• Spam Filtering: Classifying emails as spam or not spam based on their content and features.
• Machine Learning: Training classifiers and models in supervised learning, especially in probabilistic
algorithms.
• Risk Assessment: Evaluating risks in finance, insurance, and other industries by updating probabilities
based on new evidence.
• Recommender Systems: Improving recommendations by updating user preferences based on new
interactions or feedback.
• Fault Diagnosis: Identifying the probability of different faults in complex systems like machinery or
electronics based on observed symptoms.
• Decision Making: Supporting decision-making processes by providing probabilistic estimates and
updating them as new information becomes available.
4)K-Nearest Neighbors (KNN) / LAZY LEARNER

K-Nearest Neighbors (KNN) is a simple algorithm used in data mining,and machine learning.

❖ It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
❖ KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
❖ K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o Step-6: Our model is ready.

Key Concepts of KNN

• Instance-based Learning: KNN does not explicitly learn a model; instead, it makes predictions based
on the similarity of new data to the training data.
• Lazy Learning: It delays the learning process until a prediction is required, meaning the training phase
is minimal or non-existent.
• Similarity Measure: The algorithm uses distance metrics (e.g., Euclidean, Manhattan, Minkowski) to
find the "nearest" neighbors of a given data point. How KNN Works
Applications

KNN is widely used in various domains for its simplicity and effectiveness:

• Customer Segmentation: Classifying customers based on purchasing patterns or behavior.

• Fraud Detection: Identifying anomalies by comparing transactions with similar historical data.
• Recommender Systems: Finding similar users/items to recommend content or products.
• Medical Diagnosis: Classifying diseases based on patient symptoms and historical data.

Example of KNN in Classification:

Suppose you have the following dataset with two features, X1 and X2, and a target class:

Let's say you want to predict the class of a new data point with the coordinates (3, 4).

1. Choose k = 3 (three nearest neighbors).

2. Calculate the distance (using Euclidean distance):

3.Identify the three nearest neighbors: The nearest neighbors are the points (3, 3), (2, 3), and (4, 5), with the
smallest distances (1, 1.41, and 1.41, respectively).

4.Make the prediction: Among these three nearest neighbors, two belong to class B and one belongs to class
A. Since class B is the majority, the new point (3, 4) is classified as B.
5)Rule-based classification in data mining
Rule-based classification in data mining is a technique in which class decisions are taken based on various
“if...then… else” rules.
The rules are often written as "IF-THEN" statements,

IF-THEN Rule

Rule Extraction from a Decision Tree

❖ Rules are easier to understand than large trees

❖ One rule is created for each path from the root to a leaf
❖ Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction

Codility Test
0% (2)
Codility Test
6 pages
ML UNIT4
No ratings yet
ML UNIT4
10 pages
Module 3_classification
No ratings yet
Module 3_classification
9 pages
unit 5
No ratings yet
unit 5
25 pages
Introduction to Classification and Classification Algorithms
No ratings yet
Introduction to Classification and Classification Algorithms
9 pages
Class i Fiers
No ratings yet
Class i Fiers
24 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
DM Module 4
No ratings yet
DM Module 4
12 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
Asign-3 DWDM
No ratings yet
Asign-3 DWDM
27 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
Decision Trees
67% (3)
Decision Trees
14 pages
Unit 3
No ratings yet
Unit 3
16 pages
Decision Tree
No ratings yet
Decision Tree
16 pages
DWDM Module IV
No ratings yet
DWDM Module IV
57 pages
Data Mining
No ratings yet
Data Mining
68 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
04. UNIT-IV(DMWH6EM)
No ratings yet
04. UNIT-IV(DMWH6EM)
33 pages
siv UNIT-3 Classification DWM PART-A
No ratings yet
siv UNIT-3 Classification DWM PART-A
12 pages
FPA unit 2
No ratings yet
FPA unit 2
20 pages
Unit III Data Mining Techniques
No ratings yet
Unit III Data Mining Techniques
17 pages
DM Unit 4
No ratings yet
DM Unit 4
22 pages
ML (Interview)
No ratings yet
ML (Interview)
20 pages
Interview AI Algo
No ratings yet
Interview AI Algo
3 pages
Data Minin1
No ratings yet
Data Minin1
104 pages
Data Mining Algorithms Classification L4
No ratings yet
Data Mining Algorithms Classification L4
7 pages
DS
No ratings yet
DS
7 pages
Machine Learning QNA
No ratings yet
Machine Learning QNA
1 page
Unit-3 Decision Tree Learning (Februray 26, 2024)
No ratings yet
Unit-3 Decision Tree Learning (Februray 26, 2024)
51 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Decisiontree 2
No ratings yet
Decisiontree 2
16 pages
Decision Tree Algorithm, Explained-1-22
No ratings yet
Decision Tree Algorithm, Explained-1-22
22 pages
Unit-3 Classification
No ratings yet
Unit-3 Classification
28 pages
Classification
No ratings yet
Classification
7 pages
pattern Recognition_Unit_1&2
100% (1)
pattern Recognition_Unit_1&2
41 pages
ML ModuleUntitled 2
No ratings yet
ML ModuleUntitled 2
8 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
updated dm unit 3
No ratings yet
updated dm unit 3
28 pages
Dwdm-Unit-3 R16
No ratings yet
Dwdm-Unit-3 R16
14 pages
Unit 3 Classification - Dr. Vidyut D
No ratings yet
Unit 3 Classification - Dr. Vidyut D
72 pages
DWM UNIT-V NOTES
No ratings yet
DWM UNIT-V NOTES
15 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Decision Tree Algorithm: and Classification Problems Too
No ratings yet
Decision Tree Algorithm: and Classification Problems Too
12 pages
UNIT 3 DM
No ratings yet
UNIT 3 DM
34 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
Slide 3
No ratings yet
Slide 3
23 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
MLP U2
No ratings yet
MLP U2
7 pages
MLunit 2 Mynotes
No ratings yet
MLunit 2 Mynotes
15 pages
Decision Tree
100% (1)
Decision Tree
57 pages
Unit-Iii: Classification and Prediction
No ratings yet
Unit-Iii: Classification and Prediction
21 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
Decision Tree
No ratings yet
Decision Tree
21 pages
Unit-4 - Data Ware
No ratings yet
Unit-4 - Data Ware
59 pages
1.0 Modeling: 1.1 Classification
No ratings yet
1.0 Modeling: 1.1 Classification
5 pages
Unit 4
No ratings yet
Unit 4
20 pages
Machine Learning
No ratings yet
Machine Learning
15 pages
Module 3 Notes (1)
No ratings yet
Module 3 Notes (1)
31 pages
Trinh Khanh Ly 20213676
No ratings yet
Trinh Khanh Ly 20213676
13 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Rail Fence Cipher: Materials
No ratings yet
Rail Fence Cipher: Materials
3 pages
Lect Class 4
No ratings yet
Lect Class 4
28 pages
Bode Plot Examples
No ratings yet
Bode Plot Examples
3 pages
What Is Sequence and Logic Control?
No ratings yet
What Is Sequence and Logic Control?
6 pages
A New Algorithm For Wire Fault Location Using Time-Domain Reflectometry
No ratings yet
A New Algorithm For Wire Fault Location Using Time-Domain Reflectometry
9 pages
Pattern Classification
No ratings yet
Pattern Classification
39 pages
Informed Search v1
No ratings yet
Informed Search v1
31 pages
Fundamentals of AIML: Sample Project 1
No ratings yet
Fundamentals of AIML: Sample Project 1
4 pages
(Chapman & Hall - CRC Texts in Statistical Science) Paul Roback and Julie Legler - Beyond Multiple Linear Regression-Applied Generalized Linear Models and Multilevel Models in R-CRC Press (2020)
No ratings yet
(Chapman & Hall - CRC Texts in Statistical Science) Paul Roback and Julie Legler - Beyond Multiple Linear Regression-Applied Generalized Linear Models and Multilevel Models in R-CRC Press (2020)
437 pages
Introduction To Meshing
No ratings yet
Introduction To Meshing
22 pages
MFCS QP
No ratings yet
MFCS QP
3 pages
Bubble Sort (With Code in Python-C++-Java-C)
No ratings yet
Bubble Sort (With Code in Python-C++-Java-C)
12 pages
IT160 Final Exam
No ratings yet
IT160 Final Exam
36 pages
EBA 14 Inglês - Jancer
No ratings yet
EBA 14 Inglês - Jancer
4 pages
Chapter # 2 Complexity Analysis
No ratings yet
Chapter # 2 Complexity Analysis
19 pages
Resit Exam Assignment Control Systems (ECE 304) : Name Surname: Student's ID
No ratings yet
Resit Exam Assignment Control Systems (ECE 304) : Name Surname: Student's ID
3 pages
Lec 5a
No ratings yet
Lec 5a
24 pages
MAS S62S18 Lec02
No ratings yet
MAS S62S18 Lec02
35 pages
Addition Multiplication RNN
No ratings yet
Addition Multiplication RNN
7 pages
Spam News Detection Report
No ratings yet
Spam News Detection Report
9 pages
Bellman Equation in Dynamic Programming
No ratings yet
Bellman Equation in Dynamic Programming
3 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
31 pages
Research Paper On Security Paillier
No ratings yet
Research Paper On Security Paillier
16 pages
Instant Access to The LIBOR Market Model in Practice Dariusz Gatarek ebook Full Chapters
100% (1)
Instant Access to The LIBOR Market Model in Practice Dariusz Gatarek ebook Full Chapters
45 pages
Ann Chapter 2
No ratings yet
Ann Chapter 2
240 pages
Network Planning/Analysis Methods: Civ4101 Civil Engineering Management
No ratings yet
Network Planning/Analysis Methods: Civ4101 Civil Engineering Management
14 pages
Class 13 PDF
No ratings yet
Class 13 PDF
4 pages

DataMining_Unit-3

Uploaded by

DataMining_Unit-3

Uploaded by

UNIT – 3

1. Classification in Data Mining:

2. Prediction in Data Mining:

Comparison of Classification and Prediction:

2) What is a Decision Tree?

Structure of a Decision Tree:

A decision tree consists of the following components:

Steps of Building a Decision Tree:

1. Select the Best Feature:

Common Splitting Criteria:

Bayes' Theorem in Data Mining

Goal: If a fruit is Red, determine whether it is more likely an Apple or an Orange.

How does K-NN work?

o Step-1: Select the number K of the neighbors

Key Concepts of KNN

• Customer Segmentation: Classifying customers based on purchasing patterns or behavior.

Example of KNN in Classification:

1. Choose k = 3 (three nearest neighbors).

Rule Extraction from a Decision Tree

❖ Rules are easier to understand than large trees

You might also like