0% found this document useful (0 votes)
3 views

DataMining_Unit-3

The document discusses key techniques in data mining, focusing on classification and prediction, which are used to derive insights from datasets. It explains classification as the process of assigning categorical labels to data based on features, while prediction estimates continuous values, often using regression analysis. Additionally, it covers decision trees, Bayesian classification, K-Nearest Neighbors, and rule-based classification, highlighting their structures, algorithms, and applications.

Uploaded by

Aparna kallepu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DataMining_Unit-3

The document discusses key techniques in data mining, focusing on classification and prediction, which are used to derive insights from datasets. It explains classification as the process of assigning categorical labels to data based on features, while prediction estimates continuous values, often using regression analysis. Additionally, it covers decision trees, Bayesian classification, K-Nearest Neighbors, and rule-based classification, highlighting their structures, algorithms, and applications.

Uploaded by

Aparna kallepu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT – 3

1) Classification and prediction are two fundamental techniques in data mining used to extract
meaningful insights and models from datasets.

1. Classification in Data Mining:


Classification is the process of identifying the category or class of an object based on a given set of features or
attributes. The goal of classification is to predict the discrete class label (categorical variable) of a data instance.
In classification, you have a dataset with a target variable (also called the class label), which is categorical. You
use this dataset to build a model that can predict the class label for new, unseen data.
Example of Classification:
Suppose you are working with a dataset of customer information, and the goal is to classify whether each
customer will buy a product or not. The dataset might have features like age, income, and browsing history, and
the class label could be "Buy" or "Don't Buy."
Age Income Browsing Time (minutes) Class
25 40,000 20 Don't Buy
34 60,000 40 Buy
29 50,000 15 Don't Buy
42 80,000 50 Buy
Using this dataset, a classification algorithm (such as Decision Trees, k-NN, or SVM) can learn patterns from
the data, such as:
• Customers who are older and have a higher income are more likely to buy.
• Customers who spend more time browsing the product are more likely to buy.
Once the model is trained, it can predict the class label (e.g., "Buy" or "Don't Buy") for new customers.

2. Prediction in Data Mining:


Prediction is the task of estimating or forecasting a continuous value based on input data. The goal of prediction
is to predict numeric values or quantities that are continuous, rather than class labels. In other words, the output
is a continuous variable (e.g., a price, temperature, or stock value).
Prediction can be seen as a type of regression analysis, where you model the relationship between input features
(independent variables) and the continuous target variable (dependent variable).
Example of Prediction:
Imagine you're trying to predict the price of a house based on its features (e.g., size, number of bedrooms,
location). You have a dataset like this:
Size (sq ft) Bedrooms Location Price ($)
1,200 3 Suburb 300,000
1,800 4 City 500,000
1,500 3 Suburb 350,000
2,000 4 City 600,000
Here, the goal is to predict the price of a house (a continuous value) given its size, number of bedrooms, and
location. A regression algorithm (like linear regression, decision trees for regression, or neural networks) would
learn the relationship between the input features and the price, and could predict the price for new houses.

Comparison of Classification and Prediction:


Aspect Classification Prediction (Regression)
Goal Assign an object to a class/category Predict a continuous numeric value
Output
Categorical (discrete classes) Continuous (real values)
Variable
Aspect Classification Prediction (Regression)
Algorithms Decision Trees, k-NN, Naive Bayes, SVM, Linear Regression, Decision Trees (for
Used Random Forest regression), Neural Networks
Spam detection, Medical diagnosis, House price prediction, Stock price forecasting,
Examples
Sentiment analysis Weather prediction

2) What is a Decision Tree?

A Decision Tree is a popular and powerful model used for both classification and regression tasks in data
mining. It is a supervised learning algorithm that splits data into subsets based on feature values, creating a
tree-like structure to make decisions.

Structure of a Decision Tree:

A decision tree consists of the following components:

1. Root Node: The topmost node, representing the entire dataset. It is split into two or more branches
based on a feature's value.
2. Internal Nodes: These represent the features (attributes) used for decision-making and splitting the data.
3. Edges/Branches: These represent the outcomes of splitting the data on the attribute.
4. Leaf Nodes: The terminal nodes, which represent the final class label (for classification) or predicted
value (for regression).
5. Pruning: The process of removing or cutting down specific nodes in a tree to prevent overfitting and
simplify the model.

Steps of Building a Decision Tree:

1. Select the Best Feature:


o At each node, choose the feature that best splits the data into different classes. This is done using
impurity measures such as Gini Index, Entropy (for classification).
2. Split the Data:
o The chosen feature is used to split the dataset into subsets (branches), which are then used for the
next decision-making step.
3. Repeat:
o The process is repeated for each subset, choosing the best feature to split the data further.
4. Stop:
o The tree grows until certain stopping conditions are met, such as:
▪ All data points in a node belong to the same class.
▪ The maximum tree depth is reached.

Common Splitting Criteria:

1. Gini Index (for classification): The Gini index measures the "impurity" or "impurity" of a node. A
lower Gini value indicates better splits (more homogeneous groups).
Example of Decision Tree

Decision trees are upside down which means the root is at the top and then this root is split into various several
nodes. Decision trees are nothing but a bunch of if-else statements in layman terms. It checks if the condition is
true and if it is then it goes to the next node attached to that decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If yes then it will
go to the next feature which is humidity and wind. It will again check if there is a strong wind or weak, if it’s a
weak wind and it’s rainy then the person may go and play.

Rules:
1) If weather = cloudy, then play = YES
2) If weather = sunny, humidity = high, then play = NO
3) If weather = sunny, humidity = normal, then play = YES
4) If weather = rainy, wind=strong, then play = NO
5) If weather = rainy, wind=weak, then play = YES
Test:
Day 11: If weather = cloudy, humidity = high,wind=weak, then play = ?
Answer: YES
3) Bayesian Classification in Data Mining
❖ Bayesian classification is a statistical method based on Bayes' theorem, which describes the probability
of an event based on prior knowledge of conditions related to the event. In data mining,
❖ Bayesian classification is used to predict the probability that a given data point belongs to a particular
class.

Bayes' Theorem in Data Mining


This fundamental theorem forms the basis of Bayesian classification. It is expressed as:
P(A∣B)=P(B∣A)⋅P(A)
_____________
P(B)P(A|B)
where:
• P(A∣B) is the posterior probability of event A occurring given that B is true.
• P(B∣A) is the likelihood of event B given that A is true.
• P(A) is the prior probability of event A.
• P(B) is the probability of event B.

Example Problem
You observe a basket containing fruits, and you want to classify whether a randomly chosen fruit is an Apple or
an Orange based on the observed color.

Given Data
The basket contains:
• 60 fruits in total:
o 30 Apples: 20 Red and 10 Green.
o 30 Oranges: 15 Orange and 15 Green.

Goal: If a fruit is Red, determine whether it is more likely an Apple or an Orange.


Applications of Bayes’ Theorem

Bayes' Theorem is widely used in various fields for different applications. Here are some key applications:

• Medical Diagnosis: Estimating the probability of a disease given the presence of certain symptoms and
test results.
• Spam Filtering: Classifying emails as spam or not spam based on their content and features.
• Machine Learning: Training classifiers and models in supervised learning, especially in probabilistic
algorithms.
• Risk Assessment: Evaluating risks in finance, insurance, and other industries by updating probabilities
based on new evidence.
• Recommender Systems: Improving recommendations by updating user preferences based on new
interactions or feedback.
• Fault Diagnosis: Identifying the probability of different faults in complex systems like machinery or
electronics based on observed symptoms.
• Decision Making: Supporting decision-making processes by providing probabilistic estimates and
updating them as new information becomes available.
4)K-Nearest Neighbors (KNN) / LAZY LEARNER

K-Nearest Neighbors (KNN) is a simple algorithm used in data mining,and machine learning.

❖ It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
❖ KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
❖ K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o Step-6: Our model is ready.

Key Concepts of KNN

• Instance-based Learning: KNN does not explicitly learn a model; instead, it makes predictions based
on the similarity of new data to the training data.
• Lazy Learning: It delays the learning process until a prediction is required, meaning the training phase
is minimal or non-existent.
• Similarity Measure: The algorithm uses distance metrics (e.g., Euclidean, Manhattan, Minkowski) to
find the "nearest" neighbors of a given data point. How KNN Works
Applications

KNN is widely used in various domains for its simplicity and effectiveness:

• Customer Segmentation: Classifying customers based on purchasing patterns or behavior.


• Fraud Detection: Identifying anomalies by comparing transactions with similar historical data.
• Recommender Systems: Finding similar users/items to recommend content or products.
• Medical Diagnosis: Classifying diseases based on patient symptoms and historical data.

Example of KNN in Classification:


Suppose you have the following dataset with two features, X1 and X2, and a target class:

Let's say you want to predict the class of a new data point with the coordinates (3, 4).

1. Choose k = 3 (three nearest neighbors).


2. Calculate the distance (using Euclidean distance):

3.Identify the three nearest neighbors: The nearest neighbors are the points (3, 3), (2, 3), and (4, 5), with the
smallest distances (1, 1.41, and 1.41, respectively).

4.Make the prediction: Among these three nearest neighbors, two belong to class B and one belongs to class
A. Since class B is the majority, the new point (3, 4) is classified as B.
5)Rule-based classification in data mining
Rule-based classification in data mining is a technique in which class decisions are taken based on various
“if...then… else” rules.
The rules are often written as "IF-THEN" statements,

IF-THEN Rule

Rule Extraction from a Decision Tree

❖ Rules are easier to understand than large trees


❖ One rule is created for each path from the root to a leaf
❖ Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction

You might also like