0% found this document useful (0 votes)
8 views

Data Mining UNIT-III R20 Syllabus

Uploaded by

abhiindian38
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Mining UNIT-III R20 Syllabus

Uploaded by

abhiindian38
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

MARRI LAXMAN REDDY

Institute of Technology and Management


(AN AUTONOMOUS INSTITUTION)

YANNAM APPARAO
Associate Professor -CSE/IT

Computer Science and Engineering/Information Technology @ MLRITM


Data
Mining
UNIT- III
DATA MINING
Classification
Index

1. Classification: Problem definition.


2. General approach to solving a classification problem.
3. Evaluation of classifiers, Classification techniques.
4. Decision trees: decision tree construction.
5. Methods of expressing attribute test conditions.
6. Methods for selecting the best split.
7. Algorithm for decision tree induction.
8. Naïve - Bayes classifier.
9. Bayesian Belief Networks.
10. K-Nearest neighbour classification -Algorithm and characteristics
Classification
Classification is a data mining function that assigns items in a
collection to target categories or classes.

The goal of classification is to accurately predict(expected) the


target class for each case in the data.

For example, a classification model could be used to identify loan


applicants as low, medium, or high credit risks.
Classification: Problem definition.

There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.

These two forms are as follows −


i. Classification
ii. Prediction
Classification

Following are the examples of cases where the data analysis task
is Classification −

a) A bank loan officer wants to analyze the data in order to know


which customer (loan applicant) are risky or which are safe.

b) A marketing manager at a company needs to analyze a


customer with a given profile, who will buy a new computer.
Prediction

Following are the examples of cases where the data analysis task is
Prediction −

Suppose the marketing manager needs to predict how much a


given customer will spend during a sale at his company.

In this example we are bothered to predict a numeric value.


Therefore the data analysis task is an example of numeric
prediction. In this case, a model or a predictor will be constructed
that predicts a continuous-valued-function or ordered value.
How Does Classification Works?

With the help of the bank loan application that we have discussed
above, let us understand the working of classification.

The Data Classification process includes two steps −

1) Building the Classifiear or Model


2) Using Classifier for Classification
Building the Classifier or Model
1. This step is the learning step or the learning phase.
2. In this step the classification algorithms build the classifier.
3. The classifier is built from the training set made up of database
tuples and their associated class labels.
4. Each tuple that constitutes the training set is referred to as a
category or class. These tuples can also be referred to as sample,
object or data points.
Introduction to Data Mining

General approach to solving a


classification problem
Introduction to Data Mining

Classification
Introduction to Data Mining

A large database has huge amount of raw


data,

Which is analyzed and predicted to retrieve


useful
information and to make definition.

Classification is one of the methods used for


data
analysis ,we analyze the data and classify it,
based
on our requirement .
Introduction to Data Mining
For Example ,

if we want to know the performance of the university,


we classify the students database based on their
performance as above average ,average and bellow
average students.

If the classification shows that the number of students


under "bellow average "category are more, then the
university needs to improve.
Introduction to Data Mining

Data Classification
process
Introduction to Data Mining

Let us consider the data classification


where a decision is to be made for
increasing the pay scale of employees in
an organization based on their
performance level and current pay scale
Introduction to Data Mining
Introduction to Data Mining
General approach to classification

• Training set consists of records with known class


labels

• Training set is used to build a classification model

• A labeled test set of previously unseen data records is


used to evaluate the quality of the model.

• The classification model is applied to new records


with unknown class labels.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Evaluation of classifiers,
Classification Techniques.
A classification technique (or classifier) is a
systematic approach to building classification
models from an input data set.

Examples include

 Decision tree classifier,


 Rule-based c classifier,
 Neural networks,
 Support vector machines, and
 Na¨ıve Bayes classifier.
Each technique employs a learning
algorithm to identify a model that best fits
the relationship between the attribute set
and class label of the input data.

The model generated by a learning


algorithm should both fit the input data
well and correctly predict the class labels
of records it has never seen before.
Classification by Decision Tree Induction
Decision Tree Induction

A decision tree is a tree- structure, where each non-leaf node


represents
the test on an attribute ,
branches represents the outcome of the test and the test and
the leaf, nodes represents the class labels,
Figure: Decision Tree
The decision tree shown in the above figure ,enables the
organization to identity the number of students who are going
to join a software company, Some decision trees are binary
and some trees are no binary.

Decision trees are mostly used for classification rules for


tuples which don’t have class label identifier for them. The
class predictions can be made by traversing from node to the
leaf node.
Advantages of Decision Trees

i. The don’t require domain knowledge


ii. They are easy to understand
iii. Handles high dimensional data
iv. Classification and learning becomes simpler
when decision trees are used
v. They are very accurate
Decision Tree induction algorithm with an
example

Decision _tree (DP, attribute _ list

Input
 Data partition, DP
 Attribute _ list
 Attribute _ selection _ procedure
Output
A Decision Tree
Method

1. Create a Node ‘A’

2. If all tuples in data partistion, DP are from same class C, then


Leaf node=A with the class lable C.
If attribute list is empty, then

3. If attribute _ List is empty, then


Leaf node=A with the majority class in DP as its class label.

4. Call the function Attribute _ selection_ procedure to identify the splitting


criterion. it returns the splitting attribute and outcome say I

5. If there are more than two splitting attributes and no binary _ list by removining
the splitting attribute.
Attribute _ list=attribute_list_’splitting_ssttribute’

6. For each outcome i identify the tuples satisfying the outcome i, let those tuples
by DPi

7. If DP I is empty
Attach a leaf node
else
Attach a node returned by Decesion_Tree(Di, attribute_list)
end for

8. return A.
Description of algorithm
Input parameters

1. Data partition: It is a set of training tuples and their associated class


lables.

2. Aattribute_list: It is a list of candidate attributes.

3. Attribute_selection_procedure : This is used to classify the attributes based


on their associated classess.

Attribute selection easures used are,

a) Gini index
b) Information gain

These measures also tells us whether the treee is strictly binary or not.
Step 1
Suppose we are giving with data tuples of the students with
heir average. Then we will have a single node

Average
Step 2
If all the data tuples, belongs to a single class, then the node
becomes the leaf node. And we label it with that class.

Suppose if the average of the students is above 80% it means


all data tuples belongs to a single class. Hence the node
average become the leaf node. We label it as Avg above 80%
and algorithm ends here.

Avg >80%

But suppose if all the students do not have the average above 80% then the
algorithm proceed further.
Step 3
It calls Attribute _ selection _ procedure to identify the splitting criterion. The
identifies the branches to be spitted.
In our example . The method would determine three splitting criterions.

I. Above 80%
II. Above 65 % and Less than 80%
III. Above 40% and less than 65%

Here the splitting attribute are more than two. So we are allowed to have a
no binary tree, We considerer that a partition is pure if all the attributes within
that partition belongings to the same class.

Average

Above above 40% and


above 65%
80% less than 65%
and less 80%
The splitting attribute can be

i. Discrete
ii. Continuous
iii. Discrete-values binary tree

If the splitting attribute is a discrete value ,then the node is splitting into all the possible
values.

Example : Fruits

apple banana grapes Pineapple


If it is a continuous – values attribute as in our average

Example, then we divide he set of attributes based on the range or values.

If it is a discrete valued and binary tree , then it has exactly two values, either yes or no

Student CSIT

no
yes

This algorithm is used recursively for all nodes at each level, until all nodes become
leaf nodes.
Naive-Bayes
Classifier
•Naïve Bayes algorithm is a supervised
learning algorithm, which is based
on Bayes theorem and used for solving
classification problems.

•It is mainly used in text classification that


includes a high-dimensional training
dataset.

•Naïve Bayes Classifier is one of the simple


and most effective Classification algorithms
which helps in building the fast machine
learning models that can make quick
predictions.

•It is a probabilistic classifier, which


means it predicts on the basis of the
probability of an object.
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two
words Naïve and Bayes, Which can be described
as:

• Naïve: It is called Naïve because it assumes that


the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature
individually contributes to identify that it is an
apple without depending on each other.

• Bayes: It is called Bayes because it depends on


the principle of Bayes' Theorem.
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes'
law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the
conditional probability.

• The formula for Bayes' theorem is given as:


Where,

P(A|B) is Posterior probability: Probability of hypothesis A


on the observed event B.

P(B|A) is Likelihood probability: Probability of the


evidence given that the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before


observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.


Working of Naïve Bayes' Classifier:

• Working of Naïve Bayes' Classifier can be


understood with the help of the below example:

• Suppose we have a dataset of weather


conditions and corresponding target variable
"Play". So using this dataset we need to decide
that whether we should play or not on a
particular day according to the weather
conditions. So to solve this problem, we need to
follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the
probabilities of given features.
3. Now, use Bayes theorem to calculate the
posterior probability.
Problem: If the weather is sunny, then the Player should
play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play

0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the
Weather Conditions:
Weath Yes No
er

Overca 5 0
st

Rainy 2 2

Sunny 3 2

Total 10 4
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|
Yes)*P(Yes)/P(Sunny)
• P(Sunny|Yes)= 3/10= 0.3
• P(Sunny)= 0.35
• P(Yes)=0.71

• So P(Yes|Sunny) =
0.3*0.71/0.35= 0.60
Applying Bayes'theorem:
P(No|Sunny)=
P(Sunny|No)*P(No)/P(Sunny)
• P(Sunny|NO)= 2/4=0.5
• P(No)= 0.29
• P(Sunny)= 0.35
• So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
• So as we can see from the above calculation
that P(Yes|Sunny)>P(No|Sunny)
• Hence on a Sunny day, Player can play
the game.
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms
to predict a class of datasets.

• It can be used for Binary as well as Multi-class


Classifications.

• It performs well in Multi-class predictions as


compared to the other Algorithms.

• It is the most popular choice for text classification


problems.
Disadvantages of Naïve Bayes
Classifier:
• Naive Bayes assumes that all
features are independent or
unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes
Classifier:
• It is used for Credit Scoring.
• It is used in medical data
classification.
• It can be used in real-time
predictions because Naïve Bayes
Classifier is an eager learner.
• It is used in Text classification such
as Spam filtering and Sentiment
analysis.

You might also like