0% found this document useful (0 votes)
20 views

Unit V Classification

The document discusses classification and prediction techniques in data mining. It covers decision trees, naive Bayes classification, and issues around evaluating classification models. Specifically, it describes the process of constructing a classification model using a training set, and then applying the model to classify new unlabeled data. Decision tree induction and the ID3 algorithm for building decision trees are explained at a high level.

Uploaded by

Yog Dharaskar
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Unit V Classification

The document discusses classification and prediction techniques in data mining. It covers decision trees, naive Bayes classification, and issues around evaluating classification models. Specifically, it describes the process of constructing a classification model using a training set, and then applying the model to classify new unlabeled data. Decision tree induction and the ID3 algorithm for building decision trees are explained at a high level.

Uploaded by

Yog Dharaskar
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 69

UNIT V:Classification

Decision trees.
Naïve Bayes

Ref: Han and Kamber PPT/Book

Data Mining: Concepts


July 17, 2015 and Techniques
Decision trees- Overview, general algorithm,
decision tree algorithm, evaluating a decision
tree.
Naïve Bayes – Bayes Theorem and Algorithm, Naïve
Bayes Classifier, smoothing, diagnostics.
Diagnostics of classifiers, additional classification
methods.

Data Mining: Concepts


July 17, 2015 and Techniques
Classification and Prediction

 What is classification? What is


prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification

 Rule-based classification

 Classification by back
propagation

July 17, 2015 Data Mining: Concepts and Techniques 3


Classification vs. Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
 Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values.
 Ex. Manager want to predict how much a given
customer will spend during the sale at shop.
 This is numeric prediction so normally Regression is
used here.

July 17, 2015 Data Mining: Concepts and Techniques 4


Typical applications

 Loan approval: Ex. Which loan applications are Safe or


risky for bank.

 Target marketing Ex. Customer with given profile will


buy new computer or not.

 Medical diagnosis : Ex. Medical researcher may want to


analyze cancer data to predict which one of three
treatments a patient should receive.

 Fraud detection: It detects whether there is fraud or


not.

July 17, 2015 Data Mining: Concepts and Techniques 5


Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
July 17, 2015 Data Mining: Concepts and Techniques 6
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
July 17, 2015 Data Mining: Concepts and Techniques 7
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
July 17, 2015 Data Mining: Concepts and Techniques 8
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
July 17, 2015 Data Mining: Concepts and Techniques 9
Classification and Prediction

 What is classification? What is


prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification

 Rule-based classification

 Classification by back
propagation

July 17, 2015 Data Mining: Concepts and Techniques 10


Issues: Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

July 17, 2015 Data Mining: Concepts and Techniques 11


Issues: Evaluating Classification Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted
attributes(Finding Missing values)
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
July 17, 2015 Data Mining: Concepts and Techniques 12
Classification and Prediction

 What is classification? What is


prediction?
 Issues regarding classification
and prediction
 Classification by decision tree
induction
 Bayesian classification

 Rule-based classification

 Classification by back
propagation

July 17, 2015 Data Mining: Concepts and Techniques 13


Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

July 17, 2015 Data Mining: Concepts and Techniques 14


Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

July 17, 2015 Data Mining: Concepts and Techniques 15


Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
July 17, 2015 Data Mining: Concepts and Techniques 16
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain

 Let pi be the probability that an arbitrary tuple in D


belongs to class Ci, estimated by |Ci, D|/|D|

 Expected information (entropy) needed to classify a tuple


in D:

m
Info ( D )=−∑ p i lo g 2 ( p i )
i= 1

July 17, 2015 Data Mining: Concepts and Techniques 17


 Information needed (after using A Attribute to split D
into v ((Different values in attribute A) partitions) to
classify D:
v
|D j|
Info A ( D )= ∑ × Info( D j )
j= 1 |D|

|Dj| is total number of tuples in D, that have outcome aj of A.


 Information gained by branching on attribute A

G ain ( A )=Info ( D )− Info A ( D )

July 17, 2015


age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

 Class P: buys_computer = “yes”


 Class N: buys_computer = “no”

July 17, 2015


Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage (D )= I (2,3 )+ I (4,0)
 Class N: buys_computer = “no” 14 14
5
+ I (3,2)=0 .694
14
9 9 5 5
Info( D)=I (9,5 )=− log2 ( )− log 2 ( )=0 .940
14 14 14 14

means “age <=30” has 5 out of 14


samples, with 2 yes’es and 3 no’s.
age pi n i I(pi, n i)
Hence
<=30 2 3 0.971
31…40 4 0 0 Similarly,
>40 3 2 0.971

July 17, 2015


Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage (D )= I (2,3 )+ I (4,0)
 Class N: buys_computer = “no” 14 14
5
9 9 5 5 + I (3,2)=0 .694
Info( D)=I (9,5 )=− log 2 ( )− log2 ( )=0 .940 14
14 14 14 14
5 means “age <=30” has 5 out of 14 samples, with 2
age pi n i I(p i, n i) I (2,3)
14 yes’es and 3 no’s. Hence
<=30 2 3 0.971
Similarly, Find
31…40 4 0 0
>40 3 2 0.971 Gain (income)
GGain
ain(Student)
( age ) =Info( D )− Info age ( D )= 0 .246
age income student credit_rating buys_computer Gain (Credit_Rating)
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40July 17,medium
2015 no excellent no 21
Gain( income )=0 . 029
Gain( student )=0 .151
Gain( credit rating )=0 .048

Data Mining: Concepts and


July 17, 2015 Techniques
How to convert continues value

Let us take an example, where you have age as the


target variable. So, let us say you compute the
variance of age and it comes out to be x. Next,
decision tree looks at various splits and calculates
the total weighted variance of each of these
splits. It chooses the split which provides the
minimum variance.

Data Mining: Concepts


July 17, 2015 and Techniques
techniques 
• Replacing values
• Encoding labels
• One-Hot encoding
• Binary encoding
• Backward difference encoding
• Replace():
replace_map = {'carrier': {'AA': 1, 'AS': 2, 'B6': 3,
'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7 , 'UA': 8 , 'US': 9,'VX':
10,'WN': 11}}
replace_map ={‘{College’:‘VIT’:1,”VIIT’:2,’MIT’:3}}
Suitable when categories are less
Data Mining: Concepts
July 17, and Techniques
labels = cat_df_flights['carrier'].astype('category').cat.categories.tolist()
replace_map_comp = {'carrier' : {k: v for k,v in
zip(labels,list(range(1,len(labels)+1)))}} print(replace_map_comp)

cat_df_flights_replace.replace(replace_map_comp,
inplace=True) print(cat_df_flights_replace.head())

Data Mining: Concepts


July 17, 2015 and Techniques
Label Encoding: allows you to convert each value in a
column to a number. Numerical labels are always
between 0 and n_categories-1.

cat_df_flights_lc['carrier’]=cat_df_flights_lc['carrier'].cat.codes
cat_df_flights_lc.head() #alphabetically labeled from 0 to 10

Data Mining: Concepts


July 17, 2015 and Techniques
One hot encoding :convert each category value into a new
column and assign a 1 or 0 (True/False) value to the column.
This has the benefit of not weighting a value improperly.

pandas .get_dummies()

cat_df_flights_onehot = cat_df_flights.copy()
cat_df_flights_onehot =
pd.get_dummies(cat_df_flights_onehot,
columns=['carrier'], prefix = ['carrier'])
print(cat_df_flights_onehot.head())

Data Mining: Concepts


July 17, 2015 and Techniques
Binary encoding

the categories are encoded as ordinal, then those


integers are converted into binary code, then the
digits from that binary string are split into
separate columns. This encodes the data in fewer
dimensions than one-hot.

Data Mining: Concepts


July 17, 2015 and Techniques
Backward Difference Encoding

A feature of K categories, or levels, usually enters a


regression as a sequence of K-1 dummy variables.
In backward difference coding, the mean of the
dependent variable for a level is compared with
the mean of the dependent variable for the prior
level. This type of coding may be useful for a
nominal or an ordinal variable.

Data Mining: Concepts


July 17, 2015 and Techniques
Hhh
# Load dataset: iris = load_iris() # Build and train classifier: clf = tree.DecisionTreeClassifier() clf = clf.fit(iris.data, iris.target) #
Visualize model: if verbose: dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, fi

Data Mining: Concepts


July 17, 2015 and Techniques
Data Mining: Concepts
July 17, 2015 and Techniques
Data Mining: Concepts and
July 17, 2015 Techniques
Data Mining: Concepts and
July 17, 2015 Techniques
Data Mining: Concepts and
July 17, 2015 Techniques
Data Mining: Concepts and
July 17, 2015 Techniques
Computing Information-Gain for
Continuous-Value Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1

 The point with the minimum expected information


requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
July 17, 2015 36
Gain Ratio for Attribute Selection (C4.5)

 Information gain measure is biased towards attributes


with a large number of values
 Ex. Product_ID It may consist of large number of
partitions, each one contain only one tuple.
 Information gain of such attribute is maximal.
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain using split
information value.)
v
|D j| | D j|
SplitInfo A ( D )=− ∑ × log 2 ( )
j= 1 | D| |D|

4 4 6 6 4 4
SplitInfo A ( D )=− ×log2 ( )− ×log 2 ( )− ×log 2 ( )=0. 926
14 14 14 14 14 14
July 17, 2015 Data Mining: Concepts and Techniques 37
GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex. Gain ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected as
the splitting attribute

Data Mining: Concepts and


July 17, 2015 Techniques
Gini index (CART, IBM IntelligentMiner)

 Gini index measures the impurity of D,a data partition or set of training
tuples

 If a data set D contains examples from n classes, gini index, gini(D) is


defined as n
gini( D )=1− ∑ p 2j
j= 1

where pj is the relative frequency of class j in D |Cj, D|/|D|


 Ex. Income has three possible values, namely {low, medium, high},
then there are possible 8 subsets, we exclude {low, medium, high}, {}
so there are 23 -2 possible ways to form two partitions of the data, D,
based on binary split of A.

July 17, 2015 Data Mining: Concepts and Techniques 39


 If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
|D 1| |D 2|
gini A ( D )= gini ( D 1 )+ gini( D 2 )
| D| |D|
gini A ( D )
is computing a weighted sum of the impurity of each resulting partition.

For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset.

 Reduction in Impurity:
Δg ini ( A )= gini ( D )− gini A ( D )

July 17, 2015


 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node.

Data Mining: Concepts and


July 17, 2015 Techniques
Gini index (CART, IBM IntelligentMiner)

 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

gini( D)=1−
14
− ( )( )
9 2 5 2
14
=0 .459

 Suppose the attribute income partitions D into 10 in D1: {low, medium}


and 4 in D2 10 4
giniincome∈ {low,medium } (D )= (14 )Gini (D )+(14 ) Gini( D )
1 1

Find values for gini{medium,high} and gini{low,high}

July 17, 2015 Data Mining: Concepts and Techniques 42


but gini{medium,high} is 0.450 and gini{low,high} is 0.458

Now apply same method on other attributes age,


student, credit rating.

Data Mining: Concepts and


July 17, 2015 Techniques
For age best splitting subset is gini{youth,senior} or {middle_aged} is 0.375/0.357
and student and credit rating are binary attributes with Gini Index
0.367 and 0.429 Respectively.

Finally attribute age with splitting subst {youth,senior} gives


minimum Gini index, 0.459-0.357=0.102

Data Mining: Concepts and


July 17, 2015 Techniques
Comparing Attribute Selection Measures

 The three measures, in general, return good results but


 Information gain:
biased towards multivalued attributes
 Gain ratio:
tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:
biased towards multivalued attributes
has difficulty when number of classes is large
tends to favor tests that result in equal-sized
partitions and purity in both partitions
July 17, 2015 Data Mining: Concepts and Techniques 45
Overfitting and Tree Pruning(Repetition and Replication)

 Overfitting: An induced tree may overfit the training data


 Too many branches, some may reflect irregularities due to noise
or outliers
 Poor accuracy for unseen samples
 Two approaches of Tree pruning:
 Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
 Difficult to choose an appropriate threshold.
 High thresholds could result in oversimplified trees
 Low thresholds could result in very little simplification.

July 17, 2015 Data Mining: Concepts and Techniques 46


 Postpruning: Remove branches from a “fully grown” tree.
 A subtree at a given node is pruned by removing its branches and
replacing it with a leaf.
 The leaf is labled with most frequent class among the subtree
being replaced.
 Use a set of data different from the training data to decide
which is the “best pruned tree”

Data Mining: Concepts and


July 17, 2015 Techniques
Bayesian Classification: Why?

 A statistical classifier: performs probabilistic prediction,


i.e., predicts class membership probabilities such as the
probability that given tuple belongs to which class.

 Foundation: Based on Bayes’ Theorem.

 Performance: A simple Bayesian classifier, known as naive


Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers.

July 17, 2015 Data Mining: Concepts and Techniques 48


Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
prior knowledge can be combined with observed data

Standard: Even when Bayesian methods are computationally


inflexible, they can provide a standard of optimal decision
making against which other methods can be measured.
Bayesian Theorem: Basics
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
 Suppose x is described by attribute age and income , 35
year old customer with income 40000, suppose
H(Hypothesis) is customer will buy computer then,
 P(H/X) reflects the probability that customer x will buy a
computer given that we know the customers age and
income.

July 17, 2015 Data Mining: Concepts and Techniques 50


 P(H) (prior probability), the initial probability
 E.g., X or any given customer will buy computer,
regardless of age, income, …
 P(X): probability that sample data is observed
It is probability that person from our set of customers is
35 years old and income is 40,000.
 P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
 E.g., It is the probability that customer x is 35 year old
and income is 40,000 given that we know customer will
buy a computer.
Bayesian Theorem

 Given training data X, posteriori probability of a hypothesis H,


P(H|X), follows the Bayes theorem
P ( X |H ) P ( H )
P ( H |X )=
P( X )
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci if the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost

July 17, 2015 52


Towards Naïve Bayesian Classifier

 Let D be a training set of tuples and their associated class


labels, and each tuple is represented by an n attribute vector X
= (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P ( X|Ci )P (C i )
P(C i|X )=
P( X )
 Since P(X) is constant for all classes, only
P ( C i| X ) =P ( X |C i ) P ( C i )

needs
July 17, 2015
to be maximized 53
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_com
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
July 17, 2015 Data Mining: Concepts and Techniques 54
Naïve Bayesian Classifier: An Example
 X = (age <= 30 , income = medium, student = yes,
credit_rating = fair)

 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643


P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

July 17, 2015 Data Mining: Concepts and Techniques 55


P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating


= fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028


P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

July 17, 2015


 Find class label for X = (age >40 , income = medium, student =
yes, credit_rating = excellent)

 Find class label for X=(age=31 to 40, income=medium,


student=yes ,credit_rating=excellent)

Data Mining: Concepts and


July 17, 2015 Techniques
Data Mining: Concepts
July 17, 2015 and Techniques
Outlook= “ Overcast” ,Temp=“Mild”, Humidity=“High”, Windy=“False”

Play Golf=Yes (9/14) : 4/9*4/9*3/9*6/9=0.04


Play Golf =No (5/14) : 0/5*2/5*4/5*2/5=0
(Limitation of NB, if zero frequency found .
Solution: Use smoothing technique like Laplace estimation

July 17, 2015 Data Mining: Concepts and Techniques


Naïve Bayesian Classifier: Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore
loss of accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
July 17, 2015 Data Mining: Concepts and Techniques 60
Other Limitation of Naive Bayes :
1. assumption of independent predictors.
In real life, it is almost impossible that we get a set
of predictors which are completely independent.
2. Zero Probability

Data Mining: Concepts


July 17, 2015 and Techniques
Types of NB Classifier
Gaussian: it assumes that features follow a normal
distribution.
Data is continuous , features having continuous values
Multinomial: It is used for discrete counts. works with
occurrence counts,.
used in text classification (where the data are typically represented as word
vector counts,
Example: count of each word, movie rating
Bernoulli: useful if your feature vectors are binary (i.e.
zeros and ones).
there may be multiple features but each one is assumed to be a binary-
valued
e.g symptoms present or not ,Word present or not

Data Mining: Concepts


July 17, 2015 and Techniques
Using IF-THEN Rules for Classification

 Represent the knowledge in the form of IF-THEN rules


R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers

July 17, 2015 Data Mining: Concepts and Techniques 63


If more than one rule is triggered, need conflict resolution
Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
Class-based ordering: decreasing order of prevalence or mis classification cost
per class
Rule-based ordering (decision list): rules are organized into one long priority
list, according to some measure of rule quality or by experts
Rule Extraction from a Decision Tree
age?

<=30 31..40 >40


 Rules are easier to understand than large trees student? credit rating?
yes
 One rule is created for each path from the root to excellent fair
n yes
a leaf o
no yes
no yes
 Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
 Rules are mutually exclusive and exhaustive
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no

July 17, 2015 65


Rule Extraction from the Training Data

 Sequential covering algorithm: Extracts rules directly from training data


 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules simultaneously
July 17, 2015 Data Mining: Concepts and Techniques 66
Classification: A Mathematical Mapping

 Classification:
predicts categorical class labels
 E.g., Personal homepage classification
xi = (x1, x2, x3, …), yi = +1 or –1
x1 : # of a word “homepage”
x2 : # of a word “welcome”
 Mathematically
x  X = n, y  Y = {+1, –1}
We want a function f: X  Y
July 17, 2015 Data Mining: Concepts and Techniques 67
Linear Classification
 Binary Classification
problem
 The data above the red
line belongs to class ‘x’
x  The data below red line
x x
x x belongs to class ‘o’
x  Examples: SVM,
x x x o
Perceptron, Probabilistic
o
x o Classifiers
o o o
o o o
o o o o

July 17, 2015 Data Mining: Concepts and Techniques 68


Discriminative Classifiers
 Advantages
 prediction accuracy is generally high
 As compared to Bayesian methods – in general
 robust, works when training examples contain errors
 fast evaluation of the learned target function
 Bayesian networks are normally slow
 Criticism
 long training time
 difficult to understand the learned function (weights)
 Bayesian networks can be used easily for pattern discovery
 not easy to incorporate domain knowledge
 Easy in the form of priors on the data or distributions

July 17, 2015 Data Mining: Concepts and Techniques 69

You might also like