DmUnit 3
DmUnit 3
UNIT-III
1
Association
◼ Suppose we want to know which items are frequently purchased together; let
consider an example:
6
Classification: Basic Concepts
7
Supervised vs. Unsupervised Learning
9
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ Note: If the test set is used to select models, it is called validation (test) set
10
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
12
Classification: Basic Concepts
13
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training data set: Buys_computer <=30 high no excellent no
❑ The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
❑ Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes
14
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-
conquer manner
◼ At start, all the training examples are at the root
discretized in advance)
◼ Examples are partitioned recursively based on selected
attributes
◼ Test attributes are selected on the basis of a heuristic or
m=2
16
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in D:
m
Info( D ) = − pi log 2 ( pi )
i =1
◼ Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D) = Info( D j )
j
j =1 | D |
◼ Information gained by branching on attribute A
Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student) = 0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) = 0.048
31…40 high yes fair yes
>40 medium no excellent no 18
Computing Information-Gain for
Continuous-Valued Attributes
◼ Let attribute A be a continuous-valued attribute
◼ Must determine the best split point for A
◼ Sort the value A in increasing order
◼ Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
◼ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
◼ The point with the minimum expected information
requirement for A is selected as the split-point for A
◼ Split:
◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
19
Gain Ratio for Attribute Selection (C4.5)
◼ Information gain measure is biased towards attributes with a
large number of values
◼ C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) = − log 2 ( )
j =1 |D| |D|
◼ GainRatio(A) = Gain(A)/SplitInfo(A)
◼ Ex.
noise or outliers
◼ Poor accuracy for unseen samples
26
Classification: Basic Concepts
27
Bayesian Classification: Why?
◼ A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
◼ Foundation: Based on Bayes’ Theorem.
◼ Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
28
Bayes’ Theorem: Basics
M
◼ Total probability Theorem: P(B) = P(B | A )P( A )
i i
i =1
medium income
29
Prediction Based on Bayes’ Theorem
◼ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
30
Classification Is to Derive the Maximum Posteriori
◼ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
◼ Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
31
Naïve Bayes Classifier
◼ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
◼ This greatly reduces the computation cost: Only counts the
class distribution
◼ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
◼ If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
g ( x, , ) = e 2 2
and P(xk|Ci) is 2
P(X | C i) = g ( xk , Ci , Ci )
32
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
33
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
“uncorrected” counterparts
35
Naïve Bayes Classifier: Comments
◼ Advantages
◼ Easy to implement
◼ Disadvantages
◼ Assumption: class conditional independence, therefore loss
of accuracy
◼ Practically, dependencies exist among variables
Bayes Classifier
36
Classification: Basic Concepts
37
A Multi-Layer Feed-Forward Neural Network
Output vector
w(jk +1) = w(jk ) + ( yi − yˆ i( k ) ) xij
Output layer
Hidden layer
wij
Input layer
Input vector: X
38
How A Multi-Layer Neural Network Works
◼ The inputs to the network correspond to the attributes measured for each
training tuple
◼ Inputs are fed simultaneously into the units making up the input layer
◼ They are then weighted and fed simultaneously to a hidden layer
◼ The number of hidden layers is arbitrary, although usually only one
◼ The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction
◼ The network is feed-forward: None of the weights cycles back to an input
unit or to an output unit of a previous layer
◼ From a statistical point of view, networks perform nonlinear regression:
Given enough hidden units and enough training samples, they can closely
approximate any function
39
Defining a Network Topology
◼ Decide the network topology: Specify # of units in the input
layer, # of hidden layers (if > 1), # of units in each hidden layer,
and # of units in the output layer
◼ Normalize the input values for each attribute measured in the
training tuples to [0.0—1.0]
◼ One input unit per domain value, each initialized to 0
◼ Output, if for classification and more than two classes, one
output unit per class is used
◼ Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
40
Backpropagation
◼ Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
◼ For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value
◼ Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation.”
◼ Steps
◼ Initialize weights to small random numbers associated with biases
◼ Propagate the inputs forward (by applying the activation function)
◼ Backpropagate the error (by updating weights and biases)
◼ Terminating condition (when the error is very small, etc.)
41
Neuron: A Hidden/Output Layer Unit
x0 w0
k bias
x1
w1
f output y
xn wn For Example
n
y = sign( wi xi − k )
Input weight weighted Activation i =0