Pattern Recognition
Pattern Recognition
Tempreture
Yes No P(yes) P(no) P()
Hot 2 2 2/9 2/5 4/14
medium 4 2 4/9 2/5 6/14
Cool 3 1 3/9 1/5 4/14
9 5
9/14 5/14
Humidity
Windy
Yes No P(yes) P(no) P()
True 6 2 6/9 2/5 8/14
False 3 3 3/9 3/5 6/14
9 5
9/14 5/14
Play P(yes/no)
Yes 9/14
NO 5/14
P(yes/today)=(2/9*2/9*6/9*6/9*9/
14)/P(today) approximately .0141
or
= 0.141/(00141+.0068)= 0.67
P(No/today)=(3/5*2/5*1/5*2/5*5/1
4)/P(today) approximately 0.0068 or
= 0.0068/(00141+.0068)= 0.33
Name Given Can Fly Live in Have Legs Class
birth Water
human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
whale yes No Yes No Mammals
frog no No sometimes yes non-mammal
komodo no no no yes non- mammals
Can Fly
Mammal Non-mammal P(M) P(NM)
Yes 1 3 1/7 3/13 4/20
No 6 10 6/7 10/13 16/20
7 13
7/20 13/20
Live in water
Mammal Non-mammal P(M) P(NM)
Yes 2 3 2/7 3/13 5/20
No 5 6 5/7 6/13
Some times 0 4 0 4/13
7/20 13/20
Have legs
Mammal Non-mammal P(M) P(NM)
Yes 5 9 5/7 8/13
No 2 4 2/7 5/13
7 13
7/20 13/20
• P(M)=7/20
• P(NM)=13/20
Given Can Fly Live in Have legs Class
Birth water
yes No Yes NO
• P(M/B)=6/7*6/7*2/7*2/7*7/20=.0042
• P(NM/B)=1/13*10/13*3/13*4/13*13/20=0.00
27
• Rearranging these leads us to the answer to our
question, which is called Bayes’ formula:
– P(ωj|x)=p(x|ωj)P(ωj) /p(x)
• where in this case of two categories
2
– p(x)=ƹ p(x|ωj)P(ωj)
j=1
Bayes’ formula can be expressed informally in English by
saying that
posterior = likelihood×prior /evidence .
• P(ω1|x) is greater than P(ω2|x), we would
naturally be inclined to decide that the true
state of nature is ω1 and vice versa.
• probability of error whenever we make a
decision. Whenever we observe a particular x,
P(error|x)= P(ω1|x) if we decide ω2
• P(ω2|x) if we decide ω1.
• We shall now formalize the ideas just considered, and generalize them in
four ways:
• by allowing the use of more than one feature
• X is feature vector
• by allowing more than two states of nature
provides us with a useful generalization for a small notational
Expenses.
• by allowing actions other than merely deciding the state of nature
primarily allows the possibility of rejection, i.e., of refusing to make
a decision in close cases; this is a useful option if being indecisive is not
too costly.
• by introducing a loss function more general than the probability of error.
exactly how costly each action is, and is used to convert a
probability determination into a decision
• Let ω1,...,ωc be the finite set of c states of
nature (“categories”)
• α1,...,αa be the finite set of a possible actions.
• The loss function λ(αi|ωj) describes the loss
incurred for taking action αi when the state of
nature is ωj
• Let the feature vector x be a d-component vector-valued
random variable, and let p(x|ωj) be the state conditional
probability density function for x — the probability density
function for x conditioned on ωj being the true state of
nature. As before, P(ωj) describes the prior probability that
nature is in state ωj. Then the posterior probability P(ωj|x)
can be computed from p(x|ωj) by Bayes’ formula:
• P(x)=Ʃ p(x|wj)p(wj)
• Expected loss of taking action αj when actual
class is wj
– R(αj|x)= Ʃ λ(αj|wj)p(wj|x)
In decision-theoretic terminology, an expected loss
is called a risk, and R(αi|x) isrisk called the
conditional risk.
Problem is to find the decision rule against Wi to
minimize the risk
Two Category classification
• λij= λ(αi|ωj) is loss function for deciding wi
when true class is wj. Condition risk is
– R(α1|x)=λ11P(w1|x)+λ12P(W2|x)
– R(α2|x)=λ21P(w1|x)+λ22P(W2|x)
Fundamental rule is to decide w1 if R(α1|x)<R(α2|x)
In terms of posterior probabilities
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 varia 33. 7.9 std 10.2 9.7 true 3/9 3/5
nce 44 dev
rainy 3/9 2/5
• For examples,
66 732
f temperature 66 | Yes
1
e 2 6.2 2
0.0340
2 6.2
2 3 9
• Likelihood of Yes = 0.0340 0.0221 0.000036
9 9 14
3 3 5
• Likelihood of No = 0.0291 0.038 0.000136
5 5 14
Decision Tree
• Decision trees are used for both classification and
regression problems
• Decision tree is the most powerful and popular tool for
classification and prediction.
• A Decision tree is a flowchart like tree structure, where
each internal node denotes a test on an attribute, each
branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label.
• simple to understand the data and make some good
interpretations.
A decision tree is a tree where each node represents a
feature(attribute), each link(branch) represents a
decision(rule) and each leaf represents an
outcome(categorical or continues value).
No Yes E(Humidity,high)=-4/8*log(4/8)-4/8log(4/8)=x
No No
Yes Yes E(humidity,normal)=-5/6*log5/6-1/6*log(1/6)=y
Yes Yes
no Yes
Average entropy of humidity (z)
Yes
Yes
yes Z=8/14*x+6/14*y
no Gain(humidity)=.94-z=p
Temp
E(temp, High)=-2/4*log(2/4)-2/4*log(2/4)=s
cool E(temp, mild)=-4/6*log(4/6)-2/6*log(2/6)=t
Hot
E(temp, cool)=-3/4*log(3/4)-1/4*log(1/4)=u
Mild
Average entropy of Temp
Yes Q=4/14*s+6/14*t+4/14*u
No
No
Yes
no
No
Yes
Gain(temp)=.94-Q
Yes
Yes yes
Yes
Yes
Yes
No
Play(yes)=9
Ply(no)=5
sunny
rainy
overcast
Play(yes)=2 Play(yes)=3
Ply(no)=3 Play(yes)=4
Ply(no)=2
Ply(no)=0
high
normal true
false
Play(yes)=0 Play(yes)=2
Ply(no)=3 Play(yes)=0 Play(yes)=3
Ply(no)=0
Ply(no)=2 Ply(no)=0
humidity
wind
ID3
• ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
• Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
3) Choose Debt Level as root of subtree. 3a) All examples are in the same class, MODERATE.
Return Leaf node.
Debt Level
Debt Level
Low High High
Low
3a) 3b) 3b)
3 2,12,14 MODERATE 2,12,14
ID3 Example (II)
4) Choose Credit History as root of subtree. 4a-4c) All examples are in the same class.
Return Leaf nodes.
Credit History
Credit History
Unknown Good
Bad Unknown Good
4a) 4b) 4c) Bad
5,6 8 9,10,13
LOW MODERATE LOW
ID3 Example (III)
Attach subtrees at appropriate places.
Income
Low
Medium High
HIGH Debt Level Credit History
Low High Unknown Good
Bad
Unknown Good
Bad
R
R
G A2
G A2
B
B
A1
M F A1
Decision surfaces are axis-aligned M F
Hyper-rectangles
Non-Uniqueness
No
MarSt Single,
1 Yes Single 125K
Married Divorced
2 No Married 100K No
3 No Single 70K No NO Refund
4 Yes Married 120K No Yes No
5 No Divorced 95K Yes
NO TaxInc
6 No Married 60K No
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
ID3’s Question
More precisely:
Given a training set, which of all of the decision
trees consistent with that training set has the
greatest likelihood of correctly classifying
unseen instances of the population?
ID3’s (Approximate) Bias
• ID3 (and family) prefers simpler decision trees
• Occam’s Razor Principle:
– “It is vain to do with more what can be
done with less...Entities should not be
multiplied beyond necessity.”
• Intuitively:
– Always accept the simplest answer that fits
the data, avoid unnecessary constraints
– Simpler trees are more general
ID3’s Question Revisited
Practically:
Given a training set, how do we select attributes
so that the resulting tree is as small as possible,
i.e. contains as few attributes as possible?
ID3
• Start with a training data set which we’ll call S. It
should have attributes and classification.
• Determine the best attribute in the dataset. (We will
go over the definition of best attribute)
• Split S into subset that contains the possible values for
the best attribute.
• Make decision tree node that contains the best
attribute.
• Recursively generate new decision trees by using the
subset of data created from step 3 until a stage is
reached where you cannot classify the data further.
Represent the class as leaf node.
The left split has less information gain as the
data is split on two classes which has almost
equal ‘+’ and ‘-’ examples. While the split on
the right as more ‘+’ example in one class and
more ‘-’ example in the other class. In order to
calculate best attribute we will use
Problem with Continuous Data
• 10–20 can be class1, 20–30 and so on and a
particular discrete value is put to a particular
class.
CART
• Classification and Regression Tree(CART)
• Gini index is a metric for classification tasks in
CART.
• It stores sum of squared probabilities of each
class.
Gini = 1 – Σ (Pi)2 for i=1 to number of
classes
Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 –
0.16 – 0.36 = 0.48
Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 =
0
Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 –
0.36 – 0.16 = 0.48
Then, we will calculate weighted sum of gini
indexes for outlook feature.
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 +
(5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
Outlook Yes No Number of instances
Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
Temperature
Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5
Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
We’ll calculate weighted sum of gini index for temperature feature
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 +
0.190 = 0.439
Number of Number of
Humidity Yes No Wind Yes No
instances instances
High 3 4 7 Weak 6 2 8
Normal 6 1 7 Strong 3 3 6
Wind
Humidity
Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2
Humidity is a binary class feature. It can be high or
= 1 – 0.5625 – 0.062 = 0.375
normal.
Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2
Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2
= 1 – 0.25 – 0.25 = 0.5
= 1 – 0.183 – 0.326
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
= 0.489
Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2
= 1 – 0.734 – 0.02
= 0.244
Weighted sum for humidity feature will be calculated
next
Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 =
0.367
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
Day Outlook Temp. Humidity Wind Decision
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
Strong 1 1 2 High 0 3 3
Normal 2 0 2
Gini of wind for sunny outlook
Gini(Outlook=Sunny and Wind=Weak) =
1 – (1/3)2 – (2/3)2 = 0.266
Gini of humidity for sunny outlook
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0
Gini(Outlook=Sunny and Wind=Strong) = Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0
1- (1/2)2 – (1/2)2 = 0.2 Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
Gini(Outlook=Sunny and Wind) =
(3/5)x0.266 + (2/5)x0.2 = 0.466
Decision for sunny outlook
We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity because it has the lowest value.
We’ll put humidity check at the extension of sunny outlook.
Temperature 0.2
Humidity 0
Wind 0.466
Decision for rain outlook
The winner is wind feature for rain outlook because it has the minimum gini index score in features.
Put the wind feature for rain outlook branch and monitor the new sub data sets.
Temperature 0.466
Humidity 0.466
Wind 0
advantages of decision tree
• Easy to use and understand.
• Can handle both categorical and numerical
data.
• Resistant to outliers, hence require little data
preprocessing.
• New features can be easily added.
• Can be used to build larger classifiers by using
ensemble methods.
Disadvantages
• Prone to overfitting.
• Require some kind of measurement as to how
well they are doing.
• Need to be careful with parameter tuning.
• Can create biased learned trees if some
classes dominate.