ML-Lec3
ML-Lec3
3 4
Quantification of ambiguous and imprecise terms
on a time-frequency scale • Unknown data. When the data is incomplete or
Ray Simpson (1944) Milton Hakel (1968) missing, the only solution is to accept the value
Term Mean value Term Mean value “unknown” and proceed to an approximate reasoning
Always 99 Always 100
Very often 88 Very often 87 with this value.
Usually 85 Usually 79
Often 78 Often 74 • Combining the views of different experts. Large
Generally 78 Rather often 74
Frequently 73 Frequently 72 expert systems usually combine the knowledge and
Rather often 65 Generally 72
About as often as not 50 About as often as not 50 expertise of a number of experts. Unfortunately,
Now and then 20 Now and then 34
Sometimes 20 Sometimes 29 experts often have contradictory opinions and produce
Occasionally 20 Occasionally 28
Once in a while 15 Once in a while 22 conflicting rules. To resolve the conflict, the
Not often 13 Not often 16
Usually not 10 Usually not 16 knowledge engineer has to attach a weight to each
Seldom 10 Seldom 9
Hardly ever 7 Hardly ever 8 expert and then calculate the composite conclusion.
Very seldom 6 Very seldom 7 But no systematic method exists to obtain these
Rarely 5 Rarely 5
Almost never
Never
3
0
Almost never
Never
2
0
weights.
5 6
p(A B ) =
a single throw, s = f = 1, and therefore the the number of times A and B can occur
probability of getting a head (or a tail) is 0.5. the number of times B can occur
9
p ( B A)
p(B A) =
p ( A)
Bayesian rule • The concept of conditionality probability considered
that event A was dependent upon event B.
p (B A) p ( A)
p(A B ) = • This principle can be extended to event A being
p (B ) dependent on a number of mutually exclusive events
B1, B2,…, Bn. We know: p( A B )
p (A B ) =
where: p (B )
p(AB) is the conditional probability that event A
occurs given that event B has occurred; p( A B1 ) = p (A B1 ) p(B1 )
p(BA) is the conditional probability of event B p( A B2 ) = p (A B2 ) p(B2 ) n n
p( A Bn ) = p (A Bn ) p(Bn )
p(A) is the probability of event A occurring;
p(B) is the probability of event B occurring.
13
B4
A B1 p ( A B ) = p ( A)
i =1
i
Therefore
n
B3 B2 p ( A) = p(A B ) p(B )
i =1
i i
15 16
If the occurrence of event A depends on only two 3. Bayesian reasoning
mutually exclusive events, B and NOT B, we obtain:
Suppose all rules in the knowledge base are
p(A) = p(AB) p(B) + p(AB) p(B) represented in the following form:
p(B) = p(BA) p(A) + p(BA) p(A) This rule implies that if event E occurs, then the
probability that event H will occur is p.
Substituting this equation into the Bayesian rule yields:
p(B A) p( A)
In expert systems, H usually represents a hypothesis
p(A B ) = and E denotes evidence to support this hypothesis.
p(B A) p( A) + p(B A) p(A)
17
The Bayesian rule expressed in terms of hypotheses • In expert systems, the probabilities required to solve a
and evidence looks like this: problem are provided by experts. An expert
determines the prior probabilities for possible
p(E H ) p(H )
p(H E ) =
hypotheses p(H) and p(H), and also the conditional
p(E H ) p(H ) + p(E H ) p(H ) probabilities for observing evidence E if hypothesis H
is true, p(EH), and if hypothesis H is false, p(EH).
where:
• Users provide information about the evidence observed
p(H) is the prior probability of hypothesis H being true;
and the expert system computes p(HE) for hypothesis
p(EH) is the probability that hypothesis H being true will
H in light of the user-supplied evidence E. Probability
result in evidence E;
p(HE) is called the posterior probability of
p(H) is the prior probability of hypothesis H being
hypothesis H upon observing evidence E.
false;
p(EH) is the probability of finding evidence E even
when hypothesis H is false.
19 20
• We can take into account both multiple hypotheses H1, • This requires to obtain the conditional probabilities of
H2,..., Hm and multiple evidences E1, E2,..., En. The all possible combinations of evidences for all
hypotheses as well as the evidences must be mutually hypotheses, and thus places an enormous burden on
exclusive and exhaustive. the expert.
• Single evidence E and multiple hypotheses follow: • Therefore, in expert systems, conditional
p(E H i ) p(H i )
p(H i E ) =
independence among different evidences assumed.
m Thus, instead of the unworkable equation, we attain:
p(E H k ) p(H k )
k =1 p(E1 H i ) p(E2 H i ) . . . p(En H i ) p(H i )
• Multiple evidences and multiple hypotheses follow: p(H i E1 E2 . . . En ) =
m
p (E1 E2 . . . En H i ) p(H i )
p(E1 H k ) p(E2 H k ) . . . p(En H k ) p(H k )
p(H i E1 E2 . . . En ) = k =1
m
p(E1 E2 . . . En H k ) p(H k )
k =1
21 22
Suppose an expert, given three conditionally independent p(H i ) 0.40 0.35 0.25
p (E1 H i )
evidences E1, E2 and E3, creates three mutually exclusive
0.3 0.8 0.5
and exhaustive hypotheses H1, H2 and H3, and provides
prior probabilities for these hypotheses – p(H1), p(H2) and p (E2 H i ) 0.9 0.0 0.7
p(H3), respectively. The expert also determines the
conditional probabilities of observing each evidence for p (E3 H i ) 0.6 0.7 0.9
all possible hypotheses.
Assume that we first observe evidence E3. The expert
system computes the posterior probabilities for all
hypotheses as
23 24
p(E3 H i ) p(H i )
p(H i E3 ) =
Suppose now that we observe evidence E1. The posterior
, i = 1, 2, 3
3 probabilities are calculated as
p(E3 H k ) p(H k )
p(E1 H i ) p(E3 H i ) p(H i )
p(H i E1E3 ) =
k =1
, i = 1, 2, 3
3
p(E1 H k ) p(E3 H k ) p(H k )
0.6 0.40
Thus, p(H1 E3 ) = = 0.34
0.6 0.40 + 0.7 0.35 + 0.9 0.25 k =1
0.7 0.35
p(H 2 E3 ) = = 0.34 Hence,
0.6 0.40 + 0.7 0.35 + 0.9 0.25 0.3 0.6 0.40
p(H1 E1E3 ) = = 0.19
0.9 0.25 0.3 0.6 0.40 + 0.8 0.7 0.35 + 0.5 0.9 0.25
p(H 3 E3 ) = = 0.32
0.8 0.7 0.35
0.6 0.40 + 0.7 0.35 + 0.9 0.25 p(H 2 E1E3 ) = = 0.52
0.3 0.6 0.40 + 0.8 0.7 0.35 + 0.5 0.9 0.25
After evidence E3 is observed, belief in hypothesis H1
0.5 0.9 0.25
decreases and becomes equal to belief in hypothesis H2. p(H 3 E1E3 ) = = 0.29
Belief in hypothesis H3 also increases and even nearly 0.3 0.6 0.40 + 0.8 0.7 0.35 + 0.5 0.9 0.25
reaches beliefs in hypotheses H1 and H2. Hypothesis H2 has now become the most likely one.
25 26
• For a set of all events E, we compute the • Suppose that a set of training examples includes
maximum a posteriori hypothesis: records with conjunctive attributes values (a1, a2, ..
hMAP = arg max P(h | E ) an) and target function is based on finite set of
hH classes V.
P ( E | h) P ( h)
hMAP = arg max
hH P( E ) v MAP = arg max P(v j | a1 , a 2 ,.., an )
v j V
hMAP = arg max P( E | h) P(h)
hH P(a1 , a2 ,.., an | v j ) P(v j )
v MAP = arg max
• We can omit P(E) since it is constant and v j V P(a1 , a2 ,.., an )
independent of the hypothesis v MAP = arg max P(a1 , a 2 ,.., an | v j ) P(v j )
• This the only difference between MAP and v j V
Bayesian Inference
Estimation
• So, we assume that attributes values are
conditionally independent given the target value:
vMAP = arg max P(a1 , a2 ,.., an | v j ) P(v j )
v j V
P (a1 , a 2 ,.., a n | v j ) = P(a | v )
i
i j
Example
• Estimating P(ai|vj) instead of P(a1,a2,..,an|vj) Day Outlook Temp Humidity Wind Play
Tennis
greatly reduces the number of parameters. Dl Sunny Hot High Weak No
• The learning step in Naï ve Bayes consists of D2 Sunny Hot High Strong No
estimating P(ai|vj) and P(vj) based on the D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
frequencies in the training data. (The priories are
D5 Rain Cool Normal Weak Yes
embedded in the training data) D6 Rain Cool Normal Strong No
• There is no explicit search during training. D7 Overcast Cool Normal Strong Yes
v NB = arg max P(v j ) P(outlook = sunny | v j ) P(temp = cool | v j ) P(humidity = high | v j ) P( wind = strong | v j )
v j { yes , no}
37 38
39 40