0% found this document useful (0 votes)
22 views

ML-Lec3

The document discusses the concept of uncertainty, defining it as the lack of exact knowledge necessary for reliable conclusions. It covers basic probability theory, Bayesian reasoning, and the challenges of quantifying uncertain knowledge, including ambiguous language and unknown data. The document also explains how expert systems utilize probabilities to manage uncertainty and make informed decisions based on evidence.

Uploaded by

luosuochao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

ML-Lec3

The document discusses the concept of uncertainty, defining it as the lack of exact knowledge necessary for reliable conclusions. It covers basic probability theory, Bayesian reasoning, and the challenges of quantifying uncertain knowledge, including ambiguous language and unknown data. The document also explains how expert systems utilize probabilities to manage uncertainty and make informed decisions based on evidence.

Uploaded by

luosuochao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1. Introduction, or what is uncertainty?

Lecture 3 • Information can be incomplete, inconsistent, uncertain,


From Uncertainty management to Naïve Bayes or all three. In other words, information is often
unsuitable for solving a problem.
1. Introduction, or what is uncertainty? • Uncertainty is defined as the lack of the exact
2. Basic probability theory knowledge that would enable us to reach a perfectly
3. Bayesian reasoning reliable conclusion. Classical logic permits only exact
reasoning. It assumes that perfect knowledge always
4. Naï ve Bayes Classifiers
exists and the law of the excluded middle can always
be applied:
IF A is true IF A is false
THEN A is not false THEN A is not true
1 2

Sources of uncertain knowledge


• Imprecise language. Our natural language is
ambiguous and imprecise. We describe facts with such
• Weak implications. Domain experts and knowledge
terms as often and sometimes, frequently and hardly
engineers have the painful task of establishing concrete
ever. As a result, it can be difficult to express
correlations between IF (condition) and THEN (action)
knowledge in the precise IF-THEN form of production
parts of the rules. Therefore, expert systems need to
rules. However, if the meaning of the facts is
have the ability to handle vague associations, for
quantified, it can be used in expert systems. In 1944,
example by accepting the degree of correlations as
Ray Simpson asked 355 high school and college
numerical certainty factors.
students to place 20 terms like often on a scale between
1 and 100. In 1968, Milton Hakel repeated this
experiment.

3 4
Quantification of ambiguous and imprecise terms
on a time-frequency scale • Unknown data. When the data is incomplete or
Ray Simpson (1944) Milton Hakel (1968) missing, the only solution is to accept the value
Term Mean value Term Mean value “unknown” and proceed to an approximate reasoning
Always 99 Always 100
Very often 88 Very often 87 with this value.
Usually 85 Usually 79
Often 78 Often 74 • Combining the views of different experts. Large
Generally 78 Rather often 74
Frequently 73 Frequently 72 expert systems usually combine the knowledge and
Rather often 65 Generally 72
About as often as not 50 About as often as not 50 expertise of a number of experts. Unfortunately,
Now and then 20 Now and then 34
Sometimes 20 Sometimes 29 experts often have contradictory opinions and produce
Occasionally 20 Occasionally 28
Once in a while 15 Once in a while 22 conflicting rules. To resolve the conflict, the
Not often 13 Not often 16
Usually not 10 Usually not 16 knowledge engineer has to attach a weight to each
Seldom 10 Seldom 9
Hardly ever 7 Hardly ever 8 expert and then calculate the composite conclusion.
Very seldom 6 Very seldom 7 But no systematic method exists to obtain these
Rarely 5 Rarely 5
Almost never
Never
3
0
Almost never
Never
2
0
weights.
5 6

2. Basic probability theory • Probability can be expressed mathematically as a


numerical index with a range between zero (an
• The concept of probability has a long history that goes absolute impossibility) to unity (an absolute certainty).
back thousands of years when words like “probably”,
• Most events have a probability index strictly between
“likely”, “maybe”, “perhaps” and “possibly” were
0 and 1, which means that each event has at least two
introduced into spoken languages. However, the
possible outcomes: favourable outcome or success,
mathematical theory of probability was formulated
and unfavourable outcome or failure.
only in the 17th century.
• The probability of an event is the proportion of cases
P(success) =
the number of successes
in which the event occurs. Probability can also be the number of possible outcomes
defined as a scientific measure of chance.
P( failure) =
the number of failures
the number of possible outcomes
7 8
• If s is the number of times success can occur, and f is Conditional probability
the number of times failure can occur, then
• Let A be an event in the world and B be another event.
Suppose that events A and B are not mutually
P(success) = p =
s
s+ f exclusive, but occur conditionally on the occurrence of
the other. The probability that event A will occur if
P( failure) = q =
f event B occurs is called the conditional probability.
s+ f Conditional probability is denoted mathematically as
p(AB) in which the vertical bar represents GIVEN and
and p+q=1
the complete probability expression is interpreted as
• If we throw a coin, the probability of getting a head “Conditional probability of event A occurring given
will be equal to the probability of getting a tail. In that event B has occurred”.

p(A B ) =
a single throw, s = f = 1, and therefore the the number of times A and B can occur
probability of getting a head (or a tail) is 0.5. the number of times B can occur
9

• The number of times A and B can occur, or the


probability that both A and B will occur, is called the
joint probability of A and B. It is represented Hence, p(B  A) = p(B A) p( A)
mathematically as p(AB). The number of ways B
and 𝑝 𝐴 ∩ 𝐵 = 𝑝 𝐴ȁ𝐵 × 𝑝 𝐵
can occur is the probability of B, p(B), and thus
Substituting the last equation into the equation
p( A  B )
p(A B ) = p( A  B )
p (B ) p(A B ) =
p (B )
• Similarly, the conditional probability of event B
occurring given that event A has occurred equals yields the Bayesian rule:

p ( B  A)
p(B A) =
p ( A)
Bayesian rule • The concept of conditionality probability considered
that event A was dependent upon event B.
p (B A) p ( A)
p(A B ) = • This principle can be extended to event A being
p (B ) dependent on a number of mutually exclusive events
B1, B2,…, Bn. We know: p( A  B )
p (A B ) =
where: p (B )
p(AB) is the conditional probability that event A
occurs given that event B has occurred; p( A  B1 ) = p (A B1 ) p(B1 )
p(BA) is the conditional probability of event B p( A  B2 ) = p (A B2 ) p(B2 ) n n

occurring given that event A has occurred; .


 i =1
p( A  Bi ) =  p(A B ) p(B )
i =1
i i

p( A  Bn ) = p (A Bn ) p(Bn )
p(A) is the probability of event A occurring;
p(B) is the probability of event B occurring.

13

The joint probability


For all events Bi
n n
 p( A  Bi ) =  p(A Bi ) p(Bi ) n n
i =1 i =1
 p( A  B ) =  p(A B ) p(B )
i =1
i
i =1
i i

B4
A B1  p ( A  B ) = p ( A)
i =1
i

Therefore
n
B3 B2 p ( A) =  p(A B ) p(B )
i =1
i i

15 16
If the occurrence of event A depends on only two 3. Bayesian reasoning
mutually exclusive events, B and NOT B, we obtain:
Suppose all rules in the knowledge base are
p(A) = p(AB)  p(B) + p(AB)  p(B) represented in the following form:

where  is the logical function NOT. IF E is true


Similarly, THEN H is true {with probability p}

p(B) = p(BA)  p(A) + p(BA)  p(A) This rule implies that if event E occurs, then the
probability that event H will occur is p.
Substituting this equation into the Bayesian rule yields:

p(B A) p( A)
In expert systems, H usually represents a hypothesis
p(A B ) = and E denotes evidence to support this hypothesis.
p(B A) p( A) + p(B A) p(A)
17

The Bayesian rule expressed in terms of hypotheses • In expert systems, the probabilities required to solve a
and evidence looks like this: problem are provided by experts. An expert
determines the prior probabilities for possible
p(E H ) p(H )
p(H E ) =
hypotheses p(H) and p(H), and also the conditional
p(E H ) p(H ) + p(E H ) p(H ) probabilities for observing evidence E if hypothesis H
is true, p(EH), and if hypothesis H is false, p(EH).
where:
• Users provide information about the evidence observed
p(H) is the prior probability of hypothesis H being true;
and the expert system computes p(HE) for hypothesis
p(EH) is the probability that hypothesis H being true will
H in light of the user-supplied evidence E. Probability
result in evidence E;
p(HE) is called the posterior probability of
p(H) is the prior probability of hypothesis H being
hypothesis H upon observing evidence E.
false;
p(EH) is the probability of finding evidence E even
when hypothesis H is false.
19 20
• We can take into account both multiple hypotheses H1, • This requires to obtain the conditional probabilities of
H2,..., Hm and multiple evidences E1, E2,..., En. The all possible combinations of evidences for all
hypotheses as well as the evidences must be mutually hypotheses, and thus places an enormous burden on
exclusive and exhaustive. the expert.
• Single evidence E and multiple hypotheses follow: • Therefore, in expert systems, conditional
p(E H i ) p(H i )
p(H i E ) =
independence among different evidences assumed.
m Thus, instead of the unworkable equation, we attain:
 p(E H k ) p(H k )
k =1 p(E1 H i ) p(E2 H i ) . . .  p(En H i ) p(H i )
• Multiple evidences and multiple hypotheses follow: p(H i E1 E2 . . . En ) =
m

p (E1 E2 . . . En H i ) p(H i )
 p(E1 H k ) p(E2 H k ) . . .  p(En H k ) p(H k )
p(H i E1 E2 . . . En ) = k =1
m
 p(E1 E2 . . . En H k ) p(H k )
k =1
21 22

Ranking potentially true hypotheses The prior and conditional probabilities


Hypothesis
Let us consider a simple example. Probability
i =1 i =2 i =3

Suppose an expert, given three conditionally independent p(H i ) 0.40 0.35 0.25

p (E1 H i )
evidences E1, E2 and E3, creates three mutually exclusive
0.3 0.8 0.5
and exhaustive hypotheses H1, H2 and H3, and provides
prior probabilities for these hypotheses – p(H1), p(H2) and p (E2 H i ) 0.9 0.0 0.7
p(H3), respectively. The expert also determines the
conditional probabilities of observing each evidence for p (E3 H i ) 0.6 0.7 0.9
all possible hypotheses.
Assume that we first observe evidence E3. The expert
system computes the posterior probabilities for all
hypotheses as
23 24
p(E3 H i ) p(H i )
p(H i E3 ) =
Suppose now that we observe evidence E1. The posterior
, i = 1, 2, 3
3 probabilities are calculated as
 p(E3 H k ) p(H k )
p(E1 H i ) p(E3 H i ) p(H i )
p(H i E1E3 ) =
k =1
, i = 1, 2, 3
3
 p(E1 H k ) p(E3 H k ) p(H k )
0.6  0.40
Thus, p(H1 E3 ) = = 0.34
0.6  0.40 + 0.7  0.35 + 0.9  0.25 k =1
0.7  0.35
p(H 2 E3 ) = = 0.34 Hence,
0.6  0.40 + 0.7  0.35 + 0.9  0.25 0.3  0.6  0.40
p(H1 E1E3 ) = = 0.19
0.9  0.25 0.3  0.6  0.40 + 0.8  0.7  0.35 + 0.5  0.9  0.25
p(H 3 E3 ) = = 0.32
0.8  0.7  0.35
0.6  0.40 + 0.7  0.35 + 0.9  0.25 p(H 2 E1E3 ) = = 0.52
0.3  0.6  0.40 + 0.8  0.7  0.35 + 0.5  0.9  0.25
After evidence E3 is observed, belief in hypothesis H1
0.5  0.9  0.25
decreases and becomes equal to belief in hypothesis H2. p(H 3 E1E3 ) = = 0.29
Belief in hypothesis H3 also increases and even nearly 0.3  0.6  0.40 + 0.8  0.7  0.35 + 0.5  0.9  0.25
reaches beliefs in hypotheses H1 and H2. Hypothesis H2 has now become the most likely one.
25 26

After observing evidence E2, the final posterior


probabilities for all hypotheses are calculated: 4. Naïve Bayes Classifiers
p(E1 H i ) p(E2 H i ) p(E3 H i ) p(H i ) • Bayesian Inference calculate the posterior distribution
p(H i E1E2 E3 ) = , i = 1, 2, 3 based on priories
3
 p(E1 H k ) p(E2 H k ) p(E3 H k ) p(H k )
p(E1 H i ) p(E2 H i ) . . .  p(En H i ) p(H i )
p(H i E1 E2 . . . En ) =
k =1
0.3  0.9  0.6  0.40 m
p(H1 E1E2 E3 ) =
0.3  0.9  0.6  0.40 + 0.8  0.0  0.7  0.35 + 0.5  0.7  0.9  0.25
= 0.45  p(E1 H k ) p(E2 H k ) . . .  p(En H k ) p(H k )
k =1
0.8  0.0  0.7  0.35
p(H 2 E1E2 E3 ) = =0 • But if we don’t know the priories, and we only have
0.3  0.9  0.6  0.40 + 0.8  0.0  0.7  0.35 + 0.5  0.7  0.9  0.25
0.5  0.7  0.9  0.25
some historical data? (Learning from the data)
p(H 3 E1E2 E3 ) = = 0.55
0.3  0.9  0.6  0.40 + 0.8  0.0  0.7  0.35 + 0.5  0.7  0.9  0.25 • We can use naï ve Bayes classifier to select the
hypothesis or classes.
Although the initial ranking was H1, H2 and H3, only
hypotheses H1 and H3 remain under consideration
after all evidences (E1, E2 and E3) were observed.
27 28
Naïve Bayes Classifiers
• Bayes Rules • Bayes Rules from previous slide can be represented
p(B A) p( A)
based on hypothesis and evidence as :
p (A B ) =
p (B ) p(E H ) p(H )
p(H E ) =
p (E )
where:
p(AB) is the conditional probability that event A
occurs given that event B has occurred;
p(BA) is the conditional probability of event B
occurring given that event A has occurred;
p(A) is the probability of event A occurring;
p(B) is the probability of event B occurring.

Maximum A Posteriori (MAP) Naïve Bayes Classifiers

• For a set of all events E, we compute the • Suppose that a set of training examples includes
maximum a posteriori hypothesis: records with conjunctive attributes values (a1, a2, ..
hMAP = arg max P(h | E ) an) and target function is based on finite set of
hH classes V.
P ( E | h) P ( h)
hMAP = arg max
hH P( E ) v MAP = arg max P(v j | a1 , a 2 ,.., an )
v j V
hMAP = arg max P( E | h) P(h)
hH P(a1 , a2 ,.., an | v j ) P(v j )
v MAP = arg max
• We can omit P(E) since it is constant and v j V P(a1 , a2 ,.., an )
independent of the hypothesis v MAP = arg max P(a1 , a 2 ,.., an | v j ) P(v j )
• This the only difference between MAP and v j V

Bayesian Inference
Estimation
• So, we assume that attributes values are
conditionally independent given the target value:
vMAP = arg max P(a1 , a2 ,.., an | v j ) P(v j )
v j V
P (a1 , a 2 ,.., a n | v j ) =  P(a | v )
i
i j

• We can easily estimate P(vj) by computing What we know


relative frequency of each target class in the
training set. v MAP = arg max P(a1 , a 2 ,.., a n | v j ) P (v j )
v j V
• But, estimating P(a1,a2,..an|vj) is difficult.
Therefore
v NB = arg max P(v j )
v j V
 P(a | v )
i
i j

Example

• Estimating P(ai|vj) instead of P(a1,a2,..,an|vj) Day Outlook Temp Humidity Wind Play
Tennis
greatly reduces the number of parameters. Dl Sunny Hot High Weak No
• The learning step in Naï ve Bayes consists of D2 Sunny Hot High Strong No
estimating P(ai|vj) and P(vj) based on the D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
frequencies in the training data. (The priories are
D5 Rain Cool Normal Weak Yes
embedded in the training data) D6 Rain Cool Normal Strong No
• There is no explicit search during training. D7 Overcast Cool Normal Strong Yes

• An unseen instance is classified by computing the D8 Sunny Mild High Weak No


D9 Sunny Cool Normal Weak Yes
class that maximizes the posterior. D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
36
• Suppose that we want classify following new
Outlook Temperature Humidity Wind Play
instance:
yes no yes no yes no yes no yes no
Sunny 2 3 Hot 2 2 High 3 4 Weak 6 2 9 5 • Outlook = sunny, temp = cool, humidity =
Overcast 4 0 Mild 4 2 Normal 6 1 Strong 3 3
Rainy 3 2 Cool 3 1 high, wind = strong.
yes no yes no yes no yes no yes no
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 Weak 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 Strong 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5 v NB = arg max P(v j )
v j { yes , no}
 P(a | v )
i
i j

v NB = arg max P(v j ) P(outlook = sunny | v j ) P(temp = cool | v j ) P(humidity = high | v j ) P( wind = strong | v j )
v j { yes , no}

37 38

• First, we calculate the prior probabilities:


Want to know more?

P( play = yes) = 9 / 14 • Rish, Irina. (2001). "An empirical study of the


P( play = no) = 5 / 14 naive Bayes classifier".
• https://faculty.cc.gatech.edu/~isbell/reading/papers/
𝑃(𝑦𝑒𝑠)𝑃(𝑠𝑢𝑛𝑛𝑦ȁ𝑦𝑒𝑠)𝑃(𝑐𝑜𝑜𝑙ȁ𝑦𝑒𝑠)𝑃(ℎ𝑖𝑔ℎȁ𝑦𝑒𝑠)𝑃(𝑠𝑡𝑟𝑜𝑛𝑔ȁ𝑦𝑒𝑠) Rish.pdf
= 9/14 × 2/9 × 3/9 × 3/9 × 3/9 = 0.0053 • Andrew Gelman, et al. (2003) Bayesian Data
𝑃(𝑛𝑜)𝑃(𝑠𝑢𝑛𝑛𝑦ȁ𝑛𝑜)𝑃(𝑐𝑜𝑜𝑙ȁ𝑛𝑜)𝑃(ℎ𝑖𝑔ℎȁ𝑛𝑜)𝑃(𝑠𝑡𝑟𝑜𝑛𝑔ȁ𝑛𝑜)
= 5/14 × 3/5 × 1/5 × 4/5 × 3/5 = 0.0206
Analysis
𝑣𝑁𝐵
= argmax 𝑃 𝑣𝑗 𝑃 𝑠𝑢𝑛𝑛𝑦 𝑣𝑗 𝑃 𝑐𝑜𝑜𝑙 𝑣𝑗 𝑃 ℎ𝑖𝑔ℎ 𝑣𝑗 𝑃 𝑠𝑡𝑟𝑜𝑛𝑔 𝑣𝑗
𝑣𝑗 ∈ 𝑦𝑒𝑠,𝑛𝑜
= 𝑛𝑜

39 40

You might also like