L08-Uncertainty & Probabilistic Reasoning
L08-Uncertainty & Probabilistic Reasoning
▪ Systems which can reason about the effects of uncertainty should do better than those
that don’t
▪ If I go to the dentist and he examines me, when the probe catches this indicates there
may be a cavity, rather than another cause.
The likelihood of a hypothesised cause will change as additional pieces of evidence
arrive.
▪ Bob lives in San Francisco. He has a burglar alarm on his house, which can be triggered
by burglars and earthquakes. He has two neighbours, John and Mary, who will call him if
the alarm goes off while he is at work, but each is unreliable in their own way. All these
sources of uncertainty can be quantified. Mary calls, how likely is it that there has been
a burglary?
Using probabilistic reasoning we can calculate how likely a hypothesised cause is.
▪ The set of possible outcomes for a random variable is called its domain.
▪ e.g. A coin. The domain of Throw is {head, tail}
▪ So, conditional probabilities reflect the fact that some events make other events more
(or less) likely.
▪ If one event doesn’t affect the likelihood of another event, they are said to be
independent and therefore:
𝑃 𝑎|𝑏 = 𝑃 𝑎
▪ e.g. if you roll a die, it doesn’t make it more or less likely that you will roll a 6 on the next
throw. The rolls are independent.
▪ If we model how likely observable effects are given hidden causes (how likely
toothache is given a cavity), then Bayes’ rule allows us to use that model to infer
the likelihood of the hidden cause (and thus answer our question)
P(effect | cause) P(cause)
P(cause | effect ) =
P(effect )
▪ In fact, good models of 𝑃 effect|cause are often available to us in real domains (e.g.
medical diagnosis)
10S3001-Certan | Gasal 20/21 | Institut Teknologi Del 13
▪ Suppose a doctor knows that a meningitis causes a stiff neck in 50% of cases
P( s | m) = 0.5
▪ She also knows that the probability in the general population of someone having a
stiff neck at any time is 1/20
P( s) = 0.05
▪ She also has to know the incidence of meningitis in the population (1/50,000)
P(m) = 0.00002
▪ Using Bayes’ rule she can calculate the probability the patient has meningitis:
P( s | m) P(m) 0.5 0.00002
P(m | s ) = = = 0.0002 = 1 / 5000
P( s) 0.05
P(effect | cause) P(cause)
P(cause | effect ) =
P(effect )
10S3001-Certan | Gasal 20/21 | Institut Teknologi Del 14
▪ Why wouldn’t the doctor be better off if she just knew the likelihood of meningitis
given a stiff neck? i.e. information in the diagnostic direction from symptoms to
causes?
▪ Because diagnostic knowledge is often more fragile than causal knowledge
▪ Suppose there was a meningitis epidemic? The rate of meningitis goes up 20 times
within a group
P( s | m) P(m) 0.5 0.0004
P(m | s) = = = 0.004 = 1 / 250
P( s) 0.05
▪ We simply calculate the top line for each one and then normalise (divide by the sum of the top line for
all hypothesises.
▪ But sometimes it’s harder to find out 𝑃(effect|cause) for all causes independently than it is simply to find
out 𝑃(effect).
▪ Note that Bayes’ rule here relies on the fact the effect must have arisen because of one of the
hypothesised causes. You can’t reason directly about causes you haven’t imagined.
▪ How do we do this?
P(cavity | toothache catch) = P(toothache catch | cavity) P(cavity)
▪ As we have more effects our causal model becomes very complicated (for N binary
effects there will be 2N different combinations of evidence that we need to model
given a cause)
P(toothache catch | cavity) , P(toothache catch | cavity)
10S3001-Certan | Gasal 20/21 | Institut Teknologi Del 17
▪ In many practical applications there are not a few evidence variables but hundreds
▪ This nearly led everyone to give up and rely on approximate or qualitative methods for reasoning about
uncertainty
▪ Toothache and catch are not independent, but they are independent given the presence or absence of a
cavity.
▪ In other words we can use the knowledge that cavities cause toothache and they cause the catch, but
the catch and the toothache do not cause each other (they have a single common cause).
▪ Or in a new equation:
▪ Using conditional independence the causal model is much more compact. Rather than the number
of parameters being O(2N) it is O(N) where N is the number of effects (or evidence variables)
▪ The human has to move their (hidden) hand to a target. Half way through the movement
they are given an uncertain clue as to where the hand is. The position their hand moves
to by the end is consistent with the brain using Bayes’ rule to combine the information it
receives
https://mitpress.mit.edu/books/bayesian-brain
• Markov Chain
• Hidden Markov Model
P
is next of 0.
6
P P
P P P paragraph
P P
P
Wha P 0.0
the word this 5
t P P
0.0
P 5
P P P line
P P
are end at
0.
3
message
P
10S3001-Certan | Gasal 20/21 | Institut Teknologi Del 24
▪ Design a Markov Chain to predict the weather of
tomorrow using previous information of the past days.
▪ 𝑃 𝑤𝑛 |𝑤𝑛−1 , 𝑤𝑛−2 , … , 𝑤1
▪ Problem: the larger 𝑛 is, the more statistics we must collect. Suppose 𝑛 = 5, then we
must collect statistics for 35 = 243 past histories. Therefore, Markov Assumption:
▪ In a sequence 𝑤1 , 𝑤2 , … , 𝑤𝑛 :
𝑃 𝑤1 , … , 𝑤𝑛 = ෑ 𝑃 𝑤𝑖 |𝑤𝑖−1
𝑖=1
▪ For first-order Markov models, we can use these probabilities to draw a probabilistic
finite state automaton.
𝑃 𝑆𝑢𝑛𝑛𝑦|𝑅𝑎𝑖𝑛𝑦 = 0.2
𝑃 𝑅𝑎𝑖𝑛𝑦|𝑅𝑎𝑖𝑛𝑦 = 0.6 1
𝑃 𝐶𝑙𝑜𝑢𝑑𝑦|𝑅𝑎𝑖𝑛𝑦 = 0.2
𝑃 𝑆𝑢𝑛𝑛𝑦|𝐶𝑙𝑜𝑢𝑑𝑦 = 0.2
𝑃 𝑅𝑎𝑖𝑛𝑦|𝐶𝑙𝑜𝑢𝑑𝑦 = 0.3 1
10S3001-Certan | Gasal 20/21 | Institut Teknologi Del 𝑃 𝐶𝑙𝑜𝑢𝑑𝑦|𝐶𝑙𝑜𝑢𝑑𝑦 = 0.5 30
▪ Exercise 1
▪ Given that today is Sunny, what’s the probability that tomorrow is Sunny and the day after
is Rainy?
𝑃 𝑤2 , 𝑤3 |𝑤1 = 𝑃 𝑤3 𝑤2 , 𝑤1 ∗ 𝑃 𝑤2 |𝑤1
= 𝑃 𝑤3 𝑤2 ∗ 𝑃 𝑤2 |𝑤1
= 𝑃 𝑅𝑎𝑖𝑛𝑦|𝑆𝑢𝑛𝑛𝑦 𝑃 𝑆𝑢𝑛𝑛𝑦 𝑆𝑢𝑛𝑛𝑦
= 0.05 0.8
= 0.04
𝑃 𝑤3 |𝑤1 = 𝑃 𝑤2 = 𝐶, 𝑤3 = 𝑅|𝑤1 = 𝐶 +
𝑃 𝑤2 = 𝑅, 𝑤3 = 𝑅|𝑤1 = 𝐶 +
𝑃 𝑤2 = 𝑆, 𝑤3 = 𝑅|𝑤1 = 𝐶
= 𝑃 𝑤3 = 𝑅|𝑤2 = 𝐶 ∗ 𝑃(𝑤2 = 𝐶|𝑤1 = 𝐶) +
𝑃 𝑤3 = 𝑅|𝑤2 = 𝑅 ∗ 𝑃(𝑤2 = 𝑅|𝑤1 = 𝐶) +
𝑃 𝑤3 = 𝑅|𝑤2 = 𝑆 ∗ 𝑃(𝑤2 = 𝑆|𝑤1 = 𝐶)
= 0.3 0.5 + 0.6 0.3 + (0.05)(0.2)
= 0.34
10S3001-Certan | Gasal 20/21 | Institut Teknologi Del 32
▪ A Markov Model is a stochastic model which models temporal or sequential data, i.e.,
data that are ordered.
▪ It provides a way to model the dependencies of current information (e.g. weather) with
previous information.
▪ Imagine: You were locked in a room for several days, and you were asked about the
weather outside. The only piece of evidence you have is whether the person who
comes into the room carrying your daily meal is carrying an umbrella or not.
▪ What is hidden? Sunny, Rainy, Cloudy
▪ What can you observe? Umbrella or Not
U = Umbrella
NU= Not Umbrella
▪ Each observation comes from an unknown state. Therefore, we will also have an
unknown sequence 𝑊 = 𝑤1 , … , 𝑤𝑡 , where 𝑤𝑖 ∈ 𝑆𝑢𝑛𝑛𝑦, 𝑅𝑎𝑖𝑛𝑦, 𝐶𝑙𝑜𝑢𝑑𝑦 .
𝑃 𝑜𝑖 |𝑤𝑖 𝑃 𝑤𝑖
𝑃 𝑤𝑖 |𝑜𝑖 =
𝑃 𝑜𝑖
𝑃 𝑜1 , … , 𝑜𝑡 |𝑤1 , … , 𝑤𝑡 𝑃 𝑤1 , … , 𝑤𝑡
𝑃 𝑤1 , … , 𝑤𝑡 |𝑜1 , … , 𝑜𝑡 =
𝑃 𝑜1 , … , 𝑜𝑡
𝑃 𝑤1 , … , 𝑤𝑡 = ෑ 𝑃 𝑤𝑖 |𝑤𝑖−1
𝑖=1
𝑃 𝑜1 , … , 𝑜𝑡 |𝑤1 , … , 𝑤𝑡 = ෑ 𝑃 𝑜𝑖 |𝑤𝑖
𝑖=1
HMM Parameters:
• Transition probabilities 𝑃 𝑤𝑖 |𝑤𝑖−1
• Emission probabilities 𝑃 𝑜𝑖 |𝑤𝑖
• Initial state probabilities 𝑃 𝑤𝑖
▪ How do we compute the probability that the observed sequence was produced by the
model?
▪ Scoring how well a given model matches a given observation sequence.
▪ Attempt to uncover the hidden part of the model –that is, to find the “correct” state
sequence.
▪ For practical situations, we usually use an optimality criterion to solve this problem as best
as possible.
▪ Attempt to optimize the model parameters to best describe how a given observation
sequence comes about.
▪ The observation sequence used to adjust the model parameters is called a training
sequence because it is used to “train” the HMM.
▪ Left-right
▪ Edge descriptor
▪ Player height
P. Chang, M. Han, and Y. Gong. “Extract Highlights From Baseball Game Video With Hidden Markov Models,” In Proc. of ICIP, vol. 1, pp. 609-612, 2002.
10S3001-Certan | Gasal 20/21 | Institut Teknologi Del 54
▪ Reasoning under uncertainty is an important area of AI
▪ It is not the case that statistical methods are the only way
▪ Logics can also cope with uncertainty in a qualitative way
▪ But statistical methods, and particularly Bayesian reasoning has become a
cornerstone of modern AI (rightly or wrongly)
▪ A Markov Model is a stochastic model which models temporal or sequential data,
i.e., data that are ordered.
▪ Hidden Markov Model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobservable (i.e. hidden) states.
▪ E. Fosler-Lusier, “Markov Models and Hidden Markov Model: A Brief Tutorial,” International Computer Science Institute, 1998.
▪ L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE ,
vol.77, no.2, pp.257-286, Feb 1989
▪ John R. Deller, John, and John H. L. Hansen. “Discrete-Time Processing of Speech Signals”. Prentice Hall, New Jersey, 1987.
▪ Barbara Resch (modified Erhard and Car Line Rank and Mathew Magimai-doss); “Hidden Markov Models A Tutorial for the Course
Computational Intelligence.”
▪ Henry Stark and John W. Woods. “Probability and Random Processes with Applications to Signal Processing (3rd Edition).”
Prentice Hall, 3 edition, August 2001.