0% found this document useful (0 votes)
9 views

Bayes_Expected_Utility

Uploaded by

Osman Hamdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Bayes_Expected_Utility

Uploaded by

Osman Hamdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Bayesian Inference

and Decision
Theory
Unit 1: A Brief Tour of Bayesian
Inference and Decision Theory

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 1 -


What this Course is About

• You will learn a way of thinking about problems of inference and


decision-making under uncertainty
• You will learn to construct mathematical models for inference and
decision problems
• You will learn how to apply these models to draw inferences from
data and to make decisions
• These methods are based on Bayesian Decision Theory, a formal
theory for rational inference and decision making

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 2 -


Logistics
• Web site
• http://seor.vse.gmu.edu/~klaskey/SYST664/SYST664.html
• Blackboard site: http://mymason.gmu.edu
• Textbook and Software
• Hoff, A First Course in Bayesian Statistical Methods, Springer, 2009 (Free softcopy from Mason library)
• Other recommended texts on course web site
• We will use R, a free open-source statistical computing environment: http://www.r-project.org/. R code for many textbook
examples is on authorʼs web site
• Later in the semester we will use JAGS, an open-source package for Markov Chain Monte Carlo simulation (interfaces with
R): http://mcmc-jags.sourceforge.net/
• Requirements
• Regular assignments (30%); take-home midterm (35%); take-home final (35%)
• Office hours
• Official office hours are 3:00-400 PM Mondays (by appointment only), 4:00-5:00PM Wednesdays (via Zoom)
• I respond to questions by email and am available by appointment
• Course delivery
• 4:30-7:10 Mondays, in person ENT 276 and online via Zoom; all classes recorded
• Policies and Resources
• Academic integrity policy
• Read the policies and resources section of the syllabus

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 3 -


Course Outline

• Unit 1: A Brief Tour of Bayesian Inference and Decision Theory


• Unit 2: Random Variables, Parametric Models, and Inference from Observation
• Unit 3: Statistical Models with a Single Parameter
• Unit 4: Monte Carlo Approximation
• Unit 5: The Normal Model
• Unit 6: Markov Chain Monte Carlo
• Unit 7: Hierarchical Bayesian Models
• Unit 8: Bayesian Regression and Analysis of Variance
• Unit 9: Multinomial Distribution and Latent Groups
• Unit 10: Hypothesis Tests, Bayes Factors, and Bayesian Model Averaging

(There may be changes especially in second half; we may not make it to last unit)

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 4 -


Learning Objectives for Unit 1

• Describe the elements of a decision model


• Refresh knowledge of probability
• Apply Bayes rule for simple inference problems and interpret the results
• Explain why Bayesians believe inference cannot be separated from
decision-making
• Compare Bayesian and frequentist philosophies of statistical inference
• Compute and interpret the expected value of information (VOI) for a
decision problem with an option to collect information
• Download, install and use R statistical software

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 5 -


Bayesian Inference

• Bayesians use probability to quantify rational


degrees of belief
• Bayesians view inference as belief dynamics
• Use evidence to update prior beliefs to posterior
beliefs
• Posterior beliefs become prior beliefs for future
evidence
• Inference problems are usually embedded
in decision problems
• We will learn to build models of inference
and decision problems

“All models are wrong but some are useful”


George Box
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 6 -
Decision Theory

• Decision theory is a formal theory of decision making under uncertainty


• A decision problem consists of:
• Possible actions: {a}aÎA
• States of the world (usually uncertain): {s}sÎS
• Possible consequences: {c}cÎC (depends on action and state)

• Question: What is the best action?


• Answer (according to decision theory):
• Measure “goodness” of consequences with a utility function u(c)
• Measure likelihood of states with probability distribution p(s) (more generally, p(s|a))
• Best action with respect to model maximizes expected utility:

𝑎 ∗ = argmax 𝐸#(%) [𝑢(𝑐(𝑠, 𝑎))|𝑎] For brevity, we may write E[u(a)] for 𝐸! [𝑢(𝑐(𝑠, 𝑎))|𝑎]
"
• Caveat emptor:
• How good it is for you depends on fidelity of model to your beliefs and preferences
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 7 -
Illustrative Example:
Highly Oversimplified Decision Problem

• Decision problem: Should patient be treated for disease?


• We suspect she may have disease but do not know
• Without treatment the disease will lead to long illness
• Treatment has unpleasant side effects
• Decision model:
• Actions: aT (treat) and aN (don’t treat)
• States of world: sD (disease now) and sW (well now)
• Consequences: c(sD, aT)= c(sW, aT)=cWS (well shortly, side effects), c(sW, aN)=cWN (well shortly, no
side effects), c(sD, aN)=cDN (disease for long time, no side effects)
• Probabilities and Utilities:
u(cWN ) 100
• P(sD) = 0.3
u(cWS ) 90 = EU(aT)
• u(cWN) = 100, u(cWS) = 90; u(cDN) = 0 EU(aN ) 70

• Expected utility:
• Treat: .3´90 + .7´90 = 90
• Don't treat: .3´0 + .7´100 = 70
• Best action is aT (treat patient) u(cDN ) 0
P(sD) = 0.3 P(sW) = 0.7

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 8 -


Decision Model: Summary

• To model a decision problem we specify:


• Possible actions [aT, aN in medical example]
• Consequences [cWN, cWS, cDN in medical example]
• States [sD, sW in medical example]
• Probabilities of states [P(sD)=0.3, P(sW)=0.7 in medical example]
• Utilities for consequences [u(cWN) = 100, u(cWS) = 90; u(cDN) = 0 in medical example]
• To find the best decision we calculate the expected utility for each action
and choose the best
• E[u(aT)] = 90 ; E[u(aN)] = 70 in medical example; best decision is to treat
• Notation: for brevity we write E[u(a)] for EP(s)[u(c(s,a)|a]
• Sometimes we minimize expected loss (negative utility) instead of
maximizing expected utility
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 9 -
Sensitivity Analysis:
How Optimal Decision Varies with Sickness Probability

• Expected utility of not treating depends on the probability


p = P(sD) of having the disease
• E[U|aT] = 90 No dependence on p
• E[U|aN] = 0p + 100(1-p) = 100(1 – p) Decreases as p increases
• We should treat if p > 0.1, don’t treat if p < 0.1
• When we are unsure about the value of p Expected Utility for Disease Problem

we may want to explore how the optimal 100

decision changes as we vary p


90
Expected gain from
80
treatment at p = 0.3

If our estimate is near the crossover point,

Expected Utility
70

• 60

we may want to gather information to 50

refine our estimate of p


40

30
E[U(Treat)]

We will use Bayesian inference to update


20
• 10 E[U(NoTreat)]

our estimate of the probability 0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

P(Sick)
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 10 -
Interlude: Review of Probability Basics

• Probability is a mathematical representation for uncertainty


• We assign probability to events
• An event 𝐴 is a subset of the sample space Ω
• A probability distribution is a function on events that satisfies:
• 𝑃 𝐴 ≥ 0 for all events 𝐴 Kolmogorov’s axioms
• 𝑃 Ω =1
• If 𝐴𝑖Ç𝐴𝑗 = Æ, then 𝑃(𝐴1È𝐴2È ⋯ ) = 𝑃(𝐴1) + 𝑃(𝐴2) + ⋯
• From these properties we can derive others, e.g.:
• 𝑃(𝐴) ≤ 1 for all events A
• 𝑃(⊘) = 0
• If 𝐴 ⊂ 𝐵 then 𝑃 𝐴 ≤ 𝑃(𝐵)
• 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) for events 𝐴 and 𝐵

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 11 -


Conditional Probability

• The conditional probability 𝑃(𝐴|𝐵) satisfies:


• 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐴 ∩ 𝐵)
!(#∩%)
• If 𝑃 𝐵 > 0 then 𝑃 𝐴 𝐵 =
!(%)

• A and B are independent if 𝑃(𝐴|𝐵) = 𝑃(𝐴) B|B


¬A
• This implies 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃(𝐵)
• The law of total probability is: A|B
• If 𝐵𝑖Ç𝐵𝑗 = Æ and Ω = 𝐵1È𝐵2È ⋯ then
𝑃 𝐴 = - 𝑃 𝐴 ∩ 𝐵' = - 𝑃 𝐴 𝐵' 𝑃(𝐵' )
' '

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 12 -


Bayes Rule: The Law of Belief Dynamics

• Objective: use evidence to update beliefs


• H1, … Hn: exclusive and exhaustive hypotheses HiÇHj = Æ, Ω=H1ÈH2È…
• E: evidence (with positive probability) P(E)>0
H2
• Procedure: apply Bayes Rule: E
H1
"($(∩&) " &|$( "($() " &|$( "($()
𝑃 𝐻! 𝐸 = = =∑
"(&) "(&) ) " &|$) " $)
• Bayes Rule (odds likelihood form):

H 2|E
" $(|& " &|$( "($()
= P(E)>0, P(H2)>0
H1 |"
" $) |& " &|$) "($) )

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 13 -


Interpreting Bayes Rule

Bayes Rule (odds likelihood form):

H 2|E
• H2
2 30 |4 2 4|30 2(30 ) E
= H1 H1 |"
2 31 |4 2 4|31 2(31 )

• Terminology:
𝑃(𝐻) - The prior probability of 𝐻 𝑃(𝐸|𝐻) - The likelihood for E given H
𝑃(𝐸) - The predictive probability of 𝐸 𝑃(𝐻|𝐸) - The posterior probability of 𝐻 given 𝐸
2 3|4' 2(4' )
- The likelihood ratio for 𝐻𝑖 versus 𝐻𝑗 - The prior odds ratio for 𝐻𝑖 versus Hj
2 3|4( 2(4( )

• The posterior probability of 𝐻𝑖 increases relative to 𝐻𝑗 if the evidence is more


likely given 𝐻𝑖 than given 𝐻𝑗

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 14 -


Probability Review: Summary

• Events: subsets of sample space


• Probability:
• Maps event to number between 0 and 1
• Measures how likely event is to occur
• Satisfies basic rules
• Conditional probabilities measure how likely an event is given that
another event has occurred
• Bayes rule tells us how probabilities change when we get new
evidence

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 15 -


Extending the Disease Example:
Gathering Information
• We can perform a test before deciding whether to treat the patient
• Test has two outcomes: tP (positive) and tN (negative)
• Quality of test is characterized by two numbers:
• Sensitivity: Probability that test is positive if patient has disease
• Specificity: Probability that test is negative if patient does not have disease

• Test characteristics:
• Sensitivity: P(tP | sD) = 0.95
• Specificity: P(tN | sW) = 0.85
• How does the model change if test results are available?
• Take test, observe outcome t
• Revise prior beliefs P(sD) to obtain posterior beliefs P(sD|t)
• Re-compute optimal decision using P(sD|t)
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 16 -
Disease Example with Test
EU(aN | tN )

• Review of Problem Ingredients: EU(aT)


97.5
90
• P(sD) = 0.3 (prior probability of disease)
70
• P(tP | sD) = 0.95; P(tN | sW) = 0.85 (sensitivity & specificity of test) EU(aN )

• u(cWN) = 100, u(cWS) = 90; u(cDN) = 0 (utilities)


EU(aN | tP ) 26.9

P(𝑡" |𝑠! )P(𝑠! ) P(𝑡$ |𝑠! )P(𝑠! )


• If negative test: P(𝑠! |𝑡" ) = • If positive test: P(𝑠! |𝑡$ ) =
P(𝑡$ |𝑠! )P(𝑠! ) + P(𝑡$ |𝑠# )P(𝑠# )
P(𝑡" |𝑠! )P(𝑠! ) + P(𝑡" |𝑠# )P(𝑠# )

• P(sD | tN) = • P(sD | tP) =


(0.3 x 0.05)/(0.3 x 0.05 + 0.7 x 0.85) = 0.025 (0.3 x 0.95)/(0.3 x 0.95 + 0.7 x 0.15) = 0.731
• EU(aN | tN) = 0.025 ´ 0 + (1-0.025) ´ 100 = 97.5 • EU(aN | tP) = 0.731 ´ 0 + (1-0.731) ´ 100 = 26.9
• EU(aT | tN) = 0.025 ´ 90 + (1-0.025) ´ 90 = 90 • EU(aT | tP) = 0.731 ´ 90 + (1-0.731) ´ 90 = 90
• Best action is not to treat
• Best action is to treat
• Optimal policy is to treat if positive; don’t treat if negative
• We will call this strategy aF (FollowTest)
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 17 -
Decision Model with Test: Summary
• To model a decision problem we specify:
• Possible actions [aT, aN in medical example]
• Consequences [cWN, cWS, cDN in medical example]
• States [sD, sW in medical example]
• Probabilities of states depend on test outcome
• P(sD | tN) = 0.025, P(sW| tN) = 0.975 with negative test
• P(sD | tP) = 0.731, P(sW| tP) = 0.269 with positive test
• Utilities for consequences [u(cWN) = 100, u(cWS) = 90; u(cDN) = 0 in medical example]
• To find the best decision we calculate the expected utility for each action
given the test result and choose the best
• E[u(aT | tN)] = 90 ; E[u(aN | tN)] = 97.5; best decision is not to treat if test is negative
• E[u(aT | tP)] = 90 ; E[u(aN | tP)] = 26.9; best decision is to treat if test is positive
• We always make our decision based on the information we have at the time
of the decision, so if a test result is available, we use the probability given
the test result
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 18 -
Should We Gather Information?
• Reminder of problem ingredients: EU(aN | tN )
• P(sD) = 0.3 (prior probability of disease) EU(aF)
EU(aT) 97.5
• u(cWN) = 100, u(cWS) = 90; u(cDN) = 0 (utilities) = EU(aT|tP ) 90 94.6
= EU(aT|tN )
• P(tP | sD) = 0.95; P(tN | sW) = 0.85 (sensitivity & specificity of test) 70

• Expected utility after doing test: EU(aN )

• If test is positive we should treat, with EU(aT) = EU(aT | tP) = 90 EU(aN | tP ) 26.9

• If test is negative we should not treat, with EU(aN | tN) = 97.54098


• Probability test will be positive (use law of total probability):
• P(tP) = P(tP | sD) P(sD) + P(tP | sW) P(sW) = 0.95x0.3 + 0.15x0.7 = 0.39
• Expected utility of FollowTest strategy (treat if test is positive, otherwise not):
• EU(aF) = P(tP) EU(aT | tP) + P(tN) EU(aN | tN)
= 0.39 x 90 + (1-0.39) x 97.54098 = 94.6
• EU(aF) is larger than EU(aT) = 90 so we should do the test
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 19 -
Decision Tree for Disease Model with Test
26.9

90
//

94.6
97.5

94.6 97.5

//

70
//
90
//

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 20 -


Expected Value of Information
• Expected Value of Sample Information (EVSI) is gain in expected utility
from doing a test
• EVSI for our medical example is 94.6 – 90 = 4.6
• Expected Value of Perfect Information (EVPI) is gain in expected utility
from perfect knowledge of an uncertain variable
• For medical example:
• Suppose an oracle will tell us whether patient is sick (sensitivity = specificity = 1)
• 30% chance we discover she is sick and treat - utility 90
• 70% chance we discover she is well and don’t treat - utility 100
• Expected utility if we ask the oracle 0.3 x 90 + 0.7 x 100 = 97
• Therefore EVPI = 97 - 90 = 7
• EVPI ≥ EVSI ≥ 0
• EVSI = 0 if information won’t change your decision
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 21 -
Should We Collect Information?

• General Principle: Free information can never hurt


• Whether we should do the test depends on whether utility gain
EVSI=4.6 is greater than cost of information
• To analyze decision of whether to collect information:
• Find maximum expected utility option if we don't collect information
• Compute its expected utility U0
• Find EVPI
• Compare EVPI with cost of information
• If EVPI is too small in relation to cost then stop; otherwise, compute EVSI
• Compare EVSI with cost of information
• Collect information if expected utility gain is greater than cost of information

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 22 -


Strategy Regions for Medical Decision
(Without Test)
Expected utility of not treating
depends on the probability
p = P(sD) of having the disease Expected Utility for Disease Problem
• E[U|aT] = 90 100

• E[U|aN] = 0p + 100(1-p) = 100(1 – p) 90


Expected gain from treatment at
80
• The strategy regions for the decision p = 0.3

Expected Utility
(without test):
70

60

• aT if p > 0.1 50

40
• aN if p < 0.1
30
• What are the strategy regions if we 20
E[U(Treat)]

do a test? 10 E[U(NoTreat)]
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

P(Sick)
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 23 -
Expected Utility of FollowTest Policy
as Function of Prior Probability p = P(sD)
• FollowTest strategy treats if test is positive and otherwise not
World Probability Action Utility Before doing test, we think:
State P(t|s) P(s) • There are four possibilities for disease status
Sick, .95p Treat 90 and test results. Their probabilities are shown
Positive in the table
Sick, .05p NoTreat 0 P(s, t) = P(t|s) P(s) = P(s|t) P(t)
Negative • We treat if test is positive and don’t treat if test
Well, .15(1-p) Treat 90 is negative, with utilities shown in last column.
Positive • We multiply probability times utility for each
Well, .85(1-p) NoTreat 100 world state and sum to get the expected utility
Negative of FollowTest

E[U|aF] = 0.95p´90 + .05p´0 + 0.15 (1-p)´90 + 0.85 (1-p)´100


= 98.5 – 13p

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 24 -


Strategy Regions for Medical Decision
(With Test)
Strategy Regions for Costless Test Test
100
Sensitivity Analysis for EVSI with Costless
• FollowTest: EU(aF) = 98.5 – 13p
95
EVSI is positive for
0.017 < p < 0.654;
• AlwaysTreat: EU(aT) = 90
otherwise EVSI = 0 • FollowTest is better when 98.5 – 13p > 90
Expected Utility

EVSI
90
or p < 8.5/13 = 0.654
85 • NeverTreat: EU(aN) = 100(1 – p)
80
E[U(Treat)]
• FollowTest is better when 98.5 – 13p > 100(1-p)
or p > 1.5/87 = 0.017
E[U(NoTreat)]
E[U[FollowTest)]

75
0 0.2 0.4 0.6 0.8 1

P(Sick)

Region Optimal Strategy


p < 0.017 NeverTreat
0.017 < p < 0.654 FollowTest
p > 0.654 AlwaysTreat

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 25 -


Decision Model for Whether to Test: Summary

• To model a decision problem we specify:


• Possible policies* [aT, aN, aF in medical example with test]
• Consequences [cWN, cWS, cDN as previously]
• States [(sD, tP), (sD, tN), (sW, tP), (sW, tN), in medical example with test]
• Probabilities of states
• P(sD, tP)=0.95p, P(sD, tN)=0.05p, P(sW, tP)=0.15(1-p), P(sW, tN)=0.85(1-p)
• For p=0.3, P(sD, tP)=0.285, P(sD, tN)=0.015, P(sW, tP)=0.105, P(sW, tN)=0.595
• Utilities for consequences [u(cWN) = 100, u(cWS) = 90; u(cDN) = 0 as previously]
• To find the best decision we calculate the expected utility for each
policy
• E[u(aT)] = 90 ; E[u(aN)] = 100(1 – p); E[u(aF)] = 98.5 – 13p in medical example with test
• Best policy is aT if p≤0.017; aF if 0.017< p < 0.654; aN if p > 0.654
• For p=0.3, best policy is aF
*We use the word policy for a sequence of actions taken over time that can depend on information we acquire over time
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 26 -
Strategy Regions for Costly Test:

Expected Utility of Optimal Strategy with Costly Test


100 • FollowTest: EU(aF) = 98.5 – 13p – c
c=0
98 c=1 • AlwaysTreat: EU(aT) = 90
c=4
c≥7.2
• FollowTest is better when
96
98.5 – 13p – c > 90 or p < (8.5-c)/13
Expected Utility

c≥7.2

94
• NeverTreat: EU(aN) = 100(1 – p)
92 Gain from doing test with c=1 at p=0.3
• FollowTest is better when
98.5 – 13p – c > 100(1-p) or p > (1.5+c)/87
90

88 (c is cost of test)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P(Sick)

• Test is worth doing if gain is larger than cost.


• Range of values for which test is worth doing:
(1.5+c)/87 < p < (8.5-c)/13
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 27 -
EVSI and Costly Test
Expected(U*lity(of(Op*mal(Strategy(with(Costly(Test(

• Information collection is optimal when EVSI 100"

is greater than cost of test 98"


c=0"

c=1"

Probability range where testing is optimal


96"
c=4"

• c≥7.2"

depends on cost of test


94"

92"

• In our example for a test with cost c: 90"

• Testing is optimal if (1.5+c)/87 < p < (8.5-c)/13 88"


0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1"

EVSI%as%a%Func,on%of%Prior%Probability%
8"

7"

6"

5"

4"
Range of optimality
3"
of test with c=1
2"

1"

0"
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2-0"28 - 0.2" 0.4" 0.6" 0.8" 1"
Summary : Value of Information and Strategy
Regions
• Collecting information may have value if it might change your decision
• Expected value of perfect information (EVPI) is utility gain from knowing true value of
uncertain variable
• Expected value of sample information (EVSI) is utility gain from available information
• In our example, EVSI is positive for 0.017 < p < 0.654
• If 0.017 ≤ p ≤ 0.1 EVSI is 87p - 1.5
• If 0.1 ≤ p ≤ 0.654 EVSI is 8.5 – 13p
• If p = 0.3 EVSI is 8.5 – 13p = 4.6 (testing is optimal)
• Costly information has value when EVSI is greater than cost of information
• In our example:
• If 0.017 ≤ p ≤ 0.1 Test if 87p - 1.5 > c (where c is cost of test)
• If 0.1 ≤ p ≤ 0.654 Test if 8.5 – 13p > c
• If p = 0.3 Test if 4.6 > c (test if c is less than 4.6)
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 29 -
What if a Probability is Unknown?
• The model for our medical example depends on several parameters
• Prior probability of disease
• Sensitivity of test
• Specificity of test
• Usually these probabilities are estimated from data and/or expert
judgment
• “Randomized clinical trials have established that Test T has sensitivity 0.95 and
specificity 0.85 for Disease D”
• “Given the presenting symptoms and my clinical judgment, I estimate a 30%
probability that the patient has Disease D.”
• How does a Bayesian combine data and expert judgment?
• Use clinical judgment to quantify uncertainty about as a probability distribution
• Gather data
• Use Bayes rule to obtain posterior distribution for the unknown probability
• If appropriate, use clinical judgment to adjust results of studies to apply to a
particular patient
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 30 -
Example: Bayesian Inference about a
Probability (with a very small sample)
The unknown probability actually has a
• Assign prior distribution to possible values of disease continuous range of values. We will treat
continuous distributions later. For now we
probability p approximate with a finite set of values.
• Although p can be any real number between zero and 1,
we pretend there are only 20 equally spaced possible values
• Our prior distribution is consistent with our estimate p=0.3
• Observe 10 independent and identically distributed
(iid) cases
• (X1, X2, X3, X4, X5, X6, X7, X8, X9 X10) = (0, 1, 0, 0, 0, 0, 1, 0, 0, 1)
• Cases 2, 7, and 10 have disease; the rest do not
• How do we find the posterior distribution of the
unknown probability?
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 31 -
Posterior Distribution of Disease Parameter

• Applying Bayes Rule


• We observed 3 cases of disease in 10 trials
• Likelihood of data is 𝑝 , (1 − 𝑝)-
• Multiply prior 𝑔(𝑝) times likelihood 𝑝 , (1 − 𝑝)- and divide
by sum:
.(/)/0 (01/)1
𝑔(𝑝│𝑥) = ∑23 .(/3)/30 (01/3)1

• Notice that the posterior distribution depends only


on the number of cases with and without the disease
• Cases with and without the disease are sufficient for
inference about p
Underscore indicates a vector: x=(x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 32 -
Bayesian Inference Example:
R Code

Horizontal axis is p = P(Sick);


height of bar is probability that P(Sick) = p
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 33 -
Bayesian Learning and Sample Size
• When the sample size is very large:
• The posterior distribution will be concentrated around the maximum likelihood estimate and is relatively
insensitive to the prior distribution
• We wonʼt go too far wrong if we act as if the parameter is equal to the maximum likelihood estimate
• When the sample size is very small:
• The posterior distribution is highly dependent on the prior distribution
• Reasonable people may disagree on the value of the parameter
• When the sample size is moderate, Bayesian learning can be a big improvement on either expert
judgment alone or data alone
• Achieving the benefit requires careful modeling
• This course will teach methods for constructing Bayesian models
• A powerful characteristic of the Bayesian approach is the flexibility to tailor results to moderate-
sized sub-populations
• Bayesian estimate “shrinks” estimates of sub-population parameters toward population average
• Amount of shrinkage depends on sample size and similarity of sub-population to overall population
• Shrinkage improves estimates for small to moderate sized sub-populations

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 34 -


Effect of Sample Size on Posterior Distribution
Sample size 5: 1 with, 4 without • These plots show the posterior distribution for Θ
when:
0.4"

0.35"

Prior distribution is uniform


0.3"

0.25"

0.2"
• 20% of patients in sample have the disease
0.15"

0.1"
• Posterior distribution becomes more concentrated
around 1/5 as sample size gets larger
0.05"

0"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

Sample size 20: 4 with, 16 without Sample size 80: 16 with, 64 without
0.4" 0.4"

0.35" 0.35"

0.3" 0.3"

0.25" 0.25"

0.2" 0.2"

0.15" 0.15"

0.1" 0.1"

0.05" 0.05" Horizontal axis is q = P(sD); height


0"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"
0"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"
of bar is probability that Θ =q

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 35 -


Sample Size and Impact of the Prior
Distribution
• Prior distribution favors low probabilities:
Posterior(Distribu,on(for(1(Case(in(5(Samples( Posterior(Distribu,on(for(10(Cases(in(50(Samples(
Prior%Distribu+on%%
0.4" 0.4"
0.4"

0.35" 0.35"
0.35"

0.3" 0.3"
0.3"

0.25" 0.25"
0.25"

0.2" 0.2" 0.2"

0.15" 0.15" 0.15"

0.1" 0.1" 0.1"

0.05" 0.05" 0.05"

0" 0"
0"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975" 0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

Horizontal axis is q =
• Prior distribution favors high probabilities: P(sD); height of bar is
probability that Θ =q
Prior%Distribu+on% Posterior(Distribu,on(for(1(Case(in(5(Samples( Posterior(Distribu,on(for(10(Cases(in(50(Samples(
0.4" 0.4" 0.4"

0.35" 0.35" 0.35"

0.3" 0.3" 0.3"

0.25" 0.25" 0.25"

0.2" 0.2" 0.2"

0.15" 0.15" 0.15"

0.1" 0.1" 0.1"

0.05" 0.05" 0.05"

0" 0" 0"


0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975" 0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975" 0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

• Bayesian inference “shrinks” posterior distribution toward prior expectations


• Posterior distribution for smaller sample is more sensitive to prior distribution
• Posterior distribution for larger sample is less sensitive to prior distribution

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 36 -


Some Concepts of Probability
• Classical - Probability is a ratio of favorable cases to total (equipossible) cases
• Frequency - Probability is the limiting value as the number of trials becomes infinite of the
frequency of occurrence of a type of event
• Logical - Probability is a logical property of one’s state of information about a phenomenon
• Propensity - Probability is a propensity for certain kinds of physical event to occur
• Subjective - Probability is an ideal rational agent’s degree of belief about an uncertain event
• Algorithmic - The algorithmic probability of a finite sequence is the probability that a universal
computer fed a random input will give the sequence as output (related to Kolmogorov complexity)
• Game Theoretic - Probability is an agent’s optimal “announced certainty” for an event in a multi-
agent game in which agents receive rewards that depend on both forecasts and outcomes
Probability really is none of these things.
Probability can represent all of these things.
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 37 -
The Frequentist
• A frequentist believes:
• Probability can be legitimately applied only to repeatable problems
• Probability is an objective property in the real world
• Probability applies only to random processes
• Probabilities are associated only with collectives not individual events

• Frequentist Inference
• Data are drawn from a distribution of known form but with an unknown parameter (this includes
“nonparametric” statistics in which the unknown parameter is the distribution itself)
• Distribution may arise from explicit randomization or may be considered “close enough” to random
• Inference treats data as random and parameter as fixed
• For example: A sample X1,…XN is drawn from a normal distribution with mean Θ . A 95% confidence
interval is constructed. The interpretation is:
If an experiment like this were performed many times we would expect in 95% of the cases that an interval calculated
by the procedure we applied would include the true value of Θ .
• A frequentist can say nothing about any individual experiment about q!
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 38 -
The Subjectivist

• A subjectivist believes:
• Probability as an expression of a rational agent’s degrees of belief about uncertain
propositions.
• Rational agents may disagree. There is no “one correct probability.”
• If the agent receives feedback her assessed probabilities will in the limit converge to
observed frequencies
• Subjectivist Inference:
• Probability distributions are assigned to unknowns (parameters and observations).
• Condition on knowns; use probability to express uncertainty about unknowns
• For example: A sample X1,…XN is drawn from a normal distribution with mean q having prior
distribution g(q). A 95% posterior credible interval is constructed, and the result is the
interval (3.7, 4.9). The interpretation is:
Given prior distribution for q and observed data, the probability that q lies between 3.7 and 4.9 is 95%.
• A subjectivist can draw conclusions about what we should believe about 𝜃 and
about what we should expect on the next trial

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 39 -


The Bayesian Resurgence
• Bayesian inference is as old as probability
• Subjective view fell into disfavor in 19th and early 20th centuries
• Positivism, empiricism, and quest for objectivity in science
• “Paradoxes” and systematic deviation of human judgment from Bayesian
“norm”
• There has been a recent resurgence
• Computational advances make calculation possible for complex models
• Bayesian models can coherently integrate many different kinds of information
• Physical cause and effect
• Logical implication
• Informed expert judgment
• Empirical observation
• Unified theory and methods for data-rich and data-poor problems
• Clear connection to decision making
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 40 -
Comparison: Understandability,
Subjectivity and Honest Reporting
• Often the Bayesian answer is what the decision maker really wants to hear.
• Untrained people often interpret results in the Bayesian way.
• Frequentists are disturbed by dependence of the posterior interval on “subjective” prior
distribution.
It is more important that stochastics provides a means of communication among researchers
whose personal beliefs about the phenomena under study may differ. If these beliefs are
allowed to contaminate the reporting of results, … how are the results of different researchers
to be compared?
- H. Dinges
• Bayesians say the prior distribution is not the only subjective element in an analysis.
• Bayesian probability statements are always subjective, but statistical analyses are often
done for public consumption. Whose probability distribution should be reported?
• For large samples, a good Bayesian analysis and a good frequentist analysis are usually
similar
• If results are sensitive to the prior distribution, a Bayesian analyst should report this sensitivity
and present a range of results obtained from a range of prior distributions

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 41 -


Comparison: Generality

• Subjectivists can handle problems the frequentist approach cannot (in


particular, problems with not enough data for sound frequentist
inference).
• Frequentist statisticians say this comes at a price -- when there are
not enough data the result will be highly dependent on the prior
distribution.
• Subjectivists often apply frequentist techniques but with a Bayesian
interpretation
• Frequentists often apply Bayesian methods if they have good
frequency properties

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 42 -


Coherence and Rationality

• In the mid 20th century, several authors proposed systems of axioms


intended to characterize rational behavior
• Proofs that decision-makers satisfying these axioms must be expected utility
maximizers
• Proofs that decision-makers not satisfying these axioms are vulnerable to
exploitation (“Dutch book”)
• Well-documented systematic departures of human decision-making from
expected utility maximization
• A decision-maker is called coherent if she behaves as a maximizer of
expected utility
• Should coherence be equated with rationality?
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 43 -
Axioms for Probability
De Groot, 1970

There is a qualitative relationship of relative likelihood  , that operates on pairs of


events, that satisfies the following conditions:

SP1. For any two uncertain events A and B, one of the following relations holds: A Comparability
 B, A  B or A ~ B.
SP2. If A 1, A2, B1, and B2 are four events such that A1∩A2=∅, B1∩B2=∅, and if Ai Union of disjoint
 Bi, for i = 1,2, then A1∪A2  B1∪B2. If in addition Ai  Bi for either i=1 or events
i=2, then A1∪A2  B1∪B2.
SP3. If A is any event, then ∅  A. Furthermore, there is some event A0 for which Null lottery
∅  A0.
SP4. If A1⊃A2⊃ … is a decreasing sequence of events, and B is some event such Decreasing
sequences
that Ai  B for i=1, 2, …, then  i=1 Ai  B .

SP5. There is an experiment, with a numerical outcome between the values of 0 and Existence of
1, such that if Ai is the event that the outcome x lies within the interval ai ≤ x ≤ uniform
bi, for i=1,2, then A1  A2 if and only if (b1-a1) ≤ (b2-a2). distribution
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 44 -
Axioms for Utility
Watson and Buede, 1987

A reward is a prize the decision maker cares about. A lottery is a situation in which
the decision maker will receive one of the possible rewards, where the reward to be
received is governed by a probability distribution. There is a qualitative relationship
of relative preference  * , that operates on lotteries, that satisfies the following
conditions:

SU1. For any two lotteries L1 and L2, either L1  * L2, L1  * L2, or L1~*L2.
Furthermore, if L1, L2, and L3 are any lotteries such that L1  * L2 and Comparability and transitivity
L2  * L3, then L1  * L3.
SU2. If r 1, r2 and r3 are rewards such that r1  * r2  * r3, then there exists a
probability p such that [r1: p; r3: (1-p)] ~* r2, where [r1:p; r3:(1-p)] is a Lottery equivalence
lottery that pays r1 with probability p and r3 with probability (1-p ).
SU3. If r1 ~* r2 are rewards, then for any probability p and any reward r3, Substitutability of equivalent
[r1: p; r3: (1-p)] ~* [r2: p; r3: (1-p )]
rewards
SU4. If r1  * r2 are rewards, then [r1: p; r2: (1-p)]  * [r1: q; r2: (1-q)] if and
only if p > q. Higher chance of better reward
SU5. Consider three lotteries, Li = [r1: pi; r2: (1-pi)], i = 1, 2, 3, giving different
probabilities of the two rewards r1 and r2. Suppose lottery M gives entry to Compound lottery
lottery L2 with probability q and L3 with probability 1-q. Then L1~*M if and
©Kathryn Blackmondonly if p1 = qp2 + (1-q)p3. Spring 2023
Laskey Unit 1 v2- 45 -
Probabilities and Utilities
• If your beliefs satisfy SP1-SP5, then there is a probability
distribution Pr(⋅) over events such that for any two events A1
and A2, Pr(A1) ≥ Pr(A2) if and only if A1  A2.
• If your preferences satisfy SU1-SU5, then there is a utility
function u(⋅) defined on rewards such that for any two lotteries
L1 and L2, L1  * L2 if and only if E[u(L1)] ≥ E[u(L2)], where
E[⋅] denotes the expected value with respect to the probability
distribution Pr(⋅).
?

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 46 -


Why be a Bayesian?
• Arguments from theory
• A coherent decision maker uses probability to represent uncertainty, uses
utility to represent value, and maximizes expected utility
• If you are not coherent then someone can make "Dutch book" on you (turn
you into a "money pump")
• Pragmatic arguments
• Useful and principled methodology for modeling inference, decision and
learning
• Analyze engineering tradeoffs between accuracy, complexity and cost
• Represent and incorporate both empirical data and informed engineering
judgment
• Handle small, moderate and large sample sizes and parameter sets
• Interpretability of results and understandability of model
• Arguments from experience
• Successful applications attributed to decision theory
©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 47 -
What do you think?

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 48 -


Unit 1: Summary and Synthesis

• Bayesian statistics is a theory of rational belief dynamics


• We took a broad-brush tour of Bayesian methodology
• We applied Bayesian thinking to a simplified medical example that illustrates
many of the concepts we will be learning this semester
• Bayesian decision theory provides a methodology for rational choice under
uncertainty
• The twentieth century saw a resurgence of interest in subjective probability and
an increased understanding of the appropriate role of subjectivity in science
• Most statistics texts and courses take a frequentist approach but this is changing
• The inventors of probability theory thought of it as a logic of enlightened rational reasoning.
In the nineteenth century this was replaced by a view of probability as measuring “objective”
propensities of “intrinsically random” phenomena
• Bayesian methods often require more computational power than traditional frequentist
methods
• The computer revolution has enabled the Bayesian resurgence

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 49 -


References for Unit 1
• Bayes, Thomas. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of
London, 53:370- 418, 1763.
• Bashir, S.A., Getting Started in R, http://www.luchsinger-mathematics.ch/Bashir.pdf
• Dawid, A.P. and Vovk, V.G. (1999), Prequential Probability: Principles and Properties, Bernoulli, 5: 125-162.
• de Finetti, Bruno. Theory of Probability: A Critical Introductory Treatment. New York: Wiley, 1974.
• Gelman, A., Carlin, J., Stern, H. and Rubin, D., Bayesian Data Analysis (2nd edition), Chapman & Hall, 2004. Chapter 1
• Hájek, Alan, "Interpretations of Probability", The Stanford Encyclopedia of Philosophy (Summer 2003 Edition), Edward N. Zalta (ed.),
URL = <http://plato.stanford.edu/archives/sum2003/entries/probability-interpret/>.
• Lee, P. Bayesian Statistics: An Introduction, 4th ed. Springer, 2012. Chapter 1
• Li, Ming and Vitanyi, Paul. An Introduction to Kolmogorov Complexity and Its Applications. (2nd ed) Springer-Verlag, 2005.
• Nau, Robert F. (1999), Arbitrage, Incomplete Models, And Interactive Rationality, working paper, Fuqua School of Business, Duke
University.
• Neapolitan, R. Learning Bayesian Networks, Prentice Hall, 2003.
• Jaynes, E., Probability Theory: The Logic of Science, Cambridge University Press, 2003)
• Savage, L.J., The Foundations of Statistics. Dover, 1972.
• Shafer, G. Probability and Finance: It’s Only a Game, Wiley, 2001.
• von Mises R., 1957, Probability, Statistics and Truth, revised English edition, New York: Macmillan

©Kathryn Blackmond Laskey Spring 2023 Unit 1 v2- 50 -

You might also like