0% found this document useful (0 votes)
4 views

mid2

Chapter 12 discusses the quantification of uncertainty in decision-making, highlighting issues such as partial observability, noisy sensors, and the complexity of modeling outcomes. It introduces decision theory, which combines probability and utility theory to maximize expected utility, and outlines methods for handling uncertainty, including probabilistic assertions and Bayesian probability. The chapter also covers fundamental concepts of probability, including random variables, probability distributions, and the laws governing them, emphasizing the importance of probabilistic inference in evaluating decisions under uncertainty.

Uploaded by

msparan681
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

mid2

Chapter 12 discusses the quantification of uncertainty in decision-making, highlighting issues such as partial observability, noisy sensors, and the complexity of modeling outcomes. It introduces decision theory, which combines probability and utility theory to maximize expected utility, and outlines methods for handling uncertainty, including probabilistic assertions and Bayesian probability. The chapter also covers fundamental concepts of probability, including random variables, probability distributions, and the laws governing them, emphasizing the importance of probabilistic inference in evaluating decisions under uncertainty.

Uploaded by

msparan681
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 211

Chapter 12

Quantifying Uncertainty

1
Uncertainty
 The real world is rife with uncertainty!
 E.g., if I leave for SFO 60 minutes before my flight, will I be there in time?
 Problems:
 partial observability (road state, other drivers’ plans, etc.)
 noisy sensors (radio traffic reports, Google maps)
 immense complexity of modelling and predicting traffic, security line, etc.
 lack of knowledge of world dynamics (will tire burst? need COVID test?)
 Probabilistic assertions summarize effects of ignorance and laziness
 Combine probability theory + utility theory -> decision theory
 Maximize expected utility : a* = argmaxa s P(s | a) U(s)
2
https://inst.eecs.berkeley.edu/~cs188/sp21/
Outline
− Uncertainty
− Probability
− Syntax and Semantics
− Inference
− Independence and Bayes’Rule

https://inst.eecs.berkeley.edu/~cs188/sp21/
Uncertainty
Let action At = leave for airport t minutes before flight

Will At get me there on time?


Problems:
1) partial observability (road state, other drivers’ plans, etc.)
2) noisy sensors (traffic reports)
3) uncertainty in action outcomes (flat tire, etc.)
4) immense complexity of modelling and predicting traffic
Hence a purely logical approach either
1) risks falsehood: “A 25 will get me there on time”
or 2) leads to conclusions that are too weak for decision making:
“A 25 will get me there on time if there’s no accident on the bridge and it doesn’t rain and my tires
remain intact etc etc.”
(A1440 might reasonably be said to get me there on time but I’d have to stay overnight in the
airport. . .) https://inst.eecs.berkeley.edu/~cs188/sp21/
Methods for handling uncertainty
Default or nonmonotonic logic:
Assume my car does not have a flat tire
Assume A25 works unless contradicted byevidence
Issues: What assumptions are reasonable? How to handle contradiction?
Rules with fudge factors:
A25 1→0.3 At Airport On Time
Sprinkler 1→0.99 WetGrass
WetGrass 1→0.7Rain
Issues: Problems with combination, e.g., Sprinkler causes Rain??
Probability
Given the available evidence,
A25 will get me there on time with probability 0.04
Mahaviracarya (9th C.), Cardamo (1565) theory of gambling
(Fuzzy logic handles degree of t r u t h NOT uncertainty e.g.,
W etGrass is true to degree 0.2) https://inst.eecs.berkeley.edu/~cs188/sp21/
Probability
Probabilistic assertions summarize effects of
laziness: failure to enumerate exceptions, qualifications, etc.
ignorance: lack of relevant facts, initial conditions, etc.
Subjective or Bayesian probability:
Probabilities relate propositions to one’s own state of knowledge
e.g., P (A25|no reported accidents) = 0.06
These are not claims of a “probabilistic tendency” in the current situation (but might be
learned from past experience of similar situations)
Probabilities of propositions change with new evidence: e.g.,
P (A25|no reported accidents, 5 a.m.) = 0.15
Analogous to logical entailment status K B |= α, not truth.

https://inst.eecs.berkeley.edu/~cs188/sp21/
Making decisions under uncertainty
Suppose I believe the following:
P(A25 gets me there on time| . . .) = 0.04
P(A90 gets me there on time| . . .) = 0.70
P(A120 gets me there on time| . . .) = 0.95
P (A1440 gets me there on time| . ..) = 0.9999

Which action to choose?


Depends on my preferences for missing flight vs. airport cuisine, etc.
Utility theory is used to represent and infer preferences

Decision theory = utility theory + probability theory

https://inst.eecs.berkeley.edu/~cs188/sp21/
Basic laws of probability
 Begin with a set  of possible worlds
 E.g., 6 possible rolls of a die, {1, 2, 3, 4, 5, 6}

 A probability model assigns a number P() to each world 


 E.g., P(1) = P(2) = P(3) = P(5) = P(5) = P(6) = 1/6. 1/6
1/6

 These numbers must satisfy 1/6 1/6 1/6

 0  P()  1 1/6

   P() = 1

8
https://inst.eecs.berkeley.edu/~cs188/sp21/
Basic laws contd.
 An event is any subset of 
 E.g., “roll < 4” is the set {1,2,3}
 E.g., “roll is odd” is the set {1,3,5}
 The probability of an event is the sum of probabilities over its worlds
 P(A) =   A P()
 E.g., P(roll < 4) = P(1) + P(2) + P(3) = 1/2

 De Finetti (1931): anyone who bets according to probabilities that


violate these laws can be forced to lose money on every set of bets
9
https://inst.eecs.berkeley.edu/~cs188/sp21/
Random variables
 A random variable is a function from sample points to some range, e.g.,
the reals or Booleans
 e.g., Odd(1) = true.
 P induces a probability distribution for any r.v. X:
 P (X = xi) = Σ{ω:X(ω) = xi}P (ω)
 e.g., P (Odd = true) = P (1) + P (3) + P (5) = 1/6 + 1/6 + 1/6 = 1/2

https://inst.eecs.berkeley.edu/~cs188/sp21/
Propositions
Think of a proposition as the event (set of sample points) where the proposition is true
Given Boolean random variables A and B:
event a = set of sample points where A(ω) = true
event ¬a = set of sample points where A(ω) = false
event a b = points where A(ω) = true and B(ω) = true
Often in AI applications, the sample points are defined by the values of a set of random variables, i.e.,
The sample space is the Cartesian product of the ranges of the variables
With Boolean variables, sample point = propositional logic model e.g.,
A = true, B = false, or a ¬b.

Proposition = disjunction of atomic events in which it is true e.g.,


(a b) ≡ (¬a b) (a ¬b) (a b)
P(a b) = P(¬a b)+ P(a ¬b)+ P(a b) https://inst.eecs.berkeley.edu/~cs188/sp21/
Why use probability?
The definitions imply that certain logically related events must have related probabilities
 E.g., P(a b) = P(a) + P(b) − P(a b)
 True

A A B B

de Finetti (1931): an agent who bets according to probabilities that violate these axioms can be
forced to bet so as to lose money regardless of outcome.

https://inst.eecs.berkeley.edu/~cs188/sp21/
Probability Distributions
 Associate a probability with each value; sums to 1
 Temperature:  Weather:  Joint distribution

P(W) P(T,W)
P(T)
W P
T P
sun 0.6 Temperature
hot 0.5
rain 0.1 hot cold
cold 0.5
fog 0.3 sun 0.45 0.15

Weather
meteor 0.0 rain 0.02 0.08
fog 0.03 0.27
meteor 0.00 0.00

13
https://inst.eecs.berkeley.edu/~cs188/sp21/
Syntax for propositions
Propositional or Boolean random variables e.g., Cavity (do I have a cavity?)
Cavity = true is a proposition, also written cavity
Discrete random variables (finite or infinite)
e.g., Weather is one of (sunny, rain, cloudy, snow)
Weather = rain is a proposition
Values must be exhaustive and mutually exclusive
Continuous random variables (bounded or unbounded) e.g.,
Temp = 21.6; also allow, e.g., Temp < 22.0.
Arbitrary Boolean combinations of basic propositions

https://inst.eecs.berkeley.edu/~cs188/sp21/
Making possible worlds
 In many cases we
 begin with random variables and their domains
 construct possible worlds as assignments of values to all variables
 E.g., two dice rolls Roll1 and Roll2
 How many possible worlds?
 What are their probabilities?
 Size of distribution for n variables with range size d?
n
 For all but the smallest distributions, cannot write out by hand!

15
https://inst.eecs.berkeley.edu/~cs188/sp21/
Probabilities of events
 Recall that the probability of an event is the
sum of probabilities of its worlds:  Joint distribution
 P(A) =   A P()
P(T,W)
 So, given a joint distribution over all variables, Temperature
can compute any event probability! hot cold
sun 0.45 0.15
 Probability that it’s hot AND sunny?

Weather
rain 0.02 0.08
 Probability that it’s hot? fog 0.03 0.27
meteor 0.00 0.00
 Probability that it’s hot OR not foggy?

16
https://inst.eecs.berkeley.edu/~cs188/sp21/
Marginal Distributions
 Marginal distributions are sub-tables which eliminate variables
 Marginalization (summing out): Collapse a dimension by adding

P(X=x) = y P(X=x, Y=y)


Temperature
hot cold
sun 0.45 0.15 0.60
Weather
rain 0.02 0.08 0.10
fog 0.03 0.27
P(W)
0.30
meteor 0.00 0.00 0.00
0.50 0.50
P(T)
17
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional Probabilities
 A simple relation between joint and conditional probabilities
 In fact, this is taken as the definition of a conditional probability

P(a,b)

P(a | b) = P(a, b)
P(b)

P(a) P(b)
P(T,W)
Temperature
hot cold P(W=s | T=c) = P(W=s,T=c) = 0.15/0.50 = 0.3
P(T=c)
sun 0.45 0.15
Weather

rain 0.02 0.08


= P(W=s,T=c) + P(W=r,T=c) + P(W=f,T=c) + P(W=m,T=c)
fog 0.03 0.27 = 0.15 + 0.08 + 0.27 + 0.00= 0.50
18
meteor 0.00 0.00 https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional Distributions
 Distributions for one set of variables given another set
P(W | T=h) P(W | T=c) P(W | T)
Temperature
hot cold
hot cold hot cold
sun 0.45 0.15 0.90 0.30 0.90 0.30
Weather

rain 0.02 0.08 0.04 0.16 0.04 0.16


fog 0.03 0.27 0.06 0.54 0.06 0.54
meteor 0.00 0.00 0.00 0.00 0.00 0.00

19
https://inst.eecs.berkeley.edu/~cs188/sp21/
Normalizing a distribution
 (Dictionary) To bring or restore to a normal condition

 Procedure: All entries sum to ONE


 Multiply each entry by  = 1/(sum over all entries)

P(W,T)
P(W | T=c) = P(W,T=c)/P(T=c)
Temperature =  P(W,T=c)
P(W,T=c)
hot cold
sun 0.45 0.15 0.15 0.30
Normalize
Weather

rain 0.02 0.08 0.08 0.16


fog 0.03 0.27 0.27 0.54
 = 1/0.50 = 2
meteor 0.00 0.00 0.00 0.00
20
https://inst.eecs.berkeley.edu/~cs188/sp21/
The Product Rule
 Sometimes have conditional distributions but want the joint

P(a | b) P(b) = P(a, b) P(a | b) = P(a, b)


P(b)

21
https://inst.eecs.berkeley.edu/~cs188/sp21/
The Product Rule: Example

P(W | T) P(T) = P(W, T)


P(W | T) P(W, T)
hot cold
P(T) Temperature
0.90 0.30 hot cold
T P
0.04 0.16 sun 0.45 0.15
hot 0.5

Weather
0.06 0.54 rain 0.02 0.08
cold 0.5
0.00 0.00 fog 0.03 0.27
meteor 0.00 0.00

22
https://inst.eecs.berkeley.edu/~cs188/sp21/
The Chain Rule

 A joint distribution can be written as a product of conditional


distributions by repeated application of the product rule:
 P(x1, x2, x3) = P(x3 | x1, x2) P(x1, x2) = P(x3 | x1, x2) P(x2 | x1) P(x1)

 P(x1, x2,…, xn) = i P(xi | x1,…, xi-1)

23
https://inst.eecs.berkeley.edu/~cs188/sp21/
Probabilistic Inference
 Probabilistic inference: compute a desired probability
from a probability model
 Typically for a query variable given evidence
 E.g., P(airport on time | no accidents) = 0.90
 These represent the agent’s beliefs given the evidence

 Probabilities change with new evidence:


 P(airport on time | no accidents, 5 a.m.) = 0.95
 P(airport on time | no accidents, 5 a.m., raining) = 0.80
 Observing new evidence causes beliefs to be updated

24
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by Enumeration
* Works fine with
 General case: multiple query
 Evidence variables: E1, …, Ek = e1, …,ek
X1, …, Xn
 We want: variables, too
 Query* variable: Q
 Hidden variables: H1, …, Hr All variables
P(Q | e1, …,ek)
Probability model P(X1, …, Xn) is given

 Step 1: Select the  Step 2: Sum out H from model to  Step 3: Normalize
entries consistent get joint of Query and evidence
with the evidence

P(Q | e1,…,ek) =  P(Q,e1,…,ek)

P(Q,e1,…,ek) = h1,…, hrP(Q,h1,…, hr, e1,…,ek)


25
https://inst.eecs.berkeley.edu/~cs188/sp21/
X1, …, Xn
Inference by Enumeration
Season Temp Weather P
summer hot sun 0.35
 P(W)? summer hot rain 0.01
summer hot fog 0.01
summer hot meteor 0.00
summer cold sun 0.10
summer cold rain 0.05
summer cold fog 0.09
summer cold meteor 0.00
winter hot sun 0.10
winter hot rain 0.01
winter hot fog 0.02
winter hot meteor 0.00
winter cold sun 0.15
winter cold rain 0.20
winter cold fog 0.18
winter cold meteor 0.00
26
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by Enumeration
Season Temp Weather P
summer hot sun 0.35
 P(W)? summer hot rain 0.01
summer hot fog 0.01
summer hot meteor 0.00
summer cold sun 0.10
 P(W | winter)? summer cold rain 0.05
summer cold fog 0.09
summer cold meteor 0.00
winter hot sun 0.10
winter hot rain 0.01
winter hot fog 0.02
winter hot meteor 0.00
winter cold sun 0.15
winter cold rain 0.20
winter cold fog 0.18
27
https://inst.eecs.berkeley.edu/~cs188/sp21/
winter cold meteor 0.00
Inference by Enumeration
Season Temp Weather P
summer hot sun 0.35
 P(W)? summer hot rain 0.01
summer hot fog 0.01
summer hot meteor 0.00
summer cold sun 0.10
 P(W | winter)? summer cold rain 0.05
summer cold fog 0.09
summer cold meteor 0.00
winter hot sun 0.10
winter hot rain 0.01
 P(W | winter, hot)? winter hot fog 0.02
winter hot meteor 0.00
winter cold sun 0.15
winter cold rain 0.20
winter cold fog 0.18
28
https://inst.eecs.berkeley.edu/~cs188/sp21/
winter cold meteor 0.00
Inference by enumeration
Start with the joint distribution:
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576

For any proposition φ, sum the atomic events where it is true:


P(φ) = Σω:ω|=φP(ω)

Catch : dentist’s probe catches in tooth

https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration
Start with the joint distribution:
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576

For any proposition φ, sum the atomic events where it is true:


P(φ) = Σω:ω|=φP(ω)
P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration
Start with the joint distribution:
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576

For any proposition φ, sum the atomic events where it is true:


P(φ) = Σω:ω|=φP(ω)
P (cavity∨toothache) = 0.108+0.012+0.072+0.008+0.016+0.064 = 0.28

https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration
Start with the joint distribution:
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576

Can also compute conditional probabilities:


P (¬cavity ∧toothache)
P (¬cavity|toothache) =
P(toothache)
0.016 + 0.064
= = 0.4
0.108 + 0.012 + 0.016 + 0.064

https://inst.eecs.berkeley.edu/~cs188/sp21/
Normalization
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576

Denominator can be viewed as a normalization constant α


P(Cavity|toothache) = α P(Cavity, toothache)
= α [P(Cavity, toothache, catch) + P(Cavity, toothache, ¬catch)]
= α [(0.108, 0.016) + (0.012, 0.064)]
= α (0.12, 0.08)
=(0.6, 0.4)

General idea: compute distribution on query variable by fixing evidence


variables and summing over hidden variables

https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration, contd.
 Let
• X be the query variable (e.g. Cavity) .
• Typically, we want the posterior joint distribution of the hidden
variables Y (remaining, unobserved, e.g. catch)
• given specific values e for the evidence variables E (e.g. Toothache)
 Then the required summation of joint entries is done by summing out :
 P(X|E = e) = αP(X, E = e) = α Σ y P(X, e, y )

https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by Enumeration

 Obvious problems:
 Worst-case time complexity O(dn)
 Space complexity O(dn) to store the joint distribution

 O(dn) data points to estimate the entries in the joint distribution

35
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes Rule

36
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes’ Rule

 Write the product rule both ways:


P(a | b) P(b) = P(a, b) = P(b | a) P(a) That’s my rule!

 Dividing left and right expressions, we get:

P(a | b) = P(b | a) P(a)


P(b)
 Why is this at all helpful?
 Lets us build one conditional from its reverse
 Often one conditional is tricky but the other one is simple
 Describes an “update” step from prior P(a) to posterior P(a | b)
 Foundation of many systems (e.g. decision support systems)

 In the running for most important AI equation! 37


https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference with Bayes’ Rule
 Example: Diagnostic probability from causal probability:
P(cause | effect) = P(effect | cause) P(cause)
P(effect)
 Example:
 M: meningitis, S: stiff neck
P(s | m) = 0.8
Example
P(m) = 0.0001 givens
P(s) = 0.01

P(s | m) P(m) 0.8 x 0.0001 .


P(m | s) = =
P(s) 0.01
 Why do we have P(s | m) but not P(m | s) ?

38
https://inst.eecs.berkeley.edu/~cs188/sp21/
Independence
 Two variables X and Y are (absolutely) independent if
x,y P(x, y) = P(x) P(y)

 i.e., the joint distribution factors into a product of two simpler distributions

 Equivalently, via the product rule P(x,y) = P(x|y)P(y),

P(x | y) = P(x) or P(y | x) = P(y)

 Example: two dice rolls Roll1 and Roll2


 P(Roll1=5, Roll2=3) = P(Roll1=5) P(Roll2=3) = 1/6 x 1/6 = 1/36
 P(Roll2=3 | Roll1=5) = P(Roll2=3)

39
https://inst.eecs.berkeley.edu/~cs188/sp21/
Independence
A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A)P(B)
Cavity
Cavity decomposes into Toothache Catch
Toothache Catch
Weather
Weather

P(Toothache, Catch, Cavity, W eather) = P(T oothache, Catch, Cavity)P(W eather)


32 entries reduced to 12;
for n independent biased coins, 2n → n
Absolute independence powerful but rare
Dentistry is a large field with hundreds of variables, none of which are independent.
What to do?
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Independence
 n fair, independent coin flips:

P(X1) P(X2) P(Xn)


H 0.5 H 0.5 H 0.5
T 0.5 T 0.5 T 0.5

P(X1,X2,...,Xn)

2n
41
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional Independence
 Conditional independence is our most basic and robust form of
knowledge about uncertain environments.

 X is conditionally independent of Y given Z if and only if:


x,y,z P(x | y, z) = P(x | z)

or, equivalently, if and only if


x,y,z P(x, y | z) = P(x | z) P(y | z)

42
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional independence
P(Toothache, Cavity, Catch) has 23 − 1 = 7 independent entries
If I have a cavity, the probability that the probe catches in it doesn’t depend on whether I
have a toothache:
(1)P(catch|toothache,cavity) = P(catch|cavity)
The same independence holds if I haven’t got a cavity:
(2)P(catch|toothache,¬cavity) = P(catch|¬cavity)
Catch is conditionally independent of Toothache given Cavity:
P(Catch|T oothache, Cavity) = P(Catch|Cavity)
Equivalent statements:
P(T oothache|Catch, Cavity) = P(T oothache|Cavity)
P(T oothache, Catch|Cavity) = P(T oothache|Cavity)P(Catch|Cavity)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional independence contd.
 Write out full joint distribution using chain rule:
 P(T oothache, Catch, Cavity)
 = P(T oothache|Catch, Cavity)P(Catch, Cavity)
 = P(T oothache|Catch, Cavity)P(Catch|Cavity)P(Cavity)
 = P(T oothache|Cavity)P(Catch|Cavity)P(Cavity)
 i.e., 2 + 2 + 1 = 5 independent numbers
 In most cases, the use of conditional independence reduces the size of the
representation of the joint distribution from exponential in n to linear in n.
 Conditional independence is our most basic and robust form
of knowledge about uncertain environments.
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional Independence?
 What about this domain:
 Traffic
 Umbrella
 Raining

45
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional Independence?
 What about this domain:
 Fire
 Smoke
 Alarm

46
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes Nets: Big Picture

47
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes Nets: Big Picture

 Bayes nets: a technique for describing


complex joint distributions (models) using
simple, conditional distributions
 A subset of the general class of graphical models
 Use local causality/conditional independence:
 the world is composed of many variables,
 each interacting locally with a few others

48
https://inst.eecs.berkeley.edu/~cs188/sp21/
Graphical Model Notation

 Nodes: variables (with domains)


 Can be assigned (observed) or unassigned Weather
(unobserved)

 Arcs: interactions
 Indicate “direct influence” between variables G
 Formally: encode conditional independence
0.11 0.11 0.11
(more later)

C1,1 C1,2 C3,3 0.11 0.11 0.11

0.11 0.11 0.11

49
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Car diagnosis
Initial evidence: car won’t start
Testable variables (green),
“broken, so fix it” variables (orange) ,
Hidden variables (gray) ensure sparse structure, reduce parameters
battery age alternator fanbelt
broken broken

battery no charging
dead

battery battery fuel line starter


flat no oil no gas blocked broken
meter

car won’t
lights oil light gas gauge start dipstick

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example Bayes’ Net: Car Insurance
Age SocioEcon
GoodStudent ExtraCar

RiskAversion
MakeModel VehicleYear
YearsLicensed
Mileage

DrivingSkill
AntiTheft SafetyFeatures CarValue
Garaged Airbag
DrivingRecord Ruggedness

DrivingBehavior
Cushioning Theft
Accident

OwnCarDamage

OwnCarCost
OtherCost

MedicalCost LiabilityCost PropertyCost 51


https://inst.eecs.berkeley.edu/~cs188/sp21/
Wumpus World

P ij = true iff [i, j ] contains a pit


B i j = true iff [i, j ] is breezy
Include only B1,1, B1,2, B2,1 in the probability model

https://inst.eecs.berkeley.edu/~cs188/sp21/
Specifying the probability model
 The full joint distribution is P(P1,1, . . . , P4,4, B1,1, B1,2, B2,1)
 Apply product rule: P(B1,1, B1,2, B2,1 |P1,1, . . . , P4,4)P(P1,1, . . . , P4,4)
 (Do it this way to get P (Effect|Cause).)
 First term: 1 if pits are adjacent to breezes, 0 otherwise
 Second term: pits are placed randomly, probability 0.2 per square:
4,4
P(P 1,1, . . . , P 4,4) = Πi , j = 1,1 P(P i , j ) = 0.2n × 0.816− n
for n pits.

https://inst.eecs.berkeley.edu/~cs188/sp21/
Observations and query
 We know the following facts:
 b= ¬b1,1 b1,2 b2,1
 known = ¬p1,1 ¬p1,2 ¬p2,1
 Query is P(P1,3|known, b)
 Define Unknown = Pijs other than P1,3 and Known
 For inference by enumeration, we have
 P(P1,3|known, b) = αΣunknownP(P1,3, unknown, known, b)
 Grows exponentially with number of squares!
 12 unknown squares => 2 12

54
https://inst.eecs.berkeley.edu/~cs188/sp21/
Using conditional independence
Basic insight: observations are conditionally independent of other hidden
squares given neighbouring hidden squares [4,4]
irrelevant

b = ¬b1,1 ∧ b1,2 ∧b2,1


known = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
Unknown = Pijs other than P1,3 and Known

Query is P(P1,3|known, b)

Other than
Query variable

Define Unknown = Frontier ∪Other


P(b|P1,3, Known, Unknown) = P(b|P1,3, Known, Frontier)

Manipulate query into a form where we can use this!


https://inst.eecs.berkeley.edu/~cs188/sp21/
U s i n g conditional independence contd.

, b = ¬b1,1 ∧ b1,2 ∧b2,1


, known = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
Unknown = Pijs other than P1,3 and Known

, ,

= 𝛼 𝑷(𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟, 𝑜𝑡ℎ𝑒𝑟)𝑷(𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟, 𝑜𝑡ℎ𝑒𝑟)

Query is P(P1,3|known, b)

= 𝛼 𝑷(𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟)𝑷(𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟, 𝑜𝑡ℎ𝑒𝑟)

= 𝛼 𝑷(𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟) 𝑷(𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟, 𝑜𝑡ℎ𝑒𝑟)

56
https://inst.eecs.berkeley.edu/~cs188/sp21/
U s i n g conditional independence contd.
b = ¬b1,1 ∧ b1,2 ∧b2,1
known = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
Unknown = Pijs other than P1,3 and
Known
,
= 𝛼 𝑷(𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟) 𝑷(𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟, 𝑜𝑡ℎ𝑒𝑟)

= 𝛼 𝑷(𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟) 𝑷 𝑃 , 𝑷(𝑘𝑛𝑜𝑤𝑛)𝑷(𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟)𝑷(𝑜𝑡ℎ𝑒𝑟)

= 𝛼𝑷 𝑃 , 𝑷(𝑘𝑛𝑜𝑤𝑛) 𝑷 𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟 𝑷(𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟) 𝑷(𝑜𝑡ℎ𝑒𝑟)

= 𝛼 𝑷 𝑜𝑡ℎ𝑒𝑟 𝑷(𝑘𝑛𝑜𝑤𝑛)𝑷 𝑃 , 𝑷 𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟 𝑷(𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟)

’ https://inst.eecs.berkeley.edu/~cs188/sp21/
57
Using conditional independence contd.
pits are placed randomly, probability 0.2 per square

b = ¬b1,1 ∧ b1,2 ∧ b2,1 known= ¬ p1,1 ¬ p1,2 ¬ p2,1

,
= 𝛼𝑷 𝑃 , 𝑷 𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟 𝑷 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟

P(P1,3|known, b) = α' (0.2(0.04 + 0.16 + 0.16), 0.8(0.04 + 0.16))


≈ (0.31, 0.69)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Walmart : “ People think we got big by putting big stores in small towns. Really,
we got big by replacing inventory with information.” Sam Walton – Founder of
Walmart
Wal-Mart Manager
creates special offer
PC + printer Spike in sale
The day after
Thanksgiving

All other stores are Trigger alert in


informed Analytics

At the end of the


day all stores
experienced the
same spike 59
https://inst.eecs.berkeley.edu/~cs188/sp21/
Summary
 Independence and conditional independence are
important forms of probabilistic knowledge
 Bayes net encode joint distributions efficiently by
taking advantage of conditional independence
 Global joint probability = product of local conditionals

 Next: how to answer queries, i.e., compute


conditional probabilities of queries given evidence

60
https://inst.eecs.berkeley.edu/~cs188/sp21/
Chapter 13

Bayes Net Syntax and Semantics:


Probabilistic Reasoning

https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayesian networks
A simple, graphical notation for conditional independence assertions and hence for compact
specification of full joint distributions

Syntax:
• a set of nodes, one per variable
• a directed, acyclic graph (link ≈ “directly influences”)
• a conditional distribution for each node given its parents: P(Xi|P arents(Xi))
In the simplest case, conditional distribution represented as a conditional probability table
(CPT) giving the distribution over Xi for each combination of parent values

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
Topology of network encodes conditional independence
assertions:

Weather Cavity

Toothache Catch

Weather is independent of the other variables


Toothache and Catch are conditionally independent given
Cavity

https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes net global semantics
 Bayes nets encode joint distributions as product of
conditional distributions on each variable:
P(X1,..,Xn) = i P(Xi | Parents(Xi))

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
I’m at work, neighbor John calls to say my alarm is ringing, but neighbor
Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a
burglar?

Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls

Network topology reflects “causal” knowledge:


– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Alarm Network
P(B) 1 1 P(E)

true false true false


Burglary Earthquake
0.001 0.999 0.002 0.998

B E P(A|B,E)

true false

true true 0.95 0.05


Alarm
4 true false 0.94 0.06 Number of free parameters
false true 0.29 0.71
in each CPT:
false false 0.001 0.999
Parent range sizes d1,…,dk
John Mary
calls calls Child range size d
A P(J|A) A P(M|A)
Each table row must sum to 1
true false true false
true 0.9 0.1 true 0.7 0.3 (d-1) Πi di
false 0.05 0.95 2 2 false 0.01 0.99
https://inst.eecs.berkeley.edu/~cs188/sp21/
General formula for sparse BNs
 Suppose
 n variables
 Maximum range size is d
 Maximum number of parents is k
 Full joint distribution has size O(dn)
 Bayes net has size O(n .dk)
 Linear scaling with n as long as causal structure is local

67
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
P(B) P(E) P(b,e, a, j, m) =
true false true false
Burglary Earthquake P(b) P(e) P(a|b,e) P(j|a) P(m|a)
0.001 0.999 0.002 0.998

B E P(A|B,E) =.001x.998x.94x.1x.3=.000028
true false

true true 0.95 0.05


Alarm
true false 0.94 0.06

false true 0.29 0.71

false false 0.001 0.999


John Mary
calls calls
A P(J|A) A P(M|A)
true false true false
true 0.9 0.1 true 0.7 0.3
false 0.05 0.95 false 0.01 0.99
68
https://inst.eecs.berkeley.edu/~cs188/sp21/
Compactness
 A CPT forBooleanX i withk Booleanparentshas 2k rows for the combinations of parent values
 Each row requires one number p for X i = true (the number for X i = false is just 1 − p)
 If each variable has no more than k parents, the complete network requires O(n ·2k) numbers
 i.e., grows linearly with n, vs. O(2n) for the full joint distribution
 For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25 − 1 =31)
B E

J M

https://inst.eecs.berkeley.edu/~cs188/sp21/
Global semantics B

A
E

J M

Global semantics defines the full joint distribution as the product of the local conditional distributions:

P(X 1 , . . . , X n ) = P(X i | Parents(X 1 ))

e.g., P (j ∧ m ∧ a ∧ ¬b ∧¬e) =

Chain Rule
P(X 1 , . . . , X n ) = P(X n |X n-1 , . . . , X 1 ) P(X n-1 , . . . , X 1 )
= P(X n |X n-1 , . . . , X 1 ) P(X n−2 |X n-3 , . . . , X 1 ) P(X n-3 , . . . , X 1 )
= ...
= P(X i |X i−1 , . . . , X 1 )

https://inst.eecs.berkeley.edu/~cs188/sp21/
Global semantics
P(E)
“Global” semantics defines the full joint distribution as the product of the local conditional distributions:
true false
P(X 1 , . . . , X n ) = ∏ P(X i | Parents(X 1 )) Burglary Earthquake 0.002 0.998

P(B) B E P(A|B,E)

e.g., P (j ∧ m ∧ a ∧ ¬b∧¬e) true false true false

= P(j|a)P(m|a)P(a|¬b,¬e)P(¬b)P(¬e) 0.001 0.999 true true 0.95 0.05

= 0.9 × 0.7 × 0.001× 0.999 × 0.998 Alarm true false 0.94 0.06
≈ 0.00063 false true 0.29 0.71

false false 0.001 0.999

John Mary A P(M|A)


calls calls
A P(J|A) true false
true false true 0.7 0.3
true 0.9 0.1 false 0.01 0.99
false 0.05 0.95
https://inst.eecs.berkeley.edu/~cs188/sp21/
Local semantics
Local semantics: each node is conditionally independent of its
nondescendants given its parents

U1 Um
. ..

X
Z 1j Z nj

Y1 Yn
. ..

Theorem: Local semantics ⇔ global semantics

https://inst.eecs.berkeley.edu/~cs188/sp21/
Markov blanket
Each node is conditionally independent of all others given its
Markov blanket: parents + children + children’s parents

U1 Um
. ..

X
Z 1j Z nj

Y1 Yn
. ..

https://inst.eecs.berkeley.edu/~cs188/sp21/
Constructing Bayesian networks
Need a method such that a series of locally testable assertions of conditional independence guarantees
the required global semantics

1. Choose an ordering of variables X 1 , . . . , X n Must ensure that


2. For i = 1 to n 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 𝑋 ⊆ {𝑋 ,…,𝑋 }
add X i to the network
select parents from X 1 , . . . , X i − 1 such that
P(Xi|P arents(Xi)) = P(X i |X 1 , . . . , X i−1 )

This choice of parents guarantees the global semantics:

P(X 1 , . . . , X n ) = Πi=1n P(X i | X 1 , . . . , X i - 1 ) (chain rule)


= Πi=1n P(X i |P arents(Xi )) (by construction)

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
Must ensure that
Suppose we choose the ordering M , J, A, B, E 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 𝑋 ⊆ {𝑋 ,…,𝑋 }

P(J|M ) = P(J)? No
P(A|J, M ) = P(A|J ) ? No MaryCalls
P(A|J, M ) = P(A)? No
P(B|A, J, M ) = P(B|A)? Yes JohnCalls
P(B|A, J, M ) = P(B)? No
P(E |B , A, J, M ) = P(E |A)? No Alarm
P(E|B, A, J, M ) = P(E|A, B)? Yes Given the network

Burglary

Earthquake

P(X 1 , . . . , X n ) = Πi=1n P(X i | X 1 , . . . , X i - 1 ) (chain rule)


= Πi=1n P(X i |P arents(Xi )) (by construction)

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example contd.
MaryCalls B E
JohnCalls A

Order matters 10 versus 13 J M


Alarm

Burglary

Earthquake

Deciding conditional independence is hard in noncausal directions


(Causal models and conditional independence seem hardwired for humans!)
Assessing conditional probabilities is hard in noncausal directions

Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Car diagnosis
Initial evidence: car won’t start
Testable variables (green),
“broken, so fix it” variables (orange) ,
Hidden variables (gray) ensure sparse structure, reduce parameters
battery age alternator fanbelt
broken broken

battery no charging
dead

battery battery fuel line starter


flat no oil no gas blocked broken
meter

car won’t
lights oil light gas gauge start dipstick

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example Bayes’ Net: Car Insurance
Age SocioEcon
GoodStudent ExtraCar

RiskAversion
MakeModel VehicleYear
YearsLicensed
Mileage

DrivingSkill
AntiTheft SafetyFeatures CarValue
Garaged Airbag
DrivingRecord Ruggedness

DrivingBehavior
Cushioning Theft
Accident

OwnCarDamage

OwnCarCost
OtherCost

MedicalCost LiabilityCost PropertyCost 78


https://inst.eecs.berkeley.edu/~cs188/sp21/
Compact conditional distributions
CPT grows exponentially with number of parents
CPT becomes infinite with continuous-valued parent or child
Solution: canonical distributions that are defined compactly
nodes are the simplest case:
X = f (P arents(X)) for some function f
E.g., Boolean functions
N orthAmerican ⇔ Canadian ∨ US ∨ M exican

E.g., numerical relationships among continuous variables


∂Level
= inflow + precipitation − outflow − evaporation
∂t

https://inst.eecs.berkeley.edu/~cs188/sp21/
Compact conditional distributions contd.

Noisy-OR distributions model multiple noninteracting causes


1) Parents U1 . . . Uk include all causes
Can add leak nodes if some parents are missing
1) Independent failure probability qi for each cause alone

P (X| . . . , ¬ ...¬

https://inst.eecs.berkeley.edu/~cs188/sp21/
Compact conditional distributions contd.
𝑞 =P( fever | cold,  flu,  malaria) = 0.6
𝑞 =P( fever |  cold, flu,  malaria) = 0.2
𝑞 =P( fever |  cold,  flu, malaria) = 0.1

 Fever is false iff all its true parents are inhibited


 This probability = product of the inhibition probability for each parent

Cold F lu Malaria P(Fever) P(¬Fever)


F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2 × 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6 × 0.1
T T F 0.88 0.12 = 0.6 × 0.2
T T T 0.988 0.012 = 0.6 × 0.2 × 0.1
https://inst.eecs.berkeley.edu/~cs188/sp21/
Number of parameters linear in number of parents
Hybrid (discrete+continuous) networks
- Discrete (Subsidy? and Buys?);
- continuous (Harvest and Cost)

- Option 1: discretization—possibly large errors, large CPTs


- Option 2: finitely parameterized canonical families (Gaussian
or Normal distribution)
Subsidy? Harvest

1) Continuous variable, discrete+continuous parents (e.g., Cost)


2) Discrete variable, continuous parents (e.g., Buys?) Cost

Buys?

https://inst.eecs.berkeley.edu/~cs188/sp21/
Continuous child variables
 Need one conditional density function for child variable given continuous parents, for each possible assignment to discrete parents

 Most common is the linear Gaussian model, e.g.,:


Subsidy? Harvest
_ P (Cost =c|Harvest =h, Subsidy=true)

_ Mean Cost varies linearly with Harvest, variance is fixed


Cost
_ Discrete parent (Subsidy) is handled by specifying both

 P (Cost =c|Harvest =h, Subsidy)


Buys?
( )
( )
 𝑃 𝑐 ℎ, 𝑠𝑢𝑏𝑠𝑖𝑑𝑦) = 𝑁 𝑎 ℎ + 𝑏 , 𝜎 𝑐 = 𝑒

 P (Cost =c|Harvest =h,  Subsidy)

( )
( )
 𝑃 𝑐 ℎ, 𝑠𝑢𝑏𝑠𝑖𝑑𝑦) = 𝑁 𝑎 ℎ + 𝑏 , 𝜎 𝑐 = 𝑒

https://inst.eecs.berkeley.edu/~cs188/sp21/
Normal Distribution

- 𝐸 𝑋 =  = 𝑚𝑒𝑎𝑛 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑ 𝑥 𝑃 𝑋 = 𝑥 − 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒


- 𝑉𝑎𝑟 𝑋 = 𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝐸 (𝑋 − 𝜇) = 𝐸 𝑋 - 

( )
- Normal = Gaussian Distribution 𝑃(𝑥) = 𝑁 , 𝜎 𝑥 = 𝑒

- Central Limit Theorem: the distribution formed by sampling n independent random variables and
taking their mean tends to a normal distribution as n tends to infinity

https://inst.eecs.berkeley.edu/~cs188/sp21/
Continuous child variables
P(Cost|Harvest,Subsidy?=true)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
10
0 5 Harvest
5
Cost 10 0

Properties of Linear Gaussian conditional distribution


- All-continuous (parents) network with Linear Gaussian distributions
⇒ full joint distribution is a multivariate Gaussian

- Discrete (as parents) +continuous (parents) Linear Gaussian network is a conditional Gaussian network
 a multivariate Gaussian over all continuous variables for each combination of discrete variable values

https://inst.eecs.berkeley.edu/~cs188/sp21/
Discrete variable w/ continuous parents

Subsidy? Harvest

Cost

Buys?

1
𝑁 , 𝜎 𝑥 = 𝑒 ( )
P(Buys = false | Cost = c) = Φ((c - µ)/σ)
𝜎 2𝜋
P(Buys = true | Cost = c) = 1- Φ((c - µ)/σ)
Φ(x) = 𝑁 0,1 𝑥 𝑑𝑥
https://inst.eecs.berkeley.edu/~cs188/sp21/
Discrete variable w/ continuous parents
Probability of Buys given Cost should be a “soft” threshold:
1

Subsidy? Harvest
0.8
P(Buys=false| Cost=c)

0.6
Cost
0.4

0.2
Buys?
0
0 2 4 6 8 10 12
Cost c

Probit distribution uses integral of Gaussian:


1 ( )
Φ(x) = 𝑁 0,1 𝑥 𝑑𝑥 𝑁 , 𝜎 𝑥 = 𝑒
𝜎 2𝜋

P(Buys = false | Cost = c) = Φ((c - µ)/σ)

P(Buys = true | Cost = c) = 1- Φ((c - µ)/σ) https://inst.eecs.berkeley.edu/~cs188/sp21/


Why the probit?
1. It’s sort of the right shape
2. Can view as hard threshold whose location is subject to noise

Cost Cost Noise

Buys?

https://inst.eecs.berkeley.edu/~cs188/sp21/
Discrete variable contd.
Sigmoid (or logit) distribution also used in neural networks:

1
𝑃 𝐵𝑢𝑦𝑠 = 𝑡𝑟𝑢𝑒 𝐶𝑜𝑠𝑡 = 𝑐) = −𝑐 + 𝜇
1 + exp(−2 )
𝜎

Sigmoid has similar shape to probit but much longer tails:


1
0.9
0.8
P(Buys?=false|Cost=c)

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12
Cost c

https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference
 Inference: calculating some  Examples:
useful quantity from a probability
 Posterior marginal probability
model (joint probability  P(Q|e1,..,ek)
distribution)  E.g., what disease might I have?
 Most likely explanation:
 argmaxq,r,s P(Q=q,R=r,S=s|e1,..,ek)
 E.g., what did he say?

https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference tasks - Examples
Simple queries: compute posterior marginal P(X i |E = e)

e.g., P(N oGas|Gauge= empty, Lights = on, Starts = false)

Conjunctive queries: P(X i , X j |E = e) = P(X i |E = e)P(X j |X i , E = e)

Optimal decisions: decision networks include utility information;

probabilistic inference required for P (outcome|action, evidence)

Value of information: which evidence to seek next?

Sensitivity analysis: which probability values are most critical?

Explanation: why do I need a new starter motor?

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example Bayes’ Net: Car Insurance
Age SocioEcon
GoodStudent ExtraCar

RiskAversion
MakeModel VehicleYear
YearsLicensed
Mileage

DrivingSkill
AntiTheft SafetyFeatures CarValue
Garaged Airbag
DrivingRecord Ruggedness

DrivingBehavior
Cushioning Theft
Accident

OwnCarDamage

OwnCarCost
OtherCost

MedicalCost LiabilityCost PropertyCost 92


https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration
 Slightly intelligent way to sum out variables from the joint probability without actually
constructing its explicit representation
 Simple query on the burglary network: B E

 P(B|j, m) A

 = P(B, j, m)/P (j, m) J M


 = α P(B, j, m)
 = α Σe Σa P(B, e, a, j, m) [e a hidden variable]
 Rewrite full joint entries using product of CPT entries:

 P(b|j, m)
 = α Σe Σa P(b)P (e)P(a|b, e)P(j|a)P (m|a)
 = α P(b) Σe P(e) Σa P(a|b, e)P(j|a)P (m|a)
 Recursive depth-first enumeration: O(n) space, O(dn) time https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by Enumeration in Bayes Net
 Inference by enumeration: B E
 Any probability of interest can be computed by summing
entries from the joint distribution: P(Q | e) =  h P(Q , h, e)
 Entries from the joint distribution can be obtained from a BN
by multiplying the corresponding conditional probabilities
A
 P(B | j, m) = α e,a P(B, e, a, j, m)
 = α e,a P(B) P(e) P(a|B,e) P(j|a) P(m|a)
J M
 So inference in Bayes nets means computing sums of
products of numbers: sounds easy!!
 Problem: sums of exponentially many products!

https://inst.eecs.berkeley.edu/~cs188/sp21/
Evaluation tree
 P(b|j, m) = α P(b) Σe P(e) Σa P(a|b, e)P(j|a)P(m|a)

B E

J M

Enumeration is inefficient: repeated computation


e.g., computes P (j|a)P (m|a) for each value of e

Can be improved with variable elimination, choosing the right order of variables, detecting irrelevant variables
https://inst.eecs.berkeley.edu/~cs188/sp21/
Order matters
Z
 Order the terms Z, A, B C, D
 P(D) = α z,a,b,c P(z) P(a|z) P(b|z) P(c|z) P(D|z)
 = α z P(z) a P(a|z) b P(b|z) c P(c|z) P(D|z) A B C D
 Largest factor has 2 variables (D,Z)
 Order the terms A, B C, D, Z
 P(D) = α a,b,c,z P(a|z) P(b|z) P(c|z) P(D|z) P(z)
 = α a b c z P(a|z) P(b|z) P(c|z) P(D|z) P(z)
 Largest factor has 4 variables (A,B,C,D)
 In general, with n leaves, factor of size 2n

https://inst.eecs.berkeley.edu/~cs188/sp21/
Enumeration algorithm
P(b|j, m) = α P(b) Σe P(e) Σa P(a|b, e)P(j|a)P(m|a)

B E

J M

https://inst.eecs.berkeley.edu/~cs188/sp21/
Sampling
 Why sample?
 Basic idea  Often very fast to get a decent
approximate answer
 Draw N samples from a sampling distribution S
 The algorithms are very simple and
 Compute an approximate posterior probability general (easy to apply to fancy models)
 Show this converges to the true probability P  They require very little memory (O(n))
 They can be applied to large models,
whereas exact algorithms blow up

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
 Suppose you have two agent programs A and B for Monopoly
 What is the probability that A wins?
 Method 1:
 Let s be a sequence of dice rolls and Chance and Community Chest cards
 Given s, the outcome V(s) is determined (1 for a win, 0 for a loss)
 Probability that A wins is s P(s) V(s)
 Problem: infinitely many sequences s !
 Method 2:
 Sample N sequences from P(s) , play N games (maybe 100)
 Probability that A wins is roughly 1/N i V(si) i.e., fraction of wins in the sample
99
https://inst.eecs.berkeley.edu/~cs188/sp21/
Sampling basics: discrete (categorical) distribution
 To simulate a biased d-sided coin:  Example
 Step 1: Get sample u from uniform
distribution over [0, 1) C P(C)
 E.g. random() in python
red 0.6 0.0  u < 0.6,  C=red
 Step 2: Convert this sample u into an 0.6  u < 0.7,  C=green
outcome for the given distribution by
green 0.1 0.7  u < 1.0,  C=blue
associating each outcome x with a blue 0.3
P(x)-sized sub-interval of [0,1)
 If random() returns u = 0.83,
then the sample is C = blue
 E.g, after sampling 8 times:

0.6 0.1 0.3


https://inst.eecs.berkeley.edu/~cs188/sp21/
Sampling in Bayes Nets
1 Prior Sampling: Sampling from an empty network 0.5

Coin
2 Rejection Sampling : reject samples disagreeing with evidence

3 Likelihood Weighting : use evidence to weight samples

4  Gibbs Sampling / Markov chain Monte Carlo (MCMC): sample from a stochastic
process whose stationary distribution is the true posterior

https://inst.eecs.berkeley.edu/~cs188/sp21/
Prior Sampling
1
P(C)
c 0.5
c 0.5

P(S | C) Cloudy P(R | C)


c s
0.1 c r
0.8
s 0.9 r 0.2
c s 0.5 Sprinkler Rain c r 0.2
s 0.5 r 0.8

P(W | S,R) Samples:


WetGrass
r w 0.99
s
w 0.01 c, s, r, w
r w 0.90 c, s, r, w
w 0.10

r w 0.90
s
w 0.10
r w 0.01
w 0.99 https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]

1
Exam ple
P(C)
P(S = true| C = true) =0.1 .50 P(R = true| C = true) =0.8
P(S = false| C = true) =0.9
Cloudy P(R = false| C = true) =0.2

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
P(W = true| S = true & R = true) =0.99
S R P(W|S,R)
P(W = false| S = true & R = true) =0.01
T T .99
T F .90
F T .90
F F .01 103

https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]

1
Exam ple
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
P(Cloudy = true) * F F .01 104

P(Sprinkler=false | cloudy = true) *


P(Rain=true | cloudy = true) *
P(WetGrass=true | Sprinkler=False & Rain =True)
https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]

1
Exam ple
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01 105

P(Cloudy = true)*
P(Sprinkler=false | cloudy = true)*
P(Rain=true | cloudy = true)
P(WetGrass=true | Sprinkler=False & Rain =True) https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]

1
Exam ple
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01 106

P(Cloudy = true)*
P(Sprinkler=false | cloudy = true)*
P(Rain=true | cloudy = true)
P(WetGrass=true | Sprinkler=False & Rain =True) https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]

1
Exam ple
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01 107

P(Cloudy = true)*
P(Sprinkler=false | cloudy = true)*
P(Rain=true | cloudy = true)
P(WetGrass=true | Sprinkler=False & Rain =True) https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]

1
Exam ple
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90 Answer ?
F T .90
P(Cloudy = true)* F F .01
P(Sprinkler=false | cloudy = true)*
P(Rain=true | cloudy = true)
P(WetGrass=true | Sprinkler=False & Rain =True)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Prior Sampling
1

 For i=1, 2, …, n (in topological order)


 Sample Xi from P(Xi | parents(Xi))

 Return (x1, x2, …, xn)

https://inst.eecs.berkeley.edu/~cs188/sp21/
Sampling from an empty network
1

Sample each variable in turn, in topological order


The probability distribution from which the value is
sampled is conditioned on the values already
assigned to the variable’s parents

https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
1

 We’ll get a bunch of samples from the BN:


c, s, r, w C

c, s, r, w S R
c, s, r, w
W
c, s, r, w
c, s, r, w
 If we want to know P(W)
 We have counts <w:4, w:1>
 Normalize to get P(W) = <w:0.8, w:0.2>
 This will get closer to the true distribution with more samples
 Can estimate anything else, too
 E.g., for query P(C| r, w) use P(C| r, w) = α P(C, r, w)

https://inst.eecs.berkeley.edu/~cs188/sp21/
Prior Sampling
1

 This process generates samples with probability:


SPS(x1,…,xn) = i P(xi | parents(Xi)) = P(x1,…,xn)
…i.e. the BN’s joint probability

 Let the number of samples of an event be NPS(x1,…,xn)


 Estimate from N samples is QN(x1,…,xn) = NPS(x1,…,xn)/N
 Then limN QN(x1,…,xn) = limN NPS(x1,…,xn)/N
= SPS(x1,…,xn)
= P(x1,…,xn)
 i.e., the sampling procedure is consistent

https://inst.eecs.berkeley.edu/~cs188/sp21/
Rejection Sampling
2

 A simple modification of prior sampling


for conditional probabilities
 Let’s say we want P(C| r, w) C

 Count the C outcomes, but ignore (reject) S R


samples that don’t have R=true, W=true
W
 This is called rejection sampling
 It is also consistent for conditional c, s, r, w
probabilities (i.e., correct in the limit) c, s, r
c, s, r, w
c, s, r
c, s, r, w
https://inst.eecs.berkeley.edu/~cs188/sp21/
Rejection sampling 2

𝑃 𝑋|e estimated from samples agreeing with e

E.g., estimate P(Rain|Sprinkler = true) using 100 samples


- 27 samples have Sprinkler = true
- Of these, 8 have Rain = true and 19 have Rain = false.

𝑃 𝑅𝑎𝑖𝑛|Sprinkler = tru𝑒 = 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒 8,19 = (0.296, 0.704)

Similar to a basic real-world empirical estimation procedure https://inst.eecs.berkeley.edu/~cs188/sp21/


Analysis of rejection sampling
2

𝑃 𝑋|e = 𝛼 𝑁 (𝑋, 𝑒) (algorithm defn.)


= 𝑁 (𝑋, 𝑒)/𝑁 (𝑒) (normalized by 𝑁 (𝑒))
≈ 𝑃(𝑋, 𝑒)/𝑃(𝑒) (property of P r i o r S a m p l e )
= 𝑃(𝑋|𝑒) (defn. of conditional probability)

Hence rejection sampling returns consistent posterior estimates


Problem: hopelessly expensive if P (e) is small

P (e) drops off exponentially with number of evidence variables!


P(Rain|Sky = red)

https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood Weighting 3

 Problem with rejection sampling:  Idea: fix evidence variables, sample the rest
 If evidence is unlikely, rejects lots of samples  Problem: sample distribution not consistent!
 Evidence not exploited as you sample  Solution: weight each sample by probability of
 Consider P(Shape|Color=blue) evidence variables given parents

pyramid, green pyramid, blue


pyramid, red pyramid, blue
Shape Color sphere, blue Shape Color sphere, blue
cube, red cube, blue
sphere, green sphere, blue

https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood Weighting 3

P(C)
c 0.5
c 0.5

P(S | C) Cloudy
P(R | C)
c s
0.1 c r
0.8
s 0.9 r 0.2
c s 0.5 Sprinkler Rain c r 0.2
s 0.5 r 0.8
P(W | S,R)
Samples:
r w 0.99 WetGrass
s
w 0.01 c , s, r , w w = 1.0 x 0.1 x 0.99
r w 0.90
w 0.10
r w 0.90
s
w 0.10
r w 0.01
https://inst.eecs.berkeley.edu/~cs188/sp21/
w 0.99
Likelihood Weighting 3

 Input: evidence e1,..,ek


 w = 1.0
 for i=1, 2, …, n
 if Xi is an evidence variable
 xi = observed valuei for Xi
 Set w = w * P(xi | parents(Xi))
 else
 Sample xi from P(Xi | parents(Xi))
 return (x1, x2, …, xn), w

https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood weighting 3
Idea: fix evidence variables, sample only nonevidence variables, and weight each sample by the likelihood it accords the evidence

Avoids the inefficiency of rejection sampling by generating only events that are consistent with evidence e
https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood Weighting 3

 Sampling distribution if Z sampled and e fixed evidence


SWS(z,e) = j P(zj | parents(Zj))

Cloudy
C
 Now, samples have weights
S R
w(z,e) = k P(ek | parents(Ek))
W
 Together, weighted sampling distribution is consistent
SWS(z,e)  w(z,e) = j P(zj | parents(Zj)) k P(ek | parents(Ek))
= P(z,e)
 Likelihood weighting is an example of importance sampling
 Would like to estimate some quantity based on samples from P
 P is hard to sample from, so use Q instead
 Weight each sample x by P(x)/Q(x)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Car Insurance: P(PropertyCost | e)

Age SocioEcon
GoodStudent ExtraCar 0.1
RiskAversion
Rejection sampling
YearsLicensed
MakeModel VehicleYear Likelihood weighting
Mileage 0.08
DrivingSkill
AntiTheft SafetyFeatures CarValue

DrivingRecord
Garaged Airbag 0.06
Ruggedness

Error
DrivingBehavior 0.04
Cushioning Theft
Accident

OwnCarDamage 0.02
OwnCarCost
OtherCost

0
6
MedicalCost LiabilityCost PropertyCost 0 200000 400000 600000 800000 1x10
Number of samples

https://inst.eecs.berkeley.edu/~cs188/sp21/
3
Likelihood weighting example
given Query : P(Rain| Cloudy = True, WetGrass = True)
w is set to w = 1.0 Ordering : Cloudy, Sprinkler, Rain, Wet-Grass
w  w x P(Cloudy = true) = 0.5

P(C)
• Fixes evidence variable value
.50
• Only samples non-evidence Variables
Cloudy (including query).

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .90 F .20
Wet
Grass
S R P(W|S,R)
• Cloudy is an evidence variable
T T .99
• Sprinkler is not an evidence variable
T F .90
F T .90 • WetGrass is an evidence variable
F F .01

https://inst.eecs.berkeley.edu/~cs188/sp21/
Query : P(Rain| Cloudy = True, WetGrass = True)
Ordering : Cloudy, Sprinkler, Rain, Wet-Grass
3
Likelihood weighting example
P(C)
given .50
w is set to w = 1.0
w  w x P(Cloudy = true) = 0.5 Cloudy

Samples from
Samples from C P(S|C) C P(R|C)
P(Rain | Cloudy = true) = (0.8, 0.2)
P(Sprinkler | Cloudy = true) = (0.1, 0.9) T .10 Sprinkler Rain T .80
F .90 F .20 Suppose this returns true
Suppose this returns false
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01

The weighted sample


w  w x P(WetGrass = true | Sprinkler = false , Rain = true) = 0.5 x 0.9 = 0.45
The event [true, false, true, true] has weight 0.45 and is tallied under Rain = True
https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood weighting analysis 3

Z : non evidence
Sampling probability for We i g h t e d S a m p l e is variables including
query variable
𝑆 𝑧,𝑒 = 𝑃(𝑧 | Parents(𝑍 ) )
Cloudy

Parents(Z) : can include both evidence and not evidence variables.


Sprinkler Rain
Weight for a given sample z,e is

𝑤 𝑧,𝑒 = 𝑃(𝑒 | Parents(𝐸 ) ) Wet


Grass

Weighted sampling probability is

𝑆 𝑧,𝑒 × 𝑤 𝑧,𝑒 | ))× 𝑃(𝑒 | Parents(𝐸 ) )

(by standard global semantics of network)


Query : P(Rain| Cloudy = True, WetGrass = True)
Ordering : Cloudy, Sprinkler, Rain, Wet-Grass
https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood weighting analysis 3

+ likelihood weighting returns consistent estimates


+ More efficient than rejection sampling because it uses all samples
- Performance degrades if there are many evidence variables (a few samples have nearly all the total weight)

https://inst.eecs.berkeley.edu/~cs188/sp21/
Markov Chain Monte Carlo 4

 MCMC (Markov chain Monte Carlo) is a family of randomized


algorithms for approximating some quantity of interest over a
very large state space
 Markov chain = a sequence of randomly chosen states (“random walk”),
where each state is chosen conditioned on the previous state
 Monte Carlo = a very expensive city in Monaco with a famous casino
 Monte Carlo = an algorithm (usually based on sampling) that has some
probability of producing an incorrect answer
 MCMC = wander around for a bit, average what you see
126
https://inst.eecs.berkeley.edu/~cs188/sp21/
Gibbs sampling
 A particular kind of MCMC
 States are complete assignments to all variables
 (Cf local search: closely related to simulated annealing!)
 Evidence variables remain fixed, other variables change
 To generate the next state, pick a variable and sample a value for it
conditioned on all the other variables: Xi’ ~ P(Xi | x1,..,xi-1,xi+1,..,xn)
 Will tend to move towards states of higher probability, but can go down too
 In a Bayes net, P(Xi | x1,..,xi-1,xi+1,..,xn) = P(Xi | markov_blanket(Xi))
 Theorem: Gibbs sampling is consistent*
 Provided all Gibbs distributions are bounded away from 0 and 1 and variable selection is fair

127
https://inst.eecs.berkeley.edu/~cs188/sp21/
Why would anyone do this?
Samples soon begin to
reflect all the evidence
in the network

Eventually they are


being drawn from the
true posterior!

128
https://inst.eecs.berkeley.edu/~cs188/sp21/
How would anyone do this?
 Repeat many times
 Sample a non-evidence variable Xi from
P(Xi | x1,..,xi-1,xi+1,..,xn) = P(Xi | markov_blanket(Xi)) U1 ... Um

= α P(Xi | parents (Xi)) j P(yj | parents(Yj))


X
Z 1j Z nj

Y1 ... Yn

129
https://inst.eecs.berkeley.edu/~cs188/sp21/
Gibbs Sampling Example: P( S | r)
 Step 1: Fix evidence C
 Step 2: Initialize other variables C
 R = true  Randomly
S r S r

W W

 Step 3: Repeat
 Choose a non-evidence variable X
 Resample X from P(X | markov_blanket(X))
C C C C C C
S r S r S r S r S r S r
W W W W W W

Sample S ~ P(S | c, r, w) Sample C ~ P(C | s, r) Sample W ~ P(W | s, r)


https://inst.eecs.berkeley.edu/~cs188/sp21/
Markov chain given s, w
0.6296 0.0926 0.1164

c r c ¬r

0.4074

0.2222 0.2778 0.0238 0.4762

0.3922

¬c r ¬c ¬r

0.3856 0.1078 0.8683

131
https://inst.eecs.berkeley.edu/~cs188/sp21/
Car Insurance: P(Age | mc,lc,pc)

Age SocioEcon
GoodStudent ExtraCar
0.02
YearsLicensed
RiskAversion
MakeModel VehicleYear Likelihood weighting
Mileage Gibbs sampling
DrivingSkill
AntiTheft SafetyFeatures CarValue
0.015
Garaged Airbag
DrivingRecord Ruggedness

Error
0.01
DrivingBehavior
Cushioning Theft
Accident

OwnCarDamage 0.005
OwnCarCost
OtherCost

0
MedicalCost LiabilityCost PropertyCost
0 200000 400000 600000 800000 1x106
Number of samples

https://inst.eecs.berkeley.edu/~cs188/sp21/
Approximate inference using M C M C
4
“State” of network = current assignment to all variables.
Generate next state by sampling one variable given Markov blanket Sample each variable in turn, keeping evidence fixed

Markov blanket: parents + children + children’s parents

https://inst.eecs.berkeley.edu/~cs188/sp21/
Query : P(Rain| Sprinkler = True, WetGrass = True)
Ordering : Cloudy, Sprinkler, Rain, Wet-Grass
The Markov chain 4

With Sprinkler = true, W etGrass = true, there are four states:

Cloudy Cloudy

Sprinkler Rain Sprinkler Rain

Wet Wet
Grass Grass

Cloudy Cloudy

Sprinkler Rain Sprinkler Rain

Wet Wet
Grass Grass

Wander about for a while, average what you see

Markov blanket: parents + children + children’s parents


https://inst.eecs.berkeley.edu/~cs188/sp21/
Markov blanket sampling 4
Markov blanket of Cloudy is Sprinkler and Rain
Markov blanket of Rain is Cloudy, Sprinkler, and W etGrass Cloudy

Rain
Probability given the Markov blanket is calculated as follows: Sprinkler

Wet
Grass

𝑃(𝑥 | mb(𝑋 ) ) = 𝑃((𝑥 | Parents(𝑋 ) ) 𝑃((𝑧 | Parents(𝑍 ) )


∈Children( )

Easily to implement in message-passing parallel systems, brains

Main computational problems:


1) Difficult to tell if convergence has been achieved
2) Can be wasteful if Markov blanket is large:
P (Xi|mb(Xi)) won’t change much (law of large numbers)

Markov blanket: parents + children + children’s parents

https://inst.eecs.berkeley.edu/~cs188/sp21/
M C M C example contd. 4

Estimate P(Rain|Sprinkler = true, W etGrass = true)


Sample Cloudy or Rain given its Markov blanket, repeat.
Count number of times Rain is true and false in the samples.

E.g., visit 100 states


31 have Rain = true, 69 have Rain = false
Pˆ(Rain|Sprinkler = true, W etGrass = true)
= Normalize((31, 69)) = (0.31, 0.69)

Why does it work


 Theorem: chain approaches stationary distribution:
long-run fraction of time spent in each state is exactly proportional
to its posterior probability

https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes Net Sampling Summary
 Prior Sampling P  Rejection Sampling P( Q | e )

 Likelihood Weighting P( Q | e)  Gibbs Sampling P( Q | e )

https://inst.eecs.berkeley.edu/~cs188/sp21/
Summary : Conditional independence in BNs
 Compare the Bayes net global semantics
P(X1,..,Xn) = i P(Xi | Parents(Xi))
with the chain rule identity
P(X1,..,Xn) = i P(Xi | X1,…,Xi-1)

 Assume (without loss of generality) that X1,..,Xn sorted in topological order according to
the graph (i.e., parents before children), so Parents(Xi)  X1,…,Xi-1

 So the Bayes net asserts conditional independences P(Xi | X1,…,Xi-1) = P(Xi | Parents(Xi))
 To ensure these are valid, choose parents for node Xi that “shield” it from other predecessors

https://inst.eecs.berkeley.edu/~cs188/sp21/
Summary
 Independence and conditional independence are
important forms of probabilistic knowledge

 Bayes net encode joint distributions efficiently by


taking advantage of conditional independence
 Global joint probability = product of local conditionals
 Local causality => exponential reduction in total size

https://inst.eecs.berkeley.edu/~cs188/sp21/
Chapter 16
Making Simple Decisions
Utilities

https://inst.eecs.berkeley.edu/~cs188/sp21/
Maximum Expected Utility

 Principle of maximum expected utility:


 A rational agent should choose the action that maximizes its
expected utility, given its knowledge

 Questions:
 Where do utilities come from?
 How do we know such utilities even exist?
 How do we know that averaging even makes sense?
 What if our behavior (preferences) can’t be described by utilities?

https://inst.eecs.berkeley.edu/~cs188/sp21/
The need for numbers

0 40 20 30 x2 0 1600 400 900

 For worst-case minimax reasoning, terminal value scale doesn’t matter


 We just want better states to have higher evaluations (get the ordering right)
 The optimal decision is invariant under any monotonic transformation

 For average-case expectimax reasoning, we need magnitudes to be meaningful


https://inst.eecs.berkeley.edu/~cs188/sp21/
Utilities

 Utilities are functions from outcomes


(states of the world) to real numbers
that describe an agent’s preferences

 Where do utilities come from?


 In a game, may be simple (+1/-1)
 Utilities summarize the agent’s goals
 Theorem: any “rational” preferences can
be summarized as a utility function

 We hard-wire utilities and let


behaviors emerge
 Why don’t we let agents pick utilities?
 Why don’t we prescribe behaviors?

https://inst.eecs.berkeley.edu/~cs188/sp21/
Utilities: Uncertain Outcomes
Getting ice cream

Get Single Get Double

Oops Whew!

https://inst.eecs.berkeley.edu/~cs188/sp21/
Preferences

 An agent must have preferences among: A Prize A Lottery


 Prizes: A, B, etc.
 Lotteries: situations with uncertain prizes
A
L = [p, A; (1-p), B] p 1-p

 Notation: A B
 Preference: A > B
 Indifference: A ~ B

https://inst.eecs.berkeley.edu/~cs188/sp21/
Rational Preferences
The Axioms of Rationality:
preferences of a rational agent must obey constraints.
Orderability:
(A > B)  (B > A)  (A ~ B)
Transitivity:
(A > B)  (B > C)  (A > C)
Continuity:
(A > B > C)  p [p, A; 1-p, C] ~ B
Substitutability:
(A ~ B)  [p, A; 1-p, C] ~ [p, B; 1-p, C]
Monotonicity:
(A > B) 
(p  q)  [p, A; 1-p, B]  [q, A; 1-q, B]

Theorem: Rational preferences imply behavior describable as maximization of expected utility


https://inst.eecs.berkeley.edu/~cs188/sp21/
Rational Preferences
 We want some constraints on preferences before we call them rational, such as:

Axiom of Transitivity: (A > B)  (B > C)  (A > C)

 For example: an agent with intransitive preferences can


be induced to give away all of its money
 If B > C, then an agent with C would pay (say) 1 cent to get B
 If A > B, then an agent with B would pay (say) 1 cent to get A
 If C > A, then an agent with A would pay (say) 1 cent to get C

https://inst.eecs.berkeley.edu/~cs188/sp21/
Maximizing Expected Utility : MEU Principle

 Theorem [Ramsey, 1931; von Neumann & Morgenstern, 1944]


 Given any preferences satisfying these constraints, there exists a real-valued
function U such that:

U(A)  U(B)  A  B
U([p1,S1; … ; pn,Sn]) = p1U(S1) + … + pnU(Sn)
 i.e. values assigned by U preserve preferences of both prizes and lotteries!
 Optimal policy invariant under positive affine transformation U’ = aU+b, a>0

 Maximum expected utility (MEU) principle:


 Choose the action that maximizes expected utility
 Note: rationality does not require representing or manipulating utilities and probabilities
 E.g., a lookup table for perfect tic-tac-toe
https://inst.eecs.berkeley.edu/~cs188/sp21/
Human Utilities
 Utilities map states to real numbers. Which numbers?
 Standard approach to assessment (elicitation) of human utilities:
 Compare a prize A to a standard lottery Lp between
 “best possible prize” uT with probability p
 “worst possible catastrophe” u with probability 1-p
 Adjust lottery probability p until indifference: A ~ Lp
 Resulting p is a utility in [0,1]

Pay $50 0.999999 0.000001

No change Instant death


https://inst.eecs.berkeley.edu/~cs188/sp21/
Money
 Money does n o t behave as a utility function: no information about preferences between lotteries
involving money
 Given a lottery L with expected monetary value EMV(L), usually U (L) < U(EMV(L)),
i.e., people are risk-averse
 Utility curve: for what probability p am I indifferent between a prize x and a lottery [p, $M ; (1 − p), $0]
for large M ?
People may prefer
 Typical empirical data, extrapolated with risk-prone behavior: +U a sure thing
Example o o o o o o o o Risk averse
o
o
o
o
−150,000 o 800,000
o
People may be willing to
take more risk
Risk seeking o

151
https://inst.eecs.berkeley.edu/~cs188/sp21/
Money

 Flip of a coin
 Take the 1M prize that you have and walk away, or
 Flip the coin one more time for a chance to win 2.5M
 Expected Monetary Value (EMV) : ½ * 0 + ½ * 2.5M = 1.25M
 1.25M > 1M ?
 Most people will not take the gamble
 Sn : denotes the state of the player’s wealth
 Expected Utility :
 EU(accept) = ½ * U(Sk) + ½ * U(Sk+2.5M)
 EU(decline) = U(Sk+1M)
152
https://inst.eecs.berkeley.edu/~cs188/sp21/
Post-decision Disappointment: the Optimizer’s Curse
we will get the utility we expect if the
 Usually we don’t have direct access to exact whole process is repeated many times.
utilities, only estimates
 E.g., you could make one of k investments As number of choices increases
the potential for disappointment
 An unbiased expert assesses their expected net also increases
profit V1,…,Vk
 You choose the best one V*
 With high probability, its actual value is
considerably less than V*
 Error in each utility estimate is independent and
has a unit normal distribution: N(0,σ);
 This is a serious problem in many areas:
 Future performance of mutual funds
 Efficacy of drugs measured by trials
 Statistical significance in scientific papers
 Winning an auction
https://inst.eecs.berkeley.edu/~cs188/sp21/
Human are predictably irrational
 Most people will choose
 B over A : sure thing
 C over D : higher monetary value
 There is no utility function that is consistent
with both of these choices
 People are strongly attracted to gains that
are certain

https://inst.eecs.berkeley.edu/~cs188/sp21/
Multi-attribute utility
 How can we handle utility functions of many variables X 1, . . . ,X n ?
 Example: Choosing the site of an airport, what is U(Deaths, Noise, Cost)?

 How can complex utility functions be assessed from preference behavior?


 Idea 1: identify conditions under which decisions can be made without
complete identification of U (X 1, . . . ,X n )
 Idea 2: identify various types of independence in preferences and
derive consequent canonical forms for U (X 1, . . . ,X n )

https://inst.eecs.berkeley.edu/~cs188/sp21/
Strict dominance
Typically define attributes such that U is monotonic in each Strict dominance:
choice B strictly dominates choice A iff
∀i X i (B) ≥ X i (A) (and hence U (B) ≥ U (A))
X2 This region X2
dominates A
B
C B C
A A
D

X1 X1
Deterministic attributes Uncertain attributes

Suppose that airport site S1 costs less, generates less noise


Strict dominance seldom holds in practice pollution, and is safer than site S2. One would not hesitate
to reject S2. We then say that there is strict
dominance of S1 over S2.
https://inst.eecs.berkeley.edu/~cs188/sp21/
Stochastic dominance

Suppose that airport site S1 costs less, generates


less noise pollution, and is safer than site S2.
 reject S2, strict dominance of S1 over S2.

Stochastic dominance

Cumulative Distribution measures the probability that the cost ≤ amount


If the cumulative distribution of S1 is always to the right of that of S2
 S1 is cheaper than S2
If U is monotonic in x, then A1 with outcome distribution p1 stochastically dominates A2 with outcome distribution p2:

https://inst.eecs.berkeley.edu/~cs188/sp21/
Preference structure: Deterministic
X 1 and X 2 preferentially independent of X 3 iff
preference between (x 1, x2, x 3 ) and (x’1, x’2, x 3 ) does not depend on x3

E.g., (N oise, Cost, Safety):


(20,000 suffer, $4.6 billion, 0.06 deaths/mpm) vs.
(70,000 suffer, $4.2 billion, 0.06 deaths/mpm)
T h e o r e m (Leontief, 1947): if every pair of attributes is P.I. of its complement, then every subset of attributes is P.I of
its complement: mutual P.I..

T h e o r e m (Debreu, 1960): mutual P.I.


V (x 1, …., x n ) = Σ i V i (x i ) ⇒ ∃ additive value function V
Hence assess n single-attribute functions; often a good approximation
V (Noise, Cost, Safety) = - noise x 10 4 – cost – deaths x 10 12

https://inst.eecs.berkeley.edu/~cs188/sp21/
Preference structure: Stochastic
Need to consider preferences over lotteries:
X is utility-independent of Y iff
preferences over lotteries in X do not depend on y

Mutual U.I.: each subset is U.I of its complement


⇒ ∃ multiplicative utility function:
U = k1U1 + k2U2 + k3U3
+ k1k2U1U2 + k2k3U2U3 + k3k1U3U1
+ k1k2k3U1U2U3

https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision Networks

In its most general form, a decision network represents information about


• Its current state
• Its possible actions
• The state that will result from its actions
• The utility of that state

Decision network = Bayes net + Actions + Utilities

• Action nodes (rectangles, cannot have parents,


will have value fixed by algorithm)
• Utility nodes (diamond, depends on action and
chance nodes)
• Chance nodes (ovals) : random variables similar
to Bayesian networks. https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Take an umbrella?
A W U(A,W)
leave sun 100
leave rain 0
Umbrella take sun 20
take rain 70

W P(W)
U
sun 0.7

Weather

W P(F=bad|W) Forecast
sun 0.17 =bad
rain 0.77
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision Networks
 Decision network = Bayes net + Actions + Utilities

 Action nodes (rectangles, cannot have parents, will have Umbrella


value fixed by algorithm)
U
 Utility nodes (diamond, depends on action and chance
nodes)
Weather
 Decision algorithm: Bayes net inference!
 Fix evidence e
 For each possible action a
 Fix action node to a
 Compute posterior P(W|e,a) for parents W of U Forecast
 Compute expected utility w P(w|e,a) U(a,w)
 Return action with highest expected utility
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Take an umbrella?
 Decision algorithm: A W U(A,W)
 Fix evidence e Bayes net inference!
leave sun 100
 For each possible action a
leave rain 0
 Fix action node to a
 Compute posterior P(W|e,a) for parents W of U Umbrella take sun 20
 Compute expected utility of action a: w P(w|e,a) U(a,w) take rain 70
 Return action with highest expected utility
W P(W) U
Umbrella = leave sun 0.7

EU(leave|F=bad) = w P(w|F=bad) U(leave,w) Weather


= 0.34x100 + 0.66x0 = 34 W P(W|F=bad)
Umbrella = take sun 0.34

EU(take|F=bad) = w P(w|F=bad) U(take,w) rain 0.66

= 0.34x20 + 0.66x70 = 53 W P(F=bad|W) Forecast


sun 0.17 =bad
Optimal decision = take! rain 0.77 https://inst.eecs.berkeley.edu/~cs188/sp21/
Value of information
 General idea: value of information = expected improvement in decision quality
from observing value of a variable
 E.g., oil company deciding on seismic exploration and test drilling
 E.g., doctor deciding whether to order a blood test
 E.g., person deciding on whether to look before crossing the road
 Key point: decision network contains everything needed to compute it!
 Value of perfect information
 VPI(Ei | e) = [ ei P(ei | e) maxa EU(a|ei,e) ] - maxa EU(a|e)

https://inst.eecs.berkeley.edu/~cs188/sp21/
Value of information : is all information relevant ?
Idea: compute value of acquiring each possible piece of evidence

Can be done directly from decision network

Example: buying oil drilling rights


• Two blocks A and B, exactly one has oil, worth k
• Prior probabilities 0.5 each, mutually exclusive
• Current price of each block is k/2
• “Consultant” offers accurate survey of A.
• What is a Fair price for the survey?
Solution: compute expected value of information
= expected value of best action given the information minus expected value of
best action without information

Survey may say “oil in A” or “no oil in A”, prob. 0.5 each (given!)
= [0.5 × value of “buy A” given “oil in A” + 0.5 × value of “buy B” given “no oil in A”]
= (0.5 × k/2) + (0.5 × k/2) = k/2 https://inst.eecs.berkeley.edu/~cs188/sp21/
VPI Properties
Value of current best action α

Revised utility with if evidence


Ej = ej is obtained
Ej is the RV whose value is currently
unknown, average across all values of ej

VPI is non-negative! VPI(Ei | e)  0

VPI is not (usually) additive: VPI(Ei , Ej | e)  VPI(Ei | e) + VPI(Ej | e)

VPI is order-independent: VPI(Ei , Ej | e) = VPI(Ej , Ei | e)


https://inst.eecs.berkeley.edu/~cs188/sp21/
Decisions with unknown preferences
 In reality the assumption that we can write down our exact
preferences for the machine to optimize is false
 A machine optimizing the wrong preferences causes problems
 A machine that is explicitly uncertain about the human’s
preferences will defer to the human

167
https://inst.eecs.berkeley.edu/~cs188/sp21/
Chapter 19
Learning from Examples

168
Deep Learning Powering Everyday Products

pcmag.com theverge.com

169
https://inst.eecs.berkeley.edu/~cs188/sp21/
Learning
 Learning: a process for improving the performance of an agent
through experience
 Learning :
 The general idea: generalization from experience
 Supervised learning: classification and regression
 Learning is useful as a system construction method, i.e., expose
the system to reality rather than trying to write it down

170
https://inst.eecs.berkeley.edu/~cs188/sp21/
Key questions when building a learning agent
 What is the agent design that will implement the desired
performance?
 Improve the performance of what piece of the agent system and
how is that piece represented?
 What data are available relevant to that piece? (In particular, do
we know the right answers?)
 What knowledge is already available?

171
https://inst.eecs.berkeley.edu/~cs188/sp21/
Forms of learning

Supervised learning: correct answers for each training instance


Reinforcement learning: reward sequence, no correct answers
Unsupervised learning: “just make sense of the data”

172
https://inst.eecs.berkeley.edu/~cs188/sp21/
Supervised learning
 To learn an unknown target function f
 Input: a training set of labeled examples (xj,yj) where yj = f(xj)
 E.g., xj is an image, f(xj) is the label “giraffe”
 E.g., xj is a seismic signal, f(xj) is the label “explosion”
 Output: hypothesis h that is “close” to f, i.e., predicts well on unseen
examples (“test set”)
 Many possible hypothesis families for h
 Linear models, logistic regression, neural networks, decision trees, examples
(nearest-neighbor), grammars, kernelized separators, etc etc
 Classification = learning f with discrete output value
 Regression = learning f with real-valued output value
173
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inductive Learning (Science)
 Simplest form: learn a function from examples
 A target function: g
 Examples: input-output pairs (x, g(x))
 E.g. x is an email and g(x) is spam / ham
 E.g. x is a house and g(x) is its selling price

 Problem:
 Given a hypothesis space H
 Given a training set of examples xi
 Find a hypothesis h(x) such that h ~ g

 Includes:
 Classification (outputs = class labels)
 Regression (outputs = real numbers)

174
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting

175
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting

176
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting

177
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting

178
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting

179
https://inst.eecs.berkeley.edu/~cs188/sp21/
Consistency vs. Simplicity
 Fundamental tradeoff: bias vs. variance

 Usually algorithms prefer consistency by default (why?)

 Several ways to operationalize “simplicity”


 Reduce the hypothesis space
 Assume more: e.g. independence assumptions, as in naïve Bayes
 Have fewer, better features / attributes: feature selection
 Other structural limitations (decision lists vs trees)
 Regularization : penalize complex hypothesis
 Smoothing: cautious use of small counts
 Many other generalization parameters (pruning cutoffs today)
 Hypothesis space stays big, but harder to get to the outskirts

180
https://inst.eecs.berkeley.edu/~cs188/sp21/
Basic questions
 Which hypothesis space H to choose?
 How to measure degree of fit?
 How to trade off degree of fit vs. complexity?
 “Ockham’s razor”
 How do we find a good h?
 How do we know if a good h will predict well?

181
https://inst.eecs.berkeley.edu/~cs188/sp21/
Classification - Examples

182
https://inst.eecs.berkeley.edu/~cs188/sp21/
Classification example: Object recognition
x

f(x) giraffe giraffe giraffe llama llama llama

X= f(x)=?

183
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Spam Filter
 Input: an email Dear Sir.
 Output: spam/ham First, I must solicit your confidence in
this transaction, this is by virture of its
 Setup: nature as being utterly confidencial and
top secret. …
 Get a large collection of example emails, each labeled
“spam” or “ham” (by hand) TO BE REMOVED FROM FUTURE
 Learn to predict labels of new incoming emails MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
 Classifiers reject 200 billion spam emails per day SUBJECT.

 Features: The attributes used to make the ham / 99 MILLION EMAIL ADDRESSES
FOR ONLY $99
spam decision
 Words: FREE! Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
 Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
 Non-text: SenderInContacts, AnchorLinkMismatch decided to put it to use, I know it was
 … working pre being stuck in the corner,
but when I plugged it in, hit the power
nothing184
happened.
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Digit Recognition

 Input: images / pixel grids 0


 Output: a digit 0-9
1
 Setup:
 MNIST data set of 60K collection hand-labeled images
 Note: someone has to hand label all this data! 2
 Want to learn to predict labels of new digit images

1
 Features: The attributes used to make the digit decision
 Pixels: (6,8)=ON
 Shape Patterns: NumComponents, AspectRatio, NumLoops
 … ??

185
https://inst.eecs.berkeley.edu/~cs188/sp21/
Other Classification Tasks
 Medical diagnosis
 input: symptoms
 output: disease
 Automatic essay grading
 input: document
 output: grades
 Fraud detection
 input: account activity
 output: fraud / no fraud
 Email routing
 input: customer complaint email
 output: which department needs to ignore this email
 Fruit and vegetable inspection
 input: image (or gas analysis)
 output: moldy or OK

 … many more

186
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision Trees

187
Decision tree learning
 Decision tree models
 Tree construction
 Measuring learning performance

188
https://inst.eecs.berkeley.edu/~cs188/sp21/
Restaurant Example

189
https://inst.eecs.berkeley.edu/~cs188/sp21/
Restaurant Example

190
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision trees
 Popular representation
for classifiers
 Even among humans!
 I’ve just arrived at a
restaurant: should I stay
(and wait for a table) or
go elsewhere?

191
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision trees
 It’s Friday night and you’re hungry
 You arrive at your favorite cheap but
really cool happening burger place
 It’s full up and you have no
reservation but there is a bar
 The host estimates a 45 minute wait
 There are alternatives nearby but it’s
raining outside

Decision tree partitions


the input space, assigns
a label to each partition 192
https://inst.eecs.berkeley.edu/~cs188/sp21/
Expressiveness
 Discrete decision trees can express any function of the input in propositional
logic
 E.g., for Boolean functions, build a path from root to leaf for each row of the
truth table:

 True/false: there is a consistent decision tree that fits any training set exactly
 But a tree that simply records the examples is essentially a lookup table
 To get generalization to new examples, need a compact tree
193
https://inst.eecs.berkeley.edu/~cs188/sp21/
Quiz
 How many distinct functions of n Boolean attributes?
 = number of truth tables with n inputs
 = number of ways of filling in output column with 2n entries
 = 2 2n
 For n=6 attributes, there are 18,446,744,073,709,551,616 functions
 and many more trees than that!

194
https://inst.eecs.berkeley.edu/~cs188/sp21/
Hypothesis spaces in general
 Increasing the expressiveness of the hypothesis language
 Increases the chance that the true function can be expressed
 Increases the number of hypotheses that are consistent with training set
 => many consistent hypotheses have large test error
 => may reduce prediction accuracy!
n
 With 22 hypotheses, all but an exponentially small fraction will
require O(2n) bits to express in any representation (even brains!)
 I.e., any given representation can represent only an exponentially
small fraction of hypotheses concisely; no universal compression
 E.g., decision trees are bad at “k-out-of-n” functions
195
https://inst.eecs.berkeley.edu/~cs188/sp21/
Learning a Decision Tree
 Goal
 Given training data of a list of attributes & label
 Hypothesis family H: decision trees
 Find (approximately) the smallest decision tree that fits the training data
 Iterative process to choose the next attribute to split on

196
https://inst.eecs.berkeley.edu/~cs188/sp21/
Choosing an attribute: Information gain
 Idea: measure contribution of attribute to increasing “purity” of
labels in each subset of examples; find most distinguishing feature

 Patrons is a better choice: gives information about classification;


more distinguishing feature 197
https://inst.eecs.berkeley.edu/~cs188/sp21/
Information
 Information answers questions
 The more clueless I am about the answer initially, the more
information is contained in the answer
 Scale: 1 bit = answer to Boolean question with prior 0.5, 0.5
 For a coin with probability p heads, entropy
H( p,1-p ) = –plog p –(1-p)log (1-p)
 Convenient notation: B(p) = H( p,1-p )

198
https://inst.eecs.berkeley.edu/~cs188/sp21/
Information
 Information in an answer when prior is p1,…,pn is
H( p1,…,pn ) = i –pi log pi
 This is the entropy of the prior
 Entropy was initially proposed in physics (thermodynamics)
 Shannon developed information theory, using entropy to
measure information

199
https://inst.eecs.berkeley.edu/~cs188/sp21/
Information gain from splitting on an attribute
 Suppose we have p positive and n negative examples at the root
 => B(p/(p+n)) bits needed to classify a new example
 E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit
 An attribute splits the examples E into subsets Ek, each of which (we
hope) needs less information to complete the classification
 For an example in Ek we expect to need B(pk/(pk+nk)) more bits
 Probability a new example goes into Ek is (pk+nk)/(p+n)
 Expected number of bits needed after split is k (pk+nk)/(p+n) B(pk/(pk+nk))
 Information gain = B(p/(p+n)) - k (pk+nk)/(p+n) B(pk/(pk+nk))
200
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example

1 – [(2/12)B(0) + (4/12)B(1) + (6/12)B(2/6)] 1 – [(2/12)B(1/2) + (2/12)B(1/2) +


= 0.541 bits (4/12)B(2/4) + (4/12)B(2/4)]
= 0 bits

201
https://inst.eecs.berkeley.edu/~cs188/sp21/
Results for restaurant data

Decision tree learned from


the 12 examples:
 Simpler than “true” tree!

202
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision tree learning
function Decision-Tree-Learning(examples,attributes,parent_examples) returns a tree
if examples is empty then return Plurality-Value(parent_examples)
else if all examples have the same classification then return the classification
else if attributes is empty then return Plurality-Value(examples)
else A ← argmaxa attributes Importance(a, examples)
tree ← a new decision tree with root test A
for each value v of A do
exs ← the subset of examples with value v for attribute A
subtree ← Decision-Tree-Learning(exs, a ributes − A, examples)
add a branch to tree with label (A = vk) and subtree subtree
return tree
203
Plurality-Value selects the most common output value among a set of examples https://inst.eecs.berkeley.edu/~cs188/sp21/
Choosing the Right Model

204
https://inst.eecs.berkeley.edu/~cs188/sp21/
A few important points about learning
 Data: labeled instances, e.g. emails marked spam/ham
 Training set
 Held out set
 Test set
 Features: attribute-value pairs which characterize each x Training
Data
 Experimentation cycle
 Learn parameters (e.g. model probabilities) on training set
 Tune hyperparameters on held-out set
 Compute accuracy of test set
 Very important: never “peek” at the test set!
 Evaluation Held-Out
 Accuracy: fraction of instances predicted correctly Data
(Validation set)
 Overfitting and generalization
 Want a classifier which does well on test data
 Overfitting: fitting the training data very closely, but not Test
generalizing well Data
 Underfitting: fits the training set poorly
205
https://inst.eecs.berkeley.edu/~cs188/sp21/
A few important points about learning
Our goal in machine learning is to select a hypothesis
that will optimally fit future examples.

 Future examples?
 Same prior distribution as the past examples (otherwise all bets are off!) : P(Ej) =
P(Ej+1) = P(Ej+2)
 Each example is independent of previous examples : P(Ej) = P(Ej | Ej-1, Ej-2, …)
 i.i.d (independent, identically distributed)

 Optimally fit?
 error rate: the proportion of times that h(x)≠y for an (x,y) example

206
https://inst.eecs.berkeley.edu/~cs188/sp21/
A few important points about learning
Our goal in machine learning is to select a hypothesis that will optimally fit future examples.

 What should we learn where?


 Learn parameters from training data
 Tune hyperparameters on different data
 Why?
 For each value of the hyperparameters, train
and test on the held-out data
 Choose the best value and do a final test on
the test data

 What are examples of hyperparameters?


207
https://inst.eecs.berkeley.edu/~cs188/sp21/
Results for restaurant data

208
https://inst.eecs.berkeley.edu/~cs188/sp21/
K-fold cross validation

209
https://inst.eecs.berkeley.edu/~cs188/sp21/
Error rates/Loss Functions

Because p(x,y) is not known

210
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regularization
An alternative to cross-validation

• λ is a hyperparameter
• Balances empirical loss versus complexity
• Penalizing complexity  regularization

211
https://inst.eecs.berkeley.edu/~cs188/sp21/

You might also like