mid2
mid2
Quantifying Uncertainty
1
Uncertainty
The real world is rife with uncertainty!
E.g., if I leave for SFO 60 minutes before my flight, will I be there in time?
Problems:
partial observability (road state, other drivers’ plans, etc.)
noisy sensors (radio traffic reports, Google maps)
immense complexity of modelling and predicting traffic, security line, etc.
lack of knowledge of world dynamics (will tire burst? need COVID test?)
Probabilistic assertions summarize effects of ignorance and laziness
Combine probability theory + utility theory -> decision theory
Maximize expected utility : a* = argmaxa s P(s | a) U(s)
2
https://inst.eecs.berkeley.edu/~cs188/sp21/
Outline
− Uncertainty
− Probability
− Syntax and Semantics
− Inference
− Independence and Bayes’Rule
https://inst.eecs.berkeley.edu/~cs188/sp21/
Uncertainty
Let action At = leave for airport t minutes before flight
https://inst.eecs.berkeley.edu/~cs188/sp21/
Making decisions under uncertainty
Suppose I believe the following:
P(A25 gets me there on time| . . .) = 0.04
P(A90 gets me there on time| . . .) = 0.70
P(A120 gets me there on time| . . .) = 0.95
P (A1440 gets me there on time| . ..) = 0.9999
https://inst.eecs.berkeley.edu/~cs188/sp21/
Basic laws of probability
Begin with a set of possible worlds
E.g., 6 possible rolls of a die, {1, 2, 3, 4, 5, 6}
0 P() 1 1/6
P() = 1
8
https://inst.eecs.berkeley.edu/~cs188/sp21/
Basic laws contd.
An event is any subset of
E.g., “roll < 4” is the set {1,2,3}
E.g., “roll is odd” is the set {1,3,5}
The probability of an event is the sum of probabilities over its worlds
P(A) = A P()
E.g., P(roll < 4) = P(1) + P(2) + P(3) = 1/2
https://inst.eecs.berkeley.edu/~cs188/sp21/
Propositions
Think of a proposition as the event (set of sample points) where the proposition is true
Given Boolean random variables A and B:
event a = set of sample points where A(ω) = true
event ¬a = set of sample points where A(ω) = false
event a b = points where A(ω) = true and B(ω) = true
Often in AI applications, the sample points are defined by the values of a set of random variables, i.e.,
The sample space is the Cartesian product of the ranges of the variables
With Boolean variables, sample point = propositional logic model e.g.,
A = true, B = false, or a ¬b.
A A B B
de Finetti (1931): an agent who bets according to probabilities that violate these axioms can be
forced to bet so as to lose money regardless of outcome.
https://inst.eecs.berkeley.edu/~cs188/sp21/
Probability Distributions
Associate a probability with each value; sums to 1
Temperature: Weather: Joint distribution
P(W) P(T,W)
P(T)
W P
T P
sun 0.6 Temperature
hot 0.5
rain 0.1 hot cold
cold 0.5
fog 0.3 sun 0.45 0.15
Weather
meteor 0.0 rain 0.02 0.08
fog 0.03 0.27
meteor 0.00 0.00
13
https://inst.eecs.berkeley.edu/~cs188/sp21/
Syntax for propositions
Propositional or Boolean random variables e.g., Cavity (do I have a cavity?)
Cavity = true is a proposition, also written cavity
Discrete random variables (finite or infinite)
e.g., Weather is one of (sunny, rain, cloudy, snow)
Weather = rain is a proposition
Values must be exhaustive and mutually exclusive
Continuous random variables (bounded or unbounded) e.g.,
Temp = 21.6; also allow, e.g., Temp < 22.0.
Arbitrary Boolean combinations of basic propositions
https://inst.eecs.berkeley.edu/~cs188/sp21/
Making possible worlds
In many cases we
begin with random variables and their domains
construct possible worlds as assignments of values to all variables
E.g., two dice rolls Roll1 and Roll2
How many possible worlds?
What are their probabilities?
Size of distribution for n variables with range size d?
n
For all but the smallest distributions, cannot write out by hand!
15
https://inst.eecs.berkeley.edu/~cs188/sp21/
Probabilities of events
Recall that the probability of an event is the
sum of probabilities of its worlds: Joint distribution
P(A) = A P()
P(T,W)
So, given a joint distribution over all variables, Temperature
can compute any event probability! hot cold
sun 0.45 0.15
Probability that it’s hot AND sunny?
Weather
rain 0.02 0.08
Probability that it’s hot? fog 0.03 0.27
meteor 0.00 0.00
Probability that it’s hot OR not foggy?
16
https://inst.eecs.berkeley.edu/~cs188/sp21/
Marginal Distributions
Marginal distributions are sub-tables which eliminate variables
Marginalization (summing out): Collapse a dimension by adding
P(a,b)
P(a | b) = P(a, b)
P(b)
P(a) P(b)
P(T,W)
Temperature
hot cold P(W=s | T=c) = P(W=s,T=c) = 0.15/0.50 = 0.3
P(T=c)
sun 0.45 0.15
Weather
19
https://inst.eecs.berkeley.edu/~cs188/sp21/
Normalizing a distribution
(Dictionary) To bring or restore to a normal condition
P(W,T)
P(W | T=c) = P(W,T=c)/P(T=c)
Temperature = P(W,T=c)
P(W,T=c)
hot cold
sun 0.45 0.15 0.15 0.30
Normalize
Weather
21
https://inst.eecs.berkeley.edu/~cs188/sp21/
The Product Rule: Example
Weather
0.06 0.54 rain 0.02 0.08
cold 0.5
0.00 0.00 fog 0.03 0.27
meteor 0.00 0.00
22
https://inst.eecs.berkeley.edu/~cs188/sp21/
The Chain Rule
23
https://inst.eecs.berkeley.edu/~cs188/sp21/
Probabilistic Inference
Probabilistic inference: compute a desired probability
from a probability model
Typically for a query variable given evidence
E.g., P(airport on time | no accidents) = 0.90
These represent the agent’s beliefs given the evidence
24
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by Enumeration
* Works fine with
General case: multiple query
Evidence variables: E1, …, Ek = e1, …,ek
X1, …, Xn
We want: variables, too
Query* variable: Q
Hidden variables: H1, …, Hr All variables
P(Q | e1, …,ek)
Probability model P(X1, …, Xn) is given
Step 1: Select the Step 2: Sum out H from model to Step 3: Normalize
entries consistent get joint of Query and evidence
with the evidence
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration
Start with the joint distribution:
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration
Start with the joint distribution:
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration
Start with the joint distribution:
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576
https://inst.eecs.berkeley.edu/~cs188/sp21/
Normalization
toothache toothache
catch catch catch catch
cavity .108 .012 .072 .008
cavity .016 .064 .144 .576
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by enumeration, contd.
Let
• X be the query variable (e.g. Cavity) .
• Typically, we want the posterior joint distribution of the hidden
variables Y (remaining, unobserved, e.g. catch)
• given specific values e for the evidence variables E (e.g. Toothache)
Then the required summation of joint entries is done by summing out :
P(X|E = e) = αP(X, E = e) = α Σ y P(X, e, y )
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by Enumeration
Obvious problems:
Worst-case time complexity O(dn)
Space complexity O(dn) to store the joint distribution
35
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes Rule
36
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes’ Rule
38
https://inst.eecs.berkeley.edu/~cs188/sp21/
Independence
Two variables X and Y are (absolutely) independent if
x,y P(x, y) = P(x) P(y)
i.e., the joint distribution factors into a product of two simpler distributions
39
https://inst.eecs.berkeley.edu/~cs188/sp21/
Independence
A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A)P(B)
Cavity
Cavity decomposes into Toothache Catch
Toothache Catch
Weather
Weather
P(X1,X2,...,Xn)
2n
41
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional Independence
Conditional independence is our most basic and robust form of
knowledge about uncertain environments.
42
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional independence
P(Toothache, Cavity, Catch) has 23 − 1 = 7 independent entries
If I have a cavity, the probability that the probe catches in it doesn’t depend on whether I
have a toothache:
(1)P(catch|toothache,cavity) = P(catch|cavity)
The same independence holds if I haven’t got a cavity:
(2)P(catch|toothache,¬cavity) = P(catch|¬cavity)
Catch is conditionally independent of Toothache given Cavity:
P(Catch|T oothache, Cavity) = P(Catch|Cavity)
Equivalent statements:
P(T oothache|Catch, Cavity) = P(T oothache|Cavity)
P(T oothache, Catch|Cavity) = P(T oothache|Cavity)P(Catch|Cavity)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional independence contd.
Write out full joint distribution using chain rule:
P(T oothache, Catch, Cavity)
= P(T oothache|Catch, Cavity)P(Catch, Cavity)
= P(T oothache|Catch, Cavity)P(Catch|Cavity)P(Cavity)
= P(T oothache|Cavity)P(Catch|Cavity)P(Cavity)
i.e., 2 + 2 + 1 = 5 independent numbers
In most cases, the use of conditional independence reduces the size of the
representation of the joint distribution from exponential in n to linear in n.
Conditional independence is our most basic and robust form
of knowledge about uncertain environments.
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional Independence?
What about this domain:
Traffic
Umbrella
Raining
45
https://inst.eecs.berkeley.edu/~cs188/sp21/
Conditional Independence?
What about this domain:
Fire
Smoke
Alarm
46
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes Nets: Big Picture
47
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes Nets: Big Picture
48
https://inst.eecs.berkeley.edu/~cs188/sp21/
Graphical Model Notation
Arcs: interactions
Indicate “direct influence” between variables G
Formally: encode conditional independence
0.11 0.11 0.11
(more later)
49
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Car diagnosis
Initial evidence: car won’t start
Testable variables (green),
“broken, so fix it” variables (orange) ,
Hidden variables (gray) ensure sparse structure, reduce parameters
battery age alternator fanbelt
broken broken
battery no charging
dead
car won’t
lights oil light gas gauge start dipstick
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example Bayes’ Net: Car Insurance
Age SocioEcon
GoodStudent ExtraCar
RiskAversion
MakeModel VehicleYear
YearsLicensed
Mileage
DrivingSkill
AntiTheft SafetyFeatures CarValue
Garaged Airbag
DrivingRecord Ruggedness
DrivingBehavior
Cushioning Theft
Accident
OwnCarDamage
OwnCarCost
OtherCost
https://inst.eecs.berkeley.edu/~cs188/sp21/
Specifying the probability model
The full joint distribution is P(P1,1, . . . , P4,4, B1,1, B1,2, B2,1)
Apply product rule: P(B1,1, B1,2, B2,1 |P1,1, . . . , P4,4)P(P1,1, . . . , P4,4)
(Do it this way to get P (Effect|Cause).)
First term: 1 if pits are adjacent to breezes, 0 otherwise
Second term: pits are placed randomly, probability 0.2 per square:
4,4
P(P 1,1, . . . , P 4,4) = Πi , j = 1,1 P(P i , j ) = 0.2n × 0.816− n
for n pits.
https://inst.eecs.berkeley.edu/~cs188/sp21/
Observations and query
We know the following facts:
b= ¬b1,1 b1,2 b2,1
known = ¬p1,1 ¬p1,2 ¬p2,1
Query is P(P1,3|known, b)
Define Unknown = Pijs other than P1,3 and Known
For inference by enumeration, we have
P(P1,3|known, b) = αΣunknownP(P1,3, unknown, known, b)
Grows exponentially with number of squares!
12 unknown squares => 2 12
54
https://inst.eecs.berkeley.edu/~cs188/sp21/
Using conditional independence
Basic insight: observations are conditionally independent of other hidden
squares given neighbouring hidden squares [4,4]
irrelevant
Query is P(P1,3|known, b)
Other than
Query variable
, ,
Query is P(P1,3|known, b)
56
https://inst.eecs.berkeley.edu/~cs188/sp21/
U s i n g conditional independence contd.
b = ¬b1,1 ∧ b1,2 ∧b2,1
known = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
Unknown = Pijs other than P1,3 and
Known
,
= 𝛼 𝑷(𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟) 𝑷(𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟, 𝑜𝑡ℎ𝑒𝑟)
’ https://inst.eecs.berkeley.edu/~cs188/sp21/
57
Using conditional independence contd.
pits are placed randomly, probability 0.2 per square
,
= 𝛼𝑷 𝑃 , 𝑷 𝑏| 𝑃 , , 𝑘𝑛𝑜𝑤𝑛, 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟 𝑷 𝑓𝑟𝑜𝑛𝑡𝑖𝑒𝑟
60
https://inst.eecs.berkeley.edu/~cs188/sp21/
Chapter 13
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayesian networks
A simple, graphical notation for conditional independence assertions and hence for compact
specification of full joint distributions
Syntax:
• a set of nodes, one per variable
• a directed, acyclic graph (link ≈ “directly influences”)
• a conditional distribution for each node given its parents: P(Xi|P arents(Xi))
In the simplest case, conditional distribution represented as a conditional probability table
(CPT) giving the distribution over Xi for each combination of parent values
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
Topology of network encodes conditional independence
assertions:
Weather Cavity
Toothache Catch
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes net global semantics
Bayes nets encode joint distributions as product of
conditional distributions on each variable:
P(X1,..,Xn) = i P(Xi | Parents(Xi))
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
I’m at work, neighbor John calls to say my alarm is ringing, but neighbor
Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a
burglar?
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Alarm Network
P(B) 1 1 P(E)
B E P(A|B,E)
true false
67
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
P(B) P(E) P(b,e, a, j, m) =
true false true false
Burglary Earthquake P(b) P(e) P(a|b,e) P(j|a) P(m|a)
0.001 0.999 0.002 0.998
B E P(A|B,E) =.001x.998x.94x.1x.3=.000028
true false
J M
https://inst.eecs.berkeley.edu/~cs188/sp21/
Global semantics B
A
E
J M
Global semantics defines the full joint distribution as the product of the local conditional distributions:
e.g., P (j ∧ m ∧ a ∧ ¬b ∧¬e) =
Chain Rule
P(X 1 , . . . , X n ) = P(X n |X n-1 , . . . , X 1 ) P(X n-1 , . . . , X 1 )
= P(X n |X n-1 , . . . , X 1 ) P(X n−2 |X n-3 , . . . , X 1 ) P(X n-3 , . . . , X 1 )
= ...
= P(X i |X i−1 , . . . , X 1 )
https://inst.eecs.berkeley.edu/~cs188/sp21/
Global semantics
P(E)
“Global” semantics defines the full joint distribution as the product of the local conditional distributions:
true false
P(X 1 , . . . , X n ) = ∏ P(X i | Parents(X 1 )) Burglary Earthquake 0.002 0.998
P(B) B E P(A|B,E)
= 0.9 × 0.7 × 0.001× 0.999 × 0.998 Alarm true false 0.94 0.06
≈ 0.00063 false true 0.29 0.71
U1 Um
. ..
X
Z 1j Z nj
Y1 Yn
. ..
https://inst.eecs.berkeley.edu/~cs188/sp21/
Markov blanket
Each node is conditionally independent of all others given its
Markov blanket: parents + children + children’s parents
U1 Um
. ..
X
Z 1j Z nj
Y1 Yn
. ..
https://inst.eecs.berkeley.edu/~cs188/sp21/
Constructing Bayesian networks
Need a method such that a series of locally testable assertions of conditional independence guarantees
the required global semantics
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
Must ensure that
Suppose we choose the ordering M , J, A, B, E 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 𝑋 ⊆ {𝑋 ,…,𝑋 }
P(J|M ) = P(J)? No
P(A|J, M ) = P(A|J ) ? No MaryCalls
P(A|J, M ) = P(A)? No
P(B|A, J, M ) = P(B|A)? Yes JohnCalls
P(B|A, J, M ) = P(B)? No
P(E |B , A, J, M ) = P(E |A)? No Alarm
P(E|B, A, J, M ) = P(E|A, B)? Yes Given the network
Burglary
Earthquake
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example contd.
MaryCalls B E
JohnCalls A
Burglary
Earthquake
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Car diagnosis
Initial evidence: car won’t start
Testable variables (green),
“broken, so fix it” variables (orange) ,
Hidden variables (gray) ensure sparse structure, reduce parameters
battery age alternator fanbelt
broken broken
battery no charging
dead
car won’t
lights oil light gas gauge start dipstick
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example Bayes’ Net: Car Insurance
Age SocioEcon
GoodStudent ExtraCar
RiskAversion
MakeModel VehicleYear
YearsLicensed
Mileage
DrivingSkill
AntiTheft SafetyFeatures CarValue
Garaged Airbag
DrivingRecord Ruggedness
DrivingBehavior
Cushioning Theft
Accident
OwnCarDamage
OwnCarCost
OtherCost
https://inst.eecs.berkeley.edu/~cs188/sp21/
Compact conditional distributions contd.
P (X| . . . , ¬ ...¬
https://inst.eecs.berkeley.edu/~cs188/sp21/
Compact conditional distributions contd.
𝑞 =P( fever | cold, flu, malaria) = 0.6
𝑞 =P( fever | cold, flu, malaria) = 0.2
𝑞 =P( fever | cold, flu, malaria) = 0.1
Buys?
https://inst.eecs.berkeley.edu/~cs188/sp21/
Continuous child variables
Need one conditional density function for child variable given continuous parents, for each possible assignment to discrete parents
( )
( )
𝑃 𝑐 ℎ, 𝑠𝑢𝑏𝑠𝑖𝑑𝑦) = 𝑁 𝑎 ℎ + 𝑏 , 𝜎 𝑐 = 𝑒
https://inst.eecs.berkeley.edu/~cs188/sp21/
Normal Distribution
( )
- Normal = Gaussian Distribution 𝑃(𝑥) = 𝑁 , 𝜎 𝑥 = 𝑒
- Central Limit Theorem: the distribution formed by sampling n independent random variables and
taking their mean tends to a normal distribution as n tends to infinity
https://inst.eecs.berkeley.edu/~cs188/sp21/
Continuous child variables
P(Cost|Harvest,Subsidy?=true)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
10
0 5 Harvest
5
Cost 10 0
- Discrete (as parents) +continuous (parents) Linear Gaussian network is a conditional Gaussian network
a multivariate Gaussian over all continuous variables for each combination of discrete variable values
https://inst.eecs.berkeley.edu/~cs188/sp21/
Discrete variable w/ continuous parents
Subsidy? Harvest
Cost
Buys?
1
𝑁 , 𝜎 𝑥 = 𝑒 ( )
P(Buys = false | Cost = c) = Φ((c - µ)/σ)
𝜎 2𝜋
P(Buys = true | Cost = c) = 1- Φ((c - µ)/σ)
Φ(x) = 𝑁 0,1 𝑥 𝑑𝑥
https://inst.eecs.berkeley.edu/~cs188/sp21/
Discrete variable w/ continuous parents
Probability of Buys given Cost should be a “soft” threshold:
1
Subsidy? Harvest
0.8
P(Buys=false| Cost=c)
0.6
Cost
0.4
0.2
Buys?
0
0 2 4 6 8 10 12
Cost c
Buys?
https://inst.eecs.berkeley.edu/~cs188/sp21/
Discrete variable contd.
Sigmoid (or logit) distribution also used in neural networks:
1
𝑃 𝐵𝑢𝑦𝑠 = 𝑡𝑟𝑢𝑒 𝐶𝑜𝑠𝑡 = 𝑐) = −𝑐 + 𝜇
1 + exp(−2 )
𝜎
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12
Cost c
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference
Inference: calculating some Examples:
useful quantity from a probability
Posterior marginal probability
model (joint probability P(Q|e1,..,ek)
distribution) E.g., what disease might I have?
Most likely explanation:
argmaxq,r,s P(Q=q,R=r,S=s|e1,..,ek)
E.g., what did he say?
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference tasks - Examples
Simple queries: compute posterior marginal P(X i |E = e)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example Bayes’ Net: Car Insurance
Age SocioEcon
GoodStudent ExtraCar
RiskAversion
MakeModel VehicleYear
YearsLicensed
Mileage
DrivingSkill
AntiTheft SafetyFeatures CarValue
Garaged Airbag
DrivingRecord Ruggedness
DrivingBehavior
Cushioning Theft
Accident
OwnCarDamage
OwnCarCost
OtherCost
P(B|j, m) A
P(b|j, m)
= α Σe Σa P(b)P (e)P(a|b, e)P(j|a)P (m|a)
= α P(b) Σe P(e) Σa P(a|b, e)P(j|a)P (m|a)
Recursive depth-first enumeration: O(n) space, O(dn) time https://inst.eecs.berkeley.edu/~cs188/sp21/
Inference by Enumeration in Bayes Net
Inference by enumeration: B E
Any probability of interest can be computed by summing
entries from the joint distribution: P(Q | e) = h P(Q , h, e)
Entries from the joint distribution can be obtained from a BN
by multiplying the corresponding conditional probabilities
A
P(B | j, m) = α e,a P(B, e, a, j, m)
= α e,a P(B) P(e) P(a|B,e) P(j|a) P(m|a)
J M
So inference in Bayes nets means computing sums of
products of numbers: sounds easy!!
Problem: sums of exponentially many products!
https://inst.eecs.berkeley.edu/~cs188/sp21/
Evaluation tree
P(b|j, m) = α P(b) Σe P(e) Σa P(a|b, e)P(j|a)P(m|a)
B E
J M
Can be improved with variable elimination, choosing the right order of variables, detecting irrelevant variables
https://inst.eecs.berkeley.edu/~cs188/sp21/
Order matters
Z
Order the terms Z, A, B C, D
P(D) = α z,a,b,c P(z) P(a|z) P(b|z) P(c|z) P(D|z)
= α z P(z) a P(a|z) b P(b|z) c P(c|z) P(D|z) A B C D
Largest factor has 2 variables (D,Z)
Order the terms A, B C, D, Z
P(D) = α a,b,c,z P(a|z) P(b|z) P(c|z) P(D|z) P(z)
= α a b c z P(a|z) P(b|z) P(c|z) P(D|z) P(z)
Largest factor has 4 variables (A,B,C,D)
In general, with n leaves, factor of size 2n
https://inst.eecs.berkeley.edu/~cs188/sp21/
Enumeration algorithm
P(b|j, m) = α P(b) Σe P(e) Σa P(a|b, e)P(j|a)P(m|a)
B E
J M
https://inst.eecs.berkeley.edu/~cs188/sp21/
Sampling
Why sample?
Basic idea Often very fast to get a decent
approximate answer
Draw N samples from a sampling distribution S
The algorithms are very simple and
Compute an approximate posterior probability general (easy to apply to fancy models)
Show this converges to the true probability P They require very little memory (O(n))
They can be applied to large models,
whereas exact algorithms blow up
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
Suppose you have two agent programs A and B for Monopoly
What is the probability that A wins?
Method 1:
Let s be a sequence of dice rolls and Chance and Community Chest cards
Given s, the outcome V(s) is determined (1 for a win, 0 for a loss)
Probability that A wins is s P(s) V(s)
Problem: infinitely many sequences s !
Method 2:
Sample N sequences from P(s) , play N games (maybe 100)
Probability that A wins is roughly 1/N i V(si) i.e., fraction of wins in the sample
99
https://inst.eecs.berkeley.edu/~cs188/sp21/
Sampling basics: discrete (categorical) distribution
To simulate a biased d-sided coin: Example
Step 1: Get sample u from uniform
distribution over [0, 1) C P(C)
E.g. random() in python
red 0.6 0.0 u < 0.6, C=red
Step 2: Convert this sample u into an 0.6 u < 0.7, C=green
outcome for the given distribution by
green 0.1 0.7 u < 1.0, C=blue
associating each outcome x with a blue 0.3
P(x)-sized sub-interval of [0,1)
If random() returns u = 0.83,
then the sample is C = blue
E.g, after sampling 8 times:
Coin
2 Rejection Sampling : reject samples disagreeing with evidence
4 Gibbs Sampling / Markov chain Monte Carlo (MCMC): sample from a stochastic
process whose stationary distribution is the true posterior
https://inst.eecs.berkeley.edu/~cs188/sp21/
Prior Sampling
1
P(C)
c 0.5
c 0.5
1
Exam ple
P(C)
P(S = true| C = true) =0.1 .50 P(R = true| C = true) =0.8
P(S = false| C = true) =0.9
Cloudy P(R = false| C = true) =0.2
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
P(W = true| S = true & R = true) =0.99
S R P(W|S,R)
P(W = false| S = true & R = true) =0.01
T T .99
T F .90
F T .90
F F .01 103
https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]
1
Exam ple
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
P(Cloudy = true) * F F .01 104
1
Exam ple
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01 105
P(Cloudy = true)*
P(Sprinkler=false | cloudy = true)*
P(Rain=true | cloudy = true)
P(WetGrass=true | Sprinkler=False & Rain =True) https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]
1
Exam ple
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01 106
P(Cloudy = true)*
P(Sprinkler=false | cloudy = true)*
P(Rain=true | cloudy = true)
P(WetGrass=true | Sprinkler=False & Rain =True) https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]
1
Exam ple
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01 107
P(Cloudy = true)*
P(Sprinkler=false | cloudy = true)*
P(Rain=true | cloudy = true)
P(WetGrass=true | Sprinkler=False & Rain =True) https://inst.eecs.berkeley.edu/~cs188/sp21/
P[Cloudy = true, Sprinkler =false, Rain = true, WetGrass = true ]
1
Exam ple
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90 Answer ?
F T .90
P(Cloudy = true)* F F .01
P(Sprinkler=false | cloudy = true)*
P(Rain=true | cloudy = true)
P(WetGrass=true | Sprinkler=False & Rain =True)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Prior Sampling
1
https://inst.eecs.berkeley.edu/~cs188/sp21/
Sampling from an empty network
1
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
1
c, s, r, w S R
c, s, r, w
W
c, s, r, w
c, s, r, w
If we want to know P(W)
We have counts <w:4, w:1>
Normalize to get P(W) = <w:0.8, w:0.2>
This will get closer to the true distribution with more samples
Can estimate anything else, too
E.g., for query P(C| r, w) use P(C| r, w) = α P(C, r, w)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Prior Sampling
1
https://inst.eecs.berkeley.edu/~cs188/sp21/
Rejection Sampling
2
https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood Weighting 3
Problem with rejection sampling: Idea: fix evidence variables, sample the rest
If evidence is unlikely, rejects lots of samples Problem: sample distribution not consistent!
Evidence not exploited as you sample Solution: weight each sample by probability of
Consider P(Shape|Color=blue) evidence variables given parents
https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood Weighting 3
P(C)
c 0.5
c 0.5
P(S | C) Cloudy
P(R | C)
c s
0.1 c r
0.8
s 0.9 r 0.2
c s 0.5 Sprinkler Rain c r 0.2
s 0.5 r 0.8
P(W | S,R)
Samples:
r w 0.99 WetGrass
s
w 0.01 c , s, r , w w = 1.0 x 0.1 x 0.99
r w 0.90
w 0.10
r w 0.90
s
w 0.10
r w 0.01
https://inst.eecs.berkeley.edu/~cs188/sp21/
w 0.99
Likelihood Weighting 3
https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood weighting 3
Idea: fix evidence variables, sample only nonevidence variables, and weight each sample by the likelihood it accords the evidence
Avoids the inefficiency of rejection sampling by generating only events that are consistent with evidence e
https://inst.eecs.berkeley.edu/~cs188/sp21/
Likelihood Weighting 3
Cloudy
C
Now, samples have weights
S R
w(z,e) = k P(ek | parents(Ek))
W
Together, weighted sampling distribution is consistent
SWS(z,e) w(z,e) = j P(zj | parents(Zj)) k P(ek | parents(Ek))
= P(z,e)
Likelihood weighting is an example of importance sampling
Would like to estimate some quantity based on samples from P
P is hard to sample from, so use Q instead
Weight each sample x by P(x)/Q(x)
https://inst.eecs.berkeley.edu/~cs188/sp21/
Car Insurance: P(PropertyCost | e)
Age SocioEcon
GoodStudent ExtraCar 0.1
RiskAversion
Rejection sampling
YearsLicensed
MakeModel VehicleYear Likelihood weighting
Mileage 0.08
DrivingSkill
AntiTheft SafetyFeatures CarValue
DrivingRecord
Garaged Airbag 0.06
Ruggedness
Error
DrivingBehavior 0.04
Cushioning Theft
Accident
OwnCarDamage 0.02
OwnCarCost
OtherCost
0
6
MedicalCost LiabilityCost PropertyCost 0 200000 400000 600000 800000 1x10
Number of samples
https://inst.eecs.berkeley.edu/~cs188/sp21/
3
Likelihood weighting example
given Query : P(Rain| Cloudy = True, WetGrass = True)
w is set to w = 1.0 Ordering : Cloudy, Sprinkler, Rain, Wet-Grass
w w x P(Cloudy = true) = 0.5
P(C)
• Fixes evidence variable value
.50
• Only samples non-evidence Variables
Cloudy (including query).
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .90 F .20
Wet
Grass
S R P(W|S,R)
• Cloudy is an evidence variable
T T .99
• Sprinkler is not an evidence variable
T F .90
F T .90 • WetGrass is an evidence variable
F F .01
https://inst.eecs.berkeley.edu/~cs188/sp21/
Query : P(Rain| Cloudy = True, WetGrass = True)
Ordering : Cloudy, Sprinkler, Rain, Wet-Grass
3
Likelihood weighting example
P(C)
given .50
w is set to w = 1.0
w w x P(Cloudy = true) = 0.5 Cloudy
Samples from
Samples from C P(S|C) C P(R|C)
P(Rain | Cloudy = true) = (0.8, 0.2)
P(Sprinkler | Cloudy = true) = (0.1, 0.9) T .10 Sprinkler Rain T .80
F .90 F .20 Suppose this returns true
Suppose this returns false
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Z : non evidence
Sampling probability for We i g h t e d S a m p l e is variables including
query variable
𝑆 𝑧,𝑒 = 𝑃(𝑧 | Parents(𝑍 ) )
Cloudy
https://inst.eecs.berkeley.edu/~cs188/sp21/
Markov Chain Monte Carlo 4
127
https://inst.eecs.berkeley.edu/~cs188/sp21/
Why would anyone do this?
Samples soon begin to
reflect all the evidence
in the network
128
https://inst.eecs.berkeley.edu/~cs188/sp21/
How would anyone do this?
Repeat many times
Sample a non-evidence variable Xi from
P(Xi | x1,..,xi-1,xi+1,..,xn) = P(Xi | markov_blanket(Xi)) U1 ... Um
Y1 ... Yn
129
https://inst.eecs.berkeley.edu/~cs188/sp21/
Gibbs Sampling Example: P( S | r)
Step 1: Fix evidence C
Step 2: Initialize other variables C
R = true Randomly
S r S r
W W
Step 3: Repeat
Choose a non-evidence variable X
Resample X from P(X | markov_blanket(X))
C C C C C C
S r S r S r S r S r S r
W W W W W W
c r c ¬r
0.4074
0.3922
¬c r ¬c ¬r
131
https://inst.eecs.berkeley.edu/~cs188/sp21/
Car Insurance: P(Age | mc,lc,pc)
Age SocioEcon
GoodStudent ExtraCar
0.02
YearsLicensed
RiskAversion
MakeModel VehicleYear Likelihood weighting
Mileage Gibbs sampling
DrivingSkill
AntiTheft SafetyFeatures CarValue
0.015
Garaged Airbag
DrivingRecord Ruggedness
Error
0.01
DrivingBehavior
Cushioning Theft
Accident
OwnCarDamage 0.005
OwnCarCost
OtherCost
0
MedicalCost LiabilityCost PropertyCost
0 200000 400000 600000 800000 1x106
Number of samples
https://inst.eecs.berkeley.edu/~cs188/sp21/
Approximate inference using M C M C
4
“State” of network = current assignment to all variables.
Generate next state by sampling one variable given Markov blanket Sample each variable in turn, keeping evidence fixed
https://inst.eecs.berkeley.edu/~cs188/sp21/
Query : P(Rain| Sprinkler = True, WetGrass = True)
Ordering : Cloudy, Sprinkler, Rain, Wet-Grass
The Markov chain 4
Cloudy Cloudy
Wet Wet
Grass Grass
Cloudy Cloudy
Wet Wet
Grass Grass
Rain
Probability given the Markov blanket is calculated as follows: Sprinkler
Wet
Grass
https://inst.eecs.berkeley.edu/~cs188/sp21/
M C M C example contd. 4
https://inst.eecs.berkeley.edu/~cs188/sp21/
Bayes Net Sampling Summary
Prior Sampling P Rejection Sampling P( Q | e )
https://inst.eecs.berkeley.edu/~cs188/sp21/
Summary : Conditional independence in BNs
Compare the Bayes net global semantics
P(X1,..,Xn) = i P(Xi | Parents(Xi))
with the chain rule identity
P(X1,..,Xn) = i P(Xi | X1,…,Xi-1)
Assume (without loss of generality) that X1,..,Xn sorted in topological order according to
the graph (i.e., parents before children), so Parents(Xi) X1,…,Xi-1
So the Bayes net asserts conditional independences P(Xi | X1,…,Xi-1) = P(Xi | Parents(Xi))
To ensure these are valid, choose parents for node Xi that “shield” it from other predecessors
https://inst.eecs.berkeley.edu/~cs188/sp21/
Summary
Independence and conditional independence are
important forms of probabilistic knowledge
https://inst.eecs.berkeley.edu/~cs188/sp21/
Chapter 16
Making Simple Decisions
Utilities
https://inst.eecs.berkeley.edu/~cs188/sp21/
Maximum Expected Utility
Questions:
Where do utilities come from?
How do we know such utilities even exist?
How do we know that averaging even makes sense?
What if our behavior (preferences) can’t be described by utilities?
https://inst.eecs.berkeley.edu/~cs188/sp21/
The need for numbers
https://inst.eecs.berkeley.edu/~cs188/sp21/
Utilities: Uncertain Outcomes
Getting ice cream
Oops Whew!
https://inst.eecs.berkeley.edu/~cs188/sp21/
Preferences
Notation: A B
Preference: A > B
Indifference: A ~ B
https://inst.eecs.berkeley.edu/~cs188/sp21/
Rational Preferences
The Axioms of Rationality:
preferences of a rational agent must obey constraints.
Orderability:
(A > B) (B > A) (A ~ B)
Transitivity:
(A > B) (B > C) (A > C)
Continuity:
(A > B > C) p [p, A; 1-p, C] ~ B
Substitutability:
(A ~ B) [p, A; 1-p, C] ~ [p, B; 1-p, C]
Monotonicity:
(A > B)
(p q) [p, A; 1-p, B] [q, A; 1-q, B]
https://inst.eecs.berkeley.edu/~cs188/sp21/
Maximizing Expected Utility : MEU Principle
U(A) U(B) A B
U([p1,S1; … ; pn,Sn]) = p1U(S1) + … + pnU(Sn)
i.e. values assigned by U preserve preferences of both prizes and lotteries!
Optimal policy invariant under positive affine transformation U’ = aU+b, a>0
151
https://inst.eecs.berkeley.edu/~cs188/sp21/
Money
Flip of a coin
Take the 1M prize that you have and walk away, or
Flip the coin one more time for a chance to win 2.5M
Expected Monetary Value (EMV) : ½ * 0 + ½ * 2.5M = 1.25M
1.25M > 1M ?
Most people will not take the gamble
Sn : denotes the state of the player’s wealth
Expected Utility :
EU(accept) = ½ * U(Sk) + ½ * U(Sk+2.5M)
EU(decline) = U(Sk+1M)
152
https://inst.eecs.berkeley.edu/~cs188/sp21/
Post-decision Disappointment: the Optimizer’s Curse
we will get the utility we expect if the
Usually we don’t have direct access to exact whole process is repeated many times.
utilities, only estimates
E.g., you could make one of k investments As number of choices increases
the potential for disappointment
An unbiased expert assesses their expected net also increases
profit V1,…,Vk
You choose the best one V*
With high probability, its actual value is
considerably less than V*
Error in each utility estimate is independent and
has a unit normal distribution: N(0,σ);
This is a serious problem in many areas:
Future performance of mutual funds
Efficacy of drugs measured by trials
Statistical significance in scientific papers
Winning an auction
https://inst.eecs.berkeley.edu/~cs188/sp21/
Human are predictably irrational
Most people will choose
B over A : sure thing
C over D : higher monetary value
There is no utility function that is consistent
with both of these choices
People are strongly attracted to gains that
are certain
https://inst.eecs.berkeley.edu/~cs188/sp21/
Multi-attribute utility
How can we handle utility functions of many variables X 1, . . . ,X n ?
Example: Choosing the site of an airport, what is U(Deaths, Noise, Cost)?
https://inst.eecs.berkeley.edu/~cs188/sp21/
Strict dominance
Typically define attributes such that U is monotonic in each Strict dominance:
choice B strictly dominates choice A iff
∀i X i (B) ≥ X i (A) (and hence U (B) ≥ U (A))
X2 This region X2
dominates A
B
C B C
A A
D
X1 X1
Deterministic attributes Uncertain attributes
Stochastic dominance
https://inst.eecs.berkeley.edu/~cs188/sp21/
Preference structure: Deterministic
X 1 and X 2 preferentially independent of X 3 iff
preference between (x 1, x2, x 3 ) and (x’1, x’2, x 3 ) does not depend on x3
https://inst.eecs.berkeley.edu/~cs188/sp21/
Preference structure: Stochastic
Need to consider preferences over lotteries:
X is utility-independent of Y iff
preferences over lotteries in X do not depend on y
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision Networks
W P(W)
U
sun 0.7
Weather
W P(F=bad|W) Forecast
sun 0.17 =bad
rain 0.77
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision Networks
Decision network = Bayes net + Actions + Utilities
https://inst.eecs.berkeley.edu/~cs188/sp21/
Value of information : is all information relevant ?
Idea: compute value of acquiring each possible piece of evidence
Survey may say “oil in A” or “no oil in A”, prob. 0.5 each (given!)
= [0.5 × value of “buy A” given “oil in A” + 0.5 × value of “buy B” given “no oil in A”]
= (0.5 × k/2) + (0.5 × k/2) = k/2 https://inst.eecs.berkeley.edu/~cs188/sp21/
VPI Properties
Value of current best action α
167
https://inst.eecs.berkeley.edu/~cs188/sp21/
Chapter 19
Learning from Examples
168
Deep Learning Powering Everyday Products
pcmag.com theverge.com
169
https://inst.eecs.berkeley.edu/~cs188/sp21/
Learning
Learning: a process for improving the performance of an agent
through experience
Learning :
The general idea: generalization from experience
Supervised learning: classification and regression
Learning is useful as a system construction method, i.e., expose
the system to reality rather than trying to write it down
170
https://inst.eecs.berkeley.edu/~cs188/sp21/
Key questions when building a learning agent
What is the agent design that will implement the desired
performance?
Improve the performance of what piece of the agent system and
how is that piece represented?
What data are available relevant to that piece? (In particular, do
we know the right answers?)
What knowledge is already available?
171
https://inst.eecs.berkeley.edu/~cs188/sp21/
Forms of learning
172
https://inst.eecs.berkeley.edu/~cs188/sp21/
Supervised learning
To learn an unknown target function f
Input: a training set of labeled examples (xj,yj) where yj = f(xj)
E.g., xj is an image, f(xj) is the label “giraffe”
E.g., xj is a seismic signal, f(xj) is the label “explosion”
Output: hypothesis h that is “close” to f, i.e., predicts well on unseen
examples (“test set”)
Many possible hypothesis families for h
Linear models, logistic regression, neural networks, decision trees, examples
(nearest-neighbor), grammars, kernelized separators, etc etc
Classification = learning f with discrete output value
Regression = learning f with real-valued output value
173
https://inst.eecs.berkeley.edu/~cs188/sp21/
Inductive Learning (Science)
Simplest form: learn a function from examples
A target function: g
Examples: input-output pairs (x, g(x))
E.g. x is an email and g(x) is spam / ham
E.g. x is a house and g(x) is its selling price
Problem:
Given a hypothesis space H
Given a training set of examples xi
Find a hypothesis h(x) such that h ~ g
Includes:
Classification (outputs = class labels)
Regression (outputs = real numbers)
174
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting
175
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting
176
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting
177
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting
178
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regression example: Curve fitting
179
https://inst.eecs.berkeley.edu/~cs188/sp21/
Consistency vs. Simplicity
Fundamental tradeoff: bias vs. variance
180
https://inst.eecs.berkeley.edu/~cs188/sp21/
Basic questions
Which hypothesis space H to choose?
How to measure degree of fit?
How to trade off degree of fit vs. complexity?
“Ockham’s razor”
How do we find a good h?
How do we know if a good h will predict well?
181
https://inst.eecs.berkeley.edu/~cs188/sp21/
Classification - Examples
182
https://inst.eecs.berkeley.edu/~cs188/sp21/
Classification example: Object recognition
x
X= f(x)=?
183
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Spam Filter
Input: an email Dear Sir.
Output: spam/ham First, I must solicit your confidence in
this transaction, this is by virture of its
Setup: nature as being utterly confidencial and
top secret. …
Get a large collection of example emails, each labeled
“spam” or “ham” (by hand) TO BE REMOVED FROM FUTURE
Learn to predict labels of new incoming emails MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
Classifiers reject 200 billion spam emails per day SUBJECT.
Features: The attributes used to make the ham / 99 MILLION EMAIL ADDRESSES
FOR ONLY $99
spam decision
Words: FREE! Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
Non-text: SenderInContacts, AnchorLinkMismatch decided to put it to use, I know it was
… working pre being stuck in the corner,
but when I plugged it in, hit the power
nothing184
happened.
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example: Digit Recognition
1
Features: The attributes used to make the digit decision
Pixels: (6,8)=ON
Shape Patterns: NumComponents, AspectRatio, NumLoops
… ??
185
https://inst.eecs.berkeley.edu/~cs188/sp21/
Other Classification Tasks
Medical diagnosis
input: symptoms
output: disease
Automatic essay grading
input: document
output: grades
Fraud detection
input: account activity
output: fraud / no fraud
Email routing
input: customer complaint email
output: which department needs to ignore this email
Fruit and vegetable inspection
input: image (or gas analysis)
output: moldy or OK
… many more
186
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision Trees
187
Decision tree learning
Decision tree models
Tree construction
Measuring learning performance
188
https://inst.eecs.berkeley.edu/~cs188/sp21/
Restaurant Example
189
https://inst.eecs.berkeley.edu/~cs188/sp21/
Restaurant Example
190
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision trees
Popular representation
for classifiers
Even among humans!
I’ve just arrived at a
restaurant: should I stay
(and wait for a table) or
go elsewhere?
191
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision trees
It’s Friday night and you’re hungry
You arrive at your favorite cheap but
really cool happening burger place
It’s full up and you have no
reservation but there is a bar
The host estimates a 45 minute wait
There are alternatives nearby but it’s
raining outside
True/false: there is a consistent decision tree that fits any training set exactly
But a tree that simply records the examples is essentially a lookup table
To get generalization to new examples, need a compact tree
193
https://inst.eecs.berkeley.edu/~cs188/sp21/
Quiz
How many distinct functions of n Boolean attributes?
= number of truth tables with n inputs
= number of ways of filling in output column with 2n entries
= 2 2n
For n=6 attributes, there are 18,446,744,073,709,551,616 functions
and many more trees than that!
194
https://inst.eecs.berkeley.edu/~cs188/sp21/
Hypothesis spaces in general
Increasing the expressiveness of the hypothesis language
Increases the chance that the true function can be expressed
Increases the number of hypotheses that are consistent with training set
=> many consistent hypotheses have large test error
=> may reduce prediction accuracy!
n
With 22 hypotheses, all but an exponentially small fraction will
require O(2n) bits to express in any representation (even brains!)
I.e., any given representation can represent only an exponentially
small fraction of hypotheses concisely; no universal compression
E.g., decision trees are bad at “k-out-of-n” functions
195
https://inst.eecs.berkeley.edu/~cs188/sp21/
Learning a Decision Tree
Goal
Given training data of a list of attributes & label
Hypothesis family H: decision trees
Find (approximately) the smallest decision tree that fits the training data
Iterative process to choose the next attribute to split on
196
https://inst.eecs.berkeley.edu/~cs188/sp21/
Choosing an attribute: Information gain
Idea: measure contribution of attribute to increasing “purity” of
labels in each subset of examples; find most distinguishing feature
198
https://inst.eecs.berkeley.edu/~cs188/sp21/
Information
Information in an answer when prior is p1,…,pn is
H( p1,…,pn ) = i –pi log pi
This is the entropy of the prior
Entropy was initially proposed in physics (thermodynamics)
Shannon developed information theory, using entropy to
measure information
199
https://inst.eecs.berkeley.edu/~cs188/sp21/
Information gain from splitting on an attribute
Suppose we have p positive and n negative examples at the root
=> B(p/(p+n)) bits needed to classify a new example
E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit
An attribute splits the examples E into subsets Ek, each of which (we
hope) needs less information to complete the classification
For an example in Ek we expect to need B(pk/(pk+nk)) more bits
Probability a new example goes into Ek is (pk+nk)/(p+n)
Expected number of bits needed after split is k (pk+nk)/(p+n) B(pk/(pk+nk))
Information gain = B(p/(p+n)) - k (pk+nk)/(p+n) B(pk/(pk+nk))
200
https://inst.eecs.berkeley.edu/~cs188/sp21/
Example
201
https://inst.eecs.berkeley.edu/~cs188/sp21/
Results for restaurant data
202
https://inst.eecs.berkeley.edu/~cs188/sp21/
Decision tree learning
function Decision-Tree-Learning(examples,attributes,parent_examples) returns a tree
if examples is empty then return Plurality-Value(parent_examples)
else if all examples have the same classification then return the classification
else if attributes is empty then return Plurality-Value(examples)
else A ← argmaxa attributes Importance(a, examples)
tree ← a new decision tree with root test A
for each value v of A do
exs ← the subset of examples with value v for attribute A
subtree ← Decision-Tree-Learning(exs, a ributes − A, examples)
add a branch to tree with label (A = vk) and subtree subtree
return tree
203
Plurality-Value selects the most common output value among a set of examples https://inst.eecs.berkeley.edu/~cs188/sp21/
Choosing the Right Model
204
https://inst.eecs.berkeley.edu/~cs188/sp21/
A few important points about learning
Data: labeled instances, e.g. emails marked spam/ham
Training set
Held out set
Test set
Features: attribute-value pairs which characterize each x Training
Data
Experimentation cycle
Learn parameters (e.g. model probabilities) on training set
Tune hyperparameters on held-out set
Compute accuracy of test set
Very important: never “peek” at the test set!
Evaluation Held-Out
Accuracy: fraction of instances predicted correctly Data
(Validation set)
Overfitting and generalization
Want a classifier which does well on test data
Overfitting: fitting the training data very closely, but not Test
generalizing well Data
Underfitting: fits the training set poorly
205
https://inst.eecs.berkeley.edu/~cs188/sp21/
A few important points about learning
Our goal in machine learning is to select a hypothesis
that will optimally fit future examples.
Future examples?
Same prior distribution as the past examples (otherwise all bets are off!) : P(Ej) =
P(Ej+1) = P(Ej+2)
Each example is independent of previous examples : P(Ej) = P(Ej | Ej-1, Ej-2, …)
i.i.d (independent, identically distributed)
Optimally fit?
error rate: the proportion of times that h(x)≠y for an (x,y) example
206
https://inst.eecs.berkeley.edu/~cs188/sp21/
A few important points about learning
Our goal in machine learning is to select a hypothesis that will optimally fit future examples.
208
https://inst.eecs.berkeley.edu/~cs188/sp21/
K-fold cross validation
209
https://inst.eecs.berkeley.edu/~cs188/sp21/
Error rates/Loss Functions
210
https://inst.eecs.berkeley.edu/~cs188/sp21/
Regularization
An alternative to cross-validation
• λ is a hyperparameter
• Balances empirical loss versus complexity
• Penalizing complexity regularization
211
https://inst.eecs.berkeley.edu/~cs188/sp21/