samp_sol
samp_sol
5/16/01
1 Bayes’ Nets
I am a professor trying to predict the performance of my class on an exam.
After much thought, it is apparent that the students who do well are those
that studied and do not have a headache when they take the exam. My vast
medical knowledge leads me to believe that headaches can only be caused by
being tired or by having the flu. Studying, the flu, and being tired are pairwise
independent.
F T S F T F T S
H H S H
E E E
b) The first model makes flu, tiredness, and studying only conditionally in-
dependent (the first two conditioned on H and the third conditioned on
E). The second model has the right relations, but many unnecessary de-
pendencies. If the situation is accurately described in the problem, it will
produce the same joint distribution as the third model because the de-
pendencies will have no effect (for example, P (S|F, T ) will be the same as
P (S)).
c) If we assume that we need to hold only one value, P (X = true) to repre-
sent the apriori probability of a single variable, then we need 2n entries for
a conditional probability table for a node with n parents. So, the original
network has a complexity of 1 + 1 + 1 + 4 + 4 = 11 and the new network
has a complexity of 1 + 1 + 4 + 4 + 4 = 14. So, the size of the information
1
necessary to store the network has increased by about 27%, but we have
gained only a little more accuracy.
F T
S
H
d)
X
P (¬E|F ) = P (¬E|H, S, F )P (H, S|F )
H,S
X
= P (¬E|H, S)P (H, S|F )
H,S
X
= P (¬E|H, S)P (H|F, S)P (S|F )
H,S
X
= P (¬E|H, S)P (H|F, S)P (S)
H,S
X X
= P (¬E|H, S)P (S) P (H|F, T )P (T )
H,S T
e)
P (E|S)P (S)
P (S|E) =
P (E)
X
P (E|S) = P (E|S, H)P (H)
H
X X
= P (E|S, H) P (H|F, T )P (F )P (T )
H F,T
X X
P (E) = P (E|S, H)P (S) P (H|F, T )P (F )P (T )
H,S F,T
2
5. False.
6. False.
7. False.
8. True.
9. True.
-50
p-quiet -50
talk (0.60)
-46
quiet (0.40)
p-talks -100
-46
p-quiet -10
(0.60)
The decision tree shows that the best choice is to keep quiet - the expected
value is higher than that for talking. If we want to find out how much trust in
our partner is necessary make keeping quiet the better choice, we need to find
out for what value of P (p − quiet) the expected value of keeping quiet equals
-50.
4 Searching
a) A*-search would be a good strategy. Straight-line distance is known to
be an admissible heuristic and should be a good approximation of the
3
distance remaining to our goal from any point. And, we know that A*-
search will return the shortest path, allowing us to beat out our trading
competitors.
b) I can still use the straight-line distances. Heuristics are admissible so long
as they do not overestimate the distance to a goal.
5 Overfitting
In regression and in crafting generative models, we can make our model fit the
training points too closely, which often means that we are modeling noise. Or
we start fitting with a model that is more complex than is necessary to fit the
points. In either case, future data are likely to not fit our model as well as they
would have fit a simpler model of the training data. The standard approach is
to add a penalty to the error criterion based on the regularity or the complexity
of the model. This will make the best-fit calculation balance the complexity of
the model against the error it produces on the training data, and reduce the
likelihood of overfitting.
In classification models, we can encounter a similar problem. The classifier
will separate the training data precisely, but the boundary may not generalize
well. We can resist overfitting by using simpler models, either with fewer features
or (in multi-layer nets) with fewer hidden nodes. Other possibilities are to use
support vector machines to find the maximal-margin separator, which should
be more resistant to error, or to add some noise to the training data to make
the learned classifier more robust.
6 Planning
There are no constraints on what order the actions must be executed. They are
all in the same layer, which indicates that they can be performed in parallel.
8 More GraphPlan
Yes, this is an admissible heuristic because it will always underestimate the
distance to a solution (no solution can be nearer than the first layer where all
4
of the solution propositions are not mutex).
9 Bayesian Networks
Which of the following conditional independence assumptions are true?
1. False.
2. True.
3. False.
4. True.
5. False.
6. False.
7. False.
8. False.
Neither network is equivalent to the original one. The second network can
encode the same joint probability because it’s conditional independence relations
are a subset of the original network’s relations.
10 Automated Inference
The algorithm attempts to avoid calculating unnecesary information. It is easier
to understand what is going on if we look at the description and derivation,
rather than the pseudocode. α and β indicate normalizing factors that can be
computed with information from the known probability tables and the return
values of previously made recursive calls.
Now, we know every value in this equation except for P (L|C), so we make
a recursive call to calculate it.
5
Again, all of these values can be looked up in the network’s conditional
probability tables, except for P (L|D).
Finally, we have arrived at a point where every element of the equation can
be directly looked up in a known conditional probability table. Notice that we
avoided calculating information for the F-G-H chain, or for I. These help to make
our calculation more efficient than the brute-force method of reconstructing the
entire joint probability table.
11 Sampling
You would use sampling-based inference in a very large Bayesian network or one
with undirected cycles that make exact inference procedures infeasible. How-
ever, in cases where you wish to evaluate the probability of events with very
small probability, sampling is not an effective strategy.
12 Medical Decisions
(0.30)
chip -10
-10
no chip -10
surgery (0.70)
-10
nothing (0.30)
chip -100
-30
no-chip 0
(0.70)
6
chip -12
surgery
no -12
chip -102
pos nothing
no -2
test
chip -12
chip -102
nothing
no -2
It’s too hard to fit the math on the diagram, so here are the relevant prob-
abilities and expected values. Abbreviations are: p - positive test, n - negative
test, bc - bone chip, s - surgery.
P (p|bc) = x P (n|bc) = (1 − x)
P (p|¬bc) = y P (n|¬bc) = (1 − y)
P (p) = P (p|bc)P (bc) + P (p|¬bc)P (¬bc)
= 0.3x + 0.7y
P (n) = P (n|bc)P (bc) + P (n|¬bc)P (¬bc)
= 0.3(1 − x) + 0.7(1 − y)
= 1 − 0.3x − 0.7y
P (p|bc)P (bc)
P (bc|p) =
P (p)
0.3x
=
0.3x + 0.7y
P (p|¬bc)P (¬bc)
P (¬bc|p) =
P (p)
0.7y
=
0.3x + 0.7y
P (n|bc)P (bc)
P (bc|n) =
P (n)
0.3 − 0.3x
=
1 − 0.3x − 0.7y
P (n|¬bc)P (¬bc)
P (¬bc|n) =
P (n)
0.7 − 0.7y
=
1 − 0.3x − 0.7y
E[s|p] = −12P (bc|p) + −12P (¬bc|p)
= −12
7
E[¬s|p] = −102P (bc|p) + −2P (¬bc|p)
−30.6x − 1.4y
=
0.3x + 0.7y
E[s|n] = −12
E[¬s|n] = −102P (bc|n) + −2P (¬bc|n)
−32 + 30.6x + 1.4y
=
1 − 0.3x − 0.7y
Now it gets ugly. Let’s refer to our decision to take the test as D.
Any values of x and y which satisfy this relationship (and, of course, which
are less than or equal to 1, so they are valid probabilities) will give us a better
expected value than just deciding on surgery without having the examination.
8
.1 s1(2)
1.1
a1
1.5 .9 s2(1)
s1
.1 .5 s1(2)
a2
1.5
.96
.5 s2(1)
.6 s1(1)
.6
a1
.9 .4 s2(0)
a1 .9
s2
.9 s1(1)
a2
.9
.1 s2(0)
s1
.1 s1(2)
1.1
a1
1.5 .9 s2(1)
a2 s1
.5
.5 s1(2)
a2
1.2 1.5
.5 s2(1)
.6 s1(1)
.6
a1
.5 .9 .4 s2(0)
s2
.9 s1(1)
a2
.9
.1 s2(0)
The decision tree for the optimal one-step strategy, assuming no knowledge
of the starting state. Clearly, we should take a2 .
9
results
starts
s1(.1) 1
s1(.5) .1
s2(.9) 0
.35
s1(.6) 1
a1 s2(.5) .6
s2(.4) 0
s1(.5) 1
.5
a2 s1(.5)
s2(.5) 0
.7
s1(.9) 1
s2(.5) .9
s2(.1) 0
And now, the value functions. Because this is a simple system, we can
calculate them directly and avoid value-iteration.
So, we have two unknowns, and two linear equations. So we can easily solve
them and get V (s1 ) ≈ 6.69 and V (s2 ) ≈ 5.96.
10
3. The four-element network has a complexity of 15, the five-element network
requires only 9 conditional probability values. This illustrates the use of
a hidden variable to simplify a network.
A B C D
A B C D
So
dE X
= 2(z − y i )z(1 − z) .
dz i
80(z − 1) + 20(z − 0) = 0
100z = 80
z = 0.8
11
P (A, ¬B)
P (A|¬B) =
P (¬B)
P (A, ¬B, ¬C)
=
P (A, ¬B, ¬C) + P (C, ¬A, ¬B)
1/3
=
1/3 + 1/3
= 0.5
So, now that you know that B is being pardoned, the probability that you
will be executed is 50%. Before, your probability of execution was only 13 .
Weird, huh? This is similar to another famous probability scenario known
as the “Monty Hall” problem.
6. If f (X) = 0 or f (X) = 1, meaning that X · W was greater than b or
less than −b, there is no gradient. In the other cases, we can derive the
following gradient equation.
s = ((1/2b)(X · W + b) − d)2
ds
= ((1/2b)(X · W + b) − d)(Xi /b)
dWi
E = (f − d)2
E = (g2 (w2 · v) − d)2
dE
= 2(g2 − d) · (g2 (1 − g2 )) · vn
dw2n
dE
= 2(g2 − d) · (g2 (1 − g2 )) · w20 g1 (1 − g1 ) · xn
dw1n
12
((¬∃x.P (x)) ∨ Q(A)) ∧ ∃y.(P (y) ∧ ¬Q(A))
(∀x.¬P (x) ∨ Q(A)) ∧ ∃y.(P (y) ∧ ¬Q(A))
1. ¬P (x) ∨ Q(A)
2. P (f red)
3. ¬Q(A)
4. ¬P (x) (3,1)
5. False (4,2, fred/x)
1. ¬P (f red) ∨ Q(A)
2. P (x)
3. ¬Q(A)
4. ¬P (f red) (3,1)
5. False (4,2)
Third example: Hmmm, I’m a bit suspicious. Maybe we can find a coun-
terexample.
Q(a) = f alse
P (f red) = true
P (ned) = f alse
∃x.(P (x) → Q(a)) → ∀x.(P (x) → Q(a))
We need a situation in which the left-hand side is true and the right-hand
side is false. Consider if we satisfy the left-hand side with x=ned. Because
P (ned) = f alse, the left-hand side will be true despite the fact the Q(a)
is always false. But, because the right-hand side is universally quantified,
it is not true, because P(fred) is true, which will make P (f red) → Q(a)
false. So, the entire expression is false because we have a true left-hand
side and a false right-hand side.
9. ¬On(x, z) ∨ ¬Above(z, y) ∨ Above(x, y)
13
10. The first-order logic description is:
1. P (F ) ∨ P (y) ∨ G(y)
2. ¬B(F ) ∨ P (y) ∨ G(y)
3. B(x) ∨ G(x)
4. ¬B(x) ∨ ¬G(x)
5. P (x) ∨ ¬P (y) ∨ B(y)
6. P (O1)
7. ¬P (O2)
And then we can prove that there is a green object:
8. ¬G(x)
9. ¬B(F ) ∨ G(O2) (2,7)
10. G(F ) ∨ G(O2) (3,9)
11. G(O2) (8,10)
12. False (8,11)
14