Midterm Review
Midterm Review
• Transition model: Forward moves the ant 1 square in the direction it’s facing, unless there is a wall in front. The two
turning actions rotate the ant by 90 degrees to face a different direction.
• Action cost: Each action costs 1.
(a) (i) What’s the minimum state space size 𝑆 for this task?
𝑆=
(ii) Now suppose there are 𝐾 ants, where each ant 𝑖 must reach a distinct goal location 𝐺𝑖 ; any number of ants can
occupy the same square; and action costs are a sum of the individual ants’ step costs. What’s the minimum state
space size for this task, expressed in terms of 𝐾 and 𝑆?
(iii) Now suppose that each ant 𝑖 can exit at any of the goal locations 𝐺𝑗 , but no two ants can occupy the same square
if they are facing the same direction. What’s the minimum state space size for this task, expressed in terms of 𝐾
and 𝑆?
(iv) Now suppose, once again, that each ant 𝑖 must reach its own exit at 𝐺𝑖 , and no two ants can occupy the same square
∑
if they are facing the same direction. Let 𝐻 = 𝑖 ℎ∗𝑖 , where ℎ∗𝑖 is the optimal cost for ant 𝑖 to reach goal 𝐺𝑖 when
it is the only ant in the maze. Is 𝐻 admissible for the 𝐾-ant problem? Select all appropriate answers.
□ Yes, because for any multiagent problem the sum of individual agent costs, with each agent solving a
subproblem separately, is always a lower bound on the joint cost.
□ Yes, because 𝐻 is the exact cost for a relaxed version of the 𝐾-ant problem.
□ Yes, because the "no two ants. . . " condition can only make the true cost larger than 𝐻, not smaller.
□ No, because some ants can exit earlier than others so the sum may overestimate the total cost.
□ No, it should be max𝑖 rather than ∑𝑖 .
# None of the above
1
(b) The ant is alone again in the maze. Now, the spider will return in 𝑇 timesteps, so the ant must reach an exit in 𝑇 or fewer
actions. Any sequence with more than 𝑇 actions doesn’t count as a solution.
In this part, we’ll address this by solving the original problem and checking the resulting solution. That is, suppose 𝑝 is
a problem and 𝐴 is a search algorithm; 𝐴(𝑝) returns a solution 𝑠, and 𝓁(𝑠) is the length (number of actions) of 𝑠, where
𝓁(failure) = ∞. Let 𝑝𝑇 be 𝑝 with the added time limit 𝑇 . Then, given 𝐴, we can define a new algorithm 𝐴′ (𝑝𝑇 ) as follows:
(c) Now we attempt to solve the time-limited problem by modifying the problem definition (specifically, the states, legal
actions in each state, and/or goal test) appropriately so that regular, unmodified search algorithms will automatically
avoid returning solutions with more than 𝑇 actions.
(i) Is this possible in general, for any problem where actions costs are all 1? Mark all correct answers.
□ Yes, by augmenting the state space only.
□ Yes, by augmenting the state space and modifying the goal test.
□ Yes, by modifying the goal test only.
□ Yes, by augmenting the state space and modifying the legal actions.
□ Yes, by modifying the legal actions only.
# No, it’s not possible in general.
(ii) Subpart removed for time.
2
Q2. Informed Search
ℎ(𝐴) = 4 ℎ(𝐶) = 2
2
𝐴 𝐶
2 4
3
𝑆 𝐺
1 3
ℎ(𝑆) = 6 ℎ(𝐺) = 0
2
𝐵 𝐷
ℎ(𝐵) ℎ(𝐷) = 2
Search problem graph: S is the start state and G is the goal state.
Tie-break in alphabetical order.
ℎ(𝐵) is unknown and will be determined in the subquestions.
(a) In this question, refer to the graph above where the optimal path is 𝑆 → 𝐵 → 𝐷 → 𝐺. For each of the following subparts,
you will be asked to write ranges of ℎ(𝐵). You should represent ranges as ≤ ℎ(𝐵) ≤ . Heuristic values can be
any number including ±∞. For responses of ±∞, you can treat the provided inequalities as a strict inequality. If you
believe that there is no possible range, please write "None" in the left-hand box and leave the right box empty.
(i) What is the range for the heuristic to be admissible?
≤ ℎ(𝐵) ≤
(ii) What range of heuristic values for B would allow A* tree search to still return the optimal path (𝑆 → 𝐵 → 𝐷 → 𝐺)?
≤ ℎ(𝐵) ≤
(iii) Now assume that the edges in the graph are undirected (which is equivalent to having two directed edges that point
at both directions with the same cost as before). Regardless of whether the heuristic is consistent, what range of
heuristic values for B would allow A* tree search to still return the optimal path (𝑆 → 𝐵 → 𝐷 → 𝐺)?
≤ ℎ(𝐵) ≤
(b) Part b removed on this worksheet for cooking too hard, try it on your own time for extra practice.
3
Q3. Games
(a) Minimax and Alpha-Beta Pruning We have a two-player, zero-sum game with 𝑘 rounds. In each round, the maximizer
acts first and chooses from 𝑛 possible actions, then the minimizer acts next and chooses from 𝑚 possible actions. After
the minimizer’s 𝑘-th turn, the game finishes and we arrive at a utility value (leaf node). Both players behave optimally.
Explore nodes from left to right.
(i) What is the total number of leaf nodes in the game tree, in terms of 𝑚, 𝑛, 𝑘?
(ii) In the minimax tree below 𝑘 = 1, 𝑛 = 3, 𝑚 = 4.
5 7 3 6 8 2 10 7 4 9 1 0
A B C D E F G H I J K L
(2) Which leaf nodes are pruned by alpha-beta? Mark the corresponding letters below.
□ A □ B □ C □ D □ E □ F □ G □ H □ I □ J □ K □ L # None
(iii) When 𝑘 = 1, in the best case (pruning the most nodes possible), which nodes can be pruned in the tree below?
□ A □ B □ C □ D □ E □ F □ G □ H □I□J□ K □ L # None
A B C D E F G H I J K L
(iv) Now consider the same 𝑘 = 1 but with a general 𝑚 and 𝑛 number of maximizer and minimizer actions respectively.
How many leaf nodes would be pruned in the best case? Express your answer in terms of 𝑚 and 𝑛.
4
(v) When 𝑘 = 2, 𝑛 = 2, 𝑚 = 2, in the best case, which of the leaves labeled A, B, C, D will be pruned?
A B C D
□ A □ B □ C □ D # None
(b) Chance Nodes Our maximizer agent is now playing against a non-optimal opponent. In each round, the maximizer acts
first, then the opponent acts next and chooses uniformly at random from 𝑚 possible actions.
(i) Subpart removed for time (simple expectimax)
(ii) Consider the game tree below where we now know that the opponent always has 𝑚 = 4 possible moves and chooses
uniformly at random. We also know that all leaf node utility values are less than or equal to 𝑐 = 10.
5 3 8 4 10 6 7 9 1 0 9 2
A B C D E F G H I J K L
(c) Now, let’s generalize this idea for pruning on expectimax. We consider expectimax game trees where the opponent always
chooses uniformly at random from 𝑚 possible moves, and all leaf nodes have values no more than 𝑐. These facts are known
by the maximizer player.
(i) Let’s say that our depth-first traversal of this game tree is currently at a chance node and has seen 𝑘 children of this
node so far. The sum of the children seen so far is 𝑆. What is the largest possible value that this chance node can
take on? (Answer in terms of 𝑚, 𝑐, 𝑘, and 𝑆)
5
(ii) [OPTIONAL]
Now, let’s write an algorithm for computing the root value. Fill in the pseudocode below.
Note that 𝑚 and 𝑐 are constants that you should use in your pseudocode. To find the value at the root, we will start
with a call to MAX-VALUE(root, −∞).
1: function MAX-VALUE (state, 𝛼)
2: if state has no successors then return eval(state)
3: v←
4: for each successor n of state do
5: v←
6:
7: return v
8:
9: function EXP-VALUE(state, 𝛼)
10: if state has no successors then return eval(state)
11: S←0
12: k←1
13: for each successor n of state do
14: S←S+
15: ci ← "expression from (c)(i) using m, c, k, and S"
16: if then
17: return S/m
18: k←k+1
19: return S/m
6
Q4. CSPs: The Zookeeper
You are a newly appointed zookeeper, and your first task is to find rooms for all of the animals.
The zoo has three animals: the Iguana (𝐼), Jaguar (𝐽 ), and Koala (𝐾).
Each animal needs to be assigned to one of four rooms: the North room (N), East room (E), South room (S), or West room (W),
subject to the following constraints:
(a) Consider the first constraint: “The jaguar cannot share a room with any other animal.”
Can this constraint be expressed using only binary constraints on the three variables 𝐼, 𝐽 , 𝐾?
(b) Suppose we enforce unary constraints, and then assign the jaguar to the South room. The remaining values in each domain
would be:
In the table below, mark each value that would be removed by running forward-checking after this assignment.
7
The constraints, repeated here for your convenience:
(d) Regardless of your answer to the previous subpart, suppose we start over and just enforce the third constraint. Then the
remaining values in each domain are:
What does the minimum remaining values (MRV) heuristic suggest doing next?
(e) Again, consider the CSP after just enforcing the third constraint:
Which assignment would the least constraining value (LCV) heuristic prefer?
# Assign North to Jaguar.
# Assign East to Jaguar.
# LCV is indifferent between these two assignments.
8
Q5. MDPs: Flying Pacman
Pacman is in a 1-dimensional grid with squares labeled 0 through 𝑛, inclusive, as shown below:
0 1 2 3 4 5 n-1 n
Pacman’s goal is to reach square 𝑛 as cheaply as possible. From state 𝑛, there are no more actions or rewards available.
At any given state, if Pacman is not in 𝑛, Pacman has two actions to choose from:
• Run: Pacman deterministically advances to the next state (i.e. from state 𝑖 to state 𝑖 + 1). This action costs Pacman $1.
• Fly: With probability 𝑝, Pacman directly reaches state 𝑛. With probability 1 − 𝑝, Pacman is stuck in the same state. This
action costs Pacman $2.
(a) Fill in the blank boxes below to define the MDP. 𝑖 represents an arbitrary state in the range {0, … , 𝑛 − 1}.
𝑠 𝑎 𝑠′ 𝑇 (𝑠, 𝑎, 𝑠′ ) 𝑅(𝑠, 𝑎, 𝑠′ )
𝑖 Run 𝑖+1
𝑖 Fly 𝑖
𝑖 Fly
Let 𝜋𝑅 denote the policy of always selecting Run, and 𝜋𝐹 denote the policy of always selecting Fly.
Compute the values of these two policies. Your answer should be an expression, possibly in terms of 𝑛, 𝑝, and/or 𝑖.
(d) Given the results of the two previous subparts, we can now find the optimal policy for the MDP.
Which of the following are true? Select all that apply. (Hint: consider what value of 𝑖 makes 𝑉 𝜋𝑅 (𝑖) and 𝑉 𝜋𝐹 (𝑖) equal.)
Note: ⌈𝑥⌉ is the smallest integer greater than or equal to 𝑥.
□ If 𝑝 < 2∕𝑛, Fly is optimal for all states.
□ If 𝑝 < 2∕𝑛, Run is optimal for all states.
□ If 𝑝 ≥ 2∕𝑛, Fly is optimal for all 𝑖 ≥ ⌈𝑛 − 2∕𝑝⌉ and Run is optimal for all 𝑖 < ⌈𝑛 − 2∕𝑝⌉.
□ If 𝑝 ≥ 2∕𝑛, Run is optimal for all 𝑖 ≥ ⌈𝑛 − 2∕𝑝⌉ and Fly is optimal for all 𝑖 < ⌈𝑛 − 2∕𝑝⌉.
# None of the above.
9
Regardless of your answers to the previous parts, consider the following modified transition and reward functions (which may
not correspond to the original problem). As before, once Pacman reaches state 𝑛, no further actions or rewards are available.
For each modified MDP and discount factor, select whether value iteration will converge to a finite set of values.
(e) 𝛾 = 1
𝑠 𝑎 𝑠′ 𝑇 (𝑠, 𝑎, 𝑠′ ) 𝑅(𝑠, 𝑎, 𝑠′ )
𝑖 Run 𝑖+1 1.0 +5
𝑖 Fly 𝑖+1 1.0 +5
(f) 𝛾 = 1
𝑠 𝑎 𝑠′ 𝑇 (𝑠, 𝑎, 𝑠′ ) 𝑅(𝑠, 𝑎, 𝑠′ )
𝑖 Run 𝑖+1 1.0 +5
𝑖 Fly 𝑖−1 1.0 +5
(g) 𝛾 < 1
𝑠 𝑎 𝑠′ 𝑇 (𝑠, 𝑎, 𝑠′ ) 𝑅(𝑠, 𝑎, 𝑠′ )
𝑖 Run 𝑖+1 1.0 +5
𝑖 Fly 𝑖−1 1.0 +5
10
Q6. RL: Rest and ReLaxation
Consider the grid world MDP below, with The agent observes the following samples in this grid world:
unknown transition and reward functions.
𝑠 𝑎 𝑠′ 𝑅(𝑠, 𝑎, 𝑠′ )
A B C E East F −1
E East H −1
D E F E South H −1
G H I E South H −1
E South D −1
Reminder: In grid world, each non-exit action succeeds with some probability. If an action (e.g. North) fails, the agent moves
in one of the cardinally adjacent directions (e.g. East or West) with equal probability, but will not move in the opposite direction
(e.g. South).
In this question, we will consider 3 strategies for estimating the transition function in this MDP.
Strategy 1: The agent does not know the rules of grid world, and runs model-based learning to directly estimate the transition
function.
Strategy 2: The agent knows the rules of grid world, and runs model-based learning to estimate 𝑝. Then, the agent uses the
estimated 𝑝̂ to estimate the transition function.
(d) Based on 𝑝,
̂ what is 𝑇̂ (E, West, D)?
(e) Select all true statements about comparing Strategy 1 and Strategy 2.
□ Strategy 1 will usually require fewer samples to estimate the transition function to the same accuracy threshold.
□ There are fewer unknown parameters to learn in Strategy 1.
□ Strategy 1 is more prone to overfitting on samples.
# None of the above
11
The grid world and samples, repeated for your convenience:
𝑠 𝑎 𝑠′ 𝑅(𝑠, 𝑎, 𝑠′ )
A B C E East F −1
E East H −1
D E F E South H −1
G H I E South H −1
E South D −1
Strategy 3: The agent knows the rules of grid world, and uses an exponential moving average to estimate 𝑝. Then, the agent
uses the estimated 𝑝̂ to estimate the transition function.
(g) Select all true statements about comparing Strategy 2 and Strategy 3.
□ Strategy 2 gives a more accurate estimate, because it is the maximum likelihood estimate.
□ Strategy 3 gives a more accurate estimate, because it gives more weight to more recent samples.
□ Strategy 3 can be run with samples streaming in one at a time.
# None of the above
Suppose the agent runs Q-learning in this grid world, with learning rate 0 < 𝛼 < 1, and discount factor 𝛾 = 1.
(h) After iterating through the samples once, how many learned Q-values will be nonzero?
# 0 # 1 # 2 # 3 # 4 # >4
(i) After iterating through the samples repeatedly until convergence, how many learned Q-values will be nonzero?
# 0 # 1 # 2 # 3 # 4 # >4
12
Q7. Potpourri [OPTIONAL]
(a) Below is a list of task environments. For each of the sub-parts, choose all the environments in the list that falls into the
specified type.
A: The competitive rock-paper-scissors game
B: The classical Pacman game (with ghosts following a fixed path)
C: Solving a crossword puzzle
D: A robot that removes defective cookies from a cookie conveyor belt
(i) Which of the environments can be formulated as single-agent? □ A □ B □ C □ D
(ii) Which of the environments are static? □ A □ B □ C □ D
(iii) Which of the environments are discrete? □ A □ B □ C □ D
(b) (i) # T # F Reflex agents cannot be rational.
(ii) # T # F There exist task environments in which no pure reflex agent can behave rationally.
(c) (i) # T # F If the costs can be arbitrarily large negative numbers in a search problem, then any optimal
search algorithm in this problem will need to explore the entire state space.
(ii) # T # F Depth-first search always expands at least as many nodes as A* search with an admissible
heuristic.
(d) (i) # T # F Local beam search with a beam size of 1 reduces to Hill climbing.
(ii) # T # F Local beam search with one initial state and no limit on the number of states retained reduces
to depth-first search.
13