Lec 04
Lec 04
[These slides adapted from Dan Klein, Pieter Abbeel, Anca Dragan, Stuart Russell, and many others]
Behavior from Computation
o Axes:
o Deterministic or stochastic?
o One, two, or more players?
o Zero sum?
o Perfect information (can you see the state)?
Types of Games
o Chess
o (1997): Deep Blue defeats human champion
Gary Kasparov in a six-game match. Current
programs are even better, if less historic.
o Go
o (2016): AlphaGo defeats human champion Lee
Sedol. Uses Monte Carlo Tree Search, learned
evaluation function.
Deterministic Games with Terminal
Utilities
o Many possible formalizations, one is:
o States: S (start at s0)
o Players: P = {1...N} (usually take turns)
o Actions: A (may depend on player / state)
o Transition Function: S x A S
o Terminal Test: S {t, f}
o Terminal Utilities: S x P R
2 0 … 2 6 … 4 6
Value of a State
Value of a state: Non-Terminal States:
The best achievable
outcome (utility)
from that state
2 0 … 2 6 … 4 6
Terminal States:
Adversarial Game Trees
-8 -5 -10 +8
Terminal States:
Tic-Tac-Toe Game Tree
Adversarial Search (Minimax)
o Deterministic, zero-sum games: Minimax values:
computed recursively
o Tic-tac-toe, chess, checkers
o One player maximizes result 5 max
o The other minimizes result
2 5 min
o Minimax search:
o A state-space search tree
o Players alternate turns
o Compute each node’s minimax 8 2 5 6
value: the best achievable
utility against a rational Terminal values:
part of the game
(optimal) adversary
Minimax Implementation (Dispatch)
def value(state):
if the state is terminal: return the state’s utility
if the next agent is MAX: return max-
value(state)
if the next agent is MIN: return min-value(state)
3 2 2
3 12 8 2 4 6 14 5 2
Minimax Properties
max
min
10 10 9 100
3 >=3
3 <=2 2
3 12 8 2 14 5 2
Alpha-Beta Implementation
MAX
MIN a
MAX
MIN n
24
Alpha-Beta Pruning Properties
o This pruning has no effect on minimax value computed for the root!
min
o Good child ordering improves effectiveness of pruning
2
Alpha-Beta Quiz 2
2
Alpha-Beta Quiz 2
10
<=2
>=100 2
10
Resource Limits
Resource Limits
o Problem: In realistic games, cannot search to max
4
leaves!
-2 4 min
o Solution: Depth-limited search
o Instead, search only to a limited depth in the tree -1 -2 4 9
o Replace terminal utilities with an evaluation function
for non-terminal positions
o Example:
o Suppose we have 100 seconds, can explore 10K nodes
/ sec
o So can check 1M nodes per move
o - reaches about depth 8 – decent chess program
[Demo: thrashing d=2, thrashing d=2 (fixed evaluation function), smart ghosts coordinate (L6D6,7,8,10)]
Video of Demo Thrashing (d=2)
Why Pacman Starves
max
min
10 10 9 100
def value(state):
if the state is a terminal state: return the
state’s utility
if the next agent is MAX: return max-
value(state)
if the next agent is EXP: return exp-
value(state)
def max-value(state): def exp-value(state):
initialize v = -∞ initialize v = 0
for each successor of state: for each successor of state:
v = max(v, value(successor)) p=
return v probability(successor)
v += p * value(successor)
return v
Expectimax Pseudocode
def exp-value(state):
initialize v = 0
for each successor of state: 1/2 1/6
p= 1/3
probability(successor)
v += p * value(successor) 5
8 24
7 -12
return v
3 12 9 2 4 6 15 6 0
Expectimax Pruning?
3 12 9 2
Depth-Limited Expectimax
Estimate of true
400 300 …expectimax value
(which would
require a lot of
… work to compute)
492 362 …
What Probabilities to Use?
o In expectimax search, we have a
probabilistic model of how the
opponent (or environment) will
behave in any state
o Model could be a simple uniform
distribution (roll a die)
o Model could be sophisticated and require
a great deal of computation
o We have a chance node for any outcome
out of our control: opponent or
environment
o The model might say that adversarial
actions are likely!
o For now, assume each chance node
magically comes along with Having a probabilistic belief
probabilities that specify the about another agent’s action
distribution over its outcomes does not mean that the agent is
flipping any coins!
Quiz: Informed Probabilities
o Let’s say you know that your opponent is actually running a depth 2
minimax, using the result 80% of the time, and moving randomly
otherwise
o Question: What tree search should you use?
Answer: Expectimax!
To figure out EACH chance node’s probabilities,
you have to run a simulation of your opponent
This kind of thing gets very slow very quickly
0.1 0.9 Even worse if you have to simulate your
opponent simulating you…
… except for minimax and maximax, which have
the nice property that it all collapses into one
game tree
This is basically how you would model a human, except for their utility: their utility might be the same as yours (i.e. you try
to help them, but they are depth 2 and noisy), or they might have a slightly different utility (like another person navigating
in the office)
Modeling Assumptions
The Dangers of Optimism and Pessimism
Dangerous Optimism Dangerous Pessimism
Assuming chance when the world is adversarial Assuming the worst case when it’s not likely
Assumptions vs. Reality
Pacman used depth 4 search with an eval function that avoids trouble
Ghost used depth 2 search with an eval function that seeks Pacman
[Demos: world assumptions (L7D3,4,5,6)]
Assumptions vs. Reality
Pacman used depth 4 search with an eval function that avoids trouble
Ghost used depth 2 search with an eval function that seeks Pacman
[Demos: world assumptions (L7D3,4,5,6)]
Video of Demo World Assumptions
Random Ghost – Expectimax Pacman
Video of Demo World Assumptions
Adversarial Ghost – Minimax Pacman
Video of Demo World Assumptions
Adversarial Ghost – Expectimax Pacman
Video of Demo World Assumptions
Random Ghost – Minimax Pacman
Mixed Layer Types
o E.g.
Backgammon
o Expectiminimax
o Environment is
an extra “random
agent” player
that moves after
each min/max
agent
o Each node
computes the
appropriate
combination of
its children
Example: Backgammon
o Dice rolls increase b: 21 possible rolls
with 2 dice
o Backgammon 20 legal moves
o Depth 2 = 20 x (21 x 20)3 = 1.2 x 109
80
MCTS Version 2: UCT
o Repeat until out of time:
o Given the current search tree, recursively apply UCB to
choose a path down to a leaf (not fully expanded) node n
o Add a new child c to n and run a rollout from c
o Update the win counts from c back up to the root
o Choose the action leading to the child with highest
N
81
UCT Example
5/10 4/9
82
Why is there no min or max?????
o “Value” of a node, U(n)/N(n), is a weighted sum of
child values!
o Idea: as N , the vast majority of rollouts are
concentrated in the best children, so weighted
average max/min
o Theorem: as N UCT selects the minimax move
o (but N never approaches infinity!)
83
Summary
o Games require decisions when optimality is impossible
o Bounded-depth search and approximate evaluation functions
o Games force efficient use of computation
o Alpha-beta pruning, MCTS
o Game playing has produced important research ideas
o Reinforcement learning (checkers)
o Iterative deepening (chess)
o Rational metareasoning (Othello)
o Monte Carlo tree search (chess, Go)
o Solution methods for partial-information games in economics (poker)
o Video games present much greater challenges – lots to do!
o b = 10500, |S| = 104000, m = 10,000, partially observable, often > 2 players