Mitchell Machine Learning
Mitchell Machine Learning
Peter Danenberg
24 October 2011
Contents
1 TODO An empty module that gathers the exercises’ dependen-
cies 1
2 Exercises 2
2.1 DONE 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 DONE 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 DONE 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 DONE 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.5 TODO 1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Notes 4
3.1 Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
2 Exercises
2.1 DONE 1.1
CLOSED: 2011-10-12 Wed 04:21
2
2.3 DONE 1.3
CLOSED: 2011-10-12 Wed 12:46
Here’s a solution for the trivial case in which |⟨b, Vtrain (b)⟩| = 1 and the
target function V̂ consists of a single feature x and a single weight w:
∂E ∂
= (Vtrain (b) − V̂ (b))2 (1)
∂w ∂w
∂
= 2(Vtrain (b) − V̂ (b)) (Vtrain (b) − V̂ (b)) (2)
∂w
= 2(Vtrain (b) − V̂ (b))(0 − x) (3)
= −2(Vtrain (b) − V̂ (b))x (4)
which gives:
∂E
wn+1 = wn − (5)
∂w
∝ wn + η(Vtrain (b) − V̂ (b))x (by 4) (6)
It should be trivial to extend this to the case where w and X are vectors;
the LMS training rule, furthermore, covers the summation.
3
2.5 TODO 1.5
3 Notes
3.1 Chapters
3.1.1 1
• a computer program is said to learn from experienc E with respect to some
class of tasks T and performance measure P, if its performanc at tasks in
T, as measured by P, improves with experience E.
• neural network, hidden markov models, decision tree
• artificial intelligence :: symbolic representations of concepts
• bayesian :: estimating values of unobserved variables
• statistics :: characterization of errors, confidence intervals
• attributes of training experience:
– type of training experience from which our system will learn
∗ direct or indirect feedback
direct individual checkers board states and the correct move for
each
indirect move sequences, final outcomes
· credit assignment: game can be lost even when early moves
are optimal
– degree to which learner controls sequence of training examples
– how well it represents the distribution of examples over which the
final system performance P must be measured
∗ mastery of one distribution of examples will not necessary (sic)
lead to strong performance over some other distribution
• task T: playing checkers; performance measure P: percent of games won;
training experience E: games played against itself.
• 1. the exactly type of knowledge to be learned; 2. a representation for
this target knowledge; 3. a learning mechanism.
• program: generate legal moves: needs to learn how to choose the best
move; some large search space
• class for which the legal moves that define some large search space are
known a priori, but for which the best search strategy is not known
• target function :: choosemove : B -> M (some B from legal board states
to some M from legal moves)
4
– very difficult to learn given the kind of indirect training experience
available
– alternative target function: assigns a numerical score to any given
board state
• alternative target function :: V : B -> R (V maps legal board state B to
some real value)
– higher scores to better board states
• the more expressive the representation, the more training data program
will require to choose among alternative hypotheses
T playing checkers
5
P percent games won
E games played against self
target function V : Board → R
target function representation V̂ = w0 +w1 x1 +w2 x2 +w3 x3 +w4 x4 +
w5 x5 + w6 x6 two: design choices
• require set of training examples, describing board state b and training
value Vtrain (b) for b: ordered pair ⟨b, Vtrain (b)⟩: ⟨⟨x1 = 3, x2 = 0, x3 =
1, x4 = 0, x5 = 0, x6 = 0⟩, +100⟩.
• less obvious how ta assign training values to the more numerous interme-
diate board states
• Vtrain (b) ← V̂ (Successor(b))
• Successor(b) denotes the next board state following b for which it is again
the program’s turn to move
– train separately red and black
• V̂ tends to be more accurate forboard states closer to game’s end
• best fit: define the best hypothesis, or set of weights, as that which min-
imizes the squared error E between the training values and the values
predicted by the hypothesis V̂
E ≡ Σ⟨b,Vtrain (b)⟩∈training examples (Vt rain(b) − V̂ (b))2
http://en.wikipedia.org/wiki/Minimum_mean_square_error
6
is a risk function, corresponding to the expected value of the
squared error loss or quadratic loss. . . the defference occurs
because of randomness or because the estimator doesn’t account
for information that could produce a more accurate estimate.
http://en.wikipedia.org/wiki/Mean_squared_error
• thus we seek the weights, or equivalently the V̂ , that minimize E for the
observed training examples
– damn, statistics would make this all intuitive and clear
• several algorithms are known for finding weights of a linear function that
minimize E; we require an algorithm that will incrementally refine the
weights as new training examples become available and that will be robust
to errors in these estimated training values.
• one such algorithm is called the least mean squares, or LMS training rule.
7
http://en.wikipedia.org/wiki/Least_mean_squares_filter
http://en.wikipedia.org/wiki/Expected_value
• LMS weight update rule: for each training example ⟨b, Vtrain (b)⟩:
– use the current weights to calculate V̂ (b)
– for each weight wi , update it as: wi ← wi + η(Vtrain (b) − V̂ (b))xi
• here η is a small constant (e.g., 0.1) that moderates the size of the weight
update.
• notice that when the error Vtrain (b)− V̂ (b) is zero, no weights are changed.
when Vtrain (b) − V̂ (b) is positive (i.e., when V̂ (b) is too low), then each
weight is increased in proportion to the value of its correpsonding feature.
this will raise the value of V̂ (b), reducing the error. notice that if the value
of some feature xi is zero, then its weight is not altered regardless of the
error, so that the only weights updated are those whose features actually
occur on the training example board.
8
• performance system :: solve the given performance task (e.g. playing
checkers) by using the learned target function(s). it taks an instance of a
new problem (game) as input and produces a trace of its solution (game
history) as output (e.g. select its next move at each step by the learned
V̂ evaluation function). we expect its performance to improve as this
evaluation function becomes increasingly accurate.
• critic :: takes history or trace of the game produces as output set of train-
ing examples of the target function: {⟨b1 , Vtrain (b1 )⟩, . . . , ⟨bn , Vtrain (b2 )⟩}.
digraph design {
generator [label="Experiment Generator"]
performer [label="Performance System"]
critic [label=Critic]
generalizer [label=Generalizer]
performer -> critic [label="Solution trace"]
critic -> generalizer [label="Training examples"]
generalizer -> generator [label=Hypothesis]
generator -> performer [label="New problem"]
}
• linear function representation for V̂ too simple to capture well the nuances
of the game.
9
– program represents the learned eval function using an artifical neural
network that considers the complete description of the board state
rather than a subsect of board features.
• prior knowledge?
• choosing useful next training experience?
10
3.2 Exercises
3.2.1 1.3
From page 11: “The LMS training rule can be viewed as performing a stochastic
gradient-descent search through the space of possible hypotheses (weight values)
to minimize the squared error E.”
3.2.2 1.4
11
strategies could involve creating board positions designed to explore par-
ticular regions of the state space.
3.2.3 1.5
“Non-operational” definition of V (b):
100 b is a final winning board state
−100 b is a final losing board state
V (b) = (7)
0 b is a final drawing board state
V (b ) otherwise, where b′ is an optimal final board state
′
12
x9 X empty corner?
x1 0 O empty corner?
x1 1 X empty side?
x1 2 O empty side?
Page 8: “In general, this choice of representation involves a crucial tradeoff.
On one hand, we wish to pick a very expressive representation to allow repre-
senting as close an approximation as possible to the ideal target function V . On
the other hand, the more expressive the representation, tho more training data
the program will require in order to choose among the alternative hypotheses it
can represent.”
Here’s a crazy thought: since the space-state complexity of tic-tac-toe is
utterly tractable, let’s have nine features: one corresponding to each of the
squares.
How do we deal with training the opposite direction, by the way: invert the
outcome of the training data?
I have no idea how much training data nine variables need: we’ll have to
plot it; interesting to compare a strategy containing e.g. forks and wins.
Is it interesting that each variable is binary?
Let’s start with the generalizer and a catalog of games; in order to map the
number of training-examples . . . Ah, I see: the second player has a fixed
evaluation function. Can we abstract xkcd? Problem is, the space for O is
much more complicated. Maybe we can abstract the Wikipedia strategy: #
wikipedia-strategy
1. Win
2. Block
3. Fork
4. Block a fork
5. Center
6. Opposite corner
7. Empty side
(It looks like the Wikipedia strategy was abstracted from here, by the way;
damn: it looks like there are separate X- and O-heuristics.)
Represent the board as a vector of nine values; can we set up abstractions for
< x, y > as well as {map,reduce,for-each}-{row,column,diagonal,triplet}?
Meh; maybe we can implement the X/O-agnostic heuristics.
(use debug
13
vector-lib
srfi-1
srfi-11
srfi-26)
(define (n-by-n)
(* (n) (n)))
(define (a) 0)
(define (ac-diagonal)
(iota (n) (a) (+ (n) 1)))
(define (bd-diagonal)
(iota (n) (b) (- (n) 1)))
(define (rows)
(map row (iota (n) (a) (n))))
(define (columns)
(map column (iota (n))))
(define (diagonals)
(list (ac-diagonal)
(bd-diagonal)))
(define (tuplets)
(append (rows)
(columns)
(diagonals)))
14
(define (corners)
(list (a) (b) (c) (d)))
(define -1)
(define X 0)
(define O 1)
(define X?
(case-lambda
((mark) (= X mark))
((board space) (X? (board-ref board space)))))
(define O?
(case-lambda
((mark) (= O mark))
((board space) (O? (board-ref board space)))))
(define empty?
(case-lambda
((mark) (= mark))
((board space) (empty? (board-ref board space)))))
15
(vector-map
(lambda (i mark)
(cond ((X? mark) "X")
((O? mark) "O")
(else " ")))
board))))
(define (make-empty-board)
(make-board (n-by-n) ))
(define shuffle
(case-lambda
((deck) (shuffle ’() deck))
((shuffled-deck deck)
(if (null? deck)
shuffled-deck
(let ((pivot (random (length deck))))
(let ((left-partition (take deck pivot))
(right-partition (drop deck pivot)))
(shuffle (cons (car right-partition) shuffled-deck)
(append left-partition (cdr right-partition)))))))))
(define (make-random-board)
(let ((board (make-empty-board)))
(let iter ((moves (random (n-by-n)))
(indices (shuffle (iota (n-by-n)))))
(if (zero? moves)
board
(let ((mark (random (length indices))))
;; You may end up with a board where there are more Os
;; than Xs.
(vector-set! board
(car indices)
(if (even? moves) X O))
(iter (- moves 1) (cdr indices)))))))
16
(tuplets)))
;;; Putting tuplet first would allow you to use many boards.
(define (first-empty-space board tuplet)
(find (cute empty? board <>) tuplet))
17
(empty? board (center-space board)))
18
(lambda (board)
(random-empty-space board)))
;;; http://www.buzzle.com/articles/tic-tac-toe-strategy-guide.html
(define (make-heuristic-player player? player opponent? opponent)
(lambda (board)
(let ((my-winning-spaces (winning-spaces player? board)))
(if (null? my-winning-spaces)
(let ((losing-spaces (winning-spaces opponent? board)))
(if (null? losing-spaces)
(let ((my-forking-spaces (forking-spaces player? player board)))
(if (null? my-forking-spaces)
(let ((opponent-forking-spaces
(forking-spaces opponent? opponent board)))
(if (null? opponent-forking-spaces)
(if (center-empty? board)
(center-space board)
(let ((opposite-corners (opposite-corners player? board)))
(if (null? opposite-corners)
(let ((empty-corners (empty-corners board)))
(if (null? empty-corners)
(random-empty-space board)
(random-ref empty-corners)))
(random-ref opposite-corners))))
(random-ref opponent-forking-spaces)))
(random-ref my-forking-spaces)))
(random-ref losing-spaces)))
(random-ref my-winning-spaces)))))
(define (make-heuristic-X-player)
(make-heuristic-player X? X O? O))
(define (make-heuristic-O-player)
(make-heuristic-player X? X O? O))
(define make-default-X-player
(make-parameter make-heuristic-X-player))
(define make-default-O-player
(make-parameter make-heuristic-O-player))
19
(play (make-empty-board)))
((board)
(play ((make-default-X-player))
((make-default-O-player))
board))
((X-player O-player board)
(play 0 X-player O-player board))
((move X-player O-player board)
(debug move)
(display-board board)
(or (outcome board)
(let-values (((token player)
(if (even? move)
(values X X-player)
(values O O-player))))
(let ((next-move (player board)))
(board-set! board next-move token)
(play (+ move 1)
X-player
O-player
board)))))))
(use test
debug)
(include "tic-tac-toe.scm")
20
O)))
(test "forking-spaces"
’(6 2 1)
(forking-spaces X? X board)))
X )))
(test "opposite-corners"
’(8 2)
(opposite-corners X? board)))
21
(debug (play))
Instead of this heuristics-based approach, we could map out the entire game
tree from any arbitrary position; either in real time, or before: greedily picking
whichever position yields the most wins.
Maybe that’s better; should we plot it as an alternative to the heuristics-
based player: the deterministic player?
Each node has three values: wins for X, wins for O, draws; it may be possible
to do it functionally through memoization: let’s just ascend up the tree when
we’ve explored a child, however, and update the parent accordingly.
W.r.t. the heuristic player, by the way, can’t we get rid of space-marks
and rely on board-ref instead? Triplets (tuplets) are just lists of indices, then.
Good call.
We should have two graphs, by the way: training games vs. games one
against 1) the heuristic and 2) the deterministic player.
The heuristic player has the interesting property that it can continue from
illegal positions; the deterministic player would be confined to legal positions,
wouldn’t it?
Should we create directories for modules? Should the players be stateful or
stateless? If the e.g. deterministic player is stateless, we should probably analyse
the game off-line. We can instantiate the player with a token and predicate (i.e.
X and X? or O and O?); the player receives a game state and returns a space.
Some kind of intermediary updates the game space and, in theory, arbitrates
for legality, etc. Initially, though, we can trust the agents.
The deterministic player is going to have to turn the state into a tree if
we’re using something precomputed: but we don’t necessarily have the benefit
of history; we’d have to flatten the history by using an e.g. hash-table. Even
then, how do we distinguish among parents and look-alikes when calculating
win-draw-loss?
Let’s compute the tree in real time to see what sort of complexity we’re
dealing with; then we can think about optimizing. Since we have a stateful
player, let’s memoized by history; even though there are equivalent branches.
We might be able to use rotation to prune the terminal by a factor of ten.
(use
debug
srfi-9
srfi-69
)
(include "tic-tac-toe.scm")
(define-record move
parent
space
22
X-wins
O-wins
draws)
#;
(define-record-printer move
(lambda (move out)
(format out
"#<move space: ~a X-wins: ~a O-wins: ~a draws: ~a>"
(move-space move)
(move-X-wins move)
(move-O-wins move)
(move-draws move))))
(define-record-printer move
(lambda (move out)
(format out
"#,(move ~a ~a ~a ~a ~a)"
#f
(move-space move)
(move-X-wins move)
(move-O-wins move)
(move-draws move))))
23
(let ((board (board-copy board)))
(let ascend ((move move))
(if move
(begin
;; (display-board board)
(cond ((X-win? outcome)
(move-X-wins-set! move (+ 1 (move-X-wins move)))
(hash-table-update!/default
hash
(hash-board board)
(lambda (move)
(move-X-wins-set! move (+ 1 (move-X-wins move)))
move)
(make-move #f 0 0 0 0)))
((O-win? outcome)
(move-O-wins-set! move (+ 1 (move-O-wins move)))
(hash-table-update!/default
hash
(hash-board board)
(lambda (move)
(move-O-wins-set! move (+ 1 (move-O-wins move)))
move)
(make-move #f 0 0 0 0)))
(else
(move-draws-set! move (+ 1 (move-draws move)))
(hash-table-update!/default
hash
(hash-board board)
(lambda (move)
(move-draws-set! move (+ 1 (move-draws move)))
move)
(make-move #f 0 0 0 0))))
(board-set! board (move-space move) )
(ascend (move-parent move)))))))
24
;; Could just cap the tree here with a ’(), too.
;; outcome
’())
(let ((empty-spaces (empty-spaces board)))
(let-values (((player player?)
(if (odd? (length empty-spaces))
(values X X?)
(values O O?))))
(map (lambda (empty-space)
(let ((new-move (make-move move empty-space 0 0 0)))
(cons new-move
(let ((board (board-copy board)))
(board-set! board empty-space player)
(plumb hash new-move board)))))
empty-spaces))))))
#;
(let ((board (board X
O
O X))
(hash (make-hash-table)))
(time (plumb hash #f (make-empty-board)))
(debug (hash-table->alist hash)))
#;
(let ((board (make-empty-board)))
(with-output-to-file
"tic-tac-toe-game-tree.scm"
(lambda () (write (plumb #f board)))))
#;
(let ((board (make-empty-board)))
(with-output-to-file
"tic-tac-toe-hash-table.scm"
(lambda ()
(let ((hash (make-hash-table)))
(plumb hash #f board)
(write (hash-table->alist hash))))))
#;
(time (with-input-from-file
"tic-tac-toe-game-tree.scm"
read))
#;
(debug (time (with-input-from-file
25
"tic-tac-toe-hash-table.scm"
read)))
26
(cons -1 (make-move #f 0 +Inf +Inf 0))
possible-outcomes))
(max-win-ratio (fold (lambda (win-ratio max-win-ratio)
(if (> (cdr win-ratio)
(cdr max-win-ratio))
win-ratio
max-win-ratio))
(cons -1 0)
(map (lambda (possible-outcome)
(cons (car possible-outcome)
(let ((denominator
(+ (move-opponent-wins (cdr possible-ou
(move-player-wins (cdr possible-outc
(move-draws (cdr possible-outcome)))
(if (zero? denominator)
+Inf
(/ (move-player-wins (cdr possible-outco
denominator)))))
possible-outcomes))))
(debug possible-outcomes
;; This is an alternative, not taking draws into
;; consideration.
(map (lambda (possible-outcome)
(cons* (car possible-outcome)
(let ((denominator
(+ (move-opponent-wins (cdr possible-outcome))
(move-player-wins (cdr possible-outcome))
(move-draws (cdr possible-outcome)))))
(if (zero? denominator)
+Inf
(/ (move-player-wins (cdr possible-outcome))
denominator)))
(let ((denominator
(+ (move-opponent-wins (cdr possible-outcome))
(move-player-wins (cdr possible-outcome))
(move-draws (cdr possible-outcome)))))
(if (zero? denominator)
+Inf
(/ (move-opponent-wins (cdr possible-outcome))
denominator)))
(let ((denominator
(+ (move-opponent-wins (cdr possible-outcome))
(move-player-wins (cdr possible-outcome)))))
(if (zero? denominator)
+Inf
(/ (move-player-wins (cdr possible-outcome))
27
denominator)))))
possible-outcomes)
max-win-ratio
max-wins
max-draws
min-losses)
(if (zero? (move-player-wins (cdr max-wins)))
(if (zero? (move-draws (cdr max-draws)))
(car min-losses)
(car max-draws))
#;(car max-wins)
(car max-win-ratio))
(car min-losses)))))
(define game-hash
(make-parameter (alist->hash-table
(with-input-from-file
"tic-tac-toe-hash-table.scm"
read))))
(define (make-deterministic-X-player)
(make-deterministic-player (game-hash)
X?
move-X-wins
X
O?
move-O-wins
O))
(define (make-deterministic-O-player)
(make-deterministic-player (game-hash)
O?
move-O-wins
O
X?
move-X-wins
X))
(debug
(play
(make-deterministic-X-player)
(make-heuristic-O-player)
(make-empty-board))
(play
(make-heuristic-X-player)
(make-deterministic-O-player)
28
(make-empty-board)))
Deterministic TTT will look like a tree with complete games as leaves; the
tree is not complete. Eventually: every node should have summary statistics:
wins, draws, losses. Moving algorithm: maximize wins; otherwise, maximize
draws; otherwise, minimize losses.
Each node will contain: square, wins, draws, losses. The player moving is
implicit: it depends upon the root and alternates. Root is always X, though.
Can we implement the game tree as some kind of priority queue? It will,
nevertheless, be a tree of queues.
This is an interesting question, actually: is wins, draws, losses the correct
order?
The deterministic player, by the way, is going to have to keep track of the
game history; or plumb from the current board. We can also prune an initial
game tree with the current move.
Beware: it doesn’t look like opposite corner is working, by the way; nor fork,
for that matter.
Here’s an interesting solution to the rotation/reflection problem, by the way:
reduce before hashing!
State of the union: heuristic doesn’t seem to recognize e.g. forks or opposite
corners; deterministic doesn’t play well.
For deterministic, let’s try minimax; we may need to revise our position
evaluation function, even going back to trees. Interestingly, we can do a basic
win/loss calculus and +1 and −1 (how to deal with draws: 0, 0.5?);
See minimax with alternate moves and alpha-beta pruning; also, principal
variation. Negamax for two-player, zero-sum games; by the way.
This is interesting:
The nodes that belong to the MAX player receive the maximun
value of its [sic] children. The nodes for the MIN player will select
the minimun value of its [sic] children.
For a win-or-lose game like chess or tic-tac-toe, there are only three
possible values-win, lose, or draw (often assigned numeric values of
1, -1, and 0 respectively).
29
As veteran tic-tac-toe players know, there is no opening move which
guarantees victory; so, a computer running a minimax algorithm
without any sort of enhancements will discover that, if both it and
its opponent play optimally, the game will end in a draw no matter
where it starts, and thus have no clue as to which opening play is
the “best.”
Yeah, naïve minimax is worthless for tic-tac-toe; you have to rely on heuris-
tics. Maybe we can repurpose our win-fork-center-corner-side heuristics for min-
imax.
On using the number of winning or losing boards:
First, you can prune the tree: any strategy that doesn’t include a
potential path for you to win, you can just eliminate from considera-
tion: it has a utility of 0. For other moves, you look at the potential
paths: if move A leads into a subtree where 72% of the paths lead to
a leaf in which you win, then the utility of A is 0.72. Each turn, the
best strategy is the one with the highest utility; moves with equal
utility are equivalent.
(Gradient descent, by the way, is one of the “aha!” things that seems obvious
after the fact.)
What about this comment?
30
“If move A leads into a subtree where 72% of the paths lead to a
leaf in which you win, then the utility of A is 0.72.”
No . . . if any of the moves available to the next player are a win for
him, then that move is a loss for you. If all of the moves available
to the next player are a loss for that player, then it’s a win for you.
TicTackToe is uninteresting because there’s no percentage utility
about it.
We did implement it as a game tree, then as a hash; but we’re not taking
symmetries into account and maybe we should. It could just be a matter of
normalizing the board before we hash it or perform lookups.
The rotations of the board (012 345 678), by the way:
It seems like if you imagine a tenth square exists, and take mod 3; you get
the rotations. The (n2 + 1)th square also becomes a sentinel to stop rotating.
The symmetries seem to happen by reversing triplets and swapping rows,
respectively:
Naïvely, you could test the rotations and symmetries against the hash tree
until you found a match; otherwise, add the hash. Lookup is similarly expensive,
though.
We could hash each board as a 17-bit integer (i.e. 32 bits) instead of a string,
by the way: bits e.g. 0 through 8 are the X positions; 9 through 17, O.
The win ratio does yield different results than the max analysis; give it a
shot?
Just to confirm out intuition:
The real problem with this algorithm (and where you may be getting
results that you aren’t expecting) is when all sub-trees result in
possible losses. At that point, you will need to use a heuristic to
get any better information on which move you should take. You will
31
need something better than simply {-1, 0, 1}, because some moves
could allow you to win, but you’d block them out because you could
also lose.
Also:
This behavior is quite easy to implement in min/max using a decay
for each recursion. I.e. whenever you return something from a recur-
sive call multiply the result by 0.9 or something like this. This will
lead to higher scores for longer negative paths and smaller scores for
longer positive paths.
If we do ratio of wins to wins, draws, losses; and wins are zero, we can’t
distinguish states anymore. At that point: ratio of draws to draws, losses; after
which: minimize losses?
This guy successfully applied a heuristic to minimax, by the way:
• For each row, if there are both X and O, then the score for the
row is 0.
• If the whole row is empty, then the score is 1.
• If there is only one X, then the score is 10.
• If there are two Xs, then the score is 100.
• If there are 3 Xs, then the score is 1000, and the winner is
Player X.
• For Player O, the score is negative.
• Player X tries to maximize the score.
• Player O tries to minimize the score.
• If the current turn is for Player X, then the score of Player X
has more advantage. I gave the advantage rate as 3.
Doesn’t mention anything about forks, etc.
Let’s abandon minimax, since we have a heuristic player: we’d have to inte-
grate the heuristic with minimax, anyway; at which point minimax is superflu-
ous.
Let’s package tic-tac-toe as a module with tests, too; thet way the depen-
dencies can be pulled with chicken-install.
with-debug-level in debug would be nice, too.
I’d like to take the opportunity to have the tic-tac-toe module conform to
Riastradh’s style rules, too.
One simple approach has been found to be surprisingly successful:
assign the training value of Vtrain (b) for any intermediate board state
b to be V̂ (Successor(b)), where Successor(b) denotes the next board
state following b for which it is again the program’s turn to move:
32
Not utterly different from minimax, then, in the sense that we’re going to
have to descend to the leave and propagate up; aren’t we?
(If features qua board-position is insufficient, by the way, we might have to
come up with features based on our heuristics.)
Problem even says, by the way, to “use a fixed evaluation function you create
by hand;” which doesn’t rule out minimax, of course.
We’ll rate a given board on the basis of: wins, losses, forks, opponent forks,
center, corners, sides.
Assign the training value of Vtrain (b) for any intermediate board
state b to be V̂ (Successor(b)), where V̂ is the learner’s current ap-
proximation to V and where Successor(b) denotes the next board
state following b for which it is again the program’s turn to move:
33
Performance system takes V̂ , evaluates the possible boards, selects the best
one; repeat. Produces a game history. The critic takes the trace, produces
⟨b, Vtrain (b)⟩. The generalizer takes the training examples and the hypothesis;
updates the hypothesis accordingly. The experiment generator takes the current
hypothesis and outputs a new problem for the performance system to explore.
Let’s say: blank board vs. heuristic player.
Vtrain from V̂ (Successor(b)) is still a little fuzzy: do we trace to an endpoint,
and propagate up? And what’s the initial state of the algorithm: some guess;
or ones for all the weights?
(use srfi-95)
(include "tic-tac-toe.scm")
34
;;; Let’s output our solutions. Can we have a =fold-game=, by the way,
;;; so I don’t have to rely on mutation to keep track of state?
;;; =fold-game= is a =play= with an explicit accumulator. It might
;;; need to return both the accumulation and the outcome.
;;; Scratch =fold-game= for the time being; we have to keep track of
;;; state and move for each player: messy and unnecessary. Let’s
;;; mutate.
(define play
(case-lambda
(()
(play (make-empty-board)))
((board)
(play ((make-default-X-player))
((make-default-O-player))
board))
((X-player O-player board)
(play 0 (list board) X-player O-player board))
((move history X-player O-player board)
;; (debug move)
;; (display-board board)
35
(let ((outcome (outcome board)))
(if outcome
(values history outcome)
(let-values (((token player)
(if (even? move)
(values X X-player)
(values O O-player))))
(let ((next-move (player board))
(board (board-copy board)))
(board-set! board next-move token)
(play (+ move 1)
(cons board history)
X-player
O-player
board))))))))
(let-values
(((history outcome)
(perform (lambda (board) 1)
(make-heuristic-O-player)
(make-empty-board))))
(display-board (car history))
36
(debug history outcome))
The performer will evaluate according to: wins, losses, my forks, their forks,
center, corners, sides.
In addition to “subjunctive” features like winning-spaces, forking-spaces,
etc.; we need indicative (i.e. enumerative) ones like: wins, forks, center?,
corners, sides. They can employ slightly simplified logic, since we don’t care
about the actual spaces and we don’t need to enumerate possible moves. It
would be nice to unify them, however, with the subjunctive features; in which
case, they merely employ enumeration of moves and indication of features.
Or, fuck it: let’s just length on the space-enumerative features. Do we need
to distinguish between I-have-the-center vs. other-has-the-center? Do we need
to distinguish between corners and opposite corners?
Let’s implement the learning system with one feature: wins; once we have
the mechanics down, we can implement other features.
It seems like we recurse indeed to the leaves and propagate up with one
quantum of data: win or loss.
What’s the point of having the game trace: what do we do with the <
b, Vtrain (b) > pairs?
Ah: when it comes to adjusting the weights, the board-state is finally rel-
evant; unexpressed features will not weigh in. The Critic, furthermore, is re-
sponsible for the win → +100, loss → −100, draw → 0 heuristics.
37