0% found this document useful (0 votes)
8 views

Lecture 13

9.66

Uploaded by

Gio Villa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 13

9.66

Uploaded by

Gio Villa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Class announcements

• Recitations Th, F 4 PM – 46-3189


– This week Review of Basic Bayes

• PSet 1 out today, due Oct 3. Other psets due


approximately every two weeks thereafter.

• Classes next week are virtual: We will have a guest


lecture from Vikash Mansinghka on Thursday that
you can watch asynchronously, and I may give one
virtual lecture (depending on where we end up
today).
Plan for today

Basic Bayesian cognition


– The number game
The number game

60
Diffuse similarity

60 80 10 30 Rule:
“multiples of 10”

60 52 57 55 Focused similarity:
numbers near 50-60

Main phenomena to explain:


– Generalization can appear either similarity-
based (graded) or rule-based (all-or-none).
– Learning from just a few positive examples.
A single unifying account of (number) concept learning?

• We’re going to use this to introduce Bayesian


approaches, but first consider ...
– The “naïve programmer” approach?
– The “modern neural network” approach?
Traditional (algorithmic level) cognitive models

• Multiple representational systems: rules and


similarity
– Categorization, language (past tense), reasoning
• Questions this leaves open:
– How does each system work? How far and in ways to
generalize as a function of the examples observed?
• Which rule to choose?
– E.g., X = {60, 80, 10, 30}: multiples of 10 vs. even numbers?
• Which similarity metric?
– E.g., X = {60, 53} vs. {60, 20}?
– Why these two systems?
– When and why does a learner switch between them?
Reverse-engineering a cognitive system:
Marr’s three levels

• Level 1: Computational theory


– What are the inputs and outputs to the computation,
what is its goal, and what is the logic by which it is
carried out?
• Level 2: Representation and algorithm
– How is information represented and processed to
achieve the computational goal?
• Level 3: Hardware implementation
– How is the computation realized in physical or
biological hardware?
Bayesian model
• H: Hypothesis space of possible concepts:
– h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”)
– h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”)
– h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”)
– h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”)
– ...

Representational interpretations for H:


– Candidate rules
– Features for similarity
– “Consequential subsets” (Shepard, 1987)
Three hypothesis subspaces for number
concepts
• Mathematical properties (24 hypotheses):
– Odd, even, square, cube, prime numbers
– Multiples of small integers
– Powers of small integers
• Raw magnitude (5050 hypotheses):
– All intervals of integers with endpoints between 1 and
100.
• Approximate magnitude (10 hypotheses):
– Decades (1-10, 10-20, 20-30, …)
Bayesian model

• H: Hypothesis space of possible concepts:


– Mathematical properties: even, odd, square, prime, . . . .
– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
– Raw magnitude: all intervals between 1 and 100.

• X = {x1, . . . , xn}: n examples of a concept C.


• Evaluate hypotheses given data:
p ( X | h) p ( h)
p(h | X ) =
å p( X | h¢) p(h¢)
h¢ÎH
– p(h) [“prior”]: domain knowledge, pre-existing biases
– p(X|h) [“likelihood”]: statistical information in examples.
– p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h

• Captures the intuition of a “representative” sample, versus


a “suspicious coincidence”.
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100

Data slightly more of a coincidence under h1


Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100

Data much more of a coincidence under h1


Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h

• Captures the intuition of a “representative” sample, versus


a “suspicious coincidence”.
• A special case of the law of “conservation of belief”:

åx
p( X = x | Y = y ) = 1
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Do we need this? Why not allow all logically possible
hypotheses, with uniform priors, and let the data sort
them out (via the likelihood)?
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.

e.g., X = {60 80 10 30}:


4
é1ù
p ( X | multiples of 10) = ê ú = 0.0001
ë10 û
4
é1 ù
p ( X | multiples of 10 except 50, 70) = ê ú = 0.00024
ë8 û
p ( X | h) p ( h)
Posterior: p(h | X ) =
å p( X | h¢) p(h¢)
h¢ÎH

• X = {60, 80, 10, 30}


• Why prefer “multiples of 10” over “even
numbers”? p(X|h).
• Why prefer “multiples of 10” over “multiples of
10 except 50 and 70”? p(h).
• Why does a good generalization need both high
prior and high likelihood? p(h|X) ~ p(X|h) p(h)
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
• p(h) encodes relative weights of alternative theories:
H: Total hypothesis space
p(H1) = 1/5 p(H3) = 1/5
p(H2) = 3/5

H1: Math properties (24) H2: Raw magnitude (5050) H3: Approx. magnitude (10)
• even numbers • 10-15 • 10-20
• powers of two • 20-32 • 20-30
• multiples of three • 37-54 • 30-40
…. p(h) = p(H1) / 24 …. p(h) = p(H2) / 5050 …. p(h) = p(H3) / 10
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
• p(h) encodes relative plausibility of alternative theories:
– Mathematical properties: p(h) ~ 1/120
– Approximate magnitude: p(h) ~ 1/50
– Raw magnitude: p(h) ~ 1/8500 (on average)

• Also degrees of plausibility within a theory,


e.g., for magnitude intervals of size s:

p(s)
s
Generalizing to new objects

From hypotheses to predictions:


How do we compute the probability that C
applies to some new object y, given the posterior
p(h|X)?
Hypothesis averaging

In general, we have the law of total probability:


p(A = a) = ∑ p(A = a | Z = z) p(Z = z)
z

p( A = a | B = b) = å p( A = a | Z = z, B = b) p(Z = z | B = b)
z
…especially useful if A and B are independent conditioned on Z:
p( A = a | B = b) = å p( A = a | Z = z) p(Z = z | B = b)
z
Hypothesis averaging

In general, we have the law of total probability:


p(A = a) = ∑ p(A = a | Z = z) p(Z = z)
z

p( A = a | B = b) = å p( A = a | Z = z, B = b) p(Z = z | B = b)
z
…especially useful if A and B are independent conditioned on Z:
p( A = a | B = b) = å p( A = a | Z = z) p(Z = z | B = b)
z
Another example: what is the probability that the republican will
win the election, given that the weather man predicts rain?
p( Republican win | Weather report: “Rain storm”) =
å pp((Repub.
wÎweather
Republican | W =| w)
winswin = w | Weatherman
w)p(p(wW|Weather saysstorm”)
report: “Rain ' rain' )
conditions
Generalizing to new objects

Hypothesis averaging:
Compute the probability that C applies to some
new object y by averaging the predictions of all
hypotheses h, weighted by p(h|X):

p( y Î C | X ) = å$
p( y Î C | h) p(h | X )
!#!"
hÎH é 1 if yÎh

ë 0 if yÏh

= å p(h | X )
h É{ y , X }
Examples:
16
Examples:
16
8
2
64
Examples:
16
23
19
20
+ Examples Human generalization Bayesian Model

60

60 80 10 30

60 52 57 55

16

16 8 2 64

16 23 19 20
Summary of the Bayesian model

• How do the statistics of the examples interact with


prior knowledge to guide generalization?
posterior µ likelihood ´ prior

• Why does generalization appear rule-based or


similarity-based?
hypothesis averaging + size principle

broad p(h|X): similarity gradient


narrow p(h|X): all-or-none rule
Summary of the Bayesian model

• How do the statistics of the examples interact with


prior knowledge to guide generalization?
posterior µ likelihood ´ prior

• Why does generalization appear rule-based or


similarity-based?
hypothesis averaging + size principle

broad p(h|X): Many h of similar size, or


very few examples (i.e. 1)
narrow p(h|X): One h much smaller
Model variants
1. Bayes with weak sampling
posterior µ likelihood ´ prior
hypothesis averaging + size principle

“Weak sampling” p( X | h) µ 1 if x1 ,!, xn Î h


= 0 if any xi Ï h

2. Maximum a posteriori (MAP)


Maximum likelihood /subset principle
posterior µ likelihood ´ prior
hypothesis averaging + size principle
p( y Î C | X ) = 1 if y Î h*; h* = arg max p(h | X )
hÎH
= 0 if y Ï h *
Human generalization Full Bayesian model

Bayes with weak sampling Maximum a posteriori (MAP) / subset


(no size principle) principle (no hypothesis averaging)
Taking stock

• A model of high-level, knowledge-driven inductive reasoning


that makes strong quantitative predictions with minimal free
parameters.
(r2 > 0.9 for mean judgments on 180 generalization stimuli, with 3 free
numerical parameters)
• Explains qualitatively different patterns of generalization
(rules, similarity) as the output of a single general-purpose
rational inference engine.
– Marr level 1 (Computational theory) explanation of phenomena that
have traditionally been treated only at Marr level 2 (Representation
and algorithm).
Looking forward
• Can we see these ideas at work in more natural cognitive
function, not just toy problems and games?
– How might differently structured hypothesis spaces, different
likelihood functions or priors, be needed?
• Can we move from ‘weak rational analysis’ to ‘strong
rational analysis’ in the priors, as with the likelihood?
– “Weak”: behavior consistent with some reasonable prior.
– “Strong”: behavior consistent with the “correct” prior given the
structure of the world.
• Can we work with more flexible priors, not just restricted to
a small subset of all logically possible concepts?
– Would like to be able to learn any concept, even very complex ones,
given enough data (a non-dogmatic prior).
• Can we describe formally how these hypothesis spaces and
priors are generated by abstract knowledge or theories?
• Can we explain how people learn these rich priors?
Learning more natural concepts
“horse” “horse” “horse”

“tufa”
“tufa”

“tufa”
Learning rectangle concepts

Weighting different rectangle


hypotheses based on the size principle:
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h
Generalization gradients
Full Bayes Subset principle Bayes w/o size principle
(MAP Bayes) (0/1 likelihoods)
Modeling word learning (Xu & Tenenbaum, 2007)
Modeling word learning (Xu & Tenenbaum, 2007)
Modeling word learning (Xu & Tenenbaum, 2007)
Children’s
generalizations

Bayesian
concept learning
with tree-structured
hypothesis space
Exploring different models

• Priors, likelihoods derived from simple assumptions.


What about more complex cases?
• Different likelihoods?
– Suppose the examples are sampled by a different process,
such as active learning, or active pedagogy.

• Different priors?
– More complex language-like hypothesis spaces, allowing
exceptions, compound concepts, and much more…

You might also like