Bayesian Networks (Part I) : 10 - 601 Introduction To Machine Learning
Bayesian Networks (Part I) : 10 - 601 Introduction To Machine Learning
Bayesian Networks
(Part I)
Graphical Model Readings:
Murphy 10 – 10.2.1 Matt Gormley
Bishop 8.1, 8.2.2
HTF -‐-‐
Lecture 22
Mitchell 6.11 April 10, 2017
1
Reminders
• Peer Tutoring
• Homework 7: Deep Learning
– Release: Wed, Apr. 05
– Part I due Wed, Apr. 12 Start Early
– Part II due Mon, Apr. 17
2
CONVOLUTIONAL NEURAL NETS
3
Deep Learning Outline
• Background: Computer Vision
– Image Classification
– ILSVRC 2010 -‐ 2016
– Traditional Feature Extraction Methods
– Convolution as Feature Extraction
• Convolutional Neural Networks (CNNs)
– Learning Feature Abstractions
– Common CNN Layers:
• Convolutional Layer
• Max-‐Pooling Layer
• Fully-‐connected Layer (w/tensor input)
• Softmax Layer
• ReLU Layer
– Background: Subgradient
– Architecture: LeNet
– Architecture: AlexNet
• Training a CNN
– SGD for CNNs
– Backpropagation for CNNs
4
Convolutional Neural Network (CNN)
• Typical layers include:
– Convolutional layer
– Max-‐pooling layer
– Fully-‐connected (Linear) layer
– ReLU layer (or some other nonlinear activation function)
– Softmax
• These can be arranged into arbitrarily deep topologies
5
Convolutional Layer
CNN key idea:
Treat convolution matrix as
parameters and learn them!
Input Image
0 0 0 0 0 0 0 Convolved Image
Learned
0 1 1 1 1 1 0 Convolution .4 .5 .5 .5 .4
0 1 0 0 1 0 0 θ11 θ12 θ13 .4 .2 .3 .6 .3
0 1 0 1 0 0 0 θ21 θ22 θ23 .5 .4 .4 .2 .1
0 1 1 0 0 0 0 θ31 θ32 θ33 .5 .6 .2 .1 0
0 1 0 0 0 0 0 .4 .3 .1 0 0
0 0 0 0 0 0 0
6
Downsampling by Averaging
• Downsampling by averaging used to be a common approach
• This is a special case of convolution where the weights are fixed to a
uniform distribution
• The example below uses a stride of 2
Input Image
1 1 1 1 1 0 Convolved Image
Convolution
1 0 0 1 0 0
3/4 3/4 1/4
1 0 1 0 0 0 1/4 1/4
3/4 1/4 0
1 1 0 0 0 0 1/4 1/4
1/4 0 0
1 0 0 0 0 0
0 0 0 0 0 0
7
Max-‐Pooling
• Max-‐pooling is another (common) form of downsampling
• Instead of averaging, we take the max value within the same range as
the equivalently-‐sized convolution
• The example below uses a stride of 2
Input Image
Max-‐Pooled
1 1 1 1 1 0 Image
Max-‐
1 0 0 1 0 0 pooling
1 1 1
1 0 1 0 0 0 xi,j xi,j+1
1 1 0
1 1 0 0 0 0 xi+1,j xi+1,j+1
1 0 0
1 0 0 0 0 0
0 0 0 0 0 0
8
Multi-‐Class Output
Output …
Hidden Layer …
Input …
10
Multi-‐Class Output
(F) Loss
Softmax Layer: J = k=1 yk HQ;(yk )
K
2tT(bl )
K l=12tT(b )
l
l=1
(D) Output (linear)
D
bk = j=0 kj zj k
…
Output
(C) Hidden (nonlinear)
zj = (aj ), j
…
Hidden Layer
(B) Hidden (linear)
M
aj = i=0 ji xi , j
…
Input
(A) Input
Given xi , i
11
Training a CNN
Whiteboard
– SGD for CNNs
– Backpropagation for CNNs
12
Common CNN Layers
Whiteboard
– ReLU Layer
– Background: Subgradient
– Fully-‐connected Layer (w/tensor input)
– Softmax Layer
– Convolutional Layer
– Max-‐Pooling Layer
13
Convolutional Layer
14
Convolutional Layer
15
Max-‐Pooling Layer
16
Max-‐Pooling Layer
17
Convolutional Neural Network (CNN)
• Typical layers include:
– Convolutional layer
– Max-‐pooling layer
– Fully-‐connected (Linear) layer
– ReLU layer (or some other nonlinear activation function)
– Softmax
• These can be arranged into arbitrarily deep topologies
18
Architecture #2: AlexNet
CNN for Image Classification
(Krizhevsky, Sutskever & Hinton, 2012)
15.3% error on ImageNet LSVRC-‐2012 contest
Input • Five convolutional layers 1000-‐way
image (w/max-‐pooling)
(pixels) • Three fully connected layers softmax
19
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
CNNs for Image Recognition
21
Mini-‐Batch SGD
22
CNN VISUALIZATIONS
23
3D Visualization of CNN
http://scs.ryerson.ca/~aharley/vis/conv/
Convolution of a Color Image
• Color images consist of 3 floats per pixel for
RGB (red, green blue) color values
• Convolution must also be 3-‐
A closer look at spatial dimensions: dimensional
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
25
Figure from Fei-‐Fei Li & Andrej Karpathy & Justin Johnson (CS231N)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 23 27 Jan 2016
Animation of 3D Convolution
http://cs231n.github.io/convolutional-‐networks/
26
Figure from Fei-‐Fei Li & Andrej Karpathy & Justin Johnson (CS231N)
MNIST Digit Recognition with CNNs
(in your browser)
https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html
27
Figure from Andrej Karpathy
CNN Summary
CNNs
– Are used for all aspects of computer vision, and
have won numerous pattern recognition
competitions
– Able learn interpretable features at different levels
of abstraction
– Typically, consist of convolution layers, pooling
layers, nonlinearities, and fully connected layers
Other Resources:
– Readings on course website
– Andrej Karpathy, CS231n Notes
http://cs231n.github.io/convolutional-‐networks/
28
BAYESIAN NETWORKS
29
Bayes Nets Outline
• Motivation
– Structured Prediction
• Background
– Conditional Independence
– Chain Rule of Probability
• Directed Graphical Models
– Writing Joint Distributions
– Definition: Bayesian Network
– Qualitative Specification
– Quantitative Specification
– Familiar Models as Bayes Nets
• Conditional Independence in Bayes Nets
– Three case studies
– D-‐separation
– Markov blanket
• Learning
– Fully Observed Bayes Net
– (Partially Observed Bayes Net)
• Inference
– Sampling directly from the joint distribution
– Gibbs Sampling
31
MOTIVATION: STRUCTURED
PREDICTION
32
Structured Prediction
• Most of the models we’ve seen so far were
for classification
– Given observations: x = (x1, x2, …, xK)
– Predict a (binary) label: y
• Many real-‐world problems require
structured prediction
– Given observations: x = (x1, x2, …, xK)
– Predict a structure: y = (y1, y2, …, yJ)
• Some classification problems benefit from
latent structure
33
Structured Prediction Examples
• Examples of structured prediction
– Part-‐of-‐speech (POS) tagging
– Handwriting recognition
– Speech recognition
– Word alignment
– Congressional voting
• Examples of latent structure
– Object recognition
34
Dataset for Supervised
Part-‐of-‐Speech (POS) Tagging
Data: D = {x(n) , y (n) }N
n=1
n v p d n y(1)
Sample 1:
time flies like an arrow x(1)
n n v d n y(2)
Sample 2:
time flies like an arrow x(2)
n v p n n y(3)
Sample 3:
flies fly with their wings x(3)
p n n v v y(4)
Sample 4:
with time you will see x(4)
35
total of 1,63
total words,
of 1,63
words, and
words,its and
part
Dataset for Supervised its part
its part
of s
of s
speechspeech
label
Handwriting Recognition speech label
1. F
1. First-
Data: D = {x(n) , y (n) }N
n=1
1. First-
2. F
2. Four
2. Four
Sample 1:
u n e x p e c t e d y(1)
x(1) Hand
Hand
x(2)
Sample 2:
e m b r a c e s y(3)
x(3)
Sample 1:
1704 h# dh ih s w uh z iy z y(1)
iyIEEE TRANSACTIONS ON SIGNA
x(1)
Fig. 5. Extrinsic (top) and intrinsic (bottom) spectral representations for the utterance “This was easy for us.” Note
was used. x (2)
where are the input unlabeled data and is the novel utterance into this in
new parametrization of the function we need to estimate. To the computation of Equatio
proceed, we plug the functional form of (9) into the optimization Fig. 5 shows an exampl
problem of (8). Taking the gradient with respect to the parameter spectrograms for
vector and setting it to zero sets up the following generalized 37
for us” (TIMIT sentence sx
Figures from (Jansen & Niyogi, 2013)
eigenvalue problem: with 200 examples of eac
ified in [26].2 Each examp
(10) a 40-dimensional, homomo
Application:
Word Alignment / Phrase Extraction
• Variables (boolean):
– For each (Chinese phrase,
English phrase) pair,
are they linked?
• Interactions:
– Word fertilities
– Few “jumps” (discontinuities)
– Syntactic reorderings
– “ITG contraint” on alignment
– Phrases are disjoint (?)
40
Case Study: Object Recognition
Data consists of images x and labels y.
x(1) x(2)
x(3) x(4)
• Define graphical
model with these
latent variables in
mind
• z is not observed at leopard
train or test time
42
Case Study: Object Recognition
Data consists of images x and labels y.
• Preprocess data into
“patches”
• Posit a latent labeling z
describing the object’s
Z7
parts (e.g. head, leg,
tail, torso, grass)
X7
• Define graphical
Z2
model with these
latent variables in
mind X2
43
Case Study: Object Recognition
Data consists of images x and labels y.
• Preprocess data into
“patches”
• Posit a latent labeling z
describing the object’s
Z7
parts (e.g. head, leg, ψ4
X7 ψ4
• Define graphical
ψ2 Z2 ψ4
model with these
ψ3
latent variables in
mind X2
44
Structured Prediction
Preview of challenges to come…
• Consider the task of finding the most probable
assignment to the output
45
Machine Learning
The data inspires Our model
the structures defines a score
we want to for each structure
predict
Domain Mathematical
It also tells us
Knowledge Modeling
what to optimize
ML
Inference finds Combinatorial Optimization
e
a telescop
Machine Learning
Alice
saw
B ob
on
ow
a hill with
Model
s l i ke a n a rr
ie
time fl
4
Data X1
arrow X3
like an X2
flies
time
X4 X5
an arrow
like
flies
time
an arrow
Objective
like
flies
time
Inference
time
flies
like an arrow
Learning
(Inference is usually
called as a subroutine
in learning) 47
BACKGROUND
48
Background
Whiteboard
– Chain Rule of Probability
– Conditional Independence
49
Background: Chain Rule
of Probability
50
Background:
Conditional Independence
Random variables A and B are conditionally
independent given C if:
or equivalently:
A B|C (3)
Later we will also
|4
52
Example: Tornado Alarms
1. Imagine that
you work at the
911 call center
in Dallas
2. You receive six
calls informing
you that the
Emergency
Weather Sirens
are going off
3. What do you
conclude?
53
Figure from https://www.nytimes.com/2017/04/08/us/dallas-‐emergency-‐sirens-‐hacking.html
Example: Tornado Alarms
1. Imagine that
you work at the
911 call center
in Dallas
2. You receive six
calls informing
you that the
Emergency
Weather Sirens
are going off
3. What do you
conclude?
54
Figure from https://www.nytimes.com/2017/04/08/us/dallas-‐emergency-‐sirens-‐hacking.html
Directed Graphical Models
(Bayes Nets)
Whiteboard
– Example: Tornado Alarms
– Writing Joint Distributions
• Idea #1: Giant Table
• Idea #2: Rewrite using chain rule
• Idea #3: Assume full independence
• Idea #4: Drop variables from RHS of conditionals
– Definition: Bayesian Network
– Observed Variables in Graphical Models
55
Bayesian Network
X1
p(X1 , X2 , X3 , X4 , X5 ) =
X3
X2
p(X5 |X3 )p(X4 |X2 , X3 )
X4 X5 p(X3 )p(X2 |X1 )p(X1 )
56
Bayesian Network
Definition:
X1
n
X3
X2
P(X1 …X n ) = ∏ P(Xi | parents(Xi ))
i=1
X4 X5
57