learningtheory-bns
learningtheory-bns
1
How many points can a linear
boundary classify exactly? (1-D)
2
How many points can a linear
boundary classify exactly? (d-D)
3
Shattering a set of points
VC dimension
4
PAC bound using VC dimension
Number of training points that can be
classified exactly is VC dimension!!!
Measures relevant size of hypothesis space, as
with decision trees with k leaves
Bound for infinite dimension hypothesis spaces:
Examples of VC dimension
Linear classifiers:
VC(H) = d+1, for d features plus constant term b
Neural networks
VC(H) = #parameters
Local minima means NNs will probably not find best
parameters
1-Nearest neighbor?
5
Another VC dim. example -
What can we shatter?
What’s the VC dim. of decision stumps in 2d?
6
What you need to know
Finite hypothesis space
Derive results
Counting number of hypothesis
Mistakes on Training data
Bayesian Networks –
Representation
Machine Learning – 10701/15781
Carlos Guestrin
Carnegie Mellon University
October 29th, 2007
©2005-2007 Carlos Guestrin 14
7
Handwriting recognition
Webpage classification
8
Handwriting recognition 2
Webpage classification 2
9
Today – Bayesian networks
One of the most exciting advancements in
statistical AI in the last 10-15 years
Generalizes naïve Bayes and logistic regression
classifiers
Compact representation for exponentially-large
probability distributions
Exploit conditional independencies
Causal structure
Suppose we know the following:
The flu causes sinus inflammation
Allergies cause sinus inflammation
Sinus inflammation causes a runny nose
Sinus inflammation causes headaches
How are these connected?
10
Possible queries
Inference
Flu Allergy
Most probable
Sinus explanation
Car starts BN
18 binary attributes
Inference
P(BatteryAge|Starts=f)
11
Factored joint distribution -
Preview
Flu Allergy
Sinus
Headache Nose
Number of parameters
Flu Allergy
Sinus
Headache Nose
12
Key: Independence assumptions
Flu Allergy
Sinus
Headache Nose
(Marginal) Independence
Flu and Allergy are (marginally) independent
Flu = t
Flu = f
Allergy = f
Flu = t Flu = f
Allergy = t
Allergy = f
©2005-2007 Carlos Guestrin 26
13
Marginally independent random
variables
Sets of variables X, Y
X is independent of Y if
P ²(X=x⊥Y=y), 8 x2Val(X), y2Val(Y)
Shorthand:
Marginal independence: P ² (X ⊥ Y)
Conditional independence
Flu and Headache are not (marginally) independent
More Generally:
14
Conditionally independent random
variables
Sets of variables X, Y, Z
X is independent of Y given Z if
P ²(X=x ⊥ Y=y|Z=z), 8 x2Val(X), y2Val(Y), z2Val(Z)
Shorthand:
Conditional independence: P ² (X ⊥ Y | Z)
For P ² (X ⊥ Y | ;), write P ² (X ⊥ Y)
Properties of independence
Symmetry:
(X ⊥ Y | Z) ⇒ (Y ⊥ X | Z)
Decomposition:
(X ⊥ Y,W | Z) ⇒ (X ⊥ Y | Z)
Weak union:
(X ⊥ Y,W | Z) ⇒ (X ⊥ Y | Z,W)
Contraction:
(X ⊥ W | Y,Z) & (X ⊥ Y | Z) ⇒ (X ⊥ Y,W | Z)
Intersection:
(X⊥ Y | W,Z) & (X ⊥ W | Y,Z) ⇒ (X ⊥ Y,W | Z)
Only for positive distributions!
P(α)>0, 8α, α≠;
©2005-2007 Carlos Guestrin 30
15
The independence assumption
Flu Allergy
Local Markov Assumption:
Sinus A variable X is independent
of its non-descendants given
Headache Nose its parents
Sinus
Headache Nose
16
Naïve Bayes revisited
Flu Allergy
Sinus
Headache Nose
17
Joint distribution
Flu Allergy
Sinus
Headache Nose
Sinus
More generally:
P(X1,…,Xn) = P(X1) · P(X2|X1) · … · P(Xn|X1,…,Xn-1)
18
Chain rule & Joint distribution
Local Markov Assumption:
A variable X is independent
Flu Allergy of its non-descendants given
its parents
Sinus
Headache Nose
19
The Representation Theorem –
Joint Distribution to BN
20
A general Bayes net
Set of random variables
CPTs
Joint distribution:
21
Another example
Variables:
B – Burglar
E – Earthquake
A – Burglar alarm
N – Neighbor calls
R – Radio report
22
Independencies encoded in BN
We said: All you need is the local Markov
assumption
(Xi ⊥ NonDescendantsXi | PaXi)
But then we talked about other (in)dependencies
e.g., explaining away
Common cause: Z
Z
X Y
23
Understanding independencies in BNs
– Some examples
A B
C
E
D
G
F
H J
I
K
E G
A B D H
C F
F’
F’’
24
Active trails formalized
A path X1 – X2 – · · · –Xk is an active trail when
variables Oµ{X1,…,Xn} are observed if for each
consecutive triplet in the trail:
Xi-1→Xi→Xi+1, and Xi is not observed (Xi∉O)
H J
I
K
25
The BN Representation Theorem
If conditional
Joint probability
independencies
distribution:
in BN are subset of Obtain
conditional
independencies in P
Important because:
Every P has at least one BN structure G
Then conditional
If joint independencies
probability Obtain in BN are subset of
distribution: conditional
independencies in P
Important because:
Read independencies of P from BN structure G
©2005-2007 Carlos Guestrin 51
“Simpler” BNs
A distribution can be represented by many BNs:
26
Learning Bayes nets
Known structure Unknown structure
Fully observable
data
Missing data
Data
CPTs –
x(1)
… P(Xi| PaXi)
x(m)
structure parameters
©2005-2007 Carlos Guestrin 53
27
Queries in Bayes nets
Given BN, find:
Probability of X given some evidence, P(X|e)
28
Acknowledgements
JavaBayes applet
http://www.pmr.poli.usp.br/ltd/Software/javabayes/Ho
me/index.html
29