0% found this document useful (0 votes)
1 views

learningtheory-bns

The document discusses the concept of VC dimension in machine learning, explaining how it measures the capacity of hypothesis spaces to classify points accurately. It also introduces Bayesian networks, highlighting their ability to represent complex probability distributions and exploit conditional independencies. The document covers various examples and applications of these concepts in statistical AI, including decision trees, neural networks, and real-world applications like diagnosis and anomaly detection.

Uploaded by

NandKumar Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

learningtheory-bns

The document discusses the concept of VC dimension in machine learning, explaining how it measures the capacity of hypothesis spaces to classify points accurately. It also introduces Bayesian networks, highlighting their ability to represent complex probability distributions and exploit conditional independencies. The document covers various examples and applications of these concepts in statistical AI, including decision trees, neural networks, and real-world applications like diagnosis and anomaly detection.

Uploaded by

NandKumar Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

VC Dimension

Machine Learning – 10701/15781


Carlos Guestrin
Carnegie Mellon University

October 29th, 2007


©2005-2007 Carlos Guestrin 1

What about continuous hypothesis


spaces?

 Continuous hypothesis space:


 |H| =∞
 Infinite variance???

 As with decision trees, only care about the


maximum number of points that can be
classified exactly!

©2005-2007 Carlos Guestrin 2

1
How many points can a linear
boundary classify exactly? (1-D)

©2005-2007 Carlos Guestrin 3

How many points can a linear


boundary classify exactly? (2-D)

©2005-2007 Carlos Guestrin 4

2
How many points can a linear
boundary classify exactly? (d-D)

©2005-2007 Carlos Guestrin 5

PAC bound using VC dimension


 Number of training points that can be
classified exactly is VC dimension!!!
 Measures relevant size of hypothesis space, as
with decision trees with k leaves

©2005-2007 Carlos Guestrin 6

3
Shattering a set of points

©2005-2007 Carlos Guestrin 7

VC dimension

©2005-2007 Carlos Guestrin 8

4
PAC bound using VC dimension
 Number of training points that can be
classified exactly is VC dimension!!!
 Measures relevant size of hypothesis space, as
with decision trees with k leaves
 Bound for infinite dimension hypothesis spaces:

©2005-2007 Carlos Guestrin 9

Examples of VC dimension
 Linear classifiers:
 VC(H) = d+1, for d features plus constant term b

 Neural networks
 VC(H) = #parameters
 Local minima means NNs will probably not find best
parameters

 1-Nearest neighbor?

©2005-2007 Carlos Guestrin 10

5
Another VC dim. example -
What can we shatter?
 What’s the VC dim. of decision stumps in 2d?

©2005-2007 Carlos Guestrin 11

Another VC dim. example -


What can’t we shatter?
 What’s the VC dim. of decision stumps in 2d?

©2005-2007 Carlos Guestrin 12

6
What you need to know
 Finite hypothesis space
 Derive results
 Counting number of hypothesis
 Mistakes on Training data

 Complexity of the classifier depends on number of


points that can be classified exactly
 Finite case – decision trees
 Infinite case – VC dimension

 Bias-Variance tradeoff in learning theory


 Remember: will your algorithm find best classifier?

©2005-2007 Carlos Guestrin 13

Bayesian Networks –
Representation
Machine Learning – 10701/15781
Carlos Guestrin
Carnegie Mellon University
October 29th, 2007
©2005-2007 Carlos Guestrin 14

7
Handwriting recognition

Character recognition, e.g., kernel SVMs


rr r
r
c r
a c r
z bc
©2005-2007 Carlos Guestrin 15

Webpage classification

Company home page


vs
Personal home page
vs
University home page
vs

©2005-2007 Carlos Guestrin 16

8
Handwriting recognition 2

©2005-2007 Carlos Guestrin 17

Webpage classification 2

©2005-2007 Carlos Guestrin 18

9
Today – Bayesian networks
 One of the most exciting advancements in
statistical AI in the last 10-15 years
 Generalizes naïve Bayes and logistic regression
classifiers
 Compact representation for exponentially-large
probability distributions
 Exploit conditional independencies

©2005-2007 Carlos Guestrin 19

Causal structure
 Suppose we know the following:
 The flu causes sinus inflammation
 Allergies cause sinus inflammation
 Sinus inflammation causes a runny nose
 Sinus inflammation causes headaches
 How are these connected?

©2005-2007 Carlos Guestrin 20

10
Possible queries
 Inference
Flu Allergy

 Most probable
Sinus explanation

Headache Nose  Active data


collection

©2005-2007 Carlos Guestrin 21

Car starts BN
 18 binary attributes

 Inference
 P(BatteryAge|Starts=f)

 216 terms, why so fast?


 Not impressed?
 HailFinder BN – more than 354 =
58149737003040059690390169 terms

©2005-2007 Carlos Guestrin 22

11
Factored joint distribution -
Preview

Flu Allergy

Sinus

Headache Nose

©2005-2007 Carlos Guestrin 23

Number of parameters

Flu Allergy

Sinus

Headache Nose

©2005-2007 Carlos Guestrin 24

12
Key: Independence assumptions

Flu Allergy

Sinus

Headache Nose

Knowing sinus separates the variables from each other

©2005-2007 Carlos Guestrin 25

(Marginal) Independence
 Flu and Allergy are (marginally) independent
Flu = t

Flu = f

 More Generally: Allergy = t

Allergy = f

Flu = t Flu = f

Allergy = t

Allergy = f
©2005-2007 Carlos Guestrin 26

13
Marginally independent random
variables
 Sets of variables X, Y
 X is independent of Y if
P ²(X=x⊥Y=y), 8 x2Val(X), y2Val(Y)

 Shorthand:
 Marginal independence: P ² (X ⊥ Y)

 Proposition: P statisfies (X ⊥ Y) if and only if


 P(X,Y) = P(X) P(Y)

©2005-2007 Carlos Guestrin 27

Conditional independence
 Flu and Headache are not (marginally) independent

 Flu and Headache are independent given Sinus


infection

 More Generally:

©2005-2007 Carlos Guestrin 28

14
Conditionally independent random
variables
 Sets of variables X, Y, Z
 X is independent of Y given Z if
P ²(X=x ⊥ Y=y|Z=z), 8 x2Val(X), y2Val(Y), z2Val(Z)

 Shorthand:
 Conditional independence: P ² (X ⊥ Y | Z)
 For P ² (X ⊥ Y | ;), write P ² (X ⊥ Y)

 Proposition: P statisfies (X ⊥ Y | Z) if and only if


 P(X,Y|Z) = P(X|Z) P(Y|Z)

©2005-2007 Carlos Guestrin 29

Properties of independence
 Symmetry:
 (X ⊥ Y | Z) ⇒ (Y ⊥ X | Z)
 Decomposition:
 (X ⊥ Y,W | Z) ⇒ (X ⊥ Y | Z)
 Weak union:
 (X ⊥ Y,W | Z) ⇒ (X ⊥ Y | Z,W)
 Contraction:
 (X ⊥ W | Y,Z) & (X ⊥ Y | Z) ⇒ (X ⊥ Y,W | Z)
 Intersection:
 (X⊥ Y | W,Z) & (X ⊥ W | Y,Z) ⇒ (X ⊥ Y,W | Z)
 Only for positive distributions!
 P(α)>0, 8α, α≠;
©2005-2007 Carlos Guestrin 30

15
The independence assumption

Flu Allergy
Local Markov Assumption:
Sinus A variable X is independent
of its non-descendants given
Headache Nose its parents

©2005-2007 Carlos Guestrin 31

Local Markov Assumption:


Explaining away A variable X is independent
of its non-descendants given
its parents
Flu Allergy

Sinus

Headache Nose

©2005-2007 Carlos Guestrin 32

16
Naïve Bayes revisited

Local Markov Assumption:


A variable X is independent
of its non-descendants given
its parents

©2005-2007 Carlos Guestrin 33

What about probabilities?


Conditional probability tables (CPTs)

Flu Allergy

Sinus

Headache Nose

©2005-2007 Carlos Guestrin 34

17
Joint distribution
Flu Allergy

Sinus

Headache Nose

Why can we decompose? Markov Assumption!


©2005-2007 Carlos Guestrin 35

The chain rule of probabilities


 P(A,B) = P(A)P(B|A) Flu

Sinus

 More generally:
 P(X1,…,Xn) = P(X1) · P(X2|X1) · … · P(Xn|X1,…,Xn-1)

©2005-2007 Carlos Guestrin 36

18
Chain rule & Joint distribution
Local Markov Assumption:
A variable X is independent
Flu Allergy of its non-descendants given
its parents
Sinus

Headache Nose

©2005-2007 Carlos Guestrin 37

Two (trivial) special cases


Edgeless graph Fully-connected
graph

©2005-2007 Carlos Guestrin 38

19
The Representation Theorem –
Joint Distribution to BN

BN: Encodes independence


assumptions

If conditional Joint probability


independencies
Obtain distribution:
in BN are subset of
conditional
independencies in P

©2005-2007 Carlos Guestrin 39

Real Bayesian networks


applications
 Diagnosis of lymph node disease
 Speech recognition
 Microsoft office and Windows
 http://www.research.microsoft.com/research/dtg/
 Study Human genome
 Robot mapping
 Robots to identify meteorites to study
 Modeling fMRI data
 Anomaly detection
 Fault dianosis
 Modeling sensor network data

©2005-2007 Carlos Guestrin 40

20
A general Bayes net
 Set of random variables

 Directed acyclic graph


 Encodes independence assumptions

 CPTs

 Joint distribution:

©2005-2007 Carlos Guestrin 41

How many parameters in a BN?


 Discrete variables X1, …, Xn
 Graph
 Defines parents of Xi, PaXi
 CPTs – P(Xi| PaXi)

©2005-2007 Carlos Guestrin 42

21
Another example
 Variables:
B – Burglar
 E – Earthquake
 A – Burglar alarm
 N – Neighbor calls
 R – Radio report

 Both burglars and earthquakes can set off the


alarm
 If the alarm sounds, a neighbor may call
 An earthquake may be announced on the radio

©2005-2007 Carlos Guestrin 43

Another example – Building the BN


 B – Burglar
 E – Earthquake
 A – Burglar alarm
 N – Neighbor calls
 R – Radio report

©2005-2007 Carlos Guestrin 44

22
Independencies encoded in BN
 We said: All you need is the local Markov
assumption
 (Xi ⊥ NonDescendantsXi | PaXi)
 But then we talked about other (in)dependencies
 e.g., explaining away

 What are the independencies encoded by a BN?


 Only assumption is local Markov
 But many others can be derived using the algebra of
conditional independencies!!!
©2005-2007 Carlos Guestrin 45

Understanding independencies in BNs


– BNs with 3 nodes Local Markov Assumption:
A variable X is independent
of its non-descendants given
Indirect causal effect:
its parents
X Z Y

Indirect evidential effect: Common effect:


X Z Y
X Y

Common cause: Z
Z
X Y

©2005-2007 Carlos Guestrin 46

23
Understanding independencies in BNs
– Some examples
A B

C
E
D

G
F

H J

I
K

©2005-2007 Carlos Guestrin 47

An active trail – Example

E G
A B D H
C F

F’

F’’

When are A and H independent?

©2005-2007 Carlos Guestrin 48

24
Active trails formalized
 A path X1 – X2 – · · · –Xk is an active trail when
variables Oµ{X1,…,Xn} are observed if for each
consecutive triplet in the trail:
 Xi-1→Xi→Xi+1, and Xi is not observed (Xi∉O)

 Xi-1←Xi←Xi+1, and Xi is not observed (Xi∉O)

 Xi-1←Xi→Xi+1, and Xi is not observed (Xi∉O)

 Xi-1→Xi←Xi+1, and Xi is observed (Xi2O), or one of


its descendents

©2005-2007 Carlos Guestrin 49

Active trails and independence?


A B
 Theorem: Variables Xi
and Xj are independent
C
given Zµ{X1,…,Xn} if the E
is no active trail between D
Xi and Xj when variables G
Zµ{X1,…,Xn} are observed F

H J

I
K

©2005-2007 Carlos Guestrin 50

25
The BN Representation Theorem

If conditional
Joint probability
independencies
distribution:
in BN are subset of Obtain
conditional
independencies in P

Important because:
Every P has at least one BN structure G

Then conditional
If joint independencies
probability Obtain in BN are subset of
distribution: conditional
independencies in P
Important because:
Read independencies of P from BN structure G
©2005-2007 Carlos Guestrin 51

“Simpler” BNs
 A distribution can be represented by many BNs:

 Simpler BN, requires fewer parameters

©2005-2007 Carlos Guestrin 52

26
Learning Bayes nets
Known structure Unknown structure

Fully observable
data
Missing data

Data
CPTs –
x(1)
… P(Xi| PaXi)
x(m)
structure parameters
©2005-2007 Carlos Guestrin 53

Learning the CPTs


For each discrete variable Xi
Data
x(1)

x(m)

©2005-2007 Carlos Guestrin 54

27
Queries in Bayes nets
 Given BN, find:
 Probability of X given some evidence, P(X|e)

 Most probable explanation, maxx P(x1,…,xn | e)


1,…,xn

 Most informative query

 Learn more about these next class


©2005-2007 Carlos Guestrin 55

What you need to know


 Bayesian networks
 A compact representation for large probability distributions
 Not an algorithm
 Semantics of a BN
 Conditional independence assumptions
 Representation
 Variables
 Graph
 CPTs
 Why BNs are useful
 Learning CPTs from fully observable data
 Play with applet!!! 

©2005-2007 Carlos Guestrin 56

28
Acknowledgements
 JavaBayes applet
 http://www.pmr.poli.usp.br/ltd/Software/javabayes/Ho
me/index.html

©2005-2007 Carlos Guestrin 57

29

You might also like