Machine Learning
Machine Learning
• Supervised learning
• Unsupervised learning
• Graphical models
• Learning relational models
2
Supervised learning
yes no
R A S N
T
Testing data
Y
X1 X2 X3 T
B A S ? Hypothesi N
s
Y C S ?
4
Key issue: generalization
yes n
o
? ?
5
Can’t just memorize the training set (overfitting)
Hypothesis spaces
• Decision trees
• Neural networks
• K-nearest neighbors
• Naïve Bayes classifier
• Support vector machines (SVMs)
• Boosted decision stumps
• …
6
Perceptron
(neural net with no hidden layers)
8
The linear separator with the
largest margin is the best one to
pick
margin
9
What if the data is not linearly
separable?
10
Kernel trick
z3
x1
x2 kernel
z2
z1
12
Boosting
14
Unsupervised learning
• What if there are no output labels?
15
K-means clustering
1. Guess number of clusters, K
2. Guess initial cluster centers, μ1, μ2
Reiterate
3. Assign data points xi to nearest cluster center
4. Re-compute cluster centers based on
assignments
16
AutoClass (Cheeseman et al,
1986)
• EM algorithm for mixtures of Gaussians
• “Soft” version of K-means
• Uses Bayesian criterion to select K
• Discovered new types of stars from
spectral data
• Discovered new classes of proteins and
introns from DNA/protein sequence
databases
17
Hierarchical clustering
18
Principal Component Analysis
(PCA)
PCA seeks a projection that best
represents the data in a least-squares
sense.
20
Combining supervised and
unsupervised learning
21
Discovering rules (data mining)
Occup. Income Educ. Sex Married Age
Student 10k MA M S 22
Student 20k PhD F S 24
Doctor 80k MD M M 30
Retired 30k HS F M 60
Find the most frequent patterns (association rules)
22
Unsupervised learning:
summary
• Clustering
• Hierarchical clustering
• Linear dimensionality reduction (PCA)
• Non-linear dim. Reduction
• Learning rules
23
Discovering networks
g1 w12
w55 g2
g5 w23
g4 g3
27
Qualitative level: Boolean
Networks
28
Probabilistic graphical models
• Supports graph-based modeling at various levels
of detail
• Models can be learned from noisy, partial data
• Can model “inherently” stochastic phenomena,
e.g., molecular-level fluctuations…
• But can also model deterministic, causal
processes.
"The actual science of logic is conversant at present only with
things either certain, impossible, or entirely doubtful. Therefore
the true logic for this world is the calculus of probabilities."
-- James Clerk Maxwell
"Probability theory is nothing but common sense reduced to
calculation." -- Pierre Simon Laplace 29
Graphical models: outline
• What are graphical models?
• Inference
• Structure learning
30
Simple probabilistic model:
linear regression
Y=μ+βX+ Deterministic (functional) relationship
noise
Y
31
Simple probabilistic model:
linear regression
Y=μ+βX+ Deterministic (functional) relationship
noise
Y
“Learning” = estimating
parameters μ, β, σ
from
(x,y) pairs.
Is the empirical mean
Can be estimate by
least squares
X
Is the residual variance
32
Piecewise linear regression
X
•Hidden variable Q chooses which set of
parameters to use for predicting Y.
Directed Undirected
DBNs
35
Bayesian Networks
Compact representation of probability
distributions via conditional independence
Qualitative part: Family of Alarm
Earthq Burgla E B P(A | E,B)
Directed acyclic graph uake ry
e b 0.9 0.1
(DAG) e b 0.2 0.8
• Nodes - random Radio Alarm
e b 0.9 0.1
variables e b 0.01 0.99
• Edges - direct influence
Call
…instead of 254 PA
P
SH
UN
T
VEN
TLU
NG
PR
VENI
TUB
E
ES
MIN FI VEN S
OVL O2 TALV
TP SA INSUFFA EXP
R O2 NESTH CO2
CAT
HYPOV LVFAIL
ECH
OLEMIA URE
OL
ERRC
LVEDVOL STROEV HIST ERRBLOW H
AUTE
UME OLUME ORY OUTPUT R
R
H
P H
R
CV C C R
H E
P W O S
R K
P AT
B G
B P 37
P
Success stories for graphical
models
• Multiple sequence alignment
• Forensic analysis
• Medical and fault diagnosis
• Speech recognition
• Visual tracking
• Channel coding at Shannon limit
• Genetic pedigree analysis
• …
38
Graphical models: outline
• What are graphical models? p
• Inference
• Structure learning
39
Probabilistic Inference
• Posterior probabilities
– Probability of any event given any evidence
• P(X|E)
Earthq Burgla
uake ry
Radio Alarm
Call
40
Viterbi decoding
Compute most probable explanation (MPE) of observed data
X1 X2 X3 hidden
Y1 Y3 observed
Y2
“Tomato”
41
Inference: computational issues
Easy Hard
Dense, loopy graphs
Chains
M
I
P
Trees
U
L
M
I
N
T
K
I
N
N
VV
EO
D
I
S
E
U
V
K NL C
B E TS V O
P
A
M
B
O
L
S
H
U
A
T
M I
E
N
TV
LE
D
T
U P
M
C
E
AT
E
N
I
N
N
E
Grids
P N I O B R H T C
U UN E U
T N N E T
S I NT S B
O A
PN GA
V R S E
VS L
L T
H SU V E
C
Y L S AF X
T O
P V A TC
F E P
P SF A 2
R
LOR O A C
TA 2 T R O E
VV N
EO
RI H EE B 2 R
DL
OL I CS L R
VE
EU S HT O C
H
OM
VR T OH W A
R
H HU
P O O L O
CLI L
E R RT
C C R U
H
VUA U E SE
W O Y T
R
PM M K AR
P
E BP B G T
E U
P
P T
42
Inference: computational issues
Easy Hard
Dense, loopy graphs
Chains
M
I
P
Trees
U
L
M
I
N
T
K
I
N
N
VV
EO
D
I
S
E
U
V
K NL C
B E TS V O
P
A
M
B
O
L
S
H
U
A
T
M I
E
N
TV
LE
D
T
U P
M
C
E
AT
E
N
I
N
N
E
Grids
P N I O B R H T C
U UN E U
T N N E T
S I NT S B
O A
PN GA
V R S E
VS L
L T
H SU V E
C
Y L S AF X
T O
P V A TC
F E P
P SF A 2
R
LOR O A C
TA 2 T R O E
VV N
EO
RI H EE B 2 R
DL
OL I CS L R
VE
EU S HT O C
H
OM
VR T OH W A
R
H HU
P O O L O
CLI L
E R RT
C C R U
H
VUA U E SE
W O Y T
R
PM M K AR
P
E BP B G T
E U
P
P T
X1 Xn
Y1 Yn
44
Bayesian inference
• + Elegant – no distinction between
parameters and other hidden variables
• + Can use priors to learn from small data
sets (c.f., one-shot learning by humans)
• - Math can get hairy
• - Often computationally intractable
45
Graphical models: outline
p
• What arep graphical models?
• Inference
• Structure learning
46
Why Struggle for Accurate
Structure?
Earth Alarm Burgl
quake Set ary
Sound
Sound
Sound
• Cannot be compensated
for by fitting parameters • Increases the number of
• Wrong assumptions about parameters to be estimated
domain structure • Wrong assumptions about
domain structure 47
Scorebased Learning
Define scoring function that evaluates how well a
structure matches the data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
. E B E
E
<N,Y,Y> A
A
A B
B
E E
D D 51
Problems with local search
Easy to get stuck in local optima
“truth”
you
S(G|D)
52
Problems with local search II
P(G|D)
Picking a single best model can be misleading
E B
R A
53
P(G|D) Problems with local search II
Picking a single best model can be misleading
E B E B E B E B E B
R A R A R A R A R A
C C C C C
Feature of G,
Indicator function
e.g., X→Y for feature f
55
Bayesian approach: computational
issues
• Posterior distribution over structures
56
Structure learning: other issues
• Discovering latent variables
• Learning causal models
• Learning from interventional data
• Active learning
57
Discovering latent variables
a) 17 parameters b) 59 parameters
X Y Z
X Y Z
X Y Z 60
Learning from interventional
data
• The only way to distinguish between Markov
equivalent networks is to perform interventions,
e.g., gene knockouts.
• We need to (slightly) modify our learning
algorithms.
smoking smoking
63
Learning from relational data:
approaches
• Probabilistic relational models (PRMs)
– Reify a relationship (arcs) between nodes
(objects) by making into a node (hypergraph)
64
ILP for learning protein folding:
input
yes no
67
The future of machine learning for
bioinformatics?
Oracle
68
The future of machine learning for
Prior knowledge bioinformatics
Hypotheses
Replicated experiments
Learne
r
Biological literature
Real world
Expt.
design
69
•“Computer assisted pathway refinement”
The end
70
Decision trees
blue?
yes oval?
no
big?
no yes
71
Decision trees
blue?
yes oval?
72
Feedforward neural network
input Hidden layer Output
73
Feedforward neural network
input Hidden layer Output
74
Nearest Neighbor
– Remember all your data
– When someone asks a question,
• find the nearest old data point
• return the answer associated with it
75
Nearest Neighbor
77
SVM: mathematical details
▪ Training data : l-dimensional vector with flag of true or false
▪ Separating hyperplane :
▪ Margin :
▪ Inequalities :
▪ Support vector expansion:
▪ Support vectors :
▪ Decision:
margin
78
Replace all inner products with
kernels
Kernel function
79
SVMs: summary
- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predictive power
•Kernel trick can be used to make many linear methods non-linear e.g.,
kernel PCA, kernelized mutual information
81
Supervised learning: summary
• Learn mapping F from inputs to outputs using a
training set of (x,t) pairs
• F can be drawn from different hypothesis
spaces, e.g., decision trees, linear separators,
linear in high dimensions, mixtures of linear
• Algorithms offer a variety of tradeoffs
• Many good books, e.g.,
– “The elements of statistical learning”,
Hastie, Tibshirani, Friedman, 2001
– “Pattern classification”, Duda, Hart, Stork, 2001
82
Inference
• Posterior probabilities
– Probability of any event given any evidence
• Most likely explanation
– Scenario that explains evidence
• Rational decision making Earthq Burgla
uake ry
– Maximize expected utility
– Value of Information
Radio Alarm
• Effect of intervention
Call
83
Assumption needed to make
learning work
• We need to assume “Future futures will
resemble past futures” (B. Russell)
• Unlearnable hypothesis: “All emeralds are
grue”, where “grue” means:
green if observed before time t, blue
afterwards.
84
Structure learning success stories:
gene regulation network (Friedman
et al.)
Yeast data
[Hughes et al 2000]
• 600 genes
• 300 85
experiments
Structure learning success stories II:
Phylogenetic Tree Reconstruction (Friedman
et al.)
Input: Biological sequences
Uses structural EM,
Human CGTTGC… with max-spanning-tree
Chimp CCTAGG… in the inner loop
Orang CGAACG…
….
Output: a phylogeny
10 billion years
leaf
86
Instances of graphical models
Probabilistic models
Graphical models
Naïve Bayes classifier
Directed Undirected
Mixtures
DBNs
of experts
Kalman filter
model Ising model
Hidden Markov Model (HMM)
87
ML enabling technologies
• Faster computers
• More data
– The web
– Parallel corpora (machine translation)
– Multiple sequenced genomes
– Gene expression arrays
• New ideas
– Kernel trick
– Large margins
– Boosting
– Graphical models
– … 88