0% found this document useful (0 votes)
11 views

Machine Learning

The document provides an overview of machine learning and probabilistic graphical models. It discusses supervised learning, unsupervised learning, and graphical models. It then describes several specific machine learning algorithms and applications, including linear regression, decision trees, neural networks, clustering, and Bayesian networks.

Uploaded by

Adesh Jagtap
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Machine Learning

The document provides an overview of machine learning and probabilistic graphical models. It discusses supervised learning, unsupervised learning, and graphical models. It then describes several specific machine learning algorithms and applications, including linear regression, decision trees, neural networks, clustering, and Bayesian networks.

Uploaded by

Adesh Jagtap
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

An introduction to machine

learning and probabilistic


graphical models
Overview

• Supervised learning
• Unsupervised learning
• Graphical models
• Learning relational models

2
Supervised learning
yes no

Color Shape Size Output


Blue Torus Big Y
Blue Square Small Y
Blue Star Small Y
Red Arrow Small N
Learn to approximate function F(x1, x2, x3) -> t
3
from a training set of (x,t) pairs
Supervised
Training data
learning
X1 X2 X3 T
B T B Y Learne
r
B S S Y
B S S Y Prediction

R A S N
T
Testing data
Y
X1 X2 X3 T
B A S ? Hypothesi N
s
Y C S ?
4
Key issue: generalization
yes n
o

? ?
5
Can’t just memorize the training set (overfitting)
Hypothesis spaces
• Decision trees
• Neural networks
• K-nearest neighbors
• Naïve Bayes classifier
• Support vector machines (SVMs)
• Boosted decision stumps
• …

6
Perceptron
(neural net with no hidden layers)

Linearly separable data


7
Which separating hyperplane?

8
The linear separator with the
largest margin is the best one to
pick

margin

9
What if the data is not linearly
separable?

10
Kernel trick

z3

x1
x2 kernel
z2
z1

Kernel implicitly maps from 2D to 3D,


making problem linearly separable 11
Support Vector Machines
(SVMs)
• Two key ideas:
– Large margins
– Kernel trick

12
Boosting

Simple classifiers (weak learners) can have their performance


boosted by taking weighted combinations

Boosting maximizes the margin 13


Supervised learning success
stories
– Face detection
– Steering an autonomous car across the US
– Detecting credit card fraud
– Medical diagnosis
– …

14
Unsupervised learning
• What if there are no output labels?

15
K-means clustering
1. Guess number of clusters, K
2. Guess initial cluster centers, μ1, μ2

Reiterate
3. Assign data points xi to nearest cluster center
4. Re-compute cluster centers based on
assignments

16
AutoClass (Cheeseman et al,
1986)
• EM algorithm for mixtures of Gaussians
• “Soft” version of K-means
• Uses Bayesian criterion to select K
• Discovered new types of stars from
spectral data
• Discovered new classes of proteins and
introns from DNA/protein sequence
databases
17
Hierarchical clustering

18
Principal Component Analysis
(PCA)
PCA seeks a projection that best
represents the data in a least-squares
sense.

PCA reduces the


dimensionality of
feature space by
restricting attention to
those directions along
which the scatter of the
cloud is greatest.
Discovering nonlinear manifolds

20
Combining supervised and
unsupervised learning

21
Discovering rules (data mining)
Occup. Income Educ. Sex Married Age
Student 10k MA M S 22
Student 20k PhD F S 24
Doctor 80k MD M M 30
Retired 30k HS F M 60
Find the most frequent patterns (association rules)

Num in household = 1 ^ num children = 0 => language = English

Language = English ^ Income < $40k ^ Married = false ^


num children = 0 => education {college, grad school}

22
Unsupervised learning:
summary
• Clustering
• Hierarchical clustering
• Linear dimensionality reduction (PCA)
• Non-linear dim. Reduction
• Learning rules

23
Discovering networks

From data visualization to causal discovery 24


Networks in biology
• Most processes in the cell are controlled
by networks of interacting molecules:
– Metabolic Network
– Signal Transduction Networks
– Regulatory Networks
• Networks can be modeled at multiple
levels of detail/ realism
– Molecular level Decreasing detail
– Concentration level
– Qualitative level 25
Molecular level: Lysis-Lysogeny
circuit in Lambda phage

Arkin et al. (1998), Genetics


149(4):1633-48

5 genes, 67 parameters based on 50 years of research


Stochastic simulation required supercomputer 26
Concentration level: metabolic
pathways
• Usually modeled with differential equations

g1 w12
w55 g2
g5 w23

g4 g3

27
Qualitative level: Boolean
Networks

28
Probabilistic graphical models
• Supports graph-based modeling at various levels
of detail
• Models can be learned from noisy, partial data
• Can model “inherently” stochastic phenomena,
e.g., molecular-level fluctuations…
• But can also model deterministic, causal
processes.
"The actual science of logic is conversant at present only with
things either certain, impossible, or entirely doubtful. Therefore
the true logic for this world is the calculus of probabilities."
-- James Clerk Maxwell
"Probability theory is nothing but common sense reduced to
calculation." -- Pierre Simon Laplace 29
Graphical models: outline
• What are graphical models?
• Inference
• Structure learning

30
Simple probabilistic model:
linear regression
Y=μ+βX+ Deterministic (functional) relationship
noise
Y

31
Simple probabilistic model:
linear regression
Y=μ+βX+ Deterministic (functional) relationship
noise
Y

“Learning” = estimating
parameters μ, β, σ
from
(x,y) pairs.
Is the empirical mean

Can be estimate by
least squares
X
Is the residual variance
32
Piecewise linear regression

Latent “switch” variable – hidden process at work 33


Probabilistic graphical model for
piecewise linear regression
input

X
•Hidden variable Q chooses which set of
parameters to use for predicting Y.

Q •Value of Q depends on value of


input X.
•This is an example of “mixtures of experts”
Y
output

Learning is harder because Q is hidden, so we don’t know which


data points to assign to each line; can be solved with EM (c.f., K-means)
34
Classes of graphical models
Probabilistic models
Graphical models

Directed Undirected

Bayes nets MRFs

DBNs

35
Bayesian Networks
Compact representation of probability
distributions via conditional independence
Qualitative part: Family of Alarm
Earthq Burgla E B P(A | E,B)
Directed acyclic graph uake ry
e b 0.9 0.1
(DAG) e b 0.2 0.8
• Nodes - random Radio Alarm
e b 0.9 0.1
variables e b 0.01 0.99
• Edges - direct influence
Call

Together: Quantitative part:


Define a unique distribution Set of conditional
in a factored form probability distributions
36
Example: “ICU Alarm” network
Domain: Monitoring Intensive-Care Patients
• 37 variables MINVO
LSET

• 509 parameters PULMEM


BOLUS
INTUBA
TION
KINKED
TUBE
VENT
MAC
H
DISCON
NECT

…instead of 254 PA
P
SH
UN
T
VEN
TLU
NG
PR
VENI
TUB
E
ES
MIN FI VEN S
OVL O2 TALV

ANAPH PVS ART


YLAXIS AT CO2

TP SA INSUFFA EXP
R O2 NESTH CO2

CAT
HYPOV LVFAIL
ECH
OLEMIA URE
OL

ERRC
LVEDVOL STROEV HIST ERRBLOW H
AUTE
UME OLUME ORY OUTPUT R
R
H
P H
R
CV C C R
H E
P W O S
R K
P AT
B G
B P 37
P
Success stories for graphical
models
• Multiple sequence alignment
• Forensic analysis
• Medical and fault diagnosis
• Speech recognition
• Visual tracking
• Channel coding at Shannon limit
• Genetic pedigree analysis
• …
38
Graphical models: outline
• What are graphical models? p
• Inference
• Structure learning

39
Probabilistic Inference
• Posterior probabilities
– Probability of any event given any evidence
• P(X|E)

Earthq Burgla
uake ry

Radio Alarm

Call
40
Viterbi decoding
Compute most probable explanation (MPE) of observed data

Hidden Markov Model (HMM)

X1 X2 X3 hidden

Y1 Y3 observed
Y2

“Tomato”
41
Inference: computational issues
Easy Hard
Dense, loopy graphs
Chains

M
I
P

Trees
U
L
M
I
N
T
K
I
N
N
VV
EO
D
I
S
E
U
V
K NL C
B E TS V O

P
A
M
B
O
L
S
H
U
A
T
M I
E
N
TV
LE
D
T
U P
M

C
E
AT
E
N
I
N
N
E
Grids
P N I O B R H T C
U UN E U
T N N E T
S I NT S B
O A
PN GA
V R S E
VS L
L T
H SU V E
C
Y L S AF X
T O
P V A TC
F E P
P SF A 2
R
LOR O A C
TA 2 T R O E
VV N
EO
RI H EE B 2 R
DL
OL I CS L R
VE
EU S HT O C
H
OM
VR T OH W A
R
H HU
P O O L O
CLI L
E R RT
C C R U
H
VUA U E SE
W O Y T
R
PM M K AR
P
E BP B G T
E U
P
P T

42
Inference: computational issues
Easy Hard
Dense, loopy graphs
Chains

M
I
P

Trees
U
L
M
I
N
T
K
I
N
N
VV
EO
D
I
S
E
U
V
K NL C
B E TS V O

P
A
M
B
O
L
S
H
U
A
T
M I
E
N
TV
LE
D
T
U P
M

C
E
AT
E
N
I
N
N
E
Grids
P N I O B R H T C
U UN E U
T N N E T
S I NT S B
O A
PN GA
V R S E
VS L
L T
H SU V E
C
Y L S AF X
T O
P V A TC
F E P
P SF A 2
R
LOR O A C
TA 2 T R O E
VV N
EO
RI H EE B 2 R
DL
OL I CS L R
VE
EU S HT O C
H
OM
VR T OH W A
R
H HU
P O O L O
CLI L
E R RT
C C R U
H
VUA U E SE
W O Y T
R
PM M K AR
P
E BP B G T
E U
P
P T

Many difference inference algorithms,


both exact and approximate 43
Bayesian inference
• Bayesian probability treats parameters as random
variables
• Learning/ parameter estimation is replaced by
probabilistic inference P(θ|D)
• Example: Bayesian linear regression; parameters
are
θ Parameters are tied (shared)
θ = (μ, β, σ) across repetitions of the data

X1 Xn

Y1 Yn
44
Bayesian inference
• + Elegant – no distinction between
parameters and other hidden variables
• + Can use priors to learn from small data
sets (c.f., one-shot learning by humans)
• - Math can get hairy
• - Often computationally intractable

45
Graphical models: outline
p
• What arep graphical models?
• Inference
• Structure learning

46
Why Struggle for Accurate
Structure?
Earth Alarm Burgl
quake Set ary

Sound

Missing an arc Adding an arc


Earth Alarm Burgl
quake Set ary Earth Alarm Burgl
quake Set ary

Sound
Sound
• Cannot be compensated
for by fitting parameters • Increases the number of
• Wrong assumptions about parameters to be estimated
domain structure • Wrong assumptions about
domain structure 47
Scorebased Learning
Define scoring function that evaluates how well a
structure matches the data

E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
. E B E
E
<N,Y,Y> A
A
A B
B

Search for a structure that maximizes the score


48
Learning Trees

• Can find optimal tree structure in O(n2 log n)


time: just find the max-weight spanning tree
• If some of the variables are hidden, problem
becomes hard again, but can use EM to fit
mixtures of trees
49
Heuristic Search
• Learning arbitrary graph structure is
NP-hard.
So it is common to resort to heuristic search
• Define a search space:
– search states are possible structures
– operators make small changes to structure
• Traverse space looking for high-scoring
structures
• Search techniques:
– Greedy hill-climbing
– Best first search
– Simulated Annealing
– ... 50
Local Search Operations
S C
• Typical operations:
E
d d C
S C A
→D D
E
Δscore =
S({C,E} →D)
D Re
e teC ver
s
- S({E} →D)
e l →E e C
D E
S C → S C

E E

D D 51
Problems with local search
Easy to get stuck in local optima
“truth”
you
S(G|D)

52
Problems with local search II
P(G|D)
Picking a single best model can be misleading

E B

R A

53
P(G|D) Problems with local search II
Picking a single best model can be misleading

E B E B E B E B E B

R A R A R A R A R A

C C C C C

– Small sample size ⇒ many high scoring models


– Answer based on one model often useless
54
– Want features common to many models
Bayesian Approach to Structure
Learning
• Posterior distribution over structures
• Estimate probability of features
– Edge X→Y
Bayesian score
– Path X→… → Y for G
–…

Feature of G,
Indicator function
e.g., X→Y for feature f
55
Bayesian approach: computational
issues
• Posterior distribution over structures

How compute sum over super-exponential number of graphs?

•MCMC over networks


•MCMC over node-orderings (Rao-Blackwellisation)

56
Structure learning: other issues
• Discovering latent variables
• Learning causal models
• Learning from interventional data
• Active learning

57
Discovering latent variables

a) 17 parameters b) 59 parameters

There are some techniques for automatically detecting the


possible presence of latent variables 58
Learning causal models
• So far, we have only assumed that X -> Y
-> Z means that Z is independent of X
given Y.
• However, we often want to interpret
directed arrows causally.
• This is uncontroversial for the arrow of
time.
• But can we infer causality from static
observational data?
59
Learning causal models
• We can infer causality from static observational
data if we have at least four measured variables
and certain “tetrad” conditions hold.
• See books by Pearl and Spirtes et al.
• However, we can only learn up to Markov
equivalence, not matter how much data we
have.
X Y Z

X Y Z
X Y Z

X Y Z 60
Learning from interventional
data
• The only way to distinguish between Markov
equivalent networks is to perform interventions,
e.g., gene knockouts.
• We need to (slightly) modify our learning
algorithms.
smoking smoking

Cut arcs coming


into nodes which
were set by
intervention
Yellow Yellow
fingers fingers

P(smoker|observe(yellow)) >> prior P(smoker | do(paint yellow)) = prior 61


Active learning
• Which experiments (interventions) should
we perform to learn structure as efficiently
as possible?
• This problem can be modeled using
decision theory.
• Exact solutions are wildly computationally
intractable.
• Can we come up with good approximate
decision making techniques?
• Can we implement hardware to
automatically perform the experiments?
62
• “AB: Automated Biologist”
Learning from relational data
Can we learn concepts from a set of relations between objects,
instead of/ in addition to just their attributes?

63
Learning from relational data:
approaches
• Probabilistic relational models (PRMs)
– Reify a relationship (arcs) between nodes
(objects) by making into a node (hypergraph)

• Inductive Logic Programming (ILP)


– Top-down, e.g., FOIL (generalization of C4.5)
– Bottom up, e.g., PROGOL (inverse deduction)

64
ILP for learning protein folding:
input
yes no

TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …

100 conjuncts describing structure of each pos/neg example


65
ILP for learning protein folding:
results
• PROGOL learned the following rule to
predict if a protein will form a “four-helical
up-and-down bundle”:

• In English: “The protein P folds if it


contains a long helix h1 at a secondary
structure position between 1 and 3 and h1
is next to a second helix”
66
ILP: Pros and Cons
• + Can discover new predicates (concepts)
automatically
• + Can learn relational models from
relational (or flat) data
• - Computationally intractable
• - Poor handling of noise

67
The future of machine learning for
bioinformatics?

Oracle

68
The future of machine learning for
Prior knowledge bioinformatics

Hypotheses
Replicated experiments
Learne
r

Biological literature

Real world
Expt.
design
69
•“Computer assisted pathway refinement”
The end

70
Decision trees
blue?

yes oval?

no
big?

no yes

71
Decision trees
blue?

yes oval?

+ Handles mixed variables


+ Handles missing data no
+ Efficient for large data sets big?
+ Handles irrelevant attributes
+ Easy to understand
- Predictive power
no yes

72
Feedforward neural network
input Hidden layer Output

Weights on each arc Sigmoid function at each node

73
Feedforward neural network
input Hidden layer Output

- Handles mixed variables


- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predicts poorly

74
Nearest Neighbor
– Remember all your data
– When someone asks a question,
• find the nearest old data point
• return the answer associated with it

75
Nearest Neighbor

- Handles mixed variables


- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predictive power
76
Support Vector Machines
(SVMs)
• Two key ideas:
– Large margins are good
– Kernel trick

77
SVM: mathematical details
▪ Training data : l-dimensional vector with flag of true or false

▪ Separating hyperplane :
▪ Margin :
▪ Inequalities :
▪ Support vector expansion:

▪ Support vectors :

▪ Decision:
margin

78
Replace all inner products with
kernels

Kernel function

79
SVMs: summary
- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predictive power

General lessons from SVM success:

•Kernel trick can be used to make many linear methods non-linear e.g.,
kernel PCA, kernelized mutual information

•Large margin classifiers are good 80


Boosting: summary
• Can boost any weak learner
• Most commonly: boosted decision
“stumps”
+ Handles mixed variables
+ Handles missing data
+ Efficient for large data sets
+ Handles irrelevant attributes
- Easy to understand
+ Predictive power

81
Supervised learning: summary
• Learn mapping F from inputs to outputs using a
training set of (x,t) pairs
• F can be drawn from different hypothesis
spaces, e.g., decision trees, linear separators,
linear in high dimensions, mixtures of linear
• Algorithms offer a variety of tradeoffs
• Many good books, e.g.,
– “The elements of statistical learning”,
Hastie, Tibshirani, Friedman, 2001
– “Pattern classification”, Duda, Hart, Stork, 2001

82
Inference
• Posterior probabilities
– Probability of any event given any evidence
• Most likely explanation
– Scenario that explains evidence
• Rational decision making Earthq Burgla
uake ry
– Maximize expected utility
– Value of Information
Radio Alarm
• Effect of intervention
Call
83
Assumption needed to make
learning work
• We need to assume “Future futures will
resemble past futures” (B. Russell)
• Unlearnable hypothesis: “All emeralds are
grue”, where “grue” means:
green if observed before time t, blue
afterwards.

84
Structure learning success stories:
gene regulation network (Friedman
et al.)

Yeast data
[Hughes et al 2000]
• 600 genes
• 300 85
experiments
Structure learning success stories II:
Phylogenetic Tree Reconstruction (Friedman
et al.)
Input: Biological sequences
Uses structural EM,
Human CGTTGC… with max-spanning-tree
Chimp CCTAGG… in the inner loop

Orang CGAACG…
….
Output: a phylogeny
10 billion years

leaf

86
Instances of graphical models
Probabilistic models
Graphical models
Naïve Bayes classifier

Directed Undirected

Bayes nets MRFs

Mixtures
DBNs
of experts

Kalman filter
model Ising model
Hidden Markov Model (HMM)
87
ML enabling technologies
• Faster computers
• More data
– The web
– Parallel corpora (machine translation)
– Multiple sequenced genomes
– Gene expression arrays
• New ideas
– Kernel trick
– Large margins
– Boosting
– Graphical models
– … 88

You might also like