The Expectation Maximization (EM) Algorithm
The Expectation Maximization (EM) Algorithm
Algorithm
General Idea
▪ Start by devising a noisy channel
▪ Any model that predicts the corpus observations via
some hidden structure (tags, parses, …)
▪ Initially guess the parameters of the model!
▪ Educated guess is best, but random can work
M step
Grammar Reestimation
E step correct test trees
P
A s
c
R o
S r accuracy
e
E r
test R
sentences expensive and/or
wrong sublanguage
▪ Real EM we r?
h y slo
▪ Expectation:
w find all parses of each sentence
▪ Maximization: retrain on all parses in proportion to
their probability (as if we observed fractional count)
▪ Advantage: p(training corpus) guaranteed to increase
▪ Exponentially many parses, so don’t extract them
from chart – need some kind of clever counting
Examples of EM
▪ Finite-State case: Hidden Markov Models
▪ “forward-backward” or “Baum-Welch” algorithm
▪ Applications:
▪ explain ice cream in terms of underlying weather sequence
▪ explain words in terms of underlying tag sequence
▪ explain phoneme sequence in terms of underlying word
compose ▪ explain sound sequence in terms of underlying phoneme
these?
▪Context-Free case: Probabilistic CFGs
▪ “inside-outside” algorithm: unsupervised grammar learning!
▪ Explain raw text in terms of underlying cx-free parse
▪ In practice, local maximum problem gets in the way
▪ But can improve a good starting grammar via raw text
▪ Clustering case: explain points via clusters
Our old friend PCFG
S
NP VP
p( time | S) = p(S ® NP VP | S) * p(NP ® time | NP)
V PP
flies * p(VP ® V PP | VP)
P NP
like
Det N
* p(V ® flies | V) *…
an arrow
Viterbi reestimation for parsing
NP NP V PRT