UCAS-AI模式识别4-参数估计
UCAS-AI模式识别4-参数估计
第3章:参数估计(续)
刘成林([email protected])
2017年10月11日
助教:何文浩([email protected])
杨红明([email protected])
上次课主要内容回顾
• 离散变量贝叶斯决策
• 复合模式分类
• 最大似然参数估计
高斯分布的情况
• 贝叶斯估计
Parameter space vs
feature space 二者的区别和联系
2
提 纲
• 第3章
– 特征维数问题
– 期望最大法
– 隐马尔可夫模型
3
特征维数问题
• 统计模式分类
– 特征空间划分
– 贝叶斯决策:最小风险规则,MAP
• 增加特征有什么好处
– 判别性:类别间有差异的特征有助于分类
• 带来什么问题
– 计算
– 存储
– 泛化性能,Overfitting
4
分类错误率与特征的关系
• 二类高斯分布
– p(x|ωj)~N(μj,Σ), j=1,2, 等协方差矩阵
– Bayes error rate
5
• 特征维数决定可分性的例子
– 3D空间完全可分
– 2D和1D投影空间有重叠
然而,增加特征也可能导致分类性能更差,因为有模型
估计误差(wrong model)
6
计算复杂度
• 最大似然估计
– 高斯分布,d维特征,n个样本
– 参数估计的复杂度,主要由Σ决定
• 参数存储复杂度
c(d d (d 1) / 2)
• 分类复杂度?
7
过拟合(Overfitting)
• Overfitting
– 特征维数高、训练样本少导致模型参数估计不准确
• 比如协方差矩阵需要样本数在d以上
• 克服办法
– 特征降维:特征提取(变换)、特征选择
– 参数共享/平滑
• 共享协方差矩阵Σ0
• Shrinkage (a.k.a. Regularized Discriminant Analysis)
8
• 过拟合的例子
9
期望-最大法(EM)
• 数据缺失情况下的参数估计
– Good features, missing/bad features
– 已知参数θi 情况下估计新参数θ
• 对缺失数据求期望(marginalize)
max
10
• Expectation-Maximization (EM)
11
• Example: EM for a 2D Gaussian
parameters initially
12
max
After 3 iterations
x41 unknown
13
• EM for Gaussian mixture
– 参数型概率密度函数,可以表示复杂的分布
K
p ( x) k p ( x | k )
k 1
K
subject to k 1
k 1
• Gaussian component
p ( x | k ) N ( x | k , k )
– 参数估计:Maximum Likelihood (ML)
N N K
max LL log p(xn ) log k p(xn | k )
n 1 n 1 k 1
k LL 0, k LL 0, k LL 0
不能解析求解
14
• EM Algorithm for Gaussian mixture
– Incomplete data X, complete data {X,Z}
znk {0,1}, k 1, , K
– Expectation of complete data log-likelihood
Q(, old ) [log p( X, Z | )] p( Z | X, old )
Z
E-step p( X, Z | ) k N (xn | k , k )
z z nk nk
n 1 k 1
N K
Q(, )=E Z [log p( X, Z | )] ( znk ){log k log N ( x | k , k )}
old
n 1 k 1
k N (x n | k , k )
( znk ) P( znk 1| x n )= K
j 1
j N (x n | j , j )
M-step Q 0, Q 0, Q 0
k k k
N
1
new
k
Nk
(z
n 1
nk )x n
N
1
new
k
Nk
nk n k n k )
(
n 1
z )( x new
)( x new T
N
N k ( znk )
N
knew k
N n 1
16
An example, from (C.M. Bishop, Pattern Recognition and Machine
Learning, 2006. Figure 9.8)
17
Break
隐马尔可夫模型
• Sequential (Temporal) Pattern
– Variable length
– Distortion
– Ambiguous boundary between primitives (symbols)
• Bayesian Classification
– Sequence of patterns (observations) O O1O2 OT
– Sequence class (states) q q1q2 qT
– Posterior probability p (O | q) P(q)
P(q | O)
p (O)
• Hidden Markov Model (HMM)
– Model p(O|q), p(O,q)
19
Markov Chain
• Sequence of States (classes)
P(q1q2 qT ) P(q1 ) P(q2 | q1 ) P(q3 | q1q2 ) P(qT | q1 qT 1 )
qt {S1 , SN }
– First-Order Markov
P(qt = S j | qt-1 = Si , qt-2 = S k , ) = P(qt = S j | qt-1 = Si )
P(q1q2 qT )
P(q1 ) P(q2 | q1 ) P(q3 | q2 ) P(qT | qT 1 )
a
j 1
ij 1
qi∈{S1,S2,…,SN}
22
• Elements of an HMM λ (A, B, π)
– N: number of states in the model, S={S1,S2,…,SN}
– M: number of observation symbols, V={v1,v2,…,vM}
– State transition probability distribution A={aij}
aij P(qt 1 S j | qt Si ), 1 i, j N
– Observation symbol (emission) probability distribution
B={bj(k)} b j (k ) P(vk at t | qt S j ), 1 j N , 1 k M
– Initial state distribution π { i }
i P(q1 Si ), 1 i N
23
• Three Basic Problems of HMM
– Problem 1 (Evaluation):
How to efficiently compute the probability of observation
sequence P(O|)
– Problem 2 (Decoding):
How to choose the best state sequence responding to an
observation sequence
– Problem 3 (Training):
How to estimate the model parameters
24
Evaluation Problem
• Given model λ (A, B, π) and observation sequence
O=O1O2…OT, compute P(O | λ )
– Direct computation
P(O λ ) P(O | Q, λ )P(Q | λ )
all Q
q1 ,q2 , ..., qT
q bq (O1 )aq q bq (O2 )
1 1 1 2 2
aqT 1qT bqT (OT )
Conditional P(O Q, λ ) P(Ot | qt , λ ) P(Q λ ) q1 aq1q2 aq2q3 aqT 1qT
t 1
independence
bq1 (O1 )bq2 (O2 ) bqT (OT ) Markov chain of states
– Complexity: O(2TNT)!
25
• Evaluation: Forward Procedure
– Define forward variable t (i) P(O1O2 Ot , qt Si | λ)
– Initialization 1 (i) ibi (O1 ), 1 i N
– Induction
N
t 1 ( j ) t (i)aij b j (Ot 1 ),
i 1
1 t T 1, 1 j N.
– Termination
N
P(O | λ ) T (i)
i 1
– Complexity: O(TN2)
26
• Evaluation: Backward Procedure
– Define backward variable
t (i) P(Ot 1 , , OT | qt Si , λ)
– Initialization
T (i) 1, 1 i N
– Induction
N
t (i) aij b j (Ot 1 )t 1 ( j ),
j 1
1 t T 1, 1 i N
– Termination
N N
P(O | ) ibi (O1 ) 1 (i) 1 (i) 1 (i)
i 1 i 1
– Complexity?
27
Decoding Problem
• This is Pattern Recognition
• Optimal Sequence of States
max P(q1q2 qT | O, λ ) max P(q1q2 qT , O | λ)
q1q2 qT q1q2 qT
29
t (i)= max P(q1q2 qt i, O1O2 Ot | λ )
q1 , q2 ,..., qt 1
2 t T, 1 j N
– Termination
P* max T (i )
1i N
qT * arg max T (i )
1i N
– Backtracking
qt* t 1 (qt*1 ), Complexity: O(TN2)
1 t T 1
30
Training Problem
• Maximum Likelihood (ML) max P(O | )
A, B ,
• Baum-Welch Algorithm (EM)
max Q( , ) [log P(Q, O | )]P(Q, O | )
Q
– Define variable t (i, j ) P(qt Si , qt 1 S j | O, λ)
t i aij b j Ot 1 t 1 ( j ) P(O, qt Si , qt 1 S j | )
t (i, j )
P(O | λ )
t i aij b j Ot 1 t 1 ( j )
N N
i a b O
i 1 j 1
t ij j t 1 t 1 ( j)
– Define probability
t i t (i) t i t (i) N
t (i) P(qt Si | O, λ ) t (i, j )
P(O | λ ) N
t i t (i)
i 1
j 1
31
• Baum-Welch Algorithm (Cont.)
– Reestimation formulas
i expected frequency (number of times) in state Si at time (t=1)= 1 (i)
expected number of transitions from state Si to state S j
aij
expected number of transitions from state Si
T 1 T 1
(i, j ) (i)a b (O
t t ij j t 1 ) t 1 ( j )
t 1
T 1
t 1
T 1
(i)
t 1
t (i) (i)
t 1
t t
t 1, s.t. Ot vk
t ( j)
t 1, s.t. Ot vk
t ( j ) t ( j )
T 1
T
( j)
t 1
t ( j) ( j)
t 1
t t
32
Continuous Density HMM
• Handling Continuous Observations
– Continuous features: vector OT
– Discretization: vector quantization (VQ)
• Each vector replaced with its closest codevector, which is viewed as
a symbol
• Small codebook: distortion
• Large codebook: large data required in emission probability
estimation
– Continuous emission density: Gaussian mixture (GM)
M
b j (O) c jmN (O; jm ,U jm ), 1 j N
m 1
33
Application to Speech Recognition
• Isolated Word Recognition
– Given HMM for each word in the vocabulary
– Input observation sequence O, Bayes decision
(assuming equal prior probabilities)
v* arg max P(λ v | O) arg max P(O | λ v )
1v V 1v V
34
Extensions of HMM
• Hybrid HMM/Neural
– HMM: parametric bj(Ot)=p(Ot|qt=Sj), conditional
independence
– Neural: discriminative emission probability p(qt=Sj|Ot)
• Neural network outputs approximate posterior probabilities
– Replace p(xt|qt) with p(qt|xt)/p(qt)
p(xt | qt ) p(qt | xt )
p ( xt ) P(qt )
– ANN may input multiple frames to learn the correlation
35
讨论
• 特征维数与过拟合
– 克服过拟合的方法?
• 期望最大法(EM)
– 对数似然度对缺失数据的期望
– EM for Gaussian mixture
• 隐马尔可夫模型(HMM)
– Three basic problems
– Viterbi Algorithm
– Extensions
36
下次课内容
• 第4章 非参数法
– 密度估计
– Parzen窗方法
– K近邻估计
– 最近邻规则
– 距离度量
– Reduced Coulomb Energy Network
– Approximation by Series Expansion
37