0% found this document useful (0 votes)
2 views

UCAS-AI模式识别4-参数估计

This document is a lecture on parameter estimation in the context of pattern recognition, focusing on topics such as feature dimensionality, the Expectation-Maximization (EM) algorithm, and Hidden Markov Models (HMM). It discusses the trade-offs between increasing feature dimensions and the risk of overfitting, as well as the complexities involved in maximum likelihood estimation. The document also covers the basic problems associated with HMMs, including evaluation, decoding, and training.

Uploaded by

day s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

UCAS-AI模式识别4-参数估计

This document is a lecture on parameter estimation in the context of pattern recognition, focusing on topics such as feature dimensionality, the Expectation-Maximization (EM) algorithm, and Hidden Markov Models (HMM). It discusses the trade-offs between increasing feature dimensions and the risk of overfitting, as well as the complexities involved in maximum likelihood estimation. The document also covers the basic problems associated with HMMs, including evaluation, decoding, and training.

Uploaded by

day s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

中国科学院大学人工智能技术学院硕士课《模式识别》

第3章:参数估计(续)

刘成林([email protected])
2017年10月11日
助教:何文浩([email protected])
杨红明([email protected])
上次课主要内容回顾
• 离散变量贝叶斯决策
• 复合模式分类

• 最大似然参数估计
高斯分布的情况
• 贝叶斯估计
Parameter space vs
feature space 二者的区别和联系

2
提 纲

• 第3章
– 特征维数问题
– 期望最大法
– 隐马尔可夫模型

3
特征维数问题
• 统计模式分类
– 特征空间划分
– 贝叶斯决策:最小风险规则,MAP

• 增加特征有什么好处
– 判别性:类别间有差异的特征有助于分类
• 带来什么问题
– 计算
– 存储
– 泛化性能,Overfitting

4
分类错误率与特征的关系
• 二类高斯分布
– p(x|ωj)~N(μj,Σ), j=1,2, 等协方差矩阵
– Bayes error rate

– Conditionally independent case


• 每一维二类均值之间距离反映区分度,决定错误率
• 特征增加有助于减小错误率(r2增大)

5
• 特征维数决定可分性的例子
– 3D空间完全可分
– 2D和1D投影空间有重叠

然而,增加特征也可能导致分类性能更差,因为有模型
估计误差(wrong model)

6
计算复杂度
• 最大似然估计
– 高斯分布,d维特征,n个样本
– 参数估计的复杂度,主要由Σ决定

• 参数存储复杂度
c(d  d (d  1) / 2)
• 分类复杂度?

7
过拟合(Overfitting)
• Overfitting
– 特征维数高、训练样本少导致模型参数估计不准确
• 比如协方差矩阵需要样本数在d以上

• 克服办法
– 特征降维:特征提取(变换)、特征选择
– 参数共享/平滑
• 共享协方差矩阵Σ0
• Shrinkage (a.k.a. Regularized Discriminant Analysis)

8
• 过拟合的例子

10th order polynomial

9
期望-最大法(EM)
• 数据缺失情况下的参数估计
– Good features, missing/bad features

– 已知参数θi 情况下估计新参数θ
• 对缺失数据求期望(marginalize)
max

10
• Expectation-Maximization (EM)

The EM algorithm guarantees that the log-likelihood of


good data increases monotonically.

11
• Example: EM for a 2D Gaussian

parameters initially

12
max

After 3 iterations

x41 unknown

13
• EM for Gaussian mixture
– 参数型概率密度函数,可以表示复杂的分布
K
p ( x)    k p ( x |  k )
k 1
K
subject to   k  1
k 1

• Gaussian component
p ( x |  k )  N ( x | k ,  k )
– 参数估计:Maximum Likelihood (ML)
N N K
max LL  log  p(xn )   log   k p(xn | k )
n 1 n 1 k 1

 k LL  0, k LL  0, k LL  0

不能解析求解
14
• EM Algorithm for Gaussian mixture
– Incomplete data X, complete data {X,Z}
znk {0,1}, k  1, , K
– Expectation of complete data log-likelihood
Q(, old )   [log p( X, Z | )] p( Z | X, old )
Z

1. Choose an initial set of parameters for Θold


2. Do
E-step: Evaluate p(Z|X, Θold)
M-step: Update parameters
new  arg max Q(, old )

If convergence condition is not satisfied
old  new
3. End
(C.M. Bishop, Pattern Recognition and Machine Learning, 2006) 15
• EM Algorithm for Gaussian mixture
N K

E-step p( X, Z | )    k N (xn | k , k )
z z nk nk

n 1 k 1
N K
Q(, )=E Z [log p( X, Z | )]    ( znk ){log  k  log N ( x | k ,  k )}
old

n 1 k 1
 k N (x n | k ,  k )
 ( znk )  P( znk  1| x n )= K


j 1
j N (x n |  j ,  j )

M-step  Q  0,  Q  0,  Q  0
k k k
N
1
 new
k 
Nk
 (z
n 1
nk )x n
N
1
 new
k 
Nk
 nk n k n k )
 (
n 1
z )( x   new
)( x   new T

N
N k    ( znk )
N
 knew  k
N n 1

16
An example, from (C.M. Bishop, Pattern Recognition and Machine
Learning, 2006. Figure 9.8)

17
Break
隐马尔可夫模型
• Sequential (Temporal) Pattern
– Variable length
– Distortion
– Ambiguous boundary between primitives (symbols)
• Bayesian Classification
– Sequence of patterns (observations) O  O1O2 OT
– Sequence class (states) q  q1q2 qT
– Posterior probability p (O | q) P(q)
P(q | O) 
p (O)
• Hidden Markov Model (HMM)
– Model p(O|q), p(O,q)
19
Markov Chain
• Sequence of States (classes)
P(q1q2 qT )  P(q1 ) P(q2 | q1 ) P(q3 | q1q2 ) P(qT | q1 qT 1 )
qt {S1 , SN }

– First-Order Markov
P(qt = S j | qt-1 = Si , qt-2 = S k , ) = P(qt = S j | qt-1 = Si )
P(q1q2 qT ) 
P(q1 ) P(q2 | q1 ) P(q3 | q2 ) P(qT | qT 1 )

State transition probabilities


aij  P(qt  S j | qt 1  Si ), 1  i, j  N
N

a
j 1
ij 1

L.R. Rabiner, A tutorial on hidden Markov models and selected applications in


20
speech recognition, Proc. IEEE, 77(2): 257-286, 1989.
– State duration (self-transition)
O  {Si , Si , Si , , Si , S j  Si }
1 2 3 d d+1

P(O | Model, q1  Si )  (aii )d 1 (1  aii )  pi (d )


– Expected duration of specific state

di   d pi (d )
d 1

1
  d (aii ) d 1
(1 - aii ) 
d 1 1  aii

• Example: Transition of Weather


State 1: rain or (snow) 0.4 0.3 0.3
State 2: cloudy A  {aij }  0.2 0.6 0.2 
State 3: sunny  0.1 0.1 0.8

Expected number of days for sunny and cloudy?


21
Hidden Markov Model (HMM)
• Markov Chain: States are Observable
• Hidden States: An Example
– Imagine you are in a un-windowed room, cannot see the
weather outside. Instead, you can guess the weather from
the temperature and humidity in room
• Observations: temperature, humidity
• Hidden states: weather
– Hidden Markov Model (HMM): Doubly embedded
stochastic process O1 O2 OT
P(O1,O2,…,OT)
P(q1,q2,…,qT)
• Infer states from observations q1 q2 qT

qi∈{S1,S2,…,SN}
22
• Elements of an HMM λ  (A, B, π)
– N: number of states in the model, S={S1,S2,…,SN}
– M: number of observation symbols, V={v1,v2,…,vM}
– State transition probability distribution A={aij}
aij  P(qt 1  S j | qt  Si ), 1  i, j  N
– Observation symbol (emission) probability distribution
B={bj(k)} b j (k )  P(vk at t | qt  S j ), 1  j  N , 1  k  M
– Initial state distribution π  { i }
 i  P(q1  Si ), 1  i  N

23
• Three Basic Problems of HMM
– Problem 1 (Evaluation):
How to efficiently compute the probability of observation
sequence P(O|)
– Problem 2 (Decoding):
How to choose the best state sequence responding to an
observation sequence
– Problem 3 (Training):
How to estimate the model parameters

24
Evaluation Problem
• Given model λ  (A, B, π) and observation sequence
O=O1O2…OT, compute P(O | λ )
– Direct computation
P(O λ )   P(O | Q, λ )P(Q | λ )
all Q

 
q1 ,q2 , ..., qT
 q bq (O1 )aq q bq (O2 )
1 1 1 2 2
aqT 1qT bqT (OT )

Conditional P(O Q, λ )   P(Ot | qt , λ ) P(Q λ )   q1 aq1q2 aq2q3 aqT 1qT
t 1
independence
 bq1 (O1 )bq2 (O2 ) bqT (OT ) Markov chain of states

– Complexity: O(2TNT)!

25
• Evaluation: Forward Procedure
– Define forward variable t (i)  P(O1O2 Ot , qt  Si | λ)
– Initialization 1 (i)   ibi (O1 ), 1  i  N
– Induction
N 
 t 1 ( j )    t (i)aij  b j (Ot 1 ),
 i 1 
1  t  T  1, 1  j  N.
– Termination
N
P(O | λ )   T (i)
i 1

– Complexity: O(TN2)

26
• Evaluation: Backward Procedure
– Define backward variable
t (i)  P(Ot 1 , , OT | qt  Si , λ)
– Initialization
T (i)  1, 1  i  N
– Induction
N
t (i)   aij b j (Ot 1 )t 1 ( j ),
j 1

1  t  T  1, 1  i  N
– Termination
N N
P(O |  )    ibi (O1 ) 1 (i)  1 (i) 1 (i)
i 1 i 1

– Complexity?

27
Decoding Problem
• This is Pattern Recognition
• Optimal Sequence of States
max P(q1q2 qT | O, λ )  max P(q1q2 qT , O | λ)
q1q2 qT q1q2 qT

• Viterbi Algorithm (DP)


– Define variable
 t (i)  max P(q1q2 qt  i, O1O2 Ot λ )
q1 , q2 , , qt 1
– DP
 t 1 ( j )   max  t (i)aij   b j (Ot 1 )
 i 
– Initialization
1 (i)   i bi (O1 ), 1  i  N
 1 (i)  0.
• Appendix: Dynamic Programming (DP) Principle
(Bellman Principle of Optimality)
– The best path through a particular, intermediate place is
the best way from start to it, followed by the best way
from it to the goal.
– Implication: from multiple ways reaching an intermediate
place, only retain the best one
– Often used in sequence matching and HMMs

29
 t (i)= max P(q1q2 qt  i, O1O2 Ot | λ )
q1 , q2 ,..., qt 1

• Viterbi Algorithm (Cont.)


– Recursion
 t ( j )   max  t -1 (i)aij  b j (Ot ),
 1i  N 
 t ( j )  arg max  t -1 (i)aij ,
1i  N

2  t  T, 1 j  N
– Termination
P*  max  T (i )
1i  N

qT *  arg max  T (i )
1i  N

– Backtracking
qt*   t 1 (qt*1 ), Complexity: O(TN2)
1  t  T 1
30
Training Problem
• Maximum Likelihood (ML) max P(O |  )
A, B ,
• Baum-Welch Algorithm (EM)
max Q( ,  )   [log P(Q, O |  )]P(Q, O |  )

Q
– Define variable t (i, j )  P(qt  Si , qt 1  S j | O, λ)

 t  i  aij b j  Ot 1  t 1 ( j ) P(O, qt  Si , qt 1  S j |  )
t (i, j ) 
P(O | λ )
 t  i  aij b j  Ot 1  t 1 ( j )
 N N

   i  a b  O  
i 1 j 1
t ij j t 1 t 1 ( j)

– Define probability
 t  i  t (i)  t  i  t (i) N
 t (i)  P(qt  Si | O, λ )     t (i, j )
P(O | λ ) N

t  i  t (i)
i 1
j 1

31
• Baum-Welch Algorithm (Cont.)
– Reestimation formulas
 i  expected frequency (number of times) in state Si at time (t=1)= 1 (i)
expected number of transitions from state Si to state S j
aij 
expected number of transitions from state Si
T 1 T 1

  (i, j )   (i)a b (O
t t ij j t 1 ) t 1 ( j )
 t 1
T 1
 t 1
T 1

  (i)
t 1
t   (i) (i)
t 1
t t

expected number of times in state j and observing symbol vk


b j (k ) 
expected number of times in state j
T 1 T


t 1, s.t. Ot  vk
 t ( j) 
t 1, s.t. Ot  vk
 t ( j ) t ( j )
 T 1
 T

  ( j)
t 1
t  ( j) ( j)
t 1
t t

32
Continuous Density HMM
• Handling Continuous Observations
– Continuous features: vector OT
– Discretization: vector quantization (VQ)
• Each vector replaced with its closest codevector, which is viewed as
a symbol
• Small codebook: distortion
• Large codebook: large data required in emission probability
estimation
– Continuous emission density: Gaussian mixture (GM)
M
b j (O)   c jmN (O;  jm ,U jm ), 1  j  N
m 1

• Parameter Estimation of Continuous HMM (omitted)

33
Application to Speech Recognition
• Isolated Word Recognition
– Given HMM  for each word in the vocabulary
– Input observation sequence O, Bayes decision
(assuming equal prior probabilities)
v*  arg max P(λ v | O)  arg max P(O | λ v )
1v V 1v V

– Acoustic features (Ot) (details omitted)


– Vector quantization (discrete observation symbols)
– Choice of model parameters
• Number of states: empirical (cross-
validation), can be equal for all word
models
• Number of components in GM

34
Extensions of HMM
• Hybrid HMM/Neural
– HMM: parametric bj(Ot)=p(Ot|qt=Sj), conditional
independence
– Neural: discriminative emission probability p(qt=Sj|Ot)
• Neural network outputs approximate posterior probabilities
– Replace p(xt|qt) with p(qt|xt)/p(qt)
p(xt | qt ) p(qt | xt )

p ( xt ) P(qt )
– ANN may input multiple frames to learn the correlation

• Latest: deep neural networks

35
讨论
• 特征维数与过拟合
– 克服过拟合的方法?
• 期望最大法(EM)
– 对数似然度对缺失数据的期望
– EM for Gaussian mixture
• 隐马尔可夫模型(HMM)
– Three basic problems
– Viterbi Algorithm
– Extensions

36
下次课内容

• 第4章 非参数法
– 密度估计
– Parzen窗方法
– K近邻估计
– 最近邻规则
– 距离度量
– Reduced Coulomb Energy Network
– Approximation by Series Expansion

37

You might also like