Recursive Least Squares Estimation: 1 Estimation of A Constant
Recursive Least Squares Estimation: 1 Estimation of A Constant
Yan-Bin Jia
Dec 8, 2015
1 Estimation of a Constant
We start with estimation of a constant based on several noisy measurements. Suppose we have a
resistor but do not know its resistance. So we measure it several times using a cheap (and noisy)
multimeter. How do we come up with a good estimate of the resistance based on these noisy
measurements?
More formally, suppose x = (x1 , x2 , . . . , xn )T is a constant but unknown vector, and y =
(y1 , y2 , . . . , yl )T is an l-element noisy measurement vector. Our task is to find the “best” estimate
x̃ of x. Here we look at perhaps the simplest case where each yi is a linear combination of xj ,
1 ≤ j ≤ n, with addition of some measurement noise νi . Thus, we are working with the following
linear system,
y = Hx + ν,
where ν = (ν1 , ν2 , . . . , νl )T , and H is an l × n matrix; or with all terms listed,
y1 H11 · · · H1n x1 ν1
.. .. .. .. .. + .. .
. = . . . . .
yl Hk1 · · · Hkn xn νl
Given an estimate x̃, we consider the difference between the noisy measurements and the pro-
jected values H x̃:
ǫ = y − H x̃.
Under the least squares principle, we will try to find the value of x̃ that minimizes the cost function
J(x̃) = ǫT ǫ
= (y − H x̃)T (y − H x̃)
= y T y − x̃T Hy − y T H x̃ + x̃T H T H x̃.
The necessary condition for the minimum is the vanishing of the partial derivative of J with
respect to x̃, that is,
∂J
= −2y T H + 2x̃T H T H = 0.
∂ x̃
∗
The material is adapted from Sections 3.1–3.3 in Dan Simon’s book Optimal State Estimation [1].
1
We solve the equation, obtaining
x̃ = (H T H)−1 H T y. (1)
The inverse (H T H)−1exists if l > n and H is non-singular. In other words, when the number
of measurements is no fewer than the number of variables, and these measurements are linearly
independent.
Example 1. Suppose we are trying to estimate the resistance x of an unmarked resistor based on l noisy
measurements using a multimeter. In this case,
y = Hx + ν, (2)
where
H = (1, · · · , 1)T . (3)
Substitution of the above into equation (1) gives us the optimal estimate of x as
x̃ = (H T H)−1 H T y
1 T
= H y
l
y1 + · · · + yl
= .
l
E(νi2 ) = σi2 , 1 ≤ i ≤ l.
Assume that the noise for each measurement has zero mean and is independent. The covariance
matrix for all measurement noise is
R = E(νν T )
2
σ1 · · · 0
.. . . .. .
= . . .
0 · · · σl2
2
If a measurement yi is noisy, we care less about the discrepancy between it and the ith element
of H x̃ because we do not have much confidence in this measurement. The cost function J can be
expanded as follows:
Note that the measurement noise matrix R must be non-singular for a solution to exist. In other
words, each measurement yi must be corrupted by some noise for the estimation method to work.
Example 2. We get back to the problem in Example 1 of resistance estimation, for which the equations are
given in (2) and (3). Suppose each of the l noisy measurements has variance
E(νi2 ) = σi2 .
R = diag(σ12 , . . . , σl2 ).
3
A linear recursive estimator can be written in the following form:
y k = Hk x + ν k ,
x̃k = x̃k−1 + Kk (y k − Hk x̃k−1 ). (5)
ǫk = x − x̃k
= x − x̃k−1 − Kk (y k − Hk x̃k−1 )
= ǫk−1 − Kk (Hk x + ν k − Hk x̃k−1 )
= ǫk−1 − Kk Hk (x − x̃k−1 ) − Kk ν k
= (I − Kk Hk )ǫk−1 − Kk ν k , (6)
If E(ν k ) = 0 and E(ǫk−1 ) = 0, then E(ǫk ) = 0. So if the measurement noise ν k has zero mean for
all k, and the initial estimate of x is set equal to its expected value, then x̃k = xk for all k. With
this property, the estimator (5) is called unbiased. The property holds regardless of the value of
the gain vector Kk . It says that on the average the estimate x̃ will be equal to the true value x.
The key is to determine the optimal value of the gain vector Kk . The optimality criterion used
by us is to minimize the aggregated variance of the estimation errors at time k:
Jk = E(kx − x̃k k2 )
= E(ǫTk ǫk )
= E tr(ǫk ǫTk )
= Tr(Pk ), (7)
where Tr is the trace operator1 , and the n × n matrix Pk = E(ǫk ǫTk ) is the estimation-error
covariance, Next, we obtain Pk with a substitution of (6):
T
Pk = E (I − Kk Hk )ǫk−1 − Kk ν k (I − Kk Hk )ǫk−1 − Kk ν k
The estimation error ǫk−1 at time k − 1 is independent of the measurement noise ν k at time k,
which implies that
4
Given the definition of the m × m matrix Rk = E(ν k ν Tk ) as covariance of ν k , the expression of Pk
becomes
Pk = (I − Kk Hk )Pk−1 (I − Kk Hk )T + Kk Rk KkT . (8)
Equation (8) is the recurrence for the covariance of the least squares estimation error. It is
consistent with the intuition that as the measurement noise (Rk ) increases, the uncertainty (Pk )
increases. Note that Pk as a covariance matrix is positive definite.
What remains is to find the value of the gain vector Kk that minimizes the cost function given
by (6). The mean of the estimation error is zero independent of the value of Kk already. Thus
the minimizing value of Kk will make the cost function consistently close to zero. We need to
differentiate Jk with respect to Kk .
∂f ∂f
The derivative of a function f with respect to a matrix A = (aij ) is a matrix ∂A = ( ∂aij
).
Theorem 1 Let C and X be matrices of the same dimension r × s. Suppose C does not depend
on X. Then the following holds:
∂Tr(CX T )
= C, (9)
∂X
∂Tr(XCX T )
= XC + XC T . (10)
∂X
∂
A proof of the theorem is given in Appendix A. In the case that C is symmetric, ∂X Tr(XCX T ) =
2XC. With these facts in mind, we first substitute (8) into (7) and differentiate the resulting
expression with respect to Kk :
∂Jk ∂ ∂
= Tr Pk−1 − Kk Hk Pk−1 − Pk−1 HkT KkT + Kk (Hk Pk−1 HkT )KkT + Tr(Kk Rk KkT )
∂Kk ∂Kk ∂Kk
∂
= −2 (Pk−1 HkT KkT ) + 2Kk (Hk Pk−1 HkT ) + 2Kk Rk (by (10))
∂Kk
= −2Pk−1 HkT + 2Kk Hk Pk−1 HkT + 2Kk Rk (by (9))
= −2Pk−1 HkT + 2Kk (Hk Pk−1 HkT + Rk )
In the second equation above, we also used that Pk−1 is independent of Kk and that Kk Hk Pk−1 and
Pk−1 HkT KkT are transposes of each other (since Pk−1 is symmetric) so they have the same trace.
Setting the partial derivative to zero, we solve for Kk :
Substitute the above for Kk into equation (8) for Pk . The operation followed by an expansion leads
to a few steps of manipulation as follows:
Pk = (I − Pk−1 HkT Sk−1 Hk )Pk−1 (I − Pk−1 HkT Sk−1 Hk )T + Pk−1 HkT Sk−1 Rk Sk−1 Hk Pk−1
= Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 +
5
Pk−1 HkT Sk−1 Hk Pk−1 HkT Sk−1 Hk Pk−1 + Pk−1 HkT Sk−1 Rk Sk−1 Hk Pk−1
= Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 + Pk−1 HkT Sk−1 Sk Sk−1 Hk Pk−1
(after merging the underlined terms into Sk )
= Pk−1 − 2Pk−1 HkT Sk−1 Hk Pk−1 + Pk−1 HkT Sk−1 Hk Pk−1
= Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 (13)
= Pk−1 − Kk Hk Pk−1 (by (12))
= (I − Kk Hk )Pk−1 . (14)
Note that in the above Pk is symmetric as a covariance matrix, and so is Sk .
We take the inverses of both sides of equation (13) and plug into the expression for Sk . Expan-
sion and merging of terms yield
Pk−1 = Pk−1
−1
+ HkT Rk−1 Hk , (15)
from which we obtain an alternative expression for the convariance matrix:
+ HkT Rk−1 Hk
−1
Pk = Pk−1
−1
. (16)
This expression is more complicated than (14) since it requires three matrix inversions. Neverthe-
less, it has computational advantages in certain situations in practice [1, pp.156–158].
We can also derive an alternate form for the convariance Pk as follows. Start with a multiplica-
tion of the right of (11) with Pk Pk−1 . Then, substitute (15) for Pk−1 into the resulting expression.
Multiply the Pk−1 Hk factor inside the parenthesized factor on its left, and extract HkT Rk−1 out of
the parentheses. The last two parenthesized factors will cancel each other, yielding
Kk = Pk HkT Rk−1 . (17)
In the case of no prior knowledge about x, simply let P0 = ∞I. In the case of perfect prior
knowledge, let P0 = 0.
2. Iterate the follow two steps.
(a) Obtain a new measurement y k , assuming that it is given by the equation
y k = Hk x + ν k ,
where the noise ν k has zero mean and covariance Rk . The measurement noise at each
time step k is independent. So,
T 0, if i 6= j,
E(ν i ν j ) =
Rj , if i = j.
Essentially, we assume white measurement noise.
6
(b) Update the estimate x̃ and the covariance of the estimation error sequentially according
to (11), (5), (14), which are re-listed below:
+ HkT Rk−1 Hk
−1
Pk = Pk−1
−1
,
Kk = Pk HkT Rk−1 ,
x̃k = x̃k−1 + Kk (y k − Hk x̃k−1 ).
Note that (20) and (19) can switch their order in one round of update.
Example 3. We revisit the resistance estimation problem presented in Examples 1 and 2. Now, we want
to iteratively improve our estimate of the resistance x. At the kth sampling, our measurement is
yk = Hk x + νk = x + νk ,
Rk = E(νk2 ).
Here, the measurement vector Hk is a scalar 1. Furthermore, we suppose that each measurement has the
same covariance so Rk is a constant written as R.
Before the first measurement, we have some idea about the resistance x. This becomes our initial
estimate. Also, we have some uncertainty about this initial estimate, which becomes our initial covariance.
Together we have
x̃0 = E(x),
P0 = E((x − x̃0 )2 ).
If we have no idea about the resistance, set P0 = ∞. If we are certain about the resistance value, set P0 = 0.
(Of course, then there would be no need to take measurements.)
After the first measurement (k=1), we update the estimate and the error covariance according to equa-
tions (18)–(19) as follows:
P0
K1 = ,
P0 + R
P0
x̃1 = x̃0 + (y1 − x̃0 ),
P0 + R
P0 P0 R
P1 = 1− P0 = .
P0 + R P0 + R
After the second measurement, the estimates become
P1 P0
K2 = = ,
P1 + R 2P0 + R
P1
x̃2 = x̃1 + (y2 − x̃1 )
P1 + R
P0 + R P0
= x̃1 + y2 ,
2P0 + R 2P0 + R
P1 R P0 R
P2 = = .
P1 + R 2P0 + R
7
By induction, we can show that
P0
Kk = ,
kP0 + R
(k − 1)P0 + R P0
x̃k = x̃k−1 + yk ,
kP0 + R kP0 + R
P0 R
Pk = .
kP0 + R
Note that if x is known perfectly a priori, then P0 = 0, which implies that Kk = 0 and x̃k = x̃0 , for all
k. The optimal estimate of x is independent of any measurements that are obtained. At the opposite end of
the spectrum, if x is completely unknown a priori, then P0 = ∞. The above equation for x̃k becomes,
(k − 1)P0 + R P0
x̃k = lim x̃k−1 + yk
P0 →∞ kP0 + R kP0 + R
k−1 1
= x̃k−1 + yk
k k
1
= (k − 1)x̃k−1 + yk .
k
1 Pk
The right hand side of the last equation above is just the running average ȳk = k j=1 yj of the measure-
ments. To see this, we first have
k
X k−1
X
yj = yj + yk
j=1 j=1
k−1
1 X
= (k − 1) yj + yk
k − 1 j=1
= (k − 1)ȳk−1 + yk .
Since x̃1 = ȳ1 , the recurrences for x̃k and ȳk are the same. Hence x̃k = ȳk for all k.
yk = x1 + 0.99k−1 x2 + νk ,
where Hk = (1, 0.99k−1 )T , and νk is a random variable with zero mean and a variance R = 0.01.
Let the real values be x̃ = (x1 , x2 )T = (10, 5)T . Suppose the initial estimates are x̃1 = 8 and x̃2 = 7 with
P0 equal to the identity matrix. We apply the recursive least squares algorithm. The next figure2 shows the
evolutions of the estimates x̃1 and x̃2 , along with those of the variance of the estimation errors. It can be
seen that after a couple dozen measurements, the estimates are getting very close to the true values 10 and 5.
The variances of the estimation errors asymptotically approach zero. This means that we have increasingly
more confidence in the estimates with more measurements obtained.
2
Figure 3.1, p. 92 of [1].
8
A Proof of Theorem 1
Proof
Denote C = (cij ), X = (xij ), and CX T = (dij ). The trace of CX T is
r
X
T
Tr(CX ) = dtt
t=1
Xr X s
= ctk xtk .
t=1 k=1
From the above, we easily obtain its partial derivatives with respect to the entries of X:
∂
Tr(CX T ) = cij .
∂xij
∂ ∂ ∂
Tr(XCX T ) = Tr(XCY T ) + Tr(Y CX T )
∂X ∂X Y =X ∂X Y =X
∂
= Tr(Y C T X T ) +Y C (by (9))
∂X Y =X Y =X
= Y CT +XC
Y =X
= XC T + XC.
9
References
[1] D. Simon. Optimal State Estimations. John Wiley & Sons, Inc., Hoboken, New Jersey, 2006.
10