0% found this document useful (0 votes)
22 views

Least Squares

The document discusses linear least squares regression, outlining methods for solving the optimization problem to find the best fit line through a set of data points. It covers various techniques such as SVD, QR decomposition, and normal equations, as well as extensions to polynomial fits and weighted least squares. Additionally, it addresses issues like regularization, maximum a posteriori estimation, and the implications of Gaussian noise on least squares estimates.

Uploaded by

nenutrash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Least Squares

The document discusses linear least squares regression, outlining methods for solving the optimization problem to find the best fit line through a set of data points. It covers various techniques such as SVD, QR decomposition, and normal equations, as well as extensions to polynomial fits and weighted least squares. Additionally, it addresses issues like regularization, maximum a posteriori estimation, and the implications of Gaussian noise on least squares estimates.

Uploaded by

nenutrash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Least squares

Linear least squares


Regression: Consider a data set (xi , yi ), i = 1, . . . , N . We want to find a best straight line fit through the
points (xi , yi ). Let y = θ1 x + θ0 . We consider the following optimization problem
N
X
minimizeθ0 ,θ1 kyi − θ1 xi − θ0 k2 . (1)
i=1

This can be identified with the linear least squares problem minimizeθ kΦθ − yk2 as follows. Let
   
y1 x1 1
 y2   x2 1  
    θ1
 . = . .
 θ .
. 0
  
 .   .
yN xN 1

Then, we want to find the solution to this overdetermined system of equations. The familiar methods to
solve Φθ = y are

1. SVD: Let Φ = U ΣV T , then


U ΣV T θ = y ⇒ ΣV T θ = U T y.
Solving Σz = U T y, θ = V z.

2. QR decoposition: Let Φ = QR, then

QRθ = y ⇒ Rθ = QT y

and θ can be found by back substitution.

3. Normal equations: ΦT Φθ = ΦT y ⇒ θ = (ΦT Φ)−1 ΦT y.

The method of normal equations can be derived from the necessary and sufficient conditions for optimality.
Consider kΦθ − yk22 = (Φθ − y)T (Φθ − y) where Φ has full column rank. Taking derivative w.r.t. θ and
setting it to zero gives
ΦT Φθ = ΦT y ⇒ θ = (ΦT Φ)−1 ΦT y.
Taking 2nd derivative w.r.t. θ gives ΦT Φ > 0 which shows that there is a minimum at θ.
The method of least squares also extends to polynomial fits/nonlinear regression. Given a set of data
points (xi , yi ), i = 1, . . . , N , the best nth degree polynomial fit is
N
X
minimizeθ0 ,...,θn kyi − (θn xni + · · · θ0 )k2 . (2)
i=1
 
θn
xn1 xn−1
   
y1 1 · · · x1 1 
 xn2 xn−1 θn−1 

 y2 
   2 · · · x2 1 . 
 . = .
   . ··· . . . .
 
 .   . . ··· . . 
  
θ1 
yN xnN xn−1
N · · · xN 1
θ0
Rows of matrix Φ are called regressors and the vector of unknowns is the parameter vector. One can consider
other regression fits as well.

1
Least squares estimate and mean squared error: Let y(t) be the observed output with known input and
θ be the parameter (scalar/vector/matrix) associated the model. Let ŷ(t|θ) be the output of the model under
a given input. Then, the mean squared error (MSE) is
N
1 X
VN (θ) = (ŷ(t|θ) − y(t))2 . (3)
N
t=1

The lease squares estimate is


θ∗ = argminθ VN (θ). (4)
Weighted LS: Let kΦθ − yk2W = (Φθ − y)T W (Φθ − y), W > 0. Then, it can be easily shown using second
order optimization conditions that
−1 T
θ̂wls = (ΦT
N W ΦN ) ΦN W y. (5)

For
y(t) = φT (t)θ + (t), (6)
choosing W =diag(σ−2 (i)), i = 1, . . . , N where σ2 (i) is the variance of (i) forms a WLS in the presence
of noisy measurements.
MVU estimators using LS for Gaussian noise: Consider

y = Φθ + v (7)

where y = (y(0), . . . , y(N − 1))T , v = (v(0), . . . , v(N − 1))T where v ∼ N (0, σ 2 I). Note that ŷ|θ = Φθ
and LS solution is the solution of minimizeθ ky − Φθk22 which is

θ̂ = (ΦT Φ)−1 ΦT y. (8)

Clearly, E[θ̂] = θ as the noise is zero mean. It turns out that the MVUE is also given by the LS solution. If
the noise is not zero mean, then the LS solution is biased. To find the covariance matrix of θ̂, observe that

θ̂ − θ = (ΦT Φ)−1 ΦT (Φθ + v) − (ΦT Φ)−1 (ΦT Φ)θ = (ΦT Φ)−1 ΦT v


⇒ Σθ̂ = E[(θ̂ − θ)(θ̂ − θ)T ] = (ΦT Φ)−1 ΦT E[vv T ]Φ(ΦT Φ)−1 = σ 2 (ΦT Φ)−1 . (9)

For linear models with zero mean Gaussian noise, LS estimates are also MVUE ([4]).
Statistical weighted LS: We consider the following assumptions

• N measurements y(t) are generated by a model y(t) = φ(t)T θ0 + (t) (φ(t) is a regressor vector)
where θ0 is the true but unknown value.

• N regressor vectors φ(t) are deterministic and not corrupted by noise.

• N noise terms (t) have zero mean. Often, they are also independent or iid. This determines the
optimal weighting matrix. N := ((1), . . . , (N ))T .

• The WLS is θ̂wls = (ΦT −1 T


N W ΦN ) ΦN W yN .

Notice that
−1 T −1 T
E[θ̂wls ] = (ΦT T
N W ΦN ) ΦN W E[yN ] = (ΦN W ΦN ) ΦN W E[ΦN θ0 + N ] = θ0

which implies unbiased estimation.


Notice that
−1 T −1
θ̂wls − θ = (ΦT T T
N W ΦN ) ΦN W (ΦN θ + N ) − (ΦN W ΦN ) (ΦN W ΦN )θ

2
−1 T
= (ΦT
N W ΦN ) ΦN W (N ).

Therefore, the covariance matrix of θ̂wls is computed as


−1 T −1
cov(θ̂wls ) = E[(θ̂wls − θ0 )(θ̂wls − θ0 )T ] = (ΦT T T
N W ΦN ) ΦN W E[N N ]W ΦN (ΦN W ΦN )
−1 T −1
= (ΦT T
N W ΦN ) ΦN W ΣN W ΦN (ΦN W ΦN ) (10)

where E[N T
N ] = ΣN = cov(N ). Thus, cov(θ̂wls ) depends on the choice of W . If ΣN is known,
choosing W = Σ−1N ,
−1 −1
cov(θ̂wls ) = (ΦT
N ΣN ΦN ) . (11)
This is the optimal choice of W i.e., cov(θ̂wls ) ≥ (ΦT −1 −1 for all other W s. We refer the reader to
N ΣN ΦN )
[4] for Bayesian linear models and LS.
MLE and LS for Gaussian noise:It turns out that for linear models y = Φθ + v with v being Gaussian,
the MLE is given by the weighted LS solution which is also MVUE ([4]). Consider (7) with v ∼ N (0, Σ).
Then
1 1 T Σ−1 (y−Φθ)
p(y; θ) = e− 2 (y−Φθ)
(2π)N/2 det(Σ)1/2
and the MLE can be found by minimizing

(y − Φθ)T Σ−1 (y − Φθ) (12)

∂ log p(y; θ)
0= = ΦT Σ−1 (y − Φθ) ⇒ θ̂ = (ΦT Σ−1 Φ)−1 ΦT Σ−1 y
∂θ
and the covariance Σθ̂ = (ΦT Σ−1 Φ)−1 (which is the optimal estimator i.e., with minimum covariance).
(Notice that E[θ̂] = θ.) Thus, the MLE and the MMSE (minimum mean squared estimator) or the MVUE
(weighted) coincide when the noise is white Gaussian with zero mean. However, although for a Gaussian
noise with nonzero mean, LS or weighted LS solution coincides with MLE, the LS solution is biased.
Ill-posed LS problems and Moore Penrose pseudo inverse: In ill-posed problems, ΦT Φ is not invertible
and there is no unique LS solution. In such cases the Moore Penrose pseudo inverse is given by Φ+ =
V Σ+ U T where Σ+ = diag(σ1−1 , . . . , σr−1 , 0, . . . , 0) where r is the rank of Φ.
Regularized least squares: To find a unique LS solution, one can pose a regularized optimization problem
(ridge regression)
1 α
minimizeθ ky − Φθk22 + kθk22 (13)
2 2
where α > 0. This problem has a unique solution θ∗ = (ΦT Φ + αI)−1 ΦT y. Note that for y = Φθ + v with
v being WGN N (0, σ 2 I),
E[θ∗ ] = (ΦT Φ + αI)−1 ΦT Φθ (14)
i.e., the estimator is biased and becomes unbiased for α = 0. The covariance of θ∗ is

E[(θ∗ − E[θ∗ ])(θ∗ − E[θ∗ ])T ] = (ΦT Φ + αI)−1 ΦT E[vv T ]Φ(ΦT Φ + αI)−1
= σ 2 (ΦT Φ + αI)−1 ΦT Φ(ΦT Φ + αI)−1 . (15)

It turns out that regularized LS estimators have lower covariance (i.e., less sensitive to the data variation)
than LS estimators. On the flip side, regularized LS estimators are biased. Suppose θ is a scalar. Then,
2 ΦT Φ
var(θ∗ ) = (ΦσT Φ+αI) 2 T
2 < var(θ̂ls ) = σ (Φ Φ)
−1 which verifies the lower variance in the scalar case. In

cases with low signal to noise ratio (SNR), one prefers regularized LS than LS. There are other regularizers
as well such as kθkW , W > 0, kθk1 which promotes sparse solutions and so on. Notice that in (13), the first
term corresponds to the variance and the second term corresponds to the (bias)2 .

• At α = 0, there is no bias but high variance and we get LS solution.

3
• As α increases, bias increases and the variance goes down.

• At α → ∞, θ∗ = 0 (trivial estimator) independent of the data so there is no variance and only bias.

• The bias variance trade-off depends on α as well as the choice of the norm in the second term of (13).

• The regularizer is added to penalize undesired solutions.


MAP and regularized LS for Gaussian noise: [2] Recall that
p(y, θ) p(y|θ)p(θ)
p(θ|y) = =
p(y) p(y)
and
θ̂M AP = argminθ {− log(p(y|θ)) − log p(θ)}.
For linear models y = Φθ + w, w ∼ N (0, Σw ), θ ∼ N (θ0 , Σθ ). Note that
1 1 T Σ−1 (y−Φθ) 1 1 T Σ−1 (θ−θ )
p(y|θ) = e− 2 (y−Φθ) w
, p(θ) = e− 2 (θ−θ0 ) θ 0
.
(2π)N/2 det(Σ w )1/2 (2π)n/2 det(Σ θ )1/2
Therefore, ignoring constants
1 T −1
− log(p(y|θ)) − log p(θ) = (y − Φθ)T Σ−1
w (y − Φθ) + (θ − θ0 ) Σθ (θ − θ0 ).
2
Thus,
1
θ̂M AP = argminθ (ky − Φθk2Σ−1 + kθ − θ0 k2Σ−1 ) (16)
2 w θ

which is the regularized LS problem. Since this is a quadratic program, the optimal solution can be found
by setting derivative w.r.t. θ to zero i.e.,
−1 −1 −1 −1
ΦT Σ−1 T −1 T −1
w (y − Φθ) + Σθ (θ − θ0 ) = 0 ⇒ θ̂M AP = (Φ Σw Φ + Σθ ) (Φ Σw y + Σθ θ0 ). (17)

(It turns out that the MMSE given by E[θ|Y ] also coincides with the MAP estimator due to linear Gaussian
models. Note that E[θ|Y ] turns out to be linear due to linear Gaussian model.)
In the above setup, suppose Σw = I, θ0 = 0, Σθ = λI. Then, (16) becomes
1
θ̂M AP = argminθ (ky − Φθk2 + λkθk2 )
2
which exactly matches the ridge regularization.
L1 -estimator: It is given by argminθ ky − M (θ)k1 . For linear models, it is argminθ ky − Φθk1 . Suppose
||
1 −σ
yN = Φθ + N where (k) are iid and Laplace distributed1 (p() = 2σ e ). Then, MLE is the same as
the optimal L1 estimator. When  ∼ N , then the MLE is the same as the optimal L2 or LS estimator.
Lasso: Least absolute shrinkage and selection operator (Lasso) is another type of regularizer which gives
sparse solutions to LS problems. It is formulated as
1
minimizeθ ky − Φθk22 + λkθk1 (18)
2
where λ > 0. High λ results in sparser solutions. In the Bayesian setting, if we use a Laplace prior rather
than a Gaussian one, we end up with lasso regularization rather than ridge regularization. For example,
kθk1
1 −
let y = Φθ + w, w ∼ N (0, Σw ), p(θ) = 2σ e
σ . Then, proceeding as above in the MAP case, ignoring
constants,
1 kθk1
− log(p(y|θ)) − log p(θ) = (y − Φθ)T Σ−1
w (y − Φθ) + .
2 σ
1
this noise model is used when the observation has many outliers

4
Thus, θ̂M AP is given by the solution of a lasso problem.
Elastic net regularizers are of the form
1
minimizeθ ky − Φθk22 + λ1 kθk1 + λ2 kθk22 (19)
2
where λ1 , λ2 > 0 which combine features of ridge regression and lasso.
Recursive LS: Let ΦN ∈ RN ×d . We want to find
1
θ̂(N ) = argminθ kyN − ΦN θk22 (20)
2

for increasing values of N . We now demonstrate how to find θ̂(N + 1) from θ̂(N ) and the updated mea-
surement y(N + 1).
Linear algebraic way: Let QN = ΦT N ΦN . Therefore,

QN +1 = ΦT T T
N +1 ΦN +1 = ΦN ΦN + φN +1 φN +1 (21)

where φN +1 is the (N + 1)th regressor. Note that


−1 T
θ̂(N + 1) = (ΦT
N +1 ΦN +1 ) ΦN +1 yN +1 .

Therefore,

θ̂(N + 1) = Q−1 T −1 T
N +1 ΦN +1 yN +1 = QN +1 (ΦN yN + φN +1 y(N + 1))
= Q−1 T
N +1 (ΦN ΦN θ̂(N ) + φ(N + 1)y(N + 1))

= Q−1 T T
N +1 (ΦN ΦN + φN +1 φN +1 )θ̂(N ) −

T
φN +1 φN +1 θ̂(N ) + φN +1 y(N + 1)
 
= Q−1 Q
N +1 N +1 θ̂(N ) + Q −1
N +1 φ N +1 (y(N + 1) − φ T
N +1 θ̂(N ))
 
= θ̂(N ) + Q−1 T
N +1 φN +1 y(N + 1) − φN +1 θ̂(N ) . (22)

This is the recursive LS solution. Notice that the recursion starts when QN becomes invertible. Note further
−1 T A−1
that one can use Sherman-Morrison-Woodbury formula (A + vv T )−1 = A−1 − A1+vvv T A−1 v to calculate

Q−1 −1
N +1 using QN and the new regressor vector φN +1 .
Optimization approach: (Solving convex quadratic programs online) Note that the RLS algorithm solves at
each step the following minimization problem [1]
1
θ̂(N + 1) = argminθ kyN +1 − ΦN +1 θk22 (23)
2
1
= argminθ kyN − ΦN θk22 + ky(N + 1) − φT 2

N +1 θk2
2
1
= argminθ kyN k22 + kΦN θk22 − 2θT ΦT T 2

N yN + ky(N + 1) − φN +1 θk2
2
1
= argminθ kyN k22 + kΦN θk22 − 2θT ΦT T 2

N ΦN θ̂(N ) + ky(N + 1) − φN +1 θk2
2
1
= argminθ kyN k22 − kΦN θ̂(N )k22 + kΦN (θ − θ̂(N ))k22 + ky(N + 1) − φT 2

N +1 θk2
2
 
1 1
= argminθ kθ − θ̂(N )k2QN + ky(N + 1) − φT θk
N +1 2
2
(24)
2 2

5
where we have ignored the terms kyN k22 − kΦN θ̂(N )k22 as they are constants independent of θ.
Initialization of RLS: Notice that, in order for QN to be invertible, one can collect enough number of
measurements and regressor vectors such that QN becomes invertible and then start the RLS algorithm.
Otherwise, one can assume θ ∼ N (θ̂(0), Q−1 0 ) where Q0 = βI > 0, 0 < β << 1. Therefore, Q1 is
invertible and we can apply RLS from the first measurement.
Growing QN : In order to avoid QN → ∞ as N increases is to downweight the past information using
forgetting factor α. Equation (24) can be rewritten as
 
1 2 1 T 2
θ̂(N + 1) = argminθ α kθ − θ̂(N )kQN + ky(N + 1) − φN +1 θk2 (25)
2 2
The recursive update becomes

QN +1 = αQN + φN +1 φT N +1 , (26)
 
θ̂(N + 1) = θ̂(N ) + Q−1 φ
N +1 N +1 y(N + 1) − φ T
N +1 θ̂(N ) . (27)

PN T
Note that QN = i=1 φi φi is called the regressor matrix. This regressor matrix is obtained from data
and needs to be invertible. If the regressor matrix is invertible, then the data is said to be persistently
exciting. Since this matrix is symmetric and positive (semi) definite, one can associated an ellipsoid formed
by the unit eigenvectors of ΦN . Geometrically, given N (normalized) regressor vectors, one can fit an
ellipsoid containing these vectors. As the new regressor φ(N + 1) comes, QN gets updated to QN +1 and
the corresponding ellipsoid gets updated as well. If the new regressor is along the dominant eigenvectors
of QN , due to the presence of Q−1N +1 in the RLS solution update, the correction term is small. On the other
hand, if the regressor is along least dominant eigenvectors of QN , then the correction is significant.
Goodness of fit: Apart from the covariance matrix, another goodness of fit of an estimator is given by the
R2 value defined as
kyN − ΦN θ̂k22 kyN k22 − kN k22 kŷN k22
R2 = 1 − = = . (28)
kyN k2 kyN k22 kyN k22
Recall that yN = ŷN + N where ŷN = ΦN θ̂ and θ̂ is the LS solution of kyN − ΦN θ̂k. From the 1st order
optimality conditions,

ΦT T T T T T T
N ΦN θ̂ − ΦN yN = 0 ⇒ ΦN (ΦN θ̂ − yN ) = 0 ⇒ ΦN N = 0 ⇒ θ̂LS ΦN N = 0 ⇒ ŷN N = 0.

Therefore, kyN k2 = kŷN k22 + kN k22 .

Applications
Fourier analysis:([4], Example 4.2) Consider a sinusoidal data with Gaussian noise (without any dc term)
M M
X 2πkn X 2πkn
x(n) = ak cos( )+ bk sin( ) + w(n), n = 0, . . . , N − 1.
N N
k=1 k=1

where w(n) is WGN. Let θ = (a1 , . . . , aM , b1 , . . . , bm )T . Then the above equation can be written as

x = Hθ + w

where
1 ··· 1 0 ··· 0
 
 cos( 2πN) ··· 2πM
cos( N ) sin( 2π
N) ··· sin( 2πM
N )

 
H=
 . ··· . . ··· . .

 . ··· . . ··· . 
2π(N −1) 2πM (N −1) 2π(N −1) 2πM (N −1)
cos( N ) · · · cos( N ) sin( N ) · · · sin( N )

6
Then,
N 2
θ̂ = (H T H)−1 H T x, H T H = I ⇒ θ̂ = H T x.
2 N
Notice that E[θ̂] = (H T H)−1 H T E[x] = (H T H)−1 H T Hθ = θ. The covariance matrix is given by

E[(θ̂ − θ)(θ̂ − θ)T ] = (H T H)−1 H T Σw H(H T H)−1 = σ 2 (H T H)−1 .

One can use weighted LS to minimize kx − HθkΣ−1


w
where Σw = σ 2 I. In this case,

θ̂wls = (H T Σ−1 −1 T −1 T −1 −1 T −1
w H) H Σw x ⇒ E[θ̂] = (H Σw H) H Σw Hθ = θ

and the covariance matrix of this estimator is

E[(θ̂wls − θ)(θ̂wls − θ)T ] = (H T Σ−1


w H)
−1
= σ 2 (H T H)−1 .

Thus both estimators have the same covariance. One can also find a regularized LS estimator in this case.
Exercise: Trading bias for variance: Add noise with more magnitude and compare the LS fit and the
regularized LS fit for this example on a computer. Check which estimate has more variance. Visually check
the graphs to see which is the better fit.
Add a noise which is Laplace distributed in the above example and find the Fourier coefficients using
k.k1 minimization in Matlab.
Consider the Bayesian estimation of Fourier coefficients. Let θ ∼ N (0, σθ2 I). Then from (17),

θ̂M AP = (σ −2 H T H + σθ−2 )−1 (σ −2 H T x).

Consider a regularization version of the above problem with ridge regularization and lasso regularization
with λ = 1. Compare the Fourier coefficients obtained with ordinary LS ordinary k.k1 minimization.
Impulse response identification: Let
l−1
X
y(n) = h(k)u(n − k) + w(n), n = 0, 1, . . . , N − 1
k=0

where w(n) is WGN and u(n) = 0 for n < 0. Then, y = Hθ + w where θ = (h(0), . . . , h(l − 1))T and
 
u(0) 0 ··· 0
 u(1)
 u(0) ··· 0 

H=  . . ··· . .

 . . ··· . 
u(N − 1) u(N − 2) · · · u(N − l)

The mean and the variance of the estimator can be computed as done in the previous example. Repeat the
same exercise given in the previous case.
ARX model: Consider an ARX model

A(q −1 )y(t) = B(q −1 )u(t) + v0 (t) (29)

where the unknown parameters are θ := (a1 , . . . , an , b1 , . . . , bm )T and the noise variance σ 2 .

A(q −1 ) = 1 + a1 q −1 + · · · + an q −n , B(q −1 ) = b1 q −1 + · · · + bm q −m

Let H(q −1 ) = 1/A(q −1 ). The prediction error is given by

(t) := A(q −1 , θ)y(t) − B(q −1 , θ)u(t) = y(t) − φT (t)θ (30)

7
where T
φ(t) := −y(t − 1) · · · −y(t − n) u(t − 1) · · · u(t − m) ∈ Rd .

(31)
Therefore,
N −1
1 X
VN (θ) = (y(t) − φT (t)θ)2 , t∗ = max(m, n).
N t=t∗
Therefore,
 N −1 −1 NX −1
1 X T 1
θ̂ls (N ) = φ(t)φ (t) φ(t)y(t).
N t=t∗ N t=t∗
Suppose the actual observation is expressed as

y(t) = φT (t)θ0 + v0 (t)

where v0 is noise and θ0 is the true parameter. Then,


 N −1 −1 NX −1
1 X T 1
θ̂ls (N ) = θ0 + φ(t)φ (t) φ(t)v0 (t).
N t=t∗ N t=t∗

Assuming that
PN −1
• limN →∞ N1 t=t∗ φ(t)φT (t) = E[φ(t)φT (t)] is nonsingular and
PN −1
• limN →∞ N1 t=t∗ φ(t)v0 (t) = E[φ(t)v0 (t)] = 0,

limN →∞ θ̂ls (N ) = θ0 .
We now find the covariance of the estimate θ̂ls (N ). Let
N −1
1 X
R(N ) = φ(t)φT (t), θ̃ := θ̂N − θ0 .
N t=t∗

N −1
1 X
E[θ̃] = E[ R(N )−1 φ(t)v0 (t)] = 0 as N → ∞.
N t=t∗
Notice that
N −1
1 X
θ̃ = θ̂N − θ0 = R(N )−1 φ(t)v0 (t).
N t=t∗

In matrix-vector form,
yN = ΦN θ + vN
and
−1 1 T −1 −1 T
θ̂N = N (ΦT
N ΦN ) Φ yN = (ΦT T T T
N ΦN ) (ΦN ΦN θ + ΦN vN ) = θ0 + (ΦN ΦN ) ΦN vN . (32)
N N
Thus, θ̃ = (ΦT −1 T
N ΦN ) ΦN vN . Therefore,

−1 T −1
PN := E[θ̃θ̃T ] = E[(ΦT T T
N ΦN ) ΦN vN vN ΦN (ΦN ΦN ) ]

T → λI (under some weak conditions). Therefore,


Note that as N → ∞, vN vN N

−1 T −1
PN = E[λ(ΦT T
N ΦN ) ΦN ΦN (ΦN ΦN ) ]
−1 λ
= λ(ΦT
N ΦN ) = R(N )−1 . (33)
N

8

Thus, the covariance decays at the rate 1/N and the parameters approach the limiting value at 1/ N .
Furthermore, the covariance is proportional to the inverse of the signal to noise ratio where λ is the noise
variance and R(N ) is proportional to the input power.
More applications similar to the ones discussed above are curve fitting, surface/manifold fitting and so
on.
Noisy regressors [1]: Consider a resisor estimation (with true value R0 ) problem using (noisy) measure-
ments v(k) and i(k) of the true voltage v0 and the true current i0 respectively. Let v(k) = v0 + nv (k),
i(k) = i0 + ni (k) where nv and ni denote voltage and current noise during measurements at time k. Let
nv , ni be iid, zero mean with finite variances σv2 , σi2 . The LS problem is
N
1 X
minR (v(k) − Ri(k))2 .
N
k=1
Let vN , iN be the vectors of voltages and currents (ΦN = iN in this case). Notice that the regressor iN is
noisy in this case unlike the previous cases where the regressor matrix Φ was assumed to be noise free.
The LS solution is
1 T 1 PN
N iN vN N k=1 (v0 + nv (k))(i0 + ni (k))
R̂(N ) = 1 T = 1 PN
.
2
N iN iN N k=1 (i0 + ni (k))
As N → ∞, the numerator evaluates to
 XN 
1
limN →∞ v0 i0 + nv (k)ni (k) = v0 i0
N
k=1
and the denominator evaluates to
 N 
1 X 2
limN →∞ i0 + ni (k)ni (k) = i20 + σi2
N
k=1
where we have used that the sample mean and the sample variance converges to the true mean and the true
variance asymptotically. Thus, limN →∞ R̂ = Rσ0 2 . The estimation is biased as a result of noisy regressors
1+ i
i2
0
and the bias depends of the variance of the noise in the regressors.

Instrumental variables
Note that in LS solutions, with θ0 being the true value and θ̂ being the LS estimate, from (8)
θ̂ − θ0 = (ΦT Φ)−1 ΦT y − (ΦT Φ)−1 ΦT Φθ0 = (ΦT Φ)−1 ΦT v. (34)
PN −1
As seen in the ARX model, for an unbiased estimator, we need limN →∞ t=t∗ φ(t)v(t) = E[φ(t)v(t)] = 0
i.e., regressors are uncorrelated with noise. If this condition is not satisfied, then the estimation is biased. In
such cases, one uses the method of instrumental variables.
Suppose y(t) is ny dimensional output at time t and θ be nθ dimensional parameter vector. Let φT (t) ∈
R y ×nθ . Let Z(t) ∈ Rnz ×ny such that
n

N
1 X
Z(t)(y(t) − φT (t)θ) = 0. (35)
N t=t∗
If nz = nθ , then
N
X N
−1 X
θ̂ = Z(t)φT (t) Z(t)y(t). (36)
t=t∗ t=t∗
For Z(t) = φ(t), we get the LS estimator. If Z(t) are uncorrelated with the noise, then we obtain a consistent
estimator without bias. The instrumental variables method works when the noise is not necessarily white
Gaussian. In the case of ARX models, delayed inputs are chosen as instrumental variables [5].

9
Nonlinear least squares
Let g(θ) be a nonlinear function of θ and x(n) = g(θ)(n). Let
J(θ) = (x − g(θ))T (x − g(θ)) (37)
where x(n) is the observed signal. Then,
θ̂ = argminθ J(θ) (38)
gives the solution to nonlinear LS problem. The minimum can be found by gradient descent method or
Gauss-Newton method.
Let x(n) = g(θ)(n) + v(n). Let Σv be the noise covariance matrix. Then,
J(θ) = (x − g(θ))T Σ−1
v (x − g(θ)) (39)
and
θ̂ = argminθ J(θ) (40)
gives the solution to nonlinear LS problem. One can consider regularized nonlinear LS as well. A particular
application involves nonlinear system identification and is discussed in parametric identification.

Nonlinear optimization and state trajectory estimation


[1] Consider the following discrete time nonlinear state space model
x(k + 1) = f (x(k), u(k), θ) + w(k), y(k) = g(x(k), u(k), θ) + v(k) (41)
where w(k) ∼ N (0, Σw ) and v(k) ∼ N (0, Σv ). Given a control trajectory (u(0), . . . , u(N )) and measure-
ments (y(0), . . . , y(N )), we want to compute the most likely values of θ and the most likely state trajectory
(x(0), . . . , x(N )). The problem of estimating both θ and (x(0), . . . , x(N )) is called parameter and state
trajectory estimation. The unknown vector to be estimated is
θ̄ = (θ, x(0), . . . , x(N )). (42)
Define the unweighted residual function as
w(0) x(1) − f (x(0), u(0), θ)
   
 w(1)  
   x(2) − f (x(1), u(1), θ) 


 .  
  . 


 . 
 
 . 

w(N − 1) x(N ) − f (x(N − 1), u(N − 1), θ)
R̄(θ̄) = 

=
  . (43)
 v(0)   y(0) − g(x(0), u(0), θ) 

 v(1)  
   y(1) − g(x(1), u(1), θ) 


 . 
 
 . 

 .   . 
v(N − 1) y(N ) − g(x(N ), u(N ), θ)
Let Σ =cov(w(0), . . . , w(N − 1), v(0), . . . , v(N )) be the covariance matrix where
Σw 0 · · · 0 0 ··· 0
 
 0 Σw · · · 0 0 ··· 0 
 
 .
 . ··· . . ··· .  
 .
 . ··· . . ··· .  
Σ=  0 0 · · · Σ w 0 · · · 0 .

 0
 0 · · · 0 Σv · · · 0  
 .
 . ··· . . ··· .  
 . . ··· . . ··· . 
. . ··· . . · · · Σv

10
Let
R(θ̄) = Σ−1/2 R̄(θ̄). (44)
Then, the parameter and the state trajectory estimate can be found by solving the following nonlinear LS
problem
1
θ̄∗ = argminθ̄ kR(θ̄)k22 (45)
2
which can be solved using Gauss-Newton algorithm or using lsqnonlin in MATLAB. One may not get a
global minimum in this case.

Aside: Similar optimization problems


We now consider optimization problems where Ax = b may have multiple solutions (A is full row rank
wide matrix). Some of these problems include l1 norm minimization subject to linear equality constraints,
l2 norm minimization subject to equality constraints, nuclear norm minimization and low rank problems.
For example in open loop minimum energy control problem, one solves the following quadratic opti-
mization problem
minimize kuk22 subject to CT (A, B)u = xf , x(0) = 0. (46)
where CT (A, B) := [AT −1 B · · · AB B] is the controllability matrix.
Another variant of the above problem is

minimize kuk22 + λkHu − bk22 , λ > 0. (47)

In compressed sensing, one solves the following l1 -norm minimization problem

minimize kxk1 subject to Ax = b. (48)

Similar problem also shows up in maximum hands-off control problems in control literature. An alternate
form is to augment the constraints in the cost function

minimize kxk1 + λkAx − bk22 , λ > 0. (49)

If the unknown happens to be a matrix variable, then we have

minimize kXk∗ subject to AX = B. (50)

where kXk∗ is the nuclear norm of X i.e., the sum of nonzero singular values. This is the well known matrix
completion problem. In matrix completion, we want to minimize the rank of the unknown matrix subject to
some equality constraints. However, this is a combinatorial NP-hard problem. Its convex relaxation is given
by minimizing the nuclear norm of the unknown matrix subject to equality constraints.
An augmented form of (50) will be

minimize kXk∗ + λkAX − Bk22 . (51)

For sparse noise, we formulate the objective function as

minimize kXk∗ + λkAX − Bk1 . (52)

11
References
[1] M. Diehl, Lecture notes on Modeling and System Identification, Lecture notes and video lectures, 2020.

[2] G. Pilloneto, T. Chen, A. Chiuso, G. De Nicolao, L. Ljung, Regularized System Identification, Learning
Dynmic Models from Data, Springer, 2022.

[3] L. Ljung, System Identification, Theory for the user, PHI, 2nd Edition, 1999.

[4] S. Kay, Fundamentals of Statistical Signal Processing: Estimation theory, 1993.

[5] T. Söderström, P. Stoica, System Identification, Prentice Hall, 1989.

[6] G. Bottegal, Lecture slides on Nonparameric kerrnel-based System Identification, DISC course, 2020.

12

You might also like