0% found this document useful (0 votes)

22 views

Least Squares

The document discusses linear least squares regression, outlining methods for solving the optimization problem to find the best fit line through a set of data points. It covers various techniques such as SVD, QR decomposition, and normal equations, as well as extensions to polynomial fits and weighted least squares. Additionally, it addresses issues like regularization, maximum a posteriori estimation, and the implications of Gaussian noise on least squares estimates.

Uploaded by

nenutrash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Least Squares

Uploaded by

nenutrash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Least squares

Linear least squares

Regression: Consider a data set (xi , yi ), i = 1, . . . , N . We want to find a best straight line fit through the
points (xi , yi ). Let y = θ1 x + θ0 . We consider the following optimization problem
N
X
minimizeθ0 ,θ1 kyi − θ1 xi − θ0 k2 . (1)
i=1

This can be identified with the linear least squares problem minimizeθ kΦθ − yk2 as follows. Let
   
y1 x1 1
 y2   x2 1
    θ1
 . = . .
 θ .
. 0
  
 .   .
yN xN 1

Then, we want to find the solution to this overdetermined system of equations. The familiar methods to
solve Φθ = y are

1. SVD: Let Φ = U ΣV T , then

U ΣV T θ = y ⇒ ΣV T θ = U T y.
Solving Σz = U T y, θ = V z.

2. QR decoposition: Let Φ = QR, then

QRθ = y ⇒ Rθ = QT y

and θ can be found by back substitution.

3. Normal equations: ΦT Φθ = ΦT y ⇒ θ = (ΦT Φ)−1 ΦT y.

The method of normal equations can be derived from the necessary and sufficient conditions for optimality.
Consider kΦθ − yk22 = (Φθ − y)T (Φθ − y) where Φ has full column rank. Taking derivative w.r.t. θ and
setting it to zero gives
ΦT Φθ = ΦT y ⇒ θ = (ΦT Φ)−1 ΦT y.
Taking 2nd derivative w.r.t. θ gives ΦT Φ > 0 which shows that there is a minimum at θ.
The method of least squares also extends to polynomial fits/nonlinear regression. Given a set of data
points (xi , yi ), i = 1, . . . , N , the best nth degree polynomial fit is
N
X
minimizeθ0 ,...,θn kyi − (θn xni + · · · θ0 )k2 . (2)
i=1
 
θn
xn1 xn−1
   
y1 1 · · · x1 1 
 xn2 xn−1 θn−1 

 y2 
   2 · · · x2 1 . 
 . = .
   . ··· . . . .
 
 .   . . ··· . . 
  
θ1 
yN xnN xn−1
N · · · xN 1
θ0
Rows of matrix Φ are called regressors and the vector of unknowns is the parameter vector. One can consider
other regression fits as well.

1
Least squares estimate and mean squared error: Let y(t) be the observed output with known input and
θ be the parameter (scalar/vector/matrix) associated the model. Let ŷ(t|θ) be the output of the model under
a given input. Then, the mean squared error (MSE) is
N
1 X
VN (θ) = (ŷ(t|θ) − y(t))2 . (3)
N
t=1

The lease squares estimate is

θ∗ = argminθ VN (θ). (4)
Weighted LS: Let kΦθ − yk2W = (Φθ − y)T W (Φθ − y), W > 0. Then, it can be easily shown using second
order optimization conditions that
−1 T
θ̂wls = (ΦT
N W ΦN ) ΦN W y. (5)

For
y(t) = φT (t)θ + (t), (6)
choosing W =diag(σ−2 (i)), i = 1, . . . , N where σ2 (i) is the variance of (i) forms a WLS in the presence
of noisy measurements.
MVU estimators using LS for Gaussian noise: Consider

y = Φθ + v (7)

where y = (y(0), . . . , y(N − 1))T , v = (v(0), . . . , v(N − 1))T where v ∼ N (0, σ 2 I). Note that ŷ|θ = Φθ
and LS solution is the solution of minimizeθ ky − Φθk22 which is

θ̂ = (ΦT Φ)−1 ΦT y. (8)

Clearly, E[θ̂] = θ as the noise is zero mean. It turns out that the MVUE is also given by the LS solution. If
the noise is not zero mean, then the LS solution is biased. To find the covariance matrix of θ̂, observe that

θ̂ − θ = (ΦT Φ)−1 ΦT (Φθ + v) − (ΦT Φ)−1 (ΦT Φ)θ = (ΦT Φ)−1 ΦT v

⇒ Σθ̂ = E[(θ̂ − θ)(θ̂ − θ)T ] = (ΦT Φ)−1 ΦT E[vv T ]Φ(ΦT Φ)−1 = σ 2 (ΦT Φ)−1 . (9)

For linear models with zero mean Gaussian noise, LS estimates are also MVUE ([4]).
Statistical weighted LS: We consider the following assumptions

• N measurements y(t) are generated by a model y(t) = φ(t)T θ0 + (t) (φ(t) is a regressor vector)
where θ0 is the true but unknown value.

• N regressor vectors φ(t) are deterministic and not corrupted by noise.

• N noise terms (t) have zero mean. Often, they are also independent or iid. This determines the
optimal weighting matrix. N := ((1), . . . , (N ))T .

• The WLS is θ̂wls = (ΦT −1 T

N W ΦN ) ΦN W yN .

Notice that
−1 T −1 T
E[θ̂wls ] = (ΦT T
N W ΦN ) ΦN W E[yN ] = (ΦN W ΦN ) ΦN W E[ΦN θ0 + N ] = θ0

which implies unbiased estimation.

Notice that
−1 T −1
θ̂wls − θ = (ΦT T T
N W ΦN ) ΦN W (ΦN θ + N ) − (ΦN W ΦN ) (ΦN W ΦN )θ

2
−1 T
= (ΦT
N W ΦN ) ΦN W (N ).

Therefore, the covariance matrix of θ̂wls is computed as

−1 T −1
cov(θ̂wls ) = E[(θ̂wls − θ0 )(θ̂wls − θ0 )T ] = (ΦT T T
N W ΦN ) ΦN W E[N N ]W ΦN (ΦN W ΦN )
−1 T −1
= (ΦT T
N W ΦN ) ΦN W ΣN W ΦN (ΦN W ΦN ) (10)

where E[N T
N ] = ΣN = cov(N ). Thus, cov(θ̂wls ) depends on the choice of W . If ΣN is known,
choosing W = Σ−1N ,
−1 −1
cov(θ̂wls ) = (ΦT
N ΣN ΦN ) . (11)
This is the optimal choice of W i.e., cov(θ̂wls ) ≥ (ΦT −1 −1 for all other W s. We refer the reader to
N ΣN ΦN )
[4] for Bayesian linear models and LS.
MLE and LS for Gaussian noise:It turns out that for linear models y = Φθ + v with v being Gaussian,
the MLE is given by the weighted LS solution which is also MVUE ([4]). Consider (7) with v ∼ N (0, Σ).
Then
1 1 T Σ−1 (y−Φθ)
p(y; θ) = e− 2 (y−Φθ)
(2π)N/2 det(Σ)1/2
and the MLE can be found by minimizing

(y − Φθ)T Σ−1 (y − Φθ) (12)

∂ log p(y; θ)
0= = ΦT Σ−1 (y − Φθ) ⇒ θ̂ = (ΦT Σ−1 Φ)−1 ΦT Σ−1 y
∂θ
and the covariance Σθ̂ = (ΦT Σ−1 Φ)−1 (which is the optimal estimator i.e., with minimum covariance).
(Notice that E[θ̂] = θ.) Thus, the MLE and the MMSE (minimum mean squared estimator) or the MVUE
(weighted) coincide when the noise is white Gaussian with zero mean. However, although for a Gaussian
noise with nonzero mean, LS or weighted LS solution coincides with MLE, the LS solution is biased.
Ill-posed LS problems and Moore Penrose pseudo inverse: In ill-posed problems, ΦT Φ is not invertible
and there is no unique LS solution. In such cases the Moore Penrose pseudo inverse is given by Φ+ =
V Σ+ U T where Σ+ = diag(σ1−1 , . . . , σr−1 , 0, . . . , 0) where r is the rank of Φ.
Regularized least squares: To find a unique LS solution, one can pose a regularized optimization problem
(ridge regression)
1 α
minimizeθ ky − Φθk22 + kθk22 (13)
2 2
where α > 0. This problem has a unique solution θ∗ = (ΦT Φ + αI)−1 ΦT y. Note that for y = Φθ + v with
v being WGN N (0, σ 2 I),
E[θ∗ ] = (ΦT Φ + αI)−1 ΦT Φθ (14)
i.e., the estimator is biased and becomes unbiased for α = 0. The covariance of θ∗ is

E[(θ∗ − E[θ∗ ])(θ∗ − E[θ∗ ])T ] = (ΦT Φ + αI)−1 ΦT E[vv T ]Φ(ΦT Φ + αI)−1
= σ 2 (ΦT Φ + αI)−1 ΦT Φ(ΦT Φ + αI)−1 . (15)

It turns out that regularized LS estimators have lower covariance (i.e., less sensitive to the data variation)
than LS estimators. On the flip side, regularized LS estimators are biased. Suppose θ is a scalar. Then,
2 ΦT Φ
var(θ∗ ) = (ΦσT Φ+αI) 2 T
2 < var(θ̂ls ) = σ (Φ Φ)
−1 which verifies the lower variance in the scalar case. In

cases with low signal to noise ratio (SNR), one prefers regularized LS than LS. There are other regularizers
as well such as kθkW , W > 0, kθk1 which promotes sparse solutions and so on. Notice that in (13), the first
term corresponds to the variance and the second term corresponds to the (bias)2 .

• At α = 0, there is no bias but high variance and we get LS solution.

3
• As α increases, bias increases and the variance goes down.

• At α → ∞, θ∗ = 0 (trivial estimator) independent of the data so there is no variance and only bias.

• The bias variance trade-off depends on α as well as the choice of the norm in the second term of (13).

• The regularizer is added to penalize undesired solutions.

MAP and regularized LS for Gaussian noise: [2] Recall that
p(y, θ) p(y|θ)p(θ)
p(θ|y) = =
p(y) p(y)
and
θ̂M AP = argminθ {− log(p(y|θ)) − log p(θ)}.
For linear models y = Φθ + w, w ∼ N (0, Σw ), θ ∼ N (θ0 , Σθ ). Note that
1 1 T Σ−1 (y−Φθ) 1 1 T Σ−1 (θ−θ )
p(y|θ) = e− 2 (y−Φθ) w
, p(θ) = e− 2 (θ−θ0 ) θ 0
.
(2π)N/2 det(Σ w )1/2 (2π)n/2 det(Σ θ )1/2
Therefore, ignoring constants
1 T −1
− log(p(y|θ)) − log p(θ) = (y − Φθ)T Σ−1
w (y − Φθ) + (θ − θ0 ) Σθ (θ − θ0 ).
2
Thus,
1
θ̂M AP = argminθ (ky − Φθk2Σ−1 + kθ − θ0 k2Σ−1 ) (16)
2 w θ

which is the regularized LS problem. Since this is a quadratic program, the optimal solution can be found
by setting derivative w.r.t. θ to zero i.e.,
−1 −1 −1 −1
ΦT Σ−1 T −1 T −1
w (y − Φθ) + Σθ (θ − θ0 ) = 0 ⇒ θ̂M AP = (Φ Σw Φ + Σθ ) (Φ Σw y + Σθ θ0 ). (17)

(It turns out that the MMSE given by E[θ|Y ] also coincides with the MAP estimator due to linear Gaussian
models. Note that E[θ|Y ] turns out to be linear due to linear Gaussian model.)
In the above setup, suppose Σw = I, θ0 = 0, Σθ = λI. Then, (16) becomes
1
θ̂M AP = argminθ (ky − Φθk2 + λkθk2 )
2
which exactly matches the ridge regularization.
L1 -estimator: It is given by argminθ ky − M (θ)k1 . For linear models, it is argminθ ky − Φθk1 . Suppose
||
1 −σ
yN = Φθ + N where (k) are iid and Laplace distributed1 (p() = 2σ e ). Then, MLE is the same as
the optimal L1 estimator. When ∼ N , then the MLE is the same as the optimal L2 or LS estimator.
Lasso: Least absolute shrinkage and selection operator (Lasso) is another type of regularizer which gives
sparse solutions to LS problems. It is formulated as
1
minimizeθ ky − Φθk22 + λkθk1 (18)
2
where λ > 0. High λ results in sparser solutions. In the Bayesian setting, if we use a Laplace prior rather
than a Gaussian one, we end up with lasso regularization rather than ridge regularization. For example,
kθk1
1 −
let y = Φθ + w, w ∼ N (0, Σw ), p(θ) = 2σ e
σ . Then, proceeding as above in the MAP case, ignoring
constants,
1 kθk1
− log(p(y|θ)) − log p(θ) = (y − Φθ)T Σ−1
w (y − Φθ) + .
2 σ
1
this noise model is used when the observation has many outliers

4
Thus, θ̂M AP is given by the solution of a lasso problem.
Elastic net regularizers are of the form
1
minimizeθ ky − Φθk22 + λ1 kθk1 + λ2 kθk22 (19)
2
where λ1 , λ2 > 0 which combine features of ridge regression and lasso.
Recursive LS: Let ΦN ∈ RN ×d . We want to find
1
θ̂(N ) = argminθ kyN − ΦN θk22 (20)
2

for increasing values of N . We now demonstrate how to find θ̂(N + 1) from θ̂(N ) and the updated mea-
surement y(N + 1).
Linear algebraic way: Let QN = ΦT N ΦN . Therefore,

QN +1 = ΦT T T
N +1 ΦN +1 = ΦN ΦN + φN +1 φN +1 (21)

where φN +1 is the (N + 1)th regressor. Note that

−1 T
θ̂(N + 1) = (ΦT
N +1 ΦN +1 ) ΦN +1 yN +1 .

Therefore,

θ̂(N + 1) = Q−1 T −1 T
N +1 ΦN +1 yN +1 = QN +1 (ΦN yN + φN +1 y(N + 1))
= Q−1 T
N +1 (ΦN ΦN θ̂(N ) + φ(N + 1)y(N + 1))

= Q−1 T T
N +1 (ΦN ΦN + φN +1 φN +1 )θ̂(N ) −

T
φN +1 φN +1 θ̂(N ) + φN +1 y(N + 1)

= Q−1 Q
N +1 N +1 θ̂(N ) + Q −1
N +1 φ N +1 (y(N + 1) − φ T
N +1 θ̂(N ))

= θ̂(N ) + Q−1 T
N +1 φN +1 y(N + 1) − φN +1 θ̂(N ) . (22)

This is the recursive LS solution. Notice that the recursion starts when QN becomes invertible. Note further
−1 T A−1
that one can use Sherman-Morrison-Woodbury formula (A + vv T )−1 = A−1 − A1+vvv T A−1 v to calculate

Q−1 −1
N +1 using QN and the new regressor vector φN +1 .
Optimization approach: (Solving convex quadratic programs online) Note that the RLS algorithm solves at
each step the following minimization problem [1]
1
θ̂(N + 1) = argminθ kyN +1 − ΦN +1 θk22 (23)
2
1
= argminθ kyN − ΦN θk22 + ky(N + 1) − φT 2

N +1 θk2
2
1
= argminθ kyN k22 + kΦN θk22 − 2θT ΦT T 2

N yN + ky(N + 1) − φN +1 θk2
2
1
= argminθ kyN k22 + kΦN θk22 − 2θT ΦT T 2

N ΦN θ̂(N ) + ky(N + 1) − φN +1 θk2
2
1
= argminθ kyN k22 − kΦN θ̂(N )k22 + kΦN (θ − θ̂(N ))k22 + ky(N + 1) − φT 2

N +1 θk2
2

1 1
= argminθ kθ − θ̂(N )k2QN + ky(N + 1) − φT θk
N +1 2
2
(24)
2 2

5
where we have ignored the terms kyN k22 − kΦN θ̂(N )k22 as they are constants independent of θ.
Initialization of RLS: Notice that, in order for QN to be invertible, one can collect enough number of
measurements and regressor vectors such that QN becomes invertible and then start the RLS algorithm.
Otherwise, one can assume θ ∼ N (θ̂(0), Q−1 0 ) where Q0 = βI > 0, 0 < β << 1. Therefore, Q1 is
invertible and we can apply RLS from the first measurement.
Growing QN : In order to avoid QN → ∞ as N increases is to downweight the past information using
forgetting factor α. Equation (24) can be rewritten as

1 2 1 T 2
θ̂(N + 1) = argminθ α kθ − θ̂(N )kQN + ky(N + 1) − φN +1 θk2 (25)
2 2
The recursive update becomes

QN +1 = αQN + φN +1 φT N +1 , (26)

θ̂(N + 1) = θ̂(N ) + Q−1 φ
N +1 N +1 y(N + 1) − φ T
N +1 θ̂(N ) . (27)

PN T
Note that QN = i=1 φi φi is called the regressor matrix. This regressor matrix is obtained from data
and needs to be invertible. If the regressor matrix is invertible, then the data is said to be persistently
exciting. Since this matrix is symmetric and positive (semi) definite, one can associated an ellipsoid formed
by the unit eigenvectors of ΦN . Geometrically, given N (normalized) regressor vectors, one can fit an
ellipsoid containing these vectors. As the new regressor φ(N + 1) comes, QN gets updated to QN +1 and
the corresponding ellipsoid gets updated as well. If the new regressor is along the dominant eigenvectors
of QN , due to the presence of Q−1N +1 in the RLS solution update, the correction term is small. On the other
hand, if the regressor is along least dominant eigenvectors of QN , then the correction is significant.
Goodness of fit: Apart from the covariance matrix, another goodness of fit of an estimator is given by the
R2 value defined as
kyN − ΦN θ̂k22 kyN k22 − kN k22 kŷN k22
R2 = 1 − = = . (28)
kyN k2 kyN k22 kyN k22
Recall that yN = ŷN + N where ŷN = ΦN θ̂ and θ̂ is the LS solution of kyN − ΦN θ̂k. From the 1st order
optimality conditions,

ΦT T T T T T T
N ΦN θ̂ − ΦN yN = 0 ⇒ ΦN (ΦN θ̂ − yN ) = 0 ⇒ ΦN N = 0 ⇒ θ̂LS ΦN N = 0 ⇒ ŷN N = 0.

Therefore, kyN k2 = kŷN k22 + kN k22 .

Applications
Fourier analysis:([4], Example 4.2) Consider a sinusoidal data with Gaussian noise (without any dc term)
M M
X 2πkn X 2πkn
x(n) = ak cos( )+ bk sin( ) + w(n), n = 0, . . . , N − 1.
N N
k=1 k=1

where w(n) is WGN. Let θ = (a1 , . . . , aM , b1 , . . . , bm )T . Then the above equation can be written as

x = Hθ + w

where
1 ··· 1 0 ··· 0
 
 cos( 2πN) ··· 2πM
cos( N ) sin( 2π
N) ··· sin( 2πM
N )

 
H=
 . ··· . . ··· . .

 . ··· . . ··· . 
2π(N −1) 2πM (N −1) 2π(N −1) 2πM (N −1)
cos( N ) · · · cos( N ) sin( N ) · · · sin( N )

6
Then,
N 2
θ̂ = (H T H)−1 H T x, H T H = I ⇒ θ̂ = H T x.
2 N
Notice that E[θ̂] = (H T H)−1 H T E[x] = (H T H)−1 H T Hθ = θ. The covariance matrix is given by

E[(θ̂ − θ)(θ̂ − θ)T ] = (H T H)−1 H T Σw H(H T H)−1 = σ 2 (H T H)−1 .

One can use weighted LS to minimize kx − HθkΣ−1

w
where Σw = σ 2 I. In this case,

θ̂wls = (H T Σ−1 −1 T −1 T −1 −1 T −1
w H) H Σw x ⇒ E[θ̂] = (H Σw H) H Σw Hθ = θ

and the covariance matrix of this estimator is

E[(θ̂wls − θ)(θ̂wls − θ)T ] = (H T Σ−1

w H)
−1
= σ 2 (H T H)−1 .

Thus both estimators have the same covariance. One can also find a regularized LS estimator in this case.
Exercise: Trading bias for variance: Add noise with more magnitude and compare the LS fit and the
regularized LS fit for this example on a computer. Check which estimate has more variance. Visually check
the graphs to see which is the better fit.
Add a noise which is Laplace distributed in the above example and find the Fourier coefficients using
k.k1 minimization in Matlab.
Consider the Bayesian estimation of Fourier coefficients. Let θ ∼ N (0, σθ2 I). Then from (17),

θ̂M AP = (σ −2 H T H + σθ−2 )−1 (σ −2 H T x).

Consider a regularization version of the above problem with ridge regularization and lasso regularization
with λ = 1. Compare the Fourier coefficients obtained with ordinary LS ordinary k.k1 minimization.
Impulse response identification: Let
l−1
X
y(n) = h(k)u(n − k) + w(n), n = 0, 1, . . . , N − 1
k=0

where w(n) is WGN and u(n) = 0 for n < 0. Then, y = Hθ + w where θ = (h(0), . . . , h(l − 1))T and
 
u(0) 0 ··· 0
 u(1)
 u(0) ··· 0 

H=  . . ··· . .

 . . ··· . 
u(N − 1) u(N − 2) · · · u(N − l)

The mean and the variance of the estimator can be computed as done in the previous example. Repeat the
same exercise given in the previous case.
ARX model: Consider an ARX model

A(q −1 )y(t) = B(q −1 )u(t) + v0 (t) (29)

where the unknown parameters are θ := (a1 , . . . , an , b1 , . . . , bm )T and the noise variance σ 2 .

A(q −1 ) = 1 + a1 q −1 + · · · + an q −n , B(q −1 ) = b1 q −1 + · · · + bm q −m

Let H(q −1 ) = 1/A(q −1 ). The prediction error is given by

(t) := A(q −1 , θ)y(t) − B(q −1 , θ)u(t) = y(t) − φT (t)θ (30)

7
where T
φ(t) := −y(t − 1) · · · −y(t − n) u(t − 1) · · · u(t − m) ∈ Rd .

(31)
Therefore,
N −1
1 X
VN (θ) = (y(t) − φT (t)θ)2 , t∗ = max(m, n).
N t=t∗
Therefore,
N −1 −1 NX −1
1 X T 1
θ̂ls (N ) = φ(t)φ (t) φ(t)y(t).
N t=t∗ N t=t∗
Suppose the actual observation is expressed as

y(t) = φT (t)θ0 + v0 (t)

where v0 is noise and θ0 is the true parameter. Then,

N −1 −1 NX −1
1 X T 1
θ̂ls (N ) = θ0 + φ(t)φ (t) φ(t)v0 (t).
N t=t∗ N t=t∗

Assuming that
PN −1
• limN →∞ N1 t=t∗ φ(t)φT (t) = E[φ(t)φT (t)] is nonsingular and
PN −1
• limN →∞ N1 t=t∗ φ(t)v0 (t) = E[φ(t)v0 (t)] = 0,

limN →∞ θ̂ls (N ) = θ0 .
We now find the covariance of the estimate θ̂ls (N ). Let
N −1
1 X
R(N ) = φ(t)φT (t), θ̃ := θ̂N − θ0 .
N t=t∗

N −1
1 X
E[θ̃] = E[ R(N )−1 φ(t)v0 (t)] = 0 as N → ∞.
N t=t∗
Notice that
N −1
1 X
θ̃ = θ̂N − θ0 = R(N )−1 φ(t)v0 (t).
N t=t∗

In matrix-vector form,
yN = ΦN θ + vN
and
−1 1 T −1 −1 T
θ̂N = N (ΦT
N ΦN ) Φ yN = (ΦT T T T
N ΦN ) (ΦN ΦN θ + ΦN vN ) = θ0 + (ΦN ΦN ) ΦN vN . (32)
N N
Thus, θ̃ = (ΦT −1 T
N ΦN ) ΦN vN . Therefore,

−1 T −1
PN := E[θ̃θ̃T ] = E[(ΦT T T
N ΦN ) ΦN vN vN ΦN (ΦN ΦN ) ]

T → λI (under some weak conditions). Therefore,

Note that as N → ∞, vN vN N

−1 T −1
PN = E[λ(ΦT T
N ΦN ) ΦN ΦN (ΦN ΦN ) ]
−1 λ
= λ(ΦT
N ΦN ) = R(N )−1 . (33)
N

8
√
Thus, the covariance decays at the rate 1/N and the parameters approach the limiting value at 1/ N .
Furthermore, the covariance is proportional to the inverse of the signal to noise ratio where λ is the noise
variance and R(N ) is proportional to the input power.
More applications similar to the ones discussed above are curve fitting, surface/manifold fitting and so
on.
Noisy regressors [1]: Consider a resisor estimation (with true value R0 ) problem using (noisy) measure-
ments v(k) and i(k) of the true voltage v0 and the true current i0 respectively. Let v(k) = v0 + nv (k),
i(k) = i0 + ni (k) where nv and ni denote voltage and current noise during measurements at time k. Let
nv , ni be iid, zero mean with finite variances σv2 , σi2 . The LS problem is
N
1 X
minR (v(k) − Ri(k))2 .
N
k=1
Let vN , iN be the vectors of voltages and currents (ΦN = iN in this case). Notice that the regressor iN is
noisy in this case unlike the previous cases where the regressor matrix Φ was assumed to be noise free.
The LS solution is
1 T 1 PN
N iN vN N k=1 (v0 + nv (k))(i0 + ni (k))
R̂(N ) = 1 T = 1 PN
.
2
N iN iN N k=1 (i0 + ni (k))
As N → ∞, the numerator evaluates to
XN
1
limN →∞ v0 i0 + nv (k)ni (k) = v0 i0
N
k=1
and the denominator evaluates to
N
1 X 2
limN →∞ i0 + ni (k)ni (k) = i20 + σi2
N
k=1
where we have used that the sample mean and the sample variance converges to the true mean and the true
variance asymptotically. Thus, limN →∞ R̂ = Rσ0 2 . The estimation is biased as a result of noisy regressors
1+ i
i2
0
and the bias depends of the variance of the noise in the regressors.

Instrumental variables
Note that in LS solutions, with θ0 being the true value and θ̂ being the LS estimate, from (8)
θ̂ − θ0 = (ΦT Φ)−1 ΦT y − (ΦT Φ)−1 ΦT Φθ0 = (ΦT Φ)−1 ΦT v. (34)
PN −1
As seen in the ARX model, for an unbiased estimator, we need limN →∞ t=t∗ φ(t)v(t) = E[φ(t)v(t)] = 0
i.e., regressors are uncorrelated with noise. If this condition is not satisfied, then the estimation is biased. In
such cases, one uses the method of instrumental variables.
Suppose y(t) is ny dimensional output at time t and θ be nθ dimensional parameter vector. Let φT (t) ∈
R y ×nθ . Let Z(t) ∈ Rnz ×ny such that
n

N
1 X
Z(t)(y(t) − φT (t)θ) = 0. (35)
N t=t∗
If nz = nθ , then
N
X N
−1 X
θ̂ = Z(t)φT (t) Z(t)y(t). (36)
t=t∗ t=t∗
For Z(t) = φ(t), we get the LS estimator. If Z(t) are uncorrelated with the noise, then we obtain a consistent
estimator without bias. The instrumental variables method works when the noise is not necessarily white
Gaussian. In the case of ARX models, delayed inputs are chosen as instrumental variables [5].

9
Nonlinear least squares
Let g(θ) be a nonlinear function of θ and x(n) = g(θ)(n). Let
J(θ) = (x − g(θ))T (x − g(θ)) (37)
where x(n) is the observed signal. Then,
θ̂ = argminθ J(θ) (38)
gives the solution to nonlinear LS problem. The minimum can be found by gradient descent method or
Gauss-Newton method.
Let x(n) = g(θ)(n) + v(n). Let Σv be the noise covariance matrix. Then,
J(θ) = (x − g(θ))T Σ−1
v (x − g(θ)) (39)
and
θ̂ = argminθ J(θ) (40)
gives the solution to nonlinear LS problem. One can consider regularized nonlinear LS as well. A particular
application involves nonlinear system identification and is discussed in parametric identification.

Nonlinear optimization and state trajectory estimation

[1] Consider the following discrete time nonlinear state space model
x(k + 1) = f (x(k), u(k), θ) + w(k), y(k) = g(x(k), u(k), θ) + v(k) (41)
where w(k) ∼ N (0, Σw ) and v(k) ∼ N (0, Σv ). Given a control trajectory (u(0), . . . , u(N )) and measure-
ments (y(0), . . . , y(N )), we want to compute the most likely values of θ and the most likely state trajectory
(x(0), . . . , x(N )). The problem of estimating both θ and (x(0), . . . , x(N )) is called parameter and state
trajectory estimation. The unknown vector to be estimated is
θ̄ = (θ, x(0), . . . , x(N )). (42)
Define the unweighted residual function as
w(0) x(1) − f (x(0), u(0), θ)
   
 w(1)  
   x(2) − f (x(1), u(1), θ) 


 .  
  . 


 . 
 
 . 

w(N − 1) x(N ) − f (x(N − 1), u(N − 1), θ)
R̄(θ̄) = 

=
  . (43)
 v(0)   y(0) − g(x(0), u(0), θ) 

 v(1)  
   y(1) − g(x(1), u(1), θ) 


 . 
 
 . 

 .   . 
v(N − 1) y(N ) − g(x(N ), u(N ), θ)
Let Σ =cov(w(0), . . . , w(N − 1), v(0), . . . , v(N )) be the covariance matrix where
Σw 0 · · · 0 0 ··· 0
 
 0 Σw · · · 0 0 ··· 0 
 
 .
 . ··· . . ··· .  
 .
 . ··· . . ··· .  
Σ=  0 0 · · · Σ w 0 · · · 0 .

 0
 0 · · · 0 Σv · · · 0  
 .
 . ··· . . ··· .  
 . . ··· . . ··· . 
. . ··· . . · · · Σv

10
Let
R(θ̄) = Σ−1/2 R̄(θ̄). (44)
Then, the parameter and the state trajectory estimate can be found by solving the following nonlinear LS
problem
1
θ̄∗ = argminθ̄ kR(θ̄)k22 (45)
2
which can be solved using Gauss-Newton algorithm or using lsqnonlin in MATLAB. One may not get a
global minimum in this case.

Aside: Similar optimization problems

We now consider optimization problems where Ax = b may have multiple solutions (A is full row rank
wide matrix). Some of these problems include l1 norm minimization subject to linear equality constraints,
l2 norm minimization subject to equality constraints, nuclear norm minimization and low rank problems.
For example in open loop minimum energy control problem, one solves the following quadratic opti-
mization problem
minimize kuk22 subject to CT (A, B)u = xf , x(0) = 0. (46)
where CT (A, B) := [AT −1 B · · · AB B] is the controllability matrix.
Another variant of the above problem is

minimize kuk22 + λkHu − bk22 , λ > 0. (47)

In compressed sensing, one solves the following l1 -norm minimization problem

minimize kxk1 subject to Ax = b. (48)

Similar problem also shows up in maximum hands-off control problems in control literature. An alternate
form is to augment the constraints in the cost function

minimize kxk1 + λkAx − bk22 , λ > 0. (49)

If the unknown happens to be a matrix variable, then we have

minimize kXk∗ subject to AX = B. (50)

where kXk∗ is the nuclear norm of X i.e., the sum of nonzero singular values. This is the well known matrix
completion problem. In matrix completion, we want to minimize the rank of the unknown matrix subject to
some equality constraints. However, this is a combinatorial NP-hard problem. Its convex relaxation is given
by minimizing the nuclear norm of the unknown matrix subject to equality constraints.
An augmented form of (50) will be

minimize kXk∗ + λkAX − Bk22 . (51)

For sparse noise, we formulate the objective function as

minimize kXk∗ + λkAX − Bk1 . (52)

11
References
[1] M. Diehl, Lecture notes on Modeling and System Identification, Lecture notes and video lectures, 2020.

[2] G. Pilloneto, T. Chen, A. Chiuso, G. De Nicolao, L. Ljung, Regularized System Identification, Learning
Dynmic Models from Data, Springer, 2022.

[3] L. Ljung, System Identification, Theory for the user, PHI, 2nd Edition, 1999.

[4] S. Kay, Fundamentals of Statistical Signal Processing: Estimation theory, 1993.

[5] T. Söderström, P. Stoica, System Identification, Prentice Hall, 1989.

[6] G. Bottegal, Lecture slides on Nonparameric kerrnel-based System Identification, DISC course, 2020.

Phonetic Characterisation of Musical Articulation - The Case of Txistu
No ratings yet
Phonetic Characterisation of Musical Articulation - The Case of Txistu
14 pages
SLAM_Least_Squares_notes
No ratings yet
SLAM_Least_Squares_notes
12 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
Wk06
No ratings yet
Wk06
7 pages
G.C. Calafiore (Politecnico Di Torino)
No ratings yet
G.C. Calafiore (Politecnico Di Torino)
23 pages
LeastSquares_DeptMath
No ratings yet
LeastSquares_DeptMath
7 pages
Ece830 Fall11 Lecture20
No ratings yet
Ece830 Fall11 Lecture20
7 pages
Basic Stats Estimation
No ratings yet
Basic Stats Estimation
8 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
No ratings yet
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
8 pages
Robust Regression: 1 M-Estimation
No ratings yet
Robust Regression: 1 M-Estimation
8 pages
Appendix Robust Regression
No ratings yet
Appendix Robust Regression
8 pages
Regress
No ratings yet
Regress
11 pages
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
No ratings yet
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
33 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
1238 Support Vector Regression Machines
No ratings yet
1238 Support Vector Regression Machines
7 pages
Reading 4
No ratings yet
Reading 4
15 pages
Bias-Variance Tradeoffs: 1 Single Sample MLE
No ratings yet
Bias-Variance Tradeoffs: 1 Single Sample MLE
7 pages
Linear Regression Analysis: Module - Vii
No ratings yet
Linear Regression Analysis: Module - Vii
10 pages
Linear Regression, Active Learning
No ratings yet
Linear Regression, Active Learning
10 pages
lecture03d_ridge
No ratings yet
lecture03d_ridge
13 pages
Lect 6
No ratings yet
Lect 6
10 pages
Identification and Estimation
No ratings yet
Identification and Estimation
37 pages
Unit - III
No ratings yet
Unit - III
4 pages
Simple Linear Regression.: 29.1 Method of Least Squares
No ratings yet
Simple Linear Regression.: 29.1 Method of Least Squares
4 pages
Simple Linear Regression.: 29.1 Method of Least Squares
No ratings yet
Simple Linear Regression.: 29.1 Method of Least Squares
4 pages
Appendix Robust Regression
No ratings yet
Appendix Robust Regression
17 pages
19-AOS1828
No ratings yet
19-AOS1828
26 pages
Additional Problems
No ratings yet
Additional Problems
6 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
A Family of Median Based Estimators in Simple Random Sampling
No ratings yet
A Family of Median Based Estimators in Simple Random Sampling
11 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Linear Model
No ratings yet
Linear Model
14 pages
Class
No ratings yet
Class
35 pages
Recursive Least Squares Estimation: 1 Estimation of A Constant
No ratings yet
Recursive Least Squares Estimation: 1 Estimation of A Constant
10 pages
Representer Function
No ratings yet
Representer Function
12 pages
Chap7
No ratings yet
Chap7
7 pages
RegEstimationLS_ML_StatColumbia
No ratings yet
RegEstimationLS_ML_StatColumbia
44 pages
VE564 Summer 2023: Lecture 3-1: Maximum Likelihood Estimation and Least Squares
No ratings yet
VE564 Summer 2023: Lecture 3-1: Maximum Likelihood Estimation and Least Squares
78 pages
Linear_least_squared
No ratings yet
Linear_least_squared
23 pages
Estimation Theory Overview
100% (1)
Estimation Theory Overview
17 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
No ratings yet
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
5 pages
Consistency of One-Class SVM and Related Algorithms
No ratings yet
Consistency of One-Class SVM and Related Algorithms
8 pages
ML-Lec8
No ratings yet
ML-Lec8
7 pages
Least Squares Estimation PDF
No ratings yet
Least Squares Estimation PDF
5 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
ML Lecture Linear Regression 1
No ratings yet
ML Lecture Linear Regression 1
33 pages
Linear Model Methodology
No ratings yet
Linear Model Methodology
9 pages
Classical Linear Regression and Its Assumptions
No ratings yet
Classical Linear Regression and Its Assumptions
63 pages
Kayatu
No ratings yet
Kayatu
3 pages
SDET Formulae MidSem2 2018 Ver3
No ratings yet
SDET Formulae MidSem2 2018 Ver3
2 pages
13
No ratings yet
13
4 pages
Lecture 24-25: Weighted and Generalized Least Squares: 36-401, Fall 2015, Section B 19 and 24 November 2015
No ratings yet
Lecture 24-25: Weighted and Generalized Least Squares: 36-401, Fall 2015, Section B 19 and 24 November 2015
27 pages
M6 RegressionLinearModels v2
No ratings yet
M6 RegressionLinearModels v2
97 pages
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
CV Akash Ramesh Gawai
No ratings yet
CV Akash Ramesh Gawai
2 pages
29+ Ugly Cartoon Character Drawing Ideas - Pixel Orbyt
No ratings yet
29+ Ugly Cartoon Character Drawing Ideas - Pixel Orbyt
35 pages
Semester 1 Examinations 2018/2019: Programme (S)
No ratings yet
Semester 1 Examinations 2018/2019: Programme (S)
6 pages
Garter Stitch Mitts by Ysolda v1.1
No ratings yet
Garter Stitch Mitts by Ysolda v1.1
6 pages
Fisa Tehnica RG40
No ratings yet
Fisa Tehnica RG40
2 pages
NABL Accreditation
100% (1)
NABL Accreditation
18 pages
Newhire/Update: Lastname Firstname
No ratings yet
Newhire/Update: Lastname Firstname
75 pages
CR 30-X CR 30-Xm: User Manual
No ratings yet
CR 30-X CR 30-Xm: User Manual
66 pages
Gpa Standard 2145-09
No ratings yet
Gpa Standard 2145-09
4 pages
Essential TL Davis
No ratings yet
Essential TL Davis
222 pages
4 Hypothesis Testing 1 Sample Mean For Students
No ratings yet
4 Hypothesis Testing 1 Sample Mean For Students
18 pages
Gasdynamics PDF
No ratings yet
Gasdynamics PDF
380 pages
Draft LC Yuanta 36000 MT
No ratings yet
Draft LC Yuanta 36000 MT
4 pages
Boening SmartBridge EN V2 0 Ebook
No ratings yet
Boening SmartBridge EN V2 0 Ebook
2 pages
Reflection Ee 570
No ratings yet
Reflection Ee 570
2 pages
Drought in India - Causes, Effects and Measures
100% (6)
Drought in India - Causes, Effects and Measures
55 pages
Wend Catalogo 2012
100% (1)
Wend Catalogo 2012
144 pages
محاضرات الكورس الثاني
No ratings yet
محاضرات الكورس الثاني
162 pages
Oki Service Manual
No ratings yet
Oki Service Manual
52 pages
Näser Equipping and Stripping The Dead
No ratings yet
Näser Equipping and Stripping The Dead
22 pages
Certificado Supercore 2205P U2FC153058
No ratings yet
Certificado Supercore 2205P U2FC153058
1 page
Engineering Surveys and Construction of New Lines
100% (1)
Engineering Surveys and Construction of New Lines
21 pages
Project Proposal
100% (2)
Project Proposal
2 pages
Agriculture Marketing in India
No ratings yet
Agriculture Marketing in India
14 pages
Lightolier RFL Recessed Fluorescent Lighting Catalog 1981
No ratings yet
Lightolier RFL Recessed Fluorescent Lighting Catalog 1981
52 pages
Liquid Calcium Fertilizer
No ratings yet
Liquid Calcium Fertilizer
5 pages
Saudi Aramco Inspection Checklist
No ratings yet
Saudi Aramco Inspection Checklist
2 pages
Kaplan Turbines
No ratings yet
Kaplan Turbines
28 pages
Harga Obat Mahal - Revisi 17
No ratings yet
Harga Obat Mahal - Revisi 17
2 pages

Least Squares

Uploaded by

Least Squares

Uploaded by

Least squares

Linear least squares

1. SVD: Let Φ = U ΣV T , then

2. QR decoposition: Let Φ = QR, then

and θ can be found by back substitution.

3. Normal equations: ΦT Φθ = ΦT y ⇒ θ = (ΦT Φ)−1 ΦT y.

The lease squares estimate is

θ̂ = (ΦT Φ)−1 ΦT y. (8)

θ̂ − θ = (ΦT Φ)−1 ΦT (Φθ + v) − (ΦT Φ)−1 (ΦT Φ)θ = (ΦT Φ)−1 ΦT v

• N regressor vectors φ(t) are deterministic and not corrupted by noise.

• The WLS is θ̂wls = (ΦT −1 T

which implies unbiased estimation.

Therefore, the covariance matrix of θ̂wls is computed as

(y − Φθ)T Σ−1 (y − Φθ) (12)

• At α = 0, there is no bias but high variance and we get LS solution.

• The regularizer is added to penalize undesired solutions.

where φN +1 is the (N + 1)th regressor. Note that

Therefore, kyN k2 = kŷN k22 + kN k22 .

E[(θ̂ − θ)(θ̂ − θ)T ] = (H T H)−1 H T Σw H(H T H)−1 = σ 2 (H T H)−1 .

One can use weighted LS to minimize kx − HθkΣ−1

and the covariance matrix of this estimator is

E[(θ̂wls − θ)(θ̂wls − θ)T ] = (H T Σ−1

θ̂M AP = (σ −2 H T H + σθ−2 )−1 (σ −2 H T x).

A(q −1 )y(t) = B(q −1 )u(t) + v0 (t) (29)

Let H(q −1 ) = 1/A(q −1 ). The prediction error is given by

(t) := A(q −1 , θ)y(t) − B(q −1 , θ)u(t) = y(t) − φT (t)θ (30)

y(t) = φT (t)θ0 + v0 (t)

where v0 is noise and θ0 is the true parameter. Then,

T → λI (under some weak conditions). Therefore,

Nonlinear optimization and state trajectory estimation

Aside: Similar optimization problems

minimize kuk22 + λkHu − bk22 , λ > 0. (47)

In compressed sensing, one solves the following l1 -norm minimization problem

minimize kxk1 subject to Ax = b. (48)

minimize kxk1 + λkAx − bk22 , λ > 0. (49)

If the unknown happens to be a matrix variable, then we have

minimize kXk∗ subject to AX = B. (50)

minimize kXk∗ + λkAX − Bk22 . (51)

For sparse noise, we formulate the objective function as

minimize kXk∗ + λkAX − Bk1 . (52)

[4] S. Kay, Fundamentals of Statistical Signal Processing: Estimation theory, 1993.

[5] T. Söderström, P. Stoica, System Identification, Prentice Hall, 1989.

You might also like

Therefore, kyN k2 = kŷN k22 + kN k22 .

(t) := A(q −1 , θ)y(t) − B(q −1 , θ)u(t) = y(t) − φT (t)θ (30)