Least Squares
Least Squares
This can be identified with the linear least squares problem minimizeθ kΦθ − yk2 as follows. Let
y1 x1 1
y2 x2 1
θ1
. = . .
θ .
. 0
. .
yN xN 1
Then, we want to find the solution to this overdetermined system of equations. The familiar methods to
solve Φθ = y are
QRθ = y ⇒ Rθ = QT y
The method of normal equations can be derived from the necessary and sufficient conditions for optimality.
Consider kΦθ − yk22 = (Φθ − y)T (Φθ − y) where Φ has full column rank. Taking derivative w.r.t. θ and
setting it to zero gives
ΦT Φθ = ΦT y ⇒ θ = (ΦT Φ)−1 ΦT y.
Taking 2nd derivative w.r.t. θ gives ΦT Φ > 0 which shows that there is a minimum at θ.
The method of least squares also extends to polynomial fits/nonlinear regression. Given a set of data
points (xi , yi ), i = 1, . . . , N , the best nth degree polynomial fit is
N
X
minimizeθ0 ,...,θn kyi − (θn xni + · · · θ0 )k2 . (2)
i=1
θn
xn1 xn−1
y1 1 · · · x1 1
xn2 xn−1 θn−1
y2
2 · · · x2 1 .
. = .
. ··· . . . .
. . . ··· . .
θ1
yN xnN xn−1
N · · · xN 1
θ0
Rows of matrix Φ are called regressors and the vector of unknowns is the parameter vector. One can consider
other regression fits as well.
1
Least squares estimate and mean squared error: Let y(t) be the observed output with known input and
θ be the parameter (scalar/vector/matrix) associated the model. Let ŷ(t|θ) be the output of the model under
a given input. Then, the mean squared error (MSE) is
N
1 X
VN (θ) = (ŷ(t|θ) − y(t))2 . (3)
N
t=1
For
y(t) = φT (t)θ + (t), (6)
choosing W =diag(σ−2 (i)), i = 1, . . . , N where σ2 (i) is the variance of (i) forms a WLS in the presence
of noisy measurements.
MVU estimators using LS for Gaussian noise: Consider
y = Φθ + v (7)
where y = (y(0), . . . , y(N − 1))T , v = (v(0), . . . , v(N − 1))T where v ∼ N (0, σ 2 I). Note that ŷ|θ = Φθ
and LS solution is the solution of minimizeθ ky − Φθk22 which is
Clearly, E[θ̂] = θ as the noise is zero mean. It turns out that the MVUE is also given by the LS solution. If
the noise is not zero mean, then the LS solution is biased. To find the covariance matrix of θ̂, observe that
For linear models with zero mean Gaussian noise, LS estimates are also MVUE ([4]).
Statistical weighted LS: We consider the following assumptions
• N measurements y(t) are generated by a model y(t) = φ(t)T θ0 + (t) (φ(t) is a regressor vector)
where θ0 is the true but unknown value.
• N noise terms (t) have zero mean. Often, they are also independent or iid. This determines the
optimal weighting matrix. N := ((1), . . . , (N ))T .
Notice that
−1 T −1 T
E[θ̂wls ] = (ΦT T
N W ΦN ) ΦN W E[yN ] = (ΦN W ΦN ) ΦN W E[ΦN θ0 + N ] = θ0
2
−1 T
= (ΦT
N W ΦN ) ΦN W (N ).
where E[N T
N ] = ΣN = cov(N ). Thus, cov(θ̂wls ) depends on the choice of W . If ΣN is known,
choosing W = Σ−1N ,
−1 −1
cov(θ̂wls ) = (ΦT
N ΣN ΦN ) . (11)
This is the optimal choice of W i.e., cov(θ̂wls ) ≥ (ΦT −1 −1 for all other W s. We refer the reader to
N ΣN ΦN )
[4] for Bayesian linear models and LS.
MLE and LS for Gaussian noise:It turns out that for linear models y = Φθ + v with v being Gaussian,
the MLE is given by the weighted LS solution which is also MVUE ([4]). Consider (7) with v ∼ N (0, Σ).
Then
1 1 T Σ−1 (y−Φθ)
p(y; θ) = e− 2 (y−Φθ)
(2π)N/2 det(Σ)1/2
and the MLE can be found by minimizing
∂ log p(y; θ)
0= = ΦT Σ−1 (y − Φθ) ⇒ θ̂ = (ΦT Σ−1 Φ)−1 ΦT Σ−1 y
∂θ
and the covariance Σθ̂ = (ΦT Σ−1 Φ)−1 (which is the optimal estimator i.e., with minimum covariance).
(Notice that E[θ̂] = θ.) Thus, the MLE and the MMSE (minimum mean squared estimator) or the MVUE
(weighted) coincide when the noise is white Gaussian with zero mean. However, although for a Gaussian
noise with nonzero mean, LS or weighted LS solution coincides with MLE, the LS solution is biased.
Ill-posed LS problems and Moore Penrose pseudo inverse: In ill-posed problems, ΦT Φ is not invertible
and there is no unique LS solution. In such cases the Moore Penrose pseudo inverse is given by Φ+ =
V Σ+ U T where Σ+ = diag(σ1−1 , . . . , σr−1 , 0, . . . , 0) where r is the rank of Φ.
Regularized least squares: To find a unique LS solution, one can pose a regularized optimization problem
(ridge regression)
1 α
minimizeθ ky − Φθk22 + kθk22 (13)
2 2
where α > 0. This problem has a unique solution θ∗ = (ΦT Φ + αI)−1 ΦT y. Note that for y = Φθ + v with
v being WGN N (0, σ 2 I),
E[θ∗ ] = (ΦT Φ + αI)−1 ΦT Φθ (14)
i.e., the estimator is biased and becomes unbiased for α = 0. The covariance of θ∗ is
E[(θ∗ − E[θ∗ ])(θ∗ − E[θ∗ ])T ] = (ΦT Φ + αI)−1 ΦT E[vv T ]Φ(ΦT Φ + αI)−1
= σ 2 (ΦT Φ + αI)−1 ΦT Φ(ΦT Φ + αI)−1 . (15)
It turns out that regularized LS estimators have lower covariance (i.e., less sensitive to the data variation)
than LS estimators. On the flip side, regularized LS estimators are biased. Suppose θ is a scalar. Then,
2 ΦT Φ
var(θ∗ ) = (ΦσT Φ+αI) 2 T
2 < var(θ̂ls ) = σ (Φ Φ)
−1 which verifies the lower variance in the scalar case. In
cases with low signal to noise ratio (SNR), one prefers regularized LS than LS. There are other regularizers
as well such as kθkW , W > 0, kθk1 which promotes sparse solutions and so on. Notice that in (13), the first
term corresponds to the variance and the second term corresponds to the (bias)2 .
3
• As α increases, bias increases and the variance goes down.
• At α → ∞, θ∗ = 0 (trivial estimator) independent of the data so there is no variance and only bias.
• The bias variance trade-off depends on α as well as the choice of the norm in the second term of (13).
which is the regularized LS problem. Since this is a quadratic program, the optimal solution can be found
by setting derivative w.r.t. θ to zero i.e.,
−1 −1 −1 −1
ΦT Σ−1 T −1 T −1
w (y − Φθ) + Σθ (θ − θ0 ) = 0 ⇒ θ̂M AP = (Φ Σw Φ + Σθ ) (Φ Σw y + Σθ θ0 ). (17)
(It turns out that the MMSE given by E[θ|Y ] also coincides with the MAP estimator due to linear Gaussian
models. Note that E[θ|Y ] turns out to be linear due to linear Gaussian model.)
In the above setup, suppose Σw = I, θ0 = 0, Σθ = λI. Then, (16) becomes
1
θ̂M AP = argminθ (ky − Φθk2 + λkθk2 )
2
which exactly matches the ridge regularization.
L1 -estimator: It is given by argminθ ky − M (θ)k1 . For linear models, it is argminθ ky − Φθk1 . Suppose
||
1 −σ
yN = Φθ + N where (k) are iid and Laplace distributed1 (p() = 2σ e ). Then, MLE is the same as
the optimal L1 estimator. When ∼ N , then the MLE is the same as the optimal L2 or LS estimator.
Lasso: Least absolute shrinkage and selection operator (Lasso) is another type of regularizer which gives
sparse solutions to LS problems. It is formulated as
1
minimizeθ ky − Φθk22 + λkθk1 (18)
2
where λ > 0. High λ results in sparser solutions. In the Bayesian setting, if we use a Laplace prior rather
than a Gaussian one, we end up with lasso regularization rather than ridge regularization. For example,
kθk1
1 −
let y = Φθ + w, w ∼ N (0, Σw ), p(θ) = 2σ e
σ . Then, proceeding as above in the MAP case, ignoring
constants,
1 kθk1
− log(p(y|θ)) − log p(θ) = (y − Φθ)T Σ−1
w (y − Φθ) + .
2 σ
1
this noise model is used when the observation has many outliers
4
Thus, θ̂M AP is given by the solution of a lasso problem.
Elastic net regularizers are of the form
1
minimizeθ ky − Φθk22 + λ1 kθk1 + λ2 kθk22 (19)
2
where λ1 , λ2 > 0 which combine features of ridge regression and lasso.
Recursive LS: Let ΦN ∈ RN ×d . We want to find
1
θ̂(N ) = argminθ kyN − ΦN θk22 (20)
2
for increasing values of N . We now demonstrate how to find θ̂(N + 1) from θ̂(N ) and the updated mea-
surement y(N + 1).
Linear algebraic way: Let QN = ΦT N ΦN . Therefore,
QN +1 = ΦT T T
N +1 ΦN +1 = ΦN ΦN + φN +1 φN +1 (21)
Therefore,
θ̂(N + 1) = Q−1 T −1 T
N +1 ΦN +1 yN +1 = QN +1 (ΦN yN + φN +1 y(N + 1))
= Q−1 T
N +1 (ΦN ΦN θ̂(N ) + φ(N + 1)y(N + 1))
= Q−1 T T
N +1 (ΦN ΦN + φN +1 φN +1 )θ̂(N ) −
T
φN +1 φN +1 θ̂(N ) + φN +1 y(N + 1)
= Q−1 Q
N +1 N +1 θ̂(N ) + Q −1
N +1 φ N +1 (y(N + 1) − φ T
N +1 θ̂(N ))
= θ̂(N ) + Q−1 T
N +1 φN +1 y(N + 1) − φN +1 θ̂(N ) . (22)
This is the recursive LS solution. Notice that the recursion starts when QN becomes invertible. Note further
−1 T A−1
that one can use Sherman-Morrison-Woodbury formula (A + vv T )−1 = A−1 − A1+vvv T A−1 v to calculate
Q−1 −1
N +1 using QN and the new regressor vector φN +1 .
Optimization approach: (Solving convex quadratic programs online) Note that the RLS algorithm solves at
each step the following minimization problem [1]
1
θ̂(N + 1) = argminθ kyN +1 − ΦN +1 θk22 (23)
2
1
= argminθ kyN − ΦN θk22 + ky(N + 1) − φT 2
N +1 θk2
2
1
= argminθ kyN k22 + kΦN θk22 − 2θT ΦT T 2
N yN + ky(N + 1) − φN +1 θk2
2
1
= argminθ kyN k22 + kΦN θk22 − 2θT ΦT T 2
N ΦN θ̂(N ) + ky(N + 1) − φN +1 θk2
2
1
= argminθ kyN k22 − kΦN θ̂(N )k22 + kΦN (θ − θ̂(N ))k22 + ky(N + 1) − φT 2
N +1 θk2
2
1 1
= argminθ kθ − θ̂(N )k2QN + ky(N + 1) − φT θk
N +1 2
2
(24)
2 2
5
where we have ignored the terms kyN k22 − kΦN θ̂(N )k22 as they are constants independent of θ.
Initialization of RLS: Notice that, in order for QN to be invertible, one can collect enough number of
measurements and regressor vectors such that QN becomes invertible and then start the RLS algorithm.
Otherwise, one can assume θ ∼ N (θ̂(0), Q−1 0 ) where Q0 = βI > 0, 0 < β << 1. Therefore, Q1 is
invertible and we can apply RLS from the first measurement.
Growing QN : In order to avoid QN → ∞ as N increases is to downweight the past information using
forgetting factor α. Equation (24) can be rewritten as
1 2 1 T 2
θ̂(N + 1) = argminθ α kθ − θ̂(N )kQN + ky(N + 1) − φN +1 θk2 (25)
2 2
The recursive update becomes
QN +1 = αQN + φN +1 φT N +1 , (26)
θ̂(N + 1) = θ̂(N ) + Q−1 φ
N +1 N +1 y(N + 1) − φ T
N +1 θ̂(N ) . (27)
PN T
Note that QN = i=1 φi φi is called the regressor matrix. This regressor matrix is obtained from data
and needs to be invertible. If the regressor matrix is invertible, then the data is said to be persistently
exciting. Since this matrix is symmetric and positive (semi) definite, one can associated an ellipsoid formed
by the unit eigenvectors of ΦN . Geometrically, given N (normalized) regressor vectors, one can fit an
ellipsoid containing these vectors. As the new regressor φ(N + 1) comes, QN gets updated to QN +1 and
the corresponding ellipsoid gets updated as well. If the new regressor is along the dominant eigenvectors
of QN , due to the presence of Q−1N +1 in the RLS solution update, the correction term is small. On the other
hand, if the regressor is along least dominant eigenvectors of QN , then the correction is significant.
Goodness of fit: Apart from the covariance matrix, another goodness of fit of an estimator is given by the
R2 value defined as
kyN − ΦN θ̂k22 kyN k22 − kN k22 kŷN k22
R2 = 1 − = = . (28)
kyN k2 kyN k22 kyN k22
Recall that yN = ŷN + N where ŷN = ΦN θ̂ and θ̂ is the LS solution of kyN − ΦN θ̂k. From the 1st order
optimality conditions,
ΦT T T T T T T
N ΦN θ̂ − ΦN yN = 0 ⇒ ΦN (ΦN θ̂ − yN ) = 0 ⇒ ΦN N = 0 ⇒ θ̂LS ΦN N = 0 ⇒ ŷN N = 0.
Applications
Fourier analysis:([4], Example 4.2) Consider a sinusoidal data with Gaussian noise (without any dc term)
M M
X 2πkn X 2πkn
x(n) = ak cos( )+ bk sin( ) + w(n), n = 0, . . . , N − 1.
N N
k=1 k=1
where w(n) is WGN. Let θ = (a1 , . . . , aM , b1 , . . . , bm )T . Then the above equation can be written as
x = Hθ + w
where
1 ··· 1 0 ··· 0
cos( 2πN) ··· 2πM
cos( N ) sin( 2π
N) ··· sin( 2πM
N )
H=
. ··· . . ··· . .
. ··· . . ··· .
2π(N −1) 2πM (N −1) 2π(N −1) 2πM (N −1)
cos( N ) · · · cos( N ) sin( N ) · · · sin( N )
6
Then,
N 2
θ̂ = (H T H)−1 H T x, H T H = I ⇒ θ̂ = H T x.
2 N
Notice that E[θ̂] = (H T H)−1 H T E[x] = (H T H)−1 H T Hθ = θ. The covariance matrix is given by
θ̂wls = (H T Σ−1 −1 T −1 T −1 −1 T −1
w H) H Σw x ⇒ E[θ̂] = (H Σw H) H Σw Hθ = θ
Thus both estimators have the same covariance. One can also find a regularized LS estimator in this case.
Exercise: Trading bias for variance: Add noise with more magnitude and compare the LS fit and the
regularized LS fit for this example on a computer. Check which estimate has more variance. Visually check
the graphs to see which is the better fit.
Add a noise which is Laplace distributed in the above example and find the Fourier coefficients using
k.k1 minimization in Matlab.
Consider the Bayesian estimation of Fourier coefficients. Let θ ∼ N (0, σθ2 I). Then from (17),
Consider a regularization version of the above problem with ridge regularization and lasso regularization
with λ = 1. Compare the Fourier coefficients obtained with ordinary LS ordinary k.k1 minimization.
Impulse response identification: Let
l−1
X
y(n) = h(k)u(n − k) + w(n), n = 0, 1, . . . , N − 1
k=0
where w(n) is WGN and u(n) = 0 for n < 0. Then, y = Hθ + w where θ = (h(0), . . . , h(l − 1))T and
u(0) 0 ··· 0
u(1)
u(0) ··· 0
H= . . ··· . .
. . ··· .
u(N − 1) u(N − 2) · · · u(N − l)
The mean and the variance of the estimator can be computed as done in the previous example. Repeat the
same exercise given in the previous case.
ARX model: Consider an ARX model
where the unknown parameters are θ := (a1 , . . . , an , b1 , . . . , bm )T and the noise variance σ 2 .
A(q −1 ) = 1 + a1 q −1 + · · · + an q −n , B(q −1 ) = b1 q −1 + · · · + bm q −m
7
where T
φ(t) := −y(t − 1) · · · −y(t − n) u(t − 1) · · · u(t − m) ∈ Rd .
(31)
Therefore,
N −1
1 X
VN (θ) = (y(t) − φT (t)θ)2 , t∗ = max(m, n).
N t=t∗
Therefore,
N −1 −1 NX −1
1 X T 1
θ̂ls (N ) = φ(t)φ (t) φ(t)y(t).
N t=t∗ N t=t∗
Suppose the actual observation is expressed as
Assuming that
PN −1
• limN →∞ N1 t=t∗ φ(t)φT (t) = E[φ(t)φT (t)] is nonsingular and
PN −1
• limN →∞ N1 t=t∗ φ(t)v0 (t) = E[φ(t)v0 (t)] = 0,
limN →∞ θ̂ls (N ) = θ0 .
We now find the covariance of the estimate θ̂ls (N ). Let
N −1
1 X
R(N ) = φ(t)φT (t), θ̃ := θ̂N − θ0 .
N t=t∗
N −1
1 X
E[θ̃] = E[ R(N )−1 φ(t)v0 (t)] = 0 as N → ∞.
N t=t∗
Notice that
N −1
1 X
θ̃ = θ̂N − θ0 = R(N )−1 φ(t)v0 (t).
N t=t∗
In matrix-vector form,
yN = ΦN θ + vN
and
−1 1 T −1 −1 T
θ̂N = N (ΦT
N ΦN ) Φ yN = (ΦT T T T
N ΦN ) (ΦN ΦN θ + ΦN vN ) = θ0 + (ΦN ΦN ) ΦN vN . (32)
N N
Thus, θ̃ = (ΦT −1 T
N ΦN ) ΦN vN . Therefore,
−1 T −1
PN := E[θ̃θ̃T ] = E[(ΦT T T
N ΦN ) ΦN vN vN ΦN (ΦN ΦN ) ]
−1 T −1
PN = E[λ(ΦT T
N ΦN ) ΦN ΦN (ΦN ΦN ) ]
−1 λ
= λ(ΦT
N ΦN ) = R(N )−1 . (33)
N
8
√
Thus, the covariance decays at the rate 1/N and the parameters approach the limiting value at 1/ N .
Furthermore, the covariance is proportional to the inverse of the signal to noise ratio where λ is the noise
variance and R(N ) is proportional to the input power.
More applications similar to the ones discussed above are curve fitting, surface/manifold fitting and so
on.
Noisy regressors [1]: Consider a resisor estimation (with true value R0 ) problem using (noisy) measure-
ments v(k) and i(k) of the true voltage v0 and the true current i0 respectively. Let v(k) = v0 + nv (k),
i(k) = i0 + ni (k) where nv and ni denote voltage and current noise during measurements at time k. Let
nv , ni be iid, zero mean with finite variances σv2 , σi2 . The LS problem is
N
1 X
minR (v(k) − Ri(k))2 .
N
k=1
Let vN , iN be the vectors of voltages and currents (ΦN = iN in this case). Notice that the regressor iN is
noisy in this case unlike the previous cases where the regressor matrix Φ was assumed to be noise free.
The LS solution is
1 T 1 PN
N iN vN N k=1 (v0 + nv (k))(i0 + ni (k))
R̂(N ) = 1 T = 1 PN
.
2
N iN iN N k=1 (i0 + ni (k))
As N → ∞, the numerator evaluates to
XN
1
limN →∞ v0 i0 + nv (k)ni (k) = v0 i0
N
k=1
and the denominator evaluates to
N
1 X 2
limN →∞ i0 + ni (k)ni (k) = i20 + σi2
N
k=1
where we have used that the sample mean and the sample variance converges to the true mean and the true
variance asymptotically. Thus, limN →∞ R̂ = Rσ0 2 . The estimation is biased as a result of noisy regressors
1+ i
i2
0
and the bias depends of the variance of the noise in the regressors.
Instrumental variables
Note that in LS solutions, with θ0 being the true value and θ̂ being the LS estimate, from (8)
θ̂ − θ0 = (ΦT Φ)−1 ΦT y − (ΦT Φ)−1 ΦT Φθ0 = (ΦT Φ)−1 ΦT v. (34)
PN −1
As seen in the ARX model, for an unbiased estimator, we need limN →∞ t=t∗ φ(t)v(t) = E[φ(t)v(t)] = 0
i.e., regressors are uncorrelated with noise. If this condition is not satisfied, then the estimation is biased. In
such cases, one uses the method of instrumental variables.
Suppose y(t) is ny dimensional output at time t and θ be nθ dimensional parameter vector. Let φT (t) ∈
R y ×nθ . Let Z(t) ∈ Rnz ×ny such that
n
N
1 X
Z(t)(y(t) − φT (t)θ) = 0. (35)
N t=t∗
If nz = nθ , then
N
X N
−1 X
θ̂ = Z(t)φT (t) Z(t)y(t). (36)
t=t∗ t=t∗
For Z(t) = φ(t), we get the LS estimator. If Z(t) are uncorrelated with the noise, then we obtain a consistent
estimator without bias. The instrumental variables method works when the noise is not necessarily white
Gaussian. In the case of ARX models, delayed inputs are chosen as instrumental variables [5].
9
Nonlinear least squares
Let g(θ) be a nonlinear function of θ and x(n) = g(θ)(n). Let
J(θ) = (x − g(θ))T (x − g(θ)) (37)
where x(n) is the observed signal. Then,
θ̂ = argminθ J(θ) (38)
gives the solution to nonlinear LS problem. The minimum can be found by gradient descent method or
Gauss-Newton method.
Let x(n) = g(θ)(n) + v(n). Let Σv be the noise covariance matrix. Then,
J(θ) = (x − g(θ))T Σ−1
v (x − g(θ)) (39)
and
θ̂ = argminθ J(θ) (40)
gives the solution to nonlinear LS problem. One can consider regularized nonlinear LS as well. A particular
application involves nonlinear system identification and is discussed in parametric identification.
10
Let
R(θ̄) = Σ−1/2 R̄(θ̄). (44)
Then, the parameter and the state trajectory estimate can be found by solving the following nonlinear LS
problem
1
θ̄∗ = argminθ̄ kR(θ̄)k22 (45)
2
which can be solved using Gauss-Newton algorithm or using lsqnonlin in MATLAB. One may not get a
global minimum in this case.
Similar problem also shows up in maximum hands-off control problems in control literature. An alternate
form is to augment the constraints in the cost function
where kXk∗ is the nuclear norm of X i.e., the sum of nonzero singular values. This is the well known matrix
completion problem. In matrix completion, we want to minimize the rank of the unknown matrix subject to
some equality constraints. However, this is a combinatorial NP-hard problem. Its convex relaxation is given
by minimizing the nuclear norm of the unknown matrix subject to equality constraints.
An augmented form of (50) will be
11
References
[1] M. Diehl, Lecture notes on Modeling and System Identification, Lecture notes and video lectures, 2020.
[2] G. Pilloneto, T. Chen, A. Chiuso, G. De Nicolao, L. Ljung, Regularized System Identification, Learning
Dynmic Models from Data, Springer, 2022.
[3] L. Ljung, System Identification, Theory for the user, PHI, 2nd Edition, 1999.
[6] G. Bottegal, Lecture slides on Nonparameric kerrnel-based System Identification, DISC course, 2020.
12