Lecture 24
Lecture 24
1 Outline
1. Lasso
2. `1 − `0 equivalence
• AIC: λ2 = 2
• BIC: λ2 = log n
• RIC: λ2 ≈ 2 log p
The critical problem shared by these methods is that finding the minimum requires an exhaustive
search over all possible models. This search is completely intractable for even moderate values of
p.
3 Lasso
1
This is a quadratic program which can be solved efficiently.
`1 regularization has a long history in statistics and applied science. This idea appeared in the
early work of reflection seismology. In seismology, people try to image the earth by sound wave.
With reflected sound wave, one receives a time series of signals, which is noisy and low frequency.
In order to detect where the “jumps” in earth are, people came up with the idea to use `1 norm.
• Logan (1956)
• Tibshirani (1996)
3.1 Why `1 ?
y = Xβ,
2
which is an underdetermined system of equations in the case where we have more variables than
unknowns, i.e. p > n.
In this setting, there is no unique solution to the least squares problem. However, by assuming that
β is sparse, we can reduce our search to sparse solutions, i.e., we wish to find the sparsest solution
to y = Xβ by solving
min kβ̂k0 s.t. y = X β̂
Under some conditions, the minimizer is unique, i.e., β̂ = β.
Even so, this doesn’t address the issue of computational tractability. The optimization problem as
stated remains intractable. This motivates the convex relaxation
Figure 1: Geometry of the `1 optimization problem in the noiseless setting. To understand the
picture suppose β̂ is a feasible point (X β̂ = Xβ = y). Consider ‘descent’ vectors u such that
kβ̂ + uk1 < kβ̂k1 . It is easy to see that β̂ is the solution to the `1 optimization problem if Xu 6= 0
for all descent vectors u, i.e., if all vectors with smaller `1 norm are non-feasible. (a) The solution
here is the true β, because all other feasible vectors have larger `1 norm. (b) Here, β̂ 6= β. Starting
at β, we can consider the descent direction u as shown. The point β + u remains feasible, and has
`1 -norm strictly smaller than β.
Rigorous results pertaining to the formal equivalence between the `0 and `1 problem can be found
in
• Donoho (2004)
3
Why Why an ! objective?
an !1approximation
objective? to ` quasi-norm
`1 norm is closest convex 0
!0 `quasi-norm
(a) 0 quasi-norm !1 `norm
(b) 1 norm !2 `norm
(c) 2 norm
`1 is the tightest convex relaxation of `0 . One way of seeing this is to look at the model
p
X
µ= βi Xi
i=1
where β is sparse. The ‘building blocks’ of sparse solutions are vectors of the form
(0, . . . , 0, ±1, 0, . . . , 0)
and the `1 ball is the smallest convex body that contains these building blocks. It is the convex
hull of these points.
Another interpretation is this: the `1 norm is the best convex lower bound to the cardinality
functional over the unit cube. That is to say, if we search for a convex function f such that
f (β) ≤ kβk0 for all β ∈ [−1, 1]p ,
then the tightest lower bound is the `1 norm, see Figure 2.
As we will illustrate, we cannot hope to achieve comparable optimality results for the Lasso.
0 0 ... 1 1
4
In this case, p = n + 1 so the system is underdetermined.
Suppose that µ is constructed with
0 1
.. ..
. .
β = ε−1 0 so that µ = 1
−1 1
1 0
• The `1 solution is
1
..
.
β̂ = 1 6= β
0
0
in the case where ε < 2/(n − 1). So `1 does not recover the right solution.
yi = 1(i 6= n) + 0.1 zi , i = 1, . . . , n.
It can be shown for all reasonable values of λ, the Lasso solution is given by soft-thresholding y,
i.e.
β̂i = (yi − λσ)+ , i = 1, . . . , n − 1
β̂i = 0, i = n, n + 1
and that
µ̂i = (yi − λσ)+ , i = 1, . . . , n − 1
µ̂i = 0, i=n
Consequently, the lasso has a squared bias of roughly (n − 1)λ2 σ 2 and a variance of (n − 1)σ 2 . We’d
be better off by just plugging µ̂ = y and only pay the variance.
The empirical results under a noisy setting are shown in Figure 3.
5
Lasso estimated mean response Ideal estimated mean response
1 1.2
true mean true mean
estimated mean estimated mean
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3 0.2
0.2
0
0.1
0 −0.2
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Figure 3: Graphical representation of the example above with n = 1, 000, p = 1, 001 and σ = 0.1.
The lasso estimate fits X β̂ = µ̂ = (y − 0.1λ)+ , which results in a large variance and a large bias.
The ideal estimate uses only two predictors. The ratio between the lasso MSE and the ideal MSE
is over 6, 500 ≈ 6.5n ≈ 6.5p.
Summary: This is an instance where the `1 solution is irreparably worse than the `0 solution.
Indeed, with `0 we achieve
EkX β̂ − Xβk2 ≈ 2σ 2
while for the Lasso (no matter λ),
EkX β̂ − Xβk2 nσ 2
Take-away message: Even in the noiseless case, correlation between the predictors cause severe
problems for the Lasso.
Following we will show another case where the Lasso does not work well even when the correlation
between columns of X is very low.
Case study 2: X = [In F2:n ], where F2:n is the discrete cosine transform matrix excluding the
first column (the DC component). Specifically,
ϕ1 (1) ϕ2 (1) · · · ϕn−1 (1)
ϕ1 (2) ϕ2 (2) · · · ϕn−1 (2)
F2:n = . .. .. ,
.. . ··· .
ϕ1 (n) ϕ2 (n) · · · ϕn−1 (n)
where p
ϕ2k−1 (t) = 2/n cos(2πkt/n), k = 1, 2, · · · , n/2 − 1
p
ϕ2k (t) = 2/n sin(2πkt/n), k = 1, 2, · · · , n/2 − 1
√
ϕn−1 (t) = (−1)t / n.
6
p
In this case, p = 2n − 1. The maximum correlation between columns of X, µ(X) = 2/n, which
shows that the columns of X are extremely incoherent. Besides, kxk2 ≤ 2.
Then we can construct f = Xβ, of which each coordinate is 1, with a sparse β (shown in Figure 4).
In noisy setting, the observed noisy Xβ + z, the Lasso estimate β̂ and X β̂ are shown in Figure 5.
Figure 5: Graphical representation of the results in case study 2 with n = 256, p = 511. The lasso
estimate β̂ has far more non-zero entries than β and results in large MSE.
7
The Lasso MSE, kfˆ − f k2 ∼ nσ 2 (1 + λ2 ), is horrible. The l0 norm of β̂, |supp(β̂)| = n, is much
√
higher than that of β, |supp(β)| = 32 n.