Lasso-NIPS
Lasso-NIPS
Huan Xu
Department of Electrical and Computer Engineering
McGill University
Montreal, QC Canada
[email protected]
Constantine Caramanis
Department of Electrical and Computer Engineering
The University of Texas at Austin
Austin, Texas
[email protected]
Shie Mannor
Department of Electrical and Computer Engineering
McGill University
Montreal, QC Canada
[email protected]
Abstract
1 Introduction
In this paper we consider linear regression problems with least-square error. The problem is to find
a vector x so that the `2 norm of the residual b − Ax is minimized, for a given matrix A ∈ Rn×m
and vector b ∈ Rn . From a learning/regression perspective, each row of A can be regarded as a
training sample, and the corresponding element of b as the target value of this observed sample.
Each column of A corresponds to a feature, and the objective is to find a set of weights so that the
weighted sum of the feature values approximates the target value.
It is well known that minimizing the least squared error can lead to sensitive solutions [1, 2]. Many
regularization methods have been proposed to decrease this sensitivity. Among them, Tikhonov
regularization [3] and Lasso [4, 5] are two widely known and cited algorithms. These methods
minimize a weighted sum of the residual norm and a certain regularization term, kxk2 for Tikhonov
regularization and kxk1 for Lasso. In addition to providing regularity, Lasso is also known for
1
the tendency to select sparse solutions. Recently this has attracted much attention for its ability
to reconstruct sparse solutions when sampling occurs far below the Nyquist rate, and also for its
ability to recover the sparsity pattern exactly with probability one, asymptotically as the number of
observations increases (there is an extensive literature on this subject, and we refer the reader to
[6, 7, 8, 9, 10] and references therein). In many of these approaches, the choice of regularization
parameters often has no fundamental connection to an underlying noise model [2].
In [11], the authors propose an alternative approach to reducing sensitivity of linear regression, by
considering a robust version of the regression problem: they minimize the worst-case residual for
the observations under some unknown but bounded disturbances. They show that their robust least
squares formulation is equivalent to `2 -regularized least squares, and they explore computational
aspects of the problem. In that paper, and in most of the subsequent research in this area and the
more general area of Robust Optimization (see [12, 13] and references therein) the disturbance is
taken to be either row-wise and uncorrelated [14], or given by bounding the Frobenius norm of the
disturbance matrix [11].
In this paper we investigate the robust regression problem under more general uncertainty sets,
focusing in particular on the case where the uncertainty set is defined by feature-wise constraints,
and also the case where features are meaningfully correlated. This is of interest when values of
features are obtained with some noisy pre-processing steps, and the magnitudes of such noises are
known or bounded. We prove that all our formulations are computationally tractable. Unlike much
of the previous literature, we provide a focus on structural properties of the robust solution. In
addition to giving new formulations, and new properties of the solutions to these robust problems,
we focus on the inherent importance of robustness, and its ability to prove from scratch important
properties such as sparseness, and asymptotic consistency of Lasso in the statistical learning context.
In particular, our main contributions in this paper are as follows.
• We formulate the robust regression problem with feature-wise independent disturbances,
and show that this formulation is equivalent to a least-square problem with a weighted `1
norm regularization term. Hence, we provide an interpretation for Lasso from a robustness
perspective. This can be helpful in choosing the regularization parameter. We generalize
the robust regression formulation to loss functions given by an arbitrary norm, and uncer-
tainty sets that allow correlation between disturbances of different features.
• We investigate the sparsity properties for the robust regression problem with feature-wise
independent disturbances, showing that such formulations encourage sparsity. We thus eas-
ily recover standard sparsity results for Lasso using a robustness argument. This also im-
plies a fundamental connection between the feature-wise independence of the disturbance
and the sparsity.
• Next, we relate Lasso to kernel density estimation. This allows us to re-prove consistency
in a statistical learning setup, using the new robustness tools and formulation we introduce.
Notation. We use capital letters to represent matrices, and boldface letters to represent column
vectors. For a vector z, we let zi denote the ith element. Throughout the paper, ai and r> j denote
th th
the i column and the j row of the observation matrix A, respectively; aij is the ij element of A,
hence it is the j th element of ri , and ith element of aj . For a convex function f (·), ∂f (z) represents
any of its sub-gradients evaluated at z.
2
2.1 Formulation
Robust linear regression considers the case that the observed matrix A is corrupted by some distur-
bance. We seek the optimal weight for the uncorrupted (yet unknown) sample matrix. We consider
the following min-max formulation:
Robust Linear Regression: minm max kb − (A + ∆A)xk2 . (1)
x∈R ∆A∈U
Here, U is the set of admissible disturbances of the matrix A. In this section, we consider the specific
setup where the disturbance is feature-wise uncorrelated, and norm-bounded for each feature:
n o
U , (δ 1 , · · · , δ m ) kδ i k2 ≤ ci , i = 1, · · · , m , (2)
Proof. We defer the full details to [15], and give only an outline of the proof here. Showing that the
robust regression is a lower bound for the regularized regression follows from the standard triangle
inequality. Conversely, one can take the worst-case noise to be δ ∗i , −ci sgn(x∗i )u, where u is given
by b−Ax∗
if Ax∗ 6= b,
u, kb−Ax∗ k2 ,
any vector with unit `2 norm otherwise;
from which the result follows after some algebra.
If we take ci = c and normalized ai for all i, Problem (3) is the well-known Lasso [4, 5].
It is possible to generalize this result to the case where the `2 -norm is replaced by an arbitrary norm,
and where the uncertainty is correlated from feature to feature. For space considerations, we refer
to the full version ([15]), and simply state the main results here.
Theorem 2. Let k · ka denote an arbitrary norm. Then the robust regression problem
n o
minm max kb − (A + ∆A)xka ; Ua , (δ 1 , · · · , δ m ) kδ i ka ≤ ci , i = 1, · · · , m ;
x∈R ∆A∈Ua
n Pm o
is equivalent to the regularized regression problem minx∈Rm kb − Axka + i=1 ci |xi | .
Using feature-wise uncorrelated disturbance may lead to overly conservative results. We relax this,
allowing the disturbances of different features to be correlated. Consider the following uncertainty
set:
U 0 , (δ 1 , · · · , δ m ) fj (kδ 1 ka , · · · , kδ m ka ) ≤ 0; j = 1, · · · , k ,
where fj (·) are convex functions. Notice that both k and fj can be arbitrary, hence this is a very
general formulation and provides us with significant flexibility in designing uncertainty sets and
equivalently new regression algorithms. The following theorem converts this formulation to a con-
vex and tractable optimization problem.
Theorem 3. Assume that the set Z , {z ∈ Rm |fj (z) ≤ 0, j = 1, · · · , k; z ≥ 0} has non-empty
relative interior. The robust regression problem
minm max 0 kb − (A + ∆A)xka ,
x∈R ∆A∈U
3
is equivalent to the following regularized regression problem
n o
min kb − Axka + v(λ, κ, x) ;
λ∈Rk m
+ ,κ∈R+ ,x∈R
m
h k
X i (4)
where: v(λ, κ, x) , maxm (κ + |x|)> c − λj fj (c) .
c∈R
j=1
n o
Example 1. Suppose U 0 = (δ 1 , · · · , δ m ) kδ 1 ka , · · · , kδ m ka s ≤ l; for a symmetric norm
k · ks , then the resulting regularized regression problem is
n o
minm kb − Axka + lkxk∗s ; where k · k∗s is the dual norm of k · ks .
x∈R
The robust regression formulation (1) considers disturbances that are bounded in a set, while in
practice, often the disturbance is a random variable with unbounded support. In such cases, it is not
possible to simply use an uncertainty set that includes all admissible disturbances, and we need to
construct a meaningful U based on probabilistic information. In the full version [15] we consider
computationally efficient ways to use chance constraints to construct uncertainty sets.
3 Sparsity
In this section, we investigate the sparsity properties of robust regression (1), and equivalently Lasso.
Lasso’s ability to recover sparse solutions has been extensively discussed (cf [6, 7, 8, 9]), and takes
one of two approaches. The first approach investigates the problem from a statistical perspective.
That is, it assumes that the observations are generated by a (sparse) linear combination of the fea-
tures, and investigates the asymptotic or probabilistic conditions required for Lasso to correctly
recover the generative model. The second approach treats the problem from an optimization per-
spective, and studies under what conditions a pair (A, b) defines a problem with sparse solutions
(e.g., [16]).
We follow the second approach and do not assume a generative model. Instead, we consider the
conditions that lead to a feature receiving zero weight. In particular, we show that (i) as a direct
result of feature-wise independence of the uncertainty set, a slight change of a feature that was
originally assigned zero weight still gets zero weight (Theorem 4); (ii) using Theorem 4, we show
that “nearly” orthogonal features get zero weight (Corollary 1); and (iii) “nearly” linearly dependent
features get zero weight (Theorem 5). Substantial research regarding sparsity properties of Lasso
can be found in the literature (cf [6, 7, 8, 9, 17, 18, 19, 20] and many others). In particular, similar
results as in point (ii), that rely on an incoherence property, have been established in, e.g., [16], and
are used as standard tools in investigating sparsity of Lasso from a statistical perspective. However,
a proof exploiting robustness and properties of the uncertainty is novel. Indeed, such a proof shows
a fundamental connection between robustness and sparsity, and implies that robustifying w.r.t. a
feature-wise independent uncertainty set might be a plausible way to achieve sparsity for other
problems.
Theorem 4. Given (Ã, b), let x∗ be an optimal solution of the robust regression problem:
minm max kb − (Ã + ∆A)xk2 .
x∈R ∆A∈U
for any A that satisfies kai − ãi k ≤ `i for i ∈ I, and aj = ãj for j 6∈ I.
4
Proof. Notice that for i ∈ I, x∗i = 0, hence the ith column of both A and ∆A has no effect on the
residual. We have
max b − (A + ∆A)x∗ = max b − (A + ∆A)x∗ = max b − (Ã + ∆A)x∗ .
∆A∈Ũ 2 ∆A∈U 2 ∆A∈U 2
For i ∈ I, kai −ãi k ≤ li , and aj = ãj for j 6∈ I. Thus Ã+∆A ∆A ∈ U ⊆ A+∆A ∆A ∈ Ũ .
Therefore, for any fixed x0 , the following holds:
By definition of x∗ ,
Therefore we have
max b − (A + ∆A)x∗ ≤ max b − (A + ∆A)x0 .
∆A∈Ũ 2 ∆A∈Ũ 2
Theorem 4 is established using the robustness argument, and is a direct result of the feature-wise
independence of the uncertainty set. It explains why P Lasso tends to assign zero weight to non-
relative features. Consider a generative model1 b = i∈I wi ai + ξ˜ where I ⊆ {1 · · · , m} and ξ˜ is
a random variable, i.e., b is generated by features belonging to I. In this case, for a feature i0 6∈ I,
Lasso would assign zero weight as long as there exists a perturbed value of this feature, such that
the optimal regression assigned it zero weight. This is also shown in the next corollary, in which
we apply Theorem 4 to show that the problem has a sparse solution as long as an incoherence-type
property is satisfied (this result is more in line with the traditional sparsity results).
Corollary 1. SupposeSthat for all i, ci = c. If there exists I ⊂ {1, · · · , m} such that for all
v ∈ span {ai , i ∈ I} {b} , kvk = 1, we have v> aj ≤ c ∀j 6∈ I, then any optimal solution x∗
satisfies x∗j = 0, ∀j 6∈ I.
that x̂∗j = 0 for all j 6∈ I. This is because âj are orthogonal to the span of of {âi , i ∈ I} {b}.
S
Hence for any given x̂, by changing x̂j to zero for all j 6∈ I, the minimizing objective does not
increase.
Since kâ − âj k = ka=
j k ≤ c ∀j 6∈ I, (and recall that U = {(δ 1 , · · · , δ m )|kδ i k2 ≤ c, ∀i}) applying
Theorem 4 we establish the corollary.
5
The next theorem shows that sparsity is achieved when a set of features are “almost” linearly depen-
dent. Again we refer to [15] for the proof.
Theorem 5. Given I ⊆ {1, · · · , m} such that there exists a non-zero vector (wi )i∈I satisfying
X X
k wi ai k2 ≤ min | σi ci wi |,
σi ∈{−1,+1}
i∈I i∈I
P
Notice that for linearly dependent features, there exists non-zero (wi )i∈I such that k i∈I wi ai k2 =
0, which leads to the following corollary.
Corollary 3. Given I ⊆ {1, · · · , m}, let AI , ai , and t , rank(AI ). There exists an
i∈I
optimal solution x∗ such that x∗I , (xi )>
i∈I has at most t non-zero coefficients.
almost surely.
6
The full proof and results we develop along the way are deferred to [15], but we provide the main
ideas and outline here. The key to the proof is establishing a connection between robustness and
kernel density estimation.
Step 1: For a given x, we show that the robust regression loss over the training data is equal to the
worst-case expected generalization error. To show this we establish a more general result:
Proposition 1. Given a function g : Rm+1 → R and Borel sets Z1 , · · · , Zn ⊆ Rm+1 , let
[
Pn , {µ ∈ P|∀S ⊆ {1, · · · , n} : µ( Zi ) ≥ |S|/n}.
i∈S
The following holds
n
1X
Z
sup h(ri , bi ) = sup h(r, b)dµ(r, b).
n i=1 (ri ,bi )∈Zi µ∈Pn Rm+1
Step 2: Next we show that robust regression has a form like that in the left hand side above. Also,
the set of distributions we supremize over, in the right hand side above, includes a kernel density
estimator for the true (unknown) distribution. Indeed, consider the following kernel estimator: given
samples (bi , ri )ni=1 ,
n
X b − bi , r − ri
hn (b, r) , (ncm+1 )−1 K ,
i=1
c (5)
where: K(x) , I[−1,+1]m+1 (x)/2m+1 .
Observe that the estimated distribution given by Equation (5) belongs to the set of distributions
m
Y
Pn (A, ∆, b, c) , {µ ∈ P|Zi = [bi − c, bi + c] × [aij − δij , aij + δij ];
j=1
[
∀S ⊆ {1, · · · , n} : µ( Zi ) ≥ |S|/n},
i∈S
and hence belongs to P̂(n) = P̂(n) , ∆|∀j,P δ2 =nc2 Pn (A, ∆, b, c), which is precisely the set
S
i ij j
of distributions used in the representation from Proposition 1.
R
Step 3: Combining the last two steps, and using the fact that b,r |hn (b, r) − h(b, r)|d(b, r) goes to
zero almost surely when cn ↓ 0 and ncm+1 n ↑ ∞ since hn (·) is a kernel density estimation of f (·)
(see e.g. Theorem 3.1 of [21]), we prove consistency of robust regression.
We can remove the assumption that kx(cn , Sn )k2 ≤ H, and as in Theorem 6, the proof technique
rather than the result itself is of interest. We postpone the proof to [15].
Theorem 7. Let {cn } converge to zero sufficiently slowly. Then
sZ sZ
lim > 2
(b − r x(cn , Sn )) dP(b, r) = (b − r> x(P))2 dP(b, r),
n→∞ b,r b,r
almost surely.
5 Conclusion
In this paper, we consider robust regression with a least-square-error loss, and extend the results of
[11] (i.e., Tikhonov regularization is equivalent to a robust formulation for Frobenius norm-bounded
disturbance set) to a broader range of disturbance sets and hence regularization schemes. A special
case of our formulation recovers the well-known Lasso algorithm, and we obtain an interpretation
of Lasso from a robustness perspective. We consider more general robust regression formulations,
allowing correlation between the feature-wise noise, and we show that this too leads to tractable
convex optimization problems.
We exploit the new robustness formulation to give direct proofs of sparseness and consistency for
Lasso. As our results follow from robustness properties, it suggests that they may be far more
general than Lasso, and that in particular, consistency and sparseness may be properties one can
obtain more generally from robustified algorithms.
7
References
[1] L. Elden. Perturbation theory for the least-square problem with linear equality constraints. BIT, 24:472–
476, 1985.
[2] G. Golub and C. Van Loan. Matrix Computation. John Hopkins University Press, Baltimore, 1989.
[3] A. Tikhonov and V. Arsenin. Solution for Ill-Posed Problems. Wiley, New York, 1977.
[4] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,
Series B, 58(1):267–288, 1996.
[5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,
32(2):407–499, 2004.
[6] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on
Scientific Computing, 20(1):33–61, 1998.
[7] A. Feuer and A. Nemirovski. On sparse representation in pairs of bases. IEEE Transactions on Informa-
tion Theory, 49(6):1579–1581, 2003.
[8] E. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly
incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.
[9] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information
Theory, 50(10):2231–2242, 2004.
[10] M. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of spar-
sity using `1 -constrained quadratic programming. Technical Report Available from:
http://www.stat.berkeley.edu/tech-reports/709.pdf, Department of Statistics,
UC Berkeley, 2006.
[11] L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal
on Matrix Analysis and Applications, 18:1035–1064, 1997.
[12] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations Research
Letters, 25(1):1–13, August 1999.
[13] D. Bertsimas and M. Sim. The price of robustness. Operations Research, 52(1):35–53, January 2004.
[14] P. Shivaswamy, C. Bhattacharyya, and A. Smola. Second order cone programming approaches for han-
dling missing and uncertain data. Journal of Machine Learning Research, 7:1283–1314, July 2006.
[15] H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. Submitted, available from
http://arxiv.org/abs/0811.1790v1, 2008.
[16] J. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE Transactions on
Information Theory, 51(3):1030–1051, 2006.
[17] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computa-
tion, 10(6):1445–1480, 1998.
[18] R. R. Coifman and M. V. Wickerhauser. Entropy-based algorithms for best-basis selection. IEEE Trans-
actions on Information Theory, 38(2):713–718, 1992.
[19] S. Mallat and Z. Zhang. Matching Pursuits with time-frequence dictionaries. IEEE Transactions on Signal
Processing, 41(12):3397–3415, 1993.
[20] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
[21] L. Devroye and L. Györfi. Nonparametric Density Estimation: the l1 View. John Wiley & Sons, 1985.