0% found this document useful (0 votes)

2 views

Lasso-NIPS

This document discusses robust least-squares regression with feature-wise disturbances, demonstrating that this approach leads to tractable convex optimization problems. It establishes a connection between robust regression and Lasso, providing insights into sparsity properties and consistency results from a robust optimization perspective. The authors propose new methodologies for regression algorithms that leverage robustness to improve performance and generalize existing formulations.

Uploaded by

kumaransundaram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lasso-NIPS

Uploaded by

kumaransundaram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Robust Regression and Lasso

Huan Xu
Department of Electrical and Computer Engineering
McGill University
Montreal, QC Canada
[email protected]

Constantine Caramanis
Department of Electrical and Computer Engineering
The University of Texas at Austin
Austin, Texas
[email protected]

Shie Mannor
Department of Electrical and Computer Engineering
McGill University
Montreal, QC Canada
[email protected]

Abstract

We consider robust least-squares regression with feature-wise disturbance. We

show that this formulation leads to tractable convex optimization problems, and
we exhibit a particular uncertainty set for which the robust problem is equivalent
to `1 regularized regression (Lasso). This provides an interpretation of Lasso from
a robust optimization perspective. We generalize this robust formulation to con-
sider more general uncertainty sets, which all lead to tractable convex optimization
problems. Therefore, we provide a new methodology for designing regression al-
gorithms, which generalize known formulations. The advantage is that robustness
to disturbance is a physical property that can be exploited: in addition to obtaining
new formulations, we use it directly to show sparsity properties of Lasso, as well
as to prove a general consistency result for robust regression problems, including
Lasso, from a unified robustness perspective.

1 Introduction

In this paper we consider linear regression problems with least-square error. The problem is to find
a vector x so that the `2 norm of the residual b − Ax is minimized, for a given matrix A ∈ Rn×m
and vector b ∈ Rn . From a learning/regression perspective, each row of A can be regarded as a
training sample, and the corresponding element of b as the target value of this observed sample.
Each column of A corresponds to a feature, and the objective is to find a set of weights so that the
weighted sum of the feature values approximates the target value.
It is well known that minimizing the least squared error can lead to sensitive solutions [1, 2]. Many
regularization methods have been proposed to decrease this sensitivity. Among them, Tikhonov
regularization [3] and Lasso [4, 5] are two widely known and cited algorithms. These methods
minimize a weighted sum of the residual norm and a certain regularization term, kxk2 for Tikhonov
regularization and kxk1 for Lasso. In addition to providing regularity, Lasso is also known for

1
the tendency to select sparse solutions. Recently this has attracted much attention for its ability
to reconstruct sparse solutions when sampling occurs far below the Nyquist rate, and also for its
ability to recover the sparsity pattern exactly with probability one, asymptotically as the number of
observations increases (there is an extensive literature on this subject, and we refer the reader to
[6, 7, 8, 9, 10] and references therein). In many of these approaches, the choice of regularization
parameters often has no fundamental connection to an underlying noise model [2].
In [11], the authors propose an alternative approach to reducing sensitivity of linear regression, by
considering a robust version of the regression problem: they minimize the worst-case residual for
the observations under some unknown but bounded disturbances. They show that their robust least
squares formulation is equivalent to `2 -regularized least squares, and they explore computational
aspects of the problem. In that paper, and in most of the subsequent research in this area and the
more general area of Robust Optimization (see [12, 13] and references therein) the disturbance is
taken to be either row-wise and uncorrelated [14], or given by bounding the Frobenius norm of the
disturbance matrix [11].
In this paper we investigate the robust regression problem under more general uncertainty sets,
focusing in particular on the case where the uncertainty set is defined by feature-wise constraints,
and also the case where features are meaningfully correlated. This is of interest when values of
features are obtained with some noisy pre-processing steps, and the magnitudes of such noises are
known or bounded. We prove that all our formulations are computationally tractable. Unlike much
of the previous literature, we provide a focus on structural properties of the robust solution. In
addition to giving new formulations, and new properties of the solutions to these robust problems,
we focus on the inherent importance of robustness, and its ability to prove from scratch important
properties such as sparseness, and asymptotic consistency of Lasso in the statistical learning context.
In particular, our main contributions in this paper are as follows.
• We formulate the robust regression problem with feature-wise independent disturbances,
and show that this formulation is equivalent to a least-square problem with a weighted `1
norm regularization term. Hence, we provide an interpretation for Lasso from a robustness
perspective. This can be helpful in choosing the regularization parameter. We generalize
the robust regression formulation to loss functions given by an arbitrary norm, and uncer-
tainty sets that allow correlation between disturbances of different features.
• We investigate the sparsity properties for the robust regression problem with feature-wise
independent disturbances, showing that such formulations encourage sparsity. We thus eas-
ily recover standard sparsity results for Lasso using a robustness argument. This also im-
plies a fundamental connection between the feature-wise independence of the disturbance
and the sparsity.
• Next, we relate Lasso to kernel density estimation. This allows us to re-prove consistency
in a statistical learning setup, using the new robustness tools and formulation we introduce.
Notation. We use capital letters to represent matrices, and boldface letters to represent column
vectors. For a vector z, we let zi denote the ith element. Throughout the paper, ai and r> j denote
th th
the i column and the j row of the observation matrix A, respectively; aij is the ij element of A,
hence it is the j th element of ri , and ith element of aj . For a convex function f (·), ∂f (z) represents
any of its sub-gradients evaluated at z.

2 Robust Regression with Feature-wise Disturbance

We show that our robust regression formulation recovers Lasso as a special case. The regression
formulation we consider differs from the standard Lasso formulation, as we minimize the norm of
the error, rather than the squared norm. It is known that these two coincide up to a change of the reg-
ularization coefficient. Yet our results amount to more than a representation or equivalence theorem.
In addition to more flexible and potentially powerful robust formulations, we prove new results, and
give new insight into known results. In Section 3, we show the robust formulation gives rise to new
sparsity results. Some of our results there (e.g. Theorem 4) fundamentally depend on (and follow
from) the robustness argument, which is not found elsewhere in the literature. Then in Section 4,
we establish consistency of Lasso directly from the robustness properties of our formulation, thus
explaining consistency from a more physically motivated and perhaps more general perspective.

2
2.1 Formulation

Robust linear regression considers the case that the observed matrix A is corrupted by some distur-
bance. We seek the optimal weight for the uncorrupted (yet unknown) sample matrix. We consider
the following min-max formulation:

Robust Linear Regression: minm max kb − (A + ∆A)xk2 . (1)
x∈R ∆A∈U

Here, U is the set of admissible disturbances of the matrix A. In this section, we consider the specific
setup where the disturbance is feature-wise uncorrelated, and norm-bounded for each feature:
n o
U , (δ 1 , · · · , δ m ) kδ i k2 ≤ ci , i = 1, · · · , m , (2)

for given ci ≥ 0. This formulation recovers the well-known Lasso:

Theorem 1. The robust regression problem (1) with the uncertainty set (2) is equivalent to the
following `1 regularized regression problem:
n m
X o
minm kb − Axk2 + ci |xi | . (3)
x∈R
i=1

Proof. We defer the full details to [15], and give only an outline of the proof here. Showing that the
robust regression is a lower bound for the regularized regression follows from the standard triangle
inequality. Conversely, one can take the worst-case noise to be δ ∗i , −ci sgn(x∗i )u, where u is given
by b−Ax∗
if Ax∗ 6= b,
u, kb−Ax∗ k2 ,
any vector with unit `2 norm otherwise;
from which the result follows after some algebra.

If we take ci = c and normalized ai for all i, Problem (3) is the well-known Lasso [4, 5].

2.2 Arbitrary norm and correlated disturbance

It is possible to generalize this result to the case where the `2 -norm is replaced by an arbitrary norm,
and where the uncertainty is correlated from feature to feature. For space considerations, we refer
to the full version ([15]), and simply state the main results here.
Theorem 2. Let k · ka denote an arbitrary norm. Then the robust regression problem
n o
minm max kb − (A + ∆A)xka ; Ua , (δ 1 , · · · , δ m ) kδ i ka ≤ ci , i = 1, · · · , m ;
x∈R ∆A∈Ua
n Pm o
is equivalent to the regularized regression problem minx∈Rm kb − Axka + i=1 ci |xi | .

Using feature-wise uncorrelated disturbance may lead to overly conservative results. We relax this,
allowing the disturbances of different features to be correlated. Consider the following uncertainty
set:
U 0 , (δ 1 , · · · , δ m ) fj (kδ 1 ka , · · · , kδ m ka ) ≤ 0; j = 1, · · · , k ,

where fj (·) are convex functions. Notice that both k and fj can be arbitrary, hence this is a very
general formulation and provides us with significant flexibility in designing uncertainty sets and
equivalently new regression algorithms. The following theorem converts this formulation to a con-
vex and tractable optimization problem.
Theorem 3. Assume that the set Z , {z ∈ Rm |fj (z) ≤ 0, j = 1, · · · , k; z ≥ 0} has non-empty
relative interior. The robust regression problem

minm max 0 kb − (A + ∆A)xka ,
x∈R ∆A∈U

3
is equivalent to the following regularized regression problem
n o
min kb − Axka + v(λ, κ, x) ;
λ∈Rk m
+ ,κ∈R+ ,x∈R
m

h k
X i (4)
where: v(λ, κ, x) , maxm (κ + |x|)> c − λj fj (c) .
c∈R
j=1
n o
Example 1. Suppose U 0 = (δ 1 , · · · , δ m ) kδ 1 ka , · · · , kδ m ka s ≤ l; for a symmetric norm
k · ks , then the resulting regularized regression problem is
n o
minm kb − Axka + lkxk∗s ; where k · k∗s is the dual norm of k · ks .
x∈R

The robust regression formulation (1) considers disturbances that are bounded in a set, while in
practice, often the disturbance is a random variable with unbounded support. In such cases, it is not
possible to simply use an uncertainty set that includes all admissible disturbances, and we need to
construct a meaningful U based on probabilistic information. In the full version [15] we consider
computationally efficient ways to use chance constraints to construct uncertainty sets.

3 Sparsity
In this section, we investigate the sparsity properties of robust regression (1), and equivalently Lasso.
Lasso’s ability to recover sparse solutions has been extensively discussed (cf [6, 7, 8, 9]), and takes
one of two approaches. The first approach investigates the problem from a statistical perspective.
That is, it assumes that the observations are generated by a (sparse) linear combination of the fea-
tures, and investigates the asymptotic or probabilistic conditions required for Lasso to correctly
recover the generative model. The second approach treats the problem from an optimization per-
spective, and studies under what conditions a pair (A, b) defines a problem with sparse solutions
(e.g., [16]).
We follow the second approach and do not assume a generative model. Instead, we consider the
conditions that lead to a feature receiving zero weight. In particular, we show that (i) as a direct
result of feature-wise independence of the uncertainty set, a slight change of a feature that was
originally assigned zero weight still gets zero weight (Theorem 4); (ii) using Theorem 4, we show
that “nearly” orthogonal features get zero weight (Corollary 1); and (iii) “nearly” linearly dependent
features get zero weight (Theorem 5). Substantial research regarding sparsity properties of Lasso
can be found in the literature (cf [6, 7, 8, 9, 17, 18, 19, 20] and many others). In particular, similar
results as in point (ii), that rely on an incoherence property, have been established in, e.g., [16], and
are used as standard tools in investigating sparsity of Lasso from a statistical perspective. However,
a proof exploiting robustness and properties of the uncertainty is novel. Indeed, such a proof shows
a fundamental connection between robustness and sparsity, and implies that robustifying w.r.t. a
feature-wise independent uncertainty set might be a plausible way to achieve sparsity for other
problems.
Theorem 4. Given (Ã, b), let x∗ be an optimal solution of the robust regression problem:

minm max kb − (Ã + ∆A)xk2 .
x∈R ∆A∈U

Let I ⊆ {1, · · · , m} be such that for all i ∈ I, x∗i = 0. Now let

n o
Ũ , (δ 1 , · · · , δ m ) kδ j k2 ≤ cj , j ∈
6 I; kδ i k2 ≤ ci + `i , i ∈ I .

Then, x∗ is an optimal solution of

min max kb − (A + ∆A)xk2 ,
x∈Rm ∆A∈Ũ

for any A that satisfies kai − ãi k ≤ `i for i ∈ I, and aj = ãj for j 6∈ I.

4
Proof. Notice that for i ∈ I, x∗i = 0, hence the ith column of both A and ∆A has no effect on the
residual. We have
max b − (A + ∆A)x∗ = max b − (A + ∆A)x∗ = max b − (Ã + ∆A)x∗ .
∆A∈Ũ 2 ∆A∈U 2 ∆A∈U 2

For i ∈ I, kai −ãi k ≤ li , and aj = ãj for j 6∈ I. Thus Ã+∆A ∆A ∈ U ⊆ A+∆A ∆A ∈ Ũ .
Therefore, for any fixed x0 , the following holds:

max b − (Ã + ∆A)x0 ≤ max b − (A + ∆A)x0 .

∆A∈U 2 ∆A∈Ũ 2

By definition of x∗ ,

max b − (Ã + ∆A)x∗ ≤ max b − (Ã + ∆A)x0 .

∆A∈U 2 ∆A∈U 2

Therefore we have
max b − (A + ∆A)x∗ ≤ max b − (A + ∆A)x0 .
∆A∈Ũ 2 ∆A∈Ũ 2

Since this holds for arbitrary x0 , we establish the theorem.

Theorem 4 is established using the robustness argument, and is a direct result of the feature-wise
independence of the uncertainty set. It explains why P Lasso tends to assign zero weight to non-
relative features. Consider a generative model1 b = i∈I wi ai + ξ˜ where I ⊆ {1 · · · , m} and ξ˜ is
a random variable, i.e., b is generated by features belonging to I. In this case, for a feature i0 6∈ I,
Lasso would assign zero weight as long as there exists a perturbed value of this feature, such that
the optimal regression assigned it zero weight. This is also shown in the next corollary, in which
we apply Theorem 4 to show that the problem has a sparse solution as long as an incoherence-type
property is satisfied (this result is more in line with the traditional sparsity results).
Corollary 1. SupposeSthat for all i, ci = c. If there exists I ⊂ {1, · · · , m} such that for all
v ∈ span {ai , i ∈ I} {b} , kvk = 1, we have v> aj ≤ c ∀j 6∈ I, then any optimal solution x∗
satisfies x∗j = 0, ∀j 6∈ I.

Proof. For j 6∈ I, let a=

S
j denote the projection of aj onto the span of {ai , i ∈ I} {b}, and let
a+ = =
j , aj − aj . Thus, we have kaj k ≤ c. Let Â be such that

ai i ∈ I;
âi =
a+
i i 6∈ I.
Now let
Û , {(δ 1 , · · · , δ m )|kδ i k2 ≤ c, i ∈ I; kδ j k2 = 0, j 6∈ I}.
n o
Consider the robust regression problem minx̂ max∆A∈Û b−(Â+∆A)x̂ 2 , which is equivalent
n
to minx̂ b − Âx̂ 2 + i∈I c|x̂i | . Now we show that there exists an optimal solution x̂∗ such
P

that x̂∗j = 0 for all j 6∈ I. This is because âj are orthogonal to the span of of {âi , i ∈ I} {b}.
S
Hence for any given x̂, by changing x̂j to zero for all j 6∈ I, the minimizing objective does not
increase.
Since kâ − âj k = ka=
j k ≤ c ∀j 6∈ I, (and recall that U = {(δ 1 , · · · , δ m )|kδ i k2 ≤ c, ∀i}) applying
Theorem 4 we establish the corollary.

The next corollary follows easily from Corollary 1.

Corollary 2. Suppose there exists I ⊆ {1, · · · , m}, such that for all i ∈ I, kai k < ci . Then any
optimal solution x∗ satisfies x∗i = 0, for i ∈ I.
1
While we are not assuming generative models to establish the results, it is still interesting to see how these
results can help in a generative model setup.

5
The next theorem shows that sparsity is achieved when a set of features are “almost” linearly depen-
dent. Again we refer to [15] for the proof.
Theorem 5. Given I ⊆ {1, · · · , m} such that there exists a non-zero vector (wi )i∈I satisfying
X X
k wi ai k2 ≤ min | σi ci wi |,
σi ∈{−1,+1}
i∈I i∈I

then there exists an optimal solution x such that ∃i ∈ I : x∗i = 0.

∗

P
Notice that for linearly dependent features, there exists non-zero (wi )i∈I such that k i∈I wi ai k2 =
0, which leads to the following corollary.

Corollary 3. Given I ⊆ {1, · · · , m}, let AI , ai , and t , rank(AI ). There exists an
i∈I
optimal solution x∗ such that x∗I , (xi )>
i∈I has at most t non-zero coefficients.

Setting I = {1, · · · , m}, we immediately get the following corollary.

Corollary 4. If n < m, then there exists an optimal solution with no more than n non-zero coeffi-
cients.

4 Density Estimation and Consistency

In this section, we investigate the robust linear regression formulation from a statistical perspective
and rederive using only robustness properties that Lasso is asymptotically consistent. We note that
our result applies to a considerably more general framework than Lasso. In the full version ([15])
we use some intermediate results used to prove consistency, to show that regularization can be
identified with the so-called maxmin expected utility (MMEU) framework, thus tying regularization
to a fundamental tenet of decision-theory.
We restrict our discussion to the case where the magnitude of the allowable uncertainty for all
features equals c, (i.e., the standard Lasso) and establish the statistical consistency of Lasso from
a distributional robustness argument. Generalization to the non-uniform case is straightforward.
Throughout, we use cn to represent c where there are n samples (we take cn to zero).
Recall the standard generative model in statistical learning: let P be a probability measure with
bounded support that generates i.i.d. samples (bi , ri ), and has a density f ∗ (·). Denote the set of the
first n samples by Sn . Define
n √n u
v v
nu n n
u1 X o uX o
x(cn , Sn ) , arg min t (bi − r>
i x)2 + c kxk
n 1 = arg min t (bi − r>
i x)2 + cn kxk1 ;
x n i=1 x n i=1
sZ
n o
x(P) , arg min (b − r> x)2 dP(b, r) .
x b,r
√
In words, x(cn , Sn ) is the solution to Lasso with the tradeoff parameter set to cn n, and x(P)
is the “true” optimal solution. We have the following consistency result. The theorem itself is a
well-known result. However, the proof technique is novel. This technique is of interest because
the standard techniques to establish consistency in statistical learning including VC dimension and
algorithm stability often work for a limited range of algorithms, e.g., SVMs are known to have
infinite VC dimension, and we show in the full version ([15]) that Lasso is not stable. In contrast,
a much wider range of algorithms have robustness interpretations, allowing a unified approach to
prove their consistency.
Theorem 6. Let {cn } be such that cn ↓ 0 and limn→∞ n(cn )m+1 = ∞. Suppose there exists a
constant H such that kx(cn , Sn )k2 ≤ H almost surely. Then,
sZ sZ
lim (b − r> x(cn , Sn ))2 dP(b, r) = (b − r> x(P))2 dP(b, r),
n→∞ b,r b,r

almost surely.

6
The full proof and results we develop along the way are deferred to [15], but we provide the main
ideas and outline here. The key to the proof is establishing a connection between robustness and
kernel density estimation.
Step 1: For a given x, we show that the robust regression loss over the training data is equal to the
worst-case expected generalization error. To show this we establish a more general result:
Proposition 1. Given a function g : Rm+1 → R and Borel sets Z1 , · · · , Zn ⊆ Rm+1 , let
[
Pn , {µ ∈ P|∀S ⊆ {1, · · · , n} : µ( Zi ) ≥ |S|/n}.
i∈S
The following holds
n
1X
Z
sup h(ri , bi ) = sup h(r, b)dµ(r, b).
n i=1 (ri ,bi )∈Zi µ∈Pn Rm+1

Step 2: Next we show that robust regression has a form like that in the left hand side above. Also,
the set of distributions we supremize over, in the right hand side above, includes a kernel density
estimator for the true (unknown) distribution. Indeed, consider the following kernel estimator: given
samples (bi , ri )ni=1 ,
n
X b − bi , r − ri
hn (b, r) , (ncm+1 )−1 K ,
i=1
c (5)
where: K(x) , I[−1,+1]m+1 (x)/2m+1 .
Observe that the estimated distribution given by Equation (5) belongs to the set of distributions
m
Y
Pn (A, ∆, b, c) , {µ ∈ P|Zi = [bi − c, bi + c] × [aij − δij , aij + δij ];
j=1
[
∀S ⊆ {1, · · · , n} : µ( Zi ) ≥ |S|/n},
i∈S

and hence belongs to P̂(n) = P̂(n) , ∆|∀j,P δ2 =nc2 Pn (A, ∆, b, c), which is precisely the set
S
i ij j
of distributions used in the representation from Proposition 1.
R
Step 3: Combining the last two steps, and using the fact that b,r |hn (b, r) − h(b, r)|d(b, r) goes to
zero almost surely when cn ↓ 0 and ncm+1 n ↑ ∞ since hn (·) is a kernel density estimation of f (·)
(see e.g. Theorem 3.1 of [21]), we prove consistency of robust regression.
We can remove the assumption that kx(cn , Sn )k2 ≤ H, and as in Theorem 6, the proof technique
rather than the result itself is of interest. We postpone the proof to [15].
Theorem 7. Let {cn } converge to zero sufficiently slowly. Then
sZ sZ
lim > 2
(b − r x(cn , Sn )) dP(b, r) = (b − r> x(P))2 dP(b, r),
n→∞ b,r b,r
almost surely.

5 Conclusion
In this paper, we consider robust regression with a least-square-error loss, and extend the results of
[11] (i.e., Tikhonov regularization is equivalent to a robust formulation for Frobenius norm-bounded
disturbance set) to a broader range of disturbance sets and hence regularization schemes. A special
case of our formulation recovers the well-known Lasso algorithm, and we obtain an interpretation
of Lasso from a robustness perspective. We consider more general robust regression formulations,
allowing correlation between the feature-wise noise, and we show that this too leads to tractable
convex optimization problems.
We exploit the new robustness formulation to give direct proofs of sparseness and consistency for
Lasso. As our results follow from robustness properties, it suggests that they may be far more
general than Lasso, and that in particular, consistency and sparseness may be properties one can
obtain more generally from robustified algorithms.

7
References
[1] L. Elden. Perturbation theory for the least-square problem with linear equality constraints. BIT, 24:472–
476, 1985.
[2] G. Golub and C. Van Loan. Matrix Computation. John Hopkins University Press, Baltimore, 1989.
[3] A. Tikhonov and V. Arsenin. Solution for Ill-Posed Problems. Wiley, New York, 1977.
[4] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,
Series B, 58(1):267–288, 1996.
[5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,
32(2):407–499, 2004.
[6] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on
Scientific Computing, 20(1):33–61, 1998.
[7] A. Feuer and A. Nemirovski. On sparse representation in pairs of bases. IEEE Transactions on Informa-
tion Theory, 49(6):1579–1581, 2003.
[8] E. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly
incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.
[9] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information
Theory, 50(10):2231–2242, 2004.
[10] M. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of spar-
sity using `1 -constrained quadratic programming. Technical Report Available from:
http://www.stat.berkeley.edu/tech-reports/709.pdf, Department of Statistics,
UC Berkeley, 2006.
[11] L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal
on Matrix Analysis and Applications, 18:1035–1064, 1997.
[12] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations Research
Letters, 25(1):1–13, August 1999.
[13] D. Bertsimas and M. Sim. The price of robustness. Operations Research, 52(1):35–53, January 2004.
[14] P. Shivaswamy, C. Bhattacharyya, and A. Smola. Second order cone programming approaches for han-
dling missing and uncertain data. Journal of Machine Learning Research, 7:1283–1314, July 2006.
[15] H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. Submitted, available from
http://arxiv.org/abs/0811.1790v1, 2008.
[16] J. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE Transactions on
Information Theory, 51(3):1030–1051, 2006.
[17] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computa-
tion, 10(6):1445–1480, 1998.
[18] R. R. Coifman and M. V. Wickerhauser. Entropy-based algorithms for best-basis selection. IEEE Trans-
actions on Information Theory, 38(2):713–718, 1992.
[19] S. Mallat and Z. Zhang. Matching Pursuits with time-frequence dictionaries. IEEE Transactions on Signal
Processing, 41(12):3397–3415, 1993.
[20] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
[21] L. Devroye and L. Györfi. Nonparametric Density Estimation: the l1 View. John Wiley & Sons, 1985.

Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
Bicycle Powered Mobile Charger
100% (1)
Bicycle Powered Mobile Charger
43 pages
DLL P.E
87% (23)
DLL P.E
3 pages
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
No ratings yet
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
25 pages
A Convenient Approach For Penalty Parameter Selection in Robust Lasso Regression
No ratings yet
A Convenient Approach For Penalty Parameter Selection in Robust Lasso Regression
12 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
B Ridge - and - Lasso - Regression
No ratings yet
B Ridge - and - Lasso - Regression
5 pages
Lect 6
No ratings yet
Lect 6
10 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
Lesson Four
No ratings yet
Lesson Four
28 pages
9_Linear Regression-Problems and Solutions
No ratings yet
9_Linear Regression-Problems and Solutions
23 pages
CSL0777 L17
No ratings yet
CSL0777 L17
27 pages
Ex Regularization 2
No ratings yet
Ex Regularization 2
3 pages
Slides 2
No ratings yet
Slides 2
27 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Regularization_(1)
No ratings yet
Regularization_(1)
3 pages
SLChapter5
No ratings yet
SLChapter5
16 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
What Is LASSO Regression Definition, Examples and Techniques
No ratings yet
What Is LASSO Regression Definition, Examples and Techniques
15 pages
LLM ML Interview Q
No ratings yet
LLM ML Interview Q
43 pages
Hdnotes 2021
No ratings yet
Hdnotes 2021
31 pages
3.3 Regularized Linear Model
No ratings yet
3.3 Regularized Linear Model
27 pages
LassoRegression
No ratings yet
LassoRegression
3 pages
lecture03d_ridge
No ratings yet
lecture03d_ridge
13 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
Lecture 3
No ratings yet
Lecture 3
61 pages
Advanced Regression Assignment
No ratings yet
Advanced Regression Assignment
5 pages
Feature selection
No ratings yet
Feature selection
19 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Machine learning
No ratings yet
Machine learning
19 pages
Lasso Vs Ridge Vs Elastic 1
No ratings yet
Lasso Vs Ridge Vs Elastic 1
5 pages
PRML Test 2
No ratings yet
PRML Test 2
3 pages
Chapter 3. Linear Regression
No ratings yet
Chapter 3. Linear Regression
41 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
Lasoo Regression
No ratings yet
Lasoo Regression
8 pages
Module 3
No ratings yet
Module 3
35 pages
Regularization and Feature Selectio N
No ratings yet
Regularization and Feature Selectio N
102 pages
Regularization
No ratings yet
Regularization
5 pages
02. Performance Tuning
No ratings yet
02. Performance Tuning
24 pages
Least Squares Optimization With L1-Norm Regularization
No ratings yet
Least Squares Optimization With L1-Norm Regularization
12 pages
Training Python
No ratings yet
Training Python
4 pages
Detailed_Breakdown_Ridge_Lasso
No ratings yet
Detailed_Breakdown_Ridge_Lasso
2 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
Data Analytics_Ridge and LASSO Regression
No ratings yet
Data Analytics_Ridge and LASSO Regression
15 pages
Lasso SVM
No ratings yet
Lasso SVM
6 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
1108.4559
No ratings yet
1108.4559
12 pages
AI34
No ratings yet
AI34
3 pages
Ridge Regression
No ratings yet
Ridge Regression
5 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
Lecture 24
No ratings yet
Lecture 24
8 pages
Regularization
No ratings yet
Regularization
45 pages
14 Regularization 2
No ratings yet
14 Regularization 2
18 pages
A-A regularized robust design criterion for uncertain data
No ratings yet
A-A regularized robust design criterion for uncertain data
23 pages
G.C. Calafiore (Politecnico Di Torino)
No ratings yet
G.C. Calafiore (Politecnico Di Torino)
23 pages
Slide 1
No ratings yet
Slide 1
4 pages
Advanced Regression Pres
No ratings yet
Advanced Regression Pres
42 pages
Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview
No ratings yet
Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview
11 pages
Understanding Vector Calculus: Practical Development and Solved Problems
From Everand
Understanding Vector Calculus: Practical Development and Solved Problems
Jerrold Franklin
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
701937
No ratings yet
701937
8 pages
Assessment Cognitive, Psychomotor and Affective
100% (2)
Assessment Cognitive, Psychomotor and Affective
8 pages
Reservoir Simulation Strategies - RSS
No ratings yet
Reservoir Simulation Strategies - RSS
3 pages
E-Government Strategy Kosovo 2023-2027
No ratings yet
E-Government Strategy Kosovo 2023-2027
35 pages
Hojat 2001
No ratings yet
Hojat 2001
17 pages
Form A2 Academic writing
No ratings yet
Form A2 Academic writing
10 pages
Real Final Fle Rahul
No ratings yet
Real Final Fle Rahul
82 pages
Download ebooks file New Instruments of Environmental Governance National Experiences and Prospects Environmental Politics 1st Edition Andrew Jordan all chapters
No ratings yet
Download ebooks file New Instruments of Environmental Governance National Experiences and Prospects Environmental Politics 1st Edition Andrew Jordan all chapters
89 pages
MBA Project Guidelines OU Format 2018-2020
100% (1)
MBA Project Guidelines OU Format 2018-2020
14 pages
Lesondra Edited Final
No ratings yet
Lesondra Edited Final
19 pages
IBP1111 - 19 Best Alternative For Rigid Offshore Pipelines Decommissioning - A Case Study
No ratings yet
IBP1111 - 19 Best Alternative For Rigid Offshore Pipelines Decommissioning - A Case Study
13 pages
National Implementation Guidelines For Ict in Education 2019
No ratings yet
National Implementation Guidelines For Ict in Education 2019
64 pages
Program Infosheet IFSR 2022
100% (1)
Program Infosheet IFSR 2022
36 pages
Consumer Behavior
No ratings yet
Consumer Behavior
35 pages
PHD - Ranjita Banerjee-Enrol. 119997392045
No ratings yet
PHD - Ranjita Banerjee-Enrol. 119997392045
460 pages
Cost Engineering Management Feb 2019
No ratings yet
Cost Engineering Management Feb 2019
43 pages
Hazard Identification Methods: Qualitative Quantitative
No ratings yet
Hazard Identification Methods: Qualitative Quantitative
2 pages
Case Study - How We Determined Optimal Staffing Levels
No ratings yet
Case Study - How We Determined Optimal Staffing Levels
5 pages
AC5002-Management Accounting A21 (1st Sit) CW1 MS
No ratings yet
AC5002-Management Accounting A21 (1st Sit) CW1 MS
11 pages
Solution Manual for Introduction to Epidemiology Distribution and Determinants of Disease, 1st Edition - 2025 Version Is Available With All Chapters
100% (7)
Solution Manual for Introduction to Epidemiology Distribution and Determinants of Disease, 1st Edition - 2025 Version Is Available With All Chapters
43 pages
Diction: The Choice of Words: Denotation and Connotation
No ratings yet
Diction: The Choice of Words: Denotation and Connotation
20 pages
PSS Greece
No ratings yet
PSS Greece
12 pages
Impact of Technological Advancements On BSNL's Financial Performance
No ratings yet
Impact of Technological Advancements On BSNL's Financial Performance
4 pages
Applied Economics - SHS - Q1 - LP 1
100% (1)
Applied Economics - SHS - Q1 - LP 1
15 pages
Lineages And Advancements In Material Culture Studies Perspectives From Ucl Anthropology 1st Edition Timothy Carroll instant download
100% (2)
Lineages And Advancements In Material Culture Studies Perspectives From Ucl Anthropology 1st Edition Timothy Carroll instant download
83 pages
Project Final Draft
No ratings yet
Project Final Draft
84 pages
Heliyon: Laura Bernard, Laura Cyr, Agn Es Bonnet-Suard, Christophe Cutarella, Vincent BR Ejard
No ratings yet
Heliyon: Laura Bernard, Laura Cyr, Agn Es Bonnet-Suard, Christophe Cutarella, Vincent BR Ejard
13 pages
Documentation Technique Marquage CE
No ratings yet
Documentation Technique Marquage CE
1 page

Lasso-NIPS

Uploaded by

Lasso-NIPS

Uploaded by

Robust Regression and Lasso

We consider robust least-squares regression with feature-wise disturbance. We

2 Robust Regression with Feature-wise Disturbance

for given ci ≥ 0. This formulation recovers the well-known Lasso:

2.2 Arbitrary norm and correlated disturbance

Let I ⊆ {1, · · · , m} be such that for all i ∈ I, x∗i = 0. Now let

Then, x∗ is an optimal solution of

max b − (Ã + ∆A)x0 ≤ max b − (A + ∆A)x0 .

max b − (Ã + ∆A)x∗ ≤ max b − (Ã + ∆A)x0 .

Since this holds for arbitrary x0 , we establish the theorem.

Proof. For j 6∈ I, let a=

The next corollary follows easily from Corollary 1.

then there exists an optimal solution x such that ∃i ∈ I : x∗i = 0.

Setting I = {1, · · · , m}, we immediately get the following corollary.

4 Density Estimation and Consistency

You might also like