0% found this document useful (0 votes)
35 views23 pages

Spare Least Trimmed

Journal statistic

Uploaded by

Gary Chocolatos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views23 pages

Spare Least Trimmed

Journal statistic

Uploaded by

Gary Chocolatos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

The Annals of Applied Statistics

2013, Vol. 7, No. 1, 226–248


DOI: 10.1214/12-AOAS575
© Institute of Mathematical Statistics, 2013

SPARSE LEAST TRIMMED SQUARES REGRESSION FOR


ANALYZING HIGH-DIMENSIONAL LARGE DATA SETS

B Y A NDREAS A LFONS , C HRISTOPHE C ROUX AND S ARAH G ELPER


KU Leuven, KU Leuven and Erasmus University Rotterdam
Sparse model estimation is a topic of high importance in modern data
analysis due to the increasing availability of data sets with a large number of
variables. Another common problem in applied statistics is the presence of
outliers in the data. This paper combines robust regression and sparse model
estimation. A robust and sparse estimator is introduced by adding an L1
penalty on the coefficient estimates to the well-known least trimmed squares
(LTS) estimator. The breakdown point of this sparse LTS estimator is de-
rived, and a fast algorithm for its computation is proposed. In addition, the
sparse LTS is applied to protein and gene expression data of the NCI-60 can-
cer cell panel. Both a simulation study and the real data application show that
the sparse LTS has better prediction performance than its competitors in the
presence of leverage points.

1. Introduction. In applied data analysis, there is an increasing availability of


data sets containing a large number of variables. Linear models that include the full
set of explanatory variables often have poor prediction performance as they tend to
have large variance. Furthermore, large models are in general difficult to interpret.
In many cases, the number of variables is even larger than the number of observa-
tions. Traditional methods such as least squares can then no longer be applied due
to the rank deficiency of the design matrix. For instance, gene expression or fMRI
studies typically contain tens of thousands of variables for only a small number
of observations. In this paper, we present an application to the cancer cell panel
of the National Cancer Institute, in which the data consists of 59 observations and
22,283 predictors.
To improve prediction accuracy and as a remedy to computational problems
with high-dimensional data, a penalty term on the regression coefficients can be
added to the objective function. This approach shrinks the coefficients and reduces
variance at the price of increased bias. Tibshirani (1996) introduced the least abso-
lute shrinkage and selection operator (lasso), in which the penalty function is the
L1 norm. Let y = (y1 , . . . , yn ) be the response and X = (xij )1≤i≤n,1≤j ≤p the ma-
trix of predictor variables, where n denotes the number of observations and p the
number of variables. In addition, let x1 , . . . , xn be the p-dimensional observations,

Received July 2011; revised May 2012.


Key words and phrases. Breakdown point, outliers, penalized regression, robust regression, trim-
ming.
226
SPARSE LEAST TRIMMED SQUARES REGRESSION 227

that is, the rows of X. We assume a standard regression model

(1.1) yi = xi β + εi ,

where the regression parameter is β = (β1 , . . . , βp ) , and the error terms εi have
zero expected value. With a penalty parameter λ, the lasso estimate of β is

n p

 2
(1.2) β̂ lasso = argmin yi − xi β + nλ |βj |.
β i=1 j =1

The lasso is frequently used in practice since the L1 penalty allows to shrink some
coefficients to exactly zero, that is, to produce sparse model estimates that are
highly interpretable. In addition, a fast algorithm for computing the lasso is avail-
able through the framework of least angle regression [LARS; Efron et al. (2004)].
Other algorithms are available as well [e.g., Wu and Lange (2008)]. Due to the pop-
ularity of the lasso, its theoretical properties are well studied in the literature [e.g.,
Knight and Fu (2000), Zhao and Yu (2006), Zou, Hastie and Tibshirani (2007)] and
several modifications have been proposed [e.g., Zou (2006), Yuan and Lin (2006),
Gertheiss and Tutz (2010), Radchenko and James (2011), Wang et al. (2011)].
However, the lasso is not robust to outliers. In this paper we formally show that the
breakdown point of the lasso is 1/n, that is, only one single outlier can make the
lasso estimate completely unreliable. Therefore, robust alternatives are needed.
Outliers are observations that deviate from the model assumptions and are a
common problem in the practice of data analysis. For example, for many of the
22,283 predictors in the NCI data set used in Section 7, (log-transformed) re-
sponses on the 59 cell lines showed outliers. Robust alternatives to the least squares
regression estimator are well known and studied; see Maronna, Martin and Yohai
(2006) for an overview. In this paper, we focus on the least trimmed squares (LTS)
estimator introduced by Rousseeuw (1984). This estimator has a simple defini-
tion, is quite fast to compute, and is probably the most popular robust regression
estimator. Denote the vector of squared residuals by r2 (β) = (r12 , . . . , rn2 ) with
ri2 = (yi − xi β)2 , i = 1, . . . , n. Then the LTS estimator is defined as


h
 
(1.3) β̂ LTS = argmin r2 (β) i:n ,
β i=1

where (r2 (β))1:n ≤ · · · ≤ (r2 (β))n:n are the order statistics of the squared residuals
and h ≤ n. Thus, LTS regression corresponds to finding the subset of h observa-
tions whose least squares fit produces the smallest sum of squared residuals. The
subset size h can be seen as an initial guess of the number of good observations
in the data. While the LTS is highly robust, it clearly does not produce sparse
model estimates. Furthermore, if h < p, the LTS estimator cannot be computed.
228 A. ALFONS, C. CROUX AND S. GELPER

A sparse and regularized version of the LTS is obtained by adding an L1 penalty


with penalty parameter λ to (1.3), leading to the sparse LTS estimator

h p

 2 
(1.4) β̂ sparseLTS = argmin r (β) i:n + hλ |βj |.
β i=1 j =1

We prove in this paper that sparse LTS has a high breakdown point. It is resistant
to multiple regression outliers, including leverage points. Besides being highly
robust, and similar to the lasso estimate, sparse LTS (i) improves the prediction
performance through variance reduction if the sample size is small relative to the
dimension, (ii) ensures higher interpretability due to simultaneous model selection,
and (iii) avoids computational problems of traditional robust regression methods
in the case of high-dimensional data. For the NCI data, sparse LTS was less in-
fluenced by the outliers than competitor methods and showed better prediction
performance, while the resulting model is small enough to be easily interpreted
(see Section 7).
The sparse LTS (1.4) can also be interpreted as a trimmed version of the lasso,
since the limit case h = n yields the lasso solution. Other robust versions of
the lasso have been considered in the literature. Most of them are penalized M-
estimators, as in van de Geer (2008) and Li, Peng and Zhu (2011). Rosset and
Zhu (2004) proposed a Huber-type loss function, which requires knowledge of the
residual scale. A least absolute deviations (LAD) type of estimator called LAD-
lasso is proposed by Wang, Li and Jiang (2007),

n  p

  
(1.5) β̂ LAD-lasso = argmin yi − xi β + nλ |βj |.
β i=1 j =1

However, none of these methods is robust with respect to leverage points, that
is, outliers in the predictor space, and can handle outliers only in the response
variable. The main competitor of the sparse LTS is robust least angle regression,
called RLARS, and proposed in Khan, Van Aelst and Zamar (2007). They develop
a robust version of the LARS algorithm, essentially replacing correlations by a
robust type of correlation, to sequence and select the most important predictor
variables. Then a nonsparse robust regression estimator is applied to the selected
predictor variables. RLARS, as will be confirmed by our simulation study, is robust
with respect to leverage points. A main drawback of the RLARS algorithm of
Khan, Van Aelst and Zamar (2007) is the lack of a natural definition, since it is not
optimizing a clearly defined objective function.
An entirely different approach is taken by She and Owen (2011), who propose
an iterative procedure for outlier detection. Their method is based on imposing a
sparsity criterion on the estimator of the mean-shift parameter γ in the extended
regression model
(1.6) y = Xβ + γ + ε.
SPARSE LEAST TRIMMED SQUARES REGRESSION 229

They stress that this method requires a nonconvex sparsity criterion. An extension
of the method to high-dimensional data is obtained by also assuming sparsity of
the coefficients β. Nevertheless, their paper mainly focuses on outlier detection
and much less on sparse robust estimation. Note that another procedure for simul-
taneous outlier identification and variable selection based on the mean-shift model
is proposed by Menjoge and Welsch (2010).
The rest of the paper is organized as follows. In Section 2 the breakdown point
of the sparse LTS estimator is obtained. Further, we also show that the lasso and
the LAD-lasso have a breakdown point of only 1/n. A detailed description of the
proposed algorithm to compute the sparse LTS regression estimator is provided in
Section 3. Section 4 introduces a reweighted version of the estimator in order to
increase statistical efficiency. The choice of the penalty parameter λ is discussed
in Section 5. Simulation studies are performed in Section 6. In addition, Section 7
presents an application to protein and gene expression data of the well-known can-
cer cell panel of the National Cancer Institute. The results indicate that these data
contain outliers such that robust methods are necessary for analysis. Moreover,
sparse LTS yields a model that is easy to interpret and has excellent prediction
performance. Finally, Section 8 presents some computation times and Section 9
concludes.

2. Breakdown point. The most popular measure for the robustness of an es-
timator is the replacement finite-sample breakdown point (FBP) [e.g., Maronna,
Martin and Yohai (2006)]. Let Z = (X, y) denote the sample. For a regression es-
timator β̂, the breakdown point is defined as
 
∗ m  
(2.1) ε (β̂; Z) = min  
: sup β̂(Z̃) 2 = ∞ ,
n Z̃

where Z̃ are corrupted data obtained from Z by replacing m of the original n


data points by arbitrary values. We obtained the following result, from which the
breakdown point of the sparse LTS estimator immediately follows. The proof is in
the Appendix.

T HEOREM 1. Let ρ(x) be a convex and symmetric loss function with ρ(0) = 0
and ρ(x) > 0 for x = 0, and define ρ(x) := (ρ(x1 ), . . . , ρ(xn )) . With subset size
h ≤ n, consider the regression estimator

h p

 
(2.2) β̂ = argmin ρ(y − Xβ) i:n + hλ |βj |,
β i=1 j =1

where (ρ(y − Xβ)))1:n ≤ · · · ≤ (ρ(y − Xβ))n:n are the order statistics of the re-
gression loss. Then the breakdown point of the estimator β̂ is given by
n−h+1
ε∗ (β̂; Z) = .
n
230 A. ALFONS, C. CROUX AND S. GELPER

The breakdown point is the same for any loss function ρ fulfilling the assump-
tions. In particular, the breakdown point for the sparse LTS estimator β̂sparseLTS
with subset size h ≤ n, in which ρ(x) = x 2 , is still (n − h + 1)/n. The smaller
the value of h, the higher the breakdown point. By taking h small enough, it is
even possible to have a breakdown point larger than 50%. However, while this is
mathematically possible, we are not advising to use h < n/2 since robust statistics
aim for models that fit the majority of the data. Thus, we do not envisage to have
such large breakdown points. Instead, we suggest to take a value of h equal to a
fraction α of the sample size, with α = 0.75, such that the final estimate is based
on a sufficiently large number of observations. This guarantees a sufficiently high
statistical efficiency, as will be shown in the simulations in Section 6. The resulting
breakdown point is then about 1 − α = 25%. Notice that the breakdown point does
not depend on the dimension p. Even if the number of predictor variables is larger
than the sample size, a high breakdown point is guaranteed. For the nonsparse LTS,
the breakdown point does depend on p [see Rousseeuw and Leroy (2003)].
Applying Theorem 1 to the lasso [corresponding to ρ(x) = x 2 and h = n] yields
a finite-sample breakdown point of
1
ε ∗ (β̂ lasso ; Z) = .
n
Hence, only one outlier can already send the lasso solution to infinity, despite the
fact that large values of the regression estimate are penalized in the objective func-
tion of the lasso. The nonrobustness of the Lasso comes from the use of the squared
residuals in the objective function (1.2). Using other convex loss functions, as done
in the LAD-lasso or penalized M-estimators, does not solve the problem and re-
sults in a breakdown point of 1/n as well. The theoretical results on robustness are
also reflected in the application to the NCI data in Section 7, where the lasso is
much more influenced by the outliers than the sparse LTS.

3. Algorithm. We first present an equivalent formulation of the sparse LTS


estimator (1.4). For a fixed penalty parameter λ, define the objective function
 p

2
(3.1) Q(H, β) = yi − xi β + hλ |βj |,
i∈H j =1

which is the L1 penalized residual sum of squares based on a subsample H ⊆


{1, . . . , n} with |H | = h. With
(3.2) β̂ H = argmin Q(H, β),
β

the sparse LTS estimator is given by β̂ Hopt , where

(3.3) Hopt = argmin Q(H, β̂ H ).


H ⊆{1,...,n}:|H |=h
SPARSE LEAST TRIMMED SQUARES REGRESSION 231

Hence, the sparse LTS corresponds to finding the subset of h ≤ n observations


whose lasso fit produces the smallest penalized residual sum of squares. To find
this optimal subset, we use an analogue of the FAST-LTS algorithm developed by
Rousseeuw and Van Driessen (2006).
The algorithm is based on concentration steps or C-steps. The C-step at itera-
tion k consists of computing the lasso solution based on the current subset Hk , with
|Hk | = h, and constructing the next subset Hk+1 from the observations correspond-
ing to the h smallest squared residuals. Let Hk denote a certain subsample derived
at iteration k and let β̂ Hk be the coefficients of the corresponding lasso fit. After
computing the squared residuals r2k = (rk,1 2 , . . . , r 2 ) with r 2 = (y − x β̂ )2 ,
k,n k,i i i Hk
the subsample Hk+1 for iteration k + 1 is defined as the set of indices correspond-
ing to the h smallest squared residuals. In mathematical terms, this can be written
as
 

Hk+1 = i ∈ {1, . . . , n} : rk,i


2
∈ r2k j :n : j = 1, . . . , h ,

where (r2k )1:n ≤ · · · ≤ (r2k )n:n denote the order statistics of the squared residuals.
Let β̂ Hk+1 denote coefficients of the lasso fit based on Hk+1 . Then

(3.4) Q(Hk+1 , β̂ Hk+1 ) ≤ Q(Hk+1 , β̂ Hk ) ≤ Q(Hk , β̂ Hk ),

where the first inequality follows from the definition of β̂ Hk+1 , and the second
inequality from the definition of Hk . From (3.4) it follows that a C-step results in
a decrease of the sparse LTS objective function, and that a sequence of C-steps
yields convergence to a local minimum in a finite number of steps.
To increase the chances of arriving at the global minimum, a sufficiently large
number s of initial subsamples H0 should be used, each of them being used as
starting point for a sequence of C-steps. Rather than randomly selecting h data
points, any initial subset H0 of size h is constructed from an elemental subset of
size 3 as follows. Draw three observations from the data at random, say, xi1 , xi2
and xi3 . The lasso fit for this elemental subset is then
 
(3.5) β̂ {i1 ,i2 ,i3 } = argmin Q {i1 , i2 , i3 }, β ,
β

and the initial subset H0 is then given by the indices of the h observations with the
smallest squared residuals with respect to the fit in (3.5). The nonsparse FAST-LTS
algorithm uses elemental subsets of size p, since any OLS regression requires at
least as many observations as the dimension p. This would make the algorithm not
applicable if p > n. Fortunately the lasso is already properly defined for samples
of size 3, even for large values of p. Moreover, from a robustness point of view,
using only three observations is optimal, as it ensures the highest probability of
not including outliers in the elemental set. It is important to note that the elemental
subsets of size 3 are only used to construct the initial subsets of size h for the
C-step algorithms. All C-steps are performed on subsets of size h.
232 A. ALFONS, C. CROUX AND S. GELPER

In this paper, we used s = 500 initial subsets. Using a larger number of subsets
did not lead to better prediction performance in the case of the NCI data. Following
the strategy advised in Rousseeuw and Van Driessen (2006), we perform only two
C-steps for all s subsets and retain the s1 = 10 subsamples with the lowest values of
the objective function (3.1). For the reduced number of subsets s1 , further C-steps
are performed until convergence. This is a standard strategy for C-step algorithms
to decrease computation time.
Estimation of an intercept: the regression model in (1.1) does not contain an
intercept. It is indeed common to assume that the variables are mean-centered and
the predictor variables are standardized before applying the lasso. However, com-
puting the means and standard deviations over all observations does not result in
a robust method, so we take a different approach. Each time the sparse LTS algo-
rithm computes a lasso fit on a subsample of size h, the variables are first centered
and the predictors are standardized using the means and standard deviations com-
puted from the respective subsample. The resulting procedure then minimizes (1.4)
with squared residuals ri2 = (yi − β0 − xi β)2 , where β0 stands for the intercept.
We verified that adding an intercept to the model has no impact on the breakdown
point of the sparse LTS estimator of β.

4. Reweighted sparse LTS estimator. Let α denote the proportion of ob-


servations from the full sample to be retained in each subsample, that is, h =
(n + 1)α
. In this paper we take α = 0.75. Then (1 − α) may be interpreted
as an initial guess of the proportion of outliers in the data. This initial guess is typ-
ically rather conservative to ensure that outliers do not impact the results, and may
therefore result in a loss of statistical efficiency. To increase efficiency, a reweight-
ing step that downweights outliers detected by the sparse LTS estimator can be
performed.
Under the normal error model, observations with standardized residuals larger
than a certain quantile of the standard normal distribution may be declared as out-
liers. Since the sparse LTS estimator—like the lasso—is biased, we need to center
the residuals. A natural estimate for the center of the residuals is
1 
(4.1) μ̂raw = ri ,
h i∈H
opt

where ri = yi − xi β̂ sparseLTS and Hopt is the optimal subset from (3.3). Then the
residual scale estimate associated to the raw sparse LTS estimator is given by

h
1  
(4.2) σ̂raw = kα r2c i:n ,
h i=1

with squared centered residuals r2c = ((r1 − μ̂raw )2 , . . . , (rn − μ̂raw )2 ) , and
  −1 ((α+1)/2) −1/2
1
(4.3) kα = 2
u d(u) ,
α −−1 ((α+1)/2)
SPARSE LEAST TRIMMED SQUARES REGRESSION 233

a factor to ensure that σ̂raw is a consistent estimate of the standard deviation at the
normal model. This formulation allows to define binary weights
  
1, if (ri − μ̂raw )/σ̂raw  ≤ −1 (1 − δ),
(4.4) wi =   i = 1, . . . , n.
0, if (ri − μ̂raw )/σ̂raw  > −1 (1 − δ),
In this paper δ = 0.0125 is used such that 2.5% of the observations are expected to
be flagged as outliers in the normal model, which is a typical choice.
The reweighted sparse LTS estimator is given by the weighted lasso fit

n p

 2
(4.5) β̂ reweighted = argmin wi yi − xi β + λnw |βj |,
β i=1 j =1

with nw = ni=1 wi the sum of weights. With the choice of weights given in (4.4),
the reweighted sparse LTS is the lasso fit based on the observations not flagged
as outliers. Of course, other weighting schemes could be considered. Using the
residual center estimate
1 n
 
(4.6) μ̂reweighted = wi yi − xi β̂ reweighted ,
nw i=1
the residual scale estimate of the reweighted sparse LTS estimator is given by


1 
n
 2
(4.7) σ̂reweighted = kαw wi yi − xi β̂ reweighted − μ̂reweighted ,
nw i=1

where kαw is the consistency factor from (4.3) with αw = nw /n.


Note that this reweighting step is conceptually different from the adaptive lasso
by Zou (2006). While the adaptive lasso derives individual penalties on the predic-
tors from initial coefficient estimates, the reweighted sparse LTS aims to include
all nonoutlying observations into fitting the model.

5. Choice of the penalty parameter. In practical data analysis, a suitable


value of the penalty parameter λ is not known in advance. We propose to select λ
by optimizing the Bayes Information Criterion (BIC), or the estimated prediction
performance via cross-validation. In this paper we use the BIC since it requires
less computational effort. The BIC of a given model estimated with shrinkage
parameter λ is given by
log(n)
(5.1) BIC(λ) = log(σ̂ ) + df (λ) ,
n
where σ̂ denotes the corresponding residual scale estimate, (4.2) or (4.7), and
df (λ) are the degrees of freedom of the model. The degrees of freedom are given
by the number of nonzero estimated parameters in β̂ [see Zou, Hastie and Tibshi-
rani (2007)].
234 A. ALFONS, C. CROUX AND S. GELPER

As an alternative to the BIC, cross-validation can be used. To prevent outliers


from affecting the choice of λ, a robust prediction loss function should be used.
A natural choice is the root trimmed mean squared prediction error (RTMSPE)
with the same trimming proportion as for computing the sparse LTS. In k-fold
cross-validation, the data are split randomly in k blocks of approximately equal
size. Each block is left out once to fit the model, and the left-out block is used
as test data. In this manner, and for a given value of λ, a prediction is obtained
for each observation in the sample. Denote the vector of squared prediction errors
e2 = (e12 , . . . , en2 ) . Then

h
1  
(5.2) RTMSPE(λ) = e2 i:n .
h i=1
To reduce variability, the RTMSPE may be averaged over a number of different
random splits of the data.
The selected λ then minimizes BIC(λ) or RTMSPE(λ) over a grid of values
in the interval [0, λ̂0 ]. We take a grid with steps of size 0.025λ̂0 , where λ̂0 is an
estimate of the shrinkage parameter λ0 that would shrink all parameters to zero. If
p > n, 0 is of course excluded from the grid. For the lasso solution we take
2
(5.3) λ̂0 = max Cor(y, xj ),
n j ∈{1,...,p}
exactly the same as given and motivated in Efron et al. (2004). In (5.3), Cor(y, xj )
stands for the Pearson correlation between y and the j th column of the design ma-
trix X. For sparse LTS, we need a robust estimate λ̂0 . We propose to replace the
Pearson correlation in (5.3) by the robust correlation based on bivariate winsoriza-
tion of the data [see Khan, Van Aelst and Zamar (2007)].

6. Simulation study. This section presents a simulation study for comparing


the performance of various sparse estimators. The simulations are performed in R
[R Development Core Team (2011)] with package simFrame [Alfons, Templ and
Filzmoser (2010), Alfons (2012a)], which is a general framework for simulation
studies in statistics. Sparse LTS is evaluated for the subset size h = (n + 1)0.75
.
Both the raw and the reweighted version (see Section 4) are considered. We prefer
to take a relatively large trimming proportion to guarantee a breakdown point of
25%. Adding the reweighting step will then increase the statistical efficiency of
sparse LTS. We make a comparison with the lasso, the LAD-lasso and robust least
angle regression (RLARS), discussed in the introduction. We selected the LAD-
lasso estimator as a representative of the class of penalized M-estimators, since it
does not need an initial residual scale estimator.
For every generated sample, an optimal value of the shrinkage parameter λ is
selected. The penalty parameters for sparse LTS and the lasso are chosen using
the BIC, as described in Section 5. For the LAD-lasso, we estimate the shrinkage
SPARSE LEAST TRIMMED SQUARES REGRESSION 235

parameter in the same way as in Wang, Li and Jiang (2007). However, if p > n, we
cannot use their approach and use the BIC as in (5.1), with the mean absolute value
of residuals (multiplied by a consistency factor) as scale estimate. For RLARS, we
add the sequenced variables to the model in a stepwise fashion and fit robust MM-
regressions [Yohai (1987)], as advocated in Khan, Van Aelst and Zamar (2007).
The optimal model when using RLARS is then again selected via BIC, now using
the robust scale estimate resulting from the MM-regression.

6.1. Sampling schemes. The first configuration is a latent factor model taken
from Khan, Van Aelst and Zamar (2007) and covers the case of n > p. From k = 6
latent independent standard normal variables l1 , . . . , lk and an independent normal
error variable e with standard deviation σ , the response variable y is constructed
as
y := l1 + · · · + lk + e,

where σ is chosen so that the signal-to-noise ratio is 3, that is, σ = k/3. With
independent standard normal variables e1 , . . . , ep , a set of p = 50 candidate pre-
dictors is then constructed as
xj := lj + τ ej , j = 1, . . . , k,
xk+1 := l1 + δek+1 ,
xk+2 := l1 + δek+2 ,
..
.
x3k−1 := lk + δe3k−1 ,
x3k := lk + δe3k ,
xj := ej , j = 3k + 1, . . . , p,
where τ = 0.3 and δ = 5 so that x1 , . . . , xk are low-noise perturbations of the la-
tent variables, xk+1 , . . . , x3k are noise covariates that are correlated with the latent
variables, and x3k+1 , . . . , xp are independent noise covariates. The number of ob-
servations is set to n = 150.
The second configuration covers the case of moderate high-dimensional data.
We generate n = 100 observations from a p-dimensional normal distribution
N(0, ), with p = 1000. The covariance matrix  = ( ij )1≤i,j ≤p is given by
ij = 0.5|i−j | , creating correlated predictor variables. Using the coefficient vec-
tor β = (βj )1≤j ≤p with β1 = β7 = 1.5, β2 = 0.5, β4 = β11 = 1, and βj = 0 for
j ∈ {1, . . . , p} \ {1, 2, 4, 7, 11}, the response variable is generated according to the
regression model (1.1), where the error terms follow a normal distribution with
σ = 0.5.
Finally, the third configuration represents a more extreme case of high-
dimensional data with n = 100 observations and p = 20,000 variables. The first
236 A. ALFONS, C. CROUX AND S. GELPER

1000 predictor variables are generated from a multivariate normal distribution


N(0, ) with ij = 0.6|i−j | . Furthermore, the remaining 19,000 covariates are
standard normal variables. Then the response variable is generated according
to (1.1), where the coefficient vector β = (βj )1≤j ≤p is given by βj = 1 for
1 ≤ j ≤ 10 and βj = 0 for 11 ≤ j ≤ p, and the error terms follow a standard
normal distribution.
For each of the three simulation settings, we apply contamination schemes taken
from Khan, Van Aelst and Zamar (2007). To be more precise, we consider the
following:
(1) No contamination.
(2) Vertical outliers: 10% of the error terms in the regression model follow a
normal N(20, σ ) instead of a N(0, σ ).
(3) Leverage points: Same as in 2, but the 10% contaminated observations con-
tain high-leverage values by drawing the predictor variables from independent
N(50, 1) distributions.
In addition, we investigate a fourth and more stressful outlier scenario. Keeping
the contamination level at 10%, outliers in the predictor variables are drawn from
independent N(10, 0.01) distributions. Note the small standard deviation such that
the outliers form a dense cluster. Let x̃i denote such a leverage point. Then the
values of the response variable of the contaminated observations are generated by
ỹi = ηx̃i γ with γ = (−1/p)1≤j ≤p . The direction of γ is very different from the
one of the true regression parameter β in the following ways. First, γ is not sparse.
Second, all predictors have a negative effect on the response in the contaminated
observations, whereas the variables with nonzero coefficients have a positive effect
on the response in the good data points. Furthermore, the parameter η controls the
magnitude of the leverage effect and is varied from 1 to 25 in five equidistant steps.
This results in a total of 12 different simulations schemes, which we think to
be representative for the many different simulation designs we tried out. The first
scheme has n > p, the second setting has p > n, and the third setting has p n.
The choices for the contamination schemes are standard, inducing both vertical
outliers and leverage points in the samples.

6.2. Performance measures. Since one of the aims of sparse model estimation
is to improve prediction performance, the different estimators are evaluated by the
root mean squared prediction error (RMSPE). For this purpose, n additional ob-
servations from the respective sampling schemes (without outliers) are generated
as test data, and this in each simulation run. Then the RMSPE is given by

n
1  ∗ 2
RMSPE(β̂) = yi − x∗
i β̂ ,
n i=1
where yi∗ and x∗i , i = 1, . . . , n, denote the observations of the response and pre-
dictor variables in the test data, respectively. The RMSPE of the oracle estimator,
SPARSE LEAST TRIMMED SQUARES REGRESSION 237

which uses the true coefficient values β, is computed as a benchmark for the eval-
uated methods. We report average RMSPE over all simulation runs.
Concerning sparsity, the estimated models are evaluated by the false positive
rate (FPR) and the false negative rate (FNR). A false positive is a coefficient that
is zero in the true model, but is estimated as nonzero. Analogously, a false nega-
tive is a coefficient that is nonzero in the true model, but is estimated as zero. In
mathematical terms, the FPR and FNR are defined as
|{j ∈ {1, . . . , p} : β̂j = 0 ∧ βj = 0}|
FPR(β̂) = ,
|{j ∈ {1, . . . , p} : βj = 0}|
|{j ∈ {1, . . . , p} : β̂j = 0 ∧ βj = 0}|
FNR(β̂) = .
|{j ∈ {1, . . . , p} : βj = 0}|
Both FPR and FNR should be as small as possible for a sparse estimator and are av-
eraged over all simulation runs. Note that false negatives in general have a stronger
effect on the RMSPE than false positives. A false negative means that important
information is not used for prediction, whereas a false positive merely adds a bit
of variance.

6.3. Simulation results. In this subsection the simulation results for the differ-
ent data configurations are presented and discussed.

6.3.1. Results for the first sampling scheme. The simulation results for the first
data configuration are displayed in Table 1. Keep in mind that this configuration is
exactly the same as in Khan, Van Aelst and Zamar (2007), and that the contamina-
tion settings are a subset of the ones applied in their paper. In the scenario without
contamination, LAD-lasso, RLARS and lasso show excellent performance with
low RMSPE and FPR. The prediction performance of sparse LTS is good, but it

TABLE 1
Results for the first simulation scheme, with n = 150 and p = 50. Root mean squared prediction
error (RMSPE), the false positive rate (FPR) and the false negative rate (FNR), averaged over 500
simulation runs, are reported for every method

No contamination Vertical outliers Leverage points


Method RMSPE FPR FNR RMSPE FPR FNR RMSPE FPR FNR

Lasso 1.18 0.10 0.00 2.44 0.54 0.09 2.20 0.00 0.16
LAD-lasso 1.13 0.05 0.00 1.15 0.07 0.00 1.27 0.18 0.00
RLARS 1.14 0.07 0.00 1.12 0.03 0.00 1.22 0.09 0.00
Raw sparse LTS 1.29 0.34 0.00 1.26 0.32 0.00 1.26 0.26 0.00
Sparse LTS 1.24 0.22 0.00 1.22 0.25 0.00 1.22 0.18 0.00
Oracle 0.82 0.82 0.82
238 A. ALFONS, C. CROUX AND S. GELPER

F IG . 1. Root mean squared prediction error (RMSPE) for the first simulation scheme, with n = 150
and p = 50, and for the fourth contamination setting, averaged over 500 simulation runs. Lines for
raw and reweighted sparse LTS almost coincide.

has a larger FPR than the other three methods. The reweighting step clearly im-
proves the estimates, which is reflected in the lower values for RMSPE and FPR.
Furthermore, none of the methods suffer from false negatives.
In the case of vertical outliers, the nonrobust lasso is clearly influenced by the
outliers, reflected in the much higher RMSPE and FPR. RLARS, LAD-lasso and
sparse LTS, on the other hand, keep their excellent behavior. Sparse LTS still has a
considerable tendency toward false positives, but the reweighting step is a signifi-
cant improvement over the raw estimator.
When leverage points are introduced in addition to the vertical outliers, the
performance of RLARS, sparse LTS and LAD-lasso is comparable. The FPR of
RLARS and LAD-lasso slightly increased, whereas the FPR of sparse LTS slightly
decreased. The LAD-lasso still performs well, and even the lasso performs better
than in the case of only vertical outliers. This suggests that the leverage points in
this example do not have a bad leverage effect.
In Figure 1 the results for the fourth contamination setting are shown. The
RMSPE is thereby plotted as a function of the parameter η. With increasing η,
the RMSPE of the lasso and the LAD-lasso increases. RLARS has a considerably
higher RMSPE than sparse LTS for lower values of η, but the RMSPE gradually
decreases with increasing η. However, the RMSPE of sparse LTS remains the low-
est, thus, it has the best overall performance.

6.3.2. Results for the second sampling scheme. Table 2 contains the simula-
tion results for the moderate high-dimensional data configuration. In the scenario
without contamination, RLARS and the lasso perform best with very low RMSPE
and almost perfect FPR and FNR. Also, the LAD-lasso has excellent prediction
SPARSE LEAST TRIMMED SQUARES REGRESSION 239

TABLE 2
Results for the second simulation scheme, with n = 100 and p = 1000. Root mean squared
prediction error (RMSPE), the false positive rate (FPR) and the false negative rate (FNR), averaged
over 500 simulation runs, are reported for every method

No contamination Vertical outliers Leverage points


Method RMSPE FPR FNR RMSPE FPR FNR RMSPE FPR FNR

Lasso 0.62 0.00 0.00 2.56 0.08 0.16 2.53 0.00 0.71
LAD-lasso 0.66 0.08 0.00 0.82 0.00 0.01 1.17 0.08 0.01
RLARS 0.60 0.01 0.00 0.73 0.00 0.10 0.92 0.02 0.09
Raw sparse LTS 0.81 0.02 0.00 0.73 0.02 0.00 0.73 0.02 0.00
Sparse LTS 0.74 0.01 0.00 0.69 0.01 0.00 0.71 0.02 0.00
Oracle 0.50 0.50 0.50

performance, followed by sparse LTS. The LAD-lasso leads to a slightly higher


FPR than the other methods, though. When vertical outliers are added, RLARS
still has excellent prediction performance despite some false negatives. We see
that the sparse LTS performs best here. In addition, the prediction performance of
the nonrobust lasso already suffers greatly from the vertical outliers. In the sce-
nario with additional leverage points, sparse LTS remains stable and is still the
best. For RLARS, sparsity behavior according to FPR and FNR does not change
significantly either, but there is a small increase in the RMSPE. On the other hand,
LAD-lasso already has a considerably larger RMSPE than sparse LTS, and again
a slightly higher FPR than the other methods. Furthermore, the lasso is still highly
influenced by the outliers, which is reflected in a very high FNR and poor predic-
tion performance.
The results for the fourth contamination setting are presented in Figure 2. As
for the previous simulation scheme, the RMSPE for the lasso and the LAD-lasso
is increasing with increasing parameter η. The RMSPE for RLARS, however, is
gradually decreasing. Sparse LTS shows particularly interesting behavior: the RM-
SPE is close to the oracle at first, then there is a kink in the curve (with the value of
the RMSPE being in between those for the LAD-lasso and the lasso), after which
the RMSPE returns to low values close to the oracle. In any case, for most of the
investigated values of η, sparse LTS has the best performance.

6.3.3. Results for the third sampling scheme. Table 3 contains the simulation
results for the more extreme high-dimensional data configuration. Note that the
LAD-lasso was no longer computationally feasible with such a large number of
variables. In addition, the number of simulation runs was reduced from 500 to 100
to lower the computational effort.
In the case without contamination, the sparse LTS suffers from an efficiency
problem, which is reflected in larger values for RMSPE and FNR than for the
240 A. ALFONS, C. CROUX AND S. GELPER

F IG . 2. Root mean squared prediction error (RMSPE) for the second simulation scheme, with
n = 100 and p = 1000, and for the fourth contamination setting, averaged over 500 simulation
runs. Lines for raw and reweighted sparse LTS almost coincide.

other methods. The lasso and RLARS have considerably better performance in
this case. With vertical outliers, the RMSPE for the lasso increases greatly due to
many false negatives. Also, RLARS has a larger FNR than sparse LTS, resulting
in a slightly lower RMSPE for the reweighted version of the latter. When leverage
points are introduced, sparse LTS clearly exhibits the lowest RMSPE and FNR.
Furthermore, the lasso results in a very large FNR.
Figure 3 shows the results for the fourth contamination setting. Most interest-
ingly, the RMSPE of RLARS in this case keeps increasing in the beginning and
even goes above the one of the lasso, before dropping dropping continuously in
the remaining steps. Sparse LTS again shows a kink in the curve for the RMSPE,
but clearly performs best.

TABLE 3
Results for the third simulation scheme, with n = 100 and p = 20,000. Root mean squared
prediction error (RMSPE), the false positive rate (FPR) and the false negative rate (FNR), averaged
over 100 simulation runs, are reported for every method

No contamination Vertical outliers Leverage points


Method RMSPE FPR FNR RMSPE FPR FNR RMSPE FPR FNR

Lasso 1.43 0.000 0.00 5.19 0.004 0.49 5.57 0.000 0.83
RLARS 1.54 0.001 0.00 2.53 0.000 0.38 3.34 0.001 0.45
Raw sparse LTS 3.00 0.001 0.19 2.59 0.002 0.11 2.59 0.002 0.10
Sparse LTS 2.88 0.001 0.16 2.49 0.002 0.10 2.57 0.002 0.09
Oracle 1.00 1.00 1.00
SPARSE LEAST TRIMMED SQUARES REGRESSION 241

F IG . 3. Root mean squared prediction error (RMSPE) for the third simulation scheme, with
n = 100 and p = 20,000, and for the fourth contamination setting, averaged over 100 simulation
runs. Lines for raw and reweighted sparse LTS almost coincide.

6.3.4. Summary of the simulation results. Sparse LTS shows the best overall
performance in this simulation study, if the reweighted version is taken. Concern-
ing the other investigated methods, RLARS also performs well, but suffers some-
times from an increased percentage of false negatives under contamination. It is
also confirmed that the lasso is not robust to outliers. The LAD-lasso still sustains
vertical outliers, but is not robust against bad leverage points.

7. NCI-60 cancer cell panel. In this section the sparse LTS estimator is com-
pared to the competing methods in an application to the cancer cell panel of the
National Cancer Institute. It consists of data on 60 human cancer cell lines and
can be downloaded via the web application CellMiner (http://discover.nci.nih.gov/
cellminer/). We regress protein expression on gene expression data. The gene ex-
pression data were obtained with an Affymetrix HG-U133A chip and normalized
with the GCRMA method, resulting in a set of p = 22,283 predictors. The protein
expressions based on 162 antibodies were acquired via reverse-phase protein lysate
arrays and log2 transformed. One observation had to be removed since all values
were missing in the gene expression data, reducing the number of observations to
n = 59. More details on how the data were obtained can be found in Shankavaram
et al. (2007). Furthermore, Lee et al. (2011) also use this data for regression anal-
ysis, but consider only nonrobust methods. They obtain models that still consist of
several hundred to several thousand predictors and are thus difficult to interpret.
Similar to Lee et al. (2011), we first order the protein expression variables ac-
cording to their scale, but use the MAD (median absolute deviation from the me-
dian, multiplied with the consistency factor 1.4826) as a scale estimator instead of
the standard deviation. We show the results for the protein expressions based on
242 A. ALFONS, C. CROUX AND S. GELPER

the KRT18 antibody, which constitutes the variable with the largest MAD, serving
as one dependent variable. Hence, our response variable measures the expression
levels of the protein keratin 18, which is known to be persistently expressed in car-
cinomas [Oshima, Baribault and Caulín (1996)]. We compare raw and reweighted
sparse LTS with 25% trimming, lasso and RLARS. As in the simulation study,
the LAD-lasso could not be computed for such a large p. The optimal models are
selected via BIC as discussed in Section 5. The raw sparse LTS estimator thereby
results in a model with 32 genes. In the reweighting step, one more observation
is added to the best subset found by the raw estimator, yielding a model with 33
genes for reweighted sparse LTS (thus also one more gene is selected compared
to the raw estimator). The lasso model is somewhat larger with 52 genes, whereas
the RLARS model is somewhat smaller with 18 genes.
Sparse LTS and the lasso have three selected genes in common, one of which
is KRT8. The product of this gene, the protein keratin 8, typically forms an in-
termediate filament with keratin 18 such that their expression levels are closely
linked [e.g., Owens and Lane (2003)]. However, the larger model of the lasso is
much more difficult to interpret. Two of the genes selected by the lasso are not
even recorded in the Gene database [Maglott et al. (2005)] of the National Cen-
ter for Biotechnology Information (NCBI). The sparse LTS model is considerably
smaller and easier to interpret. For instance, the gene expression level of MSLN,
whose product mesothelin is overexpressed in various forms of cancer [Hassan,
Bera and Pastan (2004)], has a positive effect on the protein expression level of
keratin 18.
Concerning prediction performance, the root trimmed mean squared predic-
tion error (RTMSPE) is computed as in (5.2) via leave-one-out cross-validation
(so k = n). Table 4 reports the RTMSPE for the considered methods. Sparse LTS
clearly shows the smallest RTMSPE, followed by RLARS and the lasso. In addi-
tion, sparse LTS detects 13 observations as outliers, showing the need for a robust
procedure. Further analysis revealed that including those 13 observations changes
the correlation structure of the predictor variables with the response. Consequently,

TABLE 4
Root trimmed mean squared prediction error
(RTMSPE) for protein expressions based on the KRT18
antibody (NCI-60 cancer cell panel data), computed
from leave-one-out cross-validation

Method RTMSPE

Lasso 1.058
RLARS 0.936
Raw sparse LTS 0.727
Sparse LTS 0.721
SPARSE LEAST TRIMMED SQUARES REGRESSION 243

the order in which the genes are added to the model by the lasso algorithm on the
full sample is completely different from the order on the best subset found by
sparse LTS. Leaving out those 13 observations therefore yields more reliable re-
sults for the majority of the cancer cell lines.
It is also worth noting that the models still contain a rather large number of vari-
ables given the small number of observations. For the lasso, it is well known that
it tends to select many noise variables in high dimensions since the same penalty
is applied on all variables. Meinshausen (2007) therefore proposed a relaxation of
the penalty for the selected variables of an initial lasso fit. Adding such a relax-
ation step to the sparse LTS procedure may thus be beneficial for large p and is
considered for future work.

8. Computational details and CPU times. All computations are carried out
in R version 2.14.0 [R Development Core Team (2011)] using the packages ro-
bustHD [Alfons (2012b)] for sparse LTS and RLARS, quantreg [Koenker (2011)]
for the LAD-lasso and lars [Hastie and Efron (2011)] for the lasso. Most of sparse
LTS is thereby implemented in C++, while RLARS is an optimized version of
the R code by Khan, Van Aelst and Zamar (2007). Optimization of the RLARS
code was necessary since the original code builds a p × p matrix of robust cor-
relations, which is not computationally feasible for very large p. The optimized
version only stores an q × p matrix, where q is the number of sequenced vari-
ables. Furthermore, the robust correlations are computed with C++ rather than R.
Since computation time is an important practical consideration, Figure 4 dis-
plays computation times of lasso, LAD-lasso, RLARS and sparse LTS in sec-
onds. Note that those are average times over 10 runs based on simulated data with
n = 100 and varying dimension p, obtained on an Intel Xeon X5670 machine. For
sparse LTS and the LAD-lasso, the reported CPU times are averages over a grid

F IG . 4. CPU times (in seconds) for n = 100 and varying p, averaged over 10 runs.
244 A. ALFONS, C. CROUX AND S. GELPER

of five values for λ. RLARS is a hybrid procedure, thus, we only report the CPU
times for obtaining the sequence of predictors, but not for fitting the models along
the sequence.
As expected, the computation time of the nonrobust lasso remains very low for
increasing p. Sparse LTS is still reasonably fast up to p ≈ 10,000, but computation
time is a considerable factor if p is much larger than that. However, sparse LTS
remains faster than obtaining the RLARS sequence. A further advantage of the
subsampling algorithm of sparse LTS is that it can easily be parallelized to reduce
computation time on modern multicore computers, which is future work.

9. Conclusions and discussion. Least trimmed squares (LTS) is a robust re-


gression method frequently used in practice. Nevertheless, it does not allow for
sparse model estimates and cannot be applied to high-dimensional data with p > n.
This paper introduced the sparse LTS estimator, which overcomes these two issues
simultaneously by adding an L1 penalty to the LTS objective function. Simula-
tion results and a real data application to protein and gene expression data of the
NCI-60 cancer cell panel illustrated the excellent performance of sparse LTS and
showed that it performs as well or better than robust variable selection methods
such as RLARS. In addition, an advantage of sparse LTS over algorithmic proce-
dures such as RLARS is that the objective function allows for theoretical investi-
gation of its statistical properties. As such, we could derive the breakdown point
of the sparse LTS estimator. However, it should be noted that efficiency is an issue
with sparse LTS. A reweighting step can thereby lead to a substantial improvement
in efficiency, as shown in the simulation study.
In the paper, an L1 penalization was imposed on the regression parameter, as for
the lasso. Other choices for the penalty are possible. For example, an L2 penalty
leads to ridge regression. A robust version of ridge regression was recently pro-
posed by Maronna (2011), using L2 penalized MM-estimators. Even though the
resulting estimates are not sparse, prediction accuracy is improved by shrinking
the coefficients, and the computational issues with high-dimensional robust esti-
mators are overcome due to the regularization. Another possible choice for the
penalty function is the smoothly clipped absolute deviation penalty (SCAD) pro-
posed by Fan and Li (2001). It satisfies the mathematical conditions for sparsity
but results in a more difficult optimization problem than the lasso. Still, a robust
version of SCAD can be obtained by optimizing the associated objective function
over trimmed samples instead of over the full sample.
There are several other open questions that we leave for future research. For
instance, we did not provide any asymptotics for sparse LTS, as was, for example,
done for penalized M-estimators in Germain and Roueff (2010). Potentially, sparse
LTS could be used as an initial estimator for computing penalized M-estimators.
All in all, the results presented in this paper suggest that sparse LTS is a valuable
addition to the statistics researcher’s toolbox. The sparse LTS estimator has an
intuitively appealing definition and is related to the popular least trimmed squares
SPARSE LEAST TRIMMED SQUARES REGRESSION 245

estimator of robust regression. It performs model selection, outlier detection and


robust estimation simultaneously, and is applicable if the dimension is larger than
the sample size.

APPENDIX: PROOF OF BREAKDOWN POINT

P ROOF OF T HEOREM 1. In this proof the L1 norm of a vector β is denoted as


β1 and the Euclidean norm as β2 . Since these norms are topologically equiv-
alent, there exists a constant c1 > 0 such that β1 ≥ c1 β2 for all vectors β. The
proof is split into two parts.
First, we prove that ε∗ (β̂; Z) ≥ n−h+1
n . Replace the last m ≤ n − h observations,
resulting in the contaminated sample Z̃. Then there are still n − m ≥ h good ob-
servations in Z̃. Let My = max1≤i≤n |yi | and Mx1 = max1≤i≤n |xi1 |. For the case
βj = 0, j = 1, . . . , p, the value of the objective function is given by

h
  
h
 
Q(0) = ρ(ỹ) i:n ≤ ρ(y) i:n ≤ hρ(My ).
i=1 i=1
Now consider any β with β2 ≥ M := (hρ(My ) + 1)/(λc1 ). For the value of the
objective function, it holds that
Q(β) ≥ λβ1 ≥ λc1 β2 ≥ hρ(My ) + 1 > Q(0).
Since Q(β̂) ≤ Q(0), we conclude that β̂(Z̃)2 ≤ M, where M does not depend
on the outliers. This concludes the first part of the proof.
Second, we prove that ε∗ (β̂; Z) ≤ n−h+1
n . Move the last m = n − h + 1 obser-
vations of Z to the position z(γ , τ ) = (x(τ ) , y(γ , τ )) = ((τ, 0, . . . , 0), γ τ ) with
γ , τ > 0, and denote Zγ ,τ the resulting contaminated sample. Assume that there
exists a constant M such that
 
(A.1) supβ̂(Zγ ,τ )2 ≤ M,
τ,γ

that is, there is no breakdown. We will show that this leads to a contradiction.
Let β γ = (γ , 0, . . . , 0) ∈ Rp with γ = M + 2 and define τ > 0 such that ρ(τ ) ≥
max(h − m, 0)ρ(My + γ Mx1 ) + hλγ + 1. Note that τ is always well defined due to
the assumptions on ρ, in particular, since ρ(∞) = ∞. Then the objective function
is given by
⎧ h−m
⎨ 
⎪ 
ρ(y − Xβ γ ) + hλ|γ |, if h > m,
Q(β γ ) = i:(n−m)

⎩ i=1
hλ|γ |, else,
since the residuals with respect to the outliers are all zero. Hence,
(A.2) Q(β γ ) ≤ max(h − m, 0)ρ(My + γ Mx1 ) + hλγ ≤ ρ(τ ) − 1.
246 A. ALFONS, C. CROUX AND S. GELPER

Furthermore, for β = (β1 , . . . , βp ) with β2 ≤ γ − 1 we have


Q(β) ≥ ρ(γ τ − τβ1 ),
since at least one outlier will be in the set of the smallest h residuals. Now β1 ≤
β2 ≤ γ − 1, so that
 
(A.3) Q(β) ≥ ρ τ (γ − β1 ) ≥ ρ(τ ),
since ρ is nondecreasing.
Combining (A.2) and (A.3) leads to
 
β̂(Zγ ,τ ) ≥ γ − 1 = M + 1,
2
which contradicts the assumption (A.1). Hence, there is breakdown. 

Acknowledgments. We would like to thank the Editor and two anonymous


referees for their constructive remarks that led to an improvement of the paper.

REFERENCES
A LFONS , A. (2012a). simFrame: Simulation framework. R package version 0.5.0.
A LFONS , A. (2012b). robustHD: Robust methods for high-dimensional data. R package ver-
sion 0.1.0.
A LFONS , A., T EMPL , M. and F ILZMOSER , P. (2010). An object-oriented framework for statistical
simulation: The R package simFrame. Journal of Statistical Software 37 1–36.
E FRON , B., H ASTIE , T., J OHNSTONE , I. and T IBSHIRANI , R. (2004). Least angle regression. Ann.
Statist. 32 407–499. MR2060166
FAN , J. and L I , R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. J. Amer. Statist. Assoc. 96 1348–1360. MR1946581
G ERMAIN , J.-F. and ROUEFF , F. (2010). Weak convergence of the regularization path in penalized
M-estimation. Scand. J. Stat. 37 477–495. MR2724509
G ERTHEISS , J. and T UTZ , G. (2010). Sparse modeling of categorial explanatory variables. Ann.
Appl. Stat. 4 2150–2180. MR2829951
H ASSAN , R., B ERA , T. and PASTAN , I. (2004). Mesothelin: A new target for immunotherapy. Clin.
Cancer Res. 10 3937–3942.
H ASTIE , T. and E FRON , B. (2011). lars: Least angle regression, lasso and forward stagewise.
R package version 0.9-8.
K HAN , J. A., VAN A ELST, S. and Z AMAR , R. H. (2007). Robust linear model selection based on
least angle regression. J. Amer. Statist. Assoc. 102 1289–1299. MR2412550
K NIGHT, K. and F U , W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28 1356–1378.
MR1805787
KOENKER , R. (2011). quantreg: Quantile regression. R package version 4.67.
L EE , D., L EE , W., L EE , Y. and PAWITAN , Y. (2011). Sparse partial least-squares regression and its
applications to high-throughput data analysis. Chemometrics and Intelligent Laboratory Systems
109 1–8.
L I , G., P ENG , H. and Z HU , L. (2011). Nonconcave penalized M-estimation with a diverging number
of parameters. Statist. Sinica 21 391–419. MR2796868
M AGLOTT, D., O STELL , J., P RUITT, K. D. and TATUSOVA , T. (2005). Entrez gene: Gene-centered
information at NCBI. Nucleic Acids Res. 33 D54–D58.
SPARSE LEAST TRIMMED SQUARES REGRESSION 247

M ARONNA , R. A. (2011). Robust ridge regression for high-dimensional data. Technometrics 53 44–
53. MR2791951
M ARONNA , R. A., M ARTIN , R. D. and YOHAI , V. J. (2006). Robust Statistics: Theory and Meth-
ods. Wiley, Chichester. MR2238141
M EINSHAUSEN , N. (2007). Relaxed lasso. Comput. Statist. Data Anal. 52 374–393. MR2409990
M ENJOGE , R. S. and W ELSCH , R. E. (2010). A diagnostic method for simultaneous feature se-
lection and outlier identification in linear regression. Comput. Statist. Data Anal. 54 3181–3193.
MR2727745
O SHIMA , R. G., BARIBAULT, H. and C AULÍN , C. (1996). Oncogenic regulation and function of
keratins 8 and 18. Cancer and Metastasis Rewiews 15 445–471.
OWENS , D. W. and L ANE , E. B. (2003). The quest for the function of simple epithelial keratins.
Bioessays 25 748–758.
R D EVELOPMENT C ORE T EAM (2011). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
R ADCHENKO , P. and JAMES , G. M. (2011). Improved variable selection with forward-lasso adaptive
shrinkage. Ann. Appl. Stat. 5 427–448. MR2810404
ROSSET, S. and Z HU , J. (2004). Discussion of “Least angle regression,” by B. Efron, T. Hastie,
I. Johnstone and R. Tibshirani. Ann. Statist. 32 469–475.
ROUSSEEUW, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc. 79 871–880.
MR0770281
ROUSSEEUW, P. J. and L EROY, A. M. (2003). Robust Regression and Outlier Detection, 2nd ed.
Wiley, Hoboken.
ROUSSEEUW, P. J. and VAN D RIESSEN , K. (2006). Computing LTS regression for large data sets.
Data Min. Knowl. Discov. 12 29–45. MR2225526
S HANKAVARAM , U. T., R EINHOLD , W. C., N ISHIZUKA , S., M AJOR , S., M ORITA , D.,
C HARY, K. K., R EIMERS , M. A., S CHERF, U., K AHN , A., D OLGINOW, D., C OSSMAN , J.,
K ALDJIAN , E. P., S CUDIERO , D. A., P ETRICOIN , E., L IOTTA , L., L EE , J. K. and W EIN -
STEIN , J. N. (2007). Transcript and protein expression profiles of the NCI-60 cancer cell panel:
An integromic microarray study. Molecular Cancer Therapeutics 6 820–832.
S HE , Y. and OWEN , A. B. (2011). Outlier detection using nonconvex penalized regression. J. Amer.
Statist. Assoc. 106 626–639. MR2847975
T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B
58 267–288. MR1379242
VAN DE G EER , S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist.
36 614–645. MR2396809
WANG , H., L I , G. and J IANG , G. (2007). Robust regression shrinkage and consistent variable se-
lection through the LAD-lasso. J. Bus. Econom. Statist. 25 347–355. MR2380753
WANG , S., NAN , B., ROSSET, S. and Z HU , J. (2011). Random lasso. Ann. Appl. Stat. 5 468–485.
MR2810406
W U , T. T. and L ANGE , K. (2008). Coordinate descent algorithms for lasso penalized regression.
Ann. Appl. Stat. 2 224–244. MR2415601
YOHAI , V. J. (1987). High breakdown-point and high efficiency robust estimates for regression. Ann.
Statist. 15 642–656. MR0888431
Y UAN , M. and L IN , Y. (2006). Model selection and estimation in regression with grouped variables.
J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67. MR2212574
Z HAO , P. and Y U , B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7 2541–
2563. MR2274449
Z OU , H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
MR2279469
248 A. ALFONS, C. CROUX AND S. GELPER

Z OU , H., H ASTIE , T. and T IBSHIRANI , R. (2007). On the “degrees of freedom” of the lasso. Ann.
Statist. 35 2173–2192. MR2363967

A. A LFONS S. G ELPER
C. C ROUX ROTTERDAM S CHOOL OF M ANAGEMENT
ORSTAT R ESEARCH C ENTER E RASMUS U NIVERSITY ROTTERDAM
FACULTY OF B USINESS AND E CONOMICS B URGEMEESTER O UDLAAN 50
KU L EUVEN 3000 ROTTERDAM
NAAMSESTRAAT 69 T HE N ETHERLANDS
3000 L EUVEN E- MAIL : [email protected]
B ELGIUM
E- MAIL : [email protected]
[email protected]

You might also like