Spare Least Trimmed
Spare Least Trimmed
(1.1) yi = xi β + εi ,
where the regression parameter is β = (β1 , . . . , βp ) , and the error terms εi have
zero expected value. With a penalty parameter λ, the lasso estimate of β is
n p
2
(1.2) β̂ lasso = argmin yi − xi β + nλ |βj |.
β i=1 j =1
The lasso is frequently used in practice since the L1 penalty allows to shrink some
coefficients to exactly zero, that is, to produce sparse model estimates that are
highly interpretable. In addition, a fast algorithm for computing the lasso is avail-
able through the framework of least angle regression [LARS; Efron et al. (2004)].
Other algorithms are available as well [e.g., Wu and Lange (2008)]. Due to the pop-
ularity of the lasso, its theoretical properties are well studied in the literature [e.g.,
Knight and Fu (2000), Zhao and Yu (2006), Zou, Hastie and Tibshirani (2007)] and
several modifications have been proposed [e.g., Zou (2006), Yuan and Lin (2006),
Gertheiss and Tutz (2010), Radchenko and James (2011), Wang et al. (2011)].
However, the lasso is not robust to outliers. In this paper we formally show that the
breakdown point of the lasso is 1/n, that is, only one single outlier can make the
lasso estimate completely unreliable. Therefore, robust alternatives are needed.
Outliers are observations that deviate from the model assumptions and are a
common problem in the practice of data analysis. For example, for many of the
22,283 predictors in the NCI data set used in Section 7, (log-transformed) re-
sponses on the 59 cell lines showed outliers. Robust alternatives to the least squares
regression estimator are well known and studied; see Maronna, Martin and Yohai
(2006) for an overview. In this paper, we focus on the least trimmed squares (LTS)
estimator introduced by Rousseeuw (1984). This estimator has a simple defini-
tion, is quite fast to compute, and is probably the most popular robust regression
estimator. Denote the vector of squared residuals by r2 (β) = (r12 , . . . , rn2 ) with
ri2 = (yi − xi β)2 , i = 1, . . . , n. Then the LTS estimator is defined as
h
(1.3) β̂ LTS = argmin r2 (β) i:n ,
β i=1
where (r2 (β))1:n ≤ · · · ≤ (r2 (β))n:n are the order statistics of the squared residuals
and h ≤ n. Thus, LTS regression corresponds to finding the subset of h observa-
tions whose least squares fit produces the smallest sum of squared residuals. The
subset size h can be seen as an initial guess of the number of good observations
in the data. While the LTS is highly robust, it clearly does not produce sparse
model estimates. Furthermore, if h < p, the LTS estimator cannot be computed.
228 A. ALFONS, C. CROUX AND S. GELPER
We prove in this paper that sparse LTS has a high breakdown point. It is resistant
to multiple regression outliers, including leverage points. Besides being highly
robust, and similar to the lasso estimate, sparse LTS (i) improves the prediction
performance through variance reduction if the sample size is small relative to the
dimension, (ii) ensures higher interpretability due to simultaneous model selection,
and (iii) avoids computational problems of traditional robust regression methods
in the case of high-dimensional data. For the NCI data, sparse LTS was less in-
fluenced by the outliers than competitor methods and showed better prediction
performance, while the resulting model is small enough to be easily interpreted
(see Section 7).
The sparse LTS (1.4) can also be interpreted as a trimmed version of the lasso,
since the limit case h = n yields the lasso solution. Other robust versions of
the lasso have been considered in the literature. Most of them are penalized M-
estimators, as in van de Geer (2008) and Li, Peng and Zhu (2011). Rosset and
Zhu (2004) proposed a Huber-type loss function, which requires knowledge of the
residual scale. A least absolute deviations (LAD) type of estimator called LAD-
lasso is proposed by Wang, Li and Jiang (2007),
n p
(1.5) β̂ LAD-lasso = argmin yi − xi β + nλ |βj |.
β i=1 j =1
However, none of these methods is robust with respect to leverage points, that
is, outliers in the predictor space, and can handle outliers only in the response
variable. The main competitor of the sparse LTS is robust least angle regression,
called RLARS, and proposed in Khan, Van Aelst and Zamar (2007). They develop
a robust version of the LARS algorithm, essentially replacing correlations by a
robust type of correlation, to sequence and select the most important predictor
variables. Then a nonsparse robust regression estimator is applied to the selected
predictor variables. RLARS, as will be confirmed by our simulation study, is robust
with respect to leverage points. A main drawback of the RLARS algorithm of
Khan, Van Aelst and Zamar (2007) is the lack of a natural definition, since it is not
optimizing a clearly defined objective function.
An entirely different approach is taken by She and Owen (2011), who propose
an iterative procedure for outlier detection. Their method is based on imposing a
sparsity criterion on the estimator of the mean-shift parameter γ in the extended
regression model
(1.6) y = Xβ + γ + ε.
SPARSE LEAST TRIMMED SQUARES REGRESSION 229
They stress that this method requires a nonconvex sparsity criterion. An extension
of the method to high-dimensional data is obtained by also assuming sparsity of
the coefficients β. Nevertheless, their paper mainly focuses on outlier detection
and much less on sparse robust estimation. Note that another procedure for simul-
taneous outlier identification and variable selection based on the mean-shift model
is proposed by Menjoge and Welsch (2010).
The rest of the paper is organized as follows. In Section 2 the breakdown point
of the sparse LTS estimator is obtained. Further, we also show that the lasso and
the LAD-lasso have a breakdown point of only 1/n. A detailed description of the
proposed algorithm to compute the sparse LTS regression estimator is provided in
Section 3. Section 4 introduces a reweighted version of the estimator in order to
increase statistical efficiency. The choice of the penalty parameter λ is discussed
in Section 5. Simulation studies are performed in Section 6. In addition, Section 7
presents an application to protein and gene expression data of the well-known can-
cer cell panel of the National Cancer Institute. The results indicate that these data
contain outliers such that robust methods are necessary for analysis. Moreover,
sparse LTS yields a model that is easy to interpret and has excellent prediction
performance. Finally, Section 8 presents some computation times and Section 9
concludes.
2. Breakdown point. The most popular measure for the robustness of an es-
timator is the replacement finite-sample breakdown point (FBP) [e.g., Maronna,
Martin and Yohai (2006)]. Let Z = (X, y) denote the sample. For a regression es-
timator β̂, the breakdown point is defined as
∗ m
(2.1) ε (β̂; Z) = min
: sup β̂(Z̃) 2 = ∞ ,
n Z̃
T HEOREM 1. Let ρ(x) be a convex and symmetric loss function with ρ(0) = 0
and ρ(x) > 0 for x = 0, and define ρ(x) := (ρ(x1 ), . . . , ρ(xn )) . With subset size
h ≤ n, consider the regression estimator
h p
(2.2) β̂ = argmin ρ(y − Xβ) i:n + hλ |βj |,
β i=1 j =1
where (ρ(y − Xβ)))1:n ≤ · · · ≤ (ρ(y − Xβ))n:n are the order statistics of the re-
gression loss. Then the breakdown point of the estimator β̂ is given by
n−h+1
ε∗ (β̂; Z) = .
n
230 A. ALFONS, C. CROUX AND S. GELPER
The breakdown point is the same for any loss function ρ fulfilling the assump-
tions. In particular, the breakdown point for the sparse LTS estimator β̂sparseLTS
with subset size h ≤ n, in which ρ(x) = x 2 , is still (n − h + 1)/n. The smaller
the value of h, the higher the breakdown point. By taking h small enough, it is
even possible to have a breakdown point larger than 50%. However, while this is
mathematically possible, we are not advising to use h < n/2 since robust statistics
aim for models that fit the majority of the data. Thus, we do not envisage to have
such large breakdown points. Instead, we suggest to take a value of h equal to a
fraction α of the sample size, with α = 0.75, such that the final estimate is based
on a sufficiently large number of observations. This guarantees a sufficiently high
statistical efficiency, as will be shown in the simulations in Section 6. The resulting
breakdown point is then about 1 − α = 25%. Notice that the breakdown point does
not depend on the dimension p. Even if the number of predictor variables is larger
than the sample size, a high breakdown point is guaranteed. For the nonsparse LTS,
the breakdown point does depend on p [see Rousseeuw and Leroy (2003)].
Applying Theorem 1 to the lasso [corresponding to ρ(x) = x 2 and h = n] yields
a finite-sample breakdown point of
1
ε ∗ (β̂ lasso ; Z) = .
n
Hence, only one outlier can already send the lasso solution to infinity, despite the
fact that large values of the regression estimate are penalized in the objective func-
tion of the lasso. The nonrobustness of the Lasso comes from the use of the squared
residuals in the objective function (1.2). Using other convex loss functions, as done
in the LAD-lasso or penalized M-estimators, does not solve the problem and re-
sults in a breakdown point of 1/n as well. The theoretical results on robustness are
also reflected in the application to the NCI data in Section 7, where the lasso is
much more influenced by the outliers than the sparse LTS.
where (r2k )1:n ≤ · · · ≤ (r2k )n:n denote the order statistics of the squared residuals.
Let β̂ Hk+1 denote coefficients of the lasso fit based on Hk+1 . Then
where the first inequality follows from the definition of β̂ Hk+1 , and the second
inequality from the definition of Hk . From (3.4) it follows that a C-step results in
a decrease of the sparse LTS objective function, and that a sequence of C-steps
yields convergence to a local minimum in a finite number of steps.
To increase the chances of arriving at the global minimum, a sufficiently large
number s of initial subsamples H0 should be used, each of them being used as
starting point for a sequence of C-steps. Rather than randomly selecting h data
points, any initial subset H0 of size h is constructed from an elemental subset of
size 3 as follows. Draw three observations from the data at random, say, xi1 , xi2
and xi3 . The lasso fit for this elemental subset is then
(3.5) β̂ {i1 ,i2 ,i3 } = argmin Q {i1 , i2 , i3 }, β ,
β
and the initial subset H0 is then given by the indices of the h observations with the
smallest squared residuals with respect to the fit in (3.5). The nonsparse FAST-LTS
algorithm uses elemental subsets of size p, since any OLS regression requires at
least as many observations as the dimension p. This would make the algorithm not
applicable if p > n. Fortunately the lasso is already properly defined for samples
of size 3, even for large values of p. Moreover, from a robustness point of view,
using only three observations is optimal, as it ensures the highest probability of
not including outliers in the elemental set. It is important to note that the elemental
subsets of size 3 are only used to construct the initial subsets of size h for the
C-step algorithms. All C-steps are performed on subsets of size h.
232 A. ALFONS, C. CROUX AND S. GELPER
In this paper, we used s = 500 initial subsets. Using a larger number of subsets
did not lead to better prediction performance in the case of the NCI data. Following
the strategy advised in Rousseeuw and Van Driessen (2006), we perform only two
C-steps for all s subsets and retain the s1 = 10 subsamples with the lowest values of
the objective function (3.1). For the reduced number of subsets s1 , further C-steps
are performed until convergence. This is a standard strategy for C-step algorithms
to decrease computation time.
Estimation of an intercept: the regression model in (1.1) does not contain an
intercept. It is indeed common to assume that the variables are mean-centered and
the predictor variables are standardized before applying the lasso. However, com-
puting the means and standard deviations over all observations does not result in
a robust method, so we take a different approach. Each time the sparse LTS algo-
rithm computes a lasso fit on a subsample of size h, the variables are first centered
and the predictors are standardized using the means and standard deviations com-
puted from the respective subsample. The resulting procedure then minimizes (1.4)
with squared residuals ri2 = (yi − β0 − xi β)2 , where β0 stands for the intercept.
We verified that adding an intercept to the model has no impact on the breakdown
point of the sparse LTS estimator of β.
where ri = yi − xi β̂ sparseLTS and Hopt is the optimal subset from (3.3). Then the
residual scale estimate associated to the raw sparse LTS estimator is given by
h
1
(4.2) σ̂raw = kα
r2c i:n ,
h i=1
with squared centered residuals r2c = ((r1 − μ̂raw )2 , . . . , (rn − μ̂raw )2 ) , and
−1 ((α+1)/2) −1/2
1
(4.3) kα = 2
u d(u) ,
α −−1 ((α+1)/2)
SPARSE LEAST TRIMMED SQUARES REGRESSION 233
a factor to ensure that σ̂raw is a consistent estimate of the standard deviation at the
normal model. This formulation allows to define binary weights
1, if (ri − μ̂raw )/σ̂raw ≤ −1 (1 − δ),
(4.4) wi = i = 1, . . . , n.
0, if (ri − μ̂raw )/σ̂raw > −1 (1 − δ),
In this paper δ = 0.0125 is used such that 2.5% of the observations are expected to
be flagged as outliers in the normal model, which is a typical choice.
The reweighted sparse LTS estimator is given by the weighted lasso fit
n p
2
(4.5) β̂ reweighted = argmin wi yi − xi β + λnw |βj |,
β i=1 j =1
with nw = ni=1 wi the sum of weights. With the choice of weights given in (4.4),
the reweighted sparse LTS is the lasso fit based on the observations not flagged
as outliers. Of course, other weighting schemes could be considered. Using the
residual center estimate
1 n
(4.6) μ̂reweighted = wi yi − xi β̂ reweighted ,
nw i=1
the residual scale estimate of the reweighted sparse LTS estimator is given by
1
n
2
(4.7) σ̂reweighted = kαw
wi yi − xi β̂ reweighted − μ̂reweighted ,
nw i=1
parameter in the same way as in Wang, Li and Jiang (2007). However, if p > n, we
cannot use their approach and use the BIC as in (5.1), with the mean absolute value
of residuals (multiplied by a consistency factor) as scale estimate. For RLARS, we
add the sequenced variables to the model in a stepwise fashion and fit robust MM-
regressions [Yohai (1987)], as advocated in Khan, Van Aelst and Zamar (2007).
The optimal model when using RLARS is then again selected via BIC, now using
the robust scale estimate resulting from the MM-regression.
6.1. Sampling schemes. The first configuration is a latent factor model taken
from Khan, Van Aelst and Zamar (2007) and covers the case of n > p. From k = 6
latent independent standard normal variables l1 , . . . , lk and an independent normal
error variable e with standard deviation σ , the response variable y is constructed
as
y := l1 + · · · + lk + e,
√
where σ is chosen so that the signal-to-noise ratio is 3, that is, σ = k/3. With
independent standard normal variables e1 , . . . , ep , a set of p = 50 candidate pre-
dictors is then constructed as
xj := lj + τ ej , j = 1, . . . , k,
xk+1 := l1 + δek+1 ,
xk+2 := l1 + δek+2 ,
..
.
x3k−1 := lk + δe3k−1 ,
x3k := lk + δe3k ,
xj := ej , j = 3k + 1, . . . , p,
where τ = 0.3 and δ = 5 so that x1 , . . . , xk are low-noise perturbations of the la-
tent variables, xk+1 , . . . , x3k are noise covariates that are correlated with the latent
variables, and x3k+1 , . . . , xp are independent noise covariates. The number of ob-
servations is set to n = 150.
The second configuration covers the case of moderate high-dimensional data.
We generate n = 100 observations from a p-dimensional normal distribution
N(0, ), with p = 1000. The covariance matrix = (ij )1≤i,j ≤p is given by
ij = 0.5|i−j | , creating correlated predictor variables. Using the coefficient vec-
tor β = (βj )1≤j ≤p with β1 = β7 = 1.5, β2 = 0.5, β4 = β11 = 1, and βj = 0 for
j ∈ {1, . . . , p} \ {1, 2, 4, 7, 11}, the response variable is generated according to the
regression model (1.1), where the error terms follow a normal distribution with
σ = 0.5.
Finally, the third configuration represents a more extreme case of high-
dimensional data with n = 100 observations and p = 20,000 variables. The first
236 A. ALFONS, C. CROUX AND S. GELPER
6.2. Performance measures. Since one of the aims of sparse model estimation
is to improve prediction performance, the different estimators are evaluated by the
root mean squared prediction error (RMSPE). For this purpose, n additional ob-
servations from the respective sampling schemes (without outliers) are generated
as test data, and this in each simulation run. Then the RMSPE is given by
n
1 ∗ 2
RMSPE(β̂) =
yi − x∗
i β̂ ,
n i=1
where yi∗ and x∗i , i = 1, . . . , n, denote the observations of the response and pre-
dictor variables in the test data, respectively. The RMSPE of the oracle estimator,
SPARSE LEAST TRIMMED SQUARES REGRESSION 237
which uses the true coefficient values β, is computed as a benchmark for the eval-
uated methods. We report average RMSPE over all simulation runs.
Concerning sparsity, the estimated models are evaluated by the false positive
rate (FPR) and the false negative rate (FNR). A false positive is a coefficient that
is zero in the true model, but is estimated as nonzero. Analogously, a false nega-
tive is a coefficient that is nonzero in the true model, but is estimated as zero. In
mathematical terms, the FPR and FNR are defined as
|{j ∈ {1, . . . , p} : β̂j = 0 ∧ βj = 0}|
FPR(β̂) = ,
|{j ∈ {1, . . . , p} : βj = 0}|
|{j ∈ {1, . . . , p} : β̂j = 0 ∧ βj = 0}|
FNR(β̂) = .
|{j ∈ {1, . . . , p} : βj = 0}|
Both FPR and FNR should be as small as possible for a sparse estimator and are av-
eraged over all simulation runs. Note that false negatives in general have a stronger
effect on the RMSPE than false positives. A false negative means that important
information is not used for prediction, whereas a false positive merely adds a bit
of variance.
6.3. Simulation results. In this subsection the simulation results for the differ-
ent data configurations are presented and discussed.
6.3.1. Results for the first sampling scheme. The simulation results for the first
data configuration are displayed in Table 1. Keep in mind that this configuration is
exactly the same as in Khan, Van Aelst and Zamar (2007), and that the contamina-
tion settings are a subset of the ones applied in their paper. In the scenario without
contamination, LAD-lasso, RLARS and lasso show excellent performance with
low RMSPE and FPR. The prediction performance of sparse LTS is good, but it
TABLE 1
Results for the first simulation scheme, with n = 150 and p = 50. Root mean squared prediction
error (RMSPE), the false positive rate (FPR) and the false negative rate (FNR), averaged over 500
simulation runs, are reported for every method
Lasso 1.18 0.10 0.00 2.44 0.54 0.09 2.20 0.00 0.16
LAD-lasso 1.13 0.05 0.00 1.15 0.07 0.00 1.27 0.18 0.00
RLARS 1.14 0.07 0.00 1.12 0.03 0.00 1.22 0.09 0.00
Raw sparse LTS 1.29 0.34 0.00 1.26 0.32 0.00 1.26 0.26 0.00
Sparse LTS 1.24 0.22 0.00 1.22 0.25 0.00 1.22 0.18 0.00
Oracle 0.82 0.82 0.82
238 A. ALFONS, C. CROUX AND S. GELPER
F IG . 1. Root mean squared prediction error (RMSPE) for the first simulation scheme, with n = 150
and p = 50, and for the fourth contamination setting, averaged over 500 simulation runs. Lines for
raw and reweighted sparse LTS almost coincide.
has a larger FPR than the other three methods. The reweighting step clearly im-
proves the estimates, which is reflected in the lower values for RMSPE and FPR.
Furthermore, none of the methods suffer from false negatives.
In the case of vertical outliers, the nonrobust lasso is clearly influenced by the
outliers, reflected in the much higher RMSPE and FPR. RLARS, LAD-lasso and
sparse LTS, on the other hand, keep their excellent behavior. Sparse LTS still has a
considerable tendency toward false positives, but the reweighting step is a signifi-
cant improvement over the raw estimator.
When leverage points are introduced in addition to the vertical outliers, the
performance of RLARS, sparse LTS and LAD-lasso is comparable. The FPR of
RLARS and LAD-lasso slightly increased, whereas the FPR of sparse LTS slightly
decreased. The LAD-lasso still performs well, and even the lasso performs better
than in the case of only vertical outliers. This suggests that the leverage points in
this example do not have a bad leverage effect.
In Figure 1 the results for the fourth contamination setting are shown. The
RMSPE is thereby plotted as a function of the parameter η. With increasing η,
the RMSPE of the lasso and the LAD-lasso increases. RLARS has a considerably
higher RMSPE than sparse LTS for lower values of η, but the RMSPE gradually
decreases with increasing η. However, the RMSPE of sparse LTS remains the low-
est, thus, it has the best overall performance.
6.3.2. Results for the second sampling scheme. Table 2 contains the simula-
tion results for the moderate high-dimensional data configuration. In the scenario
without contamination, RLARS and the lasso perform best with very low RMSPE
and almost perfect FPR and FNR. Also, the LAD-lasso has excellent prediction
SPARSE LEAST TRIMMED SQUARES REGRESSION 239
TABLE 2
Results for the second simulation scheme, with n = 100 and p = 1000. Root mean squared
prediction error (RMSPE), the false positive rate (FPR) and the false negative rate (FNR), averaged
over 500 simulation runs, are reported for every method
Lasso 0.62 0.00 0.00 2.56 0.08 0.16 2.53 0.00 0.71
LAD-lasso 0.66 0.08 0.00 0.82 0.00 0.01 1.17 0.08 0.01
RLARS 0.60 0.01 0.00 0.73 0.00 0.10 0.92 0.02 0.09
Raw sparse LTS 0.81 0.02 0.00 0.73 0.02 0.00 0.73 0.02 0.00
Sparse LTS 0.74 0.01 0.00 0.69 0.01 0.00 0.71 0.02 0.00
Oracle 0.50 0.50 0.50
6.3.3. Results for the third sampling scheme. Table 3 contains the simulation
results for the more extreme high-dimensional data configuration. Note that the
LAD-lasso was no longer computationally feasible with such a large number of
variables. In addition, the number of simulation runs was reduced from 500 to 100
to lower the computational effort.
In the case without contamination, the sparse LTS suffers from an efficiency
problem, which is reflected in larger values for RMSPE and FNR than for the
240 A. ALFONS, C. CROUX AND S. GELPER
F IG . 2. Root mean squared prediction error (RMSPE) for the second simulation scheme, with
n = 100 and p = 1000, and for the fourth contamination setting, averaged over 500 simulation
runs. Lines for raw and reweighted sparse LTS almost coincide.
other methods. The lasso and RLARS have considerably better performance in
this case. With vertical outliers, the RMSPE for the lasso increases greatly due to
many false negatives. Also, RLARS has a larger FNR than sparse LTS, resulting
in a slightly lower RMSPE for the reweighted version of the latter. When leverage
points are introduced, sparse LTS clearly exhibits the lowest RMSPE and FNR.
Furthermore, the lasso results in a very large FNR.
Figure 3 shows the results for the fourth contamination setting. Most interest-
ingly, the RMSPE of RLARS in this case keeps increasing in the beginning and
even goes above the one of the lasso, before dropping dropping continuously in
the remaining steps. Sparse LTS again shows a kink in the curve for the RMSPE,
but clearly performs best.
TABLE 3
Results for the third simulation scheme, with n = 100 and p = 20,000. Root mean squared
prediction error (RMSPE), the false positive rate (FPR) and the false negative rate (FNR), averaged
over 100 simulation runs, are reported for every method
Lasso 1.43 0.000 0.00 5.19 0.004 0.49 5.57 0.000 0.83
RLARS 1.54 0.001 0.00 2.53 0.000 0.38 3.34 0.001 0.45
Raw sparse LTS 3.00 0.001 0.19 2.59 0.002 0.11 2.59 0.002 0.10
Sparse LTS 2.88 0.001 0.16 2.49 0.002 0.10 2.57 0.002 0.09
Oracle 1.00 1.00 1.00
SPARSE LEAST TRIMMED SQUARES REGRESSION 241
F IG . 3. Root mean squared prediction error (RMSPE) for the third simulation scheme, with
n = 100 and p = 20,000, and for the fourth contamination setting, averaged over 100 simulation
runs. Lines for raw and reweighted sparse LTS almost coincide.
6.3.4. Summary of the simulation results. Sparse LTS shows the best overall
performance in this simulation study, if the reweighted version is taken. Concern-
ing the other investigated methods, RLARS also performs well, but suffers some-
times from an increased percentage of false negatives under contamination. It is
also confirmed that the lasso is not robust to outliers. The LAD-lasso still sustains
vertical outliers, but is not robust against bad leverage points.
7. NCI-60 cancer cell panel. In this section the sparse LTS estimator is com-
pared to the competing methods in an application to the cancer cell panel of the
National Cancer Institute. It consists of data on 60 human cancer cell lines and
can be downloaded via the web application CellMiner (http://discover.nci.nih.gov/
cellminer/). We regress protein expression on gene expression data. The gene ex-
pression data were obtained with an Affymetrix HG-U133A chip and normalized
with the GCRMA method, resulting in a set of p = 22,283 predictors. The protein
expressions based on 162 antibodies were acquired via reverse-phase protein lysate
arrays and log2 transformed. One observation had to be removed since all values
were missing in the gene expression data, reducing the number of observations to
n = 59. More details on how the data were obtained can be found in Shankavaram
et al. (2007). Furthermore, Lee et al. (2011) also use this data for regression anal-
ysis, but consider only nonrobust methods. They obtain models that still consist of
several hundred to several thousand predictors and are thus difficult to interpret.
Similar to Lee et al. (2011), we first order the protein expression variables ac-
cording to their scale, but use the MAD (median absolute deviation from the me-
dian, multiplied with the consistency factor 1.4826) as a scale estimator instead of
the standard deviation. We show the results for the protein expressions based on
242 A. ALFONS, C. CROUX AND S. GELPER
the KRT18 antibody, which constitutes the variable with the largest MAD, serving
as one dependent variable. Hence, our response variable measures the expression
levels of the protein keratin 18, which is known to be persistently expressed in car-
cinomas [Oshima, Baribault and Caulín (1996)]. We compare raw and reweighted
sparse LTS with 25% trimming, lasso and RLARS. As in the simulation study,
the LAD-lasso could not be computed for such a large p. The optimal models are
selected via BIC as discussed in Section 5. The raw sparse LTS estimator thereby
results in a model with 32 genes. In the reweighting step, one more observation
is added to the best subset found by the raw estimator, yielding a model with 33
genes for reweighted sparse LTS (thus also one more gene is selected compared
to the raw estimator). The lasso model is somewhat larger with 52 genes, whereas
the RLARS model is somewhat smaller with 18 genes.
Sparse LTS and the lasso have three selected genes in common, one of which
is KRT8. The product of this gene, the protein keratin 8, typically forms an in-
termediate filament with keratin 18 such that their expression levels are closely
linked [e.g., Owens and Lane (2003)]. However, the larger model of the lasso is
much more difficult to interpret. Two of the genes selected by the lasso are not
even recorded in the Gene database [Maglott et al. (2005)] of the National Cen-
ter for Biotechnology Information (NCBI). The sparse LTS model is considerably
smaller and easier to interpret. For instance, the gene expression level of MSLN,
whose product mesothelin is overexpressed in various forms of cancer [Hassan,
Bera and Pastan (2004)], has a positive effect on the protein expression level of
keratin 18.
Concerning prediction performance, the root trimmed mean squared predic-
tion error (RTMSPE) is computed as in (5.2) via leave-one-out cross-validation
(so k = n). Table 4 reports the RTMSPE for the considered methods. Sparse LTS
clearly shows the smallest RTMSPE, followed by RLARS and the lasso. In addi-
tion, sparse LTS detects 13 observations as outliers, showing the need for a robust
procedure. Further analysis revealed that including those 13 observations changes
the correlation structure of the predictor variables with the response. Consequently,
TABLE 4
Root trimmed mean squared prediction error
(RTMSPE) for protein expressions based on the KRT18
antibody (NCI-60 cancer cell panel data), computed
from leave-one-out cross-validation
Method RTMSPE
Lasso 1.058
RLARS 0.936
Raw sparse LTS 0.727
Sparse LTS 0.721
SPARSE LEAST TRIMMED SQUARES REGRESSION 243
the order in which the genes are added to the model by the lasso algorithm on the
full sample is completely different from the order on the best subset found by
sparse LTS. Leaving out those 13 observations therefore yields more reliable re-
sults for the majority of the cancer cell lines.
It is also worth noting that the models still contain a rather large number of vari-
ables given the small number of observations. For the lasso, it is well known that
it tends to select many noise variables in high dimensions since the same penalty
is applied on all variables. Meinshausen (2007) therefore proposed a relaxation of
the penalty for the selected variables of an initial lasso fit. Adding such a relax-
ation step to the sparse LTS procedure may thus be beneficial for large p and is
considered for future work.
8. Computational details and CPU times. All computations are carried out
in R version 2.14.0 [R Development Core Team (2011)] using the packages ro-
bustHD [Alfons (2012b)] for sparse LTS and RLARS, quantreg [Koenker (2011)]
for the LAD-lasso and lars [Hastie and Efron (2011)] for the lasso. Most of sparse
LTS is thereby implemented in C++, while RLARS is an optimized version of
the R code by Khan, Van Aelst and Zamar (2007). Optimization of the RLARS
code was necessary since the original code builds a p × p matrix of robust cor-
relations, which is not computationally feasible for very large p. The optimized
version only stores an q × p matrix, where q is the number of sequenced vari-
ables. Furthermore, the robust correlations are computed with C++ rather than R.
Since computation time is an important practical consideration, Figure 4 dis-
plays computation times of lasso, LAD-lasso, RLARS and sparse LTS in sec-
onds. Note that those are average times over 10 runs based on simulated data with
n = 100 and varying dimension p, obtained on an Intel Xeon X5670 machine. For
sparse LTS and the LAD-lasso, the reported CPU times are averages over a grid
F IG . 4. CPU times (in seconds) for n = 100 and varying p, averaged over 10 runs.
244 A. ALFONS, C. CROUX AND S. GELPER
of five values for λ. RLARS is a hybrid procedure, thus, we only report the CPU
times for obtaining the sequence of predictors, but not for fitting the models along
the sequence.
As expected, the computation time of the nonrobust lasso remains very low for
increasing p. Sparse LTS is still reasonably fast up to p ≈ 10,000, but computation
time is a considerable factor if p is much larger than that. However, sparse LTS
remains faster than obtaining the RLARS sequence. A further advantage of the
subsampling algorithm of sparse LTS is that it can easily be parallelized to reduce
computation time on modern multicore computers, which is future work.
that is, there is no breakdown. We will show that this leads to a contradiction.
Let β γ = (γ , 0, . . . , 0) ∈ Rp with γ = M + 2 and define τ > 0 such that ρ(τ ) ≥
max(h − m, 0)ρ(My + γ Mx1 ) + hλγ + 1. Note that τ is always well defined due to
the assumptions on ρ, in particular, since ρ(∞) = ∞. Then the objective function
is given by
⎧ h−m
⎨
⎪
ρ(y − Xβ γ ) + hλ|γ |, if h > m,
Q(β γ ) = i:(n−m)
⎪
⎩ i=1
hλ|γ |, else,
since the residuals with respect to the outliers are all zero. Hence,
(A.2) Q(β γ ) ≤ max(h − m, 0)ρ(My + γ Mx1 ) + hλγ ≤ ρ(τ ) − 1.
246 A. ALFONS, C. CROUX AND S. GELPER
REFERENCES
A LFONS , A. (2012a). simFrame: Simulation framework. R package version 0.5.0.
A LFONS , A. (2012b). robustHD: Robust methods for high-dimensional data. R package ver-
sion 0.1.0.
A LFONS , A., T EMPL , M. and F ILZMOSER , P. (2010). An object-oriented framework for statistical
simulation: The R package simFrame. Journal of Statistical Software 37 1–36.
E FRON , B., H ASTIE , T., J OHNSTONE , I. and T IBSHIRANI , R. (2004). Least angle regression. Ann.
Statist. 32 407–499. MR2060166
FAN , J. and L I , R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. J. Amer. Statist. Assoc. 96 1348–1360. MR1946581
G ERMAIN , J.-F. and ROUEFF , F. (2010). Weak convergence of the regularization path in penalized
M-estimation. Scand. J. Stat. 37 477–495. MR2724509
G ERTHEISS , J. and T UTZ , G. (2010). Sparse modeling of categorial explanatory variables. Ann.
Appl. Stat. 4 2150–2180. MR2829951
H ASSAN , R., B ERA , T. and PASTAN , I. (2004). Mesothelin: A new target for immunotherapy. Clin.
Cancer Res. 10 3937–3942.
H ASTIE , T. and E FRON , B. (2011). lars: Least angle regression, lasso and forward stagewise.
R package version 0.9-8.
K HAN , J. A., VAN A ELST, S. and Z AMAR , R. H. (2007). Robust linear model selection based on
least angle regression. J. Amer. Statist. Assoc. 102 1289–1299. MR2412550
K NIGHT, K. and F U , W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28 1356–1378.
MR1805787
KOENKER , R. (2011). quantreg: Quantile regression. R package version 4.67.
L EE , D., L EE , W., L EE , Y. and PAWITAN , Y. (2011). Sparse partial least-squares regression and its
applications to high-throughput data analysis. Chemometrics and Intelligent Laboratory Systems
109 1–8.
L I , G., P ENG , H. and Z HU , L. (2011). Nonconcave penalized M-estimation with a diverging number
of parameters. Statist. Sinica 21 391–419. MR2796868
M AGLOTT, D., O STELL , J., P RUITT, K. D. and TATUSOVA , T. (2005). Entrez gene: Gene-centered
information at NCBI. Nucleic Acids Res. 33 D54–D58.
SPARSE LEAST TRIMMED SQUARES REGRESSION 247
M ARONNA , R. A. (2011). Robust ridge regression for high-dimensional data. Technometrics 53 44–
53. MR2791951
M ARONNA , R. A., M ARTIN , R. D. and YOHAI , V. J. (2006). Robust Statistics: Theory and Meth-
ods. Wiley, Chichester. MR2238141
M EINSHAUSEN , N. (2007). Relaxed lasso. Comput. Statist. Data Anal. 52 374–393. MR2409990
M ENJOGE , R. S. and W ELSCH , R. E. (2010). A diagnostic method for simultaneous feature se-
lection and outlier identification in linear regression. Comput. Statist. Data Anal. 54 3181–3193.
MR2727745
O SHIMA , R. G., BARIBAULT, H. and C AULÍN , C. (1996). Oncogenic regulation and function of
keratins 8 and 18. Cancer and Metastasis Rewiews 15 445–471.
OWENS , D. W. and L ANE , E. B. (2003). The quest for the function of simple epithelial keratins.
Bioessays 25 748–758.
R D EVELOPMENT C ORE T EAM (2011). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
R ADCHENKO , P. and JAMES , G. M. (2011). Improved variable selection with forward-lasso adaptive
shrinkage. Ann. Appl. Stat. 5 427–448. MR2810404
ROSSET, S. and Z HU , J. (2004). Discussion of “Least angle regression,” by B. Efron, T. Hastie,
I. Johnstone and R. Tibshirani. Ann. Statist. 32 469–475.
ROUSSEEUW, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc. 79 871–880.
MR0770281
ROUSSEEUW, P. J. and L EROY, A. M. (2003). Robust Regression and Outlier Detection, 2nd ed.
Wiley, Hoboken.
ROUSSEEUW, P. J. and VAN D RIESSEN , K. (2006). Computing LTS regression for large data sets.
Data Min. Knowl. Discov. 12 29–45. MR2225526
S HANKAVARAM , U. T., R EINHOLD , W. C., N ISHIZUKA , S., M AJOR , S., M ORITA , D.,
C HARY, K. K., R EIMERS , M. A., S CHERF, U., K AHN , A., D OLGINOW, D., C OSSMAN , J.,
K ALDJIAN , E. P., S CUDIERO , D. A., P ETRICOIN , E., L IOTTA , L., L EE , J. K. and W EIN -
STEIN , J. N. (2007). Transcript and protein expression profiles of the NCI-60 cancer cell panel:
An integromic microarray study. Molecular Cancer Therapeutics 6 820–832.
S HE , Y. and OWEN , A. B. (2011). Outlier detection using nonconvex penalized regression. J. Amer.
Statist. Assoc. 106 626–639. MR2847975
T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B
58 267–288. MR1379242
VAN DE G EER , S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist.
36 614–645. MR2396809
WANG , H., L I , G. and J IANG , G. (2007). Robust regression shrinkage and consistent variable se-
lection through the LAD-lasso. J. Bus. Econom. Statist. 25 347–355. MR2380753
WANG , S., NAN , B., ROSSET, S. and Z HU , J. (2011). Random lasso. Ann. Appl. Stat. 5 468–485.
MR2810406
W U , T. T. and L ANGE , K. (2008). Coordinate descent algorithms for lasso penalized regression.
Ann. Appl. Stat. 2 224–244. MR2415601
YOHAI , V. J. (1987). High breakdown-point and high efficiency robust estimates for regression. Ann.
Statist. 15 642–656. MR0888431
Y UAN , M. and L IN , Y. (2006). Model selection and estimation in regression with grouped variables.
J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67. MR2212574
Z HAO , P. and Y U , B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7 2541–
2563. MR2274449
Z OU , H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
MR2279469
248 A. ALFONS, C. CROUX AND S. GELPER
Z OU , H., H ASTIE , T. and T IBSHIRANI , R. (2007). On the “degrees of freedom” of the lasso. Ann.
Statist. 35 2173–2192. MR2363967
A. A LFONS S. G ELPER
C. C ROUX ROTTERDAM S CHOOL OF M ANAGEMENT
ORSTAT R ESEARCH C ENTER E RASMUS U NIVERSITY ROTTERDAM
FACULTY OF B USINESS AND E CONOMICS B URGEMEESTER O UDLAAN 50
KU L EUVEN 3000 ROTTERDAM
NAAMSESTRAAT 69 T HE N ETHERLANDS
3000 L EUVEN E- MAIL : [email protected]
B ELGIUM
E- MAIL : [email protected]
[email protected]