A Comparison of Optimization Methods and Software For Large-Scale L1-Regularized Linear Classification1
A Comparison of Optimization Methods and Software For Large-Scale L1-Regularized Linear Classification1
Abstract
Large-scale linear classification is widely used in many areas. The L1-regularized form can
be applied for feature selection; however, its non-differentiability causes more difficulties in
training. Although various optimization methods have been proposed in recent years, these
have not yet been compared suitably. In this paper, we first broadly review existing meth-
ods. Then, we discuss state-of-the-art software packages in detail and propose two efficient
implementations. Extensive comparisons indicate that carefully implemented coordinate
descent methods are very suitable for training large document data.
Keywords: L1 regularization, linear classification, optimization methods, logistic regres-
sion, support vector machines, document classification
1. Introduction
Recently, L1-regularized classifiers have attracted considerable attention because they can
be used to obtain a sparse model. Given a set of instance-label pairs (xi , yi ), i = 1, . . . , l, xi ∈
Rn , yi ∈ {−1, +1}, training an L1-regularized linear classifier involves the following uncon-
strained optimization problem:
l
X
min f (w) ≡ kwk1 + C ξ(w; xi , yi ), (1)
w
i=1
where k·k1 denotes the 1-norm and ξ(w; xi , yi ) is a non-negative (convex) loss function. The
regularization term kwk1 is used to avoid overfitting the training data. The user-defined
parameter C > 0 is used to balance the regularization and loss terms.
If kwk2 is used in (1), we obtain an L2-regularized classifier. Although L2 regulariza-
tion is used more commonly, an L1-regularized formula often produces a sparse w. Nonzero
elements help to select important features; in addition, the time required to produce pre-
dictions may be reduced. Considerable literature has been published on the advantages of
using L1 regularization; see, for example, Zhao and Yu (2006). However, an L1-regularized
c 2010 Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh and Chih-Jen Lin.
Yuan, Chang, Hsieh and Lin
form (1) is not differentiable regardless of its loss function. This drawback leads to greater
difficulty in solving the optimization problem. Therefore, certain considerations are required
to handle the non-differentiability.
Many loss functions can be used for linear classification. A commonly used one is logistic
loss:
T
ξlog (w; x, y) = log(1 + e−yw x ). (2)
This loss function is twice differentiable. Note that minimizing logistic loss is equivalent to
maximizing the likelihood, whereas minimizing the regularized loss in (1) is equivalent to
maximizing the posterior with independent Laplace priors on the parameters. Two other
frequently used functions are the L1- and the L2-loss functions:
Because of the max(·) operation, the L1-loss function is not differentiable. On the other
hand, L2 loss is differentiable, but not twice differentiable (Mangasarian, 2002). We refer
to (1) with logistic loss as L1-regularized logistic regression and (1) with L1/L2 loss as
L1-regularized L1-/L2-loss support vector machines (SVMs).
In some applications, we require a bias term b (also called as an intercept) in the loss
function; therefore, wT x in (2)–(4) is replaced with wT x + b. For example, the logistic loss
function becomes
T
ξlog (w, b; x, y) = log 1 + e−y(w x+b) .
Because b does not appear in the regularization term, most optimization methods used to
solve (1) can solve (5) as well. In this paper, except whenever b is required, we mainly
consider the formulation (1).
Many papers have proposed optimization methods for large-scale L1-regularized logistic
regression (i.e., using ξlog as the loss function). These studies did not consider L1- or L2-loss
functions because logistic loss has a probability interpretation and is twice differentiable.
These methods differ in various aspects such as the convergence speed, ease of implemen-
tation, and practical applicability. With so many available methods, it is important to
conduct a comprehensive comparison. Schmidt et al. (2007, 2009) compared optimization
methods for L1-regularized classifiers; however, they did not include some state-of-the-art
solvers. Moreover, their comparison is based on the number of function evaluations instead
of the actual running time. In this paper, we categorize and compare existing methods
for logistic regression. We also extend some methods to solve L2-loss SVMs. We exclude
L1 loss from the discussion because most methods for logistic regression or L2-loss SVMs
assume the differentiability of the loss functions. Readers interested in using L1-loss SVMs
can refer to Bradley and Mangasarian (1998), Zhu et al. (2004), Fung and Mangasarian
(2004), Mangasarian (2006), and references therein.
3184
Comparing Methods and Software for L1-regularized Linear Classification
l: number of instances
n: number of features
i: index for a data instance
j: index for a data feature
k: iteration index for an optimization algorithm
3185
Yuan, Chang, Hsieh and Lin
For any vector v, we consider the following two representations for sub-vectors:
xTi1
.
XI,: ≡ .. . (8)
xTi|I|
The function τ (s) gives the first derivative of the logistic loss function log(1 + es ):
1
τ (s) ≡ . (9)
1 + e−s
An indicator vector for the jth component is
where ej is defined in (10). This function gj (z) has only one non-differentiable point at
z = −wj . In optimization, the cyclic method for choosing working variables is often called
the Gauss-Seidel rule (Tseng and Yun, 2009).
Several works have applied coordinate descent methods to solve (1) with logistic loss.
Here, a difficulty arises in that sub-problem (11) does not have a closed-form solution.
3186
Comparing Methods and Software for L1-regularized Linear Classification
Goodman (2004) assumed nonnegative feature values (i.e., xij ≥ 0, ∀i, j) and then approx-
imated gj (z) by a function Aj (z) at each iteration. Aj (z) satisfies Aj (z) ≥ gj (z), ∀z, and
Aj (0) = gj (0) = 0; therefore, minimizing Aj (z) may reduce the function value. Moreover,
there is a closed-form solution for minimizing Aj (z). Goodman (2004) actually studied a
problem with additional constraints wj ≥ 0; however, his approach can be extended to solve
the original L1-regularized logistic regression by taking the sign of wj into consideration.
Genkin et al. (2007) implemented a cyclic coordinate descent method called BBR to
solve L1-regularized logistic regression. BBR approximately minimizes sub-problem (11)
in a trust region and applies one-dimensional Newton steps. Balakrishnan and Madigan
(2005) reported an extension of BBR for online settings.
In Section 4.1.2, we propose a coordinate descent method by extending Chang et al.’s
(2008) approach for L2-regularized classifiers. Chang et al. (2008) approximately solved
the sub-problem (11) by a one-dimensional Newton direction with line search. Experiments
show that their approach outperforms a BBR variant for L2-regularized classification. There-
fore, for L1 regularization, this approach might be faster than BBR. Hereafter, we refer to
this efficient coordinate descent method as CDN (coordinate descent using one-dimensional
Newton steps).
Tseng and Yun (2009) broadly discussed decomposition methods for L1-regularized
problems. One of their approaches is a cyclic coordinate descent method. They consid-
ered a general cyclic setting so that several working variables are updated at each iteration.
We show that CDN is a special case of their general algorithms.
If we randomly select the working variable, then the procedure becomes a stochastic co-
ordinate descent method. Shalev-Shwartz and Tewari (2009, Algorithm 2) recently studied
this issue. Duchi and Singer (2009) proposed a similar coordinate descent method for the
maximum entropy model, which is a generalization of logistic regression.
3187
Yuan, Chang, Hsieh and Lin
experiments with document classification data. We refer to their method as CGD-GS because
the method described in Tseng and Yun (2009) is called “coordinate gradient descent” and
a Gauss-Southwell rule is used.
Active set methods are a popular optimization approach for linear-constrained problems.
For problem (1), an active method becomes a special decomposition method because at
each iteration, a sub-problem over a working set of variables is solved. The main difference
from decomposition methods described earlier is that the working set contains all non-zero
variables. Therefore, an active set method iteratively predicts what a correct split of zero
and non-zero elements in w is. If the split is correct, then solving the sub-problem gives
the optimal values of non-zero elements.
Perkins et al. (2003) proposed an active set method for (1) with logistic loss. This
implementation uses gradient information to predict w’s zero and non-zero elements.
We can replace w in (1) with w+ − w− , where w+ and w− are both nonnegative vectors.
Then, problem (1) becomes equivalent to the following bound-constrained optimization
problem:
n
X n
X l
X
min wj+ + wj− + C ξ(w+ − w− ; xi , yi )
w+ ,w− (12)
j=1 j=1 i=1
The objective function and constraints of (12) are smooth, and therefore, the problem can
be solved by standard bound-constrained optimization techniques. Schmidt et al. (2009)
proposed ProjectionL1 to solve (12). This is an extension of the “two-metric projection”
method (Gafni and Bertsekas, 1984). Limited-memory quasi Newton implementations, for
example, LBFGS-B by Byrd et al. (1995) and BLMVM by Benson and Moré (2001), require
function/gradient evaluations and use a limited amount of memory to approximate the Hes-
sian matrix. Kazama and Tsujii (2003) presented an example of using BLMVM for (12).
Lin and Moré’s (1999) trust region Newton method (TRON) can also be applied to solve
(12). In addition to function/gradient evaluations, TRON needs to calculate Hessian-vector
products for faster convergence. Lin et al. (2008) applied TRON to solve L2-regularized
logistic regression and showed that TRON outperforms LBFGS for document data. No pre-
vious work has applied TRON to solve L1-regularized problems, and therefore, we describe
this in Section 5.1.
3188
Comparing Methods and Software for L1-regularized Linear Classification
Koh et al. (2007) proposed an interior point method to solve L1-regularized logistic
regression. They transformed (1) to the following constrained problem:
n
X l
X
min uj + C ξ(w; xi , yi )
w,u (13)
j=1 i=1
subject to − uj ≤ wj ≤ uj , j = 1, . . . , n.
uj + wj uj − wj
wj+ = and wj− = .
2 2
To ensure that w and u are in the interior of the feasible region set, Koh et al. (2007) added
a log barrier to the objective function of (13) and applied a Newton method.
l
X
min ξ(w; xi , yi )
w
i=1
(14)
subject to kwk1 ≤ K.
See the explanation in, for example, Donoho and Tsaig (2008, Section 1.2). Notice that in
(14), the constraint is not smooth at {w | wj = 0 for some j}. However, (14) contains fewer
variables and constraints than (12). Lee et al. (2006) applied the LARS algorithm described
in Efron et al. (2004) to find a Newton direction at each step and then used a backtracking
line search to minimize the objective value. Kivinen and Warmuth’s (1997) concept of
exponentiated gradient (EG) can solve (14) with additional constraints wj ≥ 0, ∀j. In
a manner similar to the technique for constructing (12), we can remove the nonnegative
constraints by replacing w with w+ − w− . If k is the iteration index, EG updates w by the
following rule:
Pl
wjk exp −ηk ∇j k
k+1 i=1 ξ(w ; xi , yi )
wj = ,
Zk
where Zk is a normalization term for maintaining kwk k1 = K, ∀k and ηk is a user-defined
learning rate. Duchi et al. (2008) applied a gradient projection method to solve (14). The
update rule is
Xl
k+1 k k
w = arg min k w − ηk ∇ ξ(w ; xi , yi ) − wk kwk1 ≤ K . (15)
w i=1
They developed a fast algorithm to project a solution to the closest point satisfying the con-
straint. They also considered replacing the gradient in (15) with a stochastic sub-gradient.
In a manner similar to (15), Liu et al. (2009) proposed a gradient projection method called
3189
Yuan, Chang, Hsieh and Lin
Lassplore and carefully addressed the selection of the learning rate ηk . However, they eval-
uated their method only on data with no more than 2, 000 instances.2
Kim and Kim (2004) discussed a coordinate descent method to solve (14). They used
the gradient information to select an element wj for update. However, because of the
constraint kwk1 ≤ K, the whole vector w is normalized at each step. Thus, the setting
is very different from the unconstrained situations described in Section 2.1.1. Kim et al.
(2008) made further improvements to realize faster convergence.
Active set methods have also been applied to solve (14). However, in contrast to the
active set methods described in Section 2.1.3, here, the sub-problem at each iteration is
a constrained optimization problem. Roth (2004) studied general loss functions including
logistic loss.
3190
Comparing Methods and Software for L1-regularized Linear Classification
Shi et al. (2010) proposed a hybrid algorithm for (1) with logistic loss. They used a fixed-
point method to identify the set {j | wj∗ = 0}, where w∗ is an optimal solution, and then
applied an interior point method to achieve fast local convergence.
Krishnapuram et al. (2005) and Friedman et al. (2010) replaced the loss term with a second-
order approximation at the beginning of each iteration and then applied a cyclic coordinate
descent method to minimize the quadratic function. We will show that an implementation
in Friedman et al. (2010) is efficient.
Teo et al. (2010) implemented a bundle (cutting plane) method BMRM to handle non-
smooth loss functions. It includes an extension for L1 regularization.
Several works have attempted to find the “solution path” of (1). The optimal solution of (1)
varies according to parameter C. It is occasionally useful to find all solutions as a function
of C; see, for example, Rosset (2005), Zhao and Yu (2007), Park and Hastie (2007), and
Keerthi and Shevade (2007). We do not provide details of these works because this paper
focuses on the case in which a single C is used.
Although this paper will compare state-of-the-art software, we discuss some known strengths
and weaknesses of existing methods.
Optimization methods using higher-order information (e.g., quasi Newton or Newton meth-
ods) usually enjoy fast local convergence. However, they involve an expensive iteration.
For example, Newton methods such as TRON or IPM need to solve a large linear system
related to the Hessian matrix. In contrast, methods using or partially using the gradient
information (e.g., stochastic gradient descent) have slow local convergence although they
can more quickly decrease the function value in the early stage.
3191
Yuan, Chang, Hsieh and Lin
Third, in many situations, xi are not directly available. Instead, the product between the
data matrix X and a vector can be efficiently obtained through certain operators. We
briefly review some of the many optimization methods for least-square regression. If we use
formulation (14) with non-smooth constraints, the problem reduces to LASSO proposed
by Tibshirani (1996) and some early optimization studies include, for example, Fu (1998)
and Osborne et al. (2000a,b). Fu (1998) applied a cyclic coordinate descent method. For
least-square loss, the minimization of the one-variable sub-problem (11) has a closed-form
solution. Sardy et al. (2000) also considered coordinate descent methods, although they
allowed a block of variables at each iteration. Wu and Lange (2008) considered a coordinate
descent method, but used the gradient information for selecting the working variable at each
3192
Comparing Methods and Software for L1-regularized Linear Classification
iteration. Osborne et al. (2000a) reported one of the earliest active set methods for L1-
regularized problems. Roth’s (2004) method for general losses (see Section 2.2.2) reduces to
this method if the least-square loss is used. Friedman et al. (2007) extended Fu’s coordinate
descent method to find a solution path. Donoho and Tsaig (2008) also obtained a solution
path. Their procedure requires solving a linear system of a matrix X:,J T X , where J is
:,J
a subset of {1, . . . , n}. Figueiredo et al. (2007) transformed the regression problem to a
bound-constrained formula in (12) and applied a projected gradient method. Wright et al.
(2009) proposed the iterative minimization of the sum of the L1 regularization term and
a quadratic approximation of the loss term. In the quadratic approximation, they used a
diagonal matrix instead of the Hessian of the loss term, and therefore, the minimization can
be easily carried out. Hale et al. (2008) proposed a fixed point method to solve (1) with
the least-square loss (16). Their update rule is generated from a fixed-point view; however,
it is very similar to a gradient projection algorithm.
The dual of LASSO and the dual of (1) have been discussed in Osborne et al. (2000b)
and Kim et al. (2007), respectively. However, thus far, there have been few optimization
methods for the dual problem. Tomioka and Sugiyama (2009) proposed a dual augmented
Lagrangian method for L1-regularized least-square regression that theoretically converges
super-linearly.
Most optimization methods for classification discussed in the earlier sections can handle
general smooth loss functions, and therefore, they can be applied to the regression problem.
However, data for classification applications may be very different from those for regression
applications. For example, in text classification, the numbers of instances and features are
both large and data are very sparse. However, in certain signal processing applications, the
number of instances may be much smaller than the number of features (i.e., l n) and the
data may be dense. Moreover, in classification, the parameter C is often selected by cross
validation; however, in signal processing, the selection may be application dependent. Few
optimization studies have investigated both types of applications. The interior point method
for logistic regression by solving (13) has been applied to the least-square regression problem
(Kim et al., 2007). Duchi et al. (2008) compared their gradient projection implementation
(see Section 2.2.2) with interior point methods using both logistic and least-square losses.
In Section 2.1.2, we mentioned a decomposition method CGD-GS by Yun and Toh (2011).
In the same paper, Yun and Toh have also investigated the performance of CGD-GS on
regression problems.
In this paper, we focus on data classification, and therefore, our conclusions may not
apply to regression applications. In particular, the efficient calculation between X and a
vector in some signal processing applications may afford advantages to some optimization
methods.
In the rest of this paper, we compare some state-of-the-art software BBR, SCD, CGD-GS,
IPM, BMRM, OWL-QN, Lassplore and GLMNET. We further develop two efficient implemen-
tations. One is a coordinate descent method (CDN) and the other is a trust region Newton
method (TRON). These packages are selected because of two reasons. First, they are pub-
3193
Yuan, Chang, Hsieh and Lin
licly available. Second, they are able to handle large and sparse data sets. We categorize
these packages into three groups:
• decomposition methods,
• methods by solving constrained optimization problems,
• other methods,
and describe their algorithms in Sections 4–6. The comparison results are described in
Sections 8. Note that classification (logistic regression and L2-loss SVMs) is our focus, and
therefore, software for regression problems using the least-square loss (16) is not considered.
4. Decomposition Methods
This section discusses three methods sequentially or randomly choosing variables for update,
and one method using gradient information for the working set selection.
where
Lj (z; wk,j ) ≡ L(wk,j + zej ).
For simplicity, hereafter, we use Lj (z) in most places. If the solution of (17) is d, then we
update the jth element by
wjk,j+1 = wjk,j + d.
Typically, a coordinate descent method sequentially goes through all variables and then
repeats the same process; see the framework in Algorithm 1. We refer to the process of
updating every n elements (i.e., from wk to wk+1 ) as an outer iteration and the step of
updating one element (i.e., from wk,j to wk,j+1 ) as an inner iteration. Practically, we only
approximately solve (17), where several approaches are discussed below.
Using the same way to obtain the optimality condition in (6), for the one-variable
function gj (z), we have that z = 0 is optimal for (17) if and only if
if wjk,j > 0,
0
Lj (0) + 1 = 0
L0j (0) − 1 = 0 if wjk,j < 0, (18)
k,j
−1 ≤ L0j (0) ≤ 1 if wj = 0.
The optimality condition at z = 0 shows if modifying wjk,j may decrease the function value
or not.
3194
Comparing Methods and Software for L1-regularized Linear Classification
4.1.1 BBR
Genkin et al. (2007) propose a coordinate descent method BBR for (1) and (5) with logistic
loss. This method is extended from Zhang and Oles (2001) for L2-regularized logistic
regression. At each inner iteration, BBR approximately solves the sub-problem (17) by a
trust region Newton method. With the trust region ∆j , it requires the step z to satisfy
(
k,j ≥ 0 if wjk,j > 0,
|z| ≤ ∆j and wj + z (19)
≤ 0 if wjk,j < 0.
The first constraint confines the step size to be in the trust region and ∆j is updated at the
end of each inner iteration. Due to the non-differentiable point at z = −wjk,j , the second
constraint ensures that the function is differentiable in the search space.
To approximately solve (17), BBR minimizes a quadratic function upper-bounding the
function gj (z) in the trust region. Though gj (z) is not differentiable, by considering both
cases of wjk,j > 0 and wjk,j < 0, we obtain the following form:
1
gj (z) = gj (0) + gj0 (0)z + gj00 (ηz)z 2 , (20)
2
where 0 < η < 1,
(
L0j (0) + 1 if wjk,j > 0,
gj0 (0) ≡ and gj00 (ηz) ≡ L00j (ηz). (21)
L0j (0) − 1 if wjk,j < 0,
Notice that when wjk,j = 0, gj (z) is non-differentiable at z = 0 and gj0 (0) is not well-defined.
We will discuss this situation later. BBR finds an upper bound Uj of gj00 (z) such that
1
ĝj (z) ≡ gj (0) + gj0 (0)z + Uj z 2 .
2
Any step z satisfying ĝj (z) < ĝj (0) leads to
gj (z) − gj (0) = gj (z) − ĝj (0) ≤ ĝj (z) − ĝj (0) < 0,
3195
Yuan, Chang, Hsieh and Lin
so the function value is decreased. If logistic loss (2) is used, BBR suggests setting Uj as
l
X
x2ij F yi (wk,j )T xi , ∆j |xij | ,
Uj ≡ C (22)
i=1
where (
0.25 if |r| ≤ δ,
F (r, δ) = 1
2+e(|r|−δ) +e(δ−|r|)
otherwise.
If wjk,j 6= 0, BBR minimizes ĝj (z) under the constraints (19) to obtain
where (
z if sgn(w + z) = sgn(w),
P (z, w) ≡
−w otherwise.
Now consider the case of wjk,j = 0, where gj0 (0) is not well-defined at this point. If L0j (0)+1 <
0, by defining gj0 (0) ≡ L0j (0)+1, any 0 < z ≤ −gj0 (0)/Uj gives a smaller ĝj (z) than ĝj (0). We
thus obtain the new point by mapping −gj0 (0)/Uj back to the trust region. The situation
for L0j (0) − 1 > 0 is similar. We do not need to handle the situation −1 ≤ L0j (0) ≤ 1 as (18)
and wjk,j = 0 indicate that z = 0 is the minimum of gj (z).
The procedure of BBR to approximately solve (17) is described in Algorithm 2. The
major cost is on calculating gj0 (0) and Uj . For logistic loss, L0j (0) needed for calculating
gj0 (0) is
Xl
L0j (0) = C yi xij τ (yi (wk,j )T xi ) − 1 , (24)
i=1
where τ (·) is defined in (9). From (22) and (24), the most expensive operation is on obtaining
wT xi , ∀i. A common trick for saving the running time is to store wT xi , ∀i and update
them according to
wT xi ← wT xi + d · xij . (25)
If wT xi , ∀i are available, both (22) and (24) need O(l) operations. Because maintaining
wT xi via (25) also takes O(l), the cost per inner iteration is O(l).
Unfortunately, there is no convergence proof yet for the method BBR.
3196
Comparing Methods and Software for L1-regularized Linear Classification
Because (26) is only a quadratic approximation of f (wk,j + zej ) − f (wk,j ), the direction
d may not ensure the decrease of the function value. For the convergence, Chang et al.
(2008) consider a line search procedure to find λ ∈ (0, 1) such that the step λd satisfies the
sufficient decrease condition:
where σ is any constant in (0, 1). However, as pointed out by Tseng and Yun (2009), this
condition may not be suitable here due to the non-smooth regularization term kwk1 . We
follow Tseng and Yun (2009) to use a modified condition:
gj (λd) − gj (0) ≤ σλ L0j (0)d + |wjk,j + d| − |wjk,j | . (28)
3197
Yuan, Chang, Hsieh and Lin
Algorithm 3 CDN: Approximately solving the sub-problem by Newton directions with line
search
1. Given wk,j . Choose β ∈ (0, 1).
2. Calculate the Newton direction d by (27).
3. Compute λ = max{1, β, β 2 , . . . } such that λd satisfies (28).
We discuss the computational cost. To obtain the sub-problem (26), we must calculate
L0j (0) and L00j (0). For logistic loss, L0j (0) is shown in (24) and
l
X
L00j (0) = C x2ij τ (yi (wk,j )T xi ) 1 − τ (yi (wk,j )T xi ) . (29)
i=1
Similar to the situation in BBR, calculating wT xi , ∀i is the major cost here. We can apply
T
the same trick in (25) to maintain wT xi , ∀i. In our implementation, we maintain ew xi
instead:
T T
ew xi ← ew xi · eλdxij . (30)
The line search procedure needs to calculate gj (λd). From (2), the main cost is still on
T
obtaining (w + λdej )T xi , ∀i, so the trick in (30) is again useful. If ew xi , ∀i are available,
from (24), (29) and (2), the cost per inner iteration is
To reduce the cost for line search, Chang et al. (2008) obtain a function ĝj (·) satisfying
ĝj (λd) > gj (λd) and check ĝj (λd) − gj (0) in (28). Calculating ĝj (λd) is much easier than
gj (λd), so we may avoid calculating the last gj (λd) in the line search procedure. In some
iterations, λ = 1 already satisfies (28), and therefore, this trick makes the cost of the line
search procedure negligible. We do not show details here; however, derivations can be found
in Fan et al. (2008, Appendix G).
Next we describe two implementation techniques to improve the convergence speed. The
first one is to use a random permutation of sub-problems. In Algorithm 1, we cyclically
consider variables to form one-variable sub-problems. Chang et al. (2008) show that solving
sub-problems in a random order may lead to faster convergence. That is, at the kth iteration,
we randomly permute {1, 2, . . . , n} to {πk (1), πk (2), . . . , πk (n)} and update the elements of
w in the order of {wπk (1) , wπk (2) , . . . , wπk (n) }.
The second implementation technique is shrinking. Past works such as Hsieh et al.
(2008), Joachims (1998), and Krishnapuram et al. (2005, Section 3.5) heuristically remove
some variables to obtain a smaller optimization problem. If wj has remained zero at many
iterations, it is very possible that the optimal wj∗ is also zero. Therefore, we can remove
such variables. Our shrinking implementation is related to that in the software LIBSVM
(Chang and Lin, 2011). From the optimality condition (6),
3198
Comparing Methods and Software for L1-regularized Linear Classification
Theorem 1 Assume {wk } globally converges to w∗ . If −1 < ∇j L(w∗ ) < 1, then there is
Kj such that for all k ≥ Kj ,
Using this theorem, we design the following shrinking strategy. Before updating wjk,j via
approximately solving the sub-problem (17), if
we conjecture that wjk,j may remain zero and hence remove it for optimization. We choose
maxj vj
M k−1 ≡ ,
l
where
if wjk−1,j > 0,
k−1,j ) + 1|
|∇j L(w
vj ≡ |∇j L(wk−1,j ) − 1| if wjk−1,j < 0,
if wjk−1,j = 0,
max ∇j L(wk−1,j ) − 1, −1 − ∇j L(wk−1,j ), 0
From (6), vj , j = 1, . . . , n measure the violation of the optimality condition at the (k − 1)st
iteration. The value M k−1 reflects how aggressive we remove variables. It is large in the
beginning, but approaches zero in the end.
The shrinking implementation introduces little extra cost. When updating the jth
component at the kth iteration, we calculate
for the direction d in (27), regardless of implementing shrinking or not. Thus, ∇j L(wk,j )
needed for checking (32) is already available. Moreover, it can be used to calculate vj and
M k , which are needed for the next iteration.
wjk,+ + z ≥ 0 or wjk,− + z ≥ 0,
according to whether wj+ or wj− is the working variable. SCD considers a second-order
approximation similar to BBR:
1
ĝj (z) = gj (0) + gj0 (0)z + Uj z 2 , (33)
2
3199
Yuan, Chang, Hsieh and Lin
where
(
1 + L0j (0) for wj+
gj0 (0) = and Uj ≥ gj00 (z), ∀z.
1 − L0j (0) for wj−
BBR considers Uj to be an upper bound of gj00 (z) only in the trust region, while SCD finds
a global upper bound of gj00 (z). For logistic regression, following (9) and (29), we have
τ (·) (1 − τ (·)) ≤ 0.25 and
X l
Uj = 0.25C x2ij ≥ gj00 (z), ∀z. (34)
i=1
Shalev-Shwartz and Tewari (2009) assume −1 ≤ xij ≤ 1, ∀i, j, so a simpler upper bound is
Uj = 0.25Cl. (35)
Using the direction obtained by minimizing (33) and taking the non-negativity into
consideration, SCD updates w by the following way:
If wj+ is selected
1 + L0j (0)
wj+ ← wj+ + max(−wj+ , − ) (36)
Uj
Else
1 − L0j (0)
wj− ← wj− + max(−wj− , − ) (37)
Uj
3200
Comparing Methods and Software for L1-regularized Linear Classification
Because it is still difficult to solve this sub-problem, CGD-GS considers a quadratic approx-
imation of the loss term:
1
min qk (d) ≡ ∇L(wk )T d + dT Hd + kwk + dk1 − kwk k1
d 2 (38)
subject to dj = 0, ∀j ∈
/ J,
where 0 < σ < 1 and 0 ≤ γ < 1. This condition is the same as (28) if γ = 0 and J
contains only one element. Tseng and Yun (2009) used (39) for both the cyclic selection
(Gauss-Seidel) or the selection using gradient information (Gauss-Southwell).
For selecting J using gradients, Tseng and Yun (2009) proposed two possible ways.3
The first one, referred to as the Gauss-Southwell-r rule, requires J to satisfy
where v ∈ (0, 1) is a constant and d(J) and d(N ) are the solution of (38) by considering
J and N as the working set, respectively. This condition connects the directions of solving
sub-problems using a subset and a full set. The other condition for selecting J is
where v ∈ (0, 1) and qk (d) is defined in (38). This condition is referred to as the Gauss-
Southwell-q rule. Algorithm 5 summarizes the procedure.
The remaining issues are how to solve the sub-problem (38) and how to obtain J satis-
fying (40) or (41). The CGD-GS implementation considers a diagonal matrix with positive
entries as H. For example, Hjj = max(∇2jj L(wk ), ), where is a small positive value. Then,
the sub-problem becomes |J| separable one-variable problems like (26). Each one-variable
problem has a simple closed-form solution. Further, it is easy to find indices satisfying (40)
or (41). For example, the rule (40) becomes to find the larger elements of d(N ). Tseng and
Yun (2009) proved that any limit point of {wk } is an optimum of (1).
We discuss the computational cost for logistic regression. If H is diagonal, then solving
(38) takes only O(|J|) operations. Constructing (38) is more expensive because we need to
calculate
Xl
τ (yi wT xi ) − 1 yi xi .
∇L(w) = C (42)
i=1
T
Yun and Toh (2011) apply a trick similar to (30) and maintain ew xi , ∀i, but the gradient
calculation still needs O(ln). This high cost may not pay off, and therefore, Hsieh et al.
(2008) favor decomposition methods without maintaining the gradient. Finding the working
set J is another potentially expensive step, but it is cheap for a diagonal H.
3. Note that (40) indirectly uses gradient information. If the 1-norm term is not considered and H is
diagonal, then d(N ) is a “scaled” gradient with dj (N ) = −∇j L(wk )/Hjj . Thus, (40) selects indices
with larger scaled gradient values.
3201
Yuan, Chang, Hsieh and Lin
min f¯(w)
w
subject to w ∈ Ω ≡ {w | lj ≤ wj ≤ uj , ∀j},
where f¯(w) denotes the objective function of (12), and l and u are lower and upper bounds,
respectively.
At the kth iteration of the trust region Newton method, we have an iterate wk , a size
∆k of the trust region, and a quadratic model
1
qk (d) ≡ dT ∇2 f¯(wk )d + ∇f¯(wk )T d
2
to approximate the value f¯(wk + d) − f¯(wk ). Next, we find a step dk by approximately
solving the following trust region problem
min qk (d)
d (44)
subject to kdk ≤ ∆k , wk + d ∈ Ω.
f¯(wk + dk ) − f¯(wk )
ρk = (45)
qk (dk )
3202
Comparing Methods and Software for L1-regularized Linear Classification
min qk (dk,t + v)
v Ft
subject to kdk,t + vk ≤ ∆k , v Bt = 0,
dk = wk,t+1 − wk .
of the actual reduction in the function to the predicted reduction in the quadratic model.
The direction dk is accepted if ρk is large enough:
(
wk + dk if ρk > η0 ,
wk+1 = (46)
wk if ρk ≤ η0 ,
where η0 > 0 is a pre-specified value. The size ∆k of the trust region is then updated
according to the reduction of the function value. If the reduction is significant, then ∆k is
enlarged. Otherwise, we reduce ∆k . More details can be found in Lin and Moré (1999).
The framework of our trust region method is given in Algorithm 6. Earlier works applying
trust region methods for L2-regularized problems include, for example, Lin et al. (2008).
The algorithm here is more complicated due to the bound constraints.
3203
Yuan, Chang, Hsieh and Lin
where
dk,C = P [wk − λ∇f¯(wk )] − wk (48)
is called the Cauchy step in bound-constrained optimization and σ ∈ (0, 1/2) is a constant.
The projection operator P [·] maps wk − λ∇f¯(wk ) back to the feasible region Ω:
so some components become bounded. Although the negative gradient direction is projected
in (48), the resulting direction is still a descending one (i.e., ∇qk (0)T dk,C < 0). Hence, one
can always find a small λ > 0 such that (47) is satisfied. The point wk,C ≡ wk + dk,C is
referred to as the Cauchy point.
min qk (dk,C + v)
vF
(51)
subject to kdk,C + vk ≤ ∆k , v B = 0.
If the free set at the Cauchy point is close to that at an optimum, using a Newton direction
on this sub-space leads to fast convergence.
Because (51) does not enforce the feasibility of wk,C + v, one needs a projected line
search procedure similar to (47)–(48). Details are shown later in (55). The resulting point
may contain more bounded components than the Cauchy point. In this situation, Lin and
Moré (1999) continue to minimize a quadratic approximation on the new sub-space. They
thus generate inner iterates wk,1 = wk,C , wk,2 , wk,3 , . . ., until that the free/bounded sets do
3204
Comparing Methods and Software for L1-regularized Linear Classification
not change. If m inner iterations are taken, then the direction dk for the kth trust region
iteration in Algorithm 6 is
dk = wk,m+1 − wk .
Details of our procedure are described in Algorithm 7. Because each inner iteration enlarges
the bounded set, the number of inner iterations is bounded by 2n, the number of variables.
In practice, very few inner iterations (one or two) are taken. Another reason to take inner
iterations is for the quadratic convergence proof in Lin and Moré (1999).
Let t be the inner-iteration index and Bt , Ft be bounded/free sets at wk,t . With v Bt = 0,
1
qk (dk,t + v) = v TFt ∇2 f¯(wk )Ft ,Ft v Ft + ∇qk (dk,t )TFt v Ft + qk (dk,t ), (52)
2
so minimizing qk (dk,t + v) is equivalent to solving the following linear system
k∇2 f¯(wk )Ft ,Ft v Ft + ∇qk (dk,t )Ft k = k∇qk (dk,t+1 )Ft k ≤ k∇f¯(wk )Ft k (54)
is satisfied, where is a given positive constant. See Section 5.1.3 for reasons to choose CG.
CG may stop before reaching (54) if either the iterate causes our search direction to exceed
the trust region boundary or the singularity of the matrix ∇2 f¯(wk )Ft ,Ft is detected.
Once a direction v k,t is identified, we conduct a projected line search to ensure the
feasibility and the sufficient decrease of qk (d). This procedure is similar to (47)–(48) for
the Cauchy step. We find λ (e.g., by a backtrack line search) such that
The matrix ∇2 f¯(w)Ft ,Ft may be too large to be stored. If using CG, the Hessian-vector
product can be conducted by a sequence of matrix-vector products:
T
X
∇2 f¯(wk,t )Ft ,Ft v Ft = C
D X −X v
:,Ft Ft
. (58)
−X T F ,:
t
3205
Yuan, Chang, Hsieh and Lin
5.1.4 Convergence
From Theorem 2.1 of Lin and Moré (1999), any limit point of {wk } is an optimum of (12).
For the local convergence rate, in Appendix E, we indicate that in general TRON can achieve
quadratic convergence.
where t > 0 is a parameter. The unique minimizer (b∗ (t), w∗ (t), u∗ (t)) under any given t
forms a curve called the “central path,” which approaches an optimal solution of (13) as
t → ∞. An interior point method thus alternatively minimizes φt (b, w, u) and adjusts t.
From a set of (w, b) on the search path, we can construct a feasible solution to the dual
problem of (13) and evaluate the duality gap. The duality gap is guaranteed to converge
to 0 as we walk along the central path when t → ∞. Thus, we can check the duality gap
for the stopping condition.
At the kth iteration, using the current tk , interior point methods approximately minimize
φtk (b, w, u) by finding a Newton direction. The following linear system is solved:
∆b
∇2 φtk (bk , wk , uk ) ∆w = −∇φtk (bk , wk , uk ). (59)
∆u
For logistic loss,
Pl
τ (yi (wT xi + b))
tC i=1 yi − 1
u21 − w12
P 2w1 /
tC li=1
T
τ (yi (w xi + b)) − 1 yi xi +
..
.
∇φt (b, w, u) = 2 2
un − wn
2wn /
2u1 / u21 − w12
ten −
..
.
2 2
2un / un − wn
3206
Comparing Methods and Software for L1-regularized Linear Classification
and
tCy T Dy tCeTl DX 0T
Koh et al. (2007) apply preconditioned conjugate gradient methods (PCG) to solve (59)
with diagonal preconditioning.
For the convergence, a backtrack line search procedure is needed to ensure the sufficient
decrease of φtk (·). Koh et al. (2007) did not discuss details of their method’s convergence.
However, because interior point methods are a type of Newton methods, they often enjoy
fast local convergence.
sk = wk + βk (wk − wk−1 ),
for some positive parameter βk . From sk , we obtain wk+1 by taking the negative gradient
direction:
wk+1 = sk − λk ∇L̄(sk ),
3207
Yuan, Chang, Hsieh and Lin
where
l
X
L̄(w) ≡ ξi (w; xi , yi )
i=1
is the objective function of (14) and λk is the step size. Note that L̄(w) is different from
L(w) defined in (7) because the penalty parameter C is not needed in (14). Liu et al. (2009)
suggest to estimate βk by
γk (1 − αk−1 )
βk = , (60)
αk−1 (γk + αλkk )
where αk ∈ (0, 1) is the root of a quadratic function
α2
ηk (α) ≡ + γk α − γk
λk
and γk ≥ 0 satisfies
γk+1 = (1 − αk )γk if k ≥ 1 and γ1 ≥ 0. (61)
We have αk ∈ (0, 1) because ηk (0) = −γk < 0 and ηk (1) = 1/λk > 0.
Lassplore applies an adaptive line search scheme that adjusts λk so that wk+1 satisfies
1
L̄(wk+1 ) ≤ L̄(sk ) + ∇L̄(sk )T (wk+1 − sk ) + kwk+1 − sk k22 . (62)
2λk
The new solution wk+1 generated from the above process may not satisfy the constraint in
(14), so Lassplore projects the solution to the feasible region:
This projection is related to (15) in Section 2.2.2, but Lassplore applies the method in Liu
and Ye (2009) to efficiently compute (63).
A summary of Lassplore is given in Algorithm 9.
4. See Liu et al. (2009, Algorithm 3) for details.
3208
Comparing Methods and Software for L1-regularized Linear Classification
where (
sgn(wjk ) if wjk 6= 0,
skj ≡ (64)
sgn(−∇¯ j f (wk )) otherwise,
and
0 if wj > 0 or (wj = 0 and L0j (w) + 1 < 0),
Lj (w) + 1
¯ j f (w) ≡ L0 (w) − 1
∇ if wj < 0 or (wj = 0 and L0j (w) − 1 > 0), (65)
j
0 otherwise
is defined as the pseudo gradient of f (w). In (64), if wjk = 0, we consider the space where
wj can be moved by taking the negative gradient direction. OWL-QN then approximately
minimizes a quadratic approximation of (1) in the search space Ωk :
¯ (wk )T d + 1 dT Hk d
min f (wk ) + ∇f
d 2 (66)
subject to wk + d ∈ Ωk ,
where Hk approximates the Hessian of f (wk ) by the first-order information gathered from
previous iterations. Details for getting Hk can be found in Liu and Nocedal (1989). To
approximately solve (66), OWL-QN finds the minimum of the quadratic objective function:
dk = −Hk−1 ∇f
¯ (wk ), (67)
k ¯ (wk ):
obtains a direction d̄ by confining dk on the same orthant as −∇f
(
¯ j f (wk )),
dkj if sgn(dkj ) = sgn(−∇
d¯kj = (68)
0 otherwise,
and then conducts a backtracking line search to find λ such that the sufficient decrease of
the function value is satisfied. Note that wk+1 must be in Ωk , so following (64),
(
k+1 wjk + λd¯kj if sgn(wjk + λd¯kj ) = skj ,
wj =
0 otherwise.
3209
Yuan, Chang, Hsieh and Lin
3210
Comparing Methods and Software for L1-regularized Linear Classification
where X
B ≡ ∇j L(wk ) + ∇2jt L(wk )dt and A ≡ ∇2jj L(wk )
t
can be calculated using (42) and (69). It is easy to minimize gj (z) by (27).5
Because calculating the matrix D involves many exponential operations, GLMNET also
considers using an approximation of ∇2 L(w) and minimizes
1
qk (d) ≡ ∇L(wk )T d + dT Hd + kwk + dk1 − kwk k1 ,
2
T
where H ≡ 0.25CX IX and I is an identity matrix. That is, we use a cheaper but less
accurate approximation of f (wk + d) − f (wk ). A sketch of GLMNET is in Algorithm 11.
We briefly describe some GLMNET’s implementation details. GLMNET applies a shrink-
ing technique to solve a smaller optimization problem than (71); see similar techniques for
decomposition methods in Section 4.1.2. Using a sparse representation of w and maintain-
ing an index set Ω to indicate the non-zero elements of d, GLMNET solves a smaller problem
by a coordinate descent method:
1
min ∇L(wk )T d + dT Hd + kwk + dk1 − kwk k1 .
dΩ 2
GLMNET conducts feature-wise normalization before solving the optimization problem.
That is, it solves (1) by replacing xi with x̃i , where
v
Pl u l l
xij − x̄j i=1 xij
uX X
x̃ij ≡ , ∀i, j, x̄j = , ∀j, and σj = t x2ij − x̄2j , ∀j.
σj l
i=1 i=1
3211
Yuan, Chang, Hsieh and Lin
where
ak ≡ ∇L(wk ) and bk ≡ L(wk ) − aTk wk . (73)
If L(w) is non-differentiable, in (72), BMRM substitutes ∇L(w) with the sub-gradient of
L(w).
BMRM maintains all cutting planes from the earlier iterations to form a lower-bound
function for L(w):
L(w) ≥ LCP T
k (w) ≡ max at w + bt , ∀w. (74)
1≤t≤k
Using (74) and the splitting of w in (12) by w = w+ − w− , Equation (75) can be reformu-
lated to the following linear programming problem:
n
X n
X
min wj+ + wj− + ζ
w+ ,w− ,ζ
j=1 j=1
(76)
subject to aTt (w+ − w− ) + bt ≤ ζ, t = 1, . . . , k,
wj+ ≥ 0, wj− ≥ 0, j = 1, . . . , n.
A summary of the BMRM approach is given in Algorithm 12. Teo et al. (2010, Appendix C)
indicated that because of the L1 regularization, the convergence result has not been fully
established yet.
3212
Comparing Methods and Software for L1-regularized Linear Classification
where
bi (w) ≡ 1 − yi wT xi and I(w) ≡ {i | bi (w) > 0}. (78)
Therefore, the sum of losses is
X
L(w) = C bi (w)2 . (79)
i∈I(w)
In contrast to logistic loss, the L2-loss function is differentiable but not twice differentiable
(Mangasarian, 2002). Thus, some methods discussed in Sections 4–6 may not be directly
applicable because they use second-order information. However, as shown in Mangasarian
(2002, Section 3), (79) is twice differentiable at all but {w | bi (w) = 0 for some i}. More-
over, ∇L(w) is globally Lipschitz continuous, so a generalized Hessian exists everywhere.
Using a generalized Hessian, we may modify algorithms using the second-order information
for L1-regularized L2-loss SVMs. In the following two sections, we extend CDN and TRON
to solve (77).
For a second-order approximation similar to (26), we need L0j (0) and L00j (0):
X
L0j (0) = −2C yi xij bi (wk,j ).
i∈I(wk,j )
Unfortunately, L00j (0) is not well-defined if there exists some i such that bi (wk,j ) = 0.
Following Chang et al. (2008), we consider the generalized second derivative:
X
2C x2ij . (80)
i∈I(wk,j )
By replacing L00j (0) in (26) with the above quantity, we can easily obtain a direction d.
However, the value in (80) may be zero if xij = 0, ∀i ∈ I(wk,j ). To apply the convergence
result in Tseng and Yun (2009), we ensure the strict positivity by taking
X
2
max 2C xij , , (81)
i∈I(wk,j )
3213
Yuan, Chang, Hsieh and Lin
where is a small positive value. Following the explanation in Appendix F, we have the finite
termination of the line search procedure and the asymptotic convergence of the function
value. If the loss function L(w) is strictly convex, the algorithm converges at least linearly.
The proof is in the supplementary file.
Like the situation for logistic regression, the major cost for finding the Newton direction
d and for the line search procedure is to calculate wT xi , ∀i ∈ I. We maintain bi (w), ∀i
using the same trick in (25). All other implementation techniques discussed in Section 4.1.2
for logistic regression can be applied here.
a sub-matrix including X̄’s rows corresponding to I; see (8). Note that bi (w) defined in
(78) is now calculated by
1 − yi (w1:n − w(n+1):2n )T xi .
Following Mangasarian (2002) and Fan et al. (2008, Appendix D), we consider the
generalized Hessian matrix:
2C X̄ T DX̄,
where D ∈ Rl×l is a diagonal matrix with
(
1 if bi (w) > 0,
Dii =
0 if bi (w) ≤ 0.
We then apply Algorithm 7 to approximately minimize qk (d). For the Hessian-vector prod-
uct, only instances in the index set I and variables in the free set F are considered. Thus,
the Hessian-vector product in (58) becomes
T
2C X̄I,F DI,I X̄I,F v F ,
where v is a vector in R2n .
Regarding the convergence, from Theorem 2.1 of Lin and Moré (1999), any limit point
of {wk } is an optimal solution. However, without the twice differentiability, it is unclear if
the local quadratic convergence holds.
8. Numerical Experiments
In Sections 4–7, we have described details of several large-scale optimization methods for
L1-regularized linear classification. In this section, we conduct experiments to investigate
their performances. We describe data sets and experimental settings first. Then, we com-
prehensively compare methods for logistic regression and L2-loss SVMs. Programs used in
this paper are available at http://www.csie.ntu.edu.tw/~cjlin/liblinear/exp.html.
3214
Comparing Methods and Software for L1-regularized Linear Classification
Table 1: Statistics of data sets: l and n denote the numbers of instances and features in a
data set, respectively. The column #nz indicates the number of non-zero entries.
Table 2: The best parameter C, the model density (%), and the testing accuracy (%). We
conduct five-fold cross validation on the training set to select C in 2k | k = −4, −3, . . . , 6 .
3215
Yuan, Chang, Hsieh and Lin
implementation is the same as that included in version 1.6 of our software LIBLINEAR
(http://www.csie.ntu.edu.tw/~cjlin/liblinear/).
In Section 4.1.2, we discuss the shrinking technique for CDN. We defer the investigation
of its effectiveness to Section 8.4. In all other places of the comparison, we run CDN with
the shrinking strategy.
• SCD: the stochastic coordinate descent method for logistic regression is described in
Section 4.1.3. The source code is available at http://ttic.uchicago.edu/~tewari/
code/scd/.
• CGD-GS: the block coordinate gradient descent method for logistic regression is described
in Section 4.2. Following Yun and Toh’s (2011) settings, we choose
h i
H = diag min max(∇2jj L(wk ), 10−10 ), 1010
j=1,...,n
and apply the Gauss-Southwell-r rule to choose the working set J; see Equation (40).
Note that from Table 9 of Yun and Toh (2011), implementations using Gauss-Southwell-r
and Gauss-Southwell-q rules perform similarly on document data. We use default values
for all parameters; see Section 5.1 in Yun and Toh (2011). In particular, σ = 0.1, β = 1/2,
and γ = 0. The source code is available at http://www.math.nus.edu.sg/~matys/.
• TRON: the trust region Newton methods for logistic regression and L2-loss SVMs are
respectively described in Sections 5.1 and 7.2. In the projected line search (47) and
(55), we use σ = 0.01 and β = 0.1. For the stopping condition of the CG procedure in
Algorithm 7, we use = 0.1. The parameter η0 in (46) is 0.0001.
• IPM: the interior point method for logistic regression is described in Section 5.2. For small
and dense data sets, IPM solves a linear system in Algorithm 8 by Cholesky factorization.
For large and sparse data sets, it applies a preconditioned conjugate gradient method to
approximately solve the linear system. In the experiment, we only consider the large and
sparse setting. We use default parameters: σ = 0.01 and β = 1/2 for the line search
procedure. To update tk , we use smin = 1/2 and µ = 2. The source code (version 0.8.2)
is downloaded from http://www.stanford.edu/~boyd/l1_logreg/.
As w lies in the interior of the feasible region, every wj is non-zero after IPM stops. To
gain the sparsity, following the condition (31), Koh et al. (2007) assign those wj satisfying
|∇j L(w)| ≤ 0.9999 to zero. However, if we have not run enough iterations to obtain an
accurate solution, this modification of w may result in an erroneous model. We thus add
another condition |wj | < 1 in deciding if wj should be assigned to zero. We will address
this issue again in Section 8.3.
We find that because of a problem of not initializing an element in an array, the pre-
vious version (0.8.1) of IPM is two or three times slower than the latest version (0.8.2).
This observation indicates the difficulty in comparing software. Some minor issues may
significantly affect the conclusions.
• OWL-QN: the quasi Newton method is described in Section 6.1. The source code (version
1.1.2) is available at http://research.microsoft.com/en-us/um/people/jfgao/.
• GLMNET: the method is described in Section 6.2 for logistic regression. Following the
default setting, we use the full Hessian in (71) instead of an approximation. We gradually
reduce the stopping tolerance to obtain different models. The coordinate descent method
for minimizing the quadratic approximation qk (d) stops according to a tolerance the
3216
Comparing Methods and Software for L1-regularized Linear Classification
same as the overall stopping tolerance. The source code (version 1.5) is available at
http://cran.r-project.org/web/packages/glmnet/index.html.
• Lassplore: this method is described in Section 5.3. We check the 1-norm of the optimal
solution (obtained by other solvers) to calculate the value K in (14). The source code (ver-
sion 1.0) is available at http://www.public.asu.edu/~jye02/Software/lassplore.
• BMRM: this bundle method is described in Section 6.3. We apply it to both logistic regres-
sion and L2-loss SVMs. The linear programming problem (76) is solved via GNU Linear
Programming Kit (GLPK). The source code (version 2.2) and GLPK are available at http:
//users.rsise.anu.edu.au/~chteo/BMRM.html and http://www.gnu.org/software/
glpk/, respectively.
GLMNET is implemented in Fortran along with an R interface. CGD-GS and Lassplore
are primarily implemented in MATLAB, but expensive operations are coded in C/C++.
All other solvers are implemented in C/C++ with double precision.
Some implementations (SCD, TRON, OWL-QN, and BMRM) solve (1), while some (CGD-
GS, IPM, GLMNET, and Lassplore) consider the bias term and solve (5). BBR6 and our
CDN implementation can handle both (1) and (5). According to whether the bias term is
considered, we categorize methods into two groups for comparison.7 Moreover, for certain
solvers their formulations are scaled by a constant. We ensure that equivalent optimization
problems are solved.
Software such as IPM and GLMNET supports finding a solution path of various C values.
However, in our experiments, we focus on the performance of an algorithm using a fixed C.
For all methods, we set the initial solution w1 = 0.
The parameter C is chosen by five-fold cross validation (CV) on the training set. Using
models trained under the best C, we predict the testing set to obtain the testing accuracy.
The best C, the model density (the number of non-zero coefficients in w divided by n), and
the corresponding testing accuracy are recorded in Table 2.
In the rest of this section, we compare the training speed of solvers for logistic regression
and L2-loss SVMs by using the best parameter C of each data set. We run all experiments
on a 64-bit machine with Intel Xeon 2.0GHz CPU (E5504), 4MB cache, and 32GB main
memory. We use GNU C/C++/Fortran compilers (version 4.4.1) and ensure that for each
package the “-O3” optimization flag is set.
6. BBR considers bias term as a feature and allows users to use an option “-I” to specify weights to individual
features.
7. For simplicity, we apply BBR only to solve (1).
3217
Yuan, Chang, Hsieh and Lin
where f ∗ is obtained by running TRON with a strict stopping condition. Both x-axis and
y-axis are log-scaled. We draw a dotted reference line in Figure 1 to indicate the relative
error 0.1. From Figure 1, BBR and CDN can more quickly give a good solution than SCD,
TRON, OWL-QN, and BMRM. However, quasi Newton and Newton methods may have
faster local convergence. In Figures 1(c) and 1(d), TRON’s curve is almost vertical in the
end. This fast local convergence is mainly useful for problems such as a9a, for which BBR
and CDN are less competitive. We note that a9a is not a document data set and its number
of features is much smaller than the number of instances. Earlier studies for L2-regularized
classifiers have shown that coordinate descent methods are less competitive for this type of
data (Hsieh et al., 2008). The same situation seems to occur here for L1 regularization.
It is surprising that SCD is much slower than CDN and BBR because they are all coordi-
nate descent methods. We modify CDN to randomly select working variables, a stochastic
method similar to SCD; the result was still much faster than SCD, suggesting that the
stochastic selection is not the culprit. It turns out that a too large upper bound of the
second derivative causes the slow convergence of SCD. When xij ∈ [−1, 1], SCD replaces
Pl 2
i=1 xij in (34) with l. Then its Uj in (35) is much larger than Uj in (22) for BBR.
Therefore, SCD’s step size −gj0 (0)/Uj is often too small.
Equation (82) cannot be used for a practical stopping condition because f ∗ is not easily
available. Instead, we often rely on the gradient information. Now (1) is not differentiable,
3218
Comparing Methods and Software for L1-regularized Linear Classification
Since ∇S f (w) = 0 if and only if w is optimal, we can check k∇S f (w)k for the stopping
condition. Figure 2 shows the scaled 2-norm of the minimum-norm sub-gradient,
l
k∇S f (w)k, (83)
min( i:yi =1 1, i:yi =−1 1)k∇S f (w1 )k
P P
along the training time. From Figure 2, CDN and BBR are the best in the early stage
of the procedure, while Newton and quasi Newton methods such as TRON and OWL-QN
have faster local convergence. This observation is consistent with Figure 1. Some methods
(e.g., decomposition methods) do not calculate ∇S f (w) in their procedures, so a gradient-
based stopping condition may introduce extra cost. However, these methods may be able to
use approximate gradient values obtained during the calculation. For example, coordinate
descent methods calculate only ∇j f (wk,j ), j = 1, . . . , n instead of ∇f (wk+1 ), but these
values can be directly used in a stopping condition; see details in Appendix F of Fan et al.
(2008).
In Figure 3, we investigate how the testing accuracy is improved along the training time.
The testing accuracy is the percentage of correct predictions on the testing set. Results show
3219
Yuan, Chang, Hsieh and Lin
that BBR and CDN achieve the final accuracy more quickly. In addition, we are interested
in the progress of these methods on gaining the sparsity. Figure 4 shows the number of
non-zero coefficients of w and the training time. For most data sets, BBR and CDN more
efficiently obtain a sparse solution.
Next, we compare methods solving (5) with the bias term. These solvers include CDN,
CGD-GS, IPM, Lassplore, and GLMNET. Although BBR can solve (5), we omit it in the
comparison as its performance is similar to that of solving (1). Following the comparison
for methods solving (1), we begin with checking the running time to reduce the relative
error of the function value and the scaled norm of the minimum-norm sub-gradient; see
(82) and (83), respectively. The results are given in Figures 5 and 6. The reference value f ∗
is obtained using IPM with a strict stopping condition. From Figure 5, CDN is the fastest,
IPM comes the second, and CGD-GS is the third. However, in Figure 6 for reducing the
gradient norm, IPM often surpasses CDN in the final stage.
GLMNET gives good performances in Figures 5(b)–5(d). In particular, it has fast local
convergence; see curves that are close to vertical in Figures 5(c) and 5(d). This fast local
convergence is due to using the quadratic approximation qk (d), which generates a Newton-
like direction. We find that the approximation of replacing D in (69) with a constant
diagonal matrix is effective for micro-array data used in Friedman et al. (2010). However,
for large document sets, using the exact Hessian is better. Unfortunately, GLMNET fails
to generates results for yahoo-japan and yahoo-korea after sufficient running time. We find
that setting an appropriate stopping tolerance for minimizing the quadratic approximation
3220
Comparing Methods and Software for L1-regularized Linear Classification
in (71) is sometimes difficult. This issue might cause the problem in training yahoo-japan
and yahoo-korea. For IPM, the performance is in general competitive. Because IPM is a
type of Newton method, it has fast local convergence.
The result that CDN is faster than CGD-GS is worth for further discussion. Tseng and
Yun (2009) show that for some least-square regression problems, an implementation similar
to CDN (i.e., a Gauss-Seidel rule for selecting working variables) is slower than CGD-GS,
which selects variables using the gradient information. This result is opposite to ours. Two
issues might cause the different results. First, problems are different. We aim at classifying
large and sparse document data, but they solve regression problems. Second, we implement
two techniques (permutation of sub-problems and shrinking) to improve the convergence.
Yun and Toh (2011) show that for some document data sets used here, their CGD-
GS is faster than IPM. However, our results indicate that IPM is better. This difference
is apparently due to that Yun and Toh (2011) run an earlier version (0.8.1) of IPM. We
indicate in Section 8.2 that a minor problem in this version causes this version to be two or
three times slower than a later version (0.8.2) used here.
Figure 7 indicates the testing accuracy versus the training time. We can see in Figures
7(e) and 7(f) that IPM’s accuracy may not improve as the training time increases. As we
mentioned in Section 8.2, this result is because we modify certain w elements to zero. In the
middle of the procedure, w is not close to an optimum yet, but many elements have satisfied
|∇j L(w)| ≤ 0.9999 and are trimmed to zero. Hence, the resulting accuracy may be worse
3221
Yuan, Chang, Hsieh and Lin
3222
Comparing Methods and Software for L1-regularized Linear Classification
than that in the early stage of the procedure. In fact, due to IPM’s dense w throughout
iterations, to get the final sparsity and the testing accuracy, we need to accurately solve the
optimization problem.
Figure 8 presents the number of w’s non-zero coefficients. Similar to methods solving
(1), all methods here, except CGD-GS, have solutions with many non-zero coefficients in
the beginning and gradually gain the sparsity. In contrast, CGD-GS’s numbers of non-
zero coefficients in the whole optimization process are not much more than those of the
final solutions. We find that this nice property is from using the gradient information for
selecting working variables. In contrast, without the gradient information, CDN wrongly
updates some elements of w to be non-zeros in the early stage of the procedure.
In summary, for large document data, coordinate descents methods such as CDN perform
well in the early stage of the optimization procedure. As a result, they achieve the final
accuracy more quickly. However, for some applications, a correct sparsity pattern of the
model is important and a more accurate solution is sought. Then, GLMNET by combining
both Newton-type and coordinate descent approaches is useful.
3223
Yuan, Chang, Hsieh and Lin
Following the experiment for logistic regression, we plot the relative difference to the
optimal function value in Figure 9 and the scaled norm of the minimum-norm sub-gradient
in Figure 10. The reference f ∗ is obtained by running TRON with a strict stopping condition.
One can see that CDN’s training time is the shortest among the three solvers and TRON is
better than BMRM. BMRM does not properly decrease the function value and the norm of
gradient on large data sets.
In Figures 9 and 10, we also present results of CDN without implementing the shrinking
strategy (denoted as CDN-NS). In most cases, the implementation with shrinking is only
slightly better. However, shrinking is effective if the sparsity is high (e.g., news20 and
yahoo-japan). In such a situation, most w components are zero. We can safely remove some
zero components and more efficiently solve smaller optimization problems.
In Section 2.5, we briefly discuss optimization methods for L1-regularized least-square re-
gression. Some comparisons can be seen in, for example, the experiment section of Wright
et al. (2009) and Yun and Toh (2011). Note that an optimization method may perform dif-
ferently on classification and regression problems. For instance, Yun and Toh (2011) show
that CGD-GS is faster than CDN for regression problems, but here we have an opposite
observation for document classification.
3224
Comparing Methods and Software for L1-regularized Linear Classification
3225
Yuan, Chang, Hsieh and Lin
Figure 11: A comparison of logistic regression solvers with regularization parameters 10C ∗
and 0.1C ∗ . Both x-axis (relative difference between the objective function and the optimum
value) and y-axis (training time) are log-scaled.
Figures 1–8 indicate that CDN is faster for solving (1) than (5). Our past work (Huang
et al., 2010, Section 5) has indicated that the bias term may affect the running time of the
same optimization method. As the accuracy does not differ much, it seems that in general
the bias term should not be considered.
Among quasi Newton and Newton approaches, IPM and OWL-QN are faster than TRON.
This result seems to indicate that TRON suffers from the difficulty for finding w’s final non-
zero coefficients. However, IPM’s w elements are all non-zeros, so we must accurately solve
the optimization problem before trimming some elements to zero.
Figures 9 and 10 indicate that CDN/TRON for L2-loss SVMs is faster than CDN/TRON
for logistic regression. Each iteration of CDN/TRON for L2-loss SVMs is cheaper because
no exp/log operations are involved. Because the accuracy is similar, in general we prefer
L2-loss SVMs over logistic regression.
The choice of the regularization parameter C affects the performance of solvers. In
Figure 11, we present results on the rcv1 data set with regularization parameters 10C ∗ and
0.1C ∗ , respectively, where C ∗ = 4 is the best parameter obtained from cross validation; see
Table 2. Results show that all solvers take longer running time when C is large. Therefore,
one should avoid a value much larger than C ∗ by trying from a small C. Moreover, methods
using low-order (e.g., gradient only) information seem to be more sensitive to the change of
C. For example, with C ∗ and 0.1C ∗ we respectively observe in Figures 1(d) and 11(b) that
CDN is faster than OWL-QN, but with 10C ∗ , OWL-QN surpasses CDN in the final stage.
GLMNET iteratively considers quadratic approximations and applies coordinate descent
methods at each iteration. The discussion in Section 8.3 indicates that GLMNET’s speed
may be further improved if an adaptive stopping condition is properly designed for mini-
mizing each quadratic approximation.
It is challenging to efficiently solve L1-regularized linear classification problems. The 1-
norm term causes the non-differentiability of the objective function. In this paper, we review
3226
Comparing Methods and Software for L1-regularized Linear Classification
many existing methods for logistic regression and L2-loss SVMs. We discuss some state-
of-the-art methods in detail. Our extensive comparison shows that carefully implemented
coordinate descent methods are effective for L1-regularized classification with large-scale
document data.
Acknowledgments
This work was supported in part by the National Science Council of Taiwan via the grant 98-
2221-E-002-136-MY3. The authors thank associate editor, reviewers, Trevor Hastie, Robert
Tibshirani, and Jun Liu for valuable comments.
S ≡ {w | f (w) ≤ f (0)}.
We prove that S is compact (closed and bounded). Clearly, S is closed due to the continuity
of f (w). If S is not bounded, then there is a sequence {wk } ⊂ S such that kwk k1 → ∞.
Because we assume that ξ(w; xi , yi ) ≥ 0, f (wk ) ≥ kwk k1 . Then, f (wk ) → ∞ contradicts
f (wk ) ≤ f (0). Thus, S is a compact set. From Weierstrass’ Theorem, f (w) has at least
one minimum in S.
1
min |wj + z| + Bz + Az 2 . (84)
z 2
Clearly, (
1 g1 (z) if z ≥ −wj ,
|wj + z| + Bz + Az 2 =
2 g2 (z) if z ≤ −wj ,
where
1 1
g1 (z) ≡ wj + z + Bz + Az 2 and g2 (z) ≡ −wj − z + Bz + Az 2 .
2 2
By checking if the minimum of the quadratic function occurs on the right or the left side
of −wj , we know
(
− B+1
A if B + 1 ≤ Awj ,
arg min g1 (z) =
z≥−wj −wj otherwise,
and (
− B−1
A if B − 1 ≥ Awj ,
arg min g2 (z) =
z≤−wj −wj otherwise.
3227
Yuan, Chang, Hsieh and Lin
Because B + 1 ≤ Awj and B − 1 ≥ Awj cannot hold at the same time, and g1 (−wj ) =
g2 (−wj ), we can easily conclude that the solution of (84) is
B+1
− A
if B + 1 ≤ Awj ,
B−1
− A if B − 1 ≥ Awj ,
−wj otherwise.
We prove that wjk,j = 0, ∀k ≥ Kj . Recall that in the decomposition method we use (27)
to obtain a direction d. From (86), we see that at k = Kj , the third case in (27) is taken.
That is, d = −wjk,j . If σ in (86) is chosen to be the constant used in the sufficient decrease
condition (28), we claim that d = −wjk,j satisfies (28) with λ = 1: If wjk,j ≥ 0,
gj (−wjk,j ) − gj (0) − σ −wjk,j − L0j (0; wk,j )wjk,j
1
= (1 − σ) −wjk,j − L0j (0; wk,j )wjk,j + L00j (z̃; wk,j )(wjk,j )2 (88)
2
1
k,j k,j
0
≤ (1 − σ) −wj − Lj (0; w )wj k,j
+ M (wjk,j )2 (89)
2
−1
≤ M (wjk,j )2 ≤ 0, (90)
2
where z̃ is between 0 and −wjk,j , Equation (88) follows from (20)–(21), Equation (89) is
from (87) and (85),9 and Equation (90) is from the first inequality of (86). The above result
8. We need f (w∗ ) < f (0). This property generally holds. If not, we can slightly enlarge S so that (87) and
all subsequent derivations still follow.
9. More precisely,
kwk,j + z̃ej − wk,j k ≤ |wjk,j | = |wjk,j − wj∗ | ≤ kwk,j − w∗ k.
3228
Comparing Methods and Software for L1-regularized Linear Classification
is precisely the sufficient decrease condition (28) with λ = 1. The situation for wjk,j < 0
is similar. By taking the step d = −wjk,j , we have wjk+1,j = 0. At the (k + 1)st iteration,
wjk+1,j = 0 and (86) imply that the optimality condition (18) for the sub-problem has been
satisfied. Thus, the jth element of w remains zero in all subsequent iterations.
and
L00j (0; w) > 0, (93)
where w is the current solution used to generate the direction d in (27). Equation (92)
means that ∇L(·) is globally Lipschitz continuous. For logistic regression, we have
where D(w̃) ∈ Rl×l is a diagonal matrix similar to (70). Because any diagonal element of
D(w̃) is less than one, we have the inequality in (94). Thus, we can use CkX T kkXk as the
constant A in (92).
For (93), from (29), L00j (0; w) in the level set S is lower bounded by a positive value.
The only exception is that L00j (0; w) = 0 when xij = 0, ∀i = 1, . . . , l. In this situation,
L0j (0; w) = 0 and the optimal wj∗ = 0. The one-variable function gj (z) is reduced to
so d = −wj from (27) and d satisfies the sufficient decrease condition (28). Therefore, wj
becomes zero after the first iteration and is not changed after. It is like that the jth feature
has been removed in the beginning.
Next, we discuss the asymptotic convergence of function values. Tseng and Yun (2009)
impose certain conditions on choosing the working set; see Equation (12) in their paper. If
one variable is updated at a time, their condition is reduced to that between wk and wk+1 ,
one must go through n sub-problems covering all w1 , . . . , wn . This setting is exactly what
we do, regardless of using a sequential order of 1, . . . , n or a permutation π(1), . . . , π(n).
3229
Yuan, Chang, Hsieh and Lin
Tseng and Yun (2009) need an additional assumption for the asymptotic convergence
(see Assumption 1 in their paper):
0 < λmin ≤ H k,j ≤ λmax , ∀j = 1, . . . , n, k = 1, . . . , (95)
where λmin and λmax are positive constants, and H k,j = ∇2jj L(wk,j ) is the coefficient H
in (91) at wk,j . From (29) and the boundedness of the level set S (see Appendix A), this
assumption holds except that for some j, xij = 0, ∀i. For such j, ∇j L(w) = 0, ∀w. Hence,
wj becomes zero by (27) at the first iteration and is not changed subsequently. Because
wj = 0 is optimal in this situation, it is like that the jth variable has been removed for
optimization. Therefore, we have (95) without problems.
Following Tseng and Yun (2009, Theorem 1(e)), any limit point of {wk } is an optimum
of (1) with logistic loss.
3230
Comparing Methods and Software for L1-regularized Linear Classification
(i.e., the optimization problem is non-degenerate), then |J| is small. From (97), ∇2 f¯(w̄)J,J
tends to be positive definite if |J| is small. Then, we can have the quadratic convergence.
References
Galen Andrew and Jianfeng Gao. Scalable training of L1-regularized log-linear models.
In Proceedings of the Twenty Fourth International Conference on Machine Learning
(ICML), 2007.
Suhrid Balakrishnan and David Madigan. Algorithms for sparse linear classifiers in the
massive data setting. 2005. URL http://www.stat.rutgers.edu/~madigan/PAPERS/
sm.pdf.
Steven Benson and Jorge J. Moré. A limited memory variable metric method for bound
constrained minimization. Preprint MCS-P909-0901, Mathematics and Computer Science
Division, Argonne National Laboratory, Argonne, Illinois, 2001.
Paul S. Bradley and Olvi L. Mangasarian. Feature selection via concave minimization and
support vector machines. In Proceedings of the Twenty Fifth International Conference
on Machine Learning (ICML), 1998.
Richard H. Byrd, Peihung Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm
for bound constrained optimization. SIAM Journal on Scientific Computing, 16:1190–
1208, 1995.
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Coordinate descent method for large-
scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008.
URL http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.
David L. Donoho and Yaakov Tsaig. Fast solution of l1 minimization problems when the
solution may be sparse. IEEE Transactions on Information Theory, 54:4789–4812, 2008.
John Duchi and Yoram Singer. Boosting with structural sparsity. In Proceedings of the
Twenty Sixth International Conference on Machine Learning (ICML), 2009.
3231
Yuan, Chang, Hsieh and Lin
John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections
onto the L1-ball for learning in high dimensions. In Proceedings of the Twenty Fifth
International Conference on Machine Learning (ICML), 2008.
Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.
Annals of Statistics, 32:407–499, 2004.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLIN-
EAR: A library for large linear classification. Journal of Machine Learning Research, 9:
1871–1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf.
Mário A. T. Figueiredo, Robert Nowak, and Stephen Wright. Gradient projection for sparse
reconstruction: Applications to compressed sensing and other inverse problems. IEEE
Journal on Selected Topics in Signal Processing, 1:586–598, 2007.
Jerome H. Friedman, Trevor Hastie, Holger Höfling, and Robert Tibshirani. Pathwise
coordinate optimization. Annals of Applied Statistics, 1(2):302–332, 2007.
Jerome H. Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for gen-
eralized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22,
2010.
Wenjiang J. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computa-
tional and Graphical Statistics, 7(3):397–416, 1998.
Glenn M. Fung and Olvi L. Mangasarian. A feature selection Newton method for support
vector machine classification. Computational optimization and applications, 28:185–202,
2004.
Eli M. Gafni and Dimitri P. Bertsekas. Two-metric projection methods for constrained
optimization. SIAM Journal on Control and Optimization, 22:936–964, 1984.
Alexandar Genkin, David D. Lewis, and David Madigan. Large-scale Bayesian logistic
regression for text categorization. Technometrics, 49(3):291–304, 2007.
Joshua Goodman. Exponential priors for maximum entropy models. In Proceedings of the
Annual Meeting of the Association for Computational Linguistics, 2004.
Elaine T. Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for l1-minimization:
Methodology and convergence. SIAM Journal on Optimization, 19(3):1107–1130, 2008.
Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and Sellamanickam Sun-
dararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings
of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.
3232
Comparing Methods and Software for L1-regularized Linear Classification
Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Iterative scaling
and coordinate descent methods for maximum entropy. Journal of Machine Learning Re-
search, 11:815–848, 2010. URL http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_
journal.pdf.
Jun’ichi Kazama and Jun’ichi Tsujii. Evaluation and extension of maximum entropy models
with inequality constraints. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, pages 137–144, 2003.
S. Sathiya Keerthi and Shrish Shevade. A fast tracking algorithm for generalized
LARS/LASSO. IEEE Transactions on Neural Networks, 18(6):1826–1830, 2007.
Jinseog Kim, Yuwon Kim, and Yongdai Kim. A gradient-based optimization algorithm for
LASSO. Journal of Computational and Graphical Statistics, 17(4):994–1009, 2008.
Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky.
An interior point method for large-scale l1-regularized least squares. IEEE Journal on
Selected Topics in Signal Processing, 1:606–617, 2007.
Yongdai Kim and Jinseog Kim. Gradient LASSO for feature selection. In Proceedings of
the 21st International Conference on Machine Learning (ICML), 2004.
Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent
for linear predictors. Information and Computation, 132:1–63, 1997.
Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for large-
scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519–
1555, 2007. URL http://www.stanford.edu/~boyd/l1_logistic_reg.html.
Jussi Kujala, Timo Aho, and Tapio Elomaa. A walk from 2-norm SVM to 1-norm
SVM. Preprint available from http://www.cs.tut.fi/~kujala2/papers/1norm2norm.
pdf, 2009.
John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient.
Journal of Machine Learning Research, 10:771–801, 2009.
3233
Yuan, Chang, Hsieh and Lin
3234
Comparing Methods and Software for L1-regularized Linear Classification
Volker Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):
16–28, 2004.
Sylvain Sardy, Andrew G. Bruce, and Paul Tseng. Block coordinate relaxation methods for
nonparametric signal denoising with wavelet dictionaries. Journal of Computational and
Graphical Statistics, 9:361–379, 2000.
Mark Schmidt, Glenn Fung, and Romer Rosales. Fast optimization methods for l1 regu-
larization: A comparative study and two new approaches. In Proceedings of European
Conference on Machine Learning, pages 286–297, 2007.
Mark Schmidt, Glenn Fung, and Romer Rosales. Optimization methods for l1-
regularization. Technical Report TR-2009-19, University of British Columbia, 2009.
Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for l1 regularized loss mini-
mization. In Proceedings of the Twenty Sixth International Conference on Machine Learn-
ing (ICML), 2009.
Shirish Krishnaj Shevade and S. Sathiya Keerthi. A simple and efficient algorithm for gene
selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.
Jianing Shi, Wotao Yin, Stanley Osher, and Paul Sajda. A fast hybrid algorithm for large
scale l1-regularized logistic regression. Journal of Machine Learning Research, 11:713–
741, 2010.
Choon Hui Teo, S.V.N. Vishwanathan, Alex Smola, and Quoc V. Le. Bundle methods for
regularized risk minimization. Journal of Machine Learning Research, 11:311–365, 2010.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society Series B, 58:267–288, 1996.
Ryota. Tomioka and Masashi Sugiyama. Dual augmented Lagrangian method for efficient
sparse reconstruction. IEEE Signal Processing Letters, 16(12):1067–1070, 2009.
Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth
separable minimization. Mathematical Programming, 117:387–423, 2009.
Stephen J. Wright, Robert D. Nowak, and Mário A.T. Figueiredo. Sparse reconstruction by
separable approximation. IEEE Transactions on Signal Processing, 57:2479–2493, 2009.
Tong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized
regression. The Annals of Applied Statistics, 2(1):224–244, 2008.
Jin Yu, S.V.N. Vishwanathan, Simon Gunter, and Nicol N. Schraudolph. A quasi-Newton
approach to nonsmooth convex optimization problems in machine learning. Journal of
Machine Learning Research, 11:1–57, 2010.
Sangwoon Yun and Kim-Chuan Toh. A coordinate gradient descent method for l1-
regularized convex minimization. Computational Optimizations and Applications, 48(2):
273–307, 2011.
3235
Yuan, Chang, Hsieh and Lin
Tong Zhang and Frank J. Oles. Text categorization based on regularized linear classification
methods. Information Retrieval, 4(1):5–31, 2001.
Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine
Learning Research, 7:2541–2563, 2006.
Peng Zhao and Bin Yu. Stagewise lasso. Journal of Machine Learning Research, 8:2701–
2726, 2007.
Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani. 1-norm support vector ma-
chines. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances
in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
3236