NeurIPS 2023 Path Following Algorithms For Ell - 2 Regularized M Estimation With Approximation Guarantee Paper Conference
NeurIPS 2023 Path Following Algorithms For Ell - 2 Regularized M Estimation With Approximation Guarantee Paper Conference
Abstract
Many modern machine learning algorithms are formulated as regularized M-
estimation problems, in which a regularization (tuning) parameter controls a trade-
off between model fit to the training data and model complexity. To select the “best”
tuning parameter value that achieves a good trade-off, an approximated solution
path needs to be computed. In practice, this is often done through selecting a grid
of tuning parameter values and solving the regularized problem at the selected grid
points. However, given any desired level of accuracy, it is often not clear how to
choose the grid points and also how accurately one should solve the regularized
problems at the selected gird points, both of which can greatly impact the overall
amount of computation. In the context of ℓ2 -regularized M-estimation problem,
we propose a novel grid point selection scheme and an adaptive stopping criterion
for any given optimization algorithm that produces an approximated solution path
with approximation error guarantee. Theoretically, we prove that the proposed
solution path can approximate the exact solution path to arbitrary level of accuracy,
while saving the overall computation as much as possible. Numerical results also
corroborate our theoretical analysis.
1 Introduction
Modern machine learning algorithms are often formulated as regularized M-minimization problems,
θ(λ) = arg min Ln (θ) + λp(θ) , (1)
θ
where Ln (θ) denotes an empirical loss function, p(θ) denotes a regularization function, and λ is a
tuning parameter that controls the trade-off between model fit and model complexity. Varying the
tuning parameter leads to a collection of models, among which one of them may be chosen as the
final model. This requires solving a collection of related optimization problems, whose solutions are
often referred to as the solution path (We refer readers to Figure S4 and S5 in the supplementary
material for some examples of solution paths).
Very often the exact solution path θ(λ) can not be computed, and path-following algorithms are
usually used to obtain a sequence of solutions at some selected grid points to produce an approximated
solution path. Existing path-following algorithms [see, e.g., Hastie et al., 2004, 2007] typically
choose a set of equally-spaced grid points (often on a log-scale), and choose certain algorithm to solve
the minimization problem at the selected grid points. Warm-start strategy is often used to leverage
the fact that the minimizers at two consecutive grid points may be close. However, there is a paucity
of literature on how to choose these grid points and how accurately one should solve the optimization
∗
Work done when Renxiong Liu is a PhD student at the Ohio State University.
2
It requires that the object function is concave with respect to the hyperparameter for any fixed value of
parameters. This includes most regularized machine learning problems as special cases.
2
In view of these limitations, Ndiaye et al. [2019] considered regularized convex optimization problems
and proposed general strategies to choose grid points based on primal-dual methods. Again, a
piecewise constant path is constructed to approximate the solution path. However, their approach can
only be applied to a special class of loss functions that are composition of a “simple” function (whose
conjugate can be analytically derived) and affine functions. Moreover, the choice of the grid points
depends on some convexity and smoothness parameters associated with the loss, which either are not
available except for some special class of functions or require careful tuning due to their great impact
on grid points selection. Ndiaye√et al. [2019] also showed that to achieve ϵ suboptimality, the number
d
of grid√points required is O(1/ ϵ) for uniformly convex loss of order d [Bauschke et al., 2011] and
O(1/ ϵ) for generalized self-concordant functions√ [Sun and Tran-Dinh, 2017]. In our paper, we
show that the number of grid points needed is O(1/ ϵ) for general differentiable convex loss (see
Theorem 4). Compared to Ndiaye et al. [2019], we also demonstrate through simulations that our
method produces better approximated solution paths under a fixed computational budget (see Section
4). Moreover, our method can also be applicable to loss functions that are possibly nonconvex (see
Section 3).
We also note that Proposition 38 of Ndiaye [2018] shows that under some regularity conditions ϵ-
suboptimality along the entire solution path can be achieved by using a piecewise linear interploation.
However, it not clear how these conditions can be used to construct an algorithm with approximation
guarantee. Our proposed algorithm, on the other hand, provides an easily implementable grid point
selection scheme and an associated stopping criterion at each grid point, which together ensures that
ϵ-suboptimality can be attained through a piecewise linear interploation.
More recently, in the context of ℓ2 -regularized convex optimization problems, Zhu and Liu [2021]
proposed two path-following algorithms by using Newton method and gradient descent method as the
basis algorithm seperately, and discuss the grid point selection scheme for each algorithm based on
the piecewise linear interpolation. Our paper differs from Zhu and Liu [2021] in twofold. First, the
grid-point selection scheme in Zhu and Liu [2021] depends on the basis algorithm. By contrast, in
our paper any basis algorithm can be applied at any chosen grid point as long as the stopping criterion
is satisfied by the basis algorithm. Next, the path-following algorithms by Zhu and Liu [2021] can
only be applied to convex optimization problems, while in this article we also consider loss functions
that are possibly nonconvex (see Section 3).
The rest of the paper is organized as follows. Section 2 introduces the proposed path following
algorithm and establish its global approximation-error bound. Section 3 considers an extension
to nonconvex empirical loss. In Section 4, we compare the the proposed scheme to a standard
path-following scheme through a simulated study using ridge regression and ℓ2 -regularized logistic
regression. We close the article with some remarks in Section 5. All proofs are included in the
supplementary material.
Note that as the tuning parameter t varies from 0 to ∞, the solution θ(t) varies from 0 to a minimizer
of Ln (θ). We also remark that the choice of et − 1 is not essential. In fact, it can be replaced
by any function C(t) that is a strictly increasing and differentiable function, with C(0) = 0 and
limt→∞ C(t) = ∞. The solution path produced should be independent of C(t). Moreover, we can
show that the ℓ2 norm of the solutions is an increasing function of t, and the solution θ(t) converges
to the minimum ℓ2 norm minimizer of Ln (θ) (if it is finite) as t goes to infinity (see Corollary S1 in
the supplementary material).
Suppose that the goal is to approximate the solution path θ(t) over a given interval [0, tmax ) for some
tmax ∈ (0, ∞], where we allow tmax = ∞. Given a set of grid points 0 < t1 < · · · < tN < ∞, and
approximated solutions {θk }N k=1 at these grid points, we construct an approximated solution path
over [0, tmax ) through linear interpolation. More specifically, we define a piecewise linear solution
3
path θ̃(t) as follows
k+1 −t
θ̃(t) = ttk+1 −tk θk +
t−tk
tk+1 −tk θk+1 for any t ∈ [tk , tk+1 ], k = 0, 1, . . . , N − 1 ,
(3)
θ̃(t) = θN for any tN < t ≤ tmax if tN < tmax ,
where t0 = 0 and θ0 = 0. This defines an approximated solution path θ̃(t) for any t ∈ [0, tmax ],
which interpolates the approximated solutions at the selected grid points. In view of this definition, we
may also assume that tN −1 ≤ tmax , because we do not need θ̃(t) over t ∈ [tN −1 , tN ] if tN −1 > tmax .
To assess how well θ̃(t) approximates θ(t), we use the function-value suboptimality of the solution
paths defined by sup0≤t≤tmax {ft (θ̃(t)) − ft (θ(t))}, where ft (θ) := (1 − e−t )Ln (θ) + e−t (∥θ∥22 /2)
is a scaled version of the objective function in (2). In what follows, we call sup0≤t≤tmax {ft (θ̃(t)) −
ft (θ(t))} the global approximation error of θ̃(t), and suptk ≤t≤tk+1 {ft (θ̃(t)) − ft (θ(t))} the local
approximation error of θ̃(t) over [tk , tk+1 ].
Intuitively, how well θ̃(t) approximates θ(t) depends on how finely spaced the grid points are and
the quality of the solutions at the selected grid points (i.e., θk ’s). Our first main result relates the
local approximation error of θ̃ over [tk , tk+1 ] to the choice of the grid points and the accuracy of the
approximated solutions at the selected grid points measured by the size of the gradients ∥∇ftk (θk )∥2 ,
where we assume that Ln (θ) is differentiable.
Theorem 1. Assume that Ln (θ) is a differentiable convex function. Let gk := ∇ftk (θk ) = (1 −
e−tk )∇Ln (θk ) + e−tk θk denote the scaled gradient at θk . For any 0 < t1 < t2 < · · · < tN < ∞,
we have that
n o et1 (1 − e−t1 )2
sup ft (θ̃(t)) − ft (θ(t)) ≤ max et1 ∥g1 ∥22 , ∥θ1 ∥22 + ∥∇Ln (0)∥22 , (4)
t∈[0,t1 ] 2
( )
−tk+1 2
n o 1 − e
sup ft (θ̃(t)) − ft (θ(t)) ≤ etk+1 max ∥gk ∥22 , ∥gk+1 ∥22
t∈[tk ,tk+1 ] 1 − e−tk
tk+1
∥θk ∥22 etk ∥θk+1 ∥22
−tk −tk+1 2 e
+ (e −e ) max , (5)
(1 − e−tk )2 (1 − e−tk+1 )2
for any k = 1, . . . , N − 1. If we further assume that ∥θ(tmax )∥2 < ∞, then we have that
n o e tN 1
sup ft (θ̃(t)) − ft (θ(t)) ≤ ∥gN ∥22 + ∥θ(tmax )∥22 (6)
tN <t≤tmax 2(1 − e−tN ) 2(etN − 1)
when tN < tmax .
The proof of Theorem 1 starts with relating the local approximation error to the suboptimality of θk
and θk+1 at some t ∈ [tk , tk+1 ]. The suboptimality can then further bounded by the square norm
of the grandient leveraging the fact that the objective function is e−t -strongly convex. Finally, a
triangular inequality is used to bound the norm of the grandient by quantities that depend on ∥gk ∥2
and ∥θk ∥2 .
We can see from Theorem 1 that the upper bounds for the local approximation error in (4)–(6) consist
of two terms, with the first one (depending on gk ) being algorithm dependent and the other term
stemming from interpolation. We call them optimization error and interpolation error, respectively.
Note that the optimization error is roughly of order etk ∥gk ∥22 , which again depends on how accurate
the solutions θk are at the selected grid points. And the interpolation error is essentially independent
of the quality of the solutions θk ’s, as the interpolation error only depends on the spacing of the
grid points and the norm of the solutions along the solution path (typically ∥θk ∥2 = O(∥θ(tk )∥2 ),
c.f., Lemma S2 in the supplementary material). In other words, the interpolation error is irreducible
once the grid points are chosen, while the optimization error does depend on the algorithm and it
can be pushed to be arbitrarily small if we run the algorithm long enough at each grid point. In
this sense, if the goal is to minimize the local approximation error, then both the grid points and
optimization algorithm should be designed carefully to strike a balance between these two errors
to minimize the overall computations. For instance, it would be wasteful to have the optimization
error smaller than the interpolation error, because the additional computation would not improve the
overall approximation error (at least in order).
4
Motivated by this observation and the local approximation-error bounds in Theorem 1, we propose a
novel stopping criterion scheme at the selected grid points for any general optimization algorithm. In
particular, for any given algorithm that takes an initializer and solves the problem at grid point tk+1 ,
we can run the algorithm initialized at θk until the optimization error is smaller than the interpolation
error for θk+1 , that is
( 2 )
tk+1 1 − e−tk+1 2 2
e max ∥∇ftk (θk )∥2 , ∥∇ftk (θk+1 )∥2
1 − e−tk
| {z }
optimization error
where the LHS is the optimization error and the RHS is the interpolation error in the bounds in
Theorem 1. This condition can be further simplified, under which the upper bounds in Theorem 1 can
also be further simplified and combined so that we can bound the global approximation error of θ̃(t).
This is established in the following theorem.
Theorem 2. Assume that Ln (θ) is a differentiable and convex. Suppose that 2−1 αk ≤ αk+1 ≤ 2αk ,
αk ≤ αmax for some αmax ≤ ln(2), and
(eαk − 1)
∥∇ftk (θk )∥2 ≤ C0 ∥θk ∥2 ; 1 ≤ k ≤ N (8)
(etk − 1)
for some C0 ≤ 1/4. Denote
A = 4(1 + C02 (eαmax + eαmax /2 + 1)2 ) , B = (eα1 − 1)2 ∥∇Ln (0)∥22 ,
e−tk ∥θk ∥22 (eαk+1 − 1)2 1 − e− max(αN , tmax −tN )
C = max , D= max{∥θN ∥22 , ∥θ(tmax )∥22 } .
1≤k≤N (1 − e−tk )2 e tN − 1
Then
sup {ft (θ̃(t)) − ft (θ(t))} ≤ A max{B, C, D} (9)
0≤t≤tmax
5
Algorithm 1 A general path following algorithm.
Input: ϵ > 0, C0 ≤ 1/4, c1 ≥ 1, c2 > 0, 0 < αmax ≤ 5−1 and tmax ∈ (0, ∞].
Output: grid points {tk }N k=1 and an approximated solution path θ̃(t).
1: Initialize: k = 1.
2: Compute α1 using (12). Starting from 0, iteratively calculate θ1 by minimizing ft1 (θ) until (8)
is satisfied for k = 1.
3: while (14) is not satisfied, do
4: Compute αk+1 using (13), update tk+1 = tk + αk+1 .
5: Starting from θk , iteratively compute θk+1 by minimizing ftk+1 (θ) until (8) is satisfied.
6: Update k = k + 1.
7: Interpolation: construct a solution path θ̃(t) through linear interpolation of {θk }N
k=1 using (3).
where c1 ≥ 1, c2 > 0, and 0 < αmax ≤ 1/5 are absolute constants. Note that the above step sizes
and termination criterion are chosen to ensure that the upper bounds in (9) and (10) are at most O(ϵ).
Formally, we show that, given any ϵ > 0, the above scheme achieves ϵ-suboptimality (11) for θ̃(t)
(up to a multiplicative constant).
Theorem 3. Suppose that Ln (θ) is differentiable and convex. For any ϵ > 0 and tmax ∈ (0, ∞],
assume that either tmax < ∞ or ∥θ(tmax )∥2 < ∞. Then, using the step sizes and the termination
criterion specified above in (12), (13), and (14), any algorithm that produces iterates satisfying (8)
for any k ≥ 1 terminates after finite number of iterations, and when terminated, the solution path
θ̃(t) satisfies
sup {ft (θ̃(t)) − ft (θ(t))} ≲ ϵ . (15)
0≤t≤tmax
√
Note that when ϵ is small enough, we have that α1 = ln (1 + ϵ/∥∇Ln (0)∥2 ), which together with
(15) implies that
As a result, the approximation error for the solution path is essentially controlled by the initial step
size α1 . Smaller α1 leads to better approximation. However, smaller α1 also leads to a slower
exploration of the solution path and a more stringent stopping criterion at individual grid points (c.f.
(8)). As such, the initial step size controls the trade-off between computational cost and accuracy.
Moreover, given α1 , we can see from (13) that how fast the algorithm can explore the solution path
depends largely on ∥θk ∥2 . In particular, if ∥θk ∥2 is bounded (e.g., when θ⋆ is finite), then the first
term in the min function of (13) increases as tk increases. Therefore, aggressive step sizes can be
taken in this case until it reaches O(1), which will likely lead to a fast exploration of the solution
path. On the other hand, if ∥θk ∥2 grows quite quickly to infinity as k increases, then the first term in
the min function may go to zero as k → ∞ (say, when ∥θk ∥2 /etk /2 → ∞). This means that the step
sizes need to decrease to zero eventually, leading to a slower exploration of the solution path.
We summarize the proposed scheme in Algorithm 1. Again, we emphasize that the above scheme
can be applied to any optimization algorithm. Later, we will empirically investigate its performance
using Newton method and gradient descent method.
Lastly, we establish an upper bound on the total number of grid points required for Algorithm 1 to
achieve ϵ-suboptimality.
Theorem 4. Under the assumptions in Theorem 3, the total number of grid points required for
Algorithm 1 to achieve an ϵ-suboptimality (15) is at most O(ϵ−1/2 ).
6
3 Extensions to nonconvex loss functions
So far we assume that the empirical loss is convex. Next, we consider a generalization, where we do
not assume convexity for Ln (θ). Instead, we assume that there exists λ(t) > 0 so that
1
ft (θ) − ft (θ(t)) ≤ ∥∇ft (θ)∥22 , (16)
2λ(t)
for any t ≥ 0 and θ ∈ dom(Ln ). In the literature, the condition (16) was often referred to as the
Polyak-Łojasiewicz (PL) inequality. Polyak [1964] proved the linear rate of convergence for gradient
descent method under this condition. Also see Karimi et al. [2016] for discussions of more recent
development. Note that if Ln (θ) is convex, then (16) holds for λ(t) = e−t since ft (θ) is e−t -strongly
convex. In general, however, Ln (θ) does not need to be convex for (16) to hold.
For some technical reasons, we need to consider slightly different interpolation scheme
θ̃(t) = θk for any t ∈ [tk , tk+1 ], 0 ≤ k ≤ N − 1;
(17)
θ̃(t) = θN for any tN < t ≤ tmax if tN < tmax ,
where t0 = 0 and θ0 = 0. As we can see that the approximated solution path is piecewise constant.
We use the following termination criterion at each tk
∥g1 ∥2 ≤ min(C0 (eα1 − 1)∥∇Ln (0)∥2 , 1);
eαk+1 − 1 (18)
∥gk+1 ∥2 ≤ min(C0 t ∥θk ∥2 , 1), k ≥ 1 .
e k −1
We also define a slightly new step size scheme as follows:
p
ϵ min(λ0 , λ1 )
α1 = min{αmax , ln(1 + )} (19)
∥∇Ln (0)∥2
and
p
c1 (eα1 − 1)∥∇Ln (0)∥2 (etk − 1) min(λk , λk+1 )
αk+1 = min{αmax , 2αk , ln(1 + p )} , (20)
∥θk ∥2 min(λ0 , λ1 )
where λk = inf tk ≤t≤tk+1 λ(t), k ≥ 0. Finally, the stopping criterion for the path-following algorithm
is set to be
c3
≤ ϵ or tN ≥ tmax , (21)
e tN − 1
where c3 > 0 is an absolute constant.
Moreover, we need to make some additional assumptions on the loss function.
Assumption (A1). Assume that Ln (θ) is differentiable and inf θ Ln (θ) > −∞. Moreover, assume
that there exists a positive decreasing function g(·) such that λ(t) ≥ g(t) > 0 for any 0 ≤ t < ∞.
Theorem 5. Suppose that Assumption (A1) holds. For any ϵ > 0 and tmax ∈ (0, ∞], assume that
either tmax < ∞ or sup0≤t≤tmax ∥θ(t)∥2 < ∞. Then, using step sizes and termination criterion
specified in (19), (20), and (21), any algorithm that produces gradients satisfying (18) terminates
after finite number of iterations, and when terminated, the solution path θ̃(t) defined in (17) satisfies
sup {ft (θ̃(t)) − ft (θ(t))} ≲ ϵ . (22)
0≤t≤tmax
The above result can be viewed as a generalization of Theorem 3, because when Ln (θ) is convex, we
could set λ(t) = exp(−t) so that both condition (16) and Assumption (A1) are satisfied.
4 Numerical studies
In this section, we study the operating characteristics of Newton method, gradient descent method and
a mixed method in ridge regression and ℓ2 -regularized logistic regression, using both the simulated
datasets and a real data example. Here the mixed method applies gradient descent method to minimize
7
ftk (θ) at the beginning and then switch to Newton method when the number of gradient steps nk
exceeds min{n, p}. To illustrate the advantages of the proposed grid point selection scheme and
stopping criterion, we use a standard implementation of a typical path-following algorithm as a
baseline. In particular, the grid points are equally spaced (on a log-scale) for both examples. For
ℓ2 logistic regression, Newton method is used as the basis algorithm with a fixed stopping criterion
∥∇ftk (θk )∥2 < 10−5 , while for ridge regression the exact solution can be computed. We refer to
this method as the baseline method.
We also compare the proposed methods against the method of Ndiaye et al. [2019]. In particular,
we compare our grid points selection scheme (13) with the so-called adaptive unilateral scheme
proposed in Ndiaye et al. [2019], where Newton method is used as the basis algorithm at each
grid point. Different from the aforementioned methods, adaptive unilateral scheme constructs
piecewise constant solution path θ̃(λ) over chosen grid points to approximate the solution path
θ(λ) = arg min θ∈Rp Pλ (θ), where Pλ (θ) = Ln (θ)+(λ/2)·∥θ∥22 , such that for all λ ∈ (λmin , λmax )
8
(c.f. (13)) is more efficient than both the standard equally-spaced grid point scheme and the adaptive
unilateral scheme proposed by Ndiaye et al. [2019]. Moreover, gradient descent method performs
similarly compared with the mixed method in both setups of Example 1, which suggests that switching
does not happen much in those cases. However, in both problem dimensions of Example 2, we do
see a difference between the mixed method and gradient descent, which is due to the fact that the
switching to the Newton update does happen and it can speed up the computation by design. Lastly,
as one would expect, Newton method is more efficient than both the gradient descent method and the
mixed method when the problem dimension is small, but less so when problem dimension is large.
Figure 2 shows the runtime versus global approximation error tradeoff for our real data example, as
we vary ϵ for our proposed methods and Ndiaye’s method, and change grid point spacing for the
baseline method. Again, the Newton method is more efficient than both the baseline method and the
Ndiaye’s method, which demonstrates the advantage of our proposed grid point selection scheme
over the standard equally-spaced grid point scheme and the scheme by Ndiaye et al. [2019].
We also plot the number of iterations at each grid point for Newton method and gradient descent
method in Figure S1-S3 of the supplementary material. Interestingly, for ridge regression (c.f.,
Figure S1), the number of iterations by gradient method first increases and then stays flat as tk
grows. Newton method, however, only takes one iteration at each grid point. Moreover, the level of
approximation (i.e., ϵ) seems to have no impact on the number of iterations at each grid point, which
is highly desirable. For the ℓ2 -regularized logistic regression (c.f., Figure S2 and S3), the number of
iterations needed increases as tk increases for gradient descent method, whereas the Newton method
always requires just a constant number of iterations at each tk . Again, the level of approximation
(i.e., ϵ) does not seem to influence the number of iterations much.
p=500 p=10000
−2 Baseline method Baseline method
Suboptimality
Newton Newton
−6 −4
−8
−6
Gradient descent −2
Mixed method
−4
Ndiaye
Suboptimality
Suboptimality
Newton −4
Baseline method
−6
Gradient descent
Mixed method
−6
Ndiaye
−8 Newton
Figure 1: Runtime v.s. suboptimality for the Newton method, the gradient descent method, the mixed
method, the baseline method and Ndiaye’s method under two problem dimensions when applied to
ridge regression (upper panels) and ℓ2 -regularized logistic regression (lower panels). The vertical and
horizontal lines at each point represent the standard errors of suboptimality and runtime.
5 Discussion
In this article, we proposed a novel grid point selection scheme and stopping criterion for any general
path-following algorithms. A simple solution path was constructed through linear interpolation
9
−2
−4
Suboptimality
−6
Baseline method
Gradient descent
−8 Mixed method
Ndiaye
Newton
−10
10.00 100.00 1 000.00
Runtime (in seconds)
Figure 2: Runtime v.s. suboptimality for the Newton method, the gradient descent method, the mixed
method, the baseline method and Ndiaye’s method when applied to ℓ2 -regularized logistic regression
over the a9a real dataset from LIBSVM [Chang and Lin, 2011].
References
T. B. Arnold and R. J. Tibshirani. Efficient implementations of the generalized lasso dual path algorithm. Journal
of Computational and Graphical Statistics, 25(1):1–27, 2016. doi: 10.1080/10618600.2015.1008638.
R. Bao, B. Gu, and H. Huang. Efficient approximate solution path algorithm for order weight l_1-norm with
accuracy guarantee. In 2019 IEEE International Conference on Data Mining (ICDM), pages 958–963. IEEE,
2019.
H. H. Bauschke, P. L. Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces,
volume 408. Springer, 2011.
C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent
systems and technology (TIST), 2(3):1–27, 2011.
D. Eddelbuettel, R. François, J. Allaire, K. Ushey, Q. Kou, N. Russel, J. Chambers, and D. Bates. Rcpp:
Seamless r and c++ integration. Journal of Statistical Software, 40(8):1–18, 2011.
B. Efron, T. Hastie, I. Johnstone, and R. Tishirani. Least angle regression. The Annals of Statistics, 32(2):407 –
499, 2004.
J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied
Statistics, 1(2):302–332, 2007.
P. Garrigues and L. Ghaoui. An homotopy algorithm for the lasso with online observations. Advances in neural
information processing systems, 21, 2008.
10
J. Giesen, M. Jaggi, and S. Laue. Approximating parameterized convex optimization problems. In European
Symposium on Algorithms, pages 524–535. Springer, 2010.
J. Giesen, J. Müller, S. Laue, and S. Swiercy. Approximating concavely parameterized optimization problems.
In Advances in neural information processing systems, pages 2105–2113, 2012.
J. Giesen, S. Laue, and P. Wieschollek. Robust and efficient kernel hyperparameter paths with guarantees. In
International Conference on Machine Learning, pages 1296–1304, 2014.
M. Grant and S. Boyd. Cvx: Matlab software for disciplined convex programming, version 2.1, 2014.
M. C. Grant and S. P. Boyd. Graph implementations for nonsmooth convex programs. In Recent advances in
learning and control, pages 95–110. Springer, 2008.
T. Hastie, S. Rosset, R. Tishirani, and J. Zhu. The entire regularization path for the support vector machine.
Journal of Machine Learning Research, 5:1391 – 1415, 2004.
T. Hastie, J. Taylor, R. Tibshirani, G. Walther, et al. Forward stagewise regression and the monotone lasso.
Electronic Journal of Statistics, 1:1–29, 2007.
H. Hoefling. A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical
Statistics, 19(4):984–1006, 2010. doi: 10.1198/jcgs.2010.09208.
H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the
polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery
in Databases, pages 795–811. Springer, 2016.
E. Ndiaye. Safe optimization algorithms for variable selection and hyperparameter tuning. PhD thesis, Université
Paris-Saclay (ComUE), 2018.
E. Ndiaye, T. Le, O. Fercoq, J. Salmon, and I. Takeuchi. Safe grid search with optimal complexity. In
International Conference on Machine Learning, pages 4771–4780, 2019.
Y. E. Nesterov and A. S. Nemirovskii. Interior Point Polynomial Methods in Convex Programming: Theory and
Algorithms. SIAM Publications, 1993.
M. Osborne. An effective method for computing regression quantiles. IMA Journal of Numerical Analysis, 12:
151 – 166, 1992.
M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA
Journal of Numerical Analysis, 20(3):389 – 403, 2000.
B. T. Polyak. Gradient methods for solving equations and inequalities. USSR Computational Mathematics and
Mathematical Physics, 4(6):17–32, 1964.
X. Qu, D. Li, X. Zhao, and B. Gu. Gaga: Deciphering age-path of generalized self-paced regularizer. Advances
in Neural Information Processing Systems, 35:32025–32038, 2022.
S. Rosset. Following curved regularized optimization solution paths. In Advances in Neural Information
Processing Systems, 2004.
S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Ann. Statist., 35(3):1012–1030, 2007. doi:
10.1214/009053606000001370.
T. Sun and Q. Tran-Dinh. Generalized self-concordant functions: a recipe for newton-type methods. Mathemati-
cal Programming, pages 1–69, 2017.
R. J. Tibshirani and J. Taylor. The solution path of the generalized lasso. The Annals of Statistics, pages
1335–1371, 2011.
Y. Zhu and R. Liu. An algorithmic view of ℓ2 -regularization and some path-following algorithms. The Journal
of Machine Learning Research, 22(1):6123–6184, 2021.
11