apl232
apl232
Yuqia Wu [email protected]
Department of Applied Mathematics
The Hong Kong Polytechnic University
Kowloon, Hong Kong
Shaohua Pan [email protected]
School of Mathematics
South China University of Technology
Guangzhou, China
Xiaoqi Yang∗ [email protected]
Department of Applied Mathematics
The Hong Kong Polytechnic University
Kowloon, Hong Kong
Abstract
This paper concerns structured `0 -norms regularization problems, with a twice continu-
ously differentiable loss function and a box constraint. This class of problems have a wide
range of applications in statistics, machine learning and image processing. To the best
of our knowledge, there is no efficient algorithm in the literature for solving them. In
this paper, we first provide a polynomial-time algorithm to find a point in the proximal
mapping of the fused `0 -norms with a box constraint based on dynamic programming
principle. We then propose a hybrid algorithm of proximal gradient method and inexact
projected regularized Newton method to solve structured `0 -norms regularization problems.
The iterate sequence generated by the algorithm is shown to be convergent by virtue of a
non-degeneracy condition, a curvature condition and a Kurdyka-Lojasiewicz property. A
superlinear convergence rate of the iterates is established under a locally Hölderian error
bound condition on a second-order stationary point set, without requiring the local opti-
mality of the limit point. Finally, numerical experiments are conducted to highlight the
features of our considered model, and the superiority of our proposed algorithm.
Keywords: fused `0 -norms regularization problems; inexact projected regularized New-
ton algorithm; global convergence; superlinear convergence; KL property.
1. Introduction
Given a matrix B ∈ Rp×n , parameters λ1 > 0 and λ2 > 0, and vectors l ∈ Rn− and u ∈ Rn+ ,
we are interested in the structured `0 -norms regularization problem with a box constraint:
∗. Corresponding author.
1.1 Motivation
Given a data matrix A ∈ Rm×n and its response b ∈ Rm , the common regression model is
to minimize f (x) := h(Ax − b), where h : Rm → R is continuously differentiable on A(O) − b
with its minimum attained at the origin. When h(·) = 21 k · k2 , f is the least-squares loss
function of the linear regression. It is known that one of the popular models for seeking a
sparse vector while minimizing f is the following `0 -norm regularization problem
where the `0 -norm term is used to identify a set of influential components by shrinking
some small coefficients to 0. However, the `0 -norm regularizer only takes the sparsity of x
into consideration, but ignores its spatial nature, which sometimes needs to be considered in
real-world applications. For example, in the context of image processing, the variables often
represent the pixels of images, which are correlated with their neighboring ones. To recover
the blurred images, Rudin et al. (1992) took into account the differences between adjacent
variables and used the total variation regularization, which penalizes the changes of the
neighboring pixels and hence encourages smoothness in the solution. In addition, Land and
Friedman (1997) studied the phoneme classification on TIMIT database, for which there is
a high chance that every sampled point is close or identical to its neighboring ones because
each phoneme is composed of a series of consecutively sampled points. Land and Friedman
(1997) considered imposing a fused penalty on the coefficients vector x, and proposed the
following models with zero-order variable fusion and first-order variable fusion respectively
to train the classifier:
1
min kAx − bk2 + λ1 kBxk
b 0, (3)
x∈Rn 2
1
min kAx − bk2 + λ1 kBxkb 1, (4)
x∈Rn 2
where A ∈ Rm×n represents the phoneme data, b ∈ Rm is the label vector, B b ∈ R(n−1)×n
with Bbii = 1 and Bbi,i+1 = −1 for all i ∈ {1, . . . , n−1} and B
bij = 0 otherwise. In the sequel,
1 2
we call (1) with f (·) = 2 kA · −bk and B = B a fused `0 -norms regularization problem with
b
a box constraint.
Additionally taking the sparsity of x into consideration, Tibshirani et al. (2005) proposed
the fused Lasso, given by
1
minn kAx − bk2 + λ1 kBxk
b 1 + λ2 kxk1 , (5)
x∈R 2
and presented its nice statistical properties. Friedman et al. (2007) demonstrated that the
proximal mapping of the function λ1 kB b · k1 + λ2 k · k1 can be obtained through a process,
2
PGiPN for Fused Zero-norms Regularization Problems
3
Wu, Pan, and Yang
The PG method is able to effectively cope with model (1) if the proximal mapping of g
can be exactly computed. The PG method belongs to first-order methods, which have a low
computation cost and require weak global convergence conditions, but they achieve at most
a linear convergence rate. On the other hand, the Newton method has a faster convergence
rate, but it can only be applied to minimize sufficiently smooth objective functions. In recent
years, there have been active investigations into the Newton-type methods for nonsmooth
composite optimization problems of the form
where proxµ−1 g (·) is the proximal mapping of g, µk > 0 is a constant such that the objective
k
function F of (1) gains a sufficient decrease from xk to xk , and then judges whether the
iterate xk enters Newton step or not in terms of some switch condition, which takes the
following forms of structured stable supports:
If this switch condition does not hold, we set xk+1 = xk and return to the PG step.
Otherwise, by the nature of `0 -norm, the restriction of the function x 7→ λ1 kBxk0 + λ2 kxk0
4
PGiPN for Fused Zero-norms Regularization Problems
It is noted that the set Π(xk ) containing all the points whose supports are a subset of the
support of xk as well as the supports of their linear transformation is a subset of the support
of the linear transformation of xk . It is worth pointing out that the multifunction Π is not
closed but closed-valued.
We will show that every stationary point of (10) is one for problem (1). Thus, instead
of a subspace regularized Newton step in Wu et al. (2023), following the projected Newton
method in Bertsekas (1982) and the proximal Newton method in Lee et al. (2014); Yue
et al. (2019); Mordukhovich et al. (2023) and Liu et al. (2024), our projected regularized
Newton step minimizes the following second-order approximation of (10) on Πk :
1
arg min Θk (x) := f (xk ) + h∇f (xk ), x −xk i + hx− xk , Gk (x−xk )i + δΠk (x). (11)
x∈Rn 2
where b1 > 0, σ ∈ (0, 12 ) and µk is the same as in (7). To cater for the practical computation,
our Newton step seeks an inexact solution y k of (11) satisfying
k
Θk (y) − Θk (x ) ≤ 0, (13)
−1
dist(0, ∂Θk (y)) ≤ min{µk , 1} min kµ (xk −xk )k, kµ (xk − xk )k1+ς
n o
k k (14)
2
with ς ∈ (σ, 1]. Set the direction dk := y k − xk . A step size αk ∈ (0, 1] is found in the
direction dk via backtracking, and let xk+1 := xk + αk dk . To ensure the global convergence,
the next iteration still returns to the PG step. The details of the algorithm are given in
Section 3.
5
Wu, Pan, and Yang
proxµ−1 g (·) with µ from a closed interval on a compact set. This plays a crucial role in the
convergence analysis of the proposed algorithm, as well as generalizes the corresponding
results in Lu (2014) for `0 -norm and in Wu et al. (2023) for `q -norm with 0 < q < 1,
respectively.
• We design a hybrid algorithm (PGiPN) of PG and inexact projected regularized New-
ton methods to solve the structured `0 -norms regularization problem (1), which includes the
fused `0 -norms regularization problem with a box constraint as a special case. We obtain
the global convergence of the algorithm by showing that the structured stable supports (8)
hold when the iteration number is sufficiently large. Moreover, we establish a superlinear
convergence rate under a Hölderian error bound on a second-order stationary point set,
without requiring the local optimality of the limit point.
• The numerical experiments show that our PGiPN is more effective than some existing
algorithms in the literature in terms of solution quality and running time.
The rest of the paper is organized as follows. In Section 2 we recall some preliminary
knowledge and characterize the stationary point condition of model (1). In Section 3, we
prove the prox-regularity of g, characterize a uniform lower bound of the proximal mapping
of g, and provide an algorithm for finding a point in the proximal mapping of g with
B = B.b In Section 4, we introduce our algorithm and show that it is well defined. Section
5 is devoted to the convergence analysis of the proposed algorithm. The implementation
details of our algorithm and the numerical experiments are included in Section 6.
1.4 Notation
Throughout this paper, B(x, ) := {z | kz − xk ≤ } denotes the ball centered at x with
radius > 0, and B := B(0, 1). Let I and 1 be an identity matrix and a vector of all ones,
respectively, whose dimension is known from the context. For any two integers 0 ≤ j < k,
define [j : k] := {j, j + 1, . . . , k} and [k] := [1 : k]. For a closed and convex set Ξ ⊂ Rn ,
ri(Ξ) denotes the relative interior of Ξ, projΞ (·) represents the projection operator onto
Ξ, and for a given x ∈ Ξ, NΞ (x) and TΞ (x) denote the normal cone and tangent cone of
Ξ at x, respectively. For a closed set Ξ0 ⊂ Rn , dist(z, Ξ0 ) := minx∈Ξ0 kx − zk. For an
index set T ⊂ [n], |T | means the number of the elements of T and write T c := [n]\T . For
t ∈ R, sign(t) denotes the sign of t, i.e., sign(0) = 0 and sign(t) = t/|t| for t 6= 0, and
t+ := max{t, 0}. For a given x ∈ Rn , supp(x) := {i ∈ [n] | xi 6= 0}, sign(x) denotes the
vector with [sign(x)]i = sign(xi ), |x|min := mini∈supp(x) |xi |. For a vector x ∈ Rn and an
index set T ⊂ [n], xT ∈ R|T | is the vector obtained by removing those xj ’s with j ∈ / T,
and xj:k means x[j:k] . Given a real symmetric matrix H, λmin (H) denotes the smallest
eigenvalue of H, and kHk2 is the spectral norm of H. For a matrix A ∈ Rm×n and S ⊂ [m]
(resp. T ⊂ [n]), AS· (resp. A·T ) denotes the matrix obtained by removing those rows (resp.
columns) of A whose indices are not in S (resp. T ). For a proper lower semicontinuous
function h : Rn → R, its domain is denoted by dom h := {x ∈ Rn | h(x) < ∞}, and its
proximal mapping of h associated with a parameter µ > 0 is defined as
n 1 o
proxµh (z) := arg min kx − zk2 + h(x) ∀z ∈ Rn . (15)
x∈Rn 2µ
6
PGiPN for Fused Zero-norms Regularization Problems
For a nonnegative real number sequence {an }, O(an ) represents a sequence such that
O(an ) ≤ c1 an for some c1 > 0. The symbol F : Rm ⇒ Rn means that F is a set-valued
mapping (or multifunction), i.e., its image at every point is a set.
2. Preliminaries
Note that the structured `0 -norms function is lower semicontinuous and problem (1) involves
a compact box constraint, so its set of global optimal solutions is nonempty and compact.
Moreover, the continuity of ∇2f on an open set containing Ω and the compactness of Ω
implies that ∇f is Lipschitz continuous on Ω, i.e., there exists L∇f > 0 such that
The above basic facts are often used in the subsequent sections.
This means that x is a stationary point of problem (6) if and only if x is an L-stationary
point. To extend this equivalence to the class of prox-regular functions, we need to recall
the definition of prox-regularity, which acts as a surrogate of local convexity.
The following proposition reveals that under the prox-regularity of ϕ, the set of station-
ary points of problem (6) coincides with that of its L-stationary points. Since the proof is
similar to that in (Wu et al., 2023, Remark 2.5), the details are omitted here.
7
Wu, Pan, and Yang
Next we provide the stationary point conditions of problem (1) by characterizing the
subdifferential of function F . The closed-valuedness of the multifunction Π in (9) is used.
Lemma 4 Consider any z ∈ Ω. The following statements are true.
(i) z ∈ Π(z), and ∂g(z)
b = ∂g(z) = NΠ(z) (z).
(ii) ∂F (z) = ∇f (z) + ∂g(z) = ∇f (z) + NΠ(z) (z).
(iii) for any x ∈ Ω, 0 ∈ ∇f (x) + NΠ(z) (x) implies that 0 ∈ ∂F (x).
NΠ1 (z) (z) + NΠ2 (z) (z) + NΩ (z) = ∂h1 (z) + ∂h2 (z) ⊂ ∂g(z)
b ⊂ ∂g(z) ⊂ NΠ(z) (z).
Since Π(z) = Ω ∩ Π1 (z) ∩ Π2 (z) and z ∈ Π(z), by (Rockafellar, 1970, Theorem 23.8),
NΠ(z) (z) = NΩ (z) + NΠ1 (z) (z) + NΠ2 (z) (z). Thus, the desired conclusion holds.
(ii)-(iii) The first equality of part (ii) follows by (Rockafellar and Wets, 2009, Exercise
8.8), and the second one is implied by part (i). Next we consider part (iii). Suppose that
0 ∈ ∇f (x) + NΠ(z) (x). Obviously, x ∈ Π(z). From the definition of Π(·), we have Π(x) ⊂
Π(z), which along with their convexity and x ∈ Π(x) implies that NΠ(z) (x) ⊂ NΠ(x) (x).
Combining with part (ii) leads to the desired result.
8
PGiPN for Fused Zero-norms Regularization Problems
Remark 5 Lemma 4 (ii) provides a way to seek a stationary point of F . Indeed, for
any given z ∈ Ω, if x is a stationary point of problem miny∈Rn {f (y) | y ∈ Π(z)}, i.e.,
0 ∈ ∇f (x) + NΠ(z) (x), then by Lemma 4 (ii) it necessarily satisfies 0 ∈ ∂F (x). This
implication will be utilized in the design of our algorithm, that is, when obtaining a good
estimate of the stationary point, say xk , we run a Newton step to minimize f over the
polyhedral set Π(xk ) so as to enhance the speed of the algorithm.
9
Wu, Pan, and Yang
By the expressions of ∂ ∞ f1 (x) and ∂ ∞ f2 (x), it is immediate to check that the constraint
qualification in (17) does not hold. Next, we give our proof toward the prox-regularity of g.
Lemma 7 The function g is prox-bounded, and is prox-regular on its domain Ω, so the set
of stationary points of model (1) coincides with its set of L-stationary points.
g(x0 ) ≥ g(x) + v > (x0 − x), for all kx0 − xk ≤ ε, v ∈ ∂g(x), kv − vk < ε and x ∈ Ξ (18)
with Ξ := {x | kx − xk < ε, g(x) < g(x) + ε}, so the function g is prox-regular at x for v.
We first claim that for each x ∈ Ξ, supp(Cx) = supp(Cx) and x ∈ Ω. In fact, by the
definition of ε, supp(Cx) ⊃ supp(Cx). If supp(Cx) 6= supp(Cx), we have g(x) ≥ g(x) + λ >
g(x) + ε, which yields that x ∈/ Ξ. Therefore, supp(Cx) = supp(Cx). The fact that x ∈ Ξ
implies x ∈ Ω is clear. Hence the claimed facts are true.
Fix any x ∈ Ξ. Consider any x0 ∈ B(x, ε). If x0 ∈ / Ω, since g(x0 ) = ∞, it is immediate
to see that (18) holds, so it suffices to consider x0 ∈ B(x, ε) ∩ Ω. Note that supp(Cx0 ) ⊃
supp(Cx) = supp(Cx). If supp(Cx0 ) 6= supp(Cx), then g(x0 ) ≥ g(x) + λ. For any v ∈ ∂g(x)
with v ∈ B(v, ε), kvk ≤ kvk+ε ≤ kvk+λ, which along with kx0 −xk ≤ kx0 −xk+kx−xk ≤ 2ε
implies that kvkkx0 − xk ≤ (kvk + λ) 3(kvk+λ)
2λ
≤ 2λ
3 , and hence
Equation (18) holds. Next we consider the case supp(Cx0 ) = supp(Cx). Define
Clearly, Π(x) = Π1 (x) ∩ Π2 (x) ∩ Ω and Π1 (x), Π2 (x) and Ω are all polyhedral sets. By
(Rockafellar, 1970, Theorem 23.8), for any v ∈ NΠ(x) (x) = ∂g(x), there exist v1 ∈ NΠ1 (x) (x),
v2 ∈ NΠ2 (x) (x) and v3 ∈ NΩ (x) such that v = v1 + v2 + v3 . Then,
where the inequality follows from λ1 kBx0 k0 − λ1 kBxk0 = 0, v1> (x0 − x) = 0, λ2 kx0 k0 −
λ2 kxk0 = 0, v2> (x0 − x) = 0 and v3> (x0 − x) ≤ 0. Equation (18) is true. Thus, by the
arbitrariness of x ∈ Ω and v ∈ ∂g(x), we conclude that g is prox-regular on set Ω.
10
PGiPN for Fused Zero-norms Regularization Problems
0 < q < 1 and played a crucial role in the convergence analysis of the algorithms involving
subspace Newton method (see Wu et al. (2023)). Next, we show that such a uniform lower
bound exists for the proximal mapping of g.
Lemma 8 For any given compact set Ξ ⊂ Rn and constants 0 < µ < µ, define
S
Z := z∈Ξ,µ∈[µ,µ] proxµ−1 g (z).
Then, there exists ν > 0 (depending on Ξ, µ and µ) such that inf u∈Z\{0} |[B; I]u|min ≥ ν.
Proof Let C := [B; I]. By invoking (Bauschke et al., 1999, Corollary 3) and the compact-
ness of Ω, there exists κ > 0 such that for all index set J ⊂ [n+p],
Since the index sets J ⊂ [n + p] are finite, there exists σ > 0 such that for any index set
J ⊂ [n+p] with CJ· having full row rank,
>
λmin (CJ· CJ· ) ≥ σ. (20)
For any z ∈ Ξ and µ ∈ [µ, µ], define hz,µ (x) := µ2 kx − zk2 for x ∈ Rn . By the compactness
of Ω, [µ, µ] and Ξ, there exists δ0 ∈ (0, 1) such that for all z ∈ Ξ, µ ∈ [µ, µ] and x, y ∈ Ω
with kx − yk < δ0 , µ(kxk + kyk + 2kzk)kx − yk < λ := min{λ1 , λ2 }, and consequently,
µ µ λ
|hz,µ (x) − hz,µ (y)| = |hx − y, x + y − 2zi| ≤ (kxk + kyk + 2kzk)kx − yk < . (21)
2 2 2
Now suppose on the contrary that the conclusion does not hold. Then there is a sequence
1
{z k } k
k∈N ⊂ Z\{0} such that |Cz |min ≤ k for all k ∈ N. Note that C has a full column
k
rank. We also have |Cz |min > 0 for each k ∈ N. By the definition of Z, for each k ∈
k k k k 1
N,
there exist z ∈ Ξ and µk ∈ [µ, µ] such that z ∈ proxµ−1 g (z ). Since |Cz |min ∈ 0, k for
k
all k ∈ N, there exists an infinite index set K ⊂ N and an index i ∈ [n+p] such that
δ0 σ
0 < |(Cz k )i | = |Cz k |min < for each k ∈ K, (22)
κkCk2
where κ and σ are the ones appearing in (19) and (20), respectively. Fix any k ∈ K.
Write Qk := [n + p]\supp(Cz k ) and choose Jk ⊂ Qk such that the rows of CJk · form a
basis of those of CQk · . Let Jbk := Jk ∪ {i}. Obviously, kCJbk · z k k = |(Cz k )i |. We claim that
CJbk · also has a full row rank. Indeed, if Jk = ∅, then CJbk · has a full row rank because
CJbk · 6= 0 by (22); if Jk 6= ∅, then CJk · z k = 0, which implies that CJbk · also has a full row
rank (if not, Ci· is a linear combination of the rows of CJk · , which along with CJk · z k = 0
implies that Ci· z k = 0, contradicting to |(Cz k )i | = |Cz k |min > 0). The claimed fact holds.
Let zek := projNull(C b ) (z k ). Then, CJbk · zek = 0, and by the optimality condition of the
Jk ·
11
Wu, Pan, and Yang
where the inequality is due to (20). Combining (23) with (22) yields kξ k k < κ−1 kCk−1
2 δ0 .
Therefore,
kz k − zek k = kCJ> k k k −1
b · ξ k ≤ kCJbk · k2 kξ k ≤ kCk2 kξ k < κ δ0 . (24)
k
λ
z k ) − hz k ,µk (z k )| <
|hz k ,µk (b . (26)
2
Next we claim that supp(C zbk ) ∪ {i} ⊂ supp(Cz k ). Indeed, since the rows of CJbk · form a
basis of those of C[Qk ∪{i}]· and CJbk · zbk = 0, C[Qk ∪{i}]· zbk = 0. Then, supp(C[Qk ∪{i}]· zbk ) ∪
{i} = supp(C[Qk ∪{i}]· z k ). Since all the entries of C[Qk ∪{i}]c · z k are nonzero, it holds that
supp(C[Qk ∪{i}]c · zbk ) ⊂ supp(C[Qk ∪{i}]c · z k ), which implies that supp(C zbk ) ∪ {i} ⊂ supp(Cz k ).
Thus, the claimed inclusion follows, which implies that g(z k ) − g(b z k ) ≥ λ. This together
k k
with (26) yields hz k ,µk (z ) + g(z ) − (hz k ,µk (b k
z ) + g(bz )) ≥ λ − 2 = λ2 , contradicting to
k λ
The result of Lemma 8 will be utilized in Proposition 14 to justify the fact that the
sequences {|Bxk |min }k∈N and {|xk |min }k∈N are uniformly lower bounded, where xk is ob-
tained in (7) (or (38) below). This is a crucial aspect in proving the stability of supp(xk )
and supp(Bxk ) when k is sufficiently large.
To simplify the deduction,P for each i ∈ [n], define ωi (α) := λ2 |α|0 + δ[li ,ui ] (α) for α ∈ R.
Clearly, λ2 kxk0 + δΩ (x) = ni=1 ωi (xi ) for x ∈ Rn . Let H(0) := −λ1 , and for each s ∈ [n],
12
PGiPN for Fused Zero-norms Regularization Problems
define
s
1 X
H(s) := mins hs (y; z1:s ) := ky − z1:s k2 + λ1 kB
b[s−1][s] yk0 + ωj (yj ) (28)
y∈R 2
j=1
with Bb[0][1] := 0. It is immediate to see that H(n) is the optimal value to (27). For each
s ∈ [n], define function Ps : [0 : s−1] × R → R by
s
1 X
Ps (i, α) := H(i) + kα1 − zi+1:s k2 + ωj (α) + λ1 . (29)
2
j=i+1
For each s ∈ [n], there is a close relation between Ps and hs . Indeed, for any given y ∈ Rs
with ys = α, let i be the smallestPinteger in [0 : s−1] such that yi+1 = · · · = ys = α. When
i = 0, Ps (i, α) = 12 ky1:s −z1:s k2 + sj=1 ωj (yj ) = hs (y; z1:s ). When i 6= 0, if y1:i is optimal to
miny0 ∈Ri hi (y 0 ; z1:i ), then by noting that y = (y1:i ; α1) and
i
1 X
hs (y; z1:s ) = ky1:i −z1:i k2 +λ1 kB
b[i−1][i] y1:i k0 + ωj (yj )
2
j=1
s
1 X
+ kyi+1:s −zi+1:s k2 + ωj (yj ) + λ1 (30)
2
j=i+1
s
1 X
= hi (y1:i ; z1:i ) + kα1 − zi+1:s k2 + ωj (α) + λ1 ,
2
j=i+1
we get H(i) = hi (y1:i ; z1:i ). Along with the above equality and (29), hs (y; z1:s ) = Ps (i, α).
In the following lemma, we prove that the optimal value of mini∈[0:s−1],α∈R Ps (i, α) is
equal to H(s), and apply this result to characterize a global minimizer of hs (·; z1:s ).
Lemma 9 Fix any s ∈ [n]. The following statements are true.
(i) H(s) = mini∈[0:s−1],α∈R Ps (i, α).
s
1 X
min Ps (i, α) ≤ H(i∗s ) + kαs∗ 1 − zi∗s +1:s k2 + ωj (αs∗ ) + λ1
i∈[0:s−1],α∈R 2 ∗ j=is +1
s
∗ 1 ∗ X
≤ hi∗s (y1:i ∗ ; z1:i∗ )
s
2
+ kyi∗s +1:s − zi∗s +1:s k + ωj (yj∗ ) + λ1
s 2 ∗ j=is +1
∗
= hs (y ; z1:s ) = H(s),
13
Wu, Pan, and Yang
where the first equality is due to yi∗∗s +1 6= yi∗∗s and the expression of hs (y ∗ ; z1:s ) by (30). If
i∗s = 0,
s
1 X
min Ps (i, α) ≤ H(0) + ky ∗ − z1:s k2 + ωj (yj∗ ) + λ1 = H(s).
i∈[0:s−1],α∈R 2
j=1
Therefore, mini∈[0:s−1],α∈R Ps (i, α) ≤ H(s) holds. On the other hand, let (i∗s , αs∗ ) be an
optimal solution to mini∈[0:s−1],α∈R Ps (i, α). If i∗s 6= 0, let y ∗ ∈ Rs be such that y1:i
∗
∗ ∈
s
∗ ∗
arg minv∈Ri∗s hi∗s (v; z1:i∗s ) and yi∗s +1:s = αs 1. Then, it is clear that
s
∗ ∗ 1 ∗ X
H(s) ≤ hs (y ; z1:s ) ≤ hi∗s (y1:i ∗ ; z1:i∗ )
s
2
+ kyi∗s +1:s − zi∗s +1:s k + ωj (yj∗ ) + λ1
s 2 ∗ j=is +1
s
1 X
= H(i∗s ) + kαs∗ 1 − zi∗s +1:s k2 + ωj (αs∗ ) + λ1 = min Ps (i, α).
2 i∈[0:s−1],α∈R
j=i∗s +1
s
1 X
H(s) ≤ hs (y ∗ ; z1:s ) = H(0) + ky ∗ − z1:s k2 + ωj (αs∗ ) + λ1 = min Ps (i, α).
2 i∈[0:s−1],α∈R
j=1
Therefore, H(s) ≤ mini∈[0:s−1],α∈R Ps (i, α). The above two inequalities imply the result.
(ii) If i∗s 6= 0, by part (i) and the definitions of αs∗ and i∗s , it holds that
s
1 ∗ X
H(s) = min Ps (i, α) = H(i∗s ) 2
+ kαs 1 − zi∗s +1:s k + ωj (αs∗ ) + λ1
i∈[0:s−1],α∈R 2 ∗ j=is +1
s
∗ 1 ∗ X
= hi∗s (y1:i ∗ ; z1:i∗ ) + ky ∗ − zi∗s +1:s k2 + ωj (yj∗ ) + λ1 ≥ hs (y ∗ ; z1:s ),
s s
2 is +1:s
j=i∗s +1
Therefore, H(s) ≥ hs (y ∗ ; z1:s ). Along with the definition of H(s), H(s) = hs (y ∗ ; z1:s ).
Lemma 9 (i) implies that the nonconvex and nonsmooth problem (27) can be recast as
a mixed-integer programming with objective function given in (29). Lemma 9 (ii) suggests
a recursive method to obtain an optimal solution to (27). In fact, by setting s = n,
there exists an optimal solution to (27), says x∗ , such that x∗i∗n +1:n = αn∗ 1, and x∗1:i∗n ∈
arg minv∈Ri∗n hi∗n (v; z1:i∗n ). Next, by setting s = i∗n , we are able to obtain the expression of
x∗i∗s +1:i∗n . Repeating this loop backward until s = 0, we can obtain the full expression of an
14
PGiPN for Fused Zero-norms Regularization Problems
Set s = n.
While s > 0 do
Find (i∗s , αs∗ ) ∈ arg min Ps (i, α).
(31)
i∈[0:s−1],α∈R
Let xi∗s +1:s = αs 1 and s ← i∗s .
∗ ∗
End
To obtain an optimal solution to (27), the remaining issue is how to execute the first line
in while loop of (31), or in other words, for any given s ∈ [n], how to find (i∗s , αs∗ ) ∈ N × R
appearing in Lemma 9 (ii). The following proposition provides some preparations.
Proposition 10 For each s ∈ [n], let Ps∗ (α) := mini∈[0:s−1] Ps (i, α).
(ii) Let R01 := R. For each s ∈ [2 : n] and i ∈ [0 : s−2], let Ris := Ris−1 ∩ (Rs−1 c
s ) with
∗ ∗ 0
Rs−1
s := α ∈ R | Ps−1 (α) ≥ min
0
P s−1 (α ) + λ1 . (32)
α ∈R
(a) For each s ∈ [2 : n], i∈[0:s−1] Ris = R and Ris ∩ Rjs = ∅ for any i 6= j ∈ [0 : s−1].
S
(b) For each s ∈ [n] and i ∈ [0 : s−1], Ps∗ (α) = Ps (i, α) when α ∈ Ris .
Proof (i) Fix any α ∈ R. Note that P1∗ (α) = P1 (0, α) = H(0) + 21 (α − z1 )2 + ω1 (α) + λ1 =
1 2 ∗
2 (α − z1 ) + ω1 (α). Now pick any s ∈ [2 : n]. By the definition of Ps , we have
n o
Ps∗ (α) = min Ps (i, α) = min min Ps (i, α), Ps (s−1, α) . (33)
i∈[0:s−1] i∈[0:s−2]
15
Wu, Pan, and Yang
desired result.
(ii) We first prove (a) by induction. When s = 2, since R01 = R and R02 = R01 ∩ (R12 )c , we
have R02 ∪ R12 = R and R02 ∩ R12 = ∅. Assume that the result holds with s = j for some
j ∈ [2 : n−1]. We prove that the result holds for s = j +1. Since Rij+1 = Rij ∩ (Rjj+1 )c for
all i ∈ [0 : j −1] and i∈[0:j−1] Rij = R, it holds that
S
The first part holds. For the second part, by definition, Rij+1 ∩ Rjj+1 = ∅ for all i ∈ [0 : j−1],
so it suffices to prove that Rij+1 ∩ Rkj+1 = ∅ for any i 6= k ∈ [0 : j −1]. By definition,
where the second equality is due to Rij ∩ Rkj = ∅. Thus, the second part follows.
Next we prove (b). When s = 1, since for any α ∈ R = R01 , P1∗ (α) = P1 (0, α), the result
holds. For s ∈ [2 : n] and i = s−1, by the definition of Rs−1
s and part (i), for all α ∈ Rs−1
s ,
1
Ps∗ (α) = min ∗
Ps−1 (α0 ) + λ1 + (α − zs )2 + ωs (α) = Ps (s − 1, α),
α0 ∈R 2
where the second equality is obtained by using H(s) = minα0 ∈R Ps−1 ∗ (α0 ) and (29). Next we
consider s ∈ [2 : n] and i ∈ [0 : s−2]. We argue by induction that Ps∗ (α) = Ps (i, α) when
α ∈ Ris . Indeed, when s = 2, since R02 = R01 ∩ (R12 )c = (R12 )c , for any α ∈ R02 , from (32) we
have P1∗ (α) < minα0 ∈R P1∗ (α0 ) + λ1 , which by part (i) implies that
1 1
P2∗ (α) = P1∗ (α) + (α − z2 )2 + ω2 (α) = P1 (0, α) + (α − z2 )2 + ω2 (α) = P2 (0, α).
2 2
Assume that the result holds when s = j for some j ∈ [2 : n−1]. We consider the case for
s = j +1. For any i ∈ [0 : j −1], from Rij+1 = Rij ∩ (Rjj+1 )c and (34), for any α ∈ Rij+1 ,
∗ 1 1
Pj+1 (α) = Pj∗ (α) + (α − zj+1 )2 + ωj+1 (α) = Pj (i, α) + (α − zj+1 )2 + ωj+1 (α)
2 2
j
1 X 1
= H(i) + kα1 − zi+1:j k2 + wk (α) + λ1 + (α−zj+1 )2 + ωj+1 (α)
2 2
k=i+1
j+1
1 X
= H(i) + kα1 − zi+1:j+1 k2 + wk (α) + λ1 = Pj+1 (i, α),
2
k=i+1
16
PGiPN for Fused Zero-norms Regularization Problems
where the second equality is using Pj∗ (α) = Pj (i, α) implied by induction. Hence, the con-
clusion holds for s = j + 1 and any i ∈ [0 : s−2]. The proof is completed.
Now we take a closer look at Proposition 10. Part (i) provides a recursive method to
compute Ps∗ (α) for all s ∈ [n]. For each s ∈ [n], by the expression of ωs , Ps (i, ·) is a piecewise
lower semicontinuous linear-quadratic function whose domain is a closed interval, relative
to which Ps (i, ·) has an expression of the form H(i) + 21 kα1 − zi+1:s k2 + (s − i)|α|0 + λ1 .
While Ps∗ (·) = min{Ps (0, ·), Ps (1, ·), . . . , Ps (s − 1, ·)}, and for each i ∈ [0 : s−1], the optimal
solution to minα∈R Ps (i,
P α) is easily obtained (in fact, all the possible candidates of the
s
zj
global solutions are 0, j=i+1 s−i , maxj∈[i+1:s] {lj }, minj∈[i+1:s] {uj }), so is arg minα0 ∈R Ps∗ (α0 ).
Part (ii) suggests a way to search for i∗s such that Ps∗ (αs∗ ) = Ps (i∗s , αs∗ ) for each s ∈ [n].
Obviously, Ps (i∗s , αs∗ ) = mini∈[0:s−1],α∈R Ps (i, α). This inspires us to propose Algorithm 1 for
solving proxλ1 kB·k
b 0 +ω(·) (z), whose iteration steps are described as follows.
1. Initialize: Compute P1∗ (α) = 21 (z1 − α)2 + ω1 (α) and set R01 = R.
2. For s = 2, . . . , n do
3. Ps∗ (α) := min{Ps−1 ∗ (α), min 0 ∗ 0 1 2
α ∈R Ps−1 (α ) + λ1 } + 2 (α − zs ) + ωs (α).
s−1
4. Compute Rs by (32).
5. For i = 0, . . . , s − 2 do
6. Ris = Ris−1 ∩ (Rs−1 c
s ) .
7. End
8. End
9. Set s = n.
10. While s > 0 do
11. Find αs∗ ∈ arg minα∈R Ps∗ (α), and i∗s = i | αs∗ ∈ Ris .
Lemma 11 Fix any s ∈ [2 : n]. The function Ps∗ in line 3 of Algorithm 1 has at most
O(s1+ ) linear-quadratic pieces, where is any small positive constant.
Proof For each i ∈ [0 : s − 2], let hi (α) := H(i) + 21 kα1 − zi+1:s k2 + λ1 + (s − i)λ2 |α|0 +
P s
j=i+1 δ[lj ,uj ] (α) for α ∈ R. Obviously, every hi is a piecewise lower semicontinuous linear-
quadratic function whose domain is a closed interval, and every piece is continuous on the
closed interval except α = 0. Therefore, for each i ∈ [0 : s − 2], hi = min hi,1 , hi,2 , hi,3
with hi,1 (α) := hi (α) − (s − i)λ2 |α|0 + (s − i)λ2 + δ(−∞,0] (α), hi,2 (α) := hi (α) + δ{0} (α)
and hi,3 (α) := hi (α) − (s − i)λ2 |α|0 + (s − i)λ2 + δ[0,∞) (α). Obviously, hi1 , hi,2 and hi,3 are
17
Wu, Pan, and Yang
piecewise linear-quadratic functions with domain being a closed interval. In addition, write
∗ (α0 ) + 1 (α − z )2 + λ|α| + λ + δ
hs−1 (α) := minα0 ∈R Ps−1 2 s 0 1 [ls ,us ] (α) for α ∈ R. Obviously,
hs−1 is a piecewise lower semicontinuous linear-quadratic function whose domain is a closed
interval. Similarly, hs−1 = min{hs−1,1 , hs−1,2 , hs−1,3 } where each hs−1,j for j = 1, 2, 3 is a
piecewise linear function whose domain is a closed interval. Combining the above discussion
with line 3 of Algorithm 1 and the definition of Ps−1 ∗ , for any α ∈ R,
n 1 1 o
Ps∗ (α) = min Ps−1 (i, α)+ (α − zs )2 +ωs (α), min P ∗
s−1 (α 0
)+ (α − z s )2
+ω s (α)+λ1
i∈[0:s−2] 2 α0 ∈R 2
n o
= h0 (α), h1 (α), . . . , hs−2 (α), hs−1 (α) = min hi,j (α) .
i∈[0:s−1],j∈[3]
Notice that any hi,j and hi0 ,j 0 with i 6= i0 ∈ [0 : s − 1] or j 6= j 0 ∈ [3] crosses at most 2
times. From (Sharir, 1995, Theorem 2.5) the maximal number of linear-quadratic pieces
involved in Ps∗ is bounded by the maximal length of a (3s, 4) Davenport-Schinzel
√ sequence,
which by (Davenport and Schinzel, 1965, Theorem 3) is 3c1 s exp(c2 log 3s). Here, c1 , c2
are positive constants independent of s. Thus, we conclude that the maximal number of
linear-quadratic pieces involved in Ps∗ is O(s1+ ) for any > 0. The proof is finished.
By invoking Lemma 11, we are able to provide a worst-case estimation for the complexity
of Algorithm 1. Indeed, the main cost of Algorithm 1 consists in lines 3 and 5-7. The
computation cost involved in line 3 depends on the number of pieces of Ps−1 ∗ , which by
Lemma 11 requires O(s1+ ) operations with any small > 0. From part (b) of Proposition
(ii), for each i ∈ [0 : s − 1], Ris consists of at most O(s1+ ) intervals, which means that
line 6 requires at most O(s1+ ) operations and then the computation complexity of lines
5-7 is O(s2+ ) with any small > 0. Thus, the worst-case complexity of Algorithm 1 is
P n 2+ ) = O(n3+ ) with any small > 0.
s=2 O(s
G1k := ∇2 f (xk )+ b2 [−λmin (∇2h(Axk − b))]+ A> A + b1 kµk (xk −xk )kσ I with b2 ≥ 1. (35)
18
PGiPN for Fused Zero-norms Regularization Problems
However, for highly nonconvex h, [−λmin (∇2h(Axk − b))]+ is large, for which G1k is a poor
approximation to ∇2 f (xk ). To avoid this drawback, we consider the following
G3k := ∇2 f (xk )+ b2 [−λmin (∇2 f (xk ))]+ + b1 kµk (xk −xk )kσ I.
(37)
It is not hard to check that for i = 1, 2, 3, Gik meets the requirement in (12). We remark
here that the subsequent convergence analysis holds for the above three Gik , and we write
them by Gk for simplicity. The iterates of PGiPN are described as follows.
Remark 12 (i) Our PGiPN benefits from the PG step in two aspects. First, the incor-
poration of the PG step can guarantee that the sequence generated by PGiPN remains in
a right position for convergence. Second, the PG step helps to identify adaptively the sub-
space used in the Newton step, and as will be shown in Proposition 16, when k is sufficiently
large, switch condition (8) always holds and the supports of {Bxk }k∈N and {xk }k∈N keep un-
changed, so that Algorithm 2 will reduce to an inexact projected regularized Newton method
for solving (10) with Πk ≡ Π∗ , where Π∗ ⊂ Rn is a polytope defined in (49). In this sense,
the PG step plays a crucial role in transforming the original challenging problem (1) into a
problem that can be efficiently solved by the inexact projected regularized Newton method.
19
Wu, Pan, and Yang
(ii) When xk enters the Newton step, from the inexact criterion (13) and the expression of
Θk , 0 ≥ Θk (xk +dk ) − Θk (xk ) = h∇f (xk ), dk i + 21 hdk , Gk dk i, and then
1 b1
h∇f (xk ), dk i ≤ − hdk , Gk dk i ≤ − kµk (xk − xk )kσ kdk k2 < 0, (40)
2 2
where the second inequality is due to (12). In addition, the inexact criterion (13) implies
that y k ∈ Πk , which along with xk ∈ Πk and the convexity of Πk yields that xk + αdk ∈
Πk for any α ∈ (0, 1]. By the definition of Πk , supp(B(xk + αdk )) ⊂ supp(Bxk ) and
supp(xk + αdk ) ⊂ supp(xk ), so g(xk + αdk ) ≤ g(xk ) for any α ∈ (0, 1]. This together with
(40) shows that the iterate along the direction dk will reduce the value of F at xk .
(iii) When = 0, by Definition 1 the output xk of Algorithm 2 is an L-stationary point of
(1), which is also a stationary point of problem (10) from Proposition 3 and Lemma 4 (i).
Let rk : Rn → Rn be the KKT residual mapping of (10) defined by
It is not difficult to verify that when xk satisfies condition (8), the following relation holds
By Remark 12 (iv), to show that Algorithm 2 is well defined, we only need to argue that
the Newton steps in Algorithm 2 are well defined, which is implied by the following lemma.
Lemma 13 For each k ∈ N, define the KKT residual mapping Rk : Rn → Rn of (11) by
Then, for those xk ’s satisfying (8), the following statements are true.
(i) For any y close enough to the optimal solution of (11), y −µ−1 k Rk (y) satisfies (13)-(14).
(ii) The line search step in (39) is well defined, and αk ≥ min 1, (1−%)b 1β
kµk (xk − xk )kσ .
L1
(iii) The inexact criterion (14) implies that kRk (y k )k ≤ 21 min krk (xk )k, krk (xk )k1+ς .
20
PGiPN for Fused Zero-norms Regularization Problems
Proof Pick any xk satisfying (8). We proceed the proof of parts (i)-(iii) as follows.
(i) Let ybk be the unique optimal solution to (11). Then ybk 6= xk (if not, xk is the optimal
solution of (11) and 0 = Rk (xk ) = rk (xk ), which by (42) means that xk = xk and Algorithm
2 stops at xk ). By the optimality condition of (11), −∇f (xk )−Gk (b y k −xk ) ∈ NΠk (b y k ), which
by the convexity of Πk and xk ∈ Πk implies that h∇f (xk ) + Gk (b y k − xk ), ybk − xk i ≤ 0. Along
k k 1 k
with the expression of Θk , we have Θk (b y ) − Θk (x ) ≤ − 2 hby − xk , Gk (b y k − xk )i < 0. Since
Θk is continuous relative to Πk , for any z ∈ Πk sufficiently close to yb , Θk (z) − Θk (xk ) ≤ 0.
k
From Rk (b y ) = 0 and the continuity of Rk , when y sufficiently close to yb, y − µ−1 k Rk (y) is
close to yb, which together with y − µ−1k R k (y) ∈ Π k implies that y − µ −1
k R k (y) satisfies the
criterion (13). In addition, from the expression of Rk , for any y ∈ Rn ,
L1 α2 k 2
f (xk +αdk ) − f (xk ) − %αh∇f (xk ), dk i ≤ (1−%)αh∇f (xk ), dk i + kd k
2
(1−%)αb1 L1 α2 k 2
≤− kµk (xk −xk )kσ kdk k2 + kd k
2 2
(1−%)b L1 α
1
= − kµk (xk −xk )kσ + αkdk k2 ,
2 2
where the second inequality uses (40). Therefore, when the nonnegative integer t is such
that β t ≤ min 1, (1−%)b k k σ , the line search in (39) holds, which implies that the
1
L1 kµk (x −x )k
smallest nonnegative integer tk should satisfy αk = β tk ≥ min 1, (1−%)b 1β
kµk (xk −xk )kσ .
L1
(iii) Let ζ k ∈ ∂Θk (y k ) be such that kζ k k = dist(0, ∂Θk (y k )). From ζ k ∈ ∂Θk (y k ) and
the expression of Θk , we have y k = projΠk (y k + ζ k − (Gk (y k − xk ) + ∇f (xk ))). Along
with y k = projΠk (y k ) and the nonexpansiveness of projΠk , ky k −projΠk (y k −(Gk (y k −xk ) +
∇f (xk )))k ≤ kζ k k. Consequently,
where the second inequality follows Lemma 4 of Sra (2012) and the expression of Rk . Com-
bining the last inequality with (14) and (42) leads to the desired inequality.
When µk = 1, the condition that kRk (y k )k ≤ 12 min krk (xk )k, krk (xk )k1+ς is a special
case of the inexact condition in (Yue et al., 2019, Equa (6a)) or the inexact condition
in (Mordukhovich et al., 2023, Equa (14)), which along with Lemma 13 (iii) shows that
criterion (14) with µk = 1 is stronger than the ones adopted in these literature.
To analyze the convergence of Algorithm 2 with = 0, henceforth we assume xk 6= xk
for all k (if not, Algorithm 2 will produce an L-stationary point within finite number of
21
Wu, Pan, and Yang
steps, and its convergence holds automatically). From the iteration steps of Algorithm 2,
we see that the sequence {xk }k∈N consists of two parts, {xk }k∈K1 and {xk }k∈K2 , where
Obviously, K1 consists of those k’s with xk+1 from the PG step, while K2 consists of those
k’s with xk+1 from the Newton step.
To close this section, we provide some properties of the sequences {xk }k∈N and {xk }k∈N .
(ii) There exists ν > 0 such that |Bxk |min ≥ ν and |xk |min ≥ ν for all k ∈ N.
(iii) There exist c1 , c2 > 0 such that c1 krk (xk )k ≤ kdk k ≤ c2 krk (xk )k1−σ for all k ∈ K2 .
Proof (i) For each k ∈ N, when k ∈ K1 , by the line search in step (1a), F (xk+1 ) < F (xk ),
and when k ∈ K2 , from (39) and (40), it follows that f (xk+1 ) < f (xk ), which along with
g(xk+1 ) ≤ g(xk ) by Remark 12 (ii) implies that F (xk+1 ) < F (xk ). Hence, {F (xk )}k∈N is a
descent sequence. Recall that F is lower bounded on Ω, so {F (xk )}k∈N is convergent.
(ii) By the definition of µk and Remark 12 (iv), µk ∈ [µmin , µ e] for all k ∈ N. Note that
{xk }k∈N ⊂ Ω, so the sequence {xk − µ−1 k ∇f (x k )}
k∈N is bounded and is contained in a
compact set, says, Ξ. By invoking Lemma 8 with such Ξ and µ = µmin , µ = µ e, there exists
ν > 0 (depending on Ξ, µmin and µ k
e) such that |[B; I]x |min > ν. The desired result then
follows by noting that |Bxk |min ≥ |[B; I]xk |min and |xk |min ≥ |[B; I]xk |min .
(iii) From the definition of Gk , the continuity of ∇2f , {xk }k∈N ⊂ Ω, {xk }k∈N ⊂ Ω and
Remark 12 (iv), there exists c > 0 such that
Fix any k ∈ K2 . By Lemma 13 (iii), kRk (y k )k ≤ 21 krk (xk )k. Then, it holds that
1
krk (xk )k ≤ krk (xk )k − kRk (y k )k ≤ krk (xk ) − Rk (y k )k
2
= µk kxk − projΠk (xk − µ−1 k k k −1 k k k
k ∇f (x )) − y + projΠk (y − µk (Gk (y − x ) + ∇f (x )))k
≤ (2µk + kGk k2 )ky k − xk k ≤ (2e
µ + c)kdk k,
where the third inequality is using the nonexpansiveness of projΠk , and the last one is due to
(43) and dk = y k − xk . Therefore, c1 krk (xk )k ≤ kdk k with c1 := 1/(4eµ+ 2c). For the second
inequality, it follows from the definitions of rk (·) and Rk (·) that Rk (y k ) − ∇f (xk ) − Gk dk ∈
NΠk (y k − µ−1 k k k k −1 k
k Rk (y )) and rk (x ) − ∇f (x ) ∈ NΠk (x − µk rk (x )), which together with
the monotonicity of the set-valued mapping NΠk (·) implies that
22
PGiPN for Fused Zero-norms Regularization Problems
Combining this inequality with equations (12), (42) and Lemma 13 (iii) leads to
b1 krk (xk )kσ kdk k2 ≤ (1 + µ−1 k k k
k kGk k2 )(kRk (y )k + krk (x )k)kd k (44)
≤ (3/2)(1 + µ−1 k k
k kGk k2 )krk (x )kkd k,
which along with (43) and µk ≥ µmin implies that kdk k ≤ 23 (1 + µ−1 −1 k 1−σ .
min c)b1 krk (x )k
k k 1−σ 3 −1 −1
Then, kd k ≤ c2 krk (x )k holds with c2 := 2 (1 + µmin c)b1 . The proof is completed.
5. Convergence Analysis
Before analyzing the convergence of Algorithm 2, we show that it finally reduces to an
inexact projected regularized Newton method for seeking a stationary point of a problem
to minimize a smooth function over a polyhedral set. This requires the following lemma.
Lemma 15 For the sequences {xk }k∈N and {xk }k∈N generated by Algorithm 2, the following
assertions are true.
(i) There exists a constant γ > 0 such that for each k ∈ N,
−γkxk − xk k2
if k ∈ K1 ,
k+1 k k k 2+σ
F (x ) − F (x ) ≤ −γkx − x k if k ∈ K2 , αk = 1, (45)
−γkxk − xk k2+2σ if k ∈ K2 , αk = 6 1.
23
Wu, Pan, and Yang
Thus, we obtain limk→∞ kxk −xk k = 0. Together with (42), Proposition 14 (iii) and Remark
12 (iv), it follows that limK2 3k→∞ kdk k = 0.
(iii) Recall that {xk }k∈N ⊂ Ω, so its accumulation point set Γ(x0 ) is nonempty. Pick
any x∗ ∈ Γ(x0 ). Then, there exists an index set K ⊂ N such that limK3k→∞ xk = x∗ .
From part (ii), limK3k→∞ xk = x∗ . By step (1a) and Remark 12 (iv), for each k ∈ K,
xk ∈ proxµ−1 g xk − µ−1 k
k ∇f (x ) with µk ∈ [µmin , µ
e], and consequently,
k
while lim inf K3k→∞ g(xk ) ≥ g(x∗ ) follows from the lower semicontinuity of g. Thus, the
claimed limit limK3k→∞ g(xk ) = g(x∗ ) holds. Now from the above inclusion (47), it follows
that 0 ∈ ∇f (x∗ ) + ∂g(x∗ ). By Lemma 7, we know that x∗ is an L-stationary point of (1).
Next we apply Lemma 15 (ii) to show that, after a finite number of iterations, the switch
condition in (8) always holds and the Newton step is executed. To this end, define
Proposition 16 For the index sets defined in (48), there exist index sets T ⊂ [p], S ⊂ [n]
and k ∈ N such that for all k > k, Tk = T k = T and Sk = S k = S, which means that
k ∈ K2 for all k > k. Moreover, for each x∗ ∈ Γ(x0 ), supp(Bx∗ ) = T, supp(x∗ ) = S and
F (x∗ ) = limk→∞ F (xk ) := F ∗ , where Γ(x0 ) is defined in Lemma 15 (iii).
Proof We complete the proof of the conclusion via the following three claims:
Claim 1: There exists k ∈ N such that for k > k, |Bxk |min ≥ ν2 , where ν is the same as
the one in Proposition 14 (ii). Indeed, for each k − 1 ∈ K1 , xk = xk−1 , and |Bxk |min =
|Bxk−1 |min ≥ ν > ν2 follows by Proposition 14 (ii). Hence, it suffices to consider that k−1 ∈
ν
K2 . By Lemma 15 (ii), there exists k ∈ N such that for all k ≥ k, kxk−1 − xk−1 k < 4kBk 2
,
ν
and for all K2 3 k − 1 > k − 1, kdk−1 k < 4kBk2 , which implies that for K2 3 k − 1 > k − 1,
24
PGiPN for Fused Zero-norms Regularization Problems
kBxk−1−Bxk−1 k < ν4 and kBdk−1 k < ν4 . For each K2 3 k−1 > k−1, let ik ∈ [p] be such that
|(Bxk−1 )ik | = |Bxk−1 |min . Since condition (8) implies that supp(Bxk−1 ) = supp(Bxk−1 )
for each k − 1 ∈ K2 , we have |(Bxk−1 )ik | ≥ |Bxk−1 |min . Thus, for each K2 3 k − 1 > k − 1,
kBxk−1 − Bxk−1 k ≥ |(Bxk−1 )ik − (Bxk−1 )ik | ≥ |(Bxk−1 )ik | − |(Bxk−1 )ik |
≥ |Bxk−1 |min − |Bxk−1 |min .
Recall that |Bxk−1 |min ≥ ν for all k ∈ N by Proposition 14 (ii). Together with the last
inequality and kBxk−1 −Bxk−1 k < ν4 , for each K2 3 k −1 > k −1, we have |Bxk−1 |min ≥ 3ν
4 .
k k
For each K2 3 k − 1 > k − 1, let jk ∈ [p] be such that |(Bx )jk | = |Bx |min . By Remark
12 (ii), supp(Bxk ) ⊂ supp(Bxk−1 ) for each k − 1 ∈ K2 , which along with jk ∈ supp(Bxk )
implies that |(Bxk−1 )jk | ≥ |Bxk−1 |min . Thus, for each K2 3 k − 1 > k − 1,
1
kBdk−1 k = kBxk − Bxk−1 k ≥ kBxk − Bxk−1 k ≥ |(Bxk−1 )jk − (Bxk )jk |
αk
≥ |(Bxk−1 )jk | − |(Bxk )jk | ≥ |Bxk−1 |min − |Bxk |min ,
On the other hand, by the lower semicontinuity of F , we have F ∗ ≥ F (x∗ ). The two sides
imply that F (x∗ ) = F ∗ . The proof is completed.
25
Wu, Pan, and Yang
By Proposition 16, all k > k belong to K2 , i.e., the sequence {xk+1 }k>k is generated
by the Newton step. This means that {xk+1 }k>k is identical to the one generated by the
inexact projected regularized Newton method starting from xk+1 . Also, since Πk = Πk+1 for
all k > k, Algorithm 2 finally reduces to the inexact projected regularized Newton method
for solving
minn φ(x) := f (x) + δΠ∗ (x) with Π∗ := Πk+1 , (49)
x∈R
which is a minimization problem of function f over the polytope Π∗ , much simpler than
the original problem (1). Consequently, the global convergence and local convergence rate
analysis of PGiPN for model (1) boils down to analyzing those of the inexact projected
regularized Newton method for (49). The rest of this section is devoted to this.
Unless otherwise stated, the notation k in the sequel is always that of Proposition 16
plus one. In addition, we require the assumption that ∇2f is locally Lipschitz continuous
on Γ(x0 ), where Γ(x0 ) is defined in Lemma 15 (iii).
Π∗ = x ∈ R n | B T c
·x = 0, xS c = 0, x ≥ l, −x ≥ −u . (50)
k+1 k+1
A(x) := {i | xi = li } ∪ {i + n | xi = ui }.
Clearly, for x ∈ Π∗ , A(x) is the index set of those active constraints involved in Π∗ at x.
To prove the global convergence for PGiPN, we first show that A(xk ) keeps unchanged for
sufficiently large k under the following non-degeneracy assumption.
It follows from Proposition 3 and Lemma 15 (iii) that for each x∗ ∈ Γ(x0 ), x∗ is a
stationary point of F , which together with Proposition 16 and Lemma 4 (i) yields that
0 ∈ ∇f (x∗ ) + NΠ∗ (x∗ ), so that Assumption 2 substantially requires that −∇f (x∗ ) does
not belong to the relative boundary2 of NΠ∗ (x∗ ). In the next lemma, we prove that under
Assumptions 1-2, A(xk ) = A(xk+1 ) for sufficiently large k.
2. For convex set Ξ ⊂ Rn , the set difference cl(Ξ)\ri(Ξ) is called the relative boundary of Ξ, see (Rockafellar,
1970, p. 44).
26
PGiPN for Fused Zero-norms Regularization Problems
Lemma 18 Let {xk }k∈N be the sequence generated by Algorithm 2. Suppose that Assump-
tions 1-2 hold. Then, there exist A∗ ⊂ [2n] and a closed and convex cone N ∗ ⊂ Rn such
that A(xk ) = A∗ and NΠ∗ (xk ) = N ∗ for sufficiently large k.
kprojTΠ∗ (xk ) (−∇f (xk ))k = k −∇f (xk )−projNΠ∗ (xk ) (−∇f (xk ))k
= dist(0, ∂φ(xk )) = dist(0, ∂φ(xk−1 + dk−1 )),
where the third equality is due to Lemma 17. Thus, it suffices to prove that
For each k ∈ K2 , by equation (14), there exists ζk ∈ ∂Θk (y k ) = ∂Θk (xk +dk ) or equivalently
0 ∈ ∇f (xk )+Gk dk −ζk +NΠk (xk +dk ) such that kζk k is not more than the right hand side of
(14). Invoking Remark 12 (iv) and Lemma 15 (ii) yields that limk→∞ kζk k = 0. Moreover,
from Proposition 16, for k > k, the inclusion 0 ∈ ∇f (xk ) + Gk dk − ζk + NΠk (xk + dk ) is
equivalent to 0 ∈ ∇f (xk ) + Gk dk − ζk + NΠ∗ (xk + dk ). Note that ∂φ(xk+dk ) = ∇f (xk+dk ) +
NΠ∗ (xk +dk ) for each k > k. Then, ∇f (xk +dk )−∇f (xk ) − Gk dk +ζk ∈ ∂φ(xk +dk ) for each
k > k. This, by the continuity of ∇f , equation (43), Lemma 15 (ii), and limk→∞ kζk k = 0,
implies the desired limit limk→∞ dist(0, ∂φ(xk + dk )) = 0.
Claim 2: A(xk ) ⊂ A(xk+1 ) for sufficiently large k. If not, there exists an infi-
nite index set K ⊂ N such that A(xk ) 6⊂ A(xk+1 ) for all k ∈ K. If necessary taking a
subsequence, we assume that {xk }k∈K converges to x∗ . By Lemma 15 (ii), {xk+1 }k∈K con-
verges to x∗ . In addition, from Claim 1, limk→∞ kprojTΠ∗ (xk+1 ) (−∇f (xk+1 ))k = 0. The
two sides along with Assumption 2 and (Burke and Moré, 1988, Corollary 3.6) yields that
A(xk+1 ) = A(x∗ ) for all sufficiently large k ∈ K, contradicting to A(xk ) 6⊂ A(xk+1 ) for all
k ∈ K. The claimed inclusion holds for sufficiently large k.
From A(xk ) ⊂ A(xk+1 ) for sufficiently large k, {A(xk )}k∈N converges to for some
A∗ ⊂ [2n] in the sense of Painlevé-Kuratowski3 . From the discreteness of A∗ , we con-
clude that A(xk ) = A∗ for sufficiently large k. From the expression of Π∗ in (50) and
A(xk ) = A∗ for sufficiently large k, we have NΠ∗ (xk ) = N ∗ for sufficiently large k.
Assumption 3 For every sufficiently large k, there exists ξk ∈ NΠ∗ (xk ) such that
−h∇f (xk ) + ξk , dk i
lim inf > 0.
k→∞ k∇f (xk ) + ξk kkdk k
3. A sequence of sets {C }k∈N with C k ⊂ Rn is said to converge in the sense of Painlevé-Kuratowski if
k
its outer limit set lim supk→∞ C k coincides with its inner limit set lim inf k→∞ C k . On the definition of
lim supk→∞ C k and lim inf k→∞ C k , see (Rockafellar and Wets, 2009, Definition 4.1).
27
Wu, Pan, and Yang
This assumption essentially requires for every sufficiently large k the existence of one
element ξk ∈ NΠ∗ (xk ) such that the angle between ∇f (xk ) + ξk and dk is uniformly larger
than π/2. For sufficiently large k, since xk +αdk ∈ Π∗ for all α ∈ [0, 1], we have dk ∈ TΠ∗ (xk ),
which implies that hξ k , dk i ≤ 0. Together with (40), for sufficiently large k, the angle
between ∇f (xk ) + ξk and dk is larger than π/2. This means that it is highly possible for
Assumption 3 to hold. When n = 1, it automatically holds.
Next, we show that if φ is a KL function and Assumptions 1-3 hold, the sequence
generated by PGiPN is Cauchy and converges to an L-stationary point.
Theorem 19 Let {xk }k∈N be the sequence generated Pby Algorithm 2. Suppose that Assump-
tions 1-3 hold, and that φ is a KL function. Then, ∞k=1 kx k+1−xk k < ∞, and consequently
{xk }k∈N converges to an L-stationary point of (1) within a finite number of steps. Thus, we
only need to consider the case that φ(xk ) > φ(xk+1 ) for all k > k. By Proposition 16, for any
x ∈ Γ(x0 ), F ∗ = F (x) = φ(x)+λ1 |T |+λ2 |S| or equivalently φ(x) = φ∗ := F ∗ −λ1 |T |−λ2 |S|.
By (Bolte et al., 2014, Lemma 6), there exist ε > 0, η > 0 and a continuous concave function
ϕ ∈ Υη such that for all x ∈ Γ(x0 ) and x ∈ {z ∈ Rn | dist(z, Γ(x0 )) < ε} ∩ [φ∗ < φ < φ∗ + η],
ϕ0 (φ(x) − φ∗ )dist(0, ∂φ(x)) ≥ 1 where Υη is defined in Definition 6. Then, for k > k (if
necessary by increasing k), xk ∈ {z ∈ Rn | dist(z, Γ(x0 )) < ε} ∩ [φ∗ < φ < φ∗ + η], so
ϕ0 (φ(xk ) − φ∗ )dist(0, ∂φ(xk )) ≥ 1. (51)
By Assumption 3, there exist c > 0 and ξk ∈ NΠ∗ (xk ) such that for suffciently large k,
−h∇f (xk ) + ξk , dk i > ck∇f (xk ) + ξk kkdk k. (52)
From Lemma 18, NΠ∗ (xk ) = NΠ∗ (xk+1 ) for all k > k (by possibly enlarging k), which
implies that ξk ∈ NΠ∗ (xk+1 ). Together with (39), (52) and Lemma 17, for all k > k (if
necessary enlarging k), it holds that
φ(xk ) − φ(xk+1 ) −%h∇f (xk ) + ξk , dk i %ck∇f (xk ) + ξk kkdk k
≥ ≥ = %ckxk+1 −xk k, (53)
dist(0, ∂φ(xk )) dist(0, ∂φ(xk )) k∇f (xk ) + ξk k
where the second inequality follows by ∇f (xk ) + ξk ∈ ∂φ(xk ) and (52). For each k > k, let
∆k := ϕ(φ(xk )−φ∗ ). From (51), (53) and the concavity of ϕ on [0, η), for all k > k,
∆k − ∆k+1 = φ(xk ) − φ(xk+1 ) ≥ ϕ0 (φ(xk )−φ∗ )(φ(xk )−φ(xk+1 ))
φ(xk ) − φ(xk+1 )
≥ ≥ %ckxk+1 − xk k.
dist(0, ∂φ(xk ))
Summing this inequality from k to any k > k and using ∆k ≥ 0 yields that
k k
X 1 X 1 1
kxj+1 −xj k ≤ (∆j −∆j+1 ) = (∆k −∆k+1 ) ≤ ∆k .
%c %c %c
j=k j=k
28
PGiPN for Fused Zero-norms Regularization Problems
P∞
Passing the limit k → ∞ leads to j=k
kxj+1 − xj k < ∞. Thus, {xk }k∈N is a Cauchy
sequence and converges to x∗ . It follows from Lemma 15 (iii) that x∗ is an L-stationary
point of model (1). The proof is completed.
Next we focus on the superlinear rate analysis of PGiPN. For this purpose, define
which is called the set of second-order stationary points of (49). By Lemma 15 (iii) and
Proposition 3, the set X ∗ is generally smaller than the set of stationary points of (1). We
assume that a local Hölderian error bound condition holds with respect to (w.r.t.) X ∗ in
Assumption 4. For more introduction on the Hölderian error bound condition, we refer the
interested readers to Mordukhovich et al. (2023) and Liu et al. (2024).
Assumption 4 The mapping Rn 3 x 7→ r(x) := x−projΠ∗ (x−∇f (x)) has the q-subregularity
with q ∈ (0, 1] at any x ∈ Γ(x0 ) for the origin w.r.t. the set X ∗ , i.e., for every x ∈ Γ(x0 ),
there exist ε > 0 and κ > 0 such that for all x ∈ B(x, ε), dist(x, X ∗ ) ≤ κkr(x)kq .
Recently, Liu et al. (2024) proposed an inexact regularized proximal Newton method
(IRPNM) for solving the composite problem, the minimization of the sum of a twice contin-
uously differentiable function and an extended real-valued convex function, which includes
(49) as a special case. They established the superlinear convergence rate of IRPNM under
Assumption 1, and Assumption 4 with projΠ∗ replaced by the proximal mapping of the
convex function. By (Sra, 2012, Lemma 4) and µk ∈ [µmin , µ e], kr(xk )k = O(krk (xk )k) for
sufficiently large k. This together with Assumption 4 implies that for every x ∈ Γ(x0 ), there
exist ε > 0 and κ b > 0 such that for sufficiently large k with xk ∈ B(x, ε),
dist(xk , X ∗ ) ≤ κ
bkrk (xk )kq . (54)
Recall that PGiPN finally reduces to an inexact projected regularized Newton method for
solving (49). From Lemma 13 (iii) and Lemma 17, for sufficiently large k,
1
Θk (xk+1 ) − Θk (xk ) ≤ 0 and kRk (xk+1 )k ≤ min{krk (xk )k, krk (xk )k1+ς }. (55)
2
Let Λik := Gik−∇2f (xk ) − b1 kµk (xk − xk )kσ I with Gik given by (35)-(37). Under Assumption
4, from (Wu et al., 2023, Lemma 4.8), (Liu et al., 2024, Lemma 4.4), and the fact that
G1k − G2k 0, it holds that for sufficiently large k,
29
Wu, Pan, and Yang
In the rest of this section, for completeness, we provide the proof of the superlinear con-
vergence of PGiPN under Assumptions 1 and 4 though it is implied by that of Liu et al.
ek , x
(2024). To this end, for each k ∈ K2 , define x bk and fk as follows.
1
fk (x) := f (xk ) + ∇f (xk )> (x − xk ) + (x − xk )> Gk (x − xk );
2
xek : the exact solution to problem (11); x bk ∈ projX ∗ (xk ).
Rk (y k ) + ∇fk (y k − µ−1 k k k −1 k
k Rk (y )) − ∇fk (y ) ∈ ∂Θk (y − µk Rk (y )).
Note that ∇fk (x) = ∇f (xk ) + Gk (x − xk ). The above inclusion can be simplified as
(I − µ−1 k k −1 k
k Gk )Rk (y ) ∈ ∂Θk (y − µk Rk (y )).
On the other hand, from the definition of x ek , we have 0 ∈ ∂Θk (e xk ). Together with the
above inclusion and the strong monotoncity of ∂Θk with model b1 krk (xk )kσ , it follows that
D E
(I − µ−1
k G k )Rk (y k
), y k
− µ−1
k Rk (y k
) − x
e k
≥ b1 krk (xk )kσ ky k − µ−1 k
ek k2 .
k Rk (y ) − x
ky k − µ−1 k
ek k ≤ (b−1
k Rk (y ) − x
k −σ
1 krk (x )k )k(I − µ−1 k
k Gk )Rk (y )k
1 −1 k 1+ς (1 + µ−1
min c)
≤ (1 + µ min kG k
k 2 )krk (x )k ≤ krk (xk )k1+ς−σ ,
2b1 krk (xk )kσ 2b1
where the second inequality is due to (55) and µk ≥ µmin , and the third is by (43). Note
that y k = xk+1 by Lemma 17. From the above inequality and the second one of (55),
1 (1 + µ−1
min c)
ky k − x
ek k ≤ krk (xk )k1+ς + krk (xk )k1+ς−σ ,
2µmin 2b1
1 (1+µ−1
min c)
and the desired result holds with γ1 := 2µmin and γ2 := 2b1 .
L2 dist(xk , X ∗ )
kΛk k2
k
kx − x k
e k≤ + + 2 dist(xk , X ∗ ).
2b1 krk (xk )kσ b1 krk (xk )kσ
30
PGiPN for Fused Zero-norms Regularization Problems
Proof From Assumption 1, there exist 0 > 0 and L2 > 0 such that for any x, x0 ∈ B(x, 0 ),
ek , 0 ∈ ∇f (xk ) + Gk (e
By the definition of x xk − xk ) + NΠ∗ (e
xk ); while by the definition of x
bk ,
0 ∈ ∇f (bk
x ) + NΠ∗ (b k
x ). Using the monotoncity of NΠ∗ results in
0 ≤ h∇f (xk ) + Gk (e
xk − xk ) − ∇f (b
xk ), x
bk − x
ek i
= h∇f (xk ) + Gk (b
xk − xk ) − ∇f (b
xk ), x
bk − x
ek i − hGk (b
xk − x
ek ), x
bk − x
ek i.
Now we are ready to establish the supelinear convergence rate of the sequence. It is
noted that the proof is similar to that of (Liu et al., 2024, Theorem 6).
Theorem 23 Fix any x ∈ Γ(x0 ). Suppose that Assumption 1 holds, and Assumption 4
1
holds with q ∈ ( 1+σ , 1]. Then, the sequence {xk }k∈N converges to x with the Q-superlinear
convergence rate of order q(1+σ).
Proof If necessary enlarging k, we assume that xk ∈ B(x, 1 ) for k > k, where 1 is the
xk ) = 0 for k > k. This together
one in Lemma 22. From the definition of rk , we have rk (b
with the nonexpansive property of projΠ∗ yields that
In view of equation (56), if necessary enlarging k, there exists γ3 > 0 such that for k > k,
31
Wu, Pan, and Yang
From kdk k = ky k − xk k ≤ ky k − x ek k + ke
xk − xk k, Lemmas 21- 22, Assumption 4 and
equations (59)-(60), if necessary enlarging k, there exists γ4 > 0 such that for k > k,
where the third inequality is due to (59). Next we bound the term |krk (xk+1 )k−kRk (xk+1 )k|.
If necessary enlarging k, we have for k > k,
where the first inequality is by the definitions of rk and Rk and the nonexpansive property
of projΠk = projΠ∗ , the second one follows Assumption 1 and similar arguments for (58),
the third one follows equations (59)-(60), and the fourth is by (61). By combining the
L γ2
above inequality and (62) and letting γ5 := 22 4 + γ3 γ4 , γ6 := b1 γ4 (2e µ + L1 )σ and γ7 :=
1 1+ς
2 (2e
µ + L1 ) , it holds that for k > k (if necessary enlarging k),
h iq
dist(xk+1 , X ∗ ) ≤ κ
b γ5 dist(xk , X ∗ )2 + γ6 dist(xk , X ∗ )1+σ + γ7 dist(xk , X ∗ )1+ς
(63)
≤κb(γ5 + γ6 + γ7 )q dist(xk , X ∗ )q(1+σ) ,
where the last inequality follows by limk→∞ dist(xk , X ∗ ) = 0 and σ ≤ ς ≤ 1. The proof for
the result that {xk }k∈N converges to x at a superlinear convergence rate is similar to the
proof of (Liu et al., 2024, Theorem 6), and the details are omitted here.
32
PGiPN for Fused Zero-norms Regularization Problems
1 Pm
2). Thus, when h(u) = 12 kuk2 or h(u) = m m
i=1 log(1 + exp(−bi ui )) for u ∈ R , i.e., f
is the popular least-squares function or logistic regression function, Assumption 4 holds
automatically. In addition, when f is a piece-wise linear quadratic function, since ∂φ is a
polyhedral multifunction, the error bound condition automatically holds by (Robinson, 1981,
Proposition 1). Such loss functions, covering the Huber loss, the `1 -norm loss, the MCP
and SCAD loss, are often used to deal with outliers or heavy-tailed noise.
6. Numerical experiments
This section focuses on the numerical experiments of several variants of PGiPN for solving
a fused `0 -norms regularization problem with a box constraint. We first describe the imple-
mentation of Algorithm 2 in Section 6.1. In Section 6.2, we make comparison between model
(1) with the least-squares loss function f and the fused Lasso model (5) by using PGiPN to
solve the former and SSNAL (Li et al. (2018)) to solve the latter, to highlight the advantages
and disadvantages of our proposed fused `0 -norms regularization. Among others, the code
of SSNAL is available at (https://github.com/MatOpt/SuiteLasso). Finally, in Section 6.3,
we present some numerical results toward the comparison among several variants of PGiPN
and ZeroFPR and PG method for (1) in terms of efficiency and the quality of the output.
The MATLAB code of PGiPN is available at (https://github.com/yuqiawu/PGiPN).
b k := {v ∈ R|Sk | | B
Hk := (Gk )Sk Sk , v k := xkSk , ∇fSk (v k ) = [∇f (xk )]Sk , Π ek v = 0, lS ≤ v ≤ uS },
k k
where B
ek is the matrix obtained by removing the rows of BT c S whose elements are all zero.
k k
We turn to consider the following strongly convex optimization problem,
n 1 o
vbk ≈ arg min θk (v) := f (I·Sk v k )+h∇fSk (v k ), v −v k i+ (v−v k )> Hk (v−v k )+δΠ (v) . (64)
2
bk
v∈R|Sk |
The following lemma gives a way to find y k satisfying (13)-(14) by inexactly solving problem
(64), whose dimension is much smaller than that of (11) if |Sk | n.
min{µ−1
k , 1}
n o
v k )−θk (v k ) ≤ 0, dist(0, ∂θk (b
θk (b v k )) ≤ min kµk (xk −xk )k, kµk (xk −xk )k1+ς ,
2
33
Wu, Pan, and Yang
Proof The first part is straightforward. We consider the second part. By the definition
of Θk , dist(0, ∂Θk (y k )) = dist(0, ∇f (xk ) + Gk (y k − xk ) + NΠk (y k )). Recall that Πk = {x ∈
Ω | BTkc · x = 0, xSkc = 0}. Then, NΠk (y k ) = Range(BT>c · ) + Range(IS>c · ) + NΩ (y k ), and
k k
From the above discussions, we see that the computation of subproblem (11) involves
the projection onto Πk . Next we provide a method for computing it. Fix any k ∈ K2 . Given
z ∈ Rn , we consider the minimization problem on the projection onto Πk :
1
min kx − zk2 s.t. B
bT c · x = 0, xS c = 0, l ≤ x ≤ u. (65)
x∈Rn 2 k k
We provide a toy example to illustrate how to solve (65). Let xk = (1, 1, 2, 3, 3, 0, 0, 0)> ∈ R8 .
Since Tkc = {1, 4, 6, 7} and Skc = {6, 7, 8}, problem (65) can be written as
1
min kx − zk2 s.t. x1 = x2 , x4 = x5 , x6 = x7 = x8 = 0, l ≤ x ≤ u,
x∈R 2
8
which can be separated into the following four lower dimensional problems:
Inspired by this toy example, there exists a smallest b j ∈ N such that the index set Tkc can
be partitioned into Tkc = i∈[bj] [i1 : i2 ]. Without loss of generality, we assume that these sets
S
are listed in an increasing order according to their left endpoints. Then, problem (65) can
be represented as
X1 X 1
minn kxi :i +1 − zi1 :i2 +1 k2 + (xi − zi )2
x∈R 2 1 2 2
(66)
S
i∈[b
j] i∈Tk \( i∈[bj] {i2 +1})
34
PGiPN for Fused Zero-norms Regularization Problems
Proposition 26 For each i ∈ [b j], if [i1 : i2 + 1] ∩ Skc 6= ∅, let x∗i1 :i2 +1 = 0; otherwise, let
∗
xi1 :i2 +1 be the unique optimal solution to
1
arg min kv − zi1 :i2 +1 k2 s.t. li1 :i2 +1 ≤ v ≤ ui1 :i2 +1 , v1 = · · · = vi2 +2−i1 . (67)
2
v∈Ri2 +2−i1
For each i ∈ Tk \( i∈[bj] {i2 + 1}), if i ∈ Skc , let x∗i = 0; otherwise, let x∗i be the unique
S
optimal solution to
1
min (α − zi )2 s.t. li ≤ α ≤ ui . (68)
α∈R 2
35
Wu, Pan, and Yang
36
PGiPN for Fused Zero-norms Regularization Problems
Our second numerical study is to evaluate the classification ability of these two models
with the TIMIT database (Acpistoc-Phonetic Continuous Speech Corpus, NTIS, US Dept of
Commerce), which consists of 4509 32ms speech frames and each speech frame is represented
by 512 samples of 16 KHz rate. The TIMIT database is collected from 437 male speakers.
Every speaker provided approximately two speech frames of each of five phonemes, where
the phonemes are “sh” as in “she”, “dcl” as in “dark”, “iy” as the vowel in “she”, “aa”
as the vowel in “dark”, and “ao” as the first vowel in “water”. This database is a widely
used resource for research in speech recognition. Following the approach described in Land
and Friedman (1997), we compute a log-periodogram from each speech frame, which is one
of the several widely used methods to generate speech data in a form suitable for speech
recognition. Consequently, the dataset comprises 4509 log-periodograms of length 256 (fre-
quency). It was highlighted in Land and Friedman (1997) that distinguishing between “aa”
and “ao” is particularly challenging. Our aim is to classify these sounds using FZNS and
the fused Lasso with λ2 = 0, l = −1 and u = 1, or in other words, the zero-order variable
fusion (3) plus a box constraint and the first-order variable fusion (4).
In TIMIT, the numbers of phonemes labeled “aa” and “ao” are 695 and 1022, re-
spectively. As in Land and Friedman (1997), we use the first 150 frequencies of the
log-periodograms because the remaining 106 frequencies do not appear to contain any in-
formation. We randomly select m1 samples labeled “aa” and m2 samples labeled “ao”
as training set, which together with their labels form A ∈ Rm×n and b ∈ Rm , with
m = m1 + m2 , n = 150, where bi = 1 if Ai· is labeled as “aa”, and bi = 2 otherwise.
The rest of dataset is left as the testing set, which forms Ā ∈ R(1717−m)×n , b̄1717−m , with
b̄i = 1 if Āi· is labeled as “aa” and b̄i = 2 otherwise. For (A, b), given 10 λ1 ’s randomly
selected within [2 × 10−5 , 300] such that the sparsity of the outputs kBx b ∗ k0 spans a wide
range. If Āi· x∗ ≤ 1.5, this phoneme is classified as “aa” and hence we set b̂i = 1; other-
wise, b̂i = 2. If b̂i 6= b̄i , Ai· is regarded as failure in classification. Then the error rate of
kb̄−b̂k1 b ∗ k0 and the error rate of classification.
classification is given by 1717−m . We record both kBx
37
Wu, Pan, and Yang
The above procedure is repeated for 30 groups of randomly generated (A, b), resulting
in 300 outputs for each solver. The four subfigures in Figure 1 present kBx b ∗ k0 and the error
rate of each output, with 4 different choices of (m1 , m2 ). We see that, for each subfigure the
output with the smallest error rate is always achieved by the fused `0 -norms regularization
model. It is apparent that FZNS generally performs better than the fused Lasso when
b ∗ k0 ≤ 30, while the average error rate of the fused Lasso is lower than that of FZNS
kBx
when kBx b ∗ k0 ≥ 60. This phenomenon is especially evident when m1 and m2 are small.
0.55 0.45
FZNS FZNS
Fused Lasso Fused Lasso
0.5
0.4
0.45
0.35
0.4
Error rate
Error rate
0.35 0.3
0.3
0.25
0.25
0.2
0.2
0.15 0.15
0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120
Sparsity of Bx * Sparsity of Bx *
0.45 0.45
FZNS FZNS
Fused Lasso Fused Lasso
0.4 0.4
0.35 0.35
Error rate
Error rate
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Sparsity of Bx * Sparsity of Bx *
b ∗ k0 and the classification error rate for the outputs from FZNS and the fused
Figure 1: kBx
Lasso under different m1 , m2 .
The numerical results for these two empirical studies show that for prostate database,
our model outperforms the fused Lasso when the output is sufficiently sparse, that is,
b ∗ k0 ≤ 3, see the first two lines in Table 1, and for phoneme database, our model performs
kBx
better when kBx b ∗ k0 ≤ 30. We also observe that the numerical performance of the fused
`0 -norms regularization is not stable if the output is not sparse, especially when the number
of observations is small, so when using the fused `0 -norms regularization model, a careful
consideration should be given to selecting an appropriate penalty parameter. Moreover, for
38
PGiPN for Fused Zero-norms Regularization Problems
5 0.23 4.25
PGiPN
ZeroFPR
4 PGls 4.2
0.225
PGilbfgs
3 4.15
0.22
Obj value
error rate
2 4.1
Time
0.215
1 4.05
PGiPN
ZeroFPR 0.21
0 PGls 4
PGilbfgs PGiPN
ZeroFPR
0.205 PGls
-1 3.95
PGilbfgs
-2 0.2 3.9
0.01 0.04 0.07 0.1 0.4 0.7 1 4 7 10 0.01 0.04 0.07 0.1 0.4 0.7 1 4 7 10 0.01 0.04 0.07 0.1 0.4 0.7 1 4 7 10
c c c
(a) λc -log(time(seconds)) plot (b) λc -error rate plot (c) λc -obj value plot
Figure 2: The average CPU time and error rate of 30 examples for four solvers.
We see from Figure 2(a) that in terms of CPU time, PGiPN is always the best one, more
than ten times faster than other three solvers. The reason is that other three solvers depend
39
Wu, Pan, and Yang
Table 2: Standard deviation of CPU time, error rate and objective value in Figure 2.
λ 0.01 0.04 0.07 0.1 0.4 0.7 1 4 7 10
PGiPN 0.12 0.10 0.11 0.10 0.12 0.14 0.11 0.05 0.05 0.04
ZeroFPR 4.06 5.02 5.12 4.77 3.83 3.41 3.47 2.24 1.91 1.32
Time
PGls 1.75 2.27 3.03 2.13 3.01 3.51 3.11 3.18 2.75 2.82
PGilbfgs 6.18 5.77 5.74 5.14 3.73 10.73 9.46 20.92 18.31 16.38
PGiPN 0.009 0.008 0.008 0.007 0.008 0.008 0.009 0.009 0.009 0.008
ZeroFPR 0.010 0.008 0.007 0.007 0.008 0.010 0.010 0.008 0.009 0.010
Error rate
PGls 0.006 0.006 0.007 0.008 0.008 0.008 0.008 0.008 0.008 0.009
PGilbfgs 0.009 0.008 0.007 0.006 0.007 0.008 0.008 0.010 0.008 0.008
PGiPN 2.67 2.82 3.05 3.17 3.15 2.84 3.13 3.00 3.28 3.38
ZeroFPR 2.59 2.81 2.86 2.80 3.04 3.00 3.07 3.68 2.88 2.69
Obj
PGls 2.73 2.88 3.04 3.02 2.94 2.97 2.96 3.17 3.25 3.25
PGilbfgs 2.51 2.74 2.94 2.99 3.00 2.89 3.21 3.00 3.27 3.36
0.25
0.24
0.23
Error rate
0.22
0.21
0.2
PGiPN
ZeroFPR
0.19 PGilbfgs
PGls
0.18
0 20 40 60 80 100 120 140
Sparsity of Bx *
Figure 3: Scatter figure for all tested examples, recording the relationship of sparsity
b ∗ k0 ) and the error rate of classification.
(kBx
heavily on the proximal mapping of g, and its computation is a little time-consuming. The
fact that PGiPN always requires the least CPU time reflects the advantage of the projected
regularized Newton steps in PGiPN. From Figure 2(b), when λc = 1, PGiPN attains the
smallest average error rate among four solvers for 10 λc ’s. When λc is larger, say, λc > 0.4,
PGiPN and PGilbfgs tend to outperform ZeroFPR and PG in terms of the average error rate
and objective value; when λc is smaller, say, λc < 0.1, the solutions returned by PG have
40
PGiPN for Fused Zero-norms Regularization Problems
41
Wu, Pan, and Yang
this experiments, we do not find the case that the Newton steps always performs toward
the end of the algorithms for PGiPN, PGiPN(r) and PGilbfgs. That is, some Newton steps
are executed along the PG steps.
As one reviewer mentioned, due to the highly nonconvexity of model (1), it is not easy to
remove Assumption 3 from our global convergence result (see Theorem 19). In this part,
42
PGiPN for Fused Zero-norms Regularization Problems
we make a numerical study on it. To this end, we introduce a specific choice of ξk . Let
ξk := −∇f (xk )−projNull(Ck ) (−∇f (xk )) with Ck = [BTkc ; ISkc ] for k ∈ K2 .
Obviously, for each k ∈ K2 , ξk ⊥Null(Ck ), which implies that ξk ∈ NNull(Ck ) (xk ) ⊂ NΠk (xk ).
The second inclusion is due to Πk ⊂ Null(Ck ) and the convexity of Πk and Null(Ck ).
We are ready to solve the problem in Section 6.3.1 with the termination condition
−h∇f (xk )+ξk ,dk i
µk kxk −xk k∞ ≤ 10−8 . Each test will generate a sequence {ak }k∈K2 with ak := k∇f (xk )+ξk kkdk k
.
Since {ak }k∈K2 is a finite sequence, its infimum limit does not exist. Recall that for a real
value infinite sequence {bk }, lim inf k→∞ bk = supl∈N inf k≥l bk . Write the number of elements
of {ak }k∈K2 as t. For each test, we record a as follows, as an approximation to the lower
limit,
a := sup inf ak .
l∈[t] k≥l
It is not hard to check that a = ak0 , where k 0 is the maximum element of K2 . We solve
the problem for 10 different λc ’s and 10 different groups of (A, b), resulting in 100 a for
100 times experiments. We store these 100 a’s as a MATLAB variable cosinelist, and find
that min(cosinelist) = 0.0025, mean(cosinelist) = 0.0761 and std(cosinelist) = 0.0650. This
indicates that it is highly possible for Assumption 3 to hold.
7. Conclusions
In this paper, we proposed a hybrid of PG and inexact projected regularized Newton meth-
ods for solving the fused `0 -norms regularization problem (1). This hybrid framework fully
exploits the advantages of PG method and Newton method, while avoids their disadvan-
tages. We employed the KL property to prove the full convergence of the generated iterate
sequence under a curve condition (Assumption 3) on f without assuming the uniformly
positive definiteness of the regularized Hessian matrix, and also obtained a superlinear con-
vergence rate under a Hölderian local error bound on the set of the second-order stationary
points, without assuming the local optimality of the limit point.
All PGiPN, ZeroFPR and PG have employed the polynomial-time algorithm to compute
a point in the proximal mapping of g with B = B, b which we developed in Section 3.3 of
this paper. Numerical tests indicate that our PGiPN not only produces solutions of better
quality, but also requires 2-3 times less running time than PG and ZeroFPR, where the
latter mainly attributes to our subspace strategy when applying the projected regularized
Newton method to solve the problems. It would be an interesting topic to extend the
polynomial-time algorithm in Section 3.3 to the case where B is of other special structures.
Acknowledgments
The authors would like to thank the editor and the two anonymous referees for their valuable
suggestions, which allowed them to improve the quality of the paper.
The second author’s work was supported by the National Natural Science Foundation of
China under project No.12371299, and the third author’s research was partially supported
by Research Grants Council of Hong Kong SAR, P.R. China (PolyU15209921).
43
Wu, Pan, and Yang
References
Masoud Ahookhosh, Andreas Themelis, and Panagiotis Patrinos. A Bregman forward-
backward linesearch algorithm for nonconvex composite optimization: superlinear con-
vergence to nonisolated local minima. SIAM Journal on Optimization, 31(1):653–685,
2021.
Aleksandr Aravkin, Michael P Friedlander, Felix J Herrmann, and Tristan Van Leeuwen.
Robust inversion, dimensionality reduction, and randomized sampling. Mathematical
Programming, 134:101–125, 2012.
Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating
minimization and projection methods for nonconvex problems: An approach based on
the Kurdyka-Lojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457,
2010.
Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for
semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and
regularized Gauss-Seidel methods. Mathematical Programming, 137(1):91–129, 2013.
Gilles Bareilles, Franck Iutzeler, and Jérôme Malick. Newton acceleration on manifolds
identified by proximal gradient methods. Mathematical Programming, 200:37–70, 2023.
Heinz H Bauschke, Jonathan M Borwein, and Wu Li. Strong conical hull intersection
property, bounded linear regularity, Jameson’s property (g), and error bounds in convex
optimization. Mathematical Programming, 86:135–160, 1999.
Dimitri P Bertsekas. Projected Newton methods for optimization problems with simple
constraints. SIAM Journal on Control and Optimization, 20(2):221–246, 1982.
Wei Bian and Xiaojun Chen. A smoothing proximal gradient algorithm for nonsmooth
convex regression with cardinality penalty. SIAM Journal on Numerical Analysis, 58(1):
858–883, 2020.
Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations.
Journal of Fourier Analysis and Applications, 14(5):629–654, 2008.
Thomas Blumensath and Mike E Davies. Normalized iterative hard thresholding: Guaran-
teed stability and performance. IEEE Journal of selected topics in signal processing, 4
(2):298–309, 2010.
Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized mini-
mization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):
459–494, 2014.
James V Burke and Jorge J Moré. On the identification of active constraints. SIAM Journal
on Numerical Analysis, 25(5):1197–1211, 1988.
44
PGiPN for Fused Zero-norms Regularization Problems
Harold Davenport and Andrzej Schinzel. A combinatorial problem connected with differ-
ential equations. American Journal of Mathematics, 87(3):684–694, 1965.
Jerome Friedman, Trevor Hastie, Holger Höfling, and Robert Tibshirani. Pathwise coordi-
nate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.
Felix Friedrich, Angela Kempe, Volkmar Liebscher, and Gerhard Winkler. Complexity
penalized m-estimation: Fast computation. Journal of Computational and Graphical
Statistics, 17(1):201–224, 2008.
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URL https:
//www.gurobi.com.
Kyle K Herrity, Anna C Gilbert, and Joel A Tropp. Sparse approximation via iterative
thresholding. In 2006 IEEE International Conference on Acoustics Speech and Signal
Processing Proceedings, volume 3, pages III–III. IEEE, 2006.
Brad Jackson, Jeffrey D Scargle, David Barnes, Sundararajan Arabhi, Alina Alt, Peter
Gioumousis, Elyus Gwin, Paungkaew Sangtrakulcharoen, Linda Tan, and Tun Tao Tsai.
An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing
Letters, 12(2):105–108, 2005.
Sean Jewell and Daniela Witten. Exact spike train inference via `0 optimization. The
Annals of Applied Statistics, 12(4):2457–2482, 2018.
Sean W Jewell, Toby Dylan Hocking, Paul Fearnhead, and Daniela M Witten. Fast non-
convex deconvolution of calcium imaging data. Biostatistics, 21(4):709–726, 2020.
He Jiang, Shihua Luo, and Yao Dong. Simultaneous feature selection and clustering based
on square root optimization. European Journal of Operational Research, 289(1):214–231,
2021.
Rebecca Killick, Paul Fearnhead, and Idris A Eckley. Optimal detection of changepoints
with a linear computational cost. Journal of the American Statistical Association, 107
(500):1590–1598, 2012.
Stephanie R Land and Jerome H Friedman. Variable fusion: A new adaptive signal regres-
sion method. Technical report, Technical Report 656, Department of Statistics, Carnegie
Mellon University, 1997.
Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal Newton-type methods for
minimizing composite functions. SIAM Journal on Optimization, 24(3):1420–1443, 2014.
Xudong Li, Defeng Sun, and Kim-Chuan Toh. On efficiently solving the subproblems of
a level-set method for fused Lasso problems. SIAM Journal on Optimization, 28(2):
1842–1866, 2018.
45
Wu, Pan, and Yang
Jun Liu, Shuiwang Ji, and Jieping Ye. SLEP: Sparse learning with efficient projections.
Arizona State University, 6(491):7, 2009.
Jun Liu, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused Lasso prob-
lems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 323–332, 2010.
Ruyu Liu, Shaohua Pan, Yuqia Wu, and Xiaoqi Yang. An inexact regularized proximal
Newton method for nonconvex and nonsmooth optimization. Computational Optimization
and Applications, 88:603–641, 2024.
Zhaosong Lu. Iterative hard thresholding methods for `0 regularized convex cone program-
ming. Mathematical Programming, 147(1):125–154, 2014.
Zhaosong Lu and Yong Zhang. Sparse approximation via penalty decomposition methods.
SIAM Journal on Optimization, 23(4):2448–2478, 2013.
Robert Maidstone, Toby Hocking, Guillem Rigaill, and Paul Fearnhead. On optimal mul-
tiple changepoint algorithms for large data. Statistics and computing, 27:519–533, 2017.
Cesare Molinari, Jingwei Liang, and Jalal Fadili. Convergence rates of Forward–Douglas–
Rachford splitting method. Journal of Optimization Theory and Applications, 182:606–
639, 2019.
Boris S Mordukhovich, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A globally con-
vergent proximal Newton-type method in nonsmooth convex optimization. Mathematical
Programming, 198(1):899–936, 2023.
Shaohua Pan, Ling Liang, and Yulan Liu. Local optimality for stationary points of group
zero-norm regularized problems and equivalent surrogates. Optimization, 72(9):2311–
2343, 2023.
Guillem Rigaill. A pruned dynamic programming algorithm to recover the best segmenta-
tions with 1 to k {max} change-points. Journal de la Société Française de Statistique,
156(4):180–205, 2015.
R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer,
2009.
46
PGiPN for Fused Zero-norms Regularization Problems
Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise
removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.
Suvrit Sra. Scalable nonconvex inexact proximal splitting. Advances in Neural Information
Processing Systems, 25, 2012.
Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity
and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 67(1):91–108, 2005.
Kenji Ueda and Nobuo Yamashita. Convergence properties of the regularized Newton
method for the unconstrained nonconvex optimization. Applied Mathematics and Opti-
mization, 62(1):27–46, 2010.
Lou Van den Dries and Chris Miller. Geometric categories and o-minimal structures. Duke
Mathematical Journal, 84(2), 1996.
Andreas Weinmann, Martin Storath, and Laurent Demaret. The lˆ1-potts functional for
robust jump-sparse reconstruction. SIAM Journal on Numerical Analysis, 53(1):644–673,
2015.
Fan Wu and Wei Bian. Accelerated iterative hard thresholding algorithm for l0 regularized
regression problem. Journal of Global Optimization, 76(4):819–840, 2020.
Yuqia Wu, Shaohua Pan, and Xiaoqi Yang. A regularized Newton method for `q -norm com-
posite optimization problems. SIAM Journal on Optimization, 33(3):1676–1706, 2023.
Man-Chung Yue, Zirui Zhou, and Anthony Man-Cho So. A family of inexact SQA methods
for non-smooth convex minimization with provable convergence guarantees based on the
Luo-Tseng error bound property. Mathematical Programming, 174(1):327–358, 2019.
47
Wu, Pan, and Yang
Rui Zhou and Daniel P Palomar. Solving high-order portfolios via successive convex ap-
proximation algorithms. IEEE Transactions on Signal Processing, 69:892–904, 2021.
Shenglong Zhou, Lili Pan, and Naihua Xiu. Newton method for `0 -regularized optimization.
Numerical Algorithms, 88(4):1541–1570, 2021.
Zirui Zhou and Anthony Man-Cho So. A unified approach to error bounds for structured
convex optimization problems. Mathematical Programming, 165:689–728, 2017.
48