0% found this document useful (0 votes)
11 views

apl232

Uploaded by

HARSHIT KHANNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

apl232

Uploaded by

HARSHIT KHANNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Journal of Machine Learning Research 25 (2024) 1-48 Submitted 12/23; Revised 9/24; Published 11/24

An Inexact Projected Regularized Newton Method for Fused


Zero-norms Regularization Problems

Yuqia Wu [email protected]
Department of Applied Mathematics
The Hong Kong Polytechnic University
Kowloon, Hong Kong
Shaohua Pan [email protected]
School of Mathematics
South China University of Technology
Guangzhou, China
Xiaoqi Yang∗ [email protected]
Department of Applied Mathematics
The Hong Kong Polytechnic University
Kowloon, Hong Kong

Editor: Silvia Villa

Abstract
This paper concerns structured `0 -norms regularization problems, with a twice continu-
ously differentiable loss function and a box constraint. This class of problems have a wide
range of applications in statistics, machine learning and image processing. To the best
of our knowledge, there is no efficient algorithm in the literature for solving them. In
this paper, we first provide a polynomial-time algorithm to find a point in the proximal
mapping of the fused `0 -norms with a box constraint based on dynamic programming
principle. We then propose a hybrid algorithm of proximal gradient method and inexact
projected regularized Newton method to solve structured `0 -norms regularization problems.
The iterate sequence generated by the algorithm is shown to be convergent by virtue of a
non-degeneracy condition, a curvature condition and a Kurdyka-Lojasiewicz property. A
superlinear convergence rate of the iterates is established under a locally Hölderian error
bound condition on a second-order stationary point set, without requiring the local opti-
mality of the limit point. Finally, numerical experiments are conducted to highlight the
features of our considered model, and the superiority of our proposed algorithm.
Keywords: fused `0 -norms regularization problems; inexact projected regularized New-
ton algorithm; global convergence; superlinear convergence; KL property.

1. Introduction
Given a matrix B ∈ Rp×n , parameters λ1 > 0 and λ2 > 0, and vectors l ∈ Rn− and u ∈ Rn+ ,
we are interested in the structured `0 -norms regularization problem with a box constraint:

min F (x) := f (x) + λ1 kBxk0 + λ2 kxk0 s.t. l ≤ x ≤ u, (1)


x∈Rn

∗. Corresponding author.

c 2024 Yuqia Wu, Shaohua Pan and Xiaoqi Yang.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/23-1700.html.
Wu, Pan, and Yang

where f : Rn → R := (−∞, ∞] is twice continuously differentiable on an open set O


containing the box set Ω := {x ∈ Rn | l ≤ x ≤ u}, and k · k0 denotes the `0 -norm (or
cardinality) function. This model encourages sparsity of both variable x and its linear
transformation Bx. Throughout this paper, we write g(·) := λ1 kB · k0 + λ2 k · k0 + δΩ (·),
where δΩ (·) denotes the indicator function of Ω.

1.1 Motivation
Given a data matrix A ∈ Rm×n and its response b ∈ Rm , the common regression model is
to minimize f (x) := h(Ax − b), where h : Rm → R is continuously differentiable on A(O) − b
with its minimum attained at the origin. When h(·) = 21 k · k2 , f is the least-squares loss
function of the linear regression. It is known that one of the popular models for seeking a
sparse vector while minimizing f is the following `0 -norm regularization problem

min f (x) + λ2 kxk0 , (2)


x∈Rn

where the `0 -norm term is used to identify a set of influential components by shrinking
some small coefficients to 0. However, the `0 -norm regularizer only takes the sparsity of x
into consideration, but ignores its spatial nature, which sometimes needs to be considered in
real-world applications. For example, in the context of image processing, the variables often
represent the pixels of images, which are correlated with their neighboring ones. To recover
the blurred images, Rudin et al. (1992) took into account the differences between adjacent
variables and used the total variation regularization, which penalizes the changes of the
neighboring pixels and hence encourages smoothness in the solution. In addition, Land and
Friedman (1997) studied the phoneme classification on TIMIT database, for which there is
a high chance that every sampled point is close or identical to its neighboring ones because
each phoneme is composed of a series of consecutively sampled points. Land and Friedman
(1997) considered imposing a fused penalty on the coefficients vector x, and proposed the
following models with zero-order variable fusion and first-order variable fusion respectively
to train the classifier:
1
min kAx − bk2 + λ1 kBxk
b 0, (3)
x∈Rn 2
1
min kAx − bk2 + λ1 kBxkb 1, (4)
x∈Rn 2

where A ∈ Rm×n represents the phoneme data, b ∈ Rm is the label vector, B b ∈ R(n−1)×n
with Bbii = 1 and Bbi,i+1 = −1 for all i ∈ {1, . . . , n−1} and B
bij = 0 otherwise. In the sequel,
1 2
we call (1) with f (·) = 2 kA · −bk and B = B a fused `0 -norms regularization problem with
b
a box constraint.
Additionally taking the sparsity of x into consideration, Tibshirani et al. (2005) proposed
the fused Lasso, given by
1
minn kAx − bk2 + λ1 kBxk
b 1 + λ2 kxk1 , (5)
x∈R 2
and presented its nice statistical properties. Friedman et al. (2007) demonstrated that the
proximal mapping of the function λ1 kB b · k1 + λ2 k · k1 can be obtained through a process,

2
PGiPN for Fused Zero-norms Regularization Problems

which is known as “prox-decomposition” later. Based on the accessibility of this proximal


mapping, various efficient algorithms were proposed to address model (5), see Liu et al.
(2009, 2010); Li et al. (2018); Molinari et al. (2019). In particular, Li et al. (2018) proposed
a semismooth Newton augmented Lagrangian method (SSNAL) to solve the dual of (5).
The numerical results reported in their paper indicate that SSNAL is highly efficient.
It was claimed in Land and Friedman (1997) that both (3) and (4) perform well in signal
regression, but the zero-order fusion model (3) produces simpler estimated coefficient vec-
tors. This observation suggests that model (1) with f = 21 kA · −bk2 and B = B b may be able
to effectively find a simpler solution while perform as well as the fused Lasso does. Com-
pared with regularization problems with `0 -norm, those using kBxk0 regularization remain
less explored in terms of algorithm development. According to Land and Friedman (1997),
the global optimal solution of (3) is unavailable. However, one of its stationary points can
be obtained. In fact, Jewell et al. (2020) has revealed by virtue of dynamic programming
principle that a point in the proximal mapping of λ1 kB b ·k0 can be exactly determined within
polynomial time, which allows one to use the well-known proximal gradient (PG) method
to find a stationary point of problem (3). However, the highly nonconvex and nonsmooth
nature of model (1) poses significant challenges in computing the proximal mapping of g
when B = B b and in developing effective optimization algorithms to solve it. As far as we
know, no specific algorithms have yet been designed to solve these challenging problems.
Another motivation for this work comes from our previous research (Wu et al. (2023)).
In that work, we considered the model (2) with the `0 -norm replaced by the `q quasi-
norm kxkqq , where q ∈ (0, 1) and kxkq := q 1/q . For this class of nonconvex
Pn 
i=1 |xi |
and nonsmooth problems, we proposed a hybrid of PG and subspace regularized Newton
methods (HpgSRN), which restricts the subproblems of Newton steps on a subspace within
which their objective functions are smooth, and thus a regularized Newton method can
be applied. It is worth noting that the subspace is induced by the support of the current
iterate xk . PG step is executed in every iteration, but it does not necessarily run a Newton
step unless a switch condition is satisfied. The full convergence of the iterate sequence
was established under a curvature condition and the Kurdyka-Lojasiewicz (KL) property
(Attouch et al. (2010)) of the objective function, and a superlinear convergence rate was
achieved under an additional local error bound condition on a second-order stationary point
set. Due to the desirable convergence result and numerical performance of HpgSRN, we
aim to adopt a similar subspace regularized Newton algorithm to solve (1), in which the
subspace is induced by the combined support of Bxk and xk .

1.2 Related work


In recent years, many optimization algorithms have been well developed to solve the `0 -norm
regularization problems of the form (2), which includes iterative hard thresholding (Herrity
et al. (2006); Blumensath and Davies (2008, 2010); Lu (2014)), the penalty decomposition
(Lu and Zhang (2013)), the smoothing proximal gradient method (Bian and Chen (2020)),
the accelerated iterative hard thresholding (Wu and Bian (2020)) and NL0R (Zhou et al.
(2021)). Among all these algorithms, NL0R is the only Newton-type method, which employs
Newton method to solve a series of stationary point equations confined to the subspaces
identified by the support of the solution obtained by the proximal mapping of λ2 k · k0 .

3
Wu, Pan, and Yang

The PG method is able to effectively cope with model (1) if the proximal mapping of g
can be exactly computed. The PG method belongs to first-order methods, which have a low
computation cost and require weak global convergence conditions, but they achieve at most
a linear convergence rate. On the other hand, the Newton method has a faster convergence
rate, but it can only be applied to minimize sufficiently smooth objective functions. In recent
years, there have been active investigations into the Newton-type methods for nonsmooth
composite optimization problems of the form

min Ψ(x) := ψ(x) + ϕ(x), (6)


x∈Rn

where ϕ : Rn → R is proper lower semicontinuous, and ψ is twice continuously differentiable


on an open subset of Rn containing the domain of ϕ. The proximal Newton-type method
is able to address (6) with a convex or weakly convex ϕ, and a convex ψ (see Bertsekas
(1982); Lee et al. (2014); Yue et al. (2019); Mordukhovich et al. (2023)) or nonconvex ψ
(Liu et al. (2024)). Another popular second-order method for solving (6) is to minimize
the forward-backward envelop (FBE) of Ψ, see Stella et al. (2017); Themelis et al. (2018,
2019); Ahookhosh et al. (2021). In particular, for those Ψ with the proximal mapping of
ϕ being available, Themelis et al. (2018) proposed an algorithm called ZeroFPR, based
on the quasi-Newton method, for minimizing the FBE of Ψ. They achieved the global
convergence of the iterate sequence by means of the KL property of the FBE and its local
superlinear rate under the Dennis-Moré condition and the strong local minimality property
of the limit point. An algorithm similar to ZeroFPR but minimizing the Bregman FBE
of Ψ was proposed in Ahookhosh et al. (2021), which achieves a superlinear convergence
rate without requiring the strong local minimality of the limit point. For the case that ψ is
smooth and ϕ admits a computable proximal mapping, Bareilles et al. (2023) proposed an
algorithm, alternating between a PG step and a Riemannian Newton method, which was
proved to have a quadratic convergence rate under a positive definiteness assumption on
the Riemannian Hessian at the limit point.

1.3 Main contributions


This work aims to design a hybrid of PG and inexact projected regularized Newton methods
(PGiPN) to solve the structured `0 -norms regularization problem (1). Let xk ∈ Ω be the
current iterate. Our method first runs a PG step with line search at xk to produce xk via

xk ∈ proxµ−1 g (xk − µ−1 k


k ∇f (x )), (7)
k

where proxµ−1 g (·) is the proximal mapping of g, µk > 0 is a constant such that the objective
k
function F of (1) gains a sufficient decrease from xk to xk , and then judges whether the
iterate xk enters Newton step or not in terms of some switch condition, which takes the
following forms of structured stable supports:

supp(xk ) = supp(xk ) and supp(Bxk ) = supp(Bxk ). (8)

If this switch condition does not hold, we set xk+1 = xk and return to the PG step.
Otherwise, by the nature of `0 -norm, the restriction of the function x 7→ λ1 kBxk0 + λ2 kxk0

4
PGiPN for Fused Zero-norms Regularization Problems

on the supports supp(Bxk ) and supp(xk ), i.e., λ1 k(Bx)supp(Bxk ) k0 + λ2 kxsupp(xk ) k0 , is a


constant near xk and does not provide any useful information. In this case, unlike dealing
with the `q -norm regularization problem in Wu et al. (2023), we introduce the following
multifunction Π : Rn ⇒ Rn :

Π(z) := {x ∈ Ω | supp(x) ⊂ supp(z), supp(Bx) ⊂ supp(Bz)}



= x ∈ Ω | x[supp(z)]c = 0, (Bx)[supp(Bz)]c = 0 , (9)

and consider the associated subproblem

min f (x) + δΠk (x) with Πk = Π(xk ). (10)


x∈Rn

It is noted that the set Π(xk ) containing all the points whose supports are a subset of the
support of xk as well as the supports of their linear transformation is a subset of the support
of the linear transformation of xk . It is worth pointing out that the multifunction Π is not
closed but closed-valued.
We will show that every stationary point of (10) is one for problem (1). Thus, instead
of a subspace regularized Newton step in Wu et al. (2023), following the projected Newton
method in Bertsekas (1982) and the proximal Newton method in Lee et al. (2014); Yue
et al. (2019); Mordukhovich et al. (2023) and Liu et al. (2024), our projected regularized
Newton step minimizes the following second-order approximation of (10) on Πk :
1
arg min Θk (x) := f (xk ) + h∇f (xk ), x −xk i + hx− xk , Gk (x−xk )i + δΠk (x). (11)
x∈Rn 2

Among others, Gk in (11) is an approximation to the Hessian ∇2f (xk ) satisfying

Gk  b1 kµk (xk −xk )kσ I, (12)

where b1 > 0, σ ∈ (0, 12 ) and µk is the same as in (7). To cater for the practical computation,
our Newton step seeks an inexact solution y k of (11) satisfying
k

 Θk (y) − Θk (x ) ≤ 0, (13)
−1
 dist(0, ∂Θk (y)) ≤ min{µk , 1} min kµ (xk −xk )k, kµ (xk − xk )k1+ς
n o
k k (14)
2
with ς ∈ (σ, 1]. Set the direction dk := y k − xk . A step size αk ∈ (0, 1] is found in the
direction dk via backtracking, and let xk+1 := xk + αk dk . To ensure the global convergence,
the next iteration still returns to the PG step. The details of the algorithm are given in
Section 3.

The main contributions of the paper are as follows:


• Based on dynamic programming principle, we develop a polynomial-time algorithm in
time O(n3+ ) with any  > 0 for seeking a point xk in the proximal mapping (7) of g with
B = B.b This generalizes the corresponding result in Jewell et al. (2020) for finding xk in
(7) from g(·) = λ1 kB · k0 to g(·) = λ1 kB · k0 + λ2 k · k0 + δΩ (·) with B = B,
b and also provides
the core of PG algorithms for solving (1). We also establish a uniform lower bound for

5
Wu, Pan, and Yang

proxµ−1 g (·) with µ from a closed interval on a compact set. This plays a crucial role in the
convergence analysis of the proposed algorithm, as well as generalizes the corresponding
results in Lu (2014) for `0 -norm and in Wu et al. (2023) for `q -norm with 0 < q < 1,
respectively.
• We design a hybrid algorithm (PGiPN) of PG and inexact projected regularized New-
ton methods to solve the structured `0 -norms regularization problem (1), which includes the
fused `0 -norms regularization problem with a box constraint as a special case. We obtain
the global convergence of the algorithm by showing that the structured stable supports (8)
hold when the iteration number is sufficiently large. Moreover, we establish a superlinear
convergence rate under a Hölderian error bound on a second-order stationary point set,
without requiring the local optimality of the limit point.
• The numerical experiments show that our PGiPN is more effective than some existing
algorithms in the literature in terms of solution quality and running time.

The rest of the paper is organized as follows. In Section 2 we recall some preliminary
knowledge and characterize the stationary point condition of model (1). In Section 3, we
prove the prox-regularity of g, characterize a uniform lower bound of the proximal mapping
of g, and provide an algorithm for finding a point in the proximal mapping of g with
B = B.b In Section 4, we introduce our algorithm and show that it is well defined. Section
5 is devoted to the convergence analysis of the proposed algorithm. The implementation
details of our algorithm and the numerical experiments are included in Section 6.

1.4 Notation
Throughout this paper, B(x, ) := {z | kz − xk ≤ } denotes the ball centered at x with
radius  > 0, and B := B(0, 1). Let I and 1 be an identity matrix and a vector of all ones,
respectively, whose dimension is known from the context. For any two integers 0 ≤ j < k,
define [j : k] := {j, j + 1, . . . , k} and [k] := [1 : k]. For a closed and convex set Ξ ⊂ Rn ,
ri(Ξ) denotes the relative interior of Ξ, projΞ (·) represents the projection operator onto
Ξ, and for a given x ∈ Ξ, NΞ (x) and TΞ (x) denote the normal cone and tangent cone of
Ξ at x, respectively. For a closed set Ξ0 ⊂ Rn , dist(z, Ξ0 ) := minx∈Ξ0 kx − zk. For an
index set T ⊂ [n], |T | means the number of the elements of T and write T c := [n]\T . For
t ∈ R, sign(t) denotes the sign of t, i.e., sign(0) = 0 and sign(t) = t/|t| for t 6= 0, and
t+ := max{t, 0}. For a given x ∈ Rn , supp(x) := {i ∈ [n] | xi 6= 0}, sign(x) denotes the
vector with [sign(x)]i = sign(xi ), |x|min := mini∈supp(x) |xi |. For a vector x ∈ Rn and an
index set T ⊂ [n], xT ∈ R|T | is the vector obtained by removing those xj ’s with j ∈ / T,
and xj:k means x[j:k] . Given a real symmetric matrix H, λmin (H) denotes the smallest
eigenvalue of H, and kHk2 is the spectral norm of H. For a matrix A ∈ Rm×n and S ⊂ [m]
(resp. T ⊂ [n]), AS· (resp. A·T ) denotes the matrix obtained by removing those rows (resp.
columns) of A whose indices are not in S (resp. T ). For a proper lower semicontinuous
function h : Rn → R, its domain is denoted by dom h := {x ∈ Rn | h(x) < ∞}, and its
proximal mapping of h associated with a parameter µ > 0 is defined as

n 1 o
proxµh (z) := arg min kx − zk2 + h(x) ∀z ∈ Rn . (15)
x∈Rn 2µ

6
PGiPN for Fused Zero-norms Regularization Problems

For a nonnegative real number sequence {an }, O(an ) represents a sequence such that
O(an ) ≤ c1 an for some c1 > 0. The symbol F : Rm ⇒ Rn means that F is a set-valued
mapping (or multifunction), i.e., its image at every point is a set.

2. Preliminaries
Note that the structured `0 -norms function is lower semicontinuous and problem (1) involves
a compact box constraint, so its set of global optimal solutions is nonempty and compact.
Moreover, the continuity of ∇2f on an open set containing Ω and the compactness of Ω
implies that ∇f is Lipschitz continuous on Ω, i.e., there exists L∇f > 0 such that

k∇f (x) − ∇f (y)k ≤ L1 kx − yk for all x, y ∈ Ω. (16)

The above basic facts are often used in the subsequent sections.

2.1 Stationary conditions


For an extended real-valued h : Rn → R and a point x ∈ dom h, we denote the regular
subdifferential of h at x by ∂h(x),
b and the general subdifferential of h at x by ∂h(x)
(Rockafellar and Wets, 2009, Definition 8.3). Now we introduce two classes of stationary
points for the general composite problem (6), which includes (1) as a special case.

Definition 1 A vector x ∈ Rn is called a stationary point of problem (6) if 0 ∈ ∂Ψ(x). A


vector x ∈ Rn is called an L-stationary point of problem (6) if there exists a constant µ > 0
such that x ∈ proxµ−1 ϕ (x −µ−1 ∇ψ(x)).

Recall that Ψ = ψ + ϕ, where ψ is twice continuously differentiable and ϕ is proper and


lower semicontinuous. If in addition ϕ is assumed to be convex, then

0 ∈ ∂Ψ(x) ⇔ 0 ∈ µ(x − (x − µ−1 ∇ψ(x))) + ∂ϕ(x) ⇔ x = proxµ−1 ϕ (x− µ−1 ∇ψ(x)).

This means that x is a stationary point of problem (6) if and only if x is an L-stationary
point. To extend this equivalence to the class of prox-regular functions, we need to recall
the definition of prox-regularity, which acts as a surrogate of local convexity.

Definition 2 (Rockafellar and Wets, 2009, Definition 13.27) A function h : Rn → R is


prox-regular at a point x ∈ dom h for v ∈ ∂h(x) if h is locally lower semicontinuous at x,
and there exist r ≥ 0 and ε > 0 such that h(x0 ) ≥ h(x) + v > (x0 − x) − 2r kx0 − xk2 for all
kx0 − xk ≤ ε, whenever v ∈ ∂h(x), kv − vk < ε, kx − xk < ε and h(x) < h(x) + ε. If h is
prox-regular at x for all v ∈ ∂h(x), we say that h is prox-regular at x.

The following proposition reveals that under the prox-regularity of ϕ, the set of station-
ary points of problem (6) coincides with that of its L-stationary points. Since the proof is
similar to that in (Wu et al., 2023, Remark 2.5), the details are omitted here.

Proposition 3 If x is an L-stationary point of problem (6), then 0 ∈ ∂Ψ(x). If ϕ is


prox-regular at x for −∇ψ(x) and prox-bounded1 , the converse is also true.
1. For the definition of prox-boundedness, see (Rockafellar and Wets, 2009, Definitions 1.23).

7
Wu, Pan, and Yang

Next we provide the stationary point conditions of problem (1) by characterizing the
subdifferential of function F . The closed-valuedness of the multifunction Π in (9) is used.
Lemma 4 Consider any z ∈ Ω. The following statements are true.
(i) z ∈ Π(z), and ∂g(z)
b = ∂g(z) = NΠ(z) (z).
(ii) ∂F (z) = ∇f (z) + ∂g(z) = ∇f (z) + NΠ(z) (z).
(iii) for any x ∈ Ω, 0 ∈ ∇f (x) + NΠ(z) (x) implies that 0 ∈ ∂F (x).

Proof (i) Clearly, z ∈ Π(z). We first argue that ∂g(z)


b ⊂ NΠ(z) (z). Let h(x) := λ1 kBxk0 +
λ2 kxk0 . Pick any v ∈ ∂g(z). By invoking (Rockafellar and Wets, 2009, Definition 8.3),
b

h(y) + δΩ (y) − h(z) − δΩ (z) − hv, y − zi


0 ≤ lim inf
z6=y→z ky − zk
h(y) + δΩ (y) − h(z) − δΩ (z) − hv, y − zi
≤ lim inf
z6=y→z,y∈Π(z) ky − zk
−hv, y − zi hv, y − zi
= lim inf = − lim sup ,
z6=y→z,y∈Π(z) ky − zk z6=y→z,y∈Π(z) ky − zk

which by (Rockafellar and Wets, 2009, Definition 6.3) implies that v ∈ N


bΠ(z) (z). From the
arbitrariness of v ∈ ∂g(z)
b and the convexity of Π(z), we conclude that ∂g(z)
b ⊂ NΠ(z) (z).
g
Next we prove that ∂g(z) ⊂ NΠ(z) (z). Pick any v ∈ ∂g(z), there exists z k → − z and
k k k k
v ∈ ∂g(z ) with v → v as k → ∞. As supp(Bz ) ⊃ supp(Bz) and supp(z ) ⊃ supp(z)
b k
g
as k → ∞, we deduce from z k → − z that Π(z k ) = Π(z). Therefore, v k ∈ NΠ(z k ) (z k ) and
v ∈ NΠ(z) (z), which yields the desired inclusion.
Let h1 (x) := λ1 kBxk0 + δΩ (x) and h2 (x) := λ2 kxk0 for x ∈ Rn . From (Pan et al., 2023,
b 1 (z) = Range(B >
Lemma 2.2 (iii)), ∂h1 (z) = ∂h [supp(z)]c · ) + NΩ (z) and ∂h2 (z) = ∂h2 (z) =
b
n c

v ∈ R | supp(v) ⊂ [supp(z)] . As g = h1 + h2 , by the definition of regular subdifferential,

∂h1 (z) + ∂h2 (z) = ∂h b 2 (z) ⊂ ∂g(z)


b 1 (z) + ∂h b ⊂ ∂g(z) ⊂ NΠ(z) (z).

Let Π1 (z) := x ∈ Rn | supp(Bx) ⊂ supp(Bz) and Π2 (z) := x ∈ Rn | supp(x) ⊂ supp(z) .


 
>
Observe that Π1 (z) and Π2 (z) are the subspaces with NΠ1 (z) (z) = Range(B[supp(Bz)]c · ) and
n c

NΠ2 (z) (z) = v ∈ R | supp(v) ⊂ [supp(z)] . Along with the above arguments, we have

NΠ1 (z) (z) + NΠ2 (z) (z) + NΩ (z) = ∂h1 (z) + ∂h2 (z) ⊂ ∂g(z)
b ⊂ ∂g(z) ⊂ NΠ(z) (z).

Since Π(z) = Ω ∩ Π1 (z) ∩ Π2 (z) and z ∈ Π(z), by (Rockafellar, 1970, Theorem 23.8),
NΠ(z) (z) = NΩ (z) + NΠ1 (z) (z) + NΠ2 (z) (z). Thus, the desired conclusion holds.
(ii)-(iii) The first equality of part (ii) follows by (Rockafellar and Wets, 2009, Exercise
8.8), and the second one is implied by part (i). Next we consider part (iii). Suppose that
0 ∈ ∇f (x) + NΠ(z) (x). Obviously, x ∈ Π(z). From the definition of Π(·), we have Π(x) ⊂
Π(z), which along with their convexity and x ∈ Π(x) implies that NΠ(z) (x) ⊂ NΠ(x) (x).
Combining with part (ii) leads to the desired result.

8
PGiPN for Fused Zero-norms Regularization Problems

Remark 5 Lemma 4 (ii) provides a way to seek a stationary point of F . Indeed, for
any given z ∈ Ω, if x is a stationary point of problem miny∈Rn {f (y) | y ∈ Π(z)}, i.e.,
0 ∈ ∇f (x) + NΠ(z) (x), then by Lemma 4 (ii) it necessarily satisfies 0 ∈ ∂F (x). This
implication will be utilized in the design of our algorithm, that is, when obtaining a good
estimate of the stationary point, say xk , we run a Newton step to minimize f over the
polyhedral set Π(xk ) so as to enhance the speed of the algorithm.

2.2 Kurdyka-Lojasiewicz property


Next we introduce the Kurdyka-Lojasiewicz (KL) property of an extended real-valued func-
tion, which plays an important role in the convergence analysis of first-order algorithms for
nonconvex and nonsmooth optimization problems (see, e.g., Attouch et al. (2010, 2013)).
In this work, we will use it to establish the global convergence property of our algorithm.
Definition 6 For any given η > 0, we denote by Υη the set consisting of all continuous
concave ϕ : [0, η) → R+ that are continuously differentiable on (0, η) with ϕ(0) = 0 and
ϕ0 (s) > 0 for all s ∈ (0, η). A proper function h : Rn → R is said to have the KL property
at x ∈ dom ∂h if there exist η ∈ (0, ∞], a neighborhood U of x and a function ϕ ∈ Υη such
that for all x ∈ U ∩ h(x) < h < h(x) + η , ϕ0 (h(x) − h(x))dist(0, ∂h(x)) ≥ 1. If h has the
KL property at each point of dom ∂h, then h is called a KL function.
The KL property is ubiquitous, and the functions definable in an o-minimal structure
over the real field admit this property; see (Attouch et al., 2010, Theorem 4.1). The
functions definable in an o-minimal structure cover a wide range of functions, such as semi-
algebraic functions and globally subanalytic functions; see (Van den Dries and Miller, 1996,
Example 2.5). Moreover, from (Attouch et al., 2010, Section 4), we know that definable sets
and functions are closed under some common calculus rules in optimization; for example,
finite unions or finite intersections of definable sets are definable, compositions of definable
mappings are definable, and subdifferentials of definable functions are definable.

3. Prox-regularity and proximal mapping of g


3.1 Prox-regularity of g
In this subsection, we aim at proving the prox-regularity of g, which together with Proposi-
tion 3 and the prox-boundedness of g indicates that the set of stationary points of problem
(1) coincides with that of its L-stationary points.
We remark here that the prox-regularity of g cannot be obtained from the existing chain
calculus of prox-regularity. It was revealed in (Poliquin and Rockafellar, 2010, Theorem
3.2) that, for proper fi , i = 1, 2 with fi being prox-regular at x for vi ∈ ∂fi (x), by letting
v := v1 + v2 , and f0 := f1 + f2 , a sufficient condition for f0 to be prox-regular at x for v is
w1 + w2 = 0 with wi ∈ ∂ ∞ fi (x) =⇒ wi = 0, i = 1, 2, (17)
where ∂ ∞ denotes the horizon subdifferential (Rockafellar and Wets, 2009, Definition 8.3).
We give a counter example to illustrate that the above constraint qualification does not
b · k0 and f2 = k · k0 . Let x = (0, 0, 0, 1)> . Then,
hold for fi : R4 → R with f1 = kB
∂ ∞ f1 (x) = ∂f1 (x) = Range((B
b[2]· )> ), ∂ ∞ f2 (x) = ∂f2 (x) = Range((I[3]· )> ).

9
Wu, Pan, and Yang

By the expressions of ∂ ∞ f1 (x) and ∂ ∞ f2 (x), it is immediate to check that the constraint
qualification in (17) does not hold. Next, we give our proof toward the prox-regularity of g.

Lemma 7 The function g is prox-bounded, and is prox-regular on its domain Ω, so the set
of stationary points of model (1) coincides with its set of L-stationary points.

Proof The prox-boundedness of g is immediate by (Rockafellar and Wets, 2009, Definition


1.23). It suffices to prove that g is prox-regular on Ω. Fix any x ∈ Ω and pick any v ∈ ∂g(x).
λ
Let λ := min{λ1 , λ2 } and C := [B; I]. Pick any ε ∈ (0, min{λ, 3(kvk+λ) }) such that for all
x ∈ B(x, ε), supp(Cx) ⊃ supp(Cx). We will prove that

g(x0 ) ≥ g(x) + v > (x0 − x), for all kx0 − xk ≤ ε, v ∈ ∂g(x), kv − vk < ε and x ∈ Ξ (18)

with Ξ := {x | kx − xk < ε, g(x) < g(x) + ε}, so the function g is prox-regular at x for v.
We first claim that for each x ∈ Ξ, supp(Cx) = supp(Cx) and x ∈ Ω. In fact, by the
definition of ε, supp(Cx) ⊃ supp(Cx). If supp(Cx) 6= supp(Cx), we have g(x) ≥ g(x) + λ >
g(x) + ε, which yields that x ∈/ Ξ. Therefore, supp(Cx) = supp(Cx). The fact that x ∈ Ξ
implies x ∈ Ω is clear. Hence the claimed facts are true.
Fix any x ∈ Ξ. Consider any x0 ∈ B(x, ε). If x0 ∈ / Ω, since g(x0 ) = ∞, it is immediate
to see that (18) holds, so it suffices to consider x0 ∈ B(x, ε) ∩ Ω. Note that supp(Cx0 ) ⊃
supp(Cx) = supp(Cx). If supp(Cx0 ) 6= supp(Cx), then g(x0 ) ≥ g(x) + λ. For any v ∈ ∂g(x)
with v ∈ B(v, ε), kvk ≤ kvk+ε ≤ kvk+λ, which along with kx0 −xk ≤ kx0 −xk+kx−xk ≤ 2ε
implies that kvkkx0 − xk ≤ (kvk + λ) 3(kvk+λ)

≤ 2λ
3 , and hence

g(x0 ) − g(x) − v > (x0 − x) ≥ λ − kvkkx0 − xk > 0.

Equation (18) holds. Next we consider the case supp(Cx0 ) = supp(Cx). Define

Π1 (x) := z ∈ Rn | (Bz)[supp(Bx)]c = 0 , Π2 (x) := z ∈ Rn | z[supp(x)]c = 0 .


 

Clearly, Π(x) = Π1 (x) ∩ Π2 (x) ∩ Ω and Π1 (x), Π2 (x) and Ω are all polyhedral sets. By
(Rockafellar, 1970, Theorem 23.8), for any v ∈ NΠ(x) (x) = ∂g(x), there exist v1 ∈ NΠ1 (x) (x),
v2 ∈ NΠ2 (x) (x) and v3 ∈ NΩ (x) such that v = v1 + v2 + v3 . Then,

g(x0 ) − g(x)−v > (x0 − x) = λ1 kBx0 k0 − λ1 kBxk0 − v1> (x0 − x)


+ λ2 kx0 k0 − λ2 kxk0 − v2> (x0 − x) − v3> (x0 − x) ≥ 0,

where the inequality follows from λ1 kBx0 k0 − λ1 kBxk0 = 0, v1> (x0 − x) = 0, λ2 kx0 k0 −
λ2 kxk0 = 0, v2> (x0 − x) = 0 and v3> (x0 − x) ≤ 0. Equation (18) is true. Thus, by the
arbitrariness of x ∈ Ω and v ∈ ∂g(x), we conclude that g is prox-regular on set Ω.

3.2 Lower bound of the proximal mapping of g



Given λ > 0 and x ∈ Rn , any z ∈ proxλk·k0 (x) satisfies |zi | ≥ 2λ for i ∈ supp(z) (Lu, 2014,
Lemma 3.3). This indicates that |z|min with z ∈ proxλk·k0 (x) has a uniform lower bound.
Such a uniform lower bound is shown to hold for the proximal mapping of the `q -norm with

10
PGiPN for Fused Zero-norms Regularization Problems

0 < q < 1 and played a crucial role in the convergence analysis of the algorithms involving
subspace Newton method (see Wu et al. (2023)). Next, we show that such a uniform lower
bound exists for the proximal mapping of g.

Lemma 8 For any given compact set Ξ ⊂ Rn and constants 0 < µ < µ, define
S
Z := z∈Ξ,µ∈[µ,µ] proxµ−1 g (z).

Then, there exists ν > 0 (depending on Ξ, µ and µ) such that inf u∈Z\{0} |[B; I]u|min ≥ ν.

Proof Let C := [B; I]. By invoking (Bauschke et al., 1999, Corollary 3) and the compact-
ness of Ω, there exists κ > 0 such that for all index set J ⊂ [n+p],

dist(x, Null(CJ· ) ∩ Ω) ≤ κdist(x, Null(CJ· )) for any x ∈ Ω. (19)

Since the index sets J ⊂ [n + p] are finite, there exists σ > 0 such that for any index set
J ⊂ [n+p] with CJ· having full row rank,
>
λmin (CJ· CJ· ) ≥ σ. (20)

For any z ∈ Ξ and µ ∈ [µ, µ], define hz,µ (x) := µ2 kx − zk2 for x ∈ Rn . By the compactness
of Ω, [µ, µ] and Ξ, there exists δ0 ∈ (0, 1) such that for all z ∈ Ξ, µ ∈ [µ, µ] and x, y ∈ Ω
with kx − yk < δ0 , µ(kxk + kyk + 2kzk)kx − yk < λ := min{λ1 , λ2 }, and consequently,
µ µ λ
|hz,µ (x) − hz,µ (y)| = |hx − y, x + y − 2zi| ≤ (kxk + kyk + 2kzk)kx − yk < . (21)
2 2 2
Now suppose on the contrary that the conclusion does not hold. Then there is a sequence
1
{z k } k
k∈N ⊂ Z\{0} such that |Cz |min ≤ k for all k ∈ N. Note that C has a full column
k
rank. We also have |Cz |min > 0 for each k ∈ N. By the definition of Z, for each k ∈
k k k k 1
 N,
there exist z ∈ Ξ and µk ∈ [µ, µ] such that z ∈ proxµ−1 g (z ). Since |Cz |min ∈ 0, k for
k
all k ∈ N, there exists an infinite index set K ⊂ N and an index i ∈ [n+p] such that
δ0 σ
0 < |(Cz k )i | = |Cz k |min < for each k ∈ K, (22)
κkCk2
where κ and σ are the ones appearing in (19) and (20), respectively. Fix any k ∈ K.
Write Qk := [n + p]\supp(Cz k ) and choose Jk ⊂ Qk such that the rows of CJk · form a
basis of those of CQk · . Let Jbk := Jk ∪ {i}. Obviously, kCJbk · z k k = |(Cz k )i |. We claim that
CJbk · also has a full row rank. Indeed, if Jk = ∅, then CJbk · has a full row rank because
CJbk · 6= 0 by (22); if Jk 6= ∅, then CJk · z k = 0, which implies that CJbk · also has a full row
rank (if not, Ci· is a linear combination of the rows of CJk · , which along with CJk · z k = 0
implies that Ci· z k = 0, contradicting to |(Cz k )i | = |Cz k |min > 0). The claimed fact holds.
Let zek := projNull(C b ) (z k ). Then, CJbk · zek = 0, and by the optimality condition of the
Jk ·

projection problem, there exists ξ k ∈ R|Jk | such that z k − zek = C > k


b ξ . Since CJbk · has a full
b
Jk ·
row rank and kCJbk · z k k = |(Cz k )i |, we have

|(Cz k )i | = kCJbk · z k − CJbk · zek k = kCJbk · CJ> k k


b · ξ k ≥ σkξ k, (23)
k

11
Wu, Pan, and Yang

where the inequality is due to (20). Combining (23) with (22) yields kξ k k < κ−1 kCk−1
2 δ0 .
Therefore,
kz k − zek k = kCJ> k k k −1
b · ξ k ≤ kCJbk · k2 kξ k ≤ kCk2 kξ k < κ δ0 . (24)
k

Let zbk := projNull(C b )∩Ω (z


k ). From (19) and (24), it follows that
Jk ·

kz k − zbk k = dist(z k , Null(CJbk · ) ∩ Ω) ≤ κdist(z k , Null(CJbk · )) = κkz k − zek k < δ0 . (25)

Note that zbk , z k ∈ Ω. From (25) and (21), it follows that

λ
z k ) − hz k ,µk (z k )| <
|hz k ,µk (b . (26)
2

Next we claim that supp(C zbk ) ∪ {i} ⊂ supp(Cz k ). Indeed, since the rows of CJbk · form a
basis of those of C[Qk ∪{i}]· and CJbk · zbk = 0, C[Qk ∪{i}]· zbk = 0. Then, supp(C[Qk ∪{i}]· zbk ) ∪
{i} = supp(C[Qk ∪{i}]· z k ). Since all the entries of C[Qk ∪{i}]c · z k are nonzero, it holds that
supp(C[Qk ∪{i}]c · zbk ) ⊂ supp(C[Qk ∪{i}]c · z k ), which implies that supp(C zbk ) ∪ {i} ⊂ supp(Cz k ).
Thus, the claimed inclusion follows, which implies that g(z k ) − g(b z k ) ≥ λ. This together
k k
with (26) yields hz k ,µk (z ) + g(z ) − (hz k ,µk (b k
z ) + g(bz )) ≥ λ − 2 = λ2 , contradicting to
k λ

z k ∈ proxµ−1 g (z k ). The proof is completed.


k

The result of Lemma 8 will be utilized in Proposition 14 to justify the fact that the
sequences {|Bxk |min }k∈N and {|xk |min }k∈N are uniformly lower bounded, where xk is ob-
tained in (7) (or (38) below). This is a crucial aspect in proving the stability of supp(xk )
and supp(Bxk ) when k is sufficiently large.

3.3 Proximal mapping of a fused `0 -norms function with a box constraint


The characterization for the proximal mapping of the fused `0 -norm λ1 kB b · k0 can be traced
back to Liebscher and Winkler (1999), where the problem is addressed by using the tech-
nique of optimal partitioning of changepoints. For later developments of this technique,
please refer to Jackson et al. (2005); Friedrich et al. (2008); Killick et al. (2012); Weinmann
et al. (2015); Jewell and Witten (2018). Recently, by using the functional pruning technique
introduced in Rigaill (2015) and Maidstone et al. (2017), Jewell et al. (2020) presented a
polynomial-time algorithm for computing the proximal mapping of λ1 kB b · k0 . Numerical
experiments show that this method is more efficient than the one proposed in Jewell and
Witten (2018); see also the arguments in (Jewell et al., 2020, Section 2.2). In this subsec-
tion, we extend the functional pruning technique to compute the proximal mapping of the
fused `0 -norms λ1 kB b · k0 + λ2 k · k0 + δΩ (·), i.e., for any given z ∈ Rn (n ≥ 2), to seek a
global optimal solution of the problem
1
min h(x; z) := kx − zk2 + λ1 kBxk
b 0 + λ2 kxk0 + δΩ (x). (27)
x∈Rn 2

To simplify the deduction,P for each i ∈ [n], define ωi (α) := λ2 |α|0 + δ[li ,ui ] (α) for α ∈ R.
Clearly, λ2 kxk0 + δΩ (x) = ni=1 ωi (xi ) for x ∈ Rn . Let H(0) := −λ1 , and for each s ∈ [n],

12
PGiPN for Fused Zero-norms Regularization Problems

define
s
1 X
H(s) := mins hs (y; z1:s ) := ky − z1:s k2 + λ1 kB
b[s−1][s] yk0 + ωj (yj ) (28)
y∈R 2
j=1

with Bb[0][1] := 0. It is immediate to see that H(n) is the optimal value to (27). For each
s ∈ [n], define function Ps : [0 : s−1] × R → R by
s
1 X
Ps (i, α) := H(i) + kα1 − zi+1:s k2 + ωj (α) + λ1 . (29)
2
j=i+1

For each s ∈ [n], there is a close relation between Ps and hs . Indeed, for any given y ∈ Rs
with ys = α, let i be the smallestPinteger in [0 : s−1] such that yi+1 = · · · = ys = α. When
i = 0, Ps (i, α) = 12 ky1:s −z1:s k2 + sj=1 ωj (yj ) = hs (y; z1:s ). When i 6= 0, if y1:i is optimal to
miny0 ∈Ri hi (y 0 ; z1:i ), then by noting that y = (y1:i ; α1) and
i
1 X
hs (y; z1:s ) = ky1:i −z1:i k2 +λ1 kB
b[i−1][i] y1:i k0 + ωj (yj )
2
j=1
s
1 X
+ kyi+1:s −zi+1:s k2 + ωj (yj ) + λ1 (30)
2
j=i+1
s
1 X
= hi (y1:i ; z1:i ) + kα1 − zi+1:s k2 + ωj (α) + λ1 ,
2
j=i+1

we get H(i) = hi (y1:i ; z1:i ). Along with the above equality and (29), hs (y; z1:s ) = Ps (i, α).
In the following lemma, we prove that the optimal value of mini∈[0:s−1],α∈R Ps (i, α) is
equal to H(s), and apply this result to characterize a global minimizer of hs (·; z1:s ).
Lemma 9 Fix any s ∈ [n]. The following statements are true.
(i) H(s) = mini∈[0:s−1],α∈R Ps (i, α).

(ii) If (i∗s , αs∗ ) ∈ arg min Ps (i, α), then y ∗ = (y1:i


∗ ; α∗ 1) with y ∗

s s 1:i∗s ∈ arg min hi∗s (v; z1:i∗s )

i∈[0:s−1],α∈R v∈Ris
is a global optimal solution of the minimization problem min y∈Rs hs (y; z1:s ).
Proof (i) Let y ∗ be an optimal solution of problem (28). If yi∗ = yj∗ for any i, j ∈ [s], let
i∗s = 0; otherwise, let i∗s ∈ [s−1] be the largest integer such that yi∗∗s 6= yi∗∗s +1 . Set αs∗ = yi∗∗s +1 .
If i∗s 6= 0, from the definition of H(·), hi∗s (y1:i
∗ ; z ∗ ) ≥ H(i∗ ), which implies that

s
1:is s

s
1 X
min Ps (i, α) ≤ H(i∗s ) + kαs∗ 1 − zi∗s +1:s k2 + ωj (αs∗ ) + λ1
i∈[0:s−1],α∈R 2 ∗ j=is +1
s
∗ 1 ∗ X
≤ hi∗s (y1:i ∗ ; z1:i∗ )
s
2
+ kyi∗s +1:s − zi∗s +1:s k + ωj (yj∗ ) + λ1
s 2 ∗ j=is +1

= hs (y ; z1:s ) = H(s),

13
Wu, Pan, and Yang

where the first equality is due to yi∗∗s +1 6= yi∗∗s and the expression of hs (y ∗ ; z1:s ) by (30). If
i∗s = 0,

s
1 X
min Ps (i, α) ≤ H(0) + ky ∗ − z1:s k2 + ωj (yj∗ ) + λ1 = H(s).
i∈[0:s−1],α∈R 2
j=1

Therefore, mini∈[0:s−1],α∈R Ps (i, α) ≤ H(s) holds. On the other hand, let (i∗s , αs∗ ) be an
optimal solution to mini∈[0:s−1],α∈R Ps (i, α). If i∗s 6= 0, let y ∗ ∈ Rs be such that y1:i

∗ ∈
s
∗ ∗
arg minv∈Ri∗s hi∗s (v; z1:i∗s ) and yi∗s +1:s = αs 1. Then, it is clear that

s
∗ ∗ 1 ∗ X
H(s) ≤ hs (y ; z1:s ) ≤ hi∗s (y1:i ∗ ; z1:i∗ )
s
2
+ kyi∗s +1:s − zi∗s +1:s k + ωj (yj∗ ) + λ1
s 2 ∗ j=is +1
s
1 X
= H(i∗s ) + kαs∗ 1 − zi∗s +1:s k2 + ωj (αs∗ ) + λ1 = min Ps (i, α).
2 i∈[0:s−1],α∈R
j=i∗s +1

If i∗s = 0, let y ∗ = αs∗ 1. We have

s
1 X
H(s) ≤ hs (y ∗ ; z1:s ) = H(0) + ky ∗ − z1:s k2 + ωj (αs∗ ) + λ1 = min Ps (i, α).
2 i∈[0:s−1],α∈R
j=1

Therefore, H(s) ≤ mini∈[0:s−1],α∈R Ps (i, α). The above two inequalities imply the result.
(ii) If i∗s 6= 0, by part (i) and the definitions of αs∗ and i∗s , it holds that

s
1 ∗ X
H(s) = min Ps (i, α) = H(i∗s ) 2
+ kαs 1 − zi∗s +1:s k + ωj (αs∗ ) + λ1
i∈[0:s−1],α∈R 2 ∗ j=is +1
s
∗ 1 ∗ X
= hi∗s (y1:i ∗ ; z1:i∗ ) + ky ∗ − zi∗s +1:s k2 + ωj (yj∗ ) + λ1 ≥ hs (y ∗ ; z1:s ),
s s
2 is +1:s
j=i∗s +1

where the last inequality follows by (30). If i∗s = 0,

H(s) = min Ps (i, α) = Ps (0, α) = hs (y ∗ ; z1:s ).


i∈[0:s−1],α∈R

Therefore, H(s) ≥ hs (y ∗ ; z1:s ). Along with the definition of H(s), H(s) = hs (y ∗ ; z1:s ).

Lemma 9 (i) implies that the nonconvex and nonsmooth problem (27) can be recast as
a mixed-integer programming with objective function given in (29). Lemma 9 (ii) suggests
a recursive method to obtain an optimal solution to (27). In fact, by setting s = n,
there exists an optimal solution to (27), says x∗ , such that x∗i∗n +1:n = αn∗ 1, and x∗1:i∗n ∈
arg minv∈Ri∗n hi∗n (v; z1:i∗n ). Next, by setting s = i∗n , we are able to obtain the expression of
x∗i∗s +1:i∗n . Repeating this loop backward until s = 0, we can obtain the full expression of an

14
PGiPN for Fused Zero-norms Regularization Problems

optimal solution to (27). The outline of computing proxλ1 kB·k


b 0 +ω(·) (z) is shown as follows.

Set s = n.



While s > 0 do




Find (i∗s , αs∗ ) ∈ arg min Ps (i, α).

(31)

 i∈[0:s−1],α∈R
Let xi∗s +1:s = αs 1 and s ← i∗s .
∗ ∗





End

To obtain an optimal solution to (27), the remaining issue is how to execute the first line
in while loop of (31), or in other words, for any given s ∈ [n], how to find (i∗s , αs∗ ) ∈ N × R
appearing in Lemma 9 (ii). The following proposition provides some preparations.

Proposition 10 For each s ∈ [n], let Ps∗ (α) := mini∈[0:s−1] Ps (i, α).

(i) For any α ∈ R, it holds that


1
− z1 )2 + ω1 (α)
(
2 (α n if s = 1,
Ps∗ (α) = ∗ (α), min 0 ∗ (α0 )+λ
o
1 2
min Ps−1 P
α ∈R s−1 1 + 2 (α−zs ) +ωs (α) if s ∈ [2 : n].

(ii) Let R01 := R. For each s ∈ [2 : n] and i ∈ [0 : s−2], let Ris := Ris−1 ∩ (Rs−1 c
s ) with
 
∗ ∗ 0
Rs−1
s := α ∈ R | Ps−1 (α) ≥ min
0
P s−1 (α ) + λ1 . (32)
α ∈R

Then, the following assertions hold true.

(a) For each s ∈ [2 : n], i∈[0:s−1] Ris = R and Ris ∩ Rjs = ∅ for any i 6= j ∈ [0 : s−1].
S

(b) For each s ∈ [n] and i ∈ [0 : s−1], Ps∗ (α) = Ps (i, α) when α ∈ Ris .

Proof (i) Fix any α ∈ R. Note that P1∗ (α) = P1 (0, α) = H(0) + 21 (α − z1 )2 + ω1 (α) + λ1 =
1 2 ∗
2 (α − z1 ) + ω1 (α). Now pick any s ∈ [2 : n]. By the definition of Ps , we have
n o
Ps∗ (α) = min Ps (i, α) = min min Ps (i, α), Ps (s−1, α) . (33)
i∈[0:s−1] i∈[0:s−2]

From the definition of Ps in (29), for each i ∈ [0 : s−2], it holds that


s
1 X
Ps (i, α) = H(i) + kα1 − zi+1:s k2 + ωj (α) + λ1
2
j=i+1
s−1
1 2
X 1
= H(i) + kα1−zi+1:s−1 k + ωj (α) + λ1 + (α−zs )2 + ωs (α)
2 2
j=i+1
1
= Ps−1 (i, α) + (α − zs )2 + ωs (α),
2

15
Wu, Pan, and Yang

while Ps (s−1, α) = H(s−1) + 21 (α − zs )2 + ωs (α) + λ1 . Together with the above equality


and (33), we immediately obtain that
n o 1
Ps∗ (α) = min min Ps−1 (i, α), H(s−1) + λ1 + (α − zs )2 + ωs (α)
i∈[0:s−2] 2
n o 1 (34)
∗ ∗ 0 2
= min Ps−1 (α), min Ps−1 (α ) + λ1 + (α − zs ) + ωs (α),
α0 ∈R 2
where the second equality is by Lemma 9 (i) and the definition of Ps−1∗ . Thus, we get the

desired result.
(ii) We first prove (a) by induction. When s = 2, since R01 = R and R02 = R01 ∩ (R12 )c , we
have R02 ∪ R12 = R and R02 ∩ R12 = ∅. Assume that the result holds with s = j for some
j ∈ [2 : n−1]. We prove that the result holds for s = j +1. Since Rij+1 = Rij ∩ (Rjj+1 )c for
all i ∈ [0 : j −1] and i∈[0:j−1] Rij = R, it holds that
S

i i ∩ (Rjj+1 )c ) ∪ Rjj+1 = R ∩ (Rjj+1 )c ∪ Rjj+1 = R.


S S  
i∈[0:j] Rj+1 = i∈[0:j−1] (Rj

The first part holds. For the second part, by definition, Rij+1 ∩ Rjj+1 = ∅ for all i ∈ [0 : j−1],
so it suffices to prove that Rij+1 ∩ Rkj+1 = ∅ for any i 6= k ∈ [0 : j −1]. By definition,

Rij+1 ∩ Rkj+1 = Rij ∩ (Rjj+1 )c ∩ Rkj ∩ (Rjj+1 )c = ∅,


   

where the second equality is due to Rij ∩ Rkj = ∅. Thus, the second part follows.
Next we prove (b). When s = 1, since for any α ∈ R = R01 , P1∗ (α) = P1 (0, α), the result
holds. For s ∈ [2 : n] and i = s−1, by the definition of Rs−1
s and part (i), for all α ∈ Rs−1
s ,

1
Ps∗ (α) = min ∗
Ps−1 (α0 ) + λ1 + (α − zs )2 + ωs (α) = Ps (s − 1, α),
α0 ∈R 2
where the second equality is obtained by using H(s) = minα0 ∈R Ps−1 ∗ (α0 ) and (29). Next we

consider s ∈ [2 : n] and i ∈ [0 : s−2]. We argue by induction that Ps∗ (α) = Ps (i, α) when
α ∈ Ris . Indeed, when s = 2, since R02 = R01 ∩ (R12 )c = (R12 )c , for any α ∈ R02 , from (32) we
have P1∗ (α) < minα0 ∈R P1∗ (α0 ) + λ1 , which by part (i) implies that
1 1
P2∗ (α) = P1∗ (α) + (α − z2 )2 + ω2 (α) = P1 (0, α) + (α − z2 )2 + ω2 (α) = P2 (0, α).
2 2
Assume that the result holds when s = j for some j ∈ [2 : n−1]. We consider the case for
s = j +1. For any i ∈ [0 : j −1], from Rij+1 = Rij ∩ (Rjj+1 )c and (34), for any α ∈ Rij+1 ,

∗ 1 1
Pj+1 (α) = Pj∗ (α) + (α − zj+1 )2 + ωj+1 (α) = Pj (i, α) + (α − zj+1 )2 + ωj+1 (α)
2 2
j
1 X 1
= H(i) + kα1 − zi+1:j k2 + wk (α) + λ1 + (α−zj+1 )2 + ωj+1 (α)
2 2
k=i+1
j+1
1 X
= H(i) + kα1 − zi+1:j+1 k2 + wk (α) + λ1 = Pj+1 (i, α),
2
k=i+1

16
PGiPN for Fused Zero-norms Regularization Problems

where the second equality is using Pj∗ (α) = Pj (i, α) implied by induction. Hence, the con-
clusion holds for s = j + 1 and any i ∈ [0 : s−2]. The proof is completed.

Now we take a closer look at Proposition 10. Part (i) provides a recursive method to
compute Ps∗ (α) for all s ∈ [n]. For each s ∈ [n], by the expression of ωs , Ps (i, ·) is a piecewise
lower semicontinuous linear-quadratic function whose domain is a closed interval, relative
to which Ps (i, ·) has an expression of the form H(i) + 21 kα1 − zi+1:s k2 + (s − i)|α|0 + λ1 .
While Ps∗ (·) = min{Ps (0, ·), Ps (1, ·), . . . , Ps (s − 1, ·)}, and for each i ∈ [0 : s−1], the optimal
solution to minα∈R Ps (i,
P α) is easily obtained (in fact, all the possible candidates of the
s
zj
global solutions are 0, j=i+1 s−i , maxj∈[i+1:s] {lj }, minj∈[i+1:s] {uj }), so is arg minα0 ∈R Ps∗ (α0 ).
Part (ii) suggests a way to search for i∗s such that Ps∗ (αs∗ ) = Ps (i∗s , αs∗ ) for each s ∈ [n].
Obviously, Ps (i∗s , αs∗ ) = mini∈[0:s−1],α∈R Ps (i, α). This inspires us to propose Algorithm 1 for
solving proxλ1 kB·k
b 0 +ω(·) (z), whose iteration steps are described as follows.

Algorithm 1 (Computing proxλ1 kB·k


b 0 +ω(·) (z))

1. Initialize: Compute P1∗ (α) = 21 (z1 − α)2 + ω1 (α) and set R01 = R.
2. For s = 2, . . . , n do
3. Ps∗ (α) := min{Ps−1 ∗ (α), min 0 ∗ 0 1 2
α ∈R Ps−1 (α ) + λ1 } + 2 (α − zs ) + ωs (α).
s−1
4. Compute Rs by (32).
5. For i = 0, . . . , s − 2 do
6. Ris = Ris−1 ∩ (Rs−1 c
s ) .
7. End
8. End
9. Set s = n.
10. While s > 0 do
11. Find αs∗ ∈ arg minα∈R Ps∗ (α), and i∗s = i | αs∗ ∈ Ris .


12. x∗i∗s +1:s = αs∗ 1 and s ← i∗s .


13. End

For every s ∈ [n], as Ps∗ is a piecewise lower semicontinuous linear-quadratic function,


in the implementation of Algorithm 1, we store the parameters to identify this function via
a matrix, whose each row records the parameters of Ps∗ and its domain. Similarly, each Ris
is stored via a vector which records its endpoints. The main computation cost of Algorithm
1 comes from lines 3 and 6, in which the number of pieces of the linear-quadratic functions
involved in Ps∗ plays a key role. The following lemma gives a worst-case estimation for it.

Lemma 11 Fix any s ∈ [2 : n]. The function Ps∗ in line 3 of Algorithm 1 has at most
O(s1+ ) linear-quadratic pieces, where  is any small positive constant.

Proof For each i ∈ [0 : s − 2], let hi (α) := H(i) + 21 kα1 − zi+1:s k2 + λ1 + (s − i)λ2 |α|0 +
P s
j=i+1 δ[lj ,uj ] (α) for α ∈ R. Obviously, every hi is a piecewise lower semicontinuous linear-
quadratic function whose domain is a closed interval, and every piece is continuous  on the
closed interval except α = 0. Therefore, for each i ∈ [0 : s − 2], hi = min hi,1 , hi,2 , hi,3
with hi,1 (α) := hi (α) − (s − i)λ2 |α|0 + (s − i)λ2 + δ(−∞,0] (α), hi,2 (α) := hi (α) + δ{0} (α)
and hi,3 (α) := hi (α) − (s − i)λ2 |α|0 + (s − i)λ2 + δ[0,∞) (α). Obviously, hi1 , hi,2 and hi,3 are

17
Wu, Pan, and Yang

piecewise linear-quadratic functions with domain being a closed interval. In addition, write
∗ (α0 ) + 1 (α − z )2 + λ|α| + λ + δ
hs−1 (α) := minα0 ∈R Ps−1 2 s 0 1 [ls ,us ] (α) for α ∈ R. Obviously,
hs−1 is a piecewise lower semicontinuous linear-quadratic function whose domain is a closed
interval. Similarly, hs−1 = min{hs−1,1 , hs−1,2 , hs−1,3 } where each hs−1,j for j = 1, 2, 3 is a
piecewise linear function whose domain is a closed interval. Combining the above discussion
with line 3 of Algorithm 1 and the definition of Ps−1 ∗ , for any α ∈ R,

n 1 1 o
Ps∗ (α) = min Ps−1 (i, α)+ (α − zs )2 +ωs (α), min P ∗
s−1 (α 0
)+ (α − z s )2
+ω s (α)+λ1
i∈[0:s−2] 2 α0 ∈R 2
n o 
= h0 (α), h1 (α), . . . , hs−2 (α), hs−1 (α) = min hi,j (α) .
i∈[0:s−1],j∈[3]

Notice that any hi,j and hi0 ,j 0 with i 6= i0 ∈ [0 : s − 1] or j 6= j 0 ∈ [3] crosses at most 2
times. From (Sharir, 1995, Theorem 2.5) the maximal number of linear-quadratic pieces
involved in Ps∗ is bounded by the maximal length of a (3s, 4) Davenport-Schinzel
√ sequence,
which by (Davenport and Schinzel, 1965, Theorem 3) is 3c1 s exp(c2 log 3s). Here, c1 , c2
are positive constants independent of s. Thus, we conclude that the maximal number of
linear-quadratic pieces involved in Ps∗ is O(s1+ ) for any  > 0. The proof is finished.

By invoking Lemma 11, we are able to provide a worst-case estimation for the complexity
of Algorithm 1. Indeed, the main cost of Algorithm 1 consists in lines 3 and 5-7. The
computation cost involved in line 3 depends on the number of pieces of Ps−1 ∗ , which by

Lemma 11 requires O(s1+ ) operations with any small  > 0. From part (b) of Proposition
(ii), for each i ∈ [0 : s − 1], Ris consists of at most O(s1+ ) intervals, which means that
line 6 requires at most O(s1+ ) operations and then the computation complexity of lines
5-7 is O(s2+ ) with any small  > 0. Thus, the worst-case complexity of Algorithm 1 is
P n 2+ ) = O(n3+ ) with any small  > 0.
s=2 O(s

4. A hybrid of PG and inexact projected regularized Newton methods


In the hybrid frameworks owing to Themelis et al. (2018) and Bareilles et al. (2023), the PG
and Newton steps are alternating. Consider that the PG step is more cost-effective than
the Newton step when the iterates are far from a stationary point. We introduce a switch
condition (8) into our algorithm, a hybrid of PG and inexact projected regularized Newton
methods (PGiPN) for problem (1), to control when the Newton steps are executed.
Now we describe the details of our algorithm. Let xk ∈ Ω be the current iterate. It
is noted that the PG step is always executed, and when condition (8) is met, we need to
solve (11), which involves constructing Gk to satisfy (12). Such Gk can be easily achieved
for the following two cases. One is the case that f can be expressed as f (x) = h(Ax − b)
for some A ∈ Rm×n , b ∈ Rm and separable twice continuously differentiable h. Now
∇2f (x) = A> ∇2 h(Ax − b)A with ∇2 h(Ax − b) being a diagonal matrix. Since ∇2f (xk ) is
not necessarily positive definite, following the same way as in Liu et al. (2024), we construct

G1k := ∇2 f (xk )+ b2 [−λmin (∇2h(Axk − b))]+ A> A + b1 kµk (xk −xk )kσ I with b2 ≥ 1. (35)

18
PGiPN for Fused Zero-norms Regularization Problems

However, for highly nonconvex h, [−λmin (∇2h(Axk − b))]+ is large, for which G1k is a poor
approximation to ∇2 f (xk ). To avoid this drawback, we consider the following

G2k := A> [∇2h(Axk − b)]+ A + b1 kµk (xk −xk )kσ I. (36)

When ∇2h(Axk − b)  0, G1k = G2k . If ∇2h(Axk − b) 6 0, it is immediate to see that


kG1k − ∇2f (xk )k2 ≥ kG2k − ∇2f (xk )k2 , which means that G2k is a better approximation to
∇2 f (xk ) than G1k . The other is the case that f has no special structure, and in this case
we form Gk := G3k as in Ueda and Yamashita (2010) and Wu et al. (2023), where

G3k := ∇2 f (xk )+ b2 [−λmin (∇2 f (xk ))]+ + b1 kµk (xk −xk )kσ I.

(37)

It is not hard to check that for i = 1, 2, 3, Gik meets the requirement in (12). We remark
here that the subsequent convergence analysis holds for the above three Gik , and we write
them by Gk for simplicity. The iterates of PGiPN are described as follows.

Algorithm 2 (a hybrid of PG and inexact projected regularized Newton methods)


Initialization: Choose  ≥ 0 and parameters µmax > µmin > 0, τ > 1, α > 0, b1 > 0, b2 ≥
1, % ∈ (0, 12 ), σ ∈ (0, 21 ), ς ∈ (σ, 1] and β ∈ (0, 1). Choose an initial x0 ∈ Ω and let k := 0.
PG Step:
(1a) Select µk ∈ [µmin , µmax ]. Let mk be the smallest nonnegative integer m such that
α k
F (xk ) ≤ F (xk ) − kx −xk k2 with xk ∈ prox(µk τ m )−1 g (xk −(µk τ m )−1 ∇f (xk )). (38)
2
(1b) Let µk = µk τ mk . If µk kxk − xk k ≤ , stop and output xk ; otherwise, go to step (1c).
(1c) If condition (8) holds, go to Newton step; otherwise, let xk+1 = xk . Set k ← k + 1 and
return to step (1a).
Newton step:
(2a) Seek an inexact solution y k of (11) with Gk from (36) or (37) such that (13)-(14) hold.
(2b) Set dk := y k − xk . Let tk be the smallest nonnegative integer t such that

f (xk + β t dk ) ≤ f (xk ) + %β t h∇f (xk ), dk i. (39)

(2c) Let αk = β tk with xk+1 = xk +αk dk . Set k ← k + 1 and return to PG step.

Remark 12 (i) Our PGiPN benefits from the PG step in two aspects. First, the incor-
poration of the PG step can guarantee that the sequence generated by PGiPN remains in
a right position for convergence. Second, the PG step helps to identify adaptively the sub-
space used in the Newton step, and as will be shown in Proposition 16, when k is sufficiently
large, switch condition (8) always holds and the supports of {Bxk }k∈N and {xk }k∈N keep un-
changed, so that Algorithm 2 will reduce to an inexact projected regularized Newton method
for solving (10) with Πk ≡ Π∗ , where Π∗ ⊂ Rn is a polytope defined in (49). In this sense,
the PG step plays a crucial role in transforming the original challenging problem (1) into a
problem that can be efficiently solved by the inexact projected regularized Newton method.

19
Wu, Pan, and Yang

(ii) When xk enters the Newton step, from the inexact criterion (13) and the expression of
Θk , 0 ≥ Θk (xk +dk ) − Θk (xk ) = h∇f (xk ), dk i + 21 hdk , Gk dk i, and then

1 b1
h∇f (xk ), dk i ≤ − hdk , Gk dk i ≤ − kµk (xk − xk )kσ kdk k2 < 0, (40)
2 2
where the second inequality is due to (12). In addition, the inexact criterion (13) implies
that y k ∈ Πk , which along with xk ∈ Πk and the convexity of Πk yields that xk + αdk ∈
Πk for any α ∈ (0, 1]. By the definition of Πk , supp(B(xk + αdk )) ⊂ supp(Bxk ) and
supp(xk + αdk ) ⊂ supp(xk ), so g(xk + αdk ) ≤ g(xk ) for any α ∈ (0, 1]. This together with
(40) shows that the iterate along the direction dk will reduce the value of F at xk .
(iii) When  = 0, by Definition 1 the output xk of Algorithm 2 is an L-stationary point of
(1), which is also a stationary point of problem (10) from Proposition 3 and Lemma 4 (i).
Let rk : Rn → Rn be the KKT residual mapping of (10) defined by

rk (x) := µk [x − projΠk (x − µ−1


k ∇f (x))]. (41)

It is not difficult to verify that when xk satisfies condition (8), the following relation holds

rk (xk ) = µk (xk − xk ). (42)

Indeed, we only need to argue that xk = projΠk (xk − µ−1 k


k ∇f (x )). Suppose that this does not
hold. Then, there exists z k ∈ Πk such that e hk (z k ) < e
hk (xk ), where e hk (x) := µ2k kx − (xk −
−1
µk ∇f (xk ))k2 . Since z k ∈ Πk , we have supp(Bz k ) ⊂ supp(Bxk ) and supp(z k ) ⊂ supp(xk ),
which implies that g(z k ) ≤ g(xk ) and then e
hk (z k ) + g(z k ) < e
hk (xk ) + g(xk ), which yields a
a contradiction to xk ∈ proxµ−1 g (xk − µ−1 k
k ∇f (x )).
k
(iv) By using (16) and the descent lemma (Bertsekas, 1997, Proposition A.24), the line
search in step (1a) must stop after a finite number of backtrackings. In fact, the line search
in step (1a) is satisfied whenever the nonnegative integer m is such that µk τ m ≥ L1 + α,
and consequently, for each k ∈ N, µk = µk τ mk ≤ µ e := τ (L1 + α).
(v) Note that problem (11) is a strongly convex quadratic program over a polyhedral set, for
which many successful algorithms have been developed such as interior point algorithms. In
our numerical experiments, we call the commercial software GUROBI (Gurobi Optimiza-
tion, LLC (2024)) to solve it, which uses an interior point method as the solver.

By Remark 12 (iv), to show that Algorithm 2 is well defined, we only need to argue that
the Newton steps in Algorithm 2 are well defined, which is implied by the following lemma.
Lemma 13 For each k ∈ N, define the KKT residual mapping Rk : Rn → Rn of (11) by

Rk (y) := µk [y − projΠk (y − µ−1 k k


k (Gk (y − x ) + ∇f (x )))].

Then, for those xk ’s satisfying (8), the following statements are true.
(i) For any y close enough to the optimal solution of (11), y −µ−1 k Rk (y) satisfies (13)-(14).

(ii) The line search step in (39) is well defined, and αk ≥ min 1, (1−%)b 1β
kµk (xk − xk )kσ .

L1

(iii) The inexact criterion (14) implies that kRk (y k )k ≤ 21 min krk (xk )k, krk (xk )k1+ς .


20
PGiPN for Fused Zero-norms Regularization Problems

Proof Pick any xk satisfying (8). We proceed the proof of parts (i)-(iii) as follows.
(i) Let ybk be the unique optimal solution to (11). Then ybk 6= xk (if not, xk is the optimal
solution of (11) and 0 = Rk (xk ) = rk (xk ), which by (42) means that xk = xk and Algorithm
2 stops at xk ). By the optimality condition of (11), −∇f (xk )−Gk (b y k −xk ) ∈ NΠk (b y k ), which
by the convexity of Πk and xk ∈ Πk implies that h∇f (xk ) + Gk (b y k − xk ), ybk − xk i ≤ 0. Along
k k 1 k
with the expression of Θk , we have Θk (b y ) − Θk (x ) ≤ − 2 hby − xk , Gk (b y k − xk )i < 0. Since
Θk is continuous relative to Πk , for any z ∈ Πk sufficiently close to yb , Θk (z) − Θk (xk ) ≤ 0.
k

From Rk (b y ) = 0 and the continuity of Rk , when y sufficiently close to yb, y − µ−1 k Rk (y) is
close to yb, which together with y − µ−1k R k (y) ∈ Π k implies that y − µ −1
k R k (y) satisfies the
criterion (13). In addition, from the expression of Rk , for any y ∈ Rn ,

0 ∈ Gk (y − xk ) + ∇f (xk ) − Rk (y) + NΠk (y − µ−1


k Rk (y)),

which by the expression of Θk implies that µ−1 −1


k Gk Rk (y) + Rk (y) ∈ ∂Θk (y − µk Rk (y)).
−1 −1
Hence, dist(0, ∂Θk (y − µk Rk (y))) ≤ kµk Gk Rk (y) + Rk (y)k. Noting that Rk (b y k ) = 0,
min{µ−1
k ,1}
we have kµ−1 y k ) + Rk (b
y k )k = 0 < min kµk (xk −xk )k, kµk (xk − xk )k1+ς .

k Gk Rk (b 2
From the continuity of the function y 7→ kµ−1 k Gk Rk (y) + Rk (y)k, we conclude that for any
k −1
y sufficiently close to yb , y − µk Rk (y) satisfies the inexact criterion (14).
(ii) By (16) and the descent lemma (Bertsekas, 1997, Proposition A.24), for any α ∈ (0, 1],

L1 α2 k 2
f (xk +αdk ) − f (xk ) − %αh∇f (xk ), dk i ≤ (1−%)αh∇f (xk ), dk i + kd k
2
(1−%)αb1 L1 α2 k 2
≤− kµk (xk −xk )kσ kdk k2 + kd k
2 2
 (1−%)b L1 α 
1
= − kµk (xk −xk )kσ + αkdk k2 ,
2 2
where the second inequality uses (40). Therefore, when the nonnegative integer t is such
that β t ≤ min 1, (1−%)b k k σ , the line search in (39) holds, which implies that the
 1
L1 kµk (x −x )k
smallest nonnegative integer tk should satisfy αk = β tk ≥ min 1, (1−%)b 1β
kµk (xk −xk )kσ .

L1
(iii) Let ζ k ∈ ∂Θk (y k ) be such that kζ k k = dist(0, ∂Θk (y k )). From ζ k ∈ ∂Θk (y k ) and
the expression of Θk , we have y k = projΠk (y k + ζ k − (Gk (y k − xk ) + ∇f (xk ))). Along
with y k = projΠk (y k ) and the nonexpansiveness of projΠk , ky k −projΠk (y k −(Gk (y k −xk ) +
∇f (xk )))k ≤ kζ k k. Consequently,

dist(0, ∂Θk (y k )) ≥ ky k −projΠk (y k −(Gk (y k −xk ) + ∇f (xk )))k ≥ min{µ−1 k


k , 1}kRk (y )k,

where the second inequality follows Lemma 4 of Sra (2012) and the expression of Rk . Com-
bining the last inequality with (14) and (42) leads to the desired inequality.

When µk = 1, the condition that kRk (y k )k ≤ 12 min krk (xk )k, krk (xk )k1+ς is a special


case of the inexact condition in (Yue et al., 2019, Equa (6a)) or the inexact condition
in (Mordukhovich et al., 2023, Equa (14)), which along with Lemma 13 (iii) shows that
criterion (14) with µk = 1 is stronger than the ones adopted in these literature.
To analyze the convergence of Algorithm 2 with  = 0, henceforth we assume xk 6= xk
for all k (if not, Algorithm 2 will produce an L-stationary point within finite number of

21
Wu, Pan, and Yang

steps, and its convergence holds automatically). From the iteration steps of Algorithm 2,
we see that the sequence {xk }k∈N consists of two parts, {xk }k∈K1 and {xk }k∈K2 , where

K1 := N\K2 with K2 := k ∈ N | supp(Bxk ) = supp(Bxk ), supp(xk ) = supp(xk ) .




Obviously, K1 consists of those k’s with xk+1 from the PG step, while K2 consists of those
k’s with xk+1 from the Newton step.
To close this section, we provide some properties of the sequences {xk }k∈N and {xk }k∈N .

Proposition 14 The following assertions are true.

(i) The sequence {F (xk )}k∈N is descent and convergent.

(ii) There exists ν > 0 such that |Bxk |min ≥ ν and |xk |min ≥ ν for all k ∈ N.

(iii) There exist c1 , c2 > 0 such that c1 krk (xk )k ≤ kdk k ≤ c2 krk (xk )k1−σ for all k ∈ K2 .

Proof (i) For each k ∈ N, when k ∈ K1 , by the line search in step (1a), F (xk+1 ) < F (xk ),
and when k ∈ K2 , from (39) and (40), it follows that f (xk+1 ) < f (xk ), which along with
g(xk+1 ) ≤ g(xk ) by Remark 12 (ii) implies that F (xk+1 ) < F (xk ). Hence, {F (xk )}k∈N is a
descent sequence. Recall that F is lower bounded on Ω, so {F (xk )}k∈N is convergent.
(ii) By the definition of µk and Remark 12 (iv), µk ∈ [µmin , µ e] for all k ∈ N. Note that
{xk }k∈N ⊂ Ω, so the sequence {xk − µ−1 k ∇f (x k )}
k∈N is bounded and is contained in a
compact set, says, Ξ. By invoking Lemma 8 with such Ξ and µ = µmin , µ = µ e, there exists
ν > 0 (depending on Ξ, µmin and µ k
e) such that |[B; I]x |min > ν. The desired result then
follows by noting that |Bxk |min ≥ |[B; I]xk |min and |xk |min ≥ |[B; I]xk |min .
(iii) From the definition of Gk , the continuity of ∇2f , {xk }k∈N ⊂ Ω, {xk }k∈N ⊂ Ω and
Remark 12 (iv), there exists c > 0 such that

kGk k2 ≤ c for all k ∈ K2 . (43)

Fix any k ∈ K2 . By Lemma 13 (iii), kRk (y k )k ≤ 21 krk (xk )k. Then, it holds that

1
krk (xk )k ≤ krk (xk )k − kRk (y k )k ≤ krk (xk ) − Rk (y k )k
2
= µk kxk − projΠk (xk − µ−1 k k k −1 k k k
k ∇f (x )) − y + projΠk (y − µk (Gk (y − x ) + ∇f (x )))k
≤ (2µk + kGk k2 )ky k − xk k ≤ (2e
µ + c)kdk k,

where the third inequality is using the nonexpansiveness of projΠk , and the last one is due to
(43) and dk = y k − xk . Therefore, c1 krk (xk )k ≤ kdk k with c1 := 1/(4eµ+ 2c). For the second
inequality, it follows from the definitions of rk (·) and Rk (·) that Rk (y k ) − ∇f (xk ) − Gk dk ∈
NΠk (y k − µ−1 k k k k −1 k
k Rk (y )) and rk (x ) − ∇f (x ) ∈ NΠk (x − µk rk (x )), which together with
the monotonicity of the set-valued mapping NΠk (·) implies that

hdk , Gk dk i ≤ hRk (y k )−rk (xk ), dk i − µ−1 k k 2 −1 k k k


k kRk (y )−rk (x )k − µk hGk d , −Rk (y ) + rk (x )i
≤ h(I + µ−1 k k k
k Gk )d , Rk (y ) − rk (x )i.

22
PGiPN for Fused Zero-norms Regularization Problems

Combining this inequality with equations (12), (42) and Lemma 13 (iii) leads to
b1 krk (xk )kσ kdk k2 ≤ (1 + µ−1 k k k
k kGk k2 )(kRk (y )k + krk (x )k)kd k (44)
≤ (3/2)(1 + µ−1 k k
k kGk k2 )krk (x )kkd k,

which along with (43) and µk ≥ µmin implies that kdk k ≤ 23 (1 + µ−1 −1 k 1−σ .
min c)b1 krk (x )k
k k 1−σ 3 −1 −1
Then, kd k ≤ c2 krk (x )k holds with c2 := 2 (1 + µmin c)b1 . The proof is completed.

5. Convergence Analysis
Before analyzing the convergence of Algorithm 2, we show that it finally reduces to an
inexact projected regularized Newton method for seeking a stationary point of a problem
to minimize a smooth function over a polyhedral set. This requires the following lemma.
Lemma 15 For the sequences {xk }k∈N and {xk }k∈N generated by Algorithm 2, the following
assertions are true.
(i) There exists a constant γ > 0 such that for each k ∈ N,

 −γkxk − xk k2

if k ∈ K1 ,
k+1 k k k 2+σ
F (x ) − F (x ) ≤ −γkx − x k if k ∈ K2 , αk = 1, (45)
−γkxk − xk k2+2σ if k ∈ K2 , αk = 6 1.

(ii) limk→∞ kxk − xk k = 0 and limK2 3k→∞ kdk k = 0.


(iii) The accumulation point set of {xk }k∈N , denoted by Γ(x0 ), is nonempty and compact,
and every element of Γ(x0 ) is an L-stationary point of problem (1).
Proof (i) Fix any k ∈ K2 . From inequalities (39)-(40), Proposition 14 (iii) and (42),
%b1 αk %c2 b1 αk
f (xk+1 ) − f (xk ) ≤ − kµk (xk −xk )kσ kdk k2 ≤ − 1 kµk (xk −xk )k2+σ
2 2 (46)
%c2 b1 αk µ2+σ
≤− 1 min
kxk −xk k2+σ
2
which, along with g(xk+1 ) ≤ g(xk ) by Remark 12 (ii), implies that F (xk+1 ) − F (xk ) ≤
%c2 b α µ2+σ
f (xk+1 ) − f (xk ). Together with (46), we have F (xk+1 )− F (xk ) ≤ − 1 1 2k min kxk − xk k2+σ .
Recall that F (xk+1 )−F (xk ) ≤ α2 kxk −xk k2 for k ∈ K1 . By using Lemma 13 (ii), the desired
 %c2 b µ2+σ β(1−%)%c21 b21 µ2+2σ
result then follows with γ := min α2 , 1 12 min , 2L1
min
.
(ii) Let K2 := {k ∈ K2 | αk = 1}. Doing summation for inequality (45) from i = 1 to any
e
j ∈ N yields that
X X X
γkxi − xi k2 + γkxi − xi k2+σ + γkxi − xi k2+2σ
i∈K1 ∩[j] i∈K
e 2 ∩[j] i∈(K2 \K
e 2 )∩[j]
j
X
F (xi ) − F (xi+1 ) = F (x1 ) − F (xj+1 ),
 

i=1

23
Wu, Pan, and Yang

which by the lower boundedness of F on the set Ω implies that


X X X
kxi − xi k2 + γkxi − xi k2+σ + γkxi − xi k2+2σ < ∞.
i∈K1 i∈K
e2 i∈K2 \K
e2

Thus, we obtain limk→∞ kxk −xk k = 0. Together with (42), Proposition 14 (iii) and Remark
12 (iv), it follows that limK2 3k→∞ kdk k = 0.
(iii) Recall that {xk }k∈N ⊂ Ω, so its accumulation point set Γ(x0 ) is nonempty. Pick
any x∗ ∈ Γ(x0 ). Then, there exists an index set K ⊂ N such that limK3k→∞ xk = x∗ .
From part (ii), limK3k→∞ xk = x∗ . By step (1a) and Remark 12 (iv), for each k ∈ K,
xk ∈ proxµ−1 g xk − µ−1 k
k ∇f (x ) with µk ∈ [µmin , µ
e], and consequently,
k

0 ∈ µk (xk − (xk − µ−1 k k


k ∇f (x ))) + ∂g(x ). (47)

We claim that g(xk ) → g(x∗ ) as K 3 k → ∞. Indeed, by the definition of xk , we have


µk k 2 µk ∗ 2
x − (xk −µ−1 k
k ∇f (x )) + g(xk ) ≤ x − (xk −µ−1 k
k ∇f (x )) + g(x∗ ) ∀k ∈ K.
2 2
Recall that µk ∈ [µmin , µ
e] for each k. If necessary by taking a subsequence, we assume that
µk → µ ∈ [µmin , µ
e] as K 3 k → ∞. Passing K 3 k → ∞ to the above inequality leads to
hµ i
2
lim sup g(xk ) ≤ lim sup k x∗ − (xk −µ−1 k ∇f (x k
)) + g(x ∗
)
K3k→∞ K3k→∞ 2
h µ i
−1 2
+ lim sup − k k k
x − (x −µk ∇f (x )) k
= g(x∗ ),
K3k→∞ 2

while lim inf K3k→∞ g(xk ) ≥ g(x∗ ) follows from the lower semicontinuity of g. Thus, the
claimed limit limK3k→∞ g(xk ) = g(x∗ ) holds. Now from the above inclusion (47), it follows
that 0 ∈ ∇f (x∗ ) + ∂g(x∗ ). By Lemma 7, we know that x∗ is an L-stationary point of (1).

Next we apply Lemma 15 (ii) to show that, after a finite number of iterations, the switch
condition in (8) always holds and the Newton step is executed. To this end, define

Tk := supp(Bxk ), T k := supp(Bxk ), Sk := supp(xk ) and S k := supp(xk ). (48)

Proposition 16 For the index sets defined in (48), there exist index sets T ⊂ [p], S ⊂ [n]
and k ∈ N such that for all k > k, Tk = T k = T and Sk = S k = S, which means that
k ∈ K2 for all k > k. Moreover, for each x∗ ∈ Γ(x0 ), supp(Bx∗ ) = T, supp(x∗ ) = S and
F (x∗ ) = limk→∞ F (xk ) := F ∗ , where Γ(x0 ) is defined in Lemma 15 (iii).

Proof We complete the proof of the conclusion via the following three claims:
Claim 1: There exists k ∈ N such that for k > k, |Bxk |min ≥ ν2 , where ν is the same as
the one in Proposition 14 (ii). Indeed, for each k − 1 ∈ K1 , xk = xk−1 , and |Bxk |min =
|Bxk−1 |min ≥ ν > ν2 follows by Proposition 14 (ii). Hence, it suffices to consider that k−1 ∈
ν
K2 . By Lemma 15 (ii), there exists k ∈ N such that for all k ≥ k, kxk−1 − xk−1 k < 4kBk 2
,
ν
and for all K2 3 k − 1 > k − 1, kdk−1 k < 4kBk2 , which implies that for K2 3 k − 1 > k − 1,

24
PGiPN for Fused Zero-norms Regularization Problems

kBxk−1−Bxk−1 k < ν4 and kBdk−1 k < ν4 . For each K2 3 k−1 > k−1, let ik ∈ [p] be such that
|(Bxk−1 )ik | = |Bxk−1 |min . Since condition (8) implies that supp(Bxk−1 ) = supp(Bxk−1 )
for each k − 1 ∈ K2 , we have |(Bxk−1 )ik | ≥ |Bxk−1 |min . Thus, for each K2 3 k − 1 > k − 1,

kBxk−1 − Bxk−1 k ≥ |(Bxk−1 )ik − (Bxk−1 )ik | ≥ |(Bxk−1 )ik | − |(Bxk−1 )ik |
≥ |Bxk−1 |min − |Bxk−1 |min .

Recall that |Bxk−1 |min ≥ ν for all k ∈ N by Proposition 14 (ii). Together with the last
inequality and kBxk−1 −Bxk−1 k < ν4 , for each K2 3 k −1 > k −1, we have |Bxk−1 |min ≥ 3ν
4 .
k k
For each K2 3 k − 1 > k − 1, let jk ∈ [p] be such that |(Bx )jk | = |Bx |min . By Remark
12 (ii), supp(Bxk ) ⊂ supp(Bxk−1 ) for each k − 1 ∈ K2 , which along with jk ∈ supp(Bxk )
implies that |(Bxk−1 )jk | ≥ |Bxk−1 |min . Thus, for each K2 3 k − 1 > k − 1,

1
kBdk−1 k = kBxk − Bxk−1 k ≥ kBxk − Bxk−1 k ≥ |(Bxk−1 )jk − (Bxk )jk |
αk
≥ |(Bxk−1 )jk | − |(Bxk )jk | ≥ |Bxk−1 |min − |Bxk |min ,

which together with kBdk−1 k ≤ ν4 and |Bxk−1 |min ≥ 3ν k ν


4 implies that |Bx |min ≥ 2 .
Claim 2: Tk = T k for k > k. From the above arguments, kBxk − Bxk k ≤ ν4 for k > k. If
i ∈ Tk , then |(Bxk )i | ≥ |(Bxk )i | − ν4 ≥ ν4 , where the second inequality is using |Bxk |min > ν2
by Claim 1. This means that i ∈ T k , so Tk ⊂ T k . Conversely, if i ∈ T k , then |(Bxk )i | ≥
|(Bxk )i | − ν4 ≥ 3ν
4 , so i ∈ Tk and T k ⊂ Tk . Thus, Tk = T k for k > k.
Claim 3: Tk = Tk+1 for k > k. If k ∈ K1 , the result follows directly by the result in Claim
2 because T k = supp(Bxk ) = supp(Bxk+1 ) = Tk+1 . If k ∈ K2 , from the proof of Claim 1,
kBxk − Bxk+1 k ≤ kBdk k ≤ ν4 for all k > k. Then, if i ∈ Tk , |(Bxk+1 )i | ≥ |(Bxk )i | − ν4 ≥ ν4 ,
where the second inequality is using |Bxk |min > ν2 by Claim 1. This implies that i ∈ Tk+1
and Tk ⊂ Tk+1 . Conversely, if i ∈ Tk+1 , then |(Bxk )i | ≥ |(Bxk+1 )i | − ν4 ≥ ν4 . Hence, i ∈ Tk
and Tk+1 ⊂ Tk .
From Claim 2 and Claim 3, there exists T ⊂ [p] such that Tk = T k = T for k > k.
Using the similar arguments can prove the existence of S ⊂ [n] such that Sk = S k = S for
all k > k (if necessary increasing k).
Pick any x∗ ∈ Γ(x0 ). Let {xk }k∈K be a subsequence such that limK3k→∞ xk = x∗ . By
the above proof, for all sufficiently large k ∈ K, |Bxk |min ≥ ν2 and |xk |min ≥ ν2 , which implies
that |Bx∗ |min ≥ ν2 and |x∗ |min ≥ ν2 . The results supp(Bx∗ ) = T and supp(x∗ ) = S can be
obtained by a proof similar to Claim 3. From x∗ ∈ Γ(x0 ), there exists an index set K ⊂ N
such that limK3k→∞ xk = x∗ . From the above arguments, g(xk ) = g(xk ) = λ1 |T | + λ2 |S|
for all K 3 k ≥ k. By the proof of Lemma 15 (iii), lim supK3k→∞ g(xk ) ≤ g(x∗ ), so that

F ∗ = lim sup F (xk ) = lim sup [f (xk ) + g(xk )]


K3k→∞ K3k→∞
≤ f (x ) + lim sup g(x ) = f (x∗ ) + lim sup g(xk ) ≤ F (x∗ ).
∗ k
K3k→∞ K3k→∞

On the other hand, by the lower semicontinuity of F , we have F ∗ ≥ F (x∗ ). The two sides
imply that F (x∗ ) = F ∗ . The proof is completed.

25
Wu, Pan, and Yang

By Proposition 16, all k > k belong to K2 , i.e., the sequence {xk+1 }k>k is generated
by the Newton step. This means that {xk+1 }k>k is identical to the one generated by the
inexact projected regularized Newton method starting from xk+1 . Also, since Πk = Πk+1 for
all k > k, Algorithm 2 finally reduces to the inexact projected regularized Newton method
for solving
minn φ(x) := f (x) + δΠ∗ (x) with Π∗ := Πk+1 , (49)
x∈R

which is a minimization problem of function f over the polytope Π∗ , much simpler than
the original problem (1). Consequently, the global convergence and local convergence rate
analysis of PGiPN for model (1) boils down to analyzing those of the inexact projected
regularized Newton method for (49). The rest of this section is devoted to this.
Unless otherwise stated, the notation k in the sequel is always that of Proposition 16
plus one. In addition, we require the assumption that ∇2f is locally Lipschitz continuous
on Γ(x0 ), where Γ(x0 ) is defined in Lemma 15 (iii).

Assumption 1 ∇2f is locally Lipschitz continuous on an open set containing Γ(x0 ).

Assumption 1 is very standard when analyzing the convergence behavior of Newton-


type method. The following lemma reveals that under this assumption, the step size αk in
Newton step takes 1 when k is sufficiently large. Since the proof is similar to that of Lemma
B.1 of the arxiv version of Liu et al. (2024), the details are omitted here.

Lemma 17 Suppose that Assumption 1 holds. Then αk = 1 for sufficiently large k.

Notice that Π∗ is a polytope, which can be expressed as

Π∗ = x ∈ R n | B T c

·x = 0, xS c = 0, x ≥ l, −x ≥ −u . (50)
k+1 k+1

For any x ∈ Rn , we define multifunction A : Rn ⇒ [2n] as

A(x) := {i | xi = li } ∪ {i + n | xi = ui }.

Clearly, for x ∈ Π∗ , A(x) is the index set of those active constraints involved in Π∗ at x.
To prove the global convergence for PGiPN, we first show that A(xk ) keeps unchanged for
sufficiently large k under the following non-degeneracy assumption.

Assumption 2 For all x∗ ∈ Γ(x0 ), 0 ∈ ∇f (x∗ ) + ri(NΠ∗ (x∗ )).

It follows from Proposition 3 and Lemma 15 (iii) that for each x∗ ∈ Γ(x0 ), x∗ is a
stationary point of F , which together with Proposition 16 and Lemma 4 (i) yields that
0 ∈ ∇f (x∗ ) + NΠ∗ (x∗ ), so that Assumption 2 substantially requires that −∇f (x∗ ) does
not belong to the relative boundary2 of NΠ∗ (x∗ ). In the next lemma, we prove that under
Assumptions 1-2, A(xk ) = A(xk+1 ) for sufficiently large k.

2. For convex set Ξ ⊂ Rn , the set difference cl(Ξ)\ri(Ξ) is called the relative boundary of Ξ, see (Rockafellar,
1970, p. 44).

26
PGiPN for Fused Zero-norms Regularization Problems

Lemma 18 Let {xk }k∈N be the sequence generated by Algorithm 2. Suppose that Assump-
tions 1-2 hold. Then, there exist A∗ ⊂ [2n] and a closed and convex cone N ∗ ⊂ Rn such
that A(xk ) = A∗ and NΠ∗ (xk ) = N ∗ for sufficiently large k.

Proof We complete the proof via the following two claims.


Claim 1: limk→∞ kprojTΠ∗ (xk ) (−∇f (xk ))k = 0. Since Π∗ is polyhedral, for any x ∈ Π∗ ,
TΠ∗ (x) and NΠ∗ (x) are closed and convex cones, and TΠ∗ (x) is polar to NΠ∗ (x), which
implies that when k is sufficiently large, z = projTΠ∗ (xk ) (z) + projNΠ∗ (xk ) (z) holds for any
z ∈ Rn . Then, for all sufficiently large k,

kprojTΠ∗ (xk ) (−∇f (xk ))k = k −∇f (xk )−projNΠ∗ (xk ) (−∇f (xk ))k
= dist(0, ∂φ(xk )) = dist(0, ∂φ(xk−1 + dk−1 )),

where the third equality is due to Lemma 17. Thus, it suffices to prove that

lim dist(0, ∂φ(xk + dk )) = 0.


k→∞

For each k ∈ K2 , by equation (14), there exists ζk ∈ ∂Θk (y k ) = ∂Θk (xk +dk ) or equivalently
0 ∈ ∇f (xk )+Gk dk −ζk +NΠk (xk +dk ) such that kζk k is not more than the right hand side of
(14). Invoking Remark 12 (iv) and Lemma 15 (ii) yields that limk→∞ kζk k = 0. Moreover,
from Proposition 16, for k > k, the inclusion 0 ∈ ∇f (xk ) + Gk dk − ζk + NΠk (xk + dk ) is
equivalent to 0 ∈ ∇f (xk ) + Gk dk − ζk + NΠ∗ (xk + dk ). Note that ∂φ(xk+dk ) = ∇f (xk+dk ) +
NΠ∗ (xk +dk ) for each k > k. Then, ∇f (xk +dk )−∇f (xk ) − Gk dk +ζk ∈ ∂φ(xk +dk ) for each
k > k. This, by the continuity of ∇f , equation (43), Lemma 15 (ii), and limk→∞ kζk k = 0,
implies the desired limit limk→∞ dist(0, ∂φ(xk + dk )) = 0.
Claim 2: A(xk ) ⊂ A(xk+1 ) for sufficiently large k. If not, there exists an infi-
nite index set K ⊂ N such that A(xk ) 6⊂ A(xk+1 ) for all k ∈ K. If necessary taking a
subsequence, we assume that {xk }k∈K converges to x∗ . By Lemma 15 (ii), {xk+1 }k∈K con-
verges to x∗ . In addition, from Claim 1, limk→∞ kprojTΠ∗ (xk+1 ) (−∇f (xk+1 ))k = 0. The
two sides along with Assumption 2 and (Burke and Moré, 1988, Corollary 3.6) yields that
A(xk+1 ) = A(x∗ ) for all sufficiently large k ∈ K, contradicting to A(xk ) 6⊂ A(xk+1 ) for all
k ∈ K. The claimed inclusion holds for sufficiently large k.
From A(xk ) ⊂ A(xk+1 ) for sufficiently large k, {A(xk )}k∈N converges to for some
A∗ ⊂ [2n] in the sense of Painlevé-Kuratowski3 . From the discreteness of A∗ , we con-
clude that A(xk ) = A∗ for sufficiently large k. From the expression of Π∗ in (50) and
A(xk ) = A∗ for sufficiently large k, we have NΠ∗ (xk ) = N ∗ for sufficiently large k.

The global convergence of PGiPN additionally requires the following assumption.

Assumption 3 For every sufficiently large k, there exists ξk ∈ NΠ∗ (xk ) such that

−h∇f (xk ) + ξk , dk i
lim inf > 0.
k→∞ k∇f (xk ) + ξk kkdk k
3. A sequence of sets {C }k∈N with C k ⊂ Rn is said to converge in the sense of Painlevé-Kuratowski if
k

its outer limit set lim supk→∞ C k coincides with its inner limit set lim inf k→∞ C k . On the definition of
lim supk→∞ C k and lim inf k→∞ C k , see (Rockafellar and Wets, 2009, Definition 4.1).

27
Wu, Pan, and Yang

This assumption essentially requires for every sufficiently large k the existence of one
element ξk ∈ NΠ∗ (xk ) such that the angle between ∇f (xk ) + ξk and dk is uniformly larger
than π/2. For sufficiently large k, since xk +αdk ∈ Π∗ for all α ∈ [0, 1], we have dk ∈ TΠ∗ (xk ),
which implies that hξ k , dk i ≤ 0. Together with (40), for sufficiently large k, the angle
between ∇f (xk ) + ξk and dk is larger than π/2. This means that it is highly possible for
Assumption 3 to hold. When n = 1, it automatically holds.
Next, we show that if φ is a KL function and Assumptions 1-3 hold, the sequence
generated by PGiPN is Cauchy and converges to an L-stationary point.
Theorem 19 Let {xk }k∈N be the sequence generated Pby Algorithm 2. Suppose that Assump-
tions 1-3 hold, and that φ is a KL function. Then, ∞k=1 kx k+1−xk k < ∞, and consequently

{xk }k∈N converges to an L-stationary point of (1).


Proof By Proposition 16 and the expressions of F and φ, we have F (xk ) = φ(xk ) + λ1 |T | +
λ2 |S| for all k > k. Along with Lemma 15 (i), the sequence {φ(xk )}k>k is nonincreasing.
If there exists ek > k such that φ(xk ) = φ(xk+1 ), then F (xk ) = F (xk+1 ), which along with
e e e e

Lemma 15 (i) leads to xk = xk . Then, xk meets the termination condition of Algorithm 2, so


e e e

{xk }k∈N converges to an L-stationary point of (1) within a finite number of steps. Thus, we
only need to consider the case that φ(xk ) > φ(xk+1 ) for all k > k. By Proposition 16, for any
x ∈ Γ(x0 ), F ∗ = F (x) = φ(x)+λ1 |T |+λ2 |S| or equivalently φ(x) = φ∗ := F ∗ −λ1 |T |−λ2 |S|.
By (Bolte et al., 2014, Lemma 6), there exist ε > 0, η > 0 and a continuous concave function
ϕ ∈ Υη such that for all x ∈ Γ(x0 ) and x ∈ {z ∈ Rn | dist(z, Γ(x0 )) < ε} ∩ [φ∗ < φ < φ∗ + η],
ϕ0 (φ(x) − φ∗ )dist(0, ∂φ(x)) ≥ 1 where Υη is defined in Definition 6. Then, for k > k (if
necessary by increasing k), xk ∈ {z ∈ Rn | dist(z, Γ(x0 )) < ε} ∩ [φ∗ < φ < φ∗ + η], so
ϕ0 (φ(xk ) − φ∗ )dist(0, ∂φ(xk )) ≥ 1. (51)
By Assumption 3, there exist c > 0 and ξk ∈ NΠ∗ (xk ) such that for suffciently large k,
−h∇f (xk ) + ξk , dk i > ck∇f (xk ) + ξk kkdk k. (52)
From Lemma 18, NΠ∗ (xk ) = NΠ∗ (xk+1 ) for all k > k (by possibly enlarging k), which
implies that ξk ∈ NΠ∗ (xk+1 ). Together with (39), (52) and Lemma 17, for all k > k (if
necessary enlarging k), it holds that
φ(xk ) − φ(xk+1 ) −%h∇f (xk ) + ξk , dk i %ck∇f (xk ) + ξk kkdk k
≥ ≥ = %ckxk+1 −xk k, (53)
dist(0, ∂φ(xk )) dist(0, ∂φ(xk )) k∇f (xk ) + ξk k
where the second inequality follows by ∇f (xk ) + ξk ∈ ∂φ(xk ) and (52). For each k > k, let
∆k := ϕ(φ(xk )−φ∗ ). From (51), (53) and the concavity of ϕ on [0, η), for all k > k,
∆k − ∆k+1 = φ(xk ) − φ(xk+1 ) ≥ ϕ0 (φ(xk )−φ∗ )(φ(xk )−φ(xk+1 ))
φ(xk ) − φ(xk+1 )
≥ ≥ %ckxk+1 − xk k.
dist(0, ∂φ(xk ))
Summing this inequality from k to any k > k and using ∆k ≥ 0 yields that
k k
X 1 X 1 1
kxj+1 −xj k ≤ (∆j −∆j+1 ) = (∆k −∆k+1 ) ≤ ∆k .
%c %c %c
j=k j=k

28
PGiPN for Fused Zero-norms Regularization Problems

P∞
Passing the limit k → ∞ leads to j=k
kxj+1 − xj k < ∞. Thus, {xk }k∈N is a Cauchy
sequence and converges to x∗ . It follows from Lemma 15 (iii) that x∗ is an L-stationary
point of model (1). The proof is completed.

Remark 20 Since Π∗ is a semi-algebraic set, the function δΠ∗ is semi-algebraic. According


to the comments in Section 2.2, the function φ is necessarily a KL function whenever f
is definable in an o-minimal structure over the real field; for example, the least-squares
loss function f in Section 6.2, the logarithmic loss function f in Section 6.3, the logistic
regression loss, and the high order portfolio loss function (Zhou and Palomar (2021)) are
all definable in an o-minimal structure over the real field.

Next we focus on the superlinear rate analysis of PGiPN. For this purpose, define

X ∗ := x ∈ Γ(x0 ) | 0 ∈ ∇f (x) + NΠ∗ (x), ∇2 f (x)  0 ,




which is called the set of second-order stationary points of (49). By Lemma 15 (iii) and
Proposition 3, the set X ∗ is generally smaller than the set of stationary points of (1). We
assume that a local Hölderian error bound condition holds with respect to (w.r.t.) X ∗ in
Assumption 4. For more introduction on the Hölderian error bound condition, we refer the
interested readers to Mordukhovich et al. (2023) and Liu et al. (2024).

Assumption 4 The mapping Rn 3 x 7→ r(x) := x−projΠ∗ (x−∇f (x)) has the q-subregularity
with q ∈ (0, 1] at any x ∈ Γ(x0 ) for the origin w.r.t. the set X ∗ , i.e., for every x ∈ Γ(x0 ),
there exist ε > 0 and κ > 0 such that for all x ∈ B(x, ε), dist(x, X ∗ ) ≤ κkr(x)kq .

Recently, Liu et al. (2024) proposed an inexact regularized proximal Newton method
(IRPNM) for solving the composite problem, the minimization of the sum of a twice contin-
uously differentiable function and an extended real-valued convex function, which includes
(49) as a special case. They established the superlinear convergence rate of IRPNM under
Assumption 1, and Assumption 4 with projΠ∗ replaced by the proximal mapping of the
convex function. By (Sra, 2012, Lemma 4) and µk ∈ [µmin , µ e], kr(xk )k = O(krk (xk )k) for
sufficiently large k. This together with Assumption 4 implies that for every x ∈ Γ(x0 ), there
exist ε > 0 and κ b > 0 such that for sufficiently large k with xk ∈ B(x, ε),

dist(xk , X ∗ ) ≤ κ
bkrk (xk )kq . (54)

Recall that PGiPN finally reduces to an inexact projected regularized Newton method for
solving (49). From Lemma 13 (iii) and Lemma 17, for sufficiently large k,
1
Θk (xk+1 ) − Θk (xk ) ≤ 0 and kRk (xk+1 )k ≤ min{krk (xk )k, krk (xk )k1+ς }. (55)
2
Let Λik := Gik−∇2f (xk ) − b1 kµk (xk − xk )kσ I with Gik given by (35)-(37). Under Assumption
4, from (Wu et al., 2023, Lemma 4.8), (Liu et al., 2024, Lemma 4.4), and the fact that
G1k − G2k  0, it holds that for sufficiently large k,

max kΛ1k k2 , kΛ2k k2 , kΛ3k k2 = O(dist(xk , X ∗ )).



(56)

29
Wu, Pan, and Yang

In the rest of this section, for completeness, we provide the proof of the superlinear con-
vergence of PGiPN under Assumptions 1 and 4 though it is implied by that of Liu et al.
ek , x
(2024). To this end, for each k ∈ K2 , define x bk and fk as follows.
1
fk (x) := f (xk ) + ∇f (xk )> (x − xk ) + (x − xk )> Gk (x − xk );
2
xek : the exact solution to problem (11); x bk ∈ projX ∗ (xk ).

We first bound the gap between y k and x


ek from above in terms of krk (xk )k.
Lemma 21 There exist γ1 > 0 and γ2 > 0 such that for every k ∈ K2 , ky k − x
ek k ≤
k
γ1 krk (x )k1+ς k
+ γ2 krk (x )k 1+ς−σ .
Proof Fix any k ∈ K2 . Recall that Rk (y k ) = µk [y k −projΠ∗ (y k −µ−1 k
k ∇fk (y ))]. Invoking
the relation projΠ∗ = (I + NΠ∗ )−1 by the convexity of Π∗ , where I is the identity mapping,
we have Rk (y k ) − ∇fk (y k ) ∈ NΠ∗ (y k − µ−1 k
k Rk (y )). Along with Θk = fk + δΠ∗ , it holds

Rk (y k ) + ∇fk (y k − µ−1 k k k −1 k
k Rk (y )) − ∇fk (y ) ∈ ∂Θk (y − µk Rk (y )).

Note that ∇fk (x) = ∇f (xk ) + Gk (x − xk ). The above inclusion can be simplified as

(I − µ−1 k k −1 k
k Gk )Rk (y ) ∈ ∂Θk (y − µk Rk (y )).

On the other hand, from the definition of x ek , we have 0 ∈ ∂Θk (e xk ). Together with the
above inclusion and the strong monotoncity of ∂Θk with model b1 krk (xk )kσ , it follows that
D E
(I − µ−1
k G k )Rk (y k
), y k
− µ−1
k Rk (y k
) − x
e k
≥ b1 krk (xk )kσ ky k − µ−1 k
ek k2 .
k Rk (y ) − x

Using the Cauchy-Schwarz inequality leads to

ky k − µ−1 k
ek k ≤ (b−1
k Rk (y ) − x
k −σ
1 krk (x )k )k(I − µ−1 k
k Gk )Rk (y )k
1 −1 k 1+ς (1 + µ−1
min c)
≤ (1 + µ min kG k
k 2 )krk (x )k ≤ krk (xk )k1+ς−σ ,
2b1 krk (xk )kσ 2b1
where the second inequality is due to (55) and µk ≥ µmin , and the third is by (43). Note
that y k = xk+1 by Lemma 17. From the above inequality and the second one of (55),

1 (1 + µ−1
min c)
ky k − x
ek k ≤ krk (xk )k1+ς + krk (xk )k1+ς−σ ,
2µmin 2b1
1 (1+µ−1
min c)
and the desired result holds with γ1 := 2µmin and γ2 := 2b1 .

The following lemma bounds the gap between xk and x


ek by following the similar line of
(Liu et al., 2024, Lemma 6).
Lemma 22 Consider any x ∈ Γ(x0 ). Under Assumptions 1 and 4, there exist 1 > 0 and
L2 > 0 such that for all xk ∈ B(x, 1 ),

L2 dist(xk , X ∗ )
 
kΛk k2
k
kx − x k
e k≤ + + 2 dist(xk , X ∗ ).
2b1 krk (xk )kσ b1 krk (xk )kσ

30
PGiPN for Fused Zero-norms Regularization Problems

Proof From Assumption 1, there exist 0 > 0 and L2 > 0 such that for any x, x0 ∈ B(x, 0 ),

k∇2 f (x) − ∇2 f (x0 )k ≤ L2 kx − x0 k. (57)


From Assumption 4, x ∈ X ∗ . Recall that x bk ∈ projX ∗ (xk ). By taking 1 = 0 /2, for
k
x ∈ B(x, 1 ), it holds kbk k
x − xk ≤ kx − xb k + kxk − xk ≤ 2kxk − xk ≤ 0 . Therefore, for
k
k
x ∈ B(x, 1 ), we deduce from (57) that

xk ) − ∇f (xk ) − ∇2 f (xk )(b


k∇f (b xk − xk )k
Z 1 (58)
L2 k
= [∇2 f (xk + t(b
xk − xk )) − ∇2 f (xk )](b
xk − xk )dt ≤ x − xk k2 .
kb
0 2

ek , 0 ∈ ∇f (xk ) + Gk (e
By the definition of x xk − xk ) + NΠ∗ (e
xk ); while by the definition of x
bk ,
0 ∈ ∇f (bk
x ) + NΠ∗ (b k
x ). Using the monotoncity of NΠ∗ results in

0 ≤ h∇f (xk ) + Gk (e
xk − xk ) − ∇f (b
xk ), x
bk − x
ek i
= h∇f (xk ) + Gk (b
xk − xk ) − ∇f (b
xk ), x
bk − x
ek i − hGk (b
xk − x
ek ), x
bk − x
ek i.

This implies that

b1 krk (xk )kσ kb


xk − x
ek k ≤ λmin (Gk )kb
xk − x
ek k ≤ k∇f (b
xk ) − ∇f (xk ) − Gk (b
xk − xk )k
xk ) − ∇f (xk ) − ∇2 f (xk )(b
≤ k∇f (b xk − xk )k + kΛk k2 kb
xk − xk k + b1 krk (xk )kσ kb
xk − xk k
L2 k
≤ x − xk k2 + kΛk k2 kb
kb xk − xk k + b1 krk (xk )kσ kb
xk − xk k,
2
where the first inequality is by the expression of Gk and (42), and the fourth follows (58). Re-
xk −xk k2
L2 kb xk −xk k
arranging the above inequality, we obtain kb xk −exk k ≤ 2b kr (xk )kσ
+ kΛb kkr
k2 kb
(xk )kσ
xk −xk k.
+kb
1 k 1 k
bk k = dist(xk , X ∗ ).
Then, the desired result holds by the triangle inequality and kxk − x

Now we are ready to establish the supelinear convergence rate of the sequence. It is
noted that the proof is similar to that of (Liu et al., 2024, Theorem 6).

Theorem 23 Fix any x ∈ Γ(x0 ). Suppose that Assumption 1 holds, and Assumption 4
1
holds with q ∈ ( 1+σ , 1]. Then, the sequence {xk }k∈N converges to x with the Q-superlinear
convergence rate of order q(1+σ).

Proof If necessary enlarging k, we assume that xk ∈ B(x, 1 ) for k > k, where 1 is the
xk ) = 0 for k > k. This together
one in Lemma 22. From the definition of rk , we have rk (b
with the nonexpansive property of projΠ∗ yields that

krk (xk )k = µk kxk − projΠ∗ (xk − µ−1 k


xk − µ−1
bk + projΠ∗ (b
k ∇f (x )) − x xk ))k
k ∇f (b
(59)
≤ (2µk + L1 )dist(xk , X ∗ ) ≤ (2e
µ + L1 )dist(xk , X ∗ ).

In view of equation (56), if necessary enlarging k, there exists γ3 > 0 such that for k > k,

kΛk k2 ≤ γ3 dist(xk , X ∗ ). (60)

31
Wu, Pan, and Yang

From kdk k = ky k − xk k ≤ ky k − x ek k + ke
xk − xk k, Lemmas 21- 22, Assumption 4 and
equations (59)-(60), if necessary enlarging k, there exists γ4 > 0 such that for k > k,

kdk k ≤ γ4 dist(xk , X ∗ ). (61)

In addition, by virtue of equation (54) and Lemma 17, we obtain


h iq
dist(xk+1 , X ∗ ) ≤ κ
bkrk (xk+1 )kq = κ
b krk (xk+1 )k − kRk (xk+1 )k + kRk (y k )k
 q
k+1 k+1 1 k 1+ς
≤κb krk (x )k − kRk (x )k + krk (x )k (62)
2
 q
k+1 k+1 1 1+ς k ∗ 1+ς
≤κb krk (x )k − kRk (x )k + (2e µ + L1 ) dist(x , X ) ,
2

where the third inequality is due to (59). Next we bound the term |krk (xk+1 )k−kRk (xk+1 )k|.
If necessary enlarging k, we have for k > k,

krk (xk+1 )k − kRk (xk+1 )k ≤ k∇f (xk+1 ) − ∇f (xk ) − Gk (xk+1 − xk )k


L2 k+1
≤ kx − xk k2 + kΛk k2 kxk+1 − xk k + b1 krk (xk )kσ kxk+1 − xk k
2
L2 k 2
≤ kd k + γ3 kdk kdist(xk , X ∗ ) + b1 (2e µ + L1 )σ kdk kdist(xk , X ∗ )σ
2
L2 γ42
 
≤ + γ3 γ4 dist(xk , X ∗ )2 + b1 γ4 (2e
µ + L1 )σ dist(xk , X ∗ )1+σ ,
2

where the first inequality is by the definitions of rk and Rk and the nonexpansive property
of projΠk = projΠ∗ , the second one follows Assumption 1 and similar arguments for (58),
the third one follows equations (59)-(60), and the fourth is by (61). By combining the
L γ2
above inequality and (62) and letting γ5 := 22 4 + γ3 γ4 , γ6 := b1 γ4 (2e µ + L1 )σ and γ7 :=
1 1+ς
2 (2e
µ + L1 ) , it holds that for k > k (if necessary enlarging k),
h iq
dist(xk+1 , X ∗ ) ≤ κ
b γ5 dist(xk , X ∗ )2 + γ6 dist(xk , X ∗ )1+σ + γ7 dist(xk , X ∗ )1+ς
(63)
≤κb(γ5 + γ6 + γ7 )q dist(xk , X ∗ )q(1+σ) ,

where the last inequality follows by limk→∞ dist(xk , X ∗ ) = 0 and σ ≤ ς ≤ 1. The proof for
the result that {xk }k∈N converges to x at a superlinear convergence rate is similar to the
proof of (Liu et al., 2024, Theorem 6), and the details are omitted here.

Remark 24 When f is convex, X ∗ reduces to the set of L-stationary points of (49). In


this case, by Lemma 7, the local Hölderian error bound with q = 1 in Assumption 4 is
precisely the metric subregularity of the residual mapping r at x∗ for 0, which is equivalent
to that of ∂φ at x∗ for 0 by (Liu et al., 2024, Lemma 1). Due to the polyhedrality of Π∗ ,
the latter holds when f (·) = h(A ·) for some A ∈ Rm×n and a continuously differentiable
strictly convex h by following the same arguments as those for (Zhou and So, 2017, Theorem

32
PGiPN for Fused Zero-norms Regularization Problems

1 Pm
2). Thus, when h(u) = 12 kuk2 or h(u) = m m
i=1 log(1 + exp(−bi ui )) for u ∈ R , i.e., f
is the popular least-squares function or logistic regression function, Assumption 4 holds
automatically. In addition, when f is a piece-wise linear quadratic function, since ∂φ is a
polyhedral multifunction, the error bound condition automatically holds by (Robinson, 1981,
Proposition 1). Such loss functions, covering the Huber loss, the `1 -norm loss, the MCP
and SCAD loss, are often used to deal with outliers or heavy-tailed noise.

6. Numerical experiments
This section focuses on the numerical experiments of several variants of PGiPN for solving
a fused `0 -norms regularization problem with a box constraint. We first describe the imple-
mentation of Algorithm 2 in Section 6.1. In Section 6.2, we make comparison between model
(1) with the least-squares loss function f and the fused Lasso model (5) by using PGiPN to
solve the former and SSNAL (Li et al. (2018)) to solve the latter, to highlight the advantages
and disadvantages of our proposed fused `0 -norms regularization. Among others, the code
of SSNAL is available at (https://github.com/MatOpt/SuiteLasso). Finally, in Section 6.3,
we present some numerical results toward the comparison among several variants of PGiPN
and ZeroFPR and PG method for (1) in terms of efficiency and the quality of the output.
The MATLAB code of PGiPN is available at (https://github.com/yuqiawu/PGiPN).

6.1 Implementation of Algorithm 2


6.1.1 Computation of subproblem (11)
Suppose that ∅ 6= Skc := [n]\Sk . Based on the fact that every x ∈ Πk satisfies xSkc = 0,
we can obtain an approximate solution to (11) by solving a problem in a lower dimension.
Specifically, for each k ∈ K2 , write

b k := {v ∈ R|Sk | | B
Hk := (Gk )Sk Sk , v k := xkSk , ∇fSk (v k ) = [∇f (xk )]Sk , Π ek v = 0, lS ≤ v ≤ uS },
k k

where B
ek is the matrix obtained by removing the rows of BT c S whose elements are all zero.
k k
We turn to consider the following strongly convex optimization problem,
n 1 o
vbk ≈ arg min θk (v) := f (I·Sk v k )+h∇fSk (v k ), v −v k i+ (v−v k )> Hk (v−v k )+δΠ (v) . (64)
2
bk
v∈R|Sk |

The following lemma gives a way to find y k satisfying (13)-(14) by inexactly solving problem
(64), whose dimension is much smaller than that of (11) if |Sk |  n.

Lemma 25 Let ySk k = vbk and ySk c = 0. Then, Θk (y k ) = θk (b


v k ) and dist(0, ∂Θk (y k )) =
k
v k )). Consequently, the vector vbk satisfies
dist(0, ∂θk (b

min{µ−1
k , 1}
n o
v k )−θk (v k ) ≤ 0, dist(0, ∂θk (b
θk (b v k )) ≤ min kµk (xk −xk )k, kµk (xk −xk )k1+ς ,
2

if and only if the vector y k satisfies the inexact conditions in (13)-(14).

33
Wu, Pan, and Yang

Proof The first part is straightforward. We consider the second part. By the definition
of Θk , dist(0, ∂Θk (y k )) = dist(0, ∇f (xk ) + Gk (y k − xk ) + NΠk (y k )). Recall that Πk = {x ∈
Ω | BTkc · x = 0, xSkc = 0}. Then, NΠk (y k ) = Range(BT>c · ) + Range(IS>c · ) + NΩ (y k ), and
k k

dist(0, ∂Θk (y k )) = dist 0, ∇f (xk ) + Gk (y k − xk ) + Range(BT>c · ) + Range(IS>c · ) + NΩ (y k )



k k
k k k >
vk )

= dist 0, ∇fSk (v ) + Hk (b v − v ) + Range(BT c Sk ) + N[lS ,uS ] (b
k k k
k k k k k
v − v ) + NΠ
= dist(0, ∇fSk (v ) + Hk (b b k (b
v )) = dist(0, θk (b
v )),

where the second equality is using Range(IS>c · ) = {z ∈ Rn | zSk = 0}.


k

From the above discussions, we see that the computation of subproblem (11) involves
the projection onto Πk . Next we provide a method for computing it. Fix any k ∈ K2 . Given
z ∈ Rn , we consider the minimization problem on the projection onto Πk :
1
min kx − zk2 s.t. B
bT c · x = 0, xS c = 0, l ≤ x ≤ u. (65)
x∈Rn 2 k k

We provide a toy example to illustrate how to solve (65). Let xk = (1, 1, 2, 3, 3, 0, 0, 0)> ∈ R8 .
Since Tkc = {1, 4, 6, 7} and Skc = {6, 7, 8}, problem (65) can be written as

1
min kx − zk2 s.t. x1 = x2 , x4 = x5 , x6 = x7 = x8 = 0, l ≤ x ≤ u,
x∈R 2
8

which can be separated into the following four lower dimensional problems:

min (1/2)kx1:2 − z1:2 k2 s.t. x1 = x2 , l1:2 ≤ x1:3 ≤ u1:2 ;


x1:2 ∈R2
min (1/2)kx3 − z3 k2 s.t. l3 ≤ x3 ≤ u3 ;
x3 ∈R
min (1/2)kx4:5 − z4:5 k2 s.t. x4 = x5 , l4:5 ≤ x4:5 ≤ u4:5 ;
x4:5 ∈R2
min (1/2)kx6:8 − z6:8 k2 s.t. x6 = x7 = x8 = 0.
x6:8 ∈R3

Inspired by this toy example, there exists a smallest b j ∈ N such that the index set Tkc can
be partitioned into Tkc = i∈[bj] [i1 : i2 ]. Without loss of generality, we assume that these sets
S

are listed in an increasing order according to their left endpoints. Then, problem (65) can
be represented as
X1 X 1
minn kxi :i +1 − zi1 :i2 +1 k2 + (xi − zi )2
x∈R 2 1 2 2
(66)
S
i∈[b
j] i∈Tk \( i∈[bj] {i2 +1})

s.t. xSkc = 0; l ≤ x ≤ u; xk1 = xk2 for k1 , k2 ∈ [i1 : i2 + 1], ∀ i ∈ [b


j].
S
j + |Tk \( i∈[bj] {i2 + 1})|
From this equivalent expression, problem (65) can be separated into b
blocks. The following proposition shows that the unique global solution of (65) can be
characterized by those of every small block problems.

34
PGiPN for Fused Zero-norms Regularization Problems

Proposition 26 For each i ∈ [b j], if [i1 : i2 + 1] ∩ Skc 6= ∅, let x∗i1 :i2 +1 = 0; otherwise, let

xi1 :i2 +1 be the unique optimal solution to
1
arg min kv − zi1 :i2 +1 k2 s.t. li1 :i2 +1 ≤ v ≤ ui1 :i2 +1 , v1 = · · · = vi2 +2−i1 . (67)
2
v∈Ri2 +2−i1

For each i ∈ Tk \( i∈[bj] {i2 + 1}), if i ∈ Skc , let x∗i = 0; otherwise, let x∗i be the unique
S
optimal solution to
1
min (α − zi )2 s.t. li ≤ α ≤ ui . (68)
α∈R 2

Then, x∗ is the unique optimal solution to (65).


An elementary calculation yields the unique solution of (67) as v ∗ = αi∗ 1i2 +2−i1 with
n n Pz o o
∗ i1 :i2 +1
αi = min max , max{li1 :i2 +1 } , min{ui1 :i2 +1 } ,
i2 + 2 − i1
and the unique optimal solution to (68) is min{max{zi , li }}, ui }}. Together with Proposition
26, we conclude that the unique optimal solution to (65) is accessible.

6.1.2 Acceleration of Algorithm 2


Generally, when kBxk k0 or kxk k0 is large, it is difficult for the switch condition in (8) to
be satisfied, which will make PGiPN continuously execute PG steps. This phenomenon is
evident in the numerical experiment of the restoration of blurred images in Section 6.3.2.
To accelerate the iterations of Algorithm 2 or make its iterations enter in Newton steps
earlier, we introduce the following relaxed switch condition:
η1 n η2 n
k|sign(Bxk )| − |sign(Bxk )|k1 ≤ and k|sign(xk )| − |sign(xk )|k1 ≤ , (69)
k k
where η1 ≥ 0 and η2 ≥ 0 are two given constants. By following the arguments similar
to those for Lemma 13, Algorithm 2 equipped with (69) is also well defined. Obviously,
when ηki n ≥ 1, condition (69) allows the supports of Bxk and Bxk and xk and xk have
some difference; when ηki n < 1 (i = 1, 2), condition (69) is identical to (8). This means that
as k grows, Algorithm 2 with relaxed switch condition (69) will finally reduce to the one
with (8). Since our convergence analysis does not specify the initial point, the asymptotic
convergence results also hold for Algorithm 2 with condition (69).

6.1.3 Choice of parameters in Algorithm 2


We will test the performance of PGiPN with Gk = G2k given by (36), and PGiPN(r) that is
PGiPN with the relaxed switch condition (69). We apply Gurobi to solve subproblem (11)
with such Gk under inexact conditions (13) and (14) controlled by options params.Cutoff
and params.OptimalityTol, respectively. Also, we test PGilbfgs that is the same as PGiPN
except that the limited-memory BFGS (lbfgs) is used to construct Gk , i.e., to form Gk =
Bk + b1 kµk (xk − xk )kσ with Bk given by lbfgs. For solving (11) with such Gk , we use the
method introduced in Kanzow and Lechner (2022). The parameters of all the variants of
PGiPN are chosen as α = 10−8 , σ = 21 , % = 10−4 , β = 12 , ς = 32 , and b1 = 10−3 is used for
PGiPN and PGiPN(r), and b1 = 10−8 for PGilbfgs.

35
Wu, Pan, and Yang

We compare the numerical performance of PGiPNs with that of ZeroFPR (Themelis


et al. (2018)) and the PG method (Wright et al. (2009)). Among others, ZeroFPR uses the
lbfgs to minimize the forward-backward envelope of the objective, and its code package can
be downloaded from (http://github.com/kul-forbes/ForBES). We run it with the default
setting. In addition, the iteration steps of PG are the same as those of PGiPN without the
Newton steps, so that we can check the effect of the additional second-order step on PGiPN.
For this reason, the parameters of PG are chosen to be the same as those involved in PG
Step of PGiPN. We also observe that the sparsity of the output is very sensitive to µk in
Algorithm 2. To be fair, as the default setting in ZeroFPR, in all variants of PGiPN and
PG, we set µk = 0.95−1 kAk22 for all k ∈ N with kAk2 computed by the MATLAB sentences:
opt.issym = 1; opt.tol = 0.001; ATAmap = @(x) A’*A*x; L = eigs(ATAmap,n,1,‘LM’,opt)
For each solver, we set x0 = 0 and terminate at the iterate xk whenever k ≥ 5000 or
µk kxk − proxµ−1 g (xk − µ−1 k −4
k ∇f (x ))k∞ < 10 . All the numerical tests in this section are
k
conducted on a desktop running on 64-bit Windows System with an Intel(R) Core(TM)
i7-10700 CPU 2.90GHz and 32.0 GB RAM.

6.2 Model comparison with the fused Lasso


This subsection is devoted to examining the superiority and shortcoming of model (1) with
f (·) = 12 kA · −bk2 and B = B, b i.e. the fused `0 -norms regularization problem with a box
constraint (FZNS), compared with the fused Lasso (5). We apply PGiPN to solve FZNS,
and SSNAL to solve (5). Considering that the models to solve are different, we only compare
the quality of solutions returned by PGiPN and SSNAL, but do not do their running time.
Our first empirical study focuses on the ability of regression via a commonly used
dataset, prostate data. There are 97 observations and 9 features included in this dataset.
This data was used in Jiang et al. (2021) to check the performance of square root fused
Lasso. We randomly select 50 observations to form the training set, and obtain the training
data matrix A ∈ R50×8 . The corresponding responses are represented by b ∈ R50 . The rest
47 observations are left for testing set, which forms (Ā, b̄) with Ā ∈ R47×8 and b̄ ∈ R47 .
We employ PGiPN to solve FZNS, and SSNAL (Li et al. (2018)) to solve the fused Lasso
(5), with (A, b) given above, and [l, u] = 1000[−1, 1]. For each solver, we select 10 groups
of (λ1 , λ2 ) ∈ [0.003, 400] × [0.0003, 40], ensuring that the outputs exhibit different sparsity
levels. We record the sparsity and the testing error, where the latter is defined as kĀx∗ −b̄k
with x∗ being the output. The above procedure is repeated for 100 randomly constructed
(A, b) and a random (A, b) is tested with 10 groups of (λ1 , λ2 ), resulting in a total of 1000
recorded outputs for each model. All the sparsity pairs (kBx b ∗ k0 , kx∗ k0 ) from PGiPN and
SSNAL are recorded in lines 1, 4 and 7 of Table 1. For every sparsity pair, the average
testing errors of kĀx∗−b̄k for PGiPN and SSNAL corresponding to the given pair is recorded
in lines 2, 5 and 8 of Table 1, while the standard deviation of the results is recorded in its
lines 3, 6, and 9. Considering that the fused Lasso may produce solutions with components
Pk ↓
being very small but not equal to 0, we define kyk0 := min{k | i=1 |y|i ≥ 0.999kyk1 } as
in Li et al. (2018) for the outputs of the fused Lasso, where |y|↓ is the vector obtained by
sorting |y| in a nonincreasing order. As shown in Table 1, when (kBx b ∗ k0 , kx∗ k0 ) = (6, 6),
the average testing error for FZNS is the smallest among all the testing examples. Among
the total 20 experiment results, FZNS outperforms the fused Lasso for 13 cases. For the

36
PGiPN for Fused Zero-norms Regularization Problems

b ∗ k0 ≥ 4. This indicates that our model performs


rest 7 cases, there are 6 cases with kBx
better when kBx ∗
b k0 ≤ 3.

Table 1: Average testing error (FZNS|Fused Lasso) of the outputs.


b ∗ k0 , kx∗ k0 )
(kBx (2,1) (3,2) (3,3) (3,4) (3,5) (3,6) (3,8)
Average testing error 8.35|8.54 7.34|7.36 5.44|5.15 5.12|5.74 5.21|6.32 5.08|5.27 5.11|5.70
Standard deviation 0.48|0.37 0.76|0.74 0.27|0.30 1.06|1.10 0.35|0.26 0.30|0.22 0.32|0.24
∗ ∗
b k0 , kx k0 )
(kBx (4,5) (4,6) (4,7) (4,8) (5,5) (5,6) (5,7)
Average testing error 5.24|5.86 5.49|4.99 5.25|4.97 5.33|4.78 4.60|5.48 5.60|5.38 5.46|5.58
Standard deviation 1.02|1.15 0.48|0.46 0.28|0.21 0.38|0.29 0.41|1.61 0.61|0.71 0.29|0.40
b ∗ k0 , kx∗ k0 )
(kBx (5,8) (6,6) (6,7) (6,8) (7,7) (7,8)
Average testing error 5.35|5.19 4.41|5.26 5.34|4.95 5.20|5.34 5.13|5.22 5.27|5.22
Standard deviation 0.58|0.43 0.42|1.10 0.69|0.79 0.94|0.73 0.87|1.45 1.52|1.17

Our second numerical study is to evaluate the classification ability of these two models
with the TIMIT database (Acpistoc-Phonetic Continuous Speech Corpus, NTIS, US Dept of
Commerce), which consists of 4509 32ms speech frames and each speech frame is represented
by 512 samples of 16 KHz rate. The TIMIT database is collected from 437 male speakers.
Every speaker provided approximately two speech frames of each of five phonemes, where
the phonemes are “sh” as in “she”, “dcl” as in “dark”, “iy” as the vowel in “she”, “aa”
as the vowel in “dark”, and “ao” as the first vowel in “water”. This database is a widely
used resource for research in speech recognition. Following the approach described in Land
and Friedman (1997), we compute a log-periodogram from each speech frame, which is one
of the several widely used methods to generate speech data in a form suitable for speech
recognition. Consequently, the dataset comprises 4509 log-periodograms of length 256 (fre-
quency). It was highlighted in Land and Friedman (1997) that distinguishing between “aa”
and “ao” is particularly challenging. Our aim is to classify these sounds using FZNS and
the fused Lasso with λ2 = 0, l = −1 and u = 1, or in other words, the zero-order variable
fusion (3) plus a box constraint and the first-order variable fusion (4).
In TIMIT, the numbers of phonemes labeled “aa” and “ao” are 695 and 1022, re-
spectively. As in Land and Friedman (1997), we use the first 150 frequencies of the
log-periodograms because the remaining 106 frequencies do not appear to contain any in-
formation. We randomly select m1 samples labeled “aa” and m2 samples labeled “ao”
as training set, which together with their labels form A ∈ Rm×n and b ∈ Rm , with
m = m1 + m2 , n = 150, where bi = 1 if Ai· is labeled as “aa”, and bi = 2 otherwise.
The rest of dataset is left as the testing set, which forms Ā ∈ R(1717−m)×n , b̄1717−m , with
b̄i = 1 if Āi· is labeled as “aa” and b̄i = 2 otherwise. For (A, b), given 10 λ1 ’s randomly
selected within [2 × 10−5 , 300] such that the sparsity of the outputs kBx b ∗ k0 spans a wide
range. If Āi· x∗ ≤ 1.5, this phoneme is classified as “aa” and hence we set b̂i = 1; other-
wise, b̂i = 2. If b̂i 6= b̄i , Ai· is regarded as failure in classification. Then the error rate of
kb̄−b̂k1 b ∗ k0 and the error rate of classification.
classification is given by 1717−m . We record both kBx

37
Wu, Pan, and Yang

The above procedure is repeated for 30 groups of randomly generated (A, b), resulting
in 300 outputs for each solver. The four subfigures in Figure 1 present kBx b ∗ k0 and the error
rate of each output, with 4 different choices of (m1 , m2 ). We see that, for each subfigure the
output with the smallest error rate is always achieved by the fused `0 -norms regularization
model. It is apparent that FZNS generally performs better than the fused Lasso when
b ∗ k0 ≤ 30, while the average error rate of the fused Lasso is lower than that of FZNS
kBx
when kBx b ∗ k0 ≥ 60. This phenomenon is especially evident when m1 and m2 are small.

0.55 0.45
FZNS FZNS
Fused Lasso Fused Lasso
0.5
0.4

0.45

0.35
0.4
Error rate

Error rate
0.35 0.3

0.3
0.25

0.25

0.2
0.2

0.15 0.15
0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120
Sparsity of Bx * Sparsity of Bx *

(a) m1 = 25, m2 = 50 (b) m1 = 50, m2 = 100

0.45 0.45
FZNS FZNS
Fused Lasso Fused Lasso
0.4 0.4

0.35 0.35
Error rate

Error rate

0.3 0.3

0.25 0.25

0.2 0.2

0.15 0.15
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Sparsity of Bx * Sparsity of Bx *

(c) m1 = 100, m2 = 200 (d) m1 = 200, m2 = 400

b ∗ k0 and the classification error rate for the outputs from FZNS and the fused
Figure 1: kBx
Lasso under different m1 , m2 .

The numerical results for these two empirical studies show that for prostate database,
our model outperforms the fused Lasso when the output is sufficiently sparse, that is,
b ∗ k0 ≤ 3, see the first two lines in Table 1, and for phoneme database, our model performs
kBx
better when kBx b ∗ k0 ≤ 30. We also observe that the numerical performance of the fused
`0 -norms regularization is not stable if the output is not sparse, especially when the number
of observations is small, so when using the fused `0 -norms regularization model, a careful
consideration should be given to selecting an appropriate penalty parameter. Moreover, for

38
PGiPN for Fused Zero-norms Regularization Problems

some optimal solution x∗ of the fused Lasso regularization problem, |Bx


b ∗ |min and |x∗ |min
may be very small but not equal to zero, which leads to a difficulty in interpreting what
the outputs mean in the real world application. This also well matches the statements in
Land and Friedman (1997) that the `0 -norm variable fusion produces simpler estimated
coefficient vectors.

6.3 Comparison with ZeroFPR and PG


This subsection focuses on the numerical comparisons among several variants of PGiPN,
ZeroFPR and PG, in terms of the number of iterations, the required CPU time, and the
quality of the outputs.

6.3.1 Classification of TIMIT


The experimental data used in this part is the TIMIT dataset, the one in Section 6.2. To
test the performance of the algorithms on (1) with nonconvex f , we consider solving model
(A ·−b)2i 
(1) with f (·) = m b l = −1 and u = 1, where A ∈ Rm×n
P
i=1 log 1 + ν , B = B,
m
represents the training data and b ∈ R is the vector of corresponding labels. It is worth
noting that the loss function is nonconvex, and as commented in Aravkin et al. (2012), this
loss function is effective to process data denoised by heavy-tailed Student’s t-noise.
Following the approach in Section 6.2, we use the first 150 frequencies of the log-
periodograms. For the training set, we arbitrarily select 200 samples labeled as “aa”
and 400 samples labeled as “ao”. These samples, along with their corresponding labels,
form the matrices A ∈ Rm×n and b ∈ Rm , with dimensions m = 600 and n = 150.
The remaining samples are designated as the testing set. Given a group of λc > 0, we
set λ1 = λc × 10−7 kA> bk∞ and λ2 = 0.1λ1 . We employ four solvers, including PGiPN,
PGilbfgs, PG and ZeroFPR, to solve model (1) with the above f , and then record the CPU
time and the error rate of classification on the testing set. This experimental procedure is
repeated for a total of 30 groups of (A, b). Figure 2 plots the average CPU time, error rate
and objective value associated with each λc , and their standard deviations are reported in
Table 2. Motivated by the experiment in Section 6.2, we also plot Figure 3 to show the
average kBxb ∗ k0 and error rate for all the tested cases produced by four solvers.

5 0.23 4.25
PGiPN
ZeroFPR
4 PGls 4.2
0.225
PGilbfgs

3 4.15
0.22
Obj value
error rate

2 4.1
Time

0.215
1 4.05
PGiPN
ZeroFPR 0.21
0 PGls 4
PGilbfgs PGiPN
ZeroFPR
0.205 PGls
-1 3.95
PGilbfgs

-2 0.2 3.9
0.01 0.04 0.07 0.1 0.4 0.7 1 4 7 10 0.01 0.04 0.07 0.1 0.4 0.7 1 4 7 10 0.01 0.04 0.07 0.1 0.4 0.7 1 4 7 10

c c c

(a) λc -log(time(seconds)) plot (b) λc -error rate plot (c) λc -obj value plot

Figure 2: The average CPU time and error rate of 30 examples for four solvers.

We see from Figure 2(a) that in terms of CPU time, PGiPN is always the best one, more
than ten times faster than other three solvers. The reason is that other three solvers depend

39
Wu, Pan, and Yang

Table 2: Standard deviation of CPU time, error rate and objective value in Figure 2.
λ 0.01 0.04 0.07 0.1 0.4 0.7 1 4 7 10
PGiPN 0.12 0.10 0.11 0.10 0.12 0.14 0.11 0.05 0.05 0.04
ZeroFPR 4.06 5.02 5.12 4.77 3.83 3.41 3.47 2.24 1.91 1.32
Time
PGls 1.75 2.27 3.03 2.13 3.01 3.51 3.11 3.18 2.75 2.82
PGilbfgs 6.18 5.77 5.74 5.14 3.73 10.73 9.46 20.92 18.31 16.38
PGiPN 0.009 0.008 0.008 0.007 0.008 0.008 0.009 0.009 0.009 0.008
ZeroFPR 0.010 0.008 0.007 0.007 0.008 0.010 0.010 0.008 0.009 0.010
Error rate
PGls 0.006 0.006 0.007 0.008 0.008 0.008 0.008 0.008 0.008 0.009
PGilbfgs 0.009 0.008 0.007 0.006 0.007 0.008 0.008 0.010 0.008 0.008
PGiPN 2.67 2.82 3.05 3.17 3.15 2.84 3.13 3.00 3.28 3.38
ZeroFPR 2.59 2.81 2.86 2.80 3.04 3.00 3.07 3.68 2.88 2.69
Obj
PGls 2.73 2.88 3.04 3.02 2.94 2.97 2.96 3.17 3.25 3.25
PGilbfgs 2.51 2.74 2.94 2.99 3.00 2.89 3.21 3.00 3.27 3.36

0.25

0.24

0.23
Error rate

0.22

0.21

0.2
PGiPN
ZeroFPR
0.19 PGilbfgs
PGls

0.18
0 20 40 60 80 100 120 140
Sparsity of Bx *

Figure 3: Scatter figure for all tested examples, recording the relationship of sparsity
b ∗ k0 ) and the error rate of classification.
(kBx

heavily on the proximal mapping of g, and its computation is a little time-consuming. The
fact that PGiPN always requires the least CPU time reflects the advantage of the projected
regularized Newton steps in PGiPN. From Figure 2(b), when λc = 1, PGiPN attains the
smallest average error rate among four solvers for 10 λc ’s. When λc is larger, say, λc > 0.4,
PGiPN and PGilbfgs tend to outperform ZeroFPR and PG in terms of the average error rate
and objective value; when λc is smaller, say, λc < 0.1, the solutions returned by PG have

40
PGiPN for Fused Zero-norms Regularization Problems

b ∗ returned by PG is sparser than


the best error rate among four solvers. This is because Bx
those returned by other three solvers under the same λc (see Figure 3), and the solutions
given by other three solvers with small λc are not sparse, which leads to high error rate.

6.3.2 Recovery of blurred images


Let x ∈ Rn with n = 2562 be a vector obtained by vectorizing a 256×256 image “camera-
man.tif” in MATLAB and then by scaling all the entries to be in [0, 1]. Let A ∈ Rn×n be a
matrix representing a Gaussian blur operator with standard deviation 4 and a filter size of
9, and let b ∈ Rm be the vector to represent a blurred image obtained by adding Gauss noise
e ∼ N (0, ) with  > 0 to Ax, i.e., b = Ax + e. We restore the blurred image by using model
b l = 0, u = 1 and λ1 = λ2 = 0.0005 × kA> bk∞ . We test
(1) with f (·) = 12 kA · −bk2 , B = B,
five solvers including PGiPN, PGiPN(r), PGilbfgs, ZeroFPR and PG. For PGiPN(r), the
constants η1 and η2 in (69) are set to be η1 = 0.01, η2 = 0.01. For these five solvers, we com-
pare their performance under different ’s in terms of the number of iterations (Iter), cpu
time (Time), F (x∗ ) (Fval), kx∗ k0 (xNnz), kBxb ∗ k0 (BxNnz) and the highest peak signal-to-
 
n
noise ratio (PSNR), where PSNR := 10 log10 kx−x ∗ k2 . In particular, to check the effect
of the Newton step for PGiPN, PGiPN(r) and PGilbfgs, we record the iterations in the
form M (Nf , Nt , Ne ), where M means the total iterations, Nf means the ordinal number of
iterations in which the first Newton step appears, Nt denotes the total number of Newton
steps, and Ne denotes the total number of Newton steps in the last 10 iterations of solves.
We record the cpu time for these three solvers by M (N ), where M is the total time and
N represents the time for the Newton steps. PSNR measures the quality of the restored
images, and the higher PSNR, the better the quality of restoration. Table 3 reports the
numerical results of five solvers, where the number marked in blue means the best one in
the same line, whereas the number marked in red means the worst one in the same line.
From Table 3, PGiPN(r) always performs the best in terms of time, which verifies
the effectiveness of the acceleration scheme proposed in Section 6.1.2. PGiPN is faster
than PGilbfgs, and PGilbfgs is faster than PG, supporting the effective acceleration of
the Newton steps. ZeroFPR is the most time-consuming, even worse than PG, a pure first-
order method. The reason is that ZeroFPR requires more line-searches, and each line-search
involves computing the proximal mapping of g once, which is expensive (2-5 seconds). We
observe that PGiPN requires less Newton steps than PGiPN(r). Almost all the Newton
steps of PGiPN appear at the end of iterations, while more Newton steps of PGiPN(r)
appear along the PG steps. This implies that PGiPN with the relaxed switching condition
in (69) lacks the stability.
Despite the superiority of time, the solutions yielded by PGiPN(r) are not good. We
also observe that kBx b ∗ k0 of PGiPN(r) is a little higher than that of PGiPN, PGilbfgs
and PG, because PGiPN(r) runs few PG steps, so that its structured sparsity is not well
reduced. Moreover, as the PSNR is closely related to kBx b ∗ k0 , this leads to the weakest
performance of PGiPN(r) in terms of PSNR. Although ZeroFPR always outputs solutions
with the smallest objective values, its PSNR is not as good as the objective value. The
objective values of the outputs of PGiPN are a litter worse than those of the outputs of
PGilbfgs and PG. However, by making a trade-off between the speed and the quality of the
outputs, we conclude that PGiPN is a good solver for this test. Finally, we remark that in

41
Wu, Pan, and Yang

Table 3: Numerical comparison of five solvers on recovery of blurred image with λ1 = λ2 =


0.0005kA> bk∞ .

Noise PGiPN PGiPN(r) PGilbfgs PG ZeroFPR


Iter 119(106,5,4) 66(43,19,9) 444(106,120,7) 796 361
Time 5.40e2(16.9) 3.49e2(49.4) 2.03e3(27.2) 3.46e3 2.28e4
Fval 37.95 38.06 37.88 37.88 37.77
 = 0.01
xNnz 63805 63637 63858 63858 63717
BxNnz 5995 6467 5778 5779 5834
psnr 25.77 25.47 25.90 25.90 25.91
Iter 153(144,4,4) 64(50,10,8) 324(144,86,7) 853 286
Time 6.61e2(8.6) 3.07e2(28.1) 1.41e3(22.2) 3.61e3 1.81e4
Fval 46.01 46.10 45.98 45.98 45.83
 = 0.02
xNnz 63485 63318 63495 63495 63350
BxNnz 6176 6638 6098 6099 6143
psnr 25.36 24.81 25.42 25.42 25.33
Iter 140(135,3,3) 54(42,9,8) 320(135,99,8) 717 332
Time 5.93e2(6.2) 2.47e2(19.8) 1.37e3(22.0) 2.99e3 1.91e4
Fval 60.29 60.37 60.25 60.26 60.02
 = 0.03
xNnz 62998 62778 63006 63006 62800
BxNnz 6665 7227 6572 6592 6710
psnr 24.86 24.12 24.90 24.90 24.76
Iter 161(153,3,3) 65(41,15,4) 306(155,56,4) 526 230
Time 6.59e2(9.4) 3.04e2(41.3) 1.13e3(11.2) 2.10e3 1.12e4
Fval 77.83 77.87 77.81 77.82 77.44
 = 0.04
xNnz 62098 61908 62104 62104 61853
BxNnz 7294 7776 7264 7271 7427
psnr 24.17 23.47 24.20 24.20 24.00
Iter 108(101,3,3) 62(46,12,8) 353(101,100,6) 688 168
Time 4.60e2(6.3) 2.81e2(28.6) 1.49e3(31.1) 2.73e3 5.93e3
Fval 99.69 99.73 99.65 99.65 98.93
 = 0.05
xNnz 61362 61252 61377 61381 60963
BxNnz 8056 8381 7951 7956 8240
psnr 23.30 22.85 23.37 23.37 22.87

this experiments, we do not find the case that the Newton steps always performs toward
the end of the algorithms for PGiPN, PGiPN(r) and PGilbfgs. That is, some Newton steps
are executed along the PG steps.

6.3.3 Numerical validation of Assumption 3

As one reviewer mentioned, due to the highly nonconvexity of model (1), it is not easy to
remove Assumption 3 from our global convergence result (see Theorem 19). In this part,

42
PGiPN for Fused Zero-norms Regularization Problems

we make a numerical study on it. To this end, we introduce a specific choice of ξk . Let
ξk := −∇f (xk )−projNull(Ck ) (−∇f (xk )) with Ck = [BTkc ; ISkc ] for k ∈ K2 .

Obviously, for each k ∈ K2 , ξk ⊥Null(Ck ), which implies that ξk ∈ NNull(Ck ) (xk ) ⊂ NΠk (xk ).
The second inclusion is due to Πk ⊂ Null(Ck ) and the convexity of Πk and Null(Ck ).
We are ready to solve the problem in Section 6.3.1 with the termination condition
−h∇f (xk )+ξk ,dk i
µk kxk −xk k∞ ≤ 10−8 . Each test will generate a sequence {ak }k∈K2 with ak := k∇f (xk )+ξk kkdk k
.
Since {ak }k∈K2 is a finite sequence, its infimum limit does not exist. Recall that for a real
value infinite sequence {bk }, lim inf k→∞ bk = supl∈N inf k≥l bk . Write the number of elements
of {ak }k∈K2 as t. For each test, we record a as follows, as an approximation to the lower
limit,
a := sup inf ak .
l∈[t] k≥l

It is not hard to check that a = ak0 , where k 0 is the maximum element of K2 . We solve
the problem for 10 different λc ’s and 10 different groups of (A, b), resulting in 100 a for
100 times experiments. We store these 100 a’s as a MATLAB variable cosinelist, and find
that min(cosinelist) = 0.0025, mean(cosinelist) = 0.0761 and std(cosinelist) = 0.0650. This
indicates that it is highly possible for Assumption 3 to hold.

7. Conclusions
In this paper, we proposed a hybrid of PG and inexact projected regularized Newton meth-
ods for solving the fused `0 -norms regularization problem (1). This hybrid framework fully
exploits the advantages of PG method and Newton method, while avoids their disadvan-
tages. We employed the KL property to prove the full convergence of the generated iterate
sequence under a curve condition (Assumption 3) on f without assuming the uniformly
positive definiteness of the regularized Hessian matrix, and also obtained a superlinear con-
vergence rate under a Hölderian local error bound on the set of the second-order stationary
points, without assuming the local optimality of the limit point.
All PGiPN, ZeroFPR and PG have employed the polynomial-time algorithm to compute
a point in the proximal mapping of g with B = B, b which we developed in Section 3.3 of
this paper. Numerical tests indicate that our PGiPN not only produces solutions of better
quality, but also requires 2-3 times less running time than PG and ZeroFPR, where the
latter mainly attributes to our subspace strategy when applying the projected regularized
Newton method to solve the problems. It would be an interesting topic to extend the
polynomial-time algorithm in Section 3.3 to the case where B is of other special structures.

Acknowledgments

The authors would like to thank the editor and the two anonymous referees for their valuable
suggestions, which allowed them to improve the quality of the paper.
The second author’s work was supported by the National Natural Science Foundation of
China under project No.12371299, and the third author’s research was partially supported
by Research Grants Council of Hong Kong SAR, P.R. China (PolyU15209921).

43
Wu, Pan, and Yang

References
Masoud Ahookhosh, Andreas Themelis, and Panagiotis Patrinos. A Bregman forward-
backward linesearch algorithm for nonconvex composite optimization: superlinear con-
vergence to nonisolated local minima. SIAM Journal on Optimization, 31(1):653–685,
2021.

Aleksandr Aravkin, Michael P Friedlander, Felix J Herrmann, and Tristan Van Leeuwen.
Robust inversion, dimensionality reduction, and randomized sampling. Mathematical
Programming, 134:101–125, 2012.

Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating
minimization and projection methods for nonconvex problems: An approach based on
the Kurdyka-Lojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457,
2010.

Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for
semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and
regularized Gauss-Seidel methods. Mathematical Programming, 137(1):91–129, 2013.

Gilles Bareilles, Franck Iutzeler, and Jérôme Malick. Newton acceleration on manifolds
identified by proximal gradient methods. Mathematical Programming, 200:37–70, 2023.

Heinz H Bauschke, Jonathan M Borwein, and Wu Li. Strong conical hull intersection
property, bounded linear regularity, Jameson’s property (g), and error bounds in convex
optimization. Mathematical Programming, 86:135–160, 1999.

Dimitri P Bertsekas. Projected Newton methods for optimization problems with simple
constraints. SIAM Journal on Control and Optimization, 20(2):221–246, 1982.

Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society,


48(3):334–334, 1997.

Wei Bian and Xiaojun Chen. A smoothing proximal gradient algorithm for nonsmooth
convex regression with cardinality penalty. SIAM Journal on Numerical Analysis, 58(1):
858–883, 2020.

Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations.
Journal of Fourier Analysis and Applications, 14(5):629–654, 2008.

Thomas Blumensath and Mike E Davies. Normalized iterative hard thresholding: Guaran-
teed stability and performance. IEEE Journal of selected topics in signal processing, 4
(2):298–309, 2010.

Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized mini-
mization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):
459–494, 2014.

James V Burke and Jorge J Moré. On the identification of active constraints. SIAM Journal
on Numerical Analysis, 25(5):1197–1211, 1988.

44
PGiPN for Fused Zero-norms Regularization Problems

Harold Davenport and Andrzej Schinzel. A combinatorial problem connected with differ-
ential equations. American Journal of Mathematics, 87(3):684–694, 1965.

Jerome Friedman, Trevor Hastie, Holger Höfling, and Robert Tibshirani. Pathwise coordi-
nate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.

Felix Friedrich, Angela Kempe, Volkmar Liebscher, and Gerhard Winkler. Complexity
penalized m-estimation: Fast computation. Journal of Computational and Graphical
Statistics, 17(1):201–224, 2008.

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URL https:
//www.gurobi.com.

Kyle K Herrity, Anna C Gilbert, and Joel A Tropp. Sparse approximation via iterative
thresholding. In 2006 IEEE International Conference on Acoustics Speech and Signal
Processing Proceedings, volume 3, pages III–III. IEEE, 2006.

Brad Jackson, Jeffrey D Scargle, David Barnes, Sundararajan Arabhi, Alina Alt, Peter
Gioumousis, Elyus Gwin, Paungkaew Sangtrakulcharoen, Linda Tan, and Tun Tao Tsai.
An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing
Letters, 12(2):105–108, 2005.

Sean Jewell and Daniela Witten. Exact spike train inference via `0 optimization. The
Annals of Applied Statistics, 12(4):2457–2482, 2018.

Sean W Jewell, Toby Dylan Hocking, Paul Fearnhead, and Daniela M Witten. Fast non-
convex deconvolution of calcium imaging data. Biostatistics, 21(4):709–726, 2020.

He Jiang, Shihua Luo, and Yao Dong. Simultaneous feature selection and clustering based
on square root optimization. European Journal of Operational Research, 289(1):214–231,
2021.

Christian Kanzow and Theresa Lechner. Efficient regularized proximal quasi-Newton


methods for large-scale nonconvex composite optimization problems. arXiv preprint
arXiv:2210.07644, 2022.

Rebecca Killick, Paul Fearnhead, and Idris A Eckley. Optimal detection of changepoints
with a linear computational cost. Journal of the American Statistical Association, 107
(500):1590–1598, 2012.

Stephanie R Land and Jerome H Friedman. Variable fusion: A new adaptive signal regres-
sion method. Technical report, Technical Report 656, Department of Statistics, Carnegie
Mellon University, 1997.

Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal Newton-type methods for
minimizing composite functions. SIAM Journal on Optimization, 24(3):1420–1443, 2014.

Xudong Li, Defeng Sun, and Kim-Chuan Toh. On efficiently solving the subproblems of
a level-set method for fused Lasso problems. SIAM Journal on Optimization, 28(2):
1842–1866, 2018.

45
Wu, Pan, and Yang

V Liebscher and G Winkler. A potts model for segmentation and jump-detection. In


Proceedings S4G International Conference on Stereology, Spatial Statistics and Stochastic
Geometry, Prague June, volume 21, pages 185–190. Citeseer, 1999.

Jun Liu, Shuiwang Ji, and Jieping Ye. SLEP: Sparse learning with efficient projections.
Arizona State University, 6(491):7, 2009.

Jun Liu, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused Lasso prob-
lems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 323–332, 2010.

Ruyu Liu, Shaohua Pan, Yuqia Wu, and Xiaoqi Yang. An inexact regularized proximal
Newton method for nonconvex and nonsmooth optimization. Computational Optimization
and Applications, 88:603–641, 2024.

Zhaosong Lu. Iterative hard thresholding methods for `0 regularized convex cone program-
ming. Mathematical Programming, 147(1):125–154, 2014.

Zhaosong Lu and Yong Zhang. Sparse approximation via penalty decomposition methods.
SIAM Journal on Optimization, 23(4):2448–2478, 2013.

Robert Maidstone, Toby Hocking, Guillem Rigaill, and Paul Fearnhead. On optimal mul-
tiple changepoint algorithms for large data. Statistics and computing, 27:519–533, 2017.

Cesare Molinari, Jingwei Liang, and Jalal Fadili. Convergence rates of Forward–Douglas–
Rachford splitting method. Journal of Optimization Theory and Applications, 182:606–
639, 2019.

Boris S Mordukhovich, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A globally con-
vergent proximal Newton-type method in nonsmooth convex optimization. Mathematical
Programming, 198(1):899–936, 2023.

Shaohua Pan, Ling Liang, and Yulan Liu. Local optimality for stationary points of group
zero-norm regularized problems and equivalent surrogates. Optimization, 72(9):2311–
2343, 2023.

René A Poliquin and Ralph Tyrell Rockafellar. A calculus of prox-regularity. J. Convex


Anal, 17(1):203–210, 2010.

Guillem Rigaill. A pruned dynamic programming algorithm to recover the best segmenta-
tions with 1 to k {max} change-points. Journal de la Société Française de Statistique,
156(4):180–205, 2015.

Stephen M Robinson. Some continuity properties of polyhedral multifunctions. Springer,


1981.

R Tyrrell Rockafellar. Convex analysis. Princeton university press, 1970.

R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer,
2009.

46
PGiPN for Fused Zero-norms Regularization Problems

Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise
removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.

Micha Sharir. Davenport-schinzel sequences and their geometric applications. In Theoretical


Foundations of Computer Graphics and CAD, pages 253–278. Springer, 1995.

Suvrit Sra. Scalable nonconvex inexact proximal splitting. Advances in Neural Information
Processing Systems, 25, 2012.

Lorenzo Stella, Andreas Themelis, and Panagiotis Patrinos. Forward–backward quasi-


Newton methods for nonsmooth optimization problems. Computational Optimization
and Applications, 67(3):443–487, 2017.

Andreas Themelis, Lorenzo Stella, and Panagiotis Patrinos. Forward-backward envelope


for the sum of two nonconvex functions: Further properties and nonmonotone linesearch
algorithms. SIAM Journal on Optimization, 28(3):2274–2303, 2018.

Andreas Themelis, Masoud Ahookhosh, and Panagiotis Patrinos. On the acceleration of


forward-backward splitting via an inexact Newton method. In Splitting Algorithms, Mod-
ern Operator Theory, and Applications, pages 363–412. Springer, 2019.

Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity
and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 67(1):91–108, 2005.

Kenji Ueda and Nobuo Yamashita. Convergence properties of the regularized Newton
method for the unconstrained nonconvex optimization. Applied Mathematics and Opti-
mization, 62(1):27–46, 2010.

Lou Van den Dries and Chris Miller. Geometric categories and o-minimal structures. Duke
Mathematical Journal, 84(2), 1996.

Andreas Weinmann, Martin Storath, and Laurent Demaret. The lˆ1-potts functional for
robust jump-sparse reconstruction. SIAM Journal on Numerical Analysis, 53(1):644–673,
2015.

Stephen J Wright, Robert D Nowak, and Mário AT Figueiredo. Sparse reconstruction


by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493,
2009.

Fan Wu and Wei Bian. Accelerated iterative hard thresholding algorithm for l0 regularized
regression problem. Journal of Global Optimization, 76(4):819–840, 2020.

Yuqia Wu, Shaohua Pan, and Xiaoqi Yang. A regularized Newton method for `q -norm com-
posite optimization problems. SIAM Journal on Optimization, 33(3):1676–1706, 2023.

Man-Chung Yue, Zirui Zhou, and Anthony Man-Cho So. A family of inexact SQA methods
for non-smooth convex minimization with provable convergence guarantees based on the
Luo-Tseng error bound property. Mathematical Programming, 174(1):327–358, 2019.

47
Wu, Pan, and Yang

Rui Zhou and Daniel P Palomar. Solving high-order portfolios via successive convex ap-
proximation algorithms. IEEE Transactions on Signal Processing, 69:892–904, 2021.

Shenglong Zhou, Lili Pan, and Naihua Xiu. Newton method for `0 -regularized optimization.
Numerical Algorithms, 88(4):1541–1570, 2021.

Zirui Zhou and Anthony Man-Cho So. A unified approach to error bounds for structured
convex optimization problems. Mathematical Programming, 165:689–728, 2017.

48

You might also like