FTC_2021_Nonsmooth
FTC_2021_Nonsmooth
Tu, Wei, Liu, Peng, Liu, Yi, Li, Guodong, Jiang, Bei, Kong, Linglong, Yao,
Hengshuai and Jiu, Shangling (2022) Nonsmooth low-rank matrix recovery:
methodology, theory and algorithm. In: Lecture Notes in Networks and
Systems. Proceedings of the Future Technologies Conference (FTC) 2021,
Volume 1. 358. Springer ISBN 978-3-030-89906-6.
Downloaded from
https://kar.kent.ac.uk/78761/ The University of Kent's Academic Repository KAR
Additional information
Versions of Record
If this version is the version of record, it is the same as the published version available on the publisher's web site.
Cite as the published version.
Enquiries
If you have questions about this document contact [email protected]. Please include the URL of the record
in KAR. If you believe that your, or a third party's rights have been compromised through this document please see
our Take Down policy (available from https://www.kent.ac.uk/guides/kar-the-kent-academic-repository#policies).
Nonsmooth Low-rank Matrix Recovery: Methodology,
Theory and Algorithm
Wei Tu1 , Peng Liu2 , Yi Liu3 , Guodong Li4 , Bei Jiang3 , Linglong Kong?3 , Hengshuai
Yao5 , and Shangling Jiu6
1
Department of Public Health Sciences and Canadian Cancer Trials Group, Queen’s
University, Canada
2
School of Mathematics, Statistics and Actuarial Science, University of Kent, United Kingdom
3
Department of Mathematical and Statistical Sciences, University of Alberta, Canada
4
Department of Statistics and Actuarial Science, University of Hong Kong, Hong Kong
5
Huawei Hisilicon, Edmonton, Canada
6
Huawei Hisilicon, Shanghai, China
1 Introduction
Many problem in statistics and machine learning can be formulated as the following
form:
min F (x) = f (x) + g(x),
x
where x is the parameter, f is the loss and g is the regularizer. Examples includes penal-
ized regression in high-dimensional feature selection [37] and low-rank matrix/tensor
?
Corresponding author: [email protected]
2 Authors Suppressed Due to Excessive Length
recovery. Typically both f (x) and g(x) are proper convex functions such as using L2
loss for f (x). However, in some problems, due to the need of sparsity, robustness or
other structural requirement of the parameter space, a nonsmooth or even nonconvex
loss function or regularizer is often needed. In this paper we propose a general frame-
work to deal with situations when you have nonsmooth loss or regularizer. Specifically
we use low-rank matrix recovery as an example to illustrate the main idea.
In practice, many high dimensional matrices essentially have low-rank structure;
see, for example, recommender systems [32], stochastic system realization [30] in sys-
tems and control, computer algebra [4], psychometrics [23] and even quantum state
tomography [16] in physics. Meanwhile, these matrices usually are partially observed,
and a lot of entries are left unobserved due to many different reasons. For example,
we can only observe a few ratings from any particular recommender systems; or the
quantum states have an exponentially large size so that it’s not possible to obtain same
scale observations. For these partially observed matrices with a high missing rate, it is
of interest to ask “How to estimate the matrix with low-rank structure?” or “How to re-
cover the low-rank matrix effectively?”. This leads to an important problem of low-rank
matrix recovery.
Since the Netflix prize competition in 2009, matrix factorization has been shown
to outperform traditional nearest-neighbor based techniques in the sense that it allows
the incorporation of additional information such as temporal effects, confidence levels
and so on [22]. The basic idea of matrix factorization is to decompose the target matrix
M ∗ ∈ Rm×n into a bilinear form:
M ∗ = U >V
tion in many areas, such as complementarity problems, optimal control, eigenvalue op-
timization, etc., and it has been shown to be efficient even for the case with nonsmooth
constraints; see, for example, [2], [10] and [12].
This paper considers the problem of low-rank matrix recovery from linear measure-
ments. A general nonsmooth loss function is considered here, and Nesterov’s smooth-
ing method [26] is then applied to obtain an optimal smooth approximation. In practice,
according to the specific nature of the problem and data, one can choose a suitable non-
smooth loss function, satisfying Nesterov’s assumptions in [26], such that an efficient
algorithm can be obtained. Due to the bilinear structure of matrix factorization, the al-
ternating minimization method is thus employed to search for the solutions and, at each
step, we compare the performance of various algorithms, which are based on gradient
descent and momentum. Compared with previous work, this paper is more general in
the following ways: 1) the transformation matrices we considered are more general; 2)
a strong convergence guarantee is established for the proposed algorithm; 3) different
state-of-the-art gradient based algorithms are used and compared. For example, vanilla
gradient descent, Nesterov’s momentum method [26], Adam [21] as well as YellowFin
[42] algorithm. All the algorithms substantially improve the performance of original
nonsmooth problem.
The rest of the paper is organized as follows. In section 2, we introduce the notation
and formulate the problem mathematically. The proposed algorithms are presented in
details in section 3, and the theoretical convergence analysis results can be found in sec-
tion 4. In section 5, we illustrate the effectiveness of the proposed framework using the
popularly used L1 loss as a special example. Different gradient and momentum based
algorithms are used to compare their performances. All the proofs for the theorems
presented in the main paper can be found in the appendix.
2 Methodology Framework
This paper considers the model,
bi = hAi , M ∗ i + i , i = 1, . . . , p, (1)
where M ∗ is the true value, we can observe {Ai , bi }, i = 1, · · · , p, and i is the error
term. Here Ai ∈ Rm×n with 1 ≤ i ≤ p are given transformation matrices. Suppose
that the rank of matrix M ∗ ∈ Rm×n is no more than r with r min(m, n, p). We
then have M ∗ = U ∗> V ∗ , where U ∗ ∈ Rr×m and V ∗ ∈ Rr×n , and the following
optimization can be used to recover M ∗ :
p
1X
min f (bi − hAi , U > V i), (2)
U,V p
i=1
Assume that f (·) is differentiable almost everywhere, and has the following struc-
ture:
n o
f (b − A(U > V )) = fˆ(b − A(U > V )) + max hB(b − A(U > V ), u)2 − φ̂(u)i ,(4)
u
where fˆ is a continuous and convex function; see [26]. We then can obtain the following
optimal smooth approximation:
where the ad hoc choice of λ is 1/16, the objective function is smooth with respect to U
and V , and can be denoted by fµλ (U, V ). The original nonsmooth optimization problem
corresponds to
min f (b − A(U > V )) + λkU U > − V V > k2F , (7)
U,V
3 Algorithm
There are three steps in our algorithm, and the first one is the initialization; see Algo-
rithm 1. When applying the alternating minimization at the second step, if the initial
values of U and V are orthogonal to the true ones, then our algorithm may never con-
verge. To avoid this situation, we adopt the singular value projection (SVP) to provide
Nonsmooth Matrix Recovery 5
some starting values of U and V for the alternating minimization, and the SVP was
proposed by [19] and later used by [15], [1], [38] and so on.
Algorithm 1 can be written into
M t+1 ← Pr M t − ξt ∇M fµ (b − A(M t )) ,
where Pr denotes a projection onto the space of rank-r matrices. Actually Algorithm
1 can be directly used to recover the matrix X ∗ if it is iterated for sufficient times; see
[19]. However, the singular value calculation here is time-consuming when the dimen-
sion of matrix X ∗ is large. Furthermore, as an initialization, we do not need a very small
tolerance , i.e., a rough output is sufficient.
The second step is the alternating minimization; see Algorithm 2. Specifically, in
each iteration, we will keep one of U and V fixed, optimize over the other, and then
switch in the next iteration until it converges. We then take the final values of U and V ,
denoted by Û and V̂ , as solutions.
The third step is to update the values of U and V , respectively, at each iteration
of the alternating minimization at step 1.1 and 1.2. There are various methods for it in
the literature. [20] and [38] used the vanilla gradient descent method, while it may be
slow in our nonsmooth matrix factorization settings. Algorithm 3 introduces Nesterov’s
momentum method to update the value of U , while that of V is fixed. Similarly we can
gives the algorithm to update the value of V .
t
In Algorithm 3, ν(i) stands for the momentum term, γ is the momentum parameter,
η is the learning rate parameter. And usually γ is chosen to be around 0.9 [36, 42].
For the sake of comparison, we also consider two other momentum-based algorithms:
Adam in [21] and Yellowfin in [42].
4 Convergence analysis
Denote {Û π , V̂ π } = minU,V fµλ (U, V ) and {Û , V̂ } = minU,V f λ (U, V ). The follow-
ing theorem guarantees that the optimal solution of the smoothed objective function
converges to that of the nonsmooth one.
dist(U, U † ) = min kU − RU † kF ,
R∈Rr×r :R> R=Ir
U0 U 1
dist , ≤ σr (U ). (8)
V0 V 4
Nonsmooth Matrix Recovery 7
Furthermore, starting from any initial solution obeying (11), the t-th iterate of algo-
rithm 2 satisfies
Ut U 1 µ̃ 1 + δr
dist , ≤ (1 − τ̃1 )t σr (U ) (9)
Vt V 4 ξ˜ 1 − δr
We make several contributions in Theorem 2, the first one is that we extend [19]’s
SVP algorithm from least square matrix factorization to nonsmooth matrix factoriza-
tion, in addition, we provide theoretical convergence guarantees for alternating mini-
mization with Nesterov’s momentum method for general objective function, this gen-
eralize [38]’s linear convergence guarantees for alternating minimization with gradient
descent for least square objective function. We also would like to mention that though
[41] also provide a smoothing approximation using Nesterov’s smoothing technique,
however, they are dealing with nonnegative matrix factorization case, which is much
simpler than ours and they also did not provide rigorous theoretical analysis.
5 Numerical Studies
In the simulation study, we use L1 loss as an example to illustrate the practical applica-
bility of our theoretical results and provide insights about the choice of the algorithms.
The corresponding smoothed approximation of L1 loss is the popular Huber loss func-
tion (Huber 1981) in robust statistics. The Huber loss function is defined as
1 2
Lµ (a) = 2µ |a| for |a| ≤ µ
, (10)
|a| − µ2 otherwise
The dataset is generated in the following manner: all entries in U ∈ Rr×m , V ∈ Rr×n
and Ai ∈ Rm×n are independently sampled from Gaussian distribution N (0, 1). The
ground truth M ∗ to recover is then calculated as M ∗ = U > V , and the rank of M ∗
is r. The observations bi are generated following the assumed data generating process
indicated in (1).
To evaluate the performance of the algorithms, the following two metrics are used:
1) the value of the loss function; 2) relative recovery error: kÛ > V̂ − M kF /kM kF ,
where Û and V̂ are the estimated U and V using the proposed smoothing method. For
both metrics, a smaller value is desired.
8 Authors Suppressed Due to Excessive Length
Fig. 1: The comparison of different loss functions. Here Huber 1.35 in the legend denotes Huber
loss with µ = 1.35.
Smooth parameter µ: The smoothing parameter µ in (10) controls the tradeoff be-
tween the smoothness and precision. In this simulation we aim to investigate the rela-
tionship between the choice of µ and the relative recovery error. Without loss of gen-
erality, m = n = 32, p = 512 and r = 10 is used here. To enhance the difference
between L1 loss and L2 loss, all oberservations bi have been contaminated by a Cauchy
noise ei , and Nesterov’s momentum method is used for this simulation.
Figure 3 shows the trend between the smoothing parameter and the relative recovery
error, and the x axis is the natural logrithm transformed µ. We observe three interesting
patterns from this plot: 1) the performance of the algorithm is not sensitive to the choice
of µ as we can see a reasonable small choice of µ will result in a small recovery error; 2)
as µ approaches 0, the huber loss becomes very close to the L1 loss, and recovery error
increases slightly. This might due to the non-smoothness of L1 loss; 3) as µ approaches
+∞, the huber loss becomes very close to the L2 loss, and L2 loss does not work well
here due to the Cauchy noise added to the data. The empirical results here also further
verifies our theoretical result in Theorem 1: as the smoothing paramter approaches to
0, the optimal solution of the smoothed objective function approaches to that of the
nonsmooth one (L1 loss here).
Fig. 2: Relative recovery error under different choices of smooth parameter µ. The horizontal
axis is the µ in logrithm scale, and the vertical axis is the relative recovery error.
Table 1: Number of iterations needed to reach a relative recovery error smaller than 20%, 10%,
5%, 1% for each algorithm under no contamination setting
20% 10% 5% 1%
Adam 775 962 1105 1251
GD 3799 4788 > 5000 > 5000
NAG 709 881 1103 1148
YellowFin 1341 1509 1668 1764
10 Authors Suppressed Due to Excessive Length
No contamination: Figure 4 and Figure 5 show the behaviors of the loss and rela-
tive recovery error of 4 different algorithms when the observed bi contains no error. We
observe 4 interesting findings: 1) all algorithms converge eventually and the final rela-
tive recovery errors are all very close to 0; 2) as expected, the vanilla gradient descent
converges the slowest in terms of both loss and relative recovery error; 3) Adam and
Nestrov’s momentum method have very similar behaviors, and Adam slightly outper-
forms Nestrov’s momentum method at the later stage of optimization. Both of them are
considerably more desirable than the vanilla gradient descent method and do not differ
significantly in practical use. 4) The YellowFin algorithm have different behavior. It
takes extra iterations for the algorithm to figure out optimal learning rate before the loss
function starts to monotonously decrease.
Table 1 presents the number of iterations needed to reach a relative recovery error
smaller than a certain threshold for each algorithms. We can see that all algorithms
expect than gradient descent reaches 20% error in a relatively fast speed. Towards the
end, from 5% to 1%, NAG has only needs less than 50 steps, while Adam takes aout
150 steps.
Table 2: Number of iterations needed to reach a relative recovery error smaller than 50%, 30%,
25%, 20% for each algorithm under chi-squared noise setting, and here NaN means the algorithm
can not reach an error smaller than this value
Chi squared noise: Figure 6 and Figure 7 show the behaviors of the loss and relative
recovery error of 4 different algorithms when each of the observed bi is contaminated
by a chi squared noise. Specifically, each bi is replaced by bi + 10 ∗ ei , where ei follows
a chi squared distribution with 3 degree of freedom. Compare to the no contamination
setting, we have noticed that even though all algorithms converge eventually, the final
loss and relative recovery error are not as close as to 0 as before. This is expected as
the Cauchy error brought unnegligible noise to the observations. For the comparisons
between different algorithms, the patterns are similar as the no contamination setting.
Table 2 presents the number of iterations needed to reach a relative recovery error
smaller than a certain threshold for each algorithms. We can see that no algorithm can
reach an error smaller than 20%, which shows that contaminated observations can bring
serious trouble in the recovery of the original matrix. Adam and NAG have similar
performances here, while Adam needs slightly smaller iterations to reach 25% error.
Nonsmooth Matrix Recovery 11
Fig. 3: Loss function curves under no contamination setting. The horizontal axis is training steps,
and the vertical axis is the value of loss function. Each curve represents median over 100 runs,
and the area between 0.25 and 0.75 quantile are plotted as shadow.
Fig. 4: Relative matrix recovery error curves under no contamination setting. The horizontal axis
is training steps, and the vertical axis is the recovery error. Each curve represents median over
100 runs, and the area between 0.25 and 0.75 quantile are plotted as shadow.
12 Authors Suppressed Due to Excessive Length
Fig. 5: Loss function curves under Cauchy noise setting. The horizontal axis is training steps,
and the vertical axis is the value of loss function. Each curve represents median over 100 runs,
and the area between 0.25 and 0.75 quantile are plotted as shadow.
Fig. 6: Relative matrix recovery error curves under Cauchy noise setting. The horizontal axis is
training steps, and the vertical axis is the recovery error. Each curve represents median over 100
runs, and the area between 0.25 and 0.75 quantile are plotted as shadow.
Nonsmooth Matrix Recovery 13
6 Conclusion
This paper considers the matrix factorization for low-rank matrix recovery, and a gen-
eral nonsmooth loss function is assumed. It includes the commonly used L1 and quan-
tile loss functions as special cases, and this gives us much flexibility by choosing a
suitable form according to our knowledge and observations.
14 Authors Suppressed Due to Excessive Length
In the prove of Theorem 2, we will only provide the proof road map, the key idea is first
to prove the convergence of SVP initialization, then the alternating minimization have
an linear convergence rate combined with Nesterov’s momentum algorithm. Previously
[38] has already prove that alternating minimization have an linear convergence rate
combined with gradient descent under least square matrix factorization setting by using
the results from [9], here we prove that alternating minimization also have an linear con-
vergence rate combined with Nesterov’s momentum algorithm under nonsmooth matrix
factorization, given the condition that the nonsmooth function is L smooth and strong
convex, here we employ another result from [33] which analyze the linear convergence
rate for Nesterov’s momentum algorithm by using Lyapunov function.
Below only provide results for fµ , in Theorem 1 we will provide another proof that
actually the solution of fµλ will converge to f λ . Assumptions for function fµ are as
follows:
Remark: Assumption 1-2 defines the condition number of function fµ (·), similar as-
sumptions also appears in [34]. Usually κ = η/ξ is called the condition number of a
function [5].
18 Authors Suppressed Due to Excessive Length
Lemma 1. (Restricted Isometry Property (RIP)) A linear map A satisfies the r-RIP
with constant δr , if
Lemma 2. [8] Let A satisfy 2r-RIP with constant δ2r . Then for all matrices M ,N of
rank at most r, we have
M t+1 ← Pr M t − ξt ∇M fµ (A(M t ) − b) .
kM t − M kF ≤ ψ(A)t kM 0 − M kF .
ψ(A) = 2supkM kF =kN kF =1,rank(M )≤2r,rank(N )≤2r |hA(M ), A(N )i − hM, N i|.
Then
1
ft (M ) = ξ(1 + δ2k ) kM − N t+1 k2F − k∇f˜µ (M t )k2F
4ξ (1 + δ2k )2
2
kek2
1 + δ2k
t+1
fµ (M ) ≤ + ξ − η kb − A(M t ) − ek2
2 1 − δ2k
kek2
1 + δ2k
− η kb − A(M t )k2 − 2e> (b − A(M t )) + kek2
≤ + ξ
2 1 − δ2k
√ !
fµ (M t ) 2fµ (M t ) fµ (M t ) 2fµ (M t )
1 + δ2k
≤ + ξ −η + + √
C2 1 − δ2k C2 c1 C c1
= Dfµ (M t ).
l m
1 (C 2 +ε)kek2
Since D < 1, combine the fact that fµ (M 0 ) ≤ c2 kbk2 , by taking t = log D log 2c2 kbk22
,
we complete the proof.
kM2 − M1 k2F
M2 M1 2
dist2 , ≤√ .
N2 N1 2−1 σr (X1 )
20 Authors Suppressed Due to Excessive Length
Combine lemma√ 1, 2 and 5, follow the similar route of [38], we can prove that using
more than 3 log( rκ) + 5 iterations of SVP initialization algorithm, we can obtain
U0 U 1
dist , ≤ σr (U ). (11)
V0 V 4
Thus we finished the proof of convergence in the initialization procedure, next we
will study the linear decay rate for each iteration for penalized object function.
Before study the theoretical properties, we first rearrange object function (7) by
using the uplifting technique so that it’s easy to simultaneously consider fµ and the
regularization term, to see this, consider rank r matrix M ∈ Rm×n with SVD decom-
position M = U > ΣV , define Sym : Rm×n → Rm×n as
0 X
Sym(X) = m×m .
X > 0n×n
A11 A12
Given the block matrix A = with A11 ∈ Rm×m , A12 ∈ Rm×n ,
A21 A22
n×m n×n A11 0m×n 0m×m A12
A21 ∈ R , A22 ∈ R . Define Pdiag (A) = , Poff (A) = ,
0n×m A22 A21 0n×n
define B : R(m+n)×(m+n) → R(m+n)×(m+n) as the uplift version of A operator:
Define W = [U > ; V > ], Then as a result, we can rewrite object function (7) as:
g(W ) := g(U, V )
= fµ (b − A(U > V )) + λkU U > − V V > k2F
1 1
= fµ B(Sym(U > V )) − Sym(X) + λkSym(U > V ) − Sym(X)k2F .
2 2
From equation (12) we can see that actually the non-penalize part and penalize part
have similar structure.
As a result, although we made the assumption 1,2,4 on function M, we can see from
equation (12) that after adding the penalization term, the penalized object function still
retains similar property, as for assumption 3, we can also use some location transform
techniques to make the penalized object function satisfies this assumption, as a result,
it does not make so much difference whether we deal with penalized object function or
un-penalized object function.
Then similar to [38], the alternating minimization incorporate with Nesterov’s mo-
mentum algorithm with respect to U and V (sub vector of W ), respectively, are actually
can be written as the NAG algorithm applied to g(W ) with respect to W .
For the convergence analysis of Nesterov’s momentum algorithm, we employ fol-
low lemma, which is a theorem in [33]:
Lemma 6. For minimization problem minx∈X f (x), where x is a vector, using the Lya-
punov function
it can be shown
Substitute xk with Wk+1 = [U k+1 , V k+1 ]> , f with g, if we want to deal with the
convergence analysis with respect to Wt − W ∗ and W0 − W ∗ , we need to handle two
parts, the first part is the error part with respect to ε̃, this can be solved by choosing
initial estimate close to true value, as a result k∇f (x1 )k can be arbitrary close to 0.
)k
For the sake of notation simplicity, assume that ε̃−ε̃(1−τ̃
τ̃ ≤ ε† . Since Ṽk still satisfies
assumption 1 and 2, without loss of generality, assume the corresponding parameter are
ξ˜ and η̃.
In the next, we want to seek the relation of Wt − W ∗ with V˜t . This will involve
the assumption 1 and 2 as well as lemma 1. With a rough handle of the gradient part in
assumption 1 and 2, we can obtain
˜
ξkA(W ∗ 2 t ∗ 2
t ) − A(W )k2 ≤ (1 − τ̃ ) µ̃kA(W0 ) − A(W )k2 + ε̃
and 1 − τ̃1 still larger than 0 smaller than 1. Employ the Restricted Isometry Property,
˜ − δr )kWt − W ∗ k2 ≤ (1 − τ̃1 )t µ̃(1 + δr )kW0 − W ∗ k2
ξ(1 2 2
Thus
Ut U t µ̃ 1 + δr U0 U
dist , ≤ (1 − τ̃1 ) dist ,
Vt V ξ˜ 1 − δ r V 0 V
1 µ̃ 1 + δr
≤ (1 − τ̃1 )t σr (U )
4 ξ˜ 1 − δr
Theorem 2 is proved.
Remark on equation (12): From (12), we provide a guideline with respect to the
selection of λ compared with [38], by combine (12) and assumption 4.