0% found this document useful (0 votes)
2 views

FTC_2021_Nonsmooth

The document discusses a framework for nonsmooth low-rank matrix recovery, focusing on the optimization problem formulated as min F(x) = f(x) + g(x), where f is the loss function and g is the regularizer. It proposes a methodology that involves optimal smoothing of nonsmooth loss functions followed by a gradient-based algorithm, demonstrating its effectiveness through numerical studies using L1 loss. The authors establish strong theoretical convergence guarantees and compare various state-of-the-art algorithms in the context of high-dimensional data.

Uploaded by

liudaohong1965
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

FTC_2021_Nonsmooth

The document discusses a framework for nonsmooth low-rank matrix recovery, focusing on the optimization problem formulated as min F(x) = f(x) + g(x), where f is the loss function and g is the regularizer. It proposes a methodology that involves optimal smoothing of nonsmooth loss functions followed by a gradient-based algorithm, demonstrating its effectiveness through numerical studies using L1 loss. The authors establish strong theoretical convergence guarantees and compare various state-of-the-art algorithms in the context of high-dimensional data.

Uploaded by

liudaohong1965
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Kent Academic Repository

Tu, Wei, Liu, Peng, Liu, Yi, Li, Guodong, Jiang, Bei, Kong, Linglong, Yao,
Hengshuai and Jiu, Shangling (2022) Nonsmooth low-rank matrix recovery:
methodology, theory and algorithm. In: Lecture Notes in Networks and
Systems. Proceedings of the Future Technologies Conference (FTC) 2021,
Volume 1. 358. Springer ISBN 978-3-030-89906-6.

Downloaded from
https://kar.kent.ac.uk/78761/ The University of Kent's Academic Repository KAR

The version of record is available from


https://doi.org/10.1007/978-3-030-89906-6_54

This document version


Author's Accepted Manuscript

DOI for this version

Licence for this version


UNSPECIFIED

Additional information

Versions of research works

Versions of Record
If this version is the version of record, it is the same as the published version available on the publisher's web site.
Cite as the published version.

Author Accepted Manuscripts


If this document is identified as the Author Accepted Manuscript it is the version after peer review but before type
setting, copy editing or publisher branding. Cite as Surname, Initial. (Year) 'Title of article'. To be published in Title
of Journal , Volume and issue numbers [peer-reviewed accepted version]. Available at: DOI or URL (Accessed: date).

Enquiries
If you have questions about this document contact [email protected]. Please include the URL of the record
in KAR. If you believe that your, or a third party's rights have been compromised through this document please see
our Take Down policy (available from https://www.kent.ac.uk/guides/kar-the-kent-academic-repository#policies).
Nonsmooth Low-rank Matrix Recovery: Methodology,
Theory and Algorithm

Wei Tu1 , Peng Liu2 , Yi Liu3 , Guodong Li4 , Bei Jiang3 , Linglong Kong?3 , Hengshuai
Yao5 , and Shangling Jiu6
1
Department of Public Health Sciences and Canadian Cancer Trials Group, Queen’s
University, Canada
2
School of Mathematics, Statistics and Actuarial Science, University of Kent, United Kingdom
3
Department of Mathematical and Statistical Sciences, University of Alberta, Canada
4
Department of Statistics and Actuarial Science, University of Hong Kong, Hong Kong
5
Huawei Hisilicon, Edmonton, Canada
6
Huawei Hisilicon, Shanghai, China

Abstract. Many interesting problems in statistics and machine learning can be


written as minx F (x) = f (x) + g(x), where x is the model parameter, f is the
loss and g is the regularizer. Examples include regularized regression in high-
dimensional feature selection and low-rank matrix/tensor factorization. Some-
times the loss function and/or the regularizer is nonsmooth due to the nature of
the problem, for example, f (x) could be quantile loss to induce some robustness
or to put more focus on different parts of the distribution other than the mean.
In this paper we propose a general framework to deal with situations when you
have nonsmooth loss or regularizer. Specifically we use low-rank matrix recovery
as an example to demonstrate the main idea. The framework involves two main
steps: the optimal smoothing of the loss function or regularizer and then a gradi-
ent based algorithm to solve the smoothed loss. The proposed smoothing pipeline
is highly flexible, computationally efficient, easy to implement and well suited for
problems with high-dimensional data. Strong theoretical convergence guarantee
has also been established. In the numerical studies, we used L1 loss as an exam-
ple to illustrate the practicability of the proposed pipeline. Various state-of-the-art
algorithms such as Adam, NAG and YellowFin all show promising results for the
smoothed loss.

Keywords: Matrix factorization, Nonsmooth, Low-rank matrix, Nesterov’s smooth-


ing, Optimization

1 Introduction
Many problem in statistics and machine learning can be formulated as the following
form:
min F (x) = f (x) + g(x),
x

where x is the parameter, f is the loss and g is the regularizer. Examples includes penal-
ized regression in high-dimensional feature selection [37] and low-rank matrix/tensor
?
Corresponding author: [email protected]
2 Authors Suppressed Due to Excessive Length

recovery. Typically both f (x) and g(x) are proper convex functions such as using L2
loss for f (x). However, in some problems, due to the need of sparsity, robustness or
other structural requirement of the parameter space, a nonsmooth or even nonconvex
loss function or regularizer is often needed. In this paper we propose a general frame-
work to deal with situations when you have nonsmooth loss or regularizer. Specifically
we use low-rank matrix recovery as an example to illustrate the main idea.
In practice, many high dimensional matrices essentially have low-rank structure;
see, for example, recommender systems [32], stochastic system realization [30] in sys-
tems and control, computer algebra [4], psychometrics [23] and even quantum state
tomography [16] in physics. Meanwhile, these matrices usually are partially observed,
and a lot of entries are left unobserved due to many different reasons. For example,
we can only observe a few ratings from any particular recommender systems; or the
quantum states have an exponentially large size so that it’s not possible to obtain same
scale observations. For these partially observed matrices with a high missing rate, it is
of interest to ask “How to estimate the matrix with low-rank structure?” or “How to re-
cover the low-rank matrix effectively?”. This leads to an important problem of low-rank
matrix recovery.
Since the Netflix prize competition in 2009, matrix factorization has been shown
to outperform traditional nearest-neighbor based techniques in the sense that it allows
the incorporation of additional information such as temporal effects, confidence levels
and so on [22]. The basic idea of matrix factorization is to decompose the target matrix
M ∗ ∈ Rm×n into a bilinear form:

M ∗ = U >V

where U ∈ Rr×m , V ∈ Rr×n , and the rank of M ∗ is no more than r with r ≤


min(m, n).
Due to the rapid improvement of computation power, matrix factorization has re-
ceived more and more attention in various fields; see [20, 31, 6, 43] and among others.
Most currently used methods are based on the L2 loss, which is the optimal choice when
the noise is Gaussian distributed. However, it is sensitive to outliers, and one possible
solution is to consider a loss function other than the L2 loss; see [5, 29, 20].
Meanwhile, as shown in [11], the nonsmooth optimization problem plays an impor-
tant role in many areas such as image restoration, signal reconstruction, optimal control,
and so on. In statistics, the least absolute deviation is well known to be robust to highly
skewed and/or heavy-tailed data. The Manhattan distance in machine learning is actu-
ally based on L1 loss. Quantile regression, which corresponds to the quantile loss, is
another important estimating method in statistics, and is also commonly used to han-
dle the highly skewed data. It is noteworthy to point out that both L1 and quantile loss
functions are nonsmooth. Other useful nonsmooth functions in statistics and machine
learning include indicator function, step function, max function and so on [39, 40, 7].
Thus it is of importance to consider matrix factorization with nonsmooth loss func-
tion for the matrix recovery. Various algorithms, including the simplex, subgradient
and quasi-monotone methods, have been proposed to tackle with the nonsmooth opti-
mization, while few of them are efficient for recovering low-rank matrices with high
dimensions. Smooth approximation recently has been studied for nonsmooth optimiza-
Nonsmooth Matrix Recovery 3

tion in many areas, such as complementarity problems, optimal control, eigenvalue op-
timization, etc., and it has been shown to be efficient even for the case with nonsmooth
constraints; see, for example, [2], [10] and [12].
This paper considers the problem of low-rank matrix recovery from linear measure-
ments. A general nonsmooth loss function is considered here, and Nesterov’s smooth-
ing method [26] is then applied to obtain an optimal smooth approximation. In practice,
according to the specific nature of the problem and data, one can choose a suitable non-
smooth loss function, satisfying Nesterov’s assumptions in [26], such that an efficient
algorithm can be obtained. Due to the bilinear structure of matrix factorization, the al-
ternating minimization method is thus employed to search for the solutions and, at each
step, we compare the performance of various algorithms, which are based on gradient
descent and momentum. Compared with previous work, this paper is more general in
the following ways: 1) the transformation matrices we considered are more general; 2)
a strong convergence guarantee is established for the proposed algorithm; 3) different
state-of-the-art gradient based algorithms are used and compared. For example, vanilla
gradient descent, Nesterov’s momentum method [26], Adam [21] as well as YellowFin
[42] algorithm. All the algorithms substantially improve the performance of original
nonsmooth problem.
The rest of the paper is organized as follows. In section 2, we introduce the notation
and formulate the problem mathematically. The proposed algorithms are presented in
details in section 3, and the theoretical convergence analysis results can be found in sec-
tion 4. In section 5, we illustrate the effectiveness of the proposed framework using the
popularly used L1 loss as a special example. Different gradient and momentum based
algorithms are used to compare their performances. All the proofs for the theorems
presented in the main paper can be found in the appendix.

2 Methodology Framework
This paper considers the model,

bi = hAi , M ∗ i + i , i = 1, . . . , p, (1)

where M ∗ is the true value, we can observe {Ai , bi }, i = 1, · · · , p, and i is the error
term. Here Ai ∈ Rm×n with 1 ≤ i ≤ p are given transformation matrices. Suppose
that the rank of matrix M ∗ ∈ Rm×n is no more than r with r  min(m, n, p). We
then have M ∗ = U ∗> V ∗ , where U ∗ ∈ Rr×m and V ∗ ∈ Rr×n , and the following
optimization can be used to recover M ∗ :
p
1X
min f (bi − hAi , U > V i), (2)
U,V p
i=1

where f (·) is a nonsmooth objective function. Let A : Rm×n → Rp be an P affine trans-


p
formation with the ith entry of A(M ∗ ) being hAi , M ∗ i, and f (x) = p−1 i=1 f (xi )
>
for a vector x = (x1 , ..., xp ) . We then can rewrite (2) into a compact form,

min f (b − A(U > V )). (3)


U,V
4 Authors Suppressed Due to Excessive Length

Assume that f (·) is differentiable almost everywhere, and has the following struc-
ture:
n o
f (b − A(U > V )) = fˆ(b − A(U > V )) + max hB(b − A(U > V ), u)2 − φ̂(u)i ,(4)
u

where fˆ is a continuous and convex function; see [26]. We then can obtain the following
optimal smooth approximation:

fµ (b − A(U > V )) = fˆ(b − A(U > V ))+


n o (5)
max hB(b − (U > V )), ui2 − φ̂(u) − µd2 (u) ,
u

where µ is a positive smoothness parameter.


The above smooth approximation is made to function f (·) with respect to the vector
b − A(U > V ), rather than with respect to unknown matrix parameters U and V , since
it can be handled more conveniently. Moreover, under the restricted isometry property
assumption on A, our theoretical results show that this does not affect the convergence
rate.
In meanwhile, due to the scale problem, the direct solutions to (5) may not have
a proper structure, and one commonly used correction is to introduce a penalty of
λ(kU k2F + kV k2F ); see, for example, [35], [44] and among others. However, such pe-
nalization can not preserve the intrinsic structure for U and V . This paper use the Pro-
crustes flow penalty [38, 24] instead, and our optimization problem becomes

min fµ (b − A(U > V )) + λkU U > − V V > k2F , (6)


U,V

where the ad hoc choice of λ is 1/16, the objective function is smooth with respect to U
and V , and can be denoted by fµλ (U, V ). The original nonsmooth optimization problem
corresponds to
min f (b − A(U > V )) + λkU U > − V V > k2F , (7)
U,V

where the objective function can be denoted by f λ (U, V ).


Due to the bilinear structure of fµλ (U, V ) with respect to U and V at (6), we adopt
the alternating minimization method to search for the solutions. Specifically, in each
iteration, we will keep one of U and V fixed, optimize over the other, and then switch
in the next iteration until it converges. In the literature of matrix factorization, [20]
showed its convergence for L2 loss function, and this paper will further establish the
convergence for a general nonsmooth loss function.

3 Algorithm

There are three steps in our algorithm, and the first one is the initialization; see Algo-
rithm 1. When applying the alternating minimization at the second step, if the initial
values of U and V are orthogonal to the true ones, then our algorithm may never con-
verge. To avoid this situation, we adopt the singular value projection (SVP) to provide
Nonsmooth Matrix Recovery 5

Algorithm 1: Initialization by SVP algorithm


Input: A, b, tolerance , step size ξt for t = 0, 1, · · · , M 0 = 0m×n
Output: M t+1
1 Repeat
2 Y t+1 ← M t − ξt ∇M fµ (b − A(M t ))
3 Compute top r singular vectors of Y t+1 :Ur , Σr , Vr
4 M t+1 ← Ur Σr Vr
5 t←t+1
6 Until kM t+1 − M t kF ≤ 

some starting values of U and V for the alternating minimization, and the SVP was
proposed by [19] and later used by [15], [1], [38] and so on.
Algorithm 1 can be written into

M t+1 ← Pr M t − ξt ∇M fµ (b − A(M t )) ,


where Pr denotes a projection onto the space of rank-r matrices. Actually Algorithm
1 can be directly used to recover the matrix X ∗ if it is iterated for sufficient times; see
[19]. However, the singular value calculation here is time-consuming when the dimen-
sion of matrix X ∗ is large. Furthermore, as an initialization, we do not need a very small
tolerance , i.e., a rough output is sufficient.
The second step is the alternating minimization; see Algorithm 2. Specifically, in
each iteration, we will keep one of U and V fixed, optimize over the other, and then
switch in the next iteration until it converges. We then take the final values of U and V ,
denoted by Û and V̂ , as solutions.
The third step is to update the values of U and V , respectively, at each iteration
of the alternating minimization at step 1.1 and 1.2. There are various methods for it in
the literature. [20] and [38] used the vanilla gradient descent method, while it may be
slow in our nonsmooth matrix factorization settings. Algorithm 3 introduces Nesterov’s
momentum method to update the value of U , while that of V is fixed. Similarly we can
gives the algorithm to update the value of V .
t
In Algorithm 3, ν(i) stands for the momentum term, γ is the momentum parameter,
η is the learning rate parameter. And usually γ is chosen to be around 0.9 [36, 42].
For the sake of comparison, we also consider two other momentum-based algorithms:
Adam in [21] and Yellowfin in [42].

Algorithm 2: Alternating Minimization


Input: U 0 , V 0
Output: U nmax , V nmax
1 Repeat
2 1.1.Update U t with U t+1 = N AG(U t , V t )
3 1.2.Update V t with V t+1 = N AG(U t+1 , V t )
4 Until convergence
6 Authors Suppressed Due to Excessive Length

Algorithm 3: Nesterov’s accelerate gradient (NAG) method for U t+1


Input: U t , V t
Output: U t+1
1 Repeat
t t
2 ν(i) = γν(i−1) + η∇U fµλ (U(i−1)
t t
− γν(i−1) , V t)
t t t
3 U(i) = U(i−1) + ν(i)
4 Until convergence

4 Convergence analysis

Denote {Û π , V̂ π } = minU,V fµλ (U, V ) and {Û , V̂ } = minU,V f λ (U, V ). The follow-
ing theorem guarantees that the optimal solution of the smoothed objective function
converges to that of the nonsmooth one.

Theorem 1. (Convergence of optimal solution of smoothed objective function) As π →


0+ , we have Û π> V̂ π → Û > V̂ .

Many existing literatures on smoothing technique usually only focus on analyzing


the theoretical properties while ignore the relationship between smooth objective func-
tion and original nonsmooth objective function, for example, [3]. However, theorem 1
shows that if we only focus on optimizing smooth objective function, we can still obtain
the optimal solution for nonsmooth objective function, the benefit is that we can thus
have lots of choices with respect to the algorithms based on smooth objective function,
then we can simple choose a fast one to obtain the solution.
Suppose that U and V are a pair of solutions, i.e. M = U > V . It then holds that, for
an orthonormal matrix R satisfying R> R = Ir , U † = RU and V † = RV are another
pair of solutions. To evaluate the performance of the proposed algorithm, we first define
a distance between two matrices,

dist(U, U † ) = min kU − RU † kF ,
R∈Rr×r :R> R=Ir

where U, U † ∈ Rr×m with m ≥ r; see [38].

Theorem 2. Let M ∈ Rm×n be a rank r matrix, with singular values σ1 (M ) ≥


σ2 (M ) ≥ · · · ≥ σr (M ) > 0 and condition number κ = σ1 (M )/σr (M ), let M =
A> ΣB be the SVD decomposition. Define U = A> Σ 1/2 ∈ Rm×r , V = B > Σ 1/2 ∈
1
Rn×r . Assume A satisfies a rank-6r RIP condition with RIP constant σ6r < 25 , ξt = p1 .

Then use T0 ≥ 3 log( rκ) + 5 iterations in SVP initialization yields a solution U0 , V0
obeying

   
U0 U 1
dist , ≤ σr (U ). (8)
V0 V 4
Nonsmooth Matrix Recovery 7

Furthermore, starting from any initial solution obeying (11), the t-th iterate of algo-
rithm 2 satisfies
   
Ut U 1 µ̃ 1 + δr
dist , ≤ (1 − τ̃1 )t σr (U ) (9)
Vt V 4 ξ˜ 1 − δr

under Nesterov’s momentum method.

We make several contributions in Theorem 2, the first one is that we extend [19]’s
SVP algorithm from least square matrix factorization to nonsmooth matrix factoriza-
tion, in addition, we provide theoretical convergence guarantees for alternating mini-
mization with Nesterov’s momentum method for general objective function, this gen-
eralize [38]’s linear convergence guarantees for alternating minimization with gradient
descent for least square objective function. We also would like to mention that though
[41] also provide a smoothing approximation using Nesterov’s smoothing technique,
however, they are dealing with nonnegative matrix factorization case, which is much
simpler than ours and they also did not provide rigorous theoretical analysis.

5 Numerical Studies

In the simulation study, we use L1 loss as an example to illustrate the practical applica-
bility of our theoretical results and provide insights about the choice of the algorithms.
The corresponding smoothed approximation of L1 loss is the popular Huber loss func-
tion (Huber 1981) in robust statistics. The Huber loss function is defined as
 1 2
Lµ (a) = 2µ |a| for |a| ≤ µ
, (10)
|a| − µ2 otherwise

where µ is the predetermined smoothness parameter, controlling the tradeoff be-


tween smoothness and precision. When µ → 0+ , the Huber function converge to abso-
lute loss uniformly. On the other hand, when µ → +∞, the Huber function resembles
the L2 loss. Figure 2 further illustrates the differences of these loss functions.

5.1 Synthetic data

The dataset is generated in the following manner: all entries in U ∈ Rr×m , V ∈ Rr×n
and Ai ∈ Rm×n are independently sampled from Gaussian distribution N (0, 1). The
ground truth M ∗ to recover is then calculated as M ∗ = U > V , and the rank of M ∗
is r. The observations bi are generated following the assumed data generating process
indicated in (1).
To evaluate the performance of the algorithms, the following two metrics are used:
1) the value of the loss function; 2) relative recovery error: kÛ > V̂ − M kF /kM kF ,
where Û and V̂ are the estimated U and V using the proposed smoothing method. For
both metrics, a smaller value is desired.
8 Authors Suppressed Due to Excessive Length

Fig. 1: The comparison of different loss functions. Here Huber 1.35 in the legend denotes Huber
loss with µ = 1.35.

Smooth parameter µ: The smoothing parameter µ in (10) controls the tradeoff be-
tween the smoothness and precision. In this simulation we aim to investigate the rela-
tionship between the choice of µ and the relative recovery error. Without loss of gen-
erality, m = n = 32, p = 512 and r = 10 is used here. To enhance the difference
between L1 loss and L2 loss, all oberservations bi have been contaminated by a Cauchy
noise ei , and Nesterov’s momentum method is used for this simulation.
Figure 3 shows the trend between the smoothing parameter and the relative recovery
error, and the x axis is the natural logrithm transformed µ. We observe three interesting
patterns from this plot: 1) the performance of the algorithm is not sensitive to the choice
of µ as we can see a reasonable small choice of µ will result in a small recovery error; 2)
as µ approaches 0, the huber loss becomes very close to the L1 loss, and recovery error
increases slightly. This might due to the non-smoothness of L1 loss; 3) as µ approaches
+∞, the huber loss becomes very close to the L2 loss, and L2 loss does not work well
here due to the Cauchy noise added to the data. The empirical results here also further
verifies our theoretical result in Theorem 1: as the smoothing paramter approaches to
0, the optimal solution of the smoothed objective function approaches to that of the
nonsmooth one (L1 loss here).

Algorithm comparisons: The third step of the algorithm of updating U or V can be


solved by many different algorithms. Four methods are implemented here: the vanilla
gradient descent method (GD), the Nesterov’s momentum method (NAG) [25], the
Adam [21] method and the YellowFin [42] algorithm that features step size auto-tuning
capacity.
For each algorithm, we start with an unreasonably high step size, η = 1. Under
such a high step size, most algorithms are expected to diverge or suffer from numerical
instability. Then, we repetitively decrease the step size via multiplying η by √110 for
each iteration until the algorithm starts to converge properly.
Nonsmooth Matrix Recovery 9

Fig. 2: Relative recovery error under different choices of smooth parameter µ. The horizontal
axis is the µ in logrithm scale, and the vertical axis is the relative recovery error.

To limit our simulation in reasonable amount of time, m, n, r and p are chosen as


64, 64, 8 and 2048. Here we choose µ = 1.35 as suggested by [18], and later on used
by [27] and [13], among others. For each algorithm, the experiments are repeated 100
times. In each experiment, the step size tuning procedure yield incidental result for GD,
NAG and ADAM. Step size are all decided as 10−2.5 . The exception is the YellowFin
algorithm, even extreme choices like 10 or 10−9 won’t significantly alter the outcome
of the optimization because of the fact that unlike other algorithms, the internal auto
tuning process of YellowFin will override the preassigned step size before the end of
first iteration.
One of the motivation to use nonsmooth objective function such as L1 loss is the
good robust performance of it. We have experimented different choices of error dis-
tributions for observations bi . Due to the limited space here, we select two different
scenerios have to present the findings: no contamination and adding a chi squared noise
to the observations bi . The findings of each scenerio has been summarized below:

Table 1: Number of iterations needed to reach a relative recovery error smaller than 20%, 10%,
5%, 1% for each algorithm under no contamination setting

20% 10% 5% 1%
Adam 775 962 1105 1251
GD 3799 4788 > 5000 > 5000
NAG 709 881 1103 1148
YellowFin 1341 1509 1668 1764
10 Authors Suppressed Due to Excessive Length

No contamination: Figure 4 and Figure 5 show the behaviors of the loss and rela-
tive recovery error of 4 different algorithms when the observed bi contains no error. We
observe 4 interesting findings: 1) all algorithms converge eventually and the final rela-
tive recovery errors are all very close to 0; 2) as expected, the vanilla gradient descent
converges the slowest in terms of both loss and relative recovery error; 3) Adam and
Nestrov’s momentum method have very similar behaviors, and Adam slightly outper-
forms Nestrov’s momentum method at the later stage of optimization. Both of them are
considerably more desirable than the vanilla gradient descent method and do not differ
significantly in practical use. 4) The YellowFin algorithm have different behavior. It
takes extra iterations for the algorithm to figure out optimal learning rate before the loss
function starts to monotonously decrease.

Table 1 presents the number of iterations needed to reach a relative recovery error
smaller than a certain threshold for each algorithms. We can see that all algorithms
expect than gradient descent reaches 20% error in a relatively fast speed. Towards the
end, from 5% to 1%, NAG has only needs less than 50 steps, while Adam takes aout
150 steps.

Table 2: Number of iterations needed to reach a relative recovery error smaller than 50%, 30%,
25%, 20% for each algorithm under chi-squared noise setting, and here NaN means the algorithm
can not reach an error smaller than this value

50% 30% 25% 20%


Adam 351 813 1108 NaN
GD 1289 4464 >5000 NaN
NAG 257 852 1229 NaN
YellowFin 444 1334 1606 NaN

Chi squared noise: Figure 6 and Figure 7 show the behaviors of the loss and relative
recovery error of 4 different algorithms when each of the observed bi is contaminated
by a chi squared noise. Specifically, each bi is replaced by bi + 10 ∗ ei , where ei follows
a chi squared distribution with 3 degree of freedom. Compare to the no contamination
setting, we have noticed that even though all algorithms converge eventually, the final
loss and relative recovery error are not as close as to 0 as before. This is expected as
the Cauchy error brought unnegligible noise to the observations. For the comparisons
between different algorithms, the patterns are similar as the no contamination setting.

Table 2 presents the number of iterations needed to reach a relative recovery error
smaller than a certain threshold for each algorithms. We can see that no algorithm can
reach an error smaller than 20%, which shows that contaminated observations can bring
serious trouble in the recovery of the original matrix. Adam and NAG have similar
performances here, while Adam needs slightly smaller iterations to reach 25% error.
Nonsmooth Matrix Recovery 11

Fig. 3: Loss function curves under no contamination setting. The horizontal axis is training steps,
and the vertical axis is the value of loss function. Each curve represents median over 100 runs,
and the area between 0.25 and 0.75 quantile are plotted as shadow.

Fig. 4: Relative matrix recovery error curves under no contamination setting. The horizontal axis
is training steps, and the vertical axis is the recovery error. Each curve represents median over
100 runs, and the area between 0.25 and 0.75 quantile are plotted as shadow.
12 Authors Suppressed Due to Excessive Length

Fig. 5: Loss function curves under Cauchy noise setting. The horizontal axis is training steps,
and the vertical axis is the value of loss function. Each curve represents median over 100 runs,
and the area between 0.25 and 0.75 quantile are plotted as shadow.

Fig. 6: Relative matrix recovery error curves under Cauchy noise setting. The horizontal axis is
training steps, and the vertical axis is the recovery error. Each curve represents median over 100
runs, and the area between 0.25 and 0.75 quantile are plotted as shadow.
Nonsmooth Matrix Recovery 13

5.2 Real world data experiment


In this section, we demonstrate the efficiency of our method via a real world example.
Not all data are normally distributed as is in our synthesized data set, furthermore, noise
is ubiquitously unavoidable in real world practices.
The real world data we use in this experiment is an old-school gray-scale saving
icon with dimension of m = n = 128 and the rank of this picture is r = 6. 8000
normal distributed Ai are generated in the setting of matrix sensing and bi are calculated
accordingly. To show the robustness trait of L1 loss is well preserved by our smoothing
method, a noise with independent Cauchy distribution is additionally applied to all bi .
The results are shown in Figure 8. L2 loss can not recover the image and it turns out
totally blurred. Both L1 and Huber case can recover recognizable picture benefit from
our loss. However, L1 optimization based on nonsmooth subgradient method takes over
20x more time to reach converge than the Huber design and the ratio will increase even
more as the scale of the problem exaggerates.

Truth (Error: 0.00%) L1 (Error: 0.71%)

Huber (Error: 1.01%) L2 (Error: > 100%)

Fig. 8: Final recovery result for different loss function

6 Conclusion
This paper considers the matrix factorization for low-rank matrix recovery, and a gen-
eral nonsmooth loss function is assumed. It includes the commonly used L1 and quan-
tile loss functions as special cases, and this gives us much flexibility by choosing a
suitable form according to our knowledge and observations.
14 Authors Suppressed Due to Excessive Length

In the proposed algorithm, we first suggest an optimal smooth approximation of the


nonsmooth objective function [26], then a lot of algorithms based on gradient can be
applied to the problem, we use the vanilla gradient descent, Nesterov’s momentum algo-
rithm, adam as well as yellowfin as examples and compare their performance. Though
smoothing changed the problem’s structure, however, the benefit is that it brings us
much more flexibility to choose different algorithms.
Bibliography

[1] D. Achlioptas and F. McSherry. Fast computation of low-rank matrix approxima-


tions. Journal of the ACM (JACM), 54(2):9, 2007.
[2] G. Alefeld and X. Chen. A regularized projection method for complemen-
tarity problems with non-lipschitzian functions. Mathematics of Computation,
77(261):379–395, 2008.
[3] A. Y. Aravkin, A. Kambadur, A. C. Lozano, and R. Luss. Sparse quantile huber
regression for efficient and robust estimation. arXiv preprint arXiv:1402.4624,
2014.
[4] S. Barnett. Greatest common divisor of two polynomials. Linear Algebra and its
Applications, 3(1):7–9, 1970.
[5] S. Bhojanapalli, A. Kyrillidis, and S. Sanghavi. Dropping convexity for faster
semi-definite optimization. In Conference on Learning Theory, pages 530–582,
2016.
[6] D. Bokde, S. Girase, and D. Mukhopadhyay. Matrix factorization model in collab-
orative filtering algorithms: A survey. Procedia Computer Science, 49:136–146,
2015.
[7] L. Bottou. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
[8] E. J. Candes. The restricted isometry property and its implications for compressed
sensing. Comptes rendus mathematique, 346(9-10):589–592, 2008.
[9] E. J. Candes, X. Li, and M. Soltanolkotabi. Phase retrieval via wirtinger flow:
Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–
2007, 2015.
[10] X. Chen. First order conditions for nonsmooth discretized constrained optimal
control problems. SIAM journal on control and optimization, 42(6):2004–2015,
2004.
[11] X. Chen. Smoothing methods for nonsmooth, nonconvex minimization. Mathe-
matical programming, 134(1):71–99, 2012.
[12] X. Chen, R. S. Womersley, and J. J. Ye. Minimizing the condition number of a
gram matrix. SIAM Journal on optimization, 21(1):127–148, 2011.
[13] J. Fan. Local polynomial modelling and its applications: monographs on statistics
and applied probability 66. Routledge, 2018.
[14] O. Fercoq and P. Richtárik. Smooth minimization of nonsmooth functions with
parallel coordinate descent methods. arXiv preprint arXiv:1309.5885, 2013.
[15] R. Garg and R. Khandekar. Gradient descent with sparsification: an iterative al-
gorithm for sparse recovery with restricted isometry property. In Proceedings of
the 26th Annual International Conference on Machine Learning, pages 337–344.
ACM, 2009.
[16] D. Gross, Y.-K. Liu, S. T. Flammia, S. Becker, and J. Eisert. Quantum state tomog-
raphy via compressed sensing. Physical review letters, 105(15):150401, 2010.
[17] N. Guan, D. Tao, Z. Luo, and J. Shawe-Taylor. Mahnmf: Manhattan non-negative
matrix factorization. arXiv preprint arXiv:1207.3438, 2012.
16 Authors Suppressed Due to Excessive Length

[18] P. J. Huber. Robust statistics. John Wiley and Sons, 1981.


[19] P. Jain, R. Meka, and I. S. Dhillon. Guaranteed rank minimization via singular
value projection. In Advances in Neural Information Processing Systems, pages
937–945, 2010.
[20] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alter-
nating minimization. In Proceedings of the forty-fifth annual ACM symposium on
Theory of computing, pages 665–674. ACM, 2013.
[21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[22] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recom-
mender systems. Computer, (8):30–37, 2009.
[23] I. Markovsky and K. Usevich. Low rank approximation. Springer, 2012.
[24] J. Mount. Approximation by orthogonal transform. winvector.gitbuh.io/xDrift/
orthApprox.pdf, 2014. [Online; accessed 05-Sep-2018].
[25] Y. Nesterov. A method of solving a convex programming problem with conver-
gence rate o(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376, 1983.
[26] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical pro-
gramming, 103(1):127–152, 2005.
[27] A. B. Owen. A robust hybrid of lasso and ridge regression. Contemporary Math-
ematics, 443(7):59–72, 2007.
[28] S. Oymak, B. Recht, and M. Soltanolkotabi. Sharp time–data tradeoffs for linear
inverse problems. IEEE Transactions on Information Theory, 64(6):4129–4158,
2018.
[29] D. Park, A. Kyrillidis, C. Caramanis, and S. Sanghavi. Finding low-rank solu-
tions via non-convex matrix factorization, efficiently and provably. arXiv preprint
arXiv:1606.03168, 2016.
[30] G. Picci. Stochastic realization theory. In Mathematical System Theory, pages
213–229. Springer, 1991.
[31] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of
linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–
501, 2010.
[32] P. Resnick and H. R. Varian. Recommender systems. Communications of the
ACM, 40(3):56–58, 1997.
[33] B. Riley. Convergence analysis of deterministic and stochastic methods for convex
optimization. Master’s thesis, University of Waterloo, 2017.
[34] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterov’s
accelerated gradient method: Theory and insights. In Advances in Neural Infor-
mation Processing Systems, pages 2510–2518, 2014.
[35] R. Sun and Z.-Q. Luo. Guaranteed matrix completion via non-convex factoriza-
tion. IEEE Transactions on Information Theory, 62(11):6535–6579, 2016.
[36] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initial-
ization and momentum in deep learning. In International conference on machine
learning, pages 1139–1147, 2013.
[37] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
Nonsmooth Matrix Recovery 17

[38] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank so-


lutions of linear matrix equations via procrustes flow. In International Conference
on Machine Learning, pages 964–973, 2016.
[39] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
[40] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal. Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann, 2016.
[41] Z. Yang, Y. Zhang, W. Yan, Y. Xiang, and S. Xie. A fast non-smooth nonnega-
tive matrix factorization for learning sparse representation. IEEE access, 4:5161–
5168, 2016.
[42] J. Zhang and I. Mitliagkas. Yellowfin and the art of momentum tuning. arXiv
preprint arXiv:1706.03471, 2017.
[43] T. Zhao, Z. Wang, and H. Liu. Nonconvex low rank matrix factorization via inex-
act first order oracle. Advances in Neural Information Processing Systems, 2015.
[44] R. Zhu, D. Niu, and Z. Li. Robust web service recommendation via quantile
matrix factorization. In INFOCOM 2017-IEEE Conference on Computer Com-
munications, IEEE, pages 1–9. IEEE, 2017.

A Appendix A: Proof Sketch of Theorem 2

In the prove of Theorem 2, we will only provide the proof road map, the key idea is first
to prove the convergence of SVP initialization, then the alternating minimization have
an linear convergence rate combined with Nesterov’s momentum algorithm. Previously
[38] has already prove that alternating minimization have an linear convergence rate
combined with gradient descent under least square matrix factorization setting by using
the results from [9], here we prove that alternating minimization also have an linear con-
vergence rate combined with Nesterov’s momentum algorithm under nonsmooth matrix
factorization, given the condition that the nonsmooth function is L smooth and strong
convex, here we employ another result from [33] which analyze the linear convergence
rate for Nesterov’s momentum algorithm by using Lyapunov function.
Below only provide results for fµ , in Theorem 1 we will provide another proof that
actually the solution of fµλ will converge to f λ . Assumptions for function fµ are as
follows:

1 This assumption is ξ-strongly convex assumption [29]: fµ (A(N )−b)−fµ (A(M )−


b) ≥ h∇fµ (A(M ) − b), N − M i + ξkA(M ) − A(N )k22 .
2 This assumption basically means that fµ (A(N )−b)−fµ (A(M )−b) ≤ h∇fµ (A(M )−
b), N − M i + ηkA(M ) − A(N )k22 .
3 Without loss of generality, assume M ∗ is the optimal matrix, then fµ (A(M ∗ ) −
b) = 0, and fµ (M ) ≥ 0.
4 Assume that fµ also defines a matrix norm when act on a matrix M , and c1 kXk22 ≤
fµ (M ) ≤ c2 kM k22 .

Remark: Assumption 1-2 defines the condition number of function fµ (·), similar as-
sumptions also appears in [34]. Usually κ = η/ξ is called the condition number of a
function [5].
18 Authors Suppressed Due to Excessive Length

Lemma 1. (Restricted Isometry Property (RIP)) A linear map A satisfies the r-RIP
with constant δr , if

(1 − δr )kM k2F ≤ kA(M )k22 ≤ (1 + δr )kM k2F

is satisfied for all matrices M ∈ Rm×n of rank at most r.

Lemma 2. [8] Let A satisfy 2r-RIP with constant δ2r . Then for all matrices M ,N of
rank at most r, we have

|hA(M ), A(N )i − hM, N i| ≤ δ2r kM kF kN kF .

The next lemma characterizes the convergence rate of initialization procedure:


Lemma 3. [28] Let M ∈ Rm×n be an arbitrary matrix of rank r. Also let b = A(M ) ∈
Rp be p linear measurements. Consider the iterative updates

M t+1 ← Pr M t − ξt ∇M fµ (A(M t ) − b) .


where M i are m × n matrices. Then

kM t − M kF ≤ ψ(A)t kM 0 − M kF .

holds. Here ψ(A) is defined as

ψ(A) = 2supkM kF =kN kF =1,rank(M )≤2r,rank(N )≤2r |hA(M ), A(N )i − hM, N i|.

We can prove this lemma by using the results of Theorem 3.


First we will prove that the initialization procedure is indeed converge to the true
value of M .

A.1 proof of the initialization results, equation (1) in Theorem 2

Lemma 4 (Lemma for initialization). Assume that ξ 1−δ1+δ2k


2k
−η > 0. Denote f˜µ (M ) =
∗ t
fµ (A(M ) − b). Let M be an optimal solution and let M be the iterate obtained by
the SVP algorithm at t-th iteration. Then
 
˜ ˜ 1 + δ2k
t+1
fµ (M ) ≤ fµ (M ) + ξ t
− η kA(M ∗ − M t )k22 .
1 − δ2k

Proof. From assumption, we have

f˜µ (M t+1 ) − f˜µ (M t ) ≤ h∇f˜µ (M t ), M t+1 − M t i + ξkA(M t+1 ) − A(M t )k22


≤ h∇f˜µ (M t ), M t+1 − M t i + ξ(1 + δ2k )kM t+1 − M t k2 F

where the last inequality comes from RIP. Let N t+1 = M t − 1 ˜ t


2ξ(1+δ2k ) ∇fµ (M ), and

ft (M ) = h∇f˜µ (M t ), M − M t i + ξ(1 + δ2k )kM − M t k2F .


Nonsmooth Matrix Recovery 19

Then
 
1
ft (M ) = ξ(1 + δ2k ) kM − N t+1 k2F − k∇f˜µ (M t )k2F
4ξ (1 + δ2k )2
2

By definition, Pk (N t+1 ) = M t+1 , then ft (M t+1 ) ≤ ft (M ∗ ). Thus

f˜µ (M t+1 ) − f˜µ (M t ) ≤ ft (M t+1 ) ≤ ft (M ∗ )


= h∇f˜µ (M t ), M ∗ − M t i + ξ(1 + δ2k )kM ∗ − M t k2 F
1 + δ2k
≤ ∇f˜µ (M t ), M ∗ − M t i + ξ kA(M ∗ ) − A(M t )k2F
1 − δ2k
 
1 + δ2k
≤ ∇f˜µ (M t ), M ∗ − M t i + ηkA(M ∗ ) − A(M t )k2F + ξ − η kA(M ∗ ) − A(M t )k2F
1 − δ2k
 
1 + δ2k
≤ f˜µ (M ∗ ) − f˜µ (M t ) + ξ − η kA(M ∗ ) − A(M t )k2F
1 − δ2k

Theorem 3. Let b = A(M ∗ ) + e for rank k matrix M ∗and an errorvector


 ∈ Rp .
eq 
1 1+δ2k 2 2 1 1
Then, under the assumption that D < 1, where D = C2 + ξ 1−δ 2k
−η C2 + c1 C + c1 ,
1
the SVP algorithm with step size ηt = 2ξ(1+δ2k )
outputs a matrix M of rank at most k
2
l 2 2
m
kek
such that f˜µ (A(M ) − b) ≤ (C + ε) 2 , ε ≥ 0, in at most log1 D log (C2c+ε)kek
2
2
2 kbk2
iterations.
2 2
Proof. Let the current solution M t satisfy fµ (M t ) ≥ C kek
2 , by lemma 4 and b −
A(M ∗ ) = e, we have

kek2
 
1 + δ2k
t+1
fµ (M ) ≤ + ξ − η kb − A(M t ) − ek2
2 1 − δ2k
kek2
 
1 + δ2k
− η kb − A(M t )k2 − 2e> (b − A(M t )) + kek2

≤ + ξ
2 1 − δ2k
√ !
fµ (M t ) 2fµ (M t ) fµ (M t ) 2fµ (M t )
 
1 + δ2k
≤ + ξ −η + + √
C2 1 − δ2k C2 c1 C c1
= Dfµ (M t ).
l m
1 (C 2 +ε)kek2
Since D < 1, combine the fact that fµ (M 0 ) ≤ c2 kbk2 , by taking t = log D log 2c2 kbk22
,
we complete the proof.

The following lemma was adapted from [38]:


Lemma 5. Let M1 , M2 ∈ Rm×n be two rank r matrices with SVD decomposition
1/2
M1 = U1> Σ1 V1 , M2 = U2> Σ2 V2 , for l = 1, 2, define Ml = Ul> Σl ∈ Rm×r ,
1/2
Nl = Vl> Σl ∈ Rn×r . Assume M1 , M2 obey kM2 − M1 k ≤ 21 σr (M1 ). Then:

kM2 − M1 k2F
   
M2 M1 2
dist2 , ≤√ .
N2 N1 2−1 σr (X1 )
20 Authors Suppressed Due to Excessive Length

Combine lemma√ 1, 2 and 5, follow the similar route of [38], we can prove that using
more than 3 log( rκ) + 5 iterations of SVP initialization algorithm, we can obtain
   
U0 U 1
dist , ≤ σr (U ). (11)
V0 V 4
Thus we finished the proof of convergence in the initialization procedure, next we
will study the linear decay rate for each iteration for penalized object function.
Before study the theoretical properties, we first rearrange object function (7) by
using the uplifting technique so that it’s easy to simultaneously consider fµ and the
regularization term, to see this, consider rank r matrix M ∈ Rm×n with SVD decom-
position M = U > ΣV , define Sym : Rm×n → Rm×n as
 
0 X
Sym(X) = m×m .
X > 0n×n
 
A11 A12
Given the block matrix A = with A11 ∈ Rm×m , A12 ∈ Rm×n ,
A21 A22    
n×m n×n A11 0m×n 0m×m A12
A21 ∈ R , A22 ∈ R . Define Pdiag (A) = , Poff (A) = ,
0n×m A22 A21 0n×n
define B : R(m+n)×(m+n) → R(m+n)×(m+n) as the uplift version of A operator:

B(M )k = hBk , M i, where Bk = Sym(M ).

Define W = [U > ; V > ], Then as a result, we can rewrite object function (7) as:

g(W ) := g(U, V )
= fµ (b − A(U > V )) + λkU U > − V V > k2F
1  1
= fµ B(Sym(U > V )) − Sym(X) + λkSym(U > V ) − Sym(X)k2F .
2 2
From equation (12) we can see that actually the non-penalize part and penalize part
have similar structure.
As a result, although we made the assumption 1,2,4 on function M, we can see from
equation (12) that after adding the penalization term, the penalized object function still
retains similar property, as for assumption 3, we can also use some location transform
techniques to make the penalized object function satisfies this assumption, as a result,
it does not make so much difference whether we deal with penalized object function or
un-penalized object function.
Then similar to [38], the alternating minimization incorporate with Nesterov’s mo-
mentum algorithm with respect to U and V (sub vector of W ), respectively, are actually
can be written as the NAG algorithm applied to g(W ) with respect to W .
For the convergence analysis of Nesterov’s momentum algorithm, we employ fol-
low lemma, which is a theorem in [33]:
Lemma 6. For minimization problem minx∈X f (x), where x is a vector, using the Lya-
punov function

Ṽk = f (yk ) + ξkzk − x∗ k2


Nonsmooth Matrix Recovery 21

it can be shown

Ṽk+1 − Ṽk = −τk Ṽk + εk+1 (12)

where the error is expressed as


 2  
τk 2 ξ
εk+1 = k∇f (xk )k + τk η − kxk − yk k2 ,
4ξ τk

τk is the step size in Nesterov’s momentum algorithm, usually
 equals 1/ κ, yk+1 = 
1 1 τk 1
xk − 2η ∇f (xk ), xk+1 = 1+τ k
y + z , z
k 1+τk k k+1 = zk +τk x k+1 − z k − 2ξ ∇f (x + k + 1) .

Assume that τ0 = 0, τ1 = τ2 = · · · = τ̃ , and ε1 , · · · , εk+1 has a common upper


bound ε̃, then (12) implies:
k+1
X ε̃ − ε̃(1 − τ̃ )k
|Ṽk+1 | = |(1 − τ̃ )k+1 Ṽ0 + (1 − τ̃ )i−1 εk+2−i | ≤ (1 − τ̃ )k+1 |Ṽ0 | +
i=1
τ̃

Substitute xk with Wk+1 = [U k+1 , V k+1 ]> , f with g, if we want to deal with the
convergence analysis with respect to Wt − W ∗ and W0 − W ∗ , we need to handle two
parts, the first part is the error part with respect to ε̃, this can be solved by choosing
initial estimate close to true value, as a result k∇f (x1 )k can be arbitrary close to 0.
)k
For the sake of notation simplicity, assume that ε̃−ε̃(1−τ̃
τ̃ ≤ ε† . Since Ṽk still satisfies
assumption 1 and 2, without loss of generality, assume the corresponding parameter are
ξ˜ and η̃.
In the next, we want to seek the relation of Wt − W ∗ with V˜t . This will involve
the assumption 1 and 2 as well as lemma 1. With a rough handle of the gradient part in
assumption 1 and 2, we can obtain
˜
ξkA(W ∗ 2 t ∗ 2
t ) − A(W )k2 ≤ (1 − τ̃ ) µ̃kA(W0 ) − A(W )k2 + ε̃

Notice that ε̃ can be made arbitrary small so that


˜
ξkA(W ∗ 2 t ∗ 2
t ) − A(W )k2 ≤ (1 − τ̃1 ) µ̃kA(W0 ) − A(W )k2

and 1 − τ̃1 still larger than 0 smaller than 1. Employ the Restricted Isometry Property,
˜ − δr )kWt − W ∗ k2 ≤ (1 − τ̃1 )t µ̃(1 + δr )kW0 − W ∗ k2
ξ(1 2 2

Thus
       
Ut U t µ̃ 1 + δr U0 U
dist , ≤ (1 − τ̃1 ) dist ,
Vt V ξ˜ 1 − δ r V 0 V
1 µ̃ 1 + δr
≤ (1 − τ̃1 )t σr (U )
4 ξ˜ 1 − δr
Theorem 2 is proved.
Remark on equation (12): From (12), we provide a guideline with respect to the
selection of λ compared with [38], by combine (12) and assumption 4.

You might also like