Tensor Q-Rank - New Data Dependent Definition of
Tensor Q-Rank - New Data Dependent Definition of
Hao Kong
Key Lab. of Machine Perception (MOE), School of EECS, Peking University, Beijing, China.
E-mail: [email protected]
Canyi Lu
Department of Electrical & Computer Engineering (ECE), Carnegie Mellon University, Pitts-
burgh, America.
E-mail: [email protected]
Zhouchen Lin
Key Lab. of Machine Perception (MOE), School of EECS, Peking University, Beijing, China.
Z. Lin is the corresponding author.
E-mail: [email protected]
2 Hao Kong et al.
1 Introduction
where Y is the observed measurement by a linear operator Ψ (·) and X is the clean
data. Generally, it is difficult to solve Eq. (1) directly, and different rank definitions
correspond to different models. The commonly used definitions of tensor rank are
all related to particular tensor decompositions [1]. For example, CP-rank [2] is
based on the CANDECOMP/PARAFAC decomposition [3]; multilinear rank [4] is
based on the Tucker decomposition [5]; tensor multi-rank and tubal-rank [6] are
based on t-SVD [7]; and a new tensor rank with invertible linear operator [8] is
based on T-SVD [9]. Among them, CP-rank and multilinear rank are both older
and more widely studied, while the remaining two mentioned here are relatively
new. Minimizing the rank function in Eq. (1) directly is usually NP-hard and is
difficult to be solved within polynomial time, hence we often replace rank(X ) by
a convex/non-convex surrogate function. Similar to the matrix case [10, 11], with
different definitions of tensor singular values, various tensor nuclear norms are
proposed as the rank surrogates [7, 8, 12, 13].
Friedland and Lim [13] introduce cTNN (Tensor Nuclear Norm based on CP) as
the convex relaxation of the tensor CP-rank:
( r r
)
X X
kT kcT N N = inf |λi | : T = λi ui ◦ v i ◦ w i , (2)
i=1 i=1
where kui k = kvi k = kwi k = 1 and ◦ represents the vector outer product1 .
However, for a given tensor T ∈ Rn1 ×n2 ×n3 , minimizing the surrogate objection
kT kcT N N directly is difficult due to the fact that computing CP-rank is usually
NP-complete [14, 15] and computing cTNN is NP-hard in some sense [13], which
also mean we cannot verify the consistency of cTNN’s implicit decomposition
with the ground-truth CP-decomposition. Meanwhile, it is hard to measure the
cTNN’s tightness relative to the CP-rank2 . Although Yuan and Zhang [16] give
the sub-gradient of cTNN by leveraging its dual property, the high computational
cost makes it difficult to implement.
To reduce the computation cost of computing the rank surrogate function, Liu
et al. [12] define a kind of tensor nuclear norm named SNN (Sum of Nuclear Norm)
1
Please see [1] or our supplementary materials for more details.
2
For the matrix case, the nuclear norm is the conjugate of the conjugate function of the
rank function in the unit ball. However, it is still unknown whether this property holds for
cTNN and CP-rank.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 3
where T ∈ Rn1 ×...×nd , T(i) ∈ R(n1 ...ni−1 ni+1 ...nd )×ni denotes unfolding the tensor
along the i-th dimension, and k · k∗ is the nuclear norm of a matrix, i.e., sum of
singular values. The convenient calculation algorithm makes SNN widely used [12,
17–20]. It is worth to mentioned that, although SNN has a similar representation to
matrix case, Paredes and Pontil [21] point out that SNN is not the tightest convex
relaxation of the multilinear rank [4], and is actually an overlap regularization of
it. References [22–24] also propose a new regularizer named Latent Trace Norm
to better approximate the tensor rank function. In addition, due to unfolding the
tensor directly along each dimension, the information utilization of SNN based
model is insufficient.
To avoid information loss in SNN, Kilmer and Martin [7] propose a tensor
decomposition named t-SVD with a Fourier transform matrix F, and Zhang et
al. [25] give a definition of the tensor nuclear norm on T ∈ Rn1 ×n2 ×n3 corresponding
to t-SVD, i.e., Tensor Nuclear Norm (TNN):
n3
1 X
kT kT N N := G(i) , where G = T ×3 F, (4)
n3 ∗
i=1
where G(i) denotes the i-th frontal slice matrix of tensor G 3 , and ×3 is the mode-
3 multilinear multiplication [5]. Benefitting from the efficient Discrete Fourier
transform and the better sampling effect of Fourier basis on time series features,
TNN has attracted extensive attention in recent years [25–29]. The operation of
Fourier transform along the third dimension makes TNN based models have a
natural computing advantage for video and other data with strong time continuity
along a certain dimension.
However, when considering the smoothness of different data, using a fixed
Fourier transform matrix F may bring some limitations. In this paper, we define
smooth and non-smooth data along a certain dimension as the usual intuitive
meaning, which means the slices of tensor data along a dimension are arranged
in a certain paradigm, e.g., time series. For example, a continuous video data is
smooth. But if the data tensor is a concatenation of several different scene videos
or a random arrangement of all frames, then the data is non-smooth.
Firstly, TNN needs to implement Singular Value Decomposition (SVD) in the
complex field C, which is slightly slower than that in the real field R. Besides, the
experiments in related papers [25, 27, 30, 31] are usually based on some special
dataset which have smooth change along the third dimension, such as RGB images
and short videos. Those non-smooth data may increase the number of non-zero
tensor singular values [7,25], weakening the significance of low rank structure. Since
tensor multi-rank [25] is actually the rank of each projection matrix on different
Fourier basis, the non-smooth change along the third dimension may lead to large
singular values appearing on the projection matrix slices which are corresponding
to the high frequency.
3 The implementation of Fourier transform along the third dimension of T is equivalent to
multiplying a DFT matrix F by using ×3 . For more details, please see Sec. 2.2.
团队到位情况(同步需完成详细团队成员Excel表)
Fig. 1 Replace F in Eq. (4) by matrix M and further obtain new definitions of tensor rank
rankM (X ) and tensor nuclear norm kX kM,∗ by using S(i) .
In order to solve the above phenomenon, there are some works [8, 9, 32–34] that
consider to improve the projection matrix of TNN, i.e., the Discrete Fourier trans-
form matrix F in Eq. (4). These work want to replace F by another measurement
matrix M and further obtain new definitions of tensor rank rankM (X ) and tensor
nuclear norm kX kM,∗ as regularizers. Figure 1 shows the related operations. Their
recovery models can be summarized as follows:
Please see Sec. 2 for the relevent definitions in Eq. (5). In the following, we will
discuss the motivations and limitations of these work [8, 9, 32–34], respectively.
Kernfeld, Kilmer, and Aeron [9] generalize the t-product by introducing a new
operator named cosine transform product with an arbitrary invertible linear
transform L (or arbitrary invertible matrix M). For a given T ∈ Rn1 ×n2 ×n3
and an invertible matrix M ∈ Rn3 ×n3 , they have LM (T ) = T ×3 M and L−1 M (T ) =
T ×3 M−1 . Different from the commonly used definition of tensor mode-i product
in [1, 8, 9, 12], it should be mentioned that for convenience in this paper, we define
LQ (T ) = T ×3 Q = fold3 (T(3) Q), where T(3) ∈ Rn1 n2 ×n3 and is defined by
T(3) := unfold3 (T ). That is to say, we arrange the tensor fiber Tij: by rows.
Following this idea, Lu, Peng, and Wei [8] propose a new tensor nuclear norm
induced by invertible linear transforms [9]. Different from [7, 25], they use an fixed
invertible matrix to replace the Fourier transform matrix in TNN. Although this
method improves the performance of the recovery model to a certain extent, some
new problems still arise, such as how to determine the fixed invertible matrix.
Normally, different data need different optimal invertible matrix, but a reasonable
matrix selection method is not given in [8]. Furthermore, the Frobenius norm of the
invertible matrix is uncertain, which may lead to some computational problems,
e.g., approaching zero or infinity.
Additionally, Kernfeld, Kilmer, and Aeron [9] propose an idea that, with the
help of Toeplitz-plus-Hankel matrix [35], the Discrete cosine transform matrix C
can also be used to replace F. Then the work [32] propose some fast algorithms
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 5
for diagonalization and the relevant recovery model. However, C is still based on
trigonometric function, and may lead to the similar problems with TNN based
model, as we mentioned in the last paragraph of Sec. 1.1.
Considering the efficiency of time-space transformation, the work [33] use the
Daubechies 4 discrete wavelet transform matrix to replace F. As we know, the
wavelet transform can take the position information into account, which may make
it better than Fourier transform and cosine transform in handling some special
data, e.g., audio data. However, many wavelet bases generate transform matrices
in exponential form, which means the large scale wavelet matrix may bring the
problem of computational complexity.
Regardless of the computational complexity, Jiang et al. [34] introduce a new
projection matrix named tight framelets transform matrix [36, 37]. They claim that
redundancy in the transformation is important as such transformed coefficients
can contain information of missing data in the original domain [36]. However, we
consider that redundancy is not a sufficient condition to improve the effect of
recovery model shown in Eq. (5).
In summary, different multipliers M in Eq. (5) lead to different definitions
of regularizer, which may lead to different experimental results. However, there
is still no unified rules for selecting M. It can be seen from the above methods
that when M is selected as orthogonal matrix, it is convenient for calculation
and interpretation. In general, projection bases are unit orthogonal. We further
think that every data should have its best matching matrix, i.e., M could be data
dependent. In this paper, we solve the problem of how to define a better data
dependent orthogonal transformation matrix.
1.3 Motivation
In the tensor completion task, we find that when dealing with some non-smooth
data, Tensor Nuclear Norm (TNN) based methods usually perform worse than
the cases with smooth data. Therefore, we want to improve this phenomenon
by changing the projection basis F in Eq. (4). In other words, we provide some
interpretable selection criteria of M in Eq. (5), e.g., make M be an orthogonal
matrix and data dependent w.r.t. the data tensor X . The following gives the details:
Whether in the case of matrix recovery [10, 11] or tensor recovery [8, 27, 38], the
low rank hypothesis is very important. Generally speaking, the lower the rank of
the data, the easier it is to recover with fewer observations. As can be seen from
Figure. 2, we can use a better Q to make the low rank structure of the non-smooth
data more significant.
Considering the convex relaxation, the low rank property is usually represented
by (a): the distribution of singular values, or (b): the value of nuclear
norm. We may as well take these two points as priori knowledge respectively, and
specify the selection rules of Q in Eq. (6), so that the low rank property of X can
be better reflected. Therefore, we provide two methods in this paper as follows:
(a): Let Q satisfy a certain selection method to make more tensor singular
values close to 0 while the remaining ones are far from 0. From another perspective,
6 Hao Kong et al.
100
90 TNN
80 Ours
Singular Value
70
60
50
40
30
20
10
0
50 100 150 200 250 300 350 400 450 500
Fig. 2 Compare the two different low rank structures between our proposed regularization
and TNN regularization in non-smooth video data. Left: the first 500 sorted singular values
√
by TNN regularization (divided by n3 ) and ours. Right: the short video with background
changes.
the distribution variance of singular values should be larger, which leads to Variance
Maximization Tensor Q-Nuclear norm (VMTQN) in Sec. 3.1.
(b): Let Q minimize the nuclear norm kX kQ,∗ directly, leading to a bilevel
problem. As we know, nuclear norm is usually used as an surrogate function of
the rank function. Then we use some manifold optimization method to solve the
problem, which leads to Manifold Optimization Tensor Q-Nuclear norm (MOTQN)
in Sec. 3.2.
1.4 Contributions
2.1 Notations
We introduce some notations and necessary definitions which will be used later.
Tensors are represented by uppercase calligraphic letters, e.g., T . Matrices are
represented by boldface uppercase letters, e.g., M. Vectors are represented by
boldface lowercase letters, e.g., v. Scalars are represented by lowercase letters, e.g.,
s. Given a third-order tensor T ∈ Rn1 ×n2 ×n3 , we use T(k) to represent its k-th
frontal slice T (:, :, k) while its (i, j, k)-th entry is represented as Tijk . σi (X) denotes
the i-th largest singular value of matrix X. X+ denotes the pseudo-inverse matrix of
Pmin{n1 ,n2 }
X. kXkσ = σ1 (X) denotes the matrix spectral norm. kXk q ∗ = i=1 σi (X)
Pn2 Pn1 2
denotes the matrix nuclear norm and kXk2,1 = j=1 i=1 Xij denotes the
matrix `2,1 norm, where X ∈ Rn1 ×n2 and Xij is the (i, j)-th entry of X.
T(3) ∈ Rn1 n2 ×n3 denotes unfolding the tensor T along the 3-th dimension by
columns, which is little different from [1, 9]. That is to say, we arrange the tensor
fiber Tij: by columns. We then define LQ (T ) = T ×3 Q = fold3 (T(3) Q) and have
L−1
Q (T ) = T ×3 Q
−1
, where T(3) ∈ Rn1 n2 ×n3 and is defined by T(3) := unfold3 (T ).
Due to limited space, for the definitions of PT [26], multilinear multiplication [5],
t-product [7], and so on, please see our Supplementary Materials.
For a given tensor X ∈ Rn1 ×n2 ×n3 and a Fourier transform matrix F ∈ Cn3 ×n3 , if
we use G(i) to represent the i-th frontal slice of tensor G, then the tensor multi-rank
and Tensor Nuclear Norm (TNN) of X can be formulated by mode-3 multilinear
multiplication as follows:
n o
rankm := (r1 , . . . , rn3 ) ri = rank G(i) , G = X ×3 F , (7)
1 Pn3 (i)
kX k∗ := n3 i=1 G , where G = X ×3 F. (8)
∗
Comparing with CP-rank and cTNN mentioned in Sec. 1.1, it is quite easy to
calculate Eqs. (7) and (8) through the matrix Singular Value Decomposition (SVD).
Kernfeld, Kilmer, and Aeron [9] generalize the t-product by introducing a new
operator named cosine transform product with an arbitrary invertible linear trans-
form L (or arbitrary invertible matrix Q). For an invertible matrix Q ∈ Rn3 ×n3 ,
they have LQ (X ) = X ×3 Q and L−1 Q (X ) = X ×3 Q
−1
.
Here, we further define the invertible multiplier Q as any general real orthog-
onal matrix. It is worth mentioning that the orthogonal matrix Q has two good
properties: one is invertibility, the other is to keep Frobenius norm invariant, i.e.,
kX kF = kLQ (X )kF . Then we introduce a new definition of tensor rank named
Tensor Q-rank.
Definition 1 (Tensor Q-rank) Given a tensor X ∈ Rn1 ×n2 ×n3 and a fixed real
orthogonal matrix Q ∈ Rn3 ×n3 , the tensor Q-rank of T is defined as the following:
n3
X
rankQ (X ) := rank G(i) , where G = LQ (X ) = T ×3 Q. (9)
i=1
8 Hao Kong et al.
The corresponding low rank tensor recovery model can be written as follows:
Generally in the low rank recovery models, due to the discontinuity and non-
convexity of the rank function, it is quite difficult to minimize the rank function
directly. Therefore, some auxiliary definitions of tensor singular value and tensor
norm are needed to relax the rank function.
Considering the superior recovery performance of TNN in many existing tasks, e.g.,
video denoising [39] and subspace clustering [28], we can use the similar singular
value definition of TNN. Given a tensor X ∈ Rn1 ×n2 ×n3 and a fixed orthogonal
matrix Q such that G = LQ (X ), then the Q-singular value of X is defined as
{σj (G(i) )}, where i = 1, . . . , n3 , j = 1, . . . , min{n1 , n2 }, G(i) is the i-the frontal
slice of G, and σ(·) denotes the matrix singular value. When an orthogonal matrix
Q is fixed, the corresponding tensor spectral norm and tensor nuclear norm of X
can also be given.
n3
X
kX kQ,∗ := G(i) , where G = LQ (X ). (12)
∗
i=1
Moreover, with any fixed orthogonal matrix Q, the convexity, duality, and
envelope properties are all preserved.
Property 1 (Convexity) Tensor Q-nuclear norm and Tensor Q-spectral norm are
both convex.
Property 2 (Duality) Tensor Q-nuclear norm is the dual norm of Tensor Q-spectral
norm, and vice versa.
These three properties are quite important in the low rank recovery theory. Prop-
erty 3 implies that we can use the tensor Q-nuclear norm as a rank surrogate. That
is to say, when the orthogonal matrix Q is given, we can replace the low tensor
Q-rank model (10) with model (13) to recover the original tensor:
In some cases, we will encounter the case that Q is not a square matrix, i.e.,
Q ∈ Rn3 ×r is column orthonormal. Then the corresponding definitions of rankQ (X )
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 9
in Eq. (9) and kX kQ,∗ in Eq. (12) also change to the sum of r frontal slices instead
of n3 . Moreover, as for the convex envelope property, the double conjugate function
of rank function rankQ (X ) is still the corresponding nuclear norm kX kQ,∗ within
an unit ball. We give the following theorem to illustrate this case:
Theorem 1 Given a tensor T ∈ Rn1 ×n2 ×n3 and a fixed real column orthonormal
matrix Q ∈ Rn3 ×r . Let
Q⊥ ∈ R
n3 ×(n3 −r)
be the column complement matrix of
Q, and Qt = Q Q⊥ be a orthogonal matrix. Then within the unit ball D =
{X |kX kQt ≤ 1}, the double conjugate function of rankQ (X ) is kX kQ,∗ :
∗∗
rankQ (X ) = kX kQ,∗ . (14)
∗∗
In other words, kX kQ,∗ is still the tightest convex envelope of rankQ (X ) within
the unit ball D.
Theorem 1 indicate that even if Q is not a square matrix, Eq. (13) can still be used
as an effective recovery model.
Easy to see that Eq. (15) is actually a bilevel model and is usually hard to be
solved directly. In the following, we will show two ways to solve this problem from
the following two perspectives:
1. One is to use the prior knowledge of X to specify the selection criteria of
Q. For the low rank hypothesis, we usually measure it by the distribution of
singular values. Therefore, we consider artificially specifying the conditions that
Q should satisfy so as to maximize the variance of the corresponding singular
values.
2. The other is to give the function Q = argmin f (X , Q) = argmin kX kQ,∗ and
then use manifold optimization to solve the bilevel model directly. That is to say,
We directly minimize the surrogate function of rank function (Property 3 and
Theorem 1). It should be noted that although this way has higher rationality,
it corresponds to a higher computational complexity.
From the above two perspectives, Q will be data dependent. In the following, we will
introduce our two methods in two sub-sections respectively (Sec.3.1 and Sec.3.2).
And in the last part (Sec.3.3), considering a third-order tensor X ∈ Rn1 ×n2 ×n3 , we
analyze the applicability of each method in two different situations, i.e., n1 n2 < n3
and n1 n2 > n3 .
10 Hao Kong et al.
Combined with above two points, it is easyPto see that we need to make more
kG(i) kF close to 0 while the sum of squares n (i) 2
i=1 kG kF is a constant C. From
3
kX(3) Qk2,1 , where G(3) and X(3) denote the mode-3 unfolding matrices [5].
Lemma 2 Given a fixed matrix X ∈ Rn1 ×n2 , and its full Singular Value Decom-
position as X = UΣV> with U ∈ Rn1 ×n1 , Σ ∈ Rn1 ×n2 , and V ∈ Rn2 ×n2 . Then
the matrix of right singular vectors V optimizes the following:
The proofs of the above please see Appendix C. Theorem 2 shows that, to
minimize the `2,1 norm kX(3) Qk2,1 w.r.t. Q, we can choose Q as the matrix of
right singular vectors of X(3) .
And the corresponding surrogate model (13) is also replaced by the following:
min kX kQ,∗ , s.t. Ψ (X ) = Y, Q ∈ argmin kX(3) Qk2,1 , XQQ> = X. (20)
X ,Q Q> Q=I
In Eqs. (19) and (20), X(3) ∈ Rn1 n2 ×n3 denotes the mode-3 unfolding matrix of
tensor X ∈ Rn1 ×n2 ×n3 , and Q ∈ Rn3 ×r with r = min{n1 n2 , n3 }.
12 Hao Kong et al.
Remark 1 Notice that Q ∈ Rn3 ×r in Eqs. (19) and (20) may not have full columns,
i.e., r < n3 . The corresponding definitions of rankQ (X ) in Eq. (9) and kX kQ,∗ in
Eq. (12) also change to the sum of r frontal slices instead of n3 . Then Theorem 1
guarantee the validity of Eq (20).
Remark 2 In fact, from Appendix C we can see that, r can be chosen as any value
that satisfies the condition rank(X(3) ) ≤ r ≤ min{n1 n2 , n3 }, as long as Q ∈ Rn3 ×r
contains the whole column space of the matrix of right singular vectors V and is
pseudo-invertible to make X = X ×3 Q ×3 Q+ hold.
Recalling the data-dependent low rank recovery model Eq. (15) with X ∈ Rn1 ×n2 ×n3 ,
our main idea is to find a learnable Q ∈ Rn3 ×n3 to minimize rankQ (X ). Inspired by
Remark 3, if we let Q = argminQ> Q=I kX kQ,∗ to minimize the surrogate function
directly, then we can get the following bilevel model:
min kX kQ,∗ , s.t. Ψ (X ) = Y, Q = argmin kX kQ,∗ . (26)
X ,Q Q> Q=I
In Eq. (26), the lower-level problem w.r.t. Q is actually a Stiefel manifold optimiza-
tion problem. Similarly, we can define the corresponding nuclear norm as follows:
Different from VMTQN, the learnable Q in Eq. (26) should be a square matrix,
i.e., Q ∈ Rn3 ×n3 . If not, as mentioned in Sec. 3.1.1, Q may converge to the singular
spaces which are corresponding to smaller singular values. To avoid this case, we
let Q ∈ Rn3 ×n3 . Following, the key point of solving this model is how to deal with
such an orthogonality constrained optimization problem:
n3
X
Q = argmin kX kQ,∗ = argmin G(i) , where G = X ×3 Q. (28)
Q> Q=I Q> Q=I ∗
i=1
Note that Eq. (28) is actually a non-convex problem due to the orthogonality
constraint. The usual way is to perform the manifold Gradient Descent on the
Stiefel manifold, which evolves along the manifold geodesics [40]. However, this
method usually requires a lot of computation to calculate the projected gradient
direction of the objective function. Meanwhile, the work [41] develops a technique to
solve such orthogonality constrained problem iteratively, which generates feasible
points by the Cayley transformation and only involves matrix multiplication and
inversion. Here we consider to use their algorithm to solve the low-level problem.
Assume Q ∈ Rn×r and denote the gradient of the objective function f (Q) = kX kQ,∗
w.r.t. Q at Qk (the k-th iteration) by P ∈ Rn×r . Then the projection of P in the
tangent space of the Stiefel manifold at Qk is AQk , where A = PQ> k − Qk P
>
n×n
and A ∈ R [41]. Instead of parameterizing the geodesic of the Stiefel manifold
along direction A using the exponential function, inspired by [41], we generate
feasible points by the following Cayley transformation:
τ −1 τ
Q(τ ) = C(τ )Qk , where C(τ ) = I + A I− A , (29)
2 2
where I is the identity matrix and τ ∈ R is a parameter to determine the step size
of Qk+1 . That is to say, Q(τ ) is a re-parameterized geodesic w.r.t. τ on the Stiefel
14 Hao Kong et al.
(1) d
dτ Q(0) = −AQk , (2) Q(τ ) is smooth in τ , (3) Q(0) = Qk , (4) Q(τ )> Q(τ ) = I.
The work [41] shows that if τ is in a proper range, Q(τ ) can lead to a lower
objective function value than Q(0) on the Stiefel manifold. In summery, solving
the problem Q = argminQ> Q=I kX kQ,∗ consists of two steps: (1) find a proper
τ ∗ to make the value of the objective function f (Q(τ )) = kX kQ(τ ),∗ decrease; (2)
update Qk+1 by Eq. (29), i.e., Qk+1 = Q(τ ∗ ).
(1): We first compute the gradient of the objective function f (Q) = kX kQ,∗ w.r.t.
Q at Qk . According to the chain rule, we get the following:
∂f (Q) ∂G ∂f (Q) ∂(G(3) ) ∂f (Q)
= · = × unfold3 . (30)
∂Q ∂Q ∂G ∂Q ∂G
∂G
Note that G = X ×3 Q and G(3) = X(3) Q, then we can get ∂Q(3) = X> (3)
where G(3) and X(3) are the mode-3 unfolding matrices. Additionally, Eq. (28)
shows that f (Q) = n (i) (i)
P 3
i=1 kG k∗ where G are the frontal slices of G. We let
H = U V , where H denotes the frontal slice of H and U(i) V(i) denotes
(i) (i) (i) (i)
the left/right singular matrices of G(i) by skinny SVD [42]. Therefore, H = ∂f∂G (Q)
is the same as the matrix case and can be obtained from the singular value
decomposition5 .
In summary, the gradient of the objective function f (Q) w.r.t. Q at Qk (denoted
by P) can be written as follows:
∂f (Q) ∂G ∂f (Q)
Gradient = P = = · = X>
(3) H(3) . (31)
∂Q ∂Q ∂G
where X(3) and H(3) are the mode-3 unfolding matrices of X and H, respectively.
(2): Then we construct a geodesic curve along the gradient direction on the Stiefel
manifold by Eq. (29):
τ −1 τ
Q(τ ) = I + A I − A Qk , where A = PQ> >
k − Qk P . (32)
2 2
We consider the following problem for finding a proper τ :
τ ∗ = argmin f (Q(τ )) = argmin g(τ ) = argmin kX kQ(τ ),∗ , (33)
0≤τ ≤ε 0≤τ ≤ε 0≤τ ≤ε
4. ∗
Estimate τ = min{ε, τ̃ } by Eq. (35) and Lemma 3.
5. Update Qk+1 = Q(τ ∗ ) by Eq. (32).
6. end while
Output: Matrix QK .
Given that τ ∗ is small enough, we can approximate g(τ ) via its second order
Taylor expansion at τ = 0, i.e., g(τ ) = g(0) + g 0 (0) · τ + 12 g 00 (0) · τ 2 + O(τ 3 ). It
should be mentioned that since f (Q) is non-convex w.r.t. Q, the sign of g 00 (0) is
uncertain. However, Wen et al [41] point out that g 0 (0) = − 12 kAk2F always holds.
Thus we can estimate an optimal solution τ ∗ via:
( g0 (0)
∗ 2 − g00 (0) , g 00 (0) > 0
τ = min{ε, τ̃ }, where ε < , and τ̃ = (35)
kAk 1
kAk , g 00 (0) ≤ 0.
Here we give the following Lemma to omit the calculation process (See Appendix D).
2
Lemma 3 Let g(τ ) = f (Q(τ )) = kX kQ(τ ),∗ and Q(τ ) ≈ I − τ A + τ2 A2 Qk ,
where A is defined in Eq. (32). Then the first and the second order derivatives of
g(τ ) evaluated at 0 can be estimated as follows:
D E D E
g 0 (0) ≈ X> 00 > 2
(3) H(3) , −AQk , and g (0) ≈ X(3) H(3) , A Qk , (36)
where X(3) and H(3) are defined as the same in Eq. (31).
By using Eq. (35) and Lemma 3, we can obtain the optimal step size τ ∗ and
then use Eq. (32) to update Qk+1 = Q(τ ∗ ). Algorithm 1 organizes the whole
calculation process.
Back to the bilevel low rank tensor recovery model Eq. (26), for the lower-level
problem Eq. (28), we finish the iterative updating step by Algorithm 1. Once Qk+1
is fixed, the upper-level problem can be solved easily.
In Sec.3.2 (MOTQN), we mention that Q ∈ Rn3 ×n3 should be a square matrix but
not in Sec.3.1 (VMTQN). In this section, we start from this point and analyze the
impact of the size of X ∈ Rn1 ×n2 ×n3 on the applicability of these two methods.
3.3.1 Case 1: r = n1 n2 n3
In this case, VMTQN model in Eq. (22) usually performs better than other methods
in terms of computational efficiency, including MOTQN and other works [8, 32–34,
38]. As we can see from Sec.3.1 of VMTQN model, we need to calculate a skinny
16 Hao Kong et al.
right singular matrix V of an unfolding matrix X(3) ∈ Rn1 n2 ×n3 . If r < n3 , then
not only the computational complexity is not too large, but Q can play the role
of feature selection like Principal Component Analysis, which corresponds to the
notation Q = PCA(X , 3, r).
Meanwhile, MOTQN and the work [8, 32, 33, 38] usually need to have a square
factor matrix Q, even that [34] requires the columns of Q to be redundant.
In this case, MOTQN model in Eq. (26) has the best explainability and rationality.
On the one hand, with the same size of Q ∈ Rn3 ×n3 , MOTQN minimize the tensor
Q-nuclear norm directly, which corresponds to the definition of low rank structure
properly. On the other hand, thanks to the algorithm in [41], the optimization of
MOTQN model has good convergence guarantee.
In the third-order tensor tensor completion task, Ω is an index set consisting of the
indices {(i, j, k)} which can be observed, and the operator Ψ in Eqs. (21) and (22)
is replaced by an orthogonal projection operator PΩ , where PΩ (Xijk ) = Xijk if
(i, j, k) ∈ Ω and 0 otherwise. The observed tensor Y satisfies Y = PΩ (Y). Then the
tensor completion model based on our two ways are given by:
(VMTQN): min kX kQ,∗ , s.t. PΩ (X ) = Y, Q = PCA(X , 3, r), (37)
X
and
(MOTQN): minX ,Q kX kQ,∗
s.t. Q = argmin kX kQ,∗ , PΩ (X ) = Y. (38)
Q> Q=I
where X is the tensor that has low rank structure. In Eq. (37), Q ∈ Rn3 ×r
is an column orthonormal matrix with r = min{n1 n2 , n3 }. While in Eq. (38),
Q ∈ Rn3 ×n3 is a square orthogonal matrix. To solve these models by ADMM based
method [43], we introduce an intermediate tensor E to separate X from PΩ (·). Let
E = PΩ (X ) − X , then PΩ (X ) = Y is translated to X + E = Y, PΩ (E) = O, where
O is an all-zero tensor. Then we get the following two models:
(VMTQN): min kX kQ,∗ , s.t. X + E = Y, PΩ (E) = O, Q = PCA(X , 3, r),
X ,E,Q
(39)
and
(MOTQN): min kX kQ,∗ , s.t. X + E = Y, PΩ (E) = O, Q = argmin kX kQ,∗ .
X ,E,Q Q> Q=I
(40)
Note that in Eq. (40), the constraint Q = argminQ> Q=I kX kQ,∗ is the same as the
objective function, thus it can be omitted. Nevertheless, in order to keep Eqs. (39)
and (40) unified in form and express the dependence of Q and X conveniently, we
reserve this constraint here.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 17
Since Q is dependent on X , it is difficult to solve the models (39) and (40) w.r.t.
{X , Q} directly. Here we adopt the idea of alternating minimization to solve X
and Q alternately. We separate the sub-problem of solving Q as a sub-step in every
K-iteration, and then update X with a fixed Q by the ADMM method [27, 43].
The partial augmented Lagrangian function of Eqs. (39) and (40) is
µ
L(X , E, Z, µ) = kX kQ,∗ + hZ, Y − X − Ei + kY − X − Ek2F , (41)
2
where Z is the dual variable and µ > 0 is the penalty parameter. Then we can
update each component Q, X , E, and Z alternately. Algorithms 2 and 3 show the
details about the optimization methods to Eqs. (39) and (40). In order to improve
the efficiency and stable convergence of the algorithm, we introduce a parameter
K to control the update frequency of Q with the help of heuristic design. The
different effects of K on the two models are explained in Sec. 4.3 and Sec. 4.4,
respectively.
Note that there is one operator Prox in the sub-step of updating X as follows:
1
X = Proxλ,k·kQ,∗ (T ) := argmin λkX kQ,∗ + kX − T k2F , (42)
X 2
where Q ∈ Rn3 ×r is a given column orthonormal matrix and kX kQ,∗ is the tensor
Q-nuclear norm of X which is defined in Eq. (12). Algorithm 3 shows the details of
solving this operator.
For the models (37) or (39), it is hard to analyze the convergence of the corre-
sponding optimization method directly. The constraint on Q is non-linear and the
objective function is essentially non-convex w.r.t. Q, which increase the difficulty
of analysis. However, the conclusions of [43–47] guarantee the convergence to some
extent.
In practical applications, we can fix Qk = Q in every K iterations to solve
a convex problem w.r.t. X . As long as X is convergent, by using the following
Lemma 4, the change of Q is bounded.
∂(vij )
If vij represents the j-th element of vi , then ∂(X> X)
< ∞.
2
18 Hao Kong et al.
Algorithm 2 Solving the problems (39) and (40):VMTQN and MOTQN models by ADMM.
Input: Observation samples Yijk , (i, j, k) ∈ Ω, of tensor Y ∈ Rn1 ×n2 ×n3 .
Initialize: X0 , E0 , Z0 , Q0 ∈ Rn3 ×r . Parameters k = 1, ρ > 1, µ0 , µmax , ε, K.
While not converge do
1. Update Qk by one of the following:
(
Qk−1, k mod K 6= K − 1,
(VMTQN): Qk = Zk−1
(43)
PCA Y − Ek−1 + µk−1
, 3, r , k mod K = K − 1.
(
Qk−1 , 6 K − 1,
k mod K =
(MOTQN): Qk = (44)
Q(τ ∗ ) by using Algorithm 1, k mod K = K − 1.
2. Update Xk by
Zk−1
Xk = Proxµ−1 ,k·kQ ,∗
Y − Ek−1 + . (45)
k−1 k µk−1
3. Update Ek by
Zk−1
Ek = PΩ { Y − Xk + , (46)
µk−1
where Ω { is the complement of Ω.
4. Update the dual variable Zk by
5. Update µk by
µk = min{ρµk−1 , µmax }. (48)
6. Check the convergence condition: kXk − Xk−1 k∞ ≤ ε, kEk − Ek−1 k∞ ≤ ε, and
kY − Xk − Ek k∞ ≤ ε.
7. k ← k + 1.
end While
Output: The target tensor Xk .
Algorithm 3 Solving the proximal operator Proxλ,k·kQ,∗ (T ) in Eq. (42) and (45).
Input: Tensor T ∈ Rn1 ×n2 ×n3 , column orthonormal matrix Q ∈ Rn3 ×r .
1. G = T ×3 Q.
2. for i = 1 to r:
[U, S, V] = SVD(G(i) ).
G(i) = U(S − λI)+ V> , where (x)+ = max{x, 0}.
3. end for
4. X = G ×3 Q> + T ×3 (I − QQ> ).
Output: Tensor X .
Theorem 4 Given a fixed Q in every K iterations, the tensor completion model (39)
can be solved effectively by Algorithm 2 with Qk = Q in Eq. (43), where Ψ is re-
placed by PΩ . The rigorous convergence guarantees can be obtained directly due to
the convexity as follows:
Let (X ∗ , E ∗ , Z ∗ ) be one KKT point of problem (39) with fixed Q, X̂K =
PK 1 K
P 1
k=0 µk Xk+1 k=0 µk Ek+1
PK 1 , and ÊK = PK 1 , then we have
k=0 µk k=0 µk
!
1
kX̂K+1 + ÊK+1 − Yk2F ≤O PK 1
, (50)
k=0 µk
and
!
∗
D
∗
E 1
0 ≤ kX̂K+1 kQ,∗ − kX kQ,∗ + Z , X̂K+1 + ÊK+1 − Y ≤ O PK 1
. (51)
k=0 µk
Theorem 5 Denote the augmented Lagrangian function of low rank tensor recovery
model 38 by L(Q, X , E, Z, µ), which is shown as follows:
µ
L(Q, X , E, Z, µ) = kX kQ,∗ + hZ, Y − X − Ei + kY − X − Ek2F . (53)
2
Then the sequence {Qk , Xk , Ek , Zk , µk } generated in Algorithm 2 with Eq. (44)
satisfies the following:
The function
p value of Eq. (53) decreases monotonically after each iteration as long
as µ ≥ (ρ + 1)CL , where ρ is defined in Eq. (48) and CL is a constant w.r.t X .
By the monotone bounded convergence theorem, Algorithm 2 is convergent.
20 Hao Kong et al.
Considering the low rank tensor recovery models in Eqs. (37) and (38), Ω is
an index set consisting of the indices {(i, j, k)} which can be observed, and the
orthogonal projection operator PΩ is defined as PΩ (Xijk ) = Xijk if (i, j, k) ∈ Ω
and 0 otherwise. In this part, we discuss at least how many observation samples |Ω|
are needed to recover the ground-truth. In fact, Q∗ obtained from the convergence
of Algorithm 2 has a decisive effect on the number of observation samples needed,
since the optimal solution satisfies the KKT conditions under Q∗ . That is to say,
we only need to analyze the performance guarantee in the case of fixed Q.
With a fixed Q, the exact tensor completion guarantee for model (13) is shown
in Theorem 6. Lu et al [8] also have similar conclusions.
Theorem 6 Given a fixed orthogonal matrix Q ∈ Rn3 ×n3 and Ω ∼ Ber(p),
assume that tensor X ∈ Rn1 ×n2 ×n3 (n1 ≥ n2 ) has a low tensor Q-rank structure
and rankQ (X ) = R. If |Ω| ≥ O(µRn1 log(n1 n3 )), then X is the unique solution to
Eq.(13) with high probability, where Ψ is replaced by PΩ , and µ is the corresponding
incoherence parameter (See Supplementary Materials).
Through the proof of [8] and [27], the sampling rate p should be proportional
to max{kPT (eijk )k2F }. (The definition of projection operators PT and eijk can be
found in [26, 27] or in Supplementary materials, where T is the singular space of
the ground-truth.) The projection of eijk onto subspace T is greatly influenced by
2
the dimension. Obviously, when T is the whole space, PTQ (eijk ) F = 1. That is
2
to say, a small dimension of TQ may lead to a small maxijk { PTQ (eijk ) F }.
Proposition 15 in [27] also implies that for any ∆ ∈ T , we need to have
PΩ (∆) = 0 ⇔ ∆ = 0. These two conditions indicate that once the spatial
dimension of T is large (rankQ (X ) = R is large), a larger sampling rate p is needed.
And Figure 3 in [27] verifies the rationality of this deduction by experiment.
In fact, the smoothness of data along the third dimension has a great influence
on the Dimension of Freedom (DoF) of space T . Non-smooth change along the
third dimension is likely to increase the spatial dimension of T under the Fourier
basis vectors, which makes the TNN based methods ineffective. Our experiments
on CIFAR-10 (Table 1) confirm this conclusion.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 21
As for the models (39) and (40) with adaptive Q, our motivation is to find a
better Q in order to make rankQ (X ) = R smaller and make the spatial dimension
of corresponding TQ as small as possible, where TQ is the singular space of the
ground-truth under Q. In other words, for more complex data with non-smoothness
along the third dimension, the adaptive Q may reduce the dimension of TQ and
make max{kPTQ (eijk )k2F } smaller than max{kPT (eijk )k2F }, leading to a lower
bound for the sampling rate p.
5 Experiments
n1 n2 n3 kX0 k2∞
PSNR = 10 log10 . (55)
kX − X0 k2F
0 0 0 0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
tensor Q-rank / n: r tensor Q-rank / n: r tensor Q-rank / n: r
Fig. 3 The numbers plotted on the above figure are the average PSNRs within 10 random trials.
The gray scale reflects the quality of completion results of three different models (VMTQN,
MOTQN, TNN), where n1 = n2 = n3 = 50 and the white area represents a maximum PSNR
of 40.
In this part we compare our proposed methods (named VMTQN model and
MOTQN model) with the mainstream algorithm TNN [25, 27].
We examine the completion task with varying tensor Q-rank of tensor Y and
varying sampling rate p. Firstly, we generate a random tensor M ∈ R50×50×50 ,
whose entries are independently sampled from an N (0, 1/50) distribution. Actually,
the data generated in this way is usually non-smooth along each dimension. Then
we choose p in [0.01 : 0.02 : 0.99] and r in [1 : 1 : 50], where the column orthonormal
22 Hao Kong et al.
In this part we compare our proposed method with TNN [27] with Fourier matrix,
TTNN [33] with wavelet matrix, TNN-C [32] with cosine matrix, F-TNN [34] with
framelet matrix, SiLRTC [12], Tmac [48], and Latent Trace Norm [23]. We validate
our algorithm on three datasets: (1) CIFAR-106 ; (2) COIL-207 ; (3) HMDB518 . We
set ρ = 1.1, µ0 = 10−4 , µmax = 1010 , = 10−8 , and K = 1 in our methods. As for
TNN, SiLRTC, Tmac, F-TNN, and Latent Trace Norm, we use the default settings
as in their released code, e.g., Lu et al.9 and Tomioka et al.10 . For TTNN and
TNN-C of unreleased code, we implement their algorithms in MATLAB strictly
according to the corresponding papers.
5.2.1 Influences of Q
5.2.2 CIFAR-10
We consider the worst case for TNN based methods that there is almost no
smoothness along the third dimension of the data. We randomly selected 3000 and
6 http://www.cs.toronto.edu/~kriz/cifar.html.
7 http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php.
8 http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/.
9 https://github.com/canyilu/LibADMM
10 https://https://github.com/ryotat/tensor
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 23
Table 1 Comparisons of PSNR results on CIFAR images with different sampling rates.
Top: experiments on the case Y1 ∈ R32×32×3000 . Bottom: experiments on the case Y2 ∈
R32×32×10000 .
25 25
20 20
PSNR
PSNR
15 15
10 10
TQN(Oracle Q) TQN(Oracle Q)
VMTQN VMTQN
5
TNN 5
TNN
SiLRTC SiLRTC
0 0
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40
Running Time(s) Iteration
Fig. 4 Running time comparisons of different methods, where Y ∈ R32×32×10000 and sampling
rate p = 0.3.
10000 images from one batch of CIFAR-10 [49] as our true tensors Y1 ∈ R32×32×3000
and Y2 ∈ R32×32×10000 , respectively. Then we solve the model (39) with our
proposed Algorithm 2. The results are shown in Table 1. Note that in the latter
case r = n1 n2 n3 holds, MOTQN model has high computational complexity.
Thus we will not compare it in this part.
Table 1 verifies our hypothesis that TNN regularization performs badly on
data with non-smooth change along the third dimension. Our VMTQN method is
obviously better than the other methods in the case of low sampling rate. Moreover,
by comparing the two groups of experiments, we can see that VMTQN, TMac,
and SiLRTC perform better in Y2 . This may be due to that increasing the data
24 Hao Kong et al.
Fig. 5 Examples of the corrupted data in our completion tasks. The left figure is from COIL
dataset while the right figure is from the short video. The sampling rate is p = 0.2 in the left
and p = 0.5 in the right.
volume will make the principal components more significant. Meanwhile, in the
methods of Fourier matrix, cosine matrix and wavelet matrix, they almost have no
recovery effect when the sampling rate p is lower. This indicates that these specified
projection bases can not learn the data features in the case of poor continuity and
insufficient sampling.
The above analyses confirm that our proposed regularization are data-dependent
and can lead to a better low rank structure which makes recover easily.
As shown in Figure 4, we test the running times of different models. The two
figures indicate that, when n3 n1 n2 , our VMTQN model has higher computa-
tional efficiency in each iteration and better accuracy than TNN and SiLRTC. As
mentioned in our previous complexity analysis, VMTQN method has a great speed
advantage in this case. Moreover, for the case n3 < n1 n2 , Figure 8 implies that
setting r < n1 n2 can balance computational efficiency and recovery accuracy.
COIL-20 [50] contains 1440 images of 20 objects which are taken from different
angles. The size of each image is processed as 128 × 128, which means Y ∈
R128×128×1440 . The upper part of Table 2 shows the results of the numerical
experiments. We select a background-changing video from HMDB51 [51] for the
video inpainting task, where Y ∈ R240×320×146 . Figure 2 shows some frames of this
video. The lower part of Table 2 shows the results. And Figures 5, 6 and 7 are the
the experimental results of COIL-20 and Short Video from HMDB51, respectively.
From the two visual figures we can see that, our VMTQN method and MOTQN
method perform the best among all comparative methods. Especially when the
sampling rate p = 0.2 in Figure 6, our methods has significant superiority in visual
evaluation. What’s more, “Latent Trace Norm” based method performs much
better than TNN in COIL, which validates our assumption that with the help of
data-dependent V tensor trace norm is much more robust than TNN in processing
non-smooth data.
Overall, both our methods and t-SVD based methods (e.g., TNN) perform
better than the others (e.g., SiLRTC) on these two datasets. It is mainly because
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 25
Fig. 6 Examples of COIL completion results. Method names correspond to the top of each
figure. The sampling rate p = 0.2.
Table 2 Comparisons of PSNR results on COIL images and video inpainting with different
sampling rates. Up: the COIL dataset with Y ∈ R128×128×1440 . Down: a short video from
HMDB51 with Y ∈ R240×320×126 .
the definitions of tensor singular value in tSVD based methods can make better
use of the tensor internal structure, and this is also the main difference between
tensor Q-nuclear norm (TQN) and sum of the nuclear norm (SNN).
Meanwhile, our method is obviously better than the others at all sampling
rates, which reflects the superiority of our data dependent Q.
24 24
22 22
20 20
18
PSNR
18
PSNR
16
16
TNN TNN
14
14
VMTQN r=1440 VMTQN r=1440
12
VMTQN r=720
12
VMTQN r=720
10
VMTQN r=360 10
VMTQN r=360
8 VMTQN r=180 8 VMTQN r=180
6 6
0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450
1800
10
1600
Singular Value
1400 8
1200
6
1000
800 4
600
2
400
0
200
0 -2
180 360 720 1440 180 360 720 1440
Index of Singular Value Index of Singular Value
Fig. 8 The relations among running times, different r, and the singular values of T(3) on
COIL, where p = 0.2.
0 0 0 0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
tensor Q-rank / n: r Ground-Truth tensor Q-rank / n: r tensor Q-rank / n: r
Ground-Truth Ground-Truth
Fig. 9 The gray scale reflects the quality (PSNR) of completion results, where n1 = n2 =
n3 = 50 and the white area represents a maximum PSNR of 40. There are three different sizes
of Q in VMTQN model to show the influences.
Fig. 10 Comparisons of PSNR and visualization results of a smooth video inpainting. Up:
PSNR results with different sampling rates. Down: visualization results with the sampling
rate p = 0.5.
corner of the first two sub-figures. From the left two sub-figures we can see that,
if the dimension of true tensor is not greater than r, the recovery performance is
consistent with that in the third sub-figure. Combined with the above analyses,
r = min{n1 n2 , n3 } can not only save computational efficiency in some cases, but
also make the recovery performance of the model in “the white area”, corresponding
to the exact recovery.
6 Conclusions
We analyze the advantages and limitations of the current mainstream low rank
regularizers, and then introduce a new definition of data dependent tensor rank
named tensor Q-rank. To get a more significant low rank structure w.r.t. rankQ ,
we further introduce two explainable selection method of Q and make Q to be a
learnable variable w.r.t. the data. Specifically, maximizing the variance of singular
value distribution leads to VMTQN, while minimizing the value of nuclear norm
through manifold optimization leads to MOTQN. We provide an envelope of our
rank function and apply it to the tensor completion problem. By analyzing the
proof of exact recovery theorem,we explain why our method may perform better
than TNN based methods in non-smooth data (along the third dimension) with
low sampling rates, and conduct experiments to verify our conclusions.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 29
A Proof of Lemma 1
1 Pn
Proof Suppose that ā = n a , hence the variance of {a1 , . . . , an } can be expressed as
P i
i=1
Var[ai ] = i=1 (ai − ā) . With n
Pn 2 2
i=1 ai = C holds, we have the following:
n
X n
X
max Var[ai ] ⇒ max (ai − ā)2 ⇒ max (a2i + ā2 − 2ai ā)
i=1 i=1
Xn n
X n
X
⇒ max ( a2i ) + ( ā2 ) − 2( ai ā)
i=1 i=1 i=1
Moreover, theP feasible region of {a1 , . . . , an } is an first quadrant Euclidean spherical surface:
{(a1 , . . . , an )| n 2
i=1 ai = C, ai ≥ 0}. Thus the objective function ā = n
1 Pn
i=1 ai is actually a
linear hyperplane optimization problem, whose optimal solution contains all intersection of the
sphere and each axis, which corresponds to only one non-zero coordinate in {a1 , . . . , an }. u t
B Proof of Lemma 2
Proof Firstly, X = UΣV> denotes the full Singular Value Decomposition of matrix X with
U ∈ Rn1 ×n1 , Σ ∈ Rn1 ×n2 , and V ∈ Rn2 ×n2 . And P = V> Q is also an orthogonal matrix,
where P ∈ Rn2 ×n2 . We use Pij to represent the (i, j)-th element of matrix P, and use pi to
represent the i-th column of matrix P. Then XQ = UΣV> Q = UΣP holds and we have the
following:
Xn2 n2
X
kXQk2,1 = kUΣPk2,1 = kUΣpi k2 = kΣpi k2 . (56)
i=1 i=1
If n1 ≥ n2, letσi = Σii be the (i, i)-th element value of Σ with i = 1, . . . , n2 . Or if n1 < n2 ,
0 Σ P n2
let Σ = ∈ Rn2 ×n2 and σi = Σ0ii with i = 1, . . . , n2 . In this case, i=1 kΣpi k2 =
0
P n2 0
Pn2
i=1 kΣ pi k2 . Thus, we can always get {σ1 , . . . , σn2 } and have the equation i=1 kΣpi k2 =
Pn2 qPn2 2
i=1 j=1 (σj Pji ) .
We then prove that P P= I optimize the problem (16). By using Eq. (56), the objective
function can be written as n 2
i=1 kΣpi k2 . We give the following deduction:
v v
n2 n2 u X
X X u n2 n2 u n2
(a) X u X
n2
X
2
kΣpi k2 = t (σj Pji )2 = t (σj Pji )2 × Pji
i=1 i=1 j=1 i=1 j=1 j=1
n2 X
n2 n2 n2 n2
!
(b) X (c) X X (d) X
2 2
≥ (σj Pji ) = σj Pji = σj .
i=1 j=1 j=1 i=1 j=1
(a) holds due to that P is an orthogonal matrix with normalized columns. (b) holds because of
Cauchy inequality. (c) holds with exchanging the order of two summations. Finally (d) holds
owing to the row normalization of P. Notice that the equality in (b) holds if and only if the two
vectors (σ1 P1i , . . . , σn2 Pn2 i ) and (P1i , . . . , Pn2 i ) are parallel. It can be seen that when P = I,
the condition are satisfied. In other words, V> Q = I optimize the problem (16), which implies
Q = V. u
t
30 Hao Kong et al.
C Proof of Theorem 2
Proof We divide r = min{n1 , n2 } into two cases and prove them respectively. And we use the
same notation as in the previous proofs.
(1): If n1 < n2 and r = n1 , then U ∈ Rn1 ×n1 , V ∈ Rn2 ×n1 , and Q ∈ Rn2 ×n1 . In this
case, Σ ∈ Rn1 ×n1 . Let Σ0 = Σ 0 ∈ Rn1 ×n2 , V0 = V V⊥ ∈ Rn2 ×n2 , and Q0 = Q Q⊥ ∈
n ×n > >
R 2 2 . Note that the constraint XQQ = X in Eq. (17) implies V Q⊥ = 0 and V⊥ Q = 0, >
That is to say, minimize kXQk2,1 w.r.t. Q in Eq. (17) is equivalent to minimize kΣ0 V0> Q0 k2,1
w.r.t. Q0 under the constraints V> Q⊥ = 0 and V⊥ > Q = 0. By using Lemma 2, Q0 = V0
minimize the objective function kΣ0 V0> Q0 k2,1 , which also satisfies the constraints. In other
words, Q = V optimize the problem 17.
(2): If n1 ≥ n2 and r = n2 , then U ∈ Rn1 ×n2 , V ∈ Rn2 ×n2 , and Q ∈ Rn2 ×n2 . In this
case, we have
n2
X n2
X
kXQk2,1 =kUΣPk2,1 = kUΣpi k2 = kΣpi k2 .
i=1 i=1
The remaining proofs are similar to the details in Appendix B. u
t
D Proof of Lemma 3
τ2
Proof Let g(τ ) = f (Q(τ )) = kX kQ(τ ),∗ and Q(τ ) ≈ I − τ A + 2
A2 Qk , where A is defined
in Eq. (32). We consider the following approximation:
∂f (Q(τ )) D E
g(τ ) = f (Q(τ )) ≈ g(0) + , Q(τ ) − Q(0) = g(0) + X>
(3) H(3) , Q(τ ) − Q(0) ,
∂Q(τ ) τ =0
(58)
∂f (Q(τ ))
where Q(0) = Qk and then Eq. (31) ensure ∂Q(τ )
= X> H . Then we have:
(3) (3)
τ =0
τ2 2
g(τ ) ≈ X> H
(3) (3) , I − τ A + A Q k + Cτ , (59)
2
where Cτ is a constant independent of τ . Then the first and the second order derivatives of
g(τ ) evaluated at 0 can be estimated as follows:
D E D E
g 0 (0) ≈ X> 00 > 2
(3) H(3) , −AQk , and g (0) ≈ X(3) H(3) , A Qk , (60)
u
t
References
30. P. Zhou, C. Lu, Z. Lin, and C. Zhang, “Tensor factorization for low-rank tensor completion,”
IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1152–1163, 2018.
31. H. Kong, X. Xie, and Z. Lin, “t-schatten-p norm for low-rank tensor recovery,” IEEE
Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1405–1419, 2018.
32. W.-H. Xu, X.-L. Zhao, and M. Ng, “A fast algorithm for cosine transform based tensor
singular value decomposition,” arXiv preprint arXiv:1902.03070, 2019.
33. G. Song, M. K. Ng, and X. Zhang, “Robust tensor completion using transformed tensor
svd,” arXiv preprint arXiv:1907.01113, 2019.
34. T.-X. Jiang, M. K. Ng, X.-L. Zhao, and T.-Z. Huang, “Framelet representation of tensor
nuclear norm for third-order tensor completion,” IEEE Transactions on Image Processing,
vol. 29, pp. 7233–7244, 2020.
35. M. K. Ng, R. H. Chan, and W.-C. Tang, “A fast algorithm for deblurring models with
neumann boundary conditions,” SIAM Journal on Scientific Computing, vol. 21, no. 3,
pp. 851–866, 1999.
36. J.-F. Cai, R. H. Chan, and Z. Shen, “A framelet-based image inpainting algorithm,” Applied
and Computational Harmonic Analysis, vol. 24, no. 2, pp. 131–149, 2008.
37. T.-X. Jiang, T.-Z. Huang, X.-L. Zhao, T.-Y. Ji, and L.-J. Deng, “Matrix factorization
for low-rank tensor completion using framelet prior,” Information Sciences, vol. 436,
pp. 403–417, 2018.
38. Z. Zhang and S. Aeron, “Exact tensor completion using t-SVD,” IEEE Transactions on
Signal Processing, vol. 65, no. 6, pp. 1511–1526, 2017.
39. C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan, “Tensor robust principal component
analysis with a new tensor nuclear norm,” IEEE transactions on pattern analysis and
machine intelligence, vol. 42, no. 4, pp. 925–938, 2019.
40. A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality
constraints,” SIAM journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353,
1998.
41. Z. Wen and W. Yin, “A feasible method for optimization with orthogonality constraints,”
Mathematical Programming, vol. 142, no. 1-2, pp. 397–434, 2013.
42. K. B. Petersen, M. S. Pedersen, et al., “The matrix cookbook,” Technical University of
Denmark, vol. 7, no. 15, p. 510, 2008.
43. C. Lu, J. Feng, S. Yan, and Z. Lin, “A unified alternating direction method of multipliers
by majorization minimization,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 3, pp. 527–541, 2017.
44. Z. Lin, R. Liu, and H. Li, “Linearized alternating direction method with parallel splitting and
adaptive penalty for separable convex programs in machine learning,” Machine Learning,
vol. 99, no. 2, p. 287, 2015.
45. Y. Xu and W. Yin, “A block coordinate descent method for regularized multiconvex
optimization with applications to nonnegative tensor factorization and completion,” SIAM
Journal on Imaging Sciences, vol. 6, no. 3, pp. 1758–1789, 2015.
46. Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty
for low-rank representation,” in Advances in neural information processing systems, pp. 612–
620, 2011.
47. P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds.
Princeton University Press, 2009.
48. Y. Xu, R. Hao, W. Yin, and Z. Su, “Parallel matrix factorization for low-rank tensor
completion,” Inverse Problems & Imaging, vol. 9, no. 2, pp. 601–624, 2017.
49. A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech.
rep., Citeseer, 2009.
50. S. A. Nene, S. K. Nayar, H. Murase, et al., “Columbia object image library (coil-20),” 1996.
51. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database
for human motion recognition,” in IEEE International Conference on Computer Vision,
pp. 2556–2563, 2011.