0% found this document useful (0 votes)
32 views

Tensor Q-Rank - New Data Dependent Definition of

This document proposes a new definition of tensor rank called tensor Q-rank, which is defined using a learnable orthogonal matrix Q. This aims to better model complex data with low sampling rates compared to existing tensor nuclear norm approaches. The paper introduces two methods for selecting Q, called variance maximization tensor Q-nuclear norm and manifold optimization tensor Q-nuclear norm. Experimental results on real-world datasets show the proposed approach outperforms other tensor rank regularization methods for tensor completion problems.

Uploaded by

Hedy Liu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Tensor Q-Rank - New Data Dependent Definition of

This document proposes a new definition of tensor rank called tensor Q-rank, which is defined using a learnable orthogonal matrix Q. This aims to better model complex data with low sampling rates compared to existing tensor nuclear norm approaches. The paper introduces two methods for selecting Q, called variance maximization tensor Q-nuclear norm and manifold optimization tensor Q-nuclear norm. Experimental results on real-world datasets show the proposed approach outperforms other tensor rank regularization methods for tensor completion problems.

Uploaded by

Hedy Liu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

XXX manuscript No.

(will be inserted by the editor)

Tensor Q-Rank: New Data Dependent Definition of


Tensor Rank

Hao Kong · Canyi Lu · Zhouchen Lin

Received: date / Accepted: date


arXiv:1910.12016v4 [cs.LG] 15 Jun 2021

Abstract Recently, the Tensor Nuclear Norm (TNN) regularization based on


t-SVD has been widely used in various low tubal-rank tensor recovery tasks.
However, these models usually require smooth change of data along the third
dimension to ensure their low rank structures. In this paper, we propose a new
definition of data dependent tensor rank named tensor Q-rank by a learnable
orthogonal matrix Q, and further introduce a unified data dependent low rank
tensor recovery model. According to the low rank hypothesis, we introduce two
explainable selection method of Q, under which the data tensor may have a
more significant low tensor Q-rank structure than that of low tubal-rank structure.
Specifically, maximizing the variance of singular value distribution leads to Variance
Maximization Tensor Q-Nuclear norm (VMTQN), while minimizing the value of
nuclear norm through manifold optimization leads to Manifold Optimization Tensor
Q-Nuclear norm (MOTQN). Moreover, we apply these two models to the low rank
tensor completion problem, and then give an effective algorithm and briefly analyze
why our method works better than TNN based methods in the case of complex
data with low sampling rate. Finally, experimental results on real-world datasets
demonstrate the superiority of our proposed model in the tensor completion problem
with respect to other tensor rank regularization models.

Keywords tensor rank · low rank · tensor completion · convex optimization

Hao Kong
Key Lab. of Machine Perception (MOE), School of EECS, Peking University, Beijing, China.
E-mail: [email protected]
Canyi Lu
Department of Electrical & Computer Engineering (ECE), Carnegie Mellon University, Pitts-
burgh, America.
E-mail: [email protected]
Zhouchen Lin
Key Lab. of Machine Perception (MOE), School of EECS, Peking University, Beijing, China.
Z. Lin is the corresponding author.
E-mail: [email protected]
2 Hao Kong et al.

1 Introduction

With the development of data science, multi-dimensional data structures are


becoming more and more complex. The low-rank tensor recovery problem, which
aims to recover a low-rank tensor from an observed tensor, has also been extensively
studied and applied. The problem can be formulated as the following model:

min rank(X ), s.t. Ψ (X ) = Y, (1)


X

where Y is the observed measurement by a linear operator Ψ (·) and X is the clean
data. Generally, it is difficult to solve Eq. (1) directly, and different rank definitions
correspond to different models. The commonly used definitions of tensor rank are
all related to particular tensor decompositions [1]. For example, CP-rank [2] is
based on the CANDECOMP/PARAFAC decomposition [3]; multilinear rank [4] is
based on the Tucker decomposition [5]; tensor multi-rank and tubal-rank [6] are
based on t-SVD [7]; and a new tensor rank with invertible linear operator [8] is
based on T-SVD [9]. Among them, CP-rank and multilinear rank are both older
and more widely studied, while the remaining two mentioned here are relatively
new. Minimizing the rank function in Eq. (1) directly is usually NP-hard and is
difficult to be solved within polynomial time, hence we often replace rank(X ) by
a convex/non-convex surrogate function. Similar to the matrix case [10, 11], with
different definitions of tensor singular values, various tensor nuclear norms are
proposed as the rank surrogates [7, 8, 12, 13].

1.1 Existing Mainstream Methods and Their Limitations

Friedland and Lim [13] introduce cTNN (Tensor Nuclear Norm based on CP) as
the convex relaxation of the tensor CP-rank:
( r r
)
X X
kT kcT N N = inf |λi | : T = λi ui ◦ v i ◦ w i , (2)
i=1 i=1

where kui k = kvi k = kwi k = 1 and ◦ represents the vector outer product1 .
However, for a given tensor T ∈ Rn1 ×n2 ×n3 , minimizing the surrogate objection
kT kcT N N directly is difficult due to the fact that computing CP-rank is usually
NP-complete [14, 15] and computing cTNN is NP-hard in some sense [13], which
also mean we cannot verify the consistency of cTNN’s implicit decomposition
with the ground-truth CP-decomposition. Meanwhile, it is hard to measure the
cTNN’s tightness relative to the CP-rank2 . Although Yuan and Zhang [16] give
the sub-gradient of cTNN by leveraging its dual property, the high computational
cost makes it difficult to implement.
To reduce the computation cost of computing the rank surrogate function, Liu
et al. [12] define a kind of tensor nuclear norm named SNN (Sum of Nuclear Norm)
1
Please see [1] or our supplementary materials for more details.
2
For the matrix case, the nuclear norm is the conjugate of the conjugate function of the
rank function in the unit ball. However, it is still unknown whether this property holds for
cTNN and CP-rank.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 3

based on the Tucker decomposition [5]:


d
X
kT kSN N = T(i) ∗
, (3)
i=1

where T ∈ Rn1 ×...×nd , T(i) ∈ R(n1 ...ni−1 ni+1 ...nd )×ni denotes unfolding the tensor
along the i-th dimension, and k · k∗ is the nuclear norm of a matrix, i.e., sum of
singular values. The convenient calculation algorithm makes SNN widely used [12,
17–20]. It is worth to mentioned that, although SNN has a similar representation to
matrix case, Paredes and Pontil [21] point out that SNN is not the tightest convex
relaxation of the multilinear rank [4], and is actually an overlap regularization of
it. References [22–24] also propose a new regularizer named Latent Trace Norm
to better approximate the tensor rank function. In addition, due to unfolding the
tensor directly along each dimension, the information utilization of SNN based
model is insufficient.
To avoid information loss in SNN, Kilmer and Martin [7] propose a tensor
decomposition named t-SVD with a Fourier transform matrix F, and Zhang et
al. [25] give a definition of the tensor nuclear norm on T ∈ Rn1 ×n2 ×n3 corresponding
to t-SVD, i.e., Tensor Nuclear Norm (TNN):
n3
1 X
kT kT N N := G(i) , where G = T ×3 F, (4)
n3 ∗
i=1

where G(i) denotes the i-th frontal slice matrix of tensor G 3 , and ×3 is the mode-
3 multilinear multiplication [5]. Benefitting from the efficient Discrete Fourier
transform and the better sampling effect of Fourier basis on time series features,
TNN has attracted extensive attention in recent years [25–29]. The operation of
Fourier transform along the third dimension makes TNN based models have a
natural computing advantage for video and other data with strong time continuity
along a certain dimension.
However, when considering the smoothness of different data, using a fixed
Fourier transform matrix F may bring some limitations. In this paper, we define
smooth and non-smooth data along a certain dimension as the usual intuitive
meaning, which means the slices of tensor data along a dimension are arranged
in a certain paradigm, e.g., time series. For example, a continuous video data is
smooth. But if the data tensor is a concatenation of several different scene videos
or a random arrangement of all frames, then the data is non-smooth.
Firstly, TNN needs to implement Singular Value Decomposition (SVD) in the
complex field C, which is slightly slower than that in the real field R. Besides, the
experiments in related papers [25, 27, 30, 31] are usually based on some special
dataset which have smooth change along the third dimension, such as RGB images
and short videos. Those non-smooth data may increase the number of non-zero
tensor singular values [7,25], weakening the significance of low rank structure. Since
tensor multi-rank [25] is actually the rank of each projection matrix on different
Fourier basis, the non-smooth change along the third dimension may lead to large
singular values appearing on the projection matrix slices which are corresponding
to the high frequency.
3 The implementation of Fourier transform along the third dimension of T is equivalent to

multiplying a DFT matrix F by using ×3 . For more details, please see Sec. 2.2.
团队到位情况(同步需完成详细团队成员Excel表)

4 Hao Kong et al.

Fig. 1 Replace F in Eq. (4) by matrix M and further obtain new definitions of tensor rank
rankM (X ) and tensor nuclear norm kX kM,∗ by using S(i) .

Peng Cheng1.2 Related Work


Laboratory

In order to solve the above phenomenon, there are some works [8, 9, 32–34] that
consider to improve the projection matrix of TNN, i.e., the Discrete Fourier trans-
form matrix F in Eq. (4). These work want to replace F by another measurement
matrix M and further obtain new definitions of tensor rank rankM (X ) and tensor
nuclear norm kX kM,∗ as regularizers. Figure 1 shows the related operations. Their
recovery models can be summarized as follows:

min kX kM,∗ , s.t. Ψ (X ) = Y, M is determined by some prior knowledge. (5)


X

Please see Sec. 2 for the relevent definitions in Eq. (5). In the following, we will
discuss the motivations and limitations of these work [8, 9, 32–34], respectively.
Kernfeld, Kilmer, and Aeron [9] generalize the t-product by introducing a new
operator named cosine transform product with an arbitrary invertible linear
transform L (or arbitrary invertible matrix M). For a given T ∈ Rn1 ×n2 ×n3
and an invertible matrix M ∈ Rn3 ×n3 , they have LM (T ) = T ×3 M and L−1 M (T ) =
T ×3 M−1 . Different from the commonly used definition of tensor mode-i product
in [1, 8, 9, 12], it should be mentioned that for convenience in this paper, we define
LQ (T ) = T ×3 Q = fold3 (T(3) Q), where T(3) ∈ Rn1 n2 ×n3 and is defined by
T(3) := unfold3 (T ). That is to say, we arrange the tensor fiber Tij: by rows.
Following this idea, Lu, Peng, and Wei [8] propose a new tensor nuclear norm
induced by invertible linear transforms [9]. Different from [7, 25], they use an fixed
invertible matrix to replace the Fourier transform matrix in TNN. Although this
method improves the performance of the recovery model to a certain extent, some
new problems still arise, such as how to determine the fixed invertible matrix.
Normally, different data need different optimal invertible matrix, but a reasonable
matrix selection method is not given in [8]. Furthermore, the Frobenius norm of the
invertible matrix is uncertain, which may lead to some computational problems,
e.g., approaching zero or infinity.
Additionally, Kernfeld, Kilmer, and Aeron [9] propose an idea that, with the
help of Toeplitz-plus-Hankel matrix [35], the Discrete cosine transform matrix C
can also be used to replace F. Then the work [32] propose some fast algorithms
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 5

for diagonalization and the relevant recovery model. However, C is still based on
trigonometric function, and may lead to the similar problems with TNN based
model, as we mentioned in the last paragraph of Sec. 1.1.
Considering the efficiency of time-space transformation, the work [33] use the
Daubechies 4 discrete wavelet transform matrix to replace F. As we know, the
wavelet transform can take the position information into account, which may make
it better than Fourier transform and cosine transform in handling some special
data, e.g., audio data. However, many wavelet bases generate transform matrices
in exponential form, which means the large scale wavelet matrix may bring the
problem of computational complexity.
Regardless of the computational complexity, Jiang et al. [34] introduce a new
projection matrix named tight framelets transform matrix [36, 37]. They claim that
redundancy in the transformation is important as such transformed coefficients
can contain information of missing data in the original domain [36]. However, we
consider that redundancy is not a sufficient condition to improve the effect of
recovery model shown in Eq. (5).
In summary, different multipliers M in Eq. (5) lead to different definitions
of regularizer, which may lead to different experimental results. However, there
is still no unified rules for selecting M. It can be seen from the above methods
that when M is selected as orthogonal matrix, it is convenient for calculation
and interpretation. In general, projection bases are unit orthogonal. We further
think that every data should have its best matching matrix, i.e., M could be data
dependent. In this paper, we solve the problem of how to define a better data
dependent orthogonal transformation matrix.

1.3 Motivation

In the tensor completion task, we find that when dealing with some non-smooth
data, Tensor Nuclear Norm (TNN) based methods usually perform worse than
the cases with smooth data. Therefore, we want to improve this phenomenon
by changing the projection basis F in Eq. (4). In other words, we provide some
interpretable selection criteria of M in Eq. (5), e.g., make M be an orthogonal
matrix and data dependent w.r.t. the data tensor X . The following gives the details:

min kX kQ,∗ , s.t. Ψ (X ) = Y, Q> Q = I, Q is determined by X . (6)


X ,Q

Whether in the case of matrix recovery [10, 11] or tensor recovery [8, 27, 38], the
low rank hypothesis is very important. Generally speaking, the lower the rank of
the data, the easier it is to recover with fewer observations. As can be seen from
Figure. 2, we can use a better Q to make the low rank structure of the non-smooth
data more significant.
Considering the convex relaxation, the low rank property is usually represented
by (a): the distribution of singular values, or (b): the value of nuclear
norm. We may as well take these two points as priori knowledge respectively, and
specify the selection rules of Q in Eq. (6), so that the low rank property of X can
be better reflected. Therefore, we provide two methods in this paper as follows:
(a): Let Q satisfy a certain selection method to make more tensor singular
values close to 0 while the remaining ones are far from 0. From another perspective,
6 Hao Kong et al.

100

90 TNN
80 Ours
Singular Value
70

60

50

40

30

20

10

0
50 100 150 200 250 300 350 400 450 500

Singular Value Index (descending order)

Fig. 2 Compare the two different low rank structures between our proposed regularization
and TNN regularization in non-smooth video data. Left: the first 500 sorted singular values

by TNN regularization (divided by n3 ) and ours. Right: the short video with background
changes.

the distribution variance of singular values should be larger, which leads to Variance
Maximization Tensor Q-Nuclear norm (VMTQN) in Sec. 3.1.
(b): Let Q minimize the nuclear norm kX kQ,∗ directly, leading to a bilevel
problem. As we know, nuclear norm is usually used as an surrogate function of
the rank function. Then we use some manifold optimization method to solve the
problem, which leads to Manifold Optimization Tensor Q-Nuclear norm (MOTQN)
in Sec. 3.2.

1.4 Contributions

In summary, our main contributions include:


– We propose a unified data dependent low rank tensor recovery model which is
shown in Eq. (6). Among them, the corresponding definitions of tensor Q-rank
rankQ (X ) and tensor Q-nuclear norm kX kQ,∗ are proposed along with the
learnable data dependent orthogonal Q.
– From the low rank hypothesis, we consider the distribution of singular values
and the value of nuclear norm as prior knowledge respectively, leading to two
different selection rules of Q. It should be noted that both methods are designed
to make the low rank structure more significant. Figure. 2 shows an example
with background changing video data that, under our proposed selection of Q,
our low rank structure is more significant.
– For each method, we give relatively complete theoretical derivations, including
interpretation and optimization. As for VMTQN in Sec. 3.1, we start from
variance maximization and use Theorem 2 to associate `2,1 norm minimization
with singular value decomposition, and further make Q select as the matrix of
right singular vectors. On the other hand, MOTQN in Sec. 3.2 minimizes the
nuclear norm directly and use manifold optimization algorithm to update Q in
each iteration.
– Finally, we apply our proposed regularizers with adaptive Q to the tensor
completion problem. We analyze the computational complexity, convergence
and performance guarantee of our algorithm to a certain extent. Moreover, we
explain why the more significant the low rank structure, the easier the data
can be recovered, which corresponds to our motivation.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 7

2 Notations and Preliminaries

2.1 Notations

We introduce some notations and necessary definitions which will be used later.
Tensors are represented by uppercase calligraphic letters, e.g., T . Matrices are
represented by boldface uppercase letters, e.g., M. Vectors are represented by
boldface lowercase letters, e.g., v. Scalars are represented by lowercase letters, e.g.,
s. Given a third-order tensor T ∈ Rn1 ×n2 ×n3 , we use T(k) to represent its k-th
frontal slice T (:, :, k) while its (i, j, k)-th entry is represented as Tijk . σi (X) denotes
the i-th largest singular value of matrix X. X+ denotes the pseudo-inverse matrix of
Pmin{n1 ,n2 }
X. kXkσ = σ1 (X) denotes the matrix spectral norm. kXk q ∗ = i=1 σi (X)
Pn2 Pn1 2
denotes the matrix nuclear norm and kXk2,1 = j=1 i=1 Xij denotes the
matrix `2,1 norm, where X ∈ Rn1 ×n2 and Xij is the (i, j)-th entry of X.
T(3) ∈ Rn1 n2 ×n3 denotes unfolding the tensor T along the 3-th dimension by
columns, which is little different from [1, 9]. That is to say, we arrange the tensor
fiber Tij: by columns. We then define LQ (T ) = T ×3 Q = fold3 (T(3) Q) and have
L−1
Q (T ) = T ×3 Q
−1
, where T(3) ∈ Rn1 n2 ×n3 and is defined by T(3) := unfold3 (T ).
Due to limited space, for the definitions of PT [26], multilinear multiplication [5],
t-product [7], and so on, please see our Supplementary Materials.

2.2 Tensor Q-rank

For a given tensor X ∈ Rn1 ×n2 ×n3 and a Fourier transform matrix F ∈ Cn3 ×n3 , if
we use G(i) to represent the i-th frontal slice of tensor G, then the tensor multi-rank
and Tensor Nuclear Norm (TNN) of X can be formulated by mode-3 multilinear
multiplication as follows:
n   o
rankm := (r1 , . . . , rn3 ) ri = rank G(i) , G = X ×3 F , (7)
1 Pn3 (i)
kX k∗ := n3 i=1 G , where G = X ×3 F. (8)

Comparing with CP-rank and cTNN mentioned in Sec. 1.1, it is quite easy to
calculate Eqs. (7) and (8) through the matrix Singular Value Decomposition (SVD).
Kernfeld, Kilmer, and Aeron [9] generalize the t-product by introducing a new
operator named cosine transform product with an arbitrary invertible linear trans-
form L (or arbitrary invertible matrix Q). For an invertible matrix Q ∈ Rn3 ×n3 ,
they have LQ (X ) = X ×3 Q and L−1 Q (X ) = X ×3 Q
−1
.
Here, we further define the invertible multiplier Q as any general real orthog-
onal matrix. It is worth mentioning that the orthogonal matrix Q has two good
properties: one is invertibility, the other is to keep Frobenius norm invariant, i.e.,
kX kF = kLQ (X )kF . Then we introduce a new definition of tensor rank named
Tensor Q-rank.
Definition 1 (Tensor Q-rank) Given a tensor X ∈ Rn1 ×n2 ×n3 and a fixed real
orthogonal matrix Q ∈ Rn3 ×n3 , the tensor Q-rank of T is defined as the following:
n3
X  
rankQ (X ) := rank G(i) , where G = LQ (X ) = T ×3 Q. (9)
i=1
8 Hao Kong et al.

The corresponding low rank tensor recovery model can be written as follows:

min rankQ (X ), s.t. Ψ (X ) = Y. (10)


X

Generally in the low rank recovery models, due to the discontinuity and non-
convexity of the rank function, it is quite difficult to minimize the rank function
directly. Therefore, some auxiliary definitions of tensor singular value and tensor
norm are needed to relax the rank function.

2.3 Definitions of Tensor Singular Value and Tensor Norm

Considering the superior recovery performance of TNN in many existing tasks, e.g.,
video denoising [39] and subspace clustering [28], we can use the similar singular
value definition of TNN. Given a tensor X ∈ Rn1 ×n2 ×n3 and a fixed orthogonal
matrix Q such that G = LQ (X ), then the Q-singular value of X is defined as
{σj (G(i) )}, where i = 1, . . . , n3 , j = 1, . . . , min{n1 , n2 }, G(i) is the i-the frontal
slice of G, and σ(·) denotes the matrix singular value. When an orthogonal matrix
Q is fixed, the corresponding tensor spectral norm and tensor nuclear norm of X
can also be given.

Definition 2 (Tensor Q-spectral norm and Tensor Q-nuclear norm) Given


a tensor X ∈ Rn1 ×n2 ×n3 and a fixed real orthogonal matrix Q ∈ Rn3 ×n3 , the tensor
Q-spectral norm and tensor Q-nuclear norm of X are defined as the followings:
n o
kX kQ,σ := max G(i) G = LQ (X ) . (11)
i σ

n3
X
kX kQ,∗ := G(i) , where G = LQ (X ). (12)

i=1

Moreover, with any fixed orthogonal matrix Q, the convexity, duality, and
envelope properties are all preserved.

Property 1 (Convexity) Tensor Q-nuclear norm and Tensor Q-spectral norm are
both convex.

Property 2 (Duality) Tensor Q-nuclear norm is the dual norm of Tensor Q-spectral
norm, and vice versa.

Property 3 (Convex Envelope) Tensor Q-nuclear norm is the tightest convex


envelope of the Tensor Q-rank within the unit ball of the Tensor Q-spectral norm.

These three properties are quite important in the low rank recovery theory. Prop-
erty 3 implies that we can use the tensor Q-nuclear norm as a rank surrogate. That
is to say, when the orthogonal matrix Q is given, we can replace the low tensor
Q-rank model (10) with model (13) to recover the original tensor:

min kX kQ,∗ , s.t. Ψ (X ) = Y. (13)


X

In some cases, we will encounter the case that Q is not a square matrix, i.e.,
Q ∈ Rn3 ×r is column orthonormal. Then the corresponding definitions of rankQ (X )
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 9

in Eq. (9) and kX kQ,∗ in Eq. (12) also change to the sum of r frontal slices instead
of n3 . Moreover, as for the convex envelope property, the double conjugate function
of rank function rankQ (X ) is still the corresponding nuclear norm kX kQ,∗ within
an unit ball. We give the following theorem to illustrate this case:
Theorem 1 Given a tensor T ∈ Rn1 ×n2 ×n3 and a fixed real column orthonormal
matrix Q ∈ Rn3 ×r . Let
 Q⊥ ∈ R
n3 ×(n3 −r)
be the column complement matrix of
Q, and Qt = Q Q⊥ be a orthogonal matrix. Then within the unit ball D =
{X |kX kQt ≤ 1}, the double conjugate function of rankQ (X ) is kX kQ,∗ :
∗∗
rankQ (X ) = kX kQ,∗ . (14)
∗∗
In other words, kX kQ,∗ is still the tightest convex envelope of rankQ (X ) within
the unit ball D.
Theorem 1 indicate that even if Q is not a square matrix, Eq. (13) can still be used
as an effective recovery model.

3 Two Ways for Determining Q: Maximizing Variance & Stiefel


Manifold Optimization

In practical problems, the selection of Q often has a tremendous impact on the


performance of the model (13). If Q is an identity matrix I, it is equivalent to
solving each frontal slice separately by the low rank matrix methods [10]. Or if Q is
a Fourier transform matrix F, it is equivalent to the TNN-based methods [25,26,38].
Through the analysis of [8] and our previous section, for a given data X , those Q
that make rankQ (X ) lower usually make the recovery problem (13) easier.
Following, if we let Q in Eq. (10) and (13) be a learnable variable w.r.t. data
tensor X , we can get a data-dependent tensor rank and corresponding low rank
recovery model:
min kX kQ,∗ , s.t. Ψ (X ) = Y, Q is determined by X . (15)
X ,Q

Easy to see that Eq. (15) is actually a bilevel model and is usually hard to be
solved directly. In the following, we will show two ways to solve this problem from
the following two perspectives:
1. One is to use the prior knowledge of X to specify the selection criteria of
Q. For the low rank hypothesis, we usually measure it by the distribution of
singular values. Therefore, we consider artificially specifying the conditions that
Q should satisfy so as to maximize the variance of the corresponding singular
values.
2. The other is to give the function Q = argmin f (X , Q) = argmin kX kQ,∗ and
then use manifold optimization to solve the bilevel model directly. That is to say,
We directly minimize the surrogate function of rank function (Property 3 and
Theorem 1). It should be noted that although this way has higher rationality,
it corresponds to a higher computational complexity.
From the above two perspectives, Q will be data dependent. In the following, we will
introduce our two methods in two sub-sections respectively (Sec.3.1 and Sec.3.2).
And in the last part (Sec.3.3), considering a third-order tensor X ∈ Rn1 ×n2 ×n3 , we
analyze the applicability of each method in two different situations, i.e., n1 n2 < n3
and n1 n2 > n3 .
10 Hao Kong et al.

3.1 Way I (VMTQN): Specify the Selection of Q by Variance Maximization

Let G = LQ (X ) = X ×3 Q and {G(i) }i denotes the frontal slices of G. We hope to


find a data-dependent LQ in Eqs. (12) and (13) instead of LF in TNN (Eq. (8)),
which can reduce the number of non-zero singular values of each projected slices
G(i) . Our analyses are as follows.
(1): If we make Q an orthogonal matrix, then it is also invertible. By using
the unitary invariance of the Frobenius norm, the sum P 3 of the squares of each
projected slice’s Frobenius norm is a constant C, i.e., n i=1 kG (i) 2
k 2
F = kX kF = C.
Therefore, we need to consider how to select Q to make more singular values of
(i)
{G
Pn3 } close(i)to2 zero while the square sum of all singular values is a constant, i.e.,
j=1 σj (G ) = C.
(2): Considering the definitions of tensor rank, tensor norm and tensor singular
value corresponding to TNN in [25,38], and tensor Q-rank in this paper, the matrix
inequality n1 n (i) (i) (i)
P
j=1 σj (G ) ≤ kG kσ ≤ kG kF (singular value, spectral norm
and Frobenius norm, respectively) implies that, the closer k · kF is to zero, the
more singular values are close to zero, which will lead to a more significant tensor
low rank structure (w.r.t. rankQ (X )) with high probability.

3.1.1 From Variance Maximization to Singular Matrix

Combined with above two points, it is easyPto see that we need to make more
kG(i) kF close to 0 while the sum of squares n (i) 2
i=1 kG kF is a constant C. From
3

the perspective of variable distribution, we need to choose a data-dependent Q to


maximize the distribution variance of {kG(i) kF }, where G = LQ (X ) and G(i) is
the i-th frontal slice matrix of G. For better explanations, we give the following
two Lemmas, and the optimality condition of Lemma 1 illustrate our hypothesis
that there should be more kG(i) kF close to 04 .
Lemma 1 Given n non-negative variables {a1 , a2 , . . . , an } such that n 2
P
i=1 ai = C,
then
Pn maximizing the variance Var[ai ] is equivalent to minimizing the summation
i=1 ai . Moreover, the optimal condition is that there is only one non-zero variable
in {a1 , a2 , . . . , an }. Please see Appendix A for proof.
(i)
By using LemmaP P{kG
1, maximizing the variance of
n3 n3
kF } is equivalent to
minimizing the sum i=1 kG kF . Then we have i=1 kG(i) kF = kG(3) k2,1 =
(i)

kX(3) Qk2,1 , where G(3) and X(3) denote the mode-3 unfolding matrices [5].

Lemma 2 Given a fixed matrix X ∈ Rn1 ×n2 , and its full Singular Value Decom-
position as X = UΣV> with U ∈ Rn1 ×n1 , Σ ∈ Rn1 ×n2 , and V ∈ Rn2 ×n2 . Then
the matrix of right singular vectors V optimizes the following:

min kXQk2,1 , s.t. Q> Q = I, (16)


Q∈Rn2 ×n2

where kMk2,1 = col


P
i=1 kM:,i k2 is the sum of the `2 norms of all column vectors.
Please see Appendix B for proof.
4 Notice that minimizing n
P
i=1 ai in Lemma 1 can be seen as a linear hyperplane optimization
problem defined in the first quadrant Euclidean spherical surface: {(a1 , . . . , an )| n 2
P
i=1 ai =
C, ai ≥ 0}. The intersection of sphere and each axis is distributed on the optimal hyperplane,
which corresponds to only one non-zero coordinate (more variables close to 0).
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 11

Lemma 1 turns the maximizing variance problem into minimizing summation


problem, while Lemma 2 gives a feasible solution to the problem of minimizing the
summation of `2 norm. However, when n1 ≤ n2 , there will be some zero-columns
appearing in Σ. We can use skinny SVD to reduce the redundant columns of Q
in Eq. (16). Note that the size of V in skinny SVD is related to the size of X.
Considering the two cases n1 ≥ n2 and n1 < n2 of X ∈ Rn1 ×n2 , we introduce an
auxiliary variable r = min{n1 , n2 } to unify the matrix of right singular vectors as
V ∈ Rn2 ×r . Furthermore, we need add an extra constraint XQQ> = X to avoid
the trivial solution when r < n2 . If not, Q may converge to the singular spaces
which are corresponding to smaller singular values. For example, when r = n1 < n2
and Q ∈ Rn2 ×(n2 −r) , the optimal solution set of Q∗ for Eq. (16) includes the null
singular spaces of X, which makes XQ = O hold and the objective function value
is 0. Then we have the following:
Theorem 2 Given a fixed matrix X ∈ Rn1 ×n2 with r = min{n1 , n2 }, and its
skinny Singular Value Decomposition as X = UΣV> where U ∈ Rn1 ×r , Σ ∈ Rr×r ,
and V ∈ Rn2 ×r . Then the matrix of right singular vectors V optimizes the following:

min kXQk2,1 , s.t. Q> Q = I, XQQ> = X. (17)


Q∈Rn2 ×r

The proofs of the above please see Appendix C. Theorem 2 shows that, to
minimize the `2,1 norm kX(3) Qk2,1 w.r.t. Q, we can choose Q as the matrix of
right singular vectors of X(3) .

3.1.2 Details of How To Make Q Data Dependent

Through the analyses in Sec. 3.1.1, we make the selection of Q data-dependent,


and the following definitions shows the details.
Definition 3 (VMTQN: Variance Maximization Tensor Q-Nuclear norm)
Let X ∈ Rn1 ×n2 ×n3 be a third-order tensor and Q be an orthogonal matrix. If
G = X ×3 Q and G(i) denotes the frontal slices of G, then the Variance Maximization
Tensor Q-Nuclear norm (VMTQN) is defined as follows:
n o
kX kQ,∗ , where Q = argmax Variance G(i) . (18)
Q> Q=I F

Note that Q is determined by X . With the help of Lemma 1, Lemma 2, and


Theorem 2, we can incorporate VMTQN into the low rank recovery model.
Definition 4 (Low Tensor Q-rank model with adaptive Q) By setting the
adaptive Q module as a low-level sub-problem, the low tensor Q-rank model (10)
is transformed into the following:
min rankQ (X ), s.t. Ψ (X ) = Y, Q ∈ argmin kX(3) Qk2,1 , XQQ> = X. (19)
X ,Q Q> Q=I

And the corresponding surrogate model (13) is also replaced by the following:
min kX kQ,∗ , s.t. Ψ (X ) = Y, Q ∈ argmin kX(3) Qk2,1 , XQQ> = X. (20)
X ,Q Q> Q=I

In Eqs. (19) and (20), X(3) ∈ Rn1 n2 ×n3 denotes the mode-3 unfolding matrix of
tensor X ∈ Rn1 ×n2 ×n3 , and Q ∈ Rn3 ×r with r = min{n1 n2 , n3 }.
12 Hao Kong et al.

Definition 5 In fact, Theorem 2 implies Q = V, where V is the matrix of right


singular vectors of X(3) . If we let PCA(X , 3, r) := argminQ> Q=Ir kX(3) Qk2,1 be
the operator to obtain the matrix of right singular vectors Q ∈ Rn3 ×r , where
r = min{n1 n2 , n3 }, then the models (19) and (20) can be abbreviated as follows:

min rankQ (X ), s.t. Ψ (X ) = Y, Q = PCA(X , 3, r), (21)


X
min kX kQ,∗ , s.t. Ψ (X ) = Y, Q = PCA(X , 3, r). (22)
X

Remark 1 Notice that Q ∈ Rn3 ×r in Eqs. (19) and (20) may not have full columns,
i.e., r < n3 . The corresponding definitions of rankQ (X ) in Eq. (9) and kX kQ,∗ in
Eq. (12) also change to the sum of r frontal slices instead of n3 . Then Theorem 1
guarantee the validity of Eq (20).

Remark 2 In fact, from Appendix C we can see that, r can be chosen as any value
that satisfies the condition rank(X(3) ) ≤ r ≤ min{n1 n2 , n3 }, as long as Q ∈ Rn3 ×r
contains the whole column space of the matrix of right singular vectors V and is
pseudo-invertible to make X = X ×3 Q ×3 Q+ hold.

Within this framework, the orthogonal matrix Q is related to tensor X . As we


analyzed, choosing Q as the matrix of right singular vectors may make rankQ (X )
as low as possible. In other words, there should be more “small” frontal slices of
X ×3 Q, whose Frobenius norms are close to 0 to guarantee the low tensor Q-rank
structure of data with high probability.
Now the question is whether the function kX kQ,∗ in Eq. (22) is still an envelope
of the rank function rankQ (X ) in Eq. (21) within an appropriate region. The
following theorem shows that even if kX kQ,∗ is no longer a convex function in the
bilevel framework (22) since Q is dependent on X , we can still use it as a surrogate
for a lower bound of rankQ (X ) in Eq. (21).

Theorem 3 Given a column orthonormal matrix Q ∈ Rn3 ×r , r = min{n1 n2 , n3 },


we use rankP CA (X ), kX kP CA,σ , and kX kP CA,∗ to abbreviate the corresponding
concepts as follows:

rankP CA (X ) := rankQ (X ), where Q = PCA(X , 3, r), (23)


kX kP CA,σ := kX kQ,σ , where Q = PCA(X , 3, r), (24)
kX kP CA,∗ := kX kQ,∗ , where Q = PCA(X , 3, r). (25)

Then within the region of D = {X | kX kP CA,σ ≤ 1}, the inequality kX kP CA,∗ ≤


rankP CA (X ) holds. Moreover, for every fixed Q, let SQ denote the space {X | Q ∈
PCA(X , 3, r)}. Then Theorem 1 indicates that kX kP CA,∗ is still the tightest convex
envelope of rankP CA (X ) in SQ ∩ D.

Remark 3 For any column orthonormal matrix Q ∈ Rn3 ×r , the corresponding


conclusion also holds as long as X ×3 (QQ> ) = X . That is to say, kX kQ,∗ ≤
rankQ (X ) holds within the region {X | kX kQ,σ ≤ 1}.

Theorem 3 shows that though kX kP CA,∗ could be non-convex, its function


value is always below rankP CA (X ). Therefore, model (22) can be regarded as a
reasonable low rank tensor recovery model. Notice that it is actually a bilevel
optimization problem.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 13

3.2 Way II (MOTQN): Estimate Q by Manifold Optimization

Recalling the data-dependent low rank recovery model Eq. (15) with X ∈ Rn1 ×n2 ×n3 ,
our main idea is to find a learnable Q ∈ Rn3 ×n3 to minimize rankQ (X ). Inspired by
Remark 3, if we let Q = argminQ> Q=I kX kQ,∗ to minimize the surrogate function
directly, then we can get the following bilevel model:
min kX kQ,∗ , s.t. Ψ (X ) = Y, Q = argmin kX kQ,∗ . (26)
X ,Q Q> Q=I

In Eq. (26), the lower-level problem w.r.t. Q is actually a Stiefel manifold optimiza-
tion problem. Similarly, we can define the corresponding nuclear norm as follows:

Definition 6 (MOTQN: Manifold Optimization Tensor Q-Nuclear norm)


Let X ∈ Rn1 ×n2 ×n3 be a third-order tensor and Q ∈ Rn3 ×n3 be an orthogonal
matrix. Then the Manifold Optimization Tensor Q-Nuclear norm (MOTQN) is
defined as:
kX kQ,∗ , where Q = argmin kX kQ,∗ . (27)
Q> Q=I

Different from VMTQN, the learnable Q in Eq. (26) should be a square matrix,
i.e., Q ∈ Rn3 ×n3 . If not, as mentioned in Sec. 3.1.1, Q may converge to the singular
spaces which are corresponding to smaller singular values. To avoid this case, we
let Q ∈ Rn3 ×n3 . Following, the key point of solving this model is how to deal with
such an orthogonality constrained optimization problem:
n3
X
Q = argmin kX kQ,∗ = argmin G(i) , where G = X ×3 Q. (28)
Q> Q=I Q> Q=I ∗
i=1

Note that Eq. (28) is actually a non-convex problem due to the orthogonality
constraint. The usual way is to perform the manifold Gradient Descent on the
Stiefel manifold, which evolves along the manifold geodesics [40]. However, this
method usually requires a lot of computation to calculate the projected gradient
direction of the objective function. Meanwhile, the work [41] develops a technique to
solve such orthogonality constrained problem iteratively, which generates feasible
points by the Cayley transformation and only involves matrix multiplication and
inversion. Here we consider to use their algorithm to solve the low-level problem.

3.2.1 Optimization with Orthogonality Constraints

Assume Q ∈ Rn×r and denote the gradient of the objective function f (Q) = kX kQ,∗
w.r.t. Q at Qk (the k-th iteration) by P ∈ Rn×r . Then the projection of P in the
tangent space of the Stiefel manifold at Qk is AQk , where A = PQ> k − Qk P
>
n×n
and A ∈ R [41]. Instead of parameterizing the geodesic of the Stiefel manifold
along direction A using the exponential function, inspired by [41], we generate
feasible points by the following Cayley transformation:
 τ −1  τ 
Q(τ ) = C(τ )Qk , where C(τ ) = I + A I− A , (29)
2 2
where I is the identity matrix and τ ∈ R is a parameter to determine the step size
of Qk+1 . That is to say, Q(τ ) is a re-parameterized geodesic w.r.t. τ on the Stiefel
14 Hao Kong et al.

manifold. Moreover, if Q>


k Qk = I holds, then Q(τ ) has the following properties:

(1) d
dτ Q(0) = −AQk , (2) Q(τ ) is smooth in τ , (3) Q(0) = Qk , (4) Q(τ )> Q(τ ) = I.

The work [41] shows that if τ is in a proper range, Q(τ ) can lead to a lower
objective function value than Q(0) on the Stiefel manifold. In summery, solving
the problem Q = argminQ> Q=I kX kQ,∗ consists of two steps: (1) find a proper
τ ∗ to make the value of the objective function f (Q(τ )) = kX kQ(τ ),∗ decrease; (2)
update Qk+1 by Eq. (29), i.e., Qk+1 = Q(τ ∗ ).

3.2.2 Details of How To Estimate τ ∗ and Update Qk

(1): We first compute the gradient of the objective function f (Q) = kX kQ,∗ w.r.t.
Q at Qk . According to the chain rule, we get the following:
 
∂f (Q) ∂G ∂f (Q) ∂(G(3) ) ∂f (Q)
= · = × unfold3 . (30)
∂Q ∂Q ∂G ∂Q ∂G
∂G
Note that G = X ×3 Q and G(3) = X(3) Q, then we can get ∂Q(3) = X> (3)
where G(3) and X(3) are the mode-3 unfolding matrices. Additionally, Eq. (28)
shows that f (Q) = n (i) (i)
P 3
i=1 kG k∗ where G are the frontal slices of G. We let
H = U V , where H denotes the frontal slice of H and U(i) V(i) denotes
(i) (i) (i) (i)

the left/right singular matrices of G(i) by skinny SVD [42]. Therefore, H = ∂f∂G (Q)

is the same as the matrix case and can be obtained from the singular value
decomposition5 .
In summary, the gradient of the objective function f (Q) w.r.t. Q at Qk (denoted
by P) can be written as follows:
∂f (Q) ∂G ∂f (Q)
Gradient = P = = · = X>
(3) H(3) . (31)
∂Q ∂Q ∂G
where X(3) and H(3) are the mode-3 unfolding matrices of X and H, respectively.
(2): Then we construct a geodesic curve along the gradient direction on the Stiefel
manifold by Eq. (29):
 τ −1  τ 
Q(τ ) = I + A I − A Qk , where A = PQ> >
k − Qk P . (32)
2 2
We consider the following problem for finding a proper τ :
τ ∗ = argmin f (Q(τ )) = argmin g(τ ) = argmin kX kQ(τ ),∗ , (33)
0≤τ ≤ε 0≤τ ≤ε 0≤τ ≤ε

where ε is a given parameter to ensure that τ ∗ is small enough and k τ2 Ak ≤ 1


−1
holds. Then we can simplify g(τ ) = f (Q(τ )) with the equation I + τ2 A =
l
I+ ∞ τ
P 
l=1 − 2 A and obtain the following:
∞ 
! !
τ2 2
  
X τ l
g(τ ) = f (Q(τ )) = f I+2 − A Qk ≈ f I − τA + A Qk .
2 2
l=1
(34)
5 The subgradient of matrix nuclear norm kMk w.r.t. M is {UV> + W k U> W =

O, WV = O, kWk ≤ 1}, where M = UΣV> is the SVD of M.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 15

Algorithm 1 Updating Q iteratively to solve Eq. (27).


Input: Tensor X ∈ Rn1 ×n2 ×n3 , orthogonal matrix Q0 ∈ Rn3 ×n3 .
1. while not convergence
2. Calculate P = X> H
(3) (3)
by Eq. (31).
3. Calculate A = PQ>
k − Qk P
> by Eq. (32).

4. ∗
Estimate τ = min{ε, τ̃ } by Eq. (35) and Lemma 3.
5. Update Qk+1 = Q(τ ∗ ) by Eq. (32).
6. end while
Output: Matrix QK .

Given that τ ∗ is small enough, we can approximate g(τ ) via its second order
Taylor expansion at τ = 0, i.e., g(τ ) = g(0) + g 0 (0) · τ + 12 g 00 (0) · τ 2 + O(τ 3 ). It
should be mentioned that since f (Q) is non-convex w.r.t. Q, the sign of g 00 (0) is
uncertain. However, Wen et al [41] point out that g 0 (0) = − 12 kAk2F always holds.
Thus we can estimate an optimal solution τ ∗ via:
( g0 (0)
∗ 2 − g00 (0) , g 00 (0) > 0
τ = min{ε, τ̃ }, where ε < , and τ̃ = (35)
kAk 1
kAk , g 00 (0) ≤ 0.

Here we give the following Lemma to omit the calculation process (See Appendix D).
 2

Lemma 3 Let g(τ ) = f (Q(τ )) = kX kQ(τ ),∗ and Q(τ ) ≈ I − τ A + τ2 A2 Qk ,
where A is defined in Eq. (32). Then the first and the second order derivatives of
g(τ ) evaluated at 0 can be estimated as follows:
D E D E
g 0 (0) ≈ X> 00 > 2
(3) H(3) , −AQk , and g (0) ≈ X(3) H(3) , A Qk , (36)

where X(3) and H(3) are defined as the same in Eq. (31).

By using Eq. (35) and Lemma 3, we can obtain the optimal step size τ ∗ and
then use Eq. (32) to update Qk+1 = Q(τ ∗ ). Algorithm 1 organizes the whole
calculation process.
Back to the bilevel low rank tensor recovery model Eq. (26), for the lower-level
problem Eq. (28), we finish the iterative updating step by Algorithm 1. Once Qk+1
is fixed, the upper-level problem can be solved easily.

3.3 Applicability of VMTQN and MOTQN

In Sec.3.2 (MOTQN), we mention that Q ∈ Rn3 ×n3 should be a square matrix but
not in Sec.3.1 (VMTQN). In this section, we start from this point and analyze the
impact of the size of X ∈ Rn1 ×n2 ×n3 on the applicability of these two methods.

3.3.1 Case 1: r = n1 n2  n3

In this case, VMTQN model in Eq. (22) usually performs better than other methods
in terms of computational efficiency, including MOTQN and other works [8, 32–34,
38]. As we can see from Sec.3.1 of VMTQN model, we need to calculate a skinny
16 Hao Kong et al.

right singular matrix V of an unfolding matrix X(3) ∈ Rn1 n2 ×n3 . If r < n3 , then
not only the computational complexity is not too large, but Q can play the role
of feature selection like Principal Component Analysis, which corresponds to the
notation Q = PCA(X , 3, r).
Meanwhile, MOTQN and the work [8, 32, 33, 38] usually need to have a square
factor matrix Q, even that [34] requires the columns of Q to be redundant.

3.3.2 Case 2: n1 n2 > n3 = r or even have the same order of magnitude

In this case, MOTQN model in Eq. (26) has the best explainability and rationality.
On the one hand, with the same size of Q ∈ Rn3 ×n3 , MOTQN minimize the tensor
Q-nuclear norm directly, which corresponds to the definition of low rank structure
properly. On the other hand, thanks to the algorithm in [41], the optimization of
MOTQN model has good convergence guarantee.

4 Applications to Tensor Completion

4.1 Low Rank Tensor Completion Model

In the third-order tensor tensor completion task, Ω is an index set consisting of the
indices {(i, j, k)} which can be observed, and the operator Ψ in Eqs. (21) and (22)
is replaced by an orthogonal projection operator PΩ , where PΩ (Xijk ) = Xijk if
(i, j, k) ∈ Ω and 0 otherwise. The observed tensor Y satisfies Y = PΩ (Y). Then the
tensor completion model based on our two ways are given by:
(VMTQN): min kX kQ,∗ , s.t. PΩ (X ) = Y, Q = PCA(X , 3, r), (37)
X

and
(MOTQN): minX ,Q kX kQ,∗
s.t. Q = argmin kX kQ,∗ , PΩ (X ) = Y. (38)
Q> Q=I

where X is the tensor that has low rank structure. In Eq. (37), Q ∈ Rn3 ×r
is an column orthonormal matrix with r = min{n1 n2 , n3 }. While in Eq. (38),
Q ∈ Rn3 ×n3 is a square orthogonal matrix. To solve these models by ADMM based
method [43], we introduce an intermediate tensor E to separate X from PΩ (·). Let
E = PΩ (X ) − X , then PΩ (X ) = Y is translated to X + E = Y, PΩ (E) = O, where
O is an all-zero tensor. Then we get the following two models:
(VMTQN): min kX kQ,∗ , s.t. X + E = Y, PΩ (E) = O, Q = PCA(X , 3, r),
X ,E,Q
(39)
and
(MOTQN): min kX kQ,∗ , s.t. X + E = Y, PΩ (E) = O, Q = argmin kX kQ,∗ .
X ,E,Q Q> Q=I
(40)
Note that in Eq. (40), the constraint Q = argminQ> Q=I kX kQ,∗ is the same as the
objective function, thus it can be omitted. Nevertheless, in order to keep Eqs. (39)
and (40) unified in form and express the dependence of Q and X conveniently, we
reserve this constraint here.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 17

4.2 Optimization Algorithm

Since Q is dependent on X , it is difficult to solve the models (39) and (40) w.r.t.
{X , Q} directly. Here we adopt the idea of alternating minimization to solve X
and Q alternately. We separate the sub-problem of solving Q as a sub-step in every
K-iteration, and then update X with a fixed Q by the ADMM method [27, 43].
The partial augmented Lagrangian function of Eqs. (39) and (40) is
µ
L(X , E, Z, µ) = kX kQ,∗ + hZ, Y − X − Ei + kY − X − Ek2F , (41)
2
where Z is the dual variable and µ > 0 is the penalty parameter. Then we can
update each component Q, X , E, and Z alternately. Algorithms 2 and 3 show the
details about the optimization methods to Eqs. (39) and (40). In order to improve
the efficiency and stable convergence of the algorithm, we introduce a parameter
K to control the update frequency of Q with the help of heuristic design. The
different effects of K on the two models are explained in Sec. 4.3 and Sec. 4.4,
respectively.
Note that there is one operator Prox in the sub-step of updating X as follows:

1
X = Proxλ,k·kQ,∗ (T ) := argmin λkX kQ,∗ + kX − T k2F , (42)
X 2

where Q ∈ Rn3 ×r is a given column orthonormal matrix and kX kQ,∗ is the tensor
Q-nuclear norm of X which is defined in Eq. (12). Algorithm 3 shows the details of
solving this operator.

4.3 Convergence Analysis

4.3.1 VMTQN Model

For the models (37) or (39), it is hard to analyze the convergence of the corre-
sponding optimization method directly. The constraint on Q is non-linear and the
objective function is essentially non-convex w.r.t. Q, which increase the difficulty
of analysis. However, the conclusions of [43–47] guarantee the convergence to some
extent.
In practical applications, we can fix Qk = Q in every K iterations to solve
a convex problem w.r.t. X . As long as X is convergent, by using the following
Lemma 4, the change of Q is bounded.

Lemma 4 [42] Given a matrix X and its Singular Value Decomposition X =


UΣV> . Let vi denotes the i-th column of matrix V and σj denotes the j-th
singular value of matrix X. Denote the sub-differential of a variable by ∂(·), then
we have the following:
 +
∂(vi ) = σi2 I − X> X ∂(X> X)vi . (49)

∂(vij )
If vij represents the j-th element of vi , then ∂(X> X)
< ∞.
2
18 Hao Kong et al.

Algorithm 2 Solving the problems (39) and (40):VMTQN and MOTQN models by ADMM.
Input: Observation samples Yijk , (i, j, k) ∈ Ω, of tensor Y ∈ Rn1 ×n2 ×n3 .
Initialize: X0 , E0 , Z0 , Q0 ∈ Rn3 ×r . Parameters k = 1, ρ > 1, µ0 , µmax , ε, K.
While not converge do
1. Update Qk by one of the following:
(
Qk−1, k mod K 6= K − 1,
(VMTQN): Qk = Zk−1
 (43)
PCA Y − Ek−1 + µk−1
, 3, r , k mod K = K − 1.

(
Qk−1 , 6 K − 1,
k mod K =
(MOTQN): Qk = (44)
Q(τ ∗ ) by using Algorithm 1, k mod K = K − 1.
2. Update Xk by  
Zk−1
Xk = Proxµ−1 ,k·kQ ,∗
Y − Ek−1 + . (45)
k−1 k µk−1
3. Update Ek by  
Zk−1
Ek = PΩ { Y − Xk + , (46)
µk−1
where Ω { is the complement of Ω.
4. Update the dual variable Zk by

Zk = Zk−1 + µk−1 (Y − Xk − Ek ) . (47)

5. Update µk by
µk = min{ρµk−1 , µmax }. (48)
6. Check the convergence condition: kXk − Xk−1 k∞ ≤ ε, kEk − Ek−1 k∞ ≤ ε, and
kY − Xk − Ek k∞ ≤ ε.
7. k ← k + 1.
end While
Output: The target tensor Xk .

Algorithm 3 Solving the proximal operator Proxλ,k·kQ,∗ (T ) in Eq. (42) and (45).
Input: Tensor T ∈ Rn1 ×n2 ×n3 , column orthonormal matrix Q ∈ Rn3 ×r .
1. G = T ×3 Q.
2. for i = 1 to r:
[U, S, V] = SVD(G(i) ).
G(i) = U(S − λI)+ V> , where (x)+ = max{x, 0}.
3. end for
4. X = G ×3 Q> + T ×3 (I − QQ> ).
Output: Tensor X .

Lemma 4 indicates that, as long as the change of X is bounded by penalty term


with proper K and ρ, the change of Q will also be bounded to some extent. Then
lim Qk ≈ PCA(Xk , 3, r) gradually meets the constraints.
k→∞
Unfortunately, Updating the variable Qk in Eq. (43) needs to solve a singular
linear system, while the objective norm kX kQ,∗ in Eq. (39) is non-convex w.r.t.
Q. Therefore, it is difficult to prove the conclusion that the Lagrangian function
in Eq. (41) of Algorithm 2 decreases strictly in each iteration. However, we give
another Theorem that the iterations corresponding to Eqs. (45)-(48) are convergent
in the case of fixed Q.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 19

Theorem 4 Given a fixed Q in every K iterations, the tensor completion model (39)
can be solved effectively by Algorithm 2 with Qk = Q in Eq. (43), where Ψ is re-
placed by PΩ . The rigorous convergence guarantees can be obtained directly due to
the convexity as follows:
Let (X ∗ , E ∗ , Z ∗ ) be one KKT point of problem (39) with fixed Q, X̂K =
PK 1 K
P 1
k=0 µk Xk+1 k=0 µk Ek+1
PK 1 , and ÊK = PK 1 , then we have
k=0 µk k=0 µk

!
1
kX̂K+1 + ÊK+1 − Yk2F ≤O PK 1
, (50)
k=0 µk

and
!

D

E 1
0 ≤ kX̂K+1 kQ,∗ − kX kQ,∗ + Z , X̂K+1 + ÊK+1 − Y ≤ O PK 1
. (51)
k=0 µk

4.3.2 MOTQN Model

Different from VMTQN model, as we mentioned in Sec.3.3.2, MOTQN model has


a complete guarantee of convergence with the help of [41]. The updating step in
Eq. (44) can strictly guarantee the decrease of the objective function value kX kQ,∗
with a proper step size τ ∗ .
Lemma 5 (Lemma 3 of [41]) Denote the gradient of the objective function f (Q)
w.r.t. Q at Qk by P and let A = PQ> k − Qk P
>
be a skew-symmetric matrix. If
we define Q(τ ) by Eq. (32), then Q(τ ) is a descent curve at τ = 0, that is,
∂f (Q(τ )) 1
fτ0 (Q(0)) := = − kAk2F ≤ 0. (52)
∂τ τ =0 2
Lemma 5 indicates that, as long as τ is small enough, Eq. (44) usually decreases
the value of f (Q(τ )). Notice that Eq. (41) is a partial augmented Lagrangian
function, hence the value of Lagrangian function will also decreases after Eq. (44).
Therefore, we have the following theorem to ensure the convergence of Algorithm 2:

Theorem 5 Denote the augmented Lagrangian function of low rank tensor recovery
model 38 by L(Q, X , E, Z, µ), which is shown as follows:
µ
L(Q, X , E, Z, µ) = kX kQ,∗ + hZ, Y − X − Ei + kY − X − Ek2F . (53)
2
Then the sequence {Qk , Xk , Ek , Zk , µk } generated in Algorithm 2 with Eq. (44)
satisfies the following:

L(Qk , Xk , Ek , Zk , µk ) ≥ L(Qk+1 , Xk+1 , Ek+1 , Zk+1 , µk+1 )


(54)
 
µk µk µk+1 + µk
+ kEk − Ek+1 k2F + − C L kXk − Xk+1 k2F .
2 2 2µ2k

The function
p value of Eq. (53) decreases monotonically after each iteration as long
as µ ≥ (ρ + 1)CL , where ρ is defined in Eq. (48) and CL is a constant w.r.t X .
By the monotone bounded convergence theorem, Algorithm 2 is convergent.
20 Hao Kong et al.

4.4 Complexity Analysis

The computational complexity of VMTQN in Eq. (43) is O (rn1 n2 n3 ), where r


denotes the number of columns of Q ∈ Rn3 ×r . And the complexity of MOTQN
in Eq. (44) is O (n1 n2 + n3 )n23 . As for TNN based method [25–27, 38], they use
Fourier transform and have a complexity of O (n1 n2 n3 log n3 ). As can be seen, if
r < log n3 , VMTQN can be more efficient than the other two methods. Otherwise,
we should use a larger K to control the overall calculation speed.
However, when solving our two methods or TNN based method, the most
time-consuming part is in the SVD operator of each iteration, which is correspond-
ing to Eqs. (45)-(48). In this part, VMTQN based method has a complexity of
O (rn1 n2 min{n1 , n2 }) while MOTQN and TNN based methods has a complexity
of O (n3 n1 n2 min{n1 , n2 }). That is to say, as long as r  n3 , VMTQN based
method is usually more efficient than the other two methods.

4.5 Performance Analysis

Considering the low rank tensor recovery models in Eqs. (37) and (38), Ω is
an index set consisting of the indices {(i, j, k)} which can be observed, and the
orthogonal projection operator PΩ is defined as PΩ (Xijk ) = Xijk if (i, j, k) ∈ Ω
and 0 otherwise. In this part, we discuss at least how many observation samples |Ω|
are needed to recover the ground-truth. In fact, Q∗ obtained from the convergence
of Algorithm 2 has a decisive effect on the number of observation samples needed,
since the optimal solution satisfies the KKT conditions under Q∗ . That is to say,
we only need to analyze the performance guarantee in the case of fixed Q.
With a fixed Q, the exact tensor completion guarantee for model (13) is shown
in Theorem 6. Lu et al [8] also have similar conclusions.
Theorem 6 Given a fixed orthogonal matrix Q ∈ Rn3 ×n3 and Ω ∼ Ber(p),
assume that tensor X ∈ Rn1 ×n2 ×n3 (n1 ≥ n2 ) has a low tensor Q-rank structure
and rankQ (X ) = R. If |Ω| ≥ O(µRn1 log(n1 n3 )), then X is the unique solution to
Eq.(13) with high probability, where Ψ is replaced by PΩ , and µ is the corresponding
incoherence parameter (See Supplementary Materials).
Through the proof of [8] and [27], the sampling rate p should be proportional
to max{kPT (eijk )k2F }. (The definition of projection operators PT and eijk can be
found in [26, 27] or in Supplementary materials, where T is the singular space of
the ground-truth.) The projection of eijk onto subspace T is greatly influenced by
2
the dimension. Obviously, when T is the whole space, PTQ (eijk ) F = 1. That is
2
to say, a small dimension of TQ may lead to a small maxijk { PTQ (eijk ) F }.
Proposition 15 in [27] also implies that for any ∆ ∈ T , we need to have
PΩ (∆) = 0 ⇔ ∆ = 0. These two conditions indicate that once the spatial
dimension of T is large (rankQ (X ) = R is large), a larger sampling rate p is needed.
And Figure 3 in [27] verifies the rationality of this deduction by experiment.
In fact, the smoothness of data along the third dimension has a great influence
on the Dimension of Freedom (DoF) of space T . Non-smooth change along the
third dimension is likely to increase the spatial dimension of T under the Fourier
basis vectors, which makes the TNN based methods ineffective. Our experiments
on CIFAR-10 (Table 1) confirm this conclusion.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 21

As for the models (39) and (40) with adaptive Q, our motivation is to find a
better Q in order to make rankQ (X ) = R smaller and make the spatial dimension
of corresponding TQ as small as possible, where TQ is the singular space of the
ground-truth under Q. In other words, for more complex data with non-smoothness
along the third dimension, the adaptive Q may reduce the dimension of TQ and
make max{kPTQ (eijk )k2F } smaller than max{kPT (eijk )k2F }, leading to a lower
bound for the sampling rate p.

5 Experiments

In this section, we conduct numerical experiments to evaluate our proposed mod-


els (39) and (40). The platform is Matlab R2018b under Windows 10 on a PC
with an Intel i5-7500 CPU and 16GB memory. The experimental code of most
comparison methods come from the released version. As for some methods without
released code, we reproduce it in Matlab 2018b strictly according to the algorithm
in their respective papers.
Assume that the observed corrupted tensor is Y, and the true tensor is X0 ∈
Rn1 ×n2 ×n3 . We represent the recovered tensor (output of the algorithms) as X ,
and use Peak Signal-to-Noise Ratio (PSNR) to measure the reconstruction error:

n1 n2 n3 kX0 k2∞
 
PSNR = 10 log10 . (55)
kX − X0 k2F

5.1 Synthetic Experiments

VMTQN Model MOTQN Model TNN Model


1.0 40 1.0 40 1.0 40

0.8 32 0.8 32 0.8 32


sampling rate: p=|+|/n3
sampling rate: p=|+|/n3

sampling rate: p=|+|/n3

0.6 24 0.6 24 0.6 24

0.4 16 0.4 16 0.4 16

0.2 8 0.2 8 0.2 8

0 0 0 0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
tensor Q-rank / n: r tensor Q-rank / n: r tensor Q-rank / n: r

Fig. 3 The numbers plotted on the above figure are the average PSNRs within 10 random trials.
The gray scale reflects the quality of completion results of three different models (VMTQN,
MOTQN, TNN), where n1 = n2 = n3 = 50 and the white area represents a maximum PSNR
of 40.

In this part we compare our proposed methods (named VMTQN model and
MOTQN model) with the mainstream algorithm TNN [25, 27].
We examine the completion task with varying tensor Q-rank of tensor Y and
varying sampling rate p. Firstly, we generate a random tensor M ∈ R50×50×50 ,
whose entries are independently sampled from an N (0, 1/50) distribution. Actually,
the data generated in this way is usually non-smooth along each dimension. Then
we choose p in [0.01 : 0.02 : 0.99] and r in [1 : 1 : 50], where the column orthonormal
22 Hao Kong et al.

matrix W ∈ R50×r satisfies W = PCA(M, 3, r). We let Y = M ×3 W ×3 W> be


the true tensor. After that, we create the index set Ω by using a Bernoulli model to
randomly sample a subset from {1, . . . , 50}×{1, . . . , 50}×{1, . . . , 50}. The sampling
rate p is |Ω|/503 . For each pair of (p, r), we simulate 10 times with different random
seeds and take the average as the final result. As for the parameters of VMTQN
and MOTQN models in Algorithm 2, we set ρ = 1.1, µ0 = 10−4 , µmax = 1010 ,
and  = 10−8 .
As shown in the upper left corner regions of VMTQN model and MOTQN model
in Figure 3, Algorithm 2 can effectively solve our proposed recovery models (37)
and (38). The larger tensor Q-rank it is, the larger the sampling rate p is needed,
which is consistent with our Performance Analysis in Theorem 6. By comparing
the results of three methods, we can find that TNN has very poor robustness to
the data with non-smooth change. And the results of the left and middle images
demonstrate our assumptions (Motivation), which may imply that better low rank
structure leads to better recovery.

5.2 Real-World Datasets

In this part we compare our proposed method with TNN [27] with Fourier matrix,
TTNN [33] with wavelet matrix, TNN-C [32] with cosine matrix, F-TNN [34] with
framelet matrix, SiLRTC [12], Tmac [48], and Latent Trace Norm [23]. We validate
our algorithm on three datasets: (1) CIFAR-106 ; (2) COIL-207 ; (3) HMDB518 . We
set ρ = 1.1, µ0 = 10−4 , µmax = 1010 ,  = 10−8 , and K = 1 in our methods. As for
TNN, SiLRTC, Tmac, F-TNN, and Latent Trace Norm, we use the default settings
as in their released code, e.g., Lu et al.9 and Tomioka et al.10 . For TTNN and
TNN-C of unreleased code, we implement their algorithms in MATLAB strictly
according to the corresponding papers.

5.2.1 Influences of Q

Corresponding to our motivation, we use a Random orthogonal matrix and an


Oracle matrix (the matrix of right singular vectors of the ground-truth unfolding
matrix) to test the influence of Q. The results of TQN models with different
orthogonal matrix in Tables 1 and 2 show that Q play an important role in tensor
recovery. Comparing with Random Q case, our Algorithm 2 is effective for searching
a better Q. Table 1 also shows that a proper Q may make recover the ground-truth
more easily. For example, with sampling rate p ≥ 0.2 on 10000 images, an Oracle
matrix Q can lead to an “exact” recovery (PSNR > 200).

5.2.2 CIFAR-10

We consider the worst case for TNN based methods that there is almost no
smoothness along the third dimension of the data. We randomly selected 3000 and
6 http://www.cs.toronto.edu/~kriz/cifar.html.
7 http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php.
8 http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/.
9 https://github.com/canyilu/LibADMM
10 https://https://github.com/ryotat/tensor
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 23

Table 1 Comparisons of PSNR results on CIFAR images with different sampling rates.
Top: experiments on the case Y1 ∈ R32×32×3000 . Bottom: experiments on the case Y2 ∈
R32×32×10000 .

Sampling Rate p 0.1 0.2 0.3 0.4 0.5 0.6


TQN with Random Q 10.86 15.47 18.09 20.20 22.30 24.49
TQN with Oracle Q (ideal) 25.39 30.85 39.43 109.52 >200 >200
VMTQN (Ours) 18.83 21.10 22.89 24.56 26.26 28.07
TNN (Fourier) [27] 9.84 12.73 15.68 18.71 21.60 24.26
TNN-C (cosine) [32] 9.63 11.92 15.17 18.45 22.09 23.95
TTNN (wavelet) [33] 8.97 13.08 17.19 19.26 23.13 25.67
F-TNN (framelet) [34] 8.84 11.95 16.56 20.61 23.77 26.02
Tmac [48] 17.81 19.29 23.06 24.89 25.74 27.46
SiLRTC [12] 16.87 20.04 21.99 23.80 25.62 27.57

Sampling Rate p 0.1 0.2 0.3 0.4 0.5 0.6


TQN with Random Q 10.84 15.45 18.06 20.19 22.29 24.48
TQN with Oracle Q (ideal) 45.75 >200 >200 >200 >200 >200
VMTQN (Ours) 19.06 21.43 23.27 24.97 26.65 28.42
TNN (Fourier) [27] 8.18 10.10 12.19 14.63 17.59 21.20
TNN-C (cosine) [32] 8.12 9.95 11.80 13.62 18.07 22.10
TTNN (wavelet) [33] 9.01 10.80 13.27 15.88 20.21 24.04
F-TNN (framelet) [34] 9.17 11.06 15.10 17.44 20.85 23.77
Tmac [48] 12.91 18.49 22.97 25.25 27.06 27.97
SiLRTC [12] 14.02 19.65 22.44 24.38 26.21 28.12

Running Time on CIFAR (10000 images) PSNR-Iteration on CIFAR (10000 images)


30 30

25 25

20 20
PSNR

PSNR

15 15

10 10
TQN(Oracle Q) TQN(Oracle Q)
VMTQN VMTQN
5
TNN 5
TNN
SiLRTC SiLRTC
0 0
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40
Running Time(s) Iteration
Fig. 4 Running time comparisons of different methods, where Y ∈ R32×32×10000 and sampling
rate p = 0.3.

10000 images from one batch of CIFAR-10 [49] as our true tensors Y1 ∈ R32×32×3000
and Y2 ∈ R32×32×10000 , respectively. Then we solve the model (39) with our
proposed Algorithm 2. The results are shown in Table 1. Note that in the latter
case r = n1 n2  n3 holds, MOTQN model has high computational complexity.
Thus we will not compare it in this part.
Table 1 verifies our hypothesis that TNN regularization performs badly on
data with non-smooth change along the third dimension. Our VMTQN method is
obviously better than the other methods in the case of low sampling rate. Moreover,
by comparing the two groups of experiments, we can see that VMTQN, TMac,
and SiLRTC perform better in Y2 . This may be due to that increasing the data
24 Hao Kong et al.

Corrupted COIL Corrupted Video

Fig. 5 Examples of the corrupted data in our completion tasks. The left figure is from COIL
dataset while the right figure is from the short video. The sampling rate is p = 0.2 in the left
and p = 0.5 in the right.

volume will make the principal components more significant. Meanwhile, in the
methods of Fourier matrix, cosine matrix and wavelet matrix, they almost have no
recovery effect when the sampling rate p is lower. This indicates that these specified
projection bases can not learn the data features in the case of poor continuity and
insufficient sampling.
The above analyses confirm that our proposed regularization are data-dependent
and can lead to a better low rank structure which makes recover easily.

5.2.3 Running time on CIFAR

As shown in Figure 4, we test the running times of different models. The two
figures indicate that, when n3  n1 n2 , our VMTQN model has higher computa-
tional efficiency in each iteration and better accuracy than TNN and SiLRTC. As
mentioned in our previous complexity analysis, VMTQN method has a great speed
advantage in this case. Moreover, for the case n3 < n1 n2 , Figure 8 implies that
setting r < n1 n2 can balance computational efficiency and recovery accuracy.

5.2.4 COIL-20 and Short Video from HMDB51

COIL-20 [50] contains 1440 images of 20 objects which are taken from different
angles. The size of each image is processed as 128 × 128, which means Y ∈
R128×128×1440 . The upper part of Table 2 shows the results of the numerical
experiments. We select a background-changing video from HMDB51 [51] for the
video inpainting task, where Y ∈ R240×320×146 . Figure 2 shows some frames of this
video. The lower part of Table 2 shows the results. And Figures 5, 6 and 7 are the
the experimental results of COIL-20 and Short Video from HMDB51, respectively.
From the two visual figures we can see that, our VMTQN method and MOTQN
method perform the best among all comparative methods. Especially when the
sampling rate p = 0.2 in Figure 6, our methods has significant superiority in visual
evaluation. What’s more, “Latent Trace Norm” based method performs much
better than TNN in COIL, which validates our assumption that with the help of
data-dependent V tensor trace norm is much more robust than TNN in processing
non-smooth data.
Overall, both our methods and t-SVD based methods (e.g., TNN) perform
better than the others (e.g., SiLRTC) on these two datasets. It is mainly because
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 25

TQN Random VMTQN (Ours) MOTQN (Ours)

TNN Clean image TNN-C (Cosine)

F-TNN (Framelet) SiLRTC LTN

Fig. 6 Examples of COIL completion results. Method names correspond to the top of each
figure. The sampling rate p = 0.2.

TQN Random VMTQN (Ours) MOTQN (Ours)

TNN Clean image TNN-C (Cosine)

F-TNN (Framelet) SiLRTC LTN

Fig. 7 Examples of video inpainting task with sampling rate p = 0.5.


26 Hao Kong et al.

Table 2 Comparisons of PSNR results on COIL images and video inpainting with different
sampling rates. Up: the COIL dataset with Y ∈ R128×128×1440 . Down: a short video from
HMDB51 with Y ∈ R240×320×126 .

Sampling Rate p 0.1 0.2 0.3 0.4 0.5 0.6


TQN with Random Q 16.05 20.07 23.02 25.57 27.95 30.34
TQN with Oracle Q (ideal) 22.97 25.32 27.18 28.90 30.68 32.51
VMTQN (Ours) 22.79 25.34 27.29 29.08 30.86 32.74
MOTQN (Ours) 21.91 25.41 27.86 30.13 31.79 33.64
TNN (Fourier) [27] 19.20 22.08 24.45 26.61 28.72 30.91
TNN-C (cosine) [32] 19.02 22.11 24.23 27.04 28.95 30.97
TTNN (wavelet) [33] 18.15 21.42 24.47 26.93 29.11 31.10
F-TNN (framelet) [34] 17.62 20.58 22.87 24.67 27.41 29.90
Tmac [48] 19.04 22.48 24.97 26.70 27.91 28.86
SiLRTC [12] 18.87 21.80 23.89 25.67 27.37 29.14
Latent Trace Norm [23] 19.09 22.98 25.75 28.11 30.40 32.42

Sampling Rate p 0.1 0.2 0.3 0.4 0.5 0.6


TQN with Random Q 18.85 22.76 25.87 28.73 31.55 34.48
TQN with Oracle Q (ideal) 23.44 27.61 31.37 35.11 38.92 42.74
VMTQN (Ours) 23.97 28.09 31.76 35.33 39.06 42.87
MOTQN (Ours) 24.10 27.88 32.24 35.19 39.28 42.65
TNN (Fourier) [27] 22.40 25.58 28.28 30.88 33.55 36.41
TNN-C (cosine) [32] 22.15 25.34 28.17 30.96 33.51 36.62
TTNN (wavelet) [33] 19.80 21.95 24.92 30.13 32.78 36.84
F-TNN (framelet) [34] 19.01 23.44 25.94 29.32 32.06 35.13
Tmac [48] 18.54 22.79 26.08 29.70 31.17 34.26
SiLRTC [12] 18.42 22.33 25.76 29.15 32.59 36.15
Latent Trace Norm [23] 18.94 22.72 25.65 28.26 30.79 33.48

the definitions of tensor singular value in tSVD based methods can make better
use of the tensor internal structure, and this is also the main difference between
tensor Q-nuclear norm (TQN) and sum of the nuclear norm (SNN).
Meanwhile, our method is obviously better than the others at all sampling
rates, which reflects the superiority of our data dependent Q.

5.2.5 Influence of r in Q ∈ Rn3 ×r

Remarks 2 and 3 imply that r of Q ∈ Rn3 ×r in VMTQN denotes the apriori


assumption of the subspace dimension of the ground-truth. It means that the
dimensions of the frontal slice subspace of the true tensor T (also as the column
subspace of mode-3 unfolding matrix T(3) ) are no more than r.
Figure 8 illustrates the relations among running times, different r, and the
singular values of T(3) . We project the solution Xk (in Eq. (45)) onto the subspace of
Qk , which means Xˆk := Xk ×3 (Qk Q> k ). Meanwhile, under different r in Q ∈ R
n3 ×r
,
Figure 9 shows the PSNR results of the completion task with varying tensor Q-rank
of tensor and varying sampling rate. The settings in Figure 9 are consistent with
those in Sec. 5.1, and only the size of Q is different.
As shown in the conduct of Figure 8, the column subspace of T(3) is more
than 360. If r ≤ 360, the algorithm will converge to a bad point which only has
an r-dimensional subspace. Therefore, in our previous experiments, we usually
set r = min{n1 n2 , n3 } to make sure that r is greater than the true tensor’s
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 27

PSNR-Iteration on COIL-20 with Different r Running Time on COIL-20 with Different r


26 26

24 24

22 22

20 20

18

PSNR
18
PSNR

16
16
TNN TNN
14
14
VMTQN r=1440 VMTQN r=1440
12
VMTQN r=720
12
VMTQN r=720
10
VMTQN r=360 10
VMTQN r=360
8 VMTQN r=180 8 VMTQN r=180
6 6
0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450

Iteration Running Time(s)

Singular Value in Log2( ) Scale


2000 12

1800
10
1600
Singular Value

1400 8

1200
6
1000

800 4

600
2
400
0
200

0 -2
180 360 720 1440 180 360 720 1440
Index of Singular Value Index of Singular Value

Fig. 8 The relations among running times, different r, and the singular values of T(3) on
COIL, where p = 0.2.

VMTQN Model with Q R50x20 VMTQN Model with Q R


50x30
VMTQN Model with Q R
50x50
1.0 40 1.0 40 1.0 40

0.8 32 0.8 32 0.8 32


sampling rate: p=| |/n 3

sampling rate: p=| |/n 3


sampling rate: p=| |/n

0.6 24 0.6 24 0.6 24

0.4 16 0.4 16 0.4 16

0.2 8 0.2 8 0.2 8

0 0 0 0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
tensor Q-rank / n: r Ground-Truth tensor Q-rank / n: r tensor Q-rank / n: r
Ground-Truth Ground-Truth

Fig. 9 The gray scale reflects the quality (PSNR) of completion results, where n1 = n2 =
n3 = 50 and the white area represents a maximum PSNR of 40. There are three different sizes
of Q in VMTQN model to show the influences.

subspace dimension. This apriori assumption is commonly used in factorization-


based algorithms. What’s more, the running time decreases with the decrease of
r. Although r = 1440 needs more time to converge than TNN, it obtains a better
recovery. And a smaller r does speed up the calculation but harms the accuracy.
The results of Figure 9 intuitively reflect the selection criterion of r in VMTQN,
that is, r should be larger than the subspace dimension of the true tensor to get the
exact recovery. According to the constraint XQQ> = X in Sec. 3.1, if the subspace
dimension of the true tensor is larger than r, then this constraint can never be
satisfied. And and there must be a distance between the output of Algorithm 2
and the truth tensor, which corresponding to the black areas in the upper right
28 Hao Kong et al.

Sampling Rate p 0.2 0.3 0.4 0.5 0.6


TNN (Fourier) [27] 25.64 28.08 30.43 32.82 35.36
VMTQN (Ours) 26.58 29.18 31.69 34.21 36.87
MOTQN (Ours) 26.34 28.55 30.07 32.41 36.59

Clean image Corrupted TNN VMTQN (Ours) MOTQN (Ours)

Fig. 10 Comparisons of PSNR and visualization results of a smooth video inpainting. Up:
PSNR results with different sampling rates. Down: visualization results with the sampling
rate p = 0.5.

corner of the first two sub-figures. From the left two sub-figures we can see that,
if the dimension of true tensor is not greater than r, the recovery performance is
consistent with that in the third sub-figure. Combined with the above analyses,
r = min{n1 n2 , n3 } can not only save computational efficiency in some cases, but
also make the recovery performance of the model in “the white area”, corresponding
to the exact recovery.

5.3 Smooth Data Experiments

To verify the effectiveness of our proposed methods in smooth data, we select a


video from HMDB51 to conduct the experiments, while the background of this
video remains unchanged. Figure 10 shows the PSNR and visualization results of
the video inpainting tasks. Here we only compare TNN based method [27], since in
recent years TNN is considered as a benchmark for handling such smooth data.
The results in Figure 10 shows that VMTQN method performs best, and with the
increase of sampling rate p, MOTQN method outperforms TNN based method,
which means our proposed methods are still competitive in processing smooth data.

6 Conclusions

We analyze the advantages and limitations of the current mainstream low rank
regularizers, and then introduce a new definition of data dependent tensor rank
named tensor Q-rank. To get a more significant low rank structure w.r.t. rankQ ,
we further introduce two explainable selection method of Q and make Q to be a
learnable variable w.r.t. the data. Specifically, maximizing the variance of singular
value distribution leads to VMTQN, while minimizing the value of nuclear norm
through manifold optimization leads to MOTQN. We provide an envelope of our
rank function and apply it to the tensor completion problem. By analyzing the
proof of exact recovery theorem,we explain why our method may perform better
than TNN based methods in non-smooth data (along the third dimension) with
low sampling rates, and conduct experiments to verify our conclusions.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 29

A Proof of Lemma 1
1 Pn
Proof Suppose that ā = n a , hence the variance of {a1 , . . . , an } can be expressed as
P i
i=1
Var[ai ] = i=1 (ai − ā) . With n
Pn 2 2
i=1 ai = C holds, we have the following:

n
X n
X
max Var[ai ] ⇒ max (ai − ā)2 ⇒ max (a2i + ā2 − 2ai ā)
i=1 i=1
Xn n
X n
X
⇒ max ( a2i ) + ( ā2 ) − 2( ai ā)
i=1 i=1 i=1

⇒ max nā − 2ā(nā) ⇒ max −nā2 ⇒ min ā


2
(due to ai ≥ 0).

Moreover, theP feasible region of {a1 , . . . , an } is an first quadrant Euclidean spherical surface:
{(a1 , . . . , an )| n 2
i=1 ai = C, ai ≥ 0}. Thus the objective function ā = n
1 Pn
i=1 ai is actually a
linear hyperplane optimization problem, whose optimal solution contains all intersection of the
sphere and each axis, which corresponds to only one non-zero coordinate in {a1 , . . . , an }. u t

B Proof of Lemma 2

Proof Firstly, X = UΣV> denotes the full Singular Value Decomposition of matrix X with
U ∈ Rn1 ×n1 , Σ ∈ Rn1 ×n2 , and V ∈ Rn2 ×n2 . And P = V> Q is also an orthogonal matrix,
where P ∈ Rn2 ×n2 . We use Pij to represent the (i, j)-th element of matrix P, and use pi to
represent the i-th column of matrix P. Then XQ = UΣV> Q = UΣP holds and we have the
following:
Xn2 n2
X
kXQk2,1 = kUΣPk2,1 = kUΣpi k2 = kΣpi k2 . (56)
i=1 i=1
If n1 ≥ n2, letσi = Σii be the (i, i)-th element value of Σ with i = 1, . . . , n2 . Or if n1 < n2 ,
0 Σ P n2
let Σ = ∈ Rn2 ×n2 and σi = Σ0ii with i = 1, . . . , n2 . In this case, i=1 kΣpi k2 =
0
P n2 0
Pn2
i=1 kΣ pi k2 . Thus, we can always get {σ1 , . . . , σn2 } and have the equation i=1 kΣpi k2 =
Pn2 qPn2 2
i=1 j=1 (σj Pji ) .
We then prove that P P= I optimize the problem (16). By using Eq. (56), the objective
function can be written as n 2
i=1 kΣpi k2 . We give the following deduction:
v v
n2 n2 u X
X X u n2 n2 u n2
(a) X u X
n2
X
2
kΣpi k2 = t (σj Pji )2 = t (σj Pji )2 × Pji
i=1 i=1 j=1 i=1 j=1 j=1
n2 X
n2 n2 n2 n2
!
(b) X (c) X X (d) X
2 2
≥ (σj Pji ) = σj Pji = σj .
i=1 j=1 j=1 i=1 j=1

(a) holds due to that P is an orthogonal matrix with normalized columns. (b) holds because of
Cauchy inequality. (c) holds with exchanging the order of two summations. Finally (d) holds
owing to the row normalization of P. Notice that the equality in (b) holds if and only if the two
vectors (σ1 P1i , . . . , σn2 Pn2 i ) and (P1i , . . . , Pn2 i ) are parallel. It can be seen that when P = I,
the condition are satisfied. In other words, V> Q = I optimize the problem (16), which implies
Q = V. u
t
30 Hao Kong et al.

C Proof of Theorem 2

Proof We divide r = min{n1 , n2 } into two cases and prove them respectively. And we use the
same notation as in the previous proofs.
(1): If n1 < n2 and r = n1 , then U ∈ Rn1 ×n1 , V ∈ Rn2 ×n1 , and Q ∈ Rn2 ×n1 . In this
case, Σ ∈ Rn1 ×n1 . Let Σ0 = Σ 0 ∈ Rn1 ×n2 , V0 = V V⊥ ∈ Rn2 ×n2 , and Q0 = Q Q⊥ ∈
 
n ×n > >
R 2 2 . Note that the constraint XQQ = X in Eq. (17) implies V Q⊥ = 0 and V⊥ Q = 0, >

then we have the following:

kXQk2,1 = kUΣV> Qk2,1 = kΣV> Qk2,1 = kΣ0 V0> Q0 k2,1 . (57)

That is to say, minimize kXQk2,1 w.r.t. Q in Eq. (17) is equivalent to minimize kΣ0 V0> Q0 k2,1
w.r.t. Q0 under the constraints V> Q⊥ = 0 and V⊥ > Q = 0. By using Lemma 2, Q0 = V0

minimize the objective function kΣ0 V0> Q0 k2,1 , which also satisfies the constraints. In other
words, Q = V optimize the problem 17.
(2): If n1 ≥ n2 and r = n2 , then U ∈ Rn1 ×n2 , V ∈ Rn2 ×n2 , and Q ∈ Rn2 ×n2 . In this
case, we have
n2
X n2
X
kXQk2,1 =kUΣPk2,1 = kUΣpi k2 = kΣpi k2 .
i=1 i=1
The remaining proofs are similar to the details in Appendix B. u
t

D Proof of Lemma 3
τ2
 
Proof Let g(τ ) = f (Q(τ )) = kX kQ(τ ),∗ and Q(τ ) ≈ I − τ A + 2
A2 Qk , where A is defined
in Eq. (32). We consider the following approximation:
 
∂f (Q(τ )) D E
g(τ ) = f (Q(τ )) ≈ g(0) + , Q(τ ) − Q(0) = g(0) + X>
(3) H(3) , Q(τ ) − Q(0) ,
∂Q(τ ) τ =0
(58)
∂f (Q(τ ))
where Q(0) = Qk and then Eq. (31) ensure ∂Q(τ )
= X> H . Then we have:
(3) (3)
τ =0

τ2 2
   
g(τ ) ≈ X> H
(3) (3) , I − τ A + A Q k + Cτ , (59)
2

where Cτ is a constant independent of τ . Then the first and the second order derivatives of
g(τ ) evaluated at 0 can be estimated as follows:
D E D E
g 0 (0) ≈ X> 00 > 2
(3) H(3) , −AQk , and g (0) ≈ X(3) H(3) , A Qk , (60)

u
t

References

1. T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review,


vol. 51, no. 3, pp. 455–500, 2009.
2. F. L. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” Studies
in Applied Mathematics, vol. 6, no. 1-4, pp. 164–189, 1927.
3. H. A. Kiers, “Towards a standardized notation and terminology in multiway analysis,”
Journal of Chemometrics, vol. 14, no. 3, pp. 105–122, 2000.
4. F. L. Hitchcock, “Multiple invariants and generalized rank of a p-way matrix or tensor,”
Journal of Mathematics and Physics, vol. 7, no. 1-4, pp. 39–79, 1928.
5. L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika,
vol. 31, no. 3, pp. 279–311, 1966.
Tensor Q-Rank: New Data Dependent Definition of Tensor Rank 31

6. M. E. Kilmer, K. Braman, N. Hao, and R. C. Hoover, “Third-order tensors as operators


on matrices: A theoretical and computational framework with applications in imaging,”
SIAM Journal on Matrix Analysis and Applications, vol. 34, no. 1, pp. 148–172, 2013.
7. M. E. Kilmer and C. D. Martin, “Factorization strategies for third-order tensors,” Linear
Algebra and its Applications, vol. 435, no. 3, pp. 641–658, 2011.
8. C. Lu, X. Peng, and Y. Wei, “Low-rank tensor completion with a new tensor nuclear
norm induced by invertible linear transforms,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5996–6004, 2019.
9. E. Kernfeld, M. Kilmer, and S. Aeron, “Tensor–tensor products with invertible linear
transforms,” Linear Algebra and its Applications, vol. 485, pp. 545–570, 2015.
10. E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations
of Computational Mathematics, vol. 9, no. 6, p. 717, 2009.
11. E. J. Candès and T. Tao, “The power of convex relaxation: Near-optimal matrix completion,”
IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2053–2080, 2010.
12. J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor completion for estimating missing values
in visual data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35,
no. 1, pp. 208–220, 2013.
13. S. Friedland and L.-H. Lim, “Nuclear norm of higher-order tensors,” Mathematics of
Computation, vol. 87, no. 311, pp. 1255–1281, 2018.
14. J. Håstad, “Tensor rank is NP-complete,” Journal of Algorithms, vol. 11, no. 4, pp. 644–654,
1990.
15. C. J. Hillar and L.-H. Lim, “Most tensor problems are NP-hard,” Journal of the ACM,
vol. 60, no. 6, p. 45, 2013.
16. M. Yuan and C.-H. Zhang, “On tensor completion via nuclear norm minimization,” Foun-
dations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016.
17. Y. Fu, J. Gao, D. Tien, Z. Lin, and X. Hong, “Tensor lrr and sparse coding-based subspace
clustering,” IEEE transactions on neural networks and learning systems, vol. 27, no. 10,
pp. 2120–2133, 2016.
18. Y. Liu, F. Shang, W. Fan, J. Cheng, and H. Cheng, “Generalized higher order orthogonal
iteration for tensor learning and decomposition,” IEEE transactions on neural networks
and learning systems, vol. 27, no. 12, pp. 2551–2563, 2015.
19. H. Kasai and B. Mishra, “Low-rank tensor completion: a Riemannian manifold precon-
ditioning approach,” in International Conference on Machine Learning, pp. 1012–1021,
2016.
20. C. Li, L. Guo, Y. Tao, J. Wang, L. Qi, and Z. Dou, “Yet another Schatten norm for tensor
recovery,” in International Conference on Neural Information Processing, pp. 51–60, 2016.
21. B. Romera-Paredes and M. Pontil, “A new convex relaxation for tensor completion,” in
Advances in Neural Information Processing Systems, pp. 2967–2975, 2013.
22. R. Tomioka, K. Hayashi, and H. Kashima, “On the extension of trace norm to tensors,” in
NIPS Workshop on Tensors, Kernels, and Machine Learning, p. 7, 2010.
23. R. Tomioka and T. Suzuki, “Convex tensor decomposition via structured schatten norm
regularization,” in Advances in neural information processing systems, pp. 1331–1339,
2013.
24. K. Wimalawarne, M. Sugiyama, and R. Tomioka, “Multitask learning meets tensor fac-
torization: task imputation via convex optimization,” in Advances in neural information
processing systems, pp. 2825–2833, 2014.
25. Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer, “Novel methods for multilinear data
completion and de-noising based on tensor-SVD,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3842–3849, 2014.
26. C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan, “Tensor robust principal component
analysis: Exact recovery of corrupted low-rank tensors via convex optimization,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5249–5257,
2016.
27. C. Lu, J. Feng, Z. Lin, and S. Yan, “Exact low tubal rank tensor recovery from gaussian
measurements,” in International Conference on Artificial Intelligence, 2018.
28. M. Yin, J. Gao, S. Xie, and Y. Guo, “Multiview subspace clustering via tensorial t-product
representation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30,
no. 3, pp. 851–864, 2018.
29. W. Hu, D. Tao, W. Zhang, Y. Xie, and Y. Yang, “The twist tensor nuclear norm for video
completion,” IEEE transactions on neural networks and learning systems, vol. 28, no. 12,
pp. 2961–2973, 2016.
32 Hao Kong et al.

30. P. Zhou, C. Lu, Z. Lin, and C. Zhang, “Tensor factorization for low-rank tensor completion,”
IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1152–1163, 2018.
31. H. Kong, X. Xie, and Z. Lin, “t-schatten-p norm for low-rank tensor recovery,” IEEE
Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1405–1419, 2018.
32. W.-H. Xu, X.-L. Zhao, and M. Ng, “A fast algorithm for cosine transform based tensor
singular value decomposition,” arXiv preprint arXiv:1902.03070, 2019.
33. G. Song, M. K. Ng, and X. Zhang, “Robust tensor completion using transformed tensor
svd,” arXiv preprint arXiv:1907.01113, 2019.
34. T.-X. Jiang, M. K. Ng, X.-L. Zhao, and T.-Z. Huang, “Framelet representation of tensor
nuclear norm for third-order tensor completion,” IEEE Transactions on Image Processing,
vol. 29, pp. 7233–7244, 2020.
35. M. K. Ng, R. H. Chan, and W.-C. Tang, “A fast algorithm for deblurring models with
neumann boundary conditions,” SIAM Journal on Scientific Computing, vol. 21, no. 3,
pp. 851–866, 1999.
36. J.-F. Cai, R. H. Chan, and Z. Shen, “A framelet-based image inpainting algorithm,” Applied
and Computational Harmonic Analysis, vol. 24, no. 2, pp. 131–149, 2008.
37. T.-X. Jiang, T.-Z. Huang, X.-L. Zhao, T.-Y. Ji, and L.-J. Deng, “Matrix factorization
for low-rank tensor completion using framelet prior,” Information Sciences, vol. 436,
pp. 403–417, 2018.
38. Z. Zhang and S. Aeron, “Exact tensor completion using t-SVD,” IEEE Transactions on
Signal Processing, vol. 65, no. 6, pp. 1511–1526, 2017.
39. C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan, “Tensor robust principal component
analysis with a new tensor nuclear norm,” IEEE transactions on pattern analysis and
machine intelligence, vol. 42, no. 4, pp. 925–938, 2019.
40. A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality
constraints,” SIAM journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353,
1998.
41. Z. Wen and W. Yin, “A feasible method for optimization with orthogonality constraints,”
Mathematical Programming, vol. 142, no. 1-2, pp. 397–434, 2013.
42. K. B. Petersen, M. S. Pedersen, et al., “The matrix cookbook,” Technical University of
Denmark, vol. 7, no. 15, p. 510, 2008.
43. C. Lu, J. Feng, S. Yan, and Z. Lin, “A unified alternating direction method of multipliers
by majorization minimization,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 3, pp. 527–541, 2017.
44. Z. Lin, R. Liu, and H. Li, “Linearized alternating direction method with parallel splitting and
adaptive penalty for separable convex programs in machine learning,” Machine Learning,
vol. 99, no. 2, p. 287, 2015.
45. Y. Xu and W. Yin, “A block coordinate descent method for regularized multiconvex
optimization with applications to nonnegative tensor factorization and completion,” SIAM
Journal on Imaging Sciences, vol. 6, no. 3, pp. 1758–1789, 2015.
46. Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty
for low-rank representation,” in Advances in neural information processing systems, pp. 612–
620, 2011.
47. P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds.
Princeton University Press, 2009.
48. Y. Xu, R. Hao, W. Yin, and Z. Su, “Parallel matrix factorization for low-rank tensor
completion,” Inverse Problems & Imaging, vol. 9, no. 2, pp. 601–624, 2017.
49. A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech.
rep., Citeseer, 2009.
50. S. A. Nene, S. K. Nayar, H. Murase, et al., “Columbia object image library (coil-20),” 1996.
51. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database
for human motion recognition,” in IEEE International Conference on Computer Vision,
pp. 2556–2563, 2011.

You might also like