Tensor Computation for Data
Tensor Computation for Data
Jiani Liu
Zhen Long
Ce Zhu
Tensor
Computation
for Data
Analysis
Tensor Computation for Data Analysis
Yipeng Liu • Jiani Liu • Zhen Long • Ce Zhu
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
High dimensionality is a feature of big data. Classical data analysis methods rely
on representation and computation in the form of vectors and matrices, where
multi-dimensional data are unfolded into matrix for processing. However, the multi-
linear structure would be lost in such vectorization or matricization, which leads
to sub-optimal performance in processing. As a multidimensional array, a tensor
is a generalization of vectors and matrices, and it is a natural representation for
multi-dimensional data. The tensor computation based machine learning methods
can avoid multi-linear data structure loss in classical matrix based counterparts.
In addition, the computational complexity in tensor subspaces can be much
less than that in the original form. The recent advances in applied mathematics
allow us to move from classical matrix based methods to tensor based methods
for many data related applications, such as signal processing, machine learning,
neuroscience, communication, quantitative finance, psychometric, chemometrics,
quantum physics, and quantum chemistry.
This book will first provide a basic coverage of tensor notations, preliminary
operations, and main tensor decompositions and their properties. Based on them,
a series of tensor analysis methods are presented as the multi-linear extensions of
classical matrix based techniques. Each tensor technique is demonstrated in some
practical applications.
The book contains 13 chapters and 1 appendix. It has 2 chapters in Part I for
preliminaries on tensor computation.
• Chapter 1 gives an introduction of basic tensor notations, graphical representa-
tion, some special operators, and their properties.
• Chapter 2 summarizes a series of tensor decompositions, including canonical
polyadic decomposition, Tucker decomposition and block term decomposition,
tensor singular value decomposition, tensor networks, hybrid tensor decomposi-
tion, and scalable tensor decomposition.
Part II discusses technical aspects of tensor based data analysis methods and various
applications, and it has 11 chapters.
v
vi Preface
We would like to particularly thank the current and past group members in
this area. These include Lanlan Feng, Zihan Li, Zhonghao Zhang, Huyan Huang,
Yingyue Bi, Hengling Zhao, Xingyu Cao, Shenghan Wang, Yingcong Lu, Bin Chen,
Shan Wu, Longxi Chen, Sixing Zeng, Tengteng Liu, and Mingyi Zhou. We also
thank our editors Charles Glaser and Arjun Narayanan for their advice and support.
This project is supported in part by the National Natural Science Foundation of
China under grant No. 61602091.
1 Tensor Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Special Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Basic Matrix Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Tensor Graphical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Tensor Unfoldings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Mode-n Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Mode-n1 n2 Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 n-Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.4 l-Shifting n-Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Tensor Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Tensor Inner Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Mode-n Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.3 Tensor Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.4 t-Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.5 3-D Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Tensor Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Canonical Polyadic Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Tensor Rank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 CPD Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Tucker Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 The n-Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Computation and Optimization Model . . . . . . . . . . . . . . . . . . . . 27
2.3.3 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
ix
x Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Acronyms
xv
xvi Acronyms
xix
xx Symbols
1.1 Notations
a
· a A
A
Fig. 1.1 A graphical illustration for (a) scalar, (b) vector, (c) matrix, and (d) tensor
I3
I1
I2
(a)
Fig. 1.2 A graphical illustration for fibers and slices of a third-order tensor A. (a) third-order
tensor A. (b) Horizontal slice A(2, :, :). (c) Lateral slice A(:, 2, :). (d) Frontal slice A(:, :, 2). (e)
Mode-1 fiber A(:, 2, 2). (f) Mode-2 fiber A(2, :, 2). (g) Mode-3 fiber A(2, 2, :)
1.1 Notations 3
and the mode-1 fiber A(:, 2, 2) is the vector [8, 12, 29]T , mode-2 fiber A(2, :, 2) is
the vector [28, 12, 17]T , and mode-3 fiber A(2, 2, :) is the vector [6, 12, 22]T .
Definition 1.2 (Diagonal Tensor) The diagonal tensor has only nonzero elements
in its super-diagonal line. It means that for diagonal tensor A ∈ RI1 ×···×IN , its
elements satisfy A(i1 , . . . , iN ) = 0 only if i1 = i2 = · · · = iN . Figure 1.4 provides
an example of a third-order diagonal tensor.
A
I1
I2
A
I1
I2
4 1 Tensor Computation
C = a ◦ b ∈ RI ×J . (1.5)
For example, the outer product of vectors a = [2, 6, 1]T and b = [3, 5, 8]T yields
a matrix
⎡ ⎤
6 10 16
C = ⎣18 30 48⎦ .
3 5 8
However, with adding the outer product of another vector c = [2, 1, 3]T , we can
obtain a third-order tensor as follows:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
12 20 32 6 10 16 18 30 48
C(:, :, 1) = ⎣ 36 60 96 ⎦ , C(:, :, 2) = ⎣ 18 30 48 ⎦ , C(:, :, 3) = ⎣ 54 90 144 ⎦ .
6 10 16 3 5 8 9 15 24
1.2 Basic Matrix Computation 5
Definition 1.7 (Hadamard Product) For matrices A and B with the same size I ×
J , the Hadamard product is the element-wise product as follows:
C = A B ∈ RI ×J (1.7)
⎡ ⎤
a1,1 b1,1 a1,2 b1,2 . . . a1,J b1,J
⎢a2,1 b2,1 a2,2 b2,2 . . . a2,J b2,J ⎥
⎢ ⎥
=⎢ . .. .. .. ⎥.
⎣ .. . . . ⎦
aI,1 bI,1 aI,2 bI,2 . . . aI,J bI,J
C = A ⊗ B ∈ RI K×J L (1.8)
⎡ ⎤
a1,1 B a1,2 B . . . a1,J B
⎢a2,1 B a2,2 B . . . a2,J B⎥
⎢ ⎥
=⎢ . .. . . .. ⎥ .
⎣ .. . . . ⎦
aI,1 B aI,2 B . . . aI,J B
26 15
For example, the Kronecker product of matrices A = and B = is
21 38
⎡ ⎤
2 10 6 30
2B 6B ⎢6 16 18 48⎥
C= =⎢
⎣2 10
⎥.
2B B 1 5⎦
6 16 3 8
Theorem 1.1 [9] For any matrices A, B, C, and D, Kronecker product has some
useful properties as follows:
(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) (1.9)
AC ⊗ BD = (A ⊗ B)(C ⊗ D) (1.10)
(A ⊗ C)T = AT ⊗ CT (1.11)
(A ⊗ C) = A ⊗ C
† † †
(1.12)
vec(ABC) = (CT ⊗ A) vec(B), (1.13)
6 1 Tensor Computation
where (1.13) is commonly used to deal with the linear least squares problems in the
form of minX Y − AXB2F .
Definition 1.10 (Khatri-Rao Product) The Khatri-Rao product of matrices A ∈
RI ×K and B ∈ RJ ×K is denoted as
A B = [a1 ⊗ b1 , a2 ⊗ b2 , . . . , aK ⊗ bK ] ∈ RI J ×K , (1.14)
2 6 1 5
a1 = ; a2 = ; b1 = ; b2 = .
2 1 3 8
(A B) C=A (B C) (1.15)
(A C)T (A C) = AT A BT B (1.16)
(A ⊗ B)(C D) = AC BD (1.17)
and
holds when D is a diagonal matrix with D = diag(d), i.e., D(i, i) = d(i) for i =
1, . . . , I if d ∈ RI .
Definition 1.11 (Convolution) The convolution of matrices A ∈ RI1 ×I2 and B ∈
RJ1 ×J2 can be defined by
where kn = 1, . . . , Kn , n = 1, 2.
Definition 1.12 (Mode-N Circular Convolution) For vectors a ∈ RI and b ∈ RJ ,
its mode-N circular convolution is denoted as
c = a N b ∈ RN , (1.21)
where n = 1, . . . , N .
Definition 1.13 (p Norm) For matrix A ∈ RI ×J , its p norm is denoted as
⎛ ⎞1/p
Ap = ⎝ ai,j ⎠
p
, (1.23)
i,j
which boils down to the classical 1 norm and 0 norm when p = 1 and p = 0,
respectively. Compared with 1 norm, p with 0 < p < 1 is shown to be better
approximation of 0 norm.
Definition 1.14 (Off-diagonal 1 Norm) The off-diagonal 1 norm of matrix A ∈
RI ×J is defined by
min(I,J )
A∗ = σk , (1.26)
k=1
I
a Ia
· a a
(a) (b)
I3
I1 I2
I1 A I1 I2 A
A I1 A
I3
I2
I2
(c) (d)
Fig. 1.5 The graphical representation of tensors. (a) Scalar. (b) Vector. (c) Matrix. (d) Tensor
I J K I K
A B C
Definition 1.17 (Mode-n Unfolding) For tensor A ∈ RI1 ×I2 ×···×IN , its mode-n
unfolding matrix is expressed as A(n) or A[n] . The mode-n unfolding operator
arranges the n-th mode of A as the row while the rest modes as the column of
the mode-n unfolding matrix. Mathematically, the elements of A(n) and A[n] satisfy
But it should be noted that the employed order must be consistent during the whole
algorithm. In this monograph, unless otherwise stated, we will use little-endian
notation.
The only difference between A(n) and A[n] is the order of the remaining dimensions
in the column. For better understanding, we take the tensor A ∈ R3×3×3 used before
as an example. The mode-n unfolding matrices are as follows:
⎡ ⎤
1 14 15 15 8 7 9 5 13
A(1) = A[1] = ⎣ 23 6 20 28 12 17 7 22 26 ⎦ (1.31)
24 18 8 21 29 23 21 1 19
⎡ ⎤
1 23 24 15 28 21 9 7 21
A(2) = ⎣ 14 6 18 8 12 29 5 22 1 ⎦ (1.32)
15 20 8 7 17 23 13 26 19
10 1 Tensor Computation
⎡ ⎤
1 15 9 23 28 7 24 21 21
A[2] = ⎣ 14 8 5 6 12 22 18 29 1 ⎦ (1.33)
15 7 13 20 17 26 8 23 19
⎡ ⎤
1 23 24 14 6 18 15 20 8
A(3) = A[3] = ⎣ 15 28 21 8 12 29 7 17 23 ⎦ . (1.34)
9 7 21 5 22 1 13 26 19
As shown in ((1.31), (1.34)), there is no difference between A(n) and A[n] when
n = 1, N. Otherwise, the order of the remaining indices in the column of mode-n
unfolding matrix does make a sense, as shown in ((1.32), (1.33)). For this unfolding
operator, there is a clear disadvantage that the resulting mode-n unfolding matrix is
extremely unbalanced. Simply processing the mode-n unfolding matrices to get the
multiway structure information will bring a series of numerical problems.
N
s−1
j =1+ (is − 1) Js with Js = Im . (1.36)
s=1 m=1
s=n1 ,s=n2 m=n1 ,m=n2
1.4.3 n-Unfolding
where j1 = i1 , . . . , in , j2 = in+1 , . . . , iN .
For example, the n-unfolding matrices of the third-order tensor A used before
can be expressed as
⎡ ⎤
1 14 15 15 8 7 9 5 13
A1 = A[1] = ⎣ 23 6 20 28 12 17 7 22 26 ⎦ (1.38)
24 18 8 21 29 23 21 1 19
⎡ ⎤
1 15 9
⎢23 28 7 ⎥
⎢ ⎥
⎢24 21 21⎥
⎢ ⎥
⎢14 8 5 ⎥
⎢ ⎥
⎢ ⎥
A2 = A[3] T = ⎢ 6 12 22⎥ (1.39)
⎢ ⎥
⎢18 29 1 ⎥
⎢ ⎥
⎢15 7 13⎥
⎢ ⎥
⎣20 17 26⎦
8 23 19
A3 = vec(A) = vec(A[1] ). (1.40)
Definition 1.20 (l-Shifting) For a tensor A ∈ RI1 ×···×IN , left shifting it by l yields
←
− ←−
a tensor A l ∈ RIl ×···×IN ×I1 ×···×Il−1 . It can be implemented by the command A l =
permute(A, [l, . . . , N, 1, . . . , l − 1]) in MATLAB.
Definition 1.21 (l-Shifting n-Unfolding) [6, 12] The l-shifting n-unfolding opera-
tor first shifts the original tensor A by l and unfolds the tensor along the n-th mode.
Denote Al,n as the l-shifting n-unfolding matrix of tensor A, if A ∈ RI1 ×···×IN .
Al,n is in the size of Il · · · Imod(l+n,N ) × Imod(l+n+1,N ) · · · Il−1 with its elements
defined by
I1
I1 IN ···
In−1
I2 A ··· In A(n)
In+1
··· In
···
IN
(a) (b)
In+1
··· I1 In+1
IN
··· ···
In A[n] A<n>
I1
··· In IN
In−1
(c) (d)
Il Imod(l+n,N )
Il Il−1
←
− ··· ···
Il+1 Al ··· A<l,n>
··· I1
IN Imod(l+n−1,N ) Il−1
(e) (f)
Fig. 1.7 The graphical representation of tensor unfolding methods. (a) N th-order tensor A. (b)
Mode-n unfolding matrix A(n) . (c) Mode-n unfolding matrix A[n] . (d) n-Unfolding matrix An .
←
−
(e) l-shifting N th-order tensor A l . (f) l-shifting n-unfolding matrix Al,n
unfolding operator with appropriate section of l and n can generate more balanced
matrices while remaining the structure information within the multidimensional
data. As shown in the examples of N-th order tensor A ∈ RI ×···×I in Fig. 1.7,
mode-n unfolding matrices are extremely unbalanced; the n-unfolding matrix is
only balanced if n = N/2, while the l-shifting n-unfolding matrix can lead to an
almost square matrix in all modes with a suitable l and n.
1.5 Tensor Products 13
Definition 1.22 (Tensor Inner Product) The inner product of tensor A and B with
the same size is formulated as follows:
Definition 1.23 (Mode-n Product) The mode-n product of tensor A ∈ RI1 ×···×IN
and matrix B ∈ RJ ×In is defined by
Based on the mode-n unfolding definition, the mode-n product can be expressed
in matrix form as
I1 I1 IN
··· In
A B I2 A B J
IN ··· In−1
(a) (b)
Fig. 1.8 The graphical representation for (a) tensor inner product and (b) mode-n product
14 1 Tensor Computation
Figure 1.8 provides a graphical illustration for the tensor inner product and the
commonly used mode-n product in dimensionality reduction. Taking the example
205
defined in (1.1), the mode-2 multiplication with matrix B = ∈ R3×2 yields
471
a tensor C ∈ R3×2×3 with frontal slices as follows:
⎡ ⎤
77 117
C(:, :, 1) = ⎣ 146 154 ⎦ (1.46)
88 230
⎡ ⎤
65 123
C(:, :, 2) = ⎣ 141 213 ⎦ (1.47)
157 310
⎡ ⎤
83 84
C(:, :, 3) = ⎣ 144 208 ⎦ . (1.48)
137 110
Definition 1.24 (Tensor Contraction) For tensors A ∈ RI1 ×···×IN ×J1 ×···×JL and
B ∈ RJ1 ×···×JL ×K1 ×···×KM , contracting their common indices {J1 , . . . , JL } will
result in a tensor C = A, BL ∈ RI1 ×···×IN ×K1 ×···×KM whose entries are calculated
by
ci1 ,...,iN ,k1 ,...,kM = ai1 ,...,iN ,j1 ,...,jL bj1 ,...,jL ,k1 ,...,kM . (1.49)
j1 ,...,jL
J1
I1 K1
··· I1 J1 K1
··· A B ··· A B
ID KM
JL
1.5.4 t-Product
Definition 1.25 (t-Product) [7] The t-product for tensors A ∈ RI1 ×I2 ×I3 and B ∈
RI2 ×I4 ×I3 is defined as
I2
C(i1 , i4 , :) = A(i1 , i2 , :) I3 B(i2 , i4 , :). (1.51)
i2 =1
where  is the fast Fourier transform (FFT) of A along the third mode and Â(i3 ) is
the i3 -th frontal slice of Â.
Theorem 1.3 For any tensor A ∈ RI1 ×I2 ×I3 and B ∈ RI2 ×I4 ×I3 , their t-product
can be efficiently calculated through simple matrix product in Fourier domain as
follows:
C = A ∗ B ⇐⇒ A B = C, (1.53)
I(:, :, 1) = I; I(:, :, 2 : L) = 0.
QT ∗ Q = Q ∗ QT = I. (1.55)
Definition 1.30 (3-D Convolution) [5] Similar to the matrix convolution, the 3-D
convolution of tensors A ∈ RI1 ×I2 ×I3 and B ∈ RJ1 ×J2 ×J3 can be defined by
C(k1 , k2 , k3 ) = B(j1 , j2 , j3 )A(k1 −j1 +1, k2 −j2 +1, k3 −j3 +1), (1.57)
j1 ,j2 ,j3
K3
I3
J3
K1
B C
I1 A J1 =
C(3, 3, 3)
J2
I2 K2
1.6 Summary
References
1. Abdelfattah, A., Baboulin, M., Dobrev, V., Dongarra, J., Earl, C., Falcou, J., Haidar, A., Karlin,
I., Kolev, T., Masliah, I., et al.: High-performance tensor contractions for GPUs. Procedia
Comput. Sci. 80, 108–118 (2016)
2. Bengua, J.A., Tuan, H.D., Phien, H.N., Do, M.N.: Concatenated image completion via tensor
augmentation and completion. In: 2016 10th International Conference on Signal Processing
and Communication Systems (ICSPCS), pp. 1–7. IEEE, Piscataway (2016)
3. Cheng, D., Qi, H., Xue, A.: A survey on semi-tensor product of matrices. J. Syst. Sci. Complex.
20(2), 304–322 (2007)
4. Cichocki, A., Mandic, D., De Lathauwer, L., Zhou, G., Zhao, Q., Caiafa, C., Phan, H.A.: Tensor
decompositions for signal processing applications: from two-way to multiway component
analysis. IEEE Signal Process. Mag. 32(2), 145–163 (2015)
5. Cichocki, A., Lee, N., Oseledets, I., Phan, A.H., Zhao, Q., Mandic, D.P., et al.: Tensor
networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor
decompositions. Found. Trends Mach. Learn. 9(4–5), 249–429 (2016)
6. Huang, H., Liu, Y., Liu, J., Zhu, C.: Provable tensor ring completion. Signal Process. 171,
107486 (2020)
7. Kilmer, M.E., Braman, K., Hao, N., Hoover, R.C.: Third-order tensors as operators on matrices:
a theoretical and computational framework with applications in imaging. SIAM J. Matrix Anal.
Appl. 34(1), 148–172 (2013)
8. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500
(2009)
9. Sidiropoulos, N.D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E.E., Faloutsos, C.:
Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process.
65(13), 3551–3582 (2017)
10. Solomonik, E., Demmel, J.: Fast bilinear algorithms for symmetric tensor contractions.
Comput. Methods Appl. Math. 21(1), 211–231 (2020)
11. Tucker, L.R.: Implications of factor analysis of three-way matrices for measurement of change.
Probl. Meas. Change 15, 122–137 (1963)
12. Yu, J., Zhou, G., Zhao, Q., Xie, K.: An effective tensor completion method based on multi-
linear tensor ring decomposition. In: 2018 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), pp. 1344–1349. IEEE, Piscat-
away (2018)
13. Zhang, Z., Aeron, S.: Exact tensor completion using t-SVD. IEEE Trans. Signal Process. 65(6),
1511–1526 (2016)
Chapter 2
Tensor Decomposition
2.1 Introduction
Tensor decomposition can break a large-size tensor into many small-size factors,
which can reduce the storage and computation complexity during data processing.
Tensor decompositions were originated by Hitchcock in 1927. The idea of a multi-
way model is firstly proposed by Cattell in 1944 [11, 12]. These concepts received
successive attention until the 1960s when Tucker published three works about
tensor decomposition [71–73]. Carroll, Chang [10] and Harshman [31] proposed
canonical factor decomposition (CANDECOMP) and parallel factor decomposition
(PARAFAC), respectively, in 1970. These works first appeared in psychometrics
literatures. In addition, some other tensor decompositions such as block term
decomposition [18, 19, 21] and t-SVD [43, 44] are proposed to improve and enrich
tensor decompositions. With the sensor development, many data we obtained are
large scale, which drives the development of tensor networks [6, 15, 27, 29, 46, 53,
55, 56, 58, 60, 76, 81] and scalable tensor decomposition [7, 23, 35, 41, 65].
In the last two decades, tensor decompositions have attracted much interest in
fields such as numerical linear algebra [9, 49, 61], signal processing [16, 50, 64, 67],
data mining [2, 45, 54, 77], graph analysis [25, 37, 69], neuroscience [30, 36, 79],
and computer vision [17, 26, 66].
The first idea of canonical polyadic decomposition is from Hitchcock [33, 34] in
1927, which expresses a tensor as a sum of a finite number of rank-1 tensors. After
that, Cattell [11, 12] proposed ideas for parallel proportional analysis and multiple
axes for analysis. In 1970, the form of CANDECOMP (canonical decomposition)
c1 cR C
» b1 +◊◊◊ + bR = B
X A
a1 aR
[10] and PARAFAC (parallel factors) [31] is proposed by Carroll, Chang, and
Harshman in the psychometrics community.
Definition 2.1 (Canonical Polyadic (CP) Decomposition) Given an N th-order
tensor X ∈ RI1 ×···×IN , the CP decomposition is defined by
R
X = ur(1) ◦ u(2)
r ◦ · · · ◦ ur
(N )
= [[U(1) , U(2) , . . . , U(N ) ]], (2.1)
r=1
or element wisely as
R
(1) (2) (N )
xi1 ,i2 ,...,iN = ui1 ,r ui2 ,r · · · uiN ,r , (2.2)
r=1
X(1) = A(C B)T , X(2) = B(A C)T , X(3) = C(B A)T . (2.4)
The rank of a tensor X ∈ RI1 ×···×IN , namely, rank(X ), is defined as the smallest
number of rank-1 tensors in the CP decomposition as follows:
R
X = Ur , (2.5)
r=1
2.2 Canonical Polyadic Decomposition 21
(1) (2) (N )
where Ur is a rank-1 tensor and can be represented as Ur = ur ◦ ur ◦ · · · ◦ ur
and R is the smallest value for Eq. (2.5). The definition of tensor rank is analogue
to the definition of matrix rank, but the properties of matrix and tensor ranks are
quite different. One difference is that the rank of a real-valued tensor may actually
be different over R and C.
For example, let X ∈ R2×2×2 be a tensor whose frontal slices are defined by
1 0 0 −1
X:,:,1 = , X:,:,2 = .
0 1 1 0
1 0 1 1 0 1 1 1 0
A= , B= , C= .
0 1 −1 0 1 1 −1 1 1
2
X = ar ◦ br ◦ cr = [[A, B, C]], (2.7)
r=1
where
1 1 1 1 1 1 1 1
A= √ , B= √ , C= .
2 −j j 2 j −j −j j
It’s difficult to determine the tensor rank. In practice, we usually use CP decompo-
sition with tensor rank given in advance.
Letting X ∈ RI1 ×I2 ×I3 , the goal of CP decomposition is to find the optimal factor
matrices to approximate X with predefined tensor rank, as follows:
Directly solving the problem is difficult, and the alternating least squares (ALS)
method is popular for it. The ALS approach fixes B and C to solve for A, then
fixes A and C to solve for B, then fixes A and B to solve for C, and continues to
repeat the entire procedure until a certain convergence criterion is satisfied. Under
this framework, the problem (2.8) can be split into three subproblems, as follows:
The ALS method is simple to understand and easy to implement, but can take
many iterations to converge. Moreover, it is not guaranteed to converge to a global
minimum, only to a solution where the objective function of (2.8) decreases. To
solve this issue, Li et al. [49] propose a regularized alternating least squares method
for CP decomposition (CP-RALS). In this framework, the subproblems in (k +1)-th
iteration can be written as follows:
C k+1
= argmin X(3) − C(Bk+1 Ak+1 )T 2F + τ C − Ck 2F , (2.11)
C
2.2 Canonical Polyadic Decomposition 23
Compared with ALS, the advantage of RALS has twofolds. One is that the
solutions are more stable, and the other is for preventing mutations in results.
2.2.3 Uniqueness
R R
X = (αr ar ) ◦ (βr br ) ◦ (γr cr ) = ar ◦ br ◦ cr . (2.14)
r=1 r=1
24 2 Tensor Decomposition
where R is the CP rank. Berge and Sidiropoulos [70] gave a proof that Kruskal’s
result is a sufficient and necessary condition for tensor when R = 2 and R = 3,
but not for R > 3. Sidiropoulos and Bro [63] recently extended Kruskal’s result to
Nth-order tensors. Let X ∈ RI1 ×I2 ×···IN be an Nth-order tensor with rank R and
suppose that its CP decomposition is
R
X = r ◦ ur ◦ · · · ◦ ur ,
u(1) (2) (N )
(2.16)
r=1
N
KX(n) ≥ 2R + (N − 1), (2.17)
n=1
where KX(n) is the Kruskal rank of X(n) ∈ RIn ×I1 ···In−1 In+1 ···IN .
According to mode-n matricization of the third-order tensor in (2.4), Liu and
Sidiropoulos [51] showed that a necessary condition for uniqueness of the CP
decomposition is
More generally, they showed that for the N-way case, a necessary condition for
uniqueness of the CP decomposition in (2.1) is
Similarly, a fourth-order tensor X ∈ RI1 ×I2 ×I3 ×I4 of rank R has a CP decomposition
that is generically unique if
I1 I2 I3 (3I1 I2 I3 − I1 I2 − I3 − I1 − I2 I3 − I1 I3 − I2 + 3)
R ≤ I4 and R(R −1) ≤ .
4
(2.23)
or element wisely as
R1 R2 RN
(1) (2) (N )
xi1 ,i2 ,...,iN = ··· gr1 ,r2 ,...,rN ui1 ,r1 ui2 ,r2 · · · uiN ,rN , (2.25)
r1 r2 rN
where U(n) ∈ RIn ×Rn , n = 1, . . . , N are factor matrices, Rn ≤ In and U(n) T U(n) =
IRn , and the tensor G ∈ RR1 ×R2 ×···RN is called core tensor. It is noticed that the core
tensor G is a full tensor and its entries represent the level of interaction between the
different components U(n) , n = 1, . . . , N .
26 2 Tensor Decomposition
R1 R2
» R2 R3
R1
For example, letting X ∈ RI1 ×I2 ×I3 , we can get A ∈ RI1 ×R1 , B ∈ RI2 ×R2 ,
C ∈ RI3 ×R3 and a core tensor G ∈ RR1 ×R2 ×R3 by Tucker decomposition:
R1 R2 R3
X = G ×1 A ×2 B ×3 C = gr1 ,r2 ,r3 ar1 ◦ br2 ◦ cr3 = [[G; A, B, C]].
r1 =1 r2 =1 r3 =1
(2.26)
The graphical representation can be found in Fig. 2.2.
where spanR is an operator that spans the vector into a space and dim is the size
of the space. x:,i2 ,i3 , xi1 ,:,i3 , xi1 ,i2 ,: are the mode-1, mode-2, and mode-3 fibers of
tensor X , respectively. In this case, we can say that X is a rank-(R1 , R2 , R3 ) tensor.
More generally, if Rn = rankn (X ), n = 1, . . . , N, we can say that X is a rank-
(R1 , R2 , . . . , RN ) tensor.
It is noticed that the row rank and the column rank are the same for the matrix.
But, if i = j , the i-rank and j -rank are different for the higher-order tensor.
2.3 Tucker Decomposition 27
s. t. G ∈ RR1 ×R2 ···×RN , U(n) ∈ RIn ×Rn , U(n)T U(n) = IRn ×Rn . (2.28)
2.3.3 Uniqueness
Tucker decompositions are not unique. Here we consider the Tucker decomposition
of a three-way tensor. Letting W ∈ RR1 ×R1 , U ∈ RR2 ×R2 , V ∈ RR3 ×R3 be
nonsingular matrices, we have
We can modify the core tensor G without affecting the fit as long as we apply
the inverse modification to the factor matrices. This freedom opens the door for
choosing transformations that simplify the core structure in some way so that
most of the elements of G are zero, thereby eliminating interactions between
corresponding components and improving uniqueness.
2.4 Block Term Decomposition 29
R
X = Gr ×1 U(1)
r ×2 Ur · · · ×N Ur ,
(2) (N )
(2.33)
r=1
or element wisely as
· · · U(N )
r (iN , jN ), (2.34)
in which Ur ∈ RIn ×Jn represents the n-th factor in the r-th term, and Gr ∈
(n) r
R
X = Sr ×1 Ur ×2 Vr ×3 Wr , (2.35)
r=1
in which Sr ∈ RLr ×Mr ×Nr are full rank (Lr , Mr , Nr ) and Ur ∈ RI ×Lr with (I ≥
Lr ), Vr ∈ RJ ×Mr with (J ≥ Mr ), and Wr ∈ RK×Nr with (K ≥ Nr ) are full
column rank, 1 ≤ r ≤ R. For simplicity, we assume Lr = L, Mr = M, Nr = N,
r = 1, . . . , R hereafter.
Under this decomposition, the matrix representations of X can be represented as
W1 WR
S1 V1 SR VR
X » U1 + + UR
A b B = (A1 ⊗ B1 , . . . , AR ⊗ BR ). (2.39)
c1 cR
B1 BR
X = A1 + + AR
R
X = (Ar BTr ) ◦ cr , (2.43)
r=1
2.4.2 Uniqueness
R
X = Sr ×1 Ur ×2 Vr ×3 Wr , (2.55)
r=1
or
J ≥ LR and min( LI , R) + min(K, R) ≥ R + 2;
4. ILJ2 and min( LI , R) + min( L
J
, R) + min(K, R) ≥ R + 2.
2.5 Tensor Singular Value Decomposition 35
Kilmer et al. [43, 44] firstly proposed tensor singular value decomposition (t-SVD)
to build approximation to a given tensor. Different from CP decomposition and
Tucker decomposition, a representation of a tensor in t-SVD framework is the tensor
product of three tensors. Before introducing t-SVD, we will give several important
definitions, as follows:
Definition 2.8 (t-Product) For third-order tensors A ∈ RI1 ×I2 ×I3 and B ∈
RI2 ×J ×I3 , the t-product A ∗ B is a tensor sized of I1 × J × I3 :
A ∗ B = fold(circ(A)MatVec(B)), (2.56)
where
⎡ ⎤
A(1) A(I3 ) · · · A(2)
⎢ A(2) A(1) · · · A(3) ⎥
⎢ ⎥
circ(A) = ⎢ . .. .. .. ⎥ ,
⎣ .. . . . ⎦
A(I3 ) A(I3 −1) · · · A(1)
and
⎡ ⎤
B (1)
⎢ B (2) ⎥
⎢ ⎥
MatVec(B) = ⎢ . ⎥,
⎣ .. ⎦
B (I3 )
where A(i3 ) and B (i3 ) , i3 = 1, . . . , I3 are the frontal slices of A and B, respectively.
It is well-known that block circulant matrices can be block diagonalized by the
Fourier transform, which derives the following equation:
⎡ ⎤
Â(1) 0 ··· 0
⎢ 0 Â(2) 0 ··· ⎥
⎢ ⎥
A = (F ⊗ I1 )circ(A)(F∗ ⊗ I2 ) = ⎢ . .. .. .. ⎥ , (2.57)
⎣ .. . . . ⎦
0 ··· 0 Â(I3 )
where F ∈ RI3 ×I3 is the discrete Fourier transform (DFT) matrix and F∗ is the
conjugate transpose of F, I1 ∈∈ RI1 ×I1 and I2 ∈ RI2 ×I2 , Â is the fast Fourier
transform (FFT) of A along the third mode.
I3 I3 I3 I3
= * *
I1 A I1 U I1 S I2 VT
I2 I1 I2 I2
A = U ∗ S ∗ V T,
where U ∈ RI1 ×I1 ×I3 and V ∈ RI2 ×I2 ×I3 are orthogonal tensors, i.e.,
U T ∗ U = V T ∗ V = I. In Fourier domain, Â(i3 ) = Û (i3 ) Ŝ (i3 ) V̂ (i3 ) , i3 = 1, . . . , I3 .
A graphical illustration can be seen in Fig. 2.6, and the computation of t-SVD can
be summarized in Algorithm 7. Recently, Lu et al. [52] proposed a new way to
efficiently compute t-SVD shown in Algorithm 8.
Definition 2.10 (Tensor Tubal Rank) The tensor tubal rank, denoted by rankt (A),
is defined as the number of nonzero singular tubes of S, where S is from t-SVD
A = U ∗ S ∗ V T , i.e.,
Owing to the increasingly affordable recording devices and large-scale data vol-
umes, multidimensional data are becoming ubiquitous across the science and
engineering disciplines. Such massive datasets may have billions of entries and be of
much high order, which causes the curse of dimensionality with the order increased.
This has spurred a renewed interest in the development of tensor-based algorithms
that are suitable for very high-order datasets. Tensor networks (TNs) decompose
higher-order tensors into sparsely interconnected low-order core tensors. These TNs
provide a natural sparse and distributed representation for big data.
In this section, we mainly introduce two important networks. One is hierarchical
Tucker (HT) and the other is tensor train (TT) network.
Hierarchical Tucker (HT) decomposition has been firstly introduced in [29] and
developed by [6, 27, 46, 53, 58]. It decomposes a higher-order (order > 3) tensor
into several lower-order (third-order or less) tensors by splitting the modes of a
tensor in a hierarchical way, leading to a binary tree containing a subset of modes at
each branch.
In order to define the HT decomposition format, we have to introduce several
important definitions.
Definition 2.11 (Dimension Tree) A dimension tree T of order D is a binary tree
with root D, D := {1, . . . , D}, such that each node Cq ⊂ T, q = 1, . . . , Q has the
following properties:
1. A node with only one entry is called a leaf, i.e., Cp ={d}. The set of all leaves is
denoted by
38 2 Tensor Decomposition
E(T) = T\F(T).
ICq ×IC̄q
X(q) ∈ R ,
where
ICq := Ic , IC̄q := Ic̄ .
c∈Cq c̄∈C̄q
Definition 2.13 (Hierarchical Rank) Letting X ∈ RI1 ×···×ID , the tensor tree rank
k is defined as
where “rank” denotes the standard matrix rank. With the hierarchical rank k, the set
of all tensors with hierarchical rank no larger than k is defined as
Lemma 2.1 (Nestedness Property) Letting X ∈ Xk , for each node Cq and its
complement C̄q , we can get a subspace
I ×I
where X(q) ∈ R Cq C̄q . For each Cq ⊂ E(T) with two successors Cq1 , Cq2 , the
space Uq naturally decouples into
Uq = Uq1 ⊗ Uq2 ,
where kCq is the rank of X(q) . For Cq = {Cq1 , Cq2 }, the column vectors UCq (:, l)
of UCq fulfill the nestedness property that every vector UCq (:, l) with 1 ≤ l ≤ kCq
holds
kCq kCq
1 2
UCq (:, l) = BCq (l, l1 , l2 )UCq1 (:, l1 ) ⊗ UCq2 (:, l2 ), (2.60)
l1 =1 l2 =1
k{1,2} k{3,4}
u{1,2,3,4} = B{1,2,3,4} (l1 , l2 )U{1,2} (:, l1 ) ⊗ U{3,4} (:, l2 ). (2.61)
l1 =1 l2 =1
Similarly, the column vectors U{1,2} (:, l), 1 ≤ l ≤ k{1,2} can be decomposed into
k{1} k{2}
U{1,2} (:, l) = B{1,2} (l, l1 , l2 )U{1} (:, l1 ) ⊗ U{2} (:, l2 ), (2.62)
l1 =1 l2 =1
where U{d} ∈ RId ×k{d} , d = 1, . . . , 4. In this way, the leaf node representations of
X are U{1} ,U{2} ,U{3} ,U{4} and the representations of interior nodes are B{1,2} , B{3,4} ,
and B{1,2,3,4} . Figure 2.7 shows the graphical illustration.
As can be seen in Fig. 2.7, we need to store the transfer tensor B for all interior
nodes and the matrix U for all leaves. Therefore, for all nodes q = 1, . . . , Q,
Q−P
assuming all kCq = k, the storage complexity is Pp=1 I k + q=1 k 3 .
Letting X ∈ RI1 ×I2 ×···×IN , the goal of hierarchical Tucker decomposition is to find
the optimal B and U in hierarchical Tucker form to approximate X . Grasedyck [27]
40 2 Tensor Decomposition
{1,2,3,4} ℎ=0
{1,2} {3,4}
ℎ=1
{1,2} {3,4}
provided two ways to update B and U. One is root-to-leaves updating way and the
other is leaves-to-root way.
The first one is the root-to-leaves updating way, as summarized in Algorithm 9,
where operation diag(S) means the diagonal elements of matrix S and Length(·)
means the size of specific vector. The computational complexity of this scheme for
a tensor X ∈ RI1 ×I2 ×···×IN and dimension tree T of depth H is O(( Nn=1 In )
3/2 ).
...
Fig. 2.8 The tree tensor network state (TTNS) with third-order cores for the representation of
12th-order tensors
42 2 Tensor Decomposition
...
Fig. 2.9 The tree tensor network state (TTNS) with fourth-order cores for the representation of
12th-order tensors
R1 RN+1
X = ··· G (1) (r1 , :, r2 ) ◦ G (2) (r2 , :, r3 ) ◦ · · · ◦ G (N ) (rN , :, rN +1 )
r1 =1 rN+1 =1
or element wisely as
Definition 2.16 (TT Rank) Letting X ∈ RI1 ×I2 ×···×IN , the TT rank of X , denoted
as rankTT (X ), is the rank of Xn . For example,
rankTT (X ) = [R1 , . . . , RN +1 ] ,
with elements as
R2
X1 (i1 ; j ) = U(i1 , r2 )V(r2 , j ) (2.65)
r2 =1
with rank (X̂1 ) = R3 . The G2 = U with size of G2 ∈ RR2 ×I2 ×R3 . Following this
sequential SVD, we can obtain the core factors in TT format. Algorithm 11 gives
the details of sequential SVD for TT decomposition, where the operations Length(·)
and Reshape(·) are all MATLAB functions.
In most cases, given the tensor train rank, we also can obtain core factors in TT
format by ALS. For example, letting X ∈ RI1 ×I2 ×···×IN , the optimization problem
is to find the best core factors in TT format to fit X , as follows:
R1 RN+1
min X − ··· G (1) (r1 , :, r2 ) ◦ G (2) (r2 , :, r3 ) ◦ · · · ◦ G (N ) rN , :, rN +1 2F
G1 ,...,GN
r1 =1 rN+1 =1
(2.67)
44 2 Tensor Decomposition
The detailed solutions are concluded in Algorithm 12, where the operation
Permute(·) is a MATLAB function.
end for
until fit ceases to improve or maximum iterations exhausted
Output: G (n) , n = 1, . . . , N
The generations of tensor train decomposition in this part contain loop structure
including tensor ring (TR) decomposition [81], projected entangled-pair states
(PEPS) [56, 76], the honeycomb lattice (HCL) [1, 5, 32], multi-scale entanglement
renormalization ansatz (MERA) [15], and so on. Among them, we will emphasize
on TR decomposition due to its simpleness. For other tensor networks, we will give
simple introduction.
To alleviate the ordering uncertain problem in TT decomposition, periodic
boundary conditions (PBC) was first proposed, driving a new tensor decomposition
2.6 Tensor Networks 45
named tensor ring decomposition (called tensor chain in physical field) [59], which
was developed by Zhao et al. [81].
Definition 2.17 (Tensor Ring Decomposition) For an N-th order tensor X ∈
RI1 ×···×IN , the TR decomposition is defined as
R1 RN
X = ... G (1) (r1 , :, r2 ) ◦ G (2) (r2 , :, r3 ) ◦ · · · ◦ G (N ) (rN , :, r1 ), (2.68)
r1 =1 rN =1
with elements as
X (i1 , i2 , . . . , iN ) = tr(G (1) (:, i1 , :)G (2) (:, i2 , :) · · · G (N ) (:, iN , :)), (2.69)
where the G (n) ∈ RRn ×In ×Rn+1 , n = 1, . . . , N are the core factors and the TR ranks
are defined as [R1 , . . . , RN ]. We use F(G (1) , . . . , G (N ) ) to represent the tensor ring
decomposition. Figure 2.11 shows a representation of tensor ring decomposition.
Similar to TT decomposition, the storage complexity is NI R 2 assuming all In = I
and Rn = R in the TR model.
Like the TT decomposition model, the detailed computation of tensor ring
N
decomposition is concluded in Algorithm 14, where Bn = ⊗i=1,i=n G (i) is defined
as the tensor contraction product of tensor G (i) , i = n, i = 1, . . . , N .
Sometimes, ranks in TT form could be increased rapidly with the increasing
order of a data tensor, which may be less effective for a compact representation.
To alleviate this problem, PEPS are proposed by hierarchical two-dimensional TT
models. The graphical illustration can be found in Fig. 2.12. In this case, the ranks
are kept considerably smaller at a cost of employing fourth- or even fifth-order core
… +1
( ) ( )
…
1
(1)
1 »
2 … 1
(2)
2
3
...
tensors. However, for very high-order tensors, the ranks may increase rapidly with
an increase in the desired accuracy of approximation. For further control of the
ranks, alternative tensor networks can be employed including the HCL which uses
third-order cores and can be found in Fig. 2.13 and the MERA which consists of
both third- and fourth-order core tensors as shown in Fig. 2.14.
In [3], Ahad et al. proposed a hierarchical low-rank tensor ring decomposition. For
the first layer, the traditional tensor ring decomposition model is used to factorize
a tensor into many third-order subtensors. For the second layer, each third-order
tensor is further decomposed by the t-SVD [42]. Figure 2.15 shows the details of
hierarchical tensor ring decomposition.
2.7 Hybrid Decomposition 47
end for
until fit ceases to improve or maximum iterations exhausted
Output: G (n) , n = 1, . . . , N
···
=
N
1
min X − F(G (1) , . . . , G (N ) )2F + ranktubal (G (n) ). (2.70)
G (n) ,n=1,...,N 2
n=1
48 2 Tensor Decomposition
···
=
Fig. 2.14 Multi-scale entanglement renormalization ansatz (MERA) for an eighth-order tensor
Tensor networks are efficient to deal with high-order data but fail to handle the
large data where it contains billions of entries in each mode. Therefore, developing
scalable tensor algorithms to handle these big data is interesting and necessary.
In this part, we will introduce some scalable tensor decomposition methods.
According to the characteristics of the data (sparse or low-level), we can divide
them into two categories. One is to exploit the sparsity of tensors, and the other
is a subdivision of the large-scale tensor into smaller ones based on low-rank
assumption.
When dealing with large-scale tensor, the intermediate data explosion problem often
occurs. For example, using CP-ALS algorithm to deal with large-scale data X ∈
RI ×J ×K , the computational complexity mainly comes from updating factors, as
follows:
To simplify it, we mainly focus on Eq. (2.71) to illustrate the intermediate data
explosion problem in detail. To compute Eq. (2.71), we will first calculate C B
and CT C BT B. For large-scale data, the matrix C B ∈ RJ K×R is very large and
dense and cannot be stored in multiple disks, causing intermediate data explosion.
GigaTensor [41] is the first work to avoid intermediate data explosion problem
when we handle the large-scale sparse tensor. The idea of this work is to decouple
the C B in the Khatri-Rao product and perform algebraic operations involving X(1)
and C, X(1) , and B and then combine the results. In this work, the authors give the
proof that computing X(1) (C B) is equivalent to computing (F1 F2 )1J K , where
F1 = X(1) (1I ◦ (C(:, r)T ⊗ 1TJ )), F2 = bin(X(1) ) (1I ◦ (1TK ⊗ B(:, r)T )), and
1J K is an all-one vector of size J K, bin() is an operator that converts any nonzero
value into 1. The detailed fast computation is concluded in Algorithm 15.
Using this process, the flops of computing X(1) (C B) are reduced from J KR +
2mR to 5mR, and the intermediate data size is reduced from J KR + m to max(J +
m, K + m), where m is the number of nonzeros in X(1) .
Following this work, HaTen2 [35] unifies Tucker and CP decompositions into
a general framework. Beutel et al. [7] propose FlexiFaCT, a flexible tensor
decomposition method based on distributed stochastic gradient descent method.
FlexiFaCT supports various types of decompositions such as matrix decomposition,
50 2 Tensor Decomposition
To handle the data which is large-scale and not sparse, we need to divide large-scale
data into small-scale one for processing. Based on the low-rank assumption on big
data, the existing algorithms for dealing with large-scale low-rank tensor are mainly
divided into two groups. One is based on the parallel distributed techniques, and the
other is based on the projection technique.
In this group, the idea to handle large-scale data is to develop a “divide and conquer”
strategy. It consists of three steps: the one breaks the large-scale tensor into small-
scale tensors, finds factors of the small-scale tensors, and combines the factors
of the small-scale tensors to recover the factors of the original large-scale tensor.
PARCUBE [65] is the first approach to use the “divide and conquer” strategy to
process the large-scale tensor. The architecture of PARCUBE has three parts. Firstly
multiple small-scale tensors are parallel subsampled from the large-scale tensor, and
each small-scale tensor is independently factorized by CP decomposition secondly.
Finally, the factors of small-scale tensors are joined via a master linear equation.
For example, given a large-scale data X ∈ RI ×J ×K , multiple small-scale
tensors are parallel subsampled from the large-scale tensor firstly. The subsampled
process can be found in Fig. 2.16, where the sample matrices Un ∈ RI ×In ,Vn ∈
RJ ×Jn ,Wn ∈ RK×Kn are randomly drawn from an absolutely continuous distri-
bution and the elements of these sample matrices are independent and identically
distributed Gaussian random variables with zero mean and unit variance.
2.8 Scalable Tensor Decomposition 51
Jn
Vn J
K
I Wn
Kn
In Kn
Un K =
I X In Yn
Jn
J
K SamplingI1 Y1 CP decomposition
J1
U2 ,V2 ,W2 K2 Ā ,B̄ ,C̄
−−−−−−→ −−−−2−−2−−−
2
−→ Ā2 , B̄2 , C̄2 =⇒
I2 Y2
A, B, C
Sampling CP decomposition Join
J2
I X ··· ··· ··· ···
U ,V ,W Ā ,B̄ ,C̄
N
−−−−−N
−−→ N
KN N
−−−−−−N−−−
N
−→ĀN , B̄N , C̄N
J IN YN
Sampling CP decomposition
JN
Different from the scalable distributed tensor decomposition, the idea of the
randomized tensor decomposition methods is firstly to apply random projections
to obtain compressed tensor, and the obtained compressed tensor can be further
factorized by different tensor decompositions. Finally, the tensor decomposition
of original tensor can be obtained by projecting factor matrices of the compressed
tensor back.
52 2 Tensor Decomposition
T T
X ≈ X ×1 U1 U1 ×2 · · · ×N UN UN . (2.74)
Given a fixed target rank J , these basis matrices {Un ∈ RIn ×J }N n=1 can be
efficiently
obtained using a randomized algorithm. First, a random sample matrix
W ∈ R d=n Id ×J should be constructed, where each column vector is drawn from a
Gaussian distribution.Then, the random sample matrix is used to sample the column
space of X(n) ∈ RIn × d=n Id as follows:
Z = X(n) W, (2.75)
where Z ∈ RIn ×J is the sketch. The sketch can represent the approximate basis
for the row space of X(n) . According to the probability theory [4, 39], if each
column vector of sample matrix W ∈ R d=n Id ×J is linearly independent with high
probability, the random projection Z will efficiently sample the row space of X(n) .
Finally, the orthonormal basis can be obtained by
Un = QR(Z), (2.76)
After N iterations, we can obtain the compressed tensor Y and a set of orthonormal
matrices {Un ∈ RIn ×J }N
n=1 . The detailed solutions are concluded in Algorithm 16.
The next step performs CP decomposition on Y, and obtains compressed factor
matrices Ān ∈ RJ ×R , n = 1 . . . , N. The factor matrices of original tensor can be
recovered by
An ≈ Un Ān , n = 1, . . . , N, (2.78)
In the same way, random Tucker decompositions for large-scale data are
proposed, such as random projection HOSVD algorithm [13] and random projection
orthogonal iteration [82]. Similarly, other random tensor decompositions using
random projection can refer to [24].
Tensor decompositions are very powerful tools, which are ubiquitous in image
processing, machine learning, and computer vision. In addition, different decompo-
sitions have varied applications. For example, due to the fact that CP decomposition
is unique under mild condition, it is suitable for extracting interpretable latent
factors. Tucker has a good ability of compressing, which is frequently used in
compressing data. In addition, with tensor order increases, some tensor networks,
including tensor train decomposition, tensor tree decomposition, and tensor ring
decomposition are utilized to alleviate the curse of dimensionality and reduce the
storage memory.
However, there still exist some challenges in tensor decomposition. We summa-
rize them in the following parts.
• Which tensor decomposition is the best one to exploit spatial or temporal
structure of data? For example, in work [74], the authors use nonnegative CP
decomposition to represent the hyperspectral images, and authors [14] apply
Tucker decomposition to capture spatial-temporal information of traffic speed
data. Is there a generic way to incorporate such modifications in a tensor model
and enable it to handle these data effectively? Furthermore, is there an optimal
tensor network structure which can adaptively represent data?
• In real world, there exist many heterogeneous information networks such as
social networks and knowledge graph. These networks can be well represented
by graphs where nodes reveal different types of data and edges present the
relationship between them. Compared with graph, is there a way to represent
54 2 Tensor Decomposition
References
1. Ablowitz, M.J., Nixon, S.D., Zhu, Y.: Conical diffraction in honeycomb lattices. Phys. Rev. A
79(5), 053830–053830 (2009)
2. Acar, E., Dunlavy, D.M., Kolda, T.G.: Link prediction on evolving data using matrix and tensor
factorizations. In: 2009 IEEE International Conference on Data Mining Workshops, pp. 262–
269. IEEE, New York (2009)
3. Ahad, A., Long, Z., Zhu, C., Liu, Y.: Hierarchical tensor ring completion (2020). arXiv e-prints,
pp. arXiv–2004
4. Ahfock, D.C., Astle, W.J., Richardson, S.: Statistical properties of sketching algorithms.
Biometrika (2020). https://doi.org/10.1093/biomet/asaa062
5. Bahat-Treidel, O., Peleg, O., Segev, M.: Symmetry breaking in honeycomb photonic lattices.
Opt. Lett. 33(19), 2251–2253 (2008)
6. Ballani, J., Grasedyck, L., Kluge, M.: Black box approximation of tensors in hierarchical
Tucker format. Linear Algebra Appl. 438(2), 639–657 (2013)
7. Beutel, A., Talukdar, P.P., Kumar, A., Faloutsos, C., Papalexakis, E.E., Xing, E.P.: Flexifact:
scalable flexible factorization of coupled tensors on hadoop. In: Proceedings of the 2014 SIAM
International Conference on Data Mining, pp. 109–117. SIAM, Philadelphia (2014)
8. Bigoni, D., Engsig-Karup, A.P., Marzouk, Y.M.: Spectral tensor-train decomposition. SIAM J.
Sci. Comput. 38(4), A2405–A2439 (2016)
9. Brachat, J., Comon, P., Mourrain, B., Tsigaridas, E.: Symmetric tensor decomposition. Linear
Algebra Appl. 433(11–12), 1851–1872 (2010)
10. Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional scaling via
an n-way generalization of “Eckart-Young” decomposition. Psychometrika 35(3), 283–319
(1970)
11. Cattell, R.B.: “parallel proportional profiles” and other principles for determining the choice
of factors by rotation. Psychometrika 9(4), 267–283 (1944)
12. Cattell, R.B.: The three basic factor-analytic research designs—their interrelations and deriva-
tives. Psychol. Bull. 49(5), 499–520 (1952)
13. Che, M., Wei, Y.: Randomized algorithms for the approximations of Tucker and the tensor train
decompositions. Adv. Comput. Math. 45(1), 395–428 (2019)
14. Chen, X., He, Z., Wang, J.: Spatial-temporal traffic speed patterns discovery and incomplete
data recovery via SVD-combined tensor decomposition. Transp. Res. Part C Emerg. Technol.
86, 59–77 (2018)
15. Cincio, L., Dziarmaga, J., Rams, M.: Multiscale entanglement renormalization ansatz in two
dimensions: quantum ising model. Phys. Rev. Lett. 100(24), 240603–240603 (2008)
References 55
16. Cong, F., Lin, Q.H., Kuang, L.D., Gong, X.F., Astikainen, P., Ristaniemi, T.: Tensor decompo-
sition of EEG signals: a brief review. J. Neurosci. Methods 248, 59–69 (2015)
17. Cyganek, B., Gruszczyński, S.: Hybrid computer vision system for drivers’ eye recognition
and fatigue monitoring. Neurocomputing 126, 78–94 (2014)
18. De Lathauwer, L.: Decompositions of a higher-order tensor in block terms—part I: lemmas for
partitioned matrices. SIAM J. Matrix Anal. Appl. 30(3), 1022–1032 (2008)
19. De Lathauwer, L.: Decompositions of a higher-order tensor in block terms—part II: definitions
and uniqueness. SIAM J. Matrix Anal. Appl. 30(3), 1033–1066 (2008)
20. De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition.
SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000)
21. De Lathauwer, L., Nion, D.: Decompositions of a higher-order tensor in block terms—part III:
alternating least squares algorithms. SIAM J. Matrix Anal. Appl. 30(3), 1067–1083 (2008)
22. De Lathauwer, L., Vandewalle, J.: Dimensionality reduction in higher-order signal processing
and rank-(r1, r2,. . . , rn) reduction in multilinear algebra. Linear Algebra Appl. 391, 31–55
(2004)
23. Erichson, N.B., Manohar, K., Brunton, S.L., Kutz, J.N.: Randomized CP tensor decomposition.
Mach. Learn. Sci. Technol. 1(2), 025012 (2020)
24. Fonał, K., Zdunek, R.: Distributed and randomized tensor train decomposition for feature
extraction. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8.
IEEE, New York (2019)
25. Franz, T., Schultz, A., Sizov, S., Staab, S.: Triplerank: ranking semantic web data by tensor
decomposition. In: International Semantic Web Conference, pp. 213–228. Springer, Berlin
(2009)
26. Govindu, V.M.: A tensor decomposition for geometric grouping and segmentation. In: 2005
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05),
vol. 1, pp. 1150–1157. IEEE, New York (2005)
27. Grasedyck, L.: Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal.
Appl. 31(4), 2029–2054 (2010)
28. Grelier, E., Nouy, A., Chevreuil, M.: Learning with tree-based tensor formats (2018). Preprint,
arXiv:1811.04455
29. Hackbusch, W., Kühn, S.: A new scheme for the tensor representation. J. Fourier Anal. Appl.
15(5), 706–722 (2009)
30. Hardoon, D.R., Shawe-Taylor, J.: Decomposing the tensor kernel support vector machine for
neuroscience data with structured labels. Mach. Learn. 79(1–2), 29–46 (2010)
31. Harshman, R.: Foundations of the PARAFAC procedure: models and conditions for an
“explanatory” multimodal factor analysis. In: UCLA Working Papers in Phonetics, vol. 16,
pp. 1–84 (1970)
32. Herbut, I.: Interactions and phase transitions on graphene’s honeycomb lattice. Phys. Rev. Lett.
97(14), 146401–146401 (2006)
33. Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. J. Math. Phys.
6(1–4), 164–189 (1927)
34. Hitchcock, F.L.: Multiple invariants and generalized rank of a p-way matrix or tensor. J. Math.
Phys. 7(1–4), 39–79 (1928)
35. Jeon, I., Papalexakis, E.E., Kang, U., Faloutsos, C.: Haten2: billion-scale tensor decompo-
sitions. In: 2015 IEEE 31st International Conference on Data Engineering, pp. 1047–1058.
IEEE, New York (2015)
36. Ji, H., Li, J., Lu, R., Gu, R., Cao, L., Gong, X.: EEG classification for hybrid brain-computer
interface using a tensor based multiclass multimodal analysis scheme. Comput. Intell.
Neurosci. 2016, 1732836–1732836 (2016)
37. Jiang, B., Ding, C., Tang, J., Luo, B.: Image representation and learning with graph-Laplacian
tucker tensor decomposition. IEEE Trans. Cybern. 49(4), 1417–1426 (2018)
38. Jiang, T., Sidiropoulos, N.D.: Kruskal’s permutation lemma and the identification of CANDE-
COMP/PARAFAC and bilinear models with constant modulus constraints. IEEE Trans. Signal
Process. 52(9), 2625–2636 (2004)
56 2 Tensor Decomposition
39. Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space.
Contemp. Math. 26(189–206), 1 (1984)
40. Kanatsoulis, C.I., Sidiropoulos, N.D.: Large-scale canonical polyadic decomposition via
regular tensor sampling. In: 2019 27th European Signal Processing Conference (EUSIPCO),
pp. 1–5. IEEE, New York (2019)
41. Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by
100 times-algorithms and discoveries. In: Proceedings of the 18th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 316–324 (2012)
42. Kilmer, M.E., Braman, K., Hao, N., Hoover, R.C.: Third-order tensors as operators on matrices:
a theoretical and computational framework with applications in imaging. SIAM J. Matrix Anal.
Appl. 34(1), 148–172 (2013)
43. Kilmer, M.E., Martin, C.D.: Factorization strategies for third-order tensors. Linear Algebra
Appl. 435(3), 641–658 (2011)
44. Kilmer, M.E., Martin, C.D., Perrone, L.: A third-order generalization of the matrix SVD as a
product of third-order tensors. Tufts University, Department of Computer Science, Tech. Rep.
TR-2008-4 (2008)
45. Kolda, T.G., Sun, J.: Scalable tensor decompositions for multi-aspect data mining. In: 2008
Eighth IEEE International Conference on Data Mining, pp. 363–372. IEEE, New York (2008)
46. Kressner, D., Tobler, C.: htucker—a matlab toolbox for tensors in hierarchical Tucker format.
Mathicse, EPF Lausanne (2012)
47. Kruskal, J.B.: Three-way arrays: rank and uniqueness of trilinear decompositions, with
application to arithmetic complexity and statistics. Linear Algebra Appl. 18(2), 95–138 (1977)
48. Levin, J.: Three-mode factor analysis. Psychol. Bull. 64(6), 442–452 (1965)
49. Li, N., Kindermann, S., Navasca, C.: Some convergence results on the regularized alternating
least-squares method for tensor decomposition. Linear Algebra Appl. 438(2), 796–812 (2013)
50. Lim, L.H., Comon, P.: Multiarray signal processing: tensor decomposition meets compressed
sensing. C. R. Mec. 338(6), 311–320 (2010)
51. Liu, X., Sidiropoulos, N.D.: Cramér-Rao lower bounds for low-rank decomposition of
multidimensional arrays. IEEE Trans. Signal Process. 49(9), 2074–2086 (2001)
52. Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., Yan, S.: Tensor robust principal component analysis
with a new tensor nuclear norm. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 925–938 (2019)
53. Lubich, C., Rohwedder, T., Schneider, R., Vandereycken, B.: Dynamical approximation by
hierarchical Tucker and tensor-train tensors. SIAM J. Matrix Anal. Appl. 34(2), 470–494
(2013)
54. Mørup, M.: Applications of tensor (multiway array) factorizations and decompositions in data
mining. Wiley Interdiscip. Rev. Data Min. Knowl. Disc. 1(1), 24–40 (2011)
55. Murg, V., Verstraete, F., Legeza, Ö., Noack, R.: Simulating strongly correlated quantum
systems with tree tensor networks. Phys. Rev. B 82(20), 1–21 (2010)
56. Orús, R.: A practical introduction to tensor networks: matrix product states and projected
entangled pair states. Ann. Phys. 349, 117–158 (2014)
57. Oseledets, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput. 33(5), 2295–2317 (2011)
58. Perros, I., Chen, R., Vuduc, R., Sun, J.: Sparse hierarchical tucker factorization and its
application to healthcare. In: 2015 IEEE International Conference on Data Mining, pp. 943–
948. IEEE, New York (2015)
59. Pirvu, B., Verstraete, F., Vidal, G.: Exploiting translational invariance in matrix product state
simulations of spin chains with periodic boundary conditions. Phys. Rev. B 83(12), 125104
(2011)
60. Pižorn, I., Verstraete, F., Konik, R.M.: Tree tensor networks and entanglement spectra. Phys.
Rev. B 88(19), 195102 (2013)
61. Qi, L., Sun, W., Wang, Y.: Numerical multilinear algebra and its applications. Front. Math.
China 2(4), 501–526 (2007)
62. Schollwöck, U.: The density-matrix renormalization group in the age of matrix product states.
Ann. Phys. 326(1), 96–192 (2011)
References 57
63. Sidiropoulos, N.D., Bro, R.: On the uniqueness of multilinear decomposition of n-way arrays.
J. Chemometrics J. Chemometrics Soc. 14(3), 229–239 (2000)
64. Sidiropoulos, N.D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E.E., Faloutsos, C.:
Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process.
65(13), 3551–3582 (2017)
65. Sidiropoulos, N.D., Papalexakis, E.E., Faloutsos, C.: Parallel randomly compressed cubes: a
scalable distributed architecture for big tensor decomposition. IEEE Signal Process. Mag.
31(5), 57–70 (2014)
66. Sobral, A., Javed, S., Jung, S.K., Bouwmans, T., Zahzah, E.h.: Online stochastic tensor
decomposition for background subtraction in multispectral video sequences. In: 2015 IEEE
International Conference on Computer Vision Workshop (ICCVW), pp. 946–953. IEEE, New
York (2015)
67. Sørensen, M., De Lathauwer, L.: Blind signal separation via tensor decomposition with
vandermonde factor: canonical polyadic decomposition. IEEE Trans. Signal Process. 61(22),
5507–5519 (2013)
68. Stegeman, A., Sidiropoulos, N.D.: On Kruskal’s uniqueness condition for the Cande-
comp/Parafac decomposition. Linear Algebra Appl. 420(2–3), 540–552 (2007)
69. Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: dynamic tensor analysis. In:
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 374–383 (2006)
70. Ten Berge, J.M., Sidiropoulos, N.D.: On uniqueness in CANDECOMP/PARAFAC. Psychome-
trika 67(3), 399–409 (2002)
71. Tucker, L.R.: Implications of factor analysis of three-way matrices for measurement of change.
Probl. Meas. Change 15, 122–137 (1963)
72. Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31(3),
279–311 (1966)
73. Tucker, L.R., et al.: The extension of factor analysis to three-dimensional matrices. Contrib.
Math. Psychol. 110119 (1964)
74. Veganzones, M.A., Cohen, J.E., Farias, R.C., Chanussot, J., Comon, P.: Nonnegative tensor CP
decomposition of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 54(5), 2577–2588
(2015)
75. Verstraete, F., Cirac, J.I.: Matrix product states represent ground states faithfully. Phys. Rev. B
73(9), 094423 (2006)
76. Verstraete, F., Murg, V., Cirac, J.I.: Matrix product states, projected entangled pair states, and
variational renormalization group methods for quantum spin systems. Adv. Phys. 57(2), 143–
224 (2008)
77. Wang, Y., Tung, H.Y., Smola, A.J., Anandkumar, A.: Fast and guaranteed tensor decomposition
via sketching. In: Advances in Neural Information Processing Systems, vol. 28, pp. 991–999
(2015)
78. Yang, B., Zamzam, A., Sidiropoulos, N.D.: Parasketch: parallel tensor factorization via
sketching. In: Proceedings of the 2018 SIAM International Conference on Data Mining, pp.
396–404. SIAM, Philadelphia (2018)
79. Yang, G., Jones, T.L., Barrick, T.R., Howe, F.A.: Discrimination between glioblastoma
multiforme and solitary metastasis using morphological features derived from the p: q tensor
decomposition of diffusion tensor imaging. NMR Biomed. 27(9), 1103–1111 (2014)
80. Zhang, Z., Yang, X., Oseledets, I.V., Karniadakis, G.E., Daniel, L.: Enabling high-dimensional
hierarchical uncertainty quantification by ANOVA and tensor-train decomposition. IEEE
Trans. Comput. Aided Des. Integr. Circuits Syst. 34(1), 63–76 (2014)
81. Zhao, Q., Zhou, G., Xie, S., Zhang, L., Cichocki, A.: Tensor ring decomposition (2016). arXiv
e-prints, pp. arXiv–1606
82. Zhou, G., Cichocki, A., Xie, S.: Decomposition of big tensors with low multilinear rank (2014).
arXiv e-prints, pp. arXiv–1412
Chapter 3
Tensor Dictionary Learning
Sparse models are generally classified into two categories: synthetic sparse models
and analytic sparse models. In synthetic sparse model, a signal x is represented
as a linear combination of certain atoms of an overcomplete dictionary. The
analytic sparse model characterizes the signal x by multiplying it with an analytic
overcomplete dictionary, leading to a sparse outcome. Analytic sparse model is also
called cosparse model. Figures 3.1 and 3.2 are schematic diagrams of two models
with overcomplete dictionaries.
N
x = θ + es = θ n ψ n + es , (3.1)
n=1
min θ 0 , s. t. x − θ 2 ≤ α, (3.2)
θ
where θ 0 is the pseudo-0 norm which counts the nonzero entries and α bounds
the approximation error in sparse representation. Under mild conditions, this
3.1 Matrix Dictionary Learning 61
:= N − x0 .
By introducing the auxiliary variables, this model for cosparse recovery with a
fixed dictionary can be formulated as follows:
where is the error tolerance related to the noise power and ρ = x + ea , where ea
is the approximation error term.
All in all, in synthesis sparse model, the signal subspace consists of the
columns ψ n , n ∈ T, where the set T contains the index of nonzero coefficients.
Correspondingly, in analysis sparse model, the analysis subspace consists of the
rows ωn with ωn , x = 0. In addition, when both the synthesis dictionary and
the analysis are overcomplete, they are likely to be very different. However, if they
have complete orthogonal basis, these two kinds of representations can be equivalent
with = −1 .
62 3 Tensor Dictionary Learning
Dictionary is the basis for sparse and cosparse representation. There are generally
two groups of methods for dictionary design. One is fixed dictionary, which is
characterized by mathematical functions, such as discrete cosine transform (DCT)
[2], wavelet transform (WT) [5], contour transform (CT) [38], shearlet [28], grouplet
[61], and parametric dictionary [59]. Although these fixed dictionaries have simple
structures and low computational complexity, the basic atoms are fixed. The
atom morphology is not rich enough to match some complicated data structures.
Therefore, these fixed dictionaries may be nonoptimal representations for some data
processing applications.
The other one learns overcomplete dictionaries from training data. Compared
with the fixed dictionary based on analytical design, the data-driven dictionary
is versatile, simple, and efficient. It can better match the data structure with a
sparser/cosparser representation. The classical dictionary learning methods include
maximum likelihood methods (MLD) [34], method of optimal directions (MOD)
[18], maximum a posteriori probability (MAP) approach [25], generalized PCA
(GPCA) [55], K-SVD [1], analysis K-SVD [47], etc.
Proposed in 1999, MOD is one of the most classical dictionary learning methods
[18]. It formulates the dictionary learning into a bilinear optimization model with a
sparse constraint as follows:
A close form updating rule with the minimum mean square error can be obtained:
−1
:= XT T . (3.8)
3.1 Matrix Dictionary Learning 63
It can be seen that the multiplication of large matrix and inverse operation in formula
(3.8) may make the MOD algorithm very computationally expensive and suffer from
high storage capacity.
where θ n is a n-th row vector and ψ n is a n-th column vector. Thus X can be seen
as the sum of N rank-1 matrices. In K-SVD algorithm, we define W ∈ RT ×|T| ,
with ones at (t, T(t)) and zeros elsewhere, where the set of index T = {t 1 ≤ t ≤
T , θk (t) = 0}. By multiplying W, the model (3.9) can be rewritten as
Then we can perform SVD on Ẽk as Ẽk = U VT . Then the k-th dictionary atom is
updated by ψ k = u1 , and the coefficient is updated by θ̃ k = (1, 1)v1 , where u1
and v1 are left and right singular vector with respect to the largest singular value,
respectively.
Without matrix inverse calculation, the computational complexity of K-SVD
algorithm is much lower than that of MOD algorithm. The coefficient matrix is
updated jointly with dictionary atoms in the dictionary update step, which improves
the convergence speed of the algorithm. K-SVD algorithm is one of the most widely
used dictionary learning algorithms in practice. Inspired by K-SVD, a series of
related methods such as discriminative K-SVD [64] and analysis K-SVD [47] have
been developed for possible performance improvement.
Motivated by the analytic sparse model and K-SVD algorithm, analysis K-SVD
algorithm is proposed to learn cosparse dictionary [47]. Similarly, the analysis
dictionary learning can be divided into two tasks: analytic sparse coding and
analytic dictionary learning. Given the analytic dictionary and noisy observation
Y ∈ RI ×T of X, the optimization model for analysis dictionary learning can be
64 3 Tensor Dictionary Learning
formulated as follows:
min X − Y2F
,X,{It }Tt=1
s. t. It xt = 0, ∀ 1 ≤ t ≤ T , (3.11)
Rank(It ) = I − r, ∀ 1 ≤ t ≤ T ,
ωk 2 = 1, ∀ 1 ≤ k ≤ N,
where xt and yt is t-th column of X and Y, respectively. It ∈ R|It |×I is the sub-
matrix of , It includes indices of the rows orthogonal to xt , ωk is the k-th row of
, and r is the dimension of the subspace that signal xt belongs to.
Fixing the dictionary , the optimization problem (3.11) of X can be solved with
respect to each column individually. Thus it is formulated as follows:
s. t. It xt = 0, (3.12)
Rank(It ) = I − r,
s. t. It xt = 0, ∀ t ∈ T
(3.13)
Rank(It ) = I − r, ∀ t ∈ T
ωk 2 = 1,
minωt YT 22
ωt
(3.14)
s. t. ωt 2 = 1,
Specifically, given a third-order tensor Y as an example, i.e., Y ∈ RI1 ×I2 ×I3 , the
optimization model for tensor dictionary learning in Tucker decomposition form
can be formulated as follows [69]:
min Y − X ×1 D1 ×2 D2 ×3 D3 2F
X ,D1 ,D2 ,D3
(3.15)
s. t. g (X ) ≤ K,
66 3 Tensor Dictionary Learning
where K represents the maximum number of nonzero entries in the sparse coeffi-
cient X . We can alternately optimize X and {Dd , d = 1, 2, 3} while fixing the other
variables. Specifically, we have two main subproblems as follows:
1. Sparse tensor coding: With all dictionaries fixed, we need to solve:
i.e., ak = xmk ,mk ,mk . So, we can obtain the update of a at k-th iteration from
1 2 3
a = (P3 P2 P1 )† y, (3.19)
2. Dictionary update: When other variables are fixed, the dictionary Dd can be
solved by an alternating least squares method. For example, to update D1 , the
optimization model (3.16) can be formulated as follows:
2
min Y(1) − D1 X(1) (D3 ⊗ D2 )T . (3.20)
D1 F
To avoid the pathological case in which the entries of sparse coefficients approach
zero while the entries of dictionaries approach infinity, each matrix dictionary
is commonly assumed to be normalized. The corresponding convex optimization
model is formulated as follows [39]:
1
min Y − X ×1 D1 ×2 D2 ×3 D3 2F
X ,D1 ,D2 ,D3 2
(3.22)
s. t. g (X ) ≤ K;
Dd (:, md )22 = 1, d = 1, 2, 3; md = 1, 2, . . . , Md .
Solving such a problem, each variable can be optimized alternatively while fixing
the others. In this way, problem (3.22) can be transformed into two subproblems:
sparse tensor coding and tensor dictionary updating.
1. Sparse tensor coding: When all dictionary matrices in (3.22) are fixed, the
optimization model for sparse tensor coding is as follows:
1
min Y − X ×1 D1 ×2 D2 ×3 D3 2F + λ g(X ), (3.23)
X 2
In fact, we can get different solutions with respect to different g(X ). In case of
g(X ) = X 1 , the solution of (3.27) is
1
Xk = Sλ/Lk Xk−1 − ∇ f (Xk−1 ) ,
Lk
tk − 1
Ck+1 = Xk + (Xk − Xk−1 ) ,
tk+1
where
A = X ×1 D1 · · ·×d−1 Dd−1 ×d+1 Dd+1 · · ·×D DD ∈ RI1 ×···×Id−1 ×Md ×Id+1 ×···×ID ,
where δ ∈ RMd and δmd > 0 is dual parameter. Therefore, the Lagrange
dual function is D(δ) = minDd L(Dd , δ). The optimal solution of D(δ) can be
obtained by Newton’s method or conjugate gradient. Then the optimal dictionary
DTd = (A(d) AT(d) +)−1 (Y(d) AT(d) )T can be obtained by maximizing D(δ), where
= diag(δ).
We summarize the algorithm for tensor-based dictionary learning in Algo-
rithm 20. As we can see, the tensor-based dictionary learning method reduces both
the computational and memory costs in dealing with real-world multidimensional
signals.
3.2 Tensor Dictionary Learning Based on Different Decompositions 71
When the dictionaries are overcomplete, the corresponding sparse coefficient can
be very sparse, and the sparse representation of original data can be improved.
However, the large size of dictionary leads to high computational complexity,
which makes the overcomplete tensor dictionary learning and its applications time-
consuming in some applications. One effective way for processing time reduction is
to design structured dictionary. Orthogonality is one of the most frequently used.
Based on the Tucker form, the optimization model for orthogonal tensor dictio-
nary learning can be formulated as [22, 42]
1
min Y − X ×1 D1 ×2 D2 ×3 D3 2F
X ,D1 ,D2 ,D3 2
(3.29)
s. t. g (X ) ≤ K,
d Dd = Id ,
DH d = 1, 2, 3,
which can be divided into sparse tensor coding and dictionary updating in similar
way, and sparse tensor coding algorithms like OMP can be used with fixed
dictionaries. When updating dictionary D1 , an equivalent optimization model for
it can be formulated as follows:
1
min Y(1) − D1 X(1) (D2 ⊗ D3 )H 2F
D1 2 (3.30)
s. t. DH
1 D1 = I,
N
min Yn − D×4 xn 2F
D ,xn (3.31)
n=1
s. t. xn 0 ≤ K,
where D ∈ RI1 ×I2 ×I3 ×I0 is a tensor dictionary, I0 is the number of atoms, xn ∈ RI0
is the sparse coefficient vector, and K represents the sparsity level. Each D(i0 ) =
D(:, :, :, i0 ) ∈ RI1 ×I2 ×I3 is the i0 -th atom, which is a rank-one tensor. The i0 -th
(1) (2) (3) (1) (2)
atom can be rewritten as D(i0 ) = di0 ◦ di0 ◦ di0 , where di0 ∈ RI1 , di0 ∈ RI2 , and
(3)
di0 ∈ RI3 are normalized vectors.
Likewise, the optimization model (3.31) can be solved in an alternating way.
With fixed dictionary, the sparse tensor coding model can be formulated as follows:
Following the idea of OMP [16], the multilinear orthogonal matching pursuit
(MOMP) algorithm can be developed for solving (3.32). Algorithm 21 presents the
details of the MOMP algorithm.
3.2 Tensor Dictionary Learning Based on Different Decompositions 73
Update the dictionary by adding the atom corresponding to the maximal Pi0 :
i = arg max(Pi0 ), I = {I, i};
i0
(1) (2) (3)
D̃k = di ◦ di ◦ di ;
D̃ = [D̃1 , . . . , D̃k ].
Update the sparse coefficient: ak = arg minYn − D̃ ×4 a2F .
a
Compute the residual: Rk+1 = Yn − D̃ ×4 ak .
Updating the counter: k = k + 1.
end
Return xn (I) = ak .
To update the i0 -th atom D(i0 ) , we need to fix other atoms. Analogously, those
third-order tensor Yn can be integrated into a fourth-order tensor Y ∈ RI1 ×I2 ×I3 ×N
with Y(:, :, :, n) = Yn , n = 1, . . . , N . The convex optimization model (3.31) can
be formulated as follows:
with
(1) (2) (3)
D(i0 ) = di0 ◦ di0 ◦ di0 ,
(3.36)
(4)
i0 = λi0 di0 .
xnz
In [3], the tensor dictionary learning problem can also be solved easily and
efficiently by alternating direction method of multipliers (ADMM). The augmented
Lagrangian function of the problem (3.31) is formulated as follows:
N
1 η
L(X, G, D, Q) = Y − D ×4 X2F +λ G(n, :)0 +Q, G−X+ G−X2F ,
2 2
n=1
follows:
Y = D ∗ X, (3.37)
where Y ∈ RI1 ×N ×I2 , D ∈ RI1 ×M×I2 is the tensor dictionary, X ∈ RM×N ×I2 is the
sparse coefficient.
Corresponding to different sparsity constraints, there are different tensor sparse
coding schemes based on t-product. The optimization model for t-product-based
tensor dictionary learning can be formulated as follows:
min Y − D ∗ X 2F
X ,D
(3.38)
s. t. g (X ) ≤ K,
X 1 = |X (m, n, i2 )|,
m,n,i2
or 1,1,2 defined by
Figure 3.4 illustrates the t-product-based sparse tensor representation with different
−
→
sparsity measurements, in which red blocks represent nonzero elements and Y n is
the n-th lateral slice for Y.
For example, in [65], Zhang et al. use · 1,1,2 norm to represent tubal sparsity
of a tensor, and the corresponding optimization model is given by
1
min Y − D ∗ X 2F + λX 1,1,2 , (3.39)
D ,X 2
where λ > 0 is penalty parameter. Motivated by the classical K-SVD algorithm, the
K-TSVD algorithm is presented in Algorithm 23 to solve the t-product-based tensor
dictionary learning problem (3.39).
Especially, the optimization problem with respect to X can be reformulated in
Fourier domain as
1
min Y − DX2F + λ I2 X̂ 1,1,2 , (3.40)
X 2
Fig. 3.4 Sparse tensor representation based on t-product (a) with 1 constraint or (b) with tubal
sparsity constrained by 1,1,2 norm
√
X F = X̂ F / I2 . The problem (3.40) can be solved easily and efficiently by
ADMM. We can divide the problem (3.40) into three subproblems as follows:
2 T ρ 2
Xk+1 = arg min Y − DX + tr Lk X + X − Zk ,
X F 2 F
2
ρ
Xk+1 + 1 Lk − Z , (3.41)
Zk+1 = arg min Z1,1,2 +
Z 2λ ρ F
1
min Y − D ∗ X 2F + λ g(X )
D ,X 2 (3.42)
s. t. D(:, m, :)2F ≤ 1, m = 1, . . . , M,
where g(X ) is ||X ||1 in [23], or ||X ||1,1,2 in [24]. With given tensor dictionary, this
kind of sparse tensor coding problem can be solved by a tensor-based fast iterative
3.3 Analysis Tensor Dictionary Learning 77
shrinkage thresholding algorithm (TFISTA) [24]. And the dictionary can be updated
by ADMM in the frequency domain when the sparse coefficient is fixed.
The last section lists a series of algorithms based on different tensor decompositions
forms, and most of them can be regarded as the ones based on the synthetic sparse
tensor models. Correspondingly, there exist some analytic tensor dictionary learning
methods. For example, the analysis tensor dictionary learning in the form of Tucker
decomposition can be found in [10, 40].
min X − Y ×1 1 ×2 2 · · · ×D D 2F
X ,{d }D
d=1 (3.43)
s. t. X 0 ≤ P − K,
where X is K-cosparse, which has only K nonzero entries, P = d Pd . d is the
d-th dictionary, Pd ≥ Id , d = 1, 2, . . . , D.
With the normalization regularization term, the target problem in [10] is as
follows:
min X − Y ×1 1 ×2 2 · · · ×D D 2F
X ,{d }D
d=1
(3.44)
s. t. X 0 = P − K,
d (pd , :)2 = 1, pd = 1, 2, . . . , Pd ; d = 1, 2, . . . , D,
Q = Y + E, (3.46)
where E is the white-Gaussian additive noise. In this case, the optimization model
for updating analytic dictionaries {d }D
d=1 and recovering the clean data Y is as
follows [40]:
min Q − Y2F
d=1 ,Y
{d }D
(3.47)
s. t. Y ×1 1 ×2 2 · · · ×D D 0 = P − K,
d (pd , :)2 = 1, pd = 1, 2, . . . , Pd ; d = 1, 2, . . . , D.
3.3 Analysis Tensor Dictionary Learning 79
where w is the p-th row of . Thus the problem (3.48) is a 1D sparse coding
problem; we can solve it using backward greedy (BG) [47], greedy analysis
pursuit (GAP) [32], and so on.
2. Subproblem with respect to d : The optimization model for updating d is non-
convex, and an alternative optimization algorithm is employed. We define
s. t. d Ud 0 = P − K, (3.49)
The solution can refer to Sect. 3.1.2.3, and the detailed updating procedures are
concluded in Algorithm 24.
80 3 Tensor Dictionary Learning
In addition to the Tucker form, other convolutional analysis models for dictionary
learning have also been developed recently [58]. The corresponding optimization
can be written as
D
min Od Y0 , (3.51)
{Od }D
d=1 ,Y d=1
where Y ∈ RI1 ×I2 ···×IN is the input tensor, denotes the convolution operator as
defined in Chap. 1, and Od ∈ RI1 ×I2 ···×IN (d = 1, . . . , D) are a series of analysis
dictionary atoms desired to estimate.
In order to avoid trivial solutions, the N-th order tensor atom Od can be
decomposed as the convolution of two N-th order tensors with smaller size.
For convenience, here uses a double index to represent the D dictionary atoms
as O1,1 , O1,2 , . . . , OI,J where I J = D. Then, the orthogonality constrained
convolutional factorization of i, j -th atom Oi,j can be represented as
Oi,j = Ui Vj ,
1, i=j
δi,j =
0, i = j
I,J
min Oi,j Y (3.53)
Y ,O 0
i,j =1
s. t. Oi,j = Ui Vj , ∀i, j
Ui ∈ U, Vj ∈ V,
The previously discussed tensor dictionary learning methods are offline. Each
iteration needs to be calculated with all training samples, which requires high
storage space and computational complexity. Especially when the samples are large-
scale, the learning can be very slow. Online dictionary learning is one of the main
ways for real-time processing when the data are acquired dynamically.
There are some online tensor dictionary learning algorithms for fast dictionary
learning [26, 43, 50, 53, 54, 68]. The data tensor for online tensor dictionary is
assumed to be Yτ ∈ RI1 ×I2 ···×ID (τ = 1, . . . , t), which is an observed sequence of
t N-order tensors. And Y = {Y1 , Y2 , . . . , Yt } accumulates a number of previously
acquired training signals. Therefore, we can build up an optimization model of
tensor dictionary learning in the Tucker form [53] as follows:
1
min Y − X ×1 D1 ×2 D2 · · · ×D DD 2F + g1 (X ) + g2 (D1 , . . . , DD ) ,
X ,{Dd }D
d=1
2
(3.54)
where g1 , g2 are general penalty functions.
For each newly added tensor Yt , two processes are executed to retrain the
dictionaries: sparse coding and dictionary updating.
82 3 Tensor Dictionary Learning
1. Sparse tensor coding of Xt : Online sparse coding is different from offline sparse
coding. When new training data is arrived, only the sparse coefficient with respect
to the newly added sample needs to be obtained. Therefore, with all dictionary
matrices fixed and Xτ (τ = 1, . . . , t − 1) unchanged, the sparse representation Xt
for newly added tensor Yt can be computed. The optimization model for sparse
tensor coding is as follows:
1
2
t−1
Xt = arg min Yt − X ×1 Dt−1 1 × 2 Dt−1
2 · · · × D DD F + g1 (X ), (3.55)
X 2
! "
where Dt−1 ∈ R Id ×Md (d = 1, · · · , D) represent the dictionaries updated at
d
last iteration. 2
t−1
We set f (X ) = 12 Yt − X ×1 Dt−1 1 ×2 Dt−1
2 · · · ×D DD F , which is differ-
entiable. Therefore, (3.55) can be tackled by a proximal minimization approach
[7], and the detailed updating procedures are concluded in Algorithm 25, where
proxg1 is a proximal operator and
∂f (X )
= −Yt ×1 (Dt−1 t−1 T t−1 T
1 ) ×2 (D2 ) · · · ×D (DD )
T
∂X
+X ×1 (Dt−1 T t−1
1 ) D1 ×2 (Dt−1 T t−1 t−1 T t−1
2 ) D2 · · · ×D (DD ) DD .
2. Tensor dictionary update: For updating the dictionary Dd with fixed X , the
optimization model can be formulated as follows:
with
1
t
1
Ld,t (Dd ) = Yτ − Xτ ×1 Dt1 · · · ×d−1 Dtd−1 ×d Dd ×d+1 Dt−1
d+1
t 2
τ =1
2
· · · ×D Dt−1
D F
+ g2 Dt1 , . . . , Dtd−1 , Dd , Dt−1 t−1
d+1 . . . , DD .
(3.57)
Utd = Ut−1
d + Nt (d) XTt (d) ,
(3.63)
Vtd = Vt−1
d + Mt (d) MTt (d) .
Then the solution of the problem (3.56) is presented in Algorithm 26. Finally,
the whole process of OTL method is summarized in Algorithm 27.
Besides the Tucker-based model, some other CP-based models and algorithms
are also explored for online dictionary learning, such as online nonnegative CPD
[50] and TensorNOODL [43].
84 3 Tensor Dictionary Learning
3.5 Applications
Tensor dictionary leaning has been widely used in data processing applications,
such as dimensionality reduction, classification, medical image reconstruction, and
image denoising and fusing. Here we briefly demonstrate its effectiveness on image
denoising and image fusion by numerical experiments.
In this group of experiments, the used CAVE dataset [62] includes 32 scenes that
are separated into 5 sections. Each hyperspectral image has spatial resolution with
512 × 512 and 31 spectral bands including full spectral resolution reflectance from
400 to 700 nm in 10 nm steps. This dataset is about various real-world materials
and objects. As the additive white Gaussian noise (AWGN) can come from many
natural sources, we perturb each image with a Gaussian noise of standard deviation
σ = 0.1. Several methods are selected for comparison. All parameters involved in
the above algorithms are chosen as described in the reference papers.
Comparison methods: Matrix-based dictionary learning method K-SVD [1] is
chosen as a baseline. For tensor-based methods, we choose LRTA [45], PARAFAC
[29], TDL [36], KBRreg [57], and LTDL [20], where TDL and LTDL are tensor
dictionary learning methods. Here is a brief introduction to these five tensor-based
approaches.
• LRTA [45]: a low-rank tensor approximation method based on Tucker decompo-
sition.
• PARAFAC [29]: a low-rank tensor approximation method based on parallel
factor analysis (PARAFAC).
• TDL [36]: a tensor dictionary learning model by combining the nonlocal
similarity in space with global correlation in spectrum.
• KBR-denoising [57]: a tensor sparse coding model based on Kronecker-basis
representation for tensor sparsity measure.
• LTDL [20]: a tensor dictionary learning model which considers the nearly low-
rank approximation in a group of similar blocks and the global sparsity.
The denoising performance of different methods is compared in Table 3.1. And
the denoised images by different methods are shown in Fig. 3.5.
Obviously, all tensor-based methods perform better than K-SVD[1], which
illustrates that tensor can preserve the original structure of data. TDL [36], KBR
[57], and LTDL [20] are based on sparsity and low-rank model, and it can be
observed that the three methods outperform other methods, which are only based on
sparsity or low-rank model, in details of the images, and four different performance
indicators.
X = C ×1 D1 ×2 D2 ×3 D3 , (3.64)
Y = X ×1 1 ×2 2, (3.65)
where 1 and 2 are the downsampling matrices on the spacial width mode and
height mode, respectively.
If we only take a few spectral data from the HSI, the other practically acquired
high-resolution multispectral image (HR-MSI) Z can be regarded as the spectrally
3.5 Applications 87
Z = X ×3 3 (3.66)
Y = C ×1 ( 1 D1 ) ×2 ( 2 D2 ) ×3 D3 = C ×1 D∗1 ×2 D∗2 ×1 D3 ,
(3.67)
Z = C ×1 D1 ×2 D2 ×3 ( 3 D3 ) = C ×1 D1 ×2 D2 ×3 D∗3 ,
where D∗1 , D∗2 , D∗3 are the downsampled width dictionary, downsampled height
dictionary, and downsampled spectral dictionary, respectively.
Thus the fusion problem constrained by least-squares optimization problem can
be formulated as follows:
min Y − C × D∗ ×2 D∗ ×3 D3 2 + Z − C × 1 D1 ×2 D2 ×3 D∗ 2
1 2 F 3 F
D1 ,D2 ,D3 ,C
s. t. g (C) ≤ K.
(3.68)
Popular solutions for this problem include the matrix dictionary learning method
which is a generalization of simultaneous orthogonal matching pursuit (GSOMP)
[4] and tensor dictionary learning methods like non-local sparse tensor factorization
(NLSTF) [15] and coupled sparse tensor factorization (CSTF) [27]. In this section,
we conduct a simple experiment to verify the effectiveness of these dictionary
learning methods for fusing hyperspectral and multispectral images.
In this group of experiments, we use the University of Pavia image [14] as
the original HR-HSI image. It contains the urban area of the University of Pavia,
Italy, and is acquired by the reflective optics system imaging spectrometer (ROSIS)
optical sensor. The image has the size of 610 × 340 × 115. And the Pavia University
of size 256 × 256 × 93 image is used as the reference image.
To generate the LR-HSI of size 32 × 32 × 93, the HR-HSI is downsampled by
averaging the 8 × 8 disjoint spatial blocks. The HR-MSI Z can be generated by
downsampling X with downsampling spectral matrix 3 in spectral model, which
means selecting several bands in HR-HSI X .
As for the quality of reconstructed HSI, six quantitative metrics are used for
evaluation, which are the root mean square error (RMSE), the relative dimensionless
global error in synthesis (ERGAS), the spectral angle mapper (SAM), the universal
image quality index (UIQI), the structural similarity (SSIM), and the computational
time. The performances are reported in Table 3.2. We can see that NLSTF [15]
and CSTF [27] outperform GSOMP [4] in terms of reconstruction accuracy and
computational time, which further illustrates the superiority of learned dictionaries
compared with fixed basis.
88 3 Tensor Dictionary Learning
3.6 Summary
In this section, we briefly introduce the matrix dictionary learning technique. Based
on it, we review the tensor dictionary learning as its extension. Generally we
divide it into two kinds of subproblems: sparse tensor coding and tensor dictionary
updating. From different aspects, we discuss the sparse tensor dictionary learning
and cosparse dictionary learning, tensor dictionary learning based on different
tensor decompositions forms, and offline and online tensor dictionary learning.
The corresponding motivations, optimization models, and typical algorithms are
summarized too. As the basic multilinear data-driven model for data processing, it
can be widely used in signal processing, biomedical engineering, finance, computer
vision, remote sensing, etc. We demonstrate its effectiveness in image denoising and
data fusion in numerical experiments.
In this chapter, we mainly focus on the classical tensor decomposition forms
to extend the dictionary learning. In fact, some of most recently developed tensor
networks may motivate new tensor dictionary learning methods with better perfor-
mance. Besides the standard 0 norm for sparse constraint, some other structural
sparsity can be enforced in some applications. In addition, some other structures for
the dictionaries may be used for structural dictionaries, which may make a good
balance between the learning dictionary and mathematically designed dictionary.
Inspired by the recently proposed deep matrix factorization, a deep tensor dictionary
learning may be developed.
References
1. Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete
dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)
2. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 100(1),
90–93 (1974)
3. Aidini, A., Tsagkatakis, G., Tsakalides, P.: Tensor dictionary learning with representation
quantization for remote sensing observation compression. In: 2020 Data Compression
Conference (DCC), pp. 283–292. IEEE, New York (2020)
4. Akhtar, N., Shafait, F., Mian, A.: Sparse spatio-spectral representation for hyperspectral image
super-resolution. In: European Conference on Computer Vision, pp. 63–78. Springer, Berlin
(2014)
References 89
5. Antonini, M., Barlaud, M., Mathieu, P., Daubechies, I.: Image coding using wavelet transform.
IEEE Trans. Image Process. 1(2), 205–220 (1992)
6. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
7. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex
and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
8. Caiafa, C.F., Cichocki, A.: Computing sparse representations of multidimensional signals using
kronecker bases. Neural Comput. 25(1), 186–220 (2013)
9. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM
Rev. 43(1), 129–159 (2001)
10. Chen, G., Zhou, Q., Li, G., Zhang, X.P., Qu, C.: Tensor based analysis dictionary learning for
color video denoising. In: 2019 IEEE International Conference on Signal, Information and
Data Processing (ICSIDP), pp. 1–4. IEEE, New York (2019)
11. Cohen, J.E., Gillis, N.: Dictionary-based tensor canonical polyadic decomposition. IEEE
Trans. Signal Process. 66(7), 1876–1889 (2017)
12. Dai, W., Xu, T., Wang, W.: Simultaneous codeword optimization (SimCO) for dictionary
update and learning. IEEE Trans. Signal Process. 60(12), 6340–6353 (2012)
13. Dantas, C.F., Cohen, J.E., Gribonval, R.: Learning tensor-structured dictionaries with applica-
tion to hyperspectral image denoising. In: 2019 27th European Signal Processing Conference
(EUSIPCO), pp. 1–5. IEEE, New York (2019)
14. Dell’Acqua, F., Gamba, P., Ferrari, A., Palmason, J.A., Benediktsson, J.A., Árnason, K.:
Exploiting spectral and spatial information in hyperspectral urban data with high resolution.
IEEE Geosci. Remote Sens. Lett. 1(4), 322–326 (2004)
15. Dian, R., Fang, L., Li, S.: Hyperspectral image super-resolution via non-local sparse tensor
factorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 5344–5353 (2017)
16. Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcomplete representa-
tions in the presence of noise. IEEE Trans. Inf. Theory 52(1), 6–18 (2005)
17. Duan, G., Wang, H., Liu, Z., Deng, J., Chen, Y.W.: K-CPD: Learning of overcomplete
dictionaries for tensor sparse coding. In: Proceedings of the 21st International Conference
on Pattern Recognition (ICPR2012), pp. 493–496. IEEE, New York (2012)
18. Engan, K., Aase, S.O., Husoy, J.H.: Method of optimal directions for frame design. In: 1999
IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings.
ICASSP99 (Cat. No. 99CH36258), vol. 5, pp. 2443–2446. IEEE, New York (1999)
19. Fu, Y., Gao, J., Sun, Y., Hong, X.: Joint multiple dictionary learning for tensor sparse coding.
In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 2957–2964. IEEE,
New York (2014)
20. Gong, X., Chen, W., Chen, J.: A low-rank tensor dictionary learning method for hyperspectral
image denoising. IEEE Trans. Signal Process. 68, 1168–1180 (2020)
21. Gorodnitsky, I.F., Rao, B.D.: Sparse signal reconstruction from limited data using focuss: a
re-weighted minimum norm algorithm. IEEE Trans. Signal Process. 45(3), 600–616 (1997)
22. Huang, J., Zhou, G., Yu, G.: Orthogonal tensor dictionary learning for accelerated dynamic
MRI. Med. Biol. Eng. Comput. 57(9), 1933–1946 (2019)
23. Jiang, F., Liu, X.Y., Lu, H., Shen, R.: Efficient two-dimensional sparse coding using tensor-
linear combination (2017). Preprint, arXiv:1703.09690
24. Jiang, F., Liu, X.Y., Lu, H., Shen, R.: Efficient multi-dimensional tensor sparse coding using t-
linear combination. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32
(2018)
25. Kreutz-Delgado, K., Murray, J.F., Rao, B.D., Engan, K., Lee, T.W., Sejnowski, T.J.: Dictionary
learning algorithms for sparse representation. Neural Comput. 15(2), 349–396 (2003)
26. Li, P., Feng, J., Jin, X., Zhang, L., Xu, X., Yan, S.: Online robust low-rank tensor learning.
In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2180–
2186 (2017)
90 3 Tensor Dictionary Learning
27. Li, S., Dian, R., Fang, L., Bioucas-Dias, J.M.: Fusing hyperspectral and multispectral images
via coupled sparse tensor factorization. IEEE Trans. Image Process. 27(8), 4118–4130 (2018)
28. Lim, W.Q.: The discrete shearlet transform: a new directional transform and compactly
supported shearlet frames. IEEE Trans. Image Process. 19(5), 1166–1180 (2010)
29. Liu, X., Bourennane, S., Fossati, C.: Denoising of hyperspectral images using the PARAFAC
model and statistical performance analysis. IEEE Trans. Geosci. Remote Sens. 50(10), 3717–
3724 (2012)
30. Liu, E., Payani, A., Fekri, F.: Seismic data compression using online double-sparse dictionary
learning schemes. Foreword by Directors, p. 6 (2017)
31. Mallat, S.G., Zhang, Z.: Matching pursuits with timefrequency dictionaries. IEEE Trans.
Signal Process. 41(12), 3397–3415 (1993)
32. Nam, S., Davies, M.E., Elad, M., Gribonval, R.: The cosparse analysis model and algorithms.
Appl. Comput. Harmon. Anal. 34(1), 30–56 (2013)
33. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive field properties by learning a
sparse code for natural images. Nature 381(6583), 607–609 (1996)
34. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy
employed by v1? Vis. Res. 37(23), 3311–3325 (1997)
35. Pati, Y.C., Rezaiifar, R., Krishnaprasad, P.S.: Orthogonal matching pursuit: recursive function
approximation with applications to wavelet decomposition. In: Proceedings of 27th Asilomar
Conference on Signals, Systems and Computers, pp. 40–44. IEEE, New York (1993)
36. Peng, Y., Meng, D., Xu, Z., Gao, C., Yang, Y., Zhang, B.: Decomposable nonlocal tensor
dictionary learning for multispectral image denoising. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2949–2956 (2014)
37. Peng, Y., Li, L., Liu, S., Wang, X., Li, J.: Weighted constraint based dictionary learning for
image classification. Pattern Recogn. Lett. 130, 99–106 (2020)
38. Petersen, L., Sprunger, P., Hofmann, P., Lægsgaard, E., Briner, B., Doering, M., Rust, H.P.,
Bradshaw, A., Besenbacher, F., Plummer, E.: Direct imaging of the two-dimensional fermi
contour: Fourier-transform STM. Phys. Rev. B 57(12), R6858 (1998)
39. Qi, N., Shi, Y., Sun, X., Yin, B.: TenSR: Multi-dimensional tensor sparse representation. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5916–
5925 (2016)
40. Qi, N., Shi, Y., Sun, X., Wang, J., Yin, B., Gao, J.: Multi-dimensional sparse models. IEEE
Trans. Pattern Anal. Mach. Intell. 40(1), 163–178 (2017)
41. Qiu, Q., Patel, V.M., Chellappa, R.: Information-theoretic dictionary learning for image
classification. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2173–2184 (2014)
42. Quan, Y., Huang, Y., Ji, H.: Dynamic texture recognition via orthogonal tensor dictionary
learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 73–81
(2015)
43. Rambhatla, S., Li, X., Haupt, J.: Provable online CP/PARAFAC decomposition of a structured
tensor via dictionary learning (2020). arXiv e-prints 33, arXiv–2006
44. Ramirez, I., Sprechmann, P., Sapiro, G.: Classification and clustering via dictionary learning
with structured incoherence and shared features. In: 2010 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pp. 3501–3508. IEEE, New York (2010)
45. Renard, N., Bourennane, S., Blanc-Talon, J.: Denoising and dimensionality reduction using
multilinear tools for hyperspectral images. IEEE Geosci. Remote Sens. Lett. 5(2), 138–142
(2008)
46. Roemer, F., Del Galdo, G., Haardt, M.: Tensor-based algorithms for learning multidimensional
separable dictionaries. In: 2014 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 3963–3967. IEEE, New York (2014)
47. Rubinstein, R., Peleg, T., Elad, M.: Analysis K-SVD: A dictionary-learning algorithm for the
analysis sparse model. IEEE Trans. Signal Process. 61(3), 661–677 (2012)
48. Soltani, S., Kilmer, M.E., Hansen, P.C.: A tensor-based dictionary learning approach to
tomographic image reconstruction. BIT Numer. Math. 56(4), 1425–1454 (2016)
References 91
49. Sprechmann, P., Sapiro, G.: Dictionary learning and sparse coding for unsupervised clustering.
In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.
2042–2045. IEEE, New York (2010)
50. Strohmeier, C., Lyu, H., Needell, D.: Online nonnegative tensor factorization and cp-dictionary
learning for Markovian data (2020). arXiv e-prints pp. arXiv–2009
51. Tan, S., Zhang, Y., Wang, G., Mou, X., Cao, G., Wu, Z., Yu, H.: Tensor-based dictionary
learning for dynamic tomographic reconstruction. Phys. Med. Biol. 60(7), 2803 (2015)
52. Tosic, I., Frossard, P.: Dictionary learning. IEEE Signal Process. Mag. 28(2), 27–38 (2011)
53. Traoré, A., Berar, M., Rakotomamonjy, A.: Online multimodal dictionary learning. Neuro-
computing 368, 163–179 (2019)
54. Variddhisai, T., Mandic, D.: Online multilinear dictionary learning (2017). arXiv e-prints pp.
arXiv–1703
55. Vidal, R., Ma, Y., Sastry, S.: Generalized principal component analysis (GPCA). IEEE Trans.
Pattern Anal. Mach. Intell. 27(12), 1945–1959 (2005)
56. Wang, S., Zhang, L., Liang, Y., Pan, Q.: Semi-coupled dictionary learning with applications to
image super-resolution and photo-sketch synthesis. In: 2012 IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2216–2223. IEEE, New York (2012)
57. Xie, Q., Zhao, Q., Meng, D., Xu, Z.: Kronecker-basis-representation based tensor sparsity and
its applications to tensor recovery. IEEE Trans. Pattern Anal. Mach. Intell. 40(8), 1888–1902
(2017)
58. Xu, R., Xu, Y., Quan, Y.: Factorized tensor dictionary learning for visual tensor data
completion. IEEE Trans. Multimedia 23, 1225–1238 (2020)
59. Yaghoobi, M., Daudet, L., Davies, M.E.: Parametric dictionary design for sparse coding. IEEE
Trans. Signal Process. 57(12), 4800–4810 (2009)
60. Yang, M., Zhang, L., Feng, X., Zhang, D.: Sparse representation based fisher discrimination
dictionary learning for image classification. Int. J. Comput. Vis. 109(3), 209–232 (2014)
61. Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and
object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pp. 9–16. IEEE, New York (2010)
62. Yasuma, F., Mitsunaga, T., Iso, D., Nayar, S.K.: Generalized assorted pixel camera: postcapture
control of resolution, dynamic range, and spectrum. IEEE Trans. Image Process. 19(9), 2241–
2253 (2010)
63. Zhai, L., Zhang, Y., Lv, H., Fu, S., Yu, H.: Multiscale tensor dictionary learning approach for
multispectral image denoising. IEEE Access 6, 51898–51910 (2018)
64. Zhang, Q., Li, B.: Discriminative K-SVD for dictionary learning in face recognition. In: 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2691–
2698. IEEE, New York (2010)
65. Zhang, Z., Aeron, S.: Denoising and completion of 3d data via multidimensional dictionary
learning (2015). Preprint, arXiv:1512.09227
66. Zhang, L., Wei, W., Zhang, Y., Shen, C., Van Den Hengel, A., Shi, Q.: Dictionary learning
for promoting structured sparsity in hyperspectral compressive sensing. IEEE Trans. Geosci.
Remote Sens. 54(12), 7223–7235 (2016)
67. Zhang, Y., Mou, X., Wang, G., Yu, H.: Tensor-based dictionary learning for spectral CT
reconstruction. IEEE Trans. Med. Imaging 36(1), 142–154 (2016)
68. Zhao, R., Wang, Q.: Learning separable dictionaries for sparse tensor representation: an online
approach. IEEE Trans. Circuits Syst. Express Briefs 66(3), 502–506 (2018)
69. Zubair, S., Wang, W.: Tensor dictionary learning with sparse tucker decomposition. In: 2013
18th International Conference on Digital Signal Processing (DSP), pp. 1–6. IEEE, New York
(2013)
70. Zubair, S., Yan, F., Wang, W.: Dictionary learning based sparse coefficients for audio
classification with max and average pooling. Digital Signal Process. 23(3), 960–970 (2013)
Chapter 4
Low-Rank Tensor Recovery
4.1 Introduction
In the low-rank matrix completion, one needs to find a matrix X of smallest rank,
which matches the observed matrix M ∈ RI1 ×I2 at all indices involved in the
observation set O. The mathematical formulation of this problem is
min X∗
X
where the minimization of the nuclear norm can be achieved using the singular
value
thresholding algorithm, resulting in computational complexity of O I 3 where I =
max(I1 , I2 ).
In [11], Candès and Tao proved that for a matrix X of rank R, required samples to
exactly recover M are on the order of Cμ2 I Rlog6 I , with probability at least 1 − I13 ,
where C is a constant and μ is a parameter that satisfies the strong incoherence
property.
4.3 Tensor Completion 95
Since the low-rank matrix completion can be also regarded as matrix factorization
with missing components, another equivalent formulation is proposed as follows:
1 T
2
min UV − M
U,V 2 F
where U ∈ RI1 ×R and V ∈ RI2 ×R . For this group of approaches, the rank R needs
to be given in advance and thus requires a lot of time for parameter tuning.
In [25] Keshavan, Montanari, and Oh showed that by their algorithm the
2 √
square error is bound as = M̂ − M ≤ C max {M} I1 I2 R/ |O| with
F
probability 1 − 1/I 3 , I = max {I1 , I2 }. Using the incoherence condition, in [21]
the authors showed that their algorithm
can recover an incoherent
matrix M in
O (log (1/ )) steps with |O| = O (σ1 /σR )6 R 7 log (I R MF / ) random entries
by the perturbed power method, where σ1 and σR are the largest and the smallest
singular values, respectively.
Fig. 4.1 A simple tensor completion case. Given a third-order tensor shown in the form of frontal
slice with 11 observed values. Using the only one assumption that the tensor admits the rank-1 CP
decomposition, the unknown entries are derived by solving 11 equations
that if we only focus on one of these frontal slices, we are actually doing the matrix
completion.
Since tensors are generalization of matrices, the low-rank framework for tensor
completion is similar with that of matrix completion, namely, rank minimization
model and low-rank tensor factorization model.
The original rank minimization model for tensor completion of T ∈ RI1 ×···×ID can
be formulated as a problem that finds a tensor X of the same size with minimal rank
such that the projection of X on O matches the observations
min rank (X )
X
s. t. PO (X ) = PO (T ) , (4.4)
4.3 Tensor Completion 97
Motivated by the fact that the matrix nuclear norm is the tightest convex relax-
ation of matrix rank function, convex optimization models for tensor completion
mainly inherit and exploit this property [17, 19, 49]. They extend matrix nuclear
norm to tensor ones in various ways, and the corresponding convex optimization
model can be formulated as follows:
min X ∗
X
s. t. PO (X ) = PO (T ) , (4.5)
where the tensor nuclear norm varies according to different tensor decompositions.
For example, the nuclear norm from Tucker decomposition is D
d=1 wd X(d) ∗
[28], the nuclear
norm of tensor train (TT) decomposition is formalized as
D−1
w d
Xd [7], and the tensor ring (TR) nuclear norm is defined as
d=1 ∗
L
d=1 wd Xd,L ∗ [19], where X(d) , Xd , Xd,L are the mode-d unfolding, d-
unfolding, and d-shifting L-unfolding matrices of X , respectively, with respect to
different decompositions. Parameter wd here denotes the weights of each unfolding
matrix.
As another popular tensor network, the nuclear norm of HT is similar to that
of TR, which can be realized by the shifting unfolding. Since HT corresponds to
a tree graph in tensor network, the line that cuts off the virtual bond between any
parent and son can be regarded as the rank of the matrix composed of contracting
the factors on both sides of the line. A simple illustration for HT decomposition is
shown in Fig. 4.2.
Table 4.1 summarizes the convex relaxations for different tensor ranks. The
main difference is the unfolding matrices involved except for CP and t-SVD. Since
D
ρ
min wd Md,[d] ∗ + Md − X , Yd + Md − X 2F
X ,Md ,Yd 2
d=1
s. t. PO (X ) = PO (T ) (4.6)
where the subscript [d] denotes the d-th unfolding scheme for Md according to the
selected tensor decomposition and ρ is a penalty coefficient which parameter can be
fixed or changed with iterations from a pre-defined value.
The procedure of ADMM is to split the original problem into pieces of small
problems, each of which has a variable that can be fast solved. In this sense, (4.6)
can be broken into 2D + 1 problems with respect to variables Md , d = 1, . . . , D,
X , and Yd , d = 1, . . . , D.
⎧ ∗ ρ
⎪
⎪ Md = arg min wd Md,[d] ∗ + Md − X , Yd + Md − X 2F , d = 1, . . . , D
⎪
⎪ Md 2
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎨ ∗
D
ρ
X = arg min Md − X , Yd + Md − X 2F
⎪ X 2
⎪
⎪ d=1
⎪
⎪
⎪
⎪
⎪
⎪ s. t. PO (X ) = PO (T )
⎪
⎪
⎩
Yd =Yd + ρ (Md − X ) , d = 1, . . . , D,
(4.7)
Table 4.2 Comparison of sample complexity of five popular algorithms stemmed from different
tensor decompositions
Sampling
Decomposition scheme Model Algorithm Complexity
TR Uniform Sum of nuclear TRBU [19] O I D/2 R 2 log7 I D/2
norm
minimization
TT Uniform First-order Newton’s O D log I R 2
polynomials [2] method [3]
t-SVD Uniform Tensor nuclear LRTC-TNN O I D−1 R log I D−1
norm [52]
minimization
Tucker Gaussian Sum of nuclear SNN [36] O R D/2 I D/2
norm
minimization
CP Uniform First-order Newton’s O I 2 log (DI R)
polynomials [1] method [3]
addition, reference [1] uses the algebraic geometry way, which is different from
the framework used in [19, 36, 52]. Table 4.2 provides a list for recent references
with comparison in sample complexity. For simplicity, Id = I , Rd = R for
d = 1, . . . , D.
Under specific tensor decomposition form, the nonconvex optimization model for
tensor completion can be formulated as a generalized data fitting problem with
predefined tensor ranks, where the desired variables are latent factors of the tensor
required to recover.
4.3 Tensor Completion 101
This is recognized as a quadratic form with respect to each block variable and hence
is easy to solve via the BCD.
In order to make readers better understand, we demonstrate the BCD algorithm
in the case of TR decomposition. Note that the objective function regarding each TR
factor is convex while the other factors are treated as fixed variables; we can specify
(4.9) as
2
min Wd,1 (Ud Bd ) − Wd,1 Td,1 2 (4.10)
Ud
D
by contracting all the factors except the d-th one to Bd ∈ R n=d In ×Rd Rd+1 and
converting the projection operator to a binary tensor W with “1” indicating the
observed position and “0” for the opposite. To alleviate the computational burden,
we can further split the problem (4.10) into Id subproblems:
where u = Ud (id , :), w = Wd,1 (id , :), and t = Td,1 (id , :). Extracting the
columns of Bd that correspond to “1”s in w, now the problem (4.11) can be
constructed
as a standard linear least square problem. In one iteration of ALS, we
solve D d=1 Id linear squares. The stop criteria can be set as a maximal number of
iterations or certain tolerance of relative/absolute change of variables, gradients,
or objective function. The details about BCD method for tensor completion are
presented in Algorithm 29.
102 4 Low-Rank Tensor Recovery
4.4 Applications
With the development of cameras, visual data are widely and easily acquired and
used in many situations. The visual data, including RGB images and videos and
light-field images, can be approximated by low-rank tensor. In fact, some other
images also share the similar low-rank structure and can be processed by the tensor
completion techniques for denoising, restoration, enhancement, and compressive
4.4 Applications
Fig. 4.3 The panel (a) is the gray copy of lena with size of 256 × 256. The panel (b) is the
singular values of lena. The panel (c) is the reconstruction of (a) after a threshold with first 50
singular values. (a) Original lena. (b) The singular values of lena. (c) Truncated lena
Fig. 4.4 The completion result for lena derived by seven methods. The tensor completion methods
show different PSNRs under the same sampling scheme, which implies various expression powers
of tensor decompositions
Tensor completion can be used for recommendation system by exploiting the latent
multidimensional correlations. For example, in movie rating system, due to the
large number of movies and users, not everyone can see every movie, so the data
usually obtained is only a very small part. Users rank a movie highly if it contains
fundamental features they like, as shown in Fig. 4.5 which gives an illustration of
the association between the latent factor film genre and the user-item data. It can
be seen that different users will have different preferences but there may be some
similarities between them. By using the user’s own preferences and the correlation
between different movies, unknown data can be predicted.
In this subsection, the user-centered collaborative location and activity filtering
(UCLAF) data [55] is used. It comprises 164 users’ global positioning system (GPS)
Fig. 4.5 A simple user-item demonstration. The square blocks represent films with different
combinations of four basic genres, and the circles represent users. Each user tends to have distinct
preference, and they are more likely to choose the films that have more common characteristics,
respectively. Mathematically, the Euclidean distance (sometimes the cosine distance is favorable)
is used to measure the similarity between the films and the users, which is a key idea for user and
item prediction in recommendation system
4.4 Applications 107
Fig. 4.6 The completion results for UCLAF derived by seven methods, where the label “MC” on
the horizontal axis means matrix completion and it is compared with tensor-based methods as a
baseline
trajectories based on their partial 168 locations and 5 activity annotations. Since
people of different circles and personalities tend to be active in specific places, this
UCLAF dataset satisfies low-rank property for similar reasons. For efficiency, we
construct a new tensor of size 12 × 4 × 5 by selecting the slices in three directions
such that each new slice has at least 20 nonzeros entries. Then we randomly choose
50% samples from the obtained tensor for experiments.
The reconstruction error is measured by the ratio of the 2 norm of the deviation
the estimate
of and the ground truth to the 2 norm of the ground truth, i.e.,
X̂ − X / X . Figure 4.6 shows the completion results by six tensor completion
algorithms along with the matrix completion method as a baseline. A fact shown in
Fig. 4.6 is that the matrix completion even performs better than some tensor-based
methods. Besides, the TRBU (a TR decomposition-based algorithm) gives a similar
relative error as the matrix completion method, which can be probably explained by
that TRBU can be viewed as a combination of several cyclical matrix completions
which in fact contains the matrix used in matrix completion.
108 4 Low-Rank Tensor Recovery
Fig. 4.8 Knowledge graph and tensor. Head entity, relation entity, and tail entity correspond to the
three modes separately. Each element in tensor indicates the state of a triplet, that is, “1” stands for
true fact and “0” stands for false or missing fact
Table 4.4 Scoring function utilized in different models. In scoring functions, eh and et represent
the head and tail entity embedding vectors, respectively, r represents relation embedding vector,
and Wr denotes the relation embedding matrix
Decomposition Model Scoring function
CP DistMult [48] eh , r, et
CP ComplEx [41] Re(eh , r, ēt )
Tucker TuckER [5] W × 1 eh × 2 r × 3 et
Tensor train TensorT [51] eh , Wr , et
Table 4.5 Results of knowledge graph completion. TuckER and ComplEx are both tensor
decomposition models, while ConvE is a translational model. The results of TuckER and ComplEx
are run on our computer, and ConvE is drawn directly from the original paper since its code does
not have the option of running the FB15k dataset
Model MRR Hits@1 Hits@3 Hits@10
TuckER [5] 0.789 0.729 0.831 0.892
ComplEx [41] 0.683 0.591 0.750 0.834
ConvE [13] 0.657 0.558 0.723 0.831
then test the learned TuckER model on test data and evaluate it by mean reciprocal
rank (MRR) and hits@k, k ∈ {1, 3, 10}; the results are demonstrated in Table 4.5,
where the best are marked in bold. MRR is the average of inverse rank of the
true triplet over all other possible triplets. Hits@k indicates the proportion of the
true entities ranked within the top k possible triplets. As we can see, TuckER can
not only achieve better results than its counterpart ComplEx but also outperform
translational model ConvE. This result mainly benefits from the co-sharing structure
of its scoring function, which takes advantage of Tucker decomposition to encode
underlying information in the low-rank core tensor W.
Traffic flow prediction has gained more and more attention with the rapid devel-
opment and deployment of intelligent transportation systems (ITS) and vehicular
cyber-physical systems (VCPS). It can be decomposed into a temporal and a spatial
process; thus the traffic flow data are naturally in the form of tensors. The traffic
flow prediction problem can be stated as follows. Let X (t, i) denote the observed
traffic flow quantity at t-th time interval of the i-th observation location; we can get
a sequence X of observed traffic flow data, i = 1, 2, . . . , I, t = 1, 2, . . . , T . Then
the problem is to predict the traffic flow at time interval (t + t) for some prediction
horizon t based on the historical traffic information. Considering the seasonality
of X, we fold X along the time dimension in days, which results in a tensor in size
of location × timespot × seasonality.
4.5 Summary 111
Fig. 4.10 The completion result for traffic data by seven methods
Specifically, we use the traffic dataset downloaded from the website.1 It can be
reshaped into a tensor of size 209×60×144. We randomly sample 20% of the entries
as observations and take six tensor-based algorithms to compare their performance
with a baseline method of matrix completion. The experimental results are presented
in Fig. 4.10. Since this traffic dataset shows low rankness in three dimensions rather
than two dimensions, all tensor-based methods achieve lower relative errors than
matrix completion method. Among all tensor-based methods, the TRBU and LRTC-
TNN give better performance, which demonstrates our speculation in Sect. 4.4.1
accordingly that decomposition of TR form has more powerful representation
ability.
4.5 Summary
1 https://zenodo.org/record/1205229#.X5FUXi-1Gu4.
112 4 Low-Rank Tensor Recovery
References
1. Ashraphijuo, M., Wang, X.: Fundamental conditions for low-CP-rank tensor completion. J.
Mach. Learn. Res. 18(1), 2116–2145 (2017)
2. Ashraphijuo, M., Wang, X.: Characterization of sampling patterns for low-tt-rank tensor
retrieval. Ann. Math. Artif. Intell. 88(8), 859–886 (2020)
3. Ashraphijuo, M., Wang, X., Zhang, J.: Low-rank data completion with very low sampling rate
using Newton’s method. IEEE Trans. Signal Process. 67(7), 1849–1859 (2019)
4. Asif, M.T., Mitrovic, N., Garg, L., Dauwels, J., Jaillet, P.: Low-dimensional models for missing
data imputation in road networks. In: 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing, pp. 3527–3531. IEEE, New York (2013)
5. Balazevic, I., Allen, C., Hospedales, T.: TuckER: Tensor factorization for knowledge graph
completion. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pp. 5188–5197 (2019)
6. Balažević, I., Allen, C., Hospedales, T.M.: Hypernetwork knowledge graph embeddings. In:
International Conference on Artificial Neural Networks, pp. 553–565. Springer, Berlin (2019)
7. Bengua, J.A., Phien, H.N., Tuan, H.D., Do, M.N.: Efficient tensor completion for color image
and video recovery: low-rank tensor train. IEEE Trans. Image Process. 26(5), 2466–2479
(2017)
8. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created
graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD
International Conference on Management of Data, pp. 1247–1250 (2008)
9. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings
for modeling multi-relational data. In: Neural Information Processing Systems (NIPS), pp. 1–9
(2013)
10. Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput.
Math. 9(6), 717 (2009)
11. Candès, E.J., Tao, T.: The power of convex relaxation: near-optimal matrix completion. IEEE
Trans. Inf. Theory 56(5), 2053–2080 (2010)
12. Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods. SIAM, Philadelphia (2000)
13. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2d knowledge graph
embeddings. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
References 113
14. Fabian, M., Gjergji, K., Gerhard, W., et al.: Yago: A core of semantic knowledge unifying
wordnet and wikipedia. In: 16th International World Wide Web Conference, WWW, pp. 697–
706 (2007)
15. Fan, J., Cheng, J.: Matrix completion by deep matrix factorization. Neural Netw. 98, 34–41
(2018)
16. Filipović, M., Jukić, A.: Tucker factorization with missing data with application to low-rank
tensor completion. Multidim. Syst. Sign. Process. 26(3), 677–692 (2015)
17. Gandy, S., Recht, B., Yamada, I.: Tensor completion and low-n-rank tensor recovery via convex
optimization. Inverse Prob. 27(2), 025010 (2011)
18. Hu, Y., Zhang, D., Ye, J., Li, X., He, X.: Fast and accurate matrix completion via truncated
nuclear norm regularization. IEEE Trans. Pattern Anal. Mach. Intell. 35(9), 2117–2130 (2012)
19. Huang, H., Liu, Y., Liu, J., Zhu, C.: Provable tensor ring completion. Signal Process. 171,
107486 (2020)
20. Huang, H., Liu, Y., Long, Z., Zhu, C.: Robust low-rank tensor ring completion. IEEE Trans.
Comput. Imag. 6, 1117–1126 (2020)
21. Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimiza-
tion. In: Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing,
pp. 665–674 (2013)
22. Jannach, D., Resnick, P., Tuzhilin, A., Zanker, M.: Recommender systems—beyond matrix
completion. Commun. ACM 59(11), 94–102 (2016)
23. Kang, Z., Peng, C., Cheng, Q.: Top-n recommender system via matrix completion. In:
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
24. Kazemi, S.M., Poole, D.: SimplE embedding for link prediction in knowledge graphs. In:
Advances in Neural Information Processing Systems, vol. 31 (2018)
25. Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from a few entries. IEEE Trans.
Inf. Theory 56(6), 2980–2998 (2010)
26. Kiefer, J., Wolfowitz, J., et al.: Stochastic estimation of the maximum of a regression function.
Ann. Math. Stat. 23(3), 462–466 (1952)
27. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings
for knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 29 (2015)
28. Liu, J., Musialski, P., Wonka, P., Ye, J.: Tensor completion for estimating missing values in
visual data. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 208–220 (2012)
29. Liu, Y., Long, Z., Zhu, C.: Image completion using low tensor tree rank and total variation
minimization. IEEE Trans. Multimedia 21(2), 338–350 (2019)
30. Long, Z., Liu, Y., Chen, L., Zhu, C.: Low rank tensor completion for multiway visual data.
Signal Process. 155, 301–316 (2019)
31. Long Z., Zhu C., Liu, J., Liu, Y.: Bayesian low rank tensor ring for image recovery. IEEE
Trans. Image Process. 30, 3568–3580 (2021)
32. Lu, C.: A Library of ADMM for Sparse and Low-rank Optimization. National University of
Singapore (2016). https://github.com/canyilu/LibADMM
33. Lukovnikov, D., Fischer, A., Lehmann, J., Auer, S.: Neural network-based question answering
over knowledge graphs on word and character level. In: Proceedings of the 26th International
Conference on World Wide Web, pp. 1211–1220 (2017)
34. Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
35. Moré, J.J.: The Levenberg-Marquardt algorithm: implementation and theory. In: Numerical
Analysis, pp. 105–116. Springer, Berlin (1978)
36. Mu, C., Huang, B., Wright, J., Goldfarb, D.: Square deal: lower bounds and improved
relaxations for tensor recovery. In: International Conference on Machine Learning, pp. 73–
81 (2014)
37. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on multi-
relational data. In: Proceedings of the 28th International Conference on International
Conference on Machine Learning, pp. 809–816 (2011)
114 4 Low-Rank Tensor Recovery
38. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407
(1951)
39. Singhal, A.: Introducing the knowledge graph: things, not strings. Official Google Blog 5
(2012)
40. Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: RotatE: knowledge graph embedding by relational
rotation in complex space. In: International Conference on Learning Representations (2018)
41. Trouillon, T., Dance, C.R., Gaussier, É., Welbl, J., Riedel, S., Bouchard, G.: Knowledge graph
completion via complex tensor factorization. J. Mach. Learn. Res. 18(1), 4735–4772 (2017)
42. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM
57(10), 78–85 (2014)
43. Vu, T., Nguyen, T.D., Nguyen, D.Q., Phung, D., et al.: A capsule network-based embedding
model for knowledge graph completion and search personalization. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2180–2189 (2019)
44. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on
hyperplanes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28 (2014)
45. Wang, W., Aggarwal, V., Aeron, S.: Tensor completion by alternating minimization under the
tensor train (TT) model (2016). Preprint, arXiv:1609.05587
46. Wang, W., Aggarwal, V., Aeron, S.: Efficient low rank tensor ring completion. In: Proceedings
of the IEEE International Conference on Computer Vision, pp. 5697–5705 (2017)
47. Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization
with applications to nonnegative tensor factorization and completion. SIAM J. Imag. Sci. 6(3),
1758–1789 (2013)
48. Yang, B., Yih, W.T., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning
and inference in knowledge bases. In: 3rd International Conference on Learning Representa-
tions, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015)
49. Yang, Y., Feng, Y., Suykens, J.A.: A rank-one tensor updating algorithm for tensor completion.
IEEE Signal Process Lett. 22(10), 1633–1637 (2015)
50. Yu, J., Li, C., Zhao, Q., Zhao, G.: Tensor-ring nuclear norm minimization and application for
visual: Data completion. In: ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 3142–3146. IEEE, New York (2019)
51. Zeb, A., Haq, A.U., Zhang, D., Chen, J., Gong, Z.: KGEL: a novel end-to-end embedding
learning framework for knowledge graph completion. Expert Syst. Appl. 167, 114164 (2020)
52. Zhang, Z., Aeron, S.: Exact tensor completion using t-SVD. IEEE Trans. Signal Process. 65(6),
1511–1526 (2017)
53. Zhang, F., Yuan, N.J., Lian, D., Xie, X., Ma, W.Y.: Collaborative knowledge base embedding
for recommender systems. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 353–362 (2016)
54. Zhao, Q., Zhang, L., Cichocki, A.: Bayesian CP factorization of incomplete tensors with
automatic rank determination. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1751–1763
(2015)
55. Zheng, V., Cao, B., Zheng, Y., Xie, X., Yang, Q.: Collaborative filtering meets mobile
recommendation: a user-centered approach. In: Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 24 (2010)
56. Zniyed, Y., Boyer, R., de Almeida, A.L., Favier, G.: High-order tensor estimation via trains
of coupled third-order CP and Tucker decompositions. Linear Algebra Appl. 588, 304–337
(2020)
Chapter 5
Coupled Tensor for Data Analysis
5.1 Introduction
Multimodal signals widely exist in data acquisition with the help of different sen-
sors. For example, in medical diagnoses, we can obtain several types of data from a
patient at the same time, including electroencephalogram (EEG), electrocardiogram
(ECG) monitoring data, and functional magnetic resonance imaging (fMRI) scans.
Such data share some common latent components but also keep certain independent
features of their own. Therefore, it may be advantageous to analyze such data in a
coupled way instead of processing independently.
Joint analysis of data from multiple sources can be modeled with the help
of coupled matrix/tensor decomposition, where the obtained data are represented
by matrix/tensor and the common modes are linked in a coupled way. Coupled
matrix/tensor component analysis has attracted much attention in data mining [2, 6]
and signal processing [4, 12, 14].
For instance, coupled matrix/tensor component analysis can be used for missing
data recovery when the obtained data have low-rank structure. As shown in Fig. 5.1,
link prediction aims to give some recommendations according to the observed
data. The cube represents the relationship of locations-tourist-activities and can be
modeled as a tensor. If we only observe this data, the cold start problem would
occur [9], since the collected information is not yet sufficient. If we can obtain
the relationship of tourist-tourist, features-locations, activities-activities, and tourist-
locations, which can be regarded as the auxiliary information and modeled as
matrices, the cold start problem may be avoided. This is because the tensor and
the matrices share some latent information and when the tensor is incomplete, the
shared information in matrices will benefit the tensor recovery.
Fig. 5.1 UCLAF dataset where the tensor is incomplete and coupled with each matrix by sharing
some latent information
I X
where [[A, B, C]] represents CP decomposition with the factor matrices A, B and C,
A and D are the factors of Y. Similar to CP-ALS, problem (5.1) can be easily solved
5.2 Coupled Tensor Component Analysis Methods 117
by ALS framework which updates one variable with others fixed, as concluded in
Algorithm 30.
However, this algorithm can stop at local optimal point, and overfitting may
easily occur when the tensor rank is set too large. To deal with these problems,
CMTF-OPT [1] is proposed to simultaneously solve all factor matrices by a
gradient-based optimization.
It is noticed that CMTF models have been extended by other tensor factorizations
such as Tucker [3] and BTD [15]. In this chapter, we mainly introduce CMTF model
using CP for modeling higher-order tensors, and the squared Frobenius norm is
employed as the loss function.
In practice, the factors can have many properties according to the physical
properties of the data. For instance, nonnegativity is an important property of
latent factors since many real-world tensors have nonnegative values and the
hidden components have a physical meaning only when they are nonnegative.
Besides, sparsity and orthogonality constraints on latent factors can identify the
shared/unshared factors in coupled data. In this case, Acar et al. [1] proposed a
118 5 Coupled Tensor for Data Analysis
Besides the above CMTF model with one tensor and one matrix, both the coupled
data can be in the form of tensor. For example, in the traffic information system,
we can collect different kinds of traffic data. We can obtain a third-order traffic flow
data denoted as X ∈ RI ×J ×K , where I represents the number of road segments, J
represents the number of days, K represents the number of time slots, and its entries
represent traffic flow. In addition, the road environment data in the same road traffic
network can be collected as the side information Y ∈ RI ×L×M , where L represents
the number of lanes and M represents the number of variables about the weather,
e.g., raining, snowing, sunny, and so on. Each element denotes the frequency of
accidents. Such a coupled tensor model can be found in Fig. 5.3.
I X
M
J
5.2 Coupled Tensor Component Analysis Methods 119
As shown in Fig. 5.3, tensors X and Y are coupled in the first mode. Therefore,
the coupled tensor factorization (CTF) of X and Y can be defined as
where [[A, B, C]] represents CP decomposition for X with the factor matrices A, B
and C and [[A, D, E]] represents CP decomposition for Y. Following CMTF-OPT
[1] for coupled matrix and tensor factorization, the optimization model in (5.3) can
be solved in a similar way. The loss function of this group is as follows:
1 1
f (A, B, C, D, E) = X − [[A, B, C]]2F + Y − [[A, D, E]]2F , (5.4)
2 2
and we further define X̂ = [[A, B, C]] and Ŷ = [[A, D, E]] as the estimates of X and
Y, respectively.
First, combining all factor matrices, we can obtain the variable z:
* +
z = aT1 ; · · · ; aTR ; bT1 ; · · · ; bTR ; cT1 ; · · · ; cTR ; dT1 ; · · · ; dTR ; eT1 ; · · · ; eTR . (5.5)
The nonlinear conjugate gradient (NCG) with Hestenes-Stiefel updated and the
More-Thuente line search are used to optimize the factor matrices. The details of
CTF-OPT algorithm are given in Algorithm 31.
Another kind of CTF is coupled along all modes by linear relationship. For
example, due to the hardware limitations, it is hard to collect super-resolution
120 5 Coupled Tensor for Data Analysis
image (SRI) which admits both high-spatial and high-spectral resolutions. Instead,
we can obtain hyperspectral image (HSI) which has low-spatial and high-spectral
resolutions. It can be denoted as X ∈ RIH ×JH ×K , where IH and JH denote the
spatial dimensions and K denotes the number of spectral bands. On the other
hand, the obtained multispectral image (MSI), which can be represented as Y ∈
RI ×J ×KM , owns high-spatial and low-spectral resolutions. Both X and Y are third-
order tensor data and coupled in all modes. The graphical illustration can be found
in Fig. 5.4.
The mathematical model can be formulated as
KM
K
←→ I Y
IH X
JH J
Fig. 5.4 An example of coupled tensor factorization
5.2 Coupled Tensor Component Analysis Methods 121
Under ALS framework, this problem can be divided into three subproblems as
follows:
1 1
f (A) = X(1) − P1 A(C P2 B)T 2F + Y(1) − A(PM C B)T 2F , (5.10)
2 2
1 1
f (B) = X(2) − P2 B(C P1 A)T 2F + Y(2) − B(PM C A)T 2F , (5.11)
2 2
1 1
f (C) = X(3) − C(P2 B P1 A)T 2F + Y(3) − PM C(B A)T 2F . (5.12)
2 2
The details can be concluded in Algorithm 32.
R
N
Xi (j1 , . . . , jMi ) = (U(n) (jn , r))pi,n , i = 1, . . . I, (5.13)
r=1 n=1
122 5 Coupled Tensor for Data Analysis
R
X1 (j1 , j2 , j3 ) = U(1) (j1 , r)U(2) (j2 , r)U(3) (j3 , r), (5.14)
r=1
R
X2 (j2 , j4 ) = U(2) (j2 , r)U(4) (j4 , r), (5.15)
r=1
R
X3 (j2 , j5 ) = U(2) (j2 , r)U(5) (j5 , r), (5.16)
r=1
where the number of tensors is 3 and the number of factor matrices is 5. It can be
rewritten in detail as follows:
R
X1 (j1 , j2 , j3 ) = U(1) (j1 , r)1 U(2) (j2 , r)1 U(3) (j3 , r)1 U(4) (j4 , r)0 U(5) (j5 , r)0 ,
r=1
R
X2 (j2 , j4 ) = U(1) (j1 , r)0 U(2) (j2 , r)1 U(3) (j3 , r)0 U(4) (j4 , r)1 U(5) (j5 , r)0 ,
r=1
R
X3 (j2 , j5 ) = U(1) (j1 , r)0 U(2) (j2 , r)1 U(3) (j3 , r)0 U(4) (j4 , r)0 U(5) (j5 , r)1 ,
r=1
X1 X2 X3
be rewritten as
I J1 ,...,JMi
(Xi (j1 , . . . , jMi ) − X̂i (j1 , . . . , jMi ))2
L(Xi (j1 , . . . , jMi )) = ,
2τ 2
i j1 ,...,jMi
(5.17)
N
where X̂i (j1 , . . . , jMi ) = R
r=1
(n)
n=1 (U (jn , r))
pi,n . For the coupled factoriza-
I J1 ,...,JMi
∂L
(n)
=− pi,n (Xi (j1 , . . . , jMi ) − X̂i (j1 , . . . , jMi ))τ −2
∂U (jn , r)
i=1 j1 ,...,jMi
I J1 ,...,JMi , -T
∂2 L −2 ∂ X̂i (j1 , . . . , jMi ) ∂ X̂i (j1 , . . . , jMi )
= pi,n τ .
∂U(n) (jn , r)2 ∂U(n) (jn , r) ∂U(n) (jn , r)
i=1 j1 ,...,jMi
5.3 Applications
Fusion
HSI
SRI
MSI
5.3 Applications 125
X = Z ×1 P1 ×2 P2 + NH , (5.19)
Y = Z ×3 PM + NM , (5.20)
where NH and NM are additive white Gaussian noise. We set spatial downsampling
rate P = 4 and PM ∈ R4×103 for Pavia University data and P = 6 and PM ∈ R8×128
for Washington DC Mall data.
Eight state-of-the-art algorithms are used for comparison, namely, FUSE [16],
CNMF [20] and HySure [11], SCOTT [10], STERRO [7] and CSTF-FUSE [8], and
CNMVF [22]. The parameters of these methods are tuned to the best performance.
Figures 5.7 and 5.8 demonstrate the reconstructed images from various methods
including FUSE, SCOTT, STEERO, CSTF-FUSE, HySure, CNMF, and CNMVF
for Pavia University data and Washington DC Mall data, respectively. Besides, we
add Gaussian noise in HSI and MSI with SNR = 15 dB. For Pavia University data,
we select its 60th band to show while we choose 40th band from Washington DC
Mall data. From Figs. 5.7 and 5.8, we can observe CMVTF performs best among all
the state-of-the-art methods in terms of image resolution.
1 https://rslab.ut.ac.ir/data/.
2 https://rslab.ut.ac.ir/data/.
126 5 Coupled Tensor for Data Analysis
8000
8000
8000
7000 6000
100 100 100 7000 100
7000
6000 6000 5000
6000
200 200 200 200
5000 5000 4000
5000
4000
300 4000 300 4000 300 300 3000
3000
3000
3000 2000
400 400 400 2000 400
2000
2000
1000 1000
1000
500 500 500 500
1000 0
0 0
1000 2000
500 500 500 500
0
0 0
1000
Fig. 5.7 The reconstruction of Pavia University data when SNR = 15 dB. (a) SRI. (b) CNMVF.
(c) STEREO. (d) SCOTT. (e) CSTF. (f) FUSE. (g) HySure. (h) CNMF
4
10
20000
20000 20000
100 100 100 100 2
15000
15000 15000
200 200 200 200
1.5
10000 10000
300 300 300 10000 300
0.5
500 500 500 500
0 0 0
20000 2.2
20000
100 100 100 100 2
15000
1.8
15000
15000 1.6
200 200 200 200
10000
1.4
10000
300 300 10000 300 300 1.2
5000 1
0 0.6
Fig. 5.8 The reconstruction of Washington DC University data when SNR = 15 dB. (a) SRI. (b)
CNMVF. (c) STEREO. (d) SCOTT. (e) CSTF. (f) FUSE. (g) HySure. (h) CNMF
where to visit and what to do. The tourist, the location, and the activities can be
considered to be linked. The task of recommending other scenic spots or activities
to tourist can be cast as a missing link prediction problem. However, the results
are likely to be poor if the prediction is done in isolation on a single view of data.
Therefore, we should give the recommendation by combining multi-view of data.
In this part, we use the GPS data [24], where the relations between tourist,
location, and activity are used to construct a third-order tensor X1 ∈ R164×168×5 .
To construct the dataset, GPS points are clustered into 168 meaningful locations,
and the user comments attached to the GPS data are manually parsed into activity
5.3 Applications 127
0.8 0.8
True positive rate
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False positive rate False positive rate
(a) (b)
Fig. 5.9 The comparison of area under curve (AUC) using one dataset (left) and using multiple
dataset (right) with 80% missing ratio. (a) The performance of AUC using one dataset. (b) The
performance of AUC using multiple dataset
annotations for the 168 locations. Consequently, the data consists of 164 users, 168
locations, and 5 different types of activities, i.e., “Food and Drink,” “Shopping,”
“Movies and Shows,” “Sports and Exercise,” and “Tourism and Amusement.” The
element X1 (i, j, k) represents the frequency of tourist i visiting location j and
doing activity k there. The collected data also include additional side information,
such as the tourist-location preferences from the GPS trajectory data and the
location features from the points of interest database, represented as the matrix
X2 ∈ R164×168 and X3 ∈ R168×14 , respectively. Besides, the similarities on tourist-
tourist X4 and activities-activities X5 can further enhance the performance of the
recommendation system.
Figure 5.9 shows the area under curve (AUC) using one dataset (left) and using
multiple datasets (right) with 80% missing ratio. We can see that the prediction
accuracy of using multiple datasets is higher than that of using one dataset, which
further illustrates the side information of different views on the same object can
improve the information usage.
In addition, we have shown the prediction results of three state-of-the-art
algorithms using GPS data in Fig. 5.10, where the missing ratio ranges from
10% to 70% with step 10%. These algorithms include SDF [13], CCP [17], and
CTucker [18], which all consider the coupled mode. The first one uses Tucker
decomposition under coupled tensor factorization framework to solve GPS data.
The rest algorithms are on coupled rank minimization framework with CP and
Tucker decomposition, respectively. From Fig. 5.10, we could observe the smaller
the missing ratio is, the better the recovery performance is. Among these three
methods, SDF performs best.
128 5 Coupled Tensor for Data Analysis
1.2
SDF
CTucker
CCP
1
0.8
RES
0.6
0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Missing Ratio
In traditional visual data recovery tasks, they only consider the correlations of
images and recover them by the global/local prior. However, with the development
of sensors, we can obtain more information about the images. Effectively using
this information can improve the recovery performance. In this part, we use the
CASIA-SURF dataset [23] to illustrate that the side information can enhance the
recovery results. The CASIA-SURF dataset consists of 1000 subjects and 21,000
video clips with 3 modalities (RGB, depth, infrared (IR)), each frame containing
a person with the size of 256 × 256. For convenient computation, we choose 30
subjects with 2 modalities (RGB, IR) as shown in Fig. 5.11. Then we create a
tensor X ∈ R256×256×30 by changing each RGB image into gray image and a side
information tensor Y ∈ R256×256×30 by considering the IR information.
Figure 5.12 shows the results of three existing algorithms including SDF [13],
CCP [17], and CTucker [18] on CASIA-SURF dataset. From Fig. 5.12, we could
observe that CTucker outperforms CCP and SDF for image recovery. It is noted that
the recovery performance of CCP keeps the same for different missing ratios, which
may be caused by the CP decomposition.
5.3 Applications 129
Fig. 5.11 The illustration of CASIA-SURF dataset. (a) RGB image with 30 subjects. (b) IR image
with 30 subjects
1
SDF
0.9 CTucker
CCP
0.8
0.7
0.6
RES
0.5
0.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Missing Ratio
5.4 Summary
As a tool to explore the data with shared latent information, coupled tensor
component analysis plays an important role in signal processing and data mining.
In this chapter, we only focus on two tensor component analysis models. One
is coupled tensor completion with observed missing tensor. The other one is the
coupled tensor fusion with a linear coupled way. In experiments, we show that with
the assistance of side information, the performance of coupled tensor component
analysis on link prediction and visual data recovery is better than traditional tensor
component analysis.
Even though the coupled tensor decomposition has developed for many years,
the works are still a few on multimodality data and lack of theory foundations.
Regarding these, there are three main research directions:
• How to achieve the identifiability of coupled tensor decomposition on some mild
conditions?
• How to efficiently explore the structure of multimodality data by coupled tensor
decomposition?
• How to tackle these data with shared information and independent structure in a
noisy condition?
References
1. Acar, E., Nilsson, M., Saunders, M.: A flexible modeling framework for coupled matrix and
tensor factorizations. In: 2014 22nd European Signal Processing Conference (EUSIPCO), pp.
111–115. IEEE, Piscataway (2014)
2. Almutairi, F.M., Sidiropoulos, N.D., Karypis, G.: Context-aware recommendation-based
learning analytics using tensor and coupled matrix factorization. IEEE J. Selec. Topics Signal
Process. 11(5), 729–741 (2017)
3. Bahargam, S., Papalexakis, E.E.: Constrained coupled matrix-tensor factorization and its
application in pattern and topic detection. In: 2018 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining (ASONAM), pp. 91–94. IEEE, Piscataway
(2018)
4. Chatzichristos, C., Davies, M., Escudero, J., Kofidis, E., Theodoridis, S.: Fusion of EEG
and fMRI via soft coupled tensor decompositions. In: 2018 26th European Signal Processing
Conference (EUSIPCO), pp. 56–60. IEEE, Piscataway (2018)
5. Dian, R., Li, S., Fang, L.: Learning a low tensor-train rank representation for hyperspectral
image super-resolution. IEEE Trans. Neural Netw. Learn. Syst. 30, 1–12 (2019). https://doi.
org/10.1109/TNNLS.2018.2885616
6. Ermiş, B., Acar, E., Cemgil, A.T.: Link prediction in heterogeneous data via generalized
coupled tensor factorization. Data Min. Knowl. Discov. 29(1), 203–236 (2015)
7. Kanatsoulis, C.I., Fu, X., Sidiropoulos, N.D., Ma, W.K.: Hyperspectral super-resolution: a
coupled tensor factorization approach. IEEE Trans. Signal Process. 66(24), 6503–6517 (2018)
8. Li, S., Dian, R., Fang, L., Bioucas-Dias, J.M.: Fusing hyperspectral and multispectral images
via coupled sparse tensor factorization. IEEE Trans. Image Process. 27(8), 4118–4130 (2018)
9. Lika, B., Kolomvatsos, K., Hadjiefthymiades, S.: Facing the cold start problem in recom-
mender systems. Expert Syst. Appl. 41(4), 2065–2073 (2014)
References 131
10. Prévost, C., Usevich, K., Comon, P., Brie, D.: Coupled tensor low-rank multilinear approxima-
tion for hyperspectral super-resolution. In: ICASSP 2019-2019 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 5536–5540. IEEE, Piscataway
(2019)
11. Simões, M., Bioucas-Dias, J., Almeida, L.B., Chanussot, J.: A convex formulation for
hyperspectral image superresolution via subspace-based regularization. IEEE Trans. Geosci.
Remote Sens. 53(6), 3373–3388 (2014)
12. Şimşekli, U., Yılmaz, Y.K., Cemgil, A.T.: Score guided audio restoration via generalised
coupled tensor factorisation. In: 2012 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 5369–5372. IEEE, Piscataway (2012)
13. Sorber, L., Van Barel, M., De Lathauwer, L.: Structured data fusion. IEEE J. Select.Topics
Signal Process. 9(4), 586–600 (2015)
14. Sørensen, M., De Lathauwer, L.: Coupled tensor decompositions for applications in array
signal processing. In: 2013 5th IEEE International Workshop on Computational Advances in
Multi-Sensor Adaptive Processing (CAMSAP), pp. 228–231. IEEE, Piscataway (2013)
15. Sørensen, M., De Lathauwer, L.D.: Coupled canonical polyadic decompositions and (coupled)
decompositions in multilinear rank-(l_r,n,l_r,n,1) terms—part i: Uniqueness. SIAM J. Matrix
Analy. Appl. 36(2), 496–522 (2015)
16. Wei, Q., Dobigeon, N., Tourneret, J.Y.: Fast fusion of multi-band images based on solving a
Sylvester equation. IEEE Trans. Image Process. 24(11), 4109–4121 (2015)
17. Wimalawarne, K., Mamitsuka, H.: Efficient convex completion of coupled tensors using
coupled nuclear norms. In: Advances in Neural Information Processing Systems, pp. 6902–
6910 (2018)
18. Wimalawarne, K., Yamada, M., Mamitsuka, H.: Convex coupled matrix and tensor completion.
Neural Comput. 30(11), 3095–3127 (2018)
19. Yılmaz, K.Y., Cemgil, A.T., Simsekli, U.: Generalised coupled tensor factorisation. In:
Advances in Neural Information Processing Systems, pp. 2151–2159 (2011)
20. Yokoya, N., Yairi, T., Iwasaki, A.: Coupled nonnegative matrix factorization unmixing for
hyperspectral and multispectral data fusion. IEEE Trans. Geosci. Remote Sens. 50(2), 528–
537 (2011)
21. Zhang, K., Wang, M., Yang, S., Jiao, L.: Spatial–spectral-graph-regularized low-rank tensor
decomposition for multispectral and hyperspectral image fusion. IEEE J. Sel. Top. Appl. Earth
Obs. Remote Sens. 11(4), 1030–1040 (2018)
22. Zhang, G., Fu, X., Huang, K., Wang, J.: Hyperspectral super-resolution: A coupled nonnegative
block-term tensor decomposition approach. In: 2019 IEEE 8th International Workshop on
Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 470–474
(2019). https://doi.org/10.1109/CAMSAP45676.2019.9022476
23. Zhang, S., Wang, X., Liu, A., Zhao, C., Wan, J., Escalera, S., Shi, H., Wang, Z., Li, S.Z.: A
dataset and benchmark for large-scale multi-modal face anti-spoofing. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 919–928 (2019)
24. Zheng, V.W., Zheng, Y., Xie, X., Yang, Q.: Towards mobile intelligence: Learning from GPS
history data for collaborative recommendation. Artif. Intell. 184, 17–37 (2012)
Chapter 6
Robust Principal Tensor Component
Analysis
X = L + E, (6.1)
where L represents a low-rank matrix and the sparse matrix is denoted by E. Figure
6.1 provides an illustration for RPCA.
To separate the principal component from sparse corruption, the optimization
model can be formulated as follows:
where rank(L) is rank of the matrix L, E0 is the 0 pseudo-norm which counts
the nonzero entries, and λ is a parameter to make balance of sparsity and low-rank
terms.
• •
• • • •
• •
• •
• • • • • •
• • • •
• •
• •
• •
Original matrix Low-rank matrix Sparse matrix
Problem (6.2) is highly nonconvex and hard to solve. By replacing the nonconvex
rank and 0 norm with the convex matrix nuclear norm and 1 norm, it can be relaxed
into a tractable convex optimization model as follows:
where ·∗ denotes the nuclear norm which is the sum of singular values of the
low-rank component L, ·1 denotes the 1 norm which represents the sum of
absolute values of all elements in the sparse component E, and parameter λ is the
weighting factor√to balance the low-rank and sparse component, which is suggested
to be set as 1/ max(I1 , I2 ) for good performance. It has been proven that under
some incoherent conditions, L and E can be perfectly recovered.
The major shortcoming of RPCA is that it can only handle matrix data. In
real-world applications, a lot of data are featured by its higher order, such as
grayscale video, color images, and hyperspectral images. The classic RPCA needs
to transform the high-dimensional data into a matrix, which inevitably leads to loss
of structural information.
In order to better extract the low-rank structure from high-dimensional data,
classical RPCA has been extended into tensor versions. A given tensor X ∈
RI1 ×I2 ×I3 can be decomposed as two additive tensor terms as follows:
X = L + E, (6.4)
Sects. 6.2.1, and 6.2.3 presents the other methods based on Tucker and tensor train
decompositions.
where  , Ŝ represent the result of FFT along the third mode of A and S,
respectively.
Definition 6.2 (Tensor Average Rank [33]) The tensor average rank of A ∈
RI1 ×I2 ×I3 , denoted as ranka (A), is defined as
1
ranka (A) = rank(circ(A)). (6.6)
I3
136 6 Robust Principal Tensor Component Analysis
For a 3rd-order tensor, equipped with the TNN and the classical 1 norm
constraint, RPTCA [32, 33] convex optimization model can be formulated as
√
where λ is a regularization parameter and suggested to be set as 1/ max(I1 , I2 )I3 ,
LTNN denotes the TNN of low-rank tensor component L, and E1 is the 1
norm for the sparse tensor. Figure 6.2 gives an illustration of the classical RPTCA.
It has been analyzed that an exact recovery can be obtained for both the low-rank
component and the sparse component under certain conditions. Firstly, we need to
assume that the low-rank component is not sparse. In order to guarantee successful
low-rank component extraction, the low-rank component L should satisfy the tensor
incoherence conditions.
Definition 6.3 (Tensor Incoherence Conditions [33]) For the low-rank compo-
nent L ∈ RI1 ×I2 ×I3 , we assume that the corresponding tubal rank rankt (L) = R
and it has the t-SVD L = U ∗ S ∗ V T , where U ∈ RI1 ×R×I3 and V ∈ RI2 ×R×I3 are
orthogonal tensors and S ∈ RR×R×I3 is an f-diagonal tensor. Then L should satisfy
the following tensor incoherence conditions with parameter δ:
.
δR
max U ∗e̊i1 F
T
, (6.8)
i1 =1,...,I1 I1 I3
.
δR
max V ∗e̊i2 F
T
, (6.9)
i2 =1,...,I2 I2 I3
and
.
δR
U ∗ V T ∞ , (6.10)
I1 I2 I3 2
6.2 RPTCA Methods Based on Different Decompositions 137
wheree̊i1 ∈ RI1 ×1×I3 is the tensor column basis whose (i1 , 1, 1)-th element is 1 and
the rest is 0. And e̊i2 ∈ RI2 ×1×I3 is the one with (i2 , 1, 1)-th entry equaling 1 and
the rest equaling 0.
We should note that the sparse component is assumed not to be of low tubal rank.
The RPTCA problem can be solved easily and efficiently by alternating direction
method of multipliers (ADMM) [3]. The augmented Lagrangian function from the
optimization model (6.7) is
μ
L(L, E, Y, μ) = LTNN + λE1 + Y, X − L − E + X − L − E2F ,
2
(6.11)
which is equivalent to
μk Yk 2
Lk+1 = argminLTNN + L − X + Ek − , (6.13)
L 2 μk F
where
Sτ = ifft((Ŝ − τ )+ ). (6.15)
where τ > 0, t+ denotes the positive part of t, i.e., t+ = max(t, 0), and Ŝ is the fast
Fourier Transform (FFT) of S along the third mode. Problem (6.14) will be solved
by performing the soft thresholding operator to each frontal slice Ŝ(:, :, i).
138 6 Robust Principal Tensor Component Analysis
(2) Sparse component approximation: All terms containing the sparse component
are extracted from the augmented Lagrangian function (6.11) for the update as
follows:
μk
Ek+1 = argmin λE1 + Yk , X − Lk − E + X − Lk − E2F , (6.17)
E 2
It can be solved by
Yk
Ek+1 := sth λ X − Lk+1 + , (6.19)
μk μk
where sthτ (X) and sthτ (X ) denote the entry-wise soft thresholding operators
for matrix and tensor. For any element x in a matrix or a tensor, we have
end while
efficiency, as the calculation of matrix SVD in t-SVD for each iteration is indepen-
dent. The computational complexity for updating low-rank component and sparse
component in each iteration is O I1 I2 I3 (log I3 + 1) + I32+1 I1 I2 Imin , where
Imin = min(I1 , I2 ).
In addition to the classic 1 norm, there are several sparsity constraints with respect
to different applications that have been developed.
Consider the situation where some tubes on a video data have heavy noises.
In order to better process such pixels for the video recovery, another convex
optimization model of RPTCA, called tubal-PTCA, was studied in [51] as follows:
where E1,1,2 is defined as the sum of all Frobenius norms of the tubal fibers
in the third mode, i.e. , E1,1,2 = i,j E(i, j, :)F , and λ is the regularization
parameter
√ to balance the low-rank and sparse components. The optimal value of λ
is 1/ max (I1 , I2 ) for given tensor X ∈ RI1 ×I2 ×I3 .
As the noises exist on the tubal fiber along the third mode, E1,1,2 can well
characterize such kind of sparse components. Figure 6.3 shows the original tensor
data and structurally sparse noise for this model. For this optimization model, it can
remove the Gaussian noise on tubes while the moving objects are not considered as
sparse component.
On the other hand, outliers or sample-specific corruptions are common in real
data. When each sample is arranged as a lateral slice of a third-order tensor, an
outlier-robust model for tensor PCA (OR-PTCA) has been proposed in [57] as
follows:
where E2,1 is defined as the sum of all the Frobenius norms of the lateral slices,
i.e., E2,1 = Ii22=1 E (:, i2 , :)F .
= +
OR-PTCA is the first method that has provable performance guarantee for exact
tensor subspace recovery and outlier detection under mild conditions. Figure 6.4
shows the original tensor data and sparse noise on lateral slices for the measurement
model assumed for OR-PTCA. Experimental results on four tasks including outlier
detection, clustering, semi-supervised, and supervised learning have demonstrated
the advantages of OR-PTCA method. In addition, it has been also applied into
hyperspectral image classification in [42].
P
min (Lp ∗ + λEp 1 )
Lp ,Ep (6.23)
p=1
s. t. X = L1 · · · LP + E1 · · · EP ,
where “” denotes the concatenation operator of block tensors; P denotes the
number of blocks of the whole tensor, which is equal to In1 In2 when the size
of block tensor is n × n × I3 and the size of whole tensor is I1 × I2 × I3 ; Lp , p =
1, 2, . . . , P denotes the block low-rank component; and Ep , p = 1, 2, . . . , P
represents the block sparse component. For convenience, it has been defined that
the sparse component Esum = E1 E2 · · · EP . Figure 6.5 illustrates the data
6.2 RPTCA Methods Based on Different Decompositions 141
I3 I3 I3
...
n n n
n n n
I3 . . .
I3 I3 I3
...
=
.. . . n
n
n
n
n
n
I1 . . .
.. . .
I2 I3 I3 I3
...
n n n
n n n
Fig. 6.5 Illustration of the data model for RBPTCA
model for robust block principal tensor component analysis (RBPTCA). We also
call it block-RPTCA.
As the 1 norm is defined as the sum of the absolute values of all elements,
problem (6.23) is equivalent to the following one:
P
min Lp ∗ + λE1
Lp ,E (6.24)
p=1
s. t. X = L1 L2 · · · LP + E.
As shown in the above optimization model, the extraction of the sparse component is
not influenced by the block decomposition. Taking advantage of the local similarity
of pixels with different scale in analysis, the RBPTCA method can choose a suitable
block size to better extract details.
Note that the blocks are not overlapping. In fact, it may exhibit different
performance for a same scale with overlapping blocks. Correspondingly, the
standard 1 norm-based sparse constraints for overlapping blocks would result in a
weighted 1 norm for the estimate tensor and thus provide different sparse denoising
performance.
X̄ (
:, k,
:)
X (:, :, k)
exploited. Therefore, the classical tensor nuclear norm (TNN) may not be a good
choice under some circumstances.
In [17], a twist tensor nuclear norm (t-TNN) is proposed, and it applies a
permutation operator on the block circulant matrices of original tensor. Its definition
can be formulated as follows:
We can see that this model is different from the classical one (6.7) in the definition of
tensor nuclear norm. In this case, the t-TNN mainly exploits the linear relationship
between frames, which is more suitable for describing a scene with motions. Indeed,
t-TNN provides a new viewpoint in the low-rank structure via rotated tensor, and it
makes t-SVD more flexible when dealing with complex applications.
In [6, 29], an improved robust tensor PCA model has been proposed. The authors
find that low-rank information still exists in the core tensor for many visual data, and
they define an improved tensor nuclear norm (ITNN) to further exploit the low-rank
structures in multiway tensor data, which can be formulated as
where γ is set to balance the classical TNN and the nuclear norm of the core matrix.
When we define S ∈ RI1 ×I2 ×I3 and I = min (I1 , I2 ), the core matrix S̄ ∈ RI ×I3
6.2 RPTCA Methods Based on Different Decompositions 143
satisfies S̄(i, :) = S(i, i :). Figure 6.7 shows the transformation between the core
tensor S and the core matrix S̄.
Equipped with the newly defined ITNN (6.27), the convex optimization model
for the improved robust principal tensor component analysis (IRPTCA or improved
PTCA) is formulated as follows:
where λ is set to balance the low-rank tensor and the sparse tensor.
In classical RPTCA based on t-SVD, the low-rank structure in the third mode
is not fully utilized. Adding low-rank component extraction of the core matrix,
ITNN can characterize the low-rank structure of the tensor data sufficiently. For
example, in background modeling, the low-rank structure of a surveillance video
mainly lies in the third dimension due to the correlation between frames. In this
case, the additional low-rank extraction in the IRPTCA method can improve the
performance effectively.
Although the convex relaxation can easily get the optimal solution, the nonconvex
approximation tends to yield more sparse or lower-rank local solutions. In [41],
instead of using convex TNN, they use nonconvex surrogate functions to approxi-
mate the tensor tubal rank and propose a tensor-based iteratively reweighted nuclear
norm solver.
Motivated by the convex TNN, given a tensor X ∈ RI1 ×I2 ×I3 and the tubal rank
M, we can define the following nonconvex penalty function:
1
I3 M
R(X ) = gλ σm X̂ (i3 ) , (6.31)
I3
i3 =1 m=1
where gλ represents a type of nonconvex functions and σm (X̂ (i3 ) ) denotes the m-th
largest singular value of X̂ (i3 ) . If gλ (x) = x, it is equivalent to the TNN.
Based on the nonconvex function (6.31), the following nonconvex low-rank
tensor optimization problem is proposed:
1
I3 M (6.32)
= gλ σm X̂ (i3 ) + f(X ),
I3
i3 =1 m=1
where f denotes some loss functions, which can be nonconvex. In general, it should
be nontrivial. The convergence analysis of the new solver has been further provided.
In case of matrix-based RPCA, some inevitable biases might exist when minimiz-
ing the nuclear norm for extracting low-rank component [36, 37]. In fact, a similar
problem may exist for tensor data. In [21], a novel nonconvex low-rank constraint
called the partial sum of the tubal nuclear norm (PSTNN) has been proposed, which
only consists of some small singular values. The PSTNN minimization directly
shrinks the small singular values but does nothing on the large ones.
6.2 RPTCA Methods Based on Different Decompositions 145
I3
(i3 )
X PSTNN = X̂ , (6.33)
p=N
i3 =1
where X̂ (i3 ) is the frontal slices of tensor X̂ and ·p=N denotes the partial sum
min(I ,I )
of singular values (PSSV). PSSV is defined as Xp=N = m=N1+12 σm (X) for
a matrix X ∈ RI1 ×I2 . The corresponding optimization model for RPTCA can be
formulated as follows:
Rather than using discrete Fourier transform (DFT) for t-SVD, other forms have also
been adopted such as discrete cosine transform (DCT) [31, 48] and unitary transform
[40]. Besides, the quantum version of t-SVD and atomic-norm regularization can
also refer to [9] and [44].
ik1 , ik2 , j -th entry of X(k1 k2 ) is as follows:
D
s−1
j =1+ (is − 1) Js with Js = im . (6.35)
s=1 m=1
s=k1 ,s=k2 m=k1 ,m=k2
Based on this definition, the weighted sum of tensor nuclear norm (WSTNN) of
tensor X ∈ RI1 ×I2 ×···×ID is proposed as follows:
X WSTNN := αk1 k2 X(k1 k2 ) TNN , (6.36)
1≤k1 <k2 ≤N
where αk1 k2 ≥ 0 (1 ≤ k1 < k2 ≤ D) and 1≤k1 <k2 ≤D αk1 k2 = 1. The correspond-
ing optimization model for RPTCA can be formulated as follows:
which can handle multiway tensor and better model different correlations along
different modes.
In addition to the t-SVD framework, some other tensor decompositions have also
been applied to RTPCA recently. And an advanced model with additional total
variation constraint is used for compressed sensing of dynamic MRI [30]. The PCA-
based Tucker decomposition is firstly proposed by Huang in 2014 [18]. In addition,
recently the tensor train-based RPTCA has been discussed by Zhang in 2019 [53]
and by Yang in 2020 [49].
As a higher-order generalization of matrix SVD, the Tucker decomposition
decomposes a tensor into a core tensor multiplied by factor matrix along each
mode. Given the observed tensor X ∈ RI1 ×I2 ×···×IN , the corresponding Tucker rank
denoted by rankn (X ) is defined as a vector, the k-th entry of which is the rank of the
mode-k unfolding matrix X(k) . Therefore, the tensor PCA optimization model (6.4)
can be transformed into
Similar to the matrix RPCA model (6.2), problem (6.38) is also nonconvex.
Since the matrix nuclear norm is the convex envelope of the matrix rank, the
sum of nuclear norms (SNN) has been firstly proposed to solve the Tucker rank
minimization problem in 2012, which is successfully applied to tensor completion
6.2 RPTCA Methods Based on Different Decompositions 147
problem [28]. In 2014, a robust tensor PCA model based on SNN minimization is
firstly proposed by Huang [18] as follows:
N
min λn L(n) ∗ + E1 , s. t. X = L + E, (6.39)
L, E
n=1
N
where L(n) ∈ RIn × k=n Ik is the mode-n matricization of the low-rank tensor
L. For example, when we have a third-order tensor, it uses the combination of
three matrix nuclear norms to solve the Tucker rank minimization problem. The
theoretical guarantee has been given for successful recovery of low-rank and sparse
components from corrupted tensor X . However, the Tucker decomposition has to
perform SVD over all the unfolding matrices, which would cause high computa-
tional costs. Meanwhile, the resulting unfolding matrices in Tucker decomposition
are often unbalanced so that the global correlation of the tensor cannot be adequately
captured.
It is shown that tensor train (TT) can provide a more compact representation
for multidimensional data than traditional decomposition methods like Tucker
decomposition [2]. In 2019, Zhang et al. firstly apply TT decomposition to recover
noisy color images [53] with the optimization model as follows:
N −1
αn
min Un Vn − X<n> 2F , (6.40)
Un ,Vn ,X 2
n=1
n N n N
where Un ∈ R j =1 Ij ×Rn+1 , Vn ∈ RRn+1 × j =n+1 Ij , and X<n> ∈ R j =1 Ij × j =n+1 Ij
is the n-unfolding matrix of tensor X , [R1 , . . . , RN , RN +1 ], R1 = RN +1 = 1 is the
tensor train rank. Model (6.40) can be solved by block coordinate descent (BCD)
method, which is similar to the algorithm in [2]. It has been shown that the proposed
method can remove the Gaussian noise efficiently.
Based on TT decomposition, Yang et al. defined a tensor train nuclear norm
(TTNN) and proposed a TT-based RPTCA [49], which can be formulated as
follows:
N −1
min λn L<n> ∗ + E1 , s. t. X = L + E, (6.41)
L, E
n=1
n N
where L<n> ∈ R j =1 Ij × j =n+1 Ij is the n-unfolding matrix of low-rank tensor L.
This TTNN-based RPTCA aims to recover a low-rank tensor corrupted by sparse
noise, which can be solved by the classical ADMM algorithm.
148 6 Robust Principal Tensor Component Analysis
where Z is the observation, X denotes the low-rank tensor, and E is the sparse
tensor. We introduce Lemma 6.1 with respect to the TNN.
Lemma 6.1 ([52]) For a three-way tensor X ∈ RI1 ×T ×I3 with the tubal rank
rankt (X ) ≤ R, we have
# /
I3
T
X TNN = inf LF + RF : X = L ∗ R .
2 2
(6.43)
2
1 I3 λ1
min Z − L ∗ RT − E2F + L2F + R2F + λ2 E1 . (6.44)
L,R,E 2 2
where L ∈ RI1 ×R×I3 , R ∈ RT ×R×I3 , E ∈ RI1 ×T ×I3 . Considering the t-th sample
−
→
Z(:, t, :), we abbreviate it as Z t ∈ RI1 ×1×I3 . Then the loss function with respect to
the t-th sample is represented as
−
→ 1
−→ −
→T −→2 I3 λ1 −
→ −
→
Z t, L = −
min
→−→2
Z t − L ∗ R − E + R 2F +λ2 E 1 . (6.45)
R, E F 2
−
→ −
→
where R ∈ R1×R×I3 and E ∈ RI1 ×1×I3 is the t-th component of R and E,
respectively. The total cumulative loss for the t ≤ T samples can be obtained by
1
t −
→ I λ
3 1
ft (L) = Z τ,L + L2F . (6.46)
t 2t
τ =1
6.3 Online RPTCA 149
The main idea of ORPTCA is to develop an algorithm for minimizing the total
cumulative loss ft (L) and achieve subspace estimation in an online manner. We
define a surrogate function of the total cumulative loss ft (L), which is represented
as follows:
→
t
1 1 −
→ −
→T − 2 I3 λ1 −
→ 2 −
→
gt (L) = Z τ − L ∗ Rτ − E τ + R τ F + λ2 E τ 1
t 2 F 2
τ =1
I3 λ1
+ L2F . (6.47)
2t
This is the “non-optimized” version of ft (L). In other words, for all L, we have
gt (L) ≥ ft (L). It has been proved that minimizing gt (L) is asymptotically
equivalent to minimizing ft (L) in [47].
−
→ −
→
At time t, we get a new sample Z t ; the coefficient R t , sparse component
−
→
E t , and spanning basis L will be alternatively optimized. Based on the previous
−
→ −
→
subspace estimation Lt−1 , R t and E t are optimized by minimizing the loss
−
→
function with respect to Z t . Based on these two newly acquired factors, we
minimize gt (L) to update the spanning basis Lt . Therefore, the following two
subproblems need to be solved:
(1) Subproblem with respect to R and E:
→ −
− → 1
−→ −
→ →
− 2 I3 λ1 −
→ −
→
{ R t , E t } = argmin Z t − Lt−1 ∗ R T − E + R 2F + λ2 E 1 .
−
→−→ 2 F 2
R, E
(6.48)
−
→
For convenience,
− we define
Ẑt as the result of FFT along the third mode
of Zt ,
→ −
→
i.e., Ẑt = fft Z t , [], 3 . In the same way, we can obtain Z t = ifft Ẑt , [], 3 .
Then the solution of problem (6.48) is presented in Algorithm 35.
(2) Subproblem with respect to L:
For convenience, given a tensor in the Fourier domain A 0 ∈ RI1 ×I2 ×I3 , the block
I I ×I
diagonal matrix is denoted by A ∈ R 1 3 2 3 as follows:
I
⎡ 0(1) ⎤
A
⎢ 0(2)
A ⎥
0 =⎢
A = bdiag(A) ⎢ ..
⎥
⎥. (6.49)
⎣ . ⎦
0(I3 )
A
150 6 Robust Principal Tensor Component Analysis
−
→ − → −
→ − → −
→
Letting At = At−1 + R Tt ∗ R t and Bt = Bt−1 + ( Z t − E t ) ∗ R t , we update the
spanning basis in the Fourier domain as follows:
1 ( T ) T
Lt = argmin tr L At + I3 λ1 I L − tr L Bt , (6.51)
L
2
which can be solved by the block coordinate descent algorithm, and the details can
refer to [52].
Finally, the whole process of ORPTCA method is summarized in Algorithm 36.
For the batch RPTCA method, the total samples {Zt }Tt=1 need to be memorized, and
the storage requirement is I1 I2 T . As for the ORPTCA, we can only need to save
LT ∈ RI1 ×R×I3 , RT ∈ RT ×R×I3 , BT ∈ RI1 ×R×I3 , and the storage requirement is
I3 RT + 2I1 I3 R. When R T , the storage requirement can be greatly reduced.
6.4 RTPCA with Missing Entries 151
The observed data may be not only partially missing but also contaminated by
noise in some scenes. In this situation, robust matrix completion for matrix data has
been developed in [4]. Extending to tensor field, robust principal tensor component
analysis with missing entries, also called robust tensor completion (RTC), have been
considered in [18, 19, 43]. In fact, it simultaneously deals with the robust principal
tensor component analysis and the missing tensor component analysis.
Given the observed tensor X ∈ RI1 ×I2 ×I3 , the RTC based on TNN (RTC-TNN)
related to t-SVD [43] is formulated as follows:
N
min λn L(n) ∗ + E1 , s. t. PO (L + E) = PO (X ), (6.53)
L, E
n=1
N
min λn Ln,L ∗ + E1 , , s. t. PO (L + E) = PO (X ), (6.54)
L,E
n=1
152 6 Robust Principal Tensor Component Analysis
where Ln,L is the n-shifting L-unfolding matrix of L, which relates to the tensor
ring rank of L. These models can be easily solved by classic ADMM.
6.5 Applications
RPTCA can extract low-rank component and sparse component in tensor data. In
this section, we briefly demonstrate RPTCA for different applications, including
illumination normalization for face images, image denoising, background extrac-
tion, video rain streaks removal, and infrared small target detection.
As we all know, face recognition algorithms are easily affected by the shadow of
the face image. Therefore, it is important to remove the disturbance on face images
and further improve the accuracy of the face recognition. The low-rank plus sparse
model has been successfully used for shadow removal from face images.
We choose the face images in size of 192 × 168 with 64 different lighting
conditions from Yale B face database [15], and construct the tensor data as
X ∈ R192×168×64 . Matrix-based RPCA [4] and multi-scale low-rank decomposition
[38] are chosen as the baselines, which reshape the tensor into the matrix X ∈
R322256×64 . To show the performance enhancement by tensor methods, we choose
TNN-RPTCA [33] and block-RPTCA [11] for comparison. All the parameters are
set as recommended in their papers. The recovery results of several RPTCA methods
are shown in Fig. 6.8. We can see that the shadows can be removed efficiently,
especially by the block-RPTCA method.
Moreover, we consider the situation that face images may be not only corrupted
by noise but also partially missing. We apply robust tensor completion (RTC)
methods mentioned in Sect. 6.4 for shadow removal on face images. From the YaleB
face dataset,1 we choose the first human subject for experiments. The clean and
complete faces can be regarded as low-rank component when the shadows is sparse.
We randomly choose 50% pixels from the original images as the observation. The
comparison results are showed in Fig. 6.9. We can see that the RTRC method can
remove most shadows.
1 http://vision.ucsd.edu/content/yale-face-database.
6.5 Applications 153
Fig. 6.8 Four methods for removing shadows on face images. (a) Original faces with shadows;
(b) RPCA [4]; (c) multi-scale low-rank decomposition [38]; (d) TNN-RPTCA[33]; (e) block-
RPTCA[11]
In this section, we focus on image denoising. It is one of the most fundamental and
classical tasks in image processing and computer vision. Natural images are always
corrupted by various noise in imaging process. Since most information of a gray
image is preserved by its first few singulars values, an image can be approximated
by its low-rank approximation. If the noise ratio is not too high, the image denoising
can be modeled as sparsely corrupted low-rank data which can be processed by
robust principal component analysis.
A color image with the size of I1 ×I2 and three color channels can be represented
as a third-order tensor X ∈ RI1 ×I2 ×3 . Since the texture information in these three
channels are very similar, the low-rank structure also exists among these channels.
Classical RPCA has to process the color image on each channel independently,
which neglects the strong correlations in the third dimension. RTPCA can fully
utilize the low-rank structure between channels and deal with this problem better.
154 6 Robust Principal Tensor Component Analysis
Fig. 6.9 Four methods for removing shadows on face images. (a) Original and observed; (b)
RTRC; (c) RTC-TNN; (d) RTC-SNN; (e) RMC. For (b), (c), (d), (e), left, recovery results; right,
absolute differences with respect to original images
Fig. 6.10 Several results of image denoising on five sample images. (a) Original images; (b) noisy
images; (c) RPCA[4]; (d) SNN-PTCA[18]; (e) TNN-PTCA[33]; (f) tubal-PTCA[51]; (g) block-
RPTCA[11]; (h) improved PTCA[29]
The key task for background extraction is to model a clear background of the
video frames and accurately separate the moving objects from the video. Generally,
it is also called background modeling, background and foreground separation in
different papers. Due to spatiotemporal correlations between the video frames,
background can be regarded as a low-rank component. The moving foreground
objects only occupy a fraction of pixels in each frame and can be treated as a sparse
component.
Traditional matrix-based RPCA has to vectorize all the frames when dealing with
this problem, which would lead to a spatial structure information loss. Driven by
this, tensor-based RPCA methods are proposed and applied to solve such problems
for enhanced performance.
We perform numerical experiments on the video sequences from SBI dataset
[34]. Five color image sequences are selected with the size of 320 × 240 × 3.
We choose matrix-based RPCA [4] as the baseline. For comparison, we choose
tensor-based methods including BRTF [54], TNN-RPTCA [33], and improved
RPTCA [29]. The results of the background and foreground separation are shown
in Fig. 6.11. We can observe that in most cases tensor-based methods outperform
matrix-based one.
156 6 Robust Principal Tensor Component Analysis
Fig. 6.11 Recovered background images of five example sequences. (a) Original; (b) ground
truth; (c) RPCA [4]; (d) TNN-PTCA [33]; (e) BRTF [54]; (f) improved PTCA [29]
s. t. O = B + R, (6.55)
where αi , i = 1, 2, 3, 4 is set to balance the four terms and ∇1 , ∇2 , and ∇t are the
vertical, horizontal, and temporal TV operators, respectively. In this model, the first
item ∇1 R1 indicates the smoothness priors of the rain streaks in the perpendicular
direction. The second item R1 indicates the sparsity of the rain streaks. And
the items ∇2 B1 and ∇t B1 are imposed to regularize the smoothness of the
background video. Finally, the whole rain-free video is approximated by a low-rank
tensor model which is regularized by SNN. More details can be found in [22, 23].
We perform experiments on a selected color video named as “foreman”2 and add
the generated rain on it. Figure 6.12 shows two frames of the deraining results. We
can see the rain component can be effectively removed in these experimental results.
Infrared search and track (IRST) plays a crucial role in some applications such
as remote sensing and surveillance. The fundamental function of IRST system is
the infrared small target detection. The small targets obtained by infrared imaging
technology usually lack the texture and structure information, due to the influence by
long distance and complex background. In general, the infrared small target likes a
light spot, which is extremely difficult to be detected. Therefore, effective detection
for infrared small target is really a challenge.
2 http://trace.eas.asu.edu/yuv/.
158 6 Robust Principal Tensor Component Analysis
Fig. 6.12 Rain streaks removal from video data. From top to bottom are the 11th frame and 110th
frame of the video. (a) Groundtruth. (b) Rainy image. (c) Recovered. (d) Rain removal
1 2 3
m ……
k
……
……
……
11
m
n
k-2 k-1 k
……
Original matrix Constructed patch-tensor
Fig. 6.13 The transformation between an original matrix data and a constructed tensor
The classical infrared patch image (IPI) model [14] regards the complex back-
ground as a low-rank component while the target as an outlier. Therefore, the
conventional target detection problem is consistent with the RPCA optimization
model. In [7], in order to make use of more correlations between different patches,
Dai et al. set up an infrared patch-tensor (IPT) model. The transformation between
the original matrix data and a constructed tensor is shown in Fig. 6.13.
In [50], the authors have applied the partial sum of the tubal nuclear norm
(PSTNN) to detect the infrared small target. Based on the IPT model, they combine
PSTNN with the weighted 1 norm to efficiently suppress the background and
preserve the small target. The optimization model is as follows:
where denotes the Hadamard product and W is a weighting tensor for sparse
term.
6.6 Summary 159
Fig. 6.14 Infrared small target detection results by Zhang and Peng [50]. (a) Original image; (b)
background; (c) target
Two test images in [13, 14] are used to show the performance. Figure 6.14 shows
the infrared small target detection results by Zhang and Peng [50]. We can see that
the small target can be extracted very well.
6.6 Summary
In this section, we first introduce the basic principal component analysis model of
the matrix data, which is then extended to the tensor version. We mainly overview
the optimization models, algorithms, and applications of RPTCA. The tensor
optimization model is set up by modeling the separation of low-rank component
and sparse component in high-dimensional data. Several sparse constraints have
been summarized according to different applications. In addition, a number of
low-rank constraints based on t-SVD, Tucker decomposition, tensor train, and
tensor ring decomposition have been discussed. In particular, a variety of low-
rank tensor constraints-based t-SVD have been outlined with respect to different
scales, rotation invariance, frequency component analysis, nonconvex versions,
different transforms, and higher-order cases. Besides, online RPTCA and RPTCA
with missing entries are briefly introduced. Finally, the effectiveness of RPTCA
is illustrated in five image processing applications with experimental results,
including illumination normalization for face images, image denoising, background
extraction, video rain streaks removal, and infrared small target detection.
In the future, nonlocal similarity may be incorporated into the multi-scale
RPTCA method for performance improvement. Meanwhile, the multidimensional
160 6 Robust Principal Tensor Component Analysis
References
1. Abdel-Hakim, A.E.: A novel approach for rain removal from videos using low-rank recovery.
In: 2014 5th International Conference on Intelligent Systems, Modelling and Simulation, pp.
351–356. IEEE, Piscataway (2014)
2. Bengua, J.A., Phien, H.N., Tuan, H.D., Do, M.N.: Efficient tensor completion for color image
and video recovery: low-rank tensor train. IEEE Trans. Image Proc. 26(5), 2466–2479 (2017)
3. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.: Distributed optimization and
statistical learning via the alternating direction method of multipliers. Found. Trends Mach.
Learn. 3(1), 1–122 (2011)
4. Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3),
1–37 (2011)
5. Chen, L., Liu, Y., Zhu, C.: Iterative block tensor singular value thresholding for extraction
of low rank component of image data. In: 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 1862–1866. IEEE, Piscataway (2017)
6. Chen, L., Liu, Y., Zhu, C.: Robust tensor principal component analysis in all modes. In: 2018
IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE, Piscataway
(2018)
7. Dai, Y., Wu, Y.: Reweighted infrared patch-tensor model with both nonlocal and local priors
for single-frame small target detection. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens.
10(8), 3752–3767 (2017)
8. De La Torre, F., Black, M.J.: A framework for robust subspace learning. Int. J. Comput. Vision
54(1–3), 117–142 (2003)
9. Driggs, D., Becker, S., Boyd-Graber, J.: Tensor robust principal component analysis: Better
recovery with atomic norm regularization (2019). Preprint arXiv:1901.10991
10. Feng, J., Xu, H., Yan, S.: Online robust PCA via stochastic optimization. In: Advances in
Neural Information Processing Systems, pp. 404–412 (2013)
11. Feng, L., Liu, Y., Chen, L., Zhang, X., Zhu, C.: Robust block tensor principal component
analysis. Signal Process. 166, 107271 (2020)
12. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with
applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395
(1981)
13. Gao, C., Zhang, T., Li, Q.: Small infrared target detection using sparse ring representation.
IEEE Aerosp. Electron. Syst. Mag. 27(3), 21–30 (2012)
14. Gao, C., Meng, D., Yang, Y., Wang, Y., Zhou, X., Hauptmann, A.G.: Infrared patch-image
model for small target detection in a single image. IEEE Trans. Image Process. 22(12), 4996–
5009 (2013)
15. Georghiades, A.S., Belhumeur, P.N., Kriegman, D.J.: From few to many: illumination cone
models for face recognition under variable lighting and pose. IEEE Trans. Pattern Analy. Mach.
Intell. 23(6), 643–660 (2001)
16. Gnanadesikan, R., Kettenring, J.: Robust estimates, residuals, and outlier detection with
multiresponse data. Biometrics 28, 81–124 (1972)
17. Hu, W., Tao, D., Zhang, W., Xie, Y., Yang, Y.: The twist tensor nuclear norm for video
completion. IEEE Trans. Neural Netw. Learn. Syst. 28(12), 2961–2973 (2016)
References 161
18. Huang, B., Mu, C., Goldfarb, D., Wright, J.: Provable low-rank tensor recovery. Optimization-
Online 4252(2) (2014)
19. Huang, H., Liu, Y., Long, Z., Zhu, C.: Robust low-rank tensor ring completion. IEEE Trans.
Comput. Imaging 6, 1117–1126 (2020)
20. Huber, P.J.: Robust statistics, vol. 523. John Wiley & Sons, New York (2004)
21. Jiang, T.X., Huang, T.Z., Zhao, X.L., Deng, L.J.: A novel nonconvex approach to recover the
low-tubal-rank tensor data: when t-SVD meets PSSV (2017). Preprint arXiv:1712.05870
22. Jiang, T.X., Huang, T.Z., Zhao, X.L., Deng, L.J., Wang, Y.: A novel tensor-based video
rain streaks removal approach via utilizing discriminatively intrinsic priors. In: The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2827 (2017).
https://doi.org/10.1109/CVPR.2017.301
23. Jiang, T.X., Huang, T.Z., Zhao, X.L., Deng, L.J., Wang, Y.: Fastderain: A novel video rain
streak removal method using directional gradient priors. IEEE Trans. Image Process. 28(4),
2089–2102 (2018)
24. Ke, Q., Kanade, T.: Robust L1 norm factorization in the presence of outliers and missing
data by alternative convex programming. In: 2005 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 739–746. IEEE, Piscataway
(2005)
25. Kernfeld, E., Kilmer, M., Aeron, S.: Tensor–tensor products with invertible linear transforms.
Linear Algebra Appl. 485, 545–570 (2015)
26. Kim, J.H., Sim, J.Y., Kim, C.S.: Video deraining and desnowing using temporal correlation
and low-rank matrix completion. IEEE Trans. Image Process. 24(9), 2658–2670 (2015)
27. Liu, X.Y., Wang, X.: Fourth-order tensors with multidimensional discrete transforms (2017).
Preprint arXiv:1705.01576
28. Liu, J., Musialski, P., Wonka, P., Ye, J.: Tensor completion for estimating missing values in
visual data. IEEE Trans. Pattern Analy. Mach. Intell. 35(1), 208–220 (2012)
29. Liu, Y., Chen, L., Zhu, C.: Improved robust tensor principal component analysis via low-rank
core matrix. IEEE J. Sel. Topics Signal Process. 12(6), 1378–1389 (2018)
30. Liu, Y., Liu, T., Liu, J., Zhu, C.: Smooth robust tensor principal component analysis for
compressed sensing of dynamic MRI. Pattern Recogn. 102, 107252 (2020)
31. Lu, C., Zhou, P.: Exact recovery of tensor robust principal component analysis under linear
transforms (2019). Preprint arXiv:1907.08288
32. Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., Yan, S.: Tensor robust principal component analysis:
Exact recovery of corrupted low-rank tensors via convex optimization. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 5249–5257 (2016)
33. Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., Yan, S.: Tensor robust principal component analysis
with a new tensor nuclear norm. IEEE Trans. Pattern Analy. Mach. Intell. 42(4), 925–938
(2019)
34. Maddalena, L., Petrosino, A.: Towards benchmarking scene background initialization. In:
International Conference on Image Analysis and Processing, pp. 469–476. Springer, Berlin
(2015)
35. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and
its application to evaluating segmentation algorithms and measuring ecological statistics. In:
Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2,
pp. 416–423. IEEE, Piscataway (2001)
36. Oh, T.H., Kim, H., Tai, Y.W., Bazin, J.C., So Kweon, I.: Partial sum minimization of singular
values in RPCA for low-level vision. In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 145–152 (2013)
37. Oh, T.H., Tai, Y.W., Bazin, J.C., Kim, H., Kweon, I.S.: Partial sum minimization of singular
values in robust PCA: Algorithm and applications. IEEE Trans. Pattern Analy. Mach. Intell.
38(4), 744–758 (2015)
38. Ong, F., Lustig, M.: Beyond low rank + sparse: Multiscale low rank matrix decomposition.
IEEE J. Sel. Topics Signal Process. 10(4), 672–687 (2015)
162 6 Robust Principal Tensor Component Analysis
39. Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. London,
Edinburgh Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)
40. Song, G., Ng, M.K., Zhang, X.: Robust tensor completion using transformed tensor SVD
(2019). Preprint arXiv:1907.01113
41. Su, Y., Wu, X., Liu, G.: Nonconvex low tubal rank tensor minimization. IEEE Access 7,
170831–170843 (2019)
42. Sun, W., Yang, G., Peng, J., Du, Q.: Lateral-slice sparse tensor robust principal component
analysis for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 17(1), 107–
111 (2019)
43. Wang, A., Song, X., Wu, X., Lai, Z., Jin, Z.: Robust low-tubal-rank tensor completion.
In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 3432–3436. IEEE, Piscataway (2019)
44. Wang, X., Gu, L., Lee, H.w.J., Zhang, G.: Quantum tensor singular value decomposition with
applications to recommendation systems (2019). Preprint arXiv:1910.01262
45. Wang, A., Li, C., Jin, Z., Zhao, Q.: Robust tensor decomposition via orientation invariant tubal
nuclear norms. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, No.
04, pp. 6102–6109 (2020)
46. Wang, S., Liu, Y., Feng, L., Zhu, C.: Frequency-weighted robust tensor principal component
analysis (2020). Preprint arXiv:2004.10068
47. Wijnen, M., et al.: Online tensor robust principal component analysis. Technical Report, The
Australian National University (2018)
48. Xu, W.H., Zhao, X.L., Ng, M.: A fast algorithm for cosine transform based tensor singular
value decomposition (2019). Preprint arXiv:1902.03070
49. Yang, J.H., Zhao, X.L., Ji, T.Y., Ma, T.H., Huang, T.Z.: Low-rank tensor train for tensor robust
principal component analysis. Appl. Math. Comput. 367, 124783 (2020)
50. Zhang, L., Peng, Z.: Infrared small target detection based on partial sum of the tensor nuclear
norm. Remote Sens. 11(4), 382 (2019)
51. Zhang, Z., Ely, G., Aeron, S., Hao, N., Kilmer, M.: Novel methods for multilinear data
completion and de-noising based on tensor-SVD. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3842–3849 (2014)
52. Zhang, Z., Liu, D., Aeron, S., Vetro, A.: An online tensor robust PCA algorithm for sequential
2D data. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 2434–2438. IEEE, Piscataway (2016)
53. Zhang, Y., Han, Z., Tang, Y.: Color image denoising based on low-rank tensor train. In: Tenth
International Conference on Graphics and Image Processing (ICGIP 2018), vol. 11069, p.
110692P. International Society for Optics and Photonics, Bellingham (2019)
54. Zhao, Q., Zhang, L., Cichocki, A.: Bayesian CP factorization of incomplete tensors with
automatic rank determination. IEEE Trans. Pattern Analy. Mach. Intell. 37(9), 1751–1763
(2015)
55. Zheng, Y.B., Huang, T.Z., Zhao, X.L., Jiang, T.X., Ma, T.H., Ji, T.Y.: Mixed noise removal
in hyperspectral image via low-fibered-rank regularization. IEEE Trans. Geosci. Remote Sens.
58(1), 734–749 (2019)
56. Zheng, Y.B., Huang, T.Z., Zhao, X.L., Jiang, T.X., Ji, T.Y., Ma, T.H.: Tensor N-tubal rank and
its convex relaxation for low-rank tensor recovery. Inf. Sci. 532, 170–189 (2020)
57. Zhou, P., Feng, J.: Outlier-robust tensor PCA. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2263–2271 (2017)
Chapter 7
Tensor Regression
With the emergence of high-dimensional data, not only the data representation and
information extraction but also the casual analysis and system modeling are very
noteworthy research areas. Specifically, it is a critical research direction to model
the time-varying networks in order to predict the possible climate state in the future
or other missing values of adjacent locations in climatology [3, 35, 47], to explore
the user relationship within social networks [15, 37, 50], to identify the effective
features of specific economical activities [4]. However, such network data are
usually indexed in multiple dimensions, such as time, 3D space, different features,
or variables. The multidirectional relatedness between the multiway predictor and
multiway response brings a big challenge to the traditional regression models. The
main problem of utilizing the traditional regression models for such problems is
the loss of the inherent structural information and heavy computational and storage
burden, absence of uniqueness, sensitivity to noise, and difficulty of interpretation
brought by the huge amount of model parameters [8, 55]. To address these issues,
different tensor regression frameworks are developed to better explore the multiway
dependencies along the high-dimensional input and output and help the robust
modeling of the complex networks varying with multiple indexes.
As shown in Fig. 7.1, tensor regression can be applied in a wide range of
fields. For example, facial images or high-dimensional features extracted from
these images can be used for human age estimation [12], human facial expression
attributes estimation [32], etc. The unique low-rank and smooth characteristics of
the image itself can be well exploited by the tensor low-rank approximation method
and smooth constraints, which can greatly benefit the modeling of the complex
regression systems. Besides, in computer vision, motion reconstruction or tracking
based on video sequences can also be regarded as multiway regression problems
[31]. Neuroimaging analysis, including the medical diagnosis [14, 26, 27, 55],
Cokriging
Multitask
Forecasting
learning Process op-
timization
Human
age esti-
mation
Climatology
Shape
Facial analysis
image Human Manufacturing
at-
analysis tributes and metrol-
predic-
tion ogy
Stock
prediction
Motion
recon-
struction Computer
Vision
High-dimensional
data related Economics
regression tasks
Video Trade
tracking analysis
Traffic
predic-
Medical Neuroimaging tion
diagnosis analysis
Sociology Forecasting
Brain con-
Neural
nectivity
decoding Check-
analysis User ins
Multitask predic-
network tion
learning
analysis
Rating
system
neuron decoding [1, 40, 53], and brain connectivity analysis [21, 41], is also a
dominant research field, which inspires the development in sparse tensor regression
methods. Furthermore, it also becomes important than ever to explore the rela-
tionship between manufacturing process parameters and the shape of manufactured
parts [48], which can reduce a lot of time for process optimization and help achieve
smart manufacturing.
problem:
Bayesian-based learning
Projection-based methods
Tensor regression
Kernel-based methods
linear relationship between the multiway predictor Xn and the multiway response
Yn , as formulated by
where B ∈ RP1 ×···×PL ×Q1 ×···×QM is the regression coefficient tensor which maps
the predictor into the response by linear projections and << Xn , B >> is set to be a
specific tensor product, such as tensor inner product and tensor contraction product.
Model (7.3) can vary with different problems in terms of data forms, the adopted
tensor products, assumptions over model parameters, and observation errors.
As shown in model (7.3), the model coefficient tensor is in much higher order
compared with the tensor input and output. It brings a large amount of parameters
to estimate during the training process, which inevitably leads to computational and
storage burden. Moreover, it is easy for the number of model parameters to exceed
the sample size, which brings a lot of numerical problems. Therefore, it is necessary
to consider introducing assumptions over the model parameters or exploring the
characteristics of the original data for improving the model identifiability and
robustness, which is the key problem for linear regression.
7.3 Linear Tensor Regression Models 167
Based on the correlation of the predictor or the response along different modes,
a basic assumption over the high-dimensional regression coefficient tensor is low
rankness. It can be taken as a generalization of the reduced rank regression, which
solves the multivariate multiple outcome regression problems by
where yn,q is q-th response in yn and bq is the q-th column of B which characterizes
the relationship between the predictor xn and the response yn,q . The main reason
Q
for the low-rankness assumption over B here is to regress these outcomes {yn,q }q=1
simultaneously and explore the correlations between these responses.
Extending it into tensor field, the optimization problem is transformed into
where R donates an upper bound of specific tensor rank. In fact, there are two main
advantages of assuming the regression coefficient matrix or tensor to be low-rank
rather than using the original least squares method. On the one hand, the low-rank
assumption can be regarded as a regularization term, which improves the robustness
of the model and reduces the model parameters. On the other hand, the low-rank
assumption of the regression coefficients makes good use of the similarities between
different inputs and outputs, which can play the role of data exploration during the
training process.
Simple tensor linear regression model is the simplest regression model to illustrate
the statistical relationship between two multidimensional variables. It directly solves
the minimization problem in (7.6) when the observational error E is assumed to
obey Gaussian distribution. Up to now, there are four different optimization methods
168 7 Tensor Regression
used to tackle this tensor regression model, including rank minimization method,
projected gradient descent, greedy low-rank learning, and alternating least squares
method.
In [37], the low-rank tensor learning model is firstly developed for multitask
learning, namely, multilinear multitask learning (MLMTL).
For a set of T linear regression tasks, Mt observations {x{mt ,t} , y{mt ,t} } are
obtained for each task t, t = 1, . . . , T , mt = 1, . . . , Mt . For multitask learning,
the datasets may be defined by multiple indices, i.e., T = T1 × · · · × TN . For
example, to build a restaurant recommender system, we need to regress T1 users’ T2
rating scores based on their D descriptive attributes of different restaurants. It can
be regarded as a set of T1 × T2 regression tasks. For each task t, one needs to regress
t1 -th user’s t2 -th type of scores y{mt ,t} based on the t1 -th user’s descriptive attributes
x{mt ,t} which are in size of D, i.e.,
mt = 1, . . . , Mt , t = t1 , t2 .
Extending it into N different indices, the optimization problem for MLMTL is
as follows:
T Mt N +1
min L(y{mt ,t} , < x{mt ,t} , bt >) + λ rankn (B), (7.7)
B
t mt n
T Mt N +1
min L(y{mt ,t} , < x{mt ,t} , bt >) + λ (Wn )(n) tr (7.8)
B,Wn
t mt n=1
s.t. B = Wn , n = 1, . . . , N + 1,
N +1
1
Btr = B(n) tr , (7.9)
N +1
n=1
7.3 Linear Tensor Regression Models 169
where B(n) is the mode-n unfolding matrix of B, and where (Wn )(n) is the mode-n
unfolding matrix of Wn .
Introducing a set of Lagrange multipliers On , the resulting augmented
Lagrangian function is given by
T Mt
L(B, Wn , On ) = L(y{mt ,t} , < x{mt ,t} , bt >) (7.10)
t mt
N +1
ρ
+ (λ(Wn )(n) tr − < On , B − Wn > + B − Wn 2F )
2
n=1
for some ρ > 0. The popular optimization algorithm ADMM can be employed
to minimize the augmented Lagrangian function (7.10), and the detailed updating
procedures are concluded in Algorithm 37.
Besides the overlapped trace norm, some other norms are also exploited for
enhanced prediction performance, such as the latent trace norm [46] and the scaled
latent trace norm [45].
Similar to the ALS-based algorithm for tensor decomposition, this problem can be
solved by updating each of the model parameters B(1) , . . . , B(M) iteratively while
others are fixed. Specifically, with respect to each factor B(m) , problem (7.12) boils
down to the following subproblem:
which can be easily solved by the least squares method, where X̃(m) = X(m) (B(M) ⊗
· · · ⊗ B(m+1) ⊗ B(m−1) ⊗ · · · ⊗ B(1) )T . Algorithm 38 provides a summary of the
updating procedures.
For ALS-based approaches, the main computational complexity comes from the
construction of the matrix X̃(m) and its inverse. In addition, for large-scale datasets,
the storage of the intermediate variables is also a big challenge.
In [3], a unified framework is proposed for the cokriging and forecasting task in
spatiotemporal data analysis into as follows:
where X ∈ RN ×I1 ×I3 , Y ∈ RN ×I2 ×I3 , B ∈ RI1 ×I2 ×I3 is the coefficient tensor,
I3
L(B; X , Y) = Y:,:,i3 − X:,:,i3 B:,:,i3 + E:,:,i3 2F , (7.15)
i3 =1
and rank(B) is a sum of all the mode-n ranks of the tensor B. Rather than using
the rank minimization methods or ALS-based methods, [3] propose a greedy low n-
rank tensor learning method which searches a best rank-1 estimation at a time until
the stopping criteria are reached. At each iteration, the best rank-1 approximation is
obtained by finding the best rank-1 approximation of all the unfolding matrices and
selecting the mode which gives the maximum decrease of the objective function, as
concluded in Algorithm 39. The operation refold means the inverse operation of the
corresponding unfoldings, which satisfies refold(B(d) , d) = B.
For the rank minimization methods and the ALS-based methods, they all suffer
from high storage and computational cost for large-scale high-dimensional datasets.
In addition, rank minimization methods usually converge slowly, while ALS-based
methods often lead to suboptimal solutions. Therefore, an efficient algorithm based
on projected gradient descent is proposed for spatiotemporal data analysis and
multitask learning in [50], namely, tensor projected gradient (TPG).
Given the predictor X ∈ RN ×I1 ×I3 and its corresponding response Y ∈
R ×I2 ×I3 , the target problem is as follows:
N
where B ∈ RI1 ×I2 ×I3 is the coefficient tensor and L(B; X , Y) is the fitting loss as it
does in (7.15).
Treating problem (7.16) as an unconstrained optimization problem with only
the loss term (7.15), TPG updates the coefficient tensor along the gradient descent
direction, as shown in Algorithm 40. Then a projection procedure for rank constraint
is performed over the estimation. Specifically, the projection of the estimated
coefficient tensor can be formulated as follows:
where θn and φn are natural and dispersion parameters, a(·), b(·), and c(·), are
various functions according to different distributions, such as Poisson distribution
and gamma distribution. Unlike simple linear regression which maps the predictor
directly with the response, GLM link the linear model and the response by a link
7.3 Linear Tensor Regression Models 173
g(μn ) = αn + γ T zn + B, Xn
1 R 2
= αn + γ zn + T
u(1)
r ◦ · · · ◦ u(L)
r , Xn (7.19)
r=1
3 4
= αn + γ zn + U(L)
T
··· U(1) 1R , vec(Xn ) , (7.20)
where Xn ∈ RP1 ×···×PL is the tensor input, zn is the vector input, αn is the intercept,
γ is the vector coefficients, and B is the coefficient tensor which represents the
impact of the tensor input Xn over the response yn . g(μn ) indicates a link function
of the mean value μn = E(yn |Xn , zn ). For example, if the response yn obeys
the binomial distribution, then the link function should be the logit function in
order to map the binary output into a continuous variable. For data exploration
and dimension reduction, the CP decomposition form is employed, where U(l) =
[u(l) (l) Pl ×R . By using the CP factorization form, the number of model
1 ; · · · ; uR ] ∈ R
parameters will be decreased from O(P1 · · · PL ) to O( L l=1 Pl R).
The model parameters can be estimated by maximizing the log likelihood
function as follows:
N
L α1 , . . . , αN , γ , U(1) , . . . , U(L) = log Prob(yn |Xn , zn ).
n=1
174 7 Tensor Regression
And based on alternating minimization, subproblem with respect to each factor U(l)
is a simple GLM problem. The updating procedure is summarized in Algorithm 42.
Tensor ridge regression adds a Frobenius norm constraint over the regression
coefficient tensor for controlling the model complexity [12, 31, 32, 35]. The
7.3 Linear Tensor Regression Models 175
where the form of the loss function L(B; X , Y) varies with different developed
models, as the vector-on-tensor regression model yn =< Xn , B > + n for N
observations {Xn ∈ RP1 ×···×PL , yn ∈ R} in [12]; the tensor-on-vector regression
model Y = B×1 X+E for {X ∈ RN ×D , Y ∈ RN ×Q1 ×···×QM } in [35]; and the tensor-
on-tensor regression model Y =< X , B >L +E for {X ∈ RN ×P1 ×···×PL , Y ∈
RN ×Q1 ×···×QM } in [30–32].
To tackle problem (7.21), one way is to incorporate the regularization term into
the data fitting term as follows:
where X̆(1) = [X(1) , λI]T , Y̆(1) = [Y(1) , λ0]T . Then the algorithms mentioned
before can be directly used for addressing this unregularized optimization problem.
Besides imposing the Frobenius norm constraint over the original coefficient
tensor B, it is also reasonable to enforce the latent factors under specific factorization
forms like CP decomposition as follows:
M
(B) = U(m) 2F , (7.23)
m=1
which has been studied with the vector-on-tensor regression model yn =< Xn , B >
+ n for N observations {Xn ∈ RP1 ×···×PL , yn ∈ R} in [1, 27, 40].
To tackle problem (7.24), three different ways can be used, where two are based
on the previously mentioned rank minimization method and projected gradient
descent method, respectively.
Rank Minimization Approaches Specifically, the optimization problem proposed
in [40] is given by
which uses the convex tensor nuclear norm to replace the rank constraint, as defined
in the following equation:
D
1
B∗ = B(d) ∗ , (7.26)
D
d=1
N
1
min yn − < B, Xn >2F , (7.27)
B∈C 2
n=1
coefficient tensor just changes the projection procedure into a constrained low-rank
approximation problem. And existing tensor representation methods can be directly
used for projecting the estimation into the desired subspace.
Greedy Low-Rank Tensor Learning In [14], a fast unit-rank tensor factorization
method is proposed, which tries to sequentially find unit-rank tensor for the
following problem:
N
1 2
min yn,r − Xn , Br + λr Br 1 + αBr 2F , (7.29)
Br N
n=1
s. t. rank(Br ) ≤ 1
where r is the sequential number of the unit-rank tensors and yn,r is the remaining
residuals after r − 1 approximations.
Until the stopping condition is reached, the
full estimation is given by B = r Br .
The key problem for [14] becomes the unit-rank tensor factorization problem,
which can be divided into several regression subproblems with elastic net penalty
terms by the alternating search approach. And each subproblem can be solved using
the stagewise regression algorithms.
Other Variants Besides, instead of simply enforcing the whole coefficient tensor
to be sparse, the sparsity constraint can also be imposed over the latent factors.
For example, the adopted regression model in [42] maps the vector input xn with
the multidimensional response Yn by Yn = B ×M+1 xn + n . Imposing CP
decomposition over B, it can be rewritten as
Yn = r ◦ · · · ◦ ur
wr u(1) (M)
◦ u(M+1)
r ×M+1 xn + n. (7.30)
r
Considering the sparsity constraint over the latent factors, we can get the optimiza-
tion problem as
N
1 T (1)
min Yn − wr u(M+1)
r xn ur ◦ · · · ◦ u(M)
r F (7.31)
2
(1) (M+1)
wr ,ur ,...,ur N r
n=1
s. t. u(m)
r 2 = 1, u(m)
r 0 ≤ Sm .
In this way, based on alternating minimization, the subproblem with respect to each
latent factor can be solved by efficient sparse tensor decomposition methods.
However, the element-wise sparsity cannot utilize the structural information of
the data. As for multiresponse regression model yn,q =< Xn , Bq > + n for
N observations {Xn ∈ RP1 ×···×PL , yn,q ∈ R}, simply imposing the coefficient
tensor to be sparse means to fit each response with the sparse tensor regression
model individually. It is not in line with the actual situation that the important
178 7 Tensor Regression
features for potentially correlated response are similar. Therefore, when Bq admits
(1) (L)
CP decomposition as Bq = [[Uq , . . . , Uq ]] for q = 1, . . . , Q, [26] proposed a
group lasso penalty as follows:
⎛ ⎞1/2
L R Pl Q
(B1 , . . . , BQ ) = ⎝ U(l)
q (pl , r)
⎠ , (7.32)
l=1 r=1 pl =1 q=1
which regards the same position of {B1 , . . . , BQ } as a group and imposes sparsity
constraint over these groups. In this way, if a subregion is not related with any
response, that region will be dropped out from the model.
Besides, group sparsity constraints can be used for rank estimation. It has
been proved in [10, 51] that the rank minimization problem can be regarded as a
decomposition with sparse constraints. And derived by this consideration, [12] have
developed a group sparsity term to estimate the CP rank during the training process
as follows:
R
, L
-1/2
(B) = U (l)
(:, r)22 , (7.33)
r=1 l=1
where B = R r=1 U (:, r)◦· · ·◦U
(1) (L) (:, r). Since B is the sum of R rank-1 tensors,
treating each rank-1 tensor as a group and enforcing these components to be sparse
can make unnecessary component to be jointly zero and thus reduces the CP rank.
In addition, a variant which aims to estimate the appropriate tensor ring rank during
the training process is proposed [29].
The key to Bayesian learning is the utilization of prior knowledge. The parameters
in Bayesian learning can be tuned according to the data during the training process.
Unlike the penalized regression, the weighting factor of the penalty term is directly
specified. In other words, penalized regression introduces the prior information
based on experience but not the data, and the weighting factor needs to be selected
by a large number of cross-validation experiments. Bayesian regression considers
all the possibilities of the parameters and tunes them in the estimation process
according to the training data. In addition, Bayesian regression can give a probability
distribution of predicted values, not just a point estimate.
For tensor linear regression, the key challenge for Bayesian learning is the design
of multiway shrinkage prior for the tensor coefficients. For example, in [11], a
novel class of multiway shrinkage priors for the coefficient tensor, called multiway
Dirichlet generalized double Pareto (M-DGDP) prior, is proposed for the model
7.3 Linear Tensor Regression Models 179
σ 2 ∼ πσ , γ ∼ πγ , u(l)
r ∼ πU
r ∼ N (0, (φr τ ) W lr ) ,
u(l) (7.35)
which can be illustrated by the hierarchical graph in Fig. 7.3. As shown in Fig. 7.3,
the distribution of the latent factor relays on the distribution of τ , , and Wlr .
And these three factors play different roles, where τ ∼ Ga(aτ , bτ ) is a global
scale parameter which fits in each component by τr = φr τ for r = 1, . . . , R,
= [φ1 , . . . , φR ] ∼ Dirichlet(α1 , . . . , αR ) encourages the latent factors to be
low-rank under the tensor decomposition, and Wlr = diag(wlr,1 , . . . , wlr,Pl ),
l = 1, . . . , L are margin-specific scaleparameters which encourage shrinkage at
local scale, where wlr,pl ∼ Exp λ2lr /2 , λlr ∼ Ga (aλ , bλ ) for pl = 1, . . . , PL ,
r = 1, . . . , R, l = 1, . . . , L.
aλ bλ
aτ bτ αr λlr
τ φr wlr,pl
(l)
ur
r = 1, · · · , R
l = 1, · · · , L
Fig. 7.3 The diagram of the hierarchical shrinkage prior for Bayesian dynamic tensor regres-
sion [4]
180 7 Tensor Regression
With the similar shrinkage prior, the Bayesian learning approaches are extended
to tensor autoregression models for analyzing time-varying networks in economics
[4]. And an application to brain activation and connectivity analysis can refer to
[41].
Since the computation of the posterior distribution is complex, the posterior
distribution is usually calculated using an approximate method. Approximate
methods are mainly divided into two categories: the first class simplifies the
posterior distribution, such as variational inference, or Laplace approximation,
and the second group uses sampling methods, such as Gibbs sampling and HMC
sampling. Most Bayesian learning approaches for tensor linear regression employ
the Gibbs sampling for computing the posterior distribution of the coefficient tensor.
Although the sampling methods can achieve high accuracy, it is too simple and costs
a lot of time for training. Although the derivation process of variational inference is
more difficult, the speed of its calculation makes its promotion very meaningful.
The main idea of PLS is to first project the response and the predictor into vector
space and then find the relationship between the latent factors of them. Specifically,
given input and output matrices X ∈ RN ×I and Y ∈ RN ×J , we assume there exists
a common latent factor in their matrix factorization forms as follows:
R
X = TUT + E = tr uTr + E
r=1
R
Y = TDVT + F = dr tr vTr + F, (7.36)
r=1
u1 uR
N X = + ··· + + N E
I t1 tR I
d1 dR
v1 vR
N Y = + ··· + + N F
J t1 tR J
The solution for PLS regression is commonly obtained based on the greedy
learning. It is to say at each iteration, PLS only searches one set of latent factors
which corresponds to one desired rank-1 component. The corresponding problem at
r-th iteration is formulated as follows:
getting all the latent factors, we can provide the prediction for new predictor X" as
R
X = tr ◦ u(1)
r ◦ ur + E
(2)
r=1
R
Y= dr tr ◦ v(1)
r ◦ vr + F
(2)
(7.40)
r=1
182 7 Tensor Regression
(2) (2)
P2 u1 uR P2
= (1)
u1 + ··· + (1)
uR +
N X N E
P1 t1 tR P1
(2) (2)
Q2 v1 vR Q2
d1 dR
= (1)
v1 + ··· + (1)
vR +
N Y N F
Q1 t1 tR Q1
(1) (2)
where tr , r = 1, . . . , R are the common latent vectors of X and Y, ur and ur are
loading vectors for X , while v(1) (2)
r and vr are loading vectors of Y, r = 1, . . . , R,
as shown in Fig. 7.5.
Then multilinear PLS obtains the model parameters by sequentially maximizing
the covariance of the r-th common latent factor as follows:
! "
u(1)
r , u(2) (1) (2)
r , vr , vr = arg max Cov (tr , cr ) (7.41)
(1) (2) (1) (2)
ur ,ur ,vr ,vr
s.t. tr = Xr ×2 u(1)
r ×3 ur ,
(2)
cr = Yr ×2 v(1)
r ×3 vr ,
(2)
where Xr and Yr are the remaining parts after r − 1 approximations. The detailed
updating procedure is concluded in Algorithm 43, where T = [t1 ; · · · ; tR ], C =
[c1 ; · · · ; cR ]. The linear relationship between T and C is characterized by D =
(TT T)−1 TT C. After getting all the parameters, the prediction can be achieved for
new predictors X " by
T
Y"(1) = X"(1) U(2) U(1) D V(2) V(1) , (7.42)
expressed as
R
X = Gxr ×1 tr ×2 U(1)
r · · · ×L+1 Ur + E
(L)
(7.43)
r=1
R
Y= Gyr ×1 tr ×2 V(1)
r · · · ×M+1 Vr
(M)
+F (7.44)
r=1
The previous section presents linear tensor regression models and some popular
solutions. However, in real data analysis, the relationship between system input and
output is complex and affected by many factors. The simple linear assumption may
not simulate complex regression systems very well, and the resulting regression
model can be difficult to give ideal prediction performance. In this section, we will
introduce some prevalent nonlinear methods which are known and studied for their
powerful fitting capabilities.
The simplest idea may be how to linearize the nonlinear model. In that case, the
existing linear methods can be exploited for subsequent processing. One example
is the kernel technique, which projects the low-dimensional features into high-
dimensional space, and the data features can be classified by a linear function, as
stated in Fig. 7.6.
Given samples {xn , yn }N
n=1 , the mathematical formulation is as follows:
where φ(·) projects the input features into a higher dimensional space which make
the given samples linear separable. Regarding φ(xn ) as the extracted feature, b
characterizes the relationship between φ(xn ) and yn . For example, the optimization
problem for kernel ridge regression is
4
15
3
10
2
1 5
0 0
20
-1 10 15
10
0
-2 5
-2 -1 0 1 2 3 4 -10 0
In this way, only the kernel functions need to be defined for kernel-based regression
with no need to know the projection function φ(·).
There may be a question on why we use the kernel methods for regression
analysis. Taken from other points of view, it is hard to extract the similarity of two
features in low-dimensional space, as shown in Fig. 7.6 But it becomes clear that
these two features belong to the same group when projected into a high-dimensional
space. Therefore, kernel function is actually a better measurement metric of the
similarity of two features.
Derived by the superiority of kernel methods, there are some kernel-based tensor
regression models that have been developed as extensions to kernel spaces, such as
the kernel-based higher-order linear ridge regression (KHOLRR) [35] and kernel-
based tensor PLS [54]. The main challenge here is how to design the tensor-based
kernel functions. As implemented in [35], the simplest way is to vectorize the
tensor predictors and use the existing kernel functions, such as the linear kernel and
Gaussian RBF kernel. But in such a way, the inherent structure information within
the original tensor inputs is ignored. Therefore, some kernel functions defined
in tensor field are proposed, such as the product kernel [38, 39] measuring the
similarity of all the mode-d unfolding matrices of two predictor tensors X and X
D
k(X , X " ) = k X(d) , X"(d) (7.50)
d=1
D
, -
" 1 "
k(X , X ) = α 2
exp − 2 Sd (X ||X ) , (7.51)
d=1
2βd
where Sd (X ||X " )) represents the similarity of X and X " in mode-d, which is
measured by the information divergence between the distributions of X and X " ,
α is a magnitude parameter, and βd denotes length-scales parameter.
186 7 Tensor Regression
Besides, it is also reasonable to decompose the original tensor into its factoriza-
tion forms and measure their similarity through the inner product of kernel function
performed over the latent factors. For example, applying SVD over the mode-d
unfolding matrices of X , the Chordal distance-based kernel [38] for tensor data is
given by
, -
D
1
" (d) (d) T (d) (d) T 2
k(X , X ) = exp − 2 VX VX − VX " VX " F , (7.52)
d=1
2βd
(d)
where X(d) = UX X (VX )T , X(d)
(d) " = U (V )T . Meanwhile, [7] proposed
(d) (d) (d) (d)
X" X" X"
the TT-based kernel functions by measuring the similarity of two tensors through
kernel functions over their core factors under TT decomposition as follows:
D Rd Rd+1
k(X , X " ) = (d)
k GX (d)
(rd , :, rd+1 ), GX " (rd , :, rd+1 ) , (7.53)
d=1 rd =1 rd+1 =1
where GX (d)
and GX(d) "
" are the d-th core factor of tensor X and X , respectively, and
{R1 , . . . , RD+1 } is the tensor train rank.
yn = f(Xn ) + , (7.54)
where m(X ) is the mean function of X (usually set to zero), k(X , X " ) is the
covariance function like kernel function, and θ is the collection of hyperparameters.
The noise distribution is ∼ N (0, σ ).
After determining the distribution of the GP prior and hyperparameters in
model (7.54), the posterior distribution of the latent function can be obtained, and
the prediction can be performed based on Bayesian inference. Given new predictor
7.4 Nonlinear Tensor Regression 187
f f
y y
where
−1
y " = k(X " , X ) k(X , X ) + σ 2 I y,
−1
cov(y " ) = k(X " , X " ) − k(X " , X ) k(X , X ) + σ 2 I k(X , X " ).
L Pl
yn = T(Xn ) + n = fp1 ,...,pL (Xn (p1 , . . . , pL )) + n, (7.57)
l=1 pl =1
where ψp1 ,...,pL ,h (·) belongs to basis functions, such as polynomial splines [13]
or Fourier series [44], and βp1 ,...,pL ,h represents the weight of basis function
ψp1 ,...,pL ,h (·) in function fp1 ,...,pL (·).
Letting Bh ∈ RP1 ×···×PL with Bh (p1 , . . . , pL ) = βp1 ,...,pL ,h and h (Xn ) ∈
R ×···×PL with p1 , . . . , pL -th element equaling to ψp1 ,...,pL ,h (Xn (p1 , . . . , pL )),
P 1
H
T(Xn ) = < Bh , h (Xn ) > . (7.59)
h=1
Then the estimation of the model becomes to the estimation of the coefficients
Bh∗ . In this way, the algorithms developed for linear tensor regression models can be
employed directly for parameter estimation.
However, the additive model has limitations for applications in large-scale data
mining or tasks with few samples. For additive model, all the predictors are included
and fitted, but most predictors are not feasible or necessary in real applications.
Decision tree 1
Subset 1
1
Subset 2
Training dataset
yn = Xn ⋯ B + n
⋮ ⋮
Decision tree M
Subset M
Fig. 7.8 An illustration of the training process of random forest based on tensor regression
process of random forest based on tensor regression. Compared with the traditional
random forest method, the tensor-based method employs the tensor regression
models instead of multivariate Gaussian distribution models within the leaf nodes.
Recently, a tensor basis random forest method is proposed in [19], where not only
the leaf nodes but also other child nodes are fitted using regression models. It should
be noted that [19] and [22] predict based on the average of the existing observations
in the leaf nodes, which means that the prediction can be inferred to be values out
of the observation range.
7.5 Applications
Tensor regression has been applied in a wide range of fields, such as sociology,
climatology, computer vision, and neuroimaging analysis. In this section, we will
go through some examples for the applications of tensor regression.
Data preprocessing
Human age
⋯
Face alignment
Training set Training
⋰ ⋰ ⋯ ⋰
predicting
Testing set
HOG features
Human age
⋰ ⋰ ⋯ ⋰
Here we provide a practical example for the human age estimation task from
corresponding facial images. The dataset used in this experiment is the FG-NET
dataset.1 The experimental procedure for human age estimation is illustrated in
Fig. 7.9. The facial images are first aligned and downsized into the same size. Then
a Log-Gabor filter with four scales and eight orientations is employed to process
each image. After obtaining the filtered images, a set of histograms are calculated
for each blocks. In this experiment, the block size is set to be 8 × 8, and the number
of orientation histogram bins is set to be 59. The extracted feature tensor of each
image is in the size of 944 × 4 × 8, which is used to build the relationship between
the facial images and human age. The preprocessed dataset is divided into training
set and testing set to train the desired regression model and validate the performance
of the algorithm.
We compare some tensor regression methods mentioned in terms of estimation
error and training time, including higher rank tensor ridge regression (hrTRR),
optimal rank tensor ridge regression (orSTR), higher rank support tensor regression
(hrSTR) [12], regularized multilinear regression and selection (Remurs) [40], sparse
tubal-regularized multilinear regression (Sturm) [27], and fast stagewise unit-rank
tensor factorization (SURF) [14]. Table 7.1 reports the experimental results of these
algorithms. To visualize the difference between different age ranges, we evaluate
the performance of these algorithms in five age ranges, namely, 0–6, 7–12, 13–17,
1 https://yanweifu.github.io/FG_NET_data/index.html.
7.5 Applications 191
50 50 50
40 40 40
Predicted
Predicted
Predicted
30 30 30
20 20 20
10 10 10
0 0 0
-20 -10 0 10 20 30 40 0 5 10 15 20 25 30 35 -10 0 10 20 30 40 50
Ground truth Ground truth Ground truth
50 50 50
1-6
7-12
40 40 40
13-17
18-45
Predicted
Predicted
Predicted
30 30 30
20 20 20
10 10 10
0 0 0
-10 0 10 20 30 40 50 -10 0 10 20 30 40 50 -10 0 10 20 30 40
Ground truth Ground truth Ground truth
Fig. 7.10 The visualization of the predicted values vs. the ground truth of the human age
estimation task. (a) hrTRR. (b) hrSTR. (c) orSTR. (d) Remurs. (e) Sturm. (f) SURF
and 18–45. Figure 7.10 shows the predicted values and the given values for the test
data over all employed algorithms and five age ranges.
the alcoholic individuals. The employed EEG dataset is publicly available.2 In this
experiment, only part of the dataset is used. The chosen part is average of all the
trails on same condition. After preprocessing, the resulting EEG signal is in the
size of 64 time points × 64 channels. Mapping the type of the subjects (alcoholic
individuals or controls) with the corresponding EEG signals, we can obtain the
coefficient tensors using several classical tensor-on-vector regression methods, such
as the sparse tensor response regression (STORE) [42], tensor envelope tensor
response regression (TETRR) [24], and HOLRR [35]. The OLS method is taken
as a baseline method too. Figure 7.11 presents the estimated coefficient tensor using
OLS, STORE, TETRR, HOLRR, and the corresponding p-value maps.
2 https://archive.ics.uci.edu/ml/datasets/EEG+Database.
7.6 Summary 193
OLS STORE
5 2
Voltage(mV)
Voltage(mV)
0
0
-2
-5 -4
20 30 20
20 20
10
Time 0 0 Time 0 0
Channels Channels
TETRR HOLRR
2 2
Voltage(mV)
Voltage(mV)
0
0
-2
-4
-2
20 30 20
20 20
10
Time 0 0 Time 0 0
Channels Channels
Time
Time
Time
Fig. 7.11 Performance comparison of EEG data analysis, where the top two rows are the estimated
coefficient tensor by the OLS, STORE, TETRR, and HOLRR and the bottom line shows the
corresponding p-value maps, thresholded at 0.05
7.6 Summary
of the linear tensor regression model lies in the approximation of the regression
coefficient tensor. There are three main research directions:
1. How to use fewer parameters to better represent the multidimensional correla-
tion, or multidimensional function? It is also the key problem in low-rank tensor
learning.
2. How to use the data prior to improve the model’s identifiability and robustness?
3. Noise modeling is also one of the main problems in regression model, such as
the research on sparse noise [28] and mixed noise [33]. But there is few literature
on robust tensor regression.
The linear model is simple with strong generalization ability, but naturally
it cannot fit the local characteristics well. Nonlinear models have strong fitting
capabilities and have received widespread attention in recent years. For example, for
kernel learning, how to design a suitable tensor kernel function, which can not only
use the multidimensional structure of the data but also simplify the calculation, is
an open problem. In addition, support tensor regression and tensor-based multilayer
regression network are also promising research directions.
References 195
2
Temp-max
-2
0 5 10 15 20 25 30 35 40 45 50
Time
2
Temp-min
-2
0 5 10 15 20 25 30 35 40 45 50
Time
Days of air forst
-1
0 5 10 15 20 25 30 35 40 45 50
Time
3
Rainfall
2
1
0
-1
0 5 10 15 20 25 30 35 40 Ground
45 truth50
Time Greedy
Sunshain duration
2 TPG
HOPLS
1
HOLRR
0 TTR-CP
-1 TTR-TT
0 5 10 15 20 25 30 35 40 45 50
Time
Fig. 7.12 The visualization of the prediction of five variables in one station, including the mean
daily maximum temperature (temp-max), mean daily minimum temperature (temp-min), days of
air forest, total rainfall, and total sunshine duration, by different algorithms
References
1. Ahmed, T., Raja, H., Bajwa, W.U.: Tensor regression using low-rank and sparse Tucker
decompositions. SIAM J. Math. Data Sci. 2(4), 944–966 (2020)
2. Allen, G.: Sparse higher-order principal components analysis. In: Artificial Intelligence and
Statistics, pp. 27–36 (2012)
3. Bahadori, M.T., Yu, Q.R., Liu, Y.: Fast multivariate spatio-temporal analysis via low rank
tensor learning. In: NIPS, pp. 3491–3499. Citeseer (2014)
196 7 Tensor Regression
4. Billio, M., Casarin, R., Kaufmann, S., Iacopini, M.: Bayesian dynamic tensor regression.
University Ca’Foscari of Venice, Dept. of Economics Research Paper Series No, 13 (2018)
5. Bro, R.: Multiway calibration. multilinear PLS. J. Chemom. 10(1), 47–61 (1996)
6. Camarrone, F., Van Hulle, M.M.: Fully bayesian tensor-based regression. In: 2016 IEEE 26th
International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE,
Piscataway (2016)
7. Chen, C., Batselier, K., Yu, W., Wong, N.: Kernelized support tensor train machines (2020).
Preprint arXiv:2001.00360
8. Cichocki, A., Phan, A.H., Zhao, Q., Lee, N., Oseledets, I., Sugiyama, M., Mandic, D.P., et al.:
Tensor networks for dimensionality reduction and large-scale optimization: part 2 applications
and future perspectives. Found Trends Mach. Learn. 9(6), 431–673 (2017)
9. Eliseyev, A., Aksenova, T.: Recursive N-way partial least squares for brain-computer interface.
PloS One 8(7), e69962 (2013)
10. Fan, J., Ding, L., Chen, Y., Udell, M.: Factor group-sparse regularization for efficient low-
rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 5105–5115
(2019)
11. Guhaniyogi, R., Qamar, S., Dunson, D.B.: Bayesian tensor regression. J. Machine Learn. Res.
18(1), 2733–2763 (2017)
12. Guo, W., Kotsia, I., Patras, I.: Tensor learning for regression. IEEE Trans. Image Proc. 21(2),
816–827 (2012)
13. Hao, B., Wang, B., Wang, P., Zhang, J., Yang, J., Sun, W.W.: Sparse tensor additive regression
(2019). Preprint arXiv:1904.00479
14. He, L., Chen, K., Xu, W., Zhou, J., Wang, F.: Boosted sparse and low-rank tensor regression.
In: Advances in Neural Information Processing Systems, pp. 1009–1018 (2018)
15. Hoff, P.D.: Multilinear tensor regression for longitudinal relational data. Ann. Appl. Stat. 9(3),
1169 (2015)
16. Hou, M., Chaib-Draa, B.: Hierarchical Tucker tensor regression: Application to brain imaging
data analysis. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 1344–
1348. IEEE, Piscataway (2015)
17. Hou, M., Chaib-draa, B.: Online incremental higher-order partial least squares regression for
fast reconstruction of motion trajectories from tensor streams. In: 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6205–6209. IEEE,
Piscataway (2016)
18. Hou, M., Chaib-draa, B.: Fast recursive low-rank tensor learning for regression. In: Pro-
ceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp.
1851–1857 (2017)
19. Kaandorp, M.L., Dwight, R.P.: Data-driven modelling of the reynolds stress tensor using
random forests with invariance. Comput. Fluids 202, 104497 (2020)
20. Kanagawa, H., Suzuki, T., Kobayashi, H., Shimizu, N., Tagami, Y.: Gaussian process nonpara-
metric tensor estimator and its minimax optimality. In: International Conference on Machine
Learning, pp. 1632–1641. PMLR (2016)
21. Karahan, E., Rojas-Lopez, P.A., Bringas-Vega, M.L., Valdes-Hernandez, P.A., Valdes-Sosa,
P.A.: Tensor analysis and fusion of multimodal brain images. Proc. IEEE 103(9), 1531–1559
(2015)
22. Kaymak, S., Patras, I.: Multimodal random forest based tensor regression. IET Comput. Vision
8(6), 650–657 (2014)
23. Kia, S.M., Beckmann, C.F., Marquand, A.F.: Scalable multi-task gaussian process tensor
regression for normative modeling of structured variation in neuroimaging data (2018).
Preprint arXiv:1808.00036
24. Li, L., Zhang, X.: Parsimonious tensor response regression. J. Amer. Stat. Assoc. 112, 1–16
(2017)
25. Li, X., Xu, D., Zhou, H., Li, L.: Tucker tensor regression and neuroimaging analysis. Stat.
Biosci. 10, 1–26 (2013)
References 197
26. Li, Z., Suk, H.I., Shen, D., Li, L.: Sparse multi-response tensor regression for Alzheimer’s
disease study with multivariate clinical assessments. IEEE Trans. Med. Imag. 35(8), 1927–
1936 (2016)
27. Li, W., Lou, J., Zhou, S., Lu, H.: Sturm: Sparse tubal-regularized multilinear regression for
fmri. In: International Workshop on Machine Learning in Medical Imaging, pp. 256–264.
Springer, Berlin (2019)
28. Liu, J., Cosman, P.C., Rao, B.D.: Robust linear regression via 0 regularization. IEEE Trans.
Signal Process. 66(3), 698–713 (2017)
29. Liu, J., Zhu, C., Liu, Y.: Smooth compact tensor ring regression. IEEE Trans. Knowl. Data
Eng. (2020). https://doi.org/10.1109/TKDE.2020.3037131
30. Liu, J., Zhu, C., Long, Z., Huang, H., Liu, Y.: Low-rank tensor ring learning for multi-linear
regression. Pattern Recognit. 113, 107753 (2020)
31. Liu, Y., Liu, J., Zhu, C.: Low-rank tensor train coefficient array estimation for tensor-on-tensor
regression. IEEE Trans. Neural Netw. Learn. Syst. 31, 5402–5411 (2020)
32. Lock, E.F.: Tensor-on-tensor regression. J. Comput. Graph. Stat. 27(3), 638–647 (2018)
33. Luo, L., Yang, J., Qian, J., Tai, Y.: Nuclear-L1 norm joint regression for face reconstruction
and recognition with mixed noise. Pattern Recognit. 48(12), 3811–3824 (2015)
34. Papadogeorgou, G., Zhang, Z., Dunson, D.B.: Soft tensor regression (2019). Preprint
arXiv:1910.09699
35. Rabusseau, G., Kadri, H.: Low-rank regression with tensor responses. In: Advances in Neural
Information Processing Systems, pp. 1867–1875 (2016)
36. Raskutti, G., Yuan, M.: Convex regularization for high-dimensional tensor regression (2015).
Preprint arXiv:1512.01215 639
37. Romera-Paredes, B., Aung, H., Bianchi-Berthouze, N., Pontil, M.: Multilinear multitask
learning. In: International Conference on Machine Learning, pp. 1444–1452 (2013)
38. Signoretto, M., De Lathauwer, L., Suykens, J.A.: A kernel-based framework to tensorial data
analysis. Neural Netw. 24(8), 861–874 (2011)
39. Signoretto, M., Olivetti, E., De Lathauwer, L., Suykens, J.A.: Classification of multichannel
signals with cumulant-based kernels. IEEE Trans. Signal Process. 60(5), 2304–2314 (2012)
40. Song, X., Lu, H.: Multilinear regression for embedded feature selection with application to fmri
analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
41. Spencer, D., Guhaniyogi, R., Prado, R.: Bayesian mixed effect sparse tensor response
regression model with joint estimation of activation and connectivity (2019). Preprint
arXiv:1904.00148
42. Sun, W.W., Li, L.: STORE: sparse tensor response regression and neuroimaging analysis. J.
Mach. Learn. Res. 18(1), 4908–4944 (2017)
43. Tang, X., Bi, X., Qu, A.: Individualized multilayer tensor learning with an application in
imaging analysis. J. Amer. Stat. Assoc. 115, 1–26 (2019)
44. Wahls, S., Koivunen, V., Poor, H.V., Verhaegen, M.: Learning multidimensional fourier series
with tensor trains. In: 2014 IEEE Global Conference on Signal and Information Processing
(GlobalSIP), pp. 394–398. IEEE, Piscataway (2014)
45. Wimalawarne, K., Sugiyama, M., Tomioka, R.: Multitask learning meets tensor factorization:
task imputation via convex optimization. In: Advances in Neural Information Processing
Systems, pp. 2825–2833 (2014)
46. Wimalawarne, K., Tomioka, R., Sugiyama, M.: Theoretical and experimental analyses of
tensor-based regression and classification. Neural Comput. 28(4), 686–715 (2016)
47. Xu, J., Zhou, J., Tan, P.N., Liu, X., Luo, L.: Spatio-temporal multi-task learning via tensor
decomposition. IEEE Trans. Knowl. Data Eng. 33, 2764–2775 (2019)
48. Yan, H., Paynabar, K., Pacella, M.: Structured point cloud data analysis via regularized tensor
regression for process modeling and optimization. Technometrics 61(3), 385–395 (2019)
49. Yu, R., Li, G., Liu, Y.: Tensor regression meets gaussian processes. In: International Conference
on Artificial Intelligence and Statistics, pp. 482–490 (2018)
50. Yu, R., Liu, Y.: Learning from multiway data: Simple and efficient tensor regression. In:
International Conference on Machine Learning, pp. 373–381 (2016)
198 7 Tensor Regression
51. Zha, Z., Yuan, X., Wen, B., Zhou, J., Zhang, J., Zhu, C.: A benchmark for sparse coding: When
group sparsity meets rank minimization. IEEE Trans. Image Process. 29, 5094–5109 (2020)
52. Zhang, X., Li, L.: Tensor envelope partial least-squares regression. Technometrics 59, 1–11
(2017)
53. Zhao, Q., Caiafa, C.F., Mandic, D.P., Chao, Z.C., Nagasaka, Y., Fujii, N., Zhang, L., Cichocki,
A.: Higher order partial least squares (HOPLS): a generalized multilinear regression method.
IEEE Trans. Pattern Analy. Mach. Intell. 35(7), 1660–1673 (2013)
54. Zhao, Q., Zhou, G., Adali, T., Zhang, L., Cichocki, A.: Kernelization of tensor-based models
for multiway data analysis: Processing of multidimensional structured data. IEEE Signal
Process. Mag. 30(4), 137–148 (2013)
55. Zhou, H., Li, L., Zhu, H.: Tensor regression with applications in neuroimaging data analysis.
J. Amer. Stat. Asso. 108(502), 540–552 (2013)
Chapter 8
Statistical Tensor Classification
8.1 Introduction
tensor machine, and support Tucker machine; and tensor Fisher discriminant
analysis. Finally, some practical applications on tensor classification for digital
image classification and neuroimaging analysis are given.
min f(w, b, ξ )
w,b,ξ
(8.1)
s. t. yn cn wT xn + b ≥ ξn , 1 ≤ n ≤ N,
+1 x ≥ 0,
sign(x) = (8.2)
−1 x ≤ 0,
To extend the vector-based learning machines into their tensor forms which can
accept tensors as inputs, the key problem is how to model the relationship between
a multidimensional array and its corresponding label. In vector-based classification
model, the decision function maps the vector input into its corresponding label by
a vector inner product of the vector measurements xn ∈ RI and projection vector
w ∈ RI . Therefore, it is nature to consider tensor products to map the multiway
tensors into its corresponding labels.
Here we consider N training measurements which are represented as tensors
Xn ∈ RI1 ×I2 ×···×IK , 1 ≤ n ≤ N, and their corresponding class labels are yn ∈
{+1, −1}. By employing tensor product of tensor inputs Xn ∈ RI1 ×I2 ×···×IK and
multiway projection tensor W ∈ RI1 ×I2 ×···×IK , we can get the basic tensor learning
models as follows:
min f(W, b, ξ )
W ,b,ξ
(8.3)
s. t. yn cn (Xn , W + b) ≥ ξn , 1 ≤ n ≤ N,
where Xn , W refers to one specific tensor product, such as tensor inner product. In
this way, the decision functions are determined by the tensor plane X , W + b = 0.
However, an obvious shortcoming is that the high dimensionality of the coef-
ficient tensor inevitably requires numerous training samples and causes huge
computational and storage costs. In machine learning, when the dimension of
training samples is much larger than the number of samples, it is easy to cause
the problem of overfitting of classifier. Dimensionality reduction for this high-order
data classification is important. The commonly used assumptions for dimensionality
reduction of coefficient tensor include low rankness and sparsity.
Compared with the low-rank tensor approximation for data recovery, the dif-
ference is that the common tensor approximation task is to approximate real
data, such as images and videos, while the approximation target in regression
or classification is to approximate the multidimensional coefficient tensor which
reflects the relationship between multi-features of data.
In the following, we will give detailed illustration for popular statistical tensor
classification algorithms from the perspective of its development from matrix
versions to tensor versions, such as from logistic regression to logistic tensor
regression, from support vector machine to support tensor machine, and from
Fisher discriminant analysis (FDA) to tensor FDA. In fact, all these approaches
can be regarded as a special case of the framework in (8.3) with different
objective functions and constraints. In order to make this chapter self-contained
and let readers better understand the evolution of tensor classification algorithms,
before introducing all tensor machines, we have given a basic introduction to the
corresponding vector version.
202 8 Statistical Tensor Classification
Logistic regression is a basic and widely used binary classification method which is
derived from linear regression. Given a training set xn ∈ RI , 1 ≤ n ≤ N and their
corresponding class labels yn ∈ {0, 1}, the decision function of logistic regression
is given by
y(x) = φ wT x + b , (8.4)
where w is the weighting vector, b is the bias, and φ(·) is logistic function which is
a kind of sigmoid function as
ex
φ(x) = . (8.5)
1 + ex
The decision function uses a nonlinear function φ(x) to project the value of linear
function wT x + b to the range [0, 1]. Equation (8.5) can be transformed into
y
log = wT x + b, (8.6)
1−y
1
Prob(y = 1|x) = , Prob(y = 0|x) = 1−Prob(y = 1|x). (8.7)
1 + e−(w x+b)
T
N
L(w, b) = [Prob(y = 1|xn )]yn [Prob(y = 0|xn )]1−yn , (8.8)
n=1
which is obtained as
N ( )
yn wT xn + b − log 1 + ew xn +b .
T
min − (8.10)
w,b
n=1
8.3 Logistic Tensor Regression 203
The performance of this basic classifier would be deteriorated when dealing with
large-scale higher-order data. To solve the problem, some regularization terms are
applied to utilize the possible priors over the weighted vector. For example, the 0
norm regularizer is commonly used to overcome the overfitting and makes the model
more sparse. Considering the 0 optimization is NP-hard, the 1 norm regularizer
is often used as a substitution; the resulted new convex optimization model is as
follows:
N
* +
log 1 + ew xn +b − yn wT xn + b + λ
T
min |wi | . (8.11)
w,b
n=1 i
The classical logistic regression is based on the vector form and would lead to an
information loss when dealing with the multiway data. To avoid such a problem, we
discuss the way to extend the logistic regression to the tensor version. The training
set of logistic tensor regression are denoted as tensors Xn ∈ RI1 ×I2 ×···×IM , 1 ≤ n ≤
N with their class labels yn ∈ {+1, −1}. One decision function for logistic tensor
regression can be selected as follows:
where W ∈ RI1 ×I2 ×···×IM denotes the weighting tensor and b is the bias. Therefore,
the maximum likelihood method optimization can be formulated as follows:
N
* +
min log 1 + e<W ,Xn >+b − yn < W, Xn > +b . (8.13)
W ,b
n=1
(m)
where ur ∈ RIm , m = 1, 2, . . . , M is the projection vector, U(m) =
(m) (m)
[u1 ; · · · ; uR ], and R is the CP decomposition-based rank of the weighting
tensor. By substituting the CP form of W into (8.13), we can get the following
204 8 Statistical Tensor Classification
problem:
N
log 1 + e<[[U ,...,U ]],Xn >+b
(1) (M)
min
U(m) ,m=1,...,M,b
n=1
Inspired by the alternative minimization, the problem can be solved with respect to
one of the variables {U(m) , m = 1, . . . , M, b} by keeping the others fixed.
Specifically, the subproblem with respect to U(m) is rewritten as
N (m) (−m) T T
T
min log 1 + e(tr(U (U ) Xn(m) )+b) − yn tr U(m) U(−m) XTn(m) + b ,
U(m)
n=1
N * +T * +
log 1 + e[vec(U )] [vec(X̃n(m) )]+b −yn vec U(m)
(m) T
min vec X̃n(m) +b .
U(m)
n=1
It is clear that the logistic tensor regression problem can be transformed into several
vector logistic regression subproblems and easily solved in this way. The details of
logistic tensor regression can be found in Algorithm 44. In addition, the 1 norm
regularizer can be applied to make the solution sparse and interpretable, which can
be well solved by the Bayesian binary regression (BBR) software introduced in [9].
8.4 Support Tensor Machine 205
Support vector machine (SVM) [6] is a popular supervised learning model for
classification problem. It finds an optimal classification hyperplane by maximizing
the margin between the positive measurements and the negative measurements.
Figure 8.2 provides an illustration of the support vector machine when the samples
are linearly separable.
Mathematically, given N training measurements xn ∈ RI (1 ≤ n ≤ N)
associated with the class labels yn ∈ {+1, −1}, the standard SVM finds a projection
vector w and a bias b by a constrained optimization model as follows:
N
1
min JSVM (w, b, ξ ) = w2F + λ ξn
w,b,ξ 2 (8.14)
n=1
s. t. yn w xn + b ≥ 1 − ξn , 1 ≤ n ≤ N, ξn ≥ 0,
T
where ξ = [ξ1 ; ξ2 ; · · · ; ξM ] is a vector which contains all the slack variables, and it
is used to deal with the linearly non-separable problem. ξn is also called the marginal
error for the n-th measurement, and the margin is 2/w22 . When the problem is
8
y=+1
7 y=-1
3
Support vector
2
1 Support vector
2
0 w2
-1
-2 wT x + b = −1
-3 wT x + b = 0
-2 0 2 4 6 8
wT x + b = +1
Fig. 8.2 An illustration of the support vector machine when the samples are linearly separable
206 8 Statistical Tensor Classification
linearly separable, we can set ξ = 0. Under this condition the decision function is
y(xn ) = sign(wT xn + b).
To solve this problem, we give the Lagrangian function of the optimization
model (8.14) as follows:
N
1
L(w, b, ξ , α, κ) = w2F + λ ξn
2
n=1
N N
− αn yn wT xn + b − 1 + ξn − κn ξn , (8.15)
n=1 n=1
N
∂w L = 0 ⇒ w = αn yn xn
n=1
N (8.17)
∂b L = 0 ⇒ αn yn = 0
n=1
∂ξ L = 0 ⇒ λ − α − κ = 0.
Based on Eq. (8.17), the dual problem of optimization model (8.14) can be
formulated as follows:
N N N
1
max JD (α) = − yn1 yn2 xn1 xn2 αn1 αn2 + αn
α 2
n1 =1 n2 =1 n=1
(8.18)
N
s. t. αn yn = 0, 0 ≤ α ≤ λ.
n=1
Here we can see that the dual problem (8.18) of the original problem (8.14) is a
quadratic programming problem, which has some efficient solvers.
8.4 Support Tensor Machine 207
Considering a specific data processing problem, SVM may suffer from overfitting
and result in a bad performance when the number of training measurements is
limited or the measurements are tensors. Driven by the success of tensor analysis
in many learning techniques, it is expected to design a tensor extension of SVM,
namely, support tensor machine (STM) [1, 10, 20].
We consider that there are N training measurements denoted as M-th order
tensors Xn ∈ RI1 ×I2 ×···×IM , 1 ≤ n ≤ N with their corresponding class labels
yn ∈ {+1, −1}. The STM aims to find M projection vectors wm ∈ RIm , 1 ≤ m ≤ M
and a bias b to formulate a decision function y(Xn ) = sign(Xn M m=1 ×m wm + b).
The desired parameters can be obtained by the following minimization problem:
N
1 M
min m=1 , b, ξ = ◦m=1 wm F + λ
JSTM wm |M 2
ξn
wm |M
m=1 ,b,ξ
2
n=1
, - (8.19)
M
s. t. yn Xn ×m wm + b ≥ 1 − ξn , 1 ≤ n ≤ N, ξn ≥ 0,
m=1
N
η
min JSVM (wm , b, ξ ) = wm 2F + λ ξn
wm ,b,ξ 2
n=1
⎛ ⎛ ⎞ ⎞ (8.20)
=m
k
s. t. yn ⎝wTm ⎝Xn ×k wk ⎠ + b⎠ ≥ 1 − ξn , 1 ≤ n ≤ N,
1≤k≤M
k=m
where η = 1≤k≤M wk F .
2
208 8 Statistical Tensor Classification
min f(wm , b, ξ )
wm ,b,ξ
=m
k
s. t. yn cn (wTm (Xn ×k wk ) + b) ≥ ξn , 1 ≤ n ≤ N
1≤k≤M
M
if [wTm,t wm,t−1 (wm,t −2
F ) − 1] ≤ .
m=1
N
1
min < W, W > +C ξi
W ,b,ξ 2
n=1
(8.21)
s.t. yn (< W, Xn > +b) ≥ 1 − ξn , 1 ≤ n ≤ N, ξ ≥ 0
W = [[U(1) , . . . , U(M) ]].
8.4 Support Tensor Machine 209
Unlike
STM, which use multiple projections through mode-m products
X M m=1 ×m wm to map the input tensor to a scalar, HRSTM applies tensor inner
product < W, Xn > to map the input tensor. Here all the parameters are the same
as STM except for the weighting tensor W.
Based on the definition of tensor inner product and CP decomposition in Chaps. 1
and 2, we can get the following:
T T
< W, W > = tr W(m) WT(m) = tr U(m) U(−m) U(−m) U(m) , (8.22)
T
< W, Xn > = tr W(m) XTn(m) = tr U(m) U(−m) XTn(m) , (8.23)
where W(m) is the mode-m unfolding matrix of tensor W and Xn(m) is mode-
m unfolding matrix of tensor Xn , U(−m) = U(M) · · · U(m+1) U(m−1)
··· U .(1)
T
To solve this problem, we define A = U(−m) U(−m) , which is a positive definite
1
matrix, and let Ũ(m) = U(m) A 2 ; then we have:
T T T T
tr U(m) U(−m) U(−m) U(m) = tr Ũ(m) Ũ(m) = vec Ũ(m) vec Ũ(m) ,
tr U(m) (U(−m) )T XTn(m) = tr Ũ(m) X̃Tn(m) = vec(Ũ(m) )T vec(X̃n(m) ),
1
where X̃n(m) = Xn(m) U(−m) A− 2 .
Finally, the original optimization problem (8.21) can be simplified as
N
1
min vec(Ũ(m) )T vec(Ũ(m) ) + C ξn
Ũ(m) ,b,ξ 2
n=1
T
s. t. yn vec Ũ(m) vec X̃n(m) + b ≥ 1 − ξn , 1 ≤ n ≤ N, ξ ≥ 0,
(8.25)
The classical STM and HRSTM models assume that weighting tensor admits
low-rank structure based on CP decomposition. It is well known that the best
rank-R approximation in CP decomposition of a given data is NP-hard to obtain.
In [13], a support Tucker machine (STuM) obtains low-rank weighting tensor
by Tucker decomposition. Compared with STM and HRSTM, which formulate
weighting tensor under CP decomposition, STuM is more general and can explore
the multilinear rank especially for unbalanced datasets. Furthermore, since the
Tucker decomposition of a given tensor is represented as a core tensor multiplied
by a series of factor matrices along all modes, we can easily perform a low-rank
tensor approximation for dimensionality reduction which is practical when dealing
with complex real data.
Assuming that the weighting tensor is W, the optimization problem of STuM is
as follows:
N
1
min < W, W > +C ξn
W ,b,ξ 2
n=1
(8.26)
s. t. yn (< W, Xn > +b) ≥ 1 − ξn , 1 ≤ n ≤ N, ξ ≥ 0
W = [[G, U(1) , . . . , U(M) ]].
where G(m) is mode-m matricization of the core tensor G. The resulting subproblem
with respect to U(m) can be represented as follows:
N
1 T
min U vec(G(1) ) U vec(G(1) ) + C ξn
G(1) ,b,ξ 2 .
n=1
T
s. t. yn ( U vec(G(1) ) vec(X ) + b) ≥ 1 − ξn , 1 ≤ n ≤ N, ξ ≥ 0,
(8.28)
where U = U(M) ⊗ · · · ⊗ U(1) and G(1) is the mode-1 unfolding matrix of the core
tensor G.
As shown above, the optimization problem of STuM can also be transformed
into several standard SVM problems. The difference of STuM from HRSTM is that
STuM has an additional iteration for solving the subproblem of core tensor G. More
details can be found in [13].
In addition, another variant based on tensor train decomposition, namely, support
tensor train machine (STTM), is also proposed due to the enhanced expression
capability of tensor networks [4]. Besides, in order to deal with the complex case
that the samples are not linearly separate, some kernelized tensor machines are
proposed, including the dual structure-preserving kernel (DuSK) for supervised
tensor learning [11] and the kernelized STTM [5].
Fisher discriminant analysis (FDA) [8, 12] has been widely applied for classifi-
cation. It aims to find a direction which separates the means of class well while
minimizing the variance of the total training measurements. The basic idea of FDA
is to project the high-dimensional samples into a best vector space for classification
feature extraction and dimensionality reduction. After projecting all the measure-
ments onto a specific line y = wT x, FDA separates the positive measurements
from the negative measurements by maximizing the generalized Rayleigh quotient1
between the between-class scatter matrix and within-class scatter matrix.
Here we consider N training measurements xn , 1 ≤ n ≤ N and the correspond-
ing class labels yn ∈ {+1, −1}. Among these measurements, there are N+ positive
measurements with the mean m+ = N1+ N n=1 (I(yn + 1)xn ) and N− measurements
xT Ax
1 For a real symmetric matrix A, the Rayleigh quotient is defined as R(x) = xT x
.
212 8 Statistical Tensor Classification
6
y=+1
y=-1
5
m−
4
2
m
1
0 m+
-1
-2
y = wT x + b
-3 -2 -1 0 1 2 3 4 5 6
Fig. 8.3 An illustration of the Fisher discriminant analysis when the samples are linear separable
with the mean m− = N1− Nn=1 (I(yn − 1)xn ). The mean of all the measurements is
1 N
m = N n=1 xn , where the indicator function is defined as follows:
0 x = 0,
I(x) =
1 x = 0.
Figure 8.3 gives an illustration of the FDA. The key problem is how to find the
best projection vector in order to ensure that the samples have the largest inter-class
distance and smallest intra-class distance in the new subspace. The projection vector
w can be determined by solving the following problem:
wT Sb w
max JFDA (w) =
w wT Sw w
(8.29)
wT (m+ − m− )2
= √ ,
wT w
N− − N+ − (N+ m+ + N− m− )T w
b= .
N− + N+
Just like STM in (8.19), there is no closed-form solution for TFDA. The
alternating projection method can be applied to solve this problem, and one of the
multiple projection vectors will be updated at each iteration. At each iteration, the
solution of wm can be updated by
k=m
wTm ((M+ − M− ) 1≤k≤M ×k wk )22
max JTFDA = N k=m . (8.31)
n=1 wm ((Xn − M) 1≤k≤M ×k wk )2
wm T 2
To get the multiple projection vectors, we only need to replace the subproblem of
Step 4 in Algorithm 45 with (8.31). And the solution to the bias is computed by
M
N− − N+ − (N+ M+ + N− M− ) m=1 ×m wm
b= .
N− + N+
214 8 Statistical Tensor Classification
8.6 Applications
To briefly show the differences between the vector-based methods and the tensor-
based methods, two classical methods STM and SVM are compared on the
well-known MNIST dataset[15]. There are 60k training samples and 10k testing
samples in this database. Each sample is a grayscale image of size 28 × 28,
denoting a handwritten digit {“0”, . . . , “9”}. Since STM and SVM are naturally
binary classifiers, we divide these 9 digits into 45 digit pairs and do classification
task on each digit pair separately. Then all the classification accuracy of these 45
group experiments is calculated.
Table 8.1 shows the classification results on several selected digit pairs. As
shown for these five pairs, STM gets higher accuracy than SVM. This can be
possibly explained by that the spatial structure information is reserved in STM. The
classification accuracy of SVM and STM for all the digit pairs are also given in
Fig. 8.4. It can be seen that the performance of STM is better than SVM in most
cases.
Fig. 8.4 Classification accuracy of SVM and STM for all the digit pairs
8.6 Applications 215
2 http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/.
216 8 Statistical Tensor Classification
Table 8.2 Classification accuracy of different methods for different subjects in StarPlus fMRI
datasets
Subject SVM STM STuM DuSK TT classifier KSTTM
04799 36.67% 45.83% 52.50% 59.18% 57.50% 55.83%
04820 43.33% 44.12% 56.67% 58.33% 54.17% 70.00%
04847 38.33% 47.50% 49.17% 50.83% 61.67% 65.00%
05675 35.00% 37.50% 45.83% 57.50% 55.00% 60.00%
05680 38.33% 40.00% 58.33% 64.17% 60.83% 73.33%
05710 40.00% 43.33% 62.50% 59.17% 53.33% 59.17%
poorly on fMRI image classification task. Since fMRI data are very complicated,
the STM which is based on rank-1 CP decomposition cannot achieve an acceptable
classification accuracy. We attribute this to the fact that the rank-1 CP tensor is
simple and cannot extract complex structure of an fMRI image. Compared with
SVM and STM, the Tucker-based STuM and higher-rank CP-based DuSK give
better results. Meanwhile, the TT classifier and KSTTM based on tensor train also
give satisfactory results. Overall, we can conclude that tensor classification methods
are more effective than their counterparts for high-dimensional data.
8.7 Summary
This chapter gives a brief introduction to the statistical tensor classification. First,
we give an introduction for the concepts and the framework of tensor classification.
Second, we give some examples and show how to convert vector classification
models to tensor classification models. Specifically, the techniques of logistic tensor
regression, support tensor machines, and tensor Fisher discriminant analysis are dis-
cussed in details. Finally, we apply tensor classification methods to two applications,
i.e., handwriting digit recognition from grayscale images and biomedical classifi-
cation from fMRI images. Two groups of experiments are performed on MNIST
and fMRI datasets, respectively. The former one simply validates the effectiveness
of tensor classification, while the latter provides a comprehensive comparison of
existing tensor classifiers. Numerical results show that the classification methods
based on complicated tensor structures, like TT format and higher-rank CP tensor,
have stronger representation ability and perform better in practice.
Here are some interesting research directions in the future. Instead of the
linear classifiers like SVM and the logistic regression, some nonlinear methods
like Bayesian probability model and kernel tricks can be applied into tensor
classification. Besides using tensor decomposition for observed multiway data for
tensor classification, constructing tensor-based features is also an attractive way in
the future.
References 217
References
1. Biswas, S.K., Milanfar, P.: Linear support tensor machine with LSK channels: Pedestrian
detection in thermal infrared images. IEEE Trans. Image Process. 26(9), 4229–4242 (2017)
2. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell.
Syst. Technol. 2, 27:1–27:27 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/
libsvm
3. Chen, Z., Batselier, K., Suykens, J.A., Wong, N.: Parallelized tensor train learning of
polynomial classifiers. IEEE Trans. Neural Netw. Learn. Syst. 29(10), 4621–4632 (2017)
4. Chen, C., Batselier, K., Ko, C.Y., Wong, N.: A support tensor train machine. In: 2019
International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, Piscataway
(2019)
5. Chen, C., Batselier, K., Yu, W., Wong, N.: Kernelized support tensor train machines (2020).
Preprint arXiv:2001.00360
6. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
7. Cover, T.M.: Elements of Information Theory. Wiley, Hoboken (1999)
8. Fisher, R.A.: The statistical utilization of multiple measurements. Ann. Eugen. 8(4), 376–386
(1938)
9. Genkin, A., Lewis, D.D., Madigan, D.: Large-scale Bayesian logistic regression for text
categorization. Technometrics 49(3), 291–304 (2007)
10. Hao, Z., He, L., Chen, B., Yang, X.: A linear support higher-order tensor machine for
classification. IEEE Trans. Image Process. 22(7), 2911–2920 (2013)
11. He, L., Kong, X., Yu, P.S., Yang, X., Ragin, A.B., Hao, Z.: DuSk: A dual structure-preserving
kernel for supervised tensor learning with applications to neuroimages. In: Proceedings of
the 2014 SIAM International Conference on Data Mining, pp. 127–135. SIAM, Philadelphia
(2014)
12. Kim, S.J., Magnani, A., Boyd, S.: Robust fisher discriminant analysis. In: Advances in Neural
Information Processing Systems, pp. 659–666 (2006)
13. Kotsia, I., Patras, I.: Support tucker machines. In: CVPR 2011, pp. 633–640. IEEE, Piscataway
(2011)
14. Kotsia, I., Guo, W., Patras, I.: Higher rank support tensor machines for visual recognition.
Pattern Recognit. 45(12), 4192–4203 (2012)
15. Deng, L.: The MNIST database of handwritten digit images for machine learning research
[best of the web]. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
16. Li, Q., Schonfeld, D.: Multilinear discriminant analysis for higher-order tensor data classifica-
tion. IEEE Trans. Pattern Analy. Mach. Intell. 36(12), 2524–2537 (2014)
17. Phan, A.H., Cichocki, A.: Tensor decompositions for feature extraction and classification of
high dimensional datasets. Nonlinear Theory Appl. 1(1), 37–68 (2010)
18. Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: dynamic tensor analysis. In:
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 374–383 (2006)
19. Tan, X., Zhang, Y., Tang, S., Shao, J., Wu, F., Zhuang, Y.: Logistic tensor regression
for classification. In: International Conference on Intelligent Science and Intelligent Data
Engineering, pp. 573–581. Springer, Berlin (2012)
20. Tao, D., Li, X., Hu, W., Maybank, S., Wu, X.: Supervised tensor learning. In: Fifth IEEE
International Conference on Data Mining (ICDM’05), pp. 8–pp. IEEE, Piscataway (2005)
21. Tao, D., Li, X., Maybank, S.J., Wu, X.: Human carrying status in visual surveillance. In: 2006
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06),
vol. 2, pp. 1670–1677. IEEE, Piscataway (2006)
Chapter 9
Tensor Subspace Cluster
9.1 Background
As the scale and diversity of data increase, adaptations to existing algorithms are
required to maintain cluster quality and speed. Many advances [43] in processing
high-dimensional data have relied on the observation that, even though these data are
high dimensional, their intrinsic dimension is often much smaller than the dimension
of the ambient space. Similarly, subspace clustering is a technique about how to
cluster those high-dimensional data within multiple low-dimensional subspaces. For
example, suppose a group of data X = [x1 , x2 , · · · , xN ] ∈ RD×N , where N is
the number of samples and D is the feature dimension. The subspace clustering
seeks clusters by assuming that these samples are chosen from K unknown lower
dimensional subspaces S1 , S2 , · · · , SK . Exploring these useful subspaces and
further grouping the observations according to these distinct subspaces are the main
aims of subspace clustering. As illustrated in Fig. 9.1, the observed data can be
clustered with three subspaces, namely, a plane and two lines.
Traditional matrix-based subspace clustering can be divided into four cate-
gories: statistical, algebraic, iterative, and spectral clustering-based methods [11].
Statistical approaches can be divided into three subclasses, iterative statistical
approaches [41], robust statistical approaches [14], and information theoretic
statistical approaches [34], which introduce different statistical distributions in their
models. Algebraic methods contain two subclasses, factorization-based methods
using matrix factorization [8, 15] and algebraic geometric methods using polyno-
mials to fit the data, such as [29, 44]. The iterative approaches alternate two steps
(i.e., allocate points to subspaces and fit subspaces to each cluster) to obtain the
clustering results such as [42, 55]. Spectral clustering-based methods include two
subclasses, one calculates the similarity matrix based on local information [16, 52],
while the other employs global information [9, 10].
However, these subspace clustering approaches based on matrix operations often
suffer from two limitations. The first one is that in most existing matrix-based works,
N K N
min min zn − ck 22 = skn zn − ck 22 , (9.1)
{ck } 1≤k≤K
n=1 k=1 n=1
9.2 Tensor Subspace Cluster Based on K-Means 221
min Z − CS2F
C,S
(9.3)
s.t. S(:, n)0 = 1, S(k, n) ∈ {0, 1},
where C ∈ RD×K represents the centroids of the clusters, S ∈ RK×N is the cluster
indication matrix, and S(k, n) = 1 means Z(:, n) belongs to k-th cluster.
Driven by the equivalence between the (9.1) and matrix factorization problem
shown in (9.3), the K-means clustering can be naturally generalized into tensor
space based on different tensor decompositions, such as classical canonical polyadic
(CP) and Tucker decomposition.
joints the non-negative tensor decomposition and K-means of mode-1 (JTKM) can
be expressed as follows [53]:
where S(n, k) = 1 means U(1) (n, :) belongs to k-th cluster, the second term in
objective function U(1) − SC2F is an K-means penalty, the regularization terms
U(2) 2F and U(3) 2F are used to control scaling, and the diagonal matrix D is applied
to normalize U(1) using their 2 -norms in advance [53].
Introducing a variable Z, we can get
which can be solved by the ADMM algorithm. The corresponding subproblems for
different variables are as follows.
Letting W = (U(3) U(2) )T , the optimization model (9.5) with respect to U(1)
can be transformed into (9.6)
U(1) = arg min X(1) − DU(1) W2F + λU(1) − SC2F + μU(1) − Z2F
U(1) ≥0
(9.6)
U(1) (n, :)
Z(n, :) = . (9.8)
U(1) (n, :)2
Finally, to update U(2) and U(3) , we make use of the other two matrix unfoldings
as follows:
2
U(2) = argmin X(2) − U(2) (U(3) DU(1) )T + ηU(2) 2F , (9.11)
U(2) ≥0 F
2
U(3) = argmin X(3) − U(3) (U(2) DU(1) )T + ηU(3) 2F . (9.12)
U(3) ≥0 F
N 2
min Xn − UZn VT
U,V,Zn F (9.13)
n=1
s.t. VT V = I, UT U = I.
After obtaining {Zn }N n=1 , we vectorize the diagonal elements of each matrix Zn
and stack them as the row vectors of a new matrix 5 Z ∈ RN ×min (I1 ,I2 ) . Therefore,
based on the distance relationship, the K-means clustering (9.3) of the observations
{Xn , n = 1, · · · , N} can be transformed as the clustering of {Zn , n = 1, · · · , N },
namely, the rows of 5 Z:
min 5
Z − SC2F , (9.14)
S,C
min X − Y ×1 U ×2 V ×3 W2F
U,V,W,Y
(9.15)
s.t. U U = I, V V = I, W W = I,
T T T
where U, V, W are the factor matrices and Y is the core tensor. For better
understanding the relationship between K-means and Tucker decomposition, we
have
2
min X(3) − WY(3) (V ⊗ U)T
U,V,W,Y(3) F
(9.16)
s.t. U U = I, V V = I, W W = I,
T T T
where X(3) and Y(3) are matrices along mode-3 of X and core tensor Y, respectively.
In this way, we can clearly see that W in (9.16) contains the information of
cluster indicator in K-means clustering. Y(3) (V ⊗ U)T contains the cluster centroid
information, and U, V in Eq. (9.13) are the same as U, V in Tucker decomposition.
Therefore, Tucker decomposition is equivalent to simultaneous subspace selec-
tion (i.e., data compression) in Eq. (9.13) and K-means clustering [4] as shown in
Eq. (9.14). And the final clustering indicator can be found by applying clustering
method (e.g., K-means) on matrix W.
9.3 Tensor Subspace Cluster Based on Self-Representation 225
min X − Y ×1 U ×2 V ×3 W2F
U,V,W,Y
s.t. UT U = I, VT V = I, (9.17)
W is a binary matrix.
Different from Eq. (9.15), W is a binary matrix which is the cluster indication
matrix in Eq. (9.17), and (Y ×1 U ×2 V) is the cluster centroid matrix. And Y,
W, U, V in Eq. (9.17) can be optimized using the alternating least square algorithm
which iteratively updates one variable while fixing the others until the convergence
condition is reached.
X = XC, (9.18)
where C ∈ RN ×N denotes the representations of the data samples X and Ci,j can be
considered as a measure for certain relationship, such as similarity and correlation,
between the i-th and the j -th samples.
The self-representation matrix C ∈ RN ×N can reveal the structure of the data
samples in the perspective of relationships [39]. However, Ci,j is not always the
same as Cj,i , and the pairwise affinity matrix W = |C| + CT is often applied [9,
11, 12]. Then the spectral clustering is performed over W to obtain the final results.
Spectral clustering [37] is based on graph theory which uses information from the
eigenvalues (spectrum) of special matrices that are built from the graph or the data
226 9 Tensor Subspace Cluster
Since the (9.19) with 0 -norm is a nondeterministic polynomial (NP) hard problem,
minimizing the tightest convex relaxation of the 0 -norm is considered:
Additionally, in the case of data contaminated with Gaussian noise, the optimization
problems in (9.20) can be generalized as follows:
Furthermore, in the case of data contaminated with noise or outliers, the LRR
algorithm solves the problem
where E is the error matrix, L(E) and R(C) represent the fitting error and data self-
representation regularizers, and λ is used to make balance of those two regularizers.
After obtaining the matrix C, we can apply the spectral clustering algorithm to
obtain the final clustering results by calculating the affinity matrix W from C.
Since the self-representation clustering models in matrix form ignore the spatial
relationships of data, it has been extended to tensor space to obtain the better
clustering performance. Assume we have a tensor dataset X = {X1 , X2 , · · · , XN } ∈
RI1 ×I2 ×N , where Xn is a two-dimensional matrix, the self-representation model in
tensor space can be formulated as follows:
Applying t-SVD [20] to self-representation cluster model has the similar purpose
as that of Tucker decomposition. It also aims at obtaining a low-dimensional
representation tensor.
Assume a dataset X = {X1 , X2 , · · · , XN } ∈ RI1 ×I2 ×N , where Xn is a two-
dimensional matrix. Based on a new type of low-rank tensor constraint (i.e., the
t-SVD-based tensor nuclear norm (t-TNN)), a subspace clustering model called t-
SVD-MSC, has been proposed [50]. The optimization model is as follows:
s.t. Xn = Xn Cn + En , n = 1, · · · , N,
(9.28)
C = φ (C1 , C2 , · · · , CN ) ,
E = [E1 ; E2 ; · · · ; EN ] ,
where Cˆ = fft(C, [], 3) denotes the fast Fourier transformation (FFT) of a tensor C
along the third dimension. Besides, Ŝn is computed by the SVD of n-th frontal slice
of Cˆ as Ĉn = Ûn Ŝn V̂Tn .
By introduction of the auxiliary tensor variable G, the optimization prob-
lem (9.28) can be transformed and solved by augmented Lagrangian method [31]
whose equation is as follows:
L (C1 , · · · , CN ; E1 , · · · , EN ; G)
N
= Gt-TNN + λE2,1 + (Yn , Xn − Xn Cn − En ) (9.30)
n=1
+ μ2 Xn − Xn Cn − En 2F + W, C − G + ρ2 C − G2F ,
N
E∗ = arg min λE2,1 + (Yn , Xn − Xn Zn − En )
E n=1
μ (9.32)
+ Xn − Xn Cn − En 2F
2
λ 1
= arg min E2,1 + E − D2F ,
E μ 2
where D = Xn − Xn Cn − μ1 Yn .
230 9 Tensor Subspace Cluster
Finally, when the algorithm converges, the loop ends, and the whole process is
shown in Algorithm 48.
s.t. Xn = Xn Cn + En , n = 1, · · · , N, (9.36)
C = φ (C1 , C2 , · · · , CN ) ,
E = [E1 ; E2 ; · · · ; EN ] ,
where (Lh )n represents the view-specific hyper-Laplacian matrix built on the opti-
mized subspace representation matrix Cn , so that the local high-order geometrical
structure would be discovered by using the hyper-Laplacian regularized term. The
optimization problem in (9.36) can be solved by ADMM to get the optimal Cn , En .
For more details about this hyper-Laplacian matrix (Lh )n , readers can refer to [51].
9.4 Applications 231
9.4 Applications
Fig. 9.3 An example of tensor model to represent heterogeneous information network. The left
one is the original network with three types of objects (yellow, red, and blue), and the number
within each object is the identifier of the object. Each element (brown dot) in the third-order tensor
(right one) represents a gene network (i.e., the minimum subnetwork of heterogeneous information
network) in the network (green circle in the left), and the coordinates represent the corresponding
identifiers of the objects
decomposition to the tensor that we construct in Fig. 9.3. The optimization model
for CP decomposition can be written as
where U(n) ∈ RIn ×K , n = 1, · · · , N, is the cluster indication matrix of the n-th type
( )
(n) (n) (n) (n) T
of objects and each column uk = uk,1 , uk,2 , · · · , uk,In of the factor matrix
U(n) is the probability vector of each object in n-th type objects belonging to the
k-th cluster.
It can be seen that the problem in (9.37) is an NP-hard nonconvex optimization
problem [1]. Thus, in [46], authors adopt a stochastic gradient descent algorithm
to process this CP decomposition-based tensor clustering problem with Tikhonov
regularization-based loss function [30]. At the same time, they prove that the CP
9.4 Applications 233
decomposition in (9.37) is equivalent to the K-means clustering for the n-th type of
objects in heterogeneous information network G = (V, E), which means that the CP
decomposition of X obtains the clusters of all types of objects in the heterogeneous
information network G = (V, E) simultaneously.
Fig. 9.5 Schematic diagram of unsupervised classification of 12-lead ECG signals using wavelet
tensor spectral clustering
234 9 Tensor Subspace Cluster
Fig. 9.6 Tensorization of ECG wavelet matrix, where Zm ∈ RD×N is one of the ECG wavelet
matrices, n is the length of one wavelet frequency sub-band, D of Zm represents the number of the
measured ECG variables which equals to 12 in this case, and I3 refers to the frequency sub-band
(i.e., I3 = N/n)
The operations above are for preprocessing. To further reduce the dimensionality
of feature data for better pattern recognition results and less memory requirements,
each Zm is transformed into the tensor form Zm ∈ RD×n×I3 as presented in Fig. 9.6,
where D represents the number of the measured ECG variables which equals to
12 in this case, n is the sampling time period, and I3 refers to the frequency sub-
band. By applying the multilinear principal component analysis (MPCA) [28], the
original tensor space can be mapped onto a tensor subspace. Finally, a new kernel
function based on Euclidean distance and cosine distance, namely, two-dimensional
Gaussian kernel function (TGSC), has been applied to spectral clustering algorithm
to obtain the final clustering results, and each cluster contains all samples from one
patient.
Fig. 9.7 Several examples from different datasets in various applications. (1) The COIL-20
dataset [11]; (2) the ORL dataset [50]; (3) the Scene-15 dataset [13]; (4) the UCI-Digits dataset [2]
the matrix-based clustering methods, and t-SVD-MSC [50] and ETLMSC [48] are
tensor-based clustering methods.
Additionally, six commonly used evaluation metrics are employed, namely, F-
score (F), precision (P), recall (R), normalized mutual information (NMI), adjusted
rand index (ARI), and accuracy (ACC) [19, 36].
What we need to note is that these six metrics represent different properties in
clustering scenes and the higher the value of these six commonly used clustering
evaluation metrics we mentioned above, the better the clustering performance.
As shown in Table 9.2, t-SVD-MSC and ETLMSC, which apply the tensor model
into clustering, achieve the best and second-best results in terms of almost all these
different metrics in four applications. Since those tensor models ensure the integrity
of the data to the greatest extent and preserve the spatial structures well, the tensor-
based methods will obtain better performance than matrix-based clustering methods.
Table 9.2 The clustering results of four datasets, namely, UCI-Digits, ORL, COIL-20, and Scene-
15 datasets. The best results are all highlighted in bold black
Datasets Method F-score Precision Recall NMI AR ACC
UCI-Digits LRR 0.316 0.292 0.344 0.422 0.234 0.356
Co-reg 0.585 0.576 0.595 0.638 0.539 0.687
RMSC 0.833 0.828 0.839 0.843 0.815 0.902
DiMSC 0.341 0.299 0.397 0.424 0.257 0.530
t-SVD-MSC 0.936 0.935 0.938 0.934 0.929 0.967
ETLMSC 0.915 0.876 0.963 0.957 0.905 0.905
ORL LRR 0.679 0.631 0.734 0.882 0.671 0.750
Co-reg 0.535 0.494 0.583 0.813 0.523 0.642
RMSC 0.623 0.586 0.665 0.853 0.614 0.700
DiMSC 0.717 0.669 0.769 0.905 0.709 0.775
t-SVD-MSC 0.976 0.962 0.989 0.995 0.975 0.974
ETLMSC 0.904 0.856 0.958 0.977 0.901 0.900
COIL20 LRR 0.597 0.570 0.628 0.761 0.576 0.679
Co-reg 0.597 0.577 0.619 0.765 0.576 0.627
RMSC 0.664 0.637 0.693 0.812 0.645 0.692
DiMSC 0.733 0.726 0.739 0.837 0.719 0.767
t-SVD-MSC 0.747 0.725 0.770 0.853 0.734 0.784
ETLMSC 0.854 0.825 0.886 0.936 0.846 0.867
Scene15 LRR 0.268 0.255 0.282 0.371 0.211 0.370
Co-reg 0.344 0.346 0.341 0.425 0.295 0.451
RMSC 0.294 0.295 0.292 0.410 0.241 0.396
DiMSC 0.181 0.176 0.186 0.261 0.118 0.288
t-SVD-MSC 0.692 0.671 0.714 0.776 0.668 0.756
ETLMSC 0.826 0.820 0.831 0.877 0.812 0.859
References 237
9.5 Summary
References
1. Acar, E., Dunlavy, D.M., Kolda, T.G.: A scalable optimization approach for fitting canonical
tensor decompositions. J. Chemometrics 25(2), 67–86 (2011)
2. Asuncion, A., Newman, D.: UCI machine learning repository (2007)
3. Bauckhage, C.: K-means clustering is matrix factorization (2015, preprint). arXiv:1512.07548
4. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)
5. Cao, X., Zhang, C., Fu, H., Liu, S., Zhang, H.: Diversity-induced multi-view subspace cluster-
ing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
586–594 (2015)
6. Chen, Y., Xiao, X., Zhou, Y.: Jointly learning kernel representation tensor and affinity matrix
for multi-view clustering. IEEE Trans. Multimedia 22(8), 1985–1997 (2019)
7. Cheng, M., Jing, L., Ng, M.K.: Tensor-based low-dimensional representation learning for
multi-view clustering. IEEE Trans. Image Process. 28(5), 2399–2414 (2018)
8. Costeira, J.P., Kanade, T.: A multibody factorization method for independently moving objects.
Int. J. Comput. Visi. 29(3), 159–179 (1998)
9. Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: 2009 IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2790–2797. IEEE, Piscataway (2009)
10. Elhamifar, E., Vidal, R.: Clustering disjoint subspaces via sparse representation. In: 2010 IEEE
International Conference on Acoustics, Speech and Signal Processing, pp. 1926–1929. IEEE,
Piscataway (2010)
11. Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE
Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)
12. Favaro, P., Vidal, R., Ravichandran, A.: A closed form solution to robust subspace estimation
and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2011), pp. 1801–1807. IEEE, Piscataway (2011)
238 9 Tensor Subspace Cluster
13. Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories.
In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 2, pp. 524–531. IEEE, Piscataway (2005)
14. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with
applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395
(1981)
15. Gear, C.W.: Multibody grouping from motion images. Int. J. Comput. Vision 29(2), 133–150
(1998)
16. Goh, A., Vidal, R.: Segmenting motions of different types by unsupervised manifold clustering.
In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–6. IEEE,
Piscataway (2007)
17. He, H., Tan, Y., Xing, J.: Unsupervised classification of 12-lead ECG signals using wavelet
tensor decomposition and two-dimensional gaussian spectral clustering. Knowl.-Based Syst.
163, 392–403 (2019)
18. Huang, H., Ding, C., Luo, D., Li, T.: Simultaneous tensor subspace selection and clustering:
the equivalence of high order SVD and k-means clustering. In: Proceedings of the 14th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 327–335
(2008)
19. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
20. Kilmer, M.E., Martin, C.D.: Factorization strategies for third-order tensors. Linear Algebra
Appl. 435(3), 641–658 (2011)
21. Kumar, A., Rai, P., Daume, H.: Co-regularized multi-view spectral clustering. In: Advances in
Neural Information Processing Systems, pp. 1413–1421 (2011)
22. Leginus, M., Dolog, P., Žemaitis, V.: Improving tensor based recommenders with clustering.
In: International Conference on User Modeling, Adaptation, and Personalization, pp. 151–163.
Springer, Berlin (2012)
23. Lin, Z., Liu, R., Su, Z.: Linearized alternating direction method with adaptive penalty for low-
rank representation. In: Advances in Neural Information Processing Systems, pp. 612–620
(2011)
24. Liu, G., Lin, Z., Yu, Y.: Robust subspace segmentation by low-rank representation. In:
Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 663–
670 (2010)
25. Liu, J., Liu, J., Wonka, P., Ye, J.: Sparse non-negative tensor factorization using columnwise
coordinate descent. Pattern Recognit. 45(1), 649–656 (2012)
26. Lu, C.Y., Min, H., Zhao, Z.Q., Zhu, L., Huang, D.S., Yan, S.: Robust and efficient subspace
segmentation via least squares regression. In: European Conference on Computer Vision, pp.
347–360. Springer, Berlin (2012)
27. Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by
low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2012)
28. Lu, H., Plataniotis, K.N., Venetsanopoulos, A.N.: MPCA: multilinear principal component
analysis of tensor objects. IEEE Trans. Neural Netw. 19(1), 18–39 (2008)
29. Ma, Y., Yang, A.Y., Derksen, H., Fossum, R.: Estimation of subspace arrangements with
applications in modeling and segmenting mixed data. SIAM Rev. 50(3), 413–458 (2008)
30. Paatero, P.: A weighted non-negative least squares algorithm for three-way ‘PARAFAC’ factor
analysis. Chemom. Intell. Lab. Syst. 38(2), 223–242 (1997)
31. Peng, W., Li, T.: Tensor clustering via adaptive subspace iteration. Intell. Data Anal. 15(5),
695–713 (2011)
32. Peng, H., Hu, Y., Chen, J., Haiyan, W., Li, Y., Cai, H.: Integrating tensor similarity to enhance
clustering performance. IEEE Trans. Pattern Anal. Mach. Intell. (2020). https://doi.org/10.
1109/TPAMI.2020.3040306
33. Poullis, C.: Large-scale urban reconstruction with tensor clustering and global boundary
refinement. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1132–1145 (2019)
References 239
34. Rao, S., Tron, R., Vidal, R., Ma, Y.: Motion segmentation in the presence of outlying,
incomplete, or corrupted trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 32(10), 1832–
1845 (2009)
35. Ren, Z., Mukherjee, M., Bennis, M., Lloret, J.: Multikernel clustering via non-negative matrix
factorization tailored graph tensor over distributed networks. IEEE J. Sel. Areas Commun.
39(7), 1946–1956 (2021)
36. Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39.
Cambridge University Press, Cambridge (2008)
37. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 22(8), 888–905 (2000)
38. Sui, Y., Zhao, X., Zhang, S., Yu, X., Zhao, S., Zhang, L.: Self-expressive tracking. Pattern
Recognit. 48(9), 2872–2884 (2015)
39. Sui, Y., Wang, G., Zhang, L.: Sparse subspace clustering via low-rank structure propagation.
Pattern Recognit. 95, 261–271 (2019)
40. Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: Rankclus: integrating clustering with
ranking for heterogeneous information network analysis. In: Proceedings of the 12th Interna-
tional Conference on Extending Database Technology: Advances in Database Technology, pp.
565–576 (2009)
41. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural
Comput. 11(2), 443–482 (1999)
42. Tseng, P.: Nearest q-flat to m points. J. Optim. Theory Appl. 105(1), 249–252 (2000)
43. Vidal, R.: Subspace clustering. IEEE Signal Process. Mag. 28(2), 52–68 (2011)
44. Vidal, R., Ma, Y., Sastry, S.: Generalized principal component analysis (GPCA). IEEE Trans.
Pattern Anal. Mach. Intell. 27(12), 1945–1959 (2005)
45. Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T.S., Yan, S.: Sparse representation for
computer vision and pattern recognition. Proc. IEEE 98(6), 1031–1044 (2010)
46. Wu, J., Wang, Z., Wu, Y., Liu, L., Deng, S., Huang, H.: A tensor CP decomposition method
for clustering heterogeneous information networks via stochastic gradient descent algorithms.
Sci. Program. 2017, 2803091 (2017)
47. Wu, T., Bajwa, W.U.: A low tensor-rank representation approach for clustering of imaging data.
IEEE Signal Process. Lett. 25(8), 1196–1200 (2018)
48. Wu, J., Lin, Z., Zha, H.: Essential tensor learning for multi-view spectral clustering. IEEE
Trans. Image Proces. 28(12), 5910–5922 (2019)
49. Xia, R., Pan, Y., Du, L., Yin, J.: Robust multi-view spectral clustering via low-rank and
sparse decomposition. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial
Intelligence, pp. 2149–2155 (2014)
50. Xie, Y., Tao, D., Zhang, W., Liu, Y., Zhang, L., Qu, Y.: On unifying multi-view self-
representations for clustering by tensor multi-rank minimization. Int. J. Comput. Vis. 126(11),
1157–1179 (2018)
51. Xie, Y., Zhang, W., Qu, Y., Dai, L., Tao, D.: Hyper-laplacian regularized multilinear multiview
self-representations for clustering and semisupervised learning. IEEE Trans. Cybern. 50(2),
572–586 (2018)
52. Yan, J., Pollefeys, M.: A general framework for motion segmentation: independent, articulated,
rigid, non-rigid, degenerate and non-degenerate. In: European Conference on Computer Vision,
pp. 94–106. Springer, Berlin (2006)
53. Yang, B., Fu, X., Sidiropoulos, N.D.: Learning from hidden traits: joint factor analysis and
latent clustering. IEEE Trans. Signal Process. 65(1), 256–269 (2016)
54. Yu, K., He, L., Philip, S.Y., Zhang, W., Liu, Y.: Coupled tensor decomposition for user
clustering in mobile internet traffic interaction pattern. IEEE Access. 7, 18113–18124 (2019)
55. Zhang, T., Szlam, A., Lerman, G.: Median k-flats for hybrid linear modeling with many
outliers. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV
Workshops, pp. 234–241. IEEE, Piscataway (2009)
Chapter 10
Tensor Decomposition in Deep Networks
where xl−1 ∈ RNl−1 , Wl ∈ RNl ×Nl−1 , and bl ∈ RNl denote the input, weight,
and bias of l-th layer, respectively, and σ (·) is a nonlinear activation function. The
nonlinearity of the activation function brings powerful representation capability to
the DNN. Different activation functions may have different effects, and commonly
used activation functions include ReLU, sigmoid, etc. The model parameters Wl
and bl are trainable and often achieved by minimizing a specific loss function by
error backpropagation. Loss functions can be divided into two broad categories:
classification and regression. Mean absolute error (MAE, 1 loss) and mean square
error (MSE, 2 loss) are used for regression tasks, which are respectively represented
x0 y
as
NL
1
MAE = |yn − ȳn |
NL
and
NL
1
MSE = (yn − ȳn )2 ,
NL
where ȳn is the n-th elements of the actual (goal) vector. For classification tasks, the
cross-entropy loss function can not only measure the effect of the model well but
also carry out the derivation calculation easily. The size of output NL is set to the
number of categories, and the cross-entropy error (CEE) of multi-classification is
expressed as
NL
CEE = ȳn ln pn ,
n=1
where ȳ = [ȳ1 , ȳ2 , · · · , ȳNL ] is the target (one-hot coding) and pn is the probability
of the n-th category. The probability is usually obtained by softmax function, which
yn
is represented as pn = e eyn .
When dealing with multiway data, classical neural networks usually perform
vectorization operation over the input data, which may cause loss of structural
information. In order to explore the spatial structure of multiway data, CNN is
proposed by introducing convolution operators [15]. Besides, RNN is designed for
temporally related data, which can complete the tasks related to a continuous speech
or text, such as automatic speech recognition (ASR) and text generation [24]. Figure
10.2 gives the illustration of core part of CNN, and Fig. 10.4 illustrates the main idea
of RNN.
10.1 Introduction to Deep Learning 243
A CNN usually includes convolutional layers, pooling layers, and fully connected
(dense) layers, and the architecture is shown in Fig. 10.2. Convolutional layers,
which can extract local features and need fewer parameters than dense layers, are
the core of CNNs. Each convolutional layer is followed by a pooling layer which is
used to reduce the size of the output of the convolutional layer to avoid overfitting.
At the end of a CNN, there is one fully connected layer but is not limited to one,
which can aggregate global information and easily output a vector in the size of the
number of categories.
Suppose an input image X ∈ RH ×W ×I , where H is height of the image, W is
width of the image, and I is dimension of the image (normally D = 3 for image),
and design a convolutional kernel K ∈ RK×K×I ×O , where K is the size of the
window and O is the number of output channel. The convolutional operation is
mathematically expressed in element-wise as
K I
Y h" ,w" ,o = Xh,w,i Kk1 ,k2 ,i,o ,
k1 ,k2 =1 i=1
12@128x128
1x256
3@128x128
12@32x32
Fig. 10.2 A graphical illustration for CNN. 3@128 × 128 represents an image with three channels
whose size is 128 × 128
244 10 Tensor Decomposition in Deep Networks
8 4 2 3 8 6
8 0 3 2 8 7
6 8 0 7
size of input is H × W and the size of the pool is P × P , then the output size is
P × P (the stride size is usually set to P ).
H W
The input data of the neural network includes not only images but also speech
signals and video which are temporal (sequence) data. Although CNNs can handle
these data types, the performance is not ideal. Since the relational information
between the sequences will be ignored if the sequence data is put into the CNNs
one by one, it will result in information redundancy if the whole sequence is trained
at once. Thus RNNs are designed to handle temporal data, which can deal with the
relationship between adjacent sequences more effectively.
The main feature of RNN is additional hidden state in each moment. Driven by
the dynamic system, the t-th hidden state is the sum of linear transformation of the
(t − 1)-th hidden state and the linear transformation of the current input,
ht = Wht−1 + Uxt .
The t-th output is the linear transformation of the current hidden state,
ot = Vht .
Fig. 10.4 shows the architecture of a basic RNN. The long short-term memory
(LSTM) [9] is a popular RNN with enhanced capability to store longer-term
temporal information, and it can deal with the problem of the gradient disappearing
in RNN.
DNN is a powerful tool for computer vision tasks, but it may suffer from some
limitations. One well-known shortcoming is the black box property. It is hard
to understand how and why deep networks produce an output corresponding to
the input and why the better performance can be achieved by deeper networks.
Besides, DNN needs lots of neurons and hidden layers to improve performance.
It means that memory and computing resources cost can be much higher than
10.2 Network Compression by Low-Rank Tensor Approximation 245
V V V
W W W W
ht−1 ht ht+1
U U U
In either CNN or RNN, the weights of fully connected layers occupy the majority
of memory [23]. In a fully connected layer, the input is flattened to a vector
and converted to another output vector through a linear transformation based on
matrix product. In this way, not only the flattening operation causes dimensionality
information loss, but also the number of parameters required is huge. In this section,
we will introduce how to design a new network structure by using tensor products,
including t-product, mode-k product, and tensor contraction, instead of matrix
product to compress the parameters or improve the performance of neural networks.
246 10 Tensor Decomposition in Deep Networks
where Xl−1 ∈ RNl−1 K×M , Wl ∈ RNl K×Nl−1 K , Bl is the bias. In [19], keeping
the original multiway data representation and replacing the matrix product with t-
product, we can get the layers in form of
where Wl ∈ RNl ×Nl−1 ×K denotes the weight tensor, Xl−1 ∈ RNl−1 ×M×K denotes
the input tensor, fl (Xl−1 ) ∈ RNl ×M×K denotes the output tensor, and Bl denotes the
bias tensor.
In Fig. 10.5, we can see the linear and multilinear combinations by matrix and
tensor products, respectively. Notice that the matrix product in Fig. 10.5a requires
N 2 K 2 weight parameters, while t-product in Fig. 10.5b only requires N 2 K if Nl =
N for l = 1, · · · , L. Furthermore, according to the properties of t-product, the
computation of t-product can be accelerated by fast Fourier transform as
where X̂l and Ŵl are the fast Fourier transforms (FFTs) of Xl and Wl along their
(k)
third modes, respectively, X̂l is the k-th frontal slice of X̂l , and k = 1, · · · , K.
Then the inverse Fourier transform can be used to obtain Xl+1 , which greatly benefit
the efficient implementation of DNNs. But there is a strict requirement that the input
tensor Xl−1 and the output tensor Xl must have two same dimensions and t-product
is commonly used for third-order tensors.
Xl = Wl Xl−1
Xl = Wl * Xl−1
(a) (b)
Fig. 10.5 Comparison of matrix and tensor product-based linear operators. (a) Matrix weighted
connection. (b) Tensor weighted connection
10.2 Network Compression by Low-Rank Tensor Approximation 247
Here we consider an N-way input tensor X ∈ RI1 ×I2 ×...×IN . In a fully connected
layer of classical deep neural networks, the input tensor is unfolded to a one-way
high-dimensional vector x ∈ RI1 I2 ...IN . If the number of neurons in next layer is
IN +1 , then the fully connected layer is in form of
where the corresponding weight matrix W ∈ RI1 I2 ...IN ×IN+1 . Because of the high
dimensionality of the input vector, the weight matrix will be very large, occupy a lot
of memory, and slow down the computation.
In [2], based on tensor contraction, the aforementioned layer is replaced with
where W ∈ RI1 ×···×IN is the coefficient tensor which usually admits low rankness.
The tensor contraction layer can maintain the multiway structure of input and reduce
the model parameters by low-rank tensor approximation.
Taking the Tucker decomposition as an example, i.e.
where G ∈ RR1 ×R2 ×...×RN ×RN+1 , U(n) ∈ RIn ×Rn for n = 1, · · · , N + 1. For
easy understanding, we give a tensor graphical representation of the Tucker layer
in Fig. 10.6.
In fact, when the original tensor weight W is decomposed according to different
tensor decompositions, new neural networks with different structures can be
obtained, as the TT layer based on tensor train (TT) decomposition shown in
Fig. 10.7. In Fig. 10.7a, the weight tensor W ∈ RI1 ×...×IN ×IN+1 is decomposed to
where the core factors G (n) ∈ RRn ×In ×Jn ×Rn+1 , n = 1, · · · , N (R1 = RN +1 = 1),
as shown in Fig. 10.7b.
Table 10.1 gives the compression factor (CF) of these tensor contraction layers,
where CF is the ratio of number of parameters for the fully connected layer and the
tensorized fully connected layer to measure the compression efficiency of tensorized
neural networks.
248 10 Tensor Decomposition in Deep Networks
R2 RN −1
R1 RN
Weights G
RN +1
U(N +1)
IN +1
Besides the t-product and tensor contraction-based deep neural networks, some
other works use mode-n product to replace matrix product [3, 12], resulting
in deep tensorized neural networks, which have fewer parameters than classical
networks.
For input tensor X ∈ RI1 ×···×IN , one simple deep tensorized neural network
based on mode-n product [12] aims to seek a output tensor in small size by
Input X
I1 IN
I2 IN −1
Weights
(1)
G2
R2 (2)
G2 ... G2
(N −1)
RN
G2
(N )
RN +1
G2
(N +1)
IN +1
(a) TT-layer1
Input X
I1 IN
I2 IN −1
Weights
G1
(1)
R2
G1
(2) ... (N −1)
G1
RN
G1
(N )
J1 J2 JN −1 JN
(b) TT-layer2
Fig. 10.7 A graphical illustration for the linear transformation in tensor train layer. Tensor train
layer is a hidden one whose weight tensor is in tensor train format. There are two kinds of graphical
structures: (a) the output is a vector [2]. It can be applied in last hidden layer for classification task.
(b) The output is a tensor [20]. It can be applied in middle hidden layer and connected with next
tensor train or Tucker layer
Generally, more hidden layers can improve performance of deep neural networks.
In tensor factorization neural network (TFNN) [3], multiple factor layers and
activation functions are used, as graphically illustrated in Fig. 10.8. With the
sharing of weight factors along different ways, TFNN can efficiently capture
250 10 Tensor Decomposition in Deep Networks
I1 I2 IN −1 IN
J1 J2 JN −1 JN
J1 J2 JN −1 JN
K1 K2 KN −1 KN
K1 K2 KN −1 KN
s(·)
the data structure with much fewer parameters than the neural networks without
factorization. In Fig. 10.8, the input is a tensor X ∈ RI1 ×I2 ×...×IN , and the weights
are composed of U(n) ∈ RIn ×Jn , n = 1, · · · , N, V(n) ∈ RJn ×Kn , n = 1, · · · , N and
W ∈ RK1 ×...×KN ×C . Following behind each factor layer is a nonlinear activation
function σ (·). The softmax function s(·) is usually used in deep neural network for
classification task.
Besides, abovementioned neural network structures compress DNN whose input
is a tensor. Here, we additionally introduce another neural network whose input is
a vector, named tensor neural network (TNN) [32]. TNN is more efficient for large
vocabulary speech recognition tasks. Figure 10.9 gives a graphical illustration for
the network structure of TNN, composed of a double projection layer and a tensor
layer. The forward propagation in this model can be represented as follows:
I1 R1
x1 U1 σ(·) R1
J J
Input x W σ(·) z ... y Output
x2 U2 σ(·) R2
I2 R2
#G
•••
Bdepth
Fig. 10.10 A graphical illustration for the architecture of modified HG. They are three signal
pathways. Cyan region means downsampling/encoder, blue region means upsampling/decoder, and
red blocks represent skip connection. The transparent black block gives a detailed illustration for
each basic block module, which is composed of Bdept convolutional layers followed by a pooling
layer. The kernel of each convolutional layer is sized of H × W × Fin × Fout
where x ∈ RI is the input vector, U1 ∈ RR1 ×I and U2 ∈ RR2 ×I are two affine
matrices constituting the double projection layer, and W ∈ RR1 ×R2 ×J is the bilinear
transformation tensor constituting the tensor layer, and σ (·) is nonlinear activation
function. Note that x1 = x2 = x in Fig. 10.9.
I4
I5 I3
U(4)
I4 = Fin
U(5) U(3)
I5 = Fout I3 = Bdepth
R 4 R3
R5
R2
I6 = H W I2 = Gsubnet = I6 U (6) G U(2) I2
R6
R1
I7 = W I1 = Gdepth R 7 R0
I0 = #G U (7)
U(1)
I7 U (0)
I1
I0
Fig. 10.11 A graphical illustration for Tucker decomposition of the whole parameters of modified
HG networks
Most of deep neural networks are end-to-end models, and it is difficult to understand
the network characteristics, just like “black boxes.” One well-publicized proposition
is that deep networks are more efficient compared with shallow ones. A series
of literatures have demonstrated this phenomenon using various network architec-
tures [4, 18]. However, they fail to explain why the performance of a deep network
is better than that of a shallow network.
In [4], authors try to establish the ternary correspondence between functions,
tensor decomposition, and neural networks and provide some explanations of depth
efficiency theoretically. As we know, neural networks can express any functions,
10.3 Understanding Deep Learning with Tensor Decomposition 253
D
y
N
hy (x1 , . . . , xN ) = Ad1 ,...,dN fθdn (xn ), (10.10)
d1 ,...,dN =1 n=1
R
y
N J
hy (x1 , . . . , xN ) = ar unr (jn ) fθjn (xn ), (10.11)
r=1 n=1 jn =1
y y
where ay ∈ RR = [a1 ; · · · , aR ]. It can be achieved by a single hidden layer
convolutional arithmetic circuit as shown in Fig. 10.13.
Based on hierarchical Tucker (HT) decomposition, we can get the expression of
Ay as follows:
R0
(1,j ) 1,j,r1 (0,2j −1) (0,2j )
φr1 = ar0 ur0 ◦ ur0 (10.12)
r0 =1
...
Rl−1
(l,j ) l,j,r (l−1,2j −1) (l−1,2j )
φrl = arl−1 l φrl−1 ◦ φrl−1 (10.13)
rl−1 =1
...
RL−2
(L−1,j ) L−1,j,rL−1 (L−2,2j −1) (L−2,2j )
φrL−1 = arL−2 φrL−2 ◦ φrL−2 (10.14)
rL−2 =1
RL−1
L,y
Ay = arL−1 φr(L−1,1)
L−1
◦ φr(L−1,2)
L−1
, (10.15)
rL−1 =1
hidden layer
rep(n, d) = f θd (xn) (n)
conv(n, r) = ur , rep(n, :)
R Y
R
D out(y) = ay , pool(:)
N
pool(r) = n=1 conv(n, r)
Fig. 10.13 A graphical illustration for single hidden layer convolutional arithmetic circuit archi-
tecture. d is the index for output channels in the first convolution layer, r is the index for output
channels in hidden layer, and y is the index for output channels in last dense layer (output layer),
d = 1, · · · , D, r = 1, · · · , R, y = 1, · · · , Y
10.3 Understanding Deep Learning with Tensor Decomposition 255
a2,y
R1 R1
A1,1 A1,2
R0 R0 R0 R0
I1 I2 I3 I4
•••
R0 R0 RL−1 RL−1 Y
D
out(y) = aL,y , poolL−1(:)
pool0(j, r0) =
j ∈{2j−1,2j} conv0(j , r0)
poolL−1(rL−1) =
j ∈{1,2} convL−1(j , rL−1)
Fig. 10.15 A graphical illustration for convolutional arithmetic circuit architecture with L hidden
layers. d is the index for output channels in the first convolution layer, rl is the index for output
channels in l-th hidden layer, and y is the index for output channels in the last dense layer (output
layer)
256 10 Tensor Decomposition in Deep Networks
to the HT decomposition. And it is well known that the generated tensor from HT
decomposition can catch more information than that from CP decomposition since
the number of HT ranks is more than 1. Therefore, to a certain extent, this provides
an explanation for the depth efficiency of neural networks.
10.4 Applications
For MNIST dataset, 55,000 images in the training set are used to train, and the other
5000 images in the training set are used to validate. At last we get the result by
applying the best model for testing. In addition, for all the neural network models
compared here, the optimizer used is the popular stochastic gradient descent (SGD)
algorithm with learning rate = 0.01, momentum = 0.95, and weight decay=0.0005,
and the batch size is set to 500 and the epoch is 100. It should be noted that we pad
original 28 × 28 images to 32 × 32 in this group of experiments for easy implemen-
tation of different TT networks to get more choices for compression factors.
The neural network models compared in this part include a single hidden layer
dense neural network and its extensions in tensor field, including TT networks and
Tucker networks. In fact, with different network design parameters, the resulting
networks will be in different structures. We demonstrate the construction of these
networks in detail as follows.
10.4 Applications 257
4 8 8 4
Input layer ...
∈ R1024 8 8 8
G (1) G (2) G (3) G (4)
4 8 8 4
Hidden layer ...
Hidden layer ...
∈ R1024
∈ R1024
∈ R10 ∈ R10
(a) (b)
X X
4 4 4 4 4 2 4 4 4 4 2
8 8 8 8 4 4 4 4 4
G (1) G (2) G (3 G (4) G (5) G (1) G (2) G (3 G (4) G (5) G (6)
4 4 4 4 4 2 4 4 4 4 2
Hidden layer ... Hidden layer ...
∈ R1024 ∈ R1024
∈ R10 ∈ R10
(c) (d)
Fig. 10.16 A graphical illustration of the baseline network and its extension to different TT
network structures. (a) Baseline. (b) TT network1. (c) TT network2. (d) TT network3
• Baseline The construction of the baseline network is shown in Fig. 10.16a, where
the input data is a vector x ∈ R1024 , the weight matrix of the hidden layer is in
size of 1024 × 1024, the activation function used after the hidden layer is ReLU,
and the weight matrix of the output layer is in size of 1024 × 10, respectively.
• TT network Driven by the TT layer we introduced in Fig. 10.7, we can replace
the hidden layer in Fig. 10.16a by a TT layer as shown in Fig. 10.16. Regrading
the input data as multiway tensors X ∈ R4×8×8×4 , X ∈ R4×4×4×4×4 , X ∈
R2×4×4×4×4×2 , different TT layers can be obtained. The weights of the TT layer
include G (n) , n = 1, · · · , N; N is set to be 4, 5, and 6 in Fig. 10.16b, c, and d,
respectively.
• Tucker network Similarly, based on the Fig. 10.6, we can substitute the hidden
layer in Fig. 10.17a with a Tucker layer as shown in Fig. 10.17b. The input data
is a matrix X ∈ R32×32 ; the weights of the Tucker-layer include G ∈ RR1 ×R2 ×R3
and three matrices U(1) , U(2) , and U(3) .
Because MNIST dataset is relatively easy, we also used fashion MNIST which
is similar to MNIST. The classification results of the MNIST and fashion MNIST
datasets by the baseline network, TT networks, and Tucker networks with different
258 10 Tensor Decomposition in Deep Networks
X
32 32
U (1)
U(2)
R1 R2
G
Input layer ...
R3
∈ R1024
U(3)
...
1024
Hidden layer
Hidden layer ...
∈ R1024
∈ R1024
∈ R10 ∈ R10
(a) (b)
Fig. 10.17 A graphical comparison of the baseline network and its extension to Tucker network,
where R1 , R2 , andR3 are Tucker ranks. (a) Baseline. (b) Tucker network
design parameters are shown in Table 10.2. CF is the compression factor which is
defined as the ratio of the number of parameters in the baseline network and that of
compressed one. Ranks mean the corresponding rank of the TT layer or the Tucker
layer. The CF and classification accuracy are the two main performance metrics of
interest. For each dataset, the best classification accuracy and its corresponding CF
are marked in bold.
10.4 Applications 259
From Table 10.2, we could observe TT network can get higher compression
factor, while Tucker network can improve the accuracy slightly while compressing
the neural network model. This is because TT network resizes the input data into
a higher-order tensor which will sacrifice a small amount of precision. However
Tucker network maintains the original structure and uses low-rank constraint for
weight parameters to reduce the redundancy.
CNN is effective for many image classification tasks. In CNN, a few convolutional
layers (including pooling and Relu) are usually followed by a fully connected layer
which can produce an output category vector, i.e., ten categories corresponding to a
ten-dimensional vector. However, the output of the last convolutional layer can be
multiway tensors. In order to map the multiway tensor into vectors, two strategies
are commonly adopted, flattening or Global Average Pooling (GAP). The flattening
operation is also known as vectorization, such as a tensor sized of 8 × 8 × 32 can
be converted to a vector sized of 2048 × 1. The GAP is an averaging operation over
all feature maps along the output channel axis. Specifically, for a third-order feature
tensor H ×W ×No , the output of a GAP layer is a single vector sized of No by simply
taking the mean value of each H × W feature map along specific output channel.
By using tensor factorization, we can map the multiway tensors into output vector
without flattening operation, which maintains the original feature structures and
reduces the parameters required. Here, we set a simple CNN which includes two
convolutional layers (a convolutional layer with a 5 × 5 × 1 × 16 kernel, Relu, 2 × 2
max pooling, and a convolutional layer with 5 × 5 × 16 × 32 kernel, Relu, 2 × 2 max
pooling) and a fully connected layer. It can be extended into GAP-CNN, TT-CNN,
and Tucker-CNN by replacing its final fully connected layer by GAP layer, TT layer
in Fig. 10.7, and Tucker layer in Fig. 10.6, respectively. Figure 10.18 provides a
comparison of the structure of the last layer of CNN, GAP-CNN, TT-CNN, and
Tucker-CNN.
The developed networks are used to tackle the classification tasks over the
CIFAR-10 dataset, which is more complex than MNIST dataset. The corresponding
results are shown in Table 10.3. As shown in Table 10.3, Tucker-CNN can reduce
model parameters appropriately, and the corresponding precision loss can be
ignored. With the same precision loss, GAP can achieve a higher compression
factor. The reason may be that after convolutional operation, the redundancy has
been reduced to a certain degree, and Tucker layer and TT layer cannot reduce lots
of parameters anymore without the loss of precision.
260 10 Tensor Decomposition in Deep Networks
32 32
GAP 32
8 X 8 X
8 8
flattening
R2048 ...
R32 ...
∈ R10 ∈ R10
(a) CNN (b) GAP-CNN
X
X 8 8 32
Fig. 10.18 A graphical illustration for different operations in the last layer of the CNN. (a)
Flattening operation, (b) GAP operation, (c) TT layer where [R2 ; R3 ; R4 ] is the rank , (d) Tucker
layer where [R1 ; R2 ; R3 ; R4 ] is the rank
10.5 Summary
References
1. Calvi, G.G., Moniri, A., Mahfouz, M., Zhao, Q., Mandic, D.P.: Compression and interpretabil-
ity of deep neural networks via Tucker tensor layer: from first principles to tensor valued
back-propagation (2019, eprints). arXiv–1903
2. Cao, X., Rabusseau, G.: Tensor regression networks with various low-rank tensor approxima-
tions (2017, preprint). arXiv:1712.09520
262 10 Tensor Decomposition in Deep Networks
3. Chien, J.T., Bao, Y.T.: Tensor-factorized neural networks. IEEE Trans. Neural Netw. Learn.
Syst. 29, 1998–2011 (2017). https://doi.org/10.1109/TNNLS.2017.2690379
4. Cohen, N., Sharir, O., Levine, Y., Tamari, R., Yakira, D., Shashua, A.: Analysis and design of
convolutional networks via hierarchical tensor decompositions (2017, eprints). arXiv–1705
5. Garipov, T., Podoprikhin, D., Novikov, A., Vetrov, D.: Ultimate tensorization: compressing
convolutional and FC layers alike (2016, eprints). arXiv–1611
6. Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif.
Intell. Res. 57, 345–420 (2016)
7. Guo, J., Li, Y., Lin, W., Chen, Y., Li, J.: Network decoupling: from regular to depthwise
separable convolutions (2018, eprints). arXiv–1808
8. Hawkins, C., Zhang, Z.: End-to-end variational bayesian training of tensorized neural networks
with automatic rank determination (2020, eprints). arXiv–2010
9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
10. Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with
low rank expansions. In: Proceedings of the British Machine Vision Conference. BMVA Press,
Saint-Ouen-l’Aumône (2014)
11. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751.
Association for Computational Linguistics, Doha, Qatar (2014). https://doi.org/10.3115/v1/
D14-1181. https://www.aclweb.org/anthology/D14-1181
12. Kossaifi, J., Khanna, A., Lipton, Z., Furlanello, T., Anandkumar, A.: Tensor contraction layers
for parsimonious deep nets. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, pp. 26–32 (2017)
13. Kossaifi, J., Bulat, A., Tzimiropoulos, G., Pantic, M.: T-net: Parametrizing fully convolutional
nets with a single high-order tensor. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 7822–7831 (2019)
14. Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I.V., Lempitsky, V.S.: Speeding-up convolu-
tional neural networks using fine-tuned CP-Decomposition. In: International Conference on
Learning Representations ICLR (Poster) (2015)
15. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86, 2278–2324 (1998). https://doi.org/10.1109/5.726791
16. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
3431–3440 (2015)
17. Makantasis, K., Voulodimos, A., Doulamis, A., Bakalos, N., Doulamis, N.: Space-time domain
tensor neural networks: an application on human pose recognition (2020, e-prints). arXiv–2004
18. Mhaskar, H., Liao, Q., Poggio, T.: Learning functions: when is deep better than shallow.
preprint arXiv:1603.00988
19. Newman, E., Horesh, L., Avron, H., Kilmer, M.: Stable tensor neural networks for rapid deep
learning (2018, e-prints). arXiv–1811
20. Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural networks. In:
Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural
Information Processing Systems, vol. 28. Curran Associates, Red Hook (2015). https://
proceedings.neurips.cc/paper/2015/file/6855456e2fe46a9d49d3d3af4f57443d-Paper.pdf
21. Qi, J., Hu, H., Wang, Y., Yang, C.H.H., Siniscalchi, S.M., Lee, C.H.: Tensor-to-vector
regression for multi-channel speech enhancement based on tensor-train network. In: IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7504–
7508. IEEE, Piscataway (2020)
22. Selvan, R., Dam, E.B.: Tensor networks for medical image classification. In: Medical Imaging
with Deep Learning, pp. 721–732. PMLR, Westminster (2020)
23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. In: International Conference on Learning Representations (2015)
References 263
24. Tachioka, Y., Ishii, J.: Long short-term memory recurrent-neural-network-based bandwidth
extension for automatic speech recognition. Acoust. Sci. Technol. 37(6), 319–321 (2016)
25. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured
long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), pp. 1556–1566 (2015)
26. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features
with 3d convolutional networks. In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 4489–4497 (2015)
27. Wang, W., Sun, Y., Eriksson, B., Wang, W., Aggarwal, V.: Wide compression: tensor ring nets.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
9329–9338 (2018)
28. Xu, Y., Li, Y., Zhang, S., Wen, W., Wang, B., Dai, W., Qi, Y., Chen, Y., Lin, W., Xiong, H.:
Trained rank pruning for efficient deep neural networks (2019, e-prints). arXiv–1910
29. Yang, Y., Krompass, D., Tresp, V.: Tensor-train recurrent neural networks for video classifica-
tion. In: International Conference on Machine Learning, pp. 3891–3900. PMLR, Westminster
(2017)
30. Ye, J., Li, G., Chen, D., Yang, H., Zhe, S., Xu, Z.: Block-term tensor neural networks. Neural
Netw. 130, 11–21 (2020)
31. Yin, M., Liao, S., Liu, X.Y., Wang, X., Yuan, B.: Compressing recurrent neural networks using
hierarchical tucker tensor decomposition (2020, e-prints). arXiv–2005
32. Yu, D., Deng, L., Seide, F.: The deep tensor neural network with applications to large
vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 21(2), 388–396
(2012)
33. Zhang, X., Zou, J., He, K., Sun, J.: Accelerating ery deep convolutional networks for
classification and detection. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 1943–1955 (2015)
Chapter 11
Deep Networks for Tensor
Approximation
11.1 Introduction
Deep learning is a sub-field of machine learning, which can learn the inherent
useful features for specific tasks based on artificial neural networks. Due to the
breakthrough of resources on computing and data, deep learning makes great
success in many fields, such as computer version (CV) [17] and natural language
processing (NLP) [38].
With the emergence of a large amount of multiway data, tensor approximation
has been applied in many fields, such as compressive sensing (CS) [22], tensor
completion [25], tensor principal component analysis [27], tensor regression and
classification[33, 46], and tensor clustering [37]. Besides the machine learning
techniques in the other chapters, there exist some studies which are dedicated to
solve tensor approximation problems through deep networks [3, 9, 23, 45, 47]. The
main idea of these approaches is to employ the powerful deep networks as a tool
for tackling some difficult problems in tensor approximation, such as tensor rank
determination [47] and simulation of specific nonlinear proximal operators [8].
Classical deep neural networks directly map the tensor as input to the
approximation result as output. These networks usually consist of stacked nonlinear
operational layers, such as autoencoders [31], convolutional neural networks
(CNNs) [17], and generative adversarial networks (GANs) [36]. These models are
usually trained end-to-end using the well-known backpropagation algorithm. They
enjoy powerful fitting and learning capability but suffer from bad interpretation due
to their black box characteristic.
Model-based methods usually apply interpretable iterative algorithms to approx-
imate the multidimensional data. In order to combine the advantages of both
deep networks and model-based methods, two strategies of deep learning methods
have been employed, namely, deep unrolling and deep plug-and-play (PnP), which
usually have theoretical guarantees. The former deep unrolling maps the popular
iterative algorithms for model-based methods onto deep networks [8, 9, 28, 32, 41,
44]. It can be regarded as a neural network realization of the iterative algorithms with
fixed iteration numbers, such as ISTA-Net [41] and ADMM-Net [32]. With trainable
parameters, these deep unrolling methods commonly give a better performance
than the original algorithms with fewer iteration times. Deep unrolling models are
trained end-to-end and usually enjoy the speed of classical deep networks and the
interpretation of model-based methods.
Deep PnP is usually applied in data reconstruction-related tasks like tensor com-
pletion [45] and tensor compressed sensing [40]. It regards a specific subproblem
that usually appears in the model-based methods as the denoising problem and
solves it using pre-trained deep networks. Unlike deep unrolling, the number of
iterations of deep PnP model is not fixed. Deep PnP can use deep neural networks
to capture semantic information that expresses tensor details while having the
interpretability of model-based methods.
In the following content of this section, we will introduce these three different
frameworks for tensor approximation in detail, namely, classical deep neural
networks, deep unrolling, and deep PnP. And some applications of these frameworks
are given, including tensor rank approximation [47], video snapshot compressive
imaging (SCI) [9, 44], and low-rank tensor completion [45].
where Ncla (·, ) is the generalized classical deep learning model, contains
trainable parameters of the model, X 0 is the tensor for approximation, and X̂ is
the approximated data.
Input Approximated
tensor Classical deep learning model tensor
11.3 Deep Unrolling 267
In Fig. 11.1, the green boxes are layers of the deep neural networks which
can be linear layers or linear layers combined with nonlinear activation operators.
The linear layer can be fully connected layer, convolutional layer, and recurrent
layer. And the activation operator can be the sigmoid function, rectified linear unit
(ReLU). Furthermore, some strategies can be combined with the model in Fig. 11.1
to improve the model performance, like the residual connection in ResNet [10].
There are some classical deep learning models which are developed for tensor
approximation problems, such as rank approximation [47] and video compressed
sensing [13]. However, due to the black box characteristic of these classical
networks, there is no good interpretation and theoretical guarantee. And these
completely data-driven end-to-end manners may have risks for some undesired
effects [12]. We suggest interested readers can refer to [7] to get more details about
the classical deep neural networks.
In order to describe deep unrolling more clearly, we briefly describe the generalized
framework of model-based methods first. The model-based methods try to solve an
optimization problem as follows:
where R(·) is the regularization term and X̂ denotes the approximation results.
Model-based methods usually obtain X̂ using some nonlinear iteration algorithms,
like iterative shrinkage thresholding algorithm (ISTA) [8, 9, 41], approximate
message passing (AMP) [1, 30, 44], and alternating direction method of multipliers
(ADMM) [28, 32]. If the initialized data is X 0 , for easy understanding, we formulate
the k-th iteration of the generalized approximation algorithm as follows:
where Nmod (·) contains all calculations of one iteration of the algorithms. For
example, in ADMM, Nmod (·) may contain more than two steps to obtain X k . Model-
based methods repeat (11.3) until convergence. They have interpretation guarantees
but usually cost a lot of time.
Inspired by the iterative algorithm which processes data step by step, deep
unrolling maps iterative approximation algorithms onto step-fixed deep neural
networks. By expressing the unrolled model as Nunr (·, ), the approximation
process of deep unrolling models can be expressed as follows:
X̂ = Nunr (X 0 , ) = NK
unr (X
K−1
, K ), (11.4)
268 11 Deep Networks for Tensor Approximation
where X k is the updated tensor after the first k modules, Nkunr (·, k ) is the k-th
module/phase, k is the trainable parameters of the k-th module, and is the
collection of all the k with k = 1, · · · , K.
Figure 11.2 illustrates the difference between generalized model-based methods
and generalized deep unrolling models. As shown in Fig. 11.2, the unrolled model
usually consists of several phases with the same structure, where each phase
corresponds to one iteration in the original iterative restoration algorithm.
For many inverse problems, such as tensor completion [14, 21, 42] and sparse
coding [48], model-based methods usually introduce some handcrafted regulariza-
tion terms, such as ·1 , ·∗ and ·0 . To solve these problems, many nonlinear
operators are developed. For example, the soft thresholding function is designed for
·1 and ·∗ , and the hard thresholding function is designed for ·0 . Some studies
build the deep unrolling model by training the related parameters of these nonlinear
operators. Taking the soft threshold function as an example, it is designed for sparse
data without value limitations [6] and can be expressed as
where θ is the soft thresholding value. The generalized deep unrolling model with
trainable soft thresholding functions can be illustrated in Fig. 11.3, where
Fig. 11.3 Generalized framework of deep unrolling models with handcrafted thresholding func-
tions for data priors
Fig. 11.4 Generalized framework of deep unrolling models with learnable operators for data
priors
Obviously, Nkunr1 (·, k1 ) is a part of Nkunr (·, k ), and it contains desired operators
with respect to the related iterative algorithm for specific tasks, and we show some
examples in Sect. 11.5.
In order to obtain better performance with a smaller phase number, the parameter
θ k is usually trained with other parameters in deep unrolling models. For example, in
[8], θ k is trained with weights related to sparse dictionaries to obtain more efficient
sparse coding with only 10 to 20 phases. And in [9], θ k is jointly trained with
CNNs, which are designed for sparse transformation to achieve video snapshot
compressive imaging. Besides the soft thresholding function, there are also deep
unrolling models considering other thresholding functions. For example, bounded
linear unit [35] is designed for ∞ norm and hard thresholding linear unit [34] is
designed for 0 norm.
However, handcrafted regularizers may ignore some semantic information for
data representation. To this end, some studies [4, 5, 44] try to employ deep
neural networks to learn data priors directly. Figure 11.4 illustrates the generalized
framework of deep unrolling model with trainable data priors. In this framework,
each phase of the model can be expressed as
where k = {k1 , k2 } and Nkunr2 (·, k2 ) is a deep neural network which is used to
represent detailed data priors and k2 contains its parameters.
This unrolling strategy can be used for a lot of model-based algorithms. For
example, in [4], Nkunr2 (·, k2 ) is used to fit the gradient for unrolled gradient descent
algorithm. In [44], Nkunr2 (·, k2 ) is developed to estimate the difference between
original data and the corrupted data in each iteration. Technically, framework in
Fig. 11.4 is more flexible than the framework in Fig. 11.3. Nkunr2 (·, k2 ) can be
270 11 Deep Networks for Tensor Approximation
where R(X ) is the regularizer for a specific task. It is usually known as the proximal
operator, such as the soft thresholding operator for 1 norm.
It is obvious that problem (11.9) can be regarded as a regularized image denoising
problem, where σ determines the noise level [45]. Driven by this equivalence, some
studies try to use plug-and-play (PnP) which is a strategy that employs advanced
denoisers as a surrogate of the proximal operator in the ADMM or other proximal
algorithms. It is a highly flexible framework that employs the power of advanced
image priors [2, 40, 45].
In PnP framework, at each iteration, problem (11.9) can be solved by a denoiser
Nden (·) expressed as
The denoisers could be different for different regularizers, and some state-of-the-art
denoisers can be used, such as BM3D[11] and NLM[11]. Moreover, to explore the
deep information of data, deep neural networks are employed to be the denoiser [29],
known as deep PnP. Figure 11.5 shows the basic framework of deep PnP models, and
the k-th iteration can be expressed as
X k = NPnP X k−1 = Nkden N"mod X k−1 , k , (11.11)
where N"mod (·) is a part of Nmod (·) expected for the subproblem (11.9), and
Nkden (·, k ) is the denoiser in the k-th iteration and it is determined by σ . The
difference between Nmod1 (·) and Nkunr1 (·, k1 ) is that Nkunr1 (·, k1 ) usually contains
11.5 Applications 271
trainable parameters which are trained with other parameters. But there is usually
no trainable parameters in N"mod (·).
Unlike deep unrolling models, deep PnP models do not need to be trained end-
to-end. The employed deep network-based denoisers are usually pre-trained. And
similar to the model-based methods, deep plug-and-play methods have no stable
iteration number like deep unrolling models.
11.5 Applications
Given noisy measurement Y of tensor X , the tensor rank network (TRN) aims to
estimate the rank of tensor X using CNNs by minimizing the loss function as
1
Nt 2
L= Rn − R̂n , (11.12)
Nt
n=1
where Nt is the number of samples for training, Rn is the optimal rank value of n-th
sample Xn , and R̂n represents the estimated rank.
The mathematical expression of the TRN is as follows:
R̄
min Y − r ◦ ār ◦ ār F .
w̄r ā(1) (2) (3) 2
(11.15)
(1) (2) (3)
w̄,{ār },{ār },{ār }, r=1
This optimization problem can be solved by alternative least squares (ALS) [16].
The obtained matrices Ā(1) , Ā(2) , Ā(3) can be stacked into a tensor Ȳ along the
third mode.
As Fig. 11.7 illustrates, we can input Ȳ to the network (11.13) to get the predicted
rank R̂. In fact, the network (11.13) is used to solve the following optimization
model:
min R̂ = ŵ0 ,
(n)
R̂,ŵ,ar
ŵ0 R̄ (11.16)
r ◦ ar ◦ ar =
ŵr a(1) r ◦ ār ◦ ār .
(2) (3)
s. t. w̄r ā(1) (2) (3)
r=1 r=1
This CP-rank estimation method is called tensor rank network with pre-
decomposition (TRN-PD), which is summarized in Algorithm 49.
274 11 Deep Networks for Tensor Approximation
Fig. 11.8 The performance comparison between TRN and TRN-PD on the 50 × 50 × 50 synthetic
dataset at different noise levels
In this experiment, all the methods are applied on synthetic datasets. We generate
tensors with different ranks. The rank R of the synthetic 50 × 50 × 50 tensor ranges
from 50 to 100, and the rank R of synthetic 20 × 20 × 20 tensor ranges from 20
to 70. Each tensor is generated with added random Gaussian noise. In this way, we
get four groups of datasets at different noise levels. More specifically, the values
of signal-to-noise ratio (SNR) are set to 10, 15, 20, and 25 dB, respectively. The
numbers of training sets are 2000 and 3000 on the 50 × 50 × 50 synthetic dataset
and 20 × 20 × 20 synthetic dataset, respectively. We use mean square error (MSE) to
1 M
evaluate the estimation accuracy, which is defined as MSE = M m=1 (Rm − R̂m ) ,
2
where Rm is the real rank of testing data and R̂m is the predicted rank.
For the dataset with the size 50 × 50 × 50, we set the predefined rank bound of
pre-decomposition R from 100 to 140. Figure 11.8 is the illustration of learning
process on 50 × 50 × 50 tensor in the first 50 iterations. TRN-PD(R) denotes
the TRN-PD method with the predefined rank bound of pre-decomposition R. In
training process for the network, we perform training and validation alternatively.
The figures of training loss and validation loss present the accuracy performance
on training dataset and validation dataset, respectively. From the results, TRN-PD
works well on this dataset, but TRN can hardly converge.
11.5 Applications 275
Fig. 11.9 Illustration of MSE performance comparison between TRN and TRN-PD on the 20 ×
20 × 20 synthetic dataset at different noise levels
Sampling
Reconstruction
S Reconstruction
Reconstru
r ction
Model
Y
Figure 11.10 shows the sampling and reconstruction of video SCI. The recon-
struction model in Fig. 11.10 can be either model-based method or deep unrolling
method.
In the following content of this subsection, we introduce a deep unrolling model
named Tensor FISTA-Net [9] which applies an existing sparse regularizer.
Tensor FISTA-Net applies the sparse regularizer as D(X )1 to be R(X ), where
D(X ) is an unknown transformation function which transforms X into a sparse
domain. Tensor FISTA-Net is built by unrolling the classical fast iterative shrinkage
thresholding algorithm (FISTA) to K phases, and each phase consists of three steps.
The k-th phase can be expressed as
where X k is the output of the k-th phase, t k and ρ k are trainable parameters, and Ak
is formulated as
B
Ak (:, :, i) = C(:, :, b) Z k (:, :, b) − Y for i = 1, · · · , B. (11.24)
b=1
11.5 Applications 277
where Lfid is the fidelity of reconstructed frames, Linv is the accuracy of inverse-
transformation functions, and Lspa denotes the sparsity in transformed domains.
They can be computed by
Nt 2
1 K
Lfid = Xn − Xn , (11.28)
Nt F
n=1
Nt K 2
1
Linv = Ñtra (Ntra (Xnk )) − Xnk , (11.29)
Nt K F
n=1 k=1
Nt K
1
Lspa = Ntra (Xnk ) , (11.30)
Nt K 1
n=1 k=1
Table 11.1 Performance comparison of GAP-TV, DeSCI, and tensor FISTA-net on video SCI
Kobe Aerial Vehicle
Method PSNR (dB)/SSIM
GAP-TV[39] 27.67/0.8843 25.75/0.8582 24.92/0.8185
DeSCI[24] 35.32/0.9674 24.80/0.8432 28.47/0.9247
Tensor FISTA-Net [9] 29.46/0.8983 28.61/0.9031 27.77/0.9211
• DeSCI [24]: reconstructing video based on the idea of weighted nuclear norm.
In this subsection, we set M = N = 256 and B = 8. And we use three training
video datasets, namely, NBA, Central Park Aerial, and Vehicle Crashing Tests in
[9]. All video frames are resized into 256 × 256 and their luminance components
are extracted. Each training set contains 8000 frames, i.e., 1000 measurements.
Tensor FISTA-Net is trained for 500 epochs with batch size 2. The Adam
algorithm is adopted for updating parameters and the learning rate is 0.0001 [15].
Three datasets, namely, Kobe dataset, Aerial dataset, and Vehicle Crash dataset,
are used for testing. Each test set contains four video groups with the size of
256 × 256 × 8.
Table 11.1 shows the average PSNR and SSIM of GAP-TV, DeSCI, and Tensor
FISTA-Net on three different test sets where the best are marked in bold. And it
can be noticed that Tensor FISTA-Net has better reconstruction results on test set
Aerial. Table 11.2 shows the reconstruction time of three methods on three test sets
tested on an Intel Core i5-8300H CPU. We conclude that although DeSCI has better
performance on two test sets, namely, Kobe and Vehicle Crashing, it costs a lot
of time. And Tensor FISTA-Net can have comparable performance on these two
datasets while reconstructing videos fast. Furthermore, we emphasize that Tensor
FISTA-Net can be accelerated using GPU devices.
where R(·) is the regularizer for completion, X is the underlying tensor, and T is
the observed incomplete tensor, and PO (X ) means projecting X onto the indices
set O of the observations. Low-rank regularizers [25, 26] are usually applied for
tensor completion. However, such regularizers always capture the global structures
of tensors but ignore the details [45]. In this subsection, we introduce a deep PnP
model dubbed deep PnP prior for low-rank tensor completion (DP3LRTC) [45]
which can explore the local features using deep networks.
where ·TNN is the tensor tubal nuclear norm and Runk (·) denotes an unknown
regularizer. DP3LRTC uses deep networks to represent the information of Runk (·)
and explores the local features of tensors.
The optimization problem (11.32) can be solved using ADMM. Introducing two
auxiliary variables Y and Z, the k-th iteration of the ADMM algorithm can be
expressed as follows:
β k−1
6 2
Y k = argmin YTNN + X − Y + T1k−1 β , (11.33)
Y 2 F
β
6 2
Z k = argmin λR(Z) + X k−1 − Z + T2k−1 β , (11.34)
Z 2 F
β
6 2 β
6 2
X k = argmin 1S (X ) + X − Y k + T1k−1 β + X − Z k + T2k−1 β
X 2 F 2 F
(11.35)
where λ and β are hyperparameters and T1k and T2k are intermediate variables. X k
is the output of the k-th iteration. 1S (X ) is the indicator function which is defined
as follows:
#
0, if X ∈ S,
1S (X ) = (11.38)
∞ otherwise,
DP3LRTC regards (11.34) as a denoising problem like (11.9) and solves it using
a classical deep network-based denoiser dubbed FFDNet[43] as follows:
6
Z k = NFFD (X k−1 + T2k−1 β, σ ), (11.39)
7
where NFFD (·) denotes FFDNet and σ = λ β is related to the error level
between the estimation and ground truth. We emphasize that NFFD (·) is pre-
trained. Algorithm 50 concludes the process to solve (11.32). And in this
case, (11.33), (11.35), (11.36), and (11.36) constitute Nmod1 (·) in Fig. 11.5.
Fig. 11.11 PSNR of completed Lena image by DP3LRTC, SNN, TNN, and TNN-3DTV at
different sampling ratios of 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5
Fig. 11.12 Recovered Lena image by DP3LRTC, SNN, TNN, and TNN-3DTV at the sampling
ratio of 0.1
global information but also capture the details of images, which brings it a better
completion performance. Furthermore, DP3LRTC does not consume a lot of time
to complete a tensor.
282 11 Deep Networks for Tensor Approximation
11.6 Summary
References
1. Borgerding, M., Schniter, P., Rangan, S.: AMP-Inspired deep networks for sparse linear inverse
problems. IEEE Trans. Signal Process. 65(16), 4293–4308 (2017)
2. Chan, S.H., Wang, X., Elgendy, O.A.: Plug-and-play ADMM for image restoration: fixed-point
convergence and applications. IEEE Trans. Comput. Imag. 3(1), 84–98 (2016)
3. Che, M., Cichocki, A., Wei, Y.: Neural networks for computing best rank-one approximations
of tensors and its applications. Neurocomputing 267, 114–133 (2017)
4. Diamond, S., Sitzmann, V., Heide, F., Wetzstein, G.: Unrolled optimization with deep priors
(2017, e-prints). arXiv–1705
5. Dong, W., Wang, P., Yin, W., Shi, G., Wu, F., Lu, X.: Denoising prior driven deep neural
network for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 41(10), 2305–2318
(2018)
6. Donoho, D.L., Maleki, A., Montanari, A.: Message-passing algorithms for compressed
sensing. Proc. Natl. Acad. Sci. 106(45), 18914–18919 (2009)
7. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press,
Cambridge (2016)
8. Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: Proceedings of the
27th International Conference on International Conference on Machine Learning, pp. 399–406
(2010)
9. Han, X., Wu, B., Shou, Z., Liu, X.Y., Zhang, Y., Kong, L.: Tensor FISTA-Net for real-
time snapshot compressive imaging. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 34, pp. 10933–10940 (2020)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–
778 (2016)
11. Heide, F., Steinberger, M., Tsai, Y.T., Rouf, M., Pajak, D., Reddy, D., Gallo, O., Liu, J.,
Heidrich, W., Egiazarian, K., et al.: Flexisp: a flexible camera image processing framework.
ACM Trans. Graph. 33(6), 1–13 (2014)
References 283
12. Huang, Y., Würfl, T., Breininger, K., Liu, L., Lauritsch, G., Maier, A.: Some investigations
on robustness of deep learning in limited angle tomography. In: International Conference on
Medical Image Computing and Computer-Assisted Intervention, pp. 145–153. Springer, Berlin
(2018)
13. Iliadis, M., Spinoulas, L., Katsaggelos, A.K.: Deep fully-connected networks for video
compressive sensing. Digital Signal Process. 72, 9–18 (2018)
14. Jiang, F., Liu, X.Y., Lu, H., Shen, R.: Anisotropic total variation regularized low-rank
tensor completion based on tensor nuclear norm for color image inpainting. In: 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1363–
1367. IEEE, Piscataway (2018)
15. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International
Conference on Learning Representations, ICLR 2015, San Diego, May 7–9, 2015, Conference
Track Proceedings (2015)
16. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500
(2009)
17. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
(2012)
18. Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: a convolutional neural-
network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997)
19. LeCun, Y.: Convolutional networks for images, speech, and time series. In: The Handbook of
Brain Theory and Neural Networks, pp. 255–258 (1995)
20. LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel,
L.D.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural
Information Processing Systems, pp. 396–404 (1990)
21. Liu, J., Musialski, P., Wonka, P., Ye, J.: Tensor completion for estimating missing values in
visual data. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 208–220 (2012)
22. Liu, Y., De Vos, M., Van Huffel, S.: Compressed sensing of multichannel EEG signals: the
simultaneous cosparsity and low-rank optimization. IEEE Trans. Biomed. Eng. 62(8), 2055–
2061 (2015)
23. Liu, B., Xu, Z., Li, Y.: Tensor decomposition via variational auto-encoder (2016, e-prints).
arXiv–1611
24. Liu, Y., Yuan, X., Suo, J., Brady, D.J., Dai, Q.: Rank minimization for snapshot compressive
imaging. IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2990–3006 (2018)
25. Liu, Y., Long, Z., Huang, H., Zhu, C.: Low CP rank and tucker rank tensor completion for
estimating missing components in image data. IEEE Trans. Circuits Syst. Video Technol. 30(4),
944–954 (2019)
26. Long, Z., Liu, Y., Chen, L., Zhu, C.: Low rank tensor completion for multiway visual data.
Signal Process. 155, 301–316 (2019)
27. Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., Yan, S.: Tensor robust principal component analysis:
exact recovery of corrupted low-rank tensors via convex optimization. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 5249–5257 (2016)
28. Ma, J., Liu, X.Y., Shou, Z., Yuan, X.: Deep tensor admm-net for snapshot compressive imaging.
In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10223–10232
(2019)
29. Meinhardt, T., Moller, M., Hazirbas, C., Cremers, D.: Learning proximal operators: using
denoising networks for regularizing inverse imaging problems. In: Proceedings of the IEEE
International Conference on Computer Vision, pp. 1781–1790 (2017)
30. Metzler, C., Mousavi, A., Baraniuk, R.: Learned D-AMP: principled neural network based
compressive image recovery. In: Advances in Neural Information Processing Systems, pp.
1772–1783 (2017)
31. Mousavi, A., Patel, A.B., Baraniuk, R.G.: A deep learning approach to structured signal recov-
ery. In: The 53rd Annual Allerton Conference on Communication, Control, and Computing,
pp. 1336–1343. IEEE, Piscataway (2015)
284 11 Deep Networks for Tensor Approximation
32. Sun, J., Li, H., Xu, Z., et al.: Deep ADMM-Net for compressive sensing MRI. In: Advances in
Neural Information Processing Systems, pp. 10–18 (2016)
33. Tan, X., Zhang, Y., Tang, S., Shao, J., Wu, F., Zhuang, Y.: Logistic tensor regression
for classification. In: International Conference on Intelligent Science and Intelligent Data
Engineering, pp. 573–581. Springer, Berlin (2012)
34. Wang, Z., Ling, Q., Huang, T.: Learning deep l0 encoders. In: AAAI Conference on Artificial
Intelligence, pp. 2194–2200 (2016)
35. Wang, Z., Yang, Y., Chang, S., Ling, Q., Huang, T.S.: Learning a deep ∞ encoder for hashing.
In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence,
pp. 2174–2180. AAAI Press, Palo Alto (2016)
36. Yang, G., Yu, S., Dong, H., Slabaugh, G., Dragotti, P.L., Ye, X., Liu, F., Arridge, S., Keegan, J.,
Guo, Y., et al.: DAGAN: deep de-aliasing generative adversarial networks for fast compressed
sensing MRI reconstruction. IEEE Trans. Med. Imag. 37(6), 1310–1321 (2017)
37. Yin, M., Gao, J., Xie, S., Guo, Y.: Multiview subspace clustering via tensorial t-product
representation. IEEE Trans. Neural Netw. Learn. Syst. 30(3), 851–864 (2018)
38. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural
language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018)
39. Yuan, X.: Generalized alternating projection based total variation minimization for com-
pressive sensing. In: 2016 IEEE International Conference on Image Processing (ICIP), pp.
2539–2543. IEEE, Piscataway (2016)
40. Yuan, X., Liu, Y., Suo, J., Dai, Q.: Plug-and-play algorithms for large-scale snapshot
compressive imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 1447–1457 (2020)
41. Zhang, J., Ghanem, B.: ISTA-Net: interpretable optimization-inspired deep network for image
compressive sensing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1828–1837 (2018)
42. Zhang, Z., Ely, G., Aeron, S., Hao, N., Kilmer, M.: Novel methods for multilinear data
completion and de-noising based on tensor-SVD. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3842–3849 (2014)
43. Zhang, K., Zuo, W., Zhang, L.: FFDNet: toward a fast and flexible solution for CNN-based
image denoising. IEEE Trans. Image Process. 27(9), 4608–4622 (2018)
44. Zhang, Z., Liu, Y., Liu, J., Wen, F., Zhu, C.: AMP-Net: denoising based deep unfolding for
compressive image sensing. IEEE Trans. Image Process. 30, 1487–1500 (2020)
45. Zhao, X.L., Xu, W.H., Jiang, T.X., Wang, Y., Ng, M.K.: Deep plug-and-play prior for low-rank
tensor completion. Neurocomputing 400, 137–149 (2020)
46. Zhou, H., Li, L., Zhu, H.: Tensor regression with applications in neuroimaging data analysis.
J. Am. Stat. Assoc. 108(502), 540–552 (2013)
47. Zhou, M., Liu, Y., Long, Z., Chen, L., Zhu, C.: Tensor rank learning in CP decomposition via
convolutional neural network. Signal Process. Image Commun. 73, 12–21 (2019)
48. Zubair, S., Wang, W.: Tensor dictionary learning with sparse tucker decomposition. In: 2013
18th International Conference on Digital Signal Processing (DSP), pp. 1–6. IEEE, Piscataway
(2013)
Chapter 12
Tensor-Based Gaussian Graphical Model
12.1 Background
1 2 3 4 5 6
1
2
2
1 3
4
6 4
5
5
(a) (b)
Fig. 12.1 One-to-one correspondence between the precision matrix and graph: the left picture
is a precision matrix in which black block (off-diagonal) indicates nonzero entries and each
nonzero entry corresponds to an edge in graph; the right figure is the graph drawn according to
the relationship shown on the left. (a) Precision matrix. (b) Graph
where μ is the mean vector, is the covariance matrix, and det is the determinant
of . The vector-based GGM exploits the relationships between variables by
estimating the precision matrix = −1 using the maximum likelihood estimation
and characterizing conditional independence in the underlying variables. By taking
the logarithm of probability density function (12.1) and maximizing it, we can
formulate the corresponding optimization model as
where S = N1 N n=1 (xn − μ)(xn − μ) denotes the sample covariance matrix.
T
where λ is the tradeoff parameter between the objective function and sparsity
constraint.
In order to solve the optimization problem (12.3), different methods are
employed, such as the interior point method [19], block coordinate descent
(BCD) algorithm [1], Glasso method [6], neighborhood selection [11], and linear
programming procedure [18]. Among all these methods, Glasso is the most popular
one. We will concentrate on it and give its detailed solution in what follows.
Using the fact that the derivative of log det equals −1 , the gradient of
Lagrangian function of (12.3) with respect to vanishes when the following
equation holds:
The solution about and its inverse = −1 can be acquired by sequentially
solving several regression problems. To be specific, we divide ∈ RP ×P into two
parts: the first part is composed of its first P − 1 rows and columns, denoted as 11 ,
and the second part is composed of the rest elements. The same procedure can be
done to ∈ RP ×P . By denoting U as the estimation of , we can have
U11 u12 11 ω12 I0
T T = .
u12 u22 ω12 ω22 01
This implies
To solve this problem, we can equivalently estimate a lasso regression in the form
of
1
min y − Zβ2F + λβ1 , (12.8)
β 2
where y is the output variable and Z is the predictor matrix. Its gradient can be
derived as (12.9)
T T
Z Zβ − Z y + λ Sign(β) = 0. (12.9)
T T
Notice that when Z Z = U11 and Z y = s12 , the solution of (12.7) is equal to that
of (12.8). We summarize the above steps in Algorithm 51.
Besides the large number of variables collected, structural information encoded
in them plays an important role in some research areas. For example, in neuro-
science, electroencephalography (EEG) is often used to analyze the relationships
between brain signal and the trigger of certain events, e.g., alcohol consumption.
In real experiments, we can obtain the EEG recordings of each subject from
multichannel electrodes during a period of time. Therefore, the observation of each
subject is a channel by time matrix. If we directly use vector-variate-based GGM to
12.3 Matrix-Variate-Based Gaussian Graphical Model 289
analyze these EEG data, it would suffer from performance degradation due to the
neglect of structural information.
−1 (X −1
exp {− 12 tr( 2 n − M)T 1 (Xn − M))}
Prob(Xn |M, 1, 2) = PQ P Q
. (12.10)
(2π ) 2 det 22 det 12
The precision matrices of the row and column variables in the sample matrix are
defined as 1 = −1 −1
1 and 2 = 2 , respectively. In matrix-variate-based GGM,
maximum likelihood estimation is also involved to find the optimal precision matrix
of N samples Xn , and the objection function can be formulated as
N
NQ NP 1
max log det 1 + log det 2 − tr(Xn 2 XTn 1 ).
1 ,2 2 2 2
n=1
290 12 Tensor-Based Gaussian Graphical Model
NQ NP
min − log det 1 − log det 2
1 ,2 2 2
1
N
+ tr Xn 2 XTn 1 + λ1 R1 (1 ) + λ2 R2 (2 ) (12.11)
2
n=1
For the penalties R1 and R2 , there are multiple choices, such as the most commonly
used 1 norm penalty [14], nonconvex SCAD (smoothly clipped absolute deviation)
penalty [5], adaptive 1 penalty [22], MCP (minimax concave penalty) [20], and
truncated 1 penalty [13].
Here we follow the idea in [8], which considers the estimation of precision matrix
under 1 penalized likelihood framework as follows:
N
NQ NP 1
min − log det 1 − log det 2 + tr(Xn 2 XTn 1 )
1 ,2 2 2 2
n=1
+λ1 1 1,off + λ2 2 1,off , (12.12)
is the off-diagonal 1 norm, i.e., 1 1,off =
where · 1,off p=i |1 (p, i)|,
2 1,off = q=j |2 (q, j )|.
By this optimization model (12.12), we can acquire precision matrix satisfying
sparsity condition. However, it is obvious that the optimization model (12.12)
is nonconvex. Because of the biconvex property, we can minimize 1 and 2
separately in iterations of the solution [8, 17]. In practice, we fix one precision
matrix and solve the other. For example, if we fix 1 , it will be a convex function
with respect to 2 and vice versa. Therefore, the optimization model (12.12) can be
transformed into two subproblems as follows:
N
1 1 1
min tr(2 · XTn 1 Xn ) − log det 2 + λ2 |2 (q, j )|, (12.13)
2 Q Q Q
n=1 q=j
N
1 1 1
min tr(1 · XTn 2 Xn ) − log det 1 + λ1 |(1 (p, i)|. (12.14)
1 P P P
n=1 p=i
Note that each of those two optimization models can be treated as a vector-variate
GGM with 1 penalty imposed on the precision matrix [1, 6, 11]. According to [8],
we can iteratively solve the above minimization problems using Glasso [6].
12.4 Tensor-Based Graphical Model 291
In more generalized cases, the data to be processed can be in tensor form. Simply
rearranging the tensor into a matrix or a vector will lose vital structural information,
rendering unsatisfactory results. Besides, if we directly reshape the tensor data to fit
in the vector- or matrix-variate-based GGM, it will require a huge precision matrix,
which is computationally expensive.
To address the above issue, similar to the generalization from vector-based
processing to matrix-based processing, tensor-variate-based GGM, a natural gen-
eralization of matrix-variate-based GGM, has been proposed [7, 9, 12, 16]. In
particular, given N i.i.d. K-th order observations T1 , · · · , TN ∈ RI1 ×...×IK from
tensor normal distribution TN(0; 1 , · · · , K ), its PDF is
8
K
− 2II 2
− I2 − 12
Prob(Tn | 1, · · · , K) = (2π ) | k| k exp(−Tn ×1:K F /2),
k=1
(12.15)
K − 12 − 12 − 12
where I = k=1 Ik . Let := { 1 , · · · , K }, where 1 , . . . , K are
the covariance matrices corresponding to all modes, respectively. ×1:K denotes
the tensor mode product from mode-1 to mode-K, i.e., T ×1:K {A1 , · · · , AK } =
T ×1 A1 ×2 · · · ×K AK . In this case, the covariance matrix of the tensor normal
distribution is separable in the sense that it is the Kronecker product of small
covariance matrices, each of which corresponds to one way of the tensor. It means
that the PDF for Tn satisfies Prob(Tn | 1 , · · · , K ), if and only if the PDF for
vec(Tn ) satisfies Prob(vec(Tn )| K ⊗ · · · ⊗ 1 ), where vec(·) is the vectorization
operator and ⊗ is the matrix Kronecker product.
The negative log-likelihood function for N observations T1 , · · · , TN is
K
I 1
H(1 , · · · , K ) = − log det k + tr(S(K ⊗ . . . ⊗ 1 )) (12.16)
2Ik 2
k=1
1
where k = −1 k , k = 1, · · · , K, S = N vec(Tn )vec(Tn )T .
To encourage sparsity on each precision matrix, a penalized log-likelihood
estimator is considered, forming the following sparse tensor graphical optimization
model [9]:
K K
1 1
min log det k + tr(S(K ⊗. . .⊗1 ))+ λk Rk (k ) (12.17)
k ,k=1,··· ,K Ik I
k=1 k=1
K = 2, it degenerates into the sparse Gaussian graphical model [6] and sparse
matrix graphical model [8], respectively.
If we employ Lasso penalty on precision matrices, the problem of interest will
be transformed into
K K
1 1
min − log det k + tr(S(K ⊗ . . . ⊗ 1 )) + λk k 1,off
k ,k=1,··· ,K Ik I
k=1 k=1
(12.18)
1 1
min − log det k + tr(Sk k ) + λk k 1,off , (12.19)
k Ik Ik
Ik N k T
where Sk = NI n=1 V(n) V(n) , V(n) is the mode-n unfolding of V , V =
k k k k
1/2 1/2 1/2 1/2
Tn ×1:K {1 , · · · , k−1 , II
k , k+1 , · · · , K }, ×1:K represents the tensor product
operation, and k 1,off = p=q |(p, q)| is the off-diagonal 1 norm.
By applying Glasso for the optimization model in (12.19), we can obtain
the solution of precision matrices k , k = 1, · · · , K. This procedure is called
Tlasso [9], and the details are illustrated in Algorithm 52.
Algorithm 52: Sparse tensor graphical model via tensor lasso (Tlasso)
Input: Tensor samples T1 , · · · , TN , tuning parameters λ1 , · · · , λK , max number of
iterations T . (0) (0)
1 , · · · , K are randomly set as symmetric and positive difinite
matrices. t = 0.
While t = T do:
t =t +1
For k = 1, · · · , K:
(t) (t) (t) (t−1) (t−1)
Update k in (12.19) via Glasso, given 1 , · · · , k−1 , k+1 , K .
(t)
Set k F = 1.
End for
End while
Output: ˆ k = (T ) , k = 1, · · · , K.
k
min Hk (1 , · · · , K )
k #0,k=1,··· ,K
s. t. k 0 ≤ Sk , k = 1, · · · , K, (12.20)
where Hk (1 , · · · , K ) = − K k=1 Ik log det k + I tr(S(K ⊗. . .⊗1 )), k # 0
1 1
means k is positive definite and Sk is the parameter which controls the sparsity of
k .
Since the sparsity term in (12.20) is not convex with respect to k , k =
1, · · · , K, we resort to alternating minimization to update each variable with the
others fixed. In this case, we can obtain K subproblems in the following format:
min Hk (k )
k #0
s. t. k 0 ≤ Sk (12.21)
The details can be found in Algorithm 53. Let =k = K ⊗· · ·⊗k+1 ⊗k−1 ⊗
· · · ⊗ 1 and trunc(, S) operation in Algorithm 53 is defined as
(i, j ) if (i, j ) ∈ S
trunc(, S)i,j =
0 if (i, j ) ∈
/S
where ∈ RI ×J , S ⊆ {(i, j ) : i = 1, · · · , I, j = 1, · · · , J }.
12.5 Applications
The aim of GGM is to study the correlations among variables, which is important
for revealing the hidden structure within data. In this section, we mainly discuss two
applications employing tensor-variate-based GGM, i.e., environmental prediction
and mice aging study.
1 ftp://ftp2.psl.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/.
12.5 Applications 295
Fig. 12.2 Region of interest in our experiment within (90 ◦ W, 137.5 ◦ W) and (17.5 ◦ N, 40◦ N),
which corresponds to parts of Pacific Ocean, the USA, and Mexico. The red line denotes 23.5 ◦ N,
which is the dividing line between tropical and north temperate zones. The black grid represents
the 20 × 10 partition of this area, and the Arabic numbers on the left side are the index of different
latitude regions
18 10 3
15
19
9 8
12 13 16 20
8 2
3 6 5
17 1
14 2
2
1 4 7
11 4 1
9 3 4
8
5
7 10
6 7 6
10 5 9
Fig. 12.3 The graphical structure of precision matrices predicted by Tlasso based on the ten
samples. Graphs (a), (b), and (c) illustrate the velocity relationships in longitude, latitude, and
time, respectively. Red dot denotes the corresponding variable in each mode
The age-related diseases such as Alzheimer’s disease and Parkinson’s disease have
always been an agonizing problem in medical research for humans. The study of
aging process may shed light on the trigger factors of those diseases. To this end, a
series of pioneering experiments on mice have been carried out due to the genetic
and physiological resemblance to humans.
296 12 Tensor-Based Gaussian Graphical Model
5
11
6 2
3 6
1
9
1
4 4 2 3
7
8
2 10
3 12 8
7
5 1
Fig. 12.4 The graphical structure of precision matrices in three modes. The numbers in three
figures denote different genes, tissues, and mice, respectively. (a) Genes. (b) Tissues. (c) Mice
In this part, we mainly focus on gene expression database2 for mice aging study.
The intention of analysis of these data is to investigate the relationships between
genes and aging process. In this dataset, gene expression level is collected from
mice across ten age groups, such as 1 month, 6 months, 16 months, and 24 months.
Ten mice are considered in each group. The expression levels of 8932 genes in 16
tissues for every mouse are recorded.
For ease of illustration, we only select 8 gene expression levels in 12 tissues from
3 1-month mice to form a sample of the size 8 × 12 × 3. To be specific, the 8 chosen
genes are “Igh-6,” “Slc16a1,” “Icsbp1,” “Cxcr4,” “Prdx6,” “Rpl37,” “Rpl10,” and
“Fcgr1,” and the 12 selected tissues are “adrenal glands,” “cerebellum,” “cerebrum,”
“eye,” “hippocampus,” “kidney,” “lung,” “muscle,” “spinal cord,” “spleen,” “stria-
tum,” and “thymus.” Under the same conditions as the work in [9], the analytical
results are given in Fig. 12.4.
Figure 12.4a shows the relationship between the selected eight genes. We can
notice that “Slc16a1” (node 2) and “Rpl37” (node 6) are relatively independent
compared with other genes. The reason may be the biological functions of those
two genes are not closely related compared with others. For example, the biological
functions of “Igh-6” (node 1), “Icsbp1” (node 3), “Cxcr4” (node 4), and “Fcgr1”
(node 8) are all related to the immune system of the mice and thus closely
connected, while the “Slc16a1” (node 2) is a gene that involved with the regulation
of central metabolic pathways and insulin secretion. Figure 12.4b demonstrates
the relationship among 12 different tissues. It indicates the potential internal
connections between “lung” (node 7), “spleen” (node 10), and “thymus” (node 12),
which can provide some clues for the studies in related fields.
2 http://cmgm.stanford.edu/~kimlab/aging_mouse.
References 297
12.6 Summary
References
1. Banerjee, O., El Ghaoui, L., d’Aspremont, A.: Model selection through sparse maximum
likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9, 485–
516 (2008)
2. Beal, M.J., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans.
Pattern Anal. Mach. Intell. 25(7), 828–836 (2003)
3. Brouard, C., de Givry, S., Schiex, T.: Pushing data into CP models using graphical model
learning and solving. In: International Conference on Principles and Practice of Constraint
Programming, pp. 811–827. Springer, Berlin (2020)
4. d’Aspremont, A., Banerjee, O., El Ghaoui, L.: First-order methods for sparse covariance
selection. SIAM J. Matrix Anal. Appl. 30(1), 56–66 (2008)
5. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties.
J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
6. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical
lasso. Biostatistics 9(3), 432–441 (2008)
7. He, S., Yin, J., Li, H., Wang, X.: Graphical model selection and estimation for high dimensional
tensor data. J. Multivar. Anal. 128, 165–185 (2014)
8. Leng, C., Tang, C.Y.: Sparse matrix graphical models. J. Am. Stat. Assoc. 107(499), 1187–
1200 (2012)
9. Lyu, X., Sun, W.W., Wang, Z., Liu, H., Yang, J., Cheng, G.: Tensor graphical model: non-
convex optimization and statistical inference. IEEE Trans. Pattern Anal. Mach. Intell. 42(8),
2024–2037 (2019)
10. Ma, C., Lu, J., Liu, H.: Inter-subject analysis: a partial Gaussian graphical model approach. J.
Am. Stat. Assoc. 116(534), 746–755 (2021)
11. Meinshausen, N., Bühlmann, P., et al.: High-dimensional graphs and variable selection with
the lasso. Ann. Stat. 34(3), 1436–1462 (2006)
12. Shahid, N., Grassi, F., Vandergheynst, P.: Multilinear low-rank tensors on graphs & applica-
tions (2016, preprint). arXiv:1611.04835
13. Shen, X., Pan, W., Zhu, Y.: Likelihood-based selection and sharp parameter estimation. J. Am.
Stat. Assoc. 107(497), 223–232 (2012)
14. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58(1), 267–
288 (1996)
15. Wu, W.B., Pourahmadi, M.: Nonparametric estimation of large covariance matrices of
longitudinal data. Biometrika 90(4), 831–844 (2003)
298 12 Tensor-Based Gaussian Graphical Model
16. Xu, P., Zhang, T., Gu, Q.: Efficient algorithm for sparse tensor-variate gaussian graphical
models via gradient descent. In: Artificial Intelligence and Statistics, pp. 923–932. PMLR,
Westminster (2017)
17. Yin, J., Li, H.: Model selection and estimation in the matrix normal graphical model. J.
Multivariate Anal. 107, 119–140 (2012)
18. Yuan, M.: High dimensional inverse covariance matrix estimation via linear programming. J.
Mach. Learn. Res. 11, 2261–2286 (2010)
19. Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika
94(1), 19–35 (2007)
20. Zhang, C.H., et al.: Nearly unbiased variable selection under minimax concave penalty. Ann.
Stat. 38(2), 894–942 (2010)
21. Zhou, S., et al.: Gemini: graph estimation with matrix variate normal instances. Ann. Stat.
42(2), 532–562 (2014)
22. Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429
(2006)
Chapter 13
Tensor Sketch
13.1 Introduction
x∗ = argminAx − yF ,
x∈RI
x̃ = argminS(Ax − y)F ,
x∈RI
which satisfies
with high probability for a small > 0, where S ∈ RL×J is the sketching matrix.
We can see that sketching tries to model the relationship between x and y with
less samples based on the sketching matrix S. It sacrifices estimation accuracy to a
certain extent for achieving computational and storage complexity reduction.
An intuitive way to choose the sketching matrix S is to use i. i. d. Gaussian
random matrix. However, in this case, S is dense, and it is inefficient for matrix
multiplication. As a result, the sketching matrices are commonly drawn from certain
families of random maps (e.g., oblivious sparse norm approximating projections
transform [28], fast Johnson–Lindenstrauss transform [2]). In this way, it yields
input-sparsity computation [11, 32], improved approximation guarantees [33], and
much less storage requirement. Another distinguishing feature of sketching is that
unlike compressed sensing which relies on strict assumption such as sparsity, the
quality of sketching is almost independent with the input data structure [9, 40],
although certain structure can scale up the computation speed.
One simple and effective sketching method is count sketch (CS), which is first
designed to estimate the most frequent items in a data stream [10]. Its key idea is to
build sketching matrices from random hash functions and project the input vector
into lower-dimensional sketched space. In [29], Pagh introduces tensor sketch (TS)
for compressing matrix multiplication. TS can be regarded as a specialized case for
CS where the hash function used by TS is built by cleverly integrating multiple
hash functions from CS. In this manner, TS can be represented by the polynomial
multiplication and hence can be fast computed by the fast Fourier transform (FFT).
Compared with CS, TS achieves great acceleration and storage reduction for rank-1
matrices/tensors. TS is successfully used in many statistical learning tasks such as
kernel approximation [1, 5, 30], tensor decomposition [26, 36, 38, 44], Kronecker
product regression [13, 14], and network compression [21, 31, 34].
Despite the effectiveness of CS and TS, they have drawbacks under certain cases.
For example, when the input data are higher-order tensors, CS and TS both map the
multidimensional data into the vector space in preprocessing, which partly ignores
the multidimensional structure information. Besides, CS is memory-consuming
since the vectorization of the multidimensional data would lead to a long vector
and the hash function is required to have the same size of the input vector. To
solve these problems, higher-order count sketch (HCS) is proposed [34] to directly
sketch the input tensor into a lower-dimensional one with the same order; hence
the multidimensional structure information is preserved. Compared with CS, HCS
achieves better approximation accuracy and provides efficient computation directly
in the sketched space for some tensor contractions and tensor products.
In the following, we will provide a detailed illustration of CS, TS, and HCS. The
remaining part is organized as follows. Section 13.2, 13.3, and 13.4 introduce basic
conceptions of CS, TS, and HCS, respectively. In Sect. 13.5, we present some appli-
cations of tensor sketches in tensor decomposition, Kronecker product regression,
and network approximation. Finally, the chapter is concluded in Sect. 13.6.
13.2 Count Sketch 301
y = Sx, (13.2)
To make the estimation more robust, one can take D independent sketches and
compute the median of them [10]. The following theorem gives the error analysis of
the CS recovery:
Theorem 13.1 ([10]) For any x ∈ RI , given two 2-wise independent hash functions
h : [I ] %→ [J ] and s : [I ] %→ {±1}, x̂(i) provides an unbiased estimator of x(i) with
Var(x̂(i)) ≤ x22 /J .
Below we present a toy example showing how count sketch works.
where mod denotes the modular operation. The count sketch of x1 and x2 is obtained
as follows:
Using the fact that inner product is preserved under the same sketches [30], we can
approximate the inner product x1 , x2 by
x1 , x2 ≈ median(CS1 (x1 ), CS1 (x2 ), · · · ,CS3 (x1 ), CS3 (x2 ))
= 5,
the number of modes increases, it becomes infeasible for practical use. Therefore,
considering efficient sketching methods for matrix or tensor operations is necessary.
In [30], Pham et al. propose TS by generalizing CS into higher-order data. First,
we consider the case that X = x ◦ x ∈ RI ×I is a rank-1 matrix. If we choose H and
S by
where J and denote the mode-J circular convolution and Hadamard product,
respectively. Operator F and F−1 denote FFT and its inverse operation. The
calculation of TS(X) requires O(nnz(x) + J logJ ) time and O(I ) storage, which
is a great improvement compared with CS(X).
The extension to higher-order case is straightforward. Given X = 9x ◦ ·:;
· · ◦ x< ∈
N times
RI ×···×I , N pairs of 2-wise independent Hash functions h1 , · · · , hN : [I ] %→ [J ],
s1 , · · · , sN : [I ] %→ {±1}, then the tensor sketch TS(X ) ∈ RJ can be calculated in
O(N (nnz(x) + J logJ )) time by
It is noted that the above discussion is based on the assumption of rank-1 matrix
or tensor. If it is not the case, the computation of TS takes O(nnz(X)) time by (13.6)
for a matrix. Although the computational complexity is the same as the CS, the
storage requirement is reduced from O(I 2 ) to O(I ). Therefore, TS is more efficient
when dealing with high-order data.
It is shown that tensor sketching provides an oblivious subspace embedding [6]
and can be applied in the context of low-rank approximation, principal component
regression, and canonical correlation analysis of the polynomial kernels [5]. Its other
variants include recursive TS [1], polynomial TS [18], sparse TS [42], etc.
304 13 Tensor Sketch
Y = X ×1 H1 ×2 H2 · · · ×N HN
(13.10)
= [[X ; H1 , H2 , · · · , HN ]],
HCS
13.5 Applications 305
Table 13.1 Computational and storage complexity of CS, TS, and HCS for a third-order tensor
X ∈ RI ×I ×I . J is the Hash length
Sketching methods CS TS HCS
X =x◦x◦x O(nnz(X )) O(nnz(x) + J log J ) O(nnz(x) + J 3 )
General tensor X O(nnz(X )) O(nnz(X )) O(nnz(X ))
Storage for Hash functions O(I 3 ) O(I ) O(I )
13.5 Applications
It should be noted that the discussions here cannot cover everything. Readers are
encouraged to refer to related papers (most with available codes) for more details.
For example, a tensor sketch-based robust tensor power method (RTPM) is proposed
for fast CP decomposition [39]. RTPM is an iterative method for CP decomposition
proposed in [3]. It obtains one rank-1 component sequentially until the CP rank
reaches the upper bound or the approximation error decreases below a threshold.
The update of RTPM requires two tensor contractions T (u, u, u) and T (I, u, u),
defined as
T (u, u, u) = T , u ◦ u ◦ u
(13.14)
T (I, u, u) = [T , e1 ◦ u ◦ u, · · · , T , eN ◦ u ◦ u]T ,
where (13.17) holds due to the unitary property of FFT. Both TS(T ) and s are
irrelevant to i and can be calculated beforehand. Both (13.16) and (13.17) can be
computed in O(I + J logJ ) time. FRTPM requires O(N LT (I + DJ logJ )) time and
O(DJ ) storage, where D denotes the number of independent sketches and J denotes
the sketching dimension. In this way, the computation and space complexity can be
greatly reduced in comparison with the naive RTPM.
To validate the effectiveness of FRTPM, we conduct experiments on a synthetic
symmetric tensor T ∈ R100×100×100 and a real-world asymmetric dataset. For the
synthetic data-based experiment, the approximation performance is evaluated by the
residual norm, which is defined as
RN(T , T̂ ) = T − T̂ F ,
3 https://github.com/andrewssobral/lrslibrary.
308
Table 13.2 Residual norm and running time on rank-10 approximation of a synthetic symmetric tensor T ∈ R100×100×100 by RTPM and FRTPM. σ denotes
the level of Gaussian noise
Residual norm Running time (s)
Sketch dim 1500 2000 2500 3000 1500 2000 2500 3000
σ = 0.01 FRTPM D=2 1.0070 0.9991 0.9683 0.7765 1.1099 1.2477 1.8701 2.2574
D=4 0.9265 0.6641 0.5702 0.5171 2.2053 2.6157 3.8553 4.1976
D=6 0.6114 0.5590 0.4783 0.4294 3.4393 3.6519 5.2013 5.5934
RTPM / 0.0998 49.4443
σ = 0.1 FRTPM D=2 1.0503 1.0506 1.0388 0.8760 1.1551 1.2745 1.9331 2.1459
D=4 1.0372 0.7568 0.6922 0.5990 2.1082 2.4586 3.5557 4.0272
D=6 0.7579 0.6371 0.5860 0.5382 3.1546 3.7927 5.3187 5.4595
RTPM / 0.3152 50.5022
13 Tensor Sketch
13.5 Applications
Table 13.3 PSNR and running time on rank-10 approximation of dataset Shop by RTPM and FRTPM
PSNR Running time (s)
Sketch dim 5000 5500 6000 6500 7000 5000 5500 6000 6500 7000
FRTPM D=5 15.5824 20.2086 20.4758 19.0921 21.2039 4.9433 5.1242 5.8073 5.7232 6.0740
D=10 23.7788 24.2276 24.5138 24.9739 24.8809 9.1855 9.9487 10.7859 11.4765 14.4035
D=15 24.8358 25.1802 25.0410 25.5410 25.7796 16.2121 18.1471 20.1453 21.3052 22.7188
RTPM / 28.9158 268.9146
309
310 13 Tensor Sketch
Fig. 13.2 Rank-10 approximation on dataset Shop. The first column: RTPM (ground truth). The
second to eighth columns: sketching dimension of 5000, 5500, · · · , 7000, respectively. (a) D = 5.
(b) D = 10. (c) D = 15
where G ∈ RR1 ×···×RN is the core tensor and U(n) ∈ RIn ×Rn (n = 1, · · · , N )
are factor matrices. Higher-order orthogonal iteration (HOOI) is a commonly used
algorithm for Tucker decomposition (see Algorithm 4 in Chap. 2). The standard
HOOI involves solving two alternating least squares (ALS) problems iteratively:
can be approximated by
with Ax̃ − yF ≤ (1 + )Ax∗ − yF [14]. Therefore, (13.19) and (13.20) can be
approximated by
U(n) = arg min TS(n) (⊗1i=N U(i) )GT(n) UT − TS(n) (XT(n) )2F
U i=n
(n) (n)
where 1n=N U(n) is short for U(N ) · · · U(1) , CS1 : RIn → RJ1 and CS2 :
RIn → RJ2 (J1 < J2 , n = 1, · · · , N) are two different count sketch operators.
Ii I n
TS(n) : R i=n → RJ1 and TS(N +1) : R n → RJ2 are two different tensor sketch
(n) (n)
operators built on CS1 and CS2 , respectively.
Substituting (13.23) and (13.24) into Algorithm 4, we obtain the first algorithm
named Tucker-TS, summarized in Algorithm 54.
The second algorithm Tucker-TTMTS applies tensor sketch to a given tensor
which can be represented in chains of n-mode tensor-times-matrix form (TTM).
Recall that the standard HOOI requires computing
Similarly
vec(G) = (TS(N +1) (⊗1i=N U(i) ))T TS(N +1) (vec(X )). (13.28)
4 https://github.com/andrewssobral/lrslibrary.
13.5 Applications 313
0.6 1.8
Tucker-TS
Tucker-TTMTS 1.6
0.5 Tucker-ALS
1.4
0.4
1.2
0.3
1
0.2
0.8
Tucker-TS
0.1
0.6 Tucker-TTMTS
Tucker-ALS
0 0.4
4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20
K K
(a) (b)
Fig. 13.3 Comparison of Tucker-TS, Tucker-TTMTS, and Tucker-ALS on a rank (10, 10, 10)
synthetic sparse tensor T ∈ R500×500×500 . K is a parameter with regard to the sketching dimension.
(a) Relative norm against different values of K. (b) Running time against different values of K
Fig. 13.4 Rank-10 Tucker-TS approximation on dataset escalator. The first column: Tucker-ALS
(ground truth). The second to ninth columns: Tucker-TS with K = 4, 6, · · · , 20
In Fig. 13.3, we compare the relative error and running time of Tucker-TS,
Tucker-TTMTS, and Tucker-ALS on the sparse synthetic tensor T ∈ R500×500×500
with Tucker rank (10, 10, 10). It shows that Tucker-TS achieves similar relative
norm compared to Tucker-ALS. On the other hand, Tucker-TTMTS is the fastest
at the cost of higher relative error.
Figure 13.4 displays the Rank-10 approximation of Tucker-ALS and Tucker-TS
on dataset Escalator. It shows that the approximation is acceptable even when the
sketching parameter K is small and grows better as K increases.
Recall the over-determined least squares problem minx∈RI Ax − yF at the begin-
ning of this chapter. In [14], Diao et al. consider the context when A is a Kronecker
product of some matrices, e.g., A = B ⊗ C. By applying tensor sketch, one obtains
x̃ = arg minx∈RI TS((B ⊗ C)x − y)F with Ax̃ − yF ≤ (1 + )Ax∗ − yF in
O(nnz(B) + nnz(C) + nnz(y) + poly(N1 N2 / )) time, where N1 , N2 are column size
of B, C, respectively, poly(·) denotes some low-order polynomial.
Considering more general cases, given A(n) ∈ RPn ×Q for n = 1, · ·
n · , N, we
denote A = A(1) ⊗ · · · ⊗ A(N ) ∈ RP ×Q with P = N P
n=1 n , Q = N
n=1 Qn .
314 13 Tensor Sketch
Based on the fact that applying Tensor sketch to the Kronecker product of some
matrices equals convolving the count sketches of them [5], we can accelerate the
computation using FFT. Therefore, for qn ∈ [Qn ], n = 1, · · · , N , the tensor sketch
of each column of A is
To test the algorithm performance, we use the codes in [27] on MNIST dataset
[23]. The experiment aims to see how much the distance between handwritten digits
can be preserved by TS. After preprocessing (see [27] for details), we obtain a rank-
10 CPD-based approximation of a tensor A = [[A(1) , A(2) , A(3) ]] ∈ R32×32×100 con-
taining handwritten “4”s and another tensor B = [[B(1) , B(2) , B(3) ]] ∈ R32×32×100
for handwritten “9”s. The reason for choosing this particular pair of digits is that
handwritten “4”s and “9”s can be hard to distinguish, especially after the low-rank
approximation.
Denote A := A(1) A(2) A(3) ∈ R102400×10 and B := B(1) B(2) B(3) ∈
R 102400×10 . We take
TS(A − B)F
− 1 (13.31)
A − B
F
as the performance metric. The result is shown in Fig. 13.5. Clearly we can see along
with the increasing sketching dimension, the estimation becomes more accurate.
In [14], similar extensive work on regularized spline regression is presented.
However, this is out of the scope for our discussion.
13.5 Applications 315
Fig. 13.5 The performance in terms of the metric of (13.31) against the increasing sketching
dimension characterized by the (a) mean value, (b) standard deviation and (c) maximum value
over 1000 trials
Deep neural network receives a lot of attentions in many fields due to its powerful
representation capability (see Chaps. 10 and 11). However, with the networks
growing deeper, it brings a large number of network parameters, which would
lead to a huge computational and storage burden. Therefore, it is helpful if we
can approximate the original networks with smaller ones while preserving the
intrinsic structures. In this subsection, we introduce two works using tensor sketch
for network approximations.
The traditional convolutional (CONV) and fully connected (FC) layers of convolu-
tional neural network (CNN) can be represented by
Iout = Conv(Iin , K)
(13.32)
hout = Whin + b,
where Iin ∈ RH1 ×W1 ×I1 and Iout ∈ RH2 ×W2 ×I2 are the tensors before and after
the CONV layer, respectively, and hin ∈ RI1 and hout ∈ RI2 are the vectors before
and after the FC layer, respectively. K ∈ RI1 ×H ×W ×I2 and W ∈ RI2 ×I1 denote the
convolutional kernel and the weight matrix, respectively. When the input dimensions
of Iin are high, the high computation and storage requirement of CNNs can bring
burden for hardware implementation.
In [21], it is shown the operations in (13.32) can be approximated by a
randomized tensor sketch approach. Specifically, the mode-n sketch of tensor X ∈
RI1 ×···×IN is defined by the mode-n tensor-matrix product as
S (n) = X ×n U, (13.33)
D
1 (d) (d) T (d) (d) T
Îout = Conv(Iin , S1 ×4 U1 + S2 U2 ), (13.34)
2D
d=1
and U1 ∈ RK×I2 and U2 ∈ RKH W ×I1 H W are two independent random scaled
(d) (d)
sign matrices for d ∈ [D]. The estimation is made more robust by taking D
(d) T
∈ RI1 ×H ×W ×I2 denotes a tensor
(d)
sketches and computing the average. S2 U2
contraction computed element-wisely by
K H W
T
(S2(d) U(d)
2 )(p, q, r, s)= S2(d) (k, h, w, s)U(d)
2 (khw, pqr). (13.35)
k=1 h=1w=1
And Îout is proved to be an unbiased estimator of Iout with bounded variance [21].
The sketched FC layer (SK-FC) can be built in a similar way as follows:
D
1 (d) (d) T (d) (d) T
ĥout = (S1 ×1 U1 + S2 ×2 U2 )hin + b
2D
d=1
(13.36)
D
1 (d) T (d) (d) (d)
= (U1 S1 + S2 U2 )hin + b,
2D
d=1
Table 13.4 Number of parameters and computation time for SK-CONV and SK-FC and the
corresponding baselines (assuming KN ≤ I1 I2 /(I1 + I2 ) holds)
Layer name Number of parameters Computation time
CONV H W I1 I2 O(H2 W2 I1 I2 )
SK-CONV DH W K(I1 + I2 ) O(DH2 W2 K(I1 + I2 ))
FC I1 I2 O(I1 I2 )
SK-FC DK(I1 + I2 ) O(DK(I1 + I2 ))
13.5 Applications 317
In [22], tensor regression layer (TRL) is proposed to replace the flattening operation
and fully connected layers of traditional CNNs with learnable low-rank regression
weights. Given the input activation X ∈ RI0 ×I1 ×···×IN , regression weight tensor
W ∈ RI1 ×I2 ×···×IN ×IN+1 with I0 and IN +1 denote the batch size and the number of
classification classes, respectively; the prediction output Y ∈ RI0 ×IN+1 is computed
element-wisely by
3 4
Y(i, j ) = X[1] (i, :)T , W[N +1] (j, :)T + b. (13.37)
In [34], Yang et al. show that HCS can be applied to further sketch the weights
and input activation before regression is performed. More specifically, suppose the
tensor weight admits Tucker decomposition W = [[G; U(1) , U(2) , · · · , U(N +1) ]],
then HCS can be applied by
5 https://github.com/xwcao/LowRankTRN.
318 13 Tensor Sketch
...
10
Tucker weight
Size: 7×7×32×10
(a)
10 10
(b)
Fig. 13.6 Network structures of the TRL (a) and HCS-TRL (b). The batch size is set to 1 for ease
of illustration
Table 13.5 The comparison Model Compression ratio (%) Test accuracy (%)
of TRL and HCS-TRL on
dataset FMNIST TRL / 87.08
HCS-TRL 5.8 85.39
13.6 Conclusion
inputs. Therefore, learning-based sketches also open a fresh research domain for
enhancing the performance of sketching methods for data of interest [19, 20, 24].
References
1. Ahle, T.D., Kapralov, M., Knudsen, J.B., Pagh, R., Velingker, A., Woodruff, D.P., Zandieh,
A.: Oblivious sketching of high-degree polynomial kernels. In: Proceedings of the Fourteenth
Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 141–160. SIAM, Philadelphia
(2020)
2. Ailon, N., Chazelle, B.: Approximate nearest neighbors and the fast johnson-lindenstrauss
transform. In: Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of
Computing, pp. 557–563 (2006)
3. Anandkumar, A., Ge, R., Hsu, D., Kakade, S., Telgarsky, M.: Tensor decompositions for
learning latent variable models. J. Mach. Learn. Res. 15, 2773–2832 (2014)
4. Anandkumar, A., Ge, R., Janzamin, M.: Guaranteed non-orthogonal tensor decomposition via
alternating rank-1 updates (2014, preprint). arXiv:1402.5180
5. Avron, H., Nguyen, H., Woodruff, D.: Subspace embeddings for the polynomial kernel. In:
Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in
Neural Information Processing Systems, vol. 27. Curran Associates, Red Hook (2014). https://
proceedings.neurips.cc/paper/2014/file/b571ecea16a9824023ee1af16897a582-Paper.pdf
6. Avron, H., Nguyundefinedn, H.L., Woodruff, D.P.: Subspace embeddings for the polynomial
kernel. In: Proceedings of the 27th International Conference on Neural Information Processing
Systems - Volume 2, NIPS’14, pp. 2258–2266. MIT Press, Cambridge (2014)
7. Bhojanapalli, S., Sanghavi, S.: A new sampling technique for tensors (2015, preprint).
arXiv:1502.05023
8. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to
image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 245–250 (2001)
9. Bringmann, K., Panagiotou, K.: Efficient sampling methods for discrete distributions. In:
International Colloquium on Automata, Languages, and Programming, pp. 133–144. Springer,
Berlin (2012)
10. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Autom.
Lang. Program. 2380, 693–703 (2002)
11. Clarkson, K.L., Woodruff, D.P.: Low-rank approximation and regression in input sparsity time.
J. ACM 63(6), 1–45 (2017)
12. Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams
with limited space. In: European Symposium on Algorithms, pp. 348–360. Springer, Berlin
(2002)
13. Diao, H., Jayaram, R., Song, Z., Sun, W., Woodruff, D.: Optimal sketching for kronecker prod-
uct regression and low rank approximation. In: Advances in Neural Information Processing
Systems, pp. 4737–4748 (2019)
14. Diao, H., Song, Z., Sun, W., Woodruff, D.P.: Sketching for kronecker product regression and
p-splines. In: Proceedings of International Conference on Artificial Intelligence and Statistics
(AISTATS 2018) (2018)
15. Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
16. Eldar, Y.C., Kutyniok, G.: Compressed Sensing: Theory and Applications. Cambridge Univer-
sity Press, Cambridge (2012)
17. Gama, F., Marques, A.G., Mateos, G., Ribeiro, A.: Rethinking sketching as sampling: a graph
signal processing approach. Signal Process. 169, 107404 (2020)
320 13 Tensor Sketch
18. Han, I., Avron, H., Shin, J.: Polynomial tensor sketch for element-wise function of low-
rank matrix. In: International Conference on Machine Learning, pp. 3984–3993. PMLR,
Westminster (2020)
19. Hsu, C.Y., Indyk, P., Katabi, D., Vakilian, A.: Learning-based frequency estimation algorithms.
In: International Conference on Learning Representations (2019)
20. Indyk, P., Vakilian, A., Yuan, Y.: Learning-based low-rank approximations. In: Advances in
Neural Information Processing Systems, vol. 32. Curran Associates, Red Hook (2019). https://
proceedings.neurips.cc/paper/2019/file/1625abb8e458a79765c62009235e9d5b-Paper.pdf
21. Kasiviswanathan, S.P., Narodytska, N., Jin, H.: Network approximation using tensor sketching.
In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelli-
gence, IJCAI-18, pp. 2319–2325. International Joint Conferences on Artificial Intelligence
Organization (2018). https://doi.org/10.24963/ijcai.2018/321
22. Kossaifi, J., Lipton, Z.C., Kolbeinsson, A., Khanna, A., Furlanello, T., Anandkumar, A.: Tensor
regression networks. J. Mach. Learn. Res. 21, 1–21 (2020)
23. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
24. Liu, S., Liu, T., Vakilian, A., Wan, Y., Woodruff, D.P.: Extending and improving learned
CountSketch (2020, preprint). arXiv:2007.09890
25. Ma, J., Zhang, Q., Ho, J.C., Xiong, L.: Spatio-temporal tensor sketching via adaptive sampling
(2020, preprint). arXiv:2006.11943
26. Malik, O.A., Becker, S.: Low-rank tucker decomposition of large tensors using tensorsketch.
Adv. Neural Inf. Process. Syst. 31, 10096–10106 (2018)
27. Malik, O.A., Becker, S.: Guarantees for the kronecker fast johnson–lindenstrauss transform
using a coherence and sampling argument. Linear Algebra Appl. 602, 120–137 (2020)
28. Nelson, J., Nguyên, H.L.: OSNAP: faster numerical linear algebra algorithms via sparser
subspace embeddings. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer
Science, pp. 117–126. IEEE, Piscataway (2013)
29. Pagh, R.: Compressed matrix multiplication. ACM Trans. Comput. Theory 5(3), 1–17 (2013)
30. Pham, N., Pagh, R.: Fast and scalable polynomial kernels via explicit feature maps. In:
Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD’13, vol. 128815, pp. 239–247. ACM, New York (2013)
31. Prasad Kasiviswanathan, S., Narodytska, N., Jin, H.: Deep neural network approximation using
tensor sketching (2017, e-prints). arXiv:1710.07850
32. Sarlos, T.: Improved approximation algorithms for large matrices via random projections. In:
2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pp.
143–152. IEEE, Piscataway (2006)
33. Sarlós, T., Benczúr, A.A., Csalogány, K., Fogaras, D., Rácz, B.: To randomize or not to
randomize: space optimal summaries for hyperlink analysis. In: Proceedings of the 15th
International Conference on World Wide Web, pp. 297–306 (2006)
34. Shi, Y.: Efficient tensor operations via compression and parallel computation. Ph.D. Thesis,
UC Irvine (2019)
35. Shi, Y., Anandkumar, A.: Higher-order count sketch: dimensionality reduction that retains
efficient tensor operations. In: Data Compression Conference, DCC 2020, Snowbird, March
24–27, 2020, p. 394. IEEE, Piscataway (2020). https://doi.org/10.1109/DCC47342.2020.00045
36. Sun, Y., Guo, Y., Luo, C., Tropp, J., Udell, M.: Low-rank tucker approximation of a tensor
from streaming data. SIAM J. Math. Data Sci. 2(4), 1123–1150 (2020)
37. Vempala, S.S.: The Random Projection Method, vol. 65. American Mathematical Society,
Providence (2005)
38. Wang, Y., Tung, H.Y., Smola, A., Anandkumar, A.: Fast and guaranteed tensor decomposition
via sketching. UC Irvine 1, 991–999 (2015). https://escholarship.org/uc/item/6zt3b0g3
39. Wang, Y., Tung, H.Y., Smola, A.J., Anandkumar, A.: Fast and guaranteed tensor decomposition
via sketching. In: Advances in Neural Information Processing Systems, pp. 991–999 (2015)
40. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor
classification. J. Mach. Learn. Res. 10, 207–244 (2009)
References 321
41. Woodruff, D.P., et al.: Sketching as a tool for numerical linear algebra. Found. Trends® Theor.
Comput. Sci. 10(1–2), 1–157 (2014)
42. Xia, D., Yuan, M.: Effective tensor sketching via sparsification. IEEE Trans. Inf. Theory 67(2),
1356–1369 (2021)
43. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms (2017, preprint). arXiv:1708.07747 (2017)
44. Yang, B., Zamzam, A., Sidiropoulos, N.D.: Parasketch: parallel tensor factorization via
sketching. In: Proceedings of the 2018 SIAM International Conference on Data Mining, pp.
396–404. SIAM, Philadelphia (2018)
45. Yang, B., Zamzam, A.S., Sidiropoulos, N.D.: Large scale tensor factorization via parallel
sketches. IEEE Trans. Knowl. Data Eng. (2020). https://doi.org/10.1109/TKDE.2020.2982144
Appendix A
Tensor Software
As the multiway data widely exists in a lot of data processing applications, software
that can directly perform tensor-based machine learning is in need within many
fields. After years of development, there are various packages on different platforms.
Here we conduct a relatively complete survey of the existing packages and the
details are summarized in Table A.1.
Python and MATLAB are two main platforms for these packages. TensorLy [8]
and tntorch[4, 6, 7] are for Python, and TensorToolbox[1–3] and Tensorlab [5, 7,
10, 11] are for MATLAB. There are also packages in C, C++, and OpenMP like
mptensor and SPLATT [9].
Those packages are not only developed for different platforms but also focus
on different specific fields. Most of the existing packages can finish basic tensor
operations and decompositions. Apart from that, tntorch and hottbox can solve
tensor regression, classification, and statistics analysis problems. TensorToolbox
and Tensorlab can deal with structured tensor. Data fusion problems can be solved
by Tensorlab, and sensitivity analysis can be achieved on tntorch. According to
Table A.1, it is worth noting that there is no package which integrates all operations
together on one platform. One needs to turn to different platforms and refer to
corresponding user’s manuals in order to achieve different goals.
We also put the links of corresponding software and user’s manual in Table A.2
at the end of Appendix.
References
1. Bader, B.W., Kolda, T.G.: Algorithm 862: MATLAB tensor classes for fast algorithm pro-
totyping. ACM Trans. Math. Softw. 32(4), 635–653 (2006). https://doi.org/10.1145/1186785.
1186794
2. Bader, B.W., Kolda, T.G.: Efficient MATLAB computations with sparse and factored tensors.
SIAM J. Sci. Comput. 30(1), 205–231 (2007). https://doi.org/10.1137/060676489
3. Bader, B.W., Kolda, T.G., et al.: Matlab tensor toolbox version 3.1 (2019). https://www.
tensortoolbox.org
4. Ballester-Ripoll, R., Paredes, E.G., Pajarola, R.: Sobol tensor trains for global sensitivity
analysis. Reliab. Eng. Syst. Saf. 183, 311–322 (2019)
5. Cichocki, A., Mandic, D., De Lathauwer, L., Zhou, G., Zhao, Q., Caiafa, C., Phan, H.A.: Tensor
decompositions for signal processing applications: from two-way to multiway component
analysis. IEEE Signal Process. Mag. 32(2), 145–163 (2015)
References 327
6. Constantine, P.G., Zaharatos, B., Campanelli, M.: Discovering an active subspace in a single-
diode solar cell model. Stat. Anal. Data Min. ASA Data Sci. J. 8(5–6), 264–273 (2015)
7. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500
(2009)
8. Kossaifi, J., Panagakis, Y., Anandkumar, A., Pantic, M.: Tensorly: tensor learning in python. J.
Mach. Learn. Res. 20(26), 1–6 (2019)
9. Smith, S., Karypis, G.: SPLATT: the surprisingly parallel spArse tensor toolkit (2016). http://
cs.umn.edu/~splatt/
10. Sorber, L., Barel, M.V., Lathauwer, L.D.: Numerical solution of bivariate and polyanalytic
polynomial systems. SIAM J. Numer. Anal. 52(4), 1551–1572 (2014)
11. Sorber, L., Domanov, I., Van Barel, M., De Lathauwer, L.: Exact line and plane search for
tensor optimization. Comput. Optim. Appl. 63(1), 121–142 (2016)
Index
Compression factor (CF) of tensor contraction tensor rank network (TRN) model,
layers, 247, 249 271–275
Computational complexity, 102, 103 trainable parameters, 266
Computation and optimization model, 27–28 deep plug-and-play (PnP)
Convex multilinear multitask learning ADMM, 270
(MLMTL-C), 192 denoisers, 270
Convex tensor nuclear norm, 176 framework, 271
Convolutional layer, 6, 243 optimization problem, 270
Convolutional neural network (CNN), tensor completion problem, 278–281
243–244, 315 deep unrolling
Co-regularization (Co-reg) method for spectral framework, 268
clustering, 235 handcrafted thresholding functions, 269
Cosparsity model, 59–61 inverse problems, 268
Count sketch (CS), 300 learnable operators, 269
computational and storage complexity of, nonlinear iteration algorithms, 267
305 optimization problem, 267
error analysis of, 301 snapshot compressive imaging (SCI),
Internet packet streams, 301 275–278
K-wise independent, 301 Deep neural network (DNN)
Rademacher distribution, 301 CIFAR-10 dataset, 259, 260
Coupled matrix and tensor factorization model convolutional neural network, 243–244
(CMTF), 116–118 cross-entropy error of multi-classification,
Coupled matrix/tensor component analysis, 242
115 disadvantages, 261
Coupled sparse tensor factorization (CSTF), 87 illustration of, 242
Coupled tensor image classification, 253, 256, 259, 261
applications low-rank tensor approximation
heterogeneous data, link prediction in, deep tensorized neural networks based
125–128 on mode-n product, 248–251
HSI-MSI fusion, 124–125 network structure design, 245
visual data recovery, 128–129 parameter optimization, 251–252
coupled matrix/tensor component analysis, tensor contraction layer, 247–248
115 t-product-based DNNs, 246
joint analysis of data, 115 mean absolute error, 241, 242
Coupled tensor component analysis methods mean square error, 241, 242
coupled matrix and tensor factorization MNIST dataset
model, 116–118 baseline network, 257
coupled tensor factorization model, stochastic gradient descent algorithm,
118–121 256
generalized coupled tensor factorization TT network, 257–259
model, 121–124 Tucker network, 257–259
Coupled tensor factorization model, 118–121 recurrent neural network, 244–245
CP decomposition-based approaches, 72–74 resource-constrained devices, 261
CP decomposition-based clustering, 221–223 tensor decomposition
Cross-entropy error (CEE), 242 CP decomposition, 254
Curse of dimensionality, 37, 53 and functions, 253
hierarchical Tucker decomposition,
254, 255
D single hidden layer convolutional
Deep learning methods arithmetic circuit, 254, 255
classical deep neural networks Deep plug-and-play (PnP)
linear layer, 267 ADMM, 270
in ResNet, 267 denoisers, 270
structure of, 266 framework, 271
Index 331
H
F Hadamard product, 5, 158
Face clustering, 234, 235 Handwriting digit recognition, 214
Fast Fourier transform (FFT), 229, 246, 300 Heterogeneous data, 220
Fast Johnson–Lindenstrauss transform, 300 Heterogeneous information networks
Fast RTPM (FRTPM), 306 clustering, 231–233
Fast unit-rank tensor factorization method, 177 Hierarchical low-rank tensor ring
f -diagonal tensor, 3 decomposition, 46–48
332 Index
sparse and cosparse representation, linear tensor regression models (see Linear
59–61 tensor regression models)
on mode-n product reduced rank regression, 167
analysis with noisy observations, 78–80 nonlinear tensor regression
noiseless analysis tensor dictionary Gaussian process regression, 186–187
learning, 77–78 kernel-based learning, 184–186
tensor convolutional analysis dictionary random forest-based tensor regression,
learning model, 80–81 188–189
T-linear-based approaches, 74–77 tensor additive models, 187–188
Tucker decomposition-based approaches, research directions, 194
65 tensor-on-tensor regression, 192–193
classical Tucker-based tensor dictionary tensor-on-vector regression, 191–192
learning, 65–68 tensor regression framework, 164–165
Tucker-based tensor dictionary learning vector-on-tensor regression
with normalized constraints, 68–71 experimental process of the age
Tucker-based tensor dictionary learning estimation task, 190
with orthogonal constraints, 71 hrSTR, 190
Tensor envelope PLS (TEPLS), 180 hrTRR, 190
Tensor envelope tensor response regression orSTR, 190
(TETRR), 192 predicted values vs. the ground truth of
Tensor factorization neural network (TFNN), the human age estimation task, 191
249, 250 SURF, 190
Tensor Fisher discriminant analysis (TFDA), Tensor regression layer (TRL), 317–318
213 Tensor ridge regression, 174–175
Tensor incoherence conditions, 136–137 Tensor ring completion via alternating least
Tensor inner product, 13, 201, 209, 306 square (TR-ALS), 102
Tensor networks, 37 Tensor ring decomposition, 45–46
hierarchical Tucker decomposition, 37–42 Tensor ring nuclear norm minimization for
computation of, 39–40 tensor completion (TRNNM), 102
generalization of, 40–42 Tensor singular value decomposition, 35–37
tensor train decomposition, 42–43 Tensor singular value decomposition
computation of, 43–44 (t-SVD)-based RPTCA
generations of, 44–46 block-RPTCA, 140, 141
Tensor neural network (TNN), 250, 251 classical RPTCA model and algorithm
Tensor nuclear norm (TNN), 97, 135 tensor average rank, 135–136
Tensor-on-tensor regression, 192–193 tensor incoherence conditions, 136–137
based on CP decomposition (TTR-CP), tensor nuclear norm, 135
192, 194 tensor singular value thresholding,
based on TT decomposition (TTR-TT), 137–139
192, 194 data model for RBPTCA, 141
Tensor-on-vector regression model, 175, discrete cosine transform, 145
191–192 frequency-weighted tensor nuclear norm,
Tensor partial least squares regression, 143–144
181–183 low-rank core matrix, 142–143
Tensor projected gradient (TPG), 171, nonconvex RPTCA, 144–145
172 outlier-robust model for tensor PCA, 139,
Tensor rank network (TRN) model 140
experimental analysis, 274–275 rotation invariance, 141–142
learning architecture, 272 tubal-PTCA, 139, 155
MSE performance, 275 twist tensor nuclear norm, 142
pre-decomposition architecture, 272–274 Tensor sketch
Tensor regression computational and storage complexity of,
high-dimensional data-related regression 305
tasks, 163–164 computational complexity, 303
338 Index