0% found this document useful (0 votes)
7 views

A_Survey_on_Tensor_Techniques_and_Applications_in_Machine_Learning

Uploaded by

Marouane Nazih
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

A_Survey_on_Tensor_Techniques_and_Applications_in_Machine_Learning

Uploaded by

Marouane Nazih
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Received September 25, 2019, accepted October 19, 2019, date of publication October 28, 2019,

date of current version November 19, 2019.


Digital Object Identifier 10.1109/ACCESS.2019.2949814

A Survey on Tensor Techniques and Applications


in Machine Learning
YUWANG JI, QIANG WANG , XUAN LI, AND JIE LIU
Beijing University of Posts and Telecommunications, Beijing 100876, China
Corresponding author: Qiang Wang ([email protected])
This work was supported in part by the Beijing Natural Science Foundation under Grant L182037, in part by the National Natural Science
Foundation of China under Grant 61871045 and Grant 61325006, in part by the Beijing Natural Science Foundation under Grant L172033,
in part by the National Natural Science Foundation of China under Grant 6197106661325006, and in part by the 111 Project of China
under Grant B16006.

ABSTRACT This survey gives a comprehensive overview of tensor techniques and applications in machine
learning. Tensor represents higher order statistics. Nowadays, many applications based on machine learning
algorithms require a large amount of structured high-dimensional input data. As the set of data increases,
the complexity of these algorithms increases exponentially with the increase of vector size. Some scientists
found that using tensors instead of the original input vectors can effectively solve these high-dimensional
problems. This survey introduces the basic knowledge of tensor, including tensor operations, tensor decom-
position, some tensor-based algorithms, and some applications of tensor in machine learning and deep
learning for those who are interested in learning tensors. The tensor decomposition is highlighted because it
can effectively extract structural features of data and many algorithms and applications are based on tensor
decomposition. The organizational framework of this paper is as follows. In part one, we introduce some
tensor basic operations, including tensor decomposition. In part two, applications of tensor in machine learn-
ing and deep learning, including regression, supervised classification, data preprocessing, and unsupervised
classification based on low rank tensor approximation algorithms are introduced detailly. Finally, we briefly
discuss urgent challenges, opportunities and prospects for tensor.

INDEX TERMS Machine learning, tensor decomposition, higher order statistics, data preprocessing,
classification.

I. INTRODUCTION sors, such as Support tensor machine(STM) (Tao et al. [27];


‘‘Tensor’’ was first introduced by William Ron Hamilton Biswas and Milanfar [121]; Hao et al. [164]), tensor
in 1846 and later became known to scientists through the fisher discriminant analysis (Lechuga) [38], tensor regression
publication of Levi-Civita’s book The Absolute Differential (Hoa et al.) [89], tensor completion (Du et al.) [150], and
Calculus [72]. Because of its structured representation of data so on. Recently, a series of new algorithms based on tensor
format and ability to reduce the complexity of multidimen- have been widely used in biomedicine and image processing.
sional arrays, tensor has been gradually applied in various Compared with traditional vector-based algorithms, tensor-
fields, such as Dictionary Learning (Ghassemi et al.) [88], based algorithms can achieve lower computational com-
Magnetic Resonance Imaging(MRI) (Xu et al.) [148], Spec- plexity and better accuracy. Through these tensor-based
tral data classification (Makantasis et al.) [69], and Image algorithms, high-dimensional problems can be solved effec-
deblurring (Geng et al.) [75]. tively, and accuracy can be improved without destroying the
When traditional vector value data is extended to tensor data structure.
value data, traditional vector value based algorithms will The key references for this survey are (Cichocki et al.) [3]
no longer work. Thereupon, some scientists extend the tra- and (Kolda and Bader) [127]. The main purpose of this survey
ditional vector-based machine learning algorithms to ten- is to introduce basic machine learning applications related
to tensor decomposition and tensor network model. Similar
The associate editor coordinating the review of this manuscript and to matrix decomposition, tensor decomposition is used to
approving it for publication was Massimo Cafaro . decompose complex high-dimensional tensor into the form

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
162950 VOLUME 7, 2019
Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

FIGURE 1. A general block diagram of the survey.

of the sum of products of factor tensor or factor vector. decomposition, the Tensor train decomposition and Higher-
Tensor network decomposes the high-dimensional tensor into order singular value decomposition (also known as higher-
sparse factor matrices and low-order core tensor, which we order tensor decomposition) in Chapter C. In Chapter D,
call factors or blocks. In this way, we set the compression (that we give a detailed description of tensor train decomposition
is, distributed) representation of large-size data, enhancing and the related algorithms. In Chapter E, i.e., the last section
the advantage of interpretation and calculation. of the first part, we summarize the advantages and disadvan-
Tensor decomposition is regarded as a sub-tensor network tages of these decompositions and applications. In part two,
in this survey. That is to say, the decomposition of tensor we mainly describe tensor application algorithms in machine
can be used in the same way as the tensor network. We can learning and deep learning. In Chapter A, we introduce
divide the data into related and irrelevant parts by using the application of structured tensor in data preprocessing
tensor decomposition. High-dimensional big data can be including tensor completion and tensor dictionary learning.
compressed several times without breaking data correlation In Chapter B of this part, we introduce some applications of
by using tensor decomposition (tensor network). Moreover, tensor in classification, including algorithm innovation and
tensor decomposition can be used to reduce unknown param- data innovation. Then, we illustrate the application of tensor
eters, and then the exact solution can be obtained by alternate in regression, including tensor regression and multivariate
iterative algorithms. tensor regression, in Chapter C. In the last of part two,
We provide a general block diagram of the survey (see we explain the background of the tensor network and discuss
figure 1). The survey consists of two parts. In part one, its advantages, shortcomings, opportunities and challenges in
we first give the basic definition and notations of tensor detail.
in Chapter A. Then we introduce the basic operation of
tensor, and the block diagram of the network structure of II. PART ONE: TENSOR AND TENSOR OPERATION
tensor in Chapter B. Next, we begin to describe tensor A. TENSOR NOTATIONS
decomposition, including several famous decompositions A tensor can be seen as a generalization of multidimensional
such as the CP (regularization) decomposition, the Tucker arrays. For example, a scalar quantity can be considered as a

VOLUME 7, 2019 162951


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

0-order tensor, a vector can be treated as a first-order tensor, such as pentagons or hexagons) to represent tensor, and the
and a matrix can be regarded as a second-order tensor. And a outgoing line of the node represents the index of a particular
third-order tensor looks like a cuboid (see figure 2). dimension (see figure 5 and figure 6). The Nth-order tensor
can be expressed in a similar way.

FIGURE 2. A 3rd-order tensor looks like a cuboid [3].

A fourth-order tensor is an extension of the third-order


tensor along one dimension (see figure 3).

FIGURE 5. A simple network diagram of the tensor, vector y ∈ R I , matrix


Y ∈ R I×J , 3rd-order tensor Y ∈ R I×J×K [3].

FIGURE 3. A 4th-order tensor extending along the lateral direction [3].

As you can imagine, a fifth-order tensor is an extension of


a third order tensor in two directions (see figure 4).

FIGURE 6. A simple network diagram of the 4th-order diagonal tensor,


ϒ ∈ R I×I×I×I .

We need to know the definition of tensor slice and tensor


FIGURE 4. A fifth-order tensor extending along the lateral and
longitudinal directions [3].
fiber. Tensor fiber(see figure 7) is a vector equivalent to fixing
two tensor indices, and tensor slice(see figure 8) is a matrix
equivalent to fixing one indices. We use a simple example to
We use underlined uppercase letters to indicate tensors,
that is, Y ∈ RI1 ×I2 ×I3 ×···IN , to represent an Nth-order tensor.
Then, similar to the diagonal matrix, we define the diagonal
tensor as follows: 3 ∈ RI ×I ×I ×···I or ϒ ∈ RI ×I ×I ×···I .
Similar to the transposition of the matrix, we define the trans-
position of the tensor as follows: Y ∈ RI1 ×I2 ×I3 ×···IN , Y T ∈
RIN ×IN −1 ×IN −2 ×···I1 . We can also use this symbol to represent
2nd-order tensor (matrix) and 1st-order tensor (vector). For
convenience, we use separate symbols to represent 2nd-order
tensor (matrix) and 1st-order tensor (vector) respectively.
We use Y ∈ RI ×J to represent matrix, y ∈ RI for vector,
FIGURE 7. Tensor fibers (vectors) for a 3rd-order tensor. It’s like tofu
y ∈ R0 for scalar. being cut in both directions [127].
We use yi1 ,i2 ,··· ,iN to represent the entries of an Nth-order
tensor Y ∈ RI1 ×I2 ×I3 ×···IN . The order of a tensor is the
total number of its ‘‘dimension’’ or ‘‘mode’’. The size of a
tensor means the range of values that can be obtained for a
dimension of tensor. For example, a tensor Y ∈ R3×4×5×6 is
of order 4, size 3 in mode-1, size 4 in mode-2, size 5 in mode-3
and size 6 in mode-4. In order to describe the tensor more
simply, simple tensor network diagram will be used. We use FIGURE 8. Tensor slices (matrices) for a 3rd-order tensor. It’s like tofu
the geometric nodes of a square(sometimes with polygons being cut in one direction [127].

162952 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

illustrate tensor slice and tensor fiber. 3) THE MODE-n PRODUCT OF A TENSOR AND A MATRIX
   
1 2 5 6
C= , (1) C = A ×nm B (5)
3 4 7 8
This is a 3rd-order tensor C ∈ R2×2×2 . For tensor where in ×nm , m means matrix, n means mode-n,
slices(matrices), we can get two matrices  C(1,  :, :) and A ∈ RI1 ×I2 ×I3 ×···IN means the Nth-order tensor, B ∈
C(2, :, :) when we fix the first dimension: 13 24 , 57 68 , which RJ ×In means the matrix. They yield a tensor C ∈
we usually call them horizontal slices. If we fix the second RI1 ×···×In−1 ×J ×In+1 ×···×IN with entries ci1 ,··· ,in−1 ,j,in+1 ,··· ,iN =
dimension, we can  get  another two matrices C(:, 1, :) and PIn
C(:, 2, :): 15 26 , 37 48 , which we usually call them lateral ai1 ,··· ,in−1 ,in ,in+1 ,··· ,iN bj,in .
slices. If we fix the third dimension, we in =1
 can
 stillget another
two matrices C(:, :, 1) and C(:, :, 2): 15 37 , 26 48 , which we
4) THE MODE-(a,b) PRODUCT(TENSOR CONTRACTION) OF
usually call them frontal slices.
A TENSOR AND ANOTHER TENSOR
For tensor fibers(vectors), we can get four vectors
C(1, 1, :), C(1, 2, :), C(2, 1, :), C(2, 2, :) when we fix the first
and the second indices: [ 1 2 ], [ 3 4 ], [ 5 6 ], [ 7 8 ]. If we fix the C = A ×(a,b) B (6)
first and the third indices, we can get another four vectors
C(1, :, 1), C(1, :, 2), C(2, :, 1), C(2, :, 2): [ 1 3 ], [ 2 4 ], [ 5 7 ], where A ∈ RI1 ×I2 ×I3 ×···IN means the Nth-order tensor, B ∈
[ 6 8 ]. If we fix the second and the third indices, we can RJ1 ×J2 ×J3 ×···JM means another tensor and here we should note
get another four vectors C(:, 1, 1), C(:, 1, 2), C(:, 2, 1), that Ia = Jb (a ∈ [1, N ], b ∈ [1, M ]). They yield a tensor
C(:, 2, 2): [ 1 5 ], [ 2 6 ], [ 3 7 ], [ 4 8 ]. C ∈ RI1 ×···×Ia−1 ×Ia+1 ×···IN ×J1 ×···Jb−1 ×Jb+1 ···×JM with entries
Ia
P
ci1 ,··· ,ia−1 ,ia+1 ,··· ,iN ,j1 ,··· ,jb−1 ,jb+1 ,··· ,jM = ai1 ,··· ,ia ,··· ,iN
B. TENSOR OPERATION ia =1
In this chapter, we begin to discuss some basic tensor bj1 ,··· ,jb−1 ,ia ,jb+1 ,··· ,jM . Note that it is also called tensor contrac-
operations. Tensor operations are similar to traditional linear tion because the dimension of the new tensor is the sum of the
algebras, but are richer and more meaningful than them. dimensions of the original two tensors minus the dimension
The same operation will also be applied to the following of the same size. We draw a picture to show tensor contrac-
tensor decomposition in Chapter C. In order to get a clearer tion(see figure 9). For convenience, when two tensors have
description of the formulas, we will give examples and graph- same size of one dimension, the () in the above formula is
ical instructions. Thirteen tensor calculation formulas will be usually omitted, that is C = A ×a,b B.
given.
N We first give the definition ofJ
the following operations:
means the Kronecker product, means the Khatri-Rao
product, ◦ means the outer product, ×n means the mode-n
product. Next, we introduce a few commonly used formulas.

1) THE SUM OF TWO TENSORS

C =A+B (2)
where A ∈ RI1 ×I2 ×I3 ×···IN , B ∈ RI1 ×I2 ×I3 ×···IN , C ∈
RI1 ×I2 ×I3 ×···IN , ci1 ,··· ,iN = ai1 ,··· ,iN + bi1 ,··· ,iN .

2) THE MODE-n PRODUCT OF A TENSOR AND A VECTOR


FIGURE 9. (a) the tensor contraction of two 4th-order tensors, I3 = J1 ,
C = A ×nv b (3) C = A ×(3,1) B ∈ R I1 ×I2 ×I4 ×J2 ×J3 ×J4 . (b) the tensor contraction of two
5th-order tensors, I3 = J1 , I4 = J5 ,
C = A ×(3,1)(4,5) B ∈ R I1 ×I2 ×I5 ×J2 ×J3 ×J4 .
where in ×nv , v means vector, n means mode-n,
A ∈ RI1 ×I2 ×I3 ×···IN means the Nth-order tensor, and
b ∈ RIn means the vector. They yield a tensor C ∈ From the three formulas (the mode-n product of tenor and
RI1 ×···×In−1 ×In+1 ×···×IN with entries ci1 ,··· ,in−1 ,in+1 ,··· ,iN = vector, matrix, tensor) above, we can see that the mode-n
PIn product offsets the same dimension, and the different dimen-
ai1 ,··· ,in−1 ,in ,in+1 ,··· ,iN bin . For example,
in =1
sions are added, which is similar to the product of the matrix,
      but a little different.
1 2 5 6 2
C= , × (4)
3 4 7 8 3 5) THE TRANSPOSITION OF TENSOR CONTRACTION
where C 11 = A111 b1 + A121 b2 = 1 × 2 + 3 × 3 = 11. And
so on, we can get C = 11 16
31 36 . C = (A ×a,b B)T = BT ×b,a AT (7)

VOLUME 7, 2019 162953


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

6) THE CONJUGATE TRANSPOSE OF A 3RD-ORDER TENSOR 10) THE MODE-n MATRICIZATION AND VECTORIZATION
The conjugate transpose of a 3rd-order tensor X ∈ RI1 ×I2 ×I3 OF THE TENSOR
is a tensor X ∗ ∈ RI2 ×I1 ×I3 obtained by conjugate transpos- In the previous section, we introduced the concepts of ten-
ing each of the frontal slices(fix the third order X (:, :, i3 )) sor slice and tensor fiber. Here we present two similar but
and then reversing the order of transposed frontal slices 2 different concepts, matricization and vectorization. Tensor
through I3 . We give a simple example to show(see formula 8): slice and tensor fiber take some specific elements of the
    tensor to form a matrix or a vector, while matricization and
1 3 5 7
C = , vectorization are to matricize or vectorize all the elements.
2 4 6 8 We now give a formal definition.
   
1 3 2 4 The mode-n matricization of a tensor, Y ∈
C∗ = , (8)
5 7 6 8 RI1 ×I2 ×I3 ×···IN , is as follows:

7) THE OUTER PRODUCT OF A TENSOR AND ANOTHER mat(Y )n = mat(Y )n = Y mn ∈ RIn ×I1 ···In−1 In+1 ···IN
TENSOR Where for the matrix element (in j),
XN k−1
Y
C =A◦B (9) j=1+ (ik − 1)Jk with Jk = Im
k=1,k6 =n m=1,m6 =n
where A ∈ RI1 ×I2 ×I3 ×···IN and B ∈ RJ1 ×J2 ×J3 ×···JM .
(14)
They yield an (N+M)th-order tensor C with entries
ci1 ,··· ,iN ,j1 ,··· ,jM = ai1 ,··· ,iN bj1 ,··· ,jM . The mode-n vectorization of a tensor, Y ∈ RI1 ×I2 ×I3 ×···IN,
is as follows:
8) THE (RIGHT)KRONECKER PRODUCT OF TWO TENSORS
vec(Y )n = Y vn ∈ RIn I1 ···In−1 In+1 ···IN (15)
C = A ⊗R B (10) For the mode-n vectorization, we first perform mode-n matri-
where in ⊗R , R means right, A ∈ RI1 ×I2 ×I3 ×···IN and B ∈ cization and then stack the matrix in columns. Of course,
RJ1 ×J2 ×J3 ×···JN . They yield a tensor C ∈ RJ1 I1 ×···×JN IN with vectors and matrices can also be transformed into tensor.
entries ci1 j1 ,··· ,iN jN = ai1 ,··· ,iN bj1 ,··· ,jN , where iN jN = jN + We give examples of mode-1 matricization and vectorization
of a 3rd-order tensor(see formula 16).
(iN − 1)JN is called multi-indices. Note that for Kroneker  
product, two tensors must have the same dimension. They 1
must not carry out the Kronecker product of the matrix and 5
 
the 3rd-order tensor, and must have a 3rd-order tensor and 2
       
another 3rd-order tensor. A simple example of a second-order 1 3 5 7 1 2 3 4 6
C= , ⇔ ⇔ 
matrix is provided, as follows: 2 4 6 8 5 6 7 8 3
 
    7
1 2 5 6  
4
C = ⊗R
3 4 7 8 8
 
1×5 1×6 2×5 2×6 (16)
1 × 7 1 × 8 2 × 7 2 × 8
= 3 × 5 3 × 6 4 × 5 4 × 6
 (11)
11) THE TENSOR QUANTITATIVE PRODUCT
3×7 3×8 4×7 4×8
J1
X JN
X
In fact, c=A•B= ··· aj1 ,··· ,jN bj1 ,··· ,jN (17)
j1 =1 jN =1
C = A ⊗R B = B ⊗L A (12)
where A ∈ RJ1 ×···×JN , B ∈ RJ1 ×···×JN . Note that the require-
The right-most equation is called the left Kronecker product. ments for tensor quantitative product are too strict. Not only
the dimension of the two tensors should be the same, but also
9) THE RIGHT KHATRI-RAO PRODUCT OF MATRICES the size of the two tensors has to be the same. In this way, we
can further define the Frobenius norm of tensor.
C =A RB
kAkF = (A • A)1/2 (18)
= [a1 ⊗R b1 , a2 ⊗R b2 , · · · , aK ⊗R bK ] ∈ RIJ ×K (13)

where A = [a1 , a2 , a3 , · · · , aK ] ∈ RI ×K , B = 12) THE TENSOR ELEMENT PRODUCT


[b1 , b2 , b3 , · · · , bK ] ∈ RJ ×K . The left Khatri-Rao product
C =A~B (19)
of matrices is similar. For convenience, the right Khatri-
Rao product of matrices is used in this survey, so we will where C ∈ RI1 ×···×IN , A ∈ RI1 ×···×IN , B ∈ RI1 ×···×IN ,
abbreviate the right Khatri-Rao product of matrices ⊗R as ⊗. ci1 ,··· ,iN = ai1 ,··· ,iN bi1 ,··· ,iN . Note that since it is element

162954 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

product, similar to the quantitative product, the two tensors vectors and matrices (just change the dimension to 1 or 2 in
must have the same dimension and the same size. the formulas). Many researchers have also defined some new
operations, such as the strong Kronecker product(de Launey
13) THE TENSOR TRACE and Seberry [140]; Phan et al. [8]) and the mode-n Khatri-Rao
Similar to trace of the matrix, tensor also has a trace. product of tensors (Ballard et al.) [33]. Based on the Kroneker
(Gu, 2009) [162] proposed the concept of tensor trace. Let’s product, these two operations are just grouped into blocks to
first look at the concept of inner indices. If a tensor has the perform the Kroneker product operation.
same size for several dimensions, those same size dimensions This chapter mainly introduces basic calculation formulas
are called inner indices. For example, a tensor X ∈ RA×B×A commonly used by tensor. If you want to know more about
has two inner indices. Modes 1 and 3 are both size A. Then, many other formulas, please refer to (Kolda and Bader) [127].
we define the following concept of tensor trace:
R
C. TENSOR DECOMPOSITION
This chapter begins to discuss the knowledge of tensor
X
x = Trace(X ) = X (r, :, r) (20)
r=1
decomposition, which is similar but different from matrix
R decomposition. Tensor decomposition aims to reduce the
X
x = [x1 , x2 , · · · , xB ]T , xi = X (r, i, r) (21) computational complexity while ensuring the data structure,
r=1 so as to better deal with the data. Tensor decomposition
x = [tr(X1 ), · · · , tr(XB )]T , Xi ∈ RR×R (22) technology has been gradually used in data analysis and
processing. This chapter will focus on five main types of
Let’s give an example of the 3rd-order tensor that we have decomposition, i.e., the Canonical Polyadic(CP) decompo-
used before. sition, the Tucker decomposition, the MultiLinear Singular

1 2
 
5 6
 Value(the higher-order SVD or HOSVD) decomposition,
C = , (23) the Hierarchical Tucker(HT) decomposition and the tensor-
3 4 7 8
train(TT) decomposition, respectively.
c = Trace(C) = [1 + 6, 3 + 8]T = [7, 11]T (24)
   
1 2 3 4
C1 = , C2 = (25) 1) THE CANONICAL POLYADIC(CP) DECOMPOSITION
5 6 7 8 Before introducing CP decomposition, we first introduce the
bidirectional component analysis, i.e., the constrained low-
14) THE TENSOR CONVOLUTION
rank matrix factorization.
Tensor also has convolution, which is similar to matrix con-
volution. For two Nth-order tensors A ∈ RI1 ×I2 ×I3 ×···IN and R
X
B ∈ RJ1 ×J2 ×J3 ×···JN . Their tensor convolution is as follows: C = 3ABT + E = λr ar bTr + E (27)
r=1
C =A∗B (26)
where 3 = diag(λ1 , · · · , λr ) is an diagonal matrix.
C ∈ R(I1 +J1 −1)×(I2 +J2 −1)×···×(IN +JN −1) , with entries C ∈ RI ×J is a known matrix (for example, known input data,
J1 P
P J2 JN
P etc.). E ∈ RI ×J is a noise matrix. A = [a1 , · · · , aR ] ∈
ck1 ,k2 ,··· ,kN = ··· bj1 ,··· ,jn ak1 −j1 ,··· ,kn −jn For a
j1 =1 j2 =1 jN =1 RI ×R , B = [b1 , · · · , bR ] ∈ RJ ×R are two unknown factor
simple and intuitive display, we use matrix convolution to matrices with ar ∈ RI , br ∈ RJ , r ∈ [1, R]. In fact, if the
illustrate (see figure 10). noise matrix is very small, it can be ignored and the upper
expression can be written as C ≈ 3ABT .
In fact, based on low rank matrix decomposition,
(Hitchcock [31]; Harshman, 1970 [110]) proposed the CP
decomposition of tensor. Before introducing the definition of
CP decomposition, we give the definition of rank-1 tensor.
If a tensor can be represented as follows:

Y = b1 ◦ b2 ◦ · · · bN (28)

FIGURE 10. A schematic diagram of the results of matrix convolution, where Y ∈ RI1 ×I2 ×I3 ×···IN , bn ∈ RIn , yi1 ,··· ,iN = b1i1 · · · bN
iN ,
with C11 = 0 × 1 = 0, C12 = 1 × 1 + 0 × 2 = 1, C22 = 1 × 2 + 2 ×
1 + 1 × 1 + 0 × 0 = 5, · · · .
then we call the tensor rank-1 tensor. In CP decomposition,
tensor is decomposed into the linear sum of these vectors.
CP decomposition is defined as follows:
15) SHORT SUMMARY R
X
The formulas for tensor operations described above are rela- Y≈ λr b1r ◦ b2r ◦ · · · bN
r = 3 ×1m B1 ×2m B2 · · · ×Nm BN
tively basic ones. Because tensor can be seen as a generaliza- r=1
tion of matrices and vectors, the above formulas also apply to (29)

VOLUME 7, 2019 162955


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Similar to the constrained low-rank matrix factorization that example: the factor matrices can be iterative updated as
we have just described, where λr = 3r,r,r,··· ,r , r ∈ [1, R]
are entries of the diagonal core tensor 3 ∈ RR×R×R×···R . Bn = Y mn [(BN R · · · Bn+1 R Bn−1 · · · B1 )T ]† (31)
Bn = [bn1 , bn2 , · · · , bnR ] ∈ RIn ×R are factor matrices. With the where Y mn represents the mode-n matricization of tensor Y ,
help of other formulas, CP decomposition has a lot of other † means the Moore-Penrose pseudo-inverse of the matrix.
similar expressions, among which we give two commonly We give an algorithm for the 4th-order tensor CP decompo-
used equations. Considering a special case, when all factor sition (see Algorithm 1).
matrices are the same, we call the CP decomposition a sym-
metric tensor decomposition, then Y ∈ RI ×I ×I ···I . Figure 11 Algorithm 1 The CP Decomposition Algorithm of a
shows CP decomposition of a 3rd-order tensor(see figure 11). 4th-Order Tensor
Input:
The 4th-order tensor Y ∈ RI ×J ×K ×L
Output:
Factor matrices A,B,C,D and the core tensor 3
1: Initialize A,B,C,D and CP rank R, where R ≤
min{IJ , JK , IK };
2: while the iteration threshold does not reach or the algo-
rithm has not converged do
3: A = Y m1 [(D R C R B)T ]† ;
4: Normalize column vectors of A to unit vector;
5: B = Y m2 [(D R C R A)T ]† ;
FIGURE 11. CP decomposition of a 3rd-order tensor, 6: Normalize column vectors of B to unit vector;
Y ≈ 3 ×1m B1 ×2m B2 ×3m B3 [3].
7: C = Y m3 [(D R B R A)T ]† ;
8: Normalize column vectors of C to unit vector;
CP Rank: Similar to matrix, tensor also has a rank. Since 9: D = Y m4 [(C R B R A)T ]† ;
it is a CP decomposition at this time, we call it CP rank. CP 10: Normalize column vectors of D to unit vector;
rank refers to the smallest R for which the CP decomposition 11: Save the value of the norms of the R column vectors
in the above formula holds exactly. We use rcp (Y ) to represent in the factor matrix C to the core tensor 3;
the CP rank. 12: end while
In practice, unlike traditional matrix decomposition, tensor 13: return Factor matrices A,B,C,D and the core tensor 3
usually have interference (such as noise or even data loss).
Therefore, it is usually difficult to find the exact solution
From the above algorithm, we can see that the key
of CP decomposition, so most of them are approximate
to calculate CP decomposition is to calculate Khatri-
solutions.
Rao product and the pseudo inverse of the matrices.
So the question comes, that how do we get tensor approx-
(Choi and Vishwanathan [63]; Karlsson et al. [79]) proposed
imate CP decomposition, or in other words, that how can we
the least-squares solution method of CP decomposition and
get the core tensor? The general approach is to first find the
the detailed derivation process can be referenced by them.
factor matrix Bn by minimizing an appropriate loss function.
(A.Vorobyov, 2005) [116] presents a loss function similar to
2) THE TUCKER DECOMPOSITION
the least square method.
The Tucker decomposition was first proposed by
J (B1 , · · · , BN ) = kY − 3 ×1m B1 · · · ×Nm BN k2F (30) (Tucker) [81], so it was named Tucker decomposition. Similar
to the CP decomposition, the Tucker decomposition also
Our goal is to minimize the loss function in the upper form, divides tensor into small size of core tensor and factor
and we use the alternating least square method, which means matrices, but what we need to pay attention to is that the core
iterative optimization by fixing the value of a variable other tensor here is not necessarily the diagonal tensor. We define
than one. That is to say, one of those N factor matrices the Tucker decomposition as follows:
Bn , is optimized separately at a time, keep the values of R1
X RN
X
other N-1 factor matrices unchanged (we first initialize all Y ≈ ··· ar1 r2 ···rN b1r1 ◦ b2r2 ◦ · · · bN
rN
N factor matrices, and optimize only B1 by gradient descent r1 =1 rN =1
while keep the initial values of B2 to BN unchanged). This = A ×1m B1 ×2m B2 · · · ×Nm BN
becomes a single variable loss function optimization prob- Y v1 = (BN ⊗R BN −1 · · · ⊗R B1 )Av1 (32)
lem. Then it continues to iterate until the iteration threshold
is reached or the algorithm has converged. The derivation where ar1 r2 ···rN are entries of the small size core tensor
is not given here. We give the results directly, and take the A ∈ RR1 ×R2 ···RN , Bn = [bn1 , bn2 , · · · , bnRn ] ∈ RIn ×Rn are factor
4th-order tensor as an example to write the following matrices, Y v1 is the mode-1 vectorization of the tensor Y ,

162956 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

and Av1 is the mode-1 vectorization of the core tensor A.


In fact, CP decomposition is a special form of Tucker decom-
position. We draw two decomposition figures to compare
them intuitively (see figure 12 and figure 13).

FIGURE 14. Because only B3 is the identity matrix, the graph is the
Tucker-1 decomposition model.

where Y mn is the mode-n matricization of tensor Y , r(Y mn )


FIGURE 12. Comparison of CP decomposition and Tucker decomposition means the matrix rank of the mode-n matricization of
for a 3rd-order tensor. The above figure is Tucker decomposition
Y ≈ A ×1m A ×2m B ×3m C and the following figure is CP decomposition
tensor Y .
Y ≈ 3 ×1m A ×2m B ×3m C , Y ∈ R I×J×K . If the upper Tucker decomposition can get an equal sign,
then it will have the following important properties:
1. The CP rank of any tensor Y = A×1m B1 ×2m B2 · · ·×Nm
BN is equal to the small size core tensor A.
rcp (Y ) = rcp (A) (34)
where A is the small size core tensor of Y .
2. If a tensor Y has full column rank factor matrices and its
multiple linear rank=(R1 , R2 , · · · , RN ), then
N
Y
Rn ≤ Rk , ∀n. (35)
k6 =n

3. If a tensor Y ∈ RI ×I ×I ···×I has full column rank


factor matrices and its corresponding CP decomposition is
symmetric (all the factor matrices are the same), then its core
FIGURE 13. Comparison of CP decomposition and Tucker decomposition
for a 3rd-order tensor, (simple tensor network schematic mode). The tensor, A ∈ RR×R×R···R , is also symmetric. In this case, Tucker
above figure is CP decomposition Y ≈ 3 ×1m B1 ×2m B2 ×3m B3 and the decomposition is equivalent to CP decomposition, which is
following figure is Tucker decomposition Y ≈ A ×1m B1 ×2m B2 ×3m B3 ,
Y ∈ R I1 ×I2 ×I3 . Note that the factor matrices can be represented by
called symmetric decomposition (as we’ve defined before.).
different English letters. 4. If a tensor Y ∈ RI1 ×I2 ×···IN has full column rank factor
matrices and all the factor matrices are orthogonal, then the
As we can see from the two figures (figure 12 and Frobenius norms of the tensor Y and it’s core tensor A are
figure 13), the CP decomposition is a special form of Tucker equal.
decomposition. Once the normal core tensor degenerates into
a diagonal core tensor, the Tucker decomposition becomes kY kF = kAkF (36)
the CP decomposition. As did to CP decomposition, we can
There are also some special Tucker decompositions, which
use the properties of other formulas to represent the Tucker
are briefly described here. The full column rank of all the fac-
decomposition. Here we give two that commonly used.
tor matrices in the property 3 we just have is usually called an
Multiple Linear Rank (Tucker Rank): Unlike CP decompo-
independent Tucker decomposition. On this basis, if all the
sition, a new rank is redefined here for Tucker decomposition,
factor matrices are also orthogonal matrices, that is BTn Bn =
which we call multiple linear rank. The multiple linear rank of
IRn , we call it orthogonal Tucker decomposition. If there
the tensor is (R1 , R2 , · · · , RN ). Moreover, if the upper Tucker
are N identity matrices in factor matrices, we usually call
decomposition can get an equal sign, then the multiple linear
them Tucker-N decomposition (see figure 14). For example,
rank of a tensor Y ∈ RI1 ×I2 ×I3 ×···IN is defined as follows:
Y = A ×1m B1 ×2m B2 ×3m I ×4m I , then we call it Tucker-2
rml (Y ) = (r(Y m1 ), r(Y m2 ), · · · , r(Y mN )) (33) decomposition.

VOLUME 7, 2019 162957


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Here we briefly introduce some operational properties of


Tucker decomposition. Consider two Nth-order tensors X =
AX ×1m B1 ×2m B2 · · · ×Nm BN , Y = AY ×1m C1 ×2m
C2 · · · ×Nm CN , their multiple linear rank is (R1 , R2 , · · · , RN )
and (Q1 , Q2 , · · · , QN ), respectively. Then, they will have the
following computational properties:
1. The (Right or left) Kronecker product of the two tensors:

Z = X ⊗Y
FIGURE 15. Schematic diagram of HT decomposition of 5th-order tensor,
= (AX ⊗ AY ) ×1m (B1 ⊗ C1 ) · · · ×Nm (BN ⊗ CN ) (37) in which the core tensor is split into two small-size 3rd-order tensors
A12 , A345 , and the right core tensor is split into the factor matrix B3 and
2. The Hadamard product of the two tensors (the same sizes the 3rd-order core tensor of smaller size A45 . Finally, A12 and A45
continue to be decomposed into the last four factor matrices
and order): B1 , B2 , B3 , B4 . The diagram on the right is the HT tensor network
structure diagram with the core tensor A12345 in the original left image
Z = X ~ Y = (AX ⊗L AY )×1m (B1 C1 ) · · · ×1m (BN CN ) replaced by a connecting line.

(38)

3. The inner product of the two tensors:

z = X • Y = (vec(X )1 )T vec(Y )1
= (vec(AX )1 )T ⊗L ((B1 )T C1 ) ⊗L ((B2 )T C2 ) · · · ⊗L
((BN )T CN )vec(AY )1 (39)

Here it is noted that we used a vector equivalent representa-


tion of the Tucker decomposition:

vec(Y )1 = (BN ⊗R BN −1 · · · ⊗R B1 )vec(A)1 (40)

We put the formula 40 in 39 and get the result.

3) THE HIERARCHICAL TUCKER DECOMPOSITION


(Hackbusch and Khn [142]; Grasedyck [78]) produced FIGURE 16. Similar to the HT decomposition of the fifth-order tensor
in figure 15, it is noted here that since the decomposition of the core
the Hierarchical Tucker decomposition. The Hierarchical tensor is different at the beginning, there are two kinds of decomposition
Tucker(HT) decomposition decomposes tensor in a hierarchi- cases. The above figure is to decompose A123456 according to
dimensions 12 and 3456 respectively, and the following figure is to
cal way, and it is similar to a binary tree split. It is important to decompose A123456 according to dimensions 123 and 456 respectively.
note that for the HT decomposition, all the core tensor must be The results are not the same but both the HT decomposition.
less than or equal to the third order. In other words, the fac-
tor matrices connected to the core tensor cannot exceed 3.
Simpler, if you use a tensor network diagram to illustrate,
a core tensor can’t have more than three lines connected to
it. Also, the HT decomposition model graphs cannot contain
any loops. We draw a diagram of the HT decomposition of
5th-order tensor and 6th-order tensor so that we can under-
stand it more intuitively (see figure 15 and figure 16).
From the figure 15 and the figure 16, we can find that the
first step of HT decomposition is to extract the dimensions
FIGURE 17. The Tucker decomposition of the 5th-order tensor and its
to be decomposed. For a 5th-order tensor, we can extract equivalent HT decomposition, the right side is the equivalent conversion
any one dimension or any two dimensions and the steps of the left Tucker decomposition.
are repeated until the 5th-order tensor becomes five factor
matrices. In fact, we can discover that the HT decomposition We can use the vector form of the Tucker decomposition
replaces the core tensor A of Tucker decomposition, with to explain the HT decomposition network in figure 15.
low-order interconnected kernels, thus forming a distributed
tensor network. We draw the conversion of HT decomposi- vec(Y )1 = (B1 ⊗L B2 · · · ⊗L B5 )vec(A12345 )
tion and Tucker decomposition of the 5th-order tensor(see vec(A12345 )1 = vec(A12 )1 ⊗L vec(A345 )1
figure 17). Of course, we can find that with the increase vec(A12 )1 = B1 ⊗L B2
in dimension, these distributed networks (HT decomposition vec(A345 )1 = B3 ⊗L vec(A45 )1
networks) are not unique(see figure 16). vec(A45 )1 = B4 ⊗L B5 (41)

162958 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

In fact, the core idea is to replace the core tensor with smaller
dimension of tensors until the original tensor is decomposed
into factor matrices. Finally, the original tensor is decom-
posed into a case where several 3rd-order tensors and sev-
eral factor matrices are connected to each other. Here we
introduce the HT decomposition of the 5th-order and the
6th-order tensor. The higher order tensor HT decomposition
of the tensor network diagram can be drawn with a similar
example and for more details please refer to (Tobler [22];
Kressner et al. [23]).
After Tucker decomposition, although the size of the core
tensor is reduced, the dimension of the core tensor is still the
same as before. When the original tensor dimension is very
large (for example, greater than 10), we usually express it FIGURE 19. The truncated HOSVD decomposition of a 3rd-order
with the distributed tensor network similar to the HT decom- tensor [127].
position. That is, the dimension of core tensor is not limited
to the 3rd order. According to the actually need, it can be
4th or 5th order (see figure 18). In fact, the orthogonal constraints of tensors and the
constraints of matrix SVD decomposition are very simi-
lar. Similar to the truncated SVD decomposition of the
matrix, the tensor also has a truncated HOSVD decomposi-
tion (see figure 19).
The first step in finding the solution of HOSVD decom-
position is to first perform the mode-n matricization of the
original input tensor and then use a truncated or randomized
SVD to find the factor matrices(see equation 157)
X mn = Un Sn VnT = [Un1 , Un2 ][Sn1 , 0][Vn1
T
, Vn2
T
] (44)
When the factor matrix is obtained, the core tensor can be
decomposed using the following formula:
FIGURE 18. The blue rectangles represent the core tensors and the red
circles represent the factor matrices. The diagram on the left is an A = X ×1m BT1 ×2m BT2 · · · ×Nm BTN (45)
18th-order tensor HT decomposition tensor network diagram, in which
the 4th-order small size core tensors are connected to each other. The where X ∈ RI1 ×I2 ···IN
is the input tensor, A ∈ RR1 ×R2 ···RN
diagram on the right is a 20th-order tensor HT decomposition tensor
network diagram, in which the 5th-order small size core tensors are
is the core tensor, and Bn ∈ RIn ×Rn are the fac-
connected to each other. tor matrices. See Algorithm 2 for details and refer to
(Vannieuwenhoven et al. [101]; Halko et al. [96]).

4) THE HIGHER ORDER SVD(HOSVD) DECOMPOSITION


The high-order singular value decomposition of tensor can be Algorithm 2 The Truncated HOSVD of the Tensor
considered as another special form of Tucker decomposition (N.Vannieuwenhoven, 2012) [101]
(De Lathauwer et al.) [73], where the factor matrices and the Input:
core tensor are all orthogonal. The Nth-order input tensor X ∈ RI1 ×I2 ···IN and truncation
The definition of core tensor orthogonality is as follows: rank (R1 , R2 , · · · , RN ) and accuracy ε
1. The tensor slices in each mode of a tensor should mutually Output:
orthogonal, such as, for a 3rd-order tensor A ∈ RI ×J ×K Estimated value b X = A ×1m B1 ×2m B2 · · · ×Nm BN ,
the core tensor A ∈ RR1 ×R2 ···RN and the factor matrices
(Aa,:,: )(Ab,:,: ) = 0, for (a 6 = b, a, b ∈ [1, I ]) Bn ∈ RIn ×Rn such that kX − b X kF ≤ ε
(A:,c,: )(A:,d,: ) = 0, for (c 6 = d, c, d ∈ [1, J ]) 1: A ← X ;
(A:,:,e )(A:,:,f ) = 0, for (e 6 = f , e, f ∈ [1, K ]) (42) 2: for n=1 to N do
3: [Un , Sn , VnT ] = [Un1 , Un2 ][Sn1 , 0][Vn1
T ,VT ]
n2 =
ε
2. The Frobenius norms of slices in each mode of a tensor truncated − svd(Amn , √ );
N
should increase with the increase in the running index, such
4: Bn = Un1 ;
as, for a 3rd-order tensor
5: Amn ← Sn1 Vn1 T;

kAa,:,: kF ≥ kAb,:,: kF , for(a 6 = b, a, b ∈ [1, I ]) 6: end for


kA:,c,: kF ≥ kA:,d,: kF , for(c 6 = d, c, d ∈ [1, J ]) 7: A = X ×1m BT T
1 ×2m B2 · · · ×Nm BN ;
T

8: return the core tensor A and factor matrices Bn


kA:,:,e kF ≥ kA:,:,f kF , for(e 6 = f , e, f ∈ [1, K ]) (43)

VOLUME 7, 2019 162959


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

After performing the mode-n matricization of the tensor, Compared with the truncated SVD decomposition of the
if the tensor size is too large, we can also obtain the factor standard matrix, the tensor HOSVD decomposition does not
matrices by matrix partitioning, as follows: produce the best multiple linear rank, but only the weak linear
rank approximation (De Lathauwer et al. ) [73]:
X mn = [X1n , X2n , · · · , XMn ] √
T
= Un Sn [V1n , V2n T
, · · · , VMn
T
] (46) kX −A×1m B1 ×2m B2 · · ·×Nm BN k ≤ N kX − b X Prefect k (47)

where we divide the resulting matrix (called the unfolded where b X Prefect is the best approximation for X .
matrix) X mn into M parts. Then we use the eigenvalue decom- In order to find an accurate approximation of Tucker
M decomposition, researchers have extended the alternating
position X mn X Tmn = Un (Sn )2 UnT = T , U =
P
Xmn Xmn n least squares method to the higher-order orthogonal iterations
m=1
(Jeon et al. [56]; Austin et al. [138];Constantine et al. [103];
[Un1 , Un2 ], Bn = Un1 . And we can get Vmn = Xmn
T U (S )−1 .
n n
De Lathauwer et al. [74]). For details, please refer to
Thus, computational complexity and computational memory
Algorithm 4.
will be decreased and the efficiency will be improved to
some extent by matrix partitioning. At the same time, it also
Algorithm 4 The Higher-Order Orthogonal Iterations
alleviates the curse of dimension problem.
(Austin et al. [138]; De Lathauwer et al. [74])
Some researchers proposed a random SVD decomposi-
Input:
tion algorithm for matrices with large size and low rank.
The Nth-order input tensor X ∈ RI1 ×I2 ···IN decomposed
(Halko et al.) [96] reduced the original input matrix to a
by Tucker.
small size matrix by random sketching, i.e., by multiplying
Output:
a random sampling matrix (see Algorithm 3).
the core tensor A and factor orthogonal matrices
Bn ,BTn Bn = IRn
Algorithm 3 The Random SVD Decomposition Algorithm
1: Initialize all parameters via the Truncated HOSVD by
for Large-Size and Low Rank Matrices (Halko et al.) [96]
Algorithm 2;
Input: 2: while the cost function kX − A ×1m B1 · · · ×Nm BN k2F
The large-size and low rank matrix X ∈ RI ×J , estimated does not reach convergence do
rank R, oversampling parameter P, overestimated rank 3: for n=1 to N do
R = R + P, exponent of the power method q (q=0
b
4: Y ← X ×(p6=n)m (BTp );
or 1)
5: Z ← Y mn (Y mn )T ∈ RR×R ;
Output:
6: Bn ← leading Rn eigenvectors of Z;
the SVD of X, orthogonal matrix U ∈ RI ×R , diagonal
b
7: end for
matrix S ∈ RR×R and V ∈ RJ ×R
b b b
8: A ← Y ×Nm (BTN );
1: Initialize a random Gaussian matrix W ∈ RJ ×R ;
b
9: end while
2: Calculate sample matrix Y = (XX T )q XW ∈ RI ×R ;
b
10: return the core tensor A and factor matrices Bn
3: Compute the QR decomposition of the sample matrix
Y = QR;
4: Calculate the small-size matrix A = QT X ∈ RR×J ;
b When the size of the original tensor is too large(too many
5: Compute the SVD of the small-size matrix A = U b SV T ; elements), it will result in insufficient memory, and finally
6: Calculate the orthogonal matrix U = QU ;
b the computational complexity may also increase. In this case,
7: return orthogonal matrices U ∈ RI ×R , diagonal matrix
b the operation can be simplified in the form of a matrix prod-
S ∈ RR×R and V ∈ RJ ×R
b b b uct. Simply put, the mode-n product of the tensor and the
matrix is converted into the product of the general matrix to
simplify the operation and reduce the memory (see figure 20).
The advantage of using the overestimated rank of the
For the large size tensor, another way to simplify the
matrix is that it can achieve a more accurate approximation
operation is to use the blocking method. It simply divides
of the matrix. (Chen et al.) [129] improved the approxima-
the original tensor and the factor matrix into blocks, and then
tion of SVD decomposition by integrating multiple random
performs the mode-n product between the small size matrix
sketches, that is, multiplying the input matrix X by a set of
and the small size tensor (see figure 21).
random Gaussian matrices. (Halko et al.) [96] used a special
As seen from the figure 21, we divide the input tensor X
sampling matrix to greatly reduce the execution time of the
into small pieces X (x1 ,x2 ,··· ,xN ) . Similarly, we divide the factor
algorithm while reducing complexity. However, for a matrix
matrix BTn into B(xn ,bn ) . The tensor An remained by the mode-
with a slow singular value decay, this method will result in a
n product of the matrix and the tensor is equal to:
lower accuracy of SVD.
Many researchers developed a variety of different Xn
X
algorithms to solve the HOSVD decomposition. For An(x1 ,x2 ,··· ,bn ,··· ,xN ) = X (x1 ,x2 ,··· ,bn ,··· ,xN ) ×nm (Bn (xn , bn ))T
details, please refer to (Vannieuwenhoven et al. [101]; xn =1
Austin et al. [138]; Constantine et al. [103]). (48)

162960 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

cross approximation(MCA). The main role of the MCA is to


reduce the size of the original large-size matrix by finding
a linear combination of several components of the matrix,
thereby decreasing computational complexity and computa-
tional memory. These components are usually a small fraction
of the original matrix. This method has a premise that the
original matrix is highly redundant, so it can be approximated
by a small size matrix with some marginal information lost.
We illustrate the MCA in figure 23.
FIGURE 20. The mode-n product of a factor matrix and a large-size
3rd-order tensor is shown in the figure. When the tensor size is very large,
the mode-n matricization of the tensor can be first performed, and then
multiplied by the factor matrix to simplify the calculation.

FIGURE 23. Schematic diagram of MCA, X = ABC + E , A ∈ R I×A ,


B ∈ R A×B , C ∈ R B×J , E ∈ R I×J [3].

From figure 23 we give the specific formula of the MCA


FIGURE 21. The mode-n product of a factor matrix and a large-size method:
3rd-order tensor is shown in the figure. When the tensor/matrix size is
very large, we can also divide the large-size tensor/matrix into several
small-size tensors/matrices, and then operate on the small-size X = ABC + E (50)
tensors/matrices after blocking, thus simplifying the operation [52].
where A ∈ RI ×A is a small size matrix obtained by selecting
appropriate A columns from the original matrix X. B ∈ RA×B
5) THE TENSOR-SVD(T-SVD) DECOMPOSITION is a small size matrix obtained by selecting the appropriate B
Similar to the SVD decomposition of a matrix, when we rows from the original matrix X. C ∈ RB×J is a small size
extend the matrix to a 3rd-order tensor, we call it the matrix obtained from the appropriate selected B rows in the
Tensor-SVD decomposition. original matrix X. E ∈ RI ×J is the redundant matrix (error
matrix).
X = U ×(2,1)(3,3) S ×2,1 V ∗ (49) Obviously, if the elements of the error matrix are small
where X ∈ RI1 ×I2 ×I3 ,
U ∈ RI1 ×I1 ×I3 ,
V ∈ RI2 ×I2 ×I3 . enough, we can convert the MCA decomposition of X above
S ∈ RI1 ×I2 ×I3 means f- diagonal tensor, each of its frontal into a CR matrix decomposition.
slices is a diagonal matrix. We draw a figure to show the t- X ≈ CR (51)
svd decomposition (see figure 22).
where C = A ∈ RI ×A , R = BC ∈ RA×J or C = AB ∈ RI ×B ,
R = C ∈ RB×J .
Note that in order to reduce the size, A  J and B  I ,
and minimize the F norm of the redundancy matrix kEkF ,
the choices of A and B are also very important. Generally,
if A and B are given, then the three matrices can be obtained
according to the method as shown in figure 23(split the orig-
inal matrix and then get three matrices from it). Another spe-
cial property is that when rank(X ) ≤ min(A, B), the matrix
FIGURE 22. T-SVD of a 3rd-order tensor [20]. cross-approximation solution is exact or the error matrix E at
this time is very small and can be ignored, i.e., X = ABC.
Now we extend the concept of the MCA to the form
6) THE TENSOR CROSS-APPROXIMATION of tensor, i.e., tensor cross-approximation (TCA). There are
Before we discuss the concept of the Tensor Cross- usually two ways to implement TCA.
Approximation, we first introduce some concepts of 1. (Mahoney et al.) [93] extended MCA to the matrix form
matrix cross approximation. (Bebendorf et al. [85]; of tensor data (that is, find the matricization of the tensors and
Khoromskij and Veit [16]) proposed the concept of the matrix then implement MCA).

VOLUME 7, 2019 162961


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

FIGURE 24. The schematic diagram of TCA is similar to MCA. It is noted that R1 , R2 and R3 are selected appropriately, and then
four new tensors A, B, C , D are formed. The rightmost is the equivalent Tucker decomposition diagram. For detailed derivation,
please refer to formula 52.

FIGURE 25. A simple tensor network diagram of TCA and Tucker decomposition. Here is a schematic
diagram of the conversion of TCA and Tucker decomposition [3].

2. (Caiafa and Cichocki) [17] proposed a Fiber Sampling For an Nth-order tensor, the formula for FSTD is as follows
Tucker Decomposition that operates directly on the input (Caiafa and Cichocki, 2015) [17]:
matrix, but with the premise that it is based on low rank
X = A ×1m B1 ×2m B2 × · · · ×Nm BN
Tucker decomposition. Since tensor usually has a good low-
rank Tucker decomposition, FSTD algorithm is often used. = W ×1m C 1m1 ×2m C 2m2 · · · ×Nm C N
mN (53)
Figure 24 and 25 shows the TCA by FSTD algorithm. For a 3rd-order tensor, the four cross tensors of the above
We can see from figure 24 and 25 that the FSTD algorithm FSTD(W , B, C, D) can be obtained by random projection
first finds a suitable cross tensor from the original input ten- (see formula 51), as follows:
sor, and then changes the size of the core tensor. Specifically
by the formula: W = X ×1m B1 ×2m B2 ×3m B3 ∈ RR1 ×R2 ×R3
B = X ×2m B2 ×3m B3 ∈ RI1 ×R2 ×R3
X = A ×1m B1 ×2m B2 ×3m B3
C = X ×1m B1 ×3m B3 ∈ RR1 ×I2 ×R3
= W ×1m Bm1 ×2m C m2 ×3m Dm3 (52)
D = X ×1m B1 ×2m B2 ∈ RR1 ×R2 ×I3 (54)
where the first equation is the standard Tucker decomposi-
where Bn ∈ RRn ×In are the projection matrices.
tion. In the second equation, where Bm1 ∈ RI1 ×R2 R3 , C m2 ∈
† †
RI2 ×R1 R3 , Dm3 ∈ RI3 ×R1 R2 , W = A ×1m Am1 ×2m Am2 ×3m 7) THE TENSOR TRAIN AND TENSOR CHAIN
† † †
Am3 ∈ RR2 R3 ×R1 R3 ×R1 R2 . Note that Bm1 Am1 = B1 , C m2 Am2 = DECOMPOSITION

B2 , Dm3 Am3 = B3 . The above is for the 3rd-order tensor, and CP decomposition is a special case of Tucker decomposi-
when the dimension becomes 2(the matrix), it is easy to see tion. The core tensor of Tucker decomposition is further
that the TCA degenerates into MCA. decomposed into hierarchical tree structure and becomes

162962 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

FIGURE 26. The TT and TC decomposition for a large size vector. Figure (a) first reorganizes the vector into a
suitable Nth-order tensor, Y ∈ R I1 ×I2 ···×IN ← y ∈ R I , I = I1 I2 · · · IN , and then TT and TC decomposition are
performed on the Nth-order tensor. Figure (a) is TT decomposition, and Figure (b) is TC decomposition. Please
refer to formula 55 for the TT decomposition of Nth-order tensor.

HT decomposition. The Tensor Chain(TC) decomposition In figure 26 and figure 27, we first transform the large
is a special case of HT decomposition. The core tensor is size vector and matrix into the Nth-order and 2Nth-order
in series and aligned, i.e., every core tensor has the same small size tensor, respectively. Then we decompose them by
dimension, and at the same time, all the factor matrices TT or TC. We can see that the only difference between TT
are unit matrices. The advantage of having the same form decomposition and TC decomposition is that TC decomposi-
of core tensor and unit matrix is that it can significantly tion connects the first core tensor and the last core tensor with
reduce the amount of computation, facilitate subsequent opti- a single line RN .
mization, and so on. The Tensor Train(TT) decomposition Then we give a concrete mathematical expression of TT
is also a special case of HT decomposition. (Oseledet [60] decomposition of an Nth-order tensor Y ∈ RI1 ×I2 ×I3 ×···IN .
and Oseledet and Tyrtyshnikov [61]) first put forward the
concept of TT decomposition. The only difference between Y = A1 ×3,1 A2 · · · ×3,1 AN (55)
TT decomposition and TC decomposition is that the dimen- where An ∈ RRn−1 ×In ×Rn , R0 = RN = 0, n = 1, 2, · · · , N
sion of the first and the Nth core tensor is one less than
R1 ,R2 ,··· ,RN −1
the dimension of the intermediate N-2 core tensors in TT X
decomposition. In different domains, TT decomposition has yi1 ,i2 ,··· ,iN = a11,i1 ,r1 a2r1 ,i2 ,r2
r1 ,r2 ,··· ,rN −1 =1
different names. Generally speaking, in the field of physics,
−1
when we refer to the Tensor Chain(TC) decomposition as · · · aN N
rN −2 ,iN −1 ,rN −1 arN −1 ,iN ,1 (56)
the Matrix Product State (MPS) decomposition with periodic
where yi1 ,i2 ,··· ,iN and anrn−1 ,in ,rn are entries of Y and An ,
boundary conditions(PBC), we also refer to the TT decompo-
respectively.
sition as the Matrix Product State (MPS) decomposition with
the Open Boundary Conditions. Before we give the concrete R1 ,R2 ,··· ,RN −1
,rN −1 rN −1 ,1
◦ ar21 ,r2 ◦ · · · ◦ aNN−1
r
a1,r
X
1 −2
expression, we draw a picture to give an intuitive explanation Y = 1 aN
of the TT decomposition and the TC decomposition (see r1 ,r2 ,··· ,rN −1 =1
figure 26 and figure 27). (57)

VOLUME 7, 2019 162963


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

FIGURE 27. The TT and TC decomposition for a large size matrix. Figure (a) first reorganizes the matrix into a suitable
2Nth-order tensor, Y ∈ R I1 ×J1 ···×IN ×JN ← Y ∈ R I×J , I = I1 I2 · · · IN , J = J1 J2 · · · IN , and then TT and TC decomposition are
performed on the 2Nth-order tensor. Figure (a) is TT decomposition, and Figure (b) is TC decomposition. Please refer to
formula 58 for the TT decomposition of 2Nth-order tensor.

r ,r
where ann−1 n = An (rn−1 , :, rn ) ∈ RIn are tensor Similarly, the 3rd-order large-size tensor or higher-order
fiber(vectors). large-size tensor can be decomposed by TT in a similar
The above three formulas are TT decomposition for- way (by decomposing them into 3Nth-order or higher
mula corresponding to the large-size vector decomposed into tensor.)
Nth-order tensors (that is, figure 26). Similar to the TT Here we no longer give the mathematical expression of
decomposition for the Nth-order tensor, the TT decomposi- the TC decomposition, because there is almost no differ-
tion for the 2Nth-order tensor (see figure 27) is as follows: ence between the TT decomposition and TC decomposition
(mainly the first and last core tensors have a dimension with
Y = A1 ×4,1 A2 · · · ×4,1 AN (58)
a size of Rn ).
where An ∈ RRn−1 ×In ×Jn ×Rn , R = R = 0, n = 1, 2, · · · , N
0 N Here we give three common methods. The first is the
R1 ,R2 ,··· ,RN −1
X product form between the core tensor contractions, the second
yi1 ,i2 ,··· ,iN = a11,i1 ,j1 ,r1 a2r1 ,i2 ,j2 ,r2 is the expression between the scalars and the third is the
r1 ,r2 ,··· ,rN −1 =1 outer product of tensor slice or the outer product of tensor
−1 fiber. There are some other mathematical expressions for
· · · aN N
rN −2 ,iN −1 ,jN −1 ,rN −1 arN −1 ,iN ,jN ,1 (59)
other uses, such as, the TT decomposition can be calculated
where n
yi1 ,i2 ,··· ,iN and arn−1 ,in ,jn ,rn are entries of Y and An , by performing the mode-n matricization of the core tensor
respectively. and then we can use the strong Kronecker product or tensor
R1 ,R2 ,··· ,RN −1 slices to calculate. Those who are interested can refer to
,rN −1 rN −1 ,1
A11,r1 ◦ Ar21 ,r2 ◦ · · · ◦ ANN−1
r
X
Y = −2
AN (Cichocki et al.) [3].
r1 ,r2 ,··· ,rN −1 =1 Similar to the CP rank,we define the TT rank.
(60)
r ,r rTT (Y ) = (R1 , R2 , · · · , RN −1 ),
where Ann−1 n= An (rn−1 , :, :, rn ) ∈ RIn ×Jn are tensor
slice(matrices). Rn = rank(Y mcn ) = r(Y mcn ) (61)

162964 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Here we add a concept. We have previously introduced the


definition of mode-n matricization of tensor. But in fact, there
are two ways to perform the matricization of tensor. One of
them is to extract one dimension as the first dimension of
the resulting matrix, and the remaining N-1 dimensions as
the second dimension. The other is to extract n dimensions
from the original tensor as the first dimension of the resulting
matrix, and the remaining (N-n) dimensions as the second
dimension. We call the latter mode-n cannonical matri-
cization of the tensor:
mat(Y )cn = Y mcn ∈ RI1 ···In ×In+1 ···IN (62)
where m means matricization, c means cannonical, n means FIGURE 28. Projected Entangled Pair States(PEPS) and Projected
mode-n. Entangled Pair Operators(PEPO). PEPS on the left and PEPO on the right.
The blue rectangles represent core tensors. They use the 5th and
According to the mathematical expression of TT decom- 6th-order core tensors, respectively [3].
position and the definition of TT rank, we give the computa-
tional complexity of TT decomposition.
N
X
Rn−1 In Rn ∼ O(NIR2 ), R = max Rn , I = max In (63)
n n
n=1

We can see from the formula that the complexity is related


to the TT rank. Thus, we need to find a suitable low rank TT
decomposition to reduce the complexity.

8) THE TENSOR NETWORKS(DECOMPOSITIONS) WITH


CYCLES
In the above sections, we briefly introduced TT decompo-
sition, HT decomposition and other tree tensor networks.
We should note that all the tensor decomposition networks FIGURE 29. Honey-Comb Lattice(HCL) and the Multi-scale Entanglement
Renormalization Ansatz(MERA). HCL on the left and MERA on the right.
mentioned above do not contain a circle(except TC). We also The blue rectangle represents core tensors and the red circle represents
mentioned in the previous section that the TT rank usually factor matrices. HCL uses the 3rd-order core tensors while MERA uses the
increases with the growth of the dimension of original data 3rd-order and 4th-order tensors.

tensor that needs to be decomposed. As the depth of decom-


position increases for an arbitrary tree-shaped tensor network,
the TT rank will also increase. In order to reduce the TT increases, i.e., we need to calculate more circles. In short,
rank, researchers invented some layered tensor networks with in order to reach the balance between rank and complexity
Loops. (Verstraete et al. [32]; Schuch et al. [100]) proposed in practice, the network will be selected according to the
Projected Entangled Pair States(PEPS) and Projected Entan- need.
gled Pair Operators(PEPO), respectively (see figure 28). Compared with the former two tensor networks with
In these two kinds of tensor networks, they replaced the cycles, the size and dimension of the core tensor in MERA
3rd-order core tensors of the original TT decomposition are usually smaller, so the number of unknown parameters
with 5th and 6th-order core tensors, respectively. But they (variables or free parameters) will be reduced, and the cor-
reduced tensor rank at the expense of higher complexity responding computational complexity will also be decreased.
because the original 3rd-order core tensor rises to 5th order, At the same time, the MERA network with cycles can help us
6th order. find the relationship and interaction between tensor and free
Sometimes for some higher order tensors in science parameters. In general, the main idea of these four methods
and physics, it may not be enough to reduce the rank is to reduce TT rank by increasing the number of core tensors
for the above two kinds of networks. Some researchers and reducing the size of core tensors but usually at the cost of
have proposed new tensor networks with more circles. increasing computational complexity. The advantage of small
(Giovannetti et al. [135]; Matsueda [50]) produced the size tensor is that it is easier to manage, and it can reduce
Honey-Comb Lattice(HCL) and the Multi-scale Entangle- the number of free parameters in the network. For a single
ment Renormalization Ansatz(MERA), respectively (see small-scale tensor, the calculation is relatively simple. At the
figure 29). They used the 3rd and 4th-order core tensors, same time, we can see that due to the cycle structure, these
respectively. However, as the number of cycles increases, four networks can usually describe the correlation between
the overall computational complexity of the network variables well.

VOLUME 7, 2019 162965


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

D. THE NATURE AND ALGORITHM OF TT Algorithm 5 The Quantitative Product of Two Tensors
DECOMPOSITION Expressed in the Form of TT Decomposition
1) BASIC OPERATIONS IN TT DECOMPOSITION Input:
If large-size tensors are given in the form of TT decom- The two Nth-order tensors X = X 1 ×3,1 X 2 · · · ×3,1
position, then many calculations can be performed on the X N ∈ RI1 ×I2 ×I3 ×···IN , Y = Y 1 ×3,1 Y 2 · · · ×3,1
small-size core tensors. By performing operations on small- Y N ∈ RI1 ×I2 ×I3 ×···IN , where X n ∈ RRn−1 ×In ×Rn , Y n ∈
size core tensors, the unknown parameters can be reduced RQn−1 ×In ×Qn ,R0 = Q0 = RN = QN = 1.
effectively, and the operations can be simplified to achieve Output:
the effect of the optimization algorithm. the quantitative product of the two tensors
Consider two Nth-order tensors in TT decomposition: Initialize A0 = 1;
for n=1 to N do
X = X 1 ×3,1 X 2 · · · ×3,1 X N ∈ RI1 ×I2 ×I3 ×···IN (Z n )m1 = An−1 (Y n )m1 ∈ RQn−1 ×Qn ;
Y = Y 1 ×3,1 Y 2 · · · ×3,1 Y N ∈ RI1 ×I2 ×I3 ×···IN (64) An = ((X n )mc2 )T (Z n )mc2 ∈ RRn ×Qn ;
end for
where the core tensors X n ∈ RRn−1 ×In ×Rn , Y n ∈ RQn−1 ×In ×Qn
return AN = X • Y ∈ R
and their TT ranks are rTT (X ) = (R1 , · · · , RN −1 ) and
rTT (Y ) = (Q1 , · · · , QN −1 ), respectively. Note that the size
and dimension of two tensors are the same. Their operations
have the following properties:
1. the Hadamard product of two tensors:
Z = X ~ Y = Z 1 ×3,1 Z 2 · · · ×3,1 Z N (65)
We can use the tensor slice to represent the core tensor Z.
Z (in n ) = X (in n ) ⊗L Y n(in ) , n = 1, · · · , N , in = 1, · · · , In FIGURE 30. The multiplication of large-size matrix and vector.
Ax ≈ y , A ∈ R I×J , X ∈ R J1 ×J2 ···×JN ← x ∈ R J , J = J1 J2 · · · JN , Y ∈
(66) R I1 ×I2 ···×IN ← y ∈ R I , I = I1 I2 · · · IN [3].
(i )
where Z n ∈ RRn−1 Qn−1 ×In ×Rn Qn is the core tensor and Z n n ∈
(i ) (i )
RRn−1 Qn−1 ×Rn Qn , X n n ∈ RRn−1 ×Rn , Y n n ∈ RQn−1 ×Qn is the
tensor slice (fix the second dimension in to get). are decomposed in TT. We give an intuitive picture to show
2. the sum of two tensors: it(see figure 30).
As we can see from the figure 30, An ∈ RAn−1 ×In ×Jn ×An ,
Z =X +Y (67) X n ∈ RRn−1 ×Jn ×Rn , Y n ∈ RQn−1 ×In ×Qn . If starting from the
where its TT rank rTT (Z ) = rTT (X ) + rTT (Y ) = (R1 + Q1 , form of the outer product of the TT decomposition, it is as
R2 + Q2 , · · · , RN + QN ), similar to the previous one, we can follows:
A1 ,A2 ,··· ,AN −1
still use tensor slice to represent Z. ,aN −1 aN −1 ,1
◦ Aa21 ,a2 ◦ · · · ◦ ANN−1
a
A1,a
X
1 −2
" # A= 1 AN
(i )
Xnn 0 a1 ,a2 ,··· ,aN −1 =1
(i ) , n = 2, 3, 4, · · · , N − 1
(in )
Zn = (68)
0 Y nn R1 ,R2 ,··· ,RN −1
,rN −1 rN −1 ,1
x11,r1 ◦ x2r1 ,r2 ◦ · · · ◦ xNN−1
r
X
−2
Note that the tensor slices of the first and last core tensors are X = xN
r1 ,r2 ,··· ,rN −1 =1
as follows:
" # R1 ,R2 ,··· ,RN −1
(i ) ,rN −1 rN −1 ,1
XnN ◦ yr21 ,r2 ◦ · · · ◦ yNN−1
r
y1,r
h i X
1 −2
Yn , Zn =
(1) (1) (1) (iN ) Y = yN
Zn = Xn (i ) (69) 1
Y nN r1 ,r2 ,··· ,rN −1 =1

3. the quantitative product of two tensors: (72)

AN = X • Y then the multiplication of a matrix and a vector is equivalent


to:
An = X n ×1,2 (Y n ×1m An−1 ) ∈ RRn ×Qn , n = 1, · · · , N
r ,rn a ,an rn−1 ,rn
(70) ynn−1 = Ann−1 xn , Qn = An Rn , n = 1, 2, · · · , N
(73)
Then we calculate the final result by iterative method, the spe-
cific process reference algorithm 5. Similarly, we can use the tensor network of TT decompo-
4. the multiplication of large-size matrix and vector: sition to represent some loss functions (see figure 31).
Similar to the multiplication of matrices and vectors,
Ax ≈ y (71)
TT decomposition can also be used to simplify the solution
where A ∈ RI ×J ,
X ∈ RJ1 ×J2 ···×JN
← x ∈ J = RJ , for multiplication between large-scale matrices and matri-
J1 J2 · · · JN , Y ∈ RI1 ×I2 ···×IN ← y ∈ RI , I = I1 I2 · · · IN ces, and here we omit its solution. Since the outer product

162966 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

FIGURE 31. Special loss function represented by TT decomposition


network. J(x) = xT AT Ax, where xT , AT , A, x are all represented by TT
decomposition.

calculation is relatively simple, we use the outer product


expression of the TT decomposition to simplify the mul-
tiplication between the matrix and the vector. Of course,
we can also use the TT decomposition expressed in the form
of Kronecker or tensor contraction for simplified solution.
For more calculations on TT decomposition, please refer to
(Kazeev et al. [132]; Lee and Cichocki [99])

2) TT DECOMPOSITION SOLUTION
FIGURE 32. SVD-based TT algorithm (TT-SVD) [40] for a 4th-order tensor
The solution of TT decomposition is similar to the solution X ∈ R I1 ×I2 ×I3 ×I4 . First,we perform the mode-n matricization of the
of the truncated HOSVD algorithm mentioned above (see tensor X , here we perform the mode-1 matricization for convenience.
Then we perform the SVD decomposition and execute algorithm 6 step by
algorithm 2), and the following constraints need to be met: step.

N
X −1 In
X
Y kl2 )2 ≤
(kY − b (σk (Y mcn ))2 (74)
n=1 k=Rn +1

where b Y is the approximate estimated tensor of the original


tensor. The l2 norm of the tensor is equal to the Frobenius
norm of the tensor. σk (Y mcn ) represents the kth maximum
singular value of the nth cannonical matricization of the input
tensor Y .
Under the above constraints, there are usually three basic
ways to obtain the solution of TT decomposition.
1 SVD-based TT algorithm (TT-SVD)
2 Algorithm based on low rank matrix decomposi-
tion(LRMD) FIGURE 33. Algorithm based on low rank matrix decomposition for a
3 Restricted Tucker-1 decomposition(RT1D) 4th-order tensor X ∈ R I1 ×I2 ×I3 ×I4 . First,we perform the mode-n
matricization of the tensor X , here we perform the mode-1 matricization
SVD-based TT algorithm first performs mode-n matriciza- for convenience. Then we perform the CR/MCA/LR or other low-rank
tion on the input tensor and then performs HOSVD decom- matrix decomposition methods. Then step by step according to
position (see figure 32 and algorithm 6). algorithm 7.

Algorithm based on low rank matrix decomposi-


tion(LRMD) is similar to SVD-based TT algorithm (see Note that in the above two methods, we constructed
figure 33 and algorithm 7). The only difference we noticed the mode-n matricization of a tensor and then performed
is that after performing the mode-n matricization of the first matrix decomposition-related operations. The third method,
step, the original complex SVD decomposition operation Restricted Tucker-1 decomposition (RT1D), converts the
is simplified by using matrix cross-approximation or CR original input tensor into 3rd-order tensor, and then performs
decomposition. Tucker-1 and Tucker-2 decomposition (see figure 34).

VOLUME 7, 2019 162967


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Algorithm 6 SVD-Based TT Algorithm (TT-SVD) [40] Algorithm 8 Restricted Tucker-1 Decomposition


Input: (RT1D) [10]
The Nth-order tensor X ∈ RI1 ×I2 ×···IN and accuracy ε Input:
Output: The Nth-order tensor X ∈ RI1 ×I2 ×···IN and accuracy ε
Approximate tensor of TT decomposition b X , such that Output:
kX − b X kF ≤ ε Approximate tensor of TT decomposition b X , such that
1: Initialize R0 = 1, Z1 = X m1 ; kX − b X kF ≤ ε
2: for n=1 to N-1 do 1: Initialize R0 = RN = 1, Y 1 = Tensorization(X ) ∈
3: [Un , Sn , VnT ] = [Un1 , Un2 ][Sn1 , 0][Vn1
T ,VT ]
n2 = RI1 ×I2 I3 ···IN −1 ×IN ;
truncated − svd(Amn , √ε ); 2: if N is an odd number then
N
4: An = Un1 ; 3: for n=1 to N 2−1 do
5: Reshape An in the manner described in figure 32, An = 4: [Bn , Y n+1 , BN +1−n ] = Tucker − 1 −
An .reshape([Rn−1 , In , Rn ]); decomposition(Y n , ε);
6: Zn+1 = Sn VnT .reshape([Rn In+1 , In+2 In+3 · · · IN ]); 5: Estimate Rn = size(Bn , 2), RN −n =
7: end for size(BN +1−n , 2);
8: Compute the last core AN = Z N .reshape([Rn−1 , IN , 1]); 6: Reshape Y n in the manner described in fig-
9: return the core A1 , A2 , · · · , AN ure 34, Y n+1 = Y n+1 . reshape ([Rn−1 In , In+1 · · ·
IN −n , IN −n+1 RN −n+1 ]), An = Bn .reshape([Rn−1 ,
In , Rn ]), AN +1−n = BN +1−n .reshape([RN −n ,
Algorithm 7 Algorithm Based on Low Rank Matrix IN +1−n , RN +1−n ]);
Decomposition (LRMD) (Taking CR Decomposition as an 7: end for
Example) [29] 8: Compute the last core A N +1 = Y N +1 .reshape
2 2
Input: ([R N −1 , I N +1 , R N +1 ]);
2 2 2
The Nth-order tensor X ∈ RI1 ×I2 ×···IN and accuracy ε 9: else {N is an even number}
Output: 10: for n = 1 to N 2−2 do
Approximate tensor of TT decomposition b X , such that 11: [Bn , Y n+1 , BN +1−n ] = Tucker − 1 −
kX − b X kF ≤ ε decomposition(Y n , ε);
1: Initialize R0 = 1, Z1 = X m1 ; 12: Estimate Rn = size(Bn , 2), RN −n =
2: for n=1 to N-1 do size(BN +1−n , 2);
3: [C n , Rn ] = CR − decomposition(Zn , ε); 13: Reshape Y n in the manner described in figure 34,
4: Choose the suitable Rn ; Y n+1 = Y n+1 .reshape([Rn−1 In , In+1 · · ·
5: Reshape C n in the manner described in figure 33, IN −1 , RN −1 ]), An = Bn .reshape([Rn−1 , In , Rn ]),
An = C n .reshape([Rn−1 , In , Rn ]); AN +1−n = BN +1−n .reshape([RN −n , IN +1−n ,
6: Zn+1 = Rn .reshape([Rn In+1 , In+2 In+3 · · · IN ]); RN +1−n ]);
7: end for 14: end for
8: Compute the last core AN = Z N .reshape([Rn−1 , IN , 1]); 15: [B N , Y N +2 ] = Tucker − 2 − decomposition(Y N , ε);
2 2 2
16: Reshape Y n in the manner described in figure 34,
9: return the core A1 , A2 , · · · , AN An = B N .reshape([R N −2 , I N , R N ]), A N +2 =
2 2 2 2 2
Y N +2 .reshape([R N , I N +2 , R N +2 ]);
2 2 2 2
17: end if
It can be seen from figure 34 that we first compress the 18: return the core A1 , A2 , · · · , AN
original tensor into a 3rd-order tensor. It should be noted
that we use the first and Nth dimensions of the original
Nth-order tensor as the first and third dimensions of the new
3rd-order tensor, respectively. The remaining N-2 dimen- 3) TT TRUNCATION
sions are all multiplied as the second dimension of the In the previous section, we discussed the problem of
new 3rd-order tensor. Specifically as shown in the following increased complexity due to increased TT rank. Therefore,
formula: if we still use the TT decomposition, we need to adopt
some approximate decomposition algorithms to reduce the
Y 1 = X .reshape([I1 , I2 I3 · · · IN −1 , IN ]) (75) TT rank. (Oseledets) [60] proposed an algorithm called TT
Truncation. The algorithm first inputs a tensor with large TT
For the new third-order tensor, we first perform the rank. The goal of this algorithm is to find an approximate
Tucker-1 decomposition. Then, according to the parity of N, solution whose rank is much smaller than the original input
we perform the Tucker-2 or Tucker-1 decomposition, respec- tensor. For TT Truncation, please refer to algorithm 9 and
tively. Specifically as shown in algorithm 8. figure 35).

162968 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

FIGURE 34. Restricted Tucker-1 decomposition(RT1D) [10] for a 4th-order tensor X ∈ R I1 ×I2 ×I3 ×I4 and a 5th-order tensor
X ∈ R I1 ×I2 ×I3 ×I4 ×I5 . Similar to TT-SVD and LRMD, we first convert the original tensor into a new 3rd-order tensor, next
perform the Tucker-1 decomposition, and then follow the algorithm 8 step by step. On the left is a schematic diagram of
the 4th-order tensor and on the right is a schematic diagram of the 5th-order tensor.

It is noted that the algorithm 9 actually performs the we need from the correlation. At the same time, the biggest
Nth cannonical matricization of the core tensor and then feature of tensor decomposition is that the increase of dimen-
performs a low rank matrix approximation (SVD and QR). sion will lead to the non-uniqueness of decomposition. So we
We noticed that in the process of calculating the low rank usually want to get an approximate solution of it instead of an
matrix decomposition, the size of matrix will become smaller exact solution, so that don’t waste too much computation time
and smaller because of continuous iterative optimization, and can get a good approximation of the original data.
so the complexity will be continuously reduced in the process Due to the limited space of this survey, there are some new
of performing decomposition. By TT Truncation, TT rank tensor decompositions that are not covered in detail in this
can be reduced to the utmost extent and the correspond- survey, such as t-svd(Zhang and Aeron) [165], tensor ring
ing approximate tensor can be found, which greatly reduces decomposition(Zhao et al.) [109]. The above introduction
the computational complexity and improves the efficiency is several important tensor decompositions in this survey,
for future data processing, mathematical operations, and so and has important applications in part two. At the same
on. Of course, some researchers have developed a similar time, some of these decomposition algorithms have their own
method for the HT decomposition. For details, please refer to advantages or limitations.
(Kressner and Tobler) [25]. For CP decomposition, due to its particularity, if a certain
constraint condition is imposed on the factor matrices or core
tensor, an accurate solution can be obtained. The constraint
E. BRIEF SUMMARY FOR PART ONE is mainly determined according to the required environment.
Part one mainly introduced the basic knowledge about tensor, The advantage is that it can extract the structured infor-
including the definition of tensor, the operation of tensor, and mation of the data, which helps better extract and process
the concept of tensor decomposition. As a new technique, ten- the required data, and improves the accuracy of the appli-
sor decomposition can reduce the computational complexity cation in the future. For the Tucker decomposition, since
and memory by decomposing the tensor into lower-order ten- the decomposition is general, the solution is usually more,
sors, matrices, and vectors. At the same time, it can preserve so it is usually considered to impose a constraint term,
the data structure, effectively reduce the dimension, avoid the such as the orthogonal constraint we mentioned above. Then
curse of dimension problems, and extract the important parts the Tucker decomposition becomes HOSVD decomposition.

VOLUME 7, 2019 162969


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Algorithm 9 TT Truncation (I.V.Oseledets, 2011) [60]


Input:
The Nth-order tensor Y = Y 1 ×3,1 Y 2 · · · ×3,1 Y N ∈
RI1 ×I2 ×I3 ×···IN , the core tensor Y n ∈ RRn−1 ×In ×Rn with
a large TT rank, rTT (Y ) = (R1 , R2 , · · · , RN −1 ), Rn =
r(Y mcn )
Output:
Approximate tensor of TT decomposition b Y with a small
TT rank rTT (b Y) = b R1 , b
R2 , · · · , b
RN −1 , b
R0 = b RN =
1, b
Rn = r(b Y mcn ),such that kY − b Y kF ≤ ε
1: Initialize bY = Y, a = √ ε ;
N −1
2: for n=1 to N-1 do
3: [Qn , Rn ] = QR − decomposition(Y nmc2 ),
4: Replace Y nmc2 = Qn and Y n+1 n n+1 n+1
mc1 = R Y mc1 , Y mc1 ∈
R Rn ×I n+1 Rn+1

5: end for
6: for n=N to 2 do
7: [U n , 3n , VnT ] = truncated − svd(Y nmc1 , a),
RP
n−1
8: find the smallest rank b Rn−1 such that αr2 ≤
i>b
Rn−1
RP
n−1
a2 kαk1 = a2 ( | αi |)2 ;
i=1
n−1 n−1 b n b n
Replace b
Y mc2 = b Y mc2 U 3 ∈ RRn−2 In−1 ×Rn−1 and
b b
9:
n Rn−1 ×Inb
bnT ∈ Rb Rn ;
Y mc1 = V
b
n n
10: Reshape Y = Y mc2 .reshape([b
b b Rn−1 , In , b
Rn ]);
11: end for
1 2
12: return Approximate tensor b Y = b Y ×3,1 b Y · · · ×3,1
N n
Y ∈ RI1 ×I2 ×I3 ×···IN , where b Y ∈ RRn−1 ×In ×Rn , b R0 =
b b b

RN = 1
b

Next we will introduce some basic tensor decomposition


applications.
FIGURE 35. TT Truncation for a 3rd-order tensor. First perform TT As described in part two of this survey, we can find that
decomposition on the original third-order tensor, (the rank of the TT
decomposition at this time is relatively large) and then the TT
rank-one decomposition can be applied in tensor regression
approximate solution with the lower TT rank is found step by step to support the tensor and solve optimization problem with
according to the algorithm 9.
constraint terms. However, since not all tensors can be per-
formed rank-one decomposition, its application has certain
limitations. Some results can be seen in recent papers, such
as Zhou et al.’s Tensor regression with applications in neu-
For HT decomposition, the utility is relatively poor due to
roimaging data analysis [55], Chen et al.’s A hierarchical
the need to determine the binary tree, and most of us use TT
support tensor machine structure for target detection on high-
decomposition. The biggest advantage of TT decomposition
resolution remote sensing Images [45], Makantasis et al.’s
is that only the core tensor is used, and thus we just need to
Tensor-Based Classification Models for Hyperspectral Data
calculate between core tensors. However, as we mentioned
Analysis [69], and Makantasis et al.’s Tensor-Based Nonlin-
earlier, one of the biggest drawbacks of TT decomposition is
ear Classifier for High-Order Data Analysis [70].
that if there is no lower TT rank (i.e., there is no low rank TT
For CP decomposition, the best approximate solution can
solution), the computational complexity will be high.
usually be found even if there are no special constraints on
the original tensor or factor (such as orthogonal, independent,
F. VARIOUS TYPES OF DECOMPOSITION APPLICATIONS sparse, etc.). Therefore, CP decomposition is applied in
We can find that almost all tensor-based algorithms are many tensor-based algorithms. Some results can be seen in
inseparable from tensor decomposition because of huge recent papers, such as Tresp et al.’s Learning with memory
amount of unknown parameters. Therefore, tensor decompo- embeddings [137], Biswas and Milanfar’s Linear Support
sition becomes very important in high-dimensional problems. Tensor Machine With LSK Channels: Pedestrian Detection

162970 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

in Thermal Infrared Images [121], Pham and Yan’s Tensor


Decomposition of Gait Dynamics in Parkinson’s Dis-
ease [125], Xu et al.’s Application of support higher-order
tensor machine in fault diagnosis of electric vehicle range-
extender [147], Zdunek and Fonal Randomized Nonnega-
tive Tensor Factorization for Feature Extraction from High-
dimensional Signals [115], Kisil et al.’s Common and Indi-
vidual Feature Extraction Using Tensor Decompositions: a
Remedy for the Curse of Dimensionality? [57], and Kargas
and Sidiropoulos’s Completing a joint PMF from projections:
A low-rank coupled tensor factorization approach [98].
In practice, we tend to impose constraints on the original
input tensor or the resulting core tensor. So the application
FIGURE 36. Traditional linear regression model. Where (a) is a linear
of Tucker decomposition is usually translated into the appli- regression of a two-dimensional plane, and (b) is a linear regression of
cation of HOSVD decomposition. In fact, HOSVD decom- three-dimensional space.
position is a multidimensional extension of PCA. Some
results can be seen in recent papers, such as Hu et al.’s
Attribute-Enhanced Face Recognition with Neural Tensor III. PART TWO: TENSOR APPLICATION IN MACHINE
Fusion Networks [37], Fanaee-T and Gama’s Tensor- LEARNING AND DEEP LEARNING
based anomaly detection: An interdisciplinary survey [46], The second part is based on the first part of tensor operation
Chen et al.’s Robust supervised learning based on tensor and tensor decomposition. This part mainly discusses the
network method [156], Sofuoglu and Aviyente’s A Two- application of innovative algorithms for tensor in machine
Stage Approach to Robust Tensor Decomposition [118], learning and deep learning. For example, converting the tra-
Imtia and Sarwate’s Improved Algorithms for Differentially ditional input vector to a new tensor produces a new tensor-
Private Orthogonal Tensor Decomposition [47], Kisil et al.’s based algorithm, such as support tensor machine, tensor
Common and Individual Feature Extraction Using Tensor regression, and so on. These algorithms mainly achieve the
Decompositions: a Remedy for the Curse of Dimensional- goal of improving accuracy by finding structured information
ity? [57], and Kossaifi et al.’s Tensor Contraction Layers for of the original data and performing subsequent data pro-
Parsimonious Deep Nets [64]. cessing. Some algorithms tensorize weight matrix or vector,
Because of intuitive tree or chain representations, HT and and use tensor decomposition on the resulting tensor. The
TT decomposition are used in many places. However, because number of elements can be reduced by tensor decomposition,
the tree structure is not necessarily unique, the HT decompo- which can effectively reduce the complexity and running
sition always has a variety of tree structures. So researchers time. Before the specific description, we give the outline of
often extend HT decomposition to a fixed TT decomposi- the algorithm (see table 1).
tion. For the traditional HT decomposition, please refer to
Bachmayr et al.’s Tensor networks and hierarchical tensors A. APPLICATION OF TENSOR IN REGRESSION
for the solution of high-dimensional partial differential equa- 1) TENSOR REGRESSION
tions [86], Zhang and Barzilay’s Hierarchical low-rank ten- Consider a traditional linear regression model (see figure 36):
sors for multilingual transfer parsing [158], and Kountchev
and Kountcheva’s Truncated Hierarchical SVD for image y = wT x + b (76)
sequences, represented as third order tensor [111]. In recent
where x ∈ RN is sample feature vector, and w ∈ RN is coef-
years, there have been many studies on TT decomposition,
ficient vector, b is bias. Regression models are often used to
especially in terms of properties and algorithms. Here we give
predict, such as stock market forecasts, weather forecasts, etc.
some references, such as Kressner and Uschmajew’s On low-
When we expand the input x into a tensor, it becomes tensor
rank approximability of solutions to high-dimensional oper-
regression. First let’s consider a simple case where the input
ator equations and eigenvalue problems [26], Steinlechner’s
is a tensor X ∈ RI1 ×I2 ···IN and the predicted value y is a scalar.
Riemannian Optimization for Solving High-Dimensional
Usually tensor regression has the following expression:
Problems with Low-Rank Tensor Structure [92], Phan et al.’s
Tensor networks for latent variable analysis. Part I: Algo- y=W •X +b or y = W • X + b + aT c (77)
rithms for tensor train decomposition [10], Wu et al.. Gen-
eral tensor spectral co-clustering for higher-order data [130], where W ∈ RI1 ×I2 ···IN is the coefficient vector, and b is the
Chen et al.’s Parallelized Tensor Train Learning of Polyno- bias. Some researchers will sometimes add a vector-valued
mial Classifiers [161]. Wang et al.’s Support vector machine covariate c. In general, the solution of tensor regression is
based on low-rank tensor train decomposition for big data to decompose the coefficient tensor and then solve factors
applications [155], and Xu et al.’s Whole Brain fMRI Pattern by alternating least squares (ALS) method, such as rank-1
Analysis Based on Tensor Neural Network [148]. decomposition, CP decomposition, Tucker decomposition,

VOLUME 7, 2019 162971


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

TABLE 1. Tensor based algorithm.

TT decomposition, etc. For example, (Zhou et al.) [55] pro- The nonlinear function of the above formula can be mod-
posed the rank-1 and CP decomposition. Then the formula eled by a Gaussian process, as follows:
becomes:
f (X ) ∼ GP(m(X ), k(X , e
X )|θ) (81)
1 2 N T
y = w ◦ w ◦ ···w •X +b+a c
where m(X ) is the mean function, k(X , e
X ) is the kernel func-
y = 3 ×1m W1 ×2m W2 · · · ×Nm WN • X + b + aT c (78)
tion and θ is the associated hyperparameter. For the sake of
simplicity, we use the standard Gaussian process m(X ) = 0.
Tensor regression of the Tucker decomposition form is
For the kernel function, we use the product probability kernel:
similar. For details, please refer to (Hoff et al. [102];
Yu et al. [113]). The general tensor regression is attributed N X X
e
Y x|n )]
D[p(x|n )||q(e
to solving the following minimization problem: k(X , e
X ) = α2 exp( ) (82)
−2βn2
n=1
N
where α represents the amplitude parameter and β repre-
X
L(a, b, W ) = arg min yi − yi )2 ,
(b i = 1, · · · , N (79)
a,b,W
P px
i=1 sents the scale parameter, D(p||q) = x=1 p(x)log q(x) =
R px X
xp(x)log q(x) dx means the KL divergence, p(x|n )
whereb yi = W • X i + b + aT c represents the predicted value
means the Gaussian distribution of vector variable x =
corresponding to the ith tensor sample, X i represents the ith
[x1 , ·P
· · , xId ], the mean vector and the covariance matrix are
tensor sample, and yi represents the true value of the ith tensor n
µn , , respectively. Note that the mean vector and the
sample.
covariance matrix are determined from the mode-n matri-
We give the following general algorithm for tensor regres-
cization X mn of X by treating each X mn as a probability model
sion (see algorithm 10).
with In number of variables and I1 ×· · ·×In−1 ×In+1 · · ·×IN
number of observations.
2) TENSOR VARIABLE GAUSSIAN PROCESS REGRESSION
When we have determined the parameters from the training
Tensor variable Gaussian process regression is similar to what set, the purpose of the tensor Gaussian process regression is
we have introduced in the previous section. The same thing is to infer the probability distribution of the output for the test
that the input X i ∈ RI1 ×I2 ···IN is still an Nth-order tensor and point X test , i.e.:
the output yi is a scalar, and the difference is that the input here
is subject to Gaussian distribution. (Hou et al.) [90] assumed p(ytest |X test , X, y, θ, σ 2 ) (83)
that the output consists of a nonlinear function with respect
to input X and Gaussian noise  i ∼ N (0, σ 2 ), as follows: where X = [X 1 , X 2 , · · · , X N ]T ∈ RN ×I1 ×I2 ···IN means com-
bining all sample tensors, and y = [y1 , y2 , · · · , yN ]T ∈ RN .
yi = f (X i ) +  i i = 1, · · · , N (80) But actually we only need to know the distribution of f (X test )

162972 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Algorithm 10 Tensor Regression Algorithm (Hoff) [102] used the residual mean squared error to mea-
(Zhou et al.) [55] sure the error between the true value and the prediction value:
Input: PN
N Nth-order sample data tensors X i ∈ RI1 ×I2 ···IN , i = ||Y i − AX i BT ||2F
(A, B) = arg min i=1 (87)
1, · · · , N and its true value yi , a vector-valued covariate A,B n
c.;
Output: By deriving the above formula, we finally get:
a, b, W ; X X
N A=( Y i B(X i )T )( X i BT B(X i )T )−1
yi − yi )2 ;
P
1: Initialize W = 0, solve (a, b) = mina,b,W (b X X
i=1 B = ( (Y i )T AX i )( (X i )T AT AX i )−1 (88)
2: Initialize the factor matrices Wn for n = 1, · · · , N and
core tensor 3 for CP decomposition or initialize the Similarly, we can get A and B respectively by alternating
factor vectosr for rank-1 decomposition, other decompo- least squares.
sition is similar; We further extend to generalized tensor regression as
3: while the number of iterations is not reached or there is follows:
no convergence do
4: for n=1 to N do Y i = X i ×1m W1 ×2m W2 · · · ×nm WN + E (89)
5: solve Wn = minWn L(a, b, 3, W1 , · · · , Wn−1 ,
where Wn ∈ RJn ×In are coefficient matrices (factor matrices)
Wn+1 , · · · , WN );
and X i ∈ RI1 ×I2 ···×IN , Y i ∈ RJ1 ×J2 ···×JN are input and output
6: end for
tensors, respectively. E ∈ RJ1 ×J2 ···×JN is a Noise tensor.
7: solve 3 = min3 L(a, b, 3, W1 , · · · , WN );
N
Note that there is a property between the mode-n product
yi − yi )2 ; and the Kronecker product, as follows:
P
8: (a, b) = mina,b,W (b
i=1
9: end while Z = X ×1m W1 ×2m W2 · · · ×nm WN
Z mn = Wn X mn (WN ⊗R · · · ⊗R Wn+1
⊗R Wn−1 · · · W1 )T
according to the expression. So it finally turns to solve the
following expression: Z v1 = (WN ⊗R · · · ⊗R W1 )X v1 (90)

p(f (X test )|X test , X, y, θ, σ 2 ) (84) Therefore, we only need to adopt the mode-n matricization
on both sides of the formula 89 to get the solution:
Here we omit the complicated calculations and give the
results directly. It is noted that the test samples are also subject Y mn = Wn e
X mn + E mn (91)
to the Gaussian distribution, and the probability properties of
the distribution is accorded to Bayesian conditions. We get: where mat(e X )n = mat(X )n (WN ⊗R · · · ⊗ Wn+1 ⊗R
Wn−1 · · · W1 )T . Then through formula 88 we finally get:
p(f (X test )|X test , X, y, θ, σ 2 ) ∼ N (µtest , σtest
2
) (85)
i
X X i i
Wn = ( X mn )T ) (
Y imn (e X mn )T )−1 (92)
X mn (e
e
where µtest = k(X test , X)T (K + σ 2 I )−1 y and σtest2 =
k(X , X ) − k(X , X) (K + σ I ) k(X , X).
test test test T 2 −1 test
Finally, we give the specific algorithm of the whole gener-
Tensor variable Gaussian process regression is generally alized tensor regression (see algorithm 11).
used to deal with noise-bearing and Gaussian-distributed
data. It has certain limitations, and this method is compu-
tationally expensive. Without using tensor decomposition, Algorithm 11 Generalized Tensor Regression (Hoff) [102]
the amount of parameter data is very large. Thus, the amount Input:
of calculation will also increase exponentially. N Nth-order sample data tensors X i ∈ RI1 ×I2 ···IN , i =
1, · · · , N and output tensor Y i ∈ RJ1 ×J2 ,;
3) GENERALIZED TENSOR REGRESSION Output:
Now we introduce a more general case where both input Wn , n = 1, · · · , N ;
and output are tensors. We start with a simple second-order 1: Initialize W n as random matrices;
matrix. A second-order matrix regression is as follows: 2: while the number of iterations is not reached or there is
no convergence do
Y i = AX i BT + E (86) 3: for n=1 to N do
where X i ∈ RI1 ×I2 , Y i ∈ RJ1 ×J2 , i = 1, · · · , N are N input 4: Calculate Wn by formula 92;
sample matrices and corresponding output sample matrices. 5: end for
A ∈ RJ1 ×I1 and B ∈ RJ2 ×I2 are unknown coefficient matrices. 6: return Wn ;
7: end while
E ∈ RJ1 ×J2 is a noise matrix with mean-zero.

VOLUME 7, 2019 162973


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

4) TENSOR REPRESENTATION OF MULTIVARIATE where A is an N 2 th-order tensor with size 2 × 2 × · · · × 2, B is


POLYNOMIAL REGRESSION FOR SCALAR VARIABLES an Nth-order tensor with size (N +1)×(N +1)×· · ·×(N +1),
Multivariate polynomial regression is a generalization of lin- and V (xn ) is the Vandermonde vector of xn :
ear regression and multiple regression, which predicts the T
V (xn ) = 1 xn xn2 · · · xnN

(100)
next moment or possible future output by processing the
interaction between multiple variables. We first consider a Because of its breadth and generality, this model is used
simple binary quadratic regression: in many fields, such as weather prediction, stock forecasting
and other regression models. But we can clearly see that
y = a0 + a1 x1 + a2 x2 + a12 x1 x2 + a11 x12 + a22 x22 (93) as the variables increase, the unknown coefficients will rise
One of the coefficients a12 represents the relationship exponentially, which will greatly increase the complexity.
between two variables. We can use the multiplication of So we need to reduce unknown parameters by low-rank tensor
matrix and vector to represent the above formula: decomposition network. (Chen et al.) [159] decomposed the
   coefficient tensor with low rank TT decomposition. Another
a0 a2 a22 1 way is to use a truncation model, which is similar to the
y = 1 x1 x12  a1
 
a12 0   x2  (94) coefficient tensor of the binary case just mentioned. It only
a11 0 0 x22 takes two elements for each dimension of B, so the truncated
Consider a more complex complete binary quadratic expression is as follows:
     
polynomial regression (or more general binary quadratic 1 1 1
y = Bt ×1v ×2v · · · ×Nv (101)
polynomial regression): x1 x2 xN
y = a0 + a1 x1 + a2 x2 + a12 x1 x2 + a11 x12 + a22 x22 where Bt is an Nth-order truncated tensor of size
2 × · · · × 2. However, the coefficient tensor B in the second
+ a112 x12 x2 + a122 x1 x22 + a1122 x12 x22 (95)
term of formula 108 is the N 2 th-order tensor. This is the
We can also use a similar product form of matrix and Nth-order tensor. The calculation can be simplified greatly,
vector, as follows: and the subsequent duplicates are also reduced by using a
   new truncated tensor.
a0 a2 a22 1
y = 1 x1 x12  a1
 
a12 a122   x2  (96) 5) TENSOR REPRESENTATION OF MULTIVARIATE
a11 a112 a1122 x22 POLYNOMIAL REGRESSION FOR VECTOR VARIABLES
Using the mode-n product property of tensor and vector, In the previous section, we introduced the multivariate poly-
the above equation can be transformed into: nomial regression of traditional scalar variables. Then we
        extend to the vector form. Here we directly give the general
1 1 1 1
y = A ×1v ×2v ×3v ×4v (97) vector form of the generalized (complete) binary quadratic
x1 x1 x2 x2
regression, which is similar to formula 95, as follows:
where the 4th-order tensor A ∈ R2×2×2×2 is the coefficient y = a0 + aT1 x1 + aT2 x2 C xT1 A12 x2 C xT1 A11 x1
tensor and :
+ xT2 A22 x2 + A112 ×1v x1 ×2v x1 ×3v x2
1 1
     
1
a0 a 2 a 1 a 12 + A122 ×1v x1 ×2v x2 ×3v x2
2   ,  12 4
    
  1 1   + A1122 ×1v x1 ×2v x2 ×3v x2 ×4v x2 (102)
 a2 a22 a12 a122 
2 4 2
 
A=  
1 1
 
1
 (98)
 where x1 , x2 represent vectors, A11 , A12 , A22 represent
 2 a1
 a12   a11 a112  matrices. A112 , A122 are 3rd-order tensors and A1122 is a
4  , 1 2 
 1 1  4th-order tensor.
a12 a122 a112 a1122 We directly give the equivalent tensor product form of the
4 2 2
above formula, as follows:
For a general case of N variables, the complete multivariate
N 2 N
polynomial regression can be expressed as follows (Chen and X X
Billings) [117]: y = w0 + Ai1 ,i2 ,··· ,in ×1v xi1 · · · ×nv xin
n=1 i1 ,i2 ,··· ,in =1
N N      
ai1 i2 ···iN x1i1 x2i2 · · · xNiN 1 1 1
X X
y= ··· = B ×1v · · · ×Nv ×(N +1)v
i1 =0 i =0
x1 x1 x2
N       
1
 
1
 
1
1 1 1 · · · ×2Nv · · · ×(N 2 −N +1)v · · · ×(N 2 )v
= A ×1v · · · ×Nv ×(N +1)v · · · ×2Nv x2 xN xN
x1 x1 x2
 
1
 
1
 
1 = C ×1v V (x1 ) ×2v V (x2 ) · · · ×Nv V (xN ) (103)
· · · ×(N 2 −N +1)v · · · ×(N 2 )v
x2 xN xN where xn ∈ In , Ai1 ,i2 ,··· ,in are Nth-order tensors(n ∈ [1, N 2 ])
= B ×1v V (x1 ) ×2v V (x2 ) · · · ×Nv V (xN ) (99) of size Ii1 × Ii2 · · · × Iin , B is an Nth-order tensor with

162974 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

N +1
size B1 × B2 × · · · × BN , where Bn = (In )In −1−1 , C is an As can be seen from figure 37, the purpose of the SVM is
N 2 th-order tensor with size (I1 + 1) × · · · × (I1 + 1) × (I2 + to find a hyperplane wT x + b = 0, x = [x1 , x2 , · · · , xm ] to
1) × · · · × (I2 + 1) · · · × (IN + 1) · · · × (IN + 1), and V (xn ) distinguish between the two classes. We give the two types
is the Vandermonde vector of xn : of labels +1 and −1 respectively. Where the distance from a
point x to the hyperplane in the sample space is:
T
xn T (xn ⊗ xn )T · · · (xn ⊗ · · · ⊗ xn )T |wT x + b|

V (xn ) = 1
d= (105)
(104) kwk
As shown in figure 37, the point closest to the hyperplane
This model can be generalized to multidimensional ten- is called the support vector, and the sum of the distances
sors. Similar to the scalar form of multivariate polynomial of the two heterogeneous support vectors to the hyperplane
regression, the multivariate polynomial regression in the form is:
of vector (tensor) also has an exponential rise in complexity 2
γ = (106)
as the variable n increases. Similarly, we can reduce the coef- kwk
ficient tensor from N 2 th-order to Nth-order. We can also use
CP decomposition, Tucker decomposition, or TT decompo-
sition to get a truncated model. Please refer to (Stoudenmire
andSchwab,2016 [30]; Cohen andShashua, 2016 [94]) for
details.

6) DISCUSSION AND COMPARISON


This section mainly introduces five tensor regressions.
We extend from the simplest model to the multivariate regres-
sion of the most complex vector variable values. Usually
the first and third tensor regressions are more common. For
the first tensor regression, since the factors obtained by the
rank 1 decomposition are all vectors, the calculation is rela-
tively simple and the complexity is low. The tensor regres-
sion based on CP decomposition is slightly different from
the tensor regression based on Tucker decomposition. When FIGURE 37. A simple schematic of a linear SVM. As shown, the input is a
given tensor rank, CP decomposition is unique, and Tucker first-order tensor (vector) and the size is 2.
decomposition is usually not unique. Since the factors after
CP decomposition are all matrices, in general, the factor after It is also called the margin. In order to find the hyperplane
Tucker decomposition still has core tensor, so CP decompo- with the largest interval, it is converted to solve the following
sition is still much simpler in performing operations. Since optimization problem:
the factors obtained by CP decomposition is usually unique, 2
and the factors of Tucker decomposition is usually not unique. max
w,b kwk
For Tucker regression, we can choose the most accurate ones
s.t. yi (wT xi C b) ≥ 1, i = 1··· ,m (107)
from many weighted tensors. Therefore, the Tucker regres-
sion is better than CP regression in terms of the accuracy. where kwk is the two norm of the vector w.
For multivariate generalized regression scenarios, tensors In fact, the training samples are linearly inseparable in
are used instead of complex coefficients. The coefficient many cases, which is called the soft interval. The general
tensor decomposition not only reduces the complexity, but constraint formula for the SVM is shown as follows:
also clearly expresses the structural relationship between the M
kwk2 X
data. max +C ξj
w,b,ξj 2
j=1
B. APPLICATION OF TENSOR IN CLASSIFICATION s.t. yj (w xj C b) ≥ 1 − ξj , ξj ≥ 0,
T
j = 1, 2 · · · , M .
1) SUPPORT TENSOR MACHINE(STM) APPLICATION IN
(108)
IMAGES CLASSFICATION
a: THE SUPPORT VECTOR MACHINE(SVM) where ξj = l(yi (wT xi + b) − 1) is called slack variables. l is a
First, we briefly review the concept of support vector loss function. There are three commonly used loss functions
machine(SVM). The SVM is first proposed by (Corts and hinge loss : lhinge (x) = max(0, 1 − x);
Vapnik) [18] to find a hyperplane to distinguish between
the two different categories, what we usually call the binary exponential loss : lexp (x) = exp(−x);
classifier (see figure 37). logistic loss : llog (x) = log(1 + exp(−x)); (109)

VOLUME 7, 2019 162975


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

m
Later researchers (Zhao et al.) [144] converted the above where F(W , b) 1 T
P
= 2 tr(W W ) + C max(0, 1 −
constraints into the following formula: j=1
yj [tr(W T Xj ) + b]), G(S) = λkW k∗ . Due to the complexity
kwk2 γ of the SMM solution, please refer to (Luo et al.) [80] for
min + ξTξ
w,b,ξj 2 2 details.
s.t. yj (wT xj C b) = 1 − ξj , ξj ≥ 0, j = 1, 2 · · · , M .
c: THE SUPPORT TENSOR MACHINE(STM)
(110)
If we further extend the matrix to tensor, we will get the
where ξ = [ξ1 , ξ2 , · · · , ξM ] ∈ RM . Note that formula 110 Support Tensor Machine(STM). In general, STM currently
has two major differences compared to formula 108. 1: in have five constraint expressions, we first give the original
order to facilitate the calculation, the above constraint is constraint expression:
changed from inequality to equality. 2: the loss function in M
formula 110 is the mean square loss. The benefit of this kW k2 X
max +C ξj
modification is that the solution will be easier. Generally, w,b,ξj 2
j=1
the solution is developed by Lagrangian multiplier method. s.t. yj (W • X j + b) ≥ 1 − ξj ξj ≥ 0, j = 1, 2 · · · , M .
We do not repeated derivation here. For details, please refer to
(114)
(Corts and Vapnik) [18].
Here we usually choose to decompose the coefficient
b: THE SUPPORT MATRIX MACHINE(SMM) tensor W , and the researchers give four solutions in total.
If we extend the input sample from vector to second- (Tao et al.) [27] proposed to decompose the coefficient
order tensor (matrix), we will get the Support Matrix tensor into the form of the rank-one vector outer prod-
Machine(SMM). (Luo) [80] proposed the concept of the uct, i.e., W = w1 ◦ w2 ◦ · · · wN (see formula 28).
Support Matrix Machine. We consider a matrix sample Xa ∈ (Kotsia et al.) [58] performed CP decomposition on the coef-
RI ×J , a = 1, 2, · · · , m. The hinge loss function are replaced R
λr w1 ◦ w2 ◦ · · · wN (see for-
P
in SMM. The following constraint formula is obtained: ficient tensor, i.e., W =
r=1
mula 29). (Kotsia and Patras) [59] performed Tucker decom-
M
1 X position on the coefficient tensor, i.e., W = A ×1m W1 ×2m
min tr(W T W ) + C max(0, 1 − yj [tr(W T Xj ) + b])
W ,b,ξj 2 W2 · · · ×Nm WN (see formula 108). (Wang et al.) [155] per-
j=1
formed TT decomposition on the coefficient tensor, i.e., W =
+ λkW k∗
W 1 ×3,1 W 2 · · · ×3,1 W N (see formula 55). Substituting these
s.t. yj [tr(W T Xj ) + b] ≥ 1 − max(0, 1 − yj [tr(W T Xj ) + b]). three decompositions will result in three forms of STM.
(111) In general, the solution of STM is similar to the solution of
CP decomposition. The central idea is based on the alternat-
where kW k∗ (we usually call it the nuclear norm) represents ing least squares method, that is, N-1 other optimization items
the sum of all singular values of the matrix W, C and λ are fixed first, and only one item is updated at a time. For
are coefficient. In fact, we get the following properties after example, if we use the form of the rank-one decomposition
performing the mode-1 vectorization of the matrix w = for coefficient tensor, then the constraint expression becomes
vec(W T )1 . as follows (see algorithm 12):

tr(W T Xj ) = vec(W T )T1 vec(XjT )1 = wT xi 1 X M


max kwm k2 α + C ξj
tr(W T W ) = vec(W T )T1 vec(W T )1 = wT w. (112) wm ,b,ξj 2
j=1

Substituting the formula 136 into the formula 135 returns s.t. ×(i6=m)v wi ) + b) ≥ 1 − ξj ,
yj (wTm (X j
the constraint expression of the original SVM. Note that i = 1, 2 · · · , n − 1, n + 1, · · · , N .j = 1, 2 · · · , M .
in order to protect the data structure from being destroyed, (115)
we generally do not perform the mode-n vectorization of
the matrix and convert it into a traditional SVM. So we where α = k ◦N 2
i=1,i6 =m wi k .
give the optimization problem directly in the form of a Then the label of a test sample, X test , can be predicted as
matrix. According to (Goldstein et al.) [39], they further follows:
converted the above constraints into the following augmented
Lagrangian function form: y = sign(X test ×1v w1 · · · ×Nv wN + b) (116)

L(W , b, S, λ) = F(W , b) + G(S) + tr[3T (S − W )] However, the above-mentioned alternating least squares
a iteration method usually needs a lot of time and com-
+ kS − W k2F , a is hyperparameter putational memory, and only obtian a local optimal solu-
2
(113) tion. So many researchers proposed other algorithms.

162976 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Algorithm 12 Support Tensor Machine (Hao et al.) [164]


Input:
Input tensor sample sets X j ∈ RI1 ×I2 ···IN , j = 1, 2 · · · , M
and label yj ∈ {+1, −1};
Output:
wi , i = 1, · · · , N and b;
1: if the required number of iterations is not reached then
2: for m=1 to N do
3: Initialize w1 , w2 , · · · , wm−1 , wm+1 , · · · , wN
4: α = k ◦N i=1,i6=m wi k
2

5: Calculate wm by solving the binary optimization


problem of the formula 139.
6: end for
7: end if
FIGURE 38. A simple schematic of the RBM, with the visible layer variable
x = [x1 , · · · , x5 ] on the left and the hidden layer variable y = [y1 , y2 ] on
the right.
(Z.Hao, 2013) [164] proposed to transform the formula 114
into the following constraint expression:
derived from the following formula:
M M
1 X X
y = σ (W T x + b)
min αi αi yi yj (X i • X j ) − αj (118)
α 2
i,j=1 j=1
where σ (x) = 1+e1 −x is the activation function.
M
X Then we carry out the back propagation algorithm, which
s.t. αj yj = 0, 0 ≤ αj ≤ C j = 1, 2 · · · , M . (117) recalculates the value of the visible layer as the input of the
j=1 hidden layer’s value:
where αj are the Lagrange multipliers. Note that if the input x = σ (W y + a) (119)
tensor becomes a vector, formula 117 will become the dual
problem of the standard SVM. When the back-propagation recalculated visible layer
STM has gradually entered the field of machine learning value is not equal to the original visible layer value, the oper-
due to its ability to preserve data structures and improve ation is repeated, which is the training process of the
performance. STM with different constraints have sepa- restricted Boltzmann machine. KL divergence is usually used
rate application scenarios. For example, STM based on CP in a Restricted Boltzmann Machine to measure the distance
decomposition is applied to pedestrian detection of thermal between the distributions of these two variables. RBM is a
infrared rays in order to find pedestrians in a group of probability distribution model based on energy, as follows:
images for precise positioning (Biswas and Milanfar) [121]. E(x, y) = −aT x − bT y − yT W x (120)
STM based on rank-one decomposition is applied to
high-resolution remote sensing image target detection Then we derive the joint probability distribution of the
(Chen et al.) [45]. STM based on the original dual problem hidden layer variable y and the visible layer variable x:
solving algorithm is applied to the fault diagnosis of electric 1 −E(x,y)
vehicle range finder (Xu et al.) [147]. P(x, y) =
e (121)
Z
P
where Z = . In order to make the distribution of these
2) HIGH-ORDER RESTRICTED BOLTZMANN MACHINES x,y
(HORBM) FOR CLASSIFICATION two values as close as possible to maximize the likelihood
We first review the concept of Restricted Boltzmann function of the input samples:
Machines. RBM is a random neural network, which can N
X
be used for algorithm modeling of dimensionality reduc- L(W , a, b) = argmax E[ log P(xt )]
tion, classification, collaborative filtering, etc. In RBM, t=1
it contains two layers, including visible layer and hid- X
P(x) = P(x, y) (122)
den layer (see figure 38). Where the visible layer is,
y
x = [x1 , x2 , · · · , xM ]T ∈ RM , hidden layer is, y =
[y1 , y2 , · · · , yN ]T ∈ RN , xm = {0, 1}, m ∈ [1, M ] and We assume that there are N input samples, x t ∈ RM , t ∈
yn = {0, 1}, m ∈ [1, N ]. The weight of the interconnection [1, N ] in visible layer. Since the derivative form of the above
is W ∈ RM ×N . The visible layer has a bias of a ∈ RM and formula cannot be solved generally, the deep learning pio-
the hidden layer has a bias of b ∈ RN . When the input value neer Hinton proposed the CD algorithm (i.e.,k times Gibbs
of the visible layer is given, the value of the hidden layer is sampling) to obtain an approximate solution. Here we give a

VOLUME 7, 2019 162977


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

expressed as follows:
E(X , y) = A • X − bT y − W • (X ◦ y) (125)
where A ∈ RI1 ×I2 ···×IN , b ∈ RJ are the biases of the visible
and hidden layers, respectively. And similarly, the hidden
layer variable y = [y1 , · · · , yJ ]T can be expressed as:
yj = σ (X • W (:, · · · , :, j) + bj ), j = 1, 2 · · · , J (126)
A major problem is that as the input tensor dimension
increases, the weight tensor elements will multiply. We usu-
ally use low rank tensor decomposition to solve the problem.
For example, if we perform CP decomposition on weight
tensors:
FIGURE 39. Schematic diagram of the energy function of the three sets of
variables. The middle is the weight tensor, the above is the variable a,
the lower is the variable b, and the right is the variable c. W ≈ 3×1m W1 ×2m W2 · · · ×Nm WN ×(N +1)m WN +1 (127)
where Wn ∈ RIn ×R , n = 1, · · · , N , WN +1 ∈ RJ ×R are factor
matrices, and 3 is the diagonal tensor. ThenQ the number of
very simple update formula based on the actual application. elements is reduced from the original J N n=1 In to R(J +
In practice, it usually takes only one sample to achieve very PN
n=1 I n + 1).
accurate results, so the updated formula is as follows:
More simply, if the weight tensor can be expressed in the
W = W + α(xy − x1 y1 ) form of a rank-one vector outer product:
a = a + α(x − x1 ) W = w1 ◦ w2 ◦ · · · wN ◦ wN +1 (128)
b = b + α(y − y1 ) (123)
where wn ∈ RIn , n = 1, · · · , N , wN +1 ∈ RJ . ThenQ the
where α ∈ [0, 1] is the learning rate, x1
is the updated value number of elements is reduced from the original J N n=1 In
to J + N
P
of the visible layer variable x obtained by the first back- n=1 I n .
propagation of the hidden layer y, and y1 is the first update Finally, we introduce a latent conditional high-order
value of the hidden layer obtained by x1 forward propagation Boltzmann machines(CHBM). (Huang et al.) [151] pro-
again. If it is k(k > 1) times, we only need to change the x1 posed latent conditional high-order Boltzmann machine for
of the above formula to xk (the value of the visible layer classification. The algorithm is similar to the high-order
variable obtained by the kth back-propagation). For details, Boltzmann machine of the three sets of variables we just
please refer to (Hinton) [35]. mentioned. However, in CHBM, input data are two N sample
If we increase the number of layers, the traditional RBM features xi ∈ RI , yi ∈ RJ , i = 1, · · · , N and z is the
will become a higher dimension, which we call High-order relationship label of xi , yi where z = [z1 , z2 ]. For each
restricted Boltzmann machines (HORBM). For example, sample, if x and y are matched, z = [1, 0], else z = [0, 1]
for three sets of variables, a ∈ RI , b ∈ RJ , c ∈ RK , the energy (‘‘one-hot’’ encoding). Then the author adds another set of
function can be represented (see figure 39): binary-valued latent variables to the hidden layer. The entire
structure is shown in figure 40. Where h denotes the intrinsic
,J ,K
IX
relationship between x and y. h and z are connected by a
E(a, b, c) = − wi,j,k ai bj ck − d T a − eT b − f T c weight matrix U. Then its energy function is as follows:
i,j,k=1
= W ×1v a ×2v b ×3v c − d T a − eT b − f T c E(x, y, h, z) = W ×1v x ×2v y ×3v h − hT U z
(124) − aT x − bT y − cT h − d T z (129)

where a ∈ RI and b ∈ RJ are two input variables, which where a, b, , c , d are the biases of x, y, h, z,
can be understood as two visible layers, c ∈ RK is a hidden respectively.
layer variable, and d, e, f correspond to the biases of three Then the value of zt , t = {1, 2} (which is also known as
variables. activation conditional probability) is :
Note that the input of the visible, hidden layer or the IJ
X
additional layer of the RBM is a vector. If the input becomes hk = p(hk |x, y) = σ ( wijk xi yj + ck )
a tensor, we call it Tensor-variate Restricted Boltzmann ij
machines (TvRBMs) (Nguyen et al.) [126]. We assume that K
X
the visible layer variable is, X ∈ RI1 ×I2 ···×IN , and the hidden zt = p(zt |x, y, h) = σ (dt + hk Ukt )
layer variable is, y ∈ RJ , so the weight tensor is W ∈ k
RI1 ×I2 ···×IN ×J . Then the energy function can be similarly k = 1, · · · , K . t = {1, 2} (130)

162978 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

and v(x3 ) = (1, x3 , x32 )T . The nonzero elements of the coef-


ficient tensor A ∈ R2×4×3 are a111 = 1, a211 = 1, a141 = 3,
a112 = 2, a113 = 4, a123 = −2, a222 = −5. We combine the
indices of the three Vandermonde vectors to get the indices
of A, such as, −5x1 x2 x3 is from v(x1 )[2] = x1 , v(x2 )[2] = x2 ,
v(x3 )[2] = x3 , so a222 = −5.
Given a set of N training samples, (xi , yi ), i = 1, 2, · · · ,
N , xi ∈ Rm . After feature extraction, each feature is mapped
to high-dimensional space by mapping T:

T (x) = v(x1 ) ◦ v(x2 ) · · · v(xm ) x = (x1 , x2 , · · · , xm )T (133)

Therefore, formula 139 is further equivalent to:

f = T (xi ) • A (134)
FIGURE 40. Schematic diagram of the energy function of the four sets of
variables. The middle is the weight tensor, the above is the variable x,
the lower is the variable y, the right is the hidden layer variable h, and the Example 3: Here we consider the example of a binary
far right is the label z. z and h are connected by a weight matrix U. polynomial for the sake of simplicity. Assuming f = 2+3x1 −
x2 +2x12 +4x1 x2 −2x12 x2 +7x22 . We can get n = (2, 2), v(x1 ) =
(1, x1 , x12 )T , v(x2 ) = (1, x2 , x22 )T , then according to formula
In fact, the model is a two-layer RBM. The first layer is 9 and 17, both T (x)and A are 2rd-order tensors(matrices):
ternary RBM (x, y, h), and the second layer is the traditional  
1 x2 x22
 
binary RBM (h, z). For the 3rd-order tensor W of the first 2 −1 7
2
T (x) =  x1 x1 x2 x1 x2  , A = 3 4 0

layer, we can use the CP decomposition to solve.
2 2
x1 x1 x2 x1 x2 2 2 2 −2 0
3) POLYNOMIAL CLASSIFIER ALGORITHM BASED ON (135)
TENSOR TT DECOMPOSITION
Polynomial classifiers are often used for classification Similar to the idea of SVM, polynomial classification is
because of their ability to generate complex surfaces and looking for a hyperplane to distinguish between these two
have a good fit to raw data. However, when coming to high- types of examples. Its ultimate goal is to find the coefficient
dimensional space, the multivariate polynomial can only use tensor A so that:
some specific kernels in the support vector machine, and the
kernel function should be mapped to the high-dimensional yi (T (xi ) • A) > 0, i = 1, 2 · · · , N (136)
space for processing, which increases the difficulty of data Considering the TT decomposition of the coefficient tensor
processing. In order to enable the Polynomial classifier to A, A = A1 ×3,1 A2 · · ·×3,1 Am , the above polynomial equation
handle high dimensional problems, (Chen et al.) [161] sim- (formula 134) has the following further properties:
plified the operation by using the polynomial with the tensor
inner product of the TT format, and proposed two algorithms. f = T (xi ) • A = A ×1v v(x1 )T ×2v v(x2 )T · · · ×mv v(xm )T
First we give the definition of pure-power-n polynomial: = (A1 ×2v v(x1 )T ) · · · (Am ×2v v(xm )T )
Given a vector n = (n1 , n2 , · · · , nm ), if in a polynomial f
with m variables, the highest power for each variable xi is = Aj ×1v pj (x) ×2v v(xj )T ×3v qj (x)T
ni , i = 1, 2, · · · , m, then the polynomial f is called pure- = (qj (x)T ⊗L v(xj )T ⊗L pj (x))(Aj )v1
power-n polynomial. for any j = 1, 2 · · · , m (137)
Example 1: The polynomial f = 1 + x1 + 3x23 + 2x3 +
Qj−1
4x32 − 2x2 x32 − 5x1 x2 x3 is a pure-power-n polynomial with where p1 (x) = 1, pj (x)j≥2 = T
k=1 (Ak ×2v v(xk ) ) and
n = (1, 3, 2). qm (x) = 1, qj (x)j<m =
Qm T
k=j+1 (Ak ×2v v(xk ) ), vec(Ai )2
For pure-power-n polynomial, it can be expressed equiva- means the mode-2 vectorization of the tensor (see for-
lently by the expression of the mode-n product of the vectors mula 15).
and a tensor A ∈ R(n1 +1)×(n2 +1)···(nm +1 ): Example 4: For the polynomial f in example 3, according to
f = A ×1v v(x1 )T ×2v v(x2 )T · · · ×mv v(xm )T (131) formula 137, T (x)•A = (q2 (x)T ⊗L v(x2 )T ⊗L p2 (x))vec(A2 )2 ,
let i = 2. Then we will get:
where v(xi ) are the Vandermonde vectors:
q2 (x) = 1
v(xi ) = (1, xi , xi2 , · · · , xini )T , i = 1, 2, · · · , m (132) x22
 
v(x2 ) = 1 x2
Example 2: For the polynomial f in example 1, since p2 (x)) = A1 ×2v v(x1 )T
n=(1,3,2), then v(x1 ) = (1, x1 )T , v(x2 ) = (1, x2 , x22 , x23 )T , v(x1 ) = 1 x1 x12
 
(138)

VOLUME 7, 2019 162979


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

(Chen et al.) [161] proposed two loss functions, least Algorithm 13 The Improved Least Squares Method for TT
squares loss and logistic loss function: Decomposition (Chen et al.) [161]
N
Input:
1 X Loss function Jloss (A) and an initial guess for the TT
J (A) = (T (xi ) • A − yi )2
N decomposition of the Nth-order tensor A = A1 ×3,1
i=1
N
A2 · · · ×3,1 Am , An ∈ RRn−1 ×In ×Rn ;
1 X 1 + yi 1 − yi Output:
J (A) = − [ ln(gA (xi )) + ln(1 − gA (xi ))]
N 2 2 A in the TT format, A = argmin Jloss (A), A = b A1 ×3,1
i=1
(139) A2 · · · ×3,1 b
b Am ;
1: if the required number of iterations is not reached then
where the first formula is the least squares loss func- 2: for n=1 to N do
n e1 en−1 , en
tion, the second is the logical loss function, and 3: solve: e A = arg mine A J (A , · · · , A A , An+1 ,
n

gA (xi ) = σ (T (xi ) • A), σ (x) = 1+e1 −x . · · · , AN );


n
According to formula 137, the least squares loss function 4: [Qn , Rn ] = QR − decomposition of e Amc2 ;
of formula 139 can be further transformed into: n+1 n+1
5: A
e =e A ×2m Rn ;
1 6: end for
J (A) = kCj (Ai )v1 − yk22
N 7: for n=N to 1 do
n b1 bn−1 , bn n+1
T (xi ) • A = Cj [i](Ai )v1 8: solve: b A = arg minb A J (A , · · · , A A ,eA ,
n
(140)
N
· · · ,eA );
where n
9: [Q , R ] = QR − decomposition of b
n n Amc2 ;
qj (x1 )T ⊗L v(xj1 )T ⊗L pj (x1 ) n−1 n−1
 
10: A
b =b A ×2m Rn ;
 qj (x2 )T ⊗L v(x 2 )T ⊗L pj (x2 )  11: end for
Cj =  j  (141)
 ···  12: end if
N T N T
qj (x ) ⊗L v(xj ) ⊗L pj (x ) N 13: return A = b A1 ×3,1 b Am ;
A2 · · · ×3,1 b
and Cj [i] means the ith term of vector Cj [i], y =
[y1 , y2 , · · · , yN ]T .
If we further add a regular term, the final loss function is: original image X, and the original image X can be accurately
α recovered from the transformed 3D image Y .
Jloss (A) = J (A) + (A • A) (142) Image-based feature tensor generation is generally gener-
2
ated by the following steps (see algorithm 14).
Finally, it is transformed into solving the loss function
We also made a picture to show the process of generating
problem of minimizing the tensor A with TT format. In fact,
feature tensors (see figure 42)
the idea of optimization is still similar to alternating least
We can recover the original image by reversing the above
squares, which we call the improved alternating least squares
steps. For the feature tensor, it is highly compatible with the
method. The central idea is to update only one core An in each
deep learning method commonly used in images, Convolu-
iteration while keeping other cores unchanged. In general,
tional Neural Network (CNN). So for general image process-
we first update from A1 to An , so that the left half is updated,
ing, it can be classified firstly by finding the feature tensor
which we call forward half-sweep. Then we update from An
of the image and then we can use CNN to classify. Similar to
to A1 , and the right half is updated, which we call backward
CNN’s convolutional layer, this operation reduces the size of
half-sweep. When both forward and backward are completed,
the original image because n and k are smaller than the size
an iteration is completed (see figure 41 and algorithm 13).
N of the original input image, which can significantly reduce
computing time and memory consumption. For details, please
4) FEATURE TENSOR GENERATION (TENSORIZATION) FOR
refer to (Yang et al.) [51].
IMAGE CLASSIFICATION
Feature tensor generation (Tensorization) is often used in
5) TENSOR-BASED FEATURE FUSION FOR FACE
image processing, which means finding good image feature
RECOGNITION
representations. By finding feature tensors, that is, extracting
In general, traditional face recognition has only a single
valid data, image classification can be better, thereby improv-
input x, and the output expression is as follows:
ing classification accuracy. Usually we need to use some
means to convert 2D images into 3D feature tensors to extract
information. y = f (W T x + b) (143)
Feature tensor generation transforms the original image X
into another 3rd-order high-dimensional image Y , which can But (Hu et al.) [37] proposed to combine the face attribute
maintain the spatial relationship between the images. The feature and the face recognition feature, which is simply
size of each transformed image Y is much smaller than the adding an input z. Then their output model becomes as

162980 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Algorithm 14 Feature Tensor Generation (Yang et al.) [51]


Input:
The original image X ∈ RN ×N ;
Output:
Feature tensor, which we call F t ;
1: First, the original input image X is divided into n×n block
regions, and then multi-level perception is performed
to find the feature representation of each block region,
we call the block area Ya,b (a, b = 1, 2, · · · , n) ;
2: Perform DCT transform for each block: Da,b (u, v) =
K P K
Ya,b (x, y)cos[ Kπ (x + 12 )u]cos[ Kπ (y + 12 )v],
P
c(u)c(v)
x=0 y=0 q q
K = Nn , if u=0, c(u) = K 1+1 , else, c(u) = K 2+1 ;
3: Convert matrices Da,b (u, v) to vectors (vectorization),
C a,b = [Da,b (0, 0), Da,b (0, 1), Da,b (0, 2), · · · ,
Da,b (K , K )];
4: Pick first k elements of each C a,b , C ab = C a,b [0 : k];
5: Finally, all these
 element groupswill become a feature
C 11 C 12 · · · C 1n
C 21 C 22 · · · C 2n 
tensor F k =  . .. .. , where C ab ∈ Rk are
 
 .. . ··· . 
C n1 C n2 · · · C nn
vectors, F k ∈ Rn×n×t is a 3rd-order tensor;

W = S ×1m UA ×2m UC ×3m UB , UA ∈ RA×RA , UB ∈


FIGURE 41. The improved least squares method for TT
decomposition [161]. First forward half sweep, then backward half RB×RB , S ∈ RRA ×RC ×RB . Then formula 144 becomes as
sweep, forward and backward half sweep is an iteration. The forward and follows:
backward half sweeps are all completed, then an iterative update is
completed. After each update of the nuclear tensor, the green matrix R yi = softmax(S ×1m UA ×2m UC ×3m UB ×1v xi ×3v zi )
generated by QR decomposition of the nuclear tensor is absorbed into
the adjacent matrix, and then continues to update the adjacent matrix. (145)
According to some properties between the formula 169,
it can be converted into:
follows:
yi = softmax(S ×1m (UA xi ) ×2m UC ×3m (UB zi )) (146)
yi = softmax(W ×1v xi ×3v zi )
ei According to the nature of Kronecker, it can be further
softmax(xi ) = P (144) transformed into:
ej
j yi = softmax(((UA xi ) ⊗ (UB zi ))mat(S)T2 UC ) (147)
where the bias is omitted, xi ∈ RA , zi ∈ RB , yi ∈ RC , where ((UA xi ) ⊗ (UB zi ))mat(S)T2 is called fused feature. The
i = 1, · · · , N and weight tensor W ∈ RA×C×B . The goal entire classification process is shown in figure 43.
is still to optimize the loss function between the predicted Finally, the entire training process is actually the process
and true values. The author used Tucker decomposition of solving the factor matrix and the core tensor. This way

FIGURE 42. Feature tensor generation example (n=8). We assume that the original image (800 × 800) is divided into
8 × 8 blocks, and each block is 100 × 100. Then we perform DCT transformation. Finally it is encoded into a feature tensor
8 × 8 × 100.

VOLUME 7, 2019 162981


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

of decomposing can reduce the amount of parameters, thus


reducing the computation time, and the efficiency of large-
scale data processing can be improved.

6) TENSOR-BASED GRAPH EMBEDDING ALGORITHM


The graph embedding algorithm is generally used to better
classify data by reducing the dimensionality of the data while
preserving the data structure of the graph. In order to accu-
rately classify and identify the target object in the image,
(Hu et al.) [143] used the second-order tensor-based graph
embedding to learn the discriminant subspace (discriminate
the embedding space), and distinguished the target object
image block and the background image block from the dis-
FIGURE 43. Tensor-based feature fusion neural network.
criminant subspace.
First they assume that input training sample set are mode-n matricization, Then the following formula is estab-
Nth-order tensors, X a ∈ RR1 ×R2 ···×Rn , a = 1, 2 · · · , N . They lished:
construct an intrinsic graph G i to characterize the correla-
tion between the target sample and the background sample. X m1 = X , X m2 = X T (149)
In addition they also construct a penalty graph G p to char- According to figure 20, the mode-n product of the tensor
acterize the difference between the target sample and the and matrix can be converted into matrix product. Then the
background sample to separate them from the image. These Y a = X a ×1m B1 ×2m B2 in the formula 148 becomes as
two graphs represent the geometry and discriminant structure follow:
of the input sample. Define the weight matrix of the two
i in W i represents
graphs separately, W i , W p . The element Wab Ya = X a ×1m B1 ×2m B2 ,
the degree of similarity between the vertices Xa and Xb , and A1 = A1m1 = B1 (X a )m1 = B1 X a ,
p
the element Wab in W p represents the degree of difference Ya = A1 ×2m B2 ,
between Xa and Xb . (Y a )m2 = B2 (A1 )m2 ,
Tensor-based graph embedding aims to find a best low- Ya = ((Y a )m2 )T = (B2 (A1 )m2 )T = B1 X a BT2 (150)
dimensional tensor representation for each vertex in a
graph G, and to make the low-dimensional tensor well The optimization problem of factor formula 148 is con-
describe the similarity between vertices. The optimal tensor verted into:
representation of the vertex is obtained by solving the follow- X N
N X
ing optimization problem. J (B1 , B2 ) = arg min( kB1 X a BT2 − B1 X a BT2 k2F Wab
i
)
B1 ,B2 a=1 b=1
J (B1 , B2 , · · · , BN ) N X
N
p
X
XN X N s.t. kB1 X a BT2 − B1 X b BT2 k2F Wab = d
= arg min( kX a ×1m B1 a=1 b=1
Bn a=1 b=1 (151)
· · · ×Nm BN − X b ×1m B1 · · · ×Nm BN k2F Wab
i
) To further simplify the operation, (He et al.) [146] defined
N
XX N two diagonal matrices 3i ∈ RN ×N and 3p ∈ RN ×N , where
s.t. kX a ×1m B1 · · · ×Nm BN N N
the diagonal elements are λiaa =
P i and λp = P W p ,
Wab
a=1 b=1 aa ab
b=1 b=1
p
− X b ×1m B1 · · · ×Nm BN k2F Wab = d (148) respectively. Formula 151 can be further transformed into:

where Bn ∈ RIn ×Rn are called transfer matrices, d is a constant J (A, B) = arg min trace(BT (3iA − WAi )B)
A,B
according to the needs. In fact, we can see that this is similar to d
p p
the optimal solution to the Tucker(HOSVD) decomposition s.t. trace(BT (3A − WA )B) = (152)
2
with constraints. Bn is actually factor matrices, and X is
the core tensor in Tucker decomposition. But note that here For convenience, AT = B1 , BT = B2 ,
In ≤ Rn . N
X N X
X N
However, depending on the image itself, it can be seen as a 3iA = λiaa X Ta AAT X a , WAi = i
Wab X Ta AAT X a
matrix. (He et al.) [146]proposed the solution to the above a=1 a=1 b=1
problem. First, the mode-n matricization of the tensor is N N X N
p p p
X X
used to convert the above optimization problem equivalently. 3A = λpaa X Ta AAT X a , WA = Wab X Ta AAT X a
Note that since the input X a ∈ RR1 ×R2 , a = 1, 2 · · · , N a=1 a=1 b=1
are second-order tensors, according to the definition of the (153)

162982 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

Using the idea of alternating least squares, when fixing A, Tensor can also combine various high-dimensional features
B consists of I1 generalized eigenvectors, which correspond to improve the classification accuracy. Therefore, a tensor-
to the largest eigenvalues of first I1 and satisfy the following based feature fusion technique is proposed for classification
p p
equation, (3A − WA )b = c(3iA − WAi )b. when fixing B, processing. Finally, using the tensor to separate the target
A consists of I2 generalized eigenvectors, which correspond from the background pattern, the discriminant space can be
to the largest eigenvalues of first I2 and satisfy the following effectively learned, thereby effectively detecting the target in
p p
equation, (3B − WB )a = c(3iB − WBi )a. the picture.
According to the above analysis, we present a graph
embedding algorithm based on a second-order tensor (matrix) C. APPLICATION OF TENSOR IN DATA PREPROCESSING
(see algorithm 15). 1) TENSOR DICTIONARY LEARNING
Dictionary learning refers to finding a sparse representation
Algorithm 15 2nd-Order Tensor-Based Graph Embedding of the original data while ensuring the structure and non-
Algorithm (Hu et al.) [143] distortion of the data, thereby achieving the effect of data
Input: compression and ultimately reducing computational com-
Input tensor sample set X a ∈ RR1 ×R2 , a = 1, 2 · · · , N ; plexity (see figure 44). General dictionary learning boils
Output: down to the following optimization problems:
Transfer matrices (factor matrices) AT , BT ; N N
1: Initially A, take the first I1 column of the unit matrix I ∈
X X
min ||yi − Axi ||22 +λ ||xi ||1 , i = 1, · · · , N (154)
RR1 ×R1 as the matrix A; A,xi
i=1 i=1
2: Initialize the weight coefficient Wab i and W p according
ab
to (Weiming Hu,2017); where A ∈ RJ ×I is sparse matrix, yi ∈ RJ , i = 1, · · · , N are
3: for k=1 to n do N raw data and xi ∈ RI are vectors sparsely represented.
p p
4: Calculate 3iA , WAi , 3A , WA in formula 153;
5: Calculate B by using the properties of generalized
p p
eigenvectors: (3A − WA )b = c(3iA − WAi )b;
p p
6: Calculate 3B , WB , 3B , WB in formula 153 by exchang-
i i

ing A and B;
7: Replace A by using the properties of generalized eigen-
p p
vectors: (3B − WB )a = c(3iB − WBi )a;
8: end for
9: return Transfer matrices (factor matrices) AT , BT ;

Note that since the images are mostly two-dimensional,


the above authors used the form of a second-order matrix.
In fact, if the input is a higher-dimension tensor (n>3), we can
still use the alternating least squares to convert the above
multivariate optimization problem into a single variable opti- FIGURE 44. Schematic diagram of dictionary learning, left y is the original
mization problem. input data, A is a sparse matrix, and x is a sparse representation of y.

We now extend the vector to the tensor. When the input


7) DISCUSSION AND COMPARISON
raw data Y ∈ RI1 ×I2 ×···IN is an Nth-order tensor, it produces
This section mainly introduces some tensor-based classifi- tensor dictionary learning. The tensor decomposition method
cation algorithms. Among them, Support Tensor Machine is usually used to solve the problem for the tensor dictionary
(STM) and High-order Restricted Boltzmann machines learning. (Ghassemi et al.) [88] used the Kronecker prod-
(HORBM) are the main ones. Similar to the traditional SVM, uct representation of Tucker decomposition to represent the
STM can only perform two classifications. When faced with above optimization problem. According to the expression of
multiple classification problems, we need to perform STM the formula 108 Tucker decomposition, we get
multiple times, and now the more popular method is to use
neural network replacement. Due to the simplicity of rank- Y v1 = (BN ⊗R BN −1 · · · ⊗R B1 )X v1 (155)
one decomposition, we mainly describe the STM algorithm
based on rank-one decomposition in STM. For high-order We combine N samples as N column vectors of a new
Boltzmann machines, we extend it to a more general case. matrix Y, and get the expression of the matrix as follows:
At the same time, in the case of performing CP decom- Y = (BN ⊗R BN −1 · · · ⊗R B1 )X (156)
position or rank-one decomposition of tensor, not only the
number of unknown parameters is reduced, but also effi- At this time we call the factor matrices the Kronecker
cient iterative solution can be effectively performed by ALS. structured(KS) matrices and let D = BN ⊗L BN −1 · · · ⊗L B1 .

VOLUME 7, 2019 162983


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

However, a more general sparse matrix is a low rank sep- (Peng et al.) [68] used the HOSVD decomposition. But he
aration structure matrix, which is the sum of KS matrices, did not use the traditional truncated SVD decomposition
as follows: algorithm (see algorithm 2). Since traditional algorithms need
I
X to initialize the approximate rank of a given tensor first and
D= BiN ⊗R BiN −1 · · · ⊗R Bi1 (157) the factor matrices, which actually requires a lot of pre-
i=1 calculation. So they proposed an adaptive algorithm to obtain
the low rank approximation of tensors.
Considering another property. Let D = B2 ⊗R B1 , for the
First they set an error parameter α ∈ [0, 1]. Then, sim-
elements in D we can reconstitute the form of the vector outer
ilar to truncated-SVD, SVD decomposition is performed
product as follows: Dr = vec(B1 )1 ◦ vec(B2 )1 . Then we can
on the mode-n matricization of the core tensor 3mk , k =
convert the equivalent of equation 157 to the following:
1, 2 · · · , N , 3mk = Uk Sk V T . Where S is a diagonal matrix
I
X with nonzero entries sjj , j = 1, · · · , K , K = rank(3mk ). The
Dr = (Bi1 )v1 ◦ (Bi2 )v1 · · · ◦ (BiN )v1 (158) optimal rank can be obtained by
i=1
K
P
So we can use this structure as a regular term. Finally, sjj
we get the optimal expression for the tensor dictionary learn- j=Rk +1
Rk = min(R0 < Rk < Ik < < α) (162)
ing as follows: Rk K
P
sjj
N j=1
1 1 X r
min ||Y − DX ||2F + λ ||Dmn ||∗ (159)
D,X 2 N where R0 is the lower bound of the predefined rank, which
n=1
prevents the rank from being too small. The detailed process
PI i i i r
where D = i=1 BN ⊗R BN −1 · · · ⊗R B1 , ||Dmn ||∗ is
is shown in algorithm 16.
r
the kernel norm of the matrix D after the mode-n matri-
cization of the tensor Dr . It is generally solved by the Algorithm 16 The Adaptive HOSVD Decomposition of the
Lagrangian multiplier method. Since the solution process is Tensor (Peng et al.) [68]
too complicated, it is omitted here. For details, please refer to Input:
(Ghassemi et al.) [88]. The Nth-order data tensor X ∈ RI1 ×I2 ···IN , error parame-
ter α ∈ [0, 1], R0
2) TENSOR COMPLETION FOR DATA PROCESSING Output:
In data processing, sometimes there are some missing values The core tensor A ∈ RR1 ×R2 ···RN and the factor matrices
in the data. There are many ways to complete the missing Bn ∈ RIn ×Rn ;
data, and the popular ones are matrix estimation and matrix 1: A0 ← X ;
completion. If the input data is tensor, then we call the ten- 2: for n=1 to N do
sor estimation and tensor completion. The tensor estimation 3: [Un , Sn , VnT ] = SVD(An−1
mn ), and then compute rank Rk
and the tensor completion are similar. They are all required by formula 90;
to solve the corresponding minimum constraint problem. 4: Select the first Rk column vector of U to assign to
However, the tensor estimation is mainly to minimize the the factor matrix Bk , Bk = Un (:, 1 : Rk ) =
mean square error between the estimated value and the orig- [u1 , u2 , · · · , uRk ];
inal value. Here we mainly introduce the tensor completion. 5: Anmn = Sn (1 : RK , 1 : RK )Vn (:, 1 : Rk )T ;
The general tensor completion aims to seek the optimal solu- 6: end for
tion of the following expression: 7: A = AN ;
8: return the core tensor A and factor matrices Bn
I k2F
min k(X − Y ) ~ e (160)
Y

where X ∈ RI1 ×I2 ×···×IN is a tensor with missing values, When the improved HOSVD decomposition algorithm is
Y ∈ RI1 ×I2 ×···×IN means the reconstruction tensor, ~ means completed, we obtain the factor matrices Bn and the low rank
the element product (see formula 19), and e I ∈ RI1 ×I2 ×···×IN approximate solution of the original tensor X by the mode-n
represents the indexes of missing values in X . The entries of product of the original input tensor and the factor matrices,
I are as follows:
e as follows:

Z i = X ×1m (B1 BT1 ) ×2m (B2 BT2 ) · · · ×im (Bi BTi ) (163)

0 if x i1 i2 ···iN is missing
eii1 i2 ···iN == (161)
1 otherwise
where i = 1, · · · , N , so we can get N low rank approximate
The first step in such problems is usually to find a low solutions of X : Z 1 , Z 2 , · · · , Z N . We take the average of these
rank approximation of the original tensor. The conventional N numbers as a best approximation of the original tensor X .
N
method uses several tensor decompositions introduced in
X ≈ N1
P
Z i.
part one, such as CP, HOSVD, TT decomposition, etc. i=1

162984 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

After the previous steps, we first perform a zero- effectively use the data structure to extract effective infor-
compensation operation for the missing data of X , and we mation. At the same time, we use tensor decomposition
X . And then we get the approximate
get the fulfilled tensor b to reduce unknown parameters and the size of the original
solution of it. tensor. Finally, the original problem is transformed into a one-
N variable optimization problem by alternating least squares
1 X
D= Zi
b (164) algorithm. Tensor-based algorithms not only ensure the inter-
N relationship in the data characteristics, but also improve the
i=1
Finally, for missing values, we update it with the following accuracy.
formula:
X = X ~e
b I + D ~ (¬e
I) (165) IV. CHALLENGES AND PROSPECTS
A. CHALLENGES
where ¬ is the Boolean NOT operator (i.e., 0 ← 1,
As a new technology in recent years, tensor is gradually
1 ← 0). The entire tensor completion algorithm is shown in
applied to various fields, such as medicine, biology, computer
algorithm 17.
vision, machine learning, etc. But at the same time it also
Algorithm 17 Tensor Completion (Zisen Fang, 2018) faces many challenges.
For example, for the existing tensor-based tracking algo-
Input:
rithm, they cannot completely detect the intrinsic local geom-
The Nth-order data tensor X ∈ RI1 ×I2 ···IN with missing
etry and discriminant structure of the image block in tensor
values, eI , error parameter α ∈ [0, 1], R0 , the number of
form. As a result, they often ignore the influence of the
iterations required L
background, which will be interfered by the background area
Output:
and reduce the accuracy of target tracking.
The Nth-order tensor b X obtained after the missing value
Regardless of the classification problem of machine learn-
is completed, the core tensor A ∈ RR1 ×R2 ···RN and the
ing or deep learning, the tensor decomposition also requires
factor matrices Bn ∈ RIn ×Rn ;
more parameters. In order to improve the accuracy, a large
X 0 ← X ~e
1: b I , D0 = 0;
number of samples are needed. Without a better training
2: for n=1 to L do
algorithm, a large number of parameters will cause slow
3: Obtain the core tensor A and factor matrices Bn by
convergence or even no convergence. At the same time, how
applying algorithm 16 for b X n−1 ;
to obtain a huge amount of data is also a very important
4: for k=1 to N do
issue. Due to the limited sample, researchers often choose to
5: Z k = X ×1m (B1 BT1 ) ×2m (B2 BT2 ) · · · ×km (Bk BTk );
experiment on simulated data. After all, the simulation data
6: end for
N is different from the actual data, so the accuracy is not fully
Dn = 1
P
7: N Zk;
b guaranteed when applying the actual high-dimensional data.
k=1
For the traditional tensor decomposition introduced in part
8: X n = X ~e
b I + Dn ~ (¬e
I );
one, such as Tucker decomposition and CP decomposition,
9: end for
they all decompose the input tensor into multiple low-order
10: X ←b
b XL;
factors. However, due to some noise in illumination, occlu-
sion or practical applications, they are prone to deviations.
Thus the accuracy of the decomposition will decrease, which
3) DISCUSSION AND COMPARISON means that the robustness of these decomposition algorithms
This section focuses on two kinds of common data prepro- is relatively poor.
cessing, dimensionality reduction and data completion. For However, for data processing, some general tensor algo-
dictionary learning, we introduce a tensor model based on rithms often directly decompose the input features into mul-
Tucker decomposition, and it can be easily solved due to tiple dimensions, which excessively consider the combina-
the nature of the Tucker decomposition. At the same time, tion of these features with other useless features. So it is a
we also introduce the latest tensor completion algorithm huge challenge to accurately extract useful information in
based on improved HOSVD decomposition for the comple- the decomposition and abandon the useless combination of
tion of data missing values. features.
The last big problem is about tensor decomposition
D. BRIEF SUMMARY FOR PART TWO algorithms. When it comes to tensor decomposition, it is
Part two introduced the applications of tensor algorithms, indispensable to talk about alternating least squares, which
including data preprocessing, data classification and data alternately obtains the factor of tensor decomposition by
prediction (regression). We can see from part two that in iteratively updating the single core each time. However, these
order to solve the high-dimensional problem, more and more algorithms have a common problem, that is, the problem of
researchers have begun to develop tensor-based algorithms. initialization. In deep learning and machine learning, if the
The biggest feature of the tensor algorithm is that it can weight initialization is not appropriate, it will cause long

VOLUME 7, 2019 162985


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

convergence time or even non converge. Therefore, how to calculations (Kossaifi et al.) [64]. However, whether using
effectively initialize tensor rank and the factor matrices is a tensor decomposition or tensor contraction to reduce
huge challenge. unknown parameters, the actual unknown parameters that
need to be solved is still more than ordinary problems.
B. PROSPECTS Therefore, for the required sample data, one is to use simula-
For the above problems, we proposed the following research tion data, and the other is to increase the sample parameters
directions: that are missing in reality by tensor completion. However,
1. For target detection and image tracking, can we find a there are accuracy problems in both methods, which also
tensor-based algorithm that can capture the features between affect the training of the later models and so on.
the background image and the target image? 5. Is it possible to improve low-rank tensor decomposition
In many cases, we need to extract the target image which algorithms?
we need from an image. At this time, the dynamic video We have mentioned in the last section that tensor decom-
image will be more difficult. How to better grasp the char- position algorithms face the problem of initializing factor
acteristics of the target and the background and distinguish matrices, core tensors and tensor rank. For factor matrices
them will affect the accuracy of target tracking. Therefore, and core tensors, we tend to use the usual random Gaussian
we urgently need to develop such a tensor-based tracking variables to initialize the parameters. According to the struc-
algorithm that can capture the geometric local structural rela- tural characteristics of tensor or the mode-n matricization and
tionship and discriminant relationship between background vectorization of tensor, we can add some additional priori
and target image blocks. information to the factor matrices and core tensor. So can
2. How to optimize the learning algorithm or avoid the we apply some constraints to the factor matrices and core
saddle point? tensor to better find the properties and effectively initialize
How to improve the traditional gradient descent algo- it? Or can we improve the alternating least squares algorithm
rithm or how to avoid the saddle point becomes an urgent so that it can find the original input tensor characteristics
requirement based on tensor deep learning. For deep learning, and initialize it automatically? For the tensor rank, we have
when the dimension becomes higher, the first problem we just introduced a new improved algorithm in part two tensor
think of is the increase of computational complexity and com- completion algorithm (see algorithm 17).
putation time. In general, we usually use tensor decomposi-
tion for dimensionality reduction. But some new problems are V. CONCLUSION
inevitably generated. A more common problem with gradient This survey focuses on the basics of tensors, including ten-
descent is that it tends to fall into local minima. As the dimen- sor definitions, tensor operations, tensor decomposition, and
sion rises, such problems become more widespread, and we low-rank tensor-based algorithms. At the same time, we also
still need to invent some improved algorithms to prevent the describe the application of tensor decomposition in various
network from falling into local minima. Also the saddle point fields and introduce some applications of tensor algorithms
is generated due to the high demension, which becomes a in the field of machine learning and deep learning. Finally,
non-convex problem. Therefore, the learning update algo- we discuss the opportunities and challenges of tensor.
rithms need to be improved urgently, otherwise the accuracy
cannot be improved. REFERENCES
3. Can the non-convex problem of the weight optimization [1] A. Bibi and B. Ghanem, ‘‘High order tensor formulation for convolutional
process be transformed into a convex optimization problem? sparse coding,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Venice,
As the dimension increases, in general, the objective func- Italy, Oct. 2017, pp. 1790–1798.
[2] A. Cichocki, ‘‘Tensor decompositions: A new concept in brain
tion will change to a non-convex function, which leads to the
data analysis?’’ 2013, arXiv:1305.0395. [Online]. Available:
non-convex optimization problem. Usually the non-convex https://arxiv.org/abs/1305.0395
optimization problem is difficult to solve, so we always [3] A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao, and
want to find its equivalent convex optimization problem to D. P. Mandic, ‘‘Tensor networks for dimensionality reduction and large-
scale optimization: Part 1 low-rank tensor decompositions,’’ Found.
solve. Can we use effective tensor decomposition or other Trends Mach. Learn., vol. 9, nos. 4–5, pp. 249–429, 2016.
algorithms to transform non-convex objective functions into [4] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa,
convex functions and optimize them? and H. A. Phan, ‘‘Tensor decompositions for signal processing appli-
cations: From two-way to multiway component analysis,’’ IEEE Signal
4. How to reduce the required samples and convergence Process. Mag., vol. 32, no. 2, pp. 145–163, Mar. 2015.
time while ensuring accuracy? [5] A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari, Nonnegative Matrix
Some researchers tried to convert original tensor problem and Tensor Factorizations: Applications to Exploratory Multi-Way Data
Analysis and Blind Source Separation. Chichester, U.K.: Wiley, 2009.
into a traditional vector problem, which not only destroys the
[6] A. Cichocki, A.-H. Phan, Q. Zhao, N. Lee, and I. Oseledets, ‘‘Tensor
original data structure, but also greatly increases the number networks for dimensionality reduction and large-scale optimization: Part
of parameters. The current method for tensor data is through 2 applications and future perspectives foundations and trends?’’ Mach.
tensor decomposition, which directly converts the data tensor Learn., vol. 9, no. 6, pp. 431–673, 2017.
[7] A. Desai, M. Ghashami, and J. M. Phillips, ‘‘Improved practical matrix
into factor matrices and the core tensor. Some researchers sketching with guarantees,’’ IEEE Trans. Knowl. Data Eng., vol. 28, no. 7,
have reduced the parameters by tensor contraction pp. 1678–1690, Jul. 2016.

162986 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

[8] A.-H. Phan, A. Cichocki, P. Tichavský, D. Mandic, and K. Matsuoka, [32] F. Verstraete, V. Murg, and J. I. Cirac, ‘‘Matrix product states, projected
‘‘On revealing replicating structures in multiway data: A novel tensor entangled pair states, and variational renormalization group methods for
decomposition approach,’’ in Proc. 10th Int. Conf. LVA/ICA. Berlin, quantum spin systems,’’ Adv. Phys., vol. 57, no. 2, pp. 143–224, 2008.
Germany: Springer, Mar. 2012, pp. 297–305. [33] G. Ballard, N. Knight, and K. Rouse, ‘‘Communication lower bounds for
[9] A. H. Phan and A. Cichocki, ‘‘Extended HALS algorithm for nonnegative matricized tensor times Khatri-Rao product,’’ in Proc. IEEE Int. Parallel
Tucker decomposition and its applications for multiway analysis and Distrib. Process. Symp. (IPDPS), Vancouver, BC, Canada, May 2018,
classification,’’ Neurocomputing, vol. 74, no. 11, pp. 1956–1969, 2011. pp. 557–567.
[10] A.-H. Phan, A. Cichocki, A. Uschmajew, P. Tichavsky, G. Luta, and [34] G. Chabriel, M. Kleinsteuber, E. Moreau, H. Shen, P. Tichavsky, and
D. Mandic, ‘‘Tensor networks for latent variable analysis. Part I: A. Yeredor, ‘‘Joint matrices decompositions and blind source separation:
Algorithms for tensor train decomposition,’’ 2016, arXiv:1609.09230. A survey of methods, identification, and applications,’’ IEEE Signal
[Online]. Available: https://arxiv.org/abs/1609.09230 Process. Mag., vol. 31, no. 3, pp. 34–43, May 2014.
[11] A. Hyvärinen, ‘‘Independent component analysis: Recent advances,’’ [35] G. E. Hinton, ‘‘Training products of experts by minimizing contrastive
Philos. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., vol. 371, no. 1984, divergence,’’ Neural Comput., vol. 14, no. 8, pp. 1771–1800, 2002.
2013, Art. no. 20110534. [36] G. Evenbly and G. Vidal, ‘‘Algorithms for entanglement renormal-
[12] A. Kolbeinsson, J. Kossaifi, Y. Panagakis, A. Bulat, A. Anandkumar, ization,’’ Phys. Rev. B, Condens. Matter, vol. 79, no. 14, 2009,
I. Tzoulaki, and P. Matthews, ‘‘Robust deep networks with randomized Art. no. 144108.
tensor regression layers,’’ 2019, arXiv:1902.10758. [Online]. Available: [37] G. Hu, Y. Hua, Y. Yuan, Z. Zhang, Z. Lu, S. S. Mukherjee,
https://arxiv.org/abs/1902.10758 T. M. Hospedales, N. M. Robertson, and Y. Yang, ‘‘Attribute-enhanced
[13] A. Tjandra, S. Sakti, and S. Nakamura, ‘‘Tensor decomposition for com- face recognition with neural tensor fusion networks,’’ in Proc. IEEE Int.
pressing recurrent neural network,’’ in Proc. Int. Joint Conf. Neural Netw. Conf. Comput. Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 3764–3773.
(IJCNN), Rio de Janeiro, Brazil, Jul. 2018, pp. 1–8. [38] G. Lechuga, L. Le Brusquet, V. Perlbarg, L. Puybasset, D. Galanaud, and
[14] B. Jiang, F. Yang, and S. Zhang, ‘‘Tensor and its Tucker core: The invari- A. Tenenhaus, ‘‘Discriminant analysis for multiway data,’’ in Proc. Int.
ance relationships,’’ Jan. 2016, arXiv:1601.01469. [Online]. Available: Conf. Partial Least Squares Related Methods, in Springer Proceedings in
https://arxiv.org/abs/1601.01469 Mathematics and Statistics, 2015, pp. 115–126.
[15] B. Mao, Z. M. Fadlullah, F. Tang, N. Kato, O. Akashi, T. Inoue, and [39] T. Goldstein, B. Odonoghue, and S. Setzer, ‘‘Fast alternating direction
K. Mizutani, ‘‘A tensor based deep learning technique for intelligent optimization methods,’’ CAM Rep., 2012, pp. 12–35.
packet routing,’’ in Proc. IEEE Global Commun. Conf. (GLOBECOM), [40] G. Vidal, ‘‘Efficient classical simulation of slightly entangled quantum
Singapore, Dec. 2017, pp. 1–6. computations,’’ Phys. Rev. Lett., vol. 91, no. 14, 2003, Art. no. 147902.
[16] B. Khoromskij and A. Veit, ‘‘Efficient computation of highly oscillatory [41] G. Zhou and A. Cichocki, ‘‘Fast and unique Tucker decompositions via
integrals by using QTT tensor approximation,’’ Comput. Methods Appl. multiway blind source separation,’’ Bull. Polish Acad. Sci., vol. 60, no. 3,
Math., vol. 16, no. 1, pp. 145–159, 2016. pp. 389–407, 2012.
[17] C. F. Caiafa and A. Cichocki, ‘‘Stable, robust, and super fast recon- [42] G. Zhou and A. Cichocki, ‘‘Canonical polyadic decomposition based on a
struction of tensors using multi-way projections,’’ IEEE Trans. Signal single mode blind source separation,’’ IEEE Signal Process. Lett., vol. 19,
Process., vol. 63, no. 3, pp. 780–793, Feb. 2015. no. 8, pp. 523–526, Aug. 2012.
[18] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn., [43] G. Zhou, A. Cichocki, Y. Zhang, and D. P. Mandic, ‘‘Group component
vol. 20, no. 3, pp. 273–297, 1995. analysis for multiblock data: Common and individual feature extraction,’’
[19] C. M. Crainiceanu, B. S. Caffo, S. Luo, V. M. Zipunnikov, and IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 11, pp. 2426–2439,
N. M. Punjabi, ‘‘Population value decomposition, a framework for the Nov. 2016.
analysis of image populations,’’ J. Amer. Statist. Assoc., vol. 106, no. 495, [44] G. Zhou, Q. Zhao, Y. Zhang, T. Adalı, S. Xie, and A. Cichocki, ‘‘Linked
pp. 775–790, 2011. component analysis from matrices to high-order tensors: Applications to
[20] C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan, ‘‘Tensor robust biomedical data,’’ Proc. IEEE, vol. 104, no. 2, pp. 310–331, Feb. 2016.
principal component analysis with a new tensor nuclear norm,’’ IEEE [45] H. Chen, Q. Ren, and Y. Zhang, ‘‘A hierarchical support tensor machine
Trans. Pattern Anal. Mach. Intell., to be published. structure for target detection on high-resolution remote sensing images,’’
[21] C. Peng, L. Zou, and D.-S. Huang, ‘‘Discovery of relationships between in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), Fort Worth,
long non-coding RNAs and genes in human diseases based on tensor TX, USA, Jul. 2017, pp. 594–597.
completion,’’ IEEE Access, vol. 6, pp. 59152–59162, 2018. [46] H. Fanaee-T and J. Gama, ‘‘Tensor-based anomaly detection: An interdis-
[22] C. Tobler, ‘‘Low-rank tensor methods for linear systems and eigenvalue ciplinary survey,’’ Knowl.-Based Syst., vol. 98, pp. 130–147, Apr. 2016.
problems,’’ M.S. thesis, ETH Zürich, Zürich, Switzerland, 2012. [47] H. Imtia and A. D. Sarwate, ‘‘Improved algorithms for differentially pri-
[23] D. Kressner, M. Steinlechner, and A. Uschmajew, ‘‘Low-rank tensor vate orthogonal tensor decomposition,’’ in Proc. IEEE Int. Conf. Acoust.,
methods with subspace correction for symmetric eigenvalue problems,’’ Speech Signal Process. (ICASSP), Calgary, AB, Canada, Apr. 2018,
SIAM J. Sci. Comput., vol. 36, no. 5, pp. A2346–A2368, 2014. pp. 2201–2205.
[24] D. Kressner, M. Steinlechner, and B. Vandereycken, ‘‘Low-rank tensor [48] H. Lu, L. Zhang, Z. Cao, W. Wei, K. Xian, C. Shen, and
completion by Riemannian optimization,’’ BIT Numer. Math., vol. 54, A. van den Hengel, ‘‘When unsupervised domain adaptation meets
no. 2, pp. 447–468, 2014. tensor representations,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
[25] D. Kressner and C. Tobler, ‘‘Algorithm 941: Htucker—A MATLAB Venice, Italy, Oct. 2017, pp. 599–608.
toolbox for tensors in hierarchical Tucker format,’’ ACM Trans. Math. [49] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, ‘‘A survey of
Softw., vol. 40, no. 3, 2014, Art. no. 22. multilinear subspace learning for tensor data,’’ Pattern Recognit., vol. 44,
[26] D. Kressner and A. Uschmajew, ‘‘On low-rank approximability of solu- no. 7, pp. 1540–1551, 2011.
tions to high-dimensional operator equations and eigenvalue problems,’’ [50] H. Matsueda, ‘‘Analytic optimization of a MERA network and its rele-
Linear Algebra Appl., vol. 493, pp. 556–572, Mar. 2016. vance to quantum integrability and wavelet,’’ 2016, arXiv:1608.02205.
[27] D. Tao, X. Li, W. Hu, S. Maybank, and X. Wu, ‘‘Supervised tensor [Online]. Available: https://arxiv.org/abs/1608.02205
learning,’’ in Proc. 5th IEEE Int. Conf. Data Mining (ICDM), Nov. 2005, [51] H. Yang, J. Su, Y. Zou, B. Yu, and E. F. Y. Young, ‘‘Layout hotspot
pp. 8–16. detection with feature tensor generation and deep biased learning,’’ in
[28] D. Wang, H. Shen, and Y. Truong, ‘‘Efficient dimension reduction Proc. 54th ACM/EDAC/IEEE Design Autom. Conf. (DAC), Austin, TX,
for high-dimensional matrix-valued data,’’ Neurocomputing, vol. 190, USA, 2017, pp. 1–6.
pp. 25–34, May 2016. [52] H. Wang, Q. Wu, L. Shi, Y. Yu, and N. Ahuja, ‘‘Out-of-core tensor
[29] E. Corona, A. Rahimian, and D. Zorin, ‘‘A tensor-train accelerated approximation of multi-dimensional matrices of visual data,’’ ACM Trans.
solver for integral equations in complex geometries,’’ Nov. 2015, Graph., vol. 24, no. 3, pp. 527–535, 2005.
arXiv:1511.06029. [Online]. Available: https://arxiv.org/abs/1511.06029 [53] H. Wang, D. Huang, Y. Wang, and H. Yang, ‘‘Facial aging simulation via
[30] E. M. Stoudenmire and D. J. Schwab, ‘‘Supervised learning with tensor completion and metric learning,’’ IET Comput. Vis., vol. 11, no. 1,
quantum-inspired tensor networks,’’ 2016, arXiv:1605.05775. [Online]. pp. 78–86, Feb. 2017.
Available: https://arxiv.org/abs/1605.05775 [54] H. Zhao, Z. Wei, and H. Yan, ‘‘Detection of correlated co-clusters in
[31] F. L. Hitchcock, ‘‘Multiple invariants and generalized rank of a p-way tensor data based on the slice-wise factorization,’’ in Proc. Int. Conf.
matrix or tensor,’’ J. Math. Phys., vol. 7, pp. 39–79, Apr. 1928. Mach. Learn. (ICMLC), Ningbo, China, Jul. 2017, pp. 182–188.

VOLUME 7, 2019 162987


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

[55] H. Zhou, L. Li, and H. Zhu, ‘‘Tensor regression with applications in [77] L. Grasedyck, D. Kressner, and C. Tobler, ‘‘A literature survey of low-
neuroimaging data analysis,’’ J. Amer. Stat. Assoc., vol. 108, no. 502, rank tensor approximation techniques,’’ GAMM-Mitteilungen, vol. 36,
pp. 540–552, 2013. no. 1, pp. 53–78, 2013.
[56] I. Jeon, E. E. Papalexakis, C. Faloutsos, L. Sael, and U. Kang, ‘‘Mining [78] L. Grasedyck, ‘‘Hierarchical singular value decomposition of tensors,’’
billion-scale tensors: Algorithms and discoveries,’’ VLDB J., vol. 25, SIAM J. Matrix Anal. Appl., vol. 31, no. 4, pp. 2029–2054, 2010.
no. 4, pp. 519–544, 2016. [79] L. Karlsson, D. Kressner, and A. Uschmajew, ‘‘Parallel algorithms
[57] I. Kisil, G. G. Calvi, A. Cichocki, and D. P. Mandic, ‘‘Common and for tensor completion in the CP format,’’ Parallel Comput., vol. 57,
individual feature extraction using tensor decompositions: A remedy pp. 222–234, Sep. 2016.
for the curse of dimensionality?’’ in Proc. IEEE Int. Conf. Acoust., [80] L. Luo, Y. Xie, Z. Zhang, and W.-J. Li, ‘‘Support matrix machines,’’ in
Speech Signal Process. (ICASSP), Calgary, AB, Canada, Apr. 2018, Proc. Int. Conf. Mach. Learn. (ICML), 2015, pp. 938–947.
pp. 6299–6303. [81] L. R. Tucker, ‘‘Implications of factor analysis of three-way matrices for
[58] I. Kotsia, W. Guo, and I. Patras, ‘‘Higher rank support tensor machines measurement of change,’’ in Problems Measuring Change, C. W. Harris,
for visual recognition,’’ Pattern Recognit., vol. 45, no. 12, pp. 4192–4203, Ed. Madison, WI, USA: Univ. Wisconsin Press, 1963, pp. 122–137.
2012. [82] L. Sorber, I. Domanov, M. Van Barel, and L. De Lathauwer, ‘‘Exact line
[59] I. Kotsia and I. Patras, ‘‘Support Tucker machines,’’ in Proc. IEEE Conf. and plane search for tensor optimization,’’ Comput. Optim. Appl., vol. 63,
Comput. Vis. Pattern Recognit., Jun. 2011, pp. 633–640. no. 1, pp. 121–142, 2016.
[60] I. V. Oseledets, ‘‘Tensor-train decomposition,’’ SIAM J. Sci. Comput., [83] L. Yuan, Q. Zhao, and J. Cao, ‘‘High-order tensor completion for
vol. 33, no. 5, pp. 2295–2317, 2011. data recovery via sparse tensor-train optimization,’’ in Proc. IEEE Int.
[61] I. V. Oseledets and E. E. Tyrtyshnikov, ‘‘Breaking the curse of dimension- Conf. Acoust., Speech Signal Process. (ICASSP), Calgary, AB, Canada,
ality, or how to use SVD in many dimensions,’’ SIAM J. Sci. Comput., Apr. 2018, pp. 1258–1262.
vol. 31, no. 5, pp. 3744–3759, 2009. [84] L. Zhai, Y. Zhang, H. Lv, S. Fu, and H. Yu, ‘‘Multiscale tensor dictionary
[62] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher, ‘‘Randomized single- learning approach for multispectral image denoising,’’ IEEE Access,
view algorithms for low-rank matrix approximation,’’ Tech. Rep., 2016. vol. 6, pp. 51898–51910, 2018.
[63] J. H. Choi and S. Vishwanathan, ‘‘DFacTo: Distributed factorization of [85] M. Bebendorf, C. Kuske, and R. Venn, ‘‘Wideband nested cross approx-
tensors,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1296–1304. imation for Helmholtz problems,’’ Numerische Mathematik, vol. 130,
[64] J. Kossaifi, A. Khanna, Z. Lipton, T. Furlanello, and A. Anandkumar, no. 1, pp. 1–34, 2015.
‘‘Tensor contraction layers for parsimonious deep nets,’’ in Proc. IEEE [86] M. Bachmayr, R. Schneider, and A. Uschmajew, ‘‘Tensor networks and
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Honolulu, hierarchical tensors for the solution of high-dimensional partial differen-
HI, USA, Jul. 2017, pp. 1940–1946. tial equations,’’ Found. Comput. Math., vol. 16, no. 6, pp. 1423–1472,
[65] J. Virta and K. Nordhausen, ‘‘Blind source separation for nonstationary 2016.
tensor-valued time series,’’ in Proc. IEEE 27th Int. Workshop Mach. [87] M. Espig, M. Schuster, A. Killaitis, N. Waldren, P. Whnert, S. Handschuh,
Learn. Signal Process. (MLSP), Tokyo, Japan, Sep. 2017, pp. 1–6. and H. Auer, ‘‘TensorCalculus library,’’ Tech. Rep., 2012.
[66] K. Batselier and N. Wong, ‘‘A constructive arbitrary-degree Kronecker [88] M. Ghassemi, Z. Shakeri, A. D. Sarwate, and W. U. Bajwa, ‘‘STARK:
product decomposition of tensors,’’ 2015, arXiv:1507.08805. [Online]. Structured dictionary learning through rank-one tensor recovery,’’ in
Available: https://arxiv.org/abs/1507.08805 Proc. IEEE 7th Int. Workshop Comput. Adv. Multi-Sensor Adapt. Pro-
[67] K. Batselier, H. Liu, and N. Wong, ‘‘A constructive algorithm for decom- cess. (CAMSAP), Curacao, Netherlands Antilles, Dec. 2017, pp. 1–5.
posing a tensor into a finite sum of orthonormal rank-1 terms,’’ SIAM J. [89] M. Hou, ‘‘Tensor-based regression models and applications,’’
Matrix Anal. Appl., vol. 36, no. 3, pp. 1315–1337, 2015. Ph.D. dissertation, Laval Univ., Quebec City, QC, Canada, 2017.
[68] K.-Y. Peng, S.-Y. Fu, Y.-P. Liu, and W.-C. Hsu, ‘‘Adaptive runtime [90] M. Hou, Y. Wang, and B. Chaib-draa, ‘‘Online local Gaussian pro-
exploiting sparsity in tensor of deep learning neural network on het- cess for tensor-variate regression: Application to fast reconstruction of
erogeneous systems,’’ in Proc. Int. Conf. Embedded Comput. Syst., limb movements from brain signal,’’ in Proc. IEEE Int. Conf. Acoust.,
Archit., Modeling, Simulation (SAMOS), Pythagorion, Greece, Jul. 2017, Speech Signal Process. (ICASSP), Brisbane, QLD, Australia, Apr. 2015,
pp. 105–112. pp. 5490–5494.
[69] K. Makantasis, A. D. Doulamis, N. D. Doulamis, and A. Nikitakis, [91] M. Hou, Q. Zhao, B. Chaib-Draa, and A. Cichocki, ‘‘Common and dis-
‘‘Tensor-based classification models for hyperspectral data analysis,’’ criminative subspace kernel-based multiblock tensor partial least squares
IEEE Trans. Geosci. Remote Sens., vol. 56, no. 12, pp. 6884–6898, regression,’’ in Proc. 13th AAAI Conf. Artif. Intell., 2016, pp. 1673–1679.
Dec. 2018. [92] M. Steinlechner, ‘‘Riemannian optimization for solving high-dimensional
[70] K. Makantasis, A. Doulamis, N. Doulamis, A. Nikitakis, and problems with low-rank tensor structure,’’ Ph.D. dissertation, 2016.
A. Voulodimos, ‘‘Tensor-based nonlinear classifier for high-order [93] M. W. Mahoney, M. Maggioni, and P. Drineas, ‘‘Tensor-CUR decompo-
data analysis,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. sitions for tensor-based data,’’ SIAM J. Matrix Anal. Appl., vol. 30, no. 3,
(ICASSP), Calgary, AB, Canada, Apr. 2018, pp. 2221–2225. pp. 957–987, 2008.
[71] K. Naskovska and M. Haardt, ‘‘Extension of the semi-algebraic frame- [94] N. Cohen and A. Shashua, ‘‘Inductive bias of deep convolutional net-
work for approximate CP decompositions via simultaneous matrix diag- works through pooling geometry,’’ 2016, arXiv:1605.06743. [Online].
onalization to the efficient calculation of coupled CP decompositions,’’ Available: https://arxiv.org/abs/1605.06743
in Proc. 50th Asilomar Conf. Signals, Syst. Comput., Pacific Grove, CA, [95] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang,
USA, Nov. 2016, pp. 1728–1732. E. E. Papalexakis, and C. Faloutsos, ‘‘Tensor decomposition for
[72] T. Levi-Civita, The Absolute Differential Calculus. London, U.K.: Blackie signal processing and machine learning,’’ IEEE Trans. Signal Process.,
and Son, 1927. vol. 65, no. 13, pp. 3551–3582, Jal. 2017.
[73] L. De Lathauwer, B. De Moor, and J. Vandewalle, ‘‘A multilinear sin- [96] N. Halko, P. G. Martinsson, and J. A. Tropp, ‘‘Finding structure with ran-
gular value decomposition,’’ SIAM J. Matrix Anal. Appl., vol. 21, no. 4, domness: Probabilistic algorithms for constructing approximate matrix
pp. 1253–1278, 2000. decompositions,’’ SIAM Rev., vol. 53, no. 2, pp. 217–288, 2011.
[74] L. De Lathauwer, B. De Moor, and J. Vandewalle, ‘‘On the best rank-1 [97] N. H. Nguyen, P. Drineas, and T. D. Tran, ‘‘Tensor sparsification via a
and rank-(R1 ,R2 ,. . .,RN ) approximation of higher-order tensors,’’ SIAM bound on the spectral norm of random tensors,’’ 2015, arXiv:1005.4732.
J. Matrix Anal. Appl., vol. 21, no. 4, pp. 1324–1342, 2000. [Online]. Available: https://arxiv.org/abs/1005.4732
[75] L. Geng, X. Nie, S. Niu, Y. Yin, and J. Lin, ‘‘Structural compact core [98] N. Kargas and N. D. Sidiropoulos, ‘‘Completing a joint PMF from pro-
tensor dictionary learning for multispec-tral remote sensing image deblur- jections: A low-rank coupled tensor factorization approach,’’ in Proc. Inf.
ring,’’ in Proc. 25th IEEE Int. Conf. Image Process. (ICIP), Athens, Theory Appl. Workshop (ITA), San Diego, CA, USA, Feb. 2017, pp. 1–6.
Greece, Oct. 2018, pp. 2865–2869. [99] N. Lee and A. Cichocki, ‘‘Fundamental tensor operations for large-
[76] L. Albera, H. Becker, A. Karfoul, R. Gribonval, A. Kachenoura, scale data analysis using tensor network formats,’’ Multidimensional Syst.
S. Bensaid, L. Senhadji, A. Hernandez, and I. Merlet, ‘‘Localization of Signal Process., vol. 29, no. 3, pp. 921–960, 2018.
spatially distributed brain sources after a tensor-based preprocessing of [100] N. Schuch, I. Cirac, and D. Pérez-García, ‘‘PEPS as ground states:
interictal epileptic EEG data,’’ in Proc. 37th Annu. Int. Conf. IEEE Eng. Degeneracy and topology,’’ Ann. Phys., vol. 325, no. 10, pp. 2153–2192,
Med. Biol. Soc. (EMBC), Milan, Italy, Aug. 2015, pp. 6995–6998. 2010.

162988 VOLUME 7, 2019


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

[101] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen, ‘‘A new trunca- [124] S. Yang, M. Wang, Z. Feng, Z. Liu, and R. Li, ‘‘Deep sparse tensor
tion strategy for the higher-order singular value decomposition,’’ SIAM J. filtering network for synthetic aperture radar images classification,’’
Sci. Comput., vol. 34, no. 2, pp. A1027–A1052, 2012. IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 8, pp. 3919–3924,
[102] P. D. Hoff, ‘‘Multilinear tensor regression for longitudinal relational Aug. 2018.
data,’’ Ann. Appl. Statist., vol. 9, no. 3, pp. 1169–1193, 2015. [125] T. D. Pham and H. Yan, ‘‘Tensor decomposition of gait dynamics
[103] P. G. Constantine, D. F. Gleich, Y. Hou, and J. Templeton, ‘‘Model in Parkinson’s disease,’’ IEEE Trans. Biomed. Eng., vol. 65, no. 8,
reduction with mapreduce-enabled tall and skinny singular value decom- pp. 1820–1827, Aug. 2018.
position,’’ SIAM J. Sci. Comput., vol. 36, no. 5, pp. S166–S191, 2014. [126] T. D. Nguyen, T. Tran, D. Phung, and S. Venkatesh, ‘‘Tensor-variate
[104] P. M. Kroonenberg, Applied Multiway Data Analysis. New York, NY, restricted Boltzmann machines,’’ in Proc. AAAI, 2015, pp. 2887–2893.
USA: Wiley, 2008. [127] T. G. Kolda and B. W. Bader, ‘‘Tensor decompositions and applications,’’
[105] Q. Li, G. An, and Q. Ruan, ‘‘3D facial expression recognition using SIAM Rev., vol. 51, no. 3, pp. 455–500, 2009.
orthogonal tensor marginal Fisher analysis on geometric maps,’’ in Proc. [128] T.-X. Jiang, T.-Z. Huang, X.-L. Zhao, L.-J. Deng, and Y. Wang,
Int. Conf. Wavelet Anal. Pattern Recognit. (ICWAPR), Ningbo, China, ‘‘A novel tensor-based video rain streaks removal approach via utilizing
Jul. 2017, pp. 65–71. discriminatively intrinsic priors,’’ in Proc. IEEE Conf. Comput. Vis. Pat-
[106] Q. Li and G. Tang, ‘‘Convex and nonconvex geometries of symmetric ten- tern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 2818–2827.
sor factorization,’’ in Proc. 51st Asilomar Conf. Signals, Syst., Comput., [129] T.-L. Chen, D. D. Chang, S.-Y. Huang, H. Chen, C. Lin, and
Pacific Grove, CA, USA, Oct./Nov. 2017, pp. 305–309. W. Wang, ‘‘Integrating multiple random sketches for singular value
[107] Q. Shi, Y.-M. Cheung, Q. Zhao, and H. Lu, ‘‘Feature extraction for incom- decomposition,’’ 2016, arXiv:1608.08285. [Online]. Available:
plete data via low-rank tensor decomposition with feature regularization,’’ https://arxiv.org/abs/1608.08285
IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 6, pp. 1803–1817, [130] T. Wu, A. R. Benson, and D. F. Gleich, ‘‘General tensor spectral co-
Jun. 2019. clustering for higher-order data,’’ 2016, arXiv:1603.00395. [Online].
[108] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, ‘‘A tensor-train deep compu- Available: https://arxiv.org/abs/1603.00395
tation model for industry informatics big data feature learning,’’ IEEE [131] T. Yokota, N. Lee, and A. Cichocki, ‘‘Robust multilinear tensor rank esti-
Trans. Ind. Informat., vol. 14, no. 7, pp. 3197–3204, Jul. 2018. mation using higher order singular value decomposition and information
[109] Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki, ‘‘Tensor criteria,’’ IEEE Trans. Signal Process., vol. 65, no. 5, pp. 1196–1206,
ring decomposition,’’ 2016, arXiv:1606.05535. [Online]. Available: Mar. 2017.
https://arxiv.org/abs/1606.05535 [132] V. A. Kazeev, B. N. Khoromskij, and E. E. Tyrtyshnikov, ‘‘Multilevel
[110] R. A. Harshman, ‘‘Foundations of the PARAFAC procedure: Models Toeplitz matrices generated by tensor-structured vectors and convolution
and conditions for an ‘explanatory’ multimodal factor analysis,’’ UCLA with logarithmic complexity,’’ SIAM J. Sci. Comput., vol. 35, no. 3,
Working Papers in Phonetics, Tech. Rep., 1970, pp. 1–84, vol. 16. pp. A1511–A1536, 2013.
[111] R. Kountchev and R. Kountcheva, ‘‘Truncated hierarchical SVD for [133] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection: A survey,’’
image sequences, represented as third order tensor,’’ in Proc. 8th Int. Conf. ACM Comput. Surv., vol. 41, no. 3, 2009, Art. no. 15.
Inf. Technol. (ICIT), Amman, Jordan, May 2017, pp. 166–173.
[134] V. de Silva and L.-H. Lim, ‘‘Tensor rank and the ill-posedness of the best
[112] R. Orús, ‘‘A practical introduction to tensor networks: Matrix prod- low-rank approximation problem,’’ SIAM J. Matrix Anal. Appl., vol. 30,
uct states and projected entangled pair states,’’ Ann. Phys., vol. 349, no. 3, pp. 1084–1127, 2008.
pp. 117–158, Oct. 2014.
[135] V. Giovannetti, S. Montangero, and R. Fazio, ‘‘Quantum multiscale entan-
[113] R. Yu and Y. Liu, ‘‘Learning from multiway data: Simple and efficient
glement renormalization ansatz channels,’’ Phys. Rev. Lett., vol. 101,
tensor regression,’’ in Proc. 33rd Int. Conf. Mach. Learn. (ICML), 2016,
no. 18, 2008, Art. no. 180503.
pp. 373–381.
[136] V. Kuleshov, A. Chaganty, and P. Liang, ‘‘Tensor factorization via
[114] R. Zhao and Q. Wang, ‘‘Learning separable dictionaries for sparse tensor
matrix factorization,’’ in Proc. 18th Int. Conf. Artif. Intell. Statist., 2015,
representation: An online approach,’’ IEEE Trans. Circuits Syst. II, Exp.
pp. 507–516.
Briefs, vol. 66, no. 3, pp. 502–506, Mar. 2019.
[137] V. Tresp, C. Esteban, Y. Yang, S. Baier, and D. Krompaß, ‘‘Learning
[115] R. Zdunek and K. Fonal, ‘‘Randomized nonnegative tensor factorization
with memory embeddings,’’ 2015, arXiv:1511.07972. [Online]. Avail-
for feature extraction from high-dimensional signals,’’ in Proc. 25th
able: https://arxiv.org/abs/1511.07972
Int. Conf. Syst., Signals Image Process. (IWSSIP), Maribor, Slovenia,
Jun. 2018, pp. 1–5. [138] W. Austin, G. Ballard, and T. G. Kolda, ‘‘Parallel tensor compression for
[116] S. A. Vorobyov, Y. Rong, N. D. Sidiropoulos, and A. B. Gershman, large-scale scientific data,’’ 2015, arXiv:1510.06689. [Online]. Available:
‘‘Robust iterative fitting of multilinear models,’’ IEEE Trans. Signal https://arxiv.org/abs/1510.06689
Process., vol. 53, no. 8, pp. 2678–2689, Aug. 2005. [139] W. Chu and Z. Ghahramani, ‘‘Probabilistic models for incomplete multi-
[117] S. Chen and S. A. Billings, ‘‘Representations of non-linear systems: dimensional arrays,’’ in Proc. 12th Int. Conf. Artif. Intell. Statist., vol. 5,
The NARMAX model,’’ Int. J. Control, vol. 49, no. 3, pp. 1013–1032, 2009, pp. 89–96.
1989. [140] W. de Launey and J. Seberry, ‘‘The strong Kronecker product,’’ J. Com-
[118] S. E. Sofuoglu and S. Aviyente, ‘‘A two-stage approach to robust tensor binat. Theory, Ser. A, vol. 66, no. 2, pp. 192–213, 1994.
decomposition,’’ in Proc. IEEE Stat. Signal Process. Workshop (SSP), [141] W. Guo, I. Kotsia, and I. Patras, ‘‘Tensor learning for regression,’’ IEEE
Freiburg, Germany, Jun. 2018, pp. 831–835. Trans. Image Process., vol. 21, no. 2, pp. 816–827, Feb. 2012.
[119] S. Han and P. Woodford, ‘‘Comparison of dimension reduction methods [142] W. Hackbusch and S. Kühn, ‘‘A new scheme for the tensor representa-
using polarimetric SAR images for tensor-based feature extraction,’’ in tion,’’ J. Fourier Anal. Appl., vol. 15, no. 5, pp. 706–722, Oct. 2009.
Proc. 12th Eur. Conf. Synth. Aperture Radar (EUSAR), Aachen, Germany, [143] W. Hu, J. Gao, J. Xing, C. Zhang, and S. Maybank, ‘‘Semi-supervised
Jun. 2018, pp. 1–6. tensor-based graph embedding learning and its application to visual
[120] S. Kallam, S. M. Basha, D. S. Rajput, R. Patan, B. Balamurugan, and discriminant tracking,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,
S. A. K. Basha, ‘‘Evaluating the performance of deep learning tech- no. 1, pp. 172–188, Jan. 2017.
niques on classification using tensor flow application,’’ in Proc. Int. [144] X. Zhao, H. Shi, M. Lv, and L. Jing, ‘‘Least squares twin support
Conf. Adv. Comput. Commun. Eng. (ICACCE), Paris, France, Jun. 2018, tensor machine for classification,’’ J. Inf. Comput. Sci., vol. 11, no. 12,
pp. 331–335. pp. 4175–4189, 2014.
[121] S. K. Biswas and P. Milanfar, ‘‘Linear support tensor machine with LSK [145] X. Deng, P. Jiang, X. Peng, and C. Mi, ‘‘An intelligent outlier detection
channels: Pedestrian detection in thermal infrared images,’’ IEEE Trans. method with one class support Tucker machine and genetic algorithm
Image Process., vol. 26, no. 9, pp. 4229–4242, Sep. 2017. toward big sensor data in Internet of Things,’’ IEEE Trans. Ind. Electron.,
[122] S. Savvaki, G. Tsagkatakis, A. Panousopoulou, and P. Tsakalides, vol. 66, no. 6, pp. 4672–4683, Jun. 2019.
‘‘Matrix and tensor completion on a human activity recognition frame- [146] X. He, D. Cai, and P. Niyogi, ‘‘Tensor subspace analysis,’’ in Proc. Annu.
work,’’ IEEE J. Biomed. Health Inform., vol. 21, no. 6, pp. 1554–1561, Conf. Neural Inf. Process. Syst, 2006, pp. 499–506.
Nov. 2017. [147] X. Xu, N. Zhang, Y. Yan, and Q. Shen, ‘‘Application of support
[123] S. V. Dolgov and D. V. Savostyanov, ‘‘Alternating minimal energy meth- higher-order tensor machine in fault diagnosis of electric vehicle range-
ods for linear systems in higher dimensions,’’ SIAM J. Sci. Comput., extender,’’ in Proc. Chin. Autom. Congr. (CAC), Jinan, China, Oct. 2017,
vol. 36, no. 5, pp. A2248–A2271, 2014. pp. 6033–6037.

VOLUME 7, 2019 162989


Y. Ji et al.: Survey on Tensor Techniques and Applications in Machine Learning

[148] X. Xu, Q. Wu, S. Wang, J. Liu, J. Sun, and A. Cichocki, ‘‘Whole brain YUWANG JI is currently pursuing the master’s
fMRI pattern analysis based on tensor neural network,’’ IEEE Access, degree with the National Engineering Laboratory
vol. 6, pp. 29297–29305, 2018. for Mobile Network Security, Wireless Technol-
[149] X. Zhang, ‘‘A nonconvex relaxation approach to low-rank tensor com- ogy Innovation Institute, Beijing University of
pletion,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 6, Posts and Telecommunications (BUPT). His cur-
pp. 1659–1671, Jun. 2019. rent research interests include the tensor applica-
[150] Y. Du, G. Han, Y. Quan, Z. Yu, H.-S. Wong, C. L. P. Chen, and tion in machine learning and time series analysis.
J. Zhang, ‘‘Exploiting global low-rank structure and local sparsity
nature for tensor completion,’’ IEEE Trans. Cybern., vol. 49, no. 11,
pp. 3898–3910, Nov. 2019.
[151] Y. Huang, W. Wang, L. Wang, and T. Tan, ‘‘Conditional high-order
Boltzmann machines for supervised relation learning,’’ IEEE Trans.
Image Process., vol. 26, no. 9, pp. 4297–4310, Sep. 2017.
[152] Y.-J. Kao, Y.-D. Hsieh, and P. Chen, ‘‘Uni10: An open-source library for
tensor network algorithms,’’ J. Phys., Conf. Ser., vol. 640, no. 1, 2015,
QIANG WANG received the Ph.D. degree in
Art. no. 012040.
[153] Y. Liu, ‘‘Low-rank tensor regression: Scalability and applications,’’ in
communication engineering from the Beijing Uni-
Proc. IEEE 7th Int. Workshop Comput. Adv. Multi-Sensor Adapt. Pro- versity of Posts and Telecommunications (BUPT),
cess. (CAMSAP), Curacao, Netherlands Antilles, Dec. 2017, pp. 1–5. Beijing, China, in 2008. Since 2008, he has been
[154] Y. Wang, H.-Y. Tung, A. Smola, and A. Anandkumar, ‘‘Fast and guar- with the School of Information and Communi-
anteed tensor decomposition via sketching,’’ in Proc. Adv. Neural Inf. cation Engineering, Beijing University of Posts
Process. Syst., 2015, pp. 991–999. and Telecommunications, where he is currently
[155] Y. Wang, W. Zhang, Z. Yu, Z. Gu, H. Liu, Z. Cai, C. Wang, and an Associate Professor. He participated in many
S. Gao, ‘‘Support vector machine based on low-rank tensor train decom- national projects such as NSFC, 863, and so on.
position for big data applications,’’ in Proc. 12th IEEE Conf. Ind. Elec- His research interests include information theory,
tron. Appl. (ICIEA), Siem Reap, Cambodia, Jun. 2017, pp. 850–853. machine learning, wireless communications, VLSI, and statistical inference.
[156] Y. W. Chen, K. Guo, and Y. Pan, ‘‘Robust supervised learning based on
tensor network method,’’ in Proc. 33rd Youth Acad. Annu. Conf. Chin.
Assoc. Automat. (YAC), Nanjing, China, May 2018, pp. 311–315.
[157] Y. Xiang, Q. Jiang, J. He, X. Jin, L. Wu, and S. Yao, ‘‘The advance of
support tensor machine,’’ in Proc. IEEE 16th Int. Conf. Softw. Eng. Res.,
Manage. Appl. (SERA), Kunming, China, Jun. 2018, pp. 121–128. XUAN LI is currently pursuing the master’s degree
[158] Y. Zhang and R. Barzilay, ‘‘Hierarchical low-rank tensors for multilin- with the National Engineering Laboratory for
gual transfer parsing,’’ in Proc. Conf. Empirical Methods Natural Lang. Mobile Network Security, Wireless Technology
Process., 2015, pp. 1857–1867. Innovation Institute, Beijing University of Posts
[159] Z. Chen, K. Batselier, J. A. K. Suykens, and N. Wong, ‘‘Parallelized and Telecommunications (BUPT). Her current
tensor train learning of polynomial classifiers,’’ 2016, arXiv:1612.06505. research interests include UAV-assisted networks
[Online]. Available: https://arxiv.org/abs/1612.06505 and reinforcement learning.
[160] Z. Chen, B. Yang, and B. Wang, ‘‘Hyperspectral target detection:
A preprocessing method based on tensor principal component analysis,’’
in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), Valencia,
Spain, 2018, pp. 2753–2756.
[161] Z. Chen, K. Batselier, J. A. K. Suykens, and N. Wong, ‘‘Parallelized
tensor train learning of polynomial classifiers,’’ IEEE Trans. Neural Netw.
Learn. Syst., vol. 29, no. 10, pp. 4621–4632, Oct. 2018.
[162] Z.-C. Gu, M. Levin, B. Swingle, and X.-G. Wen, ‘‘Tensor-product rep-
JIE LIU is currently pursuing the master’s degree
resentations for string-net condensed states,’’ Phys. Rev. B, Condens.
with the National Engineering Laboratory for
Matter, vol. 79, no. 8, 2009, Art. no. 085118.
[163] Z. Fang, X. Yang, L. Han, and X. Liu, ‘‘A sequentially truncated higher Mobile Network Security, Wireless Technology
order singular value decomposition-based algorithm for tensor comple- Innovation Institute, Beijing University of Posts
tion,’’ IEEE Trans. Cybern., vol. 49, no. 5, pp. 1956–1967, May 2019. and Telecommunications (BUPT). Her current
[164] Z. Hao, L. He, B. Chen, and X. Yang, ‘‘A linear support higher-order research interests include time series prediction
tensor machine for classification,’’ IEEE Trans. Image Process., vol. 22, and reinforcement learning.
no. 7, pp. 2911–2920, Jul. 2013.
[165] Z. Zhang and S. Aeron, ‘‘Exact tensor completion using t-SVD,’’ IEEE
Trans. Signal Process., vol. 65, no. 6, pp. 1511–1526, Mar. 2015.

162990 VOLUME 7, 2019

You might also like