CPD-Structured Multivariate Polynomial Optimization
CPD-Structured Multivariate Polynomial Optimization
CPD-Structured Multivariate
Polynomial Optimization
Muzaffer Ayvaz 1,2* and Lieven De Lathauwer 1,2
1
Department of Electrical Engineering (ESAT), KU Leuven, Leuven, Belgium, 2 Group Science, Engineering and Technology,
KU Leuven Kulak, Kortrijk, Belgium
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 1 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
Tensor decompositions such as canonical polyadic where Tj denotes a low-rank tensor of order j, and Tj zj denotes
decomposition (CPD) and tensor trains (TT) are promising the mode-n product (see Section 2.1) of a tensor Tj and the vector
tools for breaking the curse of dimensionality. Tensors are multi- z for all modes. As by convention, T0 is assumed to be scalar
indexed arrays. They preserve the higher-order structure which and z0 is assumed to be scalar 1. From now on, we call (2) a
is inherent in data, are able to model nonlinear interactions, type I model. We can represent a multivariate polynomial with
and can be decomposed uniquely under mild conditions a single tensor by utilizing a process called homogenization, and
[1–3]. Efficient numerical optimization algorithms have been augmenting the independent variable z by a constant 1 as
developed for tensor decompositions. In the context of CPD, the
Gauss–Newton algorithm using both line search and trust-region p(e zd ,
z) : = We (3)
frameworks have been effectively implemented by exploiting the
CPD structure [4–6]. A low complexity damped Gauss-Newton z = [1; z]. Hereafter, we call
where W is a tensor of order d, ande
algorithm has also been proposed [7]. Moreover, a randomized (3) a type II model.
block sampling approach has been proposed which achieves An n-variate polynomial of degree d has O nd coefficients.
linear time complexity for the CPD of large tensors by utilizing This exponential dependence on d is the so-called curse of
the Gauss–Newton algorithm [8]. Many data science problems dimensionality. In the TeMPO framework, we break the curse of
such as latent factor analysis have been solved by reformulating dimensionality by assuming low-rank structure in the coefficient
them as tensor decomposition problems [9–12]. An inexact tensors. For example, when rank- R symmetric CPD structure is
Gauss–Newton algorithm has been proposed for scaling the used, the number of parameters needed to represent the n-variate
CPD of large tensors with non-least-squares cost functions polynomial of degree d is ndR which is linear in the number
[13]. Moreover, generalized Gauss–Newton algorithm with its of variables. Several low-rank structures for tensors have been
efficient parallel implementation has been proposed for tensor introduced in the literature [1, 2, 16], e.g., canonical polyadic
completion with generalized loss functions [14]. Our aim in this decomposition (CPD), Tucker decomposition, hierarchical
work is to extend the efficient numerical approaches to a broader Tucker decomposition (HT) [17], tensor train decomposition
class of problems that includes not only tensor decompositions (TT) [18]. All of these structures can be incorporated into
but also the optimization of multilinear/polynomial cost the TEMPO framework; however, in this paper we restrict
functions. Examples include, but are not limited to matrix and ourselves to symmetric CPDs. Note that different types of low-
tensor eigenvalue problems, nonlinear dimensionality reduction, rank structure allow us to represent different sub-classes of
nonlinear blind source separation, multivariate polynomial polynomials. Of course, different representations differ in storage
regression, and classification problems. space, and computational complexity. A more detailed exposition
In this study, we develop a framework called Tensor-Based will be given in Section 3.2. Note also that the type I model allows
Multivariate Polynomial Optimization (TeMPO) to deal with us to constrain each term separately while the type II model does
nonlinear optimization problems commonly encountered in not. Therefore, the type I model is a more general representation
signal processing, machine learning and artificial intelligence. A of multivariate polynomials which may provide better results
preliminary version, where only rank-1 CPD is considered with depending on the applications.
application in blind identification, appeared as the conference Besides breaking the curse of dimensionality, exploiting low-
paper [15]. In the TeMPO framework, these nonlinear functions rank representations of tensors enables us to derive efficient
are approximated or modeled by multivariate polynomials. Then, expressions for objective function and gradient evaluations.
low-rank tensors are used to represent the polynomial under These then lead us to develop scalable algorithms. We apply our
consideration. This approach reduces the number of parameters framework for image classification by adapting the second-order
that define the system, and hence enables us to develop efficient Gauss–Newton algorithm and exploiting the symmetric CPD
numerical optimization algorithms. To further elaborate on the structure in two different tensor representations of multivariate
proposed methodology, let us consider the optimization problem polynomials. We show that the TeMPO framework with
symmetric CPD structure achieves similar or better accuracy
min l(p(z), θ ), (1) than various methods such as MLPs, and tensor networks with
p
different structures for the classical MNIST and Fashion MNIST
datasets while using fewer parameters and therefore less memory.
where l : R × RM → R+ denotes a loss function such as
the mean squared error, p : RI → R denotes an unknown Related Work
multivariate polynomial, z ∈ RI denotes input data, and θ ∈ RM Several tensor-based methods have been reported in the literature
denotes output data. We compactly represent the polynomial for regression and classification, two problems that are in the
p(z) through low-rank tensors. One possible way to do this is class of problems (1). In most of these approaches, a linear model
to write the polynomial as a sum of homogeneous polynomials
as follows: y = hW, Xi + b, (4)
d
X is used where W denotes a weight tensor and X represents
p(z) : = T j zj , (2) nonlinear features of the input data. This model corresponds to
j=0 the type II model when a symmetric CPD structure is imposed
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 2 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
on the weight tensor W and X is composed of polynomial model used. The type I model (2) has not been examined with the
features of input data. Clearly, imposing different structures symmetric CPD structure in the weight tensors, to the best of our
to the weight tensor W and using different nonlinear features knowledge. Another difference of our approach from the above
in the tensor X leads to a different representation of the methods is the algorithm used. While first-order algorithms
nonlinear interaction between input data and output data. For are used in most of these approaches, we utilize the second-
example, exponential machines utilize the tensor train format order batch Gauss–Newton (GN) algorithm. Although first-order
in the weight tensor with a norm regularization term in the methods have the advantage of lower per-iteration complexity,
optimization [19]. In this approach, the Riemannian gradient second-order GN algorithms generally require fewer iterations to
descent algorithm is used for solving the optimization problem. converge and fewer hyperparameters to be optimized. Moreover,
In a similarh approach,
πx tensor trains
πx i is used with the feature map the GN algorithm using trust-region is more robust in the sense
j j that it converges to a (local) minimum for any starting point
φ(xj ) = cos , sin , by using the density matrix
2 2 under mild conditions and it is less prone to swamps (many
renormalization group (DMRG) algorithm and the first-order
iterations with little to no improvement) [5, 6, 33].
ADAM algorithm for the optimization of different cost functions
We summarize our contributions as follows:
[20, 21]. The same feature map is also used for the linear model
(4) by imposing projected entangled pair states (PEPS) structure • We develop a TeMPO framework that is able to solve
on the weight tensor W [22]. The CPD format in model (4) many nonlinear problems with ubiquitous applications in
has also been studied in the realm of tensor regression with signal processing, machine learning and artificial intelligence.
the Frobenius norm and group sparsity norm regularization Moreover, we develop an efficient second-order Gauss–
terms while using a coordinate-descent approach [23]. A similar Newton algorithm for optimizing multivariate polynomials in
model is also considered by utilizing the symmetric CPD format the CPD format.
and the second-order Gauss–Newton algorithm with algebraic • We determine the conditions where the tensorized linear
initialization for multivariate polynomial regression [24]. Several model (4) with polynomial features and the multivariate
approaches have been proposed that utilize CPD or Tucker polynomial model (2) coincide when the symmetric CPD
formats in tensor regression that use different regularization format is used in their representations.
strategies to prevent the overfitting [25, 26]. Also, the hierarchical • We show that TeMPO achieves similar or better accuracy
Tucker (HT) format has been used in the tensor regression than various methods such as multilayer perceptrons (MLPs),
context for the generalized linear model (GLM) y = α T x + tensor networks with different architectures including tensor
hW, Xi. This approach was successfully applied to brain imaging trains (TT), tree tensor networks, and projected entangled
data sets and uses a block relaxation algorithm, which solves a pair states (PEPS). We also show that TeMPO requires the
sequence of lower dimensional optimization problems [27]. optimization for fewer parameters and less memory than these
Similarly, several models related to the type I model are methods for the classification of the MNIST and Fashion
considered in various settings. For example, Kar and Karnick use MNIST datasets.
random polynomial features and parameterize the coefficients of • Last but not least, our framework can be interpreted as
the polynomial under consideration [28]. The parameterization an advancement of higher-order factorization machines; we
used in this approach has been shown to be equivalent to introduce an efficient second-order Gauss–Newton algorithm
imposing the CPD format to the weight tensor W [29]. Another for higher-order factorization machines.
approach is factorization machines which use a multivariate
The remaining part of this article is organized as follows. In
polynomial kernel in the realm of support vector machines
Section 2, we describe notation and background information
(SVM) [30]. For second-order factorization machines a first-
concerning tensors. In Section 3, we describe the TeMPO
order stochastic gradient descent algorithm has been proposed.
framework in a more detailed manner. Section 3 also covers
This approach has a linear time complexity. Higher-order
the details of representation of polynomials by symmetric
factorization machines use the ANOVA kernel to achieve a
CPD structured tensors. In Section 3, we also show how
linear time complexity and have been successfully applied
to exploit the symmetric CPD structure to obtain efficient
to link prediction models using stochastic gradient descent
expressions for the gradient and Jacobian-vector products which
[31]. The ANOVA kernel does not use symmetric tensors in
are necessary for the Gauss–Newton algorithm. The formulation
the representation and instead only considers combinations
of the image classification problem in the context of TeMPO,
of distinct features [31]. Also, factorization machines in the
numerical experiments and related discussions will be covered
symmetric CPD format have been considered using first-order
in Section 4. We conclude our paper with future remarks in the
and BFGS type algorithms [32]. Tensor machines generalize both
last section.
the Kar-Karnick random features approach and factorization
machines. It has been shown that these approaches correspond
to specific types of tensor machines in the CPD format. Further,
2. PRELIMINARIES
it has been shown that empirical risk minimization is an efficient
method for finding locally optimal tensor machines if the 2.1. Notation
optimization algorithm avoids saddle points [29]. A tensor is a higher-order generalization of a vector (first-order)
As can be seen from the literature summary above, one of the and a matrix (second-order). Following established conventions,
differences between our approach and the above methods is the we denote scalars, vectors, matrices, and tensors by a, a, A, and A,
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 3 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
respectively. The transpose of a matrix A is denoted as AT . The ith their Khatri–Rao product, also known as columnwise Kronecker
column vector of a matrix A is denoted as ai , i.e., A = (a1 , a2 . . .). product, is
The entry with row index i and column index j in a matrix
A, i.e., (A)ij , is denoted by aij . Similarly, (A)i1 i2 ...iN is denoted A ⊙ B = [a1 ⊗ b1 , a2 ⊗ b2 , . . . , aK ⊗ bK ] ∈ KIJ×K ,
by ai1 i2 ...iN . Diag(a) denotes the diagonal matrix whose entries
are composed from the vector a. On the other hand, diag(A) where ai and bi denote the ith column of the matrices A and B,
denotes a vector composed from the diagonal elements of A. respectively.
The vectorization operator vec(A) for A ∈ KI×J stacks all the
columns of A into a column vector a ∈ KIJ . The reverse operation Definition 5 (Hadamard Product). Given two matrices A ∈
unvec(a) reshapes a vector a into a matrix A ∈ KI×J . The identity KI×J and B ∈ KI×J with the same size, their Hadamard product is
matrix of size (K × K) is denoted by IK . A vector of length K with the elementwise product, i.e.,
all entries equal to 1 is denoted by 1K . The l2 norm of a vector a is
a1,1 b1,1 · · · a1,J b1,J
denoted by kak2 . The row-wise and column-wise concatenation
.. .. .. ∈ KI×J .
of two vectors a and b is denoted by [a, b] and [a; b], respectively. A∗B = . . .
The outer product, Kronecker product, Khatri–Rao product, and aI,1 bI,1 · · · aI,J bI,J
Hadamard product are denoted by ⊗, ⊗, ⊙, and ∗, respectively.
The nth power of a vector x with respect to Kronecker product is The following properties will be useful for our derivations.
defined as x⊗ n = x ⊗ x⊗(n−1) , with x⊗ 0 = 1. Similarly, x⊙ n and
x∗ n denotes the nth power of vector x with respect to Khatri– Property 1. Let A ∈ KI×J , X ∈ KJ×K , B ∈ KK×L . Then
Rao product and Hadamard product, respectively. The mode-n
product of a tensor A ∈ KI1 ×I2 ×...×IN (with K meaning either vec(AXB) = BT ⊗ A vec(X) ∈ KIL .
R or C) and a vector x ∈ KIn , denoted by A ·n xT , is defined
P Moreover, if X ∈ KJ×J is a diagonal matrix and B ∈ KJ×L , then
element-wise as A ·n xT i i ···i i ···i = IinN=1 ai1 i2 ···in ···iN xin .
1 2 n−1 n+1 N
The mode-n product of a tensor A ∈ KI×I×...×I of order k and a vec(AXB) = BT ⊙ A diag(X) ∈ KIL .
vector x ∈ KI for all modes is defined as
def
Property 2. Let A ∈ KI×J , B ∈ KK×J , C ∈ KI×L , and D ∈
Axk = A ·1 xT ·2 xT . . . ·k xT . KK×L . Then
A mode-n vector or mode-n fiber of a tensor A ∈ KI1 ×I2 ×...×IN
(A ⊙ B)T (C ⊙ D) = AT C ∗ BT D ∈ KJ×L .
is a vector obtained by fixing every index except the nth. The
mode-n matricization of A is a matrix A[n;N,N−1,...,n+1,n−1,...,1] Property 3. For matrices A ∈ RI×J and B ∈ RJ×K , and for the
collecting all the mode-n vectors as its columns. For example, function f (A, B) = AB, the following equations hold:
an entry ai1 i2 i3 of a tensor A ∈ KI×J×K is mapped to the (i2 , q)
entry of the matrix A[2;3,1] with q = i1 + (i3 − 1)I. The binomial ∂vec(f (A, B)) ∂vec(f (A, B))
n! = BT ⊗ II , = IK ⊗ A.
coefficient is denoted by Cnk = (n−k)!k! . Some useful definitions ∂vec(A) ∂vec(B)
are listed below.
2.2. Canonical Polyadic Decomposition
Definition 1 (Symmetric Tensor). A tensor A ∈ KI×I×...×I of Here, we will briefly describe the canonical polyadic
order k is called symmetric if its entries are invariant under the decomposition. A more detailed description of CPD can be
permutation of its indices. found in [1] and references therein. The CPD writes a tensor
T I1 ×I2 ×...×IN as a sum of R rank-1 tensors and is denoted by
As a consequence of this definition, the matrix representations of r∈K z
symmetric tensors in different modes are all equal. U , . . . , U(N) , with its factor matrices U(n) ∈ KIn ×R , where R
(1)
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 4 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 1 | Polyadic decomposition of a third order symmetric tensor T. It is called canonical (CPD) if R is equal to the rank of T, i.e., R is minimal. It allows compact
representation of polynomials.
where U ∈ KI×R , and c ∈ KR is a vector of weights which allows multilinear/polynomial cost functions with or without additional
us to give minus signs to the factors for even-degree symmetric constraints, which is a more general setting than tensor
tensors, see Figure 1. The matrix unfolding of a symmetric CPD decomposition or retrieval of a tensor factorization. To better
is given by describe the scope, let us consider the following class of
objective functions:
T = UDiag(c)(U ⊙ U ⊙ · · · ⊙ U)T .
l(θ , p(z)), (5)
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 5 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
min p(T0 , . . . , Td , Z)
T0 ,...,Td
2
K d
1 X X j
= min yk − T 0 − T j zk ,
T0 ,...,Td 2
k=1 j=1
subject to rank(Tj ) = Rj (7)
where T denotes the low-rank structured coefficient tensor of here 5(i1 i2 . . . id ) denotes the collection of all permutation of
order d to be optimized, e
Z ∈ R(I+1)×K denotes the augmented indices (i1 , i2 , . . . , id ). Since the entries of T are invariant under
input data matrix, and zek denotes the kth column of e Z, i.e., the permutation of indices, we can conclude that T is symmetric.
e
zk = [1; zk ]. The above discussion reveals the fact that there are infinitely
many representations of a given polynomial. Indeed two
3.2. Tensor Representation of Polynomials representations with tensors T and W are equal so long as the
In this subsection, we examine the type I and type II model in summation of the corresponding entries over the permutation of
detail. A (symmetric) tensor T of order d and dimension n can indices remains the same, i.e.,
be associated with a homogeneous n-variate polynomial p(z) of
X X
degree d [44], as shown in Equation (3). ti1 i2 ...id = wi1 i2 ...id
Type I: Since any polynomial can be written as a sum of (i1 ,i2 ,...,id )∈5(i1 i2 ...id ) (i1 ,i2 ,...,id )∈5(i1 i2 ...id )
homogeneous polynomials of increasing degrees, any polynomial
of degree d can be written by using tensors of order up to d, as In the ANOVA kernel used in higher-order factorization
shown in Equation (2). Note that in the tensor representation of machines, all t5(i1 i2 ...id ) are set to zero except t(i1 <i2 <...<id ) [31],
polynomials, any tensor can be assumed to be symmetric without which leads to a sparse representation. In this paper, we use
loss of generality. Indeed, any homogeneous polynomial p(z) of symmetric tensors for two reasons. The first reason is that
degree d ∈ N can be represented by a multilinear form Tzd , the CPD of a symmetric tensor can be expressed by a single
where T ∈ KI×I×...×I is a symmetric tensor of order d and z ∈ factor matrix. Therefore, the symmetric CPD representation of
KI . multivariate polynomial requires fewer number of parameters in
To see this, suppose a homogeneous polynomial p(z) is comparison with a non-symmetric representation. The second
represented as reason is that there is a rich history of the representation of
polynomials with symmetric tensors in the field of algebraic
I
X geometry under the name of the Waring problem [45].
T zd =
p(z) = e e
ti1 i2 ...id zi1 zi2 . . . zid ,
Type II: Augmenting the independent variable vector z,
i1 ,i2 ....,id =1
by a constant 11 , i.e., e z = [1; zT ] leads to a different
representation of non-homogeneous polynomials that uses a
where e T ∈ KI×I×...×I is a tensor of order d. Since the terms
single dth order symmetric tensor for the inhomogeneous
zi1 zi2 . . . zid are invariant under the permutation of indices, we
multivariate polynomial of degree d, as shown in Equation (3).
may write
This process is called homogenization [46] and is graphically
I illustrated in Figure 2. If we just use full tensors, the type I and II
X
p(x) = ti1 i2 ...id zi1 zi2 . . . zid , 1 Sincethe weight vector c is used in the parametrization of tensors, different
i1 ,i2 ....,id =1
choices of constant in e
z lead to mathematically equivalent cost functions in the
1 X optimization problems. On the other hand, the choice of the constant may imply
where ti1 i2 ...id = eti1 i2 ...id , numerical differences—in situations of this type, one should generally choose a
d!
(i1 ,i2 ,...,id )∈5(i1 i2 ...id ) constant that “makes sense for the application.”
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 6 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
models are interchangeable. However, it is important to note that model to be able to model functions of the same complexity. In
when low-rank structure is imposed on the coefficient tensors, both other words, the set of polynomials represented by the type II
representations yield different classes of low-rank multivariate model is a strict subset of the set of polynomials represented by
polynomial. Hence, these approaches may lead to different the type I model for the same rank values.
results depending on the application. The former approach Although we focus in this study on the type I and type II
requires more parameters since it uses more factor matrices. models in the symmetric CPD format, the TeMPO framework
The difference in the number of parameters should be taken is not limited to these. TeMPO collects low-rank tensor
into account to prevent underfitting and overfitting. A more representations of multivariate polynomials under a roof by
detailed description for storage complexity is given in Section utilizing various other tensor decompositions such as TT, HT,
3.5. Moreover, the type I model allows us to constrain each and non-symmetric and partially symmetric CPD formats2 .
term in the representation separately. In modeling multivariate In this way, TeMPO breaks the curse of dimensionality and
polynomials, one might not wish the terms of different order to makes it possible to develop second-order efficient algorithms
have some shared structure, in which case one should choose for the optimization of a more general class of multivariate
the type I model to work with. Similarly, the type II model polynomials. Moreover, use of structured tensors and multilinear
should be chosen, if some shared structure is desired in the algebra makes it easy to incorporate other polynomial bases and,
terms of different order. To further elaborate on the effects of more generally, other nonlinear feature maps rather than the
homogenization on the rank of a tensor, let us consider the standard polynomial bases to the TeMPO framework. From this
following proposition. point of view, TeMPO can be interpreted as a generalization of
higher-order factorization machines that use particular types of
Proposition 1. Let p(z) : RI → R be a multivariate polynomial multivariate polynomials with the standard polynomial bases and
of order d defined as in equation (2) by a scalar T0 and symmetric utilize first-order and BFGS type algorithms [30–32, 47].
tensors Tj ∈ RI×I×...×I for j = 1, 2, . . . , d. Moreover, let W ∈
R(I+1)×...×(I+1) be the corresponding tensor obtained from the 3.3. Gauss–Newton Algorithm
homogenization process. The tensors W and Tj have the same rank Most standard first-order and second-order numerical
R if and only if the tensors Tj admit unique CPDs with shared optimization algorithms can be used for solving problem
factor matrices and a weight vector c, i.e., (8). Since the objective function under consideration is
a least-squares function, we will utilize the second-order
r z R
X batch Gauss–Newton (GN) algorithm using a trust-region to
j
Tj = U, . . . , U; Cd (cT )⊙(d−j) , and T0 = (cT )⊙ d . take advantage of its attractive properties such as quadratic
r
r=1 convergence near a local optimum point, resistant to swamps,
suitable to incorporate constraints easily and eligible to exploit
Proof 1. Let the CPD of the tensor W be defined as JV, . . . , VK, multilinear structure. In the case the objective function is not
where, for convenience but without loss of generality, the weights of least squares, the inexact GN algorithm can also be utilized.
the rank-1 terms are assumed to be 1. Since W is obtained by the Below, we briefly describe the GN algorithm using a trust-region,
homogenization process, partitioning V as [vT ; Q] and using the and then derive the expressions for Jacobian and Jacobian-vector
definition of CPD, we obtain products for tensors in the symmetric CPD format. In nonlinear
least-squares problems, the objective function is the squared
r z R
X
j error between a data vector y and a nonlinear model m(z) [6, 33]:
Tj = Q, . . . , Q; Cd (vT )⊙(d−j) , and T0 = (vT )⊙ d .
r
r=1 1 2 1 T
(9) f (z) = m(z) − y 2
= r r, (10)
2 2
Since the CPDs of the tensors Tj are unique, the equality (9) holds
if and only if the equalities Q = U and v = c also hold. where z ∈ RI . The algorithm updates the initial guess iteratively
by taking a step length αk in the direction pk at the iteration k, i.e.,
Remark 1. In the above proof, we assumed that the vector v does
not contain any zero elements. Note that if the vector v does contain zk = zk−1 + αk pk ,
zero elements, it cancels the corresponding rank-1 terms. Therefore,
in that case rank(W) > rank(Tj ), for j = 1, . . . , d − 1. Moreover, until some stopping criteria are satisfied. Line search and trust-
the uniqueness of the CPDs of Tj implies that rank(W) ≥ region are the two main approaches used to determine αk and
rank(Tj ). Since the equality rank(W) = rank(Tj ) holds only when pk . Here, we focus on the dogleg trust-region approach. In this
the tensors Tj have shared factor matrices as described above, we approach, one sets αk = 1. Then, given a trust-region of radius
can conclude that in all other cases rank(W) > rank(Tj ). gn
δk , the GN step pk and the steepest descent step psd k for the
current iteration, the step direction pk is determined by the
Proposition 1 together with Remark 1 reveals the fact that if W
following procedure:
admits a rank-R CPD, there exists tensors Tj that admit rank-
Rj CPDs with shared factors and Rj ≤ R. Hence, the expressive 2 Note that the non-symmetric and partially symmetric CPD formats are fairly
power of the type II model is weaker than the type I model, i.e., straightforward variants of the symmetric CPD format, and derivations presented
the type II model requires higher rank values than the type I in Section 3.4 can be generalized to these formats with slight modifications.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 7 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
Algorithm 1: GN algorithm using dogleg trust-region for the function can be written as JT r, and the Hessian is approximated
type II model. by JT J, where J is the Jacobian matrix composed of partial
Input : Z – Input data matrix derivatives of the residual vector r. Hence, it is sufficient to derive
y – Vector of values (labels in the classification expressions for the Jacobian and Jacobian-vector products. We
case) for each data point in Z begin with the first-order derivatives of the multilinear form
U, c – Initial factor matrix and weight vector Tzd , where T is in the symmetric CPD format, with respect
T0 – Initial scalar to its factors and then proceed to the derivation of Jacobian
Output: U, c – Optimized factor matrix and weight vector and Jacobian-vector products for problems (7) and (8). The
T0 – Optimized scalar derivations made here can be used in other TeMPO problems
with slight modifications.
while not converged do
rk ← Compute residual vector using equations (21) and 3.4.1. Derivatives of the Multilinear Form in the
(22) Symmetric CPD Format
gk ← Compute gradient using equation (27) By using the matrix unfolding of the tensor in the symmetric
pk ← Solve linear systems of equations (12) using CG CPD format and Property 2 of Khatri–Rao product, the
method multilinear form Tzd can be written as
U, c, T0 ← Update via dogleg trust-region explained in
Subsection 3.3 ∗ d
end Tzd = cT UT z , (13)
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 8 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
wj,1 , wj,2 , . . . , wj,K , we can write the residual vector r in a x = [x1 ; x1 ; x2 ; . . . ; xd ; xv ], and by using the Equations (17) and
compact form as (18) as
d
X d h
X iT
∗j T
r = y − T 0 · 1K − cTj Wj , (14) Jx = x1 · 1K + XTj Z ∗(Cj W
e j ) 1R + Vxv ,
j=1 j=1
Using the above Equation (14), the objective function can be where Xj = unvec(xj ).
computed as the l2 norm of the residual vector r. Jacobian Transpose -Vector Product and Gradient: In a
Jacobian: The Jacobian matrix for problem (7), with the similar way, block-wise multiplication of the Jacobian transpose
tensors Tj in their symmetric CPD format, can be written in a JT by a vector can be obtained from the expression
compact form as
((Cj W e j )T .
e j ) ⊙ Z)x = vec ZDiag(x)(Cj W (19)
J = [J1 ; . . . ; JK ] , where
∂rk ∂rk ∂rk ∂rk
Jk = 1, ,..., , ,..., . (15) Note that right multiplication by a diagonal matrix can be
∂vec(U1 ) ∂vec(Ud ) ∂c1 ∂cd done efficiently by only multiplying the columns of the matrix
with the corresponding diagonal elements without explicitly
Note that we used the fact ∂rk /∂ T0 = 1 in the above equation. By forming the diagonal matrix. Overall, by defining ξ j =
utilizing Lemma 1 and Lemma 2, the derivative of each term of
vec ZDiag(x)(Cj W e j )T , we can obtain the product of the
the residual vector with respect to Uj and cj can be expressed as
Jacobian transpose JT and a vector x in the following form:
∂rk h iT ∂rk
∗(j−1) ∗j T
= −j cj ∗ wj,k ⊗ zk , and = wj,k . " K #
∂vec(Uj ) ∂cj X
T T
(16) J x= xk ; ξ 1 ; ξ 2 ; . . . ; ξ d ; V x . (20)
e ∗(j−1) k=1
By defining Wj = −j Wj for j = 1, . . . , d, and Z =
[z1 , z2 , . . . , zK ], the Jacobian matrix J in (15) can be obtained in The gradient can be obtained by the product of the Jacobian
the following compact block form: transpose JT and the residual vector r. Defining ηj =
h i vec ZDiag(r)(Cj We j )T and utilizing the Equations (19) and (20),
e 1 ) ⊙ Z)T , . . . , ((Cd W
J = 1K , ((C1 W e d ) ⊙ Z)T , V , (17)
we can obtain the gradient as
where V is a K × d block matrix in which each block is defined " K #
∗j T X
T
as Vk,j = wj,k , Cj = Diag(cj ), and d is the degree of g= rk ; η 1 ; η 2 ; . . . ; η d ; V r .
the polynomial under consideration. Since we only need the k=1
Jacobian-vector products for the GN algorithm, the explicit
construction of the Jacobian matrix is not required. The Jacobian- 3.4.3. Exploiting Structure in the Type II Model
vector products can be obtained in a more memory-efficient way Objective Function: The computation of the objective function
as described below. for the type II model is similar to that of the type I model.
Jacobian-Vector Product: The product of Jacobian J by a Utilizing Property 2 and Equation (13), the residual vector for
vector x can be obtained using block matrix operations. The problem (8) can be obtained as r = y − µ with
product of each block term by a vector vec(Xj ) = xj can be h i
obtained by utilizing properties 1 and 2 as µ = cT w∗ d T ∗d T ∗d
1 ; c w2 ; . . . ; c wK , (21)
h iT
e j ) ⊙ Z)T xj =
((Cj W XTj Z ∗(Cj W
e j) 1R . (18) where wk = UTe zk . By defining W = [w1 , w2 , . . . , wK ], we can
write the residual vector r in a compact form as
Note that the multiplication of a matrix by 1R from the right T
is equivalent to summing the columns of the matrix under r = y − cT W ∗ d , (22)
consideration. Therefore, neither
the multiplication by 1R nor the
e j ) in Equation (18) is
transposition of the matrix XTj Z ∗(Cj W Using the above Equation (22), the objective function can be
necessary to obtain the Jacobian-vector product. Note also that, computed as the l2 norm of the residual vector r.
since the matrices Cj are diagonal, the product Cj W e j can be Jacobian: The Jacobian matrix of the cost function in (8) can
obtained in a memory efficient way by multiplying the rows of W ej be defined in a compact form as
by the corresponding diagonal elements of Cj without explicitly
forming the matrices Cj . Overall, the product of the Jacobian J ∂rk ∂rk
J = [J1 ; J2 ; . . . ; JK ] , where Jk = , . (23)
and the vector x can be obtained by partitioning the vector x, i.e., ∂vec(U) ∂c
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 9 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
Utilizing Lemma 1 and Lemma 2 and using the equations in (16), storage complexity for a multivariate polynomial represented in
the parts of Jk in Equation (23) can be written as dense format is O I d for d ≪ I. In the symmetric CPD format,
∂rk h iT ∂rk T we need to store only the factor matrix U ∈ RI×R and the
= −d c ∗ wk∗(d−1) ⊗e
zk , = w∗
k
d
. vector of weights c ∈ RR , where R is the rank of the symmetric
∂vec(U) ∂c
CPD. Therefore, the storage complexity for the type II model
using the symmetric CPD format is O (IR). This shows that the
∂r1 ∂r2 ∂rK
By defining W e = −d W∗(d−1) , V = ; ;...; , symmetric CPD format breaks the curse of dimensionality, since
∂c ∂c ∂c
and Z = [e z1 ,e
z2 , . . . ,e
zK ], the Jacobian matrix can be obtained in the storage complexity in this format is linear in terms of rank R
the following compact form: and dimension I.
As is clear from Equation (22), the construction of the
h i matrix W and its dth Hadamard (elementwise) power dominates
J= e ⊙ Z T, V .
(CW) (24)
the computational complexity of the objective function. The
construction of a single column of the matrix W requires
As mentioned earlier, explicit construction of the Jacobian the multiplication of UT ∈ RR×I and e zk ∈ RI . Thus,
matrix J is not required. We only require the Jacobian-vector the computational complexity of constructing the matrix W is
and Jacobian transpose-vector products and derive efficient O (IKR). The dth Hadamard power of the matrix W can be
expressions for these products below. ∗ 2
computed recursively by using the relation W∗(2m) = W∗ m .
Jacobian-Vector Product: The product of the Jacobian matrix
Thus, the computational complexity of thedth Hadamard power
J and a vector x can be obtained in a similar way as for the type I
of the matrix W ∈ RR×K is O log(d)RK . Therefore, the total
model, by partitioning the vector x, i.e., x = [xu ; xc ] and utilizing
computational complexity for computing the objective function
properties 1 and 2 and Equation (24), as
for a batch of size K is O (I + log(d))KR . Since log(d) ≪ I, the
computational complexity for the objective function in Equation
Jx = XTu Z ∗(CW)
e T 1R + Vxc , (25)
(8) is O (IKR).
The gradient of the objective function in Equation (8)
where Xu = unvec(xu ). As mentioned earlier for Equation (18),
can be obtained by multiplying the Jacobian transpose JT
explicit construction of the diagonal matrix C is not required.
e can be obtained in a memory efficient way by the residual vector r. As shown in Equation (27), this
The product CW
e by the corresponding diagonal operation requires multiplication of a matrix Z ∈ RI×K
by multiplying the rows of W
by a diagonal matrix Diag(r), and the multiplication of the
elements of C. e T with sizes (I × K) and (K ×
matrices ZDiag(r) and (CW)
Jacobian Transpose -Vector Product and Gradient: In e
R), respectively. Note that the entries of the product CW
similar way, utilizing properties 1 and 2 and Equation (24), the
were already obtained in the computation of the objective
product of Jacobian transpose JT and a vector x can be written as
function. Further, the computational complexity for the product
h i ZDiag(r) is O (IK). Consequently, the computational complexity
e T ; VT x .
JT x = vec ZDiag(x)(CW) (26) e T is O (IKR). In
for the multiplication of ZDiag(r) and (CW)
T
addition, the computation of V r in Equation (27) requires
Since the gradient is the product of the Jacobian transpose JT and O (KR) operations. However KR ≪ IKR. Therefore, the total
the residual vector r, it directly follows from the above Equation computational complexity for computing the gradient is O (IKR)
(26) as for R ≫ 1.
h i In addition, TeMPO uses the GN algorithm for the
g = vec ZDiag(r)(CW) e T ; VT r . (27) optimization. However, this is not a requirement and first-
order methods can also be utilized within TeMPO as well. GN
3.5. Complexity Analysis requires solving the linear system of equations in (12). Tensorlab’s
We now analyze the storage and computational complexity of implementation of GN uses the conjugate-gradient (CG) method
TeMPO where we are optimizing over symmetric rank-R CPD which requires only the Grammian-vector product for solving
structured tensors T ∈ KI×I×...×I of order d. The analysis is (12). This operation requires multiplication of the Jacobian and
presented here for the type II model. However, since the number its transpose by a vector. The computational complexity of
of optimization parameters of the type I and type II models multiplying the transpose of Jacobian by a vector is O (IKR) as
(see Equations 2, 3) for an I-variate polynomial of degree d are described above. The computationally most expensive operations
proportional to each other, the analysis also applies to the type I in the multiplication of Jacobian by a vector are the multiplication
model. Indeed, the computational complexity of the type I model of matrices XTu and Z with sizes (R × I) and (I × K), and
is d times the computational complexity of the type II model. We the Hadamard product of two matrices of size (R × K) as
also compare with the storage and computational complexity of shown in Equation (25). Hence, the computational complexity
TT and PEPS tensor networks. of computing Jx is O (IKR). Note that the entries of the product
Representing a multivariate polynomial with I independent CW e were already obtained in the computation of the objective
d
variables and of degree d in dense format requires storing C(I+d) function. Therefore, the total computational complexity for a
elements. Using Stirling’s approximation, it can be shown that the single CG iteration is O (2IKR). Note that a large number of
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 10 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
TABLE 1 | The comparison of the computational complexity of TEMPO with TT and PEPS tensor networks for a batch size of K.
Storage O (dIR) O nIR2TT
Objective func. 1 O (dIKR O KR3BT R6PS O nIR2TT + R3TT log(I)
Gradient 1 O (dIKR) O αKR3BT R6PS O α nIKR2TT + KR3TT log(I)
Gramian-vector itCG O (2dIKR) − −
CG iterations in the solution of linear equations for the GN backward pass for PEPS requires O αKR3BT R6PS operations (with
algorithm might increase the computation time compared to α > 1), when automatic differentiation is used [22].
first-order algorithms. In fact, the number of CG iterations scales The above analysis shows that TeMPO is computationally less
with the number of optimization variables (IR), if the exact expensive than TT and PEPS, even though it uses a second-
solution is desired in the solution of the normal equations. order algorithm. All these results are summarized in Table 1.
This may lead to an quadratic complexity of O 2(IR)2 K . The fundamental reason for this is the linear storage complexity
However, we observed in our experiments that a small number of the symmetric CPD format. Both TT and PEPS involve
of CG iterations were sufficient to obtain accurate results. For third and higher-order tensors, which makes their computational
example, we set the maximum number CG iterations to 10 for complexity increase with powers of the bond dimension. On the
the classification of the MNIST and Fashion MNIST datasets, other hand, the CPD format is known to be numerically less
where the number of unknowns is 784 × R with R ranging stable than the TT format, which relies on orthogonal matrices.
from 10 to 150.
The storage complexity of a tensor network with TT 4. NUMERICAL EXPERIMENTS
architecture is bounded by O nIR2TT for a tensor of order I
with dimensions (n × n × . . . × n), where RTT denotes the TT- We conducted an experiment on the regression problem using
rank [48]. n is equal to 2 and I is the size of a single image synthetic data to illustrate the TeMPO framework and compared
in the image classification applications presented in [20, 21]. TeMPO with different implementations of SVMs in Section 4.1.
Note that the storage complexity of TT increases with powers Next, we applied our framework to the blind deconvolution
of the TT-rank RTT . The total computational complexity of of constant modulus (CM) signals and compared with the
TT for computing theobjective function has been reported as analytical CM algorithm (ACMA) [50], the optimal step-size CM
O nIR2TT + R3TT log(I) , when the contraction order defined in algorithm (OSCMA) [51], and the LS-CPD framework [52] in
[21] is used. When the sweeping algorithm described in [20] Section 4.2. In Section 4.3, we further illustrate TeMPO with the
is used, the computational
complexity of the objective function image classification problem. We performed experiments on the
for TT is O n3 R3TT I for a single data point of size I. Similar MNIST and Fashion MNIST datasets and compared the accuracy
to the storage complexity, the computational complexity of the and number of optimization parameters with MLPs, and TT
objective function for TT increases with powers of the TT-rank and PEPS tensor networks. We performed experiments on a
of the tensor under consideration. On the other hand, automatic computer with an Intel Core i7-8850H CPU at 2.60 GHz with 6
differentiation (AD) is one of methods used to compute the cores and 32 GB of RAM using MATLAB R2021b and Tensorlab
gradient of TT. The computational complexity of automatic 3.0 [11].
differentiation is linear in the complexity of the evaluation of the In our blind deconvolution experiments, we used the complex
objective function [49]. Therefore, the computational complexity GN algorithm with the conjugate gradient Steihaug method. We
of the gradient for TT tensor network presented in [21] is used the second-order batch Gauss–Newton algorithm for the
O α nIKR2TT + KR3TT log(I) , for a batch size of K with α > 1. regression and classification, following the same intuition as in
The total computational complexity of TT tensor network for [53]. In each epoch of the algorithm, we randomly shuffle the data
a batch size of K has been reported as O mR2TT (RTT + K) for points in the training set and process all data points by dividing
a single iteration of the stochastic Riemannian gradient descent them into batches. In the regression and binary classification
algorithm [19]. As it is clear from the above discussion, both the case, we optimize a single cost function. In the multi-label
storage and the computational complexity of TT increases with a classification case, for each batch, we randomly select a cost
power of the TT-rank regardless of the algorithm used, while for function fl defined for each label to optimize. Thus our algorithm
TeMPO it increases linearly with the symmetric CPD rank in the does not guarantee that each fl will be trained by all training
symmetric CPD case. images in each epoch in the multi-label classification case. To
The computational complexity of a single forward pass of guarantee this, the algorithm can be modified such that for each
PEPS for a batch size of K is O KR3BT R6PS , when the boundary batch all cost functions fl are optimized at the cost of increasing
matrix product state method is used. Here RBT is the bond CPU time by a factor of the number of classes L. However, in
dimension (rank) of the boundary matrix product state of PEPS that case the algorithm might need fewer epochs to converge. The
and RPS is the bond dimension of PEPS. In addition, the overall algorithm is summarized in Algorithm 2. Algorithm 2
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 11 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
Algorithm 2: Batched GN algorithm using dogleg trust- initialized each weight vector in the same way as the factor
region for regression and classification for the type II model. matrices. We approximated f (x) by the type I and type II model
Input : Z – Input data matrix of degree 5 whose coefficient tensors were represented in the
y – Vector of values (labels in the rank-R symmetric CPD format. We set the batch size to 500
classification case) for each data and the maximum number GN iterations to 5 for each batch.
point in Z In Figure 3, we show the median relative test and training errors
U1 , . . . , UL – Initial factor matrices for each label for R = {2, 4, 8, 16} as a function of the number of epochs for
(single in the regression case) 100 trials. Each epoch corresponds to optimization over the full
c 1 , . . . , cL – Initial weight vectors for each label training set. It is clear from Figure 3 that TeMPO produces more
(single in the regression case) accurate results and generalizes better when using higher rank
T0,l – Initial scalar for each label (single in values, for both the type I and type II model. Good performance
the regression case) is also observed for R = 16 > Rf = 8, meaning that TeMPO is
epoch – Number of epochs robust to over-estimation of the number of parameters. For low
batchsize – Batch size rank values, i.e., R < Rf , the type I model produces better results
Output: U1 , . . . , UL – Optimized factor matrices for each than the type II model because it involves more parameters that
label (single in the regression case) can be tuned, cf. the discussion of Proposition 1.
c 1 , . . . , cL – Optimized weight vectors for each In the second stage of the experiment, we trained the type I
label (single in the regression case) and type II model for a multivariate polynomial of degree 5 with
noisy measurements. We added Gaussian noise to the function
for each epoch do
values for a given SNR, i.e.,
Shuffle input data
for each batch do R
X T
l← 1 e
f (x) = αr e(ar x) + η, (29)
if multi-label classification then r=1
l←
Randomly select label l to optimize fl , 0 < l ≤ L where η denotes the noise. We run our algorithm with the same
end settings as in the noiseless case for an SNR ranging from 10 dB to
Ul , cl , T0,l ← Optimize fl using Algorithm 1 50 dB. In Figure 4, we show the median errors for 100 trials as a
end function of SNR. We have similar observations as in the noiseless
end case. Apart from these observations; although the accuracy of our
algorithm decreases for SNR ≤ 20 (dB), it still maintains good
accuracy for SNR > 20 (dB), as shown in Figure 4. Moreover, as
can be observed from the Figure 4 (left), the type I model overfits
is given for the type II model for the ease of explanation. Slight for R = {8, 16} and SNR ≤ 20 (dB) in agreement with the result
modifications are sufficient to obtain an algorithm for the type of Proposition 1.
I model. In our next experiment, we trained the type I and type II model
We define the relative error as the relative difference in with larger-size samples, i.e., N = 250 and R = {8, 16, 32, 64}, to
l2 norm ||f − b f||2 / kfk2 with b
f an estimate for a vector f, assess how the CPU time depends on the rank. In Figure 5, we
and the signal-to-noise ratio (SNR) as 20 log10 kfk2 /kηk2 , show the median CPU time per epoch as a function of the rank.
where η = b f − f. It is evident from the figure that the computational complexity
of the type I model is d times the computational complexity of
4.1. Regression the type II model (cf. Section 3.5). Moreover, Figure 5 confirms
In this experiment, we considered a low-rank smooth function that the computational complexity of our algorithm is linear in
f (x) : RN → R, namely the rank (cf. Section 3.5).
In our next experiment, we examined the generalization
Rf abilities of the Gauss–Newton and ADAM [54] algorithms in
X T
f (x) = αr e(ar x) , (28) our framework. We trained the type I model for a multivariate
r=1 polynomial of degree 5 with both of these algorithms for different
number of training samples to fit the rank-8 function given as
where x ∈ [−1, 1]N , Rf is the rank of the function f (x), in Equation (29). We set R = 8, N = 50, and SNR = 20(dB).
and the coefficients αr are scalars randomly chosen from the For the ADAM algorithm, we set the step size, the exponential
standard normal distribution. We generated 5, 000 test samples decay rate for the first momentum (β1 ), and the exponential
and 1, 000 training samples for N = 50 and Rf = 8. Each decay rate for the second momentum (β2 ) to 0.01, 0.9, and 0.99,
vector ar ∈ RN was a unit norm vector drawn from the respectively. In Figure 6, we show the median training and test
standard normal distribution. Each of the samples of x was drawn accuracies of these algorithms for the number of training samples
from the uniform distribution. We initialized each factor matrix ranging from 500 to 5, 000 as a function of the number of epochs
with a matrix whose elements were randomly drawn from the for 100 trials. It is evident from Figure 6 that the presented
standard normal distribution, and scaled it to unit norm. We Gauss–Newton algorithm produces more accurate results than
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 12 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 3 | (Left) The median test (dashed lines) and training (solid lines) errors of the type I model for 100 trials on the synthetic data for a rank-8 function given as in
Equation (28). The number of samples for the training dataset is set to 5, 000 and for the test dataset it is set to 1, 000. The batch size is set to 500 and the maximum
number of GN iterations is set to 5. (Right) The median test (dashed lines) and training (solid lines) errors of the type II model with the same algorithm settings. TeMPO
produces more accurate results and generalizes better for higher rank values for both the type I and type II model. The performance is robust to overparameterization
(R > Rf ). The type I model produces better results for low rank values, i.e., R < Rf .
FIGURE 4 | (Left) The median test (dashed lines) and training (solid lines) errors of the type I model for 100 trials on the synthetic noisy data for a rank-8 function
given as in Equation (29). The number of samples for the training dataset is set to 5, 000 and for test dataset is set to 1, 000. The batch size is set to 500 and the
maximum number of GN iterations is set to 5. (Right) The median test (dashed lines) and training (solid lines) errors of the type II model with the same algorithm
settings. TeMPO produces more accurate results and generalizes better for higher rank values for both the type I and type II model in the presence of noise as well.
Again, the type I model produces better results for low rank values, i.e., R < Rf , because it involves more parameters than the type II model.
the ADAM algorithm and also requires fewer number of epochs model. It is clear from Figure 7 (left) that the type I and type II
to converge in these experimental settings. model generalize better than fitrsvm. A possible reason is the
We also compared TeMPO with SVMs using a polynomial dense parameterization of SVMs, while TeMPO uses low-rank
kernel. We run the same experiment for a number of training parameterization. Moreover, as shown in Figure 7 (right), our
samples ranging from 500 to 5, 000. We set the rank to 8, i.e., algorithm is faster than SVMs for numbers of training samples
R = Rf for TeMPO. We used the built-in Matlab routine above 1, 000. This is due to the higher memory requirement of
fitrsvm and LS-SVMlab toolbox [55, 56]. We set the degree SVMs. Typically, kernel based methods such as LS-SVM have
of polynomial kernel to 5, i.e., equal to the degree of the type I a storage and computational complexity of O N 2 [55], with
and type II model for fitrsvm. LS-SVMlab automatically tunes N the number of training samples. In contrast, Figure 7 (right)
the degree to 3 to find the best fit. In Figure 7 (left), we show the confirms that the computational complexity of TeMPO is linear
median test and training errors for SVM, the type I and type II in the number of training samples (cf. Section 3.5).
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 13 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 5 | The median CPU time (seconds) per epoch for the type I and type II model as a function of the rank for a rank-8 function given as in Equation (28) for 100
trials. The number of samples for the training dataset is set to 5, 000 and for the test dataset it is set to 1, 000. The batch size is set to 500 and the maximum number
of GN iterations is set to 5. The figure confirms that the computational complexity of the type I model is d times the computational complexity of the type II model (cf.
Section 3.5). Moreover, the computational complexity of the algorithm is linear in the rank (cf. Section 3.5). The figure is in a logarithmic scale on the horizontal axis.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 14 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 6 | Comparison of the median test (dashed lines) and training (solid lines) errors of the Gauss–Newton and the ADAM algorithms as a function of the number
of epochs for 100 trials. The type I model for a rank-8 function given as in Equation (29) in the presence of SNR 20 dB Gaussian noise is used to generate the training
and the test sets. The batch size is set to 10% of the training set size. For the Gauss–Newton algorithm, the maximum number of GN iterations and CG iterations is
set to 1 and 5, respectively. For the ADAM algorithm, the step size, β1 and β2 are set to 0.01, 0.9, and 0.99, respectively. The number of training samples is set to 500
(top-left), 1, 000 (top-right), 2, 000 (bottom-left), and 5, 000 (bottom-right). The presented Gauss–Newton algorithm produces more accurate results than the
ADAM algorithm and also requires fewer number of epochs to converge in these experimental settings.
of Jacobian-vector products for the problem (35) have been We consider an autoregressive model of degree L = 10
presented in [15]. with coefficients uniformly distributed on [0, 1], sample length
A number of algorithms have been developed to solve (33) K = 600, and c = 1. We add scaled Gaussian noise to
and (34). The analytical CM algorithm (ACMA) [50] writes (34) the measurements to obtain a particular SNR. We run 50
as a generalized matrix eigenvalue problem in the absence of experiments starting from the algebraic solution presented in
noise, and under the assumption that the null space of M is one [52] for LS-CPD, OSCMA, and TeMPO. In Figure 8 (left), we
dimensional, which makes ACMA more restrictive than TeMPO. show the median relative error on w as a function of SNR. It is
In the presence of noise, ACMA writes (34) as the simultaneous clear from Figure 8 (left) that TeMPO achieves similar accuracy
diagonalization of a number of matrices and solves it by extended as LS-CPD and OS-CMA, which are more accurate than ACMA.
QZ iteration. Gradient descent and stochastic gradient descent In Figure 8 (right), we show the median CPU time in seconds as
algorithms have also been proposed for the minimization of the a function of SNR. Clearly, TeMPO is faster than ACMA, OS-
expected value of {(|yTn w| − c)2 }. The optimal step-size CMA CMA, and LS-CPD for SNR ≥ 10(dB) by exploiting the structure
(OSCMA) [51] algorithm uses a gradient descent algorithm, of the data.
which computes the step size algebraically. The problem in
(35) can also be interpreted as a linear system with a rank-1
constrained solution, which fits the LS-CPD framework in [52]. 4.3. Image Classification
LS-CPD solves (33) by relaxing the complex conjugate w to Multi-class image classification amounts to the determination
a possibly different vector v ∈ CL and utilizing the second- of a possibly nonlinear function f that maps input images Zk
order GN algorithm using dogleg trust-region method. We solve to integer scalar labels yk , which are known for a training set.
(35) by utilizing the complex GN algorithm using the conjugate In this study, we represent f by a multivariate polynomial p.
gradient Steihaug method implemented in TensorLab 3.0 [11]. Following the one-versus-all strategy, we define a cost function
We compare with these algorithms in terms of computation time fl for each label that maps the input image Zk to a scalar
and accuracy. value as
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 15 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 7 | (Left) The median test (dashed lines) and training (solid lines) errors of SVMs with polynomial kernel, the type I and type II model for a rank-8 function
given as in Equation (29) in the presence of SNR 20 dB Gaussian noise as a function of the number of training samples for 100 trials. The batch size is equal to 10% of
the training set size. The maximum number of GN iterations is set to 5 for the type I and type II model. Specifically, for the SVMs, the built-in Matlab routine fitrsvm
and LS-SVMlab toolbox were used to obtain the results. The relative errors of LS-SVMLab for the sample sizes 500, 1, 000, and 2, 000 are 1.6e − 6, 2.2e − 6 and
3.3e − 6, respectively. The presented algorithm generalizes better than fitrsvm in these experimental settings. (Right) The median CPU times (seconds) with the
same setting. The computational complexity of our algorithm is linear in the problem size as expected, and it is faster than SVMs for numbers of training samples
above 1, 000. The figures are in a logarithmic scale on both the horizontal and vertical axes.
FIGURE 8 | (Left) The median relative errors (dB) of LS-CPD, OS-ACMA, ACMA, and TeMPO with respect to SNR (dB) for an autoregressive model of degree L = 10
with uniformly distributed coefficients between zero and one, sample length K = 600 for 50 trials. TeMPO obtains similar accuracy to LS-CPD, OS-CMA, while
obtaining more accurate results than ACMA. (Right) The median CPU times (seconds) with the same settings. TeMPO is faster than other algorithms for SNR > 10
(dB).
K
1X 2 optimization problem can be written as
fl (pl , z1 , . . . , zK ) = yk − pl (zk ) ,
2
k=1 d
X j
min fl (pl , z1 , . . . , zK ), subject to pl (zk ) = Tl,0 + Tl,j zk ,
where zk = vec(Zk ) and where yk = 1 if zk is labeled as l pl
j=1
and yk = 0 otherwise. The polynomial pl can be chosen within r z
T
the type I or the type II model class. For the type I model, the and Tl,j = Ul,j , . . . , Ul,j ; cl,j , (37)
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 16 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 9 | Test (solid lines) and training (dashed lines) accuracies of the type I model for the MNIST dataset with respect to the number of epochs. The full training
set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 100 and the maximum number of GN iterations is set to 1. TeMPO achieves high
accuracy even for low rank values, i.e., R = {10, 20}. Both the test and training accuracy increase mildly as the rank increases.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 17 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 10 | Test (solid lines) and training (dashed lines) accuracies of the type I model for the Fashion MNIST dataset with respect to the number of epochs. The full
training set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 100 and the maximum number of GN iterations is set to 1. Similar to the
MNIST dataset, TeMPO achieves good accuracy even for low rank values and both the test and training accuracy mildly increase as the rank increases.
We repeated the same experiments for the Fashion MNIST In Figure 14, we show the training history for the Fashion
dataset, which is harder to classify. We show the training history MNIST dataset. Similar to the type I model, the test
in Figure 10. The observations made for the MNIST dataset and training accuracy is lower than the MNIST dataset.
also apply to the Fashion MNIST dataset. However, the test and The algorithm converges around 100 epochs and achieves
training accuracy are lower for the Fashion MNIST dataset in around 89.30% test accuracy with R = 150. Moreover,
agreement with previous works. Also, our algorithm requires our algorithm achieves around 99% training accuracy after
more epochs to converge for the Fashion MNIST dataset. 400 epochs.
In our next experiment, we set the maximum number of We repeated the same experiments with the maximum
GN iterations to 5. We observed that our algorithm needs number of GN iterations set to 5. The comparisons for the
fewer epochs to converge and produces more accurate results MNIST and Fashion MNIST datasets are shown in Figure 15.
with this setting. The comparison for the MNIST and Fashion Contrary to our observation for the type I model, the
MNIST dataset is shown in Figures 11, 12, respectively. The test accuracy now decreases for both datasets. A possible
improvement in the test accuracy for the Fashion MNIST dataset reason is that when the residuals are big, doing more GN
is around 1% and more pronounced than the improvement iterations may not lead a better direction for minimizing (37).
in the test accuracy for the MNIST dataset. TeMPO achieves A similar observation has been made in [53], for training
around 98.30% test accuracy for the MNIST dataset and DNNs. It is experimentally shown that higher number of
around 90% test accuracy for the Fashion MNIST dataset CG iterations might not produce more accurate results if the
with R = 150. Hessian obtained by mini-batch is not reliable due to non-
representative batches and/or big residuals. On the other hand,
Results of the Type II Model if the residuals are small, higher number of CG iterations
We repeated the same experiments for the type II model. We can produce more accurate results thanks to the curvature
used the same settings as in the type I model. However, we set information [53].
the batch size to 200 to obtain an accuracy similar to that of
the type I model. We show the training history in Figure 13.
Similar to previous experiments, our algorithm performs well Comparisons
even for low rank values, and produces more accurate results for We now compare TeMPO with different models, namely: TT
higher rank values. TeMPO achieves around 98% test accuracy tensor networks [21], TT structured tree tensor networks (TTN)
and 100% training accuracy after 200 epochs with R = 150 for [64], multi-layer perceptron (MLP) with 784−1000−10 neurons,
the MNIST dataset. MLP with a convolution layer (CNN-MLP), PEPS, and PEPS with
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 18 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 11 | Comparison of test accuracies of the type I model on the MNIST dataset for different maximum number of GN iterations as a function of the number of
epochs. The full training set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 100 and the maximum number of GN iterations is set to 1
(dashed lines) and to 5 (solid lines).
FIGURE 12 | Comparison of test accuracies of the type I model on the Fashion MNIST dataset for different maximum number of GN iterations as a function of the
number of epochs. The full training set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 100 and the maximum number of GN
iterations is set to 1 (dashed lines) and to 5 (solid lines).
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 19 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 13 | Test (solid lines) and training (dashed lines) accuracies of the type II model for the MNIST dataset with respect to the number of epochs. The full training
set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 200 and the maximum number of GN iterations is set to 1. Both the test and
training accuracy increase as the rank increases. The improvement in the accuracy gets smaller as the rank increases. The algorithm achieves around 100% training
accuracy after 200 epochs.
FIGURE 14 | Test (solid lines) and training (dashed lines) accuracies of the type I model for the MNIST dataset with respect to the number of epochs. The full training
set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 200 and the maximum number of GN iterations is set to 1. Both the test and
training accuracy increase as the rank increases. Also the improvement in the accuracy gets smaller as the rank increases. The algorithm achieves around 99%
training accuracy after 400 epochs.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 20 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
FIGURE 15 | Comparison of test accuracies of the type II model on the MNIST (top) and Fashion MNIST (bottom) datasets for different maximum number of GN
iterations as a function of the number of epochs. The full training set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 200 and the
maximum number of GN iterations is set to 1 (dashed lines) and to 5 (solid lines).
a convolution layer (CNN-PEPS) [22]. We compare in terms of 5. CONCLUSION AND FUTURE WORK
the test accuracy for the Fashion MNIST dataset. We summarize
the test accuracy of different models in Table 2. TeMPO achieves We presented the TeMPO framework for use in nonlinear
better accuracy than TT, PEPS and MLP, while optimizing for optimization problems arising in signal processing, machine
fewer parameters and using less memory (cf. Table 1). The learning, and artificial intelligence. We modeled the
accuracy of TeMPO is lower than CNN-MLP and CNN-PEPS nonlinearities in these problems by multivariate polynomials
as expected, since it does not use a convolution layer. Note represented by low rank tensors. In particular, we investigated
that the accuracy of TeMPO can further be improved by tuning the symmetric CPD format in this study. By taking the advantage
the parameters such as the rank, the number of CG iterations, of low rank symmetric CPD structure, we developed an efficient
the trust-region radius, the batch size and the degree of the second-order batch Gauss–Newton algorithm. We demonstrated
multivariate polynomial. the efficiency of TeMPO with some illustrative examples, and
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 21 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
TABLE 2 | The test accuracy of different models for the Fashion MNIST dataset. DATA AVAILABILITY STATEMENT
Model Test accuracy (%) Publicly available datasets were analyzed in this study. This
TT 88.0
data can be found at: http://yann.lecun.com/exdb/mnist/; https://
MLP 88.3 github.com/zalandoresearch/fashion-mnist.
PEPS 88.3
TTN 89.0 AUTHOR CONTRIBUTIONS
TeMPO (Type II) 89.3
TeMPO (Type I) 89.9 MA developed the theory and Matlab implementation. He is the
CNN–MLP 91.0 main contributor to the numerical experiments and also wrote
CNN–PEPS 91.2 the first draft of the manuscript. LD conceived the idea and
supervised the project. Both authors contributed to manuscript
The bold values indicate the results from the proposed methods.
revision, read, and approved the submitted version.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 22 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
17. Grasedyck L. Hierarchical singular value decomposition of tensors. SIAM J 37. Domanov I, De Lathauwer L. Canonical polyadic decomposition of third-
Matrix Anal Appl. (2010) 31:2029–54. doi: 10.1137/090764189 order tensors: relaxed uniqueness conditions and algebraic algorithm.
18. Oseledets IV, Tyrtyshnikov EE. Breaking the curse of dimensionality, or how arXiv:1501.07251. (2017) 513:342–75. doi: 10.1016/j.laa.2016.10.019
to use SVD in many dimensions. SIAM J Sci Comput. (2009) 31:3744–59. 38. Boyd JP, Ong JR. Exponentially-convergent strategies for defeating the Runge
doi: 10.1137/090748330 phenomenon for the approximation of non-periodic functions, part I: single-
19. Novikov A, Trofimov M, Oseledets IV. Exponential machines. In: 5th interval schemes. Commun Comput Phys. (2009) 5:484–97.
International Conference on Learning Representations, ICLR 2017. Toulon 39. Trefethen LN. Approximation Theory and Approximation Practice, Extended
(2017). Available online at: https://openreview.net/forum?id=rkm1sE4tg Edition. Philadelphia, PA: SIAM (2019). doi: 10.1137/1.9781611975949
20. Stoudenmire EM, Schwab DJ. Supervised learning with tensor networks. In: 40. De Lathauwer L, De Moor B, Vandewalle J. On the best rank-1 and rank-
Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, editors. Advances in (R1 , R2 , · · · , RN ) approximation of higher-order tensors. SIAM J Matrix Anal
Neural Information Processing Systems. Vol. 29. Barcelona: Curran Associates, Appl. (2000) 21:1324–42. doi: 10.1137/S0895479898346995
Inc. (2016). Available online at: https://proceedings.neurips.cc/paper/2016/ 41. Zhang T, Golub G. Rank-one approximation to high order tensors. SIAM J
file/5314b9674c86e3f9d1ba25ef9bb32895-Paper.pdf Matrix Anal Appl. (2001) 23:534–50. doi: 10.1137/S0895479899352045
21. Efthymiou S, Hidary J, Leichenauer S. TensorNetwork for machine learning. 42. Guan Y, Chu MT, Chu D. SVD-based algorithms for the best rank-1
arXiv: 190606329. (2019). doi: 10.48550/arXiv.1906.06329 approximation of a symmetric tensor. SIAM J Matrix Anal Appl. (2018)
22. Cheng S, Wang L, Zhang P. Supervised learning with projected entangled pair 39:1095–115. doi: 10.1137/17M1136699
states. Phys Rev B. (2021) 103:125117. doi: 10.1103/PhysRevB.103.125117 43. Nie J, Wang L. Semidefinite relaxations for best rank-1 tensor approximations.
23. Guo W, Kotsia I, Patras I. Tensor learning for regression. IEEE Trans Image SIAM J Matrix Anal Appl. (2013) 35:1155–79. doi: 10.1137/130935112
Process. (2012) 21:816–27. doi: 10.1109/TIP.2011.2165291 44. Brachat J, Comon P, Mourrain B, Tsigaridas E. Symmetric
24. Hendrikx S, Boussé M, Vervliet N, De Lathauwer L. Algebraic and tensor decomposition. Linear Algeb Appl. (2010) 433:1851–72.
optimization based algorithms for multivariate regression using doi: 10.1016/j.laa.2010.06.046
symmetric tensor decomposition. In: Proceedings of the (2019) 45. Alexander J, Hirschowitz A. Polynomial interpolation in several variables. Adv
IEEE International Workshop on Computational Advances in Multi- Comput Math. (1995) 4:201–22.
Sensor Adaptive Processing (CAMSAP). Guadeloupe (2019). p. 475–9. 46. Debals O. Tensorization and Applications in Blind Source Separation. Leuven:
doi: 10.1109/CAMSAP45676.2019.9022662 KU Leuven (2017).
25. Rabusseau G, Kadri H. Low-rank regression with tensor responses. In: Lee 47. Blondel M, Niculae V, Otsuka T, Ueda N. Multi-output Polynomial Networks
D, Sugiyama M, Luxburg U, Guyon I, Garnett R, editors. Advances in Neural and Factorization Machines. In: Advances in Neural Information Processing
Information Processing Systems. Vol. 29. Barcelona: Curran Associates, Inc. Systems 30: Annual Conference on Neural Information Processing Systems
(2016). Available online at: https://proceedings.neurips.cc/paper/2016/file/ 2017. Long Beach, CA (2017). p. 3349–59.
3806734b256c27e41ec2c6bffa26d9e7-Paper.pdf 48. Khoromskij BN. Tensor Numerical Methods in Scientific Computing. Berlin;
26. Yu R, Liu Y. Learning from multiway data: simple and efficient tensor Boston: De Gruyter (2018). doi: 10.1515/9783110365917
regression. In: Balcan MF, Weinberger KQ, editors. Proceedings of the 33rd 49. Margossian CC. A review of automatic differentiation and its efficient
International Conference on Machine Learning, Vol. 48 of Proceedings of implementation. WIREs Data Mining Knowl Discov. (2019) 9:e1305.
Machine Learning Research. New York, NY (2016). p. 373–81. Available online doi: 10.1002/widm.1305
at: https://proceedings.mlr.press/v48/yu16.html 50. van der Veen AJ, Paulraj A. An analytical constant modulus algorithm. IEEE
27. Hou M, Chaib-Draa B. Hierarchical Tucker tensor regression: application to Trans Signal Process. (1996) 44:1136–55. doi: 10.1109/78.502327
brain imaging data analysis. In: Proceedings of the (2015) IEEE International 51. Zarzoso V, Comon P. Optimal step-size constant modulus algorithm. IEEE
Conference on Image Processing (ICIP 2015). Québec, QC (2015). p. 1344–8. Trans Commun. (2008) 56:10–3. doi: 10.1109/TCOMM.2008.050484
doi: 10.1109/ICIP.2015.7351019 52. Boussé M, Vervliet N, Domanov I, Debals O, De Lathauwer L. Linear
28. Kar P, Karnick H. Random feature maps for dot product kernels. In: Lawrence systems with a canonical polyadic decomposition constrained solution:
ND, Girolami M, editors. Proceedings of the Fifteenth International Conference algorithms and applications. Numer Linear Algeb Appl. (2018) 25:e2190.
on Artificial Intelligence and Statistics, Vol. 22 of Proceedings of Machine doi: 10.1002/nla.2190
Learning Research. La Palma (2012). p. 583–91. Available online at: https:// 53. Gargiani M, Zanelli A, Diehl M, Hutter F. On the promise of the stochastic
proceedings.mlr.press/v22/kar12.html generalized Gauss-Newton method for training DNNs. arXiv: 200602409.
29. Yang J, Gittens A. Tensor machines for learning target-specific polynomial (2020). doi: 10.48550/arXiv.2006.02409
features. arxiv: 150401697. (2015). doi: 10.48550/arXiv.1504.01697 54. Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Bengio Y,
30. Rendle S. Factorization machines. In: (2010) IEEE International Conference on LeCun Y, editors. International Conference on Learning Representations, ICLR
Data Mining. Sydney (2010). p. 995–1000. doi: 10.1109/ICDM.2010.127 2015. 3rd Edn. San Diego, CA (2015). Available online at: http://arxiv.org/abs/
31. Blondel M, Fujino A, Ueda N, Ishihata M. Higher-order factorization 1412.6980
machines. In: Proceedings of the 30th International Conference on Neural 55. De Brabanter K, Karsmakers P, Ojeda F, Alzate C, De Brabanter J, Pelckmans
Information Processing Systems, NIPS’16. Red Hook, NY: Curran Associates K, et al. LS-SVMlab Toolbox User’s Guide Version 1.8. Leuven: ESAT-STADIUS
Inc. (2016). p. 3359–67. (2010). p. 10–46.
32. Blondel M, Ishihata M, Fujino A, Ueda N. Polynomial networks and 56. Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J.
factorization machines: new insights and efficient training algorithms. In: Least Squares Support Vector Machines. Singapore: World Scientific (2002).
Proceedings of the 33rd International Conference on International Conference doi: 10.1142/5089
on Machine Learning. Vol. 48. New York, NY (2016). p. 850–8. 57. Ljung L. System Identification: Theory for the User. 2nd ed. Upper Saddle River,
33. Nocedal J, Wright S. Numerical Optimization. New York, NY: Springer (2006). NJ: Prentice Hall (1999). doi: 10.1002/047134608X.W1046
34. Kruskal JB. Three-way arrays: rank and uniqueness of trilinear 58. Johnson R, Schniter P, Endres TJ, Behm JD, Brown DR, Casas RA. Blind
decompositions, with application to arithmetic complexity and statistics. equalization using the constant modulus criterion: a review. Proc IEEE. (1998)
Linear Algeb Appl. (1977) 18:95–138. doi: 10.1016/0024-3795(77)90069-6 86:1927–50. doi: 10.1109/5.720246
35. Sidiropoulos ND, Bro R. On the uniqueness of multilinear 59. van der Veen AJ. Algebraic methods for deterministic blind beamforming.
decomposition of N-way arrays. J Chemometr. (2000) 14:229–39. Proc IEEE. (1998) 86:1987–2008. doi: 10.1109/5.720249
doi: 10.1002/1099-128X(200005/06)14:3<229::AID-CEM587>3.0.CO;2-N 60. De Lathauwer L. Algebraic techniques for the blind deconvolution of
36. Domanov I, De Lathauwer L. On the uniqueness of the canonical Constant Modulus signals. In: Proceedings of the 12th European Signal
polyadic decomposition of third-order tensors – Part ii: uniqueness of Processing Conference (EUSIPCO 2004). Vienna (2004). p. 225–8.
the overall decomposition. SIAM J Matrix Anal Appl. (2013) 34:876–903. 61. Householder AS. Unitary triangularization of a nonsymmetric matrix. J ACM.
doi: 10.1137/120877258 (1958) 5:339–42. doi: 10.1145/320941.320947
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 23 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization
62. Deng L. The MNIST database of handwritten digit images for Publisher’s Note: All claims expressed in this article are solely those of the authors
machine learning research. IEEE Sign Process Mag. (2012) 29:141–2. and do not necessarily represent those of their affiliated organizations, or those of
doi: 10.1109/MSP.2012.2211477 the publisher, the editors and the reviewers. Any product that may be evaluated in
63. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a novel image dataset this article, or claim that may be made by its manufacturer, is not guaranteed or
for benchmarking machine learning algorithms arXiv:1708.07747. (2017).
endorsed by the publisher.
doi: 10.48550/arXiv.1708.07747
64. Stoudenmire EM. Learning relevant features of data with multi-scale tensor
Copyright © 2022 Ayvaz and De Lathauwer. This is an open-access article distributed
networks. Quant Sci Technol. (2018) 3:034003. doi: 10.1088/2058-9565/aaba1a
under the terms of the Creative Commons Attribution License (CC BY). The use,
distribution or reproduction in other forums is permitted, provided the original
Conflict of Interest: The authors declare that the research was conducted in the author(s) and the copyright owner(s) are credited and that the original publication
absence of any commercial or financial relationships that could be construed as a in this journal is cited, in accordance with accepted academic practice. No use,
potential conflict ofinterest. distribution or reproduction is permitted which does not comply with these terms.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 24 March 2022 | Volume 8 | Article 836433