0% found this document useful (0 votes)

10 views

CPD-Structured Multivariate Polynomial Optimization

The document presents the Tensor-Based Multivariate Polynomial Optimization (TeMPO) framework, designed for nonlinear optimization problems in signal processing, machine learning, and artificial intelligence. It utilizes low-rank symmetric tensors to model multivariate polynomials, effectively addressing the curse of dimensionality while achieving efficient computation through a second-order Gauss–Newton algorithm. The framework demonstrates competitive accuracy compared to multilayer perceptrons and other tensor network architectures, while optimizing for fewer parameters and less memory usage.

Uploaded by

focalav748

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

CPD-Structured Multivariate Polynomial Optimization

Uploaded by

focalav748

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

ORIGINAL RESEARCH

published: 30 March 2022

doi: 10.3389/fams.2022.836433

CPD-Structured Multivariate
Polynomial Optimization
Muzaffer Ayvaz 1,2* and Lieven De Lathauwer 1,2
1
Department of Electrical Engineering (ESAT), KU Leuven, Leuven, Belgium, 2 Group Science, Engineering and Technology,
KU Leuven Kulak, Kortrijk, Belgium

We introduce the Tensor-Based Multivariate Optimization (TeMPO) framework for use in

nonlinear optimization problems commonly encountered in signal processing, machine
learning, and artificial intelligence. Within our framework, we model nonlinear relations
by a multivariate polynomial that can be represented by low-rank symmetric tensors
(multi-indexed arrays), making a compromise between model generality and efficiency
of computation. Put the other way around, our approach both breaks the curse of
dimensionality in the system parameters and captures the nonlinear relations with a good
accuracy. Moreover, by taking advantage of the symmetric CPD format, we develop an
efficient second-order Gauss–Newton algorithm for multivariate polynomial optimization.
The presented algorithm has a quadratic per-iteration complexity in the number of
optimization variables in the worst case scenario, and a linear per-iteration complexity in
Edited by: practice. We demonstrate the efficiency of our algorithm with some illustrative examples,
André Uschmajew,
apply it to the blind deconvolution of constant modulus signals, and the classification
Max Planck Institute for Mathematics
in the Sciences, Germany problem in supervised learning. We show that TeMPO achieves similar or better accuracy
Reviewed by: than multilayer perceptrons (MLPs), tensor networks with tensor trains (TT) and projected
Edgar Solomonik, entangled pair states (PEPS) architectures for the classification of the MNIST and Fashion
University of Illinois at
Urbana-Champaign, United States MNIST datasets while at the same time optimizing for fewer parameters and using less
Guillaume Rabusseau, memory. Last but not least, our framework can be interpreted as an advancement of
Université de Montréal, Canada
higher-order factorization machines: we introduce an efficient second-order algorithm
*Correspondence:
for higher-order factorization machines.
Muzaffer Ayvaz
[email protected] Keywords: multivariate polynomial, numerical optimization, tensor decomposition, Gauss-Newton algorithm,
factorization machines, higher order factorization machines, tensor network, image classification
Specialty section:
This article was submitted to
Mathematics of Computation and 1. INTRODUCTION
Data Science,
a section of the journal
Many problems in data science, signal processing, machine learning and artificial intelligence (AI)
Frontiers in Applied Mathematics and
can be thought of determining the nonlinear relationship between input and output data. Several
Statistics
strategies have been developed to efficiently model these nonlinear interactions. However, due to
Received: 15 December 2021
the higher-order nature of input and output data, developing scalable algorithms to model these
Accepted: 28 February 2022
nonlinear interactions is a challenging research direction. Another major issue is the large number
Published: 30 March 2022
of system parameters needed to model the physical phenomena under consideration. For example,
Citation:
large numbers of layers and neurons are needed in deep neural networks (DNNs). Multivariate
Ayvaz M and De Lathauwer L (2022)
CPD-Structured Multivariate
polynomials are also utilized to model nonlinear continuous functions. However, this approach
Polynomial Optimization. suffers from an exponential increase in the number of coefficients with the degree of the polynomial.
Front. Appl. Math. Stat. 8:836433. This is known as the curse of dimensionality and is a major drawback that inhibits the development
doi: 10.3389/fams.2022.836433 of efficient algorithms.

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 1 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

Tensor decompositions such as canonical polyadic where Tj denotes a low-rank tensor of order j, and Tj zj denotes
decomposition (CPD) and tensor trains (TT) are promising the mode-n product (see Section 2.1) of a tensor Tj and the vector
tools for breaking the curse of dimensionality. Tensors are multi- z for all modes. As by convention, T0 is assumed to be scalar
indexed arrays. They preserve the higher-order structure which and z0 is assumed to be scalar 1. From now on, we call (2) a
is inherent in data, are able to model nonlinear interactions, type I model. We can represent a multivariate polynomial with
and can be decomposed uniquely under mild conditions a single tensor by utilizing a process called homogenization, and
[1–3]. Efficient numerical optimization algorithms have been augmenting the independent variable z by a constant 1 as
developed for tensor decompositions. In the context of CPD, the
Gauss–Newton algorithm using both line search and trust-region p(e zd ,
z) : = We (3)
frameworks have been effectively implemented by exploiting the
CPD structure [4–6]. A low complexity damped Gauss-Newton z = [1; z]. Hereafter, we call
where W is a tensor of order d, ande
algorithm has also been proposed [7]. Moreover, a randomized (3) a type II model.
block sampling approach has been proposed which achieves An n-variate polynomial of degree d has O nd coefficients.
linear time complexity for the CPD of large tensors by utilizing This exponential dependence on d is the so-called curse of
the Gauss–Newton algorithm [8]. Many data science problems dimensionality. In the TeMPO framework, we break the curse of
such as latent factor analysis have been solved by reformulating dimensionality by assuming low-rank structure in the coefficient
them as tensor decomposition problems [9–12]. An inexact tensors. For example, when rank- R symmetric CPD structure is
Gauss–Newton algorithm has been proposed for scaling the used, the number of parameters needed to represent the n-variate
CPD of large tensors with non-least-squares cost functions polynomial of degree d is ndR which is linear in the number
[13]. Moreover, generalized Gauss–Newton algorithm with its of variables. Several low-rank structures for tensors have been
efficient parallel implementation has been proposed for tensor introduced in the literature [1, 2, 16], e.g., canonical polyadic
completion with generalized loss functions [14]. Our aim in this decomposition (CPD), Tucker decomposition, hierarchical
work is to extend the efficient numerical approaches to a broader Tucker decomposition (HT) [17], tensor train decomposition
class of problems that includes not only tensor decompositions (TT) [18]. All of these structures can be incorporated into
but also the optimization of multilinear/polynomial cost the TEMPO framework; however, in this paper we restrict
functions. Examples include, but are not limited to matrix and ourselves to symmetric CPDs. Note that different types of low-
tensor eigenvalue problems, nonlinear dimensionality reduction, rank structure allow us to represent different sub-classes of
nonlinear blind source separation, multivariate polynomial polynomials. Of course, different representations differ in storage
regression, and classification problems. space, and computational complexity. A more detailed exposition
In this study, we develop a framework called Tensor-Based will be given in Section 3.2. Note also that the type I model allows
Multivariate Polynomial Optimization (TeMPO) to deal with us to constrain each term separately while the type II model does
nonlinear optimization problems commonly encountered in not. Therefore, the type I model is a more general representation
signal processing, machine learning and artificial intelligence. A of multivariate polynomials which may provide better results
preliminary version, where only rank-1 CPD is considered with depending on the applications.
application in blind identification, appeared as the conference Besides breaking the curse of dimensionality, exploiting low-
paper [15]. In the TeMPO framework, these nonlinear functions rank representations of tensors enables us to derive efficient
are approximated or modeled by multivariate polynomials. Then, expressions for objective function and gradient evaluations.
low-rank tensors are used to represent the polynomial under These then lead us to develop scalable algorithms. We apply our
consideration. This approach reduces the number of parameters framework for image classification by adapting the second-order
that define the system, and hence enables us to develop efficient Gauss–Newton algorithm and exploiting the symmetric CPD
numerical optimization algorithms. To further elaborate on the structure in two different tensor representations of multivariate
proposed methodology, let us consider the optimization problem polynomials. We show that the TeMPO framework with
symmetric CPD structure achieves similar or better accuracy
min l(p(z), θ ), (1) than various methods such as MLPs, and tensor networks with
p
different structures for the classical MNIST and Fashion MNIST
datasets while using fewer parameters and therefore less memory.
where l : R × RM → R+ denotes a loss function such as
the mean squared error, p : RI → R denotes an unknown Related Work
multivariate polynomial, z ∈ RI denotes input data, and θ ∈ RM Several tensor-based methods have been reported in the literature
denotes output data. We compactly represent the polynomial for regression and classification, two problems that are in the
p(z) through low-rank tensors. One possible way to do this is class of problems (1). In most of these approaches, a linear model
to write the polynomial as a sum of homogeneous polynomials
as follows: y = hW, Xi + b, (4)

d
X is used where W denotes a weight tensor and X represents
p(z) : = T j zj , (2) nonlinear features of the input data. This model corresponds to
j=0 the type II model when a symmetric CPD structure is imposed

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 2 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

on the weight tensor W and X is composed of polynomial model used. The type I model (2) has not been examined with the
features of input data. Clearly, imposing different structures symmetric CPD structure in the weight tensors, to the best of our
to the weight tensor W and using different nonlinear features knowledge. Another difference of our approach from the above
in the tensor X leads to a different representation of the methods is the algorithm used. While first-order algorithms
nonlinear interaction between input data and output data. For are used in most of these approaches, we utilize the second-
example, exponential machines utilize the tensor train format order batch Gauss–Newton (GN) algorithm. Although first-order
in the weight tensor with a norm regularization term in the methods have the advantage of lower per-iteration complexity,
optimization [19]. In this approach, the Riemannian gradient second-order GN algorithms generally require fewer iterations to
descent algorithm is used for solving the optimization problem. converge and fewer hyperparameters to be optimized. Moreover,
In a similarh approach,
πx tensor trains
πx i is used with the feature map the GN algorithm using trust-region is more robust in the sense
j j that it converges to a (local) minimum for any starting point
φ(xj ) = cos , sin , by using the density matrix
2 2 under mild conditions and it is less prone to swamps (many
renormalization group (DMRG) algorithm and the first-order
iterations with little to no improvement) [5, 6, 33].
ADAM algorithm for the optimization of different cost functions
We summarize our contributions as follows:
[20, 21]. The same feature map is also used for the linear model
(4) by imposing projected entangled pair states (PEPS) structure • We develop a TeMPO framework that is able to solve
on the weight tensor W [22]. The CPD format in model (4) many nonlinear problems with ubiquitous applications in
has also been studied in the realm of tensor regression with signal processing, machine learning and artificial intelligence.
the Frobenius norm and group sparsity norm regularization Moreover, we develop an efficient second-order Gauss–
terms while using a coordinate-descent approach [23]. A similar Newton algorithm for optimizing multivariate polynomials in
model is also considered by utilizing the symmetric CPD format the CPD format.
and the second-order Gauss–Newton algorithm with algebraic • We determine the conditions where the tensorized linear
initialization for multivariate polynomial regression [24]. Several model (4) with polynomial features and the multivariate
approaches have been proposed that utilize CPD or Tucker polynomial model (2) coincide when the symmetric CPD
formats in tensor regression that use different regularization format is used in their representations.
strategies to prevent the overfitting [25, 26]. Also, the hierarchical • We show that TeMPO achieves similar or better accuracy
Tucker (HT) format has been used in the tensor regression than various methods such as multilayer perceptrons (MLPs),
context for the generalized linear model (GLM) y = α T x + tensor networks with different architectures including tensor
hW, Xi. This approach was successfully applied to brain imaging trains (TT), tree tensor networks, and projected entangled
data sets and uses a block relaxation algorithm, which solves a pair states (PEPS). We also show that TeMPO requires the
sequence of lower dimensional optimization problems [27]. optimization for fewer parameters and less memory than these
Similarly, several models related to the type I model are methods for the classification of the MNIST and Fashion
considered in various settings. For example, Kar and Karnick use MNIST datasets.
random polynomial features and parameterize the coefficients of • Last but not least, our framework can be interpreted as
the polynomial under consideration [28]. The parameterization an advancement of higher-order factorization machines; we
used in this approach has been shown to be equivalent to introduce an efficient second-order Gauss–Newton algorithm
imposing the CPD format to the weight tensor W [29]. Another for higher-order factorization machines.
approach is factorization machines which use a multivariate
The remaining part of this article is organized as follows. In
polynomial kernel in the realm of support vector machines
Section 2, we describe notation and background information
(SVM) [30]. For second-order factorization machines a first-
concerning tensors. In Section 3, we describe the TeMPO
order stochastic gradient descent algorithm has been proposed.
framework in a more detailed manner. Section 3 also covers
This approach has a linear time complexity. Higher-order
the details of representation of polynomials by symmetric
factorization machines use the ANOVA kernel to achieve a
CPD structured tensors. In Section 3, we also show how
linear time complexity and have been successfully applied
to exploit the symmetric CPD structure to obtain efficient
to link prediction models using stochastic gradient descent
expressions for the gradient and Jacobian-vector products which
[31]. The ANOVA kernel does not use symmetric tensors in
are necessary for the Gauss–Newton algorithm. The formulation
the representation and instead only considers combinations
of the image classification problem in the context of TeMPO,
of distinct features [31]. Also, factorization machines in the
numerical experiments and related discussions will be covered
symmetric CPD format have been considered using first-order
in Section 4. We conclude our paper with future remarks in the
and BFGS type algorithms [32]. Tensor machines generalize both
last section.
the Kar-Karnick random features approach and factorization
machines. It has been shown that these approaches correspond
to specific types of tensor machines in the CPD format. Further,
2. PRELIMINARIES
it has been shown that empirical risk minimization is an efficient
method for finding locally optimal tensor machines if the 2.1. Notation
optimization algorithm avoids saddle points [29]. A tensor is a higher-order generalization of a vector (first-order)
As can be seen from the literature summary above, one of the and a matrix (second-order). Following established conventions,
differences between our approach and the above methods is the we denote scalars, vectors, matrices, and tensors by a, a, A, and A,

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 3 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

respectively. The transpose of a matrix A is denoted as AT . The ith their Khatri–Rao product, also known as columnwise Kronecker
column vector of a matrix A is denoted as ai , i.e., A = (a1 , a2 . . .). product, is
The entry with row index i and column index j in a matrix
A, i.e., (A)ij , is denoted by aij . Similarly, (A)i1 i2 ...iN is denoted A ⊙ B = [a1 ⊗ b1 , a2 ⊗ b2 , . . . , aK ⊗ bK ] ∈ KIJ×K ,
by ai1 i2 ...iN . Diag(a) denotes the diagonal matrix whose entries
are composed from the vector a. On the other hand, diag(A) where ai and bi denote the ith column of the matrices A and B,
denotes a vector composed from the diagonal elements of A. respectively.
The vectorization operator vec(A) for A ∈ KI×J stacks all the
columns of A into a column vector a ∈ KIJ . The reverse operation Definition 5 (Hadamard Product). Given two matrices A ∈
unvec(a) reshapes a vector a into a matrix A ∈ KI×J . The identity KI×J and B ∈ KI×J with the same size, their Hadamard product is
matrix of size (K × K) is denoted by IK . A vector of length K with the elementwise product, i.e.,
all entries equal to 1 is denoted by 1K . The l2 norm of a vector a is  
a1,1 b1,1 · · · a1,J b1,J
denoted by kak2 . The row-wise and column-wise concatenation
 .. .. ..  ∈ KI×J .
of two vectors a and b is denoted by [a, b] and [a; b], respectively. A∗B =  . . . 
The outer product, Kronecker product, Khatri–Rao product, and aI,1 bI,1 · · · aI,J bI,J
Hadamard product are denoted by ⊗, ⊗, ⊙, and ∗, respectively.
The nth power of a vector x with respect to Kronecker product is The following properties will be useful for our derivations.
defined as x⊗ n = x ⊗ x⊗(n−1) , with x⊗ 0 = 1. Similarly, x⊙ n and
x∗ n denotes the nth power of vector x with respect to Khatri– Property 1. Let A ∈ KI×J , X ∈ KJ×K , B ∈ KK×L . Then
Rao product and Hadamard product, respectively. The mode-n
product of a tensor A ∈ KI1 ×I2 ×...×IN (with K meaning either vec(AXB) = BT ⊗ A vec(X) ∈ KIL .
R or C) and a vector x ∈ KIn , denoted by A ·n xT , is defined
P Moreover, if X ∈ KJ×J is a diagonal matrix and B ∈ KJ×L , then
element-wise as A ·n xT i i ···i i ···i = IinN=1 ai1 i2 ···in ···iN xin .
1 2 n−1 n+1 N

The mode-n product of a tensor A ∈ KI×I×...×I of order k and a vec(AXB) = BT ⊙ A diag(X) ∈ KIL .
vector x ∈ KI for all modes is defined as
def
Property 2. Let A ∈ KI×J , B ∈ KK×J , C ∈ KI×L , and D ∈
Axk = A ·1 xT ·2 xT . . . ·k xT . KK×L . Then
A mode-n vector or mode-n fiber of a tensor A ∈ KI1 ×I2 ×...×IN
(A ⊙ B)T (C ⊙ D) = AT C ∗ BT D ∈ KJ×L .
is a vector obtained by fixing every index except the nth. The
mode-n matricization of A is a matrix A[n;N,N−1,...,n+1,n−1,...,1] Property 3. For matrices A ∈ RI×J and B ∈ RJ×K , and for the
collecting all the mode-n vectors as its columns. For example, function f (A, B) = AB, the following equations hold:
an entry ai1 i2 i3 of a tensor A ∈ KI×J×K is mapped to the (i2 , q)
entry of the matrix A[2;3,1] with q = i1 + (i3 − 1)I. The binomial ∂vec(f (A, B)) ∂vec(f (A, B))
n! = BT ⊗ II , = IK ⊗ A.
coefficient is denoted by Cnk = (n−k)!k! . Some useful definitions ∂vec(A) ∂vec(B)
are listed below.
2.2. Canonical Polyadic Decomposition
Definition 1 (Symmetric Tensor). A tensor A ∈ KI×I×...×I of Here, we will briefly describe the canonical polyadic
order k is called symmetric if its entries are invariant under the decomposition. A more detailed description of CPD can be
permutation of its indices. found in [1] and references therein. The CPD writes a tensor
T I1 ×I2 ×...×IN as a sum of R rank-1 tensors and is denoted by
As a consequence of this definition, the matrix representations of r∈K z
symmetric tensors in different modes are all equal. U , . . . , U(N) , with its factor matrices U(n) ∈ KIn ×R , where R
(1)

equals the rank of the tensor. This is a shortcut notation for

Definition 2 (Rank of a Tensor). A rank-1 tensor of order N is
the outer product of N nonzero vectors. The rank of a tensor is R
X
equal to the minimal number of rank-1 terms that yield the tensor T= u(1) ⊗ u(2) ⊗ . . . ⊗ u(N) ,
r r r
as their sum. r=1

Definition 3 (Kronecker Product). Given two matrices A ∈

where u(n)
r denotes the rth column of the factor matrix U .
(n)
KI×J and B ∈ KK×L , their Kronecker product is
CPD is essentially unique under mild conditions [34–37], and
  has found many applications in signal processing and machine
a1,1 B · · · a1,J B
  learning [1].
A ⊗ B =  ... . . . ...  ∈ KIK×JL . For symmetric tensors, all the factor matrices are equal, i.e.,
aI,1 B · · · aI,J B
R
q y X
Definition 4 (Khatri–Rao Product). Given two matrices A ∈ T = U, U . . . , U; cT = cr (ur ⊗ ur ⊗ . . . ⊗ ur ) ∈ KI×I×...×I ,
KI×K and B ∈ KJ×K with the same number of columns, r=1

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 4 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 1 | Polyadic decomposition of a third order symmetric tensor T. It is called canonical (CPD) if R is equal to the rank of T, i.e., R is minimal. It allows compact
representation of polynomials.

where U ∈ KI×R , and c ∈ KR is a vector of weights which allows multilinear/polynomial cost functions with or without additional
us to give minus signs to the factors for even-degree symmetric constraints, which is a more general setting than tensor
tensors, see Figure 1. The matrix unfolding of a symmetric CPD decomposition or retrieval of a tensor factorization. To better
is given by describe the scope, let us consider the following class of
objective functions:
T = UDiag(c)(U ⊙ U ⊙ · · · ⊙ U)T .
l(θ , p(z)), (5)

3. TENSOR-BASED MULTIVARIATE where l : R × RM→ R+

denotes the performance measure of
the model to be optimized, p : RI → R denotes a multivariate
POLYNOMIAL OPTIMIZATION
polynomial represented by low-rank tensors, z ∈ RI denotes
The primary aim of the TeMPO framework is to develop efficient input data, and θ ∈ RM denotes output data. A broad range
algorithms for modeling nonlinear phenomena commonly of objective functions are in the class of (5). For example, the
encountered in the areas of signal processing, machine learning, objective function for the estimation of the CPD of a third-order
and artificial intelligence [15]. To achieve this, we assume tensor T can be written as
structure in the nonlinear function f : RI → RN that maps the 1 2
input data to output data. In our framework, we first assume θ − p(A, B, C) 2
, where p(A, B, C) = vec (JA, B, CK) .
2
smoothness in f and approximate it as multivariate polynomial
p : RI → RN . Then, we approximate p with low-rank tensors. Other tensor decomposition problems, such as block term
This allows us to achieve efficiency both in storing the coefficients decomposition (BTD), also fit into TeMPO. The symmetric
of the approximation and in performing computations with best rank-1 approximation problem [40], which can also be
those coefficients. Although any continuous function on a formulated as
compact domain can be approximated by polynomials arbitrarily
max T zd , subject to kzk = 1, (6)
well according to the Stone–Weierstrass theorem, polynomial z∈R
I
approximations used in practice can pose several numerical
issues such as the Runge phenomenon. Several strategies have is another example problem that fits into the framework. Note
been proposed to overcome these numerical issues, such as using that (6) is expressed as the maximization of an objective function,
different polynomial bases and Tikhonov regularization [38, 39]. rather than as the decomposition of a tensor; indeed TeMPO
In this work, we will focus more on computational issues of allows one to address more general problems. For the symmetric
the TeMPO framework; however, it is possible to incorporate best rank-1 approximation problem, several approaches such
these strategies with TeMPO using slight modifications. In the as higher-order power method [40], generalized Rayleigh–
remaining part of this section, we describe the scope of TeMPO. Newton iteration and the alternating least squares methods
Then we will describe two types of tensor representations of [41], SVD-based algorithms [42], semi-definite relaxations [43]
multivariate polynomials where the symmetric CPD structure have been proposed. Problems from unsupervised learning
is imposed on the coefficient tensors. Next we will briefly such as nonlinear dimensionality reduction, manifold learning,
describe the Gauss–Newton algorithm using the dogleg trust- nonlinear blind source separation, and nonlinear independent
region method and show how to exploit the symmetric CPD component analysis also fit into TeMPO. Similarly, problems
structure in the computation of Jacobian and Jacobian-vector from supervised learning fit into TeMPO as well. In this work, we
products that are necessary for the Gauss–Newton algorithm. will focus on the regression and classification problem and derive
expressions for Jacobian and Jacobian-vector products, which
3.1. Scope of the TeMPO Framework are necessary for the Gauss–Newton algorithm. However, the
The TeMPO framework concerns optimization problems with derivations here can be extended to the other problems without
continuous cost functions on compact domains, namely much effort.

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 5 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

Given data points (yk , zk ), the regression problem can be

formulated within the TeMPO framework for the type I model as

min p(T0 , . . . , Td , Z)
T0 ,...,Td
 2
K d
1 X X j
= min yk − T 0 − T j zk ,
T0 ,...,Td 2
k=1 j=1
subject to rank(Tj ) = Rj (7)

where Rj ∈ N+ is a small integer, Tj ∈ RI×I×...×I denotes the

low-rank structured coefficient tensor of order j to be optimized,
T0 ∈ R denotes a scalar, Z ∈ RI×K denotes the data matrix, zk
denotes the kth column of Z and K is the number of available
data points. For the type II model, the regression problem takes
the form
FIGURE 2 | By applying the homogenization process, symmetric tensors can
represent the coefficients of non-homogeneous polynomials. For example, by
1 X 2 K
min p(T, e
Z) = min zdi ,
yk − T e stacking the coefficients t, t, T, and T of the third degree polynomial into a
T T 2 k=1
tensor as shown above, we can represent it with a symmetric third-order
tensor. Image reproduced from Debals [46].
subject to rank(T ) = R (8)

where T denotes the low-rank structured coefficient tensor of here 5(i1 i2 . . . id ) denotes the collection of all permutation of
order d to be optimized, e
Z ∈ R(I+1)×K denotes the augmented indices (i1 , i2 , . . . , id ). Since the entries of T are invariant under
input data matrix, and zek denotes the kth column of e Z, i.e., the permutation of indices, we can conclude that T is symmetric.
e
zk = [1; zk ]. The above discussion reveals the fact that there are infinitely
many representations of a given polynomial. Indeed two
3.2. Tensor Representation of Polynomials representations with tensors T and W are equal so long as the
In this subsection, we examine the type I and type II model in summation of the corresponding entries over the permutation of
detail. A (symmetric) tensor T of order d and dimension n can indices remains the same, i.e.,
be associated with a homogeneous n-variate polynomial p(z) of
X X
degree d [44], as shown in Equation (3). ti1 i2 ...id = wi1 i2 ...id
Type I: Since any polynomial can be written as a sum of (i1 ,i2 ,...,id )∈5(i1 i2 ...id ) (i1 ,i2 ,...,id )∈5(i1 i2 ...id )
homogeneous polynomials of increasing degrees, any polynomial
of degree d can be written by using tensors of order up to d, as In the ANOVA kernel used in higher-order factorization
shown in Equation (2). Note that in the tensor representation of machines, all t5(i1 i2 ...id ) are set to zero except t(i1 <i2 <...<id ) [31],
polynomials, any tensor can be assumed to be symmetric without which leads to a sparse representation. In this paper, we use
loss of generality. Indeed, any homogeneous polynomial p(z) of symmetric tensors for two reasons. The first reason is that
degree d ∈ N can be represented by a multilinear form Tzd , the CPD of a symmetric tensor can be expressed by a single
where T ∈ KI×I×...×I is a symmetric tensor of order d and z ∈ factor matrix. Therefore, the symmetric CPD representation of
KI . multivariate polynomial requires fewer number of parameters in
To see this, suppose a homogeneous polynomial p(z) is comparison with a non-symmetric representation. The second
represented as reason is that there is a rich history of the representation of
polynomials with symmetric tensors in the field of algebraic
I
X geometry under the name of the Waring problem [45].
T zd =
p(z) = e e
ti1 i2 ...id zi1 zi2 . . . zid ,
Type II: Augmenting the independent variable vector z,
i1 ,i2 ....,id =1
by a constant 11 , i.e., e z = [1; zT ] leads to a different
representation of non-homogeneous polynomials that uses a
where e T ∈ KI×I×...×I is a tensor of order d. Since the terms
single dth order symmetric tensor for the inhomogeneous
zi1 zi2 . . . zid are invariant under the permutation of indices, we
multivariate polynomial of degree d, as shown in Equation (3).
may write
This process is called homogenization [46] and is graphically
I illustrated in Figure 2. If we just use full tensors, the type I and II
X
p(x) = ti1 i2 ...id zi1 zi2 . . . zid , 1 Sincethe weight vector c is used in the parametrization of tensors, different
i1 ,i2 ....,id =1
choices of constant in e
z lead to mathematically equivalent cost functions in the
1 X optimization problems. On the other hand, the choice of the constant may imply
where ti1 i2 ...id = eti1 i2 ...id , numerical differences—in situations of this type, one should generally choose a
d!
(i1 ,i2 ,...,id )∈5(i1 i2 ...id ) constant that “makes sense for the application.”

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 6 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

models are interchangeable. However, it is important to note that model to be able to model functions of the same complexity. In
when low-rank structure is imposed on the coefficient tensors, both other words, the set of polynomials represented by the type II
representations yield different classes of low-rank multivariate model is a strict subset of the set of polynomials represented by
polynomial. Hence, these approaches may lead to different the type I model for the same rank values.
results depending on the application. The former approach Although we focus in this study on the type I and type II
requires more parameters since it uses more factor matrices. models in the symmetric CPD format, the TeMPO framework
The difference in the number of parameters should be taken is not limited to these. TeMPO collects low-rank tensor
into account to prevent underfitting and overfitting. A more representations of multivariate polynomials under a roof by
detailed description for storage complexity is given in Section utilizing various other tensor decompositions such as TT, HT,
3.5. Moreover, the type I model allows us to constrain each and non-symmetric and partially symmetric CPD formats2 .
term in the representation separately. In modeling multivariate In this way, TeMPO breaks the curse of dimensionality and
polynomials, one might not wish the terms of different order to makes it possible to develop second-order efficient algorithms
have some shared structure, in which case one should choose for the optimization of a more general class of multivariate
the type I model to work with. Similarly, the type II model polynomials. Moreover, use of structured tensors and multilinear
should be chosen, if some shared structure is desired in the algebra makes it easy to incorporate other polynomial bases and,
terms of different order. To further elaborate on the effects of more generally, other nonlinear feature maps rather than the
homogenization on the rank of a tensor, let us consider the standard polynomial bases to the TeMPO framework. From this
following proposition. point of view, TeMPO can be interpreted as a generalization of
higher-order factorization machines that use particular types of
Proposition 1. Let p(z) : RI → R be a multivariate polynomial multivariate polynomials with the standard polynomial bases and
of order d defined as in equation (2) by a scalar T0 and symmetric utilize first-order and BFGS type algorithms [30–32, 47].
tensors Tj ∈ RI×I×...×I for j = 1, 2, . . . , d. Moreover, let W ∈
R(I+1)×...×(I+1) be the corresponding tensor obtained from the 3.3. Gauss–Newton Algorithm
homogenization process. The tensors W and Tj have the same rank Most standard first-order and second-order numerical
R if and only if the tensors Tj admit unique CPDs with shared optimization algorithms can be used for solving problem
factor matrices and a weight vector c, i.e., (8). Since the objective function under consideration is
a least-squares function, we will utilize the second-order
r z R
X batch Gauss–Newton (GN) algorithm using a trust-region to
j
Tj = U, . . . , U; Cd (cT )⊙(d−j) , and T0 = (cT )⊙ d . take advantage of its attractive properties such as quadratic
r
r=1 convergence near a local optimum point, resistant to swamps,
suitable to incorporate constraints easily and eligible to exploit
Proof 1. Let the CPD of the tensor W be defined as JV, . . . , VK, multilinear structure. In the case the objective function is not
where, for convenience but without loss of generality, the weights of least squares, the inexact GN algorithm can also be utilized.
the rank-1 terms are assumed to be 1. Since W is obtained by the Below, we briefly describe the GN algorithm using a trust-region,
homogenization process, partitioning V as [vT ; Q] and using the and then derive the expressions for Jacobian and Jacobian-vector
definition of CPD, we obtain products for tensors in the symmetric CPD format. In nonlinear
least-squares problems, the objective function is the squared
r z R
X
j error between a data vector y and a nonlinear model m(z) [6, 33]:
Tj = Q, . . . , Q; Cd (vT )⊙(d−j) , and T0 = (vT )⊙ d .
r
r=1 1 2 1 T
(9) f (z) = m(z) − y 2
= r r, (10)
2 2
Since the CPDs of the tensors Tj are unique, the equality (9) holds
if and only if the equalities Q = U and v = c also hold. where z ∈ RI . The algorithm updates the initial guess iteratively
by taking a step length αk in the direction pk at the iteration k, i.e.,
Remark 1. In the above proof, we assumed that the vector v does
not contain any zero elements. Note that if the vector v does contain zk = zk−1 + αk pk ,
zero elements, it cancels the corresponding rank-1 terms. Therefore,
in that case rank(W) > rank(Tj ), for j = 1, . . . , d − 1. Moreover, until some stopping criteria are satisfied. Line search and trust-
the uniqueness of the CPDs of Tj implies that rank(W) ≥ region are the two main approaches used to determine αk and
rank(Tj ). Since the equality rank(W) = rank(Tj ) holds only when pk . Here, we focus on the dogleg trust-region approach. In this
the tensors Tj have shared factor matrices as described above, we approach, one sets αk = 1. Then, given a trust-region of radius
can conclude that in all other cases rank(W) > rank(Tj ). gn
δk , the GN step pk and the steepest descent step psd k for the
current iteration, the step direction pk is determined by the
Proposition 1 together with Remark 1 reveals the fact that if W
following procedure:
admits a rank-R CPD, there exists tensors Tj that admit rank-
Rj CPDs with shared factors and Rj ≤ R. Hence, the expressive 2 Note that the non-symmetric and partially symmetric CPD formats are fairly
power of the type II model is weaker than the type I model, i.e., straightforward variants of the symmetric CPD format, and derivations presented
the type II model requires higher rank values than the type I in Section 3.4 can be generalized to these formats with slight modifications.

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 7 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

Algorithm 1: GN algorithm using dogleg trust-region for the function can be written as JT r, and the Hessian is approximated
type II model. by JT J, where J is the Jacobian matrix composed of partial
Input : Z – Input data matrix derivatives of the residual vector r. Hence, it is sufficient to derive
y – Vector of values (labels in the classification expressions for the Jacobian and Jacobian-vector products. We
case) for each data point in Z begin with the first-order derivatives of the multilinear form
U, c – Initial factor matrix and weight vector Tzd , where T is in the symmetric CPD format, with respect
T0 – Initial scalar to its factors and then proceed to the derivation of Jacobian
Output: U, c – Optimized factor matrix and weight vector and Jacobian-vector products for problems (7) and (8). The
T0 – Optimized scalar derivations made here can be used in other TeMPO problems
with slight modifications.
while not converged do
rk ← Compute residual vector using equations (21) and 3.4.1. Derivatives of the Multilinear Form in the
(22) Symmetric CPD Format
gk ← Compute gradient using equation (27) By using the matrix unfolding of the tensor in the symmetric
pk ← Solve linear systems of equations (12) using CG CPD format and Property 2 of Khatri–Rao product, the
method multilinear form Tzd can be written as
U, c, T0 ← Update via dogleg trust-region explained in
Subsection 3.3 ∗ d
end Tzd = cT UT z , (13)

which will be useful for our derivations below.

gn gn
• If pk ≤ δk , then pk = pk . Lemma 1. Let T ∈ KI×I×...×I be a symmetric tensor of order d
2
gn and its CPD given as T = JU, . . . , U; cT K. Then the derivative of
• If pk 2 > δk and psd > δk , then pk = δk psd sd
k / pk .
k 2 2 the multilinear form Tzd with respect to vec(U) can be obtained as
gn gn
• If pk 2 > δk and psd
k ≤ δk , then pk = τk psd k + βk (pk −
2
2 2 ∂ Tzd T
τk psd sd
k ), where τk = − pk Jk psd
k , and βk is selected such = (c ∗ w) ⊗ z ,
2 2 ∂vec(U)
that pk 2
= δk .
∗(d−1)
The steepest descent step psd T where w = d UT z .
k is given by −Jk rk . To compute
the GN step, a second order Taylor series approximation for the
objective function is used. The optimal direction for the GN step Proof 2. The proof immediately follows from Equation (13) and
gn
pk can be obtained by solving the optimization problem, successive application of Property 3.

gn 1 Lemma 2. Let T ∈ KI×I×...×I be symmetric tensor of order d and

pk = arg min e
f (p), with e f (p) = f (zk ) + pT gk + pT Hk p, its CPD is given as T = JU, . . . , U; cT K. Then the derivative of
p 2
(11) multilinear form Tzd with respect to vector c can be obtained as
where gk denotes the gradient and Hk denotes the Hessian at the
current iteration. Setting ∂e
f (p)/∂p to zero, the solution of (11) ∂ Tzd ∗ d
= zT U .
can be obtained by solving the linear system of equations ∂c
gn Proof 3. The proof immediately follows from Property 3 and
Hk pk = −gk , with gk = JTk rk , (12)
Equation (13).
where Jk denotes the Jacobian of f (zk ) at iteration k, and
rk = m(zk ) − y. However, in real-life applications, explicit 3.4.2. Exploiting Structure in the Type I Model
computation of the Hessian is often expensive. To overcome this, Objective Function: The construction of the residual vector r
GN approximates the Hessian by the Grammian matrix as and the computation of its l2 norm is sufficient for computing
the objective function in (7). By utilizing Property 2 and Equation
Hk ≈ JTk Jk . (13), the residual vector can be expressed as r = y−µ, where each
entry of the vector µ ∈ RK is defined as
In this study, we used the conjugate gradient (CG) algorithm
for solving (12) together with the dogleg trust-region approach d
X ∗j
which is implemented in Tensorlab [11]. The overall algorithm is µk = T0 + cTj wj,k ,
summarized in Algorithm 1. j=1

3.4. Exploiting the Symmetric CPD Format ∗j

As described above, the GN algorithm minimizes a cost function in which wj,k = UTj zk with Uj ∈ RI×R , and wj,k denotes the
in the form of Equation (10). The gradient of this objective jth elementwise power of the vector wj,k . By defining Wj =

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 8 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

wj,1 , wj,2 , . . . , wj,K , we can write the residual vector r in a x = [x1 ; x1 ; x2 ; . . . ; xd ; xv ], and by using the Equations (17) and
compact form as (18) as

d
X d h
X iT
∗j T
r = y − T 0 · 1K − cTj Wj , (14) Jx = x1 · 1K + XTj Z ∗(Cj W
e j ) 1R + Vxv ,
j=1 j=1

Using the above Equation (14), the objective function can be where Xj = unvec(xj ).
computed as the l2 norm of the residual vector r. Jacobian Transpose -Vector Product and Gradient: In a
Jacobian: The Jacobian matrix for problem (7), with the similar way, block-wise multiplication of the Jacobian transpose
tensors Tj in their symmetric CPD format, can be written in a JT by a vector can be obtained from the expression
compact form as
((Cj W e j )T .
e j ) ⊙ Z)x = vec ZDiag(x)(Cj W (19)
J = [J1 ; . . . ; JK ] , where

∂rk ∂rk ∂rk ∂rk
Jk = 1, ,..., , ,..., . (15) Note that right multiplication by a diagonal matrix can be
∂vec(U1 ) ∂vec(Ud ) ∂c1 ∂cd done efficiently by only multiplying the columns of the matrix
with the corresponding diagonal elements without explicitly
Note that we used the fact ∂rk /∂ T0 = 1 in the above equation. By forming the diagonal matrix. Overall, by defining ξ j =
utilizing Lemma 1 and Lemma 2, the derivative of each term of
vec ZDiag(x)(Cj W e j )T , we can obtain the product of the
the residual vector with respect to Uj and cj can be expressed as
Jacobian transpose JT and a vector x in the following form:
∂rk h iT ∂rk
∗(j−1) ∗j T
= −j cj ∗ wj,k ⊗ zk , and = wj,k . " K #
∂vec(Uj ) ∂cj X
T T
(16) J x= xk ; ξ 1 ; ξ 2 ; . . . ; ξ d ; V x . (20)

e ∗(j−1) k=1
By defining Wj = −j Wj for j = 1, . . . , d, and Z =
[z1 , z2 , . . . , zK ], the Jacobian matrix J in (15) can be obtained in The gradient can be obtained by the product of the Jacobian
the following compact block form: transpose JT and the residual vector r. Defining ηj =

h i vec ZDiag(r)(Cj We j )T and utilizing the Equations (19) and (20),
e 1 ) ⊙ Z)T , . . . , ((Cd W
J = 1K , ((C1 W e d ) ⊙ Z)T , V , (17)
we can obtain the gradient as
where V is a K × d block matrix in which each block is defined " K #
∗j T X
T
as Vk,j = wj,k , Cj = Diag(cj ), and d is the degree of g= rk ; η 1 ; η 2 ; . . . ; η d ; V r .
the polynomial under consideration. Since we only need the k=1
Jacobian-vector products for the GN algorithm, the explicit
construction of the Jacobian matrix is not required. The Jacobian- 3.4.3. Exploiting Structure in the Type II Model
vector products can be obtained in a more memory-efficient way Objective Function: The computation of the objective function
as described below. for the type II model is similar to that of the type I model.
Jacobian-Vector Product: The product of Jacobian J by a Utilizing Property 2 and Equation (13), the residual vector for
vector x can be obtained using block matrix operations. The problem (8) can be obtained as r = y − µ with
product of each block term by a vector vec(Xj ) = xj can be h i
obtained by utilizing properties 1 and 2 as µ = cT w∗ d T ∗d T ∗d
1 ; c w2 ; . . . ; c wK , (21)
h iT
e j ) ⊙ Z)T xj =
((Cj W XTj Z ∗(Cj W
e j) 1R . (18) where wk = UTe zk . By defining W = [w1 , w2 , . . . , wK ], we can
write the residual vector r in a compact form as
Note that the multiplication of a matrix by 1R from the right T
is equivalent to summing the columns of the matrix under r = y − cT W ∗ d , (22)
consideration. Therefore, neither
the multiplication by 1R nor the
e j ) in Equation (18) is
transposition of the matrix XTj Z ∗(Cj W Using the above Equation (22), the objective function can be
necessary to obtain the Jacobian-vector product. Note also that, computed as the l2 norm of the residual vector r.
since the matrices Cj are diagonal, the product Cj W e j can be Jacobian: The Jacobian matrix of the cost function in (8) can
obtained in a memory efficient way by multiplying the rows of W ej be defined in a compact form as
by the corresponding diagonal elements of Cj without explicitly
forming the matrices Cj . Overall, the product of the Jacobian J ∂rk ∂rk
J = [J1 ; J2 ; . . . ; JK ] , where Jk = , . (23)
and the vector x can be obtained by partitioning the vector x, i.e., ∂vec(U) ∂c

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 9 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

Utilizing Lemma 1 and Lemma 2 and using the equations in (16), storage complexity for a multivariate polynomial represented in
the parts of Jk in Equation (23) can be written as dense format is O I d for d ≪ I. In the symmetric CPD format,
∂rk h iT ∂rk T we need to store only the factor matrix U ∈ RI×R and the
= −d c ∗ wk∗(d−1) ⊗e
zk , = w∗
k
d
. vector of weights c ∈ RR , where R is the rank of the symmetric
∂vec(U) ∂c
CPD. Therefore, the storage complexity for the type II model
using the symmetric CPD format is O (IR). This shows that the
∂r1 ∂r2 ∂rK
By defining W e = −d W∗(d−1) , V = ; ;...; , symmetric CPD format breaks the curse of dimensionality, since
∂c ∂c ∂c
and Z = [e z1 ,e
z2 , . . . ,e
zK ], the Jacobian matrix can be obtained in the storage complexity in this format is linear in terms of rank R
the following compact form: and dimension I.
As is clear from Equation (22), the construction of the
h i matrix W and its dth Hadamard (elementwise) power dominates
J= e ⊙ Z T, V .
(CW) (24)
the computational complexity of the objective function. The
construction of a single column of the matrix W requires
As mentioned earlier, explicit construction of the Jacobian the multiplication of UT ∈ RR×I and e zk ∈ RI . Thus,
matrix J is not required. We only require the Jacobian-vector the computational complexity of constructing the matrix W is
and Jacobian transpose-vector products and derive efficient O (IKR). The dth Hadamard power of the matrix W can be
expressions for these products below. ∗ 2
computed recursively by using the relation W∗(2m) = W∗ m .
Jacobian-Vector Product: The product of the Jacobian matrix
Thus, the computational complexity of thedth Hadamard power
J and a vector x can be obtained in a similar way as for the type I
of the matrix W ∈ RR×K is O log(d)RK . Therefore, the total
model, by partitioning the vector x, i.e., x = [xu ; xc ] and utilizing
computational complexity for computing the objective function
properties 1 and 2 and Equation (24), as
for a batch of size K is O (I + log(d))KR . Since log(d) ≪ I, the
computational complexity for the objective function in Equation
Jx = XTu Z ∗(CW)
e T 1R + Vxc , (25)
(8) is O (IKR).
The gradient of the objective function in Equation (8)
where Xu = unvec(xu ). As mentioned earlier for Equation (18),
can be obtained by multiplying the Jacobian transpose JT
explicit construction of the diagonal matrix C is not required.
e can be obtained in a memory efficient way by the residual vector r. As shown in Equation (27), this
The product CW
e by the corresponding diagonal operation requires multiplication of a matrix Z ∈ RI×K
by multiplying the rows of W
by a diagonal matrix Diag(r), and the multiplication of the
elements of C. e T with sizes (I × K) and (K ×
matrices ZDiag(r) and (CW)
Jacobian Transpose -Vector Product and Gradient: In e
R), respectively. Note that the entries of the product CW
similar way, utilizing properties 1 and 2 and Equation (24), the
were already obtained in the computation of the objective
product of Jacobian transpose JT and a vector x can be written as
function. Further, the computational complexity for the product
h i ZDiag(r) is O (IK). Consequently, the computational complexity
e T ; VT x .
JT x = vec ZDiag(x)(CW) (26) e T is O (IKR). In
for the multiplication of ZDiag(r) and (CW)
T
addition, the computation of V r in Equation (27) requires
Since the gradient is the product of the Jacobian transpose JT and O (KR) operations. However KR ≪ IKR. Therefore, the total
the residual vector r, it directly follows from the above Equation computational complexity for computing the gradient is O (IKR)
(26) as for R ≫ 1.
h i In addition, TeMPO uses the GN algorithm for the
g = vec ZDiag(r)(CW) e T ; VT r . (27) optimization. However, this is not a requirement and first-
order methods can also be utilized within TeMPO as well. GN
3.5. Complexity Analysis requires solving the linear system of equations in (12). Tensorlab’s
We now analyze the storage and computational complexity of implementation of GN uses the conjugate-gradient (CG) method
TeMPO where we are optimizing over symmetric rank-R CPD which requires only the Grammian-vector product for solving
structured tensors T ∈ KI×I×...×I of order d. The analysis is (12). This operation requires multiplication of the Jacobian and
presented here for the type II model. However, since the number its transpose by a vector. The computational complexity of
of optimization parameters of the type I and type II models multiplying the transpose of Jacobian by a vector is O (IKR) as
(see Equations 2, 3) for an I-variate polynomial of degree d are described above. The computationally most expensive operations
proportional to each other, the analysis also applies to the type I in the multiplication of Jacobian by a vector are the multiplication
model. Indeed, the computational complexity of the type I model of matrices XTu and Z with sizes (R × I) and (I × K), and
is d times the computational complexity of the type II model. We the Hadamard product of two matrices of size (R × K) as
also compare with the storage and computational complexity of shown in Equation (25). Hence, the computational complexity
TT and PEPS tensor networks. of computing Jx is O (IKR). Note that the entries of the product
Representing a multivariate polynomial with I independent CW e were already obtained in the computation of the objective
d
variables and of degree d in dense format requires storing C(I+d) function. Therefore, the total computational complexity for a
elements. Using Stirling’s approximation, it can be shown that the single CG iteration is O (2IKR). Note that a large number of

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 10 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

TABLE 1 | The comparison of the computational complexity of TEMPO with TT and PEPS tensor networks for a batch size of K.

Calls TEMPO PEPS TT-N

(per iter) (Type I model)

Storage O (dIR) O nIR2TT

Objective func. 1 O (dIKR O KR3BT R6PS O nIR2TT + R3TT log(I)

Gradient 1 O (dIKR) O αKR3BT R6PS O α nIKR2TT + KR3TT log(I)
Gramian-vector itCG O (2dIKR) − −

CG iterations in the solution of linear equations for the GN backward pass for PEPS requires O αKR3BT R6PS operations (with
algorithm might increase the computation time compared to α > 1), when automatic differentiation is used [22].
first-order algorithms. In fact, the number of CG iterations scales The above analysis shows that TeMPO is computationally less
with the number of optimization variables (IR), if the exact expensive than TT and PEPS, even though it uses a second-
solution is desired in the solution of the normal equations. order algorithm. All these results are summarized in Table 1.
This may lead to an quadratic complexity of O 2(IR)2 K . The fundamental reason for this is the linear storage complexity
However, we observed in our experiments that a small number of the symmetric CPD format. Both TT and PEPS involve
of CG iterations were sufficient to obtain accurate results. For third and higher-order tensors, which makes their computational
example, we set the maximum number CG iterations to 10 for complexity increase with powers of the bond dimension. On the
the classification of the MNIST and Fashion MNIST datasets, other hand, the CPD format is known to be numerically less
where the number of unknowns is 784 × R with R ranging stable than the TT format, which relies on orthogonal matrices.
from 10 to 150.
The storage complexity of a tensor network with TT 4. NUMERICAL EXPERIMENTS
architecture is bounded by O nIR2TT for a tensor of order I
with dimensions (n × n × . . . × n), where RTT denotes the TT- We conducted an experiment on the regression problem using
rank [48]. n is equal to 2 and I is the size of a single image synthetic data to illustrate the TeMPO framework and compared
in the image classification applications presented in [20, 21]. TeMPO with different implementations of SVMs in Section 4.1.
Note that the storage complexity of TT increases with powers Next, we applied our framework to the blind deconvolution
of the TT-rank RTT . The total computational complexity of of constant modulus (CM) signals and compared with the
TT for computing theobjective function has been reported as analytical CM algorithm (ACMA) [50], the optimal step-size CM
O nIR2TT + R3TT log(I) , when the contraction order defined in algorithm (OSCMA) [51], and the LS-CPD framework [52] in
[21] is used. When the sweeping algorithm described in [20] Section 4.2. In Section 4.3, we further illustrate TeMPO with the
is used, the computational
complexity of the objective function image classification problem. We performed experiments on the
for TT is O n3 R3TT I for a single data point of size I. Similar MNIST and Fashion MNIST datasets and compared the accuracy
to the storage complexity, the computational complexity of the and number of optimization parameters with MLPs, and TT
objective function for TT increases with powers of the TT-rank and PEPS tensor networks. We performed experiments on a
of the tensor under consideration. On the other hand, automatic computer with an Intel Core i7-8850H CPU at 2.60 GHz with 6
differentiation (AD) is one of methods used to compute the cores and 32 GB of RAM using MATLAB R2021b and Tensorlab
gradient of TT. The computational complexity of automatic 3.0 [11].
differentiation is linear in the complexity of the evaluation of the In our blind deconvolution experiments, we used the complex
objective function [49]. Therefore, the computational complexity GN algorithm with the conjugate gradient Steihaug method. We
of the gradient for TT tensor network presented in [21] is used the second-order batch Gauss–Newton algorithm for the
O α nIKR2TT + KR3TT log(I) , for a batch size of K with α > 1. regression and classification, following the same intuition as in
The total computational complexity of TT tensor network for [53]. In each epoch of the algorithm, we randomly shuffle the data
a batch size of K has been reported as O mR2TT (RTT + K) for points in the training set and process all data points by dividing
a single iteration of the stochastic Riemannian gradient descent them into batches. In the regression and binary classification
algorithm [19]. As it is clear from the above discussion, both the case, we optimize a single cost function. In the multi-label
storage and the computational complexity of TT increases with a classification case, for each batch, we randomly select a cost
power of the TT-rank regardless of the algorithm used, while for function fl defined for each label to optimize. Thus our algorithm
TeMPO it increases linearly with the symmetric CPD rank in the does not guarantee that each fl will be trained by all training
symmetric CPD case. images in each epoch in the multi-label classification case. To
The computational complexity of a single forward pass of guarantee this, the algorithm can be modified such that for each
PEPS for a batch size of K is O KR3BT R6PS , when the boundary batch all cost functions fl are optimized at the cost of increasing
matrix product state method is used. Here RBT is the bond CPU time by a factor of the number of classes L. However, in
dimension (rank) of the boundary matrix product state of PEPS that case the algorithm might need fewer epochs to converge. The
and RPS is the bond dimension of PEPS. In addition, the overall algorithm is summarized in Algorithm 2. Algorithm 2

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 11 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

Algorithm 2: Batched GN algorithm using dogleg trust- initialized each weight vector in the same way as the factor
region for regression and classification for the type II model. matrices. We approximated f (x) by the type I and type II model
Input : Z – Input data matrix of degree 5 whose coefficient tensors were represented in the
y – Vector of values (labels in the rank-R symmetric CPD format. We set the batch size to 500
classification case) for each data and the maximum number GN iterations to 5 for each batch.
point in Z In Figure 3, we show the median relative test and training errors
U1 , . . . , UL – Initial factor matrices for each label for R = {2, 4, 8, 16} as a function of the number of epochs for
(single in the regression case) 100 trials. Each epoch corresponds to optimization over the full
c 1 , . . . , cL – Initial weight vectors for each label training set. It is clear from Figure 3 that TeMPO produces more
(single in the regression case) accurate results and generalizes better when using higher rank
T0,l – Initial scalar for each label (single in values, for both the type I and type II model. Good performance
the regression case) is also observed for R = 16 > Rf = 8, meaning that TeMPO is
epoch – Number of epochs robust to over-estimation of the number of parameters. For low
batchsize – Batch size rank values, i.e., R < Rf , the type I model produces better results
Output: U1 , . . . , UL – Optimized factor matrices for each than the type II model because it involves more parameters that
label (single in the regression case) can be tuned, cf. the discussion of Proposition 1.
c 1 , . . . , cL – Optimized weight vectors for each In the second stage of the experiment, we trained the type I
label (single in the regression case) and type II model for a multivariate polynomial of degree 5 with
noisy measurements. We added Gaussian noise to the function
for each epoch do
values for a given SNR, i.e.,
Shuffle input data
for each batch do R
X T
l← 1 e
f (x) = αr e(ar x) + η, (29)
if multi-label classification then r=1
l←
Randomly select label l to optimize fl , 0 < l ≤ L where η denotes the noise. We run our algorithm with the same
end settings as in the noiseless case for an SNR ranging from 10 dB to
Ul , cl , T0,l ← Optimize fl using Algorithm 1 50 dB. In Figure 4, we show the median errors for 100 trials as a
end function of SNR. We have similar observations as in the noiseless
end case. Apart from these observations; although the accuracy of our
algorithm decreases for SNR ≤ 20 (dB), it still maintains good
accuracy for SNR > 20 (dB), as shown in Figure 4. Moreover, as
can be observed from the Figure 4 (left), the type I model overfits
is given for the type II model for the ease of explanation. Slight for R = {8, 16} and SNR ≤ 20 (dB) in agreement with the result
modifications are sufficient to obtain an algorithm for the type of Proposition 1.
I model. In our next experiment, we trained the type I and type II model
We define the relative error as the relative difference in with larger-size samples, i.e., N = 250 and R = {8, 16, 32, 64}, to
l2 norm ||f − b f||2 / kfk2 with b
f an estimate for a vector f, assess how the CPU time depends on the rank. In Figure 5, we
and the signal-to-noise ratio (SNR) as 20 log10 kfk2 /kηk2 , show the median CPU time per epoch as a function of the rank.
where η = b f − f. It is evident from the figure that the computational complexity
of the type I model is d times the computational complexity of
4.1. Regression the type II model (cf. Section 3.5). Moreover, Figure 5 confirms
In this experiment, we considered a low-rank smooth function that the computational complexity of our algorithm is linear in
f (x) : RN → R, namely the rank (cf. Section 3.5).
In our next experiment, we examined the generalization
Rf abilities of the Gauss–Newton and ADAM [54] algorithms in
X T
f (x) = αr e(ar x) , (28) our framework. We trained the type I model for a multivariate
r=1 polynomial of degree 5 with both of these algorithms for different
number of training samples to fit the rank-8 function given as
where x ∈ [−1, 1]N , Rf is the rank of the function f (x), in Equation (29). We set R = 8, N = 50, and SNR = 20(dB).
and the coefficients αr are scalars randomly chosen from the For the ADAM algorithm, we set the step size, the exponential
standard normal distribution. We generated 5, 000 test samples decay rate for the first momentum (β1 ), and the exponential
and 1, 000 training samples for N = 50 and Rf = 8. Each decay rate for the second momentum (β2 ) to 0.01, 0.9, and 0.99,
vector ar ∈ RN was a unit norm vector drawn from the respectively. In Figure 6, we show the median training and test
standard normal distribution. Each of the samples of x was drawn accuracies of these algorithms for the number of training samples
from the uniform distribution. We initialized each factor matrix ranging from 500 to 5, 000 as a function of the number of epochs
with a matrix whose elements were randomly drawn from the for 100 trials. It is evident from Figure 6 that the presented
standard normal distribution, and scaled it to unit norm. We Gauss–Newton algorithm produces more accurate results than

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 12 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 3 | (Left) The median test (dashed lines) and training (solid lines) errors of the type I model for 100 trials on the synthetic data for a rank-8 function given as in
Equation (28). The number of samples for the training dataset is set to 5, 000 and for the test dataset it is set to 1, 000. The batch size is set to 500 and the maximum
number of GN iterations is set to 5. (Right) The median test (dashed lines) and training (solid lines) errors of the type II model with the same algorithm settings. TeMPO
produces more accurate results and generalizes better for higher rank values for both the type I and type II model. The performance is robust to overparameterization
(R > Rf ). The type I model produces better results for low rank values, i.e., R < Rf .

FIGURE 4 | (Left) The median test (dashed lines) and training (solid lines) errors of the type I model for 100 trials on the synthetic noisy data for a rank-8 function
given as in Equation (29). The number of samples for the training dataset is set to 5, 000 and for test dataset is set to 1, 000. The batch size is set to 500 and the
maximum number of GN iterations is set to 5. (Right) The median test (dashed lines) and training (solid lines) errors of the type II model with the same algorithm
settings. TeMPO produces more accurate results and generalizes better for higher rank values for both the type I and type II model in the presence of noise as well.
Again, the type I model produces better results for low rank values, i.e., R < Rf , because it involves more parameters than the type II model.

the ADAM algorithm and also requires fewer number of epochs model. It is clear from Figure 7 (left) that the type I and type II
to converge in these experimental settings. model generalize better than fitrsvm. A possible reason is the
We also compared TeMPO with SVMs using a polynomial dense parameterization of SVMs, while TeMPO uses low-rank
kernel. We run the same experiment for a number of training parameterization. Moreover, as shown in Figure 7 (right), our
samples ranging from 500 to 5, 000. We set the rank to 8, i.e., algorithm is faster than SVMs for numbers of training samples
R = Rf for TeMPO. We used the built-in Matlab routine above 1, 000. This is due to the higher memory requirement of
fitrsvm and LS-SVMlab toolbox [55, 56]. We set the degree SVMs. Typically, kernel based methods such as LS-SVM have
of polynomial kernel to 5, i.e., equal to the degree of the type I a storage and computational complexity of O N 2 [55], with
and type II model for fitrsvm. LS-SVMlab automatically tunes N the number of training samples. In contrast, Figure 7 (right)
the degree to 3 to find the best fit. In Figure 7 (left), we show the confirms that the computational complexity of TeMPO is linear
median test and training errors for SVM, the type I and type II in the number of training samples (cf. Section 3.5).

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 13 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 5 | The median CPU time (seconds) per epoch for the type I and type II model as a function of the rank for a rank-8 function given as in Equation (28) for 100
trials. The number of samples for the training dataset is set to 5, 000 and for the test dataset it is set to 1, 000. The batch size is set to 500 and the maximum number
of GN iterations is set to 5. The figure confirms that the computational complexity of the type I model is d times the computational complexity of the type II model (cf.
Section 3.5). Moreover, the computational complexity of the algorithm is linear in the rank (cf. Section 3.5). The figure is in a logarithmic scale on the horizontal axis.

4.2. Blind Deconvolution of Constant |sk |2 = c, for k = 1, 2, . . . , K. (32)

Modulus Signals
Blind deconvolution can be formulated as a multivariate Here, c is a constant scalar. By substituting sk defined in (31) into
polynomial optimization (MPO) problem and hence it fits into (32), we obtain
the TeMPO framework [15]. In this illustrative example, we limit T
Y ⊙ Y (w ⊗ w) = c · 1K . (33)
ourselves to an autoregressive single-input single-output (SISO)
system [57], given by
Following the same intuition as in [60], by multiplying (33) from
L the left with a Householder reflector Q [61], generated for c · 1K ,
X
wl · y[k − l] = s[k] + n[k], for k = 1, . . . , K, (30) and removing the first equation3 , we obtain
l=0
M(w ⊗ w) = 0. (34)
where y[k], s[k], and n[k] are the measured output signal, the
Here, M = Q e Y ⊙ Y T , and Q
e is obtained by removing the first
input signal and the noise at the kth measurement, respectively,
and wl denotes the lth filter coefficient. Ignoring the noise for the row of the Householder reflector Q. In applications, M(w ⊗ w)
ease of derivation, (30) can be written as: will not vanish exactly due to the presence of noise. Hence, we
look for the solution which minimizes its l2 norm as
YT w = s, (31) 1 2
min f (w, w) = min M(w ⊗ w) 2
, subject to kwk = 1.
w,w w,w 2
where Y ∈ CL×K is a Toeplitz matrix and its rows are the (35)
subsequent observations under the assumption that we have K + The objective function in (35) is a homogeneous multivariate
L − 1 samples y[−L + 1], . . . , y[K]. The vector w ∈ KL contains polynomial of degree 4 in which the coefficient tensor W is given
the filter coefficients and the kth entry of the source vector s ∈ as a rank-1 Hermitian symmetric CPD, i.e.,
CK is the input signal at the kth time instance, i.e., sk = s[k].
In blind deconvolution, one attempts to find the original input W = w ⊗ w ⊗ w ⊗ w : = Jw, w, w, wK . (36)
signal s and the filter coefficients w by only observing the output
signal Y. Thus, constraints on signals and/or channel have to be Exploiting the rank-1 Hermitian symmetric CPD structure in
imposed to obtain interpretable results. The constant modulus (36) and the structure of M, which is a special case of Lemma
(CM) criterion is a widely used input constraint [58]. The CM 1 and Lemma 2, efficient expressions for the computation
property, which holds for phase- or frequency-modulated signals
[50, 59] can be written as: 3 The first equation is only a normalization constraint.

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 14 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 6 | Comparison of the median test (dashed lines) and training (solid lines) errors of the Gauss–Newton and the ADAM algorithms as a function of the number
of epochs for 100 trials. The type I model for a rank-8 function given as in Equation (29) in the presence of SNR 20 dB Gaussian noise is used to generate the training
and the test sets. The batch size is set to 10% of the training set size. For the Gauss–Newton algorithm, the maximum number of GN iterations and CG iterations is
set to 1 and 5, respectively. For the ADAM algorithm, the step size, β1 and β2 are set to 0.01, 0.9, and 0.99, respectively. The number of training samples is set to 500
(top-left), 1, 000 (top-right), 2, 000 (bottom-left), and 5, 000 (bottom-right). The presented Gauss–Newton algorithm produces more accurate results than the
ADAM algorithm and also requires fewer number of epochs to converge in these experimental settings.

of Jacobian-vector products for the problem (35) have been We consider an autoregressive model of degree L = 10
presented in [15]. with coefficients uniformly distributed on [0, 1], sample length
A number of algorithms have been developed to solve (33) K = 600, and c = 1. We add scaled Gaussian noise to
and (34). The analytical CM algorithm (ACMA) [50] writes (34) the measurements to obtain a particular SNR. We run 50
as a generalized matrix eigenvalue problem in the absence of experiments starting from the algebraic solution presented in
noise, and under the assumption that the null space of M is one [52] for LS-CPD, OSCMA, and TeMPO. In Figure 8 (left), we
dimensional, which makes ACMA more restrictive than TeMPO. show the median relative error on w as a function of SNR. It is
In the presence of noise, ACMA writes (34) as the simultaneous clear from Figure 8 (left) that TeMPO achieves similar accuracy
diagonalization of a number of matrices and solves it by extended as LS-CPD and OS-CMA, which are more accurate than ACMA.
QZ iteration. Gradient descent and stochastic gradient descent In Figure 8 (right), we show the median CPU time in seconds as
algorithms have also been proposed for the minimization of the a function of SNR. Clearly, TeMPO is faster than ACMA, OS-
expected value of {(|yTn w| − c)2 }. The optimal step-size CMA CMA, and LS-CPD for SNR ≥ 10(dB) by exploiting the structure
(OSCMA) [51] algorithm uses a gradient descent algorithm, of the data.
which computes the step size algebraically. The problem in
(35) can also be interpreted as a linear system with a rank-1
constrained solution, which fits the LS-CPD framework in [52]. 4.3. Image Classification
LS-CPD solves (33) by relaxing the complex conjugate w to Multi-class image classification amounts to the determination
a possibly different vector v ∈ CL and utilizing the second- of a possibly nonlinear function f that maps input images Zk
order GN algorithm using dogleg trust-region method. We solve to integer scalar labels yk , which are known for a training set.
(35) by utilizing the complex GN algorithm using the conjugate In this study, we represent f by a multivariate polynomial p.
gradient Steihaug method implemented in TensorLab 3.0 [11]. Following the one-versus-all strategy, we define a cost function
We compare with these algorithms in terms of computation time fl for each label that maps the input image Zk to a scalar
and accuracy. value as

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 15 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 7 | (Left) The median test (dashed lines) and training (solid lines) errors of SVMs with polynomial kernel, the type I and type II model for a rank-8 function
given as in Equation (29) in the presence of SNR 20 dB Gaussian noise as a function of the number of training samples for 100 trials. The batch size is equal to 10% of
the training set size. The maximum number of GN iterations is set to 5 for the type I and type II model. Specifically, for the SVMs, the built-in Matlab routine fitrsvm
and LS-SVMlab toolbox were used to obtain the results. The relative errors of LS-SVMLab for the sample sizes 500, 1, 000, and 2, 000 are 1.6e − 6, 2.2e − 6 and
3.3e − 6, respectively. The presented algorithm generalizes better than fitrsvm in these experimental settings. (Right) The median CPU times (seconds) with the
same setting. The computational complexity of our algorithm is linear in the problem size as expected, and it is faster than SVMs for numbers of training samples
above 1, 000. The figures are in a logarithmic scale on both the horizontal and vertical axes.

FIGURE 8 | (Left) The median relative errors (dB) of LS-CPD, OS-ACMA, ACMA, and TeMPO with respect to SNR (dB) for an autoregressive model of degree L = 10
with uniformly distributed coefficients between zero and one, sample length K = 600 for 50 trials. TeMPO obtains similar accuracy to LS-CPD, OS-CMA, while
obtaining more accurate results than ACMA. (Right) The median CPU times (seconds) with the same settings. TeMPO is faster than other algorithms for SNR > 10
(dB).

K
1X 2 optimization problem can be written as
fl (pl , z1 , . . . , zK ) = yk − pl (zk ) ,
2
k=1 d
X j
min fl (pl , z1 , . . . , zK ), subject to pl (zk ) = Tl,0 + Tl,j zk ,
where zk = vec(Zk ) and where yk = 1 if zk is labeled as l pl
j=1
and yk = 0 otherwise. The polynomial pl can be chosen within r z
T
the type I or the type II model class. For the type I model, the and Tl,j = Ul,j , . . . , Ul,j ; cl,j , (37)

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 16 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 9 | Test (solid lines) and training (dashed lines) accuracies of the type I model for the MNIST dataset with respect to the number of epochs. The full training
set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 100 and the maximum number of GN iterations is set to 1. TeMPO achieves high
accuracy even for low rank values, i.e., R = {10, 20}. Both the test and training accuracy increase mildly as the rank increases.

where d is the degree of the polynomial under consideration. Datasets

Note that we substitute the symmetric CPD structure given as a Modified National Institute of Standards and Technology
constraint into the objective function, and hence obtain and solve (MNIST) handwritten digit database [62] and the Fashion
an unconstrained optimization problem. For the type II model, MNIST database [63] are used for this study. Both datasets
the optimization problem can be written as contain gray scale images of size (28 × 28). The training sets
of both datasets are composed of 60, 000 images and test sets
min fl (pl , z1 , . . . , zK ), subject to pl (zk ) = Tl zdk , are composed of 10, 000 images. The images have been size-
pl
q y normalized and centered in a fixed-size image. We rescale images
and Tl = Ul , . . . , Ul ; cTl . such that every pixel value is in the interval [0, 1] and the
mean of each image is zero. Then, we vectorize, i.e., stack each
After the optimization of fl for each label l, the classification is column vertically in a vector, each image to a vector of size
done by computing each pl (s) for the data point s to be classified 784. For the type II model, we augment the resulting vector by
and selecting the value of l for which |pl (s)| is largest. the scalar 1. Similar pre-processing steps are necessary for also
4.3.1. Experiments tensor networks. Additionally, they may require the encoding
We performed several experiments by varying the parameters input data which increases the storage and the computational
rank and maximum number of GN iterations to illustrate the resource requirement.
TeMPO framework for the classification of the MNIST and
Fashion MNIST datasets. We kept the maximum number of CG
iterations equal to 10, the degree of the multivariate polynomial Results and Comparisons
to 3, the tolerance for the objective function and optimization Results of the Type I Model
variables equal to 1e − 10, the inner solver tolerance equal to We first trained the type I model on the total MNIST training
1e − 10, and the trust-region radius equal to 0.1, throughout set for various rank values ranging from 10 to 150 to illustrate
the experiments. the effect of rank on the accuracy. We set the batch size to 100
We initialized each factor matrix with a matrix whose and the maximum number of GN iterations to 1. We show the
elements were randomly drawn from the standard normal training history in Figure 9. It is evident from Figure 9 that
distribution, and scaled it to unit norm. Similarly, we initialized TeMPO achieves high accuracy even for low rank values, i.e.,
each weight vector cl with a vector whose elements were R = {10, 20}. Increasing the rank mildly improves both the test
randomly drawn from the standard normal distribution and and training accuracy, with the improvement getting smaller as
scaled it to unit norm. the rank increases.

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 17 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 10 | Test (solid lines) and training (dashed lines) accuracies of the type I model for the Fashion MNIST dataset with respect to the number of epochs. The full
training set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 100 and the maximum number of GN iterations is set to 1. Similar to the
MNIST dataset, TeMPO achieves good accuracy even for low rank values and both the test and training accuracy mildly increase as the rank increases.

We repeated the same experiments for the Fashion MNIST In Figure 14, we show the training history for the Fashion
dataset, which is harder to classify. We show the training history MNIST dataset. Similar to the type I model, the test
in Figure 10. The observations made for the MNIST dataset and training accuracy is lower than the MNIST dataset.
also apply to the Fashion MNIST dataset. However, the test and The algorithm converges around 100 epochs and achieves
training accuracy are lower for the Fashion MNIST dataset in around 89.30% test accuracy with R = 150. Moreover,
agreement with previous works. Also, our algorithm requires our algorithm achieves around 99% training accuracy after
more epochs to converge for the Fashion MNIST dataset. 400 epochs.
In our next experiment, we set the maximum number of We repeated the same experiments with the maximum
GN iterations to 5. We observed that our algorithm needs number of GN iterations set to 5. The comparisons for the
fewer epochs to converge and produces more accurate results MNIST and Fashion MNIST datasets are shown in Figure 15.
with this setting. The comparison for the MNIST and Fashion Contrary to our observation for the type I model, the
MNIST dataset is shown in Figures 11, 12, respectively. The test accuracy now decreases for both datasets. A possible
improvement in the test accuracy for the Fashion MNIST dataset reason is that when the residuals are big, doing more GN
is around 1% and more pronounced than the improvement iterations may not lead a better direction for minimizing (37).
in the test accuracy for the MNIST dataset. TeMPO achieves A similar observation has been made in [53], for training
around 98.30% test accuracy for the MNIST dataset and DNNs. It is experimentally shown that higher number of
around 90% test accuracy for the Fashion MNIST dataset CG iterations might not produce more accurate results if the
with R = 150. Hessian obtained by mini-batch is not reliable due to non-
representative batches and/or big residuals. On the other hand,
Results of the Type II Model if the residuals are small, higher number of CG iterations
We repeated the same experiments for the type II model. We can produce more accurate results thanks to the curvature
used the same settings as in the type I model. However, we set information [53].
the batch size to 200 to obtain an accuracy similar to that of
the type I model. We show the training history in Figure 13.
Similar to previous experiments, our algorithm performs well Comparisons
even for low rank values, and produces more accurate results for We now compare TeMPO with different models, namely: TT
higher rank values. TeMPO achieves around 98% test accuracy tensor networks [21], TT structured tree tensor networks (TTN)
and 100% training accuracy after 200 epochs with R = 150 for [64], multi-layer perceptron (MLP) with 784−1000−10 neurons,
the MNIST dataset. MLP with a convolution layer (CNN-MLP), PEPS, and PEPS with

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 18 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 11 | Comparison of test accuracies of the type I model on the MNIST dataset for different maximum number of GN iterations as a function of the number of
epochs. The full training set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 100 and the maximum number of GN iterations is set to 1
(dashed lines) and to 5 (solid lines).

FIGURE 12 | Comparison of test accuracies of the type I model on the Fashion MNIST dataset for different maximum number of GN iterations as a function of the
number of epochs. The full training set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 100 and the maximum number of GN
iterations is set to 1 (dashed lines) and to 5 (solid lines).

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 19 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 13 | Test (solid lines) and training (dashed lines) accuracies of the type II model for the MNIST dataset with respect to the number of epochs. The full training
set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 200 and the maximum number of GN iterations is set to 1. Both the test and
training accuracy increase as the rank increases. The improvement in the accuracy gets smaller as the rank increases. The algorithm achieves around 100% training
accuracy after 200 epochs.

FIGURE 14 | Test (solid lines) and training (dashed lines) accuracies of the type I model for the MNIST dataset with respect to the number of epochs. The full training
set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 200 and the maximum number of GN iterations is set to 1. Both the test and
training accuracy increase as the rank increases. Also the improvement in the accuracy gets smaller as the rank increases. The algorithm achieves around 99%
training accuracy after 400 epochs.

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 20 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

FIGURE 15 | Comparison of test accuracies of the type II model on the MNIST (top) and Fashion MNIST (bottom) datasets for different maximum number of GN
iterations as a function of the number of epochs. The full training set (60, 000 images) and test set (10, 000 images) are used. The batch size is set to 200 and the
maximum number of GN iterations is set to 1 (dashed lines) and to 5 (solid lines).

a convolution layer (CNN-PEPS) [22]. We compare in terms of 5. CONCLUSION AND FUTURE WORK
the test accuracy for the Fashion MNIST dataset. We summarize
the test accuracy of different models in Table 2. TeMPO achieves We presented the TeMPO framework for use in nonlinear
better accuracy than TT, PEPS and MLP, while optimizing for optimization problems arising in signal processing, machine
fewer parameters and using less memory (cf. Table 1). The learning, and artificial intelligence. We modeled the
accuracy of TeMPO is lower than CNN-MLP and CNN-PEPS nonlinearities in these problems by multivariate polynomials
as expected, since it does not use a convolution layer. Note represented by low rank tensors. In particular, we investigated
that the accuracy of TeMPO can further be improved by tuning the symmetric CPD format in this study. By taking the advantage
the parameters such as the rank, the number of CG iterations, of low rank symmetric CPD structure, we developed an efficient
the trust-region radius, the batch size and the degree of the second-order batch Gauss–Newton algorithm. We demonstrated
multivariate polynomial. the efficiency of TeMPO with some illustrative examples, and

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 21 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

TABLE 2 | The test accuracy of different models for the Fashion MNIST dataset. DATA AVAILABILITY STATEMENT
Model Test accuracy (%) Publicly available datasets were analyzed in this study. This
TT 88.0
data can be found at: http://yann.lecun.com/exdb/mnist/; https://
MLP 88.3 github.com/zalandoresearch/fashion-mnist.
PEPS 88.3
TTN 89.0 AUTHOR CONTRIBUTIONS
TeMPO (Type II) 89.3
TeMPO (Type I) 89.9 MA developed the theory and Matlab implementation. He is the
CNN–MLP 91.0 main contributor to the numerical experiments and also wrote
CNN–PEPS 91.2 the first draft of the manuscript. LD conceived the idea and
supervised the project. Both authors contributed to manuscript
The bold values indicate the results from the proposed methods.
revision, read, and approved the submitted version.

with the blind deconvolution of constant modulus signals. We FUNDING

showed that TeMPO achieves similar or better classification rates
than MLPs, TT and PEPS tensor networks on the MNIST and Research supported by: (1) Flemish Government: This research
Fashion MNIST datasets while optimizing for fewer parameters received funding from the Flemish Government (AI Research
and using less memory space. Program). LD and MA are affiliated to Leuven. AI-KU Leuven
The non-symmetric and partially symmetric CPD formats institute for AI, B-3000, Leuven, Belgium. This work was
are fairly straightforward variants of the symmetric CPD format supported by the Fonds de la Recherche Scientifique – FNRS
in which the factor matrices can be mutually different. Efficient and the Fonds Wetenschappelijk Onderzoek – Vlaanderen under
algorithms can be developed for multivariate polynomials EOS Project no G0F6718N (SeLMA). (2) KU Leuven Internal
in these formats by utilizing the derivations presented in Funds: C16/15/059, IDN/19/014.
this study. We are investigating other tensor formats such
as HT and TT in our framework as well. HT and TT ACKNOWLEDGMENTS
require more parameters than the CPD format. However,
they break the curse of dimensionality in a numerically The authors would like to thank E. Evert, N. Govindarajan,
stable way. We are also exploring other polynomial bases, and S. Hendrikx for proofreading the manuscript and N.
and more generally other nonlinear feature maps to Vervliet for valuable discussions. The authors also thank the
further improve the accuracy and numerical stability of our two referees whose comments/suggestions helped improve and
framework. clarify this manuscript.

REFERENCES 9. Comon P, Jutten C. Handbook of Blind Source Separation: Independent

Component Analysis and Applications. Oxford; Burlington: Academic Press;
1. Sidiropoulos N, De Lathauwer L, Fu X, Huang K, Papalexakis EE, Faloutsos Elsevier (2009).
C. Tensor decomposition for signal processing and machine learning. IEEE 10. Vervliet N, Debals O, Sorber L, De Lathauwer L. Breaking the curse of
Trans Signal Process. (2017) 65:3551–82. doi: 10.1109/TSP.2017.2690524 dimensionality using decompositions of incomplete tensors: tensor-based
2. Cichocki A, Mandic DP, De Lathauwer L, Zhou G, Zhao Q, Caiafa CF, et al. scientific computing in big data analysis. IEEE Signal Process Mag. (2014)
Tensor decompositions for signal processing applications: from two-way to 31:71–9. doi: 10.1109/MSP.2014.2329429
multiway component analysis. IEEE Signal Process Mag. (2015) 32:145–63. 11. Vervliet N, Debals O, Sorber L, Van Barel M, De Lathauwer L. Tensorlab
doi: 10.1109/MSP.2013.2297439 3.0. (2016). Available online at https://www.tensorlab.net (accessed December,
3. Kolda TG, Bader BW. Tensor decompositions and applications. SIAM Rev. 2021).
(2009) 51:455–500. doi: 10.1137/07070111X 12. Vervliet N. Compressed Sensing Approaches to Large-Scale Tensor
4. Sorber L, Van Barel M, De Lathauwer L. Optimization-based algorithms for Decompositions. Leuven: KU Leuven (2018).
tensor decompositions: Canonical polyadic decomposition, decomposition 13. Vandecappelle M, Vervliet N, Lathauwer LD. Inexact generalized gauss-
in rank-(Lr , Lr , 1) terms, and a new generalization. SIAM J Optim. (2013) newton for scaling the canonical polyadic decomposition with non-least-
23:695–720. doi: 10.1137/120868323 squares cost functions. IEEE J Selec Top Sign Process. (2021) 15:491–505.
5. Sorber L, Van Barel M, De Lathauwer L. Unconstrained optimization of doi: 10.1109/JSTSP.2020.3045911
real functions in complex variables. SIAM J Optim. (2012) 22:879–98. 14. Singh N, Zhang Z, Wu X, Zhang N, Zhang S, Solomonik E.
doi: 10.1137/110832124 Distributed-memory tensor completion for generalized loss functions
6. Vervliet N, De Lathauwer L. Numerical optimization based algorithms for in python using new sparse tensor kernels. arXiv:191002371. (2021).
data fusion. In: Cocchi M, editor. Data Fusion Methodology and Applications. doi: 10.48550/arXiv.1910.02371
Vol. 31. Amsterdam; Oxford; Cambridge: Elsevier (2019). p. 81–128. 15. Ayvaz M, De Lathauwer L. Tensor-based multivariate polynomial
doi: 10.1016/B978-0-444-63984-4.00004-1 optimization with application in blind identification. In: (2021)
7. Phan AH, Tichavský P, Cichocki A. Low Complexity Damped Gauss- 29th Europian Signal Processing Conference, (EUSIPCO). Dublin
Newton Algorithms for CANDECOMP/PARAFAC. arXiv:1205.2584. (2013) (2021). p. 1080–4. doi: 10.23919/EUSIPCO54536.2021.961
34:126–47. doi: 10.1137/100808034 6070
8. Vervliet N, De Lathauwer L. A randomized block sampling approach to 16. Grasedyck L, Kressner D, Tobler C. A literature survey of low-rank
canonical polyadic decomposition of large-scale tensors. IEEE J Selec Top Sign tensor approximation techniques. GAMM-Mitteil. (2013) 36:53–78.
Process. (2016) 10:284–95. doi: 10.1109/JSTSP.2015.2503260 doi: 10.1002/gamm.201310004

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 22 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

17. Grasedyck L. Hierarchical singular value decomposition of tensors. SIAM J 37. Domanov I, De Lathauwer L. Canonical polyadic decomposition of third-
Matrix Anal Appl. (2010) 31:2029–54. doi: 10.1137/090764189 order tensors: relaxed uniqueness conditions and algebraic algorithm.
18. Oseledets IV, Tyrtyshnikov EE. Breaking the curse of dimensionality, or how arXiv:1501.07251. (2017) 513:342–75. doi: 10.1016/j.laa.2016.10.019
to use SVD in many dimensions. SIAM J Sci Comput. (2009) 31:3744–59. 38. Boyd JP, Ong JR. Exponentially-convergent strategies for defeating the Runge
doi: 10.1137/090748330 phenomenon for the approximation of non-periodic functions, part I: single-
19. Novikov A, Trofimov M, Oseledets IV. Exponential machines. In: 5th interval schemes. Commun Comput Phys. (2009) 5:484–97.
International Conference on Learning Representations, ICLR 2017. Toulon 39. Trefethen LN. Approximation Theory and Approximation Practice, Extended
(2017). Available online at: https://openreview.net/forum?id=rkm1sE4tg Edition. Philadelphia, PA: SIAM (2019). doi: 10.1137/1.9781611975949
20. Stoudenmire EM, Schwab DJ. Supervised learning with tensor networks. In: 40. De Lathauwer L, De Moor B, Vandewalle J. On the best rank-1 and rank-
Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, editors. Advances in (R1 , R2 , · · · , RN ) approximation of higher-order tensors. SIAM J Matrix Anal
Neural Information Processing Systems. Vol. 29. Barcelona: Curran Associates, Appl. (2000) 21:1324–42. doi: 10.1137/S0895479898346995
Inc. (2016). Available online at: https://proceedings.neurips.cc/paper/2016/ 41. Zhang T, Golub G. Rank-one approximation to high order tensors. SIAM J
file/5314b9674c86e3f9d1ba25ef9bb32895-Paper.pdf Matrix Anal Appl. (2001) 23:534–50. doi: 10.1137/S0895479899352045
21. Efthymiou S, Hidary J, Leichenauer S. TensorNetwork for machine learning. 42. Guan Y, Chu MT, Chu D. SVD-based algorithms for the best rank-1
arXiv: 190606329. (2019). doi: 10.48550/arXiv.1906.06329 approximation of a symmetric tensor. SIAM J Matrix Anal Appl. (2018)
22. Cheng S, Wang L, Zhang P. Supervised learning with projected entangled pair 39:1095–115. doi: 10.1137/17M1136699
states. Phys Rev B. (2021) 103:125117. doi: 10.1103/PhysRevB.103.125117 43. Nie J, Wang L. Semidefinite relaxations for best rank-1 tensor approximations.
23. Guo W, Kotsia I, Patras I. Tensor learning for regression. IEEE Trans Image SIAM J Matrix Anal Appl. (2013) 35:1155–79. doi: 10.1137/130935112
Process. (2012) 21:816–27. doi: 10.1109/TIP.2011.2165291 44. Brachat J, Comon P, Mourrain B, Tsigaridas E. Symmetric
24. Hendrikx S, Boussé M, Vervliet N, De Lathauwer L. Algebraic and tensor decomposition. Linear Algeb Appl. (2010) 433:1851–72.
optimization based algorithms for multivariate regression using doi: 10.1016/j.laa.2010.06.046
symmetric tensor decomposition. In: Proceedings of the (2019) 45. Alexander J, Hirschowitz A. Polynomial interpolation in several variables. Adv
IEEE International Workshop on Computational Advances in Multi- Comput Math. (1995) 4:201–22.
Sensor Adaptive Processing (CAMSAP). Guadeloupe (2019). p. 475–9. 46. Debals O. Tensorization and Applications in Blind Source Separation. Leuven:
doi: 10.1109/CAMSAP45676.2019.9022662 KU Leuven (2017).
25. Rabusseau G, Kadri H. Low-rank regression with tensor responses. In: Lee 47. Blondel M, Niculae V, Otsuka T, Ueda N. Multi-output Polynomial Networks
D, Sugiyama M, Luxburg U, Guyon I, Garnett R, editors. Advances in Neural and Factorization Machines. In: Advances in Neural Information Processing
Information Processing Systems. Vol. 29. Barcelona: Curran Associates, Inc. Systems 30: Annual Conference on Neural Information Processing Systems
(2016). Available online at: https://proceedings.neurips.cc/paper/2016/file/ 2017. Long Beach, CA (2017). p. 3349–59.
3806734b256c27e41ec2c6bffa26d9e7-Paper.pdf 48. Khoromskij BN. Tensor Numerical Methods in Scientific Computing. Berlin;
26. Yu R, Liu Y. Learning from multiway data: simple and efficient tensor Boston: De Gruyter (2018). doi: 10.1515/9783110365917
regression. In: Balcan MF, Weinberger KQ, editors. Proceedings of the 33rd 49. Margossian CC. A review of automatic differentiation and its efficient
International Conference on Machine Learning, Vol. 48 of Proceedings of implementation. WIREs Data Mining Knowl Discov. (2019) 9:e1305.
Machine Learning Research. New York, NY (2016). p. 373–81. Available online doi: 10.1002/widm.1305
at: https://proceedings.mlr.press/v48/yu16.html 50. van der Veen AJ, Paulraj A. An analytical constant modulus algorithm. IEEE
27. Hou M, Chaib-Draa B. Hierarchical Tucker tensor regression: application to Trans Signal Process. (1996) 44:1136–55. doi: 10.1109/78.502327
brain imaging data analysis. In: Proceedings of the (2015) IEEE International 51. Zarzoso V, Comon P. Optimal step-size constant modulus algorithm. IEEE
Conference on Image Processing (ICIP 2015). Québec, QC (2015). p. 1344–8. Trans Commun. (2008) 56:10–3. doi: 10.1109/TCOMM.2008.050484
doi: 10.1109/ICIP.2015.7351019 52. Boussé M, Vervliet N, Domanov I, Debals O, De Lathauwer L. Linear
28. Kar P, Karnick H. Random feature maps for dot product kernels. In: Lawrence systems with a canonical polyadic decomposition constrained solution:
ND, Girolami M, editors. Proceedings of the Fifteenth International Conference algorithms and applications. Numer Linear Algeb Appl. (2018) 25:e2190.
on Artificial Intelligence and Statistics, Vol. 22 of Proceedings of Machine doi: 10.1002/nla.2190
Learning Research. La Palma (2012). p. 583–91. Available online at: https:// 53. Gargiani M, Zanelli A, Diehl M, Hutter F. On the promise of the stochastic
proceedings.mlr.press/v22/kar12.html generalized Gauss-Newton method for training DNNs. arXiv: 200602409.
29. Yang J, Gittens A. Tensor machines for learning target-specific polynomial (2020). doi: 10.48550/arXiv.2006.02409
features. arxiv: 150401697. (2015). doi: 10.48550/arXiv.1504.01697 54. Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Bengio Y,
30. Rendle S. Factorization machines. In: (2010) IEEE International Conference on LeCun Y, editors. International Conference on Learning Representations, ICLR
Data Mining. Sydney (2010). p. 995–1000. doi: 10.1109/ICDM.2010.127 2015. 3rd Edn. San Diego, CA (2015). Available online at: http://arxiv.org/abs/
31. Blondel M, Fujino A, Ueda N, Ishihata M. Higher-order factorization 1412.6980
machines. In: Proceedings of the 30th International Conference on Neural 55. De Brabanter K, Karsmakers P, Ojeda F, Alzate C, De Brabanter J, Pelckmans
Information Processing Systems, NIPS’16. Red Hook, NY: Curran Associates K, et al. LS-SVMlab Toolbox User’s Guide Version 1.8. Leuven: ESAT-STADIUS
Inc. (2016). p. 3359–67. (2010). p. 10–46.
32. Blondel M, Ishihata M, Fujino A, Ueda N. Polynomial networks and 56. Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J.
factorization machines: new insights and efficient training algorithms. In: Least Squares Support Vector Machines. Singapore: World Scientific (2002).
Proceedings of the 33rd International Conference on International Conference doi: 10.1142/5089
on Machine Learning. Vol. 48. New York, NY (2016). p. 850–8. 57. Ljung L. System Identification: Theory for the User. 2nd ed. Upper Saddle River,
33. Nocedal J, Wright S. Numerical Optimization. New York, NY: Springer (2006). NJ: Prentice Hall (1999). doi: 10.1002/047134608X.W1046
34. Kruskal JB. Three-way arrays: rank and uniqueness of trilinear 58. Johnson R, Schniter P, Endres TJ, Behm JD, Brown DR, Casas RA. Blind
decompositions, with application to arithmetic complexity and statistics. equalization using the constant modulus criterion: a review. Proc IEEE. (1998)
Linear Algeb Appl. (1977) 18:95–138. doi: 10.1016/0024-3795(77)90069-6 86:1927–50. doi: 10.1109/5.720246
35. Sidiropoulos ND, Bro R. On the uniqueness of multilinear 59. van der Veen AJ. Algebraic methods for deterministic blind beamforming.
decomposition of N-way arrays. J Chemometr. (2000) 14:229–39. Proc IEEE. (1998) 86:1987–2008. doi: 10.1109/5.720249
doi: 10.1002/1099-128X(200005/06)14:3<229::AID-CEM587>3.0.CO;2-N 60. De Lathauwer L. Algebraic techniques for the blind deconvolution of
36. Domanov I, De Lathauwer L. On the uniqueness of the canonical Constant Modulus signals. In: Proceedings of the 12th European Signal
polyadic decomposition of third-order tensors – Part ii: uniqueness of Processing Conference (EUSIPCO 2004). Vienna (2004). p. 225–8.
the overall decomposition. SIAM J Matrix Anal Appl. (2013) 34:876–903. 61. Householder AS. Unitary triangularization of a nonsymmetric matrix. J ACM.
doi: 10.1137/120877258 (1958) 5:339–42. doi: 10.1145/320941.320947

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 23 March 2022 | Volume 8 | Article 836433
Ayvaz and De Lathauwer CPD-Structured Multivariate Polynomial Optimization

62. Deng L. The MNIST database of handwritten digit images for Publisher’s Note: All claims expressed in this article are solely those of the authors
machine learning research. IEEE Sign Process Mag. (2012) 29:141–2. and do not necessarily represent those of their affiliated organizations, or those of
doi: 10.1109/MSP.2012.2211477 the publisher, the editors and the reviewers. Any product that may be evaluated in
63. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a novel image dataset this article, or claim that may be made by its manufacturer, is not guaranteed or
for benchmarking machine learning algorithms arXiv:1708.07747. (2017).
endorsed by the publisher.
doi: 10.48550/arXiv.1708.07747
64. Stoudenmire EM. Learning relevant features of data with multi-scale tensor
Copyright © 2022 Ayvaz and De Lathauwer. This is an open-access article distributed
networks. Quant Sci Technol. (2018) 3:034003. doi: 10.1088/2058-9565/aaba1a
under the terms of the Creative Commons Attribution License (CC BY). The use,
distribution or reproduction in other forums is permitted, provided the original
Conflict of Interest: The authors declare that the research was conducted in the author(s) and the copyright owner(s) are credited and that the original publication
absence of any commercial or financial relationships that could be construed as a in this journal is cited, in accordance with accepted academic practice. No use,
potential conflict ofinterest. distribution or reproduction is permitted which does not comply with these terms.

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 24 March 2022 | Volume 8 | Article 836433

ENG2005 Lecture Book
No ratings yet
ENG2005 Lecture Book
230 pages
Mechanical Behavior of Materials 01 PDF
100% (2)
Mechanical Behavior of Materials 01 PDF
54 pages
Improved Financial Forecasting Via Quantum Machine Learning
No ratings yet
Improved Financial Forecasting Via Quantum Machine Learning
19 pages
4. Haghigat_2021_DeepLearning_SurrogateModel (1)
No ratings yet
4. Haghigat_2021_DeepLearning_SurrogateModel (1)
22 pages
(Alt, Tobias, Et Al.), Translating Numerical Concepts For Pdes Into Neural Architectures, Arxiv Preprint Arxiv-2103.15419 (2021) .
No ratings yet
(Alt, Tobias, Et Al.), Translating Numerical Concepts For Pdes Into Neural Architectures, Arxiv Preprint Arxiv-2103.15419 (2021) .
12 pages
SNODE-PP
No ratings yet
SNODE-PP
15 pages
2008.05171v2
No ratings yet
2008.05171v2
41 pages
Agarwala 14
No ratings yet
Agarwala 14
9 pages
NeurIPS 2020 Fast Transformers With Clustered Attention Paper
No ratings yet
NeurIPS 2020 Fast Transformers With Clustered Attention Paper
10 pages
Accelerating_spectral_clustering_on_quan-2
No ratings yet
Accelerating_spectral_clustering_on_quan-2
13 pages
Monte Carlo Algorithms For Evaluating Sobol' Sensitivity Indices
100% (1)
Monte Carlo Algorithms For Evaluating Sobol' Sensitivity Indices
31 pages
Convolutional Knowledge Graph Embeddings
No ratings yet
Convolutional Knowledge Graph Embeddings
8 pages
Dendro: Parallel Algorithms For Multigrid and AMR Methods On 2:1 Balanced Octrees
No ratings yet
Dendro: Parallel Algorithms For Multigrid and AMR Methods On 2:1 Balanced Octrees
20 pages
Solving Inverse Problems
No ratings yet
Solving Inverse Problems
31 pages
On Symbolic Approaches For Computing The Matrix Permanent
No ratings yet
On Symbolic Approaches For Computing The Matrix Permanent
18 pages
Implementing A Randomized SVD Algorithm and Its Performance Analysis
No ratings yet
Implementing A Randomized SVD Algorithm and Its Performance Analysis
7 pages
Mechanical systems and signal processing
No ratings yet
Mechanical systems and signal processing
16 pages
Robust Algorithms For Online Convex Problems Via Primal-Dual
No ratings yet
Robust Algorithms For Online Convex Problems Via Primal-Dual
22 pages
Memory-Efficient Fast Shortest Path Estimation in Large Social Networks
No ratings yet
Memory-Efficient Fast Shortest Path Estimation in Large Social Networks
10 pages
Self-adaptive CMSA for solving the multidimensional multi-way number partitioning problem
No ratings yet
Self-adaptive CMSA for solving the multidimensional multi-way number partitioning problem
18 pages
Dynamical Measure Transport and Neural PDE Solvers For Sampling
No ratings yet
Dynamical Measure Transport and Neural PDE Solvers For Sampling
25 pages
Can Decentralized Algorithms Outperform Centralized Algorithms A Case Study For Decentralized Parallel Stochastic Gradient Descent
No ratings yet
Can Decentralized Algorithms Outperform Centralized Algorithms A Case Study For Decentralized Parallel Stochastic Gradient Descent
27 pages
Path-Based Spectral Clustering: Guarantees, Robustness To Outliers, and Fast Algorithms
No ratings yet
Path-Based Spectral Clustering: Guarantees, Robustness To Outliers, and Fast Algorithms
66 pages
Download
No ratings yet
Download
25 pages
MLSys-2022-gpu-semiring-primitives-for-sparse-neighborhood-methods-Paper
No ratings yet
MLSys-2022-gpu-semiring-primitives-for-sparse-neighborhood-methods-Paper
15 pages
Pattern Classification of Back-Propagation Algorithm Using Exclusive Connecting Network
No ratings yet
Pattern Classification of Back-Propagation Algorithm Using Exclusive Connecting Network
5 pages
Optimal Brain Damage: 598 Le Cun, Denker and Solla
No ratings yet
Optimal Brain Damage: 598 Le Cun, Denker and Solla
8 pages
澳大利亚悉尼科技大学利用质量与距离峰值快速自主聚类，开发出Torque Clustering算法，实现无参数化高效聚类
No ratings yet
澳大利亚悉尼科技大学利用质量与距离峰值快速自主聚类，开发出Torque Clustering算法，实现无参数化高效聚类
14 pages
2209.07230v2
No ratings yet
2209.07230v2
47 pages
A5f1 PDF
No ratings yet
A5f1 PDF
23 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
New Bridges Between Deep Learning and Partial Differential Equations
No ratings yet
New Bridges Between Deep Learning and Partial Differential Equations
5 pages
(Nature CS) Enhancing Computational Fluid Dynamics With Machine Learning
No ratings yet
(Nature CS) Enhancing Computational Fluid Dynamics With Machine Learning
9 pages
Ijcttjournal V1i1p12
No ratings yet
Ijcttjournal V1i1p12
3 pages
Efficient Techniques for Modeling Chip-level Interconnect, Substrate and Package Parasitics
No ratings yet
Efficient Techniques for Modeling Chip-level Interconnect, Substrate and Package Parasitics
5 pages
Fast Nonnegative Tensor Factorizations with Tensor Train Model
No ratings yet
Fast Nonnegative Tensor Factorizations with Tensor Train Model
13 pages
1312.6114v1
No ratings yet
1312.6114v1
9 pages
3. cmame_paper
No ratings yet
3. cmame_paper
28 pages
UTNet A Hybrid Transformer Architecture For Medical Image Segmentation PDF
No ratings yet
UTNet A Hybrid Transformer Architecture For Medical Image Segmentation PDF
11 pages
Artificial Neural Network Methods for the Solution of Second Order Boundary Value Problems
No ratings yet
Artificial Neural Network Methods for the Solution of Second Order Boundary Value Problems
15 pages
BEST-4-Gradient-Based Competitive Learning Theory
No ratings yet
BEST-4-Gradient-Based Competitive Learning Theory
17 pages
2003 - Evolving Polynomial Neural Network by Means of Genetic Algorithm... - Vasechkina & Yarin - 2003
No ratings yet
2003 - Evolving Polynomial Neural Network by Means of Genetic Algorithm... - Vasechkina & Yarin - 2003
13 pages
1-s2.0-S0021999119304504-main
No ratings yet
1-s2.0-S0021999119304504-main
16 pages
A Solvable Attention for Neural Scaling Laws
No ratings yet
A Solvable Attention for Neural Scaling Laws
40 pages
NeurIPS 2020 Banditpam Almost Linear Time k Medoids Clustering via Multi Armed Bandits Paper
No ratings yet
NeurIPS 2020 Banditpam Almost Linear Time k Medoids Clustering via Multi Armed Bandits Paper
12 pages
AQ - I N N G - M: Uantum Nspired Eural Etwork For E Ometric Odeling
No ratings yet
AQ - I N N G - M: Uantum Nspired Eural Etwork For E Ometric Odeling
22 pages
Unsupervised Machine Learning On A Hybrid Quantum Computer
No ratings yet
Unsupervised Machine Learning On A Hybrid Quantum Computer
17 pages
2111.01207v1
No ratings yet
2111.01207v1
13 pages
Multilevel Quasiseparable Matrices in PDE-constrained Optimization
No ratings yet
Multilevel Quasiseparable Matrices in PDE-constrained Optimization
20 pages
Blind Deconvolution of DS-CDMA Signals by Means of Decomposition in Rank - (1 L L) Terms
No ratings yet
Blind Deconvolution of DS-CDMA Signals by Means of Decomposition in Rank - (1 L L) Terms
10 pages
Robust Decision Trees
No ratings yet
Robust Decision Trees
6 pages
Standardization and Its Effects On K-Means Clustering Algorithm
No ratings yet
Standardization and Its Effects On K-Means Clustering Algorithm
6 pages
Autoregressive Diffusion Model
No ratings yet
Autoregressive Diffusion Model
23 pages
Quantum Algorithms For Solving Ordinary Differential Equations Via Classical Integration Methods
No ratings yet
Quantum Algorithms For Solving Ordinary Differential Equations Via Classical Integration Methods
13 pages
2D DOA Estimation for Uniform Rectangular Array With One-bit Measurement
No ratings yet
2D DOA Estimation for Uniform Rectangular Array With One-bit Measurement
5 pages
Mishkin 22 A
No ratings yet
Mishkin 22 A
47 pages
Research 4
No ratings yet
Research 4
22 pages
DPP For ML
No ratings yet
DPP For ML
120 pages
2504.12785v1
No ratings yet
2504.12785v1
18 pages
A Decomposition Method For Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexity
No ratings yet
A Decomposition Method For Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexity
48 pages
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Lecture Notes
No ratings yet
Lecture Notes
84 pages
Matlab Software For Iterative Methods and Algorithms To Solve A Linear System
No ratings yet
Matlab Software For Iterative Methods and Algorithms To Solve A Linear System
6 pages
PHYS115 Formula Sheet
No ratings yet
PHYS115 Formula Sheet
1 page
PPSC Mock Test 2 Akhtar Abbas
No ratings yet
PPSC Mock Test 2 Akhtar Abbas
4 pages
JC Maths Assignment Unit 11
No ratings yet
JC Maths Assignment Unit 11
5 pages
Module 14 - Vectors 1 (Self Study)
No ratings yet
Module 14 - Vectors 1 (Self Study)
3 pages
LIVRO An Introduction To Inverse Problems With Applications
100% (3)
LIVRO An Introduction To Inverse Problems With Applications
255 pages
v1 Quiz 3 PDF
No ratings yet
v1 Quiz 3 PDF
2 pages
1.2 Normed Spaces
No ratings yet
1.2 Normed Spaces
17 pages
LinearAlgebra Author Benjamin
No ratings yet
LinearAlgebra Author Benjamin
491 pages
Mathematic Question Paper Set A
No ratings yet
Mathematic Question Paper Set A
4 pages
Invertible Matrix: Nondegenerate) If There Exists An N-By-N Matrix B Such That
No ratings yet
Invertible Matrix: Nondegenerate) If There Exists An N-By-N Matrix B Such That
8 pages
2024-2025 Term2 ENGG1120A Course Syllabus
No ratings yet
2024-2025 Term2 ENGG1120A Course Syllabus
3 pages
Matrix Calculus - Notes On The Derivative of A Trace: Johannes Traa
No ratings yet
Matrix Calculus - Notes On The Derivative of A Trace: Johannes Traa
7 pages
QE Determinant & Matrices SOLUTION FOR DROPPER PDF
No ratings yet
QE Determinant & Matrices SOLUTION FOR DROPPER PDF
37 pages
2009 Seemous Problems Solutions
No ratings yet
2009 Seemous Problems Solutions
4 pages
Notes On Adjoint Methods MIT
No ratings yet
Notes On Adjoint Methods MIT
6 pages
Assgn 1
No ratings yet
Assgn 1
2 pages
Exam 1 Solutions
No ratings yet
Exam 1 Solutions
3 pages
Algebraic Signal Processing Theory: Cooley-Tukey Type Algorithms For Dcts and Dsts
No ratings yet
Algebraic Signal Processing Theory: Cooley-Tukey Type Algorithms For Dcts and Dsts
20 pages
Mnpbem 17
No ratings yet
Mnpbem 17
27 pages
WORKSHEET1 ResolveVector
No ratings yet
WORKSHEET1 ResolveVector
2 pages
Engineering Circuit Analysis 8th Edition by Hayt and Kemmer Solutions
No ratings yet
Engineering Circuit Analysis 8th Edition by Hayt and Kemmer Solutions
14 pages
Plane Stress Vs Strain
100% (1)
Plane Stress Vs Strain
53 pages
Curs Tehnici de Optimizare
No ratings yet
Curs Tehnici de Optimizare
141 pages
611 Electromagnetic Theory II
No ratings yet
611 Electromagnetic Theory II
177 pages
LIN Linear Algebra Timeline
No ratings yet
LIN Linear Algebra Timeline
2 pages
Unit 1.1 Vectors
No ratings yet
Unit 1.1 Vectors
9 pages