Ring
Ring
Abstract. In this paper we propose an efficient method to compress a high dimensional function
into a tensor ring format, based on alternating least squares (ALS). Since the function has size
exponential in d, where d is the number of dimensions, we propose an efficient sampling scheme
to obtain O(d) important samples in order to learn the tensor ring. Furthermore, we devise an
initialization method for ALS that allows fast convergence in practice. Numerical examples show
that to approximate a function with similar accuracy, the tensor ring format provided by the proposed
method has fewer parameters than the tensor-train format and also better respects the structure of
the original function.
Key words. tensor decompositions, tensor train, randomized algorithm, function approximation
DOI. 10.1137/17M1154382
Here H k \in \BbbR rk - 1 \times n\times rk , rk \leq r and we often refer to (r1 , . . . , rd ) as the TR rank.
Such type of tensor format is a generalization of the TT format for which H 1 \in
\BbbR 1\times n\times r1 , H d \in \BbbR rd - 1 \times n\times 1 . The difference between TR and TT is illustrated in Fig-
ure 1 using tensor network diagrams introduced in section 1.1. Due to the exponential
\ast Received by the editors November 2, 2017; accepted for publication (in revised form) December
USA ([email protected]).
\S Department of Mathematics, Stanford University, Stanford, CA 94305-2125 USA (lexing@
stanford.edu).
1261
1 2 𝑑−1
𝑑
1 2 𝑑−1 𝑑
…
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
v
v v
v
…
number of entries, typically we do not have access to the entire tensor f . Therefore,
TR format has to be found based on ``interpolation"" from f (\Omega ) where \Omega is a subset
of [n]d . For simplicity, in the rest of the note, we assume r1 = r2 = \cdot \cdot \cdot = rd = r.
1.1. Notations. We first summarize the notations used in this note and intro-
duce tensor network diagrams for the ease of presentation. Depending on the context,
f is often referred to as a d-tensor of size nd (instead of a function). For a p-tensor
T , given two disjoint subsets \alpha , \beta \subset [p] where \alpha \cup \beta = [p], we use
to indicate the operation of reshaping f into a matrix, followed by rows and columns
subsampling according to \Omega 1 , \Omega 2 . For any vector x \in [n]d and any integer i, we let
(a) (b)
Fig. 2. (a) Tensor diagram for a 3-tensor A and a 4-tensor B. (b) Contraction between tensors
A and B.
leg of B), we mean (with the implicit assumption that the dimensions represented by
these legs have the same size)
\sum
(7) Ai1 i2 k Bkj2 j3 j4 .
k
See the review article [12] for a more complete introduction of tensor network dia-
grams.
1.2. Previous approaches. In this section, we survey previous approaches for
compressing a blackbox function into TT or TR. In [13], successive CUR (skeleton)
decompositions [6] are applied to find a decomposition of tensor f in TT format.
In [4], a similar scheme is applied to find a TR decomposition of the tensor. A
crucial step in [4] is to ``disentangle"" one of the 3-tensors H k 's, say H 1 , from the TR.
First, f is treated as a matrix where the first dimension of f gives rows, the second,
third, . . . , dth dimensions of f give columns, i.e., reshaping f to f1;[d]\setminus 1 . Then CUR
decomposition is applied such that
(8) f1;[d]\setminus 1 = CU R
2
and the matrix C \in \BbbR n\times r in the decomposition is regarded as H2;3,1
1
(the R part
in CUR decomposition is never formed due to its exponential size). As noted by the
authors in [4], a shortcoming of the method lies in the reshaping of C into H 1 . As
in any factorization of a low-rank matrix, there is an inherent ambiguity for CUR
decomposition in that CU R = CAA - 1 U R for any invertible matrix A. Such ambigu-
ity in determining H 1 may lead to large TR rank in the subsequent determination of
H 2 , H 3 , . . . , H d . More recently, [22] proposes various alternating least squares (ALS)-
based techniques to determine the TR decomposition of a tensor f . However, they
only consider the situation where entries of f are fully observed, which limits the
applicability of their algorithms to the case with rather small d. Moreover, depending
on the initialization, ALS can suffer from slow convergence. In [18], ALS is used to
determine the TR in a more general setting where only partial observations of the
function f are given. In this paper, we further assume the freedom to observe any
O(d) entries from the tensor f . As we shall see, leveraging such freedom, the com-
plexity of the iterations can be reduced significantly compare to the ALS procedure
in [18].
1.3. Our contributions. In this paper, assuming f admits a rank-r TR decom-
position, we propose an ALS-based two-phase method to reconstruct the TR when
only a few entries of f can be sampled. Here we summarize our contributions.
where
H k [xk ] := H k (:, xk , :) \in \BbbR r\times r
denotes the xk th slice of the 3-tensor H k along the second dimension. It is compu-
tationally infeasible just to set up problem (9), as we need to evaluate f nd times.
Therefore, analogously to the matrix or CP-tensor completion problem [3, 21], a ``TR
completion"" problem [18]
\sum \bigl( \bigr) 2
(10) min Tr(H 1 [x1 ] \cdot \cdot \cdot H d [xd ]) - f (x1 , . . . , xd ) ,
H 1 ,...,H d
x\in \Omega
where \Omega is a subset of [n]d should be solved instead. Since there are a total of dnr2
parameters for the tensors H 1 , . . . , H d , there is hope that by observing a small number
of entries in f (at least O(ndr2 )), we can obtain the rank-r TR.
A standard approach for solving the minimization problem of the type (10) is via
ALS. At every iteration of ALS, a particular H k is treated as variable while H l , l \not = k
are kept fixed. Then H k is optimized w.r.t. the least-squares cost in (10). More
precisely, to determine H k , we solve
\sum \bigl( \bigr) 2
(11) min Tr(H k [xk ]C x\setminus xk ) - f (x) ,
Hk
x\in \Omega
(12) C x\setminus xk := H k+1 [xk+1 ] \cdot \cdot \cdot H d [xd ]H 1 [x1 ] \cdot \cdot \cdot H k - 1 [xk - 1 ], x \in \Omega .
By an abuse of notation, we use x \setminus xk to denote the exclusion of xk from the d-tuple
x. As mentioned previously, | \Omega | should be at least O(ndr2 ) in order to determine
the TR decomposition. This creates a large computational cost in each iteration of
the ALS, as it takes | \Omega | (d - 1) (which has O(d2 ) scaling as | \Omega | has size O(d)) matrix
multiplications just to construct C x\setminus xk for all x \in \Omega . When d is large, such quadratic
scaling in d for setting up the least-squares problem in each iteration of the ALS is
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
undesirable.
The following simple but crucial observation allows us to gain a further speedup.
Although O(ndr2 ) observations of f are required to determine all the components
H 1 , . . . , H d , when it comes to determining each individual H k via solving the linear
system (11), only O(nr2 ) equations are required for the well-posedness of the linear
system. This motivates us to use different \Omega k 's each having size O(nr2 ) (with | \Omega 1 | +
\cdot \cdot \cdot + | \Omega d | \sim O(ndr2 )) to determine different H k 's in the ALS steps instead of using
a fixed set \Omega with size O(ndr2 ) for H k 's. If \Omega k is constructed from densely sampling
the dimensions near k (where a neighborhood is defined according to ring geometry)
while sparsely sampling the dimensions far away from k, computational savings can
be achieved. The specific construction of \Omega k is made precise in section 2.1. We further
remark that if
holds with small error for every x \in [n]d , then using any \Omega k \in [n]d in place of \Omega in
(11) should give similar solutions, as long as (11) is well-posed. Therefore, we solve
\sum \bigl( \bigr) 2
(14) min Tr(H k [xk ]C x\setminus xk ) - f (x)
Hk
x\in \Omega k
instead of (11) in each step of the ALS where the index sets \Omega k 's depend on k. We
note that in practice, a regularization term \lambda \sigma k \| H k (xk )\| 2F is added to the cost in
(14) to reduce numerical instability resulting from a potential high condition number
of the least-squares problem (14). In all of our experiments, \lambda is set to 10 - 9 and \sigma k
is the top singular value of the Hessian of the least-squares problem (14). From our
experience, the quality of TR is rather insensitive to the choice of \lambda , which indicates
the problem of determining H k 's is rather well-posed.
At this point it is clear that there are two issues needed to be addressed. The
first issue is concerning the choice of \Omega k , k \in [d]. Another issue is that the nonconvex
nature of the TR completion problem 10 may cause difficulty in the convergence of
ALS. We solve the first issue using a hierarchical sampling strategy. As for the second
issue, by making certain probabilistic assumptions on f , we are able to obtain a
cheap and intuitive initialization that allows fast convergence. Before moving on, we
summarize the full algorithm in Algorithm 1. The steps of Algorithm 1 are further
detailed in sections 2.1, 2.2, and 2.3.
2.1. Constructing \Omega \bfitk . In this section, we detail the construction of \Omega k for
each k \in [d]. We first construct an index set \Omega envi
k \subset [n]d - 3 with fixed size s. The
envi
elements in \Omega k correspond to different choices of indices for the [d]\setminus \{ k - 1, k, k+1\} th
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
dimensions of the function f . Then for each of the elements in \Omega envi k , we sample all
possible indices from the (k - 1)th, kth, (k + 1)th dimensions of f to construct \Omega k ,
i.e., letting
then
\epsilon
(17) k
\| H2;3,1 - fk;[d]\setminus k (\Omega k )[vec(C x\setminus xk )]\dagger x\in \Omega k \| F \leq ,
\sigma min ([vec(C x\setminus xk )]x\in \Omega k )
where \dagger denotes the pseudoinverse, and \sigma min denotes the smallest singular value.
k
Therefore, Range(H2;3,1 ) is similar to Range(fk;[d]\setminus k (\Omega k )). On the other hand, an
k
optimal H should satisfy
k
(18) H2;3,1 [vec(C x\setminus xk )]x\in [n]d = fk;[d]\setminus k
Here we emphasize that it is possible to reshape f (\Omega k ) into a matrix fk;[d]\setminus k (\Omega k ) as
in (17) due to the product structure of \Omega k in (15), where the indices along dimension
k are fully sampled. The second criterion is that we require the cost in (14) to
approximate the cost in (9).
To meet the first criterion, we propose a hierarchical strategy to determine \Omega envi k
such that fk;[d]\setminus k (\Omega k ) has large singular values. Assuming d = 3 \cdot 2L for some natural
number L, we summarize such a strategy in Algorithm 2 (the upward pass) and 3 (the
downward pass). The dimensions are divided into groups of size 3\cdot 2L - l on each level l
for l = 1, . . . , L. We emphasize that level l = 1 corresponds to the coarsest partitioning
of the dimensions of the tensor f . The purpose of the upward pass is to hierarchically
find skeletons \Theta in,lk which represent the kth group of indices, while the downward pass
\Theta envi,L
k with k \in [2L ] are obtained. Then another upward pass can be reinitiated.
Instead of sampling new \Theta envi,l
k 's, the stored \Theta envi,l
k 's in the downward pass are used.
Multiple upward-downward passes can be called to further improved these skeletons.
Finally, we let
There are 2L index sets after this step. For each k \in [2L ], construct the set of
environment skeletons
[𝑛][𝑛][𝑛] [𝑛][𝑛][𝑛]
% '(,* : = 𝑛
Θ - [𝑛][𝑛][𝑛]
&
for l = L to l = 1
2: Find the skeletons within each index set \Theta \~ in,l , k \in [2l ], where the elements in
k
\~ in,l are multi-indices of length 3 \cdot 2L - l . Apply RRQR factorization to the
each \Theta
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
k
matrix
\~ in,l |
(22) f (\Theta envi,l
k
\~ in,l ) \in \BbbR s\times | \Theta k
; \Theta k
) &$,(
Θ Θ#$%&
,(
" "
Θ&$,(
" ∀𝑘 ∈ 2(
Θ&',) &',)
"#$%Θ"#
. &',)$%
Θ # ∀𝑘 ∈ 2)$%
end for
Ensure:
Skeletons \Theta envi,l
k 's
envi,1
1: Let \Theta 1 = \Theta 2 , \Theta envi,1
in,1
2 = \Theta in,1
1 .
Θ#$,"
" (Θ'
($)#,"
) Θ#$,"
' (Θ"
($)#,"
)
for l = 2 to l = L
2: For each k \in [2l ], we obtain \Theta envi,l
k by applying RRQR factorization to
or
Θ%&,(
" Θ%&,( .&/%,(0$
"#$ ×Θ("#$)/-
%&,( .&/% ,(
Θ" Θ" ∀𝑘 ∈ 2(
end for
𝑇𝑘,𝐶 𝐺𝑘 𝑇𝑘+1,𝐶
Fig. 3. A gauge Gk needs to be inserted between T k,C and T k+1,C .
102
Proposed initialization
Random initialization
100
Error
10-2
10-4
10-6 0
10 101 102 103
Number of iteration
Fig. 4. Plot of convergence of the ALS using both random and the proposed initializations for
the numerical example given in section 4.3 with n = 3, d = 12. The error measure is defined in
(40).
the number of iterations in ALS, when using the proposed initialization and random
initialization. By random initialization, we mean the H k 's are initialized by sampling
their entries independently from the normal distribution. Then ALS is performed on
the example detailed in section 4.3 with n = 3, d = 12. We set the TR rank to be
r = 3. As we can see, after one iteration of ALS, we already obtain a 10 - 4 error using
our proposed method, whereas with random initialization, the convergence of ALS is
slower and the solution has a lower accuracy.
2.3. Alternating least squares. After constructing \Omega k and initializing H k ,
k \in [d], we start ALS by solving problem (14) at each iteration. This completes
Algorithm 1.
When running ALS, sometimes we want to increase the TR rank to obtain a
higher accuracy approximation to the function f . In this case, we simply add a row
and column of random entries to each H k , i.e.,
\epsilon i,k
\biggl[ k \biggr]
k H (:, i, :) 1
(27) H (:, i, :) \leftarrow , i = 1, . . . , n, k = 1, . . . , d,
\epsilon i,k
2 1
Algorithm 4
Require:
Function f : [n]d \rightarrow \BbbR .
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Ensure:
T k,L \in \BbbR n\times r , T k,C \in \BbbR r\times n\times r , T k,R \in \BbbR r\times n , k \in [d].
for k = 1 to k = d
1: Pick an arbitrary z \in [n]d - 3 and let
\Omega ini d
\bigl\{ \bigr\}
(28) k := x \in [n] | x[d]\setminus \{ k - 1,k,k+1\} = z, xk - 1 , xk , xk+1 \in [n] .
Define
Let T\~k,C \in \BbbR r\times n\times r be reshaped from UR \Sigma R \in \BbbR rn\times r .
1/2 1/2
4: Let T k,L := UL \Sigma L and T k,R := \Sigma R VRT . Let T k,C be defined by
𝑇),+ ≔
Σ"#$/& 𝑇(),+ Σ-
#$/&
Algorithm 5
Require:
Function f : [n]d \rightarrow \BbbR , T k,L , T k,C , T k,R for k \in [d] from Algorithm 4.
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Ensure:
Initialization H k , k \in [d].
for k = 1 to k = d
1: Pick an arbitrary z \in [n]d - 4 and let
(32)
\Omega gauge := x \in [n]d | x[d]\setminus \{ k - 1,k,k+1,k+2\} = z, \forall xk - 1 , xk , xk+1 , xk+2 \in [n]
\bigl\{ \bigr\}
k
and sample
𝐻" =
𝑇 ",$ 𝐺"
end for
for some functions g, h. Here ``\propto "" denotes the proportional up to a constant relation-
ship.
We note that Assumption 1 holds if f is a nonnegative function and admits a
Markovian structure. Such functions can arise from a Gibbs distribution with energy
defined by short-range interactions [20], for example, the Ising model.
Next we make certain non-degeneracy assumption on the TR f .
Assumption 2. Any segment H of the TR f (for example H a , H b , H c1 , H c2 shown
in Figure 6), satisfies
(36) rank(HL+1,L+2;[L] ) = r2
𝐻" =
𝐿(+ +1 1 2 … 𝐿" 𝐿() +1
1 1
Region 𝑎
2… …2
𝐿* + 1 𝐿* + 2
𝐿(+ +2 𝐿() +2 𝐿
𝐿(+ ()
𝐻* = 1 2 … 𝐿*
Region 𝑐0 Region 𝑏 Region 𝑐/
2 L
Since HL+1,L+2;[L] \in \BbbR r \times n , it is natural to expect when nL \geq r2 , HL+1,L+2;[L]
is rank r2 generically [15].
We now state a proposition that leads us to the intuition behind designing the
initialization procedure Algorithm 4.
Proposition 1. Let
(37) s1 = ei1 \otimes ei2 \otimes \cdot \cdot \cdot \otimes eiLa , s2 = ej1 \otimes ej2 \otimes \cdot \cdot \cdot \otimes ejLb
be any two arbitrary sampling vectors, where \{ ek \} nk=1 is the canonical basis in \BbbR n .
If La , Lb , Lc1 , Lc2 \geq max(L0 , Lbuffer ), the two matrices B 1 , B 2 \in \BbbR r\times r defined in
Figure 7 are rank-1.
2
\times nLc1
Proof. Due to Assumption 2, HLc1c \in \BbbR r and HLc2c
1 +1,Lc1 +2;[Lc1 ] 2 +1,Lc2 +2;[Lc2 ]
2
\times nLc2
\in \BbbR r defined in Figure 7 are rank-r2 . Along with the implication of Assumption
1 that
\bigr) T
rank HLc1c B 1 \otimes B 2 HLc2c
\bigl( \bigl( \bigr)
(38) = 1,
1 +1,Lc1 +2;[Lc1 ] 2 +1,Lc2 +2;[Lc2 ]
we get
(39) rank(B 1 \otimes B 2 ) = 1.
Since rank(B 1 ) rank(B 2 ) = rank(B 1 \otimes B 2 ) = 1, it follows that the rank of B 1 , B 2 are
1.
𝐵1 =
𝑠1 Region 𝑎
𝐵2 =
𝑇
𝑝𝑏 (𝑞𝑏 )
𝑠2
Fig. 8. Applying a sampling vector s2 in the canonical basis to region b gives the TT.
Whenever it is feasible, we let \Omega = [n]d . Otherwise, we subsample \Omega from [n]d
at random: For every x \in \Omega , xi is drawn from [n] uniformly at random. If the
dimensionality of f is large, we simply sample \Omega from [n]d at random. For the
proposed algorithm, we also measure the error on the entries sampled for learning
TR as
\sqrt{}
\sum \bigl( \bigr) 2
Tr(H 1 [x1 ] \cdot \cdot \cdot H d [xd ]) - f (x1 , . . . , xd )
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
In the experiments, we compare our method, denoted as ITR-ALS (``I"" stands for
``initialized"") with TR-ALS proposed in [18]. In [18], the cost in (9) is minimized using
ALS where (11) is solved for each k in an alternating fashion. Although [18] proposed
an SVD-based initialization approach similar to the recursive SVD algorithm for TT
[13], this method has exponential complexity in d. Therefore the comparison with
such an initialization is omitted and we use a randomized intialization for TR-ALS.
As we shall see, ITR-ALS is generally an order of magnitude faster than TR-ALS,
due to the special structure of the samples. For each experiment we run both TR-
ALS and ITR-ALS five times and report the median accuracy. For TR-ALS, we often
have to use fewer samples such that the running time is not excessively long (recall
that TR-ALS has O(d2 ) complexity per iteration). To compare with the algorithm
in [4], we simply cite the results in [4] since the software is not publicly available.
We also compare ourselves with the density matrix renormalization group (DMRG)-
cross algorithm [16] (which gives a TT). As a method that is based on interpolative
decomposition, DMRG-cross is able to obtain a high quality approximation if we allow
a large TT-rank representation. Since we obtain the TR based on ALS optimization,
the accuracy may not be comparable to DMRG-cross. What we want to emphasize
here is that if the given situation only requires moderate accuracy, our method could
give a more economical representation than TT obtained from DMRG-cross. To
convey this message, we set the accuracy of DMRG-cross so that it matches the
accuracy of our proposed TR-ALS.
4.1. Example 1: A toy example. We first compress the function
1
(42) f (x1 , . . . , xd ) = \sqrt{} , xk \in [0, 1],
1 + x21 + \cdot \cdot \cdot + x2d
considered in [4] into a TR. The results are presented in Table 1. In this example,
we let s = 4 (recall that s is the size of \Omega envi
k ) in ITR-ALS. The number of samples
we can afford to use for TR-ALS is less than ITR-ALS due to the excessively long
running time since each iteration of TR-ALS has a complexity scaling of O(d2 ). In
this example, although sometimes ITR-ALS has lower accuracy than TR-ALS, the
running time of ITR-ALS is significantly shorter. In particular, for the case when
d = 12, TR-ALS fails to converge using the same amount of samples as ITR-ALS.
Both ITR-ALS and TR-ALS give TR with tensor components with smaller sizes than
TT. The error E reported for the case of d = 12 is obtained from sampling 105 entires
of the tensor f .
4.2. Example 2: Ising spin glass. In this example, we demosntrate the advan-
tage of ITR-ALS in compressing a high-dimensional function arising from many-body
physics, the traditional field where TT or MPS is extensively used [1, 19]. We consider
compressing the free energy of Ising spin glass with a ring geometry:
d \biggl[ \beta J
e i e - \beta Ji
\biggl[ \biggl( \prod \biggr] \biggr) \biggr]
1
(43) f (J1 , . . . , Jd ) = - log Tr .
\beta e - \beta Ji e\beta Ji
i=1
We let \beta = 10 and Ji \in \{ - 2.5, - 1.5, 1, 2\} , i \in [d]. This corresponds to an Ising model
with temperature of about 0.1K. The results are presented in Table 2. We let the
number of environment samples s = 5. When computing the error E for the case
Table 1
Results for Example 1. n corresponds to the number of uniform grid points on [0, 1] for each
xk . The tuple (r1 , . . . , rd ) indicates the rank of the learned TR and TT. Eskeleton is computed on
the samples used for learning the TR.
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Table 2
Results for Example 2. Learning the free energy of Ising spin glass.
of d = 24, due to the size of f , we simply subsample 105 entries of f , where Ji 's
are sampled independently and uniformly from \{ - 2.5, - 1.5, 1, 2\} . For d = 12, the
solution obtained by ITR-ALS is superior due to the initialization procedure. We see
that in both d = 12, 24 cases, the running time of TR-ALS is much longer compare
to ITR-ALS.
4.3. Example 3: Parametric elliptic partial differential equation (PDE).
In this section, we demonstrate the performance of our method in solving a parametric
PDE. We are interested in solving an elliptic equation with random coefficients
\biggl( \biggr)
\partial \partial
(44) a(x) u(x) + 1 = 0, x \in [0, 1],
\partial x \partial x
subject to a periodic boundary condition, where a(\cdot ) is a random field. In particular,
we want to parameterize the effective conductance function
\int \biggl( \biggr) 2
\partial
(45) Aeff (a(\cdot )) := a(x) u(x) + 1 dx
[0,1] \partial x
\sum d
as a TR. By discretizing the domain into d segments and assuming a(x) = i=1 ai \chi i (x),
where each ai \in [1, 2, 3] and \chi i 's being step functions on uniform intervals on [0, 1],
we determine Aeff (a1 , . . . , ad ) as a TR. In this case, the effective coefficients have an
analytic solution
d
\biggl( \sum \biggr) - 1
1
(46) Aeff (a1 , . . . , ad ) = ai
d i=1
and we use this formula to generate samples to learn the TR. For this example, we
pick s = 4. The results are reported in Table 3. When computing E with d = 24,
again 105 entries of f are subsampled, where the ai 's are sampled independently and
uniformly from \{ 1, 2, 3\} . We note that although in this situation, there is an analytic
formula for the function we want to learn as a TR, we foresee further usage of our
method when solving parametric PDEs with periodic boundary conditions, where
there is no analytic formula for the physical quantity of interest (for example for the
cases considered in [10]).
5. Conclusion. In this paper, we propose a method for learning a TR repre-
sentation based on ALS. Since the problem of determining a TR is a nonconvex op-
timization problem, we propose an initialization strategy that helps the convergence
of ALS. Furthermore, since using the entire tensor f in the ALS is infeasible, we
propose an efficient hierarchical sampling method to identify the important samples.
Our method provides a more economical representation of the tensor f than the TT
format. As for future works, we plan to investigate the performance of the algorithms
for quantum systems. One difficulty is that the Assumption 1 (Appendix 3) for the
proposed initialization procedure does not in general hold for quantum systems with
short-range interactions. Instead, a natural assumption for a quantum state exhibit-
ing a TR format representation is the exponential correlation decay [7, 2]. The design
of efficient algorithms to determine the TR representation under such an assumption
is left for future works. Another natural direction is to extend the proposed method to
tensor networks in higher spatial dimensions, which we shall also explore in the future.
Appendix A. Stability of initialization. In this section, we analyze the
stability of the proposed initialization procedure, where we relax Assumption 1 to
approximate Markovianity.
Assumption 3. Let
\Omega z := (xc1 , xa\cup b , xc2 ) | xc1 \in [n]Lc1 , xc2 \in [n]Lc2 , xa\cup b = z
\bigl\{ \bigr\}
(47)
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
for some given z \in [n]La +Lb . For any z \in [n]La +Lb , we assume
\| B 1 \| 22 \| B 2 \| 22 \alpha
(49) , \geq 4 .
\| B 1 \| 2F \| B 2 \| 2F \kappa
Proof. By Assumption 3,
\bigm\| \bigl( c \bigr) T 1 \bigm\| 2
\bigm\| H 1
Lc1 +1,Lc1 +2;[Lc1 ] B \otimes B 2 HLc2c +1,Lc +2;[Lc ] \bigm\| 2
\alpha \leq \bigm\| \bigl( \bigr) T
2 2 2
\bigm\| 2
\bigm\| H c1 B 1 \otimes B 2 H c2 \bigm\|
Lc +1,Lc +2;[Lc ]
1 1 1
Lc +1,Lc +2;[Lc ] F 2 2 2
\| B 1 \otimes B 2 \| 22
\leq \kappa 2c1 \kappa 2c2
\| B 1 \otimes B 2 \| 2F
\| B 1 \| 22 \| B 2 \| 22
(50) = \kappa 2c1 \kappa 2c2 ,
\| B 1 \| 2F \| B 2 \| 2F
where \kappa c1 , \kappa c2 \leq \kappa are condition numbers of HLc1c +1,Lc +2;[Lc ] and HLc2c +1,Lc +2;[Lc ] ,
1 1 1 2 2 2
respectively.
Let pb (q b )T be the best rank-1 approximation to B 2 . Before registering the next
\~ [d]\setminus a in Figure 9.
corollary, we define H [d]\setminus b and H
Corollary 1. Under the assumptions of Lemma 1, for any sampling operator
s2 defined in Proposition 1,
[d]\setminus b
\| H[d - Lb ];d - Lb +1,d - Lb +2 vec(pb (q b )T ) - f[d]\setminus b;b s2 \| 22 \Bigl( \alpha \Bigr)
(51) \leq \kappa 2 1 - 4 .
\| f[d]\setminus b;b s2 \| 2F \kappa
𝐻 * ∖( =
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
2
𝑑 − 𝐿(
1 𝑑 − 𝐿( + 1 𝑑 − 𝐿( + 2
𝑑 − 𝐿$ + 1 𝑑 − 𝐿$ + 2
.[*]\2 =
𝐻 𝑑 − 𝐿$
1
𝐵4
\~ [d]\setminus a .
Fig. 9. Definition of H [d]\setminus b and H
\| B 2 - pb (q b )T \| 2F \| B 2 \| 2F - \| pb (q b )T \| 2F \alpha
(52) = 2 2 = 2 2 \leq 1 - 4 .
\| B \| F \| B \| F \kappa
Then
[d]\setminus b
\| H[d - Lb ];d - Lb +1,d - Lb +2 vec(pb (q b )T ) - f[d]\setminus b;b s2 \| 22
\| f[d]\setminus b;b s2 \| 22
[d]\setminus b
\| H[d - Lb ];d - Lb +1,d - Lb +2 \| 22 \| HLb b +1,Lb +2;[Lb ] s2 - vec(pb (q b )T )\| 22
\leq [d]\setminus b
\| H[d - Lb ];d - Lb +1,d - Lb +2 HLb b +1,Lb +2;[Lb ] s2 \| 22
\| HLb b +1,Lb +2;[Lb ] s2 - vec(pb (q b )T )\| 22
(53) \leq \kappa 2[d]\setminus b ,
\| HLb b +1,Lb +2;[Lb ] s2 \| 22
[d]\setminus b
where \kappa 2[d]\setminus b is the condition number of H[d - Lb ];d - Lb +1,d - Lb +2 . Recall that H b is
defined in Figure 5.
This corollary states that the situation in Figure 8 holds approximately. More
d - L
precisely, let T, T\^ \in \BbbR n b be defined as
[d]\setminus b
(54) T := H[d - Lb ];d - Lb +1,d - Lb +2 vec(pb (q b )T ), T\^ := f[d]\setminus b;b s2 ,
In the following, we want to show that we can approximately extract the H k 's in
\~ c1
region a. For this, we need to take the right-inverses of H \~ c2
Lc1 +1;[Lc1 ] and HLc2 +1;[Lc2 ] ,
defined in Figure 10(b). This requires a singular value lower bound, provided by the
next lemma.
Lemma 2. Let \sigma k : \BbbR m1 \times m2 \rightarrow \BbbR be a function that extracts the kth singular
value of an m1 \times m2 matrix. Then
\~ c1
\sigma r (H 2 \~ c2 2 \surd \sqrt{}
Lc1 +1;[Lc1 ] ) \sigma r (HLc2 +1;[Lc2 ] ) 1 2 r \alpha
(56) [d]\setminus a
\geq 6
- 2
1 - 4
\| H\~ \| 2 \kappa \kappa \kappa
d - La +1,d - La +2;[d - La ] 2
𝑇=
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Fig. 10. (a) Definition of T and T\^. The dimensions in region a, c1 , c2 are grouped into
\scrI a , \scrI c1 , \scrI c2 , respectively, for the tensors T and T\^. (b) Individual components of T .
assuming
\surd
\sqrt{}
1 \alpha
(57) 4
- 2 r 1 - \geq 0.
\kappa \kappa 4
Proof. First,
a
\| H[L \| 2 \sigma 2 (H \~ c1 \~ c2 2
\sigma r2 (T\scrI a ;\scrI c1 \scrI c2 )2 a ];La +1,La +2 2 r Lc1 +1;[Lc1 ] \otimes HLc2 +1;[Lc2 ] )
\leq
\| T\^\scrI ;\scrI ,\scrI \| 2
a c1 c2 2 \| T\^\scrI ;\scrI ,\scrI \| 2
a c1 c2 2
a 2 \~ c1 \~ c2
\| H[La ];La +1,La +2 \| 2 \sigma r (HLc +1;[Lc ] )2 \sigma r (H 2
1 1
Lc2 +1;[Lc2 ] )
= [d]\setminus a
a
\| H[L \~ 2
a ];La +1,La +2
H d - La +1,d - La +2;[d - La ] \| 2
a
\| H[L \~ c1
\| 2 \sigma r (H \~ c2
)2 \sigma r (H )2
a ];La +1,La +2 2 Lc 1 +1;[Lc1 ] Lc 2 +1;[Lc2 ]
\leq
a \~ [d]\setminus a
)2 \| H 2
\sigma r2 (H[L a ];La +1,La +2 d - La +1,d - La +2;[d - La ] \| 2
\~ c1
\sigma r (H 2 \~ c2 2
2 Lc +1;[Lc1 ] ) \sigma r (HLc2 +1;[Lc2 ] )
(58) \leq \kappa 1
[d]\setminus a
.
\~
\| H 2
d - La +1,d - La +2;[d - La ] \| 2
\~ [d]\setminus a in Figure 9.
which follows from (54), and the definition of H
Observe that
\sigma r2 (T\scrI a ;\scrI c1 ,\scrI c2 )2 \sigma r2 (T\^\scrI a ;\scrI c1 ,\scrI c2 )2 - 2\| E\| F \sigma r2 (T\^\scrI a ;\scrI c1 ,\scrI c2 ) + \| E\| 2F
\geq
\| T\^\scrI ;\scrI ,\scrI \| 2
a c1 c2 2 \| T\^\scrI ;\scrI ,\scrI \| 2
a c1 c2 2
\sigma r2 (T\^\scrI a ;\scrI c1 ,\scrI c2 )2 2\| E\| F \sigma r2 (T\^\scrI a ;\scrI c1 ,\scrI c2 )
\geq -
\| T\^\scrI a ;\scrI c1 ,\scrI c2 \| 2
2 \| T\^\scrI a ;\scrI c1 ,\scrI c2 \| 22
a
\sigma r2 (H[L \~ [d]\setminus a
)2 \sigma r2 (H 2
a ];La +1,La +2 d - La +1,d - La +2;[d - La ] )
\geq [d]\setminus a
a
\| H[L \~
\| 2 \| H 2
a ];La +1,La +2 2 d - La +1,d - La +2;[d - La ] \| 2
and assuming \| E\| F \leq \sigma r2 (T\^\scrI a ;\scrI c1 ,\scrI c2 ) . Such an assumption holds when demanding
the lower bound in (59) to be nonnegative, i.e.,
\sigma r2 (T\^\scrI a ;\scrI c1 ,\scrI c2 )2 2\| E\| F \sigma r2 (T\^\scrI a ;\scrI c1 ,\scrI c2 ) \surd
\sqrt{}
1 \alpha
(61) - \geq - 2 r 1 - 4 \geq 0.
\| T\^\scrI a ;\scrI c1 ,\scrI c2 \| 22 \| T\^\scrI a ;\scrI c1 ,\scrI c2 \| 22 \kappa 4 \kappa
where I is the identity matrix. Let P1\ast \in \Pi 1 be the best rank-r projection for T\^\scrI c2 \scrI a ;\scrI c1
such that T\^\scrI c2 \scrI a ;\scrI c1 P1\ast \approx T\^\scrI c2 \scrI a ;\scrI c1 in the Frobenius norm, and
P2\ast = min \| (T\^\scrI a ;\scrI c1 \scrI c2 (I \otimes P2 ) - T\^\scrI a ;\scrI c1 \scrI c2 )(P1\ast \otimes I)\| 2F .
P2 \in \Pi 2
Then
(63) \| T\^\scrI a ;\scrI c1 \scrI c2 (I \otimes P2\ast )(P1\ast \otimes I) - T\^\scrI a ;\scrI c1 \scrI c2 \| 2F \leq 2\| E\| 2F .
Proof. To simplify the notations, let T\~\scrI a ;\scrI c1 \scrI c2 := T\^\scrI a ;\scrI c1 \scrI c2 (I \otimes P2 ). Then
min \| T\^\scrI a ;\scrI c1 \scrI c2 (I \otimes P2 )(P1\ast \otimes I) - T\^\scrI a ;\scrI c1 \scrI c2 \| 2F
P2 \in \Pi 2
= min \| (T\~\scrI a ;\scrI c1 \scrI c2 - T\^\scrI a ;\scrI c1 \scrI c2 + T\^\scrI a ;\scrI c1 \scrI c2 )(P1\ast \otimes I) - T\^\scrI a ;\scrI c1 \scrI c2 \| 2F
P2 \in \Pi 2
= min \| T\^\scrI a ;\scrI c1 \scrI c2 (I - P1\ast \otimes I)\| 2F + \| (T\~\scrI a ;\scrI c1 \scrI c2 - T\^\scrI a ;\scrI c1 \scrI c2 )(P1\ast \otimes I)\| 2F
P2 \in \Pi 2
\leq min \| T\^\scrI a ;\scrI c1 \scrI c2 (I - P1\ast \otimes I)\| 2F + \| T\~\scrI a ;\scrI c1 \scrI c2 - T\^\scrI a ;\scrI c1 \scrI c2 \| 2F
P2 \in \Pi 2
(64) = min \| T\^\scrI a ;\scrI c1 \scrI c2 (I - P1\ast \otimes I)\| 2F + \| T\^\scrI a ;\scrI c1 \scrI c2 (I - I \otimes P2 )\| 2F .
P2 \in \Pi 2
The inequality comes from the fact that P1\ast \otimes I is a projection matrix. Next,
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
(65) \| T\^\scrI a ;\scrI c1 \scrI c2 (I - P1\ast \otimes I)\| 2F + min \| T\^\scrI a ;\scrI c1 \scrI c2 (I - I \otimes P2 )\| 2F
P2 \in \Pi 2
= min \| T\^\scrI a ;\scrI c1 \scrI c2 (I - P1 \otimes I)\| 2F + min \| T\^\scrI a ;\scrI c1 \scrI c2 (I - I \otimes P2 )\| 2F
P1 \in \Pi 1 P2 \in \Pi 2
and we can conclude the lemma. The equality comes from the definition of P1\ast ,
whereas the inequality is due to the facts that P1 , P2 are rank-r projectors, and there
exists T such that T\^ = T - E, where rank(T\scrI c1 \scrI a ;\scrI c2 ), rank(T\scrI c1 ;\scrI a \scrI c2 ) \leq r.
We are ready to state the final proposition.
Proposition 2. Let
(66) T\^\scrI a ;\scrI c1 \scrI c2 := T\^\scrI a ;\scrI c1 \scrI c2 (I \otimes P2\ast )(P1\ast \otimes I),
where ``\dagger "" is used to denote the pseudoinverse of a matrix, if the upper bound is
positive. When \kappa = 1 + \delta \kappa and \alpha = 1 - \delta \alpha , where \delta \kappa , \delta \alpha \geq 0 are small parameters,
we have
a
\| H[L a ];La +1,La +2
- T\^\scrI a ;\scrI c1 \scrI c2 (H
\~ c1
Lc
\~ c2
\otimes H )\dagger \| 2F
1 +1;[Lc1 ] Lc 2 +1;[Lc2 ]
(68) a \leq O(\delta \alpha + 4\delta \kappa ).
\| H[L \| 2
a ];La +1,La +2 F
Recalling that
(70) a
H[L \~ c1
= T\scrI a ;\scrI c1 ,\scrI c2 (H \~ c2
\otimes H )\dagger ,
a ];La +1,La +2 Lc 1 +1;[Lc1 ] Lc 2 +1;[Lc2 ]
\| H[L \| 2
a ];La +1,La +2 F
(1 + 2) \| T\^\| 2F \| E\| 2F
=
\~ c1 2 \~ c2 2 a 2 \^ 2
\sigma r (H Lc1 +1;[Lc1 ] ) \sigma r (HLc2 +1;[Lc2 ] ) \| H[La ];La +1,La +2 \| F \| T \| F
\surd
(1 + 2)2 \| H \~ [d]\setminus a 2
d - La +1,d - La +2;[d - La ] \| 2 \| E\| F
2
\leq
\sigma r (H \~ 1c 2 \~ 2
c \^\| 2
2 \| T
Lc1 +1;[Lc1 ] ) \sigma r (HLc2 +1;[Lc2 ] ) F
\surd 2 \biggl( \biggr)
(1 + 2) \alpha
(71) \leq \surd
2 r \sqrt{}
\kappa 2 1 - 4 .
1
6 - 2 1 - 4 \alpha \kappa
\kappa \kappa \kappa
The first inequality follows from (69) and (70), and the last inequality follows from
Corollary 1 and Lemma 2.
When La = Lc1 = Lc2 = 1, applying Algorithm 4 to T\^ results in T\^ (represented
by the tensors T a,L , T a,C , and T a,R ). Therefore, this proposition essentially implies
T a,C approximates H a up to a gauge transformation.
REFERENCES
[1] I. Affleck, T. Kennedy, E. H. Lieb, and H. Tasaki, Valence bond ground states in isotropic
quantum antiferromagnets, Comm. Math. Phys., 115 (1988), pp. 477--528.
[2] F. G. S. L. Brandao and M. Horodecki, Exponential decay of correlations implies area law,
Comm. Math. Phys., 333 (2015), pp. 761--798.
[3] E. J. Cand\ès and B. Recht, Exact matrix completion via convex optimization, Found. Com-
put. Math., 9 (2009), 717.
[4] M. Espig, K. K. Naraparaju, and J. Schneider, A note on tensor chain approximation,
Comput. Vis. Sci., 15 (2012), pp. 331--344.
[5] S. Friedland, V. Mehrmann, A. Miedlar, and M. Nkengla, Fast low rank approximations
of matrices and tensors, Electron. J. Linear Algebra, 22 (2011), 67.
[6] F. Ruvimovich Gantmacher and J. L. Brenner, Applications of the Theory of Matrices,
Dover, Mineola, NY, 2005.
[7] M. B. Hastings and T. Koma, Spectral gap and exponential decay of correlations, Comm.
Math. Phys., 265 (2006), pp. 781--804.
[8] F. L Hitchcock, The expression of a tensor or a polyadic as a sum of products, J. Math.
Phys., 6 (1927), pp. 164--189.
[9] Y. P. Hong and C.-T. Pan, Rank-revealing QR factorizations and the singular value decom-
position, Math. Comp., 58 (1992), pp. 213--232.
[10] Y. Khoo, J. Lu, and L. Ying, Solving parametric PDE problems with artificial neural net-
works, European J. Appl. Math., 32 (2021), pp. 421--435.
[11] L. Mirsky, Symmetric gauge functions and unitarily invariant norms, Quart. J. Math., 11
(1960), pp. 50--59.
[12] R. Orus, A practical introduction to tensor networks: Matrix product states and projected
entangled pair states, Ann. Phys., 349 (2013), pp. 117--158.
[13] I. Oseledets and E. Tyrtyshnikov, TT-cross approximation for multidimensional arrays,
Linear Algebra Appl., 432 (2010), pp. 70--88.
[14] I. V. Oseledets, Tensor-train decomposition, SIAM J. Sci. Comput., 33 (2011), pp. 2295--2317.
[15] D. Perez-Garcia, F. Verstraete, M. M. Wolf, and J. I. Cirac, Matrix product state
representations, Quantum Inf. Comput., 7 (2007), pp. 401--430.
[16] D. Savostyanov and I. Oseledets, Fast adaptive interpolation of multi-dimensional arrays
in tensor train format, in 2011 7th International Workshop on Multidimensional (nD)
Systems, IEEE, Piscataway, NJ, 2011, pp. 1--8.
[17] L. R. Tucker, Some mathematical notes on three-mode factor analysis, Psychometrika, 31
(1966), pp. 279--311.
[18] W. Wang, V. Aggarwal, and S. Aeron, Efficient low rank tensor ring completion, Proceed-
ings of the IEEE International Conference on Computer Vision, IEEE Computer Society,
Los Alamitos, CA, 2017, pp. 5697--5705.
[19] S. R. White, Density matrix formulation for quantum renormalization groups, Phys. Rev.
Lett., 69 (1992), pp. 2863--2866.
[20] M. M. Wolf, F. Verstraete, M. B. Hastings, and J. I. Cirac, Area laws in quantum
Downloaded 07/21/22 to 132.174.251.2 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
systems: Mutual information and correlations, Phys. Rev. Lett., 100 (2008), 070502.
[21] M. Yuan and C.-H. Zhang, On tensor completion via nuclear norm minimization, Found.
Comput. Math., 16 (2016), pp. 1031--1068.
[22] Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki, Tensor Ring Decomposition, preprint,
arXiv:1606.05535, 2016.