Learning Multidimensional Fourier Series With Tensor Trains
Learning Multidimensional Fourier Series With Tensor Trains
This paper has been accepted for publication in the proceedings of the 2nd
IEEE Global Conference on Signal and Information Processing (GlobalSIP)
which was held in Atlanta, GA, USA in December 2014.
Copyright 2014 IEEE. Published in the 2nd IEEE Global Conference on Sig-
nal and Information Processing (GlobalSIP 2014), scheduled for 3-5 Decem-
ber 2014 in Atlanta, GA, USA. Personal use of this material is permitted.
However, permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for resale or re-
distribution to servers or lists, or to reuse any copyrighted component of
this work in other works, must be obtained from the IEEE. Contact: Man-
ager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane /
P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-
562-3966.
Learning Multidimensional Fourier Series With Tensor Trains
Sander Wahls , Visa Koivunen: , H. Vincent Poor; and Michel Verhaegen
Delft Center for Systems and Control, TU Delft, The Netherlands. Email: {s.wahls,m.verhaegen}@tudelft.nl
: Department of Signal Processing and Acoustics, Aalto University, Finland. Email: [email protected]
; Department of Electrical Engineering, Princeton University, USA. Email: [email protected]
Abstract—How to learn a function from observations of inputs and where Σ diagpσ1 , . . . , σD q ¡ 0 is a positive-definite weighting
noisy outputs is a fundamental problem in machine learning. Often, matrix and }x}2Σ : xT Σx, is a popular example for a space of
an approximation of the desired function is found by minimizing
approximation functions. The space GΣ is N -dimensional, which is
a risk functional over some function space. The space of candidate
functions should contain good approximations of the true function, but why the minimization of the risk (2) becomes infeasible for large-
it should also be such that the minimization of the risk functional is scale data sets. In order to reduce the computational complexity,
computationally feasible. In this paper, finite multidimensional Fourier Rahimi and Recht [5] have proposed to replace GΣ with a lower-
series are used as candidate functions. Their impressive approximative dimensional space that is in some sense close to GΣ . With samples
capabilities are illustrated by showing that Gaussian-kernel estimators
can be approximated arbitrarily well over any compact set of bandwidths ω 1 , . . . , ω D?taken from a normal distribution with zero mean and
with a fixed number of Fourier coefficients. However, the solution of the covariance 2Σ, and samples b1 , . . . , bD taken from the uniform
associated risk minimization problem is computationally feasible only if distribution on r0, 2π s, they proposed to replace GΣ with [6, p. 3]
the dimension d of the inputs is small because the number of required
# +
Fourier coefficients grows exponentially with d. This problem is addressed ¸
D
by using the tensor train format to model the tensor of Fourier coefficients RΣ,D : f pxq αi cosp ω Ti x bi q : αi P R, i 1, . . . , D ,
under a low-rank constraint. An algorithm for least-squares regression
i 1
is derived and the potential of this approach is illustrated in numerical
¸
αi2 e}xris}Σ .
D
experiments. The computational complexity of the algorithm grows only
linearly both with the number of observations N and the input dimension
}f }2R :
2
i 1
Σ,D
Using the multidimensional Fourier series of the indicator function such that cl G1 pl1 q Gd pld q [11]. We denote the set of all
IX of X [12, Ch. 8.1], we find that the quadratic norm satisfies such tensor trains by Trm . The corresponding subset of Tm,p is
» »
}f }2T : f pxqf¯pxqdx
¸
cl c̄k e2π ipklq
T
{
x p
r
Tm,p : tf pxq JC, DpxqK : C P Trm u .
m,p
P At this point, note that Dpxq is a tensor train of rank one because
X l,k I d Rd
¸ ¹ k i li
d
IX pxqdx cl c̄k sinc . (6) how functions in Tm,p r
can be evaluated efficiently.
P
l,k I d
i 1
p
Lemma 3 (In part from [11], p. 2309). Consider two tensor trains
The following proposition demonstrates that Tm,p provides arbi-
trarily good approximations of whole families of Gaussian kernels if C rcl slPI d P Trm,p , cl G1 pl1 q Gd pld q, (10)
the parameters m and p are chosen large enough. Z rzl slPI P T1m,p , zl z1 pl1 q zd pld q,
d
¸
M ¸
N has the well-known property that AXB C for arbitrary ma-
sup |g pxq f pxq| ¤ |αi,j ||βj |. (8) trices A, B, C and X of compatible dimensions if and only if
x XP
j 1i 1 pBT b Aq vecpXq vecpCq [16]. Thus, with B Ir , one obtains
Proof: Since all elements in E are positive-definite and E is vecpΓk q pIr b Hk q vecpGL k q. We find that
! ¸
sk lk ) instead be directly orthonormalized using, e.g., the modified Gram-
tr A Gk psk q B Bk Gk plk qAk sinc
k k Schmidt method. Algorithm 1 provides an overview of the procedure.
P
lk ,sk I
p
!¸
sl )
¸ Remark 7. Algorithm 1 converges locally around a local minimum if
tr A
k Gk psq
sinc B
k Bk Gk plqAk the Hessian of this minimum has maximal rank [15, Corollary 2.9].
P!
s I lPI
)
p
Remark 8. The minimization of (16) in a least-squares sense re-
tr Ak pGk q S S b Bk Bk GLk Ak
L quires OppN mr2 qm2 r4 q flops using standard techniques. The
2 orthonormalization of a core requires Opmr3 q flops. Remark 6
pS b Bk qGLk Ak F RHS of (13). implies that forming the coefficient matrix in (16) in general requires
OpdpN mrqmr2 q flops. In Algorithm 1 however, where the cores
Remark 5. The matrix rsincp lp s qsl,s is positive semi-definite by (6). are updated sequentially, it is possible to do this more efficiently.
Remark 6. The matrices Γ1 , . . . , Γd in Lemma 3 can be computed The matrices Lk can be updated as Lk Lk1 Γk1 . The matrices
using Opdmr2 q floating point operations (flops). Forming Lk and R1 , . . . , Rd can be efficiently precomputed at the beginning of each
Rk then takes Opdr2 q flops because Γ1 and Γd are vectors. The iteration (when k 1) using the formula Rj 1 Γj Rj because
computation of Ak and Bk in Lemma 4 requires Opdm2 r3 q flops. the Rk 1 , . . . , Rd are independent of Γk . A similar strategy may
B. Risk Minimization Over Low-rank Coefficient Tensors be used to cope with the regularization matrices Ak and Bk . In this
The risk (2) is in general not convex over F Tm,p
way, the costs of finding the coefficient matrix in (16) can be reduced
to OppN mrqmr2 q flops. Then, the total cost of updating one core
r
. We propose
to use an alternating least squares approach as in [14] and [15] to is OppN mr2 qm2 r4 q flops, and a complete iteration in Algorithm
find a local minimum. For the quadratic loss `px, y q |x y |2 , the 1 can be carried out using only OpdpN mr2 qm2 r4 q flops.
risk (2) can be rewritten as follows. Let f pxq JC, DpxqK, where
the coefficient tensor train C rcl s is given by cl G1 pl1 q IV. N UMERICAL E XPERIMENTS
Gd pld q. Then, for any k and with Zi : Dpxrisq, the Lemmas 3 Setup: We have benchmarked Algorithm 1 for minimizing (2) with
and 4 show that the risk can be written as F Tm,p r
and `px, y q |x y |2 against standard kernel ridge
1 ¸
N
regression [21] (KRR, F GΣ ) and random Fourier features (RFF,
Remp pf q |yrj s JC, Dpxrj sqK|2 λ}f }2Tm,p N1
N j 1 F RΣ,D q for several data sets that have been downloaded from
2 [22]. Each data set has first been been randomly permuted and then
Rk pZ1 qT b Lk pZ1 qHk pZ1 q
y1 rs
partitioned into a training data set (70% of the data) and a testing data
.. ..
.
. L
vec Gk p q
.
set (30% of the data). Parameters which are not given in Figure 1 have
r s Rk pZ?
N q b Lk pZN qHk pZN q
y N T been chosen by performing a grid search. Each combination of the
N λATk b S b Bk parameters was evaluated by performing a 5-fold cross validation on
0
the training data. The predictors with respect to the best parameters
(16)
were then trained on the training data and evaluated on the test data.
The size of the coefficient matrix in (16) is pN mr q mr2 2
The reported errors are average values taken over 10 experiments.
for k 1, d. Thus, a single core tGk plqulPI of the tensor C can Implementation Details: The inputs xr1s, . . . , xrN s have been
be updated by solving the linear least squares problem to minimize rescaled (all with the same scalar) such that xr1s, . . . , xrN s P X with
(16). In the alternating least squares approach, the cores are updated X as in (1). Uniform weights Σ σI have been used in order to keep
sequentially. In one iteration of the algorithm, first G1 is updated by the grid search feasible. The dimension of the random Fourier features
minimizing (16), then G2 is updated in the same way, etc., until Gd was chosen equal to the number of floats needed to store the tensor
has been updated. The iterations are repeated until convergence. train used in Algorithm 1: D 2mr pd 2qmr2 . When random
An important implementation detail arises because the representa- initializations were used (Alg. 1 and RFF), three different initializa-
tion cl G1 pl1 q Gd pld q of a tensor train is highly non-unique. tions have been evaluated and the one with the smallest training error
To avoid numerical problems, the tensor train should be stored using a was used. Algorithm 1 always performed 10 iterations. The source
canonical representation. A representation cl G1 pl1 q Gd pld q code is available online at http://bitbucket.com/wahls/mdfourier.
¸ ¸ sd ld lk
Ak A 1 p lk 1 q Gd pld qGd psd q sinc
1 psk 1 q sinc
sk 1 1
k Gk Gk , (14)
lk 1 ,sk 1 I P P
ld ,sd I
p p
¸ ¸ s1 l1 sk1 lk1
B Bk
k Gk1 psk1 q G1 ps1 q G1 pl1 q sinc Gk1 plk1 q sinc . (15)
P P p p
lk 1 ,sk 1 I l1 ,s1 I
?
?
sup ? sup ?
2σp2 2σp2
Results: The results are reported in Figure 1. Algorithm 1 has been π
1 ¤ π
1
able to perform similarly to kernel ridge regression with less resources σ Prσ ,σ̄ s p σ mπ 2 σ Prσ ,σ̄ s σ mπ 2
c 2
c
(D floats instead of N ) on all three data sets. Algorithm 1 performed
better than random Fourier features that have been provided the same ¤ πσ 1 2σ̄p mπ 2
¤ 2
π
σ
;
amount of memory in all cases. On the airfoil and yacht data sets, the
test error could be reduced significantly by a factor of three and five,
ùñ sup E2 pm, p, σ q ¤
m2 π2
sup e 4p2 σ
respectively. The high average test error for random Fourier features
σ Prσ,σ̄s σ Prσ ,σ̄ s
on the airfoil data set was caused by a single experiment (out of ten). ?π
V. C ONCLUSION sup ? 1 2σp2
2 .
σ Prσ,σ̄s p σ mπ 2
The numerical experiments have confirmed the approximative (18)
capabilities of multidimensional Fourier series even if a low-rank
constraint is placed on the tensor of Fourier coefficients. The pro- The claim now follows from Lemma 9, (17) and (18):
sup sup | eσx °l1 bl e2π i lx{p |
2 m
posed algorithm performs as well as kernel ridge regression and better 2
supσPrσ ,σ̄s exp σ p2p 1q2 {4 exp σ p2p 1q2 {4 ¤ 8ε and Let us now fix arbitrary x, y P X , and define ei px, y, Σq :
supσPrσ ,σ̄s 1{pσpp2p 1qq 1{pσpp2p 1qq ¤ 1. Then,
?
eσi pxi yi q gσi pxi yi q. Note that |ei | ¤ {2 by (19). The
2