0% found this document useful (0 votes)
9 views26 pages

02 Lecture

02 Lecture

Uploaded by

Steve Mojang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views26 pages

02 Lecture

02 Lecture

Uploaded by

Steve Mojang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

CS 598 EVS: Tensor Computations

Basics of Tensor Computations

Edgar Solomonik

University of Illinois at Urbana-Champaign


Tensors
A tensor is a collection of elements
I its dimensions define the size of the collection
I its order is the number of different dimensions
I specifying an index along each tensor mode defines
an element of the tensor

A few examples of tensors are


I Order 0 tensors are scalars, e.g., s ∈ R
I Order 1 tensors are vectors, e.g., v ∈ Rn
I Order 2 tensors are matrices, e.g., A ∈ Rm×n
I An order 3 tensor with dimensions s1 × s2 × s3 is denoted as T ∈ Rs1 ×s2 ×s3
with elements tijk for i ∈ {1, . . . , s1 }, j ∈ {1, . . . , s2 }, k ∈ {1, . . . , s3 }
Reshaping Tensors

Its often helpful to use alternative views of the same collection of elements
I Folding a tensor yields a higher-order tensor with the same elements
I Unfolding a tensor yields a lower-order tensor with the same elements
I In linear algebra, we have the unfolding v = vec(A), which stacks the
columns of A ∈ Rm×n to produce v ∈ Rmn
I For a tensor T ∈ Rs1 ×s2 ×s3 , v = vec(T ) gives v ∈ Rs1 s2 s3 with

vi+(j−1)s1 +(k−1)s1 s2 = tijk

I A common set of unfoldings is given by matricizations of a tensor, e.g., for


order 3,
T(1) ∈ Rs1 ×s2 s3 , T(2) ∈ Rs2 ×s1 s3 , and T(3) ∈ Rs3 ×s1 s2
Matrices and Tensors as Operators and Multilinear Forms
I What is a matrix?
I A collection of numbers arranged into an array of dimensions m × n, e.g.,
M ∈ Rm×n
I A linear operator fM (x) = M x
I A bilinear form xT M y

I What is a tensor?
I A collection of numbers arranged into an array of a particular order, with
dimensions l × m × n × · · · , e.g., T ∈ Rl×m×n is order 3
I A multilinear operator z = fM (x, y)
X
zi = tijk xj yk
j,k

I A multilinear form
P
i,j,k tijk xi yj zk
Tensor Transposition
For tensors of order ≥ 3, there is more than one way to transpose modes

I A tensor transposition is defined by a permutation p containing elements


{1, . . . , d}
yip1 ,...,ipd = xi1 ,...,id
I In this notation, a transposition AT of matrix A is defined by p = [2, 1] so that

bi2 i1 = ai1 i2

I Tensor transposition is a convenient primitive for manipulating


multidimensional arrays and mapping tensor computations to linear algebra
I When elementwise expressions are used in tensor algebra, indices are often
carried through to avoid transpositions
Tensor Symmetry
We say a tensor is symmetric if ∀j, k ∈ {1, . . . , d}

ti1 ...ij ...ik ...id = ti1 ...ik ...ij ...id

A tensor is antisymmetric (skew-symmetric) if ∀j, k ∈ {1, . . . , d}

ti1 ...ij ...ik ...id = (−1)ti1 ...ik ...ij ...id

A tensor is partially-symmetric if such index interchanges are restricted to be


within disjoint subsets of {1, . . . , d}, e.g., if the subsets for d = 4 and {1, 2} and
{3, 4}, then
tijkl = tjikl = tjilk = tijlk
Tensor Sparsity
We say a tensor T is diagonal if for some v,
(
v i1 : i1 = · · · = id
ti1 ,...,id = = vi1 δi1 i2 δi2 i3 · · · δid−1 id
0 : otherwise

I In the literature, such tensors are sometimes also referred to as


‘superdiagonal’
I Generalizes diagonal matrix
I A diagonal tensor is symmetric (and not antisymmetric)
If most of the tensor entries are zeros, the tensor is sparse
I Generalizes notion of sparse matrices
I Sparsity enables computational and memory savings
I We will consider data structures and algorithms for sparse tensor operations
later in the course
Tensor Products and Kronecker Products
Tensor products can be defined with respect to maps f : Vf → Wf and
g : V g → Wg

h=f ×g ⇒ g : (Vf × Vg ) → (Wf × Wg ), h(x, y) = f (x)g(y)

Tensors can be used to represent multilinear maps and have a corresponding


definition for a tensor product

T =X ×Y ⇒ ti1 ,...,im ,j1 ,...,jn = xi1 ,...,im yj1 ,...,jn

The Kronecker product between two matrices A ∈ Rm1 ×m2 , B ∈ Rn1 ×n2

C =A⊗B ⇒ ci2 +(i1 −1)m2 ,j2 +(j1 −1)n2 = ai1 j1 bi2 j2

corresponds to transposing and unfolding the tensor product


Tensor Contractions
A tensor contraction multiplies elements of two tensors and computes partial
sums to produce a third, in a fashion expressible by pairing up modes of different
tensors, defining einsum (term stems from Einstein’s summation convention)

tensor contraction einsum diagram


P
inner product w = i ui vi
outer product wij = ui vij
pointwise product wi = ui vi
Hadamard product wij = uij vij
P
matrix multiplication wij = k uik vkj
P
batched mat.-mul. wijl = k uikl vkjl
P
tensor times matrix wilk = j uijk vlj
The terms ‘contraction’ and ‘einsum’ are also often used when more than two
operands are involved
General Tensor Contractions
Given tensor U of order s + v and V of order v + t, a tensor contraction summing
over v modes can be written as
X
wi1 ...is j1 ...jt = ui1 ...is k1 ...kv vk1 ...kv j1 ...jt
k1 ...kv

I This form omits ’Hadamard indices’, i.e., indices that appear in both inputs
and the output (as with pointwise product, Hadamard product, and batched
mat–mul.)
I Other contractions can be mapped to this form after transposition
Unfolding the tensors reduces the tensor contraction to matrix multiplication
I Combine (unfold) consecutive indices in appropriate groups of size s ,t, or v
I If all tensor modes are of dimension n, obtain matrix–matrix product
s t s v v t
C = AB where C ∈ Rn ×n , A ∈ Rn ×n , and B ∈ Rn ×n
I Assuming classical matrix multiplication, contraction requires ns+t+v
elementwise products and ns+t+v − ns+t additions
Properties of Einsums
Given an elementwise expression containing a product of tensors, the operands
commute
I For example AB 6= AB, but
X X
aik bkj = bkj aik
k k
I Similarly with multiple terms, we can bring summations out and reorder as
needed, e.g., for ABC
X X X
aik ( bkl clj ) = clj bkl aik
k l kl

A contraction can be succinctly described by a tensor diagram


I Indices in contractions are only meaningful in so far as they are matched up
I A tensor diagram is defined by a graph with a vertex for each tensor and an
edge/leg for each index/mode
I Indices that are not-summed are drawn by pointing the legs/edges into
whitespace
Matrix-style Notation for Tensor Contractions
The tensor times matrix contraction along the mth mode of U to produce V is
expressed as follows
W = U ×m V ⇒ W(m) = V U(m)

I W(m) and U(m) are unfoldings where the mth mode is mapped to be an index
into rows of the matrix
I To perform multiple tensor times matrix products, can write, e.g.,
X
W = U ×1 X ×2 Y ×3 Z ⇒ wijk = upqr xip yjq zkr
pqr

The Khatri-Rao product of two matrices U ∈ Rm×k and V ∈ Rn×k products


W ∈ Rmn×k so that
 
W = u1 ⊗ v1 · · · uk ⊗ vk
The Khatri-Rao product computes the einsum ŵijk = uik vjk then unfolds Ŵ so
that wi+(j−1)n,k = ŵijk
Identities with Kronecker and Khatri-Rao Products
I Matrix multiplication is distributive over the Kronecker product

(A ⊗ B)(C ⊗ D) = AC ⊗ BD

we can derive this from the einsum expression


X X  X 
aik bjl ckp dlq = aik ckp bjl dlq
kl k l

I For the Khatri-Rao product a similar distributive identity is

(A B)T (C D) = AT C ∗ B T D

where ∗ denotes that Hadamard product, which holds since


X X  X 
aki bli ckj dlj = aki ckj bli dlj
kl k l
Multilinear Tensor Operations
Given an order d tensor T , define multilinear function x(1) = f (T ) (x(2) , . . . , x(d) )
I For an order 3 tensor,
(1) (2) (3)
X
xi1 = ti1 i2 i3 xi2 xi3 ⇒ f (T ) (x(1) , x(2) ) = T ×2 x(2) ×3 x(3) = T1 (x(2) ⊗x(3) )
i2 ,i3
I For an order 2 tensor, we simply have the matrix-vector product y = Ax
I For higher order tensors, we define the function as follows
d d
(1) (2) (d)
×x
X O
(T ) (2) (d) (j)
x i1 = ti1 ...id xi2 · · · xid ⇒f (x ,...,x )=T = T1 x(j)
i2 ...id j=2 j=2
I More generally, we can associate d functions with a T , one for each choice of
output mode, for output mode m, we can compute
d
O
(m)
x = T(m) x(j)
j=1,j6=m

which gives fT̃ where T̃ is a transposition of T defined so that T̃(1) = T(m)


Batched Multilinear Operations
The multilinear map f (T ) is frequently used in tensor computations
I Two common primitives (MTTKRP and TTMc) correspond to sets (batches) of
multilinear function evaluations
I Given a tensor T ∈ Rn×···×n and matrices U (1) , . . . , U (d) ∈ Rn×R , the
matricized tensor times Khatri-Rao product (MTTKRP) computes
(1) (2) (d)
X
ui1 r = ti1 ...id ui2 r · · · uid r
i2 ...id

which we can express columnwise as


u(1)
r =f
(T )
(u(2) (d) (2) (d) (2) (d)
r , . . . , ur ) = T ×2 ur · · · ×d ur = T(1) (ur ⊗ · · · ⊗ ur )
I With the same inputs, the tensor-times-matrix chain (TTMc) computes
(1) (2) (d)
X
ui1 r2 ...rd = ti1 ...id ui2 r2 · · · uid rd
i2 ...id

which we can express columnwise as


u(1)
r2 ...rd = f
(T )
(u(2) (d)
r1 , . . . , urd )
Tensor Norm and Conditioning of Multilinear Functions
We can define elementwise and operator norms for a tensor T
I The tensor Frobenius norm generalizes the matrix Frobenius norm
 X 1/2
kT kF = |ti1 ...id |2 = kvec(T )k2 = kT(m) kF
i1 ...id

I Denoting Sn−1 ⊂ Rn as the unit sphere (set of vectors with norm one), we
define the tensor operator (spectral) norm to generalize the matrix 2-norm as
(1) (d)
X
kT k22 = sup ti1 ...id xi1 · · · xid
x(1) ,...,x(d) ∈Sn−1 i1 ...id

= sup hx(1) , f (T ) (x(2) , . . . , x(d) )i


x(1) ,...,x(d) ∈Sn−1

= sup kf (T ) (x(2) , . . . , x(d) )k22


x(2) ,...,x(d) ∈Sn−1
I These norms satisfy the following inequalities
max |ti1 ...id | ≤ kT k2 ≤ kT kF and kT ×m M k2 ≤ kT k2 kM k2
i1 ...id
Conditioning of Multilinear Functions
Evaluation of the multilinear map is typically ill-posed for worst case inputs
I The conditioning of evaluating f (T ) (x(2) , . . . x(d) ) with x(2) , . . . x(d) ∈ Sn−1
with respect to perturbation in a variable x(m) for any m ≥ 2 is
(m)
kJf (T ) (x(2) , . . . , x(d) )k2
(2) (d)
κf (T ) (x ,...,x )=
kf (T ) (x(2) , . . . x(d) )k2
(m) (T ) (m)
where G = Jf (T ) (x(2) , . . . , x(d) ) is given by gij = dfi (x(2) , . . . , x(d) )/dxj
I If we wish to associate a single condition number with a tensor, can tightly
bound numerator
(m)
kJf (T ) (x(2) , . . . , x(d) )k2 ≤ kT k2
I However, the condition number goes to infinity (problem becomes ill-posed)
when kf (T ) (x(2) , . . . x(d) )k2 = 0
I Consequently, wish to lower bound the denominator in
κf (T ) = kT k2 / inf kf (T ) (x(2) , . . . x(d) )k2
x(2) ,...,x(d) ∈Sn−1
Well-conditioned Tensors
For equidimensional tensors (all modes of same size), some small ideally
conditioned tensors exist
I For order 2 tensors, for any dimension n, there exist n × n orthogonal
matrices with unit condition number
I For order 3, there exist tensors T ∈ Rn×n×n with n ∈ {2, 4, 8}, s.t.
inf kf (T ) (x(2) , . . . x(d) )k2 = kT k2 = 1
x(2) ,...,x(d) ∈Sn−1

which correspond to ideally conditioned multilinear maps (generalize


orthogonal matrices)
I For n = 2, an example of such a tensor is given by combining the two slices
   
1 1
and
1 −1
while for n = 4, an example is given by combining the 4 slices
       
1 1 1 −1
 1  −1   1   1
       
 1   1  −1  1 
−1 1 1 1
Ill-conditioned Tensors
/ {2, 4, 8} given any T ∈ Rn×n×n , inf x,y∈Sn−1 kf (T ) (x, y)k2 = 0
For n ∈
I In 1889, Adolf Hurwitz posed the problem of finding identities of the form
(x21 + · · · + x2l )(y12 + · · · + ym
2
) = z12 + · · · + zn2 .
I In 1922, Johann Radon derived results that imply that over the reals, when
l = m = n, solutions exist only if n ∈ {2, 4, 8}
I If for T and any vectors x, y,
kT ×2 x ×3 yk2
=1 ⇒ kT ×2 x ×3 yk22 = kxk22 kyk22 ,
kxk2 kyk2
we can define bilinear forms that provide a solution to the Hurwitz problem
XX
zi = tijk xj yk
j k
I Radon’s result immediately implies κf (T ) > 1 for n ∈
/ {2, 4, 8}, while a 1962
result by Frank J. Adams gives κf (T ) = ∞, as there exists a linear combination
of any n real n × n matrices that is rank-deficient for n ∈
/ {2, 4, 8}
CP Decomposition
I The canonical polyadic or CANDECOMP/PARAFAC (CP) decomposition
expresses an order d tensor in terms of d factor matrices
I For a tensor T ∈ Rn×n×n , the CP decomposition is defined by matrices U , V ,
and W such that
R
X
tijk = uir vjr wkr
r=1
the columns of U , V , and W are generally not orthonormal, but may be
normalized, so that
R
X
tijk = σr uir vjr wkr
r=1
where each σr ≥ 0 and kur k2 = kvr k2 = kwr k2 = 1
I For an order N tensor, the decomposition generalizes as follows,
R Y
d
(j)
X
ti1 ...id = uij r
r=1 j=1

I Its rank is generally bounded by R ≤ n d−1


CP Decomposition Basics
I The CP decomposition is useful in a variety of contexts
I If an exact decomposition with R  nd−1 is expected to exist
I If an approximate decomposition with R  nd−1 is expected to exist
I If the factor matrices from an approximate decomposition with R = O(1) are
expected to contain information about the tensor data
I CP a widely used tool, appearing in many domains of science and data analysis

I Basic properties and methods


I Uniqueness (modulo normalization) is dependent on rank
I Finding the CP rank of a tensor or computing the CP decomposition is NP-hard
(even with R = 1)
I Typical rank of tensors (likely rank of a random tensor) is generally less than
the maximal possible rank
I CP approximation as a nonlinear least squares (NLS) problem and NLS methods
can be applied in a black-box fashion, but structure of decomposition motivates
alternating least-squares (ALS) optimization
Tucker Decomposition
I The Tucker decomposition expresses an order d tensor via a smaller order d
core tensor and d factor matrices
I For a tensor T ∈ Rn×n×n , the Tucker decomposition is defined by core tensor
Z ∈ RR1 ×R2 ×R3 and factor matrices U , V , and W with orthonormal columns,
such that
R1 X
X R2 XR3
tijk = zpqr uip vjq wkr
p=1 q=1 r=1
I For general tensor order, the Tucker decomposition is defined as
R1 Rd d
(j)
X X Y
ti1 ...id = ··· zr1 ...rd uij rj
r1 =1 rd =1 j=1

which can also be expressed as


T = Z ×1 U (1) · · · ×d U (d)
I The Tucker ranks, (R1 , R2 , R3 ) are each bounded by the respective tensor
dimensions, in this case, R1 , R2 , R3 ≤ n
I In relation to CP, Tucker is formed by taking all combinations of tensor products
between columns of factor matrices, while CP takes only disjoint products
Tucker Decomposition Basics
I The Tucker decomposition is used in many of the same contexts as CP
I If an exact decomposition with each Rj < n is expected to exist
I If an approximate decomposition with Rj < n is expected to exist
I If the factor matrices from an approximate decomposition with R = O(1) are
expected to contain information about the tensor data
I Tucker is most often used for data compression and appears less often than CP
in theoretical analysis
I Basic properties and methods
I The Tucker decomposition is not unique (can pass transformations between
core tensor and factor matrices, which also permit their orthogonalization)
I Finding the best Tucker approximation is NP-hard (for R = 1, CP = Tucker)
I If an exact decomposition exists, it can be computed by high-order SVD
(HOSVD), which performs d SVDs on unfoldings
I HOSVD obtains a good approximation with cost O(nd+1 ) (reducible to O(nd R)
via randomized SVD or QR with column pivoting)
I Accuracy can be improved by iterative nonlinear optimization methods, such as
high-order orthogonal iteration (HOOI)
Tensor Train Decomposition
I The tensor train decomposition expresses an order d tensor as a chain of
products of order 2 or order 3 tensors
I For an order 4 tensor, we can express the tensor train decomposition as
X
tijkl = uip vpjq wqkr zrl
p,q,r

I More generally, the Tucker decomposition is defined as follows,

R1 Rd−1  d−1 
(1) (j) (d)
X X Y
ti1 ...id = ··· ui1 r1 urj−1 ij rj urd−1 id
r1 =1 rd−1 =1 j=2

I In physics literature, it is known as a matrix product state (MPS), as we can


write it in the form,
(1) (2) (d−1) (d)
ti1 ...id = hui1 , Ui2 · · · Uid−1 uid i

I For an equidimensional tensor, the ranks are bounded as Rj ≤ min(nj , nd−j )


Tensor Train Decomposition Basics
I Tensor train has applications in quantum simulation and in numerical PDEs
I Its useful whenever the tensor is low-rank or approximately low-rank, i.e.,
Rj Rj+1 < nd−1 for all j < d − 1
I MPS (tensor train) and extensions are widely used to approximate quantum
systems with Θ(d) particles/spins
I Often the MPS is optimized relative to an implicit operator (often of a similar
form, referred to as the matrix product operator (MPO))
I Operators and solutions to some standard numerical PDEs admit tensor-train
approximations that yield exponential compression
I Basic properties and methods
I The tensor train decomposition is not unique (can pass transformations,
permitting orthogonalization into canonical forms)
I Approximation with tensor train is NP hard (for R = 1, CP = Tucker = TT)
I If an exact decomposition exists, it can be computed by tensor train SVD
(TTSVD), which performs d − 1 SVDs
I TTSVD can be done with the cost O(nd+1 ) or O(nd R) with faster low-rank SVD
I Iterative (alternating) optimization is generally used when optimizing tensor
train relative to an implicit operator or to refine TTSVD
Summary of Tensor Decomposition Basics

We can compare the aforementioned decomposition for an order d tensor with all
dimensions equal to n and all decomposition ranks equal to R
decomposition CP Tucker tensor train
size dnR dnR + Rd 2nR + (d − 2)nR2
uniqueness if R ≤ (3n − 2)/2 no no
orthogonalizability none partial partial
exact decomposition NP hard O(nd+1 ) O(nd+1 )
approximation NP hard NP hard NP hard
typical method ALS HOSVD TT-ALS (implicit)

You might also like