Springer - Cognitive Networked Sensing and Big Data 2013
Springer - Cognitive Networked Sensing and Big Data 2013
Cognitive
Networked
Sensing and
Big Data
Cognitive Networked Sensing and Big Data
Robert Qiu • Michael Wicks
Cognitive Networked
Sensing and Big Data
123
Robert Qiu Michael Wicks
Tennessee Technological University Utica, NY, USA
Cookeville, Tennessee, USA
The idea of writing this book entitled “Cognitive Networked Sensing and Big Data”
started with the plan to write a briefing book on wireless distributed computing
and cognitive sensing. During our research on large-scale cognitive radio network
(and its experimental testbed), we realized that big data played a central role. As a
result, the book project reflects this paradigm shift. In the context, sensing roughly
is equivalent to “measurement.”
We attempt to answer the following basic questions. How do we sense the radio
environment using a large-scale network? What is unique to cognitive radio? What
do we do with the big data? How does the sample size affect the sensing accuracy?
To address these questions, we are naturally led to ask ourselves: What math-
ematical tools are required? What are the state-of-the-art for the analytical tools?
How these tools are used?
Our prerequisite is the graduate-level course on random variables and processes.
Some familiarity with wireless communication and signal processing is useful.
This book is complementary with our previous book entitled “Cognitive Radio
Communications and Networking: Principles and Practice” (John Wiley and Sons
2012). This book is also complementary with another book of the first author
“Introduction to Smart Grid” (John Wiley and Sons 2014). This current book can be
viewed as the mathematical tools for the two Wiley books.
Chapter 1 provides the necessary background to support the rest of the book.
No attempt has been made to make this book really self-contained. The book will
survey many latest results in the literature. We often include preliminary tools from
publications. These preliminary tools may be still too difficult for many of the
audience. Roughly, our prerequisite is the graduate-level course on random variables
and processes.
Chapters 2–5 (Part I) are the core of this book. The contents of these chapters
should be new to most graduate students in electrical and computer engineering.
Chapter 2 deals with the sum of matrix-valued random variables. One basic
question is “how does the sample size affect the accuracy.” The basic building block
of the data is the sample covariance matrix, which is a random matrix. Bernstein-
type concentration inequalities are of special interest.
vii
viii Preface
low-rank structure is explicitly exploited. This is one justification for putting low-
rank matrix recovery (Chap. 9) before this chapter. A modern trend is to exploit the
structure of the data (sparsity and low rank) during the detection theory. The research
in this direction is growing rapidly. Indeed, we surveyed some latest results in this
chapter.
An unexpected chapter is Chap. 11 on probability constrained optimization. Due
to the recent progress (as late as 2003 by Nemirovski), optimization with proba-
bilistic constraints, often regarded as computationally intractable in the past, may
be formulated in terms of deterministic convex problems that can be solved using
modern convex optimization solvers. The “closed-form” Bernstein concentration
inequalities play a central role in this formulation.
In Chap. 12, we show how concentration inequalities play a central role in data
friendly data processing such as low rank matrix approximation. We only want to
point out the connection.
Chapter 13 is designed to put all pieces together. This chapter may be put as
Chap. 1. We can see that so many problems are open. We only touched the tip of the
iceberg of the big data. Chapter 1 also gives us motivations of other chapters of this
book.
The first author wants to thank his students for the course in the Fall semester
of 2012: ECE 7970 Random Matrices, Concentration and Networking. Their
comments greatly improved this book. We also want to thank PhD students at
TTU for their help in proof-reading: Jason Bonior, Shujie Hou, Xia Li, Feng
Lin, and Changchun Zhang. The simulations made by Feng Lin, indeed, shaped
the conceptions and formulations of many places of this book, in particular on
hypothesis detection. Dr. Zhen Hu and Dr. Nan Guo at TTU are also of help
for their discussions. The first author’s research collaborator Professor Husheng
Li (University of Tennessee at Knoxville) is acknowledged for many inspired
discussions.
The first author’s work has been supported, for many years, by Office of Naval
Research (ONR) through the program manager Dr. Santanu K. Das (code 31). Our
friend Paul James Browning is instrumental in making this book possible. This
work is partly funded by National Science Foundation (NSF) through two grants
(ECCS-0901420 and CNS-1247778), Office of Naval Research through two grants
(N00010-10-10810 and N00014-11-1-0006), and Air Force Office of Scientific
Research, via a local contractor (prime contract number FA8650-10-D-1750-Task
4). Some parts of this book were finished while the first author was a visiting
professor during the summer of 2012 at the Centre for Quantifiable Quality of
Service in Communication Systems (Q2S), the Norwegian University of Science
and Technology (NTNU), Trondheim, Norway. Many discussions with his host
Professor Yuming Jiang are acknowledged here.
The authors want to thank our editor Brett Kurzman at Springer (US) for his
interest in this book. We acknowledge Rebecca Hytowitz at Springer (US) for her
help.
The first author wants to thank his mentors who, for the first time, exposed him
to many subjects of this book: Weigan Lin (UESTC, China) on remote sensing,
x Preface
Part I Theory
1 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Basic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Union Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Pairs of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 The Markov and Chebyshev Inequalities
and Chernoff Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Characteristic Function and Fourier Transform . . . . . . . . 6
1.1.6 Laplace Transform of the pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.7 Probability Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Sums of Independent (Scalar-Valued) Random
Variables and Central Limit Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Sums of Independent (Scalar-Valued) Random
Variables and Classical Deviation Inequalities:
Hoeffding, Bernstein, and Efron-Stein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Transform of Probability Bounds
to Expectation Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 Bernstein’s Inequality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Efron-Stein Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Probability and Matrix Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Eigenvalues, Trace and Sums of Hermitian
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Partial Ordering of Positive Semidefinite Matrices . . . . 19
1.4.4 Definitions of f (A) for Arbitrary f . . . . . . . . . . . . . . . . . . . . 20
1.4.5 Norms of Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.7 Moments and Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
xi
xii Contents
Part II Applications
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Introduction
This book deals with the data that is collected from a cognitive radio network.
Although this is the motivation, the contents really treat the general mathematical
models and the latest results in the literature.
Big data, only at its dawn, refers to things one can do at a large scale that cannot
be done at a smaller one [4]. Mankind’s constraints are really functions of the scale
in which we operate. Scale really matters. Big data see a shift from causation to
correlation, to infer probabilities. Big data is messy. The data is huge, and can
tolerate inexactitude. Mathematical models crunch mountains of data to predict
gains, while trying to reduce risks.
At this writing, big data is viewed as a paradigm shift in science and engineering,
as illustrated in Fig. 1. In November 2011 when we gave the final touch to our
book [5] on cognitive radio network. The authors of this book recognized the
fundamental significance of big data. So at the first page and first section, our section
title (Sect. 1.1 of [5]) was called “big data.” Our understanding was that due to the
spectrum sensing, the cognitive radio network leads us naturally towards big data.
In the last 18 months, as a result of this book writing, this understanding went even
further: we swam in the mathematical domains, understanding the beauty of the
consequence of big data—high dimensions. Book writing is truly a journey, and
helps one to understand the subject much deeper than otherwise. It is believed that
smart grid [6] will use many big data concepts and hopefully some mathematical
tools that are covered in this book. Many mathematical insights could not be explicit,
if the high dimensions were not assumed to be large. As a result, concentration
inequalities are natural tools to capture this insight in a non-asymptotic manner.
Figure 13.1 illustrates the vision of big data that will be the foundation to
understand cognitive networked sensing, cognitive radio network, cognitive radar
and even smart grid. We will further develop this vision in the book on smart
grid [6]. High dimensional statistics is the driver behind these subjects. Random
matrices are natural building blocks to model big data. Concentration of measure
phenomenon is of fundamental significance to modeling a large number of random
matrices. Concentration of measure phenomenon is a phenomenon unique to high-
dimensional spaces.
xxi
xxii Introduction
High-
Dimensionale
Statistics
Cognitive
Big Data Smart Grid
Networked
Sensing
Cognitive Cognitive
Radar Radio
Network
To get a feel for the book, let us consider one basic problem. The large data sets
are conveniently expressed as a matrix
⎡ ⎤
X11 X12 · · · X1n
⎢ X21 X22 · · · X2n ⎥
⎢ ⎥
X=⎢ . .. . ⎥ ∈ Cm×n
⎣ .. . · · · .. ⎦
Xm1 Xm2 · · · Xmn
where Xij are random variables, e.g., sub-Gaussian random variables. Here m, n
are finite and large. For example, m = 100, n = 100. The spectrum of a random
matrix X tends to stabilize as the dimensions of X grows to infinity. In the last
few years, local and non-asymptotic regimes, the dimensions of X are fixed rather
than grow to infinity.
Concentration of measure phenomenon naturally occurs. The
eigenvalues λi XT X , i = 1, . . . , n are natural mathematical objects to study.
The eigenvalues can be viewed as Lipschitz functions that can be handled by
Talagrand’s concentration inequality. It expresses the insight: The sum of a large
number of random variables is a constant with high probability. We can often treat
both standard Gaussian and Bernoulli random variables in the unified framework of
the sub-Gaussian family.
Theorem 1 (Talagrand’s Concentration Inequality). For every product proba-
n
bility P on {−1, 1} , consider a convex and Lipschitz function f : Rn → R
with Lipschitz constant L. Let X1 , . . . , Xn be independent random variables taking
values {−1, 1}. Let Y = f (X1 , . . . , Xn ) and let MY be a median of Y . Then For
every t > 0, we have
2
/16L2
P (|Y − MY | t) 4e−t . (1)
For a random matrix X ∈ Rn×n , the following functions are Lipschitz functions:
k k
(1)λmax (X) ; (2)λmin (X) ; (3) Tr (X) ; (4) λi (X) ; (5) λn−i+1 (X)
i=1 i=1
This chapter provides the necessary background to support the rest of the book.
No attempt has been made to make this book really self-contained. The book will
survey many recent results in the literature. We often include preliminary tools
from publications. These preliminary tools may be still too difficult for many of the
audience. Roughly, our prerequisite is the graduate-level course on random variables
and processes.
The probability of an event is expressed as (·), and we use E for the expectation
operator. For conditional expectation, we use the notation EX Z, which represents
integration with respect to X, holding all other variables fixed. We sometimes omit
the parentheses when there is no possibility of confusion. Finally, we remind the
reader of the analysts convention that roman letters c, C, etc. denote universal
constants that may change at every appearance.
where the indicator function IA (ω) takes the value of 1 if ω ∈ A and 0 otherwise.
The union bound (or Bonferroni’s inequality, or Boole’s inequality) states that for a
collection of events Ai ∈ F, i = 1, . . . , n, we have
n
n
P Ai P (Ai ). (1.1)
i=1 i=1
1.1.2 Independence
In general, it can be shown that the random variable X and Y are independent if
and only if their joint cdf is equal to the product of its marginal cdf’s:
Similarly, if X and Y are jointly continuous, then X and Y are independent if and
only if their joint pdf is equal to the product of its marginal pdf’s:
Suppose that X and Y are independent random variables, and let g(X, Y ) =
g1 (X)g2 (Y ). Find E [g (X, Y )] = E [g1 (X) g2 (Y )]. It follows that
∞ ∞
E [g (X, Y )] = g1 (x )g2 (y ) fX (x )fY (y ) dx dy
−∞ −∞
∞ ∞
= g1 (x )fX (x ) dx g2 (y )fY (y ) dy
−∞ −∞
Let us consider the sum of two random variables. Z = X + Y . Find FZ (z) and
fZ (z) in terms of the joint pdf of X and Y .
The cdf of Z is found by integrating the joint pdf of X and Y over the region of
the plane corresponding to the event P(Z ≤ z) = P(X + Y ≤ z).
∞ z−x
FZ (z) = fX,Y (x , y ) dx dy .
−∞ −∞
1.1 Basic Probability 5
The pdf of Z is
∞
d
fZ (z) = FZ (z) = fX,Y (x , z − x ) dx dy .
dz −∞
Thus the pdf for the sum of two random variables is given by a superposition
integral.
If X and Y are independent random variables, then by (1.3), the pdf is given by
the convolution integral of the marginal pdf’s of X and Y :
∞
fZ (z) = fX (x )fY (z − x ) dx .
−∞
Now suppose that the mean E [X] = m and the variance Var [X] = σ 2 of a random
variable are known, and that we are interested in bounding P (|X − m| a). The
Chebyshev inequality states that
σ2
P (|X − m| a) . (1.8)
a2
The Chebyshev inequality is a consequence of the Markov inequality. More
generally, taking φ (x) = xq (x 0), for any q > 0 we have the moment bound
q
E|X − EX|
P (|X − EX| t) . (1.9)
tq
In specific examples, we may choose the value of q to optimize the obtained upper
bound. Such moment bounds often provide with very sharp estimates of the tail
probabilities. A related idea is at the basis of Chernoff’s bounding method. Taking
ϕ(x) = esx where x is an arbitrary positive number, for any random variable X,
and any t ∈ R, we have
EesX
P (X t) = P esX est st . (1.10)
e
If more information is available than just the mean and variance, then it is possible
to obtain bounds that are tighter than the Markov and Chebyshev inequalities. The
region of interest is A = {t a}, so let IA (t) be the indicator function, that is
IA (t) = 1, t ∈ A and IA (t) = 0 otherwise. Consider the bound IA (t) es(t−a) ,
s > 0. The resulting bound is
∞ ∞
P (X a) = IA (t)fX (t)dt es(t−a) fX (t)dt
0 0
∞
= e−sa est fX (t)dt = e−sa E esX . (1.11)
0
This bound is the Chernoff bound, which can be seen to depend on the expected
value of an exponential function of X. This function is called the moment generating
function and is related to the Fourier and Laplace transforms in the following
subsections. In Chernoff’s method, we find an s ≥ 0 that minimizes the upper
bound or makes the upper bound small. Even though Chernoff bounds are never as
good as the best moment bound, in many cases they are easier to handle [7].
Transform methods are extremely useful. In many applications, the solution is given
by the convolution of two functions f1 (x)∗f2 (x). The Fourier transform will convert
this convolution integral into a product of two functions in the transformed domains.
This is a result of a linear system, which is most fundamental.
1.1 Basic Probability 7
Most of time we deal with discrete random variables that are integer-valued. The
characteristic function is then defined as
Equation (1.12) is the Fourier transform of the sequence pX (k). The following
inverse formula allows us to recover the probabilities pX (k) from ΦX (ω)
2π
1
pX (k) = ΦX (ω) e−jωk dω k = 0, ± 1, ± 2, . . .
2π 0
pX (k) are simply the coefficients of the Fourier series of the periodic function
ΦX (ω). The moments of X can be defined by ΦX (ω)—a very basic idea. The
power series (Taylor series) can be used to expand the complex exponential e−jωx
in the definition of ΦX (ω):
∞
1 2
ΦX (ω) = fX (x) 1 + jωX + (jωX) + · · · ejωx dx.
−∞ 2!
Assuming that all the moments of X are finite and the series can be integrated term
by term, we have
1 2 1 n
ΦX (ω) = 1 + jωE [X] + (jω) E X 2 + · · · + (jω) E [X n ] + · · · .
2! n!
If we differentiate the above expression once and evaluate the result at ω = 0, we
obtain
n
d
ΦX (ω) = j n E [X n ] ,
dω ω=0
8 1 Mathematical Foundation
Note that X ∗ (s) can be regarded as a Laplace transform of the pdf or as an expected
value of a function of X, e−sX . When X is replaced with a matrix-valued random
variable X, we are motivated to study
∞
X∗ (s) = fX (x)e−sx dx = E e−sX . (1.14)
0
Through the spectral mapping theorem defined in Theorem 1.4.4, f (A) is defined
simply by applying the function f to the eigenvalues of A, where f (x) is an arbitrary
function. Here we have f (x) = e−sx . The eigenvalues of the matrix-valued random
variable e−sX are scalar-valued random variables.
The moment theorem also holds for X ∗ (s):
n dn ∗
E [X n ] = (−1) X (s) .
dsn s=0
The first expression is the expected value of the function of N , z N . The second
expression is the z-transform the probability mass function (with a sign change in
the exponent). Table 3.1 of [8, p. 175] shows the probability generating function for
variables. Note that the characteristic function of N is given
some discrete random
by GN (z) = E z N = GN (ejω ).
We follow [8] on the standard Fourier-analytic proof of the central limit theorem
for scalar-valued random variables. This material allows us to warm up, and set
the stage for a parallel development of a theory for studying the sums of the matrix-
valued random variables. The Fourier-analytical proof of the central limit theorem is
one of the quickest (and slickest) proofs available for this theorem, and accordingly
the “standard” proof given in probability textbooks [9].
Let X1 , X2 , . . . , Xn be n independent random variables. In this section, we show
how the standard Fourier transform methods can be used to find the pdf of Sn =
X1 + X 2 + . . . + X n .
First, consider the n = 2 case, Z = X + Y , where X and Y are independent
random variables. The characteristic function of Z is given by
ΦZ (ω) = E ejωZ = E ejω(X+Y ) = E ejωX ejωY
= E ejωX E ejωY = ΦX (ω) ΦY (ω) ,
where the second line follows from the fact that functions of independent random
variables (i.e., ejωX and ejωY ) are also independent random variables, as discussed
in (1.4). Thus the characteristic function of Z is the product of the individual
characteristic functions of X and Y .
Recall that ΦZ (ω) can be also viewed as the Fourier transform of the pdf of Z:
Equation (1.16) states the well-known result that the Fourier transform of a
convolution of two functions is equal to the product of the individual Fourier
transforms. Now consider the sum of n independent random variables:
Sn = X1 + X2 + · · · + X n .
10 1 Mathematical Foundation
Thus the pdf of Sn can then be found by finding the inverse Fourier transform of
the product of the individual characteristic functions of the Xi ’s:
so by (1.17),
n
2 2
ΦSn (ω) = e+jωmk −ω σk /2
k=1
= exp +jω (m1 + m2 + · · · + mn ) − ω 2 σ12 + σ22 + · · · + σn2 .
α
ΦX (ω) = .
α − jω
Example 1.2.4. Find the generating function for a sum of n independent, identically
distributed random variables.
The generating function for a single geometric random variable is given by
pz
GX (z) = .
1 − qz
From Table 3.1 of [8], we see that this is the generating function of a negative
binomial random variable with parameter p and n.
We are mainly concerned with upper bounds for the probabilities of deviations
n
from the mean, that is, to obtain P (Sn − ESn t), with Sn = Xi , where
i=1
X1 , . . . , Xn are independent real-valued random variables.
12 1 Mathematical Foundation
1
n
In other words, writing σ 2 = n Var [Xn ], we have
i=1
nσ 2
P (Sn − ESn t) .
t2
This simple inequality is at the basis of the weak law of large numbers.
P (X a + tb) e−t+h ,
then, for all p ≥ 1,
p
EX p 2(a + bh + bp) . (1.20)
−st
n
P (Sn − ESn t) e E exp s (Xi − EXi )
i=1
!
n (1.22)
= e−st E es(Xi −EXi ) by independence.
i=1
Now the problem of finding tight bounds comes down to finding a good upper bound
for the moment generating function of the random variables Xi − EXi . There are
many ways of doing this. In the case of bounded random variables, perhaps the most
elegant version is due to Hoeffding [11]:
Lemma 1.3.1 (Hoeffding’s Inequality). Let X be a random variable with EX =
0, a ≤ X ≤ b. Then for s ≥ 0,
2 2
E esX es (b−a) /8 . (1.23)
x − a sb b − x sa
esx e + e for a x b.
b−a b−a
−a
Expectation is linear. Exploiting EX = 0, and using the notation p = b−a we have
EesX b
b−a
esa − e
a sb
b−a
= 1 − p + pes(b−a) e−ps(b−a)
eφ(u) ,
where u = s(b − a), and φ (u) = −pu + log (1 − p + peu ). But by direct
calculation, we can see that the derivative of φ is
p
φ (u) = −p + ,
p + (1 − p) e−u
p (1 − p) e−u 1
φ (u) = 2 .
(p + (1 − p)e−u ) 4
2
u2 u2 s2 (b − a)
φ (u) = φ (0) + uφ (0) + φ (θ) = .
2 8 8
14 1 Mathematical Foundation
P (Sn − ESn t)
n
2
(bi −ai )2 /8
e−st es (by Lemma 1.3.1 )
i=1
n
s2 (bi −ai )2 /8
−st
=e e i=1
n n
−2t2 / (bi −ai )2 2
=e i=1 by choosing s = 4t/ (bi − ai ) .
i=1
Assume, without loss of generality, that EXi = 0 for all i = 1, . . . , n. Our starting
point is again (1.22), that is, we need bounds for EesXi . Introduce σi2 = E Xi2 ,
and
∞
sk−2 E Xik
Fi = .
k!σi2
k=2
∞
Since esx = 1 + sx + sk xk /k!, and the expectation is linear, we may write
k=2
∞ sk E[Xik ]
EesXi = 1 + sE [Xi ] + k!
k=2
= 1 + s2 σi2 Fi (since E [Xi ] = 0)
2
σi2 Fi
es .
Now assume that Xi ’s are bounded such that |Xi | ≤ c. Then for each k ≥ 2,
E Xik ck−2 σi2 .
1.3 Sums of Independent (Scalar-Valued) Random Variables 15
Thus,
∞ ∞ k
sk−2 ck−2 σi2 1 (sc) esc − 1 − sc
Fi = = .
k!σi2 (sc)
2 k! (sc)
2
k=2 k=2
n
Returning to (1.22) and using the notation σ 2 = (1/n) σi2 , we have
i=1
n
2
(esc −1−sc)/c2 −st
P Xi t enσ .
i=1
n
nσ 2 ct
P Xi t exp − 2 h ,
i=1
c nσ 2
n
with σ 2 = σi2 .
i=1
n
1 nt2
P Xi t exp − 2
. (1.26)
n i=1
2σ + 2ct/3
We see that, except for the term 2ct/3, Bernstein’s inequality is quantitatively right
when compared with the central limit theorem: the central limit theorem states that
" n
2
n 1 1 e−y /2
P Xi − EXi y → 1 − Φ(y) √ ,
σ2 n i=1
2π y
n
1 2 2
P Xi − EXi t ≈ e−nt /(2σ ) ,
n i=1
In particular, by taking f (x) = esx , we see that all inequalities derived from the
sums of independent random variables Yi using Chernoff’s bounding remain true
for the sums of the Xi ’s. (This result is due to Hoeffding [11].)
1.4 Probability and Matrix Analysis 17
The main purpose of these notes [7] is to show how many of the tail inequalities
of the sums of independent random variables can be extended to general functions
of independent random variables. The simplest, yet surprisingly powerful inequality
of this kind is known as the Efron-Stein inequality.
λ1 (A) · · · λn (A)
u1 (A) , . . . , un (A) ∈ Cn .
The set of the eigenvalues {λ1 (A) , . . . , λn (A)} is known as the spectrum of A.
The eigenvalues are sorted in a non-increasing manner. The trace of a n × n matrix
is equal to the sum of the its eigenvalues
n
Tr (A) = λn .
i=1
Tr (A + B) = Tr (A) + Tr (B) .
We have
λ1 (A+B) + · · · +λk (A+B) λ1 (A) + · · · +λk (A) +λ1 (B) + · · · +λk (B)
(1.28)
In particular, we have
Inequality is one of the main topics in modern matrix theory [16]. An arbitrary
complex matrix A is Hermitian, if A = AH , where H stands for conjugate and
transpose of a matrix. If a Hermitian matrix A is positive semidefinite, we say
A ≥ 0. (1.30)
where ⇒ has the meaning of “implies.” When A is a random matrix, its deter-
minant det A and its trace Tr A are scalar random variables. Trace is a linear
operator [17, p. 30].
For every complex matrix A, the Gram matrix AAH is positive semidefinite
[16, p. 163]:
AAH ≥ 0. (1.32)
1
The eigenvalues of AAH 2 are the singular values of A.
It follows from [17, p. 189] that
n
n
Tr A = λi , det A = λi ,
i=1 i=1
1.4 Probability and Matrix Analysis 19
n
Tr Ak = λki , k = 1, 2, . . . (1.33)
i=1
1/n 1
(det A) Tr A, (1.34)
n
1 1/n
Tr AX (det A) (1.35)
n
and
Tr AAH = 0 if and only if A = 0. (1.37)
B ≥ A if B − A ≥ 0. (1.38)
A partial order may be defined using (1.38). We hold the intuition that matrix B
is somehow “greater” than matrix A. If B ≥ A ≥ 0, then [16, p. 169]
A + B ≥ B, (1.40)
2.
3.
Y ≥ X ≥ 0 ⇒ EY ≥ EX. (1.43)
The definitions of f (A) of matrix A for general function f were posed by Sylvester
and others. A eleglant treatment is given in [20]. A special function called spectrum
is studied [21]. Most often, we deal with the PSD matrix, A ≥ 0. References [17,
18, 22–24].
When f (t) is a polynomial or rational function with scalar coefficients and a
scalar argument, t, it is natural to define f (A) by substituting A for t, replacing
division by matrix inverse, and replacing 1 by the identity matrix. Then, for example,
1 + t2 −1
f (t) = ⇒ f (A) = (I − A) I + A2 if 1 ∈
/ Λ (A) . (1.44)
1−t
Here, Λ (A) denotes the set of eigenvalues of A (the spectrum of A). For a general
theory, we need a way of defining f (A) that is applicable to arbitrary functions f .
Any matrix A ∈ Cn×n can be expressed in the Jordan canonical form
where
⎛ ⎞
λ1 1 0
⎜ .. ⎟ mk ×mk
Jk = Jk (λk ) = ⎝ . 1 ⎠∈C , (1.46)
0 λk
where
⎛ ⎞
f (mk −1) (λk )
f (λk ) f (λk ) · · · (mk −1)!
⎜ ⎟
⎜ . .. ⎟
⎜ f (λk ) . . . ⎟
f (Jk ) ⎜ ⎟. (1.48)
⎜ .. ⎟
⎝ . f (λk ) ⎠
f (λk )
Several remarks are in order. First, the definition yields an f (A) that can be shown
to be independent of the particular Jordan canonical form that is used. Second, if A
is diagonalizable, then the Jordan canonical form reduces to an eigen-decomposition
A = Z−1 DZ, with D = diag (λi ) and the columns of Z (renamed U) eigenvectors
of A, the above definition yields
Therefore, for diagonalizable matrices, f (A) has the same eigenvectors as A and
its eigenvalues are obtained by applying f to those of A.
See [25] for matrix norms. The matrix p-norm is defined, for 1 ≤ p ≤ ∞, as
Ax p
A p = max ,
x=0 x p
1/p
n
p
where x p = |xi | . When p = 2, it is called spectral norm A 2 =
i=1
A . The Frobenius norm is defined as
⎛ ⎞1/2
n n
=⎝ |aij | ⎠
2
A F ,
i=1 j=1
n n n n n
2 2 2 2
AB F = C F = |cij | = |aik bkj | .
i=1 j=1 i=1 j=1 k=1
22 1 Mathematical Foundation
n
Applying the Cauchy-Schwarz inequality to the expression aik bkj , we find that
k=1
n n n
a2 b2
n
2
AB F ik kj
n n
i=1 j=1 k=1 k=1
n n
2 2
= aik bkj
i=1 k=1 j=1 k=1
= A F B F.
n
1/p
A Sp = σip for 1 p ∞, and A ∞ = A op = σ1 ,
i=1
for a matrix A ∈ Rn×n . When p = ∞, we obtain the operator norm (or spectral
norm) A op , which is the largest singular value. It is so commonly used that we
sometimes use ||A|| to represent it. When p = 2, we obtain the commonly called
Hilbert-Schmidt norm or Frobenius norm A S2 = A F . When p = 1, A S1
denotes the nuclear norm. Note that ||A|| is the spectral norm, while A F is
the Frobenius norm. The drawback of the spectral norm it that it is expensive to
compute; it is not the Frobenius norm. We have the following properties of Schatten
p-norm
1. When p < q, the inequality occurs: A Sq A Sp .
2. If r is a rank of A, then with p > log(r), it holds that A A Sp e A .
Let X, Y = Tr XT Y represent the Euclidean inner product between two
matrices and X F = X, X . It can be easily shown that
X F = sup Tr XT G = sup X, G .
GF =1 GF =1
x = sup x, y .
y=1
1.4 Probability and Matrix Analysis 23
1.4.6 Expectation
We follow closely [9] for this basic background knowledge to set the stage for future
applications. Given an unsigned random variable X (i.e., a random variable taking
values in [0, +∞],) one can define the expectation or mean EX as the unsigned
integral
∞
EX = xdμX (x),
0
EX = xdμX (x)
R
EX = xdμX (x)
C
in the complex case. Similarly, for a vector-valued random variable (note in finite
dimensions, all norms are equivalent, so the precise choice of norm used to define
|X| is not relevant here). If x = (X1 , . . . , Xn ) is a vector-valued random variable,
then X is absolutely integrable if and only if the components Xi are all absolutely
integrable, in which case one has
Ex = (EX1 , . . . , EXn ) .
∞
By the Fubini-Tonelli theorem, the same result also applies to infinite sums c i Xi ,
i=1
∞
provided that |ci | E |Xi | is finite.
i=1
24 1 Mathematical Foundation
EX EY.
For an unsigned random variable, we have the obvious but very useful Markov
inequality
1
P (X λ) EX
λ
for some λ > 0. For the signed random variables, Markov’s inequality becomes
1
P (|X| λ) E |X| .
λ
If X is an absolutely integrable or unsigned scalar random variable, and F is a
measurable function from the scalars to the unsigned extended reals [0, +∞], then
one has the change of variables formula
k
E|X|
EetX
EejtX
1.4 Probability and Matrix Analysis 25
for real t, X, or
ejt·x
for complex z.
The reason for developing the scalar and vector cases is because we are motivated
to study
EeX
EeX1 +···+Xn
for all real t, and if and only if there exists C > 0 such that
k k/2
E|X| (Ck)
If X is sub-Gaussian (or has sub-exponential tails with exponent a > 1), then
from dominated convergence we have the Taylor expansion
∞
tk
EetX = 1 + EX k
k!
k=1
for any real or complex t, thus relating the exponential and Fourier moments with
the kth moments.
The proof of this is as follows. Let I{|X|p x} is the indicator random variable: takes
p
1 on the event |X| x and 0 otherwise. Using Fubini’s theorem, we derive
p ) p ) ) |X|p ) )∞
E|X| = |X| dP = Ω 0
Ω
dxdP = Ω 0 I{|X|p x} dxdP
)∞) )∞ p
= 0 Ω I{|X|p x} dPdx = 0 P (|X| x) dx
)∞ )∞
= p 0 P (|X| tp )tp−1 dt = p 0 P (|X| t)tp−1 dt,
p 1/p q 1/q
E |XY | (E|X| ) (E|Y | )
p 1/p q 1/q
(E|X| ) (E|Y | ) .
1.4 Probability and Matrix Analysis 27
The function P (|X| > t) is called the tail of X. The Markov inequality is a
simple way of estimating the tail. We can also use the moments to estimate the tails.
The next statement due to Tropp [26] is simple but powerful. Suppose X is a random
variable satisfying
p 1/p
(E|X| ) αβ 1/p p1/γ , for all p p0
The proof of the claim is short. By Markov’s inequality, we obtain for an arbitrary
κ>0
p 1/γ p
E|X| αp
P |X| e 1/γ
αt κ p β .
(e αt) eκ αt
In many cases, this inequality yields essentially sharp results for the appropriate
choice of p.
k
If X, Y are independent with E[Y ] = 0 and k ≥ 2, then [29] E |X|
k
E |X − Y | .
28 1 Mathematical Foundation
√
with Cβ 2 + √1 .
4 2(4β)
√
Proof. According to (1.51), we have, for some α 2
)∞
E max |Xi | = 0 P max |Xi | > t dt
i=1,...,N i=1,...,N
)α )∞
0 1dt + α P max |Xi | > t dt
i=1,...,N
)∞
N
α+ α
P (|Xi | > t)dt
i=1
)∞ 2
α+ N β α e−t /2 dt.
Thus we have
∞
2 1 2
E max |Xi | α + N β e−t /2
dt α + N β e−α /2 .
i=1,...,N α α
√
Now we choose α = 2 ln (4βN ) 2 ln (4) 2. This gives
E max |Xi | 2 ln (4βN ) + √ 1
i=1,...,N 4 2 ln(4βN )
√
= 1
2 + 4√2 ln(4βN )
ln (4βN ) C β ln (4βN ).
Some results are formulated in terms of moments; the transition to a tail bound can
be established by the following standard result, which easily follows from Markov’s
inequality.
Theorem 1.4.2 ([31]). Suppose X is a random variable satisfying
p 1/p √
(E|X| ) α + β p + γp for all p p0
T
A random vector x = (X1 , . . . , Xn ) ∈ Rn is a collection of n random variables
Xi on a common probability space. Its expectation is the vector
T
Ex = (EX1 , . . . , EXn ) ∈ Rn .
30 1 Mathematical Foundation
1.4.9 Convergence
Gnedenko and Kolmogorov [34] points out: “In the formal construction of a course
in the theory of probability, limit theorems appear as a kind of superstructure
over elementary chapters, in which all problems have finite, purely arithmetic
1.4 Probability and Matrix Analysis 31
Next, Taylor’s expansion and the mean zero and boundedness hypotheses can be
used to show that, for every i,
2
eλXi eλ var Xi
, 0 λ 1.
This results in
n
2
σ2
p e−λt+λ , where σ 2 var Xi .
i=1
The optimal choice of the parameter λ ∼ min τ /2σ 2 , 1 implies Chernoff’s
inequality
2 2
p max e−t /σ , e−t/2 .
32 1 Mathematical Foundation
EX = [EXij ] .
In other words, the expectation of a matrix (or vector) is just the matrix of
expectations of the individual elements.
The basic properties of expectation still hold in these extensions.
If A, B, c are nonrandom, then [37, p. 276]
Md is the set of real symmetric d × d matrices. Cd×d Herm denote the set of complex
Hermitian d × d matrices. The spectral theorem state that all A ∈ Cd×d Herm have d
real eigenvalues (possibly with repetitions) that correspond to an orthonormal set of
eigenvectors. λmax (A) is the largest eigenvalue of A. All the eigenvalues are sorted
in non-increasing manner. The spectrum of A, denoted by spec(A), is the multiset
of all eigenvalues, where each eigenvalue appears a number of times equal to its
multiplicity. We let
Using the spectral theorem for the identity matrix I gives: I = 1. Moreover,
the trace of A, Tr (A) is defined as the sum of the sum of the diagonal entries of A.
The trace of a matrix is equal to the sum of the eigenvalues of A, or
d
Tr (A) = λi (A) .
i=1
Given a matrix ensemble X, there are many statistics of X that one may wish to
consider, e.g., the eigenvalues or singular values of X, the trace, and determinant,
etc. Including basic statistics, namely the operator norm [9, p. 106]. This is a basic
upper bound on many quantities. For example, A op is also the largest singular
value σ1 (A) of A, and thus dominates the other singular values; similarly, all
eigenvalues λi (A) of A clearly have magnitude at most A op since λi (A) =
2
σ1 (A) . Because of this, it is particularly important to obtain good upper bounds,
P A op λ ··· ,
on this quantity, for various thresholds λ. Lower tail bounds are also of interest; for
instance, they give confidence that the upper tail bounds are sharp.
We denote |A| the positive operator (or matrix) (A∗ A)
1/2
and by s(A) the
vector whose coordinates are the singular values of A, arranged as s1 (A)
s2 (A) · · · sn (A). We have [23]
A= |A| = s1 (A) .
Now, if U, V are unitary operators on Cn×n , then |UAV| = V∗ |A| V and hence
A = UAV
for all unitary operators U, V. The last property is called unitary invariance. Several
other norms have this property. We will use the symbol |||A||| to mean a norm on
n × n matrices that satisfies
for all A and for unitary U, V. We will call such a norm a unitarily invariant
norm. We will normalize such norms such that they take the value 1 on the matrix
diag(1, 0, . . . , 0).
1 + t2 −1 2
f (t) = ⇒ f (A) = (I − A) (I + A)
1−t
if 1 ∈/ spec (A).
If f (t) has a convergent power series representation, such as
t2 t3 t4
log(1 + t) = t − + − , |t| < 1,
2 3 4
we again can simply substitute A for t, to define
A2 A3 A4
log(I + A) = A − + − , ρ (A) < 1.
2 3 4
Here ρ denotes the spectral radius and the condition ρ (A) < 1 ensures convergence
of the matrix series. In this ad hoc fashion, a wide variety of matrix functions can be
defined. This approach is certainly appealing to engineering communities, however,
this approach has several drawbacks.
Theorem 1.4.4 (Spectral Mapping Theorem [38]). Let f : C → C be an entire
analytic function with a power-series representation f(z) = cl z l ,(z ∈ C). If all
l0
cl are real, we define the mapping expression:
f (A) ≡ cl Al , A ∈ Cd×d
Herm , (1.59)
l0
where Cd×d
Herm is the set of Hermitian matrices of d×d. The expression corresponds to
a map from Cd×d
Herm to itself. The so-called spectral mapping property is expressed as:
By (1.60), we mean that the eigenvalues of f (A) are the numbers f (λ) with λ ∈
spec(A). Moreover, the multiplicity of ξ ∈ spec (A) is the sum of the multiplicity
of all preimages of ξ under f that lie in spec(A).
For any function f : R → R, we extend f to a function on Hermitian matrices
as follows. We define a map on diagonal matrices by applying the function to
each diagonal entry. We extend f to a function on Hermitian matrices using the
eigenvalue decomposition. Let A = UDUT be a spectral decomposition of A.
Then, we define
⎛ ⎞
f (D1,1 ) 0
⎜ .. ⎟ T
f (A) = U ⎝ . ⎠U . (1.61)
0 f (Dd,d )
1.4 Probability and Matrix Analysis 35
The square function f (t) = t2 is monotone in the usual real-valued sense but not
monotone√ in the operator monotone sense. It turns out that the square root function
f (t) = t is also operator monotone.
The square function is, however, operator convex. The cube function is not
operator convex. After seeing these examples, let us present the Lőwner-Hernz
Theorem.
Theorem 1.4.7 (Lőwner-Hernz Theorem). For −1 ≤ p ≤ 0, the function f (t) =
−tp is operator monotone and operator concave. For 0 ≤ p ≤ 1, the function
f (t) = tp is operator monotone and operator concave. For 1 ≤ p ≤ 2, the function
f (t) = −tp is operator monotone and operator convex. Furthermore, f (t) = log(t)
is operator monotone and operator concave., while f (t) = t log(t) is operator
convex.
f (t) = t−1 is operator convex and f (t) = −t−1 is operator monotone.
36 1 Mathematical Foundation
ψ(t) − ψ(0)
ψ(1) − ψ(0)
t
for all t. Taking the limit t → 0, we obtain
Tr eA+B Tr BeA
log . (1.66)
Tr [eA ] Tr [eA ]
H0 : y = w
H1 : y = x + w
1.4 Probability and Matrix Analysis 37
where x, y, w are signal and noise vectors and y is the output vector. The covariance
matrix relation is used to rewrite the problem as the matrix-valued hypothesis testing
H0 : Ryy = Rww := A
H1 : Ryy = Rxx + Rww := A + B
H0 : otherwise
Tr eRxx +Rww Tr Rxx eRww
H1 : log T0 , with T0 = .
Tr [eRww ] Tr [eRww ]
Thus, the a prior knowledge of Rxx , Rww can be used to set the threshold of T0 .
In real world, an estimated covariance matrix must be used to replace the above
covariance matrix. We often consider a number of estimated covariance matrices,
which are random matrices. We naturally want to consider a sum of these estimated
covariance matrices. Thus, we obtain
H0 : otherwise
⎛ ⎞
Tr e(R̂xx,1 +···+R̂xx,n )+(R̂ww,1 +···+R̂ww,n )
H1 : log ⎝ ⎠ T0
Tr eR̂ww,1 +···+R̂ww,n
with
Tr R̂xx,1 + · · · + R̂xx,n eR̂ww,1 +···+R̂ww,n
T0 = .
Tr eR̂ww,1 +···+R̂ww,n
If the bounds of sums of random matrices can be used to bound the threshold T0 ,
the problem can be greatly simplified. This example provides one motivation for
systematically studying the sums of random matrices in this book.
38 1 Mathematical Foundation
I + A eA , and (1.68)
2
cosh (A) eA /2
. (1.69)
We often work with the trace of the matrix exponential, Tr exp : A → Tr eA . The
trace exponential function is convex [22]. It is also monotone [22] with respect to
the semidefinite order:
A H ⇒ Tr eA Tr eH . (1.70)
The matrix exponential doe not convert sums into products, but the trace exponential
has a related property that serves as a limited substitute.
For n×n complex matrices, the matrix exponential is defined by the Taylor series
( of course a power series representation) as
∞
1 k
exp(A) = eA = A .
k!
k=0
For a proof, we refer to [23, 41]. For a survey of Golden-Thompson and other
trace inequalities, see [40]. Golden-Thompson inequality holds for arbitrary unitary-
invariant norm replacing the trace, see [23] Theorem 9.3.7. A version of Golden-
Thompson inequality for three matrices fails:
Tr eA+B+C Tr eA eB eC .
We define the matrix logarithm as the functional inverse of the matrix exponential:
log eA := A for all Hermitian matrix A. (1.72)
where 0 denotes the zero matrix whose entries are all zeros. The logarithm is also
operator concave:
for all positive definite A, H and τ ∈ [0, 1]. Operator monotone functions and
operator convex functions are depressingly rare. In particular, the matrix exponential
does not belong to either class [23, Chap. V]. Fortunately, the trace inequalities of a
matrix-valued function can be used as limited substitute. For a survey, see [40]which
is very accessible. Carlen [15] is also ideal for a beginner.
40 1 Mathematical Foundation
Quantum relative entropy is also called quantum information divergence and von
Neumann divergence. It has a nice geometric interpretation [42].
A new class of matrix nearness problems uses a directed distance measure called
a Bregman divergence. We define the Bregman divergence of the matrix X from the
matrix Y as
for a positive-definite matrix. Note that the trace function is linear. The divergence
D (X; Y) can be viewed as the difference between ϕ (X) and the best affine
approximation of the entropy at the matrix Y. In other words, (1.75) is the special
case of (1.76) when ϕ (X) is given by (1.77). The entropy function ϕ given in (1.77)
is a strictly convex function, which implies that the affine approximation strictly
underestimates this ϕ. This observation gives us the following fact.
Fact 1.4.13 (Klein’s inequality). The quantum relative entropy is nonnegative
D (X; Y) 0.
Introducing the definition of the quantum relative entropy into Fact (1.4.13), and
rearranging, we obtain
Tr Y Tr (X log Y − X log X + X) .
When X = Y, both sides are equal. We can summarize this observation in a lemma
for convenience.
Lemma 1.4.14 (Variation formula for trace [43]). Let Y be a positive-definite
matrix. Then,
This lemma is a restatement of the fact that quantum relative entropy is nonnegative.
The convexity of quantum relative entropy has paramount importance.
Fact 1.4.15 (Lindblad [44]). The quantum relative entropy defined in (1.75) is a
jointly convex function. That is,
t · f (x1 ; y1 ) + (1 − t) f (x2 ; y2 )
The second line follows from the assumption that f (·; ·) be a jointly concave
function. In words, the partial maximum is a concave function.
42 1 Mathematical Foundation
f = w1 f 1 + · · · + w m f m (1.78)
Lieb’s Theorem is the foundation for studying the sum of random matrices in
Chap. 2. We present a succinct proof this theorem, following the arguments of Tropp
[49]. Although the main ideas of Tropp’s presentation are drawn from [46], his proof
provides a geometric intuition for Theorem 1.4.17 and connects it to another major
result. Section 1.4.19 provides all the necessary tools for this proof.
Theorem 1.4.17 (Lieb [50]). Fix a Hermitian matrix H. The function
Y = exp (H + log A)
to obtain
Using the quantum relative entropy of (1.75), this expression can be rewritten as
the whole bracket on the right-hand-side of (1.80) is also a jointly convex function
of the matrix variables A and X. It follows from Proposition 1.4.16 that the
right-hand-side of (1.80) defines a concave function of A. This observation
completes the proof.
We require a simple but powerful corollary of Lieb’s theorem. This result connects
expectation with the trace exponential.
Corollary 1.4.18. Let H be a fixed Hermitian matrix, and let X be a random
Hermitian matrix. Then
E Tr exp (H + X) Tr exp H + log EeX .
The first relation follows from the definition (1.72) of the matrix logarithm because
Y is always positive definite, Y > 0. Lieb’s result, Theorem 1.4.17, says that the
trace function is concave in Y, so in the second relation we may invoke Jensen’s
inequality to draw the expectation inside the logarithm.
1.4.21 Dilations
The matrix A is called positive semi-definite if all of its eigenvalues are non-
negative. This is denoted A 0. Furthermore, for any two Hermitian matrices
A and B, we write A ≥ B if A − B ≥ 0. One can define a semidefinite order or
partial order on all Hermitian matrices. See [22] for a treatment of this topic.
For any t, the eigenvalues of A − tI are λ1 − t, . . . , λd − t. The spectral norm of
A, denoted as A , is defined to be maxi |λi |. Thus − A · I A A · I.
Claim 1.4.19. Let A, B and C be Hermitian d × d matrices satisfying A ≥ 0 and
B ≤ C. Then, Tr (A · B) Tr (A · C).
Notice that ≤ is a partial order and that
A, B,A ,B ∈ Cd×d
Herm , A B and A B ⇒ A + A B + B .
A ∈ Cd×d
Herm , A 0.
2
Tr (A · B) B Tr (A) , (1.84)
The relation (1.86) is also a specific instance of Kadison’s inequality [23, Theorem
2.3.3].
Moreover, one can check that the usual product rule is satisfied: If Z, W : Ω →
Cd×d
Herm are measurable and independent, then
Finally,
1.4.25 Isometries
we see that transforms of the type that the theorem deals with are characterized
by the fact that they preserve distances. For this reason, we call such a transform
an isometry. An isometry on a finite-dimensional space is necessarily orthogonal
or unitary, use of this terminology will enable us to treat the real and the complex
cases simultaneously. On a finite-dimensional space, we observe that an isometry is
always invertible and that U−1 (= U∗ ) is an isometry along with U.
A matrix V− ∈ Vnk achieves equality in (1.90) if and only if its columns span a
dominant k-dimensional invariant subspace A. Likewise, a matrix V+ ∈ Vnn−k+1
achieves equality in (1.89) if and only if its columns span a bottom (n − k + 1)-
dimensional invariant subspace A.
The ± subscripts in Theorem 1.4.22 are chosen to reflect the fact that λk (A) is the
∗ ∗
minimum eigenvalue of V− AV− and the maximum eigenvalue of V+ AV+ . As a
consequence of Theorem 1.4.22, when A is Hermitian,
This fact (1.91) allows us to use the same techniques we develop for bounding the
eigenvalues from above to bound them from below. The use of this fact is given in
Sect. 2.13.
available. The larger the dimension n, the bigger gap between both sides of (1.93).
We see Fig. 1.2 for illustration. Jensen’s inequality says that if f is convex,
provided that the expectations exist. Empirically, we found when n is large enough,
say n ≥ 100, we can write (1.93) as the form
y = Hx + n
10
30
8
25
6 20
15
4
10
2
5
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Monte Carlo Index i Monte Carlo Index i
n=2 n=10
c Quadratic and Bilinear Forms
400
Quadratic
Bilinear
350
Values for Quadratic and Bilinear
300
250
200
150
100
50
0 10 20 30 40 50 60 70 80 90 100
Monte Carlo Index i
n=100
Fig. 1.1 Comparison of quadratic and bilinear forms as a function of dimension n. Expectation is
approximated by an average of K = 100 Monte Carlo simulations. (a) n = 2. (b) n = 10. (c) n = 100
n
y, x = Hx, x + n, x = aij Xi Xj + n, x .
i,j=1
110
90
100
85
90
80
80
70 75
60 70
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Monte Carlo Index i Monte Carlo Index i
K=100, C=1.4 K=500, C=1.15
c Quadratic and Bilinear Forms
92 Quadratic
Bilinear
90
Values for Quadratic and Bilinear
88
86
84
82
80
78
76
74
72
0 10 20 30 40 50 60 70 80 90 100
Monte Carlo Index i
K=1000, C=1.1
Fig. 1.2 Comparison of quadratic and bilinear forms as a function of K and C . Expectation is
approximated by an average of K Monte Carlo simulations. n = 100. (a) K = 100, C = 1.4 (b)
K = 500, C = 1.15 (c) K = 1,000, C = 1.1
where A is an n × n matrix with zero diagonal, and An, n is the quadratic form.
On the other hand, if we have the independent copy n of n, we can form
where An, n is the bilinear form. Thus, (1.93) can be used to establish the
inequality relation.
1.5 Decoupling from Dependance to Independence 51
Ef ( Ax, x ) Eε Ex f ⎝4 εi (1 − εi ) aij Xi Xj ⎠ .
i,j∈[n]
52 1 Mathematical Foundation
4 εi (1 − εi ) aij Xi Xj = 4 εi (1 − εi ) aij Xi Xj .
i,j∈[n] (i,j)∈I×I c
where I c is the complement the subset I. Since the Xi , i ∈ I are independent of the
Xj , j ∈ I c , the distribution of the sum will not change if we replace Xj , j ∈ I c by
their independent copies Xj , j ∈ I c , the coordinates of the independent copy x of
the x. As a result, we have
⎛ ⎞
Ef ( Ax, x ) Eε Ex,x f ⎝4 εi (1 − εi ) aij Xi Xj ⎠ .
(i,j)∈I×I c
Ef (Y ) = Ef (Y + EZ) E (Y + Z) .
Y =4 aij Xi Xj , Z=4 aij Xi Xj ,
(i,j)∈I×I c (i,j)∈I×I
/ c
we arrive at
⎛ ⎞
n
Ex,x f (Y ) = Ex,x f (Y + Z) = f ⎝4 aij Xi Xj ⎠ = Ef (4 Ax, x ) .
i,j=1
Theorem 1.5.5 dates back to [61], but appeared with explicit constants and with
a much simplified proof in [62]. Let X be the operator norm and X F be the
Frobenius norm.
Theorem 1.5.5 (Theorem 17 of Boucheron et al. [62]). Let X be the N × N
matrix with extries xi,j and assume that xi,i = 0 (zero diagonal) for all i ∈
n
{1, . . . , N }. Let ξ = {ξi }i=1 be a Rademacher sequence. Then, for any t > 0,
⎛ ⎞ + ,
1 96
65 t t2
⎝
P ⎠
ξi ξj xi,j > t 2 exp − min , . (1.96)
64 X 2
i,j X F
Or
+ ,
96
t2
1 65 t
P ξ T Xξ > t 2 exp − min , 2 .
64 X X F
sup sup zT Xz = 1
X∈F z22 1
for z ∈ Rn .
Theorem 1.5.6 (Theorem 17 of Boucheron et al. [62]). For all t > 0,
t2
P (Z E [Z] + t) exp − 2
32E [Y ] + 65t/3
Here, we highlight the fundamentals of random matrix theory that will be needed
in Chap. 9 that deals with high-dimensional data processing motivated by the
large-scale cognitive radio network testbed. As mentioned in Sect. 9.1, the basic
building block for each node in the data processing is a random matrix (e.g., sample
covariance matrix). A sum of random matrices arise naturally. Classical textbooks
deal with a sum of scalar-valued random variables—and the central limit theorem.
Here, we deal with a sum of random matrices—matrix-valued random variables.
Many new challenges will be encountered due to this fundamental paradigm shift.
For example, scalar-valued random variables are commutative, while matrix-valued
random variables are non-commutative. See Tao [9].
This method is standard method for the proof of the central limit theorem. Given any
real random variable X, the characteristic function FX (t) : R → C is defined as
FX (t) = EejtX .
FX (t) = Eejt·X
where · denotes the Euclidean inner product on Rn . One can similarly define the
characteristic function on complex vector spaces Cn by using the complex inner
product
The most elementary (but still remarkably effective) method is the moment
method [63]. The method is to understand the distribution of a random variable
X via its moments X k . This method is equivalent to Fourier method. If we Taylor
expand ejtX and formally exchange the series and expectation, we arrive at the
heuristic identity
1.6 Fundamentals of Random Matrices 55
∞ k
(jk)
FX (t) = EX k
k!
k=0
1
trn = Tr.
n
The case σ 2 = 12 gives the normalization used by Wigner [64] and Mehta [65],
while the case σ 2 = n1 gives the normalization used by Voiculescu [66]. We say that
A is a standard Hermitian Gaussian random n × n matrix with entries of variance
σ 2 , if the following conditions are satisfied:
1. The entries aij , 1 i j n, form a set of 12 n (n + 1) independent, complex
valued random variables.
56 1 Mathematical Foundation
2. Foreach i in {1, 2, . . . , n}, aii is a real valued random variable with distribution
N 0, 12 σ 2 .
3. When i ≤ j, the real and imaginary parts Re (aij ) , Im (aij ), of aij are indepen-
dent, identically distributed random variables with distribution N 0, 12 σ 2 .
4. When j > i, aij = āji , where d¯ is the complex conjugate of d.
We denote by HGRM(n,σ 2 ) the set of all such random matrices. If A is an element
of HGRM(n,σ 2 ), then
2
E |aij | = σ 2 , for all i, j.
The distribution of the real valued random variable aii has density
x2
x → √ 1
2πσ 2
exp − 2σ 2 , x ∈ R,
n
2
Tr H 2
= hii + 2 |hij | .
i=1 i<j
1 2
lim E (Tr [exp (sXn )]) = exp (sx) 4 − x2 dx,
n→∞ 2π −2
and the convergence is uniform on compact subsets of C. Further we have that for
the k-th moment of Xn
1 2
lim E trn Xkn = xk 4 − x2 dx,
n→∞ 2π −2
1 2
lim E (trn [f (Xn )]) = f (x) 4 − x2 dx.
n→∞ 2π −2
58 1 Mathematical Foundation
Then the initial values are C (0, n) = n, C (1, n) = n2 , and for fixed n in N, the
numbers C(k, n) satisfy the recursion formula:
4k + 2 k 4k 2 − 1
C (k + 1, n) = n · · C (k, n) + · C (k − 1, n) , k 1.
k+2 k+2
(1.97)
We can further show that C(k, n) has the following form [71]
k
2
Here the notation N0 denotes the integer that does not include zero, in contrast with
N. The coefficients ai (k) , i, k ∈ N0 are determined by the following recursive
formula
ai (k) = 0, i k2 + 1,
1 2k
a0 (k) = , k ∈ N0
k+1 k
4k + 2 k 4k 2 − 1
ai (k + 1) = · ai (k) + · ai−1 (k − 1) , k, i ∈ N.
k+2 k+2
E trn A2 = 1,
E trn A4 =2+ 1
n2 ,
E trn A6 =5+ 10
n2 ,
E trn A8 = 14 + 70
n2 + 21
n4 ,
E trn A10 = 42 + 420
n2 + 483
n4 ,
etc. The constant term in E trn A2k is
2
1 2k 1
a0 (k) = = x2k 4 − x2 dx,
k+1 k 2π −2
1/2
k!
ϕα
k (x) = xα exp (−x) · Lα
k (x) , k ∈ N0 , (1.99)
Γ (k + α + 1)
where Lα
k (x)k∈N0 is the sequence of generalized Laguerre polynomials of order α,
i.e.,
−1 −α dk k+α
Lα
k (x) = (k!) x exp (x) · x exp (−x) , k ∈ N0 .
dxk
If m ≤ n, we have that
1A map f : X → Y between two topological spaces is called Borel (or Borel measurable) if
f −1 (A) is a Borel set for any open set A.
60 1 Mathematical Foundation
-n−1 .
∞ 2
E (Tr [f (B∗ B)]) = (n − m)f (0) + f (x) ϕn−m
k (x) dx.
0 i=0
E trn (Q∗ Q) = c3 + 3c2 + c + cn−2
3
E trn (Q∗ Q) = c4 + 6c3 + 6c2 + c + 5c2 + 5c n−2
4
E trn (Q∗ Q) = c5 +10c4 +20c3 +10c2 +c + 15c3 +40c2 +15c n−2 +8cn−4 .
5
(1.100)
In general, E trn (Q∗ Q) is a polynomial of degree [ k−1 −2
k
2 ] in n , for fixed c.
The material here is taken from [72–74]. Buldygin and Solntsev [74] develops and
N
uses this tool systemically. If SN = ai Xi , where Xi are the Bernoulli random
i=1
2 2
variables, then its generating moment function E etX satisfies E etX eσ t /2 .
On the other hand, if X is a Gaussian random variable
with mean zero and variance
2 2
E X 2 = σ 2 , its moment generating function E etX is eσ t /2 . This led Kahane
[75] to make the following definition. A random variable X is sub-Gaussian, with
exponent b, if
2 2
E etX eb t /2 (1.101)
2 The precise meaning of this equivalence is the following: There is an absolute constant C such
that property i implies property j with parameter Kj CKi for any two properties i, j = 1, 2, 3.
1.7 Sub-Gaussian Random Variables 61
p 1/p √
2. Moments: (E|X| ) K2 p, for all p ≥ 1.
3. Super-exponential moment: E exp X 2 /K32 e.
Moreover, if EX = 0, then properties 1–3 are also equivalent to the
following one:
4. Laplace transform condition: E [exp (tX)] exp t2 K42 for all t ∈ R.
If X1 , . . . , XN are independent sub-Gaussian random variables with exponents
b1 , . . . , bN respectively and a1 , . . . , aN are real numbers, then [73, p. 109]
N
N
2 2
E e t(a1 X1 +···+aN XN )
= E e tai Xi
eai bi /2 ,
i=1 i=1
1/2
so that a1 X1 + · · · +aN XN is sub-Gaussian, with exponent a21 b21 + · · · +a2N b2N .
We say that a random variable Y majorizes in distribution another random
variable X if there exists a number α ∈ (0, 1] such that, for all t > 0, one has [74]
We have that
τ (X) = inf t 0 : E exp (Xt) exp X 2 t2 /2 , t∈R .
1
E exp (Xt) = 1 + tEX + t2 EX 2 + o(t2 ),
2
2 2 1 2 2
exp a t /2 = 1 + t a + o(t2 ),
2
as t → 0, then the inequality
E exp (Xt) exp X 2 t2 /2 , t∈R
62 1 Mathematical Foundation
EX = 0, EX 2 a2 .
This is why each sub-Gaussian random variable X has zero mean and satisfies the
condition
EX 2 τ 2 (X) .
where
2n · n!
θ (X) = sup EX 2n .
n1 (2n)!
In such a way, the class of sub-Gaussian random variable is closed with respect to
multiplication by scalars. This class, however, is not closed with respect to addition
of random variables. The next statement motivates us to set this class out.
1.7 Sub-Gaussian Random Variables 63
n
2 1/2
Proof. We follow [76] for the short proof. Set vi = ai / ai . We have to
i=1
n
show that the random variable Y = vi Xi is subGaussian. Let us check the
i=1
Laplace transform condition (4) of the definition of a subGaussian random variable.
For any t ∈ R
64 1 Mathematical Foundation
n !
n
E exp t v i Xi = E exp (tvi Xi )
i=1
i=1
!
n n 2 2
exp t2 vi2 K42 = exp t2 K42 vi2 = et K4 .
i=1 i=1
The inequality here follows from Laplace transform condition (4). The constant in
front of the exponent in Laplace transform condition (4) is 1, this fact plays the
crucial role here.
Theorem 1.7.4 can be used to give a very short proof of a classical inequality due
to Khinchin.
Theorem 1.7.5 (Khinchin’s inequality). Let X1 , . . . , Xn be independent centered
subGaussian random variables. Then for any p ≥ 1, there exists constants Ap , Bp >
0 such that the inequality
1/2 p 1/p 1/2
n n n
Ap a2i E ai Xi Bp a2i
i=1 i=1 i=1
p 1/p √
(E|Y | ) C p : Bp .
Thus,
1/2
B3−3 E|Y |
2
E |Y | .
1.8 Sub-Gaussian Random Vectors 65
Let S n−1 denote the unit sphere in Rn (resp. in Cn ). For two complex vectors
n
a, b ∈ Cn , the inner product is a, b = ai b̄i , where the bar standards for the
i=1
complex conjugate. A mean-zero random vector x on Cn is called isotropic if for
every θ ∈ S n−1 ,
2
E| x, θ | = 1.
for every θ ∈ S n−1 , and any t > 0. It is well known that, up to an absolute constant,
the tail estimates in the definition of a sug-Gaussian random vector are equivalent
to the moment characterization
p 1/p √
sup (| x, θ | ) pL.
θ∈S n−1
1/p
p p 1/p
E sup | t, Y | c E sup | t, G | + sup (E| t, Y | ) ,
t∈T t∈T t∈T
N
where c is a constant which depends only on L and G = gi xi for g1 , . . . , gN
i=1
independent standard Gaussian random variables.
66 1 Mathematical Foundation
If p, q ∈ [1, ∞) satisfy 1/p + 1/q = 1, then the p and q norms are dual to
each other. In particular, the Euclidean norm is self-dual p = q = 2. Similarly, the
Schatten p-norm is dual to the Schatten q-norm. If · is some norm on CN and B∗
is the unit ball in the dual norm of · , then the above theorem implies that
p 1/p p 1/p
(E Y ) c E G + sup (E| t, Y | ) .
t∈B∗
Some random variables have tails heavier than Gaussian. The following properties
are equivalent for Ki > 0
1 p 1/p
X ψ1 = sup (E|X| ) .
p1 p
- .
N t2
t
P ai Xi t 2 exp −c min 2, K a (1.104)
K2 a 2 ∞
i=1
2 2 2
This follows from triangle inequality X − EX ψ2 X ψ2 + EX ψ2 along
2 2
with EX ψ2 = |EX| X ψ2 , and similarly for the sub-exponential norm.
∀x ∈ K, ∃y ∈ N d (x, y) < ε.
∀x ∈ K, ∃y ∈ S d (x, y) ε.
These two notions are closely related. Namely, we have the following elementary
Lemma.
Lemma 1.10.1. Let K be a subset of a metric space (T, d), and let set N ⊂ T be
an ε-net for K. Then
1. There exists a 2ε-net N ⊂ K such that N ≤ N ;
2. Any 2ε-separated set S ⊂ K satisfies S ≤ N ;
3. From the other side, any maximal ε-separated set S ⊂ K is an ε-net for K.
Let N = |N | be the minimum cardinality of an ε-net of T , also called the covering
number of T at scale ε.
68 1 Mathematical Foundation
Lemma 1.10.2 (Covering numbers of the sphere). The unit Euclidean sphere
S n−1 equipped with the Euclidean metric satisfies for every ε > 0 that
n
2
N S n−1 , ε 1+ .
ε
Lemma 1.10.3 (Volumetric estimate). For any ε < 1 there exists an ε-net such
that N ⊂ S n−1 such that
n n
2 3
|N | 1 + .
ε ε
Proof. Let N be a maximal ε-separated subset of sphere S n−1 . Let B2n be Euclidean
ball. Then for any distinct points x, y ∈ N
ε ε
x + B2n ∩ y + B2n = ∅.
2 2
So,
ε ε ε n
|N | · vol B2n = vol x + B2n vol 1+ B2 ,
2 2 2
x∈N
which implies
n n
2 3
|N | 1+ .
ε ε
Using ε-nets, we prove a basic bound on the first singular value of a random
subGaussian matrix: Let A be an m × n random matrix, m ≥ n, whose entries
are independent copies of a subGaussian random variable. Then
√ 2
P s1 > t m e−c1 t m for t C0 .
One simple but basic idea in the study of sums of independent random variables
is the concept of symmetrization [27]. The simplest probabilistic object is the
Rademacher random variable ε, which takes the two values ±1 with equal prob-
ability 1/2. A random vector is symmetric (or has symmetric distribution) if x
and −x have the same distribution [77]. In this case x and εx, where ε is a
Rademacher random variable independent of x, have the same distribution. Let
xi , i = 1, . . . , n be independent symmetric random vectors. The joint distribution
of εi xi , i = 1, . . . , n is that of the original sequence if the coefficients εi are either
non-random with values ±1, or they are random and independent from each other
and all xi with P (εi = ±1) = 12 .
N
The technique of symmetrization leads to so-called Rademacher sums εi x i ,
i=1
N
N
where xi are scalars, εi xi , where xi are vectors and εi Xi , where Xi are ma-
i=1 i=1
trices. Although quite simple, symmetrization is very powerful since there are nice
estimates for Rademacher sums available—the so-called Khintchine inequalities.
A sequence of independent Rademacher variables is referred to as a Rademacher
sequence. A Rademacher series in a Banach space X is a sum of the form
∞
εi x i
i=1
x p = x = sup |xi | , p = ∞.
i
This inequality is typically established only for real scalars, but the real case implies
that the complex case holds with the same constant.
n
[29]). For a ∈ R , x ∈ {−1, 1} uniform,
n
Lemma 1.11.2 (Khintchine inequality
k k
and k ≥ 2 an even integer, E aT x a 2 · k k/2 .
If the original sequences consist of independent random vectors, all the random
vectors x(1) and x(2) used to construct symmetrization are independent. Random
vectors defined in (1.105) are also independent and symmetric.
A Banach space B is a vector space over the field of the real or complex numbers
equipped with a norm || · || for which it is complete. We consider Rademacher
averages εi xi with vector valued coefficients as a natural analog of the Gaussian
i
averages gi xi . A sequence (εi ) of independent random variables taking the
i
values +1 and −1 with equal probability 1/2, that is symmetric Bernoulli or
Rademacher random variables. We usually call (εi ) a Rademacher sequence or
Bernoulli sequence. We often investigate finite or convergent sums εi xi with
i
vector valued coefficients xi .
For arbitrary m × n matrices, A Sp denotes the Schatten p-norm of an m × n
matrix A, i.e.,
A Sp = σ (A) p ,
where σ ∈ Rmin{m,n} is the vector of singular values of A, and || · ||p is the usual
lp -norm defined above. When p = ∞, it is also called the spectrum (or matrix)
norm. The Rademacher average is given by
/ / / /
/ / / /
/ / / /
/ ε i Ai / = /σ ε i Ai / .
/ / / /
i Sp i p
1.11 Rademacher Averages and Symmetrization 71
where (Xi ) is a finite sequence of independent mean zero real valued random
vectors. This type of identity extends to Hilbert space valued random variables, but
does not in general hold for arbitrary Banach space valued random variables. The
classical theory is developed under this orthogonal property.
Lemma 1.11.3 (Ledoux and Talagrand [27]). Let F : R+ → R+ be convex.
Then, for any finite sequence (Xi ) of independent zero mean random variables in a
Banach space B such that EF ( Xi ) < ∞ for every i,
/ / / / / /
1/
/
/
/
/
/
/
/
/
/
/
/
EF / εi Xi / EF / Xi / EF 2/ εi X i / .
2/ i
/ /
i
/ /
i
/
Rademacher series appear as a basic tool for studying sums of independent random
variables in a Banach space [27, Lemma 6.3].
Proposition 1.11.4 (Symmetrization [27]). Let {Xi } be a finite sequence of
independent, zero-mean random variables taking values in a Banach space B. Then
/ / / /
/ / / /
p/ / p/ /
E / Xi / 2E / εi X i / ,
/ / / /
i B i B
The relation follows from [27, Eq. (6.2)] and the fact that M (Y ) 2EY for every
nonnegative random variable Y . Here M denotes the median.
72 1 Mathematical Foundation
See Sect. 5.7 for the applications of the results here. The proofs below are taken
from [79]. We also follow [79] for the exposition and the notation here. By g, gi ,
we denote independent N (0, 1) Gaussian random variables. For a random variable
p 1/p
ξ and p > 0, we put ξ p = (E|ξ| ) . Let · F be the Frobenius norm and · op
the operator norm. By | · | and < ·, · > we denote the standard Euclidean norm and
the inner product on Rn .
A random variable ξ is called sub-Gaussian if there is a constant β < ∞ such
that
ξ 2k β g 2k k = 1, 2, . . . (1.107)
We refer to the infimum overall all β satisfying (1.107) as the sub-Gaussian constant
of ξ. An equivalent definition is often given in terms of the ψ2 -norm. Denoting the
Orlicz function ψ2 (x) = exp(x2 ) − 1 by ψ2 , ξ is sub-Gaussian if and only if
Denting the sub-Gaussian constant of ξ by β̄, a direct calculation will show the
following common (and not optimal) estimate
β̃ ξ ψ2 β̃ g ψ2 = β̃ 8/3.
The lower estimate follows since Eψ2 (X)√ EX 2k /k! for k = 1, 2, . . .. The upper
one is using the fact that E exp(tg 2 ) = 1/ 1 − 2t for t < 1/2.
Apart from the Gaussian random variables, the prime example of sub-Gaussian
random variables are Bernoulli random variables, taking values +1 and −1 with
equal probability P (ξ = +1) = P (ξ = −1) = 1/2.
Very often, we work with random vectors in Rn of the form ξ = (ξ1 , ξ2 , . . . , ξn ),
where ξi are independent sub-Gaussian random variables, and we refer to such
vectors as sub-Gaussian random vectors. We require that Var (ξi ) 1 and sub-
Gaussian constants are at most β. Under these assumptions, we have
hence β ≥ 1.
We have the following fact: for any t > 0,
√
P |ξ| t n exp n ln 2 − t2 / 3β 2 . (1.109)
√
In particular, P (|ξ| 3β n) e−2n . Let us prove the result. For an arbitrary
s > 0 and 1 ≤ i ≤ n we have
1.12 Operators Acting on Sub-Gaussian Random Vectors 73
∞ ∞
2
ξi2 1 β 2k (βg)
E exp = Eξ 2k Eg 2k = E exp .
s2 k! · s2k i k! · s2k s2
k=0 k=0
√
This last quantity is less than or equal to 2 since, e.g., s = 3β. For this choice of s,
n
n
P ξi2 t n
2
E exp 1
s2 ξi2 −t n2
i=1 i=1
!
n 2
2 ξi2
exp − ts2 n
E exp s2 exp − 3β
t n
2 · 2n ,
i=1
⎛ 12 ⎞
n
n
2
P ⎝ bi Eξi − ξi >
2
2t b2i Eξi4 ⎠ e−t .
i=1 i=1
n
Proof. We may obviously assume that b2i Eξi4 > 0. For x > 0, we have e−x =
−1 −1
i=1
(ex ) 1 + x + x2 /2 1 − x + x2 /2. Thus for λ ≥ 0,
E exp −λξi2 1 − Eξi2 + 12 λ2 Eξi4 exp −λEξi2 + 12 λ2 Eξi4 .
n
n
Letting S = bi Eξi2 − ξi2 , we get E exp (λS) exp 1 2
2λ b2i Eξi4 , and
i=1 i=1
for any u ≥ 0,
⎛ ⎞
⎜ u2 ⎟
P (S u) inf E exp (λS − λu) exp ⎜
⎝−
n
⎟.
⎠
λ0
2 bi Eξi
2 4
i=1
Lemma 1.12.2 is somewhat special since we assume that coefficients are non-
negative. In the general case one has for any sequence of independent random
variables ξi with sub-Gaussian constant at most β and t > 1,
n 2 √
2
P ai Eξi − ξi > Cβ t (ai ) ∞ + t |(ai )|
2
e−t . (1.110)
i=1
We provide a sketch of the proof for the sake of completeness. Let ξ˜i be an
independent copy of ξi . We have by Jensen’s inequality for p ≥ 1,
/ / / /
/ n
/ / n /
/ / / /
/ ai Eξi2 − ξi2 / / ai ξi2 − ξ˜i2 / .
/ / / /
i=1 p i=1 p
where ηi are i.i.d. symmetric exponential random variables with variance 1. So, for
positive integer k,
1.13 Supremum of Stochastic Processes 75
/ / / /
/ n / / n / √
/ / 2/ /
/ ai ξi2 − ξ˜i2 / 4β / ai η i / C1 β 2 k (ai ) ∞ + k (ai ) 2 ,
/ / / /
i=1 2k i=1 2k
We follow [33] for this introduction. The standard reference is [81]. See also [27,82].
One of the fundamental issues of the probability theory is the study of suprema
of stochastic processes. In particular, in many situations one needs to estimate the
quantity Esupt∈T Xt , where supt∈T Xt is a stochastic process. T in order to avoid
measurability problems, one may assume that T is countable. The modern approach
to this problem is based on chaining techniques. The most important case of centered
Gaussian process is well understood. In this case, the boundedness of the process is
related to the geometry of the metric space (T, d), where
1/2
2
d (t, s) = E(Xt − Xs ) .
In 1967, R. Dudley [83] obtained an upper bound for Esupt∈T Xt in terms of entry
numbers and in 1975 X. Fernique [84] improved Dudley’s bound using so-called
majorizing measures. In 1987, Talagrand [85] showed that Fernique’s bound may
be reversed and that for centered Gaussian processes (Xt ),
1
γ2 (T, d) EsupXt Lγ2 (T, d) ,
L t∈T
where the infimum runs over all sequences Ti of subsets of T such that |T0 | = 1 and
i
|Ti | = 22 .
76 1 Mathematical Foundation
Sums of independent random vectors [27, 74, 86] are classical topics nowadays. It
is natural to convert sums of random matrices into sums of independent random
vectors that can be handled using the classical machinery. Sums of dependent
random vectors are much less understood: Stein’s method is very powerful [87].
N
Often we are interested in the sample covariance matrix N1 xi ⊗ xi the sums
i=1
of N rank-one matrices where x1 , . . . , xN are N independent random vectors.
N
More generally, we consider N1 Xi where Xi are independent random Hermitian
i=1
matrices. Let
⎤ ⎡
λ1
⎢ λ2 ⎥
⎢ ⎥
Λ = ⎢ . ⎥ ∈ Rn
⎣ .. ⎦
λn
N
1
Λi .
N i=1
1
log Φ̂0,R = − y, Ry .
2
We assume that
N N
1 1
R= Cov (xi ) = Ri
N i=1
N i=1
N
s
1
N E (| y, xi | )
ls,N = sup i=1
N −(s−2)/2 (s 2). (1.111)
y=1
N s/2
2
1
N E y, xi
i=1
s
ρs y ρs
ls,N N −(s−2)/2 sup s/2
= s/2
N −(s−2)/2 ,
y=1 [ y, Ry ] λmin
where λmin is the smallest eigenvalue of the average covariance matrix R. In one
dimension (i.e., n = 1)
ρs
ls,N = s/2
N −(s−2)/2 , s 2.
ρ2
If R = I, then
n
s s s
E| y, xi | E| y, xi | N s/2 ls,N y .
i=1
l4,N 1.
1 −1/4
t l ,
2 4,N
One has
N j3
1 2
Ĝi √ Bt − exp − 2 t 1
1 + √ μ3 (t)
N 6 N
i=1
4 4
(0.175) l4,N t exp − 12 t
0 1
8 6 2
2
+ (0.018) l4,N t + 361 2
l3,N t exp − (0.383) t ,
N
1 3
μ3 (t) = E t, xi .
N i=1
1.16 Linear Bounded and Compact Operators 79
The material here is standard, taken from [88, 89]. Let X and Y always be normed
spaces and A : X → Y be a linear operator. The linear operator A is bounded if
there exists c > 0 such that
Ax c x for all x ∈ X.
Ax
A = sup . (1.112)
x=0 x
b
(Ax) (t) := k (t, s)x (s) ds, t ∈ (c, d), x ∈ L2 (a, b), (1.113)
a
d b
A L2 |k (t, s)|dsdt.
c a
Let k be continuous on [c, d] × [a, b]. Then A is also well-defined, linear, and
bounded from C[a, b] into C[c, d] and
b
A ∞ max |k(s, t)| ds.
t∈[c,d] a
We can extend above results to integral operators with weakly singular kernels.
A kernel is weakly singular on [a, b] × [a, b] if k is defined and continuous for all
t, s ∈ [a, b], t = s, and there exists constants c > 0 and α ∈ [0, 1) such that
1
|k(s, t)| c α for all t,s ∈ [a, b], t = s.
|t − s|
80 1 Mathematical Foundation
Let k be weakly singular on [a, b]. Then the integral operator A, defined
in (1.113) for [c, d] = [a, b], is well-defined and bounded as an operator in L2 (a, b)
as well as in C[a, b].
Let A : X → Y be a linear and bounded operator between Hilbert spaces.
Then there exists one and only one linear bounded operator A∗ : Y → X with the
property
The set of all compact operators from X to Y is a closed subspace of the vector
space L2 (a, b) where
0 1
2
L2 (a, b) = x : (a, b) → C : x is measurable and |x| integrable .
Let k ∈ L2 ((c, d) × (a, b)). The operator K : L2 (c, d) → L2 (a, b), defined by
b
(Kx) (t) := k (t, s)x (s) ds, t ∈ (c, d), x ∈ L2 (a, b) , (1.114)
a
is compact from (c, d) to (a, b). Let k be continuous on (c, d) × (a, b) or weakly
singular on (a, b) × (a, b) (in this case (c, d) = (a, b)). Then K defined by (1.114)
is also compact as an operator from C[a, b] to C[c, d].
The material here is standard, taken from [88, 89]. The most important results in
functional analysis are collected here. Define
N = x ∈ L2 (a, b) : x(t) = 0 almost everywhere on [a, b] .
2. For every eigenvalue λ = 0, there exist only finitely, many linearly independent
eigenvectors, i.e., the eigenvectors are finite-dimensional. Eigenvectors corre-
sponding to different eigenvalues are orthonormal.
3. We order the eigenvalues in the form
X = cl (H) ⊕ N (K) .
Then
1 1 t
(K∗ x) (t) := y (s) ds and (KK∗ x) (t) := x (s) ds dt.
t t 0
"
2 2i − 1 4
xi (t) = cos πt, i ∈ N, and λi = 2 , i ∈ N.
π 2 (2i − 1) π 2
Example 1.17.2 (Porter and Stirling [89]). Let the operator K on L2 (0, 1) be
defined by
√ √
1 x + t
(Kφ) (x) =
log √ √ φ(t)dt (0 x 1) .
0 x − t
1
1
(T ∗ φ) (x) = √ φ(t)dt (0 x 1) .
x t−x
1.17 Spectrum for Compact Self-Adjoint Operators 83
positivity.
Chapter 2
Sums of Matrix-Valued Random Variables
This chapter gives an exhaustive treatment of the line of research for sums of
matrix-valued random matrices. We will present eight different derivation methods
in this context of matrix Laplace transform method. The emphasis is placed on the
methods that will be hopefully useful to some engineering applications. Although
powerful, the methods are elementary in nature. It is remarkable that some modern
results on matrix completion can be simply derived, by using the framework of
sums of matrix-valued random matrices. The treatment here is self-contained.
All the necessary tools are developed in Chap. 1. The contents of this book are
complementary to our book [5]. We have a small overlapping on the results of [36].
In this chapter, the classical, commutative theory of probability is generalized
to the more general theory of non-communicative probability. Non-communicative
algebras of random variables (“observations”) and their expectations (or “trace”) are
built. Matrices or operators takes the role of scalar random variables and the trace
takes the role of expectation. This is very similar to free probability [9].
The theory of real random variables provides the framework of much of modern
probability theory [8], such as laws of large numbers, limit theorems, and proba-
bility estimates for “deviations”, when sums of independent random variables are
involved. However, some authors have started to develop analogous theories for the
case that the algebraic structure of the reals is substituted by more general structures
such as groups, vector spaces, etc., see for example [90].
In a remarkable work [36], Ahlswede and Winter has laid the ground for the
fundamentals of a theory of (self-adjoint) operator valued random variables. There,
the large deviation bounds are derived. A self-adjoint operator includes finite
dimensions (often called Hermitian matrix) and infinite dimensions. For the purpose
of this book, finite dimensions are sufficient. We will prefer Hermitian matrix.
To extend the theory from scalars to the matrices, the fundamental difficulty
arises from that fact, in general, two matrices are not commutative. For example,
AB = BA. The functions of a matrix can be defined; for example, the matrix
exponential is defined [20] as eA . As expected, eAB = eBA , although a scalar
exponential has the elementary property eab = eba , for two scalars a, b. Fortunately,
we have the Golden-Thompson inequality that has the limited replacement for
the above elementary property of the scalar exponential. The Golden-Thompson
inequality
Tr eA+B ≤ Tr eA eB ,
for Hermitian matrices A, B, is the most complicate result that we will use.
Through the spectral mapping theorem, the eigenvalues of arbitrary matrix
function f (A), are f (λi ) where λi is the i-th eigenvalue of A. In particular, for
f (x) = ex for a scalar x; the eigenvalues of eA are eλi , which is, of course, positive
(i.e., eλi > 0). In other words, the matrix exponential eA is ALWAYS positive
semidefinite for an arbitrary matrix A. The positive real numbers have a lot of
special structures to exploit, compared with arbitrary real numbers. The elementary
fact motivates the wide use of positive semidefinite (PSD) matrices, for example,
convex optimization and quantum information theory. Through the spectral mapping
theorem, all the eigenvalues of positive semidefinite matrices are nonnegative.
For a sequence of scalar random variables (real or complex numbers),
x1 , . . . , xn , we can study itsconvergence by studying the so-called partial sum
n
Sn = x 1 + . . . + x n = i=1 xi . We say the sequence converges to a limit
value S = E[x], if there exists a limit S as n → ∞. In analogy with the scalar
counterparts, we can similarly define
n
Sn = X1 + . . . + Xn = Xi ,
i=1
Due to the basic nature of sums of random matrices, we give several versions of the
theorems and their derivations. Although essentially their techniques are equivalent,
the assumptions and arguments are sufficiently different to justify the space. The
techniques for handling matrix-valued random variables are very subtle; it is our
intention to give an exhaustive survey of these techniques. Even a seemingly small
twist of the problems can cause a lot of technical difficulties. These presentations
serve as examples to illustrate the key steps. Repetition is the best teacher—practice
makes it perfect. This is the rationale behind this chapter. It is hoped that the
audience pays attention to the methods, not the particular derived inequalities.
The Laplace transform method is the standard technique for the scalar-valued
random variables; it is remarkable that this method can be extended to the matrix
setting. We argue that this is a break-through in studying the matrices concentration.
This method is used as a thread to tie together all the surveyed literature. For
completion, we run the risk of “borrowing” too much from the cited references.
Here we give credit to those cited authors. We try our best to add more details about
their arguments with the hope of being more accessible.
The presentation here is essentially the same as [91, 92] whose style is very friendly
and accessible. We present Harvey’s version first.
Let X be a random d×d matrix, i.e., a matrix whose entries are all random variables.
We define EX to be the matrix whose entries are the expectation of the entries of
X. Since expectation and trace are both linear, they commute:
n
/ /
n
/ λX /
E Tr eSn λ /EX eλXi / · Tr eλ0 = /E e i / · Tr eλ0 ,
i
i=1 i=1
1 The assumption symmetric matrix is too strong for many applications. Since we often deal with
complex entries, the assumption of Hermitian matrix is reasonable. This is the fatal flaw of this
version. Otherwise, it is very useful.
2.2 Matrix Laplace Transform Method 89
where 0 is the zero matrix of size d × d. So eλ0 = I and Tr (I) = d, where I is the
identity matrix whose diagonal are all 1. Therefore,
n
/ λX /
E Tr eSn λ d · /E e i /.
i=1
n
/ λX /
Pr [some eigenvalues of matrix Sn is greater than t] de−λt /E e i /.
i=1
We can also bound the probability that any eigenvalue of Sn is less than −t by
applying the same argument to −Sn . This shows that the probability that any
eigenvalue of Sn lies outside [−t, t] is
+ ,
n
/ λX / n
/ −λX /
P ( Sn > t) de −λt /E e i / + /E e i /
. (2.3)
i=1 i=1
This is the basis inequality. Much like the Chernoff bound, numerous variations and
generalizations are possible. Two useful versions are stated here without proof.
Theorem 2.2.1. Let Y be a random, symmetric, positive semi-definite d × d matrix
such that E[Y] = I. Suppose Y ≤ R for some fixed scalar R ≥ 1. Let
Y1 , . . . , Yk be independent copies of Y (i.e., independently sampled matrices with
the same distribution as Y). For any ε ∈ (0, 1), we have
- .
1
k
P (1 − ε) I Yi (1 + ε) I 1 − 2d · exp −ε2 k/4R .
k i=1
1
k
This event is equivalent to the sample average k Yi having minimum eigenvalue
i=1
at least 1 − ε and maximum eigenvalue at most 1 + ε.
Proof. See [92].
Corollary 2.2.2. Let Z be a random, symmetric, positive semi-definite d×d matrix.
Define U = E[Z] and suppose Z ≤ R · U for some scalar R ≥ 1. Let Z1 , . . . , Zk
be independent copies of Z (i.e., independently sampled matrices with the same
distribution as Z). For any ε ∈ (0, 1), we have
- .
1
k
P (1 − ε) U Zi (1 + ε) U 1 − 2d · exp −ε2 k/4R .
k i=1
90 2 Sums of Matrix-Valued Random Variables
√
Note that R d because
d = Tr I = Tr E xxT = E Tr xxT = E xT x ,
1 + y ey , ∀y ∈ R
ey 1 + y + y 2 , ∀y ∈ [−1, 1].
Since Xi 1, for any λ ∈ [0, 1], we have eλXi I + λXi + λ2 X2i , and so
E eλXi E I + λXi + λ2 X2i I + λ2 E X2i
2 2 2 2
eλ E[Xi ] eλ /4R I ,
/ /
by Eq. (2.4). Thus, /E eλXi / eλ /4R . The same analysis also shows that
2 2
/ −λX /
/E e i /
2 2
eλ /4R . Substituting this into Eq. (2.3), we obtain
1
n n
2
/4R2
P x i x T
i − I >t 2d·e−λt eλ = 2d· exp −λt + nλ2 /4R2 .
i=1
2R2 i=1
S = X1 + · · · + X n .
eA+B = eA eB .
This estimate is not sharp: eλSn etI means the biggest eigenvalue of eλSn exceeds
eλt , while Tr eλSn > eλt means that the sum of all d eigenvalues exceeds the same.
Since Sn = Xn + Sn−1 , we use the Golden-Thomas’s inequality to separate the
last term from the sum:
E Tr eλSn E Tr eλXn eλSn−1 .
Now, using independence and that E and trace commute, we continue to write
/ /
= En−1 Tr En eλXn · eλSn−1 /En eλXn / · En−1 Tr eλSn−1 ,
since
Tr (AB) A Tr (B) ,
for A, B ∈ Md .
Continuing by induction, we reach (since TrI = d) to
n
E Tr eλSn d · EeλXi .
i=1
n
P (Sn tI) d · EeλXi .
i=1
n
/ λX /
P ( Sn > t) 2de−λt · /Ee i /. (2.5)
i=1
As in the real valued case, full independence is never needed in the above
argument. It works out well for martingales.
Theorem 2.2.4 (Chernoff-type inequality). Let Xi ∈ Md be independent mean
zero random matrices, Xi 1 for all i almost surely. Let
2.2 Matrix Laplace Transform Method 93
n
Sn = X1 + · · · + Xn = Xi ,
i=1
n
σ2 = var Xi .
i=1
1 + y ey 1 + y + y 2
is valid for real number y ∈ [−1, 1] (actually a bit beyond) [95]. From the two
bounds, we get (replacing y with Y)
I + Y eY I + Y + Y 2
Using the bounds twice (first the upper bound and then the lower bound), we have
2
EeY E I + Y + Y2 = I + E Y2 eE(Y ) .
Hence by (2.5),
2
σ2
P ( Sn > t) 2d · e−λt+λ .
With the optimal choice of λ = min t/2σ 2 , 1 , the conclusion of the Theorem
follows.
n /
/
Does the Theorem hold for σ 2 replaced by / E X2 / ?
i
i=1
Consider the random matrix Zn . We closely follow Oliveria [38] whose exposition
is highly accessible. In particular, he reviews all the needed theorems, all of those are
collected in Chap. 1 for easy reference. In this subsection, the matrices are assumed
to be d × d Hermitian matrices, that is, A ∈ Cd×d Herm , where CHerm is the set of d × d
d×d
Hermitian matrices.
Notice that
E esZn E esλmax (Zn ) + E esλmax (−Zn ) = 2E esλmax (Zn ) (2.7)
since Zn = max {λmax (Zn ) , λmax (−Zn )} and Zn has the same law as −Zn .
Up to now, the Oliveira’s proof in [38] has followed Ahlswede and Winter’s
argument in [36]. The next lemma is originally due to Oliveira [38]. Now Oliveira
considers the special case
2.2 Matrix Laplace Transform Method 95
n
Zn = ε i Ai , (2.10)
i=1
Proof. In (2.11), we have used the fact that trace and expectation commute,
according to (1.87). The key proof steps have been followed by Rudelson [93],
Harvey [91, 92], and Wigderson and Xiao [94].
Ahlswede and Winter [36] were the first who used the matrix Laplace transform
method. Ahlswede-Winter’s derivation, taken from [36], is presented in detail below.
We postpone their original version until now, for easy understanding. Their paper
and Tropp’s long paper [53] are two of the most important sources on this topic. We
first digress to study the problem of hypothesis for motivation.
Consider a hypothesis testing problem for a motivation
H 0 : A1 , . . . , A K
H 1 : B1 , . . . , BK
K K
Tr Ak = ξ ≤ Tr Bk ,
k=1 k=1
2. Otherwise, claim H0 .
96 2 Sums of Matrix-Valued Random Variables
TreA+B ≤ TreA eB .
TreA+B+C ≤ TreA eB eC
is known to be false.
Let A and B be two Hermitian matrices of the same size. If A − B is positive
semidefinite, we write [16]
A ≥ B or B ≤ A. (2.12)
A ≥ B ⇔ X∗ AX ≥ X∗ BX (2.13)
In the ground-breaking work of [36], they focus on a structure that has vital interest
in the algebra of operators on a (complex) Hilbert space, and in particular, the real
vector space of self-adjoint operators. Through the spectral mapping theorem, these
self-adjoint operators can be regarded as a partially ordered generalization of the
reals, as reals are embedded in the complex numbers. To study the convergence of
sums of matrix-valued random variables, this partial order is necessary. It will be
clear later.
One can generalize the exponentially good estimate for large deviations by the
so-called Bernstein trick that gives the famous Chernoff bound [96, 97].
A matrix-valued random variable X : Ω → As , where
As = {A ∈ A : A = A∗ } (2.14)
is the self-adjoint part of the C∗ -algebra A [98], which is a real vector space. Let
L(H) be the full operator algebra of the complex Hilbert space H. We denote d =
dim(H), which is assumed to be finite. Here dim means the dimensionality of the
vector space. In the general case, d = TrI, and A can be embedded into L(C d ) as an
algebra, preserving the trace. Note the trace (often regarded as expectation) has the
property Tr (AB) = Tr (BA), for any two matrices (or operators) of A, B. In free
probability3 [99], this is a (optional) axiom as very weak form of commutativity in
the trace [9, p. 169].
The real cone
A+ = {A ∈ A : A = A∗ ≥ 0} (2.15)
induces a partial order ≤ in As . This partial order is in analogy with the order of
two real numbers a ≤ b. The partial order is the main interest in what follows. We
can introduce some convenient notation: for A, B ∈ As the closed interval [A, B]
is defined as
From which, we see that these nonlinear inequalities are equivalent to infinitely
many linear inequalities, which is better adapted to the vector space structure
of As .
3. The operator mapping A → As , for s ∈ [0, 1] and A → log A are defined on
A+ , and both are operator monotone and operator concave. In contrast, A → As ,
for s > 2 and A → exp A are neither operator monotone nor operator convex.
Remarkably, A → As , for s ∈ [1, 2] is operator convex (though not operator
monotone). See Sect. 1.4.22 for definitions.
4. The mapping A → Tr exp A is monotone and convex. See [50].
5. Golden-Thompson-inequality [23]: for A, B ∈ As
Items 1–3 follows from Loewner’s theorem. A good account of the partial order
is [18, 22, 23]. Note that a rarely few of mappings (functions) are operator convex
(concave) or operator monotone. Fortunately, we are interested in the trace functions
that have much bigger sets [18].
Take a look at (2.19) for example. Since H0 : A = I + X, and A ∈ As (even
stronger A ∈ A+ ), it follows from (2.18) that
The use of (2.19) allows us to separately study the diagonal part and the non-
diagonal part of the covariance matrix of the noise, since all the diagonal elements
are equal for a WSS random process. At low SNR, the goal is to find some ratio or
threshold that is statistically stable over a large number of Monte Carlo trials.
2.2 Matrix Laplace Transform Method 99
Rs = TrRs (I + bσ1 )
where
01
σ1 =
10
are
λ1,2 = 12 TrA ± 1
2 Tr2 A − 4 det A
The exponential of the matrix X, eX , has positive entries, and in fact [101]
⎛ 2 √ 2 ⎞
b
cosh ab sinh ab
eX = ⎝ 2
a 2 ⎠
√1 sinh b
cosh ab
ab a
100 2 Sums of Matrix-Valued Random Variables
E [X]
P (X a) for X nonnegative. (2.20)
a
Theorem 2.2.10 (Markov inequality). Let X a matrix-valued random variable
with values in A+ and expectation
M = EX = Pr{X = X }X , (2.21)
X
P (Y I) Tr (E [X])
Note from (1.87), the trace and expectation commute! This is seen as follows:
E [Y] = P (Y = Y) Y P (Y = Y) Y.
Y YI
The second inequality follows from the fact that Y is positive and Y = {Y I} ∪
{Y > I}. All eigenvalues of Y are positive. Ignoring the event {Y > I} is equiva-
lent to remove some positive eigenvalues from the spectrum of the Y, spec(Y).
Taking traces, and observing that a positive operator (or matrix) which is not less
than or equal to I must have trace at least 1, we find
Tr (E [Y]) P (Y = Y) Tr (Y) P (Y = Y) = P (Y I) ,
YI YI
σ2
P (|X − m| a) . (2.23)
a2
The Chebyshev inequality is a consequence of the Markov inequality.
In analogy with the scalar case, if we assume knowledge about the matrix-
valued expectation and the matrix-valued variance, we can prove the matrix-valued
Chebyshev inequality.
Theorem 2.2.11 (Chebyshev inequality). Let X a matrix-valued random vari-
able with values in As , expectation M = EX, and variance
VarX = S2 = E (X − M)2 = E(X2 ) − M2 . (2.24)
For Δ ≥ 0,
P{|X − M| Δ} ≤ Tr S2 Δ−2 . (2.25)
|X − M| ≤ Δ ⇐ (X − M)2 ≤ Δ2
since (·) is operator monotone. See Sect. 1.4.22. We find that
P (|X − M| Δ) P (X − M) Δ2 Tr S2 Δ−2 .
2
n
P 1
n Xi ∈
/ [M − Δ, M + Δ] ≤ 1
n Tr S2 Δ−2 ,
n=1 (2.26)
n √ √
P Xi ∈
/ [nM − nΔ, nM − nΔ] ≤ 1
n Tr S2 Δ−2 .
n=1
Here, the second line is because the mapping X → TXT∗ is bijective and preserve
the order. The TYT∗ and TBT∗ are two commutative matrices. For commutative
matrices A, B, A ≤ B is equivalent to eA ≤ eB , from which the third line follows.
The last line follows from Theorem 2.2.10.
The Bernstein
trick is a crucial step. The problem is reduced to the form of
Tr EeZ where Z = TYT∗ − TBT∗ is Hermitian. We really do not know if
Z is nonnegative or positive. But we do not care since the matrix exponential of any
Hermitian A is always nonnegative. As a consequence of using the Bernstein trick,
we only need to deal with nonnegative matrices.
But we need another key ingredient—Golden-Thompson inequality—since for
Hermitian AB, we have eA+B = eA · eB , unlike ea+b = ea · eb , for two scalars
a, b. For two Hermitian matrices A, B, we have the Golden-Thompson inequality
Tr eA+B Tr eA · eB .
2.2 Matrix Laplace Transform Method 103
n
Proof. Using previous lemma (obtained from the Bernstein trick) with Y = Xi
i=1
and B = nA, we find
n
n
P Xi nA Tr E exp T (Xi −A) T∗
i=1 i=1
n
∗
=E Tr exp T (Xi −A) T
i=1
n−1
E Tr exp T (Xi −A) T∗ exp [T (Xn −A) T∗ ]
i=1
n−1
= EX1 ,...,Xn−1 Tr exp T (Xi −A) T∗ E exp [T (Xn −A) T∗ ]
i=1
n−1
E exp [T (Xn −A) T∗ ] · EX1 ,...,Xn−1 Tr exp T (Xi −A) T∗
i=1
the first line follows from Lemma 2.2.13. The second line is from the fact
that the trace and expectation commute, according to (1.87). In the third line, we
use the famous Golden-Thompson inequality (1.71). In the fourth line, we take
the expectation on the Xn . The fifth line is due to the norm property (1.84).
The sixth line is using the fifth step by induction for n times. d comes from the
fact Tr exp (0) = TrI = d, where 0 is a zero matrix whose entries are all zero.
The problem is now to minimize E exp [T (Xn − A) T∗ ] with respect to T.
For a Hermitian matrix T , we have the polar decomposition T = |T|·U, where Y is
a unitary matrix; so, without loss of generality, we may assume that T is Hermitian.
Let us focus pursue the special case of a bounded matrix-valued random variable.
Defining
n
P Xi nA d · exp (−nD (a||m)) , (2.31)
n=1
Proof. The second part follows from the first by considering Yi = Xi , and the ob-
servation that D (a||m) = D√(1 − a||1 − m). To prove it, we apply Theorem 2.2.14
with a special case of T = tI to obtain
+ n
, + n
,
P Xi nA P Xi naI
n=1 n=1
As a consequence, we have
2.2 Matrix Laplace Transform Method 105
hence, we have
The first line follows from using (2.33). The second line follows from the hypothesis
of EX mI. The third line follows from using the spectral norm property of (1.57)
for the identity matrix I: I = 1. Choosing
a 1−m
t = log · >0
m 1−a
x2 μ
D ((1 + x)μ||μ) .
2 ln 2
106 2 Sums of Matrix-Valued Random Variables
We refer to [102] for a proof. His proof directly follows from Ahlswede-Winter [36]
with some revisions.
The version of derivation, taken from [103], is more general in that the random
matrices need not be identically distributed. A symmetric matrix is assumed. It is
our conjecture that results of [103] may be easily extended to a Hermitian matrix.
Theorem 2.2.17 (Noncommutative Bernstein Inequality [104]). Let X1 , . . . , XL
be independent zero-mean random matrices of dimension d1 × d2 . Suppose
ρ2k = max { E (Xk X∗k ) , E (X∗k Xk ) } and Xk M almost surely for
all k. Then, for any τ > 0,
⎛ ⎞
-/ / .
/ L / ⎜ 2
− τ2 ⎟
/ / ⎜ ⎟
P / Xk / > τ (d1 + d2 ) exp ⎜ L ⎟. (2.35)
/ / ⎝ ⎠
k=1 ρ2k + M τ /3
k=1
Note that in the case that d1 = d2 = 1, this is precisely the two sided version of the
standard Bernstein Inequality. When the Xk are diagonal, this bound is the same
as applying the standard Bernstein Inequality and a union bound to the diagonal
of the matrix ⎛summation.
⎞ Besides, observe that the right hand side is less than
3
− 8 τ2
L
(d1 + d2 ) exp ⎝ 2
L
⎠ as long as τ 1
M ρ2k . This condensed form of the
ρk k=1
k=1
2.2 Matrix Laplace Transform Method 107
Chernoff bounds are extremely useful in probability. Intuitively, they say that a
random sample approximates the average, with a probability of deviation that goes
down exponentially with the number of samples. Typically we are concerned about
real-valued random variables, but recently several applications have called for large-
deviations bounds for matrix-valued random variables. Such a bound was given by
Ahlswede and Winter [36, 105].
All of Wigderson and Xiao’s results [94] are extended to complex Hermitian
matrices, or abstractly to self-adjoint operators over any Hilbert spaces where the
operations of addition, multiplication, trace exponential, and norm are efficiently
computable. Wigderson and Xiao [94] essentially follows the original style of
Ahlswede and Winter [36] in the validity of their method.
n
=P λmax θXi eθt (the positive homogeneity of the eigenvalue map)
i=1
n
−θt
e · E exp λmax θXi (Markov’s inequality)
i=1
n
= e−θt · Eλmax exp θXi (the spectral mapping theorem)
i=1
n
−θt
<e · E Tr exp θXi (the exponential of a Hermitian matrix is positive definite)
i=1
(2.36)
108 2 Sums of Matrix-Valued Random Variables
This section develops some general probability inequalities for the maximum
eigenvalue of a sum of independent random matrices. The main ingredient is
a matrix extension of the scalar-valued Laplace transform method for sums of
independent real random variables, see Sect. 1.1.6.
Before introducing the matrix-valued Laplace transform, we need to define
matrix and cumulants, in analogy with Sect. 1.1.6 for the scalar setting. At this point,
a quick review of Sect. 1.1 will illuminate the contrast between the scalar and matrix
settings. The central idea of Ahswede and Winter [36] is to extend the textbook idea
of the Laplace Transform Method from the scalar setting to the matrix setting.
Consider a Hermitian matrix X that has moments of all orders. By analogy with
the classical scalar definitions (Sect. 1.1.7), we may construct matrix extensions of
the moment generating function and the cumulant generating function:
The coefficients EXk are called matrix moments and Ξk are called matrix
cumulants. The matrix cumulant Ξk has a formal expression as a noncommutative
polynomial in the matrix moments up to order k. In particular, the first cumulant is
the mean and the second cumulant is the variance:
Ξ1 = E (X) and Ξ2 = E X2 − E (X) .
2
In words, we can control tail probabilities for the maximum eigenvalue of a random
matrix by producing a bound for the trace of the matrix moment generating function
defined in (2.37). Let us show how Bernstein’s Laplace transform technique extends
to the matrix setting. The basic idea is due to Ahswede-Winter [36], but we follow
Oliveria [38] in this presentation.
Proof. Fix a positive number θ. Observe that
2.4 The Failure of the Matrix Generating Function 109
P (λmax (Y) t) =P (λmax (θY) θt) =P eλmax (θY) eθt e−θt · Eeλmax (θY) .
The first identity uses the homogeneity of the maximum eigenvalue map. The
second relies on the monotonicity of the scalar exponential functions; the third
relation is Markov’s inequality. To bound the exponential, note that
eλmax (θY) = λmax eθY Tr eθY .
The first relation is the spectral mapping theorem (Sect. 1.4.13). The second relation
holds because the exponential of an Hermitian matrix is always positive definite—
the eigenvalues of the matrix exponential are always positive (see Sect. 1.4.16 for
the matrix exponential); thus, the maximum eigenvalue of a positive definite matrix
is dominated by the trace. Combine the latter two relations, we reach
P (λmax (Y) t) inf e−θt · E Tr eθY .
θ>0
This inequality holds for any positive θ, so we may take an infimum4 to complete
the proof.
In the scalar setting of Sect. 1.2, the Laplace transform method is very effective
for studying sums of independent (scalar-valued) random variables, because the
matrix generating function decomposes. Consider an independent sequence Xk of
real random variables. Operating formally, we see that the scalar matrix generating
function of the sum satisfies a multiplicative rule:
M
(θ) = E exp θXk =E eθXk = EeθXk = MXk (θ).
Xk
k k k k k
(2.38)
This calculation relies on the fact that the scalar exponential function converts sums
to products, a property the matrix exponential does not share, see Sect. 1.4.16. Thus,
there is no immediate analog of (2.38) in the matrix setting.
4 In analysis the infimum or greatest lower bound of a subset S of real numbers is denoted by
inf(S) and is defined to be the biggest real number that is smaller than or equal to every number in
S . An important property of the real numbers is that every set of real numbers has an infimum (any
bounded nonempty subset of the real numbers has an infimum in the non-extended real numbers).
For example, inf {1, 2, 3} = 1, inf {x ∈ R, 0 < x < 1} = 0.
110 2 Sums of Matrix-Valued Random Variables
Ahlswede and Winter attempts to imitate the multiplicative rule of (2.38) using
the following observation. When X1 and X2 are independent random matrices,
θX θX
Tr MX 1 +X2 (θ) E Tr eθX1 eθX2 = Tr Ee 1 Ee 2 = Tr MX 1 (θ) · MX 2 (θ) .
(2.39)
The first relation is the Golden-Thompson trace inequality (1.71). Unfortunately,
we cannot extend the bound (2.39) to include additional matrices. This cold fact
may suggest that the Golden-Thompson inequality may not be the natural way to
proceed. In Sect. 2.2.4, we have given a full exposition of the Ahlswede-Winter
Method. Here, we follow a different path due to [53].
Let us return to the problem of bounding the matrix moment generating function of
an independent sum. Although the multiplicative rule (2.38) for the matrix case is
a dead end, the scalar cumulant generating function has a related property that can
be extended. For an independent sequence Xk of real random variables, the scalar
cumulant generating function is additive:
Ξ
(θ) = log E exp θXk = log EeθXk = ΞXk (θ), (2.40)
Xk
k k k k
where the second relation follows from (2.38) when we take logarithms.
The key insight of Tropp’s approach is that Corollary 1.4.18 offers a completely
way to extend the addition rule (2.40) for the scalar setting to the matrix setting.
Indeed, this is a remarkable breakthrough. Much better results have been obtained
due to this breakthrough. This justifies the parallel development of Tropp’s method
with the Ahlswede-Winter method of Sect. 2.2.4.
Lemma 2.5.1 (Subadditivity of Matrix Cumulant Generating Functions). Con-
sider a finite sequence {Xk } of independent, random, Hermitian matrices. Then
E Tr exp θXk Tr exp log Ee θXk
for θ ∈ R. (2.41)
k k
Proof. It does not harm to assume θ = 1. Let Ek denote the expectation, conditioned
on X1 , . . . , Xk . Abbreviate
Ξk := log Ek−1 eXk = log EeXk ,
n−1
n
E Tr exp Xk = E0 · · · En−1 Tr exp Xk + X n
k=1 k=1
n−1
Xn
E0 · · · En−2 Tr exp Xk + log En−1 e
k=1
n−2
= E0 · · · En−2 Tr exp Xk + Xn−1 + Ξn
k=1
n−2
E0 · · · En−3 Tr exp Xk + Ξn−1 + Ξn
k=1
n
· · · Tr exp Ξk .
k=1
The first line follows from the tower property of conditional expectation. At each
step, m = 1, 2, . . . , n, we use Corollary 1.4.18 with the fixed matrix H equal to
m−1 n
Hm = Xk + Ξk .
k=1 k=m+1
This section contains abstract tail bounds for the sums of random matrices.
Theorem 2.6.1 (Master Tail Bound for Independent Sums—Tropp [53]). Con-
sider a finite sequence {Xk } of independent, random, Hermitian matrices. For all
t ∈ R,
+ ,
P λmax Xk t inf e−θt · Tr exp log EeθXk . (2.42)
θ>0
k k
Proof. Substitute the subadditivity rule for matrix cumulant generating functions,
Eq.2.41, into the Lapalace transform bound, Proposition 2.3.1.
112 2 Sums of Matrix-Valued Random Variables
Now we are in a position to apply the very general inequality of (2.42) to some
specific situations. The first corollary adapts Theorem 2.6.1 to the case that arises
most often in practice.
Corollary 2.6.2 (Tropp [53]). Consider a finite sequence {Xk } of independent,
random, Hermitian matrices with dimension d. Assume that there is a function g :
(0, ∞) → [0, ∞] and a sequence of {Ak } of fixed Hermitian matrices that satisfy
the relations
0 1
P λmax Xk t d · inf e−θt+g(θ)·ρ . (2.44)
θ>0
k
because of the property (1.73) that the matrix logarithm is operator monotone.
Recall the fact (1.70) that the trace exponential is monotone with respect to
the semidefinite order. As a result, we can introduce each relation from the
sequence (2.45) into the master inequality (2.42). For θ > 0, it follows that
−θt
P λmax Xk t e · Tr exp g(θ) · Ak
k k
- .
−θt
e · d · λmax exp g(θ) · Ak
k
−θt
=e · d · exp g(θ) · λmax Ak .
k
The second inequality holds because the trace of a positive definite matrix, such
as the exponential, is bounded by the dimension d times the maximum eigenvalue.
The last line depends on the spectral mapping Theorem 1.4.4 and the fact that the
function g is nonnegative. Identify the quantity ρ, and take the infimum over positive
θ to reach the conclusion (2.44).
2.6 Tail Bounds for Independent Sums 113
(2.46)
Proof. Recall the fact (1.74) that the matrix logarithm is operator concave. For θ >
0, it follows that
n n n
1 1
log Ee θXk
=n· log Ee θXk
n · log Ee θXk
.
n n
k=1 k=1 k=1
The property (1.70) that the trace exponential is monotone allows us to introduce
the latter relation into the master inequality (2.42) to obtain
n
n
−θt 1
P λmax Xk t e · Tr exp n · log Ee θXk
.
n
k=1 k=1
To bound the proof, we bound the trace by d times the maximum eigenvalue, and
we invoke the spectral mapping Theorem (twice) 1.4.4 to draw the maximum eigen-
value map inside the logarithm. Take the infimum over positive θ to reach (2.46).
We can study the minimum eigenvalue of a sum of random Hermitian matrices
because
As a result,
n
n
P λmin Xk t = P λmax −Xk −t .
k=1 k=1
We can also analyze the maximum singular value of a sum of random rectangular
matrices by applying the results to the Hermitian dilation (1.81). For a finite
sequence {Zk } of independent, random, rectangular matrices, we have
/ /
/ /
/ /
P / Xk / t = P λmax ϕ (Zk ) t
/ /
k k
114 2 Sums of Matrix-Valued Random Variables
on account of (1.83) and the property that the dilation is real-linear. ϕ means
dilation. This device allows us to extend most of the tail bounds developed in this
book to rectangular matrices.
Ahlswede and Winter uses a different approach to bound the matrix moment
generating function, which uses the multiplication bound (2.39) for the trace
exponential of a sum of two independent, random, Hermitian matrices.
Consider a sequence {Xk , k = 1, 2, .
. . , n} of independent, random, Hermitian
matrices with dimension d, and let Y = k Xk . The trace inequality (2.39) implies
that
⎡ n−1 ⎤ ⎡ n−1 ⎤
θXk θXk
Tr MY (θ) E Tr ⎣e k=1 e θXn ⎦
= Tr E ⎣e k=1 eθXn ⎦
⎡⎛ ⎞ ⎤
n−1
θXk
= Tr ⎣⎝Ee k=1 ⎠ EeθXn ⎦
⎛ ⎞
n−1
θXk
Tr ⎝Ee k=1 ⎠ · λmax EeθXn .
These steps are carefully spelled out in previous sections, for example Sect. 2.2.4.
Iterating this procedure leads to the relation
Tr MY (θ) (Tr I) λmax Ee θXk
= d · exp λmax log Ee θXk
.
k k
(2.47)
This bound (2.47) is the key to the Ahlswede–Winter method. As a consequence,
their approach generally leads to tail bounds that depend on a scale parameter
involving “the sum of eigenvalues.” In contrast, the Tropp’s approach is based on
the subadditivity of cumulants, Eq. 2.41, which implies that
Tr MY (θ) d · exp λmax log Ee θXk
. (2.48)
k
This result justifies that a Gaussian series with real coefficients satisfies a normal-
type tail bound where the variance is controlled by the sum of the sequence
coefficients. The relation (2.49) follows easily from the scalar Laplace transform
method. See Example 1.2.1 for the derivation of the characteristic function;
A Fourier inverse transform of this derived characteristic function will lead to (2.49).
So far, our exposition in this section is based on the standard textbook.
The inequality (2.49) can be generalized directly to the noncommutative setting.
The matrix Laplace transform method, Proposition 2.3.1, delivers the following
result.
Theorem 2.7.1 (Matrix Gaussian and Rademacher Series—Tropp [53]). Con-
sider a finite sequence {Ak } of fixed Hermitian matrices with dimension d, and
let γk be a finite sequence of independent standard normal variables. Compute the
variance parameter
/ /
/ /
2/ /
σ := / A2k / . (2.50)
/ /
k
In particular,
/ /
/ /
/ / 2 2
P / γk Ak / t 2d · e−t /2σ . (2.52)
/ /
k
Observe that the bound (2.51) reduces to the scalar result (2.49) when the
dimension d = 1. The generalization of (2.50) has been proven by Tropp [53] to
be sharp and is also demonstrated that Theorem 2.7.1 cannot be improved without
changing its form.
Most of the inequalities in this book have variants that concern the maximum
singular value of a sum of rectangular random matrices. These extensions follow
immediately, as mentioned above, when we apply the Hermitian matrices to the
Hermitian dilation of the sums of rectangular matrices. Here is a general version of
Theorem 2.7.1.
Corollary 2.7.2 (Rectangular Matrix Gaussian and Radamacher Series—
Tropp [53]). Consider a finite sequence {Bk } of fixed matrices with dimension
d1 × d2 , and let γk be a finite sequence of independent standard normal variables.
Compute the variance parameter
+/ / / /,
/ / / /
/ / / /
2
σ := max / Bk B∗k / , / B∗k Bk / .
/ / / /
k k
(2k)!
Eγ 2k+1 = 0 and Eγ 2k = for k = 0, 1, 2, . . . .
k!2k
2.7 Matrix Gaussian Series—Case Study 117
Therefore,
∞ ∞ k
E γ 2k A2k A2 /2 2
Ee γA
=I+ =I+ = eA /2
.
(2k)! k!
k=1 k=1
The first relation holds since the odd terms in the series vanish. With this lemma, the
tail bounds for Hermitian matrix Gaussian and Rademacher series follow easily.
Proof of Theorem 2.7.1. Let {ξk } be a finite sequence of independent standard
normal variables or independent Rademacher variables. Invoke Lemma 2.7.3 to
obtain
2
Eeξk θA eg(θ)·Ak where g (θ) := θ2 /2 for θ > 0.
Recall that
/ /
/ /
2 / /
σ =/ A2k / = λmax A2k .
/ /
k k
Since standard Gaussian and Rademacher variables are symmetric, the inequal-
ity (2.53) implies that
2
/2σ 2
P −λmin ξ k Ak t = P λmax (−ξk ) Ak t d·e−t .
k k
Apply the union bound to the estimates for λmax and −λmin to complete the proof.
We use the Hermitian dilation of the series.
Proof of Corollary 2.7.2. Let {ξk } be a finite sequence of independent standard
normal variables or independent Rademacher variables. Consider the sequence
{ξk ϕ (Bk )} of random Hermitian matrices with dimension d1 + d2 . The spectral
identity (1.83) ensures that
118 2 Sums of Matrix-Valued Random Variables
/ /
/ /
/ /
/ ξk Bk / = λmax ϕ ξk B k = λmax ξk ϕ (Bk ) .
/ /
k k k
Theorem 2.7.1 is used. Simply observe that the matrix variance parameter (2.50)
satisfies the relation
/⎡ ⎤/
/ / / B B∗ /
/ / / k k 0 /
/ 2 / / ⎢ ⎥ /
σ2 = / ϕ(Bk ) / = /⎣ k ∗ ⎦/
/ / / 0 Bk Bk /
k / k
/
+/ / / /,
/ / / /
/ ∗/ / ∗ /
= max / Bk Bk / , / Bk Bk / .
/ / / /
k k
The symbols bi: and b:j represent the ith row and jth column of the matrix B. An
immediate sequence of (2.54) is that the median of the norm satisfies
M( Γ B ) σ 2 log (2 (d1 + d2 )). (2.55)
These are nonuniform Gaussian matrices where the estimate (2.55) for the median
has the correct order. We compare [106, Theorem 1] and [107, Theorem 3.1]
although the results are not fully comparable. See Sect. 9.2.2 for extended work.
To establish (2.54), we first decompose the matrix of interest as a Gaussian series:
Similarly,
∗ 2 2 2
(bij Cij ) (bij Cij ) = |bij | Cjj = diag b:1 , . . . , b:d2 .
ij j i
Thus,
0/ / / /1
/ 2 2 / / 2 2 /
σ 2 = max /diag b1: , . . . , bd1 : / , /diag b:1 , . . . , b:d2 /
0 1
2 2
= max maxi bi: , maxj b:j .
Y= γ k Ak (2.56)
k
is used for many practical applications later in this book since it allows each sensor
to be represented by the kth matrix.
Example 2.9.1 (NC-OFDM Radar and Communications). A subcarrier (or tone)
has a frequency fk , k = 1, . . . , N . Typically, N = 64 or N = 128. A radio
sinusoid ej2πfk t is transmitted by the transmitter (cell phone tower or radar). This
radio signal passes through the radio environment and “senses” the environment.
Each sensor collects some length of data over the sensing time. The data vector yk
of length 106 is stored and processed for only one sensor. In other words, we receive
typically N = 128 copies of measurements for using one sensor. Of course, we can
use more sensors, say M = 100.
We can extract the data structure using a covariance matrix that is to be directly
estimated from the data. For example, a sample covariance matrix can be used. We
call the estimated covariance matrix R̂k , k = 1, 2, .., N . We may desire to know the
impact of N subcarriers on the sensing performance. Equation (2.56) is a natural
model for this problem at hand. If we want to investigate the impact of M = 100
sensors on the sensing performance (via collaboration from a wireless network), we
120 2 Sums of Matrix-Valued Random Variables
need a data fusion algorithm. Intuitively, we can simply consider the sum of these
extracted covariance matrices (random matrices). So we have a total of n = M N =
100 × 128 = 12,800 random matrices at our disposal. Formally, we have
n=128,000
Y= γ k Ak = γk R̂k .
k k=1
Here, we are interested in the nonasymptotic view in statistics [108]: when the
number of observations n is large, we fit large complex sets of data that one needs
to deal with huge collections of models at different scales. Throughout the book, we
promote this nonasymptotic view by solving practical problems in wireless sensing
and communications. This is a problem with “Big Data”. In this novel view, one
takes the number of observations as it is and try to evaluate the effect of all the
influential parameters. Here this parameter is n, the total number of measurements.
Within one second, we have a total of 106 × 128 × 100 ≈ 1010 points of data at our
disposal. We need models at different scales to represent the data.
A remarkable feature of Theorem 2.7.1 is that it always allows us to obtain
reasonably accurate estimates for the expected norm of this Hermitian Gaussian
series
Y= γ k Ak . (2.57)
k
To establish this point, we first derive the upper and lower bounds for the second
moment of ||Y||. Note ||Y|| is a scalar random variable. Using Theorem 2.7.1 gives
)∞ √
2
E Y = 0
P Y > t dt
)∞ 2
= 2σ 2 log (2d) + 2d 2σ2 log(2d) e−t/2σ dt = 2σ 2 log (2ed) .
The (homogeneous) first and second moments of the norm of a Gaussian series are
equivalent up to a universal constant [109, Corollary 3.2], so we have
cσ E ( Y ) σ 2 log (2ed). (2.58)
According to (2.58), the matrix variance parameter σ 2 controls the expected norm
E ( Y ) up to a factor that depends very weakly on the dimension d. A similar
remark goes to the median value M ( Y ).
2.10 Sums of Random Positive Semidefinite Matrices 121
n
n
1 1
μ̄min := λmin EXk and μ̄max := λmax EXk .
n n
k=1 k=1
Then
n
P λmin n1 Xk α d · e−n·D(α||μ̄min ) for 0 α μ̄min ,
k=1
n
P λmax n
1
Xk α d · e−n·D(α||μ̄max ) for 0 α μ̄max .
k=1
n
n
μmin := λmin EXk and μmax := λmax EXk .
k=1 k=1
Then
n −δ μmin /R
P λmin Xk (1 − δ) μmin d · (1−δ)
e
1−δ for δ ∈ [0,1],
k=1 μmax /R
n
eδ
P λmax Xk (1 + δ) μmax d · (1+δ) 1+δ for δ 0.
k=1
The minimum eigenvalues has norm-type behavior while the maximum eigenvalues
exhibits Poisson-type decay.
Before giving the proofs, we consider applications.
Example 2.10.3 (Rectangular Random Matrix). Matrix Chernoff inequalities are
very effective for studying random matrices with independent columns. Consider
a rectangular random matrix
Z = z1 z 2 · · · z n
which is /an estimate of the true covariance matrix R. One is interested in the error
/
/ /
/R̂ − R/ as a function of the number of sample vectors, n. The norm of Z satisfies
n
∗
zk z∗k
2
Z = λmax (ZZ ) = λmax . (2.59)
k=1
n
∗
zk z∗k
2
sm (Z) = λmin (ZZ ) = λmin .
k=1
In each case, the summands are stochastically independent and positive semidefinite
(rank 1) matrices, so the matrix Chernoff bounds apply.
Corollary 2.10.2 gives accurate estimates for the expectation of the maximum
eigenvalue:
n
μmax Eλmax Xk C · max {μmax , R log d} . (2.60)
k=1
The lower bound is Jensen’s inequality; the upper bound is from a standard calcu-
lation. The dimensional dependence vanishes, when the mean μmax is sufficiently
large in comparison with the upper bound R! The a prior knowledge of knowing R
accurately in λmax (Xk ) R converts into the tighter bound in (2.60).
Proof of Theorem 2.10.1. We start with a semidefinite bound for the matrix moment
generating function of a random positive semidefinite contraction.
Lemma 2.10.4 (Chernoff moment generating function). Suppose that X is a
random positive semidefinite matrix that satisfies λmax (Xk ) 1. Then
E eθX I + eθ − 1 (EX) for θ ∈ R.
The proof of Lemma 2.10.4 parallels the classical argument; the matrix adaptation is
due to Ashlwede and Winter [36], which is followed in the proof of Theorem 2.2.15.
Proof of Lemma 2.10.4. Consider the function
f (x) = eθx .
Since f is convex, its graph has below the chord connecting two points. In particular,
More explicitly,
eθx 1 + eθ − 1 · x for x ∈ [0, 1].
The eigenvalues of X lie in the interval of [0, 1], so the transfer rule (1.61) implies
that
eθX I + eθ − 1 X.
After substituting these quantiles into (2.61), we obtain the information divergence
upper bound.
Proof Corollary 2.10.2, Upper Bound. Assume that the summands satisfy the uni-
form eigenvalue bound with R = 1; the general result follows by re-scaling.
The shortest route to the weaker Chernoff bound starts at (2.61). The numerical
inequality log(1 + x) ≤ x, valid for x > −1, implies that
n
P λmax Xk t d · exp (−θt+g(θ) · nμ̄max ) =d · exp (−θt+g(θ) · μmax ) .
k=1
Since λmin (−A) = −λmax (A), we can again use Corollary 2.6.3 as follows.
n n
P λmin Xk t = P λmax (−Xk ) −t
k=1 k=1 n
d · exp θt + n · log λmax n1 (I − g(θ) · EXk )
k=1 n
= d · exp θt + n · log 1 − g(θ) · λmin n1 EX k
k=1
= d · exp (θt + n · log (1 − g(θ) · μ̄min )) .
(2.62)
Make the substitution t → nα. The right-hand side is minimum when
In the scalar setting, Bennett and Bernstein inequalities deal with a sum of
independent, zero-mean random variables that are either bounded or subexponential.
In the matrix setting, the analogous results concern a sum of zero-mean random
matrices. Recall that the classical Chernoff bounds concern the sum of independent,
nonnegative, and uniformly bounded random variables while, matrix Chernoff
bounds deal with a sum of independent, positive semidefinite, random matrices
whose maximum eigenvalues are subject to a uniform bound. Let us consider a
motivating example first.
126 2 Sums of Matrix-Valued Random Variables
Example 2.11.1 (Signal plus Noise Model). For example, the sample covariance
matrices of Gaussian noise, R̂ww , satisfy the conditions of independent, zero-mean,
random matrices. Formally
R̂yy represent the sample covariance matrix of the received signal plus noise and
R̂xx of the signal. Apparently, R̂ww , is a zero-mean random matrix. All these
matrices are independent, nonnegative, random matrices.
Our first result considers the case where the maximum eigenvalue of each
summand satisfies a uniform bound. Recall from Example 2.10.3 that the norm of a
rectangular random matrix Z satisfies
n
∗
zk z∗k
2
Z = λmax (ZZ ) = λmax . (2.63)
k=1
After the multi-path channel propagation with fading, the constraints become
When a number of transmitters, say n, are emitting at the same time, the total
received signal is described by
n
Y = X1 + · · · , Xn = Xk .
k=1
2.11 Matrix Bennett and Bernstein Inequalities 127
p!
EXk = 0; E (Xpk ) · Rp−2 A2k , for p = 2,3,4, . . . .
2!
Compute the variance parameter
/ /
/ n /
/ /
σ 2 := / A2k / .
/ /
k=1
This section, taking material from [43, 49, 53, 111], combines the matrix Laplace
transform method of Sect. 2.3 with the Courant-Fischer characterization of eigen-
values (Theorem 1.4.22) to obtain nontrivial bounds on the interior eigenvalues of a
sum of random Hermitian matrices. We will use this approach for estimates of the
covariance matrix.
The first identify follows from the positive homogeneity of eigenvalue maps and
the second from the monotonicity of the scalar exponential function. The final two
steps are Markov’s inequality and (1.89).
Let us bound the expectation. Interchange the order of the exponential and the
minimum, due to the monotonicity of the scalar exponential function; then apply the
spectral mapping Theorem 1.4.4 to see that
2.13 Tail Bounds for All Eigenvalues of a Sum of Random Matrices 129
∗
E exp min λmax (θV XV) =E min λmax (exp (θV∗ XV))
V∈Vn
n−k+1 V∈Vn
n−k+1
The first step uses Jensen’s inequality. The second inequality follows because the
exponential of a Hermitian matrix is always positive definite—see Sect. 1.4.16, so
its largest eigenvalue is smaller than its trace. The trace functional is linear, which is
very critical. The expectation is also linear. Thus we can exchange the order of the
expectation and the trace: trace and expectation commute—see (1.87).
Combine these observations and take the infimum over all positive θ to complete
the argument.
Now let apply Theorem 2.13.1 to the case that X can be expressed as a sum of
independent, Hermitian, random matrices. In this case, we develop the right-hand
side of the Laplace transform bound (2.65) by using the following result.
Theorem 2.13.2 (Tropp [111]). Consider a finite sequence {Xi } of independent,
Hermitian, random matrices with dimension n and a sequence {Ai } of fixed
Hermitian matrices with dimension n that satisfy the relations
E eX i eA i . (2.66)
In particular,
+ , + ,
E Tr exp Xi Tr exp Ai . (2.68)
i i
Theorem 2.13.2 is an extension of Lemma 2.5.1, which establish the result of (2.68).
The proof depends on a recent result of [112], which extends Lieb’s earlier classical
result [50, Theorem 6]. Here MnH represents the set of Hermitian matrices of n × n.
Proposition 2.13.3 (Lieb-Seiringer 2005). Let H be a Hermitian matrix with
dimension k. Let V ∈ Vnk be an isometric embedding of Ck into Cn for some
k ≤ n. Then the function
A → Tr exp {H + V∗ (log A) V}
Proof of Theorem 2.13.2. First, combining the given condition (2.66) with the
operator monotonicity of the matrix logarithm gives the following for each k:
The first step follows from Proposition 2.13.3 and Jensen’s inequality, and the
second depends on (2.69) and the monotonicity of the trace exponential. Iterate
this argument to complete the proof. The main result follows from combining
Theorems 2.13.1 and 2.13.2.
Theorem 2.13.4 (Minimax Laplace Transform). Consider a finite sequence
{Xi } of independent, random, Hermitian matrices with dimension n, and let k ≤ n
be an integer.
1. Let {Ai } be a sequence of Hermitian matrices that satisfy the semidefinite
relations
E eθXi eg(θ)Ai
for all V ∈ Vnn−k+1 where g : (0, ∞) → [0, ∞). Then, for all t ∈ R,
2.14 Chernoff Bounds for Interior Eigenvalues 131
- + ,.
−θt
P λk Xi t inf min e · Tr exp g (θ) Ai (V) .
θ>0 V∈Vn
n−k+1
i i
The first bound in Theorem 2.13.4 requires less detailed information on how
compression affects the summands but correspondingly does not yield as sharp
results as the second.
Classical Chernoff bounds in Sect. 1.1.4 establish that the tails of a sum of
independent, nonnegative, random variables decay subexponentially. Tropp [53]
develops Chernoff bounds for the maximum and minimum eigenvalues of a sum
of independent, positive-semidefinite matrices. In particular, sample covariance
matrices are positive-semidefinite and the sums of independent, sample covariance
matrices are ubiquitous. Following Gittens and Tropp [111], we extend this analysis
to study the interior eigenvalues. The analogy with the scalar-valued random
variables in Sect. 1.1.4 is aimed at, in this development. At this point, it is insightful
if the audience reviews the materials in Sects. 1.1.4 and 1.3.
Intuitively, how concentrated the summands will determine the eigenvalues tail
bounds; in other words, if we align the ranges of some operators, the maximum
eigenvalue of a sum of these operators varies probably more than that of a sum
of operators whose ranges are orthogonal. We are interested in a finite sequence
of random summands {Xi }. This sequence will concentrate in a given subspace.
To measure how much this sequence concentrate, we define a function ψ :
∪1kn Vnk → R that has the property
maxλmax (V∗ Xi V) ψ (V) almost surely for each V ∈ ∪1kn Vnk . (2.70)
i
Then
μk /ψ(V+ )
eδ
P λk Xi (1 + δ) μk (n − k + 1) · for δ > 0, and
i (1 + δ)1+δ
μk /ψ(V− )
e−δ
P λk Xi (1 − δ) μk k· for δ ∈ [0, 1],
i (1 − δ)1−δ
The following lemma is due to Ahlswede and Winter [36]; see also [53, Lemma 5.8].
Lemma 2.14.2. Suppose that X is a random positive-semidefinite matrix that
satisfies λmax (X) 1. Then
θ
EeθX exp e − 1 (EX) for θ ∈ R.
Proof of Theorem 2.14.1, upper bound. Without loss of generality, we consider the
case ψ (V+ ) = 1; the general case follows due to homogeneity. Define
∗
Ai (V+ ) = V+ EXi V+ and g (θ) = eθ − 1.
The trace can be bounded by the maximum eigenvalue (since the maximum
eigenvalue is nonnegative), by taking into account the reduced dimension of the
summands:
∗ ∗
Tr exp g (θ) V+ EX i V + (n−k+1) · λmax exp g (θ) V+ EX i V +
i i
∗
= (n−k+1) · exp g (θ) · λmax V + EX i V + .
i
The equality follows from the spectral mapping theorem (Theorem 1.4.4 at Page 34).
We identify the quantity μk ; then combine the last two inequalities to give
2.14 Chernoff Bounds for Interior Eigenvalues 133
P λk Xi (1 + δ) μk (n − k + 1) · inf e[g(θ)−θ(1+δ)]μk .
θ>0
i
By choosing θ = log (1 + δ), the right-hand side is minimized (by taking care of
the infimum), which gives the desired upper tail bound.
Proof of Theorem 2.14.1, lower bound. The proof of lower bound is very similar
to that of upper bound above. As above, consider only ψ (V− ) = 1. It follow
from (1.91) (Page 47) that
P λk Xi (1 − δ) μk = P λn−k+1 −Xi − (1 − δ) μk .
i i
(2.71)
Applying Lemma 2.14.2, we find that, for θ > 0,
Eeθ(−V−∗ Xi V− ) = Ee(−θ)V− Xi V− exp g (θ) · E −V−∗
∗
Xi V−
∗
= exp g (θ) · V− (−EXi ) V− ,
where g (θ) = 1−eθ . The last equality follows from the linearity of the expectation.
Using Theorem 2.13.4, we find the latter probability in (2.71) is bounded by
+ ,
∗
inf e θ(1−δ)μk
· Tr exp g (θ) V− (−EXi ) V− .
θ>0
i
The trace can be bounded by the maximum eigenvalue (since the maximum
eigenvalue is nonnegative), by taking into account the reduced dimension of the
summands:
∗ ∗
Tr exp g (θ) V− (−EXi ) V− k · λmax exp g (θ) V− (−EXi ) V−
i i
∗
= k · exp −g (θ) · λmin V− (EXi ) V−
i
The equality follows from the spectral mapping theorem, Theorem 1.4.4
(Page 34), and (1.92) (Page 47). In the second equality, we identify the quantity μk .
Note that −g(θ) ≤ 0. Our argument establishes the bound
P λk Xi (1 + δ) μk k · inf e[θ(1+δ)−g(θ)]μk .
θ>0
i
The right-hand side is minimized, (by taking care of the infimum), when θ =
− log (1 − δ), which gives the desired upper tail bound.
134 2 Sums of Matrix-Valued Random Variables
From the two proofs, we see the property that the maximum eigenvalue is
nonnegative is fundamental. Using this property, we convert the trace functional
into the maximum eigenvalue functional. Then the Courant-Fischer theorem,
Theorem 1.4.22, can be used. The spectral mapping theorem is applied almost
everywhere; it is must be recalled behind the mind. The non-commutative property
is fundamental in studying random matrices. By using the eigenvalues and their
variation property, it is very convenient to think of random matrices as scalar-
valued random variables, in which we convert the two dimensional problem into
one-dimensional problem—much more convenient to handle.
The linearity of the expectation and the trace is so basic. We must always bear
this mind. The trace which is a linear functional converts a random matrix into a
scalar-valued random variable; so as the kth interior eigenvalue which is a non-linear
functional. Since trace and expectation commute, it follows from (1.87), which says
that
as we deal with the scalar-valued random variables. This intuition lies at the very
basis of modern probability. In this book, one purpose is to prepare us for the
intuition of this “approximation” (2.73), for a given n, large but finite—the n is
taken as it is. We are not interested in the asymptotic limit as n → ∞, rather the non-
asymptotic analysis. One natural metric of measure is the kth interior eigenvalues
n
1
λk Xi − EX .
n i=1
Note the interior eigenvalues are non-linear functionals. We cannot simply separate
the two terms.
2.15 Linear Filtering Through Sums of Random Matrices 135
We can use the linear trace functional that is the sum of all eigenvalues. As a
result, we have
n n n
λk n 1
Xi − EX = Tr n 1
Xi − EX = Tr n 1
Xi − Tr (EX)
k i=1 i=1 i=1
n
= 1
n Tr Xi − E (Tr X) .
i=1
The linearity of the trace is used in the second and third equality. The property
that trace and expectation commute is used in the third equality. Indeed, the linear
trace functional is convenient, but a lot of statistical information is contained in the
interior eigenvalues. For example, the median of the eigenvalues, rather than the
average of the eigenvalues—the trace divided by its dimension can be viewed as the
average, is more representative statistically.
We are in particular interested in the signal plus noise model in the matrix setting.
We consider instead
n
1
E (X + Z) ∼
= (Xi + Zi ), (2.74)
n i=1
for
X, Z, Xi , Zi 0 and X, Z, Xi , Zi ∈ Cm×m ,
where X, Xi represent the signal and Z, Zi the noise. Recall that A ≥ 0 means that
A is positive semidefinite (Hermitian and all eigenvalues of A are nonnegative).
Samples covariance matrices of dimensions m × m are most often used in this
context.
Since we have a prior knowledge that X, Xi are of low rank, the low-rank matrix
recovery naturally fits into this framework. We can choose the matrix dimension
m such that enough information of the signal matrices Xi is recovered, but we
don’t care if sufficient information of Zi can be recovered for this chosen m. For
example, only the first dominant k eigenvalues of X, Xi are recovered, which will
be treated in Sect. 2.10 Low Rank Approximation. We conclude that the sums of
random matrices have the fundamental nature of imposing the structures of the data
that only exhibit themselves in the matrix setting. The low rank and the positive
semi-definite of sample covariance matrices belong to these data structures. When
the data is big, we must impose these additional structures for high-dimensional data
processing.
The intuition of exploiting (2.74) is as follows: if the estimates of Xi are so
accurate that they are independent and identically distributed Xi = X0 , then we
rewrite (2.74) as
n n n
1 1 1
E (X + Z) ∼
= (Xi + Zi ) = (X0 + Zi ) = X0 + Zi . (2.75)
n i=1
n i=1
n i=1
136 2 Sums of Matrix-Valued Random Variables
1
n
All we care is that, through the sums of random matrices, n Xi is performing
i=1
1
n
statistically better than n Zi . For example, we can use the linear trace functional
i=1
(average operation) and the non-linear median functional. To calculate the median
value of λk , 1 ≤ k ≤ n,
- n n
.
1 1
M λk Xi + Zi ,
n i=1
n i=1
where the linearity of the trace is used. This is simply the standard sum of scalar-
valued random variables. It is expected, via the central limit theorem, that their
sum approaches to the Gaussian distribution, for a reasonably large n. As pointed
out before, this trace operation throws away a lot of statistical information that is
available in the random matrices, for example, the matrix structures.
2.16 Dimension-Free Inequalities for Sums of Random Matrices 137
Proof. We follow [113]. The induction method is used for the proof. For N = 0, it
is easy to check the lemma is correct. For N ≥ 1, assume as the inductive hypothesis
that (2.77) holds with N with replaced with N − 1. In this case, we have that
- N N
.
E Tr exp Xi − ln Ei [exp (Xi )] −I
i=1 i=1
- - N −1 N
..
= E EN Tr exp Xi − ln Ei [exp (Xi )] + ln exp (XN ) −I
i=1 i=1
- - N −1 N
..
E EN Tr exp Xi − ln Ei [exp (Xi )] + ln EN exp (XN ) −I
i=1 i=1
- - N −1 N −1
..
= E EN Tr exp Xi − ln Ei [exp (Xi )] −I
i=1 i=1
0
138 2 Sums of Matrix-Valued Random Variables
where the second line follows from Theorem 1.4.17 and Jensen’s inequality. The
fifth line follows from the inductive hypothesis.
While (2.77) gives the trace result, sometimes we need the largest eigenvalue.
Theorem 2.16.2 (Largest eigenvalue—Hsu, Kakade and Zhang [113]). For any
α ∈ R and any t > 0
N N
P λmax α Xi − log Ei [exp (αXi )] >t
i=1 i=1
N N −1
Tr E −α Xi + log Ei [exp (αXi )] · et −t−1 .
i=1 i=1
N
N
Proof. Define a new matrix A = α Xi − log Ei [exp (αXi )].. Note that
i=1 i=1
g(x) = ex − x − 1 is non-negative for all real x, and increasing for x ≥ 0. Let
λi (A) be the i-th eigenvalue of the matrix A, we have
where I(x) is the indicator function of x. The second line follows from the spectral
mapping theorem. The third line follows from the increasing property of the function
g(x). The last line follows from Lemma 2.16.1.
N
When Xi is zero mean, then the first term in Theorem 2.16.2 vanishes, so the
i=1
trace term
N
Tr E log Ei [exp (αXi )]
i=1
Ei [Xi ] = 0
N
1 α2 σ̄ 2
λmax log Ei [exp (αXi )]
N i=1
2
N
1 α2 σ̄ 2 κ̄
E Tr log Ei [exp (αXi )]
N i=1
2
Ei [Xi ] = 0
λmax (Xi ) b̄
1
N
λmax Ei X2i σ̄ 2
N i=1
1
N
E Tr Ei X2i σ̄ 2 κ̄
N i=1
/ /
/ N /
/ 2/
that EXi = 0 and ||Xi || ≤ 1 almost surely. Denote σ = / EXi /. Then, for
2
i=1
any t > 0
N
/ /
/ N / Tr EX2i
/ / i=1
P / Xi / > t 2 exp (−Ψσ (t)) · rσ (t)
/ / σ2
i=1
t2 /2 6
where Ψσ (t) = σ 2 +t/3 and rσ (t) = 1 + t2 log2 (1+t/σ 2 )
.
N
If EX2i is of approximately low rank, i.e., has many small eigenvalues, the
i=1 N
Tr EX2i
i=1
number of non-zero eigenvalues are big. The term σ2 , however, is can
be much smaller than the dimension n. Minsker [116] has applied Theorem 2.16.5
to the problem of learning the continuous-time kernel.
A concentration inequality for the sums of matrix-valued martingale differences
is also obtained by Minsker [116]. Let Ei−1 [·] stand for the conditional expectation
Ei−1 [· |X1 , . . . , Xi ].
Theorem 2.16.6 (Minsker [116]). Let X1 , . . . , XN be a sequence of martingale
differences with values in the set of n × n independent Hermitian random matrices
N
such that ||Xi || ≤ 1 almost surely. Denote WN = Ei−1 X2i . Then, for any
i=1
t > 0,
N t
6
P Xi >t, λmax (WN ) σ 2 2Tr p − 2 EWN exp (−Ψσ (t)) · 1+ ,
σ Ψ2σ (t)
i=1
−t2 /2
P ( Xi t) 2n · exp .
σ 2 + Kt/3
In fact, Khintchine only established the inequality for the case where p ≥ 2 is an
even integer. Since his work, much effort has been spent on determining the optimal
value of Cp in (2.78). In particular, it has been shown [117] that for p ≥ 2, the value
1/2
2p p+1
Cp∗ = Γ
π 2
is the best possible. Here Γ(·) is the Gamma function. Using Stirling’s formula, one
can show [117] that Cp∗ is of the order pp/2 for all p ≥ 2.
The Khintchine inequality is extended to the case for arbitrary m × n matrices.
Here A Sp denotes the Schatten p-norm of an m × n matrix A, i.e., A Sp =
σ (A) p , where σ ∈ Rmin{m,n} is the vector of singular values of A, and || · ||p is
the usual lp -norm.
Theorem 2.17.3 (Khintchine’s inequality for arbitrary m × n matrices [118]).
Let ξ1 , . . . , ξN be a sequence of independent Bernoulli random variables, and let
X1 , . . . , XN be arbitrary m × n matrices. Then, for any N = 1, 2, . . . and p ≥ 2,
we have
⎡/ /p ⎤ p/2
/N / N
/ /
E ⎣/ ξ i Xi / ⎦ p ·
p/2 2
Xi S p .
/ /
i=1 Sp i=1
N
2
The normalization Xi Sp is not the only one possible in order for a
i=1
Khintchine-type inequality to hold. In 1986, Lust-Piquard showed another one
possibility.
142 2 Sums of Matrix-Valued Random Variables
The proof of Lust-Piquard does not provide an estimate for γp . In 1998, Pisier [120]
showed that
γp αpp/2
for some absolute constant α > 0. Using the result of Buchholz [121], we have
p/2
α (π/e) /2p/4 < 1
for all p ≥ 2. We note that Theorem 2.17.4 is valid (with γp αpp/2 < pp/2 ) when
ξ1 , . . . , ξN are i.i.d. standard Gaussian random variables [121].
Let Ci be arbitrary m × n matrices such that
N N
Ci CTi Im , CTi Ci In . (2.79)
i=1 i=1
Here we state a result of [123] that is stronger than Theorem 2.17.4. We deal with
bilinear form. Let X and Y be two matrices of size n × k1 and n × k2 which satisfy
XT Y = 0
and let {xi } and {yi } be row vectors of X and Y, respectively. Denote εi to be a
sequence of i.i.d. {0/1} Bernoulli random variables with P(εi = 1) = ε̄. Then, for
p≥2
⎛ / /p ⎞1/p
/ / √
⎝E / / /
εi xTi yi / ⎠ 2 2γp2 max xi max yi +
/ / i i
i Sp
⎧ / /1/2 / /1/2 ⎫
(2.80)
√ ⎨ / / / / ⎬
/ / / /
2 ε̄γp max max xi / yiT yi / , max yi / xTi xi / ,
⎩ i / / i / / ⎭
i Sp i Sp
where γp is the absolute constant defined in Theorem 2.17.4. This proof of (2.80)
uses the following result
⎛ / /p ⎞1/p / /
/ / / /
⎝E /
/
/
εi xTi xi / ⎠ 2γp2 max xi
2 /
+ ε̄ /
/
xTi xi /
/ / i / /
i Sp i Sp
for p ≥ 2.
Now consider XT X = I, then for p ≥ log k, we have [123]
⎛ / /p ⎞1/p
/ n / "
⎝E /
/Ik×k −
1 T / ⎠
εi x i x i / C
p
max xi ,
/ ε̄ / ε̄ i
i=1 Sp
√
where C = 23/4 πe ≈ 5. This result guarantees that the invertibility of a sub-
matrix which is formed from sampling a few columns (or rows) of a matrix X.
Theorem 2.17.6 ([124]). Let X ∈ Rn×n , be a random matrix whose entries are
independent, zero-mean, random variables. Then, for p ≥ log n,
⎛:
; ⎛ ⎞p : ⎞1/p
; ; p
√ ⎜; ; ⎟
c0 21/p p⎝<E⎝max Xij ⎠ + <E max
p 1/p
(E X ) 2 2
Xij ⎠ ,
i j
j i
√
where c0 23/4 πe < 5.
Theorem 2.17.7 ([124]). Let A ∈ Rn×n be any matrix and à ∈ Rn×n be a
random matrix such that
EÃ=A.
144 2 Sums of Matrix-Valued Random Variables
√
where c0 23/4 πe<5.
N N N
Ai yi Ai (1 + ε) Ai .
i=1 i=1 i=1
The algorithm runs in O(N n3 /ε2 ) time. Moreover, the result continues to hold if
the input matrices A1 , . . . , AN are Hermitian and positive semi-definite.
Theorem 2.18.2 ([125]). Let A1 , . . . , AN be symmetric, positive semidefinite ma-
T
N
trices of size n × n and let y = (y1 , . . . , yN ) ∈ RN satisfy y ≥ 0 and yi = 1.
i=1
N
For any ε ∈ (0, 1), these exists x 0 with xi = 1 such that x has O(n/ε)
i=1
nonzero entries and
N N N
(1 − ε) y i Ai xi Ai (1 + ε) y i Ai .
i=1 i=1 i=1
This chapter heavily relies on the work of Tropp [53]. Due to its user-friendly
nature, we take so much material from it.
Column subsampling of matrices with orthogonal rows is treated in [111].
Exchangeable pairs for sums of dependent random matrices is studied by Mackey,
Jordan, Chen, Farrell, Tropp [126]. Learning with integral operators [127, 128]
is relevant. Element-wise matrix sparsification by Drineas [129] is interesting.
See [130, Page 15] for some matrix concentration inequalities.
Chapter 3
Concentration of Measure
Concentration of measure plays a central role in the content of this book. This chap-
ter gives the first account of this subject. Bernstein-type concentration inequalities
are often used to investigate the sums of random variables (scalars, vectors and
matrices). In particular, we survey the recent status of sums of random matrices in
Chap. 2, which gives us the straightforward impression of the classical view of the
subject.
It is safe to say that the modern viewpoint of the subject is the concentration
of measure phenomenon through Talagrand’s inequality. Lipschitz functions are
basic mathematical objects to study. As a result, many complicated quantities can
be viewed as Lipschitz functions that can be handled in the framework of Talagrand.
This new viewpoint has profound impact on the whole structure of this book.
In some sense, the whole book is to prepare the audience to get comfortable with
this picture.
Sn = X1 + · · · + X n .
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 145
DOI 10.1007/978-1-4614-4544-9 3,
© Springer Science+Business Media New York 2014
146 3 Concentration of Measure
|Sn | 2
P t 2e−nt /2
, t 0.
n
Let
n
Z= ai Xi2 − 1 .
i=1
3.2 Chi-Square Distributions 147
√ (3.1)
P Z ≤ −2 a 2 t ≤ e−t .
The following consequence of this bound is useful [135]: for all x ≥ 1, we have
Z −n
P ≥ 4x ≤ e−nx . (3.3)
n
ψ(u) = log E exp u X 2 − 1 = −u − 1
2 log (1 − 2u) .
u2
ψ(u) ≤ .
(1 − 2u)
1 k u2 k
ψ(u) = 2u2 (2u) and = u2 (2u) .
k+2 (1 − 2u)
k≥0 k≥0
148 3 Concentration of Measure
Thus,
n
n
n
a2i u2
log E euZ = log E exp ai u Xi2 − 1 ≤ 1−2ai u
i=1 i=1 i=1
a22
≤ 1−2a∞ u .
Later we need to use the Lipschitz norm (also called Lipschitz constant). For a
Lipschitz function f : Rn → R, the Lipschitz norm defined as
|f (x) − f (y)|
f L = sup .
x,y∈Rn x−y 2
For i = 1, . . . , n let (Xi , || · ||i ), be normed spaces equipped with norm || · ||i ,
let Ωi be a finite subset of Xi with diameter at most one and let Pi be a probability
measure on Ωi . Define
n
X= ⊕Xi
i=1 2
and
Ω = Ω1 × Ω2 × · · · Ωn ⊂ X
and let
P = P(n) = P1 × P2 × · · · × Pn
where
n
2
D2 ≥ (vi − ui ) .
i=1
150 3 Concentration of Measure
n
1/2 n
x 2 = x2i , x ∞ = max |xi | , x 1 = |xi |.
i=1,...,n
i=1 i=1
N
s = x1 + . . . + xN = xj .
j=1
For every t ≥ 0,
2
/2D 2
P ({| s − E ( s )| ≥ t}) ≤ 2e−t (3.7)
N
2
where D2 ≥ xj ∞.
j=1
n
where a = a2i . Consider a function f : Ω1 × Ω2 × · · · × Ωn → R such that
i=1
for every x ∈ Ω1 × Ω2 × · · · × Ωn there exists a = a (x) ∈ Rn+ with ||a|| = 1 such
that for every y ∈ Ω1 × Ω2 × · · · × Ωn ,
ui ≤ Yi ≤ vi , i = 1, . . . , n.
Set
n
Z = sup ti Y i (3.9)
t∈T i=1
!
n
and apply Theorem 3.3.4 on the product space [ui , vi ] under the product
i=1
probability measure of the laws of the Yi , i = 1, . . . , n. Let t achieve the supremum
!
n
of f (x). Then, for every x ∈ [ui , vi ],
i=1
n
n
n
f (x) = ti x i ≤ ti y i + |ti | |xi − yi |
i=1 i=1 i=1
n
|ti ||ui −vi |
≤ f (y) + σ σ 1{xi =yi } .
i=1
Thus, σ1 f (x) satisfies (3.8) with a = a (x) = σ1 (|t1 | , . . . , |tn |). Combining
Theorem 3.3.4 with (3.7), we have the following consequence.
Theorem 3.3.5 (Corollary 4.8 of [141]). Let Z be defined as (3.9) and denote the
median of Z by MZ. Then, for every t ≥ 0,
2
/4σ 2
P (|Z − MZ| ≥ t) ≤ 4e−t .
In addition,
√
|EZ − MZ| ≤ 4 πσ and Var (Z) ≤ 16σ 2 .
152 3 Concentration of Measure
Convex functions are of practical interest. The following result has the central
significance to a lot of applications.
Theorem 3.3.6 (Talagrand’s concentration inequality [142]). For every product
probability P on [−1, 1]n , consider a convex and Lipschitz function f : Rn → R
with Lipschitz constant L. Let X1 , . . . , Xn be independent random variables taking
values [−1,1]. Let Y = f (X1 , . . . , Xn ) and let m be a median of Y . Then for every
t ≥ 0, we have
2
/16L2
P (|Y − m| ≥ t) ≤ 4e−t .
See [142, Theorem 6.6] for a proof. Also see Corollary 4.10 of [141]. Let us see
how we can modify Theorem 3.3.6 to have concentration around the mean instead
of the median. Following [143], we just notice that by Theorem 3.3.6,
2
E(Y − m) ≤ 64L2 . (3.10)
2
Since E(Y − m) ≥ Var(Y ), this shows that
1
P (|Y − E [Y ]| ≥ 16L) ≤ .
4
Using the definition of a median, this implies that
E [Y ] − 16L ≤ m ≤ E [Y ] + 16L.
this constant. See [23, 144] for related more general result. Here, it suffices to give
some examples studied in [145]. The eigenvalues (or singular values) are Lipschitz
functions with respect to the matrix elements. In particular, for each k ∈ {1, . . . , n},
the k-th largest singular value σk (X) (the eigenvalue λk (X)) is Lipschitz with
n 2
constant 1 if (Xij )i,j=1 is considered as an element of the Euclidean space Rn
2
(respectively the submanifold of Rn corresponding to the Hermitian matrices).
If one insists on thinking of X as a matrix, this corresponds to considering
√ √the
underlying Hilbert-Schmidt metric. The Lipschitz constant of σk (X/ n) is 1/ n,
since the variances of the entries being 1/n. The trace function Tr (X) has a
Lipschitz constant of 1/n. Form (3.11), The variance of the trace function is 1/n
times smaller than that the largest eigenvalue (or the smallest eigenvalue). The same
is true to the singular value.
Theorem 3.3.7 (Theorem 4.18 of [141]). Let f : Rn → R be a real-value function
such that f L ≤ σ and such that its Lipschitz coefficient with respect to the 1 -
metric is less than or equal to κ, that is
n
|f (x) − f (y)| ≤ κ |xi − yi |, x, y ∈ Rn .
i=1
for some numerical constant C > 0 where M is either a median of f or its mean.
Conversely, the concentration result holds.
Let μ1 , . . . , μn be arbitrary probability measures on the unit interval [0, 1] and
let P be the product probability measure P = μ1 ⊗ . . . ⊗ μn on [0, 1]n . We say a
function on Rn is separately convex if it is convex in each coordinate. Recall that a
convex function on R is continuous and almost everywhere differentiable.
Theorem 3.3.8 (Theorem 5.9 of [141]). Let f be separately convex and 1-
Lipschitz on Rn . Then, for every product probability measure P on [0, 1]n , and
every t ≥ 0,
2
P f≥ f dP + t ≤ e−t /4
.
where the last step follows from the Cauchy-Schwarz inequality. So the Lipschitz
norm is f L ≤ σ. The use of Theorem 3.3.6 or Theorem 3.3.8 gives the following
theorem, a main result we are motivated to develop.
Theorem 3.3.9 (Theorem 7.3 of [141]). Let η1 , . . . , ηn be independent (scalar-
valued) random variables such that ηi ≤ 1 almost surely, for i = 1, . . . , n, and let
v1 , . . . , vn be vectors in a normed space E with norm || · ||. For every t ≥ 0,
/ /
/ n /
/ / 2 2
P / ηi vi / ≥ M + t ≤ 2e−t /16σ
/ /
i=1
/ /
/n /
where M is either the mean or a median of /
/ η v /
i i / and where
i=1
n
2
σ 2 = sup z, vi .
z≤1 i=1
1 The supremum is the least upper bound of a set S defined as a quantity M such that no member
of the set exceeds M . It is denoted as sup x.
x∈S
3.3 Concentration of Random Vectors 155
n
S = sup g (Zi ).
g∈G i=1
Theorem 3.3.11 (Corollary 7.8 of Ledoux [141]). If |g| ≤ η for every g ∈ G and
g (Z1 ) , . . . , g (Zn ) have zero mean for every g ∈ G . Then, for all t ≥ 0,
t ηt
P (|S − ES| ≥ t) ≤ 3 exp − log 1 + 2 , (3.12)
Cη σ + ηES̄
where
n
σ 2 = sup Eg 2 (Zi ),
g∈G i=1
n
S̄ = sup |g (Zi )|,
g∈G i=1
where (Au)1 means the first coordinate of the vector Ay and so on. Let σmax =
maxi σi . Our goal here is to show that f is a Lipschitz function with Lipschitz
constant (or norm) by σmax . We only consider the case when A is diagonal; the
general case can be treated similarly. For two vectors x, y ∈ Rn , we have
Thus,
156 3 Concentration of Measure
X F = sup Tr XT G = sup X, G .
GF =1 GF =1
Note that trace and inner product are both linear. For vectors, the only norm we
consider is the 2 -norm, so we simply denote the 2 -norm of a vector by ||x||
3.3 Concentration of Random Vectors 157
which is equal to x, x , where x, y is the Euclidean inner product between
two vectors. Like matrices, it is easy to show
x = sup x, y .
y=1
See Sect. 1.4.5 for details about norms of matrices and vectors.
Now, let Zj = ξi xTj y (rank one matrix), we have
/ /
/m /
/ / m m
/
SF = / Zj / = sup Z , G = sup g (Zj ).
/ j
/ j=1 / GF =1 j=1 GF =1 j=1
F
Since SF > 0, the expected value of SF is equal to the expected value of S̄. That is
ESF = ES̄F . We can bound the absolute value of g (Zj ) as
= > / /
|g (Zj )| ≤ ξi xTj yj , G ≤ /ξi xTj yj /F ≤ xj yj ,
= >2 d = T > d / /
/xTj yj /2 ,
Eg 2 (Zj ) = Eξi xTj yj , G = xj yj , G ≤ F
m m
we have
m
d / T /2
m m
Eg 2 (Zj ) ≤ /x j y j / = d Tr xTj yj yj xj
m F m
j=1 j=1 j=1
⎛ ⎞
d
m
d
m
kd
Tr ⎝ xTj xj ⎠ =
2 2 2
≤ yj Tr xTj xj ≤ max yj max yj ,
m j=1
m j j=1
m j
(3.13)
m
where k = Tr xTj xj . In the first inequality of the second line, we have
j=1
used (1.84) that is repeated here for convenience
Tr (A · B) ≤ B Tr (A) . (3.14)
when A ≥ 0 and B is the spectrum norm (largest singular value). Prove similarly,
we also have
158 3 Concentration of Measure
⎛ ⎞
m m
d dα
Tr ⎝ yjT yj ⎠ =
2 2
Eg 2 (Zj ) ≤ max xj max xj ,
j=1
m j j=1
m j
m
where α = Tr yjT yj . So we choose
j=1
m
d 2 2
σ2 = Eg 2 (Zj ) ≤ max α max xj , k max yj . (3.15)
j=1
m j j
Apply the powerful Talagrand’s inequality (3.12) and note that from expectation
inequality, σ 2 + ηESF ≤ σESF + ηESF = E2 SF we have
P (SF − ESF ≥ t) ≤ 3 exp − Cηt
log 1 + ηt
σ 2 +ηE2 SF
2
≤ 3 exp − C1 E2tSF .
The last inequality follows from the fact that log (1 + x) ≥ 2x/3 for 0 ≤ x ≤ 1.
2 to satisfy ηt ≤ E SF .
2
Thus, t must be chosen
Choose t = C log β ESF where C is a small numerical constant. By some
3
Theorem 3.3.14 (Theorem 7.3 of Ledoux [141]). Let ξ1 , . . . , ξn be a sequence of
independent random variable such that |ξi | ≤ 1 almost surely with i = 1, . . . , n
and let x1 , . . . , xn be vectors in Banach space. Then, for every t ≥ 0,
/ /
/ n / t2
/ /
P / ξi xi / ≥ M + t ≤ 2 exp − (3.16)
/ / 16σ 2
i=1
3.4 Slepian-Fernique Lemma and Concentration of Gaussian Random Matrices 159
/ n /
/ /
where M is either the mean or median of / ξi xi /
/
/ and
i=1
n
σ 2 = sup y, xi .
y≤1 i=1
The theorem claims that the sum of vectors with random weights is √ distributed
like Gaussian around its mean or median, with standard deviation 2 2σ. This
theorem strongly bounds the supremum of a sum of vectors x1 , x2 , . . . , xn with
random weights in Banach space. For applications such as (3.15), we need to bound
max xi and max yi . For details, we see [123].
i i
Let {aij } be an n×n array of real numbers. Let π be chosen uniformly at random
n
from the set of all permutations of {1, . . . , n}, and let X = aiπ(i) . This class of
i=1
random variables was first studied by Hoeffding [147].
n
Theorem 3.3.15 ([133]). Let {aij }i,j=1 be a collection of numbers from [0, 1]. Let
π be chosen uniformly at random from the set of all permutations of {1, . . . , n},
n n
and let X = aiπ(i) . Let X = aiπ(i) , where π is drawn from the uniform
i=1 i=1
distribution over the set of all permutations of {1, . . . , n}. Then
t2
P (|X − EX| ≥ t) ≤ 2 exp −
4EX + 2t
for all t ≥ 0.
where || · || denotes the operator (or matrix) norm. The Gaussian process Xu,v =
Zu⊗v (X) , u, v ∈ S n−1 is now compared with Yu,v = Z(u,v) , where (u, v) is
regarded as an element of Rn × Rn = R2n . Now we need the Slepian-Fernique
lemma.
Lemma 3.4.1 (Slepian-Fernique lemma [145]). Let (Xt )t∈T and (Yt )t∈T be two
families of jointly Gaussian mean zero random variable such that
(a) X t − X t 2 ≤ Y t − Y t 2, f ort, t ∈ T
Then
To see that the Slepian-Fernique lemma applies, we only need to verify that, for
u, v, u , v ∈ S n−1 , where S n−1 is a sphere in Rn ,
|u ⊗ v − u ⊗ v | ≤ |(u, v) − (u , v )| = |u − u | + |v − v | ,
2 2 2 2
where |·| is the usual Euclidean norm. On the other hand, for (x, y) ∈ Rn × Rn , we
have
Z(u,v) (x, y) = x, u + y, v
so
√ / /
/
/
nE /G(n) / ≤ 2 |x| dγn (x),
Rn
1 G −G 1 10 0 −1
√ =√ ⊗G+ ⊗G
2 G G 2 01 1 0
We take material from [150, 151] for this presentation. See Sect. 7.6 for some
applications. A stochastic process is a collection Xt , t ∈ T̃ , of complex-valued
random variables indexed by some set T̃ . We are interested in bounding the
moments of the supremum of Xt , t ∈ T̃ . To avoid measurability issues, we define,
for a subset T ⊂ T̃ , the lattice supremum as
E sup |Xt | = sup E sup |Xt | , F ⊂ T, F finite . (3.21)
t∈T t∈F
Now we apply Dudley’s inequality for the special case of the Rademacher process
of the form
M
Xt = εi xi (t), t ⊂ T̃ , (3.24)
i=1
where x (t) = (x1 (t) , . . . , xM (t)) and || · ||2 is the standard Euclidean norm. So
we can rewrite the (pseudo-)metric as
1/2
2
d (s, t) = E|Xt − Xs | = x (t) − x(s) 2 . (3.26)
Hoeffding’s inequality shows that the Rademacher process (3.24) satisfies the
concentration property (3.23). We deal with the Rademacher process, while the
original process was for Gaussian process, see also [27, 81, 141, 151, 152].
For a subset T ⊂ T̃ , the covering number N (T, d, δ) is defined as the smallest
integer N such that there exists a subset E ⊂ T̃ with cardinality |E| = N satisfying
0 1
T ⊂ Bd (t, δ) , Bd (t, δ) = s ∈ T̃ , d(t, s) ≤ δ . (3.27)
t∈E
D (T ) = sup d (s, t) .
s,t∈T
D(T )
E sup |Xt − Xt0 | ≤ 16.51 · ln (N (T, d, u))du + 4.424 · D (T ) . (3.28)
t∈T 0
Further, for p ≥ 2,
1/p D(T )
E sup |Xt −Xt0 |p ≤ 6.0281/p 14.372 ln (N (T, d, u))du + 5.818 · D (T ) .
t∈T 0
(3.29)
The main proof ingredients are the covering number arguments and the concentra-
tion of measure. The estimate (3.29) is also valid for 1 ≤ p ≤ 2 with possibly
slightly different constants: this can be seen, for instance, from interpolation
between p = 1 and p = 2. The theorem and its proof easily extend to Banach
space valued processes satisfying
2
P (|Xt − Xs | ≥ ud (t, s)) ≤ 2e−t /2
, u > 0, s, t ∈ T̃ .
Inequality (3.29) for the increments of the process can be used in the following way
to bound the supremum
1/p 1/p
p p p 1/p
E sup |Xt | ≤ inf E sup |Xt − Xt0 | + (E|Xt0 | )
t∈T t0 ∈T t∈T
D(T )
√
≤ 6.0281/p p 14.372 ln (N (T, d, u))du + 5.818 · D (T )
0
The second term is often easy to estimate. Also, for a centered real-valued process,
that is EXt = 0, for all t ∈ T̃ , we have
E sup Xt = E sup (Xt − Xt0 ) ≤ E sup |Xt − Xt0 | . (3.31)
t∈T t∈T t∈T
2 × 4.424
16.51 + √ < 30.
ln 2
A p→q = max Ax p .
xq =1
where σi (A) is the singular value of matrix A. The ∞ -operator norm is given by
n
A ∞→∞ = max Ax ∞ = max |Aij |,
x∞ =1 i=1,...,m
j=1
where Aij are the entries of matrix A. Also we have the norm X 1→2
A 1→2 = sup Au 2
u1 =1
= sup sup vT Au
v2 =1 u1 =1
= max A 2.
i=1,...,d
166 3 Concentration of Measure
The inner product induces the Hilbert-Schmidt norm (or Frobenius) norm
A F = A HS = A, A .
Our goal here is to derive the concentration inequalities of √1n Xv 2 . The emphasis
is on the standard approach of using Slepian-lemma [27,141] as well as an extension
due to Gordon [153]. We follow [135] closely for our exposition of this approach.
See also Sect. 3.4 and the approach used for the proof of Theorem 3.8.4.
Given some index set U × V , let {Yu,v , (u, v) ∈ U × V } and
{Zu,v , (u, v) ∈ U × V } be a pair of zero-mean Gaussian processes. Given the
1/2
semi-norm on the processes defined via σ (X) = E X 2 , Slepian’s lemma
states that if
σ (Yu,v − Yu ,v ) ≤ σ (Zu,v − Zu ,v ) for all (u, v) and (u , v ) in U × V,
(3.33)
then
One version of Gordon’s extension [153] asserts that if the inequality (3.33) holds
for for all (u, v) and (u , v ) in U ×V , and holds with equality when v = v , then
E sup inf Yu,v ≤ E sup inf Zu,v . (3.35)
u∈U v∈V u∈U v∈V
Now let us turn to the problem at hand. Any random matrix X from the given
ensemble can be written as WΣ1/2 , where W ∈ Rn×d is a matrix with i.i.d.
N (0, 1) entries, and Σ1/2 is the symmetric matrix square root. We choose the set U
as the unit ball
S n−1 = {u ∈ Rn : u 2 = 1} ,
3.6 Concentration of Induced Operator Norms 167
(3.36)
Now we use the Cauchy-Schwarz inequality and the equalities: u 2 = u 2 = 1
and ṽ 2 = ṽ 2 , we have uT u − u 2 ≤ 0, and ṽ 2 − ṽT ṽ ≥ 0. As a result,
2 2
We claim that the Gaussian process Yu,v satisfies the conditions of Gordon’s lemma
in terms of the zero-mean Gaussian process Zu,v given by
Zu,v = gT u + hT Σ1/2 v , (3.38)
where g ∈ Rn and h ∈ Rd are both standard Gaussian vectors (i.e., with i.i.d.
N (0, 1) entries). To prove the claim, we compute
/ /2
/ /
σ 2 (Zu,v − Zu ,v ) = u − u + /Σ1/2 (v − v )/
2
2
2
2
+ ṽ − ṽ
2
= u− u 2 2 .
says that the Slepian’s condition (3.33) holds. On the other hand, when v = v , we
see Eq. (3.36) that
so that the equality required for Gordon’s inequality (3.34) also holds.
Upper Bound: Since all the conditions required for Gordon’s inequality (3.34) are
satisfied, we have
- . - .
2
E sup Xv 2 =E sup uT Xv
v∈V (r) (u,v)∈S n−1 ×V (r)
- .
≤E sup Zu,v
(u,v)∈S n−1 ×V (r)
- . - .
=E sup gT u + E sup hT Σ1/2 v
u2 =1 v∈V (r)
- .
≤ E [ g 2] + E sup hT Σ 1/2
v .
v∈V (r)
By convexity, we have
" 2 2
2 √
E [ g 2] ≤ E g 2 = E Tr (ggT ) = Tr E (gT g) = n,
since E gT g = In×n . From this, we obtain that
- . - .
√
2
E sup Xv 2 ≤ n+E sup h T
Σ 1/2
v . (3.39)
v∈V (r) v∈V (r)
Since each element Σ1/2 v is zero-mean Gaussian with variance at most ρ (Σ) =
i
max Σii , standard results on Gaussian maxima (e.g., [27]) imply that
i
/ /
/ /
E /Σ1/2 v/ ≤ 3ρ (Σ) log d.
∞
- .
√ 1/2
E sup Xv 2 / n ≤ 1 + [3ρ (Σ) log d/n] r. (3.40)
v∈V (r)
we obtain that
√ / / / /
/ / / /
n [f (W) − f (W )] = sup /WΣ1/2 v/ − sup /W Σ1/2 v/
v∈V (r) / 2 2
/ v∈V (r)
/ 1/2 /
≤ sup /Σ v/ W − W F
v∈V (r) 2
= W − W F
/ /
/ /
since /Σ1/2 v/ = 1 for all v ∈ V (r). We have thus shown that the Lipschitz
2√
constant L ≤ 1/ n. Following the rest of the derivations in [135], we conclude that
/ / 1/2
1 / / 1
√ Xv 2 ≤ 3/Σ1/2 v/ + 6 ρ (Σ) log d v 1 for all v ∈ Rd . (3.42)
n 2 n
Lower Bound: We use Gordon’s inequality to show the lower bound. We have
- . - .
E sup − Xv 2 ≤E sup infn−1 Zu,v
v∈V (r) v∈V (r) u∈S
- .
=E inf g u +E
T
sup h Σ T 1/2
v
u∈S n−1 v∈V (r)
1/2
≤ −E [ g 2 ] + [3ρ (Σ) log d] r.
- .
where we have used previous derivation to upper bound E sup hT Σ v . 1/2
v∈V (r)
√ √ √
Since
√ |E g 2 − n| = o ( n), E g 2 ≥ n/2 for all n ≥ 1. We divide by
n and add 1 to both sides so that
- .
√ 1/2
E sup (1 − Xv 2 ) / n ≤ 1/2 + [3ρ (Σ) log d] r. (3.43)
v∈V (r)
Defining
√
f (W) = sup (1 − Xv 2 ) / n,
v∈V (r)
√
we can use the same arguments to show that its Lipschitz constant is at most 1/ n.
Following the rest of arguments in [135], we conclude that
1/ / 1/2
1 / 1/2 / 1
√ Xv 2 ≥ /Σ v/ − 6 ρ (Σ) log d v 1 for all v ∈ Rd .
n 2 2 n
The notions of sparsity can be defined precisely in terms of the p -balls3 for
p ∈ (0, 1], defined as [135]
+ n
,
p p
Bp (Rp ) = z ∈ Rn : z p = |zi | ≤ Rp , (3.44)
i=1
where I is the indicator function and z has exactly k non-zero entries, where k ! n.
We see Sect. 8.7 for its application in linear regression.
To illustrate the discretization arguments of the set, we consider another result,
taken also from Raskutti, Wainwright and Yu [135].
Theorem 3.6.2 (Lemma 6 of Raskutti, Wainwright and Yu [135]). Consider a
Xz
random matrix X ∈ RN ×n with the 2 -norm upper-bounded by √N z2 ≤ κ for all
2
sparse vectors with exactly 2s non-zero entries z ∈ B0 (2s), i.e.
+ n
,
B0 (2s) = z ∈ Rn : I [Zi = 0] ≤ 2s , (3.46)
i=1
and a zero-mean
white Gaussian random vector w ∈ Rn with variance σ 2 , i.e.,
w ∼ N 0, σ In×n . Then, for any radius R > 0, we have
2
"
1 T s log (n/s)
P sup w Xz ≥ 6σRκ
z0 ≤2s, z2 ≤R N N
with probability greater than 1 − c1 exp (−c2 min {N, s log (n − s)}).
Proof. For a given radius R > 0, define the set
S (s, R) = {z ∈ Rn : z 0 ≤ 2s, z 2 ≤ R} ,
3 Strictly speaking, these sets are not “balls” when p < 1, since they fail to be convex.
172 3 Concentration of Measure
1 T
ZN = sup w Xz .
z∈S(s,R) N
For a given ε ∈ (0, 1) to be chosen later, let us upper bound the minimal cardinality
of a set that covers the set S (s, R) up to (Rε)-accuracy
in 2 -norm. Now we claim
that we may find a covering set z1 , . . . , zN ⊂ S (s, R) with cardinality K =
K (s, R, ε)
n
log K (s, R, ε) ≤ log + 2s log (1/ε) .
2s
n
To establish the claim, we note that there are subsets of size 2s within
2s
the set {1, 2, . . . , n}. Also, for any 2s-sized subset, there is an (R)-covering in
2 -norm of the ball B(R) (radius R) with at most 22s log(1/ε) elements [154].
/ As la/ result, for each z ⊂ S (s, R), we may find some ||z ||2 such that
l
/z − z / ≤ Rε. By triangle inequality, we have
2
T T T
1 w Xz ≤ 1 w Xzl + 1 w X z − zi
N N N
T wT X(z−zi )
≤ 1 w Xzl + w2
√ √ 2
.
N N N
w2
Also, since the variate √N 2 is χ2 distribution with N degrees of freedom, we have
√
w 2 / N ≤ 2σ with probability at least 1 − c1 exp (−c2 N ), using standard tail
bounds (See Sect. 3.2). Putting together the pieces, we have that
1 T 1 T
w Xz ≤ w Xzl + 2κRεσ
N N
with high probability. Taking the supremum over z on both sides gives
1 T
ZN ≤ max w Xzl + 2κRεσ.
l=1,...,N N
We need to bound the finite maximum over the covering set. See Sect. 1.10. First
we observe that each variate wT Xzl /N is zero mean Gaussian with variance
/ /2
σ 2 /Xzl / /N 2 . Under the assumed conditions on zl and X, this variance is at
2
most σ 2 κ2 R2 /N , so that by standard Gaussian tail bounds, we conclude that
3.7 Concentration of Gaussian and Wishart Random Matrices 173
2
K(s,R,ε)
ZN ≤ σκR + 2κRεσ.
2 N (3.49)
K(s,R,ε)
= σκR N + 2ε .
≤ 2s+s log(n/s)
N
+ s log(n/s)
N
,
where the last line uses standard bounds on binomial coefficients. Since n/s ≥ 2
by assumption, we conclude that our choice of ε guarantees that log K(s,R,ε)
N ≤
5s log (n/s). Substituting these relations into the inequality (3.49), we conclude
that
+ " " ,
s log (n/2s) s log (n/2s)
ZN ≤ σRκ 4 +2 ,
N N
as claimed. Since log K (s, R, ε) ≥ s log (n − 2s), this event occurs with probabil-
ity at least
1 − c1 exp (−c2 min {N, s log (n − s)}) ,
as claimed.
⎛ " " 2 ⎞
/ /
/1 T / k k
P ⎝/ /
/ n X X − Ik×k / > 2 +t + +t ⎠ + ≤ 2 exp −nt2 /2 .
2 n n
(3.50)
We can extend to more general Gaussian √ensembles. In particular, for a positive
definite matrix Σ ∈ Rk×k , setting Y = X Σ gives n × k matrix with i.i.d. rows,
xi ∼ N (0, Σ). Then
/ / / /
/1 T / /√ √ /
/ Y Y − Σ/ = / Σ 1 XT X − Ik×k Σ/
/n / / n /
2 2
/ /
is upper-bounded by λmax (Σ) / n1 YT Y − Ik×k /2 . So the claim (3.51) follows
from the basic bound (3.50). Similarly, we have
−1 −1
1 T −1/2 1 T −1/2
Y Y −Σ−1 = Σ X X −Ik×k Σ
n n
2 2
−1
1 T 1
≤ X X −Ik×k
n λmin (Σ)
2
so that the claim (3.52) follows from the basic bound (3.50).
Theorem 3.7.2 (Lemma 9 of Negahban and Wainwright [155]). For k ≤ n, let
Y ∈ Rn×k be a random matrix having i.i.d. rows, xi ∼ N (0, Σ).
1. If the covariance matrix Σ has maximum eigenvalue λmax (Σ) < +∞, then for
all t > 0,
/ / " " 2
/1 T / n n
/ /
P / Y Y − Σ/ > λmax (Σ) 2 +t + +t
n 2 N N
≤ 2 exp −nt2 /2 . (3.51)
2. If the covariance matrix Σ has minimum eigenvalue λmin (Σ) > 0, then for all
t > 0,
/ −1 / " " 2
/ 1 /
/ −1 / 1 n n
P / T
Y Y −Σ / > 2 +t + +t
/ n / λmin (Σ) N N
2
≤ 2 exp −nt2 /2 . (3.52)
2
For t = k
n, then since k/n ≤ 1, we have
3.7 Concentration of Gaussian and Wishart Random Matrices 175
Below, we condition on the event D 2 < 8 k/n.
Let ej denote the unit vector with 1 in position j, and z = (z1 , . . . , zk ) ∈ Rk be
a fixed vector z ∈ Rk . Define, for each i = 1, . . . , k, the random variable of interest
⎡ ⎤
where uj is the j-th column of the unitary matrix U. As an example, consider the
variable maxi |Vi |. Since Vi is identically distributed, it is sufficient to obtain an
exponential tail bound on {V1 > t}.
Under the conditioned event on D 2 < 8 k/n, we have
- .
k
|V1 | ≤ 8 k/n |z1 | + uT1 D zl u l . (3.54)
l=2
176 3 Concentration of Measure
√ / /
/ /
= 8 k/n k − 1 z ∞ /u1 − u1 /
2
?
k
where we have used the fact that w 2 = zl2 , by the orthonormality of the
l=2
{ul } vectors. Since E [F (u1 )] = 0, by concentration of measure for Lipschitz
functions on the sphere [141], for all t > 0, we have
2
P (|F (u1 )| > t z ∞ ) ≤ 2 exp −c1 (k − 1) 128 kt(k−1)
n
2
≤ 2 exp −c1 128k
nt
.
n = Ω (k log (p − k)), there are positive constants c1 and c2 such that for all t > 0
/ /
/ −1 /
P / YT Y/n − Ik×k z/ ≥ c1 z ∞ ≤ 4 exp (− c1 min {k, log (p − k)}).
∞
Example 3.7.5 (Feature-based detection). In our previous work in the data domain
[156, 157] and the kernel domain [158–160], features of a signal are used for
detection. For hypothesis H0 , there is only white Gaussian noise, while for H1 , there
is a signal (with some detectable features) in presence of the white Gaussian noise.
For example, we can use the leading eigenvector of the covariance matrix R as the
feature which is a fixed vector z. The inverse of the sample covariance matrix is
−1
considered R̂−1 = YT Y/n − Ik×k , where Y is as assumed in Theorem 3.7.4.
Consequently, the problem boils down to
−1
R̂−1 z = YT Y/n − Ik×k z.
So that
2 √ 2 √
EZ − |EZ| = Tr (R) /n − E Rx / n ≤ 2 R
2 op / n. (3.56)
√ 2
for all t > 0. Setting τ = (t − 2/ n) R op in the bound (3.57) gives that
/ 2
1 /√ /
/ 2 1 2
P / Rx/ − Tr (R) ≥ τ R op ≤ 2 exp − n t − √ .
n 2 2 n
(3.58)
2
Similarly, considering t = R op in the bound (3.57) gives that with probability
greater than 1 − 2 exp (−n/2), we have
" "
x Tr (R) 2 2
Tr (R)
√ +2
≤ +3 R ≤4 R op . (3.59)
n n n op
We take material from [139]. Defining the standard Gaussian random matrix
G = (Gij )1≤i≤n,1≤j≤p ∈ Rn×p , we have the p × p Wishart random matrix
1 T
W= G G − Ip , (3.60)
n
where Ip is the p × p identity matrix. We essentially deal with the “sums of
Gaussian product” random variates. Let Z1 and Z2 be independent Gaussian random
n i.i.d.
variables, we consider the sum Xi where Xi ∼ Z1 Z2 , 1 ≤ i ≤ n. The
i=1
3.7 Concentration of Gaussian and Wishart Random Matrices 179
Let wiT be the i-th row of Wishart matrix W ∈ Rp×p , and giT be the i-th row of
data matrix G ∈ Rn×p . The linear combination of off-diagonal entries of the first
row w1
n n p
1
a, w1 = aj W1j = G1i Gij aj ,
j=2
n i=1 j=2
n
Note that {ξi }i=1 is a collection of independent standard Gaussian random vari-
n n
ables. Also, {ξi }i=1 are independent of {Gi1 }i=1 . Now we have
n
1
a, w1 = a 2 G1i ξi ,
n i=1
Combining (3.4) and (3.62), we can bound a linear combination of first-row entries.
n
2
Noting that W11 = n1 (G1i ) − 1 is a centered χ2n , we have
i=1
p
p
P (|wi x| > t) = P Wij xj > t ≤ P |x1 W11 | + Wij xj > t
j=1 j=2
p
≤ P (|x1 W11 | > t/2) + P Wij xj > t/2
j=2
⎛ ⎞
3nt2 2
≤ 2 exp − 16·4x 2 +C exp ⎝− 3nt
p
⎠
1 2·4 x2
j
i=2
⎛ ⎞
3nt2
≤ 2 max (2, C) exp ⎝−
p
⎠.
16·4 x2
j
i=1
180 3 Concentration of Measure
There is nothing special about the “first” row, we conclude the following. Note that
p
the inner product is wi , x = wiT x = Wij xj , i = 1, . . . , p for a vector x ∈ Rp .
j=1
Theorem 3.7.8 (Lemma 15 of [139]). Let wiT be the j-th row of Wishart matrix
W ∈ Rp×p , defined as (3.60). For t > 0 small enough, there are (numerical
constants) c > 0 and C > 0 such that for all x ∈ Rn \ {0},
P wiT x > t ≤ C exp −cnt2 / x
2
, i = 1, . . . , p.
For a vector a, a p is the p norm. For a matrix A ∈ Rm×n , the singular values are
ordered decreasingly as σ1 (A) ≥ σ2 (A) ≥ · · · ≥ σmin(m,n) (A). Then we have
σmax (A) = σ1 (A), and σmin (A) = σmin(m,n) (A). Let the operator norm of the
matrix A be defined as A op = σ1 (A). The nuclear norm is defined as
min(m,n)
A ∗ = σi (A),
i=1
vec(Xi ) ∼ N (0, Σ) .
where the random matrix X ∈ Rm1 ×m2 is sampled from the Σ-ensemble. For the
special case (white Gaussian random vector) Σ = I, we have ρ2 (Σ) = 1.
Now we are ready to study the concentration of measure for the operator norm.
3.8 Concentration of Operator Norms 181
and moreover
t2
P X op ≥E X op + t ≤ exp − 2
. (3.64)
2ρ (Σ)
X op = sup uT Xv
u1 =1,v1 =1
⎛/ / ⎞
/ N / " "
/1 / m1 m2 ⎠
P ⎝/ i X i / ≥ c0 νρ (Σ) + ≤ c1 exp (−c2 (m1 + m2 )) .
/N / N N
i=1 op
N
1
Z= ε i Xi .
N i=1
N N
Since the random matrices {Xi }i=1 are i.i.d. Gaussian, if the sequence {i }i=1 are
fixed (by conditioning as needed), then the random matrix Z is a sample from the
Γ-Gaussian ensemble with the covariance matrix Γ = N 22 Σ. So, if Z̃ ∈ Rm1 ×m2
2
is a random matrix drawn from the N 2 Σ-ensemble, we have
/ /
/ /
P Z op ≥ t ≤ P /Z̃/ ≥t .
op
182 3 Concentration of Measure
and
/ / / /
/ / / / N t2
P /Z̃/ ≥ E /Z̃/ +t ≤ exp −c1
op op ν 2 ρ2 (Σ)
√ √
2
for a universal constant c1 . Setting t2 = Ω N −1 ν 2 ρ2 (Σ) m1 + m2 gives
the claim.
This following result follows by adapting known concentration results for
random matrices (see [138] for details):
Theorem 3.8.3 (Lemma 2 of Negahban and Wainwright [155]). Let X ∈ Rm×n
be a random matrix with i.i.d. rows sampled from a n-variate N (0, Σ) distribution.
Then for m ≥ 2n, we have
1 T 1 1 T
P σmin X X ≥ σmin (Σ) , σmax X X ≥ 9σmax (Σ) ≥ 1 − 4 exp (−n/2) .
n 9 n
Consider zero-mean Gaussian random vectors wi defined as wi ∼N 0, ν 2 Im1 ×m1
and random vectors xi defined as xi ∼ N (0, Σ). We define random matrices
X, W as
⎡ T⎤ ⎡ T⎤
x1 w1
⎢ xT ⎥ ⎢ wT ⎥
⎢ 2⎥ ⎢ 2⎥
X = ⎢ . ⎥ ∈ Rn×m2 and W = ⎢ . ⎥ ∈ Rn×m1 . (3.65)
⎣ .. ⎦ ⎣ .. ⎦
xTn wnT
The following proof, taken from [155], will illustrate the standard approach:
arguments based on Gordon-Slepian lemma (see also Sects. 3.4 and 3.6) and
Gaussian concentration of measure [27, 141].
Proof. Let S m−1 = {u ∈ Rm : u 2 = 1} denote the Euclidean sphere in m-
dimensional space. The operator norm of interest has the variation representation
3.8 Concentration of Operator Norms 183
1/ /
/XT W/ = 1 sup
op
sup vT XT Wu. (3.66)
n n u∈S m1 −1 v∈S m2 −1
1
Ψ (a, b) = sup sup vT XT Wu.
n u∈aS m1 −1 v∈bS m2 −1
Our goal is to upper bound Ψ (1, 1). Note that Ψ (a, b) = abΨ (1, 1) due to the
bi-linear property of the right-hand side of (3.66).
Let
A = u1 , . . . , uA , B = u1 , . . . , uB
denote the 1/4 coverings of S m1 −1 and S m2 −1 , respectively. We now claim that the
upper bound
= b >
Ψ (1, 1) ≤ 4 max Xv , Wua . (3.67)
ua ∈A,vb ∈B
is valid. To establish the claim, since we note that the sets A and B are 1/4-covers,
for any pair (u, v) ∈ S m1 −1 × S m2 −1 , there exists a pair ua , vb ∈ A × B, such
that u = ua + Δu and v = vb + Δv, with
max { Δu 2 , Δv 2 } ≤ 1/4.
1
| XΔv, Wua | ≤ Ψ (1, 1) ,
4
as well as
1
| XΔu, WΔv | ≤ Ψ (1, 1) .
16
Substituting these bounds into (3.68) and taking suprema over the left and right-
hand sides, we conclude that
184 3 Concentration of Measure
= > 9
Ψ (1, 1) ≤ max Xvb , Wua + Ψ (1, 1)
ua ∈A,vb ∈B 16
P (|Z| ≥ t) ≤ P (|Z| ≥ t |T ) + P (T c )
t2
≤ exp −n 2ν 2 + 2 exp (−n/2) .
( 4+Σop )
Combining this tail bound with the upper bound (3.69), we obtain
⎛ ⎞
2
t
P (|Ψ (1, 1)| ≥ 4δn) ≤ 8m1 +m2 exp ⎝−n ⎠ + 2 exp (−n/2) .
2ν 2 4+ Σ op
(3.70)
m1 +m2
Setting t2 = 20ν 2 Σ op n , this probability vanishes as long as n >
16 (m1 + m2 ).
Consider the vector random operator ϕ (A) : Rm1 ×m2 → RN with ϕ (A) =
(ϕ1 (A) , . . . , ϕN (A)) ∈ RN . The scalar random operator ϕi (A) is defined by
ϕi (A) = Xi , A , i = 1, . . . , N, (3.71)
N
where the matrices {Xi }i=1 are formed from the Σ-ensemble, i.e., vec(Xi ) ∼
N (0, Σ).
Theorem 3.8.5 (Proposition 1 of Negahban and Wainwright [155]). Consider
the random operator ϕ(A) defined in (3.71). Then, for all A ∈ Rm1 ×m2 , the
random operator ϕ(A) satisfies
" "
1 1//√
/
/ m1 m2
√ ϕ (A) 2 ≥ / Σ vec (A)/ − 12ρ (Σ) + A 1 (3.72)
N 4 2 N N
The proof of Theorem 3.8.5 follows from the use of Gaussian comparison inequal-
ities [27] and concentration of measure [141]. Its proof is similar to the proof of
Theorem 3.8.4 above. We see [155] for details.
We refer to Sect. 1.7 on sub-Gaussian random variables and Sect. 1.9 for the
background on exponential random variables. Given a zero-mean random variable
Y , we refer to
186 3 Concentration of Measure
1 l
1/l
Y ψ1 = sup E|Y |
l≥1 l
as its sub-exponential parameter. The finiteness of this quantity guarantees existence
of all moments, and hence large-deviation bounds of the Bernstein type.
We2 say that a random matrix X ∈ R
n×p
is sub-Gaussian with parameters
Σ, σ if
1. Each row xTi ∈ Rp , i = 1, . . . , n is sampled independently from a zero-mean
distribution with covariance Σ, and
2. For any unit vector u ∈ Rp , the random variable uT xi is sub-Gaussian with
parameter at most σ.
If we a random matrix by drawing each row independently from the distribution
N (0, Σ), then
the resulting
matrix X ∈ Rn×p is a sub-Gaussian matrix with
parameters Σ, Σ op , where A op is the operator norm of matrix A.
By Lemma 1.9.1, if a (scalar valued) random variable X is a zero-mean
sub-
Gaussian with parameter σ, then the random variable Y = X 2 − E X 2 is sub-
exponential with Y ψ1 ≤ 2σ 2 . It then follows that if X1 , . . . , Xn are zero-mean
i.i.d. sub-Gaussian random variables, we have the deviation inequality
2
N
1 nt nt
P Xi2 −E Xi2 ≥ t ≤ 2 exp −c min ,
N 4σ 2 2σ 2
i=1
for all t ≥ 0 where c > 0 is a universal constant (see Corollary 1.9.3). This deviation
bound may be used to obtain the following result.
Theorem 3.9.1 (Lemma 14 of [164]). If X ∈ Rn×p1 is a zero-mean sub-Gaussian
matrix with parameters Σx , σx2 , then for any fixed (unit) vector v ∈ Rp , we have
2
2 2 t t
P Xv 2 − E Xv 2 ≥ nt ≤ 2 exp −cn min , . (3.74)
σx4 σx2
Moreover,
if Y ∈ Rn×p2 is a zero-mean sub-Gaussian matrix with parameters
Σy , σy2 , then
/ 1 /
/ / t2 t
P / YT X − cov (yi , xi )/ ≥ t ≤ 6p1 p2 exp −cn min 2
, ,
n max (σx σy ) σx σy
(3.75)
/ / "
/1 T / log p
P / /
/ n Y X − cov (yi , xi )/ ≥ t ≤ c0 σx σy ≤ c1 exp (−c2 log p) .
max n
(3.76)
The 1 balls are defined as
Bl (s) = {v ∈ Rp : v l ≤ s, l = 0, 1, 2} .
K (s) = {v ∈ Rp : v 2 ≤ 1, v 0 ≤ s} ,
the 0 norm x 0 stands for the non-zeros of vector x. The sparse set is
K (s) = B0 (s) ∩ B2 (1) and the cone set is
√
C (s) = v ∈ Rp : v 1 ≤ s v 2 .
Then
T 1
v Γv ≤ 27 v
2
+ v
2
∀v ∈ Rp .
2 1
s
We consider the dependent data. The rows of X are drawn from a stationary vector
autoregressive (AR) process [155] according to
Σx = AΣx AT + Σv .
1 1 2
2 2 2 1 2
P y2 − E y2 ≥ 4tσ ≤ 2 exp − n t − √ + +2 exp (−n/2) .
n n 2 n
Z = X + W,
1 T
Γ̂add = Z Z − Σw .
n
2
T 1 2
P v Γ̂ − Σx v ≥ 4tς 2 ≤ 2 exp − n t − √ + 2 exp (−n/2) ,
2 n
where
⎧
2Σx op
⎨ Σw + (additive noise case) .
2 op 1−Aop
ς = 2Σx op
⎩ 1
(missing data case) .
(1−ρmax )2 1−Aop
We draw material from [149] in this section. Let μ be the standard Gaussian measure
−n/2 −|x|2 /2
on Rn with density (2π) e with respect to Lebesgue measure. Here |x|
is the usual Euclidean norm for vector x. One basic concentration property [141]
indicates that for every Lipschitz function F : Rn → R with the Lipschitz constant
||F ||L ≤ 1, and every t ≥ 0, we have
μ F −
F dμ ≥ t
2
≤ 2e−t /2 . (3.78)
The same holds for a median of F instead of the mean. One fundamental property
of (3.78) is its independence of dimension n of the underlying state space. Later on,
we find that (3.78) holds for non-Gaussian classes of random variables. Eigenvalues
are the matrix functions of interest.
Let us illustrate the approach by studying concentration for largest eigenvalues.
For example, consider the Gaussian unitary ensemble (GUE) X: For each integer
n ≥ 1, X = (Xij )1≤i,j≤n is an n × n Hermitian centered Gaussian random matrix
with variance σ 2 . Equivalently, the random matrix X is distributed according to the
probability distribution
1
P (dX) = exp − Tr X2 /2σ 2 dX
Z
on the space Hn ∼
2
= Rn of n × n Hermitian matrices where
dX = dXii d Re (Xij ) d Im (Xij )
1≤i≤n 1≤i,j≤n
where the function λmax is linear in X. The expression is the quadratic form. Later
on in Chap. 4, we study the concentration of the quadratic forms. λmax (X) is easily
seen (see Chap. 4) to be a 1-Lipschitz map of the n2 independent real and imaginary
entries
3.10 Concentration for Largest Eigenvalues 191
√ √
Xii , 1 ≤ i ≤ n, Re (Xij ) / 2, 1 ≤ i ≤ n, Im (Xij ) / 2, 1 ≤ i ≤ n,
of matrix X. Using Theorem (3.3.10) together with the scaling of the variance σ 2 =
1
4n , we get the following concentration inequality on λmax (X).
Let us estimate Eλmax (X). We emphasize the approach used here. Consider the
real-valued Gaussian process
n
Gu = uXuH = Xij ui ūj , |u| = 1,
i,j=1
n
2
E |Gu − Gv | = σ 2 |ui ūj − vi v̄j |.
i,j=1
1
When σ 2 = 4n , we thus have that
√
Eλmax (X) ≤ 2. (3.80)
Equation (3.80) extends to the class of sub-Gaussian distributions including random
matrices with symmetric Bernoulli entries.
Combining (3.80) with Theorem 3.10.1, we have that for every t ≥ 0,
0 √ 1 2
P λmax (X) ≥ 2+t ≤ 2e−2nt .
Based on the supremum representation (3.79) of the largest eigenvalue, we can use
another chaining approach [27, 81]. The supremum of Gaussian or more general
process (Zt )t∈T is considered. We study the random variable sup Zt as a function
t∈T
of the set T and its expectation E sup Zt . Then we can study the probability
t∈T
P sup Zt ≥ r , r ≥ 0.
t∈T
For real symmetric matrices X, we have
n
Zu = uXuT = Xij ui uj , |u| = 1,
i,j=1
2
f 2 log f 2 dμ ≤ 2C |∇f | dμ (3.82)
Rn Rn
)
for every smooth enough function f : Rn → R such that f 2 dμ = 1.
The prototype example is the standard Gaussian measure on Rn which satis-
fies (3.82) with C = 1. Another example consists of probability measures on Rn of
the type
with the same constant C, then the product measure μ1 ⊗ · · · ⊗ μn also satisfies it
(on Rn ) with the same constant.
By the so-called Herbst argument, we can apply the logarithmic Sobolev
inequality (3.82) to study concentration of measure. If μ satisfies (3.82), then for
any 1-Lipschitz function F : Rn → R and any t ∈ R,
F dμ+Ct2 /2
etF dμ ≤ et .
so that the dimension-free concentration property (3.81) holds. We refer to [141] for
more details. Related Poincare inequalities may be also considered similarly in this
context.
The goal here is to apply the concentration of measure. For a random vector x ∈ Rd ,
we study its projections to subspaces. The central problem here is to show that
for most subspaces, the resulting distributions are about the same, approximately
Gaussian, and to determine how large the dimension k of the subspace may be,
relative to d, for this phenomenon to persist.
The Euclidean length of a vector x ∈ Rd is defined by x = x21 + . . . , x2d .
The Stiefel manifold4 Zd,k ∈ Rk×d is defined by
Zd,k = Z = (z1 , . . . , zk ) : zi ∈ Rd , zi , zj = δij ∀1 ≤ i, j ≤ k ,
with metric ρ (Z, Z ) between a pair of two matrices Z and Z —two points in the
manifold Zd,k —defined by
k
1/2
zi
2
ρ (Z, Z ) = z− .
i=1
4A manifold of dimension n is a topological space that near each point resembles n-dimensional
Euclidean space. More precisely, each point of an n-dimensional manifold has a neighborhood that
is homeomorphic to the Euclidean space of dimension n.
3.11 Concentration for Projection of Random Vectors 195
for all x and y in the domain of f . Since moduli of continuity are required to be
infinitesimal at 0, a function turns out to be uniformly continuous if and only if
it admits a modulus of continuity. Moreover, relevance to the notion is given by
the fact that sets of functions sharing the same modulus of continuity are exactly
equicontinuous families. For instance, the modulus ω(t) = Lt describes the L-
Lipschitz functions, the moduli ω(t) = Ltα describe the Hölder continuity, the
modulus ω(t) = Lt(|log(t)| + 1) describes the almost Lipschitz class, and so on.
Theorem 3.11.1 (Milman and Schechtman [167]). For any F : Zn,k → R with
the median MF and modulus of continuity ωF (t) , t > 0
"
π −nt2 /8
P (|F (z1 , . . . , zk ) − MF (z1 , . . . , zk )| > ωF (t)) < e , (3.83)
2
where P is the rotation-invariant probability measure on the span of z1 , . . . , zk .
Let x be a random vector in Rd and let Z ∈ Zd,k . Let
xz = ( x, z1 , . . . , x, zk ) ∈ Rk ;
that is, xz is the projection of the vector x onto the span of Z. xz is a projection
from dimension d to dimension k.
The bounded-Lipschitz distance between two random vectors x and y is
defined by
where
f 1 = max { f ∞, f L}
f (x)−f (y)
with the Lipschitz constant of f defined by f L = sup x−y .
x=y
vector in R , with Ex = 0,
d
Theorem
x be a random
3.11.2 (Meckes [165]). Let
2 2
E x = σ d, and let α = E x /σ − d . If Z is a random point of
2 2
2
Theorem 3.11.3 (Meckes [165]). Suppose that β is defined by β = sup E x, y .
y∈Sd−1
For zi ∈ Rd , i = 1, . . . , k and Z = (z1 , . . . , zk ) ∈ Zd,k , let
That is, dBL (xz , σw) is the conditional bounded-Lipschitz distance from the
random point xz (a random vector in Rk ) to the standard
2 Gaussian random vector
β
σw, conditioned on the matrix Z. Then, for t > 2π d and Z a random point of
the manifold (a random matrix in R k×d
) Zd,k ,
"
π −dt2 /32β
P (|dBL (xz , σw) − EdBL (xz , σw)| > t) ≤ e .
2
Theorem 3.11.4 (Meckes [168]). With the notation as in the previous theorems,
we have
⎡ √ ⎤
(kβ + β log d) β 2/(9k+12) σ k (α + 1) + k
EdBL (xz , σw) ≤ C ⎣ + ⎦.
k 2/3 β 2/3 d2/(3k+4) d−1
√
In particular, under the additional assumptions that α ≤ C0 d and β = 1, then
k + log (d)
EdBL (xz , σw) ≤ C .
k 2/3 d2/(3k+4)
where Ex denotes the expectation with respect to the distribution of the random
vector x only; that is,
Ex f (xz ) = E [f (xz ) |Z ] .
The goal here is to apply the concentration of measure. We use the standard
method here. We need to find the Lipschitz constant first. For a pair of random
3.11 Concentration for Projection of Random Vectors 197
vectors x and x , which will be projected to the same span of Z, we observe that for
f with f 1 ≤ 1 given,
i=1
zi − zi
≤ ρ (Z, Z ) β.
It follows that
2
β
So as long as t > 2π d, replacing the median of F with its mean only changes the
constants:
P (|F (Z) − EF (Z)| > t) ≤ P (|F (Z) − MF (Z)| > t − |MF (Z) − EF (Z)|)
2
≤ P (|F (Z) − MF (Z)| > t/2) ≤ π2 e−dt /32β .
The standard reference for concentration of measure is [141] and [27]. We only
provide necessary results that are needed for the later chapters. Although some
results are recent, no attempt has been made to survey the latest results in the
literature.
We follow closely [169,170] and [62,171], which versions are highly accessible.
The entropy method is introduced by Ledoux [172] and further refined by Mas-
sart [173] and Rio [174]. Many applications are considered [62, 171, 175–178].
Chapter 4
Concentration of Eigenvalues
and Their Functionals
Chapters 4 and 5 are the core of this book. Talagrand’s concentration inequality is
a very powerful tool in probability theory. Lipschitz functions are the mathematics
objects. Eigenvalues and their functionals may be shown to be Lipschitz functions
so the Talagrand’s framework is sufficient. Concentration inequalities for many
complicated random variables are also surveyed here from the latest publications.
As a whole, we bring together concentration results that are motivated for future
engineering applications.
Eigenvalues and norms are the butter and bread when we deal with random matrices.
The supremum of a stochastic process [27, 82] has become a basic tool. The aim
of this section to make connections between the two topics: we can represent
eigenvalues and norms in terms of the supremum of a stochastic process.
The standard reference for our matrices analysis is Bhatia [23]. The inner
product of two finite-dimensional vectors in a Hilbert space H is denoted by u, v .
1/2
The form of a vector is denoted by u = u, u . A matrix is self-adjoint or
Hermitian if A = A, skew-Hermitian if A = −A, unitary if A∗ A = I =
∗ ∗
A = sup Ax .
x=1
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 199
DOI 10.1007/978-1-4614-4544-9 4,
© Springer Science+Business Media New York 2014
200 4 Concentration of Eigenvalues and Their Functionals
A = sup | y, Ax | .
x=y=1
A = sup | x, Ax | .
x=1
A = σ1 (A) = A∗ A
1/2
.
n
1/2
= (TrA∗ A)
1/2
A F = σi2 (A) .
i=1
This makes this norm useful in calculations with matrices. This is called Frobenius
norm or Schatten 2-norm or the Hilbert-Schmidt norm, or Euclidean norm.
Both A and A F have an important invariance property called unitary
invariant: we have UAV = A and UAV F = A F for all unitary U, V.
Any two norms on a finite-dimensional space are equivalent. For the norms A
and A F , it follows from the properties above that
√
A A F n A (4.1)
for every A. Equation (4.1) is the central result we want to revisit here.
Exercise 4.1.1 (Neumann series). If A < 1, then I − A is invertible and
−1
(I − A) = I + A + A2 + · · · + A k + · · ·
1 2 1
exp A = eA = I + A + A + · · · + Ak + · · ·
2! k!
converges. The matrix exp A is always convertible
−1
eA = e−A .
w (A) = sup | x, Ax |
x=1
We note that spr (A) w (A) A . They three are equal if (but not only if) the
matrix is normal.
Let A be Hermitian with eigenvalues λ1 λ2 · · · λn . We have
λ1 = max { x, Ax : x = 1} ,
λn = min { x, Ax : x = 1} . (4.2)
k k
λi (A) = max xi , Axi ,
i=1 i=1
k k
λi (A) = min xi , Axi ,
i=n−k+1 i=1
where the maximum and the minimum are taken over all choices of orthogonal k-
tuples (x1 , . . . , xk ) in H. The first statement is referred to as Ky Fan Maximum
Principle.
202 4 Concentration of Eigenvalues and Their Functionals
n
n
λi (A) = min xi , Axi ,
i=n−k+1 i=n−k+1
where the minimum is taken over all choices of orthogonal k-tuples (x1 , . . . , xk )
in H.
The following lemma is from [152] but we follow the exposition of [69]. Let G
denote the Gaussian distribution on Rn with density
2
dG (x) 1 x
= n exp − ,
dx (2πσ 2 ) 2σ 2
2
and x = x21 + · · · + x2n is the Euclidean norm of x. Furthermore, for a K-
Lipschitz function F : Rn → R, we have
|F (x) − F (y)| K x − y , x, y ∈ Rn ,
for some positive Lipschitz constant K. Then for any positive number t, we have
that
ct2
G ({x ∈ Rn : |F (x) − F (y)| > t}) 2 exp − 2 2
K σ
)
where E (F (x)) = Rn F (x) dG (x), and c = π22 .
The case of σ = 1 is proven in [152]. The general case follows by using the
following mapping. Under the mapping x → σx : Rn → Rn , the composed
function x → F (σx) satisfies a Lipschitz condition with constant Kσ.
Now let us consider the Hilbert-Schmidt norm (also called Frobenius form and
Euclidean norm) · F under the Lipschitz functional mapping. Let f : R → R be a
function that satisfies the Lipschitz condition
|f (s) − f (t)| K |s − t| , s, t ∈ R.
Then for any n in N, and all complex Hermitian matrices A, B ∈ Cn×n . We have
that
where
1/2
= (Tr (C∗ C))
1/2
C F = Tr C2 ,
The following lemma (Lemma 4.3.1) is at the heart of the results. First we recall
that
n
Tr (f (A)) = f (λi (A)), (4.3)
i=1
f (A) = Uf (D) U∗
where f (D) is the diagonal matrix with entries f (λ1 ) , . . . , f (λn ) and U∗ denotes
the conjugate, transpose of U.
204 4 Concentration of Eigenvalues and Their Functionals
In particular,
n n
2 2 2
|λi (A) − λi (B)| A − B F |λi (A) − λn−i+1 (B)| . (4.4)
i=1 i=1
is Lipschitz with constant one, following (4.4). With the aid of Lidskii’s theo-
rem [18, p. 657] (Theorem 4.3.2 here), we have
n n n
f (λi (A)) − f (λi (B)) |f |L |λi (A) − λi (B)|
i=1 i=1 i=1
n
1/2
√ 2
n|f |L |λi (A) − λi (B)|
i=1
√
n|f |L A − B F. (4.5)
We use the definition of a Lipschitz function in the first inequality. The second step
follows from Cauchy-Schwartz’s inequality [181, p. 31]: for arbitrary real numbers
a i , bi ∈ R
2 2
|a1 b1 + a2 b2 + · · · + an bn | a21 + a22 + · · · + a2n b21 + b22 + · · · + b2n . (4.6)
√
is Lipschitz with a constant bounded above by n|f |L . Observe that using
n
f (λk (A)) rather than λk (A) increases the Lipschitz constant from 1 to
√
k=1
n|f |L . This observation is useful later in Sect. 4.5 when the tail bound is
considered.
Lemma 4.3.3 ([182]). For a given n1 × n2 matrix X, let σi (X) the i-th largest
singular value. Let f (X) be a function on matrices in the following form: f (X) =
m
m
ai σi (X) for some real constants {ai }i=1 . Then f (X) is a Lipschitz function
i=1 ?
m
with a constant of a2i .
i=1
k
Let us consider the special case f (t) = tk . The power series (A + εB) is
expanded as
k
Tr (A + εB) = Tr Ak + εk Tr Ak−1 B + O ε2 .
n
Recall that f (λi (A)) is equal to Tr (f (A)) . λ1 (X) is convex and λn (X) is
i=1
concave.
Let us consider the function of the sum of the first largest eigenvalues: for k =
1, 2, . . . , n
k
Fk (A) = λi (A)
i=1
k
Gk (A) = λn−i+1 (A) = Tr (A) − Fk (A) . (4.8)
i=1
206 4 Concentration of Eigenvalues and Their Functionals
The trace function of (4.3) is the special case of these two function with k = n.
Now we follow [183] to derive √the Lipschitz constant of functions defined
in (4.8), which turns out to be C|f |L k, where C < 1. Recall the Euclidean norm
or the Frobenius norm is defined as
⎛ ⎞1/2 ⎛ ⎞1/2
n
√ n
Xii 2
X =⎝ |Xij | ⎠
2
= 2⎝ √ + |Xij | ⎠ .
2
F 2
i,j=1 i=1 1ijn
n
1/2
X F = λ2i (X) ,
i=1
which implies from (4.1) that λi (X) is a 1-Lipschitz function of X with respect to
X F.
Fk is positively homogeneous (of degree 1) and Fk (−A) = −Gk (A) . From
this we have that
√
|Fk (A) − Fk (B)| max {Fk (A − B) , −Gk (A − B)} k A − B F ,
√
|Gk (A) − Gk (B)| max {Gk (A − B) , −Fk (A − B)} k A − B F .
(4.9)
In other words, the functions√Fk (A) , Gk (A) : Rn → R are Lipschitz continuous
with the Lipschitz constant k. For a trace function, we have k = n. Moreover,
Fk (A) is convex and Gk (A) is concave. This follows from Ky Fan’s maximum
principle in (4.2) or Davis’ characterization [184] of all convex unitarily invariant
functions of a self-adjoint matrix.
Let us give our version of the proof of (4.9). There are no details about this proof
in [183]. When A ≥ B, implying that λi (A) λi (B) , we have
k k k
|Fk (A) −Fk (B)| = λi (A) − λi (B) = (λi (A) −λi (B))
i=1 i=1 i=1
k k k
|λi (A) −λi (B)|= (λi (A) −λi (B)) λi (A−B)=Fk (A−B) .
i=1 i=1 i=1
k k k
|Fk (A) − Fk (B)| = λi (A) − λi (B) = (λi (A) − λi (B))
i=1 i=1 i=1
k k k
|λi (A) −λi (B)|= (λi (B) −λi (A)) λi (B−A)=Fk (B−A) = −Gk (A−B) .
i=1 i=1 i=1
k
ϕk (A) = f (λi (A)) (4.11)
i=1
k k
f (λi (A)) − f (λi (B))
i=1 i=1
k
= [f (λi (A)) − f (λi (B))]
i=1
k
|f (λi (A)) − f (λi (B))|
i=1
k
|f |L |λi (A) − λi (B)|
i=1
1/2
√ k
2
|f |L k |λi (A) − λi (B)|
i=1
1/2
√ n
2
C|f |L k |λi (A) − λi (B)|
i=1
√
C|f |L k A − B F,
208 4 Concentration of Eigenvalues and Their Functionals
where
1/2
k
2
|λi (A) − λi (B)|
C = i=1 1/2 1.
n
2
|λi (A) − λi (B)|
i=1
The third line follows from the triangle inequality for complex numbers [181, 30]:
for n complex numbers
n n
zi = |z1 + z2 + · · · + zn | |z1 | + |z2 | + · · · + |zn | = |zi | . (4.12)
i=1 i=1
In particular, we set zi = λi (A) − λi (B) . The fourth line follows from the
definition for a Lipschitz function: for f : R → R, |f (s) − f (t)| |f |L |s − t| .
The fifth line follows from Cauchy-Schwartz’s inequality (4.6) by identifying ai =
|λi (A) − λi (B)| , bi = 1. The final line follows from Liskii’s theorem (4.4).
Example 4.3.4 (Standard hypothesis testing problem revisited: Moments as a func-
tion of SNR). Our standard hypothesis testing problem is expressed as
H0 : y = n
√ (4.13)
H1 : y = SN Rx + n
where x is the signal vector in Cn and n the noise vector in Cn . We assume that x
is independent of n. SNR is the dimensionless real number representing the signal
to noise ratio. It is assumed that N independent realizations of these vector valued
random variables are observed.
Representing
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1T xT1 nT1
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
Y = ⎣ ... ⎦ , X = ⎣ ... ⎦ , N = ⎣ ... ⎦ ,
T
yN xTN nTN
H0 : Y = N
√
H1 : Y = SN R · X + N (4.14)
N N N
1 1 1
Sy = yi yi∗ , Sx = xi x∗i , Sn = ni n∗i . (4.15)
N i=1
N i=1
N i=1
4.3 Smoothness and Convexity of the Eigenvalues of a Matrix and Traces of Matrices 209
we rewrite (4.15) as
H0 : N Sn = NN∗
√ √ ∗
H1 : N Sy = YY∗ = SN R · X + N SN R · X + N
√
= SN R · XX∗ + NN∗ + SN R · (XN∗ + NX∗ ) (4.16)
H0 : Ry = Rn
H1 : Ry = SN R · Rx + Rn .
Let NN∗ and N N∗ are two independent copies of underlying random matrices.
When the sample number N goes large, we expect that the sample covariance
matrices approach the true covariance matrix, as close as we desire. In other words,
1 ∗ 1 ∗
N NN and N N N are very close to each other. With this intuition, we may
consider the following matrix function
1 1√ 1
S(SN R, X) = SN R· XX∗ + SN R·(XN∗ + NX∗ )+ (NN∗ − N N∗ )
N N N
Taking the trace of both sides, we reach a more convenient form
f (SN R, X) = Tr (SS∗ ) .
SS∗ = A + ε (SN R) B,
210 4 Concentration of Eigenvalues and Their Functionals
Exercise 4.3.5. Show that SS∗ has the following form SS∗ = A + ε (SN R) B.
Example 4.3.6 (Moments of Random Matrices [23]). The special case of f (t) = tk
for integer k is Lipschitz continuous. Using the fact that [48, p. 72]
n
n n n
f (x) = log λki log λki =k log λi k (λi − 1) .
i=1 i=1 i=1 i=1
1
log |f (x) − f (y)| |x − y|
a
where f (t) = ta .
4.4 Approximation of Matrix Functions Using Matrix Taylor Series 211
1 + t2 −1
f (t) = ⇒ f (A) = (I − A) I + A2 if 1 ∈
/ Ω (A) .
1−t
Here Ω (A) is the set of eigenvalues of A (also called the spectrum of A). Note
that rational functions of a matrix commute, so it does not matter whether we write
−1 −1
(I − A) I + A2 or I + A2 (I − A) . If f has a convergent power series
representation, such as
t2 t3 t4
log (1 + t) = t − + − + · · · , |t| < 1,
2 3 4
we can again simply substitute A for t to define
A2 3
A A4
log (1 + A) = A − + − + · · · , ρ (A) < 1.
2 3 4
Here, ρ is the spectral radius of the condition ρ (A) < 1 ensures convergence of the
matrix series. We can consider the polynomial
f (t) = Pn (t) = a0 + a1 t + · · · + an tn .
For we have
f (A) = Pn (A) = a0 + a1 A + · · · + an An .
A basic tool for approximating matrix functions is the Taylor series. We state
a theorem from [20, Theorem 4.7] that guarantees the validity of a matrix Taylor
series if the eigenvalues of the “increment” lie within the radius of convergence of
the associated scalar Taylor series.
Theorem 4.4.1 (convergence of matrix Taylor series). Suppose f has a Taylor
series expansion
∞
k f (k) (α)
f (z) = ak (z − α) ak = (4.17)
k!
k=1
∞
k
f (A) = ak (A − αI)
k=1
A2 A3
exp (A) = I + A + + + ··· ,
2! 3!
A2 A4 A6
cos (A) = I − + − + ··· ,
2! 4! 6!
A3 A5 A7
sin (A) = I − + − + ··· ,
3! 5! 7!
A2 A3 A4
log (I + A) = A − + − + ··· , ρ (A) < 1,
2! 3! 4!
the first three series having infinite radius of convergence. These series can be used
to approximate the respective functions, by summing a suitable finite number of
terms. Two types of errors arise: truncated errors, and rounding errors in the floating
point evaluation. Truncated errors are bounded in the following result from [20,
Theorem 4.8].
Theorem 4.4.2 (Taylor series truncation error bound). Suppose f has the Tay-
lor series expansion (4.17) with radius of convergence r. If A ∈ Cn×n with
ρ (A − αI) < r, then for any matrix norm
/ /
/ K / / /
/ k/ 1 / K /
/f (A) − ak (A − αI) / max /(A − αI) f (K) (αI + t (A − αI))/ .
/ / K! 0t1
k=1
(4.18)
/
/ K
In order to apply this theorem, we need to bound the term max /(A − αI) f (K)
0t1
(αI + t (A − αI)) . For certain function f this is easy. We illustrate using the
cosine function, with α = 0, and K = 2k + 2, and
2k i
(−1) 2i
T2k (A) = A ,
i=0
(2i)!
/ /
cos (A) − T2k (A) ∞ = (2k+2)!
1
max /A2k+2 cos(2k+2 (tA)/∞
/ 2k+2 / /
0≤t≤1 /
1
= (2k+2)! /A / max /cos(2k+2 (tA)/ .
∞ 0≤t≤1 ∞
Now
/ /
max /cos(2k+2 (tA)/∞ = max cos (tA) ∞
0≤t≤1 0≤t≤1
A∞ A4∞
≤1+ 2! + 4! + · · · = cosh ( A ∞) ,
and thus the error is the truncated Taylor series approximation to the matrix cosine
has the bound
/ 2k+2 /
/A /
∞
cos (A) − T2k (A) ∞ ≤ cosh ( A ∞ ) .
(2k + 2)!
n−1
−1 1
A =− A n−1
+ ci A n−i−1
.
cn i=1
K
F (X) = a0 I + a1 X + · · · + aK XK = a k Xk ,
k=0
where ak are scalar-valued coefficients. Often we are interested in the trace of this
matrix function
K
Tr (F (X)) = a0 Tr (I) + a1 Tr (X) + · · · + aK Tr XK = ak Tr Xk .
k=0
K
K
Tr (E (F (X))) = E (Tr (F (X))) = E ak Tr Xk = ak E Tr XK .
k=0 k=0
Let us consider the fluctuation of this trace function around its expectation
K
K
Tr (F (X)) − Tr (EF (X)) = ak Tr Xk − ak E Tr XK
k=0 k=0
K
= ak Tr XK − E Tr XK .
k=0
K
|Tr (F (X)) − Tr (EF (X))| ak Tr XK − E Tr XK ,
k=0
K
ak Tr XK + E Tr XK , (4.19)
k=0
since, for two complex scalars a, b, a − b |a − b| |a| + |b| . E Tr XK can
be calculated. In fact, |Tr (A)| is a seminorm (but not a norm [23, p. 101]) of A.
Another relevant norm is weakly unitarily invariant norm
K K K K
F (X) − E (F (X)) = a k Xk − E a k Xk = a k Xk − a k E Xk
k=0 k=0 k=0 k=0
K
= a k Xk − E X k . (4.20)
k=0
According to (4.20), the problem boils down to a sum of random matrices, which
are
already treated in other chapters of this book. The random matrices Xk − E Xk
play a fundamental role in this problem.
Now we can define the distance as the unitarily invariant norm [23, p. 91] |·|
of the matrix fluctuation. We have
|UAV| = |A|
A+B + U∗ A +U + V∗ B + V. (4.21)
Tr A + B + Tr A + + Tr B +, (4.22)
|f (x) − f (y)| K x − y
and
2
P (|f (x) − Ef (x)| κt) Ce−ct (4.26)
for some absolute constants C, c > 0, where Mf (x) is the median of f (x) .
See [63] for a proof. Let us illustrate how to use this theorem, by considering the
operator norm of a random matrix.
The operator (or matrix) norm A op is the most important statistic of a random
matrix A. It is a basic statistic at our disposal. We define
A op = sup Ax
x∈Cn :x=1
where x is the Euclidean norm of vector x. The operator norm is the basic upper
bound for many other quantities.
The operator norm A op is also the largest singular value σmax (A) or σ1 (A)
assuming that all singular values are sorted in an non-increasing order. A op
dominates the other singular values; similarly, all eigenvalues λi (A) of A have
magnitude at most A op .
Suppose that the coefficients ξij of A are independent, have mean zero, and
uniformly
in magnitude by 1 (κ = 1). We consider σ1 (A) as a function
bounded
2
f (ξij )1i,jn of the independent complex ξij , thus f is a function from Cn
to R. The convexity of the operator norm tells us that f is convex. The elementary
bound is
where
⎛ ⎞1/2
n n
=⎝ |ξij | ⎠
2
A F
i=1 j=1
and
2
P (σ1 (A) − Eσ1 (A) t) Ce−ct (4.28)
n n
1 1 2
P f (λi (A)) − E f (λi (A)) t Ce−ct (4.29)
n i=1
n i=1
|λi (A + B) − λi (A)| B op
It is easy to observe that the operator norm of a matrix A = (ξij ) bounds the
magnitude of any of its coefficients, thus
or, equivalently
⎛ ⎞
We can view the upper tail event σ1 (A) t as a union of√many simpler events
|ξij | t. In the i.i.d. case ξij ≡ ξ, and setting t = α n for some fixed α
independent of n, we have
√ n2
P (σ1 (A) t) P |ξ| α n .
1
hij = √ (xij + jyij ) for all 1 i j n
n
1
hii = √ xii for all 1 i n
n
where {xij , yij , xii } are a collection of real independent, identically distributed
random variables with Exij = 0 and Ex2ij = 1/2.
The diagonal elements are often assumed to have a different distribution, with
Exij = 0 and Ex2ij = 1. The entries scale with the dimension n. The scaling is
chosen such that, in the limit n → ∞, all eigenvalues of H remain bounded. To see
this, we use
n n n
2 2
E λ2k = E Tr H2 = E |hij | = n2 E|hij |
k=1 i=1 j=1
n
In Sect. 4.3, we observe that using f (λk (A)) rather than λk (A) increases
√ k=1
the Lipschitz constant from 1 to n|f |L .
Theorem 4.6.1 (Guionnet and Zeitouni [180]). Suppose that the laws of the
entries {xij , yij , xii } satisfies the logarithmic Sobolev inequality with constant
c > 0. Then, for any Lipschitz function f : R → C, with Lipschitz constant |f |L
and t > 0, we have that
1 1 2 2
− n t2
P Tr f (H) − E Tr f (H) t 2e 4c|f |L . (4.30)
n n
nt2
−
4c|f |2
P (|f (λk ) − Ef (λk )| t) 2e L . (4.31)
In order to prove this theorem, we want to use of the observation of Herbst that
Lipschitz functions of random matrices satisfying the log-Sobolev inequality exhibit
Gaussian concentration.
Theorem 4.6.2 (Herbst). Suppose that P satisfies the log-Sobolev inequalities on
Rn with constant c. Let G : Rm → R be a Lipschitz function with constant |G|L .
Then, for every t > 0,
2
− 2c|G|
t
P (|g(x) − Eg(x)| t) 2e L .
g (Λ (X)) −g Λ X |g|L Λ (X) −Λ X 2
#
$ n &
$
= |g|L % |λi (X) − λi (X )||g|L Tr (H (X) −H (X ))2
i=1
#
$
$ n n
= |g|L % |hij (X) −hij (X )|2 2/n|g|L X−X Rn
.
2
i=1 j=1
(4.32)
The first inequality of the second line follows from the lemma of Hoffman-Wielandt
above. X − X Rn2 is also the Frobenius norm.
n
Since g (Λ) = Tr f (H) = f (λk ) is such that
i=1
n √
|g (Λ) − g (Λ )| |f |L λi − λi n|f |L X − X Rn
i=1
√
it follows that g is a Lipschitz function on Rn with constant n|f |L . Combined
with (4.32), we complete the proof of the corollary.
Now we have all the ingredients to prove Theorem 4.6.1.
2
Proof of Theorem 4.6.1. Let X = ({xij , yij , xii }) ∈ Rn . Let√G (X) =
Tr f (H (X)) . Then the matrix function G is Lipschitz with constant 2|f |L . By
Theorem 4.6.2, it follows that
1 1 2 2
− n t2
P Tr f (H) − E Tr f (H) t 2e 4c|f |L .
n n
To show (4.31), we see that, using Corollary 4.6.4, the matrix function G(X) =
f (λk ) is Lipschitz with constant 2/n|f |L . By Theorem 4.6.2, we find that
nt2
−
4c|f |2
P (|f (λk ) − Ef (λk )| t) 2e L ,
which is (4.31).
Example 4.6.5 (Applications of Theorem 4.6.1). We consider a special case of
2
f (s) = s. Thus |f |L = 1. From (4.31) for the k-th eigenvalue λk of random
matrix H, we see at once that, for any k = 1, . . . , n,
nt2
P (|λk − Eλk | t) 2e− 4c .
1
For the trace function n Tr (H), on the other hand, from (4.30), we have
1 1 n 2 t2
P Tr (H) − E Tr (H) t 2e− 4c .
n n
4.6 Concentration of the Spectral Measure for Wigner Random Matrices 221
The right-hand-side of the above two inequalities has the same Gaussian tail but with
different “variances”. The k-th eigenvalue has a variance of σ 2 = 2c/n, while the
normalized trace function has that of σ 2 = 2c/n2 . The variance of the normalized
trace function is 1/n times that of the k-th eigenvalue. For example, when n =
100, this factor is 0.01. In other words, for the normalized trace function—viewed
as a statistical average of n eigenvalues (random variables)—reduces the variance
by 20 dB, compared with the each individual eigenvalue. In [14], this phenomenon
has been used for signal detection of extremely low signal to noise ratio, such as
SN R = −34 dB.
The Wishart matrix is involved in [14] while here the Wigner matrix is used. In
Sect. 4.8, it is shown that for the normalized trace function of a Wishart matrix, we
n 2 t2
have the tail bound of 2e− 4c , similar to a Wigner matrix. On the other hand, for
the largest singular value (operator norm)—see Theorem 4.8.11, the tail bound is
2
Ce−cnt .
For hypothesis testing such as studied in [14], reducing the variance of the
statistic metrics (such as the k-th eigenvalue and the normalized trace function)
is critical to algorithms. From this perspective, we can easily understand why the
use of the normalized trace as statistic metric in hypothesis testing leads to much
better results than algorithms that use the k-th eigenvalue [188] (in particular, k = 1
and k = n for the largest and smallest eigenvalues, respectively). Another big
advantage is that the trace function is linear. It is insightful to view the trace function
n
as a statistic average n1 Tr (H) = n1 λk , where λk is a random variable. The
k=1
statistical average of n random variables, of course, reduces the variance, a classical
result in probability and statistics. Often one deals with sums of independent
random variables [189]. But here the n eigenvalues are not necessarily independent.
Techniques like Stein’s method [87] can be used to approximate this sum using
Gaussian distribution. Besides, the reduction of the variance by a factor of n by
using the trace function rather than each individual eigenvalue is not obvious to
observe, if we use the classical techniques since it is very difficult to deal with sums
of n dependent random variables.
2
For the general case we have the variance of σ 2 = 2c |f |L /n and σ 2 =
2
2c |f |L /n for the k-th eigenvalue and normalized trace, respectively. The general
2
2
Lipschitz function f with constant |f |L increases the variance by a factor of |f |L .
If we seek to find a statistic metric (viewed as a matrix function) that has a
minimum variance, we find here that the normalized trace function n1 Tr (H) is
optimum in the sense of minimizing the variance. This finding is in agreement with
empirical results in [14]. To a large degree, a large part of this book was motivated
to understand why the normalized trace function n1 Tr (H) always gave us the best
results in Monte Carlo simulations. It seemed that we could not find a better matrix
function for the statistic metrics. The justification for recommending the normalized
trace function n1 Tr (H) is satisfactory to the authors.
222 4 Concentration of Eigenvalues and Their Functionals
Following [190], we also refer to [191, 192]. The Schatten p-norm of a matrix A
p/2 1/p
is defined as A p = Tr AH A . The limiting case p = ∞ corresponds
to the operator (or spectral) norm, while p = 2 leads to the Hilbert-Schmidt (or
Frobenius) norm. We also denote by · p the Lp -norm of a real or complex random
variable, or p -norm of a vector in Rn or Cn .
A random vector x in a normed space V satisfies the (sub-Gaussian) convex
concentration property (CCP), or in the class CCP, if
2
P [|f (x) − Mf (x)| t] Ce−ct (4.33)
for every t > 0 and every convex 1-Lipschitz function f : V → R, where C, c > 0
are constants (parameters) independent of f and t, and M denotes a median of a
random variable.
Theorem 4.7.1 (Meckes [190]). Let X1 , . . . , Xm ∈ Cn×n be independent, cen-
tered random matrices that satisfy the convex concentration property (4.33) (with
respect to the Frobenius norm on Cn×n ) and let k ≥ 1 be an integer. Let P be a
noncommutative ∗-polynomial in m variables of degree at most k, normalized so
that its coefficients have modulus at most 1. Define the complex random variable
1 1
ZP = Tr P √ X1 , . . . , √ X m .
n n
The conclusion holds also for non-centered random matrices if—when k ≥ 2—we
assume that EXi 2(k−1) Cnk/2(k−1) for all i.
We also have that for q ≥ 1
√ q k/2
ZP − EZP q Cm,k max q, .
m
-
k k .
0 1
1 1
P Tr √ X − M Tr √ X t C exp − min ck t2 , cnt2/k .
n n
The integral of a function f (·) with respect to the measure induced by FM is denoted
by the function
n
1 1
FM (f ) = f (λi (M)) = Tr f (M) .
n i=1
n
For certain classes of random matrices M and certain classes of functions f, it can
be shown that FM (f ) is concentrated around its expectation EFM (f ) or around any
median MFM (f ).
For a Lipschitz function g, we write ||g||L for its Lipschitz constant. To state the
result, we also need to define bounded functions: f : (a, b) → R are of bounded
variation on (a, b) (where −∞ ≤ a ≤ b ≤ ∞), in the sense that
n
Vf (a, b) = sup sup |f (xk ) − f (xk−1 )|
n1 a<x0 x1 ···xn
k=1
is finite [195, Sect. X.1]. A function is bounded variation if and only if it can be
written as the difference of two bounded monotone functions on (a, b). The indicator
function g : x → {x λ} is of bounded variation on R with Vg (R) = 1 for each
λ ∈ R.
Theorem 4.8.1 (Guntuboyina and Lebb [196]). Let X be an m × n matrix
whose row-vectors are independent, set a Wishart matrix S = XT X/m, and
fixf : R → R.
1. Suppose that f is such that the mapping x → f x2 is convex and Lipschitz,
and suppose that |Xi,j | ≤ 1 for each i and j. For all t > 0, we then have
- .
1 1 nm t2
P Tr f (S) − M Tr f (S) t 4 exp − 2 .
n n n + m 8 f (·2 ) L
(4.34)
where M stands for the median.
[From the upper bound (4.34), one can also
obtain a similar bound for P n1 Tr f (S) − E n1 Tr f (S) t using standard
methods (e.g.[197])]
2. Suppose that f is of bounded variation on R. For all t > 0, we then have
4.8 Concentration of the Spectral Measure for Wishart Random Matrices 225
- .
1 1 n2
2t 2
P Tr f (S) − E Tr f (S) t 2 exp − .
n n m Vf2 (R)
holds almost surely for each i = 1, . . . , m and for some (fixed) integer r.
Theorem 4.8.2 (Guntuboyina and Lebb [196]). Assume (4.36) is satisfied for
each i = 1, . . . , m. Assume f : R → R is of bounded variation on R. For any
t > 0, we have that
- .
1 √ 1 √ n2
2t 2
P Tr f X/ m − E Tr f X/ m t 2 exp −
.
n n m r2 V 2 (R) f
(4.37)
To estimate the integer r in (4.35), we need the following lemma.
Lemma 4.8.3 (Lemma 2.2 and 2.6 of [198]). Let A and B be symmetric n × n
matrices and let C and D be m × n matrices. We have
/ /
/1 /
/ Tr f (A) − 1 Tr f (B)/ rank (A − B) ,
/n n / n
∞
and
/ /
/1 / rank (C − D)
/ Tr f CT C − 1 Tr f DT D / .
/n n / n
∞
226 4 Concentration of Eigenvalues and Their Functionals
where the entries Zij of matrix Z are independent and satisfy |Zij | ≤ 1. The
(random) sample covariance
2 matrix is S = XT X/m. For a function f such that
the mapping x → f x is convex and Lipschitz, we then have that, for all t > 0
- .
1 1 n2 m t2
P Tr f (S) − M Tr f (S) t 4 exp − 2 .
n n n + m 8C 2 f (·2 ) L
(4.38)
where C = (1 + B ) H with || · || denoting the operator (or matrix) norm.
We focus on Wishart random matrices (sample covariance matrix), that is S =
1 H
n YY , where Y is a rectangular p × n matrix. Our objective is to use new
exponential inequalities of the form
Z = g (λ1 , · · · , λp )
where (λi )1ip is the set of eigenvalues of S. These inequalities will be upper
bounds on E eZ−EZ and lead to natural bounds P (|Z − EZ| t) for large values
of t.
What is new is following [199]: (i) g is a once or twice differentiable function
(i.e., not necessarily of the form (4.39)); (ii) the entries of Y are possibly dependent;
(iii) they are not necessarily identically distributed; (iv) the bound is instrumental
when both p and n are large. In particular, we consider
p
g (λ1 , · · · , λp ) = ϕ (λk ). (4.39)
k=1
used in [14]. We propose to investigate this topic using new results from statistical
literature [180, 199, 201].
For a vector x, ||x|| stands for the Euclidean norm of the vector x. ||X||p is the
usual Lp -norm of the real random variable X.
Theorem 4.8.5 (Deylon [199]). Let Q be a N × N deterministic matrix, X =
(Xij ) , 1 i N, 1 j n be a matrix of random independent entries, and set
1
Y= QXXT Q.
n
where
ξ = Q sup Xij ∞
i,j
∂g
γ1 = sup (λ)
k,λ ∂λk
/ /
γ2 = sup /∇2 g (λ)/ (matrix norm).
λ
When dealing with the matrices with independent columns, we have Q = I, where
I is the identity matrix.
Theorem 4.8.6 (Deylon [199]). Let X = (Xij ) , 1 i N, 1 j n be a
matrix of random independent entries, and set
1
Y= XXT .
n
Z = g (λ1 , . . . , λN )
228 4 Concentration of Eigenvalues and Their Functionals
where
∂g
γ1 = sup (λ)
λ ∂λ 1
/ /1/2
/ /
/ 2/
a = sup / Xij / .
j / /
i ∞
For each c1 , c2 > 0, let L(c1 , c2 ) be the class of probability measures on R that arise
as laws for random variables like u(Z), where Z is a standard Gaussian random
variable and u is a twice continuously differentiable function such that for all x ∈ R
For example, the standard Gaussian law is in L(1, 0). Again, taking u = the
Gaussian cumulative distribution, we see that the uniform distribution on the
−1/2 −1/2
unit interval is in L (2π) , (2πe) . A random variable is said to be
“in L(c1 , c2 )” instead of the more elaborate statement that “the distribution of
X belongs to L(c1 , c2 ).” For two random variables X and Y, the supremum of
|P (X ∈ B) − P (Y ∈ B)| as B ranges from all Borel sets is called the total varia-
tion distance between the laws of X and Y, often denoted simply by dT V (X, Y ).
The following theorem gives normal approximation bounds for general smooth
functions of independent random variables whose law are in L(c1 , c2 ) for some
finite c1 , c2 .
1/2
n
∂g
4
1/4 4
1/4
κ0 = E (X) , κ1 = E∇g (X)4 , κ2 = E ∇2 g (X) .
i=1
∂xi
Suppose W = g (x) has a finite fourth moment and let σ 2 = Var (W ) . Let Z be a
normal random variable with the same mean and variance as W. Then
√
2 5 c1 c2 κ0 + c31 κ1 κ2
dT V (W, Z) .
σ2
If we slightly change the setup by assuming that x is a Gaussian random vector
with mean 0 and covariance matrix Σ, keeping all other notations the same, then
the corresponding bound is
√ 3/2
2 5 Σ κ1 κ2
dT V (W, Z) .
σ2
The cornerstone of Chatterjee[201] is Stein’s method [202]. Let us consider a
particular function f . Let n be a fixed positive integer and J be a finite indexing
set. Suppose that for each 1 ≤ i, j, ≤ n, we have a C 2 map aij : R2 → C. For each
x ∈ RJ , let A(x) be the complex n × n matrix whose (i, j)-th element is aij (x).
Let
∞
f (z) = bk z k
k=0
1
A= XXT .
n
Theorem 4.8.9 (Proposition 4.6 of Chatterjee [201]). Let λ be the largest eigen-
value of A. Take any entire function f and define f1 and f2 as in Theorem 4.8.8. Let
1/4
4 1/4
and b = E f1 (λ) + 2N −1/2 λf2 (λ)
4
a = E f1 (λ) λ2 . Suppose
W = Re Tr f (A (x)) has finite fourth moment and let σ 2 = Var (W ) . Let Z be a
normal random variable with the same mean and variance as W. Then
√ √
8 5 c 1 c 2 a2 N c31 abN
dT V (W, Z) 2 + 3/2 .
σ n n
If we slightly change the setup by assuming that the entries of x is jointly Gaussian
with mean 0 and nN × nN covariance matrix Σ, keeping all other notation the
same, then the corresponding bound is
√ 3/2
8 5 Σ abN
dT V (W, Z) 2 3/2
.
σ n
Double Wishart matrices are important in statistical theory of canonical corre-
lations [203, Sect. 2.2]. Let N n M be three positive integers. Let X =
(Xij )1iN,1jn and Y = (Xij )1iN,1jM be a collection of independent
random variables in L(c1 , c2 ) for some finite c1 , c2 . Define the double Wishart
matrix as
−1
A = XXT YYT .
Theorem 4.8.10 (Jiang [204]). Let X be an n×n matrix with complex entries, and
λ1 , . . . , λn are its eigenvalues. Let μX be the empirical law of λi , 1 ≤ i ≤ n. Let μ
be a probability measure. Then ρ (μX , μ) is a continuous function in the entries of
matrix X, where ρ is defined as in (4.40).
Proof. The proof of [204] illustrates a lot of insights into the function ρ (μX , μ) ,
so we include their proof here for convenience. The eigenvalues of X is denoted by
λi , 1 ≤ i ≤ n. First, we observe that
n
1
f (x)μX (dx) = f (λi (X)).
n i=1
C
where in the last step we use the Lipschitz property of f : |f (x) − f (y)| |x − y|
for any x and y. Since the above inequality is true for any permutation π, we have
that
|ρ (μX , μ) − ρ (μY , μ)| min max λπ(i) (X) − λi (Y) .
π 1in
⎛ ⎞1/(2n)
·⎝
1/n 2⎠
|ρ (μX , μ) − ρ (μY , μ)| 22−1/n X − Y 2 |xij − yij | .
1i,jn
232 4 Concentration of Eigenvalues and Their Functionals
||·|| is the Euclidean length (norm) of a vector. The infimum above is over probability
measures π on A × A with marginals μ and ν. Note that dp ≤ dq when p ≤ q. The
L1 Wasserstein distance can be equivalently defined [59] by
where the supremum is over f in the unit ball B (Lip (A)) of Lip (A) .
Denote the space of n × n Hermitian matrices by Msa n×n . Let A be a random
n × n Hermitian matrix. An essential condition on some of random matrices used
in the construction below is the following. Suppose that for some C, c > 0,
P (|F (A) − EF (A)| t) C exp −ct2 (4.43)
for every t > 0 and F : Msa n×n → R which is 1-Lipschitz with respect to the
Hilbert-Schmidt norm (or Frobenius norm). Examples in which (4.43) is satisfied
include:
1. The diagonal and upper-diagonal entries of matrix M are independent √ and each
satisfy a quadratic transportation cost inequality with constant c/ n. This is
slightly more general than the assumption of a log-Sobolev inequality [141,
Sect. 6.2], and is essentially the most general condition with independent en-
tries [206]. It holds, e.g., for Gaussian entries and, more generally, for entries
with densities of the form e−nuij (x) where uij (x) ≥ c > 0.
2. The distribution of M itself has a density proportional to e−n Tr u(M) with u :
R → R such that u (x) ≥ c > 0. This is a subclass of the unitarily invariant
ensembles [207]. The hypothesis on u guarantees that M satisfies a log-Sobolev
inequality [208, Proposition 4.4.26].
The first model is as follows. Let U(n) be the group of n × n unitary matrices.
Let U ∈ U(n) distributed according to Haar measure, independent of A, and let
Pk denote the projection of Rn onto the span of the first k basis elements. Define a
random matrix M by
M = Pk UAUH PH
k (4.44)
4.8 Concentration of the Spectral Measure for Wishart Random Matrices 233
N
1
μX = δλi (X) ,
N i=1
where λi (X) are eigenvalues of matrix X, and δ is the Dirac function. The empirical
spectral measure of random matrix M is denoted by μM , while its expectation is
denoted by μ = EμM .
Theorem 4.8.11 (Meckes, E.S. and Meckes, M.W. [192]). Suppose that matrix A
satisfies (4.43) for every 1-Lipschitz function F : Msa
n×n → R.
1. If F : Msa
n×n → R is 1-Lipschitz, then for M = Pk UAU Pk ,
H H
P (|F (M) − EF (M)| t) C exp −cnt2
Zf = f dμM − f dμ,
then
P (|Zf − EZf | t) C exp −cknt2
Theorem 4.8.12 (Meckes, E.S. and Meckes, M.W. [192]). Suppose that matrix
A satisfies (4.43) for every 1-Lipschitz function F : Msa n×n → R. Let M =
Pk UAUH PH k , and let μ M denote the empirical spectral distribution of random
matrix M with μ = EμM . Then,
1/3
C E M op C
Ed1 (μM , EμM ) 1/3
1/3
,
(kn) (kn)
and so
C
P d1 (μM , EμM ) 1/3
+t C exp −cknt2
(kn)
M = UAUH + B, (4.45)
P (|F (M) − EF (M)| t) C exp −cnt2
Zf = f dμM − f dρ,
4.9 Concentration for Sums of Two Random Matrices 235
then
P (|Zf − EZf | t) C exp −cn2 t2
and so
C
P d1 (μM , EμM ) 2/3
+t C exp −cn2 t2
(n)
C
d1 (μMn , EμMn )
n2/3
for all sufficiently large n, where C depends only on the bounds on the sizes of the
spectra of An and Bn .
We follow [210]. If A and B are two Hermitian matrices with a known spectrum
(the set of eigenvalues), it is a classical problem to determine all possibilities for
the spectrum of A + B. The problem goes back at least to H. Weyl [211]. Later,
Horn [212] suggested a list of inequalities, which must be satisfied by eigenvalues
236 4 Concentration of Eigenvalues and Their Functionals
H = A + UBUH ,
where A and B are two fixed n × n Hermitian matrices and U is a random unitary
matrix with the Haar distribution on the unitary group U (n). Then, the eigenvalues
of H are random and we are interested in their joint distribution. Let λ1 (A)
· · · λn (A) (repeated by multiplicities) denote the eigenvalues of A, and define
the spectral measure of A as
n
1
μA = δλi (A) . (4.46)
n i=1
# {i : λi x}
FA (x) = .
n
By an ingenious application of the Stein’s method [87], Chatterjee [213] proved that
for every x ∈ R
n
P (|FH (x) − EFH (x)| t) 2 exp −ct 2
,
log n
μA μ B ,
n2
P sup |FH (x) − F (x)| t exp −c2 t2 ε ,
x (log n)
4.10 Concentration for Submatrices 237
Consequently, we have
√
13 + 8 log k
E FA − F ∞ √ .
k
We follow [63] and [3] for our exposition of Wigner’s trace method. The idea of
using Wigner’s trace method to obtain an upper bound for the eigenvalue of A,
λ (A) , was initiated in [215]. The standard linear algebra identity has
N
Tr (A) = λi (A).
i=1
The trace of a matrix is the sum of its eigenvalues. The trace is a linear functional.
More generally, we have
N
Tr Ak = λki (A).
i=1
k k
σ1 (A) Tr Ak nσ1 (A) (4.47)
The knowledge of the k-th moment Tr Ak controls the operator norm (the also
the largest singular value) up to a multiplicative factor of n1/k . Taking larger and
larger k, we should obtain more accurate control on the operator norm.
Let us see how the moment
method works in practice. The simplest case is that
of the second moment Tr A2 , which in the Hermitian case works out to
n n
2 2
Tr A2 = |ξij | = A F .
i=1 j=1
n
n
2
The expression |ξij | is easy to compute in practice. For instance, for the
i=1 j=1
symmetric matrix A consisting of Bernoulli random variables taking random values
of ±1, this expression is exactly equal to n2 .
4.11 The Moment Method 239
asymptotically almost surely. In fact, if the ξij have uniformly sub-exponential tail,
we have (4.48) with overwhelming probability.
Applying (4.48) , we have the bounds
√
(1 + o(1)) n σ1 (A) (1 + o(1)) n (4.49)
√
asymptotically almost surely. The median of σ1 (A) is at least (1 + o(1)) n. But
the upper bound here is terrible. We need to move to higher moments to improve it.
Let us move to the fourth moment. For simplicity, all entries ξij have zero mean
and unit variance. To control moments beyond the second moments, we also assume
that all entries are bounded in magnitude by some K. We expand
n
Tr A4 = ξi1 i2 ξi2 i3 ξi3 i4 ξi4 i1 .
1i1 ,i2 ,i3 ,i4 n
n
E Tr A4 = Eξi1 i2 ξi2 i3 ξi3 i4 ξi4 i1 .
1i1 ,i2 ,i3 ,i4 n
One can view this sum graphically, as a sum over length four cycles in the vertex set
{1, . . . , n}.
Using the combinatorial arguments [63], we have
E Tr A4 O n3 + O n2 K 2 .
√
In particular, if we have the assumption K = O( n), then we have
E Tr A4 O n3 .
n
E Tr Ak = Eξi1 i2 · · · ξik i1 .
1i1 ,...,ik n
We have
k √ k−2
E Tr Ak (k/2) nk/2+1 max 1, K/ n .
240 4 Concentration of Eigenvalues and Their Functionals
k k √ k−2
Eσ1 (A) (k/2) nk/2+1 max 1, K/ n ,
1 k √ k−2
P (σ1 (A) t) k
(k/2) nk/2+1 max 1, K/ n
t
k!
Ck/2 = (4.50)
(k/2 + 1)! (k/2)!
for all k = 2, 4, 6, . . . .
Note that n (n − 1) · · · (n − k/2) = (1 + ok (1)) nk/2+1 . After putting all the
above computations we conclude
Theorem 4.11.2 (Moment computation). Let A be a real symmetric random
matrix, with the upper triangular elements ξij , i ≤ j jointly
√ independent with mean
zero and variance one, and bounded in magnitude by o( n). Let k be a positive
even integer. Then we have
E Tr Ak = Ck/2 + ok (1) nk/2+1
S = X1 + · · · + Xn
is the sum of n i.i.d. random variables of mean zero and variance one, and
k!
Ck/2 = .
2k/2 (k/2)!
where Ck/2 is given by (4.50). In particular, from the trivial bound Ck/2 2k one
has
k
E Tr Ak (2 + o(1)) nk/2+1 .
σ1 (A)
lim sup √ 2.
n→∞ n
valid for any n × n matrix A with complex entries and every positive integer
k.
It is possible to adapt all of the above moment calculationsfor Tr Ak in the
symmetric or Hermitian cases to give analogous cases for Tr (AA∗ )
k/2
in the
non-symmetric cases. Another approach is to use the augmented matrix defined as
4.12 Concentration of Trace Functionals 243
0 A
à = ,
A∗ 0
with
ω := ω R + jω I = (ωij )1i,jn , ωij = ω̄ij ,
A = (A)ij , A = A∗ .
1i,jn
Here, {ωij }1i,jn are independent complex random variables with laws
{Pij }1i,jn , Pij being a probability measure on C with
0 1
and A is a non-random complex matrix with entires (A)ij uniformly
1i,jn
bounded by, say, a.
We consider a real-valued function on R. For a compact set K, denoted by |K|
its diameter, that is the maximal distance between two points of K. For a Lipschitz
function f : Rk → R, we define the Lipschitz constant |f |L by
|f (x) − f (y)|
|f |L = sup ,
x,y x−y
244 4 Concentration of Eigenvalues and Their Functionals
f2 2
f 2 log ) 2c f dμ.
f 2 dμ
The most well known estimate on λmax (X) is perhaps the following theorem [222].
Theorem 4.13.2 (Füredi and Komlós [222]). For a random matrix X as above,
there is a positive constant c = c (σ, K) such that
√ √
2σ n − cn1/3 ln n λmax (X) 2σ n + cn1/3 ln n,
In this theorem, c does not depend on σ; we do not have to assume anything about
the variances.
Theorem 4.13.4 (Vu [3]). For a random matrix X as above, there is a positive
constant c = c (σ, K) such that
√
λmax (X) 2σ n + cn1/4 ln n,
When the entries of X is i.i.d. symmetric random variables, there are sharper
bounds. The best current bound we know of is due to Soshnikov [216], which shows
that the error term in Theorem 4.13.4 can be reduced to n1/6+o(1) .
246 4 Concentration of Eigenvalues and Their Functionals
The general concentration principle do not yield the correct small deviation rate for
the largest eigenvalues. They, however, apply to large classes of Lipschitz functions.
For M ∈ N, we denote by ·, · the Euclidean scalar product on RM (or CM ). For
M
two vectors x = (x1 , . . . , xM ) and y = (y1 , . . . , yM ), we have x, y = x i yi
i=1
M
(or x, y = xi yi∗ ). The Euclidean norm is defined as x = x, x .
i=1
For a Lipschitz function f : RM → R, we define the Lipschitz constant |f |L by
|f (x) − f (y)|
|f |L = sup ,
x=y∈RM x−y
M
|f (x) − f (y)| |f |L |xi − yi |,
i=1
n
where μ = E 1
n δλi is the mean spectral measure. Inequality (4.51) has the
i=1
2
n speed of the large deviation principles for spectral measures. With the additional
assumption of convexity on f , similar inequalities hold for real or complex matrices
with the entries that are independent with bounded support.2
Lemma 4.14.1 ([180, 224]). Let f : R → R be Lipschitz with Lipschitz constant
|f |L . X denotes the Hermitian (or symmetric) matrix with entries (Xij )1i,jn , the
map
“Good” here has, for instance, the meaning that the law satisfies a log-Sobolev
inequality; an example is when (Xij )1i,jN are independent, Gaussian random
variables with uniformly bounded covariance.
The significance of results such as Lemma 4.14.1 is that they provide bounds
on deviations that do not depend on the dimension n of the random matrix. They
can be used to prove law of large numbers—reducing the proof of the almost sure
convergence to the proof of the convergence in expectation E. They can also be used
to relax the proof of a central limit theorem: when α = 2, Lemma 4.14.1 says that
the random variable Tr (f (X)) − E Tr (f (X)) has a sub-Gaussian tail, providing
tightness augments for free.
Theorem 4.14.2 ([225]). Let f TV be the total variation norm,
m
f TV = sup |f (xi ) − f (xi−1 )|.
x1 <···<xm
i=2
X is either Wigner or the Wishart matrices. Then, for any t > 0 and any function f
n
with finite total variation norm so that E n1 f (λi (X)) < ∞,
i=1
- .
1 n n
1 − 8cn t2
P f (λi (X)) − E f (λi (X)) t f 2e X ,
n n TV
i=1 i=1
2 The support of a function is the set of points where the function is not zero-valued, or the closure
of that set. In probability theory, the support of a probability distribution can be loosely thought of
as the closure of the set of possible values of a random variable having that distribution.
248 4 Concentration of Eigenvalues and Their Functionals
n
- n
.
1 1
√ f (λi (X)) − E f (λi (X)) .
n i=1
n i=1
In Theorem 4.14.2, we only require independence of the vectors, rather than the
entries.
A probability measure P on En is said to satisfy the logarithmic Sobolev
inequality with constant c, if for any differentiable function f : Rn → R, we have
n 2
∂f
f 2 log f 2 − log f 2 dP 2c dP. (4.52)
i=1
∂xi
(4.52) implies sub-Gaussian tails, which we use commonly. See [141, 226] for a
general study. It is sufficient to know that Gaussian law satisfies the logarithmic
Sobolev inequality (4.52).
Lemma 4.14.3 (Herbst). Assume that P satisfies the logarithmic Sobolev inequal-
ity (4.52) on Rn with constant c. Let g be a Lipschitz function on Rn with Lipschitz
constant |g|L . Then, for all t ∈ R, we have
2
|f |2L /2
et(g−EP (g)) dP ect ,
Thus
x∗ Ax = σ 2 Tr A + O t(Tr A∗ A)
1/2
outside of an event of probability O e−ct .
c
where X = (xi x∗i ).
i
Theorem 4.15.1. Let x and σ as the previous page, and let V a d-dimensional
complex subspace of Cn . Let PV be the orthogonal projection to V. Then one has
2
0.9dσ 2 PV (x) 1.1dσ 2
250 4 Concentration of Eigenvalues and Their Functionals
outside of an event of probability O e−cd .
c
where we have used the triangular inequality of the Euclidean norm. This says that
the function f (x) is convex. Similarly,
2
using the Cauchy-Schwartz inequality and the fact that v 2 = 1. Equation (4.53)
implies that the function f (x) is 1-Lipschitz.
If Y is an n × p matrix, we naturally denote by Yi,j its (i, j) entry and call Ȳ
the matrix whose jth column is constant and equal to Ȳ·,j . The sample covariance
matrix of the data stored in matrix Y is
1 T
Sp = Y − Ȳ Y − Ȳ .
n−1
We have
1 1
Y − Ȳ = In − 11T Y= In − 11T XG
n n
where In − n1 11T is the centering matrix. Here 1 is a column vector of ones and
In is the identity matrix of n × n. We often encounter the quadratic form
4.15 Concentration of Quadratic Forms 251
1 1
v T XT In − 11T Xv.
n−1 n
Now we use this function as another example to illustrate how to show a function is
convex and Lipschitz, which is required to apply the Talagrand’s inequality.
/ /
Example 4.15.3 (f (x) = f (X) = / In − n1 11T Xv/2 ). Convexity is a simple
consequence of the fact that norms are
/ convex. /
The Lipschitz coefficient is v 2 /In − n1 11T /2 . The eigenvalues of the matrix
In − n1 11T are n − 1 ones and one zero, i.e.,
1 1
λi In − 11T = 1, i = 1, . . . , n − 1, and λn In − 11T = 0.
n n
As a consequence, its operator norm, the largest singular value, is therefore 1, i.e.,
/ /
1 / 1 T/
σmax In − 11T = /
/ I n − 11 / = 1.
/
n n 2
We now apply the above results to justify the centering process in statistics. In
(statistical) practice, we almost always use the centered sample covariance matrix
1 T
Sp = Y − Ȳ Y − Ȳ ,
n−1
1 T
S̃p = Y Y.
n
Sometimes there are debates as to whether one should use the correlation matrix
the data or their covariance matrix. It is therefore important for practitioners to
understand the behavior of correlation matrices in high dimensions. The matrix
norm (or operator
/ / norm) is the largest singular value. In general, the operator norm
/ /
/Sp − S̃p / does not go to zero. So a course bound of the type
op
/ /
/ /
λ1 (Sp ) − λ1 S̃p /Sp − S̃p /
op
252 4 Concentration of Eigenvalues and Their Functionals
is nor enough to determine the behavior of λ1 (Sp ) from that of λ1 S̃p .
Letting the centering matrix H = In − n1 11T , we see that
Sp − S̃p = Hn Y
This justifies the assertion that when the norm of a sample covariance matrix (which
is not re-centered) whose entries have mean 0 converges to the right endpoint of the
support of its limiting spectral distribution, so does the norm of the centered sample
covariance matrix.
When dealing with Sp , the mean of the entries of Y does not matter, so we can
assume, without loss of generality, that the mean is zero.
Let us deal with another type of concentration in the quadratic form. Suppose
that the random vector x ∈ Rn has the property: For any convex and 1-Lipschitz
function (with respect to the Euclidean norm) F : Rn → R, we have,
P (|F (x) − mF | > t) C exp −c (n) t2 ,
where mF is the median of F (x), and C and c(n) are independent of the
dimension n. We allow c(n) to be a constant or to go to zero with n like n−α , 0
α 1. Suppose, further, that the random vector has zero mean and variance Σ,
E (x) = 0, E (xx∗ ) = Σ, with the bound of the operator norm (the largest singular
value) given by Σ op log (n) .
Consider a complex deterministic matrix M such that M op κ, where
κ is independent of dimension n, then the quadratic form n1 x∗ Mx is strongly
concentrated around its mean n1 Tr (MΣ) .
4.15 Concentration of Quadratic Forms 253
In other words, strong concentration for x∗ Mr x and x∗ Mi x will imply strong con-
∗
centration for the sum of those two terms (quadratic forms). x∗ Ax = xi Aii xi ,
i
∗
which is real for real numbers Aii . So x∗ Mr x is real, (x∗ Mr x) = x∗ Mr x and
∗ M + M∗
x x.
2
is K-Lipschitz with the coefficient K = κ/n, with respect to the Euclidean norm.
The map is also convex, by noting that
√ / /
/ 1/2 √ /
φ : x → x∗ M+ x/ n = /M+ x/ n/ .
2
"
All norms are convex. A = A 2 is the Euclidean norm.
2 i,j
i,j
Theorem 4.15.4 (Lemma A.2 of El Karoui (2010) [230]). Suppose the random
vector z is a vector in Rn with i.i.d. entries of mean 0 and variance σ 2 . The
covariance matrix of z is Σ. Suppose that the entries of z are bounded by 1.
Let A be a symmetric matrix, with the largest singular value σ1 (A). Set Cn =
128 exp(4π)σ1 (A) /n. Then, for all t/2 > Cn , we have
√ 2
−n(t/2−Cn )2 /32/ 1+2 σ1 (Σ)
P n1 zT Az − n1 σ 2 Tr(A) > t 8 exp(4π)e
/σ1 (A)
√ 2
−n/32/ 1+2 σ1 (Σ) /σ1 (A)
+8 exp(4π)e .
sup sup {−ai } , 0 . Then the two concentration inequalities hold true for all
i=1,...,n
t > 0,
⎛ : ⎞
;
n
n
; n
√
P⎝ ai zi2 + bi zi a2 + 2<
i a2i + 12 b2i · t + 2a+ t⎠ e−t
i=1 i=1 i=1
⎛ : ⎞
;
n
n
; n
√
P⎝ ai zi2 + bi zi ai − 2<
2
a2i + 12 b2i · t − 2a− t⎠ e−t .
i=1 i=1 i=1
2
Tr(A)
and for any t ∈ 0, A − 1 , we have
op
⎛ ⎛? ? ⎞2 ⎞
T A A
⎜ z Az op ⎠ ⎟ −t2 /2
⎝
op
P⎝ 1− −t ⎠e . (4.58)
Tr (A) Tr (A) Tr (A)
256 4 Concentration of Eigenvalues and Their Functionals
Here ||X||op denote the operator or matrix norm (largest singular value) of matrix
A
op
X. It is obvious that Tr(A) is less than or equal to 1. The error terms involve the
operator norm as opposed to the Frobenius norm and hence are of independent
interest.
√ / /
Proof. We follow [237] for a proof. First f (z) = zT Az = /A1/2 z/2 has
2
Lipschitz constant A op with respect to the Euclidean norm on Rn . By
the Cirel’son-Ibragimov-Sudakov inequality for Lipschitz functions of Gaussian
vectors [161], it follows that for any s > 0,
1
P (f (z) E [f (z)] − s) exp − s 2
. (4.59)
2 A op
From the Poincare inequality for Gaussian measures [238], the variance of f (z) is
bounded above as
var [f (z)] A op .
2
Since E f (z) = Tr (A) , the expectation of f (z) is lower bounded as
2
E [f (z)] Tr (A) − A op .
Inserting this lower bound into the concentration inequality of (4.59) gives
2 1
P f (z) Tr (A) − A op − s exp − s 2
.
2 A op
"
norm is defined as A F = A2i,j = Tr (A2 ). For a vector x, ||x|| denotes
i,j
the standard Euclidean norm—2 -norm.
Theorem 4.15.9 (Concentration of Gaussian random variables [239]). Let
a, b ∈ R. Assume that max {|a| , |b|} α > 0. Let X ∼ N (0, 1) . Then, for t > 0
max { a 2 , A F} α > 0.
H0 : y = x
H1 : y = x + z
n
Tr P2 = Tr (P) = pii = d
i=1
and |pij | 1. Furthermore, the distance between a random vector x and a subspace
V is defined by the orthogonal projection matrix P in a quadratic form
∗
dist (x, V) = (Px) (Px) = x∗ P∗ Px = x∗ PPx = x∗ P2 x = x∗ Px
2
n
pij ξi ξj∗ = pij ξi ξj∗ .
2
= pi i |ξi | + (4.63)
1i=jn i=1 1i=jn
n
For instance, for a spectral decomposition of A, A = λi ui uH
i , we can define
i=1
d
P= i , where d ≤ n. For a random matrix A, the projection matrix P is
ui uH
i=1
also a random matrix. What is the surprising is that the distance between a random
vector and a subspace—the orthogonal projection of a random vector onto a large
space—is strongly concentrated. This tool has a geometric flavor.
The distance dist (x, V) is a (scalar-valued) random variable. It is easy to show
2
that E dist (x, V) = d so that √ it is indeed natural to expect that with high
probability dist (x, V) is around d.
We use a complex version of Talagrand’s inequality, obtained by slightly
modifying the proof [141].
Theorem 4.16.1. Let D be the unit disk {z ∈ C : |z| 1} . For every product
probability P on Dn , every convex 1-Lipschitz function f : Cn → R and every
t ≥ 0,
2
P (|f − M (f )| t) 4e−t /16
,
2
P (|X − M (X)| t) 4e−t .
Then
The bound 100 is ad hoc and can be replaced by a much smaller constant.
Proof. Set M = M(X) and let F (x) be the distribution function of X. We have
M −i+1
2
E (X) = x∂F (x) M + 4 |i|e−i M + 100.
i=1 M −i i
√
M (X) − d 2K.
√
Consider the event E+ that |dist (x, V)| d + 2K, which implies that
2
√
|dist (x, V)| d + 4K d + 4K 2 .
n
2
Set S1 = pi i |ξi | − 1 . Then it follows from Chebyshev’s inequality that
i=1
n √ √ E |S1 |
2
2
P pi i |ξi | d + 2K d P |S1 | 2K d .
i=1
4dK 2
n 2 n n
2 2 4
E |S1 | = p2ii E |ξi | − 1 = p2ii E |ξi | − 1 p2ii K = dK.
i=1 i=1 i=1
Thus we have
√ E |S 1 |
2
1 1
P |S1 | 2K d .
4dK 2 K 10
∗ 2
Similarly, set S2 = pi j ξi ξj . Then we have E S22 = |pij | d.
1i=jn i=j
Again, using Chebyshev’s inequality, we have
√ d 1 1
P |S2 | 2K d .
4dK 2 K 10
√
Combining the results, it follows that P (E+ ) 15 , and so M (dist (x, V)) d +
2K. √
To prove the lower bound, let E− √ be the event that dist (x, V) d−2K which
2
implies that dist (x, V) d − 4K d + K 2 . We thus have
√ √ √
2
P (E+ ) P dist (x, V) d −2K d P S1 d −K d + P S1 K d .
4.17 Concentration of Random Matrices in the Stieltjes Transform Domain 261
1
Both terms on the right-hand side can be bounded by 5 by the same arguments as
above. The proof is complete.
Example. Similarity Based Hypothesis Detection
A subspace V is the subspace spanned by the first k eigenvectors (corresponding
to the largest k eigenvalues).
As the Fourier transform is the tool of choice for a linear time-invariant system, the
Stieltjes transform is the fundamental tool for studying the random matrix. For a
random matrix A of n × n, we define the Stieltjes transform as
1 −1
mA (z) = Tr (A − zI) , (4.65)
n
where I is the identity matrix of n × n. Here z is a complex variable.
Consider xi , i = 1, . . . , N independent random (column) vectors in Rn . xi xTi
is a rank-one matrix of n × n. We often consider the sample covariance matrix of
n × n which is expressed as the sum of N rank-one matrices
N
S= xi xTi = XXT ,
i=1
Similar to the case of the Fourier transform, it is more convenient to study the
Stieltjes transform of the sample covariance matrix S defined as
1 −1
mS (z) = Tr (S − zI) .
n
Note that (4.66) makes no assumption whatsoever about the structure of the vectors
n
{xi }i=1 , other than the fact that they are independent.
Proof. We closely follow El Karou [229] for a proof. They use a sum of martingale
difference, followed by Azuma’s inequality[141, Lemma 4.1]. We define Sk = S −
i
xk xTk . Let Fi denote the filtration generated by random vectors {xl }l=1 . The first
classical step is (from Bai [198, p. 649]) to express the random variable of interest
as a sum of martingale differences:
n
mS (z) − EmS (z) = E (mS (z) |Fk ) − E (mS (z) |Fk−1 ) .
k=1
Note that
−1 −1
E Tr (Sk − zI) |Fk = E Tr (Sk − zI) |Fk−1 .
So we have that
The last inequality follows[243, Lemma 2.6]. As a result, the desired random
variable mS (z) − EmS (z) is a sum of bounded martingale differences. The same
would be true for both real and imaginary parts. For both of them, we apply Azuma’s
inequality[141, Lemma 4.1] to obtain that
P (|Re (mS (z) − EmS (z))| > t) 2 exp −t2 n2 v 2 / (8N ) ,
4.17 Concentration of Random Matrices in the Stieltjes Transform Domain 263
The decay rate given by Azuma’s inequality does not match the rate that
appears in results concerning concentration behavior of linear spectral √ statistics (see
Sect. 4.14). The rate given by Azuma’s inequality is n, rather than n. This decay
rate is very important in practice. It is explicitly related to the detection probability
and the false alarm, in a hypothesis testing problem.
The results using Azuma’s inequality can handle many situations that are not
covered by the current results on linear spectral statistics in Sect. 4.14. The “correct”
rate can be recovered using ideas similar to Sect. 4.14. As a matter of fact, if
we consider the Stieltjes transform of the measure that puts mass 1/n at each
singular values of the sample covariance matrix M = X∗ X/N, it is an easy
exercise to
√ this function of X is K-Lipschitz with the Lipschitz coefficient
show that
K = 1/ nN v 2 , with respect to the Euclidean norm (or Frobenius or Hilbert-
Schmidt) norm.
Example 4.17.1 (Independent Random Vectors). Consider the hypothesis testing
problem
H0 : yi = wi , i = 1, .., N
H1 : yi = xi + wi , i = 1, .., N
where N independent vectors are considered. Here wi is a random noise vector and
si is a random signal vector. Considering the sample covariance matrices, we have
N N
H0 : S = yi yiT = wi wiT = WWT
i=1 i=1
N N
T T
H1 : S = yi yiT = (xi + wi ) (xi + wi ) = (X + W) (X + W) .
i=1 i=1
1 −1
H0 : mS (z) = Tr WWT − zI
n
1 −1
T
H1 : mS (z) = Tr (X + W) (X + W) − zI .
n
The Stieltjes transform is strongly concentrated, so
264 4 Concentration of Eigenvalues and Their Functionals
H0 : P (|mS (z) − EmS (z)| > t) 4 exp −t2 n2 v 2 / (16N ) ,
1 −1 1 −1
where EmS (z) = E Tr WWT − zI = Tr E WWT − zI ,
n n
2 2 2
H1 : P (|mS (z) − EmS (z)| > t) 4 exp −t n v / (16N ) ,
1 −1
T
where EmS (z) = Tr E (X + W) (X + W) − zI .
n
Note, both the expectation and the trace are linear functions so they commute—their
order can be exchanged. Also note that
−1
(A + B) = A−1 + B−1 .
In fact
Then we have
−1
−1
T
WWT − zI − (X + W) (X + W) − zI
−1 −1
T
= WWT − zI XXT + XWT + WXT (X + W) (X + W) − zI .
1 1 1
log (I + A) = A − A2 + A3 − A4 + · · · , ρ (A) < 1,
2 3 4
−1 1 1 1
(I − A) = A − A2 + A3 − A4 + · · · , ρ (A) < 1,
2 3 4
where ρ (A) is the (spectral) radius of convergence[20, p. 77]. The series for
−1
(I − A) is also called Neumann series [23, p. 7].
are important [245]. There are essentially two such inequalities known, the so-
called basic inequalities known as strong subadditivity and weak monotonicity. To
be precise, we are interested in only linear inequalities involving the entropies
of various reduced states of a multiparty quantum state. We will demonstrate
how the concentration inequalities are established for these basic inequalities. One
motivation is for quantum information processing.
For any quantum state described by a Hermitian positive semi-definite matrix ρ,
the von Neumann entropy of ρ is defined as
f (X) = Uf (D) UH
n
S(X) = − Tr (X log X) = − λi (X) log λi (X) (4.68)
i=1
1 1 2 2
− n t2
P Tr (−X log X) − E Tr (−X log X) t 2e 4c|g|L . (4.70)
n n
The second inequality above is called the subadditivity for the von Neumann
entropy. The first one, called triangular inequality (also known as Araki-Lieb
inequality [246]), is regarded as the quantum analog of the inequality
H(X) H(X, Y )
for Shannon entropy H(X) and H(X, Y ) the joint entropy of two random variables
X and Y.
The strong subadditivity (SSA) of the von Neumann entropy proved by Lieb
and Ruskai [247, 248] plays the same role as the basic inequalities for the classical
entropy. For distinct quantum systems A, B, and C, strong subadditivity can be
represented by the following two equivalent forms
The expression on the left-hand side of the SSA inequality is known as the (quan-
tum) conditional mutual information, and is commonly denoted as I (A : B |C ) .
Inequality (WM) is usually called weak monotonicity.
As pointed above, we are interested in only linear inequalities involving the
entropies of various reduced states of a multiparty quantum state. Let us consider
the (quantum) mutual information
where ρA , ρB , ρAB are random matrices. Using the technique applied to treat the
von Neumann entropy (4.67), we similarly establish the concentration inequalities
like (4.70), by first evaluating the Lipschitz constant |I (A : B)|L of the function
I (A : B) and I (A : B |C ) . We can even extend to more general information
4.19 Supremum of a Random Process 267
Sub-Gaussian random variables (Sect. 1.7) is a convenient and quite wide class,
which includes as special cases the standard normal, Bernoulli, and all bounded
random variables. Let (Z1 , Z2 , . . .) be (possibly dependable) mean-zero sub-
Gaussian random variables, i.e., E [Zi ] = 0, according to Lemma 1.7.1, there exists
constants σ1 , σ2 , . . . such that
t2 σi2
E [exp (tZi )] exp , t ∈ R.
2
which, according to Lemma 1.7.1, implies that these functionals are sub-Gaussian
random variables. We modify the arguments of [113] for our context.
Now let X = diag (Z1 , Z2 , . . .) be the random diagonal matrix with the Zi on
the diagonal. Since E [Zi ] = 0, we have EX = 0. By the operator monotonicity of
the matrix logarithm (Theorem 1.4.7), we have that
t2 σ12 t2 σ22
log E [exp (tX)] diag , ,... .
2 2
Due to (1.27): λmax (A + B) λmax (A) + λmax (B) , the largest eigenvalue has
the following relation
268 4 Concentration of Eigenvalues and Their Functionals
2 2 2 2
t σ t σ
λmax (log E [exp (tX)]) λmax diag 2 1 , 2 2 , . . .
t2 σi2 t2
sup 2 = 2 v,
i
and
2 2 2 2
t σ t σ
Tr (log E [exp (tX)]) Tr diag 2 1 , 2 2 , . . .
t2 σi2 t2
2 2
= 2 = 2 σi = t 2vκ .
i i
Letting t = 2 (τ + log κ) > 2.6 for τ > 0 and interpreting λmax (X) as sup Zi ,
i
finally we have that
⎛ : ⎛ ⎞⎞
; 2
; σ
⎜ ; i
⎟⎟
P⎜ Zi > ; 2 ⎜
<2 sup σi ⎝log sup σ 2 + τ ⎠⎟
i −τ
⎝sup
i i
⎠e . (4.74)
i
i
Consider the special case: Zi ∼ N (0, 1) are N i.i.d. standard Gaussian random
variables. Equation (4.74) says that the largest of the Zi is O(log N + τ ) with
probability at least 1 − e−τ . This is known to be tight up to constants so the
log N term cannot be removed. Besides, (4.74) can be applied to a countably
infinite number of mean-zero Gaussian random variables Zi ∼ N 0, σi2 , or more
generally, sub-Gaussian random variables, as long as the sum of the σi2 is finite.
Reference [180] is the first paper to study the concentration of the spectral measure
for large matrices.
Concentration of eigenvalues in kernel space [253–255]. We may use a ker-
nel [159] to map the data to the high-dimensional feature space, even if the data
samples are in low-dimensional space. We can exploit the high-dimensional space
for concentration of eigenvalues in kernel space [254,255]. Concentration of random
matrices in the kernel space is studied in [229, 230, 256, 257].
4.20 Further Comments 269
Chapters 4 and 5 are the core of this book. Chapter 6 is included to form the
comparison with this chapter. The development of the theory in this chapter will
culminate in the sense of random matrices. The point of viewing this chapter
as a novel statistical tool will have far-reaching impact on applications such
as covariance matrix estimation, detection, compressed sensing, low-rank matrix
recovery, etc. Two primary examples are: (1) approximation of covariance matrix;
(2) restricted isometry property (see Chap. 7).
The non-asymptotic, local theory of random matrices is in its infancy. The goal
of this chapter is to bring together the latest results to give a comprehensive account
of this subject. No attempt is made to make the treatment exhaustive. However, for
engineering problems, this treatment may contain the main relevant results in the
literature.
The so-called geometric functional analysis studies high-dimensional sets and
linear operators, combining ideas and methods from convex geometry, functional
analysis and probability. While the complexity of a set may increase with the
dimension, it is crucial to point out that passing to a high-dimensional setting may
reveal properties of an object, which are obscure in low dimensions. For example,
the average of a few random variables may exhibit a peculiar behavior, while the
average of a large number of random variables will be close to a constant with high
probability. This observation is especially relevant to big data [4]: one can do at a
large scale that cannot be done at a smaller one, to extract new insights.
Another idea is probabilistic considerations in geometric problems. To prove the
existence of a section of a convex body having a certain property, we can show that
a random section possesses this property with positive probability. This powerful
method allows to prove results in situations, where deterministic constructions are
unknown, or unavailable.
In studying spectral properties of random matrices, the connection between the
areas is interesting: the origins of the problems are purely probabilistic, while the
methods draw from functional analysis and convexity.
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 271
DOI 10.1007/978-1-4614-4544-9 5,
© Springer Science+Business Media New York 2014
272 5 Non-asymptotic, Local Theory of Random Matrices
| x, y | x y ∗
K o = {y ∈ Rn : x, y 1 for all x ∈ K}
λ (1−λ)
|λA + (1 − λ) B| |A| |B| .
H0 : A
H1 : A + B
where A and B are two non-empty compact subsets of Rn . It follows from (5.1)
that: Claim H0 if
1/n 1/n 1/n
|A + B| − |B| γ |A| ,
x, y dx = 0
K
b (K) ε,
Theorem 5.2.4 (Kannanand Lovász and Simonovits [266]). Given 0 < δ, ε < 1,
there exists a randomized algorithm finding an affine transformation A such that
AK is in ε-nearly isotropic position with probability at least 1 − δ. The number of
oracle calls is
O ln (εδ) n5 ln n .
T
Σ (K) = EK (x − b) (x − b) .
The trace of Σ (K) is the average square distance of points of K from the center of
gravity, which we also call the second moment of K. We recall from the definition
of the isotropic position. The body K ⊆ Rn is isotropic position if and only if
b = 0 and Σ (K) = I,
2 2 2
E x, y = x, y dμ(x) = y 2 .
Rn
λ 1−λ
μ (λA + (1 − λ)B) μ(A) μ(B) .
λ 1−λ
P (x ∈ λA + (1 − λ) B) P(x ∈ A) P(x ∈ B) .
276 5 Non-asymptotic, Local Theory of Random Matrices
Theorem 5.3.1 (Paouris [272]). There exists an absolute constant c > 0 such that
if K is an isotropic convex body in Rn , then
√ √
P x ∈ K : x 2 c nLK t e−t n
for every t ≥ 1.
Theorem 5.3.2 (Paouris [272]). These exists constants c, C√> 0 such that for any
isotropic, log-concave random vector x in Rn , for any p ≤ c n,
1/2
p 1/p 2
(E x 2 ) C E x 2 . (5.6)
for any y ∈ Rn . For any nondegenerate log-concave vector x, there exists an affine
transformation T such that Tx is isotropic.
n
2 1/2
For x ∈ Rn , we define the Euclidean norm x 2 as x 2 = xi . More
i=1
generally, the lp norm is defined as
n
1/p
p
x p = |xi | .
i=1
which is also sample covariance matrix. If N is sufficiently large, then with high
probability
/ /
/ / 1
/Σ̂ − Σ/ , Σ= (y ⊗ y),
vol (K) K
will be small. Here Σ is also the true covariance matrix. Kannan et al. [266] proved
2
that it is enough to take N = c nε for some constant c. This result was greatly
improved by Bourgain [273]. He has shown that one can take N = C (ε) nlog3 n.
278 5 Non-asymptotic, Local Theory of Random Matrices
Since the situation is invariant under a linear transformation, we may assume that the
body K is in the isotropic position. The the result of Bourgain may be reformulated
as follows:
Theorem 5.4.1 (Bourgain [273]). Let K be a convex body in Rn in the isotropic
position. Fix ε > 0 and choose independently N random points x1 , . . . , xN ∈ K,
N C (ε) nlog3 n.
The work of Rudelson [93] is well-known. He has shown that this theorem follows
from a general result about random vectors in Rn . Let y be a random vector. Denote
by EX the expectation of a random variable X. We say that y is the isotropic
position if
E (y ⊗ y) = I. (5.7)
so to make the right hand side of (5.8) smaller than 1, we have to assume that
N cn log n.
Proof. The proof has two steps. The first step is relatively standard. First we
introduce a Bernoulli random process and estimate the expectation of the norm
in (5.8) by the expectation of its supremum. Then, we construct a majorizing
measure to obtain a bound for the latest.
5.4 Rudelson’s Theorem 279
We have
1/2 N 2/ log N 1/2
log N
E max yi E yi
i=1,...,N i=1 i
1/ log N
log N
N 1/ log N
· E y .
Then, denoting
/ /
/1 N /
/ /
D = E/ yi ⊗yi − I/ ,
/N /
i=1
If
√
log N log N
1/ log N
C· √ · E y 1,
N
we arrive at
√
log N log N
1/ log N
D 2C · √ · E y ,
N
which completes the proof of Theorem 5.4.2.
Let us apply Theorem 5.4.2 to the problem of Kannan et al. [266].
Corollary 5.4.4 (Rudelson [93]). Let ε > 0 and let K be an n-dimensional convex
body in the isotropic position. Let
n n
N C· 2
· log2 2
ε ε
and let y1 , . . . , yN be independent random vectors uniformly distributed in K. Then
/ /
/1 N /
/ /
E/ yi ⊗yi − I/ ε.
/N /
i=1
Corollary 5.4.4 follows from this estimate and Theorem 5.4.2. By a Lemma of
Borell [167, Appendix III], most of the volume of a convex √ body in the isotropic
position is concerned within the Euclidean ball of radius c n. So, it is might be of
interest to consider a random vector uniformly distributed in the intersection of a
convex body K and such a ball B2n .
Corollary 5.4.5 (Rudelson [93]). Let ε, R > 0 and let K be an n-dimensional
convex body in the isotropic position. Suppose that R c log 1/ε and let
R2
N C0 · · log n
ε2
5.5 Sample Covariance Matrices with Independent Rows 281
Singular values of matrices with independent rows (without assuming that the
entries are independent) are treated here, with material taken from Mendelson and
Pajor [274].
Let us first introduce a notion of isotropic position. Let x be a random vector
selected randomly from a convex symmetric body in Rn which is in an isotropic
position. By this we mean the following: let K ⊂ Rn be a convex and symmetric
set with a nonempty interior. We say that K is in an isotropic position if for any
y ∈ Rn ,
1 2 2
| y, x | dx = y ,
vol (K) K
where the volume and the integral are with respect to the Lebesgue measure on Rn
and ·, · and · are, respectively, the scalar product and the norm in the Euclidean
space l2n . In other words, if one considers the normalized volume measure on K and
x is a random vector with that distribution, then a body is in an isotropic position if
for any y ∈ Rn ,
2 2
E| y, x | = y .
N
Let x be a random vector on Rn and consider {xi }i=1 which are N independent
random vectors distributed as x. Consider the random operator X : Rn →
RN defined by
⎡ T⎤
x1
⎢ xT ⎥
⎢ 2 ⎥
X=⎢ . ⎥
⎣ .. ⎦
xTN N ×n
N
where {xi }i=1 are independent random variables distributed according to the
normalized volume measure on the body K. A difficulty arises when the matrix
X has dependent entries; in the standard setup in the theory of random matrices,
one studies matrices with i.i.d. entries.
282 5 Non-asymptotic, Local Theory of Random Matrices
N
X∗ X = xi ⊗ xi . We will show that under very mild conditions on x, with high
i=1
probability,
/ /
/1 N /
/ /
/ x i ⊗ x i − Σ/ (5.10)
/N /n
i=1 l2 →l2n
N
1 2 1 2
1−ε xi , y = Σy 1 + ε.
N i=1
N
Equivalently, this theorem says that N1 Σ : l2n → l2N is a good embedding of l2n .
When the random vector x has independent, standard Gaussian coordinates it is
known that for any y ∈ S n−1 ,
" N "
n 1 2 n
1−2 xi , y 1+2
N N i=1
N
holds with high probability (see the survey [145, Theorem II.13]). In the Gaussian
case, Theorem 5.5.1 is asymptotically optimal, up to a numerical constant.
Bourgain’s result was improved by Rudelson [93], who removed one power of
the logarithm while proving a more general statement.
Theorem 5.5.2 (Rudelson [93]). There exists an absolute constant C for which the
following holds. Let y be a random vector in Rn which satisfies that E (y ⊗ y) = I.
Then,
/ / "
/1 N / log N 1/ log N
/ / log N
E/ xi ⊗ xi −I/ C E y .
/N / N
i=1
5.5 Sample Covariance Matrices with Independent Rows 283
A proof of this theorem is given in Sect. 2.2.1.2, using the concentration of matrices.
For a vector-valued random variable y and k ≥ 1, the ψk norm of y is
+ ,
k
|y|
y ψk = inf C > 0 : E exp 2 .
Ck
A standard argument [13] shows that ify has a bounded ψk norm, then the tail of y
2
decays faster than 2 exp −tk / y ψk .
Before introducing the main result of [274], we give two results: a well known
symmetrization theorem [13] and Rudelson [93]. A Rademacher random variable is
a random variable taking values ±1 with probability 1/2.
Theorem 5.5.4 (Symmetrization Theorem [13]). Let Z be a stochastic process
indexed by a set F and let N be an integer. For every i ≤ N, let μi : F → R
N
be arbitrary functions and set {Zi }i=1 to be independent copies of Z. Under mild
topological conditions on F and (μi ) ensuring the measurability of the events below,
for any t > 0,
N N
t
βN (t) P sup Zi (f ) > t 2P sup εi (Zi (f ) − μi (f )) > ,
f ∈F i=1
f ∈F i=1
2
N
where {εi }i=1 are independent Rademacher random variables and
N
t
βN (t) = inf P Zi (f ) < .
f ∈F 2
i=1
N
We express the operator norm of (xi ⊗ xi − Σ) as the supremum of an
i=1
empirical process. Indeed, let Y be the set of tensors v ⊗ w, where v and w are
vectors in the unit Euclidean ball. Then,
284 5 Non-asymptotic, Local Theory of Random Matrices
N
(x ⊗ x − Σ) = sup x ⊗ x − Σ, y .
i=1 y∈Y
N
Consider the process indexed by Y defined by Z (y) = 1
N xi ⊗ xi − Σ, y .
i=1
Clearly, for every y, EZ (y) = 0 (the expectation is a linear operator), and
/ /
/1 N /
/ /
sup Z (y) = / (xi ⊗ xi − Σ)/ .
y∈Y /N /
i=1
N
1 4
var xi ⊗ xi − Σ, y sup E| x, z | ρ4 .
N i=1 z∈S n−1
ρ4
βN (t) 1 − .
N t2
Corollary 5.5.5. Let x be a random vector which satisfies Assumption 5.5.3 and let
x1 , . . . , xN be independent copies of x. Then,
/ / / /
/ N / / N / tN
/ / / /
P / xi ⊗ xi − Σ/ tN 4P / εi x i ⊗ x i / > ,
/ / / / 2
i=1 i=1
provided that x c ρ4 /N , for some absolute constant c.
Next, we need to estimate the norm of the symmetric random (vector-valued)
N
variables εi xi ⊗ xi . We follow Rudelson [93], who builds on an inequality due
i=1
to Lust-Piquard and Pisier [276].
Theorem 5.5.6 (Rudelson [93]). There exists an absolute constant c such that for
any integers n and N, and for any x1 , . . . , xN ∈ Rn and any k ≥ 1,
5.5 Sample Covariance Matrices with Independent Rows 285
⎛ / /k ⎞1/k / /1/2
/ N / 0 √ 1/ N /
⎝ E/
/
/ ⎠
εi xi ⊗ xi / c max
/
log n, k /
/
xi ⊗ xi / max xi ,
/ / / / 1iN
i=1 i=1
N
where {εi }i=1 are independent Rademacher random variables.
This moment inequality immediately give a ψ2 estimate on the random variable
N
εi x i ⊗ x i .
i=1
Corollary 5.5.7. These exists an absolute constant c such that for any integers n
and N, and for any x1 , . . . , xN ∈ Rn and any t > 0,
/ / 2
/ N /
/ / t
P / εi xi ⊗ xi / ≥ t 2 exp − 2 ,
/ / Δ
i=1
/ /1/2
√ /N /
where Δ = c log n/
/ x i ⊗ x /
i/ max xi .
i=1 1iN
Finally, we are ready to present the main result of Mendelson and Pajor [274] and
its proof.
Theorem 5.5.8 (Mendelson and Pajor [274]). There exists an absolute constant
c for which the following holds. Let x be a random vector in Rn which satisfies
Assumption 5.5.3 and set Z = x . For any integers n and N let
√ 1/α
log n(log N ) ρ2 1/2
An,N = Z ψα √ and Bn,N = √ + Σ An,N .
N N
−1
where β = (1 + 2/α) and Σ = E (x ⊗ x) .
Proof. First, recall that if z is a vector-valued random variable with a bounded ψα
norm, and if z1 , . . . , zN are N independent copies of z, then
/ /
/ /
/ max zi / C z 1/α
/1iN / ψα log N,
ψα
1/k
k
E max |zi | Ck 1/α z ψα log
1/α
N. (5.11)
1iN
/ /1/2 / /
√ /N / / /
where Δ = c log n/ xi ⊗ xi / /max xi /
/ / /
/ for some constant c. Setting c0 to
i=1 1iN
be the constant in Corollary
5.5.5, then by Fubini’s theorem and dividing the region
of integration
to t c 0 ρ4 /N (in this region there is no control on P (V t)) and
4
t > c0 ρ /N , it follows that the k-th order moments are
) ∞ k−1
EV k = kt P (V t) dt
) c0 √ρ4 /N k−1
0
)∞
= 0 √ kt P (V t) dt + 0 ktk−1 P (V t) dt
) c ρ4 /N k−1 )∞ 2 2
00 kt P (V t) dt + 8Ex 0 ktk−1 exp − t ΔN2 dt
k Δ k
c0 ρ4 /N + ck k k/2 Ex N
k/2 //k/2 / /k
1/
/
/ /
N /
c k
N E
k log n
xi ⊗ xi /
N/
/
/ /max xi /
/
i=1 1iN
/ / k/2
k/2 / N /
1 / / k
c k k log n
N E N / xi ⊗ xi − Σ/ + Σ max xi
1iN
1/2
i=1
k/2 1/2
k 2k
ck k log
N
n
E V + Σ E max xi
1iN
for some new absolute constant c. Thus, setting Z = x and using Assump-
tion 5.5.3 and (5.11), we obtain
1/k 1/2 1/k 1/2
ρ2 1 1 log n
EV k c √ + ck α + 2 log1/α N Zψα EV k + Σ ,
N N
5.5 Sample Covariance Matrices with Independent Rows 287
1/2
log n
for some new absolute constant c. Set An,N = N log1/α N Z ψα and
−1
β = (1 + 2/α) . Thus,
and thus,
V ψβ c max Bn,N , A2n,N ,
√
The second case is when Z ψα c1 n, where x is a random vector associated
with a convex body in an isotropic position.
Corollary 5.5.10 (Mendelson and Pajor [274]). These exists an absolute constant
√ x be a random vector in R that satisfies
n
c for which the following holds. Let
Assumption 5.5.3 with Z ψα c1 n. Then, for any t > 0
/ /
/ N /
/ /
P / xi ⊗ xi −Σ/ tN
/ /
i=1
⎛ + " ,1/2 ⎞
ρ2 (n log n) log N 2 (n log n) log N
exp ⎝−c x/ max √ +ρc1 , c1 ⎠.
N N N
Let us consider two applications: the singular values and the integral operators.
For the first one, the random vector x corresponds to the volume measure of some
convex symmetric body in an isotropic position.
288 5 Non-asymptotic, Local Theory of Random Matrices
Corollary 5.5.11 (Mendelson and Pajor [274]). These exist an absolute constant
c1 , c2 , c3 , c4 for which the following holds. Let K ⊂ Rn be a symmetric convex body
in an isotropic position, let x1 , . . . , xN be N independent points sampled according
to the normalized volume measure on K, and set
⎡ ⎤
xT1
⎢ xT ⎥
⎢ 2 ⎥
X=⎢ . ⎥
⎣ .. ⎦
xTN
√ 1/4
√ √ √ N
P 1 − t N λi 1 + t N 1 − exp −c2 t1/2 .
(log N ) (n log n)
√
2. If N > c3 n, then with probability at least 1/2, λ1 c4 N log n.
Example 5.5.12 (Learning Integral Operators [274]). Let us apply the results
above to the approximation of the integral operators. See also [277] for a learning
theory from an approximation theory viewpoint. A compact integral operator with
a symmetric kernel is approximated by the matrix of an empirical version of the
operator [278]. What sample size yields given accuracy?
Let Ω ∈ Rd and set ν to be a probability measure on Ω. Let t be a random
∞
variable on Ω distributed based on ν and consider X(t) = λi φi (x)φi , where
i=1
∞ ∞
{φi }i=1 is a complete basis in L(Ω, μ) and {λi }i=1 ∈ l1 .
Let L be a bounded, positive-definite kernel [89] on some probability space
(Ω, μ). (See also Sect. 1.17 for the background on positive operators.) By Mercer’s
∞
theorem, these is an orthonormal basis of L(Ω, μ), denoted by {φi }i=1 such that
∞
L (t, s) = λi φi (x)φi (s) .
i=1
Thus,
and the square of the singular values of the random matrix Σ are the eigenvalues of
the Gram matrix (or empirical covariance matrix)
N
G = ( X (ti ) , X (tj ) )i,j=1
5.5 Sample Covariance Matrices with Independent Rows 289
TL = L (x, y) f (y)dν.
This question is useful to kernel-based learning [279]. It is not clear how the
eigenvalues of the integral operator should be estimated from the given data in the
form of the Gram matrix G. Our results below enable us to do just that; Indeed, if
∞
X(t) = λi φi (x)φi , then
i=1
∞
E (X ⊗ X) = λi φ i , · φ i = TL .
i=1
∞
where {λi }i=1 are the eigenvalues of TL , the integral operator associated with L
∞
and μ, and {φi }i=1 are complete bases in L(μ). Also TL is a trace-class operator
∞ )
since λi = L (x, x)dμ (x).
i=1
To apply Theorem 5.5.8, we encounter a difficulty since X(t) if of infinite
dimensions. To overcome this, define
n
Xn (t) = λi φi (x)φi ,
i=1
where n is to be specified later. Mendelson and Pajor [274] shows that Xn (t) is
sufficiently close to X(t); it follows that our problem is really finite dimensional.
A deviation inequality of (5.10) enables us to estimate with high probability the
eigenvalues of the integral operators (infinite dimensions) using the eigenvalues of
the Gram matrix or empirical covariance matrix (finite dimensions). This problem
was also studied in [278], as pointed out above.
Learning with integral operators [127, 128] is relevant.
Example 5.5.13 (Inverse Problems as Integral Operators [88]). The inverse prob-
lems may be formulated as integral operators [280], that in turn may be approxi-
mated by the empirical version of the integral operator, as shown in Example 5.5.12.
Integral operator are compact operators [89] in many natural topologies under
very weak conditions on the kernel. Many examples are given in [281].
290 5 Non-asymptotic, Local Theory of Random Matrices
Once the connection between the integral operators and the concentration
inequality is recognized, we can further extend this connection with electromagnetic
inverse problems such as RF tomography [282, 283].
For relevant work, we refer to [127, 284–287].
for every t ≥ 1.
Bobkov and Nazrov [294, 295] have clarified the picture of volume distribution
on isotropic, unconditional convex bodies. A symmetric convex body K is called
unconditional if, for every choice of real numbers ti , and every choice of εi ∈
{−1, 1}, 1 ≤ i ≤ n,
ε1 t1 e1 + · · · + εn tn en K = t1 e1 + · · · + tn en K,
5.6 Concentration for Isotropic, Log-Concave Random Vectors 291
for every t ≥ 1.
Reference [291] gives a short proof of Paouris’ inequality. Assume that x has a log-
concave distribution (a typical example of such a distribution is a random vector
uniformly distributed on a convex body). Assume further that it is centered and
its covariance matrix is the identity such a random vector will be called isotropic.
The tail behavior of the Euclidean norm ||x||2 of an isotropic, log-concave random
vector x ∈ Rn , states that for every t > 1,
√ √
P x 2 ct n e−t n .
More precisely, we have that for any log-concave random vector x and any p ≥ 1,
p 1/p p 1/p
(E x 2 ) ∼E x 2 + sup (E| y, x | ) .
y∈S n−1
This result had a huge impact on the study of log-concave measures and has a lot of
applications in that subject.
Let x ∈ Rn be a random vector, denote the weak p-th moment of x by
p 1/p
σp (x) = sup (E| y, x | ) .
y∈S n−1
Theorem 5.6.2 ([291]). For any log-concave random vector x ∈ Rn and any
p ≥ 1,
p 1/p
(E x 2 ) C (E x 2 + σp (x)) ,
Theorem 5.6.3 (Approximation of the identity operator [272]). Let ε ∈ (0, 1).
Assume that n ≥ n0 and let K be an isotropic convex body in Rn . If N
c (ε) n log n, where c > 0 is an absolute constant, and if x1 , . . . , xN ∈ Rn are
independent random points uniformly distributed in K, then with probability greater
than 1 − ε we have
N
1 2
(1 − ε) L2K xi , y (1 + ε) L2K ,
N i=1
and
5.6 Concentration for Isotropic, Log-Concave Random Vectors 293
1/q
q
E x p C2 (δ) pn1/p + q for q 2.
t C log (en/k) .
Taken from [298–300], the development here is motivated for the convergence of
empirical (or sample) covariance matrix.
In the recent years a lot of work was done on the study of the empirical
covariance matrix, and on understanding related random matrices with independent
rows or columns. In particular, such matrices appear naturally in two important
(and distinct) directions. That is (1) estimation of covariance matrices of high-
dimensional distributions by empirical covariance matrices; and (2) the Restricted
Isometry Property of sensing matrices defined in the Compressive Sensing theory.
See elsewhere of the book for the background on RIP and covariance matrix
estimation.
Let x ∈ Rn be a centered vector whose covariance matrix is the identity matrix,
and x1 , . . . , xN ∈ Rn are independent copies of x. Let A be a random n × N
matrix whose columns are (xi ). By λmin (respectively λmax ) we denote the smallest
(respectively the largest) singular value of the empirical covariance matrix AAT .
For Gaussian matrices, it is known that
" "
n λmin λmax n
1−C 1+C . (5.17)
N N N N
The strength of the above results is that the conditions (5.18) and (5.19) are valid
for many classes of distributions.
Example 5.6.7 (Uniformly distributed).
√ Random vectors uniformly distributed on
the Euclidean ball of radius K n clearly satisfy (5.19). They also satisfy (5.18)
with ψ = CK.
Lemma 5.6.8 (Lemma 3.1 of [301]). Let x1 , . . . , xN ∈ Rn be i.i.d. random
vectors, distributed according to an isotropic, log-concave probability measure
√ on
Rn . There exists an absolute positive constant C0 such that for any N ≤ e n and
for every K ≥ 1 one has
√
max |Xi | C0 K n.
iN
5.7 Concentration Inequality for Small Ball Probability 295
where C and c are absolute constants. The result follow by the union bound.
Example 5.6.9 (Log-concave isotropic). Log-concave, isotropic random vectors in
Rn . Such vectors satisfy (5.18) and (5.19) for some absolute constants ψ and K. The
boundedness condition follows from Lemma 5.6.8. A version of result with weaker
was proven by Aubrun [297] in the case of isotropic, log-concave rand vectors under
an additional assumption of unconditionality.
Example 5.6.10 (Any isotropic random vector satisfying the Poincare inequality).
Any isotropic random vectors (xi )iN ∈ Rn , satisfying the Poincare inequality
with constant L, i.e., such that
2
varf (xi ) L2 E|∇f (xi )|
2 2 2
E|Ax − y| E|A (x − Ex)| = a2ij Var (ξi ) A F .
i.j
296 5 Non-asymptotic, Local Theory of Random Matrices
p := P (|Ax − y| A F /2) .
Then
p2 = P |Ax − y| A Ax − y A
F /2, P F /2
P (|Az| A F ) .
Let B = AAT = (bij ). Then bii = a2ij 0 and
j
2
|Az| = Bz, z = bij Zi Zj = bii Zi2 + 2 bij Zi Zj .
i,j i i<j
2
bii EZi2 2 Tr (B) = 2 A F .
i
Thus
2 2
p2 P |Az| A F
2
P 2 bij Zi Zj + bii Zi2 − EZi2 − A F
i<j i
2 2
P bij Zi Zj A /3 +P bii Zi2 − EZi2 − A /3
i<j F
i
F
(5.21)
/ /
Note that we have (bij δi=j ) F B F = /AAT /F A op A F
2
and (bij δi=j ) op B op + (bij δi=j ) op 2 B op 2 A op . So by
Lemma 1.12.1, we get
⎛ ⎞ ⎛ 2 ⎞
4 −1 A
⎝
P bij Zi Zj A
2
/3⎠ 2 exp ⎝− C β F ⎠ . (5.22)
F
i<j A op
5.7 Concentration Inequality for Small Ball Probability 297
We have Zi 4 2 ξ 4 2β gi 4 , thus
2 2 2
b2ii EZi4 48β 4 b2ii 48β 4 B F 48β 4 A op A F .
i i
(5.23)
P ( Ax − y 2 t A HS ) t
b Aop
,
1/n
where VK = (|K| / |B2n |) ,
√
(C/η) ln(AF /(6β n))
α = 3β(4πVK ) ,
2
and η = A F / β 4 n . Here c0 is the constant from Theorem 5.7.1, and C is a
universal constant.
H0 : y = x
H1 : y = s + x
y−Bx2
for BF τ0 , claim H0 .
y−Bx2
for BF τ1 , claim H1 .
y−Bx2 y−Bx2
We are interested in P BF τ 1 |H 1 and P BF τ 0 |H0 , which
may be handled by the above theorems.
Example 5.7.6 (Hypothesis Testing for Compressed Sensed Data). In Sect. 10.16,
the observation vector for compressed sensing is modeled in (10.54) and rep-
eated here
y = A (s + x) (5.24)
where s ∈ Rn is an unknown vector
in S, A ∈ R m×n
is a random matrix with i.i.d.
N (0, 1) entries, and x ∼ N 0, σ 2 In×n denotes noise with known variance σ 2
that is independent of A. The noise model (5.24) is different from a more commonly
studied case
y = As + x (5.25)
where z ∼ N 0, σ Im×m .
2
5.8 Moment Estimates 299
H0 : y = x
H1 : y = A(s + x).
For B = AH , we have
/ / / /
/y − Bx / y − BAx /y − AH Ax/
2 2 2
= = .
B F B F A F
For a random matrix X ∈ Cn1 ×n2 , the singular values of the random matrix X
are σ1 σ2 . . . σn , n = max{n1 , n2 }. We can form a random vector σ =
T
(σ1 , σ2 , . . . , σn ) . For N independent copies X1 , . . . , XN of the random matrix
X, we can form N independent copies σ 1 , . . . , σ N of the random vector σ. So we
can use the moments for random vectors to study these random matrices.
The aim here is to approximate the moments by the empirical averages with high
probability. The material is taken from [267]. For every 1 ≤ q ≤ +∞,we define q ∗
300 5 Non-asymptotic, Local Theory of Random Matrices
to be the conjugate of q, i.e., 1/q + 1/q ∗ = 1. Let α > 0 and let ν be a probability
measure on (X, Ω). For a function f : X → R define the ψα -norm by
= inf λ > 0
α
f ψα exp (|f | /λ) dν 2 .
X
Chebychev’s inequality shows that the functions with bounded ψα -norm are
strongly concentrated, namely ν {x ||f (x)| > λt } C exp (−tα ). We denote by D
the radius of the symmetric convex set K i.e., the smallest D such that K ⊂ DB2n ,
where B2n is the unit ball in Rn .
Let K ∈ Rn be a convex symmetric body, and · K the norm. The modulus of
convexity of K is defined for any ε ∈ (0, 2) by
/ /
/x + y/
δK (ε) = inf 1 − / /
/ 2 / , x K = 1, y K = 1, x − y K > ε . (5.26)
K
We say that K has modulus of convexity of power type q ≥ 2 if δK (ε) cεq for
every ε ∈ (0, 2). This property is equivalent to the fact that the inequality
/ / / /
/ x + y /q 1 / x − y /q 1
/ / / / q q
/ 2 / + λq / 2 / 2( x K + y K) ,
K K
the maximal deviation of the empirical p-norm of x from the exact one. We want
to bound Vp (K) under minimum assumptions on the body K and random vector
x. We choose the size of the sample N such that this deviation is small with high
probability.
To bound such random process, we must have some control of the random
variable max1iN x 2 , where || · ||2 denotes the standard Euclidean norm. To this
end we introduce the parameter
p 1/p
κp,N (x) = (Emax1iN x 2 ) .
5.8 Moment Estimates 301
The constant Cp,λ in Theorem 5.8.1 depends on p and on the parameter λ in (5.26).
That minimal assumptions on the vector x are enough to guarantee that EVp (K)
becomes small for large N. In most cases, κp,N (x) may be bounded by a simple
quantity:
N
1/p
p
κp,N (x) E xi 2 .
i=1
From the sharp estimate of Theorem 5.3.2, we will deduce the following.
Theorem 5.8.2 (Guédon and Rudelson [267]). Let x be an isotropic, log-concave
random
√ vector in Rn , and let x1 , . . . , xN be N independent copies of x. If N
e c n
, then for any p ≥ 2,
√
p 1/p C n if p log N
κp,N (x) = (Emax1iN x 2 ) √
Cp n if p log N.
Theorem 5.8.3 (Guédon and Rudelson [267]). For any ε ∈ (0, 1) and p ≥ 2 these
exists n0 (ε, p) such that for any n ≥ n0 , the following holds: let x be an isotropic,
log-concave random vector in Rn , let x1 , . . . , xN be N independent copies of x, if
B C
N = Cp np/2 log n/ε2
1/p
then for any t > ε, with probability greater than 1 − C exp − t/Cp ε , for
any y ∈ Rn
302 5 Non-asymptotic, Local Theory of Random Matrices
N
p 1 p p
(1 − t) E| x, y | | xi , y | (1 + t) E| x, y | .
N i=1
The constants Cp , Cp > 0 are real numbers depending only on p.
Let us consider the classical case when x is a Gaussian random vector in Rn .
Let x1 , . . . , xN be independent copies of x. Let p∗ denote the conjugate of p. For
t = (t1 , . . . , tN )T ∈ RN , we have
N N
1/p
p
sup ti x i , y = | xi , y | ,
t∈Bp∗
N
i=1 i=1
N
where ti xi , y is the Gaussian random variable.
i=1
Let Z and Y be Gaussian vectors in RN and Rn , respectively. Using Gordon’s
inequalities [306], it is easy to show that whenever E Z p ε−1 E Y 2 (i.e. for a
universal constant c, we have N cp pp/2 np/2 /εp )
N
1/p
1 p
E Z p −E Y 2 E inf | xi , y |
y∈S n−1 N i=1
N
1/p
1 p
E sup | xi , y | E Z p +E Y 2,
y∈S n−1 N i=1
where E Z p + E Y 2 / E Z p − E Y 2 (1 + ε) / (1 − ε). It is there-
fore possible to get (with high probability with respect to the dimension n, see [307])
a family of N random vectors x1 , . . . , xN such that for every y ∈ Rn
N
1/p
1 p 1+ε
A y | xi , y | A y 2.
2
N i=1
1−ε
This argument significantly improves the bound on m in Theorem 5.8.3 for Gaussian
random vectors.
Below we will be able to extend the estimate for the Gaussian random vector
to random vector x satisfying the ψ2 -norm condition for linear functionals y →
x, y with the same dependence on N. A random variable Z satisfies the ψ2 -norm
condition if and only if for any λ ∈ R
2
E exp (λZ) 2 exp cλ2 · Z 2 .
5.8 Moment Estimates 303
We follow [308] for our development. Related work is Chevet type inequality and
norms of sub-matrices [267, 309].
Let x ∈ Rn be a random vector in a finite dimensional Euclidean space E with
Euclidean norm x and scalar product < ·, · >. As above, for p > 0, we denote
the weak p-th moment of x by
p 1/p
σp (x) = sup (E y, x ) .
y∈S n−1
p 1/p p 1/p
Clearly (E x ) σp (x) and by Hölder’s inequality, (E x ) E x .
Sometimes we are interested in reversed inequalities of the form
p 1/p
(E x ) C1 E x + C2 σp (x) (5.27)
for p ≥ 1 and constants C1 , C2 .
This is known for some classes of distributions and the question has been studied
in a more general setting (see [288] and references there). Our objective here is to
describe classes for which the relationship (5.27) is satisfied.
Let us recall some known results when (5.27) holds. It clearly holds for Gaussian
vectors and it is not difficult to see that (5.27) is true for sub-Gaussian vectors.
Another example of such a class is the class of so-called log-concave vectors. It is
known that for every log-concave random vector x in a finite dimensional Euclidean
space and any p > 0,
p 1/p
(E x ) C (E x + σp (x)) ,
where C > 0 is a universal constant.
Here we consider the class of complex measures introduced by Borell. Let κ < 0.
A probability measure P on Rm is called κ-concave if for 0 < θ < 1 and for all
compact subsets A, B ∈ Rm with positive measures one has
κ κ 1/κ
P ((1 − θ) A + θB) ((1 − θ) P(A) + θP(B) ) . (5.28)
A random vector with a κ-concave distribution is called κ-concave. Note that a
log-concave vector is also κ-concave for any κ < 0.
For κ > −1, a κ-concave vector satisfies (5.27) for all 0 < (1 + ε) p < −1/κ
with C1 and C2 depending only on ε.
304 5 Non-asymptotic, Local Theory of Random Matrices
Definition 5.8.5. Let p > 0, m = p, and λ ≥ 1. We say that a random vector x in
E satisfies the assumption H(p, λ) if for every linear mapping A : E → Rm such
that y = Ax is non-degenerate there is a gauge || · || on Rm such that E y < ∞
and
p 1/p
(E y ) < λE y . (5.29)
For example, the standard Gaussian and Rademacher vectors satisfy the above
condition. More generally, a sub-Gaussian random vector also satisfies the above
condition.
Theorem 5.8.6 ([308]). Let p > 0 and λ ≥ 1. If a random vector x in a finite
dimensional Euclidean space satisfies H(p, λ), then
p 1/p
(E x ) < c (λE x + σp (x)) ,
where c is a universal constant.
We can apply above results to the problem of the approximation of the covariance
matrix by the empirical covariance matrix. For a random vector x the covariance
matrix of x is given by ExxT . It is equal to the identity operator I if x is isotropic.
N
The empirical covariance matrix of a sample of size N is defined by N1 xi xTi ,
i=1
where x1 , x2 , . . . , xN are independent copies of x. The main question is how small
can be taken in order that these two matrices are close to each other in the operator
norm.
It was proved there that for N ≥ n and log-concave n-dimensional vectors
x1 , x2 , . . . , xN one has
/ / "
/1 N /
/ / n
/ x i x i − I/ C
T
/N / N
i=1 op
√
with probability at least 1 − 2 exp (−c n), where I is the identity matrix, and · op
is the operator norm and c, C are absolute positive constants.
In [310, Theorem 1.1], the following condition was introduced: an isotropic
random vector x ∈ Rn is said to satisfy the strong regularity assumption if for
some η, C > 0 and every rank r ≤ n orthogonal projection P, one has that for very
t>C
√
P Px 2 t r C/ t2+2η r1+η .
In [308], it was shown that an isotropic (−1/q) satisfies this assumption. For
simplicity we give this (without proof) with η = 1.
Theorem 5.8.7 ([308]). Let n ≥ 1, a > 0 and q = max {4, 2a log n} . Let x ∈ Rn
be an isotropic (−1/q) random vector. Then there is an absolute constant C such
that for every rank r orthogonal projection P and every t ≥ C exp (4/a), one has
√ 0 1
4
P Px 2 t r C max (a log a) , exp (32/a) / t4 r2 .
5.9 Law of Large Numbers for Matrix-Valued Random Variables 305
Theorem 1.1 from [310] and the above lemma immediately imply the following
corollary on the approximation of the covariance matrix by the sample covariance
matrix.
Corollary 5.8.8 ([308]). Let n ≥ 1, a > 0 and q = max {4, 2a log n} . Let
x1 , . . . , xN be independent (−1/q)-concave, isotropic random vector in Rn . Then
for every ε ∈ (0, 1) and every N ≥ C(ε)n, one has
/ /
/1 N /
/ /
E/ xi ⊗ xi − In×n / ε
T
/N /
i=1 op
For p ≤ ∞, the finite dimensional lp spaces are denoted as lpn . Thus lpn is the Banach
space Rn , · p , where
n
1/p
p
x p = |x|i
i=1
for p ≤ ∞, and x ∞ = maxi |xi |. The closed unit ball of lp is denoted by Bpn :=
0 1
x: x p1 .
306 5 Non-asymptotic, Local Theory of Random Matrices
Furthermore, the large deviation theory allows one to estimate the probability that
N
the empirical mean N1 Xi stays close to the true mean EX.
i=1
Matrix-valued versions of this inequality are harder to prove. The absolute value
must be replaced by the matrix norm. So, instead of proving a large deviation
estimate for a single random variable, we have to estimate the supremum of
a random process. This requires deeper probabilistic techniques. The following
theorem generalizes the main result of [93].
Theorem 5.9.1 (Rudelson and Vershynin [311]). Let y be a random vector in
Rn , which is uniformly bounded almost everywhere: y 2 α. Assume for
normalization that y ⊗ y 2 1. Let y1 , . . . , yN be independent copies of y. Let
"
log N
σ := C · α.
N
Then
1. If σ < 1, then
/ /
/1 N /
/ /
E/ yi ⊗y − E (y ⊗ y)/ σ.
/N /
i=1 2
5.9 Law of Large Numbers for Matrix-Valued Random Variables 307
Theorem 5.9.1 generalizes Theorem 5.4.2. Part (1) is a law of large numbers,
and part (2) is a large deviation estimate for matrix-valued random variables. The
bounded assumption y 2 α can be too strong for some applications and can be
q
relaxed to the moment assumption E y 2 αq , where q = log N. The estimate
in Theorem 5.9.1 is in general optimal (see [311]). Part 2 also holds under an
assumption that the moments of y 2 have a nice decay.
Proof. Following [311], we prove this theorem in two steps. First we use the
standard symmetrization technique for random variables in Banach spaces, see
e.g. [27, Sect. 6]. Then, we adapt the technique of [93] to obtain a bound on a
symmetric random process. Note the expectation E(·) and the average operation
N
N
1
xi ⊗xi are linear functionals.
i=1
Let ε1 , . . . , εN denote independent Bernoulli random variables taking values 1,
−1 with probability 1/2. Let y1 , . . . , yN , ȳ1 , . . . , ȳN be independent copies of y.
We shall denote by Ey , Eȳ , and Eε the expectation according to yi , ȳi , and εi ,
respectively.
Let p ≥ 1. We shall estimate
/ /p 1/p
/1 N /
/ /
Ep := E / yi ⊗yi − E (y ⊗ y)/ . (5.31)
/N /
i=1 2
Note that
N
1
Ey (y ⊗ y) = Eȳ (ȳ ⊗ ȳ) = Eȳ ȳi ⊗ȳi .
N i=1
p
We put this into (5.31). Since x → x 2 is a convex function on Rn , Jensen’s
inequality implies that
/ /p 1/p
/1 N N /
/ 1 /
Ep Ey Eȳ / yi ⊗yi − ȳi ⊗ȳi / .
/N N /
i=1 i=1 2
Denote
N N
1 1
Y= εi yi ⊗ yi and Ȳ = εi ȳi ⊗ ȳi .
N i=1
N i=1
Then
/ / / / / /p
/Y − Ȳ/p Y + /Ȳ/2
p
2p Y
p
+ /Ȳ/2 ,
2 2 2
/ /p
= E /Ȳ/2 . Thus we obtain
p
and E Y 2
/ /p 1/p
/1 N /
/ /
Ep 2 E y E ε / εi (yi ⊗ yi )/ .
/N /
i=1 2
We shall estimate the last expectation using Lemma 5.4.3, which was a lemma
from [93]. We need to consider the higher order moments:
Lemma 5.9.2 (Rudelson [93]). Let y1 , . . . , yN be vectors in Rk and ε1 , . . . , εN
be independent Bernoulli variables taking values 1, −1 with probability 1/2. Then
/ /p p / /1/2
/1 N / / N /
/ / / /
E/ εi yi ⊗yi / C0 p + log k · max yi 2·/ yi ⊗yi / .
/N / i=1,...,N / /
i=1 2 i=1 2
√ / /p 1/2p
/1 N /
p + log N / /
Ep 2C0 ·α· E/ yi ⊗ yi / . (5.32)
N /N /
i=1 2
So we get
1/2
σp1/2 log N
Ep (Ep + 1) , where σ = 4C0 α.
2 N
5.10 Low Rank Approximation 309
It follows that
√
min (Ep , 1) σ p. (5.33)
To prove part 1 of the theorem, note that σ ≤ 1 by the assumption. We thus obtain
E1 ≤ σ. This proves part 1.
1/p
To prove part 2, we consider Ep = (EZ p ) , where
/ /
/1 N /
/ /
Z=/ yi ⊗ yi − E (y ⊗ y)/ .
/N /
i=1 2
p 1/p √
(E min (Z, 1) ) min (Ep , 1) σ p. (5.34)
We can express this moment bound as a tail probability estimate using the following
standard lemma, see e.g. [27, Lemmas 3.7 and 4.10].
Lemma 5.9.4. Let Z be a nonnegative random variable. Assume that there exists a
1/p √
constant K > 0 such that (EZ p ) K p for all p ≥ 1. Then
P (Z > t) 2 exp −c1 t2 /K 2 for all t > 0.
1. Optimality. The almost linear sample complexity O(r log r) achieved in Theo-
rem 5.10.1 is optimal. The best previous result had O(r2 ) [314, 315]
2 2
2. Numerical rank. The numerical rank r = A F / A 2 is a relaxation of the
exact notion of rank. Indeed, one always has r(A)rank(A). The numerical rank
is stable under small perturbation of the matrix, as opposed to the exact rank.
3. Law of large numbers for matrix-valued random variables. The new feature is a
use of Rudelson’s argument about random vectors in the isotropic position. See
Sect. 5.4. It yields a law of large numbers for matrix-valued random variables.
We apply it for independent copies of a rank one random matrix, which is given
by a random row of the matrix AT A—the sample covariance matrix.
4. Functional-analytic nature. A matrix is a linear operator between finite-
dimensional normed spaces. It is natural to look for stable quantities tied to
linear operators, which govern the picture. For example, operator (matrix)
norms are stable quantities, while rank is not. The low rank approximation in
Theorem 5.10.1 is only controlled by the numerical rank r. The dimension n
does not play a separate role in these results.
2
Proof. By the homogeneity, we can assume A 2 = 1. The following lemma
from [314, 316] reduces Theorem 5.10.1 to a comparison of A and a sample Ã
in the spectral norm.
Lemma 5.10.2 (Drineas and Kannan [314, 316]).
/ /
2 2 / /
A − APk 2 σk+1 (A) + 2/AT A − ÃT Ã/ .
2
1
à stands
for the matrix/ A [16]. By a result
kernel or null space of / of perturbation
T / T /
theory, σk+1 A A − σk+1 Ã Ã /A A − Ã Ã/ . This proves the
T T
2
lemma.
Let x1 , . . . , xm denote the rows of the matrix A. Then
ker(A) = {v ∈ V : A(v) = 0 ∈ W }
m
AT A = xi ⊗ xi .
i=1
We shall regard the matrix AT A as the true mean of a bounded matrix valued
random variable, while ÃT Ã will be its empirical mean; then the Law of Large
Numbers for matrix valued random variables, Theorem 5.9.1, will be used. To this
purpose, we define a random vector y ∈ Rn as
A xi
P y= F
xi = 2
.
xi 2 A F
N
1 √
AT A = E (y ⊗ y) , ÃT Ã = √ yi ⊗ yi , α := y 2 = A F = r.
N i=1
Thus Theorem 5.9.1 gives that, with probability at least 1 − 2 exp(−c/δ), we have
√ //
/1/2
/
A − APk 2 σk+1 (A) + 2 /AT A − ÃT Ã/ σk+1 (A) + ε.
2
where Ai is the i-th row of A. This sampling can be done in one pass through the
matrix A if the matrix is stored row-by-row, and in two passes if its entries are
stored in arbitrary order [317, Sect. 5.1]. Second, the algorithm computes the SVD
of the N × n matrix Ã, which consists of the normalized sampled rows. This can be
done in time O(N n)+ the time needed to compute the SVD of a N ×N matrix. The
latter can be done by one of the known methods. This algorithm is takes significantly
less time than computing SVD of the original m × n matrix A. In particular, this
algorithm is linear in the dimensions of the matrix (and polynomial in N ).
The material here is taken from [72]. For a general random matrix A with
independent centered entries bounded by 1, one can use Talagrand’s concentration
inequality for convex Lipschitz functions on the cube [142, 148, 231]. Since
σmax (A) = A (or σ1 (A)) is a convex function of A. Talagrand’s concentration
inequality implies
2
P (|σmax (A) − M (σmax (A))| t) 2e−ct ,
where M is the median. Although the precise value of the median may be unknown,
integration of this inequality shows that
|Eσmax (A) − M (σmax (A))| C
Proof. σmin (A) and σmax (A) are 1-Lipschitz (K = 1) functions of matrices A
considered as vectors in Rn . The conclusion now follows from the estimates on the
expectation (Theorem 5.11.1) and Gaussian concentration (Theorem 5.11.2).
Lemma 5.11.4 (Approximate isometries [72]). Consider a matrix X that satisfies
X∗ X − I max δ, δ 2 (5.37)
/ / "
/1 ∗ /
/ A A − I/ max δ, δ 2 where δ = C n + √t . (5.39)
/N / N N
5.12 Random Matrices with Independent Rows 315
With the aid of Lemma 1.10.4, we can evaluate the norm in (5.41) on a 14 -net N
of the unit sphere S n−1 :
/ / @ A
/1 ∗ / 1 ∗ 1
/ A A − I/ 2 max A A − I x, x = 2 max
Ax
2
− 1 .
/N / x∈N N x∈N N 2
To complete the proof, it is sufficient to show that, with the required probability,
1 ε
max − 1 .
2
Ax 2
x∈N N 2
By Lemma 1.10.2, we can choose the net N such that it has cardinality |N | 9n .
2
Step 2: Concentration. Let us fix any vector x ∈ S n−1 . We can rewrite Ax 2
as a sum of independent (scalar) random variables
316 5 Non-asymptotic, Local Theory of Random Matrices
N N
2 2
Ax 2 = Ai , x =: Zi2 (5.42)
i=1 i=1
where the last inequality follows by the definition of δ and using the inequality
2
(a + b) a2 + b2 for a, b ≥ 0.
Step 3: Union Bound. Taking the union bound over the elements of the net N
with the cardinality |N | ≤ 9n , together with (5.43), we have
c
1 ε 1 c
1
P max Ax22 − 1 9n · 2 exp − 4 C 2 n + t2 2 exp − 4 t2
x∈N N 2 K K
Proof. The proof is taken from [72] for the proof and changed to our notation habits.
We shall use the non-commutative Bernstein’s inequality (for a sum of independent
random matrices).
Step 1: Reduction to a sum of independent random matrices. We first note that
2
m ≥ n ≥ 1 since by Lemma 5.2.1 we have that E Ai 2 = n. Our argument
here is parallel
√ to Step 1 of Theorem 5.12.1. Recalling Lemma 5.11.4 for the matrix
B = A/ N , we find that the desired inequality (5.44) is equivalent to
/ / "
/1 ∗ / m
/ A A − I/ max δ, δ 2 = ε, δ=t . (5.45)
/N / N
Here · is the operator (or spectral) norm. It is more convenient to express this
random matrix as a sum of independent random matrices—the sum is a linear
operator:
N N
1 ∗ 1
A A−I= Ai ⊗ A i − I = Xi , (5.46)
N N i=1 i=1
1 1 2
1 2m
Xi 2 ( Ai ⊗ Ai + 1) = Ai 2 +1 (m + 1) = K.
N N N N
N
Here · 2 is the Euclidean norm. To estimate the total variance EX2i , we first
i=1
need to compute
1 2 1
X2i = 2 (Ai ⊗ Ai ) − 2 Ai ⊗ Ai + I .
N N
1 2
EX2i = E(A i ⊗ A i ) − I . (5.47)
N2
318 5 Non-asymptotic, Local Theory of Random Matrices
Since
2 2
(Ai ⊗ Ai ) = Ai 2 Ai ⊗ A i
2
is
/ a positive semi-definite
/ matrix and Ai 2 m by assumption, it follows that
/ 2/
/E(Ai ⊗ Ai ) / m · EAi ⊗ Ai = m. Inserting this into (5.47), we have
/ /
/EX2i / 1 (m + 1) 2m .
N2 N2
where we used the assumption that m ≥ 1. This leads to
/ /
/ N / / / 2m
/ /
/ EX2i / N · max /EX2i / = = σ2 .
/ / i N
i=1
1/2
√ √
A Σ N + t m. (5.49)
Proof. Since
2
Σ = E (Ai ⊗ Ai ) E Ai ⊗ Ai = E Ai 2 m,
5.12 Random Matrices with Independent Rows 319
Let us remark on non-identical second moment. The assumption that the rows
Ai have a common second moment matrix Σ is not essential in Theorems 5.12.3
and 5.12.5. More general versions of these results can be formulated. For example,
if Ai have arbitrary second moment matrices
Σi = E (xi ⊗ xi ) ,
1
N
then the claim of Theorem 5.12.5 holds with Σ = N Σi .
i=1
320 5 Non-asymptotic, Local Theory of Random Matrices
Σ = E (x ⊗ x) .
The simplest way to estimate Σ is to take some N independent samples xi from the
distribution and form the sample covariance matrix
N
1
ΣN = xi ⊗ xi .
N i=1
ΣN → Σ almost surely N → ∞.
So, taking sufficiently many samples, we are guaranteed to estimate the covariance
matrix as well as we want. This, however, does not deal with the quantitative
aspect of convergence: what is the minimal sample size N that guarantees this
approximation with a given accuracy?
We can rewrite ΣN as
N
1 1 ∗
ΣN = xi ⊗ xi = X X,
N i=1
N
where
⎡⎤
xT1
⎢ ⎥
X = ⎣ ... ⎦ .
xTN
The X is a N × n random matrix with independent rows xi , i = 1, . . . , N, but
usually not independent entries.
Theorem 5.13.1 (Covariance estimation for sub-Gaussian distributions [72]).
Consider a sub-Gaussian distribution in Rn with covariance matrix Σ, and let
2
ε ∈ (0, 1), t ≥ 1. Then, with probability at least 1 − 2e−t n , one has
2
If N C(t/ε) n then ΣN − Σ ε. (5.50)
2 s ≥ 0, with probability
Proof. It follows from (5.39) that for every at √
least
1 − 2e−cs , we have ΣN − Σ max δ, δ 2 where δ = C n/N + s/ N .
√
The claim follows for s = C t n where C = CK is sufficiently large.
For arbitrary centered Gaussian distribution in Rn , (5.50) becomes
2
If N C(t/ε) n then ΣN − Σ ε Σ . (5.51)
Recall that Σ = σ1 (Σ) is the matrix norm which is also equal to the largest
singular value. So, by Markov’s
√ inequality, most of the distribution is supported
in a centered ball of radius m where m = O ( Σ n). If all the distribution
is supported there, i.e., if x = O ( Σ n) almost surely, then the claim of
Theorem 5.52 holds with sample size
2
N C(t/ε) r log n.
Let us consider low-rank estimation. In this case, the distribution in Rn lies close
to a low dimensional subspace. As a result, a much smaller sample size N is
sufficient for covariance estimation. The intrinsic dimension of the distribution can
be expressed as the effective rank of the matrix Σ, defined as
Tr (Σ)
r (Σ) = .
Σ
One always has r (Σ) rank (Σ) n, and this bound is sharp. For example, for
an isotropic random vector x in Rn , we have Σ = I and r (Σ) = n.
322 5 Non-asymptotic, Local Theory of Random Matrices
The effective rank r = r (Σ) always controls the typical norm of x, since
2
E x 2 = T r (Σ) = r Σ . It follows from Markov’s √ inequality that most of the
distribution is supported within a ball of radius m wherem = r Σ . Assume that
all of the distribution is supported there, i.e., if x = O r Σ almost surely,
then, the claim of Theorem 5.52 holds with sample size
2
N C(t/ε) r log n.
The material here can be found in [321]. In recent years, interest in matrix
valued random variables gained momentum. Many of the results dealing with real
random variables and random vectors were extended to cover random matrices.
Concentration inequalities like Bernstein, Hoeffding and others were obtained in the
non-commutative setting. The methods used were mostly combination of methods
from the real/vector case and some matrix inequalities like the Golden-Thompson
inequality.
The method will work properly for a class of matrices satisfying a matrix strong
regularity assumption which we denote by (MSR) and can be viewed as an analog
to the property (SR) defined in [310]. For an n × n matrix, denote · op by the
operator norm of A on n2 .
Definition 5.13.3 (Property (MSR)). Let Y be an n × n positive semi-definite
random matrix such that EY = In×n . We will say that Y satisfies (MSR) if for
some η > 0 we have:
c
P ( PYP t) ∀t c · rank (P)
t1+η
where P is orthogonal projection of Rn .
5.13 Covariance Matrix Estimation 323
The proof of Theorem 5.13.4 is based on two theorems with the smallest and largest
N
eigenvalues of N1 Xi , which are of independent interest.
i=1
we have
N
1
Eλmin Xi 1 − ε.
N i=1
N
Eλmax Xi C (η) (n + N ) .
i=1
N
1
Eλmax Xi 1 + ε.
N i=1
We primarily follow Rudelson and Vershynin [322] and Vershynin [323] for our
exposition. Relevant work also includes [72, 322, 324–333].
Let A be an N × n matrix whose entries with real independent, centered random
variables with certain moment assumptions. Random matrix theory studies the
distribution
√ of the singular values σk (A), which are the eigenvalues of |A| =
AT A arranged in nonincreasing order. Of particular significance are the largest
and the smallest random variables
Here, we consider sub-Gaussian random variables ξ—those whose tails are domi-
nated by that of the standard normal random variables. That is, a random variable is
called sug-Gaussian if there exists B > 0 such that
2
/B 2
P (|ξ| > t) 2e−t for all t > 0. (5.54)
p 1/p √
(E|ξ| ) CB p for all p 1, (5.55)
σ1 (A) √ σn (A) √
√ → 1 + α, √ →1− α
N N
almost surely. The result was proved in [302] for Gaussian matrices, and in [334]
for matrices with independent and identically distributed entries with finite fourth
moment. In other words, we have asymptotically
√ √ √ √
σ1 (A) ∼ N+ n, σn (A) ∼ N − n. (5.56)
Let λ1 (A) be the largest eigenvalue of the n×n random matrix A. Following [336],
we present
3/2
P (λ1 (A) 2 + t) Ce−cnt ,
valid uniformly for all n and t. This inequality is sharp for “small deviation” and
complements the usual “large deviation” inequality. Our motivation is to illustrate
the simplest idea to get such a concentration inequality.
The Gaussian concentration is the easiest. It is straightforward consequence of
the measure concentration phenomenon [145] that
2
P (λ1 (A) Mλ1 (A) + t) e−nt , ∀t > 0, ∀n (5.57)
where Mλ1 (A) stands for the median of λ1 (A) with respect to the probability
measure P. One has the same upper bound estimate is the median Mλ1 (A) is
replaced by the expected value Eλ1 (A), which is easier to compute.
The value of Mλ1 (A) can be controlled: for example we have
√
Mλ1 (A) 2 + c/ n.
√
√ 2 1
P λn (B) n + N + nt C exp − N t3/2 ,
C
√
√ 2 C 1
P λ1 (B) N − n − nt exp − N t3/2 ,
1 − N/n C
A result of [337] gives an optimal bound for tall matrices, those whose aspect ratio
α = n/N satisfies α < α0 for some sufficiently small constant α0 . Recalling (5.56),
one should expect that tall matrices satisfy
√
σn (A) c N with high probability. (5.58)
It was indeed proven in [337] that for a tall ±1 matrices one has
√
P σn (A) c N e−cN (5.59)
As we move toward square matrices, making the aspect ratio α = n/N arbitrarily
close to 1, the problem becomes harder. One still expect (5.58) to be true as long
as α < 1 is any constant. It was proved in [338] for arbitrary aspect ratios α <
1−c/ log n and for general random matrices with independent sub-Gaussian entries.
One has
√
P σn (A) cα N e−cN , (5.60)
where cα > 0 depends on α and the maximal sub-Gaussian moment of the entries.
Later [339], the dependence of cα on the aspect ratio in (5.60) was improved
for random ±1 matrices; however the probability estimates there was weaker
than (5.60). An estimate for sub-Gaussian random matrices
√ of all dimensions was
obtained in a breakthrough work [340]. For any ε C/ N , it was shown that
√ √ N −n
P σn (A) ε (1 − α) N − n (Cε) + e−cN .
However, because of the factor 1 − α, this estimate is suboptimal and does not
correspond to the expected asymptotic behavior (5.56).
5.14 Concentration of Singular Values 327
The extreme case for the problem of estimating the singular values is for the square
matrices, where N = n. Equation (5.56) is useless for square matrices. However,
for√“almost” square matrices, those with constant defect N − n = O(1) is of order
1/ N , so (5.56) heuristically suggests that
c
σn (A) √ with high probability. (5.61)
N
This conjecture was proved recently in [341] for all square sub-Gaussian matrices:
c
P σn (A) √ Cε + e−cN . (5.62)
N
Rudelson and Vershynin [322] proved the conjectural bound for σ(A), which is
valid for all sub-Gaussian matrices in all fixed dimensions N, n. The bound is
optimal for matrices with all aspects we encountered above.
Theorem 5.14.2 (Rudelson and Vershynin [322]). Let A be an N × n random
matrix, N ≥ n, whose elements are independent copies of a mean 0 sub-Gaussian
random variable with unit variance. Then, for every t > 0, we have
√ √
N −n+1
P σn (A) t N − n − 1 (Ct) + e−cN , (5.63)
This is a version of (5.56), now valid for all fixed dimensions. This bound were
explicitly conjectured in [342].
Theorem 5.14.2 seems to be new even for Gaussian matrices.
Vershynin [323] extends the argument of [322] for random matrices with
bounded (4 + ε)-th moment. It follows directly from the argument of [322] and
[323, Theorem 1.1] (which is Eq. 5.69 below.)
328 5 Non-asymptotic, Local Theory of Random Matrices
After the paper of [323] was written, two important related results appeared on
the universality of the smallest singular value in two extreme regimes–for almost
square matrices [326] and for genuinely rectangular matrices [324]. The result of
Tao and Vu [326] works for square and almost square matrices where the defect N −
n is constant. It is valid for matrices with i.i.d. entries with mean 0, unit variance, and
bounded C-th moment where C is a sufficiently large absolute constant. The result
says that the smallest singular value of such N × n matrices A is asymptotically the
same as of the Gaussian matrix G of the same dimensions and with i.i.d. standard
normal entries. Specifically,
P N σn (G) t − N −c − N −c P N σn (A) t ,
2 2
P N σn (G) t + N −c + N −c .
2
(5.65)
Another result was obtained by Feldheim and Sodin [324] for genuinely rectangular
matrices, i.e. with aspect ratio N/n separated from 1 by a constant and with sub-
Gaussian i.i.d. entries. In particular, they proved
√
√ 2 C 3/2
P σn (A) N − n − tN e−cnt . (5.66)
1− N/n
where the summation is over all permutations of n elements. If xi,j are i.i.d. 0 mean
variables with unit variance and X is an n × n matrix with entries xi,j then an easy
computation shows that
2
per (A) = E det A1/2 X ,
5.14 Concentration of Singular Values 329
Further, we have
det2 A1/2 G √
P log > 2C n log n exp −c n/ log n .
per (A)
σ1 (A) = sup Ax 2 ,
x:x2 =1
For random matrices with independent and identically distributed entries, the
spectral norm is well studies. Let B be an N × n matrix whose entries are real
independent and identically distributed random variables with mean 0, variance 1,
and finite fourth moment. Estimates of the type
√ √
σ1 (B) ∼ N + n, (5.67)
are known to hold (and are sharp) in both the limit regime for dimensions increasing
to infinity, and the non-limit regime where the dimensions are fixed. The meaning
of (5.67) is that, for a family of matrices
√ as above whose aspect ratio N/n converges
√
to a constant, the ratio σ1 (B) / N + n converges to 1 almost surely [344].
In the non-limit regime, i.e., for arbitrary dimensions n and N. Variants of (5.67)
was proved by Seginer [107] and Latala [194]. If B is an N × n matrix whose
entries are i.i.d. mean 0 random variables, then denoting the rows of B by xi and
the columns by yj , the result of Seginer [107] says that
Eσ1 (B) C Emax xi 2 +E max yi 2
i i
max
i
If the entries of the matrix B are not necessarily identically distributed, the result of
Latala [194] says that
⎛ ⎛ ⎞1/4 ⎞
⎜ ⎟
Eσ1 (B) C ⎝Emax xi 2 + E max yi 2 +⎝ b4ij ⎠ ⎠ ,
i i
i,j
The main result of [323] is an extension of the optimal bound (5.68) to the class
of random matrices with non-independent entries, but which can be factored through
a matrix with independent entries.
Theorem 5.14.4 (Vershynin [323]). Let ε ∈ (0, 1) and let m, n, N be positive
integers. Consider a random m × n matrix B = ΓA, where A is an N × n random
matrix whose entries are independent random variables with mean 0 and (4 + ε)-th
moment bounded by 1, and Γ is an m×N non-random matrix such that σ1 (Γ) 1.
Then
√ √
Eσ1 (B) C (ε) m+ n (5.69)
Random determinant can be used the test metric for hypothesis testing, especially
for extremely low signal to noise ratio. The variance of random determinant is much
smaller than that of individual eigenvalues. We follow [347] for this exposition. Let
An be an n×n random matrix whose entries aij , 1 ≤ i, j ≤ n, are independent real
random variables of 0 mean and unit variance. We will refer to the entries aij as the
atom variables. This shows that almost surely, log |det (An )| is (1/2 + o(1)) n log n
but does not provide any distributional information. For other models of random
matrices, we refer to [348].
332 5 Non-asymptotic, Local Theory of Random Matrices
In [242], Tao and Vu proved that for Bernoulli random matrices, with probability
tending to one (as n tents to infinity)
√ √
n! exp −c n log n |det (An )| n!ω (n) (5.71)
for any function ω(n) tending to infinity with n. We say that a random variable ξ
satisfies condition C0 (with positive constants C1 , C2 ) if
P (|ξ| t) C1 exp −tC2 (5.72)
for all t > 0. Nguyen and Vu [347] showed that the logarithm of |det (An )| satisfies
a central limit theorem. Assume that all atom variables aij satisfy condition C0 with
some positive constants C1 , C2 . Then
⎛ ⎞
log |det (A )| − 1
log (n − 1)!
sup P ⎝ 2
n 2
t − Φ(t) log−1/3+o(1) n.
⎠
t∈R 1
log n
2
)t 2 (5.73)
Here Φ(t) = P (N (0, 1) < t) = √1
2π −∞
exp −x /2 dx. An equivalent form is
log det A2n − log (n − 1)!
sup P √ t − Φ(t) log−1/3+o(1) n. (5.74)
t∈R 2 log n
H0 : Ry = Rx
H1 : Ry = Rx + Rs
Empirical CDF
1
Bernoulli matrices
0.9 Gaussian matrices
standard normal curve
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4
x
√
Fig. 5.1 The plot compares the distributions of log det A2 − log (n − 1)! / 2 log n for random
Bernoulli matrices, random Gaussian matrices, and N (0, 1). We sampled 1,000 matrices of size
1,000 by 1,000 for each ensemble
for extremely low signal to noise ratio. The variance of random determinant is much
smaller than that of individual eigenvalues. According to (5.73), the hypothesis
test is
H0 : log |det (Rx )|
H1 : log |det (Rx + Rs )|
Let A, B be complex matrices of n×n, and assume that A is positive definite. Then
for n ≥ 2 [18, p. 535]
1
log |det (Rx + Rs )| log (n − 1)!.
2
334 5 Non-asymptotic, Local Theory of Random Matrices
We follow [350] for this exposition. Given an n × n random matrix A, what is the
probability that A is invertible, or at least “close” to being invertible? One natural
way to measure this property is to estimate the following small ball probability
P (sn (A) t) ,
where
def 1
sn (A) = inf Ax = .
x1 =1
2
A−1
In the case when the entries of A are i.i.d. random variables with appropriate
moment assumption, the problem was studied in [239, 241, 341, 351, 352]. In
particular, in [341] it is shown that if the above diagonal entries of A are
continuous and satisfy certain regularity conditions, namely that the entries are i.i.d.
subGaussian and satisfy certain smoothness conditions, then
√
P (sn (A) t) C nt + e−cn . (5.76)
The regularity assumptions were completely removed in [356] at the cost of a n5/2
(independence of the entries in the non-symmetric part is still needed). On the other
hand, in the discrete case, the result of [360] shows that if A is, say, symmetric
whose above diagonal entries are i.i.d. Bernoulli random variables, then
Further development in this direction can be found in [362] estimates similar to (1.2)
are given in the case when A is a Bernoulli random matrix, and in [356, 358, 359],
where A is symmetric.
An alternative way to measure the invertibility of a random matrix A is to
estimate det(A), which was studied in [242,363,364] (when the entries are discrete
distributions). Here we show that if the diagonal entries of A are independent
continuous random variables, we can easily get a small ball estimate for det(Γ+A),
where Γ being an arbitrary deterministic matrix.
Let A be an n × n random matrix, such that each diagonal entry Ai,i is a
continuous random variable, independent from all the other entries of A. Friedland
and Giladi [350] showed that for every n × n matrix Γ and every t ≥ 0
P [ A t] 2αnt,
n/(2n−1) (n−1)/(2n−1) 1/(2n−1) (5.78)
P [sn (A) t] (2α) (E A ) t .
Equation (5.78) can be applied to the case when the random matrix A is symmetric,
under very weak assumptions on the distributions and the moments of the entries
and under no independence assumptions on the above diagonal entries. When A is
symmetric, we have
Thus, in this case we get a far better small ball estimate for the norm
n
P [ A t] (2αt) .
336 5 Non-asymptotic, Local Theory of Random Matrices
Rudelson [76] gives an excellent self-contained lecture notes. We take some material
from his notes to get a feel of the proof ingredients. His style is very elegant.
In the classic work on numerical inversion of large matrices, von Neumann
and his associates used random matrices to test their algorithms, and they specu-
lated [365, pp. 14, 477, 555] that
√
sn (A) ∼ 1/ n with high probability. (5.79)
In a more precise form, this estimate was conjectured by Smale [366] and proved
by Edelman[367] and Szarek[368] for random Gaussian matrices A, i.e., those with
i.i.d. standard normal entries. Edelman’s theorem states that for every t ∈ (0, 1)
√
P sn (A) t/ n ∼ t. (5.80)
In [341], the conjecture (5.79) is proved in full generality under the fourth moment
assumption.
Theorem 5.15.1 (Invertibility: fourth moment [341]). Let A be an n × n matrix
whose entries are independent centered real random variables with variances at
least 1 and fourth moments bounded by B. Then, for every δ > 0 there exist ε > 0
and n0 which depend (polynomially) only on δ and B, such that
√
P sn (A) t/ n δ for all n n0 .
Spielman and Teng[369] conjectured that (5.80) should hold for the random sign
matrices up to an exponentially small term that accounts for their singularity
probability:
√
P sn (A) t/ n ε + cn .
Large complex system often exhibit remarkably simple universal patterns as the
numbers of degrees of freedom increases [370]. The simplest example is the
central limit theorem: the fluctuation of the sums of independent random scalars,
irrespective of their distributions, follows the Gaussian distribution. The other
cornerstone of probability theory is to treat the Poisson point process as the universal
limit of many independent point-like evens in space or time. The mathematical
assumption of independence is often too strong. What if independence is not
realistic approximation and strong correlations need to be modelled? Is there a
universality for strongly correlated models?
In a sensor network of time-evolving measurements consisting of many
sensors—vector time series, it is probably realistic to assume that the measurements
of sensors have strong correlations.
Let ξ be a real-valued or complex-valued random variable. Let A denote the n×n
random matrix whose entries are i.i.d. copies of ξ. One of the two normalizations
will be imposed on ξ:
• R-normalization: ξ is real-valued with Eξ = 0 and Eξ 2 = 1.
2 2
• C-normalization: ξ is complex-valued with Eξ = 0, E Re (ξ) = E Im (ξ) =
2 , and E Re (ξ) Im (ξ) = 0.
1
The joint distribution of the bottom k singular values of real or complex A was
computed in [373].
The error term o(1) in (5.82) is not explicitly stated in [372], but Tao and Vu [326]
gives the form of O(n−c ) for some absolute constant c > 0.
4
Under the assumption of bounded fourth moment E|ξ| < ∞, it was shown
by Rudelson and Vershynin [341] that
2
P nσn (A) t f (t) + o(1)
for all fixed t > 0, where g(t) goes to zero as t → 0. Similarly, in [374] it was
shown that
2
P nσn (A) t g (t) + o(1)
for all fixed t > 0, where g(t) goes to zero as t → ∞. Under stronger assumption
that ξ is sub-Gaussian, the lower tail estimate was improved in [341] to
√
2
P nσn (A) t C t + cn (5.85)
for some constant C > 0 and 0 < c < 1 depending on the sub-Gaussian moments
of ξ. At the other extreme, with no moment assumption on ξ, the bound
5
−1− 2 a−a2
n−a+o(1)
2
P nσn (A) n
Conjecture 5.16.3 (Spielman and Teng [369]). Let ξ be the Bernoulli random
variable. Then there is a constant 0 < c < 1 such that for all t ≥ 0
2
P nσn (A) t t + cn . (5.86)
A new method was introduced by Tao and Vu [326] to study small singular values.
Their method is analytic in nature and enables us to prove the universality of the
limiting distribution of nσn 2 .
Theorem 5.16.4 (Universality for the least singular value [326]). Let ξ be a (real
C
or complex) random variable of mean 0 and variance 1. Suppose E|ξ| 0 < ∞ for
some sufficiently large absolute constant C0 . Then, for all t > 0, we have,
t √
1 + x −(x/2+√x)
dx + o(n−c )
2
P nσn (A) t = √ e (5.87)
0 2 x
if ξ is R-normalized, and
t
e−x dx + o(n−c )
2
P nσn (A) t =
0
0
e dx do not play any important role. This comparison (or coupling) approach
is in the spirit of Lindeberg’s proof [376] of the central limit theorem.
Tao and Vu’s arguments are completely effective, and give an explicit value
for C0 . For example, C0 = 104 is certainly sufficient. Clearly, one can lower C0
significantly.
Theorem 5.16.4 can be extended to rectangular random matrices of (n − l) × n
dimensions (Fig. 5.2).
Theorem 5.16.5 (Universality for the least singular value of rectangular matri-
ces [326]). Let ξ be a (real or complex) random variable of mean 0 and variance
C
1. Suppose E|ξ| 0 < ∞ for some sufficiently large absolute constant C0 . Let l be a
constant. Let
2 2 2
X = nσn−l (A(ξ)) , XgR = nσn−l (A(gR )) , XgC = nσn−l (A(gC )) .
Bernoulli Gaussian
Fig. 5.2 Plotted above are the curves P nσn−l (A (ξ))2 x , for l = 0, 1, 2 based on data from
1,000 randomly generated matrices with n = 100. The curves on the left were generated with ξ
being a random Bernoulli variable, taking the values +1 and −1 each with probability 1/2; The
curves on the right were generated with ξ being a random Gaussian variable. In both cases, the
curves from left to right correspond to the cases l = 0, 1, 2, respectively
P X t − n−c − n−c P (XgR t) P X t + n−c + n−c
if ξ is R-normalized, and
P X t − n−c − n−c P (XgC t) P X t + n−c + n−c
if ξ is C-normalized.
Theorem 5.16.4 can be extended to random matrices with independent (but not
necessarily identical) entries.
Theorem 5.16.6 (Random matrices with independent entries [326]). Let ξij be
a (real or complex) random variables with mean 0 and variance 1 (R-normalized or
C
C-normalized). Suppose E|ξ| 0 < C1 for some sufficiently large absolute constant
C0 and C1 . Then, for all t > 0, we have,
t √
1 + x −(x/2+√x)
dx + o(n−c )
2
P nσn (A) t = √ e (5.88)
0 2 x
if ξij are all R-normalized, and
t
e−x dx + o(n−c )
2
P nσn (A) t =
0
if ξij are all C-normalized, where c > 0 is an absolute constant. The implied
C
constants in the O(·) notation depends on E|ξ| 0 but are uniform in t.
5.16 Universality of Singular Values 341
σ1 (A)
κ (A) = .
σn (A)
It√is well known [2] that the largest singular value is concentrated strongly around
2 n. Combining Theorem 5.16.4 with this fact, we have the following for the
general setting.
Lemma 5.16.7 (Concentration of the largest singular value [326]). Under the
setting of Theorem
√ 5.16.4, we have, with probability 1 − exp −nΩ(1) , σ1 (A) =
(2 + o(1)) n.
Corollary 5.16.8 (Conditional number [326]). Let ξij be a (real or complex)
random variables with mean 0 and variance 1 (R-normalized or C-normalized).
C
Suppose E|ξ| 0 < C1 for some sufficiently large absolute constant C0 and C1 .
Then, for all t > 0, we have,
t √
1 1 + x −(x/2+√x)
P κ (A (ξ)) t = √ e dx + o(n−c ) (5.89)
2n 0 2 x
if ξij are all C-normalized, where c > 0 is an absolute constant. The implied
C
constants in the O(·) notation depends on E|ξ| 0 but are uniform in t.
Let ξ be a complex random variable with mean 0 and variance 1. Let A be the
random matrix of size n whose entries are i.i.d. copies of ξ and Γ be a fixed matrix
of the same size. Here we study the conditional number and least singular value of
the matrix B = Γ+A. This is called signal plus noise matrix model. It is interesting
to find the “signal” matrix Γ does play a role on tail bounds for the least singular
value of Γ + A.
Example 5.16.9 (Covariance Matrix). The conditional number is a random variable
of interest to many applications [372, 377]. For example,
Σ̂ = Σ + Z or Z = Σ̂ − Σ
342 5 Non-asymptotic, Local Theory of Random Matrices
where Σ is the true covariance matrix (deterministic and assumed to be known) and
Σ̂ = n1 XX∗ is the sample (empirical) covariance matrix—random matrix; here X
is the data matrix which is the only matrix available to the statistician.
Example 5.16.10 (Hypothesis testing for two matrices).
H0 : Ry = Rn
H1 : Ry = Rx + Rn
for all t > 0. The smallest α is called the sub-Gaussian moment of ξ. For a more
general sub-Gaussian random variable ξ, Rudelson and Vershynin proved [341] the
following.
Theorem 5.16.13 (Sub-Gaussian random matrix with i.i.d. entries [341]). Let
ξ be a sub-Gaussian random variable with mean 0, variance 1 and sub-Gaussian
moment α. Let c be an arbitrary positive constant. Let A be the random matrix
whose entries are i.i.d. copies of ξ, Then, there is a constant C > 0 (depending on
α) such that, for any t ≥ n−c ,
This theorem requires very little about the random variable ξ. It does not need to
be sub-Gaussian nor even has bounded moments. All we ask is that the variable is
bounded from zero, which basically means ξ is indeed “random”. Thus, it guarantees
the well-conditionness of B = Γ + A in a very general setting.
The weakness of this theorem is that the dependence of C2 on C1 and C, while
explicit, is too generous. The work of [362] improved this dependence significantly.
Let us deal with the non-Gaussian random matrix.
Theorem 5.16.15 (The non-Gaussian random matrix [362]). There are positive
constants c1 and c2 such that the following holds. Let A be the n × n Bernoulli
matrix with n even. For any α ≥ n, there is an n × n deterministic matrix Γ such
that Γ = α and
n 1
P σn (Γ + A) c1 c2 √ .
α n
This theorem only assumes bounded second moment on ξ. The assumption that the
entries of A are i.i.d. is for convenience. A slightly weaker result would hold if one
omit this assumption.
Let us deal with the condition number.
Theorem 5.16.17 (Conditional number—Tao and Vu [362]). Let ξ be a random
variable with mean 0 and bounded second moment, and let γ ≥ 1/2, C ≥ 0 be
constants. There is a constant c depending on ξ, γ, C such that the following holds.
Let A be the n × n matrix whose entries are i.i.d. copies of ξ, Γ be a deterministic
matrix satisfying Γ nγ . Then
P κ (Γ + A) 2n(2C+2)γ c n−C+o(1) + P ( A nγ ) .
344 5 Non-asymptotic, Local Theory of Random Matrices
σ1 (Γ + A) σ1 (Γ) + σ1 (A) = Γ + A nγ + A .
If one replaces the sub-Gaussian condition by the weaker condition that ξ has fourth
moment bounded α, then one has a weaker conclusion that
√
E A C1 n.
√
For A = O ( n), (5.90) becomes
P σn (Γ + A) n−C−1/2 n−C+o(1) . (5.91)
√
In the case of A = O ( n), this implies that almost surely σn (Γ + A)
n−1/2+o(1) . For the special case of Γ = 0, this matches [341, Theorem 1.1], up
to the o(1) term.
To state the results in[380–382], we need the following two conditions. Let X =
(xij ) be a M × N data matrix with independent centered real valued entries with
variance 1/M :
1
xij = √ ξij , Eξij = 0, Eξij
2
= 1. (5.93)
M
Furthermore, the entries ξij have a sub-exponential decay, i.e., there exists a constant
1
P (|ξij | > t) exp (−tκ ) . (5.94)
κ
The sample covariance matrix corresponding to data matrix X is given by S =
XH X. We are interested in the regime
All the results here are also valid for complex valued entries with the moment
condition (5.93) replaced with its complex valued analogue:
1 2
xij = √ ξij , Eξij = 0, Eξij
2
= 0, E |ξ|ij = 1. (5.96)
M
By the singular value decomposition of X, there exist orthonormal bases
M M
X= λi ui viH = λi uH
i vi ,
i=1 i=1
Let us study the correlation matrix. For a Euclidean vector a ∈ Rn , define the l2
norm
:
; n
;
a 2=< a2i .
i=1
The matrix XH X is the usual covariance matrix. The j-th column of X is denoted
by xj . Define the matrix M × N X̃ = (x̃ij )
xij
x̃ij = . (5.98)
xj 2
M
Using the identity Ex̃2ij = ME
1
x̃2ij , we have
i=1
1
Ex̃2ij = .
M
Let λ̃i is the eigenvalues of the matrix X̃H X̃, sorted decreasingly, similarly to λi ,
the eigenvalues of the matrix XH X̃.
The key difficulty to be overcome is the strong dependence of the entries of
the correlation matrix. The main result here states that, asymptotically, the k-point
(k ≥ 1) correlation functions of the extreme eigenvalues (at the both edges of the
spectrum) of the correlation matrix X̃H X̃ converge to those of Gaussian correlation
matrix, i.e., Tracy-Widom law, and thus in particular, the largest and smallest
eigenvalues of X̃H X̃, after appropriate centering and rescaling, converge to the
Tracy-Widom distribution.
Theorem 5.16.21 ([380]). Let X with independent entries satisfying (5.93), (5.94)
(or (5.96) for complex entries), (5.95), and (5.98). For any fixed k > 0,
⎛ √
√ 2 √ √ 2 ⎞
⎜ M λ̃1 − N+ M M λ̃k − N+ M ⎟
⎝ √ √ 1 1/3 , · · · , √ √ 1/3 ⎠ → TW1 ,
N+ M √ + √1 N+ M √1 + √1
N M N M
(5.99)
where TW1 denotes the Tracy-Widom distribution. An analogous statement holds
for the k-smallest (non-trivial) eigenvalues.
As a special case, we also obtain the Tracy-Widom law for the Gaussian correlation
matrices. To reduce the variance of the test statistics, we prefer the sum of the
k
functions of eigenvalues f λ̃i for k ≤ min{M, N } where f : R → R
i=1
is a function (say convex and Lipschitz). Note that the eigenvalues λ̃i are highly
5.16 Universality of Singular Values 347
y i = ai + w i , i = 1, . . . , N
where ai is the signal vector and wi the random noise. Define the (random) sample
covariance matrices as
N N N
Y= yi yiH , A= ai aH
i , W= wi wiH .
i=1 i=1 i=1
It follows that
Y = A + W. (5.100)
Often the entries of W are normally distributed (or Gaussian random variables). We
are often interested in a more general matrix model
Y =A+W+J (5.101)
where the entries of matrix J are not normally distributed (or non-Gaussian random
variables). For example, when jamming signals are present in a communications
or sensing system. We like to perform PCA on the matrix Y. Often the rank of
matrix A is much lower than that of W. We can project the high dimensional matrix
348 5 Non-asymptotic, Local Theory of Random Matrices
for t ≥ β2 .
We assume that (1)
m
1 < α1 α2
n
where α1 and α2 are fixed constants and that S is a covariance matrix whose entries
have an exponent decay (condition (C0)) and (2) have the same first four moments
as those of a LUE matrix. The following summarizes a number of quantitative
bounds.
Theorem 5.16.23 ([386]).
1. In the bulk of the spectrum. Let η ∈ (0, 12 ]. There is a constant C > 0
(depending on η, α1 , α2 ) such that, for all covariance matrices Sm,n , with
ηn ≤ i ≤ (1 − η)n,
log n
Var (λi ) C .
n2
2. Between the bulk and the edge of the spectrum. There is a constant κ > 0
(depending on α1 , α2 ) such that the following holds. For all K > κ, for all
η ∈ (0, 12 ], there exists a constant C > 0 (depending on K, η, α1 , α2 ) such that,
for all covariance matrices Sm,n , with (1 − η)n ≤ i ≤ n − K log n,
log (n − i)
Var (λi ) C 2/3
.
n4/3 (n − i)
5.17 Further Comments 349
1
Var (λn ) C .
n4/3
This subject of this chapter is in its infancy. We tried to give a comprehensive review
of this subject by emphasizing both techniques and results. The chief motivation is
to make connections with the topics in other chapters. This line of work deserves
further research.
The work [72] is the first tutorial treatment along this line of research. The two
proofs taken from [72] form the backbone of this chapter. In the context of our
book, this chapter is mainly a statistical tool for covariance matrix estimation—
sample covariance matrix is a random matrix with independent rows. Chapter 6 is
included here to highlight the contrast between two different statistical frameworks:
non-asymptotic, local approaches and asymptotic, global approaches.
Chapter 6
Asymptotic, Global Theory of Random Matrices
The chapter contains standard results for asymptotic, global theory of random matri-
ces. The goal is for readers to compare these results with results of non-asymptotic,
local theory of random matrices (Chap. 5). A recent treatment of this subject is given
by Qiu et al. [5].
The applications included in [5] are so rich; one wonder whether a parallel
development can be done along the line of non-asymptotic, local theory of random
matrices, Chap. 5. The connections with those applications are the chief reason why
this chapter is included.
y =x+n
Ry = Rx + Rn ,
y i = x i + ni , i = 1, 2, . . . , N.
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 351
DOI 10.1007/978-1-4614-4544-9 6,
© Springer Science+Business Media New York 2014
352 6 Asymptotic, Global Theory of Random Matrices
N N N
1 1 1
yi ⊗ yi∗ = xi ⊗ x∗i + ni ⊗ n∗i + junk
N i=1
N i=1
N i=1
where “junk” denotes two other terms. It is more convenient to consider the matrix
form
⎡ T⎤ ⎡ T⎤ ⎡ T⎤
y1 x1 n1
⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
Y=⎣ . ⎦ , X=⎣ . ⎦ , N=⎣ . ⎦ ,
T
yN N ×n
xTN N ×n
nTN N ×n
1 ∗ 1 1
Y Y = X∗ X + N∗ N + junk. (6.1)
N N N
A natural question arises from the above exercise: What happens if n → ∞, N →
n → α?
∞, but N
The asymptotic regime
n
n → ∞, N → ∞, but → α? (6.2)
N
calls for a global analysis that is completely different from that of non-asymptotic,
local analysis of random matrices (Chap. 5). Stieltjes transform and free probability
are two alternative frameworks (but highly related) used to conduct such an analysis.
The former is an analogue of Fourier transform in a linear system. The latter is an
analogue of independence for random variables.
Although this analysis is asymptotic, the result is very accurate even for small
matrices whose size n is as small as less than five [5]. Therefore, the result is very
relevant to practical limits. For example, we often consider n = 100, N = 200 with
α = 0.5.
In Sect. 1.6.3, a random matrix
k of arbitrary N × n is studied in the form of
∗ k
E Tr A and E Tr(B B) where A is a Gaussian random matrix (or Wigner
matrix) and B is a Hermitian Gaussian random matrix.
Section 1.6.3 illustrates the power of Moment Method for the arbitrary matrix sizes
in the Gaussian random matrix framework. Section 4.11 treats this topic in the
non-asymptotic, local context.
The moment method is computationally intensive, but straightforward.
Sometimes, we need to consider the asymptotic regime of (6.2), where moment
method may not be feasible. In real-time computation, the computationally efficient
techniques such as Stieltjes transform and free probability is attractive.
The basic starting point is the observation that the moments of the ESD μA can
be expressed as normalized traces of powers of A of n × n
1 k
xk dμA (x) = Tr(A) . (6.4)
R n
1 k
xk dE [μA (x)] = E Tr(A) . (6.5)
R n
k
The concentration of measure for the trace function Tr(A) is treated in Sect. 4.3.
From this result, the E [μA (x)] are uniformly sub-Gaussian:
2
n2
E [μA {|x| > t}] Ce−ct
for t > C, where C are absolute (so the decay improves quite rapidly with n). From
this and the Carleman continuity theorem, Theorem 1.6.1, the circular law can be
established, through computing the mean and variance of moments.
To prove the convergence in expectation to the semicircle law μsc , it suffices to
show that
- k .
1 1
E Tr √ A = xk dμsc (x) + ok (1) (6.6)
n n R
Equation (6.4) is the starting point for the moment method. The Stieltjes transform
also proceeds from this fundamental identity
1 1 −1
dμA (x) = Tr(A − zI) (6.7)
R x−z n
for any complex z not in the support of μA . The expression in the left hand side is
called Stieltjes transform of A or of μA , and denote it as mA (z). The expression
−1
(A − zI) is the resolvent of A, and plays a significant role in the spectral theory
of that matrix. Sometimes, we can consider the normalized version M = √1n A
for an arbitrary
√ matrix A of n × n, since the matrix norm of A is typically of
size O( n). The Stieltjes transform, in analogy with Fourier transform for a linear
time-invariant system, takes full advantage of specific linear-algebraic structure of
this problem, and, in particular, of rich structure of resolvents. For example, one can
use the Neumann series
−1
(I − X) = I + X + X2 + · · · + X k + · · · .
1 1 1 1 1
mM (z) = − − 2 3/2 TrM − 3 2 TrM − · · · ,
z z n z n
valid for z sufficiently large. This is reminiscent of how the characteristic function
EejtX of a scalar random variable can be viewed as a generating function of the
moments EX k .
For fixed z = a + jb away from the real axis, the Stieltjes transform mMn (z) is
quite stable in n. When Mn is a Wigner matrix of size n × n, using a standard
concentration of measure result, such as McDiarmid’s inequality, we conclude
concentration of mMn (z) around its mean:
√ 2
P |mMn (a + jb) − EmMn (a + jb)| > t/ n Ce−ct (6.8)
for all t > 0 and some absolute constants C, c > 0. For details of derivations, we
refer to [63, p. 146].
The concentration of measure says that mMn (z) is very close to its mean. It does
not, however, tell much about what this mean is. We must exploit the linearity of
trace (and expectation) such as
n
- −1 .
1 1
m √1 An (z) = E √ An − zIn
n n i=1 n
ii
where x ∈ C n−1
n−1is ∗the right column of Bn with the bottom right entry bnn removed,
∗
and y ∈ C is the bottom row with the bottom right entry bnn removed.
Assume that Bn and Bn−1 are both invertible, we have that
356 6 Asymptotic, Global Theory of Random Matrices
1
B−1 = . (6.10)
bnn − y∗ B−1
n nn
n−1 x
1
Em √1 An (z) = −E −1 , (6.11)
n
1 ∗
z+ nx
√1 An−1
n
− zIn−1 x− √1 ξnn
n
where x ∈ Cn−1 is the right column of An with the bottom right entry ξnn removed
(the (ij)-th entry of An is a random variable ξij ). The beauty of (6.11) is to tie
together the random matrix An of size n × n and the random matrix An−1 of size
(n − 1) × (n − 1).
−1
Next, we need to understand the quadratic form x∗ √1n An−1 − zIn−1 x.
We rewrite this as x∗ Rx, where R is the resolvent matrix. This distribution of
the random matrix R is understandably complicated. The core idea, however, is to
exploit the observation that the (n − 1)-dimensional vector x involves only these
entries of An that do not lie in An−1 , so the random matrix R and the random
vector x are independent. As a consequence of this key observation, we can use the
randomness of x to do most of the work in understanding the quadratic form x∗ Rx,
without having to know much about R at all!
Concentration of quadratic forms like x∗ Rx has been studied in Chap. 5. It turns
out that
√ 2
P |x∗ Rx − E (x∗ Rx)| t n Ce−ct
for any determistic matrix R of operator norm O(1). The expectation E (x∗ Rx) is
expressed as
n−1 n−1
∗
E (x Rx) = Eξ¯in rij ξjn
i=1 j=1
where ξin are entries of x, and rij are entries of R. Since the ξin are iid with mean
zero and variance one, the standard second moment computation shows that this
expectation is less than the trace
n−1
Tr (R) = rii
i=1
√ 2
P |x∗ Rx − Tr (R)| t n Ce−ct (6.12)
we can apply conditional expectation. The trace of this matrix is nothing but the
Stieltjes transform m √ 1 An−1 (z) of matrix An−1 . Since the normalization factor
n−1
is slightly off, we have
√ √
n n
Tr (R) = n √ m √ 1 An−1 ( √ z).
n−1 n−1 n−1
with overwhelming probability. Putting this back to (6.11), we have the remarkable
self-consistent equation
1
m √1 An (z) =− + o(1).
n z + m √1 An (z)
n
Following the arguments of [63, p. 150], we see that m √1 An (z) converges to a limit
n
mA (z), as n → ∞. Thus we have shown that
1
mA (z) = − ,
z + mA (z)
1 1/2
4 − x2 +
dx = μsc .
2π
For details, we refer to [63, p. 151].
We take the liberty of freely drawing material from [63] for this brief exposition.
We highlight concepts, basic properties, and practical material.
6.5.1 Concept
In the foundation of modern probability, as laid out by Kolmogorov, the basic objects
of study are:
1. Sample space Ω, whose elements ω represent all the possible states.
2. One can select a σ−algebra of events, and assign probabilities to these events.
3. One builds (commutative) algebra of random variables X and one can assign
expectation.
In measure theory, the underlying measure space Ω plays a prominent foun-
dational role. In probability theory, in contrast, events and their probabilities are
viewed as fundamental, with the sample space Ω being abstracted away as much as
possible, and with the random variables and expectations being viewed as derived
concepts.
If we take the above abstraction process one step further, we can view the algebra
of random variables and their expectations as being the foundational concept, and
ignoring both the presence of the original sample space, the algebra of events, or the
probability of measure.
There are two reasons for considering the above foundational structures. First,
it allows one to more easily take certain types of limits, such as: the large n limit
n → ∞ when considering n × n random matrices, because quantities build from
the algebra of random variables and their expectations, the normalized moments of
random matrices, which tend to be quite stable in the large n limit, even the sample
space and event space varies with n.
6.5 Free Probability 359
Second, the abstract formalism allows one to generalize the classical commuta-
tive theory of probability to the more general theory of non-commutative probability
which does not have the classical counterparts such as sample space or event space.
Instead, this general theory is built upon a non-commutative algebra of random
variables (or “observables”) and their expectations (or “traces”). The more general
formalism includes as special cases, classical probability and spectral theory (with
matrices or operators taking the role of random variables and the trace taking
the role of expectation). Random matrix theory is considered a natural blend of
classical probability and spectrum theory whereas quantum mechanics with physical
observables, takes the role of random variables and their expected values on a given
state which is the expectation. In short, the idea is to make algebra the foundation
of the theory, as opposed to other choices of foundations such as sets, measure,
categories, etc. It is part of more general “non-commutative way of thinking.1 ”
The significance of free probability to random matrix theory lies in the fundamental
observation that random matrices that have independent entries in the classical
sense, also tend to be independent2 in the free probability sense, in the large n limit
n → ∞. Because of this observation, many tedious computations in random matrix
theory, particularly those of an algebraic or enumerative combinatorial nature, can
be performed more quickly and systematically by using the framework of free
probability, which by design is optimized for algebraic tasks rather than analytical
ones.
Free probability is an excellent tool for computing various expressions of interest
in random matrix theory, such as asymptotic values of normalized moments in the
large n limit n → ∞. Questions like the rate of convergence cannot be answered
by free probability that covers only the asymptotic regime in which n is sent to
infinity. Tools such as concentration of measure (Chap. 5) can be combined with
free probability to recover such rate information.
As an example, let us reconsider (6.1):
1 ∗ 1 ∗ 1 1 1 1
Y Y = (X + N) (X + N) = X∗ X + N∗ N + X∗ N + N∗ X.
N N N N N N
(6.13)
The rate of convergence how N1 X∗ X converges to its true covariance matrix Rx
can be understood using concentration of measure (Chap. 5), by taking advantage of
k
low rank structure of the Rx . But the form of Tr(A + B) can be easily handled by
free probability.
zero, then E [ABAB]
If A, B are freely independent, and of expectation
vanishes, but E [AABB] instead factors as E A2 E B2 . Since
1 ∗ k
ETr (A + B) (A + B)
n
can be calculated by free probability [387], (6.13) can be handled as a special case.
Qiu et al. [5] gives a comprehensive survey of free probability in wireless
communications and signal processing. Couillet and Debbah [388] gives a deeper
treatment of free probability in wireless communication. Tulino and Verdu [389] is
the first book-form treatment of this topic in wireless communication.
E [f (X) g (Y )] = 0
When X is a Gaussian random matrix, the exact expression for τ Xk = E n1 TrXk
Y is a Hermitian Gaussian random
is available in a closed form (Sect.1.6.3). When
matrix, the exact expression for τ (Y Y) = E n1 Tr(Y∗ Y) is also available in
∗ k k
X, Y L2 (τ ) τ (XY) .
This obeys the usual axioms of an inner product, except that it is only positive semi-
definite rather than positive definite. One can impose positive definiteness by adding
an axiom that the trace is faithful, which means that τ (X∗ X) = 0 if and only if
X = 0. However, this is not needed here.
Without the faithfulness, the norm can be defined using the positive semi-definite
inner product
1/2
= (τ (X∗ X))
1/2
X L2 (τ ) X, X L2 (τ ) .
for any k ≥ 0.
362 6 Asymptotic, Global Theory of Random Matrices
XY L2 (τ ) ρ (X) Y L2 (τ ) .
τ (XYZ) = τ (ZYX) .
We now come to the fundamental concept in free probability, namely the free
independence.
6.5 Free Probability 363
τ {[P1 (Xn,i1 ) − τ (P1 (Xn,i1 ))] · · · [Pm (Xn,im ) − τ (Pm (Xn,im ))]} = 0
τ {[P1 (Xn,i1 ) − τ (P1 (Xn,i1 ))] · · · [Pm (Xn,im ) − τ (Pm (Xn,im ))]} → 0
1 ∗ 1 ∗ 1 1 1 1
Y Y = (X + N) (X + N) = X∗ X + N∗ N + X∗ N + N∗ X,
N N N N N N
(6.16)
where random matrices X, N are classically independent. Thus, X, N are also
asymptotically free. Taking the trace of (6.16), we have that
1 1 1 1 1
τ (Y∗ Y) = τ (X∗ X) + τ (N∗ N) + τ (X∗ N) + τ (N∗ X) .
N N N N N
Using (6.15), we have that
which vanish since τ (N) = τ (N∗ ) = 0 for random matrices whose entries are
zero-mean. Finally, we obtain that
The intuition here is that while a large random matrix X will certainly correlate
with itself so that, Tr (X∗ X) will be large, if we interpose an independent
random matrix N of trace zero, the correlation is largely destroyed; for instance,
Tr (X∗ XN) will be quite small.
We give a typical instance of this phenomenon here:
Proposition 6.5.9 (Asymptotic freeness of Wigner matrices [9]). LetAn,1 , . . . ,An,k
be a collection of independent n × n Wigner matrices, where the coefficients all
have uniformly bounded m-th moments for each m. Then the random variables
An,1 , . . . , An,k are asymptotically free.
A Wigner matrix is called Hermitian random Gaussian matrices. In Sect. 1.6.3.
We consider
An,i = U∗i Di Ui
When two classically independent random variables X, Y are added up, the
distribution μX+Y of the sum X + Y is the convolution μX+Y = μX ⊗ μY of
the distributions μX and μY . This convolution can be computed by means of the
characteristic function
FX = τ ejtX = ejtx dμX (x)
R
Table 6.1 Common random matrices and their moments (The entries of W are i.i.d. with zero
1 1
mean and variance N ; W is square N ×N , unless otherwise specified. tr (H) lim N Tr (H))
N →∞
− 1
Haar distribution T = W WH W 2
1
−1
mA (z) = τ (A − z) = dμX (x),
R x−z
Table 6.1 gives common random matrices and their moments. Table 6.2 gives
definition of commonly encountered random matrices for convergence laws,
Table 6.3 gives the comparison of Stieltjes, R- and S-Transforms.
Let the random matrix W be square N × N with i.i.d. entries with zero mean
and variance N1 . Let Ω be the set containing eigenvalues of W. The empirical
distribution of the eigenvalues
1
Δ
PH (z) = |{λ ∈ Ω : Re λ < Re z and Im λ < Im z}|
N
converges a non-random distribution functions as N → ∞. Table 6.2 lists
commonly used random matrices and their density functions.
Table 6.1 compiles some moments for commonly encountered matrices
from [390]. Calculating eigenvalues λk of a matrix X is not a linear operation.
Calculation of the moments of the eigenvalue distribution is, however, conveniently
done using a normalized trace since
366 6 Asymptotic, Global Theory of Random Matrices
Table 6.2 Definition of commonly encountered random matrices for convergence laws (The
1
entries of W are i.i.d. with zero mean and variance N ; W is square N × N , unless otherwise
specified)
Convergence laws Definitions Density functions
⎧1
⎨ π |z| < 1
Full-circle law W square N × N pW (z) = 0 elsewhere
⎩
⎧ 1 √
⎪
⎪ 4 − x2 |x| < 2
⎨ 2π
W+W H 0 elsewhere
Semi-circle law K= √ pK (z) =
2 ⎪
⎪
⎩
⎧
⎪ √ 0≤x≤2
⎪ 1
⎨ π 4 − x2
√ 2
Quarter circle law Q= WWH pQ (z) =
⎪
⎪ 0 elsewhere
⎩
⎧ &
⎪
⎪
1 4−x
0≤x≤4
⎪
⎨ 2π x
− 1 1
Haar distribution T = W WH W 2 pT (z) = 2π
δ (|z| − 1)
1 1
√ |x| < 2
Inverse semi-circle Y =T+ TH pY (z) = π 4−x2
law 0 elsewhere
N
1 1
λm
k = Tr (Xm ) .
N N
k=1
1
tr (X) lim Tr (X) .
N →∞ N
Table 6.2 is made self-contained and only some remarks are made here. For Haar
distribution, all eigenvalues lie on the complex unit circle since the matrix T is
unitary. The essential nature is that the eigenvalues are uniformly distributed. Haar
distribution demands for Gaussian distributed entries in the random matrix W. This
6.6 Tables for Stieltjes, R- and S-Transforms 367
condition does not seem to be necessary, but allowing for any complex distribution
with zero mean and finite variance is not sufficient.
Table 6.33 lists some transforms (Stieltjes, R-, S-transforms) and their prop-
erties. The Stieltjes Transform is more fundamental since both R-transform and
S-transform can be expressed in terms of the Stieltjes transform.
The central mathematical tool for algorithm analysis and development is the
concentration of measure for random matrices. This chapter is motivated to provide
applications examples for the theory developed in Part I. We emphasize the central
role of random matrices.
Compressed sensing is a recent revolution. It is built upon the observation that
sparsity plays a central role in the structure of a vector. The unexpected message
here is that for a sparse signal, the relevant “information” is much less that what we
thought previously. As a result, to recover the sparse signal, the required samples
are much less than what is required by the traditional Shannon’s sampling theorem.
The compressed sensing problem deals with how to recover sparse vectors from
highly incomplete information using efficient algorithms. To formulate the proce-
dure, a complex vector x ∈ CN is called s-sparse if
x 0 := |{ : x = 0}| = # { : x = 0} s.
where x 0 denotes the 0 -norm of the vector x. The 0 -norm represents the total
number of how many non-zero components there are in the vector x. The p -norm
for a real number p is defined as
N
1/p
p
x p = |xi | , 1 ≤ p < ∞.
i=1
y = Φx.
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 371
DOI 10.1007/978-1-4614-4544-9 7,
© Springer Science+Business Media New York 2014
372 7 Compressed Sensing and Sparse Recovery
for suitable constants κ ≥ 1 and δ , then many algorithms precisely recover any
s-sparse vectors x from the measurements y = Φx. Moreover, if x can be well
approximated by an s sparse vector, then for noisy observations
y = Φx + e, e 2 α, (7.5)
these algorithms return a reconstruction x̃ that satisfies an error bound of the form
1
x − x̃ 2 C1 √ σs (x)1 + C2 α, (7.6)
s
7.1 Compressed Sensing 373
Table 7.1 Values of the constants [391] κ and δ in (7.4) that guaran-
tee success for various recovery algorithms
Algorithm κ δ References
3√
1 -minimization (7.2) 2 ≈ 0.4652 [392–395]
&
4+ 6
2
CoSaMP 4 √ ≈ 0.3843 [396, 397]
5+ 73
Iterative hard thresholding 3 1/2 [395, 398]
√
Hard thresholding pursuit 3 1/ 3 ≈ 0.5774 [399]
where
denotes the error of best s-term approximation in 1 and C1 , C2 > 0 are constants.
For illustration, we include Table 7.1 (from [391]) which lists available values
for the constants κ and δs in (7.4) that guarantee (7.6) for several algorithms along
with respective references.
Remarkably, all optimal measurement matrices known so far are random matri-
ces. For example, a Bernoulli random matrix Φ ∈ Rn×N has entries
√
φjk = εjk / n, 1 ≤ j ≤ n, 1 ≤ k ≤ N,
where εik are independent, symmetric {−1, 1}-valued random variables. Its
restricted isometry constant satisfies
δs δ
n Cδ −2 (s ln (eN/s) + ln (1/η)) ,
n Cδ −2 s log (N/s) .
Table 7.2 List of measurement matrices [391] that have been proven to be RIP, scaling of sparsity
s in the number of measurements n, and the respective Shannon entropy of the (random) matrix
n × N measurement matrix Shannon entropy RIP regime References
Gaussian nN 12 log (2πe) s Cn/ log N [400–402]
Rademacher entries nN s Cn/ log N [400]
N log2 N − nlog2 n
Partial Fourier matrix s Cn/ log4 N [402, 403]
−(N − n)log2 (N − n)
Partial circulant Rademacher N s Cn2/3 /log2/3 N [403]
Gabor, Rademacher window n s Cn2/3 /log2 n [404]
√
Gabor, alltop window 0 sC n [405]
In Table 7.2 we list the Shannon entropy (in bits) of various random matrices
along with the available RIP estimates.
⎧ 1/p
⎨ |x |p
⎪ N
i , 0 < p < ∞,
x = x = i=1 (7.7)
2 N
2 ⎪
⎩ max |xi | , p = ∞.
i=1,...,N
We are given a set A of points in RN with N typically large. We would like to embed
these points into a lower-dimensional Euclidean space Rn which approximately
preserving the relative distance between any two of these points.
Lemma 7.2.1 (Johnson–Lindenstrauss [411]). Let ε ∈ (0, 1) be given. For every
set A of k points in RN , if n is a positive integer such that n n0 = O ln k/ε2 ,
there exists a Lipschitz mapping f : RN → Rn such that
2 2
(1 − ε) x − y N f (x) − f (y) n (1 + ε) x − y N
2 2 2
for x, y ∈ A.
The Johnson–Lindenstrauss (JL) lemma states [412] that any set of k points in
high dimensional Euclidean space can be embedded into O(log(k)/ε2 ) dimensions,
without distorting the distance between any two points by more than a factor
between 1 − ε and 1 + ε. As a consequence, the JL lemma has become a valuable
tool for dimensionality reduction.
In its original form, the Johnson–Lindenstrauss lemma reads as follows.
7.2 Johnson–Lindenstrauss Lemma and Restricted Isometry Property 375
for all i, j ∈ {1, 2, . . . , k}. Here || · ||2 stands for the Euclidean norm in RN or Rm ,
respectively.
It is known that this setting of m is nearly tight; Alon [413] showed that one must
have
m = Ω ε−2 log(k) log (1/ε) .
2 2 2
(1 − ε) y 2 Φy 2 (1 + ε) y 2 , for all y ∈ E. (7.9)
When Φ is a random matrix, the proof that Φ satisfies the JL lemma with high
probability boils down to showing a concentration inequality of the typev
2 2 2
P (1 − ε) x 2 Φx 2 (1 + ε) x 2 1 − 2 exp −c0 ε2 m , (7.10)
2
Next, one must show that for any x ∈ RN , the random variable Φx n is sharply
2
concentrated a round its expected value (concentration of measure), thanks to the
moment conditions; that is,
2
P Φx n − x N t x N 2e−ng(t) , 0 < t < 1,
2 2
(7.12)
2 2 2
where the probability is taken over all n × N matrices Φ and g(t) is a constant
depending only on t such that for all t ∈ (0, 1), g(t) > 0.
Finally, one uses the union bound to the set of differences between all possible
pairs of points in A.
√
+1/ n with probability 1/2,
φij ∼ √ (7.13)
−1/ n with probability 1/2,
with probability
1 − 2(12/δ) e−g(δ/2)n .
k
(7.17)
It is well known from covering numbers (see, e.g., [419]) that we can choose
k
such a set with the cardinality |AT | (12/δ) .
2. Concentration of measure through the union bound. We use the union bound to
apply (7.12) to this set of points t = δ/2. It thus follows that, with probability at
least the right-hand side of (7.17), we have
2 2
(1 − δ/2) a N Φa n (1 + δ/2) a N , for all a ∈ AT (7.19)
2 2 2
The goal is to show that α ≤ δ. To do this, we recall from (7.18) that for any
x ∈ XT we can pick up a point (vector) a ∈ AT such that x − a N δ/4.
2
In this case we have
Since by definition α is the smallest number for which (7.20) holds, we obtain
α δ/2 + (1 + α) δ/4. Solving for α gives α 3δ/4/ (1 − δ/4) δ, as
desired. So we have shown the upper inequality in (7.16). The lower inequality
follows from this since
Φx n Φa n − Φ (x − a) n (1 − δ/2) − (1 + δ) δ/4 1 − δ,
2 2 2
2 2 2
(1 − δk ) xT N ΦT xT N (1 + δk ) xT N (7.22)
2 2 2
holds for all sets T with |T | ≤ k. The condition (7.22) is equivalent to requiring that
the Grammian matrix ΦTT ΦT has all of its eigenvalues in [1 − δk , 1 + δk ]. (Here ΦTT
is the transpose of ΦT .)
The similarity between the expressions in (7.9) and (7.22) suggests a connection
between the JL lemma and the Restricted Isometry Property. A first result in this
direction was established in [400], wherein it was shown that random matrices
satisfying a concentration inequality of type (7.10) (and hence the JL Lemma)
satisfy the RIP of optimal order. More precisely, the authors prove the following
theorem.
Theorem 7.2.4 (Baraniuk, Davenport, DeVore, Wakin [400]). Suppose that
n, N, and 0 < δ < 1 are given. For x ∈ RN , if the probability distribution
generating the n × N matrices Φ (ω) , ω ∈ RnN , satisfies the concentration
inequalities (7.12)
2
2e−ng(t) ,
2 2
P Φx n − x n t x n 0 < t < 1, (7.23)
2 2 2
then, there exist constant c1 , c2 > 0 depending only on δ such that the restricted
isometry property (7.22) holds for Φ (ω) with the prescribed δ and any k
c1 n/ log(N/k) with probability at least 1 − 2e−c2 n .
Proof. From Theorem 7.2.3, we know that for each of the k-dimensional spaces
XT , the matrix Φ (ω) will fail to satisfy (7.23) with probability at most
2(12/δ) e−g(δ/2)n .
k
(7.24)
N k
There are (2N/k) such subspaces. Thus, (7.23) will fail to hold with
k
probability (7.23) at most
2(2N/k) (12/δ) e−g(δ/2)n = 2 exp [−g (δ/2) n + k (log (eN/k) + log (12/δ))] .
k k
(7.25)
Thus, for a fixed c1 > 0, whenever k c1 n/ log (N/k) , we will have that the
exponent in the exponential on the right side of (7.25) is at most −c2 n if
As a result, we can always choose c1 > 0 sufficiently small to ensure that c2 > 0.
This prove that with probability at least 1 − 2e−c2 n , the matrix Φ (ω) will satisfy
(7.16) for each x. From this one can easily obtain the theorem.
The JL lemma implies the Restricted Isometry Property. Theorem 7.2.5 below is
a converse result to Theorem 7.2.4: They show that RIP matrices, with randomized
7.2 Johnson–Lindenstrauss Lemma and Restricted Isometry Property 379
Now we closely follow [29] to develop some useful techniques and at the same
time gives a shorter proof of Johnson–Lindenstrauss lemma. To prove Lemma 7.2.2,
it actually suffices to prove the following.
Lemma 7.2.7 (Nelson [29]). For any 0 < ε, δ < 1/2 and positive integer d, there
exists a distribution D over Rm×N for m = O ε−2 log (1/δ) such that for any
x ∈ RN with x 2 = 1,
2
P Ax 2 − 1 > ε < δ.
"
For A ∈ Rn×n , we define the Frobenius norm as A F = A2i,j =
i,j
Tr (A2 ). We also define A 2→2 = sup Ax 2 , which is also equal to the
x2 =1
largest magnitude of an eigenvalue of A when A has all real eigenvalues (e.g., it
is symmetric). We need the Hanson-Wright inequality [61]. We follow [29] for the
proof since its approach is modern.
Theorem 7.2.8 (Hanson-Wright inequality [29, 61]). For A ∈ Rn×n symmetric
and x ∈ Rn with the xi independent having subGaussian entries of mean zero and
variance one,
0 1
P xT Ax − Tr A2 > t C exp − min C t2 / A F , C t/ A 2→2 ,
2
C, C 0> 0 are universal constants.
where 1 Also, this holds even if the xi are only
2 2
Ω 1 + min t / A F , t/ A 2→2 -wise independent.
√
where z is an mN -dimensional vector formed by concatenating the rows of m·A,
and [N ] = {1, . . . , N }. We use the block-diagonal matrix T ∈ RmN ×mN with m
blocks, where each block is the N × N rank-one matrix xxT /m. Now we get our
2 2
desired quadratic form Ax 2 = zT Tz. Besides, Tr (T) = x 2 = 1. Next, we
T
want to argue that the quadratic form z Tz has a concentration around Tr (T) , for
which we can use Theorem 7.2.8. In particular, we have that
7.2 Johnson–Lindenstrauss Lemma and Restricted Isometry Property 381
P Ax22 −1 > ε =P zT Tz − Tr (T) > ε C exp − min C ε2 / T2F , C ε/T2→2 .
2 4
Direct computation yields T F = 1/m · x 2 = 1/m. Also, x is the
only eigenvector of the rank one matrix xxT /m with non-zero eigenvalues, and
2
furthermore its eigenvalue is x 2 /m = 1/m. Thus, we have the induced matrix
norm
A 2→2 = 1/m. Plugging these in gives error probability δ for m =
Ω ε−2 log (1/δ) .
n
[29]). For a ∈ R , x ∈ {−1, 1} uniform,
n
Lemma 7.2.10 (Khintchine inequality
k k
and k ≥ 2 an even integer, E a x
T
a 2 ·k . k/2
Lemma 7.2.11
(Moments
[29]). If X, Y are independent with E[Y ] = 0 and k ≥
k k
2, then E |X| E |X − Y | .
We here give a proof of Theorem 7.2.8 in the case that x is uniform in {−1, 1}n .
Theorem 7.2.12 (Concentration for quadratic form [29]). For A ∈ Rn×n
symmetric and x ∈ Rn with the xi independent having subGaussian entries of
mean zero and variance one,
k 0√ 1k
E xT Ax − Tr (A) C k · max
2
k A F ,k A 2→2
n
where y ∈ {−1, 1} is random and independent of x.
If we swap xi with yi , then x+y remains constant as does |xi −yi | and that xi −yi
is replaced with its negation. See Sect. 1.11 for this symmetrization. The advantage
of this approach is that we can conditional expectation and apply sophisticated
methods to estimate the moments of the Rademacher
series. Let us do averaging
T
over all such swaps. Let ξi = (x + y) A , and ηi = xi − yi . Let zi the indicator
i
random variable: zi is 1 if we did not swap and −1 if we did. Let z = (z1 , . . . , zn ).
382 7 Compressed Sensing and Sparse Recovery
T
Then we have (x + y) A (x − y) = ξi ηi zi . Averaging over all swaps,
i
k
k/2 k/2
The first inequality is by Lemma 7.2.10, and the second uses that |ηi | 2. Note
that
2 2 2
ξi2 = A (x + y) 2 2 Ax 2 + 2 Ay 2 ,
i
and thus
! k " k/2 k/2
E T
x Ax 2k kk/2 · E 2 Ax22 +2 Ay22 4k kk/2 · E Ax22 ,
p 1/p
with the final inequality using Minkowski’s inequality (namely that |E|X+Y | |
p p 1/p
|E|X| + E|Y | | for any random variables X, Y and any 1 ≤ p∞).
Next let us
2
deal with Ax 2 = Ax, Ax = xT A2 x. Let B = A2 − Tr A2 I/n. Then
2
Tr (B) = 0. Also B F A F A 2→2 , and B 2→2 A 2→2 . The former
holds since
2
2 2
B F λ4i − λ2i /n λ4i A F A 2→2 .
i i i
n
The latter is valid since the eigenvalues of B are λ2i − λ2j /n for each i ∈ [n].
j=1
The largest eigenvalue of B is thus at most that of A2 , and since λ2i ≥ 0, the smallest
2
eigenvalue of B cannot be smaller than − A 2→2 .
Then we have that
! " ! "
k/2 k/2 k/2
E Ax22 = E A2F + xT Bx 2k max AkF , E xT Bx .
where the final equality follows since the middle term above is the geometric mean
of the other two, and thus is dominated by at least one of them. This proves our
hypothesis as long as C ≥ 64.
To prove our statement for general k, set k = 2 log2 k . Then using the power
mean inequality and our results for k a power of 2,
k k k/k 0√ 1k
E xT Ax E xT Ax 128k max k A F,k A 2→2 .
Example 7.2.13 (Concentration using for quadratic form).
H0 : y = x, x ∈ Rn
H1 : y = x + z, z ∈ Rn
T
where x = (x1 , . . . , xn ) and xi are independent, subGaussian random variables
with zero mean and variance one and z is independent of x. For H0 , it follows from
Hanson-Wright inequality that
k 0√ 1k
E xT Ax − Tr (A) C k · max
2
k A F ,k A 2→2
0√ 1k
2
γ0 = C k · max k A F ,k A 2→2 .
Partial random Fourier matrices [406, 427] Φ ∈ Cm×n arise as random row
submatrices of the discrete Fourier matrix and their restricted isometry constants
satisfy δs δ with high probability provided that
m Cδ −2 slog3 s log n.
The theory of compressed sensing has been developed for classes of signals
that have a very sparse representation in an orthonormal basis. This is a rather
stringent restriction. Indeed, allowing the signal to be sparse with respect to a
redundant dictionary adds a lot of flexibility and significantly extends the range
of applicability. Already the use of two orthonormal basis instead of just one
dramatically increases the class of signals that can be modelled in this way [430].
Throughout this section, x denotes the standard Euclidean norm. Signals y
are not sparse in an orthonormal basis but rather in a redundant dictionary dictionary
Φ ∈ Rd×K with K > d. Now y = Φx, where x has only few non-zero components.
The goal is to reconstruct y from few measurements. Given a suitable measurement
matrix A ∈ Rn×d , we want to recover y from the measurements s = Ay = AΦx.
The key idea then is to use the sparse representation in Φ to drive the reconstruction
procedure, i.e., try to identify the sparse coefficient sequence x and from that
reconstruct y. Clearly, we may represent s = Ψx with
Ψ = AΦ ∈ Rn×K .
Table 7.3 Greedy algorithms. Goal: reconstruct x from s = Ψx. Columns of Ψ are
denoted by ψ i , and Ψ†Λ is the pseudo-inverse of ΨΛ
Orthogonal matching pursuits Thresholding
Initialize: z = 0, r = s, Λ = Ø find: Λ that contains the indices
find: k = arg maxi | r, ψ i | corresponding to the S largest values
update: Λ = Λ ∪ {k} , r = s − ΨΛ Ψ†Λ s of | s, ψ i |
iterate until stopping criterion is attained. output: x = Ψ†Λ s.
output: x = Ψ†Λ s.
for all v ∈ Rd and some constant c > 0. Let us list some examples of
random matrices that satisfy the above condition: (1) Gaussian ensemble; (2)
Bernoulli ensemble; (3) Isotropic subGaussian ensembles; (4) Basis transformation.
See [430].
Using the concentration inequality (7.28), we can now investigate the isometry
constants for a matrix of the type AΦ, where A is an n × d random measurement
matrix and Φ is a d × K deterministic dictionary, [430] follows the approach taken
in [400], which was inspired by proofs for the Johnson–Lindenstrauss lemma [416].
See Sect. 7.2.
A matrix, which is a composition of a random matrix of certain type and a
deterministic dictionary, has small restricted isometry constants.
Theorem 7.5.1 (Lemma 2.1 of Rauhut, Schnass, and Vandergheynst [430]). Let
A be a random matrix of size n × d drawn from a distribution that satisfies the
concentration inequality (7.28). Extract from the d × K deterministic dictionary Φ
any sub-dictionary ΦΛ of size S, in Rd i.e., |Λ| = S with (local) isometry constant
δΛ = δΛ (Φ) . For 0 < δ < 1, we have set
ν := δΛ + δ + δΛ δ. (7.29)
Then
2 2 2
(1 − ν) x AΦΛ x 2 x (1 + ν) (7.30)
The key ingredient for the proof Theorem 7.5.1 is a finite ε-covering (a set of
points1 ) of the unit sphere, which is included below for convenience.
We denote the unit Euclidean ball by B2n = {x ∈ Rn : x 1} and the unit
sphere S n−1 = {x ∈ Rn : x = 1}, respectively. For a finite set, the cardinality
of A is denoted by |A|, and for a set A ∈ Rn , conv A denotes the convex hull
of A. The following fact is well-known and standard: see, e.g, [152, Lemma 4.10]
and [407, Lemma 2.2].
Theorem 7.5.2 (ε-Cover). Let n ≥ 1 and ε > 0. There exists an ε-cover Γ ⊂ B2n
of the unit Euclidean ball B2n with respect to the Euclidean metric such that B2n ⊂
−1
(1 − ε) conv Γ and
n
|Γ| (1 + 2/ε) .
Similarly, there exists Γ ⊂ S n−1 which is an ε-cover of the sphere S n−1 and
|Γ | (1 + 2/ε) .
n
Proof of Theorem 7.5.1. First we choose a finite ε-covering of the unit sphere in
RS , i.e., a set of points Q, with q = 1 for all q ∈ Q, such that for all q ∈ Q,
such that for all x = 1
min x − q ε
q∈Q
for some ε ∈ (0, 1). According to Theorem 7.5.2, there exists such a Q with |Q|
S
(1 + 2/ε) . Applying the measure concentration in (7.29) with t < 1/3 to all the
points ΦΛ q and taking the union bound we obtain
2 2 2
(1 − t) ΦΛ q AΦΛ q (1 + t) ΦΛ q for all q ∈ Q. (7.31)
Next we estimate ν in terms of ε, t. Since for all x with ||x|| = 1 we can choose a
point q such that x − q ε and obtain
Since ν is the smallest possible constant for which (7.32) holds it also has to satisfy
7.5 Composition of a Random Matrix and a Deterministic Dictionary 387
√ √
1 + ν 1 + t 1 + δΛ + (1 + ν) ε.
1+ε
1+ν 2 (1 + δΛ ) .
(1 − t)
Thus,
ν < δ + δΛ (1 + δ) .
Now square both sides and observe that ν < 1 (otherwise we have nothing to show).
Then we finally arrive at
√ 2
2 1/2 1/2
AΦΛ x (1 − t) (1 − δΛ ) − ε 2
√ 1/2 1/2
(1 − t) (1 − δΛ√
) − 2 2ε(1 − t) (1 − δΛ ) + 2ε2
1 − δΛ − t − 2 2ε 1 − δΛ − δ 1 − ν.
for some δ ∈ (0, 1) and t > 0. Then with probability at least 1 − e−t the composed
matrix Ψ = AΦ has restricted isometry constant
388 7 Compressed Sensing and Sparse Recovery
does not hold for (local) isometry constants δS (AΦ) δS (Φ) + δ (1 + δS (Φ))
by the probability
S
12
P (δS (AΦ) > δS (Φ) + δ (1 + δS (Φ))) 2 1 + exp −cδ 2 n/9 .
δ
K
By taking the union bound over all possible sub-dictionaries of size S
S
we can estimate the probability of δS (AΦ) = sup δΛ (AΦ) not
Λ={1,...,K},|Λ|=S
satisfying (7.34) by
S
K 12
P (δS (AΦ) > δS (Φ) + δ (1 + δS (Φ))) 2 1+ exp −cδ 2 n/9 .
S δ
K S
Using Stirling’s formula (eK/S) and demanding that the above term is
S
less than e−t completes the proof.
It is interesting to observe the stability of inner products under multiplication
with a random matrix A, i.e.,
Ax, Ay ≈ x, y .
Theorem 7.5.4 (Lemma 3.1 of Rauhut, Schnass, and Vandergheynst [430]). Let
x, y ∈ RN with x 2 , y 2 1. Assume that A is an n × N random matrix with
independent N (0, 1/n) entries (independent of x, y). Then for all t > 0
t2
P (| Ax, Ay − x, y | t) 2 exp −n , (7.35)
C1 + C2 t
√
with C1 = √4e
6π
≈ 2.5044, and C 2 = e 2 ≈ 3.8442.
The analogue√statement holds for a random matrix A whose entries are
independent ±1/ n Bernoulli random variables.
7.5 Composition of a Random Matrix and a Deterministic Dictionary 389
N
EY = xk yk = x, y .
k=1
The new random variable Z is known as Gaussian chaos of order 2. Note that
EZ = 0.
To apply Theorem 1.3.4, we need to show the moment bound (1.24) for the
random variable Z. A general bound for Gaussian chaos [27, p. 65] gives
p/2
p p 2
E|Z| (p − 1) E|Z| . (7.36)
√
for p ≥ 2. Using Stirling’s formula, p! = 2πppp e−p eRp , 1
12p+1 Rp 1
12p , we
have that, for p ≥ 3,
390 7 Compressed Sensing and Sparse Recovery
(p − 1) pp/2
p 2
E|Z| = p! √ E|Z|
eRp 2πpe−p pp
p
1 e2 p! 2 2
(p−2)/2
2
= 1− √ e E|Z| E|Z|
p eRp 2πp
e (p−2)/2
2 2
Rp √ p! e2 E|Z| E|Z|
e 2πp
1/2 (p−2) e
2
p! e E|Z| √ E|Z|2 .
6π
Compare the above with the moment bound (1.24) holds for all p ≥ 3 with
1/2 2e
2 2
M = e E|Z| , σ 2 = √ E|Z| ,
6π
and by direct inspection we see that it also holds for p = 2.
2
Now we need to determine E|Z| . Using the independence of the Gaussian
random variables gk , we obtain
⎡
E|Z| 2
= E⎣ gj gk gj gk xj xk xj xk + 2 gj gk gk2 − 1 xj xk xj xk
j =k j =k j =k k
⎤
2
+ gk − 1 gk2 − 1 xk yk xk yk ⎦
k k
Further
2
E|Z| = E gj2 E gk2 xj yj xk yk + E gj2 E gk2 x2j yk2
j=k k=j
2
+ E gk2 − 1 x2k yk2 . (7.37)
k
Finally,
2
E|Z| = x j yj x k yk + x2j yk2 + 2 x2k yk2
k=j k=j k
= x j yj x k yk + x2j yk2
j,k j,k
2 2 2
= x, y + x 2 y 2 2 (7.38)
n
P (| Ax, Ay − x, y | t)= P Z nt
=1
1 n2 t2 t2
2 exp − = 2 exp −n ,
2 nσ 2 + nM t C1 + C2 t
2 √
with C1 = √2e 6π
E|Z| √4e 6π
≈ 2.5044 and C2 = e 2 ≈ 3.8442.
For the case of Bernoulli random matrices, the derivation is completely analogue.
We just have to replace the standard Gaussian random variables gk by εk = ±1
Bernoulli random variables. In particular, the estimate (7.36) for the chaos variable
Z is still valid, see [27, p. 105]. Furthermore, for Bernoulli variables εk = ±1 we
clearly have ε2k = 1. Hence, going through the estimate above we see that in (7.37)
the last term is actually zero, so the final bound in (7.38) is still valid.
Now we are in a position to investigate recovery from random measurements
by thresholding, using Theorem 7.5.4. Thresholding works by comparing inner
products of the signal with the atoms of the dictionary.
Example 7.5.6 (Recovery from random measurements by thresholding [430]). Let
A be an n × K random matrix satisfying one of the two probability models of
Theorem 7.5.4. We know thresholding will succeed if we have
The probability of the good components having responses lower than the threshold
can be further estimated as
P min | Ay, Azi | min | y, zi | − 2ε
i∈Λ i∈Λ
P ∪ | Ay, Azi | min | y, zi | − 2ε
i∈Λ i∈Λ
P | y, zi − Ay, Azi | 2ε
i∈Λ
t2 /4
2 |Λ| exp −n C1 +C 2 t/2 .
Similarly, we can bound the probability of the bad components being higher than
the threshold,
392 7 Compressed Sensing and Sparse Recovery
P max | Ay, Azk | max | y, zk | + ε
2
k∈Λ̄ k∈Λ̄
P ∪ | Ay, Azk | max | y, zk | + 2
ε
k∈Λ̄ k∈Λ̄
P | Ay, Azk − y, zk | 2ε
k∈Λ̄
t2 /4
2 Λ̄ exp −n C1 +C 2 t/2
.
Combining the these two estimates we obtain that the probability of success for
thresholding is exceeding
t2 /4
1 − 2K exp −n .
C1 + C2 t/2
Theorem 7.5.4 finally follows from requiring this probability to be higher than
1 − e−t and solving for n.
We summarize the result in the following theorem.
Theorem 7.5.7 (Theorem 3.2 of Rauhut, Schnass, and Vandergheynst [430]).
Let Ψ be a d × K dictionary. Assume that the support of x for a signal y = Φx,
normalized to have y 2 = 1, could be recovered by thresholding with a margin ε,
i.e.
where C (ε) = 4C1 ε−2 + 2C2 ε−1 and C1 , C2 are constants from Theorem 7.5.4.
Circular matrices are connected to circular convolution, defined for two vectors
x, z ∈ Cn by
n
(z ⊗ x)j := zjk xk , j = 1, . . . , n,
k=1
7.6 Restricted Isometry Property for Partial Random Circulant Matrices 393
where
j $ k = j − k mod n
Hx = z ⊗ x
T
and has entries Hjk = zjk . Given a vector z = (z0 , . . . , zn−1 ) ∈ Cn , we
introduce the circulant matrix
⎡ ⎤
z0 zn−1 · · · z1
⎢ z1 z0 · · · z2 ⎥
⎢ ⎥
.. ⎥ ∈ C
n×n
Hz = ⎢ . .. .
⎣ .. . . ⎦
zn−1 zn−2 · · · z0
Square matrices are not very interesting for compressed sensing, so we our attention
to a row submatrix of H. Consider an arbitrary index set Ω ⊂ {0, 1, . . . , n − 1}
whose cardinality |Ω| = m. We define the operator RΩ : Cn → Cm that restricts
n
a vector x ∈ Cn to its entries in Ω. Let ε = {εi }i=1 be a Rademacher vector of
length n, i.e., a random vector with independent entries distributed according to
P (εi = ±1) = 1/2. The associated partial random circulant matrix is given by
1
Φ = √ RΩ Hε ∈ Rm×n (7.39)
m
1 1
Φx = √ RΩ Hε x = √ RΩ (ε ⊗ x) . (7.40)
m m
T = {x ∈ Rn : x 0 s, x 2 1} . (7.41)
Now consider
D E
2 2
||ΦH Φ − I||| = sup ΦH Φ − I x, x = sup Φx 2 − x 2 = δs . (7.42)
x∈T x∈T
Let S be the cyclic shift down operator on column vectors in Rn . Applying the
power Sk to x will cycle x downward by k coordinates:
Sk x
= xk ,
H
where $ is subtraction modulo n. Note that Sk = S−k = Sn−k . Then we
rewrite Ψ as a random sum of shift operators,
n
1
Φ= √ εk R Ω S k .
m
k=1
It follows that
n n
1 −k 1
Φ Φ−I=
H
εk ε S RH
Ω RΩ S
= εk ε S−k PΩ S . (7.43)
m m
k= k=
F (ω, ) = e−i2πω/n , 0 ω, n − 1.
Here F is unnormalized. The hat symbol denotes the Fourier transform of a vector:
x̂ = Fx. Use the property of Fourier transform: a shift in the time domain
followed by a Fourier transform may be written as a Fourier transform followed
by a frequency modulation
FSk = Mk F,
1 −1
where P̂Ω = n FPΩ F . The matrix P̂Ω has several nice properties:
1. P̂Ω is circulant and conjugate symmetric.
2. Along the diagonal P̂Ω (ω, ω) = m/n2 , and off the diagonal P̂Ω (ω, ω)
m/n2 .
3. Since the rows and columns of P̂Ω are circular shifts of one another,
2 2 / / 2
/ /
P̂Ω (ω, ξ) = P̂Ω (ω, ξ) = /P̂Ω / /n = m/n3 .
F
ω ξ
Gx = ε, Zx ε where x ∈ T. (7.46)
−k
1 H
m x̂ M P̂Ω M x̂,
k=
Zx (k, ) = .
0, k=
1 H H
Zx = F X̂ P̂Ω X̂F − diag FH X̂H P̂Ω X̂F , (7.47)
m
where X̂ = diag (x̂) is the diagonal matrix constructed from the vector x̂. The term
homogeneous second-order chaos is used to refer to a random process Gx of the
form (7.46) where each matrix Zx is conjugate symmetric and hollow, i.e., has zeros
on the diagonal. To bound the expected supremum of the random process Gx over
the set T , we apply a version of Dudleys inequality that is specialized to this setting.
Define two pseudo-metrics on the index set T :
Let N (T, di , r) denote the minimum number of balls of radius r in the metric d1 , d2
that we need to cover the set T .
Proposition 7.6.4 (Dudley’s inequality for chaos). Suppose that Gx is a homoge-
neous second-order chaos process indexed by a set T . Fix a point x0 ∈ T . There
exists a universal constant C such that
∞ ∞
E sup |Gx − Gx0 | C max log N (T, d1 , r) dr, log N (T, d2 , r)dr .
x∈T 0 0
(7.48)
Our statement of the proposition follows [403] and looks different from the versions
presented in the literature [27, Theorem 11.22] and [81, Theorem 2.5.2]. For details
about how to bound these integrals, we see [403].
Immediately following Example 7.6.3, we can get the following theorem.
Theorem 7.6.5 (Theorem 1.1 of Rauhut, Romberg, and Tropp [403]). Let Ω be
an arbitrary subset of {0, 1, . . . , n − 1} with cardinality |Ω| = m. Let Ψ be the
corresponding partial random circulant matrix (7.39) generated by a Rademacher
sequence, and let δs denote the s-th restricted isometry constant. Then,
"
s3/2 3/2 s
E [δs ] C1 max log n, log s log n (7.49)
m m
0 1
m C2 max δ −1 s3/2 log3/2 n, δ −2 slog2 nlog2 s, , (7.50)
Proposition 7.6.7 (Tail Bound for Chaos). under the preceding assumptions, for
all t ≥ 0,
t2
P (Y E [Y ] + t) exp − 2
(7.51)
32V + 65U t/3
1 H H
Zx = Ax − diag (Ax ) for Ax = F X̂ P̂Ω X̂F.
m
398 7 Compressed Sensing and Sparse Recovery
hx = Zx ε for x ∈ T.
d2 (x, y) = Zx − Zy F.
2
then with probability at least 1 − n−(log n)(log s) , the restricted isometry constant
of Φ satisfies δs ≤ δ, in other words,
2
P (δs δ) n−(log n)(log s) .
2 2 2
(1 − δ) x 2 ΦD x 2 (1 + δ) x 2 .
Now we present a result that is more general than Theorem 7.6.9. The
L-sub-Gaussian random variables are defined in Sect. 1.8.
Theorem 7.6.11 (Theorem 4.1 of Krahmer, Mendelson, and Rauhut [31]). Let
n
ξ = {ξi }i=1 be a random vector with independent mean-zero, variance one, L-sub-
Gaussian entries. If, for s ≤ n and η, δ ∈ (0, 1),
0 1
m cδ −2 s max (log s) (log n) , log (1/η)
2 2
(7.54)
then with probability at least 1 − η, the restricted isometry constant of the partial
random circulant matrix Φ ∈ Rm×n generated by ξ satisfies δs δ. The constant
c > 0 depends only on L.
Here, we only introduce the proof ingredient. Let Vx z = √1m PΩ (x ⊗ z) , where
the projection operator PΩ : Cn → Cm is given by the positive-definite (sample
covariance) matrix
PΩ = RH
Ω RΩ ,
that is,
and hence
2 2
δs =sup Vx ξ 2 −E Vx ξ 2 ,
x∈Ts
400 7 Compressed Sensing and Sparse Recovery
So, we have
1
Vx ξ = √ PΩ F−1 X̂Fξ,
m
where X̂ is the diagonal matrix, whose diagonal is the Fourier transform Fx. In
short,
1
Vx = √ P̂Ω X̂F,
m
where P̂Ω = PΩ F−1 . For details of the proof, we refer to [31]. The proof ingredient
is novel. The approach of suprema of chaos processes is indexed by a set of matrices,
which is based on a chaining method due to Talagrand. See Sect. 7.8.
Π (k, ) = M Tk ,
435]. The m × m2 matrix Ψy whose columns are vectors Π (k, ) y, (k, ) ∈ Z2m is
called a Gabor synthesis matrix,
2
Ψy = M Tk y ∈ Cm×m , (k, ) ∈ Z2m . (7.56)
The operators Π (k, ) = M Tk are called time-frequency shifts and the system
Π (k, ) of all time-frequency shifts forms a basis of the matrix space Cm×m [436,
437]. Ψy allows for fast matrix vector multiplication algorithms based on the FFT.
Theorem 7.7.1 (Theorem 1.3 of Krahmer, Mendelson, and Rauhut [31]). Let ε
2
be a Rademacher vector and consider the Gabor synthesis matrix Ψy ∈ Cm×m
defined in (7.56) generated by y = √1m ε. If
m cδ −2 s(log s) (log m) ,
2 2
2
then with probability at least 1 − m−(log m)·(log s) , the restricted isometry constant
of Ψy satisfies δs ≤ δ.
Now we consider ∈ Cn to be a Rademacher or Steinhaus sequence, that is,
a vector of independent random variables taking values +1 and −1 with equal
probability, respectively, taking values uniformly distributed on the complex torus
S 1 = {z ∈ C : |z| = 1} . The normalized window is
1
g = √ .
n
2
Theorem 7.7.2 (Pfander and Rauhut [404]). Let Ψg ∈ Cn×n be a draw of
the random Gabor synthesis with normalized Steinhaus or Rademacher generating
vector.
1. The expectation of the restricted isometry constant δs of Ψg , s ≤ n, satisfies
+ " ,
s3/2 s3/2 log3/2 n
Eδs max C1 log n log s, C2 , (7.57)
n n
2
/σ 2 C3 s3/2 (log n)log2 s
P (δs Eδ + t) e−t , where σ 2 = , (7.58)
n
where C3 > 0 is a universal constant.
With slight variations of the proof one can show similar statements for normalized
Gaussian or subGaussian random windows g.
402 7 Compressed Sensing and Sparse Recovery
H= x(k,) Π (k, ) .
(k,)∈Zn ×Zn
Time-shifts delay is due to the multipath propagation, and the frequency-shifts are
due to the Doppler effects caused by moving transmitter, receiver and scatterers.
Physical considerations often suggest that x be rather sparse as, indeed, the number
of present scatterers can be assumed to be small in most cases. The same model is
used in sonar and radar.
Given a single input-output pair (g, Hg) , our task is to find the sparse coefficient
vector x. In other words, we need to find H ∈ Cn×n , or equivalently x, from its
action y = Hz on a single vector z. Writing
where I is the identity matrix and Ψ = Ψg . The Gabor synthesis matrix Ψg has the
form
7.7 Restricted Isometry Property for Time-Frequency Structured Random Matrices 403
n−1
Ψg = g k Ak
k=0
with
A0 = I|M|M2 | · · · |Mn−1 , A1 = I|MT|M2 T| · · · |Mn−1 Tk ,
With
n−1
AH
k Ak = nI,
k=0
it follows that
n−1 n−1 n−1 n−1
1 1
ΨH Ψ − I = −I + k k AH
k Ak = k k Wk ,k ,
n n
k=0 k =0 k=0 k =0
where
Wk ,k = k Ak , k = k,
AH
0, k = k.
B(x)k,k = xH AH
k Ak x.
Then we have
where
k Ak x = B (x) ,
ε̄k εk xH AH H
Zx = (7.62)
k =k
with x ∈ Ts = {x ∈ Cn×n : x 2 = 1, x 0 s} .
A process of the type (7.62) is called Rademacher or Steinhaus chaos process of
order 2. In order to bound such a process, [404] uses the following Theorem, see for
404 7 Compressed Sensing and Sparse Recovery
example, [27, Theorem 11.22] or [81, Theorem 2.5.2], where it is stated for Gaus-
sian processes and in terms of majorizing measure (generic chaining) conditions.
The formulation below requires the operator norm A 2→2 = max Ax 2 and
x2 =1
1/2
1/2 2
the Frobenius norm A F = Tr AH A = |Ai,j | , where Tr (A)
i,j
denotes the trace of a matrix A. z 2 denotes the 2 -norm of the vector z.
T
Theorem 7.7.5 (Pfander and Rauhut [404]). Let = (1 , . . . , n ) be a
Rademacher or Steinhaus sequence, and let
k Ak x = B (x)
ε̄k εk xH AH H
Zx =
k =k
Let N (T, di , r) be the minimum number of balls of radius r in the metric di needed
to cover the set T. Then, these exists a universal constant C > 0 such that, for an
arbitrary x0 ∈ T,
∞ ∞
E sup |Zx − Zx0 | C max log N (T, d1 , r) dr, log N (T, d2 , r) dr .
x∈T 0 0
(7.63)
The proof ingredients include: (1) decoupling [56, Theorem 3.1.1]; (2) the contrac-
tion principle [27, Theorem 4.4]. For a Rademacher sequence, the result is stated
in [403, Proposition 2.2].
The following result is a slight variant of Theorem 17 in [62], which in turn is an
improved version of a striking result due to Talagrand [231].
Theorem 7.7.6 (Pfander and Rauhut [404]). Let the set of matrices B =
T
{B (x)} ∈ Cn×n , x ∈ T, where T is the set of vectors. Let ε = (ε1 , . . . , εn )
be a sequence of i.i.d. Rademacher or Steinhaus random variables. Assume that
the matrix B(x) has zero diagonal, i.e., Bi,i (x) = 0 for all x ∈ T. Let Y be the
random variable
2
H n−1 n−1
Y = sup ε B (x) ε = εk εk B(x)k ,k .
x∈T
k=1 k =1
7.8 Suprema of Chaos Processes 405
Define U and V as
and
2
n n
2
V = E sup B (x) ε = E sup εk B(x)k ,k .
x∈T
2
x∈T
k =1 k=1
Then, for t ≥ 0,
t2
P (Y E [Y ] + t) exp − .
32V + 65U t/3
is denoted by dF (A). Similarly, the radius of the set A in the operator norm
A 2→2 = sup Ax 2
x2 1
d2→2 (A) 2
γ2 (A, · 2→2 ) c log N (A, · 2→2 , r)dr. (7.68)
0
This type of entropy integral was suggested by Dudley [83] to bound the supremum
of Gaussian processes.
Under mild measurability assumptions, if {Gt : t ∈ T } is a centered Gaussian
process by a set T, then
The upper bound is due to Fernique [84] while the lower bound is due to Talagrand’s
majorizing measures theorem [81, 85].
With the notions above, we are ready to stage the main results.
Theorem 7.8.2 (Theorem 1.4 of Krahmer, Mendelson, and Rauhut [31]). Let
A ∈ Rm×n be a symmetric set of matrices, A = −A. Let be a Rademacher
vector of length n. Then
2
E sup A22 − E A22 C1 dF (A) γ2 A, ·2→2 + γ2 A, ·2→2 =: C2 E
A∈A
(7.70)
where
p 1/p
X Lp = (E|X| ) .
Theorem 7.8.4 (Lemma 3.3 of Krahmer, Mendelson, and Rauhut [31]). Let A
be a set of matrices, let ξ = (ξ1 , . . . , ξn ) be an L-subGaussian random vector, and
let ξ be an independent copy of ξ. Then, for every p ≥ 1,
/ / / /
/ = >/ / / /= >/
/ sup Aξ, Aξ / L γ2 (A, · ) / sup Aξ / + sup / Aξ, Aξ /L .
/A∈A / 2→2 /
A∈A
2/
A∈A p
Lp Lp
The proof in [31] follows a chaining argument. We also refer to Sect. 1.5 for
decoupling from dependance to independence.
Theorem 7.8.5 (Theorem 3.4 of Krahmer, Mendelson, and Rauhut [31]). Let
L ≥ 1 and ξ = (ξ1 , . . . , ξn ) , where ξi , i = 1, . . . , n are independent mean-zero,
variance one, L-subGaussian random variables, and let A be a set of matrices.
Then, for every p ≥ 1,
/ /
/ / √
/ sup Aξ / L γ2 (A, ·
/A∈A 2/ 2→2 ) + dF (A) + pd2→2 (A) ,
Lp
2 2
sup Aξ 2 − E Aξ 2 L γ2 (A, · 2→2 ) [γ2 (A, · 2→2 ) + dF (A)]
A∈A
√
+ pd2→2 (A) [γ2 (A, · 2→2 ) + dF (A)] + pd22→2 (A) .
N
yk = aj xk−j . (7.73)
j=1
7.9 Concentration for Random Toeplitz Matrix 409
y = Xh, (7.74)
where
⎡ ⎤
xN xN −1 · · · x1
⎢ xN +1 xN · · · x2 ⎥
⎢ ⎥
X=⎢ .. .. . . .. ⎥ (7.75)
⎣ . . . . ⎦
xN +M −1 xN +M −2 · · · XM
N +M −1
is an M × N Toeplitz matrix. Here we consider the entries x = {xi }i=1
drawn from an i.i.d. Gaussian random sequences. To state the result, we need the
eigenvalues of the covariance matrix of the vector h defined as
⎡ ⎤
A (0) · · · A (M − 1)
A (1)
⎢ A (1) · · · A (M − 2) ⎥
A (0)
⎢ ⎥
R=⎢ .. .. .... ⎥ (7.76)
⎣ . . . . ⎦
A (M − 1) A (M − 2) · · · A (0)
where
N −τ
A (τ ) = hi hi+τ , τ = 0, 1, . . . , M − 1
i=1
Allowing X to have M × N i.i.d. Gaussian entries with zero mean and unit variance
(thus no Toeplitz structure) will give the concentration bound [415]
0 1
2 2 2 M 2
P y 2 −M h 2 tM h 2 2 exp − t .
4
Thus, to achieve the same probability bound for Toeplitz matrices requires choosing
M larger by a factor of 2ρ (h) or 2μ (h).
See [441], for the spectral norm of a random Toeplitz matrix. See also [442].
The central goal of compressed sensing is to capture attributes of a signal using very
few measurements. In most work to date, this broader objective is exemplified by
the important special case in which a k-sparse vector x ∈ Rn (with n large) is to
be reconstructed from a small number N of linear measurements with k < N < n.
In this problem, measurement data constitute a vector √1N Φx, where Φ is an N × n
matrix called the sensing matrix.
The two fundamental questions in compressed sensing are: how to construct
suitable sensing matrices Φ, and how to recover x from √1N Φx efficiently. In [443]
the authors constructed a large class of deterministic sensing matrices that satisfy a
statistical restricted isometry property. Because we will be interested in expected-
case performance only, we need not impose RIP; we shall instead work with the
weaker Statistical Restricted Isometry Property.
Definition 7.10.1 (Statistical restricted isometry property). An N × n (sensing)
matrix Φ is said to be a (k, δ, ε)-statistical restricted isometry property matrix if, for
k-sparse vectors x ∈ Rn , the inequalities
/ /2
/ 1 /
/ / (1 + δ) x
2 2
(1 − δ) x 2 √
/ N Φx / 2 ,
Sparsity recovery and compressed sensing are interchangeable terms. This sparsity
concept can be extended to the matrix case: sparsity recovery of the vector of
singular values. We follow [444] for this exposition.
The observed data y is modeled as
y = A (M) + z, (8.1)
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 411
DOI 10.1007/978-1-4614-4544-9 8,
© Springer Science+Business Media New York 2014
412 8 Matrix Completion and Low-Rank Matrix Recovery
⎡ ⎤
vec (A1 )
⎢ vec (A2 ) ⎥
⎢ ⎥
A (X) = ⎢ .. ⎥ vec (X) (8.2)
⎣ . ⎦
vec (Am )
where vec (X) is a long vector obtained by stacking the columns of matrix X.
The matrix version of the restricted isometry property (RIP) is an integral tool in
proving theoretical results. For each integer r = 1, 2, . . . , n, the isometry constant
δr of A is the smallest value such that
2 2 2
(1 − δr ) X F A (X) 2 (1 + δr ) X F (8.3)
minimize X ∗
∗
subject to A (v) γ (8.4)
v = y − A (X)
where · is the operator norm and · ∗ is its dual, i.e., the nuclear norm. The
nuclear norm of a matrix X is the sum of the singular values of X and the
operator norm is its largest singular value. A∗ is the adjoint of A. X F is
the Frobenius norm (the 2 -norm of the vector of singular
values).
Suppose z is a Gaussian vector with i.i.d. N 0, σ 2 entries, and let n =
max {n1 , n2 }. Then if C0 > 4 (1 + δ1 ) log 12
√
A∗ (z) C0 nσ, s (8.5)
with probability at least 1 − 2e−cn for a fixed numerical constant c > 0. The scalar
δ1 is the restricted isometry constant at rank r = 1.
We can reformulate (8.4) as a semi-definite program (SDP)
γIn A∗ (v)
∗ ∗ 0.
[A (v)] γIn
T
for all vectors v = (v1 , . . . , vm ) ∈ Rm .
The matrix version of the restricted isometry property (RIP) is an integral tool in
proving theoretical results. For each integer r = 1, 2, . . . , n, the isometry constant
δr of A is the smallest value such that
2 2 2
(1 − δr ) X F A (X) 2 (1 + δr ) X F (8.8)
holds for all matrices X of rank at most r. We say that A satisfies the RIP at rank r
if δr (or δ4r ) is bounded by a sufficiently small constant between 0 and 1.
Which linear maps A satisfy the RIP? One example is the Gaussian measurement
ensemble. A is a Gaussian measurement ensemble if each ‘row’ ai , 1 i m,
contains i.i.d. N (0, 1/m) entries (and the ai ’s are independent from each other).
We have selected the variance of the entries to be 1/m so that for a fixed matrix X,
2 2
E A (X) 2 = X F .
414 8 Matrix Completion and Low-Rank Matrix Recovery
Theorem 8.2.1 (Recht et al. [446]). Fix 0 ≤ δ < 1 and let A is a random
measurement ensemble obeying the following conditions: for any given X ∈
Rn1 ×n2 and fixed 0 < t < 1,
2 2 2
P A (X) 2 − X F >t X F C exp (−cm) (8.9)
for fixed constants C, t > 0 (which may depend on t). Then, if m ≥ Dnr, A satisfies
the RIP with isometry constant δr ≤ δ with probability exceeding 1 − Ce−dm for
fixed constants D, d > 0.
2
If A is a Gaussian random measurement ensemble, A (X) 2 is distributed as
m−1 X F times a chi-squared random variable with m degrees of freedom, and
2
we have
m
2 2 2
P A (X) 2 − X F > t X F 2 exp t2 /2 − t3 /3 . (8.10)
2
Similarly, A satisfies (8.10) in the case when each entry
√ of each√‘row’ ai has i.i.d.
entries that are equally likely to take the values +1/ m or −1/ m [446], or if A
is a random projection [416, 446]. Finally, A satisfies (8.9) if the “rows” ai contain
sub-Gaussian entries [447].
In Theorem 8.2.1, the degrees of freedom of n1 × n2 matrix of rank r is r(n1 +
n2 − r).
Given the observed vector y, the optimal solution to (8.4) is our estimator M̂ (y).
For the data vector and the linear model
y = Ax + z (8.11)
where A ∈ Rm×n and the zi ’s are i.i.d. N 0, σ 2 . Let λi AT A be the
eigenvalues of the matrix AT A. Then [448, p. 403]
2 −1
n
σ2
inf sup E x̂ − x 2 = σ 2 Tr AT A = (8.12)
x̂ x∈Rn
i=1
λi (AT A)
where x̂ is estimate of x.
Suppose that the measurement operator is fixed and satisfies the RIP, and that
the noise vector z = (z1 , . . . , zn )T ∼ N 0, σ 2 In . Then any estimator M̂ (y)
obeys [444, 4.2.8]
/ /
/ / 1
sup E/M̂ (y) − M/ nrσ 2 . (8.13)
M:rank(M)r F 1 + δr
8.4 Low Rank Matrix Recovery for Hypothesis Detection 415
y =x+n
Ry = Rx + Rn ,
y i = x i + ni , i = 1, 2, . . . , N.
Assume that xi are dependent random vectors, while ni are independent random
vectors.
Let us consider the sample covariance matrix
N N N
1 1 1
yi ⊗ yi∗ = xi ⊗ x∗i + ni ⊗ n∗i + junk
N i=1
N i=1
N i=1
where “junk” denotes another two terms. When the sample size N increases,
concentration of measure phenomenon occurs. It is remarkable that, on the right
N
hand side, the convergence rate of the first sample covariance matrix N1 xi ⊗ x∗i
i=1
N
and that of the second sample covariance matrix 1
N ni ⊗ n∗i are different! If we
i=1
N
further assume Rx is of low rank, R̂x = 1
N xi ⊗ x∗i converges to its true value
i=1
N
Rx ; R̂n = 1
N ni ⊗ n∗i converges to its true value Rn . Their convergence rates,
i=1
however, are different. &
%
416 8 Matrix Completion and Low-Rank Matrix Recovery
Example 8.4.1 illustrates the fundamental role of low rank matrix recovery in the
framework of signal plus noise. We can take advantage of the faster convergence of
the low rank structure of signal, since, for a given recovery accuracy, the required
samples N for low rank matrix recovery is O(r log(n)), where r is the rank of the
signal covariance matrix. Results such as Rudelson’s theorem in Sect. 5.4 are the
basis for understanding such a convergence rate.
d ' n,
• 1 -regularized quadratic programs (also known as the Lasso) for sparse linear
regression
• Second-order cone program (SOCP) for the group Lasso
• Semidefinite programming relaxation (SDP) for various problems include sparse
PCA and low-rank matrix estimation.
Yi = Xi , A + Zi , i = 1, 2, . . . , N, (8.15)
we rewrite (8.15) as
y = ϕ + z. (8.16)
y = ϕ(A) + z. (8.17)
the observation matrix Xi ∈ Rm1 ×m2 has i.i.d. zero-mean, unit-variance Gaussian
N (0, 1) entries. In a more general observation model [155], the entries of Xi are
allowed to have general Gaussian dependencies.
For a rectangular matrix B ∈ Rm1 ×m2 , the nuclear or trace norm is defined as
min{m1 ,m2 }
B ∗ = σi (B),
i=1
which is the sum of its singular values that are sorted in non-increasing order. The
maximum singular value is σmax = σ1 and the minimum singular value σmin =
σmin{m1 ,m2 } . The operator norm is defined as A op = σ1 (A). Given a collection
of observation (Yi , Xi ) ∈ R × Rm1 ×m2 , the problem at hand is to estimate the
unknown matrix A ∈ S, where S is a general convex subset of Rm1 ×m2 . The
problem may be formulated as an optimization problem
1 2
 ∈ arg min y − ϕ(A) 2 + γN A ∗ . (8.18)
A∈S 2N
The key condition for us to control the matrix error A − Â between Â, the
SDP solution (8.18), and the unknown matrix A is the so-called restricted
strong convexity. This condition guarantees that the quadratic loss function in
the SDP (8.18) is strictly convex over a restricted set of directions. Let the set
C ⊆ R × Rm1 ×m2 denote the restricted directions. We say the random operator
ϕ satisfies restricted strong convexity over the set C if there exists some κ (ϕ) > 0
such that
1 2 2
ϕ(Δ) 2 κ (ϕ) Δ F for all Δ ∈ C. (8.19)
2N
Recall that N is the number of observations defined in (8.15).
8.6 Matrix Compressed Sensing 419
When q = 0, the set B (R0 ) corresponds to the set of matrices with rank at most R0 .
Theorem 8.6.2 (Nearly low-rank matrix recovery [155]). Suppose / that A/ ∈ B
/N /
(Rq )∩S, the regularization parameter is lower bounded as γN 2/ /
/ εi Xi / /N ,
i=1 op
N
where {εi }i=1 are random variables, and the random operator ϕ satisfies
the restricted strong convexity with parameter κ (ϕ) ∈ (0, 1) over the set
C(Rq /γN q ; δ). Then, any solution  to the SDP (8.18) satisfies the bound
+ 1−q/2 ,
/ /
/ / γN
/Â − A / max δ, 32 Rq . (8.22)
F κ (ϕ)
The error (8.22) reduces to the exact rank case (8.20) when q = 0 and δ = 0.
Example 8.6.3 (Matrix compressed sensing with dependent sampling [155]).
A standard matrix compressed sensing has the form
420 8 Matrix Completion and Low-Rank Matrix Recovery
Yi = Xi , A + Zi , i = 1, 2, . . . , N, (8.23)
where the observation matrix Xi ∈ Rm1 ×m2 has i.i.d. standard Gaussian N (0, 1)
entries. Equation (8.23) is an instance of (8.15). Here, we study a more general
observation model, in which the entries of Xi are allowed to have general Gaussian
dependence.
Equation (8.17) involves a random Gaussian operator mapping Rm1 ×m2 to RN .
We repeat some definitions in Sect. 3.8 for convenience. For a matrix A ∈
Rm1 ×m2 , we use vector vec(A) ∈ RM , M = m1 m2 . Given a symmetric positive
definite matrix Σ ∈ RM ×M , we say that the random matrix Xi is sampled from the
Σ-ensemble if
vec(Xi ) ∼ N (0, Σ) .
where the random matrix X ∈ Rm1 ×m2 is sampled from the Σ-ensemble. For the
2
special case (white Gaussian random vector) Σ = I, we have
√ ρ (Σ) = 1.
The noise vector ∈ R satisfies the bound 2 2ν N for some constant ν.
N
This assumption holds for any bounded noise, and also holds with high probability
for any random noise vector with sub-Gaussian entries with parameter ν. The
simplest case is that of Gaussian noise N (0, ν 2 ).
N
Suppose that the matrices {Xi }i=1 are drawn i.i.d. from the Σ-ensemble, and
that the unknown matrix A ∈ B (Rq ) ∩ S for some q ∈ (0, 1]. Then there are
universal constant c0 , c1 , c2 such that a sample size
yi = A zi + wi , i = 1, 2, . . . , n,
where wi ∼ N 0, ν 2 Im1 ×m2 is observation noise vector. We assume that the
covariates zi are random, i.e., zi ∼ N (0, Σ), i.i.d. for some m2 -dimensional
covariance matrix Σ > 0.
Consider A ∈ B (Rq ) ∩ S. There are universal constants c1 , c2 , c3 such that if
we solve the SDP (8.18) with regularization parameter
"
ν m1 + m 2
γN = 10 σmax (Σ) ,
m1 n
we have
2 2 1−q/2 1−q/2
/ /2
/ / ν σmax (Σ) m1 + m 2
P /Â − A / c1 2 Rq
F σmin (Σ) n
/ /2
/ / m1 + m2
/Â − A / c1 ν 2 r
F n
The goal of the problem is to estimate the unknown matrix A ∈ Rm×m on the
n
basis of a sequence of vector samples {zt }t=1 .
It is natural to expect that the system is controlled primarily by a low-dimensional
subset of variables, implying that A is of low rank. Besides, A is a Hankel matrix.
T
Since zt = [Zt1 · · · Ztm ] is m-dimension column vector, the sample size of
scalar random variables is N = nm. Letting k = 1, 2, . . . , m index the dimension,
we have
= >
Z(t+1)k = ek zTt , A + Wtk . (8.28)
(t, k) → i = (t − 1)k + k.
After doing this, the autoregressive problem can be written in the form of (8.15)
with Yi = Z(t+1)k and observation matrix Xi = ek zTt .
n
Suppose that we are given n samples {zt }t=1 from a m-dimensional autore-
gressive process (8.27) that is stationary, based on a system matrix that is stable
( A op α 1) and approximately low-rank (A ∈ B (Rq )∩S). Then, there are
universal constants c1 , c2 , c3 such that if we solve the SDP (8.18) with regularization
parameter
"
2c0 Σ op m
γN = ,
m (1 − α) n
To prove (8.29), we need the following results (8.30) and (8.31). We need the
notation
⎡ ⎤ ⎡ ⎤
zT1 zT2
⎢ zT ⎥ ⎢ zT3 ⎥
⎢ 2 ⎥ ⎢ ⎥
X=⎢ . ⎥ ∈ Rn×m and Y = ⎢ .. ⎥ ∈ Rn×m .
⎣ .. ⎦ ⎣ . ⎦
zTn zTn+1
Let W be a matrix where each row is sampled i.i.d. from the N (0, C) distribution
corresponding to the innovation noise driving the VAR process. With this notation,
and the relation N = nm, the SDP objective function (8.18) is written as
1 1 / /
/Y − XAT /2 + γn A
F ∗ ,
m 2n
where γn = γN m.
The eigenspectrum of the matrix of the matrix XT X/n is well controlled in
terms of the stationary covariance matrix: in particular, as long as n > c3 m, we have
24σmax (Σ)
P σmax X X/n
T
2c1 exp (−c2 m) , and
1−α (8.30)
P σmin X X/n 0.25σmin (Σ) 2c1 exp (−c2 m) .
T
&
%
y = Xβ + w (8.32)
The notions of sparsity can be defined more precisely in terms of the p -balls1
for p ∈ (0, 1], defined as [135]
+ d
,
p p
Bp (Rp ) = β ∈ Rd : β p = |β i | Rp , (8.33)
i=1
1 T
Zn = sup w Xv , (8.36)
v∈S(s,R) n
where X, w are given in (8.32). Let us show how to bound the random variable Zn .
The approach used by Raskutti et al. [135] is emphasized and followed closely here.
For a given ε ∈ (0, 1) to be chosen, we need to bound the minimal cardinality of
a set that covers S (s,
R) up to (Rε)-accuracy
in -norm. We claim that we may find
such a covering set v1 , . . . , vN ⊂ S (s, R) with cardinality N = N (s, R, ε) that
is upper bounded by
d
log N (s, R, ε) log + 2s log (1/ε) .
2s
1 Strictly speaking, these sets are not “balls” when p < 1, since they fail to be convex.
8.7 Linear Regression 425
d
To establish the claim, we note that there are subsets of size 2s within
2s
{1, 2, . . . , d}. Also, for any 2s-sized subset, there are an (Rε)-covering in 2 -norm
of the ball B2 (R) with at most 22s log(1/ε) elements (e.g., [154]).
/ As ak /result, for each vector v ∈ S (s, R), we may find some v such that
k
/v − v / Rε. By triangle inequality, we obtain
2
T T T
1 w Xv 1 w Xvk + 1 w X v − v k
n n n
T / /
1 w Xvk + 1
w 2 /X v − v k /2 .
n n
1 Xv 2
√ κ, for all v ∈ B0 (2s) .
n v 2
2
Since the vector w ∈ Rn is Gaussian, the variate w 2 /σ 2 is the χ2 distribution
with n degrees of freedom, we have √1n w 2 2σ with probability at least 1 −
c1 exp (−c2 n), where c1 , c2 are two numerical constants, using standard tail bounds
(see Sect. 3.2). Putting all the pieces together, we obtain
1 T 1
w Xv wT Xvk + 2κσRε
n n
with high probability. Taking the supremum over v on both sides gives
1 T
Zn max w Xvk + 2κσRε.
k=1,2,...,N n
It remains to bound the finite maximum over the covering set. First, we see that
/ /2
each variate n1 wT Xvk is zero-mean Gaussian with variance σ 2 /Xvi /2 /n2 . Now
by standard Gaussian tail bounds, we conclude that
√
Zn σRκ 3 log N (s, R, ε)/ n + 2κσRε
√ (8.37)
= σRκ 3 log N (s, R, ε)/ n + 2ε .
1 − c1 exp (−c
with probability greater than √ 2 log N (s, R, ε)).
Finally, suppose ε = s log (d/2s)/ n. With this choice and assuming that
n ≤ d, we have
426 8 Matrix Completion and Low-Rank Matrix Recovery
⎛ ⎞
d⎠
log⎝
log N (s,R,ε) 2s s log( s log(d/2s)
n
)
n n + n
⎛ ⎞
d⎠
log⎝
2s s log(d/s)
n + n
2s+2s log(d/s) s log(d/s)
n + n ,
where the final line uses standard bounds on binomial coefficients. Since d/s ≥ 2
by assumption, we conclude that our choice of ε guarantees that
log N (s, R, ε)
5s log (d/s) .
n
Inserting these relations into (8.37), we conclude that
s log (d/s)
Zn 6σRκ .
n
Since log N (s, R, ε) s log (d − 2s), this event occurs with probability at least
1 − c1 exp (−c2 min {n, s log (d − s)}). We summarize the results in this theorem.
Theorem 8.7.1 ([135]). If the 2 -norm of random matrix X is bounded by
Xv2
√1
n v2
κ for all v with at most 2s non-zeros, i.e., v ∈ B0 (2s), and
w ∈ R is additive Gaussian noise, i.e., w ∼ N 0, σ 2 In×n , then for any radius
d
R > 0, we have
1 T s log (d/s)
sup w Xv 6σκR ,
v0 2s, v2 R n n
1 2 T
w Xe
2
Xv 2
n n
the right-hand side of which is exactly the expression required by Theorem 8.7.1, if
we identify v = e.
8.8 Multi-task Matrix Regression 427
yi = Xβ i + wi , i = 1, 2, . . . , d2 ,
Y = XB + W (8.38)
where Y = [y1 , . . . , yd2 ] and W = [w1 , . . . , wd2 ] are both matrices in Rn×d2
and B = β 1 , . . . , β d2 ∈ Rd1 ×d2 is a matrix of regression vectors. In multi-task
learning, each column of B is called a task and each row of B is a feature.
A special structure has the form of low-rank plus sparse decomposition
B=Θ+Γ
where Θ is of low rank and Γ is sparse, with a small number of non-zero entries.
For example, Γ is row-sparse, with a small number of non-zero rows. It follows
from (8.38) that
Y = X (Θ + Γ) + W (8.39)
In the following examples, the entries of W are assumed to be i.i.d. zero-mean
Gaussian with variance ν 2 , i.e. Wij ∼ N 0, ν 2 .
Example 8.8.1 (Concentration for product of two random matrices [453]). Con-
sider the product of two random matrices defined above
Z = XT W ∈ Rd1 ×d2 .
It can be shown that the matrix Z has independent columns, with each column
zj ∼ N/ 0, ν 2 XT X/n . Let σmax be the maximum eigenvalue of matrix X. Since
/
/XT X/ σmax 2
, known results on the singular values of Gaussian random
op
matrices [145] imply that
/ / 4 (d1 + d2 ) νσmax
P /XT W/op √ 2 exp (−c (d1 + d2 )) .
n
Let xj be the j-th column of matrix X. Let κmax = max xj 2 be the maximum
j=1,...,d1
2 -norm over columns. Since the 2 -norm of the columns of X are bounded by κmax ,
2
the entries of XT W are i.i.d. and Gaussian with variance at most (νκmax ) /n. As a
result, the standard Gaussian tail bound combined with union bound gives
428 8 Matrix Completion and Low-Rank Matrix Recovery
/ / 4νσ
P /XT W/∞ √max log (d1 d2 ) exp (− log (d1 d2 )) ,
n
wk → wk 2
"
ν log d2
P max wk 2 √ + 4ν exp (−3 log d2 ) . &
%
k=1,2,...,d2 d2 d1 d2
Example 8.8.3 (Concentration for trace inner product of two matrices [453]).
We study the function defined as
Z (s) = √
sup | W, Δ | .
Δ1 s, ΔF 1
8.9 Matrix Completion 429
2
t d1 d2
P (Z (s) E [Z (s)] + t) exp − .
2ν 2
4sν 2
d
1 d2
Setting t2 = d1 d2 log s , we have
2sν d1 d2
Z (s) E [Z (s)] + log
d1 d2 s
It remains to upper bound the expected value E [Z (s)]. In order to√do so, we
use [153, Theorem 5.1(ii)] with (q0 , q1 ) = (1, 2), n = d1 d2 , and t = s, thereby
obtaining
? ?
ν √ 2d d
1 2 ν 2d1 d2
E [Z (s)] c s 2 + log c s log .
d1 d2 s d1 d2 s
where ak is the k-th column of matrix A ∈ Rd1 ×d2 . We can study the function
Z̃ (s) = √
sup | W, Δ |
Δ2,1 s, ΔF 1
which is Lipschitz with constant √dν d . Similar to above, we can use the standard
1 2
approach: (1) derive concentration of measure for Gaussian Lipschitz functions; (2)
upper bound the expectation. For details, we see [453]. &
%
This section is taken from Recht [103] for low rank matrix recovery, primarily due
to his highly accessible presentation.
430 8 Matrix Completion and Low-Rank Matrix Recovery
The nuclear norm ||X||∗ of a matrix X is equal to the sum of its singular values
σi (X), and is the best convex lower bound of the rank function that is NP-hard.
i
The intuition behind this heuristic is that while the rank function counts the number
of nonvanishing singular values, the nuclear norm sums their amplitudes, much like
how the 1 norm is a useful surrogate for counting the number of nonzeros in a
vector. Besides, the nuclear norm can be minimized subject to equality constraints
via semidefinite programming (SDP).2
Let us review some matrix preliminaries and also fix the notation. Matrices are
bold capital, vectors are bold lower case and scalars or entries are not bold. For
example, X is a matrix, Xij its (i, j)-th entry. Likewise, x is a vector, and xi its i-th
component. If uk ∈ Rn for 1 k d is a collection of vectors, [u1 , . . . , ud ] will
denote the n × d matrix whose k-th column is uk . ek will denote the k-th standard
basis vector in Rd , equal to 1 in component k and 0 everywhere else. X∗ and x∗
denote the transpose of matrices X and x.
The spectral norm of a matrix is denoted ||X||. The Euclidean inner product
between two matrices is X, Y = Tr (X∗ Y) , and the corresponding Euclidean
norm, called the Frobenius or Hilbert-Schmidt norm, is denoted X F . That is,
1
X F = X, X 2 . Or
2
X F = X, X = Tr XT X , (8.40)
which is a linear operator of XT X since the trace function is linear. The nuclear
norm of a matrix is ||X||∗ . The maximum entry of X (in absolute value) is denoted
by X ∞ = maxij |Xij |, where of course | · | is the absolute value. For vectors, the
only norm applied is the Euclidean 2 norm, simply denoted ||x||.
Linear transformations that act on matrices will be denoted by calligraphic
letters. In particular, the identity operator is I. The spectral norm (the top singular
value) of such an operator is A = supX:XF 1 A (X) F . Subspaces are also
denoted by calligraphic letters.
We suggest the audience to review [454, Chap. 5] for background. We only review
the key definitions needed later. For a set of vectors S = {v1 , . . . , vr }, the subspace
2 The SDP is of course the convex optimization. It is a common practice that once a problem is
recast in terms of a convex optimization problem, then the problem may be solved, using many
general-purpose solvers such as CVX.
8.9 Matrix Completion 431
M⊥ = {x ∈ V : m, x = 0 for all m ∈ M} .
Rn1 ×n2 = T ⊕ T ⊥
where T is the linear space spanned by elements of the form uk y∗ and xv∗k , 1
k r, where x and y are arbitrary, and T ⊥ is its orthogonal complement. T ⊥ is
the subspace of matrices spanned by the family (xy∗ ), where x (respectively y) is
any vector orthogonal to U (respectively V).
The orthogonal projection PT of a matrix Z onto subspace T is defined as
where PU and PV are the orthogonal projections onto U and V, respectively. While
PU and PV are matrices, PT is a linear operator that maps a matrix to another
matrix. The orthogonal projection of a matrix Z onto T ⊥ is written as
With the aid of (8.40), the Frobenius norm of PT ⊥ (ea e∗b ) is given as3
3 This equation in the original paper [104] has a typo and is corrected here.
432 8 Matrix Completion and Low-Rank Matrix Recovery
2 2
PU ea μ (U ) r/n1 , PV eb μ (V) r/n2 , (8.43)
n1 + n2 n1 + n2
PT (ea e∗b )
2
F max {μ (U ) , μ (V)} r μ0 r , (8.44)
n1 n2 n1 n2
For any subspace, the smallest value which μ (W) can be is 1, achieved,√for
example, if W is spanned by vectors whose entries all have magnitude 1/ n.
The largest value for μ (W), on the other hand, is n/r which would correspond
to any subspace that contains a standard basis element. If a matrix has row and
column spaces with low coherence, then each entry can be expected to provide about
the same amount of information.
⎛ ⎞
-/ / .
/ L / ⎜ −τ 2 /2 ⎟
/ / ⎜ ⎟
P / Xk / > τ (d1 + d2 ) exp ⎜ L ⎟. (8.45)
/ / ⎝ 2 ⎠
k=1 ρk + M τ /3
k=1
minimize X ∗
(8.46)
subject to Xij = Mij (i, j) ∈ Ω.
2−2β
is unique
√
and equal to M with probability at least 1 − 6 log (n2 ) (n1 + n2 ) −
2−2 β
n2 .
The proof is very short and straightforward. It only uses basic matrix analysis,
elementary large deviation bounds and a noncommutative version of Bernsterin’s
inequality (See Theorem 8.9.3).
Recovering low-Rank matrices is studied by Gross [102].
Example 8.9.5 (A secure communications protocol that is robust to sparse er-
rors [455]). We want to securely transmit a binary message across a communica-
tions channel. Our theory shows that decoding the message via deconvolution also
makes this secure scheme perfectly robust to sparse corruptions such as erasures or
malicious interference.
d
We model the binary message as a sign vector m0 ∈ {±1} . Choose a random
basis Q ∈ Od . The transmitter sends the scrambled message s0 = Qm0 across the
channel, where it is corrupted by an unknown sparse vector c0 ∈ Rd . The receiver
must determine the original message given only the corrupted signal
z0 = s0 + c0 = Qm0 + c0
signals c0 and m0 . Since the message m0 is a sign vector, we also have the side
information m0 ∞ = 1. Our receiver then recovers the message with the convex
deconvolution method
minimize c 1
(8.47)
subject to m ∞ = 1 and Qm + c = z0 ,
where the decision variables are c, m ∈ Rd . For example, d = 100. This method
succeeds if (c0 , m0 ) is the unique optimal point of (8.47). &
%
Example 8.9.6 (Low-rank matrix recovery with generic sparse corruptions [455]).
Consider the matrix observation
Z0 = X0 + R (Y0 ) ∈ Rn×n ,
minimize X S1
(8.48)
subject to Y 1 α and X + R (Y) = Z0 .
where
Yi = Tr (RXi ) + Wi , i = 1, . . . , n, (8.50)
8.10 Von Neumann Entropy Penalization and Low-Rank Matrix Estimation 435
Let Mm×m be the set of all m × m matrices with complex entries. Tr(S) denotes
the trace of S ∈ Mm×m , and S∗ denotes its adjoint matrix. Let Hm×m be the set of
all m × m Hermitian matrices with complex entries, and let
Sm×m ≡ S ∈ Hm×m (C) : S 0, Tr (S) = 1
be the set of all nonnegatively definite Hermitian matrices of trace 1. The matrices
of Sm×m can be interpreted, for instance, as density matrices, describing the states
of a quantum system; or covariance matrices, describing the states of the observed
phenomenon.
Let X ∈ Hm×m (C) be a matrix (an observable) with spectral representation
m
X= λi Pi , (8.52)
i=1
where λi are the eigenvalues of X and Pi are its spectral projectors. Then, a matrix-
valued measurement of X in a state of R ∈ S ∈ Mm×m would result in outcomes
λi with probabilities λi = Tr (RPi ) and its expectation is ER X = Tr (RX).
Let X1 , . . . , Xn ∈ Hm×m (C) be given matrices (observables), and let R ∈
S m×m
be an unknown state of the system. A statistical problem is to estimate
the unknown R, based on the matrix-valued observations (X1 , Y1 ) , . . . , (Xn , Yn ),
where Y1 , . . . , Yn are outcomes of matrix-valued measurements of the observables
X1 , . . . , Xn for the system identically prepared n times in the state R. In other
words, the unknown state R of the system is to be “learned” from a set of n linear
measurements in a number of “directions” X1 , . . . , Xn .
It is assumed that the matrix-valued design variables X1 , . . . , Xn are also
random; specifically, they are i.i.d. Hermitian m × m matrices with distribution Π.
436 8 Matrix Completion and Low-Rank Matrix Recovery
In this case, the observations (X1 , Y1 ) , . . . , (Xn , Yn ) are i.i.d., and they satisfy the
following model:
Yi = Tr (RXi ) + Wi , i = 1, . . . , n, (8.53)
The linear space of matrices Mm×m (C) can be equipped with the Hilbert-Schmidt
inner product,
A, B = Tr (AB∗ ) .
2 1 2
E| A, X | = A 2 ,
m2
1/2
where · 2 = ·, · is the Hilbert-Schmidt (or Frobenius) norm.
Example 8.10.1 (Matrix Completion). Let {ei : i = 1, . . . , m} the canonical basis
of Cm , where ei are m-dimensional vectors. We first define
Eii = ei ⊗ ei , i = 1, . . . , m
1 1 (8.54)
Eij = √ (ei ⊗ ej + ej ⊗ ei ) , Eji = √ (ei ⊗ ej − ej ⊗ ei ) ,
2 2
EXij
2
= 1, i = 1, . . . , m, and EXij
2
= 12 , i < j.
2. Rademacher design:
2 1 2
E| A, X | = A 2 , A ∈ Hm×m (C) ,
m2
m
for any Hermitian matrix S with spectral representation S = λ (φi ⊗ φi ) and
i=1
any function f defined on a set that contains the spectrum of S. See Sect. 1.4.13.
Let us consider the case of sampling from an orthonormal basis {E1 , . . . , Em2 }
of Hm×m (C) (that consists of Hermitian matrices). Let us call the distribution Π in
{E1 , . . . , Em2 } nearly uniform if and only there exist constants c1 , c2 such that
1 2 1 2
max Π ({Ei }) c1 and A L2 (Π) c2 A 2 , A ∈ Hm×m (C) .
1im2 m2 m2
Clearly the matrix completion design (Example 8.10.1) is a special case of sampling
from such nearly uniform distributions.
We study the following estimator of the unknown state R defined a solution of a
penalized empirical risk minimization problem:
- n
.
1 2
R̂ε = arg min (Yi − Tr (SXi )) + ε Tr (S log S) , (8.55)
S∈Sm×m n i=1
/ /2 "
/ ε / m mtm
/R̂ − R/ C ε log p ∧ log ∨ σw
L2 (Π) ε n
√
√ m (τn log n ∨ tm )
∨ σw ∨ m . (8.58)
n
/ /2
/ ε / 2
/R̂ − R/ inf 2 S−R L2 (Π)
L2 (Π) S∈Sm×m
2
σw rank (S) mtm log2 (mn) m (τn log n ∨ tm )
+C ∨ .
n n
(8.59)
Let us present three tools that have been used for low-rank matrix estimation,
since they are of general interest. We must bear in mind that random matrices are
noncommutative, which is fundamentally different form the scalar-valued random
variables.
Noncommutative Kullback-Liebler and other distances. We use noncommutative
extensions of classical distances between probability distributions such as Kullback-
Liebler and Hellinger distances. We use the symmetrized Kullback-Liebler distance
between two states S1 , S1 ∈ Sm×m defined as
440 8 Matrix Completion and Low-Rank Matrix Recovery
The sums of random matrices can be understood, with the help of concentra-
tion of measure. A matrix may be viewed as a vector in n-dimensional space.
In this section, we make the connection between the sum of random matrices and
the optimization problem. We draw material from [459] for the background of
incremental methods. Consider the sum of a large number of component functions
N
fi (x). The number of components N is very large. We can further consider the
i=1
optimization problem
N
minimize fi (x)
i=1 (8.60)
subject to x ∈ X ,
on the entire cost function. If each incremental iteration tends to make reasonable
progress in some “average” sense, then, depending on the value of N , an incremental
method may significantly outperform (by orders of magnitude) its nonincremental
counterparts. This framework provides flexibility in exploiting the special structure
of fi , including randomization in the selection of components. It is suitable for large-
scale, distributed optimization—such as in Big Data [5].
Incremental subgradient methods apply to the case where the component
functions fi are convex and nondifferentiable at some points
fi (x) = xi ⊗ xi ,
fi (x) = ci xi ⊗ xi ,
Ry E [yi ⊗ yi ] = I,
N
2
n
aTi x − bi +γ |xj |, s.t. (x1 , . . . , xn ) ∈ Rn ,
i=1 j=1
where hi (x) represents the difference between the i-th measurement (out of N )
from a physical system and the output a parametric model whose parameter vector
is x. Another is the choice
8.12 Phase Retrieval via Matrix Completion 443
fi (x) = g aTi x − bi ,
where w is a random variable taking a finite but very large number of values wi , i =
1, . . . , N , with corresponding probabilities πi . Then the cost function consists of the
sum of the N random functions πi F (x, wi ). &
%
Example 8.11.4 (Distributed incremental optimization in sensor networks). Con-
sider a network of N sensors where data are collected and used to solve some
inference problem involving a parameter vector x. If fi (x) represents an error
penality for the data collected by the i-th sensor, then the inference problem is of the
form (8.60). One approach is to use the centralized approach: to collect all the data
at a fusion center. The preferable alternative is to adopt the distributed approach:
to save data communication overhead and/or take advantage of parallelism in
computation. In the age of Big Data, this distributed alternative is almost mandatory
due to the need for storing massive amount of data.
In such an approach, the current iterate xk is passed from one sensor to another,
with each sensor i performing an incremental iteration improving just its local
computation function fi , and the entire cost function need be known at any one
location! See [461, 462] for details. &
%
Our interest in the problem of spectral factorization and phase retrieval is motivated
by the pioneering work of [463, 464], where this problem at hand is connected
with the recently developed machinery—matrix completion, e.g., see [102, 465–
472] for the most cited papers. This connection has been first made in [473],
followed by Candes et al. [463, 464]. The first experimental demonstration is made,
via a LED source with 620 nm central wavelength, in [474], following a much
444 8 Matrix Completion and Low-Rank Matrix Recovery
8.12.1 Methodology
where
1
X ejω = √ x[k]e−j2πk/n , ω ∈ Ω,
n
0kn
1
R ejω = √ r[k]e−j2πk/n , ω ∈ Ω.
n
0kn
are the discrete Fourier Transform of xn and rn , respectively. The task of spectral
factorization is equivalent to recovering the missing phase information of X ejω
2
from its squared magnitude X ejω [475]. This problem is often called phase
retrieval in the literature [476]. Spectral factorization and phase retrieval have been
extensively studied, e.g. see [476, 477] for comprehensive surveys.
Let the unknown x and the observed vector b be collected as
⎡ ⎤ ⎡ ⎤
x1 b1
⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥ ⎢ ⎥
x=⎢ .. ⎥ and b = ⎢ .. ⎥.
⎣ . ⎦ ⎣ . ⎦
xN bN
of the inner product between the signal and some vectors zi . Our task is to find
the unknown vector x. The most important observation is to linearize the nonlinear
quadratic measurements. It is well known that this can be done by interpreting the
quadric measurements as linear measurements about the unknown rank-one matrix
X = xx∗ . By using this “trick”, we can solve a linear problem of unknown, matrix-
valued random variable X. It is remarkable that one additional dimension (from
one to two dimensions) can make such a decisive difference. This trick has been
systemically exploited in the context of matrix completion.
As a result of this trick, we have
2 2
| zi , x | = Tr | zi , x | (trace property of a scalar)
∗ ∗ ∗ 2
= Tr (zi x) (zi x) (definitions of ·, · and |·| )
= Tr (z∗i xx∗ zi ) (property of transpose (Hermitian))
(8.64)
= Tr (z∗i Xzi ) (KEY: identifying X = xx∗ )
= Tr (zi z∗i X) (cyclical property of trace)
= Tr (Ai X) (identifying Ai = zi z∗i ).
The first equality follows from the fact that a trace of a scalar equals the scalar
itself, that is, Tr (α) = α. The second equality follows from the definition of the
inner product c, d = c∗ d and the definition of the squared modus of a scalar,
that is for any complex scalar α, |α| = αα∗ = α∗ α. The third equality from the
2
∗
property of Hermitian (transpose and conjugation), that is (AB) = B∗ A∗ . The
fourth step is critical, by identifying the rank-one matrix X = xx . Note that A∗ A
∗
and AA∗ are always positive semidefinite, A∗ A, AA∗ ≥ 0, for any matrix A. The
fifth equality follows from the famous cyclical property [17, p. 31]
Note that trace is a linear operator. The last step is reached by identifying another
rank-one matrix Ai = zi z∗i .
Since trace is a linear operator [17, p. 30], the phase retrieval problem, by
combining (8.63) and (8.64), comes down to a linear problem of unknown, rank-one
matrix X. Let A be the linear operator that mappes positive semidefinite matrices
into {Tr (Ai X) : i = 1, . . . , N }. In other words, we are given N observed pairs
{yi , bi : i = 1, . . . , N }, where yi = Tr (Ai X) is a scalar-valued random variable.
Thus, the phase retrievable problem is equivalent to
find X
minimize rank (X)
subject to A (X) = b
⇔ subject to A (X) = b (8.65)
X0
X 0.
rank (X) = 1
After solving the lef-hand side of (8.65), we can factorize the rank-one solution
X as xx∗ . The equivalence between the left and right-hand side of (8.65) is
446 8 Matrix Completion and Low-Rank Matrix Recovery
The rank minimization problem (8.65) is NP-hard. It is well known that the trace
norm is replaced with a convex surrogate for the rank function [478, 479]. This
techniques gives the familiar semi-definite programming (SDP)
minimize Tr (X)
subject to A (X) = b (8.66)
X 0,
where A is a linear operator. This problem is convex and there exists a wide array
of general purpose solvers. As far as we are concerned, the problem is solved
once the problem can be formulated in terms of convex optimization. For example,
in [473], (8.66) was solved by using the solver SDPT3 [480] with the interface
provided by the package CVX [481]. In [474], the singular value thresholding (SVT)
method [466] was used. In [463], all algorithms were implemented in MATLAB
using TFOCS [482].
The trace norm promotes low-rank solution. This is the reason why it is used so
often as a convex proxy for the rank. We can solve a sequence of weighted trace-
norm problem, a technique which provides even more accurate solutions [483,484].
Choose ε > 0; start with W0 = I and for k = 0, 1, . . ., inductively define Xk as
the optimal solution to
minimize Tr (Wk X)
subject to A (X) = b (8.67)
X 0.
where
Yi = Tr (RXi ) + Wi , i = 1, . . . , n. (8.70)
intensity. The simplified quantity is called the mutual intensity and is given by
The measurable quantity of the classical field after propagation over distance z
is [474, 486]
jπ 2 x1 − x2
I (x0 ; z) = dx1 dx2 J (x1 , x2 ) exp − x − x2 exp j2π
2
x0 .
λz 1 λ
(8.71)
where I¯ (μ; z) is the Fourier transform of the vector of measured intensities with
respect to x0 . Thus, radial slices of the Ambiguity Function may be obtained from
Fourier transforming the vectors of intensities measured at corresponding propa-
gation distances z. From the Ambiguity Function, the mutual intensity J (x, Δx)
can be recovered by an additional inverse Fourier transform, subject to sufficient
sampling,
Let us formulate the problem into a linear model. The measured intensity
data is first arranged in the Ambiguity Function space. The mutual density J is
defined as the unknown to solve for. To relate the unknowns (mutual density J)
to the measurements (Ambiguity Function A), the center-difference coordination-
transform is first applied, which can be expressed as a linear transform L upon the
mutual intensity J; then this process is followed by Fourier transform F, and adding
measurement noise. Formally, we have
8.12 Phase Retrieval via Matrix Completion 449
A = F · L · J + e.
The propagation operator for mutual intensity Px is unitary and Hermitian, since it
preserves energy. The goal of low rank matrix recovery is to minimize the rank
(effective number of coherent modes). We formulate the physically meaningful
belief: the significant coherent modes is very few (most eigenvalues of these modes
are either very small or zero).
Mathematically, if we define all the eigenvalues λi and the estimated mutual
ˆ the problem can be formulated as
intensity as J,
minimize rank Jˆ
subject to A = F · L · J, (8.72)
λi 0 and λi = 1.
i
Direct rank minimization is NP-hard. We solve instead a proxy problem: with the
rank with the “nuclear norm”. The nuclear norm of a matrix is defined as the sum
of singular values of the matrix. The corresponding problem is stated as
/ /
/ /
minimize /Jˆ/
∗
subject to A = F · T · J, (8.73)
λi 0 and λi = 1.
i
This problem is a convex optimization problem, which can be solved using general
purpose solvers. In [474], the singular value thresholding (SVT) method [466] was
used.
We follow our paper [487] about a novel single-step approach for self-coherent
tomography for the exposition. Phase retrieval is implicitly executed.
Nd
Escatter ltnt →Ω→ lrnr = G ldnd → lrnr Etot ltnt → ldnd τnd (8.75)
nd =1
In Eq. (8.75), G ldnd → lrnr is the wave propagation Green’s function from
location ldnd to location lrnr and Etot ltnt → ldnd is the total field in the target
subarea ldnd caused by the sounding signal from the ntht transmitter sensor which
can be represented as the state equation shown as
Nd
Etot ltnt → ldnd = Einc ltnt → ldnd + G ldnd → ldnd
nd =1,nd =nd
Etot ltnt → ldnd τnd (8.76)
Define einc,m ∈ C Nt Nr ×1 as
8.12 Phase Retrieval via Matrix Completion 451
⎡ ⎤
Einc (lt1 → lr1 )
⎢ E (lt → lr ) ⎥
⎢ inc 1 2 ⎥
⎢ .. ⎥
⎢ ⎥
⎢ . ⎥
⎢ ⎥
einc,m = ⎢ Einc lt1 → lrNr ⎥. (8.78)
⎢ ⎥
⎢ Einc (lt2 → lr1 ) ⎥
⎢ ⎥
⎢ .. ⎥
⎣ ⎦
t .
Einc lNt → lrNr
Define escatter,m ∈ C Nt Nr ×1 as
⎡ ⎤
escatter,m,1
⎢ escatter,m,2 ⎥
⎢ ⎥
escatter,m = ⎢ .. ⎥ (8.79)
⎣ . ⎦
escatter,m,Nt
and τ is
⎡ ⎤
τ1
⎢ τ2 ⎥
⎢ ⎥
τ =⎢ .. ⎥. (8.82)
⎣ . ⎦
τN d
⎡ ⎤
0 G ld2 → ld1 · · · G ldNd → ld1
⎢ G ld1 → ld2 0 · · · G ldNd → ld2 ⎥
⎢ ⎥
Gs = ⎢ .. .. .. .. ⎥ (8.84)
⎣ . . . . ⎦
d d
G l 1 → l Nd G l 2 → l Nd · · ·
d d
0
and einc,s,nt ∈ C Nd ×1 is
⎡ ⎤
Einc ltnt → ld1
⎢ Einc ltn → ld2 ⎥
⎢ t ⎥
einc,s,nt =⎢ .. ⎥. (8.85)
⎣ ⎦
.
Einc ltnt → ldNd
Define Bm ∈ C Nt Nr ×Nd as
⎡ ⎤
Gm 0 · · · 0
⎢ 0 Gm · · · 0 ⎥
⎢ ⎥
Bm =⎢ . .. .. .. ⎥ Etot,s . (8.87)
⎣ .. . . . ⎦
0 0 · · · Gm
where
A = [aH H H H
1 a2 . . . aM ] (8.90)
8.12 Phase Retrieval via Matrix Completion 453
y = [y1H y2H . . . yM
H H
] (8.91)
and
oi = ai x(ai x)H
= ai xxH aH
i
= trace(aH H
i ai xx ) (8.93)
oi = trace(Ai X) (8.94)
minimize
rank(X)
subject to . (8.95)
oi = trace(Ai X), i = 1, 2, . . . , M
X≥0
However, the rank function is not a convex function and the optimization
problem (8.95) is not a convex optimization problem. Hence, the rank function
is relaxed to the trace function or the nuclear norm function which is a convex
function. The optimization problem (8.95) can be relaxed to an SDP,
minimize
trace(X)
subject to (8.96)
oi = trace(Ai X), i = 1, 2, . . . , M
X≥0
is a rank-1 matrix, then the optimal solution x to the original phase retrieval
problem is achieved by eigen-decomposition of X. However, there is still a
phase ambiguity problem. When the number of measurements M are fewer than
necessary for a unique solution, additional assumptions are needed to select one
of the solutions [489]. Motivated by compressive sensing, if we would like to
seek the sparse vector x, the objective function in SDP (8.96) can be replaced by
trace(X) + δ X 1 where · 1 returns the l1 norm of matrix and δ is a design
parameter [489].
Here, the solution to the linearized self-coherent tomography will be given first.
Then, a novel single-step approach based on Born iterative method will be
proposed to deal with self-coherent tomography with consideration of mutual
multi-scattering. Distorted wave born approximation (DWBA) is used here to
linearize self-coherent tomography. Specifically speaking, all the scattering
t within
the target domain will be ignored
t in DWBA [490, 491].
t Hence, E tot l nt
→ l d
nd in
Eq. (8.76) is reduced to Etot lnt → lnd = Einc lnt → lnd and Bm in Eq. (8.87)
d d
is simplified as,
⎡ ⎤⎡ ⎤
Gm 0 · · · 0 diag(einc,s,1 )
⎢ 0 Gm · · · 0 ⎥ ⎢ diag(einc,s,2 ) ⎥
⎢ ⎥⎢ ⎥
Bm =⎢ . .. .. .. ⎥ ⎢ .. ⎥. (8.97)
⎣ .. . . . ⎦ ⎣ . ⎦
0 0 · · · Gm diag(einc,s,Nt )
o = |c + Ax| (8.98)
and
oi = |ci + ai x|
∗
= trace(Ai X) + |ci | + (ai x) c∗i + (ai x) ci
2
(8.99)
where ∗ returns the conjugate value of the complex number. There are two unknown
variables X and x in Eq. (8.99) which is different from Eq. (8.94) where there is only
one unknown variable X. In order to solve a set of non-linear equations in Eq. (8.98)
to get x, the following SDP is proposed,
8.12 Phase Retrieval via Matrix Completion 455
minimize
trace(X) + δ x 2
subject to
∗
oi = trace(Ai X) + |ci | + (ai x) c∗i + (ai x) ci
2
i = 1, 2, . . . , Nt Nr
X x
≥ 0; X ≥ 0
xH 1 (8.100)
where · 2 returns the l2 norm of vector and δ is a design parameter. The
optimization solution x can be achieved without phase ambiguity. Furthermore, if
we know additional prior information about x, for example, the bound of the real
or imaginary part of each entry in x, this prior information can be put into the
optimization problem (8.100) as linear constraints,
minimize
trace(X) + δ x 2
subject to
∗
oi = trace(Ai X) + |ci | + (ai x) c∗i + (ai x) ci
2
i = 1, 2, . . . , Nt Nr
upper
real ≤ real (x) ≤ breal
blower
bimag ≤ imag (x) ≤ bupper
lower
imag
X x
≥ 0; X ≥ 0 (8.101)
xH 1
where real returns the real part of the complex number and imag returns the
upper
imaginary part of the complex number. blowerreal and breal are the lower and upper
upper
bounds of the real part of x, respectively. Similarly, blower
real and breal are the lower
and upper bounds of the imaginary part of x, respectively.
If mutual multi-scattering is considered, we have to solve Eq. (8.88) to obtain
τ , i.e., x. The novel single-step approach based on Born iterative method will be
proposed as follows:
1. Set τ (0) to be zero; t = −1;
(t)
2. t = t + 1; get Bm based on Eqs. (8.87) and (8.86) using τ (t) ;
(t)
3. Solve the inverse problem in Eq. (8.98) by the following SDP using Bm to get
τ (t+1)
456 8 Matrix Completion and Low-Rank Matrix Recovery
minimize
trace(X) + δ1 x 2 + δ2 o − u 2
subject to
∗
ui = trace(Ai X) + |ci | + (ai x) c∗i + (ai x) ci
2
i = 1, 2, . . . , Nt Nr
upper
real ≤ real (x) ≤ breal
blower
blower ≤ imag (x) ≤ bupper
imag imag
X x
≥ 0; X ≥ 0 (8.102)
xH 1
Noisy low-rank matrix completion with general sampling distribution was studied
by Klopp [492]. Concentration-based guarantees are studied by Foygel et al. [493],
Foygel and Srebro [494], and Koltchinskii and Rangel [495]. The paper [496]
introduces a penalized matrix estimation procedure aiming at solutions which are
sparse and low-rank at the same time.
Related work to phase retrieval includes [463, 464, 473, 474, 497–502]. In
particular, robust phase retrieval for sparse signals [503].
Chapter 9
Covariance Matrix Estimation in High
Dimensions
The nonasymptotic point of view [108] may turn out to be relevant when the number
of observations is large. It is to fit large complex sets of data that one needs to
deal with possibly huge collections of models at different scales. This approach
allows the collections of models together with their dimensions to vary freely,
letting the dimensions be possibly of the same order of magnitude as the number
of observations. Concentration inequalities are the probabilistic tools that we need
to develop a nonasymptotic theory.
A hybrid, large-scale cognitive radio network (CRN) testbed consisting of 100
hybrid nodes: 84 USRP2 nodes and 16 WARP nodes, as shown in Fig. 9.1, is
deployed at Tennessee Technological University. In each node, non-contiguous
orthogonal frequency division multiplexing (NC-OFDM) waveforms are agile and
programmable, as shown in Fig. 9.2, due to the use of software defined radios;
such waveforms are ideal for the convergence of communications and sensing.
The network can work in two different modes: sense and communicate. They can
even work in a hybrid mode: communicating while sensing. From sensing point
of view, this network is an active wireless sensor network. Consequentially, many
analytical tools can be borrowed from wireless sensor network; on the other hand,
there is a fundamental difference between oursensing problems and the traditional
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 457
DOI 10.1007/978-1-4614-4544-9 9,
© Springer Science+Business Media New York 2014
458 9 Covariance Matrix Estimation in High Dimensions
Fig. 9.1 A large-scale cognitive radio network is deployed at Tennessee Technological University,
as an experimental testbed on campus. A hybrid network consisting of 80 USRP2 nodes and 16
WARP nodes. The ultimate goal is to demonstrate the big picture: sense, communicate, compute,
and control
wireless sensor network. The main difference derives from the nature of the SDR
and dynamical spectrum access (DSA) for a cognitive radio. The large-scale CRN
testbed has received little attention in the literature.
With the vision of the big picture: sense, communicate, compute and control, we
deal with the Big Data. A fundamental problem is to determine what information
needs to be stored locally and what information to be communicated in a real-time
manner. The communications data rates ultimately determine how the computing
is distributed among the whole network. It is impossible to solve this problem
analytically, since the answer depends on applications. The expertise of this network
will enables us to develop better ways to approach this problem. At this point,
through a heuristic approach, we assume that only the covariance matrix of the
data is measured at each node and will be communicated in real time. More
specifically, at each USRP2 or WARP node, only the covariance matrix of the data
are communicated across the network in real time, at a data rate of say 1 Mbs. These
(sensing) nodes record the data much faster than the communications speed. For
example, a data rate of 20 Mbps can be supported by using USRP2 nodes.
The problem is sometimes called wireless distributed computing. Our view
emphasizes the convergence of sensing and communications. Distributed (parallel)
computing is needed to support all kinds of applications in mind, with a purpose of
control across the network.
9.1 Big Picture: Sense, Communicate, Compute, and Control 459
Fig. 9.2 The non-contiguous orthogonal frequency division multiplexing (NC-OFDM) wave-
forms are suitable for both communications and sensing. The agile, programmable waveforms
are made available by software defined radios (SDR)
The vision of this section is interesting, especially in the context of the Smart
Grid that is a huge network full of sensors across the whole grid. There is an analogy
between the Smart Grid and the social network. Each node of the Grid is an agent.
This connection is a long term research topic. The in-depth treatment of this topic
is beyond this scope of this monograph.
The classical estimator for the covariance matrix is the sample covariance matrix
n
1
Σ̂n = xi xH
i . (9.2)
n i=1
1/2
where |A| = AH A . This type of spectral-norm error bound defined in (9.3)
is quite powerful. It limits the magnitude of the estimator error for each entry of
the covariance matrix; it even controls the error in estimating the eigenvalues of the
covariance using the eigenvalues of the sample covariance.
Unfortunately, the error bound (9.3) for the sample covariance estimator demands
a lot of samples. Typical positive results state that the sample covariance matrix
estimator is precise when the number of samples is proportional to the number of
variables, provided that the distribution decays fast enough. For example, assuming
that x follows a normal distribution:
/ /
/ /
n Cε−2 p ⇒ /Σ̂ − Σ/ ε Σ with high probability, (9.4)
for all i = 1, . . . , N, almost surely. For all ε ∈ (0, 1/2) and δ ∈ (0, 1),
1 1
N N
T 1 T 1
P λmax xi xi >1+ · Cε,δ,N or λmin xi xi <1− · Cε,δ,N δ
N i=1 1−2ε N i=1 1−2ε
where
"
32 (N log (1+2/ε)) + log (2/δ) 2 (N log (1+2/ε) + log (2/δ))
Cε,δ,N =γ· + .
N N
The sub-Gaussian property most readily lends itself to bounds on linear combina-
tions of sub-Gaussian random variables. However, the outer products are in certain
quadratic combinations. We bootstrap from the bound for linear combinations to
bound the moment generating function of the quadratic combinations. From there,
we get the desired tail bound.
For a (scalar-valued) non-negative random variable W. For any β ∈ R, we have
∞
E [exp (βW )] − βE [W ] − 1 = β (exp (βt) − 1) · P [W > t] · dt. (9.5)
0
Theorem 9.2.2 (Sums of random vector outer products (quadratic form) [113]).
Let x1 , . . . , xN be random vectors in Rn such that, for some γ ≥ 0,
E xi xTi |x1 , . . . , xi−1 = I and
2
E exp αxTi |x1 , . . . , xi−1 exp α γ/2 for all α ∈ Rn
for all i = 1, . . . , N, almost surely. For all α ∈ Rn such that ||α|| = 1 and all
δ ∈ (0, 1),
- N
" .
1 32γ 2 log (1/δ) 2γ log (1/δ)
P α T
xi xTiα>1+ + δ and
N N N
- i=1
N
" .
1 32γ 2 log (1/δ)
P αT xi xi α < 1 −
T
δ.
N i=1 N
We see [113] for a proof, based on (9.5). With this theorem, we can bound the
smallest and largest eigenvalues of the empirical covariance matrix, when we apply
the bound for the Rayleigh quotient (quadratic form) in the above theorem, together
with a covering argument from Pisier [152].
M Σ̂,
where the symbol denotes the component-wise (i.e., Schur or Hadamard) product.
The following expression bounds the root-mean-square spectral-norm error that this
estimator incurs:
/ /2 1/2 / /2 1/2 1/2
/ / / / 2
E/M Σ̂ − Σ/ E/M Σ̂ − M Σ/ + E M Σ−Σ .
F GH I F GH I
variance bias
(9.6)
This bound is analogous to the classical bias-variance decomposition for the mean-
squared-error (MSE) of a point estimator. To obtain an effective estimator, we must
design a mask that controls both the bias and the variance in (9.6). We cannot
neglect too many components of the covariance matrix, or else the bias in the
masked estimator may compromise its accuracy. On the other hand, each additional
component we add in our estimator contributes to the size of the variance term.
In the case where the covariance matrix is sparse, it is natural to strike a balance
between these two effects by refusing to estimate entries of the covariance matrix
that we know a prior to be small or zero.
For a stationary random process, the covariance matrix is Toeplitz. A Toeplitz
matrix or diagonal-constant matrix, named after Otto Toeplitz, is a matrix in which
each descending diagonal from left to right is constant. For instance, the following
matrix is a Toeplitz matrix
Example 9.2.3 (The Banded Estimator of a Decaying Matrix). Let us consider the
example where entries of the covariance matrix Σ decay away from the diagonal.
Suppose that, for a fixed parameter α > 1,
−α
(Σ)ij |i − j + 1| for each pair (i, j) of indices.
This type of property may hold for a random process whose correlation are
localized in time. Related structure arises from random fields that have short spatial
correlation scales.
A simple (suboptimal) approach to this covariance estimation problem is to focus
on a band of entries near the diagonal. Suppose that the bandwidth B = 2b + 1 for a
nonnegative integer b. For example, a mask with bandwidth B = 3 for an ensemble
of p = 5 variables takes the form
⎡ ⎤
11
⎢1 1 1 ⎥
⎢ ⎥
⎢ ⎥
Mband = ⎢ 1 1 1 ⎥.
⎢ ⎥
⎣ 1 1 1⎦
11
Gershgorin’s theorem [187, Sect. 6.1] implies that the spectral norm of a symmetric
matrix is dominated by the maximum l1 norm of a column, so
−α 2 1−α
MΣ−Σ 2 (k + 1) (b + 1) .
α−1
k>b
The second inequality follows when we compare with the sum with an integral.
A similar calculation shows
−1
Σ 1 + 2(α − 1) .
Assuming the covariance matrix really does have constant spectral norm, it follows
that
M Σ − Σ B 1−α Σ .
466 9 Covariance Matrix Estimation in High Dimensions
The main result of using masked covariance estimation is presented here. The norm
· ∞ returns the maximum absolute entry of a vector, but we use a separate notation
· max for the maximum absolute entry of a matrix. We also require the norm
1/2
2 2
A 1→2 max |aij | .
j
i
The notation reflects the fact that this is the natural norm for linear maps from l1
into l2 .
Theorem 9.2.4 (Chen and Tropp [130]). Fix a p × p symmetric mask matrix M,
where p ≥ 3. Suppose that x is a Gaussian random vector in Rp with mean zero.
Define the covariance matrix Σ and Σ̂ in (9.1) and (9.2). Then the variance of the
masked sample covariance estimator satisfies
⎡ 1/2 ⎤
2 1/2
Σmax Σ2
1→2 log p Σmax M log p · log (np)
EM Σ̂−M Σ C ⎣ + ⎦ Σ .
Σ n Σ n
(9.8)
In this subsection, we use masks that take 0–1 values to gain intuition. Our analysis
uses two separate metrics that quantify the complexity of the mask. The first
complexity is the square of the maximum column norm:
1/2
2 2
M 1→2 max |mij | .
j
i
Roughly, the bracket counts the number of interactions we want to estimate that
involve the variable j, and the maximum computes a bound over all p variables.
This metric is “local” in nature. The second complexity metric is the spectral norm
M of the mask matrix, which provides a more “global” view of the complexity
of the interactions that we estimate.
Let us use some examples to illustrate. First, suppose we estimate the entire
covariance matrix so the mask is the matrix of ones:
2
M = matrix of ones ⇒ Σ 1→2 = p and M = p.
Next, consider the mask that arises from the banded estimator in Example 9.2.3:
9.2 Covariance Matrix Estimation 467
2
M = 0 − 1 matrix, bandwidth B ⇒ Σ 1→2 B and M B,
since there are at most B ones in each row and column. When B ! p, the banded
matrix asks us to estimate fewer interactions than the full mask, so we expect the
estimation problem to be much easier.
A random process is wide-sense stationary (WSS) if its mean is constant for all time
indices (i.e., independent of time) and its autocorrelation depends on only the time
index difference. WSS discrete random process x[n] is statistically characterized by
a constant mean
x̄[n] = x̄,
where ∗ denotes the complex conjugate. The terms “correlation” and “covariance”
are often used synonymously in the literature, but formally identical only for zero-
mean processes. The covariance matrix
⎡ ∗ ∗
⎤
rxx [0] rxx [1] · · · rxx [M ]
⎢ rxx [1] rxx [0] ∗
· · · rxx [M − 1] ⎥
⎢ ⎥
RM = ⎢ .. .. .. .. ⎥
⎣ . . . . ⎦
rxx [M ] rxx [M − 1] · · · rxx [0]
M M
aH Rxx a = a[m]a∗ [n]rxx [m − n] 0 (9.9)
m=0 n=0
λi (RM ) ≥ 0, i = 1, . . . , M.
As mentioned above, in the case where the covariance matrix is sparse, it is natural
to strike a balance between these two effects by refusing to estimate entries of the
covariance matrix that we know a prior to be small or zero. Let us illustrate this
point by using an example that is crucial to high-dimensional data processing.
Example 9.2.5 (A Sum of Sinusoids in White Gaussian Noise [506]). Let us sample
the continuous-time signal at an sampling interval Ts . If there are L real sinusoids
L
x[n] = Al sin (2πfl nTs + θl ) ,
l=1
each of which has a phase that is uniformly distributed on the interval 0 to 2π,
independent of the other phases, then the mean of the L sinusoids is zero and the
autocorrelation sequence is
L
A2l
rxx [m] = cos (2πfl mTs ) .
2
l=1
L
x[n] = Al exp [j (2πfl nTs + θl )] ,
l=1
L
rxx [m] = A2l exp (j2πfl mTs ).
l=1
A white Gaussian noise is uncorrelated with itself for all lags, except at m = 0,
for which the variance is σ 2 . The autocorrelation sequence is
which is a constant for all frequencies, justifying the name white noise. The
covariance matrix is
⎡ ⎤
1 0
⎢ ⎥
Rww = σ 2 I = σ 2 ⎣ . . . ⎦ . (9.10)
0 1
9.2 Covariance Matrix Estimation 469
L
Ryy = Rxx + Rww = A2l vM (fl ) vM
H
(fl ) + σ 2 I, (9.12)
l=1
Only diagonal entries are affected by the ideal covariance matrix of the noise.
A better model for modeling the noise is
⎡ ⎤
1 ρ ρ ··· ρ
⎢ .. ⎥
⎢ ρ 1 ρ ... .⎥
⎢ ⎥
⎢ ⎥
R̂ww = ⎢ ρ ρ ... ... ρ⎥ ,
⎢ ⎥
⎢. . . . ⎥
⎣ .. . . . . . . ρ⎦
ρ ··· ρ ρ 1
9.2 Covariance Matrix Estimation 471
y = x + w,
where it is assumed that the random signal and the random noise are independent, it
follows [507] that
The difficulty arises from the fact that Rxx is unknown. Our task at hand is
to separate the two matrices—a matrix separation problem. Our problems will
be greatly simplified if some special structures of these three matrices can be
exploited!!! Two special structures are important: (1) Rxx is of low rank; (2) Rxx
is sparse.
In a real world, we are given the date to estimate the covariance matrix of the
noisy signal Ryy
where we have used the assumption that the random signal and the random noise are
independent—which is reasonable for most covariance matrix estimators in mind.
Two estimators R̂xx , R̂ww are required. It is very critical to remember that their
difficulties are fundamentally different; two different estimators must be used. The
basic reason is that the signal subspace and the noise subspace are different—even
though we cannot always separate the two subspaces using tools such as singular
value decomposition or eigenvalue decomposition.
Example 9.2.7 (Sample Covariance Matrix). The classical estimator for the covari-
ance matrix is the sample covariance matrix defined in (9.2) and repeated here for
convenient:
n
1
Σ̂ = xi xH
i . (9.15)
n i=1
472 9 Covariance Matrix Estimation in High Dimensions
1
n
1
n
H
R̂yy = n yi yiH = n (xi + wi )(xi + wi )
i=1 i=1
n n
n n 1 1
= 1
n xi xH
i + 1
n wi wiH + xi wiH+ wi x H
i
i=1 i=1 n i=1
n i=1
F GH I F GH I
→0,n→∞ →0,n→∞
(zero mean random vectors)
n
n
= n1 xi xH 1
i + n wi wiH , n → ∞
i=1 i=1
= R̂xx + R̂ww (9.16)
In what conditions does (9.16) approximate (9.17) with high accuracy? The
asymptotic process in the derivation of (9.16) hides the difficulty of data processing.
We need to make the asymptotic process explicit via feasible algorithms. In other
words, we require a non-asymptotic theory for high-dimensional processing. Indeed,
n approaches a very large value but finite! (say n = N = 105 ) For ε ∈ (0, 1), we
require that
/ / / /
/ / / /
/R̂xx − Rxx / ε Rxx and /R̂ww − Rww / ε Rww .
To achieve the same accuracy ε, the sample size n = Nx required for the
signal covariance estimator R̂xx is much less than n = Nw required for the noise
covariance estimator R̂ww . This observation is very critical in data processing.
For a given n, how close does R̂xx become to Rxx ? For a given n, how close
does R̂ww become to Rww ?
Let A, B ∈ CN ×N be Hermitian matrices. Then [16]
(9.19)
q q q
λi R̂xx + qλM R̂ww λi R̂xx + R̂ww λi R̂xx + qλ1 R̂ww .
i=1 i=1 i=1
(9.20)
For q = 1, we have
λ1 R̂xx + λM R̂ww λ1 R̂xx + R̂ww λ1 R̂xx + λ1 R̂ww ,
More generally
n
Tr A k
= λki (A), A ∈ Cn×n , k ∈ N.
i=1
1/k
In particular, if k = 2, 4, . . . is an even integer, then Tr Ak is just the lk norm
of these eigenvalues, and we have [9, p. 115]
k k
A op Tr Ak n A op ,
where · op is the operator norm.
All eigenvalues we deal with here are non-negative since the sample covariance
matrix defined in (9.15) is non-negative. The eigenvalues, their sum, and the trace of
a random matrix are scalar-valued random variables. The expectation E of these
scalar-valued random variables can be considered. Since expectation and trace are
both linear, they commute [38, 91]:
E Tr (A) = Tr (EA) . (9.22)
Tr ER̂xx +M EλM R̂ww Tr ER̂yy
= Tr ER̂xx +ER̂ww Tr ER̂xx +M Eλ1 R̂ww . (9.24)
Obviously, EλM R̂ww and Eλ1 R̂ww are non-negative scalar values, since
λi R̂ww 0, i = 1, . . . , M.
We are really concerned with
K K K
λ 1 Ryy,k − λ1 R̂yy,k ελ1 Ryy,k . (9.25)
k=1 k=1 k=1
We follow [508]. For a stationary random process, the covariance matrix is Toeplitz.
A Toeplitz matrix or diagonal-constant matrix, named after Otto Toeplitz, is a matrix
in which each descending diagonal from left to right is constant. For instance, the
following matrix is a Toeplitz matrix
for AT = 2c log T /T where c is a constant. The diagonal elements are never
thresholded. The thresholded estimate may not be positive definite.
In the context of time series, the observations have an intrinsic temporal order and
we expect that observations are weakly dependent if they are far apart, so banding
seems to be natural. However, if there are many zeros or very weak correlations
within the band, the banding method does not automatically generate a sparse
estimate.
9.3 Covariance Matrix Estimation 475
N
1
Σ̂N = xi ⊗xi .
N i=1
N Cε−2−2/η · n
one has
/ /
/1 N /
/ /
E/ xi ⊗xi − I/ ε. (9.28)
/N /
i=1
2+2/η 1+4/η
Here, C = 512(48C0 ) (6 + 6/η) .
Corollary 9.3.2 (Covariance estimation [310]). Consider a random vector x
valued in Rn with covariance matrix Σ. Assume that: for some C0 , η > 0, the
isotropic random vector z = Σ−1/2 x satisfies
0 1
> t C0 t−1−η for t > C0 rank (P)
2
P Pxi 2 (9.29)
N Cε−2−2/η · n
the sample covariance matrix Σ̂N obtained from N independent copies of x satisfies
/ /
/ /
E /Σ̂N − Σ/ ε Σ . (9.30)
Theorem 9.3.1 says that, for sufficiently large N, all eigenvalues of the sample
covariance matrix Σ̂N are concentrated near 1. This following corollary extends to
a result that holds for all N.
476 9 Covariance Matrix Estimation in High Dimensions
1 − C1 y c Eλmin Σ̂N Eλmax Σ̂N 1 + C1 (y + y c ) . (9.31)
η 2+2/η 1+4/η
Here c = 2η+2 , C1 = 512(16C0 ) (6 + 6/η) , and λmin Σ̂N ,
λmax Σ̂N denote the smallest and the largest eigenvalues of Σ̂N , respectively.
N Cε−2−2/η · n,
2/η
Here C = 40(10C0 ) .
/ /
/ / M 1,2 M
E /M · Σ̂N − M · Σ/ Clog3 (2n) √ + Σ . (9.33)
N N
√
Proof. We note that M 1,2 k and M k and apply Theorem 9.4.1.
Corollary 9.4.2 implies that for every ε ∈ (0, 1), the sample size
/ /
/ /
N 4C 2 ε−2 klog6 (2n) suffices for E /M · Σ̂N − M · Σ/ ε Σ . (9.35)
For sparse matrices M with k ! n, this makes partial estimation possible with
N ! n observations. Therefore, (9.35) is a satisfactory “sparse” version of the
classical bound such as given in Corollary 9.29.
Identifying the non-zero entries of Σ by thresholding. If we assume that all non-
zero entries in Σ are bounded away from zero by a margin of h > 0, then a sample
size of
N h−2 log (2n)
would assure that all their locations are estimated correctly with probability
approaching 1. With this assumption, we could derive a bound for the thresholded
estimator.
Example 9.4.3 (Thresholded estimator). An n × n tridiagonal Toeplitz matrix has
the form
⎡ ⎤
b a 0 ··· 0
⎢ ⎥
⎢ a b a ... 0 ⎥
⎢ ⎥
⎢ ⎥
Σ = ⎢ 0 a ... ... 0 ⎥ .
⎢ ⎥
⎢. . . . ⎥
⎣ .. .. . . . . a ⎦
0 0 · · · a b n×n
⎡ ⎤ ⎡ ⎤
1 0 0 ··· 0 h h h ··· h
⎢ ⎥ ⎢ ⎥
⎢ 0 1 0 ... 0⎥ ⎢ h h h ... h⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
Rww = ⎢ 0 0 ... ... 0⎥ + ⎢ h h ... ... h⎥ .
⎢ ⎥ ⎢ ⎥
⎢. . . . ⎥ ⎢. . . . ⎥
⎣ .. .. . . . . 0 ⎦ ⎣ .. .. . . . . h⎦
0 0 ··· 0 1 n×n h h ··· h h n×n
All non-zero entries in Σ are bounded away from zero by a margin of h > 0.
478 9 Covariance Matrix Estimation in High Dimensions
Letx1 , . . . , xN
Example 9.5.1 (Infinite-dimensional data [113]). be i.i.d. random
vectors with their true covariance matrix Σ = xi xTi , K = E xi xTi xi xTi , and
x 2 α almost surely for some α > 0. Define random matrices Xi = xi xTi − Σ
N
and the sample covariance matrix (a random matrix) Σ̂ = N1 xi xTi . We have
i=1
λmax (Xi ) α2 − λmin (Xi ) . Also,
1
N
λmax X2i = λmax K − Σ2
N i=1
and
- .
1
N
E Tr X2i = Tr K − Σ2 .
N i=1
⎛ ? ⎞
/ / 2tλmax K−Σ2 max α2 −λmin (Xi ) , λmax (Σ) t
/ /
P ⎝/Σ − Σ̂/ > + ⎠
2 N 3N
Tr K−Σ2 t −1
2 · 2t e −t−1 .
λmax K − Σ
Tr(K−Σ2 )
The relevant notion of intrinsic dimension is λmax (K−Σ2 ) , which can be finite even
when the random vectors xi take on values in an infinite dimensional Hilbert space.
We follow [509, 510] for our exposition. Consider a matrix model of signal plus
noise
Y =S+X (9.36)
n
1/p
A Sp = λpi for 1 p ∞, and A ∞ = A op = λ1 ,
i=1
Pr A F = Pr A 2 = Pr A S2 .
Let On×n,r be the set of all orthogonal rank-r projections into subspaces of Rn so
that we can say Pr ∈ On×n,r . For any A ∈ Rn×n , its singular values λ1 , . . . , λn
are ordered in decreasing magnitude. In terms of singular values, we have
480 9 Covariance Matrix Estimation in High Dimensions
n r
2 2
A F = λ2i , Pr A F = λ2i .
i=1 i=1
P(1)
r = UIr×r U
T
and P(2) T
r = ŨIr×r Ũ ,
we have
Tr P(1)
r P(2)
r = Tr UI r×r U T
ŨI r×r Ũ T
= Tr Ũ T
UI r×r U T
ŨI r×r .
r
Tr P(1)
r P(2)
r = Tr (ΠIr×r ) = Πii 0.
i=1
We conclude that
/ / / /
/ (2) / / (2) /
/Pr − P(1)r / = / I r×r − P(1)
r − I r×r − Pr / 2r (n − r).
F
Let S n−1 be the Euclidean sphere in n-dimensional space. Finally, by the symmetry
(2) (1)
of Pr − Pr , we obtain
(2) T (2)
P −P(1) = P(2) (1) x P −P(1) x
r −Pr = λ1 P(2)
r −Pr
(1)
r r S∞ op
= sup r r
x∈S n−1
T (2)
(1)
= sup x Pr x − x Pr x 1.
T
x∈S n−1
∈[0,1] ∈[0,1]
The largest singular value (the operator norm) for the difference of two projection
matrices is bounded by 1.
(2) (1)
For (9.36), it is useful to bound the trace Tr XT Pr − Pr S and
/ /2
/ / 2
/P̃r Y/ − Pr Y F . Motivated for this purpose, we consider the trace
F
9.6 Matrix Model of Signal Plus Noise Y = S + X 481
(2) (1) (1) (2)
Tr AT Pr − Pr B , for two arbitrary rank-r projections Pr , Pr ∈ Sn,r ,
and arbitrary two matrices A, B ∈ Rn×n .
First, we observe that
P(2) (1)
r −Pr = P(2) (2) (1) (2) (1)
r −Pr Pr +Pr Pr −Pr
(1)
= P(2)
r I − P(1)
r + P(2) (1)
r − I Pr .
1
(2) (1)
√ r (n − r) · BAT ABT S∞ · Pr − Pr
2 F
1
(2) (1)
+√ r (n − r) · BAT ABT S∞ · Pr − Pr
2 F
(2) (1)
2r (n − r)λ1 (A) λ1 (B) Pr − Pr .
F
(9.37)
The inequality (9.37) is optimal to the effect for the following case: When, for n ≥
2r, there are r orthonormal vectors u1 , . . . , ur and ũ1 , . . . , ũr such that we can
form two arbitrary rank-r projection matrices
r r T
P(1)
r = ui uTi , and P(2)
r = 1−α2 ui +αũi 1−α2 ui +αũi ,
i=1 i=1
and
A = μI, B = ν P(1)
r − P(2)
r .
482 9 Covariance Matrix Estimation in High Dimensions
In fact, for the case, the left-hand side of the inequality (9.37) attains the upper
bound for any real numbers 0 α 1, and μ, ν > 0.
For the considered case, let us explicitly evaluate the left-hand side of the
inequality (9.37)
⎛ r √
⎞
√
P (2)
− 1−α 2 u +αũ 1−α 2 u +αũ T
⎜ r i i i i ⎟
Tr AT P(2) (1)
r −Pr B = μν Tr ⎜
⎝
i=1
r √ √
⎟
T ⎠
× Pr −(1)
1−α ui +αũi
2 1−α ui +αũi
2
i=1
= μν 2r − 2 Tr P P(1)
r
(2)
r
= μν 2r − 2r 1 − α2
√ √
= 2μνα2 2r.
/ / √
/ (1) (2) / (1) (2)
We have used /Pr − Pr / = α 2r and λ1 Pr − Pr = α. Let us
F
establish them now. The first one is simple since
/ / "
/ (1) (2) / (1) (2) (1) (2)
√
/Pr − Pr / = Tr Pr − Pr Pr − Pr = α 2r.
F
To prove the second one, we can check that α and −α are the only non-zero
(1) (2)
eigenvalues of the difference matrix Pr − Pr and their eigen spaces are given by
02 2 1
Wα = span 1+α
2 ui − 1−α
2 ũi , i = 1, . . . , r
02 2 1
and W−α = span 1−α
2 ui + 1+α
2 ũi , i = 1, . . . , r .
(1) (2)
Since Pr − Pr is symmetric, it follows that
(2)
r − Pr
λ1 P(1) r − Pr r − Pr
(2)
= max λmax P(1) (2)
, λmin P(1) = α.
Now we are in a position to consider the model (9.36) using the process
/ /2 / /2
/ / 2 / / 2
Z = /P̃r Y/ − Pr Y F = /P̃r (S + X)/ − Pr (S + X) F ,
F F
for a pair of rank-r projections. Recall that the projection matrix Pr be the rank-r
projection, which maximizes the Hilbert-Schmidt norm
Pr A F = Pr A 2 = Pr A S2 .
On the other hand, P̃r ∈ On×n,r is an orthogonal rank-r projections into subspaces
of Rn . Obviously, Z is a functional of the projection matrix P̃r . The supremum is
denoted by
9.6 Matrix Model of Signal Plus Noise Y = S + X 483
where P̃r is a location of the supremum. In general, P̃r is not unique. In addition,
/ /2
/ / 2
the differences /P̃r Y/ − Pr Y F are usually not centered.
F
Theorem 9.6.1 (Upper bound for Gaussian matrices [510]). Let the distribution
of Xij be centered Gaussian with variance σ 2 and rank(X) r. Then, for r ≤
n − r, the following bound holds
⎛ ⎛⎛ ⎞1/2 ⎞⎞
2r
⎜ ⎜⎜
1
λ2i ⎟ ⎟⎟
⎜ λ21 λ1 ⎜⎜
r
⎟ λ1 λ21 ⎟⎟
EZ σ 2 rn ⎜min
i=r+1
, 1+ √ + min ⎜⎜ ⎟ · √ , 2 2 ⎟⎟ ,
⎝ λ2r σ n ⎝⎝ 2
λr ⎠ σ n λr − λr+1 ⎠⎠
λ2
1
where λ2 2 is set to infinity, if λr = λr+1.
r −λr+1
Theorem 9.6.1 is valid for the Gaussian entries, while the following theorem is more
general to i.i.d. entries with finite fourth moment.
Theorem 9.6.2 (Universal upper bound [510]). Assume that the i.i.d. entries Xij
of the random matrix X has finite variance σ 2 and finite fourth moment m4 . Then,
we have
where
√
λ1 1/4
I = σ2 + m4 + √ σ + m4 ,
+ n
2 √
λ21
λ2 −λ2
σ + m4 if λr > λr+1 ,
II = r r+1
is strictly positive and bounded away from zero, uniformly over S ∈ Rn×n . In fact,
√
Eŝ − λ21 ∈ c1 σ 2 n, c2 σ 2 n + σ nλ1 (S)
The disadvantage of this estimator is its large variance for large values of n : it
depends quadratically on the dimension. If r = rank (S) < n, the matrix S can be
fully characterized by (2n − r)r parameters as it can be seen by the singular value
decomposition. In other words, if r ! n, the intrinsic dimension of the problem is
of the order rn rather than n2 . For every matrix with r = rank (S) = r, we have
2 2
S F = Pr S F .
2 2
Elementary analysis shows that Pr S F − σ 2 rn unbiasedly estimates S F , and
2 2
Var Pr S F − σ 2 rn = 2σ 4 rn + 4σ 2 S F . (9.41)
that is, σ −1
2 2
Pr S F − σ 2 rn − S F is approximately centered Gaussian with
2 2 2
variance 4 S if σ rn = o(1) in an asymptotic framework, and 4 S F is the
F
2
asymptotic efficiency lower bound [134]. The statistics Pr S F − σ 2 rn, however,
cannot be used for estimator since Pr = Pr (S) depends on S itself and is
unknown a prior. The analysis of [509] argues that empirical low-rank projections
/ /2
/ / 2
/P̂r Y/ −σ 2 rn cannot be successfully used for efficient estimation of S F , even
F
if the rank (S) < n is explicitly known beforehand.
yi = xi + vi , i = 1, . . . , N,
N
1
R̂y = yi yiT
N i=1
N
1
R̃x = xi xTi −Rx ,
N i=1
where
1
N
1
N
Δ= vi viT + xi viT + vi xTi .
N i=1
N i=1
Let us demonstrate how to use concentration of measure (see Sects. 3.7 and 3.9) in
this context. Let us assume that Rx has rank at most r. We can write
486 9 Covariance Matrix Estimation in High Dimensions
+ N
,
1
R̃x = Q zi zTi − Ir×r QT ,
N i=1
We propose to study the weak signal detection under the framework of sums of
random matrices. This matrix setting is natural for many radar problems, such as
orthogonal frequency division multiplexing (OFDM) radar and distributed aperture.
Each subcarrier (or antenna sensor) can be modeled as a random matrix, via, e.g.,
sample covariance matrix estimated using the time sequence of the data. Often the
data sequence is very extremely long. One fundamental problem is to break the long
data record into shorter data segments. Each short data segment is sufficiently long
to estimate the sample covariance matrix of the underlying distribution. If we have
128 subcarriers and 100 short data segments, we will have 12,800 random matrices
at our disposal. The most natural approach for data fusion is to sum up the 12,800
random matrices.
The random matrix (here sample covariance matrix) is the basic information
block in our proposed formalism. In this novel formalism, we take the number of
observations as it is and try to evaluate the effect of all the influential parameters.
The number of observations, large but finite-dimensional, is taken as it is—and
treated as “given”. From this given number of observations, we want algorithms to
achieve the performance as good as they can. We desire to estimate the covariance
matrix using a smaller number of observations; this way, a larger number of
covariance matrices can be obtained. In our proposed formalism, low rank matrix
recovery (or matrix completion) plays a fundamental role.
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 487
DOI 10.1007/978-1-4614-4544-9 10,
© Springer Science+Business Media New York 2014
488 10 Detection in High Dimensions
Principal component analysis (PCA) [208] is a classical method for reducing the
dimension of data, say, from high-dimensional subset of Rn down to some subsets
of Rd , with d ! n. PCA operates by projecting the data onto the d directions of
maximal variance, as captured by eigenvectors of the n×n true covariance matrix Σ.
See Sect. 3.6 for background and notation on induced operator norms. See the PhD
dissertation [511] for a treatment of high-dimensional principal component analysis.
We freely take material from [511] in this section to give some background on PCA
and its SDP formulation.
PCA as subspace of maximal variance. Consider a collection of data points
xi , i = 1, . . . , N in Rn , drawn i.i.d. from a distribution P. We denote the expectation
with respect to this distribution by E. Assume that the distribution is centered, i.e.,
Ex = 0, and that E x 2 < ∞. We collect {xi }i=1 in a matrix X ∈ RN ×n . Thus,
2 N
xi represents the i-th row of X. Let Σ and Σ̂ = Σ̂N denote the true covariance
matrix and the sample covariance matrix, respectively. We have
N
1 T 1
Σ := ExxT , Σ̂ := X X= xi xTi . (10.1)
N N i=1
that is, z is a direction that the projection of the distribution along which has
2
maximal variance. Noting that E zT x = E zT x zT x = zT ExxT z,
we obtain
SDP formulation. Let us derive a SDP equivalent to (10.3). Using the cyclic
property of the trace, we have Tr zT Σz = Tr ΣzzT . For a matrix Z ∈
Rn×n , Z 0 and rank (Z) = 1 is equivalent to Z = zzT for some z ∈ Rn .
Imposing the additional condition Tr (Z) = 1 is equivalent to the additional
constraint z 2 = 1. Now after dropping the rank (Z) = 1, we obtain a relaxation
of (10.3)
Σ̂ = Σ + Δ (10.5)
where Δ = ΔN denotes a random noisy matrix, typically arising from having only
a finite number N of samples.
A natural question is under what conditions the sample eigenvectors based on Σ̂
are consistent estimators of their true analogues Σ. In the classical theory of PCA,
the model dimension n is viewed as fixed, asymptotic statements are established
as N goes to infinity, N → ∞. However, such “fixed n, large N ” scaling may
be inappropriate for today’s big data applications, where the model dimension n is
comparable or even larger than the number of observations N , or n ≤ N .
490 10 Detection in High Dimensions
where β > 0 is some positive constant, measuring signal-to-noise ratio (SNR). The
eigenvalues of Σ are all equal to 1 except for the largest one which is 1 + β. z is the
leading principal component for Σ. One then forms the sample covariance matrix Σ̂
and obtains its maximal eigenvector ẑ, hoping that ẑ is a consistent estimate of z .
This unfortunately does not happen unless n/N → 0 as shown by Paul and
Johnston [513] among others. See also [203]. As (N, n) → ∞, n/N → α,
asymptotically, the following phase transition occurs:
+ √
0, β α
ẑ, z 2 → 1−α/β 2
√ (10.7)
1+α/β 2 , β > α.
Note that ẑ, z 2 measures cosine of the angle between ẑ and z and is related to
the projection of 2-distance between the corresponding 1-dimensional subspaces.
Nether case in (10.7) show consistency, i.e., ẑ, z 2 → 1. This has led to
research on additional structure/constraints that one may impose on z to allow
for consistent estimation.
y = Hx + z
exists a privileged direction, along which x has more variance. Here we consider
the case where the privileged direction is sparse. The covariance matrix is a sparse
rank one matrix perturbation of the identity matrix IN . Formally, let v ∈ RN be
such that ||v||2 = 1, ||v||0 ≤ k, and θ > 0. The hypothesis problem is
H0 : x ∼ N (0, I)
(10.8)
H1 : x ∼ N 0, I + θvvT .
This problem is considered in [157] where v is the so-called feature. Later a
perturbation of several rank one matrices is considered in [158, 159]. This idea is
carried out in a Kernel space [385]. What is new in this section is to include the
sparsity of v. The model under H1 is a generalization of the spiked covariance
model since it allows v to be k-sparse on the unit Euclidean sphere. The statement
of H1 is invariant under the rotation on the k relevant variables.
Denote Σ the covariance matrix of x. We most often use the empirical covariance
matrix Σ̂ defined by
n
1 1
Σ̂ = xi xTi = XXT
n i=1
n
where
⎡ ⎤
x11 x12 · · · x1n
⎢ x21 x22 · · · x2n ⎥
⎢ ⎥
X=⎢ . .. .. .. ⎥
⎣ .. . . . ⎦
xN 1 xN 2 · · · xN n N ×n
yk = xk + zk , k = 1, 2, . . . , n
where xk represents the information and zk the noise. Often, we have n independent
copies of Y at our disposal for high-dimensional data processing. It is natural to
consider the matrix concentration of measure:
y1 + · · · + yn = (x1 + · · · + xn ) + (z1 + · · · + zn ) ,
492 10 Detection in High Dimensions
Yk = Xk + Zk , k = 1, 2, . . . , n
where Xk represents the information and Zk the noise. Often, we have n indepen-
dent copies of Y at our disposal for high-dimensional data processing. It is natural
to consider the matrix concentration of measure:
Y1 + · · · + Yn = (X1 + · · · + Xn ) + (Z1 + · · · + Zn ) ,
Trace is linear, so
Tr σ = Tr (X1 + · · · + Xn ) + Tr (Z1 + · · · + Zn )
H1 :
= (Tr X1 + · · · + Tr Xn ) + (Tr Z1 + · · · + Tr Zn )
We assume that the eigenvalues, singular values and diagonal entries of Hermi-
tian matrices are arranged in decreasing order. Thus, λ1 = λmax , and λn = λmin .
Theorem 10.6.1 (Eigenvalues of Sums of Two Matrices [16]). Let A, B are n×n
Hermitian matrices. Then
In particular,
(10.10)
The third line of (10.10) follows from the fact that [53, p. 13]
where A is a Hermitian matrix. In the fourth line, we have made the assumption
that the sum of information matrices (X1 + · · · + Xn ) and noise matrices
(Z1 + · · · + Zn ) are independent from each. Similarly, we have the lower bound
494 10 Detection in High Dimensions
λmin (X1 + · · · + Xn ) + (Z1 + · · · + Zn ) + −Z1 − · · · − Zn
λmin [(X1 + · · · + Xn ) + (Z1 + · · · + Zn )] + λmin − Z1 + · · · + Zn
= λmin [(X1 + · · · + Xn ) + (Z1 + · · · + Zn )] − λmax Z1 + · · · + Zn .
λmin [(X1 + · · · + Xn )] + λmin [(Z1 + · · · + Zn )] − λmax Z1 + · · · + Zn
(10.11)
H0 : N
√
H1 : Y = SN R · X + N (10.12)
where SN R represents the signal-to-noise ratio, and X and N are two random
matrices of m × n. We assume that X is independent of N. The problem of (10.12)
is equivalent to the following:
H0 : NNH
√
H1 : YYH = SN R · XXH + NNH + SN R · XNH + NXH (10.13)
One metric of our interest is the covariance matrix with its trace
2
f (SN R, X) = Tr E YYH − E NNH . (10.14)
This function is not only positive but also linear (trace function is linear). When
only N independent realizations are available, we can replace the expectation with
its average form
⎛ 2 ⎞
N N
1 1
fˆ(SN R, X) = Tr ⎝ Yi YiH − Ni NH
i
⎠. (10.15)
N i=1
N i=1
0.676 0.6675
0.674
0.667
Trace
Trace
0.672
0.6665
0.67
0.666
0.668
0.666 0.6655
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Monto Carlo Index Mean Metric=0.085311 0.083924 Monto Carlo Index Mean Metric=0.08678 0.086426
Fig. 10.1 Random matrix detection: (a) SNR = −30 dB, N = 100; (b) SNR = −36 dB, N = 1,000
2 2
P fˆ(SN R, X) − Efˆ(SN R, X) > t Ce−n t /c (10.16)
When A, B are Hermitian, it is fundamental to realize that the TAT∗ and TBT∗
are two commutative matrices. Obviously TAT∗ and TBT∗ are Hermitian since
A∗ = A, B∗ = B. Using the fact that for any two complex matrices C, D.
∗
(CD) = D∗ C∗ , we get
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
TBT TAT =TBT TAT = AT TBT T
∗ ∗
=TA T∗ T (TB)∗ = TA∗ T∗ TB∗ T∗ = TA∗ T∗ TBT∗ ,
(10.17)
which says that TAT∗ TBT∗ = TBT∗ TAT∗ , verifying the claim.
For commutative matrices C, D, C ≤ D is equivalent to eC ≤ eD .
496 10 Detection in High Dimensions
eA+B = eA eB
if and only if two matrices A, B are commutative: AB = BA. Thus, it follows that
∗ ∗
eTAT < eTBT , (10.18)
Let A be the C ∗ -algebra. As is the set of all the self-adjoint (Hermitian) matrices.
So As is the self-adjoint part of A. Let us recall a lemma that has been proven in
Sect. 2.2.4. We repeat the lemma and its proof here for convenience.
Lemma 10.8.1 (Large deviations and Bernstein trick). For a matrix-valued
random variable A, B ∈ As , and T ∈ A such that T∗ T > 0
∗ ∗
∗ ∗
P {A B} Tr EeTAT −TBT = E Tr eTAT −TBT . (10.20)
Here, the second line is because the mapping X → TXT∗ is bijective and preserves
the order. As shown above in (10.17), when A, B are Hermitian, the TAT∗ and
TBT∗ are two commutative matrices. For commutative matrices C, D, C ≤ D is
equivalent to eC ≤ eD , from which the third line follows. The last line follows from
Chebyshev’s inequality (2.2.11).
X
The closed form of E Tr e is available in Sect. 1.6.4.
The famous Golden-Thompson inequality is recalled here
Tr eA+B Tr eA · eB , (10.22)
where A, B are arbitrary Hermitian matrices. This inequality is very tight (almost
sharp).
Tr (AB) A op Tr (B)
10.8 Random Matrix Detection 497
H0 : A,A 0
H1 : A + B,A 0, B 0 (10.23)
where A and B are random matrices. One cannot help with the temptation of using
P (A + B > A) = P eA+B > eA ,
which is false. It is true only when A and B commute, i.e., AB = BA. In fact, in
general, we have that
P (A + B > A) = P eA+B > eA .
However, the TAT∗ and TBT∗ are two commutative matrices, when A, B are
Hermitian.
Let us consider another formulation
H0 : X,
H1 : C + X, C is fixed.
We claim H1 if the decision metric E Tr exp (C + X), is greater than some positive
threshold t we can freely set. We can compare this expression with the left-hand-side
of (10.24), by replacing C with H. The probability of detection for this algorithm is
P (E Tr exp (C + X) > t) ,
due to (10.24). As a result, EeX , the expectation of the exponential of the random
matrix X, plays a basic role in this hypothesis testing problem.
498 10 Detection in High Dimensions
The first inequality follows from the Golden-Thompson inequality (10.22). The
second inequality follows from Tr (AB) Tr (A) Tr (B) when A ≥ 0 and B ≥ 0
are of the same size [16, Theorem 6.5]. Note that the all the eigenvalues of an
exponential matrix are nonnegative. The final step follows from the fact that C is
fixed. It follows from (10.25) that
P E Tr eC+X > t P Tr eC E Tr eX > t ,
which is the final upper bound of interest. The E Tr eX plays a basic role.
for Hermitian Gaussian random matrices X, the closed form expression
Fortunately,
of E Tr eX is obtained in Sect. 1.6.4.
Example 10.8.3 (Commutative property of TAT∗ and TBT∗ ). EXPM(X) is the
matrix exponential of X. EXPM is computed using a scaling and squaring algorithm
with a Pade approximation, while EXP(X) is the exponential of the elements of X.
For example EXP(0) gives a matrix whose entries are all ones, while the EXPM(X)
is the unit matrix whose diagonal elements are ones and all non-diagonal elements
are zeros. It is very critical to realize that EXPM(X) should be used to calculate the
matrix exponential of X, rather than EXP(X).
Without loss of generality, we set T = randn(m,n) where randn(m,n) gives a
random matrix of m × n whose entries are normally distributed pseudorandom
numbers, since we only need to require T∗ T > 0. For two Hermitian matrices
A, B, the MATLAB expression
expm(T*A*T’ - T*B*T’)*expm( T*B*T’) - expm(T*A*T’)
gives zeros, while
AB = BA,
10.8 Random Matrix Detection 499
the expression eA−B eB = eA . Note that the inverse matrix exponential is defined as
−1
eA = e−A .
play a basic rule in random matrices analysis. Often, we can only observe the
product of two matrix-valued random variables X, Y . We can thus readily calculate
the expectation of the product X and Y. Assume we want to “deconvolve” the
role of X to obtain the expectation of Y. This can be done using
−1
E (XY) (E (X)) = E (Y) ,
−1
assuming (E (X)) exists.1
Example 10.8.4 (The product rule of matrix expectation: E (XY) = E (X) E (Y)).
In particular, we consider the matrix exponentials
∗
−TBT∗ ∗
X = eTAT Y = eTBT ,
where T, A, B are assumed the same as Example 10.8.3. Then, we have that
∗ ∗ ∗ ∗ ∗ ∗
E eTAT −TBT E eTBT = E (XY) = E eTAT −TBT eTBT
∗ ∗ ∗
= E eTAT −TBT +TBT
∗
= E eTAT .
The first line uses the product rule of matrix expectation. The second line follows
from (10.19). In MATLAB simulation, the expectation will be implemented using
N independent Monto Carlo simulations and replaced with the average of the N
random matrices
N
1
Zi .
N i=1
1 This is not guaranteed. The probability that the random matrix X is singular is studied in the
literature [241, 242].
500 10 Detection in High Dimensions
where TT∗ > 0, and A, B are two Hermitian random matrices. Usually, here A is
a Gaussian random matrix representing
the noise. Using
Monto Carlo simulations,
∗ ∗ ∗
one often has the knowledge of E eTAT and E eTAT +TBT . Using the
arguments similar to Example 10.8.4, we have that
∗ ∗ ∗ ∗ ∗ ∗
E eTAT +TBT E e−TAT = E eTAT +TBT e−TAT
∗ ∗ ∗
= E eTAT +TBT −TAT
∗
= E eTBT .
∗ ∗
If B = 0, then eTBT = I and thus E eTBT = I, since e0 = I. If B is very
weak but B = 0, we often encounter B as a random matrix. Note that
∗
log EeTBT = 0
Claim H1 if
∗ ∗
∗
Tr E eTAT +TBT E e−TAT > γ.
1
eA+B − eA = e(1−t)A Bet(A+B) dt
0
can be used to study the difference eA+B − eA . Our goal is to understand the
perturbation of very weak B.
10.9 Sphericity Test with Sparse Alternative 501
H0 : x ∼ N (0, In )
H1 : x ∼ N 0, In + θvvT .
N
1
Σ̂ = xi xTi .
N i=1
Taking τ ∈ [τ0 , τ1 ] allows us to control the type I and type II errors of the test
502 10 Detection in High Dimensions
0 1
ψ Σ̂ = 1 ϕ Σ̂ > τ ,
where 1{} denotes the indicator function. As desired, this test has the property to
discriminate between the hypotheses with probability 1 − δ.
The sample covariance matrix Σ̂ has been studied extensively [5,388]. Convergence
of the empirical covariance matrix to the true covariance matrix in spectral norm
has received attention [518–520] under various elementwise sparsity and using
thresholding methods. Our assumption allows for relevant variables to produce
arbitrary small entries and thus we cannot use such results. A natural statistic would
be, for example, using the largest eigenvalue of the covariance matrix.
where the convergenceholds almost surely [198, 383]. Yin et al. [344] established
that E (x) = 0 and E x4 < ∞ is a necessary and sufficient condition for this
almost sure convergence to hold. As Σ̂ 0, its number of positive eigenvalues is
equal to its rank (which is smaller than N ), and we have
1
n 1 n Tr Σ̂
λmax Σ̂ λi Σ̂ Tr Σ̂ .
rank Σ̂ i=1
N N Nn
These two results indicate that the largest eigenvalue will not be able to
discriminate between the two hypotheses unless θ > Cn/N for some positive
constant C. In a “large n/small N ” scenario, this corresponds to a very strong signal
indeed.
When adding a finite rank perturbation to a Wishart matrix, a phase transition [522]
arises already in the moderate dimensional regime where n/N → α ∈ (0, 1).
A very general class of random matrices exhibit similar behavior, under finite rank
perturbation, as shown by Tao [523]. These results are extended to more general
distributions in [524]. The analysis of [514] indicates that detection using the
largest eigenvalue is impossible already for moderate dimensions, without further
assumptions. Nevertheless, resorting to the sparsity assumption allows us to bypass
this intrinsic limitation of using the largest eigenvalue as a test statistic.
To exploit the sparsity assumption, we use the fact that only a small submatrix of the
empirical covariance matrix will be affected by the perturbation. Let A be a n × n
matrix and fix k < n. We define the k-sparse largest eigenvalue by
For a set S, we denote by |S| the cardinality of S. We have the same equalities as
for regular eigenvalues
λkmax (In ) = 1 and λkmax In + θvvT = 1 + θ.
The k-sparse largest eigenvalue behaves differently under the two hypotheses as
soon as there is a k × k matrix with a significantly higher largest eigenvalue.
for any A ≥ 0.
504 10 Detection in High Dimensions
1
N
2
λkmax Σ̂ vT Σ̂ = xTi v .
N i=1
This problem involves only linear functionals. Since x ∼ N 0, In + θvvT , we
have xTi v ∼ N (0, 1 + θ).
Define the new random variable
N
1 1 T 2
Y = x v −1 ,
N i=1 1 + θ i
Hence, taking t = log (1/δ), we have Y −2 log (1/δ) /N with probability
1 − δ. Therefore, under H1 , we have with probability 1 − δ
"
log (1/δ)
λmax Σ̂ 1 + θ − 2 (1 + θ)
k
. (10.28)
N
We now establish the following: Under H0 , with probability 1 − δ
"
k log (9en/k) + log (1/δ) k log (9en/k) + log (1/δ)
λmax Σ̂ 1 + 4
k
+4 .
N N
(10.29)
We adapt a technique from [72, Lemma 3]. Let A be a symmetric n × n matrix,
and let N be an -net of sphere S n−1 for some ∈ [0, 1]. For -net, please refer to
Sect. 1.10. Then,
−1
λmax (A) = sup | Ax, x | (1 − 2ε) sup | Ax, x | .
x∈S n−1 x∈Nε
10.11 Sparse Principal Component Detection 505
Using a 1/4-net over the unit sphere of Rk , there exists a subset Nk of the unit
sphere of Rk , with cardinality smaller than 9k , such that for any A ∈ S+
k
Under H0 , we have
0 1
λkmax Σ̂ = 1 + max λmax Σ̂S − 1 ,
|S|=k
where the maximum in the right-hand side is taken over all subsets of {1, . . . , n}
that have cardinality k. See [514] for details.
where
"
k log (9en/k) + log (1/δ) k log (9en/k) + log (1/δ)
τ0 = 1 + 4 +4
N N
"
log (1/δ)
τ1 = 1 + θ − 2 (1 + θ) .
N
When τ1 > τ0 , we take τ ∈ [τ0 , τ1 ] and define the following test
0 1
ψ Σ̂ = 1 ϕ Σ̂ > τ .
Considering the asymptotic regimes, for large n, N, k, taking δ = n−β with β > 0,
gives a sequence of tests ψN that discriminate between H0 and H1 with probability
converging to 1, for any fixed θ > 0, as soon as
k log (n)
→ 0.
N
Theorem 10.11.1 gives the upper bound. The lower bound for the probability of
error is also found in [514]. We observe a gap between the upper and lower bound,
with a term in log(n/k) in the upper bound, and one log(n/k 2 ) in the lower bound.
However, by considering some regimes for n, N and k, it disappears. Indeed, as
soon as n ≥ k 2+ , for some > 0, upper bound and lower bounds match up to
constants, and the detection rate for the sparse eigenvalue is optimal in a minimax
sense. Under this assumption, detection becomes impossible if
"
k log (n/k)
θ<C ,
N
for a small enough constant C > 0.
A major breakthrough for sparse PCA was achieved in [516], who introduced a
SDP relaxation for λkmax , but tightness of this relaxation is, to this day, unknown.
Making the change of variables X = xxT , in (10.27) yields
This problem contains two sources of non-convexity: the l0 norm constraint and
the rank constraint. We make two relaxations in order to have a convex feasible
set. First, for a semidefinite matrix X, with trace 1, and sparsity k 2 , the Cauchy-
Schwartz inequality yields ||X||1 ≤ k, which is substituted to the cardinality
constraint in this relaxation. Simply dropping the rank constraint leads to the
following relaxation of our original problem:
Together with (10.32), Lemma 10.12.1 implies that for any z ≥ 0 and any matrix U
such that U ∞ z, it holds
Similarly, we can obtain (see [514]): Under H0 , we have, with high probability 1−δ,
8 8
k2 log (4n2 /δ) k log 4n2 /δ log (2n/δ) log (2n/δ)
SDPk Σ̂ 1 + 2 +2 +2 +2 .
N N N N
Whenever τ̂1 > τ̂0 , we take the threshold τ and define the following computation-
ally efficient test
0 1
ψ̂ Σ̂ = 1 SDPk Σ̂ > τ .
Theorem 10.12.2 (Berthet and Rigollet [514]). Assume that n, N, k and δ are
such that θ̄ ≤ 1, where
" " "
k 2 log (4n2 /δ) k log 4n2 /δ log (2n/δ) log (1/δ)
θ̄ = 2 +2 +2 +4 .
N N N N
0 1
Then, for any θ > θ̄, any τ ∈ [τ̂0 , τ̂1 ], the test ψ̂ Σ̂ = 1 SDPk Σ̂ > τ
discriminates between H0 and H1 with probability 1 − δ.
−β
By considering asymptotic regimes, for large n, N, k, taking δ = n with β >
0, gives a sequence of tests ψ̂N Σ̂ that discriminates between H0 and H1 with
probability converging to 1, for any fixed θ > 0, as soon as
k 2 log (n)
→ 0.
N
Compared with Theorem 10.11.1, this price to pay for using
√ this convex relaxation
is to multiply the minimum detection level by a factor of k. In most examples, k
remains small so that this is not a very high price.
y = Ax + z (10.35)
where A ∈ Rm×n and z ∼ N 0, σ 2 I . We are often interested in the under-
determined setting where m may be much smaller than n. In general, one would
not expect to be able to accurately recover x when m < n since there are more
unknowns than observations. However it is by now well-known that by exploiting
sparsity, it is possible to accurately estimate x.
If we suppose that the entries of the matrix A are i.i.d. N (0, 1/n), then one can
show that for any x ∈ Bk := {x : x 0 k}, 1 minimization techniques such as
the Lasso or the Dantzig selector produce a recovery x̂ such that
1 2 kσ 2
x − x̂ 2 C0 log n (10.36)
n m
holds with high probability provided that m = Ω (k log (n/k)) [527]. We consider
the worst case error over all x ∈ Bk , i.e.,
510 10 Detection in High Dimensions
1 2
E (A) = inf sup E
x̂ (y) − x 2 . (10.37)
x̂ x∈Bk n
The following theorem gives a fundamental limit on the minimax risk which holds
for any matrix A and any possible recovery algorithm.
Theorem 10.13.1 (Candès and Davenport [526]). Suppose that we observe y =
Ax + z where x is a k-sparse vector, A is an m × n matrix with m ≥ k, and
z ∼ N 0, σ 2 I . Then there exists a constant C1 > 0 such that for all A,
kσ 2
E (A) C1 2 log (n/k) . (10.38)
A F
kσ 2
E (A) 2 . (10.39)
A F
This theorem says that there is no A and no recovery algorithm that does
fundamentally better than the Dantzig selector (10.36) up to a constant (say, 1/128);
that is, ignoring the difference in the factors log n/kand log n. In this sense, the
results of compressive sensing are, indeed, at the limit.
Corollary 10.13.2 (Candès and Davenport [526]). Suppose that we observe y =
A (x + w) where
x is a k-sparse vector, A is an m × n matrix with k ≤ m ≤ n,
and w ∼ N 0, σ 2 I . Then for all A
kσ 2 kσ 2
E (A) C1 log (n/k) and E (A) . (10.40)
m m
The intuition behind this result is that when noise is added to the measurements, we
can boost the SNR by rescaling A to have higher norm. When we instead add noise
to the signal, the noise is also scaled by A, and so no matter how A is designed
there will always be a penalty of 1/m.
The relevant work is [528] and [135]. We only sketch the proof ingredients. The
proof of the lower bound (10.38) follows a similar course as in [135]. We will
suppose that x is distributed uniformly on a finite set of points X ⊂ Bk , where
X is constructed so that the elements of X are well separated. This allows us to
show a lemma which follows from Fano’s inequality combined with the convexity
of the Kullback-Leibler (KL) divergence. The problem of constructing the packing
set X exploits the matrix Bernstein inequality of Ahlswede and Winter.
10.14 Detection of High-Dimensional Vectors 511
We follow [529]. See also [530] for a relevant work. Detection of correlations
is considered in [531, 532]. We emphasize the approach of the Kullback-Leibler
divergence used in the proof of Theorem 10.14.2. Consider the hypothesis testing
problem
H0 : y = z
H1 : y = x + z (10.41)
where x, y, z ∈ Rn and x is the unknown signal and z is additive noise. Here only
one noisy observation is available per coordinate. The vector x is assumed to be
sparse. Denote the scalar inner product of two column vectors a, b by a, b = aT b.
Now we have that
H0 : yi = y, ai = z, ai = zi , i = 1, . . . , N
H1 : yi = x, ai + zi , i = 1, . . . , N (10.42)
where the measurement vectors ai ’s have Euclidean norm bounded by 1 and the
noise zi ’s are i.i.d. standard Gaussian, i.e., N (0, 1). A test procedure based on
N measurements of the form (10.42) is a binary function of the data, i.e., T =
T (a1 , y1 , . . . , aN , yN ), with T = ε ∈ {0, 1} indicating that T favors T . The worst-
case risk of a test T is defined as
γ (T ) := P0 (T = 1) + max Px (T = 0) ,
x∈X
where Px denotes the distribution of the data when x is the true underlying vector
and the subset X ∈ Rn \ {0}. With a prior π on the set of alternatives X , the
corresponding average Bayes risk is expressed as
γπ (T ) := P0 (T = 1) + Eπ Px (T = 0) ,
where Eπ denotes the expectation under π. For any prior π and any test procedure
T , we have
γ (T ) γπ (T ) . (10.43)
T
For a vector a = (a1 , . . . , am ) , we use the notation
m
1/2 m
a = a2i , |a| = |ai |
i=1 i=1
512 10 Detection in High Dimensions
to represent the Euclidea norm and 1 -norm. For a matrix A, the operator norm is
defined as
Ax
A op = sup .
x=0 x
The 1 denotes the vector with all coordinates equal to 1.
Vectors with non-negative entries may be relevant to imaging processing.
√
Proposition 10.14.1 (Arias-Castro [529]). Consider x = 1/ n. Suppose we take
N measurements of the form (10.42) for all i. Consider the test that rejects H0 when
N
√
yi > τ N ,
i=1
1
γπ (T ) = 1 − Pπ − P0 TV , (10.44)
2
where C = {cij } := Eπ xxT . The first line is by definition; the second line
follows from the definition of Pπ /P0 , the use of Jensen’s inequality justified by the
convexity of x → − log x, and by Fubini’s theorem; the third line follows from
independence of ai , yi and x (under P0 ) and the fact that E [yi ] = 0; The fourth
is by independence of ai , and x (under P0 ) and by Fubini’s theorem; the fifth line
follows since ai 1 for all i.
Recall that under π the support of x is chosen uniformly at random among the
subsets of size k. Then we have
k
cii = μ2 Pπ (xi = 0) = μ2 · , ∀i,
n
and
k k−1
cij = μ2 Pπ (xi = 0, xj = 0) = μ2 · · , i = j.
n n−1
K (Pπ , P0 ) N · μ2 k 2 /n,
and returning (10.44) via (10.45), we bound the risk of the likelihood ratio test
γ (T ) 1 − K (Pπ , P0 ) /8 1 − N/ (8n)kμ.
514 10 Detection in High Dimensions
From Proposition 10.14.1 and Theorem 10.14.2, we conclude that the following
a nonnegative vector x ∈ R from
n
is true in a minmax sense: Reliable detection of
N noisy linear measurements is possible if N/n |x| → ∞ and impossible if
N/n |x| → 0.
Theorem 10.14.3 (Theorem 2 of Arias-Castro [529]). Let X (±μ, k) denote the
set of vectors in Rn having exactly k non-zero entries all equal to ±μ, with μ > 0.
Based on N measurements of the form (10.42), possiblyadaptive, any test for H0 :
x = 0 versus H1 : x ∈ X (±μ, k) has risk at least 1 − N k/ (8n)μ.
2
In particular, the risk against alternatives H1 : x ∈ X (±μ, k) with N/n x =
(N/n) kμ2 → 0, goes to 1 uniformly over all procedures.
We choose the uniform prior on X (±μ, k). The proof is then completely parallel
to that of Theorem 10.14.2, now with C = μ2 (k/n) I since the signs of the nonzero
entries of x are i.i.d. Rademacher. Thus C op = μ2 (k/n).
H0 : y = z
H1 : y = x + z (10.46)
m 2
is nearly n w 2 with high probability.
516 10 Detection in High Dimensions
We need the following three equations [534] to bound three parts in (10.47). First,
m 2 2 m 2
(1 − α) w 2 wΩ 2 (1 + α) w 2 (10.48)
n n
with probability at least 1 − 2δ. Second,
/ T /
/UΩ wΩ /2 (1 + β)2 m kμ(S) w 2
(10.49)
2 2
n n
with probability at least 1 − δ. Third,
/ /
/ T −1 / n
/ UΩ UΩ / (10.50)
2 (1 − γ) m
with probability at least 1 − δ, provided that γ < 1. The proof tools for the
three equations are McDiarmid’s Inequality [539] and Noncommutative Bernstein
Inequality (see elsewhere of this book).
We follow [540]. See also Example 5.7.6. We study the problem of detecting
whether a high-dimensional vector x ∈ Rn lies in a known low-dimensional sub-
space S, given few compressive measurements of the vector. In high-dimensional
settings, it is desirable to acquire only a small set of compressive measurements of
the vector instead of measuring every coordinate. The objective is not to reconstruct
the vector, but to detect whether the vector sensed using compressive measurements
lies in a low-dimensional subspace or not.
One class of problems [541–543] is to consider a simple hypothesis test of
whether the vector x is 0 (i.e. observed vector is purely noise) or a known signal
vector s:
H0 : x = 0 vs. H1 : x = s. (10.51)
10.16 Subspace Detection of High-Dimensional Vectors Using Compressive Sensing 517
Another class of problems [532, 534, 540] is to consider the subspace detection
setting, where the subspace is known but the exact signal vector is unknown. This
set up leads to the composite hypothesis test:
H0 : x ∈ S vs. H1 : x ∈
/ S. (10.52)
Equivalently, let x⊥ denote the component of x that does not lie in S. Now we have
the composite hypothesis test
y = A (x + w) (10.54)
y = Ax + z (10.55)
where z ∼ N 0, σ 2 Im×m . For fixed A, we have y ∼ N Ax, σ 2 AAT and
y ∼ N Ax, σ 2 Im×m . Compressive linear measurements may be formed later
on to optimize storage or data collection.
Define the projection operator PU = UUT . Then x⊥ = (I − PU ) x, where x⊥
2
is the component of x that does not lie in S, and x ∈ S if and only if x⊥ 2 = 0.
/ /
/ −1/2 /2
Similar to [534], we define the test statistic T = /(I − PBU ) AAT y/
−1/2
2
based on the observed vector y and study its properties, where B = AAT A.
Here PBU is the projection operator onto the column space of BU, specifically
−1
T T
PBU = BU (BU) BU (BU)
T
if (BU) BU exists.
Now are ready to present the main result. For the sake of notational simplicity,
we directly work with the matrix B and its marginal distribution. Writing y =
2
B (x + w), we have T = (I − PBU ) y 2 . Since A is i.i.d. normal, the distribution
of the row span of A (and hence B) will be uniform over m-dimensional subspaces
−1/2
of Rn [544]. Furthermore, due to the AAT term, the rows of B will be
orthogonal (almost surely). First, we show that, in the absence of noise, the test
2 2
statistic T = (I − PBU ) Bx 2 is close to m x⊥ 2 /n with high probability.
518 10 Detection in High Dimensions
Theorem 10.16.1 (Azizyan and Singh [540]). Let 0 < r < m < n, 0 < α0 <
1 and β0 , β1 , β2 > 1. With probability at least 1− exp [(1−α0 + log α0 ) m/2]
− exp [(1−β0 + log β0 ) m/2]−exp [(1−β1+log β1) m/2]−exp [(1−β2 + log β2) r/2]
m r m
2 2 2
α0 − β1 β2 x⊥ 2 (I − PBU ) Bx 2 β0 x⊥ 2 (10.56)
n n n
The proof of Theorem 10.16.1 follows from random projections and concentration
of measure. Theorem 10.16.1 implies the following corollary.
Corollary 10.16.2 (Azizyan and Singh [540]). If m c1 r log m, then with
probability at least 1 − c2 exp (−c3 m),
m 2 2 m 2
d1 x⊥ 2 (I − PBU ) Bx 2 d2 x⊥ 2
n n
for some universal constants c1 > 0, c2 > 0, c3 ∈ (0, 1), d1 ∈ (0, 1), d2 > 1.
Corollary 10.16.5 states that given just over r noiseless compressive measurements,
2
we can estimate x⊥ 2 accurately with high probability. In the presence of noise, it
is natural to consider the hypothesis test:
H0
2 <
T = (I − PBU ) y 2 η (10.57)
>
H1
The following result bounds the false alarm level and missed detection rate of this
2
test (for appropriately chosen η) assuming a lower bound on x⊥ 2 under H1 .
Theorem 10.16.3 (Azizyan and Singh [540]). If the assumptions of Corol-
lary 10.16.5 are satisfied, and if for any x ∈ H1
2 4e + 2 r
x⊥ 2 σ2 1− n,
d1 m
then
and
lower bound on the probability of error of any test. A corollary of this theorem
implies that the proposed test statistic is optimal, that is, every test with probability
of missed detection and false alarm decreasing exponentially in the number of
compressive samples m requires that the energy off the subspace scale as n.
Theorem 10.16.4 (Azizyan and Singh [540]). Let P0 be the joint distribution of
B and y under the null hypothesis. Let P1 be the joint distribution of B and y
under the alternative hypothesis where y = B (x + w), for some fixed x such that
x = x⊥ and x 2 = M > 0. If conditions of Corollary 10.16.5 are satisfied, then
1 M2 m
inf max Pi (φ = i) exp − 2
φ i=0,1 8 2σ n
1 −K(P0 ,P1 )
inf max Pi (φ = i) e
φ i=0,1 8
x22 m
= 2σ 2 n .
Corollary 10.16.5 (Azizyan and Singh [540]). If there exists a hypothesis test φ
based on B and y such that for all n and σ 2 ,
for some C0 , C1 > 0, then there exists some C > 0 such that
2
x⊥ 2 Cσ 2 (1 − r/m) n
yi = Tr (AXi ) + zi , i = 1, . . . , N (10.58)
iid
where z1 , . . . , zN ∼ N 0, σ 2 , σ > 0 known, and the sensing matrices Xi satisfy
2
either Xi F 1 or E Xi F = 1. We are interested in two measurement
schemes2 : (1) adaptive or sequential, that is, the measurement matrices Xi is a
(possibly randomized) function of (yj , Xj )j∈[i−1] ; (2) the measurement matrices
are chosen at once, that is, passively.
We follow [237] here. The use of concentration of measure for the standard
quadratic form of a random matrix is the primary reason for this whole section.
There are two independent sets of samples {x1 , . . . , xn1 } and {y1 , . . . , yn2 } ∈ Rp .
They are generated in an i.i.d. manner from p-dimensional multivariate Gaussian
distributions N (μ1 , Σ) and N (μ2 , Σ) respectively, where the mean vectors μ1
and μ2 , and positive-definitive covariance matrix Σ > 0, are all fixed and unknown.
The hypothesis testing problem of interest here is
H0 : μ1 = μ2 versus H1 : μ1 = μ2 . (10.59)
The most well known test statistic for this problem is the Hotelling T 2 statistic,
defined by
n1 n2 T −1
T2 = (x̄ − ȳ) Σ̂ (x̄ − ȳ) , (10.60)
n1 + n2
1
n1
1
n2
where x̄ = n1 xi and ȳ = n2 yi are sample mean vectors, Σ̂ is the pooled
i=1 i=1
sample covariance matrix, given by
n1 n2
1 T 1 T
Σ̂ = (xi − x̄) (xi − x̄) + (yi − ȳ) (yi − ȳ)
n i=1
n i=1
If Pk is drawn from the Haar distribution, independently of the data, then our
random projection-based test statistic is defined by
−1
n1 n2 T
T̂k2 = (x̄ − ȳ) EPk Pk PTk Σ̂Pk PTk (x̄ − ȳ) .
n1 + n2
For a desired nominal level α ∈ (0, 1), our testing procedure rejects the null
hypothesis H0 if and only if T̂k2 tα , where
522 10 Detection in High Dimensions
?
yn 2yn √
tα ≡ n+ nz1−α ,
1 − yn (1 − yn )
3
yn = k/n and z1−α is the 1 − α quartile of the standard Gaussian distribution. T̂k2
is asymptotically Gaussian.
To state a theorem, we make the condition (A1).
A1 There is a constant y ∈ (0, 1) such that yn = y + o √1n .
2
yn 2yn √
We also need to define two parameters μ̂n = 1−y n
n, and σ̂ n = (1−y )3
n.
n
T̂k2 − μ̂n d
−
→ N (0, 1) , (10.61)
μ̂n
d
(Here −→ stands for convergence in distribution) and as a result, the critical value
tα satisfies
P T̂k2 tα = α + o(1).
n1 +n2
Proof. Following [237], we only give a sketch of the proof. Let τ = n1 n2 and
√
z ∼ N (0, Ip×p ). Under the null hypothesis that δ = 0, we have x̄−ȳ = τΣ 1/2
z,
and as a result,
−1
T̂k2 = zT Σ1/2 EPk Pk PTk Σ̂Pk PTk Σ1/2 z, (10.62)
F GH I
A
which gives us the standard quadratic form T̂k2 = zT Az. The use of concentration
of measure for the standard quadratic form is the primary reason for this whole
section. Please refer to Sect. 4.15, in particular, Theorem 4.15.8.
Here, A is a random matrix. We may take x̄ − ȳ and Σ̂ to be independent for
Gaussian data [208]. As a result, we may assume that z and A are independent. Our
overall plan is to work conditionally on A, and use the representation
T
zT Az − μ̂n z Az − μ̂n
P x = EA Pz x |A ,
σ̂n σ̂n
where x ∈ R.
10.18 Two-Sample Test in High Dimensions 523
Let oPA (1) stand for a positive constant in probability under PA . To demonstrate
the asymptotic Gaussian distribution of zT Az, in Sect. B.4 of [237], it is shown that
Aop
AF = oPA (1) where || · ||F denotes the Frobenius norm. This implies that the
Lyupanov condition [58], which in turn implies the Lindeberg condition, and it then
follows [87] that
zT Az − Tr (A)
sup Pz √ x |A − Φ (x) = oPA (1) . (10.63)
x∈R 2 A F
√ √
Tr (A) − μ̂n = oPA n and A F − σ̂n = oPA n . (10.64)
and the central limit theorem (10.61) follows from the dominated convergence
theorem.
To state another theorem, we need the following two conditions:
• (A2) There is a constant b ∈ (0, 1) such that nn1 = b + o √1n .
• (A3) (Local alternative) The shift vector and covariance matrix satisfy
δ T Σ−1 δ = o(1).
Theorem 10.18.2 (Lopes, Jacob and Wainwright [237]). Assume that conditions
(A1), (A2), and (A3) hold. Then, as (n, p) → ∞, the power function satisfies
"
1−y √
P Tk2 tα = Φ −z1−α + b (1 − b) · · Δk n + o(1). (10.65)
2y
where
−1
Δk = δ EPk Pk PTk Σ̂Pk
T
PTk δ.
Proof. The heart of this proof is to use the conditional expectation and the
concentration of quadratic form. We work under the alternative hypothesis. Let
τ = nn11+n 2
n2 and z ∼ N (0, Ip×p ). Consider the limiting value of the power function
P n T̂k tα . Since the shift vector is nonzero δ = 0, We can break the test
1 2
−1
n1 n2 T
T̂k2 = (x̄ − ȳ) EPk Pk PTk Σ̂Pk PTk (x̄ − ȳ) ,
n1 + n2
where
√
x̄ − ȳ = τ Σ1/2 z + δ,
and z is a standard Gaussian p-vector. Expanding the definition of T̂k2 and adjusting
by a factor n, we have the decomposition
1 2
T̂ = I + II + III,
n k
where
1 T
I= z Az (10.66)
n
1
II = 2 √ zT AΣ−1/2 δ (10.67)
n τ
1 T −1/2
III = δ Σ AΣ−1/2 δ. (10.68)
nτ
Recall that
−1
A = Σ1/2 EPk Pk PTk Σ̂Pk PTk Σ1/2 .
(10.69)
1 n
We also define the numerical sequence ln = τn n−k−1 Δk . In Sects. C.1 and C.2
of [237], they establish the limits
√ √
Pz n |II| ε |A = oPA (1) , and n (III − ln ) = oPA (1)
(10.70)
where the error term oPA (1) is bounded by 1. Integrating over A and applying the
dominated convergence theorem, we obtain
⎛ ? ⎞
√ (1 − yn ) ⎠
3
P Tk2 tα = Φ ⎝−z1−α + n ln + o (1) .
2yn
k √1 n1 √1
Using the assumptions yn = n = a+o n
and n = b+o n
, we conclude
"
1−y √
P Tk2 tα = Φ −z1−α + b (1 − b) · · Δk n + o(1),
2y
H0 : A = Rn
H1 : B = Rx + Rn (10.71)
526 10 Detection in High Dimensions
where Rx is the true covariance matrix of the unknown signal and Rn is the
true covariance matrix of the noise. The optimal average probability of correct
detection [5, p. 117] is
1 1
+ A − B 1,
2 4
√
where the trace norm X 1 = Tr XXH is the sum of the absolute eigenvalues.
See also [548, 549] for the original derivation.
In practice, we need to use the sample covariance matrix to replace the true
covariance matrix, so we have
H0 : Â = R̂n
H1 : B̂ = R̂x + R̂n
where R̂x is the true covariance matrix of the unknown signal and R̂n is the true
covariance matrix of the noise. Using the triangle inequality of the norm, we have
/ / / / / /
/ / / / / /
/Â − B̂/ /Â − A/ + /B − B̂/ .
1 1 1
In [550], Sharpnack, Rinaldo, and Singh consider the basic but fundamental task of
deciding whether a given graph, over which a noisy signal is observed, contains a
cluster of anomalous or activated nodes comprising an induced connected subgraph.
Ramirez, Vita, Santamaria and Scharf [551] studies the existence of locally most
powerful invariant tests for the problem of testing the covariance structure of a set
of Gaussian random vectors. In practical scenarios the above test can provide better
performance than the typically used generalized likelihood ratio test (GLRT).
Onatski, Moreira and Hallin [552] consider the problem of testing the null
hypothesis of sphericity of a high-dimensional covariance matrix against an alterna-
tive of multiple symmetry-breaking directions (multispiked alternatives).
Chapter 11
Probability Constrained Optimization
where x is the decision vector, K is a closed convex cone, and A (x, ξ) is defined as
N
A (x, ξ) = A0 (x) + σ ξi Ai (x), (11.2)
i=1
where
• Ai (·) are affine mapping from Rn to finite-dimensional real vector space E;
• ξi are scalar random perturbations satisfying the relations
1. ξi are mutually independent;
2. E {ξi } = 0;
√
E exp ξi2 /4 2; (11.3)
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 527
DOI 10.1007/978-1-4614-4544-9 11,
© Springer Science+Business Media New York 2014
528 11 Probability Constrained Optimization
here (11.2) is a randomly perturbed Conic Quadratic Inequality (CQI), where the
data are affine in the perturbations. (3) E = Sm is the space of m × m symmetric
matrices, K = Sm + is the cone of positive semidefinite matrices from S ; here (11.2)
m
Prob {ξ : A (x, ξ) ∈
/ K} ε, (11.4)
for a given ε ! 1. Our ultimate goal is to optimize over the resulting set, under the
additive constraints on x. A fundamental problem is that (11.4) is computationally
intractable. The solution to the problem is connected with concentration of measure
through large deviations of sums of random matrices, following Nemirovski [554].
Typically, the only way to estimate the probability for a probability constraint to
be violated at a given point is to use Monte-Carlo simulations (so-called scenario
approach [555–558]) with sample sizes of 1ε ; this becomes too costly when ε is
small such as 10−5 or less. A natural idea is to look for tractable approximations of
the probability constraint, i.e., for efficiently verifiable sufficient conditions for its
validity. The advantage of this approach is its generality, it imposes no restrictions
on the distribution of ξ and on how the data enter the constraints.
An alternative to the scenario approximation is an approximation based on
“closed form” upper bounding of the probability for the randomly perturbed
constraints A (x, ξ) ∈ K to be violated. The advantage of the “closed form”
approach as compared to the scenario one is that the resulting approximations
are deterministic convex problems with sizes independent of the required value
of ε, so that the approximations also remain practical in the case of very small
values of ε. A new class of “closed form” approximations, referred to as Bernstein
approximations, is proposed in [559].
Example 11.1.1 (Covariance Matrix Estimation). For independent random vectors
xk , k = 1, . . . , K, the sample covariance—a Hermitian, positive semidefinite,
random matrix—defined as
11.2 Sums of Random Symmetric Matrices 529
K
R̂ = xk xTk , xk ∈ Rn
k=1
Note that the n elements of the k-th vector xk may be dependent random variables.
Now, assume that there are N + 1 observed sample covariances R̂0 (x) , i =
0, 1, 2, . . . , N such that
N
R̂ (x, ξ) = R̂0 (x) + σ ξi R̂i (x), (11.5)
i=1
where x is the decision vector. Thus, the covariance matrix estimation is recast in
terms of an optimization. Later it will be shown that the optimization problem is
convex and may be solved efficiently using the general-purpose solver that is widely
available online. Once the covariance matrix is estimated, we can enter the second
stage of detection process for extremely weak signal. This line of research seems
novel.
which, translates to
N
B2i I.
i=1
A natural conjecture is that the latter condition is sufficient as well. This answer is
affirmative, as proven by So [122]. A relaxed version of this conjecture has been
proven by Nemirovski [560]: Specifically, under the above condition, the typical
norm of SN is O(1)m1/6 with the probability
0 1
Prob SN > tm1/6 O(1) exp −O(1)t2
N
A0 (x) − ξi Ai (x) 0, (11.7)
i=1
where A1 (x) , . . . , AN (x) are affine functions of the decision vector x taking
values in the space Sn of symmetric n × n matrices, and ξi are independent of
each other random perturbations. Without loss of generality, ξi can be assumed to
have zero means.
A natural idea is to consider the probability constraint
+ N
,
Prob ξ = (ξ1 , . . . , ξN ) : A0 (x) − ξi Ai (x) 0 1 − , (11.8)
i=1
when is small, like 10−6 or 10−8 . A natural way to overcome this difficulty
is to replace “intractable” (11.7) with its “tractable approximation”—an explicit
constraint on x.
An necessary condition for x to be feasible for (11.7) is A0 (x) ≥ 0;
strengthening this necessary condition to be A0 (x) > 0, x is feasible for the
probability constraint if and only if the sum of random matrices
N
−1/2 1/2
SN = ξ i A0 (x) Ai (x) A0 (x)
i=1
F GH I
Yi
where A 2 = Tr (AAT ) is the Frobenius norm of a matrix. This problem is
equivalent to the quadratic maximization problem
⎧ ⎫
⎨ ⎬
T T n×n T
P = max 2 Tr A[k]X[k]X [k ]A [k ] : X [k] ∈ R , X [k] X [k] = In , k = 1, . . . , N .
X[1],...,X[N ] ⎩ ⎭
k<k
(11.9)
When N > 2, this problem is intractable. For N = 2, there is a closed form solution.
Equation (11.29) allows for a straightforward semidefinite relaxation. Geometrically
speaking, we are given N collections of points in Rn and are seeking for rotations
which make these collections as close to each other as possible, the closeness being
measured by the sum of squared distances.
Then
+/ N / ,
/ /
/ /
t 7m1/4 ⇒ Prob / ξi Bi / tΘ 54 exp −t2 /32 ,
/ /
i=1
+/ N / ,
/ /
/ /
t 7m1/6 ⇒ Prob / ξi Bi / tΘ 22 exp −t2 /32 . (11.11)
/ /
i=1
See [560] for a proof. Equation (11.11) is extended by Nemirovski [560] to hold for
the case of independent, Gaussian, symmetric n × n random matrices X1 , . . . , XN
with zero means and σ > 0 such that
N
E X2i σ 2 In . (11.12)
i=1
and ξi be independent random scalars with zero mean and of order of 1. Then
+ ,
N
t O (1) ln (m + n) ⇒ Prob ξ : ξi Ci tΘ O(1) exp −O(1)t2 .
i=1
(11.14)
Using the deterministic symmetric (m + n) × (m + n) Bi defined as
CTi
Bi = ,
Ci
the following theorem follows from Theorem 11.2.4 and (11.12).
Theorem 11.2.5 (Nemirovski [560]). Let deterministic m × n matrices Ci satis-
fying (11.13) with Θ = 1., and let ξi be independent random scalars with zero first
and third moment and such that either |ξi | ≤ 1 for all i ≤ N, or ξ ∼ N (0, 1) for
all i ≤ N. Then
+N ,
1/4
t 7(m + n) ⇒ Prob ξi Ci tΘ 54 exp −t2 /32 .
i=1
+ ,
1/6
N
t 7(m + n) ⇒ Prob ξi Ci tΘ 22 exp −t2 /32 . (11.15)
i=1
+ ,
N
4
t 4 min (m, n) ⇒ Prob ξi Ci tΘ exp −t2 /16 . (11.16)
i=1
3
It is argued in [560] that the threshold t = Ω ln (m + n) is in some sense
the best one could hope for. So [122] finds that the behavior of the random
N
variable SN ≡ ξi Xi has been extensively studied in the functional analysis
i=1
and probability theory literature. One of the tools is the so-called Khintchine-type
inequalities [121].
We say ξ1 , . . . , ξN are independent Bernoulli random variables when each ξi
takes on the values ±1 with equal probability.
Theorem 11.2.6 (So [122]). Let ξ1 , . . . , ξN be independent mean zero random
variables, each of which is either (i) supported on [-1,1], or (ii) Gaussian with
variance one. Further, let X1 , . . . , XN be arbitrary m × n matrices satisfying
max(m, n) 2 and (11.13) with Θ = 1. Then, for any t ≥ 1/2, we have
/ /
/ N /
/ / −t
Prob / ξi Xi / 2e (1 + t) ln max {m, n} (max {m, n})
/ /
i=1
/ 1/2 / / 1/2 /
/ / / /
/ N
/ / N
/
/ Xi XTi / m1/p , / XTi Xi / n1/p .
/ / / /
/ i=1 / / i=1 /
Sp Sp
for any p ≥ 2. Note that ||A|| denotes the spectrum norm of matrix A. Now, by
Markov’s inequality, for any s > 0 and p ≥ 2, we have
⎛/ / ⎞ ⎡/ /p ⎤
/ N / / N / pp/2 · max {m, n}
/ / / /
P ⎝/ ξ i Xi / s⎠ s−p · E ⎣/ ξ i Xi / ⎦ .
/ / / / sp
i=1 ∞
i=1 ∞
as desired.
Next, we consider the case where ξ1 , . . . , ξN are independent mean zero
random variables supported on [−1, 1]. Let ε1 , . . . , εN be i.i.d. Bernoulli random
variables; ε1 , . . . , εN are independent of the ξi ’s. A standard symmetrization
argument (e.g., see Lemma 1.11.3 which is Lemma 6.3 in [27]), together with
Fubini’s theorem and Theorem 2.17.4 implies that
p p
N N
E
ξ i X i
2p · Eξ Eε
ε i ξ i X i
i=1 Sp i=1
⎡ ⎧Sp ⎫⎤
⎨ 1/2
p 1/2
p ⎬
N
N
2p · pp/2 · Eξ ⎣max ξ 2 Xi XT , ξ 2 XT Xi ⎦
⎩ i=1 i i
i=1 i i ⎭
Sp Sp
2p · pp/2 · max {m, n} .
&
%
Example 11.3.2 will use Theorem 11.2.6. A
536 11 Probability Constrained Optimization
where A1 (x) , . . . , AN (x) are affine functions of the decision vector x taking
values in the space Sn of symmetric n × n matrices, and ξi are independent of
each other’s random perturbations. ξi , i = 1, . . . , N are random real perturbations
which we assume to be independent with zero means “of order of 1” and with “light
tails”—we will make the two assumptions precise below.
Here we are interested in the sufficient conditions for the decision vector x such
that the random perturbed LMI (11.17) holds true with probability ≥ 1 − , where
<< 1. Clearly, we have
A0 (x) 0.
To simplify, we consider the strengthened condition A0 (x) > 0. For such decision
vector x, letting
−1/2 −1/2
Bi (x) = A0 (x) Ai (x) A0 (x) ,
the question becomes to describe those x such that
+N ,
Prob ξi Bi (x) 0 1 − . (11.18)
i=1
Precise description seems to be completely intractable. The trick is to let the closed-
form probability inequalities for sums of random matrices “do the most of the job”!
What we are about to do are verifiable sufficient conditions for (11.18) to hold true.
Using Theorem 11.2.3, we immediately obtain the following sufficient condition:
Let n ≥ 2, perturbations ξi be independent with zero means such that
E exp ξi 2 exp (1) , i = 1, . . . , N.
Then the condition
N
2 1
−1/2 −1/2
A0 (x) > 0 & A0 (x) Ai (x) A0 (x)
i=1
450 exp (1) ln 3ε (ln m)
(11.19)
N / /2
/ −1/2 −1/2 /
/ A0 (x) Ai (x) A0 (x)/ τ (11.20)
i=1
N
1
−A0 (x) μi Ai (x) A0 (x) , μi > 0, i = 1, . . . , N, τ.
i=1
μ2i
N
−νi A Ai (x) νi A, i = 1, . . . , N, νi 2 τ
i=1
in variables x, νi , τ.
Using Theorem 11.2.3, we arrive at the following sufficient conditions: Let
perturbations ξi be independent with zero mean and zero third moments such that
|ξi | 1, i = 1, . . . , N, or such that ξi ∼ N (0, 1), i = 1, . . . , N . Let, further,
∈ (0, 1) be such that one of the following two conditions is satisfied
5 49m1/2
(a) ln
4 32
(11.21)
22 49m1/3
(b) ln
32
minimize cT x
0
subject to F(x)
N (11.24)
Prob A0 (x) − ξi Ai (x) 0 1 − , (†)
i=1
x ∈ Rn .
where
−1/2 −1/2
Ãi (x) = A0 (x) Ai (x) A0 (x) .
Now suppose that we can choose γ = γ () > 0 such that whenever
N
Ã2i (x) γ 2 I (11.26)
i=1
holds, the constraint condition (†) in (11.24) is satisfied. Then, we say that (11.26)
is a sufficient condition for (†) to hold. Using the Schur complement [560], (11.26)
can be rewritten as a linear matrix inequality
⎡ ⎤
γA0 (x) A1 (x) · · · AN (x)
⎢ A1 (x) γA0 (x) ⎥
⎢ ⎥
⎢ .. . ⎥ 0. (11.27)
⎣ . . . ⎦
AN (x) γA0 (x)
Thus, by replacing the constraint condition (†) with (11.27), the original prob-
lem (11.24) is tractable. Moreover, any solution x ∈ Rn that satisfies F(x) 0
and (11.27) will be feasible for the original probability constrained problem
of (11.24).
Now we are in a position to demonstrate how Theorem 11.2.6 can be used
for problem solving. If the random variables ξ1 , . . . , ξN satisfy the conditions
11.3 Applications of Sums of Random Matrices 539
of Theorem 11.2.6, and at the same time, if (11.26) holds for γ γ () ≡
−1
8e ln (m/) , then for any ∈ (0, 1/2], it follows that
/ /
N / N /
/ /
Prob ξi Ãi (x) I = Prob / ξi Ãi (x)/ 1 > 1 − . (11.28)
/ /
i=1 i=1 ∞
N
2
Since 1
γ Ãi (x) I, by Theorem 2.17.4 and Markov’s inequality (see the
i=1
above proof of Theorem 11.2.6), we have
/ /
/ N /
/ 1 / 1
Prob / ξi Ãi (x) / > m · exp −1/ 8eγ 2 .
/ γ / γ
i=1 ∞
This gives (11.28). Finally, using (11.25), we have the following theorem.
Theorem 11.3.3. Let ξ1 , . . . , ξN be independent mean zero random variables, each
of which is either (i) supported on [-1,1], or Gaussian with variance one. Consider
the probability constrained problem (11.24). Then, for any ∈ (0, 1/2], the positive
−1
semi-definite constraint with γ γ () ≡ 8e ln (m/) is a safe tractable
approximation of (11.24).
This
theorem improves upon Nemirovski’s result in [560], which requires γ =
1/6
O m + ln (1/) before one could claim that the constraint is a safe tractable
approximation of (11.24). &
%
Example 11.3.4 (Nonconvex quadratic optimization under orthogonality con-
straints [560]—continued). Consider the following optimization problem
⎧ ⎫
⎪
⎪ X, BX 1 (a) ⎪
⎪
⎨ ⎬
X, Bl X 1, l = 1, . . . , L (b)
P = max X, AX : , (11.29)
X∈Mm×n ⎪
⎪ CX = 0, (c) ⎪
⎪
⎩ ⎭
X 1. (d)
where
• Mm×n is the space
of m × n matrices equipped with the Frobenius inner product
X, Y = Tr XYT , and X = max { XY 2 : Y 1} is, as always, the
Y
spectral norm of X ∈ Mm×n ,
• The mappings A, B, B l are symmetric linear mappings from Mm×n into Mm×n ,
• B is positive semidefinite,
540 11 Probability Constrained Optimization
Observe that
X (X) 0,
m
n
and that cij · xij = 0 if and only if
i=1 j=1
11.3 Applications of Sums of Random Matrices 541
⎛ ⎞2
m n m n m n
0=⎝ cij · xij ⎠ ≡ cij · xij · ckl · xkl = Tr (X (C) X (X)) .
i=1 j=1 i=1 j=1 k=1 l=1
and similarly
X 1 ⇔ XXT Im .
Since the entries in the matrix product XXT are linear combinations of the entries
in X (X), we have
XXT Im ⇔ S (X (X)) Im ,
XXT In ⇔ T (X (X)) In ,
where Cμ ∈ Smn
+ is given by
μ
Cij,kl = Cμ,kl · Cμ,kl .
where
Ω = max max μk + 32 ln (132K), 32 ln (12 (L + 1)) ,
1kK
1/6
μk = min 7(mk + nk ) , 4 min (mk , nk ) .
In particular, one has both the lower bound and the upper bound
The numerical simulations are given by Nemirovski [560]: The SDP solver mincx
(LMI Toolbox for MATLAB) was used, which means at most 1,000–1,200 free
entries in the decision matrix X (X). &
%
We see So [122] for data-driven distributionally robust stochastic programming.
Convex approximations of chance constrained programs are studied in Shapiro and
Nemirovski [559].
N
F (x, ξ) = A0 (x) + ξi Ai (x) (11.34)
i=1
N
P A0 (x) + ξi Ai (x) 0
ξ
i=1
N
by means of the sums of random matrices ξi Ai (x). By concentration of
i=1
measure, by using sums of random matrices, Cheung, So, and Wang [562] arrive
at a safe tractable approximation of chance-constrained, quadratically perturbed,
linear matrix inequalities
⎛ ⎞
N
P ⎝A0 (x) + ξi Ai (x) + ξj ξk Bjk 0⎠ 1 − , (11.35)
ξ
i=1 1jkN
We follow [236] closely for our exposition. The motivation is to demonstrate the
use of concentration of measure in a probabilistically constrained optimization
problem. Let || · || and || · ||F represent the vector Euclidean norm and matrix
norm, respectively. We write x ∼ CN (0, C) if x is a zero-mean, circular symmetric
complex Gassian random vector with covarance matrix C ≥ 0.
Consider so-called multiuser multiple input single output (MISO) problem,
where the base station, or the transmitter, sends parallel data streams to multiple
users over the sample fading channel. The transmission is unicast, i.e., each data
stream is exclusively sent for one user. The base station has Nt transmit antennas
and the signaling strategy is beamforming. Let x(t) ∈ CNt denote the multi-antenna
transmit signal vector of the base station at time t. We have that
K
x(t) = wk sk (t), (11.36)
k=1
where wk ∈ CNt is the transmit beamforming vector for user k, K is the number
of users, and sk (t) is the
data stream
of user k, which is assumed to have zero-
2
mean and unit power E |sk (t)| = 1. It is also assumed that sk (t) is statistically
independent of one another. For user i, the received signal is
yi (t) = hH
i x(t) + ni (t), (11.37)
544 11 Probability Constrained Optimization
where hi ∈ CNt is the channel gain from the base station to user i, and ni (t) is an
additive noise, which is assumed to have zero mean and variance σ 2 > 0.
A common assumption in transmit beamforming is that the base station has
perfect knowledge of h1 , . . . hK , the so-called perfect channel state information
(CSI). Here, we assume the CSI is not perfect and modeled as
hi = h̄i + ei , i = 1, . . . , K,
where hi is the actual channel, h̄i ∈ CNt is the presumed channel at the base station,
and ei ∈ CNt is the respective error that is assumed to be random. We model the
error vector using complex Gaussian CSI errors,
ei ∼ CN (0, Ci ) (11.38)
K
minimize Tr(Wi )
W1 ,...,WK ∈HNt )
i=1 ) * *
H 1
subject to P h̄i +ei γi
Wi − Wk h̄i +ei σi2 1−ρi , i = 1, . . . , K,
k=i
W1 , . . . , W K 0
rank (Wi ) = 1, i = 1, . . . , K,
(11.41)
where the connection between (11.40) and (11.41) lies in the feasible point
equivalence
Wi = wi wiH , i = 1, . . . , K.
K
minimize Tr(Wi )
W1 ,...,WK ∈HNti=1
) ) * *
H 1
subject to P h̄i +ei γi
Wi − Wk h̄i +ei σi2 1−ρi , i = 1, . . . , K,
k=i
W1 , . . . , WK 0.
(11.42)
where HNt denotes Hermitian matrix with size Nt by Nt . The benefit of this
relaxation is that the inequalities inside the probability functions in (11.42) are
linear in W1 , . . . , WK , which makes the probabilistic constraints in (11.42) more
tractable. An issue that comes from the semidefinite relaxation is the solution
rank: the removal of the rank constraints rank (Wi ) = 1 means that the solution
(W1 , . . . , WK ) to problem (11.42) may have rank higher than one. A stan-
dard way of tacking this is to apply some rank-one approximation procedure
to (W1 , . . . , WK ) to generate a feasible beamforming solution (w1 , . . . , wK )
to (11.39). See [565] for a review and references. An algorithm is given in [236],
which in turn follows the spirit of [566].
Let us present the second step: restriction. The relaxation step alone above does
not provide a convex approximation of the main problem (11.39). The semidefinite
relation probabilistic constraints (11.42) remain intractable. This is the moment
that the Bernstein-type concentration inequalities play a central role. Concentration
of measure lies in the central stage behind the Bernstein-type concentration
inequalities. The Bernstein-type inequality (4.56) is used here. Let z = e and y = r
√ 2 2
in Theorem 4.15.7. Denote T (t) = Tr (Q) − 2t Q F + 2 y − tλ+ (Q).
We reformulate the challenge as the following: Consider the chance constraint
P eH Qe + 2 Re eH r + s 0 1 − ρ, (11.43)
546 11 Probability Constrained Optimization
The advantage of the above decomposition approach is that the distribution of the
random vector e may be √
non-Gaussian.
√ n In particular, e ∈ Rn is a zero-mean random
vector supported on − 3, 3 with independent components. For details, we
refer to [236, 562].
11.6.1 Introduction
This section follows [567] and deals with probabilistically secured joint amplify-
and-forward (AF) relay by cooperative jamming. AF relay with N relay nodes is
the simple relay strategy for cooperative communication to extend communication
range and improve communication quality. Due to the broadcast nature of wireless
communication, the transmitted signal is easily intercepted. Hence, communication
security is another important performance measure which should be enhanced.
Artificial noise and cooperative jamming have been used for physical layer security
[568, 569].
All relay nodes perform forwarding and jamming at the same time cooperatively.
A numerical approach based on optimization theory is proposed to obtain forward-
ing complex weights and the general-rank cooperative jamming complex weight
matrix simultaneously. SDR is the core of the numerical approach. Cooperative
jamming with the general-rank complex weight matrix removes the non-convex
548 11 Probability Constrained Optimization
where gs1 is the transmitted complex weight for Alice and gd1 is the transmitted
complex weight for Bob; hsrn 1 is CSI between Alice and the nth relay node; hdrn 1
is CSI between Bob and the nth relay node; wrn 1 is the background Gaussian noise
with zero mean and σr2n 1 variance for the nth relay node. Similarly, the received
signal plus artificial noise for Eve is,
Relay 2
Information
Signal Articial Noise
Alice
Bob
Relay N
Eve
Relay 2
Eve
Information Signal plus
Artificial Noise plus
Cooperative Jamming
Relay N
where hse1 is CSI between Alice and Eve; hde1 is CSI between Bob and Eve; we1
2
is the background Gaussian noise with zero mean and σe1 variance for Eve. Thus,
SINR for Eve in the first hop is,
2
|hse1 gs1 |
SINRe1 = 2 (11.51)
|hde1 gd1 | + σe1
2
In the second hop, N relay nodes perform joint AF relay and cooperative
jamming simultaneously. The function diagram of the second hop is shown in
Fig. 11.2.
The transmitted signal plus cooperative jamming for the nth relay node is
s rn 2 = g rn 2 y rn 1 + J rn 2 (11.52)
550 11 Probability Constrained Optimization
where grn 2 is the forwarding complex weight for the nth relay node and
Jrn 2 , n = 1, 2, . . . , N constitutes cooperative jamming jr2 ,
⎡ ⎤
J r1 2
⎢ Jr 2 ⎥
⎢ 2 ⎥
jr2 =⎢ . ⎥ (11.53)
⎣ .. ⎦
J rN 2
jr2 = Uz (11.54)
where (·)H denotes Hermitian operator and diag {·} returns the main diagonal of
matrix or puts vector on the main diagonal of matrix.
The received signal plus artificial noise by Bob is,
N
yd2 = hrn d2 (grn 2 yrn 1 + Jrn 2 ) + wd2 (11.57)
n=1
where hrn d2 is CSI between the nth relay node and Bob; wd2 is the background
2
Gaussian noise with zero mean and σd2 variance for Bob. Similarly, the received
signal plus artificial noise for Eve in the second hop is,
N
ye2 = hrn e2 (grn 2 yrn 1 + Jrn 2 ) + we2 (11.58)
n=1
where hrn e2 is CSI between the nth relay node and Eve; we2 is the background
2
Gaussian noise with zero mean and σe2 variance for Eve.
In the second hop, SINR for Bob is,
2
N
h g h g
rn d2 rn 2 srn 1 s1
SINRd2 = 2
n=1
2
N N N
h g h g + |h g |2 2
σ +E h J 2
+σd2
r n d2 r n 2 dr n 1 d1 r n d2 r n 2 rn 1 r n d2 r n
2
n=1 n=1 n=1
(11.59)
11.6 Probabilistically Secured Joint Amplify-and-Forward Relay by Cooperative Jamming 551
Due to the known information about J, Bob can cancel artificial noise generated
2
N
by itself. Thus, hrn d2 grn 2 hdrn 1 gd1 can be removed from the denominator in
n=1
Eq. (11.59) which means SINR for Bob is
2
N
h g h g
rn d2 rn 2 srn 1 s1
SINRd2 = n=1
+ 2 , (11.60)
N N
2 2
|hrn d2 grn 2 | σrn 1 + E hrn d2 Jrn 2 + σd2
2
n=1 n=1
(11.61)
Here, we assume there is no memory for Eve to store the received data in the
first hop. Eve cannot do any type of combinations for the received data to decode
information.
Let,
hrd2 = hr1 d2 hr2 d2 · · · hrN d2 (11.62)
Then,
⎧ 2 ⎫
⎨ N ⎬
E hrn d2 Jrn 2 = hrd2 UE zzH UH hH (11.63)
⎩ ⎭ rd2
n=1
= hrd2 UUH hH
rd2 (11.64)
Similarly, let,
hre2 = hr1 e2 hr2 e2 · · · hrN e2 (11.65)
Then,
⎧ 2 ⎫
⎨ N ⎬
E hrn e2 Jrn 2 = hre2 UUH hH (11.66)
⎩ ⎭ re2
n=1
Based on the previous assumption, hsrn 1 , hdrn 1 , and hrn d2 are perfect known.
While, hse1 , hde1 , and hrn e2 are partially known. Without loss of generality, hse1 ,
hde1 , and hrn e2 all follow independent complex Gaussian distribution with zero
552 11 Probability Constrained Optimization
mean and unit variance (zero mean and 12 variance for both real and imaginary
parts). Due to the randomness of hse1 , hde1 , and hrn e2 , SINRe1 and SINRe2 are
also random variables.
Joint AF relay and cooperative jamming with probabilistic security consideration
would like to solve the following optimization problems,
find
gs1 , gd1 , grn 2 , n = 1, 2, . . . , N, U
subject to
9 :
|grn 2 hsrn 1 gs1 |2 +|grn 2 hdrn 1 gd1 |2 +|grn 2 |2 σr2n 1 +E |Jrn 2 |2 ≤ Prn 2 , n = 1, 2, . . . , N
Pr (SINRe1 ≥ γe ) ≤ δe
Pr (SINRe2 ≥ γe ) ≤ δe
SINRd2 ≥ γd
(11.67)
where Prn 2 is the individual power constraint for the nth relay node; γe is the
targeted SINR for Eve and δe is its violation probability; γd is the targeted SINR
for Bob.
2 2
−γe σe1
|gs1 |
e |gs1 |2
2 2 ≤ δe (11.68)
|gs1 | + γe |gd1 |
2
From inequality (11.68), given |gs1 | , we can easily get,
2
−γe σe1
2
e |gs1 |2
− δe |gs1 |
2
|gd1 | ≥ (11.69)
γ e δe
Let
⎡ ⎤
g r1 2
⎢ gr 2 ⎥
⎢ 2 ⎥
gr2 = ⎢ . ⎥, (11.70)
⎣ .. ⎦
g rN 2
hsr1 = hsr1 1 hsr2 1 · · · hsrN 1 (11.71)
and
hdr1 = hdr1 1 hdr2 1 · · · hdrN 1 (11.72)
Y = UUH (11.75)
where Y is the semidefinite matrix and the rank of Y should be equal to or smaller
than r.
Pr (SINRe2 ≥ γe ) ≤ δe can be simplified as,
Pr hre2 QhH re2 ≥ σe2 γe ≤ δe
2
(11.76)
where Q is equal to
2 2
Q = Hsr1 XHH sr1 |gs1 | − γe Hdr1 XHdr1 |gd1 | − γe diag {diag {X}} σr2 − γe Y
H 2
(11.77)
Then probabilistic constrain in (11.76) can be approximated as,
0.5
trace (Q) + (−2log (δe )) a − blog (δe ) − σe2
2
γe ≤ 0
Q F ≤a
(11.78)
bI − Q ≥ 0
b≥0
where trace(·) returns the sum of the diagonal elements of matrix; · F return
Frobenius norm of matrix.
554 11 Probability Constrained Optimization
Based on the definition of X and Y, the individual power constraint for the nth
relay node in the optimization problem (11.67) can be rewritten as,
2 2
(X)n,n |hsrn 1 gs1 | + |hdrn 1 gd1 | + σr2n 1 + (Y)n,n ≤ Prn 2 , n = 1, 2, . . . , N
(11.79)
where (·)i,j returns the entry of matrix with the ith row and jth column.
SINRd2 constraint SINRd2 ≥ γd in the optimization problem (11.67) can be
reformulated as,
trace aH aX ≥ γd trace HH 2 H
rd2 Hrd2 σr1 X + trace hrd2 hrd2 Y + σd2
2
(11.80)
where
and
minimize
N 2 2
n=1 ((X) n,n |h sr n 1 g s1 | + |h dr n 1 g d1 | + σ 2
r n 1 + (Y)n,n )
subjectto
2 2
(X)n,n |hsrn 1 gs1 | + |hdrn 1 gd1 | + σr2n 1 + (Y)n,n ≤ Prn 2 , n = 1, 2, . . . , N
0.5
trace (Q) + (−2log (δe )) a − blog (δe ) − σe2
2
γe ≤ 0
Q F ≤a
bI − Q ≥ 0
b≥0
trace aH aX ≥ γd trace HH rd2 Hrd2 σ 2
r1 X + trace h rd2
H
h rd2 Y + σ 2
d2
X≥0
rank(X) = 1
Y≥0
(11.83)
Due to the non-convex rank constraint, the optimization problem (11.83) is an
NP-hard problem. We have to remove rank constraint and the optimization prob-
lem (11.83) becomes an SDP problem which can be solved efficiently. However,
11.6 Probabilistically Secured Joint Amplify-and-Forward Relay by Cooperative Jamming 555
minimize
2 2
λ(trace(X)−trace(Xx∗ (x∗ )H ))+(( N 2
n=1 ((X)n,n |hsrn 1 gs1 | +|hdrn 1 gd1 | +σrn 1
+(Y)n,n ))−P ∗ )
subject to
(X)n,n |hsrn 1 gs1 |2 +|hdrn 1 gd1 |2 +σr2n 1 +(Y)n,n ≤ Prn 2 , n = 1, 2, . . . , N
trace (Q) + (−2log (δe ))0.5 a−blog (δe ) −σe2 2
γe ≤ 0
QF ≤ a
bI−Q ≥ 0
b≥0
2
2
trace aH aX ≥ γd trace HH rd2 Hrd2 σr1 X +trace hrd2 hrd2 Y +σd2
H
X≥0
Y≥0
(11.84)
where λ is the design parameter; then optimal solution X∗ and Y∗ will be
updated; if X∗ is the rank-1 matrix, then Algorithm 1 goes to step 3; otherwise
Algorithm 1 goes to step 2;
3. Get optimal solutions to gr2 and U by eigen-decompositions to X∗ and Y∗ ;
Algorithm 1 is finished.
In the optimization problem (11.84), the minimization of trace(X) −
trace(Xx∗ (x∗ )H ) tries to force X∗ to be a rank-one matrix
and the minimization
N
of ( n=1 ((X)n,n |hsrn 1 gs1 | + |hdrn 1 gd1 | + σrn 1 + (Y)n,n )) − P ∗ tries to
2 2 2
4
4% Violation
8% Violation
3.5 12% Violation
Total Power by All Relay Nodes
3
2.5
1.5
0.5
0
2 4 6 8 10 12 14 16 18 20
Bob Targeted SINR
0.082
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5
Received SINR for Eve
40
35
30
25
20
15
10
0
0 0.5 1 1.5 2 2.5 3
Received SINR for Eve
1
H = √ Hn ∈ Rn×n . (12.1)
n
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 559
DOI 10.1007/978-1-4614-4544-9 12,
© Springer Science+Business Media New York 2014
560 12 Database Friendly Data Processing
2
Then, with probability at least 1 − δ C /24
− 7δ,
/ /
/A − YYH A/ (1 + 50ε) · A − Ak
F F
and
) *
√ log (n/δ) ε
A−YY H A 6+ ε 15 + ·A−A k + ·A−Ak F .
2 2
C 2 log (k/δ) 8C 2 log (k/δ)
where ri rTi is a diagonal matrix with only one non-zero entry; the t-th diagonal entry
is equal to 1/pt with probability
pt . Thus, by construction, for any set of non-zero
sampling probabilities, E ri rTi = IN ×N . Since we are averaging k independent
copies, it is reasonable to expect a concentration around the mean, with respect to
k, and in some sense, QT Q essentially behaves like the identity.
Theorem 12.2.1 (Symmetric Orthonormal Subspace Sampling [586]). Let
T
U = u1 · · · uN ∈ RN ×n
uTt D2 ut
pt β .
Tr (D2 )
Then, if k 4ρ (D) /βε2 ln 2n
δ , with probability at least 1 − δ,
/ 2 /
/D − DUT QT QUD/ ε D 2 .
2
N
2
E (x) = Ax − y 2 = aTt x − yt .
t=1
This problem was formulated to use a non-commutative Bernstein bound. For
details, see [587].
562 12 Database Friendly Data Processing
The matrix product ABT for large dimensions is a challenging problem. Here
we reformulate this standard linear algebra operation in terms of sums of random
matrices. It can be viewed as non-uniform sampling of the columns of A and B.
We take material from Hsu, Kakade, and Zhang [113], converting to our notation.
We make some comments and connections with other parts of the book at the
appropriate points.
Let A = [a1 |· · · |an ] , and B = [b1 |· · · |bn ] be fixed matrices, each with n
columns. Assume that aj = 0 and bj = 0 for all j = 1, . . . , n. If n is very
large, which is common in the age of Big Data, then the standard, straightforward
computation of the product ABT can be too computation-expensive. An alternative
is to take a small (non-uniform) random samples of the columns of A and B,
say aj1 , bj1 , . . . , ajN , bjN , or aji , bji for i = 1, 2, . . . , N . Then we compute a
weighted sum of outer products1
N
1 1
ABT ∼
= aj b T (12.3)
N i=1
pj i i j i
where pji > 0 is the a priori probability of choosing the column index ji ∈
{1, . . . , n} from a collection of N columns. The “average” and randomness do
most of the work, as observed by Donoho [132]: The regularity of having many
“identical” dimensions over which one can “average” is a fundamental tool. The
scheme of (12.3) was originally proposed and analyzed by Drinceas, Kannan, and
Mahoney [313].
Let X1 , . . . , XN be i.i.d. random matrices with the discrete distribution given by
- .
1 0 aj bTj
P Xi = = pj
pj bTj aj 0
Let
N
1 0 ABT
M̂ = Xi and M= .
N i=1
BT A 0
/ /
/ /
The spectral norm error /M̂ − M/ is used to describe the approximation of ABT
2
1
using the average of N outer products pij aij bTij , where the indices are such that
ji = j ⇔ Xi = aj bTj /pj
for all i = 1, . . . , N . Again, the “average” plays a fundamental role. Our goal is to
use Theorem 2.16.4 to bound this error. To apply this theorem, we must first check
the conditions.
We have the following relations
⎡
n ⎤
0 aj b T
n
1 0 aj b T ⎢ j ⎥
=⎢ ⎥=M
j=1
E [Xi ] = pj j
⎣n ⎦
pj bT
j aj 0 bT
j=1 j aj 0
j=1
⎛ ⎞
2 n
1 aj b T
j b j aj
T 0
n
2
Tr E Xi =Tr ⎝ pj ⎠= aj 22 bj 22 =2Z 2
p2j 0 j aj aj b j
bT T pj
j=1 j=1
ABT BT A 0
Tr (E [Xi ])2 =Tr M2 = = 2Tr ABT BT A .
0 BT AABT
Let || · ||2 stand for the spectral norm. The following norm inequalities can be
obtained:
/- ./
1/ /
/
0 aj bTj / 1/ /
/aj bTj / = Z
Xi 2 max / T / = max
j=1,...,n pj / bj aj 0 / j=1,...,n pj 2
2
/ /
EXi 2 = M 2 /ABT /2 A 2 B 2
/ 2 /
/E Xi / A B Z.
2 2 2
Using Theorem 2.16.4 and a union bound, finally we arrive at the following: For
any ε ∈ (0, 1) , and δ ∈ (0, 1) , if
" √ √
8 5 1 + rA rB log 4 rA rB + log (1/δ)
N +2 ,
3 3 ε2
then with probability at least 1 − δ over the random choice of column indices
j1 , . . . , j N ,
/ /
/ /
/1 N 1 /
/ T/
aij bij − AB / ε A 2 B 2 ,
T
/N p
/ j=1 ij /
2
564 12 Database Friendly Data Processing
where
2 2 2 2
rA = A F / A 2 ∈ (1, rank (A)) , and rB = B F / B 2 ∈ (1, rank (B))
are the numerical (or stable) rank. Here || · ||F stands for the Frobenius (or Hilbert-
Schmidt) norm. For details, we refer to [113].
In the age of Big Data, we have a data deluge. Data is expressed as matrices.
One wonders what is the best way of efficiently generating “sketches” of matrices.
Formally, we define the problem as follows: Given a matrix A ∈ Rn×n , and an
error parameter ε, construct a sketch à ∈ Rn×n of A such that
/ /
/ /
/A − Ã/ ε
2
and the number of non-zero entries in à is minimized. Here || · ||2 is the spectral
norm of a matrix (the largest singular value), while || · ||F is the Frobenius form. See
Sect. 1.4.5.
Algorithm 12.4.1 (Matrix sparsification [124]).
1. Input: matrix A ∈ Rn×n , sampling parameter s.
2. For all 1 ≤ i, j ≤ n do
2 A2
—If A2ij logn n s F then Ãij = 0,
A2F
—elseIf A2ij then Ãij = Aij ,
+ s
sA2ij
Aij
—ElseÃij = pij , with probability pij = A2F
<1
0, with probability 1 − pij
3. Output: matrix Ā ∈ Rn×n .
An algorithm is shown in Algorithm 12.4.1. When n ≥ 300, and
n (log n) log2 n/log2 n
s=C ,
ε2
the modeling of the data by higher-order arrays or tensors (e.g., arrays with more
than two modes). A natural example is time-evolving data, where the third mode of
the tensor represents time [588]. Concentration of measure has been used in [10] to
design algorithms.
d times
IF G H
For any d-mode or order-d tensor A ∈ R , its Frobenius norm A F is
n×n×···×n
defined as the square root of the sum of the squares of its elements. Now we define
tensor-vector products: let x, y be vectors in Rn . Then
n
A×1 x = Aijk···l xi ,
i=1
n
A×2 x = Aijk···l xj ,
i=1
n
A×3 x = Aijk···l xk , etc.
i=1
The outcome of the above operations is an order-(d−1) tensor. The above definition
may be extended to handle multiple tensor-vector products,
n
A×1 x×2 y = Aijk···l xi yj ,
i=1
A 2 = sup |A×1 x1 · · · ×d xd | ,
x1 ,...,xd ∈Rn
where all the vectors x1 , . . . , xd ∈ Rn are unit vectors, i.e., xi 2 = 1, for all
i ∈ [d]. The notation [d] stands for the set {1, 2, . . . , d}.
d times
H IF G
Given an order-d tensor A ∈ R n×n×···×n
and an error parameter > 0,
d times
H IF G
construct an order-d tensor sketch à ∈ R n×n×···×n such that
/ / / /
/ / / /
/A − Ã/ ε/Ã/
2 2
2
A F
st (A) = 2 .
A 2
Theorem 3.3.10 has been applied by Nguyen et al. [10] to obtain the above result.
See Mahoney [589], for randomized algorithms for matrices and data.
Chapter 13
From Network to Big Data
The main goal of this chapter is to put together all pieces treated in previous
chapters. We treat the subject from a system engineering point of view. This chapter
motivates the whole book. We only have space to see the problems from ten-
thousand feet high.
Figure 13.1 illustrates the vision of big data that will be the foundation to
understand cognitive networked sensing, cognitive radio network, cognitive radar
and even smart grid. We will further develop this vision in the book on smart
grid [6]. High dimensional statistics is the driver behind these subjects. Random
matrices are natural building blocks to model big data. Concentration of measure
phenomenon is of fundamental significance to modeling a large number of random
matrices. Concentration of measure phenomenon is a phenomenon unique to high-
dimensional spaces. The large data sets are conveniently expressed as a matrix
⎡ ⎤
X11 X12 · · · X1n
⎢ X21 X22 · · · X2n ⎥
⎢ ⎥
X=⎢ . .. . ⎥ ∈ Cm×n
⎣ .. . · · · .. ⎦
Xm1 Xm2 · · · Xmn
where Xij are random variables, e.g, sub-Gaussian random variables. Here m, n
are finite and large. For example, m = 100, n = 100. The spectrum of a random
matrix X tends to stabilize as the dimensions of X grows to infinity. In the last
few years, local and non-asymptotic regimes, the dimensions of X are fixed rather
than grow to infinity.
Concentration of measure phenomenon naturally occurs. The
eigenvalues λi XT X , i = 1, . . . , n are natural mathematical objects to study.
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 567
DOI 10.1007/978-1-4614-4544-9 13,
© Springer Science+Business Media New York 2014
568 13 From Network to Big Data
High-
Dimensionale
Statistics
Cognitive
Big Data Smart Grid
Networked
Sensing
Cognitive Cognitive
Radar Radio
Network
For a random matrix X ∈ Rn×n , the following functions are Lipschitz functions:
k k
(1)λmax (X) ; (2)λmin (X) ; (3) Tr (X) ; (4) λi (X) ; (5) λn−i+1 (X)
i=1 i=1
Let
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
X1 Y1 S1
⎢ X2 ⎥ ⎢ Y2 ⎥ ⎢ S2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
x = ⎢ . ⎥; y = ⎢ . ⎥; s = ⎢ . ⎥; x, y, s ∈ Cn .
⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
Xn Yn Sn
H0 : y = x
H1 : y = s + x.
H0 : R y = R x
H1 : Ry = Rs + Rx = Low rank matrix + Sparse matrix.
where Rs is often of low rank. When the white Gaussian random vector is
considered, we have
⎡ ⎤
1 0 ··· 0
⎢ .. ⎥
⎢
2 ⎢0 1 ··· . ⎥
⎥
Rx = σ 2 In×n =σ ⎢. . ⎥,
.
⎣ .. .. . . 0 ⎦
0 ··· 0 1
which is sparse in that there are non-zero entries only along the diagonal line. Using
matrix decomposition [590], we are able to separate the two matrices even when
σ 2 is very small, say the signal to noise ratio SN R = 10−6 . Anti-jamming is one
motivated example.
Unfortunately, when the sample size N of the random vector x ∈ Rn is
finite, the sparse matrix assumption of Rx is not satisfied. Let us study a simple
example. Consider i.i.d. Gaussian random variables Xi ∼ N (0, 1) , i = 1, . . . , n.
In MATLAB code, we have the random data vector x = randn (n, 1) . So the true
covariance matrix is Rx = σ 2 In×n . For N dependent copies of x, we define the
sample covariance matrix
570 13 From Network to Big Data
a 1
First Two Columns of Random Sample Covariance Matrix b 0.3
First Two Columns of Random Sample Covariance Matrix
0.8
0.2
0.6
Sample Covariance Matrix R
-0.4 -0.2
-0.6
-0.3
-0.8
-1 -0.4
0 20 40 60 80 100 0 20 40 60 80 100
Sample Index n Sample Index n
Fig. 13.2 Random sample covariance matrices of n × n with n = 100 : (a) N = 10; (b) N = 100
N
1 1
R̂x = xi xTi = XXT ∈ Rn×n ,
N i=1
N
where the data matrix is X = x1 x2 · · · xN ∈ Rn×N . The first two columns
of R̂x are shown in Fig. 13.2 for (a) N = 10 and (b) N = 100. The variance
of case (b) is much smaller than that of case (a). The convergence is measured by
R̂x − Rx = R̂x − σ 2 In×n . Consider the regime
N
as n → ∞, N → ∞, → c ∈ (0, 1). (13.3)
n
Under this regime, the fundamental question is
R̂x → σ 2 In×n ?
H0 : R̂y = R̂x
H1 : R̂y = R̂s + R̂x
where R̂x is a random matrix which is not sparse. We are led to evaluate this
following simple, intuitive test
Tr R̂y γ1 + Tr R̂x , claim H1 .
Tr R̂y γ0 + Tr R̂x , claim H0 .
Let us use Fig. 13.3 to represent a cognitive radio network of N nodes. The network
works for a TDMA manner. When one node, say, i = 1, transmits, the rest of the
nodes i = 2, . . . , 80 record the data. We can randomly (or deterministically) choose
the next node to transmit.
N
1 1
R̂x = xi xTi = XXT .
N i=1
N
Non-asymptotic theory of
random matrix X1
Sampling rate=20 Msps X2
x= •
•
•
Real-time XM
statistics
In-network X = x1 x2 • • • xN
processing
2.5 ms
Random Graphs
Big Data
Distributed computing
N=80
M=100
X11 X12 • • •
X1N
λi (X) X21 X2 • • • X2N
X= • • •
i = 1,..., min{M,N} •
•
•
•
• • • •
•
As mentioned above, the network works for a TDMA manner. When one node,
say, i = 1, transmits, the rest of the nodes i = 2, . . . , 80 record the data. We can
randomly (or deterministically) choose the next node to transmit. An algorithm can
be designed to control the data collection for the network.
As shown in Fig. 13.4, communications and sensing modes are enabled in the
Apollo testbed of TTU. The switch between two modes are fast.
The large data sets are collected at each node i = 1, . . . , 80. After the collection,
some real-time in-network processing can be done to extract the statistical param-
eters that will be stored locally at each node. Of course, the raw data can be stored
into the local database without any in-network processing.
13.8 Mobility of Network Enabled by UAVs 573
It is very important to remark that the data sets are large and difficult to move
around. To disseminate the information about these large data sets, we need rely
some statistical tools to reduce the dimensionality, which is often the first thing to
do with the data. Principal component analysis (PCA) is one such tool. PCA requires
the sample covariance matrix as in the input to the algorithm.
Management of these large data sets is very challenging. For example, how to
index these largest data sets.
At this stage, the data is assumed to be stored and managed property with effective
indexing. We can mine the data for the information that is needed. The goals include:
(1) higher spectrum efficiency; (2) higher energy efficiency; (3) enhanced security.
Unmanned aerial vehicles (UAVs) enable the mobility of the wireless network. It is
especially interesting to study the large data sets as a function of space and time.
The Global Positioning System (GPS) is used to locate the 3-dimensional position
together with time stamps.
574 13 From Network to Big Data
Smart Grid requires an two-way information flow [6]. Smart Grid can be viewed
as a large network. As a result, Big Data aspects are essential. In another work,
we systemically explore this viewpoint, using the mathematical tools of this current
book as the new departure points. Figure 13.1 illustrates this viewpoint.
The large size and dynamic nature of complex networks [594–598] enable a
deep connection between statistic graph theory and the possibility of describing
macroscopic phenomena in terms of the dynamic evolution of the basic elements
of the system [599]. In our new book [5], a cognitive radio network is modeled
as a complex network. (The use of a cognitive radio network in a smart grid is
considered [592].) The book [600] also models the communication network as a
random network.
So little is understood of (large) networks. The rather simple question “What
is a robust network?” seems beyond the realm of present understanding [601]. Any
complex network can be represented by a graph. Any graph can be represented by an
adjacency matrix, from which other matrices such as the Laplacian are derived. One
of the most beautiful aspects of linear algebra is the notion that, to each matrix, a set
of eigenvalues can be associated with corresponding eigenvectors. As shown below,
the most profound observation is that these eigenvalues for general random matrices
are strongly concentrated. As a result, eigenvalues—which are certain scalar valued
random variables—are natural metrics to describe the complex network (random
graph). A close analogy is that the spectrum domain of the Fourier transform is
the natural domain to study a random signal. A graph consists of a set of nodes
connected by a set of links. Some properties of a graph in the topology domain is
connected with the eigenvalues of random matrices.
For undirected graphs the adjacency matrix is symmetric,1 xij = xji , and therefore
contains redundant information. If the graph consists of N nodes and L links, then
the N × N adjacency matrix can be written as
X = UΛUT (13.5)
As a result, the study of a cognitive radio network boils down to the study of random
matrix X.
When the elements of xij are random variables, the adjacency matrix X is a
random matrix. So the so-called random matrix theory can be used as a powerful
mathematical tool. The connection is very deep. Our book [5] has dedicated more
than 230 pages to this topic. The most profound observation is that the spectra of
the random matrix is highly concentrated. The surprisingly striking Talagrand’s
inequality lies at the heart of this observation. As a result, the statistical behavior
of the graph spectra for complex networks is of interest [601].
Our useful approach is to investigate the eigenvalues (spectrum) of the random
adjacency matrix and (normalized ) Laplacian matrix. There is connection between
the spectrum and the topology of a graph. The duality between topology and
spectral domain is not new and has been studied in the field of mathematics called
algebraic graph theory [602, 603]. What is new is the connection with complex
networks [601].
The promising model is based on the concentration of the adjacency matrix
and of the Laplacian in random graphs with independent edges along a line of
research [604–608]. For example. see [604] for the model. We consider a random
graph G such that each edge is determined by an independent random variable,
where the probability of each edge is not assumed to be equal, i.e., P (vi ∼ vj ) =
pij . Each edge of G is independent of each other edge.
mentioned otherwise, we assume in this section that the graph is undirected and that X is
1 Unless
symmetric.
576 13 From Network to Big Data
For random graphs with such general distributions, several bounds for the
spectrum of the corresponding adjacency matrix and (normalized) Laplacian matrix
can be derived. Eigenvalues of the adjacency matrix has many applications in
graph theory, such as describing certain topological features of a graph, such as
connectivity and enumerating the occurrences of subgraphs [602, 609].
The data collection is viewed as a statistical inverse problem. Our results have
broader applicability in data collection, e.g., problems in social networking, game
theory, network security, and logistics [610].
Bibliography
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 577
DOI 10.1007/978-1-4614-4544-9,
© Springer Science+Business Media New York 2014
578 Bibliography
20. N. J. Higham, Functions of Matrices: Theory and Computation. Society for Industrial and
Applied Mathematics, 2008. [20, 33, 86, 211, 212, 213, 264, 496]
21. L. Trefethen and M. Embree, Spectra and Pseudospectra: The Behavior of Nonnormal
Matrices and Operators. Princeton University Press, 2005. [20]
22. R. Bhatia, Positive Definite Matrices. Princeton University Press, 2007. [20, 38, 44, 98]
23. R. Bhatia, Matrix analysis. Springer, 1997. [33, 39, 41, 45, 92, 98, 153, 199, 210, 214, 215,
264, 488, 500]
24. A. W. Marshall, I. Olkin, and B. C. Arnold, Inequalities: Theory of Majorization and Its
Applications. Springer Verl, 2011. [20]
25. D. Watkins, Fundamentals of Matrix Computations. Wiley, third ed., 2010. [21]
26. J. A. Tropp, “On the conditioning of random subdictionaries,” Applied and Computational
Harmonic Analysis, vol. 25, no. 1, pp. 1–24, 2008. [27]
27. M. Ledoux and M. Talagrand, Probability in Banach spaces. Springer, 1991. [27, 28, 69, 71,
75, 76, 77, 163, 165, 166, 168, 182, 184, 185, 192, 198, 199, 307, 309, 389, 391, 394, 396,
404, 535]
28. J. A. Tropp, J. N. Laska, M. F. Duarte, J. K. Romberg, and R. G. Baraniuk, “Beyond nyquist:
Efficient sampling of sparse bandlimited signals,” Information Theory, IEEE Transactions on,
vol. 56, no. 1, pp. 520–544, 2010. [27, 71, 384]
29. J. Nelson, “Johnson-lindenstrauss notes,” tech. rep., Technical report, MIT-CSAIL, Available
at web.mit.edu/minilek/www/jl notes.pdf, 2010. [27, 70, 379, 380, 381, 384]
30. H. Rauhut, “Compressive sensing and structured random matrices,” Theoretical foundations
and numerical methods for sparse recovery, vol. 9, pp. 1–92, 2010. [28, 164, 379, 394]
31. F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos processes and the restricted
isometry property,” arXiv preprint arXiv:1207.0235, 2012. [29, 65, 398, 399, 400, 401, 407,
408]
32. R. Latala, “On weak tail domination of random vectors,” Bull. Pol. Acad. Sci. Math., vol. 57,
pp. 75–80, 2009. [29]
33. W. Bednorz and R. Latala, “On the suprema of bernoulli processes,” Comptes Rendus
Mathematique, 2013. [29, 75, 76]
34. B. Gnedenko and A. Kolmogorov, Limit Distributions for Sums Independent Random
Variables. Addison-Wesley, 1954. [30]
35. R. Vershynin, “A note on sums of independent random matrices after ahlswede-winter.” http://
www-personal.umich.edu/∼romanv/teaching/reading-group/ahlswede-winter.pdf. Seminar
Notes. [31, 91]
36. R. Ahlswede and A. Winter, “Strong converse for identification via quantum channels,”
Information Theory, IEEE Transactions on, vol. 48, no. 3, pp. 569–579, 2002. [31, 85, 87,
94, 95, 96, 97, 106, 107, 108, 122, 123, 132]
37. T. Fine, Probability and Probabilistic Reasoning for Electrical Engineering. Pearson-Prentice
Hall, 2006. [32]
38. R. Oliveira, “Sums of random hermitian matrices and an inequality by rudelson,” Elect.
Comm. Probab, vol. 15, pp. 203–212, 2010. [34, 94, 95, 108, 473]
39. T. Rockafellar, Conjugative duality and optimization. Philadephia: SIAM, 1974. [36]
40. D. Petz, “A suvery of trace inequalities.” Functional Analysis and Operator Theory, 287–298,
Banach Center Publications, 30 (Warszawa), 1994. www.renyi.hu/∼petz/pdf/64.pdf. [38, 39]
41. R. Vershynin, “Golden-thompson inequality.” www-personal.umich.edu/∼romanv/teaching/
reading-group/golden-thompson.pdf. Seminar Notes. [39]
42. I. Dhillon and J. Tropp, “Matrix nearness problems with bregman divergences,” SIAM Journal
on Matrix Analysis and Applications, vol. 29, no. 4, pp. 1120–1146, 2007. [40]
43. J. Tropp, “From joint convexity of quantum relative entropy to a concavity theorem of lieb,”
in Proc. Amer. Math. Soc, vol. 140, pp. 1757–1760, 2012. [41, 128]
44. G. Lindblad, “Expectations and entropy inequalities for finite quantum systems,” Communi-
cations in Mathematical Physics, vol. 39, no. 2, pp. 111–119, 1974. [41]
45. E. G. Effros, “A matrix convexity approach to some celebrated quantum inequalities,”
vol. 106, pp. 1006–1008, National Acad Sciences, 2009. [41]
Bibliography 579
46. E. Carlen and E. Lieb, “A minkowski type trace inequality and strong subadditivity of
quantum entropy ii: convexity and concavity,” Letters in Mathematical Physics, vol. 83, no. 2,
pp. 107–126, 2008. [41, 42]
47. T. Rockafellar, Conjugate duality and optimization. SIAM, 1974. Regional conference series
in applied mathematics. [41]
48. S. Boyd and L. Vandenberghe, Convex optimization. Cambridge Univ Pr, 2004. [42, 48, 210,
217, 418, 489]
49. J. Tropp, “Freedman’s inequality for matrix martingales,” Electron. Commun. Probab, vol. 16,
pp. 262–270, 2011. [42, 128, 137]
50. E. Lieb, “Convex trace functions and the wigner-yanase-dyson conjecture,” Advances in
Mathematics, vol. 11, no. 3, pp. 267–288, 1973. [42, 98, 129]
51. V. I. Paulsen, Completely Bounded Maps and Operator Algebras. Cambridge Press, 2002.
[43]
52. T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: John Wiley. [44]
53. J. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of Computa-
tional Mathematics, vol. 12, no. 4, pp. 389–434, 2011. [45, 47, 95, 107, 110, 111, 112, 115,
116, 121, 122, 127, 128, 131, 132, 140, 144, 493]
54. F. Hansen and G. Pedersen, “Jensen’s operator inequality,” Bulletin of the London Mathemat-
ical Society, vol. 35, no. 4, pp. 553–564, 2003. [45]
55. P. Halmos, Finite-Dimensional Vector Spaces. Springer, 1958. [46]
56. V. De la Peña and E. Giné, Decoupling: from dependence to independence. Springer Verlag,
1999. [47, 404]
57. R. Vershynin, “A simple decoupling inequality in probability theory,” May 2011. [47, 51]
58. P. Billingsley, Probability and Measure. Wiley, 2008. [51, 523]
59. R. Dudley, Real analysis and probability, vol. 74. Cambridge University Press, 2002. [51,
230, 232]
60. M. A. Arcones and E. Giné, “On decoupling, series expansions, and tail behavior of chaos
processes,” Journal of Theoretical Probability, vol. 6, no. 1, pp. 101–122, 1993. [52]
61. D. L. Hanson and F. T. Wright, “A bound on tail probabilities for quadratic forms in
independent random variables,” The Annals of Mathematical Statistics, pp. 1079–1083, 1971.
[53, 73, 380]
62. S. Boucheron, G. Lugosi, and P. Massart, “Concentration inequalities using the entropy
method,” The Annals of Probability, vol. 31, no. 3, pp. 1583–1614, 2003. [53, 198, 397,
404]
63. T. Tao, Topics in Random Matrix Theory. Amer Mathematical Society, 2012. [54, 55, 216,
238, 239, 241, 242, 355, 357, 358]
64. E. Wigner, “Distribution laws for the roots of a random hermitian matrix,” Statistical Theories
of Spectra: Fluctuations, pp. 446–461, 1965. [55]
65. M. Mehta, Random matrices, vol. 142. Academic press, 2004. [55]
66. D. Voiculescu, “Limit laws for random matrices and free products,” Inventiones mathemati-
cae, vol. 104, no. 1, pp. 201–220, 1991. [55, 236, 363]
67. J. Wishart, “The generalised product moment distribution in samples from a normal multi-
variate population,” Biometrika, vol. 20, no. 1/2, pp. 32–52, 1928. [56]
68. P. Hsu, “On the distribution of roots of certain determinantal equations,” Annals of Human
Genetics, vol. 9, no. 3, pp. 250–258, 1939. [56, 137]
69. U. Haagerup and S. Thorbjørnsen, “Random matrices with complex gaussian entries,”
Expositiones Mathematicae, vol. 21, no. 4, pp. 293–337, 2003. [57, 58, 59, 202, 203, 214]
70. A. Erdelyi, W. Magnus, Oberhettinger, and F. Tricomi, eds., Higher Transcendental Func-
tions, Vol. 1–3. McGraw-Hill, 1953. [57]
71. J. Harer and D. Zagier, “The euler characteristic of the moduli space of curves,” Inventiones
Mathematicae, vol. 85, no. 3, pp. 457–485, 1986. [58]
72. R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” Arxiv
preprint arXiv:1011.3027v5, July 2011. [60, 66, 67, 68, 313, 314, 315, 316, 317, 318, 319,
320, 321, 324, 325, 349, 462, 504]
580 Bibliography
73. D. Garling, Inequalities: a journey into linear analysis. Cambridge University Press, 2007.
[61]
74. V. Buldygin and S. Solntsev, Asymptotic behaviour of linearly transformed sums of random
variables. Kluwer, 1997. [60, 61, 63, 76, 77]
75. J. Kahane, Some random series of functions. Cambridge Univ Press, 2nd ed., 1985. [60]
76. M. Rudelson, “Lecture notes on non-asymptotic theory of random matrices,” arXiv preprint
arXiv:1301.2382, 2013. [63, 64, 68, 336]
77. V. Yurinsky, Sums and Gaussian vectors. Springer-Verlag, 1995. [69]
78. U. Haagerup, “The best constants in the khintchine inequality,” Studia Math., vol. 70, pp. 231–
283, 1981. [69]
79. R. Latala, P. Mankiewicz, K. Oleszkiewicz, and N. Tomczak-Jaegermann, “Banach-mazur
distances and projections on random subgaussian polytopes,” Discrete & Computational
Geometry, vol. 38, no. 1, pp. 29–50, 2007. [72, 73, 295, 296, 297, 298]
80. E. D. Gluskin and S. Kwapien, “Tail and moment estimates for sums of independent random
variable,” Studia Math., vol. 114, pp. 303–309, 1995. [75]
81. M. Talagrand, The generic chaining: upper and lower bounds of stochastic processes.
Springer Verlag, 2005. [75, 163, 165, 192, 394, 396, 404, 405, 406, 407]
82. M. Talagrand, Upper and Lower Bounds for Stochastic Processes, Modern Methods and
Classical Problems. Springer-Verlag, in press. Ergebnisse der Mathematik. [75, 199]
83. R. M. Dudley, “The sizes of compact subsets of hilbert space and continuity of gaussian
processes,” J. Funct. Anal, vol. 1, no. 3, pp. 290–330, 1967. [75, 406]
84. X. Fernique, “Régularité des trajectoires des fonctions aléatoires gaussiennes,” Ecole d’Eté
de Probabilités de Saint-Flour IV-1974, pp. 1–96, 1975. [75, 407]
85. M. Talagrand, “Regularity of gaussian processes,” Acta mathematica, vol. 159, no. 1, pp. 99–
149, 1987. [75, 407]
86. R. Bhattacharya and R. Rao, Normal approximation and asymptotic expansions, vol. 64.
Society for Industrial & Applied, 1986. [76, 77, 78]
87. L. Chen, L. Goldstein, and Q. Shao, Normal Approximation by Stein’s Method. Springer,
2010. [76, 221, 236, 347, 523]
88. A. Kirsch, An introduction to the mathematical theory of inverse problems, vol. 120. Springer
Science+ Business Media, 2011. [79, 80, 289]
89. D. Porter and D. S. Stirling, Integral equations: a practical treatment, from spectral theory to
applications, vol. 5. Cambridge University Press, 1990. [79, 80, 82, 288, 289, 314]
90. U. Grenander, Probabilities on Algebraic Structures. New York: Wiley, 1963. [85]
91. N. Harvey, “C&o 750: Randomized algorithms winter 2011 lecture 11 notes.” http://www.
math.uwaterloo.ca/∼harvey/W11/, Winter 2011. [87, 95, 473]
92. N. Harvey, “Lecture 12 concentration for sums of random matrices and lecture 13 the
ahlswede-winter inequality.” http://www.cs.ubc.ca/∼nickhar/W12/, Febuary 2012. Lecture
Notes for UBC CPSC 536N: Randomized Algorithms. [87, 89, 90, 95]
93. M. Rudelson, “Random vectors in the isotropic position,” Journal of Functional Analysis,
vol. 164, no. 1, pp. 60–72, 1999. [90, 95, 277, 278, 279, 280, 281, 282, 283, 284, 292, 306,
307, 308, 441, 442]
94. A. Wigderson and D. Xiao, “Derandomizing the ahlswede-winter matrix-valued chernoff
bound using pessimistic estimators, and applications,” Theory of Computing, vol. 4, no. 1,
pp. 53–76, 2008. [91, 95, 107]
95. D. Gross, Y. Liu, S. Flammia, S. Becker, and J. Eisert, “Quantum state tomography via
compressed sensing,” Physical review letters, vol. 105, no. 15, p. 150401, 2010. [93, 107]
96. D. DUBHASHI and D. PANCONESI, Concentration of measure for the analysis of random-
ized algorithms. Cambridge Univ Press, 2009. [97]
97. H. Ngo, “Cse 694: Probabilistic analysis and randomized algorithms.” http://www.cse.buffalo.
edu/∼hungngo/classes/2011/Spring-694/lectures/l4.pdf, Spring 2011. SUNY at Buffalo. [97]
98. O. Bratteli and D. W. Robinson, Operator Algebras amd Quantum Statistical Mechanics I.
Springer-Verlag, 1979. [97]
Bibliography 581
99. D. Voiculescu, K. Dykema, and A. Nica, Free Random Variables. American Mathematical
Society, 1992. [97]
100. P. J. Schreiner and L. L. Scharf, Statistical Signal Processing of Complex-Valued Data: The
Theory of Improper and Noncircular Signals. Ca, 2010. [99]
101. J. Lawson and Y. Lim, “The geometric mean, matrices, metrics, and more,” The American
Mathematical Monthly, vol. 108, no. 9, pp. 797–812, 2001. [99]
102. D. Gross, “Recovering low-rank matrices from few coefficients in any basis,” Information
Theory, IEEE Transactions on, vol. 57, no. 3, pp. 1548–1566, 2011. [106, 433, 443]
103. B. Recht, “A simpler approach to matrix completion,” Arxiv preprint arxiv:0910.0651, 2009.
[106, 107, 429]
104. B. Recht, “A simpler approach to matrix completion,” The Journal of Machine Learning
Research, vol. 7777777, pp. 3413–3430, 2011. [106, 431, 432, 433]
105. R. Ahlswede and A. Winter, “Addendum to strong converse for identification via quantum
channels,” Information Theory, IEEE Transactions on, vol. 49, no. 1, p. 346, 2003. [107]
106. R. Latala, “Some estimates of norms of random matrices,” AMERICAN MATHEMATICAL
SOCIETY, vol. 133, no. 5, pp. 1273–1282, 2005. [118]
107. Y. Seginer, “The expected norm of random matrices,” Combinatorics Probability and
Computing, vol. 9, no. 2, pp. 149–166, 2000. [118, 223, 330]
108. P. Massart, Concentration Inequalities and Model Selection. Springer, 2007. [120, 457]
109. M. Ledoux and M. Talagrand, Probability in Banach Spaces: Isoperimetry and Processes.
Springer, 1991. [120]
110. R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge Univ Press, 1995. [122]
111. A. Gittens and J. Tropp, “Tail bounds for all eigenvalues of a sum of random matrices,” Arxiv
preprint arXiv:1104.4513, 2011. [128, 129, 131, 144]
112. E. Lieb and R. Seiringer, “Stronger subadditivity of entropy,” Physical Review A, vol. 71,
no. 6, p. 062329, 2005. [128, 129]
113. D. Hsu, S. Kakade, and T. Zhang, “Tail inequalities for sums of random matrices that depend
on the intrinsic dimension,” 2011. [137, 138, 139, 267, 462, 463, 478, 562, 564]
114. B. Schoelkopf, A. Sola, and K. Mueller, Kernl principal compnent analysis, ch. Kernl
principal compnent analysis, pp. 327–352. MIT Press, 1999. [137]
115. A. Magen and A. Zouzias, “Low rank matrix-valued chernoff bounds and approximate matrix
multiplication,” in Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on
Discrete Algorithms, pp. 1422–1436, SIAM, 2011. [137]
116. S. Minsker, “On some extensions of bernstein’s inequality for self-adjoint operators,” Arxiv
preprint arXiv:1112.5448, 2011. [139, 140]
117. G. Peshkir and A. Shiryaev, “The khintchine inequalities and martingale expanding sphere of
their action,” Russian Mathematical Surveys, vol. 50, no. 5, pp. 849–904, 1995. [141]
118. N. Tomczak-Jaegermann, “The moduli of smoothness and convexity and the rademacher
averages of trace classes,” Sp (1 [p¡.) Studia Math, vol. 50, pp. 163–182, 1974. [141, 300]
119. F. Lust-Piquard, “Inégalités de khintchine dans cp (1¡ p¡∞),” CR Acad. Sci. Paris, vol. 303,
pp. 289–292, 1986. [142]
120. G. Pisier, “Non-commutative vector valued lp-spaces and completely p-summing maps,”
Astérisque, vol. 247, p. 131, 1998. [142]
121. A. Buchholz, “Operator khintchine inequality in non-commutative probability,” Mathematis-
che Annalen, vol. 319, no. 1, pp. 1–16, 2001. [142, 534]
122. A. So, “Moment inequalities for sums of random matrices and their applications in optimiza-
tion,” Mathematical programming, vol. 130, no. 1, pp. 125–151, 2011. [142, 530, 534, 537,
542, 548]
123. N. Nguyen, T. Do, and T. Tran, “A fast and efficient algorithm for low-rank approximation
of a matrix,” in Proceedings of the 41st annual ACM symposium on Theory of computing,
pp. 215–224, ACM, 2009. [143, 156, 159]
124. N. Nguyen, P. Drineas, and T. Tran, “Matrix sparsification via the khintchine inequality,”
2009. [143, 564]
582 Bibliography
125. M. de Carli Silva, N. Harvey, and C. Sato, “Sparse sums of positive semidefinite matrices,”
2011. [144]
126. L. Mackey, M. Jordan, R. Chen, B. Farrell, and J. Tropp, “Matrix concentration inequalities
via the method of exchangeable pairs,” Arxiv preprint arXiv:1201.6002, 2012. [144]
127. L. Rosasco, M. Belkin, and E. D. Vito, “On learning with integral operators,” The Journal of
Machine Learning Research, vol. 11, pp. 905–934, 2010. [144, 289, 290]
128. L. Rosasco, M. Belkin, and E. De Vito, “A note on learning with integral operators,” [144,
289]
129. P. Drineas and A. Zouzias, “A note on element-wise matrix sparsification via matrix-valued
chernoff bounds,” Preprint, 2010. [144]
130. R. CHEN, A. GITTENS, and J. TROPP, “The masked sample covariance estimator: An
analysis via matrix concentration inequalities,” Information and Inference: A Journal of the
IMA, pp. 1–19, 2012. [144, 463, 466]
131. M. Ledoux, The Concentration of Measure Pheonomenon. American Mathematical Society,
2000. [145, 146]
132. D. Donoho et al., “High-dimensional data analysis: The curses and blessings of dimensional-
ity,” AMS Math Challenges Lecture, pp. 1–32, 2000. [146, 267, 562]
133. S. Chatterjee, “Stein’s method for concentration inequalities,” Probability theory and related
fields, vol. 138, no. 1, pp. 305–321, 2007. [146, 159]
134. B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by model selection,”
The annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000. [147, 148, 485, 504]
135. G. Raskutti, M. Wainwright, and B. Yu, “Minimax rates of estimation for high-dimensional
linear regression over¡ formula formulatype=,” Information Theory, IEEE Transactions on,
vol. 57, no. 10, pp. 6976–6994, 2011. [147, 166, 169, 170, 171, 424, 426, 510]
136. L. Birgé and P. Massart, “Minimum contrast estimators on sieves: exponential bounds and
rates of convergence,” Bernoulli, vol. 4, no. 3, pp. 329–375, 1998. [148, 254]
137. I. Johnstone, State of the Art in Probability and Statastics, vol. 31, ch. Chi-square oracle
inequalities, pp. 399–418. Institute of Mathematical Statistics, ims lecture notes ed., 2001.
[148]
138. M. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery using¡
formula formulatype=,” Information Theory, IEEE Transactions on, vol. 55, no. 5, pp. 2183–
2202, 2009. [148, 175, 176, 182]
139. A. Amini and M. Wainwright, “High-dimensional analysis of semidefinite relaxations for
sparse principal components,” in Information Theory, 2008. ISIT 2008. IEEE International
Symposium on, pp. 2454–2458, IEEE, 2008. [148, 178, 180]
140. W. Johnson and G. Schechtman, “Remarks on talagrand’s deviation inequality for rademacher
functions,” Functional Analysis, pp. 72–77, 1991. [149]
141. M. Ledoux, The concentration of measure phenomenon, vol. 89. Amer Mathematical Society,
2001. [150, 151, 152, 153, 154, 155, 158, 163, 166, 169, 176, 177, 181, 182, 185, 188, 190,
194, 198, 232, 248, 258, 262, 313, 428]
142. M. Talagrand, “A new look at independence,” The Annals of probability, vol. 24, no. 1, pp. 1–
34, 1996. [152, 313]
143. S. Chatterjee, “Matrix estimation by universal singular value thresholding,” arXiv preprint
arXiv:1212.1247, 2012. [152]
144. R. Bhatia, C. Davis, and A. McIntosh, “Perturbation of spectral subspaces and solution of
linear operator equations,” Linear Algebra and its Applications, vol. 52, pp. 45–67, 1983.
[153]
145. K. Davidson and S. Szarek, “Local operator theory, random matrices and banach spaces,”
Handbook of the geometry of Banach spaces, vol. 1, pp. 317–366, 2001. [153, 159, 160, 161,
173, 191, 282, 313, 325, 427, 486]
146. A. DasGupta, Probability for Statistics and Machine Learning. Springer, 2011. [155]
147. W. Hoeffding, “A combinatorial central limit theorem,” The Annals of Mathematical Statis-
tics, vol. 22, no. 4, pp. 558–566, 1951. [159]
Bibliography 583
172. M. Ledoux, “On talagrand’s deviation inequalities for product measures,” ESAIM: Probability
and statistics, vol. 1, pp. 63–87, 1996. [198]
173. P. Massart, “About the constants in talagrand’s concentration inequalities for empirical
processes,” Annals of Probability, pp. 863–884, 2000. [198]
174. E. Rio, “Inégalités de concentration pour les processus empiriques de classes de parties,”
Probability Theory and Related Fields, vol. 119, no. 2, pp. 163–175, 2001. [198]
175. S. Boucheron, O. Bousquet, G. Lugosi, and P. Massart, “Moment inequalities for functions of
independent random variables,” The Annals of Probability, vol. 33, no. 2, pp. 514–560, 2005.
[198]
176. S. Boucheron, O. Bousquet, G. Lugosi, et al., “Theory of classification: A survey of some
recent advances,” ESAIM Probability and statistics, vol. 9, pp. 323–375, 2005. []
177. S. Boucheron, G. Lugosi, P. Massart, et al., “On concentration of self-bounding functions,”
Electronic Journal of Probability, vol. 14, no. 64, pp. 1884–1899, 2009. []
178. S. Boucheron, P. Massart, et al., “A high-dimensional wilks phenomenon,” Probability theory
and related fields, vol. 150, no. 3, p. 405, 2011. [198]
179. A. Connes, “Classification of injective factors,” Ann. of Math, vol. 104, no. 2, pp. 73–115,
1976. [203]
180. A. Guionnet and O. Zeitouni, “Concentration of the spectral measure for large matrices,”
Electron. Comm. Probab, vol. 5, pp. 119–136, 2000. [204, 218, 219, 227, 243, 244, 247, 268,
494]
181. I. N. Bronshtein, K. A. Semendiaev, and K. A. Hirsch, Handbook of mathematics. Van
Nostrand Reinhold New York, NY, 5th ed., 2007. [204, 208]
182. A. Khajehnejad, S. Oymak, and B. Hassibi, “Subspace expanders and matrix rank minimiza-
tion,” arXiv preprint arXiv:1102.3947, 2011. [205]
183. M. Meckes, “Concentration of norms and eigenvalues of random matrices,” Journal of
Functional Analysis, vol. 211, no. 2, pp. 508–524, 2004. [206, 224]
184. C. Davis, “All convex invariant functions of hermitian matrices,” Archiv der Mathematik,
vol. 8, no. 4, pp. 276–278, 1957. [206]
185. L. Li, “Concentration of measure for random matrices.” private communication, October
2012. Tenneessee Technological University. [207]
186. N. Berestycki and R. Nickl, “Concentration of measure,” tech. rep., Technical report,
University of Cambridge, 2009. [218, 219]
187. R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1994. [219,
465]
188. Y. Zeng and Y. Liang, “Maximum-minimum eigenvalue detection for cognitive radio,” in
IEEE 18th International Symposium on Personal, Indoor and Mobile Radio Communications
(PIMRC) 2007, pp. 1–5, 2007. [221]
189. V. Petrov and A. Brown, Sums of independent random variables, vol. 197. Springer-Verlag
Berlin, 1975. [221]
190. M. Meckes and S. Szarek, “Concentration for noncommutative polynomials in random
matrices,” in Proc. Amer. Math. Soc, vol. 140, pp. 1803–1813, 2012. [222, 223, 269]
191. G. W. Anderson, “Convergence of the largest singular value of a polynomial in independent
wigner matrices,” arXiv preprint arXiv:1103.4825, 2011. [222]
192. E. Meckes and M. Meckes, “Concentration and convergence rates for spectral measures of
random matrices,” Probability Theory and Related Fields, pp. 1–20, 2011. [222, 233, 234,
235]
193. R. Serfling, “Approximation theorems of mathematical statistics (wiley series in probability
and statistics),” 1981. [223]
194. R. Latala, “Some estimates of norms of random matrices,” Proceedings of the American
Mathematical Society, pp. 1273–1282, 2005. [223, 330]
195. S. Lang, Real and functional analysis. 1993. [224]
196. A. Guntuboyina and H. Leeb, “Concentration of the spectral measure of large wishart matrices
with dependent entries,” Electron. Commun. Probab, vol. 14, pp. 334–342, 2009. [224, 225,
226]
Bibliography 585
221. T. Tao and V. Vu, “Random covariance matrices: Universality of local statistics of eigenval-
ues,” Arxiv preprint arxiv:0912.0966, 2009. [243]
222. Z. Füredi and J. Komlós, “The eigenvalues of random symmetric matrices,” Combinatorica,
vol. 1, no. 3, pp. 233–241, 1981. [245]
223. M. Krivelevich and V. Vu, “Approximating the independence number and the chromatic
number in expected polynomial time,” Journal of combinatorial optimization, vol. 6, no. 2,
pp. 143–155, 2002. [245]
224. A. Guionnet, “Lecture notes, minneapolis,” 2012. [247, 249]
225. C. Bordenave, P. Caputo, and D. Chafaı̈, “Spectrum of non-hermitian heavy tailed random
matrices,” Communications in Mathematical Physics, vol. 307, no. 2, pp. 513–560, 2011.
[247]
226. A. Guionnet and B. Zegarlinski, “Lectures on logarithmic sobolev inequalities,” Séminaire de
Probabilités, XXXVI, vol. 1801, pp. 1–134, 1801. [248]
227. D. Ruelle, Statistical mechanics: rigorous results. Amsterdam: Benjamin, 1969. [248]
228. T. Tao and V. Vu, “Random matrices: Sharp concentration of eigenvalues,” Arxiv preprint
arXiv:1201.4789, 2012. [249]
229. N. El Karoui, “Concentration of measure and spectra of random matrices: applications to
correlation matrices, elliptical distributions and beyond,” The Annals of Applied Probability,
vol. 19, no. 6, pp. 2362–2405, 2009. [250, 252, 253, 261, 262, 268]
230. N. El Karoui, “The spectrum of kernel random matrices,” The Annals of Statistics, vol. 38,
no. 1, pp. 1–50, 2010. [254, 268]
231. M. Talagrand, “New concentration inequalities in product spaces,” Inventiones Mathematicae,
vol. 126, no. 3, pp. 505–563, 1996. [254, 313, 404]
232. L. Birgé and P. Massart, “Gaussian model selection,” Journal of the European Mathematical
Society, vol. 3, no. 3, pp. 203–268, 2001. [254]
233. L. Birgé and P. Massart, “Minimal penalties for gaussian model selection,” Probability theory
and related fields, vol. 138, no. 1, pp. 33–73, 2007. [254]
234. P. Massart, “Some applications of concentration inequalities to statistics,” in Annales-Faculte
des Sciences Toulouse Mathematiques, vol. 9, pp. 245–303, Université Paul Sabatier, 2000.
[254]
235. I. Bechar, “A bernstein-type inequality for stochastic processes of quadratic forms of gaussian
variables,” arXiv preprint arXiv:0909.3595, 2009. [254, 548, 552]
236. K. Wang, A. So, T. Chang, W. Ma, and C. Chi, “Outage constrained robust transmit
optimization for multiuser miso downlinks: Tractable approximations by conic optimization,”
arXiv preprint arXiv:1108.0982, 2011. [255, 543, 544, 545, 547, 548, 552]
237. M. Lopes, L. Jacob, and M. Wainwright, “A more powerful two-sample test in high
dimensions using random projection,” arXiv preprint arXiv:1108.2401, 2011. [255, 256, 520,
522, 523, 525]
238. W. Beckner, “A generalized poincaré inequality for gaussian measures,” Proceedings of the
American Mathematical Society, pp. 397–400, 1989. [256]
239. M. Rudelson and R. Vershynin, “Invertibility of random matrices: unitary and orthogonal
perturbations,” arXiv preprint arXiv:1206.5180, June 2012. Version 1. [257, 334]
240. T. Tao and V. Vu, “Random matrices: Universality of local eigenvalue statistics,” Acta
mathematica, pp. 1–78, 2011. [259]
241. T. Tao and V. Vu, “Random matrices: The distribution of the smallest singular values,”
Geometric And Functional Analysis, vol. 20, no. 1, pp. 260–297, 2010. [259, 334, 499]
242. T. Tao and V. Vu, “On random±1 matrices: singularity and determinant,” Random Structures
& Algorithms, vol. 28, no. 1, pp. 1–23, 2006. [259, 332, 335, 499]
243. J. Silverstein and Z. Bai, “On the empirical distribution of eigenvalues of a class of large
dimensional random matrices,” Journal of Multivariate analysis, vol. 54, no. 2, pp. 175–192,
1995. [262]
244. J. Von Neumann, Mathematische grundlagen der quantenmechanik, vol. 38. Springer, 1995.
[264]
Bibliography 587
245. J. Cadney, N. Linden, and A. Winter, “Infinitely many constrained inequalities for the von
neumann entropy,” Information Theory, IEEE Transactions on, vol. 58, no. 6, pp. 3657–3663,
2012. [265, 267]
246. H. Araki and E. Lieb, “Entropy inequalities,” Communications in Mathematical Physics,
vol. 18, no. 2, pp. 160–170, 1970. [266]
247. E. Lieb and M. Ruskai, “A fundamental property of quantum-mechanical entropy,” Physical
Review Letters, vol. 30, no. 10, pp. 434–436, 1973. [266]
248. E. Lieb and M. Ruskai, “Proof of the strong subadditivity of quantum-mechanical entropy,”
Journal of Mathematical Physics, vol. 14, pp. 1938–1941, 1973. [266]
249. N. Pippenger, “The inequalities of quantum information theory,” Information Theory, IEEE
Transactions on, vol. 49, no. 4, pp. 773–789, 2003. [267]
250. T. Chan, “Recent progresses in characterising information inequalities,” Entropy, vol. 13,
no. 2, pp. 379–401, 2011. []
251. T. Chan, D. Guo, and R. Yeung, “Entropy functions and determinant inequalities,” in
Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pp. 1251–
1255, IEEE, 2012. []
252. R. Yeung, “Facts of entropy,” IEEE Information Theory Society Newsletter, pp. 6–15,
December 2012. [267]
253. C. Williams and M. Seeger, “The effect of the input density distribution on kernel-based
classifiers,” in Proceedings of the 17th International Conference on Machine Learning,
Citeseer, 2000. [268]
254. J. Shawe-Taylor, N. Cristianini, and J. Kandola, “On the concentration of spectral properties,”
Advances in neural information processing systems, vol. 1, pp. 511–518, 2002. [268]
255. J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola, “On the eigenspectrum of
the gram matrix and the generalization error of kernel-pca,” Information Theory, IEEE
Transactions on, vol. 51, no. 7, pp. 2510–2522, 2005. [268]
256. Y. Do and V. Vu, “The spectrum of random kernel matrices,” arXiv preprint arXiv:1206.3763,
2012. [268]
257. X. Cheng and A. Singer, “The spectrum of random inner-product kernel matrices,”
arXiv:1202.3155v1 [math.PR], p. 40, 2012. [268]
258. N. Ross et al., “Fundamentals of stein’s method,” Probability Surveys, vol. 8, pp. 210–293,
2011. [269, 347]
259. Z. Chen and J. Dongarra, “Condition numbers of gaussian random matrices,” Arxiv preprint
arXiv:0810.0800, 2008. [269]
260. M. Junge and Q. Zeng, “Noncommutative bennett and rosenthal inequalities,” Arxiv preprint
arXiv:1111.1027, 2011. [269]
261. A. Giannopoulos, “Notes on isotropic convex bodies,” Warsaw University Notes, 2003. [272]
262. Y. D. Burago, V. A. Zalgaller, and A. Sossinsky, Geometric inequalities, vol. 1988. Springer
Berlin, 1988. [273]
263. R. Schneider, Convex bodies: the Brunn-Minkowski theory, vol. 44. Cambridge Univ Pr, 1993.
[273, 292]
264. V. Milman and A. Pajor, “Isotropic position and inertia ellipsoids and zonoids of the unit
ball of a normed n-dimensional space,” Geometric aspects of functional analysis, pp. 64–104,
1989. [274, 277, 291]
265. C. Borell, “The brunn-minkowski inequality in gauss space,” Inventiones Mathematicae,
vol. 30, no. 2, pp. 207–216, 1975. [274]
266. R. Kannan, L. Lovász, and M. Simonovits, “Random walks and an o*(n5) volume algorithm
for convex bodies,” Random structures and algorithms, vol. 11, no. 1, pp. 1–50, 1997. [274,
275, 277, 280, 282, 442]
267. O. Guédon and M. Rudelson, “Lp-moments of random vectors via majorizing measures,”
Advances in Mathematics, vol. 208, no. 2, pp. 798–823, 2007. [275, 276, 299, 301, 303]
268. C. Borell, “Convex measures on locally convex spaces,” Arkiv för Matematik, vol. 12, no. 1,
pp. 239–252, 1974. [276]
588 Bibliography
269. C. Borell, “Convex set functions ind-space,” Periodica Mathematica Hungarica, vol. 6, no. 2,
pp. 111–136, 1975. [276]
270. A. Prékopa, “Logarithmic concave measures with application to stochastic programming,”
Acta Sci. Math.(Szeged), vol. 32, no. 197, pp. 301–3, 1971. [276]
271. S. Vempala, “Recent progress and open problems in algorithmic convex geometry,” in
=IARCS Annual Conference on Foundations of Software Technology and Theoretical Com-
puter Science (FSTTCS 2010), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2010.
[276]
272. G. Paouris, “Concentration of mass on convex bodies,” Geometric and Functional Analysis,
vol. 16, no. 5, pp. 1021–1049, 2006. [277, 290, 291, 292, 295]
273. J. Bourgain, “Random points in isotropic convex sets,” Convex geometric analysis, Berkeley,
CA, pp. 53–58, 1996. [277, 278, 282]
274. S. Mendelson and A. Pajor, “On singular values of matrices with independent rows,”
Bernoulli, vol. 12, no. 5, pp. 761–773, 2006. [277, 281, 282, 283, 285, 287, 288, 289]
275. S. Alesker, “Phi 2-estimate for the euclidean norm on a convex body in isotropic position,”
Operator theory, vol. 77, pp. 1–4, 1995. [280, 290]
276. F. Lust-Piquard and G. Pisier, “Non commutative khintchine and paley inequalities,” Arkiv
för Matematik, vol. 29, no. 1, pp. 241–260, 1991. [284]
277. F. Cucker and D. X. Zhou, Learning theory: an approximation theory viewpoint, vol. 24.
Cambridge University Press, 2007. [288]
278. V. Koltchinskii and E. Giné, “Random matrix approximation of spectra of integral operators,”
Bernoulli, vol. 6, no. 1, pp. 113–167, 2000. [288, 289]
279. S. Mendelson, “On the performance of kernel classes,” The Journal of Machine Learning
Research, vol. 4, pp. 759–771, 2003. [289]
280. R. Kress, Linear Integral Equations. Berlin: Springer-Verlag, 1989. [289]
281. G. W. Hanson and A. B. Yakovlev, Operator theory for electromagnetics: an introduction.
Springer Verlag, 2002. [289]
282. R. C. Qiu, Z. Hu, M. Wicks, L. Li, S. J. Hou, and L. Gary, “Wireless Tomography, Part
II: A System Engineering Approach,” in 5th International Waveform Diversity & Design
Conference, (Niagara Falls, Canada), August 2010. [290]
283. R. C. Qiu, M. C. Wicks, L. Li, Z. Hu, and S. J. Hou, “Wireless Tomography, Part I: A
NovelApproach to Remote Sensing,” in 5th International Waveform Diversity & Design
Conference, (Niagara Falls, Canada), August 2010. [290]
284. E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone, “Learning from
examples as an inverse problem,” Journal of Machine Learning Research, vol. 6, no. 1, p. 883,
2006. [290]
285. E. De Vito, L. Rosasco, and A. Toigo, “Learning sets with separating kernels,” arXiv preprint
arXiv:1204.3573, 2012. []
286. E. De Vito, V. Umanità, and S. Villa, “An extension of mercer theorem to matrix-valued
measurable kernels,” Applied and Computational Harmonic Analysis, 2012. []
287. E. De Vito, L. Rosasco, and A. Toigo, “A universally consistent spectral estimator for the
support of a distribution,” [290]
288. R. Latala, “Weak and strong moments of random vectors,” Marcinkiewicz Centenary Volume,
Banach Center Publ., vol. 95, pp. 115–121, 2011. [290, 303]
289. R. Latala, “Order statistics and concentration of lr norms for log-concave vectors,” Journal of
Functional Analysis, 2011. [292, 293]
290. R. Adamczak, R. Latala, A. E. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Geometry
of log-concave ensembles of random matrices and approximate reconstruction,” Comptes
Rendus Mathematique, vol. 349, no. 13, pp. 783–786, 2011. []
291. R. Adamczak, R. Latala, A. E. Litvak, K. Oleszkiewicz, A. Pajor, and N. Tomczak-
Jaegermann, “A short proof of paouris’ inequality,” arXiv preprint arXiv:1205.2515, 2012.
[290, 291]
292. J. Bourgain, “On the distribution of polynomials on high dimensional convex sets,” Geometric
aspects of functional analysis, pp. 127–137, 1991. [290]
Bibliography 589
293. B. Klartag, “An isomorphic version of the slicing problem,” Journal of Functional Analysis,
vol. 218, no. 2, pp. 372–394, 2005. [290]
294. S. Bobkov and F. Nazarov, “On convex bodies and log-concave probability measures with
unconditional basis,” Geometric aspects of functional analysis, pp. 53–69, 2003. [290]
295. S. Bobkov and F. Nazarov, “Large deviations of typical linear functionals on a convex body
with unconditional basis,” Progress in Probability, pp. 3–14, 2003. [290]
296. A. Giannopoulos, M. Hartzoulaki, and A. Tsolomitis, “Random points in isotropic un-
conditional convex bodies,” Journal of the London Mathematical Society, vol. 72, no. 3,
pp. 779–798, 2005. [292]
297. G. Aubrun, “Sampling convex bodies: a random matrix approach,” Proceedings of the
American Mathematical Society, vol. 135, no. 5, pp. 1293–1304, 2007. [292, 295]
298. R. Adamczak, “A tail inequality for suprema of unbounded empirical processes with
applications to markov chains,” Electron. J. Probab, vol. 13, pp. 1000–1034, 2008. [293]
299. R. Adamczak, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Sharp bounds on the rate
of convergence of the empirical covariance matrix,” Comptes Rendus Mathematique, 2011.
[294]
300. R. Adamczak, R. Latala, A. E. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Tail estimates
for norms of sums of log-concave random vectors,” arXiv preprint arXiv:1107.4070, 2011.
[293]
301. R. Adamczak, A. E. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Quantitative estimates
of the convergence of the empirical covariance matrix in log-concave ensembles,” Journal of
the American Mathematical Society, vol. 23, no. 2, p. 535, 2010. [294, 315]
302. Z. Bai and Y. Yin, “Limit of the smallest eigenvalue of a large dimensional sample covariance
matrix,” The annals of Probability, pp. 1275–1294, 1993. [294, 324]
303. R. Kannan, L. Lovász, and M. Simonovits, “Isoperimetric problems for convex bodies and a
localization lemma,” Discrete & Computational Geometry, vol. 13, no. 1, pp. 541–559, 1995.
[295]
304. G. Paouris, “Small ball probability estimates for log-concave measures,” Trans. Amer. Math.
Soc, vol. 364, pp. 287–308, 2012. [295, 297, 305]
305. J. A. Clarkson, “Uniformly convex spaces,” Transactions of the American Mathematical
Society, vol. 40, no. 3, pp. 396–414, 1936. [300]
306. Y. Gordon, “Gaussian processes and almost spherical sections of convex bodies,” The Annals
of Probability, pp. 180–188, 1988. [302]
307. Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in
Rn . Springer, 1988. Geometric aspects of functional analysis (1986/87), 84–106, Lecture
Notes in Math., 1317. [302]
308. R. Adamczak, O. Guedon, R. Latala, A. E. Litvak, K. Oleszkiewicz, A. Pajor, and
N. Tomczak-Jaegermann, “Moment estimates for convex measures,” arXiv preprint
arXiv:1207.6618, 2012. [303, 304, 305]
309. A. Pajor and N. Tomczak-Jaegermann, “Chevet type inequality and norms of sub-matrices,”
[303]
310. N. Srivastava and R. Vershynin, “Covariance estimation for distributions with 2+\ epsilon
moments,” Arxiv preprint arXiv:1106.2775, 2011. [304, 305, 322, 475, 476]
311. M. Rudelson and R. Vershynin, “Sampling from large matrices: An approach through
geometric functional analysis,” Journal of the ACM (JACM), vol. 54, no. 4, p. 21, 2007. [306,
307, 310]
312. A. Frieze, R. Kannan, and S. Vempala, “Fast monte-carlo algorithms for finding low-rank
approximations,” Journal of the ACM (JACM), vol. 51, no. 6, pp. 1025–1041, 2004. [310]
313. P. Drineas, R. Kannan, and M. Mahoney, “Fast monte carlo algorithms for matrices i:
Approximating matrix multiplication,” SIAM Journal on Computing, vol. 36, no. 1, p. 132,
2006. [562]
314. P. Drineas, R. Kannan, and M. Mahoney, “Fast monte carlo algorithms for matrices ii:
Computing a low-rank approximation to a matrix,” SIAM Journal on Computing, vol. 36,
no. 1, p. 158, 2006. [311]
590 Bibliography
315. P. Drineas, R. Kannan, M. Mahoney, et al., “Fast monte carlo algorithms for matrices iii:
Computing a compressed approximate matrix decomposition,” SIAM Journal on Computing,
vol. 36, no. 1, p. 184, 2006. [311]
316. P. Drineas and R. Kannan, “Pass efficient algorithms for approximating large matrices,”
in Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms,
pp. 223–232, Society for Industrial and Applied Mathematics, 2003. [310, 311]
317. P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, “Clustering large graphs via the
singular value decomposition,” Machine Learning, vol. 56, no. 1, pp. 9–33, 2004. [313]
318. V. Mil’man, “New proof of the theorem of a. dvoretzky on intersections of convex bodies,”
Functional Analysis and its Applications, vol. 5, no. 4, pp. 288–295, 1971. [315]
319. K. Ball, “An elementary introduction to modern convex geometry,” Flavors of geometry,
vol. 31, pp. 1–58, 1997. [315]
320. R. Vershynin, “Approximating the moments of marginals of high-dimensional distributions,”
The Annals of Probability, vol. 39, no. 4, pp. 1591–1606, 2011. [322]
321. P. Youssef, “Estimating the covariance of random matrices,” arXiv preprint arXiv:1301.6607,
2013. [322, 323]
322. M. Rudelson and R. Vershynin, “Smallest singular value of a random rectangular matrix,”
Communications on Pure and Applied Mathematics, vol. 62, no. 12, pp. 1707–1739, 2009.
[324, 327]
323. R. Vershynin, “Spectral norm of products of random and deterministic matrices,” Probability
theory and related fields, vol. 150, no. 3, p. 471, 2011. [324, 327, 328, 329, 331]
324. O. Feldheim and S. Sodin, “A universality result for the smallest eigenvalues of certain sample
covariance matrices,” Geometric And Functional Analysis, vol. 20, no. 1, pp. 88–123, 2010.
[324, 326, 328]
325. M. Lai and Y. Liu, “The probabilistic estimates on the largest and smallest q-singular values
of pre-gaussian random matrices,” submitted to Advances in Mathematics, 2010. []
326. T. Tao and V. Vu, “Random matrices: The distribution of the smallest singular values,”
Geometric And Functional Analysis, vol. 20, no. 1, pp. 260–297, 2010. [328, 338, 339, 340,
341]
327. D. CHAFAÏ, “Singular values of random matrices,” Lecture Notes, November 2009. Univer-
sité Paris-Est Marne-la-Vallée. []
328. M. LAI and Y. LIU, “A study on the largest and smallest q-singular values of random
matrices,” []
329. A. Litvak and O. Rivasplata, “Smallest singular value of sparse random matrices,” Arxiv
preprint arXiv:1106.0938, 2011. []
330. R. Vershynin, “Invertibility of symmetric random matrices,” Arxiv preprint arXiv:1102.0300,
2011. []
331. H. Nguyen and V. Vu, “Optimal inverse littlewood-offord theorems,” Advances in Mathemat-
ics, 2011. []
332. Y. Eliseeva, F. Götze, and A. Zaitsev, “Estimates for the concentration functions in the
littlewood–offord problem,” Arxiv preprint arXiv:1203.6763, 2012. []
333. S. Mendelson and G. Paouris, “On the singular values of random matrices,” Preprint, 2012.
[324]
334. J. Silverstein, “The smallest eigenvalue of a large dimensional wishart matrix,” The Annals of
Probability, vol. 13, no. 4, pp. 1364–1368, 1985. [324]
335. M. Rudelson and R. Vershynin, “Non-asymptotic theory of random matrices: extreme singular
values,” Arxiv preprint arXiv:1003.2990, 2010. [325]
336. G. Aubrun, “A sharp small deviation inequality for the largest eigenvalue of a random matrix,”
Séminaire de Probabilités XXXVIII, pp. 320–337, 2005. [325]
337. G. Bennett, L. Dor, V. Goodman, W. Johnson, and C. Newman, “On uncomplemented
subspaces of lp, 1¡ p¡ 2,” Israel Journal of Mathematics, vol. 26, no. 2, pp. 178–187, 1977.
[326]
338. A. Litvak, A. Pajor, M. Rudelson, and N. Tomczak-Jaegermann, “Smallest singular value of
random matrices and geometry of random polytopes,” Advances in Mathematics, vol. 195,
no. 2, pp. 491–523, 2005. [326, 462]
Bibliography 591
339. S. Artstein-Avidan, O. Friedland, V. Milman, and S. Sodin, “Polynomial bounds for large
bernoulli sections of l1 n,” Israel Journal of Mathematics, vol. 156, no. 1, pp. 141–155, 2006.
[326]
340. M. Rudelson, “Lower estimates for the singular values of random matrices,” Comptes Rendus
Mathematique, vol. 342, no. 4, pp. 247–252, 2006. [326]
341. M. Rudelson and R. Vershynin, “The littlewood–offord problem and invertibility of random
matrices,” Advances in Mathematics, vol. 218, no. 2, pp. 600–633, 2008. [327, 334, 336, 338,
342, 344, 345]
342. R. Vershynin, “Some problems in asymptotic convex geometry and random matrices moti-
vated by numerical algorithms,” Arxiv preprint cs/0703093, 2007. [327]
343. M. Rudelson and O. Zeitouni, “Singular values of gaussian matrices and permanent estima-
tors,” arXiv preprint arXiv:1301.6268, 2013. [329]
344. Y. Yin, Z. Bai, and P. Krishnaiah, “On the limit of the largest eigenvalue of the large
dimensional sample covariance matrix,” Probability Theory and Related Fields, vol. 78, no. 4,
pp. 509–521, 1988. [330, 502]
345. Z. Bai and J. Silverstein, “No eigenvalues outside the support of the limiting spectral
distribution of large-dimensional sample covariance matrices,” The Annals of Probability,
vol. 26, no. 1, pp. 316–345, 1998. [331]
346. Z. Bai and J. Silverstein, “Exact separation of eigenvalues of large dimensional sample
covariance matrices,” The Annals of Probability, vol. 27, no. 3, pp. 1536–1555, 1999. [331]
347. H. Nguyen and V. Vu, “Random matrices: Law of the determinant,” Arxiv preprint
arXiv:1112.0752, 2011. [331, 332]
348. A. Rouault, “Asymptotic behavior of random determinants in the laguerre, gram and jacobi
ensembles,” Latin American Journal of Probability and Mathematical Statistics (ALEA),,
vol. 3, pp. 181–230, 2007. [331]
349. N. Goodman, “The distribution of the determinant of a complex wishart distributed matrix,”
Annals of Mathematical Statistics, pp. 178–180, 1963. [332]
350. O. Friedland and O. Giladi, “A simple observation on random matrices with continuous
diagonal entries,” arXiv preprint arXiv:1302.0388, 2013. [334, 335]
351. J. Bourgain, V. H. Vu, and P. M. Wood, “On the singularity probability of discrete random
matrices,” Journal of Functional Analysis, vol. 258, no. 2, pp. 559–603, 2010. [334]
352. T. Tao and V. Vu, “From the littlewood-offord problem to the circular law: universality of
the spectral distribution of random matrices,” Bulletin of the American Mathematical Society,
vol. 46, no. 3, p. 377, 2009. [334]
353. R. Adamczak, O. Guédon, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Smallest sin-
gular value of random matrices with independent columns,” Comptes Rendus Mathematique,
vol. 346, no. 15, pp. 853–856, 2008. [334]
354. R. law Adamczak, O. Guédon, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Condition
number of a square matrix with iid columns drawn from a convex body,” Proc. Amer. Math.
Soc., vol. 140, pp. 987–998, 2012. [334]
355. L. Erdős, B. Schlein, and H.-T. Yau, “Wegner estimate and level repulsion for wigner random
matrices,” International Mathematics Research Notices, vol. 2010, no. 3, pp. 436–479, 2010.
[334]
356. B. Farrell and R. Vershynin, “Smoothed analysis of symmetric random matrices with
continuous distributions,” arXiv preprint arXiv:1212.3531, 2012. [335]
357. A. Maltsev and B. Schlein, “A wegner estimate for wigner matrices,” Entropy and the
Quantum II, vol. 552, p. 145, 2011. []
358. H. H. Nguyen, “Inverse littlewood–offord problems and the singularity of random symmetric
matrices,” Duke Mathematical Journal, vol. 161, no. 4, pp. 545–586, 2012. [335]
359. H. H. Nguyen, “On the least singular value of random symmetric matrices,” Electron. J.
Probab., vol. 17, pp. 1–19, 2012. [335]
360. R. Vershynin, “Invertibility of symmetric random matrices,” Random Structures & Algo-
rithms, 2012. [334, 335]
592 Bibliography
361. A. Sankar, D. Spielman, and S. Teng, “Smoothed analysis of the condition numbers and
growth factors of matrices,” SIAM J. Matrix Anal. Appl., vol. 2, pp. 446–476, 2006. [335]
362. T. Tao and V. Vu, “Smooth analysis of the condition number and the least singular value,”
Mathematics of Computation, vol. 79, no. 272, pp. 2333–2352, 2010. [335, 338, 343, 344]
363. K. P. Costello and V. Vu, “Concentration of random determinants and permanent estimators,”
SIAM Journal on Discrete Mathematics, vol. 23, no. 3, pp. 1356–1371, 2009. [335]
364. T. Tao and V. Vu, “On the permanent of random bernoulli matrices,” Advances in Mathemat-
ics, vol. 220, no. 3, pp. 657–669, 2009. [335]
365. A. H. Taub, “John von neumann: Collected works, volume v: Design of computers, theory of
automata and numerical analysis,” 1963. [336]
366. S. Smale, “On the efficiency of algorithms of analysis,” Bull. Amer. Math. Soc.(NS), vol. 13,
1985. [336, 337, 342]
367. A. Edelman, “Eigenvalues and condition numbers of random matrices,” SIAM Journal on
Matrix Analysis and Applications, vol. 9, no. 4, pp. 543–560, 1988. [336]
368. S. J. Szarek, “Condition numbers of random matrices,” J. Complexity, vol. 7, no. 2, pp. 131–
149, 1991. [336]
369. D. Spielman and S. Teng, “Smoothed analysis of algorithms: Why the simplex algorithm
usually takes polynomial time,” Journal of the ACM (JACM), vol. 51, no. 3, pp. 385–463,
2004. [336, 338, 339]
370. L. Erdos, “Universality for random matrices and log-gases,” arXiv preprint arXiv:1212.0839,
2012. [337, 347]
371. J. Von Neumann and H. Goldstine, “Numerical inverting of matrices of high order,” Bull.
Amer. Math. Soc, vol. 53, no. 11, pp. 1021–1099, 1947. [337]
372. A. Edelman, Eigenvalues and condition numbers of random matrices. PhD thesis, Mas-
sachusetts Institute of Technology, 1989. [337, 338, 341, 342]
373. P. Forrester, “The spectrum edge of random matrix ensembles,” Nuclear Physics B, vol. 402,
no. 3, pp. 709–728, 1993. [338]
374. M. Rudelson and R. Vershynin, “The least singular value of a random square matrix is
o(n−1/2 ),” Comptes Rendus Mathematique, vol. 346, no. 15, pp. 893–896, 2008. [338]
375. V. Vu and T. Tao, “The condition number of a randomly perturbed matrix,” in Proceedings of
the thirty-ninth annual ACM symposium on Theory of computing, pp. 248–255, ACM, 2007.
[338]
376. J. Lindeberg, “Eine neue herleitung des exponentialgesetzes in der wahrscheinlichkeitsrech-
nung,” Mathematische Zeitschrift, vol. 15, no. 1, pp. 211–225, 1922. [339]
377. A. Edelman and B. Sutton, “Tails of condition number distributions,” simulation, vol. 1, p. 2,
2008. [341]
378. T. Sarlos, “Improved approximation algorithms for large matrices via random projections,”
in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on,
pp. 143–152, IEEE, 2006. [342]
379. T. Tao and V. Vu, “Random matrices: the circular law,” Arxiv preprint arXiv:0708.2895, 2007.
[342, 343]
380. N. Pillai and J. Yin, “Edge universality of correlation matrices,” arXiv preprint
arXiv:1112.2381, 2011. [345, 346, 347]
381. N. Pillai and J. Yin, “Universality of covariance matrices,” arXiv preprint arXiv:1110.2501,
2011. [345]
382. Z. Bao, G. Pan, and W. Zhou, “Tracy-widom law for the extreme eigenvalues of sample
correlation matrices,” 2011. [345]
383. I. Johnstone, “On the distribution of the largest eigenvalue in principal components analysis,”
The Annals of statistics, vol. 29, no. 2, pp. 295–327, 2001. [347, 461, 502]
384. S. Hou, R. C. Qiu, J. P. Browning, and M. C. Wicks, “Spectrum Sensing in Cognitive Radio
with Subspace Matching,” in IEEE Waveform Diversity and Design Conference, January
2012. [347, 514]
385. S. Hou and R. C. Qiu, “Spectrum sensing for cognitive radio using kernel-based learning,”
arXiv preprint arXiv:1105.2978, 2011. [347, 491, 514]
Bibliography 593
386. S. Dallaporta, “Eigenvalue variance bounds for wigner and covariance random matrices,”
Random Matrices: Theory and Applications, vol. 1, no. 03, 2011. [348]
387. Ø. Ryan, A. Masucci, S. Yang, and M. Debbah, “Finite dimensional statistical inference,”
Information Theory, IEEE Transactions on, vol. 57, no. 4, pp. 2457–2473, 2011. [360]
388. R. Couillet and M. Debbah, Random Matrix Methods for Wireless Communications. Cam-
bridge University Press, 2011. [360, 502]
389. A. Tulino and S. Verdu, Random matrix theory and wireless communications. now Publishers
Inc., 2004. [360]
390. R. Müller, “Applications of large random matrices in communications engineering,” in Proc.
Int. Conf. on Advances Internet, Process., Syst., Interdisciplinary Research (IPSI), Sveti
Stefan, Montenegro, 2003. [365, 367]
391. G. E. Pfander, H. Rauhut, and J. A. Tropp, “The restricted isometry property for time–
frequency structured random matrices,” Probability Theory and Related Fields, pp. 1–31,
2011. [373, 374, 400]
392. T. T. Cai, L. Wang, and G. Xu, “Shifting inequality and recovery of sparse signals,” Signal
Processing, IEEE Transactions on, vol. 58, no. 3, pp. 1300–1308, 2010. [373]
393. E. Candes, “The Restricted Isometry Property and Its Implications for Compressed Sensing,”
Comptes rendus-Mathematique, vol. 346, no. 9–10, pp. 589–592, 2008. []
394. E. Candes, J. Romberg, and T. Tao, “Stable Signal Recovery from Incomplete and Inaccurate
Measurements,” Comm. Pure Appl. Math, vol. 59, no. 8, pp. 1207–1223, 2006. []
395. S. Foucart, “A note on guaranteed sparse recovery via 1 -minimization,” Applied and
Computational Harmonic Analysis, vol. 29, no. 1, pp. 97–103, 2010. [373]
396. S. Foucart, “Sparse recovery algorithms: sufficient conditions in terms of restricted isometry
constants,” Approximation Theory XIII: San Antonio 2010, pp. 65–77, 2012. [373]
397. D. Needell and J. A. Tropp, “Cosamp: Iterative signal recovery from incomplete and
inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–
321, 2009. [373]
398. T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,”
Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 265–274, 2009. [373]
399. S. Foucart, “Hard thresholding pursuit: an algorithm for compressive sensing,” SIAM Journal
on Numerical Analysis, vol. 49, no. 6, pp. 2543–2563, 2011. [373]
400. R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A Simple Proof of the Restricted
Isometry Property for Random Matrices.” Submitted for publication, January 2007. [373,
374, 375, 376, 378, 385]
401. D. L. Donoho and J. Tanner, “Counting faces of randomly-projected polytopes when the
projection radically lowers dimension,” J. Amer. Math. Soc, vol. 22, no. 1, pp. 1–53, 2009.
[]
402. M. Rudelson and R. Vershynin, “On sparse reconstruction from fourier and gaussian
measurements,” Communications on Pure and Applied Mathematics, vol. 61, no. 8, pp. 1025–
1045, 2008. [374, 394]
403. H. Rauhut, J. Romberg, and J. A. Tropp, “Restricted isometries for partial random circulant
matrices,” Applied and Computational Harmonic Analysis, vol. 32, no. 2, pp. 242–254, 2012.
[374, 393, 394, 395, 396, 397, 398, 404]
404. G. E. Pfander and H. Rauhut, “Sparsity in time-frequency representations,” Journal of Fourier
Analysis and Applications, vol. 16, no. 2, pp. 233–260, 2010. [374, 400, 401, 402, 403, 404]
405. G. Pfander, H. Rauhut, and J. Tanner, “Identification of Matrices having a Sparse Represen-
tation,” in Preprint, 2007. [374, 400]
406. E. Candes and T. Tao, “Near-Optimal Signal Recovery From Random Projections: Universal
Encoding Strategies?,” IEEE Transactions on Information Theory, vol. 52, no. 12, pp. 5406–
5425, 2006. [373, 384]
407. S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann, “Uniform uncertainty principle for
bernoulli and subgaussian ensembles,” Constructive Approximation, vol. 28, no. 3, pp. 277–
289, 2008. [373, 386]
594 Bibliography
408. A. Cohen, W. Dahmen, and R. DeVore, “Compressed Sensing and Best k-Term Approxima-
tion,” in Submitted for publication, July, 2006. [373]
409. S. Foucart, A. Pajor, H. Rauhut, and T. Ullrich, “The gelfand widths of lp-balls for 0 < p <
1,” Journal of Complexity, vol. 26, no. 6, pp. 629–640, 2010. []
410. A. Y. Garnaev and E. D. Gluskin, “The widths of a euclidean ball,” in Dokl. Akad. Nauk SSSR,
vol. 277, pp. 1048–1052, 1984. [373]
411. W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space,”
Contemporary mathematics, vol. 26, no. 189–206, p. 1, 1984. [374, 375, 379]
412. F. Krahmer and R. Ward, “New and improved johnson-lindenstrauss embeddings via the
restricted isometry property,” SIAM Journal on Mathematical Analysis, vol. 43, no. 3,
pp. 1269–1281, 2011. [374, 375, 379, 398]
413. N. Alon, “Problems and results in extremal combinatorics-i,” Discrete Mathematics, vol. 273,
no. 1, pp. 31–53, 2003. [375]
414. N. Ailon and E. Liberty, “An almost optimal unrestricted fast johnson-lindenstrauss trans-
form,” in Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete
Algorithms, pp. 185–191, SIAM, 2011. [375]
415. D. Achlioptas, “Database-friendly random projections,” in Proceedings of the twentieth ACM
SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 274–281,
ACM, 2001. [375, 376, 408, 410]
416. D. Achlioptas, “Database-friendly random projections: Johnson-lindenstrauss with binary
coins,” Journal of computer and System Sciences, vol. 66, no. 4, pp. 671–687, 2003. [375,
379, 385, 414]
417. N. Ailon and B. Chazelle, “Approximate nearest neighbors and the fast johnson-lindenstrauss
transform,” in Proceedings of the thirty-eighth annual ACM symposium on Theory of
computing, pp. 557–563, ACM, 2006. [375]
418. N. Ailon and B. Chazelle, “The fast johnson-lindenstrauss transform and approximate nearest
neighbors,” SIAM Journal on Computing, vol. 39, no. 1, pp. 302–322, 2009. [375]
419. G. Lorentz, M. Golitschek, and Y. Makovoz, “Constructive approximation, volume 304 of
grundlehren math. wiss,” 1996. [377]
420. R. I. Arriaga and S. Vempala, “An algorithmic theory of learning: Robust concepts and
random projection,” Machine Learning, vol. 63, no. 2, pp. 161–182, 2006. [379]
421. K. L. Clarkson and D. P. Woodruff, “Numerical linear algebra in the streaming model,” in
Proceedings of the 41st annual ACM symposium on Theory of computing, pp. 205–214, ACM,
2009. []
422. S. Dasgupta and A. Gupta, “An elementary proof of a theorem of johnson and lindenstrauss,”
Random Structures & Algorithms, vol. 22, no. 1, pp. 60–65, 2002. []
423. P. Frankl and H. Maehara, “The johnson-lindenstrauss lemma and the sphericity of some
graphs,” Journal of Combinatorial Theory, Series B, vol. 44, no. 3, pp. 355–362, 1988. []
424. P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse
of dimensionality,” in Proceedings of the thirtieth annual ACM symposium on Theory of
computing, pp. 604–613, ACM, 1998. []
425. D. M. Kane and J. Nelson, “A derandomized sparse johnson-lindenstrauss transform,” arXiv
preprint arXiv:1006.3585, 2010. []
426. J. Matoušek, “On variants of the johnson–lindenstrauss lemma,” Random Structures &
Algorithms, vol. 33, no. 2, pp. 142–156, 2008. [379]
427. M. Rudelson and R. Vershynin, “Geometric approach to error-correcting codes and recon-
struction of signals,” International Mathematics Research Notices, vol. 2005, no. 64, p. 4019,
2005. [384]
428. J. Vybı́ral, “A variant of the johnson–lindenstrauss lemma for circulant matrices,” Journal of
Functional Analysis, vol. 260, no. 4, pp. 1096–1105, 2011. [384, 398]
429. A. Hinrichs and J. Vybı́ral, “Johnson-lindenstrauss lemma for circulant matrices,” Random
Structures & Algorithms, vol. 39, no. 3, pp. 391–398, 2011. [384, 398]
Bibliography 595
430. H. Rauhut, K. Schnass, and P. Vandergheynst, “Compressed Sensing and Redundant Dictio-
naries,” IEEE Transactions on Information Theory, vol. 54, no. 5, pp. 2210–2219, 2008. [384,
385, 387, 388, 389, 391, 392]
431. W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressed channel sensing: A new
approach to estimating sparse multipath channels,” Proceedings of the IEEE, vol. 98, no. 6,
pp. 1058–1076, 2010. [400]
432. J. Chiu and L. Demanet, “Matrix probing and its conditioning,” SIAM Journal on Numerical
Analysis, vol. 50, no. 1, pp. 171–193, 2012. [400]
433. G. E. Pfander, “Gabor frames in finite dimensions,” Finite Frames, pp. 193–239, 2012. [400]
434. J. Haupt, W. U. Bajwa, G. Raz, and R. Nowak, “Toeplitz compressed sensing matrices
with applications to sparse channel estimation,” Information Theory, IEEE Transactions on,
vol. 56, no. 11, pp. 5862–5875, 2010. [400]
435. K. Gröchenig, Foundations of time-frequency analysis. Birkhäuser Boston, 2000. [401]
436. F. Krahmer, G. E. Pfander, and P. Rashkov, “Uncertainty in time–frequency representations
on finite abelian groups and applications,” Applied and Computational Harmonic Analysis,
vol. 25, no. 2, pp. 209–225, 2008. [401]
437. J. Lawrence, G. E. Pfander, and D. Walnut, “Linear independence of gabor systems in finite
dimensional vector spaces,” Journal of Fourier Analysis and Applications, vol. 11, no. 6,
pp. 715–726, 2005. [401]
438. B. M. Sanandaji, T. L. Vincent, and M. B. Wakin, “Concentration of measure inequalities
for compressive toeplitz matrices with applications to detection and system identification,” in
Decision and Control (CDC), 2010 49th IEEE Conference on, pp. 2922–2929, IEEE, 2010.
[408]
439. B. M. Sanandaji, T. L. Vincent, and M. B. Wakin, “Concentration of measure inequalities for
toeplitz matrices with applications,” arXiv preprint arXiv:1112.1968, 2011. [409]
440. B. M. Sanandaji, M. B. Wakin, and T. L. Vincent, “Observability with random observations,”
arXiv preprint arXiv:1211.4077, 2012. [408]
441. M. Meckes, “On the spectral norm of a random toeplitz matrix,” Electron. Comm. Probab,
vol. 12, pp. 315–325, 2007. [410]
442. R. Adamczak, “A few remarks on the operator norm of random toeplitz matrices,” Journal of
Theoretical Probability, vol. 23, no. 1, pp. 85–108, 2010. [410]
443. R. Calderbank, S. Howard, and S. Jafarpour, “Construction of a large class of deterministic
sensing matrices that satisfy a statistical isometry property,” Selected Topics in Signal
Processing, IEEE Journal of, vol. 4, no. 2, pp. 358–374, 2010. [410]
444. Y. Plan, Compressed sensing, sparse approximation, and low-rank matrix estimation. PhD
thesis, California Institute of Technology, 2011. [411, 414, 415]
445. R. Vershynin, “Math 280 lecture notes,” 2007. [413]
446. B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix
equations via nuclear norm minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010.
[414]
447. R. Vershynin, “On large random almost euclidean bases,” Acta Math. Univ. Comenianae,
vol. 69, no. 2, pp. 137–144, 2000. [414]
448. E. L. Lehmann and G. Casella, Theory of point estimation, vol. 31. Springer, 1998. [414]
449. S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu, “A unified framework for high-
dimensional analysis of m-estimators with decomposable regularizers,” arXiv preprint
arXiv:1010.2731, 2010. [416]
450. A. Agarwal, S. Negahban, and M. Wainwright, “Fast global convergence of gradient methods
for high-dimensional statistical recovery,” arXiv preprint arXiv:1104.4824, 2011. [416]
451. B. Recht, M. Fazel, and P. Parrilo, “Guaranteed Minimum-Rank Solutions of Linear Matrix
Equations via Nuclear Norm Minimization,” in Arxiv preprint arXiv:0706.4138, 2007. [417]
452. H. Lütkepohl, “New introduction to multiple time series analysis,” 2005. [422]
453. A. Agarwal, S. Negahban, and M. Wainwright, “Noisy matrix decomposition via convex
relaxation: Optimal rates in high dimensions,” arXiv preprint arXiv:1102.4807, 2011. [427,
428, 429, 485, 486]
596 Bibliography
454. C. Meyer, Matrix analysis and applied linear algebra. SIAM, 2000. [430]
455. M. McCoy and J. Tropp, “Sharp recovery bounds for convex deconvolution, with applica-
tions,” Arxiv preprint arXiv:1205.1580, 2012. [433, 434]
456. V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence
for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011.
[434]
457. V. Koltchinskii, “Von neumann entropy penalization and low-rank matrix estimation,” The
Annals of Statistics, vol. 39, no. 6, pp. 2936–2973, 2012. [434, 438, 439, 447]
458. M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information. Cam-
bridge Press, 10th edition ed., 2010. [438]
459. S. Sra, S. Nowozin, and S. Wright, eds., Optimization for machine learning. MIT Press, 2012.
Chapter 4 (Bertsekas) Incremental Gradient, Subgradient, and Proximal Method for Convex
Optimization: A Survey. [440, 441]
460. M. Rudelson, “Contact points of convex bodies,” Israel Journal of Mathematics, vol. 101,
no. 1, pp. 93–124, 1997. [441]
461. D. Blatt, A. Hero, and H. Gauchman, “A convergent incremental gradient method with a
constant step size,” SIAM Journal on Optimization, vol. 18, no. 1, pp. 29–51, 2007. [443]
462. M. Rabbat and R. Nowak, “Quantized incremental algorithms for distributed optimization,”
Selected Areas in Communications, IEEE Journal on, vol. 23, no. 4, pp. 798–808, 2005. [443]
463. E. Candes, Y. Eldar, T. Strohmer, and V. Voroninski, “Phase retrieval via matrix completion,”
Arxiv preprint arXiv:1109.0573, 2011. [443, 444, 446, 447, 456]
464. E. Candes, T. Strohmer, and V. Voroninski, “Phaselift: Exact and stable signal recovery from
magnitude measurements via convex programming,” Arxiv preprint arXiv:1109.4499, 2011.
[443, 444, 456]
465. E. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of
Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009. [443]
466. J. Cai, E. Candes, and Z. Shen, “A singular value thresholding algorithm for matrix
completion,” Arxiv preprint Arxiv:0810.3286, 2008. [446, 449]
467. E. Candes and T. Tao, “The power of convex relaxation: Near-optimal matrix completion,”
Information Theory, IEEE Transactions on, vol. 56, no. 5, pp. 2053–2080, 2010. []
468. A. Alfakih, A. Khandani, and H. Wolkowicz, “Solving euclidean distance matrix completion
problems via semidefinite programming,” Computational optimization and applications,
vol. 12, no. 1, pp. 13–30, 1999. []
469. M. Fukuda, M. Kojima, K. Murota, K. Nakata, et al., “Exploiting sparsity in semidefinite
programming via matrix completion i: General framework,” SIAM Journal on Optimization,
vol. 11, no. 3, pp. 647–674, 2001. []
470. R. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” Information
Theory, IEEE Transactions on, vol. 56, no. 6, pp. 2980–2998, 2010. []
471. C. Johnson, “Matrix completion problems: a survey,” in Proceedings of Symposia in Applied
Mathematics, vol. 40, pp. 171–198, 1990. []
472. E. Candes and Y. Plan, “Matrix completion with noise,” Proceedings of the IEEE, vol. 98,
no. 6, pp. 925–936, 2010. [443]
473. A. Chai, M. Moscoso, and G. Papanicolaou, “Array imaging using intensity-only measure-
ments,” Inverse Problems, vol. 27, p. 015005, 2011. [443, 446, 456]
474. L. Tian, J. Lee, S. Oh, and G. Barbastathis, “Experimental compressive phase space
tomography,” Optics Express, vol. 20, no. 8, pp. 8296–8308, 2012. [443, 446, 447, 448,
449, 456]
475. Y. Lu and M. Vetterli, “Sparse spectral factorization: Unicity and reconstruction algorithms,”
in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference
on, pp. 5976–5979, IEEE, 2011. [444]
476. J. Fienup, “Phase retrieval algorithms: a comparison,” Applied Optics, vol. 21, no. 15,
pp. 2758–2769, 1982. [444]
477. A. Sayed and T. Kailath, “A survey of spectral factorization methods,” Numerical linear
algebra with applications, vol. 8, no. 6–7, pp. 467–496, 2001. [444]
Bibliography 597
478. C. Beck and R. D’Andrea, “Computational study and comparisons of lft reducibility
methods,” in American Control Conference, 1998. Proceedings of the 1998, vol. 2, pp. 1013–
1017, IEEE, 1998. [446]
479. M. Mesbahi and G. Papavassilopoulos, “On the rank minimization problem over a positive
semidefinite linear matrix inequality,” Automatic Control, IEEE Transactions on, vol. 42,
no. 2, pp. 239–243, 1997. [446]
480. K. Toh, M. Todd, and R. Tütüncü, “Sdpt3-a matlab software package for semidefinite
programming, version 1.3,” Optimization Methods and Software, vol. 11, no. 1–4, pp. 545–
581, 1999. [446]
481. M. Grant and S. Boyd, “Cvx: Matlab software for disciplined convex programming,”
Available httpstanford edu boydcvx, 2008. [446]
482. S. Becker, E. Candès, and M. Grant, “Templates for convex cone problems with applications
to sparse signal recovery,” Mathematical Programming Computation, pp. 1–54, 2011. [446]
483. E. Candes, M. Wakin, and S. Boyd, “Enhancing sparsity by reweighted l1 minimization,”
Journal of Fourier Analysis and Applications, vol. 14, no. 5, pp. 877–905, 2008. [446]
484. M. Fazel, H. Hindi, and S. Boyd, “Log-det heuristic for matrix rank minimization with
applications to hankel and euclidean distance matrices,” in American Control Conference,
2003. Proceedings of the 2003, vol. 3, pp. 2156–2162, Ieee, 2003. [446]
485. M. Fazel, Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford
University, 2002. [446]
486. L. Mandel and E. Wolf, Optical Coherence and Quantum Optics. Cambridge University
Press, 1995. [447, 448]
487. Z. Hu, R. Qiu, J. Browning, and M. Wicks, “A novel single-step approach for self-coherent
tomography using semidefinite relaxation,” IEEE Geoscience and Remote Sensing Letters. to
appear. [449]
488. M. Grant and S. Boyd, “Cvx: Matlab software for disciplined convex programming, version
1.21.” http://cvxr.com/cvx, 2010. [453, 556]
489. H. Ohlsson, A. Y. Yang, R. Dong, and S. S. Sastry, “Compressive phase retrieval from squared
output measurements via semidefinite programming,” arXiv preprint arXiv:1111.6323, 2012.
[454]
490. A. Devaney, E. Marengo, and F. Gruber, “Time-reversal-based imaging and inverse scattering
of multiply scattering point targets,” The Journal of the Acoustical Society of America,
vol. 118, pp. 3129–3138, 2005. [454]
491. L. Lo Monte, D. Erricolo, F. Soldovieri, and M. C. Wicks, “Radio frequency tomography
for tunnel detection,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 48, no. 3,
pp. 1128–1137, 2010. [454]
492. O. Klopp, “Noisy low-rank matrix completion with general sampling distribution,” Arxiv
preprint arXiv:1203.0108, 2012. [456]
493. R. Foygel, R. Salakhutdinov, O. Shamir, and N. Srebro, “Learning with the weighted trace-
norm under arbitrary sampling distributions,” Arxiv preprint arXiv:1106.4251, 2011. [456]
494. R. Foygel and N. Srebro, “Concentration-based guarantees for low-rank matrix reconstruc-
tion,” Arxiv preprint arXiv:1102.3923, 2011. [456]
495. V. Koltchinskii and P. Rangel, “Low rank estimation of similarities on graphs,” Arxiv preprint
arXiv:1205.1868, 2012. [456]
496. E. Richard, P. Savalle, and N. Vayatis, “Estimation of simultaneously sparse and low rank
matrices,” in Proceeding of 29th Annual International Conference on Machine Learning,
2012. [456]
497. H. Ohlsson, A. Yang, R. Dong, and S. Sastry, “Compressive phase retrieval from squared
output measurements via semidefinite programming,” Arxiv preprint arXiv:1111.6323, 2011.
[456]
498. A. Fannjiang and W. Liao, “Compressed sensing phase retrieval,” in Proceedings of IEEE
Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, 2011. []
499. A. Fannjiang, “Absolute uniqueness of phase retrieval with random illumination,” Arxiv
preprint arXiv:1110.5097, 2011. []
598 Bibliography
500. S. Oymak and B. Hassibi, “Recovering jointly sparse signals via joint basis pursuit,” Arxiv
preprint arXiv:1202.3531, 2012. []
501. Z. Wen, C. Yang, X. Liu, and S. Marchesini, “Alternating direction methods for classical and
ptychographic phase retrieval,” []
502. H. Ohlsson, A. Yang, R. Dong, and S. Sastry, “Compressive phase retrieval via lifting,” [456]
503. K. Jaganathan, S. Oymak, and B. Hassibi, “On robust phase retrieval for sparse signals,” in
Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference
on, pp. 794–799, IEEE, 2012. [456]
504. Y. Chen, A. Wiesel, and A. Hero, “Robust shrinkage estimation of high-dimensional
covariance matrices,” Signal Processing, IEEE Transactions on, vol. 59, no. 9, pp. 4097–
4107, 2011. [460]
505. E. Levina and R. Vershynin, “Partial estimation of covariance matrices,” Probability Theory
and Related Fields, pp. 1–15, 2011. [463, 476, 477]
506. S. Marple, Digital Spectral Analysis with Applications. Prentice-Hall, 1987. [468, 469]
507. C. W. Therrien, Discrete Random Signals and Statistical Signal Processing. Prentice-Hall,
1992. [471]
508. H. Xiao and W. Wu, “Covariance matrix estimation for stationary time series,” Arxiv preprint
arXiv:1105.4563, 2011. [474]
509. A. Rohde, “Accuracy of empirical projections of high-dimensional gaussian matrices,” arXiv
preprint arXiv:1107.5481, 2011. [479, 481, 484, 485]
510. K. Jurczak, “A universal expectation bound on empirical projections of deformed random
matrices,” arXiv preprint arXiv:1209.5943, 2012. [479, 483]
511. A. Amini, “High-dimensional principal component analysis,” 2011. [488]
512. L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM review, vol. 38, no. 1,
pp. 49–95, 1996. [489]
513. D. Paul and I. M. Johnstone, “Augmented sparse principal component analysis for high
dimensional data,” arXiv preprint arXiv:1202.1242, 2012. [490]
514. Q. Berthet and P. Rigollet, “Optimal detection of sparse principal components in high
dimension,” Arxiv preprint arXiv:1202.5070, 2012. [490, 503, 505, 506, 508, 509]
515. A. d’Aspremont, F. Bach, and L. Ghaoui, “Approximation bounds for sparse principal
component analysis,” Arxiv preprint arXiv:1205.0121, 2012. [490]
516. A. d’Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet, A direct formulation for sparse
PCA using semidefinite programming, vol. 49. SIAM Review, 2007. [501, 506]
517. H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of
computational and graphical statistics, vol. 15, no. 2, pp. 265–286, 2006. [501]
518. P. Bickel and E. Levina, “Covariance regularization by thresholding,” The Annals of Statistics,
vol. 36, no. 6, pp. 2577–2604, 2008. [502]
519. T. Cai, C. Zhang, and H. Zhou, “Optimal rates of convergence for covariance matrix
estimation,” The Annals of Statistics, vol. 38, no. 4, pp. 2118–2144, 2010. []
520. N. El Karoui, “Spectrum estimation for large dimensional covariance matrices using random
matrix theory,” The Annals of Statistics, vol. 36, no. 6, pp. 2757–2790, 2008. [502]
521. S. Geman, “The spectral radius of large random matrices,” The Annals of Probability, vol. 14,
no. 4, pp. 1318–1328, 1986. [502]
522. J. Baik, G. Ben Arous, and S. Péché, “Phase transition of the largest eigenvalue for nonnull
complex sample covariance matrices,” The Annals of Probability, vol. 33, no. 5, pp. 1643–
1697, 2005. [503]
523. T. Tao, “Outliers in the spectrum of iid matrices with bounded rank perturbations,” Probability
Theory and Related Fields, pp. 1–33, 2011. [503]
524. F. Benaych-Georges, A. Guionnet, M. Maida, et al., “Fluctuations of the extreme eigenvalues
of finite rank deformations of random matrices,” Electronic Journal of Probability, vol. 16,
pp. 1621–1662, 2010. [503]
525. F. Bach, S. Ahipasaoglu, and A. d’Aspremont, “Convex relaxations for subset selection,”
Arxiv preprint ArXiv:1006.3601, 2010. [507]
Bibliography 599
526. E. J. Candès and M. A. Davenport, “How well can we estimate a sparse vector?,” Applied and
Computational Harmonic Analysis, 2012. [509, 510]
527. E. Candes and T. Tao, “The Dantzig Selector: Statistical Estimation When p is much larger
than n,” Annals of Statistics, vol. 35, no. 6, pp. 2313–2351, 2007. [509]
528. F. Ye and C.-H. Zhang, “Rate minimaxity of the lasso and dantzig selector for the lq loss in lr
balls,” The Journal of Machine Learning Research, vol. 9999, pp. 3519–3540, 2010. [510]
529. E. Arias-Castro, “Detecting a vector based on linear measurements,” Electronic Journal of
Statistics, vol. 6, pp. 547–558, 2012. [511, 512, 514]
530. E. Arias-Castro, E. Candes, and M. Davenport, “On the fundamental limits of adaptive
sensing,” arXiv preprint arXiv:1111.4646, 2011. [511]
531. E. Arias-Castro, S. Bubeck, and G. Lugosi, “Detection of correlations,” The Annals of
Statistics, vol. 40, no. 1, pp. 412–435, 2012. [511]
532. E. Arias-Castro, S. Bubeck, and G. Lugosi, “Detecting positive correlations in a multivariate
sample,” 2012. [511, 517]
533. A. B. Tsybakov, Introduction to nonparametric estimation. Springer, 2008. [513, 519]
534. L. Balzano, B. Recht, and R. Nowak, “High-dimensional matched subspace detection when
data are missing,” in Information Theory Proceedings (ISIT), 2010 IEEE International
Symposium on, pp. 1638–1642, IEEE, 2010. [514, 516, 517]
535. L. K. Balzano, Handling missing data in high-dimensional subspace modeling. PhD thesis,
UNIVERSITY OF WISCONSIN, 2012. [514]
536. L. L. Scharf, Statistical Signal Processing. Addison-Wesley, 1991. [514]
537. L. L. Scharf and B. Friedlander, “Matched subspace detectors,” Signal Processing, IEEE
Transactions on, vol. 42, no. 8, pp. 2146–2157, 1994. [514]
538. S. Hou, R. C. Qiu, M. Bryant, and M. C. Wicks, “Spectrum Sensing in Cognitive Radio with
Robust Principal Component Analysis,” in IEEE Waveform Diversity and Design Conference,
January 2012. [514]
539. C. McDiarmid, “On the method of bounded differences,” Surveys in combinatorics, vol. 141,
no. 1, pp. 148–188, 1989. [516]
540. M. Azizyan and A. Singh, “Subspace detection of high-dimensional vectors using compres-
sive sampling,” in IEEE SSP, 2012. [516, 517, 518, 519]
541. M. Davenport, M. Wakin, and R. Baraniuk, “Detection and estimation with compressive
measurements,” tech. rep., Tech. Rep. TREE0610, Rice University ECE Department, 2006,
2006. [516]
542. J. Paredes, Z. Wang, G. Arce, and B. Sadler, “Compressive matched subspace detection,”
in Proc. 17th European Signal Processing Conference, Glasgow, Scotland, pp. 120–124,
Citeseer, 2009. []
543. J. Haupt and R. Nowak, “Compressive sampling for signal detection,” in Acoustics, Speech
and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 3,
pp. III–1509, IEEE, 2007. [516]
544. A. James, “Distributions of matrix variates and latent roots derived from normal samples,”
The Annals of Mathematical Statistics, pp. 475–501, 1964. [517]
545. S. Balakrishnan, M. Kolar, A. Rinaldo, and A. Singh, “Recovering block-structured activa-
tions using compressive measurements,” arXiv preprint arXiv:1209.3431, 2012. [520]
546. R. Muirhead, Aspects of Mutivariate Statistical Theory. Wiley, 2005. [521]
547. S. Vempala, The random projection method, vol. 65. Amer Mathematical Society, 2005. [521]
548. C. Helstrom, “Quantum detection and estimation theory,” Journal of Statistical Physics,
vol. 1, no. 2, pp. 231–252, 1969. [526]
549. C. Helstrom, “Detection theory and quantum mechanics,” Information and Control, vol. 10,
no. 3, pp. 254–291, 1967. [526]
550. J. Sharpnack, A. Rinaldo, and A. Singh, “Changepoint detection over graphs with the spectral
scan statistic,” arXiv preprint arXiv:1206.0773, 2012. [526]
551. D. Ramı́rez, J. Vı́a, I. Santamarı́a, and L. Scharf, “Locally most powerful invariant tests for
correlation and sphericity of gaussian vectors,” arXiv preprint arXiv:1204.5635, 2012. [526]
600 Bibliography
552. A. Onatski, M. Moreira, and M. Hallin, “Signal detection in high dimension: The multispiked
case,” arXiv preprint arXiv:1210.5663, 2012. [526]
553. A. Nemirovski, “On tractable approximations of randomly perturbed convex constraints,” in
Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, vol. 3, pp. 2419–2422,
IEEE, 2003. [527]
554. A. Nemirovski, “Regular banach spaces and large deviations of random sums,” Paper in
progress, E-print: http:// www2.isye.gatech.edu/ nemirovs, 2004. [528]
555. D. De Farias and B. Van Roy, “On constraint sampling in the linear programming approach
to approximate dynamic programming,” Mathematics of operations research, vol. 29, no. 3,
pp. 462–478, 2004. [528]
556. G. Calafiore and M. Campi, “Uncertain convex programs: randomized solutions and confi-
dence levels,” Mathematical Programming, vol. 102, no. 1, pp. 25–46, 2005. []
557. E. Erdoğan and G. Iyengar, “Ambiguous chance constrained problems and robust optimiza-
tion,” Mathematical Programming, vol. 107, no. 1, pp. 37–61, 2006. []
558. M. Campi and S. Garatti, “The exact feasibility of randomized solutions of uncertain convex
programs,” SIAM Journal on Optimization, vol. 19, no. 3, pp. 1211–1230, 2008. [528]
559. A. Nemirovski and A. Shapiro, “Convex approximations of chance constrained programs,”
SIAM Journal on Optimization, vol. 17, no. 4, pp. 969–996, 2006. [528, 542]
560. A. Nemirovski, “Sums of random symmetric matrices and quadratic optimization under
orthogonality constraints,” Mathematical programming, vol. 109, no. 2, pp. 283–317, 2007.
[529, 530, 531, 532, 533, 534, 536, 538, 539, 540, 542]
561. S. Janson, “Large deviations for sums of partly dependent random variables,” Random
Structures & Algorithms, vol. 24, no. 3, pp. 234–248, 2004. [542]
562. S. Cheung, A. So, and K. Wang, “Chance-constrained linear matrix inequalities with
dependent perturbations: A safe tractable approximation approach,” Preprint, 2011. [542,
543, 547]
563. A. Ben-Tal and A. Nemirovski, “On safe tractable approximations of chance-constrained
linear matrix inequalities,” Mathematics of Operations Research, vol. 34, no. 1, pp. 1–25,
2009. [542]
564. A. Gershman, N. Sidiropoulos, S. Shahbazpanahi, M. Bengtsson, and B. Ottersten, “Convex
optimization-based beamforming,” Signal Processing Magazine, IEEE, vol. 27, no. 3, pp. 62–
75, 2010. [544]
565. Z. Luo, W. Ma, A. So, Y. Ye, and S. Zhang, “Semidefinite relaxation of quadratic optimization
problems,” Signal Processing Magazine, IEEE, vol. 27, no. 3, pp. 20–34, 2010. [544, 545]
566. E. Karipidis, N. Sidiropoulos, and Z. Luo, “Quality of service and max-min fair transmit
beamforming to multiple cochannel multicast groups,” Signal Processing, IEEE Transactions
on, vol. 56, no. 3, pp. 1268–1279, 2008. [545]
567. Z. Hu, R. Qiu, and J. P. Browning, “Joint Amplify-and-Forward Relay and Cooperative
Jamming with Probabilistic Security Consideration,” 2013. submitted to IEEE COMMU-
NICATIONS LETTERS. [547]
568. H.-M. Wang, M. Luo, X.-G. Xia, and Q. Yin, “Joint cooperative beamforming and jamming
to secure af relay systems with individual power constraint and no eavesdropper’s csi,” Signal
Processing Letters, IEEE, vol. 20, no. 1, pp. 39–42, 2013. [547]
569. Y. Yang, Q. Li, W.-K. Ma, J. Ge, and P. Ching, “Cooperative secure beamforming for af
relay networks with multiple eavesdroppers,” Signal Processing Letters, IEEE, vol. 20, no. 1,
pp. 35–38, 2013. [547]
570. D. Ponukumati, F. Gao, and C. Xing, “Robust peer-to-peer relay beamforming: A probabilistic
approach,” 2013. [548, 556]
571. S. Kandukuri and S. Boyd, “Optimal power control in interference-limited fading wireless
channels with outage-probability specifications,” Wireless Communications, IEEE Transac-
tions on, vol. 1, no. 1, pp. 46–55, 2002. [552]
572. S. Ma and D. Sun, “Chance constrained robust beamforming in cognitive radio networks,”
2013. [552]
Bibliography 601
573. D. Zhang, Y. Hu, J. Ye, X. Li, and X. He, “Matrix completion by truncated nuclear norm
regularization,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
on, pp. 2192–2199, IEEE, 2012. [555]
574. K.-Y. Wang, T.-H. Chang, W.-K. Ma, A.-C. So, and C.-Y. Chi, “Probabilistic sinr constrained
robust transmit beamforming: A bernstein-type inequality based conservative approach,” in
Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on,
pp. 3080–3083, IEEE, 2011. [556]
575. K.-Y. Wang, T.-H. Chang, W.-K. Ma, and C.-Y. Chi, “A semidefinite relaxation based
conservative approach to robust transmit beamforming with probabilistic sinr constraints,”
in Proc. EUSIPCO, pp. 23–27, 2010. [556]
576. K. Yang, J. Huang, Y. Wu, X. Wang, and M. Chiang, “Distributed robust optimization (dro)
part i: Framework and example,” 2008. [556]
577. K. Yang, Y. Wu, J. Huang, X. Wang, and S. Verdú, “Distributed robust optimization
for communication networks,” in INFOCOM 2008. The 27th Conference on Computer
Communications. IEEE, pp. 1157–1165, IEEE, 2008. [556]
578. M. Chen and M. Chiang, “Distributed optimization in networking: Recent advances in com-
binatorial and robust formulations,” Modeling and Optimization: Theory and Applications,
pp. 25–52, 2012. [556]
579. G. Calafiore, F. Dabbene, and R. Tempo, “Randomized algorithms in robust control,” in
Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, vol. 2, pp. 1908–1913,
IEEE, 2003. [556]
580. R. Tempo, G. Calafiore, and F. Dabbene, Randomized algorithms for analysis and control of
uncertain systems. Springer, 2004. []
581. X. Chen, J. Aravena, and K. Zhou, “Risk analysis in robust control-making the case for
probabilistic robust control,” in American Control Conference, 2005. Proceedings of the 2005,
pp. 1533–1538, IEEE, 2005. []
582. Z. Zhou and R. Cogill, “An algorithm for state constrained stochastic linear-quadratic
control,” in American Control Conference (ACC), 2011, pp. 1476–1481, IEEE, 2011. [556]
583. A. M.-C. So and Y. J. A. Zhang, “Distributionally robust slow adaptive ofdma with soft qos
via linear programming,” in IEEE J. Sel. Areas Commun.[Online]. Available: http:// www1.
se.cuhk.edu.hk/ ∼manchoso. [558]
584. C. Boutsidis and A. Gittens, “Im@articlegittens2012var, title=Var (Xjk), author=GITTENS,
A., year=2012 ,” arXiv preprint arXiv:1204.0062, 2012. [559, 560]
585. A. GITTENS, “Var (xjk),” 2012. [559]
586. M. Magdon-Ismail, “Using a non-commutative bernstein bound to approximate some matrix
algorithms in the spectral norm,” Arxiv preprint arXiv:1103.5453, 2011. [560, 561]
587. M. Magdon-Ismail, “Row sampling for matrix algorithms via a non-commutative bernstein
bound,” Arxiv preprint arXiv:1008.0587, 2010. [561]
588. C. Faloutsos, T. Kolda, and J. Sun, “Mining large time-evolving data using matrix and tensor
tools,” in ICDM Conference, 2007. [565]
589. M. Mahoney, “Randomized algorithms for matrices and data,” Arxiv preprint
arXiv:1104.5557, 2011. [566]
590. R. Ranganathan, R. Qiu, Z. Hu, S. Hou, Z. Chen, M. Pazos-Revilla, and N. Guo, Communica-
tion and Networking in Smart Grids, ch. Cognitive Radio Network for Smart Grid. Auerbach
Publications, Taylor & Francis Group, CRC, 2013. [569]
591. R. C. Qiu, Z. Chen, N. Guo, Y. Song, P. Zhang, H. Li, and L. Lai, “Towards a real-time
cognitive radio network testbed: architecture, hardware platform, and application to smart
grid,” in Networking Technologies for Software Defined Radio (SDR) Networks, 2010 Fifth
IEEE Workshop on, pp. 1–6, IEEE, 2010. [572]
592. R. Qiu, Z. Hu, Z. Chen, N. Guo, R. Ranganathan, S. Hou, and G. Zheng, “Cognitive radio
network for the smart grid: Experimental system architecture, control algorithms, security,
and microgrid testbed,” Smart Grid, IEEE Transactions on, no. 99, pp. 1–18, 2011. [574]
602 Bibliography
593. R. Qiu, C. Zhang, Z. Hu, , and M. Wicks, “Towards a large-scale cognitive radio network
testbed: Spectrum sensing, system architecture, and distributed sensing,” Journal of Commu-
nications, vol. 7, pp. 552–566, July 2012. [572]
594. E. Estrada, The structure of complex networks: theory and applications. Oxford University
Press, 2012. [574]
595. M. Newman, Network: An Introduction. Oxford University Press, 2010. []
596. M. Newman, “The structure and function of complex networks,” SIAM review, vol. 45, no. 2,
pp. 167–256, 2003. []
597. M. Newman and M. Girvan, “Finding and evaluating community structure in networks,”
Physical review E, vol. 69, no. 2, p. 026113, 2004. []
598. S. Strogatz, “Exploring complex networks,” Nature, vol. 410, no. 6825, pp. 268–276, 2001.
[574]
599. A. Barrat, M. Barthelemy, and A. Vespignani, Dynamical processes on complex networks.
Cambridge University Press, 2008. [574]
600. M. Franceschetti and R. Meester, Random networks for communication: from statistical
physics to information systems. Cambridge University Press, 2007. [574]
601. P. Van Mieghem, Graph spectra for complex networks. Cambridge University Press, 2011.
[574, 575]
602. D. Cvekovic, M. Doob, and H. Sachs, Spectra of graphs: theory and applications. Academic
Press, 1980. [575, 576]
603. C. Godsil and G. Royle, Algebraic graph theory. Springer, 2001. [575]
604. F. Chung and M. Radcliffe, “On the spectra of general random graphs,” the electronic journal
of combinatorics, vol. 18, no. P215, p. 1, 2011. [575]
605. R. Oliveira, “Concentration of the adjacency matrix and of the laplacian in random graphs
with independent edges,” arXiv preprint arXiv:0911.0600, 2009. []
606. R. Oliveira, “The spectrum of random k-lifts of large graphs (with possibly large k),” arXiv
preprint arXiv:0911.4741, 2009. []
607. A. Gundert and U. Wagner, “On laplacians of random complexes,” in Proceedings of the 2012
symposuim on Computational Geometry, pp. 151–160, ACM, 2012. []
608. L. Lu and X. Peng, “Loose laplacian spectra of random hypergraphs,” Random Structures &
Algorithms, 2012. [575]
609. F. Chung, Speatral graph theory. American Mathematical Society, 1997. [576]
610. B. Osting, C. Brune, and S. Osher, “Optimal data collection for improved rankings expose
well-connected graphs,” arXiv preprint arXiv:1207.6430, 2012. [576]
Index
R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 603
DOI 10.1007/978-1-4614-4544-9,
© Springer Science+Business Media New York 2014
604 Index
E F
Efron-Stein inequality, 17 Feature-based detection, 177
Eigenvalues Fourier transform, 6–10, 54, 77, 261, 352, 354,
chaining approach, 192 394, 395, 400, 444, 447, 448, 474
general random matrices, 193–194 Free probability
GUE, 190 convolution, 364–365
Lipschitz mapping, 202–203 definitions and properties, 360–362
matrix function approximation free independence, 362–364
matrix Taylor series (see Matrix Taylor large n limit, 358
series) measure theory, 358
polynomial/rational function, 211 non-commutative probability, 359
and norms supremum representation practical significance, 359–360
Cartesian decomposition, 199 random matrix theory, 359
Frobenius norm, 200 Frobenius norms, 21, 22, 156–158, 180, 200,
inner product, 199, 200 202, 206, 217, 246, 256–257, 305,
matrix exponential, 201–202 523
Neumann series, 200 Fubini’s inequality, 51
norm of matrix, 199 Fubini’s theorem, 286, 513, 535
unitary invariant, 200 Fubini-Tonelli theorem, 23, 26
quadratic forms (see Quadratic forms)
random matrices (see Random matrices)
random vector and subspace distance G
Chebyshev’s inequality, 260–261 Gaussian and Wishart random matrices,
concentration around median, 258 173–180
orthogonal projection matrix, 257–260 Gaussian concentration inequality, 155–156
product probability, 258 Gaussian orthogonal ensembles (GOE), 161
similarity based hypothesis detection, Gaussian random matrix (GRM), 55, 56, 178,
261 192, 342, 353, 361, 373, 498, 500
Talagrand’s inequality, 258, 259 Gaussian unitary ensemble (GUE), 190, 246
smoothness and convexity Geometric functional analysis, 271
Cauchy-Schwartz’s inequality, 204, Global positioning system (GPS), 573
208 GOE. See Gaussian orthogonal ensembles
Euclidean norm, 206 (GOE)
function of matrix, 203 Golden-Thompson inequality, 38–39, 86, 91,
Guionnet and Zeitoumi lemma, 204 322, 496
Ky Fan inequality, 206–207 Gordon’s inequality, 168–170
Lidskii theorem, 204–205 Gordon-Slepian lemma, 182
linear trace function, 205 Gordon’s theorem, 313
moments of random matrices, GPS. See Global positioning system (GPS)
210 Gram matrix, 18, 288–289
standard hypothesis testing problem, Greedy algorithms, 372, 384–385
208–209 GRM. See Gaussian random matrix (GRM)
supremum of random process, 267–268 Gross, Liu, Flammia, Becker, and Eisert
symmetric random matrix, 244–245 derivation, 106
Talagrand concentration inequality, Guionnet and Zeitouni theorem, 219
191–192, 216–218
trace functionals, 243–244 H
variational characterization, 190 Haar distribution, 366, 521
Empirical spectral distribution (ESD), 352–353 Hamming metric, 150
ε−Nets arguments, 67–68 Hanson-Wright inequality, 380, 383
Euclidean norm, 21, 200, 202, 206, 217, 246, Harvey’s derivation
257, 523 Ahlswede-Winter inequality, 87–89
Euclidean scalar product, 246, 506 Rudelson’s theorem, 90–91
606 Index
Heavy-tailed rows I
expected singular values, 319 Information plus noise model
isotropic vector assumption, 317 sums of random matrices, 492–494
linear operator, 317 sums of random vectors, 491–492
non-commutative Bernstein inequality, Intrusion/activity detection, 459
317–318 Isometries, 46, 314, 385, 413, 414
non-isotropic, 318–319
Hermitian Gaussian random matrices (HGRM)
m, n, σ2 , 59–60 J
n, σ2 , 55–59 Jensen’s inequality, 30, 45, 210, 256, 513
Hermitian matrix, 17–18, 85, 86, 361–362 Johnson–Lindenstrauss (JL) lemma
Hermitian positive semi-definite matrix, Bernoulli random variables, 376
265–266 Chernoff inequalities, 376
HGRM. See Hermitian Gaussian random circulant matrices, 384
matrices (HGRM) Euclidean space, 374
High dimensional data processing, 459 Gaussian random variables, 376
High-dimensional matched subspace detection k -dimensional signals, 377
Balzano, Recht, and Nowak theorem, 516 linear map, 375
binary hypothesis test, 514 Lipschitz map, 374–375
coherence of subspace, 515 random matrices, 375
projection operator, 515 union bound, 377
High-dimensional statistics, 416–417
High-dimensional vectors K
Arias-Castro theorem, 512–514 Kernel-based learning, 289
average Bayes risk, 511 Kernel space, 491
operator norm, 512 Khinchin’s inequality, 64, 70–71
subspace detection, compressive sensing Khintchine’s inequality, 140–144, 381, 534
Azizyan and Singh theorem, 518–519 Klein’s inequality, 40–41
composite hypothesis test, 517 Kullback-Leibler (KL) divergence, 510, 511,
hypothesis test, 516 513, 521
low-dimensional subspace, 516 Ky Fan maximum principle, 201
observation vector, 517
projection operator, 517
worst-case risk, 511 L
High-dimensions Laguerre orthogonal ensemble (LOE), 348
PCA (see Principal component analysis Laguerre unitary ensemble (LUE), 348
(PCA)) Laplace transform, 6, 8, 61, 63, 64, 108,
two-sample test 130–131, 147
Haar distribution, 521 Laplacian matrix, 575–576
Hotelling T 2 statistic, 520–521 Large random matrices, 351–352
hypothesis testing problem, 520 data sets matrix, 567
Kullback-Leibler (KL) divergence, 521 eigenvalues, 568
Lopes, Jacob and Wainwright theorem, linear spectral statistics (see Linear spectral
522–525 statistics)
random-random projection method, measure phenomenon, 567
521 Talagrand’s concentration inequality, 568
Hilbert-Schmidt inner product, 436 trace functions, 249
Hilbert-Schmidt norm, 200, 202, 206, 217, Liapounov coefficient, 77
246, 479, 482, 523 Lieb’s theorem, 42–43
Hilbert-Schmidt scalar product, 159 Limit distribution laws, 352
Hoeffding’s inequality, 12–14, 149–150 Linear bounded and compact operators, 79–80
Hölder’s inequality, 26, 303 Linear filtering, 134–136
Homogeneous linear constraints, 540 Linear functionals, 134, 146, 151, 153, 253,
Hypothesis detection, 525–526 264, 276, 302, 307, 504
Index 607