0% found this document useful (0 votes)
6 views

Springer - Cognitive Networked Sensing and Big Data 2013

The book 'Cognitive Networked Sensing and Big Data' by Robert Qiu and Michael Wicks explores the intersection of cognitive radio networks and big data, addressing key questions about sensing the radio environment and the implications of large data sets. It presents mathematical tools and theories relevant to random matrices and their applications in engineering, particularly in wireless communications. The book is structured into theoretical foundations and practical applications, aiming to provide a comprehensive understanding of the subject matter for graduate students in electrical and computer engineering.

Uploaded by

parulsingh123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Springer - Cognitive Networked Sensing and Big Data 2013

The book 'Cognitive Networked Sensing and Big Data' by Robert Qiu and Michael Wicks explores the intersection of cognitive radio networks and big data, addressing key questions about sensing the radio environment and the implications of large data sets. It presents mathematical tools and theories relevant to random matrices and their applications in engineering, particularly in wireless communications. The book is structured into theoretical foundations and practical applications, aiming to provide a comprehensive understanding of the subject matter for graduate students in electrical and computer engineering.

Uploaded by

parulsingh123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 633

Robert Qiu · Michael Wicks

Cognitive
Networked
Sensing and
Big Data
Cognitive Networked Sensing and Big Data
Robert Qiu • Michael Wicks

Cognitive Networked
Sensing and Big Data

123
Robert Qiu Michael Wicks
Tennessee Technological University Utica, NY, USA
Cookeville, Tennessee, USA

ISBN 978-1-4614-4543-2 ISBN 978-1-4614-4544-9 (eBook)


DOI 10.1007/978-1-4614-4544-9
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2013942391

© Springer Science+Business Media New York 2014


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


To
Lily Liman Li
Preface

The idea of writing this book entitled “Cognitive Networked Sensing and Big Data”
started with the plan to write a briefing book on wireless distributed computing
and cognitive sensing. During our research on large-scale cognitive radio network
(and its experimental testbed), we realized that big data played a central role. As a
result, the book project reflects this paradigm shift. In the context, sensing roughly
is equivalent to “measurement.”
We attempt to answer the following basic questions. How do we sense the radio
environment using a large-scale network? What is unique to cognitive radio? What
do we do with the big data? How does the sample size affect the sensing accuracy?
To address these questions, we are naturally led to ask ourselves: What math-
ematical tools are required? What are the state-of-the-art for the analytical tools?
How these tools are used?
Our prerequisite is the graduate-level course on random variables and processes.
Some familiarity with wireless communication and signal processing is useful.
This book is complementary with our previous book entitled “Cognitive Radio
Communications and Networking: Principles and Practice” (John Wiley and Sons
2012). This book is also complementary with another book of the first author
“Introduction to Smart Grid” (John Wiley and Sons 2014). This current book can be
viewed as the mathematical tools for the two Wiley books.
Chapter 1 provides the necessary background to support the rest of the book.
No attempt has been made to make this book really self-contained. The book will
survey many latest results in the literature. We often include preliminary tools from
publications. These preliminary tools may be still too difficult for many of the
audience. Roughly, our prerequisite is the graduate-level course on random variables
and processes.
Chapters 2–5 (Part I) are the core of this book. The contents of these chapters
should be new to most graduate students in electrical and computer engineering.
Chapter 2 deals with the sum of matrix-valued random variables. One basic
question is “how does the sample size affect the accuracy.” The basic building block
of the data is the sample covariance matrix, which is a random matrix. Bernstein-
type concentration inequalities are of special interest.

vii
viii Preface

Chapter 3 collects together the deepest mathematical theorems in this book.


This chapter is really the departure point of this whole book. Chapter 2 is put
before this chapter since we want the audience to understand how to deal with
the basic linear functions of matrices. The theory of concentration inequality
tries to answer the following question: Given a random vector x taking value in
some measurable space X (which is usually some high dimensional Euclidean
space), and a measurable map f : X → R, what is a good explicit bound on
P (|f (x) − Ef (x)|  t)? Exact evaluation or accurate approximation is, of course,
the central purpose of probability theory itself. In situations where exact evaluation
or accurate approximation is not possible, which is the case for many practical
problems, concentration inequalities aim to do the next best job by providing
rapidly decaying tail bounds. It is our goal of this book is to systemically deal with
the “next best job,” when the classical probability theory fails to be valid in these
situations.
The sum of random matrices is a sum of linear matrix functions. Non-linear
matrix functions are encountered in practice. This motivates us to study, in Chap. 4,
the concentration of measure phenomenon, unique to high dimensions. The so-
called Lipschitz functions of matrices such as eigenvalues are the mathematical
objects.
Chapter 5 culminates for the theoretical development of the random matrix
theory. The goal of this chapter is to survey the latest results in the mathematical
literature. We tried to be exhaustive in recent results. To our best knowledge, these
results are never used in the engineering applications. Although the prerequisites for
this chapter are highly demanding, it is out belief that the pay-off will be significant
to engineering graduates if they can manage to understand the chapter.
Chapter 6 is included for completion, with the major goal for the readers to com-
pare the parallel results with Chap. 5. Our book “Cognitive Radio Communications
and Networking: Principles and Practice” (John Wiley and Sons 2012) contains
complementary materials of 230 pages on this subject.
In Part II, we attempt to apply these mathematical tools to different applications.
The emphasis is on the connection between the theory and the diverse applications.
No attempt is made to collect all the scattered results in one place.
Chapter 7 deals with compressed sensing and recovery of sparse vectors.
Concentration inequalities play the central role in the sparse recovery. The so-
called restricted isometry property for sensing matrices is another aspect of stating
concentration of measure.
A matrix is decomposed into the eigenvalues and the corresponding eigenvectors.
When the matrix is of low rank, we can equivalently say the vector of eigenvalues are
sparse. Chapter 8 deals with this aspect in the context of concentration of measure.
Statistics starts with covariance matrix estimation. In Chap. 9, we deals with this
problem in high dimensions. We think that compressed sensing and low-rank matrix
recovery are more basic than covariance matrix estimation.
Once the covariance matrix is estimated, we can apply the statistical information
to different applications. In Chap. 10, we apply the covariance matrix to hypothesis
detection in high dimensions. During the study of information plus noise model, the
Preface ix

low-rank structure is explicitly exploited. This is one justification for putting low-
rank matrix recovery (Chap. 9) before this chapter. A modern trend is to exploit the
structure of the data (sparsity and low rank) during the detection theory. The research
in this direction is growing rapidly. Indeed, we surveyed some latest results in this
chapter.
An unexpected chapter is Chap. 11 on probability constrained optimization. Due
to the recent progress (as late as 2003 by Nemirovski), optimization with proba-
bilistic constraints, often regarded as computationally intractable in the past, may
be formulated in terms of deterministic convex problems that can be solved using
modern convex optimization solvers. The “closed-form” Bernstein concentration
inequalities play a central role in this formulation.
In Chap. 12, we show how concentration inequalities play a central role in data
friendly data processing such as low rank matrix approximation. We only want to
point out the connection.
Chapter 13 is designed to put all pieces together. This chapter may be put as
Chap. 1. We can see that so many problems are open. We only touched the tip of the
iceberg of the big data. Chapter 1 also gives us motivations of other chapters of this
book.
The first author wants to thank his students for the course in the Fall semester
of 2012: ECE 7970 Random Matrices, Concentration and Networking. Their
comments greatly improved this book. We also want to thank PhD students at
TTU for their help in proof-reading: Jason Bonior, Shujie Hou, Xia Li, Feng
Lin, and Changchun Zhang. The simulations made by Feng Lin, indeed, shaped
the conceptions and formulations of many places of this book, in particular on
hypothesis detection. Dr. Zhen Hu and Dr. Nan Guo at TTU are also of help
for their discussions. The first author’s research collaborator Professor Husheng
Li (University of Tennessee at Knoxville) is acknowledged for many inspired
discussions.
The first author’s work has been supported, for many years, by Office of Naval
Research (ONR) through the program manager Dr. Santanu K. Das (code 31). Our
friend Paul James Browning is instrumental in making this book possible. This
work is partly funded by National Science Foundation (NSF) through two grants
(ECCS-0901420 and CNS-1247778), Office of Naval Research through two grants
(N00010-10-10810 and N00014-11-1-0006), and Air Force Office of Scientific
Research, via a local contractor (prime contract number FA8650-10-D-1750-Task
4). Some parts of this book were finished while the first author was a visiting
professor during the summer of 2012 at the Centre for Quantifiable Quality of
Service in Communication Systems (Q2S), the Norwegian University of Science
and Technology (NTNU), Trondheim, Norway. Many discussions with his host
Professor Yuming Jiang are acknowledged here.
The authors want to thank our editor Brett Kurzman at Springer (US) for his
interest in this book. We acknowledge Rebecca Hytowitz at Springer (US) for her
help.
The first author wants to thank his mentors who, for the first time, exposed him
to many subjects of this book: Weigan Lin (UESTC, China) on remote sensing,
x Preface

Zhengde Wu (UESTC, China) on electromagnetics, Shuzhang Liu (UESTC, China)


on electromagnetic materials, I-Tai Lu ( Polytechnic Institute of NYU) on radio
propagation, Lawrence Carin (Duke) on physics-based signal processing, Leopold
Felsen on scattering, and Henry Bertoni on radio propagation in wireless channels.
His industrial colleagues at GTE Labs (now Verizon Wireless) and Bell Labs
(Alcatel-Lucent) greatly shaped his view.
Last, but not the least, the first author wants to thank his wife Lily Liman Li
for her love, encouragement, and support that sustains him during the lonely (but
exciting) book writing journey—she is always there; and his children Michelle,
David and Jackie light his life. This is, indeed, a special moment to record a personal
note: This summer his eldest daughter Michelle is going to drive on her own, David
is happy in his high school, and Jackie smiles again. Also this summer will mark
the time of one decade after his moving back to academia—after spending 8 years
in industry. Wring books like this was among his deepest wishes that drove him
to read mathematical papers and books while watching his TV for the last two
decades. Finally, he is in memory of her mother Suxian Li who passed away in
2006. His father Dafu Qiu lives in China. His parents-in-laws Lumei Li and Jinxue
Chen live with him for many years. Their love and daily encouragement really make
a difference in his life.
Contents

Part I Theory

1 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Basic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Union Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Pairs of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 The Markov and Chebyshev Inequalities
and Chernoff Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Characteristic Function and Fourier Transform . . . . . . . . 6
1.1.6 Laplace Transform of the pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.7 Probability Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Sums of Independent (Scalar-Valued) Random
Variables and Central Limit Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Sums of Independent (Scalar-Valued) Random
Variables and Classical Deviation Inequalities:
Hoeffding, Bernstein, and Efron-Stein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Transform of Probability Bounds
to Expectation Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 Bernstein’s Inequality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Efron-Stein Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Probability and Matrix Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Eigenvalues, Trace and Sums of Hermitian
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Partial Ordering of Positive Semidefinite Matrices . . . . 19
1.4.4 Definitions of f (A) for Arbitrary f . . . . . . . . . . . . . . . . . . . . 20
1.4.5 Norms of Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.7 Moments and Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xi
xii Contents

1.4.8 Random Vector and Jensen’s Inequality . . . . . . . . . . . . . . . . 29


1.4.9 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.10 Sums of Independent Scalar-Valued
Random Variables: Chernoff’s Inequality . . . . . . . . . . . . . . 30
1.4.11 Extensions of Expectation to Matrix-Valued
Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.4.12 Eigenvalues and Spectral Norm . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.4.13 Spectral Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4.14 Operator Convexity and Monotonicity. . . . . . . . . . . . . . . . . . 35
1.4.15 Convexity and Monotonicity for Trace Functions . . . . . . 36
1.4.16 The Matrix Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.4.17 Golden-Thompson Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.4.18 The Matrix Logarithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.4.19 Quantum Relative Entropy and Bregman
Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.4.20 Lieb’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.4.21 Dilations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.4.22 The Positive Semi-definite Matrices
and Partial Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.4.23 Expectation and the Semidefinite Order . . . . . . . . . . . . . . . . 45
1.4.24 Probability with Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.4.25 Isometries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.4.26 Courant-Fischer Characterization of Eigenvalues . . . . . . 46
1.5 Decoupling from Dependance to Independence . . . . . . . . . . . . . . . . . . . 47
1.6 Fundamentals of Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.6.1 Fourier Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.6.2 The Moment Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.6.3 Expected Moments of Random Matrices
with Complex Gaussian Entries . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.6.4 Hermitian Gaussian Random Matrices
HGRM(n, σ 2 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.6.5 Hermitian Gaussian Random Matrices
GRM(m, n, σ 2 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.7 Sub-Gaussian Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.8 Sub-Gaussian Random Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.9 Sub-exponential Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.10 ε-Nets Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.11 Rademacher Averages and Symmetrization . . . . . . . . . . . . . . . . . . . . . . . 69
1.12 Operators Acting on Sub-Gaussian Random Vectors . . . . . . . . . . . . . 72
1.13 Supremum of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.14 Bernoulli Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1.15 Converting Sums of Random Matrices into Sums
of Random Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1.16 Linear Bounded and Compact Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 79
1.17 Spectrum for Compact Self-Adjoint Operators. . . . . . . . . . . . . . . . . . . . 80
Contents xiii

2 Sums of Matrix-Valued Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85


2.1 Methodology for Sums of Random Matrices . . . . . . . . . . . . . . . . . . . . . . 85
2.2 Matrix Laplace Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.2.1 Method 1—Harvey’s Derivation . . . . . . . . . . . . . . . . . . . . . . . . 87
2.2.2 Method 2—Vershynin’s Derivation . . . . . . . . . . . . . . . . . . . . . 91
2.2.3 Method 3—Oliveria’s Derivation . . . . . . . . . . . . . . . . . . . . . . . 94
2.2.4 Method 4—Ahlswede-Winter’s Derivation. . . . . . . . . . . . . 95
2.2.5 Derivation Method 5—Gross, Liu,
Flammia, Becker, and Eisert . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.2.6 Method 6—Recht’s Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.2.7 Method 7—Derivation by Wigderson and Xiao . . . . . . . . 107
2.2.8 Method 8—Tropp’s Derivation. . . . . . . . . . . . . . . . . . . . . . . . . . 107
2.3 Cumulate-Based Matrix-Valued Laplace Transform Method. . . . . 108
2.4 The Failure of the Matrix Generating Function . . . . . . . . . . . . . . . . . . . 109
2.5 Subadditivity of the Matrix Cumulant Generating Function . . . . . . 110
2.6 Tail Bounds for Independent Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2.6.1 Comparison Between Tropp’s Method
and Ahlswede–Winter Method . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.7 Matrix Gaussian Series—Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.8 Application: A Gaussian Matrix with Nonuniform Variances . . . . 118
2.9 Controlling the Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
2.10 Sums of Random Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . 121
2.11 Matrix Bennett and Bernstein Inequalities. . . . . . . . . . . . . . . . . . . . . . . . . 125
2.12 Minimax Matrix Laplace Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.13 Tail Bounds for All Eigenvalues of a Sum of Random
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.14 Chernoff Bounds for Interior Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.15 Linear Filtering Through Sums of Random Matrices . . . . . . . . . . . . . 134
2.16 Dimension-Free Inequalities for Sums of Random Matrices . . . . . 137
2.17 Some Khintchine-Type Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
2.18 Sparse Sums of Positive Semi-definite Matrices . . . . . . . . . . . . . . . . . . 144
2.19 Further Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
3 Concentration of Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.1 Concentration of Measure Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.2 Chi-Square Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
3.3 Concentration of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.4 Slepian-Fernique Lemma and Concentration
of Gaussian Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.5 Dudley’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.6 Concentration of Induced Operator Norms . . . . . . . . . . . . . . . . . . . . . . . . 165
3.7 Concentration of Gaussian and Wishart Random Matrices . . . . . . . 173
3.8 Concentration of Operator Norms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.9 Concentration of Sub-Gaussian Random Matrices . . . . . . . . . . . . . . . . 185
xiv Contents

3.10 Concentration for Largest Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190


3.10.1 Talagrand’s Inequality Approach . . . . . . . . . . . . . . . . . . . . . . . 191
3.10.2 Chaining Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.10.3 General Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
3.11 Concentration for Projection of Random Vectors . . . . . . . . . . . . . . . . . 194
3.12 Further Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4 Concentration of Eigenvalues and Their Functionals . . . . . . . . . . . . . . . . . . 199
4.1 Supremum Representation of Eigenvalues and Norms. . . . . . . . . . . . 199
4.2 Lipschitz Mapping of Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.3 Smoothness and Convexity of the Eigenvalues
of a Matrix and Traces of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.4 Approximation of Matrix Functions Using Matrix
Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.5 Talagrand Concentration Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4.6 Concentration of the Spectral Measure for Wigner
Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.7 Concentration of Noncommutative Polynomials
in Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.8 Concentration of the Spectral Measure for Wishart
Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.9 Concentration for Sums of Two Random Matrices. . . . . . . . . . . . . . . . 235
4.10 Concentration for Submatrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.11 The Moment Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.12 Concentration of Trace Functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.13 Concentration of the Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
4.14 Concentration for Functions of Large Random
Matrices: Linear Spectral Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
4.15 Concentration of Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
4.16 Distance Between a Random Vector and a Subspace . . . . . . . . . . . . . 257
4.17 Concentration of Random Matrices in the Stieltjes
Transform Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.18 Concentration of von Neumann Entropy Functions . . . . . . . . . . . . . . . 264
4.19 Supremum of a Random Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
4.20 Further Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
5 Non-asymptotic, Local Theory of Random Matrices . . . . . . . . . . . . . . . . . . . 271
5.1 Notation and Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
5.2 Isotropic Convex Bodies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.3 Log-Concave Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
5.4 Rudelson’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
5.5 Sample Covariance Matrices with Independent Rows . . . . . . . . . . . . 281
Contents xv

5.6 Concentration for Isotropic, Log-Concave Random Vectors . . . . . 290


5.6.1 Paouris’ Concentration Inequality . . . . . . . . . . . . . . . . . . . . . . 290
5.6.2 Non-increasing Rearrangement and Order
Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
5.6.3 Sample Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
5.7 Concentration Inequality for Small Ball Probability . . . . . . . . . . . . . . 295
5.8 Moment Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
5.8.1 Moments for Isotropic Log-Concave
Random Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
5.8.2 Moments for Convex Measures . . . . . . . . . . . . . . . . . . . . . . . . . 303
5.9 Law of Large Numbers for Matrix-Valued Random Variables . . . 305
5.10 Low Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
5.11 Random Matrices with Independent Entries . . . . . . . . . . . . . . . . . . . . . . . 313
5.12 Random Matrices with Independent Rows . . . . . . . . . . . . . . . . . . . . . . . . 314
5.12.1 Independent Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
5.12.2 Heavy-Tailed Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
5.13 Covariance Matrix Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
5.13.1 Estimating the Covariance of Random Matrices . . . . . . . 322
5.14 Concentration of Singular Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
5.14.1 Sharp Small Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
5.14.2 Sample Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
5.14.3 Tall Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
5.14.4 Almost Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
5.14.5 Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
5.14.6 Rectangular Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
5.14.7 Products of Random and Deterministic Matrices . . . . . . 329
5.14.8 Random Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
5.15 Invertibility of Random Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
5.16 Universality of Singular Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
5.16.1 Random Matrix Plus Deterministic Matrix . . . . . . . . . . . . . 341
5.16.2 Universality for Covariance and Correlation
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
5.17 Further Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
6 Asymptotic, Global Theory of Random Matrices . . . . . . . . . . . . . . . . . . . . . . . 351
6.1 Large Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
6.2 The Limit Distribution Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
6.3 The Moment Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
6.4 Stieltjes Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
6.5 Free Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
6.5.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
6.5.2 Practical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
6.5.3 Definitions and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . 360
xvi Contents

6.5.4 Free Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362


6.5.5 Free Convolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
6.6 Tables for Stieltjes, R- and S-Transforms. . . . . . . . . . . . . . . . . . . . . . . . . . 365

Part II Applications

7 Compressed Sensing and Sparse Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371


7.1 Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
7.2 Johnson–Lindenstrauss Lemma and Restricted
Isometry Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
7.3 Structured Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
7.3.1 Partial Random Fourier Matrices . . . . . . . . . . . . . . . . . . . . . . . 384
7.4 Johnson–Lindenstrauss Lemma for Circulant Matrices . . . . . . . . . . . 384
7.5 Composition of a Random Matrix and a Deterministic
Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
7.6 Restricted Isometry Property for Partial Random
Circulant Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
7.7 Restricted Isometry Property for Time-Frequency
Structured Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
7.8 Suprema of Chaos Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
7.9 Concentration for Random Toeplitz Matrix . . . . . . . . . . . . . . . . . . . . . . . 408
7.10 Deterministic Sensing Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
8 Matrix Completion and Low-Rank Matrix Recovery . . . . . . . . . . . . . . . . . . 411
8.1 Low Rank Matrix Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
8.2 Matrix Restricted Isometry Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
8.3 Recovery Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
8.4 Low Rank Matrix Recovery for Hypothesis Detection . . . . . . . . . . . 415
8.5 High-Dimensional Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
8.6 Matrix Compressed Sensing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
8.6.1 Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
8.6.2 Nuclear Norm Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
8.6.3 Restricted Strong Convexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
8.6.4 Error Bounds for Low-Rank Matrix Recovery . . . . . . . . . 419
8.7 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
8.8 Multi-task Matrix Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
8.9 Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
8.9.1 Orthogonal Decomposition and Orthogonal
Projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
8.9.2 Matrix Completion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
8.10 Von Neumann Entropy Penalization and Low-Rank
Matrix Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
8.10.1 System Model and Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . 435
8.10.2 Sampling from an Orthogonal Basis . . . . . . . . . . . . . . . . . . . . 436
Contents xvii

8.10.3 Low-Rank Matrix Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 437


8.10.4 Tools for Low-Rank Matrix Estimation . . . . . . . . . . . . . . . . 439
8.11 Sum of a Large Number of Convex Component Functions . . . . . . . 440
8.12 Phase Retrieval via Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
8.12.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
8.12.2 Matrix Recovery via Convex Programming . . . . . . . . . . . . 446
8.12.3 Phase Space Tomography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
8.12.4 Self-Coherent RF Tomography. . . . . . . . . . . . . . . . . . . . . . . . . . 449
8.13 Further Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
9 Covariance Matrix Estimation in High Dimensions . . . . . . . . . . . . . . . . . . . 457
9.1 Big Picture: Sense, Communicate, Compute, and Control . . . . . . . 457
9.1.1 Received Signal Strength (RSS) and
Applications to Anomaly Detection . . . . . . . . . . . . . . . . . . . . 460
9.1.2 NC-OFDM Waveforms and Applications
to Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
9.2 Covariance Matrix Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
9.2.1 Classical Covariance Estimation . . . . . . . . . . . . . . . . . . . . . . . . 461
9.2.2 Masked Sample Covariance Matrix . . . . . . . . . . . . . . . . . . . . 463
9.2.3 Covariance Matrix Estimation for
Stationary Time Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
9.3 Covariance Matrix Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
9.4 Partial Estimation of Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
9.5 Covariance Matrix Estimation in Infinite-Dimensional Data . . . . . 478
9.6 Matrix Model of Signal Plus Noise Y = S + X . . . . . . . . . . . . . . . . . . 479
9.7 Robust Covariance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
10 Detection in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
10.1 OFDM Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
10.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
10.2.1 PCA Inconsistency in High-Dimensional Setting . . . . . . 490
10.3 Space-Time Coding Combined with CS . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
10.4 Sparse Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
10.5 Information Plus Noise Model Using Sums
of Random Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
10.6 Information Plus Noise Model Using Sums
of Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
10.7 Matrix Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
10.8 Random Matrix Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
10.9 Sphericity Test with Sparse Alternative. . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
10.10 Connection with Random Matrix Theory. . . . . . . . . . . . . . . . . . . . . . . . . . 502
10.10.1 Spectral Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
10.10.2 Low Rank Perturbation of Wishart Matrices . . . . . . . . . . . 503
10.10.3 Sparse Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
xviii Contents

10.11 Sparse Principal Component Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503


10.11.1 Concentration Inequalities for the k-Sparse
Largest Eigenvalue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
10.11.2 Hypothesis Testing with λkmax . . . . . . . . . . . . . . . . . . . . . . . . . . 505
10.12 Semidefinite Methods for Sparse Principal
Component Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
10.12.1 Semidefinite Relaxation for λkmax . . . . . . . . . . . . . . . . . . . . . . . 506
10.12.2 High Probability Bounds for Convex Relaxation . . . . . . 508
10.12.3 Hypothesis Testing with Convex Methods. . . . . . . . . . . . . . 508
10.13 Sparse Vector Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
10.14 Detection of High-Dimensional Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
10.15 High-Dimensional Matched Subspace Detection . . . . . . . . . . . . . . . . . 514
10.16 Subspace Detection of High-Dimensional Vectors
Using Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
10.17 Detection for Data Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
10.18 Two-Sample Test in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
10.19 Connection with Hypothesis Detection
of Noncommuntative Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
10.20 Further Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
11 Probability Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
11.1 The Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
11.2 Sums of Random Symmetric Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
11.3 Applications of Sums of Random Matrices. . . . . . . . . . . . . . . . . . . . . . . . 536
11.4 Chance-Constrained Linear Matrix Inequalities. . . . . . . . . . . . . . . . . . . 542
11.5 Probabilistically Constrained Optimization Problem . . . . . . . . . . . . . 543
11.6 Probabilistically Secured Joint Amplify-and-Forward
Relay by Cooperative Jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
11.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
11.6.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
11.6.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
11.6.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
11.7 Further Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
12 Database Friendly Data Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
12.1 Low Rank Matrix Approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
12.2 Row Sampling for Matrix Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
12.3 Approximate Matrix Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
12.4 Matrix and Tensor Sparsification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
12.5 Further Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
13 From Network to Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
13.1 Large Random Matrices for Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
13.2 A Case Study for Hypothesis Detection in High Dimensions . . . . 569
13.3 Cognitive Radio Network Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
13.4 Wireless Distributed Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Contents xix

13.5 Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572


13.6 Data Storage and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
13.7 Data Mining of Large Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
13.8 Mobility of Network Enabled by UAVs . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
13.9 Smart Grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
13.10 From Cognitive Radio Network to Complex Network
and to Random Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
13.11 Random Matrix Theory and Concentration of Measure . . . . . . . . . . 574

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Introduction

This book deals with the data that is collected from a cognitive radio network.
Although this is the motivation, the contents really treat the general mathematical
models and the latest results in the literature.
Big data, only at its dawn, refers to things one can do at a large scale that cannot
be done at a smaller one [4]. Mankind’s constraints are really functions of the scale
in which we operate. Scale really matters. Big data see a shift from causation to
correlation, to infer probabilities. Big data is messy. The data is huge, and can
tolerate inexactitude. Mathematical models crunch mountains of data to predict
gains, while trying to reduce risks.
At this writing, big data is viewed as a paradigm shift in science and engineering,
as illustrated in Fig. 1. In November 2011 when we gave the final touch to our
book [5] on cognitive radio network. The authors of this book recognized the
fundamental significance of big data. So at the first page and first section, our section
title (Sect. 1.1 of [5]) was called “big data.” Our understanding was that due to the
spectrum sensing, the cognitive radio network leads us naturally towards big data.
In the last 18 months, as a result of this book writing, this understanding went even
further: we swam in the mathematical domains, understanding the beauty of the
consequence of big data—high dimensions. Book writing is truly a journey, and
helps one to understand the subject much deeper than otherwise. It is believed that
smart grid [6] will use many big data concepts and hopefully some mathematical
tools that are covered in this book. Many mathematical insights could not be explicit,
if the high dimensions were not assumed to be large. As a result, concentration
inequalities are natural tools to capture this insight in a non-asymptotic manner.
Figure 13.1 illustrates the vision of big data that will be the foundation to
understand cognitive networked sensing, cognitive radio network, cognitive radar
and even smart grid. We will further develop this vision in the book on smart
grid [6]. High dimensional statistics is the driver behind these subjects. Random
matrices are natural building blocks to model big data. Concentration of measure
phenomenon is of fundamental significance to modeling a large number of random
matrices. Concentration of measure phenomenon is a phenomenon unique to high-
dimensional spaces.

xxi
xxii Introduction

High-
Dimensionale
Statistics

Cognitive
Big Data Smart Grid
Networked
Sensing

Cognitive Cognitive
Radar Radio
Network

Fig. 1 Big data vision

To get a feel for the book, let us consider one basic problem. The large data sets
are conveniently expressed as a matrix
⎡ ⎤
X11 X12 · · · X1n
⎢ X21 X22 · · · X2n ⎥
⎢ ⎥
X=⎢ . .. . ⎥ ∈ Cm×n
⎣ .. . · · · .. ⎦
Xm1 Xm2 · · · Xmn

where Xij are random variables, e.g., sub-Gaussian random variables. Here m, n
are finite and large. For example, m = 100, n = 100. The spectrum of a random
matrix X tends to stabilize as the dimensions of X grows to infinity. In the last
few years, local and non-asymptotic regimes, the dimensions of X are fixed rather
than grow to infinity.
 Concentration of measure phenomenon naturally occurs. The
eigenvalues λi XT X , i = 1, . . . , n are natural mathematical objects to study.
The eigenvalues can be viewed as Lipschitz functions that can be handled by
Talagrand’s concentration inequality. It expresses the insight: The sum of a large
number of random variables is a constant with high probability. We can often treat
both standard Gaussian and Bernoulli random variables in the unified framework of
the sub-Gaussian family.
Theorem 1 (Talagrand’s Concentration Inequality). For every product proba-
n
bility P on {−1, 1} , consider a convex and Lipschitz function f : Rn → R
with Lipschitz constant L. Let X1 , . . . , Xn be independent random variables taking
values {−1, 1}. Let Y = f (X1 , . . . , Xn ) and let MY be a median of Y . Then For
every t > 0, we have
2
/16L2
P (|Y − MY |  t)  4e−t . (1)

The random variable Y has the following property


Introduction xxiii

Var (Y )  16L2 , E [Y ] − 16L  M [Y ]  E [Y ] + 16L. (2)

For a random matrix X ∈ Rn×n , the following functions are Lipschitz functions:

k k
(1)λmax (X) ; (2)λmin (X) ; (3) Tr (X) ; (4) λi (X) ; (5) λn−i+1 (X)
i=1 i=1

where Tr (X) has a Lipschitz constant


√ of L = 1/n, and λi (X) , i = 1, . . . , n has
a Lipschitz constant of L = 1/ n. So the variance of Tr (X) is upper bounded by
16/n2 , while the variance of λi (X) , i = 1, . . . , n by 16/n. The variance of Tr (X)
is 1/n smaller than that of λi (X) , i = 1, . . . , n. For example, n = 100, their ratio
is 20 dB. The variance has a fundamental control over the hypothesis detection.
Part I
Theory
Chapter 1
Mathematical Foundation

This chapter provides the necessary background to support the rest of the book.
No attempt has been made to make this book really self-contained. The book will
survey many recent results in the literature. We often include preliminary tools
from publications. These preliminary tools may be still too difficult for many of the
audience. Roughly, our prerequisite is the graduate-level course on random variables
and processes.

1.1 Basic Probability

The probability of an event is expressed as (·), and we use E for the expectation
operator. For conditional expectation, we use the notation EX Z, which represents
integration with respect to X, holding all other variables fixed. We sometimes omit
the parentheses when there is no possibility of confusion. Finally, we remind the
reader of the analysts convention that roman letters c, C, etc. denote universal
constants that may change at every appearance.

1.1.1 Union Bound

Let (Ω, F, P) be a probability space, F denotes a σ-algebra on the sample space


Ω and P a probability measure on (Ω, F). The probability of an event A ∈ F is
denoted by

P (A) = dP (ω) = IA (ω) dP (ω),


A Ω

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 3


DOI 10.1007/978-1-4614-4544-9 1,
© Springer Science+Business Media New York 2014
4 1 Mathematical Foundation

where the indicator function IA (ω) takes the value of 1 if ω ∈ A and 0 otherwise.
The union bound (or Bonferroni’s inequality, or Boole’s inequality) states that for a
collection of events Ai ∈ F, i = 1, . . . , n, we have
n
 n
P Ai  P (Ai ). (1.1)
i=1 i=1

1.1.2 Independence

In general, it can be shown that the random variable X and Y are independent if
and only if their joint cdf is equal to the product of its marginal cdf’s:

FX,Y (x, y) = FX (x)FY (y)for all x and y. (1.2)

Similarly, if X and Y are jointly continuous, then X and Y are independent if and
only if their joint pdf is equal to the product of its marginal pdf’s:

fX,Y (x, y) = fX (x)fY (y)for all x and y. (1.3)

Equation (1.3) is obtained from (1.2) by differentiation. Conversely, (1.2) is obtained


from (1.3) by integration.

1.1.3 Pairs of Random Variables

Suppose that X and Y are independent random variables, and let g(X, Y ) =
g1 (X)g2 (Y ). Find E [g (X, Y )] = E [g1 (X) g2 (Y )]. It follows that
∞ ∞
E [g (X, Y )] = g1 (x )g2 (y  ) fX (x )fY (y  ) dx dy 
−∞ −∞
 ∞  ∞ 
= g1 (x )fX (x ) dx g2 (y  )fY (y  ) dy 
−∞ −∞

= E [g1 (X) g2 (Y )] . (1.4)

Let us consider the sum of two random variables. Z = X + Y . Find FZ (z) and
fZ (z) in terms of the joint pdf of X and Y .
The cdf of Z is found by integrating the joint pdf of X and Y over the region of
the plane corresponding to the event P(Z ≤ z) = P(X + Y ≤ z).
∞ z−x
FZ (z) = fX,Y (x , y  ) dx dy  .
−∞ −∞
1.1 Basic Probability 5

The pdf of Z is

d
fZ (z) = FZ (z) = fX,Y (x , z − x ) dx dy  .
dz −∞

Thus the pdf for the sum of two random variables is given by a superposition
integral.
If X and Y are independent random variables, then by (1.3), the pdf is given by
the convolution integral of the marginal pdf’s of X and Y :

fZ (z) = fX (x )fY (z − x ) dx .
−∞

Thus, the pdf of Z = X + Y is given by the convolution of the pdf’s of X and Y :

fZ (z) = fX (x) ∗ fY (y). (1.5)

1.1.4 The Markov and Chebyshev Inequalities


and Chernoff Bound

These inequalities in this subsection will be generalized to the matrix setting,


replacing the scalar-valued random variable X with the matrix-valued random
variables X—random matrices.
In general, the mean and variance of a random variable do not provide enough
information to determine the cdf/pdf. However, the mean and variance do allow
us to obtain bounds for probabilities of the form P (|X|  t). Suppose that X is
a nonnegative random variable with mean E [X]. The Markov inequality then
states that
E [X]
P (X  a)  for X nonnegative. (1.6)
a
It follows from Markov’s inequality that if ϕ is a strictly monotonically increasing
nonnegative-valued function, then for any random variable X and real number, we
have [7]
Eφ (X)
P (X  t) = P {φ (X)  φ (t)}  . (1.7)
φ (t)
An application of (1.7) with ϕ(x) = x2 is Chebyshev’s inequality: if X is an
arbitrary random variable and t > 0, then

  E|X − EX|2 Var [X]


2
P (|X − EX|  t) = P |X − EX|  t2  = .
t2 t2
6 1 Mathematical Foundation

Now suppose that the mean E [X] = m and the variance Var [X] = σ 2 of a random
variable are known, and that we are interested in bounding P (|X − m|  a). The
Chebyshev inequality states that
σ2
P (|X − m|  a)  . (1.8)
a2
The Chebyshev inequality is a consequence of the Markov inequality. More
generally, taking φ (x) = xq (x  0), for any q > 0 we have the moment bound
q
E|X − EX|
P (|X − EX|  t)  . (1.9)
tq
In specific examples, we may choose the value of q to optimize the obtained upper
bound. Such moment bounds often provide with very sharp estimates of the tail
probabilities. A related idea is at the basis of Chernoff’s bounding method. Taking
ϕ(x) = esx where x is an arbitrary positive number, for any random variable X,
and any t ∈ R, we have
  EesX
P (X  t) = P esX  est  st . (1.10)
e
If more information is available than just the mean and variance, then it is possible
to obtain bounds that are tighter than the Markov and Chebyshev inequalities. The
region of interest is A = {t  a}, so let IA (t) be the indicator function, that is
IA (t) = 1, t ∈ A and IA (t) = 0 otherwise. Consider the bound IA (t)  es(t−a) ,
s > 0. The resulting bound is
∞ ∞
P (X  a) = IA (t)fX (t)dt  es(t−a) fX (t)dt
0 0
∞  
= e−sa est fX (t)dt = e−sa E esX . (1.11)
0

This bound is the Chernoff bound, which can be seen to depend on the expected
value of an exponential function of X. This function is called the moment generating
function and is related to the Fourier and Laplace transforms in the following
subsections. In Chernoff’s method, we find an s ≥ 0 that minimizes the upper
bound or makes the upper bound small. Even though Chernoff bounds are never as
good as the best moment bound, in many cases they are easier to handle [7].

1.1.5 Characteristic Function and Fourier Transform

Transform methods are extremely useful. In many applications, the solution is given
by the convolution of two functions f1 (x)∗f2 (x). The Fourier transform will convert
this convolution integral into a product of two functions in the transformed domains.
This is a result of a linear system, which is most fundamental.
1.1 Basic Probability 7

The characteristic function of a random variable is defined by


  ∞
ΦX (ω) = E ejωX = fX (x)ejωx dx,
−∞

where j = −1 is the imaginary unit number. If we view ΦX (ω) as a Fourier
transform, then we have from the Fourier transform inversion formula that the pdf
of X is given by

1
fX (x) = ΦX (ω)e−jωx dω.
2π −∞

If X is a discrete random variable, it follows that

ΦX (ω) = pX (xk )ejωxk discrete random variables.


k

Most of time we deal with discrete random variables that are integer-valued. The
characteristic function is then defined as

ΦX (ω) = pX (k)ejωk integer-valued random variables. (1.12)


k

Equation (1.12) is the Fourier transform of the sequence pX (k). The following
inverse formula allows us to recover the probabilities pX (k) from ΦX (ω)


1
pX (k) = ΦX (ω) e−jωk dω k = 0, ± 1, ± 2, . . .
2π 0

pX (k) are simply the coefficients of the Fourier series of the periodic function
ΦX (ω). The moments of X can be defined by ΦX (ω)—a very basic idea. The
power series (Taylor series) can be used to expand the complex exponential e−jωx
in the definition of ΦX (ω):
∞  
1 2
ΦX (ω) = fX (x) 1 + jωX + (jωX) + · · · ejωx dx.
−∞ 2!
Assuming that all the moments of X are finite and the series can be integrated term
by term, we have

1 2   1 n
ΦX (ω) = 1 + jωE [X] + (jω) E X 2 + · · · + (jω) E [X n ] + · · · .
2! n!
If we differentiate the above expression once and evaluate the result at ω = 0, we
obtain
 n 
d 
ΦX (ω) = j n E [X n ] ,
dω ω=0
8 1 Mathematical Foundation

which yields the final result


 n 
1 d 
E [X ] = n
n
ΦX (ω) .
j dω ω=0

Example 1.1.1 (Chernoff Bound for Gaussian Random Variable). Let X be a


Gaussian random variable with mean m and variance σ 2 . Find the Chernoff bound
for X. 

1.1.6 Laplace Transform of the pdf

When we deal with nonnegative continuous random variables, it is customary to


work with the Laplace transform of the pdf,
∞  
X ∗ (s) = fX (x)e−sx dx = E e−sX . (1.13)
0

Note that X ∗ (s) can be regarded as a Laplace transform of the pdf or as an expected
value of a function of X, e−sX . When X is replaced with a matrix-valued random
variable X, we are motivated to study
∞  
X∗ (s) = fX (x)e−sx dx = E e−sX . (1.14)
0

Through the spectral mapping theorem defined in Theorem 1.4.4, f (A) is defined
simply by applying the function f to the eigenvalues of A, where f (x) is an arbitrary
function. Here we have f (x) = e−sx . The eigenvalues of the matrix-valued random
variable e−sX are scalar-valued random variables.
The moment theorem also holds for X ∗ (s):

n dn ∗ 
E [X n ] = (−1) X (s)  .
dsn s=0

1.1.7 Probability Generating Function

In problems where random variables are nonnegative, it is usually more convenient


to use the z-transform or the Laplace transform. The probability generating
function GN (z) of a nonnegative integer-valued random variables N is defined by

 
GN (z) = E z N
= pN (k)z k . (1.15)
k=0
1.2 Sums of Independent (Scalar-Valued) Random Variables and Central Limit Theorem 9

The first expression is the expected value of the function of N , z N . The second
expression is the z-transform the probability mass function (with a sign change in
the exponent). Table 3.1 of [8, p. 175] shows the probability generating function for
  variables. Note that the characteristic function of N is given
some discrete random
by GN (z) = E z N = GN (ejω ).

1.2 Sums of Independent (Scalar-Valued) Random


Variables and Central Limit Theorem

We follow [8] on the standard Fourier-analytic proof of the central limit theorem
for scalar-valued random variables. This material allows us to warm up, and set
the stage for a parallel development of a theory for studying the sums of the matrix-
valued random variables. The Fourier-analytical proof of the central limit theorem is
one of the quickest (and slickest) proofs available for this theorem, and accordingly
the “standard” proof given in probability textbooks [9].
Let X1 , X2 , . . . , Xn be n independent random variables. In this section, we show
how the standard Fourier transform methods can be used to find the pdf of Sn =
X1 + X 2 + . . . + X n .
First, consider the n = 2 case, Z = X + Y , where X and Y are independent
random variables. The characteristic function of Z is given by
     
ΦZ (ω) = E ejωZ = E ejω(X+Y ) = E ejωX ejωY
   
= E ejωX E ejωY = ΦX (ω) ΦY (ω) ,

where the second line follows from the fact that functions of independent random
variables (i.e., ejωX and ejωY ) are also independent random variables, as discussed
in (1.4). Thus the characteristic function of Z is the product of the individual
characteristic functions of X and Y .
Recall that ΦZ (ω) can be also viewed as the Fourier transform of the pdf of Z:

ΦZ (ω) = F {fZ (z)} .

According to (1.5), we obtain

ΦZ (ω) = F {fZ (z)} = F {fX (x) ∗ fY (y)} = ΦX (ω) ΦY (ω) . (1.16)

Equation (1.16) states the well-known result that the Fourier transform of a
convolution of two functions is equal to the product of the individual Fourier
transforms. Now consider the sum of n independent random variables:

Sn = X1 + X2 + · · · + X n .
10 1 Mathematical Foundation

The characteristic function of Sn is


     
ΦSn (ω) = E ejωSn = E ejω(X1 +X2 +···+Xn ) = E ejωX1 ejωX2 · · · ejωXn
     
= E ejωX1 E ejωX2 · · · E ejωXn
= ΦX1 (ω) ΦX2 (ω) · · · ΦXn (ω) . (1.17)

Thus the pdf of Sn can then be found by finding the inverse Fourier transform of
the product of the individual characteristic functions of the Xi ’s:

fSn (x) = F −1 {ΦX1 (ω) ΦX1 (ω) · · · ΦXn (ω)} . (1.18)

Example 1.2.1 (Sum of Independent Gaussian Random Variables). Let Sn be the


sum of n independent Gaussian random variables with respective means and
variances, m1 , m2 , . . . , mn and σ 2 1 , σ 2 2 , . . . , σ 2 n . Find the pdf of Sn . The char-
acteristic function of Xk is
2 2
ΦXk (ω) = e+jωmk −ω σk /2

so by (1.17),


n
2 2
ΦSn (ω) = e+jωmk −ω σk /2

k=1
  
= exp +jω (m1 + m2 + · · · + mn ) − ω 2 σ12 + σ22 + · · · + σn2 .

This is the characteristic function of a Gaussian random variable. Thus Sn is a


Gaussian random variable with mean m1 + m2 + · · · + mn and variance σ12 + σ22 +
· · · + σn2 . 
Example 1.2.2 (Sum of i.i.d. Random Variables). Find the pdf of a sum of n
independent, identically distributed random variables with characteristic functions

ΦXk (ω) = ΦX (ω) for k = 1, 2, . . . , n.

Equation (1.17) immediately implies that the characteristic function Sn is


      n
ΦSn (ω) = E ejωX1 E ejωX2 · · · E ejωXn = {ΦX (ω)} .

The pdf of Sn is found by taking the inverse transform of this expression. 


Example 1.2.3 (Sum of i.i.d. Exponential Random Variables). Find the pdf of a sum
of n independent exponentially distributed random variables, all with parameter α.
The characteristic function of a single exponential random variable is
1.3 Sums of Independent (Scalar-Valued) Random Variables 11

α
ΦX (ω) = .
α − jω

From the previous example we then have that


 n
n α
ΦSn (ω) = {ΦX (ω)} = .
α − jω

From Table 4.1 of [8], we see that Sn is an m-Erlang random variables. 


When dealing with integer-valued random variables it is usually preferable to
work with the probability generating function defined in (1.15)
 
GN (z) = E z N .

The generating function for a sum of independent discrete random variables, N =


X1 + · · · + Xn , is
     
GN (z) = E z X1 +···+Xn = E z X1 · · · E z Xn = GX1 (z) · · · GXn (z). (1.19)

Example 1.2.4. Find the generating function for a sum of n independent, identically
distributed random variables.
The generating function for a single geometric random variable is given by
pz
GX (z) = .
1 − qz

Therefore, the generating function for a sum of n such independent random


variables is
 n
pz
GN (z) = .
1 − qz

From Table 3.1 of [8], we see that this is the generating function of a negative
binomial random variable with parameter p and n. 

1.3 Sums of Independent (Scalar-Valued) Random


Variables and Classical Deviation Inequalities:
Hoeffding, Bernstein, and Efron-Stein

We are mainly concerned with upper bounds for the probabilities of deviations
n
from the mean, that is, to obtain P (Sn − ESn  t), with Sn = Xi , where
i=1
X1 , . . . , Xn are independent real-valued random variables.
12 1 Mathematical Foundation

Chebyshev’s inequality and independence imply



n
Var [Xn ]
Var [Sn ] i=1
P (Sn − ESn  t)  = .
t2 t2

1

n
In other words, writing σ 2 = n Var [Xn ], we have
i=1

nσ 2
P (Sn − ESn  t)  .
t2
This simple inequality is at the basis of the weak law of large numbers.

1.3.1 Transform of Probability Bounds to Expectation Bounds

We often need to convert a probability bound for the random variable X to an


expectation bound for X p , for all p ≥ 1. This conversion is of independent interest.
It may be more convenient to apply expectation bounds since expectation can be
approximated by average. Our result here is due to [10]. Let X be a random variable
assuming non-negative values. Let a, b, t and h be non-negative parameters. If we
have exponential-like tail

P (X  a + tb)  e−t+h ,
then, for all p ≥ 1,
p
EX p  2(a + bh + bp) . (1.20)

On the other hand, if we have Gaussian-like tail


2
P (X  a + tb)  e−t +h
,

then, for all p ≥ 1,


√  √  p
EX p  3 ∗ p a + b h + b p/2 . (1.21)

1.3.2 Hoeffding’s Inequality

Chernoff’s bounding method, described in Sect. 1.1.4, is especially convenient for


bounding tail probabilities of sums of independent random variables. The reason is
that since the expected value of a product of independent random variables equals
the product of the expected variables—this is not true for matrix-valued random
variables, Chernoff’s bound becomes
1.3 Sums of Independent (Scalar-Valued) Random Variables 13

  
−st

n
P (Sn − ESn  t)  e E exp s (Xi − EXi )
i=1
!
n   (1.22)
= e−st E es(Xi −EXi ) by independence.
i=1

Now the problem of finding tight bounds comes down to finding a good upper bound
for the moment generating function of the random variables Xi − EXi . There are
many ways of doing this. In the case of bounded random variables, perhaps the most
elegant version is due to Hoeffding [11]:
Lemma 1.3.1 (Hoeffding’s Inequality). Let X be a random variable with EX =
0, a ≤ X ≤ b. Then for s ≥ 0,
  2 2
E esX  es (b−a) /8 . (1.23)

Proof. The convexity of the exponential function implies

x − a sb b − x sa
esx  e + e for a  x  b.
b−a b−a

−a
Expectation is linear. Exploiting EX = 0, and using the notation p = b−a we have

EesX  b
b−a
esa − e
a sb
b−a

= 1 − p + pes(b−a) e−ps(b−a)
 eφ(u) ,

where u = s(b − a), and φ (u) = −pu + log (1 − p + peu ). But by direct
calculation, we can see that the derivative of φ is
p
φ (u) = −p + ,
p + (1 − p) e−u

thus φ (u) = φ (0) = 0. Besides,

p (1 − p) e−u 1
φ (u) = 2  .
(p + (1 − p)e−u ) 4

Therefore, by Taylor’s theorem, for some θ ∈ [0, u],

2
u2  u2 s2 (b − a) 
φ (u) = φ (0) + uφ (0) + φ (θ)  = .
2 8 8
14 1 Mathematical Foundation

Now we directly plug this lemma into (1.22):

P (Sn − ESn  t)

n
2
(bi −ai )2 /8
 e−st es (by Lemma 1.3.1 )
i=1

n
s2 (bi −ai )2 /8
−st
=e e i=1


n n

−2t2 / (bi −ai )2 2
=e i=1 by choosing s = 4t/ (bi − ai ) .
i=1

Theorem 1.3.2 (Hoeffding’s Tail Inequality [11]). Let X1 , . . . , Xn be indepen-


dent bounded random variables such that Xi falls in the interval [ai , bi ] with
probability one. Then, for any t > 0, we have

n
−2t2 / (bi −ai )2
P (Sn − ESn  t)  e i=1 and

n
−2t2 / (bi −ai )2
P (Sn − ESn  −t)  e i=1 .

1.3.3 Bernstein’s Inequality

Assume, without loss of generality, that EXi = 0 for all i = 1, . . . , n. Our starting
 
point is again (1.22), that is, we need bounds for EesXi . Introduce σi2 = E Xi2 ,
and
∞  
sk−2 E Xik
Fi = .
k!σi2
k=2



Since esx = 1 + sx + sk xk /k!, and the expectation is linear, we may write
k=2


∞ sk E[Xik ]
EesXi = 1 + sE [Xi ] + k!
k=2
= 1 + s2 σi2 Fi (since E [Xi ] = 0)
2
σi2 Fi
 es .

Now assume that Xi ’s are bounded such that |Xi | ≤ c. Then for each k ≥ 2,
 
E Xik  ck−2 σi2 .
1.3 Sums of Independent (Scalar-Valued) Random Variables 15

Thus,
∞ ∞ k
sk−2 ck−2 σi2 1 (sc) esc − 1 − sc
Fi  = = .
k!σi2 (sc)
2 k! (sc)
2
k=2 k=2

Thus we have obtained


  sc
s2 σ 2 e −1−sc
E esXi  e i (sc)2 .


n
Returning to (1.22) and using the notation σ 2 = (1/n) σi2 , we have
i=1

n

2
(esc −1−sc)/c2 −st
P Xi  t  enσ .
i=1

Now we are free to choose s. The upper bound is minimized for


 
1 tc
s= log 1 + .
c nσ 2

Resubstituting this value, we obtain Bennett’s inequality [12]:


Theorem 1.3.3 (Bennett’s Inequality). Let X1 , . . . , Xn be independent real-
valued random variables with zero mean, and assume with zero mean, and assume
that |Xi | ≤ c with probability one. Let
n
1
σ2 = Var {Xi }.
n i=1

Then, for any t > 0,

n
   
nσ 2 ct
P Xi  t  exp − 2 h ,
i=1
c nσ 2

where h(u) = (1 + u) log(1 + u) − u for u ≥ 0.


The following inequality is due to Bennett (also referred to as Bernstein’s
inequality) [12, Eq. (7)] and [13, Lemma 2.2.11].
Theorem 1.3.4 (Bennett [12]). Let X1 , . . . , Xn be independent random variables
with zero mean such that
p
E|Xi |  p!M p−2 σi2 /2 (1.24)
16 1 Mathematical Foundation

for some p ≥ 2 and some constants M and σi , i = 1, . . . , n. Then for t > 0


  
 n 
  2 2
P  Xi   t  2e−t /2(σ +M t) (1.25)
 
i=1


n
with σ 2 = σi2 .
i=1

See Example 7.5.5 for its application.


Applying the elementary inequality h(u)  u2 / (2 + 2u/3), u ≥ 0 which can
be seen by comparing the derivatives of both sides, we obtain a classical inequality
of Bernstein [7]:
Theorem 1.3.5 (Bernstein Inequality). Under the conditions of the previous
theorem, for any t > 0,

n
  
1 nt2
P Xi  t  exp − 2
. (1.26)
n i=1
2σ + 2ct/3

We see that, except for the term 2ct/3, Bernstein’s inequality is quantitatively right
when compared with the central limit theorem: the central limit theorem states that
" n
  2
n 1 1 e−y /2
P Xi − EXi y → 1 − Φ(y)  √ ,
σ2 n i=1
2π y

from which we would expect, at least in a certain range of the parameters,


something like

n

1 2 2
P Xi − EXi  t ≈ e−nt /(2σ ) ,
n i=1

which is comparable with (1.26).


Exercise 1.3.6 (Sampling Without Replacement). Let X be a finite set with N
elements, and let X1 , . . . , Xn be a random sample without replacement from X and
Y1 , . . . , Yn a random sample with replacement from X . Show that for any convex
real-valued function f ,
n
 n

Ef Xi  Ef Yi .
i=1 i=1

In particular, by taking f (x) = esx , we see that all inequalities derived from the
sums of independent random variables Yi using Chernoff’s bounding remain true
for the sums of the Xi ’s. (This result is due to Hoeffding [11].)
1.4 Probability and Matrix Analysis 17

1.3.4 Efron-Stein Inequality

The main purpose of these notes [7] is to show how many of the tail inequalities
of the sums of independent random variables can be extended to general functions
of independent random variables. The simplest, yet surprisingly powerful inequality
of this kind is known as the Efron-Stein inequality.

1.4 Probability and Matrix Analysis

1.4.1 Eigenvalues, Trace and Sums of Hermitian Matrices

Let A is a Hermitian n × n matrix. By the spectral theorem for Hermitian matrices,


one diagonalize A using a sequence

λ1 (A)  · · ·  λn (A)

of n real eigenvalues, together with an orthonormal basis of eigenvectors

u1 (A) , . . . , un (A) ∈ Cn .

The set of the eigenvalues {λ1 (A) , . . . , λn (A)} is known as the spectrum of A.
The eigenvalues are sorted in a non-increasing manner. The trace of a n × n matrix
is equal to the sum of the its eigenvalues
n
Tr (A) = λn .
i=1

The linearity of trace

Tr (A + B) = Tr (A) + Tr (B) .

The first eigenvalue is defined as

λ1 (A) = sup vH Av.


|v|=1

We have

λ1 (A + B)  λ1 (A) + λ1 (B) . (1.27)

The Weyl inequalities are

λi+j−1 (A + B)  λi (A) + λj (B) ,


18 1 Mathematical Foundation

and the Ky Fan inequality

λ1 (A+B) + · · · +λk (A+B)  λ1 (A) + · · · +λk (A) +λ1 (B) + · · · +λk (B)
(1.28)

In particular, we have

Tr (A + B)  Tr (A) + Tr (B) . (1.29)

One consequence of these inequalities is that the spectrum of a Hermitian matrix


is stable with respect to small perturbations. This is very important when we deal
with an extremely weak signal that can be viewed as small perturbations within the
noise [14].
An n × n density matrix ρ is a positive definite matrix with Tr(ρ) = 1. Let Sn
denote the set of density matrices on Cn . This is a convex set [15].
A ≥ 0 is equivalent to saying that all eigenvalues of A are nonnegative, i.e.,
λi (A)  0.

1.4.2 Positive Semidefinite Matrices

Inequality is one of the main topics in modern matrix theory [16]. An arbitrary
complex matrix A is Hermitian, if A = AH , where H stands for conjugate and
transpose of a matrix. If a Hermitian matrix A is positive semidefinite, we say

A ≥ 0. (1.30)

Matrix A is positive semidefinite, i.e., A ≥ 0, if all the eigenvalues λi (A) are


nonnegative [16, p. 166]. In addition,

A ≥ 0 ⇒ det A ≥ 0 and A > 0 ⇒ det A > 0, (1.31)

where ⇒ has the meaning of “implies.” When A is a random matrix, its deter-
minant det A and its trace Tr A are scalar random variables. Trace is a linear
operator [17, p. 30].
For every complex matrix A, the Gram matrix AAH is positive semidefinite
[16, p. 163]:

AAH ≥ 0. (1.32)
 1
The eigenvalues of AAH 2 are the singular values of A.
It follows from [17, p. 189] that
n 
n
Tr A = λi , det A = λi ,
i=1 i=1
1.4 Probability and Matrix Analysis 19

n
Tr Ak = λki , k = 1, 2, . . . (1.33)
i=1

where λi are the eigenvalues of the matrix A ∈ Cn×n .


It follows from [17, p. 392] that

1/n 1
(det A)  Tr A, (1.34)
n

for every positive semidefinite matrix A ∈ Cn×n .


It follows from [17, p. 393] that

1 1/n
Tr AX  (det A) (1.35)
n

for every positive semidefinite matrix A, X ∈ Cn×n and further det X = 1.

1.4.3 Partial Ordering of Positive Semidefinite Matrices

By the Cauchy-Schwarz inequality, it is immediate that


 
|Tr (AB)|  Tr AAH Tr BBH (1.36)

and

Tr AAH = 0 if and only if A = 0. (1.37)

For a pair of positive semidefinite matrices A and B, we say

B ≥ A if B − A ≥ 0. (1.38)

A partial order may be defined using (1.38). We hold the intuition that matrix B
is somehow “greater” than matrix A. If B ≥ A ≥ 0, then [16, p. 169]

Tr B  Tr A, det B  det A, B−1  A−1 . (1.39)

If A  0 and B  0 be of the same size. Then [16, p. 166]


1.

A + B ≥ B, (1.40)

2.

A1/2 BA1/2  0, (1.41)


20 1 Mathematical Foundation

3.

Tr (AB) ≤ Tr (A) Tr (B) , (1.42)

4. The eigenvalues of AB are all nonnegative. Furthermore, AB is positive


semidefinite if and only if AB = BA.
A sample covariance matrix is a random matrix. For a pair of random matrices X
and Y , in analogy with their scalar counterparts, their expectations are of particular
interest:

Y ≥ X ≥ 0 ⇒ EY ≥ EX. (1.43)

Proof. Since the expectation of a random matrix can be viewed as a convex


combination, and also the positive semidefinite (PSD) cone is convex [18, p. 459],
expectation preserves the semidefinite order [19]. 

1.4.4 Definitions of f (A) for Arbitrary f

The definitions of f (A) of matrix A for general function f were posed by Sylvester
and others. A eleglant treatment is given in [20]. A special function called spectrum
is studied [21]. Most often, we deal with the PSD matrix, A ≥ 0. References [17,
18, 22–24].
When f (t) is a polynomial or rational function with scalar coefficients and a
scalar argument, t, it is natural to define f (A) by substituting A for t, replacing
division by matrix inverse, and replacing 1 by the identity matrix. Then, for example,

1 + t2 −1 
f (t) = ⇒ f (A) = (I − A) I + A2 if 1 ∈
/ Λ (A) . (1.44)
1−t
Here, Λ (A) denotes the set of eigenvalues of A (the spectrum of A). For a general
theory, we need a way of defining f (A) that is applicable to arbitrary functions f .
Any matrix A ∈ Cn×n can be expressed in the Jordan canonical form

Z−1 AZ = J = diag (J1 , J2 , . . . , Jp ) (1.45)

where
⎛ ⎞
λ1 1 0
⎜ .. ⎟ mk ×mk
Jk = Jk (λk ) = ⎝ . 1 ⎠∈C , (1.46)
0 λk

where Z is nonsingular and m1 + m2 + · · · + mp = n.


Let f be defined on the spectrum of A ∈ Cn×n and let A have the Jordan
canonical form (1.45). Then,
1.4 Probability and Matrix Analysis 21

f (A)  Zf (J) Z−1 = Z diag (f (Jk )) Z−1 , (1.47)

where
⎛ ⎞
f (mk −1) (λk )
f (λk ) f  (λk ) · · · (mk −1)!
⎜ ⎟
⎜ . .. ⎟
⎜ f (λk ) . . . ⎟
f (Jk )  ⎜ ⎟. (1.48)
⎜ .. ⎟
⎝ . f  (λk ) ⎠
f (λk )

Several remarks are in order. First, the definition yields an f (A) that can be shown
to be independent of the particular Jordan canonical form that is used. Second, if A
is diagonalizable, then the Jordan canonical form reduces to an eigen-decomposition
A = Z−1 DZ, with D = diag (λi ) and the columns of Z (renamed U) eigenvectors
of A, the above definition yields

f (A) = Zf (J) Z−1 = Uf (D) U−1 = Uf (λi ) U−1 . (1.49)

Therefore, for diagonalizable matrices, f (A) has the same eigenvectors as A and
its eigenvalues are obtained by applying f to those of A.

1.4.5 Norms of Matrices and Vectors

See [25] for matrix norms. The matrix p-norm is defined, for 1 ≤ p ≤ ∞, as
Ax p
A p = max ,
x=0 x p

 1/p

n
p
where x p = |xi | . When p = 2, it is called spectral norm A 2 =
i=1
A . The Frobenius norm is defined as
⎛ ⎞1/2
n n
=⎝ |aij | ⎠
2
A F ,
i=1 j=1

which can be computed element-wise. It is the same as the Euclidean norm on



n
vectors. Let C = AB. Then cij = aik bkj . Thus
k=1

n n n n n
2 2 2 2
AB F = C F = |cij | = |aik bkj | .
i=1 j=1 i=1 j=1 k=1
22 1 Mathematical Foundation


n
Applying the Cauchy-Schwarz inequality to the expression aik bkj , we find that
k=1

 

n  n   n  
a2   b2 
n
2
AB F  ik kj
 n n  
i=1 j=1 k=1 k=1
 n n
   2    2 
= aik bkj 
i=1 k=1 j=1 k=1
= A F B F.

Let σi , i = 1, . . . , n be singular values that are sorted in decreasing magnitude. The


Schatten-p norm is defined as

n
1/p
A Sp = σip for 1  p  ∞, and A ∞ = A op = σ1 ,
i=1

for a matrix A ∈ Rn×n . When p = ∞, we obtain the operator norm (or spectral
norm) A op , which is the largest singular value. It is so commonly used that we
sometimes use ||A|| to represent it. When p = 2, we obtain the commonly called
Hilbert-Schmidt norm or Frobenius norm A S2 = A F . When p = 1, A S1
denotes the nuclear norm. Note that ||A|| is the spectral norm, while A F is
the Frobenius norm. The drawback of the spectral norm it that it is expensive to
compute; it is not the Frobenius norm. We have the following properties of Schatten
p-norm
1. When p < q, the inequality occurs: A Sq  A Sp .
2. If r is a rank of A, then with p > log(r), it holds that A  A Sp  e A .

Let X, Y = Tr XT Y represent the Euclidean inner product between two
matrices and X F = X, X . It can be easily shown that

X F = sup Tr XT G = sup X, G .
GF =1 GF =1

Note that trace and inner product are both linear.


For vectors, the only norm we consider is the l2 -norm, so we simply denote the
l2 -norm of a vector by ||x|| which is equal to x, x , where x, y is the Euclidean
inner product between two vectors. Like matrices, it is easy to show

x = sup x, y .
y=1
1.4 Probability and Matrix Analysis 23

1.4.6 Expectation

We follow closely [9] for this basic background knowledge to set the stage for future
applications. Given an unsigned random variable X (i.e., a random variable taking
values in [0, +∞],) one can define the expectation or mean EX as the unsigned
integral

EX = xdμX (x),
0

which by the Fubini-Tonelli theorem [9, p. 13] can also be rewritten as



EX = P (X  λ)dλ.
0

The expectation of an unsigned variable lies in [0, +∞]. If X is a scalar random


variable (which is allowed to take the value ∞), for which E |X| < ∞, we find X
is absolutely integrable, in which case we can define its expectation as

EX = xdμX (x)
R

in the real case, or

EX = xdμX (x)
C

in the complex case. Similarly, for a vector-valued random variable (note in finite
dimensions, all norms are equivalent, so the precise choice of norm used to define
|X| is not relevant here). If x = (X1 , . . . , Xn ) is a vector-valued random variable,
then X is absolutely integrable if and only if the components Xi are all absolutely
integrable, in which case one has

Ex = (EX1 , . . . , EXn ) .

A fundamentally important property of expectation is that it is linear: if


X1 , . . . , Xn are absolutely integrable scalar random variables and c1 , . . . , ck are
finite scalars, then c1 X1 , . . . , ck Xk is also absolutely integrable and

E (c1 X1 + · · · + ck Xk ) = c1 EX1 + · · · + ck EXk . (1.50)



By the Fubini-Tonelli theorem, the same result also applies to infinite sums c i Xi ,
i=1


provided that |ci | E |Xi | is finite.
i=1
24 1 Mathematical Foundation

Linearity of expectation requires no assumption of independence or dependence


amongst the individual random variables Xi ; that is what makes this property of
expectation so powerful [9, p. 14]. We use this linearity of expectation so often that
we typically omit an explicit reference to it when it is being used.
Expectation is also monotone: if X ≤ Y is true for some unsigned or real
absolutely integrable X, Y , then

EX  EY.

For an unsigned random variable, we have the obvious but very useful Markov
inequality

1
P (X  λ)  EX
λ
for some λ > 0. For the signed random variables, Markov’s inequality becomes

1
P (|X|  λ)  E |X| .
λ
If X is an absolutely integrable or unsigned scalar random variable, and F is a
measurable function from the scalars to the unsigned extended reals [0, +∞], then
one has the change of variables formula

EF (X) = F (x)dμX (x)


R

when X is real-valued and

EF (X) = F (x)dμX (x)


C

when X is complex-valued. The same formula applies to signed or complex F if


it is known that |F (x)| is absolutely integrable. Important examples of expressions
such as EF (X) are moments

k
E|X|

for k ≥ 1 (particularly k= 1, 2, 4), exponential moments

EetX

for real t, X, and Fourier moments (or the characteristic function)

EejtX
1.4 Probability and Matrix Analysis 25

for real t, X, or

ejt·x

for complex or vector-valued t, x, where · denotes a real inner product. We shall


encounter the resolvents
1
E
X −z

for complex z.
The reason for developing the scalar and vector cases is because we are motivated
to study

EeX

for matrix-valued X, where the entries of X may be deterministic or random


variables. A random matrix X and its functional f (X) can be studied using this
framework (f (t) is an arbitrary function of r). For example,

EeX1 +···+Xn

is of special interest when Xi are independent random matrices.


Once the second moment of a scalar random variable is finite, one can define the
variance
2
var (X) = E|X − EX| .

From Markov’s inequality we thus have Chebyshev’s inequality


var (X)
P (|X − EX|  λ)  .
λ2
A real-valued random variable X is sub-Gaussian if and only if there exists C >
0 such that

EetX  C exp Ct2

for all real t, and if and only if there exists C > 0 such that

k k/2
E|X|  (Ck)

for all integers k ≥ 1.


A real-valued random variable X has sub-exponential tails if and only if there exists
C > 0 such that
k 
E|X|  exp Ck C

for all positive integers k.


26 1 Mathematical Foundation

If X is sub-Gaussian (or has sub-exponential tails with exponent a > 1), then
from dominated convergence we have the Taylor expansion

tk
EetX = 1 + EX k
k!
k=1

for any real or complex t, thus relating the exponential and Fourier moments with
the kth moments.

1.4.7 Moments and Tails


p
The quantities E|X| , 0 < p < ∞ are called absolute moments. The absolute
moments of a random variable X can be expressed as

p
E|X| = p P (|X| > t)tp−1 dt, p > 0. (1.51)
0

The proof of this is as follows. Let I{|X|p x} is the indicator random variable: takes
p
1 on the event |X|  x and 0 otherwise. Using Fubini’s theorem, we derive

p ) p ) ) |X|p ) )∞
E|X| = |X| dP = Ω 0
Ω
dxdP = Ω 0 I{|X|p x} dxdP
)∞) )∞ p
= 0 Ω I{|X|p x} dPdx = 0 P (|X|  x) dx
)∞ )∞
= p 0 P (|X|  tp )tp−1 dt = p 0 P (|X|  t)tp−1 dt,

where we also used a change of variables.


p 1/p
For 1  p < ∞, (E|X| ) defines a norm on the Lp (Ω, P)-space of all
p-integrable random variables, in particular, the triangular inequality

p 1/p p 1/p p 1/p


(E|X + Y | )  (E|X| ) + (E|Y | ) (1.52)
p
holds for X, Y ∈ Lp (Ω, P) = {X measurable, E|X| < ∞}.
Let p, q ≥ 1 with 1/p + 1/q = 1, Hölder’s inequality states that

p 1/p q 1/q
E |XY |  (E|X| ) (E|Y | )

for random variables X, Y . The space case p = q = 2 is the Cauchy-Schwartz


inequality. It follows from Hölder’s inequality that for 0 < p  q < ∞,

p 1/p q 1/q
(E|X| )  (E|Y | ) .
1.4 Probability and Matrix Analysis 27

The function P (|X| > t) is called the tail of X. The Markov inequality is a
simple way of estimating the tail. We can also use the moments to estimate the tails.
The next statement due to Tropp [26] is simple but powerful. Suppose X is a random
variable satisfying

p 1/p
(E|X| )  αβ 1/p p1/γ , for all p  p0

for some constants α, β, γ, p0 > 0. Then


 
P |X|  e1/γ αt  βe−t /γ
γ

The proof of the claim is short. By Markov’s inequality, we obtain for an arbitrary
κ>0
  p  1/γ p
E|X| αp
P |X|  e 1/γ
αt  κ p  β .
(e αt) eκ αt

Choose p = tγ and the optimal value κ = 1/γ to obtain the claim.


Also the converse of the above claim can be shown [27]. Important special cases
p 1/p √
are γ = 1, 2. In particular, if (E|X| )  αβ 1/p p, for all p ≥ 2, then X satisfies
the subGaussian tail estimate
  2 √
P |X|  e1/2 αt  βe−t /2 for all t  2. (1.53)

For a random variable Z, we define its Lp norm


p 1/p
Ep (Z) = (E|Z| ) .

We use a simple technique for bounding the moments of a maximum. Consider an


arbitrary set {Z1 , . . . , ZN } of random variables. We have that

Ep (maxi Zi ) = N 1/p maxEp (Zi ) .


i

To check this claim, simply note that [28]


1/p 1/p
 p 1/p p p  1/p
Emaxi |Zi |  E |Zi |  E|Zi |  N · maxi E|Zi |p .
i i

In many cases, this inequality yields essentially sharp results for the appropriate
choice of p.  
k
If X, Y are independent with E[Y ] = 0 and k ≥ 2, then [29] E |X| 
 
k
E |X − Y | .
28 1 Mathematical Foundation

For random variables satisfying a subGaussian tail estimate, we have the


following useful lemma [27]. See also [30].
Lemma 1.4.1. Let X1 , . . . , XN be random variables satisfying
2 √
P (|Xi |  t)  βe−t /2
for all t  2, i = 1, . . . , N,

for some β ≥ 1. Then



E max |Xi |  Cβ ln (4βN )
i=1,...,N


with Cβ  2 + √1 .
4 2(4β)

Proof. According to (1.51), we have, for some α  2
 
)∞
E max |Xi | = 0 P max |Xi | > t dt
i=1,...,N i=1,...,N  
)α )∞
 0 1dt + α P max |Xi | > t dt
i=1,...,N
)∞ 
N
α+ α
P (|Xi | > t)dt
i=1
)∞ 2
α+ N β α e−t /2 dt.

In the second line, we used the union bound.


A change of variable gives
∞ ∞ ∞
2 2 2 2
e−t /2
dt = e−(t+u) /2
dt = e−u /2
e−tu e−t /2
dt.
u 0 0

On the right hand side, using e−tu  1 for t, u ≥ 0 gives


∞ ∞
"
−t2 /2 −u2 /2 −t2 /2 π −u2 /2
e dt  e e dt = e .
u 0 2
2
On the other hand, using e−t /2
 1 for t ≥ 0 gives
∞ ∞
2 2 1 −u2 /2
e−t /2
dt  e−u /2
e−tu dt = e .
u 0 u

Combining the two results, we have


∞ " 
2 π 1 2
e−t /2
dt  min , e−u /2
.
u 2 u
1.4 Probability and Matrix Analysis 29

Thus we have

2 1 2
E max |Xi |  α + N β e−t /2
dt  α + N β e−α /2 .
i=1,...,N α α
  √
Now we choose α = 2 ln (4βN )  2 ln (4)  2. This gives

E max |Xi |  2 ln (4βN ) + √ 1
i=1,...,N 4 2 ln(4βN )
√  
= 1
2 + 4√2 ln(4βN )
ln (4βN )  C β ln (4βN ).


Some results are formulated in terms of moments; the transition to a tail bound can
be established by the following standard result, which easily follows from Markov’s
inequality.
Theorem 1.4.2 ([31]). Suppose X is a random variable satisfying

p 1/p √
(E|X| )  α + β p + γp for all p  p0

for some α, β, γ, p0 > 0. Then, for t ≥ p0 ,


  √ 
P E |X|  e α + β t + γt  e−t .

If a Bernoulli vector y weakly dominates random vector x then y strongly


dominates x. See also [32].
Theorem 1.4.3 (Bednorz and Latala [33]). Let x, y be random vectors in a

separate Banach space (F, || · ||) such that y = ui εi for some vectors ui ∈ F
i1
and

P (|ϕ (x)|  t)  P (|ϕ (y)|  t) for all ϕ ∈ F ∗ , t > 0.

Then there exists universal constant L such that:

P ( x  t)  LP ( y  t/L) for all t > 0.

1.4.8 Random Vector and Jensen’s Inequality

T
A random vector x = (X1 , . . . , Xn ) ∈ Rn is a collection of n random variables
Xi on a common probability space. Its expectation is the vector
T
Ex = (EX1 , . . . , EXn ) ∈ Rn .
30 1 Mathematical Foundation

A complex random vector z = x + jy ∈ Cn is a special case of a 2n-dimensional


real random vector (x, y) ∈ R2n .
A collection of random vectors x1 , . . . , xN ∈ Cn is called (stochastic ally)
independent if for all measurable subsets A1 , . . . , AN ⊂ Cn ,

P (x1 ∈ A1 , . . . , xN ∈ AN ) = P (x1 ∈ A1 ) · · · P (xN ∈ AN ) .

Functions of independent random vectors are again independent. A random vector


x in Cn will be called an independent copy of x if x and x are independent and
have the same distribution, that is, P (x ∈ A) = P (x ∈ A) for all A ∈ Cn .
Jensen’s inequality says that: let f : Cn → R be a convex function, and let
x ∈ Cn be a random vector. Then

f (Ex)  Ef (x) . (1.54)

1.4.9 Convergence

For a sequence xn of scalars to converge to a limit x, for every ε > 0, we have


|x − xn | ≤ ε for all sufficiently large n. This notion of convergence is generalized
to metric space convergence.
Let R = (R, d) be a σ-compact metric space (with the σ-algebra, and let Xn be
a sequence of random variables taking values in R. Xn converges almost surely to
X if, for almost every ω ∈ Ω, Xn (ω) converges to X (ω), or equivalently
 
P lim d (Xn , X)  ε = 1
n→∞

for every ε > 0.


Xn converges in probability to X if, for every ε > 0, one has

lim inf P (d (Xn , X)  ε) = 1,


n→∞

or equivalently if d (Xn , X) ≤ ε holds asymptotically almost surely for every


ε > 0.
Xn converges in distribution to X if, for every bounded continuous function
F : R → R, one has

lim EF (Xn ) = EF (X) .


n→∞

1.4.10 Sums of Independent Scalar-Valued Random


Variables: Chernoff’s Inequality

Gnedenko and Kolmogorov [34] points out: “In the formal construction of a course
in the theory of probability, limit theorems appear as a kind of superstructure
over elementary chapters, in which all problems have finite, purely arithmetic
1.4 Probability and Matrix Analysis 31

character. In reality, however, the epistemological value of the theory of probability


is revealed only by limit theorems. Moreover, without limit theorems it is impossible
to understand the real content of the primary concept of all our sciences—the
concept of probability. In fact, all epistemological value of the theory of probability
is based on this: that large-scale random phenomena in their collective action create
strict, nonrandom regularity. The very concept of mathematical probability would
be fruitless if it did not find its realization in the frequency of occurrence of events
under large-scale repetition of uniform conditions (a realization which is always
approximate and not wholly reliable), but that becomes, in principle, arbitrarily
precise and reliable as the number of repetitions increases.”
The philosophy behind the above cited paragraph is especially relevant to the
Big Data: The dimensions of the data are high but finite. We seek a theory of purely
arithmetic character: a non-asymptotic theory.
We follow closely [35] in this development. Ashlwede and Winter [36] proposed
a new approach to develop inequalities for sums of independent random matrices.
Ashlwede-Winter’s method [36] is parallel to the classical approach to derivation
inequalities for real valued random variables. Let X1 , . . . , Xn be independent mean
zero random variables. We are interested in the magnitude
n
Sn = X1 + . . . + X n = Xi .
i=1
For simplicity, we shall assume that |Xi | ≤ 1 almost surely. This hypothesis can be
relaxed to some control of the moments, precisely to having sub-exponential tail.
Fix t > 0 and let λ > 0 be a parameter to be determined later. Our task is to
estimate

p  P (Sn > t) = P eSn > et .

By Markov inequality and using independence, we have



p  e−λt EeλSn = e−λt EeλXi .
i

Next, Taylor’s expansion and the mean zero and boundedness hypotheses can be
used to show that, for every i,
2
eλXi  eλ var Xi
, 0  λ  1.

This results in
n
2
σ2
p  e−λt+λ , where σ 2  var Xi .
i=1

The optimal choice of the parameter λ ∼ min τ /2σ 2 , 1 implies Chernoff’s
inequality
 2 2 
p  max e−t /σ , e−t/2 .
32 1 Mathematical Foundation

1.4.11 Extensions of Expectation to Matrix-Valued


Random Variables

If X is a matrix (or vector) with random variables, then

EX = [EXij ] .

In other words, the expectation of a matrix (or vector) is just the matrix of
expectations of the individual elements.
The basic properties of expectation still hold in these extensions.
If A, B, c are nonrandom, then [37, p. 276]

E (AX + BX + c) = AEX + BEX + c. (1.55)

Define a weighted sum


n
Sn = a 1 X 1 + · · · + a n X n = aT X.
i=1

1.4.12 Eigenvalues and Spectral Norm

Md is the set of real symmetric d × d matrices. Cd×d Herm denote the set of complex
Hermitian d × d matrices. The spectral theorem state that all A ∈ Cd×d Herm have d
real eigenvalues (possibly with repetitions) that correspond to an orthonormal set of
eigenvectors. λmax (A) is the largest eigenvalue of A. All the eigenvalues are sorted
in non-increasing manner. The spectrum of A, denoted by spec(A), is the multiset
of all eigenvalues, where each eigenvalue appears a number of times equal to its
multiplicity. We let

C ≡ max |Cv| (1.56)


v∈Cd |v|=1

Herm (| · | is the Euclidean norm). By the spectral


denote the operator norm of C ∈ Cd×d
theorem,

A = max {λmax (A) , λmax (−A)} . (1.57)

Using the spectral theorem for the identity matrix I gives: I = 1. Moreover,
the trace of A, Tr (A) is defined as the sum of the sum of the diagonal entries of A.
The trace of a matrix is equal to the sum of the eigenvalues of A, or

d
Tr (A) = λi (A) .
i=1

See Sect. 1.4.1 for more properties of trace and eigenvalues.


1.4 Probability and Matrix Analysis 33

Given a matrix ensemble X, there are many statistics of X that one may wish to
consider, e.g., the eigenvalues or singular values of X, the trace, and determinant,
etc. Including basic statistics, namely the operator norm [9, p. 106]. This is a basic
upper bound on many quantities. For example, A op is also the largest singular
value σ1 (A) of A, and thus dominates the other singular values; similarly, all
eigenvalues λi (A) of A clearly have magnitude at most A op since λi (A) =
2
σ1 (A) . Because of this, it is particularly important to obtain good upper bounds,
 
P A op  λ  ··· ,

on this quantity, for various thresholds λ. Lower tail bounds are also of interest; for
instance, they give confidence that the upper tail bounds are sharp.
We denote |A| the positive operator (or matrix) (A∗ A)
1/2
and by s(A) the
vector whose coordinates are the singular values of A, arranged as s1 (A) 
s2 (A)  · · ·  sn (A). We have [23]

A= |A| = s1 (A) .

Now, if U, V are unitary operators on Cn×n , then |UAV| = V∗ |A| V and hence

A = UAV

for all unitary operators U, V. The last property is called unitary invariance. Several
other norms have this property. We will use the symbol |||A||| to mean a norm on
n × n matrices that satisfies

|||A||| = |||UAV||| (1.58)

for all A and for unitary U, V. We will call such a norm a unitarily invariant
norm. We will normalize such norms such that they take the value 1 on the matrix
diag(1, 0, . . . , 0).

1.4.13 Spectral Mapping

The multiset of all the eigenvalues of A is called the spectrum of A, denoted


spec(A), where each eigenvalue appears a number of times equal to its multiplicity.
When f (t) is a polynomial or rational function with scalar coefficients and a
scalar argument, t, it is natural to define f (A) by substituting A for t, replacing
division by matrix inversion, and replacing 1 by the identity matrix [1, 20]. For
example,
34 1 Mathematical Foundation

1 + t2 −1 2
f (t) = ⇒ f (A) = (I − A) (I + A)
1−t

if 1 ∈/ spec (A).
If f (t) has a convergent power series representation, such as

t2 t3 t4
log(1 + t) = t − + − , |t| < 1,
2 3 4
we again can simply substitute A for t, to define

A2 A3 A4
log(I + A) = A − + − , ρ (A) < 1.
2 3 4

Here ρ denotes the spectral radius and the condition ρ (A) < 1 ensures convergence
of the matrix series. In this ad hoc fashion, a wide variety of matrix functions can be
defined. This approach is certainly appealing to engineering communities, however,
this approach has several drawbacks.
Theorem 1.4.4 (Spectral Mapping Theorem [38]). Let f : C → C be an entire

analytic function with a power-series representation f(z) = cl z l ,(z ∈ C). If all
l0
cl are real, we define the mapping expression:

f (A) ≡ cl Al , A ∈ Cd×d
Herm , (1.59)
l0

where Cd×d
Herm is the set of Hermitian matrices of d×d. The expression corresponds to
a map from Cd×d
Herm to itself. The so-called spectral mapping property is expressed as:

spec f (A) = f (spec (A)) . (1.60)

By (1.60), we mean that the eigenvalues of f (A) are the numbers f (λ) with λ ∈
spec(A). Moreover, the multiplicity of ξ ∈ spec (A) is the sum of the multiplicity
of all preimages of ξ under f that lie in spec(A).
For any function f : R → R, we extend f to a function on Hermitian matrices
as follows. We define a map on diagonal matrices by applying the function to
each diagonal entry. We extend f to a function on Hermitian matrices using the
eigenvalue decomposition. Let A = UDUT be a spectral decomposition of A.
Then, we define
⎛ ⎞
f (D1,1 ) 0
⎜ .. ⎟ T
f (A) = U ⎝ . ⎠U . (1.61)
0 f (Dd,d )
1.4 Probability and Matrix Analysis 35

In other words, f (A) is defined simply by applying the function f to the


eigenvalues of A. The spectral mapping theorem states that each eigenvalue of
f (A) is equal to f (λ) for some eigenvalue λ of A. This point is obvious from
our definition.
Standard inequalities for real functions typically do not have parallel versions
that hold for the semidefinite ordering. Nevertheless, there is one type of relation
(referred to as the transfer rule) for real functions that always extends to the
semidefinite setting:
Claim 1.4.5 (Transfer Rule). Let f : R → R and g : R → R satisfy f (x) ≤ g(x)
for all x ∈ [l, u] ⊂ R. Let A be a symmetric matrix for which all eigenvalues lie in
[l, u] (i.e., lI  A  uI.) Then

f (A)  g (A) . (1.62)

1.4.14 Operator Convexity and Monotonicity

We closely follow [15].


Definition 1.4.6 (Operator convexity and monotonicity). A function f :
(0, ∞) → R is said to be operator monotone in case whenever for all n, and
all positive definite matrix A, B > 0,

A > B ⇒ f (A) > f (B) . (1.63)

A function f : (0, ∞) → R is said to be operator convex in case whenever for all n,


and all positive definite matrixA, B > 0, and 0 < t < 1,

f ((1 − t) A + tB)  (1 − t)f (A) tf (B) . (1.64)

The square function f (t) = t2 is monotone in the usual real-valued sense but not
monotone√ in the operator monotone sense. It turns out that the square root function
f (t) = t is also operator monotone.
The square function is, however, operator convex. The cube function is not
operator convex. After seeing these examples, let us present the Lőwner-Hernz
Theorem.
Theorem 1.4.7 (Lőwner-Hernz Theorem). For −1 ≤ p ≤ 0, the function f (t) =
−tp is operator monotone and operator concave. For 0 ≤ p ≤ 1, the function
f (t) = tp is operator monotone and operator concave. For 1 ≤ p ≤ 2, the function
f (t) = −tp is operator monotone and operator convex. Furthermore, f (t) = log(t)
is operator monotone and operator concave., while f (t) = t log(t) is operator
convex.
f (t) = t−1 is operator convex and f (t) = −t−1 is operator monotone.
36 1 Mathematical Foundation

1.4.15 Convexity and Monotonicity for Trace Functions

Theorem 1.4.8 (Convexity and monotonicity for trace functions). Let f : R →


R be continuous, and let n be any natural number. Then if t → f (t) is monotone
increasing, so is A → Tr f (A) on Hermitian matrices. Likewise, if t → f (t) is
convex, so is A → Tr f (A) on Hermitian matrices, and strictly so if f is strictly
convex.
Much less is required of f in the context of the trace functions: f is continuous and
convex (or monotone increasing).
Theorem 1.4.9 (Peierls-Bogoliubov Inequality). For every natural number n,
the map

A → log {Tr [exp(A)]} (1.65)

is convex on Hermitian matrices.


Indeed, for Hermitian matrices A, B and 0 < t < 1, let ψ(t) be the function

ψ(t) : A → log {Tr [exp(A)]} .

By Theorem 1.4.9, this is convex, and hence

ψ(t) − ψ(0)
ψ(1) − ψ(0) 
t
for all t. Taking the limit t → 0, we obtain
   
Tr eA+B Tr BeA
log  . (1.66)
Tr [eA ] Tr [eA ]

Frequently, this consequence of Theorem 1.4.9 is referred to as the  Peierls-



Bogoliubov Inequality. Not only are both of the functions H → log Tr eH and
ρ → −S (ρ) are both convex, they are Legendre Transforms of one another. Here
ρ is a density matrix. See [39] for a full mathematical treatment of the Legendre
Transform.
Example 1.4.10 (A Novel Use of Peierls-Bogoliubov Inequality for Hypothesis
Testing). To our best knowledge, this example provides a novel use of the Peierls-
Bogoliubov Inequality. The hypothesis for signal plus noise model is

H0 : y = w
H1 : y = x + w
1.4 Probability and Matrix Analysis 37

where x, y, w are signal and noise vectors and y is the output vector. The covariance
matrix relation is used to rewrite the problem as the matrix-valued hypothesis testing

H0 : Ryy = Rww := A
H1 : Ryy = Rxx + Rww := A + B

when x, w are independent. The covariance matrices can be treated as density


C
matrices with some normalizations: C → Tr(C) . Our task is to decide between
two alternative hypotheses: H0 and H1 . This is the very nature, in analogy with the
quantum testing of two alternative states. See [5]. The use of (1.66) gives
   
Tr eRxx +Rww Tr Rxx eRww
log  .
Tr [eRww ] Tr [eRww ]

Let us consider a threshold detector:

H0 : otherwise
   
Tr eRxx +Rww Tr Rxx eRww
H1 : log  T0 , with T0 = .
Tr [eRww ] Tr [eRww ]

Thus, the a prior knowledge of Rxx , Rww can be used to set the threshold of T0 .
In real world, an estimated covariance matrix must be used to replace the above
covariance matrix. We often consider a number of estimated covariance matrices,
which are random matrices. We naturally want to consider a sum of these estimated
covariance matrices. Thus, we obtain

H0 : otherwise
⎛  ⎞
Tr e(R̂xx,1 +···+R̂xx,n )+(R̂ww,1 +···+R̂ww,n )
H1 : log ⎝   ⎠  T0
Tr eR̂ww,1 +···+R̂ww,n

with
  
Tr R̂xx,1 + · · · + R̂xx,n eR̂ww,1 +···+R̂ww,n
T0 =   .
Tr eR̂ww,1 +···+R̂ww,n

If the bounds of sums of random matrices can be used to bound the threshold T0 ,
the problem can be greatly simplified. This example provides one motivation for
systematically studying the sums of random matrices in this book. 
38 1 Mathematical Foundation

1.4.16 The Matrix Exponential

The exponential of an Hermitian matrix A can be defined by applying (1.61) with


the function f (x) = ex . Alternatively, we may use the power series expansion
(Taylor’s series)

Ak
exp (A) = eA := I + . (1.67)
k!
k=1

The exponential of an Hermitian matrix is ALWAYS positive definite because of


the spectral mapping theorem (Theorem 1.4.4 and Eq. (1.62)): eλi (A) > 0, where
λi (A) is the i-th eigenvalue of A. On account of the transfer rule (1.62), the matrix
exponential satisfies some simple semidefinite relations. For each Hermitian matrix
A, it holds that

I + A  eA , and (1.68)

2
cosh (A)  eA /2
. (1.69)

We often work with the trace of the matrix exponential, Tr exp : A → Tr eA . The
trace exponential function is convex [22]. It is also monotone [22] with respect to
the semidefinite order:

A  H ⇒ Tr eA  Tr eH . (1.70)

See [40] for short proofs of these facts.

1.4.17 Golden-Thompson Inequality

The matrix exponential doe not convert sums into products, but the trace exponential
has a related property that serves as a limited substitute.
For n×n complex matrices, the matrix exponential is defined by the Taylor series
( of course a power series representation) as

1 k
exp(A) = eA = A .
k!
k=0

For commutative matrices A and B: AB = BA, we see that eA+B = eA eB by


multiplying the Taylor series. This identity is not true for general non-commutative
matrices. In fact, it always fails if A and B do not commute, see [40].
1.4 Probability and Matrix Analysis 39

The matrix exponential is convergent for all square matrices. Furthermore, it is


 for A ∈ Md that an eigenbasis of A is also an eigenbasis of exp(A)
not hard to see
and that λi eA = eλi (A) for all 1 ≤ i ≤ d. Also, for all A ∈ Md , it holds that
eA ≥ 0.
We will be interested in the case of the exponential function f (x) = ex . For any
Hermitian matrix A, note that eA is positive semi-definite. Whereas in the scalar
case ea+b = ea eb holds, it is not necessarily true in the matrix case that eA+B =
eA · eB . However, the following useful inequality does hold:
Theorem 1.4.11 (Golden-Thompson Inequality). Let A and B be arbitrary
Hermitian d × d matrices. Then
 
Tr eA+B  Tr eA · eB . (1.71)

For a proof, we refer to [23, 41]. For a survey of Golden-Thompson and other
trace inequalities, see [40]. Golden-Thompson inequality holds for arbitrary unitary-
invariant norm replacing the trace, see [23] Theorem 9.3.7. A version of Golden-
Thompson inequality for three matrices fails:
 
Tr eA+B+C  Tr eA eB eC .

1.4.18 The Matrix Logarithm

We define the matrix logarithm as the functional inverse of the matrix exponential:

log eA := A for all Hermitian matrix A. (1.72)

This formula determines the logarithm on the positive definite cone.


The matrix logarithm interfaces beautifully with the semidefinite order [23,
Exercise 4.2.5]. In fact, the logarithm is operator monotone:

0  A  H ⇒ log A  log H, (1.73)

where 0 denotes the zero matrix whose entries are all zeros. The logarithm is also
operator concave:

τ log A + (1 − τ ) log H  log (τ A + (1 − τ ) H) (1.74)

for all positive definite A, H and τ ∈ [0, 1]. Operator monotone functions and
operator convex functions are depressingly rare. In particular, the matrix exponential
does not belong to either class [23, Chap. V]. Fortunately, the trace inequalities of a
matrix-valued function can be used as limited substitute. For a survey, see [40]which
is very accessible. Carlen [15] is also ideal for a beginner.
40 1 Mathematical Foundation

1.4.19 Quantum Relative Entropy and Bregman Divergence

Quantum relative entropy can be interpreted as a measure of dissimilarity between


two positive-definite matrices.
Definition 1.4.12 (Quantum relative entropy). Let X, Y be positive-definite
matrices. The quantum relative entropy of X with respect to Y is defined as

D (X; Y) = Tr [X log X − X log Y − (X − Y)]


= Tr [X (log X − log Y) − (X − Y)] . (1.75)

Quantum relative entropy is also called quantum information divergence and von
Neumann divergence. It has a nice geometric interpretation [42].
A new class of matrix nearness problems uses a directed distance measure called
a Bregman divergence. We define the Bregman divergence of the matrix X from the
matrix Y as

Dϕ (X; Y)  ϕ (X) − ϕ (Y) − ∇ϕ (X) , X − Y , (1.76)

where the matrix inner product X, Y = Re Tr XY∗ . Two principal examples of


2
Bregman divergences are the following. When ϕ (X) = 12 X F , the associated
2
divergence is the squared Frobenius norm 12 X − Y F . When ϕ (X) is the
negative Shannon entropy, we obtain the Kullback-Leibler divergence, which is also
known as relative entropy. But these two cases are just the tip of the iceberg [42].
In general, Bregman divergences provide a powerful way to measure the distance
between matrices. The problem can be formulated in terms of convex optimization
*
minimize Dϕ (X; Y) subject to X ∈ Ck ,
X k

where Ck is a finite collection of closed, convex sets whose intersection is nonempty.


For example, we can apply it to the problem of learning a divergence from data.
Define the quantum entropy function

ϕ (X) = Tr (X log X) (1.77)

for a positive-definite matrix. Note that the trace function is linear. The divergence
D (X; Y) can be viewed as the difference between ϕ (X) and the best affine
approximation of the entropy at the matrix Y. In other words, (1.75) is the special
case of (1.76) when ϕ (X) is given by (1.77). The entropy function ϕ given in (1.77)
is a strictly convex function, which implies that the affine approximation strictly
underestimates this ϕ. This observation gives us the following fact.
Fact 1.4.13 (Klein’s inequality). The quantum relative entropy is nonnegative

D (X; Y)  0.

Equality holds if and only if X = Y.


1.4 Probability and Matrix Analysis 41

Introducing the definition of the quantum relative entropy into Fact (1.4.13), and
rearranging, we obtain

Tr Y  Tr (X log Y − X log X + X) .

When X = Y, both sides are equal. We can summarize this observation in a lemma
for convenience.
Lemma 1.4.14 (Variation formula for trace [43]). Let Y be a positive-definite
matrix. Then,

Tr Y = max Tr (X log Y − X log X + X) .


X>0

This lemma is a restatement of the fact that quantum relative entropy is nonnegative.
The convexity of quantum relative entropy has paramount importance.
Fact 1.4.15 (Lindblad [44]). The quantum relative entropy defined in (1.75) is a
jointly convex function. That is,

D (tX1 + (1 − t) X2 ; tY1 + (1 − t) Y2 )  tD (X1 ; Y1 ) + (1 − t) D (X2 ; Y2 ) , t ∈ [0, 1] ,

where Xi and Yi are positive definite for i = 1, 2.


Bhatia’s book [23, IX.6 and Problem IX.8.17] gives a clear account of this approach.
A very accessible work is [45].
A final useful tool is a basic result in matrix theory and convex analysis [46,
Lemma 2.3]. Following [43], a short proof originally from [47] is included here for
convenience.
Proposition 1.4.16. Let f (·; ·) be a jointly concave function. Then, the function
y → maxx f (x; y) obtained by partial maximization is concave, assuming the
maximization is always attained.
Proof. For a pair of points y1 and y2 , there are points x1 and x2 that meet

f (x1 ; y1 ) = maxf (x; y1 ) and f (x2 ; y2 ) = maxf (x; y2 ) .


x x

For each t ∈ [0, 1], the joint concavity of f says that

maxx f (x; ty1 + (1 − t) y2 )  f (tx1 + (1 − t) x2 ; ty1 + (1 − t) y2 )

 t · f (x1 ; y1 ) + (1 − t) f (x2 ; y2 )

= t · maxx f (x; y1 ) + (1 − t) f (x2 ; y2 ) .

The second line follows from the assumption that f (·; ·) be a jointly concave
function. In words, the partial maximum is a concave function. 
42 1 Mathematical Foundation

If f is a convex function and α > 0, then the function αf is convex. If f1 and f2


are both convex, then so is their sum f1 + f2 . Combining nonnegative scaling and
addition [48, p. 79], we see that the set of convex functions is itself a convex cone:
a nonnegative weighted sum of convex functions

f = w1 f 1 + · · · + w m f m (1.78)

is convex. Similarly, a nonnegative weighted sum of concave functions is concave.


A linear function is of course convex. Let Sn×n stand for the set of the n × n
symmetric matrix. Any linear function f : Sn×n → R can be represented in the
form

f (X) = Tr (CX) , C ∈ Sn×n . (1.79)

1.4.20 Lieb’s Theorem

Lieb’s Theorem is the foundation for studying the sum of random matrices in
Chap. 2. We present a succinct proof this theorem, following the arguments of Tropp
[49]. Although the main ideas of Tropp’s presentation are drawn from [46], his proof
provides a geometric intuition for Theorem 1.4.17 and connects it to another major
result. Section 1.4.19 provides all the necessary tools for this proof.
Theorem 1.4.17 (Lieb [50]). Fix a Hermitian matrix H. The function

A → Tr exp (H + log (A))

is concave on the positive-definite cone.


Proof. In the variational formula, Lemma 1.4.14, select

Y = exp (H + log A)

to obtain

Tr exp (H + log A) = max Tr (X (H + log A) − X log X + X) .


X>0

Using the quantum relative entropy of (1.75), this expression can be rewritten as

Tr exp (H + log A) = max [Tr (XH) − (D (X; A) − Tr A)] . (1.80)


X>0

Note that trace is a linear function.


For a Hermitian matrix H, Fact 1.4.15 says that D (X; A) is a jointly convex
function of the matrix variables A and X. Due to the linearity of the trace function,
1.4 Probability and Matrix Analysis 43

the whole bracket on the right-hand-side of (1.80) is also a jointly convex function
of the matrix variables A and X. It follows from Proposition 1.4.16 that the
right-hand-side of (1.80) defines a concave function of A. This observation
completes the proof. 
We require a simple but powerful corollary of Lieb’s theorem. This result connects
expectation with the trace exponential.
Corollary 1.4.18. Let H be a fixed Hermitian matrix, and let X be a random
Hermitian matrix. Then
 
E Tr exp (H + X)  Tr exp H + log EeX .

Proof. Define the random matrix Y = eX, and calculate that

E Tr exp (H + X) = E Tr exp (H + log Y)


 Tr exp (H + log (EY))

= Tr exp H + log EeX .

The first relation follows from the definition (1.72) of the matrix logarithm because
Y is always positive definite, Y > 0. Lieb’s result, Theorem 1.4.17, says that the
trace function is concave in Y, so in the second relation we may invoke Jensen’s
inequality to draw the expectation inside the logarithm. 

1.4.21 Dilations

An extraordinary fruitful idea from operator theory is to embed matrices within


larger block matrices, called dilations [51]. The Hermitian dilation of a rectangular
matrix B is

0 B
ϕ (B) = . (1.81)
B∗ 0

Evidently, ϕ (B) is always Hermitian. A short calculation yields the important


identity

2 BB∗ 0
ϕ(B) = . (1.82)
0 B∗ B

It is also be verified that the Hermitian dilation preserves spectral information:

λmax (ϕ (B)) = ϕ (B) = B . (1.83)

We use dilations to extend results for Hermitian matrices to rectangular matrices.


44 1 Mathematical Foundation

Consider a channel [52, p. 279] in which the additive Gaussian noise is a


stochastic process with a finite-dimensional covariance matrix. Information theory
tells us that the information (both quantum and classical) depends on only the
eigenvalues of the covariance matrix. The dilations preserve the “information”.

1.4.22 The Positive Semi-definite Matrices and Partial Order

The matrix A is called positive semi-definite if all of its eigenvalues are non-
negative. This is denoted A  0. Furthermore, for any two Hermitian matrices
A and B, we write A ≥ B if A − B ≥ 0. One can define a semidefinite order or
partial order on all Hermitian matrices. See [22] for a treatment of this topic.
For any t, the eigenvalues of A − tI are λ1 − t, . . . , λd − t. The spectral norm of
A, denoted as A , is defined to be maxi |λi |. Thus − A · I  A  A · I.
Claim 1.4.19. Let A, B and C be Hermitian d × d matrices satisfying A ≥ 0 and
B ≤ C. Then, Tr (A · B)  Tr (A · C).
Notice that ≤ is a partial order and that
     
A, B,A ,B ∈ Cd×d
Herm , A  B and A  B ⇒ A + A  B + B .

Moreover, spectral mapping (1.60) implies that

A ∈ Cd×d
Herm , A  0.
2

Corollary 1.4.20 (Trace-norm property). If A ≥ 0, then

Tr (A · B)  B Tr (A) , (1.84)

where B is the spectrum norm (largest singular value).


Proof. Apply Claim 1.4.19 with C = B · I and note that Tr (αA) = α Tr (A)
for any scalar α. 
Suppose a real function f on an interval I has the following property [22, p. 60]:
if A and B are two elements of Hn (I) and A ≥ B, then f (A)  f (B). We say
that such a function f is matrix monotone of order n on I. If f is matrix monotone
of order n for n = 1, 2, . . ., then we say f is operator monotone.
Matrix convexity of order n and operator convexity can be defined in a similar
way. The function f (t) = tr , on the interval [0, ∞) is operator monotone for 0 ≤
r ≤ 1, and is operator convex for 1 ≤ r ≤ 2 and for −1 ≤ r ≤ 0.
1.4 Probability and Matrix Analysis 45

1.4.23 Expectation and the Semidefinite Order

Since the expectation of a random matrix can be viewed as a convex combination


and the positive semidefinite cone is convex, then expectation preserves the
semidefinite order [53]:

X  Y almost surely ⇒ EX  EY. (1.85)

Every operator convex function admits an operator Jensen’s inequality [54]. In


particular, the matrix square is operator convex, which implies that
2 
(EX)  E X2 . (1.86)

The relation (1.86) is also a specific instance of Kadison’s inequality [23, Theorem
2.3.3].

1.4.24 Probability with Matrices

Assume (Ω, F, P) is a probability space and Z : Ω → Cd×d Herm is measurable with


respect to F and the Borel σ-field on Cd×d Herm . This is equivalent to requiring that all
entries of Z be complex-valued random variables. Cd×d Herm is a metrically complete
vector space and one can naturally define an expected value E [Z] ∈ Cd×d Herm . This
turns out to be the matrix E [Z] ∈ Cd×d
Herm whose (i, j)-entry is the expected value of
the (i, j)-entry of Z. Of course, E [Z] is only defined if all entries of Z are integrable,
but this will always be in the case in this section.
The definition of expectation implies that trace and expectation commute:

Tr (E [Z]) = E (Tr [Z]) . (1.87)

Moreover, one can check that the usual product rule is satisfied: If Z, W : Ω →
Cd×d
Herm are measurable and independent, then

E [ZW] = E [Z] E [W] . (1.88)

Finally,

Herm satisfies Z  0 almost surely (a.s.), then E [Z] ≥ 0,


if Z : Ω → Cd×d

which is an easy consequence of another readily checked fact: (v, E [Z] v) =


E [(v, Zv)] , v ∈ Cd , where (·, ·) is the standard Euclidean inner product.
46 1 Mathematical Foundation

1.4.25 Isometries

There is an analogy between numbers and transforms [55, p. 142]. A finite-


dimensional transform is a finite-dimensional matrix.
Theorem 1.4.21 (Orthogonal or Unitary). The following three conditions on a
linear transform U on inner product space are equivalent to each other.
1. U∗ U = I,
2. (Ux, Uy) = (x, y) for all x and y,
3. Ux = x for all x.
Since condition 3 implies that

Ux − Uy = x − y for all x and y,

we see that transforms of the type that the theorem deals with are characterized
by the fact that they preserve distances. For this reason, we call such a transform
an isometry. An isometry on a finite-dimensional space is necessarily orthogonal
or unitary, use of this terminology will enable us to treat the real and the complex
cases simultaneously. On a finite-dimensional space, we observe that an isometry is
always invertible and that U−1 (= U∗ ) is an isometry along with U.

1.4.26 Courant-Fischer Characterization of Eigenvalues

The expectation of a random variable is EX. We write X ∼ Bern (p) to indicate


that X has a Bernoulli distribution with mean p. In Sect. 2.12, one of the central
tools is the variational characterization of a Hermitian matrix given by the Courant-
Fischer theorem. For integers d and n satisfying 1 ≤ d ≤ n, the complex Stiefel
manifold
 
Vnd = V ∈ Cn×d : V∗ V = I

is the collection of orthonormal bases for the d-dimensional subspaces of Cn , or,


equivalently, the collection of all isometric embeddings of Cd into a subspace of
Cn . Then the matrix V∗ AV can be interpreted as the compression of A to the
space spanned by V.
Theorem 1.4.22 (Courant-Fischer). Let A is a Hermitian matrix with dimen-
sion n. Then

λk (A) = min λmax (V∗ AV) and (1.89)


V∈Vn
n−k+1

λk (A) = maxn λmin (V∗ AV) . (1.90)


V∈Vk
1.5 Decoupling from Dependance to Independence 47

A matrix V− ∈ Vnk achieves equality in (1.90) if and only if its columns span a
dominant k-dimensional invariant subspace A. Likewise, a matrix V+ ∈ Vnn−k+1
achieves equality in (1.89) if and only if its columns span a bottom (n − k + 1)-
dimensional invariant subspace A.
The ± subscripts in Theorem 1.4.22 are chosen to reflect the fact that λk (A) is the
∗ ∗
minimum eigenvalue of V− AV− and the maximum eigenvalue of V+ AV+ . As a
consequence of Theorem 1.4.22, when A is Hermitian,

λk (−A) = −λn−k+1 (A) . (1.91)

In other words, for the minimum eigenvalue of a Hermitian matrix A, we


have [53, p. 13]

λmin (A) = −λmax (−A) . (1.92)

This fact (1.91) allows us to use the same techniques we develop for bounding the
eigenvalues from above to bound them from below. The use of this fact is given in
Sect. 2.13.

1.5 Decoupling from Dependance to Independence

Decoupling is a technique of replacing quadratic forms of random variables by


bilinear forms. The monograph [56] gives a systematic study of decoupling and
its applications. A simple decoupling inequality is given by Vershynin [57]. Both
the result and its proof are well known but his short proof is not easy to find in the
literature. In a more general form, for multilinear forms, this inequality can be found
in [56, Theorem 3.1.1].
Theorem 1.5.1. Let A be an n × n (n ≥ 2) matrix with zero diagonal. Let x =
(X1 , . . . , Xn ) , n ≥ 2 be a random vector with independent mean zero coefficients.
Then, for every convex function f , one has

Ef ( Ax, x )  Ef (4 Ax, x ) (1.93)

where x is an independent copy of x.


The consequence of the theorem can be equivalently stated as
⎛ ⎞ ⎛ ⎞
n n

Ef ⎝ aij Xi Xj ⎠  Ef ⎝4 aij Xi Xj ⎠
i,j=1 i,j=1
  

where x = X1 , . . . , Xn is an independent copy of x. In practice, the indepen-
dent copy of a random vector is easily available but the true random vector is hardly
48 1 Mathematical Foundation

available. The larger the dimension n, the bigger gap between both sides of (1.93).
We see Fig. 1.2 for illustration. Jensen’s inequality says that if f is convex,

f (Ex)  Ef (x) , (1.94)

provided that the expectations exist. Empirically, we found when n is large enough,
say n ≥ 100, we can write (1.93) as the form

Ef ( Ax, x )  Ef (C Ax, x ) (1.95)

where C > 1, say C = 1.1. In practice, the expectation is replaced by an average


of K simulations. The size of K affects the tightness of the inequality. Empirical
evidence shows that K = 100–1,000 is sufficient. For Gaussian random vectors, C
is a function of K and n, or C(K, n). We cannot rigorously prove (1.95), however.
Examples of the convex function f (x) in (1.93) include [48, p. 71]
• Exponential. eax is convex on R, for any a ∈ R.
• Powers. xa is convex on R++ , the set of positive real numbers, when a ≥ 1 or
a ≤ 0, and concave for 0 ≤ a ≤ 1.
• Powers of absolute value. |x|p , for p ≥ 1, is convex on R.
• Logarithm. log x is concave on R++ .
• Negative entropy. x log x is convex, either on R++ , or R+ , (the set of nonnegative
real numbers) defined as 0 for x = 0.
Example 1.5.2 (Quadratic form is bigger than bilinear form). The function f (t) =
|t| is convex on R. We will use this simple function in our simulations below.
Let us illustrate Theorem 1.5.1 using MATLAB simulations (See the Box for
the MATLAB code). Without loss of generality, we consider a fixed n × n matrix
A whose entries are Gaussian random variables. The matrix A has zero diagonal.
For the random vector x, we also assume the Gaussian random variables as its n
entries. An independent copy x is assumed to be available to form the bilinear
form. For the quadratic form, we assume that the true random vector x is available:
this assumption is much stronger than the availability of an independent copy x of
the Gaussian random vector x (Fig. 1.1).
Using the MATLAB code in the Box, we obtain Fig. 1.2. It is seen that the right-
hand side (bilinear form) of (1.93) is always greater than the left-hand side of (1.93).
Example 1.5.3 (Multiple-Input, Multiple-Output). Given a linear system, we have
that

y = Hx + n

where x = (X1 , . . . , Xn ) is the input random vector, y = (Y1 , . . . , Yn ) is the output


random vector and n = (N1 , . . . , Nn ) is the noise vector (often Gaussian). The H
is an n × n matrix with zero diagonal. One is interested in the inner product (in the
quadratic form)
1.5 Decoupling from Dependance to Independence 49

a Quadratic and Bilinear Forms b Quadratic and Bilinear Forms


12 Quadratic
40 Quadratic
Bilinear Bilinear
35

Values for Quadratic and Bilinear


Values for Quadratic and Bilinear

10
30
8
25

6 20

15
4
10
2
5

0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Monte Carlo Index i Monte Carlo Index i
n=2 n=10
c Quadratic and Bilinear Forms
400
Quadratic
Bilinear

350
Values for Quadratic and Bilinear

300

250

200

150

100

50
0 10 20 30 40 50 60 70 80 90 100
Monte Carlo Index i
n=100

Fig. 1.1 Comparison of quadratic and bilinear forms as a function of dimension n. Expectation is
approximated by an average of K = 100 Monte Carlo simulations. (a) n = 2. (b) n = 10. (c) n = 100

n
y, x = Hx, x + n, x = aij Xi Xj + n, x .
i,j=1

if there is the complete knowledge of x = (X


 1 , . . . , Xn ). 
Unfortunately, sometimes
we only know the independent copy x = X1 , . . . , Xn , rather than the random
vector x = (X1 , . . . , Xn ). Under this circumstance, we arrive at a bilinear form
n

y, x = Hx, x + n, x = aij Xi Xj + n, x .
i,j=1
50 1 Mathematical Foundation

a Quadratic and Bilinear Forms b Quadratic and Bilinear Forms


130 Quadratic
100 Quadratic
Bilinear Bilinear
Values for Quadratic and Bilinear

Values for Quadratic and Bilinear


120 95

110
90
100
85
90
80
80

70 75

60 70
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Monte Carlo Index i Monte Carlo Index i
K=100, C=1.4 K=500, C=1.15
c Quadratic and Bilinear Forms
92 Quadratic
Bilinear
90
Values for Quadratic and Bilinear

88

86

84

82

80

78

76

74

72
0 10 20 30 40 50 60 70 80 90 100
Monte Carlo Index i
K=1000, C=1.1

Fig. 1.2 Comparison of quadratic and bilinear forms as a function of K and C . Expectation is
approximated by an average of K Monte Carlo simulations. n = 100. (a) K = 100, C = 1.4 (b)
K = 500, C = 1.15 (c) K = 1,000, C = 1.1

Similarly, if we have the complete information of n, we can form

Ay, n = AHx, n + An, n ,

where A is an n × n matrix with zero diagonal, and An, n is the quadratic form.
On the other hand, if we have the independent copy n of n, we can form

Ay, n = AHx, n + An, n ,

where An, n is the bilinear form. Thus, (1.93) can be used to establish the
inequality relation. 
1.5 Decoupling from Dependance to Independence 51

MATLAB Code: Comparing Quadratic and Bilinear Forms


clear all;
for itest=1:100
K=100;n=10; Q=0; B=0; C=4; A1=randn(n,n); A=A1;
for itry=1:K % We use K Monte Carlo simulations to approximate the
expectation
for i=1:n
A(i,i)=0; % A is an n x n matrix with zero diagonal
end
x=randn(n,1); % random vector x
x1=randn(n,1); % independent copy x of random vector x

Q=Q+abs(x *A*x); % f (x) = |x|p , for p ≥ 1, is convex on R . p=1 is chosen
here.
B=B+abs(C*x1 *A*x); % The prime  represents the Hermitian transpose in
MATLAB
end
Q=Q/K; B=B/K; % An average of K Monte Carlo simulations to approximate
the expectation
Qtest(itest,1)=Q; Btest(itest,1)=B;
end
n=1:length(Qtest);
figure(1), plot(n,Qtest(n), b– ,n,Btest(n), r-* ), xlabel( Monte Carlo Index i ),
ylabel( Values for Quadratic and Bilinear ), title( Quadratic and Bilinear Forms ),
legend( Quadratic ,  Bilinear ), grid

n
Proof of Theorem 1.5.1. We give a short proof due to [57]. Let A = (aij )i,j=1 ,
and let ε1 , . . . , εn be independent Bernoulli random variables with P (εi = 0) =
P (εi = 1) = 12 . Let Eε be the conditional expectation with respect to these
random variables εi , i = 1, . . . , n. and similarly for the conditional
  expectation



with respect to random vectors x = (X1 , . . . , Xn ) and x = X1 , . . . , Xn . Let
[n] = {1, 2, . . . , n} be the set. We have

Ax, x = aij Xi Xj = 4Eε εi (1 − εi ) aij Xi Xj .


i,j∈[n] i,j∈[n]

By Jensen’s inequality (1.94) and Fubini’s inequality [58, 59],


⎛ ⎞

Ef ( Ax, x )  Eε Ex f ⎝4 εi (1 − εi ) aij Xi Xj ⎠ .
i,j∈[n]
52 1 Mathematical Foundation

We fix a realization of ε1 , . . . , εn and consider the subset I = {i ∈ [n] : εi = 1}.


Then we obtain

4 εi (1 − εi ) aij Xi Xj = 4 εi (1 − εi ) aij Xi Xj .
i,j∈[n] (i,j)∈I×I c

where I c is the complement the subset I. Since the Xi , i ∈ I are independent of the
Xj , j ∈ I c , the distribution of the sum will not change if we replace Xj , j ∈ I c by
 
their independent copies Xj , j ∈ I c , the coordinates of the independent copy x of
the x. As a result, we have
⎛ ⎞

Ef ( Ax, x )  Eε Ex,x f ⎝4 εi (1 − εi ) aij Xi Xj ⎠ .
(i,j)∈I×I c

We use a simple consequence of Jensen’s inequality (1.94): If Y and Z are


independent random variables and EZ = 0 then

Ef (Y ) = Ef (Y + EZ)  E (Y + Z) .

Using this fact for

 
Y =4 aij Xi Xj , Z=4 aij Xi Xj ,
(i,j)∈I×I c (i,j)∈I×I
/ c

we arrive at
⎛ ⎞
n

Ex,x f (Y ) = Ex,x f (Y + Z) = f ⎝4 aij Xi Xj ⎠ = Ef (4 Ax, x ) .
i,j=1

Taking the expectation with respect to (εi ), we complete the proof. 

The following result is valid for the Gaussian case.


Theorem 1.5.4 (Arcones and Giné [60]). These exists an absolute constant C
such that the following holds for all p ≥ 1. Let g = (g1 , . . . , gn ) be a sequence of
independent standard Gaussian random variables. If A is a collection of Hermitian
matrices and g is an independent copy of g, then
 p  p
 n   n 
 n
 2   

E sup  gi gj Ai,j +  
gi − 1 Ai,i   C E sup 
p
gi gj Ai,j  .


A∈A i,j=1 i=1  A∈A i,j=1 


1.5 Decoupling from Dependance to Independence 53

In other words, we have


  p
E sup  Ag, g + Tr AggT − Tr (A)  C p E sup | Ag, g | .
p
A∈A A∈A

Theorem 1.5.5 dates back to [61], but appeared with explicit constants and with
a much simplified proof in [62]. Let X be the operator norm and X F be the
Frobenius norm.
Theorem 1.5.5 (Theorem 17 of Boucheron et al. [62]). Let X be the N × N
matrix with extries xi,j and assume that xi,i = 0 (zero diagonal) for all i ∈
n
{1, . . . , N }. Let ξ = {ξi }i=1 be a Rademacher sequence. Then, for any t > 0,
⎛  ⎞ + ,
 
  1 96
65 t t2

P   ⎠
ξi ξj xi,j  > t  2 exp − min , . (1.96)
64 X 2
 i,j  X F

Or
+ ,
   96
t2
  1 65 t
P ξ T Xξ  > t  2 exp − min , 2 .
64 X X F

Let F denote a collection of n × n symmetric matrices X, and ε1 , . . . , εn are


i.i.d. Rademacher variables. For convenience assume that the matrices X have zero
diagonal, that is, Xii = 0 for all X ∈ F and i = 1, . . . , n. Suppose the supremum
of the L2 operator norm of matrices (X)X∈F is finite, and without loss of generality
we assume that this supremum equals one, that is,

sup sup zT Xz = 1
X∈F z22 1

for z ∈ Rn .
Theorem 1.5.6 (Theorem 17 of Boucheron et al. [62]). For all t > 0,
 
t2
P (Z  E [Z] + t)  exp − 2
32E [Y ] + 65t/3

where the random variable Y is defined as


⎛ ⎛ ⎞2 ⎞1/2
n n
⎜ ⎝ ⎟
Y = sup ⎝ εj Xij ⎠ ⎠ .
X∈F i=1 j=1
54 1 Mathematical Foundation

1.6 Fundamentals of Random Matrices

Here, we highlight the fundamentals of random matrix theory that will be needed
in Chap. 9 that deals with high-dimensional data processing motivated by the
large-scale cognitive radio network testbed. As mentioned in Sect. 9.1, the basic
building block for each node in the data processing is a random matrix (e.g., sample
covariance matrix). A sum of random matrices arise naturally. Classical textbooks
deal with a sum of scalar-valued random variables—and the central limit theorem.
Here, we deal with a sum of random matrices—matrix-valued random variables.
Many new challenges will be encountered due to this fundamental paradigm shift.
For example, scalar-valued random variables are commutative, while matrix-valued
random variables are non-commutative. See Tao [9].

1.6.1 Fourier Method

This method is standard method for the proof of the central limit theorem. Given any
real random variable X, the characteristic function FX (t) : R → C is defined as

FX (t) = EejtX .

Equivalently, FX is the Fourier transform of the probability measure μX .The signed


Bernoulli distribution has FX = cos(t) and the normal distribution N μ, σ 2 has
2 2
FX (t) = ejtμ e−σ t /2 .
For a random vector X taking values in Rn , we define FX (t) : Rn → C as

FX (t) = Eejt·X

where · denotes the Euclidean inner product on Rn . One can similarly define the
characteristic function on complex vector spaces Cn by using the complex inner
product

(z1 , . . . , zn ) · (w1 , . . . , wn ) = Re (z1 w̄1 + · · · + zn w̄n ) .

1.6.2 The Moment Method

The most elementary (but still remarkably effective) method is the moment
method [63]. The method is to understand the distribution of a random variable
X via its moments X k . This method is equivalent to Fourier method. If we Taylor
expand ejtX and formally exchange the series and expectation, we arrive at the
heuristic identity
1.6 Fundamentals of Random Matrices 55

∞ k
(jk)
FX (t) = EX k
k!
k=0

which connects the characteristic functions of a real variable X as a kind of


generating functions for the moments. In practice, the moment method tends to look
somewhat different from the Fourier methods, and it is more apparent how to modify
them to non-independent or non-commutative settings.
The Fourier phases x → eitx are bounded, but the moment function x → xk
becomes unbounded at infinity. One can deal with this issue, however, as long as
one has sufficient decay:
Theorem 1.6.1 (Carleman continuity theorem). Let Xn be a sequence of uni-
formly sub-Gaussian real random variables, and let X be another sub-Gaussian
random variable. Then the following statements are equivalent:
1. For every k = 0, 1, 2, . . ., EXnk converges to EX k .
2. Xn converges in distribution to X.
See [63] for a proof.

1.6.3 Expected Moments of Random Matrices with Complex


Gaussian Entries

Recall that for ξ in R and σ 2 ∈ ]0, ∞[, N ξ, 12 σ 2 denotes the Gaussian distribution
with mean ξ and variance σ 2 . The normalized trace is defined as

1
trn = Tr.
n

The first class, denoted Hermitian Gaussian Random Matrices or HGRM(n,σ 2 ), is


a class of Hermitian n × n random matrices A = (aij ), satisfying that the entries
A = (aij ) , 1  i  j  n, forms a set of 12 n (n + 1) independent, Gaussian
random variables, which are complex valued whenever i < j, and fulfill that
 
2
E (aij ) = 0, and E |aij | = σ 2 , for all i, j.

The case σ 2 = 12 gives the normalization used by Wigner [64] and Mehta [65],
while the case σ 2 = n1 gives the normalization used by Voiculescu [66]. We say that
A is a standard Hermitian Gaussian random n × n matrix with entries of variance
σ 2 , if the following conditions are satisfied:
1. The entries aij , 1  i  j  n, form a set of 12 n (n + 1) independent, complex
valued random variables.
56 1 Mathematical Foundation

2. Foreach i in {1, 2, . . . , n}, aii is a real valued random variable with distribution
N 0, 12 σ 2 .
3. When i ≤ j, the real and imaginary parts Re (aij ) , Im (aij ), of aij are indepen-
dent, identically distributed random variables with distribution N 0, 12 σ 2 .
4. When j > i, aij = āji , where d¯ is the complex conjugate of d.
We denote by HGRM(n,σ 2 ) the set of all such random matrices. If A is an element
of HGRM(n,σ 2 ), then
 
2
E |aij | = σ 2 , for all i, j.

The distribution of the real valued random variable aii has density
 
x2
x → √ 1
2πσ 2
exp − 2σ 2 , x ∈ R,

with respect to Lebesgue measure on R, whereas, if i ≤ j, the distribution of the


complex valued random variable aij has density
 
z2
z → √ 1
2πσ 2
exp − 2σ 2 , z ∈ C,

with respect to (w.r.t.) Lebesgue measure on C. For any complex matrix H, we


have that

 n
2
Tr H 2
= hii + 2 |hij | .
i=1 i<j

The distribution of an element A of HGRM(n,σ 2 ) has the density


 
H → √ 1
2πσ 2
exp − 2σ1 2 Tr H2 , H ∈ Cn×n .

The second class, denoted Gaussian Random Matrices or GRM(m,n,σ 2 ), is a


class of m × n random matrices B = (bij ) , 1  i  m, 1  j  n. This
class forms a set of mn independent, complex-valued, Gaussian random variables,
satisfying that
 
2
E (bij ) = 0, and E |bij | = σ 2 , for all i, j.

We say B is a standard Gaussian random matrix of m×n with entries of variance σ 2 ,


if real valued random variables Re (bij ) , Im (bij ) , 1  i, j  n, form a family of
2mn  independent, identically distributed (i.i.d.) random variables, with distribution
N 0, 12 σ 2 . This class starts with Wishart [67] and Hsu [68].
We are interested in the explicit formulas for the mean values E (Tr [exp (sA)])
in the Wigner case, and
1.6 Fundamentals of Random Matrices 57

E (Tr [B∗ B exp (sB∗ B)])

in the Wishart case, as functions of a complex parameter s. A new and entirely


analytical treatment of these problems is given by the classical work of Haagerup
and Thorbjørnsen [69] which we follow closely for this section.

1.6.4 Hermitian Gaussian Random Matrices HGRM(n, σ 2 )

We need the confluent hyper-geometric function [70, Vol. 1, p. 248] (a, c, x) →


Φ (a, c, x) that is defined as

(a)n xn a x a(a + 1) x2
Φ (a, c, x) = =1+ + + ··· ,
n=0
(c)n n! c1 c(c + 1) 2

for a, c, x in C, such that c ∈


/ Z\N. In particular, if a ∈ Z\N, then (x) → Φ (a, c, x)
is a polynomial in x of degree −a, for any permitted c. For any non-negative integer
n, and any complex number w, we apply the notation

1, if n = 0,
(w)n =
w(w + 1)(w + 2) · · · (w + n − 1) , if n ∈ N.

For any element A of HGRM(n, σ 2 ) and any s ∈ C, we have that


 
σ 2 s2 
E (Tr [exp (sA)]) = n · exp · Φ 1 − n, 2; −σ 2 s2 .
2

If (Xn ) is a sequence of random matrices, such that Xn ∈ HGRM n, n1 for all n
in N. Then for any s ∈ C, we have that

1 2 
lim E (Tr [exp (sXn )]) = exp (sx) 4 − x2 dx,
n→∞ 2π −2

and the convergence is uniform on compact subsets of C. Further we have that for
the k-th moment of Xn

   1 2 
lim E trn Xkn = xk 4 − x2 dx,
n→∞ 2π −2

and in general, for every continuous bounded function f : R → C,

1 2 
lim E (trn [f (Xn )]) = f (x) 4 − x2 dx.
n→∞ 2π −2
58 1 Mathematical Foundation

Often, a recursion formula is efficient in calculation. Let A is an element of


HGRM(n, 1), and for integer k define
  
C (k, n) = E Tr A2k .

Then the initial values are C (0, n) = n, C (1, n) = n2 , and for fixed n in N, the
numbers C(k, n) satisfy the recursion formula:

4k + 2 k 4k 2 − 1
C (k + 1, n) = n · · C (k, n) + · C (k − 1, n) , k  1.
k+2 k+2
(1.97)
We can further show that C(k, n) has the following form [71]
 
k
2

C (k, n) = ai (k) nk+1−2i , k ∈ N0 , n ∈ N. (1.98)


i=0

Here the notation N0 denotes the integer that does not include zero, in contrast with
N. The coefficients ai (k) , i, k ∈ N0 are determined by the following recursive
formula
 
ai (k) = 0, i  k2 + 1,
 
1 2k
a0 (k) = , k ∈ N0
k+1 k

4k + 2 k 4k 2 − 1
ai (k + 1) = · ai (k) + · ai−1 (k − 1) , k, i ∈ N.
k+2 k+2

A list of the numbers of ai (k) is given in [71, p. 459].


From (1.97) and (1.98), we can get [69] for any A in HGRM(n, 1),
  
E Tr A2 = n2 ,
  
E Tr A4 = 2n3 + n,
  
E Tr A6 = 5n4 + 10n2 ,
  
E Tr A8 = 14n5 + 70n3 + 21n,
  
E Tr A10 = 42n6 + 420n4 + 483n2 ,

etc. If we replace the A above by an element X of HGRM(n, n1 ), and Tr by trn ,


then we have to divide the above numbers by nk+1 . Finally, for X of HGRM(n, n1 ),
we have
1.6 Fundamentals of Random Matrices 59

  
E trn A2 = 1,
  
E trn A4 =2+ 1
n2 ,
  
E trn A6 =5+ 10
n2 ,
  
E trn A8 = 14 + 70
n2 + 21
n4 ,
  
E trn A10 = 42 + 420
n2 + 483
n4 ,
  
etc. The constant term in E trn A2k is
  2 
1 2k 1
a0 (k) = = x2k 4 − x2 dx,
k+1 k 2π −2

in concordance with Wigner’s semi-circle law.

1.6.5 Hermitian Gaussian Random Matrices GRM(m, n, σ 2 )

We first define the function ϕα


k (x) as

 1/2
k!
ϕα
k (x) = xα exp (−x) · Lα
k (x) , k ∈ N0 , (1.99)
Γ (k + α + 1)

where Lα
k (x)k∈N0 is the sequence of generalized Laguerre polynomials of order α,
i.e.,

−1 −α dk  k+α

k (x) = (k!) x exp (x) · x exp (−x) , k ∈ N0 .
dxk

Here Γ (x) is the Gamma function.


Now we can state a corollary from [69]. Let B be an element of GRM(m, n, 1),
let ϕαk (x) , α ∈ ]0, ∞[ , k ∈ N0 , be the functions introduced in (1.99), and let
f : ]0, ∞[ → R a Borel function.1 If m ≥ n, we have that
-n−1 .
∞ 
∗ 2
E (Tr [f (B B)]) = f (x) ϕm−n
k (x) dx.
0 i=0

If m ≤ n, we have that

1A map f : X → Y between two topological spaces is called Borel (or Borel measurable) if
f −1 (A) is a Borel set for any open set A.
60 1 Mathematical Foundation

-n−1 .
∞  2
E (Tr [f (B∗ B)]) = (n − m)f (0) + f (x) ϕn−m
k (x) dx.
0 i=0

We need to define the hyper-geometric function F


Q of m × n is an element from GRM(m, n, n1 ). Denote c = m
n. We have

E (trn [Q∗ Q]) = c


 
E trn (Q∗ Q) = c2 + c,
2

  
E trn (Q∗ Q) = c3 + 3c2 + c + cn−2
3

   
E trn (Q∗ Q) = c4 + 6c3 + 6c2 + c + 5c2 + 5c n−2
4

   
E trn (Q∗ Q) = c5 +10c4 +20c3 +10c2 +c + 15c3 +40c2 +15c n−2 +8cn−4 .
5

(1.100)
 
In general, E trn (Q∗ Q) is a polynomial of degree [ k−1 −2
k
2 ] in n , for fixed c.

1.7 Sub-Gaussian Random Variables

The material here is taken from [72–74]. Buldygin and Solntsev [74] develops and

N
uses this tool systemically. If SN = ai Xi , where Xi are the Bernoulli random
i=1
  2 2
variables, then its generating moment function E etX satisfies E etX  eσ t /2 .
On the other hand, if X is a Gaussian random variable
 with mean zero and variance
2 2
E X 2 = σ 2 , its moment generating function E etX is eσ t /2 . This led Kahane
[75] to make the following definition. A random variable X is sub-Gaussian, with
exponent b, if
 2 2
E etX  eb t /2 (1.101)

for all −∞ < t < ∞.


Lemma 1.7.1 (Equivalence of sub-Gaussian properties [72]). Let X be a ran-
dom variable. Then the following properties are equivalent with parameters Ki > 0
that are different from each other by at most an absolute constant.2

1. Tails: P (|X| > t)  exp 1 − t2 /K12 , for all t ≥ 0.

2 The precise meaning of this equivalence is the following: There is an absolute constant C such
that property i implies property j with parameter Kj  CKi for any two properties i, j = 1, 2, 3.
1.7 Sub-Gaussian Random Variables 61

p 1/p √
2. Moments: (E|X| )  K2 p, for all p ≥ 1.
3. Super-exponential moment: E exp X 2 /K32  e.
Moreover, if EX = 0, then properties 1–3 are also equivalent to the
following one: 
4. Laplace transform condition: E [exp (tX)]  exp t2 K42 for all t ∈ R.
If X1 , . . . , XN are independent sub-Gaussian random variables with exponents
b1 , . . . , bN respectively and a1 , . . . , aN are real numbers, then [73, p. 109]

  
N
 
N
2 2
E e t(a1 X1 +···+aN XN )
= E e tai Xi
 eai bi /2 ,
i=1 i=1

 1/2
so that a1 X1 + · · · +aN XN is sub-Gaussian, with exponent a21 b21 + · · · +a2N b2N .
We say that a random variable Y majorizes in distribution another random
variable X if there exists a number α ∈ (0, 1] such that, for all t > 0, one has [74]

αP (|X| > t)  P (|Y | > t) .

In a similar manner, we say that a sequence of random variables {Yi , i  1} unifo-


rmly majorizes in distribution another sequence of random variables {Xi , i  1} if
there exists a number α ∈ (0, 1] such that, for all t > 0, and i ≥ 1, one has

αP (|Xi | > t)  P (|Yi | > t) .

Consider the quantity


 1/2
2 ln E exp (Xt)
τ (X) = sup . (1.102)
|t|>0 t2

We have that
  
τ (X) = inf t  0 : E exp (Xt)  exp X 2 t2 /2 , t∈R .

Further if X is sub-Gaussian if and only if τ (X) < ∞.


The quantity τ (X) will be called the sub-Gaussian standard. Since one has

1
E exp (Xt) = 1 + tEX + t2 EX 2 + o(t2 ),
2
 2 2 1 2 2
exp a t /2 = 1 + t a + o(t2 ),
2
as t → 0, then the inequality

E exp (Xt)  exp X 2 t2 /2 , t∈R
62 1 Mathematical Foundation

may only hold if

EX = 0, EX 2  a2 .

This is why each sub-Gaussian random variable X has zero mean and satisfies the
condition

EX 2  τ 2 (X) .

We say the sub-Gaussian random variable X has parameters (0, τ 2 ).


If X is a sub-Gaussian random variable X has parameters (0, τ 2 ), then

P (X > t)  exp −t2 /τ 2 ,

P (−X > t)  exp −t2 /τ 2 ,

P (|X| > t)  2 exp −t2 /τ 2 .

Assume that X1 , . . . , XN are independent sub-Gaussian random variables. Then


one has
n
 n
τ2 Xi  τ 2 (Xi ),
i=1 i=1
m
 n

max τ 2
Xi τ 2
Xk .
1imn
k=i k=1

Assume that X is a zero-mean random variable. Then the following inequality


holds

τ (X)  2θ (X) ,

where

2n · n!
θ (X) = sup EX 2n .
n1 (2n)!

We say that a random variable X is strictly sub-Gaussian if X is sub-Gaussian


and EX 2 = τ 2 (X). If a ∈ R and X is strictly sub-Gaussian then we have
2
τ 2 (aX) = a2 τ 2 (X) = a2 EX 2 = E(aX) .

In such a way, the class of sub-Gaussian random variable is closed with respect to
multiplication by scalars. This class, however, is not closed with respect to addition
of random variables. The next statement motivates us to set this class out.
1.7 Sub-Gaussian Random Variables 63

Lemma 1.7.2 ([74]). Let X1 , . . . , XN be independently sub-Gaussian random



n
variables, and {c1 , . . . , cN } ∈ R. Then ci Xi is strictly sub-Gaussian random
i=1
variable.
Theorem 1.7.3 ([74]). Assume that Y is a Gaussian random variable with zero
mean and variance σ 2 (or with parameters (0, σ 2 )). Then
1. If X is a sub-Gaussian random variable with parameters (0, τ 2 ) and σ 2 > τ 2 ,
then Y majorizes X in distribution.
2. If Y majorizes in distribution some zero-mean random variable X, then the
random variableX is a sub-Gaussian random variable.
Assume that {Xi , i  1} is a sequence of sub-Gaussian random variable with
parameters (0, τi2 ), i  1 while {Yi , i  1} is a sequence of Gaussian random
variable with parameters (0, ασi2 ), i  1, α > 1. Then, the sequence {Yi , i  1}
uniformly majorizes in distribution {Xi , i  1}.
A random n-dimensional vector x will be called standard sub-Gaussian vec-
tor [74, p. 209] if, in some orthogonal basis of the space Rn , its components
X (1) , . . . , X (n) are jointly independently sub-Gaussian random variables. Then
we set
 
τn (x) = max τ X (i) ,
1in

where τ (X) is defined in (1.102). The simplest example of standard sub-Gaussian


vector is the standard n-dimensional Gaussian random vector y.
A linear combination of independent subGaussian random variables is subGaus-
sian. As a special case, a linear combination of independent Gaussian random
variables is Gaussian.
Theorem 1.7.4. Let X1 , . . . , Xn be independent centered subGaussian random
variables. Then for any a1 , . . . , an ∈ R
⎛ ⎞
  
 n  ⎜ ct2 ⎟
 
P  ai Xi  > t  2 exp ⎜
⎝− 
⎟.

  n
2
i=1 ai
i=1

 n 
 2 1/2
Proof. We follow [76] for the short proof. Set vi = ai / ai . We have to
i=1

n
show that the random variable Y = vi Xi is subGaussian. Let us check the
i=1
Laplace transform condition (4) of the definition of a subGaussian random variable.
For any t ∈ R
64 1 Mathematical Foundation

 

n !
n
E exp t v i Xi = E exp (tvi Xi )
i=1 
i=1 
!
n  n 2 2
 exp t2 vi2 K42 = exp t2 K42 vi2 = et K4 .
i=1 i=1

The inequality here follows from Laplace transform condition (4). The constant in
front of the exponent in Laplace transform condition (4) is 1, this fact plays the
crucial role here. 
Theorem 1.7.4 can be used to give a very short proof of a classical inequality due
to Khinchin.
Theorem 1.7.5 (Khinchin’s inequality). Let X1 , . . . , Xn be independent centered
subGaussian random variables. Then for any p ≥ 1, there exists constants Ap , Bp >
0 such that the inequality
1/2  p 1/p 1/2
n  n  n
 
Ap a2i  E ai Xi   Bp a2i
 
i=1 i=1 i=1

holds for all a1 , . . . , an ∈ R.


Proof. We follow [76] for the proof. Without loss of generality, we assume that
 n 
 2 1/2
ai = 1. Let p ≥ 2. Then by Hölder’s inequality
i=1

1/2 ⎛  2 ⎞1/2  p 1/p


n  n   n 
   
a2i = ⎝E  ai Xi  ⎠  E ai Xi  ,
   
i=1 i=1 i=1

so Ap = 1. Using Theorem 1.7.4, we know that the linear combination Y =


n
ai Xi is a subGaussian random variable. Hence,
i=1

p 1/p √
(E|Y | )  C p : Bp .

This is the right asymptotic as p → ∞.


In the case 1 ≤ p ≤ 2 it is enough to prove the inequality for p = 1. Again, using
Hölder’s inequality, we can choose Bp = 1. Applying Khinchin’s inequality with
p = 3, we have
   1/2 3/2  3/4
E|Y |2 = E |Y |1/2 · |Y |3/2  (E |Y |)1/2 · E|Y |3  (E |Y |)1/2 · B3 E|Y |2 .

Thus,
 1/2
B3−3 E|Y |
2
 E |Y | .


1.8 Sub-Gaussian Random Vectors 65

1.8 Sub-Gaussian Random Vectors

Let S n−1 denote the unit sphere in Rn (resp. in Cn ). For two complex vectors
n
a, b ∈ Cn , the inner product is a, b = ai b̄i , where the bar standards for the
i=1
complex conjugate. A mean-zero random vector x on Cn is called isotropic if for
every θ ∈ S n−1 ,
2
E| x, θ | = 1.

A random vector x is called L-sub-Gaussian if it is isotropic and



P (| x, θ |  t)  2 exp −t2 /2L2

for every θ ∈ S n−1 , and any t > 0. It is well known that, up to an absolute constant,
the tail estimates in the definition of a sug-Gaussian random vector are equivalent
to the moment characterization
p 1/p √
sup (| x, θ | )  pL.
θ∈S n−1

Assume that a random vector ξ has independent coordinates ξi , each of which


is an L-sub-Gaussian random variable of mean zero and variance one. One
may verify by direct computation that ξ is L-sub-Gaussian. Rademacher vectors,
standard Gaussian vectors, (that is, random vectors with independent normally
distributed entries of mean zero and variance one), as well as Steinhaus vectors
(that is, random vectors with independent entries that are uniformly distributed on
{z ∈ C : |z| = 1}), are examples of isotropic, L-subGaussian random vectors for
an absolute constant L. Bernoulli random vectors (X = ±1 with equal probability
1/2 for X = +1 and X = −1) are special cases of Steinhaus vectors.
The following well-known bound is relating strong and weak moments. A proof
based on chaining and the majorizing measures theorem is given in [31].
Theorem 1.8.1 (Theorem 2.3 of Krahmer and Mendelson and Rauhut [31]).
Let x1 , . . . , xn ∈ CN and T ∈ CN . If ξ is an isotropic, L-sub-Gaussian random

n
vector and Y = ξi xi , then for every p ≥ 1,
i=1

 1/p  
p p 1/p
E sup | t, Y |  c E sup | t, G | + sup (E| t, Y | ) ,
t∈T t∈T t∈T


N
where c is a constant which depends only on L and G = gi xi for g1 , . . . , gN
i=1
independent standard Gaussian random variables.
66 1 Mathematical Foundation

If p, q ∈ [1, ∞) satisfy 1/p + 1/q = 1, then the p and q norms are dual to
each other. In particular, the Euclidean norm is self-dual p = q = 2. Similarly, the
Schatten p-norm is dual to the Schatten q-norm. If · is some norm on CN and B∗
is the unit ball in the dual norm of · , then the above theorem implies that
 
p 1/p p 1/p
(E Y )  c E G + sup (E| t, Y | ) .
t∈B∗

1.9 Sub-exponential Random Variables

Some random variables have tails heavier than Gaussian. The following properties
are equivalent for Ki > 0

P (|X| > t)  exp (1 − t/K1 ) for all t  0;


p 1/p
(E|Y | )  K2 p for all p  1;
E exp (X/K3 )  e. (1.103)

A random variable X that satisfies one of the equivalent properties of (1.103)


is called a sub-exponential random variable. The sub-exponential norm, denoted
X ψ1 , is defined to be the smallest parameter K2 . In other words,

1 p 1/p
X ψ1 = sup (E|X| ) .
p1 p

Lemma 1.9.1 (Sub-exponential is sub-Gaussian squared [72]). A (scalar val-


ued) random variable X is sub-Gaussian if and only if X 2 is sub-exponential.
Moreover,
2 2 2
X ψ2  X ψ1 2 X ψ2 .

Lemma 1.9.2 (Moment generating function [72]). Let X be a centered sub-


exponential random variable. Then, for t such that |t|  c/ X ψ1 , one has
 
2
E exp (tX)  exp Ct2 X ψ1

where C, c are absolute constants.


Corollary 1.9.3 (Bernstein-type inequality [72]). Let X1 , . . . , XN be indepen-
2
dent centered sub-exponential random variables, and let K = maxi Xi ψ1 . Then
for every a = (a1 , . . . , aN ) ∈ RN and every t ≥ 0, we have
1.10 ε-Nets Arguments 67

   - .
 N  t2
  t
P  ai Xi   t  2 exp −c min 2, K a (1.104)
  K2 a 2 ∞
i=1

where c > 0 is an absolute constant.


Corollary 1.9.4 (Corollary 17 of Vershynin [72]). Let X1 , . . . , XN be indepen-
2
dent centered sub-exponential random variables, and let K = maxi Xi ψ1 . Then,
for every t ≥ 0, we have
N     2 
 
  t t
P  ai Xi   tN  2 exp −c min , N
  K2 K
i=1

where c > 0 is an absolute constant.


Remark 1.9.5 (Centering).
The definition of sub-Gaussian and sub-exponential random variables X does not
require them to be centered. In any case, one can always center X using the simple
fact that if X is sub-Gaussian (or sub-exponential), then so is X − EX. Also,
2 2 2 2
X − EX ψ2 2 X ψ2 , X − EX ψ1 2 X ψ1 .

2 2 2
This follows from triangle inequality X − EX ψ2  X ψ2 + EX ψ2 along
2 2
with EX ψ2 = |EX|  X ψ2 , and similarly for the sub-exponential norm.

1.10 ε-Nets Arguments

Let (T, d) be a metric space. Let K ⊂ T . A set N ⊂ T is called an ε-net for K if

∀x ∈ K, ∃y ∈ N d (x, y) < ε.

A set S ⊂ K is called ε-separated if

∀x ∈ K, ∃y ∈ S d (x, y)  ε.

These two notions are closely related. Namely, we have the following elementary
Lemma.
Lemma 1.10.1. Let K be a subset of a metric space (T, d), and let set N ⊂ T be
an ε-net for K. Then
1. There exists a 2ε-net N  ⊂ K such that N  ≤ N ;
2. Any 2ε-separated set S ⊂ K satisfies S ≤ N ;
3. From the other side, any maximal ε-separated set S  ⊂ K is an ε-net for K.
Let N = |N | be the minimum cardinality of an ε-net of T , also called the covering
number of T at scale ε.
68 1 Mathematical Foundation

Lemma 1.10.2 (Covering numbers of the sphere). The unit Euclidean sphere
S n−1 equipped with the Euclidean metric satisfies for every ε > 0 that
 n
 2
N S n−1 , ε  1+ .
ε

Lemma 1.10.3 (Volumetric estimate). For any ε < 1 there exists an ε-net such
that N ⊂ S n−1 such that
 n  n
2 3
|N |  1 +  .
ε ε

Proof. Let N be a maximal ε-separated subset of sphere S n−1 . Let B2n be Euclidean
ball. Then for any distinct points x, y ∈ N
 ε   ε 
x + B2n ∩ y + B2n = ∅.
2 2
So,

ε   ε   ε  n
|N | · vol B2n = vol x + B2n  vol 1+ B2 ,
2 2 2
x∈N

which implies

 n  n
2 3
|N |  1+  . 
ε ε

Using ε-nets, we prove a basic bound on the first singular value of a random
subGaussian matrix: Let A be an m × n random matrix, m ≥ n, whose entries
are independent copies of a subGaussian random variable. Then
 √ 2
P s1 > t m  e−c1 t m for t  C0 .

See [76] for a proof.


Lemma 1.10.4 (Computing the spectral norm on a net). Let A be a symmetric
n × n matrix, and let Nε be an ε-net of S n−1 for some ε ∈ (0, 1). Then
−1
A = sup | Ax, x |  (1 − 2ε) sup | Ax, x | .
x∈S n−1 x∈Nε

See [72] for a proof.


1.11 Rademacher Averages and Symmetrization 69

1.11 Rademacher Averages and Symmetrization

One simple but basic idea in the study of sums of independent random variables
is the concept of symmetrization [27]. The simplest probabilistic object is the
Rademacher random variable ε, which takes the two values ±1 with equal prob-
ability 1/2. A random vector is symmetric (or has symmetric distribution) if x
and −x have the same distribution [77]. In this case x and εx, where ε is a
Rademacher random variable independent of x, have the same distribution. Let
xi , i = 1, . . . , n be independent symmetric random vectors. The joint distribution
of εi xi , i = 1, . . . , n is that of the original sequence if the coefficients εi are either
non-random with values ±1, or they are random and independent from each other
and all xi with P (εi = ±1) = 12 .
N
The technique of symmetrization leads to so-called Rademacher sums εi x i ,
i=1

N 
N
where xi are scalars, εi xi , where xi are vectors and εi Xi , where Xi are ma-
i=1 i=1
trices. Although quite simple, symmetrization is very powerful since there are nice
estimates for Rademacher sums available—the so-called Khintchine inequalities.
A sequence of independent Rademacher variables is referred to as a Rademacher
sequence. A Rademacher series in a Banach space X is a sum of the form

εi x i
i=1

where x is a sequence of points in X and εi is an (independent) Rademacher


sequence.
For 0 < p ≤ ∞, the lp -norm is defined as
1/p
p
x p = |xi | < ∞.
i

|| · ||2 denotes the Euclidean norm. For p = ∞,

x p = x = sup |xi | , p = ∞.
i

We use || · || to represent the case of p = ∞.


For Rademacher series with scalar coefficients, the most important result is the
inequality of Khintchine. The following sharp version is due to Haagerup [78].
Proposition 1.11.1 (Khintchine). Let p ≥ 2. For every sequence {ai } of complex
scalars,
  - .1/2
 
  2
Ep  εi a i   C p |ai | ,
 
i i
70 1 Mathematical Foundation

where the optimal constant


 1/p √ 1/p
p! √
Cp = p/2
 2 e−0.5 p.
2 (p/2)!

This inequality is typically established only for real scalars, but the real case implies
that the complex case holds with the same constant.
n
 [29]). For a ∈ R , x ∈ {−1, 1} uniform,
n
Lemma 1.11.2 (Khintchine inequality

k k
and k ≥ 2 an even integer, E aT x  a 2 · k k/2 .

For a family of random variables, it is often useful to consider a number of its


independent copies, viz., independent families of random vectors having the same
distributions.
A symmetrization of the sequence of random vectors is the difference of two
independent copies of this sequence

x̃ = x(1) − x(2) . (1.105)

If the original sequences consist of independent random vectors, all the random
vectors x(1) and x(2) used to construct symmetrization are independent. Random
vectors defined in (1.105) are also independent and symmetric.
A Banach space B is a vector space over the field of the real or complex numbers
equipped with a norm || · || for which it is complete. We consider Rademacher

averages εi xi with vector valued coefficients as a natural analog of the Gaussian
i

averages gi xi . A sequence (εi ) of independent random variables taking the
i
values +1 and −1 with equal probability 1/2, that is symmetric Bernoulli or
Rademacher random variables. We usually call (εi ) a Rademacher sequence or

Bernoulli sequence. We often investigate finite or convergent sums εi xi with
i
vector valued coefficients xi .
For arbitrary m × n matrices, A Sp denotes the Schatten p-norm of an m × n
matrix A, i.e.,

A Sp = σ (A) p ,

where σ ∈ Rmin{m,n} is the vector of singular values of A, and || · ||p is the usual
lp -norm defined above. When p = ∞, it is also called the spectrum (or matrix)
norm. The Rademacher average is given by
/ / / /
/ / / /
/ / / /
/ ε i Ai / = /σ ε i Ai / .
/ / / /
i Sp i p
1.11 Rademacher Averages and Symmetrization 71

The infinite dimensional setting is characterized by the lack of the orthogonal


property
/ /2
/ /
/ / 2
E/ xi / = E xi ,
/ /
i 2 i

where (Xi ) is a finite sequence of independent mean zero real valued random
vectors. This type of identity extends to Hilbert space valued random variables, but
does not in general hold for arbitrary Banach space valued random variables. The
classical theory is developed under this orthogonal property.
Lemma 1.11.3 (Ledoux and Talagrand [27]). Let F : R+ → R+ be convex.
Then, for any finite sequence (Xi ) of independent zero mean random variables in a
Banach space B such that EF ( Xi ) < ∞ for every i,
/ / / / / /
1/
/
/
/
/
/
/
/
/
/
/
/
EF / εi Xi /  EF / Xi /  EF 2/ εi X i / .
2/ i
/ /
i
/ /
i
/

Rademacher series appear as a basic tool for studying sums of independent random
variables in a Banach space [27, Lemma 6.3].
Proposition 1.11.4 (Symmetrization [27]). Let {Xi } be a finite sequence of
independent, zero-mean random variables taking values in a Banach space B. Then
/ / / /
/ / / /
p/ / p/ /
E / Xi /  2E / εi X i / ,
/ / / /
i B i B

where εi is a Rademacher sequence independent of {Xi }.


In other words [28], the moments of the sum are controlled by the moments of
the associated Rademacher series. The advantage of this approach is that we can
condition on the choice of Xi and apply sophisticated methods to estimate the
moments of the Rademacher series.
We need some facts about symmetrized random variables. Suppose that Z is a
zero-mean random variable that takes values in a Banach space B. We may define
the symmetrized variable Y = Z − Z0 , where Z0 is an independent copy of Z.
The tail of the symmetrized variable Y is closely related to the tail of Z. Indeed, we
have [28]

P( Z B > 2E Z B + t)  P ( Y B > t) . (1.106)

The relation follows from [27, Eq. (6.2)] and the fact that M (Y )  2EY for every
nonnegative random variable Y . Here M denotes the median.
72 1 Mathematical Foundation

1.12 Operators Acting on Sub-Gaussian Random Vectors

See Sect. 5.7 for the applications of the results here. The proofs below are taken
from [79]. We also follow [79] for the exposition and the notation here. By g, gi ,
we denote independent N (0, 1) Gaussian random variables. For a random variable
p 1/p
ξ and p > 0, we put ξ p = (E|ξ| ) . Let · F be the Frobenius norm and · op
the operator norm. By | · | and < ·, · > we denote the standard Euclidean norm and
the inner product on Rn .
A random variable ξ is called sub-Gaussian if there is a constant β < ∞ such
that

ξ 2k β g 2k k = 1, 2, . . . (1.107)

We refer to the infimum overall all β satisfying (1.107) as the sub-Gaussian constant
of ξ. An equivalent definition is often given in terms of the ψ2 -norm. Denoting the
Orlicz function ψ2 (x) = exp(x2 ) − 1 by ψ2 , ξ is sub-Gaussian if and only if

ξ ψ2 := inf {t > 0 |ψ2 (ξ/t)  1 } < ∞. (1.108)

Denting the sub-Gaussian constant of ξ by β̄, a direct calculation will show the
following common (and not optimal) estimate

β̃  ξ ψ2  β̃ g ψ2 = β̃ 8/3.

The lower estimate follows since Eψ2 (X)√ EX 2k /k! for k = 1, 2, . . .. The upper
one is using the fact that E exp(tg 2 ) = 1/ 1 − 2t for t < 1/2.
Apart from the Gaussian random variables, the prime example of sub-Gaussian
random variables are Bernoulli random variables, taking values +1 and −1 with
equal probability P (ξ = +1) = P (ξ = −1) = 1/2.
Very often, we work with random vectors in Rn of the form ξ = (ξ1 , ξ2 , . . . , ξn ),
where ξi are independent sub-Gaussian random variables, and we refer to such
vectors as sub-Gaussian random vectors. We require that Var (ξi )  1 and sub-
Gaussian constants are at most β. Under these assumptions, we have

Eξi2  Var (ξi )  1 = Eg 2 ,

hence β ≥ 1.
We have the following fact: for any t > 0,
 √   
P |ξ|  t n  exp n ln 2 − t2 / 3β 2 . (1.109)

In particular, P (|ξ|  3β n)  e−2n . Let us prove the result. For an arbitrary
s > 0 and 1 ≤ i ≤ n we have
1.12 Operators Acting on Sub-Gaussian Random Vectors 73

  ∞ ∞

2
ξi2 1 β 2k (βg)
E exp = Eξ 2k  Eg 2k = E exp .
s2 k! · s2k i k! · s2k s2
k=0 k=0


This last quantity is less than or equal to 2 since, e.g., s = 3β. For this choice of s,
    

n 
n
P ξi2 t n
2
 E exp 1
s2 ξi2 −t n2
i=1 i=1
 !
n    2 
2 ξi2
 exp − ts2 n
E exp s2  exp − 3β
t n
2 · 2n ,
i=1

which is the desired result.


Theorem 1.12.1 (Lemma 3.1 of Latala [79]). Let ξ1 , ξ2 , . . . , ξn be a sequence of
independent symmetric sub-Gaussian random variables satisfying (1.107) and let
A = (aij ) be a symmetic matrix with zero diagonal. Then, for any t > 1,
⎛  ⎞
  √ 
 
P ⎝ aij ξi ξj   Cβ 2 t A F + t A op ⎠  e−t ,
 i<j 

where C is a universal constant.


Proof. By (1.107) and by the symmetry of ξi , we immediately get a + bξi 2k 
a + bβgi 2k for any real numbers a, b and a positive integer k. So we have
/ / / / / /
/ / / / / /
/ / / / / /
/ a ξ ξ / / a βg βg / = β2/ a g × g /
j/ .
/ ij i j / / ij i j/ / ij i
/ i<j / / i<j / / i<j /
2k 2k 2k

Using the Hanson-Wright estimate [61], we get


/ /
/ / √ 
/ /
/ aij gi × gj /  C β2 2k A + 2k A
/ / F op
/ i<j /
2k

with some universal constant C  . Taking k = t/2, then by Chebyshev’s inequality


⎛  ⎞
  √ 
 

P  aij ξi ξj   eC  β 2 2k A + 2k A ⎠  e−2k  e−t .
F op
 i<j 

Statement follows, since k ≤ t. 


Theorem 1.12.2 (Lemma 3.2 of Latala [79]). Let ξ1 , ξ2 , . . . , ξn be a sequence of
independent random variables with finite fourth moments. Then for any nonnegative
coefficients bi and t > 0,
74 1 Mathematical Foundation

⎛   12 ⎞
 n
  n
 2 
P ⎝ bi Eξi − ξi  >
2
2t b2i Eξi4 ⎠  e−t .
 
i=1 i=1

n
Proof. We may obviously assume that b2i Eξi4 > 0. For x > 0, we have e−x =
−1  −1
i=1
(ex )  1 + x + x2 /2  1 − x + x2 /2. Thus for λ ≥ 0,
 
E exp −λξi2  1 − Eξi2 + 12 λ2 Eξi4  exp −λEξi2 + 12 λ2 Eξi4 .
 

n  
n
Letting S = bi Eξi2 − ξi2 , we get E exp (λS)  exp 1 2
2λ b2i Eξi4 , and
i=1 i=1
for any u ≥ 0,
⎛ ⎞
⎜ u2 ⎟
P (S  u)  inf E exp (λS − λu)  exp ⎜
⎝− 
n
⎟.

λ0
2 bi Eξi
2 4
i=1


Lemma 1.12.2 is somewhat special since we assume that coefficients are non-
negative. In the general case one has for any sequence of independent random
variables ξi with sub-Gaussian constant at most β and t > 1,
  
 n  2   √ 
 2 
P  ai Eξi − ξi  > Cβ t (ai ) ∞ + t |(ai )|
2
 e−t . (1.110)
 
i=1

We provide a sketch of the proof for the sake of completeness. Let ξ˜i be an
independent copy of ξi . We have by Jensen’s inequality for p ≥ 1,
/ / / /
/ n
 / / n  /
/ / / /
/ ai Eξi2 − ξi2 /  / ai ξi2 − ξ˜i2 / .
/ / / /
i=1 p i=1 p

Random variables ξi2 − ξ˜i2 are independent, symmetric. For k ≥ 1,


/ / / / / /
/ 2 ˜2 /
/ ξi − ξ i /  2β 2 /gi2 /2k  4β 2 /ηi2 /2k
2k

where ηi are i.i.d. symmetric exponential random variables with variance 1. So, for
positive integer k,
1.13 Supremum of Stochastic Processes 75

/ / / /
/ n  / / n /  √ 
/ / 2/ /
/ ai ξi2 − ξ˜i2 /  4β / ai η i /  C1 β 2 k (ai ) ∞ + k (ai ) 2 ,
/ / / /
i=1 2k i=1 2k

where the last inequality follows by the Gluskin-Kwapien estimate [80]. So by


Chebyshev’s inequality
  
 n
   √ 
 
P  ai Eξi2 − ξi2  > eC1 β 2 k (ai ) ∞ + k (ai )  e−2t
  2
i=1

and the assertion easily follows.

1.13 Supremum of Stochastic Processes

We follow [33] for this introduction. The standard reference is [81]. See also [27,82].
One of the fundamental issues of the probability theory is the study of suprema
of stochastic processes. In particular, in many situations one needs to estimate the
quantity Esupt∈T Xt , where supt∈T Xt is a stochastic process. T in order to avoid
measurability problems, one may assume that T is countable. The modern approach
to this problem is based on chaining techniques. The most important case of centered
Gaussian process is well understood. In this case, the boundedness of the process is
related to the geometry of the metric space (T, d), where
 1/2
2
d (t, s) = E(Xt − Xs ) .

In 1967, R. Dudley [83] obtained an upper bound for Esupt∈T Xt in terms of entry
numbers and in 1975 X. Fernique [84] improved Dudley’s bound using so-called
majorizing measures. In 1987, Talagrand [85] showed that Fernique’s bound may
be reversed and that for centered Gaussian processes (Xt ),

1
γ2 (T, d)  EsupXt  Lγ2 (T, d) ,
L t∈T

where L is a universal constant. There are many equivalent definitions of the


Talagrand’s gamma function γ2 , for example one may define

γα (T, d) = inf sup 2n/α d (t, Ti ) ,
t∈T i=0

where the infimum runs over all sequences Ti of subsets of T such that |T0 | = 1 and
i
|Ti | = 22 .
76 1 Mathematical Foundation

1.14 Bernoulli Sequence

Another fundamental class of processes is based on the Bernoulli sequence [33],


i.e. the sequence (εi )i1 of i.i.d. symmetric random variables taking values ±1.

For t = 2 , the series Xt := ti εi converges almost surely and for T ∈ 2 ,
i1
we may define a Bernoulli process (Xt )t∈T and try to estimate b(T ) = E sup Xt .
t∈T
There are two ways to bound b(T ). The first one is a consequence of the uniform

bound |Xt |  t 1 = |ti |, so that b(T )  sup t 1 . Another is based on the
i1 t∈T

domination by the canonical Gaussian process Gt := ti gi , where gi are i.d.d.
i1
N (0, 1) random variables. Assuming the independence of (gi ) and (εi ), Jensen’s
inequality implies:
"
2
g(T ) = E sup ti gi = E sup ti εi |gi | E sup ti εi E |gi | = b(T ).
t∈T t∈T t∈T π
i1 i1 i1

1.15 Converting Sums of Random Matrices into Sums


of Random Vectors

Sums of independent random vectors [27, 74, 86] are classical topics nowadays. It
is natural to convert sums of random matrices into sums of independent random
vectors that can be handled using the classical machinery. Sums of dependent
random vectors are much less understood: Stein’s method is very powerful [87].

N
Often we are interested in the sample covariance matrix N1 xi ⊗ xi the sums
i=1
of N rank-one matrices where x1 , . . . , xN are N independent random vectors.

N
More generally, we consider N1 Xi where Xi are independent random Hermitian
i=1
matrices. Let
⎤ ⎡
λ1
⎢ λ2 ⎥
⎢ ⎥
Λ = ⎢ . ⎥ ∈ Rn
⎣ .. ⎦
λn

be a vector of eigenvalues of Y. For each random matrix Xi , we have one random


vector Λi for all i = 1, . . . , N . As a result, we obtain N independent random
vectors consisting of eigenvalues. We are interested in the sums of these independent
random vectors
1.15 Converting Sums of Random Matrices into Sums of Random Vectors 77

N
1
Λi .
N i=1

Then we can use Theorem 1.15.1 to approximate characteristic functions with


normal distribution. In [86], there are many other theorems that can be used in this
context. Our technique is to convert a random matrix into a random vector that is
much easier to handle using classical results [27, 74, 86], since very often, we are
only interested in eigenvalues only. We shall pursue this connection more in the
future research.
For example, we use the classical results [86]. Let x1 , . . . , xN be N independent
random vectors in Rn , each having a zero mean and a finite s-th order absolute
moment for some s ≥ 2. Let P̂ stand for Fourier transform of P , the characteristic
function of a probability measure P . We here study P̂(x1 +...+xN )/√N , the rate
of convergence of the characteristic function of a probability measure, to Φ̂0,R
where R is the average of the covariance matrices of x1 , . . . , xN . Here for normal
distribution in Rn ,

1
log Φ̂0,R = − y, Ry .
2
We assume that
N N
1 1
R= Cov (xi ) = Ri
N i=1
N i=1

is nonsingular. Then, we define the Liapounov coefficient


N
s
1
N E (| y, xi | )
ls,N = sup  i=1
N −(s−2)/2 (s  2). (1.111)
y=1 
N   s/2
2
1
N E y, xi
i=1

It is easy to check that ls,N is independent of scale. If B is a nonsingular n × n


matrix, then Bx1 , . . . , BxN have the same Liapounov coefficient as x1 , . . . , xN . If
we write
r
ρr,i = E ( xi ) , 1  i  N,
N ,
1
ρr = ρr,i , r  0.
N i=1

according to (1.111), we have that


78 1 Mathematical Foundation

s
ρs y ρs
ls,N  N −(s−2)/2 sup s/2
= s/2
N −(s−2)/2 ,
y=1 [ y, Ry ] λmin

where λmin is the smallest eigenvalue of the average covariance matrix R. In one
dimension (i.e., n = 1)
ρs
ls,N = s/2
N −(s−2)/2 , s  2.
ρ2

If R = I, then
n
s s s
E| y, xi |  E| y, xi | N s/2 ls,N y .
i=1

Now we are ready to state a theorem.


Theorem 1.15.1 (Theorem 8.6 of [86]). Let x1 , . . . , xN be n independent random
vectors in Rn having distribution G1 , . . . , GN , respectively. Suppose that each
random vector xi has zero mean and a finite fourth absolute moment. Assume that
the average covariance matrix R is nonsingular. Also assume

l4,N  1.

Then for all t satisfying

1 −1/4
t  l ,
2 4,N
One has
   
 N   j3
 1 2 
 Ĝi √ Bt − exp − 2 t 1
1 + √ μ3 (t) 
 N 6 N 
i=1
 
4 4
 (0.175) l4,N t exp − 12 t
  0 1
8 6 2
2
+ (0.018) l4,N t + 361 2
l3,N t exp − (0.383) t ,

where B is the positive-definite symmetric matrix defined by B2 = R−1 , and

N
1 3
μ3 (t) = E t, xi .
N i=1
1.16 Linear Bounded and Compact Operators 79

1.16 Linear Bounded and Compact Operators

The material here is standard, taken from [88, 89]. Let X and Y always be normed
spaces and A : X → Y be a linear operator. The linear operator A is bounded if
there exists c > 0 such that

Ax  c x for all x ∈ X.

The smallest of these constants is called the norm of A, i.e.,

Ax
A = sup . (1.112)
x=0 x

The following are equivalent


1. A is bounded.
2. A is continuous at x = 0, i.e. xi = 0 implies that Axi = 0.
3. A is continuous for every x ∈ X.
The space L (X, Y ) of all linear bounded mappings from X to Y with the operator
norm is a normed space. Let A ∈ L (X, Y ), B ∈ L (Y, Z); then AB ∈ L (X, Z)
and AB  A B .
Let k ∈ L2 ((c, d) × (a, b)). The integral operator

b
(Ax) (t) := k (t, s)x (s) ds, t ∈ (c, d), x ∈ L2 (a, b), (1.113)
a

is well-defined, linear, and bounded from L2 (a, b) to L2 (c, d). Furthermore,

d b
A L2  |k (t, s)|dsdt.
c a

Let k be continuous on [c, d] × [a, b]. Then A is also well-defined, linear, and
bounded from C[a, b] into C[c, d] and

b
A ∞  max |k(s, t)| ds.
t∈[c,d] a

We can extend above results to integral operators with weakly singular kernels.
A kernel is weakly singular on [a, b] × [a, b] if k is defined and continuous for all
t, s ∈ [a, b], t = s, and there exists constants c > 0 and α ∈ [0, 1) such that

1
|k(s, t)|  c α for all t,s ∈ [a, b], t = s.
|t − s|
80 1 Mathematical Foundation

Let k be weakly singular on [a, b]. Then the integral operator A, defined
in (1.113) for [c, d] = [a, b], is well-defined and bounded as an operator in L2 (a, b)
as well as in C[a, b].
Let A : X → Y be a linear and bounded operator between Hilbert spaces.
Then there exists one and only one linear bounded operator A∗ : Y → X with the
property

Ax, y = x, A∗ y for all x ∈ X, y ∈ Y.

This operator A∗ : Y → X is called the adjoint operator to A. For X = Y , the


operator A is called self-adjoint if A∗ = A.
The operator K : X → Y is called compact if it maps every bounded set S into
a relatively compact set K(S). A set M ⊂ Y is called relatively compact if every
bounded sequence (yi ) ⊂ M has an accumulation point in cl (M ), i.e., the closure
cl (M ) is compact. The closure of a set M is defined as
 
cl (M ) := x ∈ M : there exists (xk )k ⊂ M with x = lim xk .
k→∞

The set of all compact operators from X to Y is a closed subspace of the vector
space L2 (a, b) where
0 1
2
L2 (a, b) = x : (a, b) → C : x is measurable and |x| integrable .

Let k ∈ L2 ((c, d) × (a, b)). The operator K : L2 (c, d) → L2 (a, b), defined by
b
(Kx) (t) := k (t, s)x (s) ds, t ∈ (c, d), x ∈ L2 (a, b) , (1.114)
a

is compact from (c, d) to (a, b). Let k be continuous on (c, d) × (a, b) or weakly
singular on (a, b) × (a, b) (in this case (c, d) = (a, b)). Then K defined by (1.114)
is also compact as an operator from C[a, b] to C[c, d].

1.17 Spectrum for Compact Self-Adjoint Operators

The material here is standard, taken from [88, 89]. The most important results in
functional analysis are collected here. Define
 
N = x ∈ L2 (a, b) : x(t) = 0 almost everywhere on [a, b] .

Let K : X → X be compact and self-adjoint (and = 0). Then, the following


holds:
1. The spectrum consists only of eigenvalues and possibly 0. Every eigenvalue of K
is real-valued. K has at least one but at most a countable number of eigenvalues
with 0 as the only possible accumulation point.
1.17 Spectrum for Compact Self-Adjoint Operators 81

2. For every eigenvalue λ = 0, there exist only finitely, many linearly independent
eigenvectors, i.e., the eigenvectors are finite-dimensional. Eigenvectors corre-
sponding to different eigenvalues are orthonormal.
3. We order the eigenvalues in the form

|λ1 |  |λ2 |  |λ3 |  · · ·

and denote by Pi : X → N (K − λi I) the orthogonal projection onto the


eigenspace corresponding to λi . If there exist only a finite number λ1 , . . . , λm of
eigenvalues, then
m
K= λ i Pi .
i=1

If there exists an infinite sequence λi of eigenvalues, then



K= λ i Pi ,
i=1

where the series converges in the operator norm. Furthermore,


/ /
/ m /
/ /
/K − λi Pi / = |λm+1 | .
/ /
i=1

4. Let H be the linear span of all of the eigenvectors corresponding to the


eigenvalues λi = 0 of K. Then

X = cl (H) ⊕ N (K) .

Let X and Y be Hilbert spaces and K : X → Y is a compact operator with


adjoint operator K∗ : Y → X. Every eigenvalue λ of K∗ K is nonnegative because
K∗ Kx = λx implies that ∗
√ λ x, x = K Kx, x = Kx, Kx  0, i.e., λ ≥ 0.
The square roots σi = λi of the eigenvalues λi , i ∈ J of the compact self-adjoint
operator K∗ K : X → X are called singular values of K. Here again, J ∈ N could
be either finite or J = N.
A compact self-adjoint operator is non-negative (positive) if and only if all of
the eigenvalues are non-negative (positive). The sum of two non-negative operators
are non-negative and is positive if one of the summands is positive. If an operator
is positive and bounded below, then it is invertible and its inverse is positive and
bounded below.
Every non-negative compact operator K in a Hilbert space H has a unique non-
negative square root G; that is, if K is non-negative and compact, there is a unique
non-negative bounded linear map G such that G 2 = K. G is compact and commutes
with every bounded operator which commutes with K.
82 1 Mathematical Foundation

An operator A : X → Y is compact if and only if its adjoint A∗ : Y → X is


compact. If A∗ A is compact, then A is also compact.
Example 1.17.1 (Integral). Let K : L2 (0, 1) → L2 (0, 1) be defined by
t
(Kx) (t) := x (s) ds, t ∈ (0, 1), x ∈ L2 (0, 1).
0

Then
1 1  t 
(K∗ x) (t) := y (s) ds and (KK∗ x) (t) := x (s) ds dt.
t t 0

The eigenvalue problem KK∗ x = λx is equivalent to


1  t 
λx = x (s) ds dt, t ∈ (0, 1).
t 0

Differentiating twice, we observe that for λ = 0 this is equivalent to the eigenvalue


problem

λx + x = 0 in (0, 1), x(1) = x (0) = 0.

Solving this gives

"
2 2i − 1 4
xi (t) = cos πt, i ∈ N, and λi = 2 , i ∈ N. 
π 2 (2i − 1) π 2

Example 1.17.2 (Porter and Stirling [89]). Let the operator K on L2 (0, 1) be
defined by
√ √ 
1  x + t
(Kφ) (x) = 
log  √ √ φ(t)dt (0  x  1) .
0 x − t

To show that K is a positive operator, consider


x
1
(T φ) (x) = √ φ(t)dt (0  x  1) .
0 x−t

Although the kernel is unbounded function, it is a Schur kernel and therefore T is a


bounded operator on L2 (0, 1), with adjoint given by

1
1
(T ∗ φ) (x) = √ φ(t)dt (0  x  1) .
x t−x
1.17 Spectrum for Compact Self-Adjoint Operators 83

The kernel of T T ∗ is, for x = t,


√ √ 
min(x,t)
1  x + t
k(x, t) = √ √ 
ds = log  √ √ ,
0 x−s t−s x − t

the integration being most easily performed using the substitution u = x − s +

t − s. Therefore, K = T T ∗ and T ∗ φ = 0 ⇒ (T ∗ ) φ = 0 ⇒ φ = 0 gives the
2

positivity. 
Chapter 2
Sums of Matrix-Valued Random Variables

This chapter gives an exhaustive treatment of the line of research for sums of
matrix-valued random matrices. We will present eight different derivation methods
in this context of matrix Laplace transform method. The emphasis is placed on the
methods that will be hopefully useful to some engineering applications. Although
powerful, the methods are elementary in nature. It is remarkable that some modern
results on matrix completion can be simply derived, by using the framework of
sums of matrix-valued random matrices. The treatment here is self-contained.
All the necessary tools are developed in Chap. 1. The contents of this book are
complementary to our book [5]. We have a small overlapping on the results of [36].
In this chapter, the classical, commutative theory of probability is generalized
to the more general theory of non-communicative probability. Non-communicative
algebras of random variables (“observations”) and their expectations (or “trace”) are
built. Matrices or operators takes the role of scalar random variables and the trace
takes the role of expectation. This is very similar to free probability [9].

2.1 Methodology for Sums of Random Matrices

The theory of real random variables provides the framework of much of modern
probability theory [8], such as laws of large numbers, limit theorems, and proba-
bility estimates for “deviations”, when sums of independent random variables are
involved. However, some authors have started to develop analogous theories for the
case that the algebraic structure of the reals is substituted by more general structures
such as groups, vector spaces, etc., see for example [90].
In a remarkable work [36], Ahlswede and Winter has laid the ground for the
fundamentals of a theory of (self-adjoint) operator valued random variables. There,
the large deviation bounds are derived. A self-adjoint operator includes finite
dimensions (often called Hermitian matrix) and infinite dimensions. For the purpose
of this book, finite dimensions are sufficient. We will prefer Hermitian matrix.

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 85


DOI 10.1007/978-1-4614-4544-9 2,
© Springer Science+Business Media New York 2014
86 2 Sums of Matrix-Valued Random Variables

To extend the theory from scalars to the matrices, the fundamental difficulty
arises from that fact, in general, two matrices are not commutative. For example,
AB = BA. The functions of a matrix can be defined; for example, the matrix
exponential is defined [20] as eA . As expected, eAB = eBA , although a scalar
exponential has the elementary property eab = eba , for two scalars a, b. Fortunately,
we have the Golden-Thompson inequality that has the limited replacement for
the above elementary property of the scalar exponential. The Golden-Thompson
inequality
 
Tr eA+B ≤ Tr eA eB ,

for Hermitian matrices A, B, is the most complicate result that we will use.
Through the spectral mapping theorem, the eigenvalues of arbitrary matrix
function f (A), are f (λi ) where λi is the i-th eigenvalue of A. In particular, for
f (x) = ex for a scalar x; the eigenvalues of eA are eλi , which is, of course, positive
(i.e., eλi > 0). In other words, the matrix exponential eA is ALWAYS positive
semidefinite for an arbitrary matrix A. The positive real numbers have a lot of
special structures to exploit, compared with arbitrary real numbers. The elementary
fact motivates the wide use of positive semidefinite (PSD) matrices, for example,
convex optimization and quantum information theory. Through the spectral mapping
theorem, all the eigenvalues of positive semidefinite matrices are nonnegative.
For a sequence of scalar random variables (real or complex numbers),
x1 , . . . , xn , we can study itsconvergence by studying the so-called partial sum
n
Sn = x 1 + . . . + x n = i=1 xi . We say the sequence converges to a limit
value S = E[x], if there exists a limit S as n → ∞. In analogy with the scalar
counterparts, we can similarly define
n
Sn = X1 + . . . + Xn = Xi ,
i=1

for a sequence of Hermitian matrices, X1 , . . . , Xn . We say the sequence converges


to a limit matrix S = E[X], if there exists a limit S as n → ∞.
One nice thing about the positive number is the ordering. When a = 0.4 and
b = 0.5, we can say a < b. In analogy, we say the partial order A ≤ B if all
the eigenvalues of B − A are nonnegative, which is equivalent to say B − A is
positive semidefinite matrix. Since a matrix exponential is eA is always positive
semidefinite for an arbitrary matrix A, we can instead study eA ≤ eB , to infer
about the partial order A ≤ B. The function x → esx is monotone, non-decreasing
and positive for all s ≥ 0. we can, by the spectral mapping theorem to study
their eigenvalues which are scalar random variables. Thus a matrix-valued random
variable is converted into a scalar-valued random variable, by using the bridge of the
spectral mapping theorem. For our interest, what matters is the spectrum (spec(A),
the set of all eigenvalues).
2.2 Matrix Laplace Transform Method 87

In summary, the sums of random matrices are of elementary nature. We


emphasize the fundamental contribution of Ahlswede and Winter [36] since their
work has triggered a snow ball of this line of research.

2.2 Matrix Laplace Transform Method

Due to the basic nature of sums of random matrices, we give several versions of the
theorems and their derivations. Although essentially their techniques are equivalent,
the assumptions and arguments are sufficiently different to justify the space. The
techniques for handling matrix-valued random variables are very subtle; it is our
intention to give an exhaustive survey of these techniques. Even a seemingly small
twist of the problems can cause a lot of technical difficulties. These presentations
serve as examples to illustrate the key steps. Repetition is the best teacher—practice
makes it perfect. This is the rationale behind this chapter. It is hoped that the
audience pays attention to the methods, not the particular derived inequalities.
The Laplace transform method is the standard technique for the scalar-valued
random variables; it is remarkable that this method can be extended to the matrix
setting. We argue that this is a break-through in studying the matrices concentration.
This method is used as a thread to tie together all the surveyed literature. For
completion, we run the risk of “borrowing” too much from the cited references.
Here we give credit to those cited authors. We try our best to add more details about
their arguments with the hope of being more accessible.

2.2.1 Method 1—Harvey’s Derivation

The presentation here is essentially the same as [91, 92] whose style is very friendly
and accessible. We present Harvey’s version first.

2.2.1.1 The Ahlswede-Winter Inequality

Let X be a random d×d matrix, i.e., a matrix whose entries are all random variables.
We define EX to be the matrix whose entries are the expectation of the entries of
X. Since expectation and trace are both linear, they commute:

E [Tr X]  P (X = A) · Ai,i = P (X = A) · Ai,i


A i i A

= P (Xi,i = a) · a = E (Xi,i ) = Tr (EX) .


i a i
88 2 Sums of Matrix-Valued Random Variables

Let X1 , · · · , Xn be random, symmetric1 matrices of size d × d. Define the partial


sums
n
Sn = X1 + · · · + Xn = Xi .
i=1

A ≥ 0 is equivalent to saying that all eigenvalues of A are nonnegative, i.e.,


λi (A)  0. We would like to analyze the probability that eigenvalues of Sn are
at most t, i.e., Sn ≤ tI. This is equivalent to the event that all eigenvalues of
eSn are at most eλt , i.e., eSn λ  eλtI . If this event fails to hold, then certainly
Tr eSn λ > Tr eλtI , since all eigenvalues of eSn are non-negative. Thus, we have
argued that
Pr [some eigenvalues of matrix Sn is greater than t]

 P Tr eSn t > Tr eλtI (2.1)

 E Tr eSn t /eλt ,
by Markov’s inequality. Now, as in the proof of the Chernoff bound, we want
to bound this expectation by a product of expectations, which will lead to an
exponentially decreasing tail bound. This is where the Golden-Thompson inequality
is needed.
   
E Tr eSn λ = E Tr eλXn +λSn−1 (since Sn = Xn + Sn−1 )
  
 E Tr eλXn · eλSn−1 (by Golden - Thompson inequality)
   
= EX1 ,··· ,Xn−1 EXn Tr eλXn · eλSn−1 (since the Xi ’s are mutually independent)
   
= EX1 ,··· ,Xn−1 Tr EXn eλXn · eλSn−1 (since trace and expectation commute)
    
= EX1 ,··· ,Xn−1 Tr EXn eλXn · eλSn−1 (since Xn and Sn−1 are independent)
    
= EX1 ,··· ,Xn−1 EXn eλXn · Tr eλSn−1 (by Corollary of trace-norm property)
    
= EXn eλXn · EX1 ,··· ,Xn−1 Tr eλSn−1
(2.2)
Applying this inequality inductively, we get

 
n
/  /  
n
/  λX / 
E Tr eSn λ  /EX eλXi / · Tr eλ0 = /E e i / · Tr eλ0 ,
i
i=1 i=1

1 The assumption symmetric matrix is too strong for many applications. Since we often deal with

complex entries, the assumption of Hermitian matrix is reasonable. This is the fatal flaw of this
version. Otherwise, it is very useful.
2.2 Matrix Laplace Transform Method 89

where 0 is the zero matrix of size d × d. So eλ0 = I and Tr (I) = d, where I is the
identity matrix whose diagonal are all 1. Therefore,

 
n
/  λX /
E Tr eSn λ  d · /E e i /.
i=1

Combining this with (2.1), we obtain


n
/  λX /
Pr [some eigenvalues of matrix Sn is greater than t]  de−λt /E e i /.
i=1

We can also bound the probability that any eigenvalue of Sn is less than −t by
applying the same argument to −Sn . This shows that the probability that any
eigenvalue of Sn lies outside [−t, t] is
+ ,

n
/  λX /  n
/  −λX /
P ( Sn > t)  de −λt /E e i / + /E e i /
. (2.3)
i=1 i=1

This is the basis inequality. Much like the Chernoff bound, numerous variations and
generalizations are possible. Two useful versions are stated here without proof.
Theorem 2.2.1. Let Y be a random, symmetric, positive semi-definite d × d matrix
such that E[Y] = I. Suppose Y ≤ R for some fixed scalar R ≥ 1. Let
Y1 , . . . , Yk be independent copies of Y (i.e., independently sampled matrices with
the same distribution as Y). For any ε ∈ (0, 1), we have
- .
1
k

P (1 − ε) I  Yi  (1 + ε) I  1 − 2d · exp −ε2 k/4R .
k i=1

1

k
This event is equivalent to the sample average k Yi having minimum eigenvalue
i=1
at least 1 − ε and maximum eigenvalue at most 1 + ε.
Proof. See [92]. 
Corollary 2.2.2. Let Z be a random, symmetric, positive semi-definite d×d matrix.
Define U = E[Z] and suppose Z ≤ R · U for some scalar R ≥ 1. Let Z1 , . . . , Zk
be independent copies of Z (i.e., independently sampled matrices with the same
distribution as Z). For any ε ∈ (0, 1), we have
- .
1
k

P (1 − ε) U  Zi  (1 + ε) U  1 − 2d · exp −ε2 k/4R .
k i=1
90 2 Sums of Matrix-Valued Random Variables

Proof. See [92]. 

2.2.1.2 Rudelson’s Theorem

In this section, we use how the Ahlswede-Winter inequality is used to prove a


concentration inequality for random vectors due to Rudelson. His original proof
was quite different [93].
The motivation for Rudelson’s inequality comes from the problem of approx-
imately computing the volume of a convex body. When solving this problem, a
convenient first step is to transform the body into the “isotropic position”, which
is a technical way of saying “roughly like the unit sphere.” To perform this first
step, one requires a concentration inequality for randomly sampled vectors, which
is provided by Rudelson’s theorem.
Theorem
 2.2.3 (Rudelson’s Theorem [93]). Let x ∈ Rd be a random vector such
that E xx = I. Suppose x  R. Let x1 , . . . , xn be independent copies of x.
T

For any ε ∈ (0, 1), we have


/ / 
/1 n / 
/ /
P / xi xi − I/ > ε  2d · exp −ε2 n/4R2 .
T
/n /
i=1


Note that R  d because
       
d = Tr I = Tr E xxT = E Tr xxT = E xT x ,

since Tr (AB) = Tr (BA).


Proof. We apply the Ahlswede-Winter inequality with the rank-1 matrix Xi
1    1 
Xi = 2
xi xTi − E xi xTi = xi xTi − I .
2R 2R2
Note that EXi = 0, Xi  1, and
  1  2

E X2i = 4
E xi xTi − I
4R
1 0  T 2
 1  
= 4
E x i x i − I (since E xi xTi = I)
4R
1   
 E xTi xi xi xTi
4R4
R2  
 4
E xi xTi (since xi  R)
4R
I
= . (2.4)
4R2
Now using Claim 1.4.5 together with the inequalities
2.2 Matrix Laplace Transform Method 91

1 + y  ey , ∀y ∈ R
ey  1 + y + y 2 , ∀y ∈ [−1, 1].

Since Xi  1, for any λ ∈ [0, 1], we have eλXi  I + λXi + λ2 X2i , and so
     
E eλXi  E I + λXi + λ2 X2i  I + λ2 E X2i
2 2 2 2
 eλ E[Xi ]  eλ /4R I ,
/  /
by Eq. (2.4). Thus, /E eλXi /  eλ /4R . The same analysis also shows that
2 2

/  −λX /
/E e i /
2 2
 eλ /4R . Substituting this into Eq. (2.3), we obtain

1     
n n
2
/4R2
P x i x T
i − I >t  2d·e−λt eλ = 2d· exp −λt + nλ2 /4R2 .
i=1
2R2 i=1

Substituting t = nε/2R2 and λ = ε proves the theorem. 

2.2.2 Method 2—Vershynin’s Derivation

We give the derivation method, taken from [35], by Vershynin.


Let X1 , . . . , Xn be independent random d × d real matrices, and let

S = X1 + · · · + X n .

We will be interested in the magnitude of the derivation Sn − ESn in the operator


norm · .
Now we try to generalize the method of Sect. 1.4.10 when Xi ∈ Md are
independent random matrices of mean zero, where Md denotes the class of
symmetric d × d matrices.
For, A ∈ Md , the matrix exponential eA is defined as usual by Taylor series.
A
e has the same eigenvectors as A, and eigenvalues eλi (A) > 0. The partial order
A ≥ B means A − B ≥ 0, i.e., A − B is positive semi-definite (their eigenvalues
are non-negative). By using the exponential function of A, we deal with the positive
semi-definite matrix which has a fundamental structure to exploit.
The non-trivial part is that, in general,

eA+B = eA eB .

However, the famous Golden-Thompson’s inequality [94] sates that


 
Tr eA+B  Tr eA eB
92 2 Sums of Matrix-Valued Random Variables

holds for arbitrary A, B ∈ Md (and in fact for arbitrary unitary-invariant norm



n
replacing the trace [23]). Therefore, for Sn = X1 + · · · + Xn = Xi and for I
i=1
being the identify matrix on Md , we have
  
p  P (Sn  tI) = P eλSn  eλtI  P Tr eλSn > eλt  e−λt E Tr eλSn .

This estimate is not sharp: eλSn  etI means the biggest eigenvalue of eλSn exceeds
eλt , while Tr eλSn > eλt means that the sum of all d eigenvalues exceeds the same.
Since Sn = Xn + Sn−1 , we use the Golden-Thomas’s inequality to separate the
last term from the sum:
 
E Tr eλSn  E Tr eλXn eλSn−1 .

Now, using independence and that E and trace commute, we continue to write
  / / 
= En−1 Tr En eλXn · eλSn−1  /En eλXn / · En−1 Tr eλSn−1 ,

since

Tr (AB)  A Tr (B) ,

for A, B ∈ Md .
Continuing by induction, we reach (since TrI = d) to

 
n
E Tr eλSn  d · EeλXi .
i=1

We have proved that


n
P (Sn  tI)  d · EeλXi .
i=1

Repeating for −Sn and using that −tI  Sn  tI is equivalent to Sn  t we


have shown that


n
/ λX /
P ( Sn > t)  2de−λt · /Ee i /. (2.5)
i=1

As in the real valued case, full independence is never needed in the above
argument. It works out well for martingales.
Theorem 2.2.4 (Chernoff-type inequality). Let Xi ∈ Md be independent mean
zero random matrices, Xi  1 for all i almost surely. Let
2.2 Matrix Laplace Transform Method 93

n
Sn = X1 + · · · + Xn = Xi ,
i=1

n
σ2 = var Xi .
i=1

Then, for every t > 0, we have


 2 2 
P ( Sn > t)  d · max e−t /4σ , e−t/2 .

To prove this theorem, we have to estimate (2.5). The standard estimate

1 + y  ey  1 + y + y 2

is valid for real number y ∈ [−1, 1] (actually a bit beyond) [95]. From the two
bounds, we get (replacing y with Y)

I + Y  eY  I + Y + Y 2

Using the bounds twice (first the upper bound and then the lower bound), we have
  2
EeY  E I + Y + Y2 = I + E Y2  eE(Y ) .

Let 0 < λ ≤ 1. Therefore, by the Theorem’s hypothesis,


/ λX / / /
/Ee i /  / 2 2 /
/ eλ E ( X i ) / = eλ
2
E(X2i )
.

Hence by (2.5),
2
σ2
P ( Sn > t)  2d · e−λt+λ .

With the optimal choice of λ = min t/2σ 2 , 1 , the conclusion of the Theorem
follows.
n / 
 /
Does the Theorem hold for σ 2 replaced by / E X2 / ?
i
i=1

Corollary 2.2.5. Let Xi ∈ Md be independent mean zero random matrices,


Xi  1 for all i almost surely. Let
n n
Sn = X1 + · · · + Xn = Xi , E = EXi .
i=1 i=1
94 2 Sums of Matrix-Valued Random Variables

Then, for every ε ∈ (0, 1), we have


2
P ( Sn − ESn > εE)  d · e−ε E/4
.

2.2.3 Method 3—Oliveria’s Derivation

Consider the random matrix Zn . We closely follow Oliveria [38] whose exposition
is highly accessible. In particular, he reviews all the needed theorems, all of those are
collected in Chap. 1 for easy reference. In this subsection, the matrices are assumed
to be d × d Hermitian matrices, that is, A ∈ Cd×d Herm , where CHerm is the set of d × d
d×d

Hermitian matrices.

2.2.3.1 Bernstein Trick

The usual Bernstein trick implies that for all t ≥ 0,


 
∀t  0, P (λmax (Zn ) > t)  inf e−st E esλmax (Zn ) . (2.6)
s>0

Notice that
       
E esZn   E esλmax (Zn ) + E esλmax (−Zn ) = 2E esλmax (Zn ) (2.7)

since Zn = max {λmax (Zn ) , λmax (−Zn )} and Zn has the same law as −Zn .

2.2.3.2 Spectral Mapping

The function x → esx is monotone, non-decreasing and positive for all s ≥ 0. It


follows from the spectral mapping property (1.60) that for all s ≥ 0, the largest
eigenvalue of esZn is esλmax (Zn ) and all eigenvalues of esZn are nonnegative. Using
the equality “trace = sum of eigenvalues” implies that for all s ≥ 0,
       
E esλmax (Zn ) = E λmax esZn  E Tr esZn . (2.8)

Combining (2.6), (2.7) with (2.8) gives


  
∀t ≥ 0, P ( Zn > t)  2 inf e−st E Tr esZn . (2.9)
s≥0

Up to now, the Oliveira’s proof in [38] has followed Ahlswede and Winter’s
argument in [36]. The next lemma is originally due to Oliveira [38]. Now Oliveira
considers the special case
2.2 Matrix Laplace Transform Method 95

n
Zn = ε i Ai , (2.10)
i=1

where εi are random coefficients and A1 , . . . , An are deterministic Hermitian


matrices. Recall that a Rademacher sequence is a sequence of εi ni=1 of i.i.d. random
variables with εi = ε1 uniform over {−1, 1}. A standard Gaussian sequence is a
sequence i.i.d. standard Gaussian random variables.
Lemma 2.2.6 (Oliveira [38]). For all s ∈ R,
⎡ ⎛ 
n ⎞⎤
s2 A2i
      ⎢ ⎜ ⎟⎥
E Tr esZn = Tr E esZn  Tr ⎢ ⎜
⎣exp ⎝
i=1 ⎟⎥ .
⎠⎦ (2.11)
2

Proof. In (2.11), we have used the fact that trace and expectation commute,
according to (1.87). The key proof steps have been followed by Rudelson [93],
Harvey [91, 92], and Wigderson and Xiao [94]. 

2.2.4 Method 4—Ahlswede-Winter’s Derivation

Ahlswede and Winter [36] were the first who used the matrix Laplace transform
method. Ahlswede-Winter’s derivation, taken from [36], is presented in detail below.
We postpone their original version until now, for easy understanding. Their paper
and Tropp’s long paper [53] are two of the most important sources on this topic. We
first digress to study the problem of hypothesis for motivation.
Consider a hypothesis testing problem for a motivation

H 0 : A1 , . . . , A K
H 1 : B1 , . . . , BK

where a sequence of positive, random matrices Ai , i = 1, . . . , K and Bi , i =


1, . . . , K are considered.
Algorithm 2.2.7 (Detection Using Traces of Sums of Covariance Matrices).
1. Claim H1 if

K K
Tr Ak = ξ ≤ Tr Bk ,
k=1 k=1

2. Otherwise, claim H0 .
96 2 Sums of Matrix-Valued Random Variables

Only diagonal elements are used in Algorithm 2.2.7; However, non-diagonal


elements contain information of use to detection. The exponential of a matrix
provides one tool. See Example 2.2.9. In particular, we have

TreA+B ≤ TreA eB .

The following matrix inequality

TreA+B+C ≤ TreA eB eC

is known to be false.
Let A and B be two Hermitian matrices of the same size. If A − B is positive
semidefinite, we write [16]

A ≥ B or B ≤ A. (2.12)

≥ is a partial ordering, referred to as Löwner partial ordering, on the set of Hermitian


matrices, that is,
1. A ≥ A for every Hermitian matrix A,
2. If A ≥ B and B ≥ A, then A = B, and
3. If A ≥ B and B ≥ C, then A ≥ C.
The statement A ≥ 0 ⇔ X∗ AX ≥ 0 is generalized as follows:

A ≥ B ⇔ X∗ AX ≥ X∗ BX (2.13)

for every complex matrix X.


A hypothesis detection problem can be viewed as a problem of partially
ordering the measured matrices for individual hypotheses. If many (K) copies of
the measured matrices Ak and Bk are at our disposal, it is natural to ask this
fundamental question:
Is B1 + B2 + · · · + BK (statistically) different than A1 + A2 + · · · + AK ?
To answer this question motivates this whole section. It turns out that a new theory
is needed. We freely use [36] that contains a relatively complete appendix for this
topic.
The theory of real random variables provides the framework of much of modern
probability theory, such as laws of large numbers, limit theorems, and probability
estimates for large deviations, when sums of independent random variables are
involved. Researchers develop analogous theories for the case that the algebraic
structure of the reals is substituted by more general structures such as groups, vector
spaces, etc.
At the hands of our current problem of hypothesis detection, we focus on a
structure that has vital interest in quantum probability theory and names the algebra
2.2 Matrix Laplace Transform Method 97

of operators2 on a (complex) Hilbert space. In particular, the real vector space of


self-adjoint operators (Hermitian matrices) can be regarded as a partially ordered
generalization of the reals, as reals are embedded in the complex numbers.

2.2.4.1 Fundamentals of Matrix-Valued Random Variables

In the ground-breaking work of [36], they focus on a structure that has vital interest
in the algebra of operators on a (complex) Hilbert space, and in particular, the real
vector space of self-adjoint operators. Through the spectral mapping theorem, these
self-adjoint operators can be regarded as a partially ordered generalization of the
reals, as reals are embedded in the complex numbers. To study the convergence of
sums of matrix-valued random variables, this partial order is necessary. It will be
clear later.
One can generalize the exponentially good estimate for large deviations by the
so-called Bernstein trick that gives the famous Chernoff bound [96, 97].
A matrix-valued random variable X : Ω → As , where

As = {A ∈ A : A = A∗ } (2.14)

is the self-adjoint part of the C∗ -algebra A [98], which is a real vector space. Let
L(H) be the full operator algebra of the complex Hilbert space H. We denote d =
dim(H), which is assumed to be finite. Here dim means the dimensionality of the
vector space. In the general case, d = TrI, and A can be embedded into L(C d ) as an
algebra, preserving the trace. Note the trace (often regarded as expectation) has the
property Tr (AB) = Tr (BA), for any two matrices (or operators) of A, B. In free
probability3 [99], this is a (optional) axiom as very weak form of commutativity in
the trace [9, p. 169].
The real cone

A+ = {A ∈ A : A = A∗ ≥ 0} (2.15)

induces a partial order ≤ in As . This partial order is in analogy with the order of
two real numbers a ≤ b. The partial order is the main interest in what follows. We
can introduce some convenient notation: for A, B ∈ As the closed interval [A, B]
is defined as

[A, B] = {X ∈ As : A ≤ X ≤ B}. (2.16)

2 Thefinite-dimensional operators and matrices are used interchangeably.


3 The idea of free probability is to make algebra (such as operator algebras C ∗ -algebra, von
Neumann algebras) the foundation of the theory, as opposed to other possible choices of
foundations such as sets, measures, categories, etc.
98 2 Sums of Matrix-Valued Random Variables

This is an analogy with the interval x ∈ [a, b] when a ≤ x ≤ b for a, b ∈ R.


Similarly, open and half-open intervals (A, B), [A, B), etc.
For simplicity, the space Ω on which the random variable lives is discrete. Some
remarks on the matrix (or operator) order is as follows.
1. The notation “≤” when used for the matrices is not a total order unless A, the
set of A, spans the entire complex space, i.e., A = C, in which case the set
of self-adjoint operators is the real number space, i.e., As = R. Thus in this
case (classical case), the theory developed below reduces to the study of the real
random variables.
2. A ≥ 0 is equivalent to saying that all eigenvalues of A are nonnegative.
These are d nonlinear inequalities. However, we can have the alternative
characterization:

A ≥ 0 ⇔ ∀ρ density operator Tr(ρA) ≥ 0


(2.17)
⇔ ∀π one − dimensional projector Tr(πA) ≥ 0

From which, we see that these nonlinear inequalities are equivalent to infinitely
many linear inequalities, which is better adapted to the vector space structure
of As .
3. The operator mapping A → As , for s ∈ [0, 1] and A → log A are defined on
A+ , and both are operator monotone and operator concave. In contrast, A → As ,
for s > 2 and A → exp A are neither operator monotone nor operator convex.
Remarkably, A → As , for s ∈ [1, 2] is operator convex (though not operator
monotone). See Sect. 1.4.22 for definitions.
4. The mapping A → Tr exp A is monotone and convex. See [50].
5. Golden-Thompson-inequality [23]: for A, B ∈ As

Tr exp(A + B) ≤ Tr((exp A)(exp B)). (2.18)

Items 1–3 follows from Loewner’s theorem. A good account of the partial order
is [18, 22, 23]. Note that a rarely few of mappings (functions) are operator convex
(concave) or operator monotone. Fortunately, we are interested in the trace functions
that have much bigger sets [18].
Take a look at (2.19) for example. Since H0 : A = I + X, and A ∈ As (even
stronger A ∈ A+ ), it follows from (2.18) that

H0 : Tr exp(A) = Tr exp(I + X) ≤ Tr((exp I)(exp X)). (2.19)

The use of (2.19) allows us to separately study the diagonal part and the non-
diagonal part of the covariance matrix of the noise, since all the diagonal elements
are equal for a WSS random process. At low SNR, the goal is to find some ratio or
threshold that is statistically stable over a large number of Monte Carlo trials.
2.2 Matrix Laplace Transform Method 99

Algorithm 2.2.8 (Ratio detection algorithm using the trace exponentials).


Tr exp A
1. Claim H1 , if ξ = Tr((exp I)(exp X)) ≥ 1, where A is the measured covariance
Rw
matrix with or without signals and X = 2
σw − I.
2. Otherwise, claim H0 .
Example 2.2.9 (Exponential of the 2 × 2 matrix). The 2 × 2 covariance matrix for
L sinusoidal signals has symmetric structure with identical diagonal elements

Rs = TrRs (I + bσ1 )

where
 
01
σ1 =
10

and b is a positive number. Obviously, Trσ1 = 0. We can study the diagonal


elements and non-diagonal elements separately. The two eigenvalues of the 2 × 2
matrix [100]
 
ab
A=
cd

are

λ1,2 = 12 TrA ± 1
2 Tr2 A − 4 det A

and the corresponding eigenvectors are, respectively,


   
1 b 1 b
u1 = ; u2 = .
||u1 || λ1 − a ||u2 || λ2 − a

To study how the zero-trace 2 × 2 matrix σ1 affects the exponential, consider


 
0 b
X= −1 .
a 0

The exponential of the matrix X, eX , has positive entries, and in fact [101]
⎛ 2 √ 2 ⎞
b
cosh ab sinh ab
eX = ⎝ 2
a 2 ⎠
√1 sinh b
cosh ab
ab a


100 2 Sums of Matrix-Valued Random Variables

2.2.4.2 Matrix-Valued Concentration Inequalities

In analogy with the scalar-valued random variables, we can develop a matrix-valued


Markov inequality. Suppose that X is a nonnegative random variable with mean
E [X]. The scalar-valued Markov inequality of (1.6) states that

E [X]
P (X  a)  for X nonnegative. (2.20)
a
Theorem 2.2.10 (Markov inequality). Let X a matrix-valued random variable
with values in A+ and expectation

M = EX = Pr{X = X }X , (2.21)
X

and A ≥ 0 is a fixed positive semidefinite matrix. Then



Pr{X  A} ≤ Tr MA−1 . (2.22)

Proof. The support of A is assumed to contain the support of M, otherwise,


the theorem is trivial. Let us consider the positive matrix-valued random variable
Y =A−1/2 XA−1/2 which has expectation E [Y] = A−1/2 E [X] A−1/2 , using
the product rule of (1.88): E [XY] = E [X] E [Y]. we have used the fact that the
expectation of a constant matrix is itself: E [A] = A.
Since the events {X  A} and {Y  I} coincide, we have to show that

P (Y  I)  Tr (E [X])

Note from (1.87), the trace and expectation commute! This is seen as follows:

E [Y] = P (Y = Y) Y  P (Y = Y) Y.
Y YI

The second inequality follows from the fact that Y is positive and Y = {Y  I} ∪
{Y > I}. All eigenvalues of Y are positive. Ignoring the event {Y > I} is equiva-
lent to remove some positive eigenvalues from the spectrum of the Y, spec(Y).
Taking traces, and observing that a positive operator (or matrix) which is not less
than or equal to I must have trace at least 1, we find

Tr (E [Y])  P (Y = Y) Tr (Y)  P (Y = Y) = P (Y  I) ,
YI YI

which is what we wanted. 


2.2 Matrix Laplace Transform Method 101

In the case of H = C the theorem reduces to the well-known Markov inequality


for nonnegative real random variables. One easily see that, as in the classical case,
the inequality is optimal in the sense that there are examples when the inequality is
assumed with equality.
Suppose that the mean m and the variance σ 2 of a scalar-valued random variable
X are known. The Chebyshev inequality of (1.8) states that

σ2
P (|X − m|  a)  . (2.23)
a2
The Chebyshev inequality is a consequence of the Markov inequality.
In analogy with the scalar case, if we assume knowledge about the matrix-
valued expectation and the matrix-valued variance, we can prove the matrix-valued
Chebyshev inequality.
Theorem 2.2.11 (Chebyshev inequality). Let X a matrix-valued random vari-
able with values in As , expectation M = EX, and variance

VarX = S2 = E (X − M)2 = E(X2 ) − M2 . (2.24)

For Δ ≥ 0,

P{|X − M|  Δ} ≤ Tr S2 Δ−2 . (2.25)

Proof. Observe that

|X − M| ≤ Δ ⇐ (X − M)2 ≤ Δ2

since (·) is operator monotone. See Sect. 1.4.22. We find that
  
P (|X − M|  Δ)  P (X − M)  Δ2  Tr S2 Δ−2 .
2

The last step follows from Theorem 2.2.10. 


If X, Y are independent, then Var(X + Y) = VarX + VarY. This is the same
as in the classical case but one has to pay attention to the noncommunicativity that
causes technical difficulty.
Corollary 2.2.12 (Weak law of large numbers). Let X, X1 , X2 , . . . , Xn be iden-
tically, independently, distributed (i.i.d.) matrix-valued random variables with
values in As , expectation M = EX, and variance VarX = S2 . For Δ ≥ 0,
then
102 2 Sums of Matrix-Valued Random Variables

 

n 
P 1
n Xi ∈
/ [M − Δ, M + Δ] ≤ 1
n Tr S2 Δ−2 ,
 n=1  (2.26)

n √ √ 
P Xi ∈
/ [nM − nΔ, nM − nΔ] ≤ 1
n Tr S2 Δ−2 .
n=1

Proof. Observe that Y ∈ / [M − Δ, M + Δ] is equivalent to |Y − M|  Δ, and


apply the previous theorem. The event |Y − M|  Δ says that the absolute values
of the eigenvalues of the matrix Y − M is bounded by the eigenvalues of Δ which
are of course nonnegative. The matrix Y − M is Hermitian (but not necessarily
nonnegative or nonpositive). That is why the absolute value operation is needed. 
When we see these functions of matrix-valued inequalities, we see the functions
of their eigenvalues. The spectral mapping theorem must be used all the time.
Lemma 2.2.13 (Large deviations and Bernstein trick). For a matrix-valued
random variable Y, B ∈ As , and T ∈ A such that T∗ T > 0
 ∗ ∗

P {Y  B} ≤ Tr EeTYT −TBT . (2.27)

Proof. We directly calculate


   
P Y B = P Y−B0
 
= P TYT∗ − TBT∗  0
 ∗ ∗

= P eTYT −TBT  I
 ∗ ∗

 Tr EeTYT −TBT .

Here, the second line is because the mapping X → TXT∗ is bijective and preserve
the order. The TYT∗ and TBT∗ are two commutative matrices. For commutative
matrices A, B, A ≤ B is equivalent to eA ≤ eB , from which the third line follows.
The last line follows from Theorem 2.2.10. 

The Bernstein
 trick is a crucial step. The problem is reduced to the form of
Tr EeZ where Z = TYT∗ − TBT∗ is Hermitian. We really do not know if
Z is nonnegative or positive. But we do not care since the matrix exponential of any
Hermitian A is always nonnegative. As a consequence of using the Bernstein trick,
we only need to deal with nonnegative matrices.
But we need another key ingredient—Golden-Thompson inequality—since for
Hermitian AB, we have eA+B = eA · eB , unlike ea+b = ea · eb , for two scalars
a, b. For two Hermitian matrices A, B, we have the Golden-Thompson inequality
 
Tr eA+B  Tr eA · eB .
2.2 Matrix Laplace Transform Method 103

Theorem 2.2.14 (Concentration for i.i.d matrix-valued random variables).


Let X, X1 , . . . , Xn be i.i.d. matrix-valued random variables with values in As ,
A ∈ As . Then for T ∈ A , T∗ T > 0
+ n
,
 d · E exp (TXT∗ − TAT∗ )
n
P Xi  nA . (2.28)
n=1

Define the binary I-divergence as

D(u||v) = u(log u − log v) + (1 − u)(log(1 − u) − log(1 − v)). (2.29)


n
Proof. Using previous lemma (obtained from the Bernstein trick) with Y = Xi
i=1
and B = nA, we find
    

n 
n
P Xi  nA  Tr E exp T (Xi −A) T∗
i=1 i=1
  

n

=E Tr exp T (Xi −A) T
i=1
 n−1  

 E Tr exp T (Xi −A) T∗ exp [T (Xn −A) T∗ ]
i=1
 n−1  

= EX1 ,...,Xn−1 Tr exp T (Xi −A) T∗ E exp [T (Xn −A) T∗ ]
i=1
 n−1 

 E exp [T (Xn −A) T∗ ] · EX1 ,...,Xn−1 Tr exp T (Xi −A) T∗
i=1

 d · E exp [T (Xn −A) T∗ ]n

the first line follows from Lemma 2.2.13. The second line is from the fact
that the trace and expectation commute, according to (1.87). In the third line, we
use the famous Golden-Thompson inequality (1.71). In the fourth line, we take
the expectation on the Xn . The fifth line is due to the norm property (1.84).
The sixth line is using the fifth step by induction for n times. d comes from the
fact Tr exp (0) = TrI = d, where 0 is a zero matrix whose entries are all zero. 
The problem is now to minimize E exp [T (Xn − A) T∗ ] with respect to T.
For a Hermitian matrix T , we have the polar decomposition T = |T|·U, where Y is
a unitary matrix; so, without loss of generality, we may assume that T is Hermitian.
Let us focus pursue the special case of a bounded matrix-valued random variable.
Defining

D (u||v) = u (log u − log v) + (1 − u) [log (1 − u) − log (1 − v)]


104 2 Sums of Matrix-Valued Random Variables

we find the following matrix-valued Chernoff bound.


Theorem 2.2.15 (Matrix-valued Chernoff Bound). Let X, X1 , . . . , Xn be i.i.d.
matrix-valued random variables with values in [0, I] ∈ As , EX ≤ mI, A ≥
aI, 1 ≥ a ≥ m ≥ 0. Then
+ n
,
P Xi  nA ≤ d · exp (−nD (a||m)) , (2.30)
n=1

Similarly, EX ≥ mI, A ≤ aI, 0 ≤ a ≤ m ≤ 1. Then

n

P Xi  nA  d · exp (−nD (a||m)) , (2.31)
n=1

As a consequence, we get, for EX = M ≥ μI and 0 ≤  ≤ 12 , then


+ n
,  
1 ε2 μ
P Xi ∈
/ [(1 − ε) M, (1 + ε) M] ≤ 2d · exp −n · . (2.32)
n n=1 2 ln 2

Proof. The second part follows from the first by considering Yi = Xi , and the ob-
servation that D (a||m) = D√(1 − a||1 − m). To prove it, we apply Theorem 2.2.14
with a special case of T = tI to obtain
+ n
, + n
,
P Xi  nA P Xi  naI
n=1 n=1

 d · ||E exp(tX) exp(−taI)||n


= d · ||E exp(tX) exp(−ta)||n .

Note exp(−taI) = exp(−ta)I and AI = A. Now the convexity of the exponential


function exp(x) implies that

exp (tx) − 1 exp (t) − 1


 , 0  x  1, x ∈ R,
x 1
which, by replacing x with matrix X ∈ As and 1 with the identify matrix I
(see Sect. 1.4.13 for this rule), yields

exp(tX) − 1  X (exp (t) − 1) .

As a consequence, we have
2.2 Matrix Laplace Transform Method 105

E exp(tX)  I + (exp (t) − 1) EX. (2.33)

hence, we have

||E exp(tX) exp(−ta)||  exp(−ta) I + (exp (t) − 1) EX


≤ exp(−ta) I + (exp (t) − 1) mI
= exp(−ta) [1 + (exp (t) − 1) m] .

The first line follows from using (2.33). The second line follows from the hypothesis
of EX  mI. The third line follows from using the spectral norm property of (1.57)
for the identity matrix I: I = 1. Choosing
 
a 1−m
t = log · >0
m 1−a

the right-hand side becomes exactly exp (−D (a||m)).


To prove the last claim of the theorem, consider the variables Yi =
μM−1/2 Xi M−1/2 with expectation EYi = μI and Yi ∈ [0, I], by hypothesis.
Since
n n
1 1
Xi ∈ [(1 − ε) M, (1 + ε) M] ⇔ Yi ∈ [(1 − ε) μI, (1 + ε) μI]
n n=1 n n=1

we can apply what we just proved to obtain


+ n
,
1
P Xi ∈
/ [(1 − ε) M, (1 + ε) M]
n n=1
+ n
, + n
,
1 1
=P Xi  (1 + ε) M + P Xi  (1 − ε) M
n n=1 n n=1

 d {exp [−nD ((1 − ε) μ||μ)] + exp [−nD ((1 + ε) μ||μ)]}


 
ε2 μ
 2d · exp −n .
2 ln 2

The last line follows from the already used inequality

x2 μ
D ((1 + x)μ||μ)  .
2 ln 2


106 2 Sums of Matrix-Valued Random Variables

2.2.5 Derivation Method 5—Gross, Liu, Flammia, Becker,


and Eisert

Let ||A|| be the operator norm of matrix A.


Theorem 2.2.16 (Matrix-Bernstein inequality—Gross [102]). let Xi , i =
1, . . . , N be i.i.d. zero mean, Hermitian matrix-valued random variables of size
/ / N
n × n. Assume σ0 , c ∈ R are such that /X2i /  σ02 and Xi  μ. Set S = Xi
i=1
and let σ 2 = N σ02 , an upper bound to the variance of S. Then


t2 2σ
P ( S > t)  2n exp − 2 , t  ,
4σ μ
  (2.34)
t 2σ
P ( S < t)  2n exp − , t> .
2μ μ

We refer to [102] for a proof. His proof directly follows from Ahlswede-Winter [36]
with some revisions.

2.2.6 Method 6—Recht’s Derivation

The version of derivation, taken from [103], is more general in that the random
matrices need not be identically distributed. A symmetric matrix is assumed. It is
our conjecture that results of [103] may be easily extended to a Hermitian matrix.
Theorem 2.2.17 (Noncommutative Bernstein Inequality [104]). Let X1 , . . . , XL
be independent zero-mean random matrices of dimension d1 × d2 . Suppose
ρ2k = max { E (Xk X∗k ) , E (X∗k Xk ) } and Xk  M almost surely for
all k. Then, for any τ > 0,
⎛ ⎞
-/ / .
/ L / ⎜ 2
− τ2 ⎟
/ / ⎜ ⎟
P / Xk / > τ  (d1 + d2 ) exp ⎜ L ⎟. (2.35)
/ / ⎝ ⎠
k=1 ρ2k + M τ /3
k=1

Note that in the case that d1 = d2 = 1, this is precisely the two sided version of the
standard Bernstein Inequality. When the Xk are diagonal, this bound is the same
as applying the standard Bernstein Inequality and a union bound to the diagonal
of the matrix ⎛summation.
⎞ Besides, observe that the right hand side is less than
3
− 8 τ2 
L
(d1 + d2 ) exp ⎝  2
L
⎠ as long as τ  1
M ρ2k . This condensed form of the
ρk k=1
k=1
2.2 Matrix Laplace Transform Method 107

inequality is used exclusively throughout in [103]. Theorem 2.2.17 is a corollary


of an Chernoff bound for finite dimensional operators developed by Ahlswede and
Winter [36]. A similar inequality for symmetric i.i.d. matrices is proposed in [95]
|| · || denotes the spectral norm (the top singular value) of an operator.

2.2.7 Method 7—Derivation by Wigderson and Xiao

Chernoff bounds are extremely useful in probability. Intuitively, they say that a
random sample approximates the average, with a probability of deviation that goes
down exponentially with the number of samples. Typically we are concerned about
real-valued random variables, but recently several applications have called for large-
deviations bounds for matrix-valued random variables. Such a bound was given by
Ahlswede and Winter [36, 105].
All of Wigderson and Xiao’s results [94] are extended to complex Hermitian
matrices, or abstractly to self-adjoint operators over any Hilbert spaces where the
operations of addition, multiplication, trace exponential, and norm are efficiently
computable. Wigderson and Xiao [94] essentially follows the original style of
Ahlswede and Winter [36] in the validity of their method.

2.2.8 Method 8—Tropp’s Derivation

The derivation follows [53].


n
P λmax Xi t
i=1

n
=P λmax θXi  eθt (the positive homogeneity of the eigenvalue map)
i=1
 n

−θt
e · E exp λmax θXi (Markov’s inequality)
i=1
 n

= e−θt · Eλmax exp θXi (the spectral mapping theorem)
i=1
 n

−θt
<e · E Tr exp θXi (the exponential of a Hermitian matrix is positive definite)
i=1
(2.36)
108 2 Sums of Matrix-Valued Random Variables

2.3 Cumulate-Based Matrix-Valued Laplace


Transform Method

This section develops some general probability inequalities for the maximum
eigenvalue of a sum of independent random matrices. The main ingredient is
a matrix extension of the scalar-valued Laplace transform method for sums of
independent real random variables, see Sect. 1.1.6.
Before introducing the matrix-valued Laplace transform, we need to define
matrix and cumulants, in analogy with Sect. 1.1.6 for the scalar setting. At this point,
a quick review of Sect. 1.1 will illuminate the contrast between the scalar and matrix
settings. The central idea of Ahswede and Winter [36] is to extend the textbook idea
of the Laplace Transform Method from the scalar setting to the matrix setting.
Consider a Hermitian matrix X that has moments of all orders. By analogy with
the classical scalar definitions (Sect. 1.1.7), we may construct matrix extensions of
the moment generating function and the cumulant generating function:

MX (θ) := EeθX and ΞX (θ) := log EeθX for θ ∈ R. (2.37)

We have the formal power series expansions:


∞ ∞
θk  k θk
MX (θ) = I + · X and ΞX (θ) = · Ξk .
k! k!
k=1 k=1


The coefficients EXk are called matrix moments and Ξk are called matrix
cumulants. The matrix cumulant Ξk has a formal expression as a noncommutative
polynomial in the matrix moments up to order k. In particular, the first cumulant is
the mean and the second cumulant is the variance:

Ξ1 = E (X) and Ξ2 = E X2 − E (X) .
2

Higher-order cumulants are harder to write down and interpret.


Proposition 2.3.1 (The Lapalce Transform Method). Let Y be a random Hermi-
tian matrix. For all t ∈ R,
 
P (λmax (Y)  t)  inf e−θt · E Tr eθY .
θ>0

In words, we can control tail probabilities for the maximum eigenvalue of a random
matrix by producing a bound for the trace of the matrix moment generating function
defined in (2.37). Let us show how Bernstein’s Laplace transform technique extends
to the matrix setting. The basic idea is due to Ahswede-Winter [36], but we follow
Oliveria [38] in this presentation.
Proof. Fix a positive number θ. Observe that
2.4 The Failure of the Matrix Generating Function 109

 
P (λmax (Y)  t) =P (λmax (θY) θt) =P eλmax (θY) eθt e−θt · Eeλmax (θY) .

The first identity uses the homogeneity of the maximum eigenvalue map. The
second relies on the monotonicity of the scalar exponential functions; the third
relation is Markov’s inequality. To bound the exponential, note that

eλmax (θY) = λmax eθY  Tr eθY .

The first relation is the spectral mapping theorem (Sect. 1.4.13). The second relation
holds because the exponential of an Hermitian matrix is always positive definite—
the eigenvalues of the matrix exponential are always positive (see Sect. 1.4.16 for
the matrix exponential); thus, the maximum eigenvalue of a positive definite matrix
is dominated by the trace. Combine the latter two relations, we reach
 
P (λmax (Y)  t)  inf e−θt · E Tr eθY .
θ>0

This inequality holds for any positive θ, so we may take an infimum4 to complete
the proof. 

2.4 The Failure of the Matrix Generating Function

In the scalar setting of Sect. 1.2, the Laplace transform method is very effective
for studying sums of independent (scalar-valued) random variables, because the
matrix generating function decomposes. Consider an independent sequence Xk of
real random variables. Operating formally, we see that the scalar matrix generating
function of the sum satisfies a multiplicative rule:

  
M 
 (θ) = E exp θXk =E eθXk = EeθXk = MXk (θ).
Xk
k k k k k
(2.38)
This calculation relies on the fact that the scalar exponential function converts sums
to products, a property the matrix exponential does not share, see Sect. 1.4.16. Thus,
there is no immediate analog of (2.38) in the matrix setting.

4 In analysis the infimum or greatest lower bound of a subset S of real numbers is denoted by
inf(S) and is defined to be the biggest real number that is smaller than or equal to every number in
S . An important property of the real numbers is that every set of real numbers has an infimum (any
bounded nonempty subset of the real numbers has an infimum in the non-extended real numbers).
For example, inf {1, 2, 3} = 1, inf {x ∈ R, 0 < x < 1} = 0.
110 2 Sums of Matrix-Valued Random Variables

Ahlswede and Winter attempts to imitate the multiplicative rule of (2.38) using
the following observation. When X1 and X2 are independent random matrices,
   θX   θX   
Tr MX 1 +X2 (θ) E Tr eθX1 eθX2 = Tr Ee 1 Ee 2 = Tr MX 1 (θ) · MX 2 (θ) .
(2.39)
The first relation is the Golden-Thompson trace inequality (1.71). Unfortunately,
we cannot extend the bound (2.39) to include additional matrices. This cold fact
may suggest that the Golden-Thompson inequality may not be the natural way to
proceed. In Sect. 2.2.4, we have given a full exposition of the Ahlswede-Winter
Method. Here, we follow a different path due to [53].

2.5 Subadditivity of the Matrix Cumulant Generating


Function

Let us return to the problem of bounding the matrix moment generating function of
an independent sum. Although the multiplicative rule (2.38) for the matrix case is
a dead end, the scalar cumulant generating function has a related property that can
be extended. For an independent sequence Xk of real random variables, the scalar
cumulant generating function is additive:

Ξ 
 (θ) = log E exp θXk = log EeθXk = ΞXk (θ), (2.40)
Xk
k k k k

where the second relation follows from (2.38) when we take logarithms.
The key insight of Tropp’s approach is that Corollary 1.4.18 offers a completely
way to extend the addition rule (2.40) for the scalar setting to the matrix setting.
Indeed, this is a remarkable breakthrough. Much better results have been obtained
due to this breakthrough. This justifies the parallel development of Tropp’s method
with the Ahlswede-Winter method of Sect. 2.2.4.
Lemma 2.5.1 (Subadditivity of Matrix Cumulant Generating Functions). Con-
sider a finite sequence {Xk } of independent, random, Hermitian matrices. Then
 
E Tr exp θXk  Tr exp log Ee θXk
for θ ∈ R. (2.41)
k k

Proof. It does not harm to assume θ = 1. Let Ek denote the expectation, conditioned
on X1 , . . . , Xk . Abbreviate
 
Ξk := log Ek−1 eXk = log EeXk ,

where the equality holds because the sequence {Xk } is independent.


2.6 Tail Bounds for Independent Sums 111

  n−1


n
E Tr exp Xk = E0 · · · En−1 Tr exp Xk + X n
k=1 k=1

n−1
 Xn
 E0 · · · En−2 Tr exp Xk + log En−1 e
k=1
n−2

= E0 · · · En−2 Tr exp Xk + Xn−1 + Ξn
k=1
n−2

 E0 · · · En−3 Tr exp Xk + Ξn−1 + Ξn
k=1
n

· · ·  Tr exp Ξk .
k=1

The first line follows from the tower property of conditional expectation. At each
step, m = 1, 2, . . . , n, we use Corollary 1.4.18 with the fixed matrix H equal to
m−1 n
Hm = Xk + Ξk .
k=1 k=m+1

This act is legal because Hm does not depend on Xm . 


To be in contrast with the additive rule (2.40), we rewrite (2.41) in the form
 
E Tr exp Ξ 
 (θ)  Tr exp ΞXk (θ) for θ ∈ R
Xk
k k

by using definition (2.37).

2.6 Tail Bounds for Independent Sums

This section contains abstract tail bounds for the sums of random matrices.
Theorem 2.6.1 (Master Tail Bound for Independent Sums—Tropp [53]). Con-
sider a finite sequence {Xk } of independent, random, Hermitian matrices. For all
t ∈ R,
  + ,
P λmax Xk t  inf e−θt · Tr exp log EeθXk . (2.42)
θ>0
k k

Proof. Substitute the subadditivity rule for matrix cumulant generating functions,
Eq.2.41, into the Lapalace transform bound, Proposition 2.3.1. 
112 2 Sums of Matrix-Valued Random Variables

Now we are in a position to apply the very general inequality of (2.42) to some
specific situations. The first corollary adapts Theorem 2.6.1 to the case that arises
most often in practice.
Corollary 2.6.2 (Tropp [53]). Consider a finite sequence {Xk } of independent,
random, Hermitian matrices with dimension d. Assume that there is a function g :
(0, ∞) → [0, ∞] and a sequence of {Ak } of fixed Hermitian matrices that satisfy
the relations

EeθXk  Eeg(θ)·Ak for θ > 0. (2.43)

Define the scale parameter



ρ := λmax Ak .
k

Then, For all t ∈ R,

 
0 1
P λmax Xk t  d · inf e−θt+g(θ)·ρ . (2.44)
θ>0
k

Proof. The hypothesis (2.44) implies that

log EeθXk  g(θ) · Ak for θ > 0. (2.45)

because of the property (1.73) that the matrix logarithm is operator monotone.
Recall the fact (1.70) that the trace exponential is monotone with respect to
the semidefinite order. As a result, we can introduce each relation from the
sequence (2.45) into the master inequality (2.42). For θ > 0, it follows that
  
−θt
P λmax Xk t e · Tr exp g(θ) · Ak
k k
- .
−θt
e · d · λmax exp g(θ) · Ak
k

−θt
=e · d · exp g(θ) · λmax Ak .
k

The second inequality holds because the trace of a positive definite matrix, such
as the exponential, is bounded by the dimension d times the maximum eigenvalue.
The last line depends on the spectral mapping Theorem 1.4.4 and the fact that the
function g is nonnegative. Identify the quantity ρ, and take the infimum over positive
θ to reach the conclusion (2.44). 
2.6 Tail Bounds for Independent Sums 113

Let us state another consequence of Theorem 2.6.1. This bound is sometimes


more convenient than Corollary 2.6.2, since it combines the matrix generating
functions of the random matrices together under a single logarithm.
Corollary 2.6.3. Consider a sequence {Xk , k = 1, 2, . . . , n} of independent,
random, Hermitian matrices with dimension d. For all t ∈ R,
n
  + n
,
1
P λmax Xk  t  d · inf exp −θt + n · log λmax Ee θXk
.
θ>0 n
k=1 k=1

(2.46)

Proof. Recall the fact (1.74) that the matrix logarithm is operator concave. For θ >
0, it follows that
n n n

1 1
log Ee θXk
=n· log Ee θXk
 n · log Ee θXk
.
n n
k=1 k=1 k=1

The property (1.70) that the trace exponential is monotone allows us to introduce
the latter relation into the master inequality (2.42) to obtain
n
  n

−θt 1
P λmax Xk  t  e · Tr exp n · log Ee θXk
.
n
k=1 k=1

To bound the proof, we bound the trace by d times the maximum eigenvalue, and
we invoke the spectral mapping Theorem (twice) 1.4.4 to draw the maximum eigen-
value map inside the logarithm. Take the infimum over positive θ to reach (2.46).

We can study the minimum eigenvalue of a sum of random Hermitian matrices
because

λmin (X) = −λmax (−X) .

As a result,
n
  n
 
P λmin Xk t = P λmax −Xk  −t .
k=1 k=1

We can also analyze the maximum singular value of a sum of random rectangular
matrices by applying the results to the Hermitian dilation (1.81). For a finite
sequence {Zk } of independent, random, rectangular matrices, we have
/ /   
/ /
/ /
P / Xk /  t = P λmax ϕ (Zk ) t
/ /
k k
114 2 Sums of Matrix-Valued Random Variables

on account of (1.83) and the property that the dilation is real-linear. ϕ means
dilation. This device allows us to extend most of the tail bounds developed in this
book to rectangular matrices.

2.6.1 Comparison Between Tropp’s Method


and Ahlswede–Winter Method

Ahlswede and Winter uses a different approach to bound the matrix moment
generating function, which uses the multiplication bound (2.39) for the trace
exponential of a sum of two independent, random, Hermitian matrices.
Consider a sequence {Xk , k = 1, 2, .
. . , n} of independent, random, Hermitian
matrices with dimension d, and let Y = k Xk . The trace inequality (2.39) implies
that
⎡ n−1 ⎤ ⎡ n−1 ⎤
 
θXk θXk
Tr MY (θ)  E Tr ⎣e k=1 e θXn ⎦
= Tr E ⎣e k=1 eθXn ⎦
⎡⎛ ⎞ ⎤

n−1
θXk 
= Tr ⎣⎝Ee k=1 ⎠ EeθXn ⎦
⎛ ⎞

n−1
θXk 
 Tr ⎝Ee k=1 ⎠ · λmax EeθXn .

These steps are carefully spelled out in previous sections, for example Sect. 2.2.4.
Iterating this procedure leads to the relation

  
Tr MY (θ)  (Tr I) λmax Ee θXk
= d · exp λmax log Ee θXk
.
k k
(2.47)
This bound (2.47) is the key to the Ahlswede–Winter method. As a consequence,
their approach generally leads to tail bounds that depend on a scale parameter
involving “the sum of eigenvalues.” In contrast, the Tropp’s approach is based on
the subadditivity of cumulants, Eq. 2.41, which implies that

Tr MY (θ)  d · exp λmax log Ee θXk
. (2.48)
k

(2.48) contains a scale parameter that involves the “eigenvalues of a sum.”


2.7 Matrix Gaussian Series—Case Study 115

2.7 Matrix Gaussian Series—Case Study

A matrix Gaussian series stands among the simplest instances of a sum of


independent random matrices. We study this fundamental problem to gain insights.
Consider a finite sequence ak of real numbers and finite sequence {γk } of
independent, standard Gaussian variables. We have

2
/2σ 2
P γ k ak  t  e−t where σ 2 := a2k . (2.49)
k k

This result justifies that a Gaussian series with real coefficients satisfies a normal-
type tail bound where the variance is controlled by the sum of the sequence
coefficients. The relation (2.49) follows easily from the scalar Laplace transform
method. See Example 1.2.1 for the derivation of the characteristic function;
A Fourier inverse transform of this derived characteristic function will lead to (2.49).
So far, our exposition in this section is based on the standard textbook.
The inequality (2.49) can be generalized directly to the noncommutative setting.
The matrix Laplace transform method, Proposition 2.3.1, delivers the following
result.
Theorem 2.7.1 (Matrix Gaussian and Rademacher Series—Tropp [53]). Con-
sider a finite sequence {Ak } of fixed Hermitian matrices with dimension d, and
let γk be a finite sequence of independent standard normal variables. Compute the
variance parameter
/ /
/ /
2/ /
σ := / A2k / . (2.50)
/ /
k

Then, for all t > 0,


 
2
/2σ 2
P λmax γ k Ak t  d · e−t . (2.51)
k

In particular,
/ / 
/ /
/ / 2 2
P / γk Ak /  t  2d · e−t /2σ . (2.52)
/ /
k

The same bounds hold when we replace γk by a finite sequence of independent


Rademacher random variables.
116 2 Sums of Matrix-Valued Random Variables

Observe that the bound (2.51) reduces to the scalar result (2.49) when the
dimension d = 1. The generalization of (2.50) has been proven by Tropp [53] to
be sharp and is also demonstrated that Theorem 2.7.1 cannot be improved without
changing its form.
Most of the inequalities in this book have variants that concern the maximum
singular value of a sum of rectangular random matrices. These extensions follow
immediately, as mentioned above, when we apply the Hermitian matrices to the
Hermitian dilation of the sums of rectangular matrices. Here is a general version of
Theorem 2.7.1.
Corollary 2.7.2 (Rectangular Matrix Gaussian and Radamacher Series—
Tropp [53]). Consider a finite sequence {Bk } of fixed matrices with dimension
d1 × d2 , and let γk be a finite sequence of independent standard normal variables.
Compute the variance parameter
+/ / / /,
/ / / /
/ / / /
2
σ := max / Bk B∗k / , / B∗k Bk / .
/ / / /
k k

Then, for all t > 0,


/ / 
/ /
/ / 2 2
P / γk Bk /  t  (d1 + d2 ) · e−t /2σ .
/ /
k

The same bounds hold when we replace γk by a finite sequence of independent


Rademacher random variables.
To prove Theorem 2.7.1 and Corollary 2.7.2, we need a lemma first.
Lemma 2.7.3 (Rademacher and Gaussian moment generating functions). Sup-
pose that A is an Hermitian matrix. Let ε be a Rademacher random variable, and
let γ be a standard normal random variable. Then
2
A2 /2 2
A2 /2
EeεθA  eθ and EeγθA = eθ for θ ∈ R.

Proof. Absorbing θ into A, we may assume θ = 1 in each case. By direct


calculation,
2
EeεA = cosh (A)  eA /2
,

where the second relation is (1.69).


For the Gaussian case, recall that the moments of a standard normal variable
satisfy

(2k)!
Eγ 2k+1 = 0 and Eγ 2k = for k = 0, 1, 2, . . . .
k!2k
2.7 Matrix Gaussian Series—Case Study 117

Therefore,

∞  ∞  k
E γ 2k A2k A2 /2 2
Ee γA
=I+ =I+ = eA /2
.
(2k)! k!
k=1 k=1

The first relation holds since the odd terms in the series vanish. With this lemma, the
tail bounds for Hermitian matrix Gaussian and Rademacher series follow easily. 
Proof of Theorem 2.7.1. Let {ξk } be a finite sequence of independent standard
normal variables or independent Rademacher variables. Invoke Lemma 2.7.3 to
obtain
2
Eeξk θA  eg(θ)·Ak where g (θ) := θ2 /2 for θ > 0.

Recall that
/ / 
/ /
2 / /
σ =/ A2k / = λmax A2k .
/ /
k k

Corollary 2.6.2 gives


 
0 2
1 2
/2σ 2
P λmax ξ k Ak t  d · inf e−θt+g(θ)·σ = d · e−t . (2.53)
θ>0
k

For the record, the infimum is attained when θ = t/σ 2 .


To obtain the norm bound (2.52), recall that

Y = max {λmax (Y) , −λmin (Y)} .

Since standard Gaussian and Rademacher variables are symmetric, the inequal-
ity (2.53) implies that
   
2
/2σ 2
P −λmin ξ k Ak t = P λmax (−ξk ) Ak t  d·e−t .
k k

Apply the union bound to the estimates for λmax and −λmin to complete the proof.
We use the Hermitian dilation of the series. 
Proof of Corollary 2.7.2. Let {ξk } be a finite sequence of independent standard
normal variables or independent Rademacher variables. Consider the sequence
{ξk ϕ (Bk )} of random Hermitian matrices with dimension d1 + d2 . The spectral
identity (1.83) ensures that
118 2 Sums of Matrix-Valued Random Variables

/ /  
/ /
/ /
/ ξk Bk / = λmax ϕ ξk B k = λmax ξk ϕ (Bk ) .
/ /
k k k

Theorem 2.7.1 is used. Simply observe that the matrix variance parameter (2.50)
satisfies the relation
/⎡ ⎤/
/ / /  B B∗ /
/ / / k k 0 /
/ 2 / / ⎢ ⎥ /
σ2 = / ϕ(Bk ) / = /⎣ k  ∗ ⎦/
/ / / 0 Bk Bk /
k / k
/
+/ / / /,
/ / / /
/ ∗/ / ∗ /
= max / Bk Bk / , / Bk Bk / .
/ / / /
k k

on account of the identity (1.82). 

2.8 Application: A Gaussian Matrix with Nonuniform


Variances

Fix a d1 × d2 matrix B and draw a random d1 × d2 matrix Γ whose entries are


independent, standard normal variables. Let  denote the componentwise (i.e.,
Schur or Hadamard) product of matrices. Construct the random matrix B  Γ, and
observe that its (i, j) component is a Gaussian variable with mean zero and variance
|bij |2 . We claim that
2
/2σ 2
P { Γ  B  t}  (d1 + d2 ) · e−t . (2.54)

The symbols bi: and b:j represent the ith row and jth column of the matrix B. An
immediate sequence of (2.54) is that the median of the norm satisfies

M( Γ  B )  σ 2 log (2 (d1 + d2 )). (2.55)

These are nonuniform Gaussian matrices where the estimate (2.55) for the median
has the correct order. We compare [106, Theorem 1] and [107, Theorem 3.1]
although the results are not fully comparable. See Sect. 9.2.2 for extended work.
To establish (2.54), we first decompose the matrix of interest as a Gaussian series:

ΓB= γij · bij · Cij .


ij
2.9 Controlling the Expectation 119

Now, let us determine the variance parameter σ 2 . Note that


⎛ ⎞
 
∗ ⎝ |bij | ⎠ Cii = diag
2 2 2
(bij Cij )(bij Cij ) = b1: , . . . , bd1 : .
ij i j

Similarly,

 
∗ 2 2 2
(bij Cij ) (bij Cij ) = |bij | Cjj = diag b:1 , . . . , b:d2 .
ij j i

Thus,
0/  / /  /1
/ 2 2 / / 2 2 /
σ 2 = max /diag b1: , . . . , bd1 : / , /diag b:1 , . . . , b:d2 /
0 1
2 2
= max maxi bi: , maxj b:j .

An application of Corollary 2.7.2 gives the tail bound of (2.54).

2.9 Controlling the Expectation

The Hermitian Gaussian series

Y= γ k Ak (2.56)
k

is used for many practical applications later in this book since it allows each sensor
to be represented by the kth matrix.
Example 2.9.1 (NC-OFDM Radar and Communications). A subcarrier (or tone)
has a frequency fk , k = 1, . . . , N . Typically, N = 64 or N = 128. A radio
sinusoid ej2πfk t is transmitted by the transmitter (cell phone tower or radar). This
radio signal passes through the radio environment and “senses” the environment.
Each sensor collects some length of data over the sensing time. The data vector yk
of length 106 is stored and processed for only one sensor. In other words, we receive
typically N = 128 copies of measurements for using one sensor. Of course, we can
use more sensors, say M = 100.
We can extract the data structure using a covariance matrix that is to be directly
estimated from the data. For example, a sample covariance matrix can be used. We
call the estimated covariance matrix R̂k , k = 1, 2, .., N . We may desire to know the
impact of N subcarriers on the sensing performance. Equation (2.56) is a natural
model for this problem at hand. If we want to investigate the impact of M = 100
sensors on the sensing performance (via collaboration from a wireless network), we
120 2 Sums of Matrix-Valued Random Variables

need a data fusion algorithm. Intuitively, we can simply consider the sum of these
extracted covariance matrices (random matrices). So we have a total of n = M N =
100 × 128 = 12,800 random matrices at our disposal. Formally, we have

n=128,000
Y= γ k Ak = γk R̂k .
k k=1

Here, we are interested in the nonasymptotic view in statistics [108]: when the
number of observations n is large, we fit large complex sets of data that one needs
to deal with huge collections of models at different scales. Throughout the book, we
promote this nonasymptotic view by solving practical problems in wireless sensing
and communications. This is a problem with “Big Data”. In this novel view, one
takes the number of observations as it is and try to evaluate the effect of all the
influential parameters. Here this parameter is n, the total number of measurements.
Within one second, we have a total of 106 × 128 × 100 ≈ 1010 points of data at our
disposal. We need models at different scales to represent the data. 
A remarkable feature of Theorem 2.7.1 is that it always allows us to obtain
reasonably accurate estimates for the expected norm of this Hermitian Gaussian
series

Y= γ k Ak . (2.57)
k

To establish this point, we first derive the upper and lower bounds for the second
moment of ||Y||. Note ||Y|| is a scalar random variable. Using Theorem 2.7.1 gives
  )∞  √
2
E Y = 0
P Y > t dt
)∞ 2
= 2σ 2 log (2d) + 2d 2σ2 log(2d) e−t/2σ dt = 2σ 2 log (2ed) .

Jensen’s inequality furnishes the lower estimate:


/ /
  / 2 / / / /
/
/
/
E Y
2 /
=E Y / / 2/
 EY = / A2k / = σ2 .
/ /
k

The (homogeneous) first and second moments of the norm of a Gaussian series are
equivalent up to a universal constant [109, Corollary 3.2], so we have

cσ  E ( Y )  σ 2 log (2ed). (2.58)

According to (2.58), the matrix variance parameter σ 2 controls the expected norm
E ( Y ) up to a factor that depends very weakly on the dimension d. A similar
remark goes to the median value M ( Y ).
2.10 Sums of Random Positive Semidefinite Matrices 121

In (2.58), the dimensional dependence is a new feature of probability inequalities


in the matrix setting. We cannot remove the factor d from the bound in Theorem
2.7.1.

2.10 Sums of Random Positive Semidefinite Matrices

The classical Chernoff bounds concern the sum of independent, nonnegative,


and uniformly bounded random variables. In contrast, matrix Chernoff bounds
deal with a sum of independent, positive semidefinite, random matrices whose
maximum eigenvalues are subject to a uniform bound. For example, the sample
covariance matrices satisfy the conditions of independent, positive semidefinite,
random matrices. This connection plays a fundamental role when we deal with
cognitive sensing in the network setting consisting of a number of sensors. Roughly,
each sensor can be modeled by a sample covariance matrix.
The first result parallels with the strongest version of the scalar Chernoff
inequality for the proportion of successes in a sequence of independent, (but not
identical) Bernoulli trials [7, Excercise 7].
Theorem 2.10.1 (Matrix Chernoff I—Tropp [53]). Consider a sequence
{Xk : k = 1, . . . , n} of independent, random, Hermitian matrices that satisfy

Xk  0 and λmax (Xk )  1 almost surely.

Compute the minimum and maximum eigenvalues of the average expectation,

n
 n

1 1
μ̄min := λmin EXk and μ̄max := λmax EXk .
n n
k=1 k=1

Then
  n  

P λmin n1 Xk  α  d · e−n·D(α||μ̄min ) for 0  α  μ̄min ,
  k=1  

n
P λmax n
1
Xk  α  d · e−n·D(α||μ̄max ) for 0  α  μ̄max .
k=1

the binary information divergence

D (a||u) = a (log (a) − log (u)) + (1 − a) (log (1 − a) − log (1 − u))

for a, u ∈ [0, 1].


Tropp [53] found the following weaker version of Theorem 2.10.1 produces
excellent results but is simpler to apply. This corollary corresponds with the usual
122 2 Sums of Matrix-Valued Random Variables

statement of the scalar Chernoff inequalities for sums of nonnegative random


variables; see [7, Excercise 8] [110, Sect. 4.1]. Theorem 2.10.1 is a considerable
strengthening of the version of Ahlswede-Winter [36, Theorem 19], in which case
their result requires the assumption that the summands are identically distributed.
Corollary 2.10.2 (Matrix Chernoff II—Tropp [53]). Consider a sequence
{Xk : k = 1, . . . , n} of independent, random, Hermitian matrices that satisfy

Xk  0 and λmax (Xk )  R almost surely.

Compute the minimum and maximum eigenvalues of the average expectation,

n
 n

μmin := λmin EXk and μmax := λmax EXk .
k=1 k=1

Then
  n    −δ μmin /R

P λmin Xk  (1 − δ) μmin  d · (1−δ)
e
1−δ for δ ∈ [0,1],
 
k=1    μmax /R
n

P λmax Xk  (1 + δ) μmax  d · (1+δ) 1+δ for δ  0.
k=1

The following standard simplification of Corollary 2.10.2 is useful:


  n  
 2
P λmin Xk  tμmin  d · e−(1−t) μmin /2R for t ∈ [0,1],
 k=1  n  
  tμmax /R
P λmax Xk  tμmax  d · et for t  e.
k=1

The minimum eigenvalues has norm-type behavior while the maximum eigenvalues
exhibits Poisson-type decay.
Before giving the proofs, we consider applications.
Example 2.10.3 (Rectangular Random Matrix). Matrix Chernoff inequalities are
very effective for studying random matrices with independent columns. Consider
a rectangular random matrix
 
Z = z1 z 2 · · · z n

where {zk } is a family of independent random vector in Cm . The sample covariance


matrix is defined as
n
1 1
R̂ = ZZ∗ = zk z∗k ,
n n
k=1
2.10 Sums of Random Positive Semidefinite Matrices 123

which is /an estimate of the true covariance matrix R. One is interested in the error
/
/ /
/R̂ − R/ as a function of the number of sample vectors, n. The norm of Z satisfies

n


zk z∗k
2
Z = λmax (ZZ ) = λmax . (2.59)
k=1

Similarly, the minimum singular value sm of the matrix satisfies

n


zk z∗k
2
sm (Z) = λmin (ZZ ) = λmin .
k=1

In each case, the summands are stochastically independent and positive semidefinite
(rank 1) matrices, so the matrix Chernoff bounds apply. 
Corollary 2.10.2 gives accurate estimates for the expectation of the maximum
eigenvalue:

n

μmax  Eλmax Xk  C · max {μmax , R log d} . (2.60)
k=1

The lower bound is Jensen’s inequality; the upper bound is from a standard calcu-
lation. The dimensional dependence vanishes, when the mean μmax is sufficiently
large in comparison with the upper bound R! The a prior knowledge of knowing R
accurately in λmax (Xk )  R converts into the tighter bound in (2.60).
Proof of Theorem 2.10.1. We start with a semidefinite bound for the matrix moment
generating function of a random positive semidefinite contraction.
Lemma 2.10.4 (Chernoff moment generating function). Suppose that X is a
random positive semidefinite matrix that satisfies λmax (Xk )  1. Then
 
E eθX  I + eθ − 1 (EX) for θ ∈ R.

The proof of Lemma 2.10.4 parallels the classical argument; the matrix adaptation is
due to Ashlwede and Winter [36], which is followed in the proof of Theorem 2.2.15.
Proof of Lemma 2.10.4. Consider the function

f (x) = eθx .

Since f is convex, its graph has below the chord connecting two points. In particular,

f (x)  f (0) + [f (1) − f (0)] · x for x ∈ [0, 1].


124 2 Sums of Matrix-Valued Random Variables

More explicitly,

eθx  1 + eθ − 1 · x for x ∈ [0, 1].

The eigenvalues of X lie in the interval of [0, 1], so the transfer rule (1.61) implies
that

eθX  I + eθ − 1 X.

Expectation respects the semidefinite order, so



EeθX  I + eθ − 1 (EX) .

This is the advertised result of Lemma 2.10.4. 


Proof Theorem 2.10.1, Upper Bound. The Chernoff moment generating function,
Lemma 2.10.4, states that

EeθXk  I + g (θ) (EXk ) where g (θ) = eθ − 1 for θ > 0.

As a result, Corollary 2.6.3 implies that


  n     
 1 
n
P λmax Xk  t  d · exp −θt + n · log ·λmax n (I + g (θ) (EXk ) )
k=1   k=1 
1 
n
= d · exp −θt + n · log ·λmax I + g (θ) · n ((EXk ) )
k=1
= d · exp (−θt + n · log · (1 + g (θ) · μ̄max )) .
(2.61)
The third line follows from the spectral mapping Theorem 1.4.13 and the definition
of μ̄max . Make the change of variable t → nα. The right-hand side is smallest when

θ = log (α/ (1 − α)) − log (μ̄max / (1 − μ̄max )) .

After substituting these quantiles into (2.61), we obtain the information divergence
upper bound. 
Proof Corollary 2.10.2, Upper Bound. Assume that the summands satisfy the uni-
form eigenvalue bound with R = 1; the general result follows by re-scaling.
The shortest route to the weaker Chernoff bound starts at (2.61). The numerical
inequality log(1 + x) ≤ x, valid for x > −1, implies that
 n

P λmax Xk t d · exp (−θt+g(θ) · nμ̄max ) =d · exp (−θt+g(θ) · μmax ) .
k=1

Make the change of variable t → (1 + δ) μmax , and select the parameter θ =


log (1 + δ). Simplify the resulting tail bound to complete the proof. 
2.11 Matrix Bennett and Bernstein Inequalities 125

The lower bounds follow from a closely related arguments.


Proof Theorem 2.10.1, Lower Bound. Our starting point is Corollary 2.6.3, consid-
ering the sequence {−Xk }. In this case, the Chernoff moment generating function,
Lemma 2.10.4, states that

Eeθ(−Xk ) = Ee(−θ)Xk  I − g(θ) · (EXk ) where g(θ) = 1 − e−θ for θ > 0.

Since λmin (−A) = −λmax (A), we can again use Corollary 2.6.3 as follows.
  n     n  
 
P λmin Xk  t = P λmax (−Xk )  −t
k=1  k=1  n 

 d · exp θt + n · log λmax n1 (I − g(θ) · EXk )
  k=1  n 

= d · exp θt + n · log 1 − g(θ) · λmin n1 EX k
k=1
= d · exp (θt + n · log (1 − g(θ) · μ̄min )) .
(2.62)
Make the substitution t → nα. The right-hand side is minimum when

θ = log (μ̄min / (1 − μ̄min )) − log (α/ (1 − α)) .

These steps result in the information divergence lower bound. 


Proof Corollary 2.10.2, Lower Bound. As before, assume that the unform bound
R = 1. We obtain the weaker lower bound as a consequence of (2.62). The
numerical inequality log(1 + x) ≤ x, is valid for x > −1, so we have
+ n
 ,
P λmin Xk t  d · exp (θt − g(θ) · nμ̄min ) = d · exp (θt − g(θ) · μmin ) .
k=1

Make the substitution t → (1 − δ) μmin , and select the parameter θ = − log (1 − δ)


to complete the proof. 

2.11 Matrix Bennett and Bernstein Inequalities

In the scalar setting, Bennett and Bernstein inequalities deal with a sum of
independent, zero-mean random variables that are either bounded or subexponential.
In the matrix setting, the analogous results concern a sum of zero-mean random
matrices. Recall that the classical Chernoff bounds concern the sum of independent,
nonnegative, and uniformly bounded random variables while, matrix Chernoff
bounds deal with a sum of independent, positive semidefinite, random matrices
whose maximum eigenvalues are subject to a uniform bound. Let us consider a
motivating example first.
126 2 Sums of Matrix-Valued Random Variables

Example 2.11.1 (Signal plus Noise Model). For example, the sample covariance
matrices of Gaussian noise, R̂ww , satisfy the conditions of independent, zero-mean,
random matrices. Formally

R̂yy = R̂xx + R̂ww ,

R̂yy represent the sample covariance matrix of the received signal plus noise and
R̂xx of the signal. Apparently, R̂ww , is a zero-mean random matrix. All these
matrices are independent, nonnegative, random matrices. 
Our first result considers the case where the maximum eigenvalue of each
summand satisfies a uniform bound. Recall from Example 2.10.3 that the norm of a
rectangular random matrix Z satisfies

n


zk z∗k
2
Z = λmax (ZZ ) = λmax . (2.63)
k=1

Physically, we can call the norm as the power.


Example 2.11.2 (Transmitters with bounded power). Consider a practical applica-
tion. Assume each transmitter is modeled as the random matrix {Zk }, k = 1, . . . , n.
We have the a prior knowledge that its transmission is bounded in some manner.
A model is to consider

EZk = M; λmax (Zk )  R1 , k = 1, 2, . . . , n.

After the multi-path channel propagation with fading, the constraints become

EXk = N; λmax (Xk )  R2 , k = 1, 2, . . . , n.

Without loss of generality, we can always considered the centered matrix-valued


random variable

EXk = 0; λmax (Xk )  R, k = 1, 2, . . . , n.

When a number of transmitters, say n, are emitting at the same time, the total
received signal is described by
n
Y = X1 + · · · , Xn = Xk .
k=1


2.11 Matrix Bennett and Bernstein Inequalities 127

Theorem 2.11.3 (Matrix Bernstein:Bounded Case—Theorem 6.1 of Tropp [53]).


Consider a finite sequence {Xk } of independent, random, Hermitian matrices with
dimension d. Assume that

EXk = 0; λmax (Xk )  R, almost surely.

Compute the norm of the total variance,


/ /
/ n
 /
/ /
2
σ := / E X2k /.
/ /
k=1

Then the following chain of inequalities holds for all t ≥ 0.


  n    2
  Rt 
P λmax Xk  t  d · exp − R σ
2 · h σ2 (i)
k=1  2

 d · exp − σ2t+Rt/3 /2
(ii)
 
d · exp −3t /8σ 2 2
for t  σ 2 /R;
 (iii)
d · exp (−3t/8R) for t  σ 2 /R.
(2.64)
The function h (x) := (1 + x) log (1 + x) − x f or x  0.
Theorem 2.11.4 (Matrix Bernstein: Subexponential Case—Theorem 6.2
of Tropp [53]). Consider a finite sequence {Xk } of independent, random,
Hermitian matrices with dimension d. Assume that

p!
EXk = 0; E (Xpk )  · Rp−2 A2k , for p = 2,3,4, . . . .
2!
Compute the variance parameter
/ /
/ n /
/ /
σ 2 := / A2k / .
/ /
k=1

Then the following chain of inequalities holds for all t ≥ 0.


  n    
 2
P λmax Xk ≥ t  d · exp − σt2 +Rt
/2
k=1  
d · exp −t2 /4σ 2 for t  σ 2 /R;

d · exp (−t/4R) for t  σ 2 /R.
128 2 Sums of Matrix-Valued Random Variables

2.12 Minimax Matrix Laplace Method

This section, taking material from [43, 49, 53, 111], combines the matrix Laplace
transform method of Sect. 2.3 with the Courant-Fischer characterization of eigen-
values (Theorem 1.4.22) to obtain nontrivial bounds on the interior eigenvalues of a
sum of random Hermitian matrices. We will use this approach for estimates of the
covariance matrix.

2.13 Tail Bounds for All Eigenvalues of a Sum of Random


Matrices

In this section, closely following Tropp [111], we develop a generic bound on


the tail probabilities of eigenvalues of sums of independent, random, Hermitian
matrices. We establish this bound by supplementing the matrix Laplace transform
methodology of Tropp [53], that is treated before in Sect. 2.3, with Theorem 1.4.22
and a new result, due to Lieb and Steiringer [112], on the concavity of a certain trace
function on the cone of positive-definite matrices.
Theorem 1.4.22 allows us to relate the behavior of the kth eigenvalue of a matrix
to the behavior of the largest eigenvalue of an appropriate compression of the matrix.
Theorem 2.13.1 (Tropp [111]). Let X be a random, Hermitian matrix with
dimension n, and let k ≤ n be an integer. Then, for all t ∈ R,
0 ∗
1
P (λk (X)  t)  inf min e−θt · E Tr eθV XV
. (2.65)
θ>0 V∈Vn−k+1
n

Proof. Let θ be a fixed positive number. Then


 
P (λk (X)  t) = P (λk (θX)  θt) = P eλk (θX)  eθt
 e−θt · Eeλk (θX)
 
−θt ∗
=e · E exp min λmax (θV XV) .
V∈Vn
n−k+1

The first identify follows from the positive homogeneity of eigenvalue maps and
the second from the monotonicity of the scalar exponential function. The final two
steps are Markov’s inequality and (1.89).
Let us bound the expectation. Interchange the order of the exponential and the
minimum, due to the monotonicity of the scalar exponential function; then apply the
spectral mapping Theorem 1.4.4 to see that
2.13 Tail Bounds for All Eigenvalues of a Sum of Random Matrices 129

 

E exp min λmax (θV XV) =E min λmax (exp (θV∗ XV))
V∈Vn
n−k+1 V∈Vn
n−k+1

 min Eλmax (exp (θV∗ XV))


V∈Vn
n−k+1

 min E Tr (exp (θV∗ XV)) .


V∈Vn
n−k+1

The first step uses Jensen’s inequality. The second inequality follows because the
exponential of a Hermitian matrix is always positive definite—see Sect. 1.4.16, so
its largest eigenvalue is smaller than its trace. The trace functional is linear, which is
very critical. The expectation is also linear. Thus we can exchange the order of the
expectation and the trace: trace and expectation commute—see (1.87).
Combine these observations and take the infimum over all positive θ to complete
the argument. 
Now let apply Theorem 2.13.1 to the case that X can be expressed as a sum of
independent, Hermitian, random matrices. In this case, we develop the right-hand
side of the Laplace transform bound (2.65) by using the following result.
Theorem 2.13.2 (Tropp [111]). Consider a finite sequence {Xi } of independent,
Hermitian, random matrices with dimension n and a sequence {Ai } of fixed
Hermitian matrices with dimension n that satisfy the relations

E eX i  eA i . (2.66)

Let V ∈ Vnk be an isometric embedding of Ck into Cn for some k ≤ n. Then


+ , + ,
E Tr exp V ∗ Xi V  Tr exp V ∗ Ai V . (2.67)
i i

In particular,
+ , + ,
E Tr exp Xi  Tr exp Ai . (2.68)
i i

Theorem 2.13.2 is an extension of Lemma 2.5.1, which establish the result of (2.68).
The proof depends on a recent result of [112], which extends Lieb’s earlier classical
result [50, Theorem 6]. Here MnH represents the set of Hermitian matrices of n × n.
Proposition 2.13.3 (Lieb-Seiringer 2005). Let H be a Hermitian matrix with
dimension k. Let V ∈ Vnk be an isometric embedding of Ck into Cn for some
k ≤ n. Then the function

A → Tr exp {H + V∗ (log A) V}

is concave on the cone of positive-definite matrices in MnH .


130 2 Sums of Matrix-Valued Random Variables

Proof of Theorem 2.13.2. First, combining the given condition (2.66) with the
operator monotonicity of the matrix logarithm gives the following for each k:

log EeXk  Ak . (2.69)

Let Ek denote the expectation conditioned on the first k summands, X1 through Xk .


Then
   
 ∗  ∗ ∗
 Xj

E Tr exp V Xi V = EE1 · · · Ej−1 Tr exp V Xi V + V log e V
ij
ij−1 
 ∗
 ∗Xj

 EE1 · · · Ej−2 Tr exp V Xi V + V log Ee V
ij−1 
  
 EE1 · · · Ej−2 Tr exp V∗ Xi V + V∗ log eAj V
ij−1 
 ∗ ∗
= EE1 · · · Ej−2 Tr exp V Xi V + V Aj V .
ij−1

The first step follows from Proposition 2.13.3 and Jensen’s inequality, and the
second depends on (2.69) and the monotonicity of the trace exponential. Iterate
this argument to complete the proof. The main result follows from combining
Theorems 2.13.1 and 2.13.2. 
Theorem 2.13.4 (Minimax Laplace Transform). Consider a finite sequence
{Xi } of independent, random, Hermitian matrices with dimension n, and let k ≤ n
be an integer.
1. Let {Ai } be a sequence of Hermitian matrices that satisfy the semidefinite
relations

E eθXi  eg(θ)Ai

where g : (0, ∞) → [0, ∞). Then, for all t ∈ R,


  - + ,.
−θt ∗
P λk Xi t  inf min e · Tr exp g (θ) V Ai V .
θ>0 V∈Vn
n−k+1
i i

2. Ai : Vnn−k+1 → MnH be a sequence of functions that satisfy the semidefinite


relations
 ∗

E eθV Xi V  eg(θ)Ai (V)

for all V ∈ Vnn−k+1 where g : (0, ∞) → [0, ∞). Then, for all t ∈ R,
2.14 Chernoff Bounds for Interior Eigenvalues 131

  - + ,.
−θt
P λk Xi t  inf min e · Tr exp g (θ) Ai (V) .
θ>0 V∈Vn
n−k+1
i i

The first bound in Theorem 2.13.4 requires less detailed information on how
compression affects the summands but correspondingly does not yield as sharp
results as the second.

2.14 Chernoff Bounds for Interior Eigenvalues

Classical Chernoff bounds in Sect. 1.1.4 establish that the tails of a sum of
independent, nonnegative, random variables decay subexponentially. Tropp [53]
develops Chernoff bounds for the maximum and minimum eigenvalues of a sum
of independent, positive-semidefinite matrices. In particular, sample covariance
matrices are positive-semidefinite and the sums of independent, sample covariance
matrices are ubiquitous. Following Gittens and Tropp [111], we extend this analysis
to study the interior eigenvalues. The analogy with the scalar-valued random
variables in Sect. 1.1.4 is aimed at, in this development. At this point, it is insightful
if the audience reviews the materials in Sects. 1.1.4 and 1.3.
Intuitively, how concentrated the summands will determine the eigenvalues tail
bounds; in other words, if we align the ranges of some operators, the maximum
eigenvalue of a sum of these operators varies probably more than that of a sum
of operators whose ranges are orthogonal. We are interested in a finite sequence
of random summands {Xi }. This sequence will concentrate in a given subspace.
To measure how much this sequence concentrate, we define a function ψ :
∪1kn Vnk → R that has the property

maxλmax (V∗ Xi V)  ψ (V) almost surely for each V ∈ ∪1kn Vnk . (2.70)
i

Theorem 2.14.1 (Eigenvalue Chernoff Bounds [111]). Consider a finite


sequence {Xi } of independent, random, positive-semidefinite matrices with
dimension n. Given an integer k ≤ n, define

μ k = λk EXi ,
i

and let V+ ∈ Vnn−k+1 and V− ∈ Vnk be isometric embeddings that satisfy


 
∗ ∗
μk = λmax V+ EXi V+ = λmin V− EXi V− .
i i
132 2 Sums of Matrix-Valued Random Variables

Then
 μk /ψ(V+ )

P λk Xi  (1 + δ) μk  (n − k + 1) · for δ > 0, and
i (1 + δ)1+δ
 μk /ψ(V− )
e−δ
P λk Xi  (1 − δ) μk k· for δ ∈ [0, 1],
i (1 − δ)1−δ

where ψ is a function that satisfies (2.70).


Practically, if it is difficult to estimate ψ (V+ ) and ψ (V− ), we can use the weaker
estimates

ψ (V+ )  max max V∗ Xi V = max Xi


V∈Vn
n−k+1
i i

ψ (V− )  maxn max V∗ Xi V = max Xi .


V∈Vk i i

The following lemma is due to Ahlswede and Winter [36]; see also [53, Lemma 5.8].
Lemma 2.14.2. Suppose that X is a random positive-semidefinite matrix that
satisfies λmax (X)  1. Then
 θ
EeθX  exp e − 1 (EX) for θ ∈ R.

Proof of Theorem 2.14.1, upper bound. Without loss of generality, we consider the
case ψ (V+ ) = 1; the general case follows due to homogeneity. Define

Ai (V+ ) = V+ EXi V+ and g (θ) = eθ − 1.

Using Theorem 2.13.4 and Lemma 2.14.2 gives


 
P λk Xi  (1 + δ) μk  inf e−θ(1+δ)μk · Tr exp g (θ) ∗
V+ EX i V + .
θ>0
i i

The trace can be bounded by the maximum eigenvalue (since the maximum
eigenvalue is nonnegative), by taking into account the reduced dimension of the
summands:
    
 ∗  ∗
Tr exp g (θ) V+ EX i V +  (n−k+1) · λmax exp g (θ) V+ EX i V +
i  i 
 ∗
= (n−k+1) · exp g (θ) · λmax V + EX i V + .
i

The equality follows from the spectral mapping theorem (Theorem 1.4.4 at Page 34).
We identify the quantity μk ; then combine the last two inequalities to give
2.14 Chernoff Bounds for Interior Eigenvalues 133

 
P λk Xi  (1 + δ) μk  (n − k + 1) · inf e[g(θ)−θ(1+δ)]μk .
θ>0
i

By choosing θ = log (1 + δ), the right-hand side is minimized (by taking care of
the infimum), which gives the desired upper tail bound. 
Proof of Theorem 2.14.1, lower bound. The proof of lower bound is very similar
to that of upper bound above. As above, consider only ψ (V− ) = 1. It follow
from (1.91) (Page 47) that
   
P λk Xi  (1 − δ) μk = P λn−k+1 −Xi  − (1 − δ) μk .
i i
(2.71)
Applying Lemma 2.14.2, we find that, for θ > 0,
   
Eeθ(−V−∗ Xi V− ) = Ee(−θ)V− Xi V−  exp g (θ) · E −V−∗

Xi V−
  ∗
= exp g (θ) · V− (−EXi ) V− ,

where g (θ) = 1−eθ . The last equality follows from the linearity of the expectation.
Using Theorem 2.13.4, we find the latter probability in (2.71) is bounded by
+ ,

inf e θ(1−δ)μk
· Tr exp g (θ) V− (−EXi ) V− .
θ>0
i

The trace can be bounded by the maximum eigenvalue (since the maximum
eigenvalue is nonnegative), by taking into account the reduced dimension of the
summands:
    
 ∗  ∗
Tr exp g (θ) V− (−EXi ) V−  k · λmax exp g (θ) V− (−EXi ) V−
i  i 

= k · exp −g (θ) · λmin V− (EXi ) V−
i

= k · exp {−g (θ) · μk } .

The equality follows from the spectral mapping theorem, Theorem 1.4.4
(Page 34), and (1.92) (Page 47). In the second equality, we identify the quantity μk .
Note that −g(θ) ≤ 0. Our argument establishes the bound
 
P λk Xi  (1 + δ) μk  k · inf e[θ(1+δ)−g(θ)]μk .
θ>0
i

The right-hand side is minimized, (by taking care of the infimum), when θ =
− log (1 − δ), which gives the desired upper tail bound. 
134 2 Sums of Matrix-Valued Random Variables

From the two proofs, we see the property that the maximum eigenvalue is
nonnegative is fundamental. Using this property, we convert the trace functional
into the maximum eigenvalue functional. Then the Courant-Fischer theorem,
Theorem 1.4.22, can be used. The spectral mapping theorem is applied almost
everywhere; it is must be recalled behind the mind. The non-commutative property
is fundamental in studying random matrices. By using the eigenvalues and their
variation property, it is very convenient to think of random matrices as scalar-
valued random variables, in which we convert the two dimensional problem into
one-dimensional problem—much more convenient to handle.

2.15 Linear Filtering Through Sums of Random Matrices

The linearity of the expectation and the trace is so basic. We must always bear
this mind. The trace which is a linear functional converts a random matrix into a
scalar-valued random variable; so as the kth interior eigenvalue which is a non-linear
functional. Since trace and expectation commute, it follows from (1.87), which says
that

E (Tr X) = Tr (EX) . (2.72)

As said above, in the left-hand side, Tr X is a scalar-valued random variable, so its


expectation is treated as our standard textbooks on random variables and processes;
remarkably, in the right-hand side, the expectation of a random matrix EX is also a
matrix whose entries are expected values. After this expectation, a trace functional
converts the matrix value into a scalar value. One cannot help replacing EX with
the empirical average—a sum of random matrices, that is,
n
1
EX ∼
= Xi , (2.73)
n i=1

as we deal with the scalar-valued random variables. This intuition lies at the very
basis of modern probability. In this book, one purpose is to prepare us for the
intuition of this “approximation” (2.73), for a given n, large but finite—the n is
taken as it is. We are not interested in the asymptotic limit as n → ∞, rather the non-
asymptotic analysis. One natural metric of measure is the kth interior eigenvalues

n

1
λk Xi − EX .
n i=1

Note the interior eigenvalues are non-linear functionals. We cannot simply separate
the two terms.
2.15 Linear Filtering Through Sums of Random Matrices 135

We can use the linear trace functional that is the sum of all eigenvalues. As a
result, we have
 n   n   n 
   
λk n 1
Xi − EX = Tr n 1
Xi − EX = Tr n 1
Xi − Tr (EX)
k i=1 i=1 i=1

n
= 1
n Tr Xi − E (Tr X) .
i=1

The linearity of the trace is used in the second and third equality. The property
that trace and expectation commute is used in the third equality. Indeed, the linear
trace functional is convenient, but a lot of statistical information is contained in the
interior eigenvalues. For example, the median of the eigenvalues, rather than the
average of the eigenvalues—the trace divided by its dimension can be viewed as the
average, is more representative statistically.
We are in particular interested in the signal plus noise model in the matrix setting.
We consider instead
n
1
E (X + Z) ∼
= (Xi + Zi ), (2.74)
n i=1

for

X, Z, Xi , Zi  0 and X, Z, Xi , Zi ∈ Cm×m ,

where X, Xi represent the signal and Z, Zi the noise. Recall that A ≥ 0 means that
A is positive semidefinite (Hermitian and all eigenvalues of A are nonnegative).
Samples covariance matrices of dimensions m × m are most often used in this
context.
Since we have a prior knowledge that X, Xi are of low rank, the low-rank matrix
recovery naturally fits into this framework. We can choose the matrix dimension
m such that enough information of the signal matrices Xi is recovered, but we
don’t care if sufficient information of Zi can be recovered for this chosen m. For
example, only the first dominant k eigenvalues of X, Xi are recovered, which will
be treated in Sect. 2.10 Low Rank Approximation. We conclude that the sums of
random matrices have the fundamental nature of imposing the structures of the data
that only exhibit themselves in the matrix setting. The low rank and the positive
semi-definite of sample covariance matrices belong to these data structures. When
the data is big, we must impose these additional structures for high-dimensional data
processing.
The intuition of exploiting (2.74) is as follows: if the estimates of Xi are so
accurate that they are independent and identically distributed Xi = X0 , then we
rewrite (2.74) as
n n n
1 1 1
E (X + Z) ∼
= (Xi + Zi ) = (X0 + Zi ) = X0 + Zi . (2.75)
n i=1
n i=1
n i=1
136 2 Sums of Matrix-Valued Random Variables

Practically, X + Z cannot be separated. The exploitation of the additional low-


rank structure of the signal matrices allows us to extract the signal matrix X0 .

n
The average of the noise matrices n1 Zi will reduce the total noise power (total
i=1
variance), while the signal power is kept constant. This process can effectively
improve the signal to noise ratio, which is especially critical to detection of
extremely weak signals (relative to the noise power).
The basic observation is that the above data processing only involves the linear
operations. The process is also blind, which means that no prior knowledge of the
noise matrix is used. We only take advantage of the rank structure of the underlying
signal and noise matrices: the dimensions of the signal space is lower than the
dimensions of the noise space. The above process can be extended to more general
case: Xi are more dependent than Zi , where Xi are dependent on each other and so
are Zi , but Xi are independent of Zi . Thus, we rewrite (2.74) as
n n n
1 1 1
E (X + Z) ∼
= (Xi + Zi ) = Xi + Zi . (2.76)
n i=1
n i=1
n i=1

1

n
All we care is that, through the sums of random matrices, n Xi is performing
i=1
1

n
statistically better than n Zi . For example, we can use the linear trace functional
i=1
(average operation) and the non-linear median functional. To calculate the median
value of λk , 1 ≤ k ≤ n,
- n n
.
1 1
M λk Xi + Zi ,
n i=1
n i=1

where M is the median value which is a scalar-valued random variable, we need to


calculate
n n

1 1
λk Xi + Zi 1 ≤ k ≤ n.
n i=1 n i=1

The average operation comes down to a trace operation


n n
1 1
Tr Xi + Tr Zi ,
n i=1
n i=1

where the linearity of the trace is used. This is simply the standard sum of scalar-
valued random variables. It is expected, via the central limit theorem, that their
sum approaches to the Gaussian distribution, for a reasonably large n. As pointed
out before, this trace operation throws away a lot of statistical information that is
available in the random matrices, for example, the matrix structures.
2.16 Dimension-Free Inequalities for Sums of Random Matrices 137

2.16 Dimension-Free Inequalities for Sums of Random


Matrices

Sums of random matrices arise in many statistical and probabilistic applications,


and hence their concentration behavior is of fundamental significance. Surprisingly,
the classical exponential moment method used to derive tail inequalities for scalar
random variables carries over to the matrix setting when augmented with certain
matrix trace inequalities [68, 113]. Altogether, these results have proven invaluable
in constructing and simplifying many probabilistic arguments concerning sums of
random matrices.
One deficiency of many of these previous inequalities is their dependence on the
explicit matrix dimension, which prevents their application to infinite dimensional
spaces that arise in a variety of data analysis tasks, such as kernel based machine
learning [114]. In this subsection, we follow [68, 113] to prove analogous results
where dimension is replaced with a trace quantity that can be small, even when
the explicit matrix dimension is large or infinite. Magen and Zouzias [115] also
gives similar results that are complicated and fall short of giving an exponential tail
inequality.
We use Ei [·] as shorthand for Ei [·] = Ei [· |X1 , . . . , Xi ], the conditional
expectation. The main idea is to use Theorem 1.4.17: Lieb’s theorem.
Lemma 2.16.1 (Tropp [49]). Let I be the identity matrix for the range of the Xi .
Then
- N N
 .
E Tr exp Xi − ln Ei [exp (Xi )] −I  0. (2.77)
i=1 i=1

Proof. We follow [113]. The induction method is used for the proof. For N = 0, it
is easy to check the lemma is correct. For N ≥ 1, assume as the inductive hypothesis
that (2.77) holds with N with replaced with N − 1. In this case, we have that
- N N
 .
E Tr exp Xi − ln Ei [exp (Xi )] −I
i=1 i=1
- - N −1 N
 ..
= E EN Tr exp Xi − ln Ei [exp (Xi )] + ln exp (XN ) −I
i=1 i=1
- - N −1 N
 ..
E EN Tr exp Xi − ln Ei [exp (Xi )] + ln EN exp (XN ) −I
i=1 i=1
- - N −1 N −1
 ..
= E EN Tr exp Xi − ln Ei [exp (Xi )] −I
i=1 i=1

0
138 2 Sums of Matrix-Valued Random Variables

where the second line follows from Theorem 1.4.17 and Jensen’s inequality. The
fifth line follows from the inductive hypothesis. 
While (2.77) gives the trace result, sometimes we need the largest eigenvalue.
Theorem 2.16.2 (Largest eigenvalue—Hsu, Kakade and Zhang [113]). For any
α ∈ R and any t > 0
 N N

P λmax α Xi − log Ei [exp (αXi )] >t
i=1 i=1
N N  −1
 Tr E −α Xi + log Ei [exp (αXi )] · et −t−1 .
i=1 i=1


N 
N
Proof. Define a new matrix A = α Xi − log Ei [exp (αXi )].. Note that
i=1 i=1
g(x) = ex − x − 1 is non-negative for all real x, and increasing for x ≥ 0. Let
λi (A) be the i-th eigenvalue of the matrix A, we have

P [λmax (A) > t] (et − t − 1) = E [I


 λ(λmax (A) > t) (et − t − 1)]
 E e max (A)
− λmax (A) − 1 
  λi (A)
E e − λi (A) − 1
i
 E (Tr [exp (A) − A − I])
 Tr (E [−A])

where I(x) is the indicator function of x. The second line follows from the spectral
mapping theorem. The third line follows from the increasing property of the function
g(x). The last line follows from Lemma 2.16.1. 

N
When Xi is zero mean, then the first term in Theorem 2.16.2 vanishes, so the
i=1
trace term
N

Tr E log Ei [exp (αXi )]
i=1

can be made small by an appropriate choice of α.


Theorem 2.16.3 (Matrix sub-Gaussian bound—Hsu, Kakade and Zhang
[113]). If there exists σ̄ > 0 and κ̄ > 0 such that for all i = 1, . . . , N ,
2.16 Dimension-Free Inequalities for Sums of Random Matrices 139

Ei [Xi ] = 0
N

1 α2 σ̄ 2
λmax log Ei [exp (αXi )] 
N i=1
2
N

1 α2 σ̄ 2 κ̄
E Tr log Ei [exp (αXi )] 
N i=1
2

for all α > 0 almost surely, then for any t > 0,


-  " .
1
N
2σ̄ 2 t  −1
P λmax Xi >  κ̄ · t et − t − 1 .
N i=1
N

The proof is simple. We refer to [113] for a proof.


Theorem 2.16.4 (Matrix Bernstein Bound—Hsu, Kakade and Zhang [113]). If
there exists b̄ > 0, σ̄ > 0 and κ̄ > 0 such that for all i = 1, . . . , N ,

Ei [Xi ] = 0
λmax (Xi )  b̄

1
N
 
λmax Ei X2i  σ̄ 2
N i=1

1
N
 
E Tr Ei X2i  σ̄ 2 κ̄
N i=1

almost surely, then for any t > 0,


-  " .
1
N
2σ̄ 2 t b̄t  −1
P λmax Xi > +  κ̄ · t et − t − 1 .
N i=1
N 3N

The proof is simple. We refer to [113] for a proof.


Explicit dependence on the dimension of the matrix does not allow straightfor-
ward use of these results in the infinite-dimensional setting. Minsker [116] deals
with this issue by extension of previous results. This new result is of interest to
low rank matrix recovery and approximation matrix multiplication. Let || · || denote
the operator norm A = max {λi (A)}, where λi are eigenvalues of a Hermitian
i
operator A. Expectation EX is taken elementwise.
Theorem 2.16.5 (Dimension-free Bernstein inequality—Minsker [116]). Let
X1 , . . . , XN be a sequence of n × n independent Hermitian random matrices such
140 2 Sums of Matrix-Valued Random Variables

/ /
/ N /
/ 2/
that EXi = 0 and ||Xi || ≤ 1 almost surely. Denote σ = / EXi /. Then, for
2
i=1
any t > 0
N 
/ /  
/ N / Tr EX2i
/ / i=1
P / Xi / > t  2 exp (−Ψσ (t)) · rσ (t)
/ / σ2
i=1

t2 /2 6
where Ψσ (t) = σ 2 +t/3 and rσ (t) = 1 + t2 log2 (1+t/σ 2 )
.


N
If EX2i is of approximately low rank, i.e., has many small eigenvalues, the
i=1 N 

Tr EX2i
i=1
number of non-zero eigenvalues are big. The term σ2 , however, is can
be much smaller than the dimension n. Minsker [116] has applied Theorem 2.16.5
to the problem of learning the continuous-time kernel.
A concentration inequality for the sums of matrix-valued martingale differences
is also obtained by Minsker [116]. Let Ei−1 [·] stand for the conditional expectation
Ei−1 [· |X1 , . . . , Xi ].
Theorem 2.16.6 (Minsker [116]). Let X1 , . . . , XN be a sequence of martingale
differences with values in the set of n × n independent Hermitian random matrices
N
such that ||Xi || ≤ 1 almost surely. Denote WN = Ei−1 X2i . Then, for any
i=1
t > 0,
N   t   
6
P Xi >t, λmax (WN ) σ 2 2Tr p − 2 EWN exp (−Ψσ (t)) · 1+ ,
σ Ψ2σ (t)
i=1

where p (t) = min (−t, 1).

2.17 Some Khintchine-Type Inequalities

Theorem 2.17.1 (Non-commutative Bernstein-type inequality [53]). Consider a


finite sequence Xi of independent centered Hermitian random n × n matrices.
Assume we have for some numbers K and σ such that
/ /
/ /
/ /
Xi  K almost surely, / EX2i /  σ2 .
/ /
i

Then, for every t ≥ 0, we have


2.17 Some Khintchine-Type Inequalities 141

 
−t2 /2
P ( Xi  t)  2n · exp .
σ 2 + Kt/3

We say ξ1 , . . . , ξN are independent Bernoulli random variables when each ξi


takes on the values ±1 with equal probability. In 1923, in an effort to provide a
sharp estimate on the rate of convergence in Borel’s strong law of large numbers,
Khintchine proved the following inequality that now bears his name.
Theorem 2.17.2 (Khintchine’s inequality [117]). Let ξ1 , . . . , ξN be a sequence
of independent Bernoulli random variables, and let X1 , . . . , XN be an arbitrary
sequence of scalars. Then, for any N = 1, 2, . . . and p ∈ (0, ∞), there exists an
absolute constant Cp > 0 such that
- p . p/2
 N  N
  2
E  ξi X i   C p · |Xi | . (2.78)
 
i=1 i=1

In fact, Khintchine only established the inequality for the case where p ≥ 2 is an
even integer. Since his work, much effort has been spent on determining the optimal
value of Cp in (2.78). In particular, it has been shown [117] that for p ≥ 2, the value
 1/2  
2p p+1
Cp∗ = Γ
π 2

is the best possible. Here Γ(·) is the Gamma function. Using Stirling’s formula, one
can show [117] that Cp∗ is of the order pp/2 for all p ≥ 2.
The Khintchine inequality is extended to the case for arbitrary m × n matrices.
Here A Sp denotes the Schatten p-norm of an m × n matrix A, i.e., A Sp =
σ (A) p , where σ ∈ Rmin{m,n} is the vector of singular values of A, and || · ||p is
the usual lp -norm.
Theorem 2.17.3 (Khintchine’s inequality for arbitrary m × n matrices [118]).
Let ξ1 , . . . , ξN be a sequence of independent Bernoulli random variables, and let
X1 , . . . , XN be arbitrary m × n matrices. Then, for any N = 1, 2, . . . and p ≥ 2,
we have
⎡/ /p ⎤ p/2
/N / N
/ /
E ⎣/ ξ i Xi / ⎦  p ·
p/2 2
Xi S p .
/ /
i=1 Sp i=1


N
2
The normalization Xi Sp is not the only one possible in order for a
i=1
Khintchine-type inequality to hold. In 1986, Lust-Piquard showed another one
possibility.
142 2 Sums of Matrix-Valued Random Variables

Theorem 2.17.4 (Non-Commutative Khintchine’s inequality for arbitrary


m × n matrices [119]). Let ξ1 , . . . , ξN be a sequence of independent Bernoulli
random variables, and let X1 , . . . , XN be an arbitrary sequence of m×n matrices.
Then, for any N = 1, 2, . . . and p ≥ 2, there exists an absolute constant γp > 0
such that
⎧/ ⎫
⎡/ /p ⎤ ⎪ 1/2 / / 1/2 /
/N / ⎨/ / N /p / N
/ /
/p ⎪
/ ⎬
/ /
E ⎣/ ξi Xi / ⎦  γp ·max / X X T / , / X T
X /
/ ⎪.
/ / ⎪/
⎩ /
i i /
/
/
/
i i
/ ⎭
i=1 Sp i=1 i=1
Sp Sp

The proof of Lust-Piquard does not provide an estimate for γp . In 1998, Pisier [120]
showed that

γp  αpp/2

for some absolute constant α > 0. Using the result of Buchholz [121], we have

p/2
α  (π/e) /2p/4 < 1

for all p ≥ 2. We note that Theorem 2.17.4 is valid (with γp  αpp/2 < pp/2 ) when
ξ1 , . . . , ξN are i.i.d. standard Gaussian random variables [121].
Let Ci be arbitrary m × n matrices such that
N N
Ci CTi  Im , CTi Ci  In . (2.79)
i=1 i=1

So [122] derived another useful theorem.


Theorem 2.17.5 (So [122]). Let ξ1 , . . . , ξN be independent mean zero random
variables, each of which is either (i) supported on [−1,1], or (ii) Gaussian with
variance one. Further, let X1 , . . . , XN be arbitrary m × n matrices satisfying
max(m, n)  2 and (2.79). Then, for any t ≥ 1/2, we have
/ / 
/ N / 
/ / −t
Prob / ξi Xi /  2e (1 + t) ln max {m, n}  (max {m, n})
/ /
i=1

if ξ1 , . . . , ξN are i.i.d. Bernoulli or standard normal random variables; and


/ / 
/ N / 
/ / −t
Prob / ξi Xi /  8e (1 + t) ln max {m, n}  (max {m, n})
/ /
i=1

if ξ1 , . . . , ξN are independent mean zero random variables supported on [−1,1].


We refer to Sect. 11.2 for a proof.
2.17 Some Khintchine-Type Inequalities 143

Here we state a result of [123] that is stronger than Theorem 2.17.4. We deal with
bilinear form. Let X and Y be two matrices of size n × k1 and n × k2 which satisfy

XT Y = 0

and let {xi } and {yi } be row vectors of X and Y, respectively. Denote εi to be a
sequence of i.i.d. {0/1} Bernoulli random variables with P(εi = 1) = ε̄. Then, for
p≥2
⎛ / /p ⎞1/p
/ / √
⎝E / / /
εi xTi yi / ⎠  2 2γp2 max xi max yi +
/ / i i
i Sp
⎧ / /1/2 / /1/2 ⎫
(2.80)
√ ⎨ / / / / ⎬
/ / / /
2 ε̄γp max max xi / yiT yi / , max yi / xTi xi / ,
⎩ i / / i / / ⎭
i Sp i Sp

where γp is the absolute constant defined in Theorem 2.17.4. This proof of (2.80)
uses the following result
⎛ / /p ⎞1/p / /
/ / / /
⎝E /
/
/
εi xTi xi / ⎠  2γp2 max xi
2 /
+ ε̄ /
/
xTi xi /
/ / i / /
i Sp i Sp

for p ≥ 2.
Now consider XT X = I, then for p ≥ log k, we have [123]
⎛ / /p ⎞1/p
/ n / "
⎝E /
/Ik×k −
1 T / ⎠
εi x i x i / C
p
max xi ,
/ ε̄ / ε̄ i
i=1 Sp


where C = 23/4 πe ≈ 5. This result guarantees that the invertibility of a sub-
matrix which is formed from sampling a few columns (or rows) of a matrix X.
Theorem 2.17.6 ([124]). Let X ∈ Rn×n , be a random matrix whose entries are
independent, zero-mean, random variables. Then, for p ≥ log n,
⎛:
; ⎛ ⎞p : ⎞1/p
; ; p
√ ⎜; ; ⎟
 c0 21/p p⎝<E⎝max Xij ⎠ + <E max
p 1/p
(E X ) 2 2
Xij ⎠ ,
i j
j i


where c0  23/4 πe < 5.
Theorem 2.17.7 ([124]). Let A ∈ Rn×n be any matrix and à ∈ Rn×n be a
random matrix such that

EÃ=A.
144 2 Sums of Matrix-Valued Random Variables

Then, for p ≥ log n,


⎛:
; ⎛ ⎞p : ⎞1/p
; ; p
 / /p 1/p ;
/ / 1/p √ ⎜; ⎟
E/A−Ã/ c0 2 p⎝<E⎝max Ãij +<E max
2 ⎠ Ã2ij ⎠ ,
i j
j i


where c0 23/4 πe<5.

2.18 Sparse Sums of Positive Semi-definite Matrices

Theorem 2.18.1 ([125]). Let A1 , . . . , AN be symmetric, positive semidefinite ma-


trices of size n × n and arbitrary rank. For any ε ∈ (0, 1), there is a deterministic
algorithm to construct a vector y ∈ RN with O(n/ε2 ) nonzero entries such that
y ≥ 0 and

N N N
Ai  yi Ai  (1 + ε) Ai .
i=1 i=1 i=1

The algorithm runs in O(N n3 /ε2 ) time. Moreover, the result continues to hold if
the input matrices A1 , . . . , AN are Hermitian and positive semi-definite.
Theorem 2.18.2 ([125]). Let A1 , . . . , AN be symmetric, positive semidefinite ma-
T 
N
trices of size n × n and let y = (y1 , . . . , yN ) ∈ RN satisfy y ≥ 0 and yi = 1.
i=1

N
For any ε ∈ (0, 1), these exists x  0 with xi = 1 such that x has O(n/ε)
i=1
nonzero entries and
N N N
(1 − ε) y i Ai  xi Ai  (1 + ε) y i Ai .
i=1 i=1 i=1

2.19 Further Comments

This chapter heavily relies on the work of Tropp [53]. Due to its user-friendly
nature, we take so much material from it.
Column subsampling of matrices with orthogonal rows is treated in [111].
Exchangeable pairs for sums of dependent random matrices is studied by Mackey,
Jordan, Chen, Farrell, Tropp [126]. Learning with integral operators [127, 128]
is relevant. Element-wise matrix sparsification by Drineas [129] is interesting.
See [130, Page 15] for some matrix concentration inequalities.
Chapter 3
Concentration of Measure

Concentration of measure plays a central role in the content of this book. This chap-
ter gives the first account of this subject. Bernstein-type concentration inequalities
are often used to investigate the sums of random variables (scalars, vectors and
matrices). In particular, we survey the recent status of sums of random matrices in
Chap. 2, which gives us the straightforward impression of the classical view of the
subject.
It is safe to say that the modern viewpoint of the subject is the concentration
of measure phenomenon through Talagrand’s inequality. Lipschitz functions are
basic mathematical objects to study. As a result, many complicated quantities can
be viewed as Lipschitz functions that can be handled in the framework of Talagrand.
This new viewpoint has profound impact on the whole structure of this book.
In some sense, the whole book is to prepare the audience to get comfortable with
this picture.

3.1 Concentration of Measure Phenomenon

Increase in dimensionality can often help to mathematical analysis. This is called


blessing of dimensionality. The regularity of having many “identical” dimensions
over which one can “average” is a fundamental tool.
Let X1 , X2 , . . . , Xn be a sequence of independent random variables taking
values ±1 with equal probability, and set, for example,

Sn = X1 + · · · + X n .

We think of Sn of the individual variables Xi . The classical law of large number


says that Sn is essentially constant
√ (equal to 0). By the central limit theorem, the
fluctuations of Sn are of order n which is hardly zero. But as Sn takes values as
large as n, this is the scale at which one should measure Sn , in which case Sn /n
indeed essentially zero as expressed by the classical exponential bound [131]

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 145
DOI 10.1007/978-1-4614-4544-9 3,
© Springer Science+Business Media New York 2014
146 3 Concentration of Measure

 
|Sn | 2
P t  2e−nt /2
, t  0.
n

According to M. Talagrand [131], one probabilistic aspect of measure concentration


is that a random variable that depends (in a smooth way) on the influence of many
independent variables is essentially constant!
Due to concentration of measure, a Lipschitz function is nearly constant [132,
p. 17]. Even more important, the tails behave at worst like a scalar Gaussian random
variable with absolutely controlled mean and variance.
Measure concentration is surprisingly shared by a number of cases that gen-
eralized previous examples: (1) by replacing linear functionals (such as sums of
independent random variables) by arbitrary Lipschitz functions of the samples;
(2) by considering measures that are not of product form. The difference between
the concentration phenomenon and the standard probabilistic views on probability
inequalities and law of large numbers theorems is made explicit by the extension to
Lipschitz (and even Hölder type) functions and more general measures. This insight
is simple, yet fundamental. This concept is extended to the matrix setting.
The theory of concentration inequality tries to answer the following question:
Given a random vector x taking value in some measurable space X (which is usually
some high dimensional Euclidean space), and a measurable map f : X → R,
what is a good explicit bound on P (|f (x) − Ef (x)| ≥ t)? Exact evaluation or
accurate approximation is, of course, the central purpose of probability theory
itself. In situations where exact evaluation or accurate approximation is not possible,
concentration inequalities aim to do the next best job by providing rapidly decaying
tail bounds [133].

3.2 Chi-Square Distributions

The χ2 distribution is a basic probability distribution. Let us first study the χ2


distribution to get a feel of concentration in high dimensions.
Lemma 3.2.1. Let (X1 , . . . , Yn ) be i.i.d. Gaussian variables, with mean 0 and
variance 1. Let a1 , . . . , an be nonnegative. We set
n
2
a ∞ = sup |ai | , a 2 = a2i .
i=1,...,n
i=1

Let
n

Z= ai Xi2 − 1 .
i=1
3.2 Chi-Square Distributions 147

Then, the following inequalities hold for any positive t :


 √ 
P Z ≥ 2 a 2 t + 2 a ∞ t ≤ e−t ,
2

 √ (3.1)
P Z ≤ −2 a 2 t ≤ e−t .

As an immediate corollary of Lemma 3.2.1, one obtains an exponential inequality


for chi-square distributions. Let Z be a centralized χ2 statistic with n degrees of
freedom [134]. Then for all t ≥ 0,
 √ 
P Z ≥ n + 2 nt + 2t ≤ e−t ,
 √  (3.2)
P Z ≤ n − 2 nt ≤ e−t .

The following consequence of this bound is useful [135]: for all x ≥ 1, we have
 
Z −n
P ≥ 4x ≤ e−nx . (3.3)
n

Starting with the first inequality bound of (3.2), setting t = nx gives


 
Z −n √
P ≥ 2 x + 2x ≤ e−nx .
n
√  −nx
Since 4x ≥ 2 x + 2x for x ≥ 1, we have P Z−n
n ≥ 4x ≤ e , for all x ≥ 1.
Proof. Let X a random variable with N (0, 1) distribution. Let ψ denote the
logarithm of the Laplace transform of X 2 − 1,

    
ψ(u) = log E exp u X 2 − 1 = −u − 1
2 log (1 − 2u) .

Then, for 0 < u < 12 ,

u2
ψ(u) ≤ .
(1 − 2u)

Indeed, considering the power series expansion, we have

1 k u2 k
ψ(u) = 2u2 (2u) and = u2 (2u) .
k+2 (1 − 2u)
k≥0 k≥0
148 3 Concentration of Measure

Thus,
   n 
   
n   
n
a2i u2
log E euZ = log E exp ai u Xi2 − 1 ≤ 1−2ai u
i=1 i=1 i=1
a22
≤ 1−2a∞ u .

We now refer to [136]. It is proved that if


   vu2
log E euZ ≤ ,
2 (1 − cu)
then, for any positive t,
 √ 
P Z ≥ ct + 2 vt ≤ e−t .

The first inequality in (3.1) holds.


In order to prove the second inequality in (3.1), we just note that for −1/2 <
u < 0, ψ(u) ≤ u2 . This concludes the proof. 
Given a centralized χ2 -variate X with n degrees of freedom, then for all t ∈
(0, 1/2), we have
 
3 2
P (X ≥ n (1 + t)) ≤ exp − nt ,
16
  (3.4)
3 2
P (X ≤ n (1 − t)) ≤ exp − nt ,
16
The first bound in (3.4) is taken from [137] and the second one from [134].
Wainwright [138] puts these two bounds together.
For a centralized χ2n variable X with d degrees of freedom, these exists a constant
C > 0, such that [139]
C 
P (X > n (1 + t)) ≥ √ exp −nt2 /2
n
for all t ∈ (0, 1).

3.3 Concentration of Random Vectors

Later we need to use the Lipschitz norm (also called Lipschitz constant). For a
Lipschitz function f : Rn → R, the Lipschitz norm defined as

|f (x) − f (y)|
f L = sup .
x,y∈Rn x−y 2

We say such a function is f L -Lipschitz.


3.3 Concentration of Random Vectors 149

For i = 1, . . . , n let (Xi , || · ||i ), be normed spaces equipped with norm || · ||i ,
let Ωi be a finite subset of Xi with diameter at most one and let Pi be a probability
measure on Ωi . Define
n

X= ⊕Xi
i=1 2

and

Ω = Ω1 × Ω2 × · · · Ωn ⊂ X

and let

P = P(n) = P1 × P2 × · · · × Pn

be the product probability measure on Ω. For a subset A ⊆ Ω and t ∈ Ω let

φA (t) = d (t, conv A)

be the distance in X from t to the convex hull of the set A.


1 2
Theorem 3.3.1 (Johnson and Schechtman [140]). Ee 4 φA (t) ≤ 1
P(A) .

Theorem 3.3.2 (Johnson and Schechtman [140]). Let 2 ≤ p ≤ ∞ and let f be


a real convex function on the convex hull of the set Ω, i.e., conv Ω. Let σp be the
Lipschitz constant. Then, for all t > 0,

P (|f − Mf | > t) ≤ 4e−t


p
/4σpp
(3.5)

where Mf is the median of f . A similar inequality holds with expectation replacing


the median

P (|f − Ef | > t) ≤ Ke−δt


p
/σpp

where one can take K = 8, δ = 1/32.


Applied to sums S = X1 +. . .+XN of real-valued independent random variables
Y1 , . . . , Yn on some probability space (Ω, A, P) such that ui ≤ Yi ≤ vi , i =
1, . . . , n, we have a Hoeffding type inequality
2
/2D 2
P (S ≥ E (S) + t) ≤ e−t (3.6)

where
n
2
D2 ≥ (vi − ui ) .
i=1
150 3 Concentration of Measure

Let · p be the Lp -norm. The following functions are norms in Rn

n
1/2 n
x 2 = x2i , x ∞ = max |xi | , x 1 = |xi |.
i=1,...,n
i=1 i=1

Consider a Banach space E with (arbitrary) norm · .


Theorem 3.3.3 (Hoeffding type inequality of [141]). Let x1 , . . . , xN be indepen-
dent bounded random vectors in a Banach space (E, · ), and let

N
s = x1 + . . . + xN = xj .
j=1

For every t ≥ 0,
2
/2D 2
P ({| s − E ( s )| ≥ t}) ≤ 2e−t (3.7)


N
2
where D2 ≥ xj ∞.
j=1

Equation (3.7) is an extension of (3.6).


The Hamming metric is defined as
n
da (x, y) = ai 1{xi =yi } , a = (a1 , . . . , an ) ∈ Rn+ ,
i=1


n
where a = a2i . Consider a function f : Ω1 × Ω2 × · · · × Ωn → R such that
i=1
for every x ∈ Ω1 × Ω2 × · · · × Ωn there exists a = a (x) ∈ Rn+ with ||a|| = 1 such
that for every y ∈ Ω1 × Ω2 × · · · × Ωn ,

f (x) ≤ f (y) + da (x, y) . (3.8)

Theorem 3.3.4 (Corollary 4.7 of [141]). Let P be a product probability measure


on the product space Ω1 × Ω2 × · · · × Ωn and let f : Ω1 × Ω2 × · · · × Ωn → R be
1-Lipschitz in the sense of (3.8). Then, for every t ≥ 0,
2
P (|f − Mf | ≥ t) ≤ 4e−t /4

where Mf is a median of f for P.


Replacing f with −f , Theorem 3.3.4 applies if (3.8) is replaced by f (y) ≤ f (x) +
da (x, y).
3.3 Concentration of Random Vectors 151

A typical application of Theorem 3.3.4 is involved in supremum of linear func-


tionals. Let us study this example. In a probabilistic language, consider independent
real-valued random variables Y1 , . . . , Yn on some probability space (Ω, A, P) such
that for real numbers ui , vi , i = 1, . . . , n,

ui ≤ Yi ≤ vi , i = 1, . . . , n.

Set
n
Z = sup ti Y i (3.9)
t∈T i=1

where T is a finite or countable family of vectors t = (t1 , . . . , tn ) ∈ Rn such that


the “variance” σ is finite
1/2
n
 2
σ = sup ti v i − u i
2 2
< ∞.
t∈T i=1

Now observe that


n
f (x) = Z = sup ti x i
t∈T i=1

!
n
and apply Theorem 3.3.4 on the product space [ui , vi ] under the product
i=1
probability measure of the laws of the Yi , i = 1, . . . , n. Let t achieve the supremum
!
n
of f (x). Then, for every x ∈ [ui , vi ],
i=1


n 
n 
n
f (x) = ti x i ≤ ti y i + |ti | |xi − yi |
i=1 i=1 i=1


n
|ti ||ui −vi |
≤ f (y) + σ σ 1{xi =yi } .
i=1

Thus, σ1 f (x) satisfies (3.8) with a = a (x) = σ1 (|t1 | , . . . , |tn |). Combining
Theorem 3.3.4 with (3.7), we have the following consequence.
Theorem 3.3.5 (Corollary 4.8 of [141]). Let Z be defined as (3.9) and denote the
median of Z by MZ. Then, for every t ≥ 0,
2
/4σ 2
P (|Z − MZ| ≥ t) ≤ 4e−t .

In addition,

|EZ − MZ| ≤ 4 πσ and Var (Z) ≤ 16σ 2 .
152 3 Concentration of Measure

Convex functions are of practical interest. The following result has the central
significance to a lot of applications.
Theorem 3.3.6 (Talagrand’s concentration inequality [142]). For every product
probability P on [−1, 1]n , consider a convex and Lipschitz function f : Rn → R
with Lipschitz constant L. Let X1 , . . . , Xn be independent random variables taking
values [−1,1]. Let Y = f (X1 , . . . , Xn ) and let m be a median of Y . Then for every
t ≥ 0, we have
2
/16L2
P (|Y − m| ≥ t) ≤ 4e−t .

See [142, Theorem 6.6] for a proof. Also see Corollary 4.10 of [141]. Let us see
how we can modify Theorem 3.3.6 to have concentration around the mean instead
of the median. Following [143], we just notice that by Theorem 3.3.6,
2
E(Y − m) ≤ 64L2 . (3.10)
2
Since E(Y − m) ≥ Var(Y ), this shows that

Var(Y ) ≤ 64L2 . (3.11)

Thus by Chebychev’s inequality,

1
P (|Y − E [Y ]| ≥ 16L) ≤ .
4
Using the definition of a median, this implies that

E [Y ] − 16L ≤ m ≤ E [Y ] + 16L.

Together with Theorem 3.3.6, we have that for any t ≥ 0,


2
/2L2
P (|Y − E [Y ]| ≥ 16L + t) ≤ 4e−t .

It is essential to point that the eigenvalues of random matrices can be viewed as


functions of matrix entries. Note that eigenvalues are not very regular functions
of general (non-normal) matrices. For a Hermitian matrix A ∈ Rn×n , and the
eigenvalues λi , i = 1, . . . , n are sorted in decreasing magnitude. The following
functions are convex: (1) the largest eigenvalue λ1 (A), (2) the sum of first k largest
k k
eigenvalues λi (A), (3) the sum of the smallest k eigenvalues λn−i+1 (A).
i=1 i=1
However, Theorem 3.3.6 can be applied to these convex functions–not necessarily
linear.
According to (3.10) and (3.11), the Lipschitz constant L controls the mean and
the variance of the function f . Later on in this book, we have shown how to evaluate
3.3 Concentration of Random Vectors 153

this constant. See [23, 144] for related more general result. Here, it suffices to give
some examples studied in [145]. The eigenvalues (or singular values) are Lipschitz
functions with respect to the matrix elements. In particular, for each k ∈ {1, . . . , n},
the k-th largest singular value σk (X) (the eigenvalue λk (X)) is Lipschitz with
n 2
constant 1 if (Xij )i,j=1 is considered as an element of the Euclidean space Rn
2
(respectively the submanifold of Rn corresponding to the Hermitian matrices).
If one insists on thinking of X as a matrix, this corresponds to considering
√ √the
underlying Hilbert-Schmidt metric. The Lipschitz constant of σk (X/ n) is 1/ n,
since the variances of the entries being 1/n. The trace function Tr (X) has a
Lipschitz constant of 1/n. Form (3.11), The variance of the trace function is 1/n
times smaller than that the largest eigenvalue (or the smallest eigenvalue). The same
is true to the singular value.
Theorem 3.3.7 (Theorem 4.18 of [141]). Let f : Rn → R be a real-value function
such that f L ≤ σ and such that its Lipschitz coefficient with respect to the 1 -
metric is less than or equal to κ, that is
n
|f (x) − f (y)| ≤ κ |xi − yi |, x, y ∈ Rn .
i=1

Then, for every t > 0,


  
1 t t2
P (f ≥ M + t) ≤ C exp − min ,
K κ σ2

for some numerical constant C > 0 where M is either a median of f or its mean.
Conversely, the concentration result holds.
Let μ1 , . . . , μn be arbitrary probability measures on the unit interval [0, 1] and
let P be the product probability measure P = μ1 ⊗ . . . ⊗ μn on [0, 1]n . We say a
function on Rn is separately convex if it is convex in each coordinate. Recall that a
convex function on R is continuous and almost everywhere differentiable.
Theorem 3.3.8 (Theorem 5.9 of [141]). Let f be separately convex and 1-
Lipschitz on Rn . Then, for every product probability measure P on [0, 1]n , and
every t ≥ 0,
 
2
P f≥ f dP + t ≤ e−t /4
.

The norm is a convex function. The norm is a supremum of linear functionals.


Consider the convex function f : Rn → R defined as
/ /
/ n /
/ /
f (x) = / xi vi / , x = (x1 , . . . , xn ) ∈ Rn
/ /
i=1
154 3 Concentration of Measure

where vi , i = 1, . . . , n are vectors in an arbitrary normed space E with norm || · ||.


Then, by duality, for x, y ∈ Rn ,
/ n /
/ /
|f (x) − f (y)| ≤ /
/ (x i − y i ) v /
i/
i=1
n
= sup (xi − yi ) z, vi ≤ σ x − y 2
z≤1 i=1

where the last step follows from the Cauchy-Schwarz inequality. So the Lipschitz
norm is f L ≤ σ. The use of Theorem 3.3.6 or Theorem 3.3.8 gives the following
theorem, a main result we are motivated to develop.
Theorem 3.3.9 (Theorem 7.3 of [141]). Let η1 , . . . , ηn be independent (scalar-
valued) random variables such that ηi ≤ 1 almost surely, for i = 1, . . . , n, and let
v1 , . . . , vn be vectors in a normed space E with norm || · ||. For every t ≥ 0,
/ / 
/ n /
/ / 2 2
P / ηi vi / ≥ M + t ≤ 2e−t /16σ
/ /
i=1
/ /
/n /
where M is either the mean or a median of /
/ η v /
i i / and where
i=1

n
2
σ 2 = sup z, vi .
z≤1 i=1

Theorem 3.3.9 is an infinite-dimensional extension of the Hoeffding type


inequality (3.6).
Let us state a well known result on concentration of measure for a standard
Gaussian vector. Let γ = γn be the standard Gaussian measure on Rn with
−n/2 −|x|2 /2
density (2π) e where |x| is the usual )Euclidean norm for vector x. The
expectation of a function is defined as Ef (x) = Rn f (x) dγn (x).
Theorem 3.3.10 (Equation (1.4) of Ledoux [141]). Let f : Rn → R be a
Lipschitz function and let f L be its Lipschitz norm. If x ∈ Rn is a standard
Gaussian vector, (a vector whose entries are independent standard Gaussian
random variables), then for all t > 0
 √  2
P f (x) ≥ Ef (x) + t 2 f L ≤ e−t .

We are interested in the supremum1 of a sum of independent random variables


Z1 , Z2 , . . . , Zn in Banach space

1 The supremum is the least upper bound of a set S defined as a quantity M such that no member
of the set exceeds M . It is denoted as sup x.
x∈S
3.3 Concentration of Random Vectors 155

n
S = sup g (Zi ).
g∈G i=1

Theorem 3.3.11 (Corollary 7.8 of Ledoux [141]). If |g| ≤ η for every g ∈ G and
g (Z1 ) , . . . , g (Zn ) have zero mean for every g ∈ G . Then, for all t ≥ 0,
  
t ηt
P (|S − ES| ≥ t) ≤ 3 exp − log 1 + 2 , (3.12)
Cη σ + ηES̄

where
n
σ 2 = sup Eg 2 (Zi ),
g∈G i=1

n
S̄ = sup |g (Zi )|,
g∈G i=1

and C > 0 is a small numerical constant.


Let us see how to apply Theorem 3.3.10.
Example 3.3.12 (Maximum of Correlated Normals—Example 17.8 of [146]).
We study the concentration inequality for the maximum of n jointly dis-
tributed Gaussian random variables. Consider a Gaussian random vector x =
(X1 , . . . , Xn ) ∼ N (0, Σ), where Σ is positive definite, and let σi = Var (Xi ) , i =
1, 2, . . . , n. Let Σ = AAT .
Let us first consider the function f : Rn → R defined by

f (u) = max {(Au)1 , . . . , (Au)n }

where (Au)1 means the first coordinate of the vector Ay and so on. Let σmax =
maxi σi . Our goal here is to show that f is a Lipschitz function with Lipschitz
constant (or norm) by σmax . We only consider the case when A is diagonal; the
general case can be treated similarly. For two vectors x, y ∈ Rn , we have

(Au)1 = a11 u1 = a11 v1 + (a11 u1 − a11 v1 ) ≤ a11 v1 + σmax u − v


≤ max {(Au)1 , (Au)2 , . . . , (Au)n } + σmax u − v .

Using the same arguments, for each i, we have

(Au)i ≤ max {(Au)1 , (Au)2 , . . . , (Au)n } + σmax u − v .

Thus,
156 3 Concentration of Measure

f (u) = max {(Au)1 , . . . , (Au)n }


≤ max {(Au)1 , (Au)2 , . . . , (Au)n } + σmax u − v
= f (v) + σmax u − v .

By switching the roles of u and v, we have f (v) ≤ f (u) + σmax u − v .


Combining both, we obtain

|f (u) − f (v)| ≤ σmax u − v ,

implying that the Lipschitz norm is σmax . Next, we observe that

max {X1 , . . . , Xn } = max {(Az)1 , (Az)2 , . . . , (Az)n } = f (z)

where z is a random Gaussian vector z = (Z1 , . . . , Zn ) ∼ N (0, Σ). Since the


function f (z) is Lipschitz with constant σmax , we apply Theorem 3.3.10 to obtain

P (|max {X1 , . . . , Xn } − E [max {X1 , . . . , Xn }]| > t)


2 2
= P (|f (z) − E [f (z)]| > t) ≤ e−t /2σmax
.

It is remarkable that although X1 , X2 , . . . , Xn are not assumed to be indepen-


dent, we can still prove an Gaussian concentration inequality by using only the
coordinate-wise variances, and the inequality is valid for all n. 
Let us see how to apply Theorem 3.3.11 to study the Frobenius norm bound of
the sum of vectors, following [123] closely.
Example 3.3.13 (Frobenius norm bound of the sum of vectors [123]). Let ξj be
i.i.d. Bernoulli 0/1 random variables with P (ξj = 1) = d/m whose subscript j
represents the entry selected from a set {1, 2, . . . , m}. In particular, we have
/ /
/m /
/ /
/
SF = / ξj x j y j /
T
/
/ j=1 /
F

where xj and yj are vectors.



Let X, Y = Tr XT Y represent the Euclidean inner product between two
matrices and X F = X, X . It can be easily shown that


X F = sup Tr XT G = sup X, G .
GF =1 GF =1

Note that trace and inner product are both linear. For vectors, the only norm we
consider is the 2 -norm, so we simply denote the 2 -norm of a vector by ||x||
3.3 Concentration of Random Vectors 157


which is equal to x, x , where x, y is the Euclidean inner product between
two vectors. Like matrices, it is easy to show

x = sup x, y .
y=1

See Sect. 1.4.5 for details about norms of matrices and vectors.
Now, let Zj = ξi xTj y (rank one matrix), we have
/ /
/m /
/ / m m
/
SF = / Zj / = sup Z , G = sup g (Zj ).
/ j
/ j=1 / GF =1 j=1 GF =1 j=1
F

Since SF > 0, the expected value of SF is equal to the expected value of S̄. That is
ESF = ES̄F . We can bound the absolute value of g (Zj ) as
= > / /
|g (Zj )| ≤  ξi xTj yj , G  ≤ /ξi xTj yj /F ≤ xj yj ,

where · is the 2 norm of a vector. We take η = max xj yj so that


j
|g (Zj )| ≤ η.

n
Now we compute the term σ 2 = sup Eg 2 (Zi ), in Theorem 3.3.11. Since
g∈G i=1

= >2 d = T > d / /
/xTj yj /2 ,
Eg 2 (Zj ) = Eξi xTj yj , G = xj yj , G ≤ F
m m
we have
m
d / T /2
m m

Eg 2 (Zj ) ≤ /x j y j / = d Tr xTj yj yj xj
m F m
j=1 j=1 j=1
⎛ ⎞
d
m
 d
m
kd
Tr ⎝ xTj xj ⎠ =
2 2 2
≤ yj Tr xTj xj ≤ max yj max yj ,
m j=1
m j j=1
m j
 (3.13)

m
where k = Tr xTj xj . In the first inequality of the second line, we have
j=1
used (1.84) that is repeated here for convenience

Tr (A · B) ≤ B Tr (A) . (3.14)

when A ≥ 0 and B is the spectrum norm (largest singular value). Prove similarly,
we also have
158 3 Concentration of Measure

⎛ ⎞
m m
d dα
Tr ⎝ yjT yj ⎠ =
2 2
Eg 2 (Zj ) ≤ max xj max xj ,
j=1
m j j=1
m j



m
where α = Tr yjT yj . So we choose
j=1

m  
d 2 2
σ2 = Eg 2 (Zj ) ≤ max α max xj , k max yj . (3.15)
j=1
m j j

Apply the powerful Talagrand’s inequality (3.12) and note that from expectation
inequality, σ 2 + ηESF ≤ σESF + ηESF = E2 SF we have
  
P (SF − ESF ≥ t) ≤ 3 exp − Cηt
log 1 + ηt
σ 2 +ηE2 SF
 2

≤ 3 exp − C1 E2tSF .

The last inequality follows from the fact that log (1 + x) ≥ 2x/3 for 0 ≤ x ≤ 1.
2 to satisfy ηt ≤ E SF .
2
Thus, t must be chosen
Choose t = C log β ESF where C is a small numerical constant. By some
3

calculations, we can show that ηt ≤ E2 SF as ηt ≤ E2 SF d ≥ C 2 μm log β3 .


Therefore,
 
3
P (SF − ESF ≥ t) ≤ 3 exp − log = β.
β
2 2
There is a small constant such that C1 log β3 = C log 3
β + 1. Finally, we
summarize the result as
 " 
3
P SF ≤ C1 log · ESF ≥ 1 − β.
β


Theorem 3.3.14 (Theorem 7.3 of Ledoux [141]). Let ξ1 , . . . , ξn be a sequence of
independent random variable such that |ξi | ≤ 1 almost surely with i = 1, . . . , n
and let x1 , . . . , xn be vectors in Banach space. Then, for every t ≥ 0,
/ /   
/ n / t2
/ /
P / ξi xi / ≥ M + t ≤ 2 exp − (3.16)
/ / 16σ 2
i=1
3.4 Slepian-Fernique Lemma and Concentration of Gaussian Random Matrices 159

/ n /
/ /
where M is either the mean or median of / ξi xi /
/
/ and
i=1

n
σ 2 = sup y, xi .
y≤1 i=1

The theorem claims that the sum of vectors with random weights is √ distributed
like Gaussian around its mean or median, with standard deviation 2 2σ. This
theorem strongly bounds the supremum of a sum of vectors x1 , x2 , . . . , xn with
random weights in Banach space. For applications such as (3.15), we need to bound
max xi and max yi . For details, we see [123].
i i
Let {aij } be an n×n array of real numbers. Let π be chosen uniformly at random

n
from the set of all permutations of {1, . . . , n}, and let X = aiπ(i) . This class of
i=1
random variables was first studied by Hoeffding [147].
n
Theorem 3.3.15 ([133]). Let {aij }i,j=1 be a collection of numbers from [0, 1]. Let
π be chosen uniformly at random from the set of all permutations of {1, . . . , n},
n n
and let X = aiπ(i) . Let X = aiπ(i) , where π is drawn from the uniform
i=1 i=1
distribution over the set of all permutations of {1, . . . , n}. Then
 
t2
P (|X − EX| ≥ t) ≤ 2 exp −
4EX + 2t

for all t ≥ 0.

3.4 Slepian-Fernique Lemma and Concentration of Gaussian


Random Matrices

Following [145], we formulate the eigenvalue problem in terms of the Gaussian


process Zu . For u ∈ RN , we define Zu = ·, u . For a matrix X and vectors
u, v ∈ Rn , we have

Xu, v = Tr (X (v ⊗ u)) = X, u ⊗ v Tr = Zu⊗v (X) ,


n
where v ⊗ u stands for the rank one matrix (ui vi )i,j=1 , that is, the matrix of the

map x → x, v u. Here X, Y T r = Tr XYT is the trace duality, often called
the Hilbert-Schmidt scalar product, can be also thought of as the usual scalar product
2
on Rn . The key observation is as follows
160 3 Concentration of Measure

X = max Xu, v = max Zu⊗v (X) (3.17)


u,v∈S n−1 u,v∈S n−1

where || · || denotes the operator (or matrix) norm. The Gaussian process Xu,v =
Zu⊗v (X) , u, v ∈ S n−1 is now compared with Yu,v = Z(u,v) , where (u, v) is
regarded as an element of Rn × Rn = R2n . Now we need the Slepian-Fernique
lemma.
Lemma 3.4.1 (Slepian-Fernique lemma [145]). Let (Xt )t∈T and (Yt )t∈T be two
families of jointly Gaussian mean zero random variable such that

(a) X t − X t 2 ≤ Y t − Y t 2, f ort, t ∈ T

Then

E max Xt ≤ E max Yt . (3.18)


t∈T t∈T

Similarly, if T = ∪s∈S Ts and

(b) X t − X t 2 ≤ Y t − Y t 2, if t ∈ Ts , t ∈ Ts with s = s .


(c) X t − X t 2 ≥ Y t − Y t 2, if t, t ∈ Ts for some s.

then one has

E max min Xs,t ≤ E max min Ys,t .


t∈S t∈Ts t∈S t∈Ts

To see that the Slepian-Fernique lemma applies, we only need to verify that, for
u, v, u , v ∈ S n−1 , where S n−1 is a sphere in Rn ,

|u ⊗ v − u ⊗ v | ≤ |(u, v) − (u , v )| = |u − u | + |v − v | ,
2 2 2 2

where |·| is the usual Euclidean norm. On the other hand, for (x, y) ∈ Rn × Rn , we
have

Z(u,v) (x, y) = x, u + y, v

so

max Z(u,v) (x, y) = |x| + |y| .


u,v∈S n−1

This is just implying that · (U ×V )◦ = · U ◦ + · V ◦ . (If K ∈ Rn , one has the


gauge, or the Minkowski functional of the polar of K given by maxu∈K Zu =
· K ◦ .) Therefore, the assertion of Lemma 3.4.1 translates to
3.4 Slepian-Fernique Lemma and Concentration of Gaussian Random Matrices 161

√ / /
/
/
nE /G(n) / ≤ 2 |x| dγn (x),
Rn

where γ = γn is the standard Gaussian measure on Rn with density


−n/2 −|x|2 /2
(2π) e where |x| is the usual Euclidean norm for vector x. Here
G = G(n) is the n × n real Gaussian matrices.
1
By comparing with the second moment of |x|, the last integral is seen to be ≤ n 2 .
|x|2 is distributed according to the familiar χ2 (n) law. The same argument, applied
just to symmetric tensors u ⊗ u, allows to analyze λ1 (GOE), the largest eigenvalue
of the Gaussian orthogonal ensembles (GOE).
Using the Theorem 3.3.10 and remark following it, we have the following
theorem. Let N denote the set of all the natural numbers, and M the median of
the set. As usual, Φ(t) = γ1 ((−∞, t)) is the cumulative distribution function of the
N(0, 1) Gaussian random variable.
Theorem 3.4.2 (Theorem 2.11 of [145]). Given n ∈ N, consider the ensembles
of n × n matrices G, and GOE. If the random variable F equals either G or
λ1 (GOE), then
MF < EF < 2,
where M standards for the median operator. As a result, for any t > 0,

P (F ≥ 2 + κt) < 1 − Φ(t) < exp −nt2 /2 , (3.19)

where κ = 1 in the case of G and κ = 2 in the case of λ1 (GOE).
For the rectangular matrix of Gaussian matrices with independent entries, we
have the following result.
Theorem 3.4.3 (Theorem 2.13 of [145]). Given m, n ∈ N, with m ≤ n, put β =
m/n and consider the n × m random matrix Γ whose entries are real, independent
Gaussian random variables following N (0, 1/n) law. Let the singular values be
s1 (Γ), . . . , sm (Γ). Then

1 + β < Esm (Γ) ≤ Ms1 (Γ) ≤ Es1 (Γ) < 1 + β

and as a result, for any t > 0,


2
max {P (s1 (Γ) ≥ 1 + β + t) , P (s1 (Γ) ≥ 1 − β − t)} < 1 − Φ(t) ≤ e−nt .
(3.20)
The beauty of the above result is that the inequality (3.20) is valid for all m, n rather
than asymptotically.
The proof of Theorem 3.4.3 is similar to that of Theorem 3.4.2. We use the second
part of Lemma 3.4.1.
Complex matrices can be viewed as real matrices with a special structure. Let
G(n) denote the complex non-Hermitian matrix: all the entries are independent and
of the form x + jy, where x, y are independent real N (0, 1/2n) Gaussian random
variables. We consider
162 3 Concentration of Measure

   
1 G −G 1 10 0 −1 
√ =√ ⊗G+ ⊗G
2 G G 2 01 1 0

where G and G are independent copies of matrix G(n) .


Another family of interest is product measure. This application requires an
additional convexity assumption on the functionals. If μ is a product measure on
Rn with compactly supported factors, a fundamental result of M. Talagrand [148]
shows that (3.78) holds for every Lipschitz and convex function. More precisely,
assume that μ = μ1 ⊗ · · · ⊗ μn , where each μi is supported on [a, b]. Then, for
every Lipschitz and convex function F : Rn → R,
2
/4(b−a)2
μ ({|F − MF | ≥ t}) ≤ 4e−t .

By the variational representation of (3.79), the largest eigenvalue of a symmetric


(or Hermitian) matrix is clearly a convex function of the entries.2 The largest
eigenvalue is 1-Lipschitz, as pointed above. We get our theorem.
Theorem 3.4.4 (Proposition 3.3 of [149] ). Let X be a real symmetric n×n matrix
such that the entries Xij , 1 ≤ i ≤ j ≤ n are independent random variables with
|Xij | ≤ 1. Then, for any t ≥ 0,
2
P (|λmax (X) − Mλmax (X)| ≥ t) ≤ 4e−t /32
.

Up to some numerical constants, the median M can be replaced by the mean E.


A similar result is expected for all the eigenvalues.

3.5 Dudley’s Inequality

We take material from [150, 151] for this presentation. See Sect. 7.6 for some
applications. A stochastic process is a collection Xt , t ∈ T̃ , of complex-valued
random variables indexed by some set T̃ . We are interested in bounding the
moments of the supremum of Xt , t ∈ T̃ . To avoid measurability issues, we define,
for a subset T ⊂ T̃ , the lattice supremum as
 
E sup |Xt | = sup E sup |Xt | , F ⊂ T, F finite . (3.21)
t∈T t∈F

We endow the set T̃ with the pseudometric


 1/2
2
d (s, t) = E|Xt − Xs | . (3.22)

2 They are linear in the entries.


3.5 Dudley’s Inequality 163

In contrast to a metric, a pseudometric does not need to separate points, i.e.,


d(s, t) = 0 does not necessarily imply s = t. We further assume that the increments
of the process Xt , t ∈ T̃ satisfy the concentration property
2
P (|Xt − Xs | ≥ ud (t, s)) ≤ 2e−t /2
, u > 0, s, t ∈ T̃ . (3.23)

Now we apply Dudley’s inequality for the special case of the Rademacher process
of the form
M
Xt = εi xi (t), t ⊂ T̃ , (3.24)
i=1

where ε = (ε1 , . . . , εM ) is a Rademacher sequence and the xi (t) : T̃ → C are


some deterministic functions. We have
 2
 M 
d(s, t) = E|Xt − Xs | = E εi (xi (t) − xi (s))

2 2
i=1 (3.25)

M
2 2
= (xi (t) − xi (s)) = x (t) − x(s) 2 ,
i=1

where x (t) = (x1 (t) , . . . , xM (t)) and || · ||2 is the standard Euclidean norm. So
we can rewrite the (pseudo-)metric as
 1/2
2
d (s, t) = E|Xt − Xs | = x (t) − x(s) 2 . (3.26)

Hoeffding’s inequality shows that the Rademacher process (3.24) satisfies the
concentration property (3.23). We deal with the Rademacher process, while the
original process was for Gaussian process, see also [27, 81, 141, 151, 152].
For a subset T ⊂ T̃ , the covering number N (T, d, δ) is defined as the smallest
integer N such that there exists a subset E ⊂ T̃ with cardinality |E| = N satisfying
0 1
T ⊂ Bd (t, δ) , Bd (t, δ) = s ∈ T̃ , d(t, s) ≤ δ . (3.27)
t∈E

In words, T can be covered by N balls of radius δ in the metric d. The diameter of


the set T in the metric is defined as

D (T ) = sup d (s, t) .
s,t∈T

We state the theorem without proof.


164 3 Concentration of Measure

Theorem 3.5.1 (Rauhut [30]). Let Xt , t ∈ T̃ , be a complex-valued process


indexed by a pseudometric space (T̃ , d) with pseudometric defined by (3.22) which
satisfies (3.23) . Then, for a subset T ⊂ T̃ and any point t0 ∈ T it holds

D(T ) 
E sup |Xt − Xt0 | ≤ 16.51 · ln (N (T, d, u))du + 4.424 · D (T ) . (3.28)
t∈T 0

Further, for p ≥ 2,
 1/p  D(T ) 
E sup |Xt −Xt0 |p ≤ 6.0281/p 14.372 ln (N (T, d, u))du + 5.818 · D (T ) .
t∈T 0

(3.29)

The main proof ingredients are the covering number arguments and the concentra-
tion of measure. The estimate (3.29) is also valid for 1 ≤ p ≤ 2 with possibly
slightly different constants: this can be seen, for instance, from interpolation
between p = 1 and p = 2. The theorem and its proof easily extend to Banach
space valued processes satisfying
2
P (|Xt − Xs | ≥ ud (t, s)) ≤ 2e−t /2
, u > 0, s, t ∈ T̃ .
Inequality (3.29) for the increments of the process can be used in the following way
to bound the supremum
 1/p  1/p 
p p p 1/p
E sup |Xt | ≤ inf E sup |Xt − Xt0 | + (E|Xt0 | )
t∈T t0 ∈T t∈T

 D(T )

≤ 6.0281/p p 14.372 ln (N (T, d, u))du + 5.818 · D (T )
0

+ inf (E|Xt0 |p )1/p. (3.30)


t0 ∈T

The second term is often easy to estimate. Also, for a centered real-valued process,
that is EXt = 0, for all t ∈ T̃ , we have
E sup Xt = E sup (Xt − Xt0 ) ≤ E sup |Xt − Xt0 | . (3.31)
t∈T t∈T t∈T

For completeness we also state the usual version of Dudley’s inequality.


Corollary 3.5.2. Let Xt , t ∈ T , be a real-valued centered process indexed by a
pseudometric space (T, d) such that (3.23) holds. Then
D(T ) 
E sup Xt ≤ 30 ln (N (T, d, u))du. (3.32)
t∈T 0
3.6 Concentration of Induced Operator Norms 165

Proof. Without loss of generality, we assume that D (T ) = 1. Then, it follows that


N (T, d, δ) ≥ 2, for all u < 1/2. Indeed, if N (T, d, δ) = 1, for some u < 1/2 then,
for any δ > 0, there would be two points of distance at least 1 − δ that are covered
by one ball of radius u. This is a contradiction to the triangle inequality. So,

D(T )  1/2  ln 2
ln (N (T, d, u))du ≥ ln (2)du = D (T ) .
0 0 2

Therefore, (3.32) follows from (3.31) and the estimate

2 × 4.424
16.51 + √ < 30.
ln 2

Generalizations of Dudley’s inequality are contained in [27, 81]. 

3.6 Concentration of Induced Operator Norms


 1/p

n
p
For a vector x ∈ Rn , we use x p = |xi | to denote its p -norm. For
i=1
a matrix, we use X p→q to denote the matrix operator norm induced by vectors
norms p and q . More precisely,

A p→q = max Ax p .
xq =1

The spectral norm for a matrix A ∈ Rm×n is given by

A 2→2 = max Ax 2 = max {σi (A)} ,


x2 =1 i=1,...,m

where σi (A) is the singular value of matrix A. The ∞ -operator norm is given by
n
A ∞→∞ = max Ax ∞ = max |Aij |,
x∞ =1 i=1,...,m
j=1

where Aij are the entries of matrix A. Also we have the norm X 1→2

A 1→2 = sup Au 2
u1 =1
= sup sup vT Au
v2 =1 u1 =1
= max A 2.
i=1,...,d
166 3 Concentration of Measure

The matrix inner product for two matrices is defined as


 
A, B = Tr ABT = Tr AT B = Xij Yij .
i,j

The inner product induces the Hilbert-Schmidt norm (or Frobenius) norm

A F = A HS = A, A .

A widely studied instance is the standard Gaussian ensemble. Consider a random


Gaussian matrix X ∈ Rn×d , formed by drawing each row xi ∈ Rd i.i.d. from an
N (0, Σ). Or
⎡ ⎤
x1
⎢ .. ⎥
X = ⎣. ⎦.
xn

Our goal here is to derive the concentration inequalities of √1n Xv 2 . The emphasis
is on the standard approach of using Slepian-lemma [27,141] as well as an extension
due to Gordon [153]. We follow [135] closely for our exposition of this approach.
See also Sect. 3.4 and the approach used for the proof of Theorem 3.8.4.
Given some index set U × V , let {Yu,v , (u, v) ∈ U × V } and
{Zu,v , (u, v) ∈ U × V } be a pair of zero-mean Gaussian processes. Given the
   1/2
semi-norm on the processes defined via σ (X) = E X 2 , Slepian’s lemma
states that if

σ (Yu,v − Yu ,v ) ≤ σ (Zu,v − Zu ,v ) for all (u, v) and (u , v ) in U × V,
(3.33)
then

E sup Yu,v ≤ E sup Zu,v . (3.34)


(u,v)∈U ×V (u,v)∈U ×V

One version of Gordon’s extension [153] asserts that if the inequality (3.33) holds
for for all (u, v) and (u , v ) in U ×V , and holds with equality when v = v , then
 
E sup inf Yu,v ≤ E sup inf Zu,v . (3.35)
u∈U v∈V u∈U v∈V

Now let us turn to the problem at hand. Any random matrix X from the given
ensemble can be written as WΣ1/2 , where W ∈ Rn×d is a matrix with i.i.d.
N (0, 1) entries, and Σ1/2 is the symmetric matrix square root. We choose the set U
as the unit ball

S n−1 = {u ∈ Rn : u 2 = 1} ,
3.6 Concentration of Induced Operator Norms 167

and for some radius r, we choose V as the set


0 / / 1
/ / q
V (r) = v ∈ Rn : /Σ1/2 v/ = 1, u q ≤r .
2

For any v ∈ V (r), we use the shorthand ṽ = Σ1/2 v.


Consider the centered Gaussian processes Yu,v = uT Wv indexed by the set
S n−1
× V (r). Given two pairs (u, v) and (u , v ) in the set S n−1 × V (r), we
have
  2
σ 2 Yu,v − Yu ,v = uṽT − u (ṽ )T
F
2
= uṽT − u ṽT + u ṽT − u (ṽ )T
F   
= ṽ22 u − u 22 + u 22 ṽ − ṽ 22 +2 uT u − u 22 ṽ22 −ṽT ṽ

(3.36)

Now we use the Cauchy-Schwarz inequality and the equalities: u 2 = u 2 = 1
and ṽ 2 = ṽ 2 , we have uT u − u 2 ≤ 0, and ṽ 2 − ṽT ṽ ≥ 0. As a result,
2 2

we may conclude that

σ 2 (Yu,v − Yu ,v ) ≤ u − u + ṽ − ṽ


2 2
2 2 . (3.37)

We claim that the Gaussian process Yu,v satisfies the conditions of Gordon’s lemma
in terms of the zero-mean Gaussian process Zu,v given by
 
Zu,v = gT u + hT Σ1/2 v , (3.38)

where g ∈ Rn and h ∈ Rd are both standard Gaussian vectors (i.e., with i.i.d.
N (0, 1) entries). To prove the claim, we compute

/ /2
/ /
σ 2 (Zu,v − Zu ,v ) = u − u + /Σ1/2 (v − v )/
2
2
2
 2
+ ṽ − ṽ
2
= u− u 2 2 .

From (3.37), we see that

σ 2 (Yu,v − Yu ,v ) ≤ σ 2 (Zu,v − Zu ,v ) ,

says that the Slepian’s condition (3.33) holds. On the other hand, when v = v , we
see Eq. (3.36) that

σ 2 (Yu,v − Yu ,v ) = u − u


2
2 = σ 2 (Zu,v − Zu ,v ) ,
168 3 Concentration of Measure

so that the equality required for Gordon’s inequality (3.34) also holds.
Upper Bound: Since all the conditions required for Gordon’s inequality (3.34) are
satisfied, we have
- . - .
2
E sup Xv 2 =E sup uT Xv
v∈V (r) (u,v)∈S n−1 ×V (r)

- .
≤E sup Zu,v
(u,v)∈S n−1 ×V (r)

- . - .
 
=E sup gT u + E sup hT Σ1/2 v
u2 =1 v∈V (r)

- .
 
≤ E [ g 2] + E sup hT Σ 1/2
v .
v∈V (r)

By convexity, we have
"   2 2
2 √
E [ g 2] ≤ E g 2 = E Tr (ggT ) = Tr E (gT g) = n,


since E gT g = In×n . From this, we obtain that
- . - .
√  
2
E sup Xv 2 ≤ n+E sup h T
Σ 1/2
v . (3.39)
v∈V (r) v∈V (r)

Turning to the remaining term, we have


   / / / /
  / / / /
sup hT Σ1/2 v  ≤ sup v 1 /Σ1/2 v/ ≤ r/Σ1/2 v/ .
v∈V (r) v∈V (r) ∞ ∞

 
Since each element Σ1/2 v is zero-mean Gaussian with variance at most ρ (Σ) =
i
max Σii , standard results on Gaussian maxima (e.g., [27]) imply that
i

/ /  
/ /
E /Σ1/2 v/ ≤ 3ρ (Σ) log d.

Putting all the pieces together, we conclude that for q = 1


3.6 Concentration of Induced Operator Norms 169

- .
√ 1/2
E sup Xv 2 / n ≤ 1 + [3ρ (Σ) log d/n] r. (3.40)
v∈V (r)

Having controlled the expectation, in the standard two-step approach to establish


concentration inequality, it remains to establish sharp concentration around its
expectation. Let f : RD → R be Lipschitz function with constant L with respect to
the 1 -norm. Thus if w ∼ N (0, ID×D ) is standard normal, we are guaranteed [141]
that for all t > 0,
 
t2
P (|f (w) − E [f (w)]| ≥ t) ≤ 2 exp − 2 . (3.41)
2L

Note the dimension-independent nature of this inequality. Now we use it to the


random matrix W ∈ Rn×d , which is viewed as a standard normal random vector in
D = nd dimensions. Let us consider the function
/ / √
/ /
f (W) = sup /WΣ1/2 v/ / n,
v∈V (r) 2

we obtain that
√ / / / /
/ / / /
n [f (W) − f (W )] = sup /WΣ1/2 v/ − sup /W Σ1/2 v/
v∈V (r) / 2 2
/ v∈V (r)
/ 1/2 / 
≤ sup /Σ v/ W − W F
v∈V (r) 2

= W − W F
/ /
/ /
since /Σ1/2 v/ = 1 for all v ∈ V (r). We have thus shown that the Lipschitz
2√
constant L ≤ 1/ n. Following the rest of the derivations in [135], we conclude that
/ /  1/2
1 / / 1
√ Xv 2 ≤ 3/Σ1/2 v/ + 6 ρ (Σ) log d v 1 for all v ∈ Rd . (3.42)
n 2 n

Lower Bound: We use Gordon’s inequality to show the lower bound. We have

− inf Xv 2 = sup − Xv 2 = sup inf uT Xv.


v∈V (r) v∈V (r) v∈V (r) u∈U

Applying Gordon’s inequality, we obtain


170 3 Concentration of Measure

- . - .
E sup − Xv 2 ≤E sup infn−1 Zu,v
v∈V (r) v∈V (r) u∈S

 - .
=E inf g u +E
T
sup h Σ T 1/2
v
u∈S n−1 v∈V (r)

1/2
≤ −E [ g 2 ] + [3ρ (Σ) log d] r.
- .
where we have used previous derivation to upper bound E sup hT Σ v . 1/2
v∈V (r)
√ √ √
Since
√ |E g 2 − n| = o ( n), E g 2 ≥ n/2 for all n ≥ 1. We divide by
n and add 1 to both sides so that
- .
√ 1/2
E sup (1 − Xv 2 ) / n ≤ 1/2 + [3ρ (Σ) log d] r. (3.43)
v∈V (r)

Defining

f (W) = sup (1 − Xv 2 ) / n,
v∈V (r)


we can use the same arguments to show that its Lipschitz constant is at most 1/ n.
Following the rest of arguments in [135], we conclude that

1/ / 1/2
1 / 1/2 / 1
√ Xv 2 ≥ /Σ v/ − 6 ρ (Σ) log d v 1 for all v ∈ Rd .
n 2 2 n

For convenience, we summarize this result in a theorem.


Theorem 3.6.1 (Proposition 1 of Raskutti, Wainwright and Yu [135]). Consider
a random matrix X ∈ Rn×d formed by drawing each row from xi ∈ Rd , i =
1, 2, . . . , n i.i.d. from an N (0, Σ) distribution. Then for some numerical constants
ck ∈ (0, ∞), k = 1, 2, we have
/ /  1/2
1 / / 1
√ Xv 2 ≤ 3/Σ1/2 v/ + 6 ρ (Σ) log d v 1 for all v ∈ Rd ,
n 2 n
and

1/ / 1/2
1 / 1/2 / 1
√ Xv 2 ≥ /Σ v/ − 6 ρ (Σ) log d v 1 for all v ∈ Rd ,
n 2 2 n
with probability 1 − c1 exp (−c2 n).
3.6 Concentration of Induced Operator Norms 171

The notions of sparsity can be defined precisely in terms of the p -balls3 for
p ∈ (0, 1], defined as [135]
+ n
,
p p
Bp (Rp ) = z ∈ Rn : z p = |zi | ≤ Rp , (3.44)
i=1

where z = (z1 , z2 , . . . , zn )T . In the limiting case of p = 0, we have the 0 -ball


+ n
,
B0 (k) = z ∈ Rn : I [Zi = 0] ≤ k , (3.45)
i=1

where I is the indicator function and z has exactly k non-zero entries, where k ! n.
We see Sect. 8.7 for its application in linear regression.
To illustrate the discretization arguments of the set, we consider another result,
taken also from Raskutti, Wainwright and Yu [135].
Theorem 3.6.2 (Lemma 6 of Raskutti, Wainwright and Yu [135]). Consider a
Xz
random matrix X ∈ RN ×n with the 2 -norm upper-bounded by √N z2 ≤ κ for all
2
sparse vectors with exactly 2s non-zero entries z ∈ B0 (2s), i.e.
+ n
,
B0 (2s) = z ∈ Rn : I [Zi = 0] ≤ 2s , (3.46)
i=1

and a zero-mean
 white Gaussian random vector w ∈ Rn with variance σ 2 , i.e.,
w ∼ N 0, σ In×n . Then, for any radius R > 0, we have
2

" 
1  T  s log (n/s)
P sup w Xz ≥ 6σRκ
z0 ≤2s, z2 ≤R N N

≤ c1 exp (−c2 min {N, s log (n−s)}) . (3.47)

In other words, we have


"
1  T  s log (n/s)
sup w Xz ≤ 6σRκ (3.48)
z0 ≤2s, z2 ≤R n N

with probability greater than 1 − c1 exp (−c2 min {N, s log (n − s)}).
Proof. For a given radius R > 0, define the set

S (s, R) = {z ∈ Rn : z 0 ≤ 2s, z 2 ≤ R} ,

and the random variable ZN = ZN (s, R) given by

3 Strictly speaking, these sets are not “balls” when p < 1, since they fail to be convex.
172 3 Concentration of Measure

1  T 
ZN = sup w Xz .
z∈S(s,R) N

For a given ε ∈ (0, 1) to be chosen later, let us upper bound the minimal cardinality
of a set that covers the set S (s, R) up to (Rε)-accuracy
 in 2 -norm. Now we claim
that we may find a covering set z1 , . . . , zN ⊂ S (s, R) with cardinality K =
K (s, R, ε)
 
n
log K (s, R, ε) ≤ log + 2s log (1/ε) .
2s
 
n
To establish the claim, we note that there are subsets of size 2s within
2s
the set {1, 2, . . . , n}. Also, for any 2s-sized subset, there is an (R)-covering in
2 -norm of the ball B(R) (radius R) with at most 22s log(1/ε) elements [154].
/ As la/ result, for each z ⊂ S (s, R), we may find some ||z ||2 such that
l
/z − z / ≤ Rε. By triangle inequality, we have
2
 T   T   T  
1 w Xz ≤ 1 w Xzl  + 1 w X z − zi 
N N N

 T  wT X(z−zi )
≤ 1 w Xzl  + w2
√ √ 2
.
N N N

Using the assumption on X, we have


/ T  / √ / /
/w X z − zi / / N ≤ κ/ z − zi / ≤ κRε.
2 2

w2
Also, since the variate √N 2 is χ2 distribution with N degrees of freedom, we have

w 2 / N ≤ 2σ with probability at least 1 − c1 exp (−c2 N ), using standard tail
bounds (See Sect. 3.2). Putting together the pieces, we have that
1  T  1  T 
w Xz ≤ w Xzl  + 2κRεσ
N N
with high probability. Taking the supremum over z on both sides gives
1  T 
ZN ≤ max w Xzl  + 2κRεσ.
l=1,...,N N
We need to bound the finite maximum over the covering set. See Sect. 1.10. First
we observe that each variate wT Xzl /N is zero mean Gaussian with variance
/ /2
σ 2 /Xzl / /N 2 . Under the assumed conditions on zl and X, this variance is at
2
most σ 2 κ2 R2 /N , so that by standard Gaussian tail bounds, we conclude that
3.7 Concentration of Gaussian and Wishart Random Matrices 173

2
K(s,R,ε)
ZN ≤ σκR + 2κRεσ.
2 N  (3.49)
K(s,R,ε)
= σκR N + 2ε .

with probability at least 1 − 2


c1 exp (−c2 log K (s, R, ε)).
s log(n/2s)
Finally, suppose that ε = N . With this choice and recalling that N ≤n
by assumption, we have
⎛ ⎞
log⎝
n ⎠
2s s log N
log K(s,R,ε)
N
≤ N
+ s log(n/2s)
N
⎛ ⎞
log⎝
n ⎠
2s
≤ N
+ s log(n/s)
N

≤ 2s+s log(n/s)
N
+ s log(n/s)
N
,
where the last line uses standard bounds on binomial coefficients. Since n/s ≥ 2
by assumption, we conclude that our choice of ε guarantees that log K(s,R,ε)
N ≤
5s log (n/s). Substituting these relations into the inequality (3.49), we conclude
that
+ " " ,
s log (n/2s) s log (n/2s)
ZN ≤ σRκ 4 +2 ,
N N

as claimed. Since log K (s, R, ε) ≥ s log (n − 2s), this event occurs with probabil-
ity at least
1 − c1 exp (−c2 min {N, s log (n − s)}) ,
as claimed. 

3.7 Concentration of Gaussian and Wishart


Random Matrices

Theorem 3.7.1 (Davidson and Szarek [145]). For k ≤ n, let X ∈ Rn×k be a


random matrix from a standard Gaussian ensemble, i.e., (Xij ∼ N (0, 1), i.i.d.).
Then, for all t > 0,
174 3 Concentration of Measure

⎛ "  " 2 ⎞
/ /
/1 T / k k 
P ⎝/ /
/ n X X − Ik×k / > 2 +t + +t ⎠ + ≤ 2 exp −nt2 /2 .
2 n n
(3.50)
We can extend to more general Gaussian √ensembles. In particular, for a positive
definite matrix Σ ∈ Rk×k , setting Y = X Σ gives n × k matrix with i.i.d. rows,
xi ∼ N (0, Σ). Then
/ / /   /
/1 T / /√ √ /
/ Y Y − Σ/ = / Σ 1 XT X − Ik×k Σ/
/n / / n /
2 2
/ /
is upper-bounded by λmax (Σ) / n1 YT Y − Ik×k /2 . So the claim (3.51) follows
from the basic bound (3.50). Similarly, we have

 −1  −1
1 T −1/2 1 T −1/2
Y Y −Σ−1 = Σ X X −Ik×k Σ
n n
2 2
 −1
1 T 1
≤ X X −Ik×k
n λmin (Σ)
2

so that the claim (3.52) follows from the basic bound (3.50).
Theorem 3.7.2 (Lemma 9 of Negahban and Wainwright [155]). For k ≤ n, let
Y ∈ Rn×k be a random matrix having i.i.d. rows, xi ∼ N (0, Σ).
1. If the covariance matrix Σ has maximum eigenvalue λmax (Σ) < +∞, then for
all t > 0,

/ / "  " 2 
/1 T / n n
/ /
P / Y Y − Σ/ > λmax (Σ) 2 +t + +t
n 2 N N

≤ 2 exp −nt2 /2 . (3.51)

2. If the covariance matrix Σ has minimum eigenvalue λmin (Σ) > 0, then for all
t > 0,
/ −1 / "  " 2 
/ 1 /
/ −1 / 1 n n
P / T
Y Y −Σ / > 2 +t + +t
/ n / λmin (Σ) N N
2

≤ 2 exp −nt2 /2 . (3.52)
2
For t = k
n, then since k/n ≤ 1, we have
3.7 Concentration of Gaussian and Wishart Random Matrices 175

"  " 2  +" , "


n n k k k
γ= 2 +t + +t =4 + ≤8 .
N N n n n

Let us consider applying Theorem 3.7.2, following [138]. As a result, we have a


specialized version of (3.51)
/ / " 
/1 T / k
P / /
/ n Y Y − Σ/ > 8λmax (Σ) n ≤ 2 exp (−k/2) .
2
/  / " 
/ 1 −1 /
/ −1 / 8 k
P / T
Y Y −Σ / > ≤ 2 exp (−k/2) . (3.53)
/ n / λmin (Σ) n
2

Example 3.7.3 (Concentration inequality for random matrix (sample covariance


matrix)[138]). We often need to deal with the random matrix (sample covariance
 −1
matrix) YT Y/n , where Y ∈ Rn×k is a random matrix whose entries are i.i.d.
elements Yij ∼ N (0, 1). Consider the eigen decomposition
 −1
YT Y/n − Ik×k = UT DU,

where D is diagonal and U is unitary. Since the distribution of Y is


invariant
/ to rotations, /the matrices D and U are independent. Since D 2 =
/ T −1 /
/ Y Y/n − Ik×k / , the random matrix bound (3.53) implies that
2
  
P D 2 > 8 k/n ≤ 2 exp (−k/2) .


Below, we condition on the event D 2 < 8 k/n.
Let ej denote the unit vector with 1 in position j, and z = (z1 , . . . , zk ) ∈ Rk be
a fixed vector z ∈ Rk . Define, for each i = 1, . . . , k, the random variable of interest
⎡ ⎤

Vi = eTi UT DUz = zi uTi Dui + uTi D ⎣ zl u l ⎦


l=i

where uj is the j-th column of the unitary matrix U. As an example, consider the
variable maxi |Vi |. Since Vi is identically distributed, it is sufficient to obtain an
exponential tail bound on {V1 > t}. 
Under the conditioned event on D 2 < 8 k/n, we have
- .
 k
|V1 | ≤ 8 k/n |z1 | + uT1 D zl u l . (3.54)
l=2
176 3 Concentration of Measure

As a result, it is sufficient to obtain asharp tail bound on the second term.


k
Conditioned on D and the vector w = zl ul , the random vector u1 ∈ Rk
l=2
is uniformly distributed over a sphere in k − 1 dimensions: one dimension is lost
since u1 must be orthogonal to w ∈ Rk . Now consider the function

F (u1 ) = uT1 Dw;

we can show that this function is


Lipschitz
√ (with respect to the Euclidean norm)
with constant at most F L ≤ 8 k/n k − 1 z ∞ . For any pair of vectors u1

and u1 , we have
     T 

   

F 1
(u ) − F u 1  = u
 1 − u 1 Dw 
/ /
/  /
≤ /u1 − u1 / D 2 w 2
2
?
  / /
k
/  /
≤ 8 k/n zl2 /u1 − u1 /
l=2 2

 √ / /
/  /
= 8 k/n k − 1 z ∞ /u1 − u1 /
2
?

k
where we have used the fact that w 2 = zl2 , by the orthonormality of the
l=2
{ul } vectors. Since E [F (u1 )] = 0, by concentration of measure for Lipschitz
functions on the sphere [141], for all t > 0, we have
 2

P (|F (u1 )| > t z ∞ ) ≤ 2 exp −c1 (k − 1) 128 kt(k−1)
n

 2

≤ 2 exp −c1 128k
nt
.

Taking union bound, we have


     
nt2 nt2
P max |F (ui )| > tz∞ ≤ 2k exp −c1 = 2 exp −c1 + log k .
i=1,2,...,k 128k 128k

Consider log (p − k) > log k, if we set t = 256k log(p−k)


c1 n , then this probability
vanishes at rate 2 exp (−c2 log (p − k)). If we assume n = Ω (k log (p − k)), the
quantity t is order one. 
We can summarize the above example in a formal theorem.
Theorem 3.7.4 (Lemma 5 of [138]). Consider a fixed nonzero vector z ∈ Rk and
a random matrix Y ∈ Rn×k with i.i.d. elements Yij ∼ N (0, 1). Under the scaling
3.7 Concentration of Gaussian and Wishart Random Matrices 177

n = Ω (k log (p − k)), there are positive constants c1 and c2 such that for all t > 0
/  / 
/ −1 /
P / YT Y/n − Ik×k z/ ≥ c1 z ∞ ≤ 4 exp (− c1 min {k, log (p − k)}).

Example 3.7.5 (Feature-based detection). In our previous work in the data domain
[156, 157] and the kernel domain [158–160], features of a signal are used for
detection. For hypothesis H0 , there is only white Gaussian noise, while for H1 , there
is a signal (with some detectable features) in presence of the white Gaussian noise.
For example, we can use the leading eigenvector of the covariance matrix R as the
feature which is a fixed vector z. The inverse of the sample covariance matrix is
 −1
considered R̂−1 = YT Y/n − Ik×k , where Y is as assumed in Theorem 3.7.4.
Consequently, the problem boils down to
 −1

R̂−1 z = YT Y/n − Ik×k z.

We can bound the above expression. 


The following theorem shows sharp concentration of a Lipschitz function of
Gaussian random variables around its mean. Let ||y||2 be the 2 -norm of an arbitrary
Gaussian vector y.
Theorem 3.7.6 ([141, 161]). Let a random vector x ∈ Rn have i.i.d. N (0, 1)
entries, and let f : Rn → R be Lipschitz with constant L, (i.e., |f (x) − f (y)| ≤
2
L x − y 2 , ∀x, y ∈ Rn ). Then, for all t > 0, we have
 
t2
P (|f (x) − f (y)| > t) ≤ 2 exp − 2 .
2L

√ complex matrix A. Let


Let σ1 be the largest singular value of a rectangular
A op = σ1 is the operator norm of matrix A. Let R be the symmetric matrix
square root, and consider the function

f (x) = Rx 2 / n.

Since it is Lipschitz with constant R op / n, Theorem 3.7.6 implies that

 √ nt2
P | Rx 2 − E Rx 2 | > nt ≤ 2 exp − , (3.55)
2 R op

for all t > 0. By integrating this tail bound, we find the variable Z = Rx 2 / n
satisfies the bound

var (Z) ≤ 4 R op / n.
178 3 Concentration of Measure

So that
 2    √  2 √
EZ − |EZ| =  Tr (R) /n − E Rx / n  ≤ 2 R
2 op / n. (3.56)

Combining (3.56) with (3.55), we obtain


⎛ ? ⎞ 
 / 
 1 /√ / /   R op nt 2
P ⎝ / Rx/ − Tr (R) ≥ t + 2 ⎠ ≤ 2 exp − (3.57)
n 2 n 2 R op

√ 2
for all t > 0. Setting τ = (t − 2/ n) R op in the bound (3.57) gives that
 /    2 
 1 /√ /
/   2 1 2

P  / Rx/ − Tr (R) ≥ τ R op ≤ 2 exp − n t − √ .
n 2 2 n
(3.58)
2
Similarly, considering t = R op in the bound (3.57) gives that with probability
greater than 1 − 2 exp (−n/2), we have
 "  "
 x Tr (R)  2 2
 Tr (R)
 √ +2
≤ +3 R ≤4 R op . (3.59)
 n n  n op

Using the two bounds, we have


    
 x 2 " Tr (R)   x "  x "  2
   Tr (R)   Tr (R) 
 √ 2− = √ 2 −  √ 2 +  ≤ 4τ R op .
 n n   n n  n n 

We summarize the above result here.


Theorem 3.7.7 (Lemma I.2 of Negahban and Wainwright√ [155]). Given a
Gaussian random vector x ∼ N (0, R), for all t > 2/ n, we have
   
1  2

 2
P  x 2 − Tr (R) > 4t R op ≤ 2 exp − 21 n t− √ +2 exp (−n/2) .
n n

We take material from [139]. Defining the standard Gaussian random matrix
G = (Gij )1≤i≤n,1≤j≤p ∈ Rn×p , we have the p × p Wishart random matrix

1 T
W= G G − Ip , (3.60)
n
where Ip is the p × p identity matrix. We essentially deal with the “sums of
Gaussian product” random variates. Let Z1 and Z2 be independent Gaussian random
n i.i.d.
variables, we consider the sum Xi where Xi ∼ Z1 Z2 , 1 ≤ i ≤ n. The
i=1
3.7 Concentration of Gaussian and Wishart Random Matrices 179

following tails bound is also known [162, 163]


  
1 n  
 
P  Xi  > t ≤ C exp −3nt2 /2 as t → 0; (3.61)
n 
i=1

Let wiT be the i-th row of Wishart matrix W ∈ Rp×p , and giT be the i-th row of
data matrix G ∈ Rn×p . The linear combination of off-diagonal entries of the first
row w1
n n p
1
a, w1 = aj W1j = G1i Gij aj ,
j=2
n i=1 j=2

for a vector a = (a2 , . . . , ap ) ∈ Rp−1 . Let


p
1 1
ξi = a, gi = Gij aj .
a 2 a 2 j=2

n
Note that {ξi }i=1 is a collection of independent standard Gaussian random vari-
n n
ables. Also, {ξi }i=1 are independent of {Gi1 }i=1 . Now we have

n
1
a, w1 = a 2 G1i ξi ,
n i=1

which is a (scaled) sum of Gaussian products. Using (3.61), we obtain


 
2
P (| a, w1 | > t) ≤ C exp −3nt2 /2 a 2 . (3.62)

Combining (3.4) and (3.62), we can bound a linear combination of first-row entries.
n
2
Noting that W11 = n1 (G1i ) − 1 is a centered χ2n , we have
i=1


p 
p
P (|wi x| > t) = P Wij xj > t ≤ P |x1 W11 | + Wij xj > t
j=1 j=2


p
≤ P (|x1 W11 | > t/2) + P Wij xj > t/2
j=2

⎛ ⎞
 
3nt2 2
≤ 2 exp − 16·4x 2 +C exp ⎝− 3nt
p

1 2·4 x2
j
i=2

⎛ ⎞
3nt2
≤ 2 max (2, C) exp ⎝− 
p
⎠.
16·4 x2
j
i=1
180 3 Concentration of Measure

There is nothing special about the “first” row, we conclude the following. Note that
p
the inner product is wi , x = wiT x = Wij xj , i = 1, . . . , p for a vector x ∈ Rp .
j=1

Theorem 3.7.8 (Lemma 15 of [139]). Let wiT be the j-th row of Wishart matrix
W ∈ Rp×p , defined as (3.60). For t > 0 small enough, there are (numerical
constants) c > 0 and C > 0 such that for all x ∈ Rn \ {0},

   
P wiT x > t ≤ C exp −cnt2 / x
2
, i = 1, . . . , p.

3.8 Concentration of Operator Norms

For a vector a, a p is the p norm. For a matrix A ∈ Rm×n , the singular values are
ordered decreasingly as σ1 (A) ≥ σ2 (A) ≥ · · · ≥ σmin(m,n) (A). Then we have
σmax (A) = σ1 (A), and σmin (A) = σmin(m,n) (A). Let the operator norm of the
matrix A be defined as A op = σ1 (A). The nuclear norm is defined as

min(m,n)
A ∗ = σi (A),
i=1

while the Frobenius norm is defined as


:
2 ;min(m,n)
;
A F = Tr (AT A) = < σ 2 (A). i
i=1

For a matrix A ∈ Rm1 ×m2 , we use vector vec(A) ∈ RM , M = m1 m2 . Given a


symmetric positive definite matrix Σ ∈ RM ×M , we say that the random matrix Xi
is sampled from the Σ-ensemble if

vec(Xi ) ∼ N (0, Σ) .

We define the quantity



ρ2 (Σ) = sup var uT Xv ,
u1 =1,v1 =1

where the random matrix X ∈ Rm1 ×m2 is sampled from the Σ-ensemble. For the
special case (white Gaussian random vector) Σ = I, we have ρ2 (Σ) = 1.
Now we are ready to study the concentration of measure for the operator norm.
3.8 Concentration of Operator Norms 181

Theorem 3.8.1 (Negahban and Wainwright [155]). Let X ∈ Rm1 ×m2 be a


random sample from the Σ-ensemble. Then we have
  √ √
E X op ≤ 12ρ (Σ) [ m1 + m2 ] (3.63)

and moreover
     
t2
P X op ≥E X op + t ≤ exp − 2
. (3.64)
2ρ (Σ)

Proof. The variational representation

X op = sup uT Xv
u1 =1,v1 =1

is the starting point. Since each (bi-linear) variable uT Xv is zero-mean Gaussian,


thus we find that the operator norm X op is the supremum of a Gaussian process.
The bound (3.64) follows from Ledoux [141, Theorem 7.1].
We now use a simple covering argument to establish the upper bound (3.63). For
more details, we refer to [155]. 
Theorem 3.8.2 (Lemma C.1 of Negahban and Wainwright [155]). The random
N
matrix {Xi }i=1 are drawn i.i.d. from the Σ-Gaussian ensemble, √ i.e., vec(Xi ) ∼
N (0, Σ). For a random vector  = (1 , . . . , N ), if  2 ≤ 2ν N , then there are
universal constants c0 , c1 , c2 > 0 such that

⎛/ / ⎞
/ N / " " 
/1 / m1 m2 ⎠
P ⎝/ i X i / ≥ c0 νρ (Σ) + ≤ c1 exp (−c2 (m1 + m2 )) .
/N / N N
i=1 op

Proof. Define the random matrix Z as

N
1
Z= ε i Xi .
N i=1

N N
Since the random matrices {Xi }i=1 are i.i.d. Gaussian, if the sequence {i }i=1 are
fixed (by conditioning as needed), then the random matrix Z is a sample from the

Γ-Gaussian ensemble with the covariance matrix Γ = N 22 Σ. So, if Z̃ ∈ Rm1 ×m2
2
is a random matrix drawn from the N 2 Σ-ensemble, we have
  / / 
/ /
P Z op ≥ t ≤ P /Z̃/ ≥t .
op
182 3 Concentration of Measure

Using Theorem 3.8.1, we have


/ / √
/ / 12 2νρ (Σ) √ √
E /Z̃/ ≤ √ ( m1 + m2 )
op N

and
/ / / /   
/ / / / N t2
P /Z̃/ ≥ E /Z̃/ +t ≤ exp −c1
op op ν 2 ρ2 (Σ)
 √ √ 
2
for a universal constant c1 . Setting t2 = Ω N −1 ν 2 ρ2 (Σ) m1 + m2 gives
the claim. 
This following result follows by adapting known concentration results for
random matrices (see [138] for details):
Theorem 3.8.3 (Lemma 2 of Negahban and Wainwright [155]). Let X ∈ Rm×n
be a random matrix with i.i.d. rows sampled from a n-variate N (0, Σ) distribution.
Then for m ≥ 2n, we have
     
1 T 1 1 T
P σmin X X ≥ σmin (Σ) , σmax X X ≥ 9σmax (Σ) ≥ 1 − 4 exp (−n/2) .
n 9 n


Consider zero-mean Gaussian random vectors wi defined as wi ∼N 0, ν 2 Im1 ×m1
and random vectors xi defined as xi ∼ N (0, Σ). We define random matrices
X, W as
⎡ T⎤ ⎡ T⎤
x1 w1
⎢ xT ⎥ ⎢ wT ⎥
⎢ 2⎥ ⎢ 2⎥
X = ⎢ . ⎥ ∈ Rn×m2 and W = ⎢ . ⎥ ∈ Rn×m1 . (3.65)
⎣ .. ⎦ ⎣ .. ⎦
xTn wnT

Theorem 3.8.4 (Lemma 3 of Negahban and Wainwright [155]). For random


matrices X, W defined in (3.65), there are constants c1 , c2 > 0 such that
" 
1/ /  m1 + m 2
P / /
X W op ≥ 5ν σmax (Σ)
T
≤ c1 exp (−c2 (m1 + m2 )) .
n n

The following proof, taken from [155], will illustrate the standard approach:
arguments based on Gordon-Slepian lemma (see also Sects. 3.4 and 3.6) and
Gaussian concentration of measure [27, 141].
Proof. Let S m−1 = {u ∈ Rm : u 2 = 1} denote the Euclidean sphere in m-
dimensional space. The operator norm of interest has the variation representation
3.8 Concentration of Operator Norms 183

1/ /
/XT W/ = 1 sup
op
sup vT XT Wu. (3.66)
n n u∈S m1 −1 v∈S m2 −1

For positive scalars a and b, define the random quantity

1
Ψ (a, b) = sup sup vT XT Wu.
n u∈aS m1 −1 v∈bS m2 −1

Our goal is to upper bound Ψ (1, 1). Note that Ψ (a, b) = abΨ (1, 1) due to the
bi-linear property of the right-hand side of (3.66).
Let
   
A = u1 , . . . , uA , B = u1 , . . . , uB

denote the 1/4 coverings of S m1 −1 and S m2 −1 , respectively. We now claim that the
upper bound
= b >
Ψ (1, 1) ≤ 4 max Xv , Wua . (3.67)
ua ∈A,vb ∈B

is valid. To establish the claim, since we note that the sets A and B are 1/4-covers,
for any pair (u, v) ∈ S m1 −1 × S m2 −1 , there exists a pair ua , vb ∈ A × B, such
that u = ua + Δu and v = vb + Δv, with

max { Δu 2 , Δv 2 } ≤ 1/4.

Consequently, due to the linearity of the inner product, we have


= > = >
Xv, Wu = Xvb , Wua + Xvb , WΔu + XΔv, Wua + XΔu, WΔv .
(3.68)
By construction, we have the bound
= b >
 Xv , WΔu  ≤ Ψ (1, 1/4) = 1 Ψ (1, 1) ,
4
and similarly

1
| XΔv, Wua | ≤ Ψ (1, 1) ,
4
as well as
1
| XΔu, WΔv | ≤ Ψ (1, 1) .
16
Substituting these bounds into (3.68) and taking suprema over the left and right-
hand sides, we conclude that
184 3 Concentration of Measure

= > 9
Ψ (1, 1) ≤ max Xvb , Wua + Ψ (1, 1)
ua ∈A,vb ∈B 16

from which (3.67) follows.


Now we need to control the discrete maximum. See Sect. 1.10 related theorems.
According to [27, 154], there exists a 1/4 covering of spheres S m1 −1 and S m2 −1
with at most A ≤ 8m1 and B ≤ 8m2 elements, respectively. As a result, we have
 
1= b >
P (|Ψ (1, 1)| ≥ 4δn) ≤ 8m1 +m2 max P Xv , Wua ≥ δ . (3.69)
ua ∈A,vb ∈B n

The rest is to do obtain a good bound on the quantity


n
1 1
Xv, Wu = v, xi u, wi
n n i=1

where (u, v) ∈ S m1 −1 × S m2 −1 are arbitrary but fixed. Here, xi and w  i are,


respectively, the i-th row of matrices X and W. Since xi ∈ Rm1 has i.i.d. N 0, ν 2
elements and u is fixed, we have

Zi = u, wi ∼ N 0, ν 2 , i = 1, . . . , n.
n
These variables {Zi }i=1 are independent from each other, and of the random matrix
X. So, conditioned on X, the sum
n
1
Z= v, xi u, wi
n i=1

is zero-mean Gaussian with variance


 
ν2 1 2 ν2 / /
/XT X/n/ .
α2 = Xv 2 ≤ op
n n n

Define the event


 
9ν 2
T = α2 ≤ Σ op .
n

Using Theorem 3.8.3, we have


/ T /
/X X/n/ ≤ 9σmax (Σ)
op

with probability at least 1 − 2 exp (−n/2), which implies that P (T c ) ≤


2 exp (−n/2). Therefore, conditioned on the event T and its complement T c ,
we have
3.9 Concentration of Sub-Gaussian Random Matrices 185

P (|Z| ≥ t) ≤ P (|Z| ≥ t |T ) + P (T c )
 
t2
≤ exp −n 2ν 2 + 2 exp (−n/2) .
( 4+Σop )

Combining this tail bound with the upper bound (3.69), we obtain
⎛ ⎞
2
t
P (|Ψ (1, 1)| ≥ 4δn) ≤ 8m1 +m2 exp ⎝−n   ⎠ + 2 exp (−n/2) .
2ν 2 4+ Σ op
(3.70)
m1 +m2
Setting t2 = 20ν 2 Σ op n , this probability vanishes as long as n >
16 (m1 + m2 ). 
Consider the vector random operator ϕ (A) : Rm1 ×m2 → RN with ϕ (A) =
(ϕ1 (A) , . . . , ϕN (A)) ∈ RN . The scalar random operator ϕi (A) is defined by

ϕi (A) = Xi , A , i = 1, . . . , N, (3.71)
N
where the matrices {Xi }i=1 are formed from the Σ-ensemble, i.e., vec(Xi ) ∼
N (0, Σ).
Theorem 3.8.5 (Proposition 1 of Negahban and Wainwright [155]). Consider
the random operator ϕ(A) defined in (3.71). Then, for all A ∈ Rm1 ×m2 , the
random operator ϕ(A) satisfies
" " 
1 1//√
/
/ m1 m2
√ ϕ (A) 2 ≥ / Σ vec (A)/ − 12ρ (Σ) + A 1 (3.72)
N 4 2 N N

with probability at least 1 − 2 exp (−N/32). In other words, we have


 " "  
1 1//√
/
/ m1 m2
P √ ϕ (A) 2 ≤ / Σ vec (A)/ −12ρ (Σ) + A 1
N 4 2 N N
≤ 2 exp (−N/32) . (3.73)

The proof of Theorem 3.8.5 follows from the use of Gaussian comparison inequal-
ities [27] and concentration of measure [141]. Its proof is similar to the proof of
Theorem 3.8.4 above. We see [155] for details.

3.9 Concentration of Sub-Gaussian Random Matrices

We refer to Sect. 1.7 on sub-Gaussian random variables and Sect. 1.9 for the
background on exponential random variables. Given a zero-mean random variable
Y , we refer to
186 3 Concentration of Measure

1 l
1/l
Y ψ1 = sup E|Y |
l≥1 l
as its sub-exponential parameter. The finiteness of this quantity guarantees existence
of all moments, and hence large-deviation bounds of the Bernstein type.
 We2 say that a random matrix X ∈ R
n×p
is sub-Gaussian with parameters
Σ, σ if
1. Each row xTi ∈ Rp , i = 1, . . . , n is sampled independently from a zero-mean
distribution with covariance Σ, and
2. For any unit vector u ∈ Rp , the random variable uT xi is sub-Gaussian with
parameter at most σ.
If we a random matrix by drawing each row independently from the distribution
N (0, Σ), then
 the resulting
 matrix X ∈ Rn×p is a sub-Gaussian matrix with
parameters Σ, Σ op , where A op is the operator norm of matrix A.
By Lemma 1.9.1, if a (scalar valued) random variable X is a zero-mean
 sub-
Gaussian with parameter σ, then the random variable Y = X 2 − E X 2 is sub-
exponential with Y ψ1 ≤ 2σ 2 . It then follows that if X1 , . . . , Xn are zero-mean
i.i.d. sub-Gaussian random variables, we have the deviation inequality
     2 
 N
 
1   nt nt
P  Xi2 −E Xi2  ≥ t ≤ 2 exp −c min ,
N   4σ 2 2σ 2
i=1

for all t ≥ 0 where c > 0 is a universal constant (see Corollary 1.9.3). This deviation
bound may be used to obtain the following result.
Theorem 3.9.1 (Lemma 14 of [164]). If X ∈ Rn×p1 is a zero-mean sub-Gaussian
matrix with parameters Σx , σx2 , then for any fixed (unit) vector v ∈ Rp , we have
     2 
 2 2 t t
P  Xv 2 − E Xv 2  ≥ nt ≤ 2 exp −cn min , . (3.74)
σx4 σx2

Moreover,
 if Y ∈ Rn×p2 is a zero-mean sub-Gaussian matrix with parameters
Σy , σy2 , then
/ 1 /    
/ / t2 t
P / YT X − cov (yi , xi )/ ≥ t ≤ 6p1 p2 exp −cn min 2
, ,
n max (σx σy ) σx σy
(3.75)

where Xi and Yi are the i-th rows of X and Y, respectively. In particular, if n 


log p, then
3.9 Concentration of Sub-Gaussian Random Matrices 187

/ /  "
/1 T / log p
P / /
/ n Y X − cov (yi , xi )/ ≥ t ≤ c0 σx σy ≤ c1 exp (−c2 log p) .
max n
(3.76)
The 1 balls are defined as

Bl (s) = {v ∈ Rp : v l ≤ s, l = 0, 1, 2} .

For a parameter s ≥ 1, use the notation

K (s) = {v ∈ Rp : v 2 ≤ 1, v 0 ≤ s} ,

the 0 norm x 0 stands for the non-zeros of vector x. The sparse set is
K (s) = B0 (s) ∩ B2 (1) and the cone set is
 √ 
C (s) = v ∈ Rp : v 1 ≤ s v 2 .

We use the following result to control deviations uniformly over vectors in Rp .


Theorem 3.9.2 (Lemma 12 of [164]). For a fixed matrix Γ ∈ Rp×p , parameter
s ≥ 1, and tolerance ε > 0, suppose we have the deviation condition
 T 
v Γv ≤ ε ∀v ∈ K (s) .

Then
 
 T  1
v Γv ≤ 27 v
2
+ v
2
∀v ∈ Rp .
2 1
s

Theorem 3.9.3 (Lemma 13 of [164]). Suppose s ≥ 1 and Γ̂ is an estimator of Σx


satisfying the deviation condition
    λ
 T  min (Σx )
v Γ̂ − Σx v ≤ ∀v ∈ K (2s) .
54
Then we have the lower-restricted eigenvalue condition
    1
 T  2 1 2
v Γ̂ − Σx v ≥ λmin (Σx ) v 2 − λmin (Σx ) v 1
2 2s
and the upper-restricted eigenvalue condition
    3
 T  2 1 2
v Γ̂ − Σx v ≤ λmax (Σx ) v 2 + λmax (Σx ) v 1 .
2 2s
We combine Theorem 3.9.1 with a discretization argument and union bound to
obtain the next result.
188 3 Concentration of Measure

Theorem 3.9.4 (Lemma 15 of [164]). If X ∈ Rn×p is a zero-mean sub-Gaussian


matrix with parameters Σ, σ 2 , then there is a universal constant c > 0 such that
! "   2  
1 1 t t
P sup Xv22 −E Xv22 ≥ t ≤ 2 exp −cn min , +2s log p .
v∈K(2s) n n σ4 σ2

We consider the dependent data. The rows of X are drawn from a stationary vector
autoregressive (AR) process [155] according to

xi+1 = Axi + vi , i = 1, 2, . . . , n − 1, (3.77)

where vi ∈ Rp is a zero-mean noise vector with covariance matrix Σv , and A ∈


Rp×p is a driving matrix with spectral norm A 2 < 1. We assume the rows of X
are drawn from a Gaussian distribution with Σx , such that

Σx = AΣx AT + Σv .

Theorem 3.9.5 (Lemma 16 of [164]). Suppose y = [Y1 , Y2 , . . . , Yn ] ∈ Rn is a


mixture of multivariate Gaussians yi ∼ N (0, Qi ), and let σ 2 = sup Qj op . Then
j
for all t > √2 , we have
n

 1 1    2 
 2 2  2 1 2
P  y2 − E y2  ≥ 4tσ ≤ 2 exp − n t − √ + +2 exp (−n/2) .
n n 2 n

This result is a generalization of Theorem 3.7.7. It follows from the concentration


of Lipschitz functions of Gaussian random vectors [141].
√ By definition, the random
vector y is a mixture of random vectors of the form Qi xi , where xi ∼ N (0, In ).
The key idea is to study the function

fj (x) = Qj x 2 / n

and obtain the Lipschitz constant as Qj op / n. Also note that fj (x) is a sub-

Gaussian random variable with parameter σj2 = Qj op / n. So the mixture
1
n y 2 is sub-Gaussian with parameter σ
2
= n1 sup Qj op . The rest follows
j
from [155].
Example 3.9.6 (Additive noise [164]). Suppose we observe

Z = X + W,

where W is a random matrix independent of X, with the rows wi drawn from a


zero-mean distribution with known covariance matrix Σw . We define
3.9 Concentration of Sub-Gaussian Random Matrices 189

1 T
Γ̂add = Z Z − Σw .
n

Note that Γ̂add is not positive semidefinite. 


Example 3.9.7 (Missing data [164]). The entries of matrix X are missing at
random. We observe the matrix Z ∈ Rn×p with entries

Xij with probability 1 − ρ,
Zij =
0 otherwise.

Given the observed matrix Z ∈ Rn×p , we use


 
1 T 1 T
Γ̂miss = Z̃ Z̃ − ρ diag Z̃ Z̃
n n

where Z̃ij = Zij / (1 − ρ). 


Theorem 3.9.8 (Lemma 17 of [164]). Let X ∈ Rn×p is a Gaussian random
matrix, with rows xi generated according to a vector autoregression (3.77) with
driving matrix A. Let v ∈ Rp be a fixed vector with unit norm. Then for all t > √2n ,

      2 
 T  1 2
P v Γ̂ − Σx v ≥ 4tς 2 ≤ 2 exp − n t − √ + 2 exp (−n/2) ,
2 n

where

2Σx op
⎨ Σw + (additive noise case) .
2 op 1−Aop
ς = 2Σx op
⎩ 1
(missing data case) .
(1−ρmax )2 1−Aop

Theorem 3.9.9 (Lemma 18 of [164]). Let X ∈ Rn×p is a Gaussian random


matrix, with rows xi generated according to a vector autoregression (3.77) with
driving matrix A. Let v ∈ Rp be a fixed vector with unit norm. Then for all t > √2n ,
  2 
   
  2
P sup vT Γ̂ − Σx v ≥ 4tς 2 ≤ 4 exp −cn t − √ + 2s log p ,
v∈K(2s) n

where ς is defined as in Lemma 3.9.8.


190 3 Concentration of Measure

3.10 Concentration for Largest Eigenvalues

We draw material from [149] in this section. Let μ be the standard Gaussian measure
−n/2 −|x|2 /2
on Rn with density (2π) e with respect to Lebesgue measure. Here |x|
is the usual Euclidean norm for vector x. One basic concentration property [141]
indicates that for every Lipschitz function F : Rn → R with the Lipschitz constant
||F ||L ≤ 1, and every t ≥ 0, we have
  
 
μ F − 
F dμ ≥ t
2
≤ 2e−t /2 . (3.78)


The same holds for a median of F instead of the mean. One fundamental property
of (3.78) is its independence of dimension n of the underlying state space. Later on,
we find that (3.78) holds for non-Gaussian classes of random variables. Eigenvalues
are the matrix functions of interest.
Let us illustrate the approach by studying concentration for largest eigenvalues.
For example, consider the Gaussian unitary ensemble (GUE) X: For each integer
n ≥ 1, X = (Xij )1≤i,j≤n is an n × n Hermitian centered Gaussian random matrix
with variance σ 2 . Equivalently, the random matrix X is distributed according to the
probability distribution

1  
P (dX) = exp − Tr X2 /2σ 2 dX
Z

on the space Hn ∼
2
= Rn of n × n Hermitian matrices where
 
dX = dXii d Re (Xij ) d Im (Xij )
1≤i≤n 1≤i,j≤n

is Lebesgue measure on Hn and Z is the normalization constant. This probability


measure is invariant under the action of the unitary group on Hn in the sense that
UXUH has the same law as X for each unitary element U of Hn . The random
matrix X is then said to be an element of the Gaussian unitary ensemble (GUE)
(“ensemble” for probability distribution).
The variational characterization is critical to the largest eigenvalue

λmax (X) = sup uXuH (3.79)


|u|=1

where the function λmax is linear in X. The expression is the quadratic form. Later
on in Chap. 4, we study the concentration of the quadratic forms. λmax (X) is easily
seen (see Chap. 4) to be a 1-Lipschitz map of the n2 independent real and imaginary
entries
3.10 Concentration for Largest Eigenvalues 191

√ √
Xii , 1 ≤ i ≤ n, Re (Xij ) / 2, 1 ≤ i ≤ n, Im (Xij ) / 2, 1 ≤ i ≤ n,

of matrix X. Using Theorem (3.3.10) together with the scaling of the variance σ 2 =
1
4n , we get the following concentration inequality on λmax (X).

Theorem 3.10.1. For all n ≥ 1 and t ≥ 0,


2
P ({|λmax (X) − Eλmax (X)| ≥ t}) ≤ 2e−2nt .

As a consequence, note that

var (λmax (X)) ≤ C/n.

Using Random Matrix Theory [145], this variance is

var (λmax (X)) ≤ C/n4/3 .

Viewing the largest eigenvalue as one particular example of Lipschitz function of


the entries of the matrix does not reflect enough the structure of the model. This
comment more or less applies to all the results presented in this book deduced from
the concentration principle.

3.10.1 Talagrand’s Inequality Approach

Let us estimate Eλmax (X). We emphasize the approach used here. Consider the
real-valued Gaussian process
n
Gu = uXuH = Xij ui ūj , |u| = 1,
i,j=1

where u = (u1 , . . . , un ) ∈ Cn . We have that for u, v ∈ Cn ,

  n
2
E |Gu − Gv | = σ 2 |ui ūj − vi v̄j |.
i,j=1

We define the Gaussian random processes indexed by the vector u ∈ Cn , |u| = 1.


We have that for u ∈ Cn , |u| = 1
n n
Hu = gi Re (ui ) + hi Im (ui )
i=1 i=1
192 3 Concentration of Measure

where g1 , . . . , gn , h1 , . . . , hn are independent standard Gaussian variables. It fol-


lows that for every u, v ∈ Cn , such that |u| = |v| = 1,
   
2 2
E |Gu − Gv | ≤ 2σ 2 E |Hu − Hv | .

By the Slepian-Fernique lemma [27] we have that


  ⎛- .1/2 ⎞
√ √ n
E sup Gu ≤ 2σE sup Gu ≤ 2 2σE ⎝ gi2 ⎠.
|u|=1 |u|=1 i=1

1
When σ 2 = 4n , we thus have that

Eλmax (X) ≤ 2. (3.80)
Equation (3.80) extends to the class of sub-Gaussian distributions including random
matrices with symmetric Bernoulli entries.
Combining (3.80) with Theorem 3.10.1, we have that for every t ≥ 0,
0 √ 1 2
P λmax (X) ≥ 2+t ≤ 2e−2nt .

3.10.2 Chaining Approach

Based on the supremum representation (3.79) of the largest eigenvalue, we can use
another chaining approach [27, 81]. The supremum of Gaussian or more general
process (Zt )t∈T is considered. We study the random variable sup Zt as a function
  t∈T

of the set T and its expectation E sup Zt . Then we can study the probability
  t∈T

P sup Zt ≥ r , r ≥ 0.
t∈T
For real symmetric matrices X, we have
n
Zu = uXuT = Xij ui uj , |u| = 1,
i,j=1

where u = (u1 , . . . , un ) ∈ Rn and Xij , 1 ≤ i ≤ j ≤ n are independent centered


either Gaussian or Bernoulli random variables. Note Zu is linear in the entries of
Xij , ui , uj . Basically, Zu is the sum of independent random variables. To study the
size of the unit sphere |u| = 1 under the L2 -metric, we have that
n
2
E|Zu − Zv | = |ui uj − vi vj |, |u| = |v| = 1.
i,j=1
3.10 Concentration for Largest Eigenvalues 193

3.10.3 General Random Matrices

The main interest of the theory is to apply concentration of measure to general


families of random matrices. For measures μ on Rn , the dimension-free inequality
is defined as
  
 
μ F − F dμ ≥ t 2
≤ Ce−t /C , t ≥ 0 (3.81)
 

for some constant C > 0 independent ) of dimension n and every 1-Lipschitz


function F : Rn → R. The mean F dμ can be replaced by a median of F .
The primary message is that (3.81) is valid for non-Gaussian random variables.
Consider the example of independent uniform entries. If X = (Xij )1≤i,j≤n is a
real symmetric n × n matrix, then its eigenvalues are 1-Lipschitz functions of the
entries.
Theorem 3.10.2 (Proposition 3.2 of [149]). Let X = (Xij )1≤i,j≤n be a real
symmetric n × n random matrix and Y = (Yij )1≤i,j≤n be a real n × n random
matrix. Assume that the distributions of the random vector Xij , 1 ≤ i ≤ j ≤ n and
2
Yij , 1 ≤ i ≤ j ≤ n in Rn(n+1)/2 and, respectively, Rn satisfy the dimension-free
concentration property (3.81). Then, if τ is any eigenvalue of X, and singular value
of Y, respectively, for every t ≥ 0,
2 2
P (|τ (X) − Eτ (X)| ≥ t) ≤ Ce−t /2C
, respectively,Ce−t /C
.

Below we give two examples of distributions satisfying concentration inequalities


of the type (3.81). The first class is measures satisfying a logarithmic Sobolev
inequality that is a natural extension of the Gaussian example. A probability measure
μ on R, or Rn is said to satisfy a logarithmic Sobolev inequality if for some constant
C > 0,

2
f 2 log f 2 dμ ≤ 2C |∇f | dμ (3.82)
Rn Rn
)
for every smooth enough function f : Rn → R such that f 2 dμ = 1.
The prototype example is the standard Gaussian measure on Rn which satis-
fies (3.82) with C = 1. Another example consists of probability measures on Rn of
the type

dμ (x) = e−V (x) dx


 
2
where V − c |x| /2 is a convex function for some constant c > 0. The measures
satisfy (3.82) for C = 1/c.
Regarding the logarithmic Sobolev inequality, an important point to remember
is its stability by product that gives dimension-free constants. If μ1 , . . . , μn are
probability measures on Rn satisfying the logarithmic Sobolev inequality (3.82)
194 3 Concentration of Measure

with the same constant C, then the product measure μ1 ⊗ · · · ⊗ μn also satisfies it
(on Rn ) with the same constant.
By the so-called Herbst argument, we can apply the logarithmic Sobolev
inequality (3.82) to study concentration of measure. If μ satisfies (3.82), then for
any 1-Lipschitz function F : Rn → R and any t ∈ R,

F dμ+Ct2 /2
etF dμ ≤ et .

In particular, by a simple use of Markov’s exponential inequality (for both F and


−F ), for any t > 0,
  
 
μ F − F dμ ≥ t
2
≤ 2e−t /2C ,


so that the dimension-free concentration property (3.81) holds. We refer to [141] for
more details. Related Poincare inequalities may be also considered similarly in this
context.

3.11 Concentration for Projection of Random Vectors

The goal here is to apply the concentration of measure. For a random vector x ∈ Rd ,
we study its projections to subspaces. The central problem here is to show that
for most subspaces, the resulting distributions are about the same, approximately
Gaussian, and to determine how large the dimension k of the subspace may be,
relative to d, for this phenomenon to persist. 
The Euclidean length of a vector x ∈ Rd is defined by x = x21 + . . . , x2d .
The Stiefel manifold4 Zd,k ∈ Rk×d is defined by
 
Zd,k = Z = (z1 , . . . , zk ) : zi ∈ Rd , zi , zj = δij ∀1 ≤ i, j ≤ k ,

with metric ρ (Z, Z ) between a pair of two matrices Z and Z —two points in the
manifold Zd,k —defined by

k
1/2

zi
2
ρ (Z, Z ) = z− .
i=1

The manifold Zd,k preserves a rotation-invariant (Haar) probability measure.

4A manifold of dimension n is a topological space that near each point resembles n-dimensional
Euclidean space. More precisely, each point of an n-dimensional manifold has a neighborhood that
is homeomorphic to the Euclidean space of dimension n.
3.11 Concentration for Projection of Random Vectors 195

One version of the concentration of measure [165] is included here. We need


some notation first. A modulus of continuity [166] is a function ω : [0, ∞] → [0, ∞]
used to measure quantitatively the uniform continuity of functions. So, a function
f : I → R admits ω as a modulus of continuity if and only if

|f (x) − f (y)| ≤ ω (|x − y|)

for all x and y in the domain of f . Since moduli of continuity are required to be
infinitesimal at 0, a function turns out to be uniformly continuous if and only if
it admits a modulus of continuity. Moreover, relevance to the notion is given by
the fact that sets of functions sharing the same modulus of continuity are exactly
equicontinuous families. For instance, the modulus ω(t) = Lt describes the L-
Lipschitz functions, the moduli ω(t) = Ltα describe the Hölder continuity, the
modulus ω(t) = Lt(|log(t)| + 1) describes the almost Lipschitz class, and so on.
Theorem 3.11.1 (Milman and Schechtman [167]). For any F : Zn,k → R with
the median MF and modulus of continuity ωF (t) , t > 0
"
π −nt2 /8
P (|F (z1 , . . . , zk ) − MF (z1 , . . . , zk )| > ωF (t)) < e , (3.83)
2
where P is the rotation-invariant probability measure on the span of z1 , . . . , zk .
Let x be a random vector in Rd and let Z ∈ Zd,k . Let

xz = ( x, z1 , . . . , x, zk ) ∈ Rk ;

that is, xz is the projection of the vector x onto the span of Z. xz is a projection
from dimension d to dimension k.
The bounded-Lipschitz distance between two random vectors x and y is
defined by

dBL (x, y) = sup Ef (x) − Ef (y) ,


f 1 ≤1

where
f 1 = max { f ∞, f L}
f (x)−f (y)
with the Lipschitz constant of f defined by f L = sup x−y .
x=y

 vector in R , with Ex = 0,
d
Theorem
  x be a random
 3.11.2 (Meckes [165]). Let
2  2 
E x = σ d, and let α = E  x /σ − d . If Z is a random point of
2 2

the manifold Zd,k , xz is defined as above, and w is a standard Gaussian random


vector, then

σ k (α + 1) + σk
dBL (xz , σw) ≤ .
d+1
196 3 Concentration of Measure

2
Theorem 3.11.3 (Meckes [165]). Suppose that β is defined by β = sup E x, y .
y∈Sd−1
For zi ∈ Rd , i = 1, . . . , k and Z = (z1 , . . . , zk ) ∈ Zd,k , let

dBL (xz , σw) = sup Ef ( x, z1 , . . . , x, zk ) − Ef (σw1 , . . . , σwk ) ;


f 1 ≤1

That is, dBL (xz , σw) is the conditional bounded-Lipschitz distance from the
random point xz (a random vector in Rk ) to the standard
2 Gaussian random vector
β
σw, conditioned on the matrix Z. Then, for t > 2π d and Z a random point of
the manifold (a random matrix in R k×d
) Zd,k ,
"
π −dt2 /32β
P (|dBL (xz , σw) − EdBL (xz , σw)| > t) ≤ e .
2

Theorem 3.11.4 (Meckes [168]). With the notation as in the previous theorems,
we have
⎡ √ ⎤
(kβ + β log d) β 2/(9k+12) σ k (α + 1) + k
EdBL (xz , σw) ≤ C ⎣ + ⎦.
k 2/3 β 2/3 d2/(3k+4) d−1

In particular, under the additional assumptions that α ≤ C0 d and β = 1, then
k + log (d)
EdBL (xz , σw) ≤ C .
k 2/3 d2/(3k+4)

The assumption that β = 1 is automatically satisfies,


 if
 the covariance matrix of
the random vector x ∈ Rd is the identity, i.e., E xxT = Id×d ; in the language
√ the case that the random vector x is isotropic. The
of geometry, this is simply
assumption that α = O( d) is a geometrically natural one that arise, for example,
if x is distributed uniformly on the isotropic dilate of the 1 ball in Rd . The key
observation in the proof is to view the distance as the supremum of a stochastic
process.
Proof of Theorem 3.11.2. This proof follows [165], changing to our notation.
Define the function F : Zd,k → R by

F (Z) = sup Ex f (xz ) − Ef (σw) ,


f 1 ≤1

where Ex denotes the expectation with respect to the distribution of the random
vector x only; that is,

Ex f (xz ) = E [f (xz ) |Z ] .

The goal here is to apply the concentration of measure. We use the standard
method here. We need to find the Lipschitz constant first. For a pair of random
3.11 Concentration for Projection of Random Vectors 197

vectors x and x , which will be projected to the same span of Z, we observe that for
f with f 1 ≤ 1 given,

Ex f (xz ) − Ef (σw) − Ex f (xz ) − Ef (σw)


≤ Ex f (xz ) − Ex f (xz )
= E [ f ( x, z1 , . . . , x, zk ) − f ( x, z1 , . . . , x, zk ) |Z, Z ]
≤ E [ f ( x, z1 − z1 , . . . , x, zk − zk ) |Z, Z ]
:
; k @ A2
; z − z i
≤< zi − zi E x, i
2

i=1
zi − zi

≤ ρ (Z, Z ) β.

It follows that

dBL (xz , σw) − dBL (xz , σw) (3.84)


 
 
 
=  sup Ex f (xz ) − Ef (σw) − sup Ex f (xz ) − Ef (σw)  (3.85)
f 1 ≤1 f 1 ≤1 

≤ sup | Ex f (xz ) − Ef (σw) − Ex f (xz ) − Ef (σw) | (3.86)


f 1 ≤1

≤ ρ (Z, Z ) β. (3.87)
(3.88)

Thus, dBL (xz , σw) is a Lipschitz function with Lipschitz constant β.
Applying the concentration of measure inequality (3.83), then we have that
"
π −t2 d/8β
P (|F (z1 , . . . , zk ) − MF (z1 , . . . , zk )| > t) ≤ e .
2

Now, if Z = (z1 , . . . , zk ) is a Haar-distributed random point of the manifold Zd,k ,


then

|EF (Z) −MF (Z)| ≤ E |F (Z) −MF (Z)| = P (|EF (Z) −MF (Z)| > t)dt
0

" "
π −t2 d/8β β
≤ e dt = π .
0 2 d
198 3 Concentration of Measure

2
β
So as long as t > 2π d, replacing the median of F with its mean only changes the
constants:

P (|F (Z) − EF (Z)| > t) ≤ P (|F (Z) − MF (Z)| > t − |MF (Z) − EF (Z)|)
 2
≤ P (|F (Z) − MF (Z)| > t/2) ≤ π2 e−dt /32β .

3.12 Further Comments

The standard reference for concentration of measure is [141] and [27]. We only
provide necessary results that are needed for the later chapters. Although some
results are recent, no attempt has been made to survey the latest results in the
literature.
We follow closely [169,170] and [62,171], which versions are highly accessible.
The entropy method is introduced by Ledoux [172] and further refined by Mas-
sart [173] and Rio [174]. Many applications are considered [62, 171, 175–178].
Chapter 4
Concentration of Eigenvalues
and Their Functionals

Chapters 4 and 5 are the core of this book. Talagrand’s concentration inequality is
a very powerful tool in probability theory. Lipschitz functions are the mathematics
objects. Eigenvalues and their functionals may be shown to be Lipschitz functions
so the Talagrand’s framework is sufficient. Concentration inequalities for many
complicated random variables are also surveyed here from the latest publications.
As a whole, we bring together concentration results that are motivated for future
engineering applications.

4.1 Supremum Representation of Eigenvalues and Norms

Eigenvalues and norms are the butter and bread when we deal with random matrices.
The supremum of a stochastic process [27, 82] has become a basic tool. The aim
of this section to make connections between the two topics: we can represent
eigenvalues and norms in terms of the supremum of a stochastic process.
The standard reference for our matrices analysis is Bhatia [23]. The inner
product of two finite-dimensional vectors in a Hilbert space H is denoted by u, v .
1/2
The form of a vector is denoted by u = u, u . A matrix is self-adjoint or
Hermitian if A = A, skew-Hermitian if A = −A, unitary if A∗ A = I =
∗ ∗

AA∗ , and normal if A∗ A = AA∗ .


Every complex matrix can be decomposed into
A = Re A + i Im A
A+A∗ ∗
where Re A = 2 and Im A = A−A 2 . This is called the Cartesian
Decomposition of A into its “real” and “imaginary” parts. The matrices Re A and
Im A are both Hermitian.
The norm of a matrix A is defined as

A = sup Ax .
x=1

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 199
DOI 10.1007/978-1-4614-4544-9 4,
© Springer Science+Business Media New York 2014
200 4 Concentration of Eigenvalues and Their Functionals

We also have the inner product version

A = sup | y, Ax | .
x=y=1

When A is Hermitian, we have

A = sup | x, Ax | .
x=1

For every matrix A, we have

A = σ1 (A) = A∗ A
1/2
.

When A is normal, we have

A = max {|λi (A)|} .

Another useful norm is the norm

n
1/2
= (TrA∗ A)
1/2
A F = σi2 (A) .
i=1

If aij are entries of a matrix A, then


⎛ ⎞1/2
n
=⎝ |aij | ⎠
2
A F .
i,j=1

This makes this norm useful in calculations with matrices. This is called Frobenius
norm or Schatten 2-norm or the Hilbert-Schmidt norm, or Euclidean norm.
Both A and A F have an important invariance property called unitary
invariant: we have UAV = A and UAV F = A F for all unitary U, V.
Any two norms on a finite-dimensional space are equivalent. For the norms A
and A F , it follows from the properties above that

A  A F  n A (4.1)

for every A. Equation (4.1) is the central result we want to revisit here.
Exercise 4.1.1 (Neumann series). If A < 1, then I − A is invertible and
−1
(I − A) = I + A + A2 + · · · + A k + · · ·

is a convergent power series. This is called the Neumann series.


4.1 Supremum Representation of Eigenvalues and Norms 201

Exercise 4.1.2 (Matrix Exponential). For any matrix A the series

1 2 1
exp A = eA = I + A + A + · · · + Ak + · · ·
2! k!
converges. The matrix exp A is always convertible
 −1
eA = e−A .

Conversely, every invertible matrix can be expressed as the exponential of some


matrix. Every unitary matrix can be expressed as the exponential of a skew-
Hermitian matrix.
The number w(A) defined as

w (A) = sup | x, Ax |
x=1

is called the numerical radius of A. The spectral radius of a matrix A is defined as

spr (A) = max {|λ (A)|} .

We note that spr (A)  w (A)  A . They three are equal if (but not only if) the
matrix is normal.
Let A be Hermitian with eigenvalues λ1  λ2  · · ·  λn . We have

λ1 = max { x, Ax : x = 1} ,
λn = min { x, Ax : x = 1} . (4.2)

The inner product of two finite-dimensional vectors in a Hilbert space H is denoted


1/2
by u, v . The form of a vector is denoted by u = u, u .
For every k = 1, 2, . . . , n

k k
λi (A) = max xi , Axi ,
i=1 i=1
k k
λi (A) = min xi , Axi ,
i=n−k+1 i=1

where the maximum and the minimum are taken over all choices of orthogonal k-
tuples (x1 , . . . , xk ) in H. The first statement is referred to as Ky Fan Maximum
Principle.
202 4 Concentration of Eigenvalues and Their Functionals

If A is positive, then for every k = 1, 2, . . . , n,


n 
n
λi (A) = min xi , Axi ,
i=n−k+1 i=n−k+1

where the minimum is taken over all choices of orthogonal k-tuples (x1 , . . . , xk )
in H.

4.2 Lipschitz Mapping of Eigenvalues

The following lemma is from [152] but we follow the exposition of [69]. Let G
denote the Gaussian distribution on Rn with density

2
dG (x) 1 x
= n exp − ,
dx (2πσ 2 ) 2σ 2

2
and x = x21 + · · · + x2n is the Euclidean norm of x. Furthermore, for a K-
Lipschitz function F : Rn → R, we have

|F (x) − F (y)|  K x − y , x, y ∈ Rn ,

for some positive Lipschitz constant K. Then for any positive number t, we have
that
 
ct2
G ({x ∈ Rn : |F (x) − F (y)| > t})  2 exp − 2 2
K σ
)
where E (F (x)) = Rn F (x) dG (x), and c = π22 .
The case of σ = 1 is proven in [152]. The general case follows by using the
following mapping. Under the mapping x → σx : Rn → Rn , the composed
function x → F (σx) satisfies a Lipschitz condition with constant Kσ.
Now let us consider the Hilbert-Schmidt norm (also called Frobenius form and
Euclidean norm) · F under the Lipschitz functional mapping. Let f : R → R be a
function that satisfies the Lipschitz condition

|f (s) − f (t)|  K |s − t| , s, t ∈ R.

Then for any n in N, and all complex Hermitian matrices A, B ∈ Cn×n . We have
that

f (A) − f (B) F ≤K A−B F,


4.3 Smoothness and Convexity of the Eigenvalues of a Matrix and Traces of Matrices 203

where
  1/2
= (Tr (C∗ C))
1/2
C F = Tr C2 ,

for all complex Hermitian matrix C.


A short proof follows from [179] but we follow closely [69] for the exposition
here. We start with the spectral decomposition
n n
A= λ i Ei , B= μi Fi ,
i=1 i=1

where λi and μi are eigenvalues of A, and B, respectively, and where Ei and Fi


are two families of mutually orthogonal one-dimensional projections (adding up to
In ). Using Tr (Ei Fj )  0 for all i, j, we obtain that
   
2 2 2
f (A) − f (B) F = Tr f (A) + Tr f (B) − 2Tr (f (A) f (B))

n
2
= (f (λi ) − f (μi )) · Tr (Ei Fj )
i,j=1

n
2
 K2 · (λi − μi ) · Tr (Ei Fj )
i,j=1
2
= K2 A − B F .

4.3 Smoothness and Convexity of the Eigenvalues of a Matrix


and Traces of Matrices

The following lemma (Lemma 4.3.1) is at the heart of the results. First we recall
that
n
Tr (f (A)) = f (λi (A)), (4.3)
i=1

where λi (A) are the eigenvalues of A. Consider a Hermitian n × n matrix A. Let


f be a real valued function on R. We can study the function of the matrix, f (A).
If A = UDU∗ , for a diagonal real matrix D = diag (λ1 , . . . , λn ) and a unitary
matrix U, then

f (A) = Uf (D) U∗

where f (D) is the diagonal matrix with entries f (λ1 ) , . . . , f (λn ) and U∗ denotes
the conjugate, transpose of U.
204 4 Concentration of Eigenvalues and Their Functionals

Lemma 4.3.1 (Guionnet and Zeitoumi [180]).


1. If f is a real-valued convex function on R, it holds that Tr (f (A)) =
n
f (λi (A)) is convex.
i=1
2
2. If f is a Lipschitz function on R, A →
√ Tr (f (A)) is a Lipschitz function on R
n

with Lipschitz constant bounded by n|f |L .


Theorem 4.3.2 (Lidskii [18]). Let A, B be Hermitian matrices. Then, there is a
doubly stochastic matrix E such that
n
λi (A + B) − λi (A)  Ei,m λi (B)
m=1

In particular,
n n
2 2 2
|λi (A) − λi (B)|  A − B F  |λi (A) − λn−i+1 (B)| . (4.4)
i=1 i=1

For all integer k, the functional


2
(Aij )1i,jn ∈ Rn → λk (A)

is Lipschitz with constant one, following (4.4). With the aid of Lidskii’s theo-
rem [18, p. 657] (Theorem 4.3.2 here), we have
 
 n n  n
 
 f (λi (A)) − f (λi (B)) |f |L |λi (A) − λi (B)|
 
i=1 i=1 i=1

n
1/2
√ 2
 n|f |L |λi (A) − λi (B)|
i=1

 n|f |L A − B F. (4.5)

We use the definition of a Lipschitz function in the first inequality. The second step
follows from Cauchy-Schwartz’s inequality [181, p. 31]: for arbitrary real numbers
a i , bi ∈ R
2 2
|a1 b1 + a2 b2 + · · · + an bn |  a21 + a22 + · · · + a2n b21 + b22 + · · · + b2n . (4.6)

In particular, we have used bi = 1. The second inequality is a direct consequence of


Lidskii’s theorem, Theorem 4.3.2. In other words, we have shown that the function
n
2
(Aij )1i,jn ∈ Rn → f (λk (A))
k=1
4.3 Smoothness and Convexity of the Eigenvalues of a Matrix and Traces of Matrices 205


is Lipschitz with a constant bounded above by n|f |L . Observe that using
n
f (λk (A)) rather than λk (A) increases the Lipschitz constant from 1 to

k=1
n|f |L . This observation is useful later in Sect. 4.5 when the tail bound is
considered.
Lemma 4.3.3 ([182]). For a given n1 × n2 matrix X, let σi (X) the i-th largest
singular value. Let f (X) be a function on matrices in the following form: f (X) =
m
m
ai σi (X) for some real constants {ai }i=1 . Then f (X) is a Lipschitz function
i=1 ?
m
with a constant of a2i .
i=1

k
Let us consider the special case f (t) = tk . The power series (A + εB) is
expanded as
    
k
Tr (A + εB) = Tr Ak + εk Tr Ak−1 B + O ε2 .

Or for small ε we have


    
k
Tr (A + εB) − Tr Ak = εk Tr Ak−1 B + O ε2 → 0, as ε → 0. (4.7)

Recall that the trace function is linear.


1  k
      
lim Tr (A + εB) − Tr Ak = k Tr Ak−1 B = Tr Ak B
ε→0 ε

where (·) denotes the derivative. More generally, if f is continuously differentiable,


we have
- n n
.
1
lim f (λi (A + εB)) − f (λi (A)) = Tr (f  (A) B) .
ε→0 ε
i=1 i=1


n
Recall that f (λi (A)) is equal to Tr (f (A)) . λ1 (X) is convex and λn (X) is
i=1
concave.
Let us consider the function of the sum of the first largest eigenvalues: for k =
1, 2, . . . , n

k
Fk (A) = λi (A)
i=1
k
Gk (A) = λn−i+1 (A) = Tr (A) − Fk (A) . (4.8)
i=1
206 4 Concentration of Eigenvalues and Their Functionals

The trace function of (4.3) is the special case of these two function with k = n.
Now we follow [183] to derive √the Lipschitz constant of functions defined
in (4.8), which turns out to be C|f |L k, where C < 1. Recall the Euclidean norm
or the Frobenius norm is defined as
⎛ ⎞1/2 ⎛ ⎞1/2
 
n
√ n
 Xii 2
X =⎝ |Xij | ⎠
2
= 2⎝ √  + |Xij | ⎠ .
2
F  2
i,j=1 i=1 1ijn

Recall also that

n
1/2
X F = λ2i (X) ,
i=1

which implies from (4.1) that λi (X) is a 1-Lipschitz function of X with respect to
X F.
Fk is positively homogeneous (of degree 1) and Fk (−A) = −Gk (A) . From
this we have that

|Fk (A) − Fk (B)|  max {Fk (A − B) , −Gk (A − B)}  k A − B F ,

|Gk (A) − Gk (B)|  max {Gk (A − B) , −Fk (A − B)}  k A − B F .
(4.9)
In other words, the functions√Fk (A) , Gk (A) : Rn → R are Lipschitz continuous
with the Lipschitz constant k. For a trace function, we have k = n. Moreover,
Fk (A) is convex and Gk (A) is concave. This follows from Ky Fan’s maximum
principle in (4.2) or Davis’ characterization [184] of all convex unitarily invariant
functions of a self-adjoint matrix.
Let us give our version of the proof of (4.9). There are no details about this proof
in [183]. When A ≥ B, implying that λi (A)  λi (B) , we have
   
 k k   k 
   
|Fk (A) −Fk (B)| =  λi (A) − λi (B) =  (λi (A) −λi (B))
   
i=1 i=1 i=1
k k k
 |λi (A) −λi (B)|= (λi (A) −λi (B))  λi (A−B)=Fk (A−B) .
i=1 i=1 i=1

In the second line, we have used the Ky Fan inequality (1.28)

λ1 (C + D) + · · · + λk (C + D)  λ1 (C) + · · · + λk (C) + λ1 (D) + · · · + λk (D)


(4.10)

where C and D are Hermitian, by identifying C = B, D = A − B, C + D = A..


4.3 Smoothness and Convexity of the Eigenvalues of a Matrix and Traces of Matrices 207

If A < B, implying that λi (B)  λi (A) , we have

k k k
|Fk (A) − Fk (B)| = λi (A) − λi (B) = (λi (A) − λi (B))
i=1 i=1 i=1

k k k
 |λi (A) −λi (B)|= (λi (B) −λi (A))  λi (B−A)=Fk (B−A) = −Gk (A−B) .
i=1 i=1 i=1

Let us define the following function: for k = 1, 2, . . . , n

k
ϕk (A) = f (λi (A)) (4.11)
i=1

where f : R → R is the Lipschitz function with constant |f |L . We can


k
compare (4.8) and (4.11). We can show [185] that ϕk (A) = f (λi (A)) is a
√ i=1
Lipschitz function with a constant bounded by k|f |L . It follows from [185] that

 
 k  k
 
 f (λi (A)) − f (λi (B))
 
i=1 i=1
 k 
 
 
= [f (λi (A)) − f (λi (B))]
 
i=1
k
 |f (λi (A)) − f (λi (B))|
i=1
k
 |f |L |λi (A) − λi (B)|
i=1
1/2
√ k
2
 |f |L k |λi (A) − λi (B)|
i=1
1/2
√ n
2
 C|f |L k |λi (A) − λi (B)|
i=1

 C|f |L k A − B F,
208 4 Concentration of Eigenvalues and Their Functionals

where
 1/2

k
2
|λi (A) − λi (B)|
C = i=1 1/2  1.
n
2
|λi (A) − λi (B)|
i=1

The third line follows from the triangle inequality for complex numbers [181, 30]:
for n complex numbers
 
 n  n
 
 zi  = |z1 + z2 + · · · + zn |  |z1 | + |z2 | + · · · + |zn | = |zi | . (4.12)
 
i=1 i=1

In particular, we set zi = λi (A) − λi (B) . The fourth line follows from the
definition for a Lipschitz function: for f : R → R, |f (s) − f (t)|  |f |L |s − t| .
The fifth line follows from Cauchy-Schwartz’s inequality (4.6) by identifying ai =
|λi (A) − λi (B)| , bi = 1. The final line follows from Liskii’s theorem (4.4).
Example 4.3.4 (Standard hypothesis testing problem revisited: Moments as a func-
tion of SNR). Our standard hypothesis testing problem is expressed as

H0 : y = n
√ (4.13)
H1 : y = SN Rx + n

where x is the signal vector in Cn and n the noise vector in Cn . We assume that x
is independent of n. SNR is the dimensionless real number representing the signal
to noise ratio. It is assumed that N independent realizations of these vector valued
random variables are observed.
Representing
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1T xT1 nT1
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
Y = ⎣ ... ⎦ , X = ⎣ ... ⎦ , N = ⎣ ... ⎦ ,
T
yN xTN nTN

we can rewrite (4.13) as

H0 : Y = N

H1 : Y = SN R · X + N (4.14)

We form the sample covariance matrices as follows:

N N N
1 1 1
Sy = yi yi∗ , Sx = xi x∗i , Sn = ni n∗i . (4.15)
N i=1
N i=1
N i=1
4.3 Smoothness and Convexity of the Eigenvalues of a Matrix and Traces of Matrices 209

we rewrite (4.15) as

H0 : N Sn = NN∗
√  √ ∗
H1 : N Sy = YY∗ = SN R · X + N SN R · X + N

= SN R · XX∗ + NN∗ + SN R · (XN∗ + NX∗ ) (4.16)

H0 can be viewed as a function of H1 as we take the limit SN R → 0. This function


is apparently continuous as a function of SN R. This way we can consider one
function of H1 only and take the limit for another hypothesis. In the ideal case of
N → ∞, the true covariance version of the problem is as follows

H0 : Ry = Rn
H1 : Ry = SN R · Rx + Rn .

Let NN∗ and N N∗ are two independent copies of underlying random matrices.
When the sample number N goes large, we expect that the sample covariance
matrices approach the true covariance matrix, as close as we desire. In other words,
1 ∗ 1  ∗
N NN and N N N are very close to each other. With this intuition, we may
consider the following matrix function

1 1√ 1
S(SN R, X) = SN R· XX∗ + SN R·(XN∗ + NX∗ )+ (NN∗ − N N∗ )
N N N
Taking the trace of both sides, we reach a more convenient form

f (SN R, X) = Tr (SS∗ ) .

f (SN R, X) is apparently a continuous function of SN R. f (SN R, X) represents


hypothesis H0 as we take the limit SN R → 0. Note that the trace rmT r is a linear
function. More generally we may consider the k-th moment
 
g (SN R, X) = Tr (SS∗ ) .
k

SS∗ can be written as a form

SS∗ = A + ε (SN R) B,

where ε (SN R) → 0, SN R → 0. It follows from (4.7) that


    
k
Tr (A + εB) − Tr Ak = εk Tr Ak−1 B + O ε2 → 0, as SN R → 0.


210 4 Concentration of Eigenvalues and Their Functionals

Exercise 4.3.5. Show that SS∗ has the following form SS∗ = A + ε (SN R) B.
Example 4.3.6 (Moments of Random Matrices [23]). The special case of f (t) = tk
for integer k is Lipschitz continuous. Using the fact that [48, p. 72]

f (x) = log (ex1 + ex2 + · · · + exn )


k
is a convex function of x, and setting λki = elog λi = xi , we have

 n
f (x) = log λk1 + · · · + λkn = log λki
i=1

is also a convex function.


Jensen’s inequality says that for a convex function φ

φ ai x i  ai φ (xi ), ai = 1, ai  0.
i i i

It follows from Jensen’s inequality that

n
 n n n
f (x) = log λki  log λki =k log λi  k (λi − 1) .
i=1 i=1 i=1 i=1

In the last step, we use the inequality that log x  x − 1, x > 0.


Since λi (A) is 1-Lipschitz, f (x) is also Lipschitz.
Another direct approach is to use Lemma 4.3.1: If f : R → R is a Lipschitz
function, then
n
1
F =√ f (λi )
n i=1

is a Lipschitz function of (real and imaginary) entries of A. If f is convex on the


real line, then F is convex on the space of matrices (Klein’s lemma). Clearly f (t) =
ta , a ≥ 1 is a convex function from R to R. f (t) = ta is also Lipschitz.
Since a1 log f (t) = log (t)  t − 1, t > 0 we have

1
log |f (x) − f (y)|  |x − y|
a

where f (t) = ta . 
4.4 Approximation of Matrix Functions Using Matrix Taylor Series 211

4.4 Approximation of Matrix Functions Using Matrix


Taylor Series

When f (t) is a polynomial or rational function with scalar coefficients ai and a


scalar argument t, it is natural to define [20] f (A) by substituting A for t, replacing
division by matrix inverse and replacing 1 by the identity matrix. Then, for example

1 + t2 −1 
f (t) = ⇒ f (A) = (I − A) I + A2 if 1 ∈
/ Ω (A) .
1−t

Here Ω (A) is the set of eigenvalues of A (also called the spectrum of A). Note
that rational functions of a matrix commute, so it does not matter whether we write
−1 −1
(I − A) I + A2 or I + A2 (I − A) . If f has a convergent power series
representation, such as

t2 t3 t4
log (1 + t) = t − + − + · · · , |t| < 1,
2 3 4
we can again simply substitute A for t to define

A2 3
A A4
log (1 + A) = A − + − + · · · , ρ (A) < 1.
2 3 4

Here, ρ is the spectral radius of the condition ρ (A) < 1 ensures convergence of the
matrix series. We can consider the polynomial

f (t) = Pn (t) = a0 + a1 t + · · · + an tn .

For we have

f (A) = Pn (A) = a0 + a1 A + · · · + an An .

A basic tool for approximating matrix functions is the Taylor series. We state
a theorem from [20, Theorem 4.7] that guarantees the validity of a matrix Taylor
series if the eigenvalues of the “increment” lie within the radius of convergence of
the associated scalar Taylor series.
Theorem 4.4.1 (convergence of matrix Taylor series). Suppose f has a Taylor
series expansion
∞  
k f (k) (α)
f (z) = ak (z − α) ak = (4.17)
k!
k=1

with radius of convergence r. If A ∈ Cn×n , then f (A) is defined and is given by


212 4 Concentration of Eigenvalues and Their Functionals


k
f (A) = ak (A − αI)
k=1

if and only if each of the distinct eigenvalues λ1 , . . . , λs of A satisfies one of the


conditions
1. |λi − α|  r,
2. |λi − α| = r, and the series for f (ni −1) (λ) (where ni is the index of λi ) is
convergent at the points λ = λi , i = 1, 2, . . . , s.
The four most important matrix Taylor series are

A2 A3
exp (A) = I + A + + + ··· ,
2! 3!
A2 A4 A6
cos (A) = I − + − + ··· ,
2! 4! 6!
A3 A5 A7
sin (A) = I − + − + ··· ,
3! 5! 7!
A2 A3 A4
log (I + A) = A − + − + ··· , ρ (A) < 1,
2! 3! 4!
the first three series having infinite radius of convergence. These series can be used
to approximate the respective functions, by summing a suitable finite number of
terms. Two types of errors arise: truncated errors, and rounding errors in the floating
point evaluation. Truncated errors are bounded in the following result from [20,
Theorem 4.8].
Theorem 4.4.2 (Taylor series truncation error bound). Suppose f has the Tay-
lor series expansion (4.17) with radius of convergence r. If A ∈ Cn×n with
ρ (A − αI) < r, then for any matrix norm
/ /
/ K / / /
/ k/ 1 / K /
/f (A) − ak (A − αI) /  max /(A − αI) f (K) (αI + t (A − αI))/ .
/ / K! 0t1
k=1
(4.18)
/
/ K
In order to apply this theorem, we need to bound the term max /(A − αI) f (K)
0t1
(αI + t (A − αI)) . For certain function f this is easy. We illustrate using the
cosine function, with α = 0, and K = 2k + 2, and

2k i
(−1) 2i
T2k (A) = A ,
i=0
(2i)!

the bound of Theorem 4.4.2 is, for the ∞-norm,


4.4 Approximation of Matrix Functions Using Matrix Taylor Series 213

/ /
cos (A) − T2k (A) ∞ = (2k+2)!
1
max /A2k+2 cos(2k+2 (tA)/∞
/ 2k+2 / /
0≤t≤1 /
1
= (2k+2)! /A / max /cos(2k+2 (tA)/ .
∞ 0≤t≤1 ∞

Now
/ /
max /cos(2k+2 (tA)/∞ = max cos (tA) ∞
0≤t≤1 0≤t≤1
A∞ A4∞
≤1+ 2! + 4! + · · · = cosh ( A ∞) ,

and thus the error is the truncated Taylor series approximation to the matrix cosine
has the bound
/ 2k+2 /
/A /

cos (A) − T2k (A) ∞ ≤ cosh ( A ∞ ) .
(2k + 2)!

Consider the characteristic function

det (A − tI) = tn + c1 tn−1 + · · · + cn .

By the Cayley-Hamilton theorem,

n−1

−1 1
A =− A n−1
+ ci A n−i−1
.
cn i=1

The ci can be obtained by computing the moments



mk = Tr Ak , k = 1, . . . , n,

and then solving the Newton identities [20, p. 90]


⎤ ⎡
⎡ ⎤
⎡ ⎤
c1 m1
1 ⎢ ⎥ ⎢ ⎥
⎢ m ⎥ ⎢ c 2 ⎥ ⎢ m2 ⎥
⎢ 1 2 ⎥⎢ . ⎥ ⎢ . ⎥
⎢ ⎥⎢ . ⎥ ⎢ . ⎥
⎢ m2 m1 3 ⎥⎢ . ⎥ ⎢ . ⎥
⎢ ⎥⎢ . ⎥ = ⎢ . ⎥.
⎢ m3 m2 m1 4 ⎥⎢ . ⎥ ⎢ . ⎥
⎢ ⎥⎢ . ⎥ ⎢ . ⎥
⎢ .. .. .. .. .. ⎥⎢ ⎥ ⎢ ⎥
⎣ . . . . . ⎦ ⎢ .. ⎥ ⎢ .. ⎥
⎣ . ⎦ ⎣ . ⎦
mn−1 · · · m3 m2 m1 n
cn mn

Let us consider a power series of random matrices Xk , k = 1, . . . , K. We define


a matrix function F (X) as
214 4 Concentration of Eigenvalues and Their Functionals

K
F (X) = a0 I + a1 X + · · · + aK XK = a k Xk ,
k=0

where ak are scalar-valued coefficients. Often we are interested in the trace of this
matrix function

 K

Tr (F (X)) = a0 Tr (I) + a1 Tr (X) + · · · + aK Tr XK = ak Tr Xk .
k=0

Taking the expectation from both sides gives

K
 K
 
Tr (E (F (X))) = E (Tr (F (X))) = E ak Tr Xk = ak E Tr XK .
k=0 k=0

Let us consider the fluctuation of this trace function around its expectation

K
 K
 
Tr (F (X)) − Tr (EF (X)) = ak Tr Xk − ak E Tr XK
k=0 k=0
K
    
= ak Tr XK − E Tr XK .
k=0

If we are interested in the absolute of this fluctuation, or called the distance


Tr (F (X)) from Tr (EF (X)) , it follows that

K
    
|Tr (F (X)) − Tr (EF (X))| ak Tr XK − E Tr XK ,
k=0
K
      
 ak Tr XK  + E Tr XK  , (4.19)
k=0

 
since, for two complex scalars a, b, a − b  |a − b|  |a| + |b| . E Tr XK can
be calculated. In fact, |Tr (A)| is a seminorm (but not a norm [23, p. 101]) of A.
Another relevant norm is weakly unitarily invariant norm

τ (A) = τ (U∗ AU) , for all A, U ∈ Cn×n .


 
The closed formof expected moments E Tr XK is extensively studied and
obtained [69]. Tr XK can be obtained numerically.
More generally, we can study F (X) − EF (X) rather than its trace function; we
have the matrix-valued fluctuation
4.4 Approximation of Matrix Functions Using Matrix Taylor Series 215


K K K K

F (X) − E (F (X)) = a k Xk − E a k Xk = a k Xk − a k E Xk
k=0 k=0 k=0 k=0
K
 
= a k Xk − E X k . (4.20)
k=0

According to (4.20), the problem boils down to a sum of random matrices, which
 are
already treated in other chapters of this book. The random matrices Xk − E Xk
play a fundamental role in this problem.
Now we can define the distance as the unitarily invariant norm [23, p. 91] |·|
of the matrix fluctuation. We have

|UAV| = |A|

for all A of n × n and for unitary U, V.


For any complex matrix A, we use A + = (A∗ A)
1/2
to denote the positive
semidefinite matrix [16, p. 235]. The main result, due to R. Thompson, is that for
any square complex matrices A and B of the same size, there exist two unitary
matrices U and V such that

A+B +  U∗ A +U + V∗ B + V. (4.21)

Note that it is false to write A + B +  A + + B + . However, we can take


the trace of both sides and use the linearity of the trace to get

Tr A + B +  Tr A + + Tr B +, (4.22)

since TrU∗ U = I and TrV∗ V = I. Thus we have


/  / / / /  /
Tr/Xk + E −Xk /+  Tr/Xk /+ + Tr/E −Xk /+ . (4.23)

Inserting (4.23) into (4.20) yields


K  / / /  / 
Tr F (X) − E (F (X)) +  ak Tr/Xk /+ + Tr/E −Xk /+ . (4.24)
k=0

We can choose the coefficients ak to minimize the right-hand-side of (4.19)


or (4.24). It is interesting to compare (4.19) with (4.24). If Xk are positive semi-
definite, Xk  0, both are almost identical. If X ≥ 0, then Xk ≥ 0. Also note
that
n
k
TrA = λki (A),
i=1

where λi are the eigenvalues of A.


216 4 Concentration of Eigenvalues and Their Functionals

4.5 Talagrand Concentration Inequality

Let x = (X1 , . . . , Xn ) be a random vector consisting of n random variables. We


say a function f : Rn → R is Lipschitz with constant K or K-Lipschitz, if

|f (x) − f (y)|  K x − y

for all x, y ∈ Rn . Here we use the Euclidean norm · on Rn .


Theorem 4.5.1 (Talagrand concentration inequality [148]). Let κ > 0, and let
X1 , . . . , Xn be independent complex variables with |Xi |  κ for all 1 ≤ i ≤ n. Let
f : Cn → R be a 1-Lipschitz and convex function (where we identify Cn with R2n
for the purposes of defining “Lipschitz” and “convex”). Then, for any t, one has
2
P (|f (x) − Mf (x)|  κt)  Ce−ct (4.25)

and
2
P (|f (x) − Ef (x)|  κt)  Ce−ct (4.26)

for some absolute constants C, c > 0, where Mf (x) is the median of f (x) .
See [63] for a proof. Let us illustrate how to use this theorem, by considering the
operator norm of a random matrix.
The operator (or matrix) norm A op is the most important statistic of a random
matrix A. It is a basic statistic at our disposal. We define

A op = sup Ax
x∈Cn :x=1

where x is the Euclidean norm of vector x. The operator norm is the basic upper
bound for many other quantities.
The operator norm A op is also the largest singular value σmax (A) or σ1 (A)
assuming that all singular values are sorted in an non-increasing order. A op
dominates the other singular values; similarly, all eigenvalues λi (A) of A have
magnitude at most A op .
Suppose that the coefficients ξij of A are independent, have mean zero, and
uniformly
  in magnitude by 1 (κ = 1). We consider σ1 (A) as a function
bounded
2
f (ξij )1i,jn of the independent complex ξij , thus f is a function from Cn
to R. The convexity of the operator norm tells us that f is convex. The elementary
bound is

A  A F or σ1 (A)  A F or σmax (A)  A F (4.27)


4.5 Talagrand Concentration Inequality 217

where
⎛ ⎞1/2
n n
=⎝ |ξij | ⎠
2
A F
i=1 j=1

is the Frobenius norm, also known as the Hilbert-Schmidt norm or 2-Schatten


norm. Combining the triangle inequality with (4.27) tells us that f is Lipschitz
with constant 1 (K = 1). Now, we are ready to apply Talagrand’s inequality,
Theorem 4.5.1. We thus obtain the following: For any t > 0, one has
2
P (σ1 (A) − Mσ1 (A)  t)  Ce−ct

and
2
P (σ1 (A) − Eσ1 (A)  t)  Ce−ct (4.28)

for some constants C, c > 0.


If f : R → R is a Lipschitz function, then
n
1
F =√ f (λi )
n i=1

is also a Lipschitz function of (real and imaginary) entries of A. If f is convex on the


real line, then F is convex on the space of matrices (Klein’s lemma). As a result, we
can use the general concentration principle to functions of the eigenvalues λi (A).
By applying Talagrand’s inequality, Theorem 4.5.1, it follows that

n n

1 1 2
P f (λi (A)) − E f (λi (A))  t  Ce−ct (4.29)
n i=1
n i=1

Talagrand’s inequality, as formulated in Theorem 4.5.1, heavily relies on con-


vexity. As a result, we cannot apply it directly to non-convex matrix statistics, such
as singular values σi (A) other than the largest singular value σ1 (A). The partial
k
sum σi (A), the sum of the first k singular values, is convex. See [48] for more
i=1
convex functions based on singular values.
The eigenvalue stability inequality is

|λi (A + B) − λi (A)|  B op

The spectrum of A + B is close to that of A if B is small in operator norm. In


particular, we see that the map A → λi (A) is Lipschitz continuous on the space of
Hermitian matrices, for fixed 1 ≤ i ≤ n.
218 4 Concentration of Eigenvalues and Their Functionals

It is easy to observe that the operator norm of a matrix A = (ξij ) bounds the
magnitude of any of its coefficients, thus

sup |ξij |  σ1 (A)


1i,jn

or, equivalently
⎛ ⎞

P (σ1 (A)  t)  P ⎝ |ξij |  t⎠ .


1i,jn

We can view the upper tail event σ1 (A)  t as a union of√many simpler events
|ξij |  t. In the i.i.d. case ξij ≡ ξ, and setting t = α n for some fixed α
independent of n, we have
  √ n2
P (σ1 (A)  t)  P |ξ|  α n .

4.6 Concentration of the Spectral Measure for Wigner


Random Matrices

We follow [180, 186]. A Hermitian Wigner matrix is an n × n matrix H =


(hij )1ijn such that

1
hij = √ (xij + jyij ) for all 1  i  j  n
n
1
hii = √ xii for all 1  i  n
n

where {xij , yij , xii } are a collection of real independent, identically distributed
random variables with Exij = 0 and Ex2ij = 1/2.
The diagonal elements are often assumed to have a different distribution, with
Exij = 0 and Ex2ij = 1. The entries scale with the dimension n. The scaling is
chosen such that, in the limit n → ∞, all eigenvalues of H remain bounded. To see
this, we use
n n n
2 2
E λ2k = E Tr H2 = E |hij | = n2 E|hij |
k=1 i=1 j=1

where λk , k = 1, . . . , n are the eigenvalues of H. If all λk stay bounded and of order


one in the limit of large dimension n, we must have E Tr H2 " n and therefore
2
E|hij | " n1 . Note that a trace function is linear.
4.6 Concentration of the Spectral Measure for Wigner Random Matrices 219

n
In Sect. 4.3, we observe that using f (λk (A)) rather than λk (A) increases
√ k=1
the Lipschitz constant from 1 to n|f |L .
Theorem 4.6.1 (Guionnet and Zeitouni [180]). Suppose that the laws of the
entries {xij , yij , xii } satisfies the logarithmic Sobolev inequality with constant
c > 0. Then, for any Lipschitz function f : R → C, with Lipschitz constant |f |L
and t > 0, we have that
  
1 1  2 2
− n t2
P  Tr f (H) − E Tr f (H)  t  2e 4c|f |L . (4.30)
n n

Moreover, for any k = 1, . . . , n, we have

nt2

4c|f |2
P (|f (λk ) − Ef (λk )|  t)  2e L . (4.31)

In order to prove this theorem, we want to use of the observation of Herbst that
Lipschitz functions of random matrices satisfying the log-Sobolev inequality exhibit
Gaussian concentration.
Theorem 4.6.2 (Herbst). Suppose that P satisfies the log-Sobolev inequalities on
Rn with constant c. Let G : Rm → R be a Lipschitz function with constant |G|L .
Then, for every t > 0,
2
− 2c|G|
t
P (|g(x) − Eg(x)|  t)  2e L .

See [186] for a proof.


To prove the theorem, we need another lemma.
Lemma 4.6.3 (Hoffman-Wielandt). Let A, B be n × n Hermitian matrices, with
eigenvalues
n
2 2
|λi (A) − λi (B)|  Tr (A − B) .
i=1

We see [187] for a proof.


2
Corollary 4.6.4. Let X = ({xij , yij , xii }) ∈ Rn and let λk (X), k = 1, . . . , n be
the eigenvalues of the Wigner matrix X = H(X). Let g : Rn → R be a Lipschitz
2
function with constant |g|L . Then the mapX ∈ Rn → g (λ1 (X) , . . . , λn (X)) ∈
R is a Lipschitz function with coefficient 2/n|g|L . In particular, if f : R → R is
2 
n
a Lipschitz function with constant |f |L , the map X ∈ Rn → f (λk ) ∈ R is a
√ i=1
Lipschitz function with constant 2|f |L .
Proof. Let Λ = (λ1 , . . . , λn ) . Observe that
220 4 Concentration of Eigenvalues and Their Functionals

    
g (Λ (X)) −g Λ X |g|L Λ (X) −Λ X 2
#
$ n &
$
= |g|L % |λi (X) − λi (X )||g|L Tr (H (X) −H (X ))2
i=1
#
$
$ n n

= |g|L % |hij (X) −hij (X )|2  2/n|g|L X−X Rn
.
2
i=1 j=1

(4.32)

The first inequality of the second line follows from the lemma of Hoffman-Wielandt
above. X − X Rn2 is also the Frobenius norm.
n
Since g (Λ) = Tr f (H) = f (λk ) is such that
i=1

n   √
 
|g (Λ) − g (Λ )|  |f |L λi − λi   n|f |L X − X Rn
i=1

it follows that g is a Lipschitz function on Rn with constant n|f |L . Combined
with (4.32), we complete the proof of the corollary. 
Now we have all the ingredients to prove Theorem 4.6.1.
2
Proof of Theorem 4.6.1. Let X = ({xij , yij , xii }) ∈ Rn . Let√G (X) =
Tr f (H (X)) . Then the matrix function G is Lipschitz with constant 2|f |L . By
Theorem 4.6.2, it follows that
  
1 1  2 2
− n t2
 
P  Tr f (H) − E Tr f (H)  t  2e 4c|f |L .
n n
To show (4.31), we see that, using Corollary 4.6.4, the matrix function G(X) =
f (λk ) is Lipschitz with constant 2/n|f |L . By Theorem 4.6.2, we find that
nt2

4c|f |2
P (|f (λk ) − Ef (λk )|  t)  2e L ,

which is (4.31). 
Example 4.6.5 (Applications of Theorem 4.6.1). We consider a special case of
2
f (s) = s. Thus |f |L = 1. From (4.31) for the k-th eigenvalue λk of random
matrix H, we see at once that, for any k = 1, . . . , n,

nt2
P (|λk − Eλk |  t)  2e− 4c .

1
For the trace function n Tr (H), on the other hand, from (4.30), we have
  
1 1  n 2 t2
P  Tr (H) − E Tr (H)  t  2e− 4c .

n n
4.6 Concentration of the Spectral Measure for Wigner Random Matrices 221

The right-hand-side of the above two inequalities has the same Gaussian tail but with
different “variances”. The k-th eigenvalue has a variance of σ 2 = 2c/n, while the
normalized trace function has that of σ 2 = 2c/n2 . The variance of the normalized
trace function is 1/n times that of the k-th eigenvalue. For example, when n =
100, this factor is 0.01. In other words, for the normalized trace function—viewed
as a statistical average of n eigenvalues (random variables)—reduces the variance
by 20 dB, compared with the each individual eigenvalue. In [14], this phenomenon
has been used for signal detection of extremely low signal to noise ratio, such as
SN R = −34 dB.
The Wishart matrix is involved in [14] while here the Wigner matrix is used. In
Sect. 4.8, it is shown that for the normalized trace function of a Wishart matrix, we
n 2 t2
have the tail bound of 2e− 4c , similar to a Wigner matrix. On the other hand, for
the largest singular value (operator norm)—see Theorem 4.8.11, the tail bound is
2
Ce−cnt .
For hypothesis testing such as studied in [14], reducing the variance of the
statistic metrics (such as the k-th eigenvalue and the normalized trace function)
is critical to algorithms. From this perspective, we can easily understand why the
use of the normalized trace as statistic metric in hypothesis testing leads to much
better results than algorithms that use the k-th eigenvalue [188] (in particular, k = 1
and k = n for the largest and smallest eigenvalues, respectively). Another big
advantage is that the trace function is linear. It is insightful to view the trace function
n
as a statistic average n1 Tr (H) = n1 λk , where λk is a random variable. The
k=1
statistical average of n random variables, of course, reduces the variance, a classical
result in probability and statistics. Often one deals with sums of independent
random variables [189]. But here the n eigenvalues are not necessarily independent.
Techniques like Stein’s method [87] can be used to approximate this sum using
Gaussian distribution. Besides, the reduction of the variance by a factor of n by
using the trace function rather than each individual eigenvalue is not obvious to
observe, if we use the classical techniques since it is very difficult to deal with sums
of n dependent random variables.
2
For the general case we have the variance of σ 2 = 2c |f |L /n and σ 2 =
2
2c |f |L /n for the k-th eigenvalue and normalized trace, respectively. The general
2
2
Lipschitz function f with constant |f |L increases the variance by a factor of |f |L .
If we seek to find a statistic metric (viewed as a matrix function) that has a
minimum variance, we find here that the normalized trace function n1 Tr (H) is
optimum in the sense of minimizing the variance. This finding is in agreement with
empirical results in [14]. To a large degree, a large part of this book was motivated
to understand why the normalized trace function n1 Tr (H) always gave us the best
results in Monte Carlo simulations. It seemed that we could not find a better matrix
function for the statistic metrics. The justification for recommending the normalized
trace function n1 Tr (H) is satisfactory to the authors. 
222 4 Concentration of Eigenvalues and Their Functionals

4.7 Concentration of Noncommutative Polynomials


in Random Matrices

Following [190], we also refer to [191, 192]. The Schatten p-norm of a matrix A
  
p/2 1/p
is defined as A p = Tr AH A . The limiting case p = ∞ corresponds
to the operator (or spectral) norm, while p = 2 leads to the Hilbert-Schmidt (or
Frobenius) norm. We also denote by · p the Lp -norm of a real or complex random
variable, or p -norm of a vector in Rn or Cn .
A random vector x in a normed space V satisfies the (sub-Gaussian) convex
concentration property (CCP), or in the class CCP, if
2
P [|f (x) − Mf (x)|  t]  Ce−ct (4.33)

for every t > 0 and every convex 1-Lipschitz function f : V → R, where C, c > 0
are constants (parameters) independent of f and t, and M denotes a median of a
random variable.
Theorem 4.7.1 (Meckes [190]). Let X1 , . . . , Xm ∈ Cn×n be independent, cen-
tered random matrices that satisfy the convex concentration property (4.33) (with
respect to the Frobenius norm on Cn×n ) and let k ≥ 1 be an integer. Let P be a
noncommutative ∗-polynomial in m variables of degree at most k, normalized so
that its coefficients have modulus at most 1. Define the complex random variable
 
1 1
ZP = Tr P √ X1 , . . . , √ X m .
n n

Then, for t > 0,


 0 1
P [|ZP − EZP |  t]  Cm,k exp −cm,k min t2 , nt2/k .

The conclusion holds also for non-centered random matrices if—when k ≥ 2—we
assume that EXi 2(k−1)  Cnk/2(k−1) for all i.
We also have that for q ≥ 1
 
 √  q k/2
ZP − EZP q  Cm,k max q, .
m

Consider a special case. Let X ∈ Cn×n be a random Hermitian matrix which


satisfies the convex concentration property (4.33) (with respect to the Frobenius
norm on the Hermitian matrix space), let k ≥ 1 be an integer, and suppose—when
 2(k−1)
k ≥ 2—that Tr √1n X  Cn. Then, for t > 0,
4.8 Concentration of the Spectral Measure for Wishart Random Matrices 223

- 
 k  k  .
 0 1
 1 1 
P Tr √ X − M Tr √ X   t  C exp − min ck t2 , cnt2/k .
 n n 

Consider non-Hermtian matrices.


Theorem 4.7.2 (Meckes [190]). Let X∈Cn×n be a random matrix which satisfies
the convex concentration property (4.33) (with respect to the Frobenius norm on
Cn×n ), let k ≥ 1 be an integer, and suppose—when k ≥ 2—that EX 2(k−1) 
cnk/2(k−1) . Then, for t>0,
- 
 k  k  .  0 1
 1 1 
P Tr √ X −E Tr √ X  t C (k+1) exp − min ck t2 , cnt2/k .
 n n 

4.8 Concentration of the Spectral Measure for Wishart


Random Matrices

Two forms of approximations have central importance in statistical applica-


tions [193]. In one form, one random variable is approximated by another random
variable. In the other, a given distribution is approximated by another.
We consider the Euclidean operator (or matrix) norm
⎧ ⎫
⎨ ⎬
(aij ) = (aij ) l2 →l2 = sup aij xi yj : x2i  1, yj2  1
⎩ ⎭
i,j i j

of random matrices whose entries are independent random variables.


Seginer [107] showed that for any m × n random matrix X = (Xij )im,jn
with iid mean zero entries
⎛ ? ? ⎞

E (Xij )  C ⎝E max 2 + E max


Xij 2 ⎠,
Xij
im jn
jn im

where C is a universal constant.


For any random matrix with independent mean zero entries, Latala [194] showed
that
⎛ ⎛ ⎞1/4 ⎞
? ?
⎜ 2 +⎝ 2⎠ ⎟
E (Xij )  C ⎝max EXij
2 + max EXij EXij ⎠,
i j
j i i,j

where C is some universal constant.


224 4 Concentration of Eigenvalues and Their Functionals

Reference [183] is relevant in the context.


For a symmetric n × n matrix M, Fλ (λ) is the cumulative distribution function
(CDF) of the spectral distribution of matrix M
n
1
FM (λ) = {λi (M)  λ}, λ ∈ R.
n i=1

The integral of a function f (·) with respect to the measure induced by FM is denoted
by the function
n
1 1
FM (f ) = f (λi (M)) = Tr f (M) .
n i=1
n

For certain classes of random matrices M and certain classes of functions f, it can
be shown that FM (f ) is concentrated around its expectation EFM (f ) or around any
median MFM (f ).
For a Lipschitz function g, we write ||g||L for its Lipschitz constant. To state the
result, we also need to define bounded functions: f : (a, b) → R are of bounded
variation on (a, b) (where −∞ ≤ a ≤ b ≤ ∞), in the sense that
n
Vf (a, b) = sup sup |f (xk ) − f (xk−1 )|
n1 a<x0 x1 ···xn
k=1

is finite [195, Sect. X.1]. A function is bounded variation if and only if it can be
written as the difference of two bounded monotone functions on (a, b). The indicator
function g : x → {x  λ} is of bounded variation on R with Vg (R) = 1 for each
λ ∈ R.
Theorem 4.8.1 (Guntuboyina and Lebb [196]). Let X be an m × n matrix
whose row-vectors are independent, set a Wishart matrix S = XT X/m, and
fixf : R → R.

1. Suppose that f is such that the mapping x → f x2 is convex and Lipschitz,
and suppose that |Xi,j | ≤ 1 for each i and j. For all t > 0, we then have
   - .
1 1  nm t2
P  Tr f (S) − M Tr f (S)  t  4 exp − 2 .
n n n + m 8 f (·2 ) L

(4.34)
where M stands for the median.
 [From the upper bound (4.34), one can also
obtain a similar bound for P  n1 Tr f (S) − E n1 Tr f (S)  t using standard
methods (e.g.[197])]
2. Suppose that f is of bounded variation on R. For all t > 0, we then have
4.8 Concentration of the Spectral Measure for Wishart Random Matrices 225

   - .
1 1  n2
2t 2
P  Tr f (S) − E Tr f (S)  t  2 exp − .
n n m Vf2 (R)

In particular, for each λ ∈ R and all t > 0, we have


   
1 1  2n2 2
 
P  Tr f (S) − E Tr f (S)  t  2 exp − t .
n n m

These bounds cannot be improved qualitatively without imposing additional as-


sumptions. This theorem only requires that the rows of X are independent while
allowing for dependence within each row of X.
Let us consider the case that is more general than Theorem 4.8.1. Let X be a
random symmetric n × n matrix. Let X be a function of m independent random
quantities y1 , . . . , ym , i.e., X = X (y1 , . . . , ym ) . Write

X(i) = X (y1 , . . . , yi−1 , ỹ i , yi+1 , . . . , ym ) (4.35)

where ỹi is distributed the same as yi and represents an independent copy of yi .


ỹi , i = 1, . . . , m are independent of y1 , . . . , ym . Assume that
/ /
/1  √  √ /
/ Tr f X/ m − 1 Tr f X(i) / m / r (4.36)
/n n / n

holds almost surely for each i = 1, . . . , m and for some (fixed) integer r.
Theorem 4.8.2 (Guntuboyina and Lebb [196]). Assume (4.36) is satisfied for
each i = 1, . . . , m. Assume f : R → R is of bounded variation on R. For any
t > 0, we have that
   - .
1  √ 1  √  n2
2t 2
P  Tr f X/ m − E Tr f X/ m   t  2 exp −
 .
n n m r2 V 2 (R) f
(4.37)
To estimate the integer r in (4.35), we need the following lemma.
Lemma 4.8.3 (Lemma 2.2 and 2.6 of [198]). Let A and B be symmetric n × n
matrices and let C and D be m × n matrices. We have
/ /
/1 /
/ Tr f (A) − 1 Tr f (B)/  rank (A − B) ,
/n n / n

and
/ /
/1   / rank (C − D)
/ Tr f CT C − 1 Tr f DT D /  .
/n n / n

226 4 Concentration of Eigenvalues and Their Functionals

Now we are ready to consider an example to illustrate the application.


Example 4.8.4 (Sample covariance matrix of vector moving average (MA) pro-
cesses [196]). Consider an m × n matrix X whose row-vectors follow a vector
MA process of order 2, i.e.,
T
(Xi,· ) = yi+1 + Byi , i = 1, . . . , m

where y1 , . . . , ym ∈ Rn are m+1 independent n-vectors and B is some fixed n×n


matrix.
Now for the innovations yi , we assume that yi = Hzi , i = 1, . . . , m + 1, where
H is a fixed n × n matrix. Write Z as
T
Z = (z1 , . . . , zm+1 ) = (Zij ) , i = 1, . . . , m + 1, j = 1, . . . , n

where the entries Zij of matrix Z are independent and satisfy |Zij | ≤ 1. The
(random) sample covariance
 2 matrix is S = XT X/m. For a function f such that
the mapping x → f x is convex and Lipschitz, we then have that, for all t > 0
   - .
1 1  n2 m t2
P  Tr f (S) − M Tr f (S)  t  4 exp − 2 .
n n n + m 8C 2 f (·2 ) L
(4.38)
where C = (1 + B ) H with || · || denoting the operator (or matrix) norm.

We focus on Wishart random matrices (sample covariance matrix), that is S =
1 H
n YY , where Y is a rectangular p × n matrix. Our objective is to use new
exponential inequalities of the form

Z = g (λ1 , · · · , λp )

where (λi )1ip is the set of eigenvalues of S. These inequalities will be upper
 
bounds on E eZ−EZ and lead to natural bounds P (|Z − EZ|  t) for large values
of t.
What is new is following [199]: (i) g is a once or twice differentiable function
(i.e., not necessarily of the form (4.39)); (ii) the entries of Y are possibly dependent;
(iii) they are not necessarily identically distributed; (iv) the bound is instrumental
when both p and n are large. In particular, we consider
p
g (λ1 , · · · , λp ) = ϕ (λk ). (4.39)
k=1

A direct application is to spectrum sensing. EZ is used to set the threshold for


hypothesis detection [14, 200]. Only simple trace functional of the form (4.39) is
4.8 Concentration of the Spectral Measure for Wishart Random Matrices 227

used in [14]. We propose to investigate this topic using new results from statistical
literature [180, 199, 201].
For a vector x, ||x|| stands for the Euclidean norm of the vector x. ||X||p is the
usual Lp -norm of the real random variable X.
Theorem 4.8.5 (Deylon [199]). Let Q be a N × N deterministic matrix, X =
(Xij ) , 1  i  N, 1  j  n be a matrix of random independent entries, and set

1
Y= QXXT Q.
n

Let λ → g (λ) be a twice differentiable symmetric function on RN and define


the random variable Z = g (λ1 , . . . , λN ) where (λ1 , . . . , λN ) is the vector of the
eigenvalues of Y. Then
 2 
  64N 4 N 2
E e Z−EZ
 exp ξ γ1 + ξ γ2 ,
n n

where

ξ = Q sup Xij ∞
i,j
 
 ∂g 
γ1 = sup  (λ)
k,λ ∂λk
/ /
γ2 = sup /∇2 g (λ)/ (matrix norm).
λ

In particular, for any t > 0


 −2 
n 2 −4 N 2
P (|Z − EZ| > t)  2 exp − t ξ γ1 + ξ γ2 .
256N n

When dealing with the matrices with independent columns, we have Q = I, where
I is the identity matrix.
Theorem 4.8.6 (Deylon [199]). Let X = (Xij ) , 1  i  N, 1  j  n be a
matrix of random independent entries, and set

1
Y= XXT .
n

Let λ → g (λ) be a twice differentiable symmetric function on RN and define the


random variable

Z = g (λ1 , . . . , λN )
228 4 Concentration of Eigenvalues and Their Functionals

where (λ1 , . . . , λN ) is the vector of the eigenvalues of Y. Then


 
  64a2 γ12
E e Z−EZ
 exp ,
n

where
 
 ∂g 

γ1 = sup  (λ)
λ ∂λ 1
/ /1/2
/ /
/ 2/
a = sup / Xij / .
j / /
i ∞

In particular, for any t > 0


 
  nt2
E e Z−EZ
 2 exp − .
64a4 γ12

Random matrices can be expressed as functions of independent random vari-


ables. Then we think of the linear statistics of eigenvalues as functions of these
independent random variables. The central idea is pictorially represented as

large vector x ⇒ matrix A (x) ⇒ linear statistic f (λi (A)) = g (x) .


i

For each c1 , c2 > 0, let L(c1 , c2 ) be the class of probability measures on R that arise
as laws for random variables like u(Z), where Z is a standard Gaussian random
variable and u is a twice continuously differentiable function such that for all x ∈ R

|u (x)  c1 | and |u (x)  c2 | .

For example, the standard Gaussian law is in L(1, 0). Again, taking u = the
Gaussian cumulative distribution, we see  that the uniform distribution on the
−1/2 −1/2
unit interval is in L (2π) , (2πe) . A random variable is said to be
“in L(c1 , c2 )” instead of the more elaborate statement that “the distribution of
X belongs to L(c1 , c2 ).” For two random variables X and Y, the supremum of
|P (X ∈ B) − P (Y ∈ B)| as B ranges from all Borel sets is called the total varia-
tion distance between the laws of X and Y, often denoted simply by dT V (X, Y ).
The following theorem gives normal approximation bounds for general smooth
functions of independent random variables whose law are in L(c1 , c2 ) for some
finite c1 , c2 .

Theorem 4.8.7 (Theorem 2.2 of Chatterjee [201]). Let x = (X1 , . . . , Xn ) be a


vector of independent random variable in L(c1 , c2 ) for some finite c1 , c2 . Take any
g ∈ C 2 (Rn ) and let ∇g and ∇2 g denote the gradient and Hessian of g. Let
4.8 Concentration of the Spectral Measure for Wishart Random Matrices 229

1/2
n
∂g
4
 1/4  4
1/4
κ0 = E (X) , κ1 = E∇g (X)4 , κ2 = E ∇2 g (X) .
i=1
∂xi

Suppose W = g (x) has a finite fourth moment and let σ 2 = Var (W ) . Let Z be a
normal random variable with the same mean and variance as W. Then
√ 
2 5 c1 c2 κ0 + c31 κ1 κ2
dT V (W, Z)  .
σ2
If we slightly change the setup by assuming that x is a Gaussian random vector
with mean 0 and covariance matrix Σ, keeping all other notations the same, then
the corresponding bound is
√ 3/2
2 5 Σ κ1 κ2
dT V (W, Z)  .
σ2
The cornerstone of Chatterjee[201] is Stein’s method [202]. Let us consider a
particular function f . Let n be a fixed positive integer and J be a finite indexing
set. Suppose that for each 1 ≤ i, j, ≤ n, we have a C 2 map aij : R2 → C. For each
x ∈ RJ , let A(x) be the complex n × n matrix whose (i, j)-th element is aij (x).
Let

f (z) = bk z k
k=0

be an analytic function on the complex plane. Let X = (Xu )u∈J be a collection


of independent random variables in L(c1 , c2 ) for some finite c1 , c2 . Under this very
general setup, we give an explicit bound on the total variation distance between
the laws of the random variable Re Tr f (A (x)) and a Gaussian random variable
with matching mean and variance. (Here Re z denotes the real part of a complex
number z.)
Theorem 4.8.8 (Theorem 3.1 of Chatterjee [201]). Let all notations be as above.
Suppose W = Re Tr f (A (x)) has finite fourth moment and let σ 2 = Var (W ) .
Let Z be a normal random variable with the same mean and variance as W. Then
√ 
2 5 c1 c2 κ0 + c31 κ1 κ2
dT V (W, Z)  .
σ2
If we slightly change the setup by assuming that x is a Gaussian random vector
with mean 0 and covariance matrix Σ, keeping all other notations the same, then
the corresponding bound is
√ 3/2
2 5 Σ κ1 κ2
dT V (W, Z)  .
σ2
230 4 Concentration of Eigenvalues and Their Functionals

Let us consider the Wishart matrix or sample covariance matrix. Let N ≤ n


be two positive integers, and let X = (Xij )1iN,1jn be a collection of
independent random variables in L(c1 , c2 ) for some finite c1 , c2 . Let

1
A= XXT .
n
Theorem 4.8.9 (Proposition 4.6 of Chatterjee [201]). Let λ be the largest eigen-
value of A. Take any entire function f and define f1 and f2 as in Theorem 4.8.8. Let
  1/4   
4 1/4
and b = E f1 (λ) + 2N −1/2 λf2 (λ)
4
a = E f1 (λ) λ2 . Suppose
W = Re Tr f (A (x)) has finite fourth moment and let σ 2 = Var (W ) . Let Z be a
normal random variable with the same mean and variance as W. Then
√ √ 
8 5 c 1 c 2 a2 N c31 abN
dT V (W, Z)  2 + 3/2 .
σ n n

If we slightly change the setup by assuming that the entries of x is jointly Gaussian
with mean 0 and nN × nN covariance matrix Σ, keeping all other notation the
same, then the corresponding bound is
√ 3/2
8 5 Σ abN
dT V (W, Z)  2 3/2
.
σ n
Double Wishart matrices are important in statistical theory of canonical corre-
lations [203, Sect. 2.2]. Let N  n  M be three positive integers. Let X =
(Xij )1iN,1jn and Y = (Xij )1iN,1jM be a collection of independent
random variables in L(c1 , c2 ) for some finite c1 , c2 . Define the double Wishart
matrix as
 −1
A = XXT YYT .

A theorem similar to Theorem 4.8.9 is obtained in [201]. In [204], a central limit


theorem is proven for the Jacob matrix ensemble. A Jacob matrix is defined as A =
 −1
XXT XXT + YYT , when the matrices X and Y have independent standard
Gaussian entries.
Let μ and ν be two probability measures on C (or R2 ). We define [59]
 
 
 
ρ (μ, ν) = sup  f (x)μ (dx) − f (x)ν (dx) , (4.40)
f L 1  
C C
4.8 Concentration of the Spectral Measure for Wishart Random Matrices 231

where f above is a bounded Lipschitz function defined on C with f = sup |f (x)|


x∈C
and f L = f + sup |f (x)−f
|x−y|
(y)|
.
x=y

Theorem 4.8.10 (Jiang [204]). Let X be an n×n matrix with complex entries, and
λ1 , . . . , λn are its eigenvalues. Let μX be the empirical law of λi , 1 ≤ i ≤ n. Let μ
be a probability measure. Then ρ (μX , μ) is a continuous function in the entries of
matrix X, where ρ is defined as in (4.40).
Proof. The proof of [204] illustrates a lot of insights into the function ρ (μX , μ) ,
so we include their proof here for convenience. The eigenvalues of X is denoted by
λi , 1 ≤ i ≤ n. First, we observe that
n
1
f (x)μX (dx) = f (λi (X)).
n i=1
C

Then by triangle inequality, for any permutation π of 1, 2, . . . , n,


 n 
  n 
|ρ (μX , μ) − ρ (μY , μ)|  1 
sup  f (λi (X)) − f (λi (Y))
n
f L 1 i=1  i=1 
n  n 

 max sup  f λπ(i) (X) − f (λi (Y))
1in f  1 i=1
 L  i=1
 max λπ(i) (X) − λi (Y) ,
1in

where in the last step we use the Lipschitz property of f : |f (x) − f (y)|  |x − y|
for any x and y. Since the above inequality is true for any permutation π, we have
that
 
|ρ (μX , μ) − ρ (μY , μ)|  min max λπ(i) (X) − λi (Y) .
π 1in

Using Theorem 2 of [205], we have that


 
min max λπ(i) (X) − λi (Y)  22−1/n X − Y
1/n 1−1/n
2 ( X 2 + Y 2)
π 1in

where X 2 is the operator norm of X. Let X = (xij ) and Y = (yij ). We know


 2
that X 2  |xij | . Therefore
1i,jn

⎛ ⎞1/(2n)

·⎝
1/n 2⎠
|ρ (μX , μ) − ρ (μY , μ)|  22−1/n X − Y 2 |xij − yij | .
1i,jn


232 4 Concentration of Eigenvalues and Their Functionals

For a subset A ⊆ C, the space of Lipschitz functions f : A → R is denoted


by Lip (A). Denote by P(A) the space of all probability measures supported in A,
and by Pp (A) be the space of all probability measures with finite p-th moment,
equipped with the Lp Wasserstein distance dp defined by
 1/p
p
dp (μ, ν) = inf x − y dπ (x, y) . (4.41)
π

||·|| is the Euclidean length (norm) of a vector. The infimum above is over probability
measures π on A × A with marginals μ and ν. Note that dp ≤ dq when p ≤ q. The
L1 Wasserstein distance can be equivalently defined [59] by

d1 (μ, ν) = sup [f (x) − f (y)]dμ (x) dν(y), (4.42)


f

where the supremum is over f in the unit ball B (Lip (A)) of Lip (A) .
Denote the space of n × n Hermitian matrices by Msa n×n . Let A be a random
n × n Hermitian matrix. An essential condition on some of random matrices used
in the construction below is the following. Suppose that for some C, c > 0,
 
P (|F (A) − EF (A)|  t)  C exp −ct2 (4.43)

for every t > 0 and F : Msa n×n → R which is 1-Lipschitz with respect to the
Hilbert-Schmidt norm (or Frobenius norm). Examples in which (4.43) is satisfied
include:
1. The diagonal and upper-diagonal entries of matrix M are independent √ and each
satisfy a quadratic transportation cost inequality with constant c/ n. This is
slightly more general than the assumption of a log-Sobolev inequality [141,
Sect. 6.2], and is essentially the most general condition with independent en-
tries [206]. It holds, e.g., for Gaussian entries and, more generally, for entries

with densities of the form e−nuij (x) where uij (x) ≥ c > 0.
2. The distribution of M itself has a density proportional to e−n Tr u(M) with u :
R → R such that u (x) ≥ c > 0. This is a subclass of the unitarily invariant
ensembles [207]. The hypothesis on u guarantees that M satisfies a log-Sobolev
inequality [208, Proposition 4.4.26].
The first model is as follows. Let U(n) be the group of n × n unitary matrices.
Let U ∈ U(n) distributed according to Haar measure, independent of A, and let
Pk denote the projection of Rn onto the span of the first k basis elements. Define a
random matrix M by

M = Pk UAUH PH
k (4.44)
4.8 Concentration of the Spectral Measure for Wishart Random Matrices 233

Then M is a compression of A (as an operator on Rn ) to a random k-dimensional


subspace chosen independently of A. This compression is relevant to the spectrum
sensing based on principal component analysis [156, 157] and kernel principal
component analysis [159]. The central idea is based on an observation that under
extremely signal to noise ratios the first several eigenvectors (or features)—
corresponding to the first k basis elements above—are found to be robust.
For an N × N Hermitian matrix X, we define the empirical spectral measure as

N
1
μX = δλi (X) ,
N i=1

where λi (X) are eigenvalues of matrix X, and δ is the Dirac function. The empirical
spectral measure of random matrix M is denoted by μM , while its expectation is
denoted by μ = EμM .
Theorem 4.8.11 (Meckes, E.S. and Meckes, M.W. [192]). Suppose that matrix A
satisfies (4.43) for every 1-Lipschitz function F : Msa
n×n → R.

1. If F : Msa
n×n → R is 1-Lipschitz, then for M = Pk UAU Pk ,
H H

 
P (|F (M) − EF (M)|  t)  C exp −cnt2

for every t > 0.


2. In particular,
    
 
P  M op − E M op   t  C exp −cnt2

for every t > 0.


3. For any fixed probability measure μ ∈ P2 (C) , and 1-Lipschitz f : R → R, if

Zf = f dμM − f dμ,

then
 
P (|Zf − EZf |  t)  C exp −cknt2

for every t > 0.


4. For any fixed probability measure μ ∈ P2 (C) , and 1 ≤ p ≤ 2,
 
P (|dp (μ, ν) − Edp (μ, ν)|  t)  C exp −cknt2

for every t > 0.


Let us consider the expectation Ed1 (μM , EμM ) .
234 4 Concentration of Eigenvalues and Their Functionals

Theorem 4.8.12 (Meckes, E.S. and Meckes, M.W. [192]). Suppose that matrix
A satisfies (4.43) for every 1-Lipschitz function F : Msa n×n → R. Let M =
Pk UAUH PH k , and let μ M denote the empirical spectral distribution of random
matrix M with μ = EμM . Then,
 1/3
C  E M op C 
Ed1 (μM , EμM )  1/3
 1/3
,
(kn) (kn)

and so

C   
P d1 (μM , EμM )  1/3
+t  C exp −cknt2
(kn)

for every t > 0.


Let us consider the second model that has been considered in free probabil-
ity [209]. Let A, B be n × n Hermitian matrices and satisfy (4.43). Let U be Haar
distributed, with A, B, U independent. Define [192]

M = UAUH + B, (4.45)

the “randomized sum” of A and B.


Theorem 4.8.13 (Meckes, E.S. and Meckes, M.W. [192]). Let A, B satis-
fies (4.43) and let U ∈ U(n) be Haar-distributed with A, B, U independent of
each other. Define M = UAUH + B.
1. There exist C, c depending only on the constants in (4.43) for A and B, such that
if F : Msa
k×k → R is 1-Lipschitz, then

 
P (|F (M) − EF (M)|  t)  C exp −cnt2

for every t > 0.


2. In particular,
    
 
P  M op −E M op   t  C exp −cnt2

for every t > 0.


3. For any fixed probability measure ρ ∈ P2 (R) , and 1-Lipschitz f : R → R, if

Zf = f dμM − f dρ,
4.9 Concentration for Sums of Two Random Matrices 235

then
 
P (|Zf − EZf |  t)  C exp −cn2 t2

for every t > 0.


4. For any fixed probability measure ρ ∈ P2 (R) , and 1 ≤ p ≤ 2,
 
P (|dp (μ, ν) − Edp (μ, ν)|  t)  C exp −cn2 t2

for every t > 0.


Theorem 4.8.14 (Meckes, E.S. and Meckes, M.W. [192]). In the setting of The-
orem 4.8.13, there are constants c, C, C  , C  depending only on the concentration
hypothesis for A and B, such that
 1/3
C E M op C
Ed1 (μM , EμM )  2/3
 2/3
,
(n) (n)

and so

C  
P d1 (μM , EμM )  2/3
+t  C  exp −cn2 t2
(n)

for every t > 0.


Theorem 4.8.15 (Meckes, E.S. and Meckes, M.W. (2011) [192]). For each n, let
An , Bn ∈ Msan×n be fixed matrices with spectra bounded independently of n. Let
U ∈ U(n) be Haar-distributed. Let Mn = UAn UH + Bn and let μn = EμMn .
Then with probability 1,

C
d1 (μMn , EμMn ) 
n2/3
for all sufficiently large n, where C depends only on the bounds on the sizes of the
spectra of An and Bn .

4.9 Concentration for Sums of Two Random Matrices

We follow [210]. If A and B are two Hermitian matrices with a known spectrum
(the set of eigenvalues), it is a classical problem to determine all possibilities for
the spectrum of A + B. The problem goes back at least to H. Weyl [211]. Later,
Horn [212] suggested a list of inequalities, which must be satisfied by eigenvalues
236 4 Concentration of Eigenvalues and Their Functionals

of A + B. For larger matrices, it is natural to consider the probabilistic analogue of


this problem, when matrices A and B are “in general position.” Let

H = A + UBUH ,

where A and B are two fixed n × n Hermitian matrices and U is a random unitary
matrix with the Haar distribution on the unitary group U (n). Then, the eigenvalues
of H are random and we are interested in their joint distribution. Let λ1 (A) 
· · ·  λn (A) (repeated by multiplicities) denote the eigenvalues of A, and define
the spectral measure of A as
n
1
μA = δλi (A) . (4.46)
n i=1

Define similarly for μA and μH . The empirical spectral cumulative distribution


function of A is defined as

# {i : λi  x}
FA (x) = .
n
By an ingenious application of the Stein’s method [87], Chatterjee [213] proved that
for every x ∈ R
 
n
P (|FH (x) − EFH (x)|  t)  2 exp −ct 2
,
log n

where c is a numerical constant. The decay rate of tail bound is sub-linear in n.


Kargin [210] greatly improved the decay rate of tail bound to n2 . To state his result,
we need to define the free deconvolution [209].
When n is large, it is natural to define

μA  μ B ,

where  denotes free convolution, a non-linear operation on probability measures


introduced by Voiculescu [66].
Assumption 4.9.1. The measure μA  μB is absolutely continuous everywhere on
R, and its density is bounded by a constant C.
Theorem 4.9.2 (Kargin [210]). Suppose that Assumption 4.9.1 holds. Let FH (x)
and F (x) be cumulative distribution functions for the eigenvalues
 of H = A +
4/ε
UBUH and for μA  μB , respectively. For all n  exp (c1 /t) ,

   
n2
P sup |FH (x) − F (x)|  t  exp −c2 t2 ε ,
x (log n)
4.10 Concentration for Submatrices 237

where c1 and c2 are positive and depend only on C, ε ∈ (0, 2] , and K =


max { A , B } .
The main tools in the proof of this theorem are the Stieltjes transform method and
standard concentration inequalities.

4.10 Concentration for Submatrices

Let M be a square matrix of order n. For any two sets of integers i1 , . . . , ik


and j1 , . . . , jl between 1 and n, M (i1 , . . . , ik ; j1 , . . . , jl ) denotes the submatrix
of M formed by deleting all rows except rows i1 , . . . , ik and all columns except
columns j1 , . . . , jl . A submatrix like M (i1 , . . . , ik ; j1 , . . . , jl ) is called a principal
submatrix.
We define FA (x) the empirical spectral cumulative distribution function of
matrix A as (4.46). The following result shows that given 1 ≤ k ≤ n and any
Hermitian matrix M of order n, the empirical spectral distribution is almost the
same for almost every principal submatrix M of order k.
Theorem 4.10.1 (Chatterjee and Ledoux [214]). Take any 1 ≤ k ≤ n and a
Hermitian matrix M of order n. Let A be a principal submatrix of M chosen
uniformly at random from the set of all k × k principal submatrices of M. Let F
be the expected spectral cumulative distribution function, that is, F (x) = EFA (x).
Then, for each t ≥ 0,
  √
1 √
P FA − F ∞  √ + t  12 ke−t k/8 .
k

Consequently, we have

13 + 8 log k
E FA − F ∞  √ .
k

Exactly the same results hold if A is a k × n submatrix of M chosen uniformly at


random, and FA is the empirical cumulative distribution function of the singular
values of A. Moreover, in this case M need not be Hermitian.
Theorem 4.10.1 can be used to sample the matrix from a larger database. In the
example [214] of an n × n covariance matrix with n = 100, they chose k = 20 and
picked two k×k principal submatrices A and B of M, uniformly and independently
at random. The two distributions FA and FB are statistically distinguishable.
238 4 Concentration of Eigenvalues and Their Functionals

4.11 The Moment Method

We follow [63] and [3] for our exposition of Wigner’s trace method. The idea of
using Wigner’s trace method to obtain an upper bound for the eigenvalue of A,
λ (A) , was initiated in [215]. The standard linear algebra identity has

N
Tr (A) = λi (A).
i=1

The trace of a matrix is the sum of its eigenvalues. The trace is a linear functional.
More generally, we have

 N
Tr Ak = λki (A).
i=1

The linearity of expectation implies


n   
k
E λi (A) = E Tr Ak .
i=1

For an even integer k, the geometric average of the high-order moment


  1/k
Tr Ak is the lk norm of these eigenvalues, and we have

k  k
σ1 (A)  Tr Ak  nσ1 (A) (4.47)

The knowledge of the k-th moment Tr Ak controls the operator norm (the also
the largest singular value) up to a multiplicative factor of n1/k . Taking larger and
larger k, we should obtain more accurate control on the operator norm.
Let us see how the moment
 method works in practice. The simplest case is that
of the second moment Tr A2 , which in the Hermitian case works out to

 n n
2 2
Tr A2 = |ξij | = A F .
i=1 j=1


n 
n
2
The expression |ξij | is easy to compute in practice. For instance, for the
i=1 j=1
symmetric matrix A consisting of Bernoulli random variables taking random values
of ±1, this expression is exactly equal to n2 .
4.11 The Moment Method 239

From the weak law of large numbers, we have


n n
2
|ξij | = (1 + o(1)) n2 (4.48)
i=1 j=1

asymptotically almost surely. In fact, if the ξij have uniformly sub-exponential tail,
we have (4.48) with overwhelming probability.
Applying (4.48) , we have the bounds

(1 + o(1)) n  σ1 (A)  (1 + o(1)) n (4.49)

asymptotically almost surely. The median of σ1 (A) is at least (1 + o(1)) n. But
the upper bound here is terrible. We need to move to higher moments to improve it.
Let us move to the fourth moment. For simplicity, all entries ξij have zero mean
and unit variance. To control moments beyond the second moments, we also assume
that all entries are bounded in magnitude by some K. We expand

 n
Tr A4 = ξi1 i2 ξi2 i3 ξi3 i4 ξi4 i1 .
1i1 ,i2 ,i3 ,i4 n

To understand this expression, we take expectations

 n
E Tr A4 = Eξi1 i2 ξi2 i3 ξi3 i4 ξi4 i1 .
1i1 ,i2 ,i3 ,i4 n

One can view this sum graphically, as a sum over length four cycles in the vertex set
{1, . . . , n}.
Using the combinatorial arguments [63], we have
  
E Tr A4  O n3 + O n2 K 2 .

In particular, if we have the assumption K = O( n), then we have
 
E Tr A4  O n3 .

Consider the general k-th moment

 n
E Tr Ak = Eξi1 i2 · · · ξik i1 .
1i1 ,...,ik n

We have
 k  √ k−2
E Tr Ak  (k/2) nk/2+1 max 1, K/ n .
240 4 Concentration of Eigenvalues and Their Functionals

With the aid of (4.47), one has

k k  √ k−2
Eσ1 (A)  (k/2) nk/2+1 max 1, K/ n ,

and so by Markov’s inequality, we have

1 k  √ k−2
P (σ1 (A)  t)  k
(k/2) nk/2+1 max 1, K/ n
t

for all t > 0. This gives the median of σ1 (A) at


 √  √ 
O n1/k k n max 1, K/ n .

We can optimize this in k√by choosing k to be√comparable to log n, and then we


obtain an upper bound O ( n log n max (1, K/ n)) for the median. After a slight
tweaking of the constants, we have
√  √
σ1 (A) = O n log n max 1, K/ n

with high probability.


We can summarize the above results into the following:
Proposition 4.11.1 (Weak upper bound). Let A be a random Hermitian matrix,
with the upper triangular entries ξij , i ≤ j being independent with mean zero and
variance at most 1, and bounded in magnitude by K. Then,
√  √
σ1 (A) = O n log n max 1, K/ n

with high probability.


√ √
When K ≤ n, this gives an upper bound of O ( √n log n) , which is still off by a
logarithmic factor from the expected bound of O ( n) . We will remove this factor
later. √
Now let us consider the case when K = o ( n) , and each entry has variance
1
exactly 1. We have the upper bound
 k
E Tr Ak  (k/2) nk/2+1 .

We later need the classical formula for the Catalan number


n
Cn+1 = Ci Cn−i
i=0

1 Later we will relax this to “at most 1”.


4.11 The Moment Method 241

for all n ≥ 1 with C0 = 1, and use this to deduce that

k!
Ck/2 = (4.50)
(k/2 + 1)! (k/2)!

for all k = 2, 4, 6, . . . .
Note that n (n − 1) · · · (n − k/2) = (1 + ok (1)) nk/2+1 . After putting all the
above computations we conclude
Theorem 4.11.2 (Moment computation). Let A be a real symmetric random
matrix, with the upper triangular elements ξij , i ≤ j jointly
√ independent with mean
zero and variance one, and bounded in magnitude by o( n). Let k be a positive
even integer. Then we have
 
E Tr Ak = Ck/2 + ok (1) nk/2+1

where Ck/2 is given by (4.50).


If we allow the ξij to have variance at most one, rather than equal to one, we obtain
the upper bound
 
E Tr Ak  Ck/2 + ok (1) nk/2+1 .

Theorem 4.11.2 is also valid for Hermitian random matrices.


Theorem 4.11.2 can be compared with the formula
  
ES k = Ck/2 + ok (1) nk/2

derived in [63, Sect. 2.1], where

S = X1 + · · · + Xn

is the sum of n i.i.d. random variables of mean zero and variance one, and

 k!
Ck/2 = .
2k/2 (k/2)!

Combining Theorem 4.11.2 with (4.47) we obtain a lower bound


 
k
Eσ1 (A)  Ck/2 + ok (1) nk/2 .

Proposition 4.11.3 (Lower Bai-Yin theorem). Let A be a real symmetric random


matrix, with the upper triangular elements ξij , i ≤ j jointly independent with mean
zero and variance one, and bounded √in magnitude by O(1). Then the median (or
mean) of σ1 (A) at least (2 − o(1)) n.
242 4 Concentration of Eigenvalues and Their Functionals

Let us remove the logarithmic factor in the following theorem.


Theorem 4.11.4 (Improved moment bound). Let A be a real symmetric random
matrix, with the upper triangular elements ξij , i ≤ j jointly independent with mean
0.49
zero and variance one, and bounded  in2 magnitude by O(n ) (say). Let k be a
positive even integer of size k = O log n (say). Then we have
  
E Tr Ak  Ck/2 nk/2+1 + O k O(1) 2k nk/2+0.98

where Ck/2 is given by (4.50). In particular, from the trivial bound Ck/2  2k one
has
 k
E Tr Ak  (2 + o(1)) nk/2+1 .

We refer to [63] for the proof.


Theorem 4.11.5 (Weak Bai-Yin theorem, upper bound). Let A = (ξij )1i,jn
be a real symmetric random matrix, whose entries all have the same distribution
ξ, with mean zero, variance one, and fourth moment √ O(1). Then for every
ε > 0 independent of n, one has σ1 (A)  √ (2 + ε) n asymptotically almost
surely. In particular, σ1 (A)  (2 + o (1)) n asymptotically√ almost surely; as
a consequence, the median of σ1 (A) is at most (2 + o (1)) n. (If ξ is bounded,
we see, in particular,
√ from Proposition 4.11.3 that the median is in fact equal to
(2 + o (1)) n.)
The fourth moment hypothesis is best possible.
Theorem 4.11.6 (Strong Bai-Yin theorem, upper bound). Let ξ be a real random
variable with mean zero, variance one and finite fourth moment, and for all 1 ≤ i ≤
j, let ξij be an i.i.d. sequence with distribution ξ, and set ξij = ξji . Let A =
(ξij )1i,jn be the random matrix formed by the top left n × n block. Then almost
surely, one has

σ1 (A)
lim sup √  2.
n→∞ n

It is a routine matter to generalize the Bai-Yin result from real symmetric


matrices to Hermitian matrices. We use a substitute of (4.47), names the bounds
 
σ1 (A)  Tr (AA∗ )
k k/2 k
 nσ1 (A) ,

valid for any n × n matrix A with complex entries and every positive integer
 k.
It is possible to adapt all of the above moment calculationsfor Tr Ak in the
symmetric or Hermitian cases to give analogous cases for Tr (AA∗ )
k/2
in the
non-symmetric cases. Another approach is to use the augmented matrix defined as
4.12 Concentration of Trace Functionals 243


0 A
à = ,
A∗ 0

which is 2n × 2n Hermitian matrix. If A has singular values σ1 (A) , . . . , σn (A) ,


then à has eigenvalues ±σ1 (A) , . . . , ±σ
 n√(A) .  √  
σ1 (A) is concentrated in the range of 2 n − O n−1/6 , 2 n + O n−1/6 ,
and even to √ get a universal distribution for the normalized expression
(σ1 (A) − 2 n) n1/6 , known as the Tracy-Widom law [216]. See [217–221].

4.12 Concentration of Trace Functionals

Consider a random n × n Hermitian matrix X. The sample covariance matrix has


the form XDX∗ where D is a diagonal matrix. Our goal here is to study the
non-asymptotic deviations, following [180]. We give concentration inequalities for
functions of the empirical measure of eigenvalues for large, random, Hermitian
matrices X, with not necessarily Gaussian entries. The results apply in particular
to non-Gaussian Wigner and Wishart matrices.
Consider
  1
X = (X)ij , X = X∗ , Xij = √ Aij ωij
1i,jn n

with

ω := ω R + jω I = (ωij )1i,jn , ωij = ω̄ij ,
 
A = (A)ij , A = A∗ .
1i,jn

Here, {ωij }1i,jn are independent complex random variables with laws
{Pij }1i,jn , Pij being a probability measure on C with

Pij (ωij ∈ Θ) = 1u+jv∈Θ PijR (du) PijI (dv) ,

0 1
and A is a non-random complex matrix with entires (A)ij uniformly
1i,jn
bounded by, say, a.
We consider a real-valued function on R. For a compact set K, denoted by |K|
its diameter, that is the maximal distance between two points of K. For a Lipschitz
function f : Rk → R, we define the Lipschitz constant |f |L by

|f (x) − f (y)|
|f |L = sup ,
x,y x−y
244 4 Concentration of Eigenvalues and Their Functionals

where · denotes the Euclidean norm on Rk .


We say that a measure μ on R satisfies the logarithmic Sobolev inequality with
(not necessarily optimal) constant c if, for any differentiable function f,

f2   2
 
f 2 log )  2c f  dμ.
f 2 dμ

A measure μ satisfying the logarithmic Sobolev inequality has sub-Gaussian


tails [197].
Theorem 4.12.1 (Guionnet and Zeitouni [180]).
(a) Assume that the (Pij , i  j ∈ N) are uniformly compactly supported, that is
that there exists a compact set K ∈ C so that for any 1  i  j  n,
Pij (K√c ) = 0. Assume f is convex and Lipschitz. Then, for any t > t0 (n) =
8 |K| πa|f |L /n > 0,
+ ,
2
n2 (t − t0 (n))
P (|Tr (f (X)) − E Tr (f (X))|  t)  4n exp − 2 2 .
16|K| a2 |f |L

(b) If PijR , PijI , i  j ∈ N satisfy the logarithmic Sobolev inequality with uniform
constant c, then for any Lipschitz function f, for any t > 0,
+ ,
n2 t2
P (|Tr (f (X)) − E Tr (f (X))|  t)  2n exp − 2 .
8ca2 |f |L
 R I
We regard Tr (f (X)) as a function of the entries ωij , ωij 1i,jn
.

4.13 Concentration of the Eigenvalues

Concentration of eigenvalues are treated in [1–3]. Trace functionals are easier to


handle since trace is a linear operator. The k-th largest eigenvalue is a nonlinear
functional. We consider the quite general model of random symmetric matrices.
Let xij , 1 ≤ i ≤ j ≤ n, be independent, real random variables with absolute
value at most 1. Define xij = xji and consider the symmetric random matrix X =
n
(xij )i,j=1 . We consider λk the k-th largest eigenvalue of X.
Theorem 4.13.1 (Alon, Krivelevich, and Vu [1]). For every 1 ≤ k ≤ n,
the probability that λk (X) deviates from its median by more than t is at most
2 2
4e−t /32k . The same estimate holds for the probability that λn−k+1 (X) deviates
from its median by more than t.
In practice, we can replace the median M with the expectation E.
4.13 Concentration of the Eigenvalues 245

Let xij , 1 ≤ i ≤ j ≤ n, be independent—but not necessarily identical—random


variables:
• |xij |  K for all 1  i  j  n;
• E (xij ) = 0, for all 1  i  j  n;
• Var (xij ) = σ 2 , for all 1  i  j  n.
n
Define xij = xji and consider the symmetric random matrix X = (xij )i,j=1 . We
consider the largest eigenvalue of X, defined as
 T 
λmax (X) = sup v Xv .
v∈Rn ,v=1

The most well known estimate on λmax (X) is perhaps the following theorem [222].
Theorem 4.13.2 (Füredi and Komlós [222]). For a random matrix X as above,
there is a positive constant c = c (σ, K) such that
√ √
2σ n − cn1/3 ln n  λmax (X)  2σ n + cn1/3 ln n,

holds almost surely.


Theorem 4.13.3 (Krivelevich and Vu [223]). For a random matrix X as above,
there is a positive constant c = c (K) such that for any t > 0
2
P (|λmax (X) − Eλmax (X)|  ct)  4e−t /32
.

In this theorem, c does not depend on σ; we do not have to assume anything about
the variances.
Theorem 4.13.4 (Vu [3]). For a random matrix X as above, there is a positive
constant c = c (σ, K) such that

λmax (X)  2σ n + cn1/4 ln n,

holds almost surely.



Theorem 4.13.5 (Vu [3]). There are constants C and C such that the following
holds. Let xij , 1 ≤ i ≤ j ≤ n, be independent random variables, each of
which has mean 0 and variance σ 2 and is bounded in absolute value K, where

σ  C n−1/2 Kln2 n. Then, almost surely,
√ 1/2
λmax (X)  2σ n + C(Kσ) n1/4 ln n.

When the entries of X is i.i.d. symmetric random variables, there are sharper
bounds. The best current bound we know of is due to Soshnikov [216], which shows
that the error term in Theorem 4.13.4 can be reduced to n1/6+o(1) .
246 4 Concentration of Eigenvalues and Their Functionals

4.14 Concentration for Functions of Large Random


Matrices: Linear Spectral Statistics

The general concentration principle do not yield the correct small deviation rate for
the largest eigenvalues. They, however, apply to large classes of Lipschitz functions.
For M ∈ N, we denote by ·, · the Euclidean scalar product on RM (or CM ). For

M
two vectors x = (x1 , . . . , xM ) and y = (y1 , . . . , yM ), we have x, y = x i yi
i=1

M 
(or x, y = xi yi∗ ). The Euclidean norm is defined as x = x, x .
i=1
For a Lipschitz function f : RM → R, we define the Lipschitz constant |f |L by

|f (x) − f (y)|
|f |L = sup ,
x=y∈RM x−y

where · denotes the Euclidean norm on RM . In other words, we have

M
|f (x) − f (y)|  |f |L |xi − yi |,
i=1

where xi , yi are elements of the vectors x, y on RM (or CM ).


Consider X being a n × n Hermitian matrix whose entries are centered,
independent Gaussian random variables with variance σ 2 . In other words, X is a
Hermitian matrix such that the entries above the diagonal are independent, complex
(real on the diagonal) Gaussian random variables with zero mean and variance σ 2 .
This is so called the Gaussian Unitary Ensembles (GUE).
Lemma 4.3.1 says that: If f : R → R is a Lipschitz function, then
n
1
F =√ f (λi )
n i=1

is a Lipschitz function of (real and imaginary) entries of X. If f is convex on the


real line, then F is convex on the space of matrices (Klein’s lemma). As a result,
we can use the general concentration principle to functions of the eigenvalues. For
example, if X is a GUE random matrix with variance σ 2 = 4n 1
, and if f : R → R
is 1-Lipschitz, for any t ≥ 0,
+ n
, 
1 2 2
P f (λi ) − f dμ t  2e−n t
, (4.51)
n i=1
4.14 Concentration for Functions of Large Random Matrices: Linear Spectral Statistics 247

 

n
where μ = E 1
n δλi is the mean spectral measure. Inequality (4.51) has the
i=1
2
n speed of the large deviation principles for spectral measures. With the additional
assumption of convexity on f , similar inequalities hold for real or complex matrices
with the entries that are independent with bounded support.2
Lemma 4.14.1 ([180, 224]). Let f : R → R be Lipschitz with Lipschitz constant
|f |L . X denotes the Hermitian (or symmetric) matrix with entries (Xij )1i,jn , the
map

(Xij )1i,jn → Tr (f (X))



is a Lipschitz function with constant n|f |L . If the joint law of (Xij )1i,jn is
“good”, there is α > 0, constant c > 0 and C < ∞ so that for all n ∈ N

P (|Tr (f (X)) − E Tr (f (X))|  t|f |L )  Ce−c|t| .


α

“Good” here has, for instance, the meaning that the law satisfies a log-Sobolev
inequality; an example is when (Xij )1i,jN are independent, Gaussian random
variables with uniformly bounded covariance.
The significance of results such as Lemma 4.14.1 is that they provide bounds
on deviations that do not depend on the dimension n of the random matrix. They
can be used to prove law of large numbers—reducing the proof of the almost sure
convergence to the proof of the convergence in expectation E. They can also be used
to relax the proof of a central limit theorem: when α = 2, Lemma 4.14.1 says that
the random variable Tr (f (X)) − E Tr (f (X)) has a sub-Gaussian tail, providing
tightness augments for free.
Theorem 4.14.2 ([225]). Let f TV be the total variation norm,

m
f TV = sup |f (xi ) − f (xi−1 )|.
x1 <···<xm
i=2

X is either Wigner or the Wishart matrices.  Then, for any t > 0 and any function f
  n 
with finite total variation norm so that E  n1 f (λi (X)) < ∞,
i=1

 - . 
1 n n 
 1  − 8cn t2
P  f (λi (X)) − E f (λi (X))   t f  2e X ,
n n  TV
i=1 i=1

2 The support of a function is the set of points where the function is not zero-valued, or the closure

of that set. In probability theory, the support of a probability distribution can be loosely thought of
as the closure of the set of possible values of a random variable having that distribution.
248 4 Concentration of Eigenvalues and Their Functionals

where cX = 1 for Wigner’s matrices and M/n for Wishart matrices.



n
Recall that f (λi (X)) = Tr (f (X)) . The above speed is not optimal for laws
i=1  n

n 
which have sufficiently fast decaying rates: f (λi (X)) − E f (λi (X)) is
i=1 i=1
of order one. However, it is optimal rate for instance for matrices whose entries have
heavy tails where the central limit theorem holds for

n
- n
.
1 1
√ f (λi (X)) − E f (λi (X)) .
n i=1
n i=1

In Theorem 4.14.2, we only require independence of the vectors, rather than the
entries.
A probability measure P on En is said to satisfy the logarithmic Sobolev
inequality with constant c, if for any differentiable function f : Rn → R, we have
  n  2
∂f
f 2 log f 2 − log f 2 dP 2c dP. (4.52)
i=1
∂xi

(4.52) implies sub-Gaussian tails, which we use commonly. See [141, 226] for a
general study. It is sufficient to know that Gaussian law satisfies the logarithmic
Sobolev inequality (4.52).
Lemma 4.14.3 (Herbst). Assume that P satisfies the logarithmic Sobolev inequal-
ity (4.52) on Rn with constant c. Let g be a Lipschitz function on Rn with Lipschitz
constant |g|L . Then, for all t ∈ R, we have

2
|f |2L /2
et(g−EP (g)) dP ect ,

and so for all t > 0


2
|f |2L /2
P (|g − EP (g)|  t)  2e−ct .

Lemma 4.14.3 implies that EP (g) is finite.


Klein’s lemma is given in [227].
Lemma 4.14.4 (Klein’s lemma [227]). Let f : R → R be a convex function. Then,
if A is the n × n Hermitian matrix with entries (Aij )1i,jn on the above the
diagonal, the map (function) ψf
n
2
ψf : (Aij )1i,jn ∈ Cn → f (λi (A))
i=1
4.15 Concentration of Quadratic Forms 249

is convex. If f is twice continuously differentiable with f  (x)  c for all x, ψf is


also twice continuously differentiable with Hessian bounded below by cI.
See [224] for a proof.

4.15 Concentration of Quadratic Forms

We follow [228] for this development. Let x ∈ Cn be a vector of independent


random variables ξ1 , . . . , ξn of mean zero and variance σ 2 , obeying the uniform
subexponential delay bound E (|ξi |  tc0 σ)  e−t for all t ≥ c1 and 1 ≤ i ≤ n,
and some c0 , c1 > 0 independent of n. Let A be an n × n matrix. Then for any
t > 0, one has
  
P x∗ Ax − σ 2 Tr A  tσ 2 (Tr A∗ A)  Ce−ct .
1/2 c

Thus
  
x∗ Ax = σ 2 Tr A + O t(Tr A∗ A)
1/2


outside of an event of probability O e−ct .
c

We consider an example (Likelihood Ratio Testing for Detection) : H1 is


claimed if

y∗ R−1 y > γLRT .

If we have a number of random vectors xi , we can consider the following


quadratic form

x∗i Axi = Tr x∗i Axi = Tr (x∗i Axi ) = Tr (Axi x∗i )
i i i i
- .
= Tr (Axi x∗i ) = Tr A (xi x∗i ) = Tr [AX] ,
i i


where X = (xi x∗i ).
i

Theorem 4.15.1. Let x and σ as the previous page, and let V a d-dimensional
complex subspace of Cn . Let PV be the orthogonal projection to V. Then one has
2
0.9dσ 2  PV (x)  1.1dσ 2
250 4 Concentration of Eigenvalues and Their Functionals


outside of an event of probability O e−cd .
c

Proof. Apply the preceding proposition with A = PV ;(so Tr A = Tr A∗ A = d)


and t = d1/2 /10. 
Example 4.15.2 (f (x) = Xv 2 ). Consider the n × n sample covariance matrix
R = n1 XT X, where X is the data matrix of n × p. We closely follow [229]. We
note the quadratic form
/ /2
/ 1 /
vT Rv = / √
/ n Xv / ,
/
2

where v is a column vector. We want to show that this function f (x) = Xv 2


is convex and 1-Lipschitz with respect to the Euclidean norm. The function f (x)
maps a vector Rnp to R and is defined by turning the vector x into the matrix X, by
first filling the rows of X, and then computing the Euclidean norm of Xv. In fact,
for θ ∈ [0, 1] and x, z ∈ Rnp ,

f (θx + (1 − θ) z) = (θX + (1 − θ) Z) v 2  θXv 2 + (1 − θ) Zv 2


= θf (x) + (1 − θ) f (z).

where we have used the triangular inequality of the Euclidean norm. This says that
the function f (x) is convex. Similarly,

|f (x) − f (z)| = | Xv 2 − Zv 2 |  (X − Z) v 2  X−Z F v 2 (4.53)


= x − z 2,

2
using the Cauchy-Schwartz inequality and the fact that v 2 = 1. Equation (4.53)
implies that the function f (x) is 1-Lipschitz. 
If Y is an n × p matrix, we naturally denote by Yi,j its (i, j) entry and call Ȳ
the matrix whose jth column is constant and equal to Ȳ·,j . The sample covariance
matrix of the data stored in matrix Y is
1  T 
Sp = Y − Ȳ Y − Ȳ .
n−1

We have
   
1 1
Y − Ȳ = In − 11T Y= In − 11T XG
n n

where In − n1 11T is the centering matrix. Here 1 is a column vector of ones and
In is the identity matrix of n × n. We often encounter the quadratic form
4.15 Concentration of Quadratic Forms 251

 
1 1
v T XT In − 11T Xv.
n−1 n

So the same strategy as above can be used, with the f defined as


/  /
/ 1 T /
f (x) = f (X) = / I −
/ n n 11 Xv / .
/
2

Now we use this function as another example to illustrate how to show a function is
convex and Lipschitz, which is required to apply the Talagrand’s inequality.
/ /
Example 4.15.3 (f (x) = f (X) = / In − n1 11T Xv/2 ). Convexity is a simple
consequence of the fact that norms are
/ convex. /
The Lipschitz coefficient is v 2 /In − n1 11T /2 . The eigenvalues of the matrix

In − n1 11T are n − 1 ones and one zero, i.e.,
   
1 1
λi In − 11T = 1, i = 1, . . . , n − 1, and λn In − 11T = 0.
n n

As a consequence, its operator norm, the largest singular value, is therefore 1, i.e.,
  / /
1 / 1 T/
σmax In − 11T = /
/ I n − 11 / = 1.
/
n n 2


We now apply the above results to justify the centering process in statistics. In
(statistical) practice, we almost always use the centered sample covariance matrix

1  T 
Sp = Y − Ȳ Y − Ȳ ,
n−1

in contrast to the (uncenterted) correlation matrix

1 T
S̃p = Y Y.
n
Sometimes there are debates as to whether one should use the correlation matrix
the data or their covariance matrix. It is therefore important for practitioners to
understand the behavior of correlation matrices in high dimensions. The matrix
norm (or operator
/ / norm) is the largest singular value. In general, the operator norm
/ /
/Sp − S̃p / does not go to zero. So a course bound of the type
op

   / /
  / /
λ1 (Sp ) − λ1 S̃p   /Sp − S̃p /
op
252 4 Concentration of Eigenvalues and Their Functionals

 
is nor enough to determine the behavior of λ1 (Sp ) from that of λ1 S̃p .
Letting the centering matrix H = In − n1 11T , we see that

Sp − S̃p = Hn Y

which is a linear matrix problem: a product of a deterministic matrix Hn and a


random matrix Y. Therefore, since the largest singular value σmax is a matrix norm,
we have
 √  √  √
σmax Y − Ȳ / n  σmax (Hn ) σmax Y/ n = σmax Y/ n

since Hn is a symmetric (thus Hermitian) matrix with (n − 1) eigenvalues equal to


1 and one eigenvalue equal to 0.
 T 
The correlation matrix n1 YT Y and covariance matrix n−1 1
Y − Ȳ Y − Ȳ
have, asymptotically, the same spectral distribution, see [229]. Letting l1 denote the
right endpoint of the support of this limiting distribution (if it exists), we have that
 √
lim inf σmax Y − Ȳ / n − 1  l1 .
/ /
So, when / n1 YT Y/op → l1 , we have
/ /
/ 1   /
/ T
Y − Ȳ /
/ n − 1 Y − Ȳ / → l1 .
op

This justifies the assertion that when the norm of a sample covariance matrix (which
is not re-centered) whose entries have mean 0 converges to the right endpoint of the
support of its limiting spectral distribution, so does the norm of the centered sample
covariance matrix.
When dealing with Sp , the mean of the entries of Y does not matter, so we can
assume, without loss of generality, that the mean is zero.
Let us deal with another type of concentration in the quadratic form. Suppose
that the random vector x ∈ Rn has the property: For any convex and 1-Lipschitz
function (with respect to the Euclidean norm) F : Rn → R, we have,

P (|F (x) − mF | > t)  C exp −c (n) t2 ,

where mF is the median of F (x), and C and c(n) are independent of the
dimension n. We allow c(n) to be a constant or to go to zero with n like n−α , 0 
α  1. Suppose, further, that the random vector has zero mean and variance Σ,
E (x) = 0, E (xx∗ ) = Σ, with the bound of the operator norm (the largest singular
value) given by Σ op  log (n) .
Consider a complex deterministic matrix M such that M op  κ, where
κ is independent of dimension n, then the quadratic form n1 x∗ Mx is strongly
concentrated around its mean n1 Tr (MΣ) .
4.15 Concentration of Quadratic Forms 253

In particular, if for ε > 0,


   
1 1 
log P  x∗ Mx − Tr (MΣ) > tn (ε)
1+2ε
# − log (n) . (4.54)
n n

where # denotes the asymptotics for large n. If there is a non-zero expectation


E (x) = 0, then the same results are true when one replaces x with x − E (x)
everywhere and Σ is the covariance of x.
If the bounding parameter κ is allowed to vary with the dimension n, then the
same results hold when one replace tn (ε) by tn (ε) κ, or equivalently, divides M
by ε.
Proof. We closely follow [229], changing to our notation. A complex matrix M
can be decomposed into the real part and imaginary part M = Mr + jMi ,
where Mr and Mi are real matrices. On the other hand, the spectral norm of
those matrices is upper bounded by κ. This is of course true for the real part since
Mr = (M + M∗ ) /2.
For a random vector x, the quadratic form x∗ Ax = Tr (Axx∗ ) is a linear
functional of the complex matrix A and the rank-1 random matrix xx∗ . Note that
the trace function Tr is a linear functional. Let A = Mr + jMi . It follows that

x∗ (Mr + jMi ) x = Tr ((Mr + jMi ) xx∗ )


= Tr (Mr xx∗ ) + jTr (Mr xx∗ )
= x∗ Mr x + jx∗ Mi x.

In other words, strong concentration for x∗ Mr x and x∗ Mi x will imply strong con-
 ∗
centration for the sum of those two terms (quadratic forms). x∗ Ax = xi Aii xi ,
i

which is real for real numbers Aii . So x∗ Mr x is real, (x∗ Mr x) = x∗ Mr x and
 
∗ M + M∗
x x.
2

Hence, instead of working on Mr , we can work on the symmetrized version.


We now decompose Mr + MTr /2 into the positive part and negative part
M+ + M− , where M+ is positive semidefinite and −M− is positive semidefinite
T
(or 0 if TMr + Mr /2 is also positive semidefinite). This is possible since
Mr + Mr /2 is real symmetric. We can carry out this decomposition by simply
following its spectral decomposition. As pointed out before, both matrices have
spectral norm less than κ.
Now we are in a position to consider the functional, which is our main purpose.
The map
 √
φ : x → x∗ M+ x/ n
254 4 Concentration of Eigenvalues and Their Functionals


is K-Lipschitz with the coefficient K = κ/n, with respect to the Euclidean norm.
The map is also convex, by noting that
 √ / /
/ 1/2 √ /
φ : x → x∗ M+ x/ n = /M+ x/ n/ .
2
"  
All norms are convex. A = A 2 is the Euclidean norm. 
2 i,j
i,j

Theorem 4.15.4 (Lemma A.2 of El Karoui (2010) [230]). Suppose the random
vector z is a vector in Rn with i.i.d. entries of mean 0 and variance σ 2 . The
covariance matrix of z is Σ. Suppose that the entries of z are bounded by 1.
Let A be a symmetric matrix, with the largest singular value σ1 (A). Set Cn =
128 exp(4π)σ1 (A) /n. Then, for all t/2 > Cn , we have

   √ 2
−n(t/2−Cn )2 /32/ 1+2 σ1 (Σ)
P  n1 zT Az − n1 σ 2 Tr(A) > t  8 exp(4π)e
/σ1 (A)
 √ 2
−n/32/ 1+2 σ1 (Σ) /σ1 (A)
+8 exp(4π)e .

Talagrand’s works [148,231] can be viewed as the infinite-dimensional analogue


of Bernstein’s inequality. Non-asymptotic bounds for statistics [136, 232, 233] are
studied for model selection. For tutorial, we refer to [161, 234]. Bechar [235]
establishes a new Bernstein-type inequality which controls quadratic forms of
Gaussian random variables.
Theorem 4.15.5 (Bechar [235]). Let a = (ai )i=1,...,n and b = (bi )i=1,...,n be
two n-dimensional real vectors. z = (zi )i=1,...,n is an n-dimensional standard
Gaussian random vector, i.e., zi , i = 1, . . . , n are i.i.d.
 zero-mean Gaussian
 random
variables with standard deviation 1. Set a+ = sup sup {ai } , 0 , and a− =
  i=1,...,n

sup sup {−ai } , 0 . Then the two concentration inequalities hold true for all
i=1,...,n
t > 0,
⎛ : ⎞
;
n
 n
; n
 √
P⎝ ai zi2 + bi zi  a2 + 2<
i a2i + 12 b2i · t + 2a+ t⎠  e−t
i=1 i=1 i=1
⎛ : ⎞
;
n
 n
; n
 √
P⎝ ai zi2 + bi zi  ai − 2<
2
a2i + 12 b2i · t − 2a− t⎠  e−t .
i=1 i=1 i=1

Theorem 4.15.6 (Real-valued quadratic forms—Bechar [235]). Consider the


random expression zT Az + bT z, where A is n × n real square matrix, b is
an n-dimensional real vector, and z = (zi )i=1,...,n is an n-dimensional standard
Gaussian random vector, i.e., zi , i = 1, . . . , n are i.i.d. zero-mean Gaussian
4.15 Concentration of Quadratic Forms 255

random variables with standard deviation 1. Let us denote by λi , i = 1, . . . , n


1 T
the 
eigenvalues of the
 symmetric matrix 2 A + A , and  let us put λ+ =
sup sup {λi } , 0 , and λ− = sup sup {−λi } , 0 . Then, the following
i=1,...,n i=1,...,n
two concentration results hold true for all t > 0
" 
1 1 √
 e−t
2 2
P z Az + b z  Tr (A) + 2
T T
A + AT + b · t + 2λ+ t
4 2
" 
1 1 √
 e−t .
2 2
P z Az + b z  Tr (A) − 2
T T
A + AT + b · t − 2λ− t
4 2
(4.55)
Theorem 4.15.6 for quadratic forms of real-valued Gaussian random variables is
extended to the complex form in the following theorem for Bernstein’s inequality.
Let In be the identity matrix of n × n.
Theorem 4.15.7 (Complex-valued quadratic forms—Wang, So, Chang, Ma and
Chi [236]). Let z be a complex Gaussian with zero mean and covariance matrix In ,
i.e., zi ∼ CN (0, In ) , Q is an n × n Hermitian matrix, and y ∈ Cn . Then, for any
t > 0, we have
 
√ 2
 1−e−t,
2
P z Qz + 2 Re z y  Tr (Q) − 2t Q
H H
F +2 y − tλ+ (Q)
(4.56)
with λ+ (Q) = max {λmax (−Q) , 0} .
Equation (4.56)
 is used to bound the probability that the quadratic form
zH Qz + 2 Re zH y of complex Gaussian random variables deviates from its
mean Tr (Q).
Theorem 4.15.8 (Lopes, Jacob, Wainwright [237]). Let A ∈ Rn×n be a positive
semidefinite matrix A op > 0, and let Gaussian vectors z be z ∼ N (0, In×n ) .
Then, for any t > 0,
⎛ ⎛ ? ⎞2 ⎞
T A op
⎜ z Az ⎠ ⎟ −t2 /2
P⎝  ⎝1 + t ⎠e , (4.57)
Tr (A) Tr (A)

 2 
Tr(A)
and for any t ∈ 0, A − 1 , we have
op

⎛ ⎛? ? ⎞2 ⎞
T A A
⎜ z Az op ⎠ ⎟ −t2 /2
⎝
op
P⎝ 1− −t ⎠e . (4.58)
Tr (A) Tr (A) Tr (A)
256 4 Concentration of Eigenvalues and Their Functionals

Here ||X||op denote the operator or matrix norm (largest singular value) of matrix
A
op
X. It is obvious that Tr(A) is less than or equal to 1. The error terms involve the
operator norm as opposed to the Frobenius norm and hence are of independent
interest.
√ / /
Proof. We follow [237] for a proof. First f (z) = zT Az = /A1/2 z/2 has
2
Lipschitz constant A op with respect to the Euclidean norm on Rn . By
the Cirel’son-Ibragimov-Sudakov inequality for Lipschitz functions of Gaussian
vectors [161], it follows that for any s > 0,

1
P (f (z)  E [f (z)] − s)  exp − s 2
. (4.59)
2 A op

From the Poincare inequality for Gaussian measures [238], the variance of f (z) is
bounded above as

var [f (z)]  A op .

 
2
Since E f (z) = Tr (A) , the expectation of f (z) is lower bounded as

2
E [f (z)]  Tr (A) − A op .

Inserting this lower bound into the concentration inequality of (4.59) gives

 2  1
P f (z)  Tr (A) − A op − s  exp − s 2
.
2 A op

Let us turn to the bound (4.57). We


 start with the upper-tail version of (4.59), that
is P (f (z)  E [f (z)] − s)  exp − 2A 1
s2 for s > 0. By Jensen’s inequality,
op
it follows that
√  2 
E [f (z)] = E zT Az  E [zT Az] = Tr (A),
 
from which we have P (f (z)  Tr (A) + s)  exp − 2A
1
s2 , and setting
op

s 2 = t2 A op for t > 0 gives the result of (4.57).



In Sect. 10.18, we apply Theorem 4.15.8 for the two-sample test in high
dimensions. We present small ball probability estimates for linear and quadratic
functions of Gaussian random variables. C is a universal constant. The Frobenius
4.16 Distance Between a Random Vector and a Subspace 257

" 
norm is defined as A F = A2i,j = Tr (A2 ). For a vector x, ||x|| denotes
i,j
the standard Euclidean norm—2 -norm.
Theorem 4.15.9 (Concentration of Gaussian random variables [239]). Let
a, b ∈ R. Assume that max {|a| , |b|}  α > 0. Let X ∼ N (0, 1) . Then, for t > 0

P (|a + bX|  t)  Ct/α. (4.60)

Theorem 4.15.10 (Concentration of Gaussian random vector [239]). Let a ∈


Rn , A ∈ Rn×n . Assume that

max { a 2 , A F}  α > 0.

Let z ∼ N (0, In ) . Then, for t > 0



P ( a + Az 2  t)  Ct n/α. (4.61)

Theorem 4.15.11 ([239]). Let a ∈ Rn , A ∈ Rn×n . Let z ∼ N μ, σ 2 In for some
μ ∈ Rn . Then, for t > 0
 √
P a + Az 2  tσ 2 A F  Ct n.

Theorem 4.15.12 (Concentration of quadrics in Gaussian random vari-


ables [239]). Let a ∈ R, b ∈ Rn and let A ∈ Rn×n be a symmetric matrix.
Assume that |a|  α > 0. Let z ∼ N (0, In ) . Then, for t > 0
  √
P a + zT b + zT Az  t  Ct1/6 n/α. (4.62)

Example 4.15.13 (Hypothesis testing).

H0 : y = x
H1 : y = x + z

where x is independent of z. See also Example 7.2.13. 

4.16 Distance Between a Random Vector and a Subspace

Let P = (pij )1ijn be the n × n orthogonal projection matrix on to V.


Obviously, we have
258 4 Concentration of Eigenvalues and Their Functionals

 n
Tr P2 = Tr (P) = pii = d
i=1

and |pij |  1. Furthermore, the distance between a random vector x and a subspace
V is defined by the orthogonal projection matrix P in a quadratic form

dist (x, V) = (Px) (Px) = x∗ P∗ Px = x∗ PPx = x∗ P2 x = x∗ Px
2

n
pij ξi ξj∗ = pij ξi ξj∗ .
2
= pi i |ξi | + (4.63)
1i=jn i=1 1i=jn


n
For instance, for a spectral decomposition of A, A = λi ui uH
i , we can define
i=1

d
P= i , where d ≤ n. For a random matrix A, the projection matrix P is
ui uH
i=1
also a random matrix. What is the surprising is that the distance between a random
vector and a subspace—the orthogonal projection of a random vector onto a large
space—is strongly concentrated. This tool has a geometric flavor.
The distance dist (x, V) is a (scalar-valued) random variable. It is easy to show
2
that E dist (x, V) = d so that √ it is indeed natural to expect that with high
probability dist (x, V) is around d.
We use a complex version of Talagrand’s inequality, obtained by slightly
modifying the proof [141].
Theorem 4.16.1. Let D be the unit disk {z ∈ C : |z|  1} . For every product
probability P on Dn , every convex 1-Lipschitz function f : Cn → R and every
t ≥ 0,
2
P (|f − M (f )|  t)  4e−t /16
,

where M(f ) denotes the median of f.


The result still holds for the space D1 × . . . × Dn , where Di are complex regions
with diameter 2. An easy change of variable reveals the following generation of this
inequality. If we consider the probability for a dilate K · Dn of the unit disk for
some K > 0, rather than Dn itself, then for every t ≥ 0, we have instead
2
/16K 2
P (|f − M (f )|  t)  4e−t . (4.64)

Theorem 4.16.1 shows concentration around the median. In applications, it is more


useful to have concentration around the mean. This can be done following the well
known lemma [141,167], which shows that concentration around the median implies
that the mean and the median are close.
Lemma 4.16.2. Let X be a random variable such that for any t ≥ 0,
4.16 Distance Between a Random Vector and a Subspace 259

2
P (|X − M (X)|  t)  4e−t .

Then

|E (X) − M (X)|  100.

The bound 100 is ad hoc and can be replaced by a much smaller constant.
Proof. Set M = M(X) and let F (x) be the distribution function of X. We have

M −i+1
2
E (X) = x∂F (x)  M + 4 |i|e−i  M + 100.
i=1 M −i i

We can prove the lower bound similarly. 


Now we are in a position to state a lemma and present its proof.
Theorem 4.16.3 (Projection Lemma, Lemma 68 of [240]). Let x= (ξ1 , . . . , ξn) ∈
Cn be a random vector whose entries are independent with mean zero,
 variance1,
4
and are bounded in magnitude by K almost surely for some K  10 E|ξ| + 1 .
Let V be a subspace of dimension d and dist (x, V) be the orthogonal projection
on V. Then
 

 √   t2
P dist (x, V) − d > t  10 exp − .
10K 2

In particular, one has



dist (x, V) = d + O (K log n)

with overwhelming probability.


Proof. We follow [240, Appendix B] closely. See also [241, Appendix E]. This
proof is a simple generalization of Tao and Vu in [242].
The standard approach of using Talagrand’s inequality is to study the functional
property of f = dist (x, V) as a function of vector x ∈ Cn . The functional map

x → |dist (x, V)|

is clearly convex and 1-Lipschitz. Applying (4.64), we have that


2
/16K 2
P (|dist (x, V) − M (dist (x, V))|  t)  4e−t

for any t > 0.


To conclude the proof, it is sufficient to show the following
260 4 Concentration of Eigenvalues and Their Functionals

 √ 

M (X) − d  2K.

Consider the event E+ that |dist (x, V)|  d + 2K, which implies that

2

|dist (x, V)|  d + 4K d + 4K 2 .

From the definition of the orthogonal projection (4.63), we have


 ⎛ ⎞
n √ √
+ P⎝ pi j ξi ξj∗  2K d⎠ .
2
P (E+ )  P pi i |ξi |  d + 2K d
i=1 1i=jn


n  
2
Set S1 = pi i |ξi | − 1 . Then it follows from Chebyshev’s inequality that
i=1

  
n √  √  E |S1 |
2
2
P pi i |ξi |  d + 2K d  P |S1 |  2K d  .
i=1
4dK 2

On the other hand, by the assumption of K,

  n  2 n   n
2 2 4
E |S1 | = p2ii E |ξi | − 1 = p2ii E |ξi | − 1  p2ii K = dK.
i=1 i=1 i=1

Thus we have
 
 √  E |S 1 |
2
1 1
P |S1 |  2K d    .
4dK 2 K 10
 
    
 ∗ 2
Similarly, set S2 =  pi j ξi ξj  . Then we have E S22 = |pij |  d.
1i=jn  i=j
Again, using Chebyshev’s inequality, we have
 √  d 1 1
P |S2 |  2K d    .
4dK 2 K 10

Combining the results, it follows that P (E+ )  15 , and so M (dist (x, V))  d +
2K. √
To prove the lower bound, let E− √ be the event that dist (x, V)  d−2K which
2
implies that dist (x, V)  d − 4K d + K 2 . We thus have
 √   √   √
2
P (E+ )  P dist (x, V)  d −2K d  P S1  d −K d + P S1  K d .
4.17 Concentration of Random Matrices in the Stieltjes Transform Domain 261

1
Both terms on the right-hand side can be bounded by 5 by the same arguments as
above. The proof is complete. 
Example. Similarity Based Hypothesis Detection
A subspace V is the subspace spanned by the first k eigenvectors (corresponding
to the largest k eigenvalues).

4.17 Concentration of Random Matrices in the Stieltjes


Transform Domain

As the Fourier transform is the tool of choice for a linear time-invariant system, the
Stieltjes transform is the fundamental tool for studying the random matrix. For a
random matrix A of n × n, we define the Stieltjes transform as

1 −1
mA (z) = Tr (A − zI) , (4.65)
n
where I is the identity matrix of n × n. Here z is a complex variable.
Consider xi , i = 1, . . . , N independent random (column) vectors in Rn . xi xTi
is a rank-one matrix of n × n. We often consider the sample covariance matrix of
n × n which is expressed as the sum of N rank-one matrices

N
S= xi xTi = XXT ,
i=1

where the data matrix X of N × n is defined as


⎡ ⎤
x1
⎢ ⎥
X = ⎣ ... ⎦ .
xN

Similar to the case of the Fourier transform, it is more convenient to study the
Stieltjes transform of the sample covariance matrix S defined as

1 −1
mS (z) = Tr (S − zI) .
n

It is remarkable that mS (z) is strongly concentrated, as first shown in [229]. Let


Im [z] = v, we have [229]

P (|mS (z) − EmS (z)| > t)  4 exp −t2 n2 v 2 / (16N ) . (4.66)
262 4 Concentration of Eigenvalues and Their Functionals

Note that (4.66) makes no assumption whatsoever about the structure of the vectors
n
{xi }i=1 , other than the fact that they are independent.
Proof. We closely follow El Karou [229] for a proof. They use a sum of martingale
difference, followed by Azuma’s inequality[141, Lemma 4.1]. We define Sk = S −
i
xk xTk . Let Fi denote the filtration generated by random vectors {xl }l=1 . The first
classical step is (from Bai [198, p. 649]) to express the random variable of interest
as a sum of martingale differences:
n
mS (z) − EmS (z) = E (mS (z) |Fk ) − E (mS (z) |Fk−1 ) .
k=1

Note that
   
−1 −1
E Tr (Sk − zI) |Fk = E Tr (Sk − zI) |Fk−1 .

So we have that

|E (mS (z) |Fk ) − E (mS (z) |Fk−1 )|


 1  1 

= E (mS (z) |Fk ) −E Tr (Sk −zI)−1 |Fk +E Tr (Sk −zI)−1 |Fk−1
n n
−E (mS (z) |Fk−1 )|
 1    1 
  
 E (mS (z) |Fk ) − E Tr (Sk − zI)−1 |Fk  + E Tr (Sk − zI)−1 |Fk−1
n n
−E (mS (z) |Fk−1 )|
2
 .
nv

The last inequality follows[243, Lemma 2.6]. As a result, the desired random
variable mS (z) − EmS (z) is a sum of bounded martingale differences. The same
would be true for both real and imaginary parts. For both of them, we apply Azuma’s
inequality[141, Lemma 4.1] to obtain that

P (|Re (mS (z) − EmS (z))| > t)  2 exp −t2 n2 v 2 / (8N ) ,

and similarly for its imaginary part. We thus conclude that


 √
P (|mS (z) − EmS (z)| > t)  P |Re (mS (z) − EmS (z))| > t/√ 2
+P |Im(mS (z) − EmS (z))| > t/ 2
 4 exp −t2 n2 v 2 / (16N ) .


4.17 Concentration of Random Matrices in the Stieltjes Transform Domain 263

The decay rate given by Azuma’s inequality does not match the rate that
appears in results concerning concentration behavior of linear spectral √ statistics (see
Sect. 4.14). The rate given by Azuma’s inequality is n, rather than n. This decay
rate is very important in practice. It is explicitly related to the detection probability
and the false alarm, in a hypothesis testing problem.
The results using Azuma’s inequality can handle many situations that are not
covered by the current results on linear spectral statistics in Sect. 4.14. The “correct”
rate can be recovered using ideas similar to Sect. 4.14. As a matter of fact, if
we consider the Stieltjes transform of the measure that puts mass 1/n at each
singular values of the sample covariance matrix M = X∗ X/N, it is an easy
exercise to
√  this function of X is K-Lipschitz with the Lipschitz coefficient
show that
K = 1/ nN v 2 , with respect to the Euclidean norm (or Frobenius or Hilbert-
Schmidt) norm.
Example 4.17.1 (Independent Random Vectors). Consider the hypothesis testing
problem

H0 : yi = wi , i = 1, .., N
H1 : yi = xi + wi , i = 1, .., N

where N independent vectors are considered. Here wi is a random noise vector and
si is a random signal vector. Considering the sample covariance matrices, we have

N N
H0 : S = yi yiT = wi wiT = WWT
i=1 i=1
N N
T T
H1 : S = yi yiT = (xi + wi ) (xi + wi ) = (X + W) (X + W) .
i=1 i=1

Taking the Stieltjes transform leads to

1  −1
H0 : mS (z) = Tr WWT − zI
n
1  −1
T
H1 : mS (z) = Tr (X + W) (X + W) − zI .
n
The Stieltjes transform is strongly concentrated, so
264 4 Concentration of Eigenvalues and Their Functionals


H0 : P (|mS (z) − EmS (z)| > t)  4 exp −t2 n2 v 2 / (16N ) ,
1  −1 1  −1
where EmS (z) = E Tr WWT − zI = Tr E WWT − zI ,
n n
 2 2 2
H1 : P (|mS (z) − EmS (z)| > t)  4 exp −t n v / (16N ) ,
1  −1
T
where EmS (z) = Tr E (X + W) (X + W) − zI .
n
Note, both the expectation and the trace are linear functions so they commute—their
order can be exchanged. Also note that
−1
(A + B) = A−1 + B−1 .

In fact

A−1 − B−1 = A−1 (B − A) B−1 .

Then we have
 −1
 −1
T
WWT − zI − (X + W) (X + W) − zI
 −1   −1
T
= WWT − zI XXT + XWT + WXT (X + W) (X + W) − zI .

Two relevant Taylor series are

1 1 1
log (I + A) = A − A2 + A3 − A4 + · · · , ρ (A) < 1,
2 3 4
−1 1 1 1
(I − A) = A − A2 + A3 − A4 + · · · , ρ (A) < 1,
2 3 4

where ρ (A) is the (spectral) radius of convergence[20, p. 77]. The series for
−1
(I − A) is also called Neumann series [23, p. 7]. 

4.18 Concentration of von Neumann Entropy Functions

The von Neumann entropy [244] is a generalization of the classical entropy


(Shannon entropy) to the field of quantum mechanics. The von Neumann entropy
is one of the cornerstones of quantum information theory. It plays an essential role
in the expressions for the best achievable rates of virtually every coding theorem.
In particular, when proving the optimality of these expressions, it is the inequalities
governing the relative magnitudes of the entropies of different subsystems which
4.18 Concentration of von Neumann Entropy Functions 265

are important [245]. There are essentially two such inequalities known, the so-
called basic inequalities known as strong subadditivity and weak monotonicity. To
be precise, we are interested in only linear inequalities involving the entropies
of various reduced states of a multiparty quantum state. We will demonstrate
how the concentration inequalities are established for these basic inequalities. One
motivation is for quantum information processing.
For any quantum state described by a Hermitian positive semi-definite matrix ρ,
the von Neumann entropy of ρ is defined as

S (ρ) = − Tr (ρ log ρ) (4.67)

A natural question occurs when the Hermitian positive semi-definite matrix ρ is a


random matrix, rather than a deterministic matrix. For notation, here we prefer using
bold upper-case symbols X, Y, Z to represent random matrices. If X is a symmetric
(or Hermitian) matrix of n × n and f is a bounded measurable function, f (X) is
defined as the matrix with the same eigenvectors as X but with eigenvalues that are
the images by f of those of X; namely, if e is an eigenvector of X with eigenvalues
λ, Xe = λe, then we have f (X) e = f (λ) e. For the spectral decomposition
X = UDUH with orthogonal (unitary) and D = diag (λ1 , . . . , λn ) diagonal real,
one has

f (X) = Uf (D) UH

with (f (D))ii = f (λi ) , i = 1, . . . , n. We rewrite (4.67) as

n
S(X) = − Tr (X log X) = − λi (X) log λi (X) (4.68)
i=1

where λi (X), i = 1, . . . , n are eigenvalues of X. Recall from Corollary 4.6.4 or


Lemma 4.3.1 that if g : R → R is a Lipschitz function with constant |g|L , the map
2  n √
X ∈ Rn → g (λk ) ∈ R is a Lipschitz function with constant 2|g|L . Observe
i=1
from (4.68) that

g(t) = −t log t, t∈R

is a Lipschitz function with the constant given by

|g(s) − g(t)| |s log s − t log t|


|g|L = sup = sup . (4.69)
s=t |s − t| s=t |s − t|

Using Lemma 4.14.1, we have that


266 4 Concentration of Eigenvalues and Their Functionals

  
1 1  2 2
− n t2
 
P  Tr (−X log X) − E Tr (−X log X)  t  2e 4c|g|L . (4.70)
n n

where |g|L is given by (4.69).


Consider distinct quantum systems A and B. The joint system is described by a
Hermitian positive semi-definite matrix ρAB . The individual systems are described
by ρA and ρB , which are obtained from ρAB by taking partial trace. We simply use
S(A) to represent the entropy of System A, i.e., S(ρA ). In the following, the same
convention applies to other joint or individual systems. It is well known that

|S (ρA ) − S(ρB )|  S (ρAB )  S (ρA ) + S (ρB ) . (4.71)

The second inequality above is called the subadditivity for the von Neumann
entropy. The first one, called triangular inequality (also known as Araki-Lieb
inequality [246]), is regarded as the quantum analog of the inequality

H(X)  H(X, Y )

for Shannon entropy H(X) and H(X, Y ) the joint entropy of two random variables
X and Y.
The strong subadditivity (SSA) of the von Neumann entropy proved by Lieb
and Ruskai [247, 248] plays the same role as the basic inequalities for the classical
entropy. For distinct quantum systems A, B, and C, strong subadditivity can be
represented by the following two equivalent forms

S (ρAC ) + S (ρBC ) − S (ρA ) − S (ρB )  0 (SSA)


S (ρAB ) + S (ρBC ) − S (ρB ) − S (ρABC )  0 (WM) (4.72)

The expression on the left-hand side of the SSA inequality is known as the (quan-
tum) conditional mutual information, and is commonly denoted as I (A : B |C ) .
Inequality (WM) is usually called weak monotonicity.
As pointed above, we are interested in only linear inequalities involving the
entropies of various reduced states of a multiparty quantum state. Let us consider
the (quantum) mutual information

I (A : B) S (ρA ) + S (ρB ) − S (ρAB )


= − Tr (ρA log ρA ) − Tr (ρA log ρA ) + Tr (ρAB log ρAB ) . (4.73)

where ρA , ρB , ρAB are random matrices. Using the technique applied to treat the
von Neumann entropy (4.67), we similarly establish the concentration inequalities
like (4.70), by first evaluating the Lipschitz constant |I (A : B)|L of the function
I (A : B) and I (A : B |C ) . We can even extend to more general information
4.19 Supremum of a Random Process 267

inequalities [245, 249–252]. Recently, infinitely many new, constrained inequalities


for the von Neumann entropy has been found by Cadney et al. [245].
Here, we make the explicit connection between concentration inequalities and
these information inequalities. This new framework allows us to study the informa-
tion stability when random states (thus random matrices), rather than deterministic
states are considered. This direction needs more research.

4.19 Supremum of a Random Process

Sub-Gaussian random variables (Sect. 1.7) is a convenient and quite wide class,
which includes as special cases the standard normal, Bernoulli, and all bounded
random variables. Let (Z1 , Z2 , . . .) be (possibly dependable) mean-zero sub-
Gaussian random variables, i.e., E [Zi ] = 0, according to Lemma 1.7.1, there exists
constants σ1 , σ2 , . . . such that
 
t2 σi2
E [exp (tZi )]  exp , t ∈ R.
2

We further assume that the supremum of a random sequence is bounded, i.e., v =


 2
sup σi2 < ∞ and κ = v1 σi .
i i
Due to concentration of measure, a Lipschitz function is nearly constant [132,
p. 17]. Even more important, the tails behave at worst like a scalar Gaussian random
variable with absolutely controlled mean and variance.
Previously in this chapter, many functionals of random matrices, such as the
largest eigenvalue and the trace, are shown to have a tail like

P (|X| > t)  exp 1 − t2 /C ,

which, according to Lemma 1.7.1, implies that these functionals are sub-Gaussian
random variables. We modify the arguments of [113] for our context.
Now let X = diag (Z1 , Z2 , . . .) be the random diagonal matrix with the Zi on
the diagonal. Since E [Zi ] = 0, we have EX = 0. By the operator monotonicity of
the matrix logarithm (Theorem 1.4.7), we have that
 
t2 σ12 t2 σ22
log E [exp (tX)]  diag , ,... .
2 2

Due to (1.27): λmax (A + B)  λmax (A) + λmax (B) , the largest eigenvalue has
the following relation
268 4 Concentration of Eigenvalues and Their Functionals

  2 2 2 2 
t σ t σ
λmax (log E [exp (tX)])  λmax diag 2 1 , 2 2 , . . .
t2 σi2 t2
 sup 2 = 2 v,
i

and
  2 2 2 2 
t σ t σ
Tr (log E [exp (tX)])  Tr diag 2 1 , 2 2 , . . .
 t2 σi2 t2
 2 2
= 2 = 2 σi = t 2vκ .
i i

where we have used the property of the trace function (1.29): Tr (A + B) =


Tr (A) + Tr (B) .
By Theorem 2.16.3, we have
 √   −1
P λmax (X) > 2vt  κ · t et − t − 1 .

Letting t = 2 (τ + log κ) > 2.6 for τ > 0 and interpreting λmax (X) as sup Zi ,
i
finally we have that
⎛ : ⎛ ⎞⎞
;  2
;   σ
⎜ ; i
⎟⎟
P⎜ Zi > ; 2 ⎜
<2 sup σi ⎝log sup σ 2 + τ ⎠⎟
i −τ
⎝sup
i i
⎠e . (4.74)
i
i

Consider the special case: Zi ∼ N (0, 1) are N i.i.d. standard Gaussian random
variables. Equation (4.74) says that the largest of the Zi is O(log N + τ ) with
probability at least 1 − e−τ . This is known to be tight up to constants so the
log N term cannot be removed. Besides, (4.74) can be applied to a countably
infinite number of mean-zero Gaussian random variables Zi ∼ N 0, σi2 , or more
generally, sub-Gaussian random variables, as long as the sum of the σi2 is finite.

4.20 Further Comments

Reference [180] is the first paper to study the concentration of the spectral measure
for large matrices.
Concentration of eigenvalues in kernel space [253–255]. We may use a ker-
nel [159] to map the data to the high-dimensional feature space, even if the data
samples are in low-dimensional space. We can exploit the high-dimensional space
for concentration of eigenvalues in kernel space [254,255]. Concentration of random
matrices in the kernel space is studied in [229, 230, 256, 257].
4.20 Further Comments 269

For dependent random variables, we see [258]. Condition numbers of Gaussian


random matrices is studied in [259]. Noncommutative Bennett and Rosenthal
inequalities [260] is also relevant in this context. Concentration for noncommutative
polynomials is studied in [190].
Chapter 5
Non-asymptotic, Local Theory
of Random Matrices

Chapters 4 and 5 are the core of this book. Chapter 6 is included to form the
comparison with this chapter. The development of the theory in this chapter will
culminate in the sense of random matrices. The point of viewing this chapter
as a novel statistical tool will have far-reaching impact on applications such
as covariance matrix estimation, detection, compressed sensing, low-rank matrix
recovery, etc. Two primary examples are: (1) approximation of covariance matrix;
(2) restricted isometry property (see Chap. 7).
The non-asymptotic, local theory of random matrices is in its infancy. The goal
of this chapter is to bring together the latest results to give a comprehensive account
of this subject. No attempt is made to make the treatment exhaustive. However, for
engineering problems, this treatment may contain the main relevant results in the
literature.
The so-called geometric functional analysis studies high-dimensional sets and
linear operators, combining ideas and methods from convex geometry, functional
analysis and probability. While the complexity of a set may increase with the
dimension, it is crucial to point out that passing to a high-dimensional setting may
reveal properties of an object, which are obscure in low dimensions. For example,
the average of a few random variables may exhibit a peculiar behavior, while the
average of a large number of random variables will be close to a constant with high
probability. This observation is especially relevant to big data [4]: one can do at a
large scale that cannot be done at a smaller one, to extract new insights.
Another idea is probabilistic considerations in geometric problems. To prove the
existence of a section of a convex body having a certain property, we can show that
a random section possesses this property with positive probability. This powerful
method allows to prove results in situations, where deterministic constructions are
unknown, or unavailable.
In studying spectral properties of random matrices, the connection between the
areas is interesting: the origins of the problems are purely probabilistic, while the
methods draw from functional analysis and convexity.

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 271
DOI 10.1007/978-1-4614-4544-9 5,
© Springer Science+Business Media New York 2014
272 5 Non-asymptotic, Local Theory of Random Matrices

5.1 Notation and Basics


 1/2

n
We follow [261] for our notation here. In this chapter |x| = x 2 = x2i
i=1
is the standard Euclidean norm. Sometimes by · we also denote the standard
Euclidean norm. The vector is bold, lower case. The matrix is bold, uppercase.
We work on Rn , which is equipped with a Euclidean structure ·, · . We write B2n
for the Euclidean unit ball and S n−1 for the unit sphere. We fix a coordinate system
defined by an orthonormal basis {e1 , . . . , en } . Volume (n-dimensional Lebesgue
measure) and the cardinality of a finite set are also denoted by | · |. We write ωn for
the volume of B2n .
Let K be a symmetric convex body in Rn . The function

x K = min {t > 0 : x ∈ tK}

is a norm on Rn . The normed space (Rn , x K ) will be denoted by XK .


Conversely, if X = (Rn , · K ) is a normed space, then the unit ball KX =
{x ∈ Rn : x  1} of X is a symmetric convex body in Rn .
The dual norm · ∗ of · is defined by y ∗ = max {| x, y | : x  1} .
From this definition it is clear that

| x, y |  x y ∗

for all x, y ∈ Rn . If X ∗ is the dual space of X, then K∗ = KX


o
where

K o = {y ∈ Rn : x, y  1 for all x ∈ K}

is the polar body K o of K.


The Brunn-Minkowski inequality describes the effect of Minkowski addition to
volumes: If A and B are two non-empty compact subsets of Rn , then

1/n 1/n 1/n


|A + B|  |A| + |B| , (5.1)

where A + B = {a + b : a ∈ A, b ∈ B} . It follows that, for every λ ∈ (0, 1),

1/n 1/n 1/n


|λA + (1 − λ) B|  λ|A| + (1 − λ) |B| ,

and, by the arithmetic-geometric means inequality,

λ (1−λ)
|λA + (1 − λ) B|  |A| |B| .

For a hypothesis testing


5.2 Isotropic Convex Bodies 273

H0 : A
H1 : A + B

where A and B are two non-empty compact subsets of Rn . It follows from (5.1)
that: Claim H0 if
1/n 1/n 1/n
|A + B| − |B|  γ  |A| ,

where γ is the threshold. Often the set A is unknown in this situation.


The basic references on classical and asymptotic convex geometry are [152, 167,
262, 263].
Example 5.1.1 (A sequence of OFDM tones is modeled as a vector of random vari-
ables). Let Xi , i = 1, . . . , n be the random variables Xi ∈ C. An arbitrary tone of
the OFDM modulation waveform in the frequency domain may be modeled as Xi .
For convenience, we form a vector of random variables x = (X1 , . . . , Xn ) ∈ Cn .
Due to the fading of the multipath channel and the radio resource allocation for
all tones, the random variables Xi and Xj are not independent. The dependable
random variables are difficult to deal with.
Each modulation waveform can be viewed as a random vector x ∈ Cn . We are
interested in a sequence of independent copies x1 , . . . , xN of the random vector x.


5.2 Isotropic Convex Bodies

Lemma 5.2.1. Let x, y be independent isotropic random vectors in Rn . Then


2 2
E x 2 = n and E x, y = n.
A convex body is a compact and full-dimensional set, K ⊆ Rn with 0 ∈ int (K).
A convex body K is called symmetric if x ∈ K ⇒ −x ⇒ K. We say that K has a
center of mass at the origin if

x, y dx = 0
K

for every y ∈ S n−1 .


Definition 5.2.2 (Isotropic Position). Let K be a convex body in Rn , and let b(K)
denote its center of gravity. We say that K is in isotropic position if its center of
gravity is in the origin, and for each i, j, 1 ≤ i ≤ j ≤ n, we have

1 1, i = j,
xi xj dx = (5.2)
vol (K) K  j,
0, i =
274 5 Non-asymptotic, Local Theory of Random Matrices

or equivalently, for every vector y ∈ Rn ,


1  2 1 2 2
yT x dx = y, x dx = y . (5.3)
vol (K) K vol (K) K

By · we denote the standard Euclidean norm. Here xi is the ith coordinate of x.


Our normalization is slightly different from [264]; their definition corresponds to
applying a homothetical transformation to get vol(K) = 1. The isotropic ) position
1 2
has many interesting features. Among others, it minimizes vol(K) K
x dx
(see [264]).
If K is in isotropic position, then
1 2
x dx = n,
vol (K) K

from which it follows


 that “most” (i.e., all but a fraction of ) of K is contained in
a ball of radius nε . Using a result of Borell [265], one can show that the radius

of the ball could be replaced by 2 2n log (1/ε). Also, if K is in isotropic position,
it contains the unit ball [266, Lemma 5.1]. It is well known that for every convex
body, there is an affine transformation to map it on a body in isotropic position, and
this transformation is unique up to an isometry fixing the origin.
We have to allow an error ε > 0, and want to find an affine transformation
bringing K into nearly isotropic position.
Definition 5.2.3 (Nearly Isotropic Position). We say that K is in ε-nearly
isotropic position (0 < ε ≤ 1), if

b (K)  ε,

and for every vector y ∈ Rn ,


2 1  2 2
(1 − ε) y  yT x dx  (1 + ε) y . (5.4)
vol (K) K−b(K)

Theorem 5.2.4 (Kannanand Lovász and Simonovits [266]). Given 0 < δ, ε < 1,
there exists a randomized algorithm finding an affine transformation A such that
AK is in ε-nearly isotropic position with probability at least 1 − δ. The number of
oracle calls is

O ln (εδ) n5 ln n .

Given a convex body K ⊆ Rn and a function f : K → Rn , we denote by EK (f )


the average of f over K, i.e.,
1 2
EK (f ) = f (x) dx.
vol (K) K

We denote by b = b(K) = EK (x) the center of gravity of K, and by Σ (K) the


n × n matrix
5.3 Log-Concave Random Vectors 275

 
T
Σ (K) = EK (x − b) (x − b) .

The trace of Σ (K) is the average square distance of points of K from the center of
gravity, which we also call the second moment of K. We recall from the definition
of the isotropic position. The body K ⊆ Rn is isotropic position if and only if

b = 0 and Σ (K) = I,

where I is the identity matrix. In this case, we have



EK (xi ) = 0, EK x2i = 1, EK (xi xj ) = 0.

 of K is n, and therefore all but a fraction of ε of its volume lies


The second moment
inside the ball nε B, where B is the unit ball.
If K is in isotropic position, then [266]
"
n+2 
B ⊆ K ⊆ n (n + 2)B.
n

There is always a Euclidean structure, the canonical inner product denoted ·, · , on


μ on Rn for which this measure is isotropic, i.e., for every y ∈ Rn ,

2 2 2
E x, y = x, y dμ(x) = y 2 .
Rn

5.3 Log-Concave Random Vectors

We need to consider x that is an isotropic, log-concave random vectors in Rn (also


a vector uniformly distributed in an isotropic convex body) [267]. A probability
measure μ on Rn is said to be log-concave if for every compact sets A, B, and
every λ ∈ [0, 1],

λ 1−λ
μ (λA + (1 − λ)B)  μ(A) μ(B) .

In other words, an n-dimensional random vector is called log-concave if it has a log-


concave distribution, i.e., for any compact nonempty sets A, B ∈ Rn and λ ∈ (0, 1),

λ 1−λ
P (x ∈ λA + (1 − λ) B)  P(x ∈ A) P(x ∈ B) .
276 5 Non-asymptotic, Local Theory of Random Matrices

According to Borell [268], a vector with full dimensional support is log-concave if


and only has a density of the form e−f , where f : Rn → (−∞, ∞) is a convex
function. See [268, 269] for a general study of this class of measures.
It is known that any affine image, in particular any projection, of a log-concave
random vector is log-concave. Moreover, if x and y are independent log-concave
random vectors, then so is x + y (see [268, 270]).
One important and simple model of a centered log-concave random variable
√ symmetric exponential random variable E which has density
with variance 1 is the
f (t) = √12 exp − 2 t 2 . In particular, for every s > 0 we have P ( E 2  s) 
 n 
 √  2 1/2
exp −s/ 2 . |x| = x 2 = xi is the standard Euclidean norm.
i=1
Sometimes by · we denote also the standard Euclidean norm.
Log-concave measures are commonly encountered in convex geometry, since
through the Brunn-Minkowski inequality, uniform distributions on convex bodies
and their low-dimensional marginals are log-concave. The class of log-concave
measures on Rn is the smallest class of probability measures that are closed under
linear transformations and weak limits that contain uniform distributions on convex
bodies. Vectors with logarithmically concave distributions are called log-concave.
The Euclidean norm of an n-dimensional log-concave random vector has
moments of all orders [268]. A log-concave probability is supported on some convex
subset of an affine subspace where it has a density. In particular when the support of
the probability generates the whole space Rn (in which case we deal, in short, with
full-dimensional probability) a characterization of Borell (see [268, 269]) states that
the probability is absolutely continuous with respect to the Lebesgue measure and
has a density which is log-concave. We say that a random vector is log-concave if
its distribution is a log-concave measure.
The indicator function of a convex set and the density function of a Gaussian
distribution are two canonical examples of log-concave functions [271, p. 43].
Every centered log-concave random variable Z, with variance 1 satisfies a sub-
exponential inequality:
for every s > 0, P (|Z|  s)  C exp (−s/C) ,
where C > 0 is an absolute constant [268]. For a random variable Z we define the
ψ1 -norm by
Z ψ1 = inf {C > 0 : E exp (− |Z| /C)  2}
and we say that Z is ψ1 with constant ψ, if Z ψ1  ψ.
A particular case of a log-concave probability measure is the normalized uniform
(Lebesgue) measure on a convex body. Borell’s inequality (see [267]) implies that
the linear functionals x → x, y satisfies Khintchine type inequalities with respect
to log-concave probability measures. That is, if p ≥ 2, then for every y ∈ Rn ,
 1/2  1/2
2 p 1/p 2
E| x, y |  (E| x, y | )  Cp E| x, y | . (5.5)

Very recently, we have the following.


5.4 Rudelson’s Theorem 277

Theorem 5.3.1 (Paouris [272]). There exists an absolute constant c > 0 such that
if K is an isotropic convex body in Rn , then
 √  √
P x ∈ K : x 2  c nLK t  e−t n

for every t ≥ 1.
Theorem 5.3.2 (Paouris [272]). These exists constants c, C√> 0 such that for any
isotropic, log-concave random vector x in Rn , for any p ≤ c n,
 1/2
p 1/p 2
(E x 2 ) C E x 2 . (5.6)

5.4 Rudelson’s Theorem

A random vector x = (X1 , . . . , Xn ) is isotropic [93,264,273,274] if EXi = 0, and


Cov (Xi , Xj ) = δij for all i, j ≤ n. Equivalently, an n-dimensional random vector
with mean zero is isotropic if
2 2
E y, x = y 2,

for any y ∈ Rn . For any nondegenerate log-concave vector x, there exists an affine
transformation T such that Tx is isotropic.
 n 
 2 1/2
For x ∈ Rn , we define the Euclidean norm x 2 as x 2 = xi . More
i=1
generally, the lp norm is defined as
n
1/p
p
x p = |xi | .
i=1

Consider N random points y1 , . . . , yN independently, uniformly distributed in the


body K and put
N
1
Σ̂ = yi ⊗yi ,
N i=1

which is also sample covariance matrix. If N is sufficiently large, then with high
probability
/ /
/ / 1
/Σ̂ − Σ/ , Σ= (y ⊗ y),
vol (K) K
will be small. Here Σ is also the true covariance matrix. Kannan et al. [266] proved
2
that it is enough to take N = c nε for some constant c. This result was greatly
improved by Bourgain [273]. He has shown that one can take N = C (ε) nlog3 n.
278 5 Non-asymptotic, Local Theory of Random Matrices

Since the situation is invariant under a linear transformation, we may assume that the
body K is in the isotropic position. The the result of Bourgain may be reformulated
as follows:
Theorem 5.4.1 (Bourgain [273]). Let K be a convex body in Rn in the isotropic
position. Fix ε > 0 and choose independently N random points x1 , . . . , xN ∈ K,

N  C (ε) nlog3 n.

Then with probability at least 1 − ε for any y ∈ Rn , one has


N
2 1 2 2
(1 − ε) x < xi , y < (1 + ε) x .
N i=1

The work of Rudelson [93] is well-known. He has shown that this theorem follows
from a general result about random vectors in Rn . Let y be a random vector. Denote
by EX the expectation of a random variable X. We say that y is the isotropic
position if

E (y ⊗ y) = I. (5.7)

If y is uniformly distributed in a convex body K, then this is equivalent to the fact


that K is in the isotropic position. The proof of Theorem 5.4.2 is taken from [93].
Theorem 5.4.2 (Rudelson [93]). Let y ∈ Rn be a random vector in the isotropic
position. Let N be a natural number and let y1 , . . . , yN be independent copies of y.
Then,
/ / √
/1 N / log N  1/ log N
/ / log N
E/ yi ⊗yi − I/  C · √ · E y , (5.8)
/N / N
i=1

provided that the last expression is smaller than 1.


Taking the trace of (5.7), we obtain that
2
E y = n,

so to make the right hand side of (5.8) smaller than 1, we have to assume that

N  cn log n.

Proof. The proof has two steps. The first step is relatively standard. First we
introduce a Bernoulli random process and estimate the expectation of the norm
in (5.8) by the expectation of its supremum. Then, we construct a majorizing
measure to obtain a bound for the latest.
5.4 Rudelson’s Theorem 279

First, let be ε1 , . . . , εN be independent Bernoulli variables taking values 1, −1


with probability 1/2 and let y1 , . . . , yN , y¯1 , . . . , y¯N be independent copies of y.
Denote Ey , Eε the expectation according to y and ε, respectively. Since yi ⊗ yi −
ȳi ⊗ ȳi is a symmetric random variable, we have
/ N / / N /
/1  / /1  /
Ey /
/N y i ⊗y i − I /
/  E E /
y ȳ / N y i ⊗y i − ȳ i ⊗ ȳ /
i/ =
i=1/ /i=1 / /
/1  N / /1  N /
/
Eε Ey Eȳ / N /
εi yi ⊗yi − ȳi ⊗ ȳi /  2Ey Eε / N/ εi yi ⊗yi /
/.
i=1 i=1

To estimate the last expectation, we need the following Lemma.


Lemma 5.4.3 (Rudelson [93]). Let y1 , . . . , yN be vectors in Rn and ε1 , . . . , εN
be independent Bernoulli variables taking values 1, −1 with probability 1/2. Then
/ / / /1/2
/ N /  / N /
/ / / /
E/ εi yi ⊗yi /  C log N · max yi · / yi ⊗yi / .
/ / i=1,...,N / /
i=1 i=1

The lemma was proven in [93]. Applying the Lemma, we get


/ N /
/1  /
E/
/N y i ⊗y i − I/
/
i=1
 1/2  / N /1/2 (5.9)
√ / /
C· √ log N
· E max y · E / y ⊗y / .
N i
i=1,...,N / i i/
i=1

We have
 1/2 N 2/ log N 1/2
 log N
E max yi  E yi
i=1,...,N i=1 i
 1/ log N
log N
N 1/ log N
· E y .

Then, denoting
/ /
/1 N /
/ /
D = E/ yi ⊗yi − I/ ,
/N /
i=1

we obtain through (5.9)



log N  log N
1/ log N
1/2
DC· √ · E y · (D + 1) .
N
280 5 Non-asymptotic, Local Theory of Random Matrices

If

log N  log N
1/ log N
C· √ · E y  1,
N

we arrive at

log N  log N
1/ log N
D  2C · √ · E y ,
N
which completes the proof of Theorem 5.4.2. 
Let us apply Theorem 5.4.2 to the problem of Kannan et al. [266].
Corollary 5.4.4 (Rudelson [93]). Let ε > 0 and let K be an n-dimensional convex
body in the isotropic position. Let
n n
N C· 2
· log2 2
ε ε
and let y1 , . . . , yN be independent random vectors uniformly distributed in K. Then
/ /
/1 N /
/ /
E/ yi ⊗yi − I/  ε.
/N /
i=1

Proof. It follows from a result of Alesker [275], that



2
y
E exp 2
c·n

for some absolute constant c. Then


  2 1/2    1/2
y2
 E exp y
log N 2 log N
E y c·n · E y · exp − c·n
 1/2
√ log N
 2 · max tlog N ·e− c·n
t
 (Cn log N ) 2 .
t0

Corollary 5.4.4 follows from this estimate and Theorem 5.4.2. By a Lemma of
Borell [167, Appendix III], most of the volume of a convex √ body in the isotropic
position is concerned within the Euclidean ball of radius c n. So, it is might be of
interest to consider a random vector uniformly distributed in the intersection of a
convex body K and such a ball B2n . 
Corollary 5.4.5 (Rudelson [93]). Let ε, R > 0 and let K be an n-dimensional
convex body in the isotropic position. Suppose that R  c log 1/ε and let

R2
N  C0 · · log n
ε2
5.5 Sample Covariance Matrices with Independent Rows 281

and let y1 , . . . , yN be independent


√ random vectors √ uniformly distributed in the
intersection of K and the ball R nB2n , i.e., K ∩ R nB2n . Then
/ /
/1 N /
/ /
E/ yi ⊗yi − I/  ε.
/N /
i=1

See [93] for the proof.

5.5 Sample Covariance Matrices with Independent Rows

Singular values of matrices with independent rows (without assuming that the
entries are independent) are treated here, with material taken from Mendelson and
Pajor [274].
Let us first introduce a notion of isotropic position. Let x be a random vector
selected randomly from a convex symmetric body in Rn which is in an isotropic
position. By this we mean the following: let K ⊂ Rn be a convex and symmetric
set with a nonempty interior. We say that K is in an isotropic position if for any
y ∈ Rn ,

1 2 2
| y, x | dx = y ,
vol (K) K

where the volume and the integral are with respect to the Lebesgue measure on Rn
and ·, · and · are, respectively, the scalar product and the norm in the Euclidean
space l2n . In other words, if one considers the normalized volume measure on K and
x is a random vector with that distribution, then a body is in an isotropic position if
for any y ∈ Rn ,
2 2
E| y, x | = y .
N
Let x be a random vector on Rn and consider {xi }i=1 which are N independent
random vectors distributed as x. Consider the random operator X : Rn →
RN defined by
⎡ T⎤
x1
⎢ xT ⎥
⎢ 2 ⎥
X=⎢ . ⎥
⎣ .. ⎦
xTN N ×n
N
where {xi }i=1 are independent random variables distributed according to the
normalized volume measure on the body K. A difficulty arises when the matrix
X has dependent entries; in the standard setup in the theory of random matrices,
one studies matrices with i.i.d. entries.
282 5 Non-asymptotic, Local Theory of Random Matrices

The method we are following from [274] is surprisingly simple. If N ≥ n, the


first n eigenvalues of XX∗ = ( xi , xj )i,j=1 are the same as the eigenvalues of
N

N
X∗ X = xi ⊗ xi . We will show that under very mild conditions on x, with high
i=1
probability,
/ /
/1 N /
/ /
/ x i ⊗ x i − Σ/ (5.10)
/N /n
i=1 l2 →l2n

tends to 0 quickly as N tends to infinity, where Σ = E (x ⊗ x). In particular, with



N
high probability, the eigenvalues of N1 xi ⊗ xi are close to the eigenvalues of Σ.
i=1
The general approximation question was motivated by an application in Com-
plexity Theory, studied by Kannan, Lovasz and Simonovits [266], regarding
algorithms that approximate the volume of convex bodies. Later, Bourgain and
Rudelson obtain some results.
Theorem 5.5.1 (Bourgain [273]). For every ε > 0 these exists a constant c(ε)
for which the following holds. If K is a convex symmetric body in Rn in isotropic
position and N  c(ε)nlog3 n, then with probability at least 1 − ε, for any y ∈
S n−1 ,

N
1 2 1 2
1−ε xi , y = Σy  1 + ε.
N i=1
N

Equivalently, this theorem says that N1 Σ : l2n → l2N is a good embedding of l2n .
When the random vector x has independent, standard Gaussian coordinates it is
known that for any y ∈ S n−1 ,
" N "
n 1 2 n
1−2  xi , y 1+2
N N i=1
N

holds with high probability (see the survey [145, Theorem II.13]). In the Gaussian
case, Theorem 5.5.1 is asymptotically optimal, up to a numerical constant.
Bourgain’s result was improved by Rudelson [93], who removed one power of
the logarithm while proving a more general statement.
Theorem 5.5.2 (Rudelson [93]). There exists an absolute constant C for which the
following holds. Let y be a random vector in Rn which satisfies that E (y ⊗ y) = I.
Then,
/ / "
/1 N / log N  1/ log N
/ / log N
E/ xi ⊗ xi −I/  C E y .
/N / N
i=1
5.5 Sample Covariance Matrices with Independent Rows 283

A proof of this theorem is given in Sect. 2.2.1.2, using the concentration of matrices.
For a vector-valued random variable y and k ≥ 1, the ψk norm of y is
+  ,
k
|y|
y ψk = inf C > 0 : E exp 2 .
Ck

A standard argument [13] shows that ify has a bounded ψk norm, then the tail of y
2
decays faster than 2 exp −tk / y ψk .

Assumption 5.5.3. Let x be a random vector in Rn . We will assume that


 1/4
4
1. There is some ρ > 0 such that for every y ∈ S n−1 , E| x, y |  ρ.
2. Set Z = x . Then Z ψα  ∞ for some α.
Assumption 5.5.3 implies that the average operator Σ satisfies that Σ ≤ ρ2 .
Indeed,

Σ = sup Σy1 , y2 = sup E x, y1 x, y2


y1 ,y2 ∈S n−1 y1 ,y2 ∈S n−1
2
 sup E x, y  ρ2 .
x∈S n−1

Before introducing the main result of [274], we give two results: a well known
symmetrization theorem [13] and Rudelson [93]. A Rademacher random variable is
a random variable taking values ±1 with probability 1/2.
Theorem 5.5.4 (Symmetrization Theorem [13]). Let Z be a stochastic process
indexed by a set F and let N be an integer. For every i ≤ N, let μi : F → R
N
be arbitrary functions and set {Zi }i=1 to be independent copies of Z. Under mild
topological conditions on F and (μi ) ensuring the measurability of the events below,
for any t > 0,
     
 N   N 
    t
βN (t) P sup  Zi (f ) > t  2P sup  εi (Zi (f ) − μi (f )) > ,
f ∈F  i=1
 f ∈F  i=1
 2

N
where {εi }i=1 are independent Rademacher random variables and
  
 N 
  t
βN (t) = inf P  Zi (f ) < .
f ∈F   2
i=1


N
We express the operator norm of (xi ⊗ xi − Σ) as the supremum of an
i=1
empirical process. Indeed, let Y be the set of tensors v ⊗ w, where v and w are
vectors in the unit Euclidean ball. Then,
284 5 Non-asymptotic, Local Theory of Random Matrices

N
(x ⊗ x − Σ) = sup x ⊗ x − Σ, y .
i=1 y∈Y


N
Consider the process indexed by Y defined by Z (y) = 1
N xi ⊗ xi − Σ, y .
i=1
Clearly, for every y, EZ (y) = 0 (the expectation is a linear operator), and
/ /
/1 N /
/ /
sup Z (y) = / (xi ⊗ xi − Σ)/ .
y∈Y /N /
i=1

To apply Theorem 5.5.4, one has to estimate for any fixed y ∈ Y,


  
1 N 
 
P  xi ⊗ xi − Σ, y  > N t .
N 
i=1

For any fixed y ∈ Y, it follows that

N

1 4
var xi ⊗ xi − Σ, y  sup E| x, z |  ρ4 .
N i=1 z∈S n−1

In particular, var (Z (y)) ρ4 /N, implying by Chebychev’s inequality that

ρ4
βN (t)  1 − .
N t2
Corollary 5.5.5. Let x be a random vector which satisfies Assumption 5.5.3 and let
x1 , . . . , xN be independent copies of x. Then,
/ /  / / 
/ N / / N / tN
/ / / /
P / xi ⊗ xi − Σ/  tN  4P / εi x i ⊗ x i / > ,
/ / / / 2
i=1 i=1


provided that x  c ρ4 /N , for some absolute constant c.
Next, we need to estimate the norm of the symmetric random (vector-valued)

N
variables εi xi ⊗ xi . We follow Rudelson [93], who builds on an inequality due
i=1
to Lust-Piquard and Pisier [276].
Theorem 5.5.6 (Rudelson [93]). There exists an absolute constant c such that for
any integers n and N, and for any x1 , . . . , xN ∈ Rn and any k ≥ 1,
5.5 Sample Covariance Matrices with Independent Rows 285

⎛ / /k ⎞1/k / /1/2
/ N / 0 √ 1/ N /
⎝ E/
/
/ ⎠
εi xi ⊗ xi /  c max
/
log n, k /
/
xi ⊗ xi / max xi  ,
/ / / / 1iN
i=1 i=1

N
where {εi }i=1 are independent Rademacher random variables.
This moment inequality immediately give a ψ2 estimate on the random variable

N
εi x i ⊗ x i .
i=1

Corollary 5.5.7. These exists an absolute constant c such that for any integers n
and N, and for any x1 , . . . , xN ∈ Rn and any t > 0,
/ /   2 
/ N /
/ / t
P / εi xi ⊗ xi / ≥ t  2 exp − 2 ,
/ / Δ
i=1

/ /1/2
√ /N /
where Δ = c log n/
/ x i ⊗ x /
i/ max xi .
i=1 1iN

Finally, we are ready to present the main result of Mendelson and Pajor [274] and
its proof.
Theorem 5.5.8 (Mendelson and Pajor [274]). There exists an absolute constant
c for which the following holds. Let x be a random vector in Rn which satisfies
Assumption 5.5.3 and set Z = x . For any integers n and N let
√ 1/α
log n(log N ) ρ2 1/2
An,N = Z ψα √ and Bn,N = √ + Σ An,N .
N N

Then, for any t > 0,


⎡ ⎛ ⎞β ⎤
/ / 
/ N /
/ / ⎢ ct
1⎠ ⎥
P / (xi ⊗ xi − Σ)/  tN  exp ⎣−⎝ 0 ⎦,
/ / 2
max Bn,N , An,N
i=1

−1
where β = (1 + 2/α) and Σ = E (x ⊗ x) .
Proof. First, recall that if z is a vector-valued random variable with a bounded ψα
norm, and if z1 , . . . , zN are N independent copies of z, then
/ /
/ /
/ max zi / C z 1/α
/1iN / ψα log N,
ψα

for an absolute constant C. Hence, for any k,


286 5 Non-asymptotic, Local Theory of Random Matrices

 1/k
k
E max |zi |  Ck 1/α z ψα log
1/α
N. (5.11)
1iN

Consider the scalar-valued random variables


/ / / /
/1 N / /1 N /
/ / / /
U =/ εi xi ⊗ xi / and V = / (xi ⊗ xi − Σ)/ .
/N / /N /
i=1 i=1

Combining Corollaries 5.5.5 and 5.5.7, we obtain

P (V  t)  4P (U  t/2) = 4E  x Pε (U  t/2 |x1 , . . . , xN )


t2 N 2
 8Ex exp − Δ2 .

/ /1/2 / /
√ /N / / /
where Δ = c log n/ xi ⊗ xi / /max xi /
/ / /
/ for some constant c. Setting c0 to
i=1 1iN
be the constant in Corollary
 5.5.5, then by Fubini’s theorem and dividing the region
of integration
 to t  c 0 ρ4 /N (in this region there is no control on P (V  t)) and
4
t > c0 ρ /N , it follows that the k-th order moments are
) ∞ k−1
EV k = kt P (V  t) dt
) c0 √ρ4 /N k−1
0
)∞
= 0 √ kt P (V  t) dt + 0 ktk−1 P (V  t) dt
) c ρ4 /N k−1 )∞  2 2
 00 kt P (V  t) dt + 8Ex 0 ktk−1 exp − t ΔN2 dt
  k Δ k
 c0 ρ4 /N + ck k k/2 Ex N

for some new absolute constant c.


We can bound the second term by using

 k/2 //k/2 / /k 
1/
/
/ /
N /
c k
N E
k log n
xi ⊗ xi /
N/
/
/ /max xi /
/
i=1 1iN
 / / k/2 
 k/2 / N /
1 / / k
c k k log n
N E N / xi ⊗ xi − Σ/ + Σ max xi
1iN
1/2 
i=1
 k/2  1/2 
k 2k
 ck k log
N
n
E V + Σ E max xi
1iN

for some new absolute constant c. Thus, setting Z = x and using Assump-
tion 5.5.3 and (5.11), we obtain
 1/k  1/2    1/k 1/2
ρ2 1 1 log n
EV k  c √ + ck α + 2 log1/α N Zψα EV k + Σ ,
N N
5.5 Sample Covariance Matrices with Independent Rows 287

 1/2  
log n
for some new absolute constant c. Set An,N = N log1/α N Z ψα and
−1
β = (1 + 2/α) . Thus,

 k 1/k ρ2 2 1/2 2  1/2k


EV  c √ + ck β Σ An,N + ck β An,N EV k ,
N

from which we have


 
 k 1/k 1 ρ2 1/2
EV  ck max
β √ + Σ An,N , A2n,N ,
N

and thus,
 
V ψβ  c max Bn,N , A2n,N ,

from which the estimator of the theorem follows by a standard argument. 


Consider the case that x is a bounded random vector. Thus, for any α, Z ψα 
2
sup x ≡ R, and by taking α → ∞ one can set β = 1 and An,N = R log n
N .
We obtain the following corollary.
Corollary 5.5.9 (Mendelson and Pajor [274]). These exists an absolute constant
c for which the following holds. Let x be a random vector in Rn supported in RB2n
and satisfies Assumption 5.5.3. Then, for any t > 0
/ /  + √ ,
/ N /
/ / ct N N
P / xi ⊗ xi − Σ/  tN  exp − 2 min √ , .
/ / R log n log n
i=1


The second case is when Z ψα  c1 n, where x is a random vector associated
with a convex body in an isotropic position.
Corollary 5.5.10 (Mendelson and Pajor [274]). These exists an absolute constant
√ x be a random vector in R that satisfies
n
c for which the following holds. Let
Assumption 5.5.3 with Z ψα  c1 n. Then, for any t > 0
/ / 
/ N /
/ /
P / xi ⊗ xi −Σ/  tN
/ /
i=1
⎛ + " ,1/2 ⎞
ρ2 (n log n) log N 2 (n log n) log N
 exp ⎝−c x/ max √ +ρc1 , c1 ⎠.
N N N

Let us consider two applications: the singular values and the integral operators.
For the first one, the random vector x corresponds to the volume measure of some
convex symmetric body in an isotropic position.
288 5 Non-asymptotic, Local Theory of Random Matrices

Corollary 5.5.11 (Mendelson and Pajor [274]). These exist an absolute constant
c1 , c2 , c3 , c4 for which the following holds. Let K ⊂ Rn be a symmetric convex body
in an isotropic position, let x1 , . . . , xN be N independent points sampled according
to the normalized volume measure on K, and set
⎡ ⎤
xT1
⎢ xT ⎥
⎢ 2 ⎥
X=⎢ . ⎥
⎣ .. ⎦
xTN

with non-zero singular values λ1 , . . . , λn .


1. If N  c1 nlog2 n, then for every t > 0,

√  1/4
√ √ √  N
P 1 − t N  λi  1 + t N  1 − exp −c2 t1/2 .
(log N ) (n log n)


2. If N > c3 n, then with probability at least 1/2, λ1  c4 N log n.
Example 5.5.12 (Learning Integral Operators [274]). Let us apply the results
above to the approximation of the integral operators. See also [277] for a learning
theory from an approximation theory viewpoint. A compact integral operator with
a symmetric kernel is approximated by the matrix of an empirical version of the
operator [278]. What sample size yields given accuracy?
Let Ω ∈ Rd and set ν to be a probability measure on Ω. Let t be a random
∞
variable on Ω distributed based on ν and consider X(t) = λi φi (x)φi , where
i=1
∞ ∞
{φi }i=1 is a complete basis in L(Ω, μ) and {λi }i=1 ∈ l1 .
Let L be a bounded, positive-definite kernel [89] on some probability space
(Ω, μ). (See also Sect. 1.17 for the background on positive operators.) By Mercer’s

theorem, these is an orthonormal basis of L(Ω, μ), denoted by {φi }i=1 such that


L (t, s) = λi φi (x)φi (s) .
i=1

Thus,

X(t), X(s) = L (t, s)

and the square of the singular values of the random matrix Σ are the eigenvalues of
the Gram matrix (or empirical covariance matrix)
N
G = ( X (ti ) , X (tj ) )i,j=1
5.5 Sample Covariance Matrices with Independent Rows 289

where t1 , . . . , tN are random variables distributed according to ν. It is natural to ask


if the eigenvalues of this Gram matrix G converges in some sense to the eigenvalues
of the integral operator

TL = L (x, y) f (y)dν.

This question is useful to kernel-based learning [279]. It is not clear how the
eigenvalues of the integral operator should be estimated from the given data in the
form of the Gram matrix G. Our results below enable us to do just that; Indeed, if


X(t) = λi φi (x)φi , then
i=1


E (X ⊗ X) = λi φ i , · φ i = TL .
i=1

Assume L is continuous and that Ω is compact. Thus, by Mercer’s Theorem,



L (x, y) = λi φi (x)φi (y),
i=1


where {λi }i=1 are the eigenvalues of TL , the integral operator associated with L

and μ, and {φi }i=1 are complete bases in L(μ). Also TL is a trace-class operator

∞ )
since λi = L (x, x)dμ (x).
i=1
To apply Theorem 5.5.8, we encounter a difficulty since X(t) if of infinite
dimensions. To overcome this, define
n
Xn (t) = λi φi (x)φi ,
i=1

where n is to be specified later. Mendelson and Pajor [274] shows that Xn (t) is
sufficiently close to X(t); it follows that our problem is really finite dimensional.
A deviation inequality of (5.10) enables us to estimate with high probability the
eigenvalues of the integral operators (infinite dimensions) using the eigenvalues of
the Gram matrix or empirical covariance matrix (finite dimensions). This problem
was also studied in [278], as pointed out above.
Learning with integral operators [127, 128] is relevant. 
Example 5.5.13 (Inverse Problems as Integral Operators [88]). The inverse prob-
lems may be formulated as integral operators [280], that in turn may be approxi-
mated by the empirical version of the integral operator, as shown in Example 5.5.12.
Integral operator are compact operators [89] in many natural topologies under
very weak conditions on the kernel. Many examples are given in [281].
290 5 Non-asymptotic, Local Theory of Random Matrices

Once the connection between the integral operators and the concentration
inequality is recognized, we can further extend this connection with electromagnetic
inverse problems such as RF tomography [282, 283].
For relevant work, we refer to [127, 284–287]. 

5.6 Concentration for Isotropic, Log-Concave


Random Vectors

The material here can be found in [288–291].

5.6.1 Paouris’ Concentration Inequality

For our exposition, we closely follow [272], a breakthrough work. Let K be an


isotropic convex body in Rn . This implies that K has volume equal to 1, its center
of mass is at the origin and its inertia matrix a multiple of the identity. Equivalently,
there is a positive constant LK , the isotropic constant of K, such that
2
x, y dx = L2K (5.12)
K

for every y ∈ S n−1 .


A major problem in Asymptotic Convex Geometry is whether there is an absolute
constant C > 0 such that LK ≤ c for every n and every isotropic convex
√ body of K
in Rn . The best known estimate is, due to Bourgain [292], L2K  c 4 n log n, where
c is an absolute constant. Klartag [293] has obtained an isomorphic answer to the
question: For every symmetric convex body K in Rn , there is a second symmetric
convex body T in Rn , whose Banach-Mazur distance from K is O(log n) and its
isotropic constant is bounded by an absolute constant: LT ≤ c.
The starting point of [272] is the following concentration estimate of
Alesker [275]: There is an absolute constant c > 0 such that if K is an isotropic
convex body in Rn , then
 √  2
P x ∈ K : x 2  c nLK t  2e−t (5.13)

for every t ≥ 1.
Bobkov and Nazrov [294, 295] have clarified the picture of volume distribution
on isotropic, unconditional convex bodies. A symmetric convex body K is called
unconditional if, for every choice of real numbers ti , and every choice of εi ∈
{−1, 1}, 1 ≤ i ≤ n,

ε1 t1 e1 + · · · + εn tn en K = t1 e1 + · · · + tn en K,
5.6 Concentration for Isotropic, Log-Concave Random Vectors 291

where · K is the norm that corresponds to K and {e1 , . . . , en } is the standard


orthonormal basis of Rn . In particular, they obtained a striking strengthening
of (5.13) in the case of 1-unconditional isotropic convex body: there is an absolute
constant c > 0 such that if K is a 1-unconditional isotropic convex body in Rn , then
 √  √
P x∈K: x 2  c nt  2e−t n (5.14)

for every t ≥ 1. Note that LK ≈ 1 in the case of 1-unconditional convex


bodies [264].
Paouris [272] obtained the following theorem in its full generality.
Theorem 5.6.1 (Paouris’ Inequality for Isotropic Convex Body [272]). There is
an absolute constant c > 0 such that if K is an isotropic convex body in Rn , then
 √  √
P x∈K: x 2  c nLk t  2e−t n

for every t ≥ 1.
Reference [291] gives a short proof of Paouris’ inequality. Assume that x has a log-
concave distribution (a typical example of such a distribution is a random vector
uniformly distributed on a convex body). Assume further that it is centered and
its covariance matrix is the identity such a random vector will be called isotropic.
The tail behavior of the Euclidean norm ||x||2 of an isotropic, log-concave random
vector x ∈ Rn , states that for every t > 1,
 √ √
P x 2  ct n  e−t n .

More precisely, we have that for any log-concave random vector x and any p ≥ 1,

p 1/p p 1/p
(E x 2 ) ∼E x 2 + sup (E| y, x | ) .
y∈S n−1

This result had a huge impact on the study of log-concave measures and has a lot of
applications in that subject.
Let x ∈ Rn be a random vector, denote the weak p-th moment of x by
p 1/p
σp (x) = sup (E| y, x | ) .
y∈S n−1

Theorem 5.6.2 ([291]). For any log-concave random vector x ∈ Rn and any
p ≥ 1,

p 1/p
(E x 2 )  C (E x 2 + σp (x)) ,

where C is an absolute positive constant.


Theorem 5.6.1 allows us to prove the following in full generality.
292 5 Non-asymptotic, Local Theory of Random Matrices

Theorem 5.6.3 (Approximation of the identity operator [272]). Let ε ∈ (0, 1).
Assume that n ≥ n0 and let K be an isotropic convex body in Rn . If N 
c (ε) n log n, where c > 0 is an absolute constant, and if x1 , . . . , xN ∈ Rn are
independent random points uniformly distributed in K, then with probability greater
than 1 − ε we have
N
1 2
(1 − ε) L2K  xi , y  (1 + ε) L2K ,
N i=1

for every y ∈ S n−1 .


In the proof of Theorem 5.6.3, Paouris followed the argument of [296] that
incorporates the concentration estimate of Theorem 5.6.1 into Rudelson’s approach
to the problem [93]. Theorem 5.4.2 from Rudelson [93] is used as a lemma.
We refer to the books [152, 167, 263] for basic facts from the Brunn-Minkowski
and the asymptotic theory of finite dimensional normed spaces.
Aubrun [297] has proved that in the unconditional case, only C()n random
points are enough to obtain (1 + ε)-approximation of the identity operator as in
Theorem 5.6.3.
All the previous results remain valid if we replace Lebesgue measure on the
isotropic convex body by an arbitrary isotropic, log-concave measure.

5.6.2 Non-increasing Rearrangement and Order Statistics

Log-concave vectors has recently received a lot of attention. Paouris’ concentration


of mass [272] says that, for any isotropic, log-concave vector x in Rn ,
 √ √
P x 2  Ct n  e−t n . (5.15)
 1/2

n
where |x| = x 2 = x2i . One wonders about the concentration of p
i=1
norm. For p ∈ (1, 2), this is an easy consequence of (5.15) and Hölder’s inequality.
For p > 2, new ideas are suggested by Latala [289], based on tail estimates of order
statistics of x.
Theorem 5.6.4 (Latala [289]). For any δ > 0 there exist constants C1 (δ), C2 (δ) ≤
Cδ −1/2 such that for any p ≥ 2 + δ,
 
− 1 t
P x p  t  e C1 (δ) for t  C1 (δ) pn1/p

and
5.6 Concentration for Isotropic, Log-Concave Random Vectors 293

 1/q  
q
E x p  C2 (δ) pn1/p + q for q  2.

For an n-dimensional random vector x, by |X1∗ |  . . .  |Xn∗ | we denote the


non-increasing rearrangement of |X1 | , . . . , |Xn | . Random variable X1∗ , 1 ≤ k ≤ n,
are called order statistics of X. In particular,

|X1∗ | = max {|X1 | , . . . , |Xn |} , and |Xn∗ | = min {|X1 | , . . . , |Xn |} .

Random variables Xn∗ are called order statistics of X.


By (5.15), we immediately have for isotropic, log-concave vectors x =
(X1 , . . . , Xn )
1

P (Xk∗  t)  e− C kt
(5.16)

for t  Cn/k. The main result of [289] is to show (5.16) is valid for

t  C log (en/k) .

5.6.3 Sample Covariance Matrix

Taken from [298–300], the development here is motivated for the convergence of
empirical (or sample) covariance matrix.
In the recent years a lot of work was done on the study of the empirical
covariance matrix, and on understanding related random matrices with independent
rows or columns. In particular, such matrices appear naturally in two important
(and distinct) directions. That is (1) estimation of covariance matrices of high-
dimensional distributions by empirical covariance matrices; and (2) the Restricted
Isometry Property of sensing matrices defined in the Compressive Sensing theory.
See elsewhere of the book for the background on RIP and covariance matrix
estimation.
Let x ∈ Rn be a centered vector whose covariance matrix is the identity matrix,
and x1 , . . . , xN ∈ Rn are independent copies of x. Let A be a random n × N
matrix whose columns are (xi ). By λmin (respectively λmax ) we denote the smallest
(respectively the largest) singular value of the empirical covariance matrix AAT .
For Gaussian matrices, it is known that
" "
n λmin λmax n
1−C   1+C . (5.17)
N N N N

with probability close to 1.


294 5 Non-asymptotic, Local Theory of Random Matrices

Let x1 , . . . , xN ∈ Rn be a sequence of random vectors on Rn (not necessarily


identical). We say that is uniformly distributed if for some ψ > 0

sup sup | xi , y | ψ  ψ, (5.18)


iN y∈S n−1

for a random variable Y ∈ R, Y ψ1 = inf {C > 0; E exp (|Y | /C)  2} . We say


it satisfies the boundedness condition with constant K (for some K ≥ 1) if
 0 1 
√ √
 e−
1/4
P max |Xi | / n > K max 1, (N/n) n
. (5.19)
iN

Theorem 5.6.5 (Adamczak et al. [299]). Let N, n be positive integers and


ψ, K > 1. Let x1 , . . . , xN ∈ Rn be independent random
√ vectors satisfying (5.18)
and (5.19). Then with probability at least 1 − 2e−c n one has
  "
1 N  
 2  2 2 n
sup  | xi , y | − E| xi , y |   C(ψ + K) .
y∈S n−1  N i=1
 N

Theorem 5.6.5 improves estimates obtained in [301] for log-concave, isotropic


vectors. There, the result had a logarithmic factor. Theorem 5.6.5 removes this factor
completely leading to the best possible estimate for an arbitrary N, that is to an
estimate known for random matrices as in the Gaussian case.
As a consequence, we obtain in our setting, the quantitative version of Bai-Yin
theorem [302] known for random matrices with i.i.d. entries.
Theorem 5.6.6 (Adamczak et al. [299]). Let A be a random n × N matrix, whose
columns x1 , . . . , xN ∈ Rn be isotropic√ random vectors satisfying Theorem 5.6.5.
Then with probability at least 1 − 2e−c n one has
" "
2 n λmin λmax 2 n
1 − C(ψ + K)    1 + C(ψ + K) . (5.20)
N N N N

The strength of the above results is that the conditions (5.18) and (5.19) are valid
for many classes of distributions.
Example 5.6.7 (Uniformly distributed).
√ Random vectors uniformly distributed on
the Euclidean ball of radius K n clearly satisfy (5.19). They also satisfy (5.18)
with ψ = CK. 
Lemma 5.6.8 (Lemma 3.1 of [301]). Let x1 , . . . , xN ∈ Rn be i.i.d. random
vectors, distributed according to an isotropic, log-concave probability measure
√ on
Rn . There exists an absolute positive constant C0 such that for any N ≤ e n and
for every K ≥ 1 one has

max |Xi |  C0 K n.
iN
5.7 Concentration Inequality for Small Ball Probability 295

Proof. From [272] we have for every i ≤ N


 √ √
P |Xi |  Ct n  e−tc n ,

where C and c are absolute constants. The result follow by the union bound. 
Example 5.6.9 (Log-concave isotropic). Log-concave, isotropic random vectors in
Rn . Such vectors satisfy (5.18) and (5.19) for some absolute constants ψ and K. The
boundedness condition follows from Lemma 5.6.8. A version of result with weaker
was proven by Aubrun [297] in the case of isotropic, log-concave rand vectors under
an additional assumption of unconditionality. 
Example 5.6.10 (Any isotropic random vector satisfying the Poincare inequality).
Any isotropic random vectors (xi )iN ∈ Rn , satisfying the Poincare inequality
with constant L, i.e., such that
2
varf (xi )  L2 E|∇f (xi )|

for all compactly supported smooth functions, satisfying (5.18) with ψ = CL


and (5.19) with K = CL.
The question from [303] regarding whether all log-concave, isotropic random
vectors satisfy the Poincare inequality with an absolute constant is one of the major
open problems in the theory of log-concave measures. 

5.7 Concentration Inequality for Small Ball Probability

We give a small ball probability inequality for isotropic log-concave probability


measures, taking material from [79, 304]. There is an effort to replace the notion
of independence by the “geometry” of convex bodies, since a log-concave measure
should be considered as the measure-theoretic equivalent of a convex body. Most of
these recent results make heavy use of tools from the asymptotic theory of finite-
dimensional normed spaces.
Theorem 5.7.1 (Theorem 2.5 of Latala [79]). Let A is an n×n matrix and let x =
(ξ1 , . . . , ξn ) be a random vector, where ξi are independent sub-Gaussian random
variables with var (ξi )  1 and sub-Gaussian bounded by β. Then, for any y ∈ Rn ,
one has
   
A A
P Ax − y 2  2 HS  2 exp − βc04 AHS ,
op

where c0 > 0 is a universal constant.


A standard calculation shows that in the situation of the theorem, for A = [aij ]
one has

2 2 2
E|Ax − y|  E|A (x − Ex)| = a2ij Var (ξi )  A F .
i.j
296 5 Non-asymptotic, Local Theory of Random Matrices

Proof. Due to the highly accessible


  nature of this proof, we include the arguments

taken from [79]. Let x = ξ1 , . . . , ξn be the independent copy of x and set
z = (Z1 , . . . , Zn ) = x − x . Variables Zi are independent symmetric with sub-
Gaussian constants at most 2β. See also Sect. 1.11 for rademacher averages and
symmetrization that will be used below.
Put

p := P (|Ax − y|  A F /2) .

Then
   
p2 = P |Ax − y|  A Ax − y  A
F /2, P F /2
 P (|Az|  A F ) .

Let B = AAT = (bij ). Then bii = a2ij  0 and
j

2
|Az| = Bz, z = bij Zi Zj = bii Zi2 + 2 bij Zi Zj .
i,j i i<j

Since Var (Zi ) = 2 Var (ξi )  2, so that

2
bii EZi2  2 Tr (B) = 2 A F .
i

Thus
 
2 2
p2  P |Az|  A F

   2
P 2 bij Zi Zj + bii Zi2 − EZi2 − A F
 i<j  i
  
   
  2 2
 P  bij Zi Zj   A /3 +P bii Zi2 − EZi2  − A /3
i<j  F
i
F

(5.21)
/ /
Note that we have (bij δi=j ) F  B F = /AAT /F  A op A F
2
and (bij δi=j ) op  B op + (bij δi=j ) op  2 B op  2 A op . So by
Lemma 1.12.1, we get
⎛  ⎞ ⎛ 2 ⎞
 
    4 −1 A

P  bij Zi Zj   A
2
/3⎠  2 exp ⎝− C β F ⎠ . (5.22)
F
 i<j  A op
5.7 Concentration Inequality for Small Ball Probability 297

We have Zi 4 2 ξ 4  2β gi 4 , thus

2 2 2
b2ii EZi4  48β 4 b2ii  48β 4 B F  48β 4 A op A F .
i i

Therefore, by Lemma 1.12.2,


   ⎛ 2 ⎞
  2  
 2  −1 A
 exp ⎝− C  β 4 ⎠.
2
P  bii Zi − EZi   A /3 F
  F
A op
i

(5.23)

Thus, combining (5.21)–(5.23), we have


⎛ 2 ⎞
 −1 A
p2  4 exp ⎝− max (C  , C  ) β 4 F ⎠,
A op

which completes the proof. 


Theorem 5.7.2 (Proposition 2.6 of Latala [79]). Let A is a non-zero n × n matrix
and let x = (g1 , . . . , gn ) be a random vector, where gi are independent standard
Gaussian N (0, 1) random variables. Then, for any t ∈ (0, c1 ) and any y ∈ Rn ,
one has
 2
A
c2 AHS
P ( Ax − y 2 t A HS )  t
op ,
where c1 , c2 > 0 are universal constants.
Theorem 5.7.3 (Paouris [304]). Let x is an isotropic log-concave random vector
in Rn , which has sub-Gaussian constant b. Let A is a non-zero n × n matrix. Then,
for any t ∈ (0, c1 ) and any y ∈ Rn , one has
 2
c2 AHS

P ( Ax − y 2 t A HS )  t
b Aop
,

where c1 , c2 > 0 are universal constants.


For a subset A ∈ Rn we denote the convex hull of A by conv A and the
symmetric convex hull of A is denoted by conv (A ∪ −A). By a symmetric body
B ∈ Rn we mean a centrally symmetric compact subset Rn with nonempty interior,
i.e., B is a convex body satisfying B = −B. Often, we identify such a symmetric
convex body B with the n-dimensional Banach space (Rn , · B ) for which B is
the unit ball. B2n stands for the Euclidean unit ball in Rn . The volume of a body
B ∈ Rn is denoted by |B|.
In [79, Theorem 4.2], we get estimates of a similar type as in Theorem 5.7.1 for
the probability that Ax belongs to a general convex and symmetric set K rather
than to a Euclidean ball.
298 5 Non-asymptotic, Local Theory of Random Matrices

Theorem 5.7.4 (Theorem 4.2 of Latala [79]). Let A is a non-zero n × n matrix


and let x = (g1 , . . . , gn ), where ξi are independent sub-Gaussian random variables
with Var (ξi )  1 and sub-Gaussian constants at most β. Let K ∈ Rn be a
symmetric convex body satisfying B2n ⊂ K. Then

 √  c A
2
P Ax ∈ α A op nK  3 exp − 4 F
2 ,
2β A op

1/n
where VK = (|K| / |B2n |) ,

(C/η) ln(AF /(6β n))
α = 3β(4πVK ) ,
2 
and η = A F / β 4 n . Here c0 is the constant from Theorem 5.7.1, and C is a
universal constant.

Example 5.7.5 (Hypothesis Testing). Consider the hypothesis testing

H0 : y = x
H1 : y = s + x

where s ∈ Rn is the unknown signal vector and x ∈ Rn is the noise vector


x = (ξ1 , . . . , ξn ) where ξi are independent sub-random variables that satisfy the
conditions of the above theorems. Let B be a non-zero n × n matrix. The algorithm
is described by:

y−Bx2
for BF  τ0 , claim H0 .
y−Bx2
for BF  τ1 , claim H1 .
   
y−Bx2 y−Bx2
We are interested in P BF  τ 1 |H 1 and P BF  τ 0 |H0 , which
may be handled by the above theorems. 
Example 5.7.6 (Hypothesis Testing for Compressed Sensed Data). In Sect. 10.16,
the observation vector for compressed sensing is modeled in (10.54) and rep-
eated here
y = A (s + x) (5.24)
where s ∈ Rn is an unknown vector
 in S, A ∈ R m×n
is a random matrix with i.i.d.
N (0, 1) entries, and x ∼ N 0, σ 2 In×n denotes noise with known variance σ 2
that is independent of A. The noise model (5.24) is different from a more commonly
studied case
y = As + x (5.25)

where z ∼ N 0, σ Im×m .
2
5.8 Moment Estimates 299

Consider the hypothesis testing

H0 : y = x
H1 : y = A(s + x).

For B = AH , we have
/ / / /
/y − Bx / y − BAx /y − AH Ax/
2 2 2
= = .
B F B F A F

The algorithm has become


/ /
/y − AH Ax/
for 2
< τ0 , claim H0 .
A F
/ /
/y − AH Ax/
for 2
> τ1 , claim H1 .
A F
   
y−AH Ax y−AH Ax
We are interested P AF
2
 τ1 |H1 and P AF
2
 τ0 |H0 ,
which can be handled by above theorems. When using these theorems, we only
need to replace A by AH A. In a sense, AH A behaves like an identity matrix. 
For N ≥ n consider N random vectors xi = (ξ1,i , ξ2,i , . . . , ξn,i ) ∈
Rn , i = 1, . . . , N, where ξi,j are independent sub-Gaussian random variables
with Var (ξi )  1 and sub-Gaussian constants at most β. Denote the matrix
[ξi,j ]in,jN by X. See Sect. 1.12 for the notation and relevant background.

5.8 Moment Estimates

For a random matrix X ∈ Cn1 ×n2 , the singular values of the random matrix X
are σ1  σ2  . . .  σn , n = max{n1 , n2 }. We can form a random vector σ =
T
(σ1 , σ2 , . . . , σn ) . For N independent copies X1 , . . . , XN of the random matrix
X, we can form N independent copies σ 1 , . . . , σ N of the random vector σ. So we
can use the moments for random vectors to study these random matrices.

5.8.1 Moments for Isotropic Log-Concave Random Vectors

The aim here is to approximate the moments by the empirical averages with high
probability. The material is taken from [267]. For every 1 ≤ q ≤ +∞,we define q ∗
300 5 Non-asymptotic, Local Theory of Random Matrices

to be the conjugate of q, i.e., 1/q + 1/q ∗ = 1. Let α > 0 and let ν be a probability
measure on (X, Ω). For a function f : X → R define the ψα -norm by
  

= inf λ > 0 
α
f ψα exp (|f | /λ) dν  2 .
X

Chebychev’s inequality shows that the functions with bounded ψα -norm are
strongly concentrated, namely ν {x ||f (x)| > λt }  C exp (−tα ). We denote by D
the radius of the symmetric convex set K i.e., the smallest D such that K ⊂ DB2n ,
where B2n is the unit ball in Rn .
Let K ∈ Rn be a convex symmetric body, and · K the norm. The modulus of
convexity of K is defined for any ε ∈ (0, 2) by
 / / 
/x + y/
δK (ε) = inf 1 − / /
/ 2 / , x K = 1, y K = 1, x − y K > ε . (5.26)
K

We say that K has modulus of convexity of power type q ≥ 2 if δK (ε)  cεq for
every ε ∈ (0, 2). This property is equivalent to the fact that the inequality
/ / / /
/ x + y /q 1 / x − y /q 1
/ / / / q q
/ 2 / + λq / 2 /  2( x K + y K) ,
K K

for all x, y ∈ Rn . Here λ > 0 is a constant depending only on c and q. We


shall say that K has modulus of convexity of power type q with constant λ.
Classical examples of convex bodies satisfying this property are unit balls of finite
dimensional subspaces of Lq [305] or of non-commutative Lq -spaces (like Schatten
trace class matrices [118]).
Given a random vector x ∈ Rn , let x1 , . . . , xN be N independent copies of x.
Let K ∈ Rn be a convex symmetric body. Denote by
 
1 N 
 p p
Vp (K) = sup  | xi , y | − E| x, y | 
y∈K  N i=1


the maximal deviation of the empirical p-norm of x from the exact one. We want
to bound Vp (K) under minimum assumptions on the body K and random vector
x. We choose the size of the sample N such that this deviation is small with high
probability.
To bound such random process, we must have some control of the random
variable max1iN x 2 , where || · ||2 denotes the standard Euclidean norm. To this
end we introduce the parameter

p 1/p
κp,N (x) = (Emax1iN x 2 ) .
5.8 Moment Estimates 301

Theorem 5.8.1 (Guédon and Rudelson [267]). Let K ⊂ (Rn , ·, · ) be a


symmetric convex body of radius D. Assume that K has modulus of convexity
of power type q for some q ≥ 2. Let p ≥ q and q ∗ be the conjugate of q.
Let x be a random vector in Rn , and let x1 , . . . , xN be N independent copies of
x. Assume that
2/q ∗
(log N ) p p
Cp,λ (D · κp,N (x))  β 2 · sup E| x, y |
N y∈K

for some β < 1. Then


p
EVp (K)  2β · sup E| x, a | .
a∈K

The constant Cp,λ in Theorem 5.8.1 depends on p and on the parameter λ in (5.26).
That minimal assumptions on the vector x are enough to guarantee that EVp (K)
becomes small for large N. In most cases, κp,N (x) may be bounded by a simple
quantity:

N
1/p
p
κp,N (x)  E xi 2 .
i=1

Let us investigate the case of x being an isotropic, log-concave random vector in


Rn (or also a vector uniformly distributed in an isotropic convex body). From (5.6),
we have
 1/2
2
·, y ψ1  C E x, y .

From the sharp estimate of Theorem 5.3.2, we will deduce the following.
Theorem 5.8.2 (Guédon and Rudelson [267]). Let x be an isotropic, log-concave
random
√ vector in Rn , and let x1 , . . . , xN be N independent copies of x. If N 
e c n
, then for any p ≥ 2,
 √
p 1/p C n if p  log N
κp,N (x) = (Emax1iN x 2 )  √
Cp n if p  log N.

Theorem 5.8.3 (Guédon and Rudelson [267]). For any ε ∈ (0, 1) and p ≥ 2 these
exists n0 (ε, p) such that for any n ≥ n0 , the following holds: let x be an isotropic,
log-concave random vector in Rn , let x1 , . . . , xN be N independent copies of x, if
B C
N = Cp np/2 log n/ε2
  1/p 

then for any t > ε, with probability greater than 1 − C exp − t/Cp ε , for
any y ∈ Rn
302 5 Non-asymptotic, Local Theory of Random Matrices

N
p 1 p p
(1 − t) E| x, y |  | xi , y |  (1 + t) E| x, y | .
N i=1

The constants Cp , Cp > 0 are real numbers depending only on p.
Let us consider the classical case when x is a Gaussian random vector in Rn .
Let x1 , . . . , xN be independent copies of x. Let p∗ denote the conjugate of p. For
t = (t1 , . . . , tN )T ∈ RN , we have

N N
1/p
p
sup ti x i , y = | xi , y | ,
t∈Bp∗
N
i=1 i=1


N
where ti xi , y is the Gaussian random variable.
i=1
Let Z and Y be Gaussian vectors in RN and Rn , respectively. Using Gordon’s
inequalities [306], it is easy to show that whenever E Z p  ε−1 E Y 2 (i.e. for a
universal constant c, we have N  cp pp/2 np/2 /εp )

N
1/p
1 p
E Z p −E Y 2 E inf | xi , y |
y∈S n−1 N i=1

N
1/p
1 p
 E sup | xi , y | E Z p +E Y 2,
y∈S n−1 N i=1

   
where E Z p + E Y 2 / E Z p − E Y 2  (1 + ε) / (1 − ε). It is there-
fore possible to get (with high probability with respect to the dimension n, see [307])
a family of N random vectors x1 , . . . , xN such that for every y ∈ Rn

N
1/p
1 p 1+ε
A y  | xi , y | A y 2.
2
N i=1
1−ε

This argument significantly improves the bound on m in Theorem 5.8.3 for Gaussian
random vectors.
Below we will be able to extend the estimate for the Gaussian random vector
to random vector x satisfying the ψ2 -norm condition for linear functionals y →
x, y with the same dependence on N. A random variable Z satisfies the ψ2 -norm
condition if and only if for any λ ∈ R
 
2
E exp (λZ)  2 exp cλ2 · Z 2 .
5.8 Moment Estimates 303

Theorem 5.8.4 (Guédon and Rudelson [267]). Let x be an isotropic random


vector in Rn such that all functionals y → x, y satisfy the ψ2 -norm condition. Let
x1 , . . . , xN be independent copies of the random vector x. Then for every p ≥ 2
and every N  np/2
N
1/p
1 p √
sup | xi , y |  c p.
y∈B2n N i=1

5.8.2 Moments for Convex Measures

We follow [308] for our development. Related work is Chevet type inequality and
norms of sub-matrices [267, 309].
Let x ∈ Rn be a random vector in a finite dimensional Euclidean space E with
Euclidean norm x and scalar product < ·, · >. As above, for p > 0, we denote
the weak p-th moment of x by
p 1/p
σp (x) = sup (E y, x ) .
y∈S n−1

p 1/p p 1/p
Clearly (E x )  σp (x) and by Hölder’s inequality, (E x )  E x .
Sometimes we are interested in reversed inequalities of the form
p 1/p
(E x )  C1 E x + C2 σp (x) (5.27)
for p ≥ 1 and constants C1 , C2 .
This is known for some classes of distributions and the question has been studied
in a more general setting (see [288] and references there). Our objective here is to
describe classes for which the relationship (5.27) is satisfied.
Let us recall some known results when (5.27) holds. It clearly holds for Gaussian
vectors and it is not difficult to see that (5.27) is true for sub-Gaussian vectors.
Another example of such a class is the class of so-called log-concave vectors. It is
known that for every log-concave random vector x in a finite dimensional Euclidean
space and any p > 0,
p 1/p
(E x )  C (E x + σp (x)) ,
where C > 0 is a universal constant.
Here we consider the class of complex measures introduced by Borell. Let κ < 0.
A probability measure P on Rm is called κ-concave if for 0 < θ < 1 and for all
compact subsets A, B ∈ Rm with positive measures one has
κ κ 1/κ
P ((1 − θ) A + θB)  ((1 − θ) P(A) + θP(B) ) . (5.28)
A random vector with a κ-concave distribution is called κ-concave. Note that a
log-concave vector is also κ-concave for any κ < 0.
For κ > −1, a κ-concave vector satisfies (5.27) for all 0 < (1 + ε) p < −1/κ
with C1 and C2 depending only on ε.
304 5 Non-asymptotic, Local Theory of Random Matrices

Definition 5.8.5. Let p > 0, m = p, and λ ≥ 1. We say that a random vector x in
E satisfies the assumption H(p, λ) if for every linear mapping A : E → Rm such
that y = Ax is non-degenerate there is a gauge || · || on Rm such that E y < ∞
and
p 1/p
(E y ) < λE y . (5.29)

For example, the standard Gaussian and Rademacher vectors satisfy the above
condition. More generally, a sub-Gaussian random vector also satisfies the above
condition.
Theorem 5.8.6 ([308]). Let p > 0 and λ ≥ 1. If a random vector x in a finite
dimensional Euclidean space satisfies H(p, λ), then
p 1/p
(E x ) < c (λE x + σp (x)) ,
where c is a universal constant.
We can apply above results to the problem of the approximation of the covariance
matrix by the empirical covariance matrix. For a random vector x the covariance
matrix of x is given by ExxT . It is equal to the identity operator I if x is isotropic.

N
The empirical covariance matrix of a sample of size N is defined by N1 xi xTi ,
i=1
where x1 , x2 , . . . , xN are independent copies of x. The main question is how small
can be taken in order that these two matrices are close to each other in the operator
norm.
It was proved there that for N ≥ n and log-concave n-dimensional vectors
x1 , x2 , . . . , xN one has
/ / "
/1 N /
/ / n
/ x i x i − I/  C
T
/N / N
i=1 op

with probability at least 1 − 2 exp (−c n), where I is the identity matrix, and · op
is the operator norm and c, C are absolute positive constants.
In [310, Theorem 1.1], the following condition was introduced: an isotropic
random vector x ∈ Rn is said to satisfy the strong regularity assumption if for
some η, C > 0 and every rank r ≤ n orthogonal projection P, one has that for very
t>C
 √ 
P Px 2  t r  C/ t2+2η r1+η .
In [308], it was shown that an isotropic (−1/q) satisfies this assumption. For
simplicity we give this (without proof) with η = 1.
Theorem 5.8.7 ([308]). Let n ≥ 1, a > 0 and q = max {4, 2a log n} . Let x ∈ Rn
be an isotropic (−1/q) random vector. Then there is an absolute constant C such
that for every rank r orthogonal projection P and every t ≥ C exp (4/a), one has
 √ 0 1 
4
P Px 2  t r  C max (a log a) , exp (32/a) / t4 r2 .
5.9 Law of Large Numbers for Matrix-Valued Random Variables 305

Theorem 1.1 from [310] and the above lemma immediately imply the following
corollary on the approximation of the covariance matrix by the sample covariance
matrix.
Corollary 5.8.8 ([308]). Let n ≥ 1, a > 0 and q = max {4, 2a log n} . Let
x1 , . . . , xN be independent (−1/q)-concave, isotropic random vector in Rn . Then
for every ε ∈ (0, 1) and every N ≥ C(ε)n, one has
/ /
/1 N /
/ /
E/ xi ⊗ xi − In×n /  ε
T
/N /
i=1 op

where C(ε, a) depends only on a and ε.


The following result was proved for small ball probability estimates.
Theorem 5.8.9 (Paouris [304]). Let x be a centered log-concave random vector in
a finite dimensional Euclidean space. For every t ∈ (0, c ) one has
  1/2  2 1/2
 tc(Ex2 ) /σ2 (x) ,
2
P x 2t E x 2

where c, c > 0 are universal positive constants.


The following result generalizes the above result to the setting of convex distribu-
tions.
Theorem 5.8.10 (Paouris [304]). Let n ≥ 1 and q > 1. Let x be a centered n-
dimensional (−1/q)-concave random vector. Assume 1  p  min {q, n/2} . Then,
for every ε ∈ (0, 1),
   3p
c p q2
P ( x 2  tE x 2 )  1 + (2c) tp ,
q−p (q − p) (q − 1)
whenever E x 2  2Cσp (x), where c, C are constants.

5.9 Law of Large Numbers for Matrix-Valued Random


Variables

For p ≤ ∞, the finite dimensional lp spaces are denoted as lpn . Thus lpn is the Banach
 
space Rn , · p , where

n
1/p
p
x p = |x|i
i=1

for p ≤ ∞, and x ∞ = maxi |xi |. The closed unit ball of lp is denoted by Bpn :=
0 1
x: x p1 .
306 5 Non-asymptotic, Local Theory of Random Matrices

The canonical basis of Rn is denoted by (e1 , . . . , en ). Let x, y ∈ Rn . The


canonical inner product is denoted by x, y := xT y. The tensor product (outer
product) is defined as x ⊗ y = yxT ; thus z = x, z y for all z ∈ Rn .
Let A = (Aij ) be an m × n real matrix. The spectral norm of A is the operator
norm l2 → l2 , defined as
Ax 2
A 2 := sup = σ1 (A) ,
x∈Rn x 2
where σ1 (A) is the largest singular value of A. The Frobenius norm A F is
defined as
2
A F := A2ij = σi (A) ,
i,j i

where σi (A) are the singular values of A.


C denotes positive absolute constants. The a = O(b) notation means that a ≤ Cb
for some absolute constant C.
For the scalar random variables, the classical Law of Large Numbers says the
following: let X be a bounded random variable and X1 , . . . , XN be independent
copies of X. Then
   
1 N 
  1
E Xi − EX  = O √ . (5.30)
N  N
i=1

Furthermore, the large deviation theory allows one to estimate the probability that
N
the empirical mean N1 Xi stays close to the true mean EX.
i=1
Matrix-valued versions of this inequality are harder to prove. The absolute value
must be replaced by the matrix norm. So, instead of proving a large deviation
estimate for a single random variable, we have to estimate the supremum of
a random process. This requires deeper probabilistic techniques. The following
theorem generalizes the main result of [93].
Theorem 5.9.1 (Rudelson and Vershynin [311]). Let y be a random vector in
Rn , which is uniformly bounded almost everywhere: y 2  α. Assume for
normalization that y ⊗ y 2  1. Let y1 , . . . , yN be independent copies of y. Let
"
log N
σ := C · α.
N

Then
1. If σ < 1, then
/ /
/1 N /
/ /
E/ yi ⊗y − E (y ⊗ y)/  σ.
/N /
i=1 2
5.9 Law of Large Numbers for Matrix-Valued Random Variables 307

2. For every t ∈ (0, 1),


+/ / ,
/1 N /
/ / 2 2
P / yi ⊗yi − E (y ⊗ y)/ > t  2e−ct /σ .
/N /
i=1 2

Theorem 5.9.1 generalizes Theorem 5.4.2. Part (1) is a law of large numbers,
and part (2) is a large deviation estimate for matrix-valued random variables. The
bounded assumption y 2  α can be too strong for some applications and can be
q
relaxed to the moment assumption E y 2  αq , where q = log N. The estimate
in Theorem 5.9.1 is in general optimal (see [311]). Part 2 also holds under an
assumption that the moments of y 2 have a nice decay.
Proof. Following [311], we prove this theorem in two steps. First we use the
standard symmetrization technique for random variables in Banach spaces, see
e.g. [27, Sect. 6]. Then, we adapt the technique of [93] to obtain a bound on a
symmetric random process. Note the expectation E(·) and the average operation
N
N
1
xi ⊗xi are linear functionals.
i=1
Let ε1 , . . . , εN denote independent Bernoulli random variables taking values 1,
−1 with probability 1/2. Let y1 , . . . , yN , ȳ1 , . . . , ȳN be independent copies of y.
We shall denote by Ey , Eȳ , and Eε the expectation according to yi , ȳi , and εi ,
respectively.
Let p ≥ 1. We shall estimate
/ /p 1/p
/1 N /
/ /
Ep := E / yi ⊗yi − E (y ⊗ y)/ . (5.31)
/N /
i=1 2

Note that
N

1
Ey (y ⊗ y) = Eȳ (ȳ ⊗ ȳ) = Eȳ ȳi ⊗ȳi .
N i=1
p
We put this into (5.31). Since x → x 2 is a convex function on Rn , Jensen’s
inequality implies that
/ /p 1/p
/1 N N /
/ 1 /
Ep  Ey Eȳ / yi ⊗yi − ȳi ⊗ȳi / .
/N N /
i=1 i=1 2

Since yi ⊗ yi − ȳi ⊗ ȳi is a symmetric random variable, it is distributed identically


with εi (yi ⊗ yi − ȳi ⊗ ȳi ). As a result, we have
/ /p 1/p
/1 N /
/ /
Ep  Ey Eȳ Eε / εi (yi ⊗ yi − ȳi ⊗ ȳi )/ .
/N /
i=1 2
308 5 Non-asymptotic, Local Theory of Random Matrices

Denote
N N
1 1
Y= εi yi ⊗ yi and Ȳ = εi ȳi ⊗ ȳi .
N i=1
N i=1

Then
/ /  / /  / /p
/Y − Ȳ/p  Y + /Ȳ/2
p
 2p Y
p
+ /Ȳ/2 ,
2 2 2

/ /p
= E /Ȳ/2 . Thus we obtain
p
and E Y 2

/ /p 1/p
/1 N /
/ /
Ep  2 E y E ε / εi (yi ⊗ yi )/ .
/N /
i=1 2

We shall estimate the last expectation using Lemma 5.4.3, which was a lemma
from [93]. We need to consider the higher order moments:
Lemma 5.9.2 (Rudelson [93]). Let y1 , . . . , yN be vectors in Rk and ε1 , . . . , εN
be independent Bernoulli variables taking values 1, −1 with probability 1/2. Then
/ /p  p / /1/2
/1 N /  / N /
/ / / /
E/ εi yi ⊗yi /  C0 p + log k · max yi 2·/ yi ⊗yi / .
/N / i=1,...,N / /
i=1 2 i=1 2

Remark 5.9.3. We can consider the vectors y1 , . . . , yN as vectors in their linear


span, so we can always choose the dimension k of the ambient space at most N.
Combining Lemma 5.9.2 with Remark 5.9.3 and using Hölder’s inequality, we
obtain

√ / /p 1/2p
/1 N /
p + log N / /
Ep  2C0 ·α· E/ yi ⊗ yi / . (5.32)
N /N /
i=1 2

By Minkowski’s inequality we have


 p 1/p
 p 1/p


N
1 
N
E yi ⊗ yi N E N
yi ⊗ yi −E (y ⊗ y) +E (y ⊗ y)2
i=1 2 i=1 2
 N (Ep +1) .

So we get
 1/2
σp1/2 log N
Ep  (Ep + 1) , where σ = 4C0 α.
2 N
5.10 Low Rank Approximation 309

It follows that

min (Ep , 1)  σ p. (5.33)

To prove part 1 of the theorem, note that σ ≤ 1 by the assumption. We thus obtain
E1 ≤ σ. This proves part 1.
1/p
To prove part 2, we consider Ep = (EZ p ) , where
/ /
/1 N /
/ /
Z=/ yi ⊗ yi − E (y ⊗ y)/ .
/N /
i=1 2

So (5.33) implies that

p 1/p √
(E min (Z, 1) )  min (Ep , 1)  σ p. (5.34)

We can express this moment bound as a tail probability estimate using the following
standard lemma, see e.g. [27, Lemmas 3.7 and 4.10].
Lemma 5.9.4. Let Z be a nonnegative random variable. Assume that there exists a
1/p √
constant K > 0 such that (EZ p )  K p for all p ≥ 1. Then

P (Z > t)  2 exp −c1 t2 /K 2 for all t > 0.

It thus follows from this and from (5.34) that



P (min (Z, 1) > t)  2 exp −c1 t2 /K 2 for all t > 0.

This completes the proof of the theorem. 


Example 5.9.5 (Bounded Random Vectors). In Theorem 5.9.1, we let y be a random
vector in Rn , which is uniformly bounded almost everywhere: y 2  α. Since
A 2 = σ1 (A), where σ1 is the largest singular value. This is very convenient
to use in practice. In many problems, we have the prior knowledge that the
random vectors are bounded as above. This bound constraint leads to more sharp
inequalities. One task is how to formulate the problem using this additional bound
constraint.

5.10 Low Rank Approximation

We assume that A has a small rank–or can be approximated by an (unknown) matrix


of a small rank. We intend to find a low rank approximation of A, from only a small
random submatrix of A.
310 5 Non-asymptotic, Local Theory of Random Matrices

Solving this problem is essential to development of fast Monte-Carlo algorithms


for computations on large matrices. An extremely large matrix—say, of the order of
105 × 105 —is impossible to upload into the random access memory (RAM) of a
computer; it is instead stored in an external memory. On the other hand, sampling a
submatrix of A, storing it in RAM and computing its small rank approximation is
feasible.
The best fixed rank approximation to A is given by the partial sum of the Singular
Value Decomposition (SVD)
A= σi (A) ui ⊗ vi
i
where σi (A) are the nonincreasing and nonnegative sequence of the singular values
of A, and ui and vi are left and right singular vectors of A, respectively. The best
rank k approximation to A in both the spectral and Frobenius norms is thus AP k ,
where Pk is the orthogonal projection onto the top k left singular vectors of A. In
particular, for the spectral norm we have
min A−B 2 = A − APk 2 = σk+1 (A) . (5.35)
B:rank(B)k

However, computing Pk , which gives the first elements of the SVD of a m × n


matrix A is often impossible in practice since (1) it would take many passes through
A, which is extremely slow for a matrix stored in an external memory; (2) this
would take superlinear time in m + n. Instead, it was proposed in [312–316] to use
the Monte-Carlo methodology: namely, appropriate the k-th partial sum of the SVD
of A by the k-th partial sum of the SVD of a random submatrix of A. Rudelson and
Vershynin [311] have shown the following:
1. With almost linear sample complexity O(r log r), that is by sampling only
O(r log r) random rows of A, if A is approximiable by a rank r matrix;
2. In one pass through A if the matrix is stored row-by-row, and in two passes if its
entries are stored in arbitrary order;
3. Using RAM space are stored and time O(n + m) (and polynomial in r and k).
Theorem 5.10.1 (Rudelson and Vershynin [311]). Let A be an m×n matrix with
2 2
numerical rank r = A F / A 2 . Let ε, δ ∈ (0, 1), and let d ≤ m be an integer
such that
 r   r 
d  C 4 log 4 .
ε δ ε δ
Consider a d × n matrix Ã, which consists of normalized rows of A picked
independently with replacement, with probabilities proportional to the squares
of their Euclidean lengths. Then, with probability at least 1 − 2 exp(−c/δ), the
following holds. For a positive integer k, let Pk be the orthogonal projection onto
the top k left singular value vectors of Ã. Then
2
A − APk 2  σk+1 (A) + ε A 2 . (5.36)

Here and in the sequel, C, c, C1 , . . . denote positive absolute constants.


We make the following remarks:
5.10 Low Rank Approximation 311

1. Optimality. The almost linear sample complexity O(r log r) achieved in Theo-
rem 5.10.1 is optimal. The best previous result had O(r2 ) [314, 315]
2 2
2. Numerical rank. The numerical rank r = A F / A 2 is a relaxation of the
exact notion of rank. Indeed, one always has r(A)rank(A). The numerical rank
is stable under small perturbation of the matrix, as opposed to the exact rank.
3. Law of large numbers for matrix-valued random variables. The new feature is a
use of Rudelson’s argument about random vectors in the isotropic position. See
Sect. 5.4. It yields a law of large numbers for matrix-valued random variables.
We apply it for independent copies of a rank one random matrix, which is given
by a random row of the matrix AT A—the sample covariance matrix.
4. Functional-analytic nature. A matrix is a linear operator between finite-
dimensional normed spaces. It is natural to look for stable quantities tied to
linear operators, which govern the picture. For example, operator (matrix)
norms are stable quantities, while rank is not. The low rank approximation in
Theorem 5.10.1 is only controlled by the numerical rank r. The dimension n
does not play a separate role in these results.
2
Proof. By the homogeneity, we can assume A 2 = 1. The following lemma
from [314, 316] reduces Theorem 5.10.1 to a comparison of A and a sample Ã
in the spectral norm.
Lemma 5.10.2 (Drineas and Kannan [314, 316]).
/ /
2 2 / /
A − APk 2  σk+1 (A) + 2/AT A − ÃT Ã/ .
2

Proof of the Lemma. We have


' T (
A−APk 22 = sup Ax22 = sup A Ax, x
x∈ker Pk ,x2 =1 x∈ker Pk ,x2 =1
)  * ) *
 sup AT A − ÃT Ã x, x + sup ÃT Ãx, x
x∈ker Pk ,x2 =1 x∈ker Pk ,x2 =1
 
= AT A − ÃT Ã +σk+1 ÃT Ã .
2

1
à stands
 for the matrix/ A [16]. By a result
kernel or null space of / of perturbation
  T  / T /
theory, σk+1 A A − σk+1 Ã Ã   /A A − Ã Ã/ . This proves the
T T
2
lemma.
Let x1 , . . . , xm denote the rows of the matrix A. Then

1 Let A is a linear transformation from vector V to vector W . The subset in V

ker(A) = {v ∈ V : A(v) = 0 ∈ W }

is a subspace of V , called the kernel or null space of A.


312 5 Non-asymptotic, Local Theory of Random Matrices

m
AT A = xi ⊗ xi .
i=1

We shall regard the matrix AT A as the true mean of a bounded matrix valued
random variable, while ÃT Ã will be its empirical mean; then the Law of Large
Numbers for matrix valued random variables, Theorem 5.9.1, will be used. To this
purpose, we define a random vector y ∈ Rn as
 
A xi
P y= F
xi = 2
.
xi 2 A F

Let y1 , . . . , yN be independent copies of y. Let the matrix à consist of rows


√1 y1 , . . . , √1 yN . The normalization of à is different from the statement of
N N
Theorem 5.10.1: in the proof, it is convenient to multiply à by the factor √1 A
N F.
However the singular value vectors of à and thus Pk do not change. Then,

N
1 √
AT A = E (y ⊗ y) , ÃT Ã = √ yi ⊗ yi , α := y 2 = A F = r.
N i=1

We can thus apply Theorem 5.9.1. Due to our assumption on N, we have


 1/2
log N 1 2 1/2
σ := 4C0 ·r  ε δ < 1.
N 2

Thus Theorem 5.9.1 gives that, with probability at least 1 − 2 exp(−c/δ), we have

√ //
/1/2
/
A − APk 2  σk+1 (A) + 2 /AT A − ÃT Ã/  σk+1 (A) + ε.
2

This proves Theorem 5.10.1. 


Let us comment on algorithmic aspects of Theorem 5.10.1. Finding a good low
rank approximation to a matrix A comes down, due to Theorem 5.10.1, to sampling
a random submatrix Ā and computing its SVD (actually, left singular vectors are
2 2
needed). The algorithm works well if the numerical rank r = A F / A 2 of the
matrix A is small. This is the case, in particular, when A is essentially a low rank
matrix, since r (A)  rank (A).
First, the algorithm samples N = O(r log r) random rows of A. That is, it takes
N independent samples of the random vector y whose law is
  2
Ai Ai
P y= = 2
2
Ai 2 A F
5.11 Random Matrices with Independent Entries 313

where Ai is the i-th row of A. This sampling can be done in one pass through the
matrix A if the matrix is stored row-by-row, and in two passes if its entries are
stored in arbitrary order [317, Sect. 5.1]. Second, the algorithm computes the SVD
of the N × n matrix Ã, which consists of the normalized sampled rows. This can be
done in time O(N n)+ the time needed to compute the SVD of a N ×N matrix. The
latter can be done by one of the known methods. This algorithm is takes significantly
less time than computing SVD of the original m × n matrix A. In particular, this
algorithm is linear in the dimensions of the matrix (and polynomial in N ).

5.11 Random Matrices with Independent Entries

The material here is taken from [72]. For a general random matrix A with
independent centered entries bounded by 1, one can use Talagrand’s concentration
inequality for convex Lipschitz functions on the cube [142, 148, 231]. Since
σmax (A) = A (or σ1 (A)) is a convex function of A. Talagrand’s concentration
inequality implies
2
P (|σmax (A) − M (σmax (A))|  t)  2e−ct ,
where M is the median. Although the precise value of the median may be unknown,
integration of this inequality shows that
|Eσmax (A) − M (σmax (A))|  C

Theorem 5.11.1 (Gordon’s theorem for Gaussian matrices [72]). Let A be an


N × n matrix whose entries are independent standard normal random variables.
Then,
√ √ √ √
N − n  Eσmin (A)  Eσmax (A)  N + n.

Let f be a real valued Lipschitz function on Rn with Lipschitz constant K, i.e.


|f (x) − f (y)|  K x − y 2 for all x, y ∈ Rn (such functions are also called
K-Lipschitz).
Theorem 5.11.2 (Concentration in Gauss space [141]). Let a real-valued func-
tion f is K-Lipschitz on Rn . Let x be the standard normal random vector in Rn .
Then, for every t ≥ 0, one has
2
/2K 2
P (f (x) − Ef (x) > t)  e−t .

Corollary 5.11.3 (Gaussian matrices, deviation; [145]). Let A be an N × n


matrix whose entries are independent standard normal random variables. Then,
2
for every t ≥ 0, with probability at least 1 − 2e−t /2 , one has
√ √ √ √
N − n − t  σmin (A)  σmax (A)  N + n + t.
314 5 Non-asymptotic, Local Theory of Random Matrices

Proof. σmin (A) and σmax (A) are 1-Lipschitz (K = 1) functions of matrices A
considered as vectors in Rn . The conclusion now follows from the estimates on the
expectation (Theorem 5.11.1) and Gaussian concentration (Theorem 5.11.2). 
Lemma 5.11.4 (Approximate isometries [72]). Consider a matrix X that satisfies

X∗ X − I  max δ, δ 2 (5.37)

for some δ > 0. Then

1 − δ  σmin (A)  σmax (A)  1 + δ. (5.38)

Conversely, if X satisfies (5.38) for some δ > 0, then



X∗ X − I  3 max δ, δ 2 .
 
Often, we have δ = O n/N .

5.12 Random Matrices with Independent Rows

Independent rows are used to form a random matrix. In an abstract setting, an


infinite-dimensional function (finite-dimensional vector) is regarded as a ‘point’
in some suitable space and an infinite-dimensional integral operator (finite-
dimensional matrix) as a transformation of one ‘point’ to another. Since a point
is conceptually simpler than a function, this view has the merit of removing some
mathematical clutter from the problem, making it possible to see the salient issues
more clearly [89, p. ix].
Traditionally, we require the entries of a random matrix are independent; Here,
however, the requirements of independent rows are much more relaxed than
independent entries. A row is a finite-dimensional vector (a ‘point’ in a finite-
dimensional vector space).
The two proofs taken from [72] are used to illustrate the approach by showing
how concentration inequalities are at the core of the proofs. In particular, the
approach can handle the tough problem of matrix rows with heavy tails.

5.12.1 Independent Rows

/ / "
/1 ∗ / 
/ A A − I/  max δ, δ 2 where δ = C n + √t . (5.39)
/N / N N
5.12 Random Matrices with Independent Rows 315

Theorem 5.12.1 (Sub-Gaussian rows [72]). Let A be an N ×n matrix whose rows


Ai are independent, sub-Gaussian, isotropic random vectors in Rn . Then, for every
2
t ≥ 0, the following inequality holds with probability at least 1 − 2n · e−ct :
√ √ √ √
N − C n − t  σmin (A)  σmax (A)  N + C n + t. (5.40)

Here C = CK , c = cK > 0 depend only on the sub-Gaussian norm K =


2
maxi Ai ψ2 of the rows.
This result is a general version of Corollary 5.11.3; instead of independent
Gaussian entries we allow independent sub-Gaussian entries such as Gaussian and
Bernoulli. It also applies to some natural matrices whose entries are not independent.
Proof. The proof is taken from [72], and changed to our notation habits. The
proof is a basic version of a covering argument, and it has three steps. The use of
covering arguments in a similar context goes back to Milman’s proof of Dvoretzky’s
theorem [318]. See e.g. [152,319] for an introduction. In the more narrow context of
extremal singular values of random matrices, this type of argument appears recently
e.g. in [301].
We need to control Ax 2 for all vectors x on the unit sphere S n−1 . To this
purpose, we discretize the sphere using the net N (called the approximation or
sampling step), establish a tight control of Ax 2 for every fixed vector x ∈ N with
high probability (the concentration step), and finish off by taking a union bound over
all x in the net. The concentration step will be based on the deviation inequality for
sub-exponential random variable, Corollary 1.9.4.

Step 1: Approximation. Recalling Lemma 5.11.4 for the matrix B = A/ N
we see that the claim of the theorem is equivalent to
/ / "
/1 ∗ / 
/ A A − I/  max δ, δ 2 where δ = C m √t . (5.41)
/N / N N

With the aid of Lemma 1.10.4, we can evaluate the norm in (5.41) on a 14 -net N
of the unit sphere S n−1 :
/ / @  A  
/1 ∗ /  1 ∗  1 
/ A A − I/  2 max  A A − I x, x  = 2 max 
 Ax
2
− 1 .
/N / x∈N  N x∈N N 2

To complete the proof, it is sufficient to show that, with the required probability,
 
1  ε
max  − 1  .
2
Ax 2
x∈N N 2

By Lemma 1.10.2, we can choose the net N such that it has cardinality |N |  9n .
2
Step 2: Concentration. Let us fix any vector x ∈ S n−1 . We can rewrite Ax 2
as a sum of independent (scalar) random variables
316 5 Non-asymptotic, Local Theory of Random Matrices

N N
2 2
Ax 2 = Ai , x =: Zi2 (5.42)
i=1 i=1

where Ai denote the rows of the matrix A. By assumption, Zi = Ai , x are


independent, sub-Gaussian random variables with EZi2 = 1 and Zi ψ2  K.
Thus, by Remark 1.9.5 and Lemma 1.9.1, Zi − 1 are independent, centered sub-
exponential random variables with Zi − 1 ψ1  2 Zi ψ1  4 Zi ψ2  4K.
Now we can use an exponential deviation inequality, Corollary 1.9.3, to control
 1/2
the sum (5.42). Since K  Zi ψ2  √12 EZi2 = √12 , this leads to
   c
1 ε 1
N
ε 1   
P Ax22 − 1  =P Zi2 − 1   2 exp − 4 min ε2 , ε N
N 2 N 2 K
i=1
 c1 2  c1  2 
2
= 2 exp − δ N  2 exp − C n + t
K4 K4
(5.43)

where the last inequality follows by the definition of δ and using the inequality
2
(a + b)  a2 + b2 for a, b ≥ 0.
Step 3: Union Bound. Taking the union bound over the elements of the net N
with the cardinality |N | ≤ 9n , together with (5.43), we have
   c 
1 ε 1   c
1

P max Ax22 − 1   9n · 2 exp − 4 C 2 n + t2  2 exp − 4 t2
x∈N N 2 K K

 the second inequality follows for C = CK sufficiently large, e.g., C =


where
K 2 ln 9/c1 .


5.12.2 Heavy-Tailed Rows

Theorem 5.12.2 (Heavy-tailed rows [72]). Let A be an N × n matrix whose rows


√ i are independent random vectors in R . Let m be a number such that Ai 2 
n
A
m almost surely for all i. Then, for every t ≥ 0, the following inequality holds
2
with probability at least 1 − 2n · e−ct :
√ √ √ √
N − t m  σmin (A)  σmax (A)  N + t m. (5.44)

Here c is an absolute constant.


2
E Ai 2 = n. This says that one would typically use
Recall from Lemma 5.2.1 that √
Theorem 5.12.2 with m = O( m). In this case the result has a form
√ √ √ √
N − t n  σmin (A)  σmax (A)  N + t n,
 2
probability at least 1 − 2n · e−c t .
5.12 Random Matrices with Independent Rows 317

Proof. The proof is taken from [72] for the proof and changed to our notation habits.
We shall use the non-commutative Bernstein’s inequality (for a sum of independent
random matrices).
Step 1: Reduction to a sum of independent random matrices. We first note that
2
m ≥ n ≥ 1 since by Lemma 5.2.1 we have that E Ai 2 = n. Our argument
here is parallel
√ to Step 1 of Theorem 5.12.1. Recalling Lemma 5.11.4 for the matrix
B = A/ N , we find that the desired inequality (5.44) is equivalent to
/ / "
/1 ∗ /  m
/ A A − I/  max δ, δ 2 = ε, δ=t . (5.45)
/N / N

Here · is the operator (or spectral) norm. It is more convenient to express this
random matrix as a sum of independent random matrices—the sum is a linear
operator:

N N
1 ∗ 1
A A−I= Ai ⊗ A i − I = Xi , (5.46)
N N i=1 i=1

where Xi = N1 (Ai ⊗ Ai − I). Here Xi are independent centered n × n random


matrices. Equation (5.46) is the standard form we have treated previously in this
book.
Step 2: Estimating the mean, range, and variance. Now we are in a position

N
to apply the non-commutative Bernstein inequality, for the sum Xi . Since Ai
i=1
are isotropic random vectors, we have EAi ⊗ Ai = I, which implies EXi = 0,
which is a required condition to use the non-commutative Bernstein inequality,
Theorem 2.17.1. √
We estimate the range of Xi using the assumption that Ai 2  m and m ≥ 1:

1 1  2
 1 2m
Xi 2  ( Ai ⊗ Ai + 1) = Ai 2 +1  (m + 1)  = K.
N N N N


N
Here · 2 is the Euclidean norm. To estimate the total variance EX2i , we first
i=1
need to compute
 
1 2 1
X2i = 2 (Ai ⊗ Ai ) − 2 Ai ⊗ Ai + I .
N N

Then, using the isotropic vector assumption EAi ⊗ Ai = I, we have

1  2

EX2i = E(A i ⊗ A i ) − I . (5.47)
N2
318 5 Non-asymptotic, Local Theory of Random Matrices

Since
2 2
(Ai ⊗ Ai ) = Ai 2 Ai ⊗ A i

2
is
/ a positive semi-definite
/ matrix and Ai 2  m by assumption, it follows that
/ 2/
/E(Ai ⊗ Ai ) /  m · EAi ⊗ Ai = m. Inserting this into (5.47), we have

/ /
/EX2i /  1 (m + 1)  2m .
N2 N2
where we used the assumption that m ≥ 1. This leads to
/ /
/ N / / / 2m
/ /
/ EX2i / N · max /EX2i / = = σ2 .
/ / i N
i=1

Step 3: Applying the non-commutative Bernstein’s inequality. Applying the


non-commutative Bernstein inequality, Theorem 2.17.1, and recalling the defini-
tions of ε and δ in (5.45), we bound the probability in question as
/ N /    2 
/ 1 ∗ / / /
/ / / /
P N A A − I  ε = P / Xi /  ε  2n · exp −c min σε 2 , Kε
i=1   
 2n · exp −c min ε2 , ε · 2m
N
2  
= 2n · exp −c · δ2mN
= 2n · exp −ct2 /2 .

This completes the proof. 


Theorem 5.12.3 (Heavy-tailed rows, non-isotropic [72]). Let A be an N × n
matrix whose rows Ai are independent random vectors in Rn with the common
second
√ moment matrix Σ = E (xi ⊗ xi ). Let m be a number such that Ai 2 
m almost surely for all i. Then, for every t ≥ 0, the following inequality holds
2
with probability at least 1 − n · e−ct :
/ /   "
/1 ∗ / m
/ A A − Σ/  max Σ 1/2
δ, δ 2 where δ = t . (5.48)
/N / N

Here c is an absolute constant. In particular, this inequality gives

1/2
√ √
A  Σ N + t m. (5.49)

Proof. Since
2
Σ = E (Ai ⊗ Ai )  E Ai ⊗ Ai = E Ai 2  m,
5.12 Random Matrices with Independent Rows 319

we have m  Σ . Then (5.48) follows by a straightforward modification of the


arguments of Theorem 5.12.2. Also, if (5.48) holds, then by triangle inequality
/ / / /
= / N1 A∗ A/  Σ + / N1 A∗ A − Σ/
1 2
N A 2
1/2
 Σ + Σ δ + δ2.

Taking the square root and multiplying both sides by N , we have (5.49). 
The almost sure boundedness in Theorem 5.12.2 may be too restrictive, and it
can be relaxed to a bound in expectation.
Theorem 5.12.4 (Heavy-tailed rows; expected singular values [72]). Let A be
an N × n matrix whose rows Ai are independent, isotropic random vectors in Rn .
2
Let m = EmaxiN Ai 2 . Then
 √  

E max σi (A) − N   C m log min (N, n)
iN

where C is an absolute constant.


The proof of this result is similar to that of Theorem 5.12.2, except that this time
Rudelson’s Lemma, Lemma 5.4.3, instead of matrix Bernstein’s inequality. For
details, we refer to [72].
Theorem 5.12.5 (Heavy-tailed rows, non-isotropic, expectation [72]). Let A be
an N × n matrix whose rows Ai are independent random vectors in Rn with the
2
common second moment matrix Σ = E (xi ⊗ xi ). Let m = EmaxiN Ai 2 .
Then
/ / "
/1 ∗ /   m log min (N, n)
E/ / 1/2
/ N A A − Σ/  max Σ
2
δ, δ where δ = C .
N

Here C is an absolute constant. In particular, this inequality gives


 1/2 √ 
2 1/2
E A  Σ N + C m log min (N, n).

Let us remark on non-identical second moment. The assumption that the rows
Ai have a common second moment matrix Σ is not essential in Theorems 5.12.3
and 5.12.5. More general versions of these results can be formulated. For example,
if Ai have arbitrary second moment matrices

Σi = E (xi ⊗ xi ) ,

1

N
then the claim of Theorem 5.12.5 holds with Σ = N Σi .
i=1
320 5 Non-asymptotic, Local Theory of Random Matrices

5.13 Covariance Matrix Estimation

Let x be a random vector in Rn ; for simplicity we assume that x is centered,2


Ex = 0. The covariance matrix of x is the n × n matrix

Σ = E (x ⊗ x) .

The simplest way to estimate Σ is to take some N independent samples xi from the
distribution and form the sample covariance matrix
N
1
ΣN = xi ⊗ xi .
N i=1

By the law of large numbers,

ΣN → Σ almost surely N → ∞.

So, taking sufficiently many samples, we are guaranteed to estimate the covariance
matrix as well as we want. This, however, does not deal with the quantitative
aspect of convergence: what is the minimal sample size N that guarantees this
approximation with a given accuracy?
We can rewrite ΣN as
N
1 1 ∗
ΣN = xi ⊗ xi = X X,
N i=1
N

where
⎡⎤
xT1
⎢ ⎥
X = ⎣ ... ⎦ .
xTN
The X is a N × n random matrix with independent rows xi , i = 1, . . . , N, but
usually not independent entries.
Theorem 5.13.1 (Covariance estimation for sub-Gaussian distributions [72]).
Consider a sub-Gaussian distribution in Rn with covariance matrix Σ, and let
2
ε ∈ (0, 1), t ≥ 1. Then, with probability at least 1 − 2e−t n , one has
2
If N  C(t/ε) n then ΣN − Σ  ε. (5.50)

Here C = CK depends only on the sub-Gaussian norm K = x ψ2 of a random


vector taken from this distribution.

2 Moregenerally, in this section we estimate the second moment matrix E (x ⊗ x) of an arbitrary


random vector x (not necessarily centered).
5.13 Covariance Matrix Estimation 321

2  s ≥ 0, with probability
Proof. It follows from (5.39) that for every  at √
least
1 − 2e−cs , we have ΣN − Σ  max δ, δ 2 where δ = C n/N + s/ N .
 √  
The claim follows for s = C t n where C = CK is sufficiently large. 
For arbitrary centered Gaussian distribution in Rn , (5.50) becomes
2
If N  C(t/ε) n then ΣN − Σ  ε Σ . (5.51)

Here C is an absolute constant.


Theorem 5.12.3 gives a similar estimation result for arbitrary distribution,
possibly heavy-tailed.
Theorem 5.13.2 (Covariance estimation for arbitrary distributions [72]). Con-
sider a distribution in Rn with covariance √
matrix Σ, and supported in some centered
Euclidean ball whose radius we denote m. Let ε ∈ (0, 1), t ≥ 1. Then, with
2
probability at least 1 − n−t , one has
2 −1
If N  C(t/ε) Σ m log n, then ΣN − Σ  ε Σ . (5.52)

Here C is an absolute constant.


Proof. It follows from Theorem 5.12.3 that for every s ≥ 0,with probability at least
2 
1 − n · e−cs , we have ΣN − Σ  max Σ
1/2
δ, δ 2 where δ = s m/N .
2 −1
Thus, if N  C(s/ε) Σ m log n, then ΣN − Σ  ε Σ . The claim
 √ 
follows for s = C t log n where C is a sufficiently large absolute constant. 
Theorem 5.52 is typically met with m = O ( Σ n). For a random vector x
chosen from the distribution at hand, the expected norm is
2
E x 2 = Tr (Σ)  n Σ .

Recall that Σ = σ1 (Σ) is the matrix norm which is also equal to the largest
singular value. So, by Markov’s
√ inequality, most of the distribution is supported
in a centered ball of radius m where m = O ( Σ n). If all the distribution
is supported there, i.e., if x = O ( Σ n) almost surely, then the claim of
Theorem 5.52 holds with sample size
2
N  C(t/ε) r log n.

Let us consider low-rank estimation. In this case, the distribution in Rn lies close
to a low dimensional subspace. As a result, a much smaller sample size N is
sufficient for covariance estimation. The intrinsic dimension of the distribution can
be expressed as the effective rank of the matrix Σ, defined as
Tr (Σ)
r (Σ) = .
Σ
One always has r (Σ)  rank (Σ)  n, and this bound is sharp. For example, for
an isotropic random vector x in Rn , we have Σ = I and r (Σ) = n.
322 5 Non-asymptotic, Local Theory of Random Matrices

The effective rank r = r (Σ) always controls the typical norm of x, since
2
E x 2 = T r (Σ) = r Σ . It follows from Markov’s √ inequality that most of the
distribution is supported within a ball of radius m wherem = r Σ  . Assume that

all of the distribution is supported there, i.e., if x = O r Σ almost surely,
then, the claim of Theorem 5.52 holds with sample size
2
N  C(t/ε) r log n.

The bounded assumption in Theorem 5.52 is necessary. For an isotropic distribution


which is highly concentrated at the origin, the sample covariance matrix will likely
2
equal 0. Still we can use a weaker assumption EmaxiN xi 2  m where xi
denote the sample points. In this case, the covariance estimation will be guaranteed
in expectation rather than with high probability.
A different way to impose the bounded assumption √ is to reject any sample
points xi that fall outside the centered ball of radius m. This is equivalent
to sampling from the conditional distribution inside the ball. The conditional
distribution satisfies the bounded requirement, so the results obtained above provide
a good covariance estimation for it. In many cases, this estimate works even for the
original distribution—that
√ is, if only a small part of the distribution lies outside the
ball of radius m. For more details, refer to [320].

5.13.1 Estimating the Covariance of Random Matrices

The material here can be found in [321]. In recent years, interest in matrix
valued random variables gained momentum. Many of the results dealing with real
random variables and random vectors were extended to cover random matrices.
Concentration inequalities like Bernstein, Hoeffding and others were obtained in the
non-commutative setting. The methods used were mostly combination of methods
from the real/vector case and some matrix inequalities like the Golden-Thompson
inequality.
The method will work properly for a class of matrices satisfying a matrix strong
regularity assumption which we denote by (MSR) and can be viewed as an analog
to the property (SR) defined in [310]. For an n × n matrix, denote · op by the
operator norm of A on n2 .
Definition 5.13.3 (Property (MSR)). Let Y be an n × n positive semi-definite
random matrix such that EY = In×n . We will say that Y satisfies (MSR) if for
some η > 0 we have:
c
P ( PYP  t)  ∀t  c · rank (P)
t1+η
where P is orthogonal projection of Rn .
5.13 Covariance Matrix Estimation 323

Theorem 5.13.4 (Youssef [321]). Let X be an n × n positive semi-definite random


matrix satisfying EX = In×n and (MSR) for some η > 0. Then for every ε < 1,
n
taking N = C1 (η) ε2+2/η we have
/ /
/1 N /
/ /
E/ Xi − In×n / ε
/N /
i=1 op

where X1 , . . . , XN are independent copies of X.


We also introduce a regularity assumption on the moments which we denote by
(MWR):
p
∃p > 1 such that E Xz, z  Cp ∀z ∈ S n−1 .

The proof of Theorem 5.13.4 is based on two theorems with the smallest and largest

N
eigenvalues of N1 Xi , which are of independent interest.
i=1

Theorem 5.13.5 (Youssef [321]). Let Xi n × n independent, positive semi-definite


random matrices satisfying EXi = In×n and (MWR). Let ε < 1, then for
n
N  16(16Cp )1/(p−1) 2p−1
ε p−1

we have
N

1
Eλmin Xi  1 − ε.
N i=1

Theorem 5.13.6 (Youssef [321]). Let Xi n × n independent, positive semi-definite


random matrices satisfying EXi = In×n and (MSR). Then for any N we have

N

Eλmax Xi  C (η) (n + N ) .
i=1

Moreover, for ε < 1 and N  C2 (η) ε2+2/η


n
we have

N

1
Eλmax Xi  1 + ε.
N i=1

Consider the case of log-concave matrices is also covered in [321].


324 5 Non-asymptotic, Local Theory of Random Matrices

5.14 Concentration of Singular Values

We primarily follow Rudelson and Vershynin [322] and Vershynin [323] for our
exposition. Relevant work also includes [72, 322, 324–333].
Let A be an N × n matrix whose entries with real independent, centered random
variables with certain moment assumptions. Random matrix theory studies the
distribution
√ of the singular values σk (A), which are the eigenvalues of |A| =
AT A arranged in nonincreasing order. Of particular significance are the largest
and the smallest random variables

σ1 (A) = sup Ax 2 , σn (A) = inf Ax 2 . (5.53)


x:x2 =1 x:x2 =1

Here, we consider sub-Gaussian random variables ξ—those whose tails are domi-
nated by that of the standard normal random variables. That is, a random variable is
called sug-Gaussian if there exists B > 0 such that
2
/B 2
P (|ξ| > t)  2e−t for all t > 0. (5.54)

The minimal B in this inequality is called the sub-Gaussian moment of ξ.


Inequality (5.54) is often equivalently formulated as the moment condition

p 1/p √
(E|ξ| )  CB p for all p  1, (5.55)

where C is an absolute constant. The class of sub-Gaussian random variables


includes many random variables that arise naturally in applications, such as normal,
symmetric ±1 and general bounded random variables.
In this section, we study N × n real random matrices A whose entries are
independent and identically distributed mean 0 sub-Gaussian random variables. The
asymptotic behavior of the extremal singular values of A is well understood. If the
entries have unit variance and the dimension n grows infinity while the aspect ratio
n/N converges to a constant α ∈ (0, 1), then

σ1 (A) √ σn (A) √
√ → 1 + α, √ →1− α
N N

almost surely. The result was proved in [302] for Gaussian matrices, and in [334]
for matrices with independent and identically distributed entries with finite fourth
moment. In other words, we have asymptotically
√ √ √ √
σ1 (A) ∼ N+ n, σn (A) ∼ N − n. (5.56)

Recently, consider efforts were made to understand non-asymptotic estimates


similar to (5.56), which would hold for arbitrary fixed dimensions N and n.
5.14 Concentration of Singular Values 325

Ledoux [149] is a survey on the largest singular value. A modern, non-asymptotic


survey is given in [335], while a tutorial is given in [72]. The discussion in this
section is on the smallest singular value, which is much harder to control.

5.14.1 Sharp Small Deviation

Let λ1 (A) be the largest eigenvalue of the n×n random matrix A. Following [336],
we present
3/2
P (λ1 (A)  2 + t)  Ce−cnt ,

valid uniformly for all n and t. This inequality is sharp for “small deviation” and
complements the usual “large deviation” inequality. Our motivation is to illustrate
the simplest idea to get such a concentration inequality.
The Gaussian concentration is the easiest. It is straightforward consequence of
the measure concentration phenomenon [145] that
2
P (λ1 (A)  Mλ1 (A) + t)  e−nt , ∀t > 0, ∀n (5.57)

where Mλ1 (A) stands for the median of λ1 (A) with respect to the probability
measure P. One has the same upper bound estimate is the median Mλ1 (A) is
replaced by the expected value Eλ1 (A), which is easier to compute.
The value of Mλ1 (A) can be controlled: for example we have

Mλ1 (A)  2 + c/ n.

5.14.2 Sample Covariance Matrices

The entries of the random matrices will be (complex-valued) random variables ξ


satisfying the following assumptions:
1. (A1) The distribution of ξ is symmetric; that is, ξ and −ξ are identically
distributed;
2k k
2. (A2) E|ξ|  (C0 k) for some constant C0 > 0 (ξ has sub-Gaussian tails).
Also we assume that either Eξ 2 = Eξ ξ¯ = 1 or Eξ 2 = 0; Eξ ξ¯ = 1.
326 5 Non-asymptotic, Local Theory of Random Matrices

Theorem 5.14.1 ([324]).


 
 √  1
P A  2 n (1 + t)  C exp − nt3/2 ,
C

 √   
√ 2 1
P λn (B)  n + N + nt  C exp − N t3/2 ,
C
 √   
√ 2 C 1
P λ1 (B)  N − n − nt   exp − N t3/2 ,
1 − N/n C

5.14.3 Tall Matrices

A result of [337] gives an optimal bound for tall matrices, those whose aspect ratio
α = n/N satisfies α < α0 for some sufficiently small constant α0 . Recalling (5.56),
one should expect that tall matrices satisfy

σn (A)  c N with high probability. (5.58)
It was indeed proven in [337] that for a tall ±1 matrices one has
 √ 
P σn (A)  c N  e−cN (5.59)

where α0 > 0 and c > 0 are absolute constants.

5.14.4 Almost Square Matrices

As we move toward square matrices, making the aspect ratio α = n/N arbitrarily
close to 1, the problem becomes harder. One still expect (5.58) to be true as long
as α < 1 is any constant. It was proved in [338] for arbitrary aspect ratios α <
1−c/ log n and for general random matrices with independent sub-Gaussian entries.
One has
 √ 
P σn (A)  cα N  e−cN , (5.60)

where cα > 0 depends on α and the maximal sub-Gaussian moment of the entries.
Later [339], the dependence of cα on the aspect ratio in (5.60) was improved
for random ±1 matrices; however the probability estimates there was weaker
than (5.60). An estimate for sub-Gaussian random matrices
√ of all dimensions was
obtained in a breakthrough work [340]. For any ε  C/ N , it was shown that
 √ √  N −n
P σn (A)  ε (1 − α) N − n  (Cε) + e−cN .

However, because of the factor 1 − α, this estimate is suboptimal and does not
correspond to the expected asymptotic behavior (5.56).
5.14 Concentration of Singular Values 327

5.14.5 Square Matrices

The extreme case for the problem of estimating the singular values is for the square
matrices, where N = n. Equation (5.56) is useless for square matrices. However,
for√“almost” square matrices, those with constant defect N − n = O(1) is of order
1/ N , so (5.56) heuristically suggests that
c
σn (A)  √ with high probability. (5.61)
N

This conjecture was proved recently in [341] for all square sub-Gaussian matrices:
 
c
P σn (A)  √  Cε + e−cN . (5.62)
N

5.14.6 Rectangular Matrices

Rudelson and Vershynin [322] proved the conjectural bound for σ(A), which is
valid for all sub-Gaussian matrices in all fixed dimensions N, n. The bound is
optimal for matrices with all aspects we encountered above.
Theorem 5.14.2 (Rudelson and Vershynin [322]). Let A be an N × n random
matrix, N ≥ n, whose elements are independent copies of a mean 0 sub-Gaussian
random variable with unit variance. Then, for every t > 0, we have
 √ √ 
N −n+1
P σn (A)  t N − n − 1  (Ct) + e−cN , (5.63)

where C, c depend (polynomial) only on the sub-Gaussian moment B.


For tall matrices, Theorem 5.14.2 clearly amounts√to the √
known estimates (5.58)√and
(5.59). For square matrices N = n, the quantity N − N − 1 is of order 1/ N ,
so Theorem 5.14.2 amounts to the known estimates (5.61) and (5.62). Finally, for
matrices that are arbitrarily close to square, Theorem 5.14.2 gives the new optimal
estimate
√ √ 
σn (A)  c N− n with high probability. (5.64)

This is a version of (5.56), now valid for all fixed dimensions. This bound were
explicitly conjectured in [342].
Theorem 5.14.2 seems to be new even for Gaussian matrices.
Vershynin [323] extends the argument of [322] for random matrices with
bounded (4 + ε)-th moment. It follows directly from the argument of [322] and
[323, Theorem 1.1] (which is Eq. 5.69 below.)
328 5 Non-asymptotic, Local Theory of Random Matrices

Corollary 5.14.3 (Vershynin [323]). Let ε ∈ (0, 1) and N ≥ n be positive


integers. Let A be a random N × n matrix whose entries are i.i.d. random variables
with mean 0, unit variance and (4 + ε)-th moment bounded by α. Then, for every
δ > 0 there exists t > 0 and n0 which depend only on ε, δ and α, and such that
 √ √ 
P σn (A)  t N − n − 1  δ, for all n  n0 .

After the paper of [323] was written, two important related results appeared on
the universality of the smallest singular value in two extreme regimes–for almost
square matrices [326] and for genuinely rectangular matrices [324]. The result of
Tao and Vu [326] works for square and almost square matrices where the defect N −
n is constant. It is valid for matrices with i.i.d. entries with mean 0, unit variance, and
bounded C-th moment where C is a sufficiently large absolute constant. The result
says that the smallest singular value of such N × n matrices A is asymptotically the
same as of the Gaussian matrix G of the same dimensions and with i.i.d. standard
normal entries. Specifically,
   
P N σn (G)  t − N −c − N −c  P N σn (A)  t ,
2 2
 
 P N σn (G)  t + N −c + N −c .
2

(5.65)
Another result was obtained by Feldheim and Sodin [324] for genuinely rectangular
matrices, i.e. with aspect ratio N/n separated from 1 by a constant and with sub-
Gaussian i.i.d. entries. In particular, they proved
 √ 
√ 2 C 3/2
P σn (A)  N − n − tN   e−cnt . (5.66)
1− N/n

Equations (5.63) and (5.66) complements each other—the former is multiplicative


(and is valid for arbitrary dimensions) while the latter is additive (and is applicable
for genuinely rectangular matrices.) Each of these two inequalities clearly has the
regime where it is stronger.
The permanent of an n × n matrix A is defined as

per (A) = a1,π(1) a2,π(2) · · · an,π(n) ,


π∈Sn

where the summation is over all permutations of n elements. If xi,j are i.i.d. 0 mean
variables with unit variance and X is an n × n matrix with entries xi,j then an easy
computation shows that
  2
per (A) = E det A1/2  X ,
5.14 Concentration of Singular Values 329

where for any two n × m matrices A, B, D = A  B denotes their Hadamard


or Schur, product, i.e., the n × m matrix with entries di,j = ai,j · bi,j , and
1/2
where A1/2 (i, j) = A(i, j) . For a class of matrices that arise from δ, κ-
strongly connected graph, i.e., graphs with good expansion properties, Rudelson
and Zeitouni [343] showed the following: Let G be the n × n standard Gaussian
matrix. The constants C, C, c, c, . . . depend only on δ, κ. For any τ ≥ 1 and any
adjacency matrix A of a (δ, κ)-strongly connected graph,
      
P log det2 A1/2  G  − E log det2 A1/2  G  > C(τ n log n)
1/3

 exp (−τ ) + exp (−c n/ log n) .
and
      
   
E log det2 A1/2 G   log per (A)  E log det2 A1/2 G  +C  n log n.

Further, we have
   
 det2 A1/2  G    √
  
P log  > 2C n log n  exp −c n/ log n .
 per (A) 

For the smallest singular value, they showed that


 √ 
P sn (A  G)  ct/ n  t + e−c n ,

and for any n/2 < k < n − 4


 
n−k 
P sk (A  G)  ct √  t(n−k)/4 + e−c n .
n

5.14.7 Products of Random and Deterministic Matrices

We study the B = ΓA, where A is a random matrix with independent 0 mean


entries and Γ is a fixed matrix. Under the (4 + ε)-th moment assumption on the
entries of A, it√ √ in [323] that the spectral norm of such an N × n matrix B
is shown
is bounded by N + n, which is sharp.
B = ΓA can be equivalently regarded as sample covariance matrices of a wide
class of random vectors—the linear transformation of vectors with independent
entries.
Recall the spectral norm W 2 is defined as the √largest singular value of a matrix
W, which equals the largest eigenvalue of |A| = AT A. Equivalently, the spectral
norm can be defined as the l2 → l2 operator norm:

σ1 (A) = sup Ax 2 ,
x:x2 =1

where · 2 denotes the Euclidean norm.


330 5 Non-asymptotic, Local Theory of Random Matrices

For random matrices with independent and identically distributed entries, the
spectral norm is well studies. Let B be an N × n matrix whose entries are real
independent and identically distributed random variables with mean 0, variance 1,
and finite fourth moment. Estimates of the type
√ √
σ1 (B) ∼ N + n, (5.67)

are known to hold (and are sharp) in both the limit regime for dimensions increasing
to infinity, and the non-limit regime where the dimensions are fixed. The meaning
of (5.67) is that, for a family of matrices
√ as above whose aspect ratio N/n converges
√ 
to a constant, the ratio σ1 (B) / N + n converges to 1 almost surely [344].
In the non-limit regime, i.e., for arbitrary dimensions n and N. Variants of (5.67)
was proved by Seginer [107] and Latala [194]. If B is an N × n matrix whose
entries are i.i.d. mean 0 random variables, then denoting the rows of B by xi and
the columns by yj , the result of Seginer [107] says that
 
Eσ1 (B)  C Emax xi 2 +E max yi 2
i i

where C is an absolute constant. The estimate is sharp because σ1 (B) is bounded


below by the Euclidean norm of any row and any column of B.

max
i

If the entries of the matrix B are not necessarily identically distributed, the result of
Latala [194] says that
⎛ ⎛ ⎞1/4 ⎞
⎜ ⎟
Eσ1 (B)  C ⎝Emax xi 2 + E max yi 2 +⎝ b4ij ⎠ ⎠ ,
i i
i,j

where bij are entries of the matrix B. In particular, if B is an N × n matrix whose


entries are independent random variables with mean 0 and fourth moments bounded
by 1, then one can deduce from either Seginer’s or Latala’s result that
√ √ 
Eσ1 (B)  C N+ n . (5.68)

This is a variant of (5.67) in the non-limit regime.


The fourth moment is known to be necessary. Consider again a family of matrices
whose dimensions N and n increase to infinity, and whose aspect ratio N/n
converges to a constant. If the entries are i.i.d. random variables with
√ mean√0 and
infinite fourth moment, then the upper limit of the ratio σ1 (B) / N + n is
infinite almost surely [344].
5.14 Concentration of Singular Values 331

The main result of [323] is an extension of the optimal bound (5.68) to the class
of random matrices with non-independent entries, but which can be factored through
a matrix with independent entries.
Theorem 5.14.4 (Vershynin [323]). Let ε ∈ (0, 1) and let m, n, N be positive
integers. Consider a random m × n matrix B = ΓA, where A is an N × n random
matrix whose entries are independent random variables with mean 0 and (4 + ε)-th
moment bounded by 1, and Γ is an m×N non-random matrix such that σ1 (Γ)  1.
Then
√ √
Eσ1 (B)  C (ε) m+ n (5.69)

where C (ε) is a function that depends only on ε.


Let us give some remarks on Eq. 5.69:
1. The conclusion is independent of N.
2. The proof of Eq. 5.69 gives the stronger estimate
 √
Eσ1 (B)  C (ε) σ1 (Γ) m + Γ HS

which is valid for arbitrary (non-random) m × N matrix Γ. Here · HS denotes


the Hilbert-Schmidt norm or Frobenius norm. This result is independent of the
dimensions of Γ, therefore holds for an arbitrary linear operator Γ acting from
the N -dimensional Euclidean space l2N to an arbitrary Hilbert space.
3. Equation 5.69 can be interpreted in terms of sample covariance matrices of
random vectors in Rm of the form Γx, where x is a random vector in Rn with in-
dependent entries. Let A be the random matrix whose columns are n independent
samples of the random vector x. Then B = ΓA is the matrix whose columns
are n independent samples of the random vector Γx. The sample covariance
matrix of the random vector Γx, is defined as Σ = n1 BB T
 . Equation 5.69
says that the largest eigenvalue of Σ is bounded by C1 (ε) 1 + m n , which is
further bound by C2 (ε) for the number of samples n ≥ m (and independently of
the dimension N ). This problem was studied [345, 346] in the asymptotic limit
regime for m = N, where the result must of course dependent on N.

5.14.8 Random Determinant

Random determinant can be used the test metric for hypothesis testing, especially
for extremely low signal to noise ratio. The variance of random determinant is much
smaller than that of individual eigenvalues. We follow [347] for this exposition. Let
An be an n×n random matrix whose entries aij , 1 ≤ i, j ≤ n, are independent real
random variables of 0 mean and unit variance. We will refer to the entries aij as the
atom variables. This shows that almost surely, log |det (An )| is (1/2 + o(1)) n log n
but does not provide any distributional information. For other models of random
matrices, we refer to [348].
332 5 Non-asymptotic, Local Theory of Random Matrices

In [349], Goodman considered random Gaussian matrices where the atom


variables are i.i.d. standard Gaussian variables. He noticed that in this case
the determinant is a product of independent Chi-square variables. Therefore, its
logarithm is the sum of independent variables and thus one expects a central limit
theorem to hold. In fact, using properties of Chi square distribution, it is not very
hard to prove

log |det (An )| − 12 log (n − 1)!


2 → N (0, 1) . (5.70)
1
2 log n

In [242], Tao and Vu proved that for Bernoulli random matrices, with probability
tending to one (as n tents to infinity)
√    √
n! exp −c n log n  |det (An )|  n!ω (n) (5.71)

for any function ω(n) tending to infinity with n. We say that a random variable ξ
satisfies condition C0 (with positive constants C1 , C2 ) if

P (|ξ|  t)  C1 exp −tC2 (5.72)

for all t > 0. Nguyen and Vu [347] showed that the logarithm of |det (An )| satisfies
a central limit theorem. Assume that all atom variables aij satisfy condition C0 with
some positive constants C1 , C2 . Then
 ⎛ ⎞ 
 
 log |det (A )| − 1
log (n − 1)! 

sup P ⎝ 2
n 2
 t − Φ(t)  log−1/3+o(1) n.

t∈R  1
log n 
2

)t  2 (5.73)
Here Φ(t) = P (N (0, 1) < t) = √1
2π −∞
exp −x /2 dx. An equivalent form is
   
 log det A2n − log (n − 1)! 
 
sup P √  t − Φ(t)  log−1/3+o(1) n. (5.74)
t∈R  2 log n 

For illustration, we see Fig. 5.1.


Example 5.14.5 (Hypothesis testing).

H0 : Ry = Rx
H1 : Ry = Rx + Rs

where Rx is an n × n random matrix whose entries are independent real random


variables of 0 mean and unit variance, and Rs is an positive definite matrix of n× n.
Random determinant can be used the test metric for hypothesis testing, especially
5.14 Concentration of Singular Values 333

Empirical CDF
1
Bernoulli matrices
0.9 Gaussian matrices
standard normal curve
0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2

0.1

0
-5 -4 -3 -2 -1 0 1 2 3 4
x
  √
Fig. 5.1 The plot compares the distributions of log det A2 − log (n − 1)! / 2 log n for random
Bernoulli matrices, random Gaussian matrices, and N (0, 1). We sampled 1,000 matrices of size
1,000 by 1,000 for each ensemble

for extremely low signal to noise ratio. The variance of random determinant is much
smaller than that of individual eigenvalues. According to (5.73), the hypothesis
test is
H0 : log |det (Rx )|
H1 : log |det (Rx + Rs )|
Let A, B be complex matrices of n×n, and assume that A is positive definite. Then
for n ≥ 2 [18, p. 535]

det A  |det A| + |det B|  |det (A + B)| (5.75)


It follows from (5.75) that
1
H0 : log |det (Rx )| ≈ log (n − 1)!
2
1
H1 : log |det (Rx + Rs )|  log (det Rs + |det Rx |)  log |det (Rx )| ≈ log (n − 1)!
2
where Rs is assumed to be positive definite and Rx is an arbitrary complex matrix.
Our algorithm claims H1 if

1
log |det (Rx + Rs )|  log (n − 1)!.
2
334 5 Non-asymptotic, Local Theory of Random Matrices

Equivalently, from (5.74), we can investigate



H0 : log det R2x ≈ log (n − 1)!
2 
H1 : log det (Rx + Rs ) = log det R2x + R2s + Rx Rs + Rs Rx

where Rx an n × n random matrix whose entries are independent real random


variables of 0 mean and unit variance, and Rs is an arbitrary complex matrix of
n × n.


5.15 Invertibility of Random Matrix

We follow [350] for this exposition. Given an n × n random matrix A, what is the
probability that A is invertible, or at least “close” to being invertible? One natural
way to measure this property is to estimate the following small ball probability

P (sn (A)  t) ,

where

def 1
sn (A) = inf Ax = .
x1 =1
2
A−1

In the case when the entries of A are i.i.d. random variables with appropriate
moment assumption, the problem was studied in [239, 241, 341, 351, 352]. In
particular, in [341] it is shown that if the above diagonal entries of A are
continuous and satisfy certain regularity conditions, namely that the entries are i.i.d.
subGaussian and satisfy certain smoothness conditions, then

P (sn (A)  t)  C nt + e−cn . (5.76)

where c, C depend on the moment of the entries.


Several cases of dependent entries have also been studied. A bound similar
to (5.76) for the case when the rows are independent log-concave random vectors
was obtained in [353, 354]. Another case of dependent entries is when the matrix is
symmetric, which was studied in [355–360]. In particular, in [355], it is shown that
if the above diagonal entries of A are continuous and satisfy certain regularity con-
ditions, namely that the entries are i.i.d. subgaussian and satisfy certain smoothness
conditions, then

P [sn (A)  t]  C nt.
5.15 Invertibility of Random Matrix 335

The regularity assumptions were completely removed in [356] at the cost of a n5/2
(independence of the entries in the non-symmetric part is still needed). On the other
hand, in the discrete case, the result of [360] shows that if A is, say, symmetric
whose above diagonal entries are i.i.d. Bernoulli random variables, then

P [sn (A) = 0]  e−n ,


c

where c is an absolute constant.


A more general case is the so called Smooth Analysis of random matrices, where
now we replace the matrix A by A + Γ, where Γ being an arbitrary deterministic
matrix. The first result in this direction can be found in [361], where it is shown that
if A is a random matrix with i.i.d. standard normal entries, then

P (sn (Γ + A)  t)  C nt. (5.77)

Further development in this direction can be found in [362] estimates similar to (1.2)
are given in the case when A is a Bernoulli random matrix, and in [356, 358, 359],
where A is symmetric.
An alternative way to measure the invertibility of a random matrix A is to
estimate det(A), which was studied in [242,363,364] (when the entries are discrete
distributions). Here we show that if the diagonal entries of A are independent
continuous random variables, we can easily get a small ball estimate for det(Γ+A),
where Γ being an arbitrary deterministic matrix.
Let A be an n × n random matrix, such that each diagonal entry Ai,i is a
continuous random variable, independent from all the other entries of A. Friedland
and Giladi [350] showed that for every n × n matrix Γ and every t ≥ 0

P (|det (A + Γ)|  t)  2αnt,

where α is a uniform upper bound on the densities of Ai,i . Further, we have

P [ A  t]  2αnt,
n/(2n−1) (n−1)/(2n−1) 1/(2n−1) (5.78)
P [sn (A)  t]  (2α) (E A ) t .

Equation (5.78) can be applied to the case when the random matrix A is symmetric,
under very weak assumptions on the distributions and the moments of the entries
and under no independence assumptions on the above diagonal entries. When A is
symmetric, we have

A = sup Ax, x  max |Ai,i | .


x1 =1 1in

Thus, in this case we get a far better small ball estimate for the norm
n
P [ A  t]  (2αt) .
336 5 Non-asymptotic, Local Theory of Random Matrices

Rudelson [76] gives an excellent self-contained lecture notes. We take some material
from his notes to get a feel of the proof ingredients. His style is very elegant.
In the classic work on numerical inversion of large matrices, von Neumann
and his associates used random matrices to test their algorithms, and they specu-
lated [365, pp. 14, 477, 555] that

sn (A) ∼ 1/ n with high probability. (5.79)

In a more precise form, this estimate was conjectured by Smale [366] and proved
by Edelman[367] and Szarek[368] for random Gaussian matrices A, i.e., those with
i.i.d. standard normal entries. Edelman’s theorem states that for every t ∈ (0, 1)
 √
P sn (A)  t/ n ∼ t. (5.80)

In [341], the conjecture (5.79) is proved in full generality under the fourth moment
assumption.
Theorem 5.15.1 (Invertibility: fourth moment [341]). Let A be an n × n matrix
whose entries are independent centered real random variables with variances at
least 1 and fourth moments bounded by B. Then, for every δ > 0 there exist ε > 0
and n0 which depend (polynomially) only on δ and B, such that
 √
P sn (A)  t/ n  δ for all n  n0 .

Spielman and Teng[369] conjectured that (5.80) should hold for the random sign
matrices up to an exponentially small term that accounts for their singularity
probability:
 √
P sn (A)  t/ n  ε + cn .

Rudelson and Vershynin prove Spielman-Teng’s conjecture up to a coefficient in


front of t. Moreover, they show that this type of behavior is common for all matrices
with subGaussian i.i.d. entries.
Theorem 5.15.2 (Invertibility: subGaussian [341]). Let A be an n × n matrix
whose entries are independent copies of a centered real subGaussian random
variable. Then, for every t ≥ 0, one has
 √
P sn (A)  t/ n  Cε + cn . (5.81)

where C > 0 and c ∈ (0, 1).


5.16 Universality of Singular Values 337

5.16 Universality of Singular Values

Large complex system often exhibit remarkably simple universal patterns as the
numbers of degrees of freedom increases [370]. The simplest example is the
central limit theorem: the fluctuation of the sums of independent random scalars,
irrespective of their distributions, follows the Gaussian distribution. The other
cornerstone of probability theory is to treat the Poisson point process as the universal
limit of many independent point-like evens in space or time. The mathematical
assumption of independence is often too strong. What if independence is not
realistic approximation and strong correlations need to be modelled? Is there a
universality for strongly correlated models?
In a sensor network of time-evolving measurements consisting of many
sensors—vector time series, it is probably realistic to assume that the measurements
of sensors have strong correlations.
Let ξ be a real-valued or complex-valued random variable. Let A denote the n×n
random matrix whose entries are i.i.d. copies of ξ. One of the two normalizations
will be imposed on ξ:
• R-normalization: ξ is real-valued with Eξ = 0 and Eξ 2 = 1.
2 2
• C-normalization: ξ is complex-valued with Eξ = 0, E Re (ξ) = E Im (ξ) =
2 , and E Re (ξ) Im (ξ) = 0.
1

In both cases, ξ has mean 0 and variance 1.


Example 5.16.1 (Normalizations). A model example of a R-normalized random
variable is the real Gaussian N (0, 1). Another R-normalized random variable is
Bernoulli, in which ξ equals +1 or −1 with an equal probability 1/2 of each.
A model example of C-normalization is the complex Gaussian whose real and
imaginary parts are i.i.d. copies of √12 N (0, 1). 
2
One frequently views σn (A) as the eigenvalues of the sample covariance matrix
AA∗ , where ∗ denotes the Hermitian (conjugate and transpose) of a matrix. It
is more traditional to write down the limiting distributions in terms of σ 2 . We
study the “hard edge” of the spectrum, and specifically the least singular value
σn (A). This problem has a long history. It first appeared in the worked of von
Neuman and Goldstein concerning numerical inversion of large matrices [371].
Later, Smale [366] made a specific conjecture about the magnitude of σn . Motivated
by a question of Smale, Edelman [372] computed the distribution of σn (ξ) for the
real and complex Gaussian cases:
Theorem 5.16.2 (Limiting Distribution for Gaussian Models [372]). For any
fixed t ≥ 0, we have, for real cases,
  t √
2 1 + x −(x/2+√x)
P nσn (A)  t = √ e dx + o(1) (5.82)
0 2 x
338 5 Non-asymptotic, Local Theory of Random Matrices

as well as the exact (!) formula, for complex cases,


  t
e−x dx.
2
P nσn (A)  t =
0

Both integrals can be computed explicitly. By exchange of variables, we have


t √ √
1 + x −(x/2+√x)
√ e dx = 1 − e−t/2− t . (5.83)
0 2 x

Also, it is clear that


t
e−x dx = 1 − e−t . (5.84)
0

The joint distribution of the bottom k singular values of real or complex A was
computed in [373].
The error term o(1) in (5.82) is not explicitly stated in [372], but Tao and Vu [326]
gives the form of O(n−c ) for some absolute constant c > 0.
4
Under the assumption of bounded fourth moment E|ξ| < ∞, it was shown
by Rudelson and Vershynin [341] that
 
2
P nσn (A)  t  f (t) + o(1)

for all fixed t > 0, where g(t) goes to zero as t → 0. Similarly, in [374] it was
shown that
 
2
P nσn (A)  t  g (t) + o(1)

for all fixed t > 0, where g(t) goes to zero as t → ∞. Under stronger assumption
that ξ is sub-Gaussian, the lower tail estimate was improved in [341] to
  √
2
P nσn (A)  t  C t + cn (5.85)

for some constant C > 0 and 0 < c < 1 depending on the sub-Gaussian moments
of ξ. At the other extreme, with no moment assumption on ξ, the bound
 
5
−1− 2 a−a2
 n−a+o(1)
2
P nσn (A)  n

was shown for any fixed a > 0 in [362].


A common feature of the above mention results is that they give good upper
and lower tail bounds on nσn 2 , but not the distribution law. In fact, many pa-
pers [341, 374, 375] are partially motivated by the following conjecture of Spielman
and Teng [369].
5.16 Universality of Singular Values 339

Conjecture 5.16.3 (Spielman and Teng [369]). Let ξ be the Bernoulli random
variable. Then there is a constant 0 < c < 1 such that for all t ≥ 0
 
2
P nσn (A)  t  t + cn . (5.86)

A new method was introduced by Tao and Vu [326] to study small singular values.
Their method is analytic in nature and enables us to prove the universality of the
limiting distribution of nσn 2 .
Theorem 5.16.4 (Universality for the least singular value [326]). Let ξ be a (real
C
or complex) random variable of mean 0 and variance 1. Suppose E|ξ| 0 < ∞ for
some sufficiently large absolute constant C0 . Then, for all t > 0, we have,
  t √
1 + x −(x/2+√x)
dx + o(n−c )
2
P nσn (A)  t = √ e (5.87)
0 2 x

if ξ is R-normalized, and
  t
e−x dx + o(n−c )
2
P nσn (A)  t =
0

if ξ is C-normalized, where c > 0 is an absolute constant. The implied constants in


C
the O(·) notation depends on E|ξ| 0 but are uniform in t.
Very roughly, one can swap ξ with the appropriate Gaussian distribution gR or
gC , at which point one can basically apply Theorem 5.16.4 as a black box. In other
words, the law of nσn 2 is universal with respect to the choice of ξ by a direct
) t √x −(x/2+√x)
comparison to the Gaussian models. The exact formula 0 1+ √ e dx and
) t −x 2 x

0
e dx do not play any important role. This comparison (or coupling) approach
is in the spirit of Lindeberg’s proof [376] of the central limit theorem.
Tao and Vu’s arguments are completely effective, and give an explicit value
for C0 . For example, C0 = 104 is certainly sufficient. Clearly, one can lower C0
significantly.
Theorem 5.16.4 can be extended to rectangular random matrices of (n − l) × n
dimensions (Fig. 5.2).
Theorem 5.16.5 (Universality for the least singular value of rectangular matri-
ces [326]). Let ξ be a (real or complex) random variable of mean 0 and variance
C
1. Suppose E|ξ| 0 < ∞ for some sufficiently large absolute constant C0 . Let l be a
constant. Let
2 2 2
X = nσn−l (A(ξ)) , XgR = nσn−l (A(gR )) , XgC = nσn−l (A(gC )) .

Then, there is a constant c > 0 such that for all t ≥ 0, we have,


340 5 Non-asymptotic, Local Theory of Random Matrices

Bernoulli Gaussian

 
Fig. 5.2 Plotted above are the curves P nσn−l (A (ξ))2  x , for l = 0, 1, 2 based on data from
1,000 randomly generated matrices with n = 100. The curves on the left were generated with ξ
being a random Bernoulli variable, taking the values +1 and −1 each with probability 1/2; The
curves on the right were generated with ξ being a random Gaussian variable. In both cases, the
curves from left to right correspond to the cases l = 0, 1, 2, respectively

 
P X  t − n−c − n−c  P (XgR  t)  P X  t + n−c + n−c

if ξ is R-normalized, and
 
P X  t − n−c − n−c  P (XgC  t)  P X  t + n−c + n−c

if ξ is C-normalized.
Theorem 5.16.4 can be extended to random matrices with independent (but not
necessarily identical) entries.
Theorem 5.16.6 (Random matrices with independent entries [326]). Let ξij be
a (real or complex) random variables with mean 0 and variance 1 (R-normalized or
C
C-normalized). Suppose E|ξ| 0 < C1 for some sufficiently large absolute constant
C0 and C1 . Then, for all t > 0, we have,
  t √
1 + x −(x/2+√x)
dx + o(n−c )
2
P nσn (A)  t = √ e (5.88)
0 2 x
if ξij are all R-normalized, and
  t
e−x dx + o(n−c )
2
P nσn (A)  t =
0

if ξij are all C-normalized, where c > 0 is an absolute constant. The implied
C
constants in the O(·) notation depends on E|ξ| 0 but are uniform in t.
5.16 Universality of Singular Values 341

Let us extend Theorem 5.16.4 for the condition number. Let A be an n × n


matrix, its condition number κ (X) is defined as

σ1 (A)
κ (A) = .
σn (A)

It√is well known [2] that the largest singular value is concentrated strongly around
2 n. Combining Theorem 5.16.4 with this fact, we have the following for the
general setting.
Lemma 5.16.7 (Concentration of the largest singular value [326]). Under the
setting of Theorem
√ 5.16.4, we have, with probability 1 − exp −nΩ(1) , σ1 (A) =
(2 + o(1)) n.
Corollary 5.16.8 (Conditional number [326]). Let ξij be a (real or complex)
random variables with mean 0 and variance 1 (R-normalized or C-normalized).
C
Suppose E|ξ| 0 < C1 for some sufficiently large absolute constant C0 and C1 .
Then, for all t > 0, we have,
  t √
1 1 + x −(x/2+√x)
P κ (A (ξ))  t = √ e dx + o(n−c ) (5.89)
2n 0 2 x

if ξij are all R-normalized, and


  t
1
P κ (A (ξ))  t = e−x dx + o(n−c )
2n 0

if ξij are all C-normalized, where c > 0 is an absolute constant. The implied
C
constants in the O(·) notation depends on E|ξ| 0 but are uniform in t.

5.16.1 Random Matrix Plus Deterministic Matrix

Let ξ be a complex random variable with mean 0 and variance 1. Let A be the
random matrix of size n whose entries are i.i.d. copies of ξ and Γ be a fixed matrix
of the same size. Here we study the conditional number and least singular value of
the matrix B = Γ+A. This is called signal plus noise matrix model. It is interesting
to find the “signal” matrix Γ does play a role on tail bounds for the least singular
value of Γ + A.
Example 5.16.9 (Covariance Matrix). The conditional number is a random variable
of interest to many applications [372, 377]. For example,

Σ̂ = Σ + Z or Z = Σ̂ − Σ
342 5 Non-asymptotic, Local Theory of Random Matrices

where Σ is the true covariance matrix (deterministic and assumed to be known) and
Σ̂ = n1 XX∗ is the sample (empirical) covariance matrix—random matrix; here X
is the data matrix which is the only matrix available to the statistician. 
Example 5.16.10 (Hypothesis testing for two matrices).

H0 : Ry = Rn
H1 : Ry = Rx + Rn

where Rx is an arbitrary deterministic matrix and Rn is a random matrix. 


Let us consider the Gaussian case. Improving the results of Kostlan
√ and
Oceanu [366] and Edelman [372] computed the limiting distribution of nσn (A).
Theorem 5.16.11 (Gaussian random matrix with i.i.d. entries [372]). There is
a constant C > 0 such that the following holds. Let ξ be a real Gaussian random
variable with mean 0 and variance 1, let A be the random matrix whose entries are
i.i.d. copies of ξ, and let Γ be an arbitrary fixed matrix. Then, for any t > 0,

P (σn (A)  t)  n1/2 t.

Considering the more general model B = Γ + A, Sankar, Spielman and Teng


proved [378].
Theorem 5.16.12 (Deterministic Matrix plus Gaussian random matrix [378]).
There is a constant C > 0 such that the following holds. Let ξ be a real Gaussian
random variable with mean 0 and variance 1, let A be the random matrix whose
entries are i.i.d. copies of ξ, and let Γ be an arbitrary fixed matrix. B = Γ + A.
Then, for any t > 0,

P (σn (B)  t)  Cn1/2 t.

We say ξ is sub-Gaussian if there is a constant α > 0 such that


2
/α2
P (|ξ|  t)  2e−t

for all t > 0. The smallest α is called the sub-Gaussian moment of ξ. For a more
general sub-Gaussian random variable ξ, Rudelson and Vershynin proved [341] the
following.
Theorem 5.16.13 (Sub-Gaussian random matrix with i.i.d. entries [341]). Let
ξ be a sub-Gaussian random variable with mean 0, variance 1 and sub-Gaussian
moment α. Let c be an arbitrary positive constant. Let A be the random matrix
whose entries are i.i.d. copies of ξ, Then, there is a constant C > 0 (depending on
α) such that, for any t ≥ n−c ,

P (σn (A)  t)  Cn1/2 t.

For the general model B = Γ + A, Tao and Vu proved [379]


5.16 Universality of Singular Values 343

Theorem 5.16.14 (A general model B = Γ + A [379]). Let ξ be a random


variable with non-zero variance. Then for any constants C1 , C > 0 there exists
a constant C2 > 0 (depending on C1 , C, ξ) such that the following holds. Let A be
the random matrix whose entries are i.i.d. copies of ξ, and let Γ be any deterministic
n × n matrix with norm Γ  nC . Then, we have

P σn (Γ + A)  n−C2  n−C1 .

This theorem requires very little about the random variable ξ. It does not need to
be sub-Gaussian nor even has bounded moments. All we ask is that the variable is
bounded from zero, which basically means ξ is indeed “random”. Thus, it guarantees
the well-conditionness of B = Γ + A in a very general setting.
The weakness of this theorem is that the dependence of C2 on C1 and C, while
explicit, is too generous. The work of [362] improved this dependence significantly.
Let us deal with the non-Gaussian random matrix.
Theorem 5.16.15 (The non-Gaussian random matrix [362]). There are positive
constants c1 and c2 such that the following holds. Let A be the n × n Bernoulli
matrix with n even. For any α ≥ n, there is an n × n deterministic matrix Γ such
that Γ = α and
 n 1
P σn (Γ + A)  c1  c2 √ .
α n

The main result of [362] is the following theorem


Theorem 5.16.16 (Bounded second moment on ξ—Tao and Vu [362]). Let ξ be
a random variable with mean 0 and bounded second moment, and let γ ≥ 1/2, C ≥
0 be constants. There is a constant c depending on ξ, γ, C such that the following
holds. Let A be the n × n matrix whose entries are i.i.d. copies of ξ, Γ be a
deterministic matrix satisfying Γ  nγ . Then
   
P σn (Γ + A)  n−(2C+1)γ  c n−C+o(1) + P ( A  nγ ) .

This theorem only assumes bounded second moment on ξ. The assumption that the
entries of A are i.i.d. is for convenience. A slightly weaker result would hold if one
omit this assumption.
Let us deal with the condition number.
Theorem 5.16.17 (Conditional number—Tao and Vu [362]). Let ξ be a random
variable with mean 0 and bounded second moment, and let γ ≥ 1/2, C ≥ 0 be
constants. There is a constant c depending on ξ, γ, C such that the following holds.
Let A be the n × n matrix whose entries are i.i.d. copies of ξ, Γ be a deterministic
matrix satisfying Γ  nγ . Then
   
P κ (Γ + A)  2n(2C+2)γ  c n−C+o(1) + P ( A  nγ ) .
344 5 Non-asymptotic, Local Theory of Random Matrices

Proof. Since κ (Γ + A)=σ1 (Γ + A) /σn (Γ + A), it follows that if κ (Γ + A) 


2n(2C+2)γ , then at least one of the two events σn (Γ + A)  n−(2C+1)γ and
σ1 (Γ + A)  2nγ holds. On the other hand, we have

σ1 (Γ + A)  σ1 (Γ) + σ1 (A) = Γ + A  nγ + A .

The claim follows. 


Let us consider several special cases and connect Theorem 5.16.16 with other
existing results. First, we consider the sub-Gaussian case of ξ. Due to [341], one
can have a strong bound on P ( A  nγ ).
Theorem 5.16.18 (Tao and Vu [362]). Let α be a positive constant. There are
positive constants C1 , C2 depending on α such that the following holds. Let ξ be
a sub-Gaussian random variable with 0 mean, variance one and sub-Gaussian
moment α, and A be a random matrix whose entries are i.i.d. copies of ξ. Then
 √
P A  C1 n  e−C2 n .

If one replaces the sub-Gaussian condition by the weaker condition that ξ has fourth
moment bounded α, then one has a weaker conclusion that

E A  C1 n.

Combining Theorems 5.16.16 and 5.16.18, we have


Theorem 5.16.19 (Bounded second moment on ξ—Tao and Vu [362]). Let C
and γ be arbitrary positive constants. Let ξ be a sub-Gaussian random variable
with mean 0 and variance 1. Let A be the n × n matrix whose entries are i.i.d.
copies of ξ, Γ be a deterministic matrix satisfying Γ  nγ . Then
 √ 
−2C−1
P σn (Γ + A)  n+ A  n−C+o(1) . (5.90)


For A = O ( n), (5.90) becomes
 
P σn (Γ + A)  n−C−1/2  n−C+o(1) . (5.91)

Up to a loss of magnitude no(1) , this matches Theorem 5.16.13, which treated


the base case Γ = 0.
If we assume bounded fourth moment instead of sub-Gaussian, we can√ use the
second half of Theorem 5.16.18 to deduce the following, for A = O ( n),
 √ 
−1+o(1)
P σn (Γ + A)  n+ A = o(1). (5.92)
5.16 Universality of Singular Values 345


In the case of A = O ( n), this implies that almost surely σn (Γ + A) 
n−1/2+o(1) . For the special case of Γ = 0, this matches [341, Theorem 1.1], up
to the o(1) term.

5.16.2 Universality for Covariance and Correlation Matrices

To state the results in[380–382], we need the following two conditions. Let X =
(xij ) be a M × N data matrix with independent centered real valued entries with
variance 1/M :
1
xij = √ ξij , Eξij = 0, Eξij
2
= 1. (5.93)
M
Furthermore, the entries ξij have a sub-exponential decay, i.e., there exists a constant
1
P (|ξij | > t)  exp (−tκ ) . (5.94)
κ
The sample covariance matrix corresponding to data matrix X is given by S =
XH X. We are interested in the regime

d = dN = N/M, lim d = 0, 1, ∞. (5.95)


N →∞

All the results here are also valid for complex valued entries with the moment
condition (5.93) replaced with its complex valued analogue:
1 2
xij = √ ξij , Eξij = 0, Eξij
2
= 0, E |ξ|ij = 1. (5.96)
M
By the singular value decomposition of X, there exist orthonormal bases
M  M 
X= λi ui viH = λi uH
i vi ,
i=1 i=1

where λ1  λ2  · · ·  λmax{M,N }  0, λi = 0 for λi = 0, min{N, M } + 1 


i  max{N, M }.
Theorem 5.16.20 ([381]). Let X with independent entries satisfying (5.93)
and (5.94). For any fixed k > 0,
⎛ √ √ 2 √ √ 2 ⎞
⎜ M λ1 − N + M M λ k − N + M ⎟
⎝ √ √  1 1/3 , · · · , √ √  1/3 ⎠ → T W 1 ,
N+ M √ + √1 N+ M √1 + √1
N M N M
(5.97)
where TW1 denotes the Tracy-Widom distribution. An analogous statement holds
for the smallest eigenvalues.
346 5 Non-asymptotic, Local Theory of Random Matrices

Let us study the correlation matrix. For a Euclidean vector a ∈ Rn , define the l2
norm
:
; n
;
a 2=< a2i .
i=1

The matrix XH X is the usual covariance matrix. The j-th column of X is denoted
by xj . Define the matrix M × N X̃ = (x̃ij )

xij
x̃ij = . (5.98)
xj 2


M
Using the identity Ex̃2ij = ME
1
x̃2ij , we have
i=1

1
Ex̃2ij = .
M

Let λ̃i is the eigenvalues of the matrix X̃H X̃, sorted decreasingly, similarly to λi ,
the eigenvalues of the matrix XH X̃.
The key difficulty to be overcome is the strong dependence of the entries of
the correlation matrix. The main result here states that, asymptotically, the k-point
(k ≥ 1) correlation functions of the extreme eigenvalues (at the both edges of the
spectrum) of the correlation matrix X̃H X̃ converge to those of Gaussian correlation
matrix, i.e., Tracy-Widom law, and thus in particular, the largest and smallest
eigenvalues of X̃H X̃, after appropriate centering and rescaling, converge to the
Tracy-Widom distribution.
Theorem 5.16.21 ([380]). Let X with independent entries satisfying (5.93), (5.94)
(or (5.96) for complex entries), (5.95), and (5.98). For any fixed k > 0,
⎛ √
√ 2 √ √ 2 ⎞
⎜ M λ̃1 − N+ M M λ̃k − N+ M ⎟
⎝ √ √  1 1/3 , · · · , √ √  1/3 ⎠ → TW1 ,
N+ M √ + √1 N+ M √1 + √1
N M N M
(5.99)
where TW1 denotes the Tracy-Widom distribution. An analogous statement holds
for the k-smallest (non-trivial) eigenvalues.
As a special case, we also obtain the Tracy-Widom law for the Gaussian correlation
matrices. To reduce the variance of the test statistics, we prefer the sum of the
k  
functions of eigenvalues f λ̃i for k ≤ min{M, N } where f : R → R
i=1
is a function (say convex and Lipschitz). Note that the eigenvalues λ̃i are highly
5.16 Universality of Singular Values 347

correlated random variables. Often the sums of independent random variables


are extensively studied. Their dependable counterparts are less studied: Stein’s
method [87, 258] can be used for this purpose.
Example 5.16.22 (Universality and Principal Component Analysis). Covariance
matrices are ubiquitous in modern multivariate statistics where the advance of
technology has leads to high dimensional data sets—Big Data. Correlation matrices
are often preferred. The Principal Component Analysis (PCA) is not invariant to
change of scale in the matrix entries. It is often recommended first to standardize
the matrix entries and then perform PCA on the resulting correlation matrix [383].
Equivalently, one performs PCA on the sample correlation matrix.
The PCA based detection algorithms (hypothesis tests) are studied for spectrum
sensing in cognitive radio [156,157] where the signal to noise ratio is extremely low
such as −20 dB. The kernel PCA is also studied in this context [384, 385].
Akin to the central limit theorem,universality [370, 380] refers to the phe-
nomenon that the asymptotic distributions of various functionals of covari-
ance/correlation matrices (such as eigenvalues, eigenvectors etc.) are identical to
those of Gaussian covariance/correlation matrices. These results let us calculate
the exact asymptotic distributions of various test statistics without restrictive
distributional assumptions of the matrix entries. For example, one can perform
various hypothesis tests under the assumption that the matrix entries are not
normally distributed but use the same test statistic as in the Gaussian case.
For random vectors ai , wi , yi ∈ Cn , our signal model is defined as

y i = ai + w i , i = 1, . . . , N

where ai is the signal vector and wi the random noise. Define the (random) sample
covariance matrices as
N N N
Y= yi yiH , A= ai aH
i , W= wi wiH .
i=1 i=1 i=1

It follows that

Y = A + W. (5.100)

Often the entries of W are normally distributed (or Gaussian random variables). We
are often interested in a more general matrix model

Y =A+W+J (5.101)

where the entries of matrix J are not normally distributed (or non-Gaussian random
variables). For example, when jamming signals are present in a communications
or sensing system. We like to perform PCA on the matrix Y. Often the rank of
matrix A is much lower than that of W. We can project the high dimensional matrix
348 5 Non-asymptotic, Local Theory of Random Matrices

Y into lower dimensions, hoping to expose more structures of A. The Gaussian


model of (5.100) is well studied. Universality implies that we are able to use the test
statistic of (5.100) to study that of (5.101). 
We provide the analogous non-asymptotic bounds on the variance of eigenvalues
for random covariance matrices, following [386]. Let X be a m×n (real or complex)
random matrix, with m > n, such that its entries are independent, centered and have
variance 1. The random covariance matrix (Wishart matrix) S is defined as
1 H
S= X X.
n
An important example is the case when all the entries of X are Gaussian. Then
S belongs to the so-called Laguerre Unitary Ensemble (LUE) if the entries are
complex and to the Laguerre Orthogonal Ensemble (LOE) if they are real. All the
eigenvalues are nonnegative and will be sorted increasingly 0  λ1  · · ·  λn .
We say that Sm,n satisfy condition (C0) if the real part ξ and imaginary part ξ˜
of (Sm,n )i,j are independent and have an exponential decay: there are two positive
constants β1 and β2 such that
   
 
P |ξ|  tβ1  e−t and P ξ˜  tβ1  e−t

for t ≥ β2 .
We assume that (1)
m
1 < α1   α2
n
where α1 and α2 are fixed constants and that S is a covariance matrix whose entries
have an exponent decay (condition (C0)) and (2) have the same first four moments
as those of a LUE matrix. The following summarizes a number of quantitative
bounds.
Theorem 5.16.23 ([386]).
1. In the bulk of the spectrum. Let η ∈ (0, 12 ]. There is a constant C > 0
(depending on η, α1 , α2 ) such that, for all covariance matrices Sm,n , with
ηn ≤ i ≤ (1 − η)n,

log n
Var (λi )  C .
n2
2. Between the bulk and the edge of the spectrum. There is a constant κ > 0
(depending on α1 , α2 ) such that the following holds. For all K > κ, for all
η ∈ (0, 12 ], there exists a constant C > 0 (depending on K, η, α1 , α2 ) such that,
for all covariance matrices Sm,n , with (1 − η)n ≤ i ≤ n − K log n,

log (n − i)
Var (λi )  C 2/3
.
n4/3 (n − i)
5.17 Further Comments 349

3. At the edge of the spectrum. There exists a constant C > 0 (depending on


α1 , α2 ) such that, for all covariance matrices Sm,n ,

1
Var (λn )  C .
n4/3

5.17 Further Comments

This subject of this chapter is in its infancy. We tried to give a comprehensive review
of this subject by emphasizing both techniques and results. The chief motivation is
to make connections with the topics in other chapters. This line of work deserves
further research.
The work [72] is the first tutorial treatment along this line of research. The two
proofs taken from [72] form the backbone of this chapter. In the context of our
book, this chapter is mainly a statistical tool for covariance matrix estimation—
sample covariance matrix is a random matrix with independent rows. Chapter 6 is
included here to highlight the contrast between two different statistical frameworks:
non-asymptotic, local approaches and asymptotic, global approaches.
Chapter 6
Asymptotic, Global Theory of Random Matrices

The chapter contains standard results for asymptotic, global theory of random matri-
ces. The goal is for readers to compare these results with results of non-asymptotic,
local theory of random matrices (Chap. 5). A recent treatment of this subject is given
by Qiu et al. [5].
The applications included in [5] are so rich; one wonder whether a parallel
development can be done along the line of non-asymptotic, local theory of random
matrices, Chap. 5. The connections with those applications are the chief reason why
this chapter is included.

6.1 Large Random Matrices

Example 6.1.1 (Large random matrices). Consider n-dimensional random vectors


y, x, n ∈ Rn

y =x+n

where vector x is independent of n. The components x1 , . . . , xn of the random


vector x are scalar valued random variables, and, in general, may be dependent
random variables. For the random vector n, this is similar. The true covariance
matrix has the relation

Ry = Rx + Rn ,

due to the independence between x and n.


Assume now there are N copies of random vector y:

y i = x i + ni , i = 1, 2, . . . , N.

Let us consider the sample covariance matrix

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 351
DOI 10.1007/978-1-4614-4544-9 6,
© Springer Science+Business Media New York 2014
352 6 Asymptotic, Global Theory of Random Matrices

N N N
1 1 1
yi ⊗ yi∗ = xi ⊗ x∗i + ni ⊗ n∗i + junk
N i=1
N i=1
N i=1

where “junk” denotes two other terms. It is more convenient to consider the matrix
form
⎡ T⎤ ⎡ T⎤ ⎡ T⎤
y1 x1 n1
⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
Y=⎣ . ⎦ , X=⎣ . ⎦ , N=⎣ . ⎦ ,
T
yN N ×n
xTN N ×n
nTN N ×n

where all matrices are of size n × n. Thus it follows that

1 ∗ 1 1
Y Y = X∗ X + N∗ N + junk. (6.1)
N N N
A natural question arises from the above exercise: What happens if n → ∞, N →
n → α?
∞, but N 
The asymptotic regime
n
n → ∞, N → ∞, but → α? (6.2)
N
calls for a global analysis that is completely different from that of non-asymptotic,
local analysis of random matrices (Chap. 5). Stieltjes transform and free probability
are two alternative frameworks (but highly related) used to conduct such an analysis.
The former is an analogue of Fourier transform in a linear system. The latter is an
analogue of independence for random variables.
Although this analysis is asymptotic, the result is very accurate even for small
matrices whose size n is as small as less than five [5]. Therefore, the result is very
relevant to practical limits. For example, we often consider n = 100, N = 200 with
α = 0.5.
In Sect. 1.6.3, a random matrix
  k   of arbitrary N × n is studied in the form of
∗ k
E Tr A and E Tr(B B) where A is a Gaussian random matrix (or Wigner
matrix) and B is a Hermitian Gaussian random matrix.

6.2 The Limit Distribution Laws

The empirical spectral distribution (ESD) is defined as


n
1
μX = δλi (X) , (6.3)
n i=1
6.3 The Moment Method 353

of X, where λ1 (X)  · · ·  λn (X) are the (necessarily real) eigenvalues of X,


counting multiplicity. The ESD is a probability measure, which can be viewed as a
distribution of the normalized eigenvalues of X.
Let A be the Wigner matrix defined in Sect. 1.6.3.
Theorem 6.2.1 (Semicircle Law). Let A be the top left n × n minors of an infinite
Wigner matrix (ξij )i,j1 . Then the ESDs μA converge almost surely (and hence
also in probability and in expectation) to Wigner semicircle distribution
⎧  1/2
⎨ 2
1
2π 4 − |x| dx, if |x| < 2,
μsc =
⎩ 0, otherwise.

Almost sure convergence (or with probability one) implies convergence in


probability. Convergence in probability implies convergence in distribution (or
expectation). The reverse is false in general.
When a sample covariance matrix S = n1 X∗ X is considered, we will reach the
so-called Marcenko-Pasture law, defined as
 
1
2πxα (b − x) (x − a)dx, if a  x  b,
μM P =
0, otherwise,

where α is defined in (6.2).

6.3 The Moment Method

Section 1.6.3 illustrates the power of Moment Method for the arbitrary matrix sizes
in the Gaussian random matrix framework. Section 4.11 treats this topic in the
non-asymptotic, local context.
The moment method is computationally intensive, but straightforward.
Sometimes, we need to consider the asymptotic regime of (6.2), where moment
method may not be feasible. In real-time computation, the computationally efficient
techniques such as Stieltjes transform and free probability is attractive.
The basic starting point is the observation that the moments of the ESD μA can
be expressed as normalized traces of powers of A of n × n

1 k
xk dμA (x) = Tr(A) . (6.4)
R n

where A is the Wigner


√ matrix defined in Sect. 1.6.3. The matrix norm of A is
typically of size O( n), so it is natural to work with the normalized matrix √1n A.
But for notation simplicity, we drop the normalization term.
Since expectation is linear, on taking expectation, it follows that
354 6 Asymptotic, Global Theory of Random Matrices

1  k

xk dE [μA (x)] = E Tr(A) . (6.5)
R n

k
The concentration of measure for the trace function Tr(A) is treated in Sect. 4.3.
From this result, the E [μA (x)] are uniformly sub-Gaussian:
2
n2
E [μA {|x| > t}]  Ce−ct

for t > C, where C are absolute (so the decay improves quite rapidly with n). From
this and the Carleman continuity theorem, Theorem 1.6.1, the circular law can be
established, through computing the mean and variance of moments.
To prove the convergence in expectation to the semicircle law μsc , it suffices to
show that
-  k .
1 1
E Tr √ A = xk dμsc (x) + ok (1) (6.6)
n n R

for k = 1, 2, . . . , where ok (1) is an expression that goes to zero as n → ∞ for


fixed k.

6.4 Stieltjes Transform

Equation (6.4) is the starting point for the moment method. The Stieltjes transform
also proceeds from this fundamental identity

1 1 −1
dμA (x) = Tr(A − zI) (6.7)
R x−z n

for any complex z not in the support of μA . The expression in the left hand side is
called Stieltjes transform of A or of μA , and denote it as mA (z). The expression
−1
(A − zI) is the resolvent of A, and plays a significant role in the spectral theory
of that matrix. Sometimes, we can consider the normalized version M = √1n A
for an arbitrary
√ matrix A of n × n, since the matrix norm of A is typically of
size O( n). The Stieltjes transform, in analogy with Fourier transform for a linear
time-invariant system, takes full advantage of specific linear-algebraic structure of
this problem, and, in particular, of rich structure of resolvents. For example, one can
use the Neumann series
−1
(I − X) = I + X + X2 + · · · + X k + · · · .

One can further exploit the linearity of trace using:


−1
Tr(I − X) = TrI + TrX + TrX2 + · · · + TrXk + · · · .
6.4 Stieltjes Transform 355

The Stieltjes transform can be viewed as a generating function of the moments


via the above Neumann series (an infinite Taylor series of matrices)

1 1 1 1 1
mM (z) = − − 2 3/2 TrM − 3 2 TrM − · · · ,
z z n z n

valid for z sufficiently large. This is reminiscent of how the characteristic function
EejtX of a scalar random variable can be viewed as a generating function of the
moments EX k .
For fixed z = a + jb away from the real axis, the Stieltjes transform mMn (z) is
quite stable in n. When Mn is a Wigner matrix of size n × n, using a standard
concentration of measure result, such as McDiarmid’s inequality, we conclude
concentration of mMn (z) around its mean:
 √  2
P |mMn (a + jb) − EmMn (a + jb)| > t/ n  Ce−ct (6.8)

for all t > 0 and some absolute constants C, c > 0. For details of derivations, we
refer to [63, p. 146].
The concentration of measure says that mMn (z) is very close to its mean. It does
not, however, tell much about what this mean is. We must exploit the linearity of
trace (and expectation) such as

n
- −1 .
1 1
m √1 An (z) = E √ An − zIn
n n i=1 n
ii

where [B]ii is the diagonal ii-th component of a matrix B. Because An is


Wigner
 matrix, on permuting the rows and columns, all of the random variables
−1
√1 An − zIn have the same distribution. Thus we may simplify the above
n
ii
expression as
- −1 .
1
Em √1 An (z) =E √ An − zIn . (6.9)
n n
nn

We only need to deal with the last entry of an inverse of a matrix.


Schur complement is very convenient. Let Bn be an n × n matrix, let Bn−1 be
the top left n − 1 × n − 1 minor, let bnn be the bottom right entry of Bn such that

Bn−1 x
Bn =
y∗ bnn

where x ∈ C n−1
 n−1is ∗the right column of Bn with the bottom right entry bnn removed,

and y ∈ C is the bottom row with the bottom right entry bnn removed.
Assume that Bn and Bn−1 are both invertible, we have that
356 6 Asymptotic, Global Theory of Random Matrices

  1
B−1 = . (6.10)
bnn − y∗ B−1
n nn
n−1 x

This expression can be obtained as follows: Solve the equation An v = en , where


en is the n-th basis vector, using the method of Schur complements (or from first
principles).
In our situations, the matrices √1n An − zIn and √1n An−1 − zIn−1 are
automatically invertible. Inserting (6.10) into (6.9) (and recalling that we normalized
the diagonal of An to vanish), it follows that

1
Em √1 An (z) = −E  −1 , (6.11)
n
1 ∗
z+ nx
√1 An−1
n
− zIn−1 x− √1 ξnn
n

where x ∈ Cn−1 is the right column of An with the bottom right entry ξnn removed
(the (ij)-th entry of An is a random variable ξij ). The beauty of (6.11) is to tie
together the random matrix An of size n × n and the random matrix An−1 of size
(n − 1) × (n − 1).
 −1
Next, we need to understand the quadratic form x∗ √1n An−1 − zIn−1 x.
We rewrite this as x∗ Rx, where R is the resolvent matrix. This distribution of
the random matrix R is understandably complicated. The core idea, however, is to
exploit the observation that the (n − 1)-dimensional vector x involves only these
entries of An that do not lie in An−1 , so the random matrix R and the random
vector x are independent. As a consequence of this key observation, we can use the
randomness of x to do most of the work in understanding the quadratic form x∗ Rx,
without having to know much about R at all!
Concentration of quadratic forms like x∗ Rx has been studied in Chap. 5. It turns
out that
 √  2
P |x∗ Rx − E (x∗ Rx)|  t n  Ce−ct

for any determistic matrix R of operator norm O(1). The expectation E (x∗ Rx) is
expressed as
n−1 n−1

E (x Rx) = Eξ¯in rij ξjn
i=1 j=1

where ξin are entries of x, and rij are entries of R. Since the ξin are iid with mean
zero and variance one, the standard second moment computation shows that this
expectation is less than the trace
n−1
Tr (R) = rii
i=1

of R. We have shown the concentration of the measure


6.4 Stieltjes Transform 357

 √  2
P |x∗ Rx − Tr (R)|  t n  Ce−ct (6.12)

for any deterministic matrix R of


√ operator norm O(1), and any t > 0. Informally,
x∗ Rx is typically Tr (R) + O ( n) .
The bound (6.12) was shown for a deterministic matrix R, but by using
conditional expectation this works for any random matrix as long as the matrix R is
independent of random vector x. For our specific matrix
 −1
1
R= √ An−1 − zIn−1 ,
n

we can apply conditional expectation. The trace of this matrix is nothing but the
Stieltjes transform m √ 1 An−1 (z) of matrix An−1 . Since the normalization factor
n−1
is slightly off, we have
√ √
n n
Tr (R) = n √ m √ 1 An−1 ( √ z).
n−1 n−1 n−1

By some subtle arguments available at [63, p. 149], we have


 
Tr (R) = n m √ 1 An−1 (z) + o(1) .
n−1

In particular, from (6.8) and (6.12), we see that


 
x∗ Rx = n Em √ 1 An−1 (z) + o(1)
n−1

with overwhelming probability. Putting this back to (6.11), we have the remarkable
self-consistent equation

1
m √1 An (z) =− + o(1).
n z + m √1 An (z)
n

Following the arguments of [63, p. 150], we see that m √1 An (z) converges to a limit
n
mA (z), as n → ∞. Thus we have shown that

1
mA (z) = − ,
z + mA (z)

which has two solutions



z± z2 − 4
mA (z) = − .
2
358 6 Asymptotic, Global Theory of Random Matrices

It is argued that the positive branch is the correct solution



z+ z2 − 4
mA (z) = − ,
2
through which we reach the semicircular law

1  1/2
4 − x2 +
dx = μsc .

For details, we refer to [63, p. 151].

6.5 Free Probability

We take the liberty of freely drawing material from [63] for this brief exposition.
We highlight concepts, basic properties, and practical material.

6.5.1 Concept

In the foundation of modern probability, as laid out by Kolmogorov, the basic objects
of study are:
1. Sample space Ω, whose elements ω represent all the possible states.
2. One can select a σ−algebra of events, and assign probabilities to these events.
3. One builds (commutative) algebra of random variables X and one can assign
expectation.
In measure theory, the underlying measure space Ω plays a prominent foun-
dational role. In probability theory, in contrast, events and their probabilities are
viewed as fundamental, with the sample space Ω being abstracted away as much as
possible, and with the random variables and expectations being viewed as derived
concepts.
If we take the above abstraction process one step further, we can view the algebra
of random variables and their expectations as being the foundational concept, and
ignoring both the presence of the original sample space, the algebra of events, or the
probability of measure.
There are two reasons for considering the above foundational structures. First,
it allows one to more easily take certain types of limits, such as: the large n limit
n → ∞ when considering n × n random matrices, because quantities build from
the algebra of random variables and their expectations, the normalized moments of
random matrices, which tend to be quite stable in the large n limit, even the sample
space and event space varies with n.
6.5 Free Probability 359

Second, the abstract formalism allows one to generalize the classical commuta-
tive theory of probability to the more general theory of non-commutative probability
which does not have the classical counterparts such as sample space or event space.
Instead, this general theory is built upon a non-commutative algebra of random
variables (or “observables”) and their expectations (or “traces”). The more general
formalism includes as special cases, classical probability and spectral theory (with
matrices or operators taking the role of random variables and the trace taking
the role of expectation). Random matrix theory is considered a natural blend of
classical probability and spectrum theory whereas quantum mechanics with physical
observables, takes the role of random variables and their expected values on a given
state which is the expectation. In short, the idea is to make algebra the foundation
of the theory, as opposed to other choices of foundations such as sets, measure,
categories, etc. It is part of more general “non-commutative way of thinking.1 ”

6.5.2 Practical Significance

The significance of free probability to random matrix theory lies in the fundamental
observation that random matrices that have independent entries in the classical
sense, also tend to be independent2 in the free probability sense, in the large n limit
n → ∞. Because of this observation, many tedious computations in random matrix
theory, particularly those of an algebraic or enumerative combinatorial nature, can
be performed more quickly and systematically by using the framework of free
probability, which by design is optimized for algebraic tasks rather than analytical
ones.
Free probability is an excellent tool for computing various expressions of interest
in random matrix theory, such as asymptotic values of normalized moments in the
large n limit n → ∞. Questions like the rate of convergence cannot be answered
by free probability that covers only the asymptotic regime in which n is sent to
infinity. Tools such as concentration of measure (Chap. 5) can be combined with
free probability to recover such rate information.
As an example, let us reconsider (6.1):

1 ∗ 1 ∗ 1 1 1 1
Y Y = (X + N) (X + N) = X∗ X + N∗ N + X∗ N + N∗ X.
N N N N N N
(6.13)
The rate of convergence how N1 X∗ X converges to its true covariance matrix Rx
can be understood using concentration of measure (Chap. 5), by taking advantage of

1 Thefoundational preference is a meta-mathematical one rather than a mathematical one.


2 Thisis only possible because of the highly non-commutative nature of these matrices; this is not
possible for non-trivial commuting independent random variables to be freely independent.
360 6 Asymptotic, Global Theory of Random Matrices

k
low rank structure of the Rx . But the form of Tr(A + B) can be easily handled by
free probability.
   zero, then E [ABAB]
If A, B are freely independent, and of expectation
vanishes, but E [AABB] instead factors as E A2 E B2 . Since
 
1 ∗ k
ETr (A + B) (A + B)
n
can be calculated by free probability [387], (6.13) can be handled as a special case.
Qiu et al. [5] gives a comprehensive survey of free probability in wireless
communications and signal processing. Couillet and Debbah [388] gives a deeper
treatment of free probability in wireless communication. Tulino and Verdu [389] is
the first book-form treatment of this topic in wireless communication.

6.5.3 Definitions and Basic Properties

In the classical (commutative) probability, two (bounded, real-valued) random


variables X, Y are independent if one has

E [f (X) g (Y )] = 0

whenever f, g : R → R are well-behaved (such as polynomials) such that all of


E [f (X)] and E [g (Y )] vanish. For two (bounded, Hermitian) non-commutative
random variables X, Y, the classical notion no longer applies. We consider, as a
substitute, the notion of being freely independent (or free for short), which means
that

E [f1 (X) g1 (Y ) · · · fk (X) gk (Y )] = 0 (6.14)

where f1 , g1 , . . . , fk , gk : R → R are well-behaved functions such that E [f1 (X)],


E [g1 (X)] , . . . , E [fk (X)] , E [gk (X)] vanish.
Example 6.5.1 (Random matrix variables). Random matrix theory combines
classical probability theory with finite-dimensional spectral theory, with the random
variables of interest now being the random matrices X, all of whose entries have all
moments finite. The normalized trace τ is given by
1
τ (X)  E TrX.
n
Thus one takes both the normalized matrix trace and the probabilistic expectation,
in order to obtain a deterministic scalar. As seen before, the moment method for
random matrices is based on the moments
 1
τ Xk = E TrXk .
n
6.5 Free Probability 361


When X is a Gaussian random matrix, the exact expression for τ Xk = E n1 TrXk
 Y is a Hermitian Gaussian random
is available in a closed form (Sect.1.6.3). When
matrix, the exact expression for τ (Y Y) = E n1 Tr(Y∗ Y) is also available in
∗ k k

a closed form (Sect. 1.6.3). 


The expectation operator is a map that is linear. In fact, it is ∗-linear, which means
that it is linear and also that E (X∗ ) = EX, where the bar represents the complex
conjugate. The analogue of the expectation operation for a deterministic matrix is
the normalized trace τ (X) = n1 TrX.
Definition 6.5.2 (Non-commutative probability space, preliminary definition).
A non-commutative probability space (A, τ ) will consist of a (potentially non-
commutative) ∗-algebra A of (potentially non-commutative) random variables (or
observables) with identity 1, together with a trace τ : A → C, which is a ∗-
linear functional that maps 1 to 1. This trace will be required to obey a number
of additional axioms.
Axiom 6.5.3 (Non-negativity). For any X ∈ A, we have τ (X∗ X)  0. (X∗ X is
Hermitian, and so its trace τ (X∗ X) is necessarily a real number.)
In the language of Von Neumann algebras, this axiom (together with the normal-
ization τ (1) = 1) is asserting that τ is a state. This axiom is the non-commutative
analogue of the Kolmogorov axiom that all events have non-negative probability.
With this axiom, we can define a positive semi-definite inner product , L2 (τ ) on
A by

X, Y L2 (τ )  τ (XY) .

This obeys the usual axioms of an inner product, except that it is only positive semi-
definite rather than positive definite. One can impose positive definiteness by adding
an axiom that the trace is faithful, which means that τ (X∗ X) = 0 if and only if
X = 0. However, this is not needed here.
Without the faithfulness, the norm can be defined using the positive semi-definite
inner product
 1/2
= (τ (X∗ X))
1/2
X L2 (τ )  X, X L2 (τ ) .

In particular, we have the Cauchy-Schwartz inequality


 
 
 X, Y L2 (τ )  X L2 (τ ) Y L2 (τ ) .

This leads to an important monotonicity:


  2k−1 1/(2k−1)   2k 1/2k   2k+2 1/(2k+2)
τ X   τ X   τ X  ,

for any k ≥ 0.
362 6 Asymptotic, Global Theory of Random Matrices

As a consequence, we can define the spectral radius ρ (X) of a Hermitian


matrix as
  1/2k
ρ (X) = lim τ X2k  ,
k→∞

in which case we arrive at the inequality


  k 
τ X   (ρ (X))k

for k = 0, 1, 2, . . . . We then say a Hermitian matrix is bounded if its spectral radius


is finite. We can further obtain [9]

XY L2 (τ )  ρ (X) Y L2 (τ ) .

Proposition 6.5.4 (Boundedness [9]). Let X be a bounded Hermitian matrix, and


let P : C → C be a polynomial. Then

|τ (P (X))|  sup |P (x)| .


x∈[−ρ(X),ρ(X)]

The spectral theorem is completely a single bounded Hermitian matrix in a


non-commutative probability space. This can be extended to multiple commuting
Hermitian elements. But, this is not true for multiple non-commuting elements.
We assume as a final (optional) axiom a weak form of commutativity in the trace.
Axiom 6.5.5 (Trace). For any two elements X, Y, we have τ (XY) = τ (YX) .
From this axiom, we can cyclically permute products in a trace

τ (XYZ) = τ (ZYX) .

Definition 6.5.6 (Non-commutative probability space, final definition [9]). A


non-commutative probability space (A, τ ) consists of a ∗-algebra A with identity
1, together with a ∗-linear functional τ : A → C, that maps 1 to 1 and obeys the
non-negativity axiom. If τ obeys the trace axiom, we say that the non-commutative
probability space is tracial. If τ obeys the faithfulness axiom, we say that the non-
commutative probability space is faithful.

6.5.4 Free Independence

We now come to the fundamental concept in free probability, namely the free
independence.
6.5 Free Probability 363

Definition 6.5.7 (Free independence). A collection of X1 , . . . , Xk of random


variables in a non-commutative probability space (A, τ ) is freely independent (or
free for short) if one has

τ {[P1 (Xn,i1 ) − τ (P1 (Xn,i1 ))] · · · [Pm (Xn,im ) − τ (Pm (Xn,im ))]} = 0

whenever P1 , . . . , Pm are polynomials and i1 , . . . , im ∈ 1, . . . , k are indices with


no two adjacent ij equal.
A sequence Xn,1 , . . . , Xn,k of random variables in a non-commutative
probability space (A, τ ) is asymptotically freely independent (or asymptotically
free for short) if one has

τ {[P1 (Xn,i1 ) − τ (P1 (Xn,i1 ))] · · · [Pm (Xn,im ) − τ (Pm (Xn,im ))]} → 0

as n → ∞ whenever P1 , . . . , Pm are polynomials and i1 , . . . , im ∈ 1, . . . , k are


indices with no two adjacent ij equal.
For classically independent commutingrandom (matrix
k
 kvalued) variables X, Y,
knowledge of the individual moments
 τ X  and τ Y give complete infor-
mation on the joint moments τ Xk Yl = τ Xk τ Yl . The same fact is true
for freely independent random variables, though the situation is more complicated.
In particular, we have that

τ (XY) = τ (X) τ (Y) ,



τ (XYX) = τ (Y) τ X2 ,
2   2 2 2
τ (XYXY) = τ (X) τ Y2 + τ X2 τ (Y) − τ (X) τ (Y) . (6.15)

For detailed derivations of the above formula, we refer to [9].


There is a fundamental connection between free probability and random matrices
first observed by Voiculescu [66]: classically independent families of random
matrices are asymptotically free!
Example 6.5.8 (Asymptotically free). As an illustration, let us reconsider (6.1):

1 ∗ 1 ∗ 1 1 1 1
Y Y = (X + N) (X + N) = X∗ X + N∗ N + X∗ N + N∗ X,
N N N N N N
(6.16)
where random matrices X, N are classically independent. Thus, X, N are also
asymptotically free. Taking the trace of (6.16), we have that

1 1 1 1 1
τ (Y∗ Y) = τ (X∗ X) + τ (N∗ N) + τ (X∗ N) + τ (N∗ X) .
N N N N N
Using (6.15), we have that

τ (X∗ N) = τ (X∗ ) τ (N) and τ (N∗ X) = τ (N∗ ) τ (X) ,


364 6 Asymptotic, Global Theory of Random Matrices

which vanish since τ (N) = τ (N∗ ) = 0 for random matrices whose entries are
zero-mean. Finally, we obtain that

τ (Y∗ Y) = τ (X∗ X) + τ (N∗ N) . (6.17)


The intuition here is that while a large random matrix X will certainly correlate
with itself so that, Tr (X∗ X) will be large, if we interpose an independent
random matrix N of trace zero, the correlation is largely destroyed; for instance,
Tr (X∗ XN) will be quite small.
We give a typical instance of this phenomenon here:
Proposition 6.5.9 (Asymptotic freeness of Wigner matrices [9]). LetAn,1 , . . . ,An,k
be a collection of independent n × n Wigner matrices, where the coefficients all
have uniformly bounded m-th moments for each m. Then the random variables
An,1 , . . . , An,k are asymptotically free.
A Wigner matrix is called Hermitian random Gaussian matrices. In Sect. 1.6.3.
We consider

An,i = U∗i Di Ui

where Di are deterministic Hermitian matrices with uniformly bounded eigenval-


ues, and the Ui are iid unitary matrices drawn from Haar measure on the unitary
group U (n). One can also show that the An,i are asymptotically free.

6.5.5 Free Convolution

When two classically independent random variables X, Y are added up, the
distribution μX+Y of the sum X + Y is the convolution μX+Y = μX ⊗ μY of
the distributions μX and μY . This convolution can be computed by means of the
characteristic function

FX = τ ejtX = ejtx dμX (x)
R

using the simple formula


   
τ ejt(X+Y ) = τ ejtX τ ejtY .

There is an analogous theory when summing two freely independent (Hermitian)


non-commutative random variables A, B; the distribution μA+B turns out to be
a certain combination μA  μB , known as free convolution of μA and μB . The
Stieltjes transform, rather than the characteristic function, is the correct tool to use
6.6 Tables for Stieltjes, R- and S-Transforms 365

Table 6.1 Common random matrices and their moments (The entries of W are i.i.d. with zero
1 1
mean and variance N ; W is square N ×N , unless otherwise specified. tr (H)  lim N Tr (H))
N →∞

Convergence laws Definitions Moments


Full-circle law W square N × N
 
W+W H   1 2m
Semi-circle law K= √ tr K2m =
2 m+1 m
 
√ 22m  1 m−1
Quarter circle law Q= WWH tr (Qm ) = πm m m−1 ∀ m odd
+1
2 2
Q2

R = WH W,
Deformed quarter
W ∈ CN ×βN
circle law   
  1 
m m m
R2 tr R2m = βi
m
i=1 i i−1

 − 1
Haar distribution T = W WH W 2

Inverse semi-circle law Y = T + TH

  1
−1
mA (z) = τ (A − z) = dμX (x),
R x−z

which has been discussed earlier.

6.6 Tables for Stieltjes, R- and S-Transforms

Table 6.1 gives common random matrices and their moments. Table 6.2 gives
definition of commonly encountered random matrices for convergence laws,
Table 6.3 gives the comparison of Stieltjes, R- and S-Transforms.
Let the random matrix W be square N × N with i.i.d. entries with zero mean
and variance N1 . Let Ω be the set containing eigenvalues of W. The empirical
distribution of the eigenvalues
1
Δ
PH (z) = |{λ ∈ Ω : Re λ < Re z and Im λ < Im z}|
N
converges a non-random distribution functions as N → ∞. Table 6.2 lists
commonly used random matrices and their density functions.
Table 6.1 compiles some moments for commonly encountered matrices
from [390]. Calculating eigenvalues λk of a matrix X is not a linear operation.
Calculation of the moments of the eigenvalue distribution is, however, conveniently
done using a normalized trace since
366 6 Asymptotic, Global Theory of Random Matrices

Table 6.2 Definition of commonly encountered random matrices for convergence laws (The
1
entries of W are i.i.d. with zero mean and variance N ; W is square N × N , unless otherwise
specified)
Convergence laws Definitions Density functions
⎧1
⎨ π |z| < 1
Full-circle law W square N × N pW (z) = 0 elsewhere

⎧ 1 √

⎪ 4 − x2 |x| < 2
⎨ 2π
W+W H 0 elsewhere
Semi-circle law K= √ pK (z) =
2 ⎪



⎪ √ 0≤x≤2
⎪ 1
⎨ π 4 − x2
√ 2
Quarter circle law Q= WWH pQ (z) =

⎪ 0 elsewhere

⎧ &


1 4−x
0≤x≤4

⎨ 2π x

Q2 pQ2 (z) = 0 elsewhere






 2
√ 4β−(x2 −1−β )
R = WH W, a≤x≤b
Deformed quarter
pR (z) =  √πx+
W ∈ CN ×βN 1 − β δ(x) elsewhere
circle law √ √
a= 1− β ,√
b=1+ β
4β−(x−1−β)2
R2 a2 ≤ x ≤ b 2
pR2 (z) =  √2πx+
1− β δ(x) elsewhere

 − 1 1
Haar distribution T = W WH W 2 pT (z) = 2π
δ (|z| − 1)
 1 1
√ |x| < 2
Inverse semi-circle Y =T+ TH pY (z) = π 4−x2
law 0 elsewhere

N
1 1
λm
k = Tr (Xm ) .
N N
k=1

Thus, in the large matrix limit, we define tr(X) as

1
tr (X)  lim Tr (X) .
N →∞ N

Table 6.2 is made self-contained and only some remarks are made here. For Haar
distribution, all eigenvalues lie on the complex unit circle since the matrix T is
unitary. The essential nature is that the eigenvalues are uniformly distributed. Haar
distribution demands for Gaussian distributed entries in the random matrix W. This
6.6 Tables for Stieltjes, R- and S-Transforms 367

condition does not seem to be necessary, but allowing for any complex distribution
with zero mean and finite variance is not sufficient.
Table 6.33 lists some transforms (Stieltjes, R-, S-transforms) and their prop-
erties. The Stieltjes Transform is more fundamental since both R-transform and
S-transform can be expressed in terms of the Stieltjes transform.

3 This table is primarily compiled from [390].


Table 6.3 Table of Stieltjes, R- and S-transforms
368

Stieltjes transform R-transform S-transform


Δ 1+z −1
/ S(z) = z
Υ (z),
Δ 1 Δ  
G (z) = dP (x) , Imz>0, ImG(z) ≥ 0 R (z) = G−1 (−z) −z −1 Δ −1
x−z Υ(z) = −z G−1 z −1 −1
1 1
GαI (z)= α−z RαI (z) =α SαI (z) = α ,
&
GK (z)= z2 1− z42 − z2 RK (z) =z SK (z) =undefined
& 
2 1 1 1
z
GQ (z)= 1− z42 2
− arcsin z
− z2 − 2π RQ2 (z)= 1−z SQ2 (z)= 1+z
&
β 1
GQ2 (z)= 12 1− z4 − 12 RR2 (z)= 1−z SR2 (z)= β+z
& √
(1−β)2 (1−β) −1+ 1+4z 2
GR2 (z)= 4z 2
− 1+β
2z
+ 14 − 12 − 2z RY (z)= z
SY (z)=undefined
−sign(Re z)
GY (z)= √ RαX (z) =αRX (αz) SAB (z)=SA (z)SB (z)
z 2 −4
√ √ /
Gλ ( z )−Gλ (− z )
Gλ2 (z) = √ lim R (z) = xdP (x)
2 z z→∞

GXXH (z) =βGXH X (z) + β−1


z
, X ∈ CN ×βN RA+B (z) =RA (z) RB (z)
 
  GA+B RA+B (−z) −z −1 =z
/ ydPY (x)
GX+WYWH (z) =GX z−β 1+yGX+WYWH (z)

Im z>0, X, Y, W jointly independent.


/1
GWWH (z) = u(x, z)dx,
⎡ 0 ⎤−1
⎢ /β w(x,y)dy ⎥
u(x, z)=⎣−z+ 1 ⎦ , x ∈ [0, 1]
0 1+ u(x ,z)w(x ,y)dx
0
6 Asymptotic, Global Theory of Random Matrices
Part II
Applications
Chapter 7
Compressed Sensing and Sparse Recovery

The central mathematical tool for algorithm analysis and development is the
concentration of measure for random matrices. This chapter is motivated to provide
applications examples for the theory developed in Part I. We emphasize the central
role of random matrices.
Compressed sensing is a recent revolution. It is built upon the observation that
sparsity plays a central role in the structure of a vector. The unexpected message
here is that for a sparse signal, the relevant “information” is much less that what we
thought previously. As a result, to recover the sparse signal, the required samples
are much less than what is required by the traditional Shannon’s sampling theorem.

7.1 Compressed Sensing

The compressed sensing problem deals with how to recover sparse vectors from
highly incomplete information using efficient algorithms. To formulate the proce-
dure, a complex vector x ∈ CN is called s-sparse if

x 0 := |{ : x = 0}| = # { : x = 0}  s.

where x 0 denotes the 0 -norm of the vector x. The 0 -norm represents the total
number of how many non-zero components there are in the vector x. The p -norm
for a real number p is defined as
N
1/p
p
x p = |xi | , 1 ≤ p < ∞.
i=1

Given a rectangular complex matrix Φ ∈ Cn×N , called the measurement matrix,


the task is to reconstruct the complex vector x ∈ CN from the linear measurements

y = Φx.

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 371
DOI 10.1007/978-1-4614-4544-9 7,
© Springer Science+Business Media New York 2014
372 7 Compressed Sensing and Sparse Recovery

We are interested in the case n ! N, so that this system is under-determined. It is


well known that without additional structure constraints on the complex vector x,
this linear system problem has no solution. The central message of the compressed
sensing is the surprising discovery that under the additional structure constraint that
the complex vector x ∈ CN is s-sparse, then the situation changes.
The naive approach for reconstruction, namely, 0 -minimization,

minimize z 0 subject to Φz = y, (7.1)

which is NP-hard in general. There are several well-known tractable alternatives—


for instance, 1 -minimization

minimize z 1 subject to Φz = y, (7.2)

where z 1 = |z1 |+|z2 |+· · ·+|zN | for z = (z1 , z2 , . . . , zN ) ∈ CN . Equation (7.2)


is a convex optimization problem and may be solved efficiently by tools such as
CVX.
The restricted isometry property (RIP) streamlines the analysis of recovery
algorithms. The restricted isometry property (RIP) also offers a very elegant way
to analyze 1 -minimization and greedy algorithms.
To guarantee recoverability of the sparse vector x in (7.1) by means of 1 -
minimization and greedy algorithms, it suffices to establish the restricted isometry
property (RIP) of the so-called measurement matrix Φ: For a matrix Φ ∈ Cn×N
and sparsity s < N, the restricted isometry constant δs is defined as the smallest
positive number such that
2 2 2
(1 − δs ) x 2  Φx 2  (1 + δs ) x 2 for all x ∈ CN with x 0 ≤ s.
(7.3)
In words, the statement (7.3) requires that all column submatrices of Φ with at most
s columns are well-conditioned. Informally, Φ is said to satisfy the RIP with order
s when the level δs is “small”.
A number of recovery algorithms (Table 7.1) are provably effective for sparse
recovery if the matrix Φ satisfies the RIP. More precisely, suppose that the matrix
Φ obeys (7.3) with

δκs < δ  (7.4)

for suitable constants κ ≥ 1 and δ  , then many algorithms precisely recover any
s-sparse vectors x from the measurements y = Φx. Moreover, if x can be well
approximated by an s sparse vector, then for noisy observations

y = Φx + e, e 2  α, (7.5)

these algorithms return a reconstruction x̃ that satisfies an error bound of the form

1
x − x̃ 2  C1 √ σs (x)1 + C2 α, (7.6)
s
7.1 Compressed Sensing 373

Table 7.1 Values of the constants [391] κ and δ in (7.4) that guaran-
tee success for various recovery algorithms
Algorithm κ δ References
3√
1 -minimization (7.2) 2 ≈ 0.4652 [392–395]
&
4+ 6
2
CoSaMP 4 √ ≈ 0.3843 [396, 397]
5+ 73
Iterative hard thresholding 3 1/2 [395, 398]

Hard thresholding pursuit 3 1/ 3 ≈ 0.5774 [399]

where

σs (x)1 = inf x−z 1


z0 s

denotes the error of best s-term approximation in 1 and C1 , C2 > 0 are constants.
For illustration, we include Table 7.1 (from [391]) which lists available values
for the constants κ and δs in (7.4) that guarantee (7.6) for several algorithms along
with respective references.
Remarkably, all optimal measurement matrices known so far are random matri-
ces. For example, a Bernoulli random matrix Φ ∈ Rn×N has entries

φjk = εjk / n, 1 ≤ j ≤ n, 1 ≤ k ≤ N,

where εik are independent, symmetric {−1, 1}-valued random variables. Its
restricted isometry constant satisfies

δs  δ

with probability at least 1 − η provided that

n  Cδ −2 (s ln (eN/s) + ln (1/η)) ,

where C is an absolute constant.


On the other hand, Gaussian random matrices, that is, matrices that have
independent, normally distributed entries with mean zero and variance one, have
been shown [400, 406, 407] to have restricted isometry constants of √1n Φ satisfy
δs ≤ δ with high probability provided that

n  Cδ −2 s log (N/s) .

That is, the number n of Gaussian measurements required to reconstruct an s-sparse


signal of length N is linear in the sparsity and logarithmic in the ambient dimension.
It follows [391] from lower estimates of Gelfand widths that this bound on the
required samples is optimal [408–410], that is, the log-factor must be present.
More structured measurement matrices are considered for practical considera-
tions in Sect. 7.3.
374 7 Compressed Sensing and Sparse Recovery

Table 7.2 List of measurement matrices [391] that have been proven to be RIP, scaling of sparsity
s in the number of measurements n, and the respective Shannon entropy of the (random) matrix
n × N measurement matrix Shannon entropy RIP regime References
Gaussian nN 12 log (2πe) s  Cn/ log N [400–402]
Rademacher entries nN s  Cn/ log N [400]
N log2 N − nlog2 n
Partial Fourier matrix s  Cn/ log4 N [402, 403]
−(N − n)log2 (N − n)
Partial circulant Rademacher N s  Cn2/3 /log2/3 N [403]
Gabor, Rademacher window n s  Cn2/3 /log2 n [404]

Gabor, alltop window 0 sC n [405]

In Table 7.2 we list the Shannon entropy (in bits) of various random matrices
along with the available RIP estimates.

7.2 Johnson–Lindenstrauss Lemma and Restricted Isometry


Property

p -norm of a vector x = (x1 , . . . , xN ) ∈ R is defined by


The N T N

⎧ 1/p
⎨  |x |p
⎪ N
i , 0 < p < ∞,
x = x = i=1 (7.7)
2 N
2 ⎪
⎩ max |xi | , p = ∞.
i=1,...,N

We are given a set A of points in RN with N typically large. We would like to embed
these points into a lower-dimensional Euclidean space Rn which approximately
preserving the relative distance between any two of these points.
Lemma 7.2.1 (Johnson–Lindenstrauss [411]). Let ε ∈ (0, 1) be given. For every
set A of k points in RN , if n is a positive integer such that n  n0 = O ln k/ε2 ,
there exists a Lipschitz mapping f : RN → Rn such that
2 2
(1 − ε) x − y N  f (x) − f (y) n  (1 + ε) x − y N
2 2 2

for x, y ∈ A.
The Johnson–Lindenstrauss (JL) lemma states [412] that any set of k points in
high dimensional Euclidean space can be embedded into O(log(k)/ε2 ) dimensions,
without distorting the distance between any two points by more than a factor
between 1 − ε and 1 + ε. As a consequence, the JL lemma has become a valuable
tool for dimensionality reduction.
In its original form, the Johnson–Lindenstrauss lemma reads as follows.
7.2 Johnson–Lindenstrauss Lemma and Restricted Isometry Property 375

Lemma 7.2.2 (Johnson–Lindenstrauss [411]). Let  ε ∈ (0, 1) be given. Let


x1 , . . . , xk ∈ RN be arbitrary points. Let m = O log k/ε2 be a natural number.
Then there exists a Lipschitz map f : RN → Rm such that
2 2 2
(1 − ε) xi − xj 2  f (xi ) − f (xj ) 2  (1 + ε) xi − xj 2 (7.8)

for all i, j ∈ {1, 2, . . . , k}. Here || · ||2 stands for the Euclidean norm in RN or Rm ,
respectively.
It is known that this setting of m is nearly tight; Alon [413] showed that one must
have

m = Ω ε−2 log(k) log (1/ε) .

In most of these frameworks, the map f under consideration is a linear map


represented by an m × N matrix Φ. In this case, one can consider the set of
differences E = {xi − xj } ; To prove the theorem, one then needs to show that

2 2 2
(1 − ε) y 2  Φy 2  (1 + ε) y 2 , for all y ∈ E. (7.9)

When Φ is a random matrix, the proof that Φ satisfies the JL lemma with high
probability boils down to showing a concentration inequality of the typev
  
2 2 2
P (1 − ε) x 2  Φx 2  (1 + ε) x 2  1 − 2 exp −c0 ε2 m , (7.10)

for an arbitrary fixed x ∈ RN , where c0 is an absolute constant in the optimal case,


and in addition possibly dependent on N in almost-optimal sense as e.g. in [414].
In order to reduce storage space and implementation time of such embeddings, the
design of structured random JL embeddings has been an active area of research
in recent years [412]. Of particular importance in this context is whether fast (i.e.
O(N log(N ))) multiplication algorithms are available for the resulting matrices.
The map f is a linear mapping represented by an n × N matrix Φ whose entries
are randomly drawn from certain probability distributions. A concise description of
this evolution is provided in [415]. As observed by Achlioptas in [416], the mapping
f : RN → Rm may be realized by a random matrix, where each component
is selected independently at random with a fixed distribution. This decreases the
time for evaluation of the function f (x) essentially. An important breakthrough was
achieved by Ailon and Chazelle in [417, 418].
Here let us show how to prove the JL lemma using such random matrices,
following [400]. One first shows that for any x ∈ RN , the random variable
 
2 2
E Φx n = x N . (7.11)
2 2
376 7 Compressed Sensing and Sparse Recovery

2
Next, one must show that for any x ∈ RN , the random variable Φx n is sharply
2
concentrated a round its expected value (concentration of measure), thanks to the
moment conditions; that is,
  
 2 
P  Φx n − x N   t x N  2e−ng(t) , 0 < t < 1,
2 2
(7.12)
2 2 2

where the probability is taken over all n × N matrices Φ and g(t) is a constant
depending only on t such that for all t ∈ (0, 1), g(t) > 0.
Finally, one uses the union bound to the set of differences between all possible
pairs of points in A.
 √
+1/ n with probability 1/2,
φij ∼ √ (7.13)
−1/ n with probability 1/2,

or related distributions such as


⎧ 
⎨ + 3/n with probability 1/6,
φij = 0  with probability 2/3, (7.14)

+ 3/n with probability 1/6.

Perhaps, the most prominent example is the n × N random matrices Φ whose


entries φi,j , are independent realizations of Gaussian random variables
 
1
φij ∼ N 0, . (7.15)
n

The verification of (7.12) with g(t) = t2 /4 − t3 /6 is elementary using Chernoff


inequalities and a comparison of the moments of a Bernouli random variable to
those of a Gaussian random variable.
In [415], it is shown that we can also use matrices whose entries are independent
realizations of ±1 Bernoulli random variables.
Let (Ω, F, μ) be the probability space where μ a probability measure and let Z
be a random variable on Ω. Given n and N, we can generate random matrices by
choosing the entries φij as independent realizations of Z. This gives the random
matrices Φ (ω), ω ∈ ΩnN . Given any set of indices T with the number of elements
of the set (cardinality) |T | ≤ k, denote by XT the set of all vectors in RN that are
zero outside of T .
Theorem 7.2.3 (Baraniuk, Davenport, DeVore, Wakin [400]). Let Φ (ω), ω ∈
ΩnN , be a random matrix of size n × N drawn according to any distribution that
satisfies the concentration inequality (7.12). Then, for any set T with the cardinality
|T | = k < n and any 0 < δ < 1, we have

(1 − δ) x N  Φx n  (1 + δ) x N , for all x ∈ XT (7.16)


2 2 2
7.2 Johnson–Lindenstrauss Lemma and Restricted Isometry Property 377

with probability

 1 − 2(12/δ) e−g(δ/2)n .
k
(7.17)

Proof. 1. Nets of points. It is sufficient to prove (7.16) for the case of x N = 1,


2
due to the linearity of Φ. We choose a finite set of points AT such that AT ⊆ XT ,
a N = 1 for all a ∈ AT , and for all x ∈ XT with x N = 1 we have
2 2

min x − a N  δ/4. (7.18)


a∈AT 2

It is well known from covering numbers (see, e.g., [419]) that we can choose
k
such a set with the cardinality |AT |  (12/δ) .
2. Concentration of measure through the union bound. We use the union bound to
apply (7.12) to this set of points t = δ/2. It thus follows that, with probability at
least the right-hand side of (7.17), we have
2 2
(1 − δ/2) a N  Φa n  (1 + δ/2) a N , for all a ∈ AT (7.19)
2 2 2

3. Extension to all possible k-dimensional signals. We define α as the smallest


number such that
2
Φx n  (1 + α) x N , for all x ∈ XT . (7.20)
2 2

The goal is to show that α ≤ δ. To do this, we recall from (7.18) that for any
x ∈ XT we can pick up a point (vector) a ∈ AT such that x − a N  δ/4.
2
In this case we have

Φx n  Φa n + Φ (x − a) n  (1 + δ/2) + (1 + α) δ/4. (7.21)


2 2 2

Since by definition α is the smallest number for which (7.20) holds, we obtain
α  δ/2 + (1 + α) δ/4. Solving for α gives α  3δ/4/ (1 − δ/4)  δ, as
desired. So we have shown the upper inequality in (7.16). The lower inequality
follows from this since

Φx n  Φa n − Φ (x − a) n  (1 − δ/2) − (1 + δ) δ/4  1 − δ,
2 2 2

which complements the claim. 


Now we can apply Theorem 7.2.3 to obtain the so-called restricted isometry
property (RIP) in compressive sensing (CS). Given a matrix Φ and any set T of
column indices, we denote by ΦT the n × |T | matrix composed of these columns.
Similarly, for x ∈ RN , we denote by xT the vector obtained by retaining only the
entries in x corresponding to the column indices T . We say that a matrix Φ satisfies
the restricted isometry property of order k is there exists a δk ∈ (0, 1) such that
378 7 Compressed Sensing and Sparse Recovery

2 2 2
(1 − δk ) xT N  ΦT xT N  (1 + δk ) xT N (7.22)
2 2 2

holds for all sets T with |T | ≤ k. The condition (7.22) is equivalent to requiring that
the Grammian matrix ΦTT ΦT has all of its eigenvalues in [1 − δk , 1 + δk ]. (Here ΦTT
is the transpose of ΦT .)
The similarity between the expressions in (7.9) and (7.22) suggests a connection
between the JL lemma and the Restricted Isometry Property. A first result in this
direction was established in [400], wherein it was shown that random matrices
satisfying a concentration inequality of type (7.10) (and hence the JL Lemma)
satisfy the RIP of optimal order. More precisely, the authors prove the following
theorem.
Theorem 7.2.4 (Baraniuk, Davenport, DeVore, Wakin [400]). Suppose that
n, N, and 0 < δ < 1 are given. For x ∈ RN , if the probability distribution
generating the n × N matrices Φ (ω) , ω ∈ RnN , satisfies the concentration
inequalities (7.12)
  
 2 
 2e−ng(t) ,
2 2
P  Φx n − x n  t x n 0 < t < 1, (7.23)
2 2 2

then, there exist constant c1 , c2 > 0 depending only on δ such that the restricted
isometry property (7.22) holds for Φ (ω) with the prescribed δ and any k 
c1 n/ log(N/k) with probability at least 1 − 2e−c2 n .
Proof. From Theorem 7.2.3, we know that for each of the k-dimensional spaces
XT , the matrix Φ (ω) will fail to satisfy (7.23) with probability at most

2(12/δ) e−g(δ/2)n .
k
(7.24)
 
N k
There are  (2N/k) such subspaces. Thus, (7.23) will fail to hold with
k
probability (7.23) at most

2(2N/k) (12/δ) e−g(δ/2)n = 2 exp [−g (δ/2) n + k (log (eN/k) + log (12/δ))] .
k k

(7.25)
Thus, for a fixed c1 > 0, whenever k  c1 n/ log (N/k) , we will have that the
exponent in the exponential on the right side of (7.25) is at most −c2 n if

c2  g (δ/2) − c1 [1 + (1 + log (12/δ)) / log (N/k)] .

As a result, we can always choose c1 > 0 sufficiently small to ensure that c2 > 0.
This prove that with probability at least 1 − 2e−c2 n , the matrix Φ (ω) will satisfy
(7.16) for each x. From this one can easily obtain the theorem. 
The JL lemma implies the Restricted Isometry Property. Theorem 7.2.5 below is
a converse result to Theorem 7.2.4: They show that RIP matrices, with randomized
7.2 Johnson–Lindenstrauss Lemma and Restricted Isometry Property 379

column signs, provide Johnson–Lindenstrauss embeddings that are optimal up to


logarithmic factors in the ambient dimension. In particular, RIP matrices of optimal
order provide Johnson–Lindenstrauss embeddings of optimal order as such, up to a
logarithmic factor in N (see Theorem 7.2.5). Note that without randomization, such
a converse is impossible as vectors in the null space of the fixed parent matrix are
always mapped to zero.
For a vector x ∈ RN , we denote Dx = (Di,j ) ∈ RN ×N the diagonal matrix
satisfying Dj,j = xj .
Theorem 7.2.5 (Theorem 3.1 of Krahmer and Ward [412]). Fix η > 0 and
ε ∈ (0, 1), and consider a finite set E ∈ RN if cardinality |E| = p. Set
k  40 log 4pη , and suppose that Φ ∈ R
m×N
satisfies the Restricted Isometry
Property of order k and level δ  ε/4. Let ξ ∈ R be a Rademacher sequence, i.e.,
N
N
uniformly distributed on {−1, 1} . Then with probability exceeding 1 − η,
2 2 2
(1 − ε) x 2  ΦDξ x 2  (1 + ε) x 2 (7.26)

uniformly for all x ∈ E.


The proof of Theorem 7.2.5 follows from the use of three ingredients: (1)
Concentration of measure result: Hoeffding’s inequality; (2) Concentration of
measure result: Theorem 1.5.5; (3) RIP matrices: Theorem 7.2.6.
Theorem 7.2.6 (Proposition 2.5 of Rauhut [30]). Suppose that Φ ∈ Rm×N has
the Restricted Isometry Property of order 2s and level δ. Then for any two disjoint
subsets J , L ⊂ {1, . . . , N } of size |J |  s, |L|  s,
/ /
/ H /
/Φ(J ) Φ(L) /  δ.

Now we closely follow [29] to develop some useful techniques and at the same
time gives a shorter proof of Johnson–Lindenstrauss lemma. To prove Lemma 7.2.2,
it actually suffices to prove the following.
Lemma 7.2.7 (Nelson [29]). For any 0 < ε, δ < 1/2 and positive integer d, there
exists a distribution D over Rm×N for m = O ε−2 log (1/δ) such that for any
x ∈ RN with x 2 = 1,
  
 2 
P  Ax 2 − 1 > ε < δ.

Let x1 , . . . , xn ∈ RN be arbitrary points. Lemma 7.2.7 implies Lemma 7.2.2,


since we can set δ = 1/n2 then apply a union bound on the vectors
(xi − xj ) / xi − xj 2 for all i < j. The first proof of Lemma 7.2.2 was given
by Johnson and Lindenstrauss [411]. Later, proofs are given when D can be taken
as a distribution over matrices with independent Gaussian or Bernoulli entries, or
even more generally, Ω (log (1/δ))-wise independent entries which each have mean
zero, variance 1/m, and a subGaussian tail. See [416, 420–426].
380 7 Compressed Sensing and Sparse Recovery

"
For A ∈ Rn×n , we define the Frobenius norm as A F = A2i,j =
i,j

Tr (A2 ). We also define A 2→2 = sup Ax 2 , which is also equal to the
x2 =1
largest magnitude of an eigenvalue of A when A has all real eigenvalues (e.g., it
is symmetric). We need the Hanson-Wright inequality [61]. We follow [29] for the
proof since its approach is modern.
Theorem 7.2.8 (Hanson-Wright inequality [29, 61]). For A ∈ Rn×n symmetric
and x ∈ Rn with the xi independent having subGaussian entries of mean zero and
variance one,
    0 1
P xT Ax − Tr A2  > t  C exp − min C  t2 / A F , C  t/ A 2→2 ,
2


 C, C 0> 0 are universal constants.
where 1 Also, this holds even if the xi are only
2 2
Ω 1 + min t / A F , t/ A 2→2 -wise independent.

Proof. By Markov’s inequality,


   α 
P xT Ax − Tr (A) > t  tα · E xT Ax − Tr (A)
0  
2
for any α > 0. We apply Theorem 7.2.12 with α = min t2 / 256 · A F , t/ (256·
A 2→2 )} . 

The following theorem implies Lemma 7.2.7.


Theorem 7.2.9 (Sub-Gaussian matrix [29]). For N > 0 an integer and any 0 <
ε, δ < 1/2, let A be an m × N random
 matrix with subGaussian entries of mean
zero and variance 1/m for m = Ω ε−2 log (1/δ) . Then for any x ∈ RN with
x 2 = 1,
  
 2 
P  Ax 2 − 1 > ε < δ.

Proof. Observe that


⎛ ⎞
m
1 ⎝ xj xk zi,j zi,k ⎠,
2
Ax 2 = · (7.27)
m i=1
(j,k)∈[N ]×[N ]


where z is an mN -dimensional vector formed by concatenating the rows of m·A,
and [N ] = {1, . . . , N }. We use the block-diagonal matrix T ∈ RmN ×mN with m
blocks, where each block is the N × N rank-one matrix xxT /m. Now we get our
2 2
desired quadratic form Ax 2 = zT Tz. Besides, Tr (T) = x 2 = 1. Next, we
T
want to argue that the quadratic form z Tz has a concentration around Tr (T) , for
which we can use Theorem 7.2.8. In particular, we have that
7.2 Johnson–Lindenstrauss Lemma and Restricted Isometry Property 381

    
 
P Ax22 −1 > ε =P zT Tz − Tr (T) > ε C exp − min C  ε2 / T2F , C  ε/T2→2 .

2 4
Direct computation yields T F = 1/m · x 2 = 1/m. Also, x is the
only eigenvector of the rank one matrix xxT /m with non-zero eigenvalues, and
2
furthermore its eigenvalue is x 2 /m = 1/m. Thus, we have the induced matrix
norm
 A 2→2 = 1/m. Plugging these in gives error probability δ for m =
Ω ε−2 log (1/δ) . 
n
 [29]). For a ∈ R , x ∈ {−1, 1} uniform,
n
Lemma 7.2.10 (Khintchine inequality

k k
and k ≥ 2 an even integer, E a x
T
 a 2 ·k . k/2

Lemma 7.2.11
 (Moments
 [29]). If X, Y are independent with E[Y ] = 0 and k ≥
k k
2, then E |X|  E |X − Y | .

We here give a proof of Theorem 7.2.8 in the case that x is uniform in {−1, 1}n .
Theorem 7.2.12 (Concentration for quadratic form [29]). For A ∈ Rn×n
symmetric and x ∈ Rn with the xi independent having subGaussian entries of
mean zero and variance one,
 k  0√ 1k
E xT Ax − Tr (A)  C k · max
2
k A F ,k A 2→2

where C > 0 is a universal constant.


Proof. The proof is taken from [29] and we only slightly change some wording
and notation for the sake of our habits. Without loss of generality we can assume
Tr (A) = 0. The reason is that if we consider A = A − (Tr (A) /n) · I, then
xT Ax − Tr (A) = xT A x, and we obtain A F  A F , and A 2→2 
2 A 2→2 . We use the  induction method. We consider k a power of 2. For
 T 2  2 2  2  2
k = 2, E x Ax = 4 Ai,j , and A F = Ai,i + 2 Ai,j . Thus
  i<j i i<j
2 2
E xT Ax 2 A F.
Next, we assume the statement of our theorem for k/2 and prove it for the
hypothesis of k. Lemma 7.2.11 is used to establish
    k
k   
E T
x Ax
k

 E x Ax − y Ay
T T  T
= E (x + y) A (x − y) ,

n
where y ∈ {−1, 1} is random and independent of x.
If we swap xi with yi , then x+y remains constant as does |xi −yi | and that xi −yi
is replaced with its negation. See Sect. 1.11 for this symmetrization. The advantage
of this approach is that we can conditional expectation and apply sophisticated
methods to estimate the moments  of the Rademacher
 series. Let us do averaging
T
over all such swaps. Let ξi = (x + y) A , and ηi = xi − yi . Let zi the indicator
i
random variable: zi is 1 if we did not swap and −1 if we did. Let z = (z1 , . . . , zn ).
382 7 Compressed Sensing and Sparse Recovery

T 
Then we have (x + y) A (x − y) = ξi ηi zi . Averaging over all swaps,
i

 k
 k/2 k/2

Ez (x+y)T A (x−y) = ξi ηi zi  ξi2 ηi2 · kk/2  2k kk/2 · ξi2 .


i i i

The first inequality is by Lemma 7.2.10, and the second uses that |ηi |  2. Note
that
2 2 2
ξi2 = A (x + y) 2  2 Ax 2 + 2 Ay 2 ,
i

and thus
! k "  k/2   k/2 
E T
x Ax  2k kk/2 · E 2 Ax22 +2 Ay22  4k kk/2 · E Ax22 ,

p 1/p
with the final inequality using Minkowski’s inequality (namely that |E|X+Y | | 
p p 1/p
|E|X| + E|Y | | for any random variables X, Y and any 1 ≤ p∞).
 Next let us
2
deal with Ax 2 = Ax, Ax = xT A2 x. Let B = A2 − Tr A2 I/n. Then
2
Tr (B) = 0. Also B F  A F A 2→2 , and B 2→2  A 2→2 . The former
holds since
 2
2 2
B F  λ4i − λ2i /n  λ4i  A F A 2→2 .
i i i



n
The latter is valid since the eigenvalues of B are λ2i − λ2j /n for each i ∈ [n].
j=1
The largest eigenvalue of B is thus at most that of A2 , and since λ2i ≥ 0, the smallest
2
eigenvalue of B cannot be smaller than − A 2→2 .
Then we have that
 ! "  ! "
k/2  k/2 k/2
E Ax22 = E A2F + xT Bx  2k max AkF , E xT Bx .

Hence using the inductive hypothesis on B we have that


 k  0√  1k
E xT Ax  8k max k A F , C k/2 k 3/4 B F , C k/2 k B 2→2
0√  1k
 8k C k/2 max k A F , k 3/4 A F A 2→2 , k A 2→2
0√ 1k
= 8k C k/2 max k A F , k A 2→2 ,
7.3 Structured Random Matrices 383

where the final equality follows since the middle term above is the geometric mean
of the other two, and thus is dominated by at least one of them. This proves our
hypothesis as long as C ≥ 64.
To prove our statement for general k, set k  = 2 log2 k . Then using the power
mean inequality and our results for k  a power of 2,
 k     k k/k 0√ 1k
E xT Ax  E xT Ax  128k max k A F,k A 2→2 .


Example 7.2.13 (Concentration using for quadratic form).

H0 : y = x, x ∈ Rn
H1 : y = x + z, z ∈ Rn

T
where x = (x1 , . . . , xn ) and xi are independent, subGaussian random variables
with zero mean and variance one and z is independent of x. For H0 , it follows from
Hanson-Wright inequality that

 k  0√ 1k
E xT Ax − Tr (A)  C k · max
2
k A F ,k A 2→2

where C is a universal constant. The algorithm claims H1 if the test metric


 k 
E yT Ay − Tr (A) > γ

where γ is the threshold. A good estimate of the threshold is

0√ 1k
2
γ0 = C k · max k A F ,k A 2→2 . 

7.3 Structured Random Matrices

As pointed out above, remarkably, all optimal measurement matrices known so


far are random matrices. In practice, structure is an additional requirement on
the measurement matrix Φ. Indeed, certain applications impose constraints on the
matrix Φ and recovery algorithms can be accelerated when fast matrix vector
multiplication routines are available for Φ.
384 7 Compressed Sensing and Sparse Recovery

7.3.1 Partial Random Fourier Matrices

Partial random Fourier matrices [406, 427] Φ ∈ Cm×n arise as random row
submatrices of the discrete Fourier matrix and their restricted isometry constants
satisfy δs  δ with high probability provided that

m  Cδ −2 slog3 s log n.

7.4 Johnson–Lindenstrauss Lemma for Circulant Matrices

Beyond Nyquist: Efficient sampling of sparse bandlimited signals [28] Johnson–


Lindenstrauss notes[29]
A variant of the Johnson–Lindenstrauss lemma for circulant matrices [428]
Johnson–Lindenstrauss lemma for circulant matrices [429]

7.5 Composition of a Random Matrix and a Deterministic


Dictionary

The theory of compressed sensing has been developed for classes of signals
that have a very sparse representation in an orthonormal basis. This is a rather
stringent restriction. Indeed, allowing the signal to be sparse with respect to a
redundant dictionary adds a lot of flexibility and significantly extends the range
of applicability. Already the use of two orthonormal basis instead of just one
dramatically increases the class of signals that can be modelled in this way [430].
Throughout this section, x denotes the standard Euclidean norm. Signals y
are not sparse in an orthonormal basis but rather in a redundant dictionary dictionary
Φ ∈ Rd×K with K > d. Now y = Φx, where x has only few non-zero components.
The goal is to reconstruct y from few measurements. Given a suitable measurement
matrix A ∈ Rn×d , we want to recover y from the measurements s = Ay = AΦx.
The key idea then is to use the sparse representation in Φ to drive the reconstruction
procedure, i.e., try to identify the sparse coefficient sequence x and from that
reconstruct y. Clearly, we may represent s = Ψx with

Ψ = AΦ ∈ Rn×K .

In Table 7.3, two greedy algorithms are listed.


We will assume that A is an n × N random matrix that satisfies
  
 2 2
 2e−cnt /2 , t ∈ (0, 1/3)
2 2
P  Av − z   t v (7.28)
7.5 Composition of a Random Matrix and a Deterministic Dictionary 385

Table 7.3 Greedy algorithms. Goal: reconstruct x from s = Ψx. Columns of Ψ are
denoted by ψ i , and Ψ†Λ is the pseudo-inverse of ΨΛ
Orthogonal matching pursuits Thresholding
Initialize: z = 0, r = s, Λ = Ø find: Λ that contains the indices
find: k = arg maxi | r, ψ i | corresponding to the S largest values
update: Λ = Λ ∪ {k} , r = s − ΨΛ Ψ†Λ s of | s, ψ i |
iterate until stopping criterion is attained. output: x = Ψ†Λ s.
output: x = Ψ†Λ s.

for all v ∈ Rd and some constant c > 0. Let us list some examples of
random matrices that satisfy the above condition: (1) Gaussian ensemble; (2)
Bernoulli ensemble; (3) Isotropic subGaussian ensembles; (4) Basis transformation.
See [430].
Using the concentration inequality (7.28), we can now investigate the isometry
constants for a matrix of the type AΦ, where A is an n × d random measurement
matrix and Φ is a d × K deterministic dictionary, [430] follows the approach taken
in [400], which was inspired by proofs for the Johnson–Lindenstrauss lemma [416].
See Sect. 7.2.
A matrix, which is a composition of a random matrix of certain type and a
deterministic dictionary, has small restricted isometry constants.
Theorem 7.5.1 (Lemma 2.1 of Rauhut, Schnass, and Vandergheynst [430]). Let
A be a random matrix of size n × d drawn from a distribution that satisfies the
concentration inequality (7.28). Extract from the d × K deterministic dictionary Φ
any sub-dictionary ΦΛ of size S, in Rd i.e., |Λ| = S with (local) isometry constant
δΛ = δΛ (Φ) . For 0 < δ < 1, we have set

ν := δΛ + δ + δΛ δ. (7.29)

Then
2 2 2
(1 − ν) x  AΦΛ x 2  x (1 + ν) (7.30)

with probability exceeding


2
1 − 2(1 + 12/δ) e−cδ
S n/9
.

The key ingredient for the proof Theorem 7.5.1 is a finite ε-covering (a set of
points1 ) of the unit sphere, which is included below for convenience.
We denote the unit Euclidean ball by B2n = {x ∈ Rn : x  1} and the unit
sphere S n−1 = {x ∈ Rn : x = 1}, respectively. For a finite set, the cardinality

1A point in a vector space is a vector.


386 7 Compressed Sensing and Sparse Recovery

of A is denoted by |A|, and for a set A ∈ Rn , conv A denotes the convex hull
of A. The following fact is well-known and standard: see, e.g, [152, Lemma 4.10]
and [407, Lemma 2.2].
Theorem 7.5.2 (ε-Cover). Let n ≥ 1 and ε > 0. There exists an ε-cover Γ ⊂ B2n
of the unit Euclidean ball B2n with respect to the Euclidean metric such that B2n ⊂
−1
(1 − ε) conv Γ and
n
|Γ|  (1 + 2/ε) .

Similarly, there exists Γ ⊂ S n−1 which is an ε-cover of the sphere S n−1 and

|Γ |  (1 + 2/ε) .
n

Proof of Theorem 7.5.1. First we choose a finite ε-covering of the unit sphere in
RS , i.e., a set of points Q, with q = 1 for all q ∈ Q, such that for all q ∈ Q,
such that for all x = 1

min x − q  ε
q∈Q

for some ε ∈ (0, 1). According to Theorem 7.5.2, there exists such a Q with |Q| 
S
(1 + 2/ε) . Applying the measure concentration in (7.29) with t < 1/3 to all the
points ΦΛ q and taking the union bound we obtain
2 2 2
(1 − t) ΦΛ q  AΦΛ q  (1 + t) ΦΛ q for all q ∈ Q. (7.31)

with probability larger than


 S
2 2
1−2 1+ e−cnt .
ε

Define ν as the smallest number such that


2 2
AΦΛ x  (1 + ν) x for all x supported on Λ. (7.32)

Next we estimate ν in terms of ε, t. Since for all x with ||x|| = 1 we can choose a
point q such that x − q  ε and obtain

AΦΛ x  AΦΛ q + AΦΛ (x − q)


1/2
 (1 + t) ΦΛ q + AΦΛ (x − q)
1/2 1/2 1/2
 (1 + t) (1 + δΛ ) + (1 + ν) ε.

Since ν is the smallest possible constant for which (7.32) holds it also has to satisfy
7.5 Composition of a Random Matrix and a Deterministic Dictionary 387

√ √ 
1 + ν  1 + t 1 + δΛ + (1 + ν) ε.

Simplifying the above equation gives

1+ε
1+ν  2 (1 + δΛ ) .
(1 − t)

Now we choose ε = δ/6, and t = δ/3 < 1/3. Then

1+t 1 + δ/3 1 + δ/3 1 + δ/3 2δ/3


= = < =1+ < 1 + δ.
(1 − ε) (1 − δ/6) 1 − δ/3 + δ 2 /36 1 − δ/3 1 − δ/3

Thus,

ν < δ + δΛ (1 + δ) .

To get a lower bound we operate in a similar fashion,

AΦΛ x  AΦΛ q − AΦΛ (x − q)


1/2
 (1 + t) ΦΛ q − AΦΛ (x − q)
1/2 1/2 1/2
 (1 − t) (1 − δΛ ) − (1 + ν) ε.

Now square both sides and observe that ν < 1 (otherwise we have nothing to show).
Then we finally arrive at
 √ 2
2 1/2 1/2
AΦΛ x  (1 − t) (1 − δΛ ) − ε 2
√ 1/2 1/2
 (1 − t) (1 − δΛ√
) − 2 2ε(1 − t) (1 − δΛ ) + 2ε2
 1 − δΛ − t − 2 2ε  1 − δΛ − δ  1 − ν.

This completes the proof. 

Based on the previous theorem it is easy to derive an estimation of the global


restricted isometry constants of the composed matrix Ψ = AΦ.
Theorem 7.5.3 (Theorem 2.2 of Rauhut, Schnass, and Vandergheynst [430]).
Let Φ ∈ Rd×K be a deterministic dictionary of size K in Rd with restricted isometry
constant δS (Φ) , S ∈ N. Let A ∈ Rn×d be a random matrix satisfying (7.28) and
assume

n  Cδ −2 (S log (K/S) + log (2e (1 + 12/δ)) + t) (7.33)

for some δ ∈ (0, 1) and t > 0. Then with probability at least 1 − e−t the composed
matrix Ψ = AΦ has restricted isometry constant
388 7 Compressed Sensing and Sparse Recovery

δS (AΦ)  δS (Φ) + δ (1 + δS (Φ)) . (7.34)

The constant satisfies C ≤ 9/c.


Proof. Using Theorem 7.5.1 we can estimate the probability that a sub-dictionary

ΨS = (AΦ)S = AΦS , S = {1, . . . , K}

does not hold for (local) isometry constants δS (AΦ)  δS (Φ) + δ (1 + δS (Φ))
by the probability
 S
12 
P (δS (AΦ) > δS (Φ) + δ (1 + δS (Φ)))  2 1 + exp −cδ 2 n/9 .
δ
 
K
By taking the union bound over all possible sub-dictionaries of size S
S
we can estimate the probability of δS (AΦ) = sup δΛ (AΦ) not
Λ={1,...,K},|Λ|=S
satisfying (7.34) by
  S
K 12 
P (δS (AΦ) > δS (Φ) + δ (1 + δS (Φ)))  2 1+ exp −cδ 2 n/9 .
S δ

 
K S
Using Stirling’s formula  (eK/S) and demanding that the above term is
S
less than e−t completes the proof. 
It is interesting to observe the stability of inner products under multiplication
with a random matrix A, i.e.,

Ax, Ay ≈ x, y .
Theorem 7.5.4 (Lemma 3.1 of Rauhut, Schnass, and Vandergheynst [430]). Let
x, y ∈ RN with x 2 , y 2  1. Assume that A is an n × N random matrix with
independent N (0, 1/n) entries (independent of x, y). Then for all t > 0
 
t2
P (| Ax, Ay − x, y |  t)  2 exp −n , (7.35)
C1 + C2 t

with C1 = √4e

≈ 2.5044, and C 2 = e 2 ≈ 3.8442.
The analogue√statement holds for a random matrix A whose entries are
independent ±1/ n Bernoulli random variables.
7.5 Composition of a Random Matrix and a Deterministic Dictionary 389

Taking x = y in the theorem provides the concentration inequality (7.28) for


Gaussian and Bernoulli matrices. Due to the elementary nature of the method used
in the proof [430], we include the proof here as an example below.
We need Theorem 1.3.4 due to Bennett [12, Eq. (7)].
Example 7.5.5 (Concentration of measure for inner products). Observe that
n n n
1
Ax, Ay = gk gj xk yk
n
=1 k=1 j=1

where gk ,  = 1, . . . , n, k = 1, . . . , N are independent standard Gaussian random


variables. We consider the random variable
N N
Y = g k g j x k yj
k=1 j=1

where again gk , k = 1, . . . , N are independent standard Gaussian random variables.


Now we can write
n
1
Ax, Ay = Y
n
=1

where Y are independent copies of Y .


The expectation of Y is easily expressed as

N
EY = xk yk = x, y .
k=1

Hence, also E [ Ax, Ay ] = x, y . Let



Z = Y − EY = g j g k x j yk + gk2 − 1 xk yk .
k=j k

The new random variable Z is known as Gaussian chaos of order 2. Note that
EZ = 0.
To apply Theorem 1.3.4, we need to show the moment bound (1.24) for the
random variable Z. A general bound for Gaussian chaos [27, p. 65] gives
 p/2
p p 2
E|Z|  (p − 1) E|Z| . (7.36)


for p ≥ 2. Using Stirling’s formula, p! = 2πppp e−p eRp , 1
12p+1  Rp  1
12p , we
have that, for p ≥ 3,
390 7 Compressed Sensing and Sparse Recovery

(p − 1)  pp/2
p 2
E|Z| = p! √ E|Z|
eRp 2πpe−p pp
 p
1 e2 p!  2 2
(p−2)/2
2
= 1− √ e E|Z| E|Z|
p eRp 2πp
e  (p−2)/2
2 2
 Rp √ p! e2 E|Z| E|Z|
e 2πp
  1/2 (p−2) e
2
 p! e E|Z| √ E|Z|2 .

Compare the above with the moment bound (1.24) holds for all p ≥ 3 with
 1/2 2e
2 2
M = e E|Z| , σ 2 = √ E|Z| ,

and by direct inspection we see that it also holds for p = 2.
2
Now we need to determine E|Z| . Using the independence of the Gaussian
random variables gk , we obtain

 
E|Z| 2
= E⎣ gj gk gj  gk xj xk xj  xk + 2 gj gk gk2 − 1 xj xk xj  xk
j =k j  =k j =k k

 2  
+ gk − 1 gk2 − 1 xk yk xk yk ⎦
k k

Further
2        
E|Z| = E gj2 E gk2 xj yj xk yk + E gj2 E gk2 x2j yk2
j=k k=j
 2

+ E gk2 − 1 x2k yk2 . (7.37)
k

Finally,
2
E|Z| = x j yj x k yk + x2j yk2 + 2 x2k yk2
k=j k=j k

= x j yj x k yk + x2j yk2
j,k j,k
2 2 2
= x, y + x 2 y 2 2 (7.38)

since by assumption x 2 , y 2  1. Denoting by Z ,  = 1, . . . , n independent


copies of Z, Theorem 1.3.4 gives
7.5 Composition of a Random Matrix and a Deterministic Dictionary 391

 n  
 
 
P (| Ax, Ay − x, y |  t)= P  Z   nt
 
=1
   
1 n2 t2 t2
 2 exp − = 2 exp −n ,
2 nσ 2 + nM t C1 + C2 t

2 √
with C1 = √2e 6π
E|Z|  √4e 6π
≈ 2.5044 and C2 = e 2 ≈ 3.8442.
For the case of Bernoulli random matrices, the derivation is completely analogue.
We just have to replace the standard Gaussian random variables gk by εk = ±1
Bernoulli random variables. In particular, the estimate (7.36) for the chaos variable
Z is still valid, see [27, p. 105]. Furthermore, for Bernoulli variables εk = ±1 we
clearly have ε2k = 1. Hence, going through the estimate above we see that in (7.37)
the last term is actually zero, so the final bound in (7.38) is still valid. 
Now we are in a position to investigate recovery from random measurements
by thresholding, using Theorem 7.5.4. Thresholding works by comparing inner
products of the signal with the atoms of the dictionary.
Example 7.5.6 (Recovery from random measurements by thresholding [430]). Let
A be an n × K random matrix satisfying one of the two probability models of
Theorem 7.5.4. We know thresholding will succeed if we have

min | Ay, Azi | > max | Ay, Azk | .


i∈Λ k∈Λ̄

We need to estimate the probability that the above inequality is violated


 
P min | Ay, Azi |  max | Ay, Azk |

i∈Λ k∈Λ̄   
 P min | Ay, Azi |  min | y, zi | − 2ε +P max | Ay, Azk |  max | y, zk | + 2ε .
i∈Λ i∈Λ k∈Λ̄ k∈Λ̄

The probability of the good components having responses lower than the threshold
can be further estimated as
 
P min | Ay, Azi |  min | y, zi | − 2ε
i∈Λ   i∈Λ 
 P ∪ | Ay, Azi |  min | y, zi | − 2ε
 i∈Λ  i∈Λ
 P | y, zi − Ay, Azi |  2ε
i∈Λ  
t2 /4
 2 |Λ| exp −n C1 +C 2 t/2 .

Similarly, we can bound the probability of the bad components being higher than
the threshold,
392 7 Compressed Sensing and Sparse Recovery

 
P max | Ay, Azk |  max | y, zk | + ε
2
k∈Λ̄   k∈Λ̄ 
 P ∪ | Ay, Azk |  max | y, zk | + 2
ε

 k∈Λ̄ k∈Λ̄
 P | Ay, Azk − y, zk |  2ε
 
k∈Λ̄  
t2 /4
 2 Λ̄ exp −n C1 +C 2 t/2
.

Combining the these two estimates we obtain that the probability of success for
thresholding is exceeding
 
t2 /4
1 − 2K exp −n .
C1 + C2 t/2

Theorem 7.5.4 finally follows from requiring this probability to be higher than
1 − e−t and solving for n. 
We summarize the result in the following theorem.
Theorem 7.5.7 (Theorem 3.2 of Rauhut, Schnass, and Vandergheynst [430]).
Let Ψ be a d × K dictionary. Assume that the support of x for a signal y = Φx,
normalized to have y 2 = 1, could be recovered by thresholding with a margin ε,
i.e.

min | Ay, Azi | > max | Ay, Azk | + ε.


i∈Λ k∈Λ̄

Let A be an n × d random matrix satisfying one of the two probability models of


Theorem 7.5.4. Then, with probability exceeding 1 − e−t , the support and thus the
signal can be reconstructed via thresholding from the n-dimensional measurement
vector w = Ay = AΦx as long as

n  C (ε) (log (2K) + t) .

where C (ε) = 4C1 ε−2 + 2C2 ε−1 and C1 , C2 are constants from Theorem 7.5.4.

7.6 Restricted Isometry Property for Partial Random


Circulant Matrices

Circular matrices are connected to circular convolution, defined for two vectors
x, z ∈ Cn by
n
(z ⊗ x)j := zjk xk , j = 1, . . . , n,
k=1
7.6 Restricted Isometry Property for Partial Random Circulant Matrices 393

where

j $ k = j − k mod n

is the cyclic subtraction. The circular matrix H = Hz ∈ Cn×n associated with z is


given by

Hx = z ⊗ x
T
and has entries Hjk = zjk . Given a vector z = (z0 , . . . , zn−1 ) ∈ Cn , we
introduce the circulant matrix
⎡ ⎤
z0 zn−1 · · · z1
⎢ z1 z0 · · · z2 ⎥
⎢ ⎥
.. ⎥ ∈ C
n×n
Hz = ⎢ . .. .
⎣ .. . . ⎦
zn−1 zn−2 · · · z0

Square matrices are not very interesting for compressed sensing, so we our attention
to a row submatrix of H. Consider an arbitrary index set Ω ⊂ {0, 1, . . . , n − 1}
whose cardinality |Ω| = m. We define the operator RΩ : Cn → Cm that restricts
n
a vector x ∈ Cn to its entries in Ω. Let ε = {εi }i=1 be a Rademacher vector of
length n, i.e., a random vector with independent entries distributed according to
P (εi = ±1) = 1/2. The associated partial random circulant matrix is given by

1
Φ = √ RΩ Hε ∈ Rm×n (7.39)
m

and acts on complex vectors x ∈ Cn via

1 1
Φx = √ RΩ Hε x = √ RΩ (ε ⊗ x) . (7.40)
m m

In other words, Φx is a circular matrix generated by a Rademacher vector, where


the rows outside Ω are removed.
Example 7.6.1 (Expectation of the restricted isometry constant [403]). We study
the expectation of the restricted isometry constant, E [δs ]. The goal is to convert
E [δs ] to another form that is easier to bound. Let T denote the set of all s-sparse
signals in the Euclidean unit ball:

T = {x ∈ Rn : x 0  s, x 2  1} . (7.41)

Define a function ||| · ||| on Hermitian n × n matrices via the expression


 
|||A||| = sup xT Ax .
x∈T
394 7 Compressed Sensing and Sparse Recovery

Now consider
D  E  
   2 2
||ΦH Φ − I||| = sup  ΦH Φ − I x, x  = sup  Φx 2 − x 2  = δs . (7.42)
x∈T x∈T

Let S be the cyclic shift down operator on column vectors in Rn . Applying the
power Sk to x will cycle x downward by k coordinates:

Sk x 
= xk ,
 H
where $ is subtraction modulo n. Note that Sk = S−k = Sn−k . Then we
rewrite Ψ as a random sum of shift operators,
n
1
Φ= √ εk R Ω S k .
m
k=1

It follows that
n n
1 −k 1
Φ Φ−I=
H
εk ε S RH
Ω RΩ S

= εk ε S−k PΩ S . (7.43)
m m
k= k=

where PΩ = RH Ω RΩ is the n × n diagonal projector onto the coordinates in Ω.


Applying PΩ to the vector x preserves the values of x on the set Ω while setting the
values outside of Ω to zero. Combining (7.42) and (7.43), we get the final form
n
1
δs = sup |Gx | where Gx = εk ε xH S−k PΩ S x. (7.44)
x∈T m
k=

We may regard the restricted isometry constant as the supremum of a random


process indexed by the set T defined above. The expected supremum of this
process can be bounded using some sophisticated techniques like Rademacher
chaos, covering number estimates, and chaining [27, 81]. 
The next example is to re-express the random process Gx (defined in (7.44))
in the Fourier domain. This is a key tool. [403] uses a version of the classical
Dudley inequality—Sect. 3.5—for Rademacher chaos that bounds the expectation
of its supremum by the maximum of two entropy integrals that involve covering
numbers with respect to two different metrics. Then they use elementary ideas
from Fourier analysis to provide bounds for these metrics. This reduction allows
them to exploit covering number estimates from the RIP analysis for partial Fourier
matrices [30, 402] to complete the argument.
Example 7.6.2 (Fourier representation of the random process [403]). This
approach studied here is elementary and is of independent interest. Let F be the
n × n discrete Fourier transform matrix whose entries are given by the expression
7.6 Restricted Isometry Property for Partial Random Circulant Matrices 395

F (ω, ) = e−i2πω/n , 0  ω,   n − 1.

Here F is unnormalized. The hat symbol denotes the Fourier transform of a vector:
x̂ = Fx. Use the property of Fourier transform: a shift in the time domain
followed by a Fourier transform may be written as a Fourier transform followed
by a frequency modulation

FSk = Mk F,

where M is the diagonal matrix with entries M (ω, ω) = e−i2πω/n for 0  ω 


n − 1. Now we are ready to handle Gx . The random process Gx has the Fourier
transform representation
n
1
Gx = εk ε x̂H M−k P̂Ω M x̂, (7.45)
m
k=

1 −1
where P̂Ω = n FPΩ F . The matrix P̂Ω has several nice properties:
1. P̂Ω is circulant and conjugate symmetric.  
 
2. Along the diagonal P̂Ω (ω, ω) = m/n2 , and off the diagonal P̂Ω (ω, ω) 
m/n2 .
3. Since the rows and columns of P̂Ω are circular shifts of one another,
 2  2 / / 2
    / /
P̂Ω (ω, ξ) = P̂Ω (ω, ξ) = /P̂Ω / /n = m/n3 .
F
ω ξ

4. P̂Ω has exactly m nonzero eigenvalues λi , i =/ 1, ./. . , m, each of which is equal


/ /
to λi = 1/n. As such, P̂Ω has spectral norm /P̂Ω / = 1/n and Frobenius norm
/ /2
/ /
/P̂Ω / = m/n2 .
F

These properties immediately follow from the fact that PΩ = RH Ω RΩ is a diagonal


matrix with 0–1 entries. The matrix P̂Ω inherits conjugate symmetry from PΩ . P̂Ω
is circulant since it is diagonalized by the Fourier transform. Since we form P̂Ω
by applying a similar transform to PΩ , they have the same eigenvalue modulo the
scalar factor 1/n. We can further rewrite the random process (7.45) in terms of the
form (7.46) we desire for sophisticated techniques. 
T
Example 7.6.3 (Integrability of chaos processes [403]). Let ε = (ε1 , . . . , εn ) .
The process (7.45) can be written as a quadratic form

Gx = ε, Zx ε where x ∈ T. (7.46)

The matrix Zx has entries


396 7 Compressed Sensing and Sparse Recovery

 −k
1 H
m x̂ M P̂Ω M x̂,  
k=
Zx (k, ) = .
0, k=

A short calculation verifies that this matrix can be expressed compactly as

1  H H   
Zx = F X̂ P̂Ω X̂F − diag FH X̂H P̂Ω X̂F , (7.47)
m

where X̂ = diag (x̂) is the diagonal matrix constructed from the vector x̂. The term
homogeneous second-order chaos is used to refer to a random process Gx of the
form (7.46) where each matrix Zx is conjugate symmetric and hollow, i.e., has zeros
on the diagonal. To bound the expected supremum of the random process Gx over
the set T , we apply a version of Dudleys inequality that is specialized to this setting.
Define two pseudo-metrics on the index set T :

d1 (x, y) = Zx − Zy and d2 (x, y) = Zx − Zy F.

Let N (T, di , r) denote the minimum number of balls of radius r in the metric d1 , d2
that we need to cover the set T .
Proposition 7.6.4 (Dudley’s inequality for chaos). Suppose that Gx is a homoge-
neous second-order chaos process indexed by a set T . Fix a point x0 ∈ T . There
exists a universal constant C such that
 ∞ ∞ 
E sup |Gx − Gx0 |  C max log N (T, d1 , r) dr, log N (T, d2 , r)dr .
x∈T 0 0

(7.48)

Our statement of the proposition follows [403] and looks different from the versions
presented in the literature [27, Theorem 11.22] and [81, Theorem 2.5.2]. For details
about how to bound these integrals, we see [403]. 
Immediately following Example 7.6.3, we can get the following theorem.
Theorem 7.6.5 (Theorem 1.1 of Rauhut, Romberg, and Tropp [403]). Let Ω be
an arbitrary subset of {0, 1, . . . , n − 1} with cardinality |Ω| = m. Let Ψ be the
corresponding partial random circulant matrix (7.39) generated by a Rademacher
sequence, and let δs denote the s-th restricted isometry constant. Then,
 " 
s3/2 3/2 s
E [δs ]  C1 max log n, log s log n (7.49)
m m

where C1 > 0 is a universal constant.


In particular, (7.49) implies that for a given δ ∈ (0, 1), we have E [δs ]  δ provided
7.6 Restricted Isometry Property for Partial Random Circulant Matrices 397

0 1
m  C2 max δ −1 s3/2 log3/2 n, δ −2 slog2 nlog2 s, , (7.50)

where C2 > 0 is another universal constant.


Theorem 7.6.5 also says that partial random circulant matrices Φ obey the RIP
(7.3) in expectation. The following theorem tell us that the random variable δs does
not deviate much from its expectation.
Theorem 7.6.6 (Theorem 1.2 of Rauhut, Romberg, and Tropp [403]). Let the
random variable δs defined as in Theorem 7.6.5. Then for 0 ≤ t ≤ 1
2
/σ 2 s
P (δs  E [δs ] + t)  e−t where σ 2 = C3 log2 slog2 n,
m
for a universal constant C3 > 0.
Proof. Since the techniques used here for the proof are interesting, we present the
detailed derivations, closely following [403]. We require Theorem 1.5.6, which is
Theorem 17 in [62].
Let F denote a collection of n×n symmetric matrices Z, and ε1 , . . . , εn are i.i.d.
Rademacher variables. Assume that Z has zero diagonal, that is, Z(i, i) = 0, i =
1, . . . , n. for each Z ∈ F. We are interested in the concentration variable
n n
Y = sup εk ε Z (k, ) .
Z∈F
k=1 =1

Define two variance parameters


 2
n  n 
 
U = sup Z and V = E sup
2
 ε Z (k, ) .
Z∈F Z∈F  
k=1 =1

Proposition 7.6.7 (Tail Bound for Chaos). under the preceding assumptions, for
all t ≥ 0,
 
t2
P (Y  E [Y ] + t)  exp − 2
(7.51)
32V + 65U t/3

Putting together (7.44), (7.46), and (7.47), we have


n n
δs = sup |Gx | = sup εk ε Zx (k, )
x∈T x∈T
k=1 =1

where the matrix Zx has the expression

1 H H
Zx = Ax − diag (Ax ) for Ax = F X̂ P̂Ω X̂F.
m
398 7 Compressed Sensing and Sparse Recovery

As a result, (7.51) applies to the random variable δs .


To bound the other parameter V 2 , we use the following “vector version” of the
Dudley inequality.
Proposition 7.6.8 ([403]). Consider the vector-valued random process

hx = Zx ε for x ∈ T.

The pseudo-metric is defined as

d2 (x, y) = Zx − Zy F.

For a point x0 ∈ T. There exists a universal constant C > 0 such that


 1/2
2
∞ 
E sup hx − hx0 2 C N (T, d2 , r)dr. (7.52)
x∈T 0

With x0 = 0, the left-hand side of (7.52) is exactly V . The rest is straightforward.


See [403] for details. 
Theorem 7.6.9 (Theorem 1.1 of Krahmer, Mendelson, and Rauhut [31]). Let
Φ ∈ Rm×n be a draw of partial random circulant matrix generated by a
Rademacher vector ε. If
 
m  cδ −2 s log2 s log2 n , (7.53)

2
then with probability at least 1 − n−(log n)(log s) , the restricted isometry constant
of Φ satisfies δs ≤ δ, in other words,
2
P (δs  δ)  n−(log n)(log s) .

The constant c > 0 is universal.


Combining Theorem 7.6.9 with the work [412] on the relation between the restricted
isometry property and the Johnson–Lindenstrauss lemma, we obtain the following.
See [428, 429] for previous work.
Theorem 7.6.10 (Theorem 1.2 of Krahmer, Mendelson, and Rauhut [31]). Fix
η, δ ∈ (0, 1), and consider a finite set E ∈ Rn of cardinality |E| = p. Choose

m  C1 δ −2 log (C2 p) (log log (C2 p)) (log n) ,


2 2

where the constants C1 , C2 depend only on η. Let Φ ∈ Rm×n be a partial circulant


matrix generated by a Rademacher vector . Furthermore, let  ∈ Rn be a
Rademacher vector independent of  and set D to be the diagonal matrix with
diagonal  . Then with probability exceeding 1 − η, for every x ∈ E,
7.6 Restricted Isometry Property for Partial Random Circulant Matrices 399

2 2 2
(1 − δ) x 2  ΦD x 2  (1 + δ) x 2 .

Now we present a result that is more general than Theorem 7.6.9. The
L-sub-Gaussian random variables are defined in Sect. 1.8.
Theorem 7.6.11 (Theorem 4.1 of Krahmer, Mendelson, and Rauhut [31]). Let
n
ξ = {ξi }i=1 be a random vector with independent mean-zero, variance one, L-sub-
Gaussian entries. If, for s ≤ n and η, δ ∈ (0, 1),
0 1
m  cδ −2 s max (log s) (log n) , log (1/η)
2 2
(7.54)

then with probability at least 1 − η, the restricted isometry constant of the partial
random circulant matrix Φ ∈ Rm×n generated by ξ satisfies δs  δ. The constant
c > 0 depends only on L.
Here, we only introduce the proof ingredient. Let Vx z = √1m PΩ (x ⊗ z) , where
the projection operator PΩ : Cn → Cm is given by the positive-definite (sample
covariance) matrix

PΩ = RH
Ω RΩ ,

that is,

(PΩ x) = x for  ∈ Ω and (PΩ x) = 0 for  ∈


/ Ω.

Define the unit ball with sparsity constrain


0 1
2
Ts = x ∈ Cn : x 2  1, x 0 s .

Let the operator norm be defined as A = sup Ax 2 . The restricted isometry


x2 =1
constant of Φ is expressed as
     
δs = sup RΩ (ξ ⊗ x)2 − x22  = sup PΩ (x ⊗ ξ)2 − x22  = sup Vx ξ2 − x22  .
x∈Ts x∈Ts x∈Ts

The δs is indexed by the set Ts of vectors. Since |Ω| = m, it follows that


n n
2 1 1 2 2
E Vx ξ = E ξj ξ¯k xj x̄j = |xj | = x 2
m m
∈Ω k,l=1 ∈Ω k,l=1

and hence
 
 2 2
δs =sup  Vx ξ 2 −E Vx ξ 2  ,
x∈Ts
400 7 Compressed Sensing and Sparse Recovery

which is the process studied in Theorem 7.8.3.


The proof of Theorem 7.6.11 requires a Fourier domain description of Φ. Let the
matrix F by the unnormalized Fourier transform with elements Fjk = ei2πjk/n . By
convolution theorem, for every 1 ≤ j ≤ n,

F(x ⊗ y)j = F(x)j · F(y)j .

So, we have

1
Vx ξ = √ PΩ F−1 X̂Fξ,
m

where X̂ is the diagonal matrix, whose diagonal is the Fourier transform Fx. In
short,
1
Vx = √ P̂Ω X̂F,
m

where P̂Ω = PΩ F−1 . For details of the proof, we refer to [31]. The proof ingredient
is novel. The approach of suprema of chaos processes is indexed by a set of matrices,
which is based on a chaining method due to Talagrand. See Sect. 7.8.

7.7 Restricted Isometry Property for Time-Frequency


Structured Random Matrices

Applications of random Gabor synthesis matrices include operator identification


(channel estimation in wireless communications), radar and sonar [405, 431, 431].
The restricted isometry property for time-frequency structured random matrices is
treated in [31, 391, 404, 432, 433].
Here we follow [31] to highlight the novel approach. The translation and
modulation operators on Cm are defined by (Ty)j = ei2πj/m yj = ω j yj , where
ω = ei2π/m and $ again denotes cyclic subtraction, this time modulo m. Observe
that
  
Tk y j
= yjk and My j
= ei2πj/m yj = ω j yj . (7.55)

The time-frequency shifts are given by

Π (k, ) = M Tk ,

where (k, ) ∈ Z2m =Zm × Zm = {{0, . . . , m  − 1} {0, . . . , m − 1}}. For y ∈


Cm \ {0} , the system Π (k, ) y : (k, ) ∈ Z2m , is called a Gabor system [434,
7.7 Restricted Isometry Property for Time-Frequency Structured Random Matrices 401

435]. The m × m2 matrix Ψy whose columns are vectors Π (k, ) y, (k, ) ∈ Z2m is
called a Gabor synthesis matrix,
2
Ψy = M Tk y ∈ Cm×m , (k, ) ∈ Z2m . (7.56)

The operators Π (k, ) = M Tk are called time-frequency shifts and the system
Π (k, ) of all time-frequency shifts forms a basis of the matrix space Cm×m [436,
437]. Ψy allows for fast matrix vector multiplication algorithms based on the FFT.
Theorem 7.7.1 (Theorem 1.3 of Krahmer, Mendelson, and Rauhut [31]). Let ε
2
be a Rademacher vector and consider the Gabor synthesis matrix Ψy ∈ Cm×m
defined in (7.56) generated by y = √1m ε. If

m  cδ −2 s(log s) (log m) ,
2 2

2
then with probability at least 1 − m−(log m)·(log s) , the restricted isometry constant
of Ψy satisfies δs ≤ δ.
Now we consider  ∈ Cn to be a Rademacher or Steinhaus sequence, that is,
a vector of independent random variables taking values +1 and −1 with equal
probability, respectively, taking values uniformly distributed on the complex torus
S 1 = {z ∈ C : |z| = 1} . The normalized window is
1
g = √ .
n

2
Theorem 7.7.2 (Pfander and Rauhut [404]). Let Ψg ∈ Cn×n be a draw of
the random Gabor synthesis with normalized Steinhaus or Rademacher generating
vector.
1. The expectation of the restricted isometry constant δs of Ψg , s ≤ n, satisfies
+ " ,
s3/2  s3/2 log3/2 n
Eδs  max C1 log n log s, C2 , (7.57)
n n

where C1 , C2 are universal constants.


2. For 0 ≤ t ≤ 1, we have

2
/σ 2 C3 s3/2 (log n)log2 s
P (δs  Eδ + t)  e−t , where σ 2 = , (7.58)
n
where C3 > 0 is a universal constant.
With slight variations of the proof one can show similar statements for normalized
Gaussian or subGaussian random windows g.
402 7 Compressed Sensing and Sparse Recovery

Example 7.7.3 (Wireless communications and radar [404]). A common finite-


dimensional model for the channel operator, which combines digital (discrete) to
analog conversion, the analog channel, and the analog to digital conversion. It is
given by

H= x(k,) Π (k, ) .
(k,)∈Zn ×Zn

Time-shifts delay is due to the multipath propagation, and the frequency-shifts are
due to the Doppler effects caused by moving transmitter, receiver and scatterers.
Physical considerations often suggest that x be rather sparse as, indeed, the number
of present scatterers can be assumed to be small in most cases. The same model is
used in sonar and radar.
Given a single input-output pair (g, Hg) , our task is to find the sparse coefficient
vector x. In other words, we need to find H ∈ Cn×n , or equivalently x, from its
action y = Hz on a single vector z. Writing

y = Hg = x(k,) Π (k, ) g = Ψg x, (7.59)


(k,)∈Zn ×Zn

with unknown but sparse x, we arrive at a compressed sensing problem. In this


setup, we clearly have the freedom to choose the vector g, and we may choose it as
a random Rademacher or Steinhaus sequence. Then, the restricted isometry property
of Ψg , as shown in Theorem 7.7.2, ensures recovery of sufficiently sparse x, and
thus of the associated operator H.
Recovery of the sparse vector x in (7.59) can be also interpreted as finding a
sparse time-frequency representation of a given y with respect to the window g. 
Let us use one example to highlight the approach that is used to prove
Theorem 7.7.2.
Example 7.7.4 (Expectation of the restricted isometry constant [404]). We first
rewrite the restricted isometry constants δs . Let the set of s-sparse vectors with
unit 2 -norm be defined as
0 2
1
T = Ts = x ∈ Cn : x 2 = 1, x 0  s .

We express δs as the following semi-norm on Hermitian matrices


   
 
δs = sup xH ΨH Ψ − I x (7.60)
x∈Ts

where I is the identity matrix and Ψ = Ψg . The Gabor synthesis matrix Ψg has the
form
7.7 Restricted Isometry Property for Time-Frequency Structured Random Matrices 403

n−1
Ψg = g k Ak
k=0

with
 
A0 = I|M|M2 | · · · |Mn−1 , A1 = I|MT|M2 T| · · · |Mn−1 Tk ,

and so on. In short, for k ∈ Zn ,



Ak = Tk |MTk |M2 Tk | · · · |Mn−1 Tk .

With
n−1
AH
k Ak = nI,
k=0

it follows that
n−1 n−1 n−1 n−1
1 1
ΨH Ψ − I = −I +  k   k AH
k  Ak = k k Wk ,k ,
n n
k=0 k =0 k=0 k =0

where
 
Wk ,k = k Ak , k = k,
AH
0, k  = k.

We use the matrix B (x) ∈ Cn×n , x ∈ Ts , given by matrix entries

B(x)k,k = xH AH
k Ak x.

Then we have

nEδs = E sup |Zx | = E sup |Zx − Z0 | , (7.61)


x∈Ts x∈Ts

where

k Ak x =  B (x) ,
ε̄k εk xH AH H
Zx = (7.62)
k =k

with x ∈ Ts = {x ∈ Cn×n : x 2 = 1, x 0  s} . 
A process of the type (7.62) is called Rademacher or Steinhaus chaos process of
order 2. In order to bound such a process, [404] uses the following Theorem, see for
404 7 Compressed Sensing and Sparse Recovery

example, [27, Theorem 11.22] or [81, Theorem 2.5.2], where it is stated for Gaus-
sian processes and in terms of majorizing measure (generic chaining) conditions.
The formulation below requires the operator norm A 2→2 = max Ax 2 and
x2 =1
1/2
 1/2  2
the Frobenius norm A F = Tr AH A = |Ai,j | , where Tr (A)
i,j
denotes the trace of a matrix A. z 2 denotes the 2 -norm of the vector z.
T
Theorem 7.7.5 (Pfander and Rauhut [404]). Let  = (1 , . . . , n ) be a
Rademacher or Steinhaus sequence, and let

k Ak x =  B (x) 
ε̄k εk xH AH H
Zx =
k =k

be an associated chaos process of order 2, indexed by x ∈ T, where assume that


B(x) is Hermitian with zero diagonal, that is, B(x)k,k = 0 and B(x)k ,k =
B(x)k,k . We define two (pseudo-)metrics on the set T,

d1 (x, y) = B(x) − B(y) 2→2 ,


d2 (x, y) = B(x) − B(y) F.

Let N (T, di , r) be the minimum number of balls of radius r in the metric di needed
to cover the set T. Then, these exists a universal constant C > 0 such that, for an
arbitrary x0 ∈ T,
 ∞ ∞ 
E sup |Zx − Zx0 |  C max log N (T, d1 , r) dr, log N (T, d2 , r) dr .
x∈T 0 0

(7.63)

The proof ingredients include: (1) decoupling [56, Theorem 3.1.1]; (2) the contrac-
tion principle [27, Theorem 4.4]. For a Rademacher sequence, the result is stated
in [403, Proposition 2.2].
The following result is a slight variant of Theorem 17 in [62], which in turn is an
improved version of a striking result due to Talagrand [231].
Theorem 7.7.6 (Pfander and Rauhut [404]). Let the set of matrices B =
T
{B (x)} ∈ Cn×n , x ∈ T, where T is the set of vectors. Let ε = (ε1 , . . . , εn )
be a sequence of i.i.d. Rademacher or Steinhaus random variables. Assume that
the matrix B(x) has zero diagonal, i.e., Bi,i (x) = 0 for all x ∈ T. Let Y be the
random variable
 2
 H  n−1 n−1 

Y = sup ε B (x) ε =  εk εk B(x)k ,k  .
x∈T  

k=1 k =1
7.8 Suprema of Chaos Processes 405

Define U and V as

U = sup B (x) 2→2


x∈T

and
 2
n  n 
2  
V = E sup B (x) ε = E sup  εk B(x)k ,k  .
x∈T
2
x∈T  
k =1 k=1

Then, for t ≥ 0,
 
t2
P (Y  E [Y ] + t)  exp − .
32V + 65U t/3

7.8 Suprema of Chaos Processes

The approach of suprema of chaos processes is indexed by a set of matrices, which


is based on a chaining method.
Both for partial random matrices and for time-frequency structured random
matrices generated by Rademacher vectors, the restricted isometry constants δs can
be expressed as a (scalared) random variable X of the form
 
 2 2
X = sup  A 2 − E A 2  , (7.64)
A∈A

where A is a set of matrices and  is a Rademacher vector. By expanding the


2 -norms, we rewrite (7.64) as
 
 
  H 
X = sup   i j A A i,j  , (7.65)
A∈A  
i=j

which is a homogeneous chaos process of order 2 indexed by the positive semidef-


inite matrices AH A. Talagrand [81] considers general homogenous chaos process
of the form
 
 
 
X = sup   i j Bi,j  , (7.66)
B∈B  
i=j

where B ⊂ Cn×n is a set of (not necessarily positive semidefinite) matrices. He


derives the bound

EY  C1 γ2 (B, · F) + C2 γ2 (B, · 2→2 ) . (7.67)


406 7 Compressed Sensing and Sparse Recovery

where the Talagrand’s functional γα is defined below.


The core of the generic chaining methodology is based on the following
definition:
Definition 7.8.1 (Talagrand [81]). For a metric space (T, d), an admissible se-
quence of T is a collection of subsets of T, {Ts : s  0} , such that for every s ≥ 1,
s
|Ts |  22 and |T0 | = 1. For β ≥ 1, define the γβ functional by

γβ (T, d) = inf sup 2s/β d (t, Ts ) ,
t∈T s=0

where the infimum is taken with respect to all admissible sequences of T .


We need some new notation to proceed. For a set of matrices A, the radius of the
set A in the Frobenius norm
2
A F = Tr (AH A)

is denoted by dF (A). Similarly, the radius of the set A in the operator norm

A 2→2 = sup Ax 2
x2 1

is denoted by d2→2 (A). That is,

dF (A) = sup A F, d2→2 (A) = sup A 2→2 .


A∈A A∈A

A metric space is a set T where a notion of distance d (called a metric) between


elements of the set is defined. We denote the metric space by (T, d). For a metric
space (T, d) and r > 0, the covering number N (T, d, r) is the minimum number
of open balls of radius r in the metric space (T, d) to cover. Talagrand’s functionals
γα can be bounded in terms of such covering numbers by the well known Dudley
integral (see, e.g., [81]). A more specific formulation for the γ2 -functional of a set
of matrices A equipped with the operator norm is

d2→2 (A) 2
γ2 (A, · 2→2 )  c log N (A, · 2→2 , r)dr. (7.68)
0

This type of entropy integral was suggested by Dudley [83] to bound the supremum
of Gaussian processes.
Under mild measurability assumptions, if {Gt : t ∈ T } is a centered Gaussian
process by a set T, then

c1 γ2 (T, d)  E sup Gt  c2 γ2 (T, d) , (7.69)


t∈T
7.8 Suprema of Chaos Processes 407

where c1 and c2 are absolute constants, and for every s, t ∈ T,


2
d2 (s, t) = E|Gs − Gt | .

The upper bound is due to Fernique [84] while the lower bound is due to Talagrand’s
majorizing measures theorem [81, 85].
With the notions above, we are ready to stage the main results.
Theorem 7.8.2 (Theorem 1.4 of Krahmer, Mendelson, and Rauhut [31]). Let
A ∈ Rm×n be a symmetric set of matrices, A = −A. Let  be a Rademacher
vector of length n. Then
    2 
E sup A22 − E A22  C1 dF (A) γ2 A, ·2→2 + γ2 A, ·2→2 =: C2 E
A∈A

(7.70)

Furthermore, for t > 0,


      2 
 2 2 t t
P sup  A 2 − E A 2   C2 E + t  2 exp −C3 min , ,
A∈A V2 U
(7.71)

where

V = d2→2 (A) [γ2 (A, · 2→2 ) + dF (A)] and U = d22→2 (A) .

The constants C1 , C2 , C3 are universal.


The symmetry assumption A = −A was made for the sake of simplicity. The
following more general theorem does not use this assumption.
One proof ingredient is the well-known bound relating strong and weak moments
for L-sub-Gaussian random vectors, see Sect. 1.8.
Theorem 7.8.3 (Theorem 3.1 of Krahmer, Mendelson, and Rauhut [31]). Let A
be a set of matrices, and let ξ be a random vector whose entries ξi are independent,
mean-zero, variance one, and L-sub-Gaussian random variables. Set

E = γ2 (A, · 2→2 ) [γ2 (A, · 2→2 ) + dF (A)] + dF (A) d2→2 (A) ,


V = d2→2 (A) [γ2 (A, · 2→2 ) + dF (A)] , and U = d22→2 (A) .

Then, for t > 0,


      2 
 2 2 t t
P sup  Aξ 2 − E Aξ 2   c1 E + t  2 exp −c3 min , .
A∈A V2 U
(7.72)
408 7 Compressed Sensing and Sparse Recovery

The constants c1 , c2 depend only on L.


Some notation is needed. We write x  y if there is an absolute constant c for
which x  cy. x ∼ y means that c1 x ≤ y ≤ c2 y for absolute constants c1 , c2 . If
the constants depend on some parameter u, we write x u y. The Lp -norm of a
random variable X, or its p-th moment, is given by

p 1/p
X Lp = (E|X| ) .

Theorem 7.8.4 (Lemma 3.3 of Krahmer, Mendelson, and Rauhut [31]). Let A
be a set of matrices, let ξ = (ξ1 , . . . , ξn ) be an L-subGaussian random vector, and
let ξ  be an independent copy of ξ. Then, for every p ≥ 1,
/ / / /
/ = >/ / / /= >/
/ sup Aξ, Aξ  / L γ2 (A, · ) / sup Aξ / + sup / Aξ, Aξ  /L .
/A∈A / 2→2 /
A∈A
2/
A∈A p
Lp Lp

The proof in [31] follows a chaining argument. We also refer to Sect. 1.5 for
decoupling from dependance to independence.
Theorem 7.8.5 (Theorem 3.4 of Krahmer, Mendelson, and Rauhut [31]). Let
L ≥ 1 and ξ = (ξ1 , . . . , ξn ) , where ξi , i = 1, . . . , n are independent mean-zero,
variance one, L-subGaussian random variables, and let A be a set of matrices.
Then, for every p ≥ 1,
/ /
/ / √
/ sup Aξ / L γ2 (A, ·
/A∈A 2/ 2→2 ) + dF (A) + pd2→2 (A) ,
 Lp 
 2 2
sup  Aξ 2 − E Aξ 2  L γ2 (A, · 2→2 ) [γ2 (A, · 2→2 ) + dF (A)]
A∈A

+ pd2→2 (A) [γ2 (A, · 2→2 ) + dF (A)] + pd22→2 (A) .

7.9 Concentration for Random Toeplitz Matrix

Unstructured random matrices [415] are studied for concentration of measure.


In practical applications, measurement matrices possess a certain structure[438–
440]. Toeplitz matrices arise from the convolution process. For a linear time-
N
invariant (LTI) system with system impulse response h = {hk }k=1 . Let x =
N +M −1
{xk }k=1 be the applied input discrete-time waveform. Suppose the xk and hk
are zero-padded from both sides. The output waveform is

N
yk = aj xk−j . (7.73)
j=1
7.9 Concentration for Random Toeplitz Matrix 409

Keeping only M consecutive observations of the output waveform, y =


N +M
{yk }k=N +1 , we rewrite (7.73) as

y = Xh, (7.74)

where
⎡ ⎤
xN xN −1 · · · x1
⎢ xN +1 xN · · · x2 ⎥
⎢ ⎥
X=⎢ .. .. . . .. ⎥ (7.75)
⎣ . . . . ⎦
xN +M −1 xN +M −2 · · · XM

N +M −1
is an M × N Toeplitz matrix. Here we consider the entries x = {xi }i=1
drawn from an i.i.d. Gaussian random sequences. To state the result, we need the
eigenvalues of the covariance matrix of the vector h defined as
⎡ ⎤
A (0) · · · A (M − 1)
A (1)
⎢ A (1) · · · A (M − 2) ⎥
A (0)
⎢ ⎥
R=⎢ .. .. .... ⎥ (7.76)
⎣ . . . . ⎦
A (M − 1) A (M − 2) · · · A (0)

where
N −τ
A (τ ) = hi hi+τ , τ = 0, 1, . . . , M − 1
i=1

is the un-normalized sample autocorrelation function of h ∈ RN . Let a 2 be the


Euclidean norm of the vector a.
Theorem 7.9.1 (Sanandaji, Vincent, and Wakin [439]). Let h ∈ RN be fixed.
Define two quantities

M
λ2i (R)
maxi λi (R) i=1
ρ (h) = 2 and μ (h) = 4 ,
h 2 M h 2

where λi (R) is the i-th eigenvalue of R. Let y = Xh, where X is a random


Toeplitz matrix (defined in (7.75))
 with i.i.d. Gaussian entries having zero-mean
2 2
and unit variance. Noting that E y 2 = M h 2 , then for any t ∈ (0, 1), the
upper tail probability bound is
0 1  
2 2 2 M 2
P y 2 −M h 2  tM h 2  exp − t (7.77)
8ρ (h)
410 7 Compressed Sensing and Sparse Recovery

and the lower tail probability bound is


0 1  
2 2 2 M 2
P y 2 −M h 2  −tM h 2  exp − t . (7.78)
8μ (h)

Allowing X to have M × N i.i.d. Gaussian entries with zero mean and unit variance
(thus no Toeplitz structure) will give the concentration bound [415]
0 1  
2 2 2 M 2
P y 2 −M h 2  tM h 2  2 exp − t .
4

Thus, to achieve the same probability bound for Toeplitz matrices requires choosing
M larger by a factor of 2ρ (h) or 2μ (h).
See [441], for the spectral norm of a random Toeplitz matrix. See also [442].

7.10 Deterministic Sensing Matrices

The central goal of compressed sensing is to capture attributes of a signal using very
few measurements. In most work to date, this broader objective is exemplified by
the important special case in which a k-sparse vector x ∈ Rn (with n large) is to
be reconstructed from a small number N of linear measurements with k < N < n.
In this problem, measurement data constitute a vector √1N Φx, where Φ is an N × n
matrix called the sensing matrix.
The two fundamental questions in compressed sensing are: how to construct
suitable sensing matrices Φ, and how to recover x from √1N Φx efficiently. In [443]
the authors constructed a large class of deterministic sensing matrices that satisfy a
statistical restricted isometry property. Because we will be interested in expected-
case performance only, we need not impose RIP; we shall instead work with the
weaker Statistical Restricted Isometry Property.
Definition 7.10.1 (Statistical restricted isometry property). An N × n (sensing)
matrix Φ is said to be a (k, δ, ε)-statistical restricted isometry property matrix if, for
k-sparse vectors x ∈ Rn , the inequalities
/ /2
/ 1 /
/ /  (1 + δ) x
2 2
(1 − δ) x 2 √
/ N Φx / 2 ,

hold with probability exceeding 1 − ε (with respect to a uniform distribution of the


vectors x among all k-sparse vectors in Rn of the same norm).
Norms without subscript denote 2 -norms. Discrete chirp sensing matrices are
studied in [443]. The proof of unique reconstruction [443] uses a version of the
classical McDiarmid concentration inequality.
Chapter 8
Matrix Completion and Low-Rank Matrix
Recovery

This chapter is a natural development following Chap. 7. In other words, Chaps. 7


and 8 may be viewed as two parallel developments. In Chap. 7, compressed
sensing exploits the sparsity structure in a vector, while low-rank matrix recovery—
Chap. 8—exploits the low-rank structure of a matrix: sparse in the vector composed
of singular values. The theory ultimately traces back to concentration of measure
due to high dimensions.

8.1 Low Rank Matrix Recovery

Sparsity recovery and compressed sensing are interchangeable terms. This sparsity
concept can be extended to the matrix case: sparsity recovery of the vector of
singular values. We follow [444] for this exposition.
The observed data y is modeled as

y = A (M) + z, (8.1)

where M is an unknown n1 × n2 matrix, A : Rn1 ×n2 → Rm is a linear mapping,


and z isan m-dimensional noise term. For  example, z is a Gaussian vector with
i.i.d. N 0, σ 2 entries, written as z ∼ N 0, σ 2 I where the covariance matrix is
essentially the identify matrix. The goal is to recover a good approximation of M
while requiring as few measurements as possible.
For some sequences of matrices Ai and with the standard inner product
A, X = Tr (A∗ X) where A∗ is the adjoint of A. Each Ai is similar to a
compressed sensing matrix. We has the intuition of forming a large matrix

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 411
DOI 10.1007/978-1-4614-4544-9 8,
© Springer Science+Business Media New York 2014
412 8 Matrix Completion and Low-Rank Matrix Recovery

⎡ ⎤
vec (A1 )
⎢ vec (A2 ) ⎥
⎢ ⎥
A (X) = ⎢ .. ⎥ vec (X) (8.2)
⎣ . ⎦
vec (Am )

where vec (X) is a long vector obtained by stacking the columns of matrix X.
The matrix version of the restricted isometry property (RIP) is an integral tool in
proving theoretical results. For each integer r = 1, 2, . . . , n, the isometry constant
δr of A is the smallest value such that
2 2 2
(1 − δr ) X F  A (X) 2  (1 + δr ) X F (8.3)

holds for all matrices X of rank at most r.


We present only two algorithms. First, it is direct to have the optimization
problem

minimize X ∗

subject to A (v)  γ (8.4)
v = y − A (X)

where · is the operator norm and · ∗ is its dual, i.e., the nuclear norm. The
nuclear norm of a matrix X is the sum of the singular values of X and the
operator norm is its largest singular value. A∗ is the adjoint of A. X F is
the Frobenius norm (the 2 -norm of the vector of singular
 values).
Suppose z is a Gaussian vector with i.i.d. N 0, σ 2 entries, and let n =

max {n1 , n2 }. Then if C0 > 4 (1 + δ1 ) log 12

A∗ (z)  C0 nσ, s (8.5)

with probability at least 1 − 2e−cn for a fixed numerical constant c > 0. The scalar
δ1 is the restricted isometry constant at rank r = 1.
We can reformulate (8.4) as a semi-definite program (SDP)

minimize Tr (W1 ) /2 + Tr (W2 ) /2


 (8.6)
W1 X
subject to 0
X∗ W 2

with optimization variables X, W1 , W2 ∈ Rn×n . We say a matrix Q ≥ 0 if Q is


positive semi-definite (all its eigenvalues are nonnegative).
Second, the constraint A∗ (v)  γ is an SDP constraint since it can be
expressed as the linear matrix inequality (LMI)
8.2 Matrix Restricted Isometry Property 413


γIn A∗ (v)
∗ ∗  0.
[A (v)] γIn

As a result, (8.4) can be reformulated as the SDP

minimize Tr (W1 ) /2 + Tr (W2 ) /2


⎡ ⎤
W1 X 0 0
⎢ X∗ W 2 0 0 ⎥
subject to ⎢
⎣ 0 0
⎥0 (8.7)
γIn A (v) ⎦


0 0 [A∗ (v)] γIn
v = y − A (X) ,

with optimization variables X, W1 , W2 ∈ Rn×n .

8.2 Matrix Restricted Isometry Property

Non-Asymptotic Theory of Random Matrices Lecture 6: Norm of a Random Matrix


[445]
The matrix X∗ is the adjoint of X, and for the linear operator A : Rn1 ×n2 →
R , A∗ : Rm → Rn1 ×n2 is the adjoint operator. Specifically, if [A (X)]i =
m

Ai , X for all matrices X ∈ Rn1 ×n2 , then


m
A∗ (X) = v i Ai
i=1

T
for all vectors v = (v1 , . . . , vm ) ∈ Rm .
The matrix version of the restricted isometry property (RIP) is an integral tool in
proving theoretical results. For each integer r = 1, 2, . . . , n, the isometry constant
δr of A is the smallest value such that
2 2 2
(1 − δr ) X F  A (X) 2  (1 + δr ) X F (8.8)

holds for all matrices X of rank at most r. We say that A satisfies the RIP at rank r
if δr (or δ4r ) is bounded by a sufficiently small constant between 0 and 1.
Which linear maps A satisfy the RIP? One example is the Gaussian measurement
ensemble. A is a Gaussian measurement ensemble if each ‘row’ ai , 1  i  m,
contains i.i.d. N (0, 1/m) entries (and the ai ’s are independent from each other).
We have selected the variance of the entries to be 1/m so that for a fixed matrix X,
2 2
E A (X) 2 = X F .
414 8 Matrix Completion and Low-Rank Matrix Recovery

Theorem 8.2.1 (Recht et al. [446]). Fix 0 ≤ δ < 1 and let A is a random
measurement ensemble obeying the following conditions: for any given X ∈
Rn1 ×n2 and fixed 0 < t < 1,
  
 2 2 2
P  A (X) 2 − X F >t X F  C exp (−cm) (8.9)

for fixed constants C, t > 0 (which may depend on t). Then, if m ≥ Dnr, A satisfies
the RIP with isometry constant δr ≤ δ with probability exceeding 1 − Ce−dm for
fixed constants D, d > 0.
2
If A is a Gaussian random measurement ensemble, A (X) 2 is distributed as
m−1 X F times a chi-squared random variable with m degrees of freedom, and
2

we have
   m  
 2 2 2
P  A (X) 2 − X F  > t X F  2 exp t2 /2 − t3 /3 . (8.10)
2
Similarly, A satisfies (8.10) in the case when each entry
√ of each√‘row’ ai has i.i.d.
entries that are equally likely to take the values +1/ m or −1/ m [446], or if A
is a random projection [416, 446]. Finally, A satisfies (8.9) if the “rows” ai contain
sub-Gaussian entries [447].
In Theorem 8.2.1, the degrees of freedom of n1 × n2 matrix of rank r is r(n1 +
n2 − r).

8.3 Recovery Error Bounds

Given the observed vector y, the optimal solution to (8.4) is our estimator M̂ (y).
For the data vector and the linear model

y = Ax + z (8.11)
 
where A ∈ Rm×n and the zi ’s are i.i.d. N 0, σ 2 . Let λi AT A be the
eigenvalues of the matrix AT A. Then [448, p. 403]

2  −1
n
σ2
inf sup E x̂ − x 2 = σ 2 Tr AT A = (8.12)
x̂ x∈Rn
i=1
λi (AT A)

where x̂ is estimate of x.
Suppose that the measurement operator is  fixed and satisfies the RIP, and that
the noise vector z = (z1 , . . . , zn )T ∼ N 0, σ 2 In . Then any estimator M̂ (y)
obeys [444, 4.2.8]
/ /
/ / 1
sup E/M̂ (y) − M/  nrσ 2 . (8.13)
M:rank(M)r F 1 + δr
8.4 Low Rank Matrix Recovery for Hypothesis Detection 415

Further, we have [444, 4.2.9]


/ /2 
/ / 1
sup P /M̂ (y) − M/  nrσ 2  1 − e−nr/16 . (8.14)
M:rank(M)r F 2 (1 + δr )

8.4 Low Rank Matrix Recovery for Hypothesis Detection

Example 8.4.1 (Different convergence rates of sums of random matrices). Consider


n-dimensional random vectors y, x, n ∈ Rn

y =x+n

where vector x is independent of n. The components x1 , . . . , xn of the random


vector x are scalar valued random variables, and, in general, may be dependent
random variables. For the random vector n, this is similar. The true covariance
matrix has the relation

Ry = Rx + Rn ,

due to the independence between x and n.


Assume now there are N copies of random vector y:

y i = x i + ni , i = 1, 2, . . . , N.

Assume that xi are dependent random vectors, while ni are independent random
vectors.
Let us consider the sample covariance matrix

N N N
1 1 1
yi ⊗ yi∗ = xi ⊗ x∗i + ni ⊗ n∗i + junk
N i=1
N i=1
N i=1

where “junk” denotes another two terms. When the sample size N increases,
concentration of measure phenomenon occurs. It is remarkable that, on the right

N
hand side, the convergence rate of the first sample covariance matrix N1 xi ⊗ x∗i
i=1

N
and that of the second sample covariance matrix 1
N ni ⊗ n∗i are different! If we
i=1

N
further assume Rx is of low rank, R̂x = 1
N xi ⊗ x∗i converges to its true value
i=1

N
Rx ; R̂n = 1
N ni ⊗ n∗i converges to its true value Rn . Their convergence rates,
i=1
however, are different. &
%
416 8 Matrix Completion and Low-Rank Matrix Recovery

Example 8.4.1 illustrates the fundamental role of low rank matrix recovery in the
framework of signal plus noise. We can take advantage of the faster convergence of
the low rank structure of signal, since, for a given recovery accuracy, the required
samples N for low rank matrix recovery is O(r log(n)), where r is the rank of the
signal covariance matrix. Results such as Rudelson’s theorem in Sect. 5.4 are the
basis for understanding such a convergence rate.

8.5 High-Dimensional Statistics

High-dimensional statistics is concerned with models in which the ambient


dimension d of the problem may be of the same order as—or substantially larger
than—the sample size n. It is so called in the “large d, small n” regime.
The rapid development of data collection technology is a major driving force: it
allows for more observations (larger n) and also for more variables to be measured
(larger d). Examples are ubiquitous throughout science and engineering, including
gene array data, medical imaging, remote sensing, and astronomical data. Terabytes
of data are produced.
In the absence of additional structure, it is often impossible to obtain consistent
estimators unless the ratio d/n converges to zero. On the other hand, the advent of
the big data age requires solving inference problems with

d ' n,

so that consistency is not possible without imposing additional structure. Typical


values of n and d include: n = 100–2,500 and d = 100–20,000.
There are several lines of work within high-dimensional statistics, all of which
are based on low-dimensional constraint on the model space, and then studying the
behavior of different estimators. Examples [155, 449, 450] include
• Linear regression with sparse constraints
• Multivariate or multi-task forms of regression
• System identification for autoregressive processes
• Estimation of structured covariance or inverse covariance matrices
• Graphic model selection
• Sparse principal component analysis
• Low rank matrix recovery from random projections
• Matrix decomposition problems
• Estimation of sparse additive non-parametric models
• Collaborative filtering
On the computation side, many well-known estimators are based on a convex
optimization problem formed by the sum of a loss function with a weighted
regularizer. Examples of convex programs include
8.6 Matrix Compressed Sensing 417

• 1 -regularized quadratic programs (also known as the Lasso) for sparse linear
regression
• Second-order cone program (SOCP) for the group Lasso
• Semidefinite programming relaxation (SDP) for various problems include sparse
PCA and low-rank matrix estimation.

8.6 Matrix Compressed Sensing

Section 3.8 is the foundation for this section.


For a vector a, the p -norm is denoted by ||a||p . ||a||2 represents the Euclidean
norm. For pairs of matrices A, B ∈ Rm1 ×m2 , we define the trace inner product of
two matrices A, B as

A, B = Tr AT B .
?
m1 m2 2
The Frobenius or trace norm is defined as A F = |aij | . The element-
i=1 j=1
1 m
m 2
wise 1 -norm A 1 is defined as A 1 = |aij |.
i=1 j=1

8.6.1 Observation Model

A linear observation model [155] is defined as

Yi = Xi , A + Zi , i = 1, 2, . . . , N, (8.15)

which is specified by a sequence of random matrices Xi , and observation noise Zi .


Of course, Yi , Zi and the matrix inner products ϕi = Xi , A are scalar-valued
random variables, but not necessarily Gaussian. After defining the vectors
T
y = [Y1 , . . . , YN ]T , z = [Z1 , . . . , ZN ]T , ϕ = [ϕ1 , . . . , ϕN ] ,

we rewrite (8.15) as

y = ϕ + z. (8.16)

In order to highlight the ϕ as a linear functional of the matrix A, we use

y = ϕ(A) + z. (8.17)

The vector ϕ(A) is viewed as a (high but finite-dimensional) random Gaussian


operator mapping Rm1 ×m2 to RN . In a typical matrix compressed sensing [451],
418 8 Matrix Completion and Low-Rank Matrix Recovery

the observation matrix Xi ∈ Rm1 ×m2 has i.i.d. zero-mean, unit-variance Gaussian
N (0, 1) entries. In a more general observation model [155], the entries of Xi are
allowed to have general Gaussian dependencies.

8.6.2 Nuclear Norm Regularization

For a rectangular matrix B ∈ Rm1 ×m2 , the nuclear or trace norm is defined as

min{m1 ,m2 }
B ∗ = σi (B),
i=1

which is the sum of its singular values that are sorted in non-increasing order. The
maximum singular value is σmax = σ1 and the minimum singular value σmin =
σmin{m1 ,m2 } . The operator norm is defined as A op = σ1 (A). Given a collection
of observation (Yi , Xi ) ∈ R × Rm1 ×m2 , the problem at hand is to estimate the
unknown matrix A ∈ S, where S is a general convex subset of Rm1 ×m2 . The
problem may be formulated as an optimization problem
 
1 2
 ∈ arg min y − ϕ(A) 2 + γN A ∗ . (8.18)
A∈S 2N

Equation (8.18) is a semidefinite program (SDP) convex optimization problem [48],


which can be solved efficiently using standard software packages.
A natural question arises: How accurate will the solution  in (8.18) be when
compared with the true unknown A ?

8.6.3 Restricted Strong Convexity

The key condition for us to control the matrix error A − Â between Â, the
SDP solution (8.18), and the unknown matrix A is the so-called restricted
strong convexity. This condition guarantees that the quadratic loss function in
the SDP (8.18) is strictly convex over a restricted set of directions. Let the set
C ⊆ R × Rm1 ×m2 denote the restricted directions. We say the random operator
ϕ satisfies restricted strong convexity over the set C if there exists some κ (ϕ) > 0
such that
1 2 2
ϕ(Δ) 2  κ (ϕ) Δ F for all Δ ∈ C. (8.19)
2N
Recall that N is the number of observations defined in (8.15).
8.6 Matrix Compressed Sensing 419

Let r an integer r  m = min {m1 , m2 } and δ ≥ 0 be a tolerance parameter.


The set C(r; δ) defines a set whose conditions are too technical to state here.
Another ingredient is the choice of the regularization parameter γN used in
solving the SDP (8.18).

8.6.4 Error Bounds for Low-Rank Matrix Recovery

Now we are ready to state our main results.


Theorem 8.6.1 (Exact low-rank matrix recovery [155]). Suppose A ∈ S has
rank r, and the random operator ϕ satisfies the /restricted/strong convexity with
/N /
respect to the set C(r; 0). Then, as long as γN  2/ / N
/ εi Xi / /N , where {εi }i=1
i=1 op
are random variables, any optimal solution  to the SDP (8.18) satisfies the bound
/ / √
/ / 32γN r
/Â − A /  . (8.20)
F κ (ϕ)

Theorem 8.6.1 is a deterministic statement on the SDP error.


Sometimes, the unknown matrix A is nearly low rank: Its singular value
N
sequence {σi (A )}i=1 decays quickly enough. For a parameter q ∈ (0, 1) and a
positive radius Rq , we define the ball set
⎧ ⎫
⎨ min{m1 ,m2 } ⎬
A ∈ Rm1 ×m2 :
q
B (Rq ) = |σi (A)|  Rq . (8.21)
⎩ ⎭
i=1

When q = 0, the set B (R0 ) corresponds to the set of matrices with rank at most R0 .
Theorem 8.6.2 (Nearly low-rank matrix recovery [155]). Suppose / that A/ ∈ B
/N /
(Rq )∩S, the regularization parameter is lower bounded as γN  2/ /
/ εi Xi / /N ,
i=1 op
N
where {εi }i=1 are random variables, and the random operator ϕ satisfies
the restricted strong convexity with parameter κ (ϕ) ∈ (0, 1) over the set
C(Rq /γN q ; δ). Then, any solution  to the SDP (8.18) satisfies the bound
+  1−q/2 ,
/ / 
/ / γN
/Â − A /  max δ, 32 Rq . (8.22)
F κ (ϕ)

The error (8.22) reduces to the exact rank case (8.20) when q = 0 and δ = 0.
Example 8.6.3 (Matrix compressed sensing with dependent sampling [155]).
A standard matrix compressed sensing has the form
420 8 Matrix Completion and Low-Rank Matrix Recovery

Yi = Xi , A + Zi , i = 1, 2, . . . , N, (8.23)

where the observation matrix Xi ∈ Rm1 ×m2 has i.i.d. standard Gaussian N (0, 1)
entries. Equation (8.23) is an instance of (8.15). Here, we study a more general
observation model, in which the entries of Xi are allowed to have general Gaussian
dependence.
Equation (8.17) involves a random Gaussian operator mapping Rm1 ×m2 to RN .
We repeat some definitions in Sect. 3.8 for convenience. For a matrix A ∈
Rm1 ×m2 , we use vector vec(A) ∈ RM , M = m1 m2 . Given a symmetric positive
definite matrix Σ ∈ RM ×M , we say that the random matrix Xi is sampled from the
Σ-ensemble if

vec(Xi ) ∼ N (0, Σ) .

We define the quantity



ρ2 (Σ) = sup var uT Xv ,
u1 =1,v1 =1

where the random matrix X ∈ Rm1 ×m2 is sampled from the Σ-ensemble. For the
2
special case (white Gaussian random vector) Σ = I, we have
√ ρ (Σ) = 1.
The noise vector  ∈ R satisfies the bound  2  2ν N for some constant ν.
N

This assumption holds for any bounded noise, and also holds with high probability
for any random noise vector with sub-Gaussian entries with parameter ν. The
simplest case is that of Gaussian noise N (0, ν 2 ).
N
Suppose that the matrices {Xi }i=1 are drawn i.i.d. from the Σ-ensemble, and
that the unknown matrix A ∈ B (Rq ) ∩ S for some q ∈ (0, 1]. Then there are
universal constant c0 , c1 , c2 such that a sample size

N > c1 ρ2 (Σ) Rq1−q/2 (m1 + m2 ) ,

any solution  to the SDP (8.18) with regularization parameter


"
2 m1 + m 2
γN = c0 ρ (Σ) ν
N
satisfies the bound
/ /2  1−q/2 
/ / m1 + m 2  2  2
P /Â − A /  c2 Rq ν ∨ 1 ρ (Σ) /σmin (Σ)
2
F N

 c3 exp (−c4 (m1 + m2 )) . (8.24)

For the special case of q = 0 and  of rank r, we have


/ /2 
/ / ρ2 (Σ) ν 2 r (m1 + m2 )
P /Â − A /  c2 2  c3 exp (−c4 (m1 + m2 )) .
F σmin (Σ) N
(8.25)
8.6 Matrix Compressed Sensing 421

In other words, we have


/ /2 ρ2 (Σ) ν 2 r (m1 + m2 )
/ /
/Â − A /  c2 2
F σmin (Σ) N

with probability at least 1 − c3 exp (−c4 (m1 + m2 )).


The central challenge to prove (8.25) is to use Theorem 3.8.5. &
%
Example 8.6.4 (Low-rank multivariate regression [155]). Consider the observation
pairs linked by the vector pairs

yi = A zi + wi , i = 1, 2, . . . , n,

where wi ∼ N 0, ν 2 Im1 ×m2 is observation noise vector. We assume that the
covariates zi are random, i.e., zi ∼ N (0, Σ), i.i.d. for some m2 -dimensional
covariance matrix Σ > 0.
Consider A ∈ B (Rq ) ∩ S. There are universal constants c1 , c2 , c3 such that if
we solve the SDP (8.18) with regularization parameter
"
ν  m1 + m 2
γN = 10 σmax (Σ) ,
m1 n

we have
 2 2 1−q/2  1−q/2 
/ /2
/ / ν σmax (Σ) m1 + m 2
P /Â − A /  c1 2 Rq
F σmin (Σ) n

 c2 exp (−c3 (m1 + m2 )) . (8.26)

When Σ = Im2 ×m2 , there is a constant c1 such that


 1−q/2 
/ /2
/ /  m1 + m 2
P /Â − A /  c1 ν 2−q
Rq  c2 exp (−c3 (m1 + m2 )) .
F n

When A is exactly low rank—that is q = 0 and r = R0 —this simplifies further to


/ /2  
/ / m1 + m2
P /Â − A /  c1  ν 2 r  c2 exp (−c3 (m1 + m2 )) .
F n

In other words, we have

/ /2  
/ / m1 + m2
/Â − A /  c1  ν 2 r
F n

with probability at least 1 − c2 exp (−c3 (m1 + m2 )). &


%
422 8 Matrix Completion and Low-Rank Matrix Recovery

Example 8.6.5 (Vector autoregressive processes [155]). A vector autoregressive


n
(VAR) process [452] in m-dimension is a stochastic process {zt }t=1 specified by
an initialization z1 ∈ R , followed by the recursion
m

zt+1 = A zt + wt , t = 1, 2, .., n. (8.27)

In this recursion, the sequence wt ∈ Rm consists of i.i.d. samples of innovation


noise. We assume that each vector wt ∈ Rm is zero-mean and has a covariance
n
matrix C > 0, so that the process {zt }t=1 is zero-mean and has a covariance matrix
Σ defined by the discrete-time Ricatti equation
T
Σ = A Σ(A ) + C.

The goal of the problem is to estimate the unknown matrix A ∈ Rm×m on the
n
basis of a sequence of vector samples {zt }t=1 .
It is natural to expect that the system is controlled primarily by a low-dimensional
subset of variables, implying that A is of low rank. Besides, A is a Hankel matrix.
T
Since zt = [Zt1 · · · Ztm ] is m-dimension column vector, the sample size of
scalar random variables is N = nm. Letting k = 1, 2, . . . , m index the dimension,
we have
= >
Z(t+1)k = ek zTt , A + Wtk . (8.28)

We re-index the collection of N = nm observations via the mapping

(t, k) → i = (t − 1)k + k.

After doing this, the autoregressive problem can be written in the form of (8.15)
with Yi = Z(t+1)k and observation matrix Xi = ek zTt .
n
Suppose that we are given n samples {zt }t=1 from a m-dimensional autore-
gressive process (8.27) that is stationary, based on a system matrix that is stable
( A op  α  1) and approximately low-rank (A ∈ B (Rq )∩S). Then, there are
universal constants c1 , c2 , c3 such that if we solve the SDP (8.18) with regularization
parameter
"
2c0 Σ op m
γN = ,
m (1 − α) n

then any solution  satisfies


 2 1−q/2   
/ /2
/ / σmax (Σ) m 1−q/2
P /Â − A /  c1 2 Rq  c2 exp (−c3 m) .
F σmin (Σ) n
(8.29)
8.7 Linear Regression 423

To prove (8.29), we need the following results (8.30) and (8.31). We need the
notation
⎡ ⎤ ⎡ ⎤
zT1 zT2
⎢ zT ⎥ ⎢ zT3 ⎥
⎢ 2 ⎥ ⎢ ⎥
X=⎢ . ⎥ ∈ Rn×m and Y = ⎢ .. ⎥ ∈ Rn×m .
⎣ .. ⎦ ⎣ . ⎦
zTn zTn+1

Let W be a matrix where each row is sampled i.i.d. from the N (0, C) distribution
corresponding to the innovation noise driving the VAR process. With this notation,
and the relation N = nm, the SDP objective function (8.18) is written as
 
1 1 / /
/Y − XAT /2 + γn A
F ∗ ,
m 2n

where γn = γN m.
The eigenspectrum of the matrix of the matrix XT X/n is well controlled in
terms of the stationary covariance matrix: in particular, as long as n > c3 m, we have
 
 24σmax (Σ)
P σmax X X/n 
T
 2c1 exp (−c2 m) , and
1−α (8.30)
 
P σmin X X/n  0.25σmin (Σ)  2c1 exp (−c2 m) .
T

There exists constants ci > 0, independent of n, m, Σ, etc. such that


 " 
1/ /
/ XT W /  0
c Σ op m
P  c2 exp (−c3 m) . (8.31)
n op 1−α n

&
%

8.7 Linear Regression

Consider a standard linear regression model

y = Xβ + w (8.32)

where y ∈ Rn is an observation vector, X ∈ Rn×d is a design matrix, β is


the unknown
 regression vector, and w ∈ Rd is additive Gaussian noise, i.e.,
w ∼ N 0, σ Id×d , where In×n is the n×n identity matrix. As pointed out above,
2

the consistent estimation of β is impossible unless we impose some additional


structure on the unknown vector β. We consider sparsity constraint here: β has
exactly s ! d non-zero entries.
424 8 Matrix Completion and Low-Rank Matrix Recovery

The notions of sparsity can be defined more precisely in terms of the p -balls1
for p ∈ (0, 1], defined as [135]
+ d
,
p p
Bp (Rp ) = β ∈ Rd : β p = |β i |  Rp , (8.33)
i=1

In the limiting case of p = 0, we have the 0 -ball


+ d
,
B0 (s) = β∈R : d
I [β i = 0]  s , (8.34)
i=1

where I is the indicator function and β has exactly s ! d non-zero entries.


The unknown vector β can be computed by solving the convex optimization
problem
2
minize y − Xβ 2
/ / (8.35)
/ p /
subject to / β p  Rp / .

Sometimes we are interested in the vector v whose sparsity is bounded by 2s and


2 -norm is bounded by R
 
S (s, R) = v ∈ Rd : v 0  2s, v 2 R .

The following random variable Zn is of independent interest, defined as

1  T 
Zn = sup w Xv , (8.36)
v∈S(s,R) n

where X, w are given in (8.32). Let us show how to bound the random variable Zn .
The approach used by Raskutti et al. [135] is emphasized and followed closely here.
For a given ε ∈ (0, 1) to be chosen, we need to bound the minimal cardinality of
a set that covers S (s,
 R) up to (Rε)-accuracy
 in -norm. We claim that we may find
such a covering set v1 , . . . , vN ⊂ S (s, R) with cardinality N = N (s, R, ε) that
is upper bounded by
 
d
log N (s, R, ε)  log + 2s log (1/ε) .
2s

1 Strictly speaking, these sets are not “balls” when p < 1, since they fail to be convex.
8.7 Linear Regression 425

 
d
To establish the claim, we note that there are subsets of size 2s within
2s
{1, 2, . . . , d}. Also, for any 2s-sized subset, there are an (Rε)-covering in 2 -norm
of the ball B2 (R) with at most 22s log(1/ε) elements (e.g., [154]).
/ As ak /result, for each vector v ∈ S (s, R), we may find some v such that
k
/v − v /  Rε. By triangle inequality, we obtain
2
 T   T   T  
1 w Xv  1 w Xvk  + 1 w X v − v k 
n n n
 T  /  /
 1 w Xvk  + 1
w 2 /X v − v k /2 .
n n

Now we make explicit assumption that

1 Xv 2
√  κ, for all v ∈ B0 (2s) .
n v 2

With this assumption, it follows that


/  / √ / /
/X v − vk / / n  κR/v − vk /  κε.
2 2

2
Since the vector w ∈ Rn is Gaussian, the variate w 2 /σ 2 is the χ2 distribution
with n degrees of freedom, we have √1n w 2  2σ with probability at least 1 −
c1 exp (−c2 n), where c1 , c2 are two numerical constants, using standard tail bounds
(see Sect. 3.2). Putting all the pieces together, we obtain

1  T  1 
w Xv  wT Xvk  + 2κσRε
n n
with high probability. Taking the supremum over v on both sides gives

1  T 
Zn  max w Xvk  + 2κσRε.
k=1,2,...,N n

It remains to bound the finite maximum over the covering set. First, we see that
/ /2
each variate n1 wT Xvk is zero-mean Gaussian with variance σ 2 /Xvi /2 /n2 . Now
by standard Gaussian tail bounds, we conclude that
 √
Zn  σRκ 3 log N (s, R, ε)/ n + 2κσRε
 √  (8.37)
= σRκ 3 log N (s, R, ε)/ n + 2ε .

 1 − c1 exp (−c
with probability greater than √ 2 log N (s, R, ε)).
Finally, suppose ε = s log (d/2s)/ n. With this choice and assuming that
n ≤ d, we have
426 8 Matrix Completion and Low-Rank Matrix Recovery

⎛ ⎞
d⎠
log⎝
log N (s,R,ε) 2s s log( s log(d/2s)
n
)
n  n + n
⎛ ⎞
d⎠
log⎝
2s s log(d/s)
 n + n
2s+2s log(d/s) s log(d/s)
 n + n ,

where the final line uses standard bounds on binomial coefficients. Since d/s ≥ 2
by assumption, we conclude that our choice of ε guarantees that

log N (s, R, ε)
 5s log (d/s) .
n
Inserting these relations into (8.37), we conclude that

s log (d/s)
Zn  6σRκ .
n

Since log N (s, R, ε)  s log (d − 2s), this event occurs with probability at least
1 − c1 exp (−c2 min {n, s log (d − s)}). We summarize the results in this theorem.
Theorem 8.7.1 ([135]). If the 2 -norm of random matrix X is bounded by
Xv2
√1
n v2
 κ for all v with at most 2s non-zeros, i.e., v ∈ B0 (2s), and

w ∈ R is additive Gaussian noise, i.e., w ∼ N 0, σ 2 In×n , then for any radius
d

R > 0, we have

1  T  s log (d/s)
sup w Xv  6σκR ,
v0 2s, v2 R n n

with probability at least 1 − c1 exp (−c2 min {n, s log (d − s)}).


Let us apply Theorem 8.7.1. Let β  be a feasible solution of (8.35). We have
2 2
y − Xβ 2  y − Xβ  2 .

Define the error vector e = (β − β  ). After some algebra, we obtain

1 2  T 
w Xe
2
Xv 2 
n n
the right-hand side of which is exactly the expression required by Theorem 8.7.1, if
we identify v = e.
8.8 Multi-task Matrix Regression 427

8.8 Multi-task Matrix Regression

We are given a collection of d2 regression problems in Rd1 , each of the form

yi = Xβ i + wi , i = 1, 2, . . . , d2 ,

where β i ∈ Rd1 is an unknown regression vector, wi ∈ Rn is the observation


noise, and X ∈ Rn×d1 is the design (random) matrix. In a convenient matrix form,
we have

Y = XB + W (8.38)

where Y = [y1 , . . . , yd2 ] and W = [w1 , . . . , wd2 ] are both matrices in Rn×d2
and B = β 1 , . . . , β d2 ∈ Rd1 ×d2 is a matrix of regression vectors. In multi-task
learning, each column of B is called a task and each row of B is a feature.
A special structure has the form of low-rank plus sparse decomposition

B=Θ+Γ

where Θ is of low rank and Γ is sparse, with a small number of non-zero entries.
For example, Γ is row-sparse, with a small number of non-zero rows. It follows
from (8.38) that

Y = X (Θ + Γ) + W (8.39)

In the following examples, the entries of W are assumed to be i.i.d. zero-mean
Gaussian with variance ν 2 , i.e. Wij ∼ N 0, ν 2 .
Example 8.8.1 (Concentration for product of two random matrices [453]). Con-
sider the product of two random matrices defined above

Z = XT W ∈ Rd1 ×d2 .

It can be shown that the matrix Z has independent columns, with each column
zj ∼ N/ 0, ν 2 XT X/n . Let σmax be the maximum eigenvalue of matrix X. Since
/
/XT X/  σmax 2
, known results on the singular values of Gaussian random
op
matrices [145] imply that
 
/ / 4 (d1 + d2 ) νσmax
P /XT W/op  √  2 exp (−c (d1 + d2 )) .
n

Let xj be the j-th column of matrix X. Let κmax = max xj 2 be the maximum
j=1,...,d1
2 -norm over columns. Since the 2 -norm of the columns of X are bounded by κmax ,
2
the entries of XT W are i.i.d. and Gaussian with variance at most (νκmax ) /n. As a
result, the standard Gaussian tail bound combined with union bound gives
428 8 Matrix Completion and Low-Rank Matrix Recovery

 
/ / 4νσ
P /XT W/∞  √max log (d1 d2 )  exp (− log (d1 d2 )) ,
n

where A ∞ for matrix A with the (i, j)-element aij is defined as

A ∞ = max max |aij | . &


%
i=1,...,d1 j=1,...,d2

Example 8.8.2 (Concentration for the columns of a random matrix [453]). As


defined above, wi is the k-th column of the matrix W. The function

wk → wk 2

is Lipschitz. By concentration of measure for Gaussian Lipschitz functions [141],


we have that for all t > 0
 2 
t d1 d2
P ( wk 2  E wk 2 + t)  exp − .
2ν 2

Using the Gaussianity of wi , we have


ν  ν
E wk 2 √ d1 = √ .
d1 d2 d2

Applying union bound over all d2 columns, we conclude that


   2 
ν t d1 d2
P max wk 2  √ +t  exp − + log d 2 .
k=1,2,...,d2 d2 2ν 2
 2 
That is, with probability greater than 1 − exp − t 2νd1 d2
2 + log d2 , we have
2
max wk 2  √νd + t. Setting t = 4ν log d2
d1 d2 gives
k 2

" 
ν log d2
P max wk 2  √ + 4ν  exp (−3 log d2 ) . &
%
k=1,2,...,d2 d2 d1 d2

Example 8.8.3 (Concentration for trace inner product of two matrices [453]).
We study the function defined as

Z (s) = √
sup | W, Δ | .
Δ1  s, ΔF 1
8.9 Matrix Completion 429

Viewed as a function of matrix W, the random variable Z(s) is a Lipschitz function


with constant √dν d . Using Talagrand’s concentration inequality, we obtain
1 2

 2 
t d1 d2
P (Z (s)  E [Z (s)] + t)  exp − .
2ν 2

4sν 2
d
1 d2
Setting t2 = d1 d2 log s , we have
 
2sν d1 d2
Z (s)  E [Z (s)] + log
d1 d2 s

with probability greater than at least


  
d 1 d2
1 − exp −2s log .
s

It remains to upper bound the expected value E [Z (s)]. In order to√do so, we
use [153, Theorem 5.1(ii)] with (q0 , q1 ) = (1, 2), n = d1 d2 , and t = s, thereby
obtaining
?   ?  
 ν √ 2d d
1 2 ν 2d1 d2
E [Z (s)]  c s 2 + log c s log .
d1 d2 s d1 d2 s

Define the notation


d2
A 2,1 = ak 2
i=1

where ak is the k-th column of matrix A ∈ Rd1 ×d2 . We can study the function

Z̃ (s) = √
sup | W, Δ |
Δ2,1  s, ΔF 1

which is Lipschitz with constant √dν d . Similar to above, we can use the standard
1 2
approach: (1) derive concentration of measure for Gaussian Lipschitz functions; (2)
upper bound the expectation. For details, we see [453]. &
%

8.9 Matrix Completion

This section is taken from Recht [103] for low rank matrix recovery, primarily due
to his highly accessible presentation.
430 8 Matrix Completion and Low-Rank Matrix Recovery

The nuclear norm ||X||∗ of a matrix X is equal to the sum of its singular values

σi (X), and is the best convex lower bound of the rank function that is NP-hard.
i
The intuition behind this heuristic is that while the rank function counts the number
of nonvanishing singular values, the nuclear norm sums their amplitudes, much like
how the 1 norm is a useful surrogate for counting the number of nonzeros in a
vector. Besides, the nuclear norm can be minimized subject to equality constraints
via semidefinite programming (SDP).2
Let us review some matrix preliminaries and also fix the notation. Matrices are
bold capital, vectors are bold lower case and scalars or entries are not bold. For
example, X is a matrix, Xij its (i, j)-th entry. Likewise, x is a vector, and xi its i-th
component. If uk ∈ Rn for 1  k  d is a collection of vectors, [u1 , . . . , ud ] will
denote the n × d matrix whose k-th column is uk . ek will denote the k-th standard
basis vector in Rd , equal to 1 in component k and 0 everywhere else. X∗ and x∗
denote the transpose of matrices X and x.
The spectral norm of a matrix is denoted ||X||. The Euclidean inner product
between two matrices is X, Y = Tr (X∗ Y) , and the corresponding Euclidean
norm, called the Frobenius or Hilbert-Schmidt norm, is denoted X F . That is,
1
X F = X, X 2 . Or
2 
X F = X, X = Tr XT X , (8.40)

which is a linear operator of XT X since the trace function is linear. The nuclear
norm of a matrix is ||X||∗ . The maximum entry of X (in absolute value) is denoted
by X ∞ = maxij |Xij |, where of course | · | is the absolute value. For vectors, the
only norm applied is the Euclidean 2 norm, simply denoted ||x||.
Linear transformations that act on matrices will be denoted by calligraphic
letters. In particular, the identity operator is I. The spectral norm (the top singular
value) of such an operator is A = supX:XF 1 A (X) F . Subspaces are also
denoted by calligraphic letters.

8.9.1 Orthogonal Decomposition and Orthogonal Projection

We suggest the audience to review [454, Chap. 5] for background. We only review
the key definitions needed later. For a set of vectors S = {v1 , . . . , vr }, the subspace

span (S) = {α1 v1 + α2 v2 + · · · + αr vr }

2 The SDP is of course the convex optimization. It is a common practice that once a problem is

recast in terms of a convex optimization problem, then the problem may be solved, using many
general-purpose solvers such as CVX.
8.9 Matrix Completion 431

generated by forming all linear combinations of vectors from S is called the


space spanned by S. For a subset M of an inner-product space V, the orthogonal
complement M⊥ of M is defined to be the set of all vectors in V that are orthogonal
to very vector in M. That is,

M⊥ = {x ∈ V : m, x = 0 for all m ∈ M} .

Let uk (respectively vk ) denote the k-th column of U (respectively V). Set

U ≡ span (u1 , . . . , ur ) , and V ≡ span (v1 , . . . , vr ) .

Also, assume, without loss of generality, that n1 ≤ n2 . It is useful to introduce


orthogonal decomposition

Rn1 ×n2 = T ⊕ T ⊥

where T is the linear space spanned by elements of the form uk y∗ and xv∗k , 1 
k  r, where x and y are arbitrary, and T ⊥ is its orthogonal complement. T ⊥ is
the subspace of matrices spanned by the family (xy∗ ), where x (respectively y) is
any vector orthogonal to U (respectively V).
The orthogonal projection PT of a matrix Z onto subspace T is defined as

PT (Z) = PU Z + ZPV − PU ZPV , (8.41)

where PU and PV are the orthogonal projections onto U and V, respectively. While
PU and PV are matrices, PT is a linear operator that maps a matrix to another
matrix. The orthogonal projection of a matrix Z onto T ⊥ is written as

PT ⊥ (Z) = (I − PT ) (Z) = (In1 − PU ) Z (In2 − PV )

where Id denotes the d × d identity matrix. It follows from the definition


∗ ∗
PT (ea e∗b ) = (PU ea ) e∗b + ea (PV eb ) − (PU ea ) (PV eb ) .

With the aid of (8.40), the Frobenius norm of PT ⊥ (ea e∗b ) is given as3

PT (ea e∗b ) = PT (ea e∗b ) , PT (ea e∗b ) = PU ea + PV eb


2 2 2
F
2 2
− PU ea PV eb .
(8.42)
In order to upper bound (8.42), we are motivated to define a scalar μ(W), called the
coherence of a subspace W, such that

3 This equation in the original paper [104] has a typo and is corrected here.
432 8 Matrix Completion and Low-Rank Matrix Recovery

2 2
PU ea  μ (U ) r/n1 , PV eb  μ (V) r/n2 , (8.43)

With the help of (8.43), the Frobenius norm is upper bounded by

n1 + n2 n1 + n2
PT (ea e∗b )
2
F  max {μ (U ) , μ (V)} r  μ0 r , (8.44)
n1 n2 n1 n2

which will be used frequently.


For a subspace W, let us formally define its coherence, which plays a central role
in the statement of the final theorem on matrix completion.
Definition 8.9.1 (Coherence of a subspace). Let W be a subspace of dimension r
and PW be the orthogonal projection of a matrix onto W. Then, the coherence of
W (using the standard basis (ei )) is defined to be
n 2
μ (W) ≡ max PW ei .
r 1in

For any subspace, the smallest value which μ (W) can be is 1, achieved,√for
example, if W is spanned by vectors whose entries all have magnitude 1/ n.
The largest value for μ (W), on the other hand, is n/r which would correspond
to any subspace that contains a standard basis element. If a matrix has row and
column spaces with low coherence, then each entry can be expected to provide about
the same amount of information.

8.9.2 Matrix Completion

The main contribution of Recht [104] is an analysis of uniformly sampled sets


via the study of a sampling with replacement model. In particular, Recht analyze
the situation where each entry index is sampled independently from uniform
distribution on {1, . . . , n1 } × {1, . . . , n2 }.
Proposition 8.9.2 (Sampling with replacement [104]). The probability that the
nuclear norm heuristic fails when the set of observed entries is sampled uniformly
from the collection of sets of size N is less than or equal to the probability that the
heuristic fails when N entries are sampled independently with replacement.
Theorem 2.2.17 is repeated here for convenience.
Theorem 8.9.3 (Noncommutative Bernstein Inequality [104]). Let X1 , . . . , Xn
be independent zero-mean random matrices of dimension d1 × d2 . Suppose ρ2k =
max { E (Xk X∗k ) , E (X∗k Xk ) } and Xk  M almost surely for all k. Then,
for any τ > 0,
8.9 Matrix Completion 433

⎛ ⎞
-/ / .
/ L / ⎜ −τ 2 /2 ⎟
/ / ⎜ ⎟
P / Xk / > τ  (d1 + d2 ) exp ⎜ L ⎟. (8.45)
/ / ⎝ 2 ⎠
k=1 ρk + M τ /3
k=1

Theorem 8.9.4 (Matrix Completion Recht [104]). Let M be an n1 × n2 matrix


of rank r with singular value decomposition UΣVH . Without loss of generality,
impose the convention n1 ≤ n2 , Σ ∈ Rr×r , U ∈ Rn1 ×r , U ∈ Rr×n2 . Assume that
A0 The row and column spaces have coherences bounded above by some
positive μ0 . 
A1 The matrix UVH has a maximum entry bounded by μ1 r/ (n1 n2 ) in
absolute value for some positive μ1 .
Suppose m entries of M are observed with locations sampled uniformly at
random. Then if
 
m  32 max μ21 , μ0 r (n1 + n2 ) βlog2 (2n2 )

for some β > 1, the minimizer to the problem

minimize X ∗
(8.46)
subject to Xij = Mij (i, j) ∈ Ω.
2−2β
is unique

and equal to M with probability at least 1 − 6 log (n2 ) (n1 + n2 ) −
2−2 β
n2 .
The proof is very short and straightforward. It only uses basic matrix analysis,
elementary large deviation bounds and a noncommutative version of Bernsterin’s
inequality (See Theorem 8.9.3).
Recovering low-Rank matrices is studied by Gross [102].
Example 8.9.5 (A secure communications protocol that is robust to sparse er-
rors [455]). We want to securely transmit a binary message across a communica-
tions channel. Our theory shows that decoding the message via deconvolution also
makes this secure scheme perfectly robust to sparse corruptions such as erasures or
malicious interference.
d
We model the binary message as a sign vector m0 ∈ {±1} . Choose a random
basis Q ∈ Od . The transmitter sends the scrambled message s0 = Qm0 across the
channel, where it is corrupted by an unknown sparse vector c0 ∈ Rd . The receiver
must determine the original message given only the corrupted signal

z0 = s0 + c0 = Qm0 + c0

and knowledge of the scrambling matrix Q.


The signal model is perfectly suited to the deconvolution recipe of [455,
Sect. 1.2]. The 1 and ∞ are natural complexity measures for the structured
434 8 Matrix Completion and Low-Rank Matrix Recovery

signals c0 and m0 . Since the message m0 is a sign vector, we also have the side
information m0 ∞ = 1. Our receiver then recovers the message with the convex
deconvolution method

minimize c 1
(8.47)
subject to m ∞ = 1 and Qm + c = z0 ,

where the decision variables are c, m ∈ Rd . For example, d = 100. This method
succeeds if (c0 , m0 ) is the unique optimal point of (8.47). &
%
Example 8.9.6 (Low-rank matrix recovery with generic sparse corruptions [455]).
Consider the matrix observation

Z0 = X0 + R (Y0 ) ∈ Rn×n ,

where X0 has low rank, Y0 is sparse, and R is a random rotation on Rn×n .


For example n = 35. We aim to discovery the matrix X0 given the corrupted
observation Z0 and the basis R.
The Schatten 1-norm · S1 serves as a natural complexity measure for the low-
rank structure of X0 , and the matrix 1 norm · 1 is appropriate for the sparse
structure of Y0 . We further assume the side information α = Y0 1 . We then
solve

minimize X S1
(8.48)
subject to Y 1  α and X + R (Y) = Z0 .

This convex deconvolution method succeeds if (X0 , Y0 ) is the unique solution


to (8.48). This problem is related to latent variable selection and robust principal
component analysis [456]. &
%

8.10 Von Neumann Entropy Penalization and Low-Rank


Matrix Estimation

Following [457], we study a problem of estimating a Hermitian nonnegatively


definite matrix R of unit trace, e.g., a density matrix of a quantum system and
a covariance matrix of a measured data. Our estimation is based on n i.i.d.
measurements

(X1 , Y1 ) , . . . , (Xn , Yn ) , (8.49)

where

Yi = Tr (RXi ) + Wi , i = 1, . . . , n, (8.50)
8.10 Von Neumann Entropy Penalization and Low-Rank Matrix Estimation 435

Here, Xi , i = 1, . . . , n are random i.i.d. Hermitian matrices (or matrix-valued ran-


dom variables) and Wi i.i.d. (scalar-valued) random variables with E ( Wi | Xi ) = 0.
We consider the estimator
- n .
1 2
ε
R̂ = arg min (Yi − Tr (SXi )) + ε Tr (S log S) , (8.51)
S∈Sm×m n i=1

where Sm×m is the set of all nonnegatively definite Hermitian m × m matrices of


trace 1. The goal is to derive oracle inequalities showing how the estimation error
depends on the accuracy of approximation of the unknown state R by low-rank
matrices.

8.10.1 System Model and Formalism

Let Mm×m be the set of all m × m matrices with complex entries. Tr(S) denotes
the trace of S ∈ Mm×m , and S∗ denotes its adjoint matrix. Let Hm×m be the set of
all m × m Hermitian matrices with complex entries, and let
 
Sm×m ≡ S ∈ Hm×m (C) : S  0, Tr (S) = 1

be the set of all nonnegatively definite Hermitian matrices of trace 1. The matrices
of Sm×m can be interpreted, for instance, as density matrices, describing the states
of a quantum system; or covariance matrices, describing the states of the observed
phenomenon.
Let X ∈ Hm×m (C) be a matrix (an observable) with spectral representation
m
X= λi Pi , (8.52)
i=1

where λi are the eigenvalues of X and Pi are its spectral projectors. Then, a matrix-
valued measurement of X in a state of R ∈ S ∈ Mm×m would result in outcomes
λi with probabilities λi = Tr (RPi ) and its expectation is ER X = Tr (RX).
Let X1 , . . . , Xn ∈ Hm×m (C) be given matrices (observables), and let R ∈
S m×m
be an unknown state of the system. A statistical problem is to estimate
the unknown R, based on the matrix-valued observations (X1 , Y1 ) , . . . , (Xn , Yn ),
where Y1 , . . . , Yn are outcomes of matrix-valued measurements of the observables
X1 , . . . , Xn for the system identically prepared n times in the state R. In other
words, the unknown state R of the system is to be “learned” from a set of n linear
measurements in a number of “directions” X1 , . . . , Xn .
It is assumed that the matrix-valued design variables X1 , . . . , Xn are also
random; specifically, they are i.i.d. Hermitian m × m matrices with distribution Π.
436 8 Matrix Completion and Low-Rank Matrix Recovery

In this case, the observations (X1 , Y1 ) , . . . , (Xn , Yn ) are i.i.d., and they satisfy the
following model:

Yi = Tr (RXi ) + Wi , i = 1, . . . , n, (8.53)

where Wi , i=1, . . . , n are i.i.d. (scalar-valued) random variables with E ( Wi | Xi ) =


0, i = 1, . . . , n.

8.10.2 Sampling from an Orthogonal Basis

The linear space of matrices Mm×m (C) can be equipped with the Hilbert-Schmidt
inner product,

A, B = Tr (AB∗ ) .

Let Ei , i = 1, . . . , m2 , be an orthonormal basis of Mm×m (C) consisting


of Hermitian matrices Ei . Let Xi , i = 1, . . . , n, be i.i.d. matrix-valued random
variables sampled from a distribution Π on the set Ei , i = 1, . . . , m2 . We will refer
to this model as sampling from an orthonormal basis.
Most often, we will use the uniform distribution Π that assigns probability m12
to each basis matrix Ei . In this case,

2 1 2
E| A, X | = A 2 ,
m2
1/2
where · 2 = ·, · is the Hilbert-Schmidt (or Frobenius) norm.
Example 8.10.1 (Matrix Completion). Let {ei : i = 1, . . . , m} the canonical basis
of Cm , where ei are m-dimensional vectors. We first define

Eii = ei ⊗ ei , i = 1, . . . , m
1 1 (8.54)
Eij = √ (ei ⊗ ej + ej ⊗ ei ) , Eji = √ (ei ⊗ ej − ej ⊗ ei ) ,
2 2

for i < j, i, j = 1, . . . , m. Here ⊗ denotes the tensor (or Kronecker) product of


vectors or matrices [16]. Then, the set of Hermitian matrices {Eij : 1  i, j  m}
forms an orthogonal basis of Hm×m (C). For i < j, the Fourier coefficients of
a Hermitian matrix R in this basis are equal√to the real and imaginary parts of the
entries Rij , i < j of matrix R multiplied by 2; for i = j, they are just the diagonal
entries of R that are real.
2 2
If now Π is the uniform distribution in this basis, then E| A, X | = m12 A 2 .
Sampling from this distribution is equivalent to sampling at random real and
imaginary parts of the entries of matrix R. &
%
8.10 Von Neumann Entropy Penalization and Low-Rank Matrix Estimation 437

Example 8.10.2 (Sub-Gaussian design). A scalar-valued random variable X is


called sub-Gaussian with parameter σ, if and only if, for all λ ∈ R,
2
σ 2 /2
EeλX  eλ .

The inner product A, X is a sub-Gaussian scalar-valued random variables for


each A ∈ Hm×m (C). This is an important model, closely related to randomized
designs in compressed sensing, for which one can use powerful tools developed in
the high-dimensional probability.
Let us consider two examples: Gaussian design and Rademacher design.
1. Gaussian design: X is a symmetric random matrix with real entries such that
{Xij : 1  i  j  m} are independent, centered normal random variables with

EXij
2
= 1, i = 1, . . . , m, and EXij
2
= 12 , i < j.

2. Rademacher design:

Xii = εii , i = 1, .., m, and Xij = 12 εij , i < j,

where εij : 1 ≤ i ≤ j ≤ m are i.i.d. Rademacher random variables: random


variables taking values +1 or −1 with probability 1/2 each.
In both cases, we have

2 1 2
E| A, X | = A 2 , A ∈ Hm×m (C) ,
m2

(such matrix-valued random variables are called isotropic) and A, X is a sub-


2
Gaussian random variable whose sub-Gaussian parameter is equal to A 2 (up to a
constant). &
%

8.10.3 Low-Rank Matrix Estimation

We deal with random sampling from an orthonormal basis and sub-Gaussian


isotropic design such as Gaussian or Rademacher, as mentioned above. Assume,
for simplicity, the noise Wi is a sequence of i.i.d. N(0, σw 2
) random variables
independent of X1 , . . . , Xn ∈ H m×m
(C) (a Gaussian noise).
We write
m
f (S) = f (λi ) (φi ⊗ φi )
i=1
438 8 Matrix Completion and Low-Rank Matrix Recovery


m
for any Hermitian matrix S with spectral representation S = λ (φi ⊗ φi ) and
i=1
any function f defined on a set that contains the spectrum of S. See Sect. 1.4.13.
Let us consider the case of sampling from an orthonormal basis {E1 , . . . , Em2 }
of Hm×m (C) (that consists of Hermitian matrices). Let us call the distribution Π in
{E1 , . . . , Em2 } nearly uniform if and only there exist constants c1 , c2 such that

1 2 1 2
max Π ({Ei })  c1 and A L2 (Π)  c2 A 2 , A ∈ Hm×m (C) .
1im2 m2 m2

Clearly the matrix completion design (Example 8.10.1) is a special case of sampling
from such nearly uniform distributions.
We study the following estimator of the unknown state R defined a solution of a
penalized empirical risk minimization problem:
- n
.
1 2
R̂ε = arg min (Yi − Tr (SXi )) + ε Tr (S log S) , (8.55)
S∈Sm×m n i=1

where ε is a regularized parameter. The penalty term is based on the function


Tr (S log S) = −E (S), where E (S) is the von Neumann entropy of the state S.
Thus the method here is based on a trade-off between fitting the model by the least
square in the class of all density matrices and maximizing the entropy of the state.
The optimization of (8.55) is convex: this is based on convexity of the penalty
term that follows from the concavity of von Neumann entropy; see [458].
It is shown that the solution R̂ε of (8.55) is always a full rank matrix; see
the proof of Proposition 3 of [457]. Nevertheless, when the target matrix R is
nearly low rank, R̂ε is also well approximated by low rank matrices and the error
/ /2
/ /
/R − R̂ε / can be controlled in terms of the “approximate rank” of R.
L2 (Π)
Let t > 0 be fixed, and hence tm ≡ t + log (2m), and τn ≡ t + log log2 (2n).
To simplify the bounds, assume that log log2 (2n)  log (2m) (so, τn ≤ τm ), that
n  mtm log2 m, and finally, that σw  √1m . The last condition just means that the
variance of the noise is not “too small” which allows one to suppress “exponential
tail term” in Bernstein-type inequalities used in the derivation of the bounds. Recall
that R ∈ Sm×m .
We state two theorems without proof.
Theorem 8.10.3 (Sampling from a nearly uniform distribution-Koltchinskii
[457]). Suppose X is sampled from a nearly uniform distribution Π. Then, these
exists a constant C > 0 such that, for all ε ∈ [0, 1], with probability at least 1 − e−t ,
" 
/ /2   m 
/ ε / mtm
/R̂ − R/ C ε log p ∧ log ∨ σw . (8.56)
L2 (Π) ε nm
8.10 Von Neumann Entropy Penalization and Low-Rank Matrix Estimation 439

2sufficiently large D > 0, these exists a constant C > 0 such that,


In addition, for all
for all ε ≡ Dσw tm
mn , with probability at least 1 − e−t ,
! "
2 rank (S) mtm log2 (mn)
R̂ε − R  inf 2 S − R2L2 (Π) + Cσw
2
∨ m−1 .
L2 (Π) S∈Sm×m n
(8.57)

where a ∨ b = max{a, b}, a ∧ b = min{a, b}.

Theorem 8.10.4 (Sub-Gaussian isotropic matrix—Koltchinskii [457]). Suppose


X is a sub-Gaussian isotropic matrix. There exist constants C > 0; c > 0 such
that the following hold. Under the assumptions that τn  cn and tm ≤ n, for all
ε ∈ [0, 1], with probability at least 1 − e−t

/ /2  "
/ ε / m mtm
/R̂ − R/  C ε log p ∧ log ∨ σw
L2 (Π) ε n
√ 
 √ m (τn log n ∨ tm )
∨ σw ∨ m . (8.58)
n

Moreover, there exists a constant c > 2


0 and, for all sufficiently large D > 0, a
constant C > 0 such that, for ε ≡ Dσw mtnm , with probability at least 1 − e−t ,

/ /2 
/ ε / 2
/R̂ − R/  inf 2 S−R L2 (Π)
L2 (Π) S∈Sm×m
 
2
σw rank (S) mtm log2 (mn) m (τn log n ∨ tm )
+C ∨ .
n n
(8.59)

8.10.4 Tools for Low-Rank Matrix Estimation

Let us present three tools that have been used for low-rank matrix estimation,
since they are of general interest. We must bear in mind that random matrices are
noncommutative, which is fundamentally different form the scalar-valued random
variables.
Noncommutative Kullback-Liebler and other distances. We use noncommutative
extensions of classical distances between probability distributions such as Kullback-
Liebler and Hellinger distances. We use the symmetrized Kullback-Liebler distance
between two states S1 , S1 ∈ Sm×m defined as
440 8 Matrix Completion and Low-Rank Matrix Recovery

K (S1 ; S2 ) = ES1 (log S1 − log S2 ) + ES2 (log S2 − log S1 )


= Tr [(S1 − S2 ) (log S1 − log S2 )].

Empirical processes bounds. Let X1 , . . . , Xn be i.i.d. matrix-valued random


variables with common distribution. If the class of measurable functions F is
uniformly bounded by a number Θ, then the famous Talagrand concentration
inequality implies that, for all t > 0, with probability at least 1 − e−t ,
 
1 n 
 
sup  f (Xi ) − Ef (X)
f ∈F  n
i=1

-  n  " .
1 
  t t
 2 E sup  f (Xi ) − Ef (X) + σ +Θ ,
f ∈F  n i=1  n n

where σ 2 = sup Var (f (X)).


f ∈F
Noncommutative Bernstein-type Inequalities. Let X1 , ./. . , Xn/ be i.i.d. Hermitian
matrix-valued random variables with EX = 0 and σX 2
= /EX2 /. We need to study

n
the partial sum of X1 + · · · + Xn = Xi . Chapter 2 gives an exhaustive treatment
i=1
of this subject.

8.11 Sum of a Large Number of Convex Component


Functions

The sums of random matrices can be understood, with the help of concentra-
tion of measure. A matrix may be viewed as a vector in n-dimensional space.
In this section, we make the connection between the sum of random matrices and
the optimization problem. We draw material from [459] for the background of
incremental methods. Consider the sum of a large number of component functions
N
fi (x). The number of components N is very large. We can further consider the
i=1
optimization problem
N
minimize fi (x)
i=1 (8.60)
subject to x ∈ X ,

where fi : Rn → R, i = 1, . . . , N , and X ∈ Rn . The standard Euclidean


 1/2
norm is defined as x 2 = xT x . There is an incentive to use incremental
methods that operate on a single component fi (x) at each iteration, rather than
8.11 Sum of a Large Number of Convex Component Functions 441

on the entire cost function. If each incremental iteration tends to make reasonable
progress in some “average” sense, then, depending on the value of N , an incremental
method may significantly outperform (by orders of magnitude) its nonincremental
counterparts. This framework provides flexibility in exploiting the special structure
of fi , including randomization in the selection of components. It is suitable for large-
scale, distributed optimization—such as in Big Data [5].
Incremental subgradient methods apply to the case where the component
functions fi are convex and nondifferentiable at some points

xk = PX (xk − αk ∇fik (xk ))

where αk is a positive stepsize, PX denotes the projection on X , and ik is the index


of the cost component that is iterated on.
An extension of the incremental approach to proximal algorithms is considered
by Sra et al. [459] in a unified algorithmic framework that includes incremental
gradient, subgradient, and proximal methods and their combinations, and highlights
their common structure and behavior. The only further restriction on (8.60) is
that fi (x) is a real-valued convex function. Fortunately, the convex function class
includes many eigenvalue functions of n × n matrices such as the largest eigenvalue

K
λmax the smallest eigenvalue λmin , the first K largest eigenvalues λi , and the last
i=1

K
K largest eigenvalues λn−i+1 . When K = n, the sum is replace with the linear
i=1
trace function. As a result, Chaps. 4 and 5 are relevant along the line of concentration
of measure.
Some examples are given to use (8.60).
Example 8.11.1 (Sample covariance matrix estimation and geometric functional
analysis). For N independent samples xi , i = 1, . . . , N of a random vector
x ∈ Rn , a sample covariance matrix of n × n is obtained as
N
R̂x = xi ⊗ xi ,
i=1

which implies that

fi (x) = xi ⊗ xi ,

where ⊗ is the outer product of two matrices. In fact, xi ⊗ xi is a rank-one,


positive matrix and a basic building block for low-rank matrix recovery. The so-
called geometric functional analysis (Chap. 5) is explicitly connected with convex
optimization through (8.60). This deep connection may be fruitful in the future.
For example, consider a connection with the approximation of a convex body by
another one having a small number of contact points [460]. Let K be a convex body
in Rn such that the ellipsoid of minimum volume containing it [93] is the standard
Euclidean ball B2n . Then by the theorem of John, there exists N  (n + 3) n/2
442 8 Matrix Completion and Low-Rank Matrix Recovery

points z1 , . . . , zN ∈ K, zi 2 = 1 and N positive numbers c1 , . . . , cN satisfying


the following system of equations
N
I= c i zi ⊗ zi ,
i=1
N
0= c i zi .
i=1

It is established by Rudelson [93] (Theorem 5.4.2) that N = O(n log(n)) is


sufficient for a good approximation. By choosing

fi (x) = ci xi ⊗ xi ,

in (8.60), we are able to calculate c1 , . . . , cN , by solving a convex optimization


problem. Recall that a random vector y is in the isotropic position if

Ry  E [yi ⊗ yi ] = I,

where the true covariance matrix is defined as Ry .


For a high dimensional convex body [266] K ∈ Rn , our algorithm formulated in
terms of (8.60) brings the body into isotropic positions. &
%
Example 8.11.2 (Least squares and inference). An important class is the cost

m
function fi (x), where fi (x) is the error between some data and the output a
i=1
parametric model, with x being the vector of parameters. (The standard Euclidean
 1/2
norm is defined as x 2 = xT x .) An example is linear least-squares problem,
where fi has a quadratic structure
N
 2
aTi x − bi + γ x − x̄ 2 , s.t. x ∈ Rn ,
i=1

where x̄ is given, or nondifferentiable, as in the l1 -regulation

N
 2
n
aTi x − bi +γ |xj |, s.t. (x1 , . . . , xn ) ∈ Rn ,
i=1 j=1

More generally, nonlinear least squares may be used


2
fi (x) = (hi (x)) ,

where hi (x) represents the difference between the i-th measurement (out of N )
from a physical system and the output a parametric model whose parameter vector
is x. Another is the choice
8.12 Phase Retrieval via Matrix Completion 443


fi (x) = g aTi x − bi ,

where g is a convex function. Still another example is maximum likelihood


estimation, where fi is a log-likelihood function of the form

fi (x) = − log PY (yi ; x) ,

where y1 , . . . , yN represent values of independent samples of a random vector


whose distribution PY (·; x) depends on an unknown parameter vector x ∈ Rn that
one wishes to estimate. Related contexts include “incomplete” data cases, where the
expectation-maximization (EM approach) is used. &
%

Example 8.11.3 (Minimization of an expected value: stochastic programming).


Consider the minimization of an expected value

minimize E [F (x, w)]


subject to x ∈ Rn ,

where w is a random variable taking a finite but very large number of values wi , i =
1, . . . , N , with corresponding probabilities πi . Then the cost function consists of the
sum of the N random functions πi F (x, wi ). &
%
Example 8.11.4 (Distributed incremental optimization in sensor networks). Con-
sider a network of N sensors where data are collected and used to solve some
inference problem involving a parameter vector x. If fi (x) represents an error
penality for the data collected by the i-th sensor, then the inference problem is of the
form (8.60). One approach is to use the centralized approach: to collect all the data
at a fusion center. The preferable alternative is to adopt the distributed approach:
to save data communication overhead and/or take advantage of parallelism in
computation. In the age of Big Data, this distributed alternative is almost mandatory
due to the need for storing massive amount of data.
In such an approach, the current iterate xk is passed from one sensor to another,
with each sensor i performing an incremental iteration improving just its local
computation function fi , and the entire cost function need be known at any one
location! See [461, 462] for details. &
%

8.12 Phase Retrieval via Matrix Completion

Our interest in the problem of spectral factorization and phase retrieval is motivated
by the pioneering work of [463, 464], where this problem at hand is connected
with the recently developed machinery—matrix completion, e.g., see [102, 465–
472] for the most cited papers. This connection has been first made in [473],
followed by Candes et al. [463, 464]. The first experimental demonstration is made,
via a LED source with 620 nm central wavelength, in [474], following a much
444 8 Matrix Completion and Low-Rank Matrix Recovery

simplified version of the approach described in [464]. Here, we mainly want to


explore the mathematical techniques, rather than specific applications. We heavily
rely on [463, 464] for our exposition.

8.12.1 Methodology

Let xn be a finite-length real-valued sequence, and rn its autocorrelation, that is,

rn  xk xk−n = (xk ∗ xn−k ) , (8.61)


k

where ∗ represents the discrete convolution of two finite-length sequences. The


goal of spectral factorization is to recover xn from rn . It is more explicit in the
discrete Fourier domain, that is,
     2
R ejω = X ejω X ∗ ejω = X ejω  , (8.62)

where
 1
X ejω = √ x[k]e−j2πk/n , ω ∈ Ω,
n
0kn

 1
R ejω = √ r[k]e−j2πk/n , ω ∈ Ω.
n
0kn

are the discrete Fourier Transform of xn and rn , respectively. The task of spectral

factorization is equivalent to recovering the missing phase information of X ejω
  2
from its squared magnitude X ejω  [475]. This problem is often called phase
retrieval in the literature [476]. Spectral factorization and phase retrieval have been
extensively studied, e.g. see [476, 477] for comprehensive surveys.
Let the unknown x and the observed vector b be collected as
⎡ ⎤ ⎡ ⎤
x1 b1
⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥ ⎢ ⎥
x=⎢ .. ⎥ and b = ⎢ .. ⎥.
⎣ . ⎦ ⎣ . ⎦
xN bN

Suppose x ∈ CN about which we have quadratic measurements of the form


2
bi = | zi , x | , i = 1, 2, . . . , N. (8.63)

where c, d is the (scalar-valued) inner product of finite-dimensional column


vectors c, d. In other words, we are given information about the squared modus
8.12 Phase Retrieval via Matrix Completion 445

of the inner product between the signal and some vectors zi . Our task is to find
the unknown vector x. The most important observation is to linearize the nonlinear
quadratic measurements. It is well known that this can be done by interpreting the
quadric measurements as linear measurements about the unknown rank-one matrix
X = xx∗ . By using this “trick”, we can solve a linear problem of unknown, matrix-
valued random variable X. It is remarkable that one additional dimension (from
one to two dimensions) can make such a decisive difference. This trick has been
systemically exploited in the context of matrix completion.
As a result of this trick, we have
 
2 2
| zi , x | = Tr | zi , x | (trace property of a scalar)
 ∗ ∗ ∗ 2
= Tr (zi x) (zi x) (definitions of ·, · and |·| )
= Tr (z∗i xx∗ zi ) (property of transpose (Hermitian))
(8.64)
= Tr (z∗i Xzi ) (KEY: identifying X = xx∗ )
= Tr (zi z∗i X) (cyclical property of trace)
= Tr (Ai X) (identifying Ai = zi z∗i ).

The first equality follows from the fact that a trace of a scalar equals the scalar
itself, that is, Tr (α) = α. The second equality follows from the definition of the
inner product c, d = c∗ d and the definition of the squared modus of a scalar,
that is for any complex scalar α, |α| = αα∗ = α∗ α. The third equality from the
2

property of Hermitian (transpose and conjugation), that is (AB) = B∗ A∗ . The
fourth step is critical, by identifying the rank-one matrix X = xx . Note that A∗ A

and AA∗ are always positive semidefinite, A∗ A, AA∗ ≥ 0, for any matrix A. The
fifth equality follows from the famous cyclical property [17, p. 31]

Tr (ABC) = Tr (CAB) = Tr (BCA) = Tr (ACB) .

Note that trace is a linear operator. The last step is reached by identifying another
rank-one matrix Ai = zi z∗i .
Since trace is a linear operator [17, p. 30], the phase retrieval problem, by
combining (8.63) and (8.64), comes down to a linear problem of unknown, rank-one
matrix X. Let A be the linear operator that mappes positive semidefinite matrices
into {Tr (Ai X) : i = 1, . . . , N }. In other words, we are given N observed pairs
{yi , bi : i = 1, . . . , N }, where yi = Tr (Ai X) is a scalar-valued random variable.
Thus, the phase retrievable problem is equivalent to

find X
minimize rank (X)
subject to A (X) = b
⇔ subject to A (X) = b (8.65)
X0
X  0.
rank (X) = 1

After solving the lef-hand side of (8.65), we can factorize the rank-one solution
X as xx∗ . The equivalence between the left and right-hand side of (8.65) is
446 8 Matrix Completion and Low-Rank Matrix Recovery

straightforward, since, by definition, these exists one rank-one solution. This


problem is a standard rank minimizing problem over an affine slice of the positive
semidefinite cone. Thus, it can be solved using the recently developed machinery—
low rank matrix completion or matrix recovery.

8.12.2 Matrix Recovery via Convex Programming

The rank minimization problem (8.65) is NP-hard. It is well known that the trace
norm is replaced with a convex surrogate for the rank function [478, 479]. This
techniques gives the familiar semi-definite programming (SDP)

minimize Tr (X)
subject to A (X) = b (8.66)
X  0,

where A is a linear operator. This problem is convex and there exists a wide array
of general purpose solvers. As far as we are concerned, the problem is solved
once the problem can be formulated in terms of convex optimization. For example,
in [473], (8.66) was solved by using the solver SDPT3 [480] with the interface
provided by the package CVX [481]. In [474], the singular value thresholding (SVT)
method [466] was used. In [463], all algorithms were implemented in MATLAB
using TFOCS [482].
The trace norm promotes low-rank solution. This is the reason why it is used so
often as a convex proxy for the rank. We can solve a sequence of weighted trace-
norm problem, a technique which provides even more accurate solutions [483,484].
Choose ε > 0; start with W0 = I and for k = 0, 1, . . ., inductively define Xk as
the optimal solution to

minimize Tr (Wk X)
subject to A (X) = b (8.67)
X  0.

and update the ‘reweight matrix’ as


−1
Wk = (Xk + εI) .

The algorithm terminates on convergence or when the iteration attains a specific


maximum number of iterations kmax . The reweighting scheme [484, 485] can be
viewed as attempting to solve

minimize f (X) = log (det (Xk + εI))


subject to A (X) = b (8.68)
X  0.
8.12 Phase Retrieval via Matrix Completion 447

by minimizing the tangent approximation to f at each iterate.


The noisy case can be solved. For more details, see [463].
In Sect. 8.10, following [457], we have studied a problem of estimating a
Hermitian nonnegatively definite matrix R of unit trace, e.g., a density matrix of
a quantum system and a covariance matrix of a measured data. Our estimation is
based on n i.i.d. measurements

(X1 , Y1 ) , . . . , (Xn , Yn ) , (8.69)

where

Yi = Tr (RXi ) + Wi , i = 1, . . . , n. (8.70)

By identifying R = X, Xi = Ai , and Yi = bi , our phase retrieval problem is


equivalent to this problem of (8.69).

8.12.3 Phase Space Tomography

We closely follow [474] for our development. Let us consider a quasi-monochromatic


light [486, Sect. 4.3.1] represented by a statistically stationary ensemble of analytic
signal V (r, t). For any wide-sense stationary (WSS) random process, its the
‘ensemble cross-correlation function’ Γ (r1 , r2 ; t1 , t2 ) is independent of the origin
of time, and may be replace by the corresponding temporal cross-correlation
function. This function depends on the two time arguments only through their
difference τ = t2 − t1 . Thus

Γ (r1 , r2 ; τ ) = E [V ∗ (r1 , t) V (r2 , t + τ )] = V ∗ (r1 , t) V (r2 , t + τ )


)T
= lim 1
−T
V ∗ (r1 , t) V (r2 , t + τ ) dt.
T →∞ 2T

where · represents the expectation value over a statistical ensemble of realizations


of the random fields. The cross-correlation function Γ (r1 , r2 ; τ ) is known as the
mutual coherence function and is the central quantity of the elementary theory of
optical coherence. We set τ = 0, Γ (r1 , r2 ; 0) is just the mutual intensity J (r1 , r2 ).
The Fourier transform of Γ (r1 , r2 ; τ ) with respect to the delay τ is given by

W (r1 , r2 ; ω) = Γ (r1 , r2 ; τ )e−jωτ dτ.

To make the formulation more transparent, we neglect the time or temporal


frequency dependence by restricting our discussions to quasi-monochromatic il-
lumination and to one-dimensional description, although these restrictions are not
necessary for the following. The measurable quantity of the classical field is the
448 8 Matrix Completion and Low-Rank Matrix Recovery

intensity. The simplified quantity is called the mutual intensity and is given by

J (x1 , x2 ) = V (x1 ) V ∗ (x2 ) .

The measurable quantity of the classical field after propagation over distance z
is [474, 486]
   
jπ  2 x1 − x2
I (x0 ; z) = dx1 dx2 J (x1 , x2 ) exp − x − x2 exp j2π
2
x0 .
λz 1 λ
(8.71)

It is more insightful to express this in operator form as


I = Tr (Px0 J) ,
where Px is the free-space propagation operator that combines both the quadratic
phase and Fourier transform operations in (8.71). Here x0 is the lateral coordinate of
the observation plane. Note that P is an infinite-dimensional operator. In practice,
we can consider the discrete, finite-dimensional operator (matrix) to avoid some
subtlety. By changing variables x = x1 +x 2
2
and Δx = x1 − x2 and Fourier
transforming the mutual intensity with respect to x, we obtain the Ambiguity
Function

A (μ, Δx) = u (x + Δx/2)u (x − Δx/2) exp (−j2πμx) dx.

We can rewrite (8.71) as

I¯ (μ; z) = A (μ, λzμ) ,

where I¯ (μ; z) is the Fourier transform of the vector of measured intensities with
respect to x0 . Thus, radial slices of the Ambiguity Function may be obtained from
Fourier transforming the vectors of intensities measured at corresponding propa-
gation distances z. From the Ambiguity Function, the mutual intensity J (x, Δx)
can be recovered by an additional inverse Fourier transform, subject to sufficient
sampling,

J (x, Δx) = A (y, Δx) exp (j2πxy) dy.

Let us formulate the problem into a linear model. The measured intensity
data is first arranged in the Ambiguity Function space. The mutual density J is
defined as the unknown to solve for. To relate the unknowns (mutual density J)
to the measurements (Ambiguity Function A), the center-difference coordination-
transform is first applied, which can be expressed as a linear transform L upon the
mutual intensity J; then this process is followed by Fourier transform F, and adding
measurement noise. Formally, we have
8.12 Phase Retrieval via Matrix Completion 449

A = F · L · J + e.

The propagation operator for mutual intensity Px is unitary and Hermitian, since it
preserves energy. The goal of low rank matrix recovery is to minimize the rank
(effective number of coherent modes). We formulate the physically meaningful
belief: the significant coherent modes is very few (most eigenvalues of these modes
are either very small or zero).
Mathematically, if we define all the eigenvalues λi and the estimated mutual
ˆ the problem can be formulated as
intensity as J,
 
minimize rank Jˆ

subject to A = F · L · J, (8.72)

λi  0 and λi = 1.
i

Direct rank minimization is NP-hard. We solve instead a proxy problem: with the
rank with the “nuclear norm”. The nuclear norm of a matrix is defined as the sum
of singular values of the matrix. The corresponding problem is stated as
/ /
/ /
minimize /Jˆ/

subject to A = F · T · J, (8.73)

λi  0 and λi = 1.
i

This problem is a convex optimization problem, which can be solved using general
purpose solvers. In [474], the singular value thresholding (SVT) method [466] was
used.

8.12.4 Self-Coherent RF Tomography

We follow our paper [487] about a novel single-step approach for self-coherent
tomography for the exposition. Phase retrieval is implicitly executed.

8.12.4.1 System Model

In self-coherent tomography, we only know amplitude-only total fields and the


full-data incident fields. The system model in 2-D near field configuration of self-
coherent tomography can be described as follows. There are Nt transmitter sensors
on the source domain with locations ltnt , nt = 1, 2, . . . , Nt . There are Nr receiver
sensors on the measurement domain with locations lrnr , nr = 1, 2, . . . , Nr . The
target domain Ω is discretized into a total number of Nd subareas with the center of
450 8 Matrix Completion and Low-Rank Matrix Recovery

the subarea located at ldnd , nd = 1, 2, . . . , Nd . The corresponding target scattering


strength is τnd , nd = 1, 2, . . . , Nd . If the nth
t sensor sounds the target domain and
the nth
r sensor receives the amplitude-only total field, the full-data measurement
equation is shown as,
  
Etot ltnt → lrnr = Einc ltnt → lrnr + Escatter ltnt → Ω → lrnr (8.74)
  
where Etot ltnt → lrnr  is the amplitude-only total field measured by the nth r
receiver
 t sensor due to the sounding signal from the nth t transmitter sensor;
Einc lnt → lrnr is the incident field directly from the nth
t transmitter sensor to
the nth
r receiver sensor; E l
scatter nt
t
→ Ω → l r
nr is the scattered field from the
target domain which can be expressed as,

 Nd
 
Escatter ltnt →Ω→ lrnr = G ldnd → lrnr Etot ltnt → ldnd τnd (8.75)
nd =1


In Eq. (8.75), G ldnd → lrnr is the wave propagation Green’s function from
location ldnd to location lrnr and Etot ltnt → ldnd is the total field in the target
subarea ldnd caused by the sounding signal from the ntht transmitter sensor which
can be represented as the state equation shown as

  Nd  
Etot ltnt → ldnd = Einc ltnt → ldnd + G ldnd → ldnd
nd =1,nd =nd
 
Etot ltnt → ldnd τnd (8.76)

Hence, the goal of self-coherent tomography is to recover τnd , nd =


1, 2, . . . , Nd and image the target domain Ω based on Eqs. (8.74)–(8.76).
Define etot,m ∈ RNt Nr ×1 as
⎡ ⎤
|Etot (lt1 → lr1 )|
⎢ |E (lt → lr )| ⎥
⎢ tot 1 2 ⎥
⎢ .. ⎥
⎢ ⎥
⎢  t .  ⎥
⎢   ⎥
etot,m = ⎢ Etot l1 → lNr ⎥ .
r (8.77)
⎢ ⎥
⎢ |Etot (l2 → l1 )| ⎥
t r
⎢ ⎥
⎢ .. ⎥
⎣ . ⎦
 t 
Etot l → lr 
Nt Nr

Define einc,m ∈ C Nt Nr ×1 as
8.12 Phase Retrieval via Matrix Completion 451

⎡ ⎤
Einc (lt1 → lr1 )
⎢ E (lt → lr ) ⎥
⎢ inc 1 2 ⎥
⎢ .. ⎥
⎢ ⎥
⎢  . ⎥
⎢ ⎥
einc,m = ⎢ Einc lt1 → lrNr ⎥. (8.78)
⎢ ⎥
⎢ Einc (lt2 → lr1 ) ⎥
⎢ ⎥
⎢ .. ⎥
⎣ ⎦
t .
Einc lNt → lrNr

Define escatter,m ∈ C Nt Nr ×1 as
⎡ ⎤
escatter,m,1
⎢ escatter,m,2 ⎥
⎢ ⎥
escatter,m = ⎢ .. ⎥ (8.79)
⎣ . ⎦
escatter,m,Nt

where based on Eq. (8.75) escatter,m,nt ∈ C Nr ×1 is described as,

escatter,m,nt = Gm Etot,s,nt τ (8.80)

where Gm ∈ C Nr ×Nd is defined as


⎡    ⎤
G ld1 → lr1 G ld2 → lr1 · · · G ldNd → lr1
d  
⎢ G l1 → lr2 G ld2 → lr2 · · · G ld → lr2 ⎥
⎢ Nd ⎥
Gm =⎢ .. .. .. .. ⎥ (8.81)
⎣ ⎦
d . r d . r . d . r
G l 1 → l Nr G l 2 → l Nr · · · G l Nd → l Nr

and τ is
⎡ ⎤
τ1
⎢ τ2 ⎥
⎢ ⎥
τ =⎢ .. ⎥. (8.82)
⎣ . ⎦
τN d

Besides, Etot,s,nt = diag(etot,s,nt ) and etot,s,nt ∈ C Nd ×1 can be expressed as


based on Eq. (8.76),
−1
etot,s,nt = (I − Gs diag(τ )) einc,s,nt (8.83)

where Gs ∈ C Nd ×Nd is defined as,


452 8 Matrix Completion and Low-Rank Matrix Recovery

⎡   ⎤
0 G ld2 → ld1 · · · G ldNd → ld1
 
⎢ G ld1 → ld2 0 · · · G ldNd → ld2 ⎥
⎢ ⎥
Gs = ⎢ .. .. .. .. ⎥ (8.84)
⎣ . . . . ⎦
d d
G l 1 → l Nd G l 2 → l Nd · · ·
d d
0

and einc,s,nt ∈ C Nd ×1 is
⎡  ⎤
Einc ltnt → ld1
⎢ Einc ltn → ld2 ⎥
⎢ t ⎥
einc,s,nt =⎢ .. ⎥. (8.85)
⎣ ⎦
 .
Einc ltnt → ldNd

Define Etot,s ∈ C Nt Nd ×Nd as


⎡ −1 ⎤
diag((I − Gs diag(τ )) einc,s,1 )
⎢ −1
diag((I − Gs diag(τ )) einc,s,2 ) ⎥
⎢ ⎥
Etot,s = ⎢ .. ⎥. (8.86)
⎣ . ⎦
−1
diag((I − Gs diag(τ )) einc,s,Nt )

Define Bm ∈ C Nt Nr ×Nd as
⎡ ⎤
Gm 0 · · · 0
⎢ 0 Gm · · · 0 ⎥
⎢ ⎥
Bm =⎢ . .. .. .. ⎥ Etot,s . (8.87)
⎣ .. . . . ⎦
0 0 · · · Gm

From Eqs. (8.77) to (8.87), we can safely express etot,m as

etot,m = |einc,m + Bm τ | . (8.88)

8.12.4.2 Mathematical Background

In a linear model for phase retrieval problem y = Ax where y ∈ C M ×1 , A ∈


C M ×m , and x ∈ C m×1 , only the squared magnitude of the output y is observed,
2 2
oi = |yi | = |ai x| , i = 1, 2, . . . , M (8.89)

where

A = [aH H H H
1 a2 . . . aM ] (8.90)
8.12 Phase Retrieval via Matrix Completion 453

y = [y1H y2H . . . yM
H H
] (8.91)

and

o = [oT1 oT2 . . . oTM ]T . (8.92)

where H is Hermitian operator and T is transpose operator.


M
We assume {oi , ai }i=1 are known and seek x which is called the generalized
phase retrieval problem. Derivation from Eq. (8.89) to get,

oi = ai x(ai x)H
= ai xxH aH
i

= trace(aH H
i ai xx ) (8.93)

where trace returns the trace value of matrix. Define Ai = aH H


i ai and X = xx .
Both Ai and X are rank-1 positive semidefinite matrices. Then,

oi = trace(Ai X) (8.94)

which is called semidefinite relaxation.


In order to seek x, we can first obtain the rank-1 positive semidefinite matrix X
which can be the solution to the following optimization problem

minimize
rank(X)
subject to . (8.95)
oi = trace(Ai X), i = 1, 2, . . . , M
X≥0

However, the rank function is not a convex function and the optimization
problem (8.95) is not a convex optimization problem. Hence, the rank function
is relaxed to the trace function or the nuclear norm function which is a convex
function. The optimization problem (8.95) can be relaxed to an SDP,

minimize
trace(X)
subject to (8.96)
oi = trace(Ai X), i = 1, 2, . . . , M
X≥0

which can be solved by CVX which is a Matlab-based modeling system for


convex optimization [488]. If the solution X to the optimization problem (8.96)
454 8 Matrix Completion and Low-Rank Matrix Recovery

is a rank-1 matrix, then the optimal solution x to the original phase retrieval
problem is achieved by eigen-decomposition of X. However, there is still a
phase ambiguity problem. When the number of measurements M are fewer than
necessary for a unique solution, additional assumptions are needed to select one
of the solutions [489]. Motivated by compressive sensing, if we would like to
seek the sparse vector x, the objective function in SDP (8.96) can be replaced by
trace(X) + δ X 1 where · 1 returns the l1 norm of matrix and δ is a design
parameter [489].

8.12.4.3 The Solution to Self-Coherent Tomography

Here, the solution to the linearized self-coherent tomography will be given first.
Then, a novel single-step approach based on Born iterative method will be
proposed to deal with self-coherent tomography with consideration of mutual
multi-scattering. Distorted wave born approximation (DWBA) is used here to
linearize self-coherent tomography. Specifically speaking, all the scattering
t within
the target domain will be ignored
t in DWBA [490, 491].
t Hence, E tot l nt
→ l d
nd in
Eq. (8.76) is reduced to Etot lnt → lnd = Einc lnt → lnd and Bm in Eq. (8.87)
d d

is simplified as,
⎡ ⎤⎡ ⎤
Gm 0 · · · 0 diag(einc,s,1 )
⎢ 0 Gm · · · 0 ⎥ ⎢ diag(einc,s,2 ) ⎥
⎢ ⎥⎢ ⎥
Bm =⎢ . .. .. .. ⎥ ⎢ .. ⎥. (8.97)
⎣ .. . . . ⎦ ⎣ . ⎦
0 0 · · · Gm diag(einc,s,Nt )

In this way, Bm is independent of τ and can be calculated through Green’s


function. The goal of the linearized self-coherent tomography is to recover τ given
etot,m , einc,m , and Bm based on Eq. (8.88).
Let o = etot,m ; c = einc,m ; A = Bm ; and x = τ . Equation (8.88) is
equivalent to

o = |c + Ax| (8.98)

and

oi = |ci + ai x|

= trace(Ai X) + |ci | + (ai x) c∗i + (ai x) ci
2
(8.99)

where ∗ returns the conjugate value of the complex number. There are two unknown
variables X and x in Eq. (8.99) which is different from Eq. (8.94) where there is only
one unknown variable X. In order to solve a set of non-linear equations in Eq. (8.98)
to get x, the following SDP is proposed,
8.12 Phase Retrieval via Matrix Completion 455

minimize
trace(X) + δ x 2
subject to

oi = trace(Ai X) + |ci | + (ai x) c∗i + (ai x) ci
2

i = 1, 2, . . . , Nt Nr

X x
≥ 0; X ≥ 0
xH 1 (8.100)
where · 2 returns the l2 norm of vector and δ is a design parameter. The
optimization solution x can be achieved without phase ambiguity. Furthermore, if
we know additional prior information about x, for example, the bound of the real
or imaginary part of each entry in x, this prior information can be put into the
optimization problem (8.100) as linear constraints,

minimize
trace(X) + δ x 2
subject to

oi = trace(Ai X) + |ci | + (ai x) c∗i + (ai x) ci
2

i = 1, 2, . . . , Nt Nr
upper
real ≤ real (x) ≤ breal
blower
bimag ≤ imag (x) ≤ bupper
lower
 imag
X x
≥ 0; X ≥ 0 (8.101)
xH 1

where real returns the real part of the complex number and imag returns the
upper
imaginary part of the complex number. blowerreal and breal are the lower and upper
upper
bounds of the real part of x, respectively. Similarly, blower
real and breal are the lower
and upper bounds of the imaginary part of x, respectively.
If mutual multi-scattering is considered, we have to solve Eq. (8.88) to obtain
τ , i.e., x. The novel single-step approach based on Born iterative method will be
proposed as follows:
1. Set τ (0) to be zero; t = −1;
(t)
2. t = t + 1; get Bm based on Eqs. (8.87) and (8.86) using τ (t) ;
(t)
3. Solve the inverse problem in Eq. (8.98) by the following SDP using Bm to get
τ (t+1)
456 8 Matrix Completion and Low-Rank Matrix Recovery

minimize
trace(X) + δ1 x 2 + δ2 o − u 2
subject to

ui = trace(Ai X) + |ci | + (ai x) c∗i + (ai x) ci
2

i = 1, 2, . . . , Nt Nr
upper
real ≤ real (x) ≤ breal
blower
blower ≤ imag (x) ≤ bupper
 imag imag
X x
≥ 0; X ≥ 0 (8.102)
xH 1

where the definitions of o and u can be referred to Eq. (8.92);


4. If τ converges, the approach is stopped; otherwise the approach goes to step 2.

8.13 Further Comments

Noisy low-rank matrix completion with general sampling distribution was studied
by Klopp [492]. Concentration-based guarantees are studied by Foygel et al. [493],
Foygel and Srebro [494], and Koltchinskii and Rangel [495]. The paper [496]
introduces a penalized matrix estimation procedure aiming at solutions which are
sparse and low-rank at the same time.
Related work to phase retrieval includes [463, 464, 473, 474, 497–502]. In
particular, robust phase retrieval for sparse signals [503].
Chapter 9
Covariance Matrix Estimation in High
Dimensions

Statistical structures start with covariance matrices. In practice, we must estimate


the covariance matrix from the big data. One may think this chapter should be more
basic than Chaps. 7 and 8—thus should be treated earlier chapters. Recent work on
compressed sensing and low-rank matrix recovery supports the idea that sparsity
can be exploited for statistical estimation, too. The treatment of this subject is very
superficial, due to the limited space. This chapter is mainly developed to support the
detection theory in Chap. 10.

9.1 Big Picture: Sense, Communicate, Compute, and Control

The nonasymptotic point of view [108] may turn out to be relevant when the number
of observations is large. It is to fit large complex sets of data that one needs to
deal with possibly huge collections of models at different scales. This approach
allows the collections of models together with their dimensions to vary freely,
letting the dimensions be possibly of the same order of magnitude as the number
of observations. Concentration inequalities are the probabilistic tools that we need
to develop a nonasymptotic theory.
A hybrid, large-scale cognitive radio network (CRN) testbed consisting of 100
hybrid nodes: 84 USRP2 nodes and 16 WARP nodes, as shown in Fig. 9.1, is
deployed at Tennessee Technological University. In each node, non-contiguous
orthogonal frequency division multiplexing (NC-OFDM) waveforms are agile and
programmable, as shown in Fig. 9.2, due to the use of software defined radios;
such waveforms are ideal for the convergence of communications and sensing.
The network can work in two different modes: sense and communicate. They can
even work in a hybrid mode: communicating while sensing. From sensing point
of view, this network is an active wireless sensor network. Consequentially, many
analytical tools can be borrowed from wireless sensor network; on the other hand,
there is a fundamental difference between oursensing problems and the traditional

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 457
DOI 10.1007/978-1-4614-4544-9 9,
© Springer Science+Business Media New York 2014
458 9 Covariance Matrix Estimation in High Dimensions

Fig. 9.1 A large-scale cognitive radio network is deployed at Tennessee Technological University,
as an experimental testbed on campus. A hybrid network consisting of 80 USRP2 nodes and 16
WARP nodes. The ultimate goal is to demonstrate the big picture: sense, communicate, compute,
and control

wireless sensor network. The main difference derives from the nature of the SDR
and dynamical spectrum access (DSA) for a cognitive radio. The large-scale CRN
testbed has received little attention in the literature.
With the vision of the big picture: sense, communicate, compute and control, we
deal with the Big Data. A fundamental problem is to determine what information
needs to be stored locally and what information to be communicated in a real-time
manner. The communications data rates ultimately determine how the computing
is distributed among the whole network. It is impossible to solve this problem
analytically, since the answer depends on applications. The expertise of this network
will enables us to develop better ways to approach this problem. At this point,
through a heuristic approach, we assume that only the covariance matrix of the
data is measured at each node and will be communicated in real time. More
specifically, at each USRP2 or WARP node, only the covariance matrix of the data
are communicated across the network in real time, at a data rate of say 1 Mbs. These
(sensing) nodes record the data much faster than the communications speed. For
example, a data rate of 20 Mbps can be supported by using USRP2 nodes.
The problem is sometimes called wireless distributed computing. Our view
emphasizes the convergence of sensing and communications. Distributed (parallel)
computing is needed to support all kinds of applications in mind, with a purpose of
control across the network.
9.1 Big Picture: Sense, Communicate, Compute, and Control 459

Communications and Sensing


Converge
NC-OFDM is an ideal
candidate for dual-use
communications/sensing.

Sensing tone filtering can be


accomplished with same
technologies used for
OFDM!

Fig. 9.2 The non-contiguous orthogonal frequency division multiplexing (NC-OFDM) wave-
forms are suitable for both communications and sensing. The agile, programmable waveforms
are made available by software defined radios (SDR)

To support the above vision, we distill the following mathematical problems:


1. High dimensional data processing. We focus on high-dimensional data
processing.One can infer dependent structures among variables by estimating the
associated covariance matrices. Sample covariance matrices are most commonly
used.
2. Data fusing. A sample covariance matrix is a random matrix. As a result, a sum
of random matrices is a fundamental mathematical problem.
3. Estimation and detection. Intrusion/activity detection can be enabled. Estimation
of network parameters is possible.
4. Machine learning. Machine learning algorithms can be distributed across the
network.
These mathematical problems are of special interest, in the context of social
networks. The data and/or the estimated information can be shared within the
social networks. The concept is very remote at the writing of this monograph, it
is our belief that the rich information contained in the radio waveforms will make a
difference when integrated into the social networks. The applications are almost of
no limit.
Cameras capture the information of optical fields (signals), while the SDR nodes
sense the environment using the radio frequency (RF). A multi-spectral approach
consists of sensors of a broad electromagnetic wave spectrum, even another physical
signal: acoustic sensors.
460 9 Covariance Matrix Estimation in High Dimensions

The vision of this section is interesting, especially in the context of the Smart
Grid that is a huge network full of sensors across the whole grid. There is an analogy
between the Smart Grid and the social network. Each node of the Grid is an agent.
This connection is a long term research topic. The in-depth treatment of this topic
is beyond this scope of this monograph.

9.1.1 Received Signal Strength (RSS) and Applications


to Anomaly Detection

Sensing across a network of mobiles (such as smart phones) is emerging. Received


signal strength (RSS) is defined as the voltage measured by a receiver’s received sig-
nal strength indicator circuit (RSSI). The RSS can be shared within a social network,
such as Facebook. With the shared RSS across such a network, we can sense the
radio environment. Let us use an example to illustrate this concept. This concept can
be implemented not only in traditional wireless sensor networks, but also wireless
communications network. For example, Wi-Fi nodes and cognitive radio nodes can
be used to form such a “sensor” network. The big picture is “sense, communicate,
compute and control.”
An real-world application of using RSS, Chen, Wiesel and Hero [504]
demonstrates the proposed robust covariance estimator in a real application:
activity/intrusion detection using an active wireless sensor network. They show
that the measured data exhibit strong non-Gaussian behavior.
The experiment was set up on an Mica2 sensor network platform, which
consists of 14 sensor nodes randomly deployed inside and outside a laboratory
at the University of Michigan. Wireless sensors communicated with each other
asynchronously by broadcasting an RF signal every 0.5 seconds. The received
signal strength was recorded for each pair of transmitting and receiving nodes. There
were pairs of RSSI measurements over a 30-min period, and samples were acquired
every 0.5 s. During the experiment period, persons walked into and out of the lab
at random times, causing anomaly patterns in the RSSI measurements. Finally, for
ground truth, a Web camera was employed to record the actual activity.

9.1.2 NC-OFDM Waveforms and Applications


to Anomaly Detection

The OFDM modulation waveforms can be measured for spectrum sensing in a


cognitive radio network. Then these waveforms data can be stored locally for
further processing. The first step is to estimate covariance from these stored data.
Our sampling rate is about 20 mega samples per second (Msps), in contrast
with 2 samples per second in Sect. 9.1.1 for received signal strength indicator
9.2 Covariance Matrix Estimation 461

circuit (RSSI). The difference is seven orders of magnitude. This fundamental


difference asks for a different approach. This difference is one basic motivation
for writing this book.
For more details on the network testbed, see Chap. 13.

9.2 Covariance Matrix Estimation

Estimating a covariance matrix (or a dispersion matrix) is a fundamental problem in


statistical signal processing. Many techniques for detection and estimation rely on
accurate estimation of the true covariance. In recent years, estimating a high dimen-
sional p × p covariance matrix under small sample size n has attracted considerable
attention. In these large p, small n problems, the classical sample covariance suffers
from a systematically distorted eigenstructure [383], and improved estimators are
required.

9.2.1 Classical Covariance Estimation

Consider a random vector


H
x = (X1 , X2 , . . . , Xp ) ,

where H denotes the Hermitian of a matrix. Let x1 , . . . , xn be independent random


vectors that follow the same distribution as x. For simplicity, we assume that the
distribution has zero mean: Ex = 0. The covariance matrix Σ is the p × p matrix
that tabulates the second-order statistics of the distribution:

Σ = E xxH . (9.1)

The classical estimator for the covariance matrix is the sample covariance matrix
n
1
Σ̂n = xi xH
i . (9.2)
n i=1

The sample covariance matrix is an unbiased estimator of the covariance matrix:


EΣ̂ = Σ.
Given a tolerance ε ∈ (0, 1) , we can study how many samples n are typically
required to provide an estimate with relative error ε in the spectral norm:
/ /
/ /
E /Σ̂ − Σ/  ε Σ . (9.3)

where A is the l2 norm. The symbol · q refers to the Schatten q-norm of a


matrix.:
1/q
A q  [Tr |A|q ]
462 9 Covariance Matrix Estimation in High Dimensions

 1/2
where |A| = AH A . This type of spectral-norm error bound defined in (9.3)
is quite powerful. It limits the magnitude of the estimator error for each entry of
the covariance matrix; it even controls the error in estimating the eigenvalues of the
covariance using the eigenvalues of the sample covariance.
Unfortunately, the error bound (9.3) for the sample covariance estimator demands
a lot of samples. Typical positive results state that the sample covariance matrix
estimator is precise when the number of samples is proportional to the number of
variables, provided that the distribution decays fast enough. For example, assuming
that x follows a normal distribution:
/ /
/ /
n  Cε−2 p ⇒ /Σ̂ − Σ/  ε Σ with high probability, (9.4)

where C is an absolute constant.


We are often interested in the largest and smallest eigenvalues of the empirical
covariance matrix of sub-Gaussian random vectors: sums of random vector outer
products. We present a version of [113]. This result (with non-explicit constants)
was originally obtained by Litvak et al. [338] and Vershynin [72].
Theorem 9.2.1 (Sums of random vector outer products [72, 113, 338]). Let
x1 , . . . , xN be random vectors in Rn such that, for some γ ≥ 0,
 
E xi xTi |x1 , . . . , xi−1 = I and  
  T  2
E exp αxi |x1 , . . . , xi−1  exp α γ/2 for all α ∈ Rn

for all i = 1, . . . , N, almost surely. For all ε ∈ (0, 1/2) and δ ∈ (0, 1),
     
1  1 
N N
T 1 T 1
P λmax xi xi >1+ · Cε,δ,N or λmin xi xi <1− · Cε,δ,N δ
N i=1 1−2ε N i=1 1−2ε

where
" 
32 (N log (1+2/ε)) + log (2/δ) 2 (N log (1+2/ε) + log (2/δ))
Cε,δ,N =γ· + .
N N

The sub-Gaussian property most readily lends itself to bounds on linear combina-
tions of sub-Gaussian random variables. However, the outer products are in certain
quadratic combinations. We bootstrap from the bound for linear combinations to
bound the moment generating function of the quadratic combinations. From there,
we get the desired tail bound.
For a (scalar-valued) non-negative random variable W. For any β ∈ R, we have

E [exp (βW )] − βE [W ] − 1 = β (exp (βt) − 1) · P [W > t] · dt. (9.5)
0

The claim follows using integration-by-parts.


9.2 Covariance Matrix Estimation 463

Theorem 9.2.2 (Sums of random vector outer products (quadratic form) [113]).
Let x1 , . . . , xN be random vectors in Rn such that, for some γ ≥ 0,
 
E xi xTi |x1 , . . . , xi−1 = I and  
   2
E exp αxTi |x1 , . . . , xi−1  exp α γ/2 for all α ∈ Rn

for all i = 1, . . . , N, almost surely. For all α ∈ Rn such that ||α|| = 1 and all
δ ∈ (0, 1),
- N
 " .
1 32γ 2 log (1/δ) 2γ log (1/δ)
P α T
xi xTiα>1+ +  δ and
N N N
- i=1
N
 " .
1 32γ 2 log (1/δ)
P αT xi xi α < 1 −
T
 δ.
N i=1 N

We see [113] for a proof, based on (9.5). With this theorem, we can bound the
smallest and largest eigenvalues of the empirical covariance matrix, when we apply
the bound for the Rayleigh quotient (quadratic form) in the above theorem, together
with a covering argument from Pisier [152].

9.2.2 Masked Sample Covariance Matrix

One way to circumvent the problem of covariance estimation in large p, small n is


to assume that the covariance matrix is nearly sparse and to focus on estimating only
the significant entries [130, 505]. A formalism called masked covariance estimation
is introduced here. This approach uses a mask, constructed a prior, to specify the
importance we place on each entry of the covariance matrix. By re-weighting the
sample covariance matrix estimate using a mask, we can reduce the error that arises
from imprecise estimates of covariances that are small or zero. The mask matrix
formalism was first introduced by Levina and Vershynin [505].
Modern applications often involve a small number of samples and a large number
of variables. The paucity of data make it impossible to obtain an accurate estimator
of a general covariance matrix. As a sequence, we must frame additional model
assumptions and develop estimators that exploit this extra structure. A number
of papers have focused on the situation where the covariance matrix is sparse or
nearly so. Thus we limit our attention to the significant entries of the covariance
matrix and thereby perform more accurate estimation with fewer samples.
Our analysis follows closely that of [130], using matrix concentration inequalities
that are suitable for studying a sum of independent random matrices (Sect. 2.2).
Indeed, matrix concentration inequalities can be viewed as a far-reaching extensions
of the classical inequalities for a sum of scalar random variables (Sect. 1.4.10).
Matrix concentration inequalities sometimes allow us to replace devilishly hard
464 9 Covariance Matrix Estimation in High Dimensions

calculations with simple arithmetic. These inequalities streamline the analysis of


random matrices. We believe that the simplicity of the arguments and the strength
of the conclusions make a compelling case for the value of these methods. We
hope matrix concentration inequalities will find a place in the toolkit of researchers
working on multivariate problems in statistics.
In the regime n ! p, were we have very few samples, we cannot hope to
achieve an estimate like (9.3) for a general covariance matrix. Instead, we must
instate additional assumptions and incorporate this prior information to construct a
regularized estimator.
One way to formalize this idea is to construct a symmetric p × p matrix M with
real entries, which we call the mask matrix. In the simplest case, the mask matrix
has 0–1 values that indicate which entries of the covariance matrix we attend to.
A unit entry mij = 1 means that we estimate the interaction between the ith and jth
variables, while a zero entry mij = 0 means that we ignore their interaction when
making estimation. More generally, we allow the entries of the mask matrix to vary
over the interval [0, 1], in which case the relative values of mij is proportional to the
importance of estimating the (i, j) entry of the covariance matrix.
Given a mask M, we define the masked sample covariance matrix estimator

M  Σ̂,

where the symbol  denotes the component-wise (i.e., Schur or Hadamard) product.
The following expression bounds the root-mean-square spectral-norm error that this
estimator incurs:
 / /2 1/2  / /2 1/2  1/2
/ / / / 2
E/M  Σ̂ − Σ/  E/M  Σ̂ − M  Σ/ + E M  Σ−Σ .
F GH I F GH I
variance bias
(9.6)
This bound is analogous to the classical bias-variance decomposition for the mean-
squared-error (MSE) of a point estimator. To obtain an effective estimator, we must
design a mask that controls both the bias and the variance in (9.6). We cannot
neglect too many components of the covariance matrix, or else the bias in the
masked estimator may compromise its accuracy. On the other hand, each additional
component we add in our estimator contributes to the size of the variance term.
In the case where the covariance matrix is sparse, it is natural to strike a balance
between these two effects by refusing to estimate entries of the covariance matrix
that we know a prior to be small or zero.
For a stationary random process, the covariance matrix is Toeplitz. A Toeplitz
matrix or diagonal-constant matrix, named after Otto Toeplitz, is a matrix in which
each descending diagonal from left to right is constant. For instance, the following
matrix is a Toeplitz matrix

Σn = (γi−j )1i,jn , (9.7)


9.2 Covariance Matrix Estimation 465

Example 9.2.3 (The Banded Estimator of a Decaying Matrix). Let us consider the
example where entries of the covariance matrix Σ decay away from the diagonal.
Suppose that, for a fixed parameter α > 1,
 
  −α
(Σ)ij   |i − j + 1| for each pair (i, j) of indices.

This type of property may hold for a random process whose correlation are
localized in time. Related structure arises from random fields that have short spatial
correlation scales.
A simple (suboptimal) approach to this covariance estimation problem is to focus
on a band of entries near the diagonal. Suppose that the bandwidth B = 2b + 1 for a
nonnegative integer b. For example, a mask with bandwidth B = 3 for an ensemble
of p = 5 variables takes the form
⎡ ⎤
11
⎢1 1 1 ⎥
⎢ ⎥
⎢ ⎥
Mband = ⎢ 1 1 1 ⎥.
⎢ ⎥
⎣ 1 1 1⎦
11

In this setting, it is easy to compute the bias term in (9.6). Indeed,


   −α
  |i − j + 1| , |i − j| > b
(M  Σ − Σ)ij  
0, otherwise.

Gershgorin’s theorem [187, Sect. 6.1] implies that the spectral norm of a symmetric
matrix is dominated by the maximum l1 norm of a column, so

−α 2 1−α
MΣ−Σ 2 (k + 1)  (b + 1) .
α−1
k>b

The second inequality follows when we compare with the sum with an integral.
A similar calculation shows
−1
Σ  1 + 2(α − 1) .

Assuming the covariance matrix really does have constant spectral norm, it follows
that

M  Σ − Σ  B 1−α Σ .


466 9 Covariance Matrix Estimation in High Dimensions

9.2.2.1 Masked Covariance Estimation for Multivariate


Normal Distributions

The main result of using masked covariance estimation is presented here. The norm
· ∞ returns the maximum absolute entry of a vector, but we use a separate notation
· max for the maximum absolute entry of a matrix. We also require the norm
1/2
2 2
A 1→2  max |aij | .
j
i

The notation reflects the fact that this is the natural norm for linear maps from l1
into l2 .
Theorem 9.2.4 (Chen and Tropp [130]). Fix a p × p symmetric mask matrix M,
where p ≥ 3. Suppose that x is a Gaussian random vector in Rp with mean zero.
Define the covariance matrix Σ and Σ̂ in (9.1) and (9.2). Then the variance of the
masked sample covariance estimator satisfies
⎡ 1/2 ⎤
  2 1/2
  Σmax Σ2
1→2 log p Σmax M log p · log (np)
EM Σ̂−M Σ C ⎣ + ⎦ Σ .
Σ n Σ n
(9.8)

9.2.2.2 Two Complexity Metrics of a Mask Design

In this subsection, we use masks that take 0–1 values to gain intuition. Our analysis
uses two separate metrics that quantify the complexity of the mask. The first
complexity is the square of the maximum column norm:
1/2
2 2
M 1→2  max |mij | .
j
i

Roughly, the bracket counts the number of interactions we want to estimate that
involve the variable j, and the maximum computes a bound over all p variables.
This metric is “local” in nature. The second complexity metric is the spectral norm
M of the mask matrix, which provides a more “global” view of the complexity
of the interactions that we estimate.
Let us use some examples to illustrate. First, suppose we estimate the entire
covariance matrix so the mask is the matrix of ones:
2
M = matrix of ones ⇒ Σ 1→2 = p and M = p.

Next, consider the mask that arises from the banded estimator in Example 9.2.3:
9.2 Covariance Matrix Estimation 467

2
M = 0 − 1 matrix, bandwidth B ⇒ Σ 1→2  B and M  B,

since there are at most B ones in each row and column. When B ! p, the banded
matrix asks us to estimate fewer interactions than the full mask, so we expect the
estimation problem to be much easier.

9.2.2.3 Covariance Matrix for Wide-Sense Stationary (WSS)

A random process is wide-sense stationary (WSS) if its mean is constant for all time
indices (i.e., independent of time) and its autocorrelation depends on only the time
index difference. WSS discrete random process x[n] is statistically characterized by
a constant mean

x̄[n] = x̄,

and an autocorrelation sequence

rxx [m] = E {x[n + m]x∗ [n]} ,

where ∗ denotes the complex conjugate. The terms “correlation” and “covariance”
are often used synonymously in the literature, but formally identical only for zero-
mean processes. The covariance matrix
⎡ ∗ ∗

rxx [0] rxx [1] · · · rxx [M ]
⎢ rxx [1] rxx [0] ∗
· · · rxx [M − 1] ⎥
⎢ ⎥
RM = ⎢ .. .. .. .. ⎥
⎣ . . . . ⎦
rxx [M ] rxx [M − 1] · · · rxx [0]

is a Hermitian Toeplitz autocorrelation matrix of order M , and, therefore, has


dimension (M + 1) × (M + 1). Then, the quadratic form

M M
aH Rxx a = a[m]a∗ [n]rxx [m − n]  0 (9.9)
m=0 n=0

must be positive semi-definite (or non-negative) for any arbitrary M × 1 vector


a if rxx [m] is a valid autocorrelation sequence. From (9.9), it follows that the
covariance matrix RM is positive semi-definite, implying all its eigenvalues must
be non-negative:

λi (RM ) ≥ 0, i = 1, . . . , M.

This property is fundamental for the covariance matrix estimation.


468 9 Covariance Matrix Estimation in High Dimensions

9.2.2.4 Signal Plus Noise Model

As mentioned above, in the case where the covariance matrix is sparse, it is natural
to strike a balance between these two effects by refusing to estimate entries of the
covariance matrix that we know a prior to be small or zero. Let us illustrate this
point by using an example that is crucial to high-dimensional data processing.
Example 9.2.5 (A Sum of Sinusoids in White Gaussian Noise [506]). Let us sample
the continuous-time signal at an sampling interval Ts . If there are L real sinusoids
L
x[n] = Al sin (2πfl nTs + θl ) ,
l=1

each of which has a phase that is uniformly distributed on the interval 0 to 2π,
independent of the other phases, then the mean of the L sinusoids is zero and the
autocorrelation sequence is
L
A2l
rxx [m] = cos (2πfl mTs ) .
2
l=1

If the process consists of L complex sinusoids

L
x[n] = Al exp [j (2πfl nTs + θl )] ,
l=1

then the autocorrelation sequence is

L
rxx [m] = A2l exp (j2πfl mTs ).
l=1

A white Gaussian noise is uncorrelated with itself for all lags, except at m = 0,
for which the variance is σ 2 . The autocorrelation sequence is

rww [m] = σ 2 δ[m],

which is a constant for all frequencies, justifying the name white noise. The
covariance matrix is
⎡ ⎤
1 0
⎢ ⎥
Rww = σ 2 I = σ 2 ⎣ . . . ⎦ . (9.10)
0 1
9.2 Covariance Matrix Estimation 469

If an independent white noise process w[n] is added to the complex sinusoids


with random phases, then the combined process

y[n] = x[n] + w[n]

will have an autocorrelation sequence

ryy [m] = rxx [m] + rww [m]


L
= A2l exp (j2πfl mTs ) + σ 2 δ[m]. (9.11)
l=1

Equation (9.11) can be rewritten as

L
Ryy = Rxx + Rww = A2l vM (fl ) vM
H
(fl ) + σ 2 I, (9.12)
l=1

where I is an (M + 1) × (M + 1) identity matrix and


⎡ ⎤
1
⎢ exp (j2πfl Ts ) ⎥
⎢ ⎥
vM (fl ) = ⎢ .. ⎥
⎣ . ⎦
exp (j2πfl mTs )

is a complex sinusoidal vector at frequency fl .


The impact of additive white Gaussian noise w[n] on the signal is, according
to (9.12), through only diagonals since Rww = σ 2 I. This is an ideal model that is
not valid for the “large p, small n” problem: p variables and n data samples—in this
case the sample covariance matrix R̂ww , the most commonly encountered estimate
of Rww , is a positive semi-definite random matrix, that is a dense matrix of full
rank. In other words, R̂ww is far away from the ideal covariance matrix σ 2 I. This
observation has far-reaching impact since the ideal covariance matrix Rww = σ 2 I
is a sparse matrix but the sample covariance matrix R̂ww is not sparse at all. 
Example 9.2.6 (Tridiagonal Toeplitz Matrix [506]). An n × n tridiagonal Toeplitz
matrix T has the form
⎡ ⎤
b a 0 ··· 0
⎢ .. ⎥
⎢ a b a ... .⎥
⎢ ⎥
⎢ ⎥
T = ⎢ 0 a ... ... 0⎥ ,
⎢ ⎥
⎢. . . . ⎥
⎣ .. . . . . . . a⎦
0 ··· 0 a b
470 9 Covariance Matrix Estimation in High Dimensions

where a and b are constants. The eigenvalues of T are


 

λk = a + 2b cos + 1 , k = 1, . . . , n,
n

and the corresponding eigenvectors are


⎛  ⎞

" sin
2 ⎜ ⎟
n+1
⎜ .. ⎟
vk = ⎜ ⎟.
n+1⎝ . ⎠
knπ
sin n+1

If an additive Gaussian white noise is added to the signal whose covariance


matrix is T, then the resultant noisy signal has a covariance matrix

Ryy = Rxx + Rww = T + σ 2 I =


⎡ ⎤ ⎡ ⎤
b a 0 ··· 0 1 0 0 ··· 0
⎢ . . ⎥ ⎢ .. .. ⎥
⎢ a b a . . .. ⎥ ⎢0 1 0 . .⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
. .
= ⎢ 0 a . . . . 0 ⎥ + σ2 ⎢ 0 0 . . . ..
. 0⎥
⎢ ⎥ ⎢ ⎥
⎢. . . . ⎥ ⎢. . . .. ⎥
⎣ .. . . . . . . a ⎦ ⎣ .. . . . . . 0⎦
0 ··· 0 a b 0 ··· 0 0 1
⎡ ⎤
b + σ2 a 0 ··· 0
⎢ . .. ⎥
⎢ a b + σ2 a . . . ⎥
⎢ ⎥
⎢ .. .. ⎥
=⎢ 0 a . . 0 ⎥ .
⎢ ⎥
⎢ .. .. .. .. ⎥
⎣ . . . . a ⎦
0 · · · 0 a b + σ2

Only diagonal entries are affected by the ideal covariance matrix of the noise. 
A better model for modeling the noise is
⎡ ⎤
1 ρ ρ ··· ρ
⎢ .. ⎥
⎢ ρ 1 ρ ... .⎥
⎢ ⎥
⎢ ⎥
R̂ww = ⎢ ρ ρ ... ... ρ⎥ ,
⎢ ⎥
⎢. . . . ⎥
⎣ .. . . . . . . ρ⎦
ρ ··· ρ ρ 1
9.2 Covariance Matrix Estimation 471

where the correlation coefficient ρ ≤ 1 is typically small, for example, ρ = 0.01.


A general model for the Gaussian noise is
⎡ ⎤
1 + ρ11 ρ12 ρ13 · · · ρ1M
⎢ ρ21 1 + ρ22 ρ23 · · · ρ2M ⎥
⎢ ⎥
2 ⎢ ρ31 ρ32 1 + ρ33 · · · ρ3M ⎥
R̂ww = σ ⎢ ⎥,
⎢ .. .. .. .. .. ⎥
⎣ . . . . . ⎦
ρM 1 ρM 2 ρM 3 · · · 1 + ρM M
where ρij ≤ 1, i, j = 1, . . . , M are random variables of the same order, e.g. 0.01.
The accumulation effect of the weak random variables ρij will have a decisive
influence on the performance of the covariance estimation. This covariance matrix is
dense and of full rank but positive semi-definite. It is difficult to enforce the Toeplitz
structure on the estimated covariance matrix.
When a random noise vector is added to a random signal vector

y = x + w,

where it is assumed that the random signal and the random noise are independent, it
follows [507] that

Ryy = Rxx + Rww (9.13)

The difficulty arises from the fact that Rxx is unknown. Our task at hand is
to separate the two matrices—a matrix separation problem. Our problems will
be greatly simplified if some special structures of these three matrices can be
exploited!!! Two special structures are important: (1) Rxx is of low rank; (2) Rxx
is sparse.
In a real world, we are given the date to estimate the covariance matrix of the
noisy signal Ryy

R̂yy = R̂xx + R̂ww , (9.14)

where we have used the assumption that the random signal and the random noise are
independent—which is reasonable for most covariance matrix estimators in mind.
Two estimators R̂xx , R̂ww are required. It is very critical to remember that their
difficulties are fundamentally different; two different estimators must be used. The
basic reason is that the signal subspace and the noise subspace are different—even
though we cannot always separate the two subspaces using tools such as singular
value decomposition or eigenvalue decomposition.
Example 9.2.7 (Sample Covariance Matrix). The classical estimator for the covari-
ance matrix is the sample covariance matrix defined in (9.2) and repeated here for
convenient:
n
1
Σ̂ = xi xH
i . (9.15)
n i=1
472 9 Covariance Matrix Estimation in High Dimensions

Using (9.15) for y = x + w, it follows that

1

n
1

n
H
R̂yy = n yi yiH = n (xi + wi )(xi + wi )
i=1 i=1
n n
n n 1 1
= 1
n xi xH
i + 1
n wi wiH + xi wiH+ wi x H
i
i=1 i=1 n i=1
n i=1
F GH I F GH I
→0,n→∞ →0,n→∞
(zero mean random vectors)
n 
n
= n1 xi xH 1
i + n wi wiH , n → ∞
i=1 i=1
= R̂xx + R̂ww (9.16)

Our ideal equation is the following:

Ryy = Rxx + Rww . (9.17)

In what conditions does (9.16) approximate (9.17) with high accuracy? The
asymptotic process in the derivation of (9.16) hides the difficulty of data processing.
We need to make the asymptotic process explicit via feasible algorithms. In other
words, we require a non-asymptotic theory for high-dimensional processing. Indeed,
n approaches a very large value but finite! (say n = N = 105 ) For ε ∈ (0, 1), we
require that
/ / / /
/ / / /
/R̂xx − Rxx /  ε Rxx and /R̂ww − Rww /  ε Rww .

To achieve the same accuracy ε, the sample size n = Nx required for the
signal covariance estimator R̂xx is much less than n = Nw required for the noise
covariance estimator R̂ww . This observation is very critical in data processing.
For a given n, how close does R̂xx become to Rxx ? For a given n, how close
does R̂ww become to Rww ?
Let A, B ∈ CN ×N be Hermitian matrices. Then [16]

λi (A) + λN (B)  λi (A + B)  λi (A) + λ1 (B) . (9.18)

Using (9.18), we have


     
λi R̂xx +λM R̂ww  λi R̂yy = λi R̂xx + R̂ww  λi R̂xx +λ1 R̂ww

(9.19)

for i = 1, . . . , M. All these covariance matrices are the positive semi-definite


matrices that have non-negative eigenvalues. If Rxx is of low rank, for q ! M, only
the first q dominant eigenvalues are of interest. Using (9.19) q times and summing
both sides of these q inequalities yield
9.2 Covariance Matrix Estimation 473

q q q
    
λi R̂xx + qλM R̂ww  λi R̂xx + R̂ww  λi R̂xx + qλ1 R̂ww .
i=1 i=1 i=1
(9.20)
For q = 1, we have
         
λ1 R̂xx + λM R̂ww  λ1 R̂xx + R̂ww  λ1 R̂xx + λ1 R̂ww ,

For q = M , it follows that


       
Tr R̂xx +M λM R̂ww  Tr R̂yy = Tr R̂xx +R̂ww
   
 Tr R̂xx +M λ1 R̂ww (9.21)

where we have used the standard linear algebra identity:


n
Tr (A) = λi (A).
i=1

More generally

 n
Tr A k
= λki (A), A ∈ Cn×n , k ∈ N.
i=1
 1/k
In particular, if k = 2, 4, . . . is an even integer, then Tr Ak is just the lk norm
of these eigenvalues, and we have [9, p. 115]
k  k
A op  Tr Ak  n A op ,
where · op is the operator norm.
All eigenvalues we deal with here are non-negative since the sample covariance
matrix defined in (9.15) is non-negative. The eigenvalues, their sum, and the trace of
a random matrix are scalar-valued random variables. The expectation E of these
scalar-valued random variables can be considered. Since expectation and trace are
both linear, they commute [38, 91]:
E Tr (A) = Tr (EA) . (9.22)

Taking the expectation of it, we have


       
E Tr R̂xx +M EλM R̂ww  E Tr R̂yy = E Tr R̂xx +R̂ww
   
 E Tr R̂xx +M Eλ1 R̂ww
(9.23)
and, with the aid of (9.22),
474 9 Covariance Matrix Estimation in High Dimensions

     
Tr ER̂xx +M EλM R̂ww  Tr ER̂yy
     
= Tr ER̂xx +ER̂ww  Tr ER̂xx +M Eλ1 R̂ww . (9.24)
   
Obviously, EλM R̂ww and Eλ1 R̂ww are non-negative scalar values, since
 
λi R̂ww  0, i = 1, . . . , M.
We are really concerned with
   
 K K  K
 
λ 1 Ryy,k − λ1 R̂yy,k   ελ1 Ryy,k . (9.25)
 
k=1 k=1 k=1

Sample covariance matrices are random matrices. In analogy with a sum of


independent scalar-valued random variables, we can consider a sum of matrix-
valued random variables. Instead of considering the sum, we can consider the
expectation. 

9.2.3 Covariance Matrix Estimation for Stationary Time Series

We follow [508]. For a stationary random process, the covariance matrix is Toeplitz.
A Toeplitz matrix or diagonal-constant matrix, named after Otto Toeplitz, is a matrix
in which each descending diagonal from left to right is constant. For instance, the
following matrix is a Toeplitz matrix

Σn = (γi−j )1i,jn , (9.26)

A thresholded covariance matrix estimator can better characterize sparsity if the


true covariance matrix is sparse. Toeplitzs connection of eigenvalues of matrices
and Fourier transforms of their entries is used. The thresholded sample covariance
matrix is defined as

Σ̂T,AT = γs−t 1|γs−t |AT 1s,tT


for AT = 2c log T /T where c is a constant. The diagonal elements are never
thresholded. The thresholded estimate may not be positive definite.
In the context of time series, the observations have an intrinsic temporal order and
we expect that observations are weakly dependent if they are far apart, so banding
seems to be natural. However, if there are many zeros or very weak correlations
within the band, the banding method does not automatically generate a sparse
estimate.
9.3 Covariance Matrix Estimation 475

9.3 Covariance Matrix Estimation

· is the operator norm and · 2 the Euclidean norm in Rn . For N copies of a


random vector x, the sample covariance matrix is defined as

N
1
Σ̂N = xi ⊗xi .
N i=1

Theorem 9.3.1 ([310]). Consider independent, isotropic random vectors xi valued


in Rn . Assume that xi satisfy the strong regularity assumption: for some C0 , η > 0,
one has
0 1
P Pxi 2 > t  C0 t−1−η for t > C0 rank (P)
2
(9.27)

for every orthogonal projection P in Rn . Then, for ε ∈ (0, 1) and for

N  Cε−2−2/η · n

one has
/ /
/1 N /
/ /
E/ xi ⊗xi − I/  ε. (9.28)
/N /
i=1

2+2/η 1+4/η
Here, C = 512(48C0 ) (6 + 6/η) .
Corollary 9.3.2 (Covariance estimation [310]). Consider a random vector x
valued in Rn with covariance matrix Σ. Assume that: for some C0 , η > 0, the
isotropic random vector z = Σ−1/2 x satisfies
0 1
> t  C0 t−1−η for t > C0 rank (P)
2
P Pxi 2 (9.29)

for every orthogonal projection P in Rn . Then, for ε ∈ (0, 1) and for

N  Cε−2−2/η · n

the sample covariance matrix Σ̂N obtained from N independent copies of x satisfies
/ /
/ /
E /Σ̂N − Σ/  ε Σ . (9.30)

Theorem 9.3.1 says that, for sufficiently large N, all eigenvalues of the sample
covariance matrix Σ̂N are concentrated near 1. This following corollary extends to
a result that holds for all N.
476 9 Covariance Matrix Estimation in High Dimensions

Corollary 9.3.3 (Extreme eigenvalues [310]). Let n, N be arbitrary positive


integers, suppose xi are N independent, isotropic random vectors satisfying (9.27),

N
and let y = n/N. Then the sample covariance matrix Σ̂N = N1 xi ⊗xi satisfies
i=1

   
1 − C1 y c  Eλmin Σ̂N  Eλmax Σ̂N  1 + C1 (y + y c ) . (9.31)
 
η 2+2/η 1+4/η
Here c = 2η+2 , C1 = 512(16C0 ) (6 + 6/η) , and λmin Σ̂N ,
 
λmax Σ̂N denote the smallest and the largest eigenvalues of Σ̂N , respectively.

It is sufficient to assume 2 + η moments for one-dimensional marginals rather


than for marginals in all dimensions. This is only slightly stronger than the isotropy
assumption, which fixes the second moments of one-dimensional marginals.
Corollary 9.3.4 (Smallest Eigenalue [310]). Consider N independent isotropic
random vectors xi valued in Rn . Assume that xi satisfy the weak regularity
assumption: for some C0 , η > 0,
2+η
sup | xi , y |  C0 (9.32)
y2 1

Then, for ε > 0 and for

N  Cε−2−2/η · n,

the minimum eigenvalue of the sample covariance matrix Σ̂N satisfies


 
Eλmin Σ̂N  1 − ε.

2/η
Here C = 40(10C0 ) .

9.4 Partial Estimation of Covariance Matrix

Theorem 9.4.1 (Estimation of Hadamard products [505]). Let M be an


arbitrary fixed symmetric n × n matrix. Then

/ /  
/ / M 1,2 M
E /M · Σ̂N − M · Σ/  Clog3 (2n) √ + Σ . (9.33)
N N

Here M does not depend on Σ̂N and Σ.


9.4 Partial Estimation of Covariance Matrix 477

Corollary 9.4.2 (Partial estimation [505]). Let M be an arbitrary fixed symmetric


n × n matrix such that all of the entries are equal to 0 or 1, and there are at most k
nonzero entries in each column. Then
√ 
/ /
/ / k k
E /M · Σ̂N − M · Σ/  Clog (2n) √ +
3
Σ . (9.34)
N N


Proof. We note that M 1,2  k and M  k and apply Theorem 9.4.1. 
Corollary 9.4.2 implies that for every ε ∈ (0, 1), the sample size
/ /
/ /
N  4C 2 ε−2 klog6 (2n) suffices for E /M · Σ̂N − M · Σ/  ε Σ . (9.35)

For sparse matrices M with k ! n, this makes partial estimation possible with
N ! n observations. Therefore, (9.35) is a satisfactory “sparse” version of the
classical bound such as given in Corollary 9.29.
Identifying the non-zero entries of Σ by thresholding. If we assume that all non-
zero entries in Σ are bounded away from zero by a margin of h > 0, then a sample
size of
N  h−2 log (2n)
would assure that all their locations are estimated correctly with probability
approaching 1. With this assumption, we could derive a bound for the thresholded
estimator.
Example 9.4.3 (Thresholded estimator). An n × n tridiagonal Toeplitz matrix has
the form
⎡ ⎤
b a 0 ··· 0
⎢ ⎥
⎢ a b a ... 0 ⎥
⎢ ⎥
⎢ ⎥
Σ = ⎢ 0 a ... ... 0 ⎥ .
⎢ ⎥
⎢. . . . ⎥
⎣ .. .. . . . . a ⎦
0 0 · · · a b n×n

⎡ ⎤ ⎡ ⎤
1 0 0 ··· 0 h h h ··· h
⎢ ⎥ ⎢ ⎥
⎢ 0 1 0 ... 0⎥ ⎢ h h h ... h⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
Rww = ⎢ 0 0 ... ... 0⎥ + ⎢ h h ... ... h⎥ .
⎢ ⎥ ⎢ ⎥
⎢. . . . ⎥ ⎢. . . . ⎥
⎣ .. .. . . . . 0 ⎦ ⎣ .. .. . . . . h⎦
0 0 ··· 0 1 n×n h h ··· h h n×n

All non-zero entries in Σ are bounded away from zero by a margin of h > 0. 
478 9 Covariance Matrix Estimation in High Dimensions

9.5 Covariance Matrix Estimation in Infinite-Dimensional


Data

In the context of kernel principal component analysis of high (or infinite)


dimensional data, covariance matrix estimation is relevant to Big Data. Let
||A||2 denote the spectral norm of matrix A. If A is symmetric, then ||A||2 =
max {λmax (A) , −λmin (A)} , where λmax (A) and λmin (A) are, respectively,
the largest and smallest eigenvalue of A.

 Letx1 , . . . , xN
Example 9.5.1 (Infinite-dimensional data [113]).  be i.i.d. random

vectors with their true covariance matrix Σ = xi xTi , K = E xi xTi xi xTi , and
x 2  α almost surely for some α > 0. Define random matrices Xi = xi xTi − Σ

N
and the sample covariance matrix (a random matrix) Σ̂ = N1 xi xTi . We have
i=1
λmax (Xi )  α2 − λmin (Xi ) . Also,

1
N

λmax X2i = λmax K − Σ2
N i=1

and
- .
1
N

E Tr X2i = Tr K − Σ2 .
N i=1

By using Theorem 2.16.4, we have that


⎛ ?   2 ⎞
  2tλmax K−Σ2 α −λmin (Xi ) t
P ⎝λmax Σ̂−Σ > + ⎠
N 3N

Tr K−Σ2  t −1
  2 · t e −t−1 .
λmax K−Σ

Since λmax (−Xi ) = λmax Σ − xi xTi  λmax (Σ) , by using Theorem 2.16.4,
we thus have
⎛   ⎞ 
2tλmax K−Σ2 λmax (Σ) t ⎠ Tr K−Σ2  −1
P ⎝λmax Σ−Σ̂ > +   · t et −t−1 .
N 3N λmax K−Σ2

Combing the above two inequalities, finally we have that


9.6 Matrix Model of Signal Plus Noise Y = S + X 479

⎛ ?    ⎞
/ / 2tλmax K−Σ2 max α2 −λmin (Xi ) , λmax (Σ) t
/ /
P ⎝/Σ − Σ̂/ > + ⎠
2 N 3N

Tr K−Σ2  t −1
  2 · 2t e −t−1 .
λmax K − Σ

Tr(K−Σ2 )
The relevant notion of intrinsic dimension is λmax (K−Σ2 ) , which can be finite even
when the random vectors xi take on values in an infinite dimensional Hilbert space.


9.6 Matrix Model of Signal Plus Noise Y = S + X

We follow [509, 510] for our exposition. Consider a matrix model of signal plus
noise

Y =S+X (9.36)

where S ∈ Rn×n is a deterministic matrix (“signal”) and X ∈ Rn×n is a centered


Gaussian matrix (“noise”) whose entries are independent with variance σ 2 . Our
goal is to study the non-asymptotic upper and lower bounds on the accuracy of
approximation which involves explicitly the singular values of S. Our work is
motivated for high-dimensional setting, in particular low-rank matrix recovery.
The Schatten-p norm is defined as

n
1/p
A Sp = λpi for 1  p  ∞, and A ∞ = A op = λ1 ,
i=1

for a matrix A ∈ Rn×n . When p = ∞, we obtain the operator norm A op , which


is the largest singular value. When p = 2, we obtain the commonly called Hilbert-
Schmidt norm or Frobenius norm A S2 = A F = A 2 . When p = 1, A S1
denotes the nuclear norm.
Let the projection matrix Pr be the rank-r projection, which maximizes the
Hilbert-Schmidt norm

Pr A F = Pr A 2 = Pr A S2 .

Let On×n,r be the set of all orthogonal rank-r projections into subspaces of Rn so
that we can say Pr ∈ On×n,r . For any A ∈ Rn×n , its singular values λ1 , . . . , λn
are ordered in decreasing magnitude. In terms of singular values, we have
480 9 Covariance Matrix Estimation in High Dimensions

n r
2 2
A F = λ2i , Pr A F = λ2i .
i=1 i=1

Let us review some basic properties of orthogonal projections. By definition


we have

Pr = PTr and Pr = Pr Pr , Pr ∈ Sn,r .

Every orthogonal projection Pr is positive-semidefinite. Let Ir×r be the identity


(1) (2)
matrix of r × r.For Pr , Pr ∈ Sn,r with eigendecomposition

P(1)
r = UIr×r U
T
and P(2) T
r = ŨIr×r Ũ ,

we have
     
Tr P(1)
r P(2)
r = Tr UI r×r U T
ŨI r×r Ũ T
= Tr Ũ T
UI r×r U T
ŨI r×r .

The matrix Π = ŨT UIr×r UT Ũ is also an orthogonal projection. Since Π is


positive semidefinite, the diagonal entries of Π are nonnegative. It follows that

  r
Tr P(1)
r P(2)
r = Tr (ΠIr×r ) = Πii  0.
i=1

We conclude that
/ / /   / 
/ (2) / / (2) /
/Pr − P(1)r / = / I r×r − P(1)
r − I r×r − Pr /  2r (n − r).
F

Let S n−1 be the Euclidean sphere in n-dimensional space. Finally, by the symmetry
(2) (1)
of Pr − Pr , we obtain
 (2)      T  (2) 
P −P(1)  = P(2) (1)  x P −P(1) x
r −Pr = λ1 P(2)
r −Pr
(1)
r r S∞ op
= sup r r

x∈S n−1 
 
 
 T (2) 
 (1) 
= sup x Pr x − x Pr x  1.
T

x∈S n−1      
 ∈[0,1] ∈[0,1] 

The largest singular value (the operator norm) for the difference of two projection
matrices is bounded by 1.    
(2) (1)
For (9.36), it is useful to bound the trace Tr XT Pr − Pr S and
/ /2
/ / 2
/P̃r Y/ − Pr Y F . Motivated for this purpose, we consider the trace
F
9.6 Matrix Model of Signal Plus Noise Y = S + X 481

   
(2) (1) (1) (2)
Tr AT Pr − Pr B , for two arbitrary rank-r projections Pr , Pr ∈ Sn,r ,
and arbitrary two matrices A, B ∈ Rn×n .
First, we observe that
   
P(2) (1)
r −Pr = P(2) (2) (1) (2) (1)
r −Pr Pr +Pr Pr −Pr
(1)
= P(2)
r I − P(1)
r + P(2) (1)
r − I Pr .

According to the proof of Proposition 8.1 of Rohde [509], we have


/  / /  / 1 / /
/ (1) / / (2) (1) / / (2) (1) /
/ I − P(2)
r Pr / = /Pr I − Pr / = √ /Pr − Pr / .
F F 2 F

By the Cauchy-Schwarz inequality we obtain

(2) (1) (2) (1) (2) (1)


Tr AT Pr − Pr B = Tr AT Pr I − Pr B + Tr AT I − Pr Pr B
     
 (2)   (1)   (2) (1) 
 BAT Pr  ·  I − Pr BAT  · Pr I − Pr 
F F F
     
 (1)   (2)   (2) (1) 
+Pr BAT  · BAT I − Pr  ·  Pr − I Pr 
F F F

1    
 (2) (1) 
 √ r (n − r) · BAT ABT S∞ · Pr − Pr 
2 F

1    
 (2) (1) 
+√ r (n − r) · BAT ABT S∞ · Pr − Pr 
2 F
  
 (2) (1) 
 2r (n − r)λ1 (A) λ1 (B) Pr − Pr  .
F

Note · S∞ is the largest singular value λ1 (·). Thus,


     / /
/ (2) (1) /
Tr AT P(2)
r − P(1)
r B  2r (n − r)λ 1 (A) λ 1 (B) / Pr − Pr / .
F

(9.37)

The inequality (9.37) is optimal to the effect for the following case: When, for n ≥
2r, there are r orthonormal vectors u1 , . . . , ur and ũ1 , . . . , ũr such that we can
form two arbitrary rank-r projection matrices
r r    T
P(1)
r = ui uTi , and P(2)
r = 1−α2 ui +αũi 1−α2 ui +αũi ,
i=1 i=1

and
 
A = μI, B = ν P(1)
r − P(2)
r .
482 9 Covariance Matrix Estimation in High Dimensions

In fact, for the case, the left-hand side of the inequality (9.37) attains the upper
bound for any real numbers 0  α  1, and μ, ν > 0.
For the considered case, let us explicitly evaluate the left-hand side of the
inequality (9.37)
⎛  r √
 ⎞
 √
P (2)
− 1−α 2 u +αũ 1−α 2 u +αũ T
  ⎜ r i i i i ⎟
Tr AT P(2) (1)
r −Pr B = μν Tr ⎜

 i=1
r √ √
⎟
 T ⎠
× Pr −(1)
1−α ui +αũi
2 1−α ui +αũi
2
i=1
 
= μν 2r − 2 Tr P P(1)
r
(2)
r
 
= μν 2r − 2r 1 − α2
√ √
= 2μνα2 2r.
/ / √  
/ (1) (2) / (1) (2)
We have used /Pr − Pr / = α 2r and λ1 Pr − Pr = α. Let us
F
establish them now. The first one is simple since
/ / "   
/ (1) (2) / (1) (2) (1) (2)

/Pr − Pr / = Tr Pr − Pr Pr − Pr = α 2r.
F

To prove the second one, we can check that α and −α are the only non-zero
(1) (2)
eigenvalues of the difference matrix Pr − Pr and their eigen spaces are given by
02 2 1
Wα = span 1+α
2 ui − 1−α
2 ũi , i = 1, . . . , r
02 2 1
and W−α = span 1−α
2 ui + 1+α
2 ũi , i = 1, . . . , r .

(1) (2)
Since Pr − Pr is symmetric, it follows that
       
 (2) 
r − Pr
λ1 P(1) r − Pr r − Pr
(2)
= max λmax P(1) (2)
, λmin P(1)  = α.

Now we are in a position to consider the model (9.36) using the process
/ /2 / /2
/ / 2 / / 2
Z = /P̃r Y/ − Pr Y F = /P̃r (S + X)/ − Pr (S + X) F ,
F F

for a pair of rank-r projections. Recall that the projection matrix Pr be the rank-r
projection, which maximizes the Hilbert-Schmidt norm

Pr A F = Pr A 2 = Pr A S2 .

On the other hand, P̃r ∈ On×n,r is an orthogonal rank-r projections into subspaces
of Rn . Obviously, Z is a functional of the projection matrix P̃r . The supremum is
denoted by
9.6 Matrix Model of Signal Plus Noise Y = S + X 483

ZP̃r = sup ZP̃r ,


P̃r ∈On×n,r

where P̃r is a location of the supremum. In general, P̃r is not unique. In addition,
/ /2
/ / 2
the differences /P̃r Y/ − Pr Y F are usually not centered.
F
Theorem 9.6.1 (Upper bound for Gaussian matrices [510]). Let the distribution
of Xij be centered Gaussian with variance σ 2 and rank(X)  r. Then, for r ≤
n − r, the following bound holds
⎛ ⎛⎛ ⎞1/2 ⎞⎞

2r
⎜   ⎜⎜
1
λ2i ⎟ ⎟⎟
⎜ λ21 λ1 ⎜⎜
r
⎟ λ1 λ21 ⎟⎟
EZ  σ 2 rn ⎜min
i=r+1
, 1+ √ + min ⎜⎜ ⎟ · √ , 2 2 ⎟⎟ ,
⎝ λ2r σ n ⎝⎝ 2
λr ⎠ σ n λr − λr+1 ⎠⎠

λ2
1
where λ2 2 is set to infinity, if λr = λr+1.
r −λr+1

Theorem 9.6.1 is valid for the Gaussian entries, while the following theorem is more
general to i.i.d. entries with finite fourth moment.
Theorem 9.6.2 (Universal upper bound [510]). Assume that the i.i.d. entries Xij
of the random matrix X has finite variance σ 2 and finite fourth moment m4 . Then,
we have

EZ  r (n − r) min (I, II, III) , (9.38)

where
√  
λ1 1/4
I = σ2 + m4 + √ σ + m4 ,
+ n
 2 √
λ21
λ2 −λ2
σ + m4 if λr > λr+1 ,
II = r r+1

⎧ ∞ ?if λr = λr+1 , (9.39)



⎨ λ21

2r
λ2i  
λ21  √ 1/4
III = λ2r σ2 + m4 + i=r+1
2 σ + m if λr > 0,


r(n−r)λ r
4
∞ if λr = 0.

Let us consider some examples for the model Y = S + X as defined in (9.36).


Let λ1  λ2  . . .  λn and λ̂1  λ̂2  . . .  λ̂n denote the singular values
/ /2 
/ / r
2
of S and Y = S + X, respectively. Recall that /P̂r Y/ = λ̂2i and Pr S F =
F i=1

r
λ2i with the rank-r projections P̂r and Pr . The hat version standards for the
i=1
non-centered Gaussian random matrix Y and the version without hat stands for the
deterministic matrix S.
484 9 Covariance Matrix Estimation in High Dimensions

Example 9.6.3 (The largest singular value [509] ).


Consider estimating λ21 , the largest eigenvalue of ST S, based on the observation
Y = S + X defined by (9.36). The maximum eigenvalue of YT Y is positively
biased as an estimate for, since
/ /2
/ / 2
Eλ̂21 = E /P̂1 Y/  E P1 Y F = λ21 + σ 2 n.
F

It is natural to consider ŝ = λ̂21 − σ 2 n as an estimator for λ21 . However, the analysis


in [509] reveals that

Eŝ − λ21 = Eλ̂21 − σ 2 n − λ21

is strictly positive and bounded away from zero, uniformly over S ∈ Rn×n . In fact,
  √ 
Eŝ − λ21 ∈ c1 σ 2 n, c2 σ 2 n + σ nλ1 (S)

for some universal constants c1 , c2 > 0, which do not depend on n, σ 2 and S. 


Example 9.6.4 (Quadratic functional of low-rank matrices [509] ).
2
One natural candidate for estimating S F , based on the observation Y = S + X
2
defined by (9.36), is the unbiased estimator Y F − σ 2 n2 . Simple calculation gives
 
2 2
Var Y F − σ 2 n2 = 2σ 4 n2 + 4σ 2 S F . (9.40)

The disadvantage of this estimator is its large variance for large values of n : it
depends quadratically on the dimension. If r = rank (S) < n, the matrix S can be
fully characterized by (2n − r)r parameters as it can be seen by the singular value
decomposition. In other words, if r ! n, the intrinsic dimension of the problem is
of the order rn rather than n2 . For every matrix with r = rank (S) = r, we have
2 2
S F = Pr S F .

2 2
Elementary analysis shows that Pr S F − σ 2 rn unbiasedly estimates S F , and
 
2 2
Var Pr S F − σ 2 rn = 2σ 4 rn + 4σ 2 S F . (9.41)

Further, it follows that


  2
2 2
E Pr S F − σ 2 rn − S F − 2σ Tr XT S = 2σ 4 rn,
9.7 Robust Covariance Estimation 485

 
that is, σ −1
2 2
Pr S F − σ 2 rn − S F is approximately centered Gaussian with
2 2 2
variance 4 S if σ rn = o(1) in an asymptotic framework, and 4 S F is the
F
2
asymptotic efficiency lower bound [134]. The statistics Pr S F − σ 2 rn, however,
cannot be used for estimator since Pr = Pr (S) depends on S itself and is
unknown a prior. The analysis of [509] argues that empirical low-rank projections
/ /2
/ / 2
/P̂r Y/ −σ 2 rn cannot be successfully used for efficient estimation of S F , even
F
if the rank (S) < n is explicitly known beforehand.


9.7 Robust Covariance Estimation

Following [453], we introduce a robust covariance matrix estimation. For i =


1, 2, . . . , N, let xi ∈ Rn be samples from a zero-mean distribution with unknown
covariance matrix Rx , which is positive definite. Suppose that the data associated
with some subsets S of individuals is arbitrarily corrupted. This adversarial
corruption can be modeled as

yi = xi + vi , i = 1, . . . , N,

where vi ∈ Rn is a vector supported on the set S. Let

N
1
R̂y = yi yiT
N i=1

be the sample covariance matrix of the corrupted samples. We define

N
1
R̃x = xi xTi −Rx ,
N i=1

which is a type of re-centered Wishart noise. After some algebra, we have

R̂y = Rx + R̃x + Δ (9.42)

where

1
N
1
N

Δ= vi viT + xi viT + vi xTi .
N i=1
N i=1

Let us demonstrate how to use concentration of measure (see Sects. 3.7 and 3.9) in
this context. Let us assume that Rx has rank at most r. We can write
486 9 Covariance Matrix Estimation in High Dimensions

+ N
,
1
R̃x = Q zi zTi − Ir×r QT ,
N i=1

where Rx = QQT , and zi ∼ N (0, Ir×r ) is standard Gaussian in dimension r. As a


result, by known results on singular values of Wishart matrices [145], we have [453]
/ /
/ / "
/R̃x / r
op
4 , (9.43)
Rx op N

with probability greater than 1 − 2 exp (−c1 r) .


Chapter 10
Detection in High Dimensions

This chapter is the core of Part II: Applications.


Detection in high dimensions is fundamentally different from the traditional
detection theory. Concentration of measure plays a central role due to the high
dimensions. We exploit the bless of dimensions.

10.1 OFDM Radar

We propose to study the weak signal detection under the framework of sums of
random matrices. This matrix setting is natural for many radar problems, such as
orthogonal frequency division multiplexing (OFDM) radar and distributed aperture.
Each subcarrier (or antenna sensor) can be modeled as a random matrix, via, e.g.,
sample covariance matrix estimated using the time sequence of the data. Often the
data sequence is very extremely long. One fundamental problem is to break the long
data record into shorter data segments. Each short data segment is sufficiently long
to estimate the sample covariance matrix of the underlying distribution. If we have
128 subcarriers and 100 short data segments, we will have 12,800 random matrices
at our disposal. The most natural approach for data fusion is to sum up the 12,800
random matrices.
The random matrix (here sample covariance matrix) is the basic information
block in our proposed formalism. In this novel formalism, we take the number of
observations as it is and try to evaluate the effect of all the influential parameters.
The number of observations, large but finite-dimensional, is taken as it is—and
treated as “given”. From this given number of observations, we want algorithms to
achieve the performance as good as they can. We desire to estimate the covariance
matrix using a smaller number of observations; this way, a larger number of
covariance matrices can be obtained. In our proposed formalism, low rank matrix
recovery (or matrix completion) plays a fundamental role.

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 487
DOI 10.1007/978-1-4614-4544-9 10,
© Springer Science+Business Media New York 2014
488 10 Detection in High Dimensions

10.2 Principal Component Analysis

Principal component analysis (PCA) [208] is a classical method for reducing the
dimension of data, say, from high-dimensional subset of Rn down to some subsets
of Rd , with d ! n. PCA operates by projecting the data onto the d directions of
maximal variance, as captured by eigenvectors of the n×n true covariance matrix Σ.
See Sect. 3.6 for background and notation on induced operator norms. See the PhD
dissertation [511] for a treatment of high-dimensional principal component analysis.
We freely take material from [511] in this section to give some background on PCA
and its SDP formulation.
PCA as subspace of maximal variance. Consider a collection of data points
xi , i = 1, . . . , N in Rn , drawn i.i.d. from a distribution P. We denote the expectation
with respect to this distribution by E. Assume that the distribution is centered, i.e.,
Ex = 0, and that E x 2 < ∞. We collect {xi }i=1 in a matrix X ∈ RN ×n . Thus,
2 N

xi represents the i-th row of X. Let Σ and Σ̂ = Σ̂N denote the true covariance
matrix and the sample covariance matrix, respectively. We have

N
1 T 1
Σ := ExxT , Σ̂ := X X= xi xTi . (10.1)
N N i=1

The first principal component of the distribution P is a vector z ∈ Rn satisfying


 2
z ∈ arg max E zT x , (10.2)
z2 =1

that is, z is a direction that the projection of the distribution along which has
 2   
maximal variance. Noting that E zT x = E zT x zT x = zT ExxT z,
we obtain

z ∈ arg max zT Σz. (10.3)


z2 =1

By a well-known result in linear analysis, called Rayleigh-Ritz or Courant-Fischer


theorem [23], (10.3) is the variational characterization of maximal eigenvectors
of Σ.
The second principal component is obtained by removing the contribution form
the first principal component and
 applying the same procedure; that is, obtaining the
 T
first principal component of x− (z ) x z . The subsequent principal components
are obtained recursively until all the variance in x is explained, i.e., the remainder
is zero. In case of ambiguity, one chooses a direction orthogonal to all the previous
components. Thus, principal components form an orthonormal basis for the eigen-
space of Σ corresponding to nonzero eigenvalues.
10.2 Principal Component Analysis 489

SDP formulation. Let us derive a SDP equivalent to (10.3). Using the cyclic
property of the trace, we have Tr zT Σz = Tr ΣzzT . For a matrix Z ∈
Rn×n , Z  0 and rank (Z) = 1 is equivalent to Z = zzT for some z ∈ Rn .
Imposing the additional condition Tr (Z) = 1 is equivalent to the additional
constraint z 2 = 1. Now after dropping the rank (Z) = 1, we obtain a relaxation
of (10.3)

Z ∈ arg max Tr (ΣZ) . (10.4)


Z0,Tr(Z)

It turns out that this relaxation is in fact exact! That is,


T
Lemma 10.2.1. There is always a rank one solution Z = z (z ) of (10.4) where
z = ϑmax (Σ).
Any member of the set of eigenvectors of A associated with an eigenvalue is
denoted as ϑ (A). Similarly, ϑmax (A) represents any eigenvector associated with
the maximal eigenvalue (occasionally referred to as a“maximal eigenvector”).
Proof. It is enough to show that all Z feasible for (10.4), one has Tr (ΣZ) 
n
λmax (Σ). Using eigenvalue decomposition of Z = λi ui uTi , this is equivalent
i=1

n 
n
to λi ui uTi  λmax (Σ). But this is true, by (10.3) and λi = 1. 
i=1 i=1

As the optimization problem in (10.4) is over the cone of semidefinite matrices


(Z ≥ 0) with an objective and extra constraints which are linear in matrix Z, the
optimization problem (10.4) is a textbook example of a SDP [48]. The SDPs belong
to the class of conic programs for which fast methods of solution are currently
available [512]. Software tools such as CVX can be used to solve (10.4).
Noisy Samples. In practice, one does not access to the true covariance matrix Σ,
but instead must rely on a “noisy” version of the form

Σ̂ = Σ + Δ (10.5)

where Δ = ΔN denotes a random noisy matrix, typically arising from having only
a finite number N of samples.
A natural question is under what conditions the sample eigenvectors based on Σ̂
are consistent estimators of their true analogues Σ. In the classical theory of PCA,
the model dimension n is viewed as fixed, asymptotic statements are established
as N goes to infinity, N → ∞. However, such “fixed n, large N ” scaling may
be inappropriate for today’s big data applications, where the model dimension n is
comparable or even larger than the number of observations N , or n ≤ N .
490 10 Detection in High Dimensions

10.2.1 PCA Inconsistency in High-Dimensional Setting

We briefly study some inconsistency results for PCA, in the high-dimensional


N
setting where (N, n) → ∞. We observe data points {xi }i=1 i.i.d. from a distribution
with true covariance matrix Σ := Exi xi . The single spiked covariance model
T

assumes the following structure on Σ


T
Σ = βz (z ) + In×n (10.6)

where β > 0 is some positive constant, measuring signal-to-noise ratio (SNR). The
eigenvalues of Σ are all equal to 1 except for the largest one which is 1 + β. z is the
leading principal component for Σ. One then forms the sample covariance matrix Σ̂
and obtains its maximal eigenvector ẑ, hoping that ẑ is a consistent estimate of z .
This unfortunately does not happen unless n/N → 0 as shown by Paul and
Johnston [513] among others. See also [203]. As (N, n) → ∞, n/N → α,
asymptotically, the following phase transition occurs:
+ √
0, β α
ẑ, z 2 → 1−α/β 2

√ (10.7)
1+α/β 2 , β > α.

Note that ẑ, z 2 measures cosine of the angle between ẑ and z and is related to
the projection of 2-distance between the corresponding 1-dimensional subspaces.
Nether case in (10.7) show consistency, i.e., ẑ, z 2 → 1. This has led to
research on additional structure/constraints that one may impose on z to allow
for consistent estimation.

10.3 Space-Time Coding Combined with CS


y = Ax + z

y = Hx + z

where H is flat fading channel MIMO information theory

10.4 Sparse Principal Components

We follow [514, 515]. Let x1 , . . . , xn be n i.i.d. realizations of a random variable


x in RN . Our task is to test whether the sphericity hypothesis is true, i.e., that the
distribution of x is invariant by rotation in RN . For a Gaussian distribution, this is
equivalent to testing if the covariance matrix of x is of the form σ 2 IN for some
known σ 2 > 0, where IN is identity matrix.
Without loss of generality, we may assume σ 2 = 1, so that the covariance matrix
is the identity in RN under the null hypothesis. For alternative hypotheses, there
10.5 Information Plus Noise Model Using Sums of Random Vectors 491

exists a privileged direction, along which x has more variance. Here we consider
the case where the privileged direction is sparse. The covariance matrix is a sparse
rank one matrix perturbation of the identity matrix IN . Formally, let v ∈ RN be
such that ||v||2 = 1, ||v||0 ≤ k, and θ > 0. The hypothesis problem is
H0 : x ∼ N (0, I)
 (10.8)
H1 : x ∼ N 0, I + θvvT .
This problem is considered in [157] where v is the so-called feature. Later a
perturbation of several rank one matrices is considered in [158, 159]. This idea is
carried out in a Kernel space [385]. What is new in this section is to include the
sparsity of v. The model under H1 is a generalization of the spiked covariance
model since it allows v to be k-sparse on the unit Euclidean sphere. The statement
of H1 is invariant under the rotation on the k relevant variables.
Denote Σ the covariance matrix of x. We most often use the empirical covariance
matrix Σ̂ defined by
n
1 1
Σ̂ = xi xTi = XXT
n i=1
n

where
⎡ ⎤
x11 x12 · · · x1n
⎢ x21 x22 · · · x2n ⎥
⎢ ⎥
X=⎢ . .. .. .. ⎥
⎣ .. . . . ⎦
xN 1 xN 2 · · · xN n N ×n

where X is an random matrix of N × n. Here Σ̂ is the maximum likelihood


estimation in the Gaussian case, when the mean is zero. Often, Σ̂ is the only data
provided to the statistician.

10.5 Information Plus Noise Model Using Sums


of Random Vectors

A ubiquitous model is information plus noise. We consider the matrix setting:

yk = xk + zk , k = 1, 2, . . . , n

where xk represents the information and zk the noise. Often, we have n independent
copies of Y at our disposal for high-dimensional data processing. It is natural to
consider the matrix concentration of measure:

y1 + · · · + yn = (x1 + · · · + xn ) + (z1 + · · · + zn ) ,
492 10 Detection in High Dimensions

where these matrices yk , xk , zk may be independent or dependent. The power of


expectation is due to the fact that expectation is valid for both independent and
dependent (matrix-valued) random variables. Expectation is also linear, which is
fundamentally useful. The linearity of expectation implies that

E (y1 + · · · + yn ) = Ey1 + · · · + Eyn


= E (x1 + · · · + xn ) + E (z1 + · · · + zn )
= Ex1 + · · · + Exn + Ez1 + · · · + Ezn .

Consider a hypothesis testing problem in the setting of sums of random matrices:


 
H 0 : ρ = z1 + · · · + z n
H1 : σ = (x1 + · · · + xn ) + (z1 + · · · + zn ) .

10.6 Information Plus Noise Model Using Sums


of Random Matrices

A ubiquitous model is information plus noise. We consider the matrix setting:

Yk = Xk + Zk , k = 1, 2, . . . , n

where Xk represents the information and Zk the noise. Often, we have n indepen-
dent copies of Y at our disposal for high-dimensional data processing. It is natural
to consider the matrix concentration of measure:

Y1 + · · · + Yn = (X1 + · · · + Xn ) + (Z1 + · · · + Zn ) ,

where these matrices Yk , Xk , Zk may be independent or dependent. The power


of expectation is due to the fact that expectation is valid for both independent and
dependent (matrix-valued) random variables. Expectation is also linear, which is
fundamentally useful. The linearity of expectation implies that

E (Y1 + · · · + Yn ) = EY1 + · · · + EYn


= E (X1 + · · · + Xn ) + E (Z1 + · · · + Zn )
= EX1 + · · · + EXn + EZ1 + · · · + EZn .

Consider a hypothesis testing problem in the setting of sums of random matrices:


 
H0 : ρ = Z 1 + · · · + Z n
H1 : σ = (X1 + · · · + Xn ) + (Z1 + · · · + Zn ) .
10.6 Information Plus Noise Model Using Sums of Random Matrices 493

Trace is linear, so

Tr σ = Tr (X1 + · · · + Xn ) + Tr (Z1 + · · · + Zn )
H1 :
= (Tr X1 + · · · + Tr Xn ) + (Tr Z1 + · · · + Tr Zn )

It is natural to consider σ − ρ, as in quantum information processing. We are


naturally led to the tail bounds of σ − ρ. It follows that
  

σ − ρ = (X1 + · · · + Xn ) + (Z1 + · · · + Zn ) − Z1 + · · · + Zn . (10.9)

We assume that the eigenvalues, singular values and diagonal entries of Hermi-
tian matrices are arranged in decreasing order. Thus, λ1 = λmax , and λn = λmin .
Theorem 10.6.1 (Eigenvalues of Sums of Two Matrices [16]). Let A, B are n×n
Hermitian matrices. Then

λi (A) + λn (B)  λi (A + B)  λi (A) + λ1 (B) .

In particular,

λ1 (A) + λn (B)  λ1 (A + B)  λ1 (A) + λ1 (B)


λn (A) + λn (B)  λn (A + B)  λn (A) + λ1 (B) .

It is natural to consider the maximum eigenvalue of σ − ρ and the minimum


eigenvalue of σ − ρ. The use of Theorem 10.6.1 in (10.9) leads to the upper bound
   

λmax (X1 + · · · + Xn ) + (Z1 + · · · + Zn ) + −Z1 − · · · − Zn
   

 λmax [(X1 + · · · + Xn ) + (Z1 + · · · + Zn )] + λmax − Z1 + · · · + Zn
  

= λmax [(X1 + · · · + Xn ) + (Z1 + · · · + Zn )] − λmin Z1 + · · · + Zn
  

 λmax [(X1 + · · · + Xn )] + λmax [(Z1 + · · · + Zn )] − λmin Z1 + · · · + Zn .

(10.10)
The third line of (10.10) follows from the fact that [53, p. 13]

λmin (A) = −λmax (−A) ,

where A is a Hermitian matrix. In the fourth line, we have made the assumption
that the sum of information matrices (X1 + · · · + Xn ) and noise matrices
(Z1 + · · · + Zn ) are independent from each. Similarly, we have the lower bound
494 10 Detection in High Dimensions

   

λmin (X1 + · · · + Xn ) + (Z1 + · · · + Zn ) + −Z1 − · · · − Zn
   

 λmin [(X1 + · · · + Xn ) + (Z1 + · · · + Zn )] + λmin − Z1 + · · · + Zn
  

= λmin [(X1 + · · · + Xn ) + (Z1 + · · · + Zn )] − λmax Z1 + · · · + Zn .
  

 λmin [(X1 + · · · + Xn )] + λmin [(Z1 + · · · + Zn )] − λmax Z1 + · · · + Zn

(10.11)

10.7 Matrix Hypothesis Testing

Let us consider the matrix hypothesis testing

H0 : N

H1 : Y = SN R · X + N (10.12)

where SN R represents the signal-to-noise ratio, and X and N are two random
matrices of m × n. We assume that X is independent of N. The problem of (10.12)
is equivalent to the following:

H0 : NNH
√ 
H1 : YYH = SN R · XXH + NNH + SN R · XNH + NXH (10.13)

One metric of our interest is the covariance matrix with its trace
   2

f (SN R, X) = Tr E YYH − E NNH . (10.14)

This function is not only positive but also linear (trace function is linear). When
only N independent realizations are available, we can replace the expectation with
its average form
⎛ 2 ⎞
N N
1 1
fˆ(SN R, X) = Tr ⎝ Yi YiH − Ni NH
i
⎠. (10.15)
N i=1
N i=1

Hypothesis H0 can be viewed as the extreme case of SN R = 0. It can be shown


that fˆ(SN R, X) is a Lipschitz continuous function of SN R and X. fˆ(SN R, X)
is a trace functional of X. It is known that the trace functional is strongly
concentrated [180]. It has the form as follows
10.8 Random Matrix Detection 495

H1=1 SNR=0.00025 N=1000 Prob=0 Mean=0.66644 0.66715 STD=0.00028395 0.00025604


H1=1 SNR=0.001 N=100 Prob=0 Mean=0.66938 0.67215 STD=0.0010314 0.00093119
0.678 0.668
Hypothesis H0 Hypothesis H0
Hypothesis H1 Hypothesis H1

0.676 0.6675

0.674
0.667

Trace
Trace

0.672

0.6665
0.67

0.666
0.668

0.666 0.6655
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Monto Carlo Index Mean Metric=0.085311 0.083924 Monto Carlo Index Mean Metric=0.08678 0.086426

Fig. 10.1 Random matrix detection: (a) SNR = −30 dB, N = 100; (b) SNR = −36 dB, N = 1,000

  
  2 2
P fˆ(SN R, X) − Efˆ(SN R, X) > t  Ce−n t /c (10.16)

where C, c are two absolute constants independent of dimension n.


Figure 10.1 illustrates the concentration of the trace function fˆ(SN R, X) around
the mean of hypothesis H0 and that of hypothesis H1 , respectively. We use the
following: the entries of X, N are zero-mean, Gaussian with variance 1, m = 200,
and n = 100. We plot the function fˆ(SN R, X) for K = 100 Monte Carlo
simulations since fˆ(SN R, X) is a scalar valued (always positive) random variable.
It is interesting to observe that the two hypotheses are more separated even if
the second case (Fig. 10.1b) has a lower SNR—6 dB lower. The reason is that
we have used N = 1,000 realizations (measurements) of random matrices, while
in the first case only N = 100 is used. As claimed in (10.16), the fluctuations
fˆ(SN R, X) − Efˆ(SN R, X) is strongly concentrated around its expectation.

10.8 Random Matrix Detection

When A, B are Hermitian, it is fundamental to realize that the TAT∗ and TBT∗
are two commutative matrices. Obviously TAT∗ and TBT∗ are Hermitian since
A∗ = A, B∗ = B. Using the fact that for any two complex matrices C, D.

(CD) = D∗ C∗ , we get
 ∗  ∗ ∗ ∗  ∗ ∗ ∗ ∗
TBT TAT =TBT TAT = AT TBT T
∗ ∗  
=TA T∗ T (TB)∗ = TA∗ T∗ TB∗ T∗ = TA∗ T∗ TBT∗ ,

(10.17)

which says that TAT∗ TBT∗ = TBT∗ TAT∗ , verifying the claim.
For commutative matrices C, D, C ≤ D is equivalent to eC ≤ eD .
496 10 Detection in High Dimensions

The matrix exponential has the property [20, p. 235] that

eA+B = eA eB

if and only if two matrices A, B are commutative: AB = BA. Thus, it follows that
∗ ∗
eTAT < eTBT , (10.18)

when A, B are Hermitian and T∗ T > 0. We have that



+TBT∗ ∗ ∗
eTAT = eTAT eTBT . (10.19)

Let A be the C ∗ -algebra. As is the set of all the self-adjoint (Hermitian) matrices.
So As is the self-adjoint part of A. Let us recall a lemma that has been proven in
Sect. 2.2.4. We repeat the lemma and its proof here for convenience.
Lemma 10.8.1 (Large deviations and Bernstein trick). For a matrix-valued
random variable A, B ∈ As , and T ∈ A such that T∗ T > 0
 ∗ ∗
  ∗ ∗

P {A  B}  Tr EeTAT −TBT = E Tr eTAT −TBT . (10.20)

Proof. We directly calculate


   
P AB =P A−B0
 
= P TAT∗ − TBT∗  0
 ∗ ∗
 (10.21)
= P eTAT −TBT  I
 ∗ ∗

 Tr EeTAT −TBT .

Here, the second line is because the mapping X → TXT∗ is bijective and preserves
the order. As shown above in (10.17), when A, B are Hermitian, the TAT∗ and
TBT∗ are two commutative matrices. For commutative matrices C, D, C ≤ D is
equivalent to eC ≤ eD , from which the third line follows. The last line follows from
Chebyshev’s inequality (2.2.11). 
  X 
The closed form of E Tr e is available in Sect. 1.6.4.
The famous Golden-Thompson inequality is recalled here
 
Tr eA+B  Tr eA · eB , (10.22)

where A, B are arbitrary Hermitian matrices. This inequality is very tight (almost
sharp).

Tr (AB)  A op Tr (B)
10.8 Random Matrix Detection 497

The random matrix based hypothesis testing problem is formulated as follows:

H0 : A,A  0
H1 : A + B,A  0, B  0 (10.23)

where A and B are random matrices. One cannot help with the temptation of using

P (A + B > A) = P eA+B > eA ,

which is false. It is true only when A and B commute, i.e., AB = BA. In fact, in
general, we have that

P (A + B > A) = P eA+B > eA .

However, the TAT∗ and TBT∗ are two commutative matrices, when A, B are
Hermitian.
Let us consider another formulation

H0 : X,
H1 : C + X, C is fixed.

where X is a Hermitian random matrix and C is a fixed Hermitian matrix.


In particular, X can be a Hermitian Gaussian random matrix that is treated in
Sect. 1.6.4. This formulation is related to a simple but powerful corollary of Lieb’s
theorem, which is Corollary 1.4.18 that has been shown previously. This result
connects expectation with the trace exponential.
Corollary 10.8.2. Let H be a fixed Hermitian matrix, and let X be a random
Hermitian matrix. Then
 
E Tr exp (H + X)  Tr exp H + log EeX . (10.24)

We claim H1 if the decision metric E Tr exp (C + X), is greater than some positive
threshold t we can freely set. We can compare this expression with the left-hand-side
of (10.24), by replacing C with H. The probability of detection for this algorithm is

P (E Tr exp (C + X) > t) ,

which is upper bounded by


  
P Tr exp C + log EeX >t ,

due to (10.24). As a result, EeX , the expectation of the exponential of the random
matrix X, plays a basic role in this hypothesis testing problem.
498 10 Detection in High Dimensions

On the other hand, it follows that


         
E Tr eC+X  E Tr eC · eX  E Tr eC Tr eX
  
= Tr eC E Tr eX . (10.25)

The first inequality follows from the Golden-Thompson inequality (10.22). The
second inequality follows from Tr (AB)  Tr (A) Tr (B) when A ≥ 0 and B ≥ 0
are of the same size [16, Theorem 6.5]. Note that the all the eigenvalues of an
exponential matrix are nonnegative. The final step follows from the fact that C is
fixed. It follows from (10.25) that
       
P E Tr eC+X > t  P Tr eC E Tr eX > t ,
 
which is the final upper bound of interest. The E Tr eX plays a basic role.
  for Hermitian Gaussian random matrices X, the closed form expression
Fortunately,
of E Tr eX is obtained in Sect. 1.6.4.
Example 10.8.3 (Commutative property of TAT∗ and TBT∗ ). EXPM(X) is the
matrix exponential of X. EXPM is computed using a scaling and squaring algorithm
with a Pade approximation, while EXP(X) is the exponential of the elements of X.
For example EXP(0) gives a matrix whose entries are all ones, while the EXPM(X)
is the unit matrix whose diagonal elements are ones and all non-diagonal elements
are zeros. It is very critical to realize that EXPM(X) should be used to calculate the
matrix exponential of X, rather than EXP(X).
Without loss of generality, we set T = randn(m,n) where randn(m,n) gives a
random matrix of m × n whose entries are normally distributed pseudorandom
numbers, since we only need to require T∗ T > 0. For two Hermitian matrices
A, B, the MATLAB expression
expm(T*A*T’ - T*B*T’)*expm( T*B*T’) - expm(T*A*T’)
gives zeros, while

expm(A - B)*expm( B) - expm(A)

gives non-zeros. In Latex, we have that



−TBT∗ TBT∗ ∗
eTAT e − eTAT = 0.
∗ ∗ ∗ ∗
Since eTAT −TBT eTBT = eTAT . This demonstrates the fundamental role of
the commutative property of TAT∗ and TBT∗ :

TAT∗ TBT∗ = TBT∗ TAT∗ .

On the other hand, since A, B are not commutative matrices:

AB = BA,
10.8 Random Matrix Detection 499

the expression eA−B eB = eA . Note that the inverse matrix exponential is defined as
 −1
eA = e−A .

Also the zero matrix 0 has e0 = I. 


The product rule of matrix expectation

E (XY) = E (X) E (Y)

play a basic rule in random matrices analysis. Often, we can only observe the
product of two matrix-valued random variables X, Y . We can thus readily calculate
the expectation of the product X and Y. Assume we want to “deconvolve” the
role of X to obtain the expectation of Y. This can be done using
−1
E (XY) (E (X)) = E (Y) ,
−1
assuming (E (X)) exists.1
Example 10.8.4 (The product rule of matrix expectation: E (XY) = E (X) E (Y)).
In particular, we consider the matrix exponentials

−TBT∗ ∗
X = eTAT Y = eTBT ,

where T, A, B are assumed the same as Example 10.8.3. Then, we have that
 ∗ ∗  ∗  ∗ ∗ ∗
E eTAT −TBT E eTBT = E (XY) = E eTAT −TBT eTBT
 ∗ ∗ ∗

= E eTAT −TBT +TBT
 ∗

= E eTAT .

The first line uses the product rule of matrix expectation. The second line follows
from (10.19). In MATLAB simulation, the expectation will be implemented using
N independent Monto Carlo simulations and replaced with the average of the N
random matrices
N
1
Zi .
N i=1 

1 This is not guaranteed. The probability that the random matrix X is singular is studied in the
literature [241, 242].
500 10 Detection in High Dimensions

Our hypothesis testing problem of (10.23) can be reformulated in terms of



H0 : eTAT

+TBT∗
H1 : eTAT (10.26)

where TT∗ > 0, and A, B are two Hermitian random matrices. Usually, here A is
a Gaussian random matrix representing
 the noise. Using
 Monto Carlo simulations,
∗ ∗ ∗
one often has the knowledge of E eTAT and E eTAT +TBT . Using the
arguments similar to Example 10.8.4, we have that
 ∗ ∗  ∗  ∗ ∗ ∗
E eTAT +TBT E e−TAT = E eTAT +TBT e−TAT
 ∗ ∗ ∗

= E eTAT +TBT −TAT
 ∗

= E eTBT .
∗  ∗
If B = 0, then eTBT = I and thus E eTBT = I, since e0 = I. If B is very
weak but B = 0, we often encounter B as a random matrix. Note that

log EeTBT = 0

if B = 0, where log represents the matrix logarithm (MATLAB function LOGM(A)


for matrix A). One metric to describe the difference away from zero (hypothesis H0 )
is to use the matrix norm
/    /     
/ TAT∗ +TBT∗ ∗ / ∗ ∗ ∗
/E e E e−TAT / = λmax E eTAT +TBT E e−TAT .
op

Another metric is use the trace


  ∗ ∗
  ∗

Tr E eTAT +TBT E e−TAT .

Dyson’s expansion [23, p. 311]

Claim H1 if
  ∗ ∗
  ∗

Tr E eTAT +TBT E e−TAT > γ.

γ is set using the typical value of H0 .

1
eA+B − eA = e(1−t)A Bet(A+B) dt
0

can be used to study the difference eA+B − eA . Our goal is to understand the
perturbation of very weak B.
10.9 Sphericity Test with Sparse Alternative 501

10.9 Sphericity Test with Sparse Alternative

Let x1 , . . . , xN be N i.i.d. realizations of a random variable x in Rn . Our goal is to


test the sphericity hypothesis, i.e., that the distribution of x is invariant by rotation in
Rn . For a Gaussian distribution, this is equivalent to testing if the covariance matrix
of x is of the form σ 2 In for some known σ 2 > 0.
Without loss of generality, we assume σ 2 = 1 so that the covariance matrix is
the identity matrix in Rn . under the null hypothesis. Possible alternative hypotheses
include the idea that there exists a privileged direction, along which x has more
variance. In the spirit of sparse PCA [516, 517], we focus on the case where the
privileged direction is sparse. The alternative hypothesis has the covariance matrix
that is a sparse, rank 1 perturbation of the identity matrix In . Formally, let v ∈ Rn
be such that v 2 = 1, v 0  k, and θ > 0. Here, for any p ≥ 1, we denote by
v p the lp norm of a vector and by extension, we denote v 0 by its l0 norm, that
is its number of non-zero elements.
The hypotheses testing problem is written as

H0 : x ∼ N (0, In )

H1 : x ∼ N 0, In + θvvT .

The model under H1 is a generalization of the spiked covariance model since it


allows v to be k-sparse on the unit Euclidean sphere. In particular, the statement of
H1 is invariant under the k relevant variables.
Denote Σ the covariance matrix of x. A most commonly used statistic is the
empirical (or sample) covariance matrix Σ̂ defined by

N
1
Σ̂ = xi xTi .
N i=1

It is an unbiased estimator for the covariance matrix of x, the maximum likelihood


estimator in the Gaussian case, when the mean is known to be 0. Σ̂ is often the only
data provided to the statistician.
We say that a test discriminates between H0 and H1 with probability 1 − δ if the
 both have a probability smaller than δ. Our objective is to
type I and type II errors
find a statistic ϕ Σ̂ and thresholds τ0 < τ1 , depending on (n, N, k, δ) such that
   
PH0 ϕ Σ̂ > τ0  δ
   
PH1 ϕ Σ̂ < τ1  δ.

Taking τ ∈ [τ0 , τ1 ] allows us to control the type I and type II errors of the test
502 10 Detection in High Dimensions

  0   1
ψ Σ̂ = 1 ϕ Σ̂ > τ ,

where 1{} denotes the indicator function. As desired, this test has the property to
discriminate between the hypotheses with probability 1 − δ.

10.10 Connection with Random Matrix Theory

The sample covariance matrix Σ̂ has been studied extensively [5,388]. Convergence
of the empirical covariance matrix to the true covariance matrix in spectral norm
has received attention [518–520] under various elementwise sparsity and using
thresholding methods. Our assumption allows for relevant variables to produce
arbitrary small entries and thus we cannot use such results. A natural statistic would
be, for example, using the largest eigenvalue of the covariance matrix.

10.10.1 Spectral Methods

For any unit vector, we have



λmax (In ) = 1 and λmax In + θvvT = 1 + θ.
 
In high dimension setting, where n may grow with N , the behavior of λmax Σ̂
is different. If n/N → α > 0, Geman [521] showed that, in accordance with the
Marcenko-Pastur distribution, we have
   √ 2
λmax Σ̂ → 1 + α > 1,

where the convergenceholds almost surely [198, 383]. Yin et al. [344] established
that E (x) = 0 and E x4 < ∞ is a necessary and sufficient condition for this
almost sure convergence to hold. As Σ̂  0, its number of positive eigenvalues is
equal to its rank (which is smaller than N ), and we have
 
  1
n   1   n Tr Σ̂
λmax Σ̂    λi Σ̂  Tr Σ̂  .
rank Σ̂ i=1
N N Nn

As the sum of nN squared norms of independent standard Gaussian


 vectors,
Tr Σ̂ ∼ χnN , hence almost surely, for n/N → ∞, we have λmax Σ̂ → ∞
2

under the null hypothesis.


10.11 Sparse Principal Component Detection 503

These two results indicate that the largest eigenvalue will not be able to
discriminate between the two hypotheses unless θ > Cn/N for some positive
constant C. In a “large n/small N ” scenario, this corresponds to a very strong signal
indeed.

10.10.2 Low Rank Perturbation of Wishart Matrices

When adding a finite rank perturbation to a Wishart matrix, a phase transition [522]
arises already in the moderate dimensional regime where n/N → α ∈ (0, 1).
A very general class of random matrices exhibit similar behavior, under finite rank
perturbation, as shown by Tao [523]. These results are extended to more general
distributions in [524]. The analysis of [514] indicates that detection using the
largest eigenvalue is impossible already for moderate dimensions, without further
assumptions. Nevertheless, resorting to the sparsity assumption allows us to bypass
this intrinsic limitation of using the largest eigenvalue as a test statistic.

10.10.3 Sparse Eigenvalues

To exploit the sparsity assumption, we use the fact that only a small submatrix of the
empirical covariance matrix will be affected by the perturbation. Let A be a n × n
matrix and fix k < n. We define the k-sparse largest eigenvalue by

λkmax (A) = max λkmax (AS ) .


|S|=k

For a set S, we denote by |S| the cardinality of S. We have the same equalities as
for regular eigenvalues

λkmax (In ) = 1 and λkmax In + θvvT = 1 + θ.

The k-sparse largest eigenvalue behaves differently under the two hypotheses as
soon as there is a k × k matrix with a significantly higher largest eigenvalue.

10.11 Sparse Principal Component Detection


   
The test statistic ϕ Σ̂ = λkmax Σ̂ can be equivalently defined as

λkmax (A) = max xT Ax (10.27)


x2 =1,x0 k

for any A ≥ 0.
504 10 Detection in High Dimensions

10.11.1 Concentration Inequalities for the k-Sparse


Largest Eigenvalue

Finding the optimal detection thresholds  comes down to the concentration


inequalities of the test statistic λkmax Σ̂ both under the null and the alternative
hypotheses. The concentration of measure phenomenon plays a fundamental role in
this framework.
Consider H1 first. There is a unit vector with sparsity k, such that x ∼
N 0, In + θvvT . By definition of Σ̂, it follows that

  1
N
 2
λkmax Σ̂  vT Σ̂ = xTi v .
N i=1

This problem involves only linear functionals. Since x ∼ N 0, In + θvvT , we
have xTi v ∼ N (0, 1 + θ).
Define the new random variable
N 
1 1  T 2
Y = x v −1 ,
N i=1 1 + θ i

which has a χ2 distribution. Using Laurent and Massart [134, Lemma 1] on


concentration of the χ2 distribution—see Lemma 3.2.1 for this and its proof, we
get for any t > 0, that
  
P Y  −2 t/N  e−t .


Hence, taking t = log (1/δ), we have Y  −2 log (1/δ) /N with probability
1 − δ. Therefore, under H1 , we have with probability 1 − δ
"
  log (1/δ)
λmax Σ̂  1 + θ − 2 (1 + θ)
k
. (10.28)
N
We now establish the following: Under H0 , with probability 1 − δ
"
  k log (9en/k) + log (1/δ) k log (9en/k) + log (1/δ)
λmax Σ̂  1 + 4
k
+4 .
N N
(10.29)
We adapt a technique from [72, Lemma 3]. Let A be a symmetric n × n matrix,
and let N be an -net of sphere S n−1 for some  ∈ [0, 1]. For -net, please refer to
Sect. 1.10. Then,
−1
λmax (A) = sup | Ax, x |  (1 − 2ε) sup | Ax, x | .
x∈S n−1 x∈Nε
10.11 Sparse Principal Component Detection 505

Using a 1/4-net over the unit sphere of Rk , there exists a subset Nk of the unit
sphere of Rk , with cardinality smaller than 9k , such that for any A ∈ S+
k

λmax (A)  2 max xT Ax.


x∈Nε

Under H0 , we have
  0   1
λkmax Σ̂ = 1 + max λmax Σ̂S − 1 ,
|S|=k

where the maximum in the right-hand side is taken over all subsets of {1, . . . , n}
that have cardinality k. See [514] for details.

10.11.2 Hypothesis Testing with λkmax

Using above results, we have


   
PH0 ϕ Σ̂ > τ0  δ
   
PH1 ϕ Σ̂ < τ1  δ,

where
"
k log (9en/k) + log (1/δ) k log (9en/k) + log (1/δ)
τ0 = 1 + 4 +4
N N
"
log (1/δ)
τ1 = 1 + θ − 2 (1 + θ) .
N
When τ1 > τ0 , we take τ ∈ [τ0 , τ1 ] and define the following test
  0   1
ψ Σ̂ = 1 ϕ Σ̂ > τ .

It follows from the previous subsection that it discriminates between H0 and H1


with probability 1 − δ.
It remains to find for which values of θ, the condition τ1 > τ0 . It corresponds to
our minimum detection threshold.
Theorem 10.11.1 (Berthet and Rigollet [514]). Assume that k, n, N and δ are
such that θ̄ ≤ 1, where
" "
k log (9en/k) + log (1/δ) k log (9en/k) + log (1/δ) log (1/δ)
θ̄ = 4 +4 +4 .
N N N
  0   1
Then, for any θ > θ̄ and for any τ ∈ [τ0 , τ1 ], the test ψ Σ̂ = 1 ϕ Σ̂ > τ
discriminates between H0 and H1 with probability 1 − δ.
506 10 Detection in High Dimensions

Considering the asymptotic regimes, for large n, N, k, taking δ = n−β with β > 0,
gives a sequence of tests ψN that discriminate between H0 and H1 with probability
converging to 1, for any fixed θ > 0, as soon as
k log (n)
→ 0.
N
Theorem 10.11.1 gives the upper bound. The lower bound for the probability of
error is also found in [514]. We observe a gap between the upper and lower bound,
with a term in log(n/k) in the upper bound, and one log(n/k 2 ) in the lower bound.
However, by considering some regimes for n, N and k, it disappears. Indeed, as
soon as n ≥ k 2+ , for some  > 0, upper bound and lower bounds match up to
constants, and the detection rate for the sparse eigenvalue is optimal in a minimax
sense. Under this assumption, detection becomes impossible if
"
k log (n/k)
θ<C ,
N
for a small enough constant C > 0.

10.12 Semidefinite Methods for Sparse Principal


Component Testing

10.12.1 Semidefinite Relaxation for λkmax

Computing λkmax is a NP-hard problem. We need a relaxation to solve this problem.


Semidefinite programming (SDP) is the matrix equivalent of linear programing.
Define the Euclidean scalar product in S+d by A, B = Tr (AB). A semidefinite
program can be written in the canonical form:

SDP = maximize Tr (CX)


subject to Tr (Ai X)  bi , i ∈ {1, . . . , m}
X0 (10.30)

A major breakthrough for sparse PCA was achieved in [516], who introduced a
SDP relaxation for λkmax , but tightness of this relaxation is, to this day, unknown.
Making the change of variables X = xxT , in (10.27) yields

λkmax (A) = maximize Tr (AX)


subject to Tr (X) = 1, X 0  k2
X  0,
rank (X) = 1.
10.12 Semidefinite Methods for Sparse Principal Component Testing 507

This problem contains two sources of non-convexity: the l0 norm constraint and
the rank constraint. We make two relaxations in order to have a convex feasible
set. First, for a semidefinite matrix X, with trace 1, and sparsity k 2 , the Cauchy-
Schwartz inequality yields ||X||1 ≤ k, which is substituted to the cardinality
constraint in this relaxation. Simply dropping the rank constraint leads to the
following relaxation of our original problem:

SDPk (A) =maximize Tr (AX)


subject to Tr (X) = 1, X 1 k
X  0. (10.31)

This optimization problem is convex since it consists in minimizing a linear


objective over a convex set. It is a standard exercise to prove that it can be expressed
in the canonical form (10.30). As such, standard convex optimization algorithms
can be used to solve this problem efficiently. A relaxation of the original problem,
for any A ≥ 0, it holds

λkmax (A)  SDPk (A) . (10.32)


 
Since we have proved in Sect. 10.11.1 that λkmax Σ̂ takes large values under H1 ,
this inequality says that using SDPk (A) as a test statistic will be to our advantage
under H1 . Of course, we have to prove this stays small under H0 . This can be
obtained using the dual formulation of the SDP.
Lemma 10.12.1 (Bach et al. [525]). For a given A ≥ 0, we have by duality

SDPk (A) = min {λmax (A + U)} + k U ∞. (10.33)


U∈Sp

Together with (10.32), Lemma 10.12.1 implies that for any z ≥ 0 and any matrix U
such that U ∞  z, it holds

λmax k (A)  SDPk (A)  λmax (A + U) + kz. (10.34)

A direct consequence of (10.34) is that the functional λkmax (A) is robust to


perturbations by matrices that have small || · ||∞ -norm. Formally, let A ≥ 0 be
such that its largest eigenvector has l0 norm bounded by k. Then, for any matrix
W, (10.34) gives

λkmax (A + W)  λmax ((A + W) − W) + k W ∞ = λkmax (A) + k W ∞.


508 10 Detection in High Dimensions

10.12.2 High Probability Bounds for Convex Relaxation


 
The SDPk Σ̂ and other computationally efficient variants can be used as test
   
statistics for our detection problem. Recall that SDPk Σ̂  λkmax Σ̂ . In view
of (10.32), the following follows directly from (10.28): Under H1 , we have, with
high probability 1 − δ,
"
  log (1/δ)
SDPk Σ̂  1 + θ − 2 (1 + θ) .
N

Similarly, we can obtain (see [514]): Under H0 , we have, with high probability 1−δ,
8   8
  k2 log (4n2 /δ) k log 4n2 /δ log (2n/δ) log (2n/δ)
SDPk Σ̂  1 + 2 +2 +2 +2 .
N N N N

10.12.3 Hypothesis Testing with Convex Methods

The results of the previous subsection can be written as


   
PH0 SDPk Σ̂ > τ̂0  δ
   
PH1 SDPk Σ̂ < τ̂1  δ,

where τ̂0 and τ̂1 are given by


"  "
k 2 log (4n2 /δ) k log 4n2 /δ log (2n/δ) log (2n/δ)
τ̂0 = 1 + 2 +2 +2 +2
N N N N
"
log (1/δ)
τ̂1 = 1 + θ − 2 (1 + θ) .
N

Whenever τ̂1 > τ̂0 , we take the threshold τ and define the following computation-
ally efficient test
  0   1
ψ̂ Σ̂ = 1 SDPk Σ̂ > τ .

It discriminates between H0 and H1 with probability 1 − δ. It remains to find for


which values of θ the condition τ̂1 > τ̂0 holds. It corresponds to our minimum
detection level.
10.13 Sparse Vector Estimation 509

Theorem 10.12.2 (Berthet and Rigollet [514]). Assume that n, N, k and δ are
such that θ̄ ≤ 1, where
"  " "
k 2 log (4n2 /δ) k log 4n2 /δ log (2n/δ) log (1/δ)
θ̄ = 2 +2 +2 +4 .
N N N N
  0   1
Then, for any θ > θ̄, any τ ∈ [τ̂0 , τ̂1 ], the test ψ̂ Σ̂ = 1 SDPk Σ̂ > τ
discriminates between H0 and H1 with probability 1 − δ.
−β
By considering asymptotic regimes, for large n, N, k, taking δ = n with β >
0, gives a sequence of tests ψ̂N Σ̂ that discriminates between H0 and H1 with
probability converging to 1, for any fixed θ > 0, as soon as

k 2 log (n)
→ 0.
N
Compared with Theorem 10.11.1, this price to pay for using
√ this convex relaxation
is to multiply the minimum detection level by a factor of k. In most examples, k
remains small so that this is not a very high price.

10.13 Sparse Vector Estimation

We follow [526]. The estimation of a sparse vector from noisy observations is a


fundamental problem in signal processing and statistics, and lies at the heart of the
growing field of compressive sensing. At its most basic level, we are interested in
accurately estimating a vector x ∈ Rn that has at most r non-zeros from a set of
noisy linear measurements

y = Ax + z (10.35)

where A ∈ Rm×n and z ∼ N 0, σ 2 I . We are often interested in the under-
determined setting where m may be much smaller than n. In general, one would
not expect to be able to accurately recover x when m < n since there are more
unknowns than observations. However it is by now well-known that by exploiting
sparsity, it is possible to accurately estimate x.
If we suppose that the entries of the matrix A are i.i.d. N (0, 1/n), then one can
show that for any x ∈ Bk := {x : x 0  k}, 1 minimization techniques such as
the Lasso or the Dantzig selector produce a recovery x̂ such that
1 2 kσ 2
x − x̂ 2  C0 log n (10.36)
n m

holds with high probability provided that m = Ω (k log (n/k)) [527]. We consider
the worst case error over all x ∈ Bk , i.e.,
510 10 Detection in High Dimensions


1 2
E (A) = inf sup E

x̂ (y) − x 2 . (10.37)
x̂ x∈Bk n

The following theorem gives a fundamental limit on the minimax risk which holds
for any matrix A and any possible recovery algorithm.
Theorem 10.13.1 (Candès and Davenport [526]). Suppose that we observe y =
Ax + z where x is a k-sparse vector, A is an m × n matrix with m ≥ k, and
z ∼ N 0, σ 2 I . Then there exists a constant C1 > 0 such that for all A,

kσ 2
E  (A)  C1 2 log (n/k) . (10.38)
A F

We also have that for all A

kσ 2
E  (A)  2 . (10.39)
A F

This theorem says that there is no A and no recovery algorithm that does
fundamentally better than the Dantzig selector (10.36) up to a constant (say, 1/128);
that is, ignoring the difference in the factors log n/kand log n. In this sense, the
results of compressive sensing are, indeed, at the limit.
Corollary 10.13.2 (Candès and Davenport [526]). Suppose that we observe y =
A (x + w) where
 x is a k-sparse vector, A is an m × n matrix with k ≤ m ≤ n,
and w ∼ N 0, σ 2 I . Then for all A

kσ 2 kσ 2
E  (A)  C1 log (n/k) and E  (A)  . (10.40)
m m
The intuition behind this result is that when noise is added to the measurements, we
can boost the SNR by rescaling A to have higher norm. When we instead add noise
to the signal, the noise is also scaled by A, and so no matter how A is designed
there will always be a penalty of 1/m.
The relevant work is [528] and [135]. We only sketch the proof ingredients. The
proof of the lower bound (10.38) follows a similar course as in [135]. We will
suppose that x is distributed uniformly on a finite set of points X ⊂ Bk , where
X is constructed so that the elements of X are well separated. This allows us to
show a lemma which follows from Fano’s inequality combined with the convexity
of the Kullback-Leibler (KL) divergence. The problem of constructing the packing
set X exploits the matrix Bernstein inequality of Ahlswede and Winter.
10.14 Detection of High-Dimensional Vectors 511

10.14 Detection of High-Dimensional Vectors

We follow [529]. See also [530] for a relevant work. Detection of correlations
is considered in [531, 532]. We emphasize the approach of the Kullback-Leibler
divergence used in the proof of Theorem 10.14.2. Consider the hypothesis testing
problem

H0 : y = z
H1 : y = x + z (10.41)

where x, y, z ∈ Rn and x is the unknown signal and z is additive noise. Here only
one noisy observation is available per coordinate. The vector x is assumed to be
sparse. Denote the scalar inner product of two column vectors a, b by a, b = aT b.
Now we have that

H0 : yi = y, ai = z, ai = zi , i = 1, . . . , N
H1 : yi = x, ai + zi , i = 1, . . . , N (10.42)

where the measurement vectors ai ’s have Euclidean norm bounded by 1 and the
noise zi ’s are i.i.d. standard Gaussian, i.e., N (0, 1). A test procedure based on
N measurements of the form (10.42) is a binary function of the data, i.e., T =
T (a1 , y1 , . . . , aN , yN ), with T = ε ∈ {0, 1} indicating that T favors T . The worst-
case risk of a test T is defined as

γ (T ) := P0 (T = 1) + max Px (T = 0) ,
x∈X

where Px denotes the distribution of the data when x is the true underlying vector
and the subset X ∈ Rn \ {0}. With a prior π on the set of alternatives X , the
corresponding average Bayes risk is expressed as

γπ (T ) := P0 (T = 1) + Eπ Px (T = 0) ,

where Eπ denotes the expectation under π. For any prior π and any test procedure
T , we have

γ (T )  γπ (T ) . (10.43)

T
For a vector a = (a1 , . . . , am ) , we use the notation

m
1/2 m
a = a2i , |a| = |ai |
i=1 i=1
512 10 Detection in High Dimensions

to represent the Euclidea norm and 1 -norm. For a matrix A, the operator norm is
defined as
Ax
A op = sup .
x=0 x
The 1 denotes the vector with all coordinates equal to 1.
Vectors with non-negative entries may be relevant to imaging processing.

Proposition 10.14.1 (Arias-Castro [529]). Consider x = 1/ n. Suppose we take
N measurements of the form (10.42) for all i. Consider the test that rejects H0 when

N

yi > τ N ,
i=1

where τ is some critical value. Its risk against x is equal to


  
1 − Φ (τ ) + Φ τ − N/n |x| ,

where Φ is the standard normal distribution function. Hence, if τ = τn → ∞, this


test has vanishing risk against alternatives satisfying N/n |x| − τn → ∞.
We have used the result
1
N  
√ yi ∼ N N/n |x| , 1 .
N i=1

Let us present the main theorem of this section.


Theorem 10.14.2 (Theorem 1 of Arias-Castro [529]). Let X (μ, k) denote the set
of vectors in Rn having exactly k non-zero entries all equal to μ > 0. Based on N
 adaptive, any test for H0 : x = 0 versus
measurements of the form (10.42), possibly
H1 : x ∈ X (μ, k) has risk at least 1 − N/ (8n)kμ.

 particular, the risk against alternatives H1 : x ∈ X (μ, k) with N/n |x| =
In
N/nkμ → 0, goes to 1 uniformly over all procedures.
Proof. Since the approach is important, we follow [529] for a proof. We use the
standard approach to deriving uniform lower bounds on the risk, by putting a prior
on the set of alternatives and use (10.43). Here we simply choose the uniform prior
on X (μ, k), which is denoted by π. The hypothesis testing is now H0 : x = 0
versus H1 : x ∼ π. By the Neyman-Pearson fundamental lemma, the likelihood
ratio test is optimal. The likelihood ratio is defined as

Pπ (a1 , y1 , . . . , aN , yN )
N
 T  T 2
L := = Eπ exp yi ai x − ai x /2 ,
P0 (a1 , y1 , . . . , aN , yN ) i=1

where Eπ is the conditional expectation with respect to π, and the test is T =


{L > 1}. It has the risk
10.14 Detection of High-Dimensional Vectors 513

1
γπ (T ) = 1 − Pπ − P0 TV , (10.44)
2

where Pπ = Eπ Px —the π-mixture of Px — and · TV is the total variation


distance [533, Theorem 2.2]. By Pinsker’s inequality [533, Lemma 2.5]

Pπ − P0 TV  K (Pπ , P0 ) /2, (10.45)

where K (Pπ , P0 ) is the Kullback-Leibler divergence [533, Definition 2.5]. We have

K (Pπ , P0 ) = −E0 log L


N    
2
 Eπ E0 yi aTi x − aTi x /2
i=1
N  
2
= Eπ E0 aTi x /2
i=1

N  
= E0 aTi Cai
i=1
N C op


where C = {cij } := Eπ xxT . The first line is by definition; the second line
follows from the definition of Pπ /P0 , the use of Jensen’s inequality justified by the
convexity of x → − log x, and by Fubini’s theorem; the third line follows from
independence of ai , yi and x (under P0 ) and the fact that E [yi ] = 0; The fourth
is by independence of ai , and x (under P0 ) and by Fubini’s theorem; the fifth line
follows since ai  1 for all i.
Recall that under π the support of x is chosen uniformly at random among the
subsets of size k. Then we have
k
cii = μ2 Pπ (xi = 0) = μ2 · , ∀i,
n
and

k k−1
cij = μ2 Pπ (xi = 0, xj = 0) = μ2 · · , i = j.
n n−1

This simple matrix has operator norm C op = μ2 k 2 /n.


Now going back to the Kullback-Leibler divergence, we thus have

K (Pπ , P0 )  N · μ2 k 2 /n,

and returning (10.44) via (10.45), we bound the risk of the likelihood ratio test
 
γ (T )  1 − K (Pπ , P0 ) /8  1 − N/ (8n)kμ.


514 10 Detection in High Dimensions

From Proposition 10.14.1 and Theorem 10.14.2, we conclude that the following
a nonnegative vector x ∈ R from
n
is true in a minmax sense: Reliable detection of
N noisy linear measurements is possible if N/n |x| → ∞ and impossible if

N/n |x| → 0.
Theorem 10.14.3 (Theorem 2 of Arias-Castro [529]). Let X (±μ, k) denote the
set of vectors in Rn having exactly k non-zero entries all equal to ±μ, with μ > 0.
Based on N measurements of the form (10.42), possiblyadaptive, any test for H0 :
x = 0 versus H1 : x ∈ X (±μ, k) has risk at least 1 − N k/ (8n)μ.
2
In particular, the risk against alternatives H1 : x ∈ X (±μ, k) with N/n x =
(N/n) kμ2 → 0, goes to 1 uniformly over all procedures.
We choose the uniform prior on X (±μ, k). The proof is then completely parallel
to that of Theorem 10.14.2, now with C = μ2 (k/n) I since the signs of the nonzero
entries of x are i.i.d. Rademacher. Thus C op = μ2 (k/n).

10.15 High-Dimensional Matched Subspace Detection

The motivation of this section is to illustrate how concentration of measure plays


a central role in the detection problem, freely taking material from [534]. See also
the PhD dissertation [535]. The classical formulation of this problem is a binary
hypothesis test of the following form

H0 : y = z
H1 : y = x + z (10.46)

where x ∈ Rn denotes a signal and z ∈ Rn is a noise of known distribution.


We are given a subspace S ∈ Rn , and our task is to decide whether x ∈ S or
x ∈/ S, based on measurements y. Tests are usually based on some measure of
the energy of y in the subspace S, and these “matched subspace detectors” enjoy
optimal properties [536,537]. See also for spectrum sensing in cognitive radio [156,
157, 159, 384, 385, 538].
Motivated by high-dimensional applications where it is prohibitive or impossible
to measure x completely, we assume that only a small subset Ω ⊂ {1, . . . , n} of the
elements of x are observed with and without noise. Based on these observations,
we test whether x ∈ S or x ∈ / S. Given a subspace S of dimension k ! n, how
many elements of x must be observed so that we can reliably decide whether it
belongs to S. The answer is that, under mild incoherence conditions, the number is
O (k log k), such that reliable matched subspace detectors can be constructed from
very few measurements, making them scalable and applicable to large-scale testing
problems.
The main focus of this section is an estimator of the energy of x in S based on
only observing the elements xi ∈ Ω. Let xΩ be the vector of dimension |Ω| × 1
10.15 High-Dimensional Matched Subspace Detection 515

composed of the elements xi , i ∈ Ω. We form the n × 1 vector x̃ with elements xi


if xi ∈ Ω and zero if i ∈
/ Ω, for i = 1, . . . , n. Filling missing elements with zero is
a fairly common, albeit naive, approach to dealing with missing data.
Let U be an n × r matrix whose columns span the k-dimensional subspace S.
 −1 T
For any U, define PS = U UT U U . The energy of x in the subspace S is
2
PS x 2 , where PS is the projection operator onto S. Consider the case of partial
measurement. Let UΩ denote the |Ω| × r matrix, whose rows are the |Ω| rows of U
indexed by the set Ω. Define the projection operator
 −1 T
PSΩ = UΩ UTΩ UΩ UΩ ,
where the dagger † denotes the pseudoinverse. We have that if x ∈ S, then
2 2 2
x − PS x 2 = 0 and xΩ − PSΩ xΩ 2 = 0, whereas x̃ − PS x̃ 2 can signifi-
2
cantly greater than zero. This property makes PS x̃ 2 a much better candidate esti-
2 2
mator than PSΩ xΩ 2 . However, if |Ω|  r, it is possible that xΩ − PSΩ xΩ 2 =
2
0, even if x − PS x 2 > 0. Our main result will show that if |Ω| is slightly
2
greater than r, then with high probability xΩ − PSΩ xΩ 2 is very close to
|Ω| 2
n x − PS x 2 .
Let |Ω| denote the cardinality of Ω. The coherence of the subspace S is
n 2
μ (S) := max PS ei 2 .
k i
That is, μ (S) measures the maximum magnitude attainable by projecting a standard
basis element onto S. We have 1  μ (S)  nk . For a vector v, we let μ(v) denote
the coherence of the subspace spanned by v. By plugging in the definition, we have
2
n v ∞
μ (v) = 2 .
v 2

To state the main result, write


x=y+w
where y ∈ S, w ∈ S ⊥ . Again let Ω refer to the set of indices for observations of
entries in x, and denote |Ω| = m. We split the quantity of interest into three terms
and bound each with high probability. Consider
2 2
xΩ − PSΩ xΩ 2 = wΩ − PSΩ wΩ 2 .

Let the k columns of U be an orthonormal basis for the subspace S. We want to


show that
2 2 2  −1
wΩ −PSΩ wΩ 2 = wΩ 2 −wΩ
T
PSΩ wΩ = wΩ 2 −wΩ
T
UΩ UTΩ UΩ UTΩ wΩ
(10.47)

m 2
is nearly n w 2 with high probability.
516 10 Detection in High Dimensions

Theorem 10.15.1 (Theorem


 1 of Balzano, Recht, and Nowak [534]). Let δ > 0
δ . Then with probability at least 1 − 4δ,
and m  83 kμ (S) log 2k
2
(1 − α) m − kμ(S) (1+β)
(1−γ) m
x − PS x22  xΩ − PSΩ xΩ 22  (1 + α) x − PS x22
n n
&  &
2μ(w)2 8kμ(S)2
where α = m
log (1/δ), β = 2μ (w) log (1/δ), and γ = 3m
log (2k/δ).

We need the following three equations [534] to bound three parts in (10.47). First,
m 2 2 m 2
(1 − α) w 2  wΩ 2  (1 + α) w 2 (10.48)
n n
with probability at least 1 − 2δ. Second,

/ T /
/UΩ wΩ /2  (1 + β)2 m kμ(S) w 2
(10.49)
2 2
n n
with probability at least 1 − δ. Third,
/ /
/ T −1 / n
/ UΩ UΩ /  (10.50)
2 (1 − γ) m

with probability at least 1 − δ, provided that γ < 1. The proof tools for the
three equations are McDiarmid’s Inequality [539] and Noncommutative Bernstein
Inequality (see elsewhere of this book).

10.16 Subspace Detection of High-Dimensional Vectors


Using Compressive Sensing

We follow [540]. See also Example 5.7.6. We study the problem of detecting
whether a high-dimensional vector x ∈ Rn lies in a known low-dimensional sub-
space S, given few compressive measurements of the vector. In high-dimensional
settings, it is desirable to acquire only a small set of compressive measurements of
the vector instead of measuring every coordinate. The objective is not to reconstruct
the vector, but to detect whether the vector sensed using compressive measurements
lies in a low-dimensional subspace or not.
One class of problems [541–543] is to consider a simple hypothesis test of
whether the vector x is 0 (i.e. observed vector is purely noise) or a known signal
vector s:

H0 : x = 0 vs. H1 : x = s. (10.51)
10.16 Subspace Detection of High-Dimensional Vectors Using Compressive Sensing 517

Another class of problems [532, 534, 540] is to consider the subspace detection
setting, where the subspace is known but the exact signal vector is unknown. This
set up leads to the composite hypothesis test:

H0 : x ∈ S vs. H1 : x ∈
/ S. (10.52)

Equivalently, let x⊥ denote the component of x that does not lie in S. Now we have
the composite hypothesis test

H0 : x ⊥ 2 = 0 vs. H1 : x⊥ 2 > 0. (10.53)

The observation vector is modeled as

y = A (x + w) (10.54)

where x ∈ Rn is an unknown vector  in S, A ∈ Rm×n is a random matrix with


i.i.d. N (0, 1) entries, and w ∼ N 0, σ In×n denotes noise with known variance
2

σ 2 that is independent of A. The noise model (10.54) is different from a more


commonly studied case

y = Ax + z (10.55)
 
where z ∼ N 0, σ 2 Im×m . For fixed A, we have y ∼ N Ax, σ 2 AAT and
y ∼ N Ax, σ 2 Im×m . Compressive linear measurements may be formed later
on to optimize storage or data collection.
Define the projection operator PU = UUT . Then x⊥ = (I − PU ) x, where x⊥
2
is the component of x that does not lie in S, and x ∈ S if and only if x⊥ 2 = 0.
/  /
/ −1/2 /2
Similar to [534], we define the test statistic T = /(I − PBU ) AAT y/
 −1/2
2
based on the observed vector y and study its properties, where B = AAT A.
Here PBU is the projection operator onto the column space of BU, specifically
 −1
T T
PBU = BU (BU) BU (BU)

T
if (BU) BU exists.
Now are ready to present the main result. For the sake of notational simplicity,
we directly work with the matrix B and its marginal distribution. Writing y =
2
B (x + w), we have T = (I − PBU ) y 2 . Since A is i.i.d. normal, the distribution
of the row span of A (and hence B) will be uniform over m-dimensional subspaces
 −1/2
of Rn [544]. Furthermore, due to the AAT term, the rows of B will be
orthogonal (almost surely). First, we show that, in the absence of noise, the test
2 2
statistic T = (I − PBU ) Bx 2 is close to m x⊥ 2 /n with high probability.
518 10 Detection in High Dimensions

Theorem 10.16.1 (Azizyan and Singh [540]). Let 0 < r < m < n, 0 < α0 <
1 and β0 , β1 , β2 > 1. With probability at least 1− exp [(1−α0 + log α0 ) m/2]
− exp [(1−β0 + log β0 ) m/2]−exp [(1−β1+log β1) m/2]−exp [(1−β2 + log β2) r/2]
 m r m
2 2 2
α0 − β1 β2 x⊥ 2  (I − PBU ) Bx 2  β0 x⊥ 2 (10.56)
n n n
The proof of Theorem 10.16.1 follows from random projections and concentration
of measure. Theorem 10.16.1 implies the following corollary.
Corollary 10.16.2 (Azizyan and Singh [540]). If m  c1 r log m, then with
probability at least 1 − c2 exp (−c3 m),
m 2 2 m 2
d1 x⊥ 2  (I − PBU ) Bx 2  d2 x⊥ 2
n n

for some universal constants c1 > 0, c2 > 0, c3 ∈ (0, 1), d1 ∈ (0, 1), d2 > 1.
Corollary 10.16.5 states that given just over r noiseless compressive measurements,
2
we can estimate x⊥ 2 accurately with high probability. In the presence of noise, it
is natural to consider the hypothesis test:

H0
2 <
T = (I − PBU ) y 2 η (10.57)
>
H1

The following result bounds the false alarm level and missed detection rate of this
2
test (for appropriately chosen η) assuming a lower bound on x⊥ 2 under H1 .
Theorem 10.16.3 (Azizyan and Singh [540]). If the assumptions of Corol-
lary 10.16.5 are satisfied, and if for any x ∈ H1

2 4e + 2  r
x⊥ 2  σ2 1− n,
d1 m

then

P (T  η|H0 )  exp [−c4 (m − r)]

and

P (T  η|H1 )  c2 exp [−c3 m] + exp [−c5 (m − r)] ,

where η = eσ 2 (m − r) , c4 = (e − 2) /2, c5 = (e + log (2e + 1)) /2, and all other


constants are as in Corollary 10.16.5.
It is important to determine whether the performance of the test statistic we proposed
can be improved further. The following theorem provides an information-theoretic
10.16 Subspace Detection of High-Dimensional Vectors Using Compressive Sensing 519

lower bound on the probability of error of any test. A corollary of this theorem
implies that the proposed test statistic is optimal, that is, every test with probability
of missed detection and false alarm decreasing exponentially in the number of
compressive samples m requires that the energy off the subspace scale as n.
Theorem 10.16.4 (Azizyan and Singh [540]). Let P0 be the joint distribution of
B and y under the null hypothesis. Let P1 be the joint distribution of B and y
under the alternative hypothesis where y = B (x + w), for some fixed x such that
x = x⊥ and x 2 = M > 0. If conditions of Corollary 10.16.5 are satisfied, then

1 M2 m
inf max Pi (φ = i)  exp − 2
φ i=0,1 8 2σ n

where the infimum is over all hypothesis tests φ.


Proof. Since the approach is interesting, we follow [540] to give a proof here. Let
K be the Kullback-Leibler divergence. Then

1 −K(P0 ,P1 )
inf max Pi (φ = i)  e
φ i=0,1 8

 Let q be the density of B and p (y; μ, Σ) that of N (μ, Σ). Under P0 ,


(see [533]).
y ∼ N 0, σ 2 Im×m since rows of B are orthonormal. So,

p(y;0,σ 2 Im×m )q(B)


K (P0 , P1 ) = EB Ey log p(y;Bx,σ 2 Im×m )q(B)
2
2σ 2 EB
1
= Bx 2

x22 m
= 2σ 2 n . 
Corollary 10.16.5 (Azizyan and Singh [540]). If there exists a hypothesis test φ
based on B and y such that for all n and σ 2 ,

max P (φ = i|Hi )  C0 exp [−C1 (m − r)]


i=0,1

for some C0 , C1 > 0, then there exists some C > 0 such that
2
x⊥ 2  Cσ 2 (1 − r/m) n

for any x ∈ H1 and all n and σ 2 .


Note that r  m
c1 log m , from Corollary 10.16.5.
520 10 Detection in High Dimensions

10.17 Detection for Data Matrices

The problem of detection and localization of a small block of weak activation in a


large matrix is considered by Balakrishnan, Kolar, Rinaldo and Singh [545]. Using
information theoretic tools, they establish lower bounds on the minimum number of
compressive measurements and the weakest signal-to-noise ratio (SNR) needed to
detect the presence of an activated block of positive activation, as well as to localize
the activated block, using both non-adaptive and adaptive measurements.
Let A ∈ Rn1 ×n2 be a signal matrix with unknown entries that we would like to
recover. We consider the following observation model under which N noisy linear
measurements of A are available

yi = Tr (AXi ) + zi , i = 1, . . . , N (10.58)

iid 
where z1 , . . . , zN ∼ N 0, σ 2 , σ > 0 known, and the sensing matrices Xi satisfy
2
either Xi F  1 or E Xi F = 1. We are interested in two measurement
schemes2 : (1) adaptive or sequential, that is, the measurement matrices Xi is a
(possibly randomized) function of (yj , Xj )j∈[i−1] ; (2) the measurement matrices
are chosen at once, that is, passively.

10.18 Two-Sample Test in High Dimensions

We follow [237] here. The use of concentration of measure for the standard
quadratic form of a random matrix is the primary reason for this whole section.
There are two independent sets of samples {x1 , . . . , xn1 } and {y1 , . . . , yn2 } ∈ Rp .
They are generated in an i.i.d. manner from p-dimensional multivariate Gaussian
distributions N (μ1 , Σ) and N (μ2 , Σ) respectively, where the mean vectors μ1
and μ2 , and positive-definitive covariance matrix Σ > 0, are all fixed and unknown.
The hypothesis testing problem of interest here is

H0 : μ1 = μ2 versus H1 : μ1 = μ2 . (10.59)

The most well known test statistic for this problem is the Hotelling T 2 statistic,
defined by
n1 n2 T −1
T2 = (x̄ − ȳ) Σ̂ (x̄ − ȳ) , (10.60)
n1 + n2

2 We use [n] to denote the set ={1, . . . , n}.


10.18 Two-Sample Test in High Dimensions 521

1

n1
1

n2
where x̄ = n1 xi and ȳ = n2 yi are sample mean vectors, Σ̂ is the pooled
i=1 i=1
sample covariance matrix, given by
n1 n2
1 T 1 T
Σ̂ = (xi − x̄) (xi − x̄) + (yi − ȳ) (yi − ȳ)
n i=1
n i=1

and we define n = n1 + n2 − 1 for convenience.


When p > n, the matrix Σ̂ is singular, and the Hotelling test is not well defined.
Even when p ≤ n, the Hotelling test is known to perform poorly if p is nearly as
large as n. It is well-known that Σ̂ is a degraded estimate of Σ in high dimensions,
allowing for the data dimension p to exceed the sample size n.
The Hotelling T 2 test measures the separation of H0 and H1 in terms of the
Kullback-Leibler (KL) divergence [546, p. 216] defined by
1 T −1
DKL (N (μ1 , Σ) |N (μ2 , Σ) ) = δ Σ δ,
2
with δ = μ1 − μ2 . The relevant statistic distance is driven by the length of δ. This
section is primarily motivated by the properties of random matrices. In particular,
we are interested in the so-called random projection method [547]. If a projection
matrix PTk ∈ Rk×p is used to project data from Rp to Rk . After the projection, the
classical Hotelling T 2 test, defined by (10.60), is thus used from the projected data.
When this projection matrix is random—random projection method, this random
projection matrix reduce the dimension and simultaneously preserve most of the
length of δ.
−1
We use the matrix PTk Σ̂Pk as a surrogate for Σ̂ in the high-dimensional
setting. To eliminate the variability of a single random projection, we use the average
 −1
of the matrix Pk PTk Σ̂Pk PTk over the ensemble Pk , to any desired degree of
  −1
precision. The resulting statistic is proportional to EPk Pk PTk Σ̂Pk PTk .
Let z1−α denote the 1 − α quartile of the standard normal distribution, and let
Φ (·) be its cumulative distribution function. Consider the Haar distribution on the
set of matrices

PTk Pk = Ik×k , Pk ∈ Rp×k .

If Pk is drawn from the Haar distribution, independently of the data, then our
random projection-based test statistic is defined by
  −1
n1 n2 T
T̂k2 = (x̄ − ȳ) EPk Pk PTk Σ̂Pk PTk (x̄ − ȳ) .
n1 + n2

For a desired nominal level α ∈ (0, 1), our testing procedure rejects the null
hypothesis H0 if and only if T̂k2  tα , where
522 10 Detection in High Dimensions

?
yn 2yn √
tα ≡ n+ nz1−α ,
1 − yn (1 − yn )
3

yn = k/n and z1−α is the 1 − α quartile of the standard Gaussian distribution. T̂k2
is asymptotically Gaussian.
To state a theorem, we make the condition (A1).
 
A1 There is a constant y ∈ (0, 1) such that yn = y + o √1n .
2
yn 2yn √
We also need to define two parameters μ̂n = 1−y n
n, and σ̂ n = (1−y )3
n.
n

f (n) = o (g(n)) means f (n)/g(n) → 0 as n → 0.


Theorem 10.18.1 (Lopes, Jacob and Wainwright [237]). Assume that the null
hypothesis H0 and the condition (A1) hold. Then, as (n, p) → ∞, we have the
limit

T̂k2 − μ̂n d

→ N (0, 1) , (10.61)
μ̂n

d
(Here −→ stands for convergence in distribution) and as a result, the critical value
tα satisfies
 
P T̂k2  tα = α + o(1).

n1 +n2
Proof. Following [237], we only give a sketch of the proof. Let τ = n1 n2 and

z ∼ N (0, Ip×p ). Under the null hypothesis that δ = 0, we have x̄−ȳ = τΣ 1/2
z,
and as a result,
  −1
T̂k2 = zT Σ1/2 EPk Pk PTk Σ̂Pk PTk Σ1/2 z, (10.62)
F GH I
A

which gives us the standard quadratic form T̂k2 = zT Az. The use of concentration
of measure for the standard quadratic form is the primary reason for this whole
section. Please refer to Sect. 4.15, in particular, Theorem 4.15.8.
Here, A is a random matrix. We may take x̄ − ȳ and Σ̂ to be independent for
Gaussian data [208]. As a result, we may assume that z and A are independent. Our
overall plan is to work conditionally on A, and use the representation
   T 
zT Az − μ̂n z Az − μ̂n
P  x = EA Pz  x |A ,
σ̂n σ̂n

where x ∈ R.
10.18 Two-Sample Test in High Dimensions 523

Let oPA (1) stand for a positive constant in probability under PA . To demonstrate
the asymptotic Gaussian distribution of zT Az, in Sect. B.4 of [237], it is shown that
Aop
AF = oPA (1) where || · ||F denotes the Frobenius norm. This implies that the
Lyupanov condition [58], which in turn implies the Lindeberg condition, and it then
follows [87] that
  
 zT Az − Tr (A) 
 
sup Pz √  x |A − Φ (x) = oPA (1) . (10.63)
x∈R  2 A F 

The next step is to show that Tr (A) and


2 A F can be replaced with deterministic
yn 2yn √
counterparts μ̂n = 1−yn n, and σ̂n = (1−yn )3
n. More precisely

√ √
Tr (A) − μ̂n = oPA n and A F − σ̂n = oPA n . (10.64)

Inserting (10.64) into (10.63), it follows that


 
zT Az − μ̂n
P  x |A − Φ (x) = oPA ,
σ̂n

and the central limit theorem (10.61) follows from the dominated convergence
theorem. 
To state another theorem, we need the following two conditions:
 
• (A2) There is a constant b ∈ (0, 1) such that nn1 = b + o √1n .
• (A3) (Local alternative) The shift vector and covariance matrix satisfy
δ T Σ−1 δ = o(1).
Theorem 10.18.2 (Lopes, Jacob and Wainwright [237]). Assume that conditions
(A1), (A2), and (A3) hold. Then, as (n, p) → ∞, the power function satisfies
 " 
 1−y √
P Tk2  tα = Φ −z1−α + b (1 − b) · · Δk n + o(1). (10.65)
2y

where
  −1
Δk = δ EPk Pk PTk Σ̂Pk
T
PTk δ.

Proof. The heart of this proof is to use the conditional expectation and the
concentration of quadratic form. We work under the alternative hypothesis. Let
τ = nn11+n 2
n2 and z ∼ N (0, Ip×p ). Consider the limiting value of the power function
 
P n T̂k  tα . Since the shift vector is nonzero δ = 0, We can break the test
1 2

statistic into three parts. Recall from (10.62) that


524 10 Detection in High Dimensions

  −1
n1 n2 T
T̂k2 = (x̄ − ȳ) EPk Pk PTk Σ̂Pk PTk (x̄ − ȳ) ,
n1 + n2

where

x̄ − ȳ = τ Σ1/2 z + δ,

and z is a standard Gaussian p-vector. Expanding the definition of T̂k2 and adjusting
by a factor n, we have the decomposition

1 2
T̂ = I + II + III,
n k
where
1 T
I= z Az (10.66)
n

1
II = 2 √ zT AΣ−1/2 δ (10.67)
n τ

1 T −1/2
III = δ Σ AΣ−1/2 δ. (10.68)

Recall that
  −1
A = Σ1/2 EPk Pk PTk Σ̂Pk PTk Σ1/2 .

We will work on the conditional expectation EA with the condition A. Consider


 
 1
P Tk2  tα = EA Pz I  tα − II − III |A .
n
1 / /
Working with Tr nA and / n1 A/F , and multiplying the top and the bottom by

n, we have
1  √ 1  
  zT 1
n n tα − Tr n1 A −II−III
n A /z− Tr A
P Tk2 tα =EA Pz √ 1 / n  √ √ / / |A .
/ /
2 nA F n 2/ n1 A/F

(10.69)

Recall the definition of the critical value


?
yn 2yn √
tα ≡ n+ 3 nz1−α .
1 − yn (1 − yn )
10.19 Connection with Hypothesis Detection of Noncommuntative Random Matrices 525

 
1 n
We also define the numerical sequence ln = τn n−k−1 Δk . In Sects. C.1 and C.2
of [237], they establish the limits
√ √
Pz n |II|  ε |A = oPA (1) , and n (III − ln ) = oPA (1)

where  > 0. Inserting the limits (10.64) into (10.69), we have


⎛ ⎞
√ & 2yn 
  zT Az− Tr (A) n · z1−α −ln −II
⎜ n(1−y )3 ⎟
P Tk2 tα =EA Pz ⎝ &
n
√  +oPA (1) |A ⎠.
2AF 2yn
(1−yn )3

(10.70)

By the limit (10.63), we have



zT Az − Tr (A) √ (1 − yn ) 3
Pz √ z1−α − n (ln + II) + oPA (1) |A
2 A F 2yn
⎛ ? ⎞
√ (1 − y )
3
= Φ ⎝−z1−α + n ln ⎠ + oPA (1) ,
n
2yn

where the error term oPA (1) is bounded by 1. Integrating over A and applying the
dominated convergence theorem, we obtain
⎛ ? ⎞
 √ (1 − yn ) ⎠
3
P Tk2 tα = Φ ⎝−z1−α + n ln + o (1) .
2yn

   
k √1 n1 √1
Using the assumptions yn = n = a+o n
and n = b+o n
, we conclude

 " 
 1−y √
P Tk2  tα = Φ −z1−α + b (1 − b) · · Δk n + o(1),
2y

which is the same as (10.65). 

10.19 Connection with Hypothesis Detection


of Noncommuntative Random Matrices

Consider the hypothesis detection of the problem

H0 : A = Rn
H1 : B = Rx + Rn (10.71)
526 10 Detection in High Dimensions

where Rx is the true covariance matrix of the unknown signal and Rn is the
true covariance matrix of the noise. The optimal average probability of correct
detection [5, p. 117] is

1 1
+ A − B 1,
2 4

where the trace norm X 1 = Tr XXH is the sum of the absolute eigenvalues.
See also [548, 549] for the original derivation.
In practice, we need to use the sample covariance matrix to replace the true
covariance matrix, so we have

H0 : Â = R̂n
H1 : B̂ = R̂x + R̂n

where R̂x is the true covariance matrix of the unknown signal and R̂n is the true
covariance matrix of the noise. Using the triangle inequality of the norm, we have
/ / / / / /
/ / / / / /
/Â − B̂/  /Â − A/ + /B − B̂/ .
1 1 1

Concentration inequalities will connect the non-asymptotic convergence of the


sample covariance matrix to its true value, via. See Chaps. 9 and 5 for covariance
matrix estimation.

10.20 Further Notes

In [550], Sharpnack, Rinaldo, and Singh consider the basic but fundamental task of
deciding whether a given graph, over which a noisy signal is observed, contains a
cluster of anomalous or activated nodes comprising an induced connected subgraph.
Ramirez, Vita, Santamaria and Scharf [551] studies the existence of locally most
powerful invariant tests for the problem of testing the covariance structure of a set
of Gaussian random vectors. In practical scenarios the above test can provide better
performance than the typically used generalized likelihood ratio test (GLRT).
Onatski, Moreira and Hallin [552] consider the problem of testing the null
hypothesis of sphericity of a high-dimensional covariance matrix against an alterna-
tive of multiple symmetry-breaking directions (multispiked alternatives).
Chapter 11
Probability Constrained Optimization

In this chapter, we make the connection between concentration of measure and


probability constrained optimization. It is the use of concentration inequality
that makes the problem of probability constrained optimization mathematically
tractable. Concentration inequalities are the enabling techniques that make possible
probability constrained optimization.

11.1 The Problem

We follow Nemirovski [553] to set up the problem. We consider a probability


constraint

Prob {ξ : A (x, ξ) ∈ K}  1 − ε (11.1)

where x is the decision vector, K is a closed convex cone, and A (x, ξ) is defined as

N
A (x, ξ) = A0 (x) + σ ξi Ai (x), (11.2)
i=1

where
• Ai (·) are affine mapping from Rn to finite-dimensional real vector space E;
• ξi are scalar random perturbations satisfying the relations
1. ξi are mutually independent;
2. E {ξi } = 0;
   √
E exp ξi2 /4  2; (11.3)

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 527
DOI 10.1007/978-1-4614-4544-9 11,
© Springer Science+Business Media New York 2014
528 11 Probability Constrained Optimization

• σ > 0 is the level of perturbations.


• K is a closed convex cone in E.
For ξi , we are primarily interested in the following cases:
• ξi ∼ N (0, 1) are Gaussian noise; the absolute constants in (11.3) comes exactly
from the desire to use the standard Gaussian perturbations;
• E {ξi } = 0; |ξi |  1 so ξi are bounded random noise.
For the vector space E and the closed pointed convex cone K, we are interested in
the cases: (1) if E = R, (real), and K = R+ (positive real); (11.2) is the special case
for a scalar linear inequality. (2) If a real vector space is considered E = Rm+1 , and
 2 
K= x∈Rm+1
: xm+1  x1 + · · · + xm ;
2 2

here (11.2) is a randomly perturbed Conic Quadratic Inequality (CQI), where the
data are affine in the perturbations. (3) E = Sm is the space of m × m symmetric
matrices, K = Sm + is the cone of positive semidefinite matrices from S ; here (11.2)
m

is a randomly perturbed Linear Matrix Inequality (LMI).


We are interested to describe x’s which satisfy (11.2) with a given high
probability, that is:

Prob {ξ : A (x, ξ) ∈
/ K}  ε, (11.4)

for a given ε ! 1. Our ultimate goal is to optimize over the resulting set, under the
additive constraints on x. A fundamental problem is that (11.4) is computationally
intractable. The solution to the problem is connected with concentration of measure
through large deviations of sums of random matrices, following Nemirovski [554].
Typically, the only way to estimate the probability for a probability constraint to
be violated at a given point is to use Monte-Carlo simulations (so-called scenario
approach [555–558]) with sample sizes of 1ε ; this becomes too costly when ε is
small such as 10−5 or less. A natural idea is to look for tractable approximations of
the probability constraint, i.e., for efficiently verifiable sufficient conditions for its
validity. The advantage of this approach is its generality, it imposes no restrictions
on the distribution of ξ and on how the data enter the constraints.
An alternative to the scenario approximation is an approximation based on
“closed form” upper bounding of the probability for the randomly perturbed
constraints A (x, ξ) ∈ K to be violated. The advantage of the “closed form”
approach as compared to the scenario one is that the resulting approximations
are deterministic convex problems with sizes independent of the required value
of ε, so that the approximations also remain practical in the case of very small
values of ε. A new class of “closed form” approximations, referred to as Bernstein
approximations, is proposed in [559].
Example 11.1.1 (Covariance Matrix Estimation). For independent random vectors
xk , k = 1, . . . , K, the sample covariance—a Hermitian, positive semidefinite,
random matrix—defined as
11.2 Sums of Random Symmetric Matrices 529

K
R̂ = xk xTk , xk ∈ Rn
k=1

Note that the n elements of the k-th vector xk may be dependent random variables.
Now, assume that there are N + 1 observed sample covariances R̂0 (x) , i =
0, 1, 2, . . . , N such that
N
R̂ (x, ξ) = R̂0 (x) + σ ξi R̂i (x), (11.5)
i=1

Our task is to consider a probability constraint (11.1)


0 1
Prob ξ : R̂ (x, ξ)  0  1 − ε (11.6)

where x is the decision vector. Thus, the covariance matrix estimation is recast in
terms of an optimization. Later it will be shown that the optimization problem is
convex and may be solved efficiently using the general-purpose solver that is widely
available online. Once the covariance matrix is estimated, we can enter the second
stage of detection process for extremely weak signal. This line of research seems
novel. 

11.2 Sums of Random Symmetric Matrices

Let us follow Nemirovski [560] to explore the connection of sums of random


symmetric matrices with the probability constrained optimization. Let A denote
the standard spectral norm (the largest singular value) of an m×n matrix A. We ask
this question.
(Q1) Let Xi , 1 ≤ i ≤ N, be independent n × n random symmetric matrices with

N
zero mean and “light-tail” distributions, and let SN = Xi . Under what
i=1
conditions is a “typical value” of SN “of order 1” such that the probability
for SN to be ≥ t goes to 0 exponentially fast as t > 1 grows?
Let Bi be deterministic symmetric n × n matrices, and ξi be independent random
scalars with zero mean and “of order of one” (e.g., ξi ∼ N (0, 1)). We are interested
in the conditions for the “typical norm” of the random matrix
N
S N = ξ1 B 1 + · · · + ξ N B N = ξi B i
i=1

to be of order 1. An necessary condition is


 
E S2N  O(1)I
530 11 Probability Constrained Optimization

which, translates to
N
B2i  I.
i=1

A natural conjecture is that the latter condition is sufficient as well. This answer is
affirmative, as proven by So [122]. A relaxed version of this conjecture has been
proven by Nemirovski [560]: Specifically, under the above condition, the typical
norm of SN is  O(1)m1/6 with the probability
0 1 
Prob SN > tm1/6  O(1) exp −O(1)t2

for all t > 0.


We can ask the question.
(Q2) Let ξ1 , . . . , ξN be independent mean zero random variables, each of which is
either (i) supported on [-1,1], or (ii) normally distributed with unit variance.
Further, let X1 , . . . , XN be arbitrary m × n matrices. Under what conditions
on t > 0 and X1 , . . . , XN will we have an exponential decay of the tail
probability
/ / 
/ N /
/ /
Prob / ξ i Xi /  t ?
/ /
i=1

Example 11.2.1 (Randomly perturbed linear matrix inequalities). Consider a ran-


domly perturbed Linear Matrix Inequalities

N
A0 (x) − ξi Ai (x)  0, (11.7)
i=1

where A1 (x) , . . . , AN (x) are affine functions of the decision vector x taking
values in the space Sn of symmetric n × n matrices, and ξi are independent of
each other random perturbations. Without loss of generality, ξi can be assumed to
have zero means.
A natural idea is to consider the probability constraint
+ N
,
Prob ξ = (ξ1 , . . . , ξN ) : A0 (x) − ξi Ai (x)  0  1 − , (11.8)
i=1

where  > 0 is a small tolerance. The resulting probability constraint, however,


typically is “heavily computationally intractable.” The probability in the left hand
side cannot be computed efficiently, its reliability estimation by Monte-Carlo
simulations requires samples of order of 1/, which is prohibitively time-consuming
11.2 Sums of Random Symmetric Matrices 531

when  is small, like 10−6 or 10−8 . A natural way to overcome this difficulty
is to replace “intractable” (11.7) with its “tractable approximation”—an explicit
constraint on x.
An necessary condition for x to be feasible for (11.7) is A0 (x) ≥ 0;
strengthening this necessary condition to be A0 (x) > 0, x is feasible for the
probability constraint if and only if the sum of random matrices

N
−1/2 1/2
SN = ξ i A0 (x) Ai (x) A0 (x)
i=1
F GH I
Yi

is  In with probability ≥ 1 − . Assuming, as it is typically the case, that


the distribution of ξi are symmetric,1 this condition is essentially the same as the
condition SN  1 with probability ≥ 1 − . If we know how to answer (Q), we
could use this answer to build a “tractable” sufficient condition for SN to be ≤ 1
with probability close to 1 and thus could build a tractable approximation of (11.8).
&
%
Example 11.2.2 (Nonconvex quadratic optimization under orthogonality constr-
aints). We take this example—the Procrustes problem—from [560]. In the Pro-
crustes problem, we are given K matrices A[k], k = 1, . . . , N, of the same size
m × n. Our goal is to look for N orthogonal matrices X[k] of n × n minimizing the
objective

A[k]X[k] − A[k  ]X[k  ]


2
2
1k<k N


where A 2 = Tr (AAT ) is the Frobenius norm of a matrix. This problem is
equivalent to the quadratic maximization problem
⎧ ⎫
⎨   ⎬
T  T  n×n T
P = max 2 Tr A[k]X[k]X [k ]A [k ] : X [k] ∈ R , X [k] X [k] = In , k = 1, . . . , N .
X[1],...,X[N ] ⎩ ⎭
k<k

(11.9)

When N > 2, this problem is intractable. For N = 2, there is a closed form solution.
Equation (11.29) allows for a straightforward semidefinite relaxation. Geometrically
speaking, we are given N collections of points in Rn and are seeking for rotations
which make these collections as close to each other as possible, the closeness being
measured by the sum of squared distances.

1 We say that, X and Y are identically distributed, or similar, or that Y is a copy of X, if


PX (A) = PY (A) , where PX (A) = PX (X ∈ A) is a probability measure. A random element
X in a measurable vector space is called symmetric, if X and −X are identically distributed. If X
is a symmetric random element, then its distribution is a symmetric measure.
532 11 Probability Constrained Optimization

Let Y = Y [X[1], . . . , X[N ]] be the symmetric matrix defined as follows: the


rows and the columns in Y are indexed by triples (k, i, j), where k runs from 1
to K and i, j run from 1 to n; the entry Ykij,k i j  in Y is xij [k]xi j  [k  ]. Note
Y is a symmetric, positive semidefinite matrix of rank 1. In (11.29), the relation
X [k] XT [k] = In is equivalent to a certain system Sk of linear equations on the
T
entries of Y, while the relation X [k] X [k] = In is equivalent to a certain system
Tk of linear equations on the entries of Y. Finally, the objective in (11.29) is a
linear function Tr (BY) of Y, where B be appropriate symmetric matrix of the
size Kn2 × Kn2 . It is seen that (11.29) is equivalent to

max Tr (BY) : Y  0, Ysatisfies Sk , Tk , k = 1, . . . , K, Rank (Y) = 1;


Y∈SKn2

removing the trouble-making constraint Rank (Y) = 1, we have an explicit


semidefinite problem

SDP = max 2 Tr (BY) : Y  0, Ysatisfies Sk , Tk , k = 1, . . . , K


Y∈SKn

which is a relaxation of (11.29), so that Opt (SDP)  Opt (P ) . In fact, we have


 
Opt (SDP)  O(1) n1/3 + ln K Opt (P )

and similarly for other problems of quadratic optimization under orthogonality


constraints. &
%
Theorem 11.2.3 (Nemirovski [560]). Let X1 , . . . , XN be independent symmetric
n × n matrices with zero mean such that
  
2
E exp Xi /σi2  exp (1) , i = 1, . . . , N

where σi > 0 are deterministic scalars factors. Then


⎧ : ⎫
⎨ ; ⎬
; N

Prob SN  t< σi2  O(1) exp −O(1)t2 / ln n , ∀t > 0, (11.10)
⎩ ⎭
i=1

with positive absolute constraints O(1).


Theorem 11.2.4 (Nemirovski [560]). Let ξ1 , . . . , ξN be independent random vari-
ables with zero mean and zero third moment taking values in [−1, 1], Bi , i =
1, . . . , N, be deterministic symmetric m × m matrices, and Θ > 0 be a real number
such that
N
B2i  Θ2 I.
i=1
11.2 Sums of Random Symmetric Matrices 533

Then
+/ N / ,
/ / 
/ /
t  7m1/4 ⇒ Prob / ξi Bi /  tΘ  54 exp −t2 /32 ,
/ /
i=1
+/ N / ,
/ / 
/ /
t  7m1/6 ⇒ Prob / ξi Bi /  tΘ  22 exp −t2 /32 . (11.11)
/ /
i=1

See [560] for a proof. Equation (11.11) is extended by Nemirovski [560] to hold for
the case of independent, Gaussian, symmetric n × n random matrices X1 , . . . , XN
with zero means and σ > 0 such that
N
 
E X2i  σ 2 In . (11.12)
i=1

Let us consider non-symmetric (and even non-square) random matrices Yi , i =


1, . . . , N . Let Ci be deterministic m × n matrices such that
N N
Ci CTi  Θ2 Im , CTi Ci  Θ2 In , (11.13)
i=1 i=1

and ξi be independent random scalars with zero mean and of order of 1. Then
+ ,
 N

t  O (1) ln (m + n) ⇒ Prob ξ : ξi Ci  tΘ  O(1) exp −O(1)t2 .
i=1
(11.14)
Using the deterministic symmetric (m + n) × (m + n) Bi defined as

CTi
Bi = ,
Ci
the following theorem follows from Theorem 11.2.4 and (11.12).
Theorem 11.2.5 (Nemirovski [560]). Let deterministic m × n matrices Ci satis-
fying (11.13) with Θ = 1., and let ξi be independent random scalars with zero first
and third moment and such that either |ξi | ≤ 1 for all i ≤ N, or ξ ∼ N (0, 1) for
all i ≤ N. Then
+N ,
1/4 
t  7(m + n) ⇒ Prob ξi Ci  tΘ  54 exp −t2 /32 .
i=1
+ ,
1/6
N

t  7(m + n) ⇒ Prob ξi Ci  tΘ  22 exp −t2 /32 . (11.15)
i=1

We can make a simple additional statement: Let Ci , ξi be defined as Theorem


11.2.5, then
534 11 Probability Constrained Optimization

+ ,
 N
4 
t  4 min (m, n) ⇒ Prob ξi Ci  tΘ  exp −t2 /16 . (11.16)
i=1
3

It is clearly desirable to equations such as (11.15) to hold for smaller values of t.


Moreover, it is nice to remove the assumption that the random variables ξ1 , . . . , ξN
have zero third moment.
Conjecture (Nemirovski [560]) Let ξ1 , . . . , ξN be independent mean zero ran-
dom variables, each of which is either (i) supported on [-1,1], or (ii) normally
distributed with unit variance. Further, let X1 , . . . , XN be arbitrary
 m×n
matrices satisfying (11.13) with Θ = 1. Then, whenever t ≥ O(1) ln (m + n),
one has
/N / 
/ / 
/ /
Prob / ξi Xi /  t  O(1) · exp −O(1) · t2 .
/ /
i=1

 
It is argued in [560] that the threshold t = Ω ln (m + n) is in some sense
the best one could hope for. So [122] finds that the behavior of the random

N
variable SN ≡ ξi Xi has been extensively studied in the functional analysis
i=1
and probability theory literature. One of the tools is the so-called Khintchine-type
inequalities [121].
We say ξ1 , . . . , ξN are independent Bernoulli random variables when each ξi
takes on the values ±1 with equal probability.
Theorem 11.2.6 (So [122]). Let ξ1 , . . . , ξN be independent mean zero random
variables, each of which is either (i) supported on [-1,1], or (ii) Gaussian with
variance one. Further, let X1 , . . . , XN be arbitrary m × n matrices satisfying
max(m, n)  2 and (11.13) with Θ = 1. Then, for any t ≥ 1/2, we have
/ / 
/ N / 
/ / −t
Prob / ξi Xi /  2e (1 + t) ln max {m, n}  (max {m, n})
/ /
i=1

if ξ1 , . . . , ξN are i.i.d. Bernoulli or standard normal random variables; and


/ / 
/ N / 
/ / −t
Prob / ξi Xi /  8e (1 + t) ln max {m, n}  (max {m, n})
/ /
i=1

if ξ1 , . . . , ξN are independent mean zero random variables supported on [-1,1].


Proof. We follow So [122] for a proof. X1 , . . . , XN are arbitrary m × n matrices

N N
satisfying (11.13) with Θ = 1, so all the eigenvalues of Xi XTi and XTi Xi
i=1 i=1
lie in [0, 1]. Then, we have
11.2 Sums of Random Symmetric Matrices 535

/ 1/2 / / 1/2 /
/ / / /
/ N
/ / N
/
/ Xi XTi /  m1/p , / XTi Xi /  n1/p .
/ / / /
/ i=1 / / i=1 /
Sp Sp

Let ξ1 , . . . , ξN be i.i.d. Bernoulli random variables or standard Gaussian random


variables. By Theorem 2.17.4 and discussions following it, it follows that
⎡/ /p ⎤ ⎡/ /p ⎤
/ N / / N /
/ / ⎦ / /
E ⎣/ ξ i Xi /  E ⎣/ ξi Xi / ⎦  pp/2 · max {m, n}
/ / / /
i=1 ∞
i=1 Sp

for any p ≥ 2. Note that ||A|| denotes the spectrum norm of matrix A. Now, by
Markov’s inequality, for any s > 0 and p ≥ 2, we have
⎛/ / ⎞ ⎡/ /p ⎤
/ N / / N / pp/2 · max {m, n}
/ / / /
P ⎝/ ξ i Xi /  s⎠  s−p · E ⎣/ ξ i Xi / ⎦  .
/ / / / sp
i=1 ∞
i=1 ∞

By assumption t  1/2 and max{m, n} ≥ 2, we set



s= 2e (1 + t) ln max {m, n}, p = s2 /e > 2

through which we obtain


⎛/ / ⎞
/ N / 
/ / −t
P ⎝/ ξ i Xi /  2e (1 + t) ln max {m, n}⎠  (max {m, n})
/ /
i=1 ∞

as desired.
Next, we consider the case where ξ1 , . . . , ξN are independent mean zero
random variables supported on [−1, 1]. Let ε1 , . . . , εN be i.i.d. Bernoulli random
variables; ε1 , . . . , εN are independent of the ξi ’s. A standard symmetrization
argument (e.g., see Lemma 1.11.3 which is Lemma 6.3 in [27]), together with
Fubini’s theorem and Theorem 2.17.4 implies that
 p  p
N   N 
E
 ξ i X i

  2p · Eξ Eε 
 ε i ξ i X i


i=1 Sp i=1
⎡ ⎧Sp ⎫⎤
⎨  1/2  
p   1/2 
p ⎬
 N
  N

 2p · pp/2 · Eξ ⎣max  ξ 2 Xi XT  , ξ 2 XT Xi  ⎦
⎩ i=1 i i
  i=1 i i  ⎭
Sp Sp
 2p · pp/2 · max {m, n} .

&
%
Example 11.3.2 will use Theorem 11.2.6. A
536 11 Probability Constrained Optimization

11.3 Applications of Sums of Random Matrices

We are in a position to apply the sums of random matrices to the probability


constraints.
Example 11.3.1 (Randomly perturbed linear matrix inequalities [560]—Continued).
Consider a randomly perturbed Linear Matrix Inequalities
N
A0 (x) − ξi Ai (x)  0, (11.17)
i=1

where A1 (x) , . . . , AN (x) are affine functions of the decision vector x taking
values in the space Sn of symmetric n × n matrices, and ξi are independent of
each other’s random perturbations. ξi , i = 1, . . . , N are random real perturbations
which we assume to be independent with zero means “of order of 1” and with “light
tails”—we will make the two assumptions precise below.
Here we are interested in the sufficient conditions for the decision vector x such
that the random perturbed LMI (11.17) holds true with probability ≥ 1 − , where
 << 1. Clearly, we have
A0 (x)  0.
To simplify, we consider the strengthened condition A0 (x) > 0. For such decision
vector x, letting
−1/2 −1/2
Bi (x) = A0 (x) Ai (x) A0 (x) ,
the question becomes to describe those x such that
+N ,
Prob ξi Bi (x)  0  1 − . (11.18)
i=1

Precise description seems to be completely intractable. The trick is to let the closed-
form probability inequalities for sums of random matrices “do the most of the job”!
What we are about to do are verifiable sufficient conditions for (11.18) to hold true.
Using Theorem 11.2.3, we immediately obtain the following sufficient condition:
Let n ≥ 2, perturbations ξi be independent with zero means such that
  
E exp ξi 2  exp (1) , i = 1, . . . , N.
Then the condition
N
2 1
−1/2 −1/2
A0 (x) > 0 & A0 (x) Ai (x) A0 (x)   
i=1
450 exp (1) ln 3ε (ln m)

(11.19)

is sufficient for (11.17) to be valid with probability ≥ 1 − .


Although (11.19) is verifiable, it in general, defines a nonconvex set in the space
of decision variables x. The “problematic” part of the condition is the inequality
11.3 Applications of Sums of Random Matrices 537

N / /2
/ −1/2 −1/2 /
/ A0 (x) Ai (x) A0 (x)/  τ (11.20)
i=1

on x, τ . Equation (11.20) can be represented by the system of convex inequalities

N
1
−A0 (x)  μi Ai (x)  A0 (x) , μi > 0, i = 1, . . . , N,  τ.
i=1
μ2i

Consider another “good” case when A0 (x) ≡ A is constant. Equation (11.20)


can be represented by the system of convex constraints

N
−νi A  Ai (x)  νi A, i = 1, . . . , N, νi 2  τ
i=1

in variables x, νi , τ.
Using Theorem 11.2.3, we arrive at the following sufficient conditions: Let
perturbations ξi be independent with zero mean and zero third moments such that
|ξi |  1, i = 1, . . . , N, or such that ξi ∼ N (0, 1), i = 1, . . . , N . Let, further,
 ∈ (0, 1) be such that one of the following two conditions is satisfied
 
5 49m1/2
(a) ln 
4 32
  (11.21)
22 49m1/3
(b) ln 
 32

Then the condition



N
2 ⎨ 1
, case (a) of (11.21)
−1/2 −1/2 32 ln( 45 )
A0 (x) > 0 & A0 (x) Ai (x) A0 (x)  1
⎩ , case(b) of (11.21)
i=1 32 ln( 22 )
(11.22)
is sufficient for (11.17) to be valid with probability ≥ 1 − .
In contrast, (11.19) and (11.22) defines a convex domain in the space of design
variables. Indeed, (11.22) is of the form
  N
Yi Ai (x)
A0 (x) > 0 &  0, i = 1, . . . , N, and Yi c()A0 (x)
Ai (x) A0 (x) i=1
(11.23)
in variables x, Yi . &
%
Example 11.3.2 (Safe tractable approximation—So [122]). We demonstrate how
Theorem 11.2.6 can be used. Let us consider a so-called safe tractable approxi-
mation of the following probability constrained optimization problem:
538 11 Probability Constrained Optimization

minimize cT x
0
subject to F(x)  

N (11.24)
Prob A0 (x) − ξi Ai (x)  0  1 − , (†)
i=1
x ∈ Rn .

Here, c∈ Rn is a given objective vector; F : Rn → Rl is an efficiently computable


vector-valued function with convex components; A0 , . . . , AN : Rn → ϕm are
affine functions in x with A0 (x) > 0 for all x ∈ Rn ;ξ1 , . . . , ξN are independent
(but not necessarily identically distributed) mean zero random variables;  ∈ (0, 1)
is the error tolerance parameter. We assume m ≥ 2 so that (11.24) is indeed a
probability constraint linear matrix inequality.
Observe that
N
 N

Prob A0 (x) − ξi Ai (x)  0 = Prob ξi Ãi (x)  I , (11.25)
i=1 i=1

where
−1/2 −1/2
Ãi (x) = A0 (x) Ai (x) A0 (x) .

Now suppose that we can choose γ = γ () > 0 such that whenever

N
Ã2i (x)  γ 2 I (11.26)
i=1

holds, the constraint condition (†) in (11.24) is satisfied. Then, we say that (11.26)
is a sufficient condition for (†) to hold. Using the Schur complement [560], (11.26)
can be rewritten as a linear matrix inequality
⎡ ⎤
γA0 (x) A1 (x) · · · AN (x)
⎢ A1 (x) γA0 (x) ⎥
⎢ ⎥
⎢ .. . ⎥  0. (11.27)
⎣ . . . ⎦
AN (x) γA0 (x)

Thus, by replacing the constraint condition (†) with (11.27), the original prob-
lem (11.24) is tractable. Moreover, any solution x ∈ Rn that satisfies F(x)  0
and (11.27) will be feasible for the original probability constrained problem
of (11.24).
Now we are in a position to demonstrate how Theorem 11.2.6 can be used
for problem solving. If the random variables ξ1 , . . . , ξN satisfy the conditions
11.3 Applications of Sums of Random Matrices 539

of Theorem 11.2.6, and at the same time, if (11.26) holds for γ  γ () ≡
 −1
8e ln (m/) , then for any  ∈ (0, 1/2], it follows that
 / / 
N / N /
/ /
Prob ξi Ãi (x)  I = Prob / ξi Ãi (x)/ 1 > 1 − . (11.28)
/ /
i=1 i=1 ∞

We also observe that


/ /  /  / 
/N / / N /
/ / / 1 / 1
Prob / ξi Ãi (x)/ >1 = Prob / ξi Ãi (x) / > .
/ / / γ / γ
i=1 ∞ i=1 ∞

N 
 2
Since 1
γ Ãi (x)  I, by Theorem 2.17.4 and Markov’s inequality (see the
i=1
above proof of Theorem 11.2.6), we have
/  / 
/ N /  
/ 1 / 1
Prob / ξi Ãi (x) / >  m · exp −1/ 8eγ 2  .
/ γ / γ
i=1 ∞

This gives (11.28). Finally, using (11.25), we have the following theorem.
Theorem 11.3.3. Let ξ1 , . . . , ξN be independent mean zero random variables, each
of which is either (i) supported on [-1,1], or Gaussian with variance one. Consider
the probability constrained problem (11.24). Then, for any  ∈ (0, 1/2], the positive
 −1
semi-definite constraint with γ  γ () ≡ 8e ln (m/) is a safe tractable
approximation of (11.24).
This
 theorem improves upon Nemirovski’s result in [560], which requires γ =
1/6

O m + ln (1/) before one could claim that the constraint is a safe tractable
approximation of (11.24). &
%
Example 11.3.4 (Nonconvex quadratic optimization under orthogonality con-
straints [560]—continued). Consider the following optimization problem
⎧ ⎫

⎪ X, BX  1 (a) ⎪

⎨ ⎬
X, Bl X  1, l = 1, . . . , L (b)
P = max X, AX : , (11.29)
X∈Mm×n ⎪
⎪ CX = 0, (c) ⎪

⎩ ⎭
X  1. (d)

where
• Mm×n is the space
 of m × n matrices equipped with the Frobenius inner product
X, Y = Tr XYT , and X = max { XY 2 : Y  1} is, as always, the
Y
spectral norm of X ∈ Mm×n ,
• The mappings A, B, B l are symmetric linear mappings from Mm×n into Mm×n ,
• B is positive semidefinite,
540 11 Probability Constrained Optimization

• Bl , l = 1, . . . , L are positive semidefinite,


• C is a linear mapping from Mm×n into RM .
Equation (11.29) covers a number of problems of quadratic optimization under
orthogonal constraints, including the Procrustes problem. For details, we refer
to [560].
We must exploit the rich structure of (11.29). The homogeneous linear con-
straints (c) in (11.29) imply that X is a block-diagonal matrix
⎡ ⎤
X1 0 0
⎢ ⎥
⎣ 0 ... 0 ⎦
0 0 XK
with mk × nk diagonal blocks Xk , k = 1, . . . , K.
Let us consider a semidefinite relaxation of Problem (11.29), which in general,
is NP-hard. Problem (11.29), however, admits a straightforward semidefinite relax-
ation as follows—following [560]. The linear mapping A in Problem (11.29) can be
identified with a symmetric mn × mn matrix A = [Aij,kl ] with rows and columns
indexed by pairs (i, j), 1 ≤ i ≤ m, 1 ≤ j ≤ n satisfying the relation
m n
[AX]ij = Aij,kl · xkl .
k=1 l=1

Similarly, B, B l can be identified with symmetric positive semidefinite mn × mn


matrix B, Bl , with B of rank 1. Finally, C can be identified with a M × mn matrix
C = [C]μ,ij :
m n
(CX)μ,ij = Cμ,ij · xij . (11.30)
i=1 j=1

Let Smn stand for the mn × mn symmetric matrix, and Smn


+ stand for the mn ×
mn positive semidefinite matrix respectively. For X ∈ Mm×n , let vec (X) be the
mn-dimensional vector obtained from the matrix X by arranging its columns into
a single column, and let X (X) ∈ Smn T
+ be the matrix vec (X) vec (X) , that is the
mn × mn matrix [xij xkl ]

X (X) = vec (X) vecT (X) .

Observe that

X (X)  0,

m 
n
and that cij · xij = 0 if and only if
i=1 j=1
11.3 Applications of Sums of Random Matrices 541

⎛ ⎞2
m n m n m n
0=⎝ cij · xij ⎠ ≡ cij · xij · ckl · xkl = Tr (X (C) X (X)) .
i=1 j=1 i=1 j=1 k=1 l=1

Further, we have that


m n m n
X, AX = Aij,kl · xij · xkl = Tr (AX (X)) ,
i=1 j=1 k=1 l=1

and similarly

X, BX = Tr (BX (X)) , X, Bl X = Tr (Bl X (X)) .

Finally, X  1 if and only if XXT  Im , in other words,

X  1 ⇔ XXT  Im .

Since the entries in the matrix product XXT are linear combinations of the entries
in X (X), we have

XXT  Im ⇔ S (X (X))  Im ,

where S is an appropriate linear mapping from Smn to Sm . Similarly, X  1 if


and only if XT X  In , which is a linear restriction on X (X) :

XXT  In ⇔ T (X (X))  In ,

where T is an appropriate linear mapping from Smn to Sn .


With the above observations, we can rewrite (11.29) as
⎧ ⎫

⎪ Tr (BX (X))  1 (a) ⎪

⎨ ⎬
Tr (Bl X (X))  1, l = 1, . . . , L (b)
max Tr (AX (X)) : ,
X∈Mm×n ⎪
⎪ Tr (Cμ X (X)) = 0, μ = 1, . . . , M (c) (c) ⎪

⎩ ⎭
S (X (X))  Im , T (X (X))  In . (d)

where Cμ ∈ Smn
+ is given by

μ
Cij,kl = Cμ,kl · Cμ,kl .

Since X (X)  0 for all X, the problem


⎧ ⎫

⎪ Tr (BX )  1 (a) ⎪


⎪ ⎪


⎨ Tr (Bl X )  1, l = 1, . . . , L (b) ⎪

SDP = max Tr (AX ) : Tr (Cμ X ) = 0, μ = 1, . . . , M (c) (11.31)
X∈Smn ⎪
⎪ ⎪

+

⎪ S (X )  Im , T (X (X))  In (d) ⎪


⎩ ⎪

X 0 (e)
542 11 Probability Constrained Optimization

is a relaxation of (11.29), so that Opt (P )  Opt (SDP ). Equation (11.31) is a


semidefinite problem and as such is computationally tractable.
The accuracy of the SDP (11.31) is given by the following result [560]: There
exists X̃ ∈ Mm×n such that
D E D E
(∗) X̃, AX̃ = Opt (SDP ) (a) X̃, B X̃  1
/ / (11.32)
/ /
(b) X, Bl X  Ω2 , l = 1, . . . , L (c)C X̃ = 0, (d) /X̃/  Ω

where
 
 
Ω = max max μk + 32 ln (132K), 32 ln (12 (L + 1)) ,
1kK
  
1/6
μk = min 7(mk + nk ) , 4 min (mk , nk ) .

In particular, one has both the lower bound and the upper bound

Opt (P )  Opt (SDP )  Ω2 Opt (P ) . (11.33)

The numerical simulations are given by Nemirovski [560]: The SDP solver mincx
(LMI Toolbox for MATLAB) was used, which means at most 1,000–1,200 free
entries in the decision matrix X (X). &
%
We see So [122] for data-driven distributionally robust stochastic programming.
Convex approximations of chance constrained programs are studied in Shapiro and
Nemirovski [559].

11.4 Chance-Constrained Linear Matrix Inequalities

Janson [561] extends a method by Hoeffding to obtain strong large deviation


bounds for sums of dependent random variables with suitable dependency structure.
So [122] and Cheung, So, and Wang [562]. The results of [562] generalizes the
works of [122, 560, 563], which only deal with the case of where ξ1 , . . . , ξm are
independent.
The starting point is that for x ∈ Rn

N
F (x, ξ) = A0 (x) + ξi Ai (x) (11.34)
i=1

where A0 (x) , A1 (x) , · · · , AN (x) : Rn → S d are affine functions that take


values in the space S d of d × d real symmetric matrices. The key idea is to construct
safe tractable approximations of the chance constraint
11.5 Probabilistically Constrained Optimization Problem 543

N

P A0 (x) + ξi Ai (x)  0
ξ
i=1


N
by means of the sums of random matrices ξi Ai (x). By concentration of
i=1
measure, by using sums of random matrices, Cheung, So, and Wang [562] arrive
at a safe tractable approximation of chance-constrained, quadratically perturbed,
linear matrix inequalities
⎛ ⎞
N
P ⎝A0 (x) + ξi Ai (x) + ξj ξk Bjk  0⎠  1 − , (11.35)
ξ
i=1 1jkN

where Bjk : Rn → S d are affine functions for 1  i  N and 1  j 


k  N and ξ1 , . . . , ξN are i.i.d. real-valued mean-zero random variables with light
tails. Dependable perturbations are allowed. Some dependence among the random
variables ξ1 , . . . , ξN are allowed. Also, (11.35) does not assume precise knowledge
of the covariance matrix.

11.5 Probabilistically Constrained Optimization Problem

We follow [236] closely for our exposition. The motivation is to demonstrate the
use of concentration of measure in a probabilistically constrained optimization
problem. Let || · || and || · ||F represent the vector Euclidean norm and matrix
norm, respectively. We write x ∼ CN (0, C) if x is a zero-mean, circular symmetric
complex Gassian random vector with covarance matrix C ≥ 0.
Consider so-called multiuser multiple input single output (MISO) problem,
where the base station, or the transmitter, sends parallel data streams to multiple
users over the sample fading channel. The transmission is unicast, i.e., each data
stream is exclusively sent for one user. The base station has Nt transmit antennas
and the signaling strategy is beamforming. Let x(t) ∈ CNt denote the multi-antenna
transmit signal vector of the base station at time t. We have that
K
x(t) = wk sk (t), (11.36)
k=1

where wk ∈ CNt is the transmit beamforming vector for user k, K is the number
of users, and sk (t) is the
 data stream
 of user k, which is assumed to have zero-
2
mean and unit power E |sk (t)| = 1. It is also assumed that sk (t) is statistically
independent of one another. For user i, the received signal is

yi (t) = hH
i x(t) + ni (t), (11.37)
544 11 Probability Constrained Optimization

where hi ∈ CNt is the channel gain from the base station to user i, and ni (t) is an
additive noise, which is assumed to have zero mean and variance σ 2 > 0.
A common assumption in transmit beamforming is that the base station has
perfect knowledge of h1 , . . . hK , the so-called perfect channel state information
(CSI). Here, we assume the CSI is not perfect and modeled as

hi = h̄i + ei , i = 1, . . . , K,

where hi is the actual channel, h̄i ∈ CNt is the presumed channel at the base station,
and ei ∈ CNt is the respective error that is assumed to be random. We model the
error vector using complex Gaussian CSI errors,

ei ∼ CN (0, Ci ) (11.38)

for some known error covariance Ci  0, i = 1, .., K.


The SINR of user i, i = 1, . . . , K is defined as
 H 2
 h wi 
SINRi =   i 2 .
 h H wk  + σ 2
i i
k=i

The goal here is to design beamforming vectors w1 , . . . , wK ∈ CNt such that


the qualify of service (QoS) of each user satisfies a prescribed set of requirements
under imperfect CSI of (11.38), while using the least amount of power to do so.
Probabilistic SINR constrained problem: Given minimum SINR requirements
γ1 , . . . , γK > 0 and maximum tolerable outage probabilities ρ1 , . . . , ρK ∈ (0, 1],
solve
K
2
minimize wi (11.39)
w1 ,...,wK ∈CNt
i=1

subject to P (SINRi  γi )  1 − ρi , i = 1, . . . , K. (11.40)

This is a chance-constrained optimization problem due to the presence of the


probabilistic constrain (11.40).
Following [236], we introduce a novel relaxation-restriction approach in two
steps: relaxation step and restriction step. First, we present the relaxation step.
The motivation is that for each i, the inequality SINRi  γi is nonconvex in
w1 , . . . , wK ; specifically, it is indefinite quadratic. This issue can be handled by
semidefinite relaxation [564, 565]. Equation (11.40) is equivalently represented by
11.5 Probabilistically Constrained Optimization Problem 545


K
minimize Tr(Wi )
W1 ,...,WK ∈HNt )
i=1 ) * *
 H 1  
subject to P h̄i +ei γi
Wi − Wk h̄i +ei  σi2  1−ρi , i = 1, . . . , K,
k=i
W1 , . . . , W K  0
rank (Wi ) = 1, i = 1, . . . , K,
(11.41)
where the connection between (11.40) and (11.41) lies in the feasible point
equivalence

Wi = wi wiH , i = 1, . . . , K.

The semidefinite relaxation (11.41) works by removing the nonconvex rank-one


constraints on Wi , i.e., to consider the relaxed problem


K
minimize Tr(Wi )
W1 ,...,WK ∈HNti=1
) ) * *
 H 1  
subject to P h̄i +ei γi
Wi − Wk h̄i +ei  σi2  1−ρi , i = 1, . . . , K,
k=i
W1 , . . . , WK  0.
(11.42)

where HNt denotes Hermitian matrix with size Nt by Nt . The benefit of this
relaxation is that the inequalities inside the probability functions in (11.42) are
linear in W1 , . . . , WK , which makes the probabilistic constraints in (11.42) more
tractable. An issue that comes from the semidefinite relaxation is the solution
rank: the removal of the rank constraints rank (Wi ) = 1 means that the solution
(W1 , . . . , WK ) to problem (11.42) may have rank higher than one. A stan-
dard way of tacking this is to apply some rank-one approximation procedure
to (W1 , . . . , WK ) to generate a feasible beamforming solution (w1 , . . . , wK )
to (11.39). See [565] for a review and references. An algorithm is given in [236],
which in turn follows the spirit of [566].
Let us present the second step: restriction. The relaxation step alone above does
not provide a convex approximation of the main problem (11.39). The semidefinite
relation probabilistic constraints (11.42) remain intractable. This is the moment
that the Bernstein-type concentration inequalities play a central role. Concentration
of measure lies in the central stage behind the Bernstein-type concentration
inequalities. The Bernstein-type inequality (4.56) is used here. Let z = e and y = r
√ 2 2
in Theorem 4.15.7. Denote T (t) = Tr (Q) − 2t Q F + 2 y − tλ+ (Q).
We reformulate the challenge as the following: Consider the chance constraint
 
P eH Qe + 2 Re eH r + s  0  1 − ρ, (11.43)
546 11 Probability Constrained Optimization

where e is a standard complex Gaussian random vector, i.e., zi ∼ CN (0, In ) , the


3-tuple (Q, r, s) ∈ Hn×n ×Cn ×R is a set of (deterministic) optimization variables,
and ρ ∈ (0, 1] is fixed. Find an efficiently computable convex restriction of (11.43).
Here Hn×n stands for the set of Hermitian matrices of n × n.
Indeed, for each constraint in (11.42), the following correspondence to (11.43)
can be shown for i = 1, . . . , K
⎛ ⎞ ⎛ ⎞
1 1
Q = Ci ⎝ W i − W k ⎠ C i , r = Ci ⎝ W i − Wk ⎠ h̄i ,
1/2 1/2 1/2
γi γi
k=i k=i
⎛ ⎞
⎝ 1
s = h̄H
i Wi − Wk ⎠ h̄i − σi2 , ρ = ρi .
γi
k=i
(11.44)
Using the Bernstein-type inequality, we can use closed-form upper bounds on
the violation probability to construct an efficiently computable convex function
f (Q, r, s, u) , where u is an extra optimization variable, such that
 
P eH Qe + 2 Re eH r + s  0  f (Q, r, s, u) . (11.45)
Then, the constraint
f (Q, r, s, u) ≤ ρ (11.46)
is, by construction, a convex restriction of (11.43).
Since T (t) is monotonically decreasing, its inverse mapping is well defined. In
particular, the Bernstein-type inequality (4.56) can be expressed as
  −1
P eH Qe + 2 Re eH r + s  0  1 − e−T (−s) .
−1
As discussed in (11.45) and (11.46), the constraint 1 − e−T (−s) ≤ ρ, or
equivalently,
 2
2
Tr (Q) − −2 ln (ρ) Q F + 2 y + ln (ρ) · λ+ (Q) + s  0 (11.47)
serves as a sufficient condition for achieving (11.43).
At this point, it is not obvious whether or not (11.47) is convex in (Q, r, s),
but we observe that (11.47) is equivalent to the following system of convex conic
inequalities

Tr (Q) − −2 ln (ρ) · u1 + ln (ρ) · u2 + s  0,
2
2
Q F + 2 y  u1 ,
(11.48)
u2 In + Q  0,
u2  0,
where u1 , u2 ∈ R are slack variables. Therefore, (11.48) is an efficiently com-
putable convex restriction of (11.43).
11.6 Probabilistically Secured Joint Amplify-and-Forward Relay by Cooperative Jamming 547

Let us introduce another method: decomposition into independent random


variables. The resulting formulation is solved more efficiently than the Bernstein-
type inequality method (11.48). The idea is to first decompose the expression
eH Qe + 2 Re eH r + s into several parts, each of which is a sum of independent
random variables. Then, one obtains a closed-form upper bound on the violation
probability [562]. Let us illustrate this approach briefly, following [236]. Let Q =
UΛVH be the spectral decomposition of Q, where Λ = diag (λ1 , . . . , λn ) and
λ1 , . . . , λn are the eigenvalues of Q. Since ei ∼ CN (0, In ) , and UH is unitary,
we have ẽ = UH e ∼ CN (0, In ) . As a result, we have that
 
ψ = eH Qe + 2 Re eH r = ẽH Λẽ + 2 Re eH r = ψq + ψl .

Now, let us decompose the ẽH Λẽ+2 Re eH r into the sum of independent random
variables. We have that
n
  n
ψq = ẽH Λẽ = λi |ei |2 , ψl =2 Re eH r = 2 (Re {ri } Re {ei } + Im {ri } Im {ei }).
i=1 i=1

The advantage of the above decomposition approach is that the distribution of the
random vector e may be √
non-Gaussian.
√ n In particular, e ∈ Rn is a zero-mean random
vector supported on − 3, 3 with independent components. For details, we
refer to [236, 562].

11.6 Probabilistically Secured Joint Amplify-and-Forward


Relay by Cooperative Jamming

11.6.1 Introduction

This section follows [567] and deals with probabilistically secured joint amplify-
and-forward (AF) relay by cooperative jamming. AF relay with N relay nodes is
the simple relay strategy for cooperative communication to extend communication
range and improve communication quality. Due to the broadcast nature of wireless
communication, the transmitted signal is easily intercepted. Hence, communication
security is another important performance measure which should be enhanced.
Artificial noise and cooperative jamming have been used for physical layer security
[568, 569].
All relay nodes perform forwarding and jamming at the same time cooperatively.
A numerical approach based on optimization theory is proposed to obtain forward-
ing complex weights and the general-rank cooperative jamming complex weight
matrix simultaneously. SDR is the core of the numerical approach. Cooperative
jamming with the general-rank complex weight matrix removes the non-convex
548 11 Probability Constrained Optimization

rank-1 matrix constraint in SDR. Meanwhile, the general-rank cooperative jamming


complex weight matrix can also facilitate the achievement of rank-1 matrix solution
in SDR for forwarding complex weights.
In this section, physical layer security is considered in a probabilistic fashion.
Specifically speaking, the probability that an eavesdropper’s SINR is greater than
or equal to its targeted SINR should be equal to the pre-determined violation
probability. In order to achieve this goal, probabilistic-based optimization is applied.
Probabilistic-based optimization is more flexible than the well-studied robust
optimization and stochastic optimization. However, the flexibility of probabilistic-
based optimization will bring challenges to the corresponding solver. Advanced
statistical signal processing and probability theory are needed to derive the safe
tractable approximation algorithm.
In this way, the optimization problem in a probabilistic fashion can be converted
to or approximated to the correspondingly deterministic optimization problem.
“Safe” means approximation will not violate the probabilistic constraints and
“tractable” means the deterministic optimization problem is convex or solvable.
There are several approximation strategies mentioned in [236, 570], e.g., Bernstein-
type inequality [235] and moment inequalities for sums of random matrices [122].
Bernstein-type inequality will be exploited explicitly here.

11.6.2 System Model

A two-hop half-duplex AF relay network is considered. A source Alice would like


to send information to the destination Bob through N relay nodes. Meanwhile, there
is an eavesdropper Eve to intercept wireless signal. All the nodes in the network are
only equipped with a single antenna. Alice, Bob, and relay nodes are assumed to be
synchronized. Perfect CSIs between Alice and relay nodes as well as between relay
nodes and Bob are known. However, CSIs related to Eve are partially known.
There is a two-hop transmission between Alice and Bob. In the first hop, Alice
transmits information I. In order to interfere with Eve, Bob generates artificial noise
J. I and J are assumed to be independent real Gaussian random variables with zero
mean and unit variance. The function diagram of the first hop is shown in Fig. 11.1.
For the nth relay node, the received signal plus artificial noise is,

yrn 1 = hsrn 1 gs1 I + hdrn 1 gd1 J + wrn 1 , n = 1, 2, . . . , N (11.49)

where gs1 is the transmitted complex weight for Alice and gd1 is the transmitted
complex weight for Bob; hsrn 1 is CSI between Alice and the nth relay node; hdrn 1
is CSI between Bob and the nth relay node; wrn 1 is the background Gaussian noise
with zero mean and σr2n 1 variance for the nth relay node. Similarly, the received
signal plus artificial noise for Eve is,

ye1 = hse1 gs1 I + hde1 gd1 J + we1 (11.50)


11.6 Probabilistically Secured Joint Amplify-and-Forward Relay by Cooperative Jamming 549

Fig. 11.1 The function


diagram of the first hop
Relay 1

Relay 2
Information
Signal Articial Noise

Alice
Bob

Relay N

Eve

Fig. 11.2 The function Relay 1


diagram of the second hop
Bob

Relay 2
Eve
Information Signal plus
Artificial Noise plus
Cooperative Jamming

Relay N

where hse1 is CSI between Alice and Eve; hde1 is CSI between Bob and Eve; we1
2
is the background Gaussian noise with zero mean and σe1 variance for Eve. Thus,
SINR for Eve in the first hop is,
2
|hse1 gs1 |
SINRe1 = 2 (11.51)
|hde1 gd1 | + σe1
2

In the second hop, N relay nodes perform joint AF relay and cooperative
jamming simultaneously. The function diagram of the second hop is shown in
Fig. 11.2.
The transmitted signal plus cooperative jamming for the nth relay node is

s rn 2 = g rn 2 y rn 1 + J rn 2 (11.52)
550 11 Probability Constrained Optimization

where grn 2 is the forwarding complex weight for the nth relay node and
Jrn 2 , n = 1, 2, . . . , N constitutes cooperative jamming jr2 ,
⎡ ⎤
J r1 2
⎢ Jr 2 ⎥
⎢ 2 ⎥
jr2 =⎢ . ⎥ (11.53)
⎣ .. ⎦
J rN 2

and jr2 is defined as,

jr2 = Uz (11.54)

where U ∈ C N ×r , r ≤ N is cooperative jamming complex weight matrix and z,


which follows N (0, I), is r-dimensional artificial noise vector. Hence, the trans-
2 2
mitted power needed
0 for the1 nth relay node is |grn 2 hsrn 1 gs1 | + |grn 2 hdrn 1 gd1 | +
2 2
|grn 2 | σr2n 1 + E |Jrn 2 | where E {·} denotes expectation operator and
0 1    
2
E |jr2 | = diag UE zzH UH (11.55)
 
= diag UUH (11.56)

where (·)H denotes Hermitian operator and diag {·} returns the main diagonal of
matrix or puts vector on the main diagonal of matrix.
The received signal plus artificial noise by Bob is,
N
yd2 = hrn d2 (grn 2 yrn 1 + Jrn 2 ) + wd2 (11.57)
n=1

where hrn d2 is CSI between the nth relay node and Bob; wd2 is the background
2
Gaussian noise with zero mean and σd2 variance for Bob. Similarly, the received
signal plus artificial noise for Eve in the second hop is,
N
ye2 = hrn e2 (grn 2 yrn 1 + Jrn 2 ) + we2 (11.58)
n=1

where hrn e2 is CSI between the nth relay node and Eve; we2 is the background
2
Gaussian noise with zero mean and σe2 variance for Eve.
In the second hop, SINR for Bob is,
 2
N 
 h g h g 
 rn d2 rn 2 srn 1 s1 
SINRd2 =  2
n=1
 2 
 N  N  N 
 h g h g  + |h g |2 2
σ +E  h J  2
+σd2
 r n d2 r n 2 dr n 1 d1  r n d2 r n 2 rn 1  r n d2 r n 
2
n=1 n=1 n=1

(11.59)
11.6 Probabilistically Secured Joint Amplify-and-Forward Relay by Cooperative Jamming 551

Due to the known information about J, Bob can cancel artificial noise generated
 2
 N 

by itself. Thus,  hrn d2 grn 2 hdrn 1 gd1  can be removed from the denominator in
n=1
Eq. (11.59) which means SINR for Bob is
 2
 N 
 h g h g 
 rn d2 rn 2 srn 1 s1 
SINRd2 = n=1
+ 2 , (11.60)

N N 
2 2
|hrn d2 grn 2 | σrn 1 + E   hrn d2 Jrn 2  + σd2
2
n=1 n=1

SINR for Eve is,


 N 2
 
 hrn e2 grn 2 hsrn 1 gs1 

n=1
SINRe2 =  2  2 
   N 
 +  |hr e2 gr 2 |2 σ 2 +E   hr e2 Jr 2 
N N
 h g h g 2
+σe2
 r n e2 r n 2 dr n 1 d1  n n rn 1  n n 
n=1 n=1 n=1

(11.61)

Here, we assume there is no memory for Eve to store the received data in the
first hop. Eve cannot do any type of combinations for the received data to decode
information.
Let,
 
hrd2 = hr1 d2 hr2 d2 · · · hrN d2 (11.62)

Then,
⎧ 2 ⎫
⎨ N  ⎬  
 
E  hrn d2 Jrn 2  = hrd2 UE zzH UH hH (11.63)
⎩  ⎭ rd2
n=1

= hrd2 UUH hH
rd2 (11.64)

Similarly, let,
 
hre2 = hr1 e2 hr2 e2 · · · hrN e2 (11.65)

Then,
⎧ 2 ⎫
⎨ N  ⎬
 
E  hrn e2 Jrn 2  = hre2 UUH hH (11.66)
⎩  ⎭ re2
n=1

Based on the previous assumption, hsrn 1 , hdrn 1 , and hrn d2 are perfect known.
While, hse1 , hde1 , and hrn e2 are partially known. Without loss of generality, hse1 ,
hde1 , and hrn e2 all follow independent complex Gaussian distribution with zero
552 11 Probability Constrained Optimization

mean and unit variance (zero mean and 12 variance for both real and imaginary
parts). Due to the randomness of hse1 , hde1 , and hrn e2 , SINRe1 and SINRe2 are
also random variables.
Joint AF relay and cooperative jamming with probabilistic security consideration
would like to solve the following optimization problems,

find
gs1 , gd1 , grn 2 , n = 1, 2, . . . , N, U
subject to
9 :
|grn 2 hsrn 1 gs1 |2 +|grn 2 hdrn 1 gd1 |2 +|grn 2 |2 σr2n 1 +E |Jrn 2 |2 ≤ Prn 2 , n = 1, 2, . . . , N
Pr (SINRe1 ≥ γe ) ≤ δe
Pr (SINRe2 ≥ γe ) ≤ δe
SINRd2 ≥ γd
(11.67)
where Prn 2 is the individual power constraint for the nth relay node; γe is the
targeted SINR for Eve and δe is its violation probability; γd is the targeted SINR
for Bob.

11.6.3 Proposed Approach

In order to make the optimization problem (11.67) solvable, the optimization


problem in a probabilistic fashion need be converted to or approximated to the
correspondingly deterministic optimization problem by the safe tractable approx-
imation approach.
2 2
For SINRe1 , |hse1 gs1 | and |hde1 gd1 | are independent exponentially distributed
2 2
random variables with means |gs1 | and |gd1 | . Pr (SINRe1 ≥ γe ) ≤ δe is equal
to [571],

2 2
−γe σe1
|gs1 |
e |gs1 |2
2 2 ≤ δe (11.68)
|gs1 | + γe |gd1 |

2
From inequality (11.68), given |gs1 | , we can easily get,
 2
−γe σe1

2
e |gs1 |2
− δe |gs1 |
2
|gd1 | ≥ (11.69)
γ e δe

In this section, we are more interested in equally probabilistic constraints instead


of too conservative or robust performance. In other words, Pr (SINRe1 ≥ γe ) = δe
2
and Pr (SINRe2 ≥ γe ) = δe are applied. Hence, |gd1 | is obtained by equality of
inequality (11.69) to ensure the probabilistic security in the first hop with minimum
power needed for Bob.
For SINRe2 , Bernstein-type inequality is explored [235, 236, 572].
11.6 Probabilistically Secured Joint Amplify-and-Forward Relay by Cooperative Jamming 553

Let
⎡ ⎤
g r1 2
⎢ gr 2 ⎥
⎢ 2 ⎥
gr2 = ⎢ . ⎥, (11.70)
⎣ .. ⎦
g rN 2

 
hsr1 = hsr1 1 hsr2 1 · · · hsrN 1 (11.71)

and
 
hdr1 = hdr1 1 hdr2 1 · · · hdrN 1 (11.72)

Define Hsr1 = diag {hsr1 }, Hdr1 = diag {hdr1 } and


⎡ 2 ⎤
σ r1 1 0 · · · 0
⎢ 0 σ2 · · · 0 ⎥
2 ⎢ r2 1 ⎥
σr1 =⎢ . .. .. .. ⎥ (11.73)
⎣ .. . . . ⎦
0 0 · · · σr2N 1
Based on SDR, define,
H
X = gr2 gr2 (11.74)

where X should be rank-1 semidefinite matrix and

Y = UUH (11.75)

where Y is the semidefinite matrix and the rank of Y should be equal to or smaller
than r.
Pr (SINRe2 ≥ γe ) ≤ δe can be simplified as,

Pr hre2 QhH re2 ≥ σe2 γe ≤ δe
2
(11.76)

where Q is equal to
2 2
Q = Hsr1 XHH sr1 |gs1 | − γe Hdr1 XHdr1 |gd1 | − γe diag {diag {X}} σr2 − γe Y
H 2

(11.77)
Then probabilistic constrain in (11.76) can be approximated as,
0.5
trace (Q) + (−2log (δe )) a − blog (δe ) − σe2
2
γe ≤ 0
Q F ≤a
(11.78)
bI − Q ≥ 0
b≥0

where trace(·) returns the sum of the diagonal elements of matrix; · F return
Frobenius norm of matrix.
554 11 Probability Constrained Optimization

Based on the definition of X and Y, the individual power constraint for the nth
relay node in the optimization problem (11.67) can be rewritten as,
 
2 2
(X)n,n |hsrn 1 gs1 | + |hdrn 1 gd1 | + σr2n 1 + (Y)n,n ≤ Prn 2 , n = 1, 2, . . . , N
(11.79)
where (·)i,j returns the entry of matrix with the ith row and jth column.
SINRd2 constraint SINRd2 ≥ γd in the optimization problem (11.67) can be
reformulated as,
     
trace aH aX ≥ γd trace HH 2 H
rd2 Hrd2 σr1 X + trace hrd2 hrd2 Y + σd2
2

(11.80)
where

Hrd2 = diag {hrd2 } (11.81)

and

a = (diag {Hsr1 Hrd2 gs1 })T (11.82)

where (·)T denotes transpose operator


In this section, we mainly focus on the joint optimization grn 2 , n = 1, 2, . . . , N
2
and U, i.e., X and Y, in the second hop based on the given |gs1 | and the
2
calculated |gd1 | .
In order to minimize the total power needed by N relay nodes, the optimization
problem (11.67) can be approximated as,

minimize  
N 2 2
n=1 ((X) n,n |h sr n 1 g s1 | + |h dr n 1 g d1 | + σ 2
r n 1 + (Y)n,n )
subjectto 
2 2
(X)n,n |hsrn 1 gs1 | + |hdrn 1 gd1 | + σr2n 1 + (Y)n,n ≤ Prn 2 , n = 1, 2, . . . , N
0.5
trace (Q) + (−2log (δe )) a − blog (δe ) − σe2
2
γe ≤ 0
Q F ≤a
bI − Q ≥ 0
b≥0    
 
trace aH aX ≥ γd trace HH rd2 Hrd2 σ 2
r1 X + trace h rd2
H
h rd2 Y + σ 2
d2
X≥0
rank(X) = 1
Y≥0
(11.83)
Due to the non-convex rank constraint, the optimization problem (11.83) is an
NP-hard problem. We have to remove rank constraint and the optimization prob-
lem (11.83) becomes an SDP problem which can be solved efficiently. However,
11.6 Probabilistically Secured Joint Amplify-and-Forward Relay by Cooperative Jamming 555

the optimal solution to X cannot be guaranteed to be 1. Hence, the well-studied


randomization procedure can be involved. In this section, we propose to use the
sum of the least N − 1 eigenvalues (truncated trace norm) minimization procedure
to find feasible and near-optimal rank-1 solution to X [573]. This whole procedure
will be presented as Algorithm 1.
In Algorithm 1, δe is given, then
Algorithm 1
1. Solve the optimization problem (11.83) without the consideration of rank
constraint to get optimal solution X∗ , Y∗ , and minimum total power needed
P ∗ ; if X∗ is the rank-1 matrix, then Algorithm 1 goes to step 3; otherwise
Algorithm 1 goes to step 2;
2. Do eigen-decomposition to X∗ to get the dominant eigen-vector x∗ related to the
maximum eigen-value of X∗ ; solve the following optimization problem which is
also an SDP problem,

minimize
  2 2 
λ(trace(X)−trace(Xx∗ (x∗ )H ))+(( N 2
n=1 ((X)n,n |hsrn 1 gs1 | +|hdrn 1 gd1 | +σrn 1
+(Y)n,n ))−P ∗ )
subject to
 
(X)n,n |hsrn 1 gs1 |2 +|hdrn 1 gd1 |2 +σr2n 1 +(Y)n,n ≤ Prn 2 , n = 1, 2, . . . , N
trace (Q) + (−2log (δe ))0.5 a−blog (δe ) −σe2 2
γe ≤ 0
QF ≤ a
bI−Q ≥ 0
b≥0
    2
   2

trace aH aX ≥ γd trace HH rd2 Hrd2 σr1 X +trace hrd2 hrd2 Y +σd2
H

X≥0
Y≥0
(11.84)
where λ is the design parameter; then optimal solution X∗ and Y∗ will be
updated; if X∗ is the rank-1 matrix, then Algorithm 1 goes to step 3; otherwise
Algorithm 1 goes to step 2;
3. Get optimal solutions to gr2 and U by eigen-decompositions to X∗ and Y∗ ;
Algorithm 1 is finished.
In the optimization problem (11.84), the minimization of trace(X) −
trace(Xx∗ (x∗ )H ) tries to force X∗ to be a rank-one matrix
 and the minimization
N
of ( n=1 ((X)n,n |hsrn 1 gs1 | + |hdrn 1 gd1 | + σrn 1 + (Y)n,n )) − P ∗ tries to
2 2 2

minimize the total transmitted power needed for N relay nodes.


As mentioned before, we are more interested in equally probabilistic constraints.
If targeted violation probability is set to be δetargeted which is also to be used as the
parameter δe in the optimization problem (11.83), the violation probability in reality
δereality by statistical validation procedure will be much smaller than δetargeted . Based
on Bernstein-type inequalities, δereality will be a non-decreasing function of δe . Thus,
556 11 Probability Constrained Optimization

we propose to exploit bi-section search to find suitable δe to make sure δereality is


equal to δetargeted [574, 575].
Overall, a novel numerical approach for joint AF relay and cooperative jamming
with the consideration of probabilistic security will be proposed as Algorithm 2.
In Algorithm 2, δel is set to be 0; δeu is set to be 1; and δetargeted is given; then
Algorithm 2
1. Set δe to be δetargeted ;
2. Invoke Algorithm 1 to get optimal solutions to gr2 and U;
3. Perform Monte Carlo simulation to get δereality ; if δereality ≥ δetargeted , then δeu is
set to be δe ; otherwise δel is set to be δe ;
4. If δeu −δel ≤ ξ where ξ is the design parameter, Algorithm 2 is finished; otherwise,
(δl +δu )
δe is set to be e 2 e and Algorithm 2 goes to step 2.

11.6.4 Simulation Results

In the simulation, N = 10; σr2n 1 = 0.15, n = 1, 2, . . . , N ;σe1 2


= 0.15; σe2 2
=
2
0.15; σd2 = 0.15; Prn 2 = 2, n = 1, 2, . . . , N ; γe = 1. hsrn 1 , hdrn 1 , and
hrn d2 , n = 1, 2, . . . , N are randomly generated as complex zero-mean Gaussian
random variables with unit covariance. CVX toolbox [488] is used to solve the
presented SDPs.
In Fig. 11.3, we illustrate the relationship between total power needed by all relay
nodes and the targeted SINR γd required by Bob. Meanwhile, different violation
probabilities for Eve are considered. The smaller the pre-determined violation
probability for Eve, the more power needed for N relay nodes. In other words,
too conservative or robust performance for security requires a large amount of
total transmitted power. Hence, we should balance the communication security
requirement and total power budget through optimization theory.
In order to verify the correctness of the proposed numerical approach, 10,000
Monte Carlo simulations are run with randomly generated hrn e2 , n = 1, 2, . . . , N .
δetargeted = 0.08. γd = 20. The histogram of the received SINRs for Eve is shown
in Fig. 11.4. Similarly, if δetargeted = 0.12, the histogram of the received SINRs for
Eve is shown in Fig. 11.5.

11.7 Further Comments

Probability constrained optimization is also used in [570]. Distributed robust


optimization is studied by Yang et al. [576, 577] and Chen and Chiang [578] for
communication networks. Randomized algorithms in robust control and smart grid
are studied in [579–582].
11.7 Further Comments 557

4
4% Violation
8% Violation
3.5 12% Violation
Total Power by All Relay Nodes
3

2.5

1.5

0.5

0
2 4 6 8 10 12 14 16 18 20
Bob Targeted SINR

Fig. 11.3 Total power needed by all relay nodes

0.082
60

50

40

30

20

10

0
0 0.5 1 1.5 2 2.5
Received SINR for Eve

Fig. 11.4 The histogram of the received SINRs for Eve


558 11 Probability Constrained Optimization

Fig. 11.5 The histogram of 0.1212


50
the received SINRs for Eve
45

40

35

30

25

20

15

10

0
0 0.5 1 1.5 2 2.5 3
Received SINR for Eve

In [583], distributionally robust slow adaptive orthogonal frequency division


multiple access (OFDMA) with Soft QoS is studied via linear programming.
Neither prediction of channel state information nor specification of channel fading
distribution is needed for subcarrier allocation. As such, the algorithm is robust
against any mismatch between actual channel state/distributional information and
the one assumed. Besides, although the optimization problem arising from our
proposed scheme is non-convex in general, based on recent advances in chance-
constrained optimization, they show that it can be approximated by a certain
linear program with provable performance guarantees. In particular, they only
need to handle an optimization problem that has the same structure as the fast
adaptive OFDMA problem, but they are able to enjoy lower computational and
signaling costs.
Chapter 12
Database Friendly Data Processing

The goal of this chapter is to demonstrate how concentration of measure plays


a central role in these modern randomized algorithms. There is a convergence
of sensing, computing, networking and control. Data base is often neglected in
traditional treatments in estimation, detection, etc.
Modern scientific computing demands efficient algorithms for dealing with
large datasets—Big Data. Often these datasets can be fruitfully represented and
manipulated as matrices; in this case, fast low-error methods for making basic linear
algebra computations are key to efficient algorithms. Examples of such foundational
computational tools are low-rank approximations, matrix sparsification, and ran-
domized column subset selection.

12.1 Low Rank Matrix Approximation

Randomness can be turned to our advantage in the development of methods for


dealing with these massive datasets [584, 585].
It is well known that a matrix Ak which minimizes both the Frobenious norm
and the spectral norm error can be calculated via the singular value decomposition
(SVD). The SVD takes cubic time, the computation cost of using it to form low-rank
approximation can be prohibitive if the matrix is large.
For an integer n = 2p , for p = 1, 2, 3, . . . . the (non-normalized) n × n matrix of
the Hadamard-Walsh transform is defined recursively as,
 
Hn/2 Hn/2 +1 +1
Hn = , H2 = .
Hn/2 −Hn/2 +1 −1
The n × n matrix of the Hadamard-Walsh transform is equal to

1
H = √ Hn ∈ Rn×n . (12.1)
n

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 559
DOI 10.1007/978-1-4614-4544-9 12,
© Springer Science+Business Media New York 2014
560 12 Database Friendly Data Processing

For integers r and n = 2p with r < n and p = 1, 2, 3, . . . . an subsampled


randomized Hadamard transform matrix is an r × n matrix of the form
"
n
Θ= · RHD; (12.2)
r

• D ∈ Rn×n is a random diagonal matrix whose entries are independent random


signs, i.e., random variables uniformly distributed on {±1}.
• H ∈ Rn×n is a normalized Walsh-Hadamard matrix.
• R ∈ Rr×n is a random matrix that restricts an n-dimensional vector to r
coordinates, which are chosen uniformly at random and without replacement.
Theorem 12.1.1 (Subsampled randomized Hadamard transform [584]). Let
A ∈ Rm×n have rank ρ. For an integer k satisfying 0 < k  ρ. Let 0 < ε < 1/3
denote an accuracy parameter, 0 < δ < 1 be a failure probability, and C ≥ 1 be a
constant. Let Y = AΘT , Θ is an r × n SRHT matrix with r satisfying
√  2
6C 2 ε−1 k+ 8 log (n/δ) log (k/δ)  r  n.

2
Then, with probability at least 1 − δ C /24
− 7δ,
/ /
/A − YYH A/  (1 + 50ε) · A − Ak
F F

and
 )  * 
  √ log (n/δ) ε
A−YY H A  6+ ε 15 + ·A−A k  + ·A−Ak F .
2 2
C 2 log (k/δ) 8C 2 log (k/δ)

The matrix Y can be constructed in O (mn log (r)) time.

12.2 Row Sampling for Matrix Algorithms

We take material from Magdon-Ismail [586]. Let e1 , . . . , eN the standard basis


vectors in Rn . Let A ∈ RN ×n denote an arbitrary matrix which represents N
points in Rn . In general, we represent a matrix such as A (bold, uppercase) by a
 T
set of vectors a1 , . . . , aN ∈ Rn (bold, lowercase), so that A = a1 a2 · · · aN .
Here ai is the i-th row of A, which we may also refer to by A(i) ; similarly we refer
to the i-th column as A(i) . Let · be the spectral norm and · F the Frobenius
2 2
norm. The numerical (stable) rank of S is defined as ρ (S) = S F / S .
12.2 Row Sampling for Matrix Algorithms 561

A row-sampling matrix samples k rows of A to form à = QA :


⎡ T⎤ ⎡ T ⎤
r1 r1 A
⎢ ⎥ ⎢ ⎥
Q = ⎣ ... ⎦ , Ã = QA = ⎣ ... ⎦ ,
rTk rTk A
where rTi A samples the ti -th row of A and rescales it. We are interested in random
sampling matrices where each ri is i.i.d. according to some distribution. Define

N
a set of sampling probabilities p1 , . . . , pN , with pi > 0 and pi = 1; then
√ i=1
ri = ei / kpt with probability pt . The scaling is also related to the sampling
probabilities in all the algorithms we consider. We can rewrite QT Q as the sum
of k independently sampled matrices
k
1
QT Q = ri rTi
k i=1

where ri rTi is a diagonal matrix with only one non-zero entry; the t-th diagonal entry
is equal to 1/pt with probability
  pt . Thus, by construction, for any set of non-zero
sampling probabilities, E ri rTi = IN ×N . Since we are averaging k independent
copies, it is reasonable to expect a concentration around the mean, with respect to
k, and in some sense, QT Q essentially behaves like the identity.
Theorem 12.2.1 (Symmetric Orthonormal Subspace Sampling [586]). Let
 T
U = u1 · · · uN ∈ RN ×n

be orthonormal, and D ∈ Rn×n be positive diagonal. Assume the row-sampling


probabilities pt satisfy

uTt D2 ut
pt  β .
Tr (D2 )

Then, if k  4ρ (D) /βε2 ln 2n
δ , with probability at least 1 − δ,

/ 2 /
/D − DUT QT QUD/  ε D 2 .

A linear regression is represented by a real data matrix A ∈ RN ×n which


represents N points in Rn , and a target vector y ∈ RN . Traditionally, N ' n.
The goal is to find a regression vector x ∈ R2 which minimizes the 2 fit error
(least squares regression)

2
N
 2
E (x) = Ax − y 2 = aTt x − yt .
t=1
This problem was formulated to use a non-commutative Bernstein bound. For
details, see [587].
562 12 Database Friendly Data Processing

12.3 Approximate Matrix Multiplication

The matrix product ABT for large dimensions is a challenging problem. Here
we reformulate this standard linear algebra operation in terms of sums of random
matrices. It can be viewed as non-uniform sampling of the columns of A and B.
We take material from Hsu, Kakade, and Zhang [113], converting to our notation.
We make some comments and connections with other parts of the book at the
appropriate points.
Let A = [a1 |· · · |an ] , and B = [b1 |· · · |bn ] be fixed matrices, each with n
columns. Assume that aj = 0 and bj = 0 for all j = 1, . . . , n. If n is very
large, which is common in the age of Big Data, then the standard, straightforward
computation of the product ABT can be too computation-expensive. An alternative
is to take a small (non-uniform) random samples of the columns of A and B,
say aj1 , bj1 , . . . , ajN , bjN , or aji , bji for i = 1, 2, . . . , N . Then we compute a
weighted sum of outer products1

N
1 1
ABT ∼
= aj b T (12.3)
N i=1
pj i i j i

where pji > 0 is the a priori probability of choosing the column index ji ∈
{1, . . . , n} from a collection of N columns. The “average” and randomness do
most of the work, as observed by Donoho [132]: The regularity of having many
“identical” dimensions over which one can “average” is a fundamental tool. The
scheme of (12.3) was originally proposed and analyzed by Drinceas, Kannan, and
Mahoney [313].
Let X1 , . . . , XN be i.i.d. random matrices with the discrete distribution given by
- .
1 0 aj bTj
P Xi = = pj
pj bTj aj 0

for all j = 1, . . . , n, where


n
aj 2 bj 2
pj = , Z= aj 2 bj 2.
Z j=1

Let
N 
1 0 ABT
M̂ = Xi and M= .
N i=1
BT A 0

1 Outer products xyT of two vectors x and y are rank-one matrices.


12.3 Approximate Matrix Multiplication 563

/ /
/ /
The spectral norm error /M̂ − M/ is used to describe the approximation of ABT
2
1
using the average of N outer products pij aij bTij , where the indices are such that

ji = j ⇔ Xi = aj bTj /pj

for all i = 1, . . . , N . Again, the “average” plays a fundamental role. Our goal is to
use Theorem 2.16.4 to bound this error. To apply this theorem, we must first check
the conditions.
We have the following relations
⎡ 
n ⎤
  0 aj b T
n
1 0 aj b T ⎢ j ⎥
=⎢ ⎥=M
j=1
E [Xi ] = pj j
⎣n ⎦
pj bT
j aj 0 bT
j=1 j aj 0
j=1
⎛   ⎞
  2  n
1 aj b T
j b j aj
T 0
n
2
Tr E Xi =Tr ⎝ pj ⎠= aj 22 bj 22 =2Z 2
p2j 0 j aj aj b j
bT T pj
j=1 j=1
 
    ABT BT A 0  
Tr (E [Xi ])2 =Tr M2 = = 2Tr ABT BT A .
0 BT AABT

Let || · ||2 stand for the spectral norm. The following norm inequalities can be
obtained:
/- ./
1/ /
/
0 aj bTj / 1/ /
/aj bTj / = Z
Xi 2  max / T / = max
j=1,...,n pj / bj aj 0 / j=1,...,n pj 2
2
/ /
EXi 2 = M 2  /ABT /2  A 2 B 2
/  2 /
/E Xi /  A B Z.
2 2 2

Using Theorem 2.16.4 and a union bound, finally we arrive at the following: For
any ε ∈ (0, 1) , and δ ∈ (0, 1) , if
"  √   √
8 5 1 + rA rB log 4 rA rB + log (1/δ)
N +2 ,
3 3 ε2

then with probability at least 1 − δ over the random choice of column indices
j1 , . . . , j N ,
/ /
/ /
/1 N 1 /
/ T/
aij bij − AB /  ε A 2 B 2 ,
T
/N p
/ j=1 ij /
2
564 12 Database Friendly Data Processing

where
2 2 2 2
rA = A F / A 2 ∈ (1, rank (A)) , and rB = B F / B 2 ∈ (1, rank (B))

are the numerical (or stable) rank. Here || · ||F stands for the Frobenius (or Hilbert-
Schmidt) norm. For details, we refer to [113].

12.4 Matrix and Tensor Sparsification

In the age of Big Data, we have a data deluge. Data is expressed as matrices.
One wonders what is the best way of efficiently generating “sketches” of matrices.
Formally, we define the problem as follows: Given a matrix A ∈ Rn×n , and an
error parameter ε, construct a sketch à ∈ Rn×n of A such that
/ /
/ /
/A − Ã/  ε
2

and the number of non-zero entries in à is minimized. Here || · ||2 is the spectral
norm of a matrix (the largest singular value), while || · ||F is the Frobenius form. See
Sect. 1.4.5.
Algorithm 12.4.1 (Matrix sparsification [124]).
1. Input: matrix A ∈ Rn×n , sampling parameter s.
2. For all 1 ≤ i, j ≤ n do
2 A2
—If A2ij  logn n s F then Ãij = 0,
A2F
—elseIf A2ij  then Ãij = Aij ,
+ s
sA2ij
Aij
—ElseÃij = pij , with probability pij = A2F
<1
0, with probability 1 − pij
3. Output: matrix Ā ∈ Rn×n .
An algorithm is shown in Algorithm 12.4.1. When n ≥ 300, and
 
n (log n) log2 n/log2 n
s=C ,
ε2

then, with probability at least 1 − 1/n, we have


/ /
/ /
/A − Ã/  ε A F,
2

Theorem 2.17.7 has been used in [124] to analyze this algorithm.


Automatic generation of very large data sets enables the coming of Big Data age.
Such data are often modeled as matrices. A generation of this framework permits
12.4 Matrix and Tensor Sparsification 565

the modeling of the data by higher-order arrays or tensors (e.g., arrays with more
than two modes). A natural example is time-evolving data, where the third mode of
the tensor represents time [588]. Concentration of measure has been used in [10] to
design algorithms.
d times
IF G H
For any d-mode or order-d tensor A ∈ R , its Frobenius norm A F is
n×n×···×n

defined as the square root of the sum of the squares of its elements. Now we define
tensor-vector products: let x, y be vectors in Rn . Then
n
A×1 x = Aijk···l xi ,
i=1
n
A×2 x = Aijk···l xj ,
i=1
n
A×3 x = Aijk···l xk , etc.
i=1

The outcome of the above operations is an order-(d−1) tensor. The above definition
may be extended to handle multiple tensor-vector products,
n
A×1 x×2 y = Aijk···l xi yj ,
i=1

which is an order-(d − 2) tensor. Using this definition, the spectrum norm is


defined as

A 2 = sup |A×1 x1 · · · ×d xd | ,
x1 ,...,xd ∈Rn

where all the vectors x1 , . . . , xd ∈ Rn are unit vectors, i.e., xi 2 = 1, for all
i ∈ [d]. The notation [d] stands for the set {1, 2, . . . , d}.
d times
H IF G
Given an order-d tensor A ∈ R n×n×···×n
and an error parameter  > 0,
d times
H IF G
construct an order-d tensor sketch à ∈ R n×n×···×n such that
/ / / /
/ / / /
/A − Ã/  ε/Ã/
2 2

and the number of non-zero entries in à is minimized.


Assume n ≥ 300, and 2  d  0.5 ln n. If the sampling parameter s satisfies
 
d3 82d st (A) nd/2 ln3 n
s=Ω ,
2
566 12 Database Friendly Data Processing

then, with probability at least 1 − 1/n,


/ / / /
/ / / /
/A − Ã/  ε/Ã/ .
2 2

Here, we use the stable rank st (A) of the tensor

2
A F
st (A) = 2 .
A 2

Theorem 3.3.10 has been applied by Nguyen et al. [10] to obtain the above result.

12.5 Further Comments

See Mahoney [589], for randomized algorithms for matrices and data.
Chapter 13
From Network to Big Data

The main goal of this chapter is to put together all pieces treated in previous
chapters. We treat the subject from a system engineering point of view. This chapter
motivates the whole book. We only have space to see the problems from ten-
thousand feet high.

13.1 Large Random Matrices for Big Data

Figure 13.1 illustrates the vision of big data that will be the foundation to
understand cognitive networked sensing, cognitive radio network, cognitive radar
and even smart grid. We will further develop this vision in the book on smart
grid [6]. High dimensional statistics is the driver behind these subjects. Random
matrices are natural building blocks to model big data. Concentration of measure
phenomenon is of fundamental significance to modeling a large number of random
matrices. Concentration of measure phenomenon is a phenomenon unique to high-
dimensional spaces. The large data sets are conveniently expressed as a matrix
⎡ ⎤
X11 X12 · · · X1n
⎢ X21 X22 · · · X2n ⎥
⎢ ⎥
X=⎢ . .. . ⎥ ∈ Cm×n
⎣ .. . · · · .. ⎦
Xm1 Xm2 · · · Xmn

where Xij are random variables, e.g, sub-Gaussian random variables. Here m, n
are finite and large. For example, m = 100, n = 100. The spectrum of a random
matrix X tends to stabilize as the dimensions of X grows to infinity. In the last
few years, local and non-asymptotic regimes, the dimensions of X are fixed rather
than grow to infinity.
 Concentration of measure phenomenon naturally occurs. The
eigenvalues λi XT X , i = 1, . . . , n are natural mathematical objects to study.

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 567
DOI 10.1007/978-1-4614-4544-9 13,
© Springer Science+Business Media New York 2014
568 13 From Network to Big Data

High-
Dimensionale
Statistics

Cognitive
Big Data Smart Grid
Networked
Sensing

Cognitive Cognitive
Radar Radio
Network

Fig. 13.1 Big data vision

The eigenvalues can be viewed as Lipschitz functions that can be handled by


Talagrands concentration inequality. It expresses the insight: The sum of a large
number of random variables is a constant with high probability. We can often treat
both standard Gaussian and Bernoulli random variables in the unified framework of
the sub-Gaussian family.
Theorem 13.1.1 (Talagrand’s Concentration Inequality). For every product
n
probability P on {−1, 1} , consider a convex and Lipschitz function f : Rn → R
with Lipschitz constant L. Let X1 , . . . , Xn be independent random variables taking
values {−1, 1}. Let Y = f (X1 , . . . , Xn ) and let MY be a median of Y . Then For
every t > 0, we have
2
/16L2
P (|Y − MY |  t)  4e−t . (13.1)

The random variable Y has the following property

Var (Y )  16L2 , E [Y ] − 16L  M [Y ]  E [Y ] + 16L. (13.2)

For a random matrix X ∈ Rn×n , the following functions are Lipschitz functions:

k k
(1)λmax (X) ; (2)λmin (X) ; (3) Tr (X) ; (4) λi (X) ; (5) λn−i+1 (X)
i=1 i=1

where Tr (X) has a Lipschitz constant


√ of L = 1/n, and λi (X) , i = 1, . . . , n has
a Lipschitz constant of L = 1/ n. So the variance of Tr (X) is upper bounded by
16/n2 , while the variance of λi (X) , i = 1, . . . , n by 16/n. The variance of Tr (X)
is 1/n smaller than that of λi (X) , i = 1, . . . , n. For example, n = 100, their ratio
is 20 dB. The variance has a fundamental control over the hypothesis detection.
13.2 A Case Study for Hypothesis Detection in High Dimensions 569

13.2 A Case Study for Hypothesis Detection in High


Dimensions

Let
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
X1 Y1 S1
⎢ X2 ⎥ ⎢ Y2 ⎥ ⎢ S2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
x = ⎢ . ⎥; y = ⎢ . ⎥; s = ⎢ . ⎥; x, y, s ∈ Cn .
⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
Xn Yn Sn

The hypothesis detection is written as

H0 : y = x
H1 : y = s + x.

The covariance matrices are defined as


     
Rx = E xxH , Ry = E yyH , Rs = E ssH , Rx , Ry , Rs ∈ Rn×n .

We have the equivalent form

H0 : R y = R x
H1 : Ry = Rs + Rx = Low rank matrix + Sparse matrix.

where Rs is often of low rank. When the white Gaussian random vector is
considered, we have
⎡ ⎤
1 0 ··· 0
⎢ .. ⎥

2 ⎢0 1 ··· . ⎥

Rx = σ 2 In×n =σ ⎢. . ⎥,
.
⎣ .. .. . . 0 ⎦
0 ··· 0 1

which is sparse in that there are non-zero entries only along the diagonal line. Using
matrix decomposition [590], we are able to separate the two matrices even when
σ 2 is very small, say the signal to noise ratio SN R = 10−6 . Anti-jamming is one
motivated example.
Unfortunately, when the sample size N of the random vector x ∈ Rn is
finite, the sparse matrix assumption of Rx is not satisfied. Let us study a simple
example. Consider i.i.d. Gaussian random variables Xi ∼ N (0, 1) , i = 1, . . . , n.
In MATLAB code, we have the random data vector x = randn (n, 1) . So the true
covariance matrix is Rx = σ 2 In×n . For N dependent copies of x, we define the
sample covariance matrix
570 13 From Network to Big Data

a 1
First Two Columns of Random Sample Covariance Matrix b 0.3
First Two Columns of Random Sample Covariance Matrix

0.8
0.2
0.6
Sample Covariance Matrix R

Sample Covariance Matrix R


0.4 0.1
0.2
0
0
-0.2 -0.1

-0.4 -0.2
-0.6
-0.3
-0.8
-1 -0.4
0 20 40 60 80 100 0 20 40 60 80 100
Sample Index n Sample Index n

Fig. 13.2 Random sample covariance matrices of n × n with n = 100 : (a) N = 10; (b) N = 100

N
1 1
R̂x = xi xTi = XXT ∈ Rn×n ,
N i=1
N
 
where the data matrix is X = x1 x2 · · · xN ∈ Rn×N . The first two columns
of R̂x are shown in Fig. 13.2 for (a) N = 10 and (b) N = 100. The variance
of case (b) is much smaller than that of case (a). The convergence is measured by
R̂x − Rx = R̂x − σ 2 In×n . Consider the regime

N
as n → ∞, N → ∞, → c ∈ (0, 1). (13.3)
n
Under this regime, the fundamental question is

R̂x → σ 2 In×n ?

Under the regime of (13.3), we have the hypothesis detection

H0 : R̂y = R̂x
H1 : R̂y = R̂s + R̂x

where R̂x is a random matrix which is not sparse. We are led to evaluate this
following simple, intuitive test
   
Tr R̂y  γ1 + Tr R̂x , claim H1 .
   
Tr R̂y  γ0 + Tr R̂x , claim H0 .

Let us compare the classical likelihood ratio test (LRT).


13.4 Wireless Distributed Computing 571

13.3 Cognitive Radio Network Testbed

Let us use Fig. 13.3 to represent a cognitive radio network of N nodes. The network
works for a TDMA manner. When one node, say, i = 1, transmits, the rest of the
nodes i = 2, . . . , 80 record the data. We can randomly (or deterministically) choose
the next node to transmit.

13.4 Wireless Distributed Computing

For a segment of M samples, we can form a matrix of N × M. For convenience,


we collect the M data samples into a vector, denoted by x ∈ RM or x ∈ CM .
For N nodes, we have x1 , . . . , xN . We can collect the data into a matrix X =
[x1 , . . . , xN ]T ∈ RN ×M . The entries of the matrix X are (scalar) random variables.
So X is a random matrix.
For statistics, we often start with the sample covariance matrix defined as

N
1 1
R̂x = xi xTi = XXT .
N i=1
N

The true covariance matrix is Rx . Of course one fundamental


/ question
/ is to answer
/ /
how the sample size N affects the estimation accuracy /Rx − R̂x / as a function of
the dimension n.

Non-asymptotic theory of
random matrix X1
Sampling rate=20 Msps X2
x= •

Real-time XM
statistics
In-network X = x1 x2 • • • xN
processing
2.5 ms
Random Graphs
Big Data
Distributed computing
N=80
M=100
X11 X12 • • •
X1N
λi (X) X21 X2 • • • X2N
X= • • •
i = 1,..., min{M,N} •



• • • •

Data storage XM1 XM2 • • • XMN


Tr (X) Concentration of measure M N

Fig. 13.3 Complex network model


572 13 From Network to Big Data

When the sample size N is comparable to the dimension n, the so-called


non-asymptotic random matrix theory must be used, rather than the asymptotic
limits when N and n goes to infinity, i.e., N → ∞, n → ∞. Concentration
of measure phenomenon is the fundamental tool to deal with this non-asymptotic
regime. Lipschitz functions are the basic building blocks to investigate. Fortunately,
most quantities of engineering interest belong to this class of functions. Examples
include the trace, the largest eigenvalue, and the smallest eigenvalue of X :
Tr (X) ,λmax (X) , and λmin (X).
Since the random matrix is viewed as the starting point for future statistical
studies, it is important to understand the computing aspects, especially for some
real-time applications. Practically, the sampling rate of the software-defined radio
node is in the level of 20 Mega samples per second (Msps). For a data segment of
M = 100 samples, the required sampling time is 5 μs for each node i = 1, . . . , N.
At this stage, it is the right moment to talk about the network synchronization,
which is critical. The Apollo testbed for cognitive radio network at Tennessee
Technological University (TTU) [591–593] has achieved the synchronization error
of less than 1 μs.
The raw data that are collected at each node i = 1, . . . , N are difficult to
be moved around. It is feasible to disseminate the real-time statistics such as the
parameters associated with the sample covariance matrix. To estimate the 100 × 100
sample covariance matrix for each node, for example we need collect a total of
M = 10, 000 sample points, which require the sampling time of 500 μs = 0.5 ms. It
is remarkable to point out that the computing time is around 2.5 ms, which is five
times larger than the required sampling time 0.5 ms. Parallel computing is needed to
speed up the wireless distributed computing. Therefore, in-networking processing
is critical.

13.5 Data Collection

As mentioned above, the network works for a TDMA manner. When one node,
say, i = 1, transmits, the rest of the nodes i = 2, . . . , 80 record the data. We can
randomly (or deterministically) choose the next node to transmit. An algorithm can
be designed to control the data collection for the network.
As shown in Fig. 13.4, communications and sensing modes are enabled in the
Apollo testbed of TTU. The switch between two modes are fast.

13.6 Data Storage and Management

The large data sets are collected at each node i = 1, . . . , 80. After the collection,
some real-time in-network processing can be done to extract the statistical param-
eters that will be stored locally at each node. Of course, the raw data can be stored
into the local database without any in-network processing.
13.8 Mobility of Network Enabled by UAVs 573

Fig. 13.4 Data collection and storage

It is very important to remark that the data sets are large and difficult to move
around. To disseminate the information about these large data sets, we need rely
some statistical tools to reduce the dimensionality, which is often the first thing to
do with the data. Principal component analysis (PCA) is one such tool. PCA requires
the sample covariance matrix as in the input to the algorithm.
Management of these large data sets is very challenging. For example, how to
index these largest data sets.

13.7 Data Mining of Large Data Sets

At this stage, the data is assumed to be stored and managed property with effective
indexing. We can mine the data for the information that is needed. The goals include:
(1) higher spectrum efficiency; (2) higher energy efficiency; (3) enhanced security.

13.8 Mobility of Network Enabled by UAVs

Unmanned aerial vehicles (UAVs) enable the mobility of the wireless network. It is
especially interesting to study the large data sets as a function of space and time.
The Global Positioning System (GPS) is used to locate the 3-dimensional position
together with time stamps.
574 13 From Network to Big Data

13.9 Smart Grid

Smart Grid requires an two-way information flow [6]. Smart Grid can be viewed
as a large network. As a result, Big Data aspects are essential. In another work,
we systemically explore this viewpoint, using the mathematical tools of this current
book as the new departure points. Figure 13.1 illustrates this viewpoint.

13.10 From Cognitive Radio Network to Complex Network


and to Random Graph

The large size and dynamic nature of complex networks [594–598] enable a
deep connection between statistic graph theory and the possibility of describing
macroscopic phenomena in terms of the dynamic evolution of the basic elements
of the system [599]. In our new book [5], a cognitive radio network is modeled
as a complex network. (The use of a cognitive radio network in a smart grid is
considered [592].) The book [600] also models the communication network as a
random network.
So little is understood of (large) networks. The rather simple question “What
is a robust network?” seems beyond the realm of present understanding [601]. Any
complex network can be represented by a graph. Any graph can be represented by an
adjacency matrix, from which other matrices such as the Laplacian are derived. One
of the most beautiful aspects of linear algebra is the notion that, to each matrix, a set
of eigenvalues can be associated with corresponding eigenvectors. As shown below,
the most profound observation is that these eigenvalues for general random matrices
are strongly concentrated. As a result, eigenvalues—which are certain scalar valued
random variables—are natural metrics to describe the complex network (random
graph). A close analogy is that the spectrum domain of the Fourier transform is
the natural domain to study a random signal. A graph consists of a set of nodes
connected by a set of links. Some properties of a graph in the topology domain is
connected with the eigenvalues of random matrices.

13.11 Random Matrix Theory and Concentration


of Measure

From a mathematical point of view, it is convenient to define a graph—and therefore


a complex network—by means of the adjacency matrix X = {xij }. This is a N ×N
matrix defined such that

1, if (i, j) ∈ E
xij = . (13.4)
0, if (i, j) ∈
/E
13.11 Random Matrix Theory and Concentration of Measure 575

For undirected graphs the adjacency matrix is symmetric,1 xij = xji , and therefore
contains redundant information. If the graph consists of N nodes and L links, then
the N × N adjacency matrix can be written as

X = UΛUT (13.5)

where the N × N matrix U contains as columns the eigenvectors u1 , . . . , uN of X


belonging to the real eigenvalues λ1  λ3  . . .  λN and where the diagonal
matrix Λ = diag (λi ) . The basic relation (13.5) equates the topology domain,
represented by the adjacency matrix, to the spectral domain of the graph, represented
by the eigensystem in terms of the orthogonal matrix U of eigenvectors and diagonal
matrix UΛ.
The topology of large complex networks makes the random graph model
attractive. When a random graph—and therefore a complex network—is studied,
the adjacency matrix X is a random matrix. Let G be an edge-independent random
graph on the vertex set [n] = {1, 2, . . . , n} ; two vertices vi and vj are adjacent in
n
G with probability pij independently. Here {pij }i=1 are not assumed to be equal.
Using the matrix notation, we have that

xij = P (vi ∼ vj ) = pij . (13.6)

As a result, the study of a cognitive radio network boils down to the study of random
matrix X.
When the elements of xij are random variables, the adjacency matrix X is a
random matrix. So the so-called random matrix theory can be used as a powerful
mathematical tool. The connection is very deep. Our book [5] has dedicated more
than 230 pages to this topic. The most profound observation is that the spectra of
the random matrix is highly concentrated. The surprisingly striking Talagrand’s
inequality lies at the heart of this observation. As a result, the statistical behavior
of the graph spectra for complex networks is of interest [601].
Our useful approach is to investigate the eigenvalues (spectrum) of the random
adjacency matrix and (normalized ) Laplacian matrix. There is connection between
the spectrum and the topology of a graph. The duality between topology and
spectral domain is not new and has been studied in the field of mathematics called
algebraic graph theory [602, 603]. What is new is the connection with complex
networks [601].
The promising model is based on the concentration of the adjacency matrix
and of the Laplacian in random graphs with independent edges along a line of
research [604–608]. For example. see [604] for the model. We consider a random
graph G such that each edge is determined by an independent random variable,
where the probability of each edge is not assumed to be equal, i.e., P (vi ∼ vj ) =
pij . Each edge of G is independent of each other edge.

mentioned otherwise, we assume in this section that the graph is undirected and that X is
1 Unless

symmetric.
576 13 From Network to Big Data

For random graphs with such general distributions, several bounds for the
spectrum of the corresponding adjacency matrix and (normalized) Laplacian matrix
can be derived. Eigenvalues of the adjacency matrix has many applications in
graph theory, such as describing certain topological features of a graph, such as
connectivity and enumerating the occurrences of subgraphs [602, 609].
The data collection is viewed as a statistical inverse problem. Our results have
broader applicability in data collection, e.g., problems in social networking, game
theory, network security, and logistics [610].
Bibliography

1. N. Alon, M. Krivelevich, and V. Vu, “On the concentration of eigenvalues of random


symmetric matrices,” Israel Journal of Mathematics, vol. 131, no. 1, pp. 259–267, 2002. [33,
244]
2. V. Vu, “Concentration of non-lipschitz functions and applications,” Random Structures &
Algorithms, vol. 20, no. 3, pp. 262–316, 2002. [341]
3. V. Vu, “Spectral norm of random matrices,” Combinatorica, vol. 27, no. 6, pp. 721–736, 2007.
[238, 244, 245]
4. V. Mayer-Sch¨onberger and K. Cukier, Big Data: A Revolution that will transform how we
live, work and think. Eamon Dolan Book and Houghton Mifflin Hardcourt, 2013. [xxi, 271]
5. R. Qiu, Z. Hu, H. Li, and M. Wicks, Cognitiv Communications and Networking: Theory and
Practice. John Wiley and Sons, 2012. [xxi, 37, 85, 351, 352, 360, 441, 502, 526, 574, 575]
6. R. Qiu, Introduction to Smart Grid. John Wiley and Sons, 2014. [xxi, 567, 574]
7. G. Lugosi, “Concentration-of-measure inequalities,” 2009. [5, 6, 16, 17, 121, 122]
8. A. Leon-Garcia, Probability, Statistics, and Random Processing for Electrical Engineering.
Pearson-Prentice Hall, third edition ed., 2008. [9, 11, 85]
9. T. Tao, Topics in Random Matrix Thoery. American Mathematical Society, 2012. [9, 23, 24,
33, 54, 85, 97, 362, 363, 364, 473]
10. N. Nguyen, P. Drineas, and T. Tran, “Tensor sparsification via a bound on the spectral norm
of random tensors,” arXiv preprint arXiv:1005.4732, 2010. [12, 565, 566]
11. W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of
the American Statistical Association, pp. 13–30, 1963. [13, 14, 16]
12. G. Bennett, “Probability inequalities for the sum of independent random variables,” Journal
of the American Statistical Association, vol. 57, no. 297, pp. 33–45, 1962. [15, 389]
13. A. Van Der Vaart and J. Wellner, Weak Convergence and Empirical Processes. Springer-
Verlag, 1996. [15, 283]
14. F. Lin, R. Qiu, Z. Hu, S. Hou, J. Browning, and M. Wicks, “Generalized fmd detection for
spectrum sensing under low signal-to-noise ratio,” IEEE Communications Letters, to appear.
[18, 221, 226, 227]
15. E. Carlen, “Trace inequalities and quantum entropy: an introductory course,” Entropy and the
quantum: Arizona School of Analysis with Applications, March 16–20, 2009, University of
Arizona, vol. 529, 2010. [18, 35, 39]
16. F. Zhang, Matrix Theory. Springer Ver, 1999. [18, 19, 96, 215, 311, 436, 472, 493, 498]
17. K. Abadir and J. Magnus, Matrix Algebra. Cambridge Press, 2005. [18, 19, 20, 445]
18. D. S. Bernstein, Matrix Mathematics: Theory, Facts, and Formulas. Princeton University
Press, 2009. [20, 98, 204, 333]
19. J. A. Tropp, “User-friendly tail bounds for sums of random matrices.” Preprint, 2011. [20]

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 577
DOI 10.1007/978-1-4614-4544-9,
© Springer Science+Business Media New York 2014
578 Bibliography

20. N. J. Higham, Functions of Matrices: Theory and Computation. Society for Industrial and
Applied Mathematics, 2008. [20, 33, 86, 211, 212, 213, 264, 496]
21. L. Trefethen and M. Embree, Spectra and Pseudospectra: The Behavior of Nonnormal
Matrices and Operators. Princeton University Press, 2005. [20]
22. R. Bhatia, Positive Definite Matrices. Princeton University Press, 2007. [20, 38, 44, 98]
23. R. Bhatia, Matrix analysis. Springer, 1997. [33, 39, 41, 45, 92, 98, 153, 199, 210, 214, 215,
264, 488, 500]
24. A. W. Marshall, I. Olkin, and B. C. Arnold, Inequalities: Theory of Majorization and Its
Applications. Springer Verl, 2011. [20]
25. D. Watkins, Fundamentals of Matrix Computations. Wiley, third ed., 2010. [21]
26. J. A. Tropp, “On the conditioning of random subdictionaries,” Applied and Computational
Harmonic Analysis, vol. 25, no. 1, pp. 1–24, 2008. [27]
27. M. Ledoux and M. Talagrand, Probability in Banach spaces. Springer, 1991. [27, 28, 69, 71,
75, 76, 77, 163, 165, 166, 168, 182, 184, 185, 192, 198, 199, 307, 309, 389, 391, 394, 396,
404, 535]
28. J. A. Tropp, J. N. Laska, M. F. Duarte, J. K. Romberg, and R. G. Baraniuk, “Beyond nyquist:
Efficient sampling of sparse bandlimited signals,” Information Theory, IEEE Transactions on,
vol. 56, no. 1, pp. 520–544, 2010. [27, 71, 384]
29. J. Nelson, “Johnson-lindenstrauss notes,” tech. rep., Technical report, MIT-CSAIL, Available
at web.mit.edu/minilek/www/jl notes.pdf, 2010. [27, 70, 379, 380, 381, 384]
30. H. Rauhut, “Compressive sensing and structured random matrices,” Theoretical foundations
and numerical methods for sparse recovery, vol. 9, pp. 1–92, 2010. [28, 164, 379, 394]
31. F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos processes and the restricted
isometry property,” arXiv preprint arXiv:1207.0235, 2012. [29, 65, 398, 399, 400, 401, 407,
408]
32. R. Latala, “On weak tail domination of random vectors,” Bull. Pol. Acad. Sci. Math., vol. 57,
pp. 75–80, 2009. [29]
33. W. Bednorz and R. Latala, “On the suprema of bernoulli processes,” Comptes Rendus
Mathematique, 2013. [29, 75, 76]
34. B. Gnedenko and A. Kolmogorov, Limit Distributions for Sums Independent Random
Variables. Addison-Wesley, 1954. [30]
35. R. Vershynin, “A note on sums of independent random matrices after ahlswede-winter.” http://
www-personal.umich.edu/∼romanv/teaching/reading-group/ahlswede-winter.pdf. Seminar
Notes. [31, 91]
36. R. Ahlswede and A. Winter, “Strong converse for identification via quantum channels,”
Information Theory, IEEE Transactions on, vol. 48, no. 3, pp. 569–579, 2002. [31, 85, 87,
94, 95, 96, 97, 106, 107, 108, 122, 123, 132]
37. T. Fine, Probability and Probabilistic Reasoning for Electrical Engineering. Pearson-Prentice
Hall, 2006. [32]
38. R. Oliveira, “Sums of random hermitian matrices and an inequality by rudelson,” Elect.
Comm. Probab, vol. 15, pp. 203–212, 2010. [34, 94, 95, 108, 473]
39. T. Rockafellar, Conjugative duality and optimization. Philadephia: SIAM, 1974. [36]
40. D. Petz, “A suvery of trace inequalities.” Functional Analysis and Operator Theory, 287–298,
Banach Center Publications, 30 (Warszawa), 1994. www.renyi.hu/∼petz/pdf/64.pdf. [38, 39]
41. R. Vershynin, “Golden-thompson inequality.” www-personal.umich.edu/∼romanv/teaching/
reading-group/golden-thompson.pdf. Seminar Notes. [39]
42. I. Dhillon and J. Tropp, “Matrix nearness problems with bregman divergences,” SIAM Journal
on Matrix Analysis and Applications, vol. 29, no. 4, pp. 1120–1146, 2007. [40]
43. J. Tropp, “From joint convexity of quantum relative entropy to a concavity theorem of lieb,”
in Proc. Amer. Math. Soc, vol. 140, pp. 1757–1760, 2012. [41, 128]
44. G. Lindblad, “Expectations and entropy inequalities for finite quantum systems,” Communi-
cations in Mathematical Physics, vol. 39, no. 2, pp. 111–119, 1974. [41]
45. E. G. Effros, “A matrix convexity approach to some celebrated quantum inequalities,”
vol. 106, pp. 1006–1008, National Acad Sciences, 2009. [41]
Bibliography 579

46. E. Carlen and E. Lieb, “A minkowski type trace inequality and strong subadditivity of
quantum entropy ii: convexity and concavity,” Letters in Mathematical Physics, vol. 83, no. 2,
pp. 107–126, 2008. [41, 42]
47. T. Rockafellar, Conjugate duality and optimization. SIAM, 1974. Regional conference series
in applied mathematics. [41]
48. S. Boyd and L. Vandenberghe, Convex optimization. Cambridge Univ Pr, 2004. [42, 48, 210,
217, 418, 489]
49. J. Tropp, “Freedman’s inequality for matrix martingales,” Electron. Commun. Probab, vol. 16,
pp. 262–270, 2011. [42, 128, 137]
50. E. Lieb, “Convex trace functions and the wigner-yanase-dyson conjecture,” Advances in
Mathematics, vol. 11, no. 3, pp. 267–288, 1973. [42, 98, 129]
51. V. I. Paulsen, Completely Bounded Maps and Operator Algebras. Cambridge Press, 2002.
[43]
52. T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: John Wiley. [44]
53. J. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of Computa-
tional Mathematics, vol. 12, no. 4, pp. 389–434, 2011. [45, 47, 95, 107, 110, 111, 112, 115,
116, 121, 122, 127, 128, 131, 132, 140, 144, 493]
54. F. Hansen and G. Pedersen, “Jensen’s operator inequality,” Bulletin of the London Mathemat-
ical Society, vol. 35, no. 4, pp. 553–564, 2003. [45]
55. P. Halmos, Finite-Dimensional Vector Spaces. Springer, 1958. [46]
56. V. De la Peña and E. Giné, Decoupling: from dependence to independence. Springer Verlag,
1999. [47, 404]
57. R. Vershynin, “A simple decoupling inequality in probability theory,” May 2011. [47, 51]
58. P. Billingsley, Probability and Measure. Wiley, 2008. [51, 523]
59. R. Dudley, Real analysis and probability, vol. 74. Cambridge University Press, 2002. [51,
230, 232]
60. M. A. Arcones and E. Giné, “On decoupling, series expansions, and tail behavior of chaos
processes,” Journal of Theoretical Probability, vol. 6, no. 1, pp. 101–122, 1993. [52]
61. D. L. Hanson and F. T. Wright, “A bound on tail probabilities for quadratic forms in
independent random variables,” The Annals of Mathematical Statistics, pp. 1079–1083, 1971.
[53, 73, 380]
62. S. Boucheron, G. Lugosi, and P. Massart, “Concentration inequalities using the entropy
method,” The Annals of Probability, vol. 31, no. 3, pp. 1583–1614, 2003. [53, 198, 397,
404]
63. T. Tao, Topics in Random Matrix Theory. Amer Mathematical Society, 2012. [54, 55, 216,
238, 239, 241, 242, 355, 357, 358]
64. E. Wigner, “Distribution laws for the roots of a random hermitian matrix,” Statistical Theories
of Spectra: Fluctuations, pp. 446–461, 1965. [55]
65. M. Mehta, Random matrices, vol. 142. Academic press, 2004. [55]
66. D. Voiculescu, “Limit laws for random matrices and free products,” Inventiones mathemati-
cae, vol. 104, no. 1, pp. 201–220, 1991. [55, 236, 363]
67. J. Wishart, “The generalised product moment distribution in samples from a normal multi-
variate population,” Biometrika, vol. 20, no. 1/2, pp. 32–52, 1928. [56]
68. P. Hsu, “On the distribution of roots of certain determinantal equations,” Annals of Human
Genetics, vol. 9, no. 3, pp. 250–258, 1939. [56, 137]
69. U. Haagerup and S. Thorbjørnsen, “Random matrices with complex gaussian entries,”
Expositiones Mathematicae, vol. 21, no. 4, pp. 293–337, 2003. [57, 58, 59, 202, 203, 214]
70. A. Erdelyi, W. Magnus, Oberhettinger, and F. Tricomi, eds., Higher Transcendental Func-
tions, Vol. 1–3. McGraw-Hill, 1953. [57]
71. J. Harer and D. Zagier, “The euler characteristic of the moduli space of curves,” Inventiones
Mathematicae, vol. 85, no. 3, pp. 457–485, 1986. [58]
72. R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” Arxiv
preprint arXiv:1011.3027v5, July 2011. [60, 66, 67, 68, 313, 314, 315, 316, 317, 318, 319,
320, 321, 324, 325, 349, 462, 504]
580 Bibliography

73. D. Garling, Inequalities: a journey into linear analysis. Cambridge University Press, 2007.
[61]
74. V. Buldygin and S. Solntsev, Asymptotic behaviour of linearly transformed sums of random
variables. Kluwer, 1997. [60, 61, 63, 76, 77]
75. J. Kahane, Some random series of functions. Cambridge Univ Press, 2nd ed., 1985. [60]
76. M. Rudelson, “Lecture notes on non-asymptotic theory of random matrices,” arXiv preprint
arXiv:1301.2382, 2013. [63, 64, 68, 336]
77. V. Yurinsky, Sums and Gaussian vectors. Springer-Verlag, 1995. [69]
78. U. Haagerup, “The best constants in the khintchine inequality,” Studia Math., vol. 70, pp. 231–
283, 1981. [69]
79. R. Latala, P. Mankiewicz, K. Oleszkiewicz, and N. Tomczak-Jaegermann, “Banach-mazur
distances and projections on random subgaussian polytopes,” Discrete & Computational
Geometry, vol. 38, no. 1, pp. 29–50, 2007. [72, 73, 295, 296, 297, 298]
80. E. D. Gluskin and S. Kwapien, “Tail and moment estimates for sums of independent random
variable,” Studia Math., vol. 114, pp. 303–309, 1995. [75]
81. M. Talagrand, The generic chaining: upper and lower bounds of stochastic processes.
Springer Verlag, 2005. [75, 163, 165, 192, 394, 396, 404, 405, 406, 407]
82. M. Talagrand, Upper and Lower Bounds for Stochastic Processes, Modern Methods and
Classical Problems. Springer-Verlag, in press. Ergebnisse der Mathematik. [75, 199]
83. R. M. Dudley, “The sizes of compact subsets of hilbert space and continuity of gaussian
processes,” J. Funct. Anal, vol. 1, no. 3, pp. 290–330, 1967. [75, 406]
84. X. Fernique, “Régularité des trajectoires des fonctions aléatoires gaussiennes,” Ecole d’Eté
de Probabilités de Saint-Flour IV-1974, pp. 1–96, 1975. [75, 407]
85. M. Talagrand, “Regularity of gaussian processes,” Acta mathematica, vol. 159, no. 1, pp. 99–
149, 1987. [75, 407]
86. R. Bhattacharya and R. Rao, Normal approximation and asymptotic expansions, vol. 64.
Society for Industrial & Applied, 1986. [76, 77, 78]
87. L. Chen, L. Goldstein, and Q. Shao, Normal Approximation by Stein’s Method. Springer,
2010. [76, 221, 236, 347, 523]
88. A. Kirsch, An introduction to the mathematical theory of inverse problems, vol. 120. Springer
Science+ Business Media, 2011. [79, 80, 289]
89. D. Porter and D. S. Stirling, Integral equations: a practical treatment, from spectral theory to
applications, vol. 5. Cambridge University Press, 1990. [79, 80, 82, 288, 289, 314]
90. U. Grenander, Probabilities on Algebraic Structures. New York: Wiley, 1963. [85]
91. N. Harvey, “C&o 750: Randomized algorithms winter 2011 lecture 11 notes.” http://www.
math.uwaterloo.ca/∼harvey/W11/, Winter 2011. [87, 95, 473]
92. N. Harvey, “Lecture 12 concentration for sums of random matrices and lecture 13 the
ahlswede-winter inequality.” http://www.cs.ubc.ca/∼nickhar/W12/, Febuary 2012. Lecture
Notes for UBC CPSC 536N: Randomized Algorithms. [87, 89, 90, 95]
93. M. Rudelson, “Random vectors in the isotropic position,” Journal of Functional Analysis,
vol. 164, no. 1, pp. 60–72, 1999. [90, 95, 277, 278, 279, 280, 281, 282, 283, 284, 292, 306,
307, 308, 441, 442]
94. A. Wigderson and D. Xiao, “Derandomizing the ahlswede-winter matrix-valued chernoff
bound using pessimistic estimators, and applications,” Theory of Computing, vol. 4, no. 1,
pp. 53–76, 2008. [91, 95, 107]
95. D. Gross, Y. Liu, S. Flammia, S. Becker, and J. Eisert, “Quantum state tomography via
compressed sensing,” Physical review letters, vol. 105, no. 15, p. 150401, 2010. [93, 107]
96. D. DUBHASHI and D. PANCONESI, Concentration of measure for the analysis of random-
ized algorithms. Cambridge Univ Press, 2009. [97]
97. H. Ngo, “Cse 694: Probabilistic analysis and randomized algorithms.” http://www.cse.buffalo.
edu/∼hungngo/classes/2011/Spring-694/lectures/l4.pdf, Spring 2011. SUNY at Buffalo. [97]
98. O. Bratteli and D. W. Robinson, Operator Algebras amd Quantum Statistical Mechanics I.
Springer-Verlag, 1979. [97]
Bibliography 581

99. D. Voiculescu, K. Dykema, and A. Nica, Free Random Variables. American Mathematical
Society, 1992. [97]
100. P. J. Schreiner and L. L. Scharf, Statistical Signal Processing of Complex-Valued Data: The
Theory of Improper and Noncircular Signals. Ca, 2010. [99]
101. J. Lawson and Y. Lim, “The geometric mean, matrices, metrics, and more,” The American
Mathematical Monthly, vol. 108, no. 9, pp. 797–812, 2001. [99]
102. D. Gross, “Recovering low-rank matrices from few coefficients in any basis,” Information
Theory, IEEE Transactions on, vol. 57, no. 3, pp. 1548–1566, 2011. [106, 433, 443]
103. B. Recht, “A simpler approach to matrix completion,” Arxiv preprint arxiv:0910.0651, 2009.
[106, 107, 429]
104. B. Recht, “A simpler approach to matrix completion,” The Journal of Machine Learning
Research, vol. 7777777, pp. 3413–3430, 2011. [106, 431, 432, 433]
105. R. Ahlswede and A. Winter, “Addendum to strong converse for identification via quantum
channels,” Information Theory, IEEE Transactions on, vol. 49, no. 1, p. 346, 2003. [107]
106. R. Latala, “Some estimates of norms of random matrices,” AMERICAN MATHEMATICAL
SOCIETY, vol. 133, no. 5, pp. 1273–1282, 2005. [118]
107. Y. Seginer, “The expected norm of random matrices,” Combinatorics Probability and
Computing, vol. 9, no. 2, pp. 149–166, 2000. [118, 223, 330]
108. P. Massart, Concentration Inequalities and Model Selection. Springer, 2007. [120, 457]
109. M. Ledoux and M. Talagrand, Probability in Banach Spaces: Isoperimetry and Processes.
Springer, 1991. [120]
110. R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge Univ Press, 1995. [122]
111. A. Gittens and J. Tropp, “Tail bounds for all eigenvalues of a sum of random matrices,” Arxiv
preprint arXiv:1104.4513, 2011. [128, 129, 131, 144]
112. E. Lieb and R. Seiringer, “Stronger subadditivity of entropy,” Physical Review A, vol. 71,
no. 6, p. 062329, 2005. [128, 129]
113. D. Hsu, S. Kakade, and T. Zhang, “Tail inequalities for sums of random matrices that depend
on the intrinsic dimension,” 2011. [137, 138, 139, 267, 462, 463, 478, 562, 564]
114. B. Schoelkopf, A. Sola, and K. Mueller, Kernl principal compnent analysis, ch. Kernl
principal compnent analysis, pp. 327–352. MIT Press, 1999. [137]
115. A. Magen and A. Zouzias, “Low rank matrix-valued chernoff bounds and approximate matrix
multiplication,” in Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on
Discrete Algorithms, pp. 1422–1436, SIAM, 2011. [137]
116. S. Minsker, “On some extensions of bernstein’s inequality for self-adjoint operators,” Arxiv
preprint arXiv:1112.5448, 2011. [139, 140]
117. G. Peshkir and A. Shiryaev, “The khintchine inequalities and martingale expanding sphere of
their action,” Russian Mathematical Surveys, vol. 50, no. 5, pp. 849–904, 1995. [141]
118. N. Tomczak-Jaegermann, “The moduli of smoothness and convexity and the rademacher
averages of trace classes,” Sp (1 [p¡.) Studia Math, vol. 50, pp. 163–182, 1974. [141, 300]
119. F. Lust-Piquard, “Inégalités de khintchine dans cp (1¡ p¡∞),” CR Acad. Sci. Paris, vol. 303,
pp. 289–292, 1986. [142]
120. G. Pisier, “Non-commutative vector valued lp-spaces and completely p-summing maps,”
Astérisque, vol. 247, p. 131, 1998. [142]
121. A. Buchholz, “Operator khintchine inequality in non-commutative probability,” Mathematis-
che Annalen, vol. 319, no. 1, pp. 1–16, 2001. [142, 534]
122. A. So, “Moment inequalities for sums of random matrices and their applications in optimiza-
tion,” Mathematical programming, vol. 130, no. 1, pp. 125–151, 2011. [142, 530, 534, 537,
542, 548]
123. N. Nguyen, T. Do, and T. Tran, “A fast and efficient algorithm for low-rank approximation
of a matrix,” in Proceedings of the 41st annual ACM symposium on Theory of computing,
pp. 215–224, ACM, 2009. [143, 156, 159]
124. N. Nguyen, P. Drineas, and T. Tran, “Matrix sparsification via the khintchine inequality,”
2009. [143, 564]
582 Bibliography

125. M. de Carli Silva, N. Harvey, and C. Sato, “Sparse sums of positive semidefinite matrices,”
2011. [144]
126. L. Mackey, M. Jordan, R. Chen, B. Farrell, and J. Tropp, “Matrix concentration inequalities
via the method of exchangeable pairs,” Arxiv preprint arXiv:1201.6002, 2012. [144]
127. L. Rosasco, M. Belkin, and E. D. Vito, “On learning with integral operators,” The Journal of
Machine Learning Research, vol. 11, pp. 905–934, 2010. [144, 289, 290]
128. L. Rosasco, M. Belkin, and E. De Vito, “A note on learning with integral operators,” [144,
289]
129. P. Drineas and A. Zouzias, “A note on element-wise matrix sparsification via matrix-valued
chernoff bounds,” Preprint, 2010. [144]
130. R. CHEN, A. GITTENS, and J. TROPP, “The masked sample covariance estimator: An
analysis via matrix concentration inequalities,” Information and Inference: A Journal of the
IMA, pp. 1–19, 2012. [144, 463, 466]
131. M. Ledoux, The Concentration of Measure Pheonomenon. American Mathematical Society,
2000. [145, 146]
132. D. Donoho et al., “High-dimensional data analysis: The curses and blessings of dimensional-
ity,” AMS Math Challenges Lecture, pp. 1–32, 2000. [146, 267, 562]
133. S. Chatterjee, “Stein’s method for concentration inequalities,” Probability theory and related
fields, vol. 138, no. 1, pp. 305–321, 2007. [146, 159]
134. B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by model selection,”
The annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000. [147, 148, 485, 504]
135. G. Raskutti, M. Wainwright, and B. Yu, “Minimax rates of estimation for high-dimensional
linear regression over¡ formula formulatype=,” Information Theory, IEEE Transactions on,
vol. 57, no. 10, pp. 6976–6994, 2011. [147, 166, 169, 170, 171, 424, 426, 510]
136. L. Birgé and P. Massart, “Minimum contrast estimators on sieves: exponential bounds and
rates of convergence,” Bernoulli, vol. 4, no. 3, pp. 329–375, 1998. [148, 254]
137. I. Johnstone, State of the Art in Probability and Statastics, vol. 31, ch. Chi-square oracle
inequalities, pp. 399–418. Institute of Mathematical Statistics, ims lecture notes ed., 2001.
[148]
138. M. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery using¡
formula formulatype=,” Information Theory, IEEE Transactions on, vol. 55, no. 5, pp. 2183–
2202, 2009. [148, 175, 176, 182]
139. A. Amini and M. Wainwright, “High-dimensional analysis of semidefinite relaxations for
sparse principal components,” in Information Theory, 2008. ISIT 2008. IEEE International
Symposium on, pp. 2454–2458, IEEE, 2008. [148, 178, 180]
140. W. Johnson and G. Schechtman, “Remarks on talagrand’s deviation inequality for rademacher
functions,” Functional Analysis, pp. 72–77, 1991. [149]
141. M. Ledoux, The concentration of measure phenomenon, vol. 89. Amer Mathematical Society,
2001. [150, 151, 152, 153, 154, 155, 158, 163, 166, 169, 176, 177, 181, 182, 185, 188, 190,
194, 198, 232, 248, 258, 262, 313, 428]
142. M. Talagrand, “A new look at independence,” The Annals of probability, vol. 24, no. 1, pp. 1–
34, 1996. [152, 313]
143. S. Chatterjee, “Matrix estimation by universal singular value thresholding,” arXiv preprint
arXiv:1212.1247, 2012. [152]
144. R. Bhatia, C. Davis, and A. McIntosh, “Perturbation of spectral subspaces and solution of
linear operator equations,” Linear Algebra and its Applications, vol. 52, pp. 45–67, 1983.
[153]
145. K. Davidson and S. Szarek, “Local operator theory, random matrices and banach spaces,”
Handbook of the geometry of Banach spaces, vol. 1, pp. 317–366, 2001. [153, 159, 160, 161,
173, 191, 282, 313, 325, 427, 486]
146. A. DasGupta, Probability for Statistics and Machine Learning. Springer, 2011. [155]
147. W. Hoeffding, “A combinatorial central limit theorem,” The Annals of Mathematical Statis-
tics, vol. 22, no. 4, pp. 558–566, 1951. [159]
Bibliography 583

148. M. Talagrand, “Concentration of measure and isoperimetric inequalities in product spaces,”


Publications Mathematiques de l’IHES, vol. 81, no. 1, pp. 73–205, 1995. [162, 216, 254,
313]
149. M. Ledoux, “Deviation inequalities on largest eigenvalues,” Geometric aspects of functional
analysis, pp. 167–219, 2007. [162, 190, 193, 325]
150. H. Rauhut, Theoretical Foundations and Numerical Method for Sparse Recovery, ch. Com-
pressed Sensing and Structured Random Matrices, pp. 1–92. Berlin/New York: De Gruyter,
2010. [162]
151. J.-M. Azaı̈s and M. Wschebor, Level sets and extrema of random processes and fields. Wiley,
2009. [162, 163]
152. G. Pisier, The volume of convex bodies and Banach space geometry, vol. 94. Cambridge Univ
Pr, 1999. [163, 202, 273, 292, 315, 386, 463]
153. Y. Gordon, A. Litvak, S. Mendelson, and A. Pajor, “Gaussian averages of interpolated bodies
and applications to approximate reconstruction,” Journal of Approximation Theory, vol. 149,
no. 1, pp. 59–73, 2007. [166, 429]
154. J. Matousek, Lectures on discrete geometry, vol. 212. Springer, 2002. [172, 184, 425]
155. S. Negahban and M. Wainwright, “Estimation of (near) low-rank matrices with noise and
high-dimensional scaling,” The Annals of Statistics, vol. 39, no. 2, pp. 1069–1097, 2011.
[174, 178, 181, 182, 185, 188, 416, 417, 418, 419, 421, 422]
156. P. Zhang and R. Qiu, “Glrt-based spectrum sensing with blindly learned feature under rank-1
assumption,” IEEE Trans. Communications. to appear. [177, 233, 347, 514]
157. P. Zhang, R. Qiu, and N. Guo, “Demonstration of Spectrum Sensing with Blindly Learned
Feature,” IEEE Communications Letters, vol. 15, pp. 548–550, May 2011. [177, 233, 347,
491, 514]
158. S. Hou, R. Qiu, J. P. Browning, and Wick, “Spectrum sensing in cognitive radio with robust
principal component analysis,” in IEEE Waveform Diversity and Design Conference 2012,
(Kauai, Hawaii), January 2012. [177, 491]
159. S. Hou, R. Qiu, J. P. Browning, and M. C. Wicks, “Spectrum sensing in cognitive radio
with subspace matching,” in IEEE Waveform Diversity and Design Conference 2012, (Kauai,
Hawaii), January 2012. [233, 268, 491, 514]
160. S. Hou, R. Qiu, Z. Chen, and Z. Hu, “SVM and Dimensionality Reduction in Cognitive Radio
with Experimental Validation,” Arxiv preprint arXiv:1106.2325, submitted to EURASIP
Journal on Advances in Signal Processing, 2011. [177]
161. P. Massart, “Concentration inequalities and model selection,” 2007. [177, 254, 256]
162. L. Birgé, “An alternative point of view on lepski’s method,” Lecture Notes-Monograph Series,
pp. 113–133, 2001. [179]
163. N. Karoui, “Operator norm consistent estimation of large-dimensional sparse covariance
matrices,” The Annals of Statistics, pp. 2717–2756, 2008. [179]
164. P. Loh and M. Wainwright, “High-dimensional regression with noisy and missing data:
Provable guarantees with non-convexity,” arXiv preprint arXiv:1109.3714, 2011. [186, 187,
188, 189]
165. E. Meckes, “Approximation of projections of random vectors,” Journal of Theoretical
Probability, vol. 25, no. 2, pp. 333–352, 2012. [195, 196]
166. A. Efimov, “Modulus of continuity,” Encyclopaedia of Mathematics. Springer, 2001. [195]
167. V. Milman and G. Schechtman, Asymptotic theory of finite dimensional normed spaces,
vol. 1200. Springer Verlag, 1986. [195, 258, 273, 280, 292]
168. E. Meckes, “Projections of probability distributions: A measure-theoretic dvoretzky theorem,”
Geometric Aspects of Functional Analysis, pp. 317–326, 2012. [196]
169. O. Bousquet, “A bennett concentration inequality and its application to suprema of empirical
processes,” Comptes Rendus Mathematique, vol. 334, no. 6, pp. 495–500, 2002. [198]
170. O. Bousquet, Concentration inequalities and empirical processes theory applied to the
analysis of learning algorithms. PhD thesis, PhD thesis, Ecole Polytechnique, 2002. [198]
171. S. Boucheron, G. Lugosi, and O. Bousquet, “Concentration inequalities,” Advanced Lectures
on Machine Learning, pp. 208–240, 2004. [198]
584 Bibliography

172. M. Ledoux, “On talagrand’s deviation inequalities for product measures,” ESAIM: Probability
and statistics, vol. 1, pp. 63–87, 1996. [198]
173. P. Massart, “About the constants in talagrand’s concentration inequalities for empirical
processes,” Annals of Probability, pp. 863–884, 2000. [198]
174. E. Rio, “Inégalités de concentration pour les processus empiriques de classes de parties,”
Probability Theory and Related Fields, vol. 119, no. 2, pp. 163–175, 2001. [198]
175. S. Boucheron, O. Bousquet, G. Lugosi, and P. Massart, “Moment inequalities for functions of
independent random variables,” The Annals of Probability, vol. 33, no. 2, pp. 514–560, 2005.
[198]
176. S. Boucheron, O. Bousquet, G. Lugosi, et al., “Theory of classification: A survey of some
recent advances,” ESAIM Probability and statistics, vol. 9, pp. 323–375, 2005. []
177. S. Boucheron, G. Lugosi, P. Massart, et al., “On concentration of self-bounding functions,”
Electronic Journal of Probability, vol. 14, no. 64, pp. 1884–1899, 2009. []
178. S. Boucheron, P. Massart, et al., “A high-dimensional wilks phenomenon,” Probability theory
and related fields, vol. 150, no. 3, p. 405, 2011. [198]
179. A. Connes, “Classification of injective factors,” Ann. of Math, vol. 104, no. 2, pp. 73–115,
1976. [203]
180. A. Guionnet and O. Zeitouni, “Concentration of the spectral measure for large matrices,”
Electron. Comm. Probab, vol. 5, pp. 119–136, 2000. [204, 218, 219, 227, 243, 244, 247, 268,
494]
181. I. N. Bronshtein, K. A. Semendiaev, and K. A. Hirsch, Handbook of mathematics. Van
Nostrand Reinhold New York, NY, 5th ed., 2007. [204, 208]
182. A. Khajehnejad, S. Oymak, and B. Hassibi, “Subspace expanders and matrix rank minimiza-
tion,” arXiv preprint arXiv:1102.3947, 2011. [205]
183. M. Meckes, “Concentration of norms and eigenvalues of random matrices,” Journal of
Functional Analysis, vol. 211, no. 2, pp. 508–524, 2004. [206, 224]
184. C. Davis, “All convex invariant functions of hermitian matrices,” Archiv der Mathematik,
vol. 8, no. 4, pp. 276–278, 1957. [206]
185. L. Li, “Concentration of measure for random matrices.” private communication, October
2012. Tenneessee Technological University. [207]
186. N. Berestycki and R. Nickl, “Concentration of measure,” tech. rep., Technical report,
University of Cambridge, 2009. [218, 219]
187. R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1994. [219,
465]
188. Y. Zeng and Y. Liang, “Maximum-minimum eigenvalue detection for cognitive radio,” in
IEEE 18th International Symposium on Personal, Indoor and Mobile Radio Communications
(PIMRC) 2007, pp. 1–5, 2007. [221]
189. V. Petrov and A. Brown, Sums of independent random variables, vol. 197. Springer-Verlag
Berlin, 1975. [221]
190. M. Meckes and S. Szarek, “Concentration for noncommutative polynomials in random
matrices,” in Proc. Amer. Math. Soc, vol. 140, pp. 1803–1813, 2012. [222, 223, 269]
191. G. W. Anderson, “Convergence of the largest singular value of a polynomial in independent
wigner matrices,” arXiv preprint arXiv:1103.4825, 2011. [222]
192. E. Meckes and M. Meckes, “Concentration and convergence rates for spectral measures of
random matrices,” Probability Theory and Related Fields, pp. 1–20, 2011. [222, 233, 234,
235]
193. R. Serfling, “Approximation theorems of mathematical statistics (wiley series in probability
and statistics),” 1981. [223]
194. R. Latala, “Some estimates of norms of random matrices,” Proceedings of the American
Mathematical Society, pp. 1273–1282, 2005. [223, 330]
195. S. Lang, Real and functional analysis. 1993. [224]
196. A. Guntuboyina and H. Leeb, “Concentration of the spectral measure of large wishart matrices
with dependent entries,” Electron. Commun. Probab, vol. 14, pp. 334–342, 2009. [224, 225,
226]
Bibliography 585

197. M. Ledoux, “Concentration of measure and logarithmic sobolev inequalities,” Seminaire de


probabilites XXXIII, pp. 120–216, 1999. [224, 244]
198. Z. Bai, “Methodologies in spectral analysis of large-dimensional random matrices, a review,”
Statist. Sinica, vol. 9, no. 3, pp. 611–677, 1999. [225, 262, 502]
199. B. Delyon, “Concentration inequalities for the spectral measure of random matrices,”
Electronic Communications in Probability, pp. 549–562, 2010. [226, 227]
200. F. Lin, R. Qiu, Z. Hu, S. Hou, J. P. Browning, and M. C. Wicks, “ Cognitive Radio Network
as Sensors: Low Signal-to-Noise Ratio Collaborative Spectrum Sensing,” in IEEE Waveform
Diversity and Design Conference, 2012. Kauai, Hawaii. [226]
201. S. Chatterjee, “Fluctuations of eigenvalues and second order poincaré inequalities,” Probabil-
ity Theory and Related Fields, vol. 143, no. 1, pp. 1–40, 2009. [227, 228, 229, 230]
202. S. Chatterjee, “A new method of normal approximation,” The Annals of Probability, vol. 36,
no. 4, pp. 1584–1610, 2008. [229]
203. I. Johnstone, “High dimensional statistical inference and random matrices,” Arxiv preprint
math/0611589, 2006. [230, 490]
204. T. Jiang, “Approximation of haar distributed matrices and limiting distributions of eigenvalues
of jacobi ensembles,” Probability theory and related fields, vol. 144, no. 1, pp. 221–246, 2009.
[230, 231]
205. R. Bhatia, L. Elsner, and G. Krause, “Bounds for the variation of the roots of a polynomial
and the eigenvalues of a matrix,” Linear Algebra and Its Applications, vol. 142, pp. 195–209,
1990. [231]
206. N. Gozlan, “A characterization of dimension free concentration in terms of transportation
inequalities,” The Annals of Probability, vol. 37, no. 6, pp. 2480–2498, 2009. [232]
207. P. Deift and D. Gioev, Random matrix theory: invariant ensembles and universality, vol. 18.
Amer Mathematical Society, 2009. [232]
208. G. W. Anderson, A. Guionnet, and O. Zeitouni, An Introduction to Random Matrices.
Cambridge University Press, 2010. [232, 488, 522]
209. R. Speicher, Free probability theory and non-crossing partitions. 39 Seminaire Lotharingien
de Combinatoire, 1997. [234, 236]
210. V. Kargin, “A concentration inequality and a local law for the sum of two random matrices,”
Probability Theory and Related Fields, pp. 1–26, 2010. [235, 236]
211. H. Weyl, “Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differential-
gleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung),” Mathematische
Annalen, vol. 71, no. 4, pp. 441–479, 1912. [235]
212. A. Horn, “Eigenvalues of sums of hermitian matrices,” Pacific J. Math, vol. 12, no. 1, 1962.
[235]
213. S. Chatterjee, “Concentration of haar measures, with an application to random matrices,”
Journal of Functional Analysis, vol. 245, no. 2, pp. 379–389, 2007. [236]
214. S. Chatterjee and M. Ledoux, “An observation about submatrices,” Electronic Communica-
tions in Probability, vol. 14, pp. 495–500, 2009. [237]
215. E. Wigner, “On the distribution of the roots of certain symmetric matrices,” The Annals of
Mathematics, vol. 67, no. 2, pp. 325–327, 1958. [238]
216. A. Soshnikov, “Universality at the edge of the spectrum of in wigner random matrices,”
Communications in mathematical physics, vol. 207, pp. 897–733, 1999. [243, 245]
217. A. Soshnikov, “Poisson statistics for the largest eigenvalues of wigner random matrices with
heavy tails,” Electron. Comm. Probab, vol. 9, pp. 82–91, 2004. [243]
218. A. Soshnikov, “A note on universality of the distribution of the largest eigenvalues in certain
sample covariance matrices,” Journal of Statistical Physics, vol. 108, no. 5, pp. 1033–1056,
2002. []
219. A. Soshnikov, “Level spacings distribution for large random matrices: Gaussian fluctuations,”
Annals of mathematics, pp. 573–617, 1998. []
220. A. Soshnikov and Y. Fyodorov, “On the largest singular values of random matrices with
independent cauchy entries,” Journal of mathematical physics, vol. 46, p. 033302, 2005. []
586 Bibliography

221. T. Tao and V. Vu, “Random covariance matrices: Universality of local statistics of eigenval-
ues,” Arxiv preprint arxiv:0912.0966, 2009. [243]
222. Z. Füredi and J. Komlós, “The eigenvalues of random symmetric matrices,” Combinatorica,
vol. 1, no. 3, pp. 233–241, 1981. [245]
223. M. Krivelevich and V. Vu, “Approximating the independence number and the chromatic
number in expected polynomial time,” Journal of combinatorial optimization, vol. 6, no. 2,
pp. 143–155, 2002. [245]
224. A. Guionnet, “Lecture notes, minneapolis,” 2012. [247, 249]
225. C. Bordenave, P. Caputo, and D. Chafaı̈, “Spectrum of non-hermitian heavy tailed random
matrices,” Communications in Mathematical Physics, vol. 307, no. 2, pp. 513–560, 2011.
[247]
226. A. Guionnet and B. Zegarlinski, “Lectures on logarithmic sobolev inequalities,” Séminaire de
Probabilités, XXXVI, vol. 1801, pp. 1–134, 1801. [248]
227. D. Ruelle, Statistical mechanics: rigorous results. Amsterdam: Benjamin, 1969. [248]
228. T. Tao and V. Vu, “Random matrices: Sharp concentration of eigenvalues,” Arxiv preprint
arXiv:1201.4789, 2012. [249]
229. N. El Karoui, “Concentration of measure and spectra of random matrices: applications to
correlation matrices, elliptical distributions and beyond,” The Annals of Applied Probability,
vol. 19, no. 6, pp. 2362–2405, 2009. [250, 252, 253, 261, 262, 268]
230. N. El Karoui, “The spectrum of kernel random matrices,” The Annals of Statistics, vol. 38,
no. 1, pp. 1–50, 2010. [254, 268]
231. M. Talagrand, “New concentration inequalities in product spaces,” Inventiones Mathematicae,
vol. 126, no. 3, pp. 505–563, 1996. [254, 313, 404]
232. L. Birgé and P. Massart, “Gaussian model selection,” Journal of the European Mathematical
Society, vol. 3, no. 3, pp. 203–268, 2001. [254]
233. L. Birgé and P. Massart, “Minimal penalties for gaussian model selection,” Probability theory
and related fields, vol. 138, no. 1, pp. 33–73, 2007. [254]
234. P. Massart, “Some applications of concentration inequalities to statistics,” in Annales-Faculte
des Sciences Toulouse Mathematiques, vol. 9, pp. 245–303, Université Paul Sabatier, 2000.
[254]
235. I. Bechar, “A bernstein-type inequality for stochastic processes of quadratic forms of gaussian
variables,” arXiv preprint arXiv:0909.3595, 2009. [254, 548, 552]
236. K. Wang, A. So, T. Chang, W. Ma, and C. Chi, “Outage constrained robust transmit
optimization for multiuser miso downlinks: Tractable approximations by conic optimization,”
arXiv preprint arXiv:1108.0982, 2011. [255, 543, 544, 545, 547, 548, 552]
237. M. Lopes, L. Jacob, and M. Wainwright, “A more powerful two-sample test in high
dimensions using random projection,” arXiv preprint arXiv:1108.2401, 2011. [255, 256, 520,
522, 523, 525]
238. W. Beckner, “A generalized poincaré inequality for gaussian measures,” Proceedings of the
American Mathematical Society, pp. 397–400, 1989. [256]
239. M. Rudelson and R. Vershynin, “Invertibility of random matrices: unitary and orthogonal
perturbations,” arXiv preprint arXiv:1206.5180, June 2012. Version 1. [257, 334]
240. T. Tao and V. Vu, “Random matrices: Universality of local eigenvalue statistics,” Acta
mathematica, pp. 1–78, 2011. [259]
241. T. Tao and V. Vu, “Random matrices: The distribution of the smallest singular values,”
Geometric And Functional Analysis, vol. 20, no. 1, pp. 260–297, 2010. [259, 334, 499]
242. T. Tao and V. Vu, “On random±1 matrices: singularity and determinant,” Random Structures
& Algorithms, vol. 28, no. 1, pp. 1–23, 2006. [259, 332, 335, 499]
243. J. Silverstein and Z. Bai, “On the empirical distribution of eigenvalues of a class of large
dimensional random matrices,” Journal of Multivariate analysis, vol. 54, no. 2, pp. 175–192,
1995. [262]
244. J. Von Neumann, Mathematische grundlagen der quantenmechanik, vol. 38. Springer, 1995.
[264]
Bibliography 587

245. J. Cadney, N. Linden, and A. Winter, “Infinitely many constrained inequalities for the von
neumann entropy,” Information Theory, IEEE Transactions on, vol. 58, no. 6, pp. 3657–3663,
2012. [265, 267]
246. H. Araki and E. Lieb, “Entropy inequalities,” Communications in Mathematical Physics,
vol. 18, no. 2, pp. 160–170, 1970. [266]
247. E. Lieb and M. Ruskai, “A fundamental property of quantum-mechanical entropy,” Physical
Review Letters, vol. 30, no. 10, pp. 434–436, 1973. [266]
248. E. Lieb and M. Ruskai, “Proof of the strong subadditivity of quantum-mechanical entropy,”
Journal of Mathematical Physics, vol. 14, pp. 1938–1941, 1973. [266]
249. N. Pippenger, “The inequalities of quantum information theory,” Information Theory, IEEE
Transactions on, vol. 49, no. 4, pp. 773–789, 2003. [267]
250. T. Chan, “Recent progresses in characterising information inequalities,” Entropy, vol. 13,
no. 2, pp. 379–401, 2011. []
251. T. Chan, D. Guo, and R. Yeung, “Entropy functions and determinant inequalities,” in
Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pp. 1251–
1255, IEEE, 2012. []
252. R. Yeung, “Facts of entropy,” IEEE Information Theory Society Newsletter, pp. 6–15,
December 2012. [267]
253. C. Williams and M. Seeger, “The effect of the input density distribution on kernel-based
classifiers,” in Proceedings of the 17th International Conference on Machine Learning,
Citeseer, 2000. [268]
254. J. Shawe-Taylor, N. Cristianini, and J. Kandola, “On the concentration of spectral properties,”
Advances in neural information processing systems, vol. 1, pp. 511–518, 2002. [268]
255. J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola, “On the eigenspectrum of
the gram matrix and the generalization error of kernel-pca,” Information Theory, IEEE
Transactions on, vol. 51, no. 7, pp. 2510–2522, 2005. [268]
256. Y. Do and V. Vu, “The spectrum of random kernel matrices,” arXiv preprint arXiv:1206.3763,
2012. [268]
257. X. Cheng and A. Singer, “The spectrum of random inner-product kernel matrices,”
arXiv:1202.3155v1 [math.PR], p. 40, 2012. [268]
258. N. Ross et al., “Fundamentals of stein’s method,” Probability Surveys, vol. 8, pp. 210–293,
2011. [269, 347]
259. Z. Chen and J. Dongarra, “Condition numbers of gaussian random matrices,” Arxiv preprint
arXiv:0810.0800, 2008. [269]
260. M. Junge and Q. Zeng, “Noncommutative bennett and rosenthal inequalities,” Arxiv preprint
arXiv:1111.1027, 2011. [269]
261. A. Giannopoulos, “Notes on isotropic convex bodies,” Warsaw University Notes, 2003. [272]
262. Y. D. Burago, V. A. Zalgaller, and A. Sossinsky, Geometric inequalities, vol. 1988. Springer
Berlin, 1988. [273]
263. R. Schneider, Convex bodies: the Brunn-Minkowski theory, vol. 44. Cambridge Univ Pr, 1993.
[273, 292]
264. V. Milman and A. Pajor, “Isotropic position and inertia ellipsoids and zonoids of the unit
ball of a normed n-dimensional space,” Geometric aspects of functional analysis, pp. 64–104,
1989. [274, 277, 291]
265. C. Borell, “The brunn-minkowski inequality in gauss space,” Inventiones Mathematicae,
vol. 30, no. 2, pp. 207–216, 1975. [274]
266. R. Kannan, L. Lovász, and M. Simonovits, “Random walks and an o*(n5) volume algorithm
for convex bodies,” Random structures and algorithms, vol. 11, no. 1, pp. 1–50, 1997. [274,
275, 277, 280, 282, 442]
267. O. Guédon and M. Rudelson, “Lp-moments of random vectors via majorizing measures,”
Advances in Mathematics, vol. 208, no. 2, pp. 798–823, 2007. [275, 276, 299, 301, 303]
268. C. Borell, “Convex measures on locally convex spaces,” Arkiv för Matematik, vol. 12, no. 1,
pp. 239–252, 1974. [276]
588 Bibliography

269. C. Borell, “Convex set functions ind-space,” Periodica Mathematica Hungarica, vol. 6, no. 2,
pp. 111–136, 1975. [276]
270. A. Prékopa, “Logarithmic concave measures with application to stochastic programming,”
Acta Sci. Math.(Szeged), vol. 32, no. 197, pp. 301–3, 1971. [276]
271. S. Vempala, “Recent progress and open problems in algorithmic convex geometry,” in
=IARCS Annual Conference on Foundations of Software Technology and Theoretical Com-
puter Science (FSTTCS 2010), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2010.
[276]
272. G. Paouris, “Concentration of mass on convex bodies,” Geometric and Functional Analysis,
vol. 16, no. 5, pp. 1021–1049, 2006. [277, 290, 291, 292, 295]
273. J. Bourgain, “Random points in isotropic convex sets,” Convex geometric analysis, Berkeley,
CA, pp. 53–58, 1996. [277, 278, 282]
274. S. Mendelson and A. Pajor, “On singular values of matrices with independent rows,”
Bernoulli, vol. 12, no. 5, pp. 761–773, 2006. [277, 281, 282, 283, 285, 287, 288, 289]
275. S. Alesker, “Phi 2-estimate for the euclidean norm on a convex body in isotropic position,”
Operator theory, vol. 77, pp. 1–4, 1995. [280, 290]
276. F. Lust-Piquard and G. Pisier, “Non commutative khintchine and paley inequalities,” Arkiv
för Matematik, vol. 29, no. 1, pp. 241–260, 1991. [284]
277. F. Cucker and D. X. Zhou, Learning theory: an approximation theory viewpoint, vol. 24.
Cambridge University Press, 2007. [288]
278. V. Koltchinskii and E. Giné, “Random matrix approximation of spectra of integral operators,”
Bernoulli, vol. 6, no. 1, pp. 113–167, 2000. [288, 289]
279. S. Mendelson, “On the performance of kernel classes,” The Journal of Machine Learning
Research, vol. 4, pp. 759–771, 2003. [289]
280. R. Kress, Linear Integral Equations. Berlin: Springer-Verlag, 1989. [289]
281. G. W. Hanson and A. B. Yakovlev, Operator theory for electromagnetics: an introduction.
Springer Verlag, 2002. [289]
282. R. C. Qiu, Z. Hu, M. Wicks, L. Li, S. J. Hou, and L. Gary, “Wireless Tomography, Part
II: A System Engineering Approach,” in 5th International Waveform Diversity & Design
Conference, (Niagara Falls, Canada), August 2010. [290]
283. R. C. Qiu, M. C. Wicks, L. Li, Z. Hu, and S. J. Hou, “Wireless Tomography, Part I: A
NovelApproach to Remote Sensing,” in 5th International Waveform Diversity & Design
Conference, (Niagara Falls, Canada), August 2010. [290]
284. E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone, “Learning from
examples as an inverse problem,” Journal of Machine Learning Research, vol. 6, no. 1, p. 883,
2006. [290]
285. E. De Vito, L. Rosasco, and A. Toigo, “Learning sets with separating kernels,” arXiv preprint
arXiv:1204.3573, 2012. []
286. E. De Vito, V. Umanità, and S. Villa, “An extension of mercer theorem to matrix-valued
measurable kernels,” Applied and Computational Harmonic Analysis, 2012. []
287. E. De Vito, L. Rosasco, and A. Toigo, “A universally consistent spectral estimator for the
support of a distribution,” [290]
288. R. Latala, “Weak and strong moments of random vectors,” Marcinkiewicz Centenary Volume,
Banach Center Publ., vol. 95, pp. 115–121, 2011. [290, 303]
289. R. Latala, “Order statistics and concentration of lr norms for log-concave vectors,” Journal of
Functional Analysis, 2011. [292, 293]
290. R. Adamczak, R. Latala, A. E. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Geometry
of log-concave ensembles of random matrices and approximate reconstruction,” Comptes
Rendus Mathematique, vol. 349, no. 13, pp. 783–786, 2011. []
291. R. Adamczak, R. Latala, A. E. Litvak, K. Oleszkiewicz, A. Pajor, and N. Tomczak-
Jaegermann, “A short proof of paouris’ inequality,” arXiv preprint arXiv:1205.2515, 2012.
[290, 291]
292. J. Bourgain, “On the distribution of polynomials on high dimensional convex sets,” Geometric
aspects of functional analysis, pp. 127–137, 1991. [290]
Bibliography 589

293. B. Klartag, “An isomorphic version of the slicing problem,” Journal of Functional Analysis,
vol. 218, no. 2, pp. 372–394, 2005. [290]
294. S. Bobkov and F. Nazarov, “On convex bodies and log-concave probability measures with
unconditional basis,” Geometric aspects of functional analysis, pp. 53–69, 2003. [290]
295. S. Bobkov and F. Nazarov, “Large deviations of typical linear functionals on a convex body
with unconditional basis,” Progress in Probability, pp. 3–14, 2003. [290]
296. A. Giannopoulos, M. Hartzoulaki, and A. Tsolomitis, “Random points in isotropic un-
conditional convex bodies,” Journal of the London Mathematical Society, vol. 72, no. 3,
pp. 779–798, 2005. [292]
297. G. Aubrun, “Sampling convex bodies: a random matrix approach,” Proceedings of the
American Mathematical Society, vol. 135, no. 5, pp. 1293–1304, 2007. [292, 295]
298. R. Adamczak, “A tail inequality for suprema of unbounded empirical processes with
applications to markov chains,” Electron. J. Probab, vol. 13, pp. 1000–1034, 2008. [293]
299. R. Adamczak, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Sharp bounds on the rate
of convergence of the empirical covariance matrix,” Comptes Rendus Mathematique, 2011.
[294]
300. R. Adamczak, R. Latala, A. E. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Tail estimates
for norms of sums of log-concave random vectors,” arXiv preprint arXiv:1107.4070, 2011.
[293]
301. R. Adamczak, A. E. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Quantitative estimates
of the convergence of the empirical covariance matrix in log-concave ensembles,” Journal of
the American Mathematical Society, vol. 23, no. 2, p. 535, 2010. [294, 315]
302. Z. Bai and Y. Yin, “Limit of the smallest eigenvalue of a large dimensional sample covariance
matrix,” The annals of Probability, pp. 1275–1294, 1993. [294, 324]
303. R. Kannan, L. Lovász, and M. Simonovits, “Isoperimetric problems for convex bodies and a
localization lemma,” Discrete & Computational Geometry, vol. 13, no. 1, pp. 541–559, 1995.
[295]
304. G. Paouris, “Small ball probability estimates for log-concave measures,” Trans. Amer. Math.
Soc, vol. 364, pp. 287–308, 2012. [295, 297, 305]
305. J. A. Clarkson, “Uniformly convex spaces,” Transactions of the American Mathematical
Society, vol. 40, no. 3, pp. 396–414, 1936. [300]
306. Y. Gordon, “Gaussian processes and almost spherical sections of convex bodies,” The Annals
of Probability, pp. 180–188, 1988. [302]
307. Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in
Rn . Springer, 1988. Geometric aspects of functional analysis (1986/87), 84–106, Lecture
Notes in Math., 1317. [302]
308. R. Adamczak, O. Guedon, R. Latala, A. E. Litvak, K. Oleszkiewicz, A. Pajor, and
N. Tomczak-Jaegermann, “Moment estimates for convex measures,” arXiv preprint
arXiv:1207.6618, 2012. [303, 304, 305]
309. A. Pajor and N. Tomczak-Jaegermann, “Chevet type inequality and norms of sub-matrices,”
[303]
310. N. Srivastava and R. Vershynin, “Covariance estimation for distributions with 2+\ epsilon
moments,” Arxiv preprint arXiv:1106.2775, 2011. [304, 305, 322, 475, 476]
311. M. Rudelson and R. Vershynin, “Sampling from large matrices: An approach through
geometric functional analysis,” Journal of the ACM (JACM), vol. 54, no. 4, p. 21, 2007. [306,
307, 310]
312. A. Frieze, R. Kannan, and S. Vempala, “Fast monte-carlo algorithms for finding low-rank
approximations,” Journal of the ACM (JACM), vol. 51, no. 6, pp. 1025–1041, 2004. [310]
313. P. Drineas, R. Kannan, and M. Mahoney, “Fast monte carlo algorithms for matrices i:
Approximating matrix multiplication,” SIAM Journal on Computing, vol. 36, no. 1, p. 132,
2006. [562]
314. P. Drineas, R. Kannan, and M. Mahoney, “Fast monte carlo algorithms for matrices ii:
Computing a low-rank approximation to a matrix,” SIAM Journal on Computing, vol. 36,
no. 1, p. 158, 2006. [311]
590 Bibliography

315. P. Drineas, R. Kannan, M. Mahoney, et al., “Fast monte carlo algorithms for matrices iii:
Computing a compressed approximate matrix decomposition,” SIAM Journal on Computing,
vol. 36, no. 1, p. 184, 2006. [311]
316. P. Drineas and R. Kannan, “Pass efficient algorithms for approximating large matrices,”
in Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms,
pp. 223–232, Society for Industrial and Applied Mathematics, 2003. [310, 311]
317. P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, “Clustering large graphs via the
singular value decomposition,” Machine Learning, vol. 56, no. 1, pp. 9–33, 2004. [313]
318. V. Mil’man, “New proof of the theorem of a. dvoretzky on intersections of convex bodies,”
Functional Analysis and its Applications, vol. 5, no. 4, pp. 288–295, 1971. [315]
319. K. Ball, “An elementary introduction to modern convex geometry,” Flavors of geometry,
vol. 31, pp. 1–58, 1997. [315]
320. R. Vershynin, “Approximating the moments of marginals of high-dimensional distributions,”
The Annals of Probability, vol. 39, no. 4, pp. 1591–1606, 2011. [322]
321. P. Youssef, “Estimating the covariance of random matrices,” arXiv preprint arXiv:1301.6607,
2013. [322, 323]
322. M. Rudelson and R. Vershynin, “Smallest singular value of a random rectangular matrix,”
Communications on Pure and Applied Mathematics, vol. 62, no. 12, pp. 1707–1739, 2009.
[324, 327]
323. R. Vershynin, “Spectral norm of products of random and deterministic matrices,” Probability
theory and related fields, vol. 150, no. 3, p. 471, 2011. [324, 327, 328, 329, 331]
324. O. Feldheim and S. Sodin, “A universality result for the smallest eigenvalues of certain sample
covariance matrices,” Geometric And Functional Analysis, vol. 20, no. 1, pp. 88–123, 2010.
[324, 326, 328]
325. M. Lai and Y. Liu, “The probabilistic estimates on the largest and smallest q-singular values
of pre-gaussian random matrices,” submitted to Advances in Mathematics, 2010. []
326. T. Tao and V. Vu, “Random matrices: The distribution of the smallest singular values,”
Geometric And Functional Analysis, vol. 20, no. 1, pp. 260–297, 2010. [328, 338, 339, 340,
341]
327. D. CHAFAÏ, “Singular values of random matrices,” Lecture Notes, November 2009. Univer-
sité Paris-Est Marne-la-Vallée. []
328. M. LAI and Y. LIU, “A study on the largest and smallest q-singular values of random
matrices,” []
329. A. Litvak and O. Rivasplata, “Smallest singular value of sparse random matrices,” Arxiv
preprint arXiv:1106.0938, 2011. []
330. R. Vershynin, “Invertibility of symmetric random matrices,” Arxiv preprint arXiv:1102.0300,
2011. []
331. H. Nguyen and V. Vu, “Optimal inverse littlewood-offord theorems,” Advances in Mathemat-
ics, 2011. []
332. Y. Eliseeva, F. Götze, and A. Zaitsev, “Estimates for the concentration functions in the
littlewood–offord problem,” Arxiv preprint arXiv:1203.6763, 2012. []
333. S. Mendelson and G. Paouris, “On the singular values of random matrices,” Preprint, 2012.
[324]
334. J. Silverstein, “The smallest eigenvalue of a large dimensional wishart matrix,” The Annals of
Probability, vol. 13, no. 4, pp. 1364–1368, 1985. [324]
335. M. Rudelson and R. Vershynin, “Non-asymptotic theory of random matrices: extreme singular
values,” Arxiv preprint arXiv:1003.2990, 2010. [325]
336. G. Aubrun, “A sharp small deviation inequality for the largest eigenvalue of a random matrix,”
Séminaire de Probabilités XXXVIII, pp. 320–337, 2005. [325]
337. G. Bennett, L. Dor, V. Goodman, W. Johnson, and C. Newman, “On uncomplemented
subspaces of lp, 1¡ p¡ 2,” Israel Journal of Mathematics, vol. 26, no. 2, pp. 178–187, 1977.
[326]
338. A. Litvak, A. Pajor, M. Rudelson, and N. Tomczak-Jaegermann, “Smallest singular value of
random matrices and geometry of random polytopes,” Advances in Mathematics, vol. 195,
no. 2, pp. 491–523, 2005. [326, 462]
Bibliography 591

339. S. Artstein-Avidan, O. Friedland, V. Milman, and S. Sodin, “Polynomial bounds for large
bernoulli sections of l1 n,” Israel Journal of Mathematics, vol. 156, no. 1, pp. 141–155, 2006.
[326]
340. M. Rudelson, “Lower estimates for the singular values of random matrices,” Comptes Rendus
Mathematique, vol. 342, no. 4, pp. 247–252, 2006. [326]
341. M. Rudelson and R. Vershynin, “The littlewood–offord problem and invertibility of random
matrices,” Advances in Mathematics, vol. 218, no. 2, pp. 600–633, 2008. [327, 334, 336, 338,
342, 344, 345]
342. R. Vershynin, “Some problems in asymptotic convex geometry and random matrices moti-
vated by numerical algorithms,” Arxiv preprint cs/0703093, 2007. [327]
343. M. Rudelson and O. Zeitouni, “Singular values of gaussian matrices and permanent estima-
tors,” arXiv preprint arXiv:1301.6268, 2013. [329]
344. Y. Yin, Z. Bai, and P. Krishnaiah, “On the limit of the largest eigenvalue of the large
dimensional sample covariance matrix,” Probability Theory and Related Fields, vol. 78, no. 4,
pp. 509–521, 1988. [330, 502]
345. Z. Bai and J. Silverstein, “No eigenvalues outside the support of the limiting spectral
distribution of large-dimensional sample covariance matrices,” The Annals of Probability,
vol. 26, no. 1, pp. 316–345, 1998. [331]
346. Z. Bai and J. Silverstein, “Exact separation of eigenvalues of large dimensional sample
covariance matrices,” The Annals of Probability, vol. 27, no. 3, pp. 1536–1555, 1999. [331]
347. H. Nguyen and V. Vu, “Random matrices: Law of the determinant,” Arxiv preprint
arXiv:1112.0752, 2011. [331, 332]
348. A. Rouault, “Asymptotic behavior of random determinants in the laguerre, gram and jacobi
ensembles,” Latin American Journal of Probability and Mathematical Statistics (ALEA),,
vol. 3, pp. 181–230, 2007. [331]
349. N. Goodman, “The distribution of the determinant of a complex wishart distributed matrix,”
Annals of Mathematical Statistics, pp. 178–180, 1963. [332]
350. O. Friedland and O. Giladi, “A simple observation on random matrices with continuous
diagonal entries,” arXiv preprint arXiv:1302.0388, 2013. [334, 335]
351. J. Bourgain, V. H. Vu, and P. M. Wood, “On the singularity probability of discrete random
matrices,” Journal of Functional Analysis, vol. 258, no. 2, pp. 559–603, 2010. [334]
352. T. Tao and V. Vu, “From the littlewood-offord problem to the circular law: universality of
the spectral distribution of random matrices,” Bulletin of the American Mathematical Society,
vol. 46, no. 3, p. 377, 2009. [334]
353. R. Adamczak, O. Guédon, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Smallest sin-
gular value of random matrices with independent columns,” Comptes Rendus Mathematique,
vol. 346, no. 15, pp. 853–856, 2008. [334]
354. R. law Adamczak, O. Guédon, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann, “Condition
number of a square matrix with iid columns drawn from a convex body,” Proc. Amer. Math.
Soc., vol. 140, pp. 987–998, 2012. [334]
355. L. Erdős, B. Schlein, and H.-T. Yau, “Wegner estimate and level repulsion for wigner random
matrices,” International Mathematics Research Notices, vol. 2010, no. 3, pp. 436–479, 2010.
[334]
356. B. Farrell and R. Vershynin, “Smoothed analysis of symmetric random matrices with
continuous distributions,” arXiv preprint arXiv:1212.3531, 2012. [335]
357. A. Maltsev and B. Schlein, “A wegner estimate for wigner matrices,” Entropy and the
Quantum II, vol. 552, p. 145, 2011. []
358. H. H. Nguyen, “Inverse littlewood–offord problems and the singularity of random symmetric
matrices,” Duke Mathematical Journal, vol. 161, no. 4, pp. 545–586, 2012. [335]
359. H. H. Nguyen, “On the least singular value of random symmetric matrices,” Electron. J.
Probab., vol. 17, pp. 1–19, 2012. [335]
360. R. Vershynin, “Invertibility of symmetric random matrices,” Random Structures & Algo-
rithms, 2012. [334, 335]
592 Bibliography

361. A. Sankar, D. Spielman, and S. Teng, “Smoothed analysis of the condition numbers and
growth factors of matrices,” SIAM J. Matrix Anal. Appl., vol. 2, pp. 446–476, 2006. [335]
362. T. Tao and V. Vu, “Smooth analysis of the condition number and the least singular value,”
Mathematics of Computation, vol. 79, no. 272, pp. 2333–2352, 2010. [335, 338, 343, 344]
363. K. P. Costello and V. Vu, “Concentration of random determinants and permanent estimators,”
SIAM Journal on Discrete Mathematics, vol. 23, no. 3, pp. 1356–1371, 2009. [335]
364. T. Tao and V. Vu, “On the permanent of random bernoulli matrices,” Advances in Mathemat-
ics, vol. 220, no. 3, pp. 657–669, 2009. [335]
365. A. H. Taub, “John von neumann: Collected works, volume v: Design of computers, theory of
automata and numerical analysis,” 1963. [336]
366. S. Smale, “On the efficiency of algorithms of analysis,” Bull. Amer. Math. Soc.(NS), vol. 13,
1985. [336, 337, 342]
367. A. Edelman, “Eigenvalues and condition numbers of random matrices,” SIAM Journal on
Matrix Analysis and Applications, vol. 9, no. 4, pp. 543–560, 1988. [336]
368. S. J. Szarek, “Condition numbers of random matrices,” J. Complexity, vol. 7, no. 2, pp. 131–
149, 1991. [336]
369. D. Spielman and S. Teng, “Smoothed analysis of algorithms: Why the simplex algorithm
usually takes polynomial time,” Journal of the ACM (JACM), vol. 51, no. 3, pp. 385–463,
2004. [336, 338, 339]
370. L. Erdos, “Universality for random matrices and log-gases,” arXiv preprint arXiv:1212.0839,
2012. [337, 347]
371. J. Von Neumann and H. Goldstine, “Numerical inverting of matrices of high order,” Bull.
Amer. Math. Soc, vol. 53, no. 11, pp. 1021–1099, 1947. [337]
372. A. Edelman, Eigenvalues and condition numbers of random matrices. PhD thesis, Mas-
sachusetts Institute of Technology, 1989. [337, 338, 341, 342]
373. P. Forrester, “The spectrum edge of random matrix ensembles,” Nuclear Physics B, vol. 402,
no. 3, pp. 709–728, 1993. [338]
374. M. Rudelson and R. Vershynin, “The least singular value of a random square matrix is
o(n−1/2 ),” Comptes Rendus Mathematique, vol. 346, no. 15, pp. 893–896, 2008. [338]
375. V. Vu and T. Tao, “The condition number of a randomly perturbed matrix,” in Proceedings of
the thirty-ninth annual ACM symposium on Theory of computing, pp. 248–255, ACM, 2007.
[338]
376. J. Lindeberg, “Eine neue herleitung des exponentialgesetzes in der wahrscheinlichkeitsrech-
nung,” Mathematische Zeitschrift, vol. 15, no. 1, pp. 211–225, 1922. [339]
377. A. Edelman and B. Sutton, “Tails of condition number distributions,” simulation, vol. 1, p. 2,
2008. [341]
378. T. Sarlos, “Improved approximation algorithms for large matrices via random projections,”
in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on,
pp. 143–152, IEEE, 2006. [342]
379. T. Tao and V. Vu, “Random matrices: the circular law,” Arxiv preprint arXiv:0708.2895, 2007.
[342, 343]
380. N. Pillai and J. Yin, “Edge universality of correlation matrices,” arXiv preprint
arXiv:1112.2381, 2011. [345, 346, 347]
381. N. Pillai and J. Yin, “Universality of covariance matrices,” arXiv preprint arXiv:1110.2501,
2011. [345]
382. Z. Bao, G. Pan, and W. Zhou, “Tracy-widom law for the extreme eigenvalues of sample
correlation matrices,” 2011. [345]
383. I. Johnstone, “On the distribution of the largest eigenvalue in principal components analysis,”
The Annals of statistics, vol. 29, no. 2, pp. 295–327, 2001. [347, 461, 502]
384. S. Hou, R. C. Qiu, J. P. Browning, and M. C. Wicks, “Spectrum Sensing in Cognitive Radio
with Subspace Matching,” in IEEE Waveform Diversity and Design Conference, January
2012. [347, 514]
385. S. Hou and R. C. Qiu, “Spectrum sensing for cognitive radio using kernel-based learning,”
arXiv preprint arXiv:1105.2978, 2011. [347, 491, 514]
Bibliography 593

386. S. Dallaporta, “Eigenvalue variance bounds for wigner and covariance random matrices,”
Random Matrices: Theory and Applications, vol. 1, no. 03, 2011. [348]
387. Ø. Ryan, A. Masucci, S. Yang, and M. Debbah, “Finite dimensional statistical inference,”
Information Theory, IEEE Transactions on, vol. 57, no. 4, pp. 2457–2473, 2011. [360]
388. R. Couillet and M. Debbah, Random Matrix Methods for Wireless Communications. Cam-
bridge University Press, 2011. [360, 502]
389. A. Tulino and S. Verdu, Random matrix theory and wireless communications. now Publishers
Inc., 2004. [360]
390. R. Müller, “Applications of large random matrices in communications engineering,” in Proc.
Int. Conf. on Advances Internet, Process., Syst., Interdisciplinary Research (IPSI), Sveti
Stefan, Montenegro, 2003. [365, 367]
391. G. E. Pfander, H. Rauhut, and J. A. Tropp, “The restricted isometry property for time–
frequency structured random matrices,” Probability Theory and Related Fields, pp. 1–31,
2011. [373, 374, 400]
392. T. T. Cai, L. Wang, and G. Xu, “Shifting inequality and recovery of sparse signals,” Signal
Processing, IEEE Transactions on, vol. 58, no. 3, pp. 1300–1308, 2010. [373]
393. E. Candes, “The Restricted Isometry Property and Its Implications for Compressed Sensing,”
Comptes rendus-Mathematique, vol. 346, no. 9–10, pp. 589–592, 2008. []
394. E. Candes, J. Romberg, and T. Tao, “Stable Signal Recovery from Incomplete and Inaccurate
Measurements,” Comm. Pure Appl. Math, vol. 59, no. 8, pp. 1207–1223, 2006. []
395. S. Foucart, “A note on guaranteed sparse recovery via 1 -minimization,” Applied and
Computational Harmonic Analysis, vol. 29, no. 1, pp. 97–103, 2010. [373]
396. S. Foucart, “Sparse recovery algorithms: sufficient conditions in terms of restricted isometry
constants,” Approximation Theory XIII: San Antonio 2010, pp. 65–77, 2012. [373]
397. D. Needell and J. A. Tropp, “Cosamp: Iterative signal recovery from incomplete and
inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–
321, 2009. [373]
398. T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,”
Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 265–274, 2009. [373]
399. S. Foucart, “Hard thresholding pursuit: an algorithm for compressive sensing,” SIAM Journal
on Numerical Analysis, vol. 49, no. 6, pp. 2543–2563, 2011. [373]
400. R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A Simple Proof of the Restricted
Isometry Property for Random Matrices.” Submitted for publication, January 2007. [373,
374, 375, 376, 378, 385]
401. D. L. Donoho and J. Tanner, “Counting faces of randomly-projected polytopes when the
projection radically lowers dimension,” J. Amer. Math. Soc, vol. 22, no. 1, pp. 1–53, 2009.
[]
402. M. Rudelson and R. Vershynin, “On sparse reconstruction from fourier and gaussian
measurements,” Communications on Pure and Applied Mathematics, vol. 61, no. 8, pp. 1025–
1045, 2008. [374, 394]
403. H. Rauhut, J. Romberg, and J. A. Tropp, “Restricted isometries for partial random circulant
matrices,” Applied and Computational Harmonic Analysis, vol. 32, no. 2, pp. 242–254, 2012.
[374, 393, 394, 395, 396, 397, 398, 404]
404. G. E. Pfander and H. Rauhut, “Sparsity in time-frequency representations,” Journal of Fourier
Analysis and Applications, vol. 16, no. 2, pp. 233–260, 2010. [374, 400, 401, 402, 403, 404]
405. G. Pfander, H. Rauhut, and J. Tanner, “Identification of Matrices having a Sparse Represen-
tation,” in Preprint, 2007. [374, 400]
406. E. Candes and T. Tao, “Near-Optimal Signal Recovery From Random Projections: Universal
Encoding Strategies?,” IEEE Transactions on Information Theory, vol. 52, no. 12, pp. 5406–
5425, 2006. [373, 384]
407. S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann, “Uniform uncertainty principle for
bernoulli and subgaussian ensembles,” Constructive Approximation, vol. 28, no. 3, pp. 277–
289, 2008. [373, 386]
594 Bibliography

408. A. Cohen, W. Dahmen, and R. DeVore, “Compressed Sensing and Best k-Term Approxima-
tion,” in Submitted for publication, July, 2006. [373]
409. S. Foucart, A. Pajor, H. Rauhut, and T. Ullrich, “The gelfand widths of lp-balls for 0 < p <
1,” Journal of Complexity, vol. 26, no. 6, pp. 629–640, 2010. []
410. A. Y. Garnaev and E. D. Gluskin, “The widths of a euclidean ball,” in Dokl. Akad. Nauk SSSR,
vol. 277, pp. 1048–1052, 1984. [373]
411. W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space,”
Contemporary mathematics, vol. 26, no. 189–206, p. 1, 1984. [374, 375, 379]
412. F. Krahmer and R. Ward, “New and improved johnson-lindenstrauss embeddings via the
restricted isometry property,” SIAM Journal on Mathematical Analysis, vol. 43, no. 3,
pp. 1269–1281, 2011. [374, 375, 379, 398]
413. N. Alon, “Problems and results in extremal combinatorics-i,” Discrete Mathematics, vol. 273,
no. 1, pp. 31–53, 2003. [375]
414. N. Ailon and E. Liberty, “An almost optimal unrestricted fast johnson-lindenstrauss trans-
form,” in Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete
Algorithms, pp. 185–191, SIAM, 2011. [375]
415. D. Achlioptas, “Database-friendly random projections,” in Proceedings of the twentieth ACM
SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 274–281,
ACM, 2001. [375, 376, 408, 410]
416. D. Achlioptas, “Database-friendly random projections: Johnson-lindenstrauss with binary
coins,” Journal of computer and System Sciences, vol. 66, no. 4, pp. 671–687, 2003. [375,
379, 385, 414]
417. N. Ailon and B. Chazelle, “Approximate nearest neighbors and the fast johnson-lindenstrauss
transform,” in Proceedings of the thirty-eighth annual ACM symposium on Theory of
computing, pp. 557–563, ACM, 2006. [375]
418. N. Ailon and B. Chazelle, “The fast johnson-lindenstrauss transform and approximate nearest
neighbors,” SIAM Journal on Computing, vol. 39, no. 1, pp. 302–322, 2009. [375]
419. G. Lorentz, M. Golitschek, and Y. Makovoz, “Constructive approximation, volume 304 of
grundlehren math. wiss,” 1996. [377]
420. R. I. Arriaga and S. Vempala, “An algorithmic theory of learning: Robust concepts and
random projection,” Machine Learning, vol. 63, no. 2, pp. 161–182, 2006. [379]
421. K. L. Clarkson and D. P. Woodruff, “Numerical linear algebra in the streaming model,” in
Proceedings of the 41st annual ACM symposium on Theory of computing, pp. 205–214, ACM,
2009. []
422. S. Dasgupta and A. Gupta, “An elementary proof of a theorem of johnson and lindenstrauss,”
Random Structures & Algorithms, vol. 22, no. 1, pp. 60–65, 2002. []
423. P. Frankl and H. Maehara, “The johnson-lindenstrauss lemma and the sphericity of some
graphs,” Journal of Combinatorial Theory, Series B, vol. 44, no. 3, pp. 355–362, 1988. []
424. P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse
of dimensionality,” in Proceedings of the thirtieth annual ACM symposium on Theory of
computing, pp. 604–613, ACM, 1998. []
425. D. M. Kane and J. Nelson, “A derandomized sparse johnson-lindenstrauss transform,” arXiv
preprint arXiv:1006.3585, 2010. []
426. J. Matoušek, “On variants of the johnson–lindenstrauss lemma,” Random Structures &
Algorithms, vol. 33, no. 2, pp. 142–156, 2008. [379]
427. M. Rudelson and R. Vershynin, “Geometric approach to error-correcting codes and recon-
struction of signals,” International Mathematics Research Notices, vol. 2005, no. 64, p. 4019,
2005. [384]
428. J. Vybı́ral, “A variant of the johnson–lindenstrauss lemma for circulant matrices,” Journal of
Functional Analysis, vol. 260, no. 4, pp. 1096–1105, 2011. [384, 398]
429. A. Hinrichs and J. Vybı́ral, “Johnson-lindenstrauss lemma for circulant matrices,” Random
Structures & Algorithms, vol. 39, no. 3, pp. 391–398, 2011. [384, 398]
Bibliography 595

430. H. Rauhut, K. Schnass, and P. Vandergheynst, “Compressed Sensing and Redundant Dictio-
naries,” IEEE Transactions on Information Theory, vol. 54, no. 5, pp. 2210–2219, 2008. [384,
385, 387, 388, 389, 391, 392]
431. W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressed channel sensing: A new
approach to estimating sparse multipath channels,” Proceedings of the IEEE, vol. 98, no. 6,
pp. 1058–1076, 2010. [400]
432. J. Chiu and L. Demanet, “Matrix probing and its conditioning,” SIAM Journal on Numerical
Analysis, vol. 50, no. 1, pp. 171–193, 2012. [400]
433. G. E. Pfander, “Gabor frames in finite dimensions,” Finite Frames, pp. 193–239, 2012. [400]
434. J. Haupt, W. U. Bajwa, G. Raz, and R. Nowak, “Toeplitz compressed sensing matrices
with applications to sparse channel estimation,” Information Theory, IEEE Transactions on,
vol. 56, no. 11, pp. 5862–5875, 2010. [400]
435. K. Gröchenig, Foundations of time-frequency analysis. Birkhäuser Boston, 2000. [401]
436. F. Krahmer, G. E. Pfander, and P. Rashkov, “Uncertainty in time–frequency representations
on finite abelian groups and applications,” Applied and Computational Harmonic Analysis,
vol. 25, no. 2, pp. 209–225, 2008. [401]
437. J. Lawrence, G. E. Pfander, and D. Walnut, “Linear independence of gabor systems in finite
dimensional vector spaces,” Journal of Fourier Analysis and Applications, vol. 11, no. 6,
pp. 715–726, 2005. [401]
438. B. M. Sanandaji, T. L. Vincent, and M. B. Wakin, “Concentration of measure inequalities
for compressive toeplitz matrices with applications to detection and system identification,” in
Decision and Control (CDC), 2010 49th IEEE Conference on, pp. 2922–2929, IEEE, 2010.
[408]
439. B. M. Sanandaji, T. L. Vincent, and M. B. Wakin, “Concentration of measure inequalities for
toeplitz matrices with applications,” arXiv preprint arXiv:1112.1968, 2011. [409]
440. B. M. Sanandaji, M. B. Wakin, and T. L. Vincent, “Observability with random observations,”
arXiv preprint arXiv:1211.4077, 2012. [408]
441. M. Meckes, “On the spectral norm of a random toeplitz matrix,” Electron. Comm. Probab,
vol. 12, pp. 315–325, 2007. [410]
442. R. Adamczak, “A few remarks on the operator norm of random toeplitz matrices,” Journal of
Theoretical Probability, vol. 23, no. 1, pp. 85–108, 2010. [410]
443. R. Calderbank, S. Howard, and S. Jafarpour, “Construction of a large class of deterministic
sensing matrices that satisfy a statistical isometry property,” Selected Topics in Signal
Processing, IEEE Journal of, vol. 4, no. 2, pp. 358–374, 2010. [410]
444. Y. Plan, Compressed sensing, sparse approximation, and low-rank matrix estimation. PhD
thesis, California Institute of Technology, 2011. [411, 414, 415]
445. R. Vershynin, “Math 280 lecture notes,” 2007. [413]
446. B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix
equations via nuclear norm minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010.
[414]
447. R. Vershynin, “On large random almost euclidean bases,” Acta Math. Univ. Comenianae,
vol. 69, no. 2, pp. 137–144, 2000. [414]
448. E. L. Lehmann and G. Casella, Theory of point estimation, vol. 31. Springer, 1998. [414]
449. S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu, “A unified framework for high-
dimensional analysis of m-estimators with decomposable regularizers,” arXiv preprint
arXiv:1010.2731, 2010. [416]
450. A. Agarwal, S. Negahban, and M. Wainwright, “Fast global convergence of gradient methods
for high-dimensional statistical recovery,” arXiv preprint arXiv:1104.4824, 2011. [416]
451. B. Recht, M. Fazel, and P. Parrilo, “Guaranteed Minimum-Rank Solutions of Linear Matrix
Equations via Nuclear Norm Minimization,” in Arxiv preprint arXiv:0706.4138, 2007. [417]
452. H. Lütkepohl, “New introduction to multiple time series analysis,” 2005. [422]
453. A. Agarwal, S. Negahban, and M. Wainwright, “Noisy matrix decomposition via convex
relaxation: Optimal rates in high dimensions,” arXiv preprint arXiv:1102.4807, 2011. [427,
428, 429, 485, 486]
596 Bibliography

454. C. Meyer, Matrix analysis and applied linear algebra. SIAM, 2000. [430]
455. M. McCoy and J. Tropp, “Sharp recovery bounds for convex deconvolution, with applica-
tions,” Arxiv preprint arXiv:1205.1580, 2012. [433, 434]
456. V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence
for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011.
[434]
457. V. Koltchinskii, “Von neumann entropy penalization and low-rank matrix estimation,” The
Annals of Statistics, vol. 39, no. 6, pp. 2936–2973, 2012. [434, 438, 439, 447]
458. M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information. Cam-
bridge Press, 10th edition ed., 2010. [438]
459. S. Sra, S. Nowozin, and S. Wright, eds., Optimization for machine learning. MIT Press, 2012.
Chapter 4 (Bertsekas) Incremental Gradient, Subgradient, and Proximal Method for Convex
Optimization: A Survey. [440, 441]
460. M. Rudelson, “Contact points of convex bodies,” Israel Journal of Mathematics, vol. 101,
no. 1, pp. 93–124, 1997. [441]
461. D. Blatt, A. Hero, and H. Gauchman, “A convergent incremental gradient method with a
constant step size,” SIAM Journal on Optimization, vol. 18, no. 1, pp. 29–51, 2007. [443]
462. M. Rabbat and R. Nowak, “Quantized incremental algorithms for distributed optimization,”
Selected Areas in Communications, IEEE Journal on, vol. 23, no. 4, pp. 798–808, 2005. [443]
463. E. Candes, Y. Eldar, T. Strohmer, and V. Voroninski, “Phase retrieval via matrix completion,”
Arxiv preprint arXiv:1109.0573, 2011. [443, 444, 446, 447, 456]
464. E. Candes, T. Strohmer, and V. Voroninski, “Phaselift: Exact and stable signal recovery from
magnitude measurements via convex programming,” Arxiv preprint arXiv:1109.4499, 2011.
[443, 444, 456]
465. E. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of
Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009. [443]
466. J. Cai, E. Candes, and Z. Shen, “A singular value thresholding algorithm for matrix
completion,” Arxiv preprint Arxiv:0810.3286, 2008. [446, 449]
467. E. Candes and T. Tao, “The power of convex relaxation: Near-optimal matrix completion,”
Information Theory, IEEE Transactions on, vol. 56, no. 5, pp. 2053–2080, 2010. []
468. A. Alfakih, A. Khandani, and H. Wolkowicz, “Solving euclidean distance matrix completion
problems via semidefinite programming,” Computational optimization and applications,
vol. 12, no. 1, pp. 13–30, 1999. []
469. M. Fukuda, M. Kojima, K. Murota, K. Nakata, et al., “Exploiting sparsity in semidefinite
programming via matrix completion i: General framework,” SIAM Journal on Optimization,
vol. 11, no. 3, pp. 647–674, 2001. []
470. R. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” Information
Theory, IEEE Transactions on, vol. 56, no. 6, pp. 2980–2998, 2010. []
471. C. Johnson, “Matrix completion problems: a survey,” in Proceedings of Symposia in Applied
Mathematics, vol. 40, pp. 171–198, 1990. []
472. E. Candes and Y. Plan, “Matrix completion with noise,” Proceedings of the IEEE, vol. 98,
no. 6, pp. 925–936, 2010. [443]
473. A. Chai, M. Moscoso, and G. Papanicolaou, “Array imaging using intensity-only measure-
ments,” Inverse Problems, vol. 27, p. 015005, 2011. [443, 446, 456]
474. L. Tian, J. Lee, S. Oh, and G. Barbastathis, “Experimental compressive phase space
tomography,” Optics Express, vol. 20, no. 8, pp. 8296–8308, 2012. [443, 446, 447, 448,
449, 456]
475. Y. Lu and M. Vetterli, “Sparse spectral factorization: Unicity and reconstruction algorithms,”
in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference
on, pp. 5976–5979, IEEE, 2011. [444]
476. J. Fienup, “Phase retrieval algorithms: a comparison,” Applied Optics, vol. 21, no. 15,
pp. 2758–2769, 1982. [444]
477. A. Sayed and T. Kailath, “A survey of spectral factorization methods,” Numerical linear
algebra with applications, vol. 8, no. 6–7, pp. 467–496, 2001. [444]
Bibliography 597

478. C. Beck and R. D’Andrea, “Computational study and comparisons of lft reducibility
methods,” in American Control Conference, 1998. Proceedings of the 1998, vol. 2, pp. 1013–
1017, IEEE, 1998. [446]
479. M. Mesbahi and G. Papavassilopoulos, “On the rank minimization problem over a positive
semidefinite linear matrix inequality,” Automatic Control, IEEE Transactions on, vol. 42,
no. 2, pp. 239–243, 1997. [446]
480. K. Toh, M. Todd, and R. Tütüncü, “Sdpt3-a matlab software package for semidefinite
programming, version 1.3,” Optimization Methods and Software, vol. 11, no. 1–4, pp. 545–
581, 1999. [446]
481. M. Grant and S. Boyd, “Cvx: Matlab software for disciplined convex programming,”
Available httpstanford edu boydcvx, 2008. [446]
482. S. Becker, E. Candès, and M. Grant, “Templates for convex cone problems with applications
to sparse signal recovery,” Mathematical Programming Computation, pp. 1–54, 2011. [446]
483. E. Candes, M. Wakin, and S. Boyd, “Enhancing sparsity by reweighted l1 minimization,”
Journal of Fourier Analysis and Applications, vol. 14, no. 5, pp. 877–905, 2008. [446]
484. M. Fazel, H. Hindi, and S. Boyd, “Log-det heuristic for matrix rank minimization with
applications to hankel and euclidean distance matrices,” in American Control Conference,
2003. Proceedings of the 2003, vol. 3, pp. 2156–2162, Ieee, 2003. [446]
485. M. Fazel, Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford
University, 2002. [446]
486. L. Mandel and E. Wolf, Optical Coherence and Quantum Optics. Cambridge University
Press, 1995. [447, 448]
487. Z. Hu, R. Qiu, J. Browning, and M. Wicks, “A novel single-step approach for self-coherent
tomography using semidefinite relaxation,” IEEE Geoscience and Remote Sensing Letters. to
appear. [449]
488. M. Grant and S. Boyd, “Cvx: Matlab software for disciplined convex programming, version
1.21.” http://cvxr.com/cvx, 2010. [453, 556]
489. H. Ohlsson, A. Y. Yang, R. Dong, and S. S. Sastry, “Compressive phase retrieval from squared
output measurements via semidefinite programming,” arXiv preprint arXiv:1111.6323, 2012.
[454]
490. A. Devaney, E. Marengo, and F. Gruber, “Time-reversal-based imaging and inverse scattering
of multiply scattering point targets,” The Journal of the Acoustical Society of America,
vol. 118, pp. 3129–3138, 2005. [454]
491. L. Lo Monte, D. Erricolo, F. Soldovieri, and M. C. Wicks, “Radio frequency tomography
for tunnel detection,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 48, no. 3,
pp. 1128–1137, 2010. [454]
492. O. Klopp, “Noisy low-rank matrix completion with general sampling distribution,” Arxiv
preprint arXiv:1203.0108, 2012. [456]
493. R. Foygel, R. Salakhutdinov, O. Shamir, and N. Srebro, “Learning with the weighted trace-
norm under arbitrary sampling distributions,” Arxiv preprint arXiv:1106.4251, 2011. [456]
494. R. Foygel and N. Srebro, “Concentration-based guarantees for low-rank matrix reconstruc-
tion,” Arxiv preprint arXiv:1102.3923, 2011. [456]
495. V. Koltchinskii and P. Rangel, “Low rank estimation of similarities on graphs,” Arxiv preprint
arXiv:1205.1868, 2012. [456]
496. E. Richard, P. Savalle, and N. Vayatis, “Estimation of simultaneously sparse and low rank
matrices,” in Proceeding of 29th Annual International Conference on Machine Learning,
2012. [456]
497. H. Ohlsson, A. Yang, R. Dong, and S. Sastry, “Compressive phase retrieval from squared
output measurements via semidefinite programming,” Arxiv preprint arXiv:1111.6323, 2011.
[456]
498. A. Fannjiang and W. Liao, “Compressed sensing phase retrieval,” in Proceedings of IEEE
Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, 2011. []
499. A. Fannjiang, “Absolute uniqueness of phase retrieval with random illumination,” Arxiv
preprint arXiv:1110.5097, 2011. []
598 Bibliography

500. S. Oymak and B. Hassibi, “Recovering jointly sparse signals via joint basis pursuit,” Arxiv
preprint arXiv:1202.3531, 2012. []
501. Z. Wen, C. Yang, X. Liu, and S. Marchesini, “Alternating direction methods for classical and
ptychographic phase retrieval,” []
502. H. Ohlsson, A. Yang, R. Dong, and S. Sastry, “Compressive phase retrieval via lifting,” [456]
503. K. Jaganathan, S. Oymak, and B. Hassibi, “On robust phase retrieval for sparse signals,” in
Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference
on, pp. 794–799, IEEE, 2012. [456]
504. Y. Chen, A. Wiesel, and A. Hero, “Robust shrinkage estimation of high-dimensional
covariance matrices,” Signal Processing, IEEE Transactions on, vol. 59, no. 9, pp. 4097–
4107, 2011. [460]
505. E. Levina and R. Vershynin, “Partial estimation of covariance matrices,” Probability Theory
and Related Fields, pp. 1–15, 2011. [463, 476, 477]
506. S. Marple, Digital Spectral Analysis with Applications. Prentice-Hall, 1987. [468, 469]
507. C. W. Therrien, Discrete Random Signals and Statistical Signal Processing. Prentice-Hall,
1992. [471]
508. H. Xiao and W. Wu, “Covariance matrix estimation for stationary time series,” Arxiv preprint
arXiv:1105.4563, 2011. [474]
509. A. Rohde, “Accuracy of empirical projections of high-dimensional gaussian matrices,” arXiv
preprint arXiv:1107.5481, 2011. [479, 481, 484, 485]
510. K. Jurczak, “A universal expectation bound on empirical projections of deformed random
matrices,” arXiv preprint arXiv:1209.5943, 2012. [479, 483]
511. A. Amini, “High-dimensional principal component analysis,” 2011. [488]
512. L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM review, vol. 38, no. 1,
pp. 49–95, 1996. [489]
513. D. Paul and I. M. Johnstone, “Augmented sparse principal component analysis for high
dimensional data,” arXiv preprint arXiv:1202.1242, 2012. [490]
514. Q. Berthet and P. Rigollet, “Optimal detection of sparse principal components in high
dimension,” Arxiv preprint arXiv:1202.5070, 2012. [490, 503, 505, 506, 508, 509]
515. A. d’Aspremont, F. Bach, and L. Ghaoui, “Approximation bounds for sparse principal
component analysis,” Arxiv preprint arXiv:1205.0121, 2012. [490]
516. A. d’Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet, A direct formulation for sparse
PCA using semidefinite programming, vol. 49. SIAM Review, 2007. [501, 506]
517. H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of
computational and graphical statistics, vol. 15, no. 2, pp. 265–286, 2006. [501]
518. P. Bickel and E. Levina, “Covariance regularization by thresholding,” The Annals of Statistics,
vol. 36, no. 6, pp. 2577–2604, 2008. [502]
519. T. Cai, C. Zhang, and H. Zhou, “Optimal rates of convergence for covariance matrix
estimation,” The Annals of Statistics, vol. 38, no. 4, pp. 2118–2144, 2010. []
520. N. El Karoui, “Spectrum estimation for large dimensional covariance matrices using random
matrix theory,” The Annals of Statistics, vol. 36, no. 6, pp. 2757–2790, 2008. [502]
521. S. Geman, “The spectral radius of large random matrices,” The Annals of Probability, vol. 14,
no. 4, pp. 1318–1328, 1986. [502]
522. J. Baik, G. Ben Arous, and S. Péché, “Phase transition of the largest eigenvalue for nonnull
complex sample covariance matrices,” The Annals of Probability, vol. 33, no. 5, pp. 1643–
1697, 2005. [503]
523. T. Tao, “Outliers in the spectrum of iid matrices with bounded rank perturbations,” Probability
Theory and Related Fields, pp. 1–33, 2011. [503]
524. F. Benaych-Georges, A. Guionnet, M. Maida, et al., “Fluctuations of the extreme eigenvalues
of finite rank deformations of random matrices,” Electronic Journal of Probability, vol. 16,
pp. 1621–1662, 2010. [503]
525. F. Bach, S. Ahipasaoglu, and A. d’Aspremont, “Convex relaxations for subset selection,”
Arxiv preprint ArXiv:1006.3601, 2010. [507]
Bibliography 599

526. E. J. Candès and M. A. Davenport, “How well can we estimate a sparse vector?,” Applied and
Computational Harmonic Analysis, 2012. [509, 510]
527. E. Candes and T. Tao, “The Dantzig Selector: Statistical Estimation When p is much larger
than n,” Annals of Statistics, vol. 35, no. 6, pp. 2313–2351, 2007. [509]
528. F. Ye and C.-H. Zhang, “Rate minimaxity of the lasso and dantzig selector for the lq loss in lr
balls,” The Journal of Machine Learning Research, vol. 9999, pp. 3519–3540, 2010. [510]
529. E. Arias-Castro, “Detecting a vector based on linear measurements,” Electronic Journal of
Statistics, vol. 6, pp. 547–558, 2012. [511, 512, 514]
530. E. Arias-Castro, E. Candes, and M. Davenport, “On the fundamental limits of adaptive
sensing,” arXiv preprint arXiv:1111.4646, 2011. [511]
531. E. Arias-Castro, S. Bubeck, and G. Lugosi, “Detection of correlations,” The Annals of
Statistics, vol. 40, no. 1, pp. 412–435, 2012. [511]
532. E. Arias-Castro, S. Bubeck, and G. Lugosi, “Detecting positive correlations in a multivariate
sample,” 2012. [511, 517]
533. A. B. Tsybakov, Introduction to nonparametric estimation. Springer, 2008. [513, 519]
534. L. Balzano, B. Recht, and R. Nowak, “High-dimensional matched subspace detection when
data are missing,” in Information Theory Proceedings (ISIT), 2010 IEEE International
Symposium on, pp. 1638–1642, IEEE, 2010. [514, 516, 517]
535. L. K. Balzano, Handling missing data in high-dimensional subspace modeling. PhD thesis,
UNIVERSITY OF WISCONSIN, 2012. [514]
536. L. L. Scharf, Statistical Signal Processing. Addison-Wesley, 1991. [514]
537. L. L. Scharf and B. Friedlander, “Matched subspace detectors,” Signal Processing, IEEE
Transactions on, vol. 42, no. 8, pp. 2146–2157, 1994. [514]
538. S. Hou, R. C. Qiu, M. Bryant, and M. C. Wicks, “Spectrum Sensing in Cognitive Radio with
Robust Principal Component Analysis,” in IEEE Waveform Diversity and Design Conference,
January 2012. [514]
539. C. McDiarmid, “On the method of bounded differences,” Surveys in combinatorics, vol. 141,
no. 1, pp. 148–188, 1989. [516]
540. M. Azizyan and A. Singh, “Subspace detection of high-dimensional vectors using compres-
sive sampling,” in IEEE SSP, 2012. [516, 517, 518, 519]
541. M. Davenport, M. Wakin, and R. Baraniuk, “Detection and estimation with compressive
measurements,” tech. rep., Tech. Rep. TREE0610, Rice University ECE Department, 2006,
2006. [516]
542. J. Paredes, Z. Wang, G. Arce, and B. Sadler, “Compressive matched subspace detection,”
in Proc. 17th European Signal Processing Conference, Glasgow, Scotland, pp. 120–124,
Citeseer, 2009. []
543. J. Haupt and R. Nowak, “Compressive sampling for signal detection,” in Acoustics, Speech
and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 3,
pp. III–1509, IEEE, 2007. [516]
544. A. James, “Distributions of matrix variates and latent roots derived from normal samples,”
The Annals of Mathematical Statistics, pp. 475–501, 1964. [517]
545. S. Balakrishnan, M. Kolar, A. Rinaldo, and A. Singh, “Recovering block-structured activa-
tions using compressive measurements,” arXiv preprint arXiv:1209.3431, 2012. [520]
546. R. Muirhead, Aspects of Mutivariate Statistical Theory. Wiley, 2005. [521]
547. S. Vempala, The random projection method, vol. 65. Amer Mathematical Society, 2005. [521]
548. C. Helstrom, “Quantum detection and estimation theory,” Journal of Statistical Physics,
vol. 1, no. 2, pp. 231–252, 1969. [526]
549. C. Helstrom, “Detection theory and quantum mechanics,” Information and Control, vol. 10,
no. 3, pp. 254–291, 1967. [526]
550. J. Sharpnack, A. Rinaldo, and A. Singh, “Changepoint detection over graphs with the spectral
scan statistic,” arXiv preprint arXiv:1206.0773, 2012. [526]
551. D. Ramı́rez, J. Vı́a, I. Santamarı́a, and L. Scharf, “Locally most powerful invariant tests for
correlation and sphericity of gaussian vectors,” arXiv preprint arXiv:1204.5635, 2012. [526]
600 Bibliography

552. A. Onatski, M. Moreira, and M. Hallin, “Signal detection in high dimension: The multispiked
case,” arXiv preprint arXiv:1210.5663, 2012. [526]
553. A. Nemirovski, “On tractable approximations of randomly perturbed convex constraints,” in
Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, vol. 3, pp. 2419–2422,
IEEE, 2003. [527]
554. A. Nemirovski, “Regular banach spaces and large deviations of random sums,” Paper in
progress, E-print: http:// www2.isye.gatech.edu/ nemirovs, 2004. [528]
555. D. De Farias and B. Van Roy, “On constraint sampling in the linear programming approach
to approximate dynamic programming,” Mathematics of operations research, vol. 29, no. 3,
pp. 462–478, 2004. [528]
556. G. Calafiore and M. Campi, “Uncertain convex programs: randomized solutions and confi-
dence levels,” Mathematical Programming, vol. 102, no. 1, pp. 25–46, 2005. []
557. E. Erdoğan and G. Iyengar, “Ambiguous chance constrained problems and robust optimiza-
tion,” Mathematical Programming, vol. 107, no. 1, pp. 37–61, 2006. []
558. M. Campi and S. Garatti, “The exact feasibility of randomized solutions of uncertain convex
programs,” SIAM Journal on Optimization, vol. 19, no. 3, pp. 1211–1230, 2008. [528]
559. A. Nemirovski and A. Shapiro, “Convex approximations of chance constrained programs,”
SIAM Journal on Optimization, vol. 17, no. 4, pp. 969–996, 2006. [528, 542]
560. A. Nemirovski, “Sums of random symmetric matrices and quadratic optimization under
orthogonality constraints,” Mathematical programming, vol. 109, no. 2, pp. 283–317, 2007.
[529, 530, 531, 532, 533, 534, 536, 538, 539, 540, 542]
561. S. Janson, “Large deviations for sums of partly dependent random variables,” Random
Structures & Algorithms, vol. 24, no. 3, pp. 234–248, 2004. [542]
562. S. Cheung, A. So, and K. Wang, “Chance-constrained linear matrix inequalities with
dependent perturbations: A safe tractable approximation approach,” Preprint, 2011. [542,
543, 547]
563. A. Ben-Tal and A. Nemirovski, “On safe tractable approximations of chance-constrained
linear matrix inequalities,” Mathematics of Operations Research, vol. 34, no. 1, pp. 1–25,
2009. [542]
564. A. Gershman, N. Sidiropoulos, S. Shahbazpanahi, M. Bengtsson, and B. Ottersten, “Convex
optimization-based beamforming,” Signal Processing Magazine, IEEE, vol. 27, no. 3, pp. 62–
75, 2010. [544]
565. Z. Luo, W. Ma, A. So, Y. Ye, and S. Zhang, “Semidefinite relaxation of quadratic optimization
problems,” Signal Processing Magazine, IEEE, vol. 27, no. 3, pp. 20–34, 2010. [544, 545]
566. E. Karipidis, N. Sidiropoulos, and Z. Luo, “Quality of service and max-min fair transmit
beamforming to multiple cochannel multicast groups,” Signal Processing, IEEE Transactions
on, vol. 56, no. 3, pp. 1268–1279, 2008. [545]
567. Z. Hu, R. Qiu, and J. P. Browning, “Joint Amplify-and-Forward Relay and Cooperative
Jamming with Probabilistic Security Consideration,” 2013. submitted to IEEE COMMU-
NICATIONS LETTERS. [547]
568. H.-M. Wang, M. Luo, X.-G. Xia, and Q. Yin, “Joint cooperative beamforming and jamming
to secure af relay systems with individual power constraint and no eavesdropper’s csi,” Signal
Processing Letters, IEEE, vol. 20, no. 1, pp. 39–42, 2013. [547]
569. Y. Yang, Q. Li, W.-K. Ma, J. Ge, and P. Ching, “Cooperative secure beamforming for af
relay networks with multiple eavesdroppers,” Signal Processing Letters, IEEE, vol. 20, no. 1,
pp. 35–38, 2013. [547]
570. D. Ponukumati, F. Gao, and C. Xing, “Robust peer-to-peer relay beamforming: A probabilistic
approach,” 2013. [548, 556]
571. S. Kandukuri and S. Boyd, “Optimal power control in interference-limited fading wireless
channels with outage-probability specifications,” Wireless Communications, IEEE Transac-
tions on, vol. 1, no. 1, pp. 46–55, 2002. [552]
572. S. Ma and D. Sun, “Chance constrained robust beamforming in cognitive radio networks,”
2013. [552]
Bibliography 601

573. D. Zhang, Y. Hu, J. Ye, X. Li, and X. He, “Matrix completion by truncated nuclear norm
regularization,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
on, pp. 2192–2199, IEEE, 2012. [555]
574. K.-Y. Wang, T.-H. Chang, W.-K. Ma, A.-C. So, and C.-Y. Chi, “Probabilistic sinr constrained
robust transmit beamforming: A bernstein-type inequality based conservative approach,” in
Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on,
pp. 3080–3083, IEEE, 2011. [556]
575. K.-Y. Wang, T.-H. Chang, W.-K. Ma, and C.-Y. Chi, “A semidefinite relaxation based
conservative approach to robust transmit beamforming with probabilistic sinr constraints,”
in Proc. EUSIPCO, pp. 23–27, 2010. [556]
576. K. Yang, J. Huang, Y. Wu, X. Wang, and M. Chiang, “Distributed robust optimization (dro)
part i: Framework and example,” 2008. [556]
577. K. Yang, Y. Wu, J. Huang, X. Wang, and S. Verdú, “Distributed robust optimization
for communication networks,” in INFOCOM 2008. The 27th Conference on Computer
Communications. IEEE, pp. 1157–1165, IEEE, 2008. [556]
578. M. Chen and M. Chiang, “Distributed optimization in networking: Recent advances in com-
binatorial and robust formulations,” Modeling and Optimization: Theory and Applications,
pp. 25–52, 2012. [556]
579. G. Calafiore, F. Dabbene, and R. Tempo, “Randomized algorithms in robust control,” in
Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, vol. 2, pp. 1908–1913,
IEEE, 2003. [556]
580. R. Tempo, G. Calafiore, and F. Dabbene, Randomized algorithms for analysis and control of
uncertain systems. Springer, 2004. []
581. X. Chen, J. Aravena, and K. Zhou, “Risk analysis in robust control-making the case for
probabilistic robust control,” in American Control Conference, 2005. Proceedings of the 2005,
pp. 1533–1538, IEEE, 2005. []
582. Z. Zhou and R. Cogill, “An algorithm for state constrained stochastic linear-quadratic
control,” in American Control Conference (ACC), 2011, pp. 1476–1481, IEEE, 2011. [556]
583. A. M.-C. So and Y. J. A. Zhang, “Distributionally robust slow adaptive ofdma with soft qos
via linear programming,” in IEEE J. Sel. Areas Commun.[Online]. Available: http:// www1.
se.cuhk.edu.hk/ ∼manchoso. [558]
584. C. Boutsidis and A. Gittens, “Im@articlegittens2012var, title=Var (Xjk), author=GITTENS,
A., year=2012 ,” arXiv preprint arXiv:1204.0062, 2012. [559, 560]
585. A. GITTENS, “Var (xjk),” 2012. [559]
586. M. Magdon-Ismail, “Using a non-commutative bernstein bound to approximate some matrix
algorithms in the spectral norm,” Arxiv preprint arXiv:1103.5453, 2011. [560, 561]
587. M. Magdon-Ismail, “Row sampling for matrix algorithms via a non-commutative bernstein
bound,” Arxiv preprint arXiv:1008.0587, 2010. [561]
588. C. Faloutsos, T. Kolda, and J. Sun, “Mining large time-evolving data using matrix and tensor
tools,” in ICDM Conference, 2007. [565]
589. M. Mahoney, “Randomized algorithms for matrices and data,” Arxiv preprint
arXiv:1104.5557, 2011. [566]
590. R. Ranganathan, R. Qiu, Z. Hu, S. Hou, Z. Chen, M. Pazos-Revilla, and N. Guo, Communica-
tion and Networking in Smart Grids, ch. Cognitive Radio Network for Smart Grid. Auerbach
Publications, Taylor & Francis Group, CRC, 2013. [569]
591. R. C. Qiu, Z. Chen, N. Guo, Y. Song, P. Zhang, H. Li, and L. Lai, “Towards a real-time
cognitive radio network testbed: architecture, hardware platform, and application to smart
grid,” in Networking Technologies for Software Defined Radio (SDR) Networks, 2010 Fifth
IEEE Workshop on, pp. 1–6, IEEE, 2010. [572]
592. R. Qiu, Z. Hu, Z. Chen, N. Guo, R. Ranganathan, S. Hou, and G. Zheng, “Cognitive radio
network for the smart grid: Experimental system architecture, control algorithms, security,
and microgrid testbed,” Smart Grid, IEEE Transactions on, no. 99, pp. 1–18, 2011. [574]
602 Bibliography

593. R. Qiu, C. Zhang, Z. Hu, , and M. Wicks, “Towards a large-scale cognitive radio network
testbed: Spectrum sensing, system architecture, and distributed sensing,” Journal of Commu-
nications, vol. 7, pp. 552–566, July 2012. [572]
594. E. Estrada, The structure of complex networks: theory and applications. Oxford University
Press, 2012. [574]
595. M. Newman, Network: An Introduction. Oxford University Press, 2010. []
596. M. Newman, “The structure and function of complex networks,” SIAM review, vol. 45, no. 2,
pp. 167–256, 2003. []
597. M. Newman and M. Girvan, “Finding and evaluating community structure in networks,”
Physical review E, vol. 69, no. 2, p. 026113, 2004. []
598. S. Strogatz, “Exploring complex networks,” Nature, vol. 410, no. 6825, pp. 268–276, 2001.
[574]
599. A. Barrat, M. Barthelemy, and A. Vespignani, Dynamical processes on complex networks.
Cambridge University Press, 2008. [574]
600. M. Franceschetti and R. Meester, Random networks for communication: from statistical
physics to information systems. Cambridge University Press, 2007. [574]
601. P. Van Mieghem, Graph spectra for complex networks. Cambridge University Press, 2011.
[574, 575]
602. D. Cvekovic, M. Doob, and H. Sachs, Spectra of graphs: theory and applications. Academic
Press, 1980. [575, 576]
603. C. Godsil and G. Royle, Algebraic graph theory. Springer, 2001. [575]
604. F. Chung and M. Radcliffe, “On the spectra of general random graphs,” the electronic journal
of combinatorics, vol. 18, no. P215, p. 1, 2011. [575]
605. R. Oliveira, “Concentration of the adjacency matrix and of the laplacian in random graphs
with independent edges,” arXiv preprint arXiv:0911.0600, 2009. []
606. R. Oliveira, “The spectrum of random k-lifts of large graphs (with possibly large k),” arXiv
preprint arXiv:0911.4741, 2009. []
607. A. Gundert and U. Wagner, “On laplacians of random complexes,” in Proceedings of the 2012
symposuim on Computational Geometry, pp. 151–160, ACM, 2012. []
608. L. Lu and X. Peng, “Loose laplacian spectra of random hypergraphs,” Random Structures &
Algorithms, 2012. [575]
609. F. Chung, Speatral graph theory. American Mathematical Society, 1997. [576]
610. B. Osting, C. Brune, and S. Osher, “Optimal data collection for improved rankings expose
well-connected graphs,” arXiv preprint arXiv:1207.6430, 2012. [576]
Index

A high dimensional statistics, 567


Adjacency matrix, 574–576 hypothesis detection, 569–570
Ahlswede-Winter inequality, 87–90 large random matrices, 567–568
Ahlswede-Winter’s derivation network mobility, 573
Bernstein trick, 97, 102 random matrix theory, 574–576
binary I-divergence, 103 Smart Grid, 574
Chebyshev inequality, 101 UAVs, 573
hypothesis detection, 95–97 vision, 567, 568
Markov inequality, 100–101 wireless distributed computing, 571–572
matrix order, 98 Bonferroni’s inequality, 3–4
matrix-valued Chernoff bound, 104–105 Boole’s inequality, 3–4
partial order, 97 Bregman divergence, 40
ratio detection algorithm, 99 Brunn-Minkowski inequality, 272, 276
weak law of large numbers, 101–102
Asymptotic convex geometry, 273, 290
Autocorrelation sequence, 467–469 C
Azuma’s inequality, 262 Carleman continuity theorem, 55, 354
Cauchy-Schwartz inequality, 19, 26, 167, 204,
208, 250, 507
B Cayley-Hamilton theorem, 213
Banach-Mazur distance, 290 CDF. See Cumulative distribution function
Banach space, 69–71, 150, 158–159, 297, 305, (CDF)
307 Central limit theorem, 9–11, 16, 54, 145, 230,
Bennett inequality, 15, 125–127 247, 248, 332, 337, 347
Bernoulli random process, 278 Chaining approach, 192
Bernoulli sequence, 70, 76 Chance-constrained linear matrix inequalities,
Bernstein approximations, 528 542–543
Bernstein’s inequality, 14–16, 66–67, 125–127, Channel state information (CSI), 544, 548–550
510 Chatterjee theorem, 228–231
Bernstein trick, 94, 97, 102, 496–497 Chebyshev’s inequality, 5–6, 12, 75, 101, 152,
Big Data 260–261, 496
cognitive radio network Chernoff bounds, 6, 131–134
complex network and random graph, Chernoff moment generating function,
574 123–125
testbed, 571 Chernoff’s inequality, 30–31, 92–93, 376
data collection, 572, 576 Chevet type inequality, 303
data mining, 573 Chi-square distributions, 146–148

R. Qiu and M. Wicks, Cognitive Networked Sensing and Big Data, 603
DOI 10.1007/978-1-4614-4544-9,
© Springer Science+Business Media New York 2014
604 Index

Classical bias-variance decomposition, CRN. See Cognitive radio network (CRN)


464 Cumulative distribution function (CDF), 224,
Classical covariance estimation, 461–463 236, 237
Closed form approximations, 528
Cognitive radio network (CRN), 457, 458
complex network and random graph, 574 D
testbed, 571 Database friendly data processing
Commutative matrices, 495–497 low rank matrix approximation, 559–560
Compact self-adjoint operators, 80–83 matrix multiplication
Complexity theory, 282 Frobenius norm, 564
Compressed sensing norm inequalities, 563–564
Bernoulli random matrix, 373 spectral norm error, 563
Gaussian random matrices, 373 sums of random matrices, 562
Gelfand widths, 373 weighted sum of outer products, 562
0 −minimization, 372 matrix sparsification, 564–565
measurement matrix, 371–372 row sampling, matrix algorithms, 560–561
recovery algorithms, 372–373 tensor sparsification, 565–566
RIP, 372 Data collection, 572, 573, 576
Shannon entropy, 374 Data fusing, 459
s-sparse, 371–372 Data matrices, 261, 520
Convergence laws, 365–366 Data mining, 573
Convex approximations, 542 Data storage and management, 572–573
Convex component functions, 440–443 Decoupling technique
Courant-Fischer characterization, eigenvalues, convex function, 48
46–47 Fubini’s inequality, 51
Courant-Fischer theorem, 488 Gaussian case, 52–53
Covariance matrix estimation, 528–529 Jensen’s inequality, 51, 52
arbitrary distributions, 321–322 multilinear forms, 47
classical covariance estimation, 461–463 multiple-input, multiple-output, 48–50
eigenvalues, 476 quadratic and bilinear forms, 48–51
Golden-Thompson inequality, 322 Detection
infinite-dimensional data, 478–479 data matrices, 520
masked covariance estimation high-dimensional matched subspace
classical bias-variance decomposition, detection (see High-dimensional
464 matched subspace detection)
complexity metrics, 466–467 high-dimensional vectors (see High-
decaying matrix, 465 dimensional vectors)
matrix concentration inequalities, hypothesis, 525–526
463–464 random matrix (see Random matrix
multivariate normal distributions, 466 detection)
root-mean-square spectral-norm error, sparse principal components (see Sparse
464 principal components)
WSS, 467 Deterministic dictionary, 385, 387–388, 392
MSR, 322–323 Deterministic sensing matrices, 410
MWR, 323 Deylon theorem, 227–228
partial estimation, 476–477 Dilations, 43–44
robust covariance estimation, 485–486 Dimension-free inequalities, 137–140, 193
sample covariance matrix, 475 Distorted wave born approximation (DWBA),
signal plus noise model (see Signal plus 454
noise matrix model) Distributed aperture, 487
stationary time series, 474 Dudley’s inequality, 162–165
sub-Gaussian distributions, 320–321 Dvoretzky’s theorem, 315
Index 605

E F
Efron-Stein inequality, 17 Feature-based detection, 177
Eigenvalues Fourier transform, 6–10, 54, 77, 261, 352, 354,
chaining approach, 192 394, 395, 400, 444, 447, 448, 474
general random matrices, 193–194 Free probability
GUE, 190 convolution, 364–365
Lipschitz mapping, 202–203 definitions and properties, 360–362
matrix function approximation free independence, 362–364
matrix Taylor series (see Matrix Taylor large n limit, 358
series) measure theory, 358
polynomial/rational function, 211 non-commutative probability, 359
and norms supremum representation practical significance, 359–360
Cartesian decomposition, 199 random matrix theory, 359
Frobenius norm, 200 Frobenius norms, 21, 22, 156–158, 180, 200,
inner product, 199, 200 202, 206, 217, 246, 256–257, 305,
matrix exponential, 201–202 523
Neumann series, 200 Fubini’s inequality, 51
norm of matrix, 199 Fubini’s theorem, 286, 513, 535
unitary invariant, 200 Fubini-Tonelli theorem, 23, 26
quadratic forms (see Quadratic forms)
random matrices (see Random matrices)
random vector and subspace distance G
Chebyshev’s inequality, 260–261 Gaussian and Wishart random matrices,
concentration around median, 258 173–180
orthogonal projection matrix, 257–260 Gaussian concentration inequality, 155–156
product probability, 258 Gaussian orthogonal ensembles (GOE), 161
similarity based hypothesis detection, Gaussian random matrix (GRM), 55, 56, 178,
261 192, 342, 353, 361, 373, 498, 500
Talagrand’s inequality, 258, 259 Gaussian unitary ensemble (GUE), 190, 246
smoothness and convexity Geometric functional analysis, 271
Cauchy-Schwartz’s inequality, 204, Global positioning system (GPS), 573
208 GOE. See Gaussian orthogonal ensembles
Euclidean norm, 206 (GOE)
function of matrix, 203 Golden-Thompson inequality, 38–39, 86, 91,
Guionnet and Zeitoumi lemma, 204 322, 496
Ky Fan inequality, 206–207 Gordon’s inequality, 168–170
Lidskii theorem, 204–205 Gordon-Slepian lemma, 182
linear trace function, 205 Gordon’s theorem, 313
moments of random matrices, GPS. See Global positioning system (GPS)
210 Gram matrix, 18, 288–289
standard hypothesis testing problem, Greedy algorithms, 372, 384–385
208–209 GRM. See Gaussian random matrix (GRM)
supremum of random process, 267–268 Gross, Liu, Flammia, Becker, and Eisert
symmetric random matrix, 244–245 derivation, 106
Talagrand concentration inequality, Guionnet and Zeitouni theorem, 219
191–192, 216–218
trace functionals, 243–244 H
variational characterization, 190 Haar distribution, 366, 521
Empirical spectral distribution (ESD), 352–353 Hamming metric, 150
ε−Nets arguments, 67–68 Hanson-Wright inequality, 380, 383
Euclidean norm, 21, 200, 202, 206, 217, 246, Harvey’s derivation
257, 523 Ahlswede-Winter inequality, 87–89
Euclidean scalar product, 246, 506 Rudelson’s theorem, 90–91
606 Index

Heavy-tailed rows I
expected singular values, 319 Information plus noise model
isotropic vector assumption, 317 sums of random matrices, 492–494
linear operator, 317 sums of random vectors, 491–492
non-commutative Bernstein inequality, Intrusion/activity detection, 459
317–318 Isometries, 46, 314, 385, 413, 414
non-isotropic, 318–319
Hermitian Gaussian random matrices (HGRM)
m, n, σ2 , 59–60 J
n, σ2 , 55–59 Jensen’s inequality, 30, 45, 210, 256, 513
Hermitian matrix, 17–18, 85, 86, 361–362 Johnson–Lindenstrauss (JL) lemma
Hermitian positive semi-definite matrix, Bernoulli random variables, 376
265–266 Chernoff inequalities, 376
HGRM. See Hermitian Gaussian random circulant matrices, 384
matrices (HGRM) Euclidean space, 374
High dimensional data processing, 459 Gaussian random variables, 376
High-dimensional matched subspace detection k -dimensional signals, 377
Balzano, Recht, and Nowak theorem, 516 linear map, 375
binary hypothesis test, 514 Lipschitz map, 374–375
coherence of subspace, 515 random matrices, 375
projection operator, 515 union bound, 377
High-dimensional statistics, 416–417
High-dimensional vectors K
Arias-Castro theorem, 512–514 Kernel-based learning, 289
average Bayes risk, 511 Kernel space, 491
operator norm, 512 Khinchin’s inequality, 64, 70–71
subspace detection, compressive sensing Khintchine’s inequality, 140–144, 381, 534
Azizyan and Singh theorem, 518–519 Klein’s inequality, 40–41
composite hypothesis test, 517 Kullback-Leibler (KL) divergence, 510, 511,
hypothesis test, 516 513, 521
low-dimensional subspace, 516 Ky Fan maximum principle, 201
observation vector, 517
projection operator, 517
worst-case risk, 511 L
High-dimensions Laguerre orthogonal ensemble (LOE), 348
PCA (see Principal component analysis Laguerre unitary ensemble (LUE), 348
(PCA)) Laplace transform, 6, 8, 61, 63, 64, 108,
two-sample test 130–131, 147
Haar distribution, 521 Laplacian matrix, 575–576
Hotelling T 2 statistic, 520–521 Large random matrices, 351–352
hypothesis testing problem, 520 data sets matrix, 567
Kullback-Leibler (KL) divergence, 521 eigenvalues, 568
Lopes, Jacob and Wainwright theorem, linear spectral statistics (see Linear spectral
522–525 statistics)
random-random projection method, measure phenomenon, 567
521 Talagrand’s concentration inequality, 568
Hilbert-Schmidt inner product, 436 trace functions, 249
Hilbert-Schmidt norm, 200, 202, 206, 217, Liapounov coefficient, 77
246, 479, 482, 523 Lieb’s theorem, 42–43
Hilbert-Schmidt scalar product, 159 Limit distribution laws, 352
Hoeffding’s inequality, 12–14, 149–150 Linear bounded and compact operators, 79–80
Hölder’s inequality, 26, 303 Linear filtering, 134–136
Homogeneous linear constraints, 540 Linear functionals, 134, 146, 151, 153, 253,
Hypothesis detection, 525–526 264, 276, 302, 307, 504
Index 607

Linear matrix inequality (LMI), 412, 528, 536 Matrix completion


Linear regression model, 423–426 Bernsterin’s inequality, 433
Linear spectral statistics convex deconvolution method, 434
Euclidean norm, 246 Frobenius/Hilbert-Schmidt norm, 430
GUE, 246 noncommutative bernstein inequality,
Klein’s lemma, 248–249 432–433
Lipschitz function, 246–247 nuclear norm, 430
log-Sobolev inequality, 247, 248 orthogonal decomposition and orthogonal
total variation norm, 247 projection, 430–432
trace functions, 249 Recht analyze, 432
Lipschitz functions, 146, 148, 202, 204–208, replacement sampling, 432
210, 216, 217, 224, 265–266, 313, Schatten 1-norm, 434
568, 572 SDP, 430
Logarithmic Sobolev inequality, 193–194, 219, sparse corruptions, 433–434
244, 248 spectral norm, 430
Log-concave random vectors, 275–277, Matrix compressed sensing
290–295 error bounds (see Low-rank matrix
Lõwner-Hernz theorem, 35 recovery)
Low rank matrix approximation, 559–560 Frobenius/trace norm, 417
Low-rank matrix recovery linear observation model, 417–418
compressed sensing, 411 nuclear norm regularization, 418
error bounds restricted strong convexity, 418–419
Gaussian operator mapping, 419 Matrix concentration inequalities, 463–464
noise vector, 419 Matrix hypothesis testing, 494–495
SDP, 419–420 Matrix laplace transform method
standard matrix compressed sensing, Ahlswede-Winter’s derivation
419 Bernstein trick, 97, 102
Gaussian vector, 412 binary I-divergence, 103
hypothesis detection, 415–416 Chebyshev inequality, 101
linear mapping, 411 hypothesis detection, 95–97
RIP, 412 Markov inequality, 100–101
SDP, 412–413 matrix order, 98
sparsity recovery, 411 matrix-valued Chernoff bound, 104–105
tools for, 439–440 partial order, 97
ratio detection algorithm, 99
weak law of large numbers, 101–102
Gross, Liu, Flammia, Becker, and Eisert
M derivation, 106
Machine learning, 459 Harvey’s derivation
Marcenko-Pastur distribution, 502 Ahlswede-Winter inequality, 87–89
Marcenko-Pasture law, 353 Rudelson’s theorem, 90–91
Markov’s inequality, 5, 27, 100–101, 194, 380, Oliveria’s derivation, 94–95
382, 535, 539 Recht’s derivation, 106–107
Masked covariance estimation Tropp’s derivation, 107
classical bias-variance decomposition, 464 Vershynin’s derivation, 91–94
complexity metrics, 466–467 Wigderson and Xiao derivation, 107
decaying matrix, 465 Matrix multiplication
matrix concentration inequalities, 463–464 Frobenius norm, 564
multivariate normal distributions, 466 norm inequalities, 563–564
root-mean-square spectral-norm error, 464 spectral norm error, 563
WSS, 467 sums of random matrices, 562
Matrix Chernoff I-tropp, 121–122 weighted sum of outer products, 562
Matrix Chernoff II-tropp, 122–123 Matrix restricted isometry property, 413–414
608 Index

Matrix sparsification, 564–565 Cauchy-Schwarz inequality, 167


Matrix Taylor series concentration inequalities, 166
convergence of, 211–212 discretization arguments, 171–173
truncation error bound Gordon’s inequality, 168–170
Cayley-Hamilton theorem, 213 ∞ -operator norm, 165
cosine function, 212 matrix inner product, 166
matrix function, 213–214 sparsity, 171
matrix-valued fluctuation, 214–215 spectral norm, 165
Newton identities, 213 zero-mean Gaussian process, 167
∞−norm, 212 operator norms, 180–185
positive semidefinite matrix, 215 random vectors
trace function fluctuation, 214 Banach space, 158–159
unitarily invariant norm, 214–215 convex function, 153
Matrix-valued random variables Frobenius norm, 156–158
Bennett and Bernstein inequalities, Gaussian concentration inequality,
125–127 155–156
Chernoff bounds, 131–134 Hoeffding type inequality, 149–150,
cumulate-based matrix-valued laplace 155
transform method, 108–109 linear functionals, 151
dimension-free inequalities, 137–140 Lipschitz function, 148
expectation controlling, 119–121 median, 151
Golden-Thompson inequality, 86 projections, 194–198
Hermitian matrix, 85, 86 standard Gaussian measure, 154
Khintchine’s inequality, 140–144 Talagrand’s concentration inequality,
linear filtering, 134–136 152–153
matrix cumulant generating function, variance, 151
110–111 Slepian-Fernique lemma, 160–161
matrix Gaussian series, 115–118 sub-Gaussian random matrices, 185–189
matrix generating function, 109–110 Measure theory, 358
matrix laplace transform method (see Median, 71, 118, 120, 135, 136, 149, 151–154,
Matrix laplace transform method) 159, 161, 162, 190, 193, 195, 198,
minimax matrix laplace method, 128 216, 222, 224, 239–242, 244, 252,
nonuniform Gaussian matrices, 118–119 258, 313, 325, 568
positive semidefinite matrices, 144 Mercer’s theorem, 288–289
random positive semidefinite matrices Minimax matrix laplace method, 128
Chernoff moment generating function, Minkowski’s inequality, 308, 382
123–125 Moment method, 353–354
Matrix Chernoff II-tropp, 122–123 improved moment bound, 242
Matrix Chernoff I-tropp, 121–122 k -th moment, 239
self-adjoint operator, 85 lower Bai-Yin theorem, 241
tail bounds, 111–114, 128–131 Markov’s inequality, 240
McDiarmid’s inequality, 355, 516 median of σ1 (A), 239, 240
Measure concentration moment computation, 241
chi-square distributions, 146–148 operator norm control, 238
dimensionality, 145 second moments, 238–239
Dudley’s inequality, 162–165 standard linear algebra identity, 238
eigenvalues (see Eigenvalues) strong Bai-Yin theorem, upper bound,
Gaussian and Wishart random matrices, 242–243
173–180 weak Bai-Yin theorem, upper bound, 242
Gaussian random variables, 161–162 weak upper bound, 240–241
GOE, 161 Monte-Carlo algorithms, 310
Hilbert-Schmidt scalar product, 159 Monte-Carlo simulations, 528, 530
induced operator norms Multi-spectral approach, 459
Index 609

Multi-task matrix regression, 427–429 subspace of maximal variance, 488


Multiuser multiple input single output (MISO) Probabilistically secured joint amplify-and-
problem, 543 forward relay
Multivariate normal distributions, 466 cooperative communication, 547
proposed approach
Bernstein-type inequalities, 556
N Bernstein-type inequality, 552–553
Neumann series, 200, 264, 354, 355 eigen-decomposition, 555
Neyman-Pearson fundamental lemma, 512 Frobenius norm of matrix, 553
Non-communtative random matrices, 525–526 Monte Carlo simulation, 556
Non-commutative Bernstein inequality, NP-hard problem, 554
317–318, 516 safe tractable approximation approach,
Non-commutative polynomials, 222–223 552
Non-contiguous orthogonal frequency division SDP problem, 554, 555
multiplexing (NC-OFDM), 457, semidefinite matrix, 553
459–461 violation probability, 555
Nonconvex quadratic optimization, 531–532, safe means approximation, 548
539–542 simulation results, 556–558
Nonuniform Gaussian matrices, 118–119 system model
Nuclear norm, 22, 180, 412, 418, 430, 449, expectation operator, 550
453, 479 first and second hop function diagram,
548, 549
optimization problems, 552
O received signal plus artificial noise, 548,
Oliveria’s derivation, 94–95 550
Orthogonal frequency division multiple access SINR, 549–551
(OFDMA), 558 transmitted signal plus cooperative
Orthogonal frequency division multiplexing jamming, 549–550
(OFDM) radar, 487 two-hop half-duplex AF relay network,
Orthogonal matrix, 575 548
tractable means approximation, 548
Probability
Bregman divergence, 40
P characteristic function, 7
Paouris’ concentration inequality, 290–292 Chebyshev’s inequality, 5–6
Parallel computing, 572 Chernoff bound, 6
Partial random Fourier matrices, 384 Chernoff’s inequality, 30–31
PCA. See Principal component analysis (PCA) convergence, 30
Peierls-Bogoliubov inequality, 36–37 Courant-Fischer characterization,
Phase retrieval via matrix completion eigenvalues, 46–47
matrix recovery via convex programming, dilations, 43–44
446–447 eigenvalues and spectral norms, 32–33
methodology, 444–446 expectation, 23–26, 32, 45
phase space tomography, 447–449 f (A) definition, 20–21
self-coherent RF tomography (see Fourier transform, 6–8
Self-coherent RF tomography) Golden-Thompson inequality, 38–39
Pinsker’s inequality, 513 Hermitian matrices, 17–18
Poincare inequality, 256, 295 independence, 4
Positive semi-definite matrices, 18–20, 44, 144 isometries, 46
Principal component analysis (PCA), 347–348, Jensen’s inequality, 30
573 laplace transform, 8
inconsistency, 490 Lieb’s theorem, 42–43
noisy samples, 489 Markov inequality, 5
SDP formulation, 489 matric norms, 21–22
610 Index

Probability (cont.) concentration of quadrics, 257


matrix exponential, 38 convex and 1-Lipschitz function, 250, 253
matrix logarithm, 39 convexity, 251
moments and tails, 26–29 El Karoui lemma, 254
operator convexity and monotonicity, 35 Euclidean norm, 250
partial order, 44 f (x) function, 250–251
positive semidefinite matrices, 18–20, 44 Gaussian random variables, 256
probability generating function, 8–9 Gaussian random vector, 257
quantum relative entropy, 40–42 Lipschitz coefficient, 251
random variables, 4–5 Lopes, Jacob,Wainwright theorem,
random vectors, 29–30 255–257
semidefinite order, 45 real-valued quadratic forms, 255–256
spectral mapping, 33–35 uncenterted correlation matrix, 251
trace and expectation commute, 45 Quantum information divergence, 40–42
trace functions, convexity and Quantum relative entropy, 40–42
monotonicity, 36–37
union bound, 3–4
Probability constrained optimization R
chance-constrained linear matrix Rademacher averages and symmetrization,
inequalities, 542–543 69–71
distributed robust optimization, 556 Rademacher random variables, 69, 70, 115,
OFDMA, 558 116, 283, 285, 437
probabilistically secured joint AF relay Random access memory (RAM), 310
(see Probabilistically secured joint Randomly perturbed linear matrix inequalities,
amplify-and-forward relay) 530–531, 536–537
problem Random matrices
Bernstein-type concentration Bernoulli random variables, 388, 391
inequalities, 545–546 Brunn-Minkowski inequality, 272
closed-form upper bounds, 546, 547 commonly encountered matrices, 365
convex conic inequalities, 546 concentration inequality, 385
covariance matrix estimation, 528–529 concentration of singular values
CSI, 544 non-asymptotic estimates, 324
decomposition approach, 547 random and deterministic matrices,
MISO, 543 329–331
multi-antenna transmit signal vector, random determinant, 331–334
543 random matrix theory, 324
real vector space, 528 rectangular matrices, 327–329
semidefinite relaxation, 545 sample covariance matrices, 325–326
setup, 527 sharp small deviation, 325
SINR, 544 square matrices, 326–327
tractable approximations, 528 sub-Gaussian random variables, 324
sums of random symmetric matrices tall matrices, 326
(see Sums of random symmetric convergence laws, 365–366
matrices) conversion into random vectors, 76–78
Probability generating function, 8–9 covariance matrices, independent rows
absolute constant, 286–287
Q bounded random vector, 287
Quadratic forms, 249–250 complexity theory, 282
Bechar theorem, 254–255 Fubini’s theorem, 286
Cauchy-Schwartz inequality, 250 Gaussian coordinates, 282
centering matrix, 252 integral operators, 288–289
complex deterministic matrix, 252–254 inverse problems, 289–290
complex-valued quadratic forms, 255 isotropic position, 281
Index 611

Lebesgue measure, 281 isotropic log-concave random vectors,


Rademacher random variables, 285 299–303
scalar-valued random variables, 286 moment method, 54–55, 353–354
symmetric random variables, 284 noncommuntative random matrices,
symmetrization theorem, 283–284 525–526
vector-valued random variable, 285 noncommutative polynomials, 222–223
covariance matrix estimation non-empty compact, 272–273
arbitrary distributions, 321–322 OFDM modulation waveform, 273
Golden-Thompson inequality, 322 Rudelson’s theorem, 277–281
MSR, 322–323 small ball probability, 295–299
MWR, 323 spectral properties, 271
sub-Gaussian distributions, 320–321 Stieltjes transform domain, 261–264
dual norm, 272 Stirling’s formula, 389–390
ε-cover, 386–387 sums of two random matrices, 235–237
Euclidean ball, 385–386 symmetric convex body, 272
Euclidean norm, 272, 384 threshold, 391–392
Fourier method, 54 universality of singular values (see
free probability (see Free probability) Universality of singular values)
Gaussian chaos, 389 Wigner random matrices (see Wigner
geometric functional analysis, 271 random matrices)
greedy algorithms, 384–385 Wishart random matrices (see Wishart
HGRM random matrices)
m, n, σ2 , 59–60 Random matrix detection
n, σ2 , 55–59 Chebyshev’s inequality, 496
independent entries, 313–314 commutative matrices, 495–497
independent rows commutative property, 498–499
Dvoretzky’s theorem, 315 Golden-Thompson inequality, 496, 498
heavy-tailed rows (see Heavy-tailed hypothesis testing problem, 497
rows) Lieb’s theorem, 497
infinite-dimensional function, 314 matrix-valued random variable, 496
sub-Gaussian, isotropic random vectors, Monto Carlo simulations, 499, 500
315 probability of detection, 497
invertibility of, 334–336 product rule of matrix expectation, 499–500
isotropic convex bodies, 273–275 Random matrix theory
isotropic, log-concave random vectors low rank perturbation, Wishart matrices,
non-increasing rearrangement and order 503
statistics, 292–293 sparse eigenvalues, 503
Paouris’ concentration inequality, spectral methods, 502–503
290–292 Random Toeplitz matrix, 408–410
sample covariance matrix, 293–294 Random variables, 4–5
large (see Large random matrices) Random vectors, 29–30
log-concave random vectors, 275–277 Banach space, 158–159
low rank approximation convex function, 153
functional-analytic nature, 311 Frobenius norm, 156–158
linear sample complexity, 311 Gaussian concentration inequality, 155–156
matrix-valued random variables, 311 Hoeffding type inequality, 149–150, 155
Monte-Carlo algorithms, 310 linear functionals, 151
numerical rank, 311 Lipschitz function, 148
RAM, 310 median, 151
spectral and Frobenius norms, 310 projections, 194–198
SVD, 310, 313 standard Gaussian measure, 154
matrix-valued random variables, 305–309 Talagrand’s concentration inequality,
moment estimates 152–153
convex measures, 303–305 variance, 151
612 Index

Ratio detection algorithm, 99 SDP. See Semidefinite programming (SDP)


Rayleigh-Ritz theorem, 488 Self-coherent RF tomography
Received signal strength (RSS), 460 Born iterative method, 454–455
Recht’s derivation, 106–107 DWBA, 454
Recovery error bounds, 414–415 Green’s function, 454
Restricted isometry property (RIP), 372, Matlab-based modeling system, 453
414–415 phase retrieval problem, 452–453
compressive sensing (CS), 377 rank function, 453
Grammian matrix, 378 system model, 449–452
JL, 378–379 Semicircle law, 353
partial random circulant matrices Semidefinite programming (SDP), 412–413,
arbitrary index, 393 506–507, 541–542
Dudley’s inequality for chaos, 396 Semidefinite relaxation, 540
Euclidean unit ball, 393 Shannon entropy, 374
Fourier domain, 400 Signal plus noise matrix model, 341
Fourier representation, 394–395 autocorrelation sequence, 468, 469
integrability of chaos processes, Hilbert-Schmidt norm, 479, 482
395–396 orthogonal projections, 480
L-sub-Gaussian random variables, 399 quadratic functional, low-rank matrices,
Rademacher vector, 393, 398 484–485
restricted isometry constant, 399 rank-r projections, 481, 482
tail bound for chaos, 397–398 sample covariance matrix, 471–474
vector-valued random process, 398 Schatten-p norm, 479
randomized column signs, 378–379 singular value, 484
time-frequency structured random matrices tridiagonal Toeplitz matrix, 469–471
Gabor synthesis matrices, 400–403 universal upper bound, 483
Hermitian matrices, 402 upper bound, Gaussian matrices, 483
Rademacher/Steinhaus chaos process, white Gaussian noise, 468, 469
401, 403–405 Singular value decomposition (SVD), 310,
Rademacher vector, 401 312, 313
time-frequency shifts, 400 Skew-Hermitian matrix, 201
wireless communications and radar, 402 Slepian-Fernique lemma, 160–161
Root-mean-square spectral-norm error, 464 Smart Grid, 574
Rounding errors, 212 Smooth analysis, 335
Row sampling, 560–561 Space-time coding combined with CS, 490
Rudelson’s theorem, 90–91, 277–281 Sparse principal components
detection
concentration inequalities, k -sparse
S largest eigenvalue, 504–505
Safe tractable approximation, 537–539 hypothesis testing with λk max,
Sample covariance matrix, 471–474 505–506
Scalar-valued random variables test statistic, 503
Bernstein’s inequality, 14–16 empirical covariance matrix, 491
central limit theorem, 9–11 Gaussian distribution, 490
Chebyshev’s inequality and independence, identity matrix, 490–491
12 Kernel space, 491
Efron-Stein inequality, 17 semidefinite methods
expectation bound, 12 high probability bounds, convex
Hoeffding’s inequality, 12–14 relaxation, 508
Scenario approximation, 528 hypothesis testing with convex methods,
Schatten 2-norm, 200, 202, 206, 217, 246, 523 508–509
Schatten-p norm, 21 semidefinite relaxation, 506–507
Schatten q -norm, 461 sphericity hypothesis, 490
Index 613

Sparse vector estimation, 509–510 Tracy-Widow law, 243


Spectral mapping theorem, 33–35, 86, 94–95 Tridiagonal Toeplitz matrix, 469–471
Sphericity test, 501–502 Tropp’s derivation, 107
Spielman-Teng’s conjecture, 336 Truncated errors, 212
Standard Gaussian measure, 154
Stein’s method, 347
Stieltjes transform, 237 U
Azuma’s inequality, 262–263 1-unconditional isotropic convex body, 291
deterministic matrix, 357 Union bound, 3–4, 28, 106, 117, 176, 187, 295,
Fourier transform, 354 315, 316, 376, 377, 379, 386, 388,
hypothesis testing problem, 263–264 427, 428, 563
McDiarmid’s inequality, 355 Unitary invariance, 33
Neumann series, 264, 354–355 Universality of singular values
quadratic form, 356 covariance and correlation matrices
random matrix, 356 PCA, 347–348
R-and S-transforms, 365, 367–368 singular value decomposition, 345
sample covariance matrix, 261–262 Stein’s method, 347
Schur complements, 355–356 sub-exponential decay, 345
semicircular law, 358 Tracy-Widom distribution, 345–346
spectral theory, 354 deterministic matrix, 341–345
sum of martingale differences, 262 Gaussian models, 337–338
sum of N rank-one matrices, 261 least singular value, 339
Taylor series, 264 normalizations, 337
Stirling’s formula, 388 Poisson point process, 337
Stochastic processes, 75 rectangular matrices, 339–340
Strong subadditivity (SSA), 265, 266 Unmanned aerial vehicles (UAVs), 573
Structured random matrices, 384
Sub-exponential random variables, 66–67
Sub-Gaussian random matrices, 185–189 V
Sub-Gaussian random variables, 60–64 Vershynin’s derivation, 91–94
Sub-Gaussian random vectors, 65–66, 72–75 von Neumann algebras, 361
Submatrices, 237 von Neumann divergence, 40–42
Sums of random symmetric matrices von Neumann entropy functions, 264–267
Khintchine-type inequalities, 534 von Neumann entropy penalization
Nemirovski theorem, 532–534 Hermitian matrices, 434–435
nonconvex quadratic optimization, low rank matrix estimation
orthogonality constraints, 531–532, orthogonal basis, 436–437
539–542 system model and formalism, 435–436
randomly perturbed linear matrix
inequalities, 530–531, 536–537
safe tractable approximation, 537–539 W
So theorem, 534–535 Weak law of large numbers, 101–102
typical norm, 529–530 Weak monotonicity, 265, 266
Suprema of chaos processes, 405–408 White Gaussian noise, 468
Symmetrization theorem, 283–284 Wide-sense stationary (WSS), 467
Wigderson and Xiao derivation, 107
Wigner matrix, 353, 355, 364
T Wigner random matrices
Talagrand’s concentration inequality, 152–153, Guionnet and Zeitouni theorem, 219
191–192, 216–218, 258, 313, 568 Herbst theorem, 219
Taylor series, 212, 264 Hermitian Wigner matrix, 218
Tensor sparsification, 565–566 Hoffman-Wielandt lemma, 219, 220
Trace functionals, 243–244 hypothesis testing, 221
Tracy-Widom distribution, 345–346 normalized trace function, 221
614 Index

Wigner random matrices (cont.) Deylon theorem, 227–228


scaling, 218 Euclidean operator norm, 223
Stein’s method, 221 Guntuboyina and Lebb theorem, 224–225
Wishart matrix, 221 independent mean zero entries, 223
Wigner semicircle distribution, 353 Jiang theorem, 231–233
Wigner’s trace method, 238 Lipschitz function, 224
Wireless distributed computing, 458, 571–572 low rank perturbation, 503
Wishart random matrices Meckes and Meckes theorem, 233–235
CDF, 224 sample covariance matrix, 226–227
Chatterjee theorem, 228–231 WSS. See Wide-sense stationary (WSS)

You might also like