Non Convex Optimization
Non Convex Optimization
1
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Contents
Abstract 1
Preface 2
Mathematical Notation 6
1 Introduction 9
1.1 Non-convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Motivation for Non-convex Optimization . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Examples of Non-Convex Optimization Problems . . . . . . . . . . . . . . . . . . 10
1.4 The Convex Relaxation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 The Non-Convex Optimization Approach . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Organization and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Mathematical Tools 16
2.1 Convex Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Convex Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Projected Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Convergence Guarantees for PGD . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Convergence with Bounded Gradient Convex Functions . . . . . . . . . . 20
2.4.2 Convergence with Strongly Convex and Smooth Functions . . . . . . . . . 22
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ii
CONTENTS
4 Alternating Minimization 35
4.1 Marginal Convexity and Other Properties . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Generalized Alternating Minimization . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 A Convergence Guarantee for gAM for Convex Problems . . . . . . . . . . . . . . 39
4.4 A Convergence Guarantee for gAM under MSC/MSS . . . . . . . . . . . . . . . . 41
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 The EM Algorithm 45
5.1 A Primer in Probabilistic Machine Learning . . . . . . . . . . . . . . . . . . . . . 45
5.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 An Alternating Maximization Approach . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Implementing the E/M steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.6 Motivating Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6.1 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6.2 Mixed Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.7 A Monotonicity Guarantee for EM . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.8 Local Strong Concavity and Local Strong Smoothness . . . . . . . . . . . . . . . 56
5.9 A Local Convergence Guarantee for EM . . . . . . . . . . . . . . . . . . . . . . . 58
5.9.1 A Note on the Application of Convergence Guarantees . . . . . . . . . . . 59
5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
III Applications 79
7 Sparse Recovery 80
7.1 Motivating Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3 Sparse Regression: Two Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4 Sparse Recovery via Projected Gradient Descent . . . . . . . . . . . . . . . . . . 83
7.5 Restricted Isometry and Other Design Properties . . . . . . . . . . . . . . . . . . 84
7.6 Ensuring RIP and other Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.7 A Sparse Recovery Guarantee for IHT . . . . . . . . . . . . . . . . . . . . . . . . 87
7.8 Other Popular Techniques for Sparse Recovery . . . . . . . . . . . . . . . . . . . 88
7.8.1 Pursuit Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.8.2 Convex Relaxation Techniques for Sparse Recovery . . . . . . . . . . . . . 89
7.8.3 Non-convex Regularization Techniques . . . . . . . . . . . . . . . . . . . . 90
iii
CONTENTS
7.8.4 Empirical Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.9 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.9.1 Sparse Recovery in Ill-Conditioned Settings . . . . . . . . . . . . . . . . . 92
7.9.2 Recovery from a Union of Subspaces . . . . . . . . . . . . . . . . . . . . . 92
7.9.3 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
iv
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
List of Figures
v
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
List of Algorithms
vi
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Abstract
A vast majority of machine learning algorithms train their models and perform inference by
solving optimization problems. In order to capture the learning and prediction problems accu-
rately, structural constraints such as sparsity or low rank are frequently imposed or else the
objective itself is designed to be a non-convex function. This is especially true of algorithms
that operate in high-dimensional spaces or that train non-linear models such as tensor models
and deep networks.
The freedom to express the learning problem as a non-convex optimization problem gives im-
mense modeling power to the algorithm designer, but often such problems are NP-hard to solve.
A popular workaround to this has been to relax non-convex problems to convex ones and use
traditional methods to solve the (convex) relaxed optimization problems. However this approach
may be lossy and nevertheless presents significant challenges for large scale optimization.
On the other hand, direct approaches to non-convex optimization have met with resounding
success in several domains and remain the methods of choice for the practitioner, as they fre-
quently outperform relaxation-based techniques – popular heuristics include projected gradient
descent and alternating minimization. However, these are often poorly understood in terms of
their convergence and other properties.
This monograph presents a selection of recent advances that bridge a long-standing gap
in our understanding of these heuristics. We hope that an insight into the inner workings of
these methods will allow the reader to appreciate the unique marriage of task structure and
generative models that allow these heuristic techniques to (provably) succeed. The monograph
will lead the reader through several widely used non-convex optimization techniques, as well as
applications thereof. The goal of this monograph is to both, introduce the rich literature in this
area, as well as equip the reader with the tools and techniques needed to analyze these simple
procedures for non-convex problems.
1
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Preface
Optimization as a field of study has permeated much of science and technology. The advent of
the digital computer and a tremendous subsequent increase in our computational prowess has
increased the impact of optimization in our lives. Today, tiny details such as airline schedules all
the way to leaps and strides in medicine, physics and artificial intelligence, all rely on modern
advances in optimization techniques.
For a large portion of this period of excitement, our energies were focused largely on con-
vex optimization problems, given our deep understanding of the structural properties of convex
sets and convex functions. However, modern applications in domains such as signal process-
ing, bio-informatics and machine learning, are often dissatisfied with convex formulations alone
since there exist non-convex formulations that better capture the problem structure. For ap-
plications in these domains, models trained using non-convex formulations often offer excellent
performance and other desirable properties such as compactness and reduced prediction times.
Examples of applications that benefit from non-convex optimization techniques include gene
expression analysis, recommendation systems, clustering, and outlier and anomaly detection. In
order to get satisfactory solutions to these problems, that are scalable and accurate, we require a
deeper understanding of non-convex optimization problems that naturally arise in these problem
settings.
Such an understanding was lacking until very recently and non-convex optimization found
little attention as an active area of study, being regarded as intractable. Fortunately, a long line
of works have recently led areas such as computer science, signal processing, and statistics to
realize that the general abhorrence to non-convex optimization problems hitherto practiced, was
misled. These works demonstrated in a beautiful way, that although non-convex optimization
problems do suffer from intractability in general, those that arise in natural settings such as
machine learning and signal processing, possess additional structure that allow the intractability
results to be circumvented.
The first of these works still religiously stuck to convex optimization as the method of choice,
and instead, sought to show that certain classes of non-convex problems which possess suitable
additional structure as offered by natural instances of those problems, could be converted to
convex problems without any loss. More precisely, it was shown that the original non-convex
problem and the modified convex problem possessed a common optimum and thus, the solution
to the convex problem would automatically solve the non-convex problem as well! However,
these approaches had a price to pay in terms of the time it took to solve these so-called relaxed
convex problems. In several instances, these relaxed problems, although not intractable to solve,
were nevertheless challenging to solve, at large scales.
It took a second wave of still more recent results to usher in provable non-convex optimization
techniques which abstained from relaxations, solved the non-convex problems in their native
forms, and yet seemed to offer the same quality of results as relaxation methods did. These newer
results were accompanied with a newer realization that, for a wide range of applications such as
sparse recovery, matrix completion, robust learning among others, these direct techniques are
2
Preface
faster, often by an order of magnitude or more, than relaxation-based techniques while offering
solutions of similar accuracy.
This monograph wishes to tell the story of this realization and the wisdom we gained from it
from the point of view of machine learning and signal processing applications. The monograph
will introduce the reader to a lively world of non-convex optimization problems with rich
structure that can be exploited to obtain extremely scalable solutions to these problems. Put a
bit more dramatically, it will seek to show how problems that were once avoided, having been
shown to be NP-hard to solve, now have solvers that operate in near-linear time, by carefully
analyzing and exploiting additional task structure! It will seek to inform the reader on how to
look for such structure in diverse application areas, as well as equip the reader with a sound
background in fundamental tools and concepts required to analyze such problem areas and
come up with newer solutions.
How to use this monograph We have made efforts to make this monograph as self-contained
as possible while not losing focus of the main topic of non-convex optimization techniques.
Consequently, we have devoted entire sections to present a tutorial-like treatment to basic
concepts in convex analysis and optimization, as well as their non-convex counterparts. As
such, this monograph can be used for a semester-length course on the basics of non-convex
optimization with applications to machine learning.
On the other hand, it is also possible to cherry pick portions of the monograph, such the
section on sparse recovery, or the EM algorithm, for inclusion in a broader course. Several
courses such as those in machine learning, optimization, and signal processing may benefit from
the inclusion of such topics. However, we advise that relevant background sections (see Figure 1)
be covered beforehand.
While striving for breadth, the limits of space have constrained us from looking at some
topics in much detail. Examples include the construction of design matrices that satisfy the
RIP/RSC properties and pursuit style methods, but there are several others. However, for all
such omissions, the bibliographic notes at the end of each section can always be consulted
for references to details of the omitted topics. We have also been unable to address several
application areas such as dictionary learning, advances in low-rank tensor decompositions, topic
modeling and community detection in graphs but have provided pointers to prominent works
in these application areas too.
The organization of this monograph is outlined below with Figure 1 presenting a suggested
order of reading the various sections.
Section 1 - Introduction This section will give a more relaxed introduction to the area
of non-convex optimization by discussing applications that motivate the use of non-convex
formulations. The discussion will also clarify the scope of this monograph.
Section 2 - Mathematical Tools This section will set up notation and introduce some basic
mathematical tools in convex optimization. This section is basically a handy repository of useful
concepts and results and can be skipped by a reader familiar with them. Parts of the section
may instead be referred back to, as and when needed, using the cross-referencing links in the
monograph.
3
Preface
SPARSE
NON-CONVEX RECOVERY
INTRODUCTION
PROJECTED GD
ROBUST
EXPECTATION REGRESSION
MAXIMIZATION
PHASE
STOCHASTIC RETRIEVAL
OPTIMIZATION
Section 4 - Alternating Minimization This section will introduce the principle of alternat-
ing minimization which is widely used in optimization problems over two or more (groups of)
variables. The methods introduced in this section will be later used in later sections to solve
problems such as low-rank matrix recovery, robust regression, and phase retrieval.
Section 5 - The EM Algorithm This section will introduce the EM algorithm which is a
widely used optimization primitive for learning problems with latent variables. Although EM is
a form of alternating minimization, given its significance, the section gives it special attention.
This section will discuss some recent advances in the analysis and applications of this method
and look at two case studies in learning Gaussian mixture models and mixed regression to
illustrate the algorithm and its analyses.
Section 6 - Stochastic Non-convex Optimization This section will look at some recent
advances in using stochastic optimization techniques for solving optimization problems with
non-convex objectives. The section will also introduce the problem of tensor factorization as a
case study for the algorithms being studied.
4
Preface
Section 7 - Sparse Recovery This section will look at a very basic non-convex optimization
problem, that of performing linear regression to fit a sparse model to the data. The section
will discuss conditions under which it is possible to do so in polynomial time and show how
the non-convex projected gradient descent method studied earlier can be used to offer provably
optimal solutions. The section will also point to other techniques used to solve this problem,
as well as refer to extensions and related results.
Section 8 - Low-rank Matrix Recovery This section will address the more general problem
of low rank matrix recovery with specific emphasis on low-rank matrix completion. The section
will gently introduce low-rank matrix recovery as a generalization of sparse linear regression
that was studied in the previous section and then move on to look at matrix completion in more
detail. The section will apply both the non-convex projected gradient descent and alternating
minimization methods in the context of low-rank matrix recovery, analyzing simple cases and
pointing to relevant literature.
Section 9 - Robust Regression This section will look at a widely studied area of machine
learning, namely robust learning, from the point of view of regression. Algorithms that are
robust to (adversarial) corruption in data are sought after in several areas of signal processing
and learning. The section will explore how to use the projected gradient and alternating
minimization techniques to solve the robust regression problem and also look at applications
of robust regression to robust face recognition and robust time series analysis.
Section 10 - Phase Retrieval This section will look at some recent advances in the
application of non-convex optimization to phase retrieval. This problem lies at the heart
of several imaging techniques such as X-ray crystallography and electron microscopy. A lot
remains to be understood about this problem and existing algorithms often struggle to cope
with the retrieval problems presented in practice.
The area of non-convex optimization has considerably widened in both scope and application
in recent years and newer methods and analyses are being proposed at a rapid pace. While this
makes researchers working in this area extremely happy, it also makes summarizing the vast
body of work in a monograph such as this, more challenging. We have striven to strike a balance
between presenting results that are the best known, and presenting them in a manner accessible
to a newcomer. However, in all cases, the bibliography notes at the end of each section do
contain pointers to the state of the art in that area and can be referenced for follow-up readings.
5
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Mathematical Notation
• The set of real numbers is denoted by R. The set of natural numbers is denoted by N.
• Vectors are denoted by boldface, lower case alphabets for example, x, y. The zero vector
is denoted by 0. A vector x ∈ Rp will be in column format. The transpose of a vector is
denoted by x> . The ith coordinate of a vector x is denoted by xi .
• Matrices are denoted by upper case alphabets for example, A, B. Ai denotes the ith column
of the matrix A and Aj denotes its j th row. Aij denotes the element at the ith row and
j th column.
• For a vector x ∈ Rp and a set S ⊂ [p], the notation xS denotes the vector z ∈ Rp such
that zi = xi for i ∈ S, and zi = 0 otherwise. Similarly for matrices, AS denotes the matrix
B with Bi = Ai for i ∈ S and Bi = 0 for i 6= S. Also, AS denotes the matrix B with
B i = Ai for i ∈ S and B i = 0> for i 6= S.
• The identity matrix of order p is denoted by Ip×p or simply Ip . The subscript may be
omitted when the order is clear from context.
qP
• For a vector x ∈ Rp , the notation kxkq = q pi=1 |xi |q denotes its Lq norm. As special
cases we define kxk∞ := maxi |xi |, kxk−∞ := mini |xi |, and kxk0 := |supp(x)|.
n o
• Balls with respect to various norms are denoted as Bq (r) := x ∈ Rp , kxkq ≤ r . As a
special case the notation B0 (s) is used to denote the set of s-sparse vectors.
• For a matrix A ∈ Rm×n , σ1 (A) ≥ σ2 (A) ≥ . . . ≥ σmin{m,n} (A) denote its singular values.
qP
A2ij =
pP
The Frobenius norm of A is defined as kAkF := 2
i,j i σi (A) . The nuclear
norm of A is defined as kAk∗ := i σi (A).
P
Pm
• The trace of a square matrix A ∈ Rm×m is defined as tr(A) = i=1 Aii .
• The spectral norm (also referred to as the operator norm) of a matrix A is defined as
kAk2 := maxi σi (A).
6
Mathematical Notation
• The expectation of a random variable X is denoted by E [X]. In cases where the dis-
tribution of X is to be made explicit, the notation EX∼D [X], or else simply ED [X], is
used.
• The standard big-Oh notation is used to describe the asymptotic behavior of functions.
The soft-Oh notation is employed to hide poly-logarithmic factors i.e., f = Oe (g) will
c
imply f = O (g log (g)) for some absolute constant c.
7
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Part I
8
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 1
Introduction
This section will set the stage for subsequent discussions by motivating some of the non-convex
optimization problems we will be studying using real life examples, as well as setting up notation
for the same.
min f (x)
x∈Rp
s.t. x ∈ C,
where x is the variable of the problem, f : Rp → R is the objective function of the problem,
and C ⊆ Rp is the constraint set of the problem. When used in a machine learning setting, the
objective function allows the algorithm designer to encode proper and expected behavior for the
machine learning model, such as fitting well to training data with respect to some loss function,
whereas the constraint allows restrictions on the model to be encoded, for instance, restrictions
on model size.
An optimization problem is said to be convex if the objective is a convex function, as
well as the constraint set is a convex set. We refer the reader to § 2 for formal definitions
of these terms. An optimization problem that violates either one of these conditions, i.e., one
that has a non-convex objective, or a non-convex constraint set, or both, is called a non-convex
optimization problem. In this monograph, we will discuss non-convex optimization problems
with non-convex objectives and convex constraints (§ 4, 5, 6, and 8), as well as problems with
non-convex constraints but convex objectives (§ 3, 7, 9, 10, and 8). Such problems arise in a lot
of application areas.
9
CHAPTER 1. INTRODUCTION
regularizing the learning problem, but often essential to prevent the problem from becoming
ill-posed. For example, suppose we know how a user rates some items and wish to infer how
this user would rate other items, possibly in order to inform future advertisement campaigns.
To do so, it is essential to impose some structure on how a user’s ratings for one set of items
influences ratings for other kinds of items. Without such structure, it becomes impossible to
infer any new user ratings. As we shall soon see, such structural constraints often turn out to
be non-convex.
In other applications, the natural objective of the learning task is a non-convex function.
Common examples include training deep neural networks and tensor decomposition problems.
Although non-convex objectives and constraints allow us to accurately model learning problems,
they often present a formidable challenge to algorithm designers. This is because unlike convex
optimization, we do not possess a handy set of tools for solving non-convex problems. Several
non-convex optimization problems are known to be NP-hard to solve. The situation is made
more bleak by a range of non-convex problems that are not only NP-hard to solve optimally,
but NP-hard to solve approximately as well [Meka et al., 2008].
Sparse Regression The classical problem of linear regression seeks to recover a linear model
which can effectively predict a response variable as a linear function of covariates. For example,
we may wish to predict the average expenditure of a household (the response) as a function of the
education levels of the household members, their annual salaries and other relevant indicators
(the covariates). The ability to do allows economic policy decisions to be more informed by
revealing, for instance, how does education level affect expenditure.
More formally, we are provided a set of n covariate/response pairs (x1 , y1 ), . . . , (xn , yn )
where xi ∈ Rp and yi ∈ R. The linear regression approach makes the modeling assumption
yi = xi> w∗ + ηi where w∗ ∈ Rp is the underlying linear model and ηi is some benign additive
noise. Using the data provided {xi , yi }i=1,...,n , we wish to recover back the model w∗ as faithfully
as possible.
A popular way to recover w∗ is using the least squares formulation
n 2
yi − xi> w
X
b = arg min
w .
w∈Rp i=1
The linear regression problem as well as the least squares estimator, are extremely well studied
and their behavior, precisely known. However, this age-old problem acquires new dimensions in
situations where, either we expect only a few of the p features/covariates to be actually relevant
to the problem but do not know their identity, or else are working in extremely data-starved
settings i.e., n p.
The first problem often arises when there is an excess of covariates, several of which may
be spurious or have no effect on the response. § 7 discusses several such practical examples. For
now, consider the example depicted in Figure 1.1, that of expenditure prediction in a situation
when the list of indicators include irrelevant ones such as whether the family lives in an odd-
numbered house or not, which should arguably have no effect on expenditure. It is useful to
eliminate such variables from consideration to promote consistency of the learned model.
The second problem is common in areas such as genomics and signal processing which face
moderate to severe data starvation and the number of data points n available to estimate the
10
1.3. EXAMPLES OF NON-CONVEX OPTIMIZATION PROBLEMS
DIET (VEG/NON-VEG)
EYE COLOR
TOTAL INCOME
EDUCATION LEVEL
FAMILY SIZE
HOUSE NO (ODD/EVEN)
NO OF CHILDREN
RENTED/SELF OWNED
SURNAME LENGTH (EXPENDITURE)
Figure 1.1: Not all available parameters and variables may be required for a prediction or
learning task. Whereas the family size may significantly influence family expenditure, the eye
color of family members does not directly or significantly influence it. Non-convex optimization
techniques, such as sparse recovery, help discard irrelevant parameters and promote compact
and accurate models.
Although the objective function in the above formulation is convex, the constraint kwk0 ≤ s
(equivalently w ∈ B0 (s) – see list of mathematical notation at the beginning of this mono-
graph) corresponds to a non-convex constraint set1 . Sparse recovery effortlessly solves the twin
problems of discarding irrelevant covariates and countering data-starvation since typically,
only n ≥ s log p (as opposed to n ≥ p) data points are required for sparse recovery to work
which drastically reduces the data requirement. Unfortunately however, sparse-recovery is an
NP-hard problem [Natarajan, 1995].
Recommendation Systems Several internet search engines and e-commerce websites utilize
recommendation systems to offer items to users that they would benefit from, or like, the most.
The problem of recommendation encompasses benign recommendations for songs etc, all the
way to critical recommendations in personalized medicine.
To be able to make accurate recommendations, we need very good estimates of how each user
likes each item (song), or would benefit from it (drug). We usually have first-hand information
for some user-item pairs, for instance if a user has specifically rated a song or if we have
administered a particular drug on a user and seen the outcome. However, users typically rate
only a handful of the hundreds of thousands of songs in any commercial catalog and it is not
feasible, or even advisable, to administer every drug to a user. Thus, for the vast majority of
user-item pairs, we have no direct information.
1
See Exercise 2.6.
11
CHAPTER 1. INTRODUCTION
ITEM FEATURES
USER FEATURES
Figure 1.2: Only the entries of the ratings matrix with thick borders are observed. Notice
that users rate infrequently and some items are not rated even once. Non-convex optimization
techniques such as low-rank matrix completion can help recover the unobserved entries, as well
as reveal hidden features that are descriptive of user and item properties, as shown on the right
hand side.
It is useful to visualize this problem as a matrix completion problem: for a set of m users
u1 , . . . , um and n items a1 , . . . , an , we have an m × n preference matrix A = [Aij ] where Aij
encodes the preference of the ith user for the j th item. We are able to directly view only a
small number of entries of this matrix, for example, whenever a user explicitly rates an item.
However, we wish to recover the remaining entries, i.e., complete this matrix. This problem is
closely linked to the collaborative filtering technique popular in recommendation systems.
Now, it is easy to see that unless there exists some structure in matrix, and by extension,
in the way users rate items, there would be no relation between the unobserved entries and the
observed ones. This would result in there being no unique way to complete the matrix. Thus, it
is essential to impose some structure on the matrix. A structural assumption popularly made is
that of low rank: we wish to fill in the missing entries of A assuming that A is a low rank matrix.
This can make the problem well-posed and have a unique solution since the additional low rank
structure links the entries of the matrix together. The unobserved entries can no longer take
values independently of the values observed by us. Figure 1.2 depicts this visually.
If we denote by Ω ⊂ [m] × [n], the set of observed entries of A, then the low rank matrix
completion problem can be written as
(Xij − Aij )2
X
Ablr = arg min
X∈Rm×n (i,j)∈Ω
s.t. rank(X) ≤ r,
This formulation also has a convex objective but a non-convex rank constraint2 . This problem
can be shown to be NP-hard as well. Interestingly, we can arrive at an alternate formulation by
imposing the low-rank constraint indirectly. It turns out that3 assuming the ratings matrix to
have rank at most r is equivalent to assuming that the matrix A can be written as A = U V >
2
See Exercise 2.7.
3
See Exercise 3.3.
12
1.4. THE CONVEX RELAXATION APPROACH
with the matrices U ∈ Rm×r and V ∈ Rn×r having at most r columns. This leads us to the
following alternate formulation
X 2
Ablv = arg min Ui> Vj − Aij .
U ∈Rm×r (i,j)∈Ω
V ∈Rn×r
There are no constraints in the formulation. However, the formulation requires joint optimization
over a pair of variables (U, V ) instead of a single variable. More importantly, it can be shown4
that the objective function is non-convex in (U, V ).
It is curious to note that the matrices U and V can be seen as encoding r-dimensional
descriptions of users and items respectively. More precisely, for every user i ∈ [m], we can
think of the vector U i ∈ Rr (i.e., the i-th row of the matrix U ) as describing user i, and
for every item j ∈ [n], use the row vector V j ∈ Rr to describe the item
j in vectoral
form. The rating given by user i to item j can now be seen to be Aij ≈ U i , V j . Thus,
recovering the rank r matrix A also gives us a bunch of r-dimensional latent vectors de-
scribing the users and items. These latent vectors can be extremely valuable in themselves
as they can help us in understanding user behavior and item popularity, as well as be used
in “content”-based recommendation systems which can effectively utilize item and user features.
The above examples, and several others from machine learning, such as low-rank tensor
decomposition, training deep networks, and training structured models, demonstrate the util-
ity of non-convex optimization in naturally modeling learning tasks. However, most of these
formulations are NP-hard to solve exactly, and sometimes even approximately. In the following
discussion, we will briefly introduce a few approaches, classical as well as contemporary, that
are used in solving such non-convex optimization problems.
Faced with the challenge of non-convexity, and the associated NP-hardness, a traditional
workaround in literature has been to modify the problem formulation itself so that existing
tools can be readily applied. This is often done by relaxing the problem so that it becomes a
convex optimization problem. Since this allows familiar algorithmic techniques to be applied, the
so-called convex relaxation approach has been widely studied. For instance, there exist relaxed,
convex problem formulations for both the recommendation system and the sparse regression
problems. For sparse linear regression, the relaxation approach gives us the popular LASSO
formulation.
Now, in general, such modifications change the problem drastically, and the solutions of the
relaxed formulation can be poor solutions to the original problem. However, it is known that
if the problem possesses certain nice structure, then under careful relaxation, these distortions,
formally referred to as a“relaxation gap”, are absent, i.e., solutions to the relaxed problem would
be optimal for the original non-convex problem as well.
Although a popular and successful approach, this still has limitations, the most prominent
of them being scalability. Although the relaxed convex optimization problems are solvable in
polynomial time, it is often challenging to solve them efficiently for large-scale problems.
4
See Exercise 4.1.
13
CHAPTER 1. INTRODUCTION
RUNTIME (sec)
RUNTIME (sec)
FoBa AM-RR
IHT gPGD
SVT
RUNTIME (sec)
RUNTIME (sec)
SVP
SVT
ADMiRA
SVP
Figure 1.3: An empirical comparison of run-times offered by various approaches to four differ-
ent non-convex optimization problems. LASSO, extended LASSO, SVT are relaxation-based
methods whereas IHT, gPGD, FoBa, AM-RR, SVP, ADMiRA are non-convex methods. In all
cases, non-convex optimization techniques offer routines that are faster, often by an order of
magnitude or more, than relaxation-based methods. Note that Figures 1.3c and 1.3d, employ a
y-axis at logarithmic scale. The details of the methods are present in the sections linked with
the respective figures.
14
1.6. ORGANIZATION AND SCOPE
fact, in practice, they often handsomely outperform relaxation-based approaches in terms of
speed and scalability. Figure 1.3 illustrates this for some applications that we will investigate
more deeply in later sections.
Very interestingly, it turns out that problem structures that allow non-convex approaches
to avoid NP-hardness results, are very similar to those that allow their convex relaxation coun-
terparts to avoid distortions and a large relaxation gap! Thus, it seems that if the problems
possess nice structure, convex relaxation-based approaches, as well as non-convex techniques,
both succeed. However, non-convex techniques usually offer more scalable solutions.
15
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 2
Mathematical Tools
This section will introduce concepts, algorithmic tools, and analysis techniques used in the
design and analysis of optimization algorithms. It will also explore simple convex optimization
problems which will serve as a warm-up exercise.
θi ≥ 0 and ni=1 θi = 1.
P
A set that is closed under arbitrary convex combinations is a convex set. A standard defini-
tion is given below. Geometrically speaking, convex sets are those that contain all line segments
that join two points inside the set. As a result, they cannot have any inward “bulges”.
Definition 2.2 (Convex Set). A set C ∈ Rp is considered convex if, for every x, y ∈ C and
λ ∈ [0, 1], we have (1 − λ) · x + λ · y ∈ C as well.
Figure 2.1 gives visual representations of prototypical convex and non-convex sets. A related
notion is that of convex functions which have a unique behavior under convex combinations.
There are several definitions of convex functions, those that are more basic and general, as well
as those that are restrictive but easier to use. One of the simplest definitions of convex functions,
one that does not involve notions of derivatives, defines convex functions f : Rp → R as those for
which, for every x, y ∈ Rp and every λ ∈ [0, 1], we have f ((1−λ)·x+λ·y) ≤ (1−λ)·f (x)+λ·f (y).
For continuously differentiable functions, a more usable definition follows.
Definition 2.3 (Convex Function). A continuously differentiable function f : Rp → R is
considered convex if for every x, y ∈ Rp we have f (y) ≥ f (x) + h∇f (x), y − xi, where ∇f (x)
is the gradient of f at x.
A more general definition that extends to non-differentiable functions uses the notion of
subgradient to replace the gradient in the above expression. A special class of convex functions
is the class of strongly convex and strongly smooth functions. These are critical to the study of
algorithms for non-convex optimization. Figure 2.2 provides a handy visual representation of
these classes of functions.
16
2.1. CONVEX ANALYSIS
17
CHAPTER 2. MATHEMATICAL TOOLS
may very well be non-convex. A property similar to strong smoothness is that of Lipschitzness
which we define below.
Notice that Lipschitzness places a upper bound on the growth of the function that is linear
in the perturbation i.e., kx − yk2 , whereas strong smoothness (SS) places a quadratic upper
bound. Also notice that Lipschitz functions need not be differentiable. However, differentiable
functions with bounded gradients are always Lipschitz2 . Finally, an important property that
generalizes the behavior of convex functions on convex combinations is the Jensen’s inequality.
Lemma 2.1 (Jensen’s Inequality). If X is a random variable taking values in the domain of a
convex function f , then E [f (X)] ≥ f (E [X])
In general, one need not use only the L2 -norm in defining projections but is the most commonly
used one. If C is a convex set, then the above problem reduces to a convex optimization problem.
In several useful cases, one has access to a closed form solution for the projection.
For instance, if C = B2 (1) i.e., the unit L2 ball, then projection is equivalent3 to a normal-
ization step (
z/ kzk2 if kzk > 1
ΠB2 (1) (z) = .
z otherwise
For the case C = B1 (1), the projection step reduces to the popular soft thresholding operation.
If z bi = max {zi − θ, 0}, where θ is a threshold that can be decided by a
b := ΠB1 (1) (z), then z
sorting operation on the vector [see Duchi et al., 2008, for details].
Projections onto convex sets have some very useful properties which come in handy while
analyzing optimization algorithms. In the following, we will study three properties of projections.
These are depicted visually in Figure 2.3 to help the reader gain an intuitive appeal.
Lemma 2.2 (Projection Property-O). For any set (convex or not) C ⊂ Rp and z ∈ Rp , let
b := ΠC (z). Then for all x ∈ C, kz
z b − zk2 ≤ kx − zk2 .
This property follows by simply observing that the projection step solves the the optimization
problem minx∈C kx − zk2 . Note that this property holds for all sets, whether convex or not.
However, the following two properties necessarily hold only for convex sets.
Lemma 2.3 (Projection Property-I). For any convex set C ⊂ Rp and any z ∈ Rp , let z
b := ΠC (z).
Then for all x ∈ C, hx − z
b, z − z
bi ≤ 0.
2
See Exercise 2.2.
3
See Exercise 2.3.
18
2.2. CONVEX PROJECTIONS
Proof. To prove this, assume the contra-positive. Suppose for some x ∈ C, we have
hx − z
b, z − z
bi > 0. Now, since C is convex and z b, x ∈ C, for any λ ∈ [0, 1], we have
xλ := λ · x + (1 − λ) · z b ∈ C. We will now show that for some value of λ ∈ [0, 1], it must
be the case that kz − xλ k2 < kz − z bk2 . This will contradict the fact that zb is the closest point
in the convex set to z and prove the lemma. All that remains(to be done is ) to find such a value
2hx−b
z,z−bzi
of λ. The reader can verify that any value of 0 < λ < min 1, 2 suffices. Since we
kx−bzk2
assumed hx − z b, z − z
bi > 0, any value of λ chosen this way is always in (0, 1].
Projection Property-I can be used to prove a very useful contraction property for convex
projections. In some sense, a convex projection brings a point closer to all points in the convex
set simultaneously.
Lemma 2.4 (Projection Property-II). For any convex set C ⊂ Rp and any z ∈ Rp , let z
b :=
ΠC (z). Then for all x ∈ C, kz
b − xk2 ≤ kz − xk2 .
Note that Projection Properties-I and II are also called first order properties and can be
violated if the underlying set is non-convex. However, Projection Property-O, often called a
zeroth order property, always holds, whether the underlying set is convex or not.
19
CHAPTER 2. MATHEMATICAL TOOLS
Algorithm 1 Projected Gradient Descent (PGD)
Input: Convex objective f , convex constraint set C, step lengths ηt
Output: A point x b ∈ C with near-optimal objective value
1
1: x ← 0
2: for t = 1, 2, . . . , T do
3: zt+1 ← xt − ηt · ∇f (xt )
4: xt+1 ← ΠC (zt+1 )
5: end for
6: (OPTION 1) return x b final = xT
b avg = ( Tt=1 xt )/T
P
7: (OPTION 2) return x
8: (OPTION 3) return x b best = arg mint∈[T ] f (xt )
min f (x)
x∈Rp (CVX-OPT)
s.t. x ∈ C.
Theorem 2.5. Let f be a convex objective with bounded gradients and Algorithm 1 be
executed
for T time steps with step lengths ηt = η = T . Then, for any > 0, if T = O 12 , then
√1
PT
1
T t=1 f (x
t) ≤ f ∗ + .
We see that the PGD algorithm in this setting ensures that the function value of the iterates
approaches f ∗ on an average. We can use this result to prove the convergence of the PGD
20
2.4. CONVERGENCE GUARANTEES FOR PGD
b best ) ≤ f (xt )
b best , then since by construction, we have f (x
algorithm. If we use OPTION 3, i.e., x
for all t, by applying Theorem 2.5, we get
T
1X
b best ) ≤
f (x f (xt ) ≤ f ∗ + ,
T t=1
If we use OPTION 2, i.e., x b avg , which is cheaper since we do not have to perform function
evaluations to find the best iterate, we can apply Jensen’s inequality (Lemma 2.1) to get the
following
T T
!
1X 1X
f (x
b avg ) = f t
x ≤ f (xt ) ≤ f ∗ + .
T t=1 T t=1
Note that the Jensen’s inequality may be applied only when the function f is convex. Now,
whereas OPTION 1 i.e., x b final , is the cheapest and does not require any additional operations,
b final does not converge to the optimum for convex functions in general and may oscillate close to
x
the optimum. However, we shall shortly see that x b final does converge if the objective function is
strongly smooth. Recall that strongly smooth functions may not grow at a faster-than-quadratic
rate.
The reader would note that we have set the step length to a value that depends on the
total number of iterations T for which the PGD algorithm is executed. This is called a horizon-
aware setting of the step length. In case we are not sure what the value of T would be, a
horizon-oblivious setting of ηt = √1t can also be shown to work4 .
Proof (of Theorem 2.5). Let x∗ ∈ arg minx∈C f (x) denote any point in the constraint set
where the optimum function value is achieved. Such a point always exists if the constraint
set is closed and the objective function continuous. We will use the following potential
function Φt = f (xt ) − f (x∗ ) to track the progress of the algorithm. Note that Φt measures
the sub-optimality of the t-th iterate. Indeed, the statement of the theorem is equivalent to
claiming that T1 Tt=1 Φt ≤ .
P
(Apply Convexity) We apply convexity to upper bound the potential function at every step.
Convexity is a global property and very useful in getting an upper bound on the level of sub-
optimality of the current iterate in such analyses.
D E
Φt = f (xt ) − f (x∗ ) ≤ ∇f (xt ), xt − x∗
where the first step applies the identity 2ab = a2 + b2 − (a + b)2 , the second step uses the update
step of the PGD algorithm that sets zt+1 ← xt − ηt · ∇f (xt ), and the third step uses the fact
that the objective function f has bounded gradients.
4
See Exercise 2.4.
21
CHAPTER 2. MATHEMATICAL TOOLS
(Apply Projection Property) We apply Lemma 2.4 to get
2
2
− x∗
≥
xt+1 − x∗
t+1
z
2 2
1
2
2 ηG2
t ∗
t+1 ∗
Φt ≤
x − x
−
x −x
+
2η 2 2 2
The above expression is interesting since it tells us that, apart from the ηG2 /2 term which is
small as η = √1T , the current sub-optimality Φt is small if the consecutive iterates xt and xt+1
are close to each other (and hence similar in distance from x∗ ).
This observation is quite useful since it tells us that once PGD stops making a lot of progress,
it actually converges to the optimum! In hindsight, this is to be expected. Since we are using a
constant step length, only a vanishing gradient can cause PGD to stop progressing. However,
for convex functions, this only happens at global optima. Summing the expression up across
time steps, performing telescopic cancellations, using x1 = 0, and dividing throughout by T
gives us
T ηG2
1X 1 ∗ 2
Φt ≤ kx k2 − kxT +1 − x∗ k22 +
T t=1 2ηT 2
1 ∗ 2
≤ √ kx k2 + G2 ,
2 T
√
where in the second step, we have used the fact that
xt+1 − x∗
2 ≥ 0 and η = 1/ T . This
Theorem 2.6. Let f be an objective that satisfies the α-SC and β-SS properties.
Let Algorithm 1
1 β β
be executed with step lengths ηt = η = β . Then after at most T = O α log steps, we have
f (xT ) ≤ f (x∗ ) + .
b final = xT converges,
This result is particularly nice since it ensures that the final iterate x
allowing us to use OPTION 1 in Algorithm 1 when the objective is SC/SS. A further advantage
is theaccelerated rate of convergence. Whereas for general convex functions, PGD requires
O 12 iterations to reach an -optimal solution, for SC/SS functions, it requires only O log 1
iterations.
The reader would notice the insistence on the step length being set to η = β1 . In fact the
proof we show below crucially uses this setting. In practice, for many problems, β may not be
known to us or may be expensive to compute which presents a problem. However, as it turns
out, it is not necessary to set the step length exactly to 1/β. The result can be shown to hold
even for values of η < 1/β which are nevertheless large enough, but the proof becomes more
involved. In practice, the step length is tuned globally by doing a grid search over several η
values, or per-iteration using line search mechanisms, to obtain a step length value that assures
good convergence rates.
22
2.4. CONVERGENCE GUARANTEES FOR PGD
Proof (of Theorem 2.6). This proof is a nice opportunity for the reader to see how the SC/SS
properties are utilized in a convergence analysis. As with convexity in the proof of Theorem 2.5,
the strong convexity property is a global property that will be useful in assessing the progress
made so far by relating the optimal point x∗ with the current iterate xt . Strong smoothness
on the other hand, will be used locally to show that the procedure makes significant progress
between iterates.
We will prove the result by showing that after at most T = O αβ log 1 steps, we will have
2
x − x∗
≤ 2
T
β . This already tells us that we have reached very close to the optimum. However,
2
we can use this to show that xT is -optimal in function value as well. Since we are very close
to the optimum, it makes sense to apply strong smoothness to upper bound the sub-optimality
as follows D E β
2
f (xT ) ≤ f (x∗ ) + ∇f (x∗ ), xT − x∗ +
xT − x∗
.
2 2
∗
Now, since x is an optimal point for the constrained optimization problem with a convex
constraint set C, the first order optimality condition [see Bubeck, 2015, Proposition 1.3] gives
us h∇f (x∗ ), x − x∗ i ≤ 0 for any x ∈ C. Applying this condition with x = xT gives us
β
2
f (xT ) − f (x∗ ) ≤
x − x∗
≤ ,
T
2 2
2
which proves that xT is an -optimal point. We now show
xT − x∗
≤ 2 β . Given that we
2
wish to show convergence in terms of the iterates, and not in terms of the function values,
2 as
∗
t
we did in Theorem 2.5, a natural potential function for this analysis is Φt = x − x 2 .
(Apply Strong Smoothness) As discussed before, we use it to show that PGD always makes
significant progress in each iteration.
D β
E
2
f (xt+1 ) − f (xt ) ≤ ∇f (xt ), xt+1 − xt +
t
x − xt+1
2 2
D E D E β
2
t t+1 ∗ ∗
− x + ∇f (x ), x − x +
x − xt+1
t t
t
= ∇f (x ), x
2 2
1D t E D E β
2
= x − zt+1 , xt+1 − xt + ∇f (xt ), x∗ − xt +
xt − xt+1
η 2 2
(Apply Projection Rule) The above expression contains an unwieldy term zt+1 . Since this
term only appears during projection steps, we eliminate it by applying Projection Property-I
(Lemma 2.3) to get
D E D E
xt − zt+1 , xt+1 − x∗ ≤ xt − xt+1 , xt+1 − x∗
x − x∗
2 −
xt − xt+1
2 −
xt+1 − x∗
2
t
2 2 2
=
2
Using η = 1/β and combining the above results gives us
β
D E
2
2
t+1 t t ∗ t
t ∗
t+1 ∗
f (x ) − f (x ) ≤ ∇f (x ), x − x +
x − x
−
x −x
2 2 2
(Apply Strong Convexity) The above expression is perfect for a telescoping step but for the
inner product term. Fortunately, this can be eliminated using strong convexity.
D E α
2
∇f (xt ), x∗ − xt ≤ f (x∗ ) − f (xt ) −
x − x∗
t
2 2
23
CHAPTER 2. MATHEMATICAL TOOLS
Combining with the above this gives us
β−α
2 β
2
f (xt+1 ) − f (x∗ ) ≤
x − x∗
−
xt+1 − x∗
.
t
2 2 2 2
f (y) − f (x) β
α 2 ∈ 1, α := [1, κ]
2 kx − yk2
5
See Exercise 2.5.
24
2.5. EXERCISES
Thus, upon perturbing the input from the global minimum x to a point kx − yk2 =: distance
2
away, the function value does change much – it goes up by an amount at least α2 but at most
2
κ · α2 . Such well behaved response to perturbations is very easy for optimization algorithms to
exploit to give fast convergence.
The condition number of the objective functioncansignificantly
affect the convergence rate
β α 1
of algorithms. Indeed, if κ = α is small, then exp − β = exp − κ would be small, ensuring
fast convergence. However, if κ 1 then exp − κ1 ≈ 1 and the procedure might offer slow
convergence.
2.5 Exercises
Exercise 2.1. Show that strong smoothness does not imply convexity by constructing a non-
convex function f : Rp → R that is 1-SS.
Exercise 2.2. Show that if a differentiable function f has bounded gradients i.e., k∇f (x)k2 ≤ G
for all x ∈ Rd , then f is Lipschitz. What is its Lipschitz constant?
Hint: use the mean value theorem.
Exercise 2.4. Show that a horizon-oblivious setting of ηt = √1t while executing the PGD algo-
rithm with a convex function with bounded gradients also ensures convergence.
Hint: the convergence rates may be a bit different for this setting.
Exercise 2.5. Show that if f : Rp → R is a strongly convex function that is differentiable, then
there is a unique point x∗ ∈ Rp that minimizes the function value f i.e., f (x∗ ) = minx∈Rp f (x).
Exercise 2.6. Show that the set of sparse vectors B0 (s) ⊂ Rp is non-convex for any s < p.
What happens when s = p?
Exercise 2.7. Show that Brank (r) ⊆ Rn×n , the set of n × n matrices with rank at most r, is
non-convex for any r < n. What happens when r = n?
Exercise 2.8. Consider the Cartesian product set C = Rm×r × Rn×r . Show that it is convex.
Exercise 2.9. Consider a least squares optimization problem with a strongly convex and smooth
objective. Show that the condition number of this problem is equal to the condition number of
the Hessian matrix of the objective function.
25
CHAPTER 2. MATHEMATICAL TOOLS
inability to cover several useful and interesting results concerning convex functions and opti-
mization techniques given the paucity of scope to present this discussion. We refer the reader to
literature in the field of optimization theory for a much more relaxed and deeper introduction
to the area of convex optimization. Some excellent examples include [Bertsekas, 2016, Boyd and
Vandenberghe, 2004, Bubeck, 2015, Nesterov, 2003, Sra et al., 2011].
26
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Part II
27
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 3
In this section we will introduce and study gradient descent-style methods for non-convex op-
timization problems. In § 2, we studied the projected gradient descent method for convex op-
timization problems. Unfortunately, the algorithmic and analytic techniques used in convex
problems fail to extend to non-convex problems. In fact, non-convex problems are NP-hard to
solve and thus, no algorithmic technique should be expected to succeed on these problems in
general.
However, the situation is not so bleak. As we discussed in § 1, several breakthroughs in non-
convex optimization have shown that non-convex problems that possess nice additional structure
can be solved not just in polynomial time, but rather efficiently too. Here, we will study the inner
workings of projected gradient methods on such structured non-convex optimization problems.
The discussion will be divided into three parts. The first part will take a look at constraint
sets that, despite being non-convex, possess additional structure so that projections onto them
can be carried out efficiently. The second part will take a look at structural properties of
objective functions that can aid optimization. The third part will present and analyze a simple
extension of the PGD algorithm for non-convex problems. We will see that for problems that
do possess nicely structured objective functions and constraint sets, the PGD-style algorithm
does converge to the global optimum in polynomial time with a linear rate of convergence.
We would like to point out to the reader that our emphasis in this section will be on
generality and exposition of basic concepts. We will seek to present easily accessible analyses
for problems that have non-convex objectives. However, the price we will pay for this generality
is in the fineness of the results we present. The results discussed in this section are not the best
possible and more refined and problem-specific results will be discussed in subsequent sections
where specific applications will be discussed in detail.
reveals that this is an optimization problem in itself. Thus, when the set C to be projected onto is
non-convex, the projection problem can itself be NP-hard. However, for several well-structured
sets, projection can be carried out efficiently despite the sets being non-convex.
28
3.1. NON-CONVEX PROJECTIONS
3.1.1 Projecting into Sparse Vectors
In the sparse linear regression example discussed in § 1,
n 2
yi − xi> w
X
b = arg min
w ,
kwk0 ≤s i=1
applying projected gradient descent requires projections onto the set of s-sparse vectors i.e.,
B0 (s) := {x ∈ Rp , kxk0 ≤ s}. The following result shows that the projection ΠB0 (s) (z) can be
carried out by simply sorting the coordinates of the vector z according to magnitude and setting
all except the top-s coordinates to zero.
z bi = zi if σ(i) ≤ s and z
b := ΠB0 (s) (z) is obtained by setting z bi = 0 otherwise.
Proof. We first notice that since the function x 7→ x2 is an increasing function on the positive
half of the real line, we have arg min kx − zk2 = arg min kx − zk22 . Next, we observe that the
x∈C x∈C
vector z bi = zi for all i ∈ supp(z
b := ΠB0 (s) must satisfy z b) otherwise we can decrease the objective
2
value kzb − zk2 by ensuring this. Having established this gives us kz b − zk22 = i∈supp( 2
P
/ bz) zi . This
is clearly minimized when supp(z b) has the coordinates of z with largest magnitude.
(Xij − Aij )2 ,
X
Ablr = arg min
rank(X)≤r (i,j)∈Ω
we need to project onto the set of low-rank matrices. Let us first define this problem formally.
Consider matrices of a certain order, say m×n and let C ⊂ Rm×n be an arbitrary set of matrices.
Then, the projection operator ΠC (·) is defined as follows: for any matrix A ∈ Rm×n ,
where k·kF is the Frobenius norm over matrices. For low rank projections we require C to be
the set of low rank matrices Brank (r) := {A ∈ Rm×n , rank(A) ≤ r}. Yet again, this projection
can be done efficiently by performing a Singular Value Decomposition on the matrix A and
retaining the top r singular values and vectors. The Eckart-Young-Mirsky theorem proves that
this indeed gives us the projection.
Theorem 3.2 (Eckart-Young-Mirsky theorem). For any matrix A ∈ Rm×n , let U ΣV > be the
singular value decomposition of A such that Σ = diag(σ1 , σ2 , . . . , σmin(m,n) ) where σ1 ≥ σ2 ≥
. . . ≥ σmin(m,n) . Then for any r ≤ min(m, n), the matrix Ab(r) := ΠBrank (r) (A) can be obtained as
> where U
U(r) Σ(r) V(r) (r) := [U1 U2 . . . Ur ], V (r) := [V1 V2 . . . Vr ], and Σ(r) := diag(σ1 , σ2 , . . . , σr ).
Although we have stated the above result for projections with the Frobenius norm defining
the projections, the Eckart-Young-Mirsky theorem actually applies to any unitarily invariant
norm including the Schatten norms and the operator norm. The proof of this result is beyond
the scope of this monograph.
Before moving on, we caution the reader that the ability to efficiently project onto the non-
convex sets mentioned above does not imply that non-convex projections are as nicely behaved
29
CHAPTER 3. NON-CONVEX PROJECTED GRADIENT DESCENT
as their convex counterparts. Indeed, none of the projections mentioned above satisfy projection
properties I or II (Lemmata 2.3 and 2.4). This will pose a significant challenge while analyzing
PGD-style algorithms for non-convex problems since, as we would recall, these properties were
crucially used in all convergence proofs discussed in § 2.
30
3.3. GENERALIZED PROJECTED GRADIENT DESCENT
31
CHAPTER 3. NON-CONVEX PROJECTED GRADIENT DESCENT
Algorithm 2 Generalized Projected Gradient Descent (gPGD)
Input: Objective function f , constraint set C, step length η
Output: A point x b ∈ C with near-optimal objective value
1
1: x ← 0
2: for t = 1, 2, . . . , T do
3: zt+1 ← xt − η · ∇f (xt )
4: xt+1 ← ΠC (zt+1 )
5: end for
6: return xb final = xT
Theorem 3.3. Let f be a (possibly non-convex) function satisfying the α-RSC and β-RSS
properties over a (possibly non-convex) constraint set C with β/α
< 2. Let Algorithm 2 be
α
executed with a step length η = β1 . Then after at most T = O 2α−β log 1 steps, f (xT ) ≤
f (x∗ ) + .
This result holds even when the step length is set to values that are large enough but yet
smaller than 1/β. However, setting η = β1 simplifies the proof and allows us to focus on the key
concepts.
Proof (of Theorem 3.3). Recall that the proof of Theorem 2.5 used the SC/SS properties for
the analysis. We will replace these by the RSC/RSS properties – we will use RSC to track
the global convergence of the algorithm and RSS to locally assess the progress made by the
algorithm in each iteration. We will use Φt = f (xt+1 ) − f (x∗ ) as the potential function.
(Apply Restricted Strong Smoothness) Since both xt , xt+1 ∈ C due to the projection
steps, we apply the β-RSS property to them.
D β
E
2
f (xt+1 ) − f (xt ) ≤ ∇f (xt ), xt+1 − xt +
t
x − xt+1
2 2
1D t E β
2
t+1 t+1 t
t t+1
= x − z ,x − x +
x − x
η 2 2
β
t+1
2
2
=
x − zt+1
−
xt − zt+1
2 2 2
Notice that this step crucially uses the fact that η = 1/β.
(Apply Projection Property) We are again stuck with the unwieldy zt+1 term. However,
unlike before, we cannot apply projection properties I or II as non-convex projections do not
satisfy them. Instead, we resort to Projection Property-O (Lemma 2.2), that all projections
(even non-convex ones) must satisfy. Applying this property gives us
β
2
2
∗
f (xt+1 ) − f (xt ) ≤
x − zt+1
−
xt − zt+1
2 2 2
β
2 D E
∗
=
x − xt
+ 2 x∗ − xt , xt − zt+1
2 2
β
∗
2 D E
=
x − xt
+ x∗ − xt , ∇f (xt )
2 2
(Apply Restricted Strong Convexity) Since both xt , x∗ ∈ C, we apply the α-RSC property
to them. However, we do so in two ways:
D E α
2
f (x∗ ) − f (xt ) ≥ ∇f (xt ), x∗ − xt +
x − x∗
t
2 2
32
3.4. EXERCISES
D E α
2 α
2
f (xt ) − f (x∗ ) ≥ ∇f (x∗ ), xt − x∗ +
x − x∗
≥
xt − x∗
,
t
2 2 2 2
where in the second line we used the fact that we assumed ∇f (x∗ ) = 0. We recall that this
assumption can be done away with but makes the proof more complicated which we wish to
avoid. Simple manipulations with the two equations give us
β
β
D E
2
∇f (xt ), x∗ − xt +
∗
x − xt
≤ 2 − f (x∗ ) − f (xt )
2 2 α
Putting this in the earlier expression gives us
β
f (x t+1 t
) − f (x ) ≤ 2 − f (x∗ ) − f (xt )
α
The above inequality is quite interesting. It tells us that the larger the gap between f (x∗ ) and
f (xt ), the larger will be the drop in objective value in going from xt to xt+1 . The form
of the
result is also quite fortunate as it assures us that we will cover a constant fraction 2 − αβ of
the remaining “distance” to x∗ at each step! Rearranging this gives
Φt+1 ≤ (κ − 1)Φt ,
where κ = β/α. Note that we always have κ ≥ 14 and by assumption κ = β/α < 2, so that we
always have κ − 1 ∈ [0, 1). This proves the result after simple manipulations.
We see that the condition number has yet again played in crucial role in deciding the
convergence rate of the algorithm, this time for a non-convex problem. However, we see that
the condition number is defined differently here, using the RSC/RSS constants instead of the
SC/SS constants as we did in § 2.
The reader would notice that while there was no restriction on the condition number κ in
the analysis of the PGD algorithm (see Theorem 2.6), the analysis of the gPGD algorithm does
require κ < 2. It turns out that this restriction can be done away with for specific problems.
However, the analysis becomes significantly more complicated. Resolving this issue in general
is beyond the scope of this monograph but we will revisit this question in § 7 when we study
sparse recovery in ill-conditioned settings. i.e., with large condition numbers.
In subsequent sections, we will see more refined versions of the gPGD algorithm for different
non-convex optimization problems, as well as more refined and problem-specific analyses. In all
cases we will see that the RSC/RSS assumptions made by us can be fulfilled in practice and that
gPGD-style algorithms offer very good performance on practical machine learning and signal
processing problems.
3.4 Exercises
Exercise 3.1. Verify that the basic convergence result for the PGD algorithm in Theorem 2.5,
continues to hold when the constraint set C is convex and f only satisfies restricted convexity
over C (i.e., f is not convex over the entire Rp ). Verify that the result for strongly convex and
smooth functions in Theorem 2.6, also continues to hold if f satisfies RSC and RSS over a
convex constraint set C.
Exercise 3.2. Let the function f satisfy the α-RSC and β-RSS properties over a set C. Show
that the condition number κ = αβ ≥ 1. Note that the function f and the set C may both be
non-convex.
4
See Exercise 3.2.
33
CHAPTER 3. NON-CONVEX PROJECTED GRADIENT DESCENT
Exercise 3.3. Recall the recommendation systems problem we discussed in § 1. Show that
assuming the ratings matrix to be rank-r is equivalent to assuming that with every user i ∈ [m]
there is associated a vector ui ∈ Rr describing that user, and with every item j ∈ [n] there is
associated a vector vi ∈ Rr describing that item such that the rating given by user i to item j
is Aij = ui> vj .
Hint: Use the singular value decomposition for A.
34
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 4
Alternating Minimization
In this section we will introduce a widely used non-convex optimization primitive, namely the
alternating minimization principle. The technique is extremely general and its popular use
actually predates the recent advances in non-convex optimization by several decades. Indeed, the
popular Lloyd’s algorithm [Lloyd, 1982] for k-means clustering and the EM algorithm [Dempster
et al., 1977] for latent variable models are problem-specific variants of the general alternating
minimization principle. The technique continues to inspire new algorithms for several important
non-convex optimization problems such as matrix completion, robust learning, phase retrieval
and dictionary learning.
Given the popularity and breadth of use of this method, our task to present an introductory
treatment will be even more challenging here. To keep the discussion focused on core principles
and tools, we will refrain from presenting the alternating minimization principle in all its va-
rieties. Instead, we will focus on showing, in a largely problem-independent manner, what are
the challenges that face alternating minimization when applied to real-life problems, and how
they can be overcome. Subsequent sections will then show how this principle can be applied to
various machine learning and signal processing tasks. In particular, § 5 will be devoted to the
EM algorithm which embodies the alternating minimization principle and is extremely popular
for latent variable estimation problems in statistics and machine learning.
The discussion will be divided into four parts. In the first part, we will look at some useful
structural properties of functions that frequently arise in alternating minimization settings.
In the second part, we will present a general implementation of the alternating minimization
principle and discuss some challenges faced by this algorithm in offering convergent behavior in
real-life problems. In the third part, as a warm-up exercise, we will show how these challenges
can be overcome when the optimization problem being solved is convex. Finally in the fourth
part, we will discuss the more interesting problem of convergence of alternating minimization
for non-convex problems.
35
CHAPTER 4. ALTERNATING MINIMIZATION
Figure 4.1: A marginally convex function is not necessarily (jointly) convex. The function
f (x, y) = x · y is marginally linear, hence marginally convex, in both its variables, but clearly
not a (jointly) convex function.
The definition of joint convexity is not different from the one for convexity Definition 2.3.
Indeed the two coincide if we assume f to be a function of a single variable z = (x, y) ∈ Rp+q
instead of two variables. However, not all multivariate functions that arise in applications are
jointly convex. This motivates the notion of marginal convexity.
where ∇x f (x1 , y) is the partial gradient of f with respect to its first variable at the point (x1 , y).
A similar condition is imposed for f to be considered marginally convex in its second variable.
Although the definition above has been given for a function of two variables, it clearly extends
to functions with an arbitrary number of variables. It is interesting to note that whereas the
objective function in the matrix completion problem mentioned earlier is not jointly convex in
its variables, it is indeed marginally convex in both its variables1 .
It is also useful to note that even though a function that is marginally convex in all its
variables need not be a jointly convex function (see Figure 4.1), the converse is true2 . We will
find the following notions of marginal strong convexity and smoothness to be especially useful
in our subsequent discussions.
1
See Exercise 4.1.
2
See Exercise 4.2.
36
4.2. GENERALIZED ALTERNATING MINIMIZATION
Algorithm 3 Generalized Alternating Minimization (gAM)
Input: Objective function f : X × Y → R
Output: A point (x b ) ∈ X × Y with near-optimal objective value
b, y
1 1
1: (x , y ) ← INITALIZE()
2: for t = 1, 2, . . . , T do
3: xt+1 ← arg minx∈X f (x, yt )
4: yt+1 ← arg miny∈Y f (xt+1 , y)
5: end for
6: return (xT , yT )
where g = ∇x f (x1 , y) is the partial gradient of f with respect to its first variable at the point
(x1 , y). A similar condition is imposed for f to be considered (uniformly) MSC/MSS in its
second variable.
The above notion is a “uniform” one since the parameters α, β do not depend on the y
coordinate. It is instructive to relate MSC/MSS to the RSC/RSS properties from § 2. MSC/MSS
extend the idea of functions that are not “globally” convex (strongly or otherwise) but do exhibit
such properties under “qualifications”. MSC/MSS use a different qualification than RSC/RSS
did. Note that a function that is MSC with respect to all its variables, need not be a convex
function3 .
37
CHAPTER 4. ALTERNATING MINIMIZATION
BISTABLE
POINT
These descent versions are often easier to execute but may also converge more slowly. If the
problem is nicely structured, then progress made on the intermediate problems offers fast con-
vergence to the optimum. However, from the point of view of convergence, gAM faces several
challenges. To discuss those, we first introduce some more concepts.
Definition 4.4 (Marginally Optimum Coordinate). Let f be a function of two variables con-
strained to be in the sets X , Y respectively. For any point y ∈ Y, we say that x
e is a marginally
optimal coordinate with respect to y, and use the shorthand x e ∈ mOPTf (y), if f (x
e , y) ≤ f (x, y)
for all x ∈ X . Similarly for any x ∈ X , we say y ∈ mOPTf (x) if y is a marginally optimal
e e
coordinate with respect to x.
Definition 4.5 (Bistable Point). Given a function f over two variables constrained within the
sets X , Y respectively, a point (x, y) ∈ X × Y is considered a bistable point if y ∈ mOPTf (x)
and x ∈ mOPTf (y) i.e., both coordinates are marginally optimal with respect to each other.
It is easy to see5 that the optimum of the optimization problem must be a bistable point.
The reader can also verify that the gAM procedure must stop after it has reached a bistable
point. However, two questions arise out of this. First, how fast does gAM approach a bistable
point and second, even if it reaches a bistable point, is that point guaranteed to be (globally)
optimal?
The first question will be explored in detail later. It is interesting to note that the gAM
procedure has no parameters, such as step length. This can be interpreted as a benefit as
well as a drawback. While it relieves the end-user from spending time tweaking parameters,
it also means that the user has less control over the progress of the algorithm. Consequently,
the convergence of the gAM procedure is totally dependent on structural properties of the
optimization problem. In practice, it is common to switch between gAM updates as given in
Algorithm 3 and descent versions thereof discussed earlier. The descent versions do give a step
length as a tunable parameter to the user.
5
See Exercise 4.5.
38
4.3. A CONVERGENCE GUARANTEE FOR GAM FOR CONVEX PROBLEMS
The second question requires a closer look at the interaction between the objective function
and the gAM process. Figure 4.2 illustrates this with toy bi-variate functions over R2 . In the
first figure, the bold solid curve plots the function g : X → X × Y with g(x) = (x, mOPTf (x))
(in this toy case, the marginally optimal coordinates are taken to be unique for simplicity, i.e.,
|mOPTf (x)| = 1 for all x ∈ X ). The bold dashed curve similarly plots h : Y → X × Y with
h(y) = (mOPTf (y), y). These plots are quite handy in demonstrating the convergence properties
of the gAM algorithm.
It is easy to see that bistable points lie precisely at the intersection of the bold solid and
the bold dashed curves. The second illustration shows how the gAM process may behave when
instantiated with this toy function – clearly gAM exhibits rapid convergence to the bistable
point. However, the third illustration shows that functions may have multiple bistable points.
The figure shows that this may happen even if the marginally optimal coordinates are unique
i.e., for every x there is a unique y e = mOPTf (x) and vice versa.
e such that y
In case a function taking bounded values possesses multiple bistable points, the bistable
point to which gAM eventually converges depends on where the procedure was initialized. This is
exemplified in the third illustration where each bistable region has its own “region of attraction”.
If initialized inside a particular region, gAM converges to the bistable point corresponding to
that region. This means that in order to converge to the globally optimal point, gAM must be
initialized inside the region of attraction of the global optimum.
The above discussion shows that it may be crucial to properly initialize the gAM procedure
to ensure convergence to the global optimum. Indeed, when discussing gAM-style algorithms for
learning latent variable models, matrix completion and phase retrieval in later sections, we will
pay special attention to initialize the procedure “close” to the optimum. The only exception will
be that of robust regression in § 9 where it seems that the problem structure ensures a unique
bistable point and so, a careful initialization is not required.
39
CHAPTER 4. ALTERNATING MINIMIZATION
Proof. The first property of the gAM algorithm that we need to appreciate is monotonicity. It
is easy to see that due to the marginal minimizations carried out, we have at all time steps t,
The region S0 is the sublevel set of f at the initialization point. Due to the monotonicity
property, we have f (xt , yt ) ≤ f (x1 , y1 ) for all t i.e., (xt , yt ) ∈ S0 for all t. Thus, gAM remains
restricted to the bounded region S0 and does not diverge. We notice that this point underlies
the importance of proper initialization: gAM benefits from being initialized at a point at which
the sublevel set of f is bounded.
We will use Φt = f (xt ,y1t )−f ∗ as the potential function. This is a slightly unusual choice of
potential function but its utility will be clear from the proof. Note that Φt > 0 for all t and that
convergence is equivalent to showing Φt → ∞. We will, as before, use smoothness to analyze the
per-iteration progress made by gAM and use convexity for global convergence analysis. For any
time step t ≥ 2, consider the hypothetical update we could have made had we done a gradient
step instead of the marginal minimization step gAM does in step 3.
1
e t+1 = xt −
x ∇x f (xt , yt )
β
(Apply Marginal Strong Smoothness) We get
D Eβ
2
e t+1 , yt ) ≤ f (xt , yt ) + ∇x f (xt , yt ), x
e t+1 − xt , +
t+1
f (x
x − xt
e
2 2
t t1
2
= f (x , y ) −
∇x f (xt , yt )
2β 2
(Apply Monotonicity of gAM) Since xt+1 ∈ mOPTf (yt ), we must have f (xt+1 , yt ) ≤
e t+1 , yt ), which gives us
f (x
1
2
f (xt+1 , yt+1 ) ≤ f (xt+1 , yt ) ≤ f (xt , yt ) −
∇x f (xt , yt )
2β 2
Now since t ≥ 2, we must have had yt ∈ arg miny f (xt , y). Since f is differentiable, we must
have ([see Bubeck, 2015, Proposition 1.2]) ∇y f (xt , yt ) = 0. Applying the Pythagoras’ theorem
2
2
now gives us as a result,
∇f (xt , yt )
2 =
∇x f (xt , yt )
2 .
where we have used the Cauchy-Schwartz inequality and the fact that (xt , yt ), (x∗ , y∗ ) ∈ S0 .
Putting these together gives us
1 ∗
2
f (xt+1 , yt+1 ) ≤ f (xt , yt ) − f (x t
, y t
) − f ,
4βR2
or in other words,
1 1 1 1 1 1 1
≤ − 2 2 ≤ − 2
,
Φt+1 Φt 4βR Φt Φt 4βR Φt Φt+1
where the second step follows from monotonicity. Rearranging gives us
1
Φt+1 − Φt ≥ ,
4βR2
40
4.4. A CONVERGENCE GUARANTEE FOR GAM UNDER MSC/MSS
which upon telescoping, and using Φ2 ≥ 0 gives us
T
ΦT ≥ ,
4βR2
which proves the result. Note that the result holds even if f is jointly convex and satisfies the
MSS property only locally in the region S0 .
The above tells us that Z ∗ is also the set of all stationary points of f . However, not all
points in Z ∗ may be global minima. Addressing this problem requires careful initialization and
problem-specific analysis, that we will carry out for problems such as matrix completion etc in
later sections. For now, we introduce a generic robust bistability property that will be very useful
in the analysis. Similar properties are frequently used in the analysis of gAM-style algorithms.
Definition 4.6 (Robust Bistability Property). A function f : Rp × Rq → R satisfies the C-
robust bistability property if for some C > 0, for every (x, y) ∈ Rp × Rq , y
e ∈ mOPTf (x) and
e ∈ mOPTf (y), we have
x
The right hand expression captures how much one can reduce the function value locally by
performing marginal optimizations. The property suggests7 that if not much local improvement
e ) ≈ f (x, y) ≈ f (x
can be made (i.e., if f (x, y e , y)) then we are close to the optimum. This has a
simple corollary that all bistable points achieve the (globally) optimal function value. We now
present a convergence analysis for gAM.
7
See Exercise 4.7.
41
CHAPTER 4. ALTERNATING MINIMIZATION
Theorem 4.3. Let f : Rp
× Rq
→ R be a continuously differentiable (but possibly non-convex)
function that, within the region S0 = {x, y : f (x, y) ≤ f (0, 0)} ⊂ Rp+q , satisfies the properties
of α-MSC, β-MSS in both its variables, and C-robust bistability. Let Algorithm
3 be executed
1 1 1
with the initialization (x , y ) = (0, 0). Then after at most T = O log steps, we have
f (xT , yT ) ≤ f ∗ + .
Note that the MSC/MSS and robust bistability properties need only hold within the sublevel
set S0 . This again underlines the importance of proper initialization. Also note that gAM offers
rapid convergence despite the non-convexity of the objective. In order to prove the result, the
following consequence of C-robust bistability will be useful.
Lemma 4.4. Let f satisfy the properties mentioned in Theorem 4.3. Then for any (x, y) ∈
Rp × Rq , y
e ∈ mOPTf (x) and x
e ∈ mOPTf (y),
Cβ
kx − x∗ k22 + ky − y∗ k22 ≤ e k22 + ky − y
kx − x e k22
α
Proof. Applying MSC/MSS repeatedly gives us
α
f (x, y∗ ) + f (x∗ , y) ≥ 2f ∗ + kx − x∗ k22 + ky − y∗ k22
2
β
2f (x, y) ≤ f (x, y
e ) + f (x
e , y) + e k22 + ky − y
kx − x e k22
2
Applying robust stability then proves the result.
It is noteworthy that Lemma 4.4 relates local convergence to global convergence and assures
us that reaching an almost bistable point is akin to converging to the optimum. Such a result
can be crucial, especially for non-convex problems. Indeed, similar properties are used in other
proofs concerning coordinate minimization as well, for example, the local error bound used in
[Luo and Tseng, 1993].
Proof (of Theorem 4.3). We will use Φt = f (xt , yt ) − f ∗ as the potential function. Since the
intermediate steps in gAM are marginal optimizations and not gradient steps, we will actually
find it useful to apply marginal strong convexity at a local level, and apply marginal strong
smoothness at a global level instead.
Further, the gAM updates ensure yt+1 ∈ mOPTf (xt+1 ), which gives
β
2
Φt+1 = f (xt+1 , yt+1 ) − f ∗ ≤ f (xt+1 , y∗ ) − f ∗ ≤ − x∗
,
t+1
x
2 2
which gives us
α
t+1
2
Φt − Φt+1 ≥
x − xt
.
2 2
42
4.5. EXERCISES
This shows that appreciable progress is made in a single step. Now, with (xt , yt ) for any t ≥ 2,
due to the nature of the gAM updates, we know that yt ∈ mOPTf (xt ) and xt+1 ∈ mOPTf (yt ).
Applying Lemma 4.4 then gives us the following inequality
2
2
2 Cβ
2
x − x∗
≤
xt − x∗
+
yt − y∗
≤
t
t
x − xt+1
2 2 2 α 2
β
2
2
2
− x∗
≤ β
xt+1 − xt
+
xt − x∗
t+1
Φt+1 ≤
x
2 2 2 2
2
≤ β(1 + Cκ)
xt+1 − xt
≤ 2κ(1 + Cκ) (Φt − Φt+1 ) ,
2
β
where κ = α is the effective condition number of the problem. Rearranging gives us
Φt+1 ≤ η0 · Φt ,
2κ(1+Cκ)
where η0 = 1+2κ(1+Cκ) < 1 which proves the result.
Notice that the condition number κ makes an appearance in the convergence rate of the
algorithm but this time, with a fresh definition in terms of the MSC/MSS parameters. As
before, small values of κ and C ensure fast convergence, whereas large values of κ, C promote
η0 → 1 which slows the procedure down.
Before we conclude, we remind the reader that in later sections, we will see more precise
analyses of gAM-style approaches, and the structural assumptions will be more problem specific.
However, we hope the preceding discussion has provided some insight into the inner workings
of alternating minimization techniques.
4.5 Exercises
Exercise 4.1. Recall the low-rank matrix completion problem in recommendation systems from
§1
X 2
Ablv = min Ui> Vj − Aij .
U ∈Rm×r
V ∈Rn×r (i,j)∈Ω
Show that the objective in this optimization problem is not jointly convex in U and V . Then
show that the objective is nevertheless, marginally convex in both the variables.
Exercise 4.2. Show that a function that is jointly convex is necessarily marginally convex as
well. Similarly show that a (jointly) strongly convex and smooth function is marginally so as
well.
Exercise 4.3. Marginal strong convexity does not imply convexity. Show this by giving an
example of a function f : Rp × Rq → R that is marginally strongly convex in both its variables,
but non-convex.
Hint: use the fact that the function f (x) = x2 is 2-strongly convex.
Exercise 4.4. Design a variant of the gAM procedure that can handle a general constraint set
Z ⊂ X × Y. Attempt to analyze the convergence of your algorithm.
Exercise 4.5. Show that (x∗ , y∗ ) ∈ arg minx∈X ,y∈Y f (x, y) must be a bistable point for any
function even if f is non-convex.
43
CHAPTER 4. ALTERNATING MINIMIZATION
Exercise 4.6. Let f : Rp × Rq → R be a differentiable, jointly convex function. Show that any
bistable point of f is a global minimum for f .
Hint: first show that directional derivatives vanish at bistable points.
Exercise 4.7. For a robustly bistable function f , any almost bistable point is almost opti-
mal as well. Show this by proving, for any (x, y), y e ∈ mOPTf (x), x
e ∈ mOPTf (y) such that
max {f (x, y ∗
e , y)} ≤ f (x, y) + , that f (x, y) ≤ f + O (). Conclude that if f satisfies
e ), f (x
robust bistability, then any bistable point (x, y) ∈ Z ∗ is optimal.
Exercise 4.8. Show that marginal strong convexity is additive i.e., if f, g : Rp × Rq → R are
two functions such that f is respectively α1 and α2 -MSC in its two variables and g is α1 and
α2 -MSC in its variables, then the function f + g is (α1 + α1 ) and (α2 + α2 )-MSC in its variables.
Exercise 4.9. The alternating minimization procedure may oscillate if the optimization problem
is not well-behaved. Suppose for an especially nasty problem, the gAM procedure enters into the
following loop
Show that all four points in the loop are bistable and share the same function value. Can you
draw a hypothetical set of marginally optimal coordinate curves which may cause this to happen
(see Figure 4.2)?
44
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 5
The EM Algorithm
In this section we will take a look at the Expectation Maximization (EM) principle. The principle
forms the basis for widely used learning algorithms such as those used for learning Gaussian
mixture models, the Baum-Welch algorithm for learning hidden Markov models (HMM), and
mixed regression. The EM algorithm is also a close cousin to the Lloyd’s algorithm for clustering
with the k-means objective.
Although the EM algorithm, at a surface level, follows the alternating minimization
principle which we studied in § 5, given its wide applicability in learning latent variable models
in probabilistic learning settings, we feel it is instructive to invest in a deeper understanding of
the EM method. To make the reading experience self-contained, we will first devote some time
developing intuitions and notation in probabilistic learning methods.
The above quantity is also known as the likelihood of the data parametrized on θ 0 as it captures
the probability that the observed data was generated by the parameter θ 0 . In fact, we can go
ahead and define the likelihood function for any parameter θ ∈ Θ as follows
L(θ; x1 , x2 , . . . , xn ) := f (x1 , x2 , . . . , xn | θ)
45
CHAPTER 5. THE EM ALGORITHM
∗
The maximum likelihood estimate (MLE) of θ is simply the parameter that maximizes the
above likelihood function i.e, the parameter which seems to be the “most likely” to have gener-
ated the data.
θ MLE := arg max L(θ; x1 , x2 , . . . , xn )
b
θ∈Θ
∗
It is interesting to study the convergence of θ MLE → θ as n → ∞ but we will not do so in this
b
monograph. We note that there do exist other estimation techniques apart from MLE, such as
the Maximum a Posteriori (MAP) estimate that incorporates a prior distribution over θ, but
we will not discuss those here either.
Least Squares Regression As a warmup exercise, let us take the example of linear regression
and reformulate it in a probabilistic setting to better understand the above framework. Let
y ∈ R and w, x ∈ Rp and consider the following parametric distribution over the set of reals,
parametrized by w and x
!
1 (y − x> w)2
f (y | x, w) = √ exp − ,
2π 2
Note that this distribution exactly encodes the responses in a linear regression model with unit
variance Gaussian noise. More specifically, if y ∼ f (· | x, w), then
y ∼ N (x> w, 1)
The above observation allows us to cast linear regression as a parameter estimation problem.
Consider the parametric distribution family
F = {fw = f (· | ·, w) : kwk2 ≤ 1} .
Suppose now that we have n covariate samples x1 , x2 , . . . , xn and there is a true parameter
w∗ such that the distribution fw∗ ∈ F (i.e., kw∗ k2 ≤ 1) is used to generate the responses i.e.,
yi ∼ f (· | xi , w∗ ). It is easy to see that the likelihood function in this setting is1
n n
!
1 Y (yi − xi> w)2
L(w; {(xi , yi )}ni=1 ) =
Y
f (yi | xi , w) = √ exp −
i=1 2π i=1 2
Since the logarithm function is a strictly increasing function, maximizing the log-likelihood will
also yield the MLE, i.e.,
n
b MLE = arg max log L(w; {(xi , yi )}n (yi − xi> w)2
X
w i=1 ) = arg min
kwk2 ≤1 kwk2 ≤1 i=1
Thus, the MLE for linear regression under Gaussian noise is nothing but the common least
squares estimate! The theory of maximum likelihood estimates and their consistency properties
b MLE → w∗ . ML estimators are
is well studied and under suitable conditions, we indeed have w
members of a more general class of estimators known as M-estimators [Huber and Ronchetti,
2009].
1
The reader would notice that we are modeling only the process that generates the responses given the
covariates. However, this is just for sake of simplicity. It is possible to model the process that generates the
covariates xi as well using, for example, a mixture of Gaussians (that we will study in this very section). A model
that accounts for the generation of both xi and yi is called a generative model.
46
5.2. PROBLEM FORMULATION
Algorithm 4 AltMax for Latent Variable Models (AM-LVM)
Input: Data points y1 , . . . , yn
Output: An approximate MLE θ b∈Θ
1
1: θ ← INITALIZE()
2: for t = 1, 2, . . . do
3: for i = 1, 2, . . . , n do
4: bti ← arg max f (z | yi , θ t )
z (Estimate latent variables)
z∈Z
5: end for n
θ t+1 ← arg max log L(θ; (yi , z
bti ) i=1 )
6: (Update parameter)
θ∈Θ
7: end for
8: return wt
In most practical situations, using the marginal likelihood function L(θ; y1 , y2 , . . . , yn ) to per-
form ML estimation becomes intractable since the expression on the right hand side, when
expanded as a sum, contains |Z|n terms which makes it difficult to even write down the expres-
sion fully let alone optimize using it as an objective function!
For comparison, the log-likelihood expression for the linear regression problem with n data
points (the least squares expression) was a summation of n terms. Indeed, the problem of
maximizing the marginal likelihood function L(θ; y1 , y2 , . . . , yn ) is often NP-hard and as a con-
sequence, direct optimization techniques for finding the MLE fail for even small scale problems.
47
CHAPTER 5. THE EM ALGORITHM
However, notice that it is also true that, had the identity of the true parameter θ ∗ been provided
to us (again magically), it would have been simple to estimate the latent variables using a
maximum posterior probability estimate for zi as follows
bi = arg max f (z | yi , θ ∗ )
z
z∈Z
Given the above, it is tempting to apply a gAM-style algorithm to solve the MLE problem in
the presence of latent variables. Algorithm 4 outlines such an adaptation of gAM to the latent
variable learning problem. Note that steps 4 and 6 in the algorithm can be very efficiently
carried out for several problem cases. In fact, it can be shown2 that for the Gaussian Mixture
modeling problem, AM-LVM reduces to the popular Llyod’s algorithm for k-means clustering.
However, the AM-LVM algorithm has certain drawbacks, especially when the space of latent
variables Z is large. At every time step t, AM-LVM makes a “hard assignment”, assigning the
data point yi to just one value of the latent variable zti ∈ Z. This can amount to throwing
away a lot of information, especially when there may be other values z0 ∈ Z present such that
f (z0 | yi , θ t ) is also large but nevertheless f (z0 | yi , θ t ) < f (zti | yi , θ t ) so that AM-LVM neglects
z0 . The EM algorithm tries to remedy this.
EM still assigns z bti = arg max f (z | yi , θ t ) the highest weight. In contrast, AM-LVM can be now
z∈Z
seen as putting all the weight on zbti alone and zero weight on any other latent variable value.
We now present a more formal derivation of the EM algorithm. Instead of maximizing the
likelihood in a single step, the EM algorithm tries to efficiently encourage an increase in the
likelihood over several steps. Define the point-wise likelihood function as
X
L(θ; y) = f (y, z | θ).
z∈Z
Note that we can write the marginal likelihood function as L(θ; y1 , y2 , . . . , yn ) = ni=1 L(θ; yi ).
Q
Our goal is to maximize L(θ; y1 , y2 , . . . , yn ) but doing so directly is too expensive. So the next
best thing is to do so indirectly. Suppose we had a proxy function that lower bounded the
likelihood function but was also easy to optimize. Then maximizing the proxy function would
also lead to an increase in the likelihood if the proxy were really good.
This is the key to the EM algorithm: it introduces a proxy function called the Q-function
that lower bounds the marginal likelihood function and casts the parameter estimation problem
as a bi-variate problem, the two variables being the parameter θ and the Q-function.
2
See Exercise 5.1.
48
5.4. THE EM ALGORITHM
Given an initialization, θ 0 ∈ Θ, EM constructs a Q-function out of it, uses that as a proxy
to obtain a better parameter θ 1 , uses the newly obtained parameter to construct a better Q-
function, uses the better Q-function to obtain a still better parameter θ 2 , and so on. Thus, it
essentially performs alternating optimization steps, with better estimations of the θ parameter
leading to better constructions of the Q-function and vice versa.
To formalize this notion, we will abuse notation to let f (z | y, θ 0 ) denote the conditional
probability function for the random variable Z given the variable Y and the parameter θ 0 .
Then we have
X X f (y, z | θ)
log L(θ; y) = log f (y, z | θ) = log f (z | y, θ 0 )
z∈Z z∈Z
f (z | y, θ 0 )
The summation in the last expression can be seen to be simply an expectation with respect to
the random variable Z being sampled from the conditional distribution f (· | y, θ 0 ). Using this,
we get
" #
f (y, z | θ)
log L(θ; y) = log Ez∼f (· | y,θ0 )
f (z | y, θ 0 )
" #
f (y, z | θ)
≥ Ez∼f (· | y,θ0 ) log
f (z | y, θ 0 )
h i
= Ez∼f (· | y,θ0 ) [log f (y, z | θ)] − Ez∼f (· | y,θ0 ) log f (z | y, θ 0 ) .
| {z } | {z }
Qy (θ | θ 0 ) Ry (θ 0 )
The inequality follows from Jensen’s inequality as the logarithm function is concave. The func-
tion Qy (θ | θ 0 ) is called the point-wise Q-function. Now, the Q-function can be interpreted as a
weighted point-wise likelihood function
X
Qy (θ | θ 0 ) = wz · log f (y, z | θ),
z∈Z
49
CHAPTER 5. THE EM ALGORITHM
Algorithm 5 Expectation Maximization (EM)
Input: Implementations of the E-step E(·), and the M-step M (·)
Output: A good parameter θ b∈Θ
1: θ 1 ← INITALIZE()
2: for t = 1, 2, . . . do
3: Qt (· | θ t ) ← E(θ t ) (E-step)
4: θ t+1 ← M (θ t , Qt ) (M-step)
5: end for
6: return wt
Clearly this construction is infeasible in practice. A much more realistic sample construction
works with just the observed samples y1 , y2 , . . . , yn (note that these were indeed drawn from
the distribution f (y | θ ∗ )). The sample E-step constructs the Q-function as
n
1X
Qsam
t (θ | θ t ) = Qy (θ | θ t )
n i=1 i
n
1X
= E t [log f (yi , z | θ)]
n i=1 z∼f (· | yi ,θ )
n X
1X
= f (z | y, θ t ) · log f (yi , z | θ)
n i=1 z∈Z
Note that this expression has n·|Z| terms instead of the |Z|n terms which the marginal likelihood
expression had. This drastic reduction is a key factor behind the scalability of the EM algorithm.
M-step Constructions Recall that in § 4 we considered alternating approaches that fully op-
timize with respect to a variable, as well as those that merely perform a descent step, improving
the function value along that variable but not quite optimizing it.
Similar variants can be developed for EM as well. Given a Q-function, the simplest strategy
is to optimize it completely with the M-step simply returning the maximizer of the Q-function.
This is the fully corrective version of EM. It is useful to remind ourselves here that whereas in
previous sections we looked at minimization problems, the problem here is that of likelihood
maximization.
M fc (θ t , Qt ) = arg max Qt (θ | θ t )
θ∈Θ
50
5.6. MOTIVATING APPLICATIONS
Since this can be expensive in large scale optimization settings, a gradient descent version exists
that makes the M-step faster by performing just a gradient step with respect to the Q-function
i.e.,
M grad (θ t , Qt ) = θ t + αt · ∇Qt (θ t | θ t ).
Stochastic EM Construction A highly scalable version of the algorithm is the stochastic
update version that uses the point-wise Q-function of a single, randomly chosen sample Yt ∼
Unif[n] to execute a gradient update at each time step t. It can be shown3 that on expectation,
this executes the sample E-step and the gradient M-step.
M sto (θ t ) = θ t + αt · ∇QYt (θ t | θ t )
where Ni = N (·; µ∗,i , Σ∗i ), i = 0, 1 are the mixture components and φ∗i ∈ (0, 1), i = 0, 1 are
the mixture coefficients. We insist that φ∗0 + φ∗1 = 1 to ensure that f is indeed a probability
distribution. Note that we consider a mixture with just two components for simplicity. Mixture
models with larger number of components can be similarly constructed.
A sample (y, z) ∈ Rp × {0, 1} can be drawn from this distribution by first tossing a Bernoulli
coin with bias φ1 to choose a component z ∈ {0, 1} and then drawing a sample y ∼ Nz from
that component.
However, despite drawing the samples (y1 , z1 ), (y2 , z2 ), . . . , (yn , zn ), what is presented to
us is y1 , y2 , . . . , yn i.e., the identities zi of the components that actually resulted in these
draws is hidden from us. For instance in topic modeling tasks, the underlying topics being
discussed in documents is hidden from us and we only get to see surface realizations of words
in the documents that have topic-specific distributions. The goal here is to recover the mixture
components as well as coefficients in an efficient manner from such partially observed draws.
For the sake of simplicity, we will look at a balanced, isotropic mixture i.e., where we are
given that φ∗0 = φ∗1 = 0.5 and Σ∗0 = Σ∗1 = Ip . This will simplify our updates and analysis as the
only unknown parameters in the model are µ∗,0 and µ∗,1 . Let M = (µ0 , µ1 ) ∈ Rp×p denote an
ensemble describing such a parametric mixture. Our job is to recover M∗ = (µ∗,0 , µ∗,1 ).
51
CHAPTER 5. THE EM ALGORITHM
i.e., Ni = N (·; µi , Ip ). In this case, the Q-function actually has a closed form expression. For
any y ∈ Rp , z ∈ {0, 1}, and M, we have
ky − µz k22
!
f (y, z | M) = Nz (y) = exp −
2
f (y, z | M) f (y, z | M)
f (z | y, M) = = .
f (y | M) f (y, z | M) + f (y, 1 − z | M)
Thus, even though the marginal likelihood function was inaccessible, the point-wise Q-function
has a nice closed form. For the E-step construction, for any two ensembles Mt = (µt,0 , µt,1 ) and
M = (µ0 , µ1 ),
2
2 2
−1
ky−µt,z k2 ky−µt,0 k2 ky−µt,1 k2
where wtz (y) = e− 2 e− 2 + e− 2 . Note that wtz (y) ≥ 0 for z = 0, 1
and wt0 (y) + wt1 (y) = 1. Also note that wtz (y) is larger if y is closer to µt,z i.e., it measures the
affinity of a point to the center. Given a sample of data points y1 , y2 , . . . , yn , the Q-function is
n
1X
Q(M | Mt ) = Qy (M | Mt )
n i=1 i
n
1 X
2
2
=− wt0 (yi ) ·
yi − µ0
+ wt1 (yi ) ·
yi − µ1
2n i=1 2 2
M-step Construction This also has a closed form solution in this case4 . If Mt+1 =
(µt+1,0 , µt+1,1 ) = arg maxM Q(M | Mt ), then
n
X
µt+1,z = wtz (yi )yi
i=1
Note that the M-step can be executed in linear time and does not require an explicit construction
of the Q-function at all! One just needs to use the M-step repeatedly – the Q-function is implicit
in the M-step.
The reader would notice a similarity between the EM algorithm for Gaussian mixture models
and the Llyod’s algorithm [Lloyd, 1982] for k-means clustering. In fact5 , the Llyod’s algorithm
implements exactly the AM-LVM algorithm (see Algorithm 4) that performs “hard” assign-
ments, assigning each data point completely to one of the clusters whereas the EM algorithm
makes “soft” assignments, allowing each point to have different levels of affinity to different
clusters. Figure 5.1 depicts the working of the EM algorithm on a toy GMM problem.
52
5.6. MOTIVATING APPLICATIONS
Figure 5.1: The data is generated from a mixture model: circle points from N (µ∗,0 , Ip ) and
triangle points from N (µ∗,1 , Ip ) but their origin is unknown. The EM algorithm performs soft
clustering assignments to realize this and keeps increasing wt0 values for circle points and in-
creasing wt1 values for triangle points. As a result, the estimated means µt,z rapidly converge to
the true means µ∗,z , z = 0, 1.
especially useful when we suspect that our data is actually composed of several sub-populations
which cannot be explained well using a single model.
For example, consider the previous example of predicting family expenditure. Although we
may have data from families across a nation, it may be unwise to try and explain it using a
single model due to various reasons. The prices of various commodities and services may vary
across urban and rural areas and similar consumption in two regions may very well result in
different expenditures. Moreover, there may exist parameters such as total income which are
not revealed in a survey due to privacy issues, but nevertheless influence expenditure.
Thus, there may actually be several models, each corresponding to a certain income bracket
or a certain geographical region, which together explain the data very well. This poses a challenge
since the income bracket, or geographical location of a family may not have been recorded as a
part of the survey due to privacy or other reasons!
To formalize the above scenario, consider two linear models w∗,0 , w∗,1 ∈ Rp . For each data
point xi ∈ Rp , first one of the models is selected by performing a Bernoulli trial with bias φ1 to
get zi ∈ {0, 1} and then the response is generated as
yi = xi> w∗,zi + ηi
where ηi is i.i.d. Gaussian noise ηi ∼ N (0, σz2i ). This can be cast as a parametric model by
considering density functions of the form
f (· | ·, {φz , w∗,z , σz }z=0,1 ) = φ0 · g(· | ·, w∗,0 , σ0 ) + φ1 · g(· | ·, w∗,1 , σ1 ),
where σz , φz > 0, φ0 + φ1 = 1, and for any (x, y) ∈ Rp × R, we have
!
(y − x> w∗,z )2
g(y | x, w∗,z , σz ) = exp − .
2σz2
53
CHAPTER 5. THE EM ALGORITHM
Note that although such a model generates data in the form of triplets
(x1 , y1 , z1 ), (x2 , y2 , z2 ), . . . , (xn , yn , zn ), we are only allowed to observe
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) as the data. For the sake of simplicity, we will yet again
look at the special case when the Bernoulli trials are fair i.e., φ0 = φ1 = 0.5 and σ1 = σ2 = 1.
Thus, the only unknown parameters are the models w∗,0 and w∗,1 . Let W = (w0 , w1 ) ∈ Rp×p
denote the parametric mixed model. Our job is to recover W∗ = (w∗,0 , w∗,1 ).
A particularly interesting special case arises when we further impose the constraint w∗,0 =
−w∗,1 , i.e. the two models in the mixture are tied together to be negative of each other. This
model is especially useful in the phase retrieval problem. Although we will study this problem
in more generality in § 10, we present a special case here.
Phase retrieval is a problem that arises in several imaging situations such as X-ray crys-
tallography where, after data {(xi , yi )}i=1,...,n has been generated as yi = hw∗ , xi i, the sign of
the response (or more generally, the phase of the response if the response is complex-valued)
is omitted and we are presented with just {(xi , |yi |)}i=1,...,n . In such a situation, we can use
the latent variable zi = sign(yi ) to denote the omitted sign information. In this setting, it can
be seen that the mixture model with w∗,0 = −w∗,1 is very appropriate since each data point
(xi , |yi |) will be nicely explained by either w∗ or −w∗ depending on the value of sign(yi ).
We will revisit this problem in detail in § 10. For now we move on to discuss the E and
M-step constructions for the mixed regression problem. We leave details of the constructions
as an exercise6 .
E-step Construction The point-wise Q-function has a closed form expression. Given two
ensembles Wt = (wt,0 , wt,1 ) and W = (w0 , w1 ),
1 0
Q(x,y) (W|Wt ) = − αt,(x,y) · (y − x> w0 )2 − αt,(x,y)
1
· (y − x> w1 )2 ,
2
−1
(y−x> wt,z )2 (y−x> wt,0 )2 (y−x> wt,1 )2
z
where αt,(x,y) = e− 2 e− 2 + e− 2 z
. Note that αt,(x,y) ≥ 0 for
0
z = 0, 1 and αt,(x,y) 1
+ αt,(x,y) z
= 1. Also note that αt,(x,y) is larger if wt,z gives less regression
error for the point (x, y) than wt,1−z i.e., if (y − x> wt,z )2 < (y − x> wt,1−z )2 . Thus, the
data point (x, y) feels greater affinity to the model that fits it better, which is intuitively, an
appropriate thing to do.
M-step Construction The maximizer of the Q-function has a closed form solution in this
case as well. If Wt+1 = (wt+1,0 , wt+1,1 ) = arg maxW Q(W|Wt ), where Q(W|Wt ) is the sample
Q-function created from a data sample (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), then it is easy to see that
the M-step update is given by the solution to two weighted least squares problems with weights
z
given by αt,(x for z = {0, 1}, which have closed form solutions given by
i ,yi )
n
!−1 n
xi xi>
X X
t+1,z z z
w = αt,(x i ,yi )
· αt,(x i ,yi )
· yi xi
i=1 i=1
6
See Exercise 5.4.
54
5.7. A MONOTONICITY GUARANTEE FOR EM
Figure 5.2: The data contains two sub-populations (circle and triangle points) and cannot be
properly explained by a single linear model. EM rapidly realizes that the circle points should
belong to the solid model and the triangle points to the dashed model. Thus, αt0 keeps going up
for circle points and αt1 keeps going up for triangle points. As a result, only circle points con-
tribute significantly to learning the solid model and only triangle points contribute significantly
to the dashed model.
Theorem 5.1. The EM algorithm (Algorithm 5), when executed with a population E-step and
fully corrective M-step, ensures that the population likelihood never decreases across iterations,
i.e., for all t,
Ey∼fθ∗ f (y | θ t+1 ) ≥ Ey∼fθ∗ f (y | θ t ),
If executed with the sample E-step on data y1 , . . . , yn , EM ensures that the sample likelihood
never decreases across iterations, i.e., for all t,
A similar result holds for gradient M-steps too but we do not consider that here. To
prove this result, we will need the following simple observation7 . Recall that for any θ 0 ∈ Θ
7
See Exercise 5.5.
55
CHAPTER 5. THE EM ALGORITHM
0
and y ∈ Y, we defined the terms Qy (θ | θ ) = E [log f (y, z | θ)] and Ry (θ 0 ) =
z∼f (· | y,θ 0 )
h i
0
Ez∼f (· | y,θ0 ) log f (z | y, θ ) .
Proof (of Theorem 5.1). Consider the sample E-step with Q(θ | θ t ) = n1 ni=1 Qyi (θ | θ t ). A sim-
P
ilar argument works for population E-steps. The fully corrective M-step ensures Q(θ t+1 | θ t ) ≥
Q(θ t | θ t ) i.e.,
n n
1X 1X
Qyi (θ t+1 | θ t ) ≥ Qy (θ t | θ t ).
n i=1 n i=1 i
Subtracting the same terms from both sides gives us
n n
1X 1X
Qyi (θ t+1 | θ t ) − Ryi (θ t ) ≥ Qyi (θ t | θ t ) − Ryi (θ t ) .
n i=1 n i=1
Using the inequality log f (y | θ t+1 ) ≥ Qy (θ t+1 | θ 0 ) − Ry (θ 0 ) and applying Lemma 5.2 gives us
n n
1X 1X
log f (yi | θ t+1 ) ≥ log f (yi | θ t ),
n i=1 n i=1
For any θ ∈ Θ, we will use qθ0 (·) = Q(· | θ 0 ) as a shorthand for the Q-function with respect to
0
θ 0 (constructed at the population or sample level depending on the E-step) and let M (θ 0 ) :=
arg max qθ0 (θ) denote the output of the M-step if θ 0 is the current parameter. Let θ ∗ denote a
θ∈Θ
parameter that optimizes the population likelihood
θ ∗ ∈ arg max E [L(θ; y)] = arg max E [f (y | θ)]
θ∈Θ y∼f (y|θ ∗ ) θ∈Θ y∼f (y|θ ∗ )
Recall that our overall goal is indeed to recover a parameter such as θ ∗ that maximizes the
population level likelihood (or else sample likelihood if using sample E-steps). Now the Q-
function satisfies8 the following self-consistency property.
θ ∗ ∈ arg max qθ∗ (θ)
θ∈Θ
8
See Exercise 5.6.
56
5.8. LOCAL STRONG CONCAVITY AND LOCAL STRONG SMOOTHNESS
Thus, if we could somehow get hold of the Q-function qθ∗ (·), then a single M-step would solve
the problem! However, this is a circular argument since getting hold of qθ∗ (·) would require
finding θ ∗ first.
To proceed along the previous argument, we need to refine this observation. Not only should
the M-step refuse to deviate from the optimum θ ∗ if initialized there, it should behave in
relatively calm manner in the neighborhood of the optimum as well. The following properties
characterize “nice” Q functions that ensure this happens.
For sake of simplicity, we will assume that all Q-functions are continuously differentiable,
as well as that the estimation problem is unconstrained i.e., Θ = Rp . Note that this ensures
∇qθ∗ (θ ∗ ) = 0 due to self-consistency. Also note that since we are looking at maximization
problems, we will require the Q function to satisfy “concavity” properties instead of “convexity”
properties.
Definition 5.1 (Local Strong Concavity). A statistical estimation problem with a population
likelihood maximizer θ ∗ satisfies the (r, α)-Local Strong Concavity (LSC) property if there exist
α, r > 0, such that the function qθ∗ (·) is α-strongly concave in neighborhood ball of radius r
around θ ∗ i.e., for all θ 1 , θ 2 ∈ B2 (θ ∗ , r),
D α
E
2
qθ∗ (θ 1 ) − qθ∗ (θ 2 ) − ∇qθ∗ (θ 2 ), θ 1 − θ 2 ≤ −
1
θ − θ 2
2 2
The reader would find LSC similar to restricted strong convexity (RSC) in Definition 3.2,
with the “restriction” being the neighborhood B2 (θ ∗ , r) of θ ∗ . Also note that only qθ∗ (·) is
required to satisfy the LSC property, and not Q-functions corresponding to every θ.
We will also require a counterpart to restricted strong smoothness (RSS). For that, we
introduce the notion of Lipschitz gradients.
Definition 5.2 (Lipschitz Gradients). A differentiable function f : Rp → R is said to have
β-Lipschitz gradients if for all x, y ∈ Rp , we have
k∇f (x) − ∇f (y)k2 ≤ β · kx − yk2 .
We advise the reader to relate this notion to that of Lipschitz functions (Definition 2.5).
It can be shown9 that all functions with L-Lipschitz gradients are also L-strongly smooth
(Definition 2.4). Using this notion we are now ready to introduce our next properties.
Definition 5.3 (Local Strong Smoothness). A statistical estimation problem with a population
likelihood maximizer θ ∗ satisfies the (r, β)-Local Strong Smoothness (LSS) property if there exist
β, r > 0, such that for all θ 1 , θ 2 ∈ B2 (θ ∗ , r), the function qθ∗ (·) satisfies
∇qθ∗ (M (θ 1 )) − ∇qθ∗ (M (θ 2 ))
≤ β ·
θ 1 − θ 2
2 2
The above property ensures that in the restricted neighborhood around the optimum, the Q-
function qθ∗ (·) is strongly smooth. The similarity to RSS is immediate. Note that this property
also generalizes the self-consistency property we saw a moment ago.
Self-consistency forces ∇qθ∗ (θ ∗ ) = 0 at the optimum. LSS forces such behavior to extend
around the optimum
as
well. To
see this, simply set θ 2 = θ ∗ with LSS and observe the corollary
∇qθ∗ (M (θ 1 ))
≤ β ·
θ 1 − θ ∗
. The curious reader may wish to relate this corollary to the
2 2
Robust Bistability property (Definition 4.6) and the Local Error Bound property introduced by
Luo and Tseng [1993].
The LSS property offers, as another corollary, a useful property of statistical estimation
problems called the First Order Stability property (introduced by Balakrishnan et al. [2017] in
a more general setting).
9
See Exercise 5.7.
57
CHAPTER 5. THE EM ALGORITHM
Definition 5.4 (First Order Stability [Balakrishnan et al., 2017]). A statistical estimation
problem with a population likelihood maximizer θ ∗ satisfies the (r, γ)-First Order Stability (FOS)
property if there exist γ > 0, r > 0 such that the the gradients of the functions qθ (·) are stable
in a neighborhood of θ ∗ i.e., for all θ ∈ B2 (θ ∗ , r),
Lemma 5.3. A statistical estimation problem that satisfies the (r, β)-LSS property, also satisfies
the (r, β)-FOS property.
Proof. Since M (θ) maximizes the function qθ (·) due to the M-step, and the problem is uncon-
strained and the Q-functions differentiable, we get ∇qθ (M (θ)) = 0. Thus, we have, using the
triangle inequality,
Lemma 5.4. Suppose we have a statistical estimation problem with a population likelihood
maximizer θ ∗ that, for some α, β, r > 0, satisfies the (r, α)-LSC and (r, β)-LSS properties.
Then in the region B2 (θ ∗ , r), the M operator corresponding to the fully corrective M-step is
contractive, i.e., for all θ ∈ B2 (θ ∗ , r),
β
kM (θ) − M (θ ∗ )k2 ≤ · kθ − θ ∗ k2
α
Since by the self consistency property, we have M (θ ∗ ) = θ ∗ and the EM algorithm sets
t+1
θ = M (θ t ) due to the M-step, Lemma 5.4 immediately guarantees the following local con-
vergence property10 .
Theorem 5.5. Suppose a statistical estimation problem with population likelihood maximizer
θ ∗ satisfies the (r, α)-LSC and (r, β)-LSS properties such that β < α. Let the EM algorithm
(Algorithm 5) be initialized with θ 1 ∈ B2 (θ ∗ , r) and executed with population E-steps and fully
corrective M-steps. Then after at most T = O log 1 steps, we have
θ t − θ ∗
2 ≤ .
Note that the above result holds only if β < α, in other words, if the condition number
κ = β/α < 1. We hasten to warn the reader that whereas in previous sections we always had
κ ≥ 1, here the LSC and LSS properties are defined differently (LSS involves the M-step whereas
LSC does not) and thus it is possible that we have κ < 1.
10
See Exercise 5.8.
58
5.9. A LOCAL CONVERGENCE GUARANTEE FOR EM
Also, since all functions satisfy (0, 0)-LSC, it is plausible that for well behaved problems, even
for α > β, there should exist some small radius r(α) so that the (r(α), α)-LSC property holds.
This may require the EM algorithm to be initialized closer to the optimum for the convergence
properties to kick in. We now prove Lemma 5.4 below.
Proof of Lemma 5.4. Since we have differentiable Q-functions and an unconstrained estimation
problem, we immediately get a lot of useful results. We note that this lemma holds even for
constrained estimation problems but the arguments are more involved which we wish to avoid.
Let θ ∈ B2 (θ ∗ , r) be any parameter in the r-neighborhood of θ ∗ .
(Apply Local Strong Concavity) Upon a two sided application of LSC and using ∇qθ∗ (θ ∗ ) =
0, we get
α
qθ∗ (θ ∗ ) − qθ∗ (M (θ)) − h∇qθ∗ (M (θ)), θ ∗ − M (θ)i ≤ −kM (θ) − θ ∗ k22
2
α ∗
qθ (M (θ)) − qθ (θ ) ≤ − kθ − M (θ)k22 ,
∗ ∗
∗
2
adding which gives us the inequality
(Apply Local Strong Smoothness) Since M (θ) maximizes the function qθ (·) due to the
M-step, we get ∇qθ (M (θ)) = 0. Thus,
Using Lemma 5.3 to invoke the (r, β)-FOS property further gives us
Gaussian Mixture Models To analyze the LSC and FOS properties, we need to look at the
population version of the Q-function. Given the point-wise Q-function derivation of the in § 5.6,
we get for M = (µ0 , µ1 )
1
2
2
Q(M | M∗ ) = − Ey∼fM∗ w0 (y) ·
y − µ0
+ w1 (y) ·
y − µ1
,
2 2 2
59
CHAPTER 5. THE EM ALGORITHM
2 2
−1
ky−µ∗,z k22 y−µ∗,0
k k k y−µ∗,1 k
− 2 2
where wz (y) = e 2 e− 2 + e− 2 for z = 0, 1. It can be seen that the
function qM∗ (·) satisfies ∇2 qM∗ (·) w · I where w = min E w0 (y) , E w1 (y) > 0 and hence
this problem satisfies the (∞, w)-LSC property i.e., it is globally strongly concave.
Establishing the FOS property is more involved. However, it can be shown that theproblem
does satisfy the (r, α)-FOS property with r = Ω(kM∗ k2 ) and α = exp(−Ω kM∗ k22 ) under
suitable conditions.
Mixed Regression We again use the point-wise Q-function construction to construct the
population Q-function. For any W = (w0 , w1 ),
1 h 0 i
Q(x,y) (W|W∗ ) = − E α(x,y) · (y − x> w0 )2 − α(x,y)
1
· (y − x> w1 )2 ,
2
−1
(y−x> w∗,z )2 (y−x> w∗,0 )2 (y−x> w∗,1 )2
where z
α(x,y) = e− 2 e− 2 + e− 2 . Assuming
h i
E xx> = I for sake of simplicity, we get ∇2 qW∗ (·) α · I where α =
n h i h io
min λmin E w0 (x, y)
· xx>
, λmin E · w1 (x, y) xx>
i.e., the problem satisfies the
(∞, w)-LSC property i.e., it is globally strongly concave. Establishing the FOS property is
more involved but the problem does satisfy the (r, α)-FOS property with r = Ω(kW∗ k2 ) and
α = Ω (1) under suitable conditions.
5.10 Exercises
Exercise 5.1. Show that for Gaussian mixture models with a balanced isotropic mixture, the
AM-LVM algorithm (Algorithm 4) implements exactly recovers Lloyd’s algorithm
for k-means
0
y − µt,0
≤
clustering. Note that AM-LVM in this case prescribes setting w t (y) = 1 if 2
y − µt,1
and 0 otherwise and also setting w1 (y) = 1 − w0 (y).
2 t t
Exercise 5.2. Show that on expectation, the stochastic EM update rule is, on expectation, equiv-
alent to the sample E and the gradient M-step i.e., E M sto (θ t , Qt ) | θ t = M grad (θ t , Qsam (· | θ t )).
t
Exercise 5.3. Derive the fully corrective and gradient M-step in the Gaussian mixture modeling
problem. Show that they have closed forms.
Exercise 5.4. Derive the E and M-step constructions for the mixed regression problem with
fair Bernoulli trials.
Exercise 5.5. Prove Lemma 5.2.
Exercise 5.6. Let θ
b be a population likelihood maximizer i.e.,
b ∈ arg max
θ E [f (y | θ)]
θ∈Θ y∼f (y|θ ∗ )
b ∈ arg max q (θ). Hint: One way to show this result is to use Theorem 5.1
Then show that θ θ
b
θ∈Θ
and Lemma 5.2.
Exercise 5.7. Show that a function f (whether convex or not) that has L-Lipschitz gradients
is necessarily L-strongly smooth. Also show that for any given L > 0, there exist functions that
do not have L-Lipschitz gradients which are also not L-strongly smooth.
Hint: Use the fundamental theorem for calculus for line integrals for the first part. For the second
part try using a quadratic function.
60
5.11. BIBLIOGRAPHIC NOTES
Exercise 5.8. For any statistical estimation problem with population likelihood maximizer θ ∗
that satisfies the LSC and LSS properties with appropriate constants, show that parameters
close to θ ∗ are approximate population likelihood maximizers themselves. Show this by finding
constants 0 , D > 0 (that may depend on the LSC, LSS constants) such that for any θ, if
kθ − θ ∗ k2 ≤ < 0 , then
E [f (y | θ)] ≥ E [f (y | θ ∗ )] − D · .
y∼f (y|θ ∗ ) y∼f (y|θ ∗ )
61
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 6
In previous sections, we have looked at specific instances of optimization problems with non-
convex objective functions. In § 4 we looked at problems in which the objective can be expressed
as a function of two variables, whereas in § 5 we looked at objective functions with latent
variables that arose in probabilistic settings. In this section, we will look at the problem of
optimization with non-convex objectives in a more general setting.
Several machine learning and signal processing applications such as deep learning, topic
modeling etc, generate optimization problems that have non-convex objective functions. The
global optimization of non-convex objectives, i.e., finding the global optimum of the objective
function, is an NP hard problem in general. Even the seemingly simple problem of minimizing
quadratic functions of the kind x> Ax over convex constraint sets becomes NP-hard the moment
the matrix A is allowed to have even one negative eigenvalue.
As a result, a much sought after goal in applications with non-convex objectives is to find a
local minimum of the objective function. The main hurdle in achieving local optimality is the
presence of saddle points which can mislead optimization methods such as gradient descent by
stalling their progress. Saddle points are best avoided as they signal inflection in the objective
surface and unlike local optima, they need not optimize the objective function in any meaningful
way.
The recent years have seen much interest in this problem, particularly with the advent of
deep learning where folklore wisdom tells us that in the presence of sufficient data, even locally
optimal solutions to the problem of learning the edge weights of a network perform quite well
[Choromanska et al., 2015]. In these settings, techniques such as convex relaxations, and non-
convex optimization techniques that we have studied such as EM, gAM, gPGD, do not apply
directly. In these settings, one has to attempt optimizing a non-convex objective directly.
The problem of avoiding or escaping saddle points is actually quite challenging in itself
given the wide variety of configurations saddle points can appear in, especially in high dimen-
sional problems. It should be noted that there exist saddle configurations, bypassing which is
intractable in itself. For such cases, even finding locally optimal solutions is an NP-hard problem
[Anandkumar and Ge, 2016].
In our discussion, we will look recent results which show that if the function being optimized
possesses certain nice structural properties, then an application of very intuitive algorithmic
techniques can guarantee local optimality of the solutions. This will be yet another instance of
a result where the presence of a structural property (such as RSC/RSS, MSC/MSS, or LSC/LSS
as studied in previous sections) makes the problem well behaved and allows efficient algorithms
to offer provably good solutions. Our discussion will largely aligned to the work of Ge et al.
[2015]. The bibliographic notes will point to other works.
62
6.1. MOTIVATING APPLICATIONS
6.1 Motivating Applications
A wide range of problems in machine learning and signal processing generate optimization
problems with non-convex objectives. Of particular interest to us would be the problem of
Orthogonal Tensor Decomposition. This problem has been shown to be especially useful in
modeling a large variety of learning problems including training deep Recurrent Neural Networks
[Sedghi and Anandkumar, 2016], Topic Modeling, learning Gaussian Mixture Models and Hidden
Markov Models, Independent Component Analysis [Anandkumar et al., 2014], and reinforcement
learning [Azizzadenesheli et al., 2016].
The details of how these machine learning problems can be reduced to tensor decomposition
problems will involve getting into the details which will distract us from our main objective.
To keep the discussion focused and brief, we request the reader to refer to these papers for
the reductions to tensor decomposition. We ourselves will be most interested in the problem of
tensor decomposition itself.
We will restrict our study to 4th -order tensors which can be interpreted as 4-dimensional
arrays. Tensors are easily constructed using outer products, also known as tensor products. An
outer product of 2nd order produces a 2nd order tensor which is nothing but a matrix. For any
u, v ∈ Rp , their outer product is defined as u ⊗ v := uv> ∈ Rp×p which is a p × p matrix, whose
(i, j)-th entry is ui vj .
We can similarly construct a 4th -order tensors. For any u, v, w, x ∈ Rp , let T = u ⊗ v ⊗
w ⊗ x ∈ Rp×p×p×p . The (i, j, k, l)-th entry of this tensor, for any i, j, k, l ∈ [p], will be Ti,j,k,l =
ui · vj · wk · xl . The set of 4th -order tensors is closed under addition and scalar multiplication.
We will study a special class of 4th -order tensors known as orthonormal tensors which have an
orthonormal decomposition as follows
r
X
T = ui ⊗ ui ⊗ ui ⊗ ui ,
i=1
where the vectors ui are orthonormal components of the tensor T i.e., ui> uj = 0 if i 6= j and
kui k2 = 1. The above tensor is said to have rank r since it has r components in its decomposition.
If an orthonormal decomposition of a tensor exists, it can be shown to be unique.
Just as a matrix A ∈ Rp×p defines a bi-linear form A : (x, y) 7→ x> Ay, similarly a tensor de-
fines a multi-linear form. For orthonormal tensors, the multilinear form has a simple expression.
In particular, if T has the orthonormal form described above, we have
63
CHAPTER 6. STOCHASTIC OPTIMIZATION TECHNIQUES
n √ √ o
Figure 6.1: The function on the left f (x) = x4 − 4 · x2 + 4 has two global optima − 2, 2
separated by a local maxima at 0. Using this function, we construct on the right, a higher
dimensional function g(x, y) = f (x) + f (y) + 8 which now has 4 global minima separated by
4 saddle points. The number of such minima and saddle points can explode exponentially in
learning problems with symmetry (indeed g(x, y, z) = f (x)+f (y)+f (z)+12 has 8 local minima
and saddle points). Plot on the right courtesy academo.org
This is the non-convex optimization problem1 that we will explore in more detail. We will
revisit this problem later after looking at some techniques to optimize non-convex objectives.
xt+1 = xt − ηt · ∇f (xt )
Now, it can be shown23 , that the procedure is guaranteed to make progress at every time step,
provided the function f is strongly smooth and the step length is small enough. However, the
procedure stalls at stationary points where the gradient of the function vanishes i.e., ∇f (x) = 0.
This includes local optima, which are of interest, and saddle points, which simply stall descent
algorithms.
One way to distinguish saddle points from local optima is by using the second derivative test.
The Hessian of a doubly differentiable function has only positive eigenvalues at local minima and
only negative ones at local maxima. Saddles on the other hand are unpredictable. The simple
saddles which we shall study here, reveal themselves by having both positive and negative
eigenvalues in the Hessian. The bibliographic notes discuss more complex saddle structures.
The reasons for the origin of saddle points is quite intriguing too. Figure 6.1 shows how
saddles may emerge and their numbers increase exponentially with increasing dimensionality.
Consider the tensor decomposition problem in ((LRTD)). It can be shown4 that all the r compo-
nents are optimal solutions to this problem. Thus, the problem possesses a beautiful symmetry
1
See Exercise 6.1.
2
See Exercise 6.2.
3
See Exercise 6.3.
4
See Exercise 6.4.
64
6.3. THE STRICT SADDLE PROPERTY
which allows us to recover the components in any order we like. However, it is also easy to show5
that general convex combinations of the components are not optimal solutions to this problem.
Thus, we automatically obtain r isolated optima spread out in space, interspersed with saddle
points.
The applications we discussed, such as Gaussian mixture models, also have such an internal
symmetry – the optimum is unique only up to permutation. Indeed, it does not matter in which
order do we recover the components of a mixture model, so long as we recover all of them.
However, this very symmetry gives rise to saddle points [Ge et al., 2015], since taking two
permutations of the optimal solution and taking a convex combination of them is in general
not an optimal solution as well. This gives us, in general, an exponential number of optima,
separated by (exponentially many) saddle points.
Before moving forward, we remind the reader that techniques we have studied so far for
non-convex optimization, namely EM, gAM, and gPGD are far too specific to be applied to
non-convex objectives in general, and to the problems we encounter with tensor decomposition
in particular. We need more generic solutions for the task of local optimization of non-convex
objectives.
The following strict saddle property formalizes the requirements we have discussed in a more
robust manner.
Definition 6.1 (Strict Saddle Property [Ge et al., 2015]). A twice differentiable function f (x)
is said to satisfy the (α, γ, κ, ξ)-strict saddle (SSa) property, if for every local minimum x∗ of the
function, the function is α-strongly convex in the region B2 (x∗ , 2ξ) and moreover, every point
x0 ∈ Rp satisfies at least one of the following properties:
1. (Non-stationary Point) k∇f (x0 )k2 ≥ κ
5
See Exercise 6.5.
65
CHAPTER 6. STOCHASTIC OPTIMIZATION TECHNIQUES
Figure 6.2: The function f (x, y) = x2 − y 2 exhibits a saddle at the origin (0, 0). The Hessian of
this function at the origin is −2 · I and since λmin (∇2 f ((0, 0))) = −2, the saddle satisfies the
strict-saddle property. Indeed, the saddle does offer a prominent descent path along the y axis
which can be used to escape the saddle point. In § 6.5 we will see how the NGD algorithm is
able to provably escape this saddle point. Plot courtesy academo.org
The above property places quite a few restrictions on the function. The function must
be strongly-convex in the neighborhood of every local optima and every point that is not an
approximate local minimum, must offer a direction of steep descent. This the point may do by
having a steep gradient (case 1) or else (if the point is a saddle point) have its Hessian offer an
eigenvector with a large negative eigenvalue which then offers a steep descent direction (case
2). We shall later see that there exist interesting applications that do satisfy this property.
Such an unbiased estimate gt is often called a stochastic gradient and is widely used in
machine learning and optimization. Recall that even the EM algorithm studied in § 5 had a
66
6.5. A LOCAL CONVERGENCE GUARANTEE FOR NGD
Algorithm 6 Noisy Gradient Descent (NGD)
Input: Objective f , max step length ηmax , tolerance
Output: A locally optimal point x b ∈ Rp
1: x1 ← INITALIZE() n o
2: Set T ← 1/η 2 , where η = min 2 / log2 (1/), ηmax
3: for t = 1, 2, . . . , T do
4: Sample perturbation ζ t ∼ S p−1 //Random pt. on unit sphere
5: gt ← ∇f (xt ) + ζ t
6: xt+1 ← xt − η · gt
7: end for
8: return xT
variant that used stochastic gradients. In several machine learning applications, the objective
can be written as a finite sum f (x) = n1 ni=1 f (x; θ i ) where θ i may denote the i-th data point.
P
This allows us to construct a stochastic gradient estimate in an even more inexpensive manner.
At each time step, simply sample a data point It ∼ Unif([n]) and let
gt = ∇f (xt , θ It ) + ζ t
Note that we still have E gt | xt = ∇f (xt ) but with a much cheaper construction for gt .
However, in order to simplify the discussion, we will continue to work with the setting gt =
∇f (xt ) + ζ t . √
We note that we have set the step lengths to be around 1/ T , where T is the total number
of iterations we are going to execute the NGD algorithm. This will seem similar to what we
used in the projected gradient descent approach (see
√ Algorithm 1 and Theorem 2.5). Although
in practice one may set the step length to ηt ≈ 1/ t here as well, the analysis becomes more
involved.
1. Non-stationary points, i.e., points where gradient is “large” enough: in this case, standard
(stochastic) gradient descent is powerful enough to ensure a large enough decrease in the
objective function value in a single step6
6
See Exercise 6.2.
67
CHAPTER 6. STOCHASTIC OPTIMIZATION TECHNIQUES
2. Saddle points, i.e., points where the gradient is close to 0. Here, the SSa property en-
sures that at least one highly ‘negative” Hessian direction exists: in this case, traditional
(stochastic) gradient descent may fail but the additional noise ensures an escape from the
saddle point with high probability
3. Local minima, i.e., points where gradient is close to 0 but the have a positive semi definite
Hessian due to strong convexity: in this case, standard (stochastic) gradient descent by
itself would converge to the corresponding local minima
The above three regimes will be formally studied below in three separate lemmata. Note
that the analysis for non-stationary points as well as for points near local minima is similar to
the standard stochastic gradient descent analysis for convex functions. However, the analysis
for saddle points is quite interesting and shows that the added random noise ensures an escape
from the saddle point.
To further understand the inner workings of the NGD algorithm, let us perform a warm-up
exercise by showing that the NGD algorithm will, with high probability, escape the saddle point
in the function f (x, y) = x2 − y 2 that we considered in Figure 6.2.
Proof. For an illustration, see Figure 6.2. Note that the function f (x, y) = x2 − y 2 has trivial
minima at the limiting points (0, ±∞) where the function value approaches −∞. Thus, the
statement of the theorem claims that NGD approaches the “minimum” function value.
In any case, we are interested in showing that NGD escapes the saddle point (0, 0). The
gradient of f is 0 at the origin (0, 0). Thus, if a gradient descent procedure is initialized at the
origin for this function, it will remain stuck there forever making no non-trivial updates.
The NGD algorithm on the other hand, when initialized at the saddle point (0, 0), after
t iterations, can be shown to reach the point (xt , y t ) where xt = t−1 τ =0 (1 − η)
t−τ −1 ζ τ and
P
1
t−1
y t = τ =0 (1 + η)t−τ −1 ζ2τ and (ζ1τ , ζ2τ ) ∈ R2 is the noise vector added to the gradient at each
P
step. Since η < 1, as t → ∞, it is easy to see that with high probability, we have xt → 0 while
|y t | → ∞ which indicates a successful escape from the saddle point, as well as progress towards
the global optima.
We now formalize the intuitions developed above. The following lemma shows that even if
we are at a saddle point, NGD will still ensure a large drop in function value in not too many
steps. The proof of this result is a bit subtle and we will just provide a sketch.
Proof. For the sake of notational simplicity, let t = 0. The overall idea of the proof is to rely on
that one large negative eigendirection in the Hessian to induce a drop in function value. The
hope is that random fluctuations will eventually nudge NGD in the steep descent direction and
upon discovery, the larger and larger gradient values will accumulate to let the NGD procedure
escape the saddle.
68
6.5. A LOCAL CONVERGENCE GUARANTEE FOR NGD
Since the effects of the Hessian are most apparent in the second order Taylor expansion, we
will consider the following function
D E 1
fb(x) = f (x0 ) + ∇f (x0 ), x − x0 + (x − x0 )> H(x − x0 ),
2
2 0 p×p
where H = ∇ f (x ) ∈ R . The proof will proceed by first imagining that NGD was executed
on the function fb(·) instead, showing that the function value indeed drops, and then finishing off
by showing that things do not change too much if NGD is executed on the function f (·) instead.
Note that to make this claim, we will need Hessians to vary smoothly which is why we assumed
the Lipschitz Hessian condition. This seems to be a requirement for several follow-up results as
well [Jin et al., 2017, Agarwal et al., 2017]. An interesting and challenging open problem is to
obtain similar results for non-convex optimization without the Lipschitz-Hessian assumption.
We will let xt , t ≥ 1 denote the iterates of NGD when executed on f (·) and x b t denote the
iterates of NGD when executed on fb(·). We will fix x b 0 = x0 . Using some careful calculations,
we can get
t−1 t−1
(I − ηH)t−τ −1 ζ τ
X X
bt − x
x b 0 = −η (I − ηH)τ ∇f (x0 ) − η
τ =0 τ =0
We note that both terms in the right hand expression above
correspond
√ to “small” vectors. The
b 0 satisfies
∇f (xb 0 )
2 ≤ ηβ by virtue of being close to
first term is small as we know that x
a stationary point as is assumed in the statement of this result. The second term is small as
ζ τ are random unit vectors with expectation 0. Using these intuitions, we will first show that
b t ) is significantly smaller than fb(x0 ) = f (x0 ) after sufficiently many steps. Then using the
fb(x
property of Lipschitz Hessians, we will obtain obtain a descent guarantee for f (xt ).
Now, notice that NGD chooses the noise vectors ζ τ for any τ ≥ 0, independently of x0 and
0
H. Moreover, for any two τ 6= τ 0 , the vectors ζhτ , ζ τ arei also chosen independently. We also
know that the noise vectors are isotropic i.e., E ζ τ (ζ τ )> = Ip . These observations and some
straightforward calculations give us the following upper bound on the suboptimality of xT with
respect to the Taylor approximation fb. As we have noted, this upper bound can be converted
to an upper bound on the suboptimality of xT with respect to the actual function f using some
more effort.
b T ) − f (x
E[fb(x b 0 )]
t−1
= − η∇f (x0 )>
X
(Ip − ηH)τ ∇f (x0 )
τ =0
t−1 t−1
η2 0 >
X X
+ ∇f (x ) (Ip − ηH)τ H (Ip − ηH)τ ∇f (x0 )
2 τ =0 τ =0
t−1
!
η2 X
+ tr (Ip − ηH)2τ H
2 τ =0
t−1
!
0 > η2 X
0
= − η∇f (x ) B∇f (x ) + tr (Ip − ηH)2τ H ,
2 τ =0
η
where tr(·) is the trace operator and B = t−1 t−1
τ τ Pt−1 τ.
τ =0 (Ip −ηH) − 2 τ =0 (Ip −ηH) H τ =0 (Ip −ηH)
P P
1
It is easy to verify that B 0 for all step lengths η ≤ kHk 2
i.e., all η ≤ β1 .
p
For any value of t ≥ 3 log
ηγ (which is the setting for the parameter s in the statement of the
theorem), the second term in the final expression above can be simplified to give us
t−1
! p t−1
η2 X X X η
tr (I − ηH)2τ H = λi ( (1 − ηλi )τ ) ≤ − ,
2 τ =0 i=1 τ =0
2
69
CHAPTER 6. STOCHASTIC OPTIMIZATION TECHNIQUES
where λi is the i-th eigenvalue of H, by using λp ≤ −γ. This gives us
h i η
b T ) − f (x0 ) ≤ − .
E fb(x
2
Note that the above equation only shows descent for fb(x bT ). One can now show [Ge et al.,
2015, Lemma 19] that the iterates obtained by NGD on fb(·) do not deviate too far from those
obtained on the actual function f (·) using the Lipschitz-Hessian property. The proof is concluded
by combining these two results.
Lemma 6.3. If NGD is executed on a function that is β-strongly smooth and satisfies the
(α, γ, κ, ξ)-SSa property, with step length η ≤ β1 ·min 1, κ2 , then for any iterate xt that satisfies
√
∇f (xt )
≥ ηβ, NGD ensures that
2
h β 2 i
E f (xt+1 ) | xt ≤ f (xt ) −
·η .
2
Proof. This is the most carefree case as we are neither close to any local optima, nor a saddle
point. Unsurprisingly, the proof of this lemma follows from standard arguments as we have
assumed the function to be strongly smooth. Since xt+1 = xt − η · (∇f (xt ) + ζ t ), we have, by
an application of the strong smoothness property (see Definition 2.4)
D E βη 2
2
f (xt+1 ) ≤ f (xt ) − ∇f (xt ), η · (∇f (xt ) + ζ t ) + ·
∇f (xt ) + ζ t
.
2 2
βη
βη 2
h i
2
t+1 t t
E f (x ) | x ≤ f (x ) − η 1 −
∇f (xt )
+ ,
2 2 2
√
Since we have η ≤ β1 · min 1, κ2 by assumption and
f (xt )
2 ≥ ηβ, we get
h i β 2
E f (xt+1 | xt ≤ f (xt ) − ·η ,
2
which proves the result.
The final intermediate result is an entrapment lemma. It shows that once NGD gets suf-
ficiently close to a local optimum, it gets trapped there for a really long time. Although the
function f satisfies strong convexity and smoothness properties in the neighborhood B2 (x∗ , 2ξ),
the proof of this result is still non-trivial due to the perturbations ζ t . Had the perturbations not
been there, we could have utilized the analysis of the PGD algorithm7 to show that we would
converge to the local optimum x∗ at a linear rate.
The problem is that the perturbations do not diminish – we always have
ζ t
2 = 1. This
prevents us from ever converging to the local optima. Moreover, a sequence of unfortunate
perturbations may have us kicked out of this nice neighborhood. The next result shows that we
will not get kicked out of the neighborhood of x∗ for a really long time.
Lemma 6.4. If NGD is executed on a function that n is β-stronglyo smooth and satisfies the
(α, γ, κ, ξ)-SSa property, with step length η ≤ min βα2 , ξ 2 log−1 ( δξ
1
) for some δ > 0, then if
t t ∗
some iterate h
x satisfies
ix − x
2 ≤ ξ, then NGD ensures that with probability at least 1 − δ,
1
for all s ∈ t, t + η2
log 1δ , we have
s
1
kxs − x∗ k2 ≤ η log ≤ ξ.
ηδ
7
See Exercise 6.3.
70
6.5. A LOCAL CONVERGENCE GUARANTEE FOR NGD
Proof. Using strong convexity in the neighborhood of x∗ and the fact that x∗ , being a local
minimum, satisfies ∇f (x∗ ) = 0, gives us
α
2
f (xt ) ≥ f (x∗ ) +
x − x∗
t
2 2
D E α
2
f (x∗ ) ≥ f (xt ) + ∇f (xt ), x∗ − xt +
x − x∗
.
t
2 2
Since f is β-smooth, using the co-coercivity of the gradient for smooth convex functions (recall
that f is strongly convex, in this neighborhood of x∗ ), we conclude that f has β-Lipschitz
gradients, which gives us
∇f (xt )
=
∇f (xt ) − ∇f (x∗ )
≤ β ·
xt − x∗
.
2 2 2
2
2
t+1 ∗
t
t t t ∗
t
E
x − x
x = E
x − η(∇f (x ) + ζ ) − x
x
2 2
2 D E
=
x − x∗
− 2η ∇f (xt ), xt − x∗
t
2
2
+ η 2
∇f (xt )
+ η 2
2
2
≤ (1 − 2ηα + η 2 β 2 ) ·
xt − x∗
+ η 2
2
2
t ∗
2
≤ (1 − ηα) ·
x − x
+ η ,
2
Notice that the above result traps the iterates within a radius ξ ball around the local mini-
mum x∗ . Also notice that all points that are approximate local minima satisfy the preconditions
of this theorem due to the SSa property and consequently the NGD gets trapped for these points.
We now present the final convergence guarantee for NGD.
Theorem 6.5. For any , δ > 0, suppose NGD is executed on a function that is β-strongly
smooth, has ρ-Lipschitz Hessians, and satisfies the o (α, γ, κ, ξ)-SSa property, with a step length
ξ2
n
2 α 1 κ2
η < ηmax = min log(1/δ) , β 2 , log(1/ξδ) , (β+ρ)2 , β . Then, with probability at least 1 − δ, after
T ≥ log p/η 2 · log(2/δ)
iterations,
NGD produces an iterate xT that is -close to some local
optimum x∗ i.e.,
xT − x∗
≤ .
2
71
CHAPTER 6. STOCHASTIC OPTIMIZATION TECHNIQUES
Proof. We partition the space Rp into 3 regions
√
1. R1 = x : kf (x)k2 ≥ ηβ
√
2. R2 = x : kf (x)k2 < ηβ, λmin (∇2 f (x)) ≤ −γ
3. R3 = Rp \(R1 ∪ R2 )
√
Since ηβ ≤ κ due to the setting of ηmax , the region R1 contains all points considered non-
stationary by the SSa property (Definition 6.1) and possibly some other points as well. For this
reason, region R2 can be shown to contain only saddle points. Since the SSa property assures
us that a point that is neither non-stationary nor a saddle point is definitely an approximate
local minimum, we deduce that the region R3 contains only approximately local minima.
The proof will use the following line of argument: since Lemmata 6.3 and 6.2 assure us that
whenever the NGD procedure is in regions R1 or R2 there is a large drop in function value, we
should expect the procedure to enter region R3 sooner or later, since the function value cannot
go on decreasing indefinitely. However, Lemma 6.4 shows that once we are in region R3 , we
are trapped there. In the following analysis, we will ignore all non-essential constants and log
factors.
Recall that we let the NGD procedure last for T = 1/η 2 log(2/δ) steps. Below we will show
that in any sequence of 1/η 2 steps, there is at least a 1/2 chance of encountering an iterate
xt ∈ R3 . Since the entire procedure lasts log(1/δ) such sequences, we will conclude, by union
bound, that with probability at least 1 − δ/2, we will encounter at least one iterate in the region
R3 in the T steps we execute.
However, Lemma 6.4 shows that once we enter the R3 neighborhood, with probability at
least 1 − δ/2, we are trapped there for at least T steps. Applying the union bound will establish
that with probability at least 1 − δ, the NGD procedure will output xT ∈ R3 . Since we set
η ≤ 2 / log(1/δ), this will conclude the proof.
We now left with proving that in every sequence of 1/η 2 steps, there is at least a 1/2 chance
of NGD encountering an iterate xt ∈ R3 . To do so, we set up the notion of epochs. These will
basically correspond to the amount of time taken by NGD to reduce the function value by a
significant amount. The first epoch starts at time τ1 = 0. Subsequently, we define
(
τi + 1 if xτi ∈ R1 ∪ R3
τi+1 =
τi + η1 if xτi ∈ R2
Ignoring constants and other non-essential factors, we can rewrite the results of Lemmata 6.3
and 6.2 as follows
Define the event Et := @ j ≤ t : xj ∈ R3 and let 1E denote the indicator variable for event E
72
6.6. CONSTRAINED OPTIMIZATION WITH NON-CONVEX OBJECTIVES
Algorithm 7 Projected Noisy Gradient Descent (PNGD)
Input: Objective f , max step length ηmax , tolerance
Output: A locally optimal point x b ∈ Rp
1: x1 ← INITALIZE() n o
2: Set T ← 1/η 2 , where η = min 2 / log2 (1/), ηmax
3: for t = 1, 2, . . . , T do
4: Sample perturbation ζ t ∼ S p−1 //Random pt. on unit sphere
5: gt ← ∇f (xt ) + ζ t
6: xt+1 ← ΠW (xt − η · gt ) //Project onto constraint set
7: end for
8: return xT
where we have used the fact that |f (x)| ≤ B for all x ∈ Rp . Since Et+1 ⇒ Et , we have
P [Et+1 ] ≤ P [Et ]. Summing the expressions above from i = 1 to j and using x1 ∈
/ R3 gives us
h i h i
E f (xτj+1 ) · 1Eτj+1 − f (x1 ) ≤ −η 2 E [τj+1 ] · P Eτj
We have not presented exact constants in the results to avoid clutter. The NGD algorithm
actually requires the step length to be set to η < ηmax ≤ p1 where p is the ambient dimensionality.
Now, since NGD is run for Ω 1/η 2 iterations and each iteration takes O (p) time to execute,
the total run-time of NGD is O p3 which can be prohibitive. The bibliographic notes discuss
more recent results that offer run-times that are linear in p. Also note that the NGD procedure
4
requires O 1/ iterations to converge within an distance of a local optimum.
e
73
CHAPTER 6. STOCHASTIC OPTIMIZATION TECHNIQUES
A very common way to do so is to first construct the Lagrangian [Boyd and Vandenberghe,
2004, Chapter 5] of the problem defined as
m
X
L(x, λ) = f (x) − λi ci (x),
i=1
where λi are Lagrange multipliers. It is easy to verify that the solution to the problem
((CNOPT)) coincides with that of the following problem
min max L(x, λ)
x∈Rp λ∈Rm
Note that the above problem is unconstrained. For any x ∈ Rp , define λ∗ (x) :=
∗
arg minλ k∇f (x) − m ∗
i=1 λi ∇ci (x)k and L (x) := L(x, λ (x)). We also define the tangent and
P
74
6.7. APPLICATION TO ORTHOGONAL TENSOR DECOMPOSITION
6.7 Application to Orthogonal Tensor Decomposition
Recall that we reduced the tensor decomposition problem to solving the following optimization
problem ((LRTD))
max T (u, u, u, u) = ri=1 (ui> u)4
P
s.t. kuk2 = 1,
The above problem has internal symmetry with respect to permutation of the components, as
well as sign flips (both ui and −ui are valid components) which gives rise to saddle points as
we discussed earlier. Using some simple but slightly tedious calculations (see [Ge et al., 2015,
Theorem 44, Lemmata 45, 46]) it is possible to show that the above optimization problem does
satisfy the (3, 7/p, 1/pc , 1/pc )-SCSa property for some constant c > 0. All other requirements
for Theorem 6.6 can also be shown to be satisfied.
It is easy to see that any solution to the problem must lie in the span of the components.
Since the components ui are orthonormal, this means that it suffices to look for solutions
of the form u = ri=1 xi ui . This gives us T (u, u, u, u) = ri=1 x4i , and kuk2 = kxk2 where
P P
min − kxk44
s.t. kxk2 = 1,
Note that the above problem is non-convex as the objective function is concave. However, it is
also possible to show89 that the only local minima of the optimization problem ((LRTD)) are
±ui . This is most fortunate since it shows that all local minima are actually global minima! Thus,
the deflation strategy alluded to earlier can be successfully applied. We discover one component,
say u1 by applying the PNGD algorithm to ((LRTD)). Having recovered this component, we
create a new tensor T 0 = T − u1 ⊗ u1 ⊗ u1 ⊗ u1 , apply the procedure again to discover a second
component and so on.
The work of Ge et al. [2015] also discusses techniques to recover all r components simulta-
neously, tensors with different positive weights on the components, as well as tensors of other
orders.
6.8 Exercises
Exercise 6.1. Show that the optimization problem in the formulation ((LRTD)) is non-convex
in that it has a non-convex objective as well as a non-convex constraint set. Show that constraint
set may be convexified without changing the optimum of the problem.
Exercise 6.2. Consider a differentiable function f that is β-strongly smooth but possibly non-
convex. Show that if gradient descent is performed on f with a static step length η ≤ β2 i.e.,
xt+1 = xt − η · ∇f (xt )
then the function value f will never increase across iterations i.e., f (xt+1 ) ≤ f (xt ). This shows
that on smooth functions, gradient descent enjoys monotonic progress whenever the step length
is small enough.
Hint: apply the SS property relating consecutive iterates.
8
See Exercise 6.4.
9
See Exercise 6.5.
75
CHAPTER 6. STOCHASTIC OPTIMIZATION TECHNIQUES h i
1 1
Exercise 6.3. For the same setting as the previous problem, show that if we have η ∈ 2β ,β
instead, then within T = O 12 steps, gradient descent is guaranteed to identify an -stationary
point i.e for some t ≤ T , we must have
∇f (xt )
2 ≤ .
1
2
Hint: first apply SS to show that f (xt ) − f (xt+1 ) ≥ 4β ·
∇f (xt )
2 . Then use the fact that the
total improvement in function value over T time steps cannot exceed f (x0 ) − f ∗ .
Exercise 6.4. Show that every component vector ui is a globally optimal point for the opti-
mization problem in ((LRTD)).
Hint: Observe the reduction in § 6.7. Show that it is equivalent to finding the minimum L2 norm
vector(s) among unit L1 norm vectors. Find the Lagrangian dual of this optimization problem
and simplify.
Exercise 6.5. Given a rank-r orthonormal tensor T , construct a non-trivial convex combina-
tion of its components that has unit L2 norm but achieves a suboptimal objective value in the
formulation ((LRTD)).
First Order Methods Given their efficiency, the question of whether first order methods can
offer local optimality has captured the interest of the field. In particular, much attention has
been paid to the standard gradient descent and its variants. The work of Lee et al. [2016] showed
that when initialized at a random point, gradient descent avoids saddle points almost surely,
although the work provides no definite rate of convergence in general.
The subsequent works of Sun et al. [2015], Ge et al. [2015], Anandkumar and Ge [2016]
introduced structural properties such as the strict saddle property, and demonstrated that crisp
convergence rates can be ensured for problems that do satisfy these structural properties. It is
notable that the work of Sun et al. [2015] reconsidered second order methods while Ge et al.
[2015] were able to show that noisy stochastic gradient descent itself suffices.
The technique of using randomly perturbed (stochastic) gradients to escape saddle points
receives attention in a more general framework of Langevin Dynamics which study cases when
the perturbations are non-isotropic or else are applied at a scale that adapts to the problem.
The recent work of Zhang et al. [2017] shows a powerful result that offers, for empirical risk
minimization problems that are ubiquitous in machine learning, a convergence guarantee that
ensures convergence to a local optimum of the population risk functional. This is a useful result
since a majority of works in literature focus on identifying local optima of the empirical risk
76
6.9. BIBLIOGRAPHIC NOTES
functional, which depend on the training data, but which may correspond to bad solutions
with respect to the population risk.
that in our analysis (see Theorem 6.5), it took NGD O 1/4 steps to reach a local optimum.
However, two caveats exist: 1) the method proposed in [Reddi et al., 2016] only reaches a
saddle point, and 2) it assumes a finite-sum objective, i.e., one that has a decomposable form.
Accelerated Optimization The works of Carmon et al. [2017], Agarwal et al. [2017] extend
the work of Reddi et al. [2016] by offering faster than O 1/2 convergence to a stationary
point for general (non finite-sum) non-convex objectives. However, it should be noted that these
techniques assume a smooth objective and convergence is guaranteed only to a saddle point, not
a local minimum. Whereas the work of Carmon et al. [2017]
invokes
a variant of Nesterov’s ac-
celerated gradient technique to offer -convergence in O 1/ 7/4 iterations, the work of Agarwal
et al. [2017] employs a second-order method to offer -convergence, also in O 1/7/4 iterations.
Escaping Higher-order Saddles In our discussion, we were occupied with the problem
of avoiding getting stuck at simple saddle points which readily reveal themselves by having
distinctly positive and negative eigenvalues in the Hessian. However, there may exist more
complex degenerate saddle points where the Hessian has only non-negative eigenvalues and
thus, masquerades as a local minima. Such configurations yield complex cases such as monkey
saddles and connected saddles. We did not address these. The work of Anandkumar and
Ge [2016] proposes a method based on the Cubic Regularization algorithm of Nesterov and
Polyak [2006] which is able to escape some of these more complex saddle points and achieve
convergence to a point that enjoys third order optimality.
Training Deep Networks Given the popularity of deep networks in several areas of learning
and signal processing, as well as the fact that the task of training deep networks corresponds
to non-convex optimization, a lot of recent efforts have focused on the problem of efficiently
and provably training deep networks using non-convex optimization techniques. Some notable
works include provable methods for training multi-layered perceptrons [Goel and Klivans, 2017,
Zhong et al., 2017, Li and Yuan, 2017] and special cases of convoluational networks known as
77
CHAPTER 6. STOCHASTIC OPTIMIZATION TECHNIQUES
non-overlapping convolutional networks [Brutzkus and Globerson, 2017]. Whereas the works
[Brutzkus and Globerson, 2017, Zhong et al., 2017, Li and Yuan, 2017] utilize gradient-descent
based techniques, Goel and Klivans [2017] uses an application of isotonic regression and kernel
techniques. The work of Li and Yuan [2017] shows that the inclusion of an identity map into
the network eases optimization by making the training problem well-posed.
78
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Part III
Applications
79
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 7
Sparse Recovery
In this section, we will take a look at the sparse recovery and sparse linear regression as ap-
plications of non-convex optimization. These are extremely well studied problems and find
applications in several practical settings. This will be the first of four “application” sections
where we apply non-convex optimization techniques to real-world problems.
Gene Expression Analysis The availability of DNA micro-array gene expression data makes
it possible to identify genetic explanations for a wide range of phenotypical traits such as
physiological properties or even disease progressions. In such data, we are given say, for n
human test subjects participating in the study, the expression levels of a large number p of
genes (encoded as a real vector xi ∈ Rp ), and the corresponding phenotypical trait yi ∈ R.
Figure 7.1 depicts this for a hypothetical study on Type-I diabetes. For the sake of simplicity,
we are considering cases where the phenotypical trait can be modeled as a real number –
this real number may indicate the severity of a condition or the level of some other biological
measurement. More expressive models exist in literature where the target phenotypical trait is
itself represented as a vector [see for example, Jain and Tewari, 2015].
For the sake of simplicity, we assume that the phenotypical response is linearly linked to
the gene expression levels i.e. for some w∗ ∈ Rp , we have yi = xi> w∗ + ηi where ηi is some
noise. The goal then is to use gene expression data to deduce an estimate for w∗ . Having access
to the model w∗ can be instrumental in discovering possible genetic bases for diseases, traits
etc. Consequently, this problem has significant implications for understanding physiology and
developing novel medical interventions to treat and prevent diseases/conditions.
However, the problem fails to reduce to a simple linear regression problem for two important
reasons. Firstly, although the number of genes whose expression levels are being recorded
is usually very large (running into several tens of thousands), the number of samples (test
subjects) is usually not nearly as large, i.e. n p. Traditional regression algorithms fall silent
in such data-starved settings as they usually expect n > p. Secondly, and more importantly,
we do not expect all genes being tracked to participate in realizing the phenotype. Indeed,
the whole objective of this exercise is to identify a small set of genes which most prominently
influence the given phenotype. Note that this implies that the vector w∗ is very sparse.
Traditional linear regression cannot guarantee the recovery of a sparse model.
80
7.1. MOTIVATING APPLICATIONS
PROGRESSION OF THE DISEASE
NON NEW LONG
DIABETIC ONSET TERM T1D
Gpx3
Lp1
Tcp1
Eef1a1
S1c126
MMU474
Atp5a3
Acatn
Hipk2
Pkd2
Hao3
Scd1
Erd
Fkbp4
Rp121
Actb
CCnd1
Nfe212
Cf11
Rps23
Rp137a
Ndufa7
Rp1p1
Oaz1
Sod1
Vapa
Psmb1
Cd24a
Tacstdl
Zfp3611
Cox7b
IDH1
Figure 7.1: Gene expression analysis can help identify genetic bases for physiological conditions.
The expression matrix on the right has 32 rows and 15 columns: each row represents one gene
being tracked and each column represents one test subject. A bright red (green) shade in a cell
indicates an elevated (depressed) expression level of the corresponding gene in the test subject
with respect to a reference population. A black/dark shade indicates an expression level identical
to the reference population. Notice that most genes do not participate in the progression of the
disease in a significant manner. Moreover, the number of genes being tracked is much larger
than the number of test subjects. This makes the problem of gene expression analysis, an ideal
application for sparse recovery techniques. Please note that names of genes and expression levels
in the figure are illustrative and do not correspond to actual experimental observations. Figure
adapted from [Wilson et al., 2003].
Sparse Signal Transmission and Recovery The task of transmitting and acquiring signals
is a key problem in engineering. In several application areas such as magnetic resonance imagery
and radio communication, linear measurement techniques, for example, sampling, are commonly
used to acquire a signal. The task then is to reconstruct the original signal from these measure-
ments. For sake of simplicity, suppose we wish to sense/transmit signals represented as vectors
in Rp . For various reasons (conserving energy, protection against data corruption etc), we may
want to not transmit the signal directly and instead, create a sensing mechanism wherein a
signal w ∈ Rp is encoded into a signal y ∈ Rn and it is y that is transmitted. At the receiving
end y must be decoded back into w. A popular way of creating sensing mechanisms – also
called designs – is to come up with a set of n linear functionals xi : Rp → R and for any signal
w ∈ Rp , record the values yi = xi> w. If we denote X = [x1 , . . . , xn ]> and y = [y1 , . . . , yn ]> ,
then y = Xw is transmitted. Note as a special case that if n = p and xi = ei , then X = Ip×p
and y = w, i.e. we transmit the original signal itself.
If p is very large then we naturally look for designs with n p. However, elementary results
in algebra dictate that the recovery of w from y cannot be guaranteed even if n = p−1. There is
irrecoverable loss of information and there could be (infinitely) many signals w all of which map
to the same transmitted signal y making it impossible to recover the original signal uniquely. A
result similar in spirit called the Shannon-Nyquist theorem holds for analog or continuous-time
signals. Although this seems to spell doom for any efforts to perform compressed sensing and
transmission, these negative results can actually be overcome by observing that in several useful
81
CHAPTER 7. SPARSE RECOVERY
settings, the signals we are interested in, are actually very sparse i.e. w ∈ B0 (s) ⊂ Rp , s p.
This realization is critical since it allows the possibility of specialized design matrices to be used
to transmit sparse signals in a highly compressed manner i.e. with n p but without any loss
of information. However the recovery problem now requires a sparse vector to be recovered from
the transmitted signal y and the design matrix X.
where X = [x1 , . . . , xn ]> and y = [y1 , . . . , yn ]> . It is common to model the additive noise as
white noise i.e. ηi ∼ N (0, σ 2 ) for some σ > 0. It should be noted that the sparse regression
problem in ((SP-REG)) is an NP-hard problem [Natarajan, 1995].
82
7.4. SPARSE RECOVERY VIA PROJECTED GRADIENT DESCENT
Algorithm 8 Iterative Hard-thresholding (IHT)
Input: Data X, y, step length η, projection sparsity level k
Output: A sparse model w b ∈ B0 (k)
1
1: w ← 0
2: for t = 1, 2, . . . do
3: zt+1 ← wt − η · n1 X > (Xwt − y)
4: wt+1 ← ΠB0 (k) (zt+1 ) //see § 3.1
5: end for
6: return wt
recovery algorithm. However, in the second case, the design matrix is mostly given to us. We
have no fine control over its properties.
This will make an important difference in the algorithms that operate in these settings
since algorithms for sparse signal recovery would be able to make very stringent assumptions
regarding the design matrix since we ourselves create this matrix from scratch. However, for
the same reason, algorithms working in statistical learning settings such as the gene expression
analysis problem, would have to work with relaxed assumptions that can be expected to be
satisfied by natural data. We will revisit this point later once we have introduced the reader to
algorithms for performing sparse regression.
83
CHAPTER 7. SPARSE RECOVERY
too. We request the reader to read on. We will find that not only do those notions extend here,
but have beautiful interpretations. Moreover, instead of directly applying the gPGD analysis
(Theorem 3.3), we will see a simpler convergence proof tailored to the sparse recovery problem
which also gives a sharper result.
be the (non-convex2 ) set of points that place a majority of their weight on some k coordinates.
Note that C(k) ⊃ B0 (k) since k-sparse vectors put all their weight on some k coordinates.
Definition 7.1 (Nullspace Property [Cohen et al., 2009]). A matrix X ∈ Rn×p is said to satisfy
the null-space property of order k if ker(X) ∩ C(k) = {0}, where ker(X) = {w ∈ Rp : Xw = 0}
is the kernel of the linear transformation induced by X (also called its null-space).
If a design matrix satisfies this property, then vectors in its null-space are disallowed from
concentrating a majority of their weight on any k coordinates. Clearly no k-sparse vector is
present in the null-space either. If a design matrix has the null-space property of order 2s, then
it can never identify two s-sparse vectors3 – something that we have already seen as essential
to ensure global recovery. A strengthening of the Nullspace Property gives us the Restricted
Eigenvalue Property.
Definition 7.2 (Restricted Eigenvalue Property [Raskutti et al., 2010]). A matrix X ∈ Rn×p is
said to satisfy the restricted eigenvalue property of order k with constant α if for all w ∈ C(k),
we have n1 kXwk22 ≥ α · kwk22 .
This means that not only are k-sparse vectors absent from the null-space, they actually
retain a good fraction of their length after projection as well. This means that if k = 2s, then
for any w1 , w2 ∈ B0 (s), we have n1 kX(w1 − w2 )k22 ≥ α·kw1 − w2 k22 . Thus, the distance between
any two sparse vectors never greatly diminished after projection. Such behavior is the hallmark
1
See Exercise 7.1.
2
See Exercise 7.2.
3
See Exercise 7.3.
84
7.6. ENSURING RIP AND OTHER PROPERTIES
of an isometry, which preserves the geometry of vectors. The next property further explicates
this and is, not surprisingly, called the Restricted Isometry Property.
Definition 7.3 (Restricted Isometry Property [Candès and Tao, 2005]). A matrix X ∈ Rn×p
is said to satisfy the restricted isometry property (RIP) of order k with constant δk ∈ [0, 1) if
for all w ∈ B0 (k), we have
(1 − δk ) · kwk22 ≤ 1
n kXwk22 ≤ (1 + δk ) · kwk22 .
The above property is most widely used in analyzing sparse recovery and compressive sensing
algorithms. However, it is a bit restrictive since it requires the distortion parameters to be of
the kind (1 ± δ) for δ ∈ [0, 1). A generalization of this property that is especially useful in
settings where the properties of the design matrix are not strictly controlled by us, such as
the gene expression analysis problem, is the following notion of restricted strong convexity and
smoothness.
Definition 7.4 (Restricted Strong Convexity/Smoothness Property [Jain et al., 2014, Jalali
et al., 2011]). A matrix X ∈ Rn×p is said to satisfy the α-restricted strong convexity (RSC)
property and the β-restricted smoothness (RSS) property of order k if for all w ∈ B0 (k), we
have
α · kwk22 ≤ 1
n kXwk22 ≤ β · kwk22 .
The only difference between the RIP and the RSC/RSS properties is that the former forces
constants to be of the form 1 ± δk whereas the latter does not impose any such constraints. The
reader will notice the similarities in the definition of restricted strong convexity and smoothness
as given here and Definition 3.2 where we defined restricted strongly convexity and smoothness
notions for general functions. The reader is invited to verify4 that the two are indeed related.
Indeed, Definition 3.2 can be seen as a generalization of Definition 7.4 to general functions
[Jain et al., 2014]. For twice differentiable functions, both definitions can be seen as placing
restrictions on the (restricted) eigenvalues of the Hessian of the function.
It is a useful exercise to verify5 that these properties fall in a hierarchy: RSC-RSS ⇒ REP
⇒ NSP for an appropriate setting of constants. We will next establish the main result of this
section: if the design matrix satisfies the RIP condition with appropriate constants, then the IHT
algorithm does indeed guarantee universal sparse recovery. Subsequently, we will give pointers
to recent results that guarantee universal recovery in gene expression analysis-like settings.
Random Designs: The simplest of these results are the so-called random design constructions
which guarantee that if the matrix is sampled from certain well behaved distributions, then it
will satisfy the RIP property with high probability. For instance, the work of Baraniuk et al.
[2008] shows the following result:
4
See Exercise 7.4.
5
See Exercise 7.5.
85
CHAPTER 7. SPARSE RECOVERY
Theorem 7.1. [Baraniuk et al., 2008, Theorem 5.2] Let D be a distribution over matrices in
Rn×p such that for any fixed v ∈ Rp , > 0,
h i
2 2 2
kXvk2 − kvk2 > · kvk2 ≤ 2 exp(−Ω(n))
P
X∼Dn×p
Then, for any k < p/2, matrices X generated from this distribution also satisfy the RIP
property
at order k with constant δ with probability at least 1 − 2 exp(−Ω(n)) whenever n ≥ Ω δk2 log kp .
Thus, a distribution over matrices that, for every fixed vector, acts as an almost isometry
with high probability, is also guaranteed to, with very high probability, generate matrices that
act as a restricted isometry simulataneously over all sparse vectors. Such matrix distributions
are easy to construct – one simply needs to sample each entry of the matrix independently
according to one of the following distributions:
The work of Agarwal et al. [2012] shows that the RSC/RSS properties are satisfied whenever
rows of the matrix X are drawn from a sub-Gaussian distribution over p-dimensional vectors
with a non-singular covariance matrix. This result is useful since it shows that real-life data,
which can often be modeled as vectors being drawn from sub-Gaussian distributions, will satisfy
these properties with high probability. This is crucial for sparse recovery and other algorithms
to be applicable to real life problems such as the gene-expression analysis problem.
If one can tolerate a slight blowup in the number of rows of the matrix X, then there
exist better constructions with the added benefit of allowing fast matrix vector products.
The initial work of Candès and Tao [2005] itself showed that selecting each row of a Fourier
log6 p
transform matrix independently with probability O k p results in an RIP matrix with
high probability. More recently, this was improved to O k log2 k logp p in the work of Haviv
and Regev
A matrix-vector product of a k-sparse vector with such a matrix takes
[2017].
2
only O k log p time whereas a dense matrix filled with Gaussians would have taken up to
O k 2 log p time. There exist more involved hashing-based constructions that simultaneously
offer reduced sample complexity and fast matrix-vector multiplications [Nelson et al., 2014].
Deterministic Designs: There exist far fewer and far weaker results for deterministic con-
structions of RIP matrices. The initial results in this direction all involved constructing inco-
herent matrices. A matrix X ∈ Rn×p with unit norm columns is said to be µ-incoherent if for all
i 6= j ∈ [p], hXi , Xj i ≤ µ. A µ-incoherent matrix always satisfies6 RIP at order k with parameter
δ = (k − 1)µ.
Deterministic constructions of incoherent matrices with µ = O √nloglogp n are well known
2 2
since the work of Kashin [1975]. However, such constructions require n = Ω e k log p
rows
δ2
which is quadratically more than what random designs require. The first result to improve
upon these constructions came from the work of Bourgain et al. [2011] which gavedetermin-
istic combinatorial constructions that assured the RIP property with n = O e k(2−) for some
δ2
constant > 0. However, till date, substantially better constructions are not known.
6
See Exercise 7.6.
86
7.7. A SPARSE RECOVERY GUARANTEE FOR IHT
7.7 A Sparse Recovery Guarantee for IHT
We will now establish a convergence result for the IHT algorithm. Although the analysis for the
gPGD algorithm (Theorem 3.3) can be adapted here, the following proof is much more tuned to
the sparse recovery problem and offers a tighter analysis and several problem-specific insights.
Theorem 7.2. Suppose X ∈ Rn×p is a design matrix that satisfies the RIP property of order
3s with constant δ3s < 21 . Let w∗ ∈ B0 (s) ⊂ Rp be any arbitrary sparse vector and let y = Xw∗ .
Then the IHT algorithm (Algorithm 8), when executed with a step length η = 1, and a projection
kw∗ k2
∗
t
sparsity level k = s, ensures w − w 2 ≤ after at most t = O log
iterations of the
algorithm.
Proof. We start off with some notation. Let S ∗ := supp(w∗ ) and S t := supp(wt ). Let I t :=
S t ∪ S t+1 ∪ S ∗ denote the union of the supports of the two consecutive iterates and the optimal
model. The reason behind defining this quantity is that we are assured that while analyzing
this update step, the two error vectors wt − w∗ and wt+1 − w∗ , which will be the focal point of
the analysis, have support within I t . Note that |I t | ≤ 3s. Please refer to the notation section at
the beginning of this monograph for the interpretation of the notation xI and AI for a vector
x, matrix A and set I.
With η = 1, we have (refer to Algorithm 8), zt+1 = wt − n1 X > (Xwt − y). However, due to
the (non-convex) projection step wt+1 = ΠB0 (k) (zt+1 ), applying projection property-O gives us
2
2
− zt+1
≤
w∗ − zt+1
.
t+1
w
2 2
Note that none of the other projection properties are applicable here since the set of sparse
vectors is a non-convex
2 set. Now, by Pythagoras’ theorem, for any vector v ∈ Rp , we have
2 2
kvk2 = kvI k2 +
vI
2 which gives us
2
t+1
2 t+1
2 t+1
2
t+1
∗
∗
wI − zt+1
t+1
+ w − z ≤ w − z + w − z
I
I I
I I
I I
2 2 2 2
> >
wI − (wIt − X I X(wt − w∗ ))
≤
wI∗ − (wIt − X I X(wt − w∗ ))
t+1
2 2
Adding and subtracting w∗ from the expression inside the norm operator on the LHS, rearrang-
ing, and applying the triangle inequality for norms gives us
>
wI − wI∗
≤ 2
(wIt − wI∗ ) − X I X(wt − w∗ )
t+1
2 2
87
CHAPTER 7. SPARSE RECOVERY
which finishes the proof. The second inequality above follows due to the triangle inequality and
the third inequality follows from the fact that RIP implies7 that for any |I| ≤ 3s, the smallest
>
eigenvalue of the matrix X I X I is lower bounded by (1 − δ3s ).
We note that this result holds even if the hard thresholding level is set to k > s. It is easy to
see that the condition δ3s < 12 is equivalent to the restricted condition number (over 3s-sparse
vectors) of the corresponding sparse recovery problem being upper bounded by κ3s < 3. Similar
to Theorem 3.3, here also we require an upper bound on the restricted condition number of the
problem. It is interesting to note that a direct application of Theorem 3.3 would have instead
required δ2s < 13 (or equivalently κ2s < 2) which can be shown to be a harsher requirement than
what we have achieved. Moreover, applying Theorem 3.3 would have also required us to set the
1
step length to a specific quantity η = 1+δ s
while executing the gPGD algorithm whereas while
executing the IHT algorithm, we need only set η = 1.
An alternate proof of this result appears in the work of Garg and Khandekar [2009] which
also requires the condition δ2s < 13 . The above result extends to a more general setting where
there is additive noise in the model y = Xw∗ + η. In this setting, it is known (see for example,
[Jain et al., 2014, Theorem 3] or [Garg and Khandekar, 2009, Theorem 2.3]) that if the objective
function in question (for the sparse recovery problem the objective function is the least squares
objective) satisfies the (α, β) RSC/RSS properties at level 2s, then the following is guaranteed
for the output w b of the IHT algorithm (assuming the algorithm is run for roughly O (log n)
iterations)
√
>
∗ 3 s
X η
kw
b − w k2 ≤
α
n
∞
The consistency of the above solution can be verified in several interesting situations. For exam-
√
ple, if the design matrix has normalized columns i.e. kXi k2 ≤ n and the noise ηi is generated
i.i.d. and
independently X from some Gaussian distribution N (0, σ 2 ), then the
of the design q
>
quantity
Xn η
is of the order of σ log p
with high probability. In the above setting IHT
∞ n
guarantees with high probability
s
e σ
b − w ∗ k2 ≤ O
kw
s log p
,
α n
88
7.8. OTHER POPULAR TECHNIQUES FOR SPARSE RECOVERY
pursuit-style algorithms. We warn the reader that the popular Basis Pursuit algorithm is actu-
ally a convex relaxation technique and not related to the other pursuit algorithms we discuss
here. The terminology is a bit confusing but seems to be a matter of legacy.
The pursuit family of algorithms includes Orthogonal Matching Pursuit (OMP) [Tropp
and Gilbert, 2007], Orthogonal Matching Pursuit with Replacement (OMPR) [Jain et al.,
2011], Compressive Sampling Matching Pursuit (CoSaMP) [Needell and Tropp, 2008], and the
Forward-backward (FoBa) algorithm [Zhang, 2011].
Pursuit methods work by gradually discovering the elements in the support of the true
model vector w∗ . At every time step, these techniques add a new support element to an active
support set (which is empty to begin with) and solve a traditional least-squares problem on
the active support set. This least-squares problem has no sparsity constraints, and is hence a
convex problem which can be solved easily.
The support set is then updated by adding a new support element. It is common to add
the coordinate where the gradient of the objective function has the highest magnitude among
coordinates not already in the support. FoBa-style techniques augment this method by having
backward steps where support elements that were erroneously picked earlier are discarded when
the error is detected.
Pursuit-style methods are, in general, applicable whenever the structure in the (non-convex)
constraint set in question can be represented as a combination of a small number of atoms.
Examples include sparse recovery, where the atoms are individual coordinates: every s-sparse
vector is a linear combination of some s of these atoms.
Other examples include low-rank matrix recovery, which we will study in detail in § 8, where
the atoms are rank-one matrices. The SVD theorem tells us that every r-rank matrix can indeed
be expressed as a sum of r rank-one matrices. There exist works [Tewari et al., 2011] that give
generic methods to perform sparse recovery in such structurally constrained settings.
min ky − Xwk22 .
w∈Rp
kwk0 ≤s
Non-convexity arises in the problem due to the non-convex constraint kwk0 ≤ s as the sparsity
operator is not a valid norm. The relaxation approach fixes this problem by changing the
constraint to use the L1 norm instead i.e.
89
CHAPTER 7. SPARSE RECOVERY
or by using its regularized version instead
1
min ky − Xwk22 + λn kwk1 . (LASSO-2)
w∈Rp 2n
The choice of the L1 norm is motivated mainly by its convexity as well as formal results
that assure us that the relaxation gap is small or non-existent. Both the above formulations
((LASSO-1)) and ((LASSO-2)) are indeed convex but include parameters such as R and λn
that must be tuned properly to ensure proper convergence. Although the optimization problems
((LASSO-1)) and ((LASSO-2)) are vastly different from ((SP-REG)), a long line of beautiful
results, starting from the seminal work of Candès and Tao [2005], Candès et al. [2006], Donoho
[2006], showed that if the design matrix X satisfies RIP with appropriate constants, and if
the parameters of the relaxations R and λn are appropriately tuned, then the solutions to the
relaxations are indeed solutions to the original problems as well.
Below we state one such result from the recent text by Hastie et al. [2016]. We recommend
this text to any reader looking for a well-curated compendium of techniques and results on the
relaxation approach to several non-convex optimization problems arising in machine learning.
Theorem 7.3. [Hastie et al., 2016, Theorem 11.1] Consider a sparse recovery problem y =
Xw∗ + η where the model w∗ is s-sparse and the design matrix X satisfies the restricted-
eigenvalue condition (see Definition 7.2) of the order s with constant α, then the following
hold
b 2 to ((LASSO-2)) with λn ≥ 2
X > η/n
2. Any solution w satisfies
∞
3√
b 1 − w ∗ k2 ≤
kw sλn .
α
The reader can verify that the above bounds are competitive with the bounds for the IHT
algorithm that we discussed previously. We refer the reader to [Hastie et al., 2016, Chapter 11]
for more consistency results for the LASSO formulations.
min kwkq ,
w∈Rp
s.t. y = Xw
For noisy settings, one may replace the constraint with a soft constraint such as ky − Xwk2 ≤ ,
or else move to an unconstrained version like LASSO with the L1 norm replaced by the Lq norm.
90
7.9. EXTENSIONS
LASSO
RUNTIME (sec)
FoBa
IHT
DIMENSIONALITY
Figure 7.2: An empirical comparison of run-times offered by the LASSO, FoBA and IHT methods
on sparse regression problems with varying dimensionality p. All problems enjoyed sparsity
s = 100 and were offered n = 2s · log p data points. IHT is clearly the most scalable of the
methods followed by FoBa. The relaxation technique does not scale very well to high dimensions.
Figure adapted from [Jain et al., 2014].
The choice of the regularization norm q is dictated by application and usually any value within
a certain range within the interval (0, 1) can be chosen.
There has been interest in characterizing both the global and the local optima of these opti-
mization problems for their recovery properties [Chen and Gu, 2015]. In general, Lq regularized
formulations, if solved exactly, can guarantee recovery under much weaker conditions than what
LASSO formulations, and IHT require. For instance, the RIP condition that Lq -regularized for-
mulations need in order to guarantee universal recovery can be as weak as δ2k+1 < 1 [Chartrand,
2007]. This is very close to the requirement δ2k < 1 that must be made by any algorithm in
order to ensure that the solution even be unique. However, solving these non-convex regularized
problems at large scale itself remains challenging and an active area of research.
7.9 Extensions
In the preceding discussion, we studied the problem of sparse linear regression and the IHT
technique to solve the problem. These basic results can be augmented and generalized in sev-
eral ways. The work of Negahban et al. [2012] greatly expanded the scope of sparse recovery
techniques beyond simple least-squares to the more general M-estimation problem. The work of
Bhatia et al. [2015] offered solutions to the robust sparse regression problem where the responses
91
CHAPTER 7. SPARSE RECOVERY
may be corrupted by an adversary. We will explore the robust regression problem in more detail
in § 9. We discuss a few more such extensions below.
Theorem 7.4. Suppose X ∈ Rn×p is a design matrix that satisfies the restricted strong convexity
and smoothness property of order 2k + s with constants α2k+s and β2k+s respectively. Let w∗ ∈
B0 (s) ⊂ Rp be any arbitrary sparse vector and let y = Xw∗ . Then the IHT algorithm, when
β2k+s 2
2
executed with a step length η < β2k+s , and a projection sparsity level k ≥ 32 α2k+s s, ensures
∗
w − w∗
≤ after t = O β2k+s log kw k2 iterations of the algorithm.
t
2 α2k+s
Note that the above result does not place any restrictions on the condition number or the
RSC/RSS constants of the problem. The result also mimics Theorem 3.3 in its dependence on
β
the (restricted) condition number of the optimization problem i.e. κ2k+s = α2k+s
2k+s
. The proof of
this result is a bit tedious, hence omitted.
a specific sparsity pattern. It is natural to wonder whether the methods and analyses described
above also hold when the vector to be recovered belongs to a general union of subspaces.
More specifically, consider a family of linear subspaces H1 , . . . , HL ⊂ Rp and denote the union
of these subspaces by H = L i=1 Hi . The restricted strong convexity and restricted strong
S
smoothness conditions can be appropriately modified to suit this setting by requiring a design
matrix X : Rp → Rn to satisfy, for every w1 , w2 ∈ H,
It turns out that IHT, with an appropriately modified projection operator ΠH (·), can ensure
recovery of vectors that are guaranteed to reside in a small union of low-dimensional subspaces.
Moreover, a linear rate of convergence, as we have seen for the IHT algorithm in the sparse
regression case, can still be achieved. We refer the reader to the work of Blumensath [2011] for
more details of this extension.
92
7.10. EXERCISES
sparse combinations wi ∈ Rp of the columns of the design matrix i.e. yi ≈ Xwi such that
kwi k0 ≤ s p. The problem has several applications in the fields of computer vision and signal
processing and has seen a lot of interest in the recent past.
The alternating minimization technique where one alternates between estimating the design
matrix and the sparse representations, is especially popular for this problem. Methods mostly
differ in the exact implementation of these alternations. Some notable works in this area include
[Agarwal et al., 2016, Arora et al., 2014, Gribonval et al., 2015, Spielman et al., 2012].
7.10 Exercises
Exercise 7.1. Suppose a design matrix X ∈ Rn×p satisfies Xw1 = Xw2 for some w1 6= w2 ∈
Rp . Then show that there exists an entire subspace H ⊂ Rp such that for all w, w0 ∈ H, we have
Xw = Xw0 .
Exercise 7.3. Show that if a design matrix X satisfies the null-space property of order 2s,
then for any two distinct s-sparse vectors v1 , v2 ∈ B0 (s), v1 6= v2 , it must be the case that
Xv1 6= Xv2 .
Exercise 7.4. Show that the RSC/RSS notion introduced in Definition 7.4 is equivalent to the
RSC/RSS notion in Definition 3.2 defined in § 3 for an appropriate choice of function and
constraint sets.
Exercise 7.5. Show that RSC-RSS ⇒ REP ⇒ NSP i.e. a matrix that satisfies the RSC/RSS
condition for some constants, must satisfy the REP condition for some constants which in turn
must force it to satisfy the null-space property.
Exercise 7.6. Show that every µ-incoherent matrix satisfies the RIP property at order k with
parameter δ = (k − 1)µ.
Exercise 7.7. Suppose the matrix X ∈ Rn×p satisfies RIP at order s with constant δs . Then
show that for any set I ⊂ [p], |I| ≤ s, the smallest eigenvalue of the matrix XI> XI is lower
bounded by (1 − δs ).
Exercise 7.8. Show that the RIP constant is monotonic in its order i.e. if a matrix X satisfies
RIP of order k with constant δk , then it also satisfies RIP for all orders k 0 ≤ k with δk0 ≤ δk .
93
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 8
In this section, we will look at the problem of low-rank matrix recovery in detail. Although
simple to motivate as an extension of the sparse recovery problem that we studied in § 7, the
problem rapidly distinguishes itself in requiring specific tools, both algorithmic and analytic.
We will start our discussion with a milder version of the problem as a warm up and move on
to the problem of low-rank matrix completion which is an active area of research.
Collaborative Filtering Recommendation systems are popularly used to model the preference
patterns of users, say at an e-commerce website, for items being sold on that website, although
the principle of recommendation extends to several other domains that demand personalization
such as education and healthcare. Collaborative filtering is a popular technique for building
recommendation systems.
The collaborative filtering approach seeks to exploit co-occurring patterns in the observed
behavior across users in order to predict future user behavior. This approach has proven suc-
cessful in addressing users that interact very sparingly with the system. Consider a set of m
users u1 , . . . , um , and n items a1 , . . . , an . Our goal is to predict the preference score s(i,j) that
is indicative of the interest user ui has in item aj .
However, we get direct access to (noisy estimates of) actual preference scores for only a few
items per user by looking at clicks, purchases etc. That is to say, if we consider the m × n
preference matrix A = [Aij ] where Aij = s(i,j) encodes the (true) preference of the ith user for
the j th item, we get to see only k m · n entries of A, as depicted in Figure 8.1. Our goal is to
recover the remaining entries.
The problem of paucity of available data is readily apparent in this setting. In its nascent
form, the problem is not even well posed and does not admit a unique solution. A popular way
of overcoming these problems is to assume a low-rank structure in the preference matrix.
As we saw in Exercise 3.3, this is equivalent to assuming that there is an r-dimensional
vector ui denoting the ith user and an r-dimensional vector aj denoting the j th such that
s(i,j) ≈ hui , aj i. Thus, if Ω ⊂ [m] × [n] is the set of entries that have been observed by us,
then the problem of recovering the unobserved entries can be cast as the following optimization
problem:
(Xij − Aij )2 .
X
min
X∈Rm×n
rank(X)≤r (i,j)∈Ω
94
8.1. MOTIVATING APPLICATIONS
ITEM FEATURES
USER FEATURES
Figure 8.1: In a typical recommendation system, users rate items very infrequently and certain
items may not get rated even once. The figure depicts a ratings matrix. Only the matrix entries
with a bold border are observed. Low-rank matrix completion can help recover the unobserved
entries, as well as reveal hidden features that are descriptive of user and item properties, as
shown on the right hand side.
This problem can be shown to be NP-hard [Hardt et al., 2014] but has generated an enormous
amount of interest across communities. We shall give special emphasis to this matrix completion
problem in the second part of this section.
Linear Time-invariant Systems Linear Time-invariant (LTI) systems are widely used in
modeling dynamical systems in fields such as engineering and finance. The response behavior of
these systems is characterized by a model vector h = [h(0), h(1), . . . , h(2N − 1)]. The order of
such a system is given by the rank of the following Hankel matrix
h(0) h(1) ... h(N )
h(1) h(2) ... h(N + 1)
hank(h) = .. .. .. ..
. . . .
h(N − 1) h(N ) . . . h(2N − 1)
Given a sequence of inputs a = [a(1), a(2), . . . , a(N )] to the system, the output of the system
is given by
N
X −1
y(N ) = a(N − t)h(t)
t=0
In order to recover the model parameters of a system, we repeatedly apply i.i.d. Gaussian
impulses a(i) to the system for N time steps and then observe the output of the system. This
k
process is repeated, say k times, to yield observation pairs (ai , y i ) i=1 . Our goal now, is to take
these observations and identify an LTI vector h that best fits the data. However, for the sake of
accuracy and ease of analysis [Fazel et al., 2013], it is advisable to fit a low-order model to the
data. Let the matrix A ∈ Rk×N contain the i.i.d. Gaussian impulses applied to the system. Then
the problem of fitting a low-order model can be shown to reduce to the following constrained
95
CHAPTER 8. LOW-RANK MATRIX RECOVERY
optimization problem with a rank objective and an affine constraint.
min rank(hank(h))
s.t. Ah = y,
The above problem is a non-convex optimization problem due to the objective being the min-
imization of the rank of a matrix. Several other problems in metric embedding and multi-
dimensional scaling, image compression, low rank kernel learning and spatio-temporal imaging
can also be reduced to low rank matrix recovery problems [Jain et al., 2010, Recht et al., 2010].
min rank(X)
(ARM)
s.t. A(X) = y,
This problem can be shown to be NP-hard due to a reduction to the sparse recovery problem1 .
The LTI modeling problem can be easily seen to be an instance of ARM with the Gaussian
impulses being delivered to the system resulting in a k-dimensional affine transformation of
the Hankel matrix corresponding to the system. However, the Collaborative Filtering problem
is also an instance of ARM. To see this, for any (i, j) ∈ [m] × [n], let O(i,j) ∈ Rm×n be the
(i,j)
matrix such that its (i, j)-th entry Oij = 1 and all other entries are zero. Then, simply define
the affine transformation A(i,j) : X 7→ tr(X > O(i,j) ) = Xij . Thus, if we observe k user-item
ratings, the ARM problem effectively operates with a k-dimensional affine transformation of
the underlying rating matrix.
Due to its similarity to the sparse recovery problem, we will first discuss the general ARM
problem. However, we will find it beneficial to cast the collaborative filtering problem as a
Low-rank Matrix Completion problem instead. In this problem, we have an underlying low rank
matrix X ∗ of which, we observe entries in a set Ω ⊂ [m] × [n]. Then the low rank matrix
completion problem can be stated as
The above formulation succinctly captures our objective to find a completion of the ratings
matrix that is both, low rank, as well as agrees on the user ratings that are actually observed.
As pointed out earlier, this problem is NP-hard [Hardt et al., 2014].
Before moving on to present algorithms for the ARM and LRMC problems, we discuss some
matrix design properties that would be required in the convergence analyses of the algorithms.
1
See Exercise 8.1.
96
8.3. MATRIX DESIGN PROPERTIES
8.3 Matrix Design Properties
Similar to sparse recovery, there exist design properties that ensure that the general NP-hardness
of the ARM and LRMC problems can be overcome in well-behaved problem settings. In fact
given the similarity between ARM and sparse recovery problems, it is tempting to try and
import concepts such as RIP into the matrix-recovery setting.
In fact this is exactly the first line of attack that was adopted in literature. What followed
was a beautiful body of work that generalized, both structural notions such as RIP, as well
as algorithmic techniques such as IHT, to address the ARM problem. Given the generality of
these constructs, as well as the smooth transition it offers having studied sparse recovery, we
feel compelled to present them to the reader.
Definition 8.1 (Matrix Restricted Isometry Property [Recht et al., 2010]). A linear map A :
Rm×n → Rk is said to satisfy the matrix restricted isometry property of order r with constant
δr ∈ [0, 1) if for all matrices X of rank at most r, we have
Furthermore, the work of Recht et al. [2010] also showed that linear maps or affine transfor-
mations arising in random measurement models, such as those in image compression and LTI
systems, do satisfy RIP with requisite constants whenever the number of affine measurements
satisfies k = O (nr) [Oymak et al., 2015]. Note however, that these are settings in which the de-
sign of the affine map is within our control. For settings, where the restricted condition number
of the affine map is not within our control, more involved analysis is required. The bibliographic
notes point to some of these results.
Given the relatively simple extension of the RIP definitions to the matrix setting, it is
all the more tempting to attempt to apply gPGD-style techniques to solve the ARM problem,
particularly since we saw how IHT succeeded in offering scalable solutions to the sparse recovery
problem. The works of [Goldfarb and Ma, 2011, Jain et al., 2010] showed that this is indeed
possible. We will explore this shortly.
97
CHAPTER 8. LOW-RANK MATRIX RECOVERY
Algorithm 9 Singular Value Projection (SVP)
Input: Linear map A, measurements y, target rank q, step length η
Output: A matrix X b with rank at most q
1: X 1 ← 0m×n
2: for t = 1, 2, . . . do
3: Y t+1 ← X t − η · A> (A(X t ) − y)
4: Compute top q singular vectors/values of Y t+1 : Uqt , Σtq , Vqt
5: X t+1 ← Uqt Σtq (Vqt )>
6: end for
7: return X t
eit e>
jt ∈ R
m×n where e are the canonical orthonormal vectors in Rm and e are the canonical
i j
orthonormal vectors in Rn . Clearly A has rank at most r.
However, this matrix A is non-zero only at r locations. Thus, it is impossible to recover the
entire matrix uniquely unless these very r locations {(it , jt )}t=1,...,r are actually observed. Since
in recommendation settings, we only observe a few random entries of the matrix, there is a good
possibility that none of these entries will ever be observed. This presents a serious challenge
for the matrix completion problem – the low rank structure is not sufficient to ensure unique
recovery!
To overcome this and make the LRMC problem well posed with a unique solution, an
additional property is imposed. This so-called matrix incoherence property prohibits low rank
matrices that are also sparse. A side effect of this imposition is that for incoherent matrices,
observing a small random set of entries is enough to uniquely determine the unobserved entries
of the matrix.
Definition 8.2 (Matrix Incoherence Property [Candès and Recht, 2009]). A matrix A ∈ Rm×n
of rank r is said to be incoherent with parameter µ if its left and right singular matrices have
> be the SVD of A. Then µ-incoherence
bounded row norms. More √ specifically, let A =
U ΣV √
dictates that
U
2 ≤ µ√mr for all i ∈ [m] and
V j
2 ≤ µ√nr for all j ∈ [n]. A stricter version
i
of this property requires all entries of U to satisfy |Uij | ≤ √µm and all entries of V to satisfy
|Vij | ≤ √µn .
A low rank incoherent matrix is guaranteed to be far, i.e., well distinguished, from any sparse
matrix, something that is exploited by algorithms to give guarantees for the LRMC problem.
98
8.5. A LOW-RANK MATRIX RECOVERY GUARANTEE FOR SVP
a linear rate of convergence to the optimum, much like IHT. All these make SVP a very attractive
choice for solving low rank matrix recovery problems.
Below, we give a convergence proof for SVP in the noiseless case, i.e., when y = A(X ∗ ).
The proof is similar in spirit to the convergence proof we saw for the IHT algorithm in § 7 but
differs in crucial aspects since sparsity in this case is apparent not in the signal domain (the
matrix is not itself sparse) but the spectral domain (the set of singular values of the matrix is
sparse). The analysis can be extended to noisy measurements as well and can be found in [Jain
et al., 2010].
Theorem 8.1. Suppose A : Rm×n → Rk is an affine transformation that satisfies the matrix
RIP property of order 2r with constant δ2r ≤ 13 . Let X ∗ ∈ Rm×n be a matrix of rank at most r
and let y = A(X ∗ ). Then the SVP Algorithm (Algorithm 9), when executed with a step length
2
2 kyk
η = 1/(1 + δ2r ), and a target rank q = r, ensures
X t − X ∗
F ≤ after t = O log 2 2
Proof. Notice that the notions of sparsity and support are very different in ARM than what they
were for sparse regression. Consequently, the exact convergence proof for IHT (Theorem 7.2)
is not
applicable
here. We will first establish an intermediate result that will show, that after
kyk22 2
iterations, SVP ensures
A(X t ) − y
2 ≤ . We will then use the matrix RIP
t = O log 2
property (Definition 8.1) to deduce
2
2
2
A(X t ) − y
=
A(X t − X ∗ )
≥ (1 − δ2r ) ·
X t − X ∗
,
2 2 F
which will conclude the proof. To prove this intermediate result, let us denote the objective
function as
1 1
f (X) = kA(X) − yk22 = kA(X − X ∗ )k22 .
2 2
An application of the matrix RIP property then gives us
f (X t+1 )
D 1
E
2
= f (X t ) + A(X t − X ∗ ), A(X t+1 − X t ) +
A(X t+1 − X t )
2 2
D E (1 + δ )
2
2r
t+1
≤ f (X t ) + A(X t − X ∗ ), A(X t+1 − X t ) +
X − X t
.
2 F
The following steps now introduce the intermediate variable Y t+1 into the analysis in order to
link the successive iterates by using the fact that X t+1 was the result of a non-convex projection
operation.
D E (1 + δ2r )
2
A(X t − X ∗ ), A(X t+1 − X t ) + ·
X t+1 − X t
2 F
1 + δ2r
2 1
2
= ·
X t+1 − Y t+1
− ·
A> (A(X t − X ∗ ))
2 F 2(1 + δ2r ) F
99
CHAPTER 8. LOW-RANK MATRIX RECOVERY
Algorithm 10 AltMin for Matrix Completion (AM-MC)
Input: Matrix A ∈ Rm×n of rank r observed at entries in the set Ω, sampling probability p,
stopping time T
Output: A matrix X b with rank at most r
1: Partition Ω into 2T + 1 sets Ω0 , Ω1 , . . . , Ω2T uniformly and randomly
2: U 1 ← SVD( p1 ΠΩ0 (A), r), the top r left singular vectors of p1 ΠΩ0 (A)
3: for t = 1, 2, . . . , T do
2
4: V t+1 ← arg minV ∈Rn×r
ΠΩt (U t V > − A)
F
2
U t+1← arg minU ∈Rm×r t+1 >
) − A)
5:
ΠΩT +t (U (V
F
6: end for
7: return U > (V > )>
1 + δ2r
2 1
2
≤ ·
X ∗ − Y t+1
− ·
A> (A(X t − X ∗ ))
2 F 2(1 + δ2r ) F
D E (1 + δ )
2
2r
= A(X t − X ∗ ), A(X ∗ − X t ) + ·
X ∗ − X t
2 F
D E (1 + δ2r )
2
≤ A(X t − X ∗ ), A(X ∗ − X t ) + ·
A(X ∗ − X t )
2(1 − δ2r ) 2
1
2
(1 + δ2r )
2
= −f (X t ) −
A(X ∗ − X t )
+ ·
A(X ∗ − X t )
.
2 2 2(1 − δ2r ) 2
The first step uses the identity Y t+1 = X t − η · A> (A(X t ) − y) from Algorithm 9, the fact that
1
we set η = 1+δ 2r
, and elementary rearrangements. The second step follows from the fact that
2
2
− Y t+1
F ≤
X ∗ − Y t+1
F by virtue of the SVD step which makes X t+1 the best rank-
t+1
X
(2r) approximation to Y t+1 in terms of the Frobenius norm. The third step simply rearranges
things in the reverse order of the way they were arranged in the first step, the fourth step
uses the matrix RIP property and the fifth
2 step makes elementary manipulations. This, upon
rearrangement, and using
A(X t − X ∗ )
2 = 2f (X t ), gives us
2δ2r
f (X t+1 ) ≤ · f (X t ).
1 − δ2r
One can, in principle, apply the SVP technique to the matrix completion problem as well.
However, on the LMRC problem, SVP is outperformed by gAM-style approaches which we study
next. Although the superior performance of gAM on the LMRC problem was well documented
empirically, it took some time before a theoretical understanding could be obtained. This was
first done in the works of Keshavan [2012], Jain et al. [2013]. These results set off a long line of
works that progressively improved both the algorithm, as well as its analysis.
100
8.7. A LOW-RANK MATRIX COMPLETION GUARANTEE FOR AM-MC
described in terms of two low-rank components
2
min
ΠΩ (U V > − X ∗ )
. (LRMC*)
U ∈Rm×k F
V ∈Rn×k
In this case, fixing either U or V reduces the above problem to a simple least squares problem for
which we have very efficient and scalable solvers. As we saw in § 4, such problems are excellent
candidates for the gAM algorithm to be applied. The AM-MC algorithm (see Algorithm 10)
applies the gAM approach to the reformulated LMRC problem. The AM-MC approach is the
choice of practitioners in the context of collaborative filtering [Chen and He, 2012, Koren et al.,
2009, Zhou et al., 2008]. However, AM-MC, like other gAM-style algorithms, does require proper
initialization and tuning.
Initialization: We will now show that the initialization step (Step 2 in Algorithm 10) provides
a point (u1 , v1 ) which is at most a constant c > 0 distance away from (u∗ , v∗ ). To this we need
a Bernstein-style argument which we provide here for the rank-1 case.
Theorem 8.2. [Tropp, 2012, Theorem 1.6] Consider a finite sequence {Zk } of independent
random matrices of dimension m × n. Assume each matrix satisfies E [Zk ] = 0 and kZk k2 ≤ R
almost surely and denote
n
P
P
o
σ 2 := max
> >
k Zk Zk
,
k Zk Zk
.
2 2
Below we apply this inequality to analyze the initialization step for AM-MC in the rank-
1 case. We point the reader to [Recht, 2011] and [Keshavan et al., 2010] for a more precise
argument analyzing the initialization step in the general rank-r case.
Theorem 8.3. For a rank one matrix A satisfying the µ-incoherence property, let the observed
samples Ω be generated with sampling probability p as described above. Let u1 , v1 be the singular
101
CHAPTER 8. LOW-RANK MATRIX RECOVERY
vectors of p1 PΩ (A) corresponding to its largest singular value. Then for any > 0, if p ≥
45µ2 log(m+n)
2
· min{m,n} , then with probability at least 1 − 1/(m + n)10 :
1
u − u∗
≤ ,
v − v∗
≤ .
ΠΩ (A) − A
≤ ,
1
1
p
2 2
2
Proof. Notice that the statement of the theorem essentially states that once enough entries in
2 log(m+n)
the matrix have been observed (as dictated by the requirement p ≥ 45µ 2
· min{m,n} ) an SVD step
1 1
on the incomplete matrix will yield components u , v that are very close to the components of
the complete matrix u∗ , v∗ . Moreover, since u∗ , v∗ are incoherent by assumption, the estimated
components u1 , v1 will be so too.
To apply Theorem 8.2 to prove this result, we will first express p1 ΠΩ (A) as a sum of random
matrices. We first rewrite p1 ΠΩ (A) = p1 ij δij Aij ei e> ij Wij where δij = 1 if (i, j) ∈ Ω
P P
j =
and 0 otherwise. Note that the Bernoulli sampling model assures us that the random variables
Wij = p1 δij Aij ei e> >
j are independent and that E [δij ] = p. This gives us E [Wij ] = Aij ei ej . Note
that ij Aij ei e>
P
j = A.
The matrices Zij = Wij − Aij ei e> j shall serve as our random matrices in the application
2
of Theorem 8.2. Clearly E [Zij ] = 0. We also have maxij kWij k2 ≤ p1 maxij |Aij | ≤ p√µmn
due to the incoherence assumption. Applying the triangle inequality gives us maxij kZij k2 ≤
2 2µ2
maxij kWij k2 + maxij kAij ei e> 1 õ
j k2 ≤ 1 + ≤ .
√
p mn p mn h i
Moreover, as Aij = ui∗ vj∗ and kv∗ k2 = 1, we have E Wij Wij> =
1P P 2 >
P
ij p i j Aij ei ei =
hP i
2
1 ∗ 2 > ∗ √µ >
µ
i (ui ) ei ei . Due to incoherence ku k∞ ≤ m , we get
E ij Wij Wij
2 ≤ p·m , which can
P
p
be shown to give us
µ2 µ2
1
>
X
E Z Z
ij ij
≤ − 1 · ≤
ij
p m p·m
2
hP
µ2
i
>Z
Similarly, we can also get
E Z
≤ p·n . Now using Theorem 8.2 gives us, with
ij ij ij
2
probability at least 1 − δ,
2 s
µ2
1 2µ m + n m+n
ΠΩ (A) − A
≤ √ log + log
p
2 3p mn δ p · min {m, n} δ
45µ2 log(m+n)
If p ≥ 2 ·min{m,n}
, we have with probability at least 1 − 1/(m + n)10 ,
1
ΠΩ (A) − A
≤ .
p
2
The proof now follows by applying the Davis-Kahan inequality [Golub and Loan, 1996] with the
above bound. It can be shown [Jain and Netrapalli, 2015] that the vectors that are recovered
as a result of this initialization are incoherent as well.
Linear Convergence: We will now show that, given the initialization above, the AM-MC
procedure converges to the true solution with a linear rate of convergence. This will involve
showing a few intermediate results, such as showing that the alternation steps preserve incoher-
ence. Since the Theorem 8.3 shows that u1 is 2µ-incoherent, this will establish the incoherence
of all future iterates. Preserving incoherence will be crucial in showing the next result which
102
8.7. A LOW-RANK MATRIX COMPLETION GUARANTEE FOR AM-MC
shows that successive iterates get increasingly close to the optimum. Put together, these will
establish the convergence result. First, recall that in the tth iteration of the AM-MC algorithm,
vt+1 is updated as
δij (uit vj − ui∗ vj∗ )2 ,
X
vt+1 = arg min
v
ij
which gives us
δij ui∗ uit
P
∗
vjt+1 = Pi t 2 · vj . (8.1)
i δij (ui )
Note that this means that if u∗ = ut , then vt+1 = v∗ . Also, note that if δij = 1 for all (i, j)
hut ,u∗ i
which happens when the sampling probability satisfies p = 1, we have vt+1 = kut k2 · v∗ . This
2
is reminiscent of the power method used to recover the leading singular vectors of a matrix.
e , u∗ i · v∗ if p = 1.
e = ut /
ut
2 , we get
ut
2 · vt+1 = hu
Indeed if we let u
This allows us to rewrite the update (8.1) as a noisy power update.
e , u∗ i · v∗ − B −1 (hu
e , u∗ i B − C)v∗
t
u
· vt+1 = hu (8.2)
2
where B, C ∈ Rn×n are diagonal matrices with Bjj = p1 i δij (u e i ui∗ . The
e i )2 and Cjj = p1 i δij u
P P
following two lemmata show that if ut is 2µ incoherent and if p is large enough, then: a) vt+1
is also 2µ incoherent, and b) the angular distance between vt+1 and v∗ decreases as compared
to that between ut and u∗ . The following lemma will aid the analysis.
Lemma 8.4. Suppose a, b ∈ Rn are two fixed µ-incoherent unit vectors. Also suppose δi , i ∈ [n]
are i.i.d. Bernoulli random variables such that δi = 1 with probability p and 0 otherwise. Then,
2 n
for any > 0, if p > 27µnlog
2 , then with probability at least 1 − 1/n 10 , 1 P δ a b − ha, bi ≤ .
p i i i i
δi
Proof. Define Zi = − 1 ai bi . Using the incoherence of the vectors, we get E [Zi ] = 0,
p
Pn 2 1 Pn 2 µ2 µ2
i=1 E Zi = p − 1 i=1 (ai bi ) ≤ pn since kbk2 = 1, and |Zi | ≤ pn almost surely. Applying
the Bernstein inequality gives us
" # !
1 X −3pnt2
δi ai bi − ha, bi > t ≤ exp ,
P
p
i
6µ4 + 2µ2 t
Lemma 8.5. With probability at least min 1 − 1/n10 , 1 − 1/m10 , if a pair of iterates (ut , vt )
in the execution of the AM-MC procedure are 2µ-incoherent, then so are the next pair of iterates
(ut+1 , vt+1 ).
Proof. Since kue k2 = 1, using Lemma 8.4 tells us that with high probability, for all j, we have
e , u∗ i| ≤ . Also, using triangle inequality, we get
ut
2 ≥ 1 − .
|Bjj − 1| ≤ as well as |Cjj − hu
Using these and the incoherence of v∗ in the update equation for vt+1 (8.2), we have
1 1
e , u∗ i vj∗ − e , u∗ i Bjj − Cjj )vj∗
t+1
vj = hu (hu
1− Bjj
1 1 1
≤ hue , u∗ i vj∗ + e , u∗ i Bjj − Cjj )vj∗
(hu
1− 1 − Bjj
1 µ 1 + 2 µ
≤ 2
(|hue , u∗ i| + |hu
e , u∗ i (1 + ) − (hu e , u∗ i − )|) √ ≤ √
(1 − ) n (1 − )2 n
For < 1/6, the result now holds.
103
CHAPTER 8. LOW-RANK MATRIX RECOVERY
We note that whereas Lemma 8.4 is proved for fixed vectors, we seem to have inappropriately
applied it to ue in the proof of Lemma 8.5 which is not a fixed vector as it depends on the
randomness used in sampling the entries of the matrix revealed to the algorithm. However
notice that the AM-MC procedure in Algorithm 10 uses fresh samples Ωt and ΩT +t for each
iteration. This ensures that u
e does behave like a fixed vector with respect to Lemma 8.4.
80µ2 log(m+n)
Lemma 8.6. For any > 0, if p > 2 min{m,n}
and ut is 2µ-incoherent, the next iterate vt+1
satisfies
* +2 * +2
vt+1 ut
1− , v∗ ≤ 1 − , u∗
kvt+1 k2 (1 − )3 kut k2
Similarly, for any 2µ-incoherent iterate vt+1 , the next iterate satisfies
* +2 * +2
ut+1 vt+1
1− , u∗ ≤ 1 − , v∗ .
kut+1 k2 (1 − )3 kvt+1 k2
Proof. Using the modified form of the update for ut+1 (8.2), we get, for any unit vector v⊥ such
that hv⊥ , v∗ i = 0,
D E D E
u
· vt+1 , v⊥ = v⊥ , B −1 (hu
e , u∗ i B − C)v∗
t
2
≤
B −1
k(hue , u∗ i B − C)v∗ k2
2
1
≤ e , u∗ i B − C)v∗ k2 ,
k(hu
1−
where the last step follows from an application of Lemma 8.4. To bound the other term let
e i ui∗ )vj∗ ej ∈ Rn . Clearly m
Pn
e , u ∗ i (u
Zij = p1 δij (hu e i )2 − u e , u∗ i B − C)v∗ . Note
P
i=1 j=1 Zij = (hu
that due to fresh samples being used by Algorithm 10 at every step, the vector u e appears as a
constant vector to the random variables δij . Given this, note that
"m # m
e , u∗ i (u e i ui∗ )vj∗ ej
X X
E Zij = (hu e i )2 − u
i=1 i=1
m m
e , u∗ i e i )2 vj∗ ej − e i ui∗ vj∗ ej
X X
= hu (u u
i=1 i=1
= (hu e k22 − hu
e , u∗ i ku e , u∗ i)vj∗ ej = 0,
hP i
1
since ku
e k2 = 1. Thus E ij e i )2 =
Zij = 0 as well. Now, we have maxi (u kut k22
· maxi (uit )2 ≤
4µ2 µ2
since
ut − u∗
2 ≤ . This allows us to bound
mkut k22
≤ m(1−)
m X n
> 1X
e , u ∗ i (u e i ui∗ )2 (vj∗ )2
X
e i )2 − u
E Z Z
ij ij
= (hu
ij p i=1 j=1
m
1X
= (ue i )2 (hue , u∗ i ue i − ui∗ )2
p i=1
m
µ2
e , u∗ i2 (u
e i )2 + (ui∗ )2 − 2 hu
e , u∗ i u
e i ui∗
X
≤ hu
pm(1 − ) i=1
8µ2
≤ e , u∗ i2 ),
(1 − hu
pm
104
8.8. OTHER POPULAR TECHNIQUES FOR MATRIX
hP RECOVERY
8µ2
i
where we set = 0.5. In the same way we can show
E > e , u∗ i2 ) as well.
ij Zij Zij
≤ pm (1−hu
q 2
2
4µ
Using a similar argument we can show kZij k2 ≤ p√mn 1 − hu e , u i2 . Applying the Bernstein
∗
2 log(m+n)
inequality now tells us, that for any > 0, if p > 80µ
2 min{m,n}
, then with probability at least
10
1 − 1/n , we have q
e , u∗ i B − C)v∗ k2 ≤ ·
k(hu e , u ∗ i2 .
1 − hu
Since
ut
2 ≥ 1 − is guaranteed by the initialization step, we now get,
D E q
vt+1 , v⊥ ≤ e , u ∗ i2 .
1 − hu
(1 − )2
t+1
If v⊥ and vkt+1 be the components of vt+1 perpendicular and parallel to v∗ . Then the above
q
e , u∗ i2 . This gives us, upon applying the Pythagoras theorem,
guarantees that
v⊥
≤ c· 1 − hu
2
t+1
2
t+1
2
t+1
2
∗ 2
t+1
2
v
=
v⊥
+
vk
≤ 1 − h u , u i + vk
2
(1 − )2
e
2 2 2
Since
vkt+1
= vt+1 , v∗ and
vt+1
2 ≥ 1 − as
vt+1 − v∗
≤ due to the initialization,
2
rearranging the terms gives us the result.
min kXk∗
s.t. A(X) = y,
where the nuclear norm of a matrix kXk∗ is the sum of all singular values of the matrix X. The
nuclear norm is known to provide the tightest convex envelope of the rank function, just as the
`1 norm provides a relaxation to the sparsity norm k·k0 [Recht et al., 2010]. Similar to sparse
recovery, under matrix-RIP settings, these relaxations can be shown to offer exact recovery
[Recht et al., 2010, Hastie et al., 2016].
Also similar to sparse recovery, there exist pursuit-style techniques for matrix recovery,
most notable among them being the ADMiRA method [Lee and Bresler, 2010] that extends the
orthogonal matching pursuit approach to the matrix recovery setting. However, this method can
be a bit sluggish when recovering matrices with slightly large rank since it discovers a matrix
with larger and larger rank incrementally.
105
CHAPTER 8. LOW-RANK MATRIX RECOVERY
SVT
RUNTIME (sec)
RUNTIME (sec)
SVP
SVT
ADMiRA
SVP
Figure 8.2: An empirical comparison of run-times offered by the SVT, ADMiRA and SVP
methods on synthetic matrix recovery and matrix completion problems with varying matrix
sizes. The SVT method due to Cai et al. [2010] is an efficient implementation of the nuclear
norm relaxation technique. ADMiRA is a pursuit-style method due to Lee and Bresler [2010].
For the ARM task in Figure 8.2a, the rank of the true matrix was set to r = 5 whereas it
was set to r = 2 for the LRMC task in Figure 8.2b. SVT is clearly the most scalable of the
methods in both cases whereas the relaxation-based SVT technique does not scale very well to
large matrices. Note however, that for the LRMC problem, AM-MC (not shown in the figure)
outperforms even SVP. Figures adapted from [Meka et al., 2008].
Before concluding, we present the reader with an empirical performance of these various
methods. Figure 8.2 provides a comparison of these methods on synthetic matrix recovery
and matrix completion problems with increasing dimensionality of the (low-rank) matrix being
recovered. The graphs indicate that non-convex optimization methods such as IHT are far more
scalable, often by an order of magnitude, than relaxation-based methods.
8.9 Exercises
Exercise 8.1. Show that low-rank matrix recovery is NP-hard.
Hint: Take the sparse recovery problem in ((SP-REG)) and reduce it to the reformula-
tion ((ARM-2)) of the matrix recovery problem.
Exercise 8.2. Show that the matrix RIP constant is monotonic in its order i.e., if a linear
map A satisfies matrix RIP of order r with constant δr , then it also satisfies matrix RIP for all
orders r0 ≤ r with δr0 ≤ δr .
106
8.10. BIBLIOGRAPHIC NOTES
(see Theorem 7.4) wherein a more “relaxed” projection step is required by using a rank q > r
while executing the SVP algorithm.
It turns out to be challenging to prove convergence results for SVP for the LRMC problem.
This is primarily due to the difficulty in establishing the matrix RIP property for the affine
transformation used in the problem. The affine map simply selects a few elements of the matrix
and reproduces them which makes establishing RIP properties harder in this setting. Specifically,
even though the initialization step can be shown to yield a matrix that satisfies matrix RIP
[Jain et al., 2010], if the underlying matrix is low-rank and incoherent, it becomes challenging
to show that RIP-ness is maintained across iterates. Jain and Netrapalli [2015] overcome this
by executing the SVP algorithm in a stage-wise fashion which resembles ADMiRA-like pursuit
approaches.
Several works have furthered the alternating minimization approach itself by reducing its
sample complexity [Hardt, 2014], giving recovery guarantees independent of the condition num-
ber of the problem [Hardt and Wootters, 2014, Jain and Netrapalli, 2015], designing universal
sampling schemes for recovery [Bhojanapalli and Jain, 2014], as well as tackling settings where
some of the revealed entries of the matrix may be corrupted [Chen et al., 2016, Cherapanamjeri
et al., 2017].
Another interesting line of work for matrix completion is that of [Ge et al., 2016, Sun and
Lu, 2015] which shows that under certain regularization assumptions, the matrix completion
problem does not have any non-optimal stationary points once one gets close-enough to the
global minimum. Thus, one can use any method for convex optimization such as alternating
minimization, gradient descent, stochastic gradient descent, and its variants we studied in § 6,
once one is close enough to the global minimum.
107
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 9
In this section, we will look at the problem of robust linear regression. Simply put, it is the task
of performing linear regression in the presence of adversarial outliers or corruptions. Let us take
a look at some motivating applications.
Face Recognition The task of face recognition is widely useful in areas such as biometrics and
automated image annotation. In biometrics, a fundamental problem is to identify if a new face
image belongs to that of a registered individual or not. This problem can be cast as a regression
problem by trying to fit various features of the new image to corresponding features of existing
images of the individual in the registered database. More specifically, assume that images are
represented as n-dimensional feature vectors say, using simple pixel-based features. Also assume
that there already exist p images of the person in the database.
Our task is to represent the new image xt ∈ Rn in terms of the database images X =
[x1 , . . . , xp ] ∈ Rn×p of that person. A nice way to do this is to perform a linear interpolation as
follows
2 n
X
minp
xt − Xw
= (xit − X i w)2 .
w∈R 2
i=1
If the person is genuine, then there will exist a combination w∗ such that for all i, we have
xit ≈ X i w∗ i.e., all features can be faithfully reconstructed. Thus, the fit will be nice and we
will admit the person. However, the same becomes problematic if the new image has occlusions.
For example, if the person is genuine but wearing a pair of sunglasses or sporting a beard. In
such cases, some of the pixels xit will appear corrupted, cause us to get a poor fit, and result in
a false alarm. More specifically
xit = X i w∗ + b∗i
where b∗i = 0 on uncorrupted pixels but can take abnormally large and unpredictable values
for corrupted pixels, such as those corresponding to the sunglasses. Being able to still correctly
identify the person involves computing the least squares fit in the presence of such corruptions.
108
9.1. MOTIVATING APPLICATIONS
DATABASE
= + ≈
= + ≈
Figure 9.1: A corrupted image y can be interpreted as a combination of a clean image y∗ and
a corruption mask b∗ i.e., y = y∗ + b∗ . The mask encodes the locations of the corrupted pixels
as well as the values of the corruptions. The clean image can be (approximately) recovered as
an affine combination of existing images in a database as y∗ ≈ Xw∗ . Face reconstruction and
recognition in such a scenario constitutes a robust regression problem. Note that the corruption
mask b∗ is sparse since only a few pixels are corrupted. Images courtesy the Yale Face Database
B.
The challenge is to do this without requiring any manual effort to identify the locations of the
corrupted pixels. Figure 9.1 depicts this problem setting visually.
Time Series Analysis This is a problem that has received much independent attention in
statistics and signal processing due to its applications in modeling sequence data such as weather
data, financial data, and DNA sequences. However, the underlying problem is similar to that
of regression. A sequence of timestamped observations {yt } for t = 0, 1, . . . are made which
constitute the time series. Note that the ordering of the samples is critical here.
The popular auto-regressive (AR) time series model uses a generative mechanism wherein
the observation yt at time t is obtained as a fixed linear combination of p previous observations
plus some noise.
p
wi∗ yt−i + ηt
X
yt =
i=1
yt = xt> w∗ + ηt ,
109
CHAPTER 9. ROBUST LINEAR REGRESSION
Is not uncommon to encounter situations where the time series experiences gross corruptions.
Examples may include observation or sensing errors or unmodeled factors such as dips and
upheavals in stock prices due to political or socio-economic events. Thus, we have
yt = xt> w∗ + ηt + b∗t
where b∗t can take unpredictable values for corrupted time instances and 0 otherwise. In time
series literature, two corruption models are popular. In the additive model, observations at time
steps subsequent to a corruption are constructed using the uncorrupted values i.e., b∗t does not
influence the values yτ for τ > t. Of course, observers may detect the corruption at time t but
the underlying time series goes on as though nothing happened.
However, in the innovative model, observations at time instances subsequent to a corruption
use the corrupted value i.e., b∗t is involved in constructing the values yτ for τ = t + 1, . . . , t + p.
Innovative corruptions are simpler to handle as although the observation at the moment of
the corruption i.e., yt , appears to deviate from that predicted by the base model i.e., xt> w∗ ,
subsequent observations fall in line with the predictions once more (unless there are more
corruptions down the line). In the additive model however, observations can seem to deviate
from the predictions of the base model for several iterations. In particular, yτ can disagree with
xτ> w∗ for times τ = t, t + 1, . . . , t + p, even if there is only a single corruption at time t.
A time series analysis technique which seeks to study the “usual” behavior of the model
involved, such as stock prices, might wish to exclude such aberrations, whether additive or
innovative. However it is unreasonable, as well as error prone, to expect manual exclusion of
such corruptions which motivates the problem of robust time series analysis. Note that time
series analysis is a more challenging problem than regression since the “covariates” in this case
xt , xt+1 , . . . are heavily correlated with each other as they share a large number of coordinates
whereas in regression they are usually assumed to be independent.
110
9.3. ROBUST REGRESSION VIA ALTERNATING MINIMIZATION
Algorithm 11 AltMin for Robust Regression (AM-RR)
Input: Data X, y, number of corruptions k
Output: An accurate model w b ∈ Rp
1: w1 ← 0, S1 = [1 : n − k]
2: for t = 1, 2, . . . do
wt+1 ← arg minw∈Rp > 2
i∈St (yi − xi w)
P
3:
St+1 ← arg min|S|=n−k > t+1 )2
i∈S (yi − xi w
P
4:
5: end for
6: return wt
where the variable b∗i encodes the additional corruption introduced into the response. Our goal
is to take a set of n (possibly) corrupted data points (xi , yi )ni=1 and recover the underlying
parameter vector w∗ , i.e.,
min
p n
ky − Xw − bk22 , (ROB-REG)
w∈R ,b∈R
kbk0 ≤k
The variables b∗i can be unbounded in magnitude and of arbitrary sign. However, we assume
that only a few data points are corrupted i.e., the vector b∗ = [b∗1 , b∗2 , . . . , b∗n ] is sparse kb∗ k0 ≤
k. Indeed it is impossible1 to recover the model w∗ if more than half the points are corrupted
i.e., k ≥ n/2. A worthy goal is to develop algorithms that can tolerate as large a value of k as
possible. We will study how two non-convex optimization techniques, namely gAM and gPGD,
can be used to solve this problem. We point to other approaches, as well as extensions such as
robust sparse recovery, in the bibliographic notes.
This gives us a direct way of applying the gAM approach to this problem as outlined in the
AM-RR algorithm (Algorithm 11). The work of Bhatia et al. [2015] showed that this technique
and its variants offer scalable solutions to the robust regression problem.
In order to execute the gAM protocol, AM-RR maintains a model estimate wt and an active
set St ⊂ [n] of points that are deemed clean at the moment. At every time step, true to the
gAM philosophy, AM-RR first fixes the active set and updates the model, and then fixes the
model and updates the active set. The first step turns out to be nothing but least squares over
1
See Exercise 9.1.
111
CHAPTER 9. ROBUST LINEAR REGRESSION
the active set. For the second step, it is easy to see that the optimal solution is achieved simply
by taking the n − k data points with the smallest residuals (by magnitude) with respect to the
updated model and designating them to be the active set.
Definition 9.1 (Subset Strong Convexity/Smoothness Property [Bhatia et al., 2015]). A matrix
X ∈ Rn×p is said to satisfy the αk -subset strong convexity (SSC) property and the βk -subset
smoothness property (SSS) of order k if for all sets S ⊂ [n] of size |S| ≤ k, we have, for all
v ∈ Rp ,
2
αk · kvk22 ≤
X S v
≤ βk · kvk22 .
2
The SSC/SSS properties require that the design matrix formed by taking any subset of k
points from the data set of n points act as an approximate isometry on all p dimensional points.
These properties are related to the traditional RSC/RSS properties and it can be shown (see
for example, [Bhatia et al., 2015]) that RIP-inducing distributions over matrices (see § 7.6)
also produce matrices that satisfy the SSC/SSS properties, with high probability. However, it
is interesting to note that whereas the RIP definition is concerned with column subsets of the
design matrix, SSC/SSS concerns itself with row subsets.
The nature of the SSC/SSS properties is readily seen to be very appropriate for AM-RR to
succeed. Since the algorithm uses only a subset of data points to estimate the model vector, it
is essential that smaller subsets of data points of size n − k (in particular the true subset of
clean points S∗ ) also allow the model to be recovered. This is equivalent2 to requiring that the
design matrices formed by smaller subsets of data points not identify distinct model vectors.
This is exactly what the SSC property demands.
Given this, we can prove the following convergence guarantee for the AM-RR algorithm. The
reader would notice that the algorithm, despite being a gAM-style algorithm, does not require
precise and careful initialization of the model and active set. This is in stark contrast to other
gAM-style approaches we have seen so far, namely EM and AM-MC, both of which demanded
careful initialization.
Theorem 9.1. Let X ∈ Rn×p satisfy the SSC property at order n − k with parameter αn−k
and the SSS property at order k with parameter βk such that βk /αn−k < √2+11
. Let w∗ ∈ Rp
be an arbitrary model vector and y = Xw∗ + b∗ where kb∗ k0 ≤ k is
a sparse
vector of possibly
unbounded corruptions. Then AM-RR yields an -accurate solution
wt − w∗
2 ≤ in no more
kb∗ k2
than O log steps.
Proof. Let rt = y − Xwt denote the vector of residuals at time t and let Ct = (X St )> X St and
S∗ = supp(b∗ ). Then the model update step of AM-RR solves a least squares problem ensuring
112
9.4. A ROBUST RECOVERY GUARANTEE FOR AM-RR
The residuals with respect to this new model can be computed as
rt+1 = y − Xwt+1 = b∗ + XCt−1 (X St )> b∗St .
However, the active-set update step selects the set with smallest residuals, in particular, ensuring
that
t+1
2
2
rSt+1
≤
rt+1
S∗
.
2 2
Plugging in the expression for into both sides of the equation, using b∗S∗ = 0 and the fact
rt+1
that that for any matrix X and vector v we have
S
2
T
2
2
2
2
X v
−
X v
=
X S\T v
−
X T \S v
≤
X S\T v
,
2 2 2 2 2
gives us, upon some simplification,
∗
2
2
bSt+1
≤
X S∗ \St+1 Ct−1 (X St )> b∗St
2 2
∗ > St+1 −1 St > ∗
− 2(bSt+1 ) X Ct (X ) bSt
2
βk
b∗
2 + 2 · βk ·
∗
·
b∗St
2 ,
≤ b
2 St 2
St+1
αn−k αn−k 2
where the last step follows from an application of the SSC/SSS properties by noticing that
|S∗ \St+1 | ≤ k and that b∗St and b∗St+1 are all k-sparse vectors since b∗ itself is a k-sparse vector.
Solving the above equation gives us
∗
√ βk
·
b∗St
2
bSt+1
≤ ( 2 + 1) ·
2 αn−k
kb∗ k2
The above result proves that in t = O log
iterations, the alternating minimization
procedure will identify an active set St such that
b∗St
≤ . It is easy3 to see that a least
2
squares step on this active set will yield a model w
b satisfying
βk
w − w∗
=
Ct−1 (X St )> b∗St
≤
t
· ≤ ,
2 2 αn−k
since βk /αn−k < 1. This concludes the convergence guarantee.
1
The crucial assumption in the previous result is the requirement βk /αn−k < √2+1 . Clearly,
as k → 0, we have βk → 0 but if the matrix X is well conditioned we still have αn−k > 0. Thus,
1
for small enough k, it is assured that we will have βk /αn−k < √2+1 . The point at which this
occurs is the so-called breakdown point of the algorithm – it is the largest number k such that
the algorithm can tolerate k possibly adversarial corruptions and yet guarantee recovery.
Note that the quantity κk = βk /αn−k acts as the effective condition number of the problem.
It plays the same role as the condition number did in the analysis of the gPGD and IHT
algorithms. It can be shown that for RIP-inducing distributions (see [Bhatia et al., 2015]),
AM-RR can tolerate k = Ω(n) corruptions.
For the specific case of the design matrix being generated from a Gaussian distribution, it
can be shown that we have αn−k = Ω (n − k) and βk = O (k). This in turn can be used to
1
show that we have βk /αn−k < √2+1 whenever k ≤ n/70. This means that AM-RR can tolerate
up to n/70 corruptions when the design matrix is Gaussian. This indicates a high degree of
robustness in the algorithm since these corruptions can be completely adversarial in terms of
their location, as well as their magnitude. Note that AM-RR is able to ensure this without
requiring any specific initialization.
3
See Exercise 9.3.
113
CHAPTER 9. ROBUST LINEAR REGRESSION
9.5 Alternating Minimization via Gradient Updates
Similar to the gradient-based EM heuristic we looked at in § 5.5, we can make the alternating
minimization process in AM-RR much cheaper by executing a gradient step for the alternations.
More specifically, we can execute step 3 of AM-RR as
(xi> w − yi )xi ,
X
wt+1 ← wt − η ·
i∈St
for some step size parameter η. It can be shown (see [Bhatia et al., 2015]) that this process
enjoys the same linear rate of convergence as AM-RR. However, notice that both alternations
(model update as well as active set update) in this gradient-descent version can be carried out
in (near-)linear time. This is in contrast with AM-RR which takes super-quadratic time in each
alternation to discover the least squares solution. In practice, this makes the gradient version
much faster (often by an order of magnitude) as compared to the fully corrective version.
However, once we have obtained a reasonably good estimate of S∗ , it is better to execute
the least squares solution to obtain the final solution in a single stroke. Thus, the gradient and
fully corrective steps can be mixed together to great effect. In practice, such hybrid techniques
offer the fastest convergence. We refer the reader to § 9.7 for a brief discussion on this and to
[Bhatia et al., 2015] for details.
where PX = X(X > X)−1 X > . The above calculation shows that an equivalent formulation for
the robust regression problem ((ROB-REG)) is the following
to which we can apply gPGD (Algorithm 2) since it now resembles a sparse recovery problem!
This problem enjoys4 the restricted isometry property whenever the design matrix X is sampled
from an RIP-inducing distribution (see § 7.6). This shows that an application of the gPGD
technique will guarantee recovery of the optimal corruption vector b∗ at a linear rate. Once we
have an -optimal estimate bb of b∗ , a good model w
b can be found by solving the least squares
problem.
w(
b b)b = (X > X)−1 X > (y − b),
b
4
See Exercise 9.4.
114
9.7. EMPIRICAL COMPARISON
AM-RR AM-RR
gPGD AM-RR-Hyb
AM-RR-GD
Figure 9.2: An empirical comparison of the performance offered by various approaches for robust
regression. Figure 9.2a (adapted from [Bhatia et al., 2017]) compares Extended LASSO, a state-
of-the-art relaxation method by Nguyen and Tran [2013b], AM-RR and the gPGD method from
§ 9.6 on a robust regression problem in p = 1000 dimensions with 30% data points corrupted.
Non-convex techniques such as AM-RR and gPGD are more than an order of magnitude faster,
and scale much better, than Extended LASSO. Figure 9.2b (adapted from [Bhatia et al., 2015])
compares various solvers on a robust regression problem in p = 300 dimensions with 1800
data points of which 40% are corrupted. The solvers include the gAM-style solver AM-RR,
a variant using gradient-based updates, a hybrid method (see § 9.5), and the DALM method
[Yang et al., 2013], a state-of-the-art solver for relaxed LASSO-style formulations. The hybrid
method is the fastest of all the techniques. In general, all AM-RR variants are much faster than
the relaxation-based method.
as before. It is a simple exercise to show that if
b − b∗
≤ , then
b
2
∗
w(b) − w
≤ O ,
b
b
α
where α is the RSC parameter of the problem. Note that this shows how the gPGD technique
can be applied to perform recovery not only when the parameter is sparse in the model domain
(for instance in the gene expression analysis problem), but also when the parameter is sparse
in the data domain, as in the robust regression example.
115
CHAPTER 9. ROBUST LINEAR REGRESSION
ORIGINAL DISTORTED OLS AM-RR
Figure 9.3: An experiment on face reconstruction using robust regression techniques. Two face
images were taken and different occlusions were applied to them. Using the model described
in § 9.1, reconstruction was attempted using both, ordinary least squares (OLS) and robust
regression (AM-RR). It is clear that AM-RR achieves far superior reconstruction of the images
and is able to correctly figure out the locations of the occlusions. Images courtesy the Yale Face
Database B.
9.8 Exercises
Exercise 9.1. Show that it is impossible to recover the model vector if a fully adaptive adversary
is able to corrupt more than half the responses i.e., if k ≥ n/2. A fully adaptive adversary is
one that is allowed to perform corruptions after observing the clean covariates as well as the
uncorrupted responses.
Hint: The adversary can make it impossible to distinguish between two models, the real model,
and another one of its choosing.
Exercise 9.2. Show that the SSC property ensures that there exists no subset S of the data,
|S| ≤ k and no two distinct model vectors v1 , v2 ∈ Rp such that X S v1 = X S v2 .
Exercise 9.4. Show that if the design matrix X satisfies RIP, then the objective function
f (b) = k(I − PX )(y − b)k22 , enjoys RIP (of the same order as X but with possibly different
constants) as well.
Exercise 9.5. Prove that ((ROB-REG)) and ((ROB-REG-2)) are equivalent formulations i.e.,
they yield the same model.
116
9.9. BIBLIOGRAPHIC NOTES
9.9 Bibliographic Notes
The problem of robust estimation has been well studied in the statistics community. Indeed,
there exist entire texts devoted to this area [Rousseeuw and Leroy, 1987, Maronna et al., 2006]
which look at robust estimators for regression and other problems. However, these methods often
involve estimators, such as the least median of squares estimator, that have an exponential time
complexity.
It is notable that these infeasible estimators often have attractive theoretical properties such
as a high breakdown point. For instance, the work of Rousseeuw [1984] shows that the least
median of squares method enjoys a breakdown point of as high as n/2 − p. In contrast, AM-RR
is only able to handle n/70 errors. However, whereas the gradient descent version of AM-RR can
be executed in near-linear time, the least median of squares method requires time exponential
in p.
There have been relaxation based approaches to solving robust regression and time series
problems as well. Chief of them include the works of Chen and Dalalyan [2012], Chen et al.
[2013] which look at the Dantzig selector methods and the trimmed product techniques to
perform estimation, and the works of Wright et al. [2009], Nguyen and Tran [2013a]. The work
of Chen et al. [2013] considers corruptions not only in the responses, but in the covariates as
well. However, these methods tend to scale poorly to really large scale problems owing to the
non-smooth nature of the optimization problems that they end up solving. The non-convex
optimization techniques we have studied, on the other hand, require linear time or else have
closed-form updates.
Recent years have seen the application of non-convex techniques to robust estimation. How-
ever these works can be traced back to the classical work of Fischler and Bolles [1981] that
developed the RANSAC algorithm that is very widely used in fields such as computer vision.
The RANSAC algorithm samples multiple candidate active sets and returns the least squares
estimate on the set with least residual error.
Although the RANSAC method does not enjoy strong theoretical guarantees in the face
of an adaptive adversary and a large number of corruptions, the method is seen to work well
when there are very few outliers. Later works, such as that of She and Owen [2011] applied
soft-thresholding techniques to the problem followed by the work of Bhatia et al. [2015] which
applied the alternating minimization algorithm we studied here. Bhatia et al. [2015] also looked
at the problem of robust sparse recovery.
The time series literature has also seen the application of various techniques for robust
estimation including robust M-estimators in both the additive and innovative outlier models
[Martin and Zeh, 1978, Stockinger and Dutter, 1987] and least trimmed squares [Croux and
Joossens, 2008].
117
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Chapter 10
Phase Retrieval
In this section, we will take a look at the phase retrieval problem, a non-convex optimization
problem with applications in several domains. We briefly mentioned this problem in § 5 when
we were discussing mixed regression. At a high level, phase retrieval is equivalent to discov-
ering a complex signal using observations that reveal only the magnitudes of (complex) linear
measurements over that signal. The phases of the measurements are not revealed to us.
Over reals, this reduces to solving a system of quadratic equations which is known to be
computationally intractable to solve exactly. Fortunately however, typical phase retrieval sys-
tems usually require solving quadratic systems that have nice randomized structures in the
coefficients of the quadratic equations. These structures can be exploited to efficiently solve
these systems. In this section we will look at various algorithms that achieve this.
X-ray Crystallography The goal in X-ray crystallography is to find the structure of a small
molecule by bombarding it with X-rays from various angles and measuring the trajectories and
intensities of the diffracted photons on a film. These quantities can be used to glean the inter-
nal three-dimensional structure of the electron cloud within the crystal, revealing the atomic
arrangements therein. This technique has been found to be immensely useful in imaging speci-
mens with a crystalline structure and has been historically significant in revealing the interior
composition of several compounds, both inorganic and organic.
A notable example is that of nucleic acids such as DNA whose structure was revealed in
the seminal work of Franklin and Gosling [1953a] which immediately led Francis and Crick to
propose its double helix structure. The reader may be intrigued by the now famous Photo 51
which provided critical support to the helical structure-theory of DNA [Franklin and Gosling,
1953b]. We refer the reader to the expository work of Lucas [2008] for a technical and historical
account of how these discoveries came to be.
118
10.2. PROBLEM FORMULATION
possible using photonic imaging techniques due to the extremely small de-Broglie wavelength
of electrons.
Coherent Diffraction Imaging (CDI) This is a widely used technique for studying nanos-
tructures such as nanotubes, nanocrystals and the like. A highly coherent beam of X-rays is
made incident on the object of study and the diffracted rays allowed to interfere to produce a
diffraction pattern which is used to recover the structure of the object. A key differentiating
factor in CDI is the absence of any lenses to focus light onto the specimen, as opposed to other
methods such as TEM/X-Ray crystallography which use optical or electromagnetic lenses to
focus the incident beam and then refocus the diffracted beam. The absence of any lenses in
CDI is very advantageous since it results in aberration-free patterns. Moreover, this way the
resolution of the technique is only dependent on the wavelength and other properties of the
incident rays rather than the material of the lens etc.
119
CHAPTER 10. PHASE RETRIEVAL
Algorithm 12 Gerchberg-Saxton Alternating Minimization (GSAM)
Input: Measurement matrix X ∈ Cn×p , observed response magnitudes |y| ∈ Rn+ , desired accu-
racy
Output: A signal w b ∈ Cp
1: Set T ← log 1/
n data points into T + 1 sets
2: Partition S0 , S1 , . . . , ST
3: w0 ← eig |S1 |
P
|y |2 · x x> , 1 //Leading eigenvector
0 k∈S0 k k k
4: for t = 1, 2, . . . , T do
5: Phase Estimation: φk = xk> wt−1 /|xk> wt−1 |, for all k ∈ St
Signal Estimation: wt = arg minw∈Cp k∈St ||yk | · φk − xk> w|2
P
6:
7: end for
8: return wT
techniques can be used to solve this problem. We will point to approaches using the relaxation
technique in the bibliographic notes.
Notation: We will abuse the notation x> to denote the complex row-conjugate of a complex
vector x ∈ Cp , something that is usually denoted by x∗ . A random vector x = a + ib ∈ Cn will
be said to be distributed according to the standard Gaussian distribution over Cn , denoted as
NC (0, I) if a, b ∈ Rn are independently distributed as standard (real valued) Gaussian vectors
i.e., a, b ∼ N (0, In ).
xk> w|2 over complex variables. Algorithm 12 presents the details of this Gerchberg-Saxton Al-
ternating Minimization (GSAM) method. To avoid correlations, the algorithm performs these
alternations on distinct sets of points at each time step. These disjoint sets can be created by
sub-sampling the overall available data.
In their original work, Gerchberg and Saxton [1972] proposed to use a random vector w0
for initialization. However, the recent work of Netrapalli et al. [2013] demonstrated that a more
careful initialization, in particular the largest eigenvector of M = k |yk |2 · xk xk> , is beneficial.
P
Such a spectral initialization leads to an initial solution that is already at most a (small) constant
distance away from the optimal solution w∗ .
As we have seen to be the case with most gAM-style approaches, including the EM algo-
rithm, this approximately optimal initialization is crucial to allow the alternating minimization
procedure to take over and push the iterates toward the globally optimal solution.
120
10.4. A PHASE RETRIEVAL GUARANTEE FOR GSAM
Notions of Convergence: Before we proceed to give convergence guarantees for the GSAM
algorithm, note that an exact recovery of w∗ is impossible, since phase information is totally
lost. More specifically, two signals w∗ and eiθ · w∗ for some θ ∈ R will generate exactly the same
responses when phase information is eliminated. Thus, the best we can hope for is to recover
w∗ up to a phase shift. There are several ways of formalizing notions of convergence modulo a
phase shift. We also note that complete proofs of the convergence results will be tedious and
hence we will only give proof sketches for them.
Linear Convergence:
For this
part, it is assumed that the GSAM procedure has been initial-
ized at w0 such that
w0 − w∗
2 ≤ 1001
. The following result (which we state without proof)
shows that the alternating procedure hereafter ensures a linear rate of convergence.
Theorem 10.1. Let yk = xk> w∗ for k = 1, . . . , n where xk ∼ NC (0,
I) and
n ≥ C · p log3 (p/)
for a suitably large constant C. Then, if the initialization satisfies
w − w∗
2
≤ 100
0 1
, then with
probability at least 1 − 1/n2 , GSAM outputs an -accurate solution
wT − w∗
≤ in no more
2
kyk2
than T = O log steps.
In practice, one can use fast approximate solvers such as the conjugate gradient method
to solve the least squares problem at each
iteration.
These take O (np log(1/)) time to solve
3
a least squares instance. Since n = O p log p samples are enough, the GSAM algorithm
e
e p2 log3 (p/) .
operates with computation time at most O
Initialization: We now establish the utility of the initialization step. The proof hinges on a
simple observation. Consider the random variable Z = |y|2 · xx> corresponding to a randomly
chosen vector x ∼ NC (0, I) and y = x> w∗ . For sake of simplicity, let us assume that kw∗ k2 = 1.
Since x = [x1 , x2 , . . . , xp ]> is a spherically symmetric vector in the complex space Cp , the random
variable x> w∗ has an identical distribution as the vector eiθ · x> e1 where e1 = [1, 0, 0, . . . , 0]> ,
for any θ ∈ R. Using the above, it is easy to see that
h i h i
E |y|2 · xx> = E |x1 |2 · xx> = 4 · e1 e>
1 +4·I
Using a slightly more tedious calculation involving unitary transformations, we can extend the
above to show that, in general,
h i
E |y|2 · xx> = 4 · w∗ (w∗ )> + 4 · I =: D
The above clearly indicates that the largest eigenvector of the matrix D is along w∗ . Now notice
that the matrix whose leading eigenvector we are interested in during initialization,
1 X
S := |yk |2 · xk xk> ,
|S0 | k∈S
0
121
h CHAPTER
i 10. PHASE RETRIEVAL
is simply an empirical estimate to the expectation E |y|2 · xx> = D. Indeed, we have E [S] =
D. Thus, it is reasonable to expect that the leading eigenvector of S would also be aligned to
w∗ . We can make this statement precise using results from the concentration of finite sums of
self-adjoint independent random matrices from [Tropp, 2012].
Theorem 10.2. The spectral initialization method (Step 3 of Algorithm
12), with
probability
at least 1 − 1/|S0 |2 ≤ 1 − 1/p2 , ensures an initialization w0 such that
w0 − w∗
2 ≤ c for any
constant c > 0, so long as it is executed with a randomly chosen set S0 of data points of size
|S0 | ≥ C · p log p for a suitably large constant C depending on c.
Proof. To make the analysis simple, we will continue to assume that w∗ = eiθ · e1 for some
θ ∈ R. We can use Bernstein-style results for matrix concentration (for instance, see [Tropp,
2012, Theorem 1.5]) to show that for any chosen constant c > 0, if n ≥ C · p log p for a large
enough constant C that depends on the constant c, then with probability at least 1 − 1/|S0 |2 ,
we have
kS − Dk2 ≤ c
Note that the norm being used above is the spectral/operator norm on matrices. Given this, it
is possible to get a handle on the leading eigenvalue of S. Observe that since w0 is the leading
eigenvector of S, and since we have assumed w∗ = eiθ · e1 , we have
w , Sw0 ≥ |hw∗ , Sw∗ i|
0
and then performing gradient descent (over complex variables) on this unconstrained opti-
mization problem. In the same work, this technique was named theWirtinger’s flow algorithm,
122
10.6. A PHASE RETRIEVAL GUARANTEE FOR WF
Algorithm 13 Wirtinger’s Flow for Phase Retrieval (WF)
Input: Measurement matrix X ∈ Cn×p , observed response magnitudes |y| ∈ Rn+ , step size η
Output: A signal w b ∈ Cp
1 n >
1: w0 ← eig( n 2
k=1 |yk | · xk xk , 1)
P
//Leading eigenvector
2: for t = 1, 2, . . . do
wt ← wt−1 − 2η · k (|xk> wt−1 |2 − |yk |2 )xk xk> wt−1
P
3:
4: end for
5: return wt
presumably as a reference to the notions of Wirtinger derivatives, and shown to offer provable
convergence to the global optimum, just like the Gerchberg-Saxton method, when initialization
is performed using a spectral method. Algorithm 13 outlines the Wirtinger’s Flow (WF) algo-
rithm. Note that WF offers accelerated update times. Note that unlike the GSAM approach,
the WF algorithm does not require sub-sampling but needs to choose a step size parameter
instead.
Starting from such an initial point, Candès et al. [2015] argue that each step of the gradient
descent procedure decreases the distance to optima by at least a constant (multiplicative) factor.
This allows a linear convergence result to be established for the WF procedure, similar to the
GSAM approach, however, with each iteration being much less expensive, being a gradient
descent step, rather than the solution to a complex-valued least squares problem.
Theorem 10.3. Let yk = xk> w∗ for k = 1, . . . , n where xk ∼ NC (0,
I). Also,
let n ≥ C · p log p
for a suitably large constant C. Then, if the initialization satisfies
w − w∗
2 ≤ 100
0 1
, then with
probability at least 1 − 1/p2 , WF outputs an -accurate solution
wT − w∗
≤ in no more
2
kyk2
than T = O log steps.
Candès et al. [2015] also studied a coded diffraction pattern (CDP) model which uses mea-
surements X that are more “practical” for X-ray crystallography style applications and based
on a combination of random multiplicative perturbations of a standard Fourier measurement.
For such measurements, Candès et al. [2015] provided a result similar to Theorem 10.3 but
with a slightly inferior rateof convergence: the new procedure is able to guarantee an -optimal
kyk2
solution only after T = O p · log steps, i.e., a multiplicative factor of p larger than that
required by the WF algorithm for Gaussian measurements.
123
CHAPTER 10. PHASE RETRIEVAL
This rank-one constraint was then replaced by a nuclear norm constraint and the resulting
problem was solved as a semi-definite program (SDP). This technique was shown to achieve the
information theoretically optimal sample complexity of n = Ω (p) (recall that the GSAM and
WF techniques require n = Ω (p log p). However, the running time of the Phase-lift algorithm is
prohibitive at O(np2 + p3 ).
In addition to the standard phase retrieval problem, several works [Netrapalli et al., 2013,
Jaganathan et al., 2013] have also studied the sparse phase retrieval problem where the goal is
to recover a sparse signal w∗ ∈ Cp , with kw∗ k0 ≤ s p, using only magnitude measurements
|yk | = |xk> w∗ |. The best known results for such problems require n ≥ s3 log p measurements for
s-sparse signals. This is significantly worse than information theoretically optimal O (s log p)
number of measurements. However, Jaganathan et al. [2013] showed that for a phase-lift style
2
technique, one cannot hope to solve the problem using less than O s log p measurements.
124
The official publication is available from now publishers via
http://dx.doi.org/10.1561/2200000058
Bibliography
Alekh Agarwal, Sahand N. Negahban, and Martin J. Wainwright. Fast global convergence of
gradient methods for high-dimensional statistical recovery. The Annals of Statistics, 40(5):
2452–2482, 2012.
Alekh Agarwal, Animashree Anandkumar, Prateek Jain, and Praneeth Netrapalli. Learning
Sparsely Used Overcomplete Dictionaries via Alternating Minimization. SIAM Journal of
Optimization, 26(4):2775–2799, 2016.
Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding
Approximate Local Minima Faster than Gradient Descent. In Proceedings of the 49th Annual
ACM SIGACT Symposium on Theory of Computing (STOC), 2017.
Animashree Anandkumar and Rong Ge. Efficient approaches for escaping higher order saddle
points in non-convex optimization. In Proceedings of the 29th Conference on Learning Theory
(COLT), pages 81–102, 2016.
Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor
Decompositions for Learning Latent Variable Models. Journal of Machine Learning Research,
15:2773–2832, 2014.
Sanjeev Arora, Rong Ge, and Ankur Moitra. New Algorithms for Learning Incoherent and Over-
complete Dictionaries. In Proceedings of The 27th Conference on Learning Theory (COLT),
2014.
Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical Guarantees for the EM
Algorithm: From Population to Sample-based Analysis. Annals of Statistics, 45(1):77–120,
2017.
Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin. A Simple Proof of
the Restricted Isometry Property for Random Matrices. Constructive Approximation, 28(3):
253–263, 2008.
Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust Regression via Hard Thresholding.
In Proceedings of the 29th Annual Conference on Neural Information Processing Systems
(NIPS), 2015.
125
BIBLIOGRAPHY
Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar. Consistent
Robust Regression. In Proceedings of the 31st Annual Conference on Neural Information
Processing Systems (NIPS), 2017.
Srinadh Bhojanapalli and Prateek Jain. Universal Matrix Completion. In Proceedings of the
31st International Conference on Machine Learning (ICML), 2014.
Thomas Blumensath. Sampling and Reconstructing Signals From a Union of Linear Subspaces.
IEEE Transactions on Information Theory, 57(7):4660–4671, 2011.
Jean Bourgain, Stephen Dilworth, Kevin Ford, Sergei Konyagin, and Denka Kutzarova. Explicit
constructions of RIP matrices and related problems. Duke Mathematical Journal, 159(1):145–
185, 2011.
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
Alon Brutzkus and Amir Globerson. Globally Optimal Gradient Descent for a ConvNet with
Gaussian Inputs. In Proceedings of the 34th International Conference on Machine Learning
(ICML), 2017.
Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A Singular Value Thresholding Algo-
rithm for Matrix Completion. SIAM Journal of Optimization, 20(4):1956–1982, 2010.
Emmanuel Candès and Terence Tao. Decoding by Linear Programming. IEEE Transactions on
Information Theory, 51(12):4203–4215, 2005.
Emmanuel J. Candès. The Restricted Isometry Property and Its Implications for Compressed
Sensing. Comptes Rendus Mathematique, 346(9-10):589–592, 2008.
Emmanuel J. Candès and Xiaodong Li. Solving Quadratic Equations via PhaseLift When There
Are About as Many Equations as Unknowns. Foundations of Computational Mathematics,
14(5):1017–1026, 2014.
Emmanuel J. Candès and Benjamin Recht. Exact Matrix Completion via Convex Optimization.
Foundations of Computational Mathematics, 9(6):717–772, 2009.
Emmanuel J. Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix
completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2009.
Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Stable Signal Recovery from In-
complete and Inaccurate Measurements. Communications on Pure and Applied Mathematics,
59(8):1207–1223, 2006.
Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase Retrieval via Wirtinger
Flow: Theory and Algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007,
2015.
Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. “Convex Until Proven Guilty”:
Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions. In Proceedings
of the 34th International Conference on Machine Learning (ICML), 2017.
126
BIBLIOGRAPHY
Rick Chartrand. Exact Reconstruction of Sparse Signals via Nonconvex Minimization. IEEE
Information Processing Letters, 14(10):707–710, 2007.
Caihua Chen and Bingsheng He. Matrix Completion via an Alternating Direction Method. IMA
Journal of Numerical Analysis, 32(1):227–245, 2012.
Laming Chen and Yuantao Gu. Local and global optimality of LP minimization for sparse
recovery. In Proceedings of the IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 2015.
Yin Chen and Arnak S. Dalalyan. Fused sparsity and robust estimation for linear models with
unknown variance. In Proceedings of the 26th Annual Conference on Neural Information
Processing Systems (NIPS), 2012.
Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust Sparse Regression under
Adversarial Corruption. In Proceedings of the 30th International Conference on Machine
Learning (ICML), 2013.
Yudong Chen, Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Matrix Completion
with Column Manipulation: Near-Optimal Sample-Robustness-Rank Tradeoffs. IEEE Trans-
actions on Information Theory, 62(1):503–526, 2016.
Yeshwanth Cherapanamjeri, Kartik Gupta, and Prateek Jain. Nearly-optimal Robust Ma-
trix Completion. In Proceedings of the 34th International Conference on Machine Learning
(ICML), 2017.
Anna Choromanska, Mikael Hena, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The
Loss Surfaces of Multilayer Networks. In Proceedings of the 18th International Conference on
Arti cial Intelligence and Statistics (AISTATS), 2015.
Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed Sensing and Best k-term
Approximation. Journal of the American Mathematical Society, 22(1):211–231, 2009.
Christophe Croux and Kristel Joossens. Robust Estimation of the Vector Autoregressive Model
by a Least Trimmed Squares Procedure. In Proceedings in Computational Statistics (COMP-
STAT), 2008.
Yann N. Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, and
Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-
convex optimization. In Proceedings of the 28th Annual Conference on Neural Information
Processing Systems (NIPS), pages 2933–2941, 2014.
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum Likelihood from Incom-
plete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1):
1–38, 1977.
David L. Donoho, Arian Maleki, and Andrea Montanari. Message Passing Algorithms for
Compressed Sensing: I. Motivation and Construction. Proceedings of the National Academy
of Sciences USA, 106(45):18914–18919, 2009.
127
BIBLIOGRAPHY
John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient Projections
onto the `1 -Ball for Learning in High Dimensions. In Proceedings of the 25th International
Conference on Machine Learning (ICML), 2008.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLIN-
EAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:
1871–1874, 2008.
Maryam Fazel, Ting Kei Pong, Defeng Sun, and Paul Tseng. Hankel matrix rank minimization
with applications in system identification and realization. SIAM Journal on Matrix Analysis
and Applications, 34(3):946–977, 2013.
Martin A. Fischler and Robert C. Bolles. Random Sample Consensus: A Paradigm for Model
Fitting with Applications to Image Analysis and Automated Cartography. Communications
of the ACM, 24(6):381–395, 1981.
Simon Foucart. A Note on Guaranteed Sparse Recovery via `1 -minimization. Applied and
Computational Harmonic Analysis, 29(1):97–103, 2010.
Simon Foucart. Hard Thresholding Pursuit: an Algorithm for Compressive Sensing. SIAM
Journal on Numerical Analysis, 49(6):2543–2563, 2011.
Simon Foucart and Ming-Jun Lai. Sparsest solutions of underdetermined linear systems via `q -
minimization for 0 < q ≤ 1. Applied and Computational Harmonic Analysis, 26(3):395–407,
2009.
Rosalind Franklin and Raymond G. Gosling. Evidence for 2-Chain Helix in Crystalline Structure
of Sodium Deoxyribonucleate. Nature, 172:156–157, 1953a.
Rahul Garg and Rohit Khandekar. Gradient Descent with Sparsification: An iterative algorithm
for sparse recovery with restricted isometry property. In Proceedings of the 26th International
Conference on Machine Learning (ICML), 2009.
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping From Saddle Points - Online
Stochastic Gradient for Tensor Decomposition. In Proceedings of The 28th Conference on
Learning Theory (COLT), pages 797–842, 2015.
Rong Ge, Jason D. Lee, and Tengyu Ma. Matrix Completion has No Spurious Local Minimum.
In Proceedings of the 30th Annual Conference on Neural Information Processing Systems
(NIPS), 2016.
R. W. Gerchberg and W. Owen Saxton. A Practical Algorithm for the Determination of Phase
from Image and Diffraction Plane Pictures. Optik, 35(2):237–246, 1972.
Surbhi Goel and Adam Klivans. Learning Depth-Three Neural Networks in Polynomial Time.
arXiv:1709.06010v1 [cs.DS], 2017.
Donald Goldfarb and Shiqian Ma. Convergence of Fixed-Point Continuation Algorithms for
Matrix Rank Minimization. Foundations of Computational Mathematics, 11(2):183–210, 2011.
Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins Studies in
Mathematical Sciences. The John Hopkins University Press, 3rd edition, 1996.
128
BIBLIOGRAPHY
Rémi Gribonval, Rodolphe Jenatton, and Francis Bach. Sparse and Spurious: Dictionary Learn-
ing With Noise and Outliers. IEEE Transaction on Information Theory, 61(11):6298–6319,
2015.
Moritz Hardt and Mary Wootters. Fast Matrix Completion Without the Condition Number.
In Proceedings of The 27th Conference on Learning Theory (COLT), 2014.
Moritz Hardt, Raghu Meka, Prasad Raghavendra, and Benjamin Weitz. Computational limits
for matrix completion. In Proceedings of The 27th Conference on Learning Theory (COLT),
2014.
Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity:
The Lasso and Generalizations. Number 143 in Monographs on Statistics and Applied Prob-
ability. The CRC Press, 2016.
Ishay Haviv and Oded Regev. The Restricted Isometry Property of Subsampled Fourier Ma-
trices. In Bo’az Klartag and Emanuel Milman, editors, Geometric Aspects of Functional
Analysis, volume 2169 of Lecture Notes in Mathematics, pages 163–179. Springer, Cham,
2017.
Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics. Wiley Series in Probability and
Statistics. John Wiley & Sons, 2nd edition, 2009.
Kishore Jaganathan, Samet Oymak, and Babak Hassibi. Sparse Phase Retrieval: Convex Algo-
rithms and Limitations. In Proceedings of the IEEE International Symposium on Information
Theory (ISIT), 2013.
Prateek Jain and Praneeth Netrapalli. Fast Exact Matrix Completion with Finite Samples. In
Proceedings of The 28th Conference on Learning Theory (COLT), 2015.
Prateek Jain and Ambuj Tewari. Alternating Minimization for Regression Problems with
Vector-valued Outputs. In Proceedings of the 29th Annual Conference on Neural Information
Processing Systems (NIPS), 2015.
Prateek Jain, Raghu Meka, and Inderjit Dhillon. Guaranteed Rank Minimization via Singular
Value Projections. In Proceedings of the 24th Annual Conference on Neural Information
Processing Systems (NIPS), 2010.
Prateek Jain, Ambuj Tewari, and Inderjit S. Dhillon. Orthogonal Matching Pursuit with Re-
placement. In Proceedings of the 25th Annual Conference on Neural Information Processing
Systems (NIPS), 2011.
Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank Matrix Completion using
Alternating Minimization. In Proceedings of the 45th annual ACM Symposium on Theory of
Computing (STOC), pages 665–674, 2013.
Prateek Jain, Ambuj Tewari, and Purushottam Kar. On Iterative Hard Thresholding Methods
for High-dimensional M-Estimation. In Proceedings of the 28th Annual Conference on Neural
Information Processing Systems (NIPS), 2014.
129
BIBLIOGRAPHY
Ali Jalali, Christopher C Johnson, and Pradeep D Ravikumar. On Learning Discrete Graphical
Models using Greedy Methods. In Proceedings of the 25th Annual Conference on Neural
Information Processing Systems (NIPS), pages 1935–1943, 2011.
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to Escape
Saddle Points Efficiently. In Proceedings of the 34th International Conference on Machine
Learning (ICML), pages 1724–1732, 2017.
Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix Completion from a
Few Entries. IEEE Transactions on Information Theory, 56(6):2980–2998, 2010.
Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix Factorization Techniques for. Recom-
mender Systems. IEEE Computer, 42(8):30–37, 2009.
Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient Descent Only
Converges to Minimizers. In Proceedings of the 29th Conference on Learning Theory (COLT),
pages 1246–1257, 2016.
Kiryung Lee and Yoram Bresler. ADMiRA: Atomic Decomposition for Minimum Rank Ap-
proximation. IEEE Transactions on Information Theory, 56(9):4402–4416, 2010.
Yuanzhi Li and Yang Yuan. Convergence Analysis of Two-layer Neural Networks with ReLU
Activation. In Proceedings of the 31st Annual Conference on Neural Information Processing
Systems (NIPS), 2017.
Amand A. Lucas. A-DNA and B-DNA: Comparing Their Historical X-ray Fiber Diffraction
Images. Journal of Chemical Education, 85(5):737–743, 2008.
Zhi-Quan Luo and Paul Tseng. On the Convergence of the Coordinate Descent Method for
Convex Differentiable Minimization. Journal of Optimization Theory and Applications, 72
(1):7–35, 1992.
Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent
methods: A general approach. Annals of Operations Research, 46(1):157–178, 1993.
Ricardo A. Maronna, R. Douglas Martin, and Victor J. Yoha. Robust Statistics: Theory and
Methods. John Wiley, 2006.
R. Douglas Martin and Judy Zeh. Robust Generalized M-estimates for Autoregressive Parame-
ters: Small-sample Behavior and Applications. Technical Report 214, University of Washing-
ton, 1978.
Raghu Meka, Prateek Jain, Constantine Caramanis, and Inderjit Dhillon. Rank Minimization
via Online Learning. In Proceedings of the 25th International Conference on Machine Learning
(ICML), 2008.
130
BIBLIOGRAPHY
Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on
Computing, 24(2):227–234, 1995.
Deanna Needell and Joel A. Tropp. CoSaMP: Iterative Signal Recovery from Incomplete and
Inaccurate Samples. Applied and Computational Harmonic Analysis, 26:301–321, 2008.
Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A Unified
Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers.
Statistical Science, 27(4):538–557, 2012.
Jelani Nelson, Eric Price, and Mary Wootters. New constructions of RIP matrices with fast
multiplication and fewer rows. In Proceedings of the 25th Annual ACM-SIAM Symposium on
Discrete Algorithms (SODA), 2014.
Yurii Nesterov and B.T. Polyak. Cubic regularization of Newton method and its global perfor-
mance. Mathematical Programming, 108(1):177–205, 2006.
Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase Retrieval using Alternating Min-
imization. In Proceedings of the 27th Annual Conference on Neural Information Processing
Systems (NIPS), 2013.
Nam H. Nguyen and Trac D. Tran. Exact recoverability from dense corrupted observations via
L1 minimization. IEEE Transactions on Information Theory, 59(4):2036–2058, 2013a.
Nam H Nguyen and Trac D Tran. Robust Lasso With Missing and Grossly Corrupted Obser-
vations. IEEE Transaction on Information Theory, 59(4):2036–2058, 2013b.
Samet Oymak, Benjamin Recht, and Mahdi Soltanolkotabi. Sharp Time-Data Tradeoffs for
Linear Inverse Problems. arXiv:1507.04793 [cs.IT], 2015.
Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Restricted Eigenvalue Properties for
Correlated Gaussian Designs. Journal of Machine Learning Research, 11:2241–2259, 2010.
Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed Minimum Rank Solutions
to Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review, 52(3):471–501,
2010.
Sashank Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alexander J. Smola. Fast
Stochastic Methods for Nonsmooth Nonconvex Optimization. In Proceedings of the 33rd
International Conference on Machine Learning (ICML), 2016.
Peter J. Rousseeuw. Least Median of Squares Regression. Journal of the American Statistical
Association, 79(388):871–880, 1984.
Peter J. Rousseeuw and Annick M. Leroy. Robust Regression and Outlier Detection. John Wiley
and Sons, 1987.
131
BIBLIOGRAPHY
Ankan Saha and Ambuj Tewari. On the Non-asymptotic Convergence of Cyclic Coordinate
Descent Methods. SIAM Journal on Optimization, 23(1):576–601, 2013.
Hanie Sedghi and Anima Anandkumar. Training Input-Output Recurrent Neural Networks
through Spectral Methods. arXiv:1603.00954 [CS.LG], 2016.
Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regu-
larized Loss Minimization. Journal of Machine Learning Research, 14:567–599, 2013.
Yiyuan She and Art B. Owen. Outlier Detection Using Nonconvex Penalized Regression. Journal
of the American Statistical Association, 106(494):626–639, 2011.
Daniel A. Spielman, Huan Wang, and John Wright. Exact Recovery of Sparsely-Used Dictio-
naries. In Proceedings of the 25th Annual Conference on Learning Theory (COLT), 2012.
Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, editors. Optimization for Machine
Learning. The MIT Press, 2011.
Norbert Stockinger and Rudolf Dutter. Robust time series analysis: A survey. Kybernetika, 23
(7):1–3, 1987.
Ju Sun, Qing Qu, and John Wright. When Are Nonconvex Problems Not Scary?
arXiv:1510.06096 [math.OC], 2015.
Ruoyu Sun and Zhi-Quan Lu. Guaranteed Matrix Completion via Non-convex Factorization.
In Proceedings of the 56th IEEE Annual Symposium on Foundations of Computer Science
(FOCS), 2015.
Ambuj Tewari, Pradeep Ravikumar, and Inderjit S. Dhillon. Greedy Algorithms for Structurally
Constrained High Dimensional Problems. In Proceedings of the 25th Annual Conference on
Neural Information Processing Systems (NIPS), 2011.
Joel A. Tropp. User-Friendly Tail Bounds for Sums of Random Matrices. Foundations of
Computational Mathematics, 12(4):389–434, 2012.
Joel A. Tropp and Anna C. Gilbert. Signal Recovery From Random Measurements Via Orthog-
onal Matching Pursuit. IEEE Transactions on Information Theory, 53(12):4655–4666, Dec.
2007. ISSN 0018-9448.
Meng Wang, Weiyu Xu, and Ao Tang. On the Performance of Sparse Recovery Via `p -
Minimization (0 ≤ p ≤ 1). IEEE Transactions on Information Theory, 57(11):7255–7278,
2011.
Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High Dimensional EM Algorithm:
Statistical Optimization and Asymptotic Normality. In Proceedings of the 29th Annual Con-
ference on Neural Information Processing Systems (NIPS), 2015.
Karen H.S. Wilson, Sarah E. Eckenrode, Quan-Zhen Li, Qing-Guo Ruan, Ping Yang, Jing-Da
Shi, Abdoreza Davoodi-Semiromi, Richard A. McIndoe, Byron P. Croker, and Jin-Xiong She.
Microarray Analysis of Gene Expression in the Kidneys of New- and Post-Onset Diabetic
NOD Mice. Diabetes, 52(8):2151–2159, 2003.
John Wright, Alan Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust Face
Recognition via Sparse Representation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 31(2):210–227, 2009.
132
BIBLIOGRAPHY
Stephen J Wright and Jorge Nocedal. Numerical Optimization, volume 2. Springer New York,
1999.
C.-F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics,
11(1):95–103, 1983.
Allen Y. Yang, Zihan Zhou, Arvind Ganesh Balasubramanian, S Shankar Sastry, and Yi Ma.
Fast `1 -Minimization Algorithms for Robust Face Recognition. IEEE Transactions on Image
Processing, 22(8):3234–3246, 2013.
Fanny Yang, Sivaraman Balakrishnan, and Martin J. Wainwright. Statistical and computa-
tional guarantees for the Baum-Welch algorithm. In Proceedings of the 53rd Annual Allerton
Conference on Communication, Control, and Computing (Allerton), 2015.
Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating Minimization for Mixed
Linear Regression. In Proceedings of the 31st International Conference on Machine Learning
(ICML), 2014.
Ya-Xiang Yuan. Recent advances in trust region algorithms. Mathematical Programming, 151
(1):249–281, 2015.
Tong Zhang. Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representa-
tions. IEEE Transactions on Information Theory, 57:4689–4708, 2011.
Yuchen Zhang, Percy Liang, and Moses Charikar. A Hitting Time Analysis of Stochastic Gra-
dient Langevin Dynamics. In Proceedings of the 30th Conference on Learning Theory, 2017.
Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery
Guarantees for One-hidden-layer Neural Networks. In Proceedings of the 34th International
Conference on Machine Learning (ICML), 2017.
Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Large-scale Parallel Col-
laborative Filtering for the Netflix Prize. In Proceedings of the 4th International Conference
on Algorithmic Aspects in Information and Management (AAIM), 2008.
133