0% found this document useful (0 votes)
6 views

A faster algorithm for solving general LPs

This paper presents a new algorithm for solving general linear programs (LPs) that improves the running time to O ∗ (nω + n 2.5−α /2 + n 2+1/18 ), addressing a polynomial gap in existing methods. The authors leverage a two-level lazy update framework and a stochastic central path algorithm to achieve a theoretical runtime of O ∗ (n2.055 ) under common assumptions about matrix multiplication exponents. This advancement is significant as it suggests that solving LPs may be computationally equivalent to solving linear systems, potentially transforming optimization techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

A faster algorithm for solving general LPs

This paper presents a new algorithm for solving general linear programs (LPs) that improves the running time to O ∗ (nω + n 2.5−α /2 + n 2+1/18 ), addressing a polynomial gap in existing methods. The authors leverage a two-level lazy update framework and a stochastic central path algorithm to achieve a theoretical runtime of O ∗ (n2.055 ) under common assumptions about matrix multiplication exponents. This advancement is significant as it suggests that solving LPs may be computationally equivalent to solving linear systems, potentially transforming optimization techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Faster Algorithm for Solving General LPs

Shunhua Jiang Zhao Song


[email protected] [email protected]
Columbia University Institute for Advanced Study
New York, NY, USA Princeton, NJ, USA

Omri Weinstein Hengjie Zhang


[email protected] [email protected]
Columbia University Columbia University
New York, NY, USA New York, NY, USA
ABSTRACT problems. Dantzig’s 1947 simplex algorithm [31] was the first pro-
The fastest known LP solver for general (dense) linear programs is posed solution for general LPs with n variables and d constraints
due to [Cohen, Lee and Song’19] and runs in O ∗ (nω + n2.5−α /2 + (minAx =b,x ≥0 c ⊤x). Despite its impressive performance in practice,
n2+1/6 ) time. A number of follow-up works [Lee, Song and Zhang’19, however, the simplex algorithm turned out to have exponential
Brand’20, Song and Yu’20] obtain the same complexity through worst-case running time (Klee and Minty [47]). The first polyno-
different techniques, but none of them can go below n 2+1/6 , even if mial time algorithm for general LPs was only developed in 1980,
ω = 2. This leaves a polynomial gap between the cost of solving when Khachiyan [46] showed that the Elliposid method (previously
linear systems (nω ) and the cost of solving linear programs, and introduced by [64] and [86]) runs in O(n 6 ) time. Unfortunately,
this algorithm is very slow in practice compared to the simplex
as such, improving the n 2+1/6 term is crucial toward establishing
algorithm, raising a quest for LP solvers which are efficient in both
an equivalence between these two fundamental problems. In this
theory and practice. This was the primary motivation behind the
paper, we reduce the running time to O ∗ (nω + n 2.5−α /2 + n 2+1/18 )
development of interior point methods (IPMs), which uses a primal-
where ω and α are the fast matrix multiplication exponent and its
dual gradient descent approach to iteratively converge to an optimal
dual. Hence, under the common belief that ω ≈ 2 and α ≈ 1, our
solution (Karmarkar, [45]). An appealing feature of IPM methods
LP solver runs in O ∗ (n2.055 ) time instead of O ∗ (n2.16 ).
for solving LPs is that they are not only guaranteed to run fast in
theory but also in practice [78]. In 1989, Vaidya proposed an O(n 2.5 )
CCS CONCEPTS LP solver based on a specific implementation of IPMs, known as
• Theory of computation → Linear programming; Convex the central path algorithm [80, 81]. 1
optimization. Recently, Cohen, Lee and Song [27] developed a stochasitc central
path algorithm which achieves the fastest known running time for
KEYWORDS general (dense) LPs:
Linear programming, Convex optimization, Dynamic data-structure
O ∗ (nω + n 2.5−α /2 + n 2+1/6 ),
ACM Reference Format:
Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. 2021. A where ω and α are the fast matrix multiplication (FMM) exponent
Faster Algorithm for Solving General LPs. In Proceedings of the 53rd Annual and the dual FMM exponent,2 and O ∗ hides no(1) factors. Note
ACM SIGACT Symposium on Theory of Computing (STOC ’21), June 21–25, that nω is the minimal time3 for merely solving a general linear
2021, Virtual, Italy. ACM, New York, NY, USA, 10 pages. https://doi.org/10. system (Ax = b), i.e., finding any feasible solution to the LP, thus it
1145/3406325.3451058 seems quite remarkable that solving the full optimization problem
(minAx =b,x ≥0 c ⊤x) may be done at virtually no extra cost. Indeed,
1 INTRODUCTION the work of [27, 69] suggests this intriguing possibility, which is
the primary motivation of our work:
Linear programming is one of the cornerstones of algorithm de-
sign and convex optimization, dating back to as early as Fourier in Are the problems of solving Linear Programs and Linear Systems
1827. LPs are the key toolbox for (literally hundreds of) approxima- computationally equivalent? 4
tion algorithms, and a standard subroutine in convex optimization
1 For other regimes of LP, the fastest solvers are [15, 52]. However, for the dense square

Permission to make digital or hard copies of all or part of this work for personal or case, their running time is still n 2.5 , the same as [81].
2 The dual exponent α is defined as the asymptotically maximum number a ≤ 1 s.t
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation multiplying an n × n a matrix by an n a × n matrix can be done in n 2+o(1) time. ω
on the first page. Copyrights for components of this work owned by others than ACM and α are related as they must satisfy ω + (ω/2)α ≤ 3 [20].
3 There is a recent exciting work [61] which solves linear systems faster than n ω
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a when the sparsity and the condition number (κ ) of the matrix A are small (nnz(A) =
fee. Request permissions from [email protected]. o(n ω −1 /log(κ))). This work is not for solving general linear systems. Currently there
STOC ’21, June 21–25, 2021, Virtual, Italy is no reason to believe that general linear system can be solved faster than n ω [82].
4 This paper is mainly focusing on the line of weakly polynomial time linear program
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8053-9/21/06. . . $15.00 algorithm. There is a whole branch of research for strongly polynomial algorithms for
https://doi.org/10.1145/3406325.3451058 linear programs (Smale’s 9th question [68]).

823
STOC ’21, June 21–25, 2021, Virtual, Italy Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang

The current best upper bound on the FMM exponent is ω ≈ 2.37286 2 BACKGROUND
[5], improving a long line of research [48, 79, 84], while the best The recent work of [27] improved Vaidya’s O(n 2.5 ) LP algorithm
known bound on the dual FMM exponent is α ≈ 0.31389 [34]. [81] to O ∗ (nω + n2.5−α /2 + n2+1/6 ). Subsequent works [11, 53, 77]
Despite recent evidence indicating the limitations of the existing achieved the same running time. They used the same lazy update
fast matrix multiplication techniques [4, 7, 21], it is still widely framework as of [27], but developed different techniques to achieve
believed that ω ≈ 2 [29, 84] and as such α ≈ 1. Assuming that fast queries in the projection maintenance data structure. More
indeed ω < 2 + 1/6 ≈ 2.166 and α > 0.66, the runtime of the precisely, the running time of [27] is
aforementioned algorithms is n 2.166 . Whether the additive n2.166
term can be improved or completely removed was explicitly posed O ∗ (nω + n 2.5−a/2 + n1.5+a ),
as an open question in [27]. Closing this polynomial gap is not
only a basic question, but also has potential implications to many where a ≤ α is a tunable parameter. This value is always bounded
other optimization problems where the cost per iteration is the by O ∗ (nω + n 2.5−α /2 + n2+1/6 ).
current bottleneck (e.g., semidefinite programming solvers [38, 41], The three main ingredients of [27] are shown in the next three
empirical risk minimization [53], the cutting plane method [42], subsections (2.1, 2.2, and 2.3), and in Section 2.3 we also discuss the
minimizing convex functions with integral minimizers [40] and fast query techniques of the three subsequent works [11, 53, 77]. In
even non-convex optimization [17]). Section 3 we describe our improvements based on previous works.
Our main result is an affirmative answer to this open question,
asserting that LPs can be solved in matrix multiplication time for 2.1 Central Path
nearly any value of ω (i.e., so long as ω > 2.055). We design an [27] proposed a stochastic version of the central path algorithm,
improved LP solver which runs in time O ∗ (nω +n2.5−α /2 +n2+1/18 ). which can tolerate small perturbations. The central path algorithm
In the most notable (ideal) case that ω ≈ 2 and α ≈ 1, our algorithm keeps track of the primal vector x and the dual slack vector s,

runs in O ∗ (n 2.055 ) time, instead of O ∗ (n 2.166 ) time of previous IPM iteratively shrinking the duality gap ni=1 x i · si by 1 − 1/Θ( n) in
Í
algorithms [11, 27, 53, 77]5 . More precisely: each iteration, until converging to the optimal point. Hence, the

Theorem 1.1 (Main result, Informal statement of Theorem central path algorithm has a total of O( n) iterations. See the full
3.1). Let minAx =b,x ≥0 c ⊤x be a linear program where A ∈ Rd ×n version [43] for more details. The main computation bottleneck of
and d = Θ(n). Then for any accuracy parameter δ ∈ (0, 1), there is a the stochastic central path algorithm boils down to the following
randomized algorithm that solves the LP in expected time projection maintenance problem.
Projection maintenance. Denoting by w the vector x/s, i.e., w i =
O ∗ (nω + n 2.5−α /2 + n 2+1/18 ) · log(n/δ ). x i /si for i ∈ [n], the projection matrix is defined as follows (for

We achieve this result by designing a two-level lazy update simplicity we omit the W on two sides):
framework for efficiently maintaining a projection matrix, which is P(w) = A⊤ (AW A⊤ )−1A.
the key subroutine in the central path algorithm. This framework
requires a combination of randomized sketching techniques [22, 60] In each iteration the stochastic central path algorithm performs the
and a complex two-level amortized analysis, which is made possible following two operations:
by establishing a stronger robustness property of the stochastic 1. Update(w new ) that updates P(w) to P(w new ).
central path algorithm against “coordinate-wise embedding” type 2. Query(h new ) that calculates r := P(w new ) · h new for some
perturbations. vector h new . Then r is used to compute δ x and δs , which are the
disposition of x and s in each iteration.
Beyond two levels. It is natural to ask whether our approach can
Tolerate perturbations. The stochastic central path algorithm
be extended to k > 2 levels to completely settle the problem (i.e.,
allows certain small perturbations in the projection maintenance
achieve O ∗ (nω ) runtime). We discuss this intriguing possibility in
data structure: Let w appr ≈ϵ w new , where ≈ϵ denotes coordinate-
Section 4.4.
wise approximation. Let D be a sampling matrix where Dh new

Recent Developments. After the initial release of this paper [43] samples n entries from h new . [27] shows that the algorithm is
on Arxiv6 , a follow-up work of [12] (SOSA’21) showed how to correct when w new and h new are approximated by w appr and Dh new .
simplify a small part of our proofs in the appendix. Specifically, the
paper shows a reduction from the projection maintenance problem 2.2 Lazy Update Framework
to the inverse maintenance problem, which combined with the
To speed up the data structure, [27] delays updates to the projection
other components of our proof (amortization, sketching, robust
matrix by batching multiple updates together and updating a batch
central path, feasibility) as black-boxes, yields an alternative proof
using fast rectangular matrix multiplication.
of our main result. For more discussion see Section 4.2.
The data structure maintains a proxy vector v for w, and main-
tains the previously-computed matrix P := P(v). The operation
5 Interior Point Method has a long history, and it has been extensively used in many
Update(w new ) only needs to update the set of coordinates S ⊆ [n]
fundamental problems in theoretical computer science, e.g. linear programming [10,
15, 50–52, 67, 85], linear program with small treewidth [33], maximum matching
[14, 54], max-flow [13, 30, 35, 55, 56, 58, 59], geometric median [26], matrix scaling where w inew is different from vi by more than ϵ, i.e., |w inew /vi −1| >
and balancing [2, 28]
6 The first version is available at here https://arxiv.org/pdf/2004.07470.pdf. The first
ϵ. Let diagonal matrix ∆ be the differences on these coordinates
version’s title is “Faster Dynamic Matrix Inverse for Faster LPs”. The title of the first (i.e., ∆i,i = w inew − vi for i ∈ S, and ∆i,i = 0 for other coordinates).
version is different from this STOC 2021 camera-ready version’s title. They effectively use P(w appr ) where W appr = V + ∆ to approximate

824
A Faster Algorithm for Solving General LPs STOC ’21, June 21–25, 2021, Virtual, Italy

Table 1: This table shows a conceptual summary of literature based on randomness used in optimization and data structure
parts.

Year Authors References Central path Data structure


2019 Cohen, Lee and Song [27] Randomized Deterministic
2019 Lee, Song and Zhang [53] Deterministic Randomized
2020 Song and Yu [77] Randomized Randomized
2020 Brand [11] Deterministic Deterministic

P(w new ). Note that w appr ≈ϵ w new by definition. See Figure 1 (a) matrix itself, effectively making it smaller. See the summary in the
for an illustration. following table.
Using Woodbury’s identity for low-rank inverse updates P(w appr ) Feasibility. We remark that when using techniques on the right,
is computed as the feasibility of LP (Ax = b) is directly satisfied during each it-
eration, but when using techniques on the left, it requires more
P(w appr ) = A⊤ (A(V + ∆)A⊤ )−1A
algorithmic design of the data structure to ensure feasibility. We
= P − PS ·(∆S−1,S + PS ,S )−1 · (PS )⊤, (1) provide the details of how to maintain feasibility in our algorithm
|{z} |{z} in the full version [43].
n× |S | |S |×n In our paper, we use sketching and vector maintenance, so we
where P := P(v) = A⊤ (AV A⊤ )−1A,
and PS ∈ Rn× |S |
is the subma- provide more details here.
trix of P with columns in S. Sketching. [53] and [77] use the idea of “iterate-and-sketch”, 8
[27] further exploits the power of fast rectangular matrix multi- an adaptive version of the classic “sketch-and-solve” paradigm [22,
plication: By the definition of α, for any a ≤ α, the time of multiply- 60] 9 . Both papers

can use any kind of common dense sketching ma-
ing an n × na rectangular matrix by an na × n rectangular matrix trices R ∈ R n×n , e.g., subsampled randomized Hadamard/Fourier
is the same as multiplying an n × 1 vector with a 1 × n vector. So transform matrix ([57, 62]), AMS sketch ([6]). Sketching on the right
they only update v and P when |S | ≥ na . Otherwise they postpone ([77]) gives the advantage that it is easier to maintain feasibility.
the actual updates to v and P, and only update ∆. Because of this, [77] can even use the sparse embedding matrix of
Soft thresholding for amortization.7 It takes O(n 2 ) time to up- [25, 44, 60] with poly log n non-zero entries each column. As [77]
date the matrix P once S has accumulated na coordinates, hence mentioned, they cannot use the Count-Sketch [18] matrix where
the lazy update framework is only useful when this update time each column only has one nonzero. Thus, we believe understanding
can be amortized. One of the key improvement of [27] to achieve the relationship between the sparsity of sketching matrices and
near matrix multiplication time is their soft threshold to exploit the iterative optimization framework is an interesting direction to ex-
“slow-moving” property of w: plore, especially because sparse sketching matrices may help in the
future study of O(nnz(A)) LP algorithms.
∥ E[w new /w − 1]∥2 ≤ O(1). (2) Vector maintenance. [11] maintains P · д where д is a proxy
Intuitively, when S reaches na
coordinates, instead of only updating for h (similar to v being a proxy for w), exploiting the observation
these na coordinates (a hard threshold of na ), [27] could potentially that the vector h is also slowly changing, and this ensures ∆h (anal-
update more coordinates according to the error vector |w new /v − 1| ogous to ∆) is na sparse. It is noteworthy that this technique is
(a soft threshold). Their amortization shows that in expectation the deterministic as it avoids sketching/sampling altogether. Indeed,
lazy update is performed once every na/2 iterations, and this gives the motivation of [11] was derandomizing [27].
the n 2−a/2 term in the final update time.
3 OUR TECHNIQUES
2.3 Fast Query Techniques This section provides a proof overview of our main result (Theorem
When not too many coordinates of w new are different from the 1.1), which we restate formally below. Formal proofs of all technical
maintained proxy v, i.e., |S | ≤ na , the Query(h new ) operation claims can be found in the Appendix of the full version [43].
computes r := P(w appr ) · h new on the fly (see Eq. (1)). Note that
Theorem 3.1 (Main result, formal version of Theorem 1.1).
multiplying the second term of Eq. (1) with h new only takes O(n 1+a )
Given a linear program minAx =b,x ≥0 c ⊤x where A ∈ Rd ×n and d =
time since PS ∈ Rn× |S | and |S | ≤ na . However, computing the first
Θ(n). Assume there are no redundant constraints, and the polytope
term P · h new from scratch could take O(n 2 ) time.
has diameter R in ℓ1 norm, namely, for any x ≥ 0 with Ax = b, we
The four papers [11, 27, 53, 77] proposed different techniques to
have ∥x ∥1 ≤ R.
speed up this matrix-vector multiplication to O(n1+a + n1.5 ). Both
techniques of [27] and [11] are essentially using sparsification of
the vector h new . In contrast, [53] and [77] sketches the projection 8 The idea of “iterate-and-sketch” can also be applied to non-convex optimization
[17, 71, 83].
7 The name soft threshold is given by us. The idea is proposed by [27], but they embed 9 The technique “sketch-and-solve” has been widely used in many fundamental prob-
it inside their update operation. Since we will use it multiple times, we give it an lems in machine learning, e.g. regression [9, 24, 32, 49, 70], low-rank approximation
explicit name. [23, 72, 73, 75], tensor decomposition [74].

825
STOC ’21, June 21–25, 2021, Virtual, Italy Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang

Table 2: Summary of fast query techniques of previous papers to approximately compute P · h new .

References Technique Position Formula Comment Time Feasible



[27] Sampling right P · Dh new D samples n entries of h new . n 1.5 Easy

[53] Sketching left R ⊤ · RP · h new Sketching matrix R ∈ R n×n . n 1.5 Hard

[77] Sketching right PR ⊤ · R · h new Sketching matrix R ∈ R n×n . n 1.5 Easy
[11] Vec. maint. right (Pд) + P · ∆h Proxy д ensures ∆h is na -sparse. n 1+a Easy

Then, for any δ ∈ (0, 1], Main(A, b, c, δ, a, ae) (Algorithm 1) outputs until we have accumulated nae coordinates in the “changes
x ≥ 0 such that of changes”. More details can be found in Section 3.2.
c ⊤x ≤ min c ⊤x + δ ∥c ∥∞ R, and ∥Ax − b ∥1 ≤ δ · (R∥A∥1 + ∥b ∥1 ) • Synchronizing two levels of amortization:
Ax =b,x ≥0 The scheduling order of the two levels of updates (and soft-
in expected time thresholding) must be carefully chosen so as to synchronize
a /2 v and ve. This synchronization is crucial to get the desired
O(nω + n 2.5−a/2 + n1.5+a−e a
+ n 0.5+a+(ω−1)e ) · no(1) · log(n/δ )
amortization: In an expected amortized sense, the second-
for any 0 < a ≤ α and 0 < ae ≤ αa. level update is performed once every nae/2 iterations, while
Remark 3.2. In particular, so long as the constants of fast matrix the first-level update is performed once every na/2 iterations.
multiplication satisfy ω > 2.055 and α > 5 − 2ω, general LPs can be More details can be found in Section 3.4.
solved in O(nω+o(1) ) time. In the ideal case that ω = 2 and α = 1, the • Right+Left compression for faster queries:
running time is n2+1/18 = n2.055 by choosing a = 8/9 and ae = 2/3. We combine the sketching on the left technique of [53] with
the vector maintenance technique of [11] (ensuring sparsity
Our improvement from [27] comes from introducing a second on the right) to “compress” the projection matrix on both
level of lazy updates: sides. Implementing this approach requires dynamic main-
time per iteration = tenance of extra data-structure primitives (matrix products,
vectors, sets), which together achieves the claimed na+(ω−1)ea

 [27] : nω−1/2 + n2−a/2 + n 1+a .


 | {z } |{z} query time. More details can be found in Section 3.3.


 1st level update query
a /2

 Ours: nω−1/2 + n2−a/2 + n1+a−e
+ na+(ω−1)e
a
.


 | {z } | {z } | {z } 3.1 Generalizing the Robustness of Central
 query
 1st level update 2nd level update Path
The first term nω−1/2 + n 2−a/2 of our running time is the same Previously, [27] explicitly considers the central path update with
as [27], stemming from the amortized cost of lazy updates, which sampling on the right, and [53] and [77] explicitly consider the cen-
we call the first-level update. tral path update with sketching matrix on the left/right. We abstract
The second term n 1+a−e a /2 comes from the amortized cost of our these specific implementations of the projection maintenance data
newly introduced second-level update. We maintain the “changes” structure and consider a generic central path algorithm, where any
of the first-level values in a second level, and only computes the update that satisfies some small perturbation properties is allowed.
smaller “changes of the changes” in each iteration. In this way we A simplified version of our algorithm is presented in Algorithm 1.
can amortize the cost of these computations from n 1+a to n 1+a−e a /2 , Let x, s ∈ Rn be the primal vector and the dual slack vector in
where ae ≤ αa is our second-level threshold. the beginning of one iteration. Let µ := x · s, and w := x/s, where
The third term na+(ω−1)e a is our query cost. Note that this is · and / are coordinate-wise operations. Let h be computed from µ
smaller than the n 1+a query cost of [27]. In fact, it is even smaller and t. Similar to [27], we allow coordinate-wise error on w, and
than n 2a ≤ n 1+a , since (ω − 1)ea ≤ (ω − 1)αa ≤ a using the fact that similar to [11] which uses the vector maintenance technique, we
ω + (ω/2)α ≤ 3 [20]10 and ω ≥ 2. also allow coordinate-wise error on h. So it suffices to compute the
central path update using w appr and h appr :
Our contributions. We achieve Theorem 3.1 by improving all
three key ingredients of [27]:
• A two-level lazy update framework: w appr ≈ϵ w, h appr ≈ϵ h.
We observe that not only the maintained first-level values
are changing slowly, but also their “changes” are slowly
changing, hence it is natural to maintain these “changes” as We allow the actual central path updates δbx and δbs , which are
second-level values, where each new iteration only computes computed from the projection maintenance data structure, to have
the smaller changes in these second-level values (“changes further deviation from δ x and δs , which are the exact version com-
of changes”). We lazily update these second-level values puted using w appr and h appr . We prove that the algorithm converges
10 Afterconsulting the researchers in the field of matrix multiplication [3, 87], they
to the optimal solution of the LP so long as in each iteration δbx and
believed improving the inequality between α and ω is an interesting problem. δbs satisfy the following small perturbation properties (we omit the

826
A Faster Algorithm for Solving General LPs STOC ’21, June 21–25, 2021, Virtual, Italy

Algorithm 1 Simplified version of our Central Path algorithm, using the projection maintenance data structure. See the detailed version of this algorithm in
the full version [43].
1: i ← 1, initialize feasible x , s ∈ Rn
2: Initialize the projection maintenance data structure mp.

3: while i < n do

4: t ← t · (1 − 1/ n), µ ← x · s ▷ t is the target duality gap, µ is the actual duality gap
5: w ← x /s
6: Compute h based on µ and t .
7: mp.Update(w ) ▷ Update projection matrix P = A⊤ (AW appr A⊤ )−1 A, where w appr ≈ϵ w
8: r ← mp.Query(h) ▷ Query a matrix vector product r ≈ P · h appr , where h appr ≈ϵ h
9: Compute δbx and δbs using r . ▷ This just multiplies r with diagonal matrices.
10: x ← x + δbx , s ← s + δbs , i ← i + 1
11: end while

analogous properties for δbs here): Two levels of updates. We use two threshold values a ≤ α and
ae ≤ α · a to decide when to update the proxy vectors v and ve. In the
1. ∥ E[x −1 (δbx − δ x )]∥2 ≤ ϵ 2 . Update(w new ) operation, when more than nae coordinates of w new
h 2i ϵ2 differ from ve by more than ϵ, we perform a second level update, and
2. E x i−1 (δbx ,i − δ x ,i ) ≤ ϵ · |x i−1δ x ,i | + √ , ∀i ∈ [n]. (3) when more than na coordinates of w new differ from v by more than
n
h i ϵ, we perform a first level update. Otherwise the update operation
−1 b
3. Pr |x i (δ x ,i − δ x ,i )| ≤ ϵ ≥ 1 − 1/n , ∀i ∈ [n].
4 only finds the approximate vector w appr , and the query operation
computes the query answer based on w appr on the fly. Note that
first level update is performed less frequent than the second level
The expectation and probability are over the randomness of δbx . In
update. A summary is as follows:
this way, we hide the implementation details of the data structure
from the central path algorithm (the optimization part). We believe • First level update: Performed when ∥v−w appr ∥0 > na . Update
this encapsulation will be useful in future research of LP algorithms. first level proxy v. Recompute first level maintained objects:
P = P(v).
3.2 Two-Level Lazy Updates • Second level update: Performed when ∥e v − w appr ∥0 > nae.
Update second level proxy ve. Recompute second level main-
To perform the query operation, [27] need to compute a small in- tained objects: ∆ = Ve −V , S = supp(∆), B := (∆S−1,S +PS ,S )−1 ,
verse matrix (resulting from Woodbury’s identity) on the fly in each
and B · (PS )⊤ . (Explanations see next paragraph.)
iteration (see Eq. (1)). We further maintain this small inverse matrix
• Query: Compute the query answer on the fly using the main-
(corresponding to “changes”) so that, at query time, it suffices to
tained objects of the two levels.
compute an even smaller inverse matrix (corresponding to “changes
of changes”). Second level maintained objects. Next we describe the second
level maintained objects. Recall that using Woodbury identity the
Proxy vectors v and ve. Similar to [27], we maintain a first level projection matrix P(w appr ) is computed as follows (see Eq. (1)):
proxy vector v and the first level object P = P(v) := A⊤ (AV A⊤ )−1A.
Given a new vector w new , we use S new ⊆ [n] to denote the set of P(w appr )
coordinates where w new differs from v by more than ϵ, i.e., S new is = P − PS new · ((∆Snew −1
+ PS new ,S new )−1 · (PS new )⊤ .
new ,S new ) (4)
the set of coordinates i such that |w inew /vi − 1| ≥ ϵ. We define a
diagonal matrix ∆new such that ∆i,i new = w new −v for i ∈ S new and
i i We maintain
∆i,i = 0 for other i. Note that the definitions of S new and ∆new
new a a
B := (∆S−1,S + PS ,S )−1 ∈ Rn ×n
are the same as [27] except we add a superscript new .
In analogy to the first level, we maintain a second level proxy as a second level object, so that we do not need to re-compute
vector ve, and ∆ := Ve − V . We also maintain S = supp(∆), and ((∆Snew −1 + P new new )−1 from scratch. Instead, we only need
new ,S new ) S ,S
some matrices computed using ∆ as the second level objects (to be to compute the difference between the new inverse matrix and the
described later). Analogous to the definition of S new , we define ∂S maintained one. Observe that the new matrix ((∆new )S−1new ,S new +
to be the set of coordinates i where w new differs from ve by more PS new ,S new ) only differs from B −1 = (∆S−1,S + PS ,S ) on entries in
than ϵ, i.e., the coordinates i such that |w inew /e
vi − 1| ≥ ϵ. Intuitively S new × ∂S and ∂S × S new . So there exists a low-rank decomposition
∂S can be viewed as “changes of changes”. In this overview we can
that can be computed in O(nae+a ) time (more details see the full
think of S new = S ∪ ∂S. (The exact definition is slightly different,
version [43]):
for more details see the full version [43].) We use w appr = v + ∆new
to approximate the true vector w new . And we have that w appr ≈ϵ U ′CU ⊤ = ((∆new )S−1new ,S new + PS new ,S new ) − (∆S−1,S + PS ,S ),
w new . a a a a
An illustration of the two proxy vectors v and ve is shown in where U ′, U ∈ Rn ×n , and C ∈ Rn ×n are all relatively small
e e e

Figure 1 (b). (with rank nae instead of na ).

827
STOC ’21, June 21–25, 2021, Virtual, Italy Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang

mp −approx
v − wappr k0 ≤ nae  −approx
ke
mp

ve
wappr wnew wappr wnew

kv − wappr k0 ≤ na kv − vek0 ≤ na
v v
(a) CLS19 (b) Ours

Figure 1: (a): In each iteration, we are given w new which is changing slowly. The algorithm will find w appr such that w appr is
coordinate-wise ϵ close to w new (pink wave line), and it is also close to v in ℓ0 norm (blue solid line). (b): Based on (a), we add
an intermediate level ve, such that in the query, ∥v − ve∥0 ≤ na (blue solid line) and ∥e
v − w appr ∥0 ≤ nae (green dashed line).


Now using Woodbury identity we have: • Left. We use a n×n sketching matrix R, same as R ⊤ ·RPS new
[53].
((∆new )S−1new ,S new + PS new ,S new )−1
= (B −1 + U ′CU ⊤ )−1 Sketching on the left:
= B − BU (C ′ −1 ⊤
+ U BU ) U B.′ −1 ⊤
(5) R ⊤ · RPS new .
|{z} | {z }
√ √
Combining with Eq. (4) we have n× n n×n a
P(w appr )
We pre-compute RPS new when performing first level updates.
= P − PS new · (I − BU ′ (C −1 + U ⊤ BU ′ )−1U ⊤ ) · B(PS new )⊤ . (6) Thus when query, we only need to multiply R ⊤ and RPS new
a with vectors, which takes O(n1.5 ) time.
We also maintain B(PS )⊤ ∈ Rn ×n as a second level object so
• Right. We generalize the vector maintenance technique of
that B(PS new )⊤ can be efficiently computed using B(PS )⊤ . Also,
a a [11] to two levels.
BU ′ ∈ Rn ×n can be computed efficiently by just taking some
e

sub-matrices from B and B(PS )⊤ . Thus when query we already have Vector maintenance on the right:
both BU ′ and B(PS new )⊤ .
B(PS new )⊤д + B(PS new )⊤дe+ B(PS new )⊤ · ∆h .
We postpone the discussion of the amortization using two-level | {z } |{z}
soft threshold to Section 3.4, as it is more coherent to describe the n a ×n n ae−sparse
query algorithm first, which is based on Eq. (6).
In addition to maintain a first level proxy vector д for h,
3.3 Our Query Algorithm we also maintain a second level proxy vector дe. In this way
When performing the Query(h new ) operation, we compute we make sure that ∆h = h appr − дe is nae-sparse, similar to
the case of w and ve. We maintain B(PS new )⊤д as a first level
P (w appr ) · h new = P · h new
  object, and B(PS new )⊤дe as a second level object. Thus when
− P S new I − BU ′ (C −1 + U ⊤ BU ′ )−1 U ⊤ B(P S new )⊤ ·h new . query we only need to compute B(PS new )⊤ · ∆h, which takes
O(na+ea ) time.
|{z} |{z} |{z} |{z} |{z} | {z }
n×n a n a ×n ae n ae×n a n a ×n ae n ae×n a n a ×n
(7) In this way, all the matrix-vector multiplication of Eq. (7) can be
For simplicity, we only describe how to compute the second term computed in O(na+e a + n 1.5 ) time. Now the dominating step is to

in O(na+(ω−1)e a ) time. The first term P · h new can be computed compute U ⊤ · (BU ′ ) when computing the inverse matrix, which
efficiently in a similar way. takes O(na+(ω−1)e a ) time using fast rectangular matrix multiplica-
Using the maintained objects stated in the previous section, we tion. This is our final query time.
know that all the matrices with size annotation in Eq. (7) are already
computed during query time. Now consider computing the matrix- Feasibility. Before going into our amortization scheme, we re-
vector multiplications from right to left. The only two computation mark that since we use sketching on the left to calculate δbx , Aδbx is
that exceeds our time budget is the multiplication of B(PS new )⊤ ∈ not 0. It takes extra work to make sure x is a feasible LP solution in
a a
Rn ×n with a vector, and the multiplication of PS new ∈ Rn×n with each iteration, i.e., ∥Ax − b ∥1 is bounded. Our technique is based
a vector. To accelerate these two computations, we need a technique on the robust central path of [53], and we extend it to two levels.
on the left and a technique on the right. Thus we combine sketching For more details see the full version [43].
on the left of [53] and vector maintenance on the right of [11].11
11 In
fact, we cannot combine sketching on the left with the other two randomized incurs too large errors, and the bounded perturbation property of Eq. (3) no longer
sampling/sketching on the right techniques. This is because randomness on both sides holds.

828
A Faster Algorithm for Solving General LPs STOC ’21, June 21–25, 2021, Virtual, Italy

Algorithm 2 When to perform 1st and 2nd level updates, simplified version of Algorithm in the full version [43].
1: procedure Update(w new )
2: w appr ← second level soft threshold with error |w new /e v − 1|
Adjust(w appr ): for any i that w i , vei and w i is close to vi , let w i ← vi
appr appr appr
3:
4: if ∥w appr − v ∥0 ≥ n then
a
5: w appr ← first level soft threshold with error |w new /v − 1| + |w new /e v − 1|
6: Perform 1st-level update.
7: else if ∥w appr − ve∥0 ≥ nae then
8: Perform 2nd-level update
9: end if
10: end procedure

3.4 Synchronizing Two Levels of Amortization cost of first level update is O(nω−1/2 + n2−a/2 ), and the expected
In this section, we introduce our amortization scheme for two levels amortized cost of second level update is O(n1+a−e a /2 ).

of updates.
We first discuss why synchronization is important. If we just 4 DISCUSSION
use two separate soft thresholds to separately decide whether to 4.1 Potential Applications
perform the two updates, these two decisions could mess up with
Convex optimization. There has been recent developments in
each other. Indeed, the second-level update might only update ≈ nae
other broader class of optimization problems, e.g., empirical risk
coordinates, but it is also possible to actually update much more
minimization (ERM) [53], semidefinite programming (SDP) [38, 41],
coordinates (> na ) because of the existence of soft-threshold. So
and the cutting plane method [42]. In particular, [42] uses one-
whether to perform the first-level update depends on the result of
level lazy update with soft-thresholding technique as a black-box,
the second-level soft threshold. Thus we need to carefully design
while variants of this technique are used in ERM [53] and SDP
the ordering of which soft threshold and which update to perform
[41], and we believe multi-level lazy updates (with two or more
first, and also explicitly adjust the approximate vector w appr to be
level soft-thresholding) might be useful in these broader classes of
consistent with both v and ve.
optimization problems as well.
The main structure of our update scheme is as follows (see Al-
gorithm 2). We first compute the target vector w appr of the sec- Practical machine learning. Recently, the phenomenon of slowly-
ond level soft threshold: If less than na coordinates have error changing weights has been also observed in the training process
|w inew /vi − 1| > ϵ, we let w appr = ve. Otherwise let ye be the de- of deep neural networks [19]. Based on this observation, [19] inte-
creasingly sorted version of error vector |w new /e v − 1|. We find the grates soft-thresholding with lazy updates to efficiently maintain
k ≥ nae such that ye1.5ke < (1 − 1/log n)e
first e yke, and update these e k local sensitive hashing(LSH,[8, 36, 39]) data structures for active
coordinates of w appr to be w new . Then we use this w appr to decide neuron selection, resulting in 3× speed-up of deep neural network
whether to perform first or second level updates. If w appr differs training in practice. An interesting avenue for further speedup is
from v by more than na coordinates, we perform first level update trying to use more levels of lazy updates, provided that our “changes
(with the analogous first level soft threshold). Otherwise if w appr of changes” idea persists in such non-convex optimization.
differs from ve by more than nae coordinates, we perform second
level update. 4.2 Relation to Inverse Maintenance
Next, we explain two other important ingredients of this algo- After the initial release of this paper on Arxiv in April, a recent result
rithm. [12] (SOSA’21) showed how to simplify12 a small part of our proofs
in the appendix of the full version ([43]) while still using the other
• Restoring threshold gaps via Adjust. We design an Ad- parts of our paper as black-boxes). [12] shows that maintaining
just function that restores all updated coordinates of w i
appr
any (rational) matrix formula can be reduced to maintaining some
whose new value is close to vi , back to the original value vi . inverse matrix, and thus projection maintenance can be reduced
In this way we ensure that w i either equals vi , or there
appr
to inverse maintenance. [12] shows how to use the one-level lazy
is a gap between them. Hence when updating v to be w appr , update inverse maintenance data structure of [63] to reproduce the
there is a large enough decrease in the potential function. worst-case version of part of the data structure of [27]. And they
• Synchronizing the error function. When updating v, we also show how to use the two-level lazy update inverse maintenance
define the error as |w new /v − 1| + |w new /e v − 1| which is data structure of [16] to reproduce the worst-case version of part
a function of both v and ve. This is because as long as one of our data structure. Note that this reduction applies the robust
of vi and vei is too far from w inew , we need to update both central path, amortization, sketching, feasibility parts of our paper
variables to be the same as w inew .
12 Theyuse the data structure of [16] to reproduce the worst-case version of our data
structure. The tight lower bound of [16] makes it less likely to be used to further
Finally, using a potential function based amortized analysis for improve LP to n 2 when ω = 2, while our data structure is LP-specific and thus more
each of the two level updates, we show that the expected amortized promising for future LP research.

829
STOC ’21, June 21–25, 2021, Virtual, Italy Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang

(the proof in the appendix of the full version [43]) as black-boxes. error (3)), which circumvents the hardness result of [16]. (ii) The
We highlight two important distinctions between the LP line of special properties of orthogonal projection matrices as opposed to
research and that of inverse-maintenance: inverse maintenance of general matrices.
(1) “Slowly-moving” based Amortization vs. Direct Amor- While extending our technique to k ≥ 3 levels is indeed an
tization. The inverse maintenance data structures of [63] intriguing possibility, we emphasize that a new method for sketching
and [16] amortize the update costs directly over the fre- the central path equations (projection matrix) is needed to overcome
quency of updates, as they only consider updating one coor- this barrier (left/right sketching are not enough). We believe such a
dinate each iteration. However, multiple coordinates could be method must rely on the specific properties of the LP problem and
updated in one iteration in the projection maintenance data projection matrices.
structure of LP, and the only guarantee is a slowly-moving
property (Eq. (2)). Exploiting this slowly-moving property ACKNOWLEDGEMENTS
and achieving good amortized cost requires nontrivial ideas,
We would like to thank Jan van den Brand for very insightful
e.g., the soft thresholding of [27] (see Section 2.2 and appen-
discussions on the data-structure part of the paper and comments on
dix in the full version [43], and, in our paper, the synchro-
this draft, Yin Tat Lee for useful discussions on the optimization part
nization of two levels of amortization (see Section 3.4).
of the paper, Santosh Vempala for helpful discussions on solving
(2) LP-specific Projection vs. General Inverse Maintenance.
linear systems, and Josh Alman and Jeroen Zuiddam for useful
[63] and [16] consider the general inverse maintenance prob-
discussions on fast matrix multiplication. We thank STOC 2021
lem. However, the projection maintenance data structure of
anonymous reviewers for their comments. We would also like to
[27] and our paper is LP-specific. Indeed, [27] could sample
thank Haotian Jiang, Binghui Peng, Aaron Schild, and Ruizhe Zhang
the query vector, and [53] and us could sketch the projection
for their comments on the drafts.
matrix only because the LP application allows extra perturba-
This research is supported in part by NSF CAREER award CCF-
tions in the output of the data structure, and this robustness
1844887.
takes extra work to ensure (see Section 2.1 and 3.1). On the
other hand, the LP-induced problem may impose additional
restrictions to the projection maintenance problem – our REFERENCES
lengthy proof in the appendix of the full version [43] ensures [1] Pankaj K Agarwal, David Eppstein, and Jirí Matousek. 1992. Dynamic half-space
the feasibility constraints of the LP are always satisfied (see reporting, geometric optimization, and minimum spanning trees. In Annual
Symposium on Foundations of Computer Science (FOCS), Vol. 33. 80–80.
discussion in Section 2.3). [2] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. 2017. Much
faster algorithms for matrix scaling. In 2017 IEEE 58th Annual Symposium on
4.3 Extension to the First Order Method Foundations of Computer Science (FOCS). IEEE, 890–901.
[3] Josh Alman. 2019. Inequalties for matrix multiplication exponent. In Personal
Usually the computation bottleneck in the second order method is communication.
[4] Josh Alman. 2019. Limits on the universal method for matrix multiplication. In
matrix inverse. The most expensive computation in the first order CCC. https://arxiv.org/pdf/1812.08731.pdf.
method is matrix-vector multiplication. After modifying the first [5] Josh Alman and Virginia Vassilevska Williams. 2021. A Refined Laser Method
order method in a non-trivial way, the matrix-vector multiplication and Faster Matrix Multiplication. In SODA. https://arxiv.org/pdf/2010.05846.pdf.
[6] Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The space complexity of
problem can be reduced to inner product search problem [65, 66, 76]. approximating the frequency moments. Journal of Computer and system sciences
[76] shows a provable convergence result for training one-hidden 58, 1 (1999), 137–147.
layer neural in sublinear linear cost per iteration. Instead of using [7] Andris Ambainis, Yuval Filmus, and François Le Gall. 2015. Fast matrix mul-
tiplication: limitations of the Coppersmith-Winograd method. In Proceedings
sketching and maintaining the inverse of the matrix, they use inner of the forty-seventh annual ACM symposium on Theory of Computing. https:
product search and maintaining the half-space range reporting data- //arxiv.org/pdf/1411.5414.pdf, 585–593.
[8] Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for
structure [1]. [65, 66] presented sublinear algorithms for projected approximate nearest neighbor in high dimensions. In 47th annual IEEE symposium
gradient descent algorithm and reinforcement learning. They use on foundations of computer science (FOCS’06). IEEE, 459–468.
locality sensitive hashing (LSH, [8, 36, 39]) to speed up maximum [9] Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, and Ruiqi Zhong. 2018.
Subspace embedding and linear regression with Orlicz norm. In International
inner product. Conference on Machine Learning (ICML). 224–233.
[10] Jan van den Band. 2021. Dynamic Matrix Algorithms and Applications in Convex
and Combinatorial Optimization. Ph.D. Dissertation. KTH Royal Institute of
4.4 Extension to Multiple Levels Technology.
An obvious question is whether our two-level data structure can [11] Jan van den Brand. 2020. A deterministic linear program solver in current
be extended to k > 2 levels so as to completely settle the problem, matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM
Symposium on Discrete Algorithms (SODA). SIAM, 259–278.
i.e., reduce the running time of our LP solver from n2+1/18 to n 2 [12] Jan van den Brand. 2021. Unifying Matrix Data Structures: Simplifying and Speed-
assuming ω = 2. A tight conditional lower bound of [16] (in their ing up Iterative Algorithms. In 4th SIAM Symposium on Simplicity in Algorithms
(SOSA). https://arxiv.org/abs/2010.13888.
Section 5) shows that the exact dynamic inverse-maintenance data [13] Jan van den Brand, Yin Tat Lee, Yang P Liu, Thatchaphol Saranurak, Aaron
structures cannot go beyond k = 2 levels, under a variant of the Sidford, Zhao Song, and Di Wang. 2021. Minimum Cost Flows, MDPs, and
L1-Regression in Nearly Linear Time for Dense Instances. In STOC.
OMv Conjecture. This lower bound suggests that in order to go [14] Jan van den Brand, Yin Tat Lee, Danupon Nanongkai, Richard Peng, Thatchaphol
beyond 2 levels, one must crucially exploit: (i) The approximate Saranurak, Aaron Sidford, Zhao Song, and Di Wang. 2020. Bipartite Matching in
nature of our problem, i.e., the ability to heavily sketch the central Nearly-linear Time on Moderately Dense Graphs. In FOCS.
[15] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. 2020. Solving
path equations. Incidentally, the OMv conjecture ([37]) is false if Tall Dense Linear Programs in Nearly Linear Time. In STOC. https://arxiv.org/
sketching is allowed (under our coordinate-wise notion of query pdf/2002.02304.pdf.

830
A Faster Algorithm for Solving General LPs STOC ’21, June 21–25, 2021, Virtual, Italy

[16] Jan van den Brand, Danupon Nanongkai, and Thatchaphol Saranurak. 2019. [41] Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song.
Dynamic matrix inverse: Improved algorithms and matching conditional lower 2020. A faster interior point method for semidefinite programming. In FOCS.
bounds. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science https://arxiv.org/abs/2009.10217.
(FOCS). IEEE, 456–480. [42] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. 2020. An
[17] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. 2021. Training improved cutting plane method for convex optimization, convex-concave games,
(overparametrized) neural networks in near-linear time. In The 12th Innovations and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium
in Theoretical Computer Science Conference (ITCS). on Theory of Computing (STOC). 944–953.
[18] Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent [43] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. 2020. Faster
items in data streams. In International Colloquium on Automata, Languages, and dynamic matrix inverse for faster lps. arXiv preprint arXiv:2004.07470 (2020).
Programming. Springer, 693–703. [44] Daniel M Kane and Jelani Nelson. 2014. Sparser johnson-lindenstrauss transforms.
[19] Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Journal of the ACM (JACM) 61, 1 (2014), 1–23.
Zhao Song Tri Dao, Anshumali Shrivastava, and Christopher Re. 2021. MON- [45] Narendra Karmarkar. 1984. A new polynomial-time algorithm for linear pro-
GOOSE: A Learnable LSH Framework for Efficient Neural Network Training. In gramming. In Proceedings of the sixteenth annual ACM symposium on Theory of
ICLR (Oral presentation). https://openreview.net/forum?id=wWK7yXkULyh. computing (STOC). ACM, 302–311.
[20] Matthias Christandl, François Le Gall, Vladimir Lysikov, and Jeroen Zuiddam. [46] Leonid G Khachiyan. 1980. Polynomial algorithms in linear programming. U. S.
2020. Barriers for fast rectangular matrix multiplication. In arXiv preprint. https: S. R. Comput. Math. and Math. Phys. 20, 1 (1980), 53–72.
//arxiv.org/pdf/2003.03019.pdf. [47] Victor Klee and George J Minty. 1972. How good is the simplex algorithm.
[21] Matthias Christandl, Péter Vrana, and Jeroen Zuiddam. 2019. Barriers for fast Inequalities 3, 3 (1972), 159–175.
matrix multiplication from irreversibility. In CCC. https://arxiv.org/pdf/1812. [48] François Le Gall. 2014. Powers of tensors and fast matrix multiplication. In
06952.pdf. Proceedings of the 39th international symposium on symbolic and algebraic compu-
[22] Kenneth L. Clarkson and David P. Woodruff. 2013. Low rank approximation tation (ISSAC). ACM, https://arxiv.org/pdf/1401.7714.pdf, 296–303.
and regression in input sparsity time. In Symposium on Theory of Computing [49] Jason D Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, and Zheng Yu. 2020. Gener-
Conference (STOC). https://arxiv.org/pdf/1207.6365, 81–90. alized Leverage Score Sampling for Neural Networks. In NeurIPS.
[23] Kenneth L Clarkson and David P Woodruff. 2015. Input sparsity and hardness [50] Yin Tat Lee. 2016. Faster algorithms for convex and combinatorial optimization.
for robust subspace approximation. In 2015 IEEE 56th Annual Symposium on Ph.D. Dissertation. Massachusetts Institute of Technology.
Foundations of Computer Science (FOCS). IEEE, 310–329. [51] Yin Tat Lee and Aaron Sidford. 2014. √ Path finding methods for linear program-
[24] Kenneth L Clarkson and David P Woodruff. 2015. Sketching for M-estimators: A ming: Solving linear programs in O ( r ank ) iterations and faster algorithms
unified approach to robust regression. In Proceedings of the twenty-sixth annual for maximum flow. In 55th Annual IEEE Symposium on Foundations of Computer
ACM-SIAM symposium on Discrete algorithms. SIAM, 921–939. Science (FOCS). https://arxiv.org/pdf/1312.6677.pdf, https://arxiv.org/pdf/1312.
[25] Michael B Cohen, TS Jayram, and Jelani Nelson. 2018. Simple analyses of the 6713.pdf, 424–433.
sparse Johnson-Lindenstrauss transform. In 1st Symposium on Simplicity in Algo- [52] Yin Tat Lee and Aaron Sidford. 2015. Efficient inverse maintenance and faster al-
rithms (SOSA). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. gorithms for linear programming. In 56th Annual IEEE Symposium on Foundations
[26] Michael B Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. of Computer Science (FOCS). https://arxiv.org/pdf/1503.01752.pdf, 230–249.
2016. Geometric median in nearly linear time. In Proceedings of the forty-eighth [53] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. 2019. Solving Empirical Risk Mini-
annual ACM symposium on Theory of Computing (STOC). 9–21. mization in the Current Matrix Multiplication Time. In COLT. https://arxiv.org/
[27] Michael B Cohen, Yin Tat Lee, and Zhao Song. 2019. Solving Linear Programs pdf/1905.04447.
in the Current Matrix Multiplication Time. In STOC. https://arxiv.org/pdf/1810. [54] S Cliff Liu, Zhao Song, and Hengjie Zhang. 2020. Breaking the n -Pass Barrier: A
07896. Streaming Algorithm for Maximum Weight Bipartite Matching. arXiv preprint
[28] Michael B Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. 2017. arXiv:2009.06106 (2020).
Matrix scaling and balancing via box constrained Newton’s method and interior [55] Yang P Liu and Aaron Sidford. 2020. Faster Divergence Maximization for Faster
point methods. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Maximum Flow. In FOCS.
Science (FOCS). IEEE, 902–913. [56] Yang P Liu and Aaron Sidford. 2020. Faster Energy Maximization for Faster
[29] Henry Cohn, Robert Kleinberg, Balazs Szegedy, and Christopher Umans. 2005. Maximum Flow. In STOC.
Group-theoretic algorithms for matrix multiplication. In 46th Annual IEEE Sym- [57] Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. 2013. Faster ridge
posium on Foundations of Computer Science (FOCS). IEEE, 379–388. regression via the subsampled randomized hadamard transform. In Advances in
[30] Samuel I Daitch and Daniel A Spielman. 2008. Faster approximate lossy general- neural information processing systems. 369–377.
ized flow via interior point algorithms. In Proceedings of the fortieth annual ACM [58] Aleksander Madry. 2013. Navigating central path with electrical flows: From
symposium on Theory of computing (STOC). 451–460. flows to matchings, and back. In 2013 IEEE 54th Annual Symposium on Foundations
[31] George B Dantzig. 1947. Maximization of a linear function of variables subject of Computer Science (FOCS). IEEE, 253–262.
to linear inequalities. Activity analysis of production and allocation 13 (1947), [59] Aleksander Madry. 2016. Computing maximum flow with augmenting electrical
339–347. flows. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science
[32] Huaian Diao, Zhao Song, David P. Woodruff, and Xin Yang. 2019. Total Least (FOCS). IEEE, 593–602.
Squares Regression in Input Sparsity Time. In NeurIPS. 2478–2489. [60] Jelani Nelson and Huy L Nguyên. 2013. OSNAP: Faster numerical linear algebra
[33] Sally Dong, Yin Tat Lee, and Guanghao Ye. 2021. A Nearly-Linear Time Algorithm algorithms via sparser subspace embeddings. In 54th Annual IEEE Symposium on
for Linear Programs with Small Treewidth: A Multiscale Representation of Robust Foundations of Computer Science (FOCS). IEEE, https://arxiv.org/pdf/1211.1002,
Central Path. In STOC. arXiv preprint arXiv:2011.05365. 117–126.
[34] Francois Le Gall and Florent Urrutia. 2018. Improved rectangular matrix mul- [61] Richard Peng and Santosh Vempala. 2021. Solving Sparse Linear Systems Faster
tiplication using powers of the coppersmith-winograd tensor. In Proceedings of than Matrix Multiplication. (2021).
the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms(SODA). [62] Eric Price, Zhao Song, and David P. Woodruff. 2017. Fast regression with an ℓ∞
https://arxiv.org/pdf/1708.05622.pdf, 1029–1046. guarantee. In International Colloquium on Automata, Languages, and Programming
[35] Yu Gao, Yang P Liu, and Richard Peng. 2021. Fully Dynamic Electrical Flows: (ICALP). https://arxiv.org/pdf/1705.10723.pdf.
Sparse Maxflow Faster Than Goldberg-Rao. arXiv preprint arXiv:2101.07233 [63] Piotr Sankowski. 2004. Dynamic transitive closure via dynamic matrix inverse. In
(2021). 45th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 509–517.
[36] Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in [64] Naum Z Shor. 1977. Cut-off method with space extension in convex programming
high dimensions via hashing. In Vldb, Vol. 99. 518–529. problems. Cybernetics and systems analysis 13, 1 (1977), 94–96.
[37] Monika Henzinger, Sebastian Krinninger, Danupon Nanongkai, and Thatchaphol [65] Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. 2021. Breaking the
Saranurak. 2015. Unifying and strengthening hardness for dynamic problems via Linear Iteration Cost Barrier using Projected Approximate Max-IP Data Structure.
the online matrix-vector multiplication conjecture. In Proceedings of the forty- manuscript (2021).
seventh annual ACM symposium on Theory of computing (STOC). 21–30. [66] Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. 2021. Sublinear Least-
[38] Baihe Huang, Shunhua Jiang, Zhao Song, and Runzhou Tao. 2021. Solving Squares Value Iteration. manuscript (2021).
Tall Dense SDPs in the Current Matrix Multiplication Time. arXiv preprint [67] Aaron Daniel Sidford. 2015. Iterative methods, combinatorial optimization, and lin-
arXiv:2101.08208 (2021). ear programming beyond the universal barrier. Ph.D. Dissertation. Massachusetts
[39] Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards Institute of Technology.
removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM [68] Steve Smale. 1998. Mathematical problems for the next century. The mathematical
symposium on Theory of computing (STOC). 604–613. intelligencer 20, 2 (1998), 7–15.
[40] Haotian Jiang. 2021. Minimizing Convex Functions with Integral Minimizers. [69] Zhao Song. 2019. Matrix Theory : Optimization, Concentration and Algorithms.
In ACM-SIAM Symposium on Discrete Algorithms (SODA). https://arxiv.org/pdf/ Ph.D. Dissertation. The University of Texas at Austin.
2007.01445.

831
STOC ’21, June 21–25, 2021, Virtual, Italy Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang

[70] Zhao Song, Ruosong Wang, Lin F Yang, Hongyang Zhang, and Peilin Zhong. [78] Gilbert Strang. 1987. Karmarkar’s algorithm and its place in applied mathematics.
2019. Efficient Symmetric Norm Regression via Linear Sketching. In Advances in The Mathematical Intelligencer 9, 2 (1987), 4–10.
Neural Information Processing Systems (NeurIPS). [79] Volker Strassen. 1969. Gaussian elimination is not optimal. Numerische mathe-
[71] Zhao Song, David Woodruff, and Huan Zhang. 2016. Sublinear time orthogonal matik 13, 4 (1969), 354–356.
tensor decomposition. Advances in Neural Information Processing Systems (NIPS) [80] Pravin M Vaidya. 1987. An algorithm for linear programming which requires
29 (2016), 793–801. O (((m + n)n 2 + (m + n)1.5 n)L) arithmetic operations. In 28th Annual IEEE
[72] Zhao Song, David P Woodruff, and Peilin Zhong. 2017. Low Rank Approximation Symposium on Foundations of Computer Science (FOCS).
with Entrywise ℓ1 -Norm Error. In Proceedings of the 49th Annual Symposium on [81] Pravin M Vaidya. 1989. Speeding-up linear programming using fast matrix
the Theory of Computing (STOC). ACM, https://arxiv.org/pdf/1611.00898. multiplication. In 30th Annual Symposium on Foundations of Computer Science
[73] Zhao Song, David P Woodruff, and Peilin Zhong. 2019. Average Case Column (FOCS). IEEE, 332–337.
Subset Selection for Entrywise l1-Norm Loss. In Advances in Neural Information [82] Santosh Vempala. 2020. General linear system. In Personal communication.
Processing Systems (NeurIPS). [83] Yining Wang, Hsiao-Yu Fish Tung, Alexander J. Smola, and Anima Anandkumar.
[74] Zhao Song, David P Woodruff, and Peilin Zhong. 2019. Relative Error Tensor Low 2015. Fast and Guaranteed Tensor Decomposition via Sketching. In NIPS. 991–
Rank Approximation. In ACM-SIAM Symposium on Discrete Algorithms (SODA). 999.
https://arxiv.org/pdf/1704.08246. [84] Virginia Vassilevska Williams. 2012. Multiplying matrices faster than
[75] Zhao Song, David P Woodruff, and Peilin Zhong. 2019. Towards a Zero-One Coppersmith-Winograd. In Proceedings of the forty-fourth annual ACM sym-
Law for Column Subset Selection. In Advances in Neural Information Processing posium on Theory of computing (STOC). ACM, 887–898.
Systems (NeurIPS). [85] Guanghao Ye. 2020. Fast Algorithm for Solving Structured Convex Programs.
[76] Zhao Song, Shuo Yang, and Ruizhe Zhang. 2021. Does preprocessing help training The University of Washington, Undergraduate Thesis (2020).
over-parameterized neural networks? manuscript (2021). [86] David B Yudin and Arkadi S Nemirovski. 1976. Evaluation of the information com-
[77] Zhao Song and Zheng Yu. 2020. Oblivious Sketching-based Central Path Method plexity of mathematical programming problems. Ekonomika i Matematicheskie
for Solving Linear Programming Problems. In manuscript. https://openreview. Metody 12 (1976), 128–142.
net/forum?id=fGiKxvF-eub. [87] Jeroen Zuiddam. 2019. Inequalties for matrix multiplication exponent. In Personal
communication.

832

You might also like