0% found this document useful (0 votes)
7 views

4. Polynomial Time Interior Point algorithms for LP,CQP and SDP

Uploaded by

scribd-ml
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

4. Polynomial Time Interior Point algorithms for LP,CQP and SDP

Uploaded by

scribd-ml
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Lecture 4

Polynomial Time Interior Point


algorithms for LP, CQP and SDP

4.1 Complexity of Convex Programming


When we attempt to solve any problem, we would like to know whether it is possible to find
a correct solution in a “reasonable time”. Had we known that the solution will not be reached
in the next 30 years, we would think (at least) twice before starting to solve it. Of course,
this in an extreme case, but undoubtedly, it is highly desirable to distinguish between “com-
putationally tractable” problems – those that can be solved efficiently – and problems which
are “computationally intractable”. The corresponding complexity theory was first developed in
Computer Science for combinatorial (discrete) problems, and later somehow extended onto the
case of continuous computational problems, including those of Continuous Optimization. In this
section, we outline the main concepts of the CCT – Combinatorial Complexity Theory – along
with their adaptations to Continuous Optimization.

4.1.1 Combinatorial Complexity Theory


A generic combinatorial problem is a family P of problem instances of a “given structure”,
each instance (p) ∈ P being identified by a finite-dimensional data vector Data(p), specifying
the particular values of the coefficients of “generic” analytic expressions. The data vectors are
assumed to be Boolean vectors – with entries taking values 0, 1 only, so that the data vectors
are, actually, finite binary words.

The model of computations in CCT: an idealized computer capable to store only integers
(i.e., finite binary words), and its operations are bitwise: we are allowed to multiply, add and
compare integers. To add and to compare two -bit integers, it takes O() “bitwise” elementary
operations, and to multiply a pair of -bit integers it costs O(2 ) elementary operations (the cost
of multiplication can be reduced to O( ln()), but it does not matter) .

In CCT, a solution to an instance (p) of a generic problem P is a finite binary word y such
that the pair (Data(p),y) satisfies certain “verifiable condition” A(·, ·). Namely, it is assumed
that there exists a code M for the above “Integer Arithmetic computer” such that executing the
code on every input pair x, y of finite binary words, the computer after finitely many elementary

137
138 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

operations terminates and outputs either “yes”, if A(x, y) is satisfied, or “no”, if A(x, y) is not
satisfied. Thus, P is the problem

Given x, find y such that


A(x, y) = true, (4.1.1)

or detect that no such y exists.

For example, the problem Stones:

Given n positive integers a1 , ..., an , find a vector x = (x1 , ..., xn )T with coordinates

±1 such that xi ai = 0, or detect that no such vector exists
i

is a generic combinatorial problem. Indeed, the data of the instance of the problem, same as
candidate solutions to the instance, can be naturally encoded by finite sequences of integers. In
turn, finite sequences of integers can be easily encoded by finite binary words. And, of course,
for this problem you can easily point out a code for the “Integer Arithmetic computer” which,
given on input two binary words x = Data(p), y encoding the data vector of an instance (p)
of the problem and a candidate solution, respectively, verifies in finitely many “bit” operations
whether y represents or does not represent a solution to (p).

A solution algorithm for a generic problem P is a code S for the Integer Arithmetic computer
which, given on input the data vector Data(p) of an instance (p) ∈ P, after finitely many
operations either returns a solution to the instance, or a (correct!) claim that no solution exists.
The running time TS (p) of the solution algorithm on instance (p) is exactly the number of
elementary (i.e., bit) operations performed in course of executing S on Data(p).

A solvability test for a generic problem P is defined similarly to a solution algorithm, but
now all we want of the code is to say (correctly!) whether the input instance is or is not solvable,
just “yes” or “no”, without constructing a solution in the case of the “yes” answer.

The complexity of a solution algorithm/solvability test S is defined as

ComplS () = max{TS (p) | (p) ∈ P, length(Data(p)) ≤ },

where length(x) is the bit length (i.e., number of bits) of a finite binary word x. The algo-
rithm/test is called polynomial time, if its complexity is bounded from above by a polynomial
of .

Finally, a generic problem P is called to be polynomially solvable, if it admits a polynomial


time solution algorithm. If P admits a polynomial time solvability test, we say that P is poly-
nomially verifiable.

Classes P and NP. A generic problem P is said to belong to the class NP, if the corresponding
condition A, see (4.1.1), possesses the following two properties:
4.1. COMPLEXITY OF CONVEX PROGRAMMING 139

I. A is polynomially computable, i.e., the running time T (x, y) (measured, of course,


in elementary “bit” operations) of the associated code M is bounded from above by
a polynomial of the bit length length(x) + length(y) of the input:
T (x, y) ≤ χ(length(x) + length(y))χ ∀(x, y) 1)

Thus, the first property of an NP problem states that given the data Data(p) of a
problem instance p and a candidate solution y, it is easy to check whether y is an
actual solution of (p) – to verify this fact, it suffices to compute A(Data(p), y), and
this computation requires polynomial in length(Data(p)) + length(y) time.
The second property of an NP problem makes its instances even more easier:

II. A solution to an instance (p) of a problem cannot be “too long” as compared to


the data of the instance: there exists χ such that
length(y) > χlengthχ (x) ⇒ A(x, y) = ”no”.

A generic problem P is said to belong to the class P, if it belongs to the class NP and is
polynomially solvable.

NP-completeness is defined as follows:


Definition 4.1.1 (i) Let P, Q be two problems from NP. Problem Q is called to be polynomially
reducible to P, if there exists a polynomial time algorithm M (i.e., a code for the Integer
Arithmetic computer with the running time bounded by a polynomial of the length of the input)
with the following property. Given on input the data vector Data(q) of an instance (q) ∈ Q, M
converts this data vector to the data vector Data(p[q]) of an instance of P such that (p[q]) is
solvable if and only if (q) is solvable.
(ii) A generic problem P from NP is called NP-complete, if every other problem Q from NP
is polynomially reducible to P.
The importance of the notion of an NP-complete problem comes from the following fact:
If a particular NP-complete problem is polynomially verifiable (i.e., admits a poly-
nomial time solvability test), then every problem from NP is polynomially solvable:
P = NP.

The question whether P=NP – whether NP-complete problems are or are not polynomially
solvable, is qualified as “the most important open problem in Theoretical Computer Science”
and remains open for about 30 years. One of the most basic results of Theoretical Computer
Science is that NP-complete problems do exist (the Stones problem is an example). Many of
these problems are of huge practical importance, and are therefore subject, over decades, of
intensive studies of thousands excellent researchers. However, no polynomial time algorithm for
any of these problems was found. Given the huge total effort invested in this research, we should
conclude that it is “highly improbable” that NP-complete problems are polynomially solvable.
Thus, at the “practical level” the fact that certain problem is NP-complete is sufficient to qualify
the problem as “computationally intractable”, at least at our present level of knowledge.
1)
Here and in what follows, we denote by χ positive “characteristic constants” associated with the predi-
cates/problems in question. The particular values of these constants are of no importance, the only thing that
matters is their existence. Note that in different places of even the same equation χ may have different values.
140 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

4.1.2 Complexity in Continuous Optimization


It is convenient to represent continuous optimization problems as Mathematical Programming
problems, i.e. programs of the following form:
 
min p0 (x) : x ∈ X(p) ⊂ Rn(p) (p)
x

where
• n(p) is the design dimension of program (p);

• X(p) ⊂ Rn is the feasible domain of the program;

• p0 (x) : Rn → R is the objective of (p).

Families of optimization programs. We want to speak about methods for solving opti-
mization programs (p) “of a given structure” (for example, Linear Programming ones). All
programs (p) “of a given structure”, like in the combinatorial case, form certain family P, and
we assume that every particular program in this family – every instance (p) of P – is specified by
its particular data Data(p). However, now the data is a finite-dimensional real vector; one may
think about the entries of this data vector as about particular values of coefficients of “generic”
(specific for P) analytic expressions for p0 (x) and X(p). The dimension of the vector Data(p)
will be called the size of the instance:

Size(p) = dim Data(p).

The model of computations. This is what is known as “Real Arithmetic Model of Computa-
tions”, as opposed to ”Integer Arithmetic Model” in the CCT. We assume that the computations
are carried out by an idealized version of the usual computer which is capable to store count-
ably many reals and can perform with them the standard exact real arithmetic operations – the
four basic arithmetic operations, evaluating elementary functions, like cos and exp, and making
comparisons.

Accuracy of approximate solutions. We assume that a generic optimization problem P


is equipped with an “infeasibility measure” InfeasP (x, p) – a real-valued function of p ∈ P and
x ∈ Rn(p) which quantifies the infeasibility of vector x as a candidate solution to (p). In our
general considerations, all we require from this measure is that
• InfeasP (x, p) ≥ 0, and InfeasP (x, p) = 0 when x is feasible for (p) (i.e., when x ∈ X(p)).
Given an infeasibility measure, we can proceed to define the notion of an -solution to an instance
(p) ∈ P, namely, as follows. Let

Opt(p) ∈ {−∞} ∪ R ∪ {+∞}

be the optimal value of the instance (i.e., the infimum of the values of the objective on the
feasible set, if the instance is feasible, and +∞ otherwise). A point x ∈ Rn(p) is called an
-solution to (p), if
InfeasP (x, p) ≤  and p0 (x) − Opt(p) ≤ ,
i.e., if x is both “-feasible” and “-optimal” for the instance.
4.1. COMPLEXITY OF CONVEX PROGRAMMING 141

It is convenient to define the number of accuracy digits in an -solution to (p) as the quantity
 
Size(p) + Data(p)1 + 2
Digits(p, ) = ln .


Solution methods. A solution method M for a given family P of optimization programs


is a code for the idealized Real Arithmetic computer. When solving an instance (p) ∈ P, the
computer first inputs the data vector Data(p) of the instance and a real  > 0 – the accuracy to
which the instance should be solved, and then executes the code M on this input. We assume
that the execution, on every input (Data(p),  > 0) with (p) ∈ P, takes finitely many elementary
operations of the computer, let this number be denoted by ComplM (p, ), and results in one of
the following three possible outputs:
– an n(p)-dimensional vector ResM (p, ) which is an -solution to (p),
– a correct message “(p) is infeasible”,
– a correct message “(p) is unbounded below”.
We measure the efficiency of a method by its running time ComplM (p, ) – the number of
elementary operations performed by the method when solving instance (p) within accuracy .
By definition, the fact that M is “efficient” (polynomial time) on P, means that there exists a
polynomial π(s, τ ) such that
ComplM (p, ) ≤ π (Size(p), Digits(p, ))
(4.1.2)
∀(p) ∈ P ∀ > 0.
Informally speaking, polynomiality of M means that when we increase the size of an instance and
the required number of accuracy digits by absolute constant factors, the running time increases
by no more than another absolute constant factor.
We call a family P of optimization problems polynomially solvable (or, which is the same,
computationally tractable), if it admits a polynomial time solution method.

Computational tractability of convex optimization problems. A generic optimization


problem P is called convex, if, for every instance (p) ∈ P, both the objective p0 (x) of the
instance and the infeasibility measure InfeasP (x, p) are convex functions of x ∈ Rn(p) . One of
the major complexity results in Continuous Optimization is that a generic convex optimization
problem, under mild computability and regularity assumptions, is polynomially solvable (and
thus “computationally tractable”). To formulate the precise result, we start with specifying the
aforementioned “mild assumptions”.

Polynomial computability. Let P be a generic convex program, and let InfeasP (x, p)
be the corresponding measure of infeasibility of candidate solutions. We say that our family is
polynomially computable, if there exist two codes Cobj , Ccons for the Real Arithmetic computer
such that
1. For every instance (p) ∈ P, the computer, when given on input the data vector of the
instance (p) and a point x ∈ Rn(p) and executing the code Cobj , outputs the value p0 (x) and a
subgradient e(x) ∈ ∂p0 (x) of the objective p0 of the instance at the point x, and the running
time (i.e., total number of operations) of this computation Tobj (x, p) is bounded from above by
a polynomial of the size of the instance:
 
∀ (p) ∈ P, x ∈ Rn(p) : Tobj (x, p) ≤ χSizeχ (p) [Size(p) = dim Data(p)]. (4.1.3)
142 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

(recall that in our notation, χ is a common name of characteristic constants associated with P).
2. For every instance (p) ∈ P, the computer, when given on input the data vector of the
instance (p), a point x ∈ Rn(p) and an  > 0 and executing the code Ccons , reports on output
whether InfeasP (x, p) ≤ , and if it is not the case, outputs a linear form a which separates the
point x from all those points y where InfeasP (y, p) ≤ :

∀ (y, InfeasP (y, p) ≤ ) : aT x > aT y, (4.1.4)

the running time Tcons (x, , p) of the computation being bounded by a polynomial of the size of
the instance and of the “number of accuracy digits”:
 
∀ (p) ∈ P, x ∈ Rn(p) ,  > 0 : Tcons (x, , p) ≤ χ (Size(p) + Digits(p, ))χ . (4.1.5)

Note that the vector a in (4.1.4) is not supposed to be nonzero; when it is 0, (4.1.4) simply says
that there are no points y with InfeasP (y, p) ≤ .

Polynomial growth. We say that a generic convex program P is with polynomial growth, if
the objectives and the infeasibility measures, as functions of x, grow polynomially with x1 ,
the degree of the polynomial being a power of Size(p):
 
∀ (p) ∈ P, x ∈ Rn(p) :
χ (4.1.6)
|p0 (x)| + InfeasP (x, p) ≤ (χ [Size(p) + x1 + Data(p)1 ])(χSize (p))
.

Polynomial boundedness of feasible sets. We say that a generic convex program P has
polynomially bounded feasible sets, if the feasible set X(p) of every instance (p) ∈ P is bounded,
and is contained in the Euclidean ball, centered at the origin, of “not too large” radius:
∀(p) ∈ P: 
χ (4.1.7)
X(p) ⊂ x ∈ Rn(p) : x2 ≤ (χ [Size(p) + Data(p)1 ])χSize (p)
.

Example. Consider generic optimization problems LP b , CQP b , SDP b with instances in


the conic form
 
min p0 (x) ≡ cT(p) x : x ∈ X(p) ≡ {x : A(p) x − b(p) ∈ K(p), x2 ≤ R} ; (4.1.8)
x∈Rn(p)

where K is a cone belonging to a characteristic for the generic program family K of cones,
specifically,
• the family of nonnegative orthants for LP b ,
• the family of direct products of Lorentz cones for CQP b ,
• the family of semidefinite cones for SDP b .
The data of and instance (p) of the type (4.1.8) is the collection

Data(p) = (n(p), c(p) , A(p) , b(p) , R, size(s) of K(p) ),

with naturally defined size(s) of a cone K from the family K associated with the generic program
under consideration: the sizes of Rn+ and of Sn+ equal n, and the size of a direct product of Lorentz
cones is the sequence of the dimensions of the factors.
4.1. COMPLEXITY OF CONVEX PROGRAMMING 143

The generic conic programs in question are equipped with the infeasibility measure
 
Infeas(x, p) = min t ≥ 0 : te[K(p) ] + A(p) x − b(p) ∈ K(p) , (4.1.9)

where e[K] is a naturally defined “central point” of K ∈ K, specifically,


• the n-dimensional vector of ones when K = Rn+ ,
• the vector em = (0, ..., 0, 1)T ∈ Rm when K(p) is the Lorentz cone Lm , and the direct sum
of these vectors, when K is a direct product of Lorentz cones,
• the unit matrix of appropriate size when K is a semidefinite cone.
In the sequel, we refer to the three generic problems we have just defined as to Linear, Conic
Quadratic and Semidefinite Programming problems with ball constraints, respectively. It is
immediately seen that the generic programs LP b , CQP b and SDP b are convex and possess the
properties of polynomial computability, polynomial growth and polynomially bounded feasible
sets (the latter property is ensured by making the ball constraint x2 ≤ R a part of program’s
formulation).

Computational Tractability of Convex Programming. The role of the properties we


have introduced becomes clear from the following result:
Theorem 4.1.1 Let P be a family of convex optimization programs equipped with infeasibility
measure InfeasP (·, ·). Assume that the family is polynomially computable, with polynomial growth
and with polynomially bounded feasible sets. Then P is polynomially solvable.
In particular, the generic Linear, Conic Quadratic and Semidefinite programs with ball con-
straints LP b , CQP b , SDP b are polynomially solvable.

Black-box represented convex programs and Information-Based complexity. The-


orem 4.1.1 is a more or less straightforward corollary of a result related to the so called
Information-Based complexity of black-box represented convex programs. This result is in-
teresting by its own right, this is why we reproduce it here:
Consider a Convex Programming program
min {f (x) : x ∈ X} (4.1.10)
x

where
• X is a convex compact set in Rn with a nonempty interior
• f is a continuous convex function on X.
Assume that our “environment” when solving (4.1.10) is as follows:
1. We have access to a Separation Oracle Sep(X) for X – a routine which, given on input a
point x ∈ Rn , reports on output whether or not x ∈ int X, and in the case of x ∈ int X,
returns a separator – a nonzero vector e such that
eT x ≥ max eT y (4.1.11)
y∈X

(the existence of such a separator is guaranteed by the Separation Theorem for convex
sets);
144 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

2. We have access to a First Order oracle which, given on input a point x ∈ int X, returns
the value f (x) and a subgradient f  (x) of f at x;

3. We are given two positive reals R ≥ r such that X is contained in the Euclidean ball,
centered at the origin, of the radius R and contains a Euclidean ball of the radius r (not
necessarily centered at the origin).

The result we are interested in is as follows:

Theorem 4.1.2 In the outlined “working environment”, for every given  > 0 it is possible to
find an -solution to (4.1.10), i.e., a point x ∈ X with

f (x ) ≤ min f (x) + 
x∈X

in no more than N () subsequent calls to the Separation and the First Order oracles plus no
more than O(1)n2 N () arithmetic operations to process the answers of the oracles, with

VarX (f )R
N () = O(1)n2 ln 2 + . (4.1.12)
·r
Here
VarX (f ) = max f − min f.
X X

4.1.3 Difficult continuous optimization problems


Real Arithmetic Complexity Theory can borrow from the Combinatorial Complexity Theory
techniques for detecting “computationally intractable” problems. Consider the situation as
follows: we are given a family P of optimization programs and want to understand whether the
family is computationally tractable. An affirmative answer can be obtained from Theorem 4.1.1;
but how could we justify that the family is intractable? A natural course of action here is to
demonstrate that certain difficult (NP-complete) combinatorial problem Q can be reduced to P
in such a way that the possibility to solve P in polynomial time would imply similar possibility
for Q. Assume that the objectives of the instances of P are polynomially computable, and that
we can point out a generic combinatorial problem Q known to be NP-complete which can be
reduced to P in the following sense:

There exists a CCT-polynomial time algorithm M which, given on input the data
vector Data(q) of an instance (q) ∈ Q, converts it into a triple Data(p[q]), (q), µ(q)
comprised of the data vector of an instance (p[q]) ∈ P, positive rational (q) and
rational µ(q) such that (p[q]) is solvable and
— if (q) is unsolvable, then the value of the objective of (p[q]) at every (q)-solution
to this problem is ≤ µ(q) − (q);
— if (q) is solvable, then the value of the objective of (p[q]) at every (q)-solution to
this problem is ≥ µ(q) + (q).

We claim that in the case in question we have all reasons to qualify P as a “computationally
intractable” problem. Assume, on contrary, that P admits a polynomial time solution method
S, and let us look what happens if we apply this algorithm to solve (p[q]) within accuracy (q).
Since (p[q]) is solvable, the method must produce an (q)-solution x to (p[q]). With additional
4.2. INTERIOR POINT POLYNOMIAL TIME METHODS FOR LP, CQP AND SDP 145

“polynomial time effort” we may compute the value of the objective of (p[q]) at x (recall that
the objectives of instances from P are assumed to be polynomially computable). Now we can
compare the resulting value of the objective with µ(q); by definition of reducibility, if this value
is ≤ µ(q), q is unsolvable, otherwise q is solvable. Thus, we get a correct “Real Arithmetic”
solvability test for Q. By definition of a Real Arithmetic polynomial time algorithm, the running
time of the test is bounded by a polynomial of s(q) = Size(p[q]) and of the quantity
 
Size(p[q]) + Data(p[q])1 + 2 (q)
d(q) = Digits((p[q]), (q)) = ln .
(q)
Now note that if  = length(Data(q)), then the total number of bits in Data(p[q]) and in (q) is
bounded by a polynomial of  (since the transformation Data(q) → (Data(p[q]), (q), µ(q)) takes
CCT-polynomial time). It follows that both s(q) and d(q) are bounded by polynomials in , so
that our “Real Arithmetic” solvability test for Q takes polynomial in length(Data(q)) number
of arithmetic operations.
Recall that Q was assumed to be an NP-complete generic problem, so that it would be “highly
improbable” to find a polynomial time solvability test for this problem, while we have managed
to build such a test. We conclude that the polynomial solvability of P is highly improbable as
well.

4.2 Interior Point Polynomial Time Methods for LP, CQP and
SDP
4.2.1 Motivation
Theorem 4.1.1 states that generic convex programs, under mild computability and bounded-
ness assumptions, are polynomially solvable. This result is extremely important theoretically;
however, from the practical viewpoint it is, essentially, no more than “an existence theorem”.
Indeed, the “universal” complexity bounds coming from Theorem 4.1.2, although polynomial,
are not that attractive: by Theorem 4.1.1, when solving problem (C) with n design variables,
the “price” of an accuracy digit (what it costs to reduce current inaccuracy  by factor 2) is
O(n2 ) calls to the first order and the separation oracles plus O(n4 ) arithmetic operations to
process the answers of the oracles. Thus, even for simplest objectives to be minimized over
simplest feasible sets, the arithmetic price of an accuracy digit is O(n4 ); think how long will
it take to solve a problem with, say, 1,000 variables (which is still a “small” size for many
applications). The good news about the methods underlying Theorem 4.1.2 is their universal-
ity: all they need is a Separation oracle for the feasible set and the possibility to compute the
objective and its subgradient at a given point, which is not that much. The bad news about
these methods has the same source as the good news: the methods are “oracle-oriented” and
capable to use only local information on the program they are solving, in contrast to the fact
that when solving instances of well-structured programs, like LP, we from the very beginning
have in our disposal complete global description of the instance. And of course it is ridiculous
to use a complete global knowledge of the instance just to mimic the local in their nature first
order and separation oracles. What we would like to have is an optimization technique capable
to “utilize efficiently” our global knowledge of the instance and thus allowing to get a solution
much faster than it is possible for “nearly blind” oracle-oriented algorithms. The major event
in the “recent history” of Convex Optimization, called sometimes “Interior Point revolution”,
was the invention of these “smart” techniques.
146 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

4.2.2 Interior Point methods


The Interior Point revolution was started by the seminal work of N. Karmarkar (1984) where
the first interior point method for LP was proposed; in 18 years since then, interior point (IP)
polynomial time methods have become an extremely deep and rich theoretically and highly
promising computationally area of Convex Optimization. A somehow detailed overview of the
history and the recent state of this area is beyond the scope of this course; an interested reader
is referred to [16, 18, 12] and references therein. All we intend to do is to give an idea of what are
the IP methods, skipping nearly all (sometimes highly instructive and nontrivial) technicalities.
The simplest way to get a proper impression of the (most of) IP methods is to start with a
quite traditional interior penalty scheme for solving optimization problems.

The Newton method and the Interior penalty scheme


Unconstrained minimization and the Newton method. Seemingly the simplest convex
optimization problem is the one of unconstrained minimization of a smooth strongly convex
objective:
min {f (x) : x ∈ Rn } ; (UC)
x
a “smooth strongly convex” in this context means a 3 times continuously differentiable convex
∂ 2 f (x)
function f such that f (x) → ∞, x2 → ∞, and such that the Hessian matrix f  (x) = ∂x i ∂xj
of f is positive definite at every point x. Among numerous techniques for solving (UC), the
most remarkable one is the Newton method. In its pure form, the Newton method is extremely
transparent and natural: given a current iterate x, we approximate our objective f by its
second-order Taylor expansion at the iterate – by the quadratic function
1
fx (y) = f (x) + (y − x)T f  (x) + (y − x)T f  (x)(y − x)
2
– and choose as the next iterate x+ the minimizer of this quadratic approximation. Thus, the
Newton method merely iterates the updating
x → x+ = x − [f  (x)]−1 f  (x). (Nwt)
In the case of a (strongly convex) quadratic objective, the approximation coincides with
the objective itself, so that the method reaches the exact solution in one step. It is natural to
guess (and indeed is true) that in the case when the objective is smooth and strongly convex
(although not necessary quadratic) and the current iterate x is close enough to the minimizer
x∗ of f , the next iterate x+ , although not being x∗ exactly, will be “much closer” to the exact
minimizer than x. The precise (and easy) result is that the Newton method converges locally
quadratically, i.e., that
x+ − x∗ 2 ≤ Cx − x∗ 22 ,
provided that x − x∗ 2 ≤ r with small enough value of r > 0 (both this value and C depend
on f ). Quadratic convergence means essentially that eventually every new step of the process
increases by a constant factor the number of accuracy digits in the approximate solution.
When started not “close enough” to the minimizer, the “pure” Newton method (Nwt) can
demonstrate weird behaviour √ (look, e.g., what happens when the method is applied to the
univariate function f (x) = 1 + x2 ). The simplest way to overcome this drawback is to pass
from the pure Newton method to its damped version
x → x+ = x − γ(x)[f  (x)]−1 f  (x), (NwtD)
4.2. INTERIOR POINT POLYNOMIAL TIME METHODS FOR LP, CQP AND SDP 147

where the stepsize γ(x) > 0 is chosen in a way which, on one hand, ensures global convergence of
the method and, on the other hand, enforces γ(x) → 1 as x → x∗ , thus ensuring fast (essentially
the same as for the pure Newton method) asymptotic convergence of the process2) .
Practitioners thought the (properly modified) Newton method to be the fastest, in terms
of the iteration count, routine for smooth (not necessarily convex) unconstrained minimization,
although sometimes “too heavy” for practical use: the practical drawbacks of the method are
both the necessity to invert the Hessian matrix at each step, which is computationally costly in
the large-scale case, and especially the necessity to compute this matrix (think how difficult it is
to write a code computing 5,050 second order derivatives of a messy function of 100 variables).

Classical interior penalty scheme: the construction. Now consider a constrained convex
optimization program. As we remember, one can w.l.o.g. make its objective linear, moving, if
necessary, the actual objective to the list of constraints. Thus, let the problem be
 
min cT x : x ∈ X ⊂ Rn , (C)
x

where X is a closed convex set, which we assume to possess a nonempty interior. How could we
solve the problem?
Traditionally it was thought that the problems of smooth convex unconstrained minimization
are “easy”; thus, a quite natural desire was to reduce the constrained problem (C) to a series
of smooth unconstrained optimization programs. To this end, let us choose somehow a barrier
(another name – “an interior penalty function”) F (x) for the feasible set X – a function which
is well-defined (and is smooth and strongly convex) on the interior of X and “blows up” as a
point from int X approaches a boundary point of X :

xi ∈ int X , x ≡ lim xi ∈ ∂X ⇒ F (xi ) → ∞, i → ∞,


i→∞

and let us look at the one-parametric family of functions generated by our objective and the
barrier:
Ft (x) = tcT x + F (x) : int X → R.
Here the penalty parameter t is assumed to be nonnegative.
It is easily seen that under mild regularity assumptions (e.g., in the case of bounded X ,
which we assume from now on)

• Every function Ft (·) attains its minimum over the interior of X , the minimizer x∗ (t) being
unique;

• The central path x∗ (t) is a smooth curve, and all its limiting, t → ∞, points belong to the
set of optimal solutions of (C).
This fact is quite clear intuitively. To minimize Ft (·) for large t is the same as to minimize the
function fρ (x) = cT x + ρF (x) for small ρ = 1t . When ρ is small, the function fρ is very close to
cT x everywhere in X , except a narrow stripe along the boundary of X , the stripe becoming thinner
2)
There are many ways to provide the required behaviour of γ(x), e.g., to choose γ(x) by a linesearch in the
direction e(x) = −[f  (x)]−1 f  (x) of the Newton step:

γ(x) = argmin f (x + te(x)).


t
148 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

and thinner as ρ → 0. Therefore we have all reasons to believe that the minimizer of Ft for large t
(i.e., the minimizer of fρ for small ρ) must be close to the set of minimizers of cT x on X .

We see that the central path x∗ (t) is a kind of Ariadne’s thread which leads to the solution set of
(C). On the other hand, to reach, given a value t ≥ 0 of the penalty parameter, the point x∗ (t)
on this path is the same as to minimize a smooth strongly convex function Ft (·) which attains
its minimum at an interior point of X . The latter problem is “nearly unconstrained one”, up to
the fact that its objective is not everywhere defined. However, we can easily adapt the methods
of unconstrained minimization, including the Newton one, to handle “nearly unconstrained”
problems. We see that constrained convex optimization in a sense can be reduced to the “easy”
unconstrained one. The conceptually simplest way to make use of this observation would be to
choose a “very large” value t̄ of the penalty parameter, like t̄ = 106 or t̄ = 1010 , and to run an
unconstrained minimization routine, say, the Newton method, on the function Ft̄ , thus getting
a good approximate solution to (C) “in one shot”. This policy, however, is impractical: since
we have no idea where x∗ (t̄) is, we normally will start our process of minimizing Ft̄ very far
from the minimizer of this function, and thus for a long time will be unable to exploit fast local
convergence of the method for unconstrained minimization we have chosen. A smarter way to
use our Ariadne’s thread is exactly the one used by Theseus: to follow the thread. Assume, e.g.,
that we know in advance the minimizer of F0 ≡ F , i.e., the point x∗ (0)3) . Thus, we know where
the central path starts. Now let us follow this path: at i-th step, standing at a point xi “close
enough” to some point x∗ (ti ) of the path, we
• first, increase a bit the current value ti of the penalty parameter, thus getting a new “target
point” x∗ (ti+1 ) on the path,
and
• second, approach our new target point x∗ (ti+1 ) by running, say, the Newton method,
started at our current iterate xi , on the function Fti+1 , until a new iterate xi+1 “close enough”
to x∗ (ti+1 ) is generated.
As a result of such a step, we restore the initial situation – we again stand at a point which
is close to a point on the central path, but this latter point has been moved along the central
path towards the optimal set of (C). Iterating this updating and strengthening appropriately
our “close enough” requirements as the process goes on, we, same as the central path, approach
the optimal set. A conceptual advantage of this “path-following” policy as compared to the
“brute force” attempt to reach a target point x∗ (t̄) with large t̄ is that now we have a hope to
exploit all the time the strongest feature of our “working horse” (the Newton method) – its fast
local convergence. Indeed, assuming that xi is close to x∗ (ti ) and that we do not increase the
penalty parameter too rapidly, so that x∗ (ti+1 ) is close to x∗ (ti ) (recall that the central path is
smooth!), we conclude that xi is close to our new target point x∗ (ti ). If all our “close enough”
and “not too rapidly” are properly controlled, we may ensure xi to be in the domain of the
quadratic convergence of the Newton method as applied to Fti+1 , and then it will take a quite
small number of steps of the method to recover closeness to our new target point.

Classical interior penalty scheme: the drawbacks. At a qualitative “common sense”


level, the interior penalty scheme looks quite attractive and extremely flexible: for the majority
3)
There is no difficulty to ensure thus assumption: given an arbitrary barrier F and an arbitrary starting
point x̄ ∈ int X , we can pass from F to a new barrier F̄ = F (x) − (x − x̄)T F  (x̄) which attains its minimum
exactly at x̄, and then use the new barrier F̄ instead of our original barrier F ; and for the traditional approach
we are following for the time being, F has absolutely no advantages as compared to F̄ .
4.2. INTERIOR POINT POLYNOMIAL TIME METHODS FOR LP, CQP AND SDP 149

of optimization problems treated by the classical optimization, there is a plenty of ways to build
a relatively simple barrier meeting all the requirements imposed by the scheme, there is a huge
room to play with the policies for increasing the penalty parameter and controlling closeness to
the central path, etc. And the theory says that under quite mild and general assumptions on the
choice of the numerous “free parameters” of our construction, it still is guaranteed to converge
to the optimal set of the problem we have to solve. All looks wonderful, until we realize that
the convergence ensured by the theory is completely “unqualified”, it is a purely asymptotical
phenomenon: we are promised to reach eventually a solution of a whatever accuracy we wish,
but how long it will take for a given accuracy – this is the question the “classical” optimization
theory, with its “convergence” – “asymptotic linear/superlinear/quadratic convergence” neither
posed nor answered. And since our life in this world is finite (moreover, usually more finite
than we would like it to be), “asymptotical promises” are perhaps better than nothing, but
definitely are not all we would like to know. What is vitally important for us in theory (and to
some extent – also in practice) is the issue of complexity: given an instance of such and such
generic optimization problem and a desired accuracy , how large is the computational effort
(# of arithmetic operations) needed to get an -solution of the instance? And we would like
the answer to be a kind of a polynomial time complexity bound, and not a quantity depending
on “unobservable and uncontrollable” properties of the instance, like the “level of regularity” of
the boundary of X at the (unknown!) optimal solution of the instance.
It turns out that the intuitively nice classical theory we have outlined is unable to say a single
word on the complexity issues (it is how it should be: a reasoning in purely qualitative terms
like “smooth”, “strongly convex”, etc., definitely cannot yield a quantitative result...) Moreover,
from the complexity viewpoint just the very philosophy of the classical convex optimization turns
out to be wrong:
• As far as the complexity is concerned, for nearly all “black box represented” classes of
unconstrained convex optimization problems (those where all we know is that the objective is
called f (x), is (strongly) convex and 2 (3,4,5...) times continuously differentiable, and can be
computed, along with its derivatives up to order ... at every given point), there is no such
phenomenon as “local quadratic convergence”, the Newton method (which uses the second
derivatives) has no advantages as compared to the methods which use only the first order
derivatives, etc.;
• The very idea to reduce “black-box-represented” constrained convex problems to uncon-
strained ones – from the complexity viewpoint, the unconstrained problems are not easier than
the constrained ones...

4.2.3 But...
Luckily, the pessimistic analysis of the classical interior penalty scheme is not the “final truth”.
It turned out that what prevents this scheme to yield a polynomial time method is not the
structure of the scheme, but the huge amount of freedom it allows for its elements (too much
freedom is another word for anarchy...). After some order is added, the scheme becomes a
polynomial time one! Specifically, it was understood that
1. There is a (completely non-traditional) class of “good” (self-concordant4) ) barriers. Every
barrier F of this type is associated with a “self-concordance parameter” θ(F ), which is a
4)
We do not intend to explain here what is a “self-concordant barrier”; for our purposes it suffices to say that
this is a three times continuously differentiable convex barrier F satisfying a pair of specific differential inequalities
linking the first, the second and the third directional derivatives of F .
150 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

real ≥ 1;

2. Whenever a barrier F underlying the interior penalty scheme is self-concordant, one can
specify the notion of “closeness to the central path” and the policy for updating the penalty
parameter in such a way that a single Newton step

xi → xi+1 = xi − [∇2 Fti+1 (xi )]−1 ∇Fti+1 (xi ) (4.2.1)

suffices to update a “close to x∗ (ti )” iterate xi into a new iterate xi+1 which is close, in
the same sense, to x∗ (ti+1 ). All “close to the central path” points belong to int X , so that
the scheme keeps all the iterates strictly feasible.

3. The penalty updating policy mentioned in the previous item is quite simple:
 
0.1
ti → ti+1 = 1+ ti ;
θ(F )

 particular, it does not “slow down” as ti grows and ensures linear, with the ratio
in
1 + √0.1 , growth of the penalty. This is vitally important due to the following fact:
θ(F )

4. The inaccuracy of a point x, which is close to some point x∗ (t) of the central path, as an
approximate solution to (C) is inverse proportional to t:

2θ(F )
cT x − min cT y ≤ .
y∈X t
It follows that

(!) After we have managed once to get close to the central path – have built a
point x0 which is close to a point x(t0 ), t0 > 0, on the path, every O( θ(F ))
steps of the scheme improve the quality of approximate solutions generated by
the scheme by an absolute constant factor. In particular, it takes no more than
 
θ(F )
O(1) θ(F ) ln 2 +
t0 
steps to generate a strictly feasible -solution to (C).

Note that with our simple penalty updating policy all needed to perform a step of the
interior penalty scheme is to compute the gradient and the Hessian of the underlying
barrier at a single point and to invert the resulting Hessian.

Items 3, 4 say that essentially all we need to derive from the just listed general results a polyno-
mial time method for a generic convex optimization problem is to be able to equip every instance
of the problem with a “good” barrier in such a way that both the parameter of self-concordance
of the barrier θ(F ) and the arithmetic cost at which we can compute the gradient and the Hes-
sian of this barrier at a given point are polynomial in the size of the instance5) . And it turns
5)
Another requirement is to be able once get close to a point x∗ (t0 ) on the central path with a not “disastrously
small” value of t0 – we should initialize somehow our path-following method! It turns out that such an initialization
is a minor problem – it can be carried out via the same path-following technique, provided we are given in advance
a strictly feasible solution to our problem.
4.3. INTERIOR POINT METHODS FOR LP, CQP, AND SDP: BUILDING BLOCKS 151

out that we can meet the latter requirement for all interesting “well-structured” generic convex
programs, in particular, for Linear, Conic Quadratic, and Semidefinite Programming. Moreover,
“the heroes” of our course – LP, CQP and SDP – are especially nice application fields of the
general theory of interior point polynomial time methods; in these particular applications, the
theory can be simplified, on one hand, and strengthened, on another.

4.3 Interior point methods for LP, CQP, and SDP: building
blocks
We are about to explain what the interior point methods for LP, CQP, SDP look like.

4.3.1 Canonical cones and canonical barriers


We will be interested in a generic conic problem
 
min cT x : Ax − B ∈ K (CP)
x

associated with a cone K given as a direct product of m “basic” cones, each of them being either
a second-order, or a semidefinite cone:
k
K = Sk+1 × ... × S+p × Lkp+1 × ... × Lkm ⊂ E = Sk1 × ... × Skp × Rkp+1 × ... × Rkm . (Cone)

Of course, the generic problem in question covers LP (no Lorentz factors, all semidefinite factors
are of dimension 1), CQP (no semidefinite factors) and SDP (no Lorentz factors).
Now, we shall equip the semidefinite and the Lorentz cones with “canonical barriers”:
• The canonical barrier for a semidefinite cone Sn+ is

Sk (X) = − ln Det(X) : int Sk+ → R;

the parameter of this barrier, by definition, is θ(Sk ) = k 6) . 


• the canonical barrier for a Lorentz cone Lk = {x ∈ Rk | xk ≥ x21 + ... + x2k−1 } is

−Ik−1
Lk (x) = − ln(x2k − x21 − ... − x2k−1 ) = − ln(x Jk x),
T
Jk = ;
1

the parameter of this barrier is θ(Lk ) = 2.


• The canonical barrier K for the cone K given by (Cone), by definition, is the direct sum
of the canonical barriers of the factors:

K(X) = Sk1 (X1 ) + ... + Skp (Xp ) + Lkp+1 (Xp+1 ) + ... + Lkm (Xm ), Xi ∈ int Sk+i , i≤p ;
int Lki , p<i≤m
from now on, we use upper case Latin letters, like X, Y, Z, to denote elements of the space E;
for such an element X, Xi denotes the projection of X onto i-th factor in the direct product
representation of E as shown in (Cone).
6)
The barrier S k , same as the canonical barrier Lk for the Lorentz cone Lk , indeed are self-concordant
(whatever it means), and the parameters they are assigned here by definition are exactly their parameters of
self-concordance.
152 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

The parameter of the barrier K, again by definition, is the sum of parameters of the basic
barriers involved:

p
θ(K) = θ(Sk1 ) + ... + θ(Skp ) + θ(Lkp+1 ) + ... + θ(Lkm ) = ki + 2(m − p).
i=1

Recall that all direct factors in the direct product representation (Cone) of our “universe” E
are Euclidean spaces; the matrix factors Ski are endowed with the Frobenius inner product

Xi , Yi Ski = Tr(Xi Yi ),

while the “arithmetic factors” Rki are endowed with the usual inner product

Xi , Yi Rki = XiT Yi ;

E itself will be regarded as a Euclidean space endowed with the direct sum of inner products
on the factors: p  
m
X, Y E = Tr(Xi Yi ) + XiT Yi .
i=1 i=p+1

It is clearly seen that our basic barriers, same as their direct sum K, indeed are barriers for
the corresponding cones: they are C∞ -smooth on the interiors of their domains, blow up to
∞ along every sequence of points from these interiors converging to a boundary point of the
corresponding domain and are strongly convex. To verify the latter property, it makes sense to
compute explicitly the first and the second directional derivatives of these barriers (we need the
corresponding formulae in any case); to simplify notation, we write down the derivatives of the
basic functions Sk , Lk at a point x from their domain along a direction h (you should remember
that in the case of Sk both the point and the direction, in spite of their lower-case denotation,
are k × k symmetric matrices):

d

DSk (x)[h] ≡ dt  Sk (x + th) = −Tr(x−1 h) = −x−1 , hSk ,
t=0
i.e.
 ∇Sk (x) = −x−1 ;
2 d2 

D Sk (x)[h, h] ≡ dt2  Sk (x + th) = Tr(x−1 hx−1 h) = x−1 hx−1 , hSk ,
t=0
i.e.
 [∇2 Sk (x)]h = x−1 hx−1 ;
d
 T (4.3.1)
DLk (x)[h] ≡ dt  Lk (x + th) = −2 hxT JJk xx ,
k
t=0
i.e.
∇Lk (x) = − xT 2J x Jk x;
 k
 T 2 T
d2 
D2 Lk (x)[h, h] ≡ dt2 Lk (x + th) = 4 [h Jk x]
[xT J x]2
− 2 hxTJJx
kh
,
k
t=0
i.e.
4 2
∇2 Lk (x) = J xxT Jk
[xT Jk x]2 k
− J .
xT Jk x k

From the expression for D2 Sk (x)[h, h] we see that

D2 Sk (x)[h, h] = Tr(x−1 hx−1 h) = Tr([x−1/2 hx−1/2 ]2 ),


4.3. INTERIOR POINT METHODS FOR LP, CQP, AND SDP: BUILDING BLOCKS 153

so that D2 Sk (x)[h, h] is positive whenever h = 0. It is not difficult to prove that the same is true
for D2 Lk (x)[h, h]. Thus, the canonical barriers for semidefinite and Lorentz cones are strongly
convex, and so is their direct sum K(·).
It makes sense to illustrate relatively general concepts and results to follow by how they
look in a particular case when K is the semidefinite cone Sk+ ; we shall refer to this situation
as to the “SDP case”. The essence of the matter in our general case is exactly the same as in
this particular one, but “straightforward computations” which are easy in the SDP case become
nearly impossible in the general case; and we have no possibility to explain here how it is possible
(it is!) to get the desired results with minimum amount of computations.
Due to the role played by the SDP case in our exposition, we use for this case special
notation, along with the just introduced “general” one. Specifically, we denote the standard
– the Frobenius – inner product on E = Sk as ·, ·F , although feel free, if necessary, to use
our “general” notation ·, ·E as well; the associated norm is denoted by  · 2 , so that X2 =
Tr(X 2 ), X being a symmetric matrix.

4.3.2 Elementary properties of canonical barriers


Let us establish a number of simple and useful properties of canonical barriers.

Proposition 4.3.1 A canonical barrier, let it be denoted F (F can be either Sk , or Lk , or the


direct sum K of several copies of these “elementary” barriers), possesses the following properties:
(i) F is logarithmically homogeneous, the parameter of logarithmic homogeneity being −θ(F ),
i.e., the following identity holds:

t > 0, x ∈ DomF ⇒ F (tx) = F (x) − θ(F ) ln t.

• In the SDP case, i.e., when F = Sk = − ln Det(x) and x is k × k positive definite


matrix, (i) claims that

− ln Det(tx) = − ln Det(x) − k ln t,

which of course is true.

(ii) Consequently, the following two equalities hold identically in x ∈ DomF :

(a) ∇F (x), x = −θ(F );


(b) [∇2 F (x)]x = −∇F (x).

• In the SDP case, ∇F (x) = ∇Sk (x) = −x−1 and [∇2 F (x)]h = ∇2 Sk (x)h =
x−1 hx−1 (see (4.3.1)). Here (a) becomes the identity x−1 , xF ≡ Tr(x−1 x) = k,
and (b) kindly informs us that x−1 xx−1 = x−1 .

(iii) Consequently, k-th differential Dk F (x) of F , k ≥ 1, is homogeneous, of degree −k, in


x ∈ DomF :

∀(x ∈ DomF, t > 0, h1 , ..., hk ) : 


∂ k F (tx+s1 h1 +...+sk hk )  (4.3.2)
Dk F (tx)[h1 , ..., hk ] ≡ ∂s1 ∂s2 ...∂sk  = t−k Dk F (x)[h1 , ..., hk ].
s1 =...=sk =0
154 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

Proof. (i): it is immediately seen that Sk and Lk are logarithmically homogeneous with parameters of
logarithmic homogeneity −θ(Sk ), −θ(Lk ), respectively; and of course the property of logarithmic homo-
geneity is stable with respect to taking direct sums of functions: if DomΦ(u) and DomΨ(v) are closed
w.r.t. the operation of multiplying a vector by a positive scalar, and both Φ and Ψ are logarithmi-
cally homogeneous with parameters α, β, respectively, then the function Φ(u) + Ψ(v) is logarithmically
homogeneous with the parameter α + β.
(ii): To get (ii.a), it suffices to differentiate the identity

F (tx) = F (x) − θ(F ) ln t

in t at t = 1:
d
F (tx) = F (x) − θ(F ) ln t ⇒ ∇F (tx), x = F (tx) = −θ(F )t−2 ,
dt
and it remains to set t = 1 in the concluding identity.
Similarly, to get (ii.b), it suffices to differentiate the identity

∇F (x + th), x + th = −θ(F )

(which is just (ii.a)) in t at t = 0, thus arriving at

[∇2 F (x)]h, x + ∇F (x), h = 0;

since [∇2 F (x)]h, x = [∇2 F (x)]x, h (symmetry of partial derivatives!) and since the resulting equality

[∇2 F (x)]x, h + ∇F (x), h = 0

holds true identically in h, we come to [∇2 F (x)]x = −∇F (x).


(iii): Differentiating k times the identity

F (tx) = F (x) − θ ln t

in x, we get
tk Dk F (tx)[h1 , ..., hk ] = Dk F (x)[h1 , ..., hk ].

An especially nice specific feature of the barriers Sk , Lk and K is their self-duality:

Proposition 4.3.2 A canonical barrier, let it be denoted F (F can be either Sk , or Lk , or the


direct sum K of several copies of these “elementary” barriers), possesses the following property:
for every x ∈ DomF , −∇F (x) belongs to DomF as well, and the mapping x → −∇F (x) :
DomF → DomF is self-inverse:

−∇F (−∇F (x)) = x ∀x ∈ DomF. (4.3.3)

Besides this, the mapping x → −∇F (x) is homogeneous of degree -1:

t > 0, x ∈ int domF ⇒ −∇F (tx) = −t−1 ∇F (x). (4.3.4)

• In the SDP case, i.e., when F = Sk and x is k × k semidefinite matrix,


∇F (x) = ∇Sk (x) = −x−1 , see (4.3.1), so that the above statements merely say
that the mapping x → x−1 is a self-inverse one-to-one mapping of the interior of
the semidefinite cone onto itself, and that −(tx)−1 = −t−1 x−1 , both claims being
trivially true.
4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 155

4.4 Primal-dual pair of problems and primal-dual central path


4.4.1 The problem(s)
It makes sense to consider simultaneously the “problem of interest” (CP) and its conic dual;
since K is a direct product of self-dual cones, this dual is a conic problem on the same cone K.
As we remember from Lecture 1, the primal-dual pair associated with (CP) is
 
min cT x : Ax − B ∈ K (CP)
x
max {B, SE : A∗ S = c, S ∈ K} (CD)
S

Assuming that the linear mapping x → Ax is an embedding (i.e., that KerA = {0} – this
is Assumption A from Lecture 1), we can write down our primal-dual pair in a symmetric
geometric form (Lecture 1, Section 1.6.1):

min {C, XE : X ∈ (L − B) ∩ K} (P)


X  
max B, SE : S ∈ (L⊥ + C) ∩ K (D)
S

where L is a linear subspace in E (the image space of the linear mapping x → Ax), L⊥ is the
orthogonal complement to L in E, and C ∈ E satisfies A∗ C = c, i.e., C, AxE ≡ cT x.
To simplify things, from now on we assume that both problems (CP) and (CD) are strictly
feasible. In terms of (P) and (D) this assumption means that both the primal feasible plane
L − B and the dual feasible plane L⊥ + C intersect the interior of the cone K.

Remark 4.4.1 By the Conic Duality Theorem (Lecture 1), both (CP) and (D) are solvable with
equal optimal values:
Opt(CP) = Opt(D)
(recall that we have assumed strict primal-dual feasibility). Since (P) is equivalent to (CP), (P)
is solvable as well, and the optimal value of (P) differs from the one of (P) by C, BE 7) . It
follows that the optimal values of (P) and (D) are linked by the relation

Opt(P) − Opt(D) + C, BE = 0. (4.4.1)

4.4.2 The central path(s)


The canonical barrier K of K induces a barrier for the feasible set X = {x | Ax − B ∈ K} of
the problem (CP) written down in the form of (C), i.e., as
 
min cT x : x ∈ X ;
x

this barrier is

K(x) = K(Ax − B) : int X → R (4.4.2)
7)
Indeed, the values of the respective objectives cT x and C, Ax − BE at the corresponding to each other
feasible solutions x of (CP) and X = Ax − B of (P) differ from each other by exactly C, BE :

cT x − C, XE = cT x − C, Ax − BE = cT x − A∗ C, xE +C, BE .


  
=0 due to A∗ C=c
156 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

and is indeed a barrier. Now we can apply the interior penalty scheme to trace the central
path x∗ (t) associated with the resulting barrier; with some effort it can be derived from the
primal-dual strict feasibility that this central path is well-defined (i.e., that the minimizer of
t (x) = tcT x + K(x)
K 

on int X exists for every t ≥ 0 and is unique)8) . What is important for us for the moment, is
the central path itself, not how to trace it. Moreover, it is highly instructive to pass from the
central path x∗ (t) in the space of design variables to its image

X∗ (t) = Ax∗ (t) − B

in E. The resulting curve has a name – it is called the primal central path of the primal-dual
pair (P), (D); by its origin, it is a curve comprised of strictly feasible solutions of (P) (since it
is the same – to say that x belongs to the (interior of) the set X and to say that X = Ax − B
is a (strictly) feasible solution of (P)). A simple and very useful observation is that the primal
central path can be defined solely in terms of (P), (D) and thus is a “geometric entity” – it is
independent of a particular parameterization of the primal feasible plane L − B by the design
vector x:

(*) A point X∗ (t) of the primal central path is the minimizer of the aggregate

Pt (X) = tC, XE + K(X)

on the set (L − B) ∩ int K of strictly feasible solutions of (P).

This observation is just a tautology: x∗ (t) is the minimizer on int X of the aggregate

Kt (x) ≡ tcT x + K(x) = tC, AxE + K(Ax − B) = Pt (Ax − B) + tC, BE ;

we see that the function Pt (x) = Pt (Ax − B) of x ∈ int X differs from the function Kt (x)
by a constant (depending on t) and has therefore the same minimizer x∗ (t) as the function
Kt (x). Now, when x runs through int X , the point X = Ax − B runs exactly through the
set of strictly feasible solutions of (P), so that the minimizer X∗ of Pt on the latter set and
the minimizer x∗ (t) of the function Pt (x) = Pt (Ax − B) on int X are linked by the relation
X∗ = Ax∗ (t) − B.

The “analytic translation” of the above observation is as follows:

(* ) A point X∗ (t) of the primal central path is exactly the strictly feasible solution
X to (P) such that the vector tC + ∇K(X) ∈ E is orthogonal to L (i.e., belongs to
L⊥ ).

Indeed, we know that X∗ (t) is the unique minimizer of the smooth convex function Pt (X) =
tC, XE +K(X) on the intersection of the primal feasible plane L−B and the interior of the
cone K; a necessary and sufficient condition for a point X of this intersection to minimize
Pt over the intersection is that ∇Pt must be orthogonal to L.

8)
In Section 4.2.1, there was no problem with the existence of the central path, since there X was assumed to
be bounded; in our now context, X not necessarily is bounded.
4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 157

• In the SDP case, a point X∗ (t), t > 0, of the primal central path is uniquely defined
by the following two requirements: (1) X∗ (t) % 0 should be feasible for (P), and (2)
the k × k matrix
tC − X∗−1 (t) = tC + ∇Sk (X∗ (t))
(see (4.3.1)) should belong to L⊥ , i.e., should be orthogonal, w.r.t. the Frobenius
inner product, to every matrix of the form Ax.

The dual problem (D) is in no sense “worse” than the primal problem (P) and thus also
possesses the central path, now called the dual central path S∗ (t), t ≥ 0, of the primal-dual pair
(P), (D). Similarly to (*), (* ), the dual central path can be characterized as follows:
(** ) A point S∗ (t), t ≥ 0, of the dual central path is the unique minimizer of the
aggregate
Dt (S) = −tB, SE + K(S)
on the set of strictly feasible solutions of (D) 9) . S∗ (t) is exactly the strictly feasible
solution S to (D) such that the vector −tB +∇F (S) is orthogonal to L⊥ (i.e., belongs
to L).
• In the SDP case, a point S∗ (t), t > 0, of the dual central path is uniquely defined
by the following two requirements: (1) S∗ (t) % 0 should be feasible for (D), and (2)
the k × k matrix
−tB − S∗−1 (t) = −tB + ∇Sk (S∗ (t))
(see (4.3.1)) should belong to L, i.e., should be representable in the form Ax for
some x.
From Proposition 4.3.2 we can derive a wonderful connection between the primal and the dual
central paths:
Theorem 4.4.1 For t > 0, the primal and the dual central paths X∗ (t), S∗ (t) of a (strictly
feasible) primal-dual pair (P), (D) are linked by the relations
S∗ (t) = −t−1 ∇K(X∗ (t))
(4.4.3)
X∗ (t) = −t−1 ∇K(S∗ (t))
Proof. By (* ), the vector tC+∇K(X∗ (t)) belongs to L⊥ , so that the vector S = −t−1 ∇K(X∗ (t)) belongs
to the dual feasible plane L⊥ +C. On the other hand, by Proposition 4.4.3 the vector −∇K(X∗ (t)) belongs
to DomK, i.e., to the interior of K; since K is a cone and t > 0, the vector S = −t−1 ∇F (X∗ (t)) belongs
to the interior of K as well. Thus, S is a strictly feasible solution of (D). Now let us compute the gradient
of the aggregate Dt at the point S:
∇Dt (S) = −tB + ∇K(−t−1 ∇K(X∗ (t)))
= −tB + t∇K(−∇K(X∗ (t)))
[we have used (4.3.4)]
= −tB − tX∗ (t)
[we have used (4.3.3)]
= −t(B + X∗ (t))
∈ L
[since X∗ (t) is primal feasible]
9)
Note the slight asymmetry between the definitions of the primal aggregate Pt and the dual aggregate Dt :
in the former, the linear term is tC, XE , while in the latter it is −tB, SE . This asymmetry is in complete
accordance with the fact that we write (P) as a minimization, and (D) – as a maximization problem; to write
(D) in exactly the same form as (P), we were supposed to replace B with −B, thus getting the formula for Dt
completely similar to the one for Pt .
158 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

Thus, S is strictly feasible for (D) and ∇Dt (S) ∈ L. But by (** ) these properties characterize S∗ (t);
thus, S∗ (t) = S ≡ −t−1 ∇K(X∗ (t)). This relation, in view of Proposition 4.3.2, implies that X∗ (t) =
−t−1 ∇K(S∗ (t)). Another way to get the latter relation from the one S∗ (t) = −t−1 ∇K(X∗ (t)) is just to
refer to the primal-dual symmetry.
In fact, the connection between the primal and the dual central paths stated by Theorem 4.4.1
can be used to characterize both the paths:
Theorem 4.4.2 Let (P), (D) be a strictly feasible primal-dual pair.
For every t > 0, there exists a unique strictly feasible solution X of (P) such that −t−1 ∇K(X)
is a feasible solution to (D), and this solution X is exactly X∗ (t).
Similarly, for every t > 0, there exists a unique strictly feasible solution S of (D) such that
−t−1 ∇K(S) is a feasible solution of (P), and this solution S is exactly S∗ (t).
Proof. By primal-dual symmetry, it suffices to prove the first claim. We already know (Theorem
4.4.1) that X = X∗ (t) is a strictly feasible solution of (P) such that −t−1 ∇K(X) is feasible
for (D); all we need to prove is that X∗ (t) is the only point with these properties, which is
immediate: if X is a strictly feasible solution of (P) such that −t−1 ∇K(X) is dual feasible, then
−t−1 ∇K(X) ∈ L⊥ + C, or, which is the same, ∇K(X) ∈ L⊥ − tC, or, which again is the same,
∇Pt (X) = tC + ∇K(X) ∈ L⊥ . And we already know from (* ) that the latter property, taken
together with the strict primal feasibility, is characteristic for X∗ (t).

On the central path


As we have seen, the primal and the dual central paths are intrinsically linked one to another,
and it makes sense to think of them as of a unique entity – the primal-dual central path of the
primal-dual pair (P), (D). The primal-dual central path is just a curve (X∗ (t), S∗ (t)) in E × E
such that the projection of the curve on the primal space is the primal central path, and the
projection of it on the dual space is the dual central path.
To save words, from now on we refer to the primal-dual central path simply as to the central
path.
The central path possesses a number of extremely nice properties; let us list some of them.

Characterization of the central path. By Theorem 4.4.2, the points (X∗ (t), S∗ (t)) of the
central path possess the following properties:
(CentralPath):

1. [Primal feasibility] The point X∗ (t) is strictly primal feasible.


2. [Dual feasibility] The point S∗ (t) is dual feasible.
3. [“Augmented complementary slackness”] The points X∗ (t) and S∗ (t) are linked
by the relation

S∗ (t) = −t−1 ∇K(X∗ (t)) [⇔ X∗ (t) = −t−1 ∇K(S∗ (t))].

• In the SDP case, ∇K(U ) = ∇Sk (U ) = −U −1 (see (4.3.1)), and the


augmented complementary slackness relation takes the nice form

X∗ (t)S∗ (t) = t−1 I, (4.4.4)

where I, as usual, is the unit matrix.


4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 159

In fact, the indicated properties fully characterize the central path: whenever two points X, S
possess the properties 1) - 3) with respect to some t > 0, X is nothing but X∗ (t), and S is
nothing but S∗ (t) (this again is said by Theorem 4.4.2).

Duality gap along the central path. Recall that for an arbitrary primal-dual feasible pair
(X, S) of the (strictly feasible!) primal-dual pair of problems (P), (D), the duality gap

DualityGap(X, S) ≡ [C, XE − Opt(P)] + [Opt(D) − B, SE ] = C, XE − B, SE + C, BE

(see (4.4.1)) which measures the “total inaccuracy” of X, S as approximate solutions of the
respective problems, can be written down equivalently as S, XE (see statement (!) in Section
1.7). Now, what is the duality gap along the central path? The answer is immediate:

DualityGap(X∗ (t), S∗ (t)) = S∗ (t), X∗ (t)E


= −t−1 ∇K(X∗ (t)), X∗ (t)E
[see (4.4.3)]
= t−1 θ(K)
[see Proposition 4.3.1.(ii)]

We have arrived at a wonderful result10) :

Proposition 4.4.1 Under assumption of primal-dual strict feasibility, the duality gap along the
central path is inverse proportional to the penalty parameter, the proportionality coefficient being
the parameter of the canonical barrier K:

θ(K)
DualityGap(X∗ (t), S∗ (t)) = .
t
 
θ(K)
In particular, both X∗ (t) and S∗ (t) are strictly feasible t -approximate solutions to their
respective problems:
θ(K)
C, X∗ (t)E − Opt(P) ≤ t ,
θ(K)
Opt(D) − B, S∗ (t)E ≤ t .

• In the SDP case, K = Sk+ and θ(K) = θ(Sk ) = k.

We see that

All we need in order to get “quickly” good primal and dual approximate solutions,
is to trace fast the central path; if we were interested to solve only one of the prob-
lems (P), (D), it would be sufficient to trace fast the associated – primal or dual –
component of this path. The quality guarantees we get in such a process depend – in
a completely universal fashion! – solely on the value t of the penalty parameter we
have managed to achieve and on the value of the parameter of the canonical barrier
K and are completely independent of other elements of the data.
10)
Which, among other, much more important consequences, explains the name “augmented complementary
slackness” of the property 10 .3): at the primal-dual pair of optimal solutions X ∗ , S ∗ the duality gap should be
zero: S ∗ , X ∗ E = 0. Property 10 .3, as we just have seen, implies that the duality gap at a primal-dual pair
(X∗ (t), S∗ (t)) from the central path, although nonzero, is “controllable” – θ(K)
t
– and becomes small as t grows.
160 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

Near the central path


The conclusion we have just made is a bit too optimistic: well, our life when moving along
the central path would be just fine (at the very least, we would know how good are the solu-
tions we already have), but how could we move exactly along the path? Among the relations
(CentralPath.1-3) defining the path the first two are “simple” – just linear, but the third is in
fact a system of nonlinear equations, and we have no hope to satisfy these equations exactly.
Thus, we arrive at the crucial question which, a bit informally, sounds as follows:

How close (and in what sense close) should we be to the path in order for our life to
be essentially as nice as if we were exactly on the path?

There are several ways to answer this question; we will present the simplest one.

A distance to the central path. Our canonical barrier K(·) is a strongly convex smooth
function on int K; in particular, its Hessian matrix ∇2 K(Y ), taken at a point Y ∈ int K, is
positive definite. We can use the inverse of this matrix to measure the distances between points
of E, thus arriving at the norm

HY = [∇2 K(Y )]−1 H, HE .

It turns out that

A good measure of proximity of a strictly feasible primal-dual pair Z = (X, S) to a


point Z∗ (t) = (X∗ (t), S∗ (t)) from the primal-dual central path is the quantity

dist(Z, Z∗ (t)) ≡ tS + ∇K(X)X ≡ [∇2 K(X)]−1 (tS + ∇K(X)), tS + ∇K(X)E

Although written in a non-symmetric w.r.t. X, S form, this quantity is in fact


symmetric in X, S: it turns out that

tS + ∇K(X)X = tX + ∇K(S)S (4.4.5)

for all t > 0 and S, X ∈ int K.

Observe that dist(Z, Z∗ (t)) ≥ 0, and dist(Z, Z∗ (t)) = 0 if and only if S = −t−1 ∇K(X), which,
for a strictly primal-dual feasible pair Z = (X, S), means that Z = Z∗ (t) (see the characterization
of the primal-dual central path); thus, dist(Z, Z∗ (t)) indeed can be viewed as a kind of distance
from Z to Z∗ (t).

In the SDP case X, S are k × k symmetric matrices, and

dist2 (Z, Z∗ (t)) = tS + ∇Sk (X)2X = [∇2 Sk (X)] −1 (tS + ∇Sk (X)), tS + ∇Sk (X)F
= Tr X(tS − X −1 )X(tS − X −1 )
[see (4.3.1)]
= Tr([tX 1/2 SX 1/2 − I]2 ),

so that
 
dist2 (Z, Z∗ (t)) = Tr X(tS − X −1 )X(tS − X −1 ) = tX 1/2 SX 1/2 − I22 . (4.4.6)
4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 161

Besides this,
 
tX 1/2 SX 1/2 − I22 = Tr [tX 1/2 SX 1/2 − I]2 
= Tr t2 X 1/2 SX 1/2 X 1/2 SX 1/2 − 2tX 1/2 SX 1/2 + I
= Tr(t2 X 1/2 SXSX 1/2 ) − 2tTr(X 1/2 SX 1/2 ) + Tr(I)
= Tr(t2 XSXS − 2tXS + I)
= Tr(t2 SXSX − 2tSX + I)
= Tr(t2 S 1/2 XS 1/2 S 1/2 XS 1/2 − 2tS 1/2 XS 1/2 + I)
= Tr([tS 1/2 XS 1/2 − I]2 ),

i.e., (4.4.5) indeed is true.

In a moderate dist(·, Z∗ (·))-neighbourhood of the central path. It turns out that in


such a neighbourhood all is essentially as fine as at the central path itself:
A. Whenever Z = (X, S) is a pair of primal-dual strictly feasible solutions to (P),
(D) such that
dist(Z, Z∗ (t)) ≤ 1, (Close)
Z is “essentially as good as Z∗ (t)”, namely, the duality gap at (X, S) is essentially
as small as at the point Z∗ (t):
2θ(K)
DualityGap(X, S) = S, XE ≤ 2DualityGap(Z∗ (t)) = . (4.4.7)
t
Let us check A in the SDP case. Let (t, X, S) satisfy the premise of A. The duality gap
at the pair (X, S) of strictly primal-dual feasible solutions is
DualityGap(X, S) = X, SF = Tr(XS),
while by (4.4.6) the relation dist((S, X), Z∗ (t)) ≤ 1 means that

tX 1/2 SX 1/2 − I2 ≤ 1,


whence
1
X 1/2 SX 1/2 − t−1 I2 ≤
.
t
Denoting by δ the vector of eigenvalues of the symmetric matrix X 1/2 SX 1/2 , we conclude

k
that (δi − t−1 )2 ≤ t−2 , whence
i=1


k
DualityGap(X, S) = Tr(XS) = Tr(X 1/2 SX 1/2 ) = δi
i=1 
−1

k
−1 −1
√ 
k
≤ kt + |δi − t | ≤ kt + k (δi − t−1 )2
i=1 i=1

≤ kt−1 + kt−1 ,
and (4.4.7) follows.

It follows from A that


For our purposes, it is essentially the same – to move along the primal-dual central
path, or to trace this path, staying in its “time-space” neighbourhood

Nκ = {(t, X, S) | X ∈ L − B, S ∈ L⊥ + C, t > 0, dist((X, S), (X∗ (t), S∗ (t))) ≤ κ}


(4.4.8)
with certain κ ≤ 1.
162 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

Most of the interior point methods for LP, CQP, and SDP, including those most powerful in
practice, solve the primal-dual pair (P), (D) by tracing the central path11) , although not all of
them keep the iterates in NO(1) ; some of the methods work in much wider neighbourhoods of
the central path, in order to avoid slowing down when passing “highly curved” segments of the
path. At the level of ideas, these “long step path following methods” essentially do not differ
from the “short step” ones – those keeping the iterates in NO(1) ; this is why in the analysis part
of our forthcoming presentation we restrict ourselves with the short-step methods. It should be
added that as far as the theoretical efficiency estimates are concerned, the short-step methods
yield the best known so far complexity bounds for LP, CQP and SDP, and are essentially better
than the long-step methods (although in practice the long-step methods usually outperform
their short-step counterparts).

4.5 Tracing the central path


4.5.1 The path-following scheme
Assume we are solving a strictly feasible primal-dual pair of problems (P), (D) and intend to
trace the associated central path. Essentially all we need is a mechanism for updating a current
iterate (t̄, X̄, S̄) such that t̄ > 0, X̄ is strictly primal feasible, S̄ is strictly dual feasible, and
(X̄, S̄) is “good”, in certain precise sense, approximation of the point Z∗ (t̄) = (X∗ (t̄), S∗ (t̄))
on the central path, into a new iterate (t+ , X+ , S+ ) with similar properties and a larger value
t+ > t̄ of the penalty parameter. Given such an updating and iterating it, we indeed shall trace
the central path, with all the benefits (see above) coming from the latter fact12) How could we
construct the required updating? Recalling the description of the central path, we see that our
question is:
Given a triple (t̄, X̄, S̄) which satisfies the relations

X ∈ L − B,
(4.5.1)
S ∈ L⊥ + C

(which is in fact a system of linear equations) and approximately satisfies the system
of nonlinear equations

Gt (X, S) ≡ S + t−1 ∇K(X) = 0, (4.5.2)

update it into a new triple (t+ , X+ , S+ ) with the same properties and t+ > t̄.
Since the left hand side G(·) in our system of nonlinear equations is smooth around (t̄, X̄, S̄)
(recall that X̄ was assumed to be strictly primal feasible), the most natural, from the viewpoint
of Computational Mathematics, way to achieve our target is as follows:
11)
There exist also potential reduction interior point methods which do not take explicit care of tracing the
central path; an example is the very first IP method for LP – the method of Karmarkar. The potential reduction
IP methods are beyond the scope of our course, which is not a big loss for a practically oriented reader, since, as
a practical tool, these methods are thought of to be obsolete.
12)
Of course, besides knowing how to trace the central path, we should also know how to initialize this process
– how to come close to the path to be able to start its tracing. There are different techniques to resolve this
“initialization difficulty”, and basically all of them achieve the goal by using the same path-tracing technique,
now applied to an appropriate auxiliary problem where the “initialization difficulty” does not arise at all. Thus,
at the level of ideas the initialization techniques do not add something essentially new, which allows us to skip in
our presentation all initialization-related issues.
4.5. TRACING THE CENTRAL PATH 163

1. We choose somehow a desired new value t+ > t̄ of the penalty parameter;


2. We linearize the left hand side Gt+ (X, S) of the system of nonlinear equations (4.5.2) at
the point (X̄, S̄), and replace (4.5.2) with the linearized system of equations
∂Gt+ (X̄, S̄) ∂Gt+ (X̄, S̄)
Gt+ (X̄, S̄) + (X − X̄) + (S − S̄) = 0 (4.5.3)
∂X ∂S
3. We define the corrections ∆X, ∆S from the requirement that the updated pair X+ =
X̄ + ∆X, S+ = S̄ + ∆S must satisfy (4.5.1) and the linearized version (4.5.3) of (4.5.2).
In other words, the corrections should solve the system
∆X ∈ L,
∆S ∈ L⊥ , (4.5.4)
∂Gt+ (X̄,S̄) ∂Gt+ (X̄,S̄)
Gt+ (X̄, S̄) + ∂X ∆X + ∂S ∆S =0

4. Finally, we define X+ and S+ as


X+ = X̄ + ∆X,
(4.5.5)
S+ = S̄ + ∆S.

The primal-dual IP methods we are describing basically fit the outlined scheme, up to the
following two important points:
• If the current iterate (X̄, S̄) is not enough close to Z∗ (t̄), and/or if the desired improvement
t+ − t̄ is too large, the corrections given by the outlined scheme may be too large; as a
result, the updating (4.5.5) as it is may be inappropriate, e.g., X+ , or S+ , or both, may
be kicked out of the cone K. (Why not: linearized system (4.5.3) approximates well the
“true” system (4.5.2) only locally, and we have no reasons to trust in corrections coming
from the linearized system, when these corrections are large.)
There is a standard way to overcome the outlined difficulty – to use the corrections in a
damped fashion, namely, to replace the updating (4.5.5) with

X+ = X̄ + α∆X,
(4.5.6)
S+ = S̄ + β∆S,
and to choose the stepsizes α > 0, β > 0 from additional “safety” considerations, like
ensuring the updated pair (X+ , S+ ) to reside in the interior of K, or enforcing it to stay in
a desired neighbourhood of the central path, or whatever else. In IP methods, the solution
(∆X, ∆S) of (4.5.4) plays the role of search direction (and this is how it is called), and the
actual corrections are proportional to the search ones rather than to be exactly the same.
In this sense the situation is completely similar to the one with the Newton method from
Section 4.2.2 (which is natural: the latter method is exactly the linearization method for
solving the Fermat equation ∇f (x) = 0).
• The “augmented complementary slackness” system (4.5.2) can be written down in many
different forms which are equivalent to each other in the sense that they share a common
solution set. E.g., we have the same reasons to express the augmented complementary
slackness requirement by the nonlinear system (4.5.2) as to express it by the system

Gt (X, S) ≡ X + t−1 ∇K(S) = 0,


164 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

not speaking about other possibilities. And although all systems of nonlinear equations

Ht (X, S) = 0

expressing the augmented complementary slackness are “equivalent” in the sense that they
share a common solution set, their linearizations are different and thus – lead to different
search directions and finally to different path-following methods. Choosing appropriate (in
general even varying from iteration to iteration) analytic representation of the augmented
complementary slackness requirement, one can gain a lot in the performance of the result-
ing path-following method, and the IP machinery facilitates this flexibility (see “SDP case
examples” below).

4.5.2 Speed of path-tracing


In the LP-CQP-SDP situation, the speed at which the best, from the theoretical viewpoint, path-
following methods manage to trace the path, is inverse proportional to the square root of the
parameter θ(K) of the underlying canonical barrier. It means the following. Started at a point
(t0 , X 0 , S 0 ) from the neighbourhood N0.1 of the central path, the method after O(1) θ(K) steps
reaches the point (t1 = 2t0 , X 1 , S 1 ) from the same neighbourhood, after the same O(1) θ(K)
steps more reaches the point (t2 = 22 t0 , X 2 , S 2 ) from the neighbourhood, and so on – it takes
the method a fixed number O(1) θ(K) steps to increase by factor 2 the current value of the
penalty parameter, staying all the time in N0.1 . By (4.4.7) it means that every O(1) θ(K) steps
of the method reduce the (upper bound on the) inaccuracy of current approximate solutions by
factor 2, or, which is the same, add a fixed number of accuracy digits to these solutions. Thus,
“the cost of an accuracy digit” for the (best) path-following methods is O(1) θ(K) steps. To
realize what this indeed mean, we should, of course, know how “heavy” a step is – what is its
arithmetic cost. Well, the arithmetic cost of a step for the “cheapest among the fastest” IP
methods as applied to (CP) is as if all operations carried out at a step were those required by

1. Assembling, given a point X ∈ int K, the symmetric n × n matrix (n = dim x)

H = A∗ [∇2 K(X)]A;

2. Subsequent Choleski factorization of the matrix H (which, due to its origin, is symmetric
positive definite and thus admits Choleski decomposition H = DDT with lower triangular
D).

Looking at (Cone), (CP) and (4.3.1), we immediately conclude that the arithmetic cost of
assembling and factorizing H is polynomial in the size dim Data(·) of the data defining (CP),
and that the parameter θ(K) also is polynomial in this size. Thus, the cost of an accuracy digit
for the methods in question is polynomial in the size of the data, as is required from polynomial
time methods13) . Explicit complexity bounds for LP b , CQP b , SDP b are given in Sections 4.6.1,
4.6.2, 4.6.3, respectively.
13)
Strictly speaking, the outlined complexity considerations are applicable to the “highway” phase of the
solution process, after we once have reached the neighbourhood N0.1 of the central path. However, the results of
our considerations remain unchanged after the initialization expenses are taken into account, see Section 4.6.
4.5. TRACING THE CENTRAL PATH 165

4.5.3 The primal and the dual path-following methods


The simplest way to implement the path-following scheme from Section 4.5.1 is to linearize the
augmented complementary slackness equations (4.5.2) as they are, ignoring the option to rewrite
these equations equivalently before linearization. Let us look at the resulting method in more
details. Linearizing (4.5.2) at a current iterate X̄, S̄, we get the vector equation

t+ (S̄ + ∆S) + ∇K(X̄) + [∇2 K(X̄)]∆X = 0,

where t+ is the target value of the penalty parameter. The system (4.5.4) now becomes

(a) ∆X ∈ L
)

(a ) ∆X = A∆x [∆x ∈ Rn ]
(b) ∆S ∈ L⊥ (4.5.7)
)

(b ) ∗
A ∆S = 0
(c) t+ [S̄ + ∆S] + ∇K(X̄) + [∇2 K(X̄)]∆X = 0;

the unknowns here are ∆X, ∆S and ∆x. To process the system, we eliminate ∆X via (a ) and
multiply both sides of (c) by A∗ , thus getting the equation

A∗ [∇2 K(X̄)]A ∆x + [t+ A∗ [S̄ + ∆S] + A∗ ∇K(X̄)] = 0. (4.5.8)


  
H

Note that A∗ [S̄ + ∆S] = c is the objective of (CP) (indeed, S̄ ∈ L⊥ + C, i.e., A∗ S̄ = c, while
A∗ ∆S = 0 by (b )). Consequently, (4.5.8) becomes the primal Newton system

H∆x = −[t+ c + A∗ ∇K(X̄)]. (4.5.9)

Solving this system (which is possible – it is easily seen that the n × n matrix H is positive
definite), we get ∆x and then set

∆X = A∆x,
(4.5.10)
∆S = −t−1 2
+ [∇K(X̄) + [∇ K(X̄)∆X] − S̄,

thus getting a solution to (4.5.7). Restricting ourselves with the stepsizes α = β = 1 (see
(4.5.6)), we come to the “closed form” description of the method:

(a)  t → t+ > t 
(b) x → x+ = x + −[A (∇2 K(X))A]−1 [t+ c + A∗ ∇K(X)] ,

   (4.5.11)
∆x
(c) S → S+ = −t−1
+ [∇K(X) + [∇2 K(X)]A∆x],

where x is the current iterate in the space Rn of design variables and X = Ax − B is its image
in the space E.
The resulting scheme admits a quite natural explanation. Consider the function

F (x) = K(Ax − B);


166 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

you can immediately verify that this function is a barrier for the feasible set of (CP). Let also

Ft (x) = tcT x + F (x)

be the associated barrier-generated family of penalized objectives. Relation (4.5.11.b) says that
the iterates in the space of design variables are updated according to

x → x+ = x − [∇2 Ft+ (x)]−1 ∇Ft+ (x),

i.e., the process in the space of design variables is exactly the process (4.2.1) from Section 4.2.3.
Note that (4.5.11) is, essentially, a purely primal process (this is where the name of the
method comes from). Indeed, the dual iterates S, S+ just do not appear in formulas for x+ , X+ ,
and in fact the dual solutions are no more than “shadows” of the primal ones.

Remark 4.5.1 When constructing the primal path-following method, we have started with the
augmented slackness equations in form (4.5.2). Needless to say, we could start our developments
with the same conditions written down in the “swapped” form

X + t−1 ∇K(S) = 0

as well, thus coming to what is called “dual path-following method”. Of course, as applied to a
given pair (P), (D), the dual path-following method differs from the primal one. However, the
constructions and results related to the dual path-following method require no special care – they
can be obtained from their “primal counterparts” just by swapping “primal” and “dual” entities.

The complexity analysis of the primal path-following method can be summarized in the following

Theorem 4.5.1 Let 0 < χ ≤ κ ≤ 0.1. Assume that we are given a starting point (t0 , x0 , S0 )
such that t0 > 0 and the point
(X0 = Ax0 − B, S0 )
is κ-close to Z∗ (t0 ):
dist((X0 , S0 ), Z∗ (t0 )) ≤ κ.
Starting with (t0 , x0 , X0 , S0 ), let us iterate process (4.5.11) equipped with the penalty updating
policy  
χ
t+ = 1 + t (4.5.12)
θ(K)
i.e., let us build the iterates (ti , xi , Xi , Si ) according to

ti = 1 + √ χ ti−1 ,
θ(K)
xi = xi−1 − [A∗ (∇2 K(Xi−1 ))A] −1
[ti c + A∗ ∇K(Xi−1 )],
  
∆xi
Xi = Axi − B,
−1
Si = −ti [∇K(Xi−1 ) + [∇2 K(Xi−1 )]A∆xi ]

The resulting process is well-defined and generates strictly primal-dual feasible pairs (Xi , Si ) such
that (ti , Xi , Si ) stay in the neighbourhood Nκ of the primal-dual central path.
4.5. TRACING THE CENTRAL PATH 167

The theorem says that with properly chosen κ, χ (e.g., κ = χ = 0.1) we can, getting once close to
the primal-dual central path, trace it by the primal path-following method, keeping the iterates
in Nκ -neighbourhood of the path and increasing the penalty parameter by an absolute constant
factor every O( θ(K)) steps – exactly as it was claimed in sections 4.2.3, 4.5.2. This fact is
extremely important theoretically; in particular, it underlies the polynomial time complexity
bounds for LP, CQP and SDP from Section 4.6 below. As a practical tool, the primal and the
dual path-following methods, at least in their short-step form presented above, are not that
attractive. The computational power of the methods can be improved by passing to appropriate
large-step versions of the algorithms, but even these versions are thought of to be inferior as
compared to “true” primal-dual path-following methods (those which “indeed work with both
(P) and (D)”, see below). There are, however, cases when the primal or the dual path-following
scheme seems to be unavoidable; these are, essentially, the situations where the pair (P), (D) is
“highly asymmetric”, e.g., (P) and (D) have different by order of magnitudes design dimensions
dim L, dim L⊥ . Here it becomes too expensive computationally to treat (P), (D) in a “nearly
symmetric way”, and it is better to focus solely on the problem with smaller design dimension.
To get an impression of how the primal path-following method works, here is a picture:

What you see is the 2D feasible set of a toy SDP (K = S3+ ). “Continuous curve” is the primal central
path; dots are iterates xi of the algorithm. We cannot draw the dual solutions, since they “live” in 4-
dimensional space (dim L⊥ = dim S3 − dim L = 6 − 2 = 4)

Here are the corresponding numbers:


168 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

Itr# Objective Duality Gap Itr# Objective Duality Gap


1 -0.100000 2.96 7 -1.359870 8.4e-4
2 -0.906963 0.51 8 -1.360259 2.1e-4
3 -1.212689 0.19 9 -1.360374 5.3e-5
4 -1.301082 6.9e-2 10 -1.360397 1.4e-5
5 -1.349584 2.1e-2 11 -1.360404 3.8e-6
6 -1.356463 4.7e-3 12 -1.360406 9.5e-7

4.5.4 The SDP case


In what follows, we specialize the primal-dual path-following scheme in the SDP case and carry
out its complexity analysis.

The path-following scheme in SDP


Let us look at the outlined scheme in the SDP case. Here the system of nonlinear equations
(4.5.2) becomes (see (4.3.1))
Gt (X, S) ≡ S − t−1 X −1 = 0, (4.5.13)
X, S being positive definite k × k symmetric matrices.
Recall that our generic scheme of a path-following IP method suggests, given a current triple
(t̄, X̄, S̄) with positive t̄ and strictly primal, respectively, dual feasible X̄ and S̄, to update the
this triple into a new triple (t+ , X+ , S+ ) of the same type as follows:
(i) First, we somehow rewrite the system (4.5.13) as an equivalent system

Ḡt (X, S) = 0; (4.5.14)

(ii) Second, we choose somehow a new value t+ > t̄ of the penalty parameter and linearize
system (4.5.14) (with t set to t+ ) at the point (X̄, S̄), thus coming to the system of linear
equations
∂ Ḡt+ (X̄, S̄) ∂ Ḡt+ (X̄, S̄)
∆X + ∆S = −Ḡt+ (X̄, S̄), (4.5.15)
∂X ∂S
for the “corrections” (∆X, ∆S);
We add to (4.5.15) the system of linear equations on ∆X, ∆S expressing the requirement that
a shift of (X̄, S̄) in the direction (∆X, ∆S) should preserve the validity of the linear constraints
in (P), (D), i.e., the equations saying that ∆X ∈ L, ∆S ∈ L⊥ . These linear equations can be
written down as
∆X = A∆x [⇔ ∆X ∈ L]
(4.5.16)
A∗ ∆S = 0 [⇔ ∆S ∈ L⊥ ]
(iii) We solve the system of linear equations (4.5.15), (4.5.16), thus obtaining a primal-dual
search direction (∆X, ∆S), and update current iterates according to

X+ = X̄ + α∆x, S+ = S̄ + β∆S

where the primal and the dual stepsizes α, β are given by certain “side requirements”.
The major “degree of freedom” of the construction comes from (i) – from how we construct
the system (4.5.14). A very popular way to handle (i), the way which indeed leads to primal-dual
4.5. TRACING THE CENTRAL PATH 169

methods, starts from rewriting (4.5.13) in a form symmetric w.r.t. X and S. To this end we
first observe that (4.5.13) is equivalent to every one of the following two matrix equations:

XS = t−1 I; SX = t−1 I.

Adding these equations, we get a “symmetric” w.r.t. X, S matrix equation

XS + SX = 2t−1 I, (4.5.17)

which, by its origin, is a consequence of (4.5.13). On a closest inspection, it turns out that
(4.5.17), regarded as a matrix equation with positive definite symmetric matrices, is equivalent
to (4.5.13). It is possible to use in the role of (4.5.14) the matrix equation (4.5.17) as it is;
this policy leads to the so called AHO (Alizadeh-Overton-Haeberly) search direction and the
“XS + SX” primal-dual path-following method.
It is also possible to use a “scaled” version of (4.5.17). Namely, let us choose somehow a
positive definite scaling matrix Q and observe that our original matrix equation (4.5.13) says
that S = t−1 X −1 , which is exactly the same as to say that Q−1 SQ−1 = t−1 (QXQ)−1 ; the latter,
in turn, is equivalent to every one of the matrix equations

QXSQ−1 = t−1 I; Q−1 SXQ = t−1 I;

Adding these equations, we get the scaled version of (4.5.17):

QXSQ−1 + Q−1 SXQ = 2t−1 I, (4.5.18)

which, same as (4.5.17) itself, is equivalent to (4.5.13).


With (4.5.18) playing the role of (4.5.14), we get a quite flexible scheme with a huge freedom
for choosing the scaling matrix Q, which in particular can be varied from iteration to iteration.
As we shall see in a while, this freedom reflects the intrinsic (and extremely important in the
interior-point context) symmetries of the semidefinite cone.
Analysis of the path-following methods based on search directions coming from (4.5.18)
(“Zhang’s family of search directions”) simplifies a lot when at every iteration we choose its own
scaling matrix and ensure that the matrices

S = Q−1 S̄Q−1 , X = QX̄Q

commute (X̄, S̄ are the iterates to be updated); we call such a policy a “commutative scaling”.
Popular commutative scalings are:

1. Q = S̄ 1/2 (S = I, X = S̄ 1/2 X̄ S̄ 1/2 ) (the “XS” method);

2. Q = X̄ −1/2 (S = X̄ 1/2 S̄ X̄ 1/2 , X = I) (the “SX” method);

3. Q is such that S = X (the NT (Nesterov-Todd) method, extremely attractive and deep)


 1/4

If X̄ and S̄ were just positive reals, the formula for Q would be simple: Q = X̄ .
In the matrix case this simple formula becomes a bit more complicated (to make our
life easier, below we write X instead of X̄ and S instead of S̄):

Q = P 1/2 , P = X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S.


170 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

We should verify that (a) P is symmetric positive definite, so that Q is well-defined,


and that (b) Q−1 SQ−1 = QXQ.
(a): Let us first verify that P is symmetric:

P ? =? P T
)
X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S ? =? SX 1/2 (X 1/2 SX 1/2 )−1/2 X −1/2
 −1/2 1/2  ) 
X (X SX 1/2 )−1/2 X 1/2 S X 1/2 (X 1/2 SX 1/2 )1/2 X −1/2 S −1 ? =? I
)
X −1/2 (X 1/2 SX 1/2 )−1/2 (X 1/2 SX 1/2 )(X 1/2 SX 1/2 )1/2 X −1/2 S −1 ? =? I
)
X −1/2 (X 1/2 SX 1/2 )X −1/2 S −1 ? =? I

and the concluding ? =? indeed is =.


Now let us verify that P is positive definite. Recall that the spectrum of the product of
two square matrices, symmetric or not, remains unchanged when swapping the factors.
Therefore, denoting σ(A) the spectrum of A, we have
 
σ(P ) = σ X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S 
= σ (X 1/2 SX 1/2 )−1/2 X 1/2 SX −1/2 
= σ (X 1/2 SX 1/2 )−1/2 (X 1/2
 SX 1/2 )X −1
= σ (X 1/2 SX 1/2 )1/2 X −1 
= σ X −1/2 (X 1/2 SX 1/2 )1/2 X −1/2 ,

and the argument of the concluding σ(·) clearly is a positive definite symmetric matrix.
Thus, the spectrum of symmetric matrix P is positive, i.e., P is positive definite.
(b): To verify that QXQ = Q−1 SQ−1 , i.e., that P 1/2 XP 1/2 = P −1/2 SP −1/2 , is the
same as to verify that P XP = S. The latter equality is given by the following compu-
tation:
   
P XP = X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S X X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S
= X −1/2 (X 1/2 SX 1/2 )−1/2 (X 1/2 SX 1/2 )(X 1/2 SX 1/2 )−1/2 X 1/2 S
= X −1/2 X 1/2 S
= S.

You should not think that Nesterov and Todd guessed the formula for this scaling ma-
trix. They did much more: they have developed an extremely deep theory (covering the
general LP-CQP-SDP case, not just the SDP one!) which, among other things, guar-
antees that the desired scaling matrix exists (and even is unique). After the existence
is established, it becomes much easier (although still not that easy) to find an explicit
formula for Q.

Complexity analysis
We are about to carry out the complexity analysis of the primal-dual path-following methods based
on “commutative” Zhang’s scalings. This analysis, although not that difficult, is more technical than
whatever else in our course, and a non-interested reader may skip it without any harm.

Scalings. We already have mentioned what a scaling of Sk+ is: this is the linear one-to-one transfor-
mation of Sk given by the formula
H → QHQT , (Scl)
4.5. TRACING THE CENTRAL PATH 171

where Q is a nonsingular scaling matrix. It is immediately seen that (Scl) is a symmetry of the semidefinite
cone Sk+ – it maps the cone onto itself. This family of symmetries is quite rich: for every pair of points
A, B from the interior of the cone, there exists a scaling which maps A onto B, e.g., the scaling
1/2 −1/2 −1/2 1/2
H → (B
  A )H(A
 B ).
Q QT

Essentially, this is exactly the existence of that rich family of symmetries of the underlying cones which
makes SDP (same as LP and CQP, where the cones also are “perfectly symmetric”) especially well suited
for IP methods.
In what follows we will be interested in scalings associated with positive definite scaling matrices.
The scaling given by such a matrix Q (X,S,...) will be denoted by Q (resp., X ,S,...):
Q[H] = QHQ.
Given a problem of interest (CP) (where K = Sk+ ) and a scaling matrix Q % 0, we can scale the problem,
i.e., pass from it to the problem  
min cT x : Q [Ax − B] * 0 Q(CP)
x

which, of course, is equivalent to (CP) (since Q[H] is positive semidefinite iff H is so). In terms of
“geometric reformulation” (P) of (CP), this transformation is nothing but the substitution of variables
QXQ = Y ⇔ X = Q−1 Y Q−1 ;
with respect to Y -variables, (P) is the problem
 
min Tr(C[Q−1 Y Q−1 ]) : Y ∈ Q(L) − Q[B], Y * 0 ,
Y

i.e., the problem  


 ) : Y ∈ L − B, Y * 0
min Tr(CY
Y (P)
 = Q−1 CQ−1 , B = QBQ, L = Im(QA) = Q(L)
C

The problem dual to (P) is  


max Tr(BZ) : Z ∈ L⊥ + C, Z * 0 . 
(D)
Z

It is immediate to realize what is L⊥ :


Z, QXQF = Tr(ZQXQ) = Tr(QZQX) = QZQ, XF ;

thus, Z is orthogonal to every matrix from L, i.e., to every matrix of the form QXQ with X ∈ L iff the
matrix QZQ is orthogonal to every matrix from L, i.e., iff QZQ ∈ L⊥ . It follows that
L⊥ = Q−1 (L⊥ ).
Thus, when acting on the primal-dual pair (P), (D) of SDP’s, a scaling, given by a matrix Q % 0, converts
it into another primal-dual pair of problems, and this new pair is as follows:
• The “primal” geometric data – the subspace L and the primal shift B (which has a part-time job
to be the dual objective as well) – are replaced with their images under the mapping Q;
• The “dual” geometric data – the subspace L⊥ and the dual shift C (it is the primal objective as
well) – are replaced with their images under the mapping Q−1 inverse to Q; this inverse mapping again
is a scaling, the scaling matrix being Q−1 .
We see that it makes sense to speak about primal-dual scaling which acts on both the primal and
the dual variables and maps a primal variable X onto QXQ, and a dual variable S onto Q−1 SQ−1 .
Formally speaking, the primal-dual scaling associated with a matrix Q % 0 is the linear transformation
(X, S) → (QXQ, Q−1 SQ−1 ) of the direct product of two copies of Sk (the “primal” and the “dual” ones).
A primal-dual scaling acts naturally on different entities associated with a primal-dual pair (P), (S), in
particular, at:
172 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS


• the pair (P), (D) itself – it is converted into another primal-dual pair of problems (P), (D);
• a primal-dual feasible pair (X, S) of solutions to (P), (D) – it is converted to the pair (X =
QXQ, S = Q−1 SQ−1 ), which, as it is immediately seen, is a pair of feasible solutions to (P), (D).

Note that the primal-dual scaling preserves strict feasibility and the duality gap:
 = DualityGap (X, S);
DualityGapP,D (X, S) = Tr(XS) = Tr(QXSQ−1 ) = Tr(X S) 

P,D

• the primal-dual central path (X∗ (·), S∗ (·)) of (P), (D); it is converted into the curve (X∗ (t) =
QX∗ (t)Q, S∗ (t) = Q−1 S∗ (t)Q−1 ), which is nothing but the primal-dual central path Z(t) of the

primal-dual pair (P), (D).
The latter fact can be easily derived from the characterization of the primal-dual central path; a
more instructive derivation is based on the fact that our “hero” – the barrier Sk (·) – is “semi-
invariant” w.r.t. scaling:
Sk (Q(X)) = − ln Det(QXQ) = − ln Det(X) − 2 ln Det(Q) = Sk (X) + const(Q).
Now, a point on the primal central path of the problem (P) associated with penalty parameter t,
let this point be temporarily denoted by Y (t), is the unique minimizer of the aggregate
Skt (Y ) = tQ−1 CQ−1 , Y F + Sk (Y ) ≡ tTr(Q−1 CQ−1 Y ) + Sk (Y )
over the set of strictly feasible solutions of (P). The latter set is exactly the image of the set of
strictly feasible solutions of (P) under the transformation Q, so that Y (t) is the image, under the
same transformation, of the point, let it be called X(t), which minimizes the aggregate
Skt (QXQ) = tTr((Q−1 CQ−1 )(QXQ)) + Sk (QXQ) = tTr(CX) + Sk (X) + const(Q)
over the set of strictly feasible solutions to (P). We see that X(t) is exactly the point X∗ (t) on
the primal central path associated with problem (P). Thus, the point Y (t) of the primal central
path associated with (P) is nothing but X∗ (t) = QX∗ (t)Q. Similarly, the point of the central path
associated with the problem (D)  is exactly S∗ (t) = Q−1 S∗ (t)Q−1 .
• the neighbourhood Nκ of the primal-dual central path Z(·) associated with the pair of problems
(P), (D) (see (4.4.8)). As you can guess, the image of Nκ is exactly the neighbourhood N κ , given

by (4.4.8), of the primal-dual central path Z(·) of (P), (D).
The latter fact is immediate: for a pair (X, S) of strictly feasible primal and dual solutions to (P),
(D) and a t > 0 we have (see (4.4.6)):
 
 Z ∗ (t)) = Tr [QXQ](tQ−1 SQ−1 − [QXQ]−1 )[QXQ](tQ−1 SQ−1 − [QXQ]−1 )
dist2 ((X, S),  
= Tr QX(tS − X −1 )X(tS − X −1)Q−1
= Tr X(tS − X −1 )X(tS − X −1 )
= dist2 ((X, S), Z∗ (t)).

Primal-dual short-step path-following methods based on commutative scalings.


Path-following methods we are about to consider trace the primal-dual central path of (P), (D), staying
in Nκ -neighbourhood of the path; here κ ≤ 0.1 is fixed. The path is traced by iterating the following
updating:
(U): Given a current pair of strictly feasible primal and dual solutions (X̄, S̄) such that the
triple 
k
t̄ = , X̄, S̄ (4.5.19)
Tr(X̄ S̄)
belongs to Nκ , i.e. (see (4.4.6))
t̄X̄ 1/2 S̄ X̄ 1/2 − I2 ≤ κ, (4.5.20)
we
4.5. TRACING THE CENTRAL PATH 173

1. Choose the new value t+ of the penalty parameter according to


 −1
χ
t+ = 1 − √ t̄, (4.5.21)
k
where χ ∈ (0, 1) is a parameter of the method;
2. Choose somehow the scaling matrix Q % 0 such that the matrices X = QX̄Q and
S = Q−1 S̄Q−1 commute with each other;
3. Linearize the equation
2
QXSQ−1 + Q−1 SXQ = I
t+
at the point (X̄, S̄), thus coming to the equation
2
Q[∆XS +X∆S]Q−1 +Q−1 [∆SX +S∆X]Q = I −[QX̄ S̄Q−1 +Q−1 S̄ X̄Q]; (4.5.22)
t+
4. Add to (4.5.22) the linear equations
∆X ∈ L,
(4.5.23)
∆S ∈ L⊥ ;
5. Solve system (4.5.22), (4.5.23), thus getting “primal-dual search direction” (∆X, ∆S);
6. Update current primal-dual solutions (X̄, S̄) into a new pair (X+ , S+ ) according to
X+ = X̄ + ∆X, S+ = S̄ + ∆S.
We already have explained the ideas underlying (U), up to the fact that in our previous explanations we
dealt with three “independent” entities t̄ (current value of the penalty parameter), X̄, S̄ (current primal
and dual solutions), while in (U) t̄ is a function of X̄, S̄:
k
t̄ = . (4.5.24)
Tr(X̄ S̄)
The reason for establishing this dependence is very simple: if (t, X, S) were on the primal-dual central
k
path: XS = t−1 I, then, taking traces, we indeed would get t = Tr(XS) . Thus, (4.5.24) is a reasonable
way to reduce the number of “independent entities” we deal with.
Note also that (U) is a “pure Newton scheme” – here the primal and the dual stepsizes are equal to
1 (cf. (4.5.6)).
The major element of the complexity analysis of path-following polynomial time methods for SDP is
as follows:
Theorem 4.5.2 Let the parameters κ, χ of (U) satisfy the relations
0 < χ ≤ κ ≤ 0.1. (4.5.25)
Let, further, (X̄, S̄) be a pair of strictly feasible primal and dual solutions to (P), (D) such that the triple
(4.5.19) satisfies (4.5.20). Then the updated pair (X+ , S+ ) is well-defined (i.e., system (4.5.22), (4.5.23)
is solvable with a unique solution), X+ , S+ are strictly feasible solutions to (P), (D), respectively,
k
t+ =
Tr(X+ S+ )
and the triple (t+ , X+ , S+ ) belongs to Nκ .
The theorem says that with properly chosen κ, χ (say, κ = χ = 0.1), updating (U) converts a close to
the primal-dual central path, in the sense of (4.5.20), strictly primal-dual feasible iterate (X̄, S̄) into
a new strictly primal-dual feasible iterate with the same closeness-to-the-path property and larger, by
factor (1 + O(1)k −1/2 ), value of the penalty parameter. Thus, after we get close to the path – reach
its 0.1-neighbourhood N0.1 – we are able to trace
√ this path, staying in N0.1 and increasing the penalty
parameter by absolute constant factor in O( k) = O( θ(K)) steps, exactly as announced in Section
4.5.2.
174 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

Proof of Theorem 4.5.2. 10 . Observe, first (this observation is crucial!) that it suffices to prove our
Theorem in the particular case when X̄, S̄ commute with each other and Q = I. Indeed, it is immediately
seen that the updating (U) can be represented as follows:
1. We first scale by Q the “input data” of (U) – the primal-dual pair of problems (P), (D) and the
strictly feasible pair X̄, S̄ of primal and dual solutions to these problems, as explained in sect.
“Scaling”. Note that the resulting entities – a pair of primal-dual problems and a strictly feasible
pair of primal-dual solutions to these problems – are linked with each other exactly in the same
fashion as the original entities, due to scaling invariance of the duality gap and the neighbourhood
Nκ . In addition, the scaled primal and dual solutions commute;
2. We apply to the “scaled input data” yielded by the previous step the updating (U) completely
similar to (U), but using the unit matrix in the role of Q;
3. We “scale back” the result of the previous step, i.e., subject this result to the scaling associated
with Q−1 , thus obtaining the updated iterate (X + , S + ).
Given that the second step of this procedure preserves primal-dual strict feasibility, w.r.t. the scaled
primal-dual pair of problems, of the iterate and keeps the iterate in the κ-neighbourhood Nκ of the
corresponding central path, we could use once again the “scaling invariance” reasoning to assert that the
result (X + , S + ) of (U) is well-defined, is strictly feasible for (P), (D) and is close to the original central
path, as claimed in the Theorem. Thus, all we need is to justify the above “Given”, and this is exactly
the same as to prove the theorem in the particular case of Q = I and commuting X̄, S̄. In the rest of
the proof we assume that Q = I and that the matrices X̄, S̄ commute with each other. Due to the latter
property, X̄, S̄ are diagonal in a properly chosen orthonormal basis; representing all matrices from Sk in
this basis, we can reduce the situation to the case when X̄ and S̄ are diagonal. Thus, we may (and do)
assume in the sequel that X̄ and S̄ are diagonal, with diagonal entries xi ,si , i = 1, ..., k, respectively, and
that Q = I. Finally, to simplify notation, we write t, X, S instead of t̄, X̄, S̄, respectively.
20 . Our situation and goals now are as follows. We are given orthogonal to each other affine planes
L − B, L⊥ + C in Sk and two positive definite diagonal matrices X = Diag({xi }) ∈ L − B, S =
Diag({si }) ∈ L⊥ + C. We set
1 Tr(XS)
µ= =
t k
and know that
tX 1/2 SX 1/2 − I2 ≤ κ.
We further set
1
µ+ = = (1 − χk −1/2 )µ (4.5.26)
t+
and consider the system of equations w.r.t. unknown symmetric matrices ∆X, ∆S:

(a) ∆X ∈ L
(b) ∆S ∈ L⊥ (4.5.27)
(c) ∆XS + X∆S + ∆SX + S∆X = 2µ+ I − 2XS

We should prove that the system has a unique solution such that the matrices

X+ = X + ∆X, S+ = S + ∆S

are
(i) positive definite,
(ii) belong, respectively, to L − B, L⊥ + C and satisfy the relation

Tr(X+ S+ ) = µ+ k; (4.5.28)

(iii) satisfy the relation


1/2 1/2
Ω ≡ µ−1
+ X+ S+ X+ − I2 ≤ κ. (4.5.29)
4.5. TRACING THE CENTRAL PATH 175

Observe that the situation can be reduced to the one with µ = 1. Indeed, let us pass from the matrices
X, S, ∆X, ∆S, X+ , S+ to X, S  = µ−1 S, ∆X, ∆S  = µ−1 ∆S, X+ , S+

= µ−1 S+ . Now the “we are given”
part of our situation becomes as follows: we are given two diagonal positive definite matrices X, S  such
that X ∈ L − B, S  ∈ L⊥ + C  , C  = µ−1 C,

Tr(XS  ) = k × 1

and
X 1/2 S  X 1/2 − I2 = µ−1 X 1/2 SX 1/2 − I2 ≤ κ.
The “we should prove” part becomes: to verify that the system of equations

(a) ∆X ∈ L
(b) ∆S  ∈ L⊥
(c) ∆XS  + X∆S  + ∆S  X + S  ∆X = 2(1 − χk −1/2 )I − 2XS 

has a unique solution and that the matrices X+ = X + ∆X, S+ = S  + ∆S+

are positive definite, are
⊥ 
contained in L − B, respectively, L + C and satisfy the relations
 µ+
Tr(X+ S+ )= = 1 − χk −1/2
µ
and
1/2 1/2
(1 − χk −1/2 )−1 X+ S+

X+ − I2 ≤ κ.
Thus, the general situation indeed can be reduced to the one with µ = 1, µ+ = 1 − χk −1/2 , and we loose
nothing assuming, in addition to what was already postulated, that

Tr(XS)
µ ≡ t−1 ≡ = 1, µ+ = 1 − χk −1/2 ,
k
whence
k

[Tr(XS) =] xi si = k (4.5.30)
i=1

and
n

1/2 1/2
[tX SX − I22 ≡] (xi si − 1)2 ≤ κ2 . (4.5.31)
i=1

30 . We start with proving that (4.5.27) indeed has a unique solution. It is convenient to pass in
(4.5.27) from the unknowns ∆X, ∆S to the unknowns

δX = X −1/2 ∆XX −1/2 ⇔ ∆X = X 1/2 δXX 1/2 ,


(4.5.32)
δS = X 1/2 ∆SX 1/2 ⇔ ∆S = X −1/2 δSX −1/2 .

With respect to the new unknowns, (4.5.27) becomes

(a) X 1/2 δXX 1/2 ∈ L,


(b) X −1/2 δSX −1/2 ∈ L⊥ ,
1/2 1/2 1/2 −1/2
(c) X δXX S + X δSX + X −1/2 δSX 1/2 + SX 1/2 δXX 1/2 = 2µ+ I − 2XS
)
 k
  
√ xi xj 
(d) 
L(δX, δS) ≡  xi xj (si + sj )(δX)ij + + (δS)ij 
k
= 2 [(µ+ − xi si )δij ]i,j=1 ,

  

xj

xi
 
φij
ψij
i,j=1
(4.5.33)
176 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

0, i = j
where δij = are the Kronecker symbols.
1, i = j
We first claim that (4.5.33), regarded as a system with unknown symmetric matrices δX, δS has a
unique solution. Observe that (4.5.33) is a system with 2dim Sk ≡ 2N scalar unknowns and 2N scalar
linear equations. Indeed, (4.5.33.a) is a system of N  ≡ N − dim L linear equations, (4.5.33.b) is a system
of N  = N − dim L⊥ = dim L linear equations, and (4.5.33.c) has N equations, so that the total #
of linear equations in our system is N  + N  + N = (N − dim L) + dim L + N = 2N . Now, to verify
that the square system of linear equations (4.5.33) has exactly one solution, it suffices to prove that the
homogeneous system
X 1/2 δXX 1/2 ∈ L, X −1/2 δSX −1/2 ∈ L⊥ , L(δX, δS) = 0
has only trivial solution. Let (δX, δS) be a solution to the homogeneous system. Relation L(δX, ∆S) = 0
means that
ψij
(δX)ij = − (δS)ij , (4.5.34)
φij
whence
 ψij
Tr(δXδS) = − (∆S)2ij . (4.5.35)
i,j
φij

Representing δX, δS via ∆X, ∆S according to (4.5.32), we get


Tr(δXδS) = Tr(X −1/2 ∆XX −1/2 X 1/2 ∆SX 1/2 ) = Tr(X −1/2 ∆X∆SX 1/2 ) = Tr(∆X∆S),
and the latter quantity is 0 due to ∆X = X 1/2 δXX 1/2 ∈ L and ∆S = X −1/2 δSX −1/2 ∈ L⊥ . Thus, the
left hand side in (4.5.35) is 0; since φij > 0, ψij > 0, (4.5.35) implies that δS = 0. But then δX = 0 in
view of (4.5.34). Thus, the homogeneous version of (4.5.33) has the trivial solution only, so that (4.5.33)
is solvable with a unique solution.
40 . Let δX, δS be the unique solution to (4.5.33), and let ∆X, ∆S be linked to δX, δS according to
(4.5.32). Our local goal is to bound from above the Frobenius norms of δX and δS.
From (4.5.33.c) it follows (cf. derivation of (4.5.35)) that
ψ
(a) (δX)ij = − φij
ij
(δS)ij + 2 µ+ φ−x
ii
i si
δij , i, j = 1, ..., k;
φ (4.5.36)
(b) (δS)ij = − ψij
ij
(δX)ij + 2 µ+ψ−x
ii
i si
δij , i, j = 1, ..., k.

Same as in the concluding part of 30 , relations (4.5.33.a − b) imply that



Tr(∆X∆S) = Tr(δXδS) = (δX)ij (δS)ij = 0. (4.5.37)
i,j

Multiplying (4.5.36.a) by (δS)ij and taking sum over i, j, we get, in view of (4.5.37), the relation
 ψij  µ+ − xi si
(δS)2ij = 2 (δS)ii ; (4.5.38)
i,j
φij i
φii

by “symmetric” reasoning, we get


 φij  µ+ − xi si
(δX)2ij = 2 (δX)ii . (4.5.39)
i,j
ψij i
ψii

Now let
θi = xi si , (4.5.40)
so that in view of (4.5.30) and (4.5.31) one has

(a) θi = k,
 i (4.5.41)
(b) (θi − 1)2 ≤ κ2 .
i
4.5. TRACING THE CENTRAL PATH 177

Observe that 
√ √ θi θj xi xj
φij = xi xj (si + sj ) = xi xj + = θj + θi .
xi xj xj xi
Thus,  
x
φij = θj xxji + θi xji ,
  (4.5.42)
xi xj
ψij = xj + xi ;

since 1 − κ ≤ θi ≤ 1 + κ by (4.5.41.b), we get


φij
1−κ≤ ≤ 1 + κ. (4.5.43)
ψij
By the geometric-arithmetic mean inequality we have ψij ≥ 2, whence in view of (4.5.43)

φij ≥ (1 − κ)ψij ≥ 2(1 − κ) ∀i, j. (4.5.44)

We now have
  φij
(1 − κ) (δX)2ij ≤ 2
ψij (δX)ij
i,j i,j
[see (4.5.43)]
 µ+ −xi si
≤ 2 ψii (δX)ii
i
[see (4.5.39)]
  −2
≤ 2 (µ+ − xi si )2 ψij (δX)2ii
i i
 
≤ ((1 − θi )2 − 2χk −1/2 (1 − θi ) + χ2 k −1 ) (δX)2ij
i i,j
[see (4.5.44)]
 
≤ χ2 + (1 − θi )2 (δX)2ij
i i,j

[since (1 − θi ) = 0 by (4.5.41.a)]
i

≤ χ2 + κ 2 (δX)2ij
i,j
[see (4.5.41.b)]
and from the resulting inequality it follows that

χ2 + κ 2
δX2 ≤ ρ ≡ . (4.5.45)
1−κ
Similarly,
  ψij
(1 + κ)−1 (δS)2ij ≤ φij (δS)2ij
i,j i,j
[see (4.5.43)]
 µ+ −xi si
≤ 2 φii (δS)ii
i
[see (4.5.38)]
  −2
≤ 2 (µ+ − xi si )2 φij (δS)2ii
i i
−1
 
≤ (1 − κ) (µ+ − θi )2 (δS)2ij
i i,j
[see (4.5.44)]

≤ (1 − κ)−1 χ2 + κ 2 (δS)2ij
i,j
[same as above]
178 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

and from the resulting inequality it follows that

(1 + κ) χ2 + κ2
δS2 ≤ = (1 + κ)ρ. (4.5.46)
1−κ
50 . We are ready to prove 20 .(i-ii). We have

X+ = X + ∆X = X 1/2 (I + δX)X 1/2 ,

and the matrix I + δX is positive definite due to (4.5.45) (indeed, the right hand side in (4.5.45) is ρ ≤ 1,
whence the Frobenius norm (and therefore - the maximum of modulae of eigenvalues) of δX is less than
1). Note that by the just indicated reasons I + δX + (1 + ρ)I, whence

X+ + (1 + ρ)X. (4.5.47)

Similarly, the matrix


S+ = S + ∆S = X −1/2 (X 1/2 SX 1/2 + δS)X −1/2
is positive definite. Indeed, the eigenvalues of the matrix X 1/2 SX 1/2 are ≥ min θi ≥ 1 − κ, while
√ i
(1+κ) χ2 +κ2
the modulae of eigenvalues of δS, by (4.5.46), do not exceed 1−κ < 1 − κ. Thus, the matrix
1/2 1/2
X SX + δS is positive definite, whence S+ also is so. We have proved 20 .(i).
2 .(ii) is easy to verify. First, by (4.5.33), we have ∆X ∈ L, ∆S ∈ L⊥ , and since X ∈ L − B,
0

S ∈ L⊥ + C, we have X+ ∈ L − B, S+ ∈ L⊥ + C. Second, we have

Tr(X+ S+ ) = Tr(XS + X∆S + ∆XS + ∆X∆S)


= Tr(XS + X∆S + ∆XS)
[since Tr(∆X∆S) = 0 due to ∆X ∈ L, ∆S ∈ L⊥ ]
= µ+ k
[take the trace of both sides in (4.5.27.c)]

20 .(ii) is proved.
60 . It remains to verify 20 .(iii). We should bound from above the quantity
1/2 1/2 1/2 1/2
Ω = µ−1 −1 −1
+ X+ S+ X+ − I2 = X+ (µ+ S+ − X+ )X+ 2 ,

and our plan is first to bound from above the “close” quantity

Ω = X 1/2 (µ−1 −1
+ S+ − X+ )X
1/2
2 = µ−1
+ Z2 ,
−1 (4.5.48)
Z = X 1/2 (S+ − µ+ X+ )X 1/2 ,

and then to bound Ω in terms of Ω.


60 .1. Bounding Ω. We have
−1
Z = X 1/2 (S+ − µ+ X+ )X 1/2
1/2 1/2
= X (S + ∆S)X − µ+ X 1/2 [X + ∆X]−1 X 1/2
= XS + δS − µ+ X [X 1/2 (I + δX)X 1/2 ]−1 X 1/2
1/2

[see (4.5.32)]
= XS + δS − µ+ (I + δX)−1
= XS + δS − µ+ (I − δX) − µ+ [(I + δX)−1 − I + δX]
= XS + δS + δX − µ+ I + (µ+ − 1)δX + µ+ [I − δX − (I + δX)−1 ],
        
Z1 Z2 Z3

so that
Z2 ≤ Z 1 2 + Z 2 2 + Z 3 2 . (4.5.49)
We are about to bound separately all 3 terms in the right hand side of the latter inequality.
4.5. TRACING THE CENTRAL PATH 179

Bounding Z 2 2 : We have

Z 2 2 = |µ+ − 1|δX2 ≤ χk −1/2 ρ (4.5.50)

(see (4.5.45) and take into account that µ+ − 1 = −χk −1/2 ).


Bounding Z 3 2 : Let λi be the eigenvalues of δX. We have

Z 3 2 = µ+ [(I + δX)−1 − I + δX]2


≤ (I + δX)−1 − I + δX2
 [since |µ+ | ≤ 1]
 1 2
= 1+λi − 1 + λi
i
[pass to the orthonormal eigenbasis of δX]
 λ4i (4.5.51)
= (1+λi )2
i
 ρ2 λ2i
≤ (1−ρ)2
i 
[see (4.5.45) and note that λ2i = δX22 ≤ ρ2 ]
i
ρ2
≤ 1−ρ

Bounding Z 1 2 : This is a bit more involving. We have

1
Zij = (XS)ij + (δS)ij + (δX)ij − µ+ δij
= (δX)ij + (δS)ij + (xi si − µ+ )δij
φij
= (δX)ij 1 − ψij + 2 µ+ψ−x
ii
i si
+ xi si − µ+ δij
[we have used (4.5.36.b)]
φij
= (δX)ij 1 − ψij
[since ψii = 2, see (4.5.42)]

whence, in view of (4.5.43),


 
 1  κ
1
|Zij | 
≤ 1 − |(δX)ij | = |(δX)ij |,
1−κ  1−κ

so that
κ κ
Z 1 2 ≤ δX2 ≤ ρ (4.5.52)
1−κ 1−κ

(the concluding inequality is given by (4.5.45)).


Assembling (4.5.50), (4.5.51), (4.5.52) and (4.5.49), we come to
$ %
χ ρ κ
Z2 ≤ ρ √ + + ,
k 1−ρ 1−κ

whence, by (4.5.48),
$ %
ρ χ ρ κ
Ω≤ √ + + . (4.5.53)
1 − χk −1/2 k 1−ρ 1−κ
180 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

60 .2. Bounding Ω. We have

1/2 1/2
Ω2 = µ−1
+ X+ S+ X+ − I2
2
1/2 −1 −1 1/2
= X+ [µ+ S+ − X+ ] X+ 22
  
 Θ=ΘT 
1/2 1/2
= Tr X+ ΘX+ ΘX+
 
1/2 1/2
≤ (1 + ρ)Tr X+ ΘXΘX+
 [see (4.5.47)]

1/2 1/2
= (1 + ρ)Tr X+ ΘX 1/2 X 1/2 ΘX+
 
1/2 1/2
= (1 + ρ)Tr X 1/2 ΘX+ X+ ΘX 1/2
 
= (1 + ρ)Tr X 1/2 ΘX+ ΘX 1/2
≤ (1 + ρ)2 Tr X 1/2 ΘXΘX 1/2
[the same (4.5.47)]
= (1 + ρ)2 X 1/2 ΘX 1/2 22
= (1 + ρ)2 X 1/2 [µ−1
+ S+ − X+ ]X
−1 1/2 2
2
2 2
= (1 + ρ) Ω
[see (4.5.48)]

so that
ρ(1+ρ) χ ρ κ
Ω ≤ (1 + ρ)Ω = 1−χk√−1/2

k
+ 1−ρ + 1−κ ,
(4.5.54)
χ2 +κ2
ρ = 1−κ .

(see (4.5.53) and (4.5.45)).


It is immediately seen that if 0 < χ ≤ κ ≤ 0.1, the right hand side in the resulting bound for Ω is
≤ κ, as required in 20 .(iii).

Remark 4.5.2 We have carried out the complexity analysis for a large group of primal-dual
path-following methods for SDP (i.e., for the case of K = Sk+ ). In fact, the constructions and
the analysis we have presented can be word by word extended to the case when K is a direct
product of semidefinite cones – you just should bear in mind that all symmetric matrices we
deal with, like the primal and the dual solutions X, S, the scaling matrices Q, the primal-dual
search directions ∆X, ∆S, etc., are block-diagonal with common block-diagonal structure. In
particular, our constructions and analysis work for the case of LP – this is the case when K
is a direct product of one-dimensional semidefinite cones. Note that in the case of LP Zhang’s
family of primal-dual search directions reduces to a single direction: since now X, S, Q are
diagonal matrices, the scaling (4.5.17) → (4.5.18) does not vary the equations of augmented
complementary slackness.
The recipe to translate all we have presented for the case of SDP to the case of LP is very
simple: in the above text, you should assume all matrices like X, S,... to be diagonal and look
what the operations with these matrices required by the description of the method do with their
diagonals. By the way, one of the very first approaches to the design and the analysis of IP
methods for SDP was exactly opposite: you take an IP scheme for LP, replace in its description
the words “nonnegative vectors” with “positive semidefinite diagonal matrices” and then erase
the adjective “diagonal”.
4.6. COMPLEXITY BOUNDS FOR LP, CQP, SDP 181

4.6 Complexity bounds for LP, CQP, SDP


In what follows we list the best known so far complexity bounds for LP, CQP and SDP. These
bounds are yielded by IP methods and, essentially, say that the Newton complexity of finding
-solution to an instance – the total # of steps of a “good” IP algorithm before an -solution
is found – is O(1) θ(K) ln 1 . This is what should be expected in view of discussion in Section
4.5.2; note, however, that the complexity bounds to follow take into account the necessity to
“reach the highway” – to come close to the central path before tracing it, while in Section
4.5.2 we were focusing on how fast could we reduce the duality gap after the central path (“the
highway”) is reached.
Along with complexity bounds expressed in terms of the Newton complexity, we present the
bounds on the number of operations of Real Arithmetic required to build an -solution. Note
that these latter bounds typically are conservative – when deriving them, we assume the data
of an instance “completely unstructured”, which is usually not the case (cf. Warning in Section
4.5.2); exploiting structure of the data, one usually can reduce significantly computational effort
per step of an IP method and consequently – the arithmetic cost of -solution.

4.6.1 Complexity of LP b
Family of problems:

Problem instance: a program


 
min cT x : aTi x ≤ bi , i = 1, ..., m; x2 ≤ R [x ∈ Rn ]; (p)
x

Data:
Data(p) = [m; n; c; a1 , b1 ; ...; am , bm ; R],
Size(p) ≡ dim Data(p) = (m + 1)(n + 1) + 2.

-solution: an x ∈ Rn such that


x∞ ≤ R,
aTi x ≤ bi + , i = 1, ..., m,
cT x ≤ Opt(p) + 
(as always, the optimal value of an infeasible problem is +∞).

Newton complexity of -solution: 14)



ComplNwt (p, ) = O(1) m + nDigits(p, ),
where  
Size(p) + Data(p)1 + 2
Digits(p, ) = ln

is the number of accuracy digits in -solution, see Section 4.1.2.
14)
In what follows, the precise meaning of a statement “the Newton/arithmetic complexity of finding -solution
of an instance (p) does not exceed N ” is as follows: as applied to the input (Data(p), ), the method underlying our
bound terminates in no more than N steps (respectively, N arithmetic operations) and outputs either a vector,
which is an -solution to the instance, or the correct conclusion “(p) is infeasible”.
182 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

Arithmetic complexity of -solution:

Compl(p, ) = O(1)(m + n)3/2 n2 Digits(p, ).

4.6.2 Complexity of CQP b


Family of problems:

Problem instance: a program


& '
  x ∈ Rn
min cT x : Ai x + bi 2 ≤ cTi x + di , i = 1, ..., m; x2 ≤ R (p)
x bi ∈ Rki

Data:
Data(P ) = [m; n; k1 , ..., km ; c; A1 , b1 , c1 , d1 ; ...; Am , bm , cm , dm ; R],

m
Size(p) ≡ dim Data(p) = (m + ki )(n + 1) + m + n + 3.
i=1

-solution: an x ∈ Rn such that


x2 ≤ R,
Ai x + bi 2 ≤ cTi x + di + , i = 1, ..., m,
cT x ≤ Opt(p) + .

Newton complexity of -solution:



ComplNwt (p, ) = O(1) m + 1Digits(p, ).

Arithmetic complexity of -solution:



m
1/2 2
Compl(p, ) = O(1)(m + 1) n(n + m + ki2 )Digits(p, ).
i=0

4.6.3 Complexity of SDP b


Family of problems:

Problem instance: a program


 
 
n 
min cT x : A0 + xj Aj * 0, x2 ≤ R [x ∈ Rn ], (p)
x  
j=1

(i)
where Aj , j = 0, 1, ..., n, are symmetric block-diagonal matrices with m diagonal blocks Aj of
sizes ki × ki , i = 1, ..., m.

Data:
(1) (m) (1) (m)
Data(p) = [m; n; k1 , ...km ; 
c; A0 , ..., A0 ; ...; An , ..., An ; R],

m
ki (ki +1)
Size(p) ≡ dim Data(P ) = 2 (n + 1) + m + n + 3.
i=1
4.7. CONCLUDING REMARKS 183

-solution: an x such that

x2 ≤ R,

n
A0 + xj Aj * −I,
j=1
cT x ≤ Opt(p) + .

Newton complexity of -solution:


m
ComplNwt (p, ) = O(1)(1 + ki )1/2 Digits(p, ).
i=1

Arithmetic complexity of -solution:


m 
m 
m
Compl(p, ) = O(1)(1 + ki )1/2 n(n2 + n ki2 + ki3 )Digits(p, ).
i=1 i=1 i=1

4.7 Concluding remarks


We have discussed IP methods for LP, CQP and SDP as “mathematical animals”, with emphasis
on the ideas underlying the algorithms and on the theoretical complexity bounds ensured by the
methods. Now it is time to say a couple of words on software implementations of IP algorithms
and on practical performance of the resulting codes.
As far as the performance of recent IP software is concerned, the situation heavily depends
on whether we are speaking about codes for LP, or those for CQP and SDP.
• There exists extremely powerful commercial IP software for LP, capable to handle reliably
really large-scale LP’s and quite competitive with the best Simplex-type codes for Linear Pro-
gramming. E.g., one of the best modern LP solvers – CPLEX – allows user to choose between
a Simplex-type and IP modes of execution, and in many cases the second option reduces the
running time by orders of magnitudes. With a state-of-the-art computer, CPLEX is capable to
solve routinely real-world LP’s with tens and hundreds thousands of variables and constraints; in
the case of favourable structured constraint matrices, the numbers of variables and constraints
can become as large as few millions.
• There already exists a very powerful commercial software for CQP – MOSEK (Erling Ander-
sen, http://www.mosek.com). I would say that as far as LP (and even mixed integer program-
ming) are concerned, MOSEK compares favourable to CPLEX, and it allows to solve really large
CQP’s of favourable structure.
• For the time being, IP software for SDP’s is not as well-polished, reliable and powerful as
the LP one. I would say that the codes available for the moment are capable to solve SDP’s
with no more than 1,000 – 1,500 design variables.
There are two groups of reasons making the power of SDP software available for the moment
that inferior as compared to the capabilities of interior point LP and CQP solvers – the “histor-
ical” and the “intrinsic” ones. The “historical” aspect is simple: the development of IP software
for LP, on one hand, and for SDP, on the other, has started, respectively, in the mid-eighties
and the mid-nineties; for the time being (2002), this is definitely a difference. Well, being too
young is the only shortcoming which for sure passes away... Unfortunately, there are intrinsic
problems with IP algorithms for large-scale (many thousands of variables) SDP’s. Recall that
184 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS

the influence of the size of an SDP/CQP program on the complexity of its solving by an IP
method is twofold:
– first, the size affects the Newton complexity of the process. Theoretically, the number of
steps required to reduce the duality gap by a constant factor, say, factor 2, is proportional to
θ(K) (θ(K) is twice the total # of conic quadratic inequalities for CQP and the total row
size of LMI’s for SDP). Thus, we could expect an unpleasant growth of the iteration count with
θ(K). Fortunately, the iteration count for good IP methods usually is much less than the one
given by the worst-case complexity analysis and is typically about few tens, independently of
θ(K).
– second, the larger is the instance, the larger is the system of linear equations one should
solve to generate new primal (or primal-dual) search direction, and, consequently, the larger is
the computational effort per step (this effort is dominated by the necessity to assemble and to
solve the linear system). Now, the system to be solved depends, of course, on what is the IP
method we are speaking about, but it newer is simpler (and for most of the methods, is not
more complicated as well) than the system (4.5.8) arising in the primal path-following method:

A∗ [∇2 K(X̄)]A ∆x = −[t+ c + A∗ ∇K(X̄)] . (Nwt)


     
H h

The size n of this system is exactly the design dimension of problem (CP).
In order to process (Nwt), one should assemble the system (compute H and h) and then solve
it. Whatever is the cost of assembling (Nwt), you should be able to store the resulting matrix H
in memory and to factorize the matrix in order to get the solution. Both these problems – storing
and factorizing H – become prohibitively expensive when H is a large dense15) matrix. (Think
how happy you will be with the necessity to store 5000×5001
2 = 12, 502, 500 reals representing a
3
dense 5000 × 5000 symmetric matrix H and with the necessity to perform ≈ 5000 6 ≈ 2.08 × 1010
arithmetic operations to find its Choleski factor).
The necessity to assemble and to solve large-scale systems of linear equations is intrinsic
for IP methods as applied to large-scale optimization programs, and in this respect there is no
difference between LP and CQP, on one hand, and SDP, on the other hand. The difference is
in how difficult is to handle these large-scale linear systems. In real life LP’s-CQP’s-SDP’s, the
structure of the data allows to assemble (Nwt) at a cost negligibly small as compared to the
cost of factorizing H, which is a good news. Another good news is that in typical real world
LP’s, and to some extent for real-world CQP’s, H turns out to be “very well-structured”, which
reduces dramatically the expenses required by factorizing the matrix and storing the Choleski
factor. All practical IP solvers for LP and CQP utilize these favourable properties of real life
problems, and this is where their ability to solve problems with tens/hundreds thousands of
variables and constraints comes from. Spoil the structure of the problem – and an IP method
will be unable to solve an LP with just few thousands of variables. Now, in contrast to real life
LP’s and CQP’s, real life SDP’s typically result in dense matrices H, and this is where severe
limitations on the sizes of “tractable in practice” SDP’s come from. In this respect, real life
CQP’s are somewhere in-between LP’s and SDP’s, so that the sizes of “tractable in practice”
CQP’s could be significantly larger than in the case of SDP’s.
It should be mentioned that assembling matrices of the linear systems we are interested in
and solving these systems by the standard Linear Algebra techniques is not the only possible
way to implement an IP method. Another option is to solve these linear systems by iterative
15)
I.e., with O(n2 ) nonzero entries.
4.7. CONCLUDING REMARKS 185

methods. With this approach, all we need to solve a system like (Nwt) is a possibility to multiply
a given vector by the matrix of the system, and this does not require assembling and storing
in memory the matrix itself. E.g., to multiply a vector ∆x by H, we can use the multiplicative
representation of H as presented in (Nwt). Theoretically, the outlined iterative schemes, as
applied to real life SDP’s, allow to reduce by orders of magnitudes the arithmetic cost of building
search directions and to avoid the necessity to assemble and store huge dense matrices, which is
an extremely attractive opportunity. The difficulty, however, is that the iterative schemes are
much more affected by rounding errors that the usual Linear Algebra techniques; as a result,
for the time being “iterative-Linear-Algebra-based” implementation of IP methods is no more
than a challenging goal.
Although the sizes of SDP’s which can be solved with the existing codes are not that im-
pressive as those of LP’s, the possibilities offered to a practitioner by SDP IP methods could
hardly be overestimated. Just ten years ago we could not even dream of solving an SDP with
more than few tens of variables, while today we can solve routinely 20-25 times larger SDP’s,
and we have all reasons to believe in further significant progress in this direction.

You might also like