4. Polynomial Time Interior Point algorithms for LP,CQP and SDP
4. Polynomial Time Interior Point algorithms for LP,CQP and SDP
The model of computations in CCT: an idealized computer capable to store only integers
(i.e., finite binary words), and its operations are bitwise: we are allowed to multiply, add and
compare integers. To add and to compare two -bit integers, it takes O() “bitwise” elementary
operations, and to multiply a pair of -bit integers it costs O(2 ) elementary operations (the cost
of multiplication can be reduced to O( ln()), but it does not matter) .
In CCT, a solution to an instance (p) of a generic problem P is a finite binary word y such
that the pair (Data(p),y) satisfies certain “verifiable condition” A(·, ·). Namely, it is assumed
that there exists a code M for the above “Integer Arithmetic computer” such that executing the
code on every input pair x, y of finite binary words, the computer after finitely many elementary
137
138 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
operations terminates and outputs either “yes”, if A(x, y) is satisfied, or “no”, if A(x, y) is not
satisfied. Thus, P is the problem
Given n positive integers a1 , ..., an , find a vector x = (x1 , ..., xn )T with coordinates
±1 such that xi ai = 0, or detect that no such vector exists
i
is a generic combinatorial problem. Indeed, the data of the instance of the problem, same as
candidate solutions to the instance, can be naturally encoded by finite sequences of integers. In
turn, finite sequences of integers can be easily encoded by finite binary words. And, of course,
for this problem you can easily point out a code for the “Integer Arithmetic computer” which,
given on input two binary words x = Data(p), y encoding the data vector of an instance (p)
of the problem and a candidate solution, respectively, verifies in finitely many “bit” operations
whether y represents or does not represent a solution to (p).
A solution algorithm for a generic problem P is a code S for the Integer Arithmetic computer
which, given on input the data vector Data(p) of an instance (p) ∈ P, after finitely many
operations either returns a solution to the instance, or a (correct!) claim that no solution exists.
The running time TS (p) of the solution algorithm on instance (p) is exactly the number of
elementary (i.e., bit) operations performed in course of executing S on Data(p).
A solvability test for a generic problem P is defined similarly to a solution algorithm, but
now all we want of the code is to say (correctly!) whether the input instance is or is not solvable,
just “yes” or “no”, without constructing a solution in the case of the “yes” answer.
where length(x) is the bit length (i.e., number of bits) of a finite binary word x. The algo-
rithm/test is called polynomial time, if its complexity is bounded from above by a polynomial
of .
Classes P and NP. A generic problem P is said to belong to the class NP, if the corresponding
condition A, see (4.1.1), possesses the following two properties:
4.1. COMPLEXITY OF CONVEX PROGRAMMING 139
Thus, the first property of an NP problem states that given the data Data(p) of a
problem instance p and a candidate solution y, it is easy to check whether y is an
actual solution of (p) – to verify this fact, it suffices to compute A(Data(p), y), and
this computation requires polynomial in length(Data(p)) + length(y) time.
The second property of an NP problem makes its instances even more easier:
A generic problem P is said to belong to the class P, if it belongs to the class NP and is
polynomially solvable.
The question whether P=NP – whether NP-complete problems are or are not polynomially
solvable, is qualified as “the most important open problem in Theoretical Computer Science”
and remains open for about 30 years. One of the most basic results of Theoretical Computer
Science is that NP-complete problems do exist (the Stones problem is an example). Many of
these problems are of huge practical importance, and are therefore subject, over decades, of
intensive studies of thousands excellent researchers. However, no polynomial time algorithm for
any of these problems was found. Given the huge total effort invested in this research, we should
conclude that it is “highly improbable” that NP-complete problems are polynomially solvable.
Thus, at the “practical level” the fact that certain problem is NP-complete is sufficient to qualify
the problem as “computationally intractable”, at least at our present level of knowledge.
1)
Here and in what follows, we denote by χ positive “characteristic constants” associated with the predi-
cates/problems in question. The particular values of these constants are of no importance, the only thing that
matters is their existence. Note that in different places of even the same equation χ may have different values.
140 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
where
• n(p) is the design dimension of program (p);
Families of optimization programs. We want to speak about methods for solving opti-
mization programs (p) “of a given structure” (for example, Linear Programming ones). All
programs (p) “of a given structure”, like in the combinatorial case, form certain family P, and
we assume that every particular program in this family – every instance (p) of P – is specified by
its particular data Data(p). However, now the data is a finite-dimensional real vector; one may
think about the entries of this data vector as about particular values of coefficients of “generic”
(specific for P) analytic expressions for p0 (x) and X(p). The dimension of the vector Data(p)
will be called the size of the instance:
The model of computations. This is what is known as “Real Arithmetic Model of Computa-
tions”, as opposed to ”Integer Arithmetic Model” in the CCT. We assume that the computations
are carried out by an idealized version of the usual computer which is capable to store count-
ably many reals and can perform with them the standard exact real arithmetic operations – the
four basic arithmetic operations, evaluating elementary functions, like cos and exp, and making
comparisons.
be the optimal value of the instance (i.e., the infimum of the values of the objective on the
feasible set, if the instance is feasible, and +∞ otherwise). A point x ∈ Rn(p) is called an
-solution to (p), if
InfeasP (x, p) ≤ and p0 (x) − Opt(p) ≤ ,
i.e., if x is both “-feasible” and “-optimal” for the instance.
4.1. COMPLEXITY OF CONVEX PROGRAMMING 141
It is convenient to define the number of accuracy digits in an -solution to (p) as the quantity
Size(p) + Data(p)1 + 2
Digits(p, ) = ln .
Polynomial computability. Let P be a generic convex program, and let InfeasP (x, p)
be the corresponding measure of infeasibility of candidate solutions. We say that our family is
polynomially computable, if there exist two codes Cobj , Ccons for the Real Arithmetic computer
such that
1. For every instance (p) ∈ P, the computer, when given on input the data vector of the
instance (p) and a point x ∈ Rn(p) and executing the code Cobj , outputs the value p0 (x) and a
subgradient e(x) ∈ ∂p0 (x) of the objective p0 of the instance at the point x, and the running
time (i.e., total number of operations) of this computation Tobj (x, p) is bounded from above by
a polynomial of the size of the instance:
∀ (p) ∈ P, x ∈ Rn(p) : Tobj (x, p) ≤ χSizeχ (p) [Size(p) = dim Data(p)]. (4.1.3)
142 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
(recall that in our notation, χ is a common name of characteristic constants associated with P).
2. For every instance (p) ∈ P, the computer, when given on input the data vector of the
instance (p), a point x ∈ Rn(p) and an > 0 and executing the code Ccons , reports on output
whether InfeasP (x, p) ≤ , and if it is not the case, outputs a linear form a which separates the
point x from all those points y where InfeasP (y, p) ≤ :
the running time Tcons (x, , p) of the computation being bounded by a polynomial of the size of
the instance and of the “number of accuracy digits”:
∀ (p) ∈ P, x ∈ Rn(p) , > 0 : Tcons (x, , p) ≤ χ (Size(p) + Digits(p, ))χ . (4.1.5)
Note that the vector a in (4.1.4) is not supposed to be nonzero; when it is 0, (4.1.4) simply says
that there are no points y with InfeasP (y, p) ≤ .
Polynomial growth. We say that a generic convex program P is with polynomial growth, if
the objectives and the infeasibility measures, as functions of x, grow polynomially with x1 ,
the degree of the polynomial being a power of Size(p):
∀ (p) ∈ P, x ∈ Rn(p) :
χ (4.1.6)
|p0 (x)| + InfeasP (x, p) ≤ (χ [Size(p) + x1 + Data(p)1 ])(χSize (p))
.
Polynomial boundedness of feasible sets. We say that a generic convex program P has
polynomially bounded feasible sets, if the feasible set X(p) of every instance (p) ∈ P is bounded,
and is contained in the Euclidean ball, centered at the origin, of “not too large” radius:
∀(p) ∈ P:
χ (4.1.7)
X(p) ⊂ x ∈ Rn(p) : x2 ≤ (χ [Size(p) + Data(p)1 ])χSize (p)
.
where K is a cone belonging to a characteristic for the generic program family K of cones,
specifically,
• the family of nonnegative orthants for LP b ,
• the family of direct products of Lorentz cones for CQP b ,
• the family of semidefinite cones for SDP b .
The data of and instance (p) of the type (4.1.8) is the collection
with naturally defined size(s) of a cone K from the family K associated with the generic program
under consideration: the sizes of Rn+ and of Sn+ equal n, and the size of a direct product of Lorentz
cones is the sequence of the dimensions of the factors.
4.1. COMPLEXITY OF CONVEX PROGRAMMING 143
The generic conic programs in question are equipped with the infeasibility measure
Infeas(x, p) = min t ≥ 0 : te[K(p) ] + A(p) x − b(p) ∈ K(p) , (4.1.9)
where
• X is a convex compact set in Rn with a nonempty interior
• f is a continuous convex function on X.
Assume that our “environment” when solving (4.1.10) is as follows:
1. We have access to a Separation Oracle Sep(X) for X – a routine which, given on input a
point x ∈ Rn , reports on output whether or not x ∈ int X, and in the case of x ∈ int X,
returns a separator – a nonzero vector e such that
eT x ≥ max eT y (4.1.11)
y∈X
(the existence of such a separator is guaranteed by the Separation Theorem for convex
sets);
144 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
2. We have access to a First Order oracle which, given on input a point x ∈ int X, returns
the value f (x) and a subgradient f (x) of f at x;
3. We are given two positive reals R ≥ r such that X is contained in the Euclidean ball,
centered at the origin, of the radius R and contains a Euclidean ball of the radius r (not
necessarily centered at the origin).
Theorem 4.1.2 In the outlined “working environment”, for every given > 0 it is possible to
find an -solution to (4.1.10), i.e., a point x ∈ X with
f (x ) ≤ min f (x) +
x∈X
in no more than N () subsequent calls to the Separation and the First Order oracles plus no
more than O(1)n2 N () arithmetic operations to process the answers of the oracles, with
VarX (f )R
N () = O(1)n2 ln 2 + . (4.1.12)
·r
Here
VarX (f ) = max f − min f.
X X
There exists a CCT-polynomial time algorithm M which, given on input the data
vector Data(q) of an instance (q) ∈ Q, converts it into a triple Data(p[q]), (q), µ(q)
comprised of the data vector of an instance (p[q]) ∈ P, positive rational (q) and
rational µ(q) such that (p[q]) is solvable and
— if (q) is unsolvable, then the value of the objective of (p[q]) at every (q)-solution
to this problem is ≤ µ(q) − (q);
— if (q) is solvable, then the value of the objective of (p[q]) at every (q)-solution to
this problem is ≥ µ(q) + (q).
We claim that in the case in question we have all reasons to qualify P as a “computationally
intractable” problem. Assume, on contrary, that P admits a polynomial time solution method
S, and let us look what happens if we apply this algorithm to solve (p[q]) within accuracy (q).
Since (p[q]) is solvable, the method must produce an (q)-solution x to (p[q]). With additional
4.2. INTERIOR POINT POLYNOMIAL TIME METHODS FOR LP, CQP AND SDP 145
“polynomial time effort” we may compute the value of the objective of (p[q]) at x (recall that
the objectives of instances from P are assumed to be polynomially computable). Now we can
compare the resulting value of the objective with µ(q); by definition of reducibility, if this value
is ≤ µ(q), q is unsolvable, otherwise q is solvable. Thus, we get a correct “Real Arithmetic”
solvability test for Q. By definition of a Real Arithmetic polynomial time algorithm, the running
time of the test is bounded by a polynomial of s(q) = Size(p[q]) and of the quantity
Size(p[q]) + Data(p[q])1 + 2 (q)
d(q) = Digits((p[q]), (q)) = ln .
(q)
Now note that if = length(Data(q)), then the total number of bits in Data(p[q]) and in (q) is
bounded by a polynomial of (since the transformation Data(q) → (Data(p[q]), (q), µ(q)) takes
CCT-polynomial time). It follows that both s(q) and d(q) are bounded by polynomials in , so
that our “Real Arithmetic” solvability test for Q takes polynomial in length(Data(q)) number
of arithmetic operations.
Recall that Q was assumed to be an NP-complete generic problem, so that it would be “highly
improbable” to find a polynomial time solvability test for this problem, while we have managed
to build such a test. We conclude that the polynomial solvability of P is highly improbable as
well.
4.2 Interior Point Polynomial Time Methods for LP, CQP and
SDP
4.2.1 Motivation
Theorem 4.1.1 states that generic convex programs, under mild computability and bounded-
ness assumptions, are polynomially solvable. This result is extremely important theoretically;
however, from the practical viewpoint it is, essentially, no more than “an existence theorem”.
Indeed, the “universal” complexity bounds coming from Theorem 4.1.2, although polynomial,
are not that attractive: by Theorem 4.1.1, when solving problem (C) with n design variables,
the “price” of an accuracy digit (what it costs to reduce current inaccuracy by factor 2) is
O(n2 ) calls to the first order and the separation oracles plus O(n4 ) arithmetic operations to
process the answers of the oracles. Thus, even for simplest objectives to be minimized over
simplest feasible sets, the arithmetic price of an accuracy digit is O(n4 ); think how long will
it take to solve a problem with, say, 1,000 variables (which is still a “small” size for many
applications). The good news about the methods underlying Theorem 4.1.2 is their universal-
ity: all they need is a Separation oracle for the feasible set and the possibility to compute the
objective and its subgradient at a given point, which is not that much. The bad news about
these methods has the same source as the good news: the methods are “oracle-oriented” and
capable to use only local information on the program they are solving, in contrast to the fact
that when solving instances of well-structured programs, like LP, we from the very beginning
have in our disposal complete global description of the instance. And of course it is ridiculous
to use a complete global knowledge of the instance just to mimic the local in their nature first
order and separation oracles. What we would like to have is an optimization technique capable
to “utilize efficiently” our global knowledge of the instance and thus allowing to get a solution
much faster than it is possible for “nearly blind” oracle-oriented algorithms. The major event
in the “recent history” of Convex Optimization, called sometimes “Interior Point revolution”,
was the invention of these “smart” techniques.
146 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
where the stepsize γ(x) > 0 is chosen in a way which, on one hand, ensures global convergence of
the method and, on the other hand, enforces γ(x) → 1 as x → x∗ , thus ensuring fast (essentially
the same as for the pure Newton method) asymptotic convergence of the process2) .
Practitioners thought the (properly modified) Newton method to be the fastest, in terms
of the iteration count, routine for smooth (not necessarily convex) unconstrained minimization,
although sometimes “too heavy” for practical use: the practical drawbacks of the method are
both the necessity to invert the Hessian matrix at each step, which is computationally costly in
the large-scale case, and especially the necessity to compute this matrix (think how difficult it is
to write a code computing 5,050 second order derivatives of a messy function of 100 variables).
Classical interior penalty scheme: the construction. Now consider a constrained convex
optimization program. As we remember, one can w.l.o.g. make its objective linear, moving, if
necessary, the actual objective to the list of constraints. Thus, let the problem be
min cT x : x ∈ X ⊂ Rn , (C)
x
where X is a closed convex set, which we assume to possess a nonempty interior. How could we
solve the problem?
Traditionally it was thought that the problems of smooth convex unconstrained minimization
are “easy”; thus, a quite natural desire was to reduce the constrained problem (C) to a series
of smooth unconstrained optimization programs. To this end, let us choose somehow a barrier
(another name – “an interior penalty function”) F (x) for the feasible set X – a function which
is well-defined (and is smooth and strongly convex) on the interior of X and “blows up” as a
point from int X approaches a boundary point of X :
and let us look at the one-parametric family of functions generated by our objective and the
barrier:
Ft (x) = tcT x + F (x) : int X → R.
Here the penalty parameter t is assumed to be nonnegative.
It is easily seen that under mild regularity assumptions (e.g., in the case of bounded X ,
which we assume from now on)
• Every function Ft (·) attains its minimum over the interior of X , the minimizer x∗ (t) being
unique;
• The central path x∗ (t) is a smooth curve, and all its limiting, t → ∞, points belong to the
set of optimal solutions of (C).
This fact is quite clear intuitively. To minimize Ft (·) for large t is the same as to minimize the
function fρ (x) = cT x + ρF (x) for small ρ = 1t . When ρ is small, the function fρ is very close to
cT x everywhere in X , except a narrow stripe along the boundary of X , the stripe becoming thinner
2)
There are many ways to provide the required behaviour of γ(x), e.g., to choose γ(x) by a linesearch in the
direction e(x) = −[f (x)]−1 f (x) of the Newton step:
and thinner as ρ → 0. Therefore we have all reasons to believe that the minimizer of Ft for large t
(i.e., the minimizer of fρ for small ρ) must be close to the set of minimizers of cT x on X .
We see that the central path x∗ (t) is a kind of Ariadne’s thread which leads to the solution set of
(C). On the other hand, to reach, given a value t ≥ 0 of the penalty parameter, the point x∗ (t)
on this path is the same as to minimize a smooth strongly convex function Ft (·) which attains
its minimum at an interior point of X . The latter problem is “nearly unconstrained one”, up to
the fact that its objective is not everywhere defined. However, we can easily adapt the methods
of unconstrained minimization, including the Newton one, to handle “nearly unconstrained”
problems. We see that constrained convex optimization in a sense can be reduced to the “easy”
unconstrained one. The conceptually simplest way to make use of this observation would be to
choose a “very large” value t̄ of the penalty parameter, like t̄ = 106 or t̄ = 1010 , and to run an
unconstrained minimization routine, say, the Newton method, on the function Ft̄ , thus getting
a good approximate solution to (C) “in one shot”. This policy, however, is impractical: since
we have no idea where x∗ (t̄) is, we normally will start our process of minimizing Ft̄ very far
from the minimizer of this function, and thus for a long time will be unable to exploit fast local
convergence of the method for unconstrained minimization we have chosen. A smarter way to
use our Ariadne’s thread is exactly the one used by Theseus: to follow the thread. Assume, e.g.,
that we know in advance the minimizer of F0 ≡ F , i.e., the point x∗ (0)3) . Thus, we know where
the central path starts. Now let us follow this path: at i-th step, standing at a point xi “close
enough” to some point x∗ (ti ) of the path, we
• first, increase a bit the current value ti of the penalty parameter, thus getting a new “target
point” x∗ (ti+1 ) on the path,
and
• second, approach our new target point x∗ (ti+1 ) by running, say, the Newton method,
started at our current iterate xi , on the function Fti+1 , until a new iterate xi+1 “close enough”
to x∗ (ti+1 ) is generated.
As a result of such a step, we restore the initial situation – we again stand at a point which
is close to a point on the central path, but this latter point has been moved along the central
path towards the optimal set of (C). Iterating this updating and strengthening appropriately
our “close enough” requirements as the process goes on, we, same as the central path, approach
the optimal set. A conceptual advantage of this “path-following” policy as compared to the
“brute force” attempt to reach a target point x∗ (t̄) with large t̄ is that now we have a hope to
exploit all the time the strongest feature of our “working horse” (the Newton method) – its fast
local convergence. Indeed, assuming that xi is close to x∗ (ti ) and that we do not increase the
penalty parameter too rapidly, so that x∗ (ti+1 ) is close to x∗ (ti ) (recall that the central path is
smooth!), we conclude that xi is close to our new target point x∗ (ti ). If all our “close enough”
and “not too rapidly” are properly controlled, we may ensure xi to be in the domain of the
quadratic convergence of the Newton method as applied to Fti+1 , and then it will take a quite
small number of steps of the method to recover closeness to our new target point.
of optimization problems treated by the classical optimization, there is a plenty of ways to build
a relatively simple barrier meeting all the requirements imposed by the scheme, there is a huge
room to play with the policies for increasing the penalty parameter and controlling closeness to
the central path, etc. And the theory says that under quite mild and general assumptions on the
choice of the numerous “free parameters” of our construction, it still is guaranteed to converge
to the optimal set of the problem we have to solve. All looks wonderful, until we realize that
the convergence ensured by the theory is completely “unqualified”, it is a purely asymptotical
phenomenon: we are promised to reach eventually a solution of a whatever accuracy we wish,
but how long it will take for a given accuracy – this is the question the “classical” optimization
theory, with its “convergence” – “asymptotic linear/superlinear/quadratic convergence” neither
posed nor answered. And since our life in this world is finite (moreover, usually more finite
than we would like it to be), “asymptotical promises” are perhaps better than nothing, but
definitely are not all we would like to know. What is vitally important for us in theory (and to
some extent – also in practice) is the issue of complexity: given an instance of such and such
generic optimization problem and a desired accuracy , how large is the computational effort
(# of arithmetic operations) needed to get an -solution of the instance? And we would like
the answer to be a kind of a polynomial time complexity bound, and not a quantity depending
on “unobservable and uncontrollable” properties of the instance, like the “level of regularity” of
the boundary of X at the (unknown!) optimal solution of the instance.
It turns out that the intuitively nice classical theory we have outlined is unable to say a single
word on the complexity issues (it is how it should be: a reasoning in purely qualitative terms
like “smooth”, “strongly convex”, etc., definitely cannot yield a quantitative result...) Moreover,
from the complexity viewpoint just the very philosophy of the classical convex optimization turns
out to be wrong:
• As far as the complexity is concerned, for nearly all “black box represented” classes of
unconstrained convex optimization problems (those where all we know is that the objective is
called f (x), is (strongly) convex and 2 (3,4,5...) times continuously differentiable, and can be
computed, along with its derivatives up to order ... at every given point), there is no such
phenomenon as “local quadratic convergence”, the Newton method (which uses the second
derivatives) has no advantages as compared to the methods which use only the first order
derivatives, etc.;
• The very idea to reduce “black-box-represented” constrained convex problems to uncon-
strained ones – from the complexity viewpoint, the unconstrained problems are not easier than
the constrained ones...
4.2.3 But...
Luckily, the pessimistic analysis of the classical interior penalty scheme is not the “final truth”.
It turned out that what prevents this scheme to yield a polynomial time method is not the
structure of the scheme, but the huge amount of freedom it allows for its elements (too much
freedom is another word for anarchy...). After some order is added, the scheme becomes a
polynomial time one! Specifically, it was understood that
1. There is a (completely non-traditional) class of “good” (self-concordant4) ) barriers. Every
barrier F of this type is associated with a “self-concordance parameter” θ(F ), which is a
4)
We do not intend to explain here what is a “self-concordant barrier”; for our purposes it suffices to say that
this is a three times continuously differentiable convex barrier F satisfying a pair of specific differential inequalities
linking the first, the second and the third directional derivatives of F .
150 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
real ≥ 1;
2. Whenever a barrier F underlying the interior penalty scheme is self-concordant, one can
specify the notion of “closeness to the central path” and the policy for updating the penalty
parameter in such a way that a single Newton step
suffices to update a “close to x∗ (ti )” iterate xi into a new iterate xi+1 which is close, in
the same sense, to x∗ (ti+1 ). All “close to the central path” points belong to int X , so that
the scheme keeps all the iterates strictly feasible.
3. The penalty updating policy mentioned in the previous item is quite simple:
0.1
ti → ti+1 = 1+ ti ;
θ(F )
particular, it does not “slow down” as ti grows and ensures linear, with the ratio
in
1 + √0.1 , growth of the penalty. This is vitally important due to the following fact:
θ(F )
4. The inaccuracy of a point x, which is close to some point x∗ (t) of the central path, as an
approximate solution to (C) is inverse proportional to t:
2θ(F )
cT x − min cT y ≤ .
y∈X t
It follows that
(!) After we have managed once to get close to the central path – have built a
point x0 which is close to a point x(t0 ), t0 > 0, on the path, every O( θ(F ))
steps of the scheme improve the quality of approximate solutions generated by
the scheme by an absolute constant factor. In particular, it takes no more than
θ(F )
O(1) θ(F ) ln 2 +
t0
steps to generate a strictly feasible -solution to (C).
Note that with our simple penalty updating policy all needed to perform a step of the
interior penalty scheme is to compute the gradient and the Hessian of the underlying
barrier at a single point and to invert the resulting Hessian.
Items 3, 4 say that essentially all we need to derive from the just listed general results a polyno-
mial time method for a generic convex optimization problem is to be able to equip every instance
of the problem with a “good” barrier in such a way that both the parameter of self-concordance
of the barrier θ(F ) and the arithmetic cost at which we can compute the gradient and the Hes-
sian of this barrier at a given point are polynomial in the size of the instance5) . And it turns
5)
Another requirement is to be able once get close to a point x∗ (t0 ) on the central path with a not “disastrously
small” value of t0 – we should initialize somehow our path-following method! It turns out that such an initialization
is a minor problem – it can be carried out via the same path-following technique, provided we are given in advance
a strictly feasible solution to our problem.
4.3. INTERIOR POINT METHODS FOR LP, CQP, AND SDP: BUILDING BLOCKS 151
out that we can meet the latter requirement for all interesting “well-structured” generic convex
programs, in particular, for Linear, Conic Quadratic, and Semidefinite Programming. Moreover,
“the heroes” of our course – LP, CQP and SDP – are especially nice application fields of the
general theory of interior point polynomial time methods; in these particular applications, the
theory can be simplified, on one hand, and strengthened, on another.
4.3 Interior point methods for LP, CQP, and SDP: building
blocks
We are about to explain what the interior point methods for LP, CQP, SDP look like.
associated with a cone K given as a direct product of m “basic” cones, each of them being either
a second-order, or a semidefinite cone:
k
K = Sk+1 × ... × S+p × Lkp+1 × ... × Lkm ⊂ E = Sk1 × ... × Skp × Rkp+1 × ... × Rkm . (Cone)
Of course, the generic problem in question covers LP (no Lorentz factors, all semidefinite factors
are of dimension 1), CQP (no semidefinite factors) and SDP (no Lorentz factors).
Now, we shall equip the semidefinite and the Lorentz cones with “canonical barriers”:
• The canonical barrier for a semidefinite cone Sn+ is
The parameter of the barrier K, again by definition, is the sum of parameters of the basic
barriers involved:
p
θ(K) = θ(Sk1 ) + ... + θ(Skp ) + θ(Lkp+1 ) + ... + θ(Lkm ) = ki + 2(m − p).
i=1
Recall that all direct factors in the direct product representation (Cone) of our “universe” E
are Euclidean spaces; the matrix factors Ski are endowed with the Frobenius inner product
while the “arithmetic factors” Rki are endowed with the usual inner product
E itself will be regarded as a Euclidean space endowed with the direct sum of inner products
on the factors: p
m
X, Y E = Tr(Xi Yi ) + XiT Yi .
i=1 i=p+1
It is clearly seen that our basic barriers, same as their direct sum K, indeed are barriers for
the corresponding cones: they are C∞ -smooth on the interiors of their domains, blow up to
∞ along every sequence of points from these interiors converging to a boundary point of the
corresponding domain and are strongly convex. To verify the latter property, it makes sense to
compute explicitly the first and the second directional derivatives of these barriers (we need the
corresponding formulae in any case); to simplify notation, we write down the derivatives of the
basic functions Sk , Lk at a point x from their domain along a direction h (you should remember
that in the case of Sk both the point and the direction, in spite of their lower-case denotation,
are k × k symmetric matrices):
d
DSk (x)[h] ≡ dt Sk (x + th) = −Tr(x−1 h) = −x−1 , hSk ,
t=0
i.e.
∇Sk (x) = −x−1 ;
2 d2
D Sk (x)[h, h] ≡ dt2 Sk (x + th) = Tr(x−1 hx−1 h) = x−1 hx−1 , hSk ,
t=0
i.e.
[∇2 Sk (x)]h = x−1 hx−1 ;
d
T (4.3.1)
DLk (x)[h] ≡ dt Lk (x + th) = −2 hxT JJk xx ,
k
t=0
i.e.
∇Lk (x) = − xT 2J x Jk x;
k
T 2 T
d2
D2 Lk (x)[h, h] ≡ dt2 Lk (x + th) = 4 [h Jk x]
[xT J x]2
− 2 hxTJJx
kh
,
k
t=0
i.e.
4 2
∇2 Lk (x) = J xxT Jk
[xT Jk x]2 k
− J .
xT Jk x k
so that D2 Sk (x)[h, h] is positive whenever h = 0. It is not difficult to prove that the same is true
for D2 Lk (x)[h, h]. Thus, the canonical barriers for semidefinite and Lorentz cones are strongly
convex, and so is their direct sum K(·).
It makes sense to illustrate relatively general concepts and results to follow by how they
look in a particular case when K is the semidefinite cone Sk+ ; we shall refer to this situation
as to the “SDP case”. The essence of the matter in our general case is exactly the same as in
this particular one, but “straightforward computations” which are easy in the SDP case become
nearly impossible in the general case; and we have no possibility to explain here how it is possible
(it is!) to get the desired results with minimum amount of computations.
Due to the role played by the SDP case in our exposition, we use for this case special
notation, along with the just introduced “general” one. Specifically, we denote the standard
– the Frobenius – inner product on E = Sk as ·, ·F , although feel free, if necessary, to use
our “general” notation ·, ·E as well; the associated norm is denoted by · 2 , so that X2 =
Tr(X 2 ), X being a symmetric matrix.
− ln Det(tx) = − ln Det(x) − k ln t,
• In the SDP case, ∇F (x) = ∇Sk (x) = −x−1 and [∇2 F (x)]h = ∇2 Sk (x)h =
x−1 hx−1 (see (4.3.1)). Here (a) becomes the identity x−1 , xF ≡ Tr(x−1 x) = k,
and (b) kindly informs us that x−1 xx−1 = x−1 .
Proof. (i): it is immediately seen that Sk and Lk are logarithmically homogeneous with parameters of
logarithmic homogeneity −θ(Sk ), −θ(Lk ), respectively; and of course the property of logarithmic homo-
geneity is stable with respect to taking direct sums of functions: if DomΦ(u) and DomΨ(v) are closed
w.r.t. the operation of multiplying a vector by a positive scalar, and both Φ and Ψ are logarithmi-
cally homogeneous with parameters α, β, respectively, then the function Φ(u) + Ψ(v) is logarithmically
homogeneous with the parameter α + β.
(ii): To get (ii.a), it suffices to differentiate the identity
in t at t = 1:
d
F (tx) = F (x) − θ(F ) ln t ⇒ ∇F (tx), x = F (tx) = −θ(F )t−2 ,
dt
and it remains to set t = 1 in the concluding identity.
Similarly, to get (ii.b), it suffices to differentiate the identity
since [∇2 F (x)]h, x = [∇2 F (x)]x, h (symmetry of partial derivatives!) and since the resulting equality
F (tx) = F (x) − θ ln t
in x, we get
tk Dk F (tx)[h1 , ..., hk ] = Dk F (x)[h1 , ..., hk ].
Assuming that the linear mapping x → Ax is an embedding (i.e., that KerA = {0} – this
is Assumption A from Lecture 1), we can write down our primal-dual pair in a symmetric
geometric form (Lecture 1, Section 1.6.1):
where L is a linear subspace in E (the image space of the linear mapping x → Ax), L⊥ is the
orthogonal complement to L in E, and C ∈ E satisfies A∗ C = c, i.e., C, AxE ≡ cT x.
To simplify things, from now on we assume that both problems (CP) and (CD) are strictly
feasible. In terms of (P) and (D) this assumption means that both the primal feasible plane
L − B and the dual feasible plane L⊥ + C intersect the interior of the cone K.
Remark 4.4.1 By the Conic Duality Theorem (Lecture 1), both (CP) and (D) are solvable with
equal optimal values:
Opt(CP) = Opt(D)
(recall that we have assumed strict primal-dual feasibility). Since (P) is equivalent to (CP), (P)
is solvable as well, and the optimal value of (P) differs from the one of (P) by C, BE 7) . It
follows that the optimal values of (P) and (D) are linked by the relation
this barrier is
K(x) = K(Ax − B) : int X → R (4.4.2)
7)
Indeed, the values of the respective objectives cT x and C, Ax − BE at the corresponding to each other
feasible solutions x of (CP) and X = Ax − B of (P) differ from each other by exactly C, BE :
and is indeed a barrier. Now we can apply the interior penalty scheme to trace the central
path x∗ (t) associated with the resulting barrier; with some effort it can be derived from the
primal-dual strict feasibility that this central path is well-defined (i.e., that the minimizer of
t (x) = tcT x + K(x)
K
on int X exists for every t ≥ 0 and is unique)8) . What is important for us for the moment, is
the central path itself, not how to trace it. Moreover, it is highly instructive to pass from the
central path x∗ (t) in the space of design variables to its image
in E. The resulting curve has a name – it is called the primal central path of the primal-dual
pair (P), (D); by its origin, it is a curve comprised of strictly feasible solutions of (P) (since it
is the same – to say that x belongs to the (interior of) the set X and to say that X = Ax − B
is a (strictly) feasible solution of (P)). A simple and very useful observation is that the primal
central path can be defined solely in terms of (P), (D) and thus is a “geometric entity” – it is
independent of a particular parameterization of the primal feasible plane L − B by the design
vector x:
(*) A point X∗ (t) of the primal central path is the minimizer of the aggregate
This observation is just a tautology: x∗ (t) is the minimizer on int X of the aggregate
we see that the function Pt (x) = Pt (Ax − B) of x ∈ int X differs from the function Kt (x)
by a constant (depending on t) and has therefore the same minimizer x∗ (t) as the function
Kt (x). Now, when x runs through int X , the point X = Ax − B runs exactly through the
set of strictly feasible solutions of (P), so that the minimizer X∗ of Pt on the latter set and
the minimizer x∗ (t) of the function Pt (x) = Pt (Ax − B) on int X are linked by the relation
X∗ = Ax∗ (t) − B.
(* ) A point X∗ (t) of the primal central path is exactly the strictly feasible solution
X to (P) such that the vector tC + ∇K(X) ∈ E is orthogonal to L (i.e., belongs to
L⊥ ).
Indeed, we know that X∗ (t) is the unique minimizer of the smooth convex function Pt (X) =
tC, XE +K(X) on the intersection of the primal feasible plane L−B and the interior of the
cone K; a necessary and sufficient condition for a point X of this intersection to minimize
Pt over the intersection is that ∇Pt must be orthogonal to L.
8)
In Section 4.2.1, there was no problem with the existence of the central path, since there X was assumed to
be bounded; in our now context, X not necessarily is bounded.
4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 157
• In the SDP case, a point X∗ (t), t > 0, of the primal central path is uniquely defined
by the following two requirements: (1) X∗ (t) % 0 should be feasible for (P), and (2)
the k × k matrix
tC − X∗−1 (t) = tC + ∇Sk (X∗ (t))
(see (4.3.1)) should belong to L⊥ , i.e., should be orthogonal, w.r.t. the Frobenius
inner product, to every matrix of the form Ax.
The dual problem (D) is in no sense “worse” than the primal problem (P) and thus also
possesses the central path, now called the dual central path S∗ (t), t ≥ 0, of the primal-dual pair
(P), (D). Similarly to (*), (* ), the dual central path can be characterized as follows:
(** ) A point S∗ (t), t ≥ 0, of the dual central path is the unique minimizer of the
aggregate
Dt (S) = −tB, SE + K(S)
on the set of strictly feasible solutions of (D) 9) . S∗ (t) is exactly the strictly feasible
solution S to (D) such that the vector −tB +∇F (S) is orthogonal to L⊥ (i.e., belongs
to L).
• In the SDP case, a point S∗ (t), t > 0, of the dual central path is uniquely defined
by the following two requirements: (1) S∗ (t) % 0 should be feasible for (D), and (2)
the k × k matrix
−tB − S∗−1 (t) = −tB + ∇Sk (S∗ (t))
(see (4.3.1)) should belong to L, i.e., should be representable in the form Ax for
some x.
From Proposition 4.3.2 we can derive a wonderful connection between the primal and the dual
central paths:
Theorem 4.4.1 For t > 0, the primal and the dual central paths X∗ (t), S∗ (t) of a (strictly
feasible) primal-dual pair (P), (D) are linked by the relations
S∗ (t) = −t−1 ∇K(X∗ (t))
(4.4.3)
X∗ (t) = −t−1 ∇K(S∗ (t))
Proof. By (* ), the vector tC+∇K(X∗ (t)) belongs to L⊥ , so that the vector S = −t−1 ∇K(X∗ (t)) belongs
to the dual feasible plane L⊥ +C. On the other hand, by Proposition 4.4.3 the vector −∇K(X∗ (t)) belongs
to DomK, i.e., to the interior of K; since K is a cone and t > 0, the vector S = −t−1 ∇F (X∗ (t)) belongs
to the interior of K as well. Thus, S is a strictly feasible solution of (D). Now let us compute the gradient
of the aggregate Dt at the point S:
∇Dt (S) = −tB + ∇K(−t−1 ∇K(X∗ (t)))
= −tB + t∇K(−∇K(X∗ (t)))
[we have used (4.3.4)]
= −tB − tX∗ (t)
[we have used (4.3.3)]
= −t(B + X∗ (t))
∈ L
[since X∗ (t) is primal feasible]
9)
Note the slight asymmetry between the definitions of the primal aggregate Pt and the dual aggregate Dt :
in the former, the linear term is tC, XE , while in the latter it is −tB, SE . This asymmetry is in complete
accordance with the fact that we write (P) as a minimization, and (D) – as a maximization problem; to write
(D) in exactly the same form as (P), we were supposed to replace B with −B, thus getting the formula for Dt
completely similar to the one for Pt .
158 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
Thus, S is strictly feasible for (D) and ∇Dt (S) ∈ L. But by (** ) these properties characterize S∗ (t);
thus, S∗ (t) = S ≡ −t−1 ∇K(X∗ (t)). This relation, in view of Proposition 4.3.2, implies that X∗ (t) =
−t−1 ∇K(S∗ (t)). Another way to get the latter relation from the one S∗ (t) = −t−1 ∇K(X∗ (t)) is just to
refer to the primal-dual symmetry.
In fact, the connection between the primal and the dual central paths stated by Theorem 4.4.1
can be used to characterize both the paths:
Theorem 4.4.2 Let (P), (D) be a strictly feasible primal-dual pair.
For every t > 0, there exists a unique strictly feasible solution X of (P) such that −t−1 ∇K(X)
is a feasible solution to (D), and this solution X is exactly X∗ (t).
Similarly, for every t > 0, there exists a unique strictly feasible solution S of (D) such that
−t−1 ∇K(S) is a feasible solution of (P), and this solution S is exactly S∗ (t).
Proof. By primal-dual symmetry, it suffices to prove the first claim. We already know (Theorem
4.4.1) that X = X∗ (t) is a strictly feasible solution of (P) such that −t−1 ∇K(X) is feasible
for (D); all we need to prove is that X∗ (t) is the only point with these properties, which is
immediate: if X is a strictly feasible solution of (P) such that −t−1 ∇K(X) is dual feasible, then
−t−1 ∇K(X) ∈ L⊥ + C, or, which is the same, ∇K(X) ∈ L⊥ − tC, or, which again is the same,
∇Pt (X) = tC + ∇K(X) ∈ L⊥ . And we already know from (* ) that the latter property, taken
together with the strict primal feasibility, is characteristic for X∗ (t).
Characterization of the central path. By Theorem 4.4.2, the points (X∗ (t), S∗ (t)) of the
central path possess the following properties:
(CentralPath):
In fact, the indicated properties fully characterize the central path: whenever two points X, S
possess the properties 1) - 3) with respect to some t > 0, X is nothing but X∗ (t), and S is
nothing but S∗ (t) (this again is said by Theorem 4.4.2).
Duality gap along the central path. Recall that for an arbitrary primal-dual feasible pair
(X, S) of the (strictly feasible!) primal-dual pair of problems (P), (D), the duality gap
DualityGap(X, S) ≡ [C, XE − Opt(P)] + [Opt(D) − B, SE ] = C, XE − B, SE + C, BE
(see (4.4.1)) which measures the “total inaccuracy” of X, S as approximate solutions of the
respective problems, can be written down equivalently as S, XE (see statement (!) in Section
1.7). Now, what is the duality gap along the central path? The answer is immediate:
Proposition 4.4.1 Under assumption of primal-dual strict feasibility, the duality gap along the
central path is inverse proportional to the penalty parameter, the proportionality coefficient being
the parameter of the canonical barrier K:
θ(K)
DualityGap(X∗ (t), S∗ (t)) = .
t
θ(K)
In particular, both X∗ (t) and S∗ (t) are strictly feasible t -approximate solutions to their
respective problems:
θ(K)
C, X∗ (t)E − Opt(P) ≤ t ,
θ(K)
Opt(D) − B, S∗ (t)E ≤ t .
We see that
All we need in order to get “quickly” good primal and dual approximate solutions,
is to trace fast the central path; if we were interested to solve only one of the prob-
lems (P), (D), it would be sufficient to trace fast the associated – primal or dual –
component of this path. The quality guarantees we get in such a process depend – in
a completely universal fashion! – solely on the value t of the penalty parameter we
have managed to achieve and on the value of the parameter of the canonical barrier
K and are completely independent of other elements of the data.
10)
Which, among other, much more important consequences, explains the name “augmented complementary
slackness” of the property 10 .3): at the primal-dual pair of optimal solutions X ∗ , S ∗ the duality gap should be
zero: S ∗ , X ∗ E = 0. Property 10 .3, as we just have seen, implies that the duality gap at a primal-dual pair
(X∗ (t), S∗ (t)) from the central path, although nonzero, is “controllable” – θ(K)
t
– and becomes small as t grows.
160 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
How close (and in what sense close) should we be to the path in order for our life to
be essentially as nice as if we were exactly on the path?
There are several ways to answer this question; we will present the simplest one.
A distance to the central path. Our canonical barrier K(·) is a strongly convex smooth
function on int K; in particular, its Hessian matrix ∇2 K(Y ), taken at a point Y ∈ int K, is
positive definite. We can use the inverse of this matrix to measure the distances between points
of E, thus arriving at the norm
HY = [∇2 K(Y )]−1 H, HE .
Observe that dist(Z, Z∗ (t)) ≥ 0, and dist(Z, Z∗ (t)) = 0 if and only if S = −t−1 ∇K(X), which,
for a strictly primal-dual feasible pair Z = (X, S), means that Z = Z∗ (t) (see the characterization
of the primal-dual central path); thus, dist(Z, Z∗ (t)) indeed can be viewed as a kind of distance
from Z to Z∗ (t).
dist2 (Z, Z∗ (t)) = tS + ∇Sk (X)2X = [∇2 Sk (X)] −1 (tS + ∇Sk (X)), tS + ∇Sk (X)F
= Tr X(tS − X −1 )X(tS − X −1 )
[see (4.3.1)]
= Tr([tX 1/2 SX 1/2 − I]2 ),
so that
dist2 (Z, Z∗ (t)) = Tr X(tS − X −1 )X(tS − X −1 ) = tX 1/2 SX 1/2 − I22 . (4.4.6)
4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 161
Besides this,
tX 1/2 SX 1/2 − I22 = Tr [tX 1/2 SX 1/2 − I]2
= Tr t2 X 1/2 SX 1/2 X 1/2 SX 1/2 − 2tX 1/2 SX 1/2 + I
= Tr(t2 X 1/2 SXSX 1/2 ) − 2tTr(X 1/2 SX 1/2 ) + Tr(I)
= Tr(t2 XSXS − 2tXS + I)
= Tr(t2 SXSX − 2tSX + I)
= Tr(t2 S 1/2 XS 1/2 S 1/2 XS 1/2 − 2tS 1/2 XS 1/2 + I)
= Tr([tS 1/2 XS 1/2 − I]2 ),
k
DualityGap(X, S) = Tr(XS) = Tr(X 1/2 SX 1/2 ) = δi
i=1
−1
k
−1 −1
√
k
≤ kt + |δi − t | ≤ kt + k (δi − t−1 )2
i=1 i=1
√
≤ kt−1 + kt−1 ,
and (4.4.7) follows.
Most of the interior point methods for LP, CQP, and SDP, including those most powerful in
practice, solve the primal-dual pair (P), (D) by tracing the central path11) , although not all of
them keep the iterates in NO(1) ; some of the methods work in much wider neighbourhoods of
the central path, in order to avoid slowing down when passing “highly curved” segments of the
path. At the level of ideas, these “long step path following methods” essentially do not differ
from the “short step” ones – those keeping the iterates in NO(1) ; this is why in the analysis part
of our forthcoming presentation we restrict ourselves with the short-step methods. It should be
added that as far as the theoretical efficiency estimates are concerned, the short-step methods
yield the best known so far complexity bounds for LP, CQP and SDP, and are essentially better
than the long-step methods (although in practice the long-step methods usually outperform
their short-step counterparts).
X ∈ L − B,
(4.5.1)
S ∈ L⊥ + C
(which is in fact a system of linear equations) and approximately satisfies the system
of nonlinear equations
update it into a new triple (t+ , X+ , S+ ) with the same properties and t+ > t̄.
Since the left hand side G(·) in our system of nonlinear equations is smooth around (t̄, X̄, S̄)
(recall that X̄ was assumed to be strictly primal feasible), the most natural, from the viewpoint
of Computational Mathematics, way to achieve our target is as follows:
11)
There exist also potential reduction interior point methods which do not take explicit care of tracing the
central path; an example is the very first IP method for LP – the method of Karmarkar. The potential reduction
IP methods are beyond the scope of our course, which is not a big loss for a practically oriented reader, since, as
a practical tool, these methods are thought of to be obsolete.
12)
Of course, besides knowing how to trace the central path, we should also know how to initialize this process
– how to come close to the path to be able to start its tracing. There are different techniques to resolve this
“initialization difficulty”, and basically all of them achieve the goal by using the same path-tracing technique,
now applied to an appropriate auxiliary problem where the “initialization difficulty” does not arise at all. Thus,
at the level of ideas the initialization techniques do not add something essentially new, which allows us to skip in
our presentation all initialization-related issues.
4.5. TRACING THE CENTRAL PATH 163
The primal-dual IP methods we are describing basically fit the outlined scheme, up to the
following two important points:
• If the current iterate (X̄, S̄) is not enough close to Z∗ (t̄), and/or if the desired improvement
t+ − t̄ is too large, the corrections given by the outlined scheme may be too large; as a
result, the updating (4.5.5) as it is may be inappropriate, e.g., X+ , or S+ , or both, may
be kicked out of the cone K. (Why not: linearized system (4.5.3) approximates well the
“true” system (4.5.2) only locally, and we have no reasons to trust in corrections coming
from the linearized system, when these corrections are large.)
There is a standard way to overcome the outlined difficulty – to use the corrections in a
damped fashion, namely, to replace the updating (4.5.5) with
X+ = X̄ + α∆X,
(4.5.6)
S+ = S̄ + β∆S,
and to choose the stepsizes α > 0, β > 0 from additional “safety” considerations, like
ensuring the updated pair (X+ , S+ ) to reside in the interior of K, or enforcing it to stay in
a desired neighbourhood of the central path, or whatever else. In IP methods, the solution
(∆X, ∆S) of (4.5.4) plays the role of search direction (and this is how it is called), and the
actual corrections are proportional to the search ones rather than to be exactly the same.
In this sense the situation is completely similar to the one with the Newton method from
Section 4.2.2 (which is natural: the latter method is exactly the linearization method for
solving the Fermat equation ∇f (x) = 0).
• The “augmented complementary slackness” system (4.5.2) can be written down in many
different forms which are equivalent to each other in the sense that they share a common
solution set. E.g., we have the same reasons to express the augmented complementary
slackness requirement by the nonlinear system (4.5.2) as to express it by the system
not speaking about other possibilities. And although all systems of nonlinear equations
Ht (X, S) = 0
expressing the augmented complementary slackness are “equivalent” in the sense that they
share a common solution set, their linearizations are different and thus – lead to different
search directions and finally to different path-following methods. Choosing appropriate (in
general even varying from iteration to iteration) analytic representation of the augmented
complementary slackness requirement, one can gain a lot in the performance of the result-
ing path-following method, and the IP machinery facilitates this flexibility (see “SDP case
examples” below).
H = A∗ [∇2 K(X)]A;
2. Subsequent Choleski factorization of the matrix H (which, due to its origin, is symmetric
positive definite and thus admits Choleski decomposition H = DDT with lower triangular
D).
Looking at (Cone), (CP) and (4.3.1), we immediately conclude that the arithmetic cost of
assembling and factorizing H is polynomial in the size dim Data(·) of the data defining (CP),
and that the parameter θ(K) also is polynomial in this size. Thus, the cost of an accuracy digit
for the methods in question is polynomial in the size of the data, as is required from polynomial
time methods13) . Explicit complexity bounds for LP b , CQP b , SDP b are given in Sections 4.6.1,
4.6.2, 4.6.3, respectively.
13)
Strictly speaking, the outlined complexity considerations are applicable to the “highway” phase of the
solution process, after we once have reached the neighbourhood N0.1 of the central path. However, the results of
our considerations remain unchanged after the initialization expenses are taken into account, see Section 4.6.
4.5. TRACING THE CENTRAL PATH 165
where t+ is the target value of the penalty parameter. The system (4.5.4) now becomes
(a) ∆X ∈ L
)
(a ) ∆X = A∆x [∆x ∈ Rn ]
(b) ∆S ∈ L⊥ (4.5.7)
)
(b ) ∗
A ∆S = 0
(c) t+ [S̄ + ∆S] + ∇K(X̄) + [∇2 K(X̄)]∆X = 0;
the unknowns here are ∆X, ∆S and ∆x. To process the system, we eliminate ∆X via (a ) and
multiply both sides of (c) by A∗ , thus getting the equation
Note that A∗ [S̄ + ∆S] = c is the objective of (CP) (indeed, S̄ ∈ L⊥ + C, i.e., A∗ S̄ = c, while
A∗ ∆S = 0 by (b )). Consequently, (4.5.8) becomes the primal Newton system
Solving this system (which is possible – it is easily seen that the n × n matrix H is positive
definite), we get ∆x and then set
∆X = A∆x,
(4.5.10)
∆S = −t−1 2
+ [∇K(X̄) + [∇ K(X̄)∆X] − S̄,
thus getting a solution to (4.5.7). Restricting ourselves with the stepsizes α = β = 1 (see
(4.5.6)), we come to the “closed form” description of the method:
(a) t → t+ > t
(b) x → x+ = x + −[A (∇2 K(X))A]−1 [t+ c + A∗ ∇K(X)] ,
∗
(4.5.11)
∆x
(c) S → S+ = −t−1
+ [∇K(X) + [∇2 K(X)]A∆x],
where x is the current iterate in the space Rn of design variables and X = Ax − B is its image
in the space E.
The resulting scheme admits a quite natural explanation. Consider the function
you can immediately verify that this function is a barrier for the feasible set of (CP). Let also
be the associated barrier-generated family of penalized objectives. Relation (4.5.11.b) says that
the iterates in the space of design variables are updated according to
i.e., the process in the space of design variables is exactly the process (4.2.1) from Section 4.2.3.
Note that (4.5.11) is, essentially, a purely primal process (this is where the name of the
method comes from). Indeed, the dual iterates S, S+ just do not appear in formulas for x+ , X+ ,
and in fact the dual solutions are no more than “shadows” of the primal ones.
Remark 4.5.1 When constructing the primal path-following method, we have started with the
augmented slackness equations in form (4.5.2). Needless to say, we could start our developments
with the same conditions written down in the “swapped” form
X + t−1 ∇K(S) = 0
as well, thus coming to what is called “dual path-following method”. Of course, as applied to a
given pair (P), (D), the dual path-following method differs from the primal one. However, the
constructions and results related to the dual path-following method require no special care – they
can be obtained from their “primal counterparts” just by swapping “primal” and “dual” entities.
The complexity analysis of the primal path-following method can be summarized in the following
Theorem 4.5.1 Let 0 < χ ≤ κ ≤ 0.1. Assume that we are given a starting point (t0 , x0 , S0 )
such that t0 > 0 and the point
(X0 = Ax0 − B, S0 )
is κ-close to Z∗ (t0 ):
dist((X0 , S0 ), Z∗ (t0 )) ≤ κ.
Starting with (t0 , x0 , X0 , S0 ), let us iterate process (4.5.11) equipped with the penalty updating
policy
χ
t+ = 1 + t (4.5.12)
θ(K)
i.e., let us build the iterates (ti , xi , Xi , Si ) according to
ti = 1 + √ χ ti−1 ,
θ(K)
xi = xi−1 − [A∗ (∇2 K(Xi−1 ))A] −1
[ti c + A∗ ∇K(Xi−1 )],
∆xi
Xi = Axi − B,
−1
Si = −ti [∇K(Xi−1 ) + [∇2 K(Xi−1 )]A∆xi ]
The resulting process is well-defined and generates strictly primal-dual feasible pairs (Xi , Si ) such
that (ti , Xi , Si ) stay in the neighbourhood Nκ of the primal-dual central path.
4.5. TRACING THE CENTRAL PATH 167
The theorem says that with properly chosen κ, χ (e.g., κ = χ = 0.1) we can, getting once close to
the primal-dual central path, trace it by the primal path-following method, keeping the iterates
in Nκ -neighbourhood of the path and increasing the penalty parameter by an absolute constant
factor every O( θ(K)) steps – exactly as it was claimed in sections 4.2.3, 4.5.2. This fact is
extremely important theoretically; in particular, it underlies the polynomial time complexity
bounds for LP, CQP and SDP from Section 4.6 below. As a practical tool, the primal and the
dual path-following methods, at least in their short-step form presented above, are not that
attractive. The computational power of the methods can be improved by passing to appropriate
large-step versions of the algorithms, but even these versions are thought of to be inferior as
compared to “true” primal-dual path-following methods (those which “indeed work with both
(P) and (D)”, see below). There are, however, cases when the primal or the dual path-following
scheme seems to be unavoidable; these are, essentially, the situations where the pair (P), (D) is
“highly asymmetric”, e.g., (P) and (D) have different by order of magnitudes design dimensions
dim L, dim L⊥ . Here it becomes too expensive computationally to treat (P), (D) in a “nearly
symmetric way”, and it is better to focus solely on the problem with smaller design dimension.
To get an impression of how the primal path-following method works, here is a picture:
What you see is the 2D feasible set of a toy SDP (K = S3+ ). “Continuous curve” is the primal central
path; dots are iterates xi of the algorithm. We cannot draw the dual solutions, since they “live” in 4-
dimensional space (dim L⊥ = dim S3 − dim L = 6 − 2 = 4)
(ii) Second, we choose somehow a new value t+ > t̄ of the penalty parameter and linearize
system (4.5.14) (with t set to t+ ) at the point (X̄, S̄), thus coming to the system of linear
equations
∂ Ḡt+ (X̄, S̄) ∂ Ḡt+ (X̄, S̄)
∆X + ∆S = −Ḡt+ (X̄, S̄), (4.5.15)
∂X ∂S
for the “corrections” (∆X, ∆S);
We add to (4.5.15) the system of linear equations on ∆X, ∆S expressing the requirement that
a shift of (X̄, S̄) in the direction (∆X, ∆S) should preserve the validity of the linear constraints
in (P), (D), i.e., the equations saying that ∆X ∈ L, ∆S ∈ L⊥ . These linear equations can be
written down as
∆X = A∆x [⇔ ∆X ∈ L]
(4.5.16)
A∗ ∆S = 0 [⇔ ∆S ∈ L⊥ ]
(iii) We solve the system of linear equations (4.5.15), (4.5.16), thus obtaining a primal-dual
search direction (∆X, ∆S), and update current iterates according to
X+ = X̄ + α∆x, S+ = S̄ + β∆S
where the primal and the dual stepsizes α, β are given by certain “side requirements”.
The major “degree of freedom” of the construction comes from (i) – from how we construct
the system (4.5.14). A very popular way to handle (i), the way which indeed leads to primal-dual
4.5. TRACING THE CENTRAL PATH 169
methods, starts from rewriting (4.5.13) in a form symmetric w.r.t. X and S. To this end we
first observe that (4.5.13) is equivalent to every one of the following two matrix equations:
XS = t−1 I; SX = t−1 I.
XS + SX = 2t−1 I, (4.5.17)
which, by its origin, is a consequence of (4.5.13). On a closest inspection, it turns out that
(4.5.17), regarded as a matrix equation with positive definite symmetric matrices, is equivalent
to (4.5.13). It is possible to use in the role of (4.5.14) the matrix equation (4.5.17) as it is;
this policy leads to the so called AHO (Alizadeh-Overton-Haeberly) search direction and the
“XS + SX” primal-dual path-following method.
It is also possible to use a “scaled” version of (4.5.17). Namely, let us choose somehow a
positive definite scaling matrix Q and observe that our original matrix equation (4.5.13) says
that S = t−1 X −1 , which is exactly the same as to say that Q−1 SQ−1 = t−1 (QXQ)−1 ; the latter,
in turn, is equivalent to every one of the matrix equations
commute (X̄, S̄ are the iterates to be updated); we call such a policy a “commutative scaling”.
Popular commutative scalings are:
P ? =? P T
)
X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S ? =? SX 1/2 (X 1/2 SX 1/2 )−1/2 X −1/2
−1/2 1/2 )
X (X SX 1/2 )−1/2 X 1/2 S X 1/2 (X 1/2 SX 1/2 )1/2 X −1/2 S −1 ? =? I
)
X −1/2 (X 1/2 SX 1/2 )−1/2 (X 1/2 SX 1/2 )(X 1/2 SX 1/2 )1/2 X −1/2 S −1 ? =? I
)
X −1/2 (X 1/2 SX 1/2 )X −1/2 S −1 ? =? I
and the argument of the concluding σ(·) clearly is a positive definite symmetric matrix.
Thus, the spectrum of symmetric matrix P is positive, i.e., P is positive definite.
(b): To verify that QXQ = Q−1 SQ−1 , i.e., that P 1/2 XP 1/2 = P −1/2 SP −1/2 , is the
same as to verify that P XP = S. The latter equality is given by the following compu-
tation:
P XP = X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S X X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S
= X −1/2 (X 1/2 SX 1/2 )−1/2 (X 1/2 SX 1/2 )(X 1/2 SX 1/2 )−1/2 X 1/2 S
= X −1/2 X 1/2 S
= S.
You should not think that Nesterov and Todd guessed the formula for this scaling ma-
trix. They did much more: they have developed an extremely deep theory (covering the
general LP-CQP-SDP case, not just the SDP one!) which, among other things, guar-
antees that the desired scaling matrix exists (and even is unique). After the existence
is established, it becomes much easier (although still not that easy) to find an explicit
formula for Q.
Complexity analysis
We are about to carry out the complexity analysis of the primal-dual path-following methods based
on “commutative” Zhang’s scalings. This analysis, although not that difficult, is more technical than
whatever else in our course, and a non-interested reader may skip it without any harm.
Scalings. We already have mentioned what a scaling of Sk+ is: this is the linear one-to-one transfor-
mation of Sk given by the formula
H → QHQT , (Scl)
4.5. TRACING THE CENTRAL PATH 171
where Q is a nonsingular scaling matrix. It is immediately seen that (Scl) is a symmetry of the semidefinite
cone Sk+ – it maps the cone onto itself. This family of symmetries is quite rich: for every pair of points
A, B from the interior of the cone, there exists a scaling which maps A onto B, e.g., the scaling
1/2 −1/2 −1/2 1/2
H → (B
A )H(A
B ).
Q QT
Essentially, this is exactly the existence of that rich family of symmetries of the underlying cones which
makes SDP (same as LP and CQP, where the cones also are “perfectly symmetric”) especially well suited
for IP methods.
In what follows we will be interested in scalings associated with positive definite scaling matrices.
The scaling given by such a matrix Q (X,S,...) will be denoted by Q (resp., X ,S,...):
Q[H] = QHQ.
Given a problem of interest (CP) (where K = Sk+ ) and a scaling matrix Q % 0, we can scale the problem,
i.e., pass from it to the problem
min cT x : Q [Ax − B] * 0 Q(CP)
x
which, of course, is equivalent to (CP) (since Q[H] is positive semidefinite iff H is so). In terms of
“geometric reformulation” (P) of (CP), this transformation is nothing but the substitution of variables
QXQ = Y ⇔ X = Q−1 Y Q−1 ;
with respect to Y -variables, (P) is the problem
min Tr(C[Q−1 Y Q−1 ]) : Y ∈ Q(L) − Q[B], Y * 0 ,
Y
thus, Z is orthogonal to every matrix from L, i.e., to every matrix of the form QXQ with X ∈ L iff the
matrix QZQ is orthogonal to every matrix from L, i.e., iff QZQ ∈ L⊥ . It follows that
L⊥ = Q−1 (L⊥ ).
Thus, when acting on the primal-dual pair (P), (D) of SDP’s, a scaling, given by a matrix Q % 0, converts
it into another primal-dual pair of problems, and this new pair is as follows:
• The “primal” geometric data – the subspace L and the primal shift B (which has a part-time job
to be the dual objective as well) – are replaced with their images under the mapping Q;
• The “dual” geometric data – the subspace L⊥ and the dual shift C (it is the primal objective as
well) – are replaced with their images under the mapping Q−1 inverse to Q; this inverse mapping again
is a scaling, the scaling matrix being Q−1 .
We see that it makes sense to speak about primal-dual scaling which acts on both the primal and
the dual variables and maps a primal variable X onto QXQ, and a dual variable S onto Q−1 SQ−1 .
Formally speaking, the primal-dual scaling associated with a matrix Q % 0 is the linear transformation
(X, S) → (QXQ, Q−1 SQ−1 ) of the direct product of two copies of Sk (the “primal” and the “dual” ones).
A primal-dual scaling acts naturally on different entities associated with a primal-dual pair (P), (S), in
particular, at:
172 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
• the pair (P), (D) itself – it is converted into another primal-dual pair of problems (P), (D);
• a primal-dual feasible pair (X, S) of solutions to (P), (D) – it is converted to the pair (X =
QXQ, S = Q−1 SQ−1 ), which, as it is immediately seen, is a pair of feasible solutions to (P), (D).
Note that the primal-dual scaling preserves strict feasibility and the duality gap:
= DualityGap (X, S);
DualityGapP,D (X, S) = Tr(XS) = Tr(QXSQ−1 ) = Tr(X S)
P,D
• the primal-dual central path (X∗ (·), S∗ (·)) of (P), (D); it is converted into the curve (X∗ (t) =
QX∗ (t)Q, S∗ (t) = Q−1 S∗ (t)Q−1 ), which is nothing but the primal-dual central path Z(t) of the
primal-dual pair (P), (D).
The latter fact can be easily derived from the characterization of the primal-dual central path; a
more instructive derivation is based on the fact that our “hero” – the barrier Sk (·) – is “semi-
invariant” w.r.t. scaling:
Sk (Q(X)) = − ln Det(QXQ) = − ln Det(X) − 2 ln Det(Q) = Sk (X) + const(Q).
Now, a point on the primal central path of the problem (P) associated with penalty parameter t,
let this point be temporarily denoted by Y (t), is the unique minimizer of the aggregate
Skt (Y ) = tQ−1 CQ−1 , Y F + Sk (Y ) ≡ tTr(Q−1 CQ−1 Y ) + Sk (Y )
over the set of strictly feasible solutions of (P). The latter set is exactly the image of the set of
strictly feasible solutions of (P) under the transformation Q, so that Y (t) is the image, under the
same transformation, of the point, let it be called X(t), which minimizes the aggregate
Skt (QXQ) = tTr((Q−1 CQ−1 )(QXQ)) + Sk (QXQ) = tTr(CX) + Sk (X) + const(Q)
over the set of strictly feasible solutions to (P). We see that X(t) is exactly the point X∗ (t) on
the primal central path associated with problem (P). Thus, the point Y (t) of the primal central
path associated with (P) is nothing but X∗ (t) = QX∗ (t)Q. Similarly, the point of the central path
associated with the problem (D) is exactly S∗ (t) = Q−1 S∗ (t)Q−1 .
• the neighbourhood Nκ of the primal-dual central path Z(·) associated with the pair of problems
(P), (D) (see (4.4.8)). As you can guess, the image of Nκ is exactly the neighbourhood N κ , given
by (4.4.8), of the primal-dual central path Z(·) of (P), (D).
The latter fact is immediate: for a pair (X, S) of strictly feasible primal and dual solutions to (P),
(D) and a t > 0 we have (see (4.4.6)):
Z ∗ (t)) = Tr [QXQ](tQ−1 SQ−1 − [QXQ]−1 )[QXQ](tQ−1 SQ−1 − [QXQ]−1 )
dist2 ((X, S),
= Tr QX(tS − X −1 )X(tS − X −1)Q−1
= Tr X(tS − X −1 )X(tS − X −1 )
= dist2 ((X, S), Z∗ (t)).
Proof of Theorem 4.5.2. 10 . Observe, first (this observation is crucial!) that it suffices to prove our
Theorem in the particular case when X̄, S̄ commute with each other and Q = I. Indeed, it is immediately
seen that the updating (U) can be represented as follows:
1. We first scale by Q the “input data” of (U) – the primal-dual pair of problems (P), (D) and the
strictly feasible pair X̄, S̄ of primal and dual solutions to these problems, as explained in sect.
“Scaling”. Note that the resulting entities – a pair of primal-dual problems and a strictly feasible
pair of primal-dual solutions to these problems – are linked with each other exactly in the same
fashion as the original entities, due to scaling invariance of the duality gap and the neighbourhood
Nκ . In addition, the scaled primal and dual solutions commute;
2. We apply to the “scaled input data” yielded by the previous step the updating (U) completely
similar to (U), but using the unit matrix in the role of Q;
3. We “scale back” the result of the previous step, i.e., subject this result to the scaling associated
with Q−1 , thus obtaining the updated iterate (X + , S + ).
Given that the second step of this procedure preserves primal-dual strict feasibility, w.r.t. the scaled
primal-dual pair of problems, of the iterate and keeps the iterate in the κ-neighbourhood Nκ of the
corresponding central path, we could use once again the “scaling invariance” reasoning to assert that the
result (X + , S + ) of (U) is well-defined, is strictly feasible for (P), (D) and is close to the original central
path, as claimed in the Theorem. Thus, all we need is to justify the above “Given”, and this is exactly
the same as to prove the theorem in the particular case of Q = I and commuting X̄, S̄. In the rest of
the proof we assume that Q = I and that the matrices X̄, S̄ commute with each other. Due to the latter
property, X̄, S̄ are diagonal in a properly chosen orthonormal basis; representing all matrices from Sk in
this basis, we can reduce the situation to the case when X̄ and S̄ are diagonal. Thus, we may (and do)
assume in the sequel that X̄ and S̄ are diagonal, with diagonal entries xi ,si , i = 1, ..., k, respectively, and
that Q = I. Finally, to simplify notation, we write t, X, S instead of t̄, X̄, S̄, respectively.
20 . Our situation and goals now are as follows. We are given orthogonal to each other affine planes
L − B, L⊥ + C in Sk and two positive definite diagonal matrices X = Diag({xi }) ∈ L − B, S =
Diag({si }) ∈ L⊥ + C. We set
1 Tr(XS)
µ= =
t k
and know that
tX 1/2 SX 1/2 − I2 ≤ κ.
We further set
1
µ+ = = (1 − χk −1/2 )µ (4.5.26)
t+
and consider the system of equations w.r.t. unknown symmetric matrices ∆X, ∆S:
(a) ∆X ∈ L
(b) ∆S ∈ L⊥ (4.5.27)
(c) ∆XS + X∆S + ∆SX + S∆X = 2µ+ I − 2XS
We should prove that the system has a unique solution such that the matrices
X+ = X + ∆X, S+ = S + ∆S
are
(i) positive definite,
(ii) belong, respectively, to L − B, L⊥ + C and satisfy the relation
Tr(X+ S+ ) = µ+ k; (4.5.28)
Observe that the situation can be reduced to the one with µ = 1. Indeed, let us pass from the matrices
X, S, ∆X, ∆S, X+ , S+ to X, S = µ−1 S, ∆X, ∆S = µ−1 ∆S, X+ , S+
= µ−1 S+ . Now the “we are given”
part of our situation becomes as follows: we are given two diagonal positive definite matrices X, S such
that X ∈ L − B, S ∈ L⊥ + C , C = µ−1 C,
Tr(XS ) = k × 1
and
X 1/2 S X 1/2 − I2 = µ−1 X 1/2 SX 1/2 − I2 ≤ κ.
The “we should prove” part becomes: to verify that the system of equations
(a) ∆X ∈ L
(b) ∆S ∈ L⊥
(c) ∆XS + X∆S + ∆S X + S ∆X = 2(1 − χk −1/2 )I − 2XS
has a unique solution and that the matrices X+ = X + ∆X, S+ = S + ∆S+
are positive definite, are
⊥
contained in L − B, respectively, L + C and satisfy the relations
µ+
Tr(X+ S+ )= = 1 − χk −1/2
µ
and
1/2 1/2
(1 − χk −1/2 )−1 X+ S+
X+ − I2 ≤ κ.
Thus, the general situation indeed can be reduced to the one with µ = 1, µ+ = 1 − χk −1/2 , and we loose
nothing assuming, in addition to what was already postulated, that
Tr(XS)
µ ≡ t−1 ≡ = 1, µ+ = 1 − χk −1/2 ,
k
whence
k
[Tr(XS) =] xi si = k (4.5.30)
i=1
and
n
1/2 1/2
[tX SX − I22 ≡] (xi si − 1)2 ≤ κ2 . (4.5.31)
i=1
30 . We start with proving that (4.5.27) indeed has a unique solution. It is convenient to pass in
(4.5.27) from the unknowns ∆X, ∆S to the unknowns
Multiplying (4.5.36.a) by (δS)ij and taking sum over i, j, we get, in view of (4.5.37), the relation
ψij µ+ − xi si
(δS)2ij = 2 (δS)ii ; (4.5.38)
i,j
φij i
φii
Now let
θi = xi si , (4.5.40)
so that in view of (4.5.30) and (4.5.31) one has
(a) θi = k,
i (4.5.41)
(b) (θi − 1)2 ≤ κ2 .
i
4.5. TRACING THE CENTRAL PATH 177
Observe that
√ √ θi θj xi xj
φij = xi xj (si + sj ) = xi xj + = θj + θi .
xi xj xj xi
Thus,
x
φij = θj xxji + θi xji ,
(4.5.42)
xi xj
ψij = xj + xi ;
We now have
φij
(1 − κ) (δX)2ij ≤ 2
ψij (δX)ij
i,j i,j
[see (4.5.43)]
µ+ −xi si
≤ 2 ψii (δX)ii
i
[see (4.5.39)]
−2
≤ 2 (µ+ − xi si )2 ψij (δX)2ii
i i
≤ ((1 − θi )2 − 2χk −1/2 (1 − θi ) + χ2 k −1 ) (δX)2ij
i i,j
[see (4.5.44)]
≤ χ2 + (1 − θi )2 (δX)2ij
i i,j
[since (1 − θi ) = 0 by (4.5.41.a)]
i
≤ χ2 + κ 2 (δX)2ij
i,j
[see (4.5.41.b)]
and from the resulting inequality it follows that
χ2 + κ 2
δX2 ≤ ρ ≡ . (4.5.45)
1−κ
Similarly,
ψij
(1 + κ)−1 (δS)2ij ≤ φij (δS)2ij
i,j i,j
[see (4.5.43)]
µ+ −xi si
≤ 2 φii (δS)ii
i
[see (4.5.38)]
−2
≤ 2 (µ+ − xi si )2 φij (δS)2ii
i i
−1
≤ (1 − κ) (µ+ − θi )2 (δS)2ij
i i,j
[see (4.5.44)]
≤ (1 − κ)−1 χ2 + κ 2 (δS)2ij
i,j
[same as above]
178 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
(1 + κ) χ2 + κ2
δS2 ≤ = (1 + κ)ρ. (4.5.46)
1−κ
50 . We are ready to prove 20 .(i-ii). We have
and the matrix I + δX is positive definite due to (4.5.45) (indeed, the right hand side in (4.5.45) is ρ ≤ 1,
whence the Frobenius norm (and therefore - the maximum of modulae of eigenvalues) of δX is less than
1). Note that by the just indicated reasons I + δX + (1 + ρ)I, whence
X+ + (1 + ρ)X. (4.5.47)
20 .(ii) is proved.
60 . It remains to verify 20 .(iii). We should bound from above the quantity
1/2 1/2 1/2 1/2
Ω = µ−1 −1 −1
+ X+ S+ X+ − I2 = X+ (µ+ S+ − X+ )X+ 2 ,
and our plan is first to bound from above the “close” quantity
Ω = X 1/2 (µ−1 −1
+ S+ − X+ )X
1/2
2 = µ−1
+ Z2 ,
−1 (4.5.48)
Z = X 1/2 (S+ − µ+ X+ )X 1/2 ,
[see (4.5.32)]
= XS + δS − µ+ (I + δX)−1
= XS + δS − µ+ (I − δX) − µ+ [(I + δX)−1 − I + δX]
= XS + δS + δX − µ+ I + (µ+ − 1)δX + µ+ [I − δX − (I + δX)−1 ],
Z1 Z2 Z3
so that
Z2 ≤ Z 1 2 + Z 2 2 + Z 3 2 . (4.5.49)
We are about to bound separately all 3 terms in the right hand side of the latter inequality.
4.5. TRACING THE CENTRAL PATH 179
Bounding Z 2 2 : We have
1
Zij = (XS)ij + (δS)ij + (δX)ij − µ+ δij
= (δX)ij + (δS)ij + (xi si − µ+ )δij
φij
= (δX)ij 1 − ψij + 2 µ+ψ−x
ii
i si
+ xi si − µ+ δij
[we have used (4.5.36.b)]
φij
= (δX)ij 1 − ψij
[since ψii = 2, see (4.5.42)]
so that
κ κ
Z 1 2 ≤ δX2 ≤ ρ (4.5.52)
1−κ 1−κ
whence, by (4.5.48),
$ %
ρ χ ρ κ
Ω≤ √ + + . (4.5.53)
1 − χk −1/2 k 1−ρ 1−κ
180 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
1/2 1/2
Ω2 = µ−1
+ X+ S+ X+ − I2
2
1/2 −1 −1 1/2
= X+ [µ+ S+ − X+ ] X+ 22
Θ=ΘT
1/2 1/2
= Tr X+ ΘX+ ΘX+
1/2 1/2
≤ (1 + ρ)Tr X+ ΘXΘX+
[see (4.5.47)]
1/2 1/2
= (1 + ρ)Tr X+ ΘX 1/2 X 1/2 ΘX+
1/2 1/2
= (1 + ρ)Tr X 1/2 ΘX+ X+ ΘX 1/2
= (1 + ρ)Tr X 1/2 ΘX+ ΘX 1/2
≤ (1 + ρ)2 Tr X 1/2 ΘXΘX 1/2
[the same (4.5.47)]
= (1 + ρ)2 X 1/2 ΘX 1/2 22
= (1 + ρ)2 X 1/2 [µ−1
+ S+ − X+ ]X
−1 1/2 2
2
2 2
= (1 + ρ) Ω
[see (4.5.48)]
so that
ρ(1+ρ) χ ρ κ
Ω ≤ (1 + ρ)Ω = 1−χk√−1/2
√
k
+ 1−ρ + 1−κ ,
(4.5.54)
χ2 +κ2
ρ = 1−κ .
Remark 4.5.2 We have carried out the complexity analysis for a large group of primal-dual
path-following methods for SDP (i.e., for the case of K = Sk+ ). In fact, the constructions and
the analysis we have presented can be word by word extended to the case when K is a direct
product of semidefinite cones – you just should bear in mind that all symmetric matrices we
deal with, like the primal and the dual solutions X, S, the scaling matrices Q, the primal-dual
search directions ∆X, ∆S, etc., are block-diagonal with common block-diagonal structure. In
particular, our constructions and analysis work for the case of LP – this is the case when K
is a direct product of one-dimensional semidefinite cones. Note that in the case of LP Zhang’s
family of primal-dual search directions reduces to a single direction: since now X, S, Q are
diagonal matrices, the scaling (4.5.17) → (4.5.18) does not vary the equations of augmented
complementary slackness.
The recipe to translate all we have presented for the case of SDP to the case of LP is very
simple: in the above text, you should assume all matrices like X, S,... to be diagonal and look
what the operations with these matrices required by the description of the method do with their
diagonals. By the way, one of the very first approaches to the design and the analysis of IP
methods for SDP was exactly opposite: you take an IP scheme for LP, replace in its description
the words “nonnegative vectors” with “positive semidefinite diagonal matrices” and then erase
the adjective “diagonal”.
4.6. COMPLEXITY BOUNDS FOR LP, CQP, SDP 181
4.6.1 Complexity of LP b
Family of problems:
Data:
Data(p) = [m; n; c; a1 , b1 ; ...; am , bm ; R],
Size(p) ≡ dim Data(p) = (m + 1)(n + 1) + 2.
Data:
Data(P ) = [m; n; k1 , ..., km ; c; A1 , b1 , c1 , d1 ; ...; Am , bm , cm , dm ; R],
m
Size(p) ≡ dim Data(p) = (m + ki )(n + 1) + m + n + 3.
i=1
(i)
where Aj , j = 0, 1, ..., n, are symmetric block-diagonal matrices with m diagonal blocks Aj of
sizes ki × ki , i = 1, ..., m.
Data:
(1) (m) (1) (m)
Data(p) = [m; n; k1 , ...km ;
c; A0 , ..., A0 ; ...; An , ..., An ; R],
m
ki (ki +1)
Size(p) ≡ dim Data(P ) = 2 (n + 1) + m + n + 3.
i=1
4.7. CONCLUDING REMARKS 183
x2 ≤ R,
n
A0 + xj Aj * −I,
j=1
cT x ≤ Opt(p) + .
m
ComplNwt (p, ) = O(1)(1 + ki )1/2 Digits(p, ).
i=1
m
m
m
Compl(p, ) = O(1)(1 + ki )1/2 n(n2 + n ki2 + ki3 )Digits(p, ).
i=1 i=1 i=1
the influence of the size of an SDP/CQP program on the complexity of its solving by an IP
method is twofold:
– first, the size affects the Newton complexity of the process. Theoretically, the number of
steps required to reduce the duality gap by a constant factor, say, factor 2, is proportional to
θ(K) (θ(K) is twice the total # of conic quadratic inequalities for CQP and the total row
size of LMI’s for SDP). Thus, we could expect an unpleasant growth of the iteration count with
θ(K). Fortunately, the iteration count for good IP methods usually is much less than the one
given by the worst-case complexity analysis and is typically about few tens, independently of
θ(K).
– second, the larger is the instance, the larger is the system of linear equations one should
solve to generate new primal (or primal-dual) search direction, and, consequently, the larger is
the computational effort per step (this effort is dominated by the necessity to assemble and to
solve the linear system). Now, the system to be solved depends, of course, on what is the IP
method we are speaking about, but it newer is simpler (and for most of the methods, is not
more complicated as well) than the system (4.5.8) arising in the primal path-following method:
The size n of this system is exactly the design dimension of problem (CP).
In order to process (Nwt), one should assemble the system (compute H and h) and then solve
it. Whatever is the cost of assembling (Nwt), you should be able to store the resulting matrix H
in memory and to factorize the matrix in order to get the solution. Both these problems – storing
and factorizing H – become prohibitively expensive when H is a large dense15) matrix. (Think
how happy you will be with the necessity to store 5000×5001
2 = 12, 502, 500 reals representing a
3
dense 5000 × 5000 symmetric matrix H and with the necessity to perform ≈ 5000 6 ≈ 2.08 × 1010
arithmetic operations to find its Choleski factor).
The necessity to assemble and to solve large-scale systems of linear equations is intrinsic
for IP methods as applied to large-scale optimization programs, and in this respect there is no
difference between LP and CQP, on one hand, and SDP, on the other hand. The difference is
in how difficult is to handle these large-scale linear systems. In real life LP’s-CQP’s-SDP’s, the
structure of the data allows to assemble (Nwt) at a cost negligibly small as compared to the
cost of factorizing H, which is a good news. Another good news is that in typical real world
LP’s, and to some extent for real-world CQP’s, H turns out to be “very well-structured”, which
reduces dramatically the expenses required by factorizing the matrix and storing the Choleski
factor. All practical IP solvers for LP and CQP utilize these favourable properties of real life
problems, and this is where their ability to solve problems with tens/hundreds thousands of
variables and constraints comes from. Spoil the structure of the problem – and an IP method
will be unable to solve an LP with just few thousands of variables. Now, in contrast to real life
LP’s and CQP’s, real life SDP’s typically result in dense matrices H, and this is where severe
limitations on the sizes of “tractable in practice” SDP’s come from. In this respect, real life
CQP’s are somewhere in-between LP’s and SDP’s, so that the sizes of “tractable in practice”
CQP’s could be significantly larger than in the case of SDP’s.
It should be mentioned that assembling matrices of the linear systems we are interested in
and solving these systems by the standard Linear Algebra techniques is not the only possible
way to implement an IP method. Another option is to solve these linear systems by iterative
15)
I.e., with O(n2 ) nonzero entries.
4.7. CONCLUDING REMARKS 185
methods. With this approach, all we need to solve a system like (Nwt) is a possibility to multiply
a given vector by the matrix of the system, and this does not require assembling and storing
in memory the matrix itself. E.g., to multiply a vector ∆x by H, we can use the multiplicative
representation of H as presented in (Nwt). Theoretically, the outlined iterative schemes, as
applied to real life SDP’s, allow to reduce by orders of magnitudes the arithmetic cost of building
search directions and to avoid the necessity to assemble and store huge dense matrices, which is
an extremely attractive opportunity. The difficulty, however, is that the iterative schemes are
much more affected by rounding errors that the usual Linear Algebra techniques; as a result,
for the time being “iterative-Linear-Algebra-based” implementation of IP methods is no more
than a challenging goal.
Although the sizes of SDP’s which can be solved with the existing codes are not that im-
pressive as those of LP’s, the possibilities offered to a practitioner by SDP IP methods could
hardly be overestimated. Just ten years ago we could not even dream of solving an SDP with
more than few tens of variables, while today we can solve routinely 20-25 times larger SDP’s,
and we have all reasons to believe in further significant progress in this direction.