Lecture 6
Lecture 6
Lecture 6
Instructor: Wu Yuqia
Convex sets and functions can be characterized in many ways by their behavior along lines. For example,
a set is convex if and only if its intersection with any line is convex, a convex set is bounded if and only if
its intersection with every line is bounded, a function is convex if and only if it is convex along any line,
and a convex function is coercive if and only if it is coercive along any line. Similarly, it turns out that the
differentiability properties of a convex function are determined by the corresponding properties along lines.
With this in mind, we first consider convex functions of a single variable.
Lemma 1.1 Let f : I → R be a convex function, where I is an interval, i.e., a convex set of scalars, which
may be open, closed, or neither open nor closed. The convexity of f implies the following important inequality
f (y) − f (x) f (z) − f (x) f (z) − f (y)
≤ ≤ , (1)
y−x z−x z−y
which holds for all x, y, z ∈ I such that x < y < z.
y−x z−y
Proof: By noting that y = z−x z + z−x x and y−x z−y
z−x + z−x = 1, we have from the convexity of f that
y−x z−y
f (y) ≤ f (z) + f (x).
z−x z−x
Therefore,
y−x x−y y−x
f (y) − f (x) ≤ f (z) + f (x) = (f (z) − f (x)).
z−x z−x z−x
The other inequality can be obtain by the similar method.
We define
f (x + α) − f (x) − f (x) − f (x − α)
s+ (x, α) = , s (x, α) = .
α α
If x is not equal to the right end point, We define the right derivative of f at x to be
f (x + α) − f (x) f (x + α) − f (x)
f + (x) = lim = inf = inf s+ (x, α).
α&0 α α>0 α α>0
Similarly, if x is not equal to the left end point, We define the left derivative of f at x to be
f (x) − f (x − α) f (x) − f (x − α)
f − (x) = lim = sup = sup s− (x, α).
α&0 α α>0 α α>0
1-1
1-2 Lecture 6
Proposition 1.2 Let I be an interval of real numbers, whose left and right end points are denoted by a and
b, respectively, and let f : I → R be a convex function.
Proof:
(a) If x is an end point of I, the result follows, since if x = a then f − (x) = −∞, and if x = b then
f + (x) = ∞. Assume that x is an interior point of I. Then we let α > 0, and we use Eq. (1), with
x, y, z replaced by x − α, x, x + α, respectively, to obtain s− (x, α) ≤ s+ (x, α). Taking the limit as α
decreases to zero, we obtain f − (x) ≤ f + (x).
(b) Let x belong to the interior of I and let α > 0 be such that x − α ∈ I. Then f − (x) ≥ s− (x, α) > −∞.
Similarly, we obtain f + (x) < ∞. Part (a) then implies that f + (x) and f − (x) are finite.
(c) We use Eq. (1), with y = (z + x)/2, to obtain s+ (x, (z − x)/2) ≤ s− (z, (z − x)/2). The result then
follows because f + (x) ≤ s+ (x, (z − x)/2) and s− (z, (z − x)/2) ≤ f − (z).
(d) This follows by combining parts (a) and (c).
We will now discuss notions of directional differentiability of multidimensional real-valued functions. The
directional derivative of a function f : Rn → R at a point x ∈ Rn in the direction y ∈ Rn is given by
f (x + αy) − f (x)
f 0 (x; y) = lim ,
α&0 α
provided that the limit exists, in which case we say that f is directionally differentiable at x in the direction
y, and we call f 0 (x; y) the directional derivative of f at x in the direction y. We say that f is directionally
differentiable at x if it is directionally differentiable at x in all directions.
Let f : Rn → R be some function, fix some x ∈ Rn , and consider the expression
f (x + αei ) − f (x)
lim ,
α&0 α
where ei is the i-th unit vector (all components are 0 except for the i-th component which is 1). If the above
limit exists, it is called the i-th partial derivative of f at the vector x and it is denoted by (∂f /∂xi )(x) or
∂f (x)/∂xi (xi in this section will denote the i-th component of the vector x).
Assuming all of these partial derivatives exist, the gradient of f at x is defined as the column vector
∂f (x)
∂x. 1
.. .
∇f (x) =
∂f (x)
∂xn
Lecture 6 1-3
If the directional derivative of f at a vector x exists in all directions y and f 0 (x; y) is a linear function of y,
we say that f is differentiable at x. This type of differentiability is also called Gateaux differentiability.
Lemma 1.3 f is differentiable at x if and only if the gradient ∇f (x) exists and satisfies
⇐: The existence of f 0 (x; y) for all y ∈ Rn is clear. The linear property of f 0 (x; y) is also given by ∇f (x)> y =
f 0 (x; y).
From the proof of the proposition, for a convex function, an equivalent definition of the directional derivative
is
f (x + αy) − f (x)
f 0 (x, y) = inf . (1.1)
α≥0 α
Proposition 1.5 Let f : Rn → R be a convex function, and let {fk } be a sequence of convex functions
fk : Rn → R with the property that limk→∞ fk (xk ) = f (x) for every x ∈ Rn and every sequence {xk } that
converges to x. Then, for any x ∈ Rn and y ∈ Rn , and any sequences {xk } and {yk } converging to x and y,
respectively, we have
lim sup fk0 (xk ; yk ) ≤ f 0 (x; y).
k→∞
Proof: Since f is convex, it follows by Prop. 1.4 that f is directional differentiable. From the definition of
directional derivative, for any > 0, there exists an α > 0 such that
f (x + αy) − f (x)
< f 0 (x; y) + .
α
Hence, for all sufficiently large k, we have, using (1) and the definition of directional derivative,
Since this is true for all > 0, we obtain lim supk→∞ fk0 (xk ; yk ) ≤ f 0 (x; y).
If f is differentiable at all x ∈ Rn , then using the continuity of f and the part of the proposition just proved,
we have for every sequence {xk } converging to x and every y ∈ Rn ,
− lim inf ∇f (xk )> y = lim sup(−∇f (xk )> y) ≤ −∇f (x)> y.
k→∞ k→∞
Therefore, we have ∇f (xk )> y → ∇f (x)> y for every y, which implies that ∇f (xk ) → ∇f (x). Hence, ∇f (·)
is continuous.
The set of all subgradients of a convex function f at x ∈ Rn is called the subdifferential of f at x, and is
denoted by ∂f (x).
A subgradient admits an intuitive geometrical interpretation: it can be identified with a nonvertical sup-
porting hyperplane to the epigraph of f at (x, f (x)). Such a hyperplane provides a linear approximation to
the function f , which is an underestimate of f if f is convex.
Now we discuss the existence, convexity and compactness of subdifferential.
Proposition 1.6 Let f : Rn → R be a convex function. The subdifferential ∂f (x) is nonempty, convex, and
compact for all x ∈ Rn .
Proof:
Nonemptyness: Since f is proper and convex, epif is also convex, and does not contain any vertical line.
/ int(epif ), here exists v ∈ Rn such that
Since (x, f (x)) ∈
Convexity: Fix any v1 , v2 ∈ ∂f (x) and let vλ = λv1 + (1 − λ)v2 . We have for ∀y ∈ Rn ,
which implies that vλ ∈ ∂f (x) and this proves the convexity of ∂f (x).
Compactness: We first prove the closedness. Let Hy = {v ∈ Rn | v > (y − x) ≤ f (y) − f (x)}, which is
clearly a closed set. Since \
∂f (x) = Hy ,
y∈Rn
we know that ∂f (x) is also closed. Next we prove the boundedness by contradiction. If ∂f (x) is not
bounded, there exists {vk } with kvk k→ ∞ such that vk ∈ ∂f (x). Define ṽk = kvvkk k . If necessary passing to
a subssequence, we assume that ṽk → ṽ. For each vk , by the definition of subdifferential, we have
By noting that tvk> ṽ = tkvk kṽk> ṽ → ∞, since ṽk> ṽ ≈ 1 and kvk k→ ∞, while the right hand side is a constant,
which leads to a contradiction. Hence we prove the boundedness.
The directional derivative and the subdifferential of a convex function are closely linked. To see this, let
d ∈ ∂f (x) and note that the subgradient inequality is equivalent to
f (x + αy) − f (x)
≥ y > d, ∀y ∈ Rn , ∀α > 0.
α
Since the quotient on the left above decreases monotonically to f 0 (x; y) as α ↓ 0, we conclude that the
subgradient inequality (2) is equivalent to
Therefore, we obtain
d ∈ ∂f (x) ⇐⇒ f 0 (x; y) ≥ y > d, ∀y ∈ Rn , (3)
and it follows that
f 0 (x; y) ≥ max y > d.
d∈∂f (x)
In particular, f is differentiable at x with gradient ∇f (x) if and only if it has ∇f (x) as its unique subgradient
at x.
1-6 Lecture 6
Now we prove the reverse direction. Take x, y ∈ Rn and consider the subset of Rn+1 ,
γw + µ> z ≥ γ(f (x) + αf 0 (x; y)) + µ> (x + αy), ∀α ≥ 0, z ∈ Rn , w > f (z). (5)
We cannot have γ < 0 since then the left-hand side above could be made arbitrarily small by choosing w
sufficiently large. Also if γ = 0, then Eq. (5) implies that µ = 0, which is a contradiction. Therefore, γ > 0
and by dividing with γ in Eq. (5) if necessary, we may assume that γ = 1, i.e.,
By setting α = 0 in the above relation and by taking the limit as w ↓ f (z), we obtain
implying that −µ ∈ ∂f (x). By setting z = x and α = 1 in Eq. (6), and by taking the limit as w ↓ f (x), we
obtain −y > µ ≥ f 0 (x; y), which implies that
(b) If a sequence {xk } converges to a vector x ∈ Rn and dk ∈ ∂f (xk ) for all k, then the sequence {dk } is
bounded and each of its limit points is a subgradient of f at x.
Proof: (a) Assume the contrary, i.e., that there exists a sequence {xk } ⊂ X, and an unbounded sequence
{dk } with dk ∈ ∂f (xk ) for all k. Without loss of generality, we assume that dk 6= 0 for all k, and we
denote yk = dk /kdk k. Since both {xk } and {yk } are bounded, they must contain convergent subsequences.
We assume without loss of generality that {xk } converges to some x and {yk } converges to some y. Since
dk ∈ ∂f (xk ), we have
f (xk + yk ) − f (xk ) ≥ d>
k yk = kdk k.
Lecture 6 1-7
Since {xk } and {yk } converge, by the continuity of f , the left-hand side above is bounded. This implies that
the right-hand side is bounded, thereby contradicting the unboundedness of {dk }.
(b) By Prop. 1.7, we have
y > dk ≤ f 0 (xk ; y), ∀y ∈ Rn .
By part (a), the sequence {dk } is bounded, so let d be a limit point of {dk }. By taking the limit along the
relevant subsequence in the above relation and by using Prop. 1.5, it follows that
y > d ≤ lim sup f 0 (xk ; y) ≤ f 0 (x; y), ∀y ∈ Rn .
k→∞
Proof: It will suffice to prove the result for the case where f = f1 + f2 . If v1 ∈ ∂f1 (x) and v2 ∈ ∂f2 (x),
then from the subgradient inequality, we have
f1 (z) ≥ f1 (x) + (z − x)> v1 , ∀z ∈ Rn ,
f2 (z) ≥ f2 (x) + (z − x)> v2 , ∀z ∈ Rn .
So by adding, we obtain
f (z) ≥ f (x) + (z − x)> (v1 + v2 ), ∀z ∈ Rn .
Hence, v1 + v2 ∈ ∂f (x), implying that ∂f1 (x) + ∂f2 (x) ⊆ ∂f (x).
To prove the reverse inclusion, assume to arrive at a contradiction, that there exists a v ∈ ∂f (x) such that
v∈ / ∂f1 (x) + ∂f2 (x). Since by Prop. 1.6, the sets ∂f1 (x) and ∂f2 (x) are compact, the set ∂f1 (x) + ∂f2 (x)
is compact, and by the Strict Separation Theorem, there exists a hyperplane strictly separating v from
∂f1 (x) + ∂f2 (x), i.e., a vector y and a scalar b such that
y > (v1 + v2 ) < b < y > v, ∀v1 ∈ ∂f1 (x), ∀v2 ∈ ∂f2 (x).
Therefore,
sup y > v1 + sup y > v2 < y > v,
v1 ∈∂f1 (x) v2 ∈∂f2 (x)
Proposition 1.10 (Chain Rule) (a) Let f : Rm → R be a convex function, and let A be an m × n
matrix. Then, the subdifferential of the function F , defined by
F (x) = f (Ax),
is given by
∂F (x) = A> ∂f (Ax) = {A> g | g ∈ ∂f (Ax)}.
1-8 Lecture 6
(b) Let f : Rn → R be a convex function and let g : R → R be a smooth scalar function. Then the function
F , defined by
F (x) = g(f (x)),
is directionally differentiable at all x, and its directional derivative is given by
Furthermore, if g is convex and monotonically nondecreasing, then F is convex and its subdifferential
is given by
∂F (x) = ∇g(f (x))∂f (x), ∀x ∈ Rn . (8)
Proof: (a) From the definition of directional derivative, it can be seen that
and in particular,
g > Ay ≤ f 0 (Ax; Ay), ∀y ∈ Rn ,
or
(A> g)> y ≤ F 0 (x; y), ∀y ∈ Rn .
Hence, by Eq. (3), we have A> g ∈ ∂F (x), so that A> ∂f (Ax) ⊆ ∂F (x).
To prove the reverse inclusion, assume to arrive at a contradiction, that there exists a d ∈ ∂F (x) such that
d∈/ A> ∂f (Ax). Since by Prop. 1.6, the set ∂f (Ax) is compact, the set A> ∂f (Ax) is also compact, and by
the Strict Separation Theorem, there exists a hyperplane strictly separating d from A> ∂f (Ax), i.e., a vector
y and a scalar c such that
y > (A> g) < c < y > d, ∀g ∈ ∂f (Ax).
From this we obtain
max (Ay)> g < y > d,
g∈∂f (Ax)
(1) For some ᾱ > 0, f (x + αy) = f (x) for all α ∈ (0, ᾱ],
(2) For some ᾱ > 0, f (x + αy) > f (x) for all α ∈ (0, ᾱ],
Lecture 6 1-9
(3) For some ᾱ > 0, f (x + αy) < f (x) for all α ∈ (0, ᾱ].
In case (1), from Eq. (9), we have limα↓0 s(x, y, α) = 0 and F 0 (x; y) = 0, so Eq. (7) holds.
In case (2), Eq. (9) is written as
If ∇g(f (x)) = 0, this relation yields d = 0, so ∂F (x) = {0} and the desired Eq. (8) holds. If ∇g(f (x)) 6= 0,
we have ∇g(f (x)) > 0 by the monotonicity of g, so we obtain
d
y> ≤ f 0 (x; y), ∀y ∈ Rn ,
∇g(f (x))
which, by Eq. (3), is equivalent to d/∇g(f (x)) ∈ ∂f (x). Thus, we have shown that d ∈ ∂F (x) if and only if
d/∇g(f (x)) ∈ ∂f (x), which proves the desired Eq. (8).