0% found this document useful (0 votes)
34 views83 pages

Introduction To Nonlinear Systems and Numerical Optimization

Uploaded by

mourad.sellah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views83 pages

Introduction To Nonlinear Systems and Numerical Optimization

Uploaded by

mourad.sellah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Introduction to Nonlinear Systems and

Numerical Optimization
Optimization problems arise in almost every field, where numerical
information is processed (Science, Engineering, Mathematics, Economics, Commerce,
etc.). In Science, optimization problems arise in data fitting, in variational principles,
and in the solution of differential and integral equations by expansion methods.
Engineering applications are in design problems, which usually have constraints in the
sense that variables cannot take arbitrary values. For example, while designing a bridge
an engineer will be interested in minimizing the cost, while maintaining certain
minimum strength for the structure. Even the strength of materials used will have a
finite range depending on what is available in the market. Such problems with
constraints are more difficult to handle than the simple unconstrained optimization
problems, which very often arise in scientific work. In most problems, we assume the
variables to be continuously varying, but some problems require the variables to take
discrete values (H M Antia 1995).

Mathematically speaking, optimization is the minimization or maximization of a function


subjected or not to constraints on its variables (parameters). In order to solve any
optimization problem numerically, nowadays there is a wide variety of algorithms at our
disposal. As we already saw in the previous chapters, these algorithms are starting
from some initial guess of the parameters, and then they generate sequence of iterates
which terminates, when either no more progress can be made, or when it seems that a
solution point has been approximated with sufficient accuracy. The main difference
between the optimization algorithms is the way on which they pass from one iteration to
another.

Mainly there are two different strategies for computing next iteration from the previous
one which are used most frequently in nowadays available optimization algorithms. The
first one is the line search strategy in which the algorithm chooses a direction 𝒅𝑘 and
then searches along this direction for the lower function value. The second one is called
the trust region strategy in which the information gathered about the objective function
is used to construct a model function whose behavior near the current iterate is trusted
to be similar enough to the actual function. Then the algorithm searches for the
minimizer of the model function inside the trust region.

Most of optimization problems require the global minimum to be found, most of the
methods that we are going to describe here will only find a local minimum. The function
has a local minimum at a point where it assumes the lowest value in a small
neighborhood of the point, which is not at the boundary of that neighborhood. To find a
global minimum we normally try
In this chapter, we consider methods for minimizing or maximizing a function of several
variables, that is, finding those values of the coordinates for which the function takes
on the minimum or the maximum value.

Definition A continuous
function 𝑓: ℝ𝑛 ⟶ ℝ is said
to be continuously
differentiable at 𝐱 ∈ ℝ𝑛 , if
(𝜕𝑓/𝜕𝑥𝒊 )(𝐱) exists and is
continuous, 𝑖 = 1, . . . , 𝑛; the
gradient of 𝑓 at 𝐱 is then
defined as
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝑇
∇𝑓(𝐱) = [ … ]
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛

The function 𝑓 is said to be


continuously differentiable
in an open region 𝑫 ⊂ ℝ𝑛 ,
denoted 𝑓 ∈ 𝐶 1 (𝑫), if it is
continuously differentiable
at every point in 𝑫.

Lemma Let 𝑓: ℝ𝑛 ⟶ ℝ be continuously differentiable in an open convex set 𝑫 ⊂ ℝ𝑛 .


Then, for 𝐱 ∈ 𝑫 and any nonzero perturbation 𝒑 ∈ ℝ𝑛 , the directional derivative of 𝑓 at 𝐱
in the direction of 𝒑, defined by
𝜕𝑓 𝑓(𝐱 + 𝜀𝒑) − 𝑓(𝐱) 𝑇
𝐷𝒑 𝑓(𝐱) = ( ) . 𝒑 = lim = (∇𝑓(𝐱)) 𝒑
𝜕𝐱 𝜀→0 𝜀
For any 𝐱, 𝐱 + 𝒑 ∈ 𝑫,
1 𝐱+𝒑
𝑇 𝑇
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = ∫ (∇𝑓(𝐱 + 𝑡𝒑)) 𝑑𝑡 = ∫ (∇𝑓(𝐳)) 𝑑𝐳
0 𝐱

𝑇
and there exists 𝒛 ∈ (𝐱, 𝐱 + 𝒑) such that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐳)) 𝒑

Example: Let 𝑓: ℝ2 ⟶ ℝ, 𝑓(𝑥) = 𝑥12 − 2𝑥2 + 3𝑥1 𝑥22 + 4𝑥23 , 𝐱 𝑐 = (1, 1)𝑇 , 𝒑 = (−2, 1)𝑇 Then

2𝑥1 − 2 + 3𝑥22
∇𝑓(𝐱) = ( )
6𝑥1 𝑥2 + 12𝑥22

𝑓(𝐱 𝑐 ) = 6, 𝑓(𝐱 𝑐 + 𝒑) = 23, ∇𝑓(𝐱 𝑐 ) = (3, 18)𝑇

If we let 𝑔(𝑡) = 𝑓(𝐱 𝑐 + 𝑡𝒑) = 𝑓(1 − 2𝑡, 1 + 𝑡 ) = 6 + 12𝑡 + 7𝑡 2 − 2𝑡 3 , and the reader can
𝑇
verify that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐳)) 𝒑 is true for 𝐳 = 𝐱 𝑐 + 𝑡𝒑 with 𝑡 = (7 − √19)/6 = 0.44
Example: Computing directional derivatives.
Let 𝑧 = 14 − 𝑥 2 − 𝑦 2 and let 𝑃 = (1,2). Find
the directional derivative of f, at 𝑃, in the
following directions:

1. toward the point Q=(3,4) ,


2. in the direction of ⟨2,−1⟩, and
3. toward the origin.

■ The surface is plotted in Figure above, where the point 𝑃 = (1,2) is indicated in the
𝑥, 𝑦_plane as well as the point (1,2,9) which lies on the surface of f. We find that

𝜕𝑓 𝜕𝑓
| = −lim 2𝑥 = −2, | = −lim 2𝑦 = −4
𝜕𝑥 𝑥=1 𝑥→1 𝜕𝑦 𝑥=2 𝑥→1

Let 𝑢⃗ 1 be the unit vector that points from the point (1,2) to the point 𝑄 = (3,4), as shown
in the figure. The vector ⃗⃗⃗⃗⃗
𝑃𝑄 = ⟨2,2⟩; the unit vector in this direction is 𝑢
⃗ 1 = ⟨1/√2, 1/√2⟩.
Thus the directional derivative of f at (1,2) in the direction of 𝑢
⃗ 1 is

𝑇 1/√2
⃗ 1 = (−2 − 4) (
𝐷𝑢⃗1 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢 ) = −3√2 ≅ −4.24
1/√2

Thus the instantaneous rate of change in moving from the point (1,2,9) on the surface
in the direction of 𝑢
⃗ 1 (which points toward the point Q) is about −4.24. Moving in this
direction moves one steeply downward.

■ We seek the directional derivative in the direction of ⟨2,−1⟩. The unit vector in this
direction is 𝑢
⃗ 2 = ⟨2/√5, −1/√5⟩. Thus the directional derivative of f at (1,2) in the
𝑇
direction of 𝑢
⃗ 2 is 𝐷𝑢⃗2 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢⃗ 2 = 0. Starting on the surface of f at (1,2) and
moving in the direction of ⟨2,−1⟩ (or 𝑢⃗ 2 ) results in no instantaneous change in z-value.

■ At P=(1,2), the direction towards the origin is given by the vector ⟨−1,−2⟩; the unit
vector in this direction is 𝑢
⃗ 3 = ⟨−1/√5, −2/√5⟩. The directional derivative of f at P in the
𝑇
direction of the origin is 𝐷𝑢⃗3 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢
⃗ 3 = 9/√5 ≅ 4.47. Moving towards the origin
means "walking uphill'' quite steeply, with an initial slope of about 4.47.

Note: The symbol "∇" is named "nabla,'' derived from the Greek name of a Jewish harp.
Oddly enough, in mathematics the expression ∇f is pronounced "del f.''

𝑎. ∇(𝑓g) = 𝑓∇(g) + g∇(𝑓)


b. ∇(𝑓/g) = (𝑔∇𝑓 − 𝑓∇g)/g2
𝑛
c. ∇((𝑓(𝑥, 𝑦)) ) = 𝑛𝑓(𝑥, 𝑦)𝑛−1 ∇𝑓
Gradients and Level Curves: In this section, we use the gradient and the chain rule to
investigate horizontal and vertical slices of a surface of the form 𝑧 = g( 𝑥, 𝑦). To begin
with, if 𝑘 is constant, then g(𝑥, 𝑦) = 𝑘 is called the level curve of g( 𝑥, 𝑦) of level k and is
the intersection of the horizontal plane z = k and the surface 𝑧 = g( 𝑥, 𝑦). In particular,
g(𝑥, 𝑦) = 𝑘 is a curve in the xy-plane.

The gradient vectors are perpendicular to the level sets, so will always be direction the
“slope” of a point toward another point on another level set. But how would you
represent that? The answer is the concept of gradient flow. Read more to learn about
how these three standard measurements fit together to flow along a surface, much like
a liquid or rolling object.
Theorem Let Consider a function f:ℝ𝑛 ⟶ ℝ, and suppose f is of class 𝐶1 . For some
constant 𝑐, consider the level set 𝑆 = {𝒙⃗ ∈ ℝ𝑛 : 𝑓(𝒙
⃗ ) = 𝑐}. Then, for any point 𝒙
⃗ 0 in 𝑆, the
⃗ 0 ) is perpendicular to 𝑆.
gradient ∇𝑓(𝒙

Proof: We need to show that for any


vector 𝒂⃗ which is tangent to 𝑆 at 𝒙 ⃗ 0 , is
perpendicular to ∇𝑓(𝒙 ⃗ 0 ). If 𝒂
⃗ is tangent
to 𝑆, we can find a parametrized curve
⃗ (𝑡) lying in 𝑆 such that 𝒙
𝒙 ⃗0=𝒙 ⃗ (𝑡0 ) and

⃗ (𝑡0 ) = 𝒂
𝒙 ⃗ . We will show that ∇𝑓(𝒙
⃗ 0 ) is

perpendicular to ⃗𝒂 = ⃗𝒙 (𝑡0 ).

⃗ (𝑡)
By the definition of 𝑆, and since 𝒙
lies in 𝑆, ⃗ (𝑡)) = 𝑐
𝑓 (𝒙 for all t.
Differentiating both sides of this
identity, and using the chain rule on

⃗ (𝑡)) ⋅ 𝒙
the left side, we obtain ∇𝑓 (𝒙 ⃗ (𝑡) = 0

⃗ (𝑡0 )) ⋅ 𝒙
Plugging in 𝑡 = 𝑡0 , this gives us ∇𝑓 (𝒙 ⃗ (𝑡0 ) = 0, which we can rewrite as

∇𝑓(𝒙 ⃗ ′ (𝑡0 ) = 0 ⟺ ∇𝑓(𝒙


⃗ 0) ⋅ 𝒙 ⃗ ′ (𝑡0 ) ⟺ ∇𝑓(𝒙
⃗ 0) ⊥ 𝒙 ⃗ 0) ⊥ 𝒂

⃗ 0 ) is perpendicular to the level set S. ■


Thus, we have shown that ∇𝑓(𝒙

Definition A continuously differentiable function: 𝑓: ℝ𝑛 ⟶ ℝ is said to be twice


continuously differentiable at 𝐱 ∈ ℝ𝑛 , if (𝜕 2 𝑓/𝜕𝑥𝑖 𝜕𝑥𝑗 )(𝐱) exists and is continuous,
1 < 𝑖 , 𝑗 < 𝑛; the Hessian of 𝑓 at 𝐱 is then defined as the 𝑛 × 𝑛 matrix whose i,j element is

𝜕 2 𝑓(𝐱)
∇2 𝑓(𝐱)𝑖𝑗 = 1 < 𝑖 ,𝑗 < 𝑛
𝜕𝑥𝑖 𝜕𝑥𝑗

The function/is said to be twice continuously differentiable in an open region 𝑫 ⊂ ℝ𝑛


denoted 𝑓 ∈ 𝐶 2 (𝑫), if it is twice continuously differentiable at every point in 𝑫.

Lemma Let the function 𝑓: ℝ𝑛 ⟶ ℝ be twice continuously differentiable in an open


convex set 𝑫 ⊂ ℝ𝑛 . Then, for 𝐱 ∈ 𝑫 and any nonzero perturbation 𝒑 ∈ ℝ𝑛 , the second
directional derivative of 𝑓 at 𝐱 in the direction of 𝒑, defined by
𝜕𝑓 𝜕𝑓
(𝐱 + 𝜀𝒑) − (𝐱)
2
𝐷𝒑𝒑 𝑓(𝐱) = lim 𝜕𝐱 𝜕𝐱 = 𝒑𝑇 (∇2 𝑓(𝐱))𝒑 = 𝒑𝑇 𝑯(𝐱)𝒑
𝜀→0 𝜀
For any 𝐱, 𝐱 + 𝒑 ∈ 𝑫, there exists 𝒛 ∈ (𝐱, 𝐱 + 𝒑) or {𝒛 = 𝐱 + 𝑡𝒑 𝑡 ∈ (0,1)} such that
𝑇 1
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝒑 + 𝒑𝑇 (∇2 𝑓(𝒛))𝒑 + 𝒪(‖𝒑‖3 )
2
𝑇 1 𝑇
= (∇𝑓(𝐱)) 𝒑 + 𝒑 𝑯(𝐱 + 𝑡𝒑)𝒑 + 𝒪(‖𝒑‖3 )
2
Remark: The Hessian 𝑯(𝐳) is always symmetric as long as 𝑓 is twice continuously
differentiable. This is the reason we were interested in symmetric matrices in previous
chapters.
𝑇
(𝑯(𝐱 + 𝑡𝒉)𝒉). 𝒉 = 〈𝑯(𝐱 + 𝑡𝒉)𝒉, 𝒉〉 = (𝑯(𝐱 + 𝑡𝒉)𝒉)𝑇 𝒉 = 𝒉𝑇 (𝑯(𝐱 + 𝑡𝒉)) 𝒉 = 𝒉𝑇 𝑯(𝐱 + 𝑡𝒉)𝒉

1
⟹ 𝑓(𝐱 + 𝒉) = 𝑓(𝐱) + ∇𝑓(𝐱). 𝒉 + (𝑯(𝐱 + 𝑡𝒉)𝒉). 𝒉
2

Example Let 𝑓, 𝐱 𝑐 , and 𝒑 be given by

𝑓(𝑥) = 𝑥12 − 2𝑥2 + 3𝑥1 𝑥22 + 4𝑥23 , 𝐱 𝑐 = (1, 1)𝑇 , 𝒑 = (−2, 1)𝑇
Then
2 6𝑥2 2 6
∇2 𝑓(𝐱) = ( ) ⟹ ∇2 𝑓(𝐱 𝑐 ) = ( )
6𝑥2 6𝑥1 + 24𝑥2 6 30

The reader can verify that


𝑇 1
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝒑 + 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 + 𝒪(‖𝒑‖3 )
2
is true for 𝐳 = 𝐱 𝑐 + 𝑡𝒑, 𝑡 = 1/3 .

Lemma suggests that we might model the function f around a point 𝐱 𝑐 by the quadratic
𝑇 1
model 𝑚(𝐱 𝑐 + 𝒑) = 𝑓(𝐱 𝑐 ) + (∇𝑓(𝐱 𝑐 )) 𝒑 + 2 𝒑𝑇 𝑯(𝐱𝑐 )𝒑 and this is precisely what we will do.
In fact, it shows that the error in this model is given by

1 𝑇
𝜀 = 𝑓(𝐱 𝑐 + 𝒑) − 𝑚(𝐱 𝑐 + 𝒑) = 𝒑 (𝑯(𝒛) − 𝑯(𝐱 𝑐 ))𝒑
2
for some 𝒛 ∈ (𝐱 𝑐 , 𝐱 𝑐 + 𝒑).

Vector-valued functions: Now let us proceed to the less simple case of 𝐹: ℝ𝑛 ⟶ ℝ𝑚 ,


where 𝑚 = 𝑛 in the nonlinear simultaneous equations problem and 𝑚 > 𝑛 in the
nonlinear least-squares problem. It will be convenient to have the special notation 𝑒𝑖𝑇 for
the 𝑖 𝑡ℎ row of the identity matrix. There should be no confusion with the natural log
base. Since the value of the 𝑖 𝑡ℎ component function of 𝐹 can be written 𝑓𝑖 (𝐱) = 𝐹(𝐱),
consistency with the product rule makes it necessary that 𝑓𝑖 ′(𝐱) = 𝐹 ′ (𝐱), the 𝑖 𝑡ℎ row of
𝐹 ′ (𝐱). Thus 𝐹 ′ (𝐱) must be an 𝑚 × 𝑛 matrix whose 𝑖 𝑡ℎ row is ∇𝑓𝑖 (𝐱)𝑇 . The following
definition makes this official.

Definition A continuous function 𝐹: ℝ𝑛 ⟶ ℝ𝑚 is continuously differentiable at


𝐱 ∈ ℝ𝑛 if each component function 𝑓𝑖 , 𝑖 = 1, … , 𝑚 is continuously differentiable at 𝐱. The
derivative of 𝐹 at 𝐱 is sometimes called the Jacobian (matrix) of 𝐹 at 𝐱, and its transpose
is sometimes called the gradient of 𝐹 at 𝐱. The common notations are:
𝜕𝑓𝑖 (𝐱)
𝐹 ′ (𝐱) ∈ ℝ𝑚×𝑛 𝐹 ′ (𝐱) = 𝑱(𝐱) = ∇𝐹(𝐱)𝑇 with 𝐹 ′ (𝐱)𝑖𝑗 =
𝜕𝑥𝑗
𝐹 is said to be continuously differentiable in an open set 𝑫 ⊂ ℝ𝑛 , denoted 𝐹 ∈ 𝐶 1 (𝐷), if F
is continuously differentiable at every point in 𝑫.
Example Let 𝐹: ℝ2 ⟶ ℝ2 𝑓1 = 𝑒 𝑥1 − 𝑥2 , 𝑓1 = 𝑥12 − 2𝑥2 Then
𝑒 𝑥1 −1
𝑱(𝐱) = ∇𝐹(𝐱)𝑇 = ( )
2𝑥1 −2

For the remainder of this chapter we will denote the Jacobian matrix of 𝐹 at 𝐱 by 𝑱(𝐱).
Also, we will often speak of the Jacobian of 𝐹 rather than the Jacobian matrix of 𝐹 at 𝐱.

Remark: The Jacobian of a vector-valued function in several variables generalizes the


gradient of a scalar-valued function in several variables, which in turn generalizes the
derivative of a scalar-valued function of a single variable. In other words, the Jacobian
matrix of a scalar-valued function in several variables is (the transpose of) its gradient
and the gradient of a scalar-valued function of a single variable is its derivative.

An important fact: Now comes the big difference from real-valued functions: there is
no mean value theorem for continuously differentiable vector-valued functions. That is,
in general there may not exist 𝐳 ∈ ℝ𝑛 such that 𝐹(𝐱 + 𝒑) = 𝐹(𝐱) + 𝑱(𝐳)𝒑. Intuitively the
reason is that, although each function 𝑓𝑖 satisfies 𝑓𝑖 (𝐱 + 𝒑) = 𝑓𝑖 (𝐱) + ∇𝑓𝑖 (𝐳𝑖 )𝑇 𝒑 , the points
𝐳𝑖 , may differ. For example, consider the function of the example before. There is no
𝐳 ∈ ℝ𝑛 for which 𝐹(1,1) = 𝐹(0,0) + 𝑱(𝐳)(1,1)𝑇 as this would require

𝑒 𝑥1 − 𝑥2 𝑒 𝑥1 − 𝑥2 𝑒 𝑥1 −1 1 𝑒−1 1 𝑒 𝑧1 −1 1
( ) = ( ) + ( ) ( )⇔( )=( )+( )( )
𝑥12 − 2𝑥2 𝐱=(1,1) 𝑥12 − 2𝑥2 𝐱=(0,0) 2𝑥1 −2 𝐱=𝐳 1
𝑖
−1 0 2𝑧1 −2 1

or 𝑒 𝑧1 = 𝑒 − 1 and 2𝑧1 = 1, which is clearly impossible. Although the standard mean


value theorem is impossible, we will be able to replace it in our analysis by Newton's
theorem and the triangle inequality for line integrals.

Those results are given below. The integral of a vector valued function of a real variable
can be interpreted as the vector of Riemann integrals of each component function.

Lemma Let 𝐹: ℝ𝑛 ⟶ ℝ𝑚 be continuously differentiable in an open convex set 𝑫 ⊂ ℝ𝑛 . For


any 𝐱, 𝐱 + 𝒑 ∈ 𝑫,
1 𝐱+𝑡𝒑

𝐹(𝐱 + 𝒑) − 𝐹(𝐱) = (∫ 𝑱(𝐱 + 𝑡𝒑)𝑑𝑡) 𝒑 = ∫ 𝐹 ′ (𝐳)𝑑𝒛


0 𝐱

Proof: The proof comes right from the definition of 𝐹 ′ (𝐳) and a component by-
component application of Newton's formula 𝑓𝑖 (𝐱 + 𝒑) = 𝑓𝑖 (𝐱) + ∇𝑓𝑖 (𝐳𝑖 )𝑇 𝒑 .■

Now we might think about a best model or the best linear approximation of the function
𝐹 around a point 𝐱 𝑐 , means that we model 𝐹(𝐱 𝑐 + 𝒑) by the affine model

𝑀(𝐱 𝑐 + 𝒑) = 𝐹(𝐱 𝑐 ) + 𝑱(𝐱 𝑐 )𝒑

and this is what we will do. To produce a bound on the difference between 𝐹(𝐱 𝑐 + 𝒑) and
𝑀(𝐱 𝑐 + 𝒑), we need to make an assumption about the continuity of 𝑱(𝐱 𝑐 ) just as we did
in scalar-valued-functions in the section before.
Definition Let the two integers 𝑚, 𝑛 > 0, 𝑮: ℝ𝑛 ⟶ ℝ𝑚×𝑛 , 𝐱 ∈ ℝ𝑛 , let ‖•‖ be a norm on
ℝ𝑛 , and ‖•‖𝑮 a norm on ℝ𝑚×𝑛 . 𝑮 is said to be Lipschitz continuous at 𝐱 if there exists an
open set 𝑫 ⊂ ℝ𝑛 , 𝐱 ∈ 𝑫, and a constant 𝛾 such that for all 𝐲 ∈ 𝑫,

‖𝑮(𝐱) − 𝑮(𝐲)‖𝑮 ≤ 𝛾‖𝐱 − 𝐲‖

The constant 𝛾 is called a Lipschitz constant for 𝑮 at 𝐱. For any specific 𝑫 containing 𝐱
for which the given inequality holds, 𝑮 is said to be Lipschitz continuous at 𝐱 in the
neighborhood 𝑫. If this inequality holds for every 𝐱 ∈ 𝑫, then 𝑮 ∈ 𝐿𝑖𝑝𝛾 (𝑫).

Note that the value of 𝛾 depends on the norms ‖•‖ & ‖•‖𝑮 , but the existence of 𝛾 does
not.

Lemma Let 𝐹: ℝ𝑛 ⟶ ℝ𝑚 be continuously differentiate in the open convex set 𝑫 ⊂ ℝ𝑛 ,


𝐱 ∈ 𝑫, and let 𝑱 be Lipschitz continuous at 𝐱 in the neighborhood 𝑫, using a vector norm
and the induced matrix operator norm and the constant 𝛾. Then, for any 𝐱 + 𝒑 ∈ 𝑫,
𝛾
‖𝐹(𝐱 + 𝒑) − 𝐹(𝐱) − 𝑱(𝐱)𝒑‖ ≤ ‖𝒑‖2
2

Proof:
1

𝐹(𝐱 + 𝒑) − 𝐹(𝐱) − 𝑱(𝐱)𝒑 = (∫ 𝑱(𝐱 + 𝑡𝒑)𝒑𝑑𝑡) − 𝑱(𝐱)𝒑


0
Using the triangle inequality for line integrals, the definition of a matrix operator norm,
and the Lipschitz continuity of 𝑱 at 𝐱 in neighborhood 𝑫, we obtain
1

‖𝐹(𝐱 + 𝒑) − 𝐹(𝐱) − 𝑱(𝐱)𝒑‖ ≤ ∫ ‖𝑱(𝐱 + 𝑡𝒑) − 𝑱(𝐱)‖‖𝒑‖𝑑𝑡


0
1

≤ ∫ 𝛾 ‖𝑡𝒑‖‖𝒑‖𝑑𝑡
0
1

= 𝛾‖𝒑‖ ∫ 𝑡𝑑𝑡 = 𝛾‖𝒑‖2


2

0

Using Lipschitz continuity, we can obtain a useful bound on the error in the
approximate affine model.

Lemma Let 𝐹, 𝑱 satisfy the conditions of the previous lemma. Then, for any 𝐯, 𝐮 ∈ 𝑫,
‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖
‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖ ≤ 𝛾 ‖𝐯 − 𝐮‖
2
If we assume that 𝑱(𝐱)−1 exists. Then there exists 𝜀 > 0, 0 < 𝛼 < 𝛽 , such that

𝛼‖𝐯 − 𝐮‖ ≤ ‖𝐹(𝐯) − 𝐹(𝐮)‖ ≤ 𝛽‖𝐯 − 𝐮‖

for all 𝐯, 𝐮 ∈ 𝑫 for which max{‖𝐯 − 𝐱‖, ‖𝐮 − 𝐱‖} ≤ 𝜀


Proof: using the previous lemma we can write

‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖ ≤ 𝛾‖𝐯 − 𝐮‖2


‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖
≤ 𝛾( ) ‖𝐯 − 𝐮‖
2
Using the triangle inequality and
‖𝐹(𝐯) − 𝐹(𝐮)‖ ≤ ‖𝑱(𝐱)(𝐯 − 𝐮)‖ + ‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖
𝛾
≤ (‖𝑱(𝐱)‖ + (‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖)) ‖𝐯 − 𝐮‖
2
≤ (‖𝑱(𝐱)‖ + 𝛾𝜀)‖𝐯 − 𝐮‖

Let we define 𝛽 = ‖𝑱(𝐱)‖ + 𝛾𝜀 then ‖𝐹(𝐯) − 𝐹(𝐮)‖ ≤ 𝛽‖𝐯 − 𝐮‖. Similarly,

‖𝐹(𝐯) − 𝐹(𝐮)‖ ≥ ‖𝑱(𝐱)(𝐯 − 𝐮)‖ − ‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖


1 𝛾
≥ (( ) − (‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖)) ‖𝐯 − 𝐮‖
‖𝑱(𝐱)−1 ‖ 2
1
≥ (( ) − 𝛾𝜀) ‖𝐯 − 𝐮‖
‖𝑱(𝐱)−1 ‖

Let we define 𝛼 = (1/‖𝑱(𝐱)−1 ‖) − 𝛾𝜀 > 0 so 𝛽‖𝐯 − 𝐮‖ ≥ ‖𝐹(𝐯) − 𝐹(𝐮)‖ ≥ 𝛼‖𝐯 − 𝐮‖

Summary on Jacobian approximation: If 𝐹 is differentiable at a point 𝒂 in ℝ𝑛 , then its


differential is represented by 𝑱(𝒂). In this case, the linear transformation represented by
𝑱(𝒂) is the best linear approximation of 𝐹 near the point 𝒂, in the sense that

𝐹(𝐱) − 𝐹(𝒂) = 𝑱(𝒂)(𝐱 − 𝒂) + 𝒪(‖𝐱 − 𝒂‖) as ‖𝐱 − 𝒂‖ ⟶ 0

where 𝒪(‖𝐱 − 𝒂‖) is a quantity that approaches zero much faster than the distance
between 𝐱 and 𝒂 does as 𝐱 approaches 𝒂.

In the preceding section we saw that the Jacobian, gradient, and Hessian will be useful
quantities in forming models of multivariable nonlinear functions. In many
applications, however, these derivatives are not analytically available. In this section we
introduce the formulas used to approximate these derivatives by finite differences, and
the error bounds associated with these formulas. The choice of finite-difference stepsize
in the presence of finite-precision arithmetic and the use of finite-difference derivatives
in our algorithms are discussed in (J. R Dennis, Jr. & Robert B. Schnabel 1993).

Frequently we deal with problems where the nonlinear function is itself the result of a
computer simulation, or is given by a long and messy algebraic formula, and so it is
often the case that analytic derivatives are not readily available although the function is
several times continuously differentiable. Therefore it is important to have algorithms
that work effectively in the absence of analytic derivatives.
 In the case when 𝐹: ℝ𝑛 ⟶ ℝ𝑚 , it is reasonable to use the same idea as in one variable
to approximate the (𝑖, 𝑗)𝑡ℎ component of 𝑱(𝐱) by the forward difference approximation

𝑓𝑖 (𝐱 + ℎ𝐞𝑗 ) − 𝑓𝑖 (𝐱)
𝑎𝑖𝑗 (𝐱) =

where 𝐞𝑗 , denotes the 𝑗 unit vector. This is equivalent to approximating the 𝑗 𝑡ℎ column
𝑡ℎ

of 𝑱(𝐱) = [𝐴1 (𝐱) 𝐴2 (𝐱) … 𝐴𝑛 (𝐱)] by


𝐹(𝐱 + ℎ𝐞𝑗 ) − 𝐹(𝐱)
𝐴𝑗 (𝐱) =

Again, one would expect ‖𝐴𝑗 (𝐱) − (𝑱(𝐱))𝒋 ‖ = 𝒪(ℎ) for ℎ sufficiently small. In terms of
ℓ1 norm we can write
𝛾
‖𝑨(𝐱) − 𝑱(𝐱)‖1 ≤ |ℎ|
2
Becuase we have seen that
𝛾 2 2 𝛾
‖𝐹(𝐱 + ℎ𝐞𝑗 ) − 𝐹(𝐱) − 𝑱(𝐱)ℎ𝐞𝑗 ‖ ≤ ℎ ‖𝐞𝑗 ‖ = |ℎ|2
2 2

Dividing both sides by ℎ gives


𝛾
‖𝐴𝑗 (𝐱) − (𝑱(𝐱))𝒋 ‖ ≤ |ℎ|
2
Since the ℓ1 norm of a matrix is the maximum of the ℓ1 norms of its columns,
𝛾
‖𝑨(𝐱) − 𝑱(𝐱)‖1 ≤ |ℎ| follows immediately. ■
2

 When the nonlinear problem is minimization of a function 𝑓: ℝ𝑛 ⟶ ℝ , we may need


to approximate the gradient ∇𝑓(𝐱) and/or the Hessian ∇2 𝑓(𝐱). Approximation of the
gradient is just a special case of the approximation of 𝑱(𝐱) discussed above, with 𝑚 = 1.

In some cases, finite-precision arithmetic causes us to seek a more accurate finite-


difference approximation using the central difference approximation. Notice that this
approximation requires twice as many evaluations of 𝑓 as forward differences.

One can prove that if 𝒂 ∈ ℝ𝑛 , which is a best model of ∇𝑓(𝐱) then 𝑎𝑖 given by
𝑓(𝐱 + ℎ𝐞𝑖 ) − 𝑓(𝐱 − ℎ𝐞𝑖 ) 𝛾
𝑎𝑖 (𝐱) = ⟺ ‖𝑎𝑖 (𝐱) − (∇𝑓(𝐱))𝒊 ‖ ≤ ℎ2
2ℎ 6
𝛾 2
⟺ ‖𝒂(𝐱) − ∇𝑓(𝐱)‖∞ ≤ ℎ
6

On some occasions ∇𝑓(𝐱) is analytically available but ∇2 𝑓(𝐱) is not. In this case, ∇2 𝑓(𝐱)
can be approximated by applying forward difference 𝑨𝑖 = (∇𝑓(𝐱 + ℎ𝐞𝑗 ) − ∇𝑓(𝐱)) /ℎ,
followed by 𝐴̂ = (𝐴 + 𝐴𝑇 )/2 , since the approximation to ∇2 𝑓(𝐱) should be symmetric.

If ∇𝑓(𝐱) is not available, it is possible to approximate ∇2 𝑓(𝐱) using only values of 𝑓(𝐱).

𝑓(𝐱 + ℎ𝐞𝑖 + ℎ𝐞𝑗 ) − 𝑓(𝐱 + ℎ𝐞𝑖 ) − 𝑓(𝐱 + ℎ𝐞𝑗 ) + 𝑓(𝐱) 5𝛾


𝐴𝑖𝑗 = 2
⟹ ‖𝐴𝑖𝑗 − (∇2 𝑓(𝐱))𝑖𝑗 ‖ ≤ ℎ
ℎ 3
In this section we derive the first and second-order necessary and sufficient conditions
for a point 𝐱 ⋆ to be a local minimizer of a continuously differentiable function 𝑓: ℝ𝑛 ⟶ ℝ,
𝑛 > 1. Naturally, these conditions will be a key to our algorithms for the unconstrained
minimization problem.

Lemma Let 𝑓: ℝ𝑛 ⟶ ℝ be continuously differentiable in an open convex set 𝑫 ⊂ ℝ𝑛 . Then


𝐱 ⋆ ∈ 𝑫 can be a local minimizer of 𝑓 only if ∇𝑓(𝐱 ⋆ ) = 𝟎.

Proof: As in the one-variable case, a proof by contradiction is better than a direct proof

A class of algorithms called descent methods are characterized by the direction vector 𝒑
such that 𝒑𝑇 ∇𝑓(𝐱) < 0 or 𝒑 = −∇𝑓(𝐱).

Theorem Let 𝑓: ℝ𝑛 ⟶ ℝ be twice continuously differentiable in the open convex set


𝑫 ⊂ ℝ𝑛 , and assume there exists 𝐱 ⋆ ∈ 𝑫 such that ∇𝑓(𝐱) = 𝟎. If ∇2 𝑓(𝐱) is positive definite,
then 𝐱 is a local minimizer of 𝑓. If ∇2 𝑓(𝐱) is Lipschitz continuous at 𝐱, then 𝐱 can be a
local minimizer of 𝑓 only if ∇2 𝑓(𝐱) is positive semidefinite.

The necessary and sufficient conditions for 𝐱 ⋆ to be a local maximizer of 𝑓 are simply
■ ∇𝑓(𝐱) = 𝟎
■ ∇2 𝑓(𝐱) is positive semidefinite.

It is important to understand the shapes of multivariable quadratic functions: they are


strictly convex or convex, respectively bowl- or trough-shaped in two dimensions, if 𝑯 is
positive definite or positive semidefinite; they are strictly concave or concave (turn the
bowl or trough upside down) if 𝑯 is negative definite or negative semidefinite; and they
are saddle-shaped (in n dimensions) if 𝑯 is indefinite.

𝑇 1
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝒑 + 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 + 𝒪(‖𝒑‖3 )
2

When the Hessian matrix is positive definite, by definition is 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 > 0 for any
𝒑 ≠ 0. Therefore we have that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (1/2)𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 > 0, which means that
𝐱 must be a local minimum. Similarly, when the Hessian matrix is negative definite, 𝐱 is
a local maximum. Finally, when 𝑯 has both positive and negative eigenvalues, the point
is a saddle point.
Those methods use
the gradient to search for the minimum point of an objective function. Such gradient-
based optimization methods are supposed to reach a point at which the gradient is
(close to) zero. In this context, the optimization of an objective function f(x) is equivalent
to finding a zero of its gradient g(x), which in general is a vector-valued function of a
vector-valued independent variable x. Therefore, if we have the gradient function g(x) of
the objective function f(x), we can solve the system of nonlinear equations g(x) = 0 to get
the minimum of f(x) by using the Newton method explained in chapter 4.

Let 𝑓: ℝ𝑛 ⟶ ℝ be twice continuously


differentiable in the open convex set 𝑫 ⊂ ℝ𝑛 . The Newton method tries to go straightly
to the zero of the gradient of the approximate objective function, means that we try to
solve the following nonlinear system of equations 𝐠(𝐱) = ∇𝑓(𝐱) = 𝟎.

𝐠(𝐱) = 𝟎 ⟺ 𝐠(𝐱 𝑘 ) + (∇𝐠(𝐱)|𝐱𝑘 )(𝐱 − 𝐱 𝑘 ) = 𝟎


⟺ 𝐠(𝐱 𝑘 ) + 𝐇(𝐱𝑘 )(𝐱 − 𝐱 𝑘 ) = 𝟎
−1
⟺ 𝐱 = 𝐱 𝑘 − (𝐇(𝐱 𝑘 )) 𝐠(𝐱𝑘 )
−1
by the updating rule 𝐱 𝑘+1 = 𝐱 𝑘 − (𝐇(𝐱𝑘 )) 𝐠(𝐱 𝑘 )

Example: Given a Rosenbrock function 𝑓(𝐱) = 100(𝑥2 − 𝑥12 )2 + (1 − 𝑥1 )2, find the
extremum value of 𝑓(𝐱) with the following starting point 𝐱 = [0.01 0.02]. this function is
severely ill-conditioned near the minimizer (1,1) (which is the unique stationary point).

syms x1 x2; f=100*(x2-x1^2)^2+(1-x1)^2;


J=jacobian(f,[x1,x2]) % Gradient computation
H=jacobian(J,[x1,x2]) % Hessian computation

%-----------------------------------------------------------%
% f .......... objective function
% J .......... gradient of the objective function
% H .......... Hessian of the objective function
%-----------------------------------------------------------%
clear all, clc, i=1; x(i,:)=[0.01 0.02]; tol=0.001;
f=@(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
J=@(x)[-400*(x(2)-x(1)^2)*x(1)+2*x(1)-2; 200*x(2)-200*x(1)^2];
H=@(x)[1200*x(1)^2-400*x(2)+2 -400*x(1);-400*x(1) 200];

while norm(J(x(i,:)))>tol
d=(inv(H(x(i,:)) + 0.5*eye(2,2))*J(x(i,:)))';
x(i+1,:)=x(i,:)-d;
i=i+1;
end
Iterations=i
x=(x(i,:))' % x=[0.989793 0.9797];
fmax=f(x)
Example: Let 𝑓(𝐱) = √(1 + 𝑥12 ) + √(1 + 𝑥22 ), find the extremum value of 𝑓(𝐱) with the
following starting point 𝐱 = [1 1].

syms x1 x2; f= sqrt(1+x(1)^2)+sqrt(1+x(2)^2);


J=jacobian(f,[x1,x2]) % Gradient computation
H=jacobian(J,[x1,x2]) % Hessian computation

clear all, clc, i=1; x(i,:)=[1 1];

f=@(x)sqrt(1+x(1)^2)+sqrt(1+x(2)^2);
J=@(x)[x(1)/sqrt(x(1)^2+1);x(2)/sqrt(x(2)^2+1)];
H=@(x)diag([1/(x(1)^2+1)^1.5,1/(x(2)^2+1)^1.5]);

while abs(x(i,:)*J(x(i,:)))>0.001
d=(inv(H(x(i,:))+0.5*eye(2,2))*J(x(i,:)))';
x(i+1,:)=x(i,:)-d;
i=i+1;
end
Iterations=i
x=(x(i,:))'
fmax=f(x)

Example: Let 𝑓(𝑥1 , 𝑥2 ) = (𝑥1 − 2)4 + ((𝑥1 − 2)2 )𝑥22 + (𝑥2 + 1)2 , which has its minimum
at 𝐱 ⋆ = (2, −1)𝑇 . Algorithm, started from 𝐱 0 = (1, 1)𝑇 , and we use the following
approximations

1 𝑓(𝐱 + ℎ𝐞1 ) − 𝑓(𝐱) 1 𝐻11 𝐻12


𝐠(𝐱) = ∇𝑓(𝐱) = ( ), 𝑯(𝐱) = ∇2 𝑓(𝐱) = ( )
ℎ 𝑓(𝐱 + ℎ𝐞 ) − 𝑓(𝐱) ℎ2 𝐻21 𝐻22
2

𝐻11 = 𝑓(𝐱 + 2ℎ𝐞1 ) − 2𝑓(𝐱 + ℎ𝐞1 ) + 𝑓(𝐱), 𝐻22 = 𝑓(𝐱 + 2ℎ𝐞2 ) − 2𝑓(𝐱 + ℎ𝐞2 ) + 𝑓(𝐱)

𝐻12 = 𝐻21 = 𝑓(𝐱 + ℎ𝐞1 + ℎ𝐞2 ) − 𝑓(𝐱 + ℎ𝐞1 ) − 𝑓(𝐱 + ℎ𝐞2 ) + 𝑓(𝐱).

Before starting the algorithm let we visualize the plot of this surface in space

clear all, clc, [x1,x2] = meshgrid(-4:0.4:6,-4:0.4:6);


f=(x1- 2).^4 + ((x1- 2).^2).*(x2).^2 + (x2 + 1).^2;
s=surf(x1,x2,f)
direction = [0 0 1];
rotate(s,direction,-25)
After the execution we obtain

Iterations = 9
Jacobian =

-4.0933e-11
2.2080e-09

x =

1.9950
-1.0050

fmax = 5.0001e-05

clear all, clc, i=1; x(i,:)=[1 1]; h=0.01; J=1; tol=0.00001;


f=@(x)(x(1)- 2)^4 + ((x(1)- 2)^2)*(x(2))^2 + (x(2) + 1)^2;

while norm(J)>tol

x1(i,:)= x(i,:) + h*[1 0]; x2(i,:)= x(i,:) + h*[0 1];


J1=(f(x1(i,:)) - f(x(i,:)))/h; J2=(f(x2(i,:)) - f(x(i,:)))/h;
J=[J1;J2]; % gradient computation

x11(i,:)= x(i,:) + 2*h*[1 0];


x12(i,:)= x(i,:) + h*[1 0] + h*[0 1];
x22(i,:)= x(i,:) + 2*h*[0 1];

H(1,1,:)=(f(x11(i,:))-2*f(x1(i,:))+ f(x(i,:)))/h^2;
H(1,2,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,1,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,2,:)=(f(x22(i,:))-2*f(x2(i,:))+f(x(i,:)))/h^2;

x(i+1,:)=x(i,:)-(inv(H)*J)';
i=i+1;

end

Iterations=i
Gradient =J
x=(x(i,:))'
fmax=f(x)
Example: Consider the Freudenstein
and Roth test function

𝑓(𝐱) = (𝑓1 (𝐱))2 + (𝑓2 (𝐱))2 , 𝐱 ∈ ℝ2 ,

Where

𝑓1 (𝐱) = −13 + 𝑥1 + ((5 − 𝑥2 )𝑥2 − 2)𝑥2 ,

𝑓2 (𝐱) = −29 + 𝑥1 + ((𝑥2 + 1)𝑥2 − 14)𝑥2 .

Show that the function f has three stationary points. Find them and prove that one is a
global minimizer, one is a strict local minimum and the third is a saddle point. You
should use the stopping criteria ‖∇𝑓(𝑥)‖ ≤ 10−5. The algorithm should be employed four
times on the following four starting points:

x(i,:)=[-50 7]; x(i,:)=[20 7]; x(i,:)=[20 -18]; x(i,:)=[5 -10];

Solution: Let we see the plot of this surface

clear all, clc, [x1,x2] = meshgrid(-4:0.4:6,-4:0.4:6);


f1=-13+x1+((5-x2).*x2-2).*x2; f2=-29+ x1+((x2+1).*x2-14).*x2;
z=f1.^2+f2.^2; s=surf(x1,x2,z)
direction = [0 0 1]; rotate(s,direction,35)

Also we can use MATLAB code to see the Gradient and the Hessian of this function.

syms x1 x2; f1=-13+x1+((5-x2)*x2-2)*x2; f2=-29+ x1+((x2+1)*x2-14)*x2;


f=f1^2+f2^2;
J=jacobian(f,[x1,x2]) % gradient computation
H=jacobian(J,[x1,x2]) % Hessian computation

When we run the program we get the points as what have been asked.

Example: Given a Rosenbrock function 𝑓(𝐱) = 100(𝑥2 − 𝑥12 )2 + (1 − 𝑥1 )2, find the solution
of 𝑓(𝐱) = 0 using only the gradient (i.e. without use of Hessian).

It is very well-known that


𝑇 1
𝑓(𝐱 + 𝜹) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝜹 + 𝜹𝑇 𝑯(𝐱 + 𝑡𝜹)𝜹 + 𝒪(‖𝜹‖3 )
2
𝑇
= (∇𝑓(𝐱)) 𝜹 + +𝒪(‖𝜹‖2 )
𝑇
Let we take the first approximation 𝑓(𝐱 + 𝜹) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝜹 and when we iterate the
𝑇
equation we obtain 𝑓(𝐱 𝑘+1 ) − 𝑓(𝐱 𝑘 ) = (∇𝑓(𝐱 𝑘 )) ∆𝐱𝑘 assuming that when 𝑘 goes to infinity
then the solution is obtained as 𝑓(𝐱 𝑘+1 ) = 0

𝑇 𝑇 −1
−𝑓(𝐱 𝑘 ) = (∇𝑓(𝐱 𝑘 )) ∆𝐱 𝑘 ⟺ ∆𝐱 𝑘 = − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) ) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 )
And in order to avoid the singularity in the matrix inversion, let we add some
𝑇 −1
regularization term: ∆𝐱 𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 = − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 ) 0<𝜆<1

clear all, clc, i=1; x(i,:)=[0 0.02]; delta=[1;1]; tol=0.001;


f=@(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
J=@(x)[-400*(x(2)-x(1)^2)*x(1)+2*x(1)-2; 200*x(2)-200*x(1)^2];
while norm(delta)>tol
delta=-inv(J(x(i,:))*(J(x(i,:)))'+ 0.7*eye(2,2))*J(x(i,:));
x(i+1,:)=x(i,:) + (delta)'*f(x(i,:));
i=i+1;
end
Iterations=i
x=(x(i,:))'
fmax=f(x)

−1
It can be observed from the previous results 𝐱 𝑘+1 = 𝐱 𝑘 − (𝑯(𝐱)) 𝐠(𝐱) that

𝑇 −1 −1
𝐱 𝑘+1 = 𝐱 𝑘 − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 ) ≅ 𝐱 𝑘 − (𝑯(𝐱)) ∇𝑓(𝐱 𝑘 )
𝑇
⟹ 𝑯(𝐱) ≅ (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) 𝑓(𝐱 𝑘 )

It practically means that, once the first derivatives are computed, we can also compute
part of the Hessian matrix for the same computational cost. The possibility to compute
“for free” the Hessian matrix once the Jacobian (i.e. Gradient) is available represents a
distinctive feature of least squares problems. This approximation is adopted in many
applications as it provides an evaluation of the Hessian matrix without computing any
second derivatives of the objective function.

2 2
Example: Given 𝑓(𝐱) = (𝑥12 − 2𝑥2 )𝑒 −𝑥1 −𝑥2 −𝑥1𝑥2 , find the solution of 𝑓(𝐱) = 0 using only
the gradient (i.e. without use of Hessian). Here in this example we will use the
approximate value of ∇𝑓(𝐱) rather that the analytic one.

clear all, clc, i=1; x(i,:)=[0.1,0.2]; delta=[1;1]; h=0.0001;


f=@(x)(x(1)^2-2*x(2))*exp(-x(1)^2-x(2)^2-x(1)*x(2)); tol=0.001;
% f=@(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
while norm(delta)>tol
x1(i,:)= x(i,:) + h*[1 0]; x2(i,:)= x(i,:) + h*[0 1];
J1=(f(x1(i,:)) - f(x(i,:)))/h; J2=(f(x2(i,:)) - f(x(i,:)))/h;
J=[J1;J2];
delta=-inv(J*J'+ 0.7*eye(2,2))*J;
x(i+1,:)=x(i,:) + (delta)'*f(x(i,:)); i=i+1;
end
Iterations=i
x=(x(i,:))'
fmax=f(x)
2 2
Example: Given 𝑓(𝐱) = 𝑥1 𝑒 −𝑥1 −𝑥2 , find the minimizer of 𝑓(𝐱) using the approximate
value of the gradient and Hessian.

Before starting the algorithm let we visualize the plot of this surface in space

[x,y] = meshgrid([-2:.25:2]);
z = x.*exp(-x.^2-y.^2);
% Plotting the Z-values of the function on which the level
% sets have to be projected
z1 = x.^2+y.^2;
% Plot your contour
[cm,c]=contour(x,y,z,30);
% Plot the surface on which the level sets have to be projected
s=surface(x,y,z1,'EdgeColor',[.8 .8 .8],'FaceColor','none')
% Get the handle to the children i.e the contour lines of the contour
cv=get(c,'children');
% Extract the (X,Y) for each of the contours and recalculate the
% Z-coordinates on the surface on which to be projected.
for i=1:length(cv)
cc = cv(i);
xd=get(cc,'XData');
yd=get(cc,'Ydata');
zd=xd.^2+yd.^2;
set(cc,'Zdata',zd);
end
grid off
view(-15,25)
colormap cool

After the execution of the program


found in the next page we obtain
Iterations = 11

Jacobian =

-4.1814e-06
-8.6410e-06

x =

1.8844
3.4178

fmax = 4.5702e-07
clear all, clc, i=1; x(i,:)=[1 1]; h=0.01; J=1; tol=0.00001;
f=@(x)x(1)*exp(-x(1)^2-x(2)^2);
while norm(J)>tol

x1(i,:)= x(i,:) + h*[1 0]; x2(i,:)= x(i,:) + h*[0 1];


J1=(f(x1(i,:)) - f(x(i,:)))/h; J2=(f(x2(i,:)) - f(x(i,:)))/h;
J=[J1;J2];

x11(i,:)= x(i,:) + 2*h*[1 0];


x12(i,:)= x(i,:) + h*[1 0] + h*[0 1];
x22(i,:)= x(i,:) + 2*h*[0 1];

H(1,1,:)=(f(x11(i,:))-2*f(x1(i,:))+ f(x(i,:)))/h^2;
H(1,2,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,1,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,2,:)=(f(x22(i,:))-2*f(x2(i,:))+f(x(i,:)))/h^2;

x(i+1,:)=x(i,:)-(inv(H)*J)';
i=i+1;

end
Iterations=i
Jacobian=J
x=(x(i,:))'
fmax=f(x)

Example: Develop the Taylor series of two-variables objective function 𝑓(𝑥1 , 𝑥2 ) with an
error of 𝒪(‖𝜹‖3 )
𝜕𝑓 𝜕𝑓 𝜕 2𝑓 𝜕 2𝑓 𝜕 2𝑓
𝑓(𝑥1 + 𝛿1 , 𝑥2 + 𝛿2 ) = 𝑓(𝑥1 , 𝑥2 ) + ( 𝛿1 + 𝛿2 ) + ( 2 𝛿12 + 2 𝛿1 𝛿2 + 2 𝛿22 ) + 𝒪(‖𝜹‖3 )
𝜕𝑥1 𝜕𝑥2 𝜕𝑥1 𝜕𝑥1 𝜕𝑥2 𝜕𝑥2
𝜕 2𝑓 𝜕 2𝑓
𝛿1 𝛿1
𝜕𝑓 𝜕𝑓 1 𝜕𝑥12 𝜕𝑥1 𝜕𝑥2
= 𝑓(𝑥1 , 𝑥2 ) + [ ] ( ) + [𝛿1 𝛿2 ] ( ) + 𝒪(‖𝜹‖3 )
𝜕𝑥1 𝜕𝑥2 𝛿 2 𝜕 2𝑓 𝜕 2𝑓 𝛿2
2
2
(𝜕𝑥1 𝜕𝑥2 𝜕𝑥2 )
𝑇 1
In compact form we can write 𝑓(𝐱 + 𝜹) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝜹 + 2 𝜹𝑇 𝑯(𝐱 + 𝑡𝜹)𝜹 + 𝒪(‖𝜹‖3 )

Let 𝑭: ℝ𝑛 ⟶ ℝ𝑚 be a continuously
differentiable in the open convex set 𝑫 ⊂ ℝ𝑛 . The practical problem in the vector case is
to solve the simultaneously the set of nonlinear equations 𝑭(𝐱) = 𝟎. In before we have
seen that
𝑭(𝐱 + 𝜹) = 𝑭(𝐱) + 𝑱(𝐱)𝜹 ⟺ 𝑭(𝐱 𝑘 + 𝜹) = 𝑭(𝐱 𝑘 ) + 𝑱(𝐱 𝑘 )𝜹𝑘

When 𝑘 → ∞ assume that 𝐱 𝑘+1 = 𝐱 𝑘 + 𝜹 is a solution to 𝑭(𝐱) = 𝟎 that is 𝑭(𝐱 𝑘+1 ) = 𝟎


−1 −1
𝜹𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 = −(𝑱(𝐱 𝑘 )) 𝑭(𝐱𝑘 ) ⟹ 𝐱 𝑘+1 = 𝐱 𝑘 − (𝑱(𝐱 𝑘 )) 𝑭(𝐱𝑘 )
In almost cases of practical problems the analytic expression of the Jacobian matrix
𝑱(𝐱 𝑘 ) is not available, so we need to approximate it by finite difference methods. In the
next program we will see how to do this. To make things more clear let we see the
following second order system 𝑭: ℝ2 ⟶ ℝ2

𝜕𝑓1 𝜕𝑓1
𝜕𝑥1 𝜕𝑥2 1 𝑓1 (𝐱 + ℎ𝐞1 ) − 𝑓1 (𝐱) 𝑓1 (𝐱 + ℎ𝐞2 ) − 𝑓1 (𝐱)
𝑱(𝐱 𝑘 ) = ≅ ( )
𝜕𝑓2 𝜕𝑓2 ℎ 𝑓 (𝐱 + ℎ𝐞 ) − 𝑓 (𝐱)
2 1 2 𝑓2 (𝐱 + ℎ𝐞2 ) − 𝑓2 (𝐱)
( 𝜕𝑥1 𝜕𝑥2 )

Example: write a MATLAB code to solve the following nonlinear system of equations

f1=@(x)(x(1)^2+x(2)^2 -1); f2=@(x)(x(1)^2-x(2));

using the approximate method and take x(i,:)= [0.1 0.2] as starting point.

% Solve the nonlinear system F(x) = 0 using Newton's method


% Vectors x and x0 are row vectors (for display purposes)
% function F returns a column vector, [f1(x), ..fn(x)]'
% stop if norm of change in solution vector is less than tol
% solve J(x)y = - F(x) using Matlab solver
clear all, clc, i=1; x(i,:)= [0.1 0.2]; tol=0.0001; maxit=100; h=0.01;
f1=@(x)(x(1)^2+x(2)^2 -1); f2=@(x)(x(1)^2-x(2)); dif=1;

while (dif >= tol) && (i<=maxit)


x1(i,:)= x(i,:) + h*[1 0]; x2(i,:)= x(i,:) + h*[0 1];
J11=(f1(x1(i,:)) - f1(x(i,:)))/h; J12=(f1(x2(i,:)) - f1(x(i,:)))/h;
J21=(f2(x1(i,:)) - f2(x(i,:)))/h; J22=(f2(x2(i,:)) - f2(x(i,:)))/h;
J=[J11 J12;J21 J22]; F=[f1(x(i,:));f2(x(i,:))];

x(i+1,:) = x(i,:) -(inv(J)*F)';


dif = norm(x(i+1,:) - x(i,:));
i = i + 1;
end
x, Iterations=i, F

x =

0.1000 0.2000 Iterations = 8


3.5715 0.7390 F =
1.8755 0.6244
1.1046 0.6181 1.8483e-05
0.8333 0.6180 1.8483e-05
0.7878 0.6180
0.7862 0.6180
0.7862 0.6180

In many optimization problems we may across the problem of singularity in the


Jacobian matrix, to solve this problem we use the Gauss-Newton Method:
−1
𝐱 𝑘+1 = 𝐱 𝑘 − (𝑱𝑇 (𝐱 𝑘 )𝑱(𝐱 𝑘 )) 𝑱𝑇 (𝐱 𝑘 )𝑭(𝐱 𝑘 )
 If equations have one or two unknowns, graphical methods can be used to solve
them. If there are too many unknowns, it is not suitable to use graphical methods.
Other methods should be tried. In this section, graphical solution methods for
equations with one or two unknowns are proposed, and advantages and disadvantages
of graphical methods are summarized.

Example: using MATLAB to visualize the intersection of the following surfaces centered
at origin (assuming that the parameters are specified)
𝑥 2 𝑥 2 𝑥 2
( ) +( ) +( ) =𝑑
𝑎 𝑏 𝑐
2 2
𝑧 = 𝛼𝑥 + 𝛽𝑦 + 𝛾
{ 𝑥2 + 𝑦2 + 𝑧2 = 𝜆

clear all, clc, a=3; b=2; c=1; imax=50; jmax=50;

for i=1:imax+1
theta=2*pi*(i-1)/ imax;
for j=1:jmax+1
phi =2*pi*(j-1)/ jmax;
x(i,j) = a*cos(theta); y(i,j) = b*sin(theta)*cos(phi);
z(i,j) = c*sin(theta)*sin(phi);
end
end
s=surf(x,y,z), hold on % Plot of ellipsoid

a=0.5; b=-1; c=1; imax=50; jmax=50;


for i=1:imax+1
xi=-2+4*(i-1)/imax;
for j=1:jmax+1
eta =-2+4*(j-1)/jmax;
x(i,j) = xi; y(i,j) = eta;
z(i,j) = a*x(i,j)^2+b*y(i,j)^2+c;
end
end
s=surf(x,y,z), hold on % Plot of hyperbolic surface

a=2; b=2; c=2; imax=50; jmax=50;


for i=1:imax+1
theta=2*pi*(i-1)/ imax;
for j=1:jmax+1
phi =2*pi*(j-1)/ jmax;
x(i,j) = a*cos(theta); y(i,j) = b*sin(theta)*cos(phi);
z(i,j) = c*sin(theta)*sin(phi);
end
end
s=surf(x,y,z), axis equal % Plot of sphere
% Intersection of Three Surfaces
clear all, clc, i=1; x(i,:)=[1 1 1]; tol=1.e-6; maxit=100; dif=1;
f1=@(x)(x(1)^2+x(2)^2+x(3)^2-14);
f2=@(x)(x(1)^2+2*x(2)^2-9);
f3=@(x)(x(1)-3*x(2)^2+ x(3)^2);

while (dif >= tol) && (i<=maxit)


J=[2*x(i,1) 2*x(i,2) 2*x(i,3); 2*x(i,1) 4*x(i,2) 0; 1 -6*x(i,2)
2*x(i,3)];
F=[f1(x(i,:));f2(x(i,:));f3(x(i,:))];

x(i+1,:) = x(i,:)-(inv(J)*F)';
dif = norm(x(i+1,:)-x(i,:));
i = i + 1;
end
x, Iterations=i, F

The main disadvantage of Newton's method, even


when regularized to ensure global convergence, is the need to calculate 𝑛(𝑛 + 1)/2
second derivatives. Hence, it may be better to approximate the Hessian matrix using
the value of the function and the gradient vector 𝐠(𝐱 𝑘 ). The simplest technique is to use
a finite difference approximation. Once again this matrix may not be positive definite,
thus requiring modifications as discussed earlier. Further, evaluating the finite
difference approximation requires 𝑛 + 1 evaluations of the gradient vector, which could
be very expensive.
The above disadvantages can be avoided if some updating procedure similar to that in
the Broyden's method can be given for the Hessian matrix or its inverse, such that the
matrix is assured to be positive definite.

In Idea behind quasi-Newton methods: construct the


approximation of the inverse Hessian using information gathered during the process.
Now in this section we will show how the inverse Hessian can be built up from gradient
information obtained at various points.
−1
𝐱 𝑘+1 = 𝐱 𝑘 − 𝐁𝑘 𝐠(𝐱𝑘 ) where 𝐁𝑘 is an approximation of (𝐇(𝐱𝑘 ))
−1
To avoid the computation of(𝐇(𝐱 𝑘 )) , the quasi-Newton methods use an approximation
−1
to in place of (𝐇(𝐱 𝑘 )) the true inverse. Let 𝐁0 , 𝐁1 , 𝐁2 … be successive approximations of
the inverse of the Hessian.

Suppose first that the Hessian matrix of the objective function is constant and
independent of 𝐱 𝑘 for 0 ≤ 𝑖 ≤ 𝑘. In other words, the objective function is quadratic, with
Hessian 𝐇(𝐱) = 𝑸 for all 𝐱, where 𝑸 = 𝑸𝑇 . Then, if 𝐱 𝑘+1 is an optimizer of 𝑓 we get
𝐠(𝐱 𝑘+1 ) = 0 so

𝐠(𝐱 𝑘+1 ) − 𝐠(𝐱 𝑘 ) = 𝑸(𝐱 𝑘+1 − 𝐱 𝑘 ) ⟺ ∆𝐠(𝐱 𝑘 ) = 𝑸∆𝐱 𝑘 ⟺ ∆𝐠(𝐱𝑘 ) = 𝑸𝒑𝑘 ⟺ 𝒑𝑇𝑘 ∆𝐠(𝐱 𝑘 ) = 𝒑𝑇𝑘 𝑸𝒑𝑘

We start with a real symmetric positive definite matrix 𝐁0 . Note that given k, the
matrix 𝑸−1 satisfies
𝑸−1 ∆𝐠(𝐱𝑖 ) = ∆𝐱 𝑖 0 ≤ 𝑖 ≤ 𝑘

Therefore, we also impose the requirement that the approximation of the Hessian
satisfy
𝐁𝑘+1 ∆𝐠(𝐱 𝑖 ) = ∆𝐱 𝑖 = 𝒑𝑖 0 ≤ 𝑖 ≤ 𝑘

If n steps are involved, then moving in n directions 𝒑0 , 𝒑1 , 𝒑2 , … 𝒑𝑛−1 yields

𝐁𝑛 ∆𝐠(𝐱 0 ) = 𝒑0
𝐁𝑛 ∆𝐠(𝐱1 ) = 𝒑1

𝐁𝑛 ∆𝐠(𝐱𝑛−1 ) = 𝒑𝑛−1

If we define 𝒒𝑘 = ∆𝐠(𝐱𝑘 ) then this set of equations can be represented as

𝐁𝑛 [𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] = [𝒑0 , 𝒑1 , … 𝒑𝑛−1 ]

Note that 𝑸 satisfies: 𝑸−1 [𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] = [𝒑0 , 𝒑1 , … 𝒑𝑛−1 ]. Therefore, if the matrix
[𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] is nonsingular, then 𝑸−1 is determined uniquely after n steps, via

𝑸−1 = 𝐁𝑛 = [𝒑0 , 𝒑1 , … 𝒑𝑛−1 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ]−1

This means that if n linearly independent directions 𝒑𝑖 and corresponding 𝒒𝑖 are known,
then 𝑸−1 is uniquely determined.
We will construct successive approximations 𝐁𝑘 to 𝑸−1 based on data obtained from the
first k steps such that:
𝐁𝑘+1 = [𝒑0 , 𝒑1 , … 𝒑𝑘 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 ]−1

After n linearly independent steps we would then have 𝐁𝑛 = 𝑸−1 . We want an update on
𝐁𝑘 such that:
𝐁𝑘+1 = [𝒑0 , 𝒑1 , … 𝒑𝑘 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 ]−1

Let us find the update in this form [Rank one correction] 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 . We need a
good 𝛼𝑘 ∈ ℝ and good 𝐮𝑘 ∈ ℝ𝑛 .

Theorem Given 𝑘 + 1 linearly independent direction 𝒑0 , 𝒑1 , 𝒑2 , … 𝒑𝑘 with corresponding


gradients 𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 such that 𝐁𝑘+1 [𝒒0 , 𝒒1 , … 𝒒𝑘 ] = [𝒑0 , 𝒑1 , … 𝒑𝑘 ] and if an update of
rank one 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 is carried out then

(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 = 𝐁𝑘 +
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )

Proof: We already know that 𝐁𝑘+1 [𝒒0 , 𝒒1 , … 𝒒𝑘 ] = [𝒑0 , 𝒑1 , … 𝒑𝑘 ] and 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 .
Therefore,
𝒑𝑘 = 𝐁𝑘+1 𝒒𝑘 = (𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 )𝒒𝑘 = 𝐁𝑘 𝒒𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘 ⟹ 𝒑𝑘 − 𝐁𝑘 𝒒𝑘 = 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘

(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇 /𝛼𝑘 = (𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘 )(𝒒𝑇𝒌 𝐮𝑘 𝐮𝑇𝑘 ) = 𝛼𝑘 𝐮𝑘 (𝐮𝑇𝑘 𝒒𝑘 )2 𝐮𝑇𝑘

Which can be written as


(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
= 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 = 𝐁𝑘+1 − 𝐁𝑘
𝛼𝑘 (𝐮𝑇𝑘 𝒒𝑘 )2

Let we multiply 𝒑𝑘 − 𝐁𝑘 𝒒𝑘 = 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 𝒒𝑘 on the right by 𝒒𝑇𝑘 we get

𝒒𝑇𝑘 𝒑𝑘 = 𝒒𝑇𝑘 𝐁𝑘 𝒒𝑘 + 𝛼𝑘 𝒒𝑇𝑘 𝐮𝑘 𝐮𝑇𝒌 𝒒𝑘 = 𝒒𝑇𝑘 𝐁𝑘 𝒒𝑘 + 𝛼𝑘 (𝒒𝑇𝑘 𝐮𝑘 )2 ⟹ (𝐮𝑇𝑘 𝒒𝑘 )2 = 𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )

Finally we replace this result in the update formula we obtain

(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 − 𝐁𝑘 =
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )

Algorithm: [Modified Newton method with rank 1 correction]


begin: k=1:n (untile converge)
𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 𝑩𝑘 𝐠(𝐱 𝑘 )
𝛼𝑘 = arg min𝛾 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 ) and 𝐠(𝐱 𝑘 ) = ∇𝑓(𝐱 𝑘 )
(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 = 𝐁𝑘 +
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )
𝒑𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 and 𝒒𝑘 = 𝐠(𝐱 𝑘+1 ) − 𝐠(𝐱 𝑘 )

end
Remark: The scalar 𝛼𝑘 is the smallest nonnegative value of a that locally minimizes
𝑓 along the direction ∇𝑓(𝐱 𝑘 ) starting from 𝐱 𝑘 . There are many alternative line-search
rules to choose 𝛼𝑘 along the ray 𝑆𝑘 = {𝐱 𝑘+1 = 𝐱 𝑘 + 𝑎𝒑𝑘 | 𝑎 > 0} . Namely: Armijo Rule,
Goldstein Rule, Wolfe Rule, Strong Wolfe Rule etc... In our work we are not interested
by such matter.

clc;clear; x(1,:)= [0.1 0.2]; B = eye(2,2); p= [0.1;0.2]; q= [1;2];


% objective function, its gradient and Hessian
f = @(x1,x2) -4*x1 - 2*x2 - x1.^2 + 2*x1.^4 - 2*x1.*x2 + 3*x2.^2;
g = @(x)[-4-2*x(1)+8*x(1)^3-2*x(2); -2-2*x(1)+6*x(2)];
%Hessian = @(x) [-2+24*x(1)^2, -2; -2; 6];
i=1; tol=0.001; alpha=0.19; % fixed step size (i.e. not optimal one)
while norm(g(x(i,:)))>=tol
x(i+1,:) = x(i,:) - alpha*(B*g(x(i,:)))';
B = B + ((p-B*q)*(p-B*q)')/(q'*(p-B*q));
p = (x(i+1,:)- x(i,:))';
q = g(x(i+1,:)) - g(x(i,:));
i=i+1;
end
x(end,:)
q
i
% plot contour lines
f = @(x1,x2) -4*x1 - 2*x2 - x1.^2 + 2*x1.^4 - 2*x1.*x2 + 3*x2.^2;
[x, y] = meshgrid(-0.25:0.01:1.75, -0.25:0.0025:1.75);
contour(x,y,f(x,y),[-4.34 -4.3 -4.2 -4.1 -4.0 -3 -2 -1
0],'ShowText','On'), hold on; grid on;
In numerical optimization, the
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving
unconstrained nonlinear optimization problems. The BFGS method belongs to quasi-
Newton methods, a class of hill-climbing optimization techniques that seek a stationary
point of a (preferably twice continuously differentiable) function. For such problems, a
necessary condition for optimality is that the gradient be zero. Newton's method and
the BFGS methods are not guaranteed to converge unless the function has a quadratic
Taylor expansion near an optimum. However, BFGS can have acceptable performance
even for non-smooth optimization instances.

In Quasi-Newton methods, the Hessian matrix of second derivatives is not computed.


Instead, the Hessian matrix is approximated using updates specified by gradient
evaluations (or approximate gradient evaluations). The BFGS method is one of the most
popular members of this class.

The optimization problem is to minimize 𝑓(𝐱), where 𝐱 is a vector in ℝ𝑛 , and 𝑓 is a


differentiable scalar function. There are no constraints on the values that 𝐱 can take.
The algorithm begins at an initial estimate for the optimal value 𝐱 0 and proceeds
iteratively to get a better estimate at each stage.

The search direction 𝒑𝑘 at stage k is given by the solution of the analogue of the Newton
equation: 𝑯𝑘 𝒑𝑘 = −∇𝑓(𝐱 𝑘 ) where 𝑯𝑘 is an approximation to the Hessian matrix, which
is updated iteratively at each stage, and ∇𝑓(𝐱 𝑘 ) is the gradient of the function evaluated
at 𝐱 𝑘 . A line search in the direction 𝒑𝑘 is then used to find the next point 𝒑𝑘+1 by
minimizing 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 ) over the scalar 𝛾 > 0. The quasi-Newton condition imposed on
the update of 𝑯𝑘 is
∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 ) = 𝑯𝑘+1 (𝐱 𝑘+1 − 𝐱 𝑘 )

Let 𝒚𝑘 = ∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 ) and 𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 then 𝑯𝑘+1 satisfies 𝑯𝑘+1 𝒔𝑘 = 𝒚𝑘 which is the
secant equation. The curvature condition 𝒔𝑇𝑘 𝑯𝑘+1 𝒔𝑘 = 𝒔𝑇𝑘 𝒚𝑘 > 0 should be satisfied for
𝑯𝑘+1 to be positive definite. If the function is not strongly convex, then the condition has
to be enforced explicitly.

Instead of requiring the full Hessian matrix at the point 𝐱 𝑘+1 to be computed as 𝑯𝑘+1,
the approximate Hessian at stage k is updated by the addition of two matrices:

𝐇𝑘+1 = 𝐇𝑘 + 𝐔𝑘 + 𝐕𝑘 = 𝐇𝑘 + 𝛼(𝐮𝑘 𝐮𝑇𝑘 ) + 𝛽(𝐯𝑘 𝐯𝑘𝑇 )

Both 𝐔𝑘 and 𝐕𝑘 are symmetric rank-one matrices, but their sum is a rank-two update
matrix. Imposing the secant condition, 𝑯𝑘+1 𝒔𝑘 = 𝒚𝑘 Choosing 𝐮𝑘 = 𝒚𝑘 and 𝐯𝑘 = 𝐇𝑘 𝒔𝑘 , we
can obtain:
1 1
𝛼= 𝑇 , 𝛽= 𝑇
𝐲𝑘 𝒔𝑘 𝒔𝑘 𝑯𝑘 𝒔𝑘

Finally, we substitute 𝛼𝑘 and 𝛽𝑘 into 𝐇𝑘+1 = 𝐇𝑘 + 𝛼𝐮𝑘 𝐮𝑇𝑘 + 𝛽𝐯𝑘 𝐯𝑘𝑇 and get the update
equation of 𝐇𝑘+1.
𝐲𝑘 𝐲𝑘𝑇 𝐇𝑘 𝒔𝑘 𝒔𝑇𝑘 𝐇𝑘𝑇
𝐇𝑘+1 = 𝐇𝑘 + + 𝑇
𝐲𝑘𝑇 𝒔𝑘 𝒔𝑘 𝐇𝑘 𝒔𝑘

Algorithm: [BFGS algorithm with rank 2 correction]


Initialization: 𝐱 0 and 𝑯0
begin: k=1:n (untile converge)

get by 𝒑𝑘 solving 𝑯𝑘 𝒑𝑘 = −∇𝑓(𝐱 𝑘 )


get 𝛼𝑘 such that 𝛼𝑘 = arg min𝛾 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 )
𝒔𝑘 = 𝛼𝑘 𝒑𝑘 and update 𝐱 𝑘+1 = 𝐱 𝑘 + 𝒔𝑘
𝐲𝑘 = ∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 )
𝐲𝑘 𝐲𝑘𝑇 𝐇𝑘 𝒔𝑘 𝒔𝑇𝑘 𝐇𝑘𝑇
𝐇𝑘+1 = 𝐇𝑘 + 𝑇 + 𝑇
𝐲𝑘 𝒔𝑘 𝒔𝑘 𝐇𝑘 𝒔𝑘

end

The functional 𝑓(𝐱 𝑘 ) denotes the objective function to be minimized. Convergence can
be checked by observing the norm of the gradient, ‖∇𝑓(𝐱 𝑘 )‖2 ≤ 𝜀. In order to avoid the
inversion of 𝑯𝑘 at each step we apply the Sherman–Morrison formula

𝑨−1 𝐮𝐯 𝑇 𝑨−1
(𝑨 + 𝐮𝐯 𝑇 )−1 = 𝑨−1 +
1 + 𝐯 𝑇 𝑨−1 𝐮
We get
𝒔𝑘 𝐲𝑘𝑇 𝐲𝑘 𝒔𝑇𝑘 𝒔𝑘 𝒔𝑇𝑘
𝐁𝑘+1 = (𝑰 − ) 𝐁 𝑘 (𝑰 − ) +
𝒔𝑇𝑘 𝐲𝑘 𝒔𝑇𝑘 𝐲𝑘 𝒔𝑇𝑘 𝐲𝑘
(𝒔𝑇𝑘 𝐲𝑘 + 𝐲𝑘𝑇 𝐁𝑘 𝐲𝑘 )(𝒔𝑘 𝒔𝑇𝑘 ) 𝐁𝑘 𝐲𝑘 𝒔𝑇𝑘 + 𝒔𝑘 𝐲𝑘𝑇 𝐁𝑘
= 𝐁𝑘 + −
(𝒔𝑇𝑘 𝐲𝑘 )2 𝒔𝑇𝑘 𝐲𝑘

Algorithm: [BFGS algorithm with rank 2 correction-without inversion]


Initialization: 𝐱 0 and 𝑩0
begin: k=1:n (untile converge)

get by 𝒑𝑘 solving 𝒑𝑘 = −𝐁𝑘 ∇𝑓(𝐱 𝑘 )


get 𝛼𝑘 such that 𝛼𝑘 = arg min𝛾 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 )
𝒔𝑘 = 𝛼𝑘 𝒑𝑘 and update 𝐱 𝑘+1 = 𝐱 𝑘 + 𝒔𝑘
𝐲𝑘 = ∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 )
(𝒔𝑇𝑘 𝐲𝑘 + 𝐲𝑘𝑇 𝐁𝑘 𝐲𝑘 )(𝒔𝑘 𝒔𝑇𝑘 ) 𝐁𝑘 𝐲𝑘 𝒔𝑇𝑘 + 𝒔𝑘 𝐲𝑘𝑇 𝐁𝑘
𝐁𝑘+1 = 𝐁𝑘 + −
(𝒔𝑇𝑘 𝐲𝑘 )2 𝒔𝑇𝑘 𝐲𝑘

end

Remark: In general, the finite difference approximations of the Hessian are more
expensive than the secant condition updates. (Walter Gander and Martin J Gander)
clear all, clc, tol=10^-4; x(:,1)= [0.8624 0.1456]; z=[]; B=eye(2,2);
f=@(x)x(1)^2-x(1)*x(2)-3*x(2)^2+5; J=@(x)[2*x(1)-x(2);-x(1)-6*x(2)];
% f=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2-10; J=@(x)[4*x(1);6*x(1);8*x(3)];
% f=@(x)3*sin(x(1))+exp(x(2)); J=@(x)[3*cos(x(1));exp(x(2))];
i=1; %matlab starts counting at 1
while and(norm(J(x(:,i)))>0.001,i<500)
p(:,i)=-B*J(x(:,i));
%------------------------------------------------------------%
% armijo method for alpha determination
alp=0.01; % initial step
rho=0.01; c=0.01; % rho and c are in (0,1);
x1=x(:,i); x2=x(:,i)+alp*p(:,i);
f2=f(x2); f1=f(x(:,i));
while and(f2>f1+c*alp*(J(x(:,i)))'*p, alp>10^-6);
alp=rho*alp;
f2=f(x(:,i)+alp*p(:,i));
end
%------------------------------------------------------------%
s=alp*p(:,i); x(:,i+1)=x(:,i) + s; y=J(x(:,i+1))-J(x(:,i));
B = B + ((s'*y + y'*B*y)/(s'*y)^2)*(s*s') -(B*y*s'+ (s*y')*B)/(s'*y);
i=i+1;
end
x(:,end), fmax=f(x(:,end)), Gradient=J(x(:,end))

Gradient descent is a first-order iterative


optimization algorithm for finding a local minimum of a differentiable function. To find
a local minimum of a function using gradient descent, we take steps proportional to the
negative of the gradient (or approximate gradient) of the function at the current point.

Gradient descent is based on the observation that if the scalar multi-variable function
𝑓(𝐱) is defined and differentiable in a neighborhood of a point 𝒂, then 𝑓(𝐱) decreases
fastest if one goes from 𝒂 in the direction of the negative gradient of 𝑓(𝐱) at 𝒂, −∇𝑓 (𝒂).
It follows that, if 𝐚𝑘+1 = 𝐚𝑘 − 𝛼∇𝑓(𝐚𝑘 ) for 𝛼 ∈ ℝ small enough, then 𝑓(𝐚𝑘 ) ≥ 𝑓(𝐚𝑘+1 ) In
other words, the term 𝛼∇𝑓(𝐚𝑘 ) is subtracted from 𝐚𝑘 because we want to move against
the gradient, toward the local minimum.

With this observation in mind, one starts with a guess 𝐱 0 for a local minimum of 𝑓(𝐱),
and considers the sequence 𝐱1 , 𝐱 2 , …, such that 𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 ).

Note that the value of the step size 𝛼𝑘 is allowed to change at every iteration. With
certain assumptions on the function 𝑓(𝐱) (for example, 𝑓(𝐱) convex and ∇𝑓(𝐱) Lipschitz)
and particular choices of 𝛼𝑘 convergence to a local minimum can be guaranteed.
According to Wolfe conditions, or the Barzilai–Borwein method
𝑇
(𝐱 𝑘 − 𝐱 𝑘−1 )𝑇 (∇𝑓(𝐱 𝑘 ) − ∇𝑓(𝐱 𝑘−1 )) (∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = or 𝛼𝑘 = 𝑇
‖∇𝑓(𝐱 𝑘 ) − ∇𝑓(𝐱 𝑘−1 )‖2 (∇𝑓(𝐱 𝑘 )) 𝐇(𝐱 𝑘 )∇𝑓(𝐱 𝑘 )
Algorithm: [Gradient descent algorithm]
Initialization: 𝐱 0 and 𝛼0
begin: k=1:n (untile converge)

𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 )
𝑇
(∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = 𝑇
(∇𝑓(𝐱 𝑘 )) 𝐇(𝐱 𝑘 )∇𝑓(𝐱 𝑘 )

end

Understanding gradient descent


Suppose you are on the peak of a
mountain and you want to reach a
lake which is at the lowest point of
the mountain. Which way will you
move???

The simplest way would be to


check at the point you are
standing, find the way ground is
descending the most and start
moving that way. There are high
chances that this path will lead
you to the lake. This is what is
depicted in the picture above.

Graphically it could be visualized as follows:

The peaks represent regions with high cost with red areas whereas the lowest point
with blue areas are regions with minimum cost or loss. In any Optimization & Deep
Learning problems, we try to find a model function that gives prediction having least
loss in comparison to actual value.
Suppose our model function has two parameters then, mathematically we wish to find
the optimal values of parameters 𝜃1 and 𝜃2 that would minimize our loss. The loss (𝐽(𝜃))
space shown in the above figure tells us how our algorithm would perform if we would
choose a particular value for a parameter. Here the 𝜃1 and 𝜃2 are our x and y axis while
the loss is plotted corresponding to the z axis. The Gradient Descent rule states that the
direction in which we should move should be 180 degrees with the gradient, in other
words moving opposite to the gradient.

clear all, clc, i=1; x(i,:)=[3 3]; a=0.01;


f=@(x)x(1)^2+ x(1)*x(2)+3*x(2)^2; z=@(x1,x2)x1.^2 + x1.*x2 + 3*x2.^2;
J=@(x)[2*x(1)+ x(2);x(1)+6*x(2)]; H=@(x)[2 1;1 6];
while norm(J(x(i,:)))>0.000001
x(i+1,:)=x(i,:)-a*(J(x(i,:)))';
a= (J(x(i,:)))'*J(x(i,:))/((J(x(i,:)))'*H(x(i,:))*J(x(i,:)));
%ezcontour(z,[-5 5 -5 5]); axis equal; hold on
%plot([x(i+1,1) x(i,1)], [x(i+1,2) x(i,2)],'ko-'); hold on; refresh
i=i+1;
end
Iterations=i
Gradient=J(x(end,:))
x=(x(i,:))'
fmax=f(x)

 A case of remarkable interest, where the parameter 𝛼𝑘 can be exactly computed, is


the problem of minimizing the quadratic function
𝑇 1
𝑓(𝐱 𝑘 + 𝜹) = 𝑓(𝐱 𝑘 ) + (∇𝑓(𝐱 𝑘 )) 𝜹 + 𝜹𝑇 𝑯(𝐱𝑘 )𝜹
2
At a fixed instant of time 𝐱 𝑘 = constant this quadratic function can be considered as
1
𝜙(𝜹) = 𝑓(𝜹) = 𝜹𝑇 𝑨𝜹 + 𝒃𝑇 𝜹 + 𝒄 𝑯(𝐱 𝑘 ) = 𝑨, 𝒃 = ∇𝑓(𝐱 𝑘 ) and 𝒄 = 𝑓(𝐱 𝑘 )
2

In order to minimize the new objective function we consider the gradient ∇𝜙(𝜹) = 𝑨𝜹 + 𝒃.
In term of 𝐱 𝑘+1we obtain ∇𝜙(𝐱 𝑘+1 ) = ∇𝑓(𝐱 𝑘+1 ) = 𝑨𝐱 𝑘+1 + 𝒃. As a consequence, all
gradient-like iterative methods developed in the previous chapter for linear systems,
can be extended to solve nonlinear minimization problems.

In particular, having fixed a descent direction 𝒑𝑘 = (𝐱 𝑘+1 − 𝐱 𝑘 )/𝛼𝑘 , we can determine the
optimal value of the acceleration parameter 𝛼𝑘 , in such a way as to find the point where
the function f, restricted to the direction 𝒑𝑘 , is minimized 𝛼𝑘 = arg min𝛼 𝑓(𝐱 𝑘 + 𝛼𝒑𝑘 ).
Setting to zero the directional derivative, we get:

𝑑 𝑛 𝜕𝑓 𝜕
0= 𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = ∑ ( (𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 )) (𝐱 (𝑖) + 𝛼𝑘 𝒑𝑘 (𝑖))
𝑑𝛼𝑘 𝑖=1 𝜕𝑥𝑖 𝜕𝛼𝑘 𝑘
𝑇
= (∇𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 )) 𝒑𝑘
But we have seen that ∇𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = 𝑨(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) + 𝒃 = (𝑨𝐱 𝑘 + 𝒃) + 𝛼𝑘 𝑨𝒑𝑘

𝑑
Therefore 𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = (𝛼𝑘 𝒑𝑘 𝑇 𝑨 + (𝑨𝐱 𝑘 + 𝒃)𝑇 )𝒑𝑘 = 0
𝑑𝛼𝑘

If we define now 𝒓𝑘 = −(𝑨𝐱 𝑘 + 𝒃) we obtain


𝒓𝑇𝑘 𝒑𝑘
𝛼𝑘 = 𝑇
𝒑𝑘 𝑨𝒑𝑘
It is not a popular algorithm due to slow convergence. In order to increase the speed of
convergence some peoples proposed a correction of the search direction called the
conjugate gradient algorithm (See the detail in the chapter before).

The conjugate direction or conjugate gradient method only requires a simple


modification of the gradient method, with a remarkable increase in the convergence
rate. It is as simple to program as the gradient method. Fletcher Reeves (FR) extends
the linear conjugate gradient method to nonlinear functions by incorporating two
changes:

■ For the step length 𝛼𝑘 , (which minimizes f along the search direction 𝒑𝑘 ), we perform
a line search that identifies the approximate minimum of the nonlinear function f along
the search direction 𝒑𝑘 .

■ The residual 𝒓𝑘 (𝒓𝑘 = −(𝒃 + 𝑨𝐱 𝑘 )), which is the gradient of function f has to be
replaced by the gradient of the nonlinear objective function.

Remark: An appropriate step length effecting sufficient decrease could be chosen from
one of the various known methods such as the Armijo, the Goldstein or the Wolfe’s
conditions. Moreover, if f is a strongly convex quadratic function and 𝛼𝑘 is the exact
minimizer of the function f, then the FR algorithm becomes specifically the linear
conjugate gradient algorithm.

The conjugate direction having modified slightly so (𝒑𝑘 − 𝛽𝑘 𝒑𝑘−1 ) = −∇𝑓(𝐱 𝑘 ),

𝒑𝑘 = 𝛽𝑘 𝒑𝑘−1 − ∇𝑓(𝐱 𝑘 ) ⟺ 𝐱 𝑘+1 = (𝛼𝑘 + 𝛽𝑘 )𝐱 𝑘 − 𝛼𝑘 𝛽𝑘 𝐱 𝑘−1 − ∇𝑓(𝐱 𝑘 )

and the 𝛽𝑘 scalar is given by the equation: 𝛽𝑘 = |∇𝑓(𝐱 𝑘 )|2 /|∇𝑓(𝐱 𝑘−1 )|2 where k is an
iterative index. The following algorithm of the conjugate gradient method is made on
MATLAB

Algorithm: [Gradient descent algorithm]


Initialization: 𝐱 0 , 𝒑0 = ∇𝑓(𝐱 0 ) and 𝛼0
begin: k=1:n (untile converge)

get 𝛼𝑘 such that 𝛼𝑘 = arg min𝛼 𝑓(𝐱 𝑘 + 𝛼𝒑𝑘 )


𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 )
𝛽𝑘+1 = |∇𝑓(𝐱 𝑘+1 )|2 /|∇𝑓(𝐱 𝑘 )|2
Make an update 𝒑𝑘+1 = 𝒑𝑘 − 𝛽𝑘+1 ∇𝑓(𝐱 𝑘+1 )

end
clear all, clc, tol=10^-5; x(:,1)=10*rand(2,1);
f=@(x)x(1)^2+x(1)*x(2)+3*x(2)^2+100;
J=@(x)[2*x(1)+x(2);x(1)+6*x(2)]; p(:,1)=-J(x(:,1));
i=1; % matlab starts counting at 1
finalX = x ; % initialize the vector
finalf =f(x(:,1)); z=[];

while and(norm(J(x(:,i)))>0.001,i<500)
%-------------------------------------------------%
% Armijo method for alpha determination
alp=0.01; % initial step
rho=0.01; c=0.02; % rho and c are in (0,1);
x1=x(:,i); x2=x(:,i)+alp*p(:,i);
f2=f(x2); f1=f(x(:,i));
while and(f2>f1+c*alp*(J(x(:,i)))'*p, alp>10^-6);
alp=rho*alp;
f2=f(x(:,i)+alp*p(:,i));
end
%-------------------------------------------------%
x(:,i+1)=x(:,i) + alp*p(:,i);
beta=((J(x(:,i+1)))'*J(x(:,i+1)))/((J(x(:,i)))'*J(x(:,i)));
p(:,i+1)=-J(x(:,i+1)) + beta*p(:,i);
i=i+1;
z=[z,f(x(:,i))];
end
Iter=i
xmax=x(:,end)
fmax=f(x(:,end))
Gradient=J(x(:,end))

%-------------------------------------------------%

figure(1)
X= x(1,1:end-1); Y= x(2,1:end-1); Z= z;
plot3(X,Y,Z ,'bo-','linewidth',0.1);
hold on

figure(2)
[X,Y]= meshgrid([-3:0.5:3]) ;
Z=X^2+X.*Y+3*Y.^2+5;
S=mesh(X,Y,Z); %plotting the surface
title('Subrats Pics'), xlabel('x'), ylabel('y')
To address the shortcomings of the original
Newton method, several variations of the technique were suggested to guarantee
convergence to a local minimum. One of the most important variations is the
Levenberg–Marquardt method. This method effectively uses a step that is a combination
between the Newton method and the steepest descent method. The step taken by this
method is given by:

𝐱 𝑘+1 = 𝐱 𝑘 + 𝑯(𝐱 𝑘 )−1 ∇𝑓(𝐱 𝑘 ) 𝐋𝐞𝐯𝐞𝐧𝐛𝐞𝐫𝐠–𝐌𝐚𝐫𝐪𝐮𝐚𝐫𝐝𝐭


} → 𝐱 𝑘+1 = 𝐱 𝑘 + (𝑯(𝐱𝑘 ) + 𝜇𝑰)−1 ∇𝑓(𝐱 𝑘 )
𝐱 𝑘+1 = 𝐱 𝑘 + 𝜇∇𝑓(𝐱 𝑘 )

where 𝜇 is a positive scalar and 𝑰 ∈ ℝ𝑛×𝑛 is the identity matrix. Notice that in last
equation if 𝜇 is small enough, the Hessian matrix 𝑯(𝐱 𝑘 ) dominates and the method
becomes effectively a Newton’s step. If the parameter 𝜇 is large enough, the matrix 𝜇𝑰
dominates and the method is approximately in the steepest descent direction. By
increasing 𝜇, the inverse matrix becomes small in norm and subsequently the norm of
the step taken ‖𝐱 𝑘+1 − 𝐱 𝑘 ‖ becomes smaller. It follows that the parameter 𝜇 controls
also the step size.

One interesting mathematical property of this approach is that adding the matrix 𝜇𝑰 to
the Hessian matrix increases each eigenvalue of this matrix by 𝜇. If the matrix 𝑯(𝐱 𝑘 ) is
not positive semi-definite then adding 𝜇 to each eigenvalue makes them more positive.
The value of 𝜇 can be increased until all the eigenvalues are positive thus guaranteeing
that the step (𝐱 𝑘+1 − 𝐱 𝑘 ) is a descent step. The Levenberg–Marquardt approach starts
each iteration with a very small value of 𝜇, thus giving effectively the Newton’s step. If
an improvement in the objective function is achieved, the new point is accepted.
Otherwise, the value of 𝜇 is increased until a reduction in the objective function is
obtained.

Example: Find the minimum of the function

𝑓(𝐱 𝑘 ) = 1.5𝑥12 + 2𝑥22 + 1.5𝑥32 + 𝑥1 𝑥3 + 2𝑥2 𝑥3 − 3𝑥1 − 𝑥3

starting from the point 𝐱 0 = [3.0 − 7.0 0]𝑇 . Utilize the MATLAB software

% The Levenberg Marquardt Method


clear all, clc,
f=@(x)1.5*x(1)^2+2*x(2)^2+1.5*x(3)^2+x(1)*x(3)+2*x(2)*x(3)-3*x(1)-x(3);
G=@(x)[3*x(1)+x(3)-3; 4*x(2)+2*x(3); 3*x(3)+ x(1) + 2*x(2)-1];
H=@(x)[3 0 1;0 4 2;1 2 3];

n=3; %Number of Parameters


x0=[3 -7 0]'; %This is the starting point
f0=f(x0); %initial function value
G0=G(x0); %initial gradient
G0Norm=norm(G0); %get the old gradient norm
H0=H(x0); %initial Hessian
I=eye(n); %This is the identity matrix
while (G0Norm>1.0e-5) %repeat until gradient is small enough
u=0.001; %initialize trust region parameter
DescentFlag=0; %flag signaling if a descent step
while(DescentFlag==0) %repeat until descent direction found
M=H0+u*I; %Marquardt Matrix
dx=-1.0*inv(M)*G0;
x=x0+dx; %get the new trial point
fNew=f(x); %calculate new value
if(fNew<f0) %a descent step?
DescentFlag=1.0; %set success flag
else
u=u*4; %Increase Mu
end
end
dx =x-x0; %get the new step
StepNorm=norm(dx); %get the step norm
GNew=G(x); %get new gradient
HNew =H(f); %get new Hessian
%now we swap parameters
x0=x; G0=GNew; H0=HNew;
G0Norm=norm(GNew);
end
x=x0, f=f(x0), G=G(x0), Ndx=StepNorm,

The algorithm terminated in only one iteration. The exact solution for this problem is
𝐱 ⋆ = [1.0 0.0 0.0]𝑇 with a minimum value of 𝑓(𝐱 ⋆ ) = −1.50.

Convexity The concept of convexity is fundamental in optimization. Many practical


problems possess this property, which generally makes them easier to solve both in
theory and practice. The term “convex” can be applied both to sets and to functions. A
set 𝑆 ∈ ℝ𝑛 is a convex set if the straight line segment connecting any two points in 𝑆 lies
entirely inside 𝑆. Formally, for any two points 𝐱 ∈ 𝑆 and 𝐲 ∈ 𝑆, we have 𝛼𝐲 + (1 − 𝛼)𝐱 ∈ 𝑆
for all 𝛼 ∈ [0, 1]. The function 𝑓 is a convex function if its domain 𝑆 is a convex set and if
for any two points 𝐱 and 𝐲 in 𝑆, the following property is satisfied:

𝑓(𝛼𝐲 + (1 − 𝛼)𝐱) ≤ 𝛼𝑓(𝐲) + (1 − 𝛼)𝑓(𝐱)


Or
𝑓(𝛼(𝐲 − 𝐱) + 𝐱) ≤ 𝛼(𝑓(𝐲 ) − 𝑓(𝐱)) + 𝑓(𝐱)

𝑓(𝛼(𝐲 − 𝐱) + 𝐱) − 𝑓(𝐱) ≤ 𝛼(𝑓(𝐲) − 𝑓(𝐱))

As 𝛼 → 0, the Taylor series of 𝑓(𝛼(𝐲 − 𝐱) + 𝐱) yields


𝑇 𝑇
𝑓(𝐱 ) + 𝛼(∇𝑓(𝐱)) (𝐲 − 𝐱) − 𝑓(𝐱) ≤ 𝛼(𝑓(𝐲) − 𝑓(𝐱)) ⟺ (∇𝑓(𝐱)) (𝐲 − 𝐱) ≤ 𝑓(𝐲) − 𝑓(𝐱)
𝑇
We conclude that: (∇𝑓(𝐱)) (𝐲 − 𝐱) ≥ 0 ⟹ 𝑓(𝐲) ≥ 𝑓(𝐱)
Theorem: Suppose that 𝑓 is differentiable function in a convex optimization problem.
Let 𝛀 denote the feasible set. Then 𝐱 is optimal if and only if 𝐱 ∈ 𝛀 and
𝑇
(∇𝑓(𝐱)) (𝐲 − 𝐱) ≥ 0

Proof: the Taylor series yields


𝑇 1
𝑓(𝐲) = 𝑓(𝐱) + (∇𝑓(𝐱)) (𝐲 − 𝐱) + 𝐝𝑇 (𝑯(𝐱 + 𝛼𝐝))𝐝 with 𝛼 ∈ (0,1] and 𝐝 = 𝐲 − 𝐱
2

Now if 𝑯(𝐱) is positive semidefinite everywhere in 𝐱 ∈ 𝛀, then 𝐝𝑇 (𝑯(𝐱 + 𝛼𝐝))𝐝 ≥ 0 and so


𝑇
𝑓(𝐲) ≥ 𝑓(𝐱) + (∇𝑓(𝐱)) (𝐲 − 𝐱)

Now if 𝐱 ⋆ is an optimizer of 𝑓(𝐱) then 𝑓(𝐱 ⋆ ) ≤ 𝑓(𝐲) ∀𝐲 ∈ 𝛀 which leads to


𝑇
(∇𝑓(𝐱 ⋆ )) (𝐲 − 𝐱 ⋆ ) ≥ 0
𝑇
This last inequality can be reduced to the other form (∇𝑓(𝐱 ⋆ )) 𝐳 ≥ 0 ∀𝐳 ∈ ℝ𝑛 . In turn,
this is equivalent to ∇𝑓(𝐱 ⋆ ) = 0. ■

Theorem: Any locally optimal point of a convex optimization problem is also (globally)
optimal.

Random search (RS) is a family of


numerical optimization methods that do not require the gradient of the problem to be
optimized, and RS can hence be used on functions that are not continuous or
differentiable. Such optimization methods are also known as direct-search, derivative-
free, or black-box methods. The name "random search" is attributed to Rastrigin who
made an early presentation of RS along with basic mathematical analysis.

Random search (RS) belongs to the fields of Global


Stochastic Optimization. Random search is a direct search method as it does not
require derivatives to search a continuous domain. To implement the method we need a
pseudorandom number generator. Fortunately, such pseudo-random number
generators with uniform distribution are implemented on most compilers. In order to
limit the search procedure to a confined space, we can impose on the decision variables
certain limits of the form:

𝐱 𝐨𝐩𝐭 = arg min𝐱 𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ) with 𝑎𝑘 ≤ 𝑥𝑘 (𝑖) ≤ 𝑏𝑘 , 𝑘 = 1,2, … , 𝑛

Where: k is the number of variables and i the number of iterations. The MATLAB
routine function rand generates a subunit random number at each call and let us call
this number 𝑟 = 𝑟𝑎𝑛𝑑. While the search must be done in the interval (𝑎𝑘 , 𝑏𝑘 ) for each
variable 𝑥𝑘 , we want the random number to be generated within this range. For this
reason, the following translation must be made: 𝑥𝑘 (𝑖) = 𝑎𝑘 + 𝑟𝑘 (𝑖)(𝑏𝑘 − 𝑎𝑘 ) 𝑘 = 1,2, … , 𝑛
clc;clear; nVar = 2; % the number of decision variables
N= 10000; % Number of random generated points
epsilon = 1e-3; % the convergence factor
a=zeros(1,nVar); b=zeros(1,nVar); % pre-allocation of vectors a and b
for i=1:nVar, a(i)=-1.50; b(i)=1.50; end % set-up of the search limits
fMin = 1e6; % initialize fMin
fPrecedent = fMin;
for i=1:N % global search procedure
x1 = a(1)+ rand*(b(1)-a(1)); % random generation: variable x1
x2 = a(2)+ rand*(b(2)-a(2)); % random generation: variable x2
f=@(x,y)2*x+y+(x.^2-y.^2)+(x-y.^2).^2; % The objective function
func =f(x1,x2);
if (func<fMin)
fMin = func; x1Min = x1; x2Min = x2;
if abs(fMin - fPrecedent)<=epsilon
break;
else
fPrecedent= fMin;
end, end, end
x1=x1Min, x2=x2Min, fMin =fMin(end)
J=@(x,y)[4*x-2*y^2+2;-4*x*y+4*y^3-2*y+1];
Jmin=J(x1, x2), fmin=f(x1, x2),

>> x1 = -0.2267 >> Jmin = >> fmin = -1.0929


>> x2 = -0.7629 -0.070971
0.057800
The search efficiency depends on the number n of randomly generated points within the
search domain (𝐚𝑘 , 𝐛𝑘 ).

Random Walk is an algorithm that provides


random paths in a graph. A random walk means that we start at one node, choose a
neighbor to navigate to at random or based on a provided probability distribution, and
then do the same from that node, keeping the resulting path in a list.

To find the solution to the minimization problem, the random path method uses an
iterative relationship of the form 𝐱(𝑖 + 1) = 𝐱(𝑖) + 𝛼𝑖 𝐬(𝑖) & 𝑖 = 1,2, … , 𝑛 where i is an
iterative index, 𝐱(𝑖) is the vector of the decision variables, 𝛼𝑖 is a step size, at iteration i
called acceleration factor in the 𝐬(𝑖) direction, and 𝐬(𝑖) is the vector of the minimization
direction. The search procedure starts from a randomly chosen point. Whatever this
start point is, we have to reach the same solution. The coordinates of the minimization
direction vector 𝐬𝑘 , are randomly chosen using the rand function.
Algorithm: [Random Walk algorithm]
Step 1: chose 𝐱(0) and 𝑁max
set 𝑖 = 1
Step 2: for each iteration 𝑖 do
𝐬(𝑖) = random vector
Step 3: 𝐱(𝑖 + 1) = 𝐱(𝑖) + 𝛼𝑖 𝐬(𝑖)

𝛼𝑖 is determined to minimize 𝑓(𝐱(𝑖 + 1)) & 𝑖 ← 𝑖 + 1


If 𝑖 < 𝑁max go to step 2
else stop (iteration exceeded)
end
end

Remark: The convergence of this algorithm is slow and not guaranteed in general, it is
dependent strongly on the convexity of the objective function.

%f=@(x)2*x(1)+x(2)+(x(1).^2+x(2).^2)+(x(1)+x(2).^2).^2;
%J=@(x)[4*x(1)+2*x(2)^2+2;4*x(1)*x(2)+4*x(2)^3+2*x(2)+1];
%---------------------------------------------------------------%
clear all, clc, n=0; nMax=5; xzero=rand(2,1); epsilon=1e-4; alfa0=0.01;
f=@(x)2*x(1)+x(2)+(x(1).^2-x(2).^2)+(x(1)-x(2).^2).^2;
a = -1.0 ; b = 1.0; % the range for s
F0=f(xzero); Fprecedent=F0; % the function value at the start point
f0=F0; s=rand(2,1); alfa = alfa0; increment = alfa0;
xone = xzero + alfa*s; % generate a next iteration Xl
F1 = f(xone); % the objective function value in Xl
Factual = F1;
i=1; % initialize the counter i
go = true; % variable 'go' remains 'true' as long as
% the convergence criteria are not fulfilled
while go
while (Factual>=Fprecedent)
s = rand(2,1); s = a*[1;1] + s*(b-a); % generate a random direction s
xone = xzero + alfa*s;
F1 = f(xone); Factual = F1;
end
i=i+1; f1=F1;
while (Factual<Fprecedent)
Fprecedent = Factual;
alfa = alfa + increment;
xone = xzero + alfa*s; F1 = f(xone);
end
deltaF = abs(F1-Fprecedent); F0 = Factual; xzero = xone; alfa = alfa0;
if(abs(f0-f1)<=epsilon) n = n + 1; end
f0 =f1;
if(n==nMax) go = false; break; end
end
J=@(x)[4*x(1)-2*x(2)^2+2;-4*x(1)*x(2)+4*x(2)^3-2*x(2)+1];
xone, Factual, Jmin=J(xone), fmin=f(xone),

Methods for which the search principle


is based on random numbers are generally called the Monte Carlo methods. At present,
random numbers are often replaced by computer-generated pseudo-random numbers
by randomization procedures. The Monte Carlo method has been successfully applied
to solving linear equation systems, to calculating the inverse matrix, to evaluating
multiple integrals, to solving the Dirichlet problem, to solving functional equations of a
variety of types and so on. It has also been used in the field of nuclear physics.

The Monte Carlo method is based on the following principle: if the best option is
needed, it should be tried "at random" many times and then the best option found
between those attempts chosen. If there are enough different attempts, the best option
found will almost certainly be an optimal global value. This method is valid both
mathematically and intuitively. The advantages of the method are both its simplicity
and its universality. But it has the disadvantage of being too slow.

The Monte Carlo idea: It is preferred to explain


the Monte Carlo principle (method) by an
illustration or exemplification of a simple integral
calculation problem. For this, the following
integral will be considered:
1 𝜋/2
𝜋
𝑦 = ∫ √1 − 𝑥 2 𝑑𝑥 = ∫ (sin(𝜃))2 𝑑𝜃 =
0 0 4

The function under the integral represent a circle


arc of 90°, that can be inscribed in a square,
whose edge size is the unit.

Obviously, the area of the square is 𝑆 = 1. The


area of the quarter-circle is 𝐴 = 𝜋𝑆/4.

Let’s pretend that we don’t know the value of 𝜋. To calculate it, we will generate a large
number 𝑁 of random points in the unit square. By 𝑛 we will denote the number of
points lying inside the quarter-circle. As you will certainly agree, with large 𝑁 the ratio
𝑛/𝑁 must be very similar to the ratio of 𝐴/𝑆. And that’s all! From the equation 𝑛/𝑁 = 𝐴/𝑆
we can easily express that 𝜋 = 4𝑛/𝑁. This proportion becomes valid as the number 𝑁 of
uniformly distributed points on the square area generated and in the meantime
becomes higher.
The corresponding MATLAB program is presented below.
clear all, clc nmax = 5000;
x = rand(nmax,1); y = rand(nmax,1); x1=x-0.5; y1=y-0.5;
r = sqrt(x1.^2+y1.^2) ;
% get logicals
inside = r<=0.5; outside = r>0.5;
% plot
plot(x1(inside),y1(inside),'b.');
hold on
plot(x1(outside),y1(outside),'r.');
axis equal
% get pi value
thepi = 4*sum(inside)/nmax;
fprintf('%8.4f\n',thepi)

In the following, a global optimization algorithm


applicable to solving nonconvex problems is proposed. It is effective even if a problem
has multiple local optima. Let us imagine that our objective function has the equation:

𝑓(𝑥, 𝑦) = −0.02 sin(𝑥 + 4𝑦) − 0.2 cos(2𝑥 + 3𝑦) − 0.2 sin(2𝑥 − 𝑦) + 0.4 cos(𝑥 − 2𝑦)

While the objective function depends on two variables 𝑥 and 𝑦, its graphical
representation is a surface. From Fig. it is evident that this surface has many peaks
and valleys, interpretable as many local minimum (or maximum) points, depending on
the problem scope. Usually, a numerical optimization procedure risks ending in a local
optimum point instead of an absolute minimum point.
% This program draws the mesh of a multimodal
% function that depends on two variables
clear all, clc
f=@(x1,x2)-0.02*sin(x1+4*x2)-0.2*cos(2*x1+3*x2)-0.2*sin(2*x1-
x2)+0.4*cos(x1-2*x2);
a1=-2.5;a2=-2.5; b1=2.5;b2=2.5; increment1=0.1; increment2=0.1;
n1=(b1-a1)/increment1; n2=(b2-a2)/increment2; fGraph = zeros(n1,n2);
x1 = a1;
for i1 = 1:n1
x2 = a2;
for i2 = 1:n2
fGraph (i1,i2)=f(x1,x2);
x2 = x2 + increment2;
end
x1 = x1 + increment1;
end
mesh(fGraph) ; % drawing of fGraph
clear all, clc

f=@(x1,x2)-0.02*sin(x1+4*x2)-0.2*cos(2*x1+3*x2)-0.2*sin(2*x1-
x2)+0.4*cos(x1-2*x2);

n=10000; % set the number of random numbers generated


gridNumber = 250; % set the grid numbers for local search
isd=20; % set the interval size divisor for local search
a1=-3; a2=-3; b1=3;b2=3; % the initial search limits on each axis
delta = 1.0e3; % set the absolute value of difference df(x)
epsilon = 1.0e-4; % set the convergence criterion value
minF = 1.e20; % set the mimimum initial value of f

%------% open a text file to save results %------%


fp = fopen('results.txt','w');
fprintf(fp, ' GLOBAL OPTIMIZATION METHOD\n\n');
fprintf(fp, ' the size of random numbers n = %d\n',n);
fprintf(fp, ' grid numbers for local searc h : %d\n ', gridNumber);
fprintf(fp, ' interval size divisor for local');
fprintf(fp, ' searc h : %d\n ' , isd );
fprintf(fp, ' initial search limits on each axis\n');
fprintf(fp, ' a1 = %5.2f b1 = %5.2f\n', a1 , b1);
fprintf(fp, ' a2 = %5.2f b2 = %5.2f\n', a2 , b2);
fprintf(fp, ' the absolute difference between');
fprintf(fp, ' function values :\n');
fprintf(fp, ' delta = %f\n' , delta);
fprintf(fp, ' the convergence criterion value :\n');
fprintf(fp, ' epsilon = %f\n', epsilon );
%------% Global search (by a Monte Carlo method) %------%
for i = 1:n
x1 = a1 + (b1 - a1)*rand;
x2 = a2 + (b2 - a2)*rand;
func = f(x1,x2);
if func<minF, minF = func; x1_min=x1; x2_min=x2; end
end
precF = minF;

%------% Local search (by a Multi-Grid method) %------%


ls1 = x1_min - abs(b1-a1)/isd; % left limitj x1 axis
if ls1<a1, ls1 = a1; end
ld1 = x1_min + abs(b1-a1)/isd; % right limitj x1 axis
if ld1>b1, ld1 = b1; end
ls2 = x2_min - abs(b2-a2)/isd; % left limitj x2 axis
if ls2<a2, ls2 = a2; end
ld2 = x2_min + abs(b2-a2)/isd; % right limitj x1 axis
if ld2>b2, ld2 = b2; end

while delta>epsilon
%------% The block for xl variable (keep x2=constant) %------%
x2 = x2_min; x1 = ls1;
increment = abs(ld1-ls1)/gridNumber;
while x1<=ld1
func = f(x1,x2);
if func<minF, minF = func; x1_min=x1; end
x1 = x1 + increment;
end

ls1 = x1_min - increment;


if ls1<a1, ls1 = a1; end
ld1 = x1_min + increment;
if ld1>b1, ld1 = b1; end
%------% The block for x2 variable (keep x1=constant) %------%
x1= x1_min; increment = abs(ls2-ld2)/gridNumber;
while x2<=ld2
func = f(x1,x2);
if func<minF, minF = func; x2_min = x2; end
x2 = x2 + increment;
end

ls2 = x2_min - increment;


if ls2<a2, ls2 = a2; end
ld2 = x2_min + increment;
if ld2>b2, ld2 = b2; end

actF = minF;
% check the convergence criterion
delta=abs(actF-precF);
precF = actF;
end

% draw the surface graph


x1 =-3:0.1:3 ; x2 =-3 :0.1:3;
[x1,x2] = meshgrid (x1,x2); % create the meshgrid
F = f(x1,x2); contour (x1,x2,F,15);

% mark the optimum solution


hold on;
scatter(x1_min, x2_min,'markerfacecolor', 'r');
hold off;

fprintf (fp, ' \n\n the minimum value : Fmin =');


fprintf (fp, ' %f\n',minF);
fprintf (fp, ' the coordinates of the optimal');
fprintf( fp, ' point : \n ' );
fprintf (fp, ' x1 = %f x2 = %f\n' , x1_min, x2_min);
fclose (fp); % close the text file

x1_min, x2_min, minF

Explanation: Initially the global search


procedure is designed based on Monte Carlo
principles, which means that a number n of
pairs of points (𝑥𝑖 , 𝑦𝑖 ) are randomly generated
inside the search area, defined by the search
limits −3 ≤ 𝑥, 𝑦 ≤ 3. From tliese points, the
point (𝑥, 𝑦) tlat has the minimum value is
thought to have more chances to be placed
near the true optimum point. This statement is
more accurate as the number 𝑛 of points that
are randomly generated inside the search area
increases. The search for this point with value
min𝐹 and coordinates (𝑥1𝑚𝑖𝑛 , 𝑥2𝑚𝑖𝑛 ) is done by
Monte Carlo method.

Starting from this point, a local search procedure is designed. This procedure is based
on the Grid method. First thing to do is to define a neighborhood of the starting point
on each axis. This neighborhood should be set for each axis separately. Then along
each axis a search of the minimum point in that direction is made successively.
In this particular case, the local search along the 𝑥1 axis starts from the left bound of
the neighborhood, that is 𝑙𝑠1, while 𝑥2 is kept constant. Once the minimum point along
𝑥1 axis is found, it is kept constant, while the local search is performed along the 𝑥2
axis. 'When the search along all axes is finished, it means that one iteration is over. The
value of the objective function at this point is compared with the value of the similar
point at the previous iteration. If the difference between these two points denoted delta
is less than a precision factor called epsilon, initially set, then the search stops, or
continues otherwise.■ (Ancau Mircea 2019)

Example Find a solution for the two-dimensional optimization problem:

f=@(x1,x2)log((1+(x1-4/3).^2)+3*(x1+x2-(x1).^3).^2);
x1 =-2:0.1:2 ; x2 =-2:0.1:2;

we apply this Global optimization algorithm we get


x1_min = 1.3376
x2_min = 1.0555
minF = 1.8193e-05
In many optimization problems, the
variables are interrelated by physical laws like the conservation of mass or energy,
Kirchhoff’s voltage and current laws, and other system equalities that must be satisfied.
In effect, in these problems certain
equality constraints of the form ℎ𝑖 (𝐱) =
0 for 𝐱 ∈ 𝛀 where 𝑖 = 1, 2, … , 𝑝 must be
satisfied before the problem can be
considered solved. In other optimization
problems a collection of inequality
constraints might be imposed on the
variables or parameters to ensure physical
realizability, reliability, compatibility, or
even to simplify the modeling of the
problem. For example, the power
dissipation might become excessive if a
particular current in a circuit exceeds a
given upper limit or the circuit might
become unreliable if another current is
reduced below a lower limit, the mass of an element in a specific chemical reaction
must be positive, and so on. In these problems, a collection of inequality constraints of
the form 𝑔𝑗 (𝐱) ≥ 0 for 𝐱 ∈ 𝛀 where 𝑗 = 1, 2, … , 𝑞 must be satisfied before the
optimization problem can be considered solved.

An optimization problem may entail a set of equality constraints and possibly a set of
inequality constraints. If this is the case, the problem is said to be a constrained
optimization problem. The most general constrained optimization problem can be
expressed mathematically as
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
subject to: ℎ𝑖 (𝐱) = 0
𝑔𝑗 (𝐱) ≥ 0

A problem that does not entail any equality or inequality constraints is said to be an
unconstrained optimization problem. Constrained optimization is usually much more
difficult than unconstrained optimization, as might be expected. Consequently, the
general strategy that has evolved in recent years towards the solution of constrained
optimization problems is to reformulate constrained problems as unconstrained
optimization problems. When the objective function and all the constraints are linear
functions of 𝐱, the problem is a linear programming problem. Problems of this type are
probably the most widely formulated and solved of all optimization problems,
particularly in control system, management, financial, and economic applications.
Nonlinear programming problems, in which at least some of the constraints or the
objective are nonlinear functions, tend to arise naturally in the physical sciences and
engineering, and are becoming more widely used in control system, management and
economic sciences as well.
Several branches of mathematical programming are of much interest for the
optimization problems, namely, linear, integer, quadratic, nonlinear, and dynamic
programming. Each one of these branches of mathematical programming consists of the
theory and application of a collection of optimization techniques that are suited to a
specific class of optimization problems.

In mathematical optimization, the method of Lagrange


multipliers is a strategy for finding the local maxima and minima of a function subject
to equality constraints (i.e., subject to the condition that one or more equations have to
be satisfied exactly by the chosen values of the variables). It is named after the
mathematician Joseph-Louis Lagrange. The basic idea is to convert a constrained
problem into a form such that the derivative test of an unconstrained problem can still
be applied. The relationship between the gradient of the function and gradients of the
constraints known as the Lagrangian function.

The method can be summarized as follows: in order to find the maximum or minimum
of a function 𝑓(𝐱) subjected to the equality constraint g(𝐱) = 0, form the Lagrangian
function 𝐿(𝐱 , 𝜆) = 𝑓(𝐱) − 𝜆g(𝐱) and find the stationary points 𝐱 = 𝐱 ⋆ of 𝐿(𝐱 , 𝜆) such that
∇𝐿(𝐱 ⋆ , 𝜆) = 0. Further, the method of Lagrange multipliers is generalized by the Karush–
Kuhn–Tucker conditions, which can also take into account inequality constraints of the
form ℎ(𝐱) ≤ 𝑐.

Often the Lagrange multipliers have an interpretation as some quantity of interest. For
example, consider
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{
subject to: g 𝑖 (𝐱) = 𝑐𝑖
𝑝
The Lagrangian function 𝐿(𝐱 , 𝜆) = 𝑓(𝐱) + ∑𝑖=1 𝜆𝑖 (𝑐𝑖 − g 𝑖 (𝐱)). Then 𝜆𝑘 = 𝜕𝐿/𝜕𝑐𝑘 . So, 𝜆𝑘 is
the rate of change of the quantity being optimized as a function of the constraint
parameter. The relationship between the gradient of the function and gradients of the
constraints is:
𝐿(𝐱 , 𝜆) = 𝑓(𝐱) − 𝜆g(𝐱) ⟹ ∇𝑓(𝐱) = 𝜆∇g(𝐱).

Example: Suppose we wish to maximize the


objective function 𝑓(𝑥, 𝑦) = 𝑥 + 𝑦 subject to the
constraint 𝑥 2 + 𝑦 2 = 1. The feasible set is the unit
circle, and the level sets of 𝑓 are diagonal lines
(with slope −1), so we can see graphically that the
maximum and the minimum occurs at
√2 √2 √2 √2
( , ) and (− ,− )
2 2 2 2
For the method of Lagrange multipliers, the
constraint is g(𝐱) = 𝑥 2 + 𝑦 2 − 1 = 0 hence 𝐿(𝐱 , 𝜆) = (𝑥 + 𝑦) + 𝜆(𝑥 2 + 𝑦 2 − 1). Now we can
calculate the gradient:
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝑇
∇𝐿(𝐱 , 𝜆) = ( , , ) = (1 + 2𝜆𝑥, 1 + 2𝜆𝑦, 𝑥 2 + 𝑦 2 − 1)𝑇
𝜕𝑥 𝜕𝑦 𝜕𝜆
1 + 2𝜆𝑥 = 0
and therefore: ∇𝐿(𝐱 , 𝜆) = 0 ⟹ { 1 + 2𝜆𝑦 = 0 }. Notice that the last equation is the
𝑥2 + 𝑦2 − 1 = 0
original constraint. The first two equations yield 𝑥 = 𝑦 = −1/2𝜆 𝜆 ≠ 0. By substituting
into the last equation we have: 2𝜆2 − 1 = 0 ⟹ 𝜆 = ±√2/2. which implies that the
stationary points of 𝐿 are (√2/2, √2/2) and (−√2/2, −√2/2). Evaluating the objective
function f at these points yields 𝑓 = ±√2.

Example: Now we modify the objective function of


the previous Example so that we minimize 𝑓(𝑥, 𝑦) =
(𝑥 + 𝑦)2 again along the circle g(𝐱) = 𝑥 2 + 𝑦 2 − 1 = 0.
Now the level sets of 𝑓 are still lines of slope −1, and
the points on the circle tangent to these level sets
are again (√2/2, √2/2) and (−√2/2, −√2/2). These
tangency points are maxima of 𝑓. On the other
hand, the minima occur on the level set for 𝑓 = 0
(since by its construction 𝑓 cannot take negative
values), at (√2/2, √2/2) and (−√2/2, −√2/2), where
the level curves of 𝑓 are not tangent to the constraint. The condition that ∇𝑓 = 𝜆∇g
correctly identifies all four points as extrema; the minima are characterized in
particular by 𝜆 = 0.

Remark: In optimal control theory, the Lagrange multipliers are interpreted as costate
variables, and Lagrange multipliers are reformulated as the minimization of the
Hamiltonian, in Pontryagin's minimum principle.

Example: Determine the Lagrange multipliers for the optimization problem

minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀


{
subject to: 𝑨𝐱 = 𝒃

Where 𝑨 ∈ ℝ𝑝×𝑛 is assumed to have full row rank. Also discuss the case where the
constraints are nonlinear.

Solution in this case we have: 𝐿(𝐱 , 𝝀) = 𝑓(𝐱) − g(𝐱)𝝀 = 𝑓(𝐱) − (𝑨𝐱 − 𝒃)[𝜆1 ⋯ 𝜆𝑝 ]𝑇 with
g(𝐱) = 𝑨𝐱 − 𝒃 = 0, let we define g new (𝐱, 𝝀) = g(𝐱)𝝀 = (𝑨𝐱 − 𝒃)[𝜆1 ⋯ 𝜆𝑝 ]𝑇 . Now take the
gradient of this new function so: ∇g new (𝐱) = 𝑨𝑇 𝝀 + g(𝐱) = 𝑨𝑇 𝝀

From the other hand


g new (𝐱) = g(𝐱)𝝀 = 𝑓(𝐱) − 𝐿(𝐱 , 𝝀) ⟹ ∇g new (𝐱) = ∇𝑓(𝐱) − ∇𝐿(𝐱 , 𝝀) = ∇(g(𝐱)𝝀) = 𝑨𝑇 𝝀

Taking the evaluation of this expression at 𝐱 = 𝐱 ⋆ we obtain


∇g new (𝐱)|𝐱=𝐱⋆ = ∇𝑓(𝐱 ⋆ ) − ∇𝐿(𝐱 ⋆ , 𝝀) = ∇𝑓(𝐱 ⋆ ) = 𝑨𝑇 𝝀
From the basic linear algebra it is very well known that Lagrange multipliers are
uniquely determined as ∇g new (𝐱) = 𝑨𝑇 𝝀 ⟺ 𝝀 = (𝑨𝑨𝑇 )−1 𝑨∇g new (𝐱) = (𝑨)+ ∇g new (𝐱)

𝝀 = [𝜆1 ⋯ 𝜆𝑝 ]𝑇 = (𝑨𝑨𝑇 )−1 𝑨∇𝑓(𝐱 ⋆ ) = (𝑨)+ ∇𝑓(𝐱 ⋆ )


For the case of nonlinear equality constraints, a similar conclusion can be reached in
𝑇
terms of the Jacobian of the constraints. If we let 𝑱e = [∇g1 (𝐱) … ∇g 𝑝 (𝐱)] then the
Lagrange multipliers are uniquely determined as 𝝀 = (𝑱e )+ ∇𝑓(𝐱 ⋆ ).

Example Solve the problem


1 𝑇
minimize 𝑓(𝐱) = 𝐱 𝑯𝐱 + 𝐱 𝑇 𝒑 for 𝐱 ∈ 𝛀
{ 2
subject to: 𝑨𝐱 = 𝒃

Where 𝑯 ≻ 0 and 𝑨 ∈ ℝ𝑝×𝑛 is assumed to have full row rank.

Solution We know that: ∇𝑓(𝐱) = 𝑯𝐱 + 𝒑, so that 𝝀 = (𝑨)+ ∇𝑓(𝐱 ⋆ ) = (𝑨𝑨𝑇 )−1 𝑨(𝑯𝐱 ⋆ + 𝒑). In
order to omite the existence of 𝐱 ⋆ let 𝑨𝑨𝑇 𝝀 = 𝑨(𝑯𝐱 ⋆ + 𝒑) ⟹ 𝑨𝑇 𝝀 = (𝑯𝐱 ⋆ + 𝒑) multiply both
sides by 𝑨𝑯−1 we get: 𝑨𝑯−1 𝑨𝑇 𝝀 = 𝑨𝑯−1 𝑯𝐱 ⋆ + 𝑨𝑯−1 𝒑 = 𝑨𝐱 ⋆ + 𝑨𝑯−1 𝒑 = 𝒃 + 𝑨𝑯−1 𝒑

𝝀 = (𝑨𝑯−1 𝑨𝑇 )−1 (𝑨𝑯−1𝒑 + 𝒃)


The Lagrange multipliers problem is difficult to solve analytically in general, therefore
we try to solve such problems digitally using computers.

Remark: assume that we are dealing with the problem of optimization such that
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{
subject to: 𝐠(𝐱) = 𝟎

The Karush–Kuhn–Tucker conditions state that ∇𝐿(𝐱 , 𝝀) = 0 which can written in the
form
𝜕𝐿(𝐱 , 𝝀)
∇𝑓(𝐱 ) + 𝑱𝑇 𝝀 0
𝑇
𝐿(𝐱 , 𝝀) = 𝑓(𝐱 ) + 𝝀 𝐠(𝐱 ) ⟺ 𝒉(𝐱 , 𝝀) = ( 𝜕𝐱 )=( )=( )
𝜕𝐿(𝐱 , 𝝀)
𝐠(𝐱 ) 0
𝜕𝝀
1
Where 𝑱 is the Jacobian of the vector 𝐠(𝐱 ). In case when 𝑓(𝐱) = 2 𝐱 𝑇 𝑯𝐱 + 𝐱 𝑇 𝒑 and
𝐠(𝐱 ) = 𝑨𝐱 − 𝒃
∇𝑓(𝐱 ) + 𝑱𝑇 𝝀 𝑯𝐱 + 𝒑 + 𝑨𝑇 𝝀
𝒉(𝐱 , 𝝀) = ( )=( ) ⇔ (𝑯 𝑨𝑇 ) (𝐱) = (−𝒑 )
𝐠(𝐱 ) 𝑨 𝟎 𝝀 𝒃
𝑨𝐱 − 𝒃

: assume that we are dealing with the


problem of optimization such that
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{subject to: ℎ𝑖 (𝐱) = 0 1≤𝑖≤𝑝
g 𝑗 (𝐱) ≤ 0 1≤𝑗≤𝑚

The basic concept in random search approaches is to randomly generate points in the
parameter space. Only feasible points satisfying g 𝑗 (𝐱) ≤ 0, 1 ≤ 𝑗 ≤ 𝑚 are considered,
while non-feasible points with at least one g 𝑗 (𝐱) > 0 for some j are rejected. The
algorithm keeps track of the feasible random point with the least value of the objective
function. This requires checking, at every iteration, if the newly generated feasible point
has a better objective function than the best value achieved so far.
The main disadvantage of this algorithm is that a large number of objective function
calculations may be required especially for problems with large n. The following
example illustrates this technique.

Example Find a solution for the two-dimensional constrained optimization problem:


minimize 𝑥12 + 𝑥22 ,
subject to: 𝑥1 − 𝑥22 − 4 ≥ 0
𝑥1 − 10 ≤ 0

%The Random Search Approach for constraint problems % MATLAB M8.1


clear all, clc,
f=@(x)x(1)^2+x(2)^2; g=@(x)[4-x(1)+x(2)^2;x(1)-10];
n=2; %This is the Number of Parameters
m=2; %This is the Number of Constraints
Ub=[10 10]'; %upper values
Lb=[-10 -10]'; %lower values
f0=1.0e9; %select a large initial value for the minimum
N=100000; %maximum number of allowed iterations
k=0; %iteration counter
while(k<N) %repeat until maximum number of iteration
r=1.2*rand(n,1); %get a vector of random variables
x=Lb + r.*(Ub-Lb); %Get new random point
f1=f(x); %get new objective function value
gg=g(x); %get the value of ALL constraints at the new point
if and((f1<f0),(max(gg)<0)) %is there an improvement and is the
%new point feasible?
x0=x; %adjust best value
f0=f1;
end
k=k+1; %increment the iteration counter
end

iterations=k,
BestPosition=x0,
fmax=f0,

The point returned by the random optimization algorithm is 𝐱 = [4.175048 0.048896].

Example Find a solution for the two-dimensional constrained optimization problem:


minimize 𝑥1 + 𝑥2 ,
{
subject to: 𝑥12 + 𝑥22 − 1 ≤ 0

The above program give as 𝑥1 = 𝑥2 = 0.7056 𝑓 = −1.4134


Example Find a solution for the two-
dimensional constrained optimization
problem:

4 2
minimize log (1 (𝑥1 − ) + 3(𝑥1 + 𝑥2 − 𝑥13 )2 ) ,
3
2 2
subject to: 𝑥1 + 𝑥2 − 4 ≤ 0
{ −1 ≤ 𝑥1 , 𝑥2 ≤ 1

The above program give as

𝑥1 = 0.5995, 𝑥2 = −0.3714, 𝑓 = 0.4311

clear all, clc,


f=@(x)log((1+(x(1)-4/3).^2)+3*(x(1)+x(2)-(x(1)).^3).^2);
g=@(x)x(1)^2+x(2)^2-4; % Contraints
n=2; %This is the Number of Parameters
m=2; %This is the Number of Constraints
Ub=[1 1]'; %upper values
Lb=[-1 -1]'; %lower values
f0=1.0e9; %select a large initial value for the minimum
N=100000; %maximum number of allowed iterations
k=0; %iteration counter
while(k<N) %repeat until maximum number of iteration
r=0.8*rand(n,1); %get a vector of random variables
x=Lb + r.*(Ub-Lb); %Get new random point
f1=f(x); %get new objective function value
gg=g(x); %get the value of ALL constraints at the new point
if and((f1<f0),(max(gg)<=0)) %is there an improvement and is the
%new point feasible?
x0=x; %adjust best value
f0=f1;
end
k=k+1; %increment the iteration counter
end
iterations=k, BestPosition=x0, fmax=f0,

: There are several situations in which the least


squares solution of 𝑨𝐱 = 𝒃 does not give rise to a good estimate of the “true” vector 𝐱.
For example, when 𝑨 is underdetermined, that is, when there are fewer equations than
variables, there are several optimal solutions to the least squares problem, and it is
unclear which of these optimal solutions is the one that should be considered. In these
cases, some type of prior information on 𝐱 should be incorporated into the optimization
model. One way to do this is to consider a penalized problem in which a regularization
function 𝑅(·) is added to the objective function. The regularized least squares (RLS)
problem has the form
RLS: min‖𝑨𝐱 − 𝒃‖2 + 𝜆𝑅(𝐱)
𝐱

The positive constant 𝜆 is the regularization parameter. As 𝜆 gets larger, more weight is
given to the regularization function. In many cases, the regularization is taken to be
quadratic. In particular, 𝑅(𝐱) = ‖𝑫𝐱‖2 where 𝑫 ∈ ℝ𝑝×𝑛 is a given matrix. The quadratic
regularization function aims to control the norm of 𝑫𝐱 and is formulated as follows:

min‖𝑨𝐱 − 𝒃‖2 + 𝜆‖𝑫𝐱‖2


𝐱
To find the optimal solution of this problem, note that it can be equivalently written as

min{𝑓RLS ≡ 𝐱 𝑇 (𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫)𝐱 − 2𝒃𝑻 𝑨𝐱 + ‖𝒃‖2 }


𝐱

Since the Hessian of the objective function is ∇2 𝑓RLS = 2(𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫) ≽ 0, it follows by
previous theorems that any stationary point is a global minimum point. The stationary
points are those satisfying ∇𝑓RLS (𝐱) = 0, that is, (𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫)𝐱 = 𝑨𝑻 𝒃.

Therefore, if 𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫 ≻ 𝟎, then the RLS solution is given by

𝐱 = (𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫)−1 𝑨𝑻 𝒃

Example: Let 𝑨 ∈ ℝ3×3 , 𝒃 ∈ ℝ3 and 𝑩 ∈ ℝ2×3 be given by

2 + 10−3 3 4 20.0019
𝑻 −3 1 1 1
𝑨 = 𝑩 𝑩 + 10 𝑰 = ( 3 5 + 10−3 7 ) , 𝑩 = ( ) , 𝒃 = ( 34.0004)
−3 1 2 3
4 7 10 + 10 48.0202

The purpose is to find the best approximate solution of 𝑨𝐱 = 𝒃. Knowing that the exact
solution is 𝐱 𝑡𝑟𝑢𝑒 = [1 2 3]𝑇 .

The matrix 𝑨 is in fact of a full column rank since its eigenvalues are all positive (which
can be checked, for example, by the MATLAB command eig(𝑨)), and the simple least
squares solution is given by 𝐱 𝐿𝑆 , whose value can be computed by

clear all, clc


B=[1,1,1;1,2,3];
A=B'*B+0.001*eye(3);
b=[20.0019,34.0004,48.0202]'
xLS=inv(A'*A)*A'*b

𝐱 𝐿𝑆 = [4.5316 − 5.1036 6.5612]𝑇

𝐱 𝐿𝑆 is rather far from the true vector 𝐱 𝑡𝑟𝑢𝑒 . One difference between the solutions is that
the squared norm ‖𝐱 𝐿𝑆 ‖2 = 90.1855 is much larger then the correct squared norm
‖𝐱 𝑡𝑟𝑢𝑒 ‖2 = 14. In order to control the norm of the solution we will add the quadratic
regularization function ‖𝐱‖2. The regularized solution will thus have the form

𝐱 = (𝑨𝑻 𝑨 + 𝜆𝑰𝑻 𝑰)−1 𝑨𝑻 𝒃


Picking the regularization parameter as 𝜆 = 1, the RLS solution becomes

clear all, clc


B=[1,1,1;1,2,3];
A=B'*B+0.001*eye(3);
b=[20.0019,34.0004,48.0202]'
xRLS=inv(A'*A+eye(3))*A'*b

𝐱 𝑅𝐿𝑆 = [1.1763 − 2.0318 2.8872]𝑇

which is a much better estimate for 𝐱 𝑡𝑟𝑢𝑒 than 𝐱 𝐿𝑆 .

Denoising: One application area in which regularization is commonly used is


denoising. Suppose that a noisy measurement of a signal 𝐱 ∈ ℝ𝑛 is given:
𝒃=𝐱+𝐰
Here 𝐱 is an unknown signal, 𝐰 is an unknown noise vector, and 𝒃 is the known
measurements vector. The denoising problem is the following: Given 𝒃, find a “good”
estimate of 𝐱. The least squares problem associated with the approximate equations
𝐱 ≈ 𝒃 is
min‖𝐱 − 𝒃‖2
𝐱

However, the optimal solution of this problem is obviously x = b, which is meaningless.


This is a case in which the least squares solution is not informative even though the
associated matrix—the identity matrix—is of a full column rank. To find a more
relevant problem, we will add a regularization term. For that, we need to exploit some a
priori information on the signal. For example, we might know in advance that the signal
is smooth in some sense. In that case, it is very natural to add a quadratic penalty,
which is the sum of the squares of the differences of consecutive components of the
vector; that is, the regularization function is
𝑛−1
𝑅(𝐱) = ∑ (𝑥𝑖 − 𝑥𝑖+1 )2
𝑖=1

This quadratic function can also be written as 𝑅(𝐱) = ‖𝑳𝐱‖2 , where 𝑳 ∈ ℝ(𝑛−1)×𝑛 is given
by
1 −1 0 0 0 0
0 1 −1 0 0 0
0 0 1 −1 0 0
𝑳= 1 ⋮
⋮ ⋱ ⋮ ⋮
⋮ ⋮ ⋮ ⋮
(0 0 0 … 1
0 −1)
The resulting regularized least squares problem is (with 𝜆 a given regularization
parameter)
min‖𝐱 − 𝒃‖2 + 𝜆‖𝑳𝐱‖2
𝐱

and its optimal solution is given by 𝐱 𝑅𝐿𝑆 = (𝑰 + 𝜆𝑳𝑻 𝑳)−1 𝒃


Example: Consider the signal 𝐱 ∈ ℝ300 constructed by the following MATLAB
commands:

clear all, clc


t=linspace(0,4,300)';
x=sin(t)+t.*(cos(t).^2);

𝑖−1 𝑖−1 𝑖−1


Essentially, this is the signal given by 𝑥𝑖 = sin (4 299 ) + (4 299 ) cos2 (4 299 ), where
𝑖 = 1,2, . . . , 300. A normally distributed noise with zero mean and standard deviation of
0.05 was added to each of the components:

randn('seed',314);
b=x+0.05*randn(300,1);

The true and noisy signals are given in Figure, which was constructed by the MATLAB
commands

subplot(1,2,1);
plot(1:300,x,'LineWidth',2);
subplot(1,2,2);
plot(1:300,b,'LineWidth',2);

In order to denoise the signal 𝒃, we look at the optimal solution of the RLS problem, for
four different values of the regularization parameter: 𝜆 = 1, 10,100, 1000.

The original true signal is denoted by a dotted line. As can be seen in the next Figure,
as 𝜆 gets larger, the RLS solution becomes smoother.
For 𝜆 = 10 the RLS solution is a rather good estimate of the original vector 𝐱. For
𝜆 = 100 we get a smoother RLS signal, but evidently it is less accurate than 𝐱 𝑅𝐿𝑆 (10),
especially near the boundaries. The RLS solution for 𝜆 = 1000 is very smooth, but it is a
rather poor estimate of the original signal. In any case, it is evident that the parameter
𝜆 is chosen via a trade-off between data fidelity (closeness of 𝐱 to 𝒃) and smoothness
(size of 𝑳𝐱). The four plots where produced by the MATLAB commands

L=zeros(299,300);
for i=1:299
L(i,i)=1;
L(i,i+1)=-1;
end
x_rls=(eye(300)+1*L'*L)\b;
x_rls=[x_rls,(eye(300)+10*L'*L)\b];
x_rls=[x_rls,(eye(300)+100*L'*L)\b];
x_rls=[x_rls,(eye(300)+1000*L'*L)\b];

figure(2)
for j=1:4
subplot(2,2,j);
plot(1:300,x_rls(:,j),'LineWidth',2);
hold on
plot(1:300,x,':r','LineWidth',2);
hold off
title(['\lambda=',num2str(10^(j-1))]);
end
Most real-world optimizations are highly
nonlinear and multimodal, under various complex constraints. Different objectives are
often conflicting. Even for a single objective, sometimes, optimal solutions may not exist
at all. In general, finding an optimal solution or even sub-optimal solutions is not an
easy task. This work aims to introduce the fundamentals of metaheuristic optimization,
as well as some popular metaheuristic algorithms. Metaheuristic algorithms are
becoming an important part of modern optimization. A wide range of metaheuristic
algorithms have emerged over the last two decades, and many metaheuristics such as
particle swarm optimization are becoming increasingly popular. Despite their popularity,
mathematical analysis of these algorithms lacks behind. Convergence analysis still
remains unsolved for the majority of metaheuristic algorithms, while efficiency analysis
is equally challenging.

Problem formulation: In general, an optimization problem can be written as

Minimize 𝑓1 (𝐱), 𝑓2 (𝐱), … , 𝑓𝑚 (𝐱) 𝐱 = [𝑥1 , 𝑥1 , … , 𝑥𝑛 ]


Subjected to
ℎ𝑗 (𝐱) = 0, (𝑗 = 1,2, . . . , 𝐽)
𝑔𝑘 (𝐱) ≤ 0, (𝑘 = 1,2, . . . , 𝐾)

where 𝑓1 , . . . , 𝑓𝑚 (𝐱) are the objectives, while ℎ𝑗 and 𝑔𝑘 are the equality and inequality
constraints, respectively. In the
case when 𝑚 = 1 , it is called
single-objective optimization.
When 𝑚 ≥ 2 , it becomes a multi-
objective problem whose solution
strategy is different from those for
a single objective. In general, all
the functions 𝑓𝑖 , ℎ𝑗 and 𝑔𝑘 are
nonlinear. In the special case when
all these functions are linear, the
optimization problem becomes a
linear programming problem which
can be solved using the standard
simplex method (Dantzig 1963).
Metaheuristic optimization concerns more generalized, nonlinear optimization
problems. It is worth pointing out that the above minimization problem can also be
formulated as a maximization problem if 𝑓𝑖 is replaced with −𝑓𝑖 .
Derivative-free algorithms do not use any derivative information but the values of the
function itself. Some functions may have discontinuities or it may be expensive to
calculate derivatives accurately, and thus derivative-free algorithms become very
useful.

From a different perspective, optimization algorithms can be classified into trajectory-


based and population-based. A trajectory-based algorithm typically uses a single agent
or one solution at a time, which will trace out a path as the iterations continue.
Optimization algorithms can also be classified as deterministic or stochastic. If an
algorithm works in a mechanical deterministic manner without any random nature, it
is called deterministic. For such an algorithm, it will reach the same final solution if we
start with the same initial point. Evolutionary algorithms such as particle swarm
optimization (PSO), ant colony optimization (ACO) and their variants are good examples
of stochastic algorithms.

Search capability can also be a basis for algorithm classification. In this case,
algorithms can be divided into local and global search algorithms. Local search
algorithms typically converge towards a local optimum, not necessarily (often not) the
global optimum, and such an algorithm is often deterministic and has no ability to
escape from local optima. On the other hand, for global optimization, local search
algorithms are not suitable, and global search algorithms should be used. Modern
metaheuristic algorithms in most cases tend to be suitable for global optimization,
though not always successful or efficient.

Algorithms with stochastic components were often referred to as heuristic in the past,
though the recent literature tends to refer to them as metaheuristics. We will follow
Glover's convention and call all modern nature-inspired algorithms metaheuristics
(Glover 1986, Glover and Kochenberger 2003). Loosely speaking, heuristic means to find
or to discover by trial and error. Here meta- means beyond or higher level, and
metaheuristics generally perform better than simple heuristics. In addition, all
metaheuristic algorithms use a certain tradeoff of randomization and local search.
Quality solutions to difficult optimization problems can be found in a reasonable
amount of time, but there is no guarantee that optimal solutions can be reached. It is
hoped that these algorithms work most of the time, but not all the time. Almost all
metaheuristic algorithms tend to be suitable for global optimization.

Particle swarm optimization (PSO) was developed


by Kennedy and Eberhart in 1995, based on swarm behavior observed in nature such
as fish and bird schooling. Since then, PSO has generated a lot of attention, and now
forms an exciting, ever-expanding research subject in the field of swarm intelligence.
PSO has been applied to almost every area in optimization, computational intelligence,
and design/scheduling applications.
PSO searches the space of an objective function by adjusting the trajectories of
individual agents, called particles. Each particle traces a piecewise path which can be
modelled as a time-dependent positional vector. The movement of a swarming particle
consists of two major components: a stochastic component and a deterministic
component. Each particle is attracted toward the position of the current global best
𝐠 𝑏𝑒𝑠𝑡 and its own best known location 𝐱 𝑏𝑒𝑠𝑡 , while exhibiting at the same time a tendency
to move randomly.

When a particle finds a location that is better than any previously found locations, then
it updates this location as the new current best for particle 𝑖 . There is a current best for
all particles at any time 𝑡 at each iteration. The aim is to find the global best among all
the current best solutions until the objective no longer improves or after a certain
number of iterations.

Let 𝐱 𝑖 and 𝐯𝑖 be the position and velocity vectors, respectively, of particle 𝑖. The new
velocity vector is determined by the following formula

𝐯𝑖𝑘+1 = 𝜔𝐯𝑖𝑘 + 𝛼𝜺1 × (𝐱 𝑖𝑏𝑒𝑠𝑡 − 𝐱 𝑖𝑘 ) + 𝛽𝜺2 × (𝐠 𝑏𝑒𝑠𝑡 − 𝐱 𝑖𝑘 ) 𝑘: number of iterations

where 𝜺1 and 𝜺2 are two random vectors, and each entry takes a value between 0 and 1.
The parameters 𝛼 and 𝛽 are the learning parameters or acceleration constants, which
are typically equal to, say, 𝛼 ≈ 𝛽 ≈ 2. 𝜔(𝑘) is the inertia function takes a value between
0 and 1. In the simplest case, the inertia function can be taken as a constant, typically
𝜔 ∈ [0.5 0.9]. This is equivalent to introducing a virtual mass to stabilize the motion of
the particles, and thus the algorithm is expected to converge more quickly.

The initial locations of all particles should be distributed relatively uniformly so that
they can sample over most regions, which is especially important for multimodal
problems. The initial velocity of a particle can be set to zero, that is, 𝐯𝑖𝑘=0 = 0 . The new
position can then be updated by the formula 𝐱 𝑖𝑘+1 = 𝐱 𝑖𝑘 + 𝐯𝑖𝑘+1

As the iterations proceed, the particle system swarms and may converge towards a
global optimum.

Algorithm: [Particle Swarm Optimization]


Initialize particles
Do until maximum iterations or minimum error criteria
For each particle
Calculate Data fitness value
If the fitness value is better than pBest
Set pBest = current fitness value
If pBest is better than gBest
Set gBest = pBest
For each particle
Calculate particle Velocity
Use gBest and Velocity to update particle Data
Exercise: Write a MATLAB code to search the maximum value of the following objective
functions

▪ 𝑓(𝑥, 𝑦) = 3 sin(𝑥) + 𝑒 𝑦 − 4 ≤ 𝑥, 𝑦 ≤ 4
▪ 𝑓(𝑥, 𝑦) = 100(𝑦 − 𝑥 2 )2 + (1 − 𝑥 2 )2 − 10 ≤ 𝑥, 𝑦 ≤ 10

Using the following inertia function

𝜔𝑚𝑎𝑥 − 𝜔𝑚𝑖𝑛
𝜔(𝑖𝑡𝑒𝑟) = 𝜔𝑚𝑎𝑥 − × 𝑖𝑡𝑒𝑟
Max𝑖𝑡𝑒𝑟
clear all, clc,
[X,Y] = meshgrid(-4:0.5:4,-4:0.5:4);
Z = 3*sin(X)+exp(Y); surf(X,Y,Z); figure

wmax=0.9; wmin=0.4; c1=1.49; c2=1.49;


itermax=50; xmin=[-2 -2]; xmax=[2 2];
n=20; m=2; % n=Number of Particles and n=Number of variables
v=zeros(m,n); rand('state',0);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));

for iter=1:itermax
w=wmax-(wmax-wmin)*iter/itermax;
for i=1:n
v(:,i)=w*v(:,i)+c1*rand*(xbest(:,i)-x(:,i))+c2*rand*(gbest-x(:,i));
x(:,i)=x(:,i)+v(:,i);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
the optimal value is:
1.5708
the optimal value is:
2.0000
the minimum value of func
is: -2.8647

Exercise: Write a MATLAB code to search the maximum value of the following objective
functions

▪ 𝑓(𝑥, 𝑦) = 𝑥 2 − 𝑦 2 − 10 ≤ 𝑥, 𝑦 ≤ 10
4 3 2
▪ 𝑓(𝑥) = 𝑥 − 14𝑥 + 60 𝑥 − 70 𝑥 − 10 ≤ 𝑥, 𝑦 ≤ 10
▪ 𝑓(𝑥, 𝑦) = 𝑥sin(4𝑥) + 1.1𝑦sin(𝑦) − 10 ≤ 𝑥, 𝑦 ≤ 10
▪ 𝑓(𝐱) = (𝑥 + 10𝑦) + 5(𝑧 − 𝑤) + (𝑦 − 2𝑧) + 10(𝑥 − 2𝑤)4
2 2 4

Alternatives of PSO: There are many variations which extend the standard algorithm.
The standard particle swarm optimization uses both the current global best 𝐠 𝑏𝑒𝑠𝑡 and
the individual best 𝐱 𝑖𝑏𝑒𝑠𝑡 . The reason of using the individual best is primarily to increase
the diversity in the quality solution, however, this diversity can be simulated using the
randomness. Subsequently, there is no compelling reason for using the individual best.
A simplified version which could accelerate the convergence of the algorithm is to use
the global best only. Thus, in the accelerated particle swarm optimization, the velocity
vector is generated by

𝐯𝑖𝑘+1 = 𝐯𝑖𝑘 + 𝛼 × (𝜺1 − 0.5𝐞) + 𝛽(𝐠 𝑏𝑒𝑠𝑡 − 𝐱 𝑖𝑘 ) with 𝐞𝑇 = [1 1 1 … 1]

In order to increase the convergence even further, we can also write the update of the
location in a single step
𝐱 𝑖𝑘+1 = (1 − 𝛽)𝐱 𝑖𝑘 + 𝛽𝐠 𝑏𝑒𝑠𝑡 + 𝛼 × (𝜺1 − 0.5𝐞)

A further accelerated PSO is to reduce the randomness as iterations proceed. This


mean that we can use a monotonically decreasing function such as

𝛼 = 𝛼0 𝑒 −𝛾𝑘 ; or 𝛼 = 𝛼0 𝛾 𝑘 , ( 𝛾 < 1)
clear all, clc,
[X,Y] = meshgrid(-4:0.5:4,-4:0.5:4);
Z = 3*sin(X)+exp(Y); surf(X,Y,Z); figure

wmax=0.9; wmin=0.4; c1=0.3; c2=0.49;


itermax=50; xmin=[-2 -2]; xmax=[2 2];
n=20; m=2; % n=Number of Particles and n=Number of variables
v=zeros(m,n); rand('state',0);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));

for iter=1:itermax
w=wmax-(wmax-wmin)*iter/itermax;
for i=1:n
v(:,i)=v(:,i) + c1*(rand-0.5) + c2*(gbest-x(:,i));
x(:,i)=(1-c2)*x(:,i) + c2*gbest + c1*(rand-0.5);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
Example Write a MATLAB code to search by PSO the maximum value of

𝑓 = 2𝑥 2 − 3𝑦 2 + 4𝑥 2 + 2 − 10 ≤ 𝑥, 𝑦, 𝑧 ≤ 10

clear all, clc, c1=0.3; c2=0.49;


itermax=50; xmin=10*[-2 -2 -2]; xmax=10*[2 2 2];
n=20; m=3; % n=Number of Particles and n=Number of variables
v=zeros(m,n); rand('state',0);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
fun_marge(i)=2*(x(1,i))^2-3*(x(2,i))^2+4*(x(3,i))^2+2;
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));

for iter=1:itermax
for i=1:n
v(:,i)=v(:,i) + c1*(rand-0.5) + c2*(gbest-x(:,i));
x(:,i)=(1-c2)*x(:,i) + c2*gbest + c1*(rand-0.5);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
fun_marge(i)=2*(x(1,i))^2-3*(x(2,i))^2+4*(x(3,i))^2+2;
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)

plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
As already mentioned, swarm intelligence is a
relatively new approach to problem
solving that takes inspiration from social
behaviors of insects and of other
animals. In particular, ants have
inspired a number of methods and
techniques among which the most
studied and the most successful is the
general-purpose optimization technique
known as ant colony optimization. Ant
colony optimization (ACO) takes
inspiration from the foraging behavior of
some ant species. These ants deposit
pheromone on the ground in order to mark some favorable path that should be followed
by other members of the colony. Ant colony optimization exploits a similar mechanism
for solving optimization problems.

Ant colony optimization (ACO), introduced by Marco Dorigo 1991 in his doctoral
dissertation, is a class of optimization algorithms modeled on the actions of an ant
colony. ACO is a probabilistic technique useful in problems that deal with finding better
paths through graphs. Artificial 'ants'—simulation agents—locate optimal solutions by
moving through a parameter space representing all possible solutions. Natural ants lay
down pheromones directing each other to resources while exploring their environment.
The simulated 'ants' similarly record their positions and the quality of their solutions,
so that in later simulation iterations more ants locate for better solutions.

Procedure: The ants construct the solutions as follows. Each ant starts from a
randomly selected city (node or vertex). Then, at each construction step it moves along
the edges of the graph. Each ant keeps a memory of its path, and in subsequent steps
it chooses among the edges that do not lead to vertices that it has already visited. An
ant has constructed a solution once it has visited all the vertices of the graph. At each
construction step, an ant probabilistically chooses the edge to follow among those that
lead to yet unvisited vertices. The probabilistic rule is biased by pheromone values and
heuristic information: the higher the pheromone and the heuristic value associated to
an edge, the higher the probability an ant will choose that particular edge. Once all the
ants have completed their tour, the pheromone on the edges is updated. Each of the
pheromone values is initially decreased by a certain percentage. Each edge then
receives an amount of additional pheromone proportional to the quality of the solutions
to which it belongs (there is one solution per ant). The solution construction process is
stochastic and is biased by a pheromone model, that is, a set of parameters associated
with graph components (either nodes or edges) whose values are modified at runtime by
the ants.
Set parameters, initialize pheromone trails
SCHEDULE_ACTIVITIES

Construct Ant Solutions (Generate a random population of 𝑚 ants (solution)).


For every individual ant ascertain the best position according to the objective function.
Get the best ant in search space.
Restore (Update) the pheromone-trail.
Verify if the termination is true.
END_SCHEDULE_ACTIVITIES

This procedure is repeatedly applied until a termination criterion is satisfied.

Parametrization: Let’s say the number of cities is 𝑛, the number of ants is 𝑚, the
distance between 𝑖 𝑡ℎ and 𝑗 𝑡ℎ cities is 𝑑𝑖𝑗 𝑖, 𝑗 = 1, 2 … , 𝑛 and the concentration of
pheromone in city (𝑖, 𝑗) at time 𝑡 is 𝜏𝑖𝑗 (𝑡). At the initial time, the pheromone
concentration 𝜏𝑖𝑗 (𝑡) between cities is equal to 𝜏𝑖𝑗 (0) = 𝐶 (𝐶 is a constant), and the
𝑘
probability of its choice is expressed by 𝑝𝑖𝑗 , and the formula is as follows:

𝛼 𝛽
(𝜏𝑖𝑗 (𝑡)) (𝜂𝑖𝑗 (𝑡))
𝑘
𝑝𝑖𝑗 = 𝛽
𝛼
∑𝑥∈𝑁(𝑥 )(𝜏𝑖𝑠 (𝑡)) (𝜂𝑖𝑠 (𝑡))
𝑘

The parameter 𝜂𝑖𝑗 (𝑡) = 1/𝑑𝑖𝑗 is heuristic information, which indicates the degree of
expectation of ants from 𝑖 𝑡ℎ to the 𝑗 𝑡ℎ city. 𝑁(𝑥𝑘 ) (𝑘 = 1, 2 … , 𝑚) indicates that ant 𝑘 is to
visit the urban set. Furthermore, 𝛼 and 𝛽 are positive real parameters whose values
determine the relative importance of pheromone versus heuristic information. When all
ants complete a cycle, they update the pheromone according to formula
𝑚
𝜏𝑖𝑗 (𝑡) ⟵ (1 − 𝜌)𝜏𝑖𝑗 (𝑡) + ∆𝜏𝑖𝑗 with ∆𝜏𝑖𝑗 = ∑ ∆𝜏𝑘𝑖𝑗
𝑘=1

Where: 𝜌 ∈ (0,1] is a parameter called evaporation rate (i.e.


pheromone decay coefficient), (1 − 𝜌) is called the
pheromone residual factor, and ∆𝜏𝑖𝑗 is the quantity or
pheromone concentration released by the 𝑘 𝑡ℎ ant on the
path of (𝑖, 𝑗). In the basic (ACO), only the positive feedback
pheromone concentration is usually updated. In order to
update the pheromone concentration in the search process
we use
𝑄/ 𝐿𝑘
∆𝜏𝑘𝑖𝑗 = {
0

Where 𝑄 is a constant that represents the total amount of pheromone released once by
an ant . 𝐿𝑘 is the tour length of the 𝑘 𝑡ℎ ant.
clear all, clc,
%LB=20*[-1 -1 -1]; UB=20*[1 1 1]; nvars=size(LB,2);
%f=@(x)2*x(1)^2-3*x(2)^2+4*x(3)^2+2; % Ant-cost
LB=20*[-1 -1]; UB=20*[1 1]; nvars=size(LB,2);
f=@(x)3*sin(x(1))+exp(x(2));
MaxTour=100; % Number of Tours
piece=500; % Number of pieces (cities)
max_assign=50; % MaxValue of assign
ants=50; % Number of Ants
poz_ph=0.5; % PositivePheremone
neg_ph=0.2; % NegativePheremone
lambda=0.95; % EvaporationParameter
ph=0.05; % Pheromone

pher=ones(piece,nvars);
indis=zeros(ants,nvars);
costs=zeros(ants,1);
cost_general=zeros(max_assign,(nvars+1));

deger=zeros(piece,nvars); deger(1,:)=LB;

for i=2:piece
for j=1:nvars
deger(i,j)=deger(i-1,j) + (UB(j)-LB(j))/(piece-1);
end
end
assign=0;
while (assign<max_assign)
for i=1:ants % FINDING THE PARAMETERS OF VALUE
prob = pher.*rand(piece,nvars);
for j=1:nvars
indis(i,j) = find(prob(:,j) == max(prob(:,j)));
end
temp=zeros(1,nvars);
for j=1:nvars
temp(j)=deger(indis(i,j),j);
end
costs(i) = f(temp); % LOCAL UPDATING
deltalocal = zeros(piece,nvars);
% Creates Matrix Contain the Pheremones Deposited for Local Updating
for j=1:nvars
deltalocal(indis(i,j),j)=(poz_ph*ph/(costs(i)));
end
pher = pher + deltalocal;
end
best_ant= min(find(costs==min(costs)));
worst_ant = min(find(costs==max(costs)));
deltapos = zeros(piece,nvars);
deltaneg = zeros(piece,nvars);
for j=1:nvars
deltapos(indis(best_ant,j),j)=(ph/(costs(best_ant)));
% UPDATING PHER OF nvars
deltaneg(indis(worst_ant,j),j)=-(neg_ph*ph/(costs(worst_ant)));
% NEGATIVE UPDATING PHER OF worst path
end
delta = deltapos + deltaneg;
pher = pher.^lambda + delta;
assign=assign + 1; % Update general cost matrix
for j=1:nvars
cost_general (assign,j)=deger(indis(best_ant,j),j);
end
cost_general (assign,nvars+1)=costs(best_ant);
xlabel Tour
title('Change in Cost Value. Red: Means, Blue: Best')
hold on
plot(assign, mean(costs), '.r');
plot(assign, costs(best_ant), '.b');
end
list_cost=sortrows(cost_general,nvars+1);
for j=1:nvars
x(j)=list_cost(1,j);
end
x1=x', fmax=f(x1)
The Firefly Algorithm (FA) was developed by
Xin-She Yang (Yang 2008) and is based on the flashing patterns and behavior of
fireflies. In essence, FA uses the following three idealized rules:

⦁ Fireflies are unisex (one firefly will be attracted to other fireflies regardless of their sex)
⦁ The attractiveness is proportional to the brightness and both decrease as the distance
between two fireflies increases. Thus for any two flashing fireflies, the brighter firefly
will attract the other one. If neither one is brighter, then a random move is performed.
⦁ The brightness of a firefly is determined by the landscape of the objective function.

As a firefly's attractiveness is proportional to the light intensity seen by adjacent


fireflies, we can now define the variation of attractiveness 𝛽 with the distance 𝑟 by
2
𝛽 = 𝛽0 𝑒 −𝛾𝑟 , where 𝛽0 is the attractiveness at 𝑟 = 0. The movement of a firefly 𝑖, attracted
to another more attractive (brighter) firefly 𝑗, is determined by
−𝛾𝑟 2
𝐱 𝑖𝑘+1 = 𝐱 𝑖𝑘 + 𝛽0 𝑒 𝑖𝑗 (𝐱𝑗𝑘 − 𝐱 𝑖𝑘 ) + 𝛼 × 𝐞𝑘𝑖

where 𝛾 is the light absorption coefficient, which can be in the range [0.01, 100], 𝑟𝑖𝑗 the
−𝛾𝑟 2
line-of-sight distance between the fireflies. The second term 𝛽0 𝑒 𝑖𝑗 (𝐱𝑗𝑘 − 𝐱 𝑖𝑘 ) is due to
the attraction. The third term 𝛼 × 𝐞𝑘𝑖 is a randomization with 𝛼 being the randomization
parameter, and 𝐞𝑘𝑖 is a vector of random numbers drawn from a Gaussian distribution
or uniform distribution at time k. If 𝛽0 = 0 , it becomes a simple random walk.
Furthermore, the randomization 𝐞𝑘𝑖 can easily be extended to other distributions such
as Lévy flights.
clear all, clc, c1=0.8; c2=0.7; gama=20;
itermax=50; xmin=10*[-2 -2]; xmax=10*[2 2];
n=50; m=2; % n=Number of Particles and n=Number of variables
rand('state',0); % v=zeros(m,n);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));

for iter=1:itermax
for i=1:n
for j=1:i
r= norm(x(:,j)-x(:,i));
x(:,i)=x(:,i)+c2*(exp(-gama*r^2))*(x(:,j)-x(:,i))+c1*(randn-0.5);
end
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on

It can be shown that the limiting case 𝛾 → 0 corresponds to the standard Particle
Swarm Optimization (PSO). In fact, if the inner loop (for j) is removed and x(:,j)the is
replaced by the current global best, then FA essentially becomes the standard PSO.
In computer science and operations research,
the artificial bee colony algorithm (ABC) is an optimization algorithm based on the
intelligent foraging behavior of honey bee swarm,
proposed by Derviş Karaboğa (Erciyes University) in
2005. In the ABC model, the colony consists of three
groups of bees: employed bees, onlookers and scouts.
It is assumed that there is only one artificial
employed bee for each food source. In other words,
the number of employed bees in the colony is equal
to the number of food sources around the hive.
Employed bees go to their food source and come
back to hive and dance on this area. The employed
bee whose food source has been abandoned becomes a scout and starts to search for
finding a new food source. Onlookers watch the dances of employed bees and choose
food sources depending on dances.

Notes: employed bees associated with specific food sources, onlooker bees watching the
dance of employed bees within the hive to choose a food source, and scout bees
searching for food sources randomly. Both onlookers and scouts are also called
unemployed bees.

The main steps of the algorithm are given below:


▪ Initial food sources are produced for all employed bees
▪ REPEAT
▪ Each employed bee goes to a food source in her memory and determines a closest
source, then evaluates its nectar amount (‫ )كمية الرحيق‬and dances in the hive.
▪ Each onlooker watches the dance of employed bees and chooses one of their
sources depending on the dances, and then goes to that source. After choosing a
neighbor around that, she evaluates its nectar amount.
▪ Abandoned food sources are determined and are replaced with the new food
sources discovered by scouts.
▪ The best food source found so far is registered.
▪ UNTIL (requirements are met)

Initialization Phase: All the vectors of the population of food sources, 𝐱 𝑘 , are initialized
by scout bees and control parameters are set. Since each food source, 𝐱 𝑘 , is a solution
vector to the optimization problem, each 𝐱 𝑘 vector holds 𝑛 variables, (𝐱 𝑘 (𝑖), 𝑖 = 1. . . 𝑛),
which are to be optimized so as to minimize the objective function.

𝐱 𝑘 (𝑖) = 𝒍𝑖 + rand(0,1) × (𝒖𝑖 − 𝒍𝑖 )

where 𝒍𝑖 and 𝒖𝑖 are the lower and upper bound of the parameter 𝐱 𝑘 (𝑖) , respectively.

Employed Bees Phase: Employed bees search for new food sources (𝐯𝑘 ) having more
nectar within the neighbourhood of the food source (𝐱 𝑘 ) in their memory.

𝐯𝑘 (𝑖) = 𝐱 𝑘 (𝑖) + 𝛼𝑘 (𝑖)(𝐱𝑘 (𝑖) − 𝐱 𝑚 (𝑖))


where 𝐱 𝑚 is a randomly selected food source, and 𝛼𝑘 (𝑖) is a random number within the
range [−𝛽, 𝛽]. After producing the new food source 𝐯𝑘 , its fitness is calculated and a
greedy selection is applied between 𝐯𝑘 and 𝐱 𝑘 .

The fitness value of the solution, fit(𝐱 𝑘 ), might be calculated for minimization problems
using the following formula
1
if f(𝐱 𝑘 ) ≥ 0
fit(𝐱 𝑘 ) = { 1 + f(𝐱 𝑘 )
1 + |f(𝐱 𝑘 )| if f(𝐱 𝑘 ) < 0

Onlooker Bees Phase: Unemployed bees consist of two groups of bees: onlooker bees
and scouts. Employed bees share their food source information with onlooker bees
waiting in the hive and then onlooker bees probabilistically choose their food sources
depending on this information. In ABC, an onlooker bee chooses a food source
depending on the probability values calculated using the fitness values provided by
employed bees. For this purpose, a fitness based selection technique can be used, such
as the roulette wheel selection method (Goldberg, 1989).

The probability value 𝑝𝑘 with which 𝐱 𝑘 is chosen by an onlooker bee can be calculated
by using the expression given in equation
fit(𝐱 𝑘 )
𝑝𝑘 = 𝑁
∑𝑘=1 fit(𝐱 𝑘 )

After a food source 𝐱 𝑘 for an onlooker bee is probabilistically chosen, a neighbourhood


source 𝐯𝑘 is determined by using equation, and its fitness value is computed. As in the
employed bees phase, a greedy selection is applied between 𝐯𝑘 and 𝐱 𝑘 . Hence, more
onlookers are recruited to richer sources and positive feedback behavior appears.

Scout Bees Phase: The unemployed bees who choose their food sources randomly are
called scouts. Employed bees whose solutions cannot be improved through a
predetermined number of trials, specified by the user of the ABC algorithm and called
“limit” or “abandonment criteria” herein, become scouts and their solutions are
abandoned. Then, the converted scouts start to search for new solutions, randomly. For
instance, if solution 𝐱 𝑘 has been abandoned, the new solution discovered by the scout
who was the employed bee of 𝐱 𝑘 can be defined by 𝐱 𝑘 (𝑖) = 𝒍𝑖 + rand(0,1) × (𝒖𝑖 − 𝒍𝑖 ). Hence
those sources which are initially poor or have been made poor by exploitation are
abandoned and negative feedback behavior arises to balance the positive feedback.

Exercise: Write a MATLAB code to search the maximum value of the following objective
functions

𝑓(𝐱) = 3 sin(𝑥) + 𝑒 𝑦 − 5 ≤ 𝑥, 𝑦 ≤ 5
𝑓(𝐱) = 2𝑥 2 + 3𝑦 2 + 4𝑧 2 + 5𝑤 2 + 10 − 5 ≤ 𝑥, 𝑦 ≤ 5
clc;
clear;
close all;
%% Problem Definition
% CostFunction=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2+5*x(4)^2+10;
f=@(x)3*sin(x(1))+exp(x(2)); % CostFunction
nVar=2; % Number of Decision Variables
VarSize=[1 nVar]; % Decision Variables Matrix Size
VarMin=-5; % Decision Variables Lower Bound
VarMax= 5; % Decision Variables Upper Bound

%% ABC Settings
MaxIt=500; % Maximum Number of Iterations
nPop=500; % Population Size (Colony Size)
nOnlooker=nPop; % Number of Onlooker Bees
L=round(0.6*nVar*nPop); % Abandonment Limit Parameter (Trial Limit)
a=1; % Acceleration Coefficient Upper Bound

%% Initialization
% Empty Bee Structure
empty_bee.Position=[]; empty_bee.Cost=[];
pop=repmat(empty_bee,nPop,1); % Initialize Population Array
BestSol.Cost=inf; % Initialize Best Solution Ever Found

% Create Initial Population


for i=1:nPop
pop(i).Position=VarMax*rand(1,nVar);
pop(i).Cost=f(pop(i).Position);
if pop(i).Cost<=BestSol.Cost
BestSol=pop(i);
end
end

C=zeros(nPop,1); % Abandonment Counter


BestCost=zeros(MaxIt,1); % Array to Hold Best Cost Values

%% ABC Main Loop


for it=1:MaxIt
% Recruited Bees (Employed Bees Phase)
for i=1:nPop
K=[1:i-1 i+1:nPop]; % Choose k randomly, not equal to i
k=K(randi([1 numel(K)]));
phi=a*rand(1,nVar); % Define Acceleration Coeff.

% New Bee Position


newbee.Position=pop(i).Position+phi.*(pop(i).Position-pop(k).Position);
newbee.Cost=f(newbee.Position); % Evaluation
% Comparision
if newbee.Cost<=pop(i).Cost
pop(i)=newbee;
else
C(i)=C(i)+1;
end

end

% Calculate Fitness Values and Selection Probabilities


F=zeros(nPop,1);
MeanCost = mean([pop.Cost]);

for i=1:nPop
F(i) = exp(-pop(i).Cost/MeanCost); % Convert Cost to Fitness
end
P=F/sum(F);

% Onlooker Bees (Onlooker Bees Phase)


for m=1:nOnlooker

%-----------------------------------------------
% Select Source Site by Roulette Wheel Selection
%-----------------------------------------------
r=rand;
C=cumsum(P);
i=find(r<=C,1,'first');
%-----------------------------------------------

K=[1:i-1 i+1:nPop]; % Choose k randomly, not equal to i


k=K(randi([1 numel(K)]));
phi=a*rand(1,nVar); % Define Acceleration Coeff.

% New Bee Position


newbee.Position=pop(i).Position+phi.*(pop(i).Position-pop(k).Position);
newbee.Cost=f(newbee.Position); % Evaluation

% Comparision

if newbee.Cost<=pop(i).Cost
pop(i)=newbee;
else
C(i)=C(i)+1;
end
end
% Scout Bees (Scout Bees Phase)
for i=1:nPop
if C(i)>=L
pop(i).Position=rand(1,nVar);
pop(i).Cost=f(pop(i).Position);
C(i)=0;
end
end
% Update Best Solution Ever Found
for i=1:nPop
if pop(i).Cost<=BestSol.Cost
BestSol=pop(i);
end
end
BestCost(it)=BestSol.Cost; % Store Best Cost Ever Found

% Display Iteration Information


disp(['Iteration', num2str(it) ':BestCost' num2str(BestCost(it))]);
end

%% Results
BestSol

figure;
%plot(BestCost,'LineWidth',2);
semilogy(BestCost,'LineWidth',2);
xlabel('Iteration'); ylabel('Best Cost');
grid on;
Bacteria Foraging Optimization
Algorithm (BFOA), proposed by Passino, is
a new comer to the family of nature-
inspired optimization algorithms. For over the
last five decades, optimization algorithms like
Genetic Algorithms (GAs), Evolutionary
Programming (EP), Evolutionary Strategies
(ES), which draw their inspiration from
evolution and natural genetics, have been
dominating the realm of optimization
algorithms. Recently natural swarm inspired
algorithms like Particle Swarm Optimization
(PSO), Ant Colony Optimization (ACO) have
found their way into this domain and proved their effectiveness. Following the same
trend of swarm-based algorithms, Passino proposed the BFOA. Application of group
foraging strategy of a swarm of E.coli-bacteria in multi-optimal function optimization is
the key idea of the new algorithm. Bacteria search for nutrients in a manner to
maximize energy obtained per unit time. Individual bacterium also communicates with
others by sending signals. A bacterium takes foraging decisions after considering two
previous factors. The process, in which a bacterium moves by taking small steps while
searching for nutrients, is called chemotaxis and key idea of BFOA is mimicking
chemotactic movement of virtual bacteria in the problem search space.

Now suppose that we want to find the minimum of the cost function 𝑱(𝜽) where 𝜽 ∈ ℜ𝑝
(i.e. 𝜽 is a 𝑝-dimensional vector of real numbers), and we do not have measurements or
an analytical description of the gradient ∇𝑱(𝜽). BFOA mimics the four principal
mechanisms observed in a real bacterial system: chemotaxis, swarming,
reproduction, and elimination-dispersal to solve this non-gradient optimization
problem. A virtual bacterium is actually one trial solution (may be called a search-
agent) that moves on the functional surface (see Figure above) to locate the global
optimum.
Flow diagram illustrating the bacterial foraging optimization algorithm
Generic algorithm of BFO

Randomly distribute initial values for 𝜃𝑖 , 𝑖 = 1, 2, . . . , 𝑆 across the optimization domain.


Compute the initial cost function value for each bacterium 𝑖 as 𝐽𝑖 , and the initial total
𝑖
cost with swarming effect as 𝐽𝑠𝑤 .

for Elimination-dispersal loop do


for Reproduction loop do
for Chemotaxis loop do
for Bacterium i do

Tumble: Generate a random vector as a unit length random direction


Move: Let 𝜃𝑛𝑒𝑤 = 𝜃𝑖 + 𝑐𝜑 and compute corresponding 𝐽𝑛𝑒𝑤 . Let
𝑛𝑒𝑤
𝐽𝑠𝑤 = 𝐽𝑛𝑒𝑤 + 𝐽𝑐𝑐 (𝜃𝑛𝑒𝑤 , 𝜃)
Swim: Let 𝑚 = 0
while 𝑚 < 𝑁𝑠 do
let 𝑚 = 𝑚 + 1
𝑛𝑒𝑤 𝑖
if 𝐽𝑠𝑤 < 𝐽𝑠𝑤 then
Let 𝜃𝑛𝑒𝑤 = 𝜃𝑖 , compute corresponding 𝐽𝑖 and 𝐽𝑠𝑤
𝑖

Let 𝜃𝑛𝑒𝑤 = 𝜃𝑖 + 𝑐𝜑 and compute corresponding 𝐽(𝜃𝑛𝑒𝑤 ).


𝑛𝑒𝑤
Let 𝐽𝑠𝑤 = 𝐽𝑛𝑒𝑤 + 𝐽𝑐𝑐 (𝜃𝑛𝑒𝑤 , 𝜃)
else
let 𝑚 = 𝑁𝑠
end
end
end
end
Sort bacteria in order of ascending cost 𝐽𝑠𝑤 The 𝑆𝑟 = 𝑆/2 bacteria with the highest 𝐽
value die and other 𝑆𝑟 bacteria with the best value split Update value of 𝐽 and 𝐽𝑠𝑤
accordingly.
end
Eliminate and disperse the bacteria to random locations on the optimization domain
with probability 𝑝𝑒𝑑 . Update corresponding 𝐽 and 𝐽𝑠𝑤 .
end
clc; clear; close all;
f=@(x)x(1)^2+x(1)*x(2)+3*x(2)^2+10; % The objective function
%Initialization
p=2; % dimension of search space
s=100; % The number of bacteria
Nc=100; % Number of chemotactic steps
Ns=8; % Limits the length of a swim
Nre=4; % The number of reproduction steps
Ned=2; % The number of elimination-dispersal events
Sr=s/2; % The number of bacteria reproductions (split) per generation
Ped=0.25; % Probability that each bacteria will be eliminated/dispersed
c(:,1)=0.05*ones(s,1); % the run length
for m=1:s % the initital posistions
P(1,:,1,1,1)= 50*rand(s,1)';
P(2,:,1,1,1)= .2*rand(s,1)';
%P(3,:,1,1,1)= .2*rand(s,1)';
end
% Main loop
for ell=1:Ned %Elimination and dispersal loop
for K=1:Nre %Reprodution loop
for j=1:Nc % swim/tumble(chemotaxis)loop
for i=1:s
J(i,j,K,ell)=f(P(:,i,j,K,ell));
% Tumble
Jlast=J(i,j,K,ell);
Delta(:,i)=(2*round(rand(p,1))-1).*rand(p,1);
P(:,i,j+1,K,ell)=P(:,i,j,K,ell)+c(i,K)*Delta(:,i)/sqrt(Delta(:,i)'*Delt
a(:,i)); % This adds a unit vector in the random direction

% Swim (for bacteria that seem to be headed in the right direction)


J(i,j+1,K,ell)=f(P(:,i,j+1,K,ell));
m=0; % Initialize counter for swim length
while m<Ns
m=m+1;
if J(i,j+1,K,ell)<Jlast
Jlast=J(i,j+1,K,ell);
P(:,i,j+1,K,ell)=P(:,i,j+1,K,ell)+c(i,K)*Delta(:,i)/sqrt(Delta(:,i)'*De
lta(:,i));
J(i,j+1,K,ell)=f(P(:,i,j+1,K,ell));
else
m=Ns ;
end
end
J(i,j,K,ell)=Jlast;
sprintf('The value of interation i %3.0f ,j=%3.0f, K=%3.0f, ell=%3.0f'
,i,j,K ,ell );
end % Go to next bacterium
x = P(1,:,j,K,ell);
y = P(2,:,j,K,ell);
clf
plot(x, y , 'h')
grid on
axis([-5 5 -5 5]);
pause(.1)
end % Go to the next chemotactic

%Reprodution
Jhealth=sum(J(:,:,K,ell),2); % Set the health of each of the S bacteria
[Jhealth,sortind]=sort(Jhealth); % Sorts the nutrient concentration
P(:,:,1,K+1,ell)=P(:,sortind,Nc+1,K,ell);
c(:,K+1)=c(sortind,K); %keeps the chemotaxis parameters with each
bacterium at the next generation

%Split the bacteria (reproduction)


for i=1:Sr
P(:,i+Sr,1,K+1,ell)=P(:,i,1,K+1,ell);
% The least fit do not reproduce, the most fit ones split into two
identical copies
c(i+Sr,K+1)=c(i,K+1);
end
end % Go to next reproduction

%Eliminatoin and dispersal


for m=1:s
if Ped>rand % % Generate random number
P(1,:,1,1,1)= 50*rand(s,1)';
P(2,:,1,1,1)= .2*rand(s,1)';
%P(3,:,1,1,1)= .2*rand(s,1)';
else
P(:,m,1,1,ell+1)=P(:,m,1,Nre+1,ell); % Bacteria that are not dispersed
end
end
end % Go to next elimination and disperstal

%Report
reproduction = J(:,[1:Ns,Nre,Ned]);
[jlastreproduction,O]=min(reproduction,[],2); %minf for each bacterial
[Y,I] = min(jlastreproduction)
pbest=P(:,I,O(I,:),K,ell)
plot([1:s],jlastreproduction)
xlabel('Iteration'), ylabel('Function')
The GWO algorithm mimics
the leadership hierarchy and hunting mechanism of gray
wolves in nature proposed by Mirjalili et al. in 2014. Four
types of grey wolves such as alpha, beta, delta, and
omega are employed for simulating the leadership
hierarchy. In addition, three main steps of hunting,
searching for prey, encircling prey, and attacking prey,
are implemented to perform optimization.

Mathematical model: The hunting technique and the social hierarchy of grey wolves
are mathematically modeled in order to design GWO and perform optimization. The
proposed mathematical models of the social hierarchy, tracking, encircling, and
attacking prey are as follows:

■ Social hierarchy In order to mathematically model


the social hierarchy of wolves when designing GWO, we
consider the fittest solution as the alpha (α).
Consequently, the second and third best solutions are
named beta (β) and delta (δ) respectively. The rest of the
candidate solutions are assumed to be omega (ω). In the
GWO algorithm the hunting (optimization) is guided by
α, β, and δ. The ω wolves follow these three wolves.

■ Encircling prey As mentioned above, grey wolves encircle prey during the hunt. In
order to mathematically model encircling behavior the following equations are
proposed: ⃗𝑿 ⃗ (𝑡 + 1) = ⃗𝑿
⃗ (𝑡) − ⃗𝑨
⃗ . ⃗𝑫
⃗ with ⃗𝑫 ⃗ . ⃗𝑿
⃗ = |𝑪 ⃗ 𝑝 (𝑡) − ⃗𝑿⃗ (𝑡)| where 𝑡 indicates the
current iteration, 𝑨⃗ and 𝑪 ⃗ are coefficient vectors, 𝑿 ⃗⃗ 𝑝 (𝑡) is the position vector of the prey,
and 𝑿 ⃗⃗ (𝑡) indicates the position vector of a grey wolf. The vectors ⃗𝑨 and 𝑪 ⃗ are calculated
as follows: ⃗𝑨 = 2𝒂 ⃗ .𝒓
⃗1−𝒂⃗ ⃗𝑪 = 2𝒓
⃗ 2 . Where components of 𝒂 ⃗ are linearly decreased from
2 to 0 over the course of iterations and 𝒓 ⃗ 1, 𝒓
⃗ 2 are random vectors in [0,1].

■ Hunting: Grey wolves have the ability to recognize the location of prey and encircle
them. The hunt is usually guided by the alpha. The beta and delta might also
participate in hunting occasionally. However, in an abstract search space we have no
idea about the location of the optimum (prey). In order to mathematically simulate the
hunting behavior of grey wolves, we suppose that the alpha (best candidate solution)
beta, and delta have better knowledge about the potential location of prey. Therefore,
we save the first three best solutions obtained so far and oblige the other search agents
(including the omegas) to update their positions according to the position of the best
search agent. The following formulas are proposed in this regard.

⃗𝑫 ⃗ 1 . ⃗𝑿
⃗ 𝛼 = |𝑪 ⃗ 𝛼 − ⃗𝑿
⃗| ⃗𝑿
⃗ 1 = ⃗𝑿
⃗𝛼− ⃗𝑨⃗ 1 . ⃗𝑫
⃗𝛼
⃗⃗ 1 + 𝑿
𝑿 ⃗⃗ 1 + 𝑿
⃗⃗ 1
{⃗𝑫 ⃗ 2 . ⃗𝑿
⃗ 𝛽 = |𝑪 ⃗ 𝛽 − ⃗𝑿
⃗| {⃗𝑿
⃗ 2 = ⃗𝑿
⃗𝛽 − ⃗𝑨
⃗ 2 . ⃗𝑫
⃗𝛽 ⃗⃗ (𝑡 + 1) =
𝑿
3
⃗𝑫 ⃗ 3 . ⃗𝑿
⃗ 𝛿 = |𝑪 ⃗ 𝛿 − ⃗𝑿
⃗| ⃗𝑿
⃗ 3 = ⃗𝑿
⃗ 𝛿 − ⃗𝑨
⃗ 3 . ⃗𝑫
⃗𝛿
With these equations, a search agent updates its position according to alpha, beta, and
delta in a n dimensional search space. In addition, the final position would be in a
random place within a circle which is defined by the positions of alpha, beta, and delta
in the search space. In other words alpha, beta, and delta estimate the position of the
prey, and other wolves updates their positions randomly around the prey.

% Grey Wolf Optimizer


clear all, clc

SearchAgents_no=20;
Max_iter=200;
dim=4;
lb=-0.25*ones(1,dim); ub=0.25*ones(1,dim);
%fobj=@(x)(x(1)-1)^2+(x(2)-2)^2+(x(3)-3)^2+(x(4)-4)^2+(x(5)-5)^2;
%fobj=@(x)3*sin(x(1))+exp(x(2)); dim=2;
fobj=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2+5*x(4)^2+10;

% initialize alpha, beta, and delta_pos


Alpha_pos=zeros(1,dim);
Alpha_score=inf; %change this to -inf for maximization problems
Beta_pos=zeros(1,dim);
Beta_score=inf; %change this to -inf for maximization problems
Delta_pos=zeros(1,dim);
Delta_score=inf; %change this to -inf for maximization problems

%---------------------------------------------------------------------%
%Initialize the positions of search agents
%---------------------------------------------------------------------%

Boundary_no= size(ub,2); % numnber of boundaries

% If the boundaries of all variables are equal and user enter a signle
% number for both ub and lb
if Boundary_no==1
Positions=rand(SearchAgents_no,dim).*(ub-lb)+lb;
end

% If each variable has a different lb and ub


if Boundary_no>1
for i=1:dim
ub_i=ub(i);
lb_i=lb(i);
Positions(:,i)=rand(SearchAgents_no,1).*(ub_i-lb_i)+lb_i;
end
end
%---------------------------------------------------------------------%
Convergence_curve=zeros(1,Max_iter);
l=0; % Loop counter
% Main loop
while l<Max_iter
for i=1:size(Positions,1)
% Return back the search agents that go beyond the boundaries
Flag4ub=Positions(i,:)>ub; Flag4lb=Positions(i,:)<lb;
Positions(i,:)=(Positions(i,:).*(~(Flag4ub+Flag4lb)))+ub.*Flag4ub+lb.*F
lag4lb;
% Calculate objective function for each search agent
fitness=fobj(Positions(i,:));
% Update Alpha, Beta, and Delta
if fitness<Alpha_score
Alpha_score=fitness; % Update alpha
Alpha_pos=Positions(i,:);
end
if fitness>Alpha_score && fitness<Beta_score
Beta_score=fitness; % Update beta
Beta_pos=Positions(i,:);
end
if fitness>Alpha_score && fitness>Beta_score && fitness<Delta_score
Delta_score=fitness; % Update delta
Delta_pos=Positions(i,:);
end
end
a=2-l*((2)/Max_iter); % a decreases linearly fron 2 to 0
% Update the Position of search agents including omegas
for i=1:size(Positions,1)
for j=1:size(Positions,2)
r1=rand();r2=rand(); % random numbers in [0,1]
A1=2*a*r1-a; C1=2*r2;
D_alpha=abs(C1*Alpha_pos(j)-Positions(i,j));
X1=Alpha_pos(j)-A1*D_alpha;
r1=rand(); r2=rand();
A2=2*a*r1-a; C2=2*r2;
D_beta=abs(C2*Beta_pos(j)-Positions(i,j));
X2=Beta_pos(j)-A2*D_beta;
r1=rand(); r2=rand();
A3=2*a*r1-a; C3=2*r2;
D_delta=abs(C3*Delta_pos(j)-Positions(i,j));
X3=Delta_pos(j)-A3*D_delta;
Positions(i,j)=(X1+X2+X3)/3;
end
end
l=l+1;
Convergence_curve(l)=Alpha_score;
end
%-----------------%
X=Positions(end,:)
F0=fobj(Positions)
%-----------------%
% Plot
%-----%

x=-3:0.5:3; y=-3:0.5:3;
L=length(x);
f=[];
for i=1:L
for j=1:L
f(i,j)=fobj([x(i),y(j)]);
end
end
surfc(x,y,f,'LineStyle','none');

Nature Inspired algorithms include:


▪ Artificial Bee Colony Algorithm, ▪ Ant Colony Optimisation
▪ Firefly Algorithm, ▪ Swarm Optimisation
▪ Social Spider Algorithm, ▪ Fractal Stochastic Optimization
▪ Bat Algorithm, ▪ Rat Swarm Optimizer
▪ Strawberry Algorithm, ▪ Fish School Search Optimizer
▪ Plant Propagation Algorithm, ▪ The Grey Wolf Optimizer
▪ Seed Based Plant Propagation Algorithm ▪ Bacterial Foraging Optimization
▪ Genetic Algorithm, ▪ Harmony search Algorithm
▪ Simulated Annealing, ▪ Coronavirus herd immunity optimizer.

%-------------------------------------------------------------------------------------------------------%
Applications of Swarm Intelligence: Swarm Intelligence-based techniques can be
used in a number of applications. The U.S. military is investigating swarm techniques
for controlling unmanned vehicles. The European Space Agency is thinking about an
orbital swarm for self-assembly and interferometry. NASA is investigating the use of
swarm technology for planetary mapping. A 1992 paper by M. Anthony Lewis and
George A. Bekey discusses the possibility of using swarm intelligence to control
nanobots within the body for the purpose of killing cancer tumors. Conversely al-Rifaie
and Aber have used stochastic diffusion search to help locate tumours. Swarm
intelligence has also been applied for data mining. Ant based models are further subject
of modern management theory.

%-------------------------------------------------------------------------------------------------------%
CVX: is a MATLAB-based modeling system for convex optimization. It was created by
Michael Grant and Stephen Boyd. This MATLAB package is in fact an interface to other
convex optimization solvers such as SeDuMi and SDPT3. We will explore here some of
the basic features of the software, but a more comprehensive and complete guide can
be found at the CVX website (CVXr.com). The basic structure of a CVX program is as
follows:

cvx_begin
{variables declaration}
minimize({objective function}) or maximize({objective function})
subject to
{constraints}
cvx_end

CVX accepts only convex functions as objective and constraint functions. There are
several basic convex functions, called “atoms,” which are embedded in CVX.
Example: Suppose that we wish to solve the least squares problem

minimize ‖𝑨𝐱 − 𝒃‖2


subject to 𝑪𝐱 = 𝒅
‖𝐱 ‖∞ ≤ 𝑒
m = 20; n = 10; p = 4;
A = randn(m,n); b = randn(m,1);
C = randn(p,n); d = randn(p,1); e = rand;
cvx_begin
variable x(n)
minimize(norm(A*x-b,2))
subject to
C*x ==d
norm(x,Inf)<=e
cvx_end

Example: Suppose that we wish to write a CVX code that solves the convex
optimization problem

minimize √𝑥12 + 𝑥22 + 1 + 2 max{𝑥1 , 𝑥2 , 0}


𝑥12
subject to |𝑥1 | + |𝑥1 | + ≤5
𝑥2
1
+ 𝑥 4 ≤ 10
1
𝑥2
𝑥2 ≥ 1
𝑥2 ≥ 0
cvx_begin
variable x(2)
minimize(norm([x;1])+2*max(max(x(1),x(2)),0))
subject to
norm(x,1)+quad_over_lin(x(1),x(2))<=5
inv_pos(x(2))+x(1)^4<=10
x(2)>=1
x(1)>=0
cvx_end

Example: Let us use an example to illustrate how a metaheuristic works. The design of
a compressional and tensional spring involves three design variables: wire diameter 𝑥1 ,
coil diameter 𝑥2 , and the length of the coil 𝑥3 . This optimization problem can be
written as
minimize 𝑓(𝐱) = 𝑥12 𝑥2 (2 + 𝑥3 ),

subject to the following constraints


𝑥23 𝑥3
g1 (𝐱) = 1 − ≤ 0,
71785𝑥14
4𝑥22 − 𝑥1 𝑥2 1
g 2 (𝐱) = − −1≤0
12566(𝑥13 𝑥2 − 𝑥14 ) 5108𝑥12
140.45𝑥1
g 3 (𝐱) = 1 − ≤0
𝑥23 𝑥3
𝑥1 + 𝑥2
g 4 (𝐱) = −1≤0
1.5
The bounds on the variables are 0.05 ≤ 𝑥1 ≤ 2.0,0.25 ≤ 𝑥2 ≤ 1.3,2.0 ≤ 𝑥3 ≤ 15.0.

For a trajectory-based metaheuristic algorithm such as simulated annealing, an initial


guess say, 𝐱 0 = (1.0, 1.0, 14.0) is used. Then, the next move is generated and accepted
depending on whether it improves or not, possibly with a probability. For a population-
based metaheuristic algorithm such as PSO, a set of n vectors are generated initially.
Then, the values of the objective functions are compared and the current best solution
is found. Iterations proceed until a certain stopping criterion is met.
The following best solution can be found easily

𝐱 ⋆ = (0.051690 0.356750 11.28126), 𝑓(𝐱 ⋆ ) = 0.012665.

Metaheuristics have been used in many applications such as engineering design


optimization (Glover and Kochenberger 2003, Talbi 2008, Yang 2010). It is an area of
active research, and there is no doubt that more metaheuristic algorithms and new
applications will emerge in the future.
Armijo Backtracking Line Search: Among the many line search techniques, one of the
most successful ones is the Armijo backtracking line search, which starts with a
reasonably large line search parameter 𝛼𝑖𝑛𝑖𝑡 , and then reduces it until the function
value at the new position is sufficiently reduced relative to the value at the old position:

clear all, clc, figure


k = 5;
n = 2^k-1;
theta = pi*(-n:2:n)/n;
phi = (pi/2)*(-n:2:n)'/n;
X = cos(phi)*cos(theta);
Y = cos(phi)*sin(theta);
Z = sin(phi)*ones(size(theta));
colormap([1 0 0;1 1 1])
C = hadamard(2^k);
surf(X,Y,Z,C)
axis square

You might also like