Introduction To Nonlinear Systems and Numerical Optimization
Introduction To Nonlinear Systems and Numerical Optimization
Numerical Optimization
Optimization problems arise in almost every field, where numerical
information is processed (Science, Engineering, Mathematics, Economics, Commerce,
etc.). In Science, optimization problems arise in data fitting, in variational principles,
and in the solution of differential and integral equations by expansion methods.
Engineering applications are in design problems, which usually have constraints in the
sense that variables cannot take arbitrary values. For example, while designing a bridge
an engineer will be interested in minimizing the cost, while maintaining certain
minimum strength for the structure. Even the strength of materials used will have a
finite range depending on what is available in the market. Such problems with
constraints are more difficult to handle than the simple unconstrained optimization
problems, which very often arise in scientific work. In most problems, we assume the
variables to be continuously varying, but some problems require the variables to take
discrete values (H M Antia 1995).
Mainly there are two different strategies for computing next iteration from the previous
one which are used most frequently in nowadays available optimization algorithms. The
first one is the line search strategy in which the algorithm chooses a direction 𝒅𝑘 and
then searches along this direction for the lower function value. The second one is called
the trust region strategy in which the information gathered about the objective function
is used to construct a model function whose behavior near the current iterate is trusted
to be similar enough to the actual function. Then the algorithm searches for the
minimizer of the model function inside the trust region.
Most of optimization problems require the global minimum to be found, most of the
methods that we are going to describe here will only find a local minimum. The function
has a local minimum at a point where it assumes the lowest value in a small
neighborhood of the point, which is not at the boundary of that neighborhood. To find a
global minimum we normally try
In this chapter, we consider methods for minimizing or maximizing a function of several
variables, that is, finding those values of the coordinates for which the function takes
on the minimum or the maximum value.
Definition A continuous
function 𝑓: ℝ𝑛 ⟶ ℝ is said
to be continuously
differentiable at 𝐱 ∈ ℝ𝑛 , if
(𝜕𝑓/𝜕𝑥𝒊 )(𝐱) exists and is
continuous, 𝑖 = 1, . . . , 𝑛; the
gradient of 𝑓 at 𝐱 is then
defined as
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝑇
∇𝑓(𝐱) = [ … ]
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
𝑇
and there exists 𝒛 ∈ (𝐱, 𝐱 + 𝒑) such that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐳)) 𝒑
Example: Let 𝑓: ℝ2 ⟶ ℝ, 𝑓(𝑥) = 𝑥12 − 2𝑥2 + 3𝑥1 𝑥22 + 4𝑥23 , 𝐱 𝑐 = (1, 1)𝑇 , 𝒑 = (−2, 1)𝑇 Then
2𝑥1 − 2 + 3𝑥22
∇𝑓(𝐱) = ( )
6𝑥1 𝑥2 + 12𝑥22
If we let 𝑔(𝑡) = 𝑓(𝐱 𝑐 + 𝑡𝒑) = 𝑓(1 − 2𝑡, 1 + 𝑡 ) = 6 + 12𝑡 + 7𝑡 2 − 2𝑡 3 , and the reader can
𝑇
verify that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐳)) 𝒑 is true for 𝐳 = 𝐱 𝑐 + 𝑡𝒑 with 𝑡 = (7 − √19)/6 = 0.44
Example: Computing directional derivatives.
Let 𝑧 = 14 − 𝑥 2 − 𝑦 2 and let 𝑃 = (1,2). Find
the directional derivative of f, at 𝑃, in the
following directions:
■ The surface is plotted in Figure above, where the point 𝑃 = (1,2) is indicated in the
𝑥, 𝑦_plane as well as the point (1,2,9) which lies on the surface of f. We find that
𝜕𝑓 𝜕𝑓
| = −lim 2𝑥 = −2, | = −lim 2𝑦 = −4
𝜕𝑥 𝑥=1 𝑥→1 𝜕𝑦 𝑥=2 𝑥→1
Let 𝑢⃗ 1 be the unit vector that points from the point (1,2) to the point 𝑄 = (3,4), as shown
in the figure. The vector ⃗⃗⃗⃗⃗
𝑃𝑄 = ⟨2,2⟩; the unit vector in this direction is 𝑢
⃗ 1 = ⟨1/√2, 1/√2⟩.
Thus the directional derivative of f at (1,2) in the direction of 𝑢
⃗ 1 is
𝑇 1/√2
⃗ 1 = (−2 − 4) (
𝐷𝑢⃗1 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢 ) = −3√2 ≅ −4.24
1/√2
Thus the instantaneous rate of change in moving from the point (1,2,9) on the surface
in the direction of 𝑢
⃗ 1 (which points toward the point Q) is about −4.24. Moving in this
direction moves one steeply downward.
■ We seek the directional derivative in the direction of ⟨2,−1⟩. The unit vector in this
direction is 𝑢
⃗ 2 = ⟨2/√5, −1/√5⟩. Thus the directional derivative of f at (1,2) in the
𝑇
direction of 𝑢
⃗ 2 is 𝐷𝑢⃗2 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢⃗ 2 = 0. Starting on the surface of f at (1,2) and
moving in the direction of ⟨2,−1⟩ (or 𝑢⃗ 2 ) results in no instantaneous change in z-value.
■ At P=(1,2), the direction towards the origin is given by the vector ⟨−1,−2⟩; the unit
vector in this direction is 𝑢
⃗ 3 = ⟨−1/√5, −2/√5⟩. The directional derivative of f at P in the
𝑇
direction of the origin is 𝐷𝑢⃗3 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢
⃗ 3 = 9/√5 ≅ 4.47. Moving towards the origin
means "walking uphill'' quite steeply, with an initial slope of about 4.47.
Note: The symbol "∇" is named "nabla,'' derived from the Greek name of a Jewish harp.
Oddly enough, in mathematics the expression ∇f is pronounced "del f.''
The gradient vectors are perpendicular to the level sets, so will always be direction the
“slope” of a point toward another point on another level set. But how would you
represent that? The answer is the concept of gradient flow. Read more to learn about
how these three standard measurements fit together to flow along a surface, much like
a liquid or rolling object.
Theorem Let Consider a function f:ℝ𝑛 ⟶ ℝ, and suppose f is of class 𝐶1 . For some
constant 𝑐, consider the level set 𝑆 = {𝒙⃗ ∈ ℝ𝑛 : 𝑓(𝒙
⃗ ) = 𝑐}. Then, for any point 𝒙
⃗ 0 in 𝑆, the
⃗ 0 ) is perpendicular to 𝑆.
gradient ∇𝑓(𝒙
⃗ (𝑡)
By the definition of 𝑆, and since 𝒙
lies in 𝑆, ⃗ (𝑡)) = 𝑐
𝑓 (𝒙 for all t.
Differentiating both sides of this
identity, and using the chain rule on
′
⃗ (𝑡)) ⋅ 𝒙
the left side, we obtain ∇𝑓 (𝒙 ⃗ (𝑡) = 0
′
⃗ (𝑡0 )) ⋅ 𝒙
Plugging in 𝑡 = 𝑡0 , this gives us ∇𝑓 (𝒙 ⃗ (𝑡0 ) = 0, which we can rewrite as
𝜕 2 𝑓(𝐱)
∇2 𝑓(𝐱)𝑖𝑗 = 1 < 𝑖 ,𝑗 < 𝑛
𝜕𝑥𝑖 𝜕𝑥𝑗
1
⟹ 𝑓(𝐱 + 𝒉) = 𝑓(𝐱) + ∇𝑓(𝐱). 𝒉 + (𝑯(𝐱 + 𝑡𝒉)𝒉). 𝒉
2
𝑓(𝑥) = 𝑥12 − 2𝑥2 + 3𝑥1 𝑥22 + 4𝑥23 , 𝐱 𝑐 = (1, 1)𝑇 , 𝒑 = (−2, 1)𝑇
Then
2 6𝑥2 2 6
∇2 𝑓(𝐱) = ( ) ⟹ ∇2 𝑓(𝐱 𝑐 ) = ( )
6𝑥2 6𝑥1 + 24𝑥2 6 30
Lemma suggests that we might model the function f around a point 𝐱 𝑐 by the quadratic
𝑇 1
model 𝑚(𝐱 𝑐 + 𝒑) = 𝑓(𝐱 𝑐 ) + (∇𝑓(𝐱 𝑐 )) 𝒑 + 2 𝒑𝑇 𝑯(𝐱𝑐 )𝒑 and this is precisely what we will do.
In fact, it shows that the error in this model is given by
1 𝑇
𝜀 = 𝑓(𝐱 𝑐 + 𝒑) − 𝑚(𝐱 𝑐 + 𝒑) = 𝒑 (𝑯(𝒛) − 𝑯(𝐱 𝑐 ))𝒑
2
for some 𝒛 ∈ (𝐱 𝑐 , 𝐱 𝑐 + 𝒑).
For the remainder of this chapter we will denote the Jacobian matrix of 𝐹 at 𝐱 by 𝑱(𝐱).
Also, we will often speak of the Jacobian of 𝐹 rather than the Jacobian matrix of 𝐹 at 𝐱.
An important fact: Now comes the big difference from real-valued functions: there is
no mean value theorem for continuously differentiable vector-valued functions. That is,
in general there may not exist 𝐳 ∈ ℝ𝑛 such that 𝐹(𝐱 + 𝒑) = 𝐹(𝐱) + 𝑱(𝐳)𝒑. Intuitively the
reason is that, although each function 𝑓𝑖 satisfies 𝑓𝑖 (𝐱 + 𝒑) = 𝑓𝑖 (𝐱) + ∇𝑓𝑖 (𝐳𝑖 )𝑇 𝒑 , the points
𝐳𝑖 , may differ. For example, consider the function of the example before. There is no
𝐳 ∈ ℝ𝑛 for which 𝐹(1,1) = 𝐹(0,0) + 𝑱(𝐳)(1,1)𝑇 as this would require
𝑒 𝑥1 − 𝑥2 𝑒 𝑥1 − 𝑥2 𝑒 𝑥1 −1 1 𝑒−1 1 𝑒 𝑧1 −1 1
( ) = ( ) + ( ) ( )⇔( )=( )+( )( )
𝑥12 − 2𝑥2 𝐱=(1,1) 𝑥12 − 2𝑥2 𝐱=(0,0) 2𝑥1 −2 𝐱=𝐳 1
𝑖
−1 0 2𝑧1 −2 1
Those results are given below. The integral of a vector valued function of a real variable
can be interpreted as the vector of Riemann integrals of each component function.
Proof: The proof comes right from the definition of 𝐹 ′ (𝐳) and a component by-
component application of Newton's formula 𝑓𝑖 (𝐱 + 𝒑) = 𝑓𝑖 (𝐱) + ∇𝑓𝑖 (𝐳𝑖 )𝑇 𝒑 .■
Now we might think about a best model or the best linear approximation of the function
𝐹 around a point 𝐱 𝑐 , means that we model 𝐹(𝐱 𝑐 + 𝒑) by the affine model
and this is what we will do. To produce a bound on the difference between 𝐹(𝐱 𝑐 + 𝒑) and
𝑀(𝐱 𝑐 + 𝒑), we need to make an assumption about the continuity of 𝑱(𝐱 𝑐 ) just as we did
in scalar-valued-functions in the section before.
Definition Let the two integers 𝑚, 𝑛 > 0, 𝑮: ℝ𝑛 ⟶ ℝ𝑚×𝑛 , 𝐱 ∈ ℝ𝑛 , let ‖•‖ be a norm on
ℝ𝑛 , and ‖•‖𝑮 a norm on ℝ𝑚×𝑛 . 𝑮 is said to be Lipschitz continuous at 𝐱 if there exists an
open set 𝑫 ⊂ ℝ𝑛 , 𝐱 ∈ 𝑫, and a constant 𝛾 such that for all 𝐲 ∈ 𝑫,
The constant 𝛾 is called a Lipschitz constant for 𝑮 at 𝐱. For any specific 𝑫 containing 𝐱
for which the given inequality holds, 𝑮 is said to be Lipschitz continuous at 𝐱 in the
neighborhood 𝑫. If this inequality holds for every 𝐱 ∈ 𝑫, then 𝑮 ∈ 𝐿𝑖𝑝𝛾 (𝑫).
Note that the value of 𝛾 depends on the norms ‖•‖ & ‖•‖𝑮 , but the existence of 𝛾 does
not.
Proof:
1
≤ ∫ 𝛾 ‖𝑡𝒑‖‖𝒑‖𝑑𝑡
0
1
Using Lipschitz continuity, we can obtain a useful bound on the error in the
approximate affine model.
Lemma Let 𝐹, 𝑱 satisfy the conditions of the previous lemma. Then, for any 𝐯, 𝐮 ∈ 𝑫,
‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖
‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖ ≤ 𝛾 ‖𝐯 − 𝐮‖
2
If we assume that 𝑱(𝐱)−1 exists. Then there exists 𝜀 > 0, 0 < 𝛼 < 𝛽 , such that
where 𝒪(‖𝐱 − 𝒂‖) is a quantity that approaches zero much faster than the distance
between 𝐱 and 𝒂 does as 𝐱 approaches 𝒂.
In the preceding section we saw that the Jacobian, gradient, and Hessian will be useful
quantities in forming models of multivariable nonlinear functions. In many
applications, however, these derivatives are not analytically available. In this section we
introduce the formulas used to approximate these derivatives by finite differences, and
the error bounds associated with these formulas. The choice of finite-difference stepsize
in the presence of finite-precision arithmetic and the use of finite-difference derivatives
in our algorithms are discussed in (J. R Dennis, Jr. & Robert B. Schnabel 1993).
Frequently we deal with problems where the nonlinear function is itself the result of a
computer simulation, or is given by a long and messy algebraic formula, and so it is
often the case that analytic derivatives are not readily available although the function is
several times continuously differentiable. Therefore it is important to have algorithms
that work effectively in the absence of analytic derivatives.
In the case when 𝐹: ℝ𝑛 ⟶ ℝ𝑚 , it is reasonable to use the same idea as in one variable
to approximate the (𝑖, 𝑗)𝑡ℎ component of 𝑱(𝐱) by the forward difference approximation
𝑓𝑖 (𝐱 + ℎ𝐞𝑗 ) − 𝑓𝑖 (𝐱)
𝑎𝑖𝑗 (𝐱) =
ℎ
where 𝐞𝑗 , denotes the 𝑗 unit vector. This is equivalent to approximating the 𝑗 𝑡ℎ column
𝑡ℎ
One can prove that if 𝒂 ∈ ℝ𝑛 , which is a best model of ∇𝑓(𝐱) then 𝑎𝑖 given by
𝑓(𝐱 + ℎ𝐞𝑖 ) − 𝑓(𝐱 − ℎ𝐞𝑖 ) 𝛾
𝑎𝑖 (𝐱) = ⟺ ‖𝑎𝑖 (𝐱) − (∇𝑓(𝐱))𝒊 ‖ ≤ ℎ2
2ℎ 6
𝛾 2
⟺ ‖𝒂(𝐱) − ∇𝑓(𝐱)‖∞ ≤ ℎ
6
On some occasions ∇𝑓(𝐱) is analytically available but ∇2 𝑓(𝐱) is not. In this case, ∇2 𝑓(𝐱)
can be approximated by applying forward difference 𝑨𝑖 = (∇𝑓(𝐱 + ℎ𝐞𝑗 ) − ∇𝑓(𝐱)) /ℎ,
followed by 𝐴̂ = (𝐴 + 𝐴𝑇 )/2 , since the approximation to ∇2 𝑓(𝐱) should be symmetric.
If ∇𝑓(𝐱) is not available, it is possible to approximate ∇2 𝑓(𝐱) using only values of 𝑓(𝐱).
Proof: As in the one-variable case, a proof by contradiction is better than a direct proof
A class of algorithms called descent methods are characterized by the direction vector 𝒑
such that 𝒑𝑇 ∇𝑓(𝐱) < 0 or 𝒑 = −∇𝑓(𝐱).
The necessary and sufficient conditions for 𝐱 ⋆ to be a local maximizer of 𝑓 are simply
■ ∇𝑓(𝐱) = 𝟎
■ ∇2 𝑓(𝐱) is positive semidefinite.
𝑇 1
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝒑 + 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 + 𝒪(‖𝒑‖3 )
2
When the Hessian matrix is positive definite, by definition is 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 > 0 for any
𝒑 ≠ 0. Therefore we have that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (1/2)𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 > 0, which means that
𝐱 must be a local minimum. Similarly, when the Hessian matrix is negative definite, 𝐱 is
a local maximum. Finally, when 𝑯 has both positive and negative eigenvalues, the point
is a saddle point.
Those methods use
the gradient to search for the minimum point of an objective function. Such gradient-
based optimization methods are supposed to reach a point at which the gradient is
(close to) zero. In this context, the optimization of an objective function f(x) is equivalent
to finding a zero of its gradient g(x), which in general is a vector-valued function of a
vector-valued independent variable x. Therefore, if we have the gradient function g(x) of
the objective function f(x), we can solve the system of nonlinear equations g(x) = 0 to get
the minimum of f(x) by using the Newton method explained in chapter 4.
Example: Given a Rosenbrock function 𝑓(𝐱) = 100(𝑥2 − 𝑥12 )2 + (1 − 𝑥1 )2, find the
extremum value of 𝑓(𝐱) with the following starting point 𝐱 = [0.01 0.02]. this function is
severely ill-conditioned near the minimizer (1,1) (which is the unique stationary point).
%-----------------------------------------------------------%
% f .......... objective function
% J .......... gradient of the objective function
% H .......... Hessian of the objective function
%-----------------------------------------------------------%
clear all, clc, i=1; x(i,:)=[0.01 0.02]; tol=0.001;
f=@(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
J=@(x)[-400*(x(2)-x(1)^2)*x(1)+2*x(1)-2; 200*x(2)-200*x(1)^2];
H=@(x)[1200*x(1)^2-400*x(2)+2 -400*x(1);-400*x(1) 200];
while norm(J(x(i,:)))>tol
d=(inv(H(x(i,:)) + 0.5*eye(2,2))*J(x(i,:)))';
x(i+1,:)=x(i,:)-d;
i=i+1;
end
Iterations=i
x=(x(i,:))' % x=[0.989793 0.9797];
fmax=f(x)
Example: Let 𝑓(𝐱) = √(1 + 𝑥12 ) + √(1 + 𝑥22 ), find the extremum value of 𝑓(𝐱) with the
following starting point 𝐱 = [1 1].
f=@(x)sqrt(1+x(1)^2)+sqrt(1+x(2)^2);
J=@(x)[x(1)/sqrt(x(1)^2+1);x(2)/sqrt(x(2)^2+1)];
H=@(x)diag([1/(x(1)^2+1)^1.5,1/(x(2)^2+1)^1.5]);
while abs(x(i,:)*J(x(i,:)))>0.001
d=(inv(H(x(i,:))+0.5*eye(2,2))*J(x(i,:)))';
x(i+1,:)=x(i,:)-d;
i=i+1;
end
Iterations=i
x=(x(i,:))'
fmax=f(x)
Example: Let 𝑓(𝑥1 , 𝑥2 ) = (𝑥1 − 2)4 + ((𝑥1 − 2)2 )𝑥22 + (𝑥2 + 1)2 , which has its minimum
at 𝐱 ⋆ = (2, −1)𝑇 . Algorithm, started from 𝐱 0 = (1, 1)𝑇 , and we use the following
approximations
𝐻11 = 𝑓(𝐱 + 2ℎ𝐞1 ) − 2𝑓(𝐱 + ℎ𝐞1 ) + 𝑓(𝐱), 𝐻22 = 𝑓(𝐱 + 2ℎ𝐞2 ) − 2𝑓(𝐱 + ℎ𝐞2 ) + 𝑓(𝐱)
𝐻12 = 𝐻21 = 𝑓(𝐱 + ℎ𝐞1 + ℎ𝐞2 ) − 𝑓(𝐱 + ℎ𝐞1 ) − 𝑓(𝐱 + ℎ𝐞2 ) + 𝑓(𝐱).
Before starting the algorithm let we visualize the plot of this surface in space
Iterations = 9
Jacobian =
-4.0933e-11
2.2080e-09
x =
1.9950
-1.0050
fmax = 5.0001e-05
while norm(J)>tol
H(1,1,:)=(f(x11(i,:))-2*f(x1(i,:))+ f(x(i,:)))/h^2;
H(1,2,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,1,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,2,:)=(f(x22(i,:))-2*f(x2(i,:))+f(x(i,:)))/h^2;
x(i+1,:)=x(i,:)-(inv(H)*J)';
i=i+1;
end
Iterations=i
Gradient =J
x=(x(i,:))'
fmax=f(x)
Example: Consider the Freudenstein
and Roth test function
Where
Show that the function f has three stationary points. Find them and prove that one is a
global minimizer, one is a strict local minimum and the third is a saddle point. You
should use the stopping criteria ‖∇𝑓(𝑥)‖ ≤ 10−5. The algorithm should be employed four
times on the following four starting points:
Also we can use MATLAB code to see the Gradient and the Hessian of this function.
When we run the program we get the points as what have been asked.
Example: Given a Rosenbrock function 𝑓(𝐱) = 100(𝑥2 − 𝑥12 )2 + (1 − 𝑥1 )2, find the solution
of 𝑓(𝐱) = 0 using only the gradient (i.e. without use of Hessian).
𝑇 𝑇 −1
−𝑓(𝐱 𝑘 ) = (∇𝑓(𝐱 𝑘 )) ∆𝐱 𝑘 ⟺ ∆𝐱 𝑘 = − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) ) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 )
And in order to avoid the singularity in the matrix inversion, let we add some
𝑇 −1
regularization term: ∆𝐱 𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 = − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 ) 0<𝜆<1
−1
It can be observed from the previous results 𝐱 𝑘+1 = 𝐱 𝑘 − (𝑯(𝐱)) 𝐠(𝐱) that
𝑇 −1 −1
𝐱 𝑘+1 = 𝐱 𝑘 − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 ) ≅ 𝐱 𝑘 − (𝑯(𝐱)) ∇𝑓(𝐱 𝑘 )
𝑇
⟹ 𝑯(𝐱) ≅ (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) 𝑓(𝐱 𝑘 )
It practically means that, once the first derivatives are computed, we can also compute
part of the Hessian matrix for the same computational cost. The possibility to compute
“for free” the Hessian matrix once the Jacobian (i.e. Gradient) is available represents a
distinctive feature of least squares problems. This approximation is adopted in many
applications as it provides an evaluation of the Hessian matrix without computing any
second derivatives of the objective function.
2 2
Example: Given 𝑓(𝐱) = (𝑥12 − 2𝑥2 )𝑒 −𝑥1 −𝑥2 −𝑥1𝑥2 , find the solution of 𝑓(𝐱) = 0 using only
the gradient (i.e. without use of Hessian). Here in this example we will use the
approximate value of ∇𝑓(𝐱) rather that the analytic one.
Before starting the algorithm let we visualize the plot of this surface in space
[x,y] = meshgrid([-2:.25:2]);
z = x.*exp(-x.^2-y.^2);
% Plotting the Z-values of the function on which the level
% sets have to be projected
z1 = x.^2+y.^2;
% Plot your contour
[cm,c]=contour(x,y,z,30);
% Plot the surface on which the level sets have to be projected
s=surface(x,y,z1,'EdgeColor',[.8 .8 .8],'FaceColor','none')
% Get the handle to the children i.e the contour lines of the contour
cv=get(c,'children');
% Extract the (X,Y) for each of the contours and recalculate the
% Z-coordinates on the surface on which to be projected.
for i=1:length(cv)
cc = cv(i);
xd=get(cc,'XData');
yd=get(cc,'Ydata');
zd=xd.^2+yd.^2;
set(cc,'Zdata',zd);
end
grid off
view(-15,25)
colormap cool
Jacobian =
-4.1814e-06
-8.6410e-06
x =
1.8844
3.4178
fmax = 4.5702e-07
clear all, clc, i=1; x(i,:)=[1 1]; h=0.01; J=1; tol=0.00001;
f=@(x)x(1)*exp(-x(1)^2-x(2)^2);
while norm(J)>tol
H(1,1,:)=(f(x11(i,:))-2*f(x1(i,:))+ f(x(i,:)))/h^2;
H(1,2,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,1,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,2,:)=(f(x22(i,:))-2*f(x2(i,:))+f(x(i,:)))/h^2;
x(i+1,:)=x(i,:)-(inv(H)*J)';
i=i+1;
end
Iterations=i
Jacobian=J
x=(x(i,:))'
fmax=f(x)
Example: Develop the Taylor series of two-variables objective function 𝑓(𝑥1 , 𝑥2 ) with an
error of 𝒪(‖𝜹‖3 )
𝜕𝑓 𝜕𝑓 𝜕 2𝑓 𝜕 2𝑓 𝜕 2𝑓
𝑓(𝑥1 + 𝛿1 , 𝑥2 + 𝛿2 ) = 𝑓(𝑥1 , 𝑥2 ) + ( 𝛿1 + 𝛿2 ) + ( 2 𝛿12 + 2 𝛿1 𝛿2 + 2 𝛿22 ) + 𝒪(‖𝜹‖3 )
𝜕𝑥1 𝜕𝑥2 𝜕𝑥1 𝜕𝑥1 𝜕𝑥2 𝜕𝑥2
𝜕 2𝑓 𝜕 2𝑓
𝛿1 𝛿1
𝜕𝑓 𝜕𝑓 1 𝜕𝑥12 𝜕𝑥1 𝜕𝑥2
= 𝑓(𝑥1 , 𝑥2 ) + [ ] ( ) + [𝛿1 𝛿2 ] ( ) + 𝒪(‖𝜹‖3 )
𝜕𝑥1 𝜕𝑥2 𝛿 2 𝜕 2𝑓 𝜕 2𝑓 𝛿2
2
2
(𝜕𝑥1 𝜕𝑥2 𝜕𝑥2 )
𝑇 1
In compact form we can write 𝑓(𝐱 + 𝜹) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝜹 + 2 𝜹𝑇 𝑯(𝐱 + 𝑡𝜹)𝜹 + 𝒪(‖𝜹‖3 )
Let 𝑭: ℝ𝑛 ⟶ ℝ𝑚 be a continuously
differentiable in the open convex set 𝑫 ⊂ ℝ𝑛 . The practical problem in the vector case is
to solve the simultaneously the set of nonlinear equations 𝑭(𝐱) = 𝟎. In before we have
seen that
𝑭(𝐱 + 𝜹) = 𝑭(𝐱) + 𝑱(𝐱)𝜹 ⟺ 𝑭(𝐱 𝑘 + 𝜹) = 𝑭(𝐱 𝑘 ) + 𝑱(𝐱 𝑘 )𝜹𝑘
𝜕𝑓1 𝜕𝑓1
𝜕𝑥1 𝜕𝑥2 1 𝑓1 (𝐱 + ℎ𝐞1 ) − 𝑓1 (𝐱) 𝑓1 (𝐱 + ℎ𝐞2 ) − 𝑓1 (𝐱)
𝑱(𝐱 𝑘 ) = ≅ ( )
𝜕𝑓2 𝜕𝑓2 ℎ 𝑓 (𝐱 + ℎ𝐞 ) − 𝑓 (𝐱)
2 1 2 𝑓2 (𝐱 + ℎ𝐞2 ) − 𝑓2 (𝐱)
( 𝜕𝑥1 𝜕𝑥2 )
Example: write a MATLAB code to solve the following nonlinear system of equations
using the approximate method and take x(i,:)= [0.1 0.2] as starting point.
x =
Example: using MATLAB to visualize the intersection of the following surfaces centered
at origin (assuming that the parameters are specified)
𝑥 2 𝑥 2 𝑥 2
( ) +( ) +( ) =𝑑
𝑎 𝑏 𝑐
2 2
𝑧 = 𝛼𝑥 + 𝛽𝑦 + 𝛾
{ 𝑥2 + 𝑦2 + 𝑧2 = 𝜆
for i=1:imax+1
theta=2*pi*(i-1)/ imax;
for j=1:jmax+1
phi =2*pi*(j-1)/ jmax;
x(i,j) = a*cos(theta); y(i,j) = b*sin(theta)*cos(phi);
z(i,j) = c*sin(theta)*sin(phi);
end
end
s=surf(x,y,z), hold on % Plot of ellipsoid
x(i+1,:) = x(i,:)-(inv(J)*F)';
dif = norm(x(i+1,:)-x(i,:));
i = i + 1;
end
x, Iterations=i, F
Suppose first that the Hessian matrix of the objective function is constant and
independent of 𝐱 𝑘 for 0 ≤ 𝑖 ≤ 𝑘. In other words, the objective function is quadratic, with
Hessian 𝐇(𝐱) = 𝑸 for all 𝐱, where 𝑸 = 𝑸𝑇 . Then, if 𝐱 𝑘+1 is an optimizer of 𝑓 we get
𝐠(𝐱 𝑘+1 ) = 0 so
𝐠(𝐱 𝑘+1 ) − 𝐠(𝐱 𝑘 ) = 𝑸(𝐱 𝑘+1 − 𝐱 𝑘 ) ⟺ ∆𝐠(𝐱 𝑘 ) = 𝑸∆𝐱 𝑘 ⟺ ∆𝐠(𝐱𝑘 ) = 𝑸𝒑𝑘 ⟺ 𝒑𝑇𝑘 ∆𝐠(𝐱 𝑘 ) = 𝒑𝑇𝑘 𝑸𝒑𝑘
We start with a real symmetric positive definite matrix 𝐁0 . Note that given k, the
matrix 𝑸−1 satisfies
𝑸−1 ∆𝐠(𝐱𝑖 ) = ∆𝐱 𝑖 0 ≤ 𝑖 ≤ 𝑘
Therefore, we also impose the requirement that the approximation of the Hessian
satisfy
𝐁𝑘+1 ∆𝐠(𝐱 𝑖 ) = ∆𝐱 𝑖 = 𝒑𝑖 0 ≤ 𝑖 ≤ 𝑘
𝐁𝑛 ∆𝐠(𝐱 0 ) = 𝒑0
𝐁𝑛 ∆𝐠(𝐱1 ) = 𝒑1
⋮
𝐁𝑛 ∆𝐠(𝐱𝑛−1 ) = 𝒑𝑛−1
Note that 𝑸 satisfies: 𝑸−1 [𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] = [𝒑0 , 𝒑1 , … 𝒑𝑛−1 ]. Therefore, if the matrix
[𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] is nonsingular, then 𝑸−1 is determined uniquely after n steps, via
This means that if n linearly independent directions 𝒑𝑖 and corresponding 𝒒𝑖 are known,
then 𝑸−1 is uniquely determined.
We will construct successive approximations 𝐁𝑘 to 𝑸−1 based on data obtained from the
first k steps such that:
𝐁𝑘+1 = [𝒑0 , 𝒑1 , … 𝒑𝑘 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 ]−1
After n linearly independent steps we would then have 𝐁𝑛 = 𝑸−1 . We want an update on
𝐁𝑘 such that:
𝐁𝑘+1 = [𝒑0 , 𝒑1 , … 𝒑𝑘 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 ]−1
Let us find the update in this form [Rank one correction] 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 . We need a
good 𝛼𝑘 ∈ ℝ and good 𝐮𝑘 ∈ ℝ𝑛 .
(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 = 𝐁𝑘 +
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )
Proof: We already know that 𝐁𝑘+1 [𝒒0 , 𝒒1 , … 𝒒𝑘 ] = [𝒑0 , 𝒑1 , … 𝒑𝑘 ] and 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 .
Therefore,
𝒑𝑘 = 𝐁𝑘+1 𝒒𝑘 = (𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 )𝒒𝑘 = 𝐁𝑘 𝒒𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘 ⟹ 𝒑𝑘 − 𝐁𝑘 𝒒𝑘 = 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘
(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 − 𝐁𝑘 =
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )
end
Remark: The scalar 𝛼𝑘 is the smallest nonnegative value of a that locally minimizes
𝑓 along the direction ∇𝑓(𝐱 𝑘 ) starting from 𝐱 𝑘 . There are many alternative line-search
rules to choose 𝛼𝑘 along the ray 𝑆𝑘 = {𝐱 𝑘+1 = 𝐱 𝑘 + 𝑎𝒑𝑘 | 𝑎 > 0} . Namely: Armijo Rule,
Goldstein Rule, Wolfe Rule, Strong Wolfe Rule etc... In our work we are not interested
by such matter.
The search direction 𝒑𝑘 at stage k is given by the solution of the analogue of the Newton
equation: 𝑯𝑘 𝒑𝑘 = −∇𝑓(𝐱 𝑘 ) where 𝑯𝑘 is an approximation to the Hessian matrix, which
is updated iteratively at each stage, and ∇𝑓(𝐱 𝑘 ) is the gradient of the function evaluated
at 𝐱 𝑘 . A line search in the direction 𝒑𝑘 is then used to find the next point 𝒑𝑘+1 by
minimizing 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 ) over the scalar 𝛾 > 0. The quasi-Newton condition imposed on
the update of 𝑯𝑘 is
∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 ) = 𝑯𝑘+1 (𝐱 𝑘+1 − 𝐱 𝑘 )
Let 𝒚𝑘 = ∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 ) and 𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 then 𝑯𝑘+1 satisfies 𝑯𝑘+1 𝒔𝑘 = 𝒚𝑘 which is the
secant equation. The curvature condition 𝒔𝑇𝑘 𝑯𝑘+1 𝒔𝑘 = 𝒔𝑇𝑘 𝒚𝑘 > 0 should be satisfied for
𝑯𝑘+1 to be positive definite. If the function is not strongly convex, then the condition has
to be enforced explicitly.
Instead of requiring the full Hessian matrix at the point 𝐱 𝑘+1 to be computed as 𝑯𝑘+1,
the approximate Hessian at stage k is updated by the addition of two matrices:
Both 𝐔𝑘 and 𝐕𝑘 are symmetric rank-one matrices, but their sum is a rank-two update
matrix. Imposing the secant condition, 𝑯𝑘+1 𝒔𝑘 = 𝒚𝑘 Choosing 𝐮𝑘 = 𝒚𝑘 and 𝐯𝑘 = 𝐇𝑘 𝒔𝑘 , we
can obtain:
1 1
𝛼= 𝑇 , 𝛽= 𝑇
𝐲𝑘 𝒔𝑘 𝒔𝑘 𝑯𝑘 𝒔𝑘
Finally, we substitute 𝛼𝑘 and 𝛽𝑘 into 𝐇𝑘+1 = 𝐇𝑘 + 𝛼𝐮𝑘 𝐮𝑇𝑘 + 𝛽𝐯𝑘 𝐯𝑘𝑇 and get the update
equation of 𝐇𝑘+1.
𝐲𝑘 𝐲𝑘𝑇 𝐇𝑘 𝒔𝑘 𝒔𝑇𝑘 𝐇𝑘𝑇
𝐇𝑘+1 = 𝐇𝑘 + + 𝑇
𝐲𝑘𝑇 𝒔𝑘 𝒔𝑘 𝐇𝑘 𝒔𝑘
end
The functional 𝑓(𝐱 𝑘 ) denotes the objective function to be minimized. Convergence can
be checked by observing the norm of the gradient, ‖∇𝑓(𝐱 𝑘 )‖2 ≤ 𝜀. In order to avoid the
inversion of 𝑯𝑘 at each step we apply the Sherman–Morrison formula
𝑨−1 𝐮𝐯 𝑇 𝑨−1
(𝑨 + 𝐮𝐯 𝑇 )−1 = 𝑨−1 +
1 + 𝐯 𝑇 𝑨−1 𝐮
We get
𝒔𝑘 𝐲𝑘𝑇 𝐲𝑘 𝒔𝑇𝑘 𝒔𝑘 𝒔𝑇𝑘
𝐁𝑘+1 = (𝑰 − ) 𝐁 𝑘 (𝑰 − ) +
𝒔𝑇𝑘 𝐲𝑘 𝒔𝑇𝑘 𝐲𝑘 𝒔𝑇𝑘 𝐲𝑘
(𝒔𝑇𝑘 𝐲𝑘 + 𝐲𝑘𝑇 𝐁𝑘 𝐲𝑘 )(𝒔𝑘 𝒔𝑇𝑘 ) 𝐁𝑘 𝐲𝑘 𝒔𝑇𝑘 + 𝒔𝑘 𝐲𝑘𝑇 𝐁𝑘
= 𝐁𝑘 + −
(𝒔𝑇𝑘 𝐲𝑘 )2 𝒔𝑇𝑘 𝐲𝑘
end
Remark: In general, the finite difference approximations of the Hessian are more
expensive than the secant condition updates. (Walter Gander and Martin J Gander)
clear all, clc, tol=10^-4; x(:,1)= [0.8624 0.1456]; z=[]; B=eye(2,2);
f=@(x)x(1)^2-x(1)*x(2)-3*x(2)^2+5; J=@(x)[2*x(1)-x(2);-x(1)-6*x(2)];
% f=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2-10; J=@(x)[4*x(1);6*x(1);8*x(3)];
% f=@(x)3*sin(x(1))+exp(x(2)); J=@(x)[3*cos(x(1));exp(x(2))];
i=1; %matlab starts counting at 1
while and(norm(J(x(:,i)))>0.001,i<500)
p(:,i)=-B*J(x(:,i));
%------------------------------------------------------------%
% armijo method for alpha determination
alp=0.01; % initial step
rho=0.01; c=0.01; % rho and c are in (0,1);
x1=x(:,i); x2=x(:,i)+alp*p(:,i);
f2=f(x2); f1=f(x(:,i));
while and(f2>f1+c*alp*(J(x(:,i)))'*p, alp>10^-6);
alp=rho*alp;
f2=f(x(:,i)+alp*p(:,i));
end
%------------------------------------------------------------%
s=alp*p(:,i); x(:,i+1)=x(:,i) + s; y=J(x(:,i+1))-J(x(:,i));
B = B + ((s'*y + y'*B*y)/(s'*y)^2)*(s*s') -(B*y*s'+ (s*y')*B)/(s'*y);
i=i+1;
end
x(:,end), fmax=f(x(:,end)), Gradient=J(x(:,end))
Gradient descent is based on the observation that if the scalar multi-variable function
𝑓(𝐱) is defined and differentiable in a neighborhood of a point 𝒂, then 𝑓(𝐱) decreases
fastest if one goes from 𝒂 in the direction of the negative gradient of 𝑓(𝐱) at 𝒂, −∇𝑓 (𝒂).
It follows that, if 𝐚𝑘+1 = 𝐚𝑘 − 𝛼∇𝑓(𝐚𝑘 ) for 𝛼 ∈ ℝ small enough, then 𝑓(𝐚𝑘 ) ≥ 𝑓(𝐚𝑘+1 ) In
other words, the term 𝛼∇𝑓(𝐚𝑘 ) is subtracted from 𝐚𝑘 because we want to move against
the gradient, toward the local minimum.
With this observation in mind, one starts with a guess 𝐱 0 for a local minimum of 𝑓(𝐱),
and considers the sequence 𝐱1 , 𝐱 2 , …, such that 𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 ).
Note that the value of the step size 𝛼𝑘 is allowed to change at every iteration. With
certain assumptions on the function 𝑓(𝐱) (for example, 𝑓(𝐱) convex and ∇𝑓(𝐱) Lipschitz)
and particular choices of 𝛼𝑘 convergence to a local minimum can be guaranteed.
According to Wolfe conditions, or the Barzilai–Borwein method
𝑇
(𝐱 𝑘 − 𝐱 𝑘−1 )𝑇 (∇𝑓(𝐱 𝑘 ) − ∇𝑓(𝐱 𝑘−1 )) (∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = or 𝛼𝑘 = 𝑇
‖∇𝑓(𝐱 𝑘 ) − ∇𝑓(𝐱 𝑘−1 )‖2 (∇𝑓(𝐱 𝑘 )) 𝐇(𝐱 𝑘 )∇𝑓(𝐱 𝑘 )
Algorithm: [Gradient descent algorithm]
Initialization: 𝐱 0 and 𝛼0
begin: k=1:n (untile converge)
𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 )
𝑇
(∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = 𝑇
(∇𝑓(𝐱 𝑘 )) 𝐇(𝐱 𝑘 )∇𝑓(𝐱 𝑘 )
end
The peaks represent regions with high cost with red areas whereas the lowest point
with blue areas are regions with minimum cost or loss. In any Optimization & Deep
Learning problems, we try to find a model function that gives prediction having least
loss in comparison to actual value.
Suppose our model function has two parameters then, mathematically we wish to find
the optimal values of parameters 𝜃1 and 𝜃2 that would minimize our loss. The loss (𝐽(𝜃))
space shown in the above figure tells us how our algorithm would perform if we would
choose a particular value for a parameter. Here the 𝜃1 and 𝜃2 are our x and y axis while
the loss is plotted corresponding to the z axis. The Gradient Descent rule states that the
direction in which we should move should be 180 degrees with the gradient, in other
words moving opposite to the gradient.
In order to minimize the new objective function we consider the gradient ∇𝜙(𝜹) = 𝑨𝜹 + 𝒃.
In term of 𝐱 𝑘+1we obtain ∇𝜙(𝐱 𝑘+1 ) = ∇𝑓(𝐱 𝑘+1 ) = 𝑨𝐱 𝑘+1 + 𝒃. As a consequence, all
gradient-like iterative methods developed in the previous chapter for linear systems,
can be extended to solve nonlinear minimization problems.
In particular, having fixed a descent direction 𝒑𝑘 = (𝐱 𝑘+1 − 𝐱 𝑘 )/𝛼𝑘 , we can determine the
optimal value of the acceleration parameter 𝛼𝑘 , in such a way as to find the point where
the function f, restricted to the direction 𝒑𝑘 , is minimized 𝛼𝑘 = arg min𝛼 𝑓(𝐱 𝑘 + 𝛼𝒑𝑘 ).
Setting to zero the directional derivative, we get:
𝑑 𝑛 𝜕𝑓 𝜕
0= 𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = ∑ ( (𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 )) (𝐱 (𝑖) + 𝛼𝑘 𝒑𝑘 (𝑖))
𝑑𝛼𝑘 𝑖=1 𝜕𝑥𝑖 𝜕𝛼𝑘 𝑘
𝑇
= (∇𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 )) 𝒑𝑘
But we have seen that ∇𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = 𝑨(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) + 𝒃 = (𝑨𝐱 𝑘 + 𝒃) + 𝛼𝑘 𝑨𝒑𝑘
𝑑
Therefore 𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = (𝛼𝑘 𝒑𝑘 𝑇 𝑨 + (𝑨𝐱 𝑘 + 𝒃)𝑇 )𝒑𝑘 = 0
𝑑𝛼𝑘
■ For the step length 𝛼𝑘 , (which minimizes f along the search direction 𝒑𝑘 ), we perform
a line search that identifies the approximate minimum of the nonlinear function f along
the search direction 𝒑𝑘 .
■ The residual 𝒓𝑘 (𝒓𝑘 = −(𝒃 + 𝑨𝐱 𝑘 )), which is the gradient of function f has to be
replaced by the gradient of the nonlinear objective function.
Remark: An appropriate step length effecting sufficient decrease could be chosen from
one of the various known methods such as the Armijo, the Goldstein or the Wolfe’s
conditions. Moreover, if f is a strongly convex quadratic function and 𝛼𝑘 is the exact
minimizer of the function f, then the FR algorithm becomes specifically the linear
conjugate gradient algorithm.
and the 𝛽𝑘 scalar is given by the equation: 𝛽𝑘 = |∇𝑓(𝐱 𝑘 )|2 /|∇𝑓(𝐱 𝑘−1 )|2 where k is an
iterative index. The following algorithm of the conjugate gradient method is made on
MATLAB
end
clear all, clc, tol=10^-5; x(:,1)=10*rand(2,1);
f=@(x)x(1)^2+x(1)*x(2)+3*x(2)^2+100;
J=@(x)[2*x(1)+x(2);x(1)+6*x(2)]; p(:,1)=-J(x(:,1));
i=1; % matlab starts counting at 1
finalX = x ; % initialize the vector
finalf =f(x(:,1)); z=[];
while and(norm(J(x(:,i)))>0.001,i<500)
%-------------------------------------------------%
% Armijo method for alpha determination
alp=0.01; % initial step
rho=0.01; c=0.02; % rho and c are in (0,1);
x1=x(:,i); x2=x(:,i)+alp*p(:,i);
f2=f(x2); f1=f(x(:,i));
while and(f2>f1+c*alp*(J(x(:,i)))'*p, alp>10^-6);
alp=rho*alp;
f2=f(x(:,i)+alp*p(:,i));
end
%-------------------------------------------------%
x(:,i+1)=x(:,i) + alp*p(:,i);
beta=((J(x(:,i+1)))'*J(x(:,i+1)))/((J(x(:,i)))'*J(x(:,i)));
p(:,i+1)=-J(x(:,i+1)) + beta*p(:,i);
i=i+1;
z=[z,f(x(:,i))];
end
Iter=i
xmax=x(:,end)
fmax=f(x(:,end))
Gradient=J(x(:,end))
%-------------------------------------------------%
figure(1)
X= x(1,1:end-1); Y= x(2,1:end-1); Z= z;
plot3(X,Y,Z ,'bo-','linewidth',0.1);
hold on
figure(2)
[X,Y]= meshgrid([-3:0.5:3]) ;
Z=X^2+X.*Y+3*Y.^2+5;
S=mesh(X,Y,Z); %plotting the surface
title('Subrats Pics'), xlabel('x'), ylabel('y')
To address the shortcomings of the original
Newton method, several variations of the technique were suggested to guarantee
convergence to a local minimum. One of the most important variations is the
Levenberg–Marquardt method. This method effectively uses a step that is a combination
between the Newton method and the steepest descent method. The step taken by this
method is given by:
where 𝜇 is a positive scalar and 𝑰 ∈ ℝ𝑛×𝑛 is the identity matrix. Notice that in last
equation if 𝜇 is small enough, the Hessian matrix 𝑯(𝐱 𝑘 ) dominates and the method
becomes effectively a Newton’s step. If the parameter 𝜇 is large enough, the matrix 𝜇𝑰
dominates and the method is approximately in the steepest descent direction. By
increasing 𝜇, the inverse matrix becomes small in norm and subsequently the norm of
the step taken ‖𝐱 𝑘+1 − 𝐱 𝑘 ‖ becomes smaller. It follows that the parameter 𝜇 controls
also the step size.
One interesting mathematical property of this approach is that adding the matrix 𝜇𝑰 to
the Hessian matrix increases each eigenvalue of this matrix by 𝜇. If the matrix 𝑯(𝐱 𝑘 ) is
not positive semi-definite then adding 𝜇 to each eigenvalue makes them more positive.
The value of 𝜇 can be increased until all the eigenvalues are positive thus guaranteeing
that the step (𝐱 𝑘+1 − 𝐱 𝑘 ) is a descent step. The Levenberg–Marquardt approach starts
each iteration with a very small value of 𝜇, thus giving effectively the Newton’s step. If
an improvement in the objective function is achieved, the new point is accepted.
Otherwise, the value of 𝜇 is increased until a reduction in the objective function is
obtained.
starting from the point 𝐱 0 = [3.0 − 7.0 0]𝑇 . Utilize the MATLAB software
The algorithm terminated in only one iteration. The exact solution for this problem is
𝐱 ⋆ = [1.0 0.0 0.0]𝑇 with a minimum value of 𝑓(𝐱 ⋆ ) = −1.50.
Theorem: Any locally optimal point of a convex optimization problem is also (globally)
optimal.
Where: k is the number of variables and i the number of iterations. The MATLAB
routine function rand generates a subunit random number at each call and let us call
this number 𝑟 = 𝑟𝑎𝑛𝑑. While the search must be done in the interval (𝑎𝑘 , 𝑏𝑘 ) for each
variable 𝑥𝑘 , we want the random number to be generated within this range. For this
reason, the following translation must be made: 𝑥𝑘 (𝑖) = 𝑎𝑘 + 𝑟𝑘 (𝑖)(𝑏𝑘 − 𝑎𝑘 ) 𝑘 = 1,2, … , 𝑛
clc;clear; nVar = 2; % the number of decision variables
N= 10000; % Number of random generated points
epsilon = 1e-3; % the convergence factor
a=zeros(1,nVar); b=zeros(1,nVar); % pre-allocation of vectors a and b
for i=1:nVar, a(i)=-1.50; b(i)=1.50; end % set-up of the search limits
fMin = 1e6; % initialize fMin
fPrecedent = fMin;
for i=1:N % global search procedure
x1 = a(1)+ rand*(b(1)-a(1)); % random generation: variable x1
x2 = a(2)+ rand*(b(2)-a(2)); % random generation: variable x2
f=@(x,y)2*x+y+(x.^2-y.^2)+(x-y.^2).^2; % The objective function
func =f(x1,x2);
if (func<fMin)
fMin = func; x1Min = x1; x2Min = x2;
if abs(fMin - fPrecedent)<=epsilon
break;
else
fPrecedent= fMin;
end, end, end
x1=x1Min, x2=x2Min, fMin =fMin(end)
J=@(x,y)[4*x-2*y^2+2;-4*x*y+4*y^3-2*y+1];
Jmin=J(x1, x2), fmin=f(x1, x2),
To find the solution to the minimization problem, the random path method uses an
iterative relationship of the form 𝐱(𝑖 + 1) = 𝐱(𝑖) + 𝛼𝑖 𝐬(𝑖) & 𝑖 = 1,2, … , 𝑛 where i is an
iterative index, 𝐱(𝑖) is the vector of the decision variables, 𝛼𝑖 is a step size, at iteration i
called acceleration factor in the 𝐬(𝑖) direction, and 𝐬(𝑖) is the vector of the minimization
direction. The search procedure starts from a randomly chosen point. Whatever this
start point is, we have to reach the same solution. The coordinates of the minimization
direction vector 𝐬𝑘 , are randomly chosen using the rand function.
Algorithm: [Random Walk algorithm]
Step 1: chose 𝐱(0) and 𝑁max
set 𝑖 = 1
Step 2: for each iteration 𝑖 do
𝐬(𝑖) = random vector
Step 3: 𝐱(𝑖 + 1) = 𝐱(𝑖) + 𝛼𝑖 𝐬(𝑖)
Remark: The convergence of this algorithm is slow and not guaranteed in general, it is
dependent strongly on the convexity of the objective function.
%f=@(x)2*x(1)+x(2)+(x(1).^2+x(2).^2)+(x(1)+x(2).^2).^2;
%J=@(x)[4*x(1)+2*x(2)^2+2;4*x(1)*x(2)+4*x(2)^3+2*x(2)+1];
%---------------------------------------------------------------%
clear all, clc, n=0; nMax=5; xzero=rand(2,1); epsilon=1e-4; alfa0=0.01;
f=@(x)2*x(1)+x(2)+(x(1).^2-x(2).^2)+(x(1)-x(2).^2).^2;
a = -1.0 ; b = 1.0; % the range for s
F0=f(xzero); Fprecedent=F0; % the function value at the start point
f0=F0; s=rand(2,1); alfa = alfa0; increment = alfa0;
xone = xzero + alfa*s; % generate a next iteration Xl
F1 = f(xone); % the objective function value in Xl
Factual = F1;
i=1; % initialize the counter i
go = true; % variable 'go' remains 'true' as long as
% the convergence criteria are not fulfilled
while go
while (Factual>=Fprecedent)
s = rand(2,1); s = a*[1;1] + s*(b-a); % generate a random direction s
xone = xzero + alfa*s;
F1 = f(xone); Factual = F1;
end
i=i+1; f1=F1;
while (Factual<Fprecedent)
Fprecedent = Factual;
alfa = alfa + increment;
xone = xzero + alfa*s; F1 = f(xone);
end
deltaF = abs(F1-Fprecedent); F0 = Factual; xzero = xone; alfa = alfa0;
if(abs(f0-f1)<=epsilon) n = n + 1; end
f0 =f1;
if(n==nMax) go = false; break; end
end
J=@(x)[4*x(1)-2*x(2)^2+2;-4*x(1)*x(2)+4*x(2)^3-2*x(2)+1];
xone, Factual, Jmin=J(xone), fmin=f(xone),
The Monte Carlo method is based on the following principle: if the best option is
needed, it should be tried "at random" many times and then the best option found
between those attempts chosen. If there are enough different attempts, the best option
found will almost certainly be an optimal global value. This method is valid both
mathematically and intuitively. The advantages of the method are both its simplicity
and its universality. But it has the disadvantage of being too slow.
Let’s pretend that we don’t know the value of 𝜋. To calculate it, we will generate a large
number 𝑁 of random points in the unit square. By 𝑛 we will denote the number of
points lying inside the quarter-circle. As you will certainly agree, with large 𝑁 the ratio
𝑛/𝑁 must be very similar to the ratio of 𝐴/𝑆. And that’s all! From the equation 𝑛/𝑁 = 𝐴/𝑆
we can easily express that 𝜋 = 4𝑛/𝑁. This proportion becomes valid as the number 𝑁 of
uniformly distributed points on the square area generated and in the meantime
becomes higher.
The corresponding MATLAB program is presented below.
clear all, clc nmax = 5000;
x = rand(nmax,1); y = rand(nmax,1); x1=x-0.5; y1=y-0.5;
r = sqrt(x1.^2+y1.^2) ;
% get logicals
inside = r<=0.5; outside = r>0.5;
% plot
plot(x1(inside),y1(inside),'b.');
hold on
plot(x1(outside),y1(outside),'r.');
axis equal
% get pi value
thepi = 4*sum(inside)/nmax;
fprintf('%8.4f\n',thepi)
𝑓(𝑥, 𝑦) = −0.02 sin(𝑥 + 4𝑦) − 0.2 cos(2𝑥 + 3𝑦) − 0.2 sin(2𝑥 − 𝑦) + 0.4 cos(𝑥 − 2𝑦)
While the objective function depends on two variables 𝑥 and 𝑦, its graphical
representation is a surface. From Fig. it is evident that this surface has many peaks
and valleys, interpretable as many local minimum (or maximum) points, depending on
the problem scope. Usually, a numerical optimization procedure risks ending in a local
optimum point instead of an absolute minimum point.
% This program draws the mesh of a multimodal
% function that depends on two variables
clear all, clc
f=@(x1,x2)-0.02*sin(x1+4*x2)-0.2*cos(2*x1+3*x2)-0.2*sin(2*x1-
x2)+0.4*cos(x1-2*x2);
a1=-2.5;a2=-2.5; b1=2.5;b2=2.5; increment1=0.1; increment2=0.1;
n1=(b1-a1)/increment1; n2=(b2-a2)/increment2; fGraph = zeros(n1,n2);
x1 = a1;
for i1 = 1:n1
x2 = a2;
for i2 = 1:n2
fGraph (i1,i2)=f(x1,x2);
x2 = x2 + increment2;
end
x1 = x1 + increment1;
end
mesh(fGraph) ; % drawing of fGraph
clear all, clc
f=@(x1,x2)-0.02*sin(x1+4*x2)-0.2*cos(2*x1+3*x2)-0.2*sin(2*x1-
x2)+0.4*cos(x1-2*x2);
while delta>epsilon
%------% The block for xl variable (keep x2=constant) %------%
x2 = x2_min; x1 = ls1;
increment = abs(ld1-ls1)/gridNumber;
while x1<=ld1
func = f(x1,x2);
if func<minF, minF = func; x1_min=x1; end
x1 = x1 + increment;
end
actF = minF;
% check the convergence criterion
delta=abs(actF-precF);
precF = actF;
end
Starting from this point, a local search procedure is designed. This procedure is based
on the Grid method. First thing to do is to define a neighborhood of the starting point
on each axis. This neighborhood should be set for each axis separately. Then along
each axis a search of the minimum point in that direction is made successively.
In this particular case, the local search along the 𝑥1 axis starts from the left bound of
the neighborhood, that is 𝑙𝑠1, while 𝑥2 is kept constant. Once the minimum point along
𝑥1 axis is found, it is kept constant, while the local search is performed along the 𝑥2
axis. 'When the search along all axes is finished, it means that one iteration is over. The
value of the objective function at this point is compared with the value of the similar
point at the previous iteration. If the difference between these two points denoted delta
is less than a precision factor called epsilon, initially set, then the search stops, or
continues otherwise.■ (Ancau Mircea 2019)
f=@(x1,x2)log((1+(x1-4/3).^2)+3*(x1+x2-(x1).^3).^2);
x1 =-2:0.1:2 ; x2 =-2:0.1:2;
An optimization problem may entail a set of equality constraints and possibly a set of
inequality constraints. If this is the case, the problem is said to be a constrained
optimization problem. The most general constrained optimization problem can be
expressed mathematically as
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
subject to: ℎ𝑖 (𝐱) = 0
𝑔𝑗 (𝐱) ≥ 0
A problem that does not entail any equality or inequality constraints is said to be an
unconstrained optimization problem. Constrained optimization is usually much more
difficult than unconstrained optimization, as might be expected. Consequently, the
general strategy that has evolved in recent years towards the solution of constrained
optimization problems is to reformulate constrained problems as unconstrained
optimization problems. When the objective function and all the constraints are linear
functions of 𝐱, the problem is a linear programming problem. Problems of this type are
probably the most widely formulated and solved of all optimization problems,
particularly in control system, management, financial, and economic applications.
Nonlinear programming problems, in which at least some of the constraints or the
objective are nonlinear functions, tend to arise naturally in the physical sciences and
engineering, and are becoming more widely used in control system, management and
economic sciences as well.
Several branches of mathematical programming are of much interest for the
optimization problems, namely, linear, integer, quadratic, nonlinear, and dynamic
programming. Each one of these branches of mathematical programming consists of the
theory and application of a collection of optimization techniques that are suited to a
specific class of optimization problems.
The method can be summarized as follows: in order to find the maximum or minimum
of a function 𝑓(𝐱) subjected to the equality constraint g(𝐱) = 0, form the Lagrangian
function 𝐿(𝐱 , 𝜆) = 𝑓(𝐱) − 𝜆g(𝐱) and find the stationary points 𝐱 = 𝐱 ⋆ of 𝐿(𝐱 , 𝜆) such that
∇𝐿(𝐱 ⋆ , 𝜆) = 0. Further, the method of Lagrange multipliers is generalized by the Karush–
Kuhn–Tucker conditions, which can also take into account inequality constraints of the
form ℎ(𝐱) ≤ 𝑐.
Often the Lagrange multipliers have an interpretation as some quantity of interest. For
example, consider
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{
subject to: g 𝑖 (𝐱) = 𝑐𝑖
𝑝
The Lagrangian function 𝐿(𝐱 , 𝜆) = 𝑓(𝐱) + ∑𝑖=1 𝜆𝑖 (𝑐𝑖 − g 𝑖 (𝐱)). Then 𝜆𝑘 = 𝜕𝐿/𝜕𝑐𝑘 . So, 𝜆𝑘 is
the rate of change of the quantity being optimized as a function of the constraint
parameter. The relationship between the gradient of the function and gradients of the
constraints is:
𝐿(𝐱 , 𝜆) = 𝑓(𝐱) − 𝜆g(𝐱) ⟹ ∇𝑓(𝐱) = 𝜆∇g(𝐱).
Remark: In optimal control theory, the Lagrange multipliers are interpreted as costate
variables, and Lagrange multipliers are reformulated as the minimization of the
Hamiltonian, in Pontryagin's minimum principle.
Where 𝑨 ∈ ℝ𝑝×𝑛 is assumed to have full row rank. Also discuss the case where the
constraints are nonlinear.
Solution in this case we have: 𝐿(𝐱 , 𝝀) = 𝑓(𝐱) − g(𝐱)𝝀 = 𝑓(𝐱) − (𝑨𝐱 − 𝒃)[𝜆1 ⋯ 𝜆𝑝 ]𝑇 with
g(𝐱) = 𝑨𝐱 − 𝒃 = 0, let we define g new (𝐱, 𝝀) = g(𝐱)𝝀 = (𝑨𝐱 − 𝒃)[𝜆1 ⋯ 𝜆𝑝 ]𝑇 . Now take the
gradient of this new function so: ∇g new (𝐱) = 𝑨𝑇 𝝀 + g(𝐱) = 𝑨𝑇 𝝀
Solution We know that: ∇𝑓(𝐱) = 𝑯𝐱 + 𝒑, so that 𝝀 = (𝑨)+ ∇𝑓(𝐱 ⋆ ) = (𝑨𝑨𝑇 )−1 𝑨(𝑯𝐱 ⋆ + 𝒑). In
order to omite the existence of 𝐱 ⋆ let 𝑨𝑨𝑇 𝝀 = 𝑨(𝑯𝐱 ⋆ + 𝒑) ⟹ 𝑨𝑇 𝝀 = (𝑯𝐱 ⋆ + 𝒑) multiply both
sides by 𝑨𝑯−1 we get: 𝑨𝑯−1 𝑨𝑇 𝝀 = 𝑨𝑯−1 𝑯𝐱 ⋆ + 𝑨𝑯−1 𝒑 = 𝑨𝐱 ⋆ + 𝑨𝑯−1 𝒑 = 𝒃 + 𝑨𝑯−1 𝒑
Remark: assume that we are dealing with the problem of optimization such that
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{
subject to: 𝐠(𝐱) = 𝟎
The Karush–Kuhn–Tucker conditions state that ∇𝐿(𝐱 , 𝝀) = 0 which can written in the
form
𝜕𝐿(𝐱 , 𝝀)
∇𝑓(𝐱 ) + 𝑱𝑇 𝝀 0
𝑇
𝐿(𝐱 , 𝝀) = 𝑓(𝐱 ) + 𝝀 𝐠(𝐱 ) ⟺ 𝒉(𝐱 , 𝝀) = ( 𝜕𝐱 )=( )=( )
𝜕𝐿(𝐱 , 𝝀)
𝐠(𝐱 ) 0
𝜕𝝀
1
Where 𝑱 is the Jacobian of the vector 𝐠(𝐱 ). In case when 𝑓(𝐱) = 2 𝐱 𝑇 𝑯𝐱 + 𝐱 𝑇 𝒑 and
𝐠(𝐱 ) = 𝑨𝐱 − 𝒃
∇𝑓(𝐱 ) + 𝑱𝑇 𝝀 𝑯𝐱 + 𝒑 + 𝑨𝑇 𝝀
𝒉(𝐱 , 𝝀) = ( )=( ) ⇔ (𝑯 𝑨𝑇 ) (𝐱) = (−𝒑 )
𝐠(𝐱 ) 𝑨 𝟎 𝝀 𝒃
𝑨𝐱 − 𝒃
The basic concept in random search approaches is to randomly generate points in the
parameter space. Only feasible points satisfying g 𝑗 (𝐱) ≤ 0, 1 ≤ 𝑗 ≤ 𝑚 are considered,
while non-feasible points with at least one g 𝑗 (𝐱) > 0 for some j are rejected. The
algorithm keeps track of the feasible random point with the least value of the objective
function. This requires checking, at every iteration, if the newly generated feasible point
has a better objective function than the best value achieved so far.
The main disadvantage of this algorithm is that a large number of objective function
calculations may be required especially for problems with large n. The following
example illustrates this technique.
iterations=k,
BestPosition=x0,
fmax=f0,
4 2
minimize log (1 (𝑥1 − ) + 3(𝑥1 + 𝑥2 − 𝑥13 )2 ) ,
3
2 2
subject to: 𝑥1 + 𝑥2 − 4 ≤ 0
{ −1 ≤ 𝑥1 , 𝑥2 ≤ 1
The positive constant 𝜆 is the regularization parameter. As 𝜆 gets larger, more weight is
given to the regularization function. In many cases, the regularization is taken to be
quadratic. In particular, 𝑅(𝐱) = ‖𝑫𝐱‖2 where 𝑫 ∈ ℝ𝑝×𝑛 is a given matrix. The quadratic
regularization function aims to control the norm of 𝑫𝐱 and is formulated as follows:
Since the Hessian of the objective function is ∇2 𝑓RLS = 2(𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫) ≽ 0, it follows by
previous theorems that any stationary point is a global minimum point. The stationary
points are those satisfying ∇𝑓RLS (𝐱) = 0, that is, (𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫)𝐱 = 𝑨𝑻 𝒃.
2 + 10−3 3 4 20.0019
𝑻 −3 1 1 1
𝑨 = 𝑩 𝑩 + 10 𝑰 = ( 3 5 + 10−3 7 ) , 𝑩 = ( ) , 𝒃 = ( 34.0004)
−3 1 2 3
4 7 10 + 10 48.0202
The purpose is to find the best approximate solution of 𝑨𝐱 = 𝒃. Knowing that the exact
solution is 𝐱 𝑡𝑟𝑢𝑒 = [1 2 3]𝑇 .
The matrix 𝑨 is in fact of a full column rank since its eigenvalues are all positive (which
can be checked, for example, by the MATLAB command eig(𝑨)), and the simple least
squares solution is given by 𝐱 𝐿𝑆 , whose value can be computed by
𝐱 𝐿𝑆 is rather far from the true vector 𝐱 𝑡𝑟𝑢𝑒 . One difference between the solutions is that
the squared norm ‖𝐱 𝐿𝑆 ‖2 = 90.1855 is much larger then the correct squared norm
‖𝐱 𝑡𝑟𝑢𝑒 ‖2 = 14. In order to control the norm of the solution we will add the quadratic
regularization function ‖𝐱‖2. The regularized solution will thus have the form
This quadratic function can also be written as 𝑅(𝐱) = ‖𝑳𝐱‖2 , where 𝑳 ∈ ℝ(𝑛−1)×𝑛 is given
by
1 −1 0 0 0 0
0 1 −1 0 0 0
0 0 1 −1 0 0
𝑳= 1 ⋮
⋮ ⋱ ⋮ ⋮
⋮ ⋮ ⋮ ⋮
(0 0 0 … 1
0 −1)
The resulting regularized least squares problem is (with 𝜆 a given regularization
parameter)
min‖𝐱 − 𝒃‖2 + 𝜆‖𝑳𝐱‖2
𝐱
randn('seed',314);
b=x+0.05*randn(300,1);
The true and noisy signals are given in Figure, which was constructed by the MATLAB
commands
subplot(1,2,1);
plot(1:300,x,'LineWidth',2);
subplot(1,2,2);
plot(1:300,b,'LineWidth',2);
In order to denoise the signal 𝒃, we look at the optimal solution of the RLS problem, for
four different values of the regularization parameter: 𝜆 = 1, 10,100, 1000.
The original true signal is denoted by a dotted line. As can be seen in the next Figure,
as 𝜆 gets larger, the RLS solution becomes smoother.
For 𝜆 = 10 the RLS solution is a rather good estimate of the original vector 𝐱. For
𝜆 = 100 we get a smoother RLS signal, but evidently it is less accurate than 𝐱 𝑅𝐿𝑆 (10),
especially near the boundaries. The RLS solution for 𝜆 = 1000 is very smooth, but it is a
rather poor estimate of the original signal. In any case, it is evident that the parameter
𝜆 is chosen via a trade-off between data fidelity (closeness of 𝐱 to 𝒃) and smoothness
(size of 𝑳𝐱). The four plots where produced by the MATLAB commands
L=zeros(299,300);
for i=1:299
L(i,i)=1;
L(i,i+1)=-1;
end
x_rls=(eye(300)+1*L'*L)\b;
x_rls=[x_rls,(eye(300)+10*L'*L)\b];
x_rls=[x_rls,(eye(300)+100*L'*L)\b];
x_rls=[x_rls,(eye(300)+1000*L'*L)\b];
figure(2)
for j=1:4
subplot(2,2,j);
plot(1:300,x_rls(:,j),'LineWidth',2);
hold on
plot(1:300,x,':r','LineWidth',2);
hold off
title(['\lambda=',num2str(10^(j-1))]);
end
Most real-world optimizations are highly
nonlinear and multimodal, under various complex constraints. Different objectives are
often conflicting. Even for a single objective, sometimes, optimal solutions may not exist
at all. In general, finding an optimal solution or even sub-optimal solutions is not an
easy task. This work aims to introduce the fundamentals of metaheuristic optimization,
as well as some popular metaheuristic algorithms. Metaheuristic algorithms are
becoming an important part of modern optimization. A wide range of metaheuristic
algorithms have emerged over the last two decades, and many metaheuristics such as
particle swarm optimization are becoming increasingly popular. Despite their popularity,
mathematical analysis of these algorithms lacks behind. Convergence analysis still
remains unsolved for the majority of metaheuristic algorithms, while efficiency analysis
is equally challenging.
where 𝑓1 , . . . , 𝑓𝑚 (𝐱) are the objectives, while ℎ𝑗 and 𝑔𝑘 are the equality and inequality
constraints, respectively. In the
case when 𝑚 = 1 , it is called
single-objective optimization.
When 𝑚 ≥ 2 , it becomes a multi-
objective problem whose solution
strategy is different from those for
a single objective. In general, all
the functions 𝑓𝑖 , ℎ𝑗 and 𝑔𝑘 are
nonlinear. In the special case when
all these functions are linear, the
optimization problem becomes a
linear programming problem which
can be solved using the standard
simplex method (Dantzig 1963).
Metaheuristic optimization concerns more generalized, nonlinear optimization
problems. It is worth pointing out that the above minimization problem can also be
formulated as a maximization problem if 𝑓𝑖 is replaced with −𝑓𝑖 .
Derivative-free algorithms do not use any derivative information but the values of the
function itself. Some functions may have discontinuities or it may be expensive to
calculate derivatives accurately, and thus derivative-free algorithms become very
useful.
Search capability can also be a basis for algorithm classification. In this case,
algorithms can be divided into local and global search algorithms. Local search
algorithms typically converge towards a local optimum, not necessarily (often not) the
global optimum, and such an algorithm is often deterministic and has no ability to
escape from local optima. On the other hand, for global optimization, local search
algorithms are not suitable, and global search algorithms should be used. Modern
metaheuristic algorithms in most cases tend to be suitable for global optimization,
though not always successful or efficient.
Algorithms with stochastic components were often referred to as heuristic in the past,
though the recent literature tends to refer to them as metaheuristics. We will follow
Glover's convention and call all modern nature-inspired algorithms metaheuristics
(Glover 1986, Glover and Kochenberger 2003). Loosely speaking, heuristic means to find
or to discover by trial and error. Here meta- means beyond or higher level, and
metaheuristics generally perform better than simple heuristics. In addition, all
metaheuristic algorithms use a certain tradeoff of randomization and local search.
Quality solutions to difficult optimization problems can be found in a reasonable
amount of time, but there is no guarantee that optimal solutions can be reached. It is
hoped that these algorithms work most of the time, but not all the time. Almost all
metaheuristic algorithms tend to be suitable for global optimization.
When a particle finds a location that is better than any previously found locations, then
it updates this location as the new current best for particle 𝑖 . There is a current best for
all particles at any time 𝑡 at each iteration. The aim is to find the global best among all
the current best solutions until the objective no longer improves or after a certain
number of iterations.
Let 𝐱 𝑖 and 𝐯𝑖 be the position and velocity vectors, respectively, of particle 𝑖. The new
velocity vector is determined by the following formula
where 𝜺1 and 𝜺2 are two random vectors, and each entry takes a value between 0 and 1.
The parameters 𝛼 and 𝛽 are the learning parameters or acceleration constants, which
are typically equal to, say, 𝛼 ≈ 𝛽 ≈ 2. 𝜔(𝑘) is the inertia function takes a value between
0 and 1. In the simplest case, the inertia function can be taken as a constant, typically
𝜔 ∈ [0.5 0.9]. This is equivalent to introducing a virtual mass to stabilize the motion of
the particles, and thus the algorithm is expected to converge more quickly.
The initial locations of all particles should be distributed relatively uniformly so that
they can sample over most regions, which is especially important for multimodal
problems. The initial velocity of a particle can be set to zero, that is, 𝐯𝑖𝑘=0 = 0 . The new
position can then be updated by the formula 𝐱 𝑖𝑘+1 = 𝐱 𝑖𝑘 + 𝐯𝑖𝑘+1
As the iterations proceed, the particle system swarms and may converge towards a
global optimum.
▪ 𝑓(𝑥, 𝑦) = 3 sin(𝑥) + 𝑒 𝑦 − 4 ≤ 𝑥, 𝑦 ≤ 4
▪ 𝑓(𝑥, 𝑦) = 100(𝑦 − 𝑥 2 )2 + (1 − 𝑥 2 )2 − 10 ≤ 𝑥, 𝑦 ≤ 10
𝜔𝑚𝑎𝑥 − 𝜔𝑚𝑖𝑛
𝜔(𝑖𝑡𝑒𝑟) = 𝜔𝑚𝑎𝑥 − × 𝑖𝑡𝑒𝑟
Max𝑖𝑡𝑒𝑟
clear all, clc,
[X,Y] = meshgrid(-4:0.5:4,-4:0.5:4);
Z = 3*sin(X)+exp(Y); surf(X,Y,Z); figure
for iter=1:itermax
w=wmax-(wmax-wmin)*iter/itermax;
for i=1:n
v(:,i)=w*v(:,i)+c1*rand*(xbest(:,i)-x(:,i))+c2*rand*(gbest-x(:,i));
x(:,i)=x(:,i)+v(:,i);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
the optimal value is:
1.5708
the optimal value is:
2.0000
the minimum value of func
is: -2.8647
Exercise: Write a MATLAB code to search the maximum value of the following objective
functions
▪ 𝑓(𝑥, 𝑦) = 𝑥 2 − 𝑦 2 − 10 ≤ 𝑥, 𝑦 ≤ 10
4 3 2
▪ 𝑓(𝑥) = 𝑥 − 14𝑥 + 60 𝑥 − 70 𝑥 − 10 ≤ 𝑥, 𝑦 ≤ 10
▪ 𝑓(𝑥, 𝑦) = 𝑥sin(4𝑥) + 1.1𝑦sin(𝑦) − 10 ≤ 𝑥, 𝑦 ≤ 10
▪ 𝑓(𝐱) = (𝑥 + 10𝑦) + 5(𝑧 − 𝑤) + (𝑦 − 2𝑧) + 10(𝑥 − 2𝑤)4
2 2 4
Alternatives of PSO: There are many variations which extend the standard algorithm.
The standard particle swarm optimization uses both the current global best 𝐠 𝑏𝑒𝑠𝑡 and
the individual best 𝐱 𝑖𝑏𝑒𝑠𝑡 . The reason of using the individual best is primarily to increase
the diversity in the quality solution, however, this diversity can be simulated using the
randomness. Subsequently, there is no compelling reason for using the individual best.
A simplified version which could accelerate the convergence of the algorithm is to use
the global best only. Thus, in the accelerated particle swarm optimization, the velocity
vector is generated by
In order to increase the convergence even further, we can also write the update of the
location in a single step
𝐱 𝑖𝑘+1 = (1 − 𝛽)𝐱 𝑖𝑘 + 𝛽𝐠 𝑏𝑒𝑠𝑡 + 𝛼 × (𝜺1 − 0.5𝐞)
𝛼 = 𝛼0 𝑒 −𝛾𝑘 ; or 𝛼 = 𝛼0 𝛾 𝑘 , ( 𝛾 < 1)
clear all, clc,
[X,Y] = meshgrid(-4:0.5:4,-4:0.5:4);
Z = 3*sin(X)+exp(Y); surf(X,Y,Z); figure
for iter=1:itermax
w=wmax-(wmax-wmin)*iter/itermax;
for i=1:n
v(:,i)=v(:,i) + c1*(rand-0.5) + c2*(gbest-x(:,i));
x(:,i)=(1-c2)*x(:,i) + c2*gbest + c1*(rand-0.5);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
Example Write a MATLAB code to search by PSO the maximum value of
𝑓 = 2𝑥 2 − 3𝑦 2 + 4𝑥 2 + 2 − 10 ≤ 𝑥, 𝑦, 𝑧 ≤ 10
for iter=1:itermax
for i=1:n
v(:,i)=v(:,i) + c1*(rand-0.5) + c2*(gbest-x(:,i));
x(:,i)=(1-c2)*x(:,i) + c2*gbest + c1*(rand-0.5);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
fun_marge(i)=2*(x(1,i))^2-3*(x(2,i))^2+4*(x(3,i))^2+2;
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
As already mentioned, swarm intelligence is a
relatively new approach to problem
solving that takes inspiration from social
behaviors of insects and of other
animals. In particular, ants have
inspired a number of methods and
techniques among which the most
studied and the most successful is the
general-purpose optimization technique
known as ant colony optimization. Ant
colony optimization (ACO) takes
inspiration from the foraging behavior of
some ant species. These ants deposit
pheromone on the ground in order to mark some favorable path that should be followed
by other members of the colony. Ant colony optimization exploits a similar mechanism
for solving optimization problems.
Ant colony optimization (ACO), introduced by Marco Dorigo 1991 in his doctoral
dissertation, is a class of optimization algorithms modeled on the actions of an ant
colony. ACO is a probabilistic technique useful in problems that deal with finding better
paths through graphs. Artificial 'ants'—simulation agents—locate optimal solutions by
moving through a parameter space representing all possible solutions. Natural ants lay
down pheromones directing each other to resources while exploring their environment.
The simulated 'ants' similarly record their positions and the quality of their solutions,
so that in later simulation iterations more ants locate for better solutions.
Procedure: The ants construct the solutions as follows. Each ant starts from a
randomly selected city (node or vertex). Then, at each construction step it moves along
the edges of the graph. Each ant keeps a memory of its path, and in subsequent steps
it chooses among the edges that do not lead to vertices that it has already visited. An
ant has constructed a solution once it has visited all the vertices of the graph. At each
construction step, an ant probabilistically chooses the edge to follow among those that
lead to yet unvisited vertices. The probabilistic rule is biased by pheromone values and
heuristic information: the higher the pheromone and the heuristic value associated to
an edge, the higher the probability an ant will choose that particular edge. Once all the
ants have completed their tour, the pheromone on the edges is updated. Each of the
pheromone values is initially decreased by a certain percentage. Each edge then
receives an amount of additional pheromone proportional to the quality of the solutions
to which it belongs (there is one solution per ant). The solution construction process is
stochastic and is biased by a pheromone model, that is, a set of parameters associated
with graph components (either nodes or edges) whose values are modified at runtime by
the ants.
Set parameters, initialize pheromone trails
SCHEDULE_ACTIVITIES
Parametrization: Let’s say the number of cities is 𝑛, the number of ants is 𝑚, the
distance between 𝑖 𝑡ℎ and 𝑗 𝑡ℎ cities is 𝑑𝑖𝑗 𝑖, 𝑗 = 1, 2 … , 𝑛 and the concentration of
pheromone in city (𝑖, 𝑗) at time 𝑡 is 𝜏𝑖𝑗 (𝑡). At the initial time, the pheromone
concentration 𝜏𝑖𝑗 (𝑡) between cities is equal to 𝜏𝑖𝑗 (0) = 𝐶 (𝐶 is a constant), and the
𝑘
probability of its choice is expressed by 𝑝𝑖𝑗 , and the formula is as follows:
𝛼 𝛽
(𝜏𝑖𝑗 (𝑡)) (𝜂𝑖𝑗 (𝑡))
𝑘
𝑝𝑖𝑗 = 𝛽
𝛼
∑𝑥∈𝑁(𝑥 )(𝜏𝑖𝑠 (𝑡)) (𝜂𝑖𝑠 (𝑡))
𝑘
The parameter 𝜂𝑖𝑗 (𝑡) = 1/𝑑𝑖𝑗 is heuristic information, which indicates the degree of
expectation of ants from 𝑖 𝑡ℎ to the 𝑗 𝑡ℎ city. 𝑁(𝑥𝑘 ) (𝑘 = 1, 2 … , 𝑚) indicates that ant 𝑘 is to
visit the urban set. Furthermore, 𝛼 and 𝛽 are positive real parameters whose values
determine the relative importance of pheromone versus heuristic information. When all
ants complete a cycle, they update the pheromone according to formula
𝑚
𝜏𝑖𝑗 (𝑡) ⟵ (1 − 𝜌)𝜏𝑖𝑗 (𝑡) + ∆𝜏𝑖𝑗 with ∆𝜏𝑖𝑗 = ∑ ∆𝜏𝑘𝑖𝑗
𝑘=1
Where 𝑄 is a constant that represents the total amount of pheromone released once by
an ant . 𝐿𝑘 is the tour length of the 𝑘 𝑡ℎ ant.
clear all, clc,
%LB=20*[-1 -1 -1]; UB=20*[1 1 1]; nvars=size(LB,2);
%f=@(x)2*x(1)^2-3*x(2)^2+4*x(3)^2+2; % Ant-cost
LB=20*[-1 -1]; UB=20*[1 1]; nvars=size(LB,2);
f=@(x)3*sin(x(1))+exp(x(2));
MaxTour=100; % Number of Tours
piece=500; % Number of pieces (cities)
max_assign=50; % MaxValue of assign
ants=50; % Number of Ants
poz_ph=0.5; % PositivePheremone
neg_ph=0.2; % NegativePheremone
lambda=0.95; % EvaporationParameter
ph=0.05; % Pheromone
pher=ones(piece,nvars);
indis=zeros(ants,nvars);
costs=zeros(ants,1);
cost_general=zeros(max_assign,(nvars+1));
deger=zeros(piece,nvars); deger(1,:)=LB;
for i=2:piece
for j=1:nvars
deger(i,j)=deger(i-1,j) + (UB(j)-LB(j))/(piece-1);
end
end
assign=0;
while (assign<max_assign)
for i=1:ants % FINDING THE PARAMETERS OF VALUE
prob = pher.*rand(piece,nvars);
for j=1:nvars
indis(i,j) = find(prob(:,j) == max(prob(:,j)));
end
temp=zeros(1,nvars);
for j=1:nvars
temp(j)=deger(indis(i,j),j);
end
costs(i) = f(temp); % LOCAL UPDATING
deltalocal = zeros(piece,nvars);
% Creates Matrix Contain the Pheremones Deposited for Local Updating
for j=1:nvars
deltalocal(indis(i,j),j)=(poz_ph*ph/(costs(i)));
end
pher = pher + deltalocal;
end
best_ant= min(find(costs==min(costs)));
worst_ant = min(find(costs==max(costs)));
deltapos = zeros(piece,nvars);
deltaneg = zeros(piece,nvars);
for j=1:nvars
deltapos(indis(best_ant,j),j)=(ph/(costs(best_ant)));
% UPDATING PHER OF nvars
deltaneg(indis(worst_ant,j),j)=-(neg_ph*ph/(costs(worst_ant)));
% NEGATIVE UPDATING PHER OF worst path
end
delta = deltapos + deltaneg;
pher = pher.^lambda + delta;
assign=assign + 1; % Update general cost matrix
for j=1:nvars
cost_general (assign,j)=deger(indis(best_ant,j),j);
end
cost_general (assign,nvars+1)=costs(best_ant);
xlabel Tour
title('Change in Cost Value. Red: Means, Blue: Best')
hold on
plot(assign, mean(costs), '.r');
plot(assign, costs(best_ant), '.b');
end
list_cost=sortrows(cost_general,nvars+1);
for j=1:nvars
x(j)=list_cost(1,j);
end
x1=x', fmax=f(x1)
The Firefly Algorithm (FA) was developed by
Xin-She Yang (Yang 2008) and is based on the flashing patterns and behavior of
fireflies. In essence, FA uses the following three idealized rules:
⦁ Fireflies are unisex (one firefly will be attracted to other fireflies regardless of their sex)
⦁ The attractiveness is proportional to the brightness and both decrease as the distance
between two fireflies increases. Thus for any two flashing fireflies, the brighter firefly
will attract the other one. If neither one is brighter, then a random move is performed.
⦁ The brightness of a firefly is determined by the landscape of the objective function.
where 𝛾 is the light absorption coefficient, which can be in the range [0.01, 100], 𝑟𝑖𝑗 the
−𝛾𝑟 2
line-of-sight distance between the fireflies. The second term 𝛽0 𝑒 𝑖𝑗 (𝐱𝑗𝑘 − 𝐱 𝑖𝑘 ) is due to
the attraction. The third term 𝛼 × 𝐞𝑘𝑖 is a randomization with 𝛼 being the randomization
parameter, and 𝐞𝑘𝑖 is a vector of random numbers drawn from a Gaussian distribution
or uniform distribution at time k. If 𝛽0 = 0 , it becomes a simple random walk.
Furthermore, the randomization 𝐞𝑘𝑖 can easily be extended to other distributions such
as Lévy flights.
clear all, clc, c1=0.8; c2=0.7; gama=20;
itermax=50; xmin=10*[-2 -2]; xmax=10*[2 2];
n=50; m=2; % n=Number of Particles and n=Number of variables
rand('state',0); % v=zeros(m,n);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));
for iter=1:itermax
for i=1:n
for j=1:i
r= norm(x(:,j)-x(:,i));
x(:,i)=x(:,i)+c2*(exp(-gama*r^2))*(x(:,j)-x(:,i))+c1*(randn-0.5);
end
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
It can be shown that the limiting case 𝛾 → 0 corresponds to the standard Particle
Swarm Optimization (PSO). In fact, if the inner loop (for j) is removed and x(:,j)the is
replaced by the current global best, then FA essentially becomes the standard PSO.
In computer science and operations research,
the artificial bee colony algorithm (ABC) is an optimization algorithm based on the
intelligent foraging behavior of honey bee swarm,
proposed by Derviş Karaboğa (Erciyes University) in
2005. In the ABC model, the colony consists of three
groups of bees: employed bees, onlookers and scouts.
It is assumed that there is only one artificial
employed bee for each food source. In other words,
the number of employed bees in the colony is equal
to the number of food sources around the hive.
Employed bees go to their food source and come
back to hive and dance on this area. The employed
bee whose food source has been abandoned becomes a scout and starts to search for
finding a new food source. Onlookers watch the dances of employed bees and choose
food sources depending on dances.
Notes: employed bees associated with specific food sources, onlooker bees watching the
dance of employed bees within the hive to choose a food source, and scout bees
searching for food sources randomly. Both onlookers and scouts are also called
unemployed bees.
Initialization Phase: All the vectors of the population of food sources, 𝐱 𝑘 , are initialized
by scout bees and control parameters are set. Since each food source, 𝐱 𝑘 , is a solution
vector to the optimization problem, each 𝐱 𝑘 vector holds 𝑛 variables, (𝐱 𝑘 (𝑖), 𝑖 = 1. . . 𝑛),
which are to be optimized so as to minimize the objective function.
where 𝒍𝑖 and 𝒖𝑖 are the lower and upper bound of the parameter 𝐱 𝑘 (𝑖) , respectively.
Employed Bees Phase: Employed bees search for new food sources (𝐯𝑘 ) having more
nectar within the neighbourhood of the food source (𝐱 𝑘 ) in their memory.
The fitness value of the solution, fit(𝐱 𝑘 ), might be calculated for minimization problems
using the following formula
1
if f(𝐱 𝑘 ) ≥ 0
fit(𝐱 𝑘 ) = { 1 + f(𝐱 𝑘 )
1 + |f(𝐱 𝑘 )| if f(𝐱 𝑘 ) < 0
Onlooker Bees Phase: Unemployed bees consist of two groups of bees: onlooker bees
and scouts. Employed bees share their food source information with onlooker bees
waiting in the hive and then onlooker bees probabilistically choose their food sources
depending on this information. In ABC, an onlooker bee chooses a food source
depending on the probability values calculated using the fitness values provided by
employed bees. For this purpose, a fitness based selection technique can be used, such
as the roulette wheel selection method (Goldberg, 1989).
The probability value 𝑝𝑘 with which 𝐱 𝑘 is chosen by an onlooker bee can be calculated
by using the expression given in equation
fit(𝐱 𝑘 )
𝑝𝑘 = 𝑁
∑𝑘=1 fit(𝐱 𝑘 )
Scout Bees Phase: The unemployed bees who choose their food sources randomly are
called scouts. Employed bees whose solutions cannot be improved through a
predetermined number of trials, specified by the user of the ABC algorithm and called
“limit” or “abandonment criteria” herein, become scouts and their solutions are
abandoned. Then, the converted scouts start to search for new solutions, randomly. For
instance, if solution 𝐱 𝑘 has been abandoned, the new solution discovered by the scout
who was the employed bee of 𝐱 𝑘 can be defined by 𝐱 𝑘 (𝑖) = 𝒍𝑖 + rand(0,1) × (𝒖𝑖 − 𝒍𝑖 ). Hence
those sources which are initially poor or have been made poor by exploitation are
abandoned and negative feedback behavior arises to balance the positive feedback.
Exercise: Write a MATLAB code to search the maximum value of the following objective
functions
𝑓(𝐱) = 3 sin(𝑥) + 𝑒 𝑦 − 5 ≤ 𝑥, 𝑦 ≤ 5
𝑓(𝐱) = 2𝑥 2 + 3𝑦 2 + 4𝑧 2 + 5𝑤 2 + 10 − 5 ≤ 𝑥, 𝑦 ≤ 5
clc;
clear;
close all;
%% Problem Definition
% CostFunction=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2+5*x(4)^2+10;
f=@(x)3*sin(x(1))+exp(x(2)); % CostFunction
nVar=2; % Number of Decision Variables
VarSize=[1 nVar]; % Decision Variables Matrix Size
VarMin=-5; % Decision Variables Lower Bound
VarMax= 5; % Decision Variables Upper Bound
%% ABC Settings
MaxIt=500; % Maximum Number of Iterations
nPop=500; % Population Size (Colony Size)
nOnlooker=nPop; % Number of Onlooker Bees
L=round(0.6*nVar*nPop); % Abandonment Limit Parameter (Trial Limit)
a=1; % Acceleration Coefficient Upper Bound
%% Initialization
% Empty Bee Structure
empty_bee.Position=[]; empty_bee.Cost=[];
pop=repmat(empty_bee,nPop,1); % Initialize Population Array
BestSol.Cost=inf; % Initialize Best Solution Ever Found
end
for i=1:nPop
F(i) = exp(-pop(i).Cost/MeanCost); % Convert Cost to Fitness
end
P=F/sum(F);
%-----------------------------------------------
% Select Source Site by Roulette Wheel Selection
%-----------------------------------------------
r=rand;
C=cumsum(P);
i=find(r<=C,1,'first');
%-----------------------------------------------
% Comparision
if newbee.Cost<=pop(i).Cost
pop(i)=newbee;
else
C(i)=C(i)+1;
end
end
% Scout Bees (Scout Bees Phase)
for i=1:nPop
if C(i)>=L
pop(i).Position=rand(1,nVar);
pop(i).Cost=f(pop(i).Position);
C(i)=0;
end
end
% Update Best Solution Ever Found
for i=1:nPop
if pop(i).Cost<=BestSol.Cost
BestSol=pop(i);
end
end
BestCost(it)=BestSol.Cost; % Store Best Cost Ever Found
%% Results
BestSol
figure;
%plot(BestCost,'LineWidth',2);
semilogy(BestCost,'LineWidth',2);
xlabel('Iteration'); ylabel('Best Cost');
grid on;
Bacteria Foraging Optimization
Algorithm (BFOA), proposed by Passino, is
a new comer to the family of nature-
inspired optimization algorithms. For over the
last five decades, optimization algorithms like
Genetic Algorithms (GAs), Evolutionary
Programming (EP), Evolutionary Strategies
(ES), which draw their inspiration from
evolution and natural genetics, have been
dominating the realm of optimization
algorithms. Recently natural swarm inspired
algorithms like Particle Swarm Optimization
(PSO), Ant Colony Optimization (ACO) have
found their way into this domain and proved their effectiveness. Following the same
trend of swarm-based algorithms, Passino proposed the BFOA. Application of group
foraging strategy of a swarm of E.coli-bacteria in multi-optimal function optimization is
the key idea of the new algorithm. Bacteria search for nutrients in a manner to
maximize energy obtained per unit time. Individual bacterium also communicates with
others by sending signals. A bacterium takes foraging decisions after considering two
previous factors. The process, in which a bacterium moves by taking small steps while
searching for nutrients, is called chemotaxis and key idea of BFOA is mimicking
chemotactic movement of virtual bacteria in the problem search space.
Now suppose that we want to find the minimum of the cost function 𝑱(𝜽) where 𝜽 ∈ ℜ𝑝
(i.e. 𝜽 is a 𝑝-dimensional vector of real numbers), and we do not have measurements or
an analytical description of the gradient ∇𝑱(𝜽). BFOA mimics the four principal
mechanisms observed in a real bacterial system: chemotaxis, swarming,
reproduction, and elimination-dispersal to solve this non-gradient optimization
problem. A virtual bacterium is actually one trial solution (may be called a search-
agent) that moves on the functional surface (see Figure above) to locate the global
optimum.
Flow diagram illustrating the bacterial foraging optimization algorithm
Generic algorithm of BFO
%Reprodution
Jhealth=sum(J(:,:,K,ell),2); % Set the health of each of the S bacteria
[Jhealth,sortind]=sort(Jhealth); % Sorts the nutrient concentration
P(:,:,1,K+1,ell)=P(:,sortind,Nc+1,K,ell);
c(:,K+1)=c(sortind,K); %keeps the chemotaxis parameters with each
bacterium at the next generation
%Report
reproduction = J(:,[1:Ns,Nre,Ned]);
[jlastreproduction,O]=min(reproduction,[],2); %minf for each bacterial
[Y,I] = min(jlastreproduction)
pbest=P(:,I,O(I,:),K,ell)
plot([1:s],jlastreproduction)
xlabel('Iteration'), ylabel('Function')
The GWO algorithm mimics
the leadership hierarchy and hunting mechanism of gray
wolves in nature proposed by Mirjalili et al. in 2014. Four
types of grey wolves such as alpha, beta, delta, and
omega are employed for simulating the leadership
hierarchy. In addition, three main steps of hunting,
searching for prey, encircling prey, and attacking prey,
are implemented to perform optimization.
Mathematical model: The hunting technique and the social hierarchy of grey wolves
are mathematically modeled in order to design GWO and perform optimization. The
proposed mathematical models of the social hierarchy, tracking, encircling, and
attacking prey are as follows:
■ Encircling prey As mentioned above, grey wolves encircle prey during the hunt. In
order to mathematically model encircling behavior the following equations are
proposed: ⃗𝑿 ⃗ (𝑡 + 1) = ⃗𝑿
⃗ (𝑡) − ⃗𝑨
⃗ . ⃗𝑫
⃗ with ⃗𝑫 ⃗ . ⃗𝑿
⃗ = |𝑪 ⃗ 𝑝 (𝑡) − ⃗𝑿⃗ (𝑡)| where 𝑡 indicates the
current iteration, 𝑨⃗ and 𝑪 ⃗ are coefficient vectors, 𝑿 ⃗⃗ 𝑝 (𝑡) is the position vector of the prey,
and 𝑿 ⃗⃗ (𝑡) indicates the position vector of a grey wolf. The vectors ⃗𝑨 and 𝑪 ⃗ are calculated
as follows: ⃗𝑨 = 2𝒂 ⃗ .𝒓
⃗1−𝒂⃗ ⃗𝑪 = 2𝒓
⃗ 2 . Where components of 𝒂 ⃗ are linearly decreased from
2 to 0 over the course of iterations and 𝒓 ⃗ 1, 𝒓
⃗ 2 are random vectors in [0,1].
■ Hunting: Grey wolves have the ability to recognize the location of prey and encircle
them. The hunt is usually guided by the alpha. The beta and delta might also
participate in hunting occasionally. However, in an abstract search space we have no
idea about the location of the optimum (prey). In order to mathematically simulate the
hunting behavior of grey wolves, we suppose that the alpha (best candidate solution)
beta, and delta have better knowledge about the potential location of prey. Therefore,
we save the first three best solutions obtained so far and oblige the other search agents
(including the omegas) to update their positions according to the position of the best
search agent. The following formulas are proposed in this regard.
⃗𝑫 ⃗ 1 . ⃗𝑿
⃗ 𝛼 = |𝑪 ⃗ 𝛼 − ⃗𝑿
⃗| ⃗𝑿
⃗ 1 = ⃗𝑿
⃗𝛼− ⃗𝑨⃗ 1 . ⃗𝑫
⃗𝛼
⃗⃗ 1 + 𝑿
𝑿 ⃗⃗ 1 + 𝑿
⃗⃗ 1
{⃗𝑫 ⃗ 2 . ⃗𝑿
⃗ 𝛽 = |𝑪 ⃗ 𝛽 − ⃗𝑿
⃗| {⃗𝑿
⃗ 2 = ⃗𝑿
⃗𝛽 − ⃗𝑨
⃗ 2 . ⃗𝑫
⃗𝛽 ⃗⃗ (𝑡 + 1) =
𝑿
3
⃗𝑫 ⃗ 3 . ⃗𝑿
⃗ 𝛿 = |𝑪 ⃗ 𝛿 − ⃗𝑿
⃗| ⃗𝑿
⃗ 3 = ⃗𝑿
⃗ 𝛿 − ⃗𝑨
⃗ 3 . ⃗𝑫
⃗𝛿
With these equations, a search agent updates its position according to alpha, beta, and
delta in a n dimensional search space. In addition, the final position would be in a
random place within a circle which is defined by the positions of alpha, beta, and delta
in the search space. In other words alpha, beta, and delta estimate the position of the
prey, and other wolves updates their positions randomly around the prey.
SearchAgents_no=20;
Max_iter=200;
dim=4;
lb=-0.25*ones(1,dim); ub=0.25*ones(1,dim);
%fobj=@(x)(x(1)-1)^2+(x(2)-2)^2+(x(3)-3)^2+(x(4)-4)^2+(x(5)-5)^2;
%fobj=@(x)3*sin(x(1))+exp(x(2)); dim=2;
fobj=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2+5*x(4)^2+10;
%---------------------------------------------------------------------%
%Initialize the positions of search agents
%---------------------------------------------------------------------%
% If the boundaries of all variables are equal and user enter a signle
% number for both ub and lb
if Boundary_no==1
Positions=rand(SearchAgents_no,dim).*(ub-lb)+lb;
end
x=-3:0.5:3; y=-3:0.5:3;
L=length(x);
f=[];
for i=1:L
for j=1:L
f(i,j)=fobj([x(i),y(j)]);
end
end
surfc(x,y,f,'LineStyle','none');
%-------------------------------------------------------------------------------------------------------%
Applications of Swarm Intelligence: Swarm Intelligence-based techniques can be
used in a number of applications. The U.S. military is investigating swarm techniques
for controlling unmanned vehicles. The European Space Agency is thinking about an
orbital swarm for self-assembly and interferometry. NASA is investigating the use of
swarm technology for planetary mapping. A 1992 paper by M. Anthony Lewis and
George A. Bekey discusses the possibility of using swarm intelligence to control
nanobots within the body for the purpose of killing cancer tumors. Conversely al-Rifaie
and Aber have used stochastic diffusion search to help locate tumours. Swarm
intelligence has also been applied for data mining. Ant based models are further subject
of modern management theory.
%-------------------------------------------------------------------------------------------------------%
CVX: is a MATLAB-based modeling system for convex optimization. It was created by
Michael Grant and Stephen Boyd. This MATLAB package is in fact an interface to other
convex optimization solvers such as SeDuMi and SDPT3. We will explore here some of
the basic features of the software, but a more comprehensive and complete guide can
be found at the CVX website (CVXr.com). The basic structure of a CVX program is as
follows:
cvx_begin
{variables declaration}
minimize({objective function}) or maximize({objective function})
subject to
{constraints}
cvx_end
CVX accepts only convex functions as objective and constraint functions. There are
several basic convex functions, called “atoms,” which are embedded in CVX.
Example: Suppose that we wish to solve the least squares problem
Example: Suppose that we wish to write a CVX code that solves the convex
optimization problem
Example: Let us use an example to illustrate how a metaheuristic works. The design of
a compressional and tensional spring involves three design variables: wire diameter 𝑥1 ,
coil diameter 𝑥2 , and the length of the coil 𝑥3 . This optimization problem can be
written as
minimize 𝑓(𝐱) = 𝑥12 𝑥2 (2 + 𝑥3 ),