Stochastic Search Optimization
Stochastic Search Optimization
Second edition
Kurt Marti
Stochastic Optimization
Methods
Second edition
123
Univ. Prof. Dr. Kurt Marti
Federal Armed Forces
University Munich
Department of Aerospace Engineering and Technology
85577 Neubiberg/München
Germany
[email protected]
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations are
liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
9 8 7 6 5 4 3 2
springer.com
Preface
The major change in the second edition of this book is the inclusion of a new,
extended version of Chap. 7 on the approximate computation of probabilities
of survival and failure of technical, economic structures and systems. Based on
a representation of the state function of the structure/system by the minimum
value of a certain optimization problem, approximations of the probability of
survival/failure are obtained by means of polyhedral approximation of the so-
called survival domain, certain probability inequalities and disretizations of
the underlying probability distribution of the random model parameters. All
the other chapters have been updated by including some more recent material
and by correcting some typing errors.
The author again thanks Ms. Elisabeth Lößl for the excellent LaTeX-
typesetting of all revisions and completions of the text. Moreover, I thank Dr.
Müller, Vice President Business/Economics and Statistics, Springer-Verlag,
for making it possibile to publish a second edition of “Stochastic Optimization
Methods”.
Part VI Appendix
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
1
Decision/Control Under Stochastic
Uncertainty
1.1 Introduction
Here, the objective (goal) function f0 = f0 (a, x) and the constraint func-
tionsfi = fi (a, x), i = 1, . . . , mf , gi = gi (a, x), i = 1, . . . , mg , defined on
a joint subset of IRν × IRr , depend on a decision, control or input vector
x = (x1 , x2 , . . . , xr )T and a vector a = (a1 , a2 , . . . , aν )T of model parame-
ters. Typical model parameters in technical applications, operations research
and economics are material parameters, external load parameters, cost factors,
technological parameters in input–output operators, demand factors. Further-
more, manufacturing and modeling errors, disturbances or noise factors, etc.,
may occur. Frequent decision, control or input variables are material, topolog-
ical, geometrical and cross-sectional design variables in structural optimiza-
tion [62], forces and moments in optimal control of dynamic systems and
factors of production in operations research and economic design.
The objective function (1.1a) to be optimized describes the aim, the goal
of the modelled optimal decision/design problem or the performance of a
technical, economic system or process to be controlled optimally. Further-
more, the constraints (1.1b)–(1.1d) represent the operating conditions guar-
anteeing a safe structure, a correct functioning of the underlying system,
process, etc. Note that the constraint (1.1d) with a given, fixed convex subset
4 1 Decision/Control Under Stochastic Uncertainty
Due to the deviation of the actual parameter vector a from the nominal
vector a0 of model parameters, deviations of the actual state, trajectory
or performance of the system from the prescribed state, trajectory, goal
values occur.
(2) Compensation: The deviation of the actual state, trajectory or perfor-
mance of the system from the prescribed values/functions is compensated
then by online measurement and correction actions (decisions or controls).
Consequently, in general, increasing measurement and correction expenses
result in course of time.
Considerable improvements of this standard procedure can be obtained
by taking into account already at the planning stage, i.e., offline, the mostly
available a priori (e.g., the type of random variability) and sample information
1.2 Deterministic Substitute Problems: Basic Formulation 5
about the parameter vector a. Indeed, based, e.g., on some structural in-
sight, or by parameter identification methods, regression techniques, cal-
ibration methods, etc., in most cases information about the vector a of
model/structural parameters can be extracted. Repeating this information
gathering procedure at some later time points tj > t0 (= initial time point),
j = 1, 2, . . . , adaptive decision/control procedures occur [101].
Based on the inherent random nature of the parameter vector a, the ob-
servation or measurement mechanism, resp., or adopting a Bayesian approach
concerning unknown parameter values [10], here we make the following basic
assumption:
Stochastic Uncertainty: The unknown parameter vector a is a realization
a = a(ω), ω ∈ Ω, (1.3)
where the (conditional) expectation “E” is taken with respect to the time
history A = At , (Aj ) ⊂ A0 up to a certain time point t or stage j. A
short definition of expectations is given in Sect. 2.1, for more details, see, e.g.,
[7, 43, 124].
Having different expected cost or performance functions F0 , F1 , . . . , Fm
to be minimized or bounded, as a basic deterministic substitute problem for
(1.1a)–(1.1d) with a random parameter vector a = a(ω) we may consider the
multi-objective expected cost minimization problem
“min”F(x) (1.6a)
s.t.
x ∈ D0 . (1.6b)
and
Fi0 (x) < Fi0 (x0 ) for at least one i0 , 0 ≤ i0 ≤ m. (1.7b)
s.t.
where
m
f (a, x) := ci γi e(a, x) . (1.10b)
i=0
s.t.
x ∈ D0 . (1.11b)
s.t.
x ∈ D0 . (1.13b)
Remark 1.3 Let ci , i = 0, 1, . . . , m, be any positive weight factors. An
optimal solution of x∗ of (1.13a), (1.13b) is a weak Pareto optimal solution
of (1.6a), (1.6b).
Instead of taking expectations, we may consider the worst case with respect to
the cost variations caused by the random parameter vector a = a(ω). Hence,
the random cost function
ω → γi e a(ω), x (1.14a)
is evaluated by means of
Fisup (x) := ess sup γi e a(ω), x , i = 0, 1, . . . , m. (1.14b)
Here, ess sup (. . .) denotes the (conditional) essential supremum with respect
to the random vector a = a(ω), given information A, i.e., the infimum of the
supremum of (1.14a) on sets A ∈ A0 of (conditional) probability one, see,
e.g., [124].
Consequently, the vector function F = Fsup (x) is then defined by
⎛ ⎞
⎛ ⎞ ess sup γ0 e a(ω), x
F0 (x) ⎜ ⎟
⎜ F1 (x) ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ess sup γ e a(ω), x ⎟
Fsup (x) = ⎜ . ⎟ := ⎜ ⎟.
1
⎜ ⎟ (1.15)
⎝ .. ⎠ ⎜ .. ⎟
⎝ . ⎠
Fm (x)
ess sup γm e a(ω), x
Working with the vector function F = Fsup (x), we have then the vector
minimization problem
“min”Fsup (x) (1.16a)
s.t.
x ∈ D0 . (1.16b)
By scalarization of (1.16a), (1.16b) we obtain then again deterministic
substitute problems for (1.1a)–(1.1d) related to the substitute problem (1.6a),
(1.6b) introduced in Sect. 1.2.1.
More details for the selection and solution of appropriate deterministic
substitute problems for (1.1a)–(1.1d) are given in the next sections.
2
Deterministic Substitute Problems in Optimal
Decision Under Stochastic Uncertainty
x ∈ D, (2.1c)
10 2 Deterministic Substitute Problems in Optimal Decision
p
a= (2.1d)
P
a = (A, b, c) (2.1e)
yi (a, x) ≤ 0 (2.2c)
design the limit state functions are determined by the extreme points of the
admissible domain of the dual pair of static/kinematic LPs related to the
equilibrium and linearized convex yield condition, see Sect. 7.
s.t.
G0 (a, x) ≤ Gmax (2.4b)
x ∈ D. (2.4c)
In (2.4a) γ = γ(y) is a scalar or vector valued cost/loss function evaluating
violations of the operating conditions (2.3b). Depending on the application,
these costs are called “failure” or “recourse” costs [58, 59, 98, 120, 137, 138].
As already discussed in Sect. 1, solving problems of the above type, a basic
difficulty is the uncertainty about the true value of the vector a of model
parameters or the (random) variability of a:
In practice, due to several types of uncertainties such as, see [154]:
– Physical uncertainty (variability of physical quantities, like material, loads,
dimensions, etc.)
– Economic uncertainty (trade, demand, costs, etc.)
– Statistical uncertainty (e.g., estimation errors of parameters due to limited
sample data)
– Model uncertainty (model errors)
the ν-vector a of model parameters must be modeled by a random vector
a = a(ω), ω ∈ Ω, (2.5a)
on a certain probability space (Ω, A0 , P) with sample space Ω having elements
ω, see (1.3). For the mathematical representation of the corresponding (con-
A
ditional) probability distribution Pa(·) = Pa(·) of the random vector a = a(ω)
12 2 Deterministic Substitute Problems in Optimal Decision
of (measurable) functions
(h ◦ a)(ω) := h a(ω) (2.6b)
Due to the stochastic variability of the random vector a(·) of model pa-
rameters, and since the realization a(ω) = a is not available at the decision
making stage, the optimal design problem (2.3a)–(2.3c) or (2.4a)–(2.4c) under
stochastic uncertainty cannot be solved directly.
Hence, appropriate deterministic substitute problems must be chosen tak-
ing into account the randomness of a = a(ω), cf. Sect. 1.2.
s.t.
x ∈ D. (2.9b)
Here,
pf = pf (x) := P y a(ω), x ∈B
| (2.9c)
is the probability of failure or the probability that a safe function of the
structure, the system is not guaranteed. Furthermore, cG is a certain weight
factor, and cf > 0 describes the failure or recourse costs. In the present
definition of expected failure costs, constant costs for each realization a = a(ω)
of a(·) are assumed. Obviously, it is
s.t.
s.t.
EG0 a(ω), x ≤ Gmax (2.11b)
x∈D (2.11c)
where the right hand side of (2.13b) is obviously an expected cost function of
type (2.12a)–(2.12c). Hence, the condition (2.10b) can be guaranteed by the
cost constraint
Eγ y a(ω), x ≤ γ0 αmax . (2.13c)
Example 2.3. If the loss function γ(y) is defined by a vector of individual loss
functions γi for each state function yi = yi (a, x), i = 1, . . . , my , hence,
T
γ(y) = γ1 (y1 ), . . . , γmy (ymy ) , (2.14a)
then
T
Γ (x) = Γ1 (x), . . . , Γmy (x) , Γi (x) := Eγi yi a(ω), x , 1 ≤ i ≤ my ,
(2.14b)
Working with the more general expected failure or recourse cost functions
Γ = Γ (x), instead of (2.9a)–(2.9c), (2.10a)–(2.10c) and (2.11a)–(2.11c) we
have the related substitute problems:
(1) Expected total cost minimization
min cG EG0 a(ω), x + cTf Γ (x), (2.15a)
s.t.
x∈D (2.15b)
(2) Expected primary cost minimization under expected failure or recourse cost
constraints
min EG0 a(ω), x (2.16a)
s.t.
x ∈ D, (2.16c)
16 2 Deterministic Substitute Problems in Optimal Decision
(3) Expected failure or recourse cost minimization under expected primary cost
constraints
“min”Γ (x) (2.17a)
s.t.
EG0 a(ω), x ≤ Gmax (2.17b)
x ∈ D. (2.17c)
Here, cG , cf are (vectorial) weight coefficients, Γ max is the vector of upper loss
bounds, and “min” indicates again that Γ (x) may be a vector valued function.
γ(y) := (y − y ∗ )2 . (2.19a)
2.1 Optimum Design Problems with Random Parameters 17
(2) Smallest-is-best
If the target or nominal value y ∗ is zero, i.e., if the absolute value |y| of
the product quality function y = y(a, x) should be as small as possible,
e.g., a certain product contamination, then the corresponding quality loss
is defined by
γ(y) := y 2 . (2.19b)
(3) Biggest-is-best
If the absolute value |y| of the product quality function y = y(a, x) should
be as large as possible, e.g., the strength of a bar or the yield of a process,
then a possible quality loss is given by
2
1
γ(y) := . (2.19c)
y
Obviously, (2.19a) and (2.19b) are convex loss functions. Moreover, since
the quality or response function y = y(a, x) takes only positive or only nega-
tive values y in many practical applications, also (2.19c) yields a convex loss
function in many cases.
Further decision situations may be modeled by choosing more general con-
vex loss functions γ = γ(y).
A high mean product/process level or performance at minimum random
quality variations and low or bounded production/manufacturing costs is
achieved then [109, 113, 114] by minimization of the expected quality loss
function
Γ (x) := Eγ y a(ω), x , (2.19d)
subject to certain constraints for the input x, as, e.g., constraints for the
production/manufacturing costs. Obviously, Γ (x) can also be minimized by
maximizing the negative logarithmic mean loss (sometimes called “signal-to-
noise-ratio” [109]):
SN := − log Γ (x). (2.19e)
s.t.
EG0 a(ω), x ≤ Gmax (2.20b)
x ∈ D. (2.20c)
18 2 Deterministic Substitute Problems in Optimal Decision
or more general
Γ (x) = Eg a(ω), x , x ∈ D0 . (2.21b)
Proof. This property follows [58, 59, 78] directly from the linearity of the ex-
pectation operator.
If g = g(a, x) is defined by g(a, x) := γ y(a, x) , see (2.21a), then the
above theorem yields the following result:
Corollary 2.1. Suppose that γ is convex and Eγ y a(ω), x exists and is
finite for each x ∈ D0 . (a) If x → y a(ω), x is linear a.s., then Γ = Γ (x) is
convex. (b) If x → y a(ω), x is convex a.s., and γ is a convex, monotonous
nondecreasing function, then Γ = Γ (x) is convex.
∆Γ
Proof. Considering the difference quotients , k = 1, . . . , r, of Γ at a
∆xk
fixed point x0 ∈ D, the assertion follows by means of the mean value the-
orem, inequality (2.23a) and Lebesgue’s dominated convergence theorem, cf.
[58, 59, 78].
which holds for any convex function γ. Using the mean value theorem, we
have
γ(y) = γ(y) + ∇γ(ŷ)T (y − y), (2.25c)
where ŷ is a point on the line segment yy between y and y. By means of
(2.25b), (2.25c) we get
0 ≤ Γ (x) − γ y(x) ≤ E ∇γ ŷ a(ω), x · y a(ω), x − y(x) . (2.25d)
where 2
q(x) := E y a(ω), x − y(x) = trQ(x) (2.26d)
is the generalized variance, and
Q(x) := cov y a(·), x (2.26e)
denotes the covariance matrix of the random vector y = y a(ω), x . Conse-
quently, the expected loss function Γ (x) can be approximated from above by
Γ (x) ≤ γ y(x) + ϑmax q(x) for x ∈ D. (2.26f)
s.t.
γ y(x) + c0 q(x) ≤ Γ max (2.28b)
x ∈ D, (2.28c)
where c0 is a scale factor, cf. (2.26f) and (2.27c);
22 2 Deterministic Substitute Problems in Optimal Decision
s.t.
EG0 a(ω), x ≤ Gmax (2.29b)
x ∈ D. (2.29c)
Indeed, according to (2.2b), (2.2c) the failure of the structure, the system
can be represented by the condition
According to the definition (2.30a), an upper bound for y min (x) is given
by
y min (x) ≤ min y i (x) = min Eyi a(ω), x .
1≤i≤my 1≤i≤my
min
Further approximations of y (a, x) and its moments can be found by
using the representation
1
a + b − |a − b|
min(a, b) =
2
of the minimum of two numbers a, b ∈ IR. E.g., for an even index my we
have
y min (a, x) = min min yi (a, x), yi+1 (a, x)
i=1,3,...,my −1
1
= min yi (a, x) + yi+1 (a, x)− yi (a, x) − yi+1 (a, x) .
i=1,3,...,my −1 2
24 2 Deterministic Substitute Problems in Optimal Decision
In many cases we may suppose that the state functions yi = yi (a, x),
i = 1, . . . , my , are bounded from below, hence,
for all (a, x) under consideration with a positive constant A > 0. Thus,
defining
ỹi (a, x) := yi (a, x) + A, i = 1, . . . , my ,
and therefore
we have
y min (a, x) ≤ 0 if and only if ỹ min (a, x) ≤ A.
Hence, the survival/failure of the system or structure can be studied also
by means of the positive function ỹ min = ỹ min (a, x). Using now the the-
ory of power or Hölder means [24], the minimum ỹ min (a, x) of positive
functions can be represented also by the limit
my
1/λ
min 1 λ
ỹ (a, x) = lim ỹi (a, x)
λ→−∞ my i=1
my
1/λ
1
of the decreasing family of power means M [λ]
(ỹ) := ỹiλ ,
my i=1
λ < 0. Consequently, for each fixed p > 0 we also have
my
p/λ
min p 1 λ
ỹ (a, x) = lim ỹi (a, x) .
λ→−∞ my i=1
p
Assuming that the expectation EM [λ] ỹ a(ω) , x exists for an expo-
nent λ = λ0 < 0, by means of Lebesgue’s bounded convergence theorem
we get the moment representation
p/λ
p 1
my λ
min
E ỹ a(ω), x = lim E ỹi a(ω), x .
λ→−∞ my i=1
Since t → tp/λ , t > 0, is convex for each fixed p > 0 and λ < 0, by Jensen’s
inequality we have the lower moment bound
p/λ
p 1
my λ
E ỹ min a(ω), x ≥ lim E ỹi a(ω), x .
λ→−∞ my i=1
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 25
Hence, for the pth order moment of ỹ min a(·), x we get the approxima-
tions
p/λ
1
my p/λ 1
my λ
E ỹi a(ω), x ≥ E ỹi a(ω), x
my i=1 my i=1
(a, x) −→ yi (a, x)
are bilinear functions. Thus, in this case y min = y min (a, x) is a piecewise
linear function with respect to a. Fitting a linear or quadratic Response
Surface Model [17, 18, 105, 106]
(j)
with “design” points da ∈ IRν , j = 1, . . . , p, the unknown coefficients
c = c(x), q = q(x) and Q = Q(x) are obtained by minimizing the mean
square error
p 2
ρ(c, q, Q) := y(a(j) , x) − y min (a(j) , x) (2.30h)
j=1
with respect to (c, q, Q). Since the model (2.30f) depends linearly on the
function parameters (c, q, Q), explicit formulas for the optimal coefficients
c∗ = c∗ (x), q ∗ = q ∗ (x), Q∗ = Q∗ (x), (2.30i)
are obtained from this least squares estimation method, see Chap. 5.
As can be seen above, cf. (2.21a), (2.21b), in the objective and/or in the
constraints of substitute problems for optimization problems with random
data mean value functions of the type
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 27
Γ (x) := Eg a(ω), x
occur. Here, g = g(a, x) is a real valued function on a subset of IRν × IRr , and
a = a(ω) is a ν-random vector.
(a) Expansion with respect to a: Suppose that on its domain the function
g = g(a, x) has partial derivatives ∇la g(a, x), l = 0, 1, . . . , lg +1, up to order
lg + 1. Note that the gradient ∇a g(a, x) contains the so-called sensitivities
∂g
(a, x), j = 1, . . . , ν, of g with respect to the parameter vector a at
∂aj
(a, x). In the same way, the higher order partial derivatives ∇la g(a, x), l >
1, represent the higher order sensitivities of g with respect to a at (a, x).
Taylor expansion of g = g(a, x) with respect to a at a := Ea(ω) yields
lg
1 l 1
g(a, x) = ∇a g(ā, x) · (a − ā)l + ∇lg+1 g(â, x) · (a − ā)lg +1
l! (lg + 1)! a
l=0
(2.31a)
where â := ā + ϑ(a − ā), 0 < ϑ < 1, and (a − ā)l denotes the system of lth
order products
ν
(aj − āj )lj
j=1
see (2.21a), then the partial derivatives ∇la g of g up to the second order
read:
T
∇a g(a, x) = ∇a y(a, x) ∇γ y(a, x) (2.31b)
T
∇2a g(a, x) = ∇a y(a, x) ∇2 γ y(a, x) ∇a y(a, x) (2.31c)
+∇γ y(a, x) · ∇2a y(a, x),
where
∂2y
(∇γ) · ∇2a y := (∇γ)T · (2.31d)
∂ak ∂al k,l=1,...,ν
is finite for all x under consideration, and (2.32b), (2.32c) yield the error
bound lg +1
Γ (x) − Γ(x) ≤ r(x)E a(ω) − ā . (2.32d)
Applying the mean value theorem [29], under appropriate second order
differentiability assumptions, for the right hand side of (2.33e) we find the
following stochastic version of the mean value theorem
E y a(ω), x − y(1) a(ω), x
2
≤ E a(ω) − a sup ∇2a y a + ϑ a(ω) − a , x . (2.33f)
0≤ϑ≤1
then
Γ (x) ≤ γ y(x) + ϑmax q(x), x ∈ D, (2.34b)
where
y(x) := Ey a(ω), x , (2.34c)
2
q(x) := V y a(·), x = E y a(ω), x − y(x) (2.34d)
denote the mean and variance of the quality function y = y a(ω), x .
30 2 Deterministic Substitute Problems in Optimal Decision
then
λmin λmax
γ y(x) + q(x) ≤ Γ (x) ≤ γ y(x) + q(x), x ∈ D. (2.35b)
2 2
Corresponding to (2.28a)–(2.28c), (2.29a)–(2.29c), we then have the fol-
lowing “dual” robust optimal design problems:
(1) Expected primary cost minimizationunder approximate expected failure
or recourse cost constraints
min EG0 a(ω), x (2.36a)
s.t.
γ y(x) + c0 q(x) ≤ Γ max (2.36b)
x ∈ D, (2.36c)
s.t.
EG0 a(ω), x ≤ Gmax (2.37b)
x ∈ D, (2.37c)
or ⎛ ⎞ ⎛ ⎞
N
N
P⎝ Sj ⎠ := P ⎝a(ω) ∈ Sj ⎠ (2.38b)
j=1 j=1
where
k
sk,N := P Vil , (2.39b)
1≤i1 <i2 <...<ik ≤N l=1
Moreover, defining
q := (q1 , . . . , qN ), Q := (qij )1≤i,j≤N , (2.40f)
we have ⎛ ⎞
N
P⎝ Vj ⎠ ≥ q T Q− q, (2.40g)
j=1
Two-Sided Constraints
Remark 2.2. Note that P(S) ≥ αs with a given minimum reliability αs ∈ (0, 1]
can be guaranteed by the expected cost constraint
Eϕ ỹ a(ω), x ≤ (1 − αs )ϕ0 .
Eϕ(ỹ) = E ỹ T C ỹ = EtrC ỹ ỹ T ,
T
= trC(diag ρ)−1 cov y a(·), x + y(x) − yc y(x) − yc (diag ρ)−1 ,
(2.44b)
Example 2.7. Assuming in the above Example 2.6 that C = diag (cii ) is a
diagonal matrix with positive elements cii > 0, i = 1, . . . , m, then
One-Sided Inequalities
Suppose that exactly one of the two bounds yli < yui is infinite for each
i = 1, . . . , m. Multiplying the corresponding constraints in (2.41a) by −1, the
admissible domain S = S(x), cf. (2.41b), can be represented always by
yi − yui
where ỹi := , if yli = −∞, and ỹi := yli − yi , if yui = +∞. If we set
ỹ(a, x) := ỹi (a, x) and
B̃ := y ∈ IRm : yi < (≤) 0, i = 1, . . . , m , (2.45b)
then
P S(x) = P ỹ a(ω), x ∈ B̃ . (2.45c)
Consider also in this case, cf. (2.42d), (2.42e), a function ϕ : IRm → IR such
that
36 2 Deterministic Substitute Problems in Optimal Decision
and
m
Eϕ ỹ a(ω), x = wi Eeαi ỹi a(ω),x
, (2.48b)
i=1
and
T
Eϕ(ỹ) = E ỹ a(ω), x − b C ỹ a(ω), x − b
T
= trCE ỹ a(ω), x − b ỹ a(ω), x − b .
2.5 Approximation of Probabilities: Probability Inequalities 37
Note that
yi (a, x) − (yui + bi ), if yli = −∞
ỹi (a, x) − bi =
yli − bi − yi (a, x), if yui = +∞,
Remark 2.4. The one-sided case can also be reduced approximatively to the
two-sided case by selecting a sufficiently large, but finite upper bound ỹui ∈
IR, lower bound ỹli ∈ IR, resp., if yui = +∞, yli = −∞. For multivariate
Tschebyscheff inequalities see also [75].
Applying certain probability inequalities, see Sects. 2.5.1 and 2.5.2, the case
with m(>1) limit state functions is reduced first to the case with one limit
state function only. Approximations of pf are obtained then by linearization
or by quadratic approximation of a transformed state function ỹi = ỹi (ã, x)
at a certain “design point” ã∗ = ã∗i (x) for each i = 1, . . . , m.
Based on the above procedure, according to (2.9f) we have
pf (x) = P yi a(ω), x ≤ 0 for at least one index i, 1 ≤ i ≤ my
my my
≤ P yi a(ω), x ≤ 0 = pf i (x), (2.49a)
i=1 i=1
where
pf i (x) := P yi a(ω), x ≤ 0 . (2.49b)
Assume now that the random ν-vector a = a(ω) can be represented [104] by
a(ω) = U a(ω) , (2.50a)
of ai = ai(ω), given
aj (ω) = aj , j = 1, . . . , i − 1, then the random ν-vector
z(ω) := T a(ω) defined by the transformation
z1 := H1 (a1 )
z2 := H2 (a2 |a1 )
..
.
zν := Hν (aν |a1 , . . . , aν−1 ),
cf. [128], has stochastically independent components z1 = z1 (ω), . . . , zν =
zν (ω), and z = z(ω) is uniformly distributed on the ν-dimensional hypercube
[0, 1]ν . Thus, if Φ = Φ(t) denotes the distribution function of the univariate
N (0, 1)-normal distribution, then the inverse ã = U −1 (a) of the transforma-
tion U can be represented by
⎛ −1 ⎞
Φ (z1 )
⎜ .. ⎟
ã := ⎝ . ⎠ , z = T (a).
Φ−1 (zν )
In the following, let x be a given, fixed r-vector. Defining then
yi (
a, x) := yi U (
a), x , (2.50b)
we get
pf i (x) = P yi
a(ω), x ≤ 0 , i = 1, . . . , my . (2.50c)
In the First Order Reliability Method (FORM), cf. [54], the state function
yi = yi (
a, x) is linearized at a so-called design point a∗ = a∗(i) . A design point
a∗ =
a∗(i) is obtained by projection of the origin a = 0 onto the limit state
surface yi ( a∗ , x) = 0, we get
a, x) = 0, see Fig. 2.5. Since yi (
and therefore
T
pf i (x) ≈ P a∗ , x) (
∇a yi ( a∗ ) ≤ 0
a(ω) − = Φ −βi (x) . (2.50e)
Note that βi (x) is the minimum distance from a = 0 to the limit state surface
a, x) = 0 in the
yi ( a-domain.
For more details and Second Order Reliability Methods (SORM) based on
second order Taylor expansions, see Chap. 7 and [54, 104].
2
Deterministic Substitute Problems in Optimal
Decision Under Stochastic Uncertainty
x ∈ D, (2.1c)
10 2 Deterministic Substitute Problems in Optimal Decision
p
a= (2.1d)
P
a = (A, b, c) (2.1e)
yi (a, x) ≤ 0 (2.2c)
design the limit state functions are determined by the extreme points of the
admissible domain of the dual pair of static/kinematic LPs related to the
equilibrium and linearized convex yield condition, see Sect. 7.
s.t.
G0 (a, x) ≤ Gmax (2.4b)
x ∈ D. (2.4c)
In (2.4a) γ = γ(y) is a scalar or vector valued cost/loss function evaluating
violations of the operating conditions (2.3b). Depending on the application,
these costs are called “failure” or “recourse” costs [58, 59, 98, 120, 137, 138].
As already discussed in Sect. 1, solving problems of the above type, a basic
difficulty is the uncertainty about the true value of the vector a of model
parameters or the (random) variability of a:
In practice, due to several types of uncertainties such as, see [154]:
– Physical uncertainty (variability of physical quantities, like material, loads,
dimensions, etc.)
– Economic uncertainty (trade, demand, costs, etc.)
– Statistical uncertainty (e.g., estimation errors of parameters due to limited
sample data)
– Model uncertainty (model errors)
the ν-vector a of model parameters must be modeled by a random vector
a = a(ω), ω ∈ Ω, (2.5a)
on a certain probability space (Ω, A0 , P) with sample space Ω having elements
ω, see (1.3). For the mathematical representation of the corresponding (con-
A
ditional) probability distribution Pa(·) = Pa(·) of the random vector a = a(ω)
12 2 Deterministic Substitute Problems in Optimal Decision
of (measurable) functions
(h ◦ a)(ω) := h a(ω) (2.6b)
Due to the stochastic variability of the random vector a(·) of model pa-
rameters, and since the realization a(ω) = a is not available at the decision
making stage, the optimal design problem (2.3a)–(2.3c) or (2.4a)–(2.4c) under
stochastic uncertainty cannot be solved directly.
Hence, appropriate deterministic substitute problems must be chosen tak-
ing into account the randomness of a = a(ω), cf. Sect. 1.2.
s.t.
x ∈ D. (2.9b)
Here,
pf = pf (x) := P y a(ω), x ∈B
| (2.9c)
is the probability of failure or the probability that a safe function of the
structure, the system is not guaranteed. Furthermore, cG is a certain weight
factor, and cf > 0 describes the failure or recourse costs. In the present
definition of expected failure costs, constant costs for each realization a = a(ω)
of a(·) are assumed. Obviously, it is
s.t.
s.t.
EG0 a(ω), x ≤ Gmax (2.11b)
x∈D (2.11c)
where the right hand side of (2.13b) is obviously an expected cost function of
type (2.12a)–(2.12c). Hence, the condition (2.10b) can be guaranteed by the
cost constraint
Eγ y a(ω), x ≤ γ0 αmax . (2.13c)
Example 2.3. If the loss function γ(y) is defined by a vector of individual loss
functions γi for each state function yi = yi (a, x), i = 1, . . . , my , hence,
T
γ(y) = γ1 (y1 ), . . . , γmy (ymy ) , (2.14a)
then
T
Γ (x) = Γ1 (x), . . . , Γmy (x) , Γi (x) := Eγi yi a(ω), x , 1 ≤ i ≤ my ,
(2.14b)
Working with the more general expected failure or recourse cost functions
Γ = Γ (x), instead of (2.9a)–(2.9c), (2.10a)–(2.10c) and (2.11a)–(2.11c) we
have the related substitute problems:
(1) Expected total cost minimization
min cG EG0 a(ω), x + cTf Γ (x), (2.15a)
s.t.
x∈D (2.15b)
(2) Expected primary cost minimization under expected failure or recourse cost
constraints
min EG0 a(ω), x (2.16a)
s.t.
x ∈ D, (2.16c)
16 2 Deterministic Substitute Problems in Optimal Decision
(3) Expected failure or recourse cost minimization under expected primary cost
constraints
“min”Γ (x) (2.17a)
s.t.
EG0 a(ω), x ≤ Gmax (2.17b)
x ∈ D. (2.17c)
Here, cG , cf are (vectorial) weight coefficients, Γ max is the vector of upper loss
bounds, and “min” indicates again that Γ (x) may be a vector valued function.
γ(y) := (y − y ∗ )2 . (2.19a)
2.1 Optimum Design Problems with Random Parameters 17
(2) Smallest-is-best
If the target or nominal value y ∗ is zero, i.e., if the absolute value |y| of
the product quality function y = y(a, x) should be as small as possible,
e.g., a certain product contamination, then the corresponding quality loss
is defined by
γ(y) := y 2 . (2.19b)
(3) Biggest-is-best
If the absolute value |y| of the product quality function y = y(a, x) should
be as large as possible, e.g., the strength of a bar or the yield of a process,
then a possible quality loss is given by
2
1
γ(y) := . (2.19c)
y
Obviously, (2.19a) and (2.19b) are convex loss functions. Moreover, since
the quality or response function y = y(a, x) takes only positive or only nega-
tive values y in many practical applications, also (2.19c) yields a convex loss
function in many cases.
Further decision situations may be modeled by choosing more general con-
vex loss functions γ = γ(y).
A high mean product/process level or performance at minimum random
quality variations and low or bounded production/manufacturing costs is
achieved then [109, 113, 114] by minimization of the expected quality loss
function
Γ (x) := Eγ y a(ω), x , (2.19d)
subject to certain constraints for the input x, as, e.g., constraints for the
production/manufacturing costs. Obviously, Γ (x) can also be minimized by
maximizing the negative logarithmic mean loss (sometimes called “signal-to-
noise-ratio” [109]):
SN := − log Γ (x). (2.19e)
s.t.
EG0 a(ω), x ≤ Gmax (2.20b)
x ∈ D. (2.20c)
18 2 Deterministic Substitute Problems in Optimal Decision
or more general
Γ (x) = Eg a(ω), x , x ∈ D0 . (2.21b)
Proof. This property follows [58, 59, 78] directly from the linearity of the ex-
pectation operator.
If g = g(a, x) is defined by g(a, x) := γ y(a, x) , see (2.21a), then the
above theorem yields the following result:
Corollary 2.1. Suppose that γ is convex and Eγ y a(ω), x exists and is
finite for each x ∈ D0 . (a) If x → y a(ω), x is linear a.s., then Γ = Γ (x) is
convex. (b) If x → y a(ω), x is convex a.s., and γ is a convex, monotonous
nondecreasing function, then Γ = Γ (x) is convex.
∆Γ
Proof. Considering the difference quotients , k = 1, . . . , r, of Γ at a
∆xk
fixed point x0 ∈ D, the assertion follows by means of the mean value the-
orem, inequality (2.23a) and Lebesgue’s dominated convergence theorem, cf.
[58, 59, 78].
which holds for any convex function γ. Using the mean value theorem, we
have
γ(y) = γ(y) + ∇γ(ŷ)T (y − y), (2.25c)
where ŷ is a point on the line segment yy between y and y. By means of
(2.25b), (2.25c) we get
0 ≤ Γ (x) − γ y(x) ≤ E ∇γ ŷ a(ω), x · y a(ω), x − y(x) . (2.25d)
where 2
q(x) := E y a(ω), x − y(x) = trQ(x) (2.26d)
is the generalized variance, and
Q(x) := cov y a(·), x (2.26e)
denotes the covariance matrix of the random vector y = y a(ω), x . Conse-
quently, the expected loss function Γ (x) can be approximated from above by
Γ (x) ≤ γ y(x) + ϑmax q(x) for x ∈ D. (2.26f)
s.t.
γ y(x) + c0 q(x) ≤ Γ max (2.28b)
x ∈ D, (2.28c)
where c0 is a scale factor, cf. (2.26f) and (2.27c);
22 2 Deterministic Substitute Problems in Optimal Decision
s.t.
EG0 a(ω), x ≤ Gmax (2.29b)
x ∈ D. (2.29c)
Indeed, according to (2.2b), (2.2c) the failure of the structure, the system
can be represented by the condition
According to the definition (2.30a), an upper bound for y min (x) is given
by
y min (x) ≤ min y i (x) = min Eyi a(ω), x .
1≤i≤my 1≤i≤my
min
Further approximations of y (a, x) and its moments can be found by
using the representation
1
a + b − |a − b|
min(a, b) =
2
of the minimum of two numbers a, b ∈ IR. E.g., for an even index my we
have
y min (a, x) = min min yi (a, x), yi+1 (a, x)
i=1,3,...,my −1
1
= min yi (a, x) + yi+1 (a, x)− yi (a, x) − yi+1 (a, x) .
i=1,3,...,my −1 2
24 2 Deterministic Substitute Problems in Optimal Decision
In many cases we may suppose that the state functions yi = yi (a, x),
i = 1, . . . , my , are bounded from below, hence,
for all (a, x) under consideration with a positive constant A > 0. Thus,
defining
ỹi (a, x) := yi (a, x) + A, i = 1, . . . , my ,
and therefore
we have
y min (a, x) ≤ 0 if and only if ỹ min (a, x) ≤ A.
Hence, the survival/failure of the system or structure can be studied also
by means of the positive function ỹ min = ỹ min (a, x). Using now the the-
ory of power or Hölder means [24], the minimum ỹ min (a, x) of positive
functions can be represented also by the limit
my
1/λ
min 1 λ
ỹ (a, x) = lim ỹi (a, x)
λ→−∞ my i=1
my
1/λ
1
of the decreasing family of power means M [λ]
(ỹ) := ỹiλ ,
my i=1
λ < 0. Consequently, for each fixed p > 0 we also have
my
p/λ
min p 1 λ
ỹ (a, x) = lim ỹi (a, x) .
λ→−∞ my i=1
p
Assuming that the expectation EM [λ] ỹ a(ω) , x exists for an expo-
nent λ = λ0 < 0, by means of Lebesgue’s bounded convergence theorem
we get the moment representation
p/λ
p 1
my λ
min
E ỹ a(ω), x = lim E ỹi a(ω), x .
λ→−∞ my i=1
Since t → tp/λ , t > 0, is convex for each fixed p > 0 and λ < 0, by Jensen’s
inequality we have the lower moment bound
p/λ
p 1
my λ
E ỹ min a(ω), x ≥ lim E ỹi a(ω), x .
λ→−∞ my i=1
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 25
Hence, for the pth order moment of ỹ min a(·), x we get the approxima-
tions
p/λ
1
my p/λ 1
my λ
E ỹi a(ω), x ≥ E ỹi a(ω), x
my i=1 my i=1
(a, x) −→ yi (a, x)
are bilinear functions. Thus, in this case y min = y min (a, x) is a piecewise
linear function with respect to a. Fitting a linear or quadratic Response
Surface Model [17, 18, 105, 106]
(j)
with “design” points da ∈ IRν , j = 1, . . . , p, the unknown coefficients
c = c(x), q = q(x) and Q = Q(x) are obtained by minimizing the mean
square error
p 2
ρ(c, q, Q) := y(a(j) , x) − y min (a(j) , x) (2.30h)
j=1
with respect to (c, q, Q). Since the model (2.30f) depends linearly on the
function parameters (c, q, Q), explicit formulas for the optimal coefficients
c∗ = c∗ (x), q ∗ = q ∗ (x), Q∗ = Q∗ (x), (2.30i)
are obtained from this least squares estimation method, see Chap. 5.
As can be seen above, cf. (2.21a), (2.21b), in the objective and/or in the
constraints of substitute problems for optimization problems with random
data mean value functions of the type
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 27
Γ (x) := Eg a(ω), x
occur. Here, g = g(a, x) is a real valued function on a subset of IRν × IRr , and
a = a(ω) is a ν-random vector.
(a) Expansion with respect to a: Suppose that on its domain the function
g = g(a, x) has partial derivatives ∇la g(a, x), l = 0, 1, . . . , lg +1, up to order
lg + 1. Note that the gradient ∇a g(a, x) contains the so-called sensitivities
∂g
(a, x), j = 1, . . . , ν, of g with respect to the parameter vector a at
∂aj
(a, x). In the same way, the higher order partial derivatives ∇la g(a, x), l >
1, represent the higher order sensitivities of g with respect to a at (a, x).
Taylor expansion of g = g(a, x) with respect to a at a := Ea(ω) yields
lg
1 l 1
g(a, x) = ∇a g(ā, x) · (a − ā)l + ∇lg+1 g(â, x) · (a − ā)lg +1
l! (lg + 1)! a
l=0
(2.31a)
where â := ā + ϑ(a − ā), 0 < ϑ < 1, and (a − ā)l denotes the system of lth
order products
ν
(aj − āj )lj
j=1
see (2.21a), then the partial derivatives ∇la g of g up to the second order
read:
T
∇a g(a, x) = ∇a y(a, x) ∇γ y(a, x) (2.31b)
T
∇2a g(a, x) = ∇a y(a, x) ∇2 γ y(a, x) ∇a y(a, x) (2.31c)
+∇γ y(a, x) · ∇2a y(a, x),
where
∂2y
(∇γ) · ∇2a y := (∇γ)T · (2.31d)
∂ak ∂al k,l=1,...,ν
is finite for all x under consideration, and (2.32b), (2.32c) yield the error
bound lg +1
Γ (x) − Γ(x) ≤ r(x)E a(ω) − ā . (2.32d)
Applying the mean value theorem [29], under appropriate second order
differentiability assumptions, for the right hand side of (2.33e) we find the
following stochastic version of the mean value theorem
E y a(ω), x − y(1) a(ω), x
2
≤ E a(ω) − a sup ∇2a y a + ϑ a(ω) − a , x . (2.33f)
0≤ϑ≤1
then
Γ (x) ≤ γ y(x) + ϑmax q(x), x ∈ D, (2.34b)
where
y(x) := Ey a(ω), x , (2.34c)
2
q(x) := V y a(·), x = E y a(ω), x − y(x) (2.34d)
denote the mean and variance of the quality function y = y a(ω), x .
30 2 Deterministic Substitute Problems in Optimal Decision
then
λmin λmax
γ y(x) + q(x) ≤ Γ (x) ≤ γ y(x) + q(x), x ∈ D. (2.35b)
2 2
Corresponding to (2.28a)–(2.28c), (2.29a)–(2.29c), we then have the fol-
lowing “dual” robust optimal design problems:
(1) Expected primary cost minimizationunder approximate expected failure
or recourse cost constraints
min EG0 a(ω), x (2.36a)
s.t.
γ y(x) + c0 q(x) ≤ Γ max (2.36b)
x ∈ D, (2.36c)
s.t.
EG0 a(ω), x ≤ Gmax (2.37b)
x ∈ D, (2.37c)
or ⎛ ⎞ ⎛ ⎞
N
N
P⎝ Sj ⎠ := P ⎝a(ω) ∈ Sj ⎠ (2.38b)
j=1 j=1
where
k
sk,N := P Vil , (2.39b)
1≤i1 <i2 <...<ik ≤N l=1
Moreover, defining
q := (q1 , . . . , qN ), Q := (qij )1≤i,j≤N , (2.40f)
we have ⎛ ⎞
N
P⎝ Vj ⎠ ≥ q T Q− q, (2.40g)
j=1
Two-Sided Constraints
Remark 2.2. Note that P(S) ≥ αs with a given minimum reliability αs ∈ (0, 1]
can be guaranteed by the expected cost constraint
Eϕ ỹ a(ω), x ≤ (1 − αs )ϕ0 .
Eϕ(ỹ) = E ỹ T C ỹ = EtrC ỹ ỹ T ,
T
= trC(diag ρ)−1 cov y a(·), x + y(x) − yc y(x) − yc (diag ρ)−1 ,
(2.44b)
Example 2.7. Assuming in the above Example 2.6 that C = diag (cii ) is a
diagonal matrix with positive elements cii > 0, i = 1, . . . , m, then
One-Sided Inequalities
Suppose that exactly one of the two bounds yli < yui is infinite for each
i = 1, . . . , m. Multiplying the corresponding constraints in (2.41a) by −1, the
admissible domain S = S(x), cf. (2.41b), can be represented always by
yi − yui
where ỹi := , if yli = −∞, and ỹi := yli − yi , if yui = +∞. If we set
ỹ(a, x) := ỹi (a, x) and
B̃ := y ∈ IRm : yi < (≤) 0, i = 1, . . . , m , (2.45b)
then
P S(x) = P ỹ a(ω), x ∈ B̃ . (2.45c)
Consider also in this case, cf. (2.42d), (2.42e), a function ϕ : IRm → IR such
that
36 2 Deterministic Substitute Problems in Optimal Decision
and
m
Eϕ ỹ a(ω), x = wi Eeαi ỹi a(ω),x
, (2.48b)
i=1
and
T
Eϕ(ỹ) = E ỹ a(ω), x − b C ỹ a(ω), x − b
T
= trCE ỹ a(ω), x − b ỹ a(ω), x − b .
2.5 Approximation of Probabilities: Probability Inequalities 37
Note that
yi (a, x) − (yui + bi ), if yli = −∞
ỹi (a, x) − bi =
yli − bi − yi (a, x), if yui = +∞,
Remark 2.4. The one-sided case can also be reduced approximatively to the
two-sided case by selecting a sufficiently large, but finite upper bound ỹui ∈
IR, lower bound ỹli ∈ IR, resp., if yui = +∞, yli = −∞. For multivariate
Tschebyscheff inequalities see also [75].
Applying certain probability inequalities, see Sects. 2.5.1 and 2.5.2, the case
with m(>1) limit state functions is reduced first to the case with one limit
state function only. Approximations of pf are obtained then by linearization
or by quadratic approximation of a transformed state function ỹi = ỹi (ã, x)
at a certain “design point” ã∗ = ã∗i (x) for each i = 1, . . . , m.
Based on the above procedure, according to (2.9f) we have
pf (x) = P yi a(ω), x ≤ 0 for at least one index i, 1 ≤ i ≤ my
my my
≤ P yi a(ω), x ≤ 0 = pf i (x), (2.49a)
i=1 i=1
where
pf i (x) := P yi a(ω), x ≤ 0 . (2.49b)
Assume now that the random ν-vector a = a(ω) can be represented [104] by
a(ω) = U a(ω) , (2.50a)
of ai = ai(ω), given
aj (ω) = aj , j = 1, . . . , i − 1, then the random ν-vector
z(ω) := T a(ω) defined by the transformation
z1 := H1 (a1 )
z2 := H2 (a2 |a1 )
..
.
zν := Hν (aν |a1 , . . . , aν−1 ),
cf. [128], has stochastically independent components z1 = z1 (ω), . . . , zν =
zν (ω), and z = z(ω) is uniformly distributed on the ν-dimensional hypercube
[0, 1]ν . Thus, if Φ = Φ(t) denotes the distribution function of the univariate
N (0, 1)-normal distribution, then the inverse ã = U −1 (a) of the transforma-
tion U can be represented by
⎛ −1 ⎞
Φ (z1 )
⎜ .. ⎟
ã := ⎝ . ⎠ , z = T (a).
Φ−1 (zν )
In the following, let x be a given, fixed r-vector. Defining then
yi (
a, x) := yi U (
a), x , (2.50b)
we get
pf i (x) = P yi
a(ω), x ≤ 0 , i = 1, . . . , my . (2.50c)
In the First Order Reliability Method (FORM), cf. [54], the state function
yi = yi (
a, x) is linearized at a so-called design point a∗ = a∗(i) . A design point
a∗ =
a∗(i) is obtained by projection of the origin a = 0 onto the limit state
surface yi ( a∗ , x) = 0, we get
a, x) = 0, see Fig. 2.5. Since yi (
and therefore
T
pf i (x) ≈ P a∗ , x) (
∇a yi ( a∗ ) ≤ 0
a(ω) − = Φ −βi (x) . (2.50e)
Note that βi (x) is the minimum distance from a = 0 to the limit state surface
a, x) = 0 in the
yi ( a-domain.
For more details and Second Order Reliability Methods (SORM) based on
second order Taylor expansions, see Chap. 7 and [54, 104].
4
Deterministic Descent Directions and Efficient
Points
where
Γ (x) := Eγ y a(ω), x . (4.1c)
with T
∇Γ (x) = E∇x y a(ω), x ∇γ y a(ω), x . (4.2b)
The main goal of this chapter is to present several methods for the con-
struction of (1) deterministic descent directions h = h(x) for F (x), Γ (x) at
96 4 Deterministic Descent Directions and Efficient Points
certain points x ∈ IRr , and (2) so-called efficient points x0 of the stochas-
tic optimization problem (4.1a)–(4.1c). Note that a descent direction for the
function F at a point x is a vector h = h(x) such that
with
Γx0 (x) := Eγ y a(ω), x0 + ∇x y a(ω), x0 (x − x0 ) . (4.3b)
Obviously,
Fx
0
follows from F by linearization of the primary cost function
G0 = G0 a(ω), x with respect to x at x0 , and by “inner linearization” of the
expected recourse cost function Γ = Γ (x) at x0 (Fig. 4.1).
Corresponding to (4.2b), the gradient ∇Fx0 (x) is given by
with
T
∇Γx0 (x) = E∇x y a(ω), x0 ∇γ y a(ω), x0 + ∇x y a(ω), x0 (x − x0 ) .
(4.3d)
Under the general assumptions concerning F and Γ , the following basic prop-
erties of Fx0 hold:
Theorem 4.1. (1) If γ is a convex cost function, then Γx0 and Fx0 are convex
functions for each x0 .
(2) At point x0 the following equations hold:
with a constant L > 0 and a convex set B ⊂ IRmy . If y = y(a, x) and its
linearization at x0 fulfill
y a(ω), x ∈ B, y a(ω), x0 + ∇x y a(ω), x0 (x − x0 ) ∈ B a.s. (4.4d)
Proof. (a) The first term of Fx0 is linear in x, and also the argument of γ in
Γx0 is linear in x. Hence, Γx0 , Fx0 are convex for each convex loss function γ
such that the expectations exist and are finite. (b) The equations in (4.4a),
(4.4b) follow from (4.1b), (4.1c), (4.3a), (4.3b) and from (4.2a), (4.2b), (4.3c),
(4.3d) by putting x = x0 . The last part follows by applying the mean value
theorem for vector valued functions, cf. [29].
Remark 4.1. (a) If γ = γ(y) is a convex function on IRmy , then (4.4c) holds
on each bounded closed subset B ⊂ IRmy with a local Lipschitz constant
L = L(B). (b) An important class of convex loss functions γ having a global
Lipschitz constant L = Lγ are the sublinear functions on IRmy . A function γ
is called sublinear if it is positive homogeneous and subadditive, hence,
98 4 Deterministic Descent Directions and Efficient Points
we have
γ(y) ≤ γ · y, y ∈ IRmy (4.5d)
γ(y) − γ(z) ≤ γ · y − z, y, z ∈ IRmy . (4.5e)
Proof. (a) Using (4.4b), we have ∇F (x0 )T h = ∇Fx0 (x0 )T h < 0. Hence, h is
also a descent direction of F at x0 . (b) Since Fx0 is a differentiable, convex
function on IRmy , we obtain
where in case “=” the function Γx0 is not constant on x0 z, or (4.6a) holds
with “<.” Then h = z − x0 is a descent direction for F at x0 .
G0 (a, x) t
ỹ(a, x) := , γ̃(ỹ) = γ̃ := t + γ(y), (4.7a)
y(a, x) y
we have
F (x) = Eγ̃ ỹ a(ω), x . (4.7b)
where
T
c0 := EG0 a(ω), x0 − E∇x G0 a(ω), x0 x0 (4.8b)
c̄ := E∇x G0 a(ω), x0 . (4.8c)
Furthermore, A(ω), b(ω) is the random my (r + 1) matrix given by
A(ω) := ∇x y a(ω), x0 (4.8d)
b(ω) := − y a(ω), x0 − ∇x y a(ω), x0 x0 . (4.8e)
JJ ˆJ
exists and is finite for all x ∈ IRr . Finally, let Csep J
, Csep , Csep , Cˆsep
JJ
and
J JJ ˆJ ˆ
Csep (P ), Csep (P ), Csep (P ), Csep (P ) denote the subset of all separable elements
JJ
'm
(i.e., u(z) := ui (zi )) of C J , . . . , CˆJJ and C J (P ), . . . , CˆJJ (P ), respectively.
i=1
4.2 Computation of Descent Directions in Case of Normal Distributions 101
The (n + 1) × (n + 1) matrix
Qij := cov Ai (·), bi (·) , Aj (·), bj (·) = Qji , (4.12c)
denotes the covariance matrix of the ith and the jth row Ai (ω), bi (ω) ,
Aj (ω), bj (ω) of A(ω), b(ω) . Consequently, δ(ω, x) := A(ω)x − b(ω) is
normally distributed with mean Āx − b̄ and covariance matrix
Qx := cov A(·)x − b(·) = gij (x) . (4.13a)
i,j=1,...,m
102 4 Deterministic Descent Directions and Efficient Points
Qij + Qji x
gij (x) = x̂T Qij x̂ = x̂T Qji x̂ = x̂T x̂ with x̂ := ; (4.13b)
2 −1
y ∈ D, (4.14e)
Corresponding to Theorem 4.3, here we know the following result [83, 84]:
Theorem 4.4. (a) If r-vectors x, y fulfill relations (4.15a)–(4.15d), then
F (x) ≥ F (y) for each u ∈ C J (P ). (b) Let x, y ∈ IRr be related according
to (4.15a)–(4.15d). Then F (x) > F (y) for each u ∈ C J (P ), C JJ (P ), CˆJ (P ),
resp., if still the following additional condition holds:
c̄T x > c̄T y (4.15a )
Āi x > Āi y for some i ∈ J (4.15b )
Qx Qy , Qx = Qy . (4.15d )
T
Proof. (4.9b), (4.11) we have F (x) = c̄ x + F0 (x) with F0 (x) =
According to
Eu A(ω)x − b(ω) , where u ∈ C J (P ), C JJ (P ), CˆJ (P ), respectively. Thus, if
(4.15a), (4.15a ), resp., holds, then F (x) ≥ (>)c̄T y + F0 (x), and in the rest of
the proof we have to consider F0 only. Defining
Yx = Yx (ω) := A(ω)x − b(ω) − (Āx − b̄) = A(ω) − Ā x − b(ω) − b̄ ,
we find that
F0 (x) = Eu Āx − b̄ + Yx (ω) ≥ (>)Eu Āy − b̄ + Yx (ω) =: F̃0 ,
Supposing now that (4.15d), (4.15d ), resp., holds, we consider the character-
istic function
1
K̂(z) := exp − z T (Qx − Qy )z , z ∈ IRm ,
2
104 4 Deterministic Descent Directions and Efficient Points
)
where J(v) := ũ(v + w)K(dw), v ∈ IRm . For any u ∈ C J (P ) we have that
and therefore
F̃0 = J(v)Py(0) (dv) ≥ ũ(v)Py(0) (dv) = F0 (y).
If u ∈ CˆJ (P ), then
and therefore
F̃0 = J(v)Py(0) (dv) > ũ(v)Py(0) (dv) = F0 (y),
(4.16a) holds with gii (x) > gii (y) at least once, (4.16a )
Obviously, we have
c(ω)T x t
F (x) = Fu (x) := Ev with v := t + u(z). (4.17a)
A(ω)x − b(ω) z
Because of Theorems 4.3, 4.4, Corollary 4.4 and the above remark, we are
looking now for solutions of (4.14a)–(4.14d) and (4.15a)–(4.15d).
1
Representing first the symmetric (r + 1) × (r + 1) matrix (Qij + Qji ),
2
by
1 Rij dij
(Qij + Qji ) = , (4.18a)
2 d ij T qij
where Rij is a symmetric r × r matrix, dij ∈ IRr and qij ∈ IR, we get
where in (4.18c) the strict inequality “>” holds for y = x, if Rii is posi-
tive definite. Thus, since (4.15a)–(4.15d) implies (4.14a)–(4.14d), we have the
following lemma:
Lemma 4.2 (Necessary conditions for (4.14a)–(4.14d), (4.15a)–
(4.15d)). If y = x fulfills (4.14a)–(4.14d) or (4.15a)–(4.15d) with given x,
then h = y − x satisfies the linear constraints
c̄T h ≤ 0 (4.19a)
ĀI h ≤ 0, ĀII h = 0 (4.19b), (4.19c)
(Rii x − dii )T h ≤ 0(< 0, if Rii is positive definite), i = 1, . . . , m. (4.19d)
Lemma 4.3. (a) Suppose that Rii is positive definite for i ∈ K, where ∅ =
(2)
K ⊂ {1, . . . , m} or νx > 0. Then, either (4.19a)–(4.19f) has a solution
h(= 0), or there exist multipliers
where
f (y) := max gii (y) − gii (x) : 1 ≤ i ≤ m . (4.21b)
Note. Clearly, also any other selection of functions from {cT y − cT x, Āi y −
Āi x, i ∈ J, gii (y) − gii (x), i = 1, . . . , m} can be used to define the objective
function f in (4.21a).
Obviously, (4.21a), (4.21b) is equivalent to
min t (4.21a )
4.2 Computation of Descent Directions in Case of Normal Distributions 109
s.t.
c̄T y ≤ c̄T x (4.21b )
ĀI y ≤ ĀI x, ĀII y = ĀII x (4.21c ), (4.21d )
gii (y) − gii (x) ≤ t, i = 1, . . . , m (4.21e )
y ∈ D. (4.21f )
It is easy to see that (4.21a), (4.21b), (4.21a )–(4.21f ) are convex programs
having for x ∈ D always the feasible solution y = x, (y, t) = (x, 0). Optimal
solutions y ∗ , f (y ∗ ) ≤ 0, (y ∗ , t∗ ), t∗ ≤ 0, resp., exist under weak assumptions.
Further basic properties of the above program are shown next:
Lemma 4.4. Let x ∈ D be any given feasible point.
(a) If y ∗ , (y ∗ , t∗ ) is optimal in (4.21a), (4.21b), (4.21a )–(4.21f ), resp., then
y ∗ solves (4.14a)–(4.14d).
(b) If y, (y, t) is feasible in (4.21a), (4.21b), (4.21a )–(4.21f ), then y solves
(4.14a)–(4.14d), provided that f (y) ≤ 0, t ≤ 0, respectively. If y solves
(4.14a)–(4.14d), then y, (y, t), t0 ≤ t ≤ 0, is feasible in (4.21a), (4.21b),
(4.21a )–(4.21f ), resp., where t0 := f (y) ≤ 0.
(c) (4.14a)–(4.14d) has the unique solution y = x if and only if (4.21a )–
(4.21f ) has the unique optimal solution (y ∗ , t∗ ) = (x, 0).
Proof. Assertions (a) and (b) are clear. Suppose now that (4.14a)–(4.14d)
yields y = x. Obviously, (y ∗ , t∗ ) = (x, 0) is then optimal in (4.21a )–(4.21f ).
Assuming that there is another optimal solution (y ∗∗ , t∗∗ ) = (x, 0), t∗ ≤ 0, of
(4.21a )–(4.21f ), we know that y ∗∗ solves (4.14a)–(4.14d). Hence, we have
y ∗∗ = x and therefore t∗∗ < 0 which is a contradiction to y ∗∗ = x.
Conversely, assume that (4.21a )–(4.21f ) has the unique optimal solution
(y ∗ , t∗ ) = (x, 0), and suppose that (4.14a)–(4.14d) has a solution y = x.
Since (y, t), f (y) ≤ t ≤ 0, is feasible in (4.21a )–(4.21f ), we find t∗ ≤ t ≤ 0 and
therefore f (y) = 0. Thus, (y, 0) = (y ∗ , t∗ ) is also optimal in (4.21a )–(4.21f ),
in contradiction to the uniqueness of (y ∗ , t∗ ) = (x, 0).
T
where g(x) = gi (x), . . . , gν (x) is a ν-vector of convex functions on IRr , g0 ∈
Rν . Furthermore, suppose that (4.21a )–(4.21f ) fulfills the Slater condition for
each x ∈ D. This regularity condition holds, e.g., (consider (y, t) := (x, 1)) if
The necessary and sufficient optimality conditions for (4.21a )–(4.21f ) read:
'
m
γi = 1 (4.24a)
i=1
'
m
γi (Rii y − dii ) = − 12 γ0 c + ĀTI λI + ĀTII λII + ∂g T
∂x (y) µ (4.24b)
i=1
Āi x, i ∈ J, with y ∗ = x.
Then h = y ∗ −x is a feasible descent direction for F at x for all u ∈ CsepJ
(P )
∗
such that F = Fu is not constant on xy .
Case ii.3: t∗ = 0 and (y ∗ , t∗ ) = (x, 0) is the unique optimal solution of
(4.21a )–(4.21f ).
According to Lemma 4.4c, in Case ii.3 we know here that (4.14a)–(4.14d)
has only the trivial solution y = x. Hence, the construction of a descent
direction h = y − x by means of (4.14a)–(4.14d) and Corollary 4.4 fails at
x. Note that points x ∈ D having this property are candidates for optimal
solutions of (4.9a), (4.9b).
Solution of (4.15a)–(4.15e). Let x ∈ D be a given feasible point. Replacing
for simplicity (4.15d) by (4.16a), for handling system (4.15a)–(4.15e), we get
the following optimization problem
where
⎧ ⎫
⎨ ⎬
fˆ(y) := max gii (y) − gii (x) + gij (x) − gij (y) : 1 ≤ i ≤ m (4.25b)
⎩ ⎭
j=i
'
= max {giσi (y) − giσi (x) : σi ∈ i , 1 ≤ i ≤ m} ,
see the note to program (4.21a), (4.21b). In the above, giσi is defined by
(4.16c). An equivalent version of (4.25a), (4.25b) reads
min t (4.25a )
s.t.
c̄T y ≤ c̄T x (4.25b )
ĀI y ≤ ĀI x, ĀII y = Ā
II x (4.25c ), (4.25d )
'
gii (y) − gii (x) + gij (x) − gij (y) ≤ t, i = 1, . . . , m (4.25e )
j=i
y ∈ D, (4.25f )
Proof. Assertions (a), (b), (d) are clear, and (c) can be shown as the corre-
sponding assertion (c) in Lemma 4.4.
" #
i
'
m ' ' '
' γiσi Rii − σij Rij y − dii − σij dij
i=1 σi ∈ j=i j=i
i
Lemmas 4.5(d) and (4.18c) yield still the auxiliary convex program
min t (4.27a)
s.t.
and
and
(c̄T x0 , ĀI x0 , gii (x0 ), 1 ≤ i ≤ m) = (c̄T y, ĀI y, gii (y), 1 ≤ i ≤ m). (4.29e)
Lemma 4.6. (a) If gii is strictly convex for at least one index i, 1 ≤ i ≤ m,
st = E .
then Esep sep '
(b) If giσi is convex for all σi ∈ i , i = 1, 2, . . . , m, and gii is strictly convex
for at least one index i, 1 ≤ i ≤ m, then E st = E.
(c) If for arbitrary x, y ∈ D
Proof. (a) We may suppose, e.g., that g11 is strictly convex. Let then x0 ∈ Esep
and assume that x0 ∈E st . Hence, there exists y ∈ D, y = x0 , such that
| sep
(4.28a)–(4.28d) holds. Since all gii are convex and g11 is strictly convex, it is
easy to see that η := λy 0 + (1 − λ)y fulfills (4.28a)–(4.28e) for each 0 < λ < 1.
This contradiction yields x0 ∈ Esep st and therefore E st = E , cf. (4.30a). As-
sep sep
sertion (b) follows in the same way. In order to show (c), let x0 ∈ Esep , x0 ∈ E,
resp., and assume x0 ∈E st , x0 ∈E
| sep | st . Consider y ∈ D, y = x0 , such that
(4.28a)–(4.28d), (4.29a)–(4.29d), resp., holds. Because of x0 ∈ Esep , x0 ∈ E,
resp., we have
c̄T x0 , ĀI x0 , gii (x0 ), 1 ≤ i ≤ m = c̄T y, ĀI y, gii (y), 1 ≤ i ≤ m .
This equation, (4.28c), (4.29c), resp., and (4.30b) yield the contradiction y =
st , x0 ∈ E st , resp., and therefore E st and E st = E.
x0 . Thus, x0 ∈ Esep sep
x0 , has the unique optimal solution (y ∗ , t∗ ) = (x0 , 0). Since (4.29d), being
equivalent to (4.16a), implies (4.28d), we find
st ⊂ E st .
Esep ⊂ E, Esep (4.30c)
Indeed, Lemma 4.2 and Remark 4.4 yield the following inclusion:
st .
Corollary 4.5. E0 ⊂ Esep
By Lemma 4.3 we have this parametric representation for E0 :
Corollary 4.6. (a) If Rii is positive definite for at least one index 1 ≤ i ≤
m, then x ∈ E0 if and only if x ∈ D and there are parameters γi , i =
0, 1, . . . , m, λI , λII and π (1) , π (2) such that (4.20a), (4.20b) holds.
116 4 Deterministic Descent Directions and Efficient Points
(2)
(b) Let Rii be singular for each i = 1, . . . , m. If νx > 0 for each x ∈ D, then
x ∈ E0 if and only if x ∈ D and there exist γi , 0 ≤ i ≤ m, λI , λII , π (1) , π (2)
(2)
such that (4.20a), (4.20b)
* holds. If νx = 0, *x ∈ D, then x ∈ E0 if and
only if x ∈ D, rank x = n (with the matrix x defined in Lemma 4.3(c))
or there are parameters γi , 0 ≤ i ≤ m, λI , λII and π (1) such that (4.20a ),
(4.20b) holds.
As was already discussed in Sect. 4.2.2, Cases ii.3, efficient solutions are candi-
dates for optimal solutions x∗ of (4.9a), (4.9b). This will be made more precise
in the following:
Theorem 4.5. If x∗ ∈ argminx∈D F (x) with any u ∈ Cˆsep
JJ
(P ), u ∈ CˆJJ (P ).
∗ ∗
then x ∈ Esep , x ∈ E, respectively.
m
uρ (z) := u(z) + ρ vi (zi ), z ∈ IRm ,
i=1
If D is compact, then for each ρ > 0 we find x∗ρ ∈ argminx∈D Fρ (x), where
Fρ (x) := c̄T x + Euρ A(ω)x − b(ω) , see (4.9b). Furthermore, since D is
4.3 Efficient Solutions (Points) 117
'
m
compact and x → E vi A(ω)x − bi (ω) is convex and therefore continuous
i=1
on D, there is a constant C > 0 such that F (x) − Fρ (x) ≤ ρC for all x ∈ D.
This yields minx∈D F (x) − minx∈D Fρ (x) = minx∈D F (x) − Fρ (x∗ρ ) ≤ ρC.
Hence, Fρ (x∗ρ ) → minx∈D F (x) as ρ ↓ 0. Furthermore, we have F (x∗ρ ) −
Fρ (x∗ρ ) ≤ ρC, ρ > 0, and for any accumulation point x∗ of (x∗ρ ) we get
F (x∗ ) − min F (x) ≤ F (x∗ ) − F (x∗ρ ) + F (x∗ρ ) − Fρ (x∗ρ ) + Fρ (x∗ρ )
x∈D
− min F (x) ≤ F (x∗ ) − F (x∗ρ ) + 2ρC, ρ > 0.
x∈D
Consequently, since x∗ρ ∈ Esep , x∗ρ ∈ E, resp., for all ρ > 0, cf. Theorem 4.5,
for every accumulation point x∗ of (x∗ρ ) we find that x∗ ∈ E sep , x∗ ∈ Ē,
resp., and F (x∗ ) = minx∈D F (x). Suppose now that D is closed and F (x) →
+∞ as x → +∞. Since Fρ (x) ≥ F (x), x ∈ IRn , ρ > 0, we also have that
Fρ (x) → +∞ for each fixed ρ > 0. Hence, there exist optimal solutions x∗0 ∈
argminx∈D F (x) and x∗ρ ∈ argminx∈D Fρ (x) for every ρ > 0. For all 0 < ρ ≤ 1
and with an arbitrary, but fixed x̃ ∈ D we find then x∗ρ ∈ Esep , x∗ρ ∈ E, resp.,
and
F (x∗ρ ) ≤ Fρ (x∗ρ ) ≤ Fρ (x̃) ≤ F1 (x̃), 0 < ρ ≤ 1.
Thus, (x∗ρ ) must lie in a bounded set. Let R := max x∗0 , R0 , where
R0 denotes a norm bound of (x∗ρ ). If D̃ := x ∈ D : x ≤ R , then
minx∈D F (x) = minx∈D̃ F (x) and minx∈D Fρ (x) = minx∈D̃ Fρ (x). Since D̃
is compact, the rest of the proof follows now as in the first part.
For elements of Csep
J
(P, D), C J (P, D), see Definition 4.3, we have the fol-
lowing condition:
Theorem 4.7. If x∗ ∈ argminx∈D F (x) with any u ∈ Csep
J
(P, D), u ∈
∗ st ∗ st
C (P, D), then x ∈ Esep , x ∈ E , respectively.
J
The preceding Theorems 4.3, 4.4 and Corollary 4.4 yield this final result:
118 4 Deterministic Descent Directions and Efficient Points
| st , x0 ∈E
x0 ∈E st , respectively. Then there is y ∈ D, y = x0 , such that (4.29a)–
| sep
(4.29d), (4.28a)–(4.28d), resp., holds. Hence, F (y) ≤ F (x0 ), and h = y − x0
is a feasible descent direction for F at x0 for all u ∈ C J (P ), u ∈ Csep
J
(P ), such
0
that F is not constant on x y.
'
s
= exp(itrM Ξ T ) exp − g vecM Q(σ)(vecM )T ,
σ=1
4.4 Descent Directions in Case of Elliptically Contoured Distributions 119
where
vecM := (M1 , M2 , . . . , Mm ) (4.33b)
denotes the m(r + 1)-vector of the rows M1 , . . . , Mm of M . Furthermore, g =
g(t) is a certain nonnegative 1–1-function on IR+ . E.g., in the case of multivari-
ate symmetric stable distributions of order s and characteristic exponent α,
1
0 < α ≤ 2, we have g(t) := tα/2 , t ≥ 0, see [119].
2
Consequently, the probability distribution PA(·)x−b(·) of the random m-
vector x
A(ω)x − b(ω) = A(ω), b(ω) x̂ with x̂ := (4.34a)
−1
may be represented by the characteristic function
P-A(·)x−b(·) (z) = E exp iz T A(ω)x − b(ω)
T
= E exp itr(z x̂T ) A(ω), b(ω) -
=P (z x̂T )
A(·),b(·)
s
= exp iz (Θx − θ) exp −
T T
g z Qx (σ)z . (4.34b)
σ=1
Following the proof of Theorem 4.4, for a given vector x and a vector y = x
to be determined, we consider the quotient of the characteristic functions of
A(ω)x − b(ω) and A(ω)y − bω), see (4.34a), (4.34b):
-A(·)x−b(·) (z)
P
= exp iz T (Θx − Θy) ϕ(0)
x,y (z), (4.35a)
-A(·)y−b(·) (z)
P
where
s
x,y (z) := exp −
ϕ(0) g z T Qx (σ)z − g z T Qy (σ)z . (4.35b)
σ=1
120 4 Deterministic Descent Directions and Efficient Points
(0)
Thus, a necessary condition such that ϕx,y is the characteristic function
- (0)
x,y (z) = Kx,y (z), z ∈ IR ,
m
ϕ(0) (4.35c)
(0)
of a symmetric probability distribution Kx,y reads
Qx (σ) Qy (σ), i.e., Qx (σ) − Qy (σ) is positive semidefinite, σ = 1, . . . , s
(4.36)
for arbitrary functions g = g(t) having the above mentioned properties. Note
that (4.35b), (4.35c) means that
PA0 (·)x−b0 (·) = Kx,y
(0)
∗ PA0 (·)y−b0 (·) , (4.37)
where A0 (ω), b0 (ω) := A(ω), b(ω) − Ξ.
Based on (4.35a)–(4.35c) and (4.36), corresponding to Theorem 4.4 for
normal distributions P , here we have the following results:
A(·),b(·)
Theorem 4.8. Distribution invariance. For given x ∈ IRr , let y = x fulfill the
relations
c̄T x ≥ c̄T y (4.38a)
Θx = Θy (4.38b)
Qx (σ) = Qy (σ), σ = 1, . . . , s. (4.38c)
Then PA(·)x−b(·) = PA(·)y−b(·) , F0 (x) = F0 (y) for arbitrary u ∈ C(P ) := C J (P )
with J = ∅, and F (x) ≥ F (y).
Proof. According to (4.34a), (4.34b), conditions (4.38b), (4.38c) imply
P -A(·)y−b(·) , hence, PA(·)x−b(·) = PA(·)y−b(·) , and the rest follows
-A(·)x−b(·) = P
immediately.
Corollary 4.8. Under the above assumptions h = y − x is a descent direction
for F at x, provided that F is not constant on xy.
Theorem 4.9. Suppose that (4.36) is also sufficient for (4.35c). Assume that
(0)
the distribution Kx,y has zero mean for all x, y under consideration. For given
x ∈ IR , let y = x be a vector satisfying the relations
r
where Θi := (Θi )i∈J , ΘII := (Θi )i∈|J . Then F (x) ≥ (>)F (y) for all u ∈
C J (P ), C JJ (P ), CˆJ (P ), respectively.
4.5 Construction of Descent Directions 121
Yx (ω) := A(ω)x − b(ω) − (Θx − θ). Now the proof of Theorem 4.4 can be
transferred to the present case.
Suppose that there exists a closed convex subset Z0 ⊂ IRm such that the loss
function u fulfills the relations
hence,
1
u(z) = u(z̄x ) + ∇u(z̄)T (z − z̄x ) + (z − z̄)T ∇2 u(w)(z − z̄x ),
2
122 4 Deterministic Descent Directions and Efficient Points
where w = z̄x + ϑ(z − z̄x ) for some 0 < ϑ < 1. Because of (4.40b) and the
convexity of Z0 it is z̄x ∈ Z0 for all x ∈ D and therefore w ∈ Z0 for all z ∈ Z0 .
Hence, (4.40a) yields
u(z̄x ) + ∇u(z̄x )T (z − z̄x ) + 12 (z − z̄x )T V (z − z̄x ) ≤ u(z)
≤ u(z̄x ) + ∇u(z̄x )T (z − z̄x ) + 12 (z − z̄x )T W (z − z̄x ) (4.44)
for all x ∈ D and z ∈ Z0 .
Putting z := A(ω)x − b(ω) into (4.44), because of (4.40b) we find
u(z̄x ) + ∇u(z̄x )T A(ω)x − b(ω) − z̄x
T
+ 12 A(ω)x − b(ω) − z̄x V A(ω)x − b(ω) − z̄x ≤ u A(ω)x − b(ω) ≤
≤ u(z̄x ) + ∇u(z̄x )T A(ω)x − b(ω) − z̄x (4.45)
T
+ 12 A(ω)x − b(ω) − z̄x W A(ω)x − b(ω) − z̄x
for all x ∈ D and all ω ∈ Ω.
x
Taking now expectations on both sides of (4.45), using (4.42) and x̂ = −1
we get
u(z̄x ) + 12 x̂T V̂ x̂ ≤ Eu A(ω)x − b(ω) ≤ u(z̄x ) + 12 x̂T Ŵ x̂ (4.46a)
for all x ∈ D.
Furthermore, putting z := Āy − b̄, where y ∈ D, into the second inequality of
(4.44), we find
u(Āy − b̄) ≤ u(z̄x ) + ∇u(z̄x )T Ā(y − x) + 12 (y − x)T ĀT W Ā(y − x) (4.46b)
T
= u(z̄x ) + ĀT ∇u(z̄x ) (y − x) + 12 (y − x)T (ĀT W Ā)(y − x)
for all x ∈ D and y ∈ D.
Considering now F (y) − F (x) for vectors x, y ∈ D, from (4.46a) we get
F (y) − F (x) = c̄T y + Eu A(ω)y − b(ω) − c̄T x + Eu A(ω)x − b(ω)
1 1 T
≤ c̄T (y − x) + ŷ T Ŵ ŷ + u(Āy − b̄) − x̂ V̂ x̂ + u(Āx − b̄) .
2 2
Hence, we have the following result:
Theorem 4.10. If D, A(ω), b(ω) , Z0 and V, W are selected such that
(4.40a), (4.40b) hold, then for all x, y ∈ D it is
1
F (y)−F (x) ≤ c̄T (y −x)+ (ŷ T Ŵ ŷ − x̂T V̂ x̂)+u(Āy − b̄)−u(Āx− b̄). (4.47a)
2
4.5 Construction of Descent Directions 123
and therefore
1 1 . + ĀT W Ā)(y − x)
F (y) − F (x) ≤ c̄T (y − x) + x̂T (Ŵ − V̂ )x̂ + (y − x)T (W
2 2
T
+ ĀT ∇u(Āx − b̄) + W. x − w (y − x)
1 T
= x̂T (Ŵ − V̂ )x̂ + c̄ + ĀT ∇u(Āx − b̄) + W. x − w (y − x) (4.47d)
2
1 . + ĀT W Ā)(y − x).
+ (y − x)T (W
2
Thus, F x + (y − x) − F (x) is estimated from above by a quadratic form in
h = y − x.
Concerning the construction of descent directions of F at x we have this
first immediate consequence from Theorem 4.10:
Theorem 4.11. Suppose that the assumptions of Theorem 4.10 hold. If vec-
tors x, y ∈ D, y = x, are related such that
then F (x) ≥ (>)F (y). Moreover, if F (x) > F (y) or F (x) = F (y) and F is
not constant on xy, then h = y − x is a feasible descent direction of F at x.
124 4 Deterministic Descent Directions and Efficient Points
Note. If the loss function u is monotone with respect to the partial order
on IRm induced by IRm + , then (4.48b) can be replaced by one of the weaker
relations
If u has the partial monotonicity property (4.10a), (4.10b), then (4.48b) can
be replaced by (4.14b), (4.14b ), (4.14c).
The next result follows from (4.47d).
Theorem 4.12. Suppose that the assumptions of Theorem 4.10 hold. If vec-
tors x, y ∈ D, y = x, are related such that
(W .x − w
. + ĀT W Ā)(y − x) = − c̄ + ĀT ∇u(Āx − b̄) + W (4.49a)
T
. x − w (y − x) ≤ (<) − x̂T (Ŵ − V̂ )x̂, (4.49b)
c̄ + ĀT ∇u(Āx − b̄) + W
then F (y) ≤ (<)F (x). Furthermore, if F (y) < F (x) or F (y) = F (x) and F is
not constant on xy, then h = y − x is a feasible descent direction for F at x.
1 T T
F (y) − F (x) ≤ x̂ (Ŵ − V̂ )x̂ + c̄ + ĀT ∇u(Āx − b̄) + W . x − w (y − x)
2
1 . + ĀT W Ā)(y − x)
+ (y − x)T (W
2
1 T
≤ x̂T (Ŵ − V̂ )x̂ + c̄ + ĀT ∇u(Āx − b̄) + W . x − x (y − x)
2
1
+(−1) (y − x)T c̄ + ĀT ∇u(T̄ x − b̄) + W .x − w
2
1 T
= . x − w (y − x)
x̂T (Ŵ − V̂ )x̂ + c̄ + ĀT ∇u(Āx − b̄) + W
2
≤ (<)0.
According to (4.49a) it is
.x − w ,
y − x = −C −1 c̄ + ĀT ∇u(Āx − b̄) + W
or
aT C −1 a ≥ x̂T (Ŵ − V̂ )x̂,
where
. x − w.
a = c̄ + ĀT ∇u(Āx − b̄) + W
Because of (4.50b) it is
1 1
a2 ≤ aT C −1 a ≤ a2 .
β α
5.1 Introduction
According to the methods for converting an optimization problem with a
random parameter vector a = a(ω) into a deterministic substitute problem,
see Chap. 1 and Sects. 2.1.1, 2.1.2, 2.2, a basic problem is the minimization of
the total expected cost function F = F (x) on a closed, convex feasible domain
D for the input or design vector x.
For simplification
of notation, here the random total cost function f =
f a(ω), x is denoted also by f = f (ω, x), hence,
f (ω, x) := f a(ω), x .
with a certain vector-valued function G = G(w, x), may be derived then, cf.
also [149, 159].
One of the main method for solving problem (5.1) is the stochastic ap-
proximation procedure [36, 44, 46, 52, 69, 71, 112, 116]
1
r
(1) (2)
Yns = f (ωn,j , Xn + cn ej ) − f (ωn,j , Xn − cn ej ) ej (5.4c)
j=1
2cn
(k)
with the unit vectors e1 , . . . , er or IRr . Here, ωn (wn ), n = 1, 2, . . . , ωn,j , k =
1, 2, j = 1, 2, . . . , r, n = 1, 2, . . . , resp., are sequences of independent re-
alizations of the random element ω(w). Using formula (5.4b), (5.4c), resp.,
algorithm (5.4a) represents a Robbins–Monro (RM)-, a Kiefer–Wolfowitz
(KW)-type stochastic approximation procedure. Moreover, pD designates the
projection of IRr onto D, and ρn > 0, n = 1, 2, . . . , denotes the sequence of
step sizes. A standard condition for the selection of (ρn ) is given [34–36] by
∞ ∞
ρn = +∞, ρ2n < +∞, (5.4d)
n=1 n=1
and corresponding conditions for (cn ) can be found in [164]. Due to its sto-
chastic nature, algorithms of the type (5.4a) have only a very small asymptotic
convergence rate in general cf. [164].
Considerable improvements of (5.4a) can be obtained [87] by
(1) Replacing −Yns at certain iteration points Xn , n ∈ IN1 , by an improved
step direction hn of F at Xn , see [79, 80, 84, 86, 90]
(2) Step size control, cf. [85, 89, 117]
5.2 Gradient Estimation Using the Response Surface Methodology (RSM) 131
chosen by the decision maker [17, 63, 106]. Simple estimators are
y (i) := f ω (i) , x(i) i = 1, 2, . . . , p, (5.6c)
where ω (1) , ω (2) , . . . , ω (p) are independent realizations of the random element ω.
The objective function F is then approximated – on S – mostly by a
polynomial response surface model F̂ (x) = F̂ (x|βj , j ∈ J). In practice, usually
first and/or second order models are used. The coefficients βj , j ∈ J, are
determined by means of regression analysis (mainly least squares estimation),
see [23, 55, 63]. Having F̂ (x), the RSM-gradient estimator ∇F ˆ (x0 ) at x0 is
defined by the gradient (with respect to x)
ˆ (x0 ) := ∇F̂ (x0 )
∇F (5.6d)
In Phase 1 of RSM, i.e., if the process (Xn ) is still far away from an optimal
point x∗ of (5.1), then F is estimated on S by the linear empirical model
where
β T = (β0 , βIT ) = (β0 , β1 , . . . , βr )T (5.7b)
is the (r + 1)-vector of unknown coefficients
of the linear model (5.7a). Having
estimates y of the function values F x(i) at the so-called design points
(i)
β̂ = (W T W )−1 W T y. (5.8a)
where
(i) 1 (i)T 2
R1 = d ∇ F x0 + ϑi d(i) d(i) , 0 < ϑi < 1, (5.10c)
2
denotes the remainder of the first order Taylor expansion. Obviously, if S is a
convex subset of the feasible domain D of (5.1), and the Hessian ∇2 F (x) of
the objective function F of (5.1) is bounded on S, then
(i) 1 (i)
R1 ≤ d sup ∇2 F x0 + ϑd(i)
2 0≤ϑ≤1
1
(i) 2
≤ d sup∇2 F (x). (5.10d)
2 x∈S
y = W β + R1 + ε, (5.11a)
where W, y are given by (5.8b), and the vectors β, R1 and ε are defined by
⎛ (1) ⎞ ⎛ (1) ⎞
R1 ε
⎜ (2) ⎟ ⎜ ε(2) ⎟
F (x0 ) ⎜ R1 ⎟ ⎜ ⎟
β= , R1 = ⎜ ⎟
⎜ .. ⎟ , ε = ⎜ . ⎟. (5.11b)
∇F (x0 ) ⎝ . ⎠ ⎝ .. ⎠
R1
(p) ε(p)
F (x0 )
β̂ = + (W T W )−1 W T (R1 + ε). (5.12a)
∇F (x0 )
where d is defined by
p
1
d= d(i) , d(i) = x(i) − x0 . (5.12c)
p i=1
β̂ = (W T W )−1 W T y. (5.14a)
1
Here the p × (r2 + 3r + 2) matrix W and the p-vector y are defined by
2
⎛ T T
⎞ ⎛ ⎞
1 d(1) z (1) y (1)
⎜ (2)T (2)T ⎟ ⎜ y (2) ⎟
⎜1 d z ⎟ ⎜ ⎟
⎜ ⎟ ⎟
W =⎜. . .. ⎟, y=⎜
⎜ .. ⎟ , (5.14b)
⎜ .. .. ⎟ ⎝ . ⎠
⎝ . ⎠
T
1 d(p) z (p)
T
y (p)
5.2 Gradient Estimation Using the Response Surface Methodology (RSM) 137
and therefore
ˆ (x0 ) = β̂I .
∇F (5.15b)
Note. In practice, in the beginning of Phase 2 only the pure quadratic terms
βii (xi − x0i )2 , i = 1, . . . , r, may be added (successively) to the linear model
(5.7a).
For the consideration of the accuracy of the estimator (5.15b) we use now
second order Taylor expansion. Corresponding to (5.10b), we obtain
1
y (i) = F (x(i) ) + ε(i) = F (x0 ) + ∇F (x0 ) d(i) + d(i) ∇2 F (x0 )d(i) + R2 + ε(i) ,
(i)
2
(5.16a)
where
(i) 1
R2 = ∇3 F x0 + ϑi d(i) · d(i)3 , 0 < ϑi < 1, (5.16b)
3!
denotes the remainder of the second order Taylor expansion. Clearly, if S
is a convex subset of D and the multilinear form ∇3 F (x) of all third order
derivatives of F is bounded on S, then
(i) 1 (i) 3
|R2 | ≤ d sup ∇3 F (x0 + ϑd(i) )
6 0≤ϑ≤1
1 (i) 3
≤ d sup ∇3 F (x). (5.16c)
6 x∈S
y = W β + R2 + ε. (5.17a)
Here, W, y are given by (5.14b), and the vectors β, R2 , ε are defined, cf. (5.13b),
(5.13b ) for the notation, by
138 5 RSM-Based Stochastic Gradient Procedures
⎛
(1)
⎞ ⎛ ⎞
R2 ε(1)
⎜ (2) ⎟ ⎜ ε(2) ⎟
1 ⎜ R2 ⎟ ⎜ ⎟
β = F (x0 ), ∇F (x0 ), ∇2 F (x0 ) , R2 = ⎜ ⎟
⎜ .. ⎟ , ε = ⎜ . ⎟.
2 ⎝ . ⎠ ⎝ .. ⎠
(p)
R2 ε(p)
(5.17b)
Corresponding to (5.11c)–(5.11e) we also suppose in Phase 2 that
Putting the data model (5.17a) into (5.14a), we get – see (5.13b), (5.13b )
concerning the notation –
1
β̂ = F (x0 ), ∇F (x0 ), ∇2 F (x0 ) + (W T W )−1 W T R2 + (W T W )−1 W T ε.
2
(5.18a)
Especially, for the estimator β̂I of ∇F (x0 ), see (5.15a), (5.15b), we find
1
β̂I = ∇F (x0 ) + U −1 (d(1) − d, . . . , d(p) − d) (5.18b)
p
p
1
− (d − d)(z − z)
(i) (i) T
Z −1 (z (1) − z, . . . , z (p) − z) (R2 + ε).
p i=1
The vectors d(i) , d, z (i) are given by (5.8c), (5.12c), (5.14c), resp., and z, Z, U
are defined by
p p
1 1
z= z (i) , Z= (z (i) − z)(z (i) − z)T , (5.18c), (5.18d)
p i=1
p i=1
p
p
1 1
U = (d (i)
− d)(d (i)
− d) −
T (i)
(d − d)(z (i)
− z)
T
Z −1 (5.18e)
p i=1
p i=1
p
1
× (z (i) − z)(d(i) − d)T .
p i=1
According to (5.9) and (5.12b), (5.15b) and (5.18b), resp., in both phases for
ˆ (x0 ) we find
the gradient estimator ∇F
ˆ (x0 ) = ∇F (x0 ) + H(R + ε)
∇F (5.19)
ˆ (x0 )
The mean square error V = V (x0 ) of the estimator ∇F
V := E ∇F ˆ (x0 ) − ∇F (x0 )2 |x0 (5.20a)
ˆ (x0 )
of the squared bias edet 2 := HR2 and the variance of ∇F
E estoch 2 |x0 := E Hε2 |x0 .
(5.20d)
Putting
1
p0 (0)
(0) T (0) 1
p0
(0)
H0 := d(i) − d d(i) − d , d := d(i) , (5.23a)
p0 i=1
p0 i=1
1 1 (0)−1
HH T = H . (5.23b)
ν p0 0
where d(i) , d, z (i) , z, Z are given by (5.8c), (5.12c), (5.14c), (5.18c), (5.18d),
resp., and H0 is defined, cf. (5.18e), here by
H0 := U. (5.24b)
where the last inequality holds if S, cf. (5.6a), (5.6b) is convex, and ∇2 F (x)
is bounded on S. In Phase 2, with (5.16b), (5.17b), we find
p 2
1
R2 2 ≤ d(i) 6 ∇3 F x0 + ϑi d(i)
36 i=1
p 2
1
≤ d(i) 6 sup ∇3 F x0 + ϑd(i)
36 i=1 0≤ϑ≤1
p
1
≤ d
(i) 6
sup ∇3 F (x)2 , (5.27b)
36 i=1 x∈S
H0 = µ2 H̃0 , (5.29b)
where the r × r matrix H̃0 follows from (5.21b), (5.24b), resp., by replacing
d(i) by d˜(i) , i = 1, . . . , p. Using (5.29a), from (5.28) we get for l = 1, 2
⎛
µ2l
p 2
⎜ 1 ˜ l+1 ˜ (i)
V ≤ ⎝ 2 d
(i) 2(l+1)
sup ∇ F (x0 + ϑ d )
p i=1 0≤ϑ≤1
(l + 1)!
⎞
rµ−2 ⎟
+ max σi2 ⎠ H̃0−1 . (5.30)
p l≤i≤p
i.e., the increments d(1) , . . . , d(p) and p may depend on the stage index n and
the iteration point Xn . In the most simple case we have
cf. (5.14c). Note that in case (5.31b) the matrices Wn are fixed.
Finally, the RSM-step directions hn are defined by
ˆ (Xn ) := −β̂(n)I with β̂(n) = (W T Wn )−1 W T yn , (5.32)
hn = −Yne = −∇F n n
see (5.8a), (5.9) for Phase 1 and (5.14a), (5.15b) for Phase 2. Of course, we
suppose that all matrices WnT Wn , n = 1, 2, . . . , are regular, see (5.8d), (5.14d).
k1 ≥ k0 > 0. (5.34c)
of the simple stochastic gradient Yns according to (5.4b), if (5.3a) holds true,
we suppose that
Vns (Xn ) ≤ σ1s2 + C1s Xn − x∗ 2 with constants σ12 , C1s ≥ 0. (5.35b)
For n ∈ IN2 with Yns given by (5.4b) we obtain [89]
bn+1 ≤ 1 − 2k0 ρn + (k12 + C1s )ρ2n bn + σ1s2 ρ2n . (5.36)
For n ∈ IN1 we have, see (5.32), Xn+1 = pD (Xn − ρn Yne ), and therefore [89]
E Xn+1 − x∗ 2 |Xn ≤ Xn − x∗ 2
− 2ρn E (Xn − x∗ )T Yne − ∇F (x∗ ) |Xn
+ ρ2n E Yne − ∇F (Xn )2 |Xn
T
+ 2E Yne − ∇F (Xn ) ∇F (Xn ) − ∇F (x∗ ) |Xn
+ ∇F (Xn ) − ∇F (x∗ )2 . (5.37)
In this equation, cf. (5.10c) and (5.11b), (5.16b) and (5.17b), resp., (5.31a)–
(5.31g),
⎛ 1 ⎞ ⎛ ⎞
Rl (Xn , d(1,n) ) ε(1,n)
⎜ R2 (Xn , d(2,n) ) ⎟ ⎜ ε(2,n) ⎟
⎜ l ⎟ ⎜ ⎟
Rn = ⎜ .. ⎟ , l = 1, 2, εn = ⎜ .. ⎟ (5.38b)
⎝ . ⎠ ⎝ . ⎠
Rlp (Xn , d(p(n),n) ) ε(p(n),n)
is the vector of remainders resulting from using first, second order empirical
models for F , the vector of stochastic errors in estimating F (x) at Xn +
d(i,n) , i = 1, 2, . . . , p(n), respectively. Furthermore, Hn is the r × p matrix
given by (5.21a), (5.24a), resp., if there we put d(i) := d(i,n) , z (i) := z (i,n) , i =
1, 2, . . . , p = p(n). Hence,
E(Yne |Xn ) = ∇F (Xn ) + edet
n ,
1
≤ Rn 2 + r max σi (Xn ) H −1 , (5.39d)
1≤i≤p(n) p(n) n0
where Hn0 is given by (5.21b), (5.24b), respectively. From (5.37) and (5.39a)–
(5.39d) we now have
E Xn − x∗ 2 |Xn ≤ (1 − 2k0 ρn + k12 ρ2n )Xn − x∗ 2
T
∗ 2 det T ∗
−2ρn edet
n (X n − x ) + 2ρ e
n n ∇F (X n ) − ∇F (x )
1
+ρ2n Rn 2 + r max σi (Xn )2 H −1 . (5.40)
1≤i≤p(n) p(n) n0
T T
Estimating next edet
n (Xn − x∗ ) and edet
n ∇F (Xn ) − ∇F (x∗ ) , we get,
see (5.20c), (5.21c), (5.24c), (5.34b)
/
det T ∗ −1 1
en (Xn − x ) ≤ Hn0 Rn · Xn − x∗ , (5.41a)
p(n)
/
det T −1 1
en ∇F (Xn ) − ∇F (x∗ ) ≤ k1 Hn0 Rn · Xn − x∗ . (5.41b)
p(n)
According to (5.27c) we find
⎛
p(n)
1 1 ⎝ 1
Rn ≤ d(i,n) 2(ln +1)
p(n) (ln + 1)! p(n) i=1
⎞1/2
2
× sup ∇ln +1 F (Xn + ϑd(i,n) ) ⎠ . (5.41c)
0≤ϑ≤1
Here, (ln )n∈IN1 is a sequence with ln ∈ {1, 2} for all n ∈ IN1 such that
Note. For the transition from Phase 1 to Phase 2 statistical F tests may be
applied. However, since a sequence of tests must be applied, the significance
level is not easy to calculate.
Furthermore, suppose that representation (5.29a) holds, i.e.,
where d˜(i,n) , i = 1, . . . , p(n), are given r-vectors. Then Hn0 = µ2n H n0 , cf.
(5.29b), and (5.41c) yields
⎛
/ / ln p(n)
−1 1 −1 µ ⎝ 1
Hn0 Rn ≤ H n0
n
d˜(i,n) 2(ln +1)
p(n) (ln + 1)! p(n) i=1
⎞1/2
2
ln +1
× sup ∇ F (Xn + ϑd˜(i,n) ) ⎠ . (5.41e)
0≤ϑ≤1
Putting (5.41a), (5.41b), (5.41e) and (5.43) into (5.40), with (5.42) we
get
E Xn+1 − x∗ 2 |Xn ≤ 1 − 2[k0 − Γn µlnn Tn ]ρn
+ (k1 + Γn µlnn Tn )2 ρ2n Xn − x∗ 2
µ−2 −1 max σi (Xn )2 + γn2 (µlnn Tn )2
+ ρ2n r n
H
p(n) n0 1≤i≤p(n)
+ 2γn (ρn + k1 ρ2n )µlnn Tn Xn − x∗ , n ∈ IN1 . (5.44a)
Here, Tn is given by
⎛ ⎞1/2
/ p(n)
1 −1 ⎝ 1
Tn = H n0 d˜(i,n) 2(ln +1) ⎠ . (5.44b)
(ln + 1)! p(n) i=1
5.4 Convergence Behavior of Hybrid Stochastic Approximation Methods 147
1 ≤ i ≤ p(n), (5.45)
with constants σ1F , C1F ≥ 0. Taking expectations on both sides of (5.44a), we
find the following result:
Theorem 5.1. Assume that (5.34a)–(5.34c), (5.42), (5.45) hold, and use rep-
resentation (5.29a). Then
+
bn+1 ≤ 1 − 2[k0 − Γn µlnn Tn ]ρn + (k1 + Γn µlnn Tn )2
,
−2
µ −1 ρ2 bn
+rC1F n H n
p(n) n0
∗
n Tn EXn − x
+2γn (ρn + k1 ρ2n )µln
µ−2 −1
+ rσ1F 2 n H + γn2 (µlnn Tn )2 ρ2n for all n ∈ IN1 , (5.46)
p(n) n0
where Tn is given by (5.44b).
Remark 5.5. Since the KW-gradient estimator (5.4c) can be represented as a
special (Phase 1-)RSM-gradient estimator, for the estimator Yns according to
(5.4c) we obtain again an error recursion of type (5.46). Indeed, we have only
to set ln = 1 and to use appropriate (KW-)increments d(i,n) := d(i,0) , i =
1, . . . , p, n ≥ 1.
Suppose that γn = 0 for all n ∈ IN1 . Hence, according to (5.42), the sequence of
response surface models F̂n is asymptotically correct. In this case inequalities
(5.36) and (5.46) yield
"
(1 − 2K0,n
e e2 2
ρn + K1,n e2 2
ρn )bn+ σ1,n ρn for n ∈ IN1
bn+1 ≤ (5.47a)
1 − 2k0 ρn + (k1 + C1 )ρn bn + σ1s2 ρ2n for n ∈ IN2 ,
2 s 2
where k0 , k12 + C1s , σ1s2 are the same constants as in (5.36) and
e
K0,n = k0 − Γn µlnn Tn (5.47b)
rC1F −1
e2
K1,n = (k1 + Γn µlnn Tn )2 + H (5.47c)
µ2n p(n) n0
rσ F 2 −1
e2
σ1,n = 2 1 H . (5.47d)
µn p(n) n0
A weak assumption is that
e
K0,n ≥ K0e > 0 for all n ∈ IN1 (5.48a)
e2
K1,n ≤ K1e2 < +∞ for all n ∈ IN1 (5.48b)
e2
σ1,n ≤ σ1e2 < +∞ for all n ∈ IN1 (5.48c)
with constants K0e > 0, K1e2 < +∞ and σ1e2 < +∞; note that k0 ≥ K0e , see
(5.47b).
Remark 5.6. Suppose that (Γn )n∈IN1 , cf. (5.42), is bounded, i.e., 0 ≤ Γn ≤
Γ0 < +∞ for all n ∈ IN1 with a constant Γ0 . The assumptions (5.48a)–(5.48c)
can be guaranteed, e.g., as follows:
(a) If, for n ∈ IN1 , F (x) is estimated ν(n) times at each one of p0 different
n0 = H
points Xn + d(i) , i = 1, 2, . . . , p0 , see (5.22a), (5.22b), then H (0)
0
with a fixed matrix H (0)
for all n ∈ IN1 . Since ln ∈ {1, 2} for all n ∈ IN1 ,
0
we find, cf. (5.44b), Tn ≤ T0 for all n ∈ IN1 with a constant T0 < +∞.
Because of 0 < µn ≤ 1, p(n) = ν(n)p0 , (5.47b)–(5.47d) yield
e
K0,n ≥ k0 − Γ0 T0 µn
rC1F (0)−1 rσ1F 2 (0)−1
e2
K1,n ≤ (k1 + Γ0 T0 )2 + H0 , e2
σ1,n ≤ H0 .
µ2n µ2n
Thus, conditions (5.48a)–(5.48c) hold for arbitrary ν(n), n ∈ IN1 , if
k0 − K0e
≥ µn ≥ µ0 > 0 for all n ∈ IN1 with
Γ0 T 0
k0 − K0e
0 < K0e < k0 , 0 < µ0 < . (5.48d)
Γ0 T 0
5.4 Convergence Behavior of Hybrid Stochastic Approximation Methods 149
bsn := EXns − x∗ 2 , n = 1, 2, . . . ,
with coefficients 0 < σ0s ≤ σ1s , 0 ≤ C0s ≤ C1s . Using (5.34a), (5.34b), we find,
see [85, 89], for (bsn ) also the “lower” recursion
bsn+1 ≥ 1 − 2k1 ρn + (k02 + C0s )ρ2n bsn + σ0s2 ρ2n for n ≥ n0 , (5.51a)
Having the lower recursion (5.51a) for (bsn ), we can compare algorithms
(5.4a), (5.4b) and (5.5) by means of the quotients
bn bn
s
≤ s for n ≥ n0 , (5.52a)
bn Bn
where the sequence (B sn ) of lower bounds B sn ≤ bsn , n ≥ n0 , is defined by
B sn+1 = 1 − 2k1 ρn + (k02 + C0s )ρ2n B sn + σ0s2 ρ2n , n ≥ n0 , with B n0 ≤ bsn0 .
(5.52b)
In all other cases we consider
bn
s , n = 1, 2, . . . , (5.52c)
Bn
s
where the sequence of upper bounds B n ≥ bsn , n = 1, 2, . . . , defined by
s
s s
B n+1 = 1 − 2k0 ρn + (k12 + C12 )ρ2n B n + σ1s2 ρ2n , n = 1, 2, . . . , B 1 ≥ bs1 ,
(5.52d)
150 5 RSM-Based Stochastic Gradient Procedures
represents the “worst case” of (5.4a), (5.4b). For the standard step sizes
c
ρn = with constants c > 0, q ∈ IN ∪ {0} (5.53)
n+q
and with k1 ≥ k0 > 0 we have, see [164],
c2 1
lim n · B sn = σ0s2 for all c > , (5.54a)
n→∞ 2k1 c − 1 2k1
s c2 c2 1
lim n · B n = σ1s2 ≥ σ0s2 for all c > . (5.54b)
n→∞ 2k0 c − 1 2k1 c − 1 2k0
For the standard step sizes (5.53) the asymptotic behavior of (bn /bsn ),
s
(bn /B n ), resp., can be determined by means of the methods developed in the
following Sect. 5.5:
Note. It is easy to see that the right hand side of (5.55a), (5.55b), resp., takes
any value below 1, provided that σ1e2 and the rate N/(N + M ) of steps with
a “simple” gradient estimator Yns are sufficiently small.
Instead of applying the standard step size rule (5.53) and having then
Theorem 5.2, we may use the optimal step sizes developed in [89]. Here esti-
mates of the coefficients appearing in the optimal step size rules may be
obtained from second order response surface models, see also the next Chap. 6.
Assume now that γn > 0 for all n ∈ IN1 . For n ∈ IN1 , according to Theorem 5.1
we have first to estimate Exn − x∗ : For any given sequence (δn )n∈IN1 of
positive numbers, we know [164] that
1
EXn − x∗ ≤ δn + bn for all n ∈ IN1 . (5.56)
δn
5.4 Convergence Behavior of Hybrid Stochastic Approximation Methods 151
4γn (1 + k1 ρn )Tn ln
δn := · µn , n ∈ IN1 , (5.57)
k0 − Γn µlnn Tn
bn+1 ≤ un bn + vn , n = 1, 2, . . . (5.59)
3 e 3
0<A< K ≤ k0 < 2k0 , (5.63a)
2 0 2
152 5 RSM-Based Stochastic Gradient Procedures
we find that (5.61a), (5.61b) and therefore also (5.60) hold, provided that
ρn → 0, n → ∞ (5.63b)
ρn
→ 0, n → ∞ with n ∈ IN1 . (5.63c)
µ2n p(n)
Note that (5.63c) holds for any sequence p(n) if
n∈IN1
ρn
→ 0, n → ∞ with n ∈ IN1 . (5.63c )
µ2n
Having (5.62a)–(5.62c) and (5.63a)–(5.63c), inequalities (5.59) and (5.60) yield
with certain constants c0 > 0, T0 > 0, h0 > 0. Under these assumptions con-
dition (5.63e) holds if – besides (5.63f) – we still demand that
ρ2n
(µlnn )2 ρn < +∞, < +∞ (5.65a), (5.65b)
µ2n p(n)
n∈IN1 n∈IN1
'
∞
cf. (5.63b). Since 0 < µn ≤ 1, n ∈ IN1 , (5.63f) and (5.65c) hold if ρ2n < +∞.
n=1
In summary, in the present case we have the following result:
Theorem 5.3. Suppose that, besides the general assumptions guaranteeing
in-
−1
equalities (5.36) and (5.46), (γn )n∈IN1 , (Γn )n∈IN1 , (Tn )n∈IN1 and H̃n0
n∈IN1
are bounded. If (ρn ), (µn )n∈IN1 , p(n) are selected such that
n∈IN1
k0 −K0e
(1) 0 < µn < for all n ∈ IN1 , where 0 < K0e < k0 and Γ0 > 0, T0 > 0
Γ0 T0
are upper bounds of (Γn )n∈IN1 , (Tn )n∈IN1
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 153
3
.
2K0e
is bounded and µn = 0(n− 4 ), then bn = 0(n− 2 ) for all
1 1
(2) If p(n)
n∈IN1
3
c> .
2K0e
bn = EXn − x∗ 2 ,
In many cases [77, 82, 83] h(x) can be represented by h(x) = y − x with a
certain vector y ∈ D. Hence, in this case we get
pD (Xn + n hn ) = Xn + n hn , if 0 < n ≤ 1,
for every optimal solution x∗ of (5.1). This assumption is justified, since for a
convex, differentiable objective function F , the optimal solutions x∗ of (5.1)
can be characterized [16] by pD x∗ − ∇F (x∗ ) = x∗ for all > 0, and in x∗
there is no feasible descent direction h(x∗ ) on the other hand.
The hybrid algorithm (5.5) can be represented then by
Xn+1 = pD Xn − n A(Xn ) + ξn , n = 1, 2, . . . , (5.69a)
where βn is defined by
βn = γ1 + Eξn 2 . (5.74b)
From (5.69b) and (5.72) we obtain E(ξn |Xn ) = 0 a.s. for every n. Hence,
from the above inequality we a.s. have
E Xn+1 − x∗ 2 |Xn ≤ (1 − 2αn + γ2 2n )Xn − x∗ 2
+γ1 2n + 2n E ξn 2 |Xn .
where a is a fixed positive number such that 0 < a < 2α. Furthermore, assume
that (5.80) holds.
Then (5.74a), (5.74b) and (5.81) yield the recursion
Next we want to consider the “worst case” in (5.82), i.e., the recursion
Corresponding to Corollary 5.1, for the upper bound Bn of the mean square
error bn we have this
Corollary 5.2. Suppose that all assumptions in Lemma 5.1 are fulfilled and
c
(5.80) holds. If n = with c > 0 and q ∈ IN ∪ {0}, then for (Bn ) we
n+q
have the asymptotic formulas
Bn = 0(n−1 ), if ac > 1,
log n
Bn = 0 , if ac = 1,
n
Bn = 0(n−ac ), if ac < 1
Since by means of Corollaries 5.1 and 5.2 we can not distinguish between
the convergence behavior of the pure stochastic and the hybrid stochastic
approximation algorithm, we have to consider recursion (5.84a), (5.84b) in
more detail.
It turns out that the improvement of the speed of convergence resulting
from the use of both stochastic and deterministic directions depends on the
rate of stochastic and deterministic steps taken in (5.69a)–(5.69c).
Note. For simplicity of notation, an iteration step Xn → Xn+1 using an im-
proved step direction, a standard stochastic gradient is called a “deterministic
step,” a “stochastic step,” respectively.
Let us assume that the rate of stochastic and deterministic steps taken in
M
(5.69a)–(5.69c) is fixed and given by , i.e., M deterministic and N stochas-
N
tic steps are taken by turns beginning with M deterministic steps.
Then Bn satisfies recursion (5.84a), (5.84b) with β̄n given by (5.83), i.e.,
γ1 for n ∈ IN1
β̄n =
γ1 + σ 2 for n ∈ IN2 ,
where
A for n ∈ IN1
β̃n =
B for n ∈ IN2 ,
A = c2 γ1 , B = c2 (γ1 + σ 2 ) and ϑ = ac.
We need two lemmata (see [164] for the first and [127] for the second).
Lemma 5.3. Let ϑ ≥ 1 be a real number and
n
ϑ
Ck,n = 1− for k, n ∈ IN, k < n.
m
m=k+1
yn+1 = Un yn + Vn for n ≥ ν,
where (Un ), (Vn ) are given sequences of real numbers such that Un = 0 for
n ≥ ν. Then
⎛ ⎞
n
V n
yn+1 = ⎝yν + ⎠
j
'j Um for n ≥ ν.
j=ν m=ν Um m=ν
j
ϑ
1− = Cν+q−1,j+q for j ≥ ν ≥ 1,
m=ν
m+q
for every fixed integer ν > n0 . By Lemma 5.3 for every 0 < ε < 1 there is an
integer p = ν − 1, ν > n0 , such that
ϑ ϑ
p+q p+q
(1 − ε) ≤ Cp+q,j+q ≤ (1 + ε) for j > p.
j+q j+q
1
'
n
Bν + (1+ε)(p+q)ϑ
β̃j (j + q)ϑ−2
j=ν Bn+1
'
n ≤ s (5.88)
B Bn+1
Bνs + (1−ε)(p+q)ϑ
(j + q)ϑ−2
j=ν
1
'n
Bν + (1−ε)(p+q)ϑ
β̃j (j + q)ϑ−2
j=ν
≤ '
n .
B
Bνs + (1+ε)(p+q)ϑ
(j + q)ϑ−2
j=ν
'
Since ϑ ≥ 1, the series (j + q)ϑ−2 diverges as n → ∞. So the numbers
j=ν
Bν and Bνs have no influence on the limit of both sides of the above inequalities
(5.88) as n → ∞. Suppose now that we are able to show the equation
S(ν, n) AM + BN
lim = = L, (5.89)
n→∞ T (ν, n) M +N
where
n
S(ν, n) = β̃j (j + q)ϑ−2 ,
j=ν
n
T (ν, n) = (j + q)ϑ−2 .
j=ν
Bn+1
But this can be true only if lim s exists and
n→∞ Bn+1
Bn+1 L AM + BN
lim s = =
n→∞ B B B(M + N )
n+1
which proves (5.87). Hence, it remains to show that the limit (5.89) exists and
is equal to L. For this purpose we may assume without loss of generality that
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 161
the existence of the limit (5.89) follows. To see this, let n ≥ ν be given and
let m = mn be the uniquely determined integer such that
m(M + N ) + 1 ≤ n < (m + 1) (M + N ).
Since
S ν, (m + 1)(M + N ) = S(ν, n) + S n + 1, (m + 1)(M + N ) ,
Clearly, from the definition of S(ν, n), T (ν, n) and β̃j ≤ B for all j ∈ IN follows
that
S n + 1, (m + 1)(M + N ) T m(M + N ) + 1, (m + 1)(M + N )
≤B
T ν, (m + 1)(M + N ) T ν, (m + 1)(M + N )
(5.92a)
and
T n + 1, (m + 1)(M + N ) ·S(ν, n) T m(M + N )+1, (m + 1)(M + N )
≤B .
T ν, (m + 1)(M + N ) · T (ν, n) T ν, (m + 1)(M + N )
(5.92b)
Assume now that
T m(M + N ) + 1, (m + 1)(M + N )
lim = 0. (5.93)
n→∞
T ν, (m + 1)(M + N )
Therefore, if the limit (5.90) exists, then also the limit (5.89) exists and
both limits are equal.
For 1 ≤ ϑ ≤ 2 the limit (5.93) follows from
T m(M + N ) + 1, (m + 1)(M + N ) ≤ M + N
and the divergence of T ν, (m + 1)(M + N ) as m → ∞.
For ϑ > 2 we have
ϑ−2
T m(M + N ) + 1, (m + 1)(M + N ) ≤ (M + N ) (m + 1)(M + N ) + q
and with ν = m0 (M + N ) + 1
m (k+1)(M +N )
T ν, (m + 1)(M + N ) = (j + q)ϑ−2
k=m0 j=k(M +N )+1
m ϑ−2
≥ (M + N ) k(M + N ) + q .
k=m0
Because of
k+1
k+1
we have that
m
T ν, (m + 1)(M + N ) ≥ (M + N ) u(k)
k=m0
m
k+1
m+1
1 ϑ−1 ϑ−1
= m(M + N ) + q − (m0 − 1)(M + N ) + q .
ϑ−1
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 163
ϑ−1 ⎫−1
(m0 − 1)(M + N ) + q ⎪
⎬
− ϑ−2 ⎪ .
(m + 1)(M + N ) + q ⎭
m−1 (k+1)(M +N )
S ν, m(M + N ) = β̃j (j + q)ϑ−2 (5.94)
k=m0 j=k(M +N )+1
⎛ ⎞
m−1 (k+1)M +kN (k+1)(M +N )
= ⎝A (j + q)ϑ−2 + B (j + q)ϑ−2 ⎠
k=m0 j=k(M +N )+1 j=(k+1)M +kN +1
m−1
= σk ,
k=m0
where σk is defined by
M ϑ−2 M +N ϑ−2
σk = A k(M + N ) + i + q +B k(M + N ) + i + q .
i=1 i=M +1
where τk is given by
M +N ϑ−2
τk = k(M + N ) + i + q .
i=1
Define now the functions
M ϑ−2 M +N ϑ−2
f (x) = A x(M + N ) + i + q +B x(M + N ) + i + q
i=1 i=M +1
164 5 RSM-Based Stochastic Gradient Procedures
and
M +N ϑ−2
g(x) = x(M + N ) + i + q .
i=1
k+1
k+1
and therefore
m m−1 m
f (x)dx ≤ σk ≤ f (x − 1)dx.
m0 k=m0 m0
Analogously we find
m m−1 m
g(x)dx ≤ τk ≤ g(x − 1)dx.
m0 k=m0 m0
k+1
k+1
hence,
m m−1 m
f (x)dx ≥ σk ≥ f (x − 1)dx,
m0 k=m0 m0
B
M +N
+ log (m − 1)(M + N ) + i + q
M +N
i=M +1
A·M
− log (m0 − 1)(M + N ) + i + q = log(m − 1)
M +N
A
M
i+q
+ log M +N + − log (m0 − 1)(M +N ) + i + q
M +N i=1
m−1
M +N
B·N B i+q
+ log(m − 1) + log M + N +
M +N M +N m−1
i=M +1
− log (m0 − 1)(M + N ) + i + q
AM
Since in the above integral representations all terms besides
M +N
BN
log(m − 1), log(m − 1) and log m are bounded as m → ∞, we
M +N
have now
166 5 RSM-Based Stochastic Gradient Procedures
m
f (x − 1)dx
m0 AM log(m − 1) BN
lim m = lim · +
m→∞ m→∞ M +N log m M +N
g(x)dx
m0
log(m − 1) AM + BN
× = .
log m M +N
Analogously
m
f (x)dx
m0 AM + BN
lim = ,
m→∞ m M +N
g(x − 1)dx
m0
A
M ϑ−1
= (m − 1)(M + N ) + i + q
(M + N )(ϑ − 1) i=1
ϑ−1
− (m0 − 1)(M + N ) + i + q
B
M +N ϑ−1
+ (m − 1)(M + N ) + i + q
(M + N )(ϑ − 1)
i=M +1
ϑ−1
− (m0 − 1)(M + N ) + i + q
M
ϑ−1
Amϑ−1 1 i+q
= 1− (M + N ) +
(M + N )(ϑ − 1) i=1 m m
ϑ−1
m0 − 1 i+q
− (M + N ) +
m m
M +N
ϑ−1
Bmϑ−1 1 i+q
+ 1− (M + N ) +
(M + N )(ϑ − 1) m m
i=M +1
ϑ−1
m0 − 1 i+q
− (M + N ) +
m m
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 167
as well as
)m mϑ−1
M +N
i+q
ϑ−1
g(x)dx = M +N +
(M + N )(ϑ − 1) m
m0 i=1
ϑ−1
m0 i+q
− (M + N ) +
m m
and therefore
m
f (x − 1)dx
m0 AM (M + N )ϑ−1 + BN (M + N )ϑ−1 AM + BN
lim m = ϑ−1
= .
m→∞ (M + N )(M + N ) M +N
g(x)dx
m0
Analogously
m
f (x)dx
m0 AM + BN
lim = ,
m→∞ m M +N
g(x − 1)dx
m0
σ2
Bn ≈ 1− w Bns as n → ∞,
γ1 + σ 2
M
where w = is the percentage of deterministic steps in one complete
M +N
turn of M + N deterministic and stochastic steps. Hence, here we find that
2
−1
asymptotically the semi-stochastic procedure is 1 − γ1σ+σ2 w times faster
than the pure stochastic approximation procedure. This conclusion is studied
in more detail in Sect. 5.5.2.
By the above results we are also able to compare two hybrid stochastic ap-
proximation procedures characterized by the parameters A, B, M, N, ϑ and
168 5 RSM-Based Stochastic Gradient Procedures
and
B̃n ÃM̃ + B̃ Ñ
→ as n → ∞.
s
B̃n B̃(M̃ + Ñ )
Moreover, by Corollary 5.2 we know that for ϑ > 1, ϑ̃ > 1
B
Bns n · Bns ϑ−1 B ϑ̃ − 1
= → = · as n → ∞.
s
B̃n n · B̃ns B̃ B̃ ϑ − 1
ϑ̃−1
Bn
Proof. For we may write
B̃n
Bn Bn B s 1
lim = s · n· .
n→∞ B̃
n Bn B̃ns B̃n
s B̃n
Bn Bn Bs 1 AM + BN B ϑ̃ − 1 B̃(M̃ + Ñ )
lim = lim s · lim n · = · · ·
n→∞ B̃
n
n→∞ Bn n→∞ B̃ s
n lim
B̃n B(M + N ) B̃ ϑ − 1 ÃM̃ + B̃ Ñ
s
n→∞ B̃n
The results obtained above may be used now to find also lower bounds for
the mean square error of our hybrid procedure, provided that some further
hypotheses are fulfilled.
To this end assume that for an optimal solution x∗ of (5.1)
pD Xn − n A(Xn ) + ξn − pD x∗ − n A(x∗ ) (5.97)
= Xn − n A(Xn ) + ξn − x∗ − n A(x∗ ) for all n ∈ IN.
δ1 ≤ γ1 , δ2 ≤ γ2 , α ≤ β and η 2 ≤ σ 2 .
If again (5.72) is valid, i.e., if E(ξn |Xn ) = 0 a.s. for every n ∈ IN2 , then,
using equality (5.98), for bn = EXn − x∗ 2 we find in the same way as in
170 5 RSM-Based Stochastic Gradient Procedures
the proof of Lemma 5.1 the following recursion being opposite to (5.74a),
(5.74b) and (5.82)
bn+1 ≥ (1 − 2βn + δ2 2n )bn + 2n δ1 + Eξn 2 (5.102a)
where
δ1 , if n ∈ IN1
βn = (5.102b)
δ1 + η 2 , if n ∈ IN
In contrast to the “worst case” in (5.82), which is represented by the recursion
(5.84a), (5.84b), here we consider the “best case” in (5.102a), i.e., the recursion
Bn bn Bn
s
≤ s ≤ s for all n ≥ n0 . (5.105)
Bn bn Bn
The (hybrid) stochastic algorithms having the lower and upper mean
square error estimates (B n ) and (Bns ) are characterized – see Sect. 5.5.1 –
by the parameters
A = c2 δ1 , B = c2 (δ1 + η 2 ), M, N, ϑ = 2βc,
and
B̃ = c2 (γ1 + σ 2 ), M̃ = 0, ϑ̃ = ac, respectively.
Hence, if c > 0 is selected such that ac > 1 and 2βc > 1, then Corollary 5.3
yields
c2 δ1 M +c2 (δ1 +η 2 )N
B M +N ac − 1
lim ns = · . (5.106a)
n→∞ Bn c2 (γ1 + σ 2 ) 2βc − 1
Furthermore, the sequences (Bn ), (Bns ), resp., are estimates of the mean square
error of the (hybrid) stochastic algorithms represented by the parameters
A = c2 γ1 , B = c2 (γ1 + σ 2 ), M, N, ϑ = ac
and
B̃ = c2 (δ1 + η 2 ), M̃ = 0, ϑ̃ = 2βc, respectively.
Thus, if ac > 1 and 2βc > 1, then Corollary 5.3 yields
c2 γ1 M +c2 (γ1 +σ 2 )N
Bn M +N 2βc − 1
lim s = · . (5.106b)
n→∞ B n c2 (δ1 + η 2 ) ac − 1
where
δ1 M + (δ1 + η 2 )N ac − 1
Q1 = · , (5.108a)
(M + N )(γ1 + σ 2 ) 2βc − 1
γ1 M + (γ1 + σ 2 )N 2βc − 1
Q2 = · . (5.108b)
(M + N )(δ1 + η 2 ) ac − 1
Proof. Since 0 < a < 2α and α ≤ β, we have ac < 2βc for every c > 0 and
therefore 2βc − 1 > ac − 1 > 0. The assertion follows now from (5.106a) and
(5.106b).
172 5 RSM-Based Stochastic Gradient Procedures
γ1 M + (γ1 + σ 2 )N ac − 1 ac − 1 Bn Bn
Q1 ≤ · = · lim s < lim s
(M + N )(γ1 + σ ) 2βc − 1
2 2βc − 1 n→∞ Bn n→∞ Bn
(5.109)
as also
δ1 M + (δ1 + η 2 )N 2βc − 1 2βc − 1 B B
Q2 ≥ · = · lim ns > lim ns .
(M + N )(δ1 + η ) ac − 1
2 ac − 1 n→∞ Bn n→∞ Bn
(5.110)
Q1 bsn ≤ bn ≤ Q2 bsn as n → ∞.
γ1 M + (γ1 + σ 2 )N ac − 1
< f := (< 1). (5.111)
(M + N )(δ1 + η 2 ) 2βc − 1
Since always
γ1 + σ 2 − f (δ1 + η 2 ) > γ1 + σ 2 − δ1 − η 2 ≥ 0,
N f (δ1 + η 2 ) − γ1
< (5.112a)
M γ1 + σ 2 − f (δ1 + η 2 )
1
(c) If c is selected according to c > , where a0 is a fixed number such that
a0
0 < a0 < 2α, then ac ≥ a0 c > 1 for all a with a0 ≤ a < 2α. Hence, if
1
c > , then Theorem 5.6 holds for all numbers a with a0 ≤ a < 2α.
a0
1
Consequently, if c > , where 0 < a0 < 2α, then the relations (5.107)
a0
with (5.108a) and (5.108b), (5.109) and (5.110) as also (5.111), (5.112a)
and (5.112b) hold also if a is replaced by 2α.
cn = nλ EXn − x∗ 2 , n = 1, 2, . . .
for some λ with 0 ≤ λ < λ̄, where λ̄ is a given fixed positive number. Then,
under the same conditions and using the same methods as in Lemma 5.1,
corresponding to (5.74a), (5.74b) we find
λ
1
cn+1 ≤ 1+ (1 − 2αn + γ2 2n )cn + (n + 1)λ βn 2n , n = 1, 2, . . . ,
n
(5.113a)
where again
βn = γ1 + Eξn 2 (5.113b)
and ξn is defined by (5.69b).
We claim that for n ≥ n1 , n1 ∈ IN, sufficiently large,
λ
1
1+ (1 − 2αn + γ2 2n ) ≤ 1, (5.114)
n
provided that the step size n is suitably chosen.
Indeed, if n → 0 as n → ∞, then there is an integer n0 such that (see
(5.81))
0 < 1 − 2αn + γ2 2n < 1 − an , n ≥ n0
for any number a such that 0 < a < 2α.
According to (5.114) and (5.81) we have to choose now n such that
λ
1
1+ (1 − an ) ≤ 1, n ≥ n1 (5.115)
n
with an integer n1 ≥ n0 . Write
λ
1 pn (λ)
1+ =1+
n n
174 5 RSM-Based Stochastic Gradient Procedures
λ
1
with pn (λ) = n 1+ − 1 . Hence, condition (5.115) has the form
n
n
1 − an ≤ , n ≥ n1 .
n + pn (λ)
Since 0 ≤ λ < λ̄ and therefore pn (λ) ≤ pn (λ̄) for all n ∈ IN, the latter is true if
n
1 − an ≤ , n ≥ n1 .
n + pn (λ̄)
1 pn (λ̄) 1 1 pn (λ̄)
n ≥ = (5.116)
a n + pn (λ̄) n a 1 + pn (λ̄)
n
we see that
d
lim pn (λ̄) = (1 + x)λ̄ |x=0 = λ̄.
n→∞ dx
Therefore
pn (λ̄)
lim pn (λ̄)
= λ̄,
n→∞
1+ n
which implies that
pn (λ̄)
ϑ1 ≤ pn (λ̄)
≤ ϑ2 , n = 1, 2, . . . , (5.117)
1+ n
with some constants ϑ1 , ϑ2 such that 0 < ϑ1 < λ̄ ≤ ϑ2 . If λ̄ ∈ IN, then an easy
consideration shows that
pn (λ̄)
pn (λ̄)
, n = 1, 2, . . . ,
1+ n
c 1 c ϑ2
n = = q with q ∈ IN and c > , (5.118b)
n+q n1+ n a
Suppose now that also (5.80) holds, hence Eξn 2 ≤ σ 2 for all n ∈ IN2 .
Therefore βn ≤ β̄n , where (see (5.83)) β̄n = γ1 for n ∈ IN1 and β̄n = γ1 + σ 2
for n ∈ IN2 . Inequality (5.119) yields now
⎛ ⎞
2 N N
ϑ ⎜ γ1 γ1 + σ 2 ⎟
cN +1 − cn1 ≤ 2λ̄ ⎝ + ⎠.
a n=n1 n2−λ n=n1 n2−λ
n∈IN1 n∈IN2
EXn − x∗ 2 = 0(n−λ ) as n → ∞
1
for all λ such 0 ≤ λ < 2 − .
p
Proof. Under the given hypotheses we have
2
ϑ 1
cN +1 − cn1 ≤ 2λ̄ σ2
a (k p )2−λ
n1 ≤kp ≤N
2 ∞
ϑ 1
≤ 2λ̄ σ2 p(2−λ)
, N ≥ n1 ,
a k
k=1
176 5 RSM-Based Stochastic Gradient Procedures
1
where cn = nλ EXn −x∗ 2 with 0 ≤ λ < λ̄ and a fixed λ̄ ≥ 2. Since 2− < λ̄,
p
1
and 0 ≤ λ < 2 − is equivalent to p(2 − λ) > 1, the above series converges.
p
Hence, the sequence (cn ) is bounded which yields the assertion of our theorem.
sn ≈ n1/p
sn
and therefore 2 ≈ n1/p−2 . Hence, the estimate of the convergence rate given
n
by Theorem 5.7 can also be represented in the form
s r
EXn − x∗ 2 ≈ O
n n
= O
n2 n
if rn = n1/p−1 is the rate of stochastic steps taken in (5.69a), (5.69b), cf. the
example in [80].
6
Stochastic Approximation Methods
with Changing Error Variances
6.1 Introduction
As already discussed in the previous Chap. 5, hybrid stochastic gradient pro-
cedures are very important tools for the iterative solution of stochastic opti-
mization problems as discussed in Chaps. 1–4:
G(x) = 0
Yn = G(Xn ) + Zn , n = 0, 1, 2, . . .
X(n+1) := pD (Xn − Rn Yn ), n = 0, 1, 2, . . . ,
cf. (5.5), are well studied in the literature. Here, pD denotes the projection of
IRν onto the feasible domain D known to contain an (optimal) solution, and
Rn , n = 0, 1, 2, . . . is a sequence of scalar or matrix valued step sizes.
In the extensive literature on stochastic approximation procedures, cf.
[39–41, 60, 61, 125, 164], various sufficient conditions can be found for the
underlying optimization problem, for the regression function, resp., the se-
quence of estimation errors (Zn ) and the sequence of step sizes Rn , n =
0, 1, 2, . . ., such that
178 6 Stochastic Approximation Methods with Changing Error Variances
Xn → x∗ , n → ∞,
in some probabilistic sense (in the (quadratic) mean, almost sure (a.s.), in
distribution). Furthermore, numerous results concerning the asymptotic dis-
tribution of the random algorithm (Xn ) and the adaptive selection of the step
sizes for increasing the convergence behavior of (Xn ) are available.
As shown in Chap. 5, the convergence behavior of stochastic gradient pro-
cedures may be improved considerably by using
– More exact gradient estimators Yn at certain iteration stages n ∈ N1
and/or
– Deterministic (feasible) descent directions hn = h(Xn ) available at certain
iteration points Xn
While the convergence of hybrid stochastic gradient procedures was examined
in Chap. 5 in terms of the sequence of mean square errors
bn = EXn − x∗ 2 , n = 0, 1, 2, . . . ,
in the present Chap. 6, for stochastic approximation algorithms with chang-
ing error variances further probabilistic convergence properties are studied in
detail: Convergence almost sure (a.s.), in the mean, in distribution. Moreover,
the asymptotic distribution is determined, and the adaptive selection of scalar
and matrix valued step sizes is considered for the case of changing error vari-
ances. The results presented in this chapter are taken from the doctoral thesis
of E. Plöchinger [115].
Remark 6.1. Based on the RSM-gradient estimators considered in Chap. 5,
implementations of the present stochastic approximation procedure, espe-
cially the Hessian approximation given in Sect. 6.6, are developed in the the-
sis of J. Reinhart [121], see also [122], for stochastic structural optimization
problems.
Problems of this type have to be solved very often. E.g., in the special case
that G = ∇F , where F : D −→ IR is a convex and continuously differentiable
function, according to [16] condition (6.1) is equivalent to
6.3 General Assumptions and Notations 179
G(x∗ ) = 0.
In many practical situations, the function G(x) can not be determined directly,
but we may assume that for every x ∈ D a stochastic approximate (estimate)
Y of G(x) is available. Hence, for the numerical solution of problem (P ) we
consider here the following stochastic approximation method:
Select a vector X1 ∈ D, and for n = 1, 2, . . . determine recursively
(f) Concerning the selection of the matrices Mn in the step sizes Rn = ρn ·Mn ,
suppose that
sup Mn < ∞ a.s. (almost sure), (U4a)
n
where
| {1, . . . , n} ∩ IN(k) |
qk := lim >0 (U6)
n→∞ n
exist and are positive, see Sect. 5.5.1.
for all n and x ∈ D, where < ·, · > denotes the scalar product in IRν . Especially,
for each n ∈ IN we have
1 2
0n G(Xn ) − G(x∗ ) ≥ -
Xn − x∗ , M an Xn − x∗ 2 a.s. (6.7)
0n = m
M - n · I with m
- n ≥ d,
H(x) + H(x) ≥ 2αI
T
for all n ∈ IN and x ∈ D, where d and α are positive numbers. These conditions
hold in Sects. 6.7.2 and 6.8. The second of the above conditions implies also
(U2), and means in the gradient case G(x) = ∇F (x) that the function F (x)
is strong convex on D.
182 6 Stochastic Approximation Methods with Changing Error Variances
The following notations and abbreviations are used in the next sections:
a := q1 -
- a(1) + . . . + qκ -
a(κ) ≥ a
∆n := Xn − x∗ = actual argument error
∆Zn := Zn − En Zn
σk2 := lim sup Eσ0,n
2
n∈IN(k)
Ω̄0 := {ω ∈ Ω : ω ∈
/ Ω0 } = complement of the set Ω0 ⊂ Ω
IΩ0 := characteristic function of Ω0 ⊂ Ω
EΩ0 T := E(T IΩ0 ) = expectation of the random variable T
restricted to the set Ω0
limextr := either lim sup or lim inf
For the proof of the convergence of the quadratic argument error Xn − x∗ 2
the following auxiliary statements are needed.
Lemma 6.1. For all n ∈ IN we have
2
En ∆n+1 2 ≤ ∆n − Rn (Gn − G∗ )
+ Rn 2 En Zn 2 + En Zn − En Zn 2
+ 2Rn En Zn ∆n − Rn (Gn − G∗ ). (6.10)
En ∆-n+1 2
7 τn Rn En Zn
≤ (1 + γn )2 1 + ∆n − Rn (Gn − G∗ )2 τn2
ε 8
2
+ τn Rn En Zn 2 + En Zn − En Zn 2 + ετn Rn En Zn .
Suppose that
ρn
−→ 1 a.s., (6.13)
sn
Mn − Dn −→ 0 a.s. (6.14)
∆n+1 = ∆n − Rn Yn + ∆Pn
- n )∆n + Rn (En Zn − Zn ) − Rn (En Zn + G∗ ) + ∆Pn .
= (I − Rn H
Multiplying then with τn+1 , (6.15) follows, provided that the following rela-
tions are taken into account:
- n ) = I − (1 + γn )Rn H
(1 + γn )(I − Rn H - n − γn I = I − sn (Bn − ∆Bn ),
(1 + γn )Rn = sn (Dn + ∆Dn ).
(1)
∆n+1 ≤ I − sn Bn ∆(1) -
n + sn ∆Bn ∆n
+ τn+1 ρn Mn En Zn + G∗ + ∆Pn , (6.23)
(2)
En ∆n+1 2 ≤ I − sn Bn 2 ∆(2)
n
2
+(τn sn )2 ∆Dn 2 σ0,n
2
+ σ12 ∆n 2 . (6.24)
(i)
Proof. For i = 0, 1, 2, select arbitrary A1 -measurable random variables ∆1
such that
-1 = ∆(0) + ∆(1) + ∆(2) .
∆ 1 1 1
(0)
For n ≥ 1, define ∆n+1 then by (6.22) and put
(1)
∆n+1 := (I − sn Bn )∆(1) - ∗
n + sn ∆Bn ∆n + τn+1 −Rn (En Zn + G )
+ ∆Pn , (i)
(2)
∆n+1 := (I − sn Bn )∆(2)
n + τn sn ∆Dn (En Zn − Zn ). (ii)
Relation (6.21) holds then because of Lemma 6.4, and (6.23) follows directly
from (i) by taking the norm. Moreover, (ii) yields
(2)
∆n+1 2 ≤ I − sn Bn 2 ∆(2)
n + (τn sn ) ∆Dn En Zn − Zn
2 2 2 2
5 6
+ 2τn sn (I − sn Bn )∆(2)
n , ∆Dn (En Zn − Zn ) .
(2)
The random variables ∆n and ∆Dn are An -measurable. Thus, taking ex-
pectations in the above inequality, by means of (U5b) we get then inequality
(6.24).
Suppose now that the matrices Dn from (6.14) fulfill the following rela-
tions:
with certain numbers an , bn ∈ IR. The matrices Bn from (6.16) fulfill then the
following inequalities:
6.4 Preliminary Results 187
Proof. The assertions follow by simple calculations from the relations (6.16),
(6.25) and (6.26), where:
√ 1
1 − 2t ≤ 1 − t for t≤ .
2
Some useful properties of the random variables ∆Bn , ∆Dn and ∆Pn arising
in (6.17), (6.18) and (6.19) are contained in the next lemma:
Lemma 6.7. Assuming that
γn −→ 0 and Xn −→ x∗ ∈ D̊ a.s.,
we have
(a) ∆Bn , ∆Dn −→ 0 a.s.
(b) G∗ = 0
(c) For each ω ∈ Ω, with exception of a null-set, there is an index n0 (ω)
such that
∆Pn (ω) = 0 for every n ≥ n0 (ω)
Proof. Because of Rn = ρn Mn , (6.13), (6.14), (6.18) and (U4a) we have
ρn
∆Dn = (1 + γn ) Mn − Dn −→ 0 a.s.
sn
Furthermore, Xn → x∗ a.s. and (U1a)–(U1c) yield
1
- n :=
H H x∗ + t(Xn − X ∗ ) dt −→ H ∗ a.s.
0
Due to assumption (a) in Sect. 6.3, inequality (6.4) and (U5a)–(U5c), for each
n ∈ IN we find
Then (-
an ) can be selected such that
1 T
a(k) := lim inf -
- an ≥ λmin D(k) H(x∗ ) + D(k) H(x∗ ) a.s., k = 1, . . . , κ.
n∈IN(k) 2
lim inf -
an ≥ b a.s.
n∈IN(k)
A theorem about the convergence with probability 1 and the convergence rate
of the sequence (Xn ) is given in the following. Related results can be found
in the literature, see, e.g., [107] and [108]. The rate of convergence depends
mainly on the positive random variable - a, defined in (6.9). Because of (6.7),
this random variable can be interpreted as a mean lower bound of the quotients
(1) (2)
For the random variables tn , tn and tn occurring in Lemma 6.3 we have
therefore
Since a > 0, in Corollary 6.1 we may select also λ = 0. Hence, we always have
Xn −→ x∗ a.s. (6.30)
We consider now the convergence rate of the mean square error EXn − x∗ 2 .
For the estimation of this error the following lemma is needed:
En (n + 1)∆n+1 2
1 1
≤ 1 − s(1) n n∆n 2 + (nρn Mn )2 σ0,n
2
+ s(2)
n (6.31)
n n
(1) (2)
with An -measurable random variables sn and sn , where
n − (2-
s(1) an − 1) −→ 0 a.s., (6.32)
n −→ 0 a.s.
s(2) (6.33)
√
Proof. Putting τn := n, we find
1
τn+1 = (1 + γn )τn with nγn −→ .
2
According to Lemma 6.3 we select
2 2
n := n(ρn tn − tn ),
s(1) n := ntn − nρn Mn σ0,n .
(1)
s(2) (2)
Consequently,
tn − (2-
an − 1) −→ 0 a.s.,
n −→ 0 a.s.,
nt(1)
2 2
ntn − nρn Mn σ0,n −→ 0 a.s.
(2)
The following theorem shows that the mean square error EXn − x∗ 2 is
1
essentially of∗order
n , and an upper bound is given for the limit of the sequence
n EXn −x 2 . This bound can be found in the literature in the special case
of scalar step sizes and having the trivial partition IN = IN(1) (cf. condition
(i) from Sect. 6.3), see [26], [164, page 28] and [152]. Also in this result, an
important role is played by the random variable - a defined in (6.9).
Proof. Because of Lemma 6.10 and the above assumptions, the assertions in
this theorem follow directly from Lemma B.4 in the Appendix, provided there
we set:
2
Tn := n∆n 2 , An := s(1)
n , Bn := nρn Mn ,
2
vn := σ0,n , Cn := s(2)
n .
Corollary 6.2. If -
a > 1
2 a.s., then for each ε0 > 0 there is a set Ω0 ∈ A
such that
Proof. Because of Corollary 6.1 and (6.35) we may apply Lemma 6.9. Hence,
we may assume that the sequence (- an ) in (6.7) fulfills
For scalar step sizes this yields now some guidelines for an asymptotic
optimal selection of the factors Mn in the step sizes Rn = ρn Mn .
In the special case
Mn −→ d · I a.s.,
6.5 General Convergence Results 195
the most favorable selection reads d = α∗ −1 which yields then the estimate
'
κ
qk σk2
k=1
lim sup EΩ0 n∆n 2 ≤ . (6.41)
n α∗ 2
The right hand side of (6.40), (6.41), resp., contains, divided by α∗2 , the
(q1 , . . . , qκ )-weighted harmonic resp. arithmetic mean of the variance limits
σ12 , . . . , σκ2 of the estimation error sequence (Zn ). Hence, the bound in (6.40)
is smaller than that in (6.41).
λn 1
γn = , λn −→ . (6.43)
n 2
Furthermore, according to Corollary 6.1 it holds without further assumptions
Xn −→ x∗ a.s. (6.44)
Now we demand that matrices Mn fulfill certain limit conditions and, due
to condition (U3), that x∗ is contained in the interior of the feasible set: Hence,
(a) For all k ∈ {1, . . . , κ} there exists a deterministic matrix D(k) having
x∗ ∈ D̊. (V4)
196 6 Stochastic Approximation Methods with Changing Error Variances
Now some preliminary considerations are made which are useful for show-
ing the already mentioned central limit theorem.
According to (6.44) and assumptions (V2) and (V3a), (V3b) as well as
Lemma 6.9, for bound -a in (6.9) we may assume that
1
-
a≥a> a.s. (6.45)
2
Moreover, due to (V1), in Sect. 6.4.2
1
sn := for n = 1, 2, . . . (6.46)
n
can be chosen, and (6.43) yields
γn 1
= λn −→ . (6.47)
sn 2
Now, according to these relations and Lemma 6.6, for the matrices
γn
Bn := Dn H ∗ − I = Dn H ∗ − λn I (6.48)
sn
Stochastic Equivalence
By further examination of sequence ∆ -n = √n(Xn − x∗ ) we find that it can
be replaced by a simpler sequence. This is guaranteed by the following result:
6.5 General Convergence Results 197
√
Theorem 6.3. n(Xn − x∗ ) is stochastically equivalent to sequence
(0) 1 1
∆n+1 = (I − n + √ Dn (En Zn − Zn ), n = 1, 2, . . . ,
Bn )∆(0)
n n
(0)
where ∆1 denotes an arbitrary A1 -measurable random variable.
Tn := ∆(1)
n ,
√
Cn := n n + 1 ρn Mn En Zn + G∗ + ∆Pn .
Due to (6.45) and Corollary 6.2 there exists a constant K < ∞ and a set
Ω2 ∈ A having
δ
P(Ω̄2 ) < , (v)
4
-n ≤ K.
sup EΩ2 ∆ (vi)
n
198 6 Stochastic Approximation Methods with Changing Error Variances
For ε0 := δεū[4(K + 1)]−1 > 0, with ū from (6.53), according to (iv) there
exists an index n0 such that for any n ≥ n0 on Ω0 := Ω1 ∩ Ω2 we have
∆Bn , Cn ≤ ε0 .
un ε0 (K + 1)
EΩ0 Tn+1 ≤ 1 − E Ω0 T n + ,
n n
ε0 (K + 1) δε
lim sup EΩ0 Tn ≤ = .
n ū 4
Thus, there exists an index n1 ≥ n0 having
δε
EΩ0 Tn ≤ for n ≥ n1 . (vii)
2
Finally, from (iii), (v), the Markov inequality and (vii) for indices n ≥ n1
holds
P(Tn > ε) ≤ P(Ω̄0 ) + P {Tn > ε} ∩ Ω0
EΩ0 Tn δ δ
≤ P(Ω̄0 ) + < + = δ.
ε 2 2
(2)
Now the corresponding conclusion for the sequence (∆n ) is shown.
(2)
Lemma 6.12. (∆n ) converges stochastically to 0.
2un 1
En Tn+1 ≤ 1− Tn + (∆Dn 2 σ0,n
2
+ Cn ),
n n
(2)
where Tn := ∆n 2 and Cn := ∆Dn 2 σ12 ∆n 2 . Due to Lemma 6.7 it holds
∆Dn , Cn −→ 0 f.s.
Let ε, δ > 0 be arbitrary fixed. Here, choosing B (k) ≡ C (k) ≡ 0, Lemma B.4,
Appendix, can be applied. Hence, there exists a set Ω0 ∈ A where
δ
P(Ω̄0 ) < and lim sup EΩ0 Tn = 0. (i)
2 n
6.5 General Convergence Results 199
Asymptotic Distribution
or
W (k) = lim Em (∆Zm ∆Zm
T
) a.s. (V6)
m∈IN(k)
or
lim Em ∆Zm 2 I{ ∆Zm 2 >tm} = 0 a.s. for all t > 0. (V8)
m→∞
200 6 Stochastic Approximation Methods with Changing Error Variances
√
The convergence in distribution of n(Xn − x∗ ) is guaranteed then by the
following result:
√
Theorem 6.4. The sequence n(Xn − x∗ ) is asymptotically N (0, V )-
distributed. The covariance matrix V is the unique solution of the Lyapunov-
matrix equation
T
1 1
DH ∗ − I V + V DH ∗ − I = C, (6.58)
2 2
where
D := q1 D(1) + . . . + qκ D(κ) , (6.59)
C := q1 D(1) W (1) D(1)T + . . . + qκ D(κ) W (κ) D(κ)T (6.60)
Central limit theorems of the above type can be found in many papers on
stochastic approximation (cf., e.g., [1,12,40,70,107,108,131,132,161,163]). But
in these convergence theorems it is always assumed that matrices Mn or the
covariance matrices of the estimation error Zn converge to a single limit D, W ,
respectively. This special case is enclosed in Theorem 6.4, if assumptions (V2),
(V3a), (V3b) and (V5a), (V5b)/(V6) for κ = 1 are fulfilled (cf. Corollary 6.4).
Before proving Theorem 6.4, some lemmas are shown first. For the deter-
ministic matrices
1
Am,n := √ Bm,n Dm for m ≤ n, (6.61)
m
arising in the important equation (6.54) the following result holds:
L2
Am,n 2 ≤ b̄m,n . (ii)
m
Hence, according to (6.52) and (6.53) Lemma A.6, Appendix, can be applied.
Therefore
and because of Lemma A.5(a), Appendix, for any 0 < δ < ū we have
m ū−δ
bm,n φm,n (ū − δ) ∼ 1. (v)
n
From (i) to (v) propositions (a), (b) and (c) are obtained easily. Symmetric
matrices V1 := 0 and
n
Vn+1 := Am,n Wm ATm,n for n = 1, 2, . . . (6.62)
m=1
where Cn := Dn Wn DnT . Hence, because of (6.47) and (6.48) for all k the
following limits exist:
1
B (k) := lim Bn = D(k) H ∗ − I.
n∈IN(k) 2
1
B (k) + B (k)T ≥ 2 a(k) − I = 2u(k) I.
2
BV + V B T = C.
holds.
holds. Furthermore,
n
V-n+1 − Vn+1 ≤ Am,n 2 Em (∆Zm ∆Zm
T
) − Wm , (iii)
m=1
n
EV-n+1 − E V-n+1 ≤ Am,n 2 E Em (∆Zm ∆Zm
T T
) − E(∆Zm ∆Zm ) (iv)
m=1
and
n
E V-n+1 − Vn+1 ≤ Am,n 2 E(∆Zm ∆Zm
T
) − Wm . (v)
m=1
For any real stochastic sequence (bn ) with bn −→ 0 a.s., Lemma 6.13(a),(b)
and Lemma A.2, Appendix, provide
n
Am,n 2 bm −→ 0 a.s. (vi)
m=1
Vn+1 − V −→ 0. (vii)
6.5 General Convergence Results 203
Hence, under condition (V5a), (V5b) we get because of (ii), (iv), (v), (vi)
and (vii)
EV-n+1 − V −→ 0.
In addition, with (V6), because of (i), (iii), (vi) and (vii) we have
V-n+1 − V −→ 0 a.s.
ε 2
holds for all indices n0 ≤ m ≤ n. Hence, for n > n0 , where t := >0
2L
n
Un := Am,n 2 Em (∆Zm 2 I{ Am,n Zm >ε} )
m=1
n
≤ Un0 + Am,n 2 · Em ∆Zm 2 I{ ∆Zm 2 >tm} . (viii)
m=n0 +1
Therefore, from (vi) and (viii) follows EUn −→ 0 at assumption (V7) and
Un −→ 0 a.s. under assumption (V8). Hence, in any case (KL) is fulfilled.
Now we are able to show Theorem 6.4.
(0)
Proof (of Theorem 6.4). Let (∆n ) be defined according to (6.54) with
(0)
∆1 := 0. Thus,
n
(0)
∆n+1 = − Am,n ∆Zm for n = 1, 2, . . . .
m=1
Due to Lemma 6.14 all conditions for the limit theorem Lemma B.7 in
(0)
the Appendix are fulfilled for this sequence. Hence, (∆n ) is asymptotically
N (0, V )-distributed.
√ According
to Theorem 6.3 this holds also true for se-
quence n(Xn − x∗ ) .
In the special case that the matrix sequence (Mn ) converges a.s. to a
unique limit matrix D, we get the following corollary:
V = H ∗ −1 W (H ∗ −1 )T .
This special case was already discussed for κ = 1 in [12, 107, 108, 161]
(Corollary 9.1).
(b) Let Θ = d · I, where d is chosen to fulfill d · (H ∗ + H ∗ ) > I. Hence,
T
d · (H ∗ V + V H ∗ ) − V = d2 W.
T
1
hn := (Y (1) − Yn(2) ),
2cn n
a known approximation for the derivative of G(Xn ) is obtained. This difference
quotient plays a fundamental role in the adaptive step size of Venter.
In [12, 107] and [108] this approach for the determination of search direc-
tions was generalized to the multi-dimensional case. There,
2ν
1
Yn := Yn(j)
2ν j=1
(i) (ν+i)
is assumed as search direction, where vectors Yn and Yn are given
stochastic approximations of vectors G(Xn + cn ei ) and G(Xn − cn ei ), and
e1 , . . . , eν denote the canonic unit vectors of IRν . Then,
1
hn := (Y (1) − Yn(2) )
2cn n
of the matrix H(Xn ) is estimated. Cycling through all columns, after ν iter-
ations a new estimator for the whole matrix H(Xn ) is obtained.
This possibility is contained in the following approach by means of choosing
ln ≡ 1 and εn ≡ 0:
For indices n = 1, 2, . . . , let ln ∈ IN0 and εn ∈ {0, 1} be integers with
εn + ln > 0. (W1a)
(j)
as stochastic approximation of Gn := G(Xn ). Here, Yn denotes a known
(j)
stochastic estimator of G at Xn , that is
d(i)
n = el(n,i) for i = 1, . . . , ln , (W1d)
l(1, 0) := 0,
l(n, i) := l(n, i − 1) mod ν + 1 for i = 1, . . . , ln , (W1e)
l(n + 1, 0) := l(n, ln ).
Yn = Gn + Zn (6.68)
if we denote
(0) (2ln )
and therefore due to the conditional independence of Zn , . . . , Zn we get
2ln
1
En ∆Zn ∆ZnT = En ∆Zn(j) ∆Zn(j)T .
(2ln + εn )2 j=1−εn
and using (6.70) assertion (c) follows. Due to condition (W2a) we have
1
2ln
L1
(j)
En Zn ≤ µ1 ,
2ln + εn
j=1−εn n
2
ln
1 (i) L1
≤ (Gn + G(ln +i)
) − Gn
2ln + εn 2 n + nµ1 .
i=1
1
≤ L0 cn d(i)n
1+µ0
+ cn d(i)
n
1+µ0
= L0 c1+µ
n
0
2
holds. Using the previously obtained inequality, assertion (d) follows.
Due to Lemmas 6.15(c), (6.70) and (6.71), conditions (U5b) and (U5c) are
fulfilled by means of
2 1
σ0,n := t2 , σ0 := t0 , σ1 := t1 . (6.75)
2ln + εn 0,n
For given positive numbers c0 and µ, choose for any n = 1, 2, . . .
c0 1
cn = , where µ> (W3a)
nµ 2(1 + µ0 )
for sequences (ln ) and (εn ) for each k = 1, . . . , κ. In particular this leads to
-
l := sup ln < ∞, (6.77)
n∈IN
6.6.1 Estimation of G∗
Vectors
1
G∗n+1 := (Y1 + . . . + Yn ) for n = 1, 2, . . . (6.79)
n
can be interpreted as An+1 -measurable estimators of G∗ := G(x∗ ). This is
shown in the next result:
1 1
G∗n+1 = 1− G∗n + Yn . (i)
n n
Moreover, due to (U5b), (U5c) and condition (a) in Sect. 6.3 for each n
Using (i), (ii) and (iii), the proposition follows from Lemma B.6 in the
Appendix.
210 6 Stochastic Approximation Methods with Changing Error Variances
(j)
Here, the independence of the variables Zn was taken into account, and
(6.70) was used. Now, because of (U1a)–(U1c)
0G(Xn ± cn d(i)
n ) = G(Xn ) ± cn Hn dn + δ(Xn ; ±cn dn ),
(i) (i)
Proof. To arbitrary fixed l ∈ {1, . . . , ν} let (n1 , i1 ), (n2 , i2 ), . . . denote all or-
(i)
dered index pairs (n, i), for which dn = el is chosen. Hence, 1 ≤ n1 ≤ n2 ≤
. . ., and for each k it holds 1 ≤ ik ≤ lnk as well as ik < ik+1 if nk = nk+1 .
Moreover, for n ≥ 1 and i ≤ ln we have
d(i)
n = el ⇐⇒ (n, i) ∈ (n1 , i1 ), (n2 , i2 ), . . . .
if
1
αk := , Uk := h(ik)
nk , Vk := Hn(ik−1
k−1 )
el ,
nk
Ūk := Hnk el , V̄k := H̄n(ik−1
k−1 )
el .
Because of (W1e), at latest after each νth iteration stage n, direction vector
el is chosen. Therefore it holds nk+1 ≤ nk + ν and nk ≤ kν for each index k.
Consequently, we get
αk = ∞. (iii)
k
αk2 1 1 l02−2µ 1
= 2 ≤ < ∞. (iv)
k
2
cnk c0
k
n2−2µ
k
c20
k
k 2−2µ
B1 ⊂ B2 ⊂ . . . ⊂ Bk ⊂ A , (v)
αk2 ∗ 2
≤ t 0,n + t 2
1 Xn − x < ∞ a.s. (ix)
2c2nk k k
k
Due to (iii), (v), (vi), (ix) and (x), Lemma B.6 in the Appendix can be applied.
Hence, it holds
Vk+1 − V̄k+1 = (Hn(ikk ) − H̄n(ikk ) )el −→ 0 a.s.
Since l ∈ {1, . . . , ν} was chosen arbitrarily, we have shown proposition (a).
In case of “Xn −→ x∗ a.s.” follows according to (U1c) also Hn −→ H ∗
a.s. Therefore, because of (ii), (iii) and Lemma B.6 (Appendix)
V̄k+1 = H̄n(ikk ) el −→ H ∗ el a.s.
H ∗ a.s.
(0)
Matrices Hn+1 from (6.81a), (6.81b) can also be approximated by means
of special convex combinations of Jacobians H1 , . . . , Hn This is shown next.
H - n(i−1) + 1 (Hn − H
- n(i) := H - n(i−1) ), (6.83a)
nν
- (0) := H
H - (ln ) . (6.83b)
n+1 n
if we define
⎧
⎨ *
ν
αkν+j
1− kν+j , for m = 0, . . . , ν − 1
pm,k := j=m+1
⎩
1 , for m = ν,
ν
1 αkν+m kν k
Ūk := pm,k · Ukν+m + (p0,k − 1) + 1 Vkν .
ν m=1
α kν + m α
Now we show
Ūk −→ 0 a.s. (∗∗)
Because of (vi) and Lemma A.4, Appendix, from this follows Vkν −→ 0 a.s.
Hence, if (∗∗) holds, then due to (v), also (∗) holds true.
First of all, from the definition of Ūk , for each k = 0, 1, 2, . . . the following
approximation is obtained:
+ ν ,
1
ν
αkν+m kν
Ūk ≤ Ukν+m + Ukν+m · pm,k · − 1
ν m=1 m=1
α kν + m
k
+ (p0,k − 1) + 1 · Vkν . (vii)
α
For matrices
TL(n,i) := Hn − H̄n(i−1) ,
due to “Xn+1 − Xn −→ 0 a.s.,” inequality (6.3) and Theorem 6.6(a), we have
Now, let ε ∈ (0, α) denote a given arbitrary number. According to (iv) and
(viii) there exists an index k0 ∈ IN such that
α − ε ≤ αkν+m ≤ α + ε, (x)
ν
Tkν+m − Tkν ≤ ε, (xi)
m=1
pm,k αkν+m · kν − 1 ≤ ε (xii)
α kν + m
Using now in (vii) the approximations (ix), (xi), (xii) and (xiii), we finally get
for all k ≥ k1
17 8 2ε
ν
Ūk ≤ νε + ε Ukν+m + Vkν
ν m=1
α
9 :
2
≤ ε · 1 + sup UL + sup VL a.s.
L α L
Since ε > 0 was chosen arbitrarily, relation (∗∗) follows now by means of (iii)
and (v).
216 6 Stochastic Approximation Methods with Changing Error Variances
(1) (2ln )
Hence, due to the conditional independence of Zn , . . . , Zn we get
ln
1
En ∆Z-n ∆Z
-T =
n En (∆Zn(i) − ∆Zn(ln +i) )(∆Zn(i) − ∆Zn(ln +i) )T
(2ln )2 i=1
ln
1
= (En ∆Zn(i) ∆Zn(i)T + En ∆Zn(ln +i) ∆Zn(ln +i)T ).
(2ln )2 i=1
6.6 Realization of Search Directions Yn 217
This is proposition (b), and by taking the trace we also get (c). According to
(U1a)–(U1c) and (6.3), for i = 1, . . . , ln follows
(i)
n − Gn
G(i) (ln +i)
= 2cn Hn d(i)
n + δ(Xn ; cn dn ) − δ(Xn ; −cn dn )
(i)
Now, for the existence of limit covariance matrices W (1) , . . . , W (κ) of (Zn )
required by (V6) we assume:
Under assumption “Xn −→ x∗ a.s.” suppose
1
(b) lim En Z-n Z-nT = V (k) a.s.
n∈IN(k) 2l(k)
Proof. Due to conditions (W4a), (W4b) and (W6), with Lemmas 6.15(b) and
6.17(b) we have
1
lim En ∆Zn ∆ZnT = V (k) a.s.,
n∈IN(k) 2l(k) + ε(k)
1
lim En ∆Z-n ∆Z
-T =
n (k)
V (k) a.s..
n∈IN(k) 2ln
218 6 Stochastic Approximation Methods with Changing Error Variances
By means of
-n ∆Z
E n ∆Z -T = En Z-n Z-T − (En Z-n )(En Z-n )T ,
n n
(En Z-n )(En Z-n )T = En Z-n 2
from the second equation and Lemma 6.17(d) we obtain proposition (b).
Because of Lemma 6.18 the following averaging method for the calculation of
matrices W (1) , . . . , W (κ) is proposed:
(1) (κ)
At step n ∈ IN(k) with unique k ∈ {1, . . . , κ}, let matrices Wn , . . . , Wn
already be calculated. Furthermore, let
(k) 1 1 2ln - -T
Wn+1 := (1 − )W (k) + · Zn Zn (6.86a)
n n n 2ln + εn
(i)
Wn+1 := Wn(i) for i = k, i ∈ {1, . . . , κ}. (6.86b)
(1) (κ)
Here, at the beginning let matrices W1 , . . . , W1 be chosen arbitrarily.
Hence, if at stage n index n lies in IN(k) with a k ∈ {1, . . . , κ}, only matrix
(k) (1) (κ)
Wn of Wn , . . . , Wn is updated.
(i)
For the convergence of matrix sequence (Wn )n the following assumption
(j)
on the fourth moments of sequence (Zn ) is needed:
There exists µ2 ∈ [0, 1) and a random variable L2 < ∞ a.s. such that
(1) (κ)
Finally, we show that matrices Wn , . . . , Wn are approximations of limit
covariance matrices W (1) , . . . , W (κ) :
Proof. For given fixed k ∈ {1, . . . , κ} let IN(k) = {n1 , n2 , . . .} with indices
n1 < n2 < . . . . With abbreviations
2lni
Ci := Z-n Z-T , Ti := Wn(k) , Bi := Ani ,
2lni + εni i ni i
by definition we have
1 1
Ti+1 = 1− Ti + Ci for i = 1, 2, . . . .
ni ni
i
Because of (U6) we have −→ qk > 0 for i −→ ∞ and therefore
ni
1
= ∞. (ii)
i
ni
Due to (i), (ii) and (iii) Lemma B.6, Appendix, can be applied to sequence
(Tn ). Hence, we have shown the proposition.
220 6 Stochastic Approximation Methods with Changing Error Variances
(1) (κ)
In a similar way approximations wn , . . . , wn of the mean quadratic limit
errors of sequence (Zn ) can be given:
At step n ∈ IN(k) with unique k ∈ {1, . . . , κ}, let
(k) 1 1 2ln -n 2
wn+1 := 1− wn(k) + · Z (6.88a)
n n 2ln + εn
(i)
wn+1 := wn(i) for i = k, i ∈ {1, . . . , κ}. (6.88b)
(1) (κ)
Here, initially let arbitrary w1 , . . . , w1 be given. Then, the following con-
vergence corollary can be shown:
Corollary 6.5. If “ Xn −→ x∗ a.s.,” then for any k = 1, . . . , κ
1
lim wn(k) = tr(V (k) ) = lim En ∆Zn 2 a.s. (6.89)
n∈IN(k) 2l(k) + ε(k) n∈IN(k)
(k) (k)
Proof. If w1 = tr(W1 ) for k = 1, . . . , κ , then for each n ∈ IN and k =
1, . . . , κ
wn(k) = tr(Wn(k) )
(k)
with matrix Wn from (6.86a), (6.86b). Hence, the proposition follows from
Theorem 6.8 by means of taking the trace.
Moreover, let the step directions (Yn ) be defined by relations (W1a), (W1b)
in Sect. 6.6.
(0)
Starting with a matrix H0 , and putting H1 := H0 , An -measurable ma-
(0)
trices Hn can be determined recursively by (6.81a), (6.81b).
(0)
Depending on Hn , the factor Mn in the step size Rn is selected here such
that conditions (U4a)–(U4d) from Sect. 6.3 are fulfilled.
Independent of the convergence behavior of (Xn ), for large index n the
(0)
matrices Hn do not deviate considerably from H - n(0) given by (6.83a), (6.83b).
This is shown next:
- n −→ 0 a.s.
Lemma 6.20. ∆Hn := Hn − H
(0) (0)
6.7 Realization of Adaptive Step Sizes 221
H ≤ β. (6.90)
If in (6.2a), (6.2b) a true matrix step size is used, then due to condition (U3)
from Sect. 6.3 we suppose
x∗ ∈ D̊. (X2)
In case that the covariance matrix of the estimation error Zn converges to-
wards a unique limit matrix W in the sense of (V5a), (V5b) or (V6), see
Sect. 6.5, according to [117] for
we
√ obtain the minimum asymptotic covariance matrix of the process
( n(Xn − x∗ )). Since H ∗ is unknown in general, this approach is not possible
directly.
(0)
However, according to Theorem 6.6, the matrix Hn is a known An -
measurable approximate to H ∗ at stage n, provided that (Xn ) converges to
x∗ . Hence, similar to [12, 107, 108], for n = 1, 2, . . . we select
(0)
(Hn )−1 , if Hn ∈ H0
(0)
Mn := −1 (6.91)
H0 , else.
for all x, y ∈ D.
222 6 Stochastic Approximation Methods with Changing Error Variances
Lemma 6.21.
(a) For a matrix H ∈ IM(ν × ν, IR) the following properties are equivalent
(i) H ∈ H⊕
1
(ii) H is regular and H −1 ≤
α0
(b) For H ∈ H and x ∈ D we have
(HH T )−1 = (H −1 )T H −1 ,
H −1 2 = λmax (H −1 )T (H −1 ) ,
1
HH T ≥ α02 I ⇐⇒ (HH T )−1 ≤ 2 I.
α0
and
HH T = λi λk H (i) H (k)T ≥ α2 I.
i,k∈I
−1 T −1 T −1
(H ) (H ) = (HH )
Hence, H is regular, and ≤ α−2 I. This
yields H −1 2 = λmax (H −1 )T (H −1 ) ≤ α−2 . Due to (6.90) we find
H T H ≤ λmax (H T H)I = H2 I ≤ β 2 I and therefore β −2 I ≤ (H T H)−1 =
(H −1 )(H −1 )T . Because of
T
H −1 H(x) + H −1 H(x) = (H −1 ) H(x)H T + HH(x)T (H −1 )T ,
Lemma 6.22.
(a) sup Mn ≤ max{α0−1 , α−1 }.
n∈IN
0n and ∆Mn such
(b) For each n ∈ IN there are An -measurable matrices M
that
0n + ∆Mn
(i) Mn = M
(ii) lim ∆Mn = 0 a.s.
n→∞
2
0n H(x) u ≥ α u2
(iii) uT M for x ∈ D and u ∈ IRν
β2
Proof. Since H0 ⊂ H⊕ and H0 ∈ H, Lemmas 6.21(a) (ii) and (b) (iii) for
n = 1, 2, . . . yields
α0 −1 , if Hn ∈ H0
(0)
Mn ≤ −1 ,
α , else
Define now
- n )−1 , if Hn ∈ H0
(H
(0) (0)
0n :=
M (6.92)
H0−1 , else,
and put ∆Mn := Mn − M 0n . Because of (i) and (ii), the assumptions of Lemma
- n(0) and Bn := Hn(0) . Hence,
C.2, Appendix, hold for An := H
∆Mn −→ 0 a.s.
- n(0) , H0 ∈ H
Furthermore, according to Lemma 6.21(b) (vi), and because of H
we get
2
uT M0n H(x) u ≥ α u2
β 2
(d) If matrices of (Zn ) fulfill (V6) and (V8) in Sect. 6.5.3, then
√the covariance
n(Xn − x∗ ) has an asymptotic normal distribution with the asymptotic
covariance matrix given by
κ T
V = (H ∗ )−1 qk W (k) (H ∗ )−1 .
k=1
6.7 Realization of Adaptive Step Sizes 225
Mn = (Hn(0) )−1 0n = (H
and M - (0) )−1 a.s.
n
0n −→ Θ := (H ∗ )−1
Mn , M a.s. (i)
a := lim inf -
- an ≥ a = 1 (ii)
n∈IN
Of course, at stage n matrix (Hn+1 )−1 follows from (Hn )−1 at minimum
(0) (0)
(0) 1 (1)
Hn+1 = Hn(0) + (h − Hn(0) d(1) (1)T
n )dn .
n n
226 6 Stochastic Approximation Methods with Changing Error Variances
Choice of Set H0
We still have to clarify how to select the set H0 occurring in (6.91) such
that conditions (X4) and (X5) are fulfilled. Theoretically, H0 := H⊕ could be
taken. However, this is not possible since the minimum eigenvalue of a matrix
HH T cannot be computed exactly in general.
Hence, for the choice of H0 , two practicable cases are presented:
Example 6.1. Let denote · 0 any simple matrix norm with I0 = 1. Fur-
thermore, define
% &
H0 := H ∈ IM(ν × ν, IR) : det(H) = 0 and H −1 0 ≤ b0 ,
a0 1
0< ≤ 1 − a0 0 and 0 < .
b0 a0
If H − H ∗ 0 ≤ 0 , then according to Lemma C.1, Appendix, we know that
H is regular and
a0
H −1 0 ≤ ≤ b0 .
1 − a0 0
% & ◦
Hence, H: H − H ∗ 0 ≤ 0 ⊂ H0 , or H ∗ ∈ H0 .
In the above, the following norms · 0 can be selected:
Proof. According to the definition of Mn , the assertions (i), (ii) follow imme-
diately from the above assumptions (a), (b).
Put now M0n :≡ Mn and ∆Mn :≡ 0. According to Lemma 6.26, also in the
present case condition (U4a)–(U4d) from Sect. 5.3 is fulfilled. Thus, corre-
sponding to the last section, cf. Theorem 6.9, we obtain the following result:
Theorem 6.11.
(a) Xn −→ x∗ a.s.,
(b) Hn(0) −→ H ∗ a.s.
lim αn∗ = α
- a.s., (X8)
n→∞
- < 2α∗ := λmin (H ∗ + H ∗T ).
α (X9)
πn π (k)
lim ∗
= =: d(k) a.s., (6.96a)
n∈IN(k) αn α-
1
d := q1 d(1) + . . . + qκ d(κ) = , (6.96b)
-
α
1
dα∗ > , (6.96c)
2
6.7 Realization of Adaptive Step Sizes 229
especially
lim Mn = d(k) I a.s. (6.97)
n∈IN(k)
Theorem 6.12.
(a) If λ ∈ [0, 12 ), then nλ (Xn − x∗ ) −→ 0 a.s.
(b) For 0 > 0 there is a set Ω0 ∈ A such that
a(k) ≥ d(k) α∗ ,
- k = 1, . . . , κ.
Hence, due to (6.96b), (6.96c) we have -a ≥ dα∗ > 12 . The first assertion
follows then from Theorem 6.1. Because of (6.96a) and (6.97), assertion (b) is
obtained from Corollary 6.3.
As already mentioned in Sect. 6.5.2, the right hand side of (6.98) takes a
minimum for
- = α∗ ,
α (6.99a)
q1 qκ
π (k) = (σk2 ( 2 + . . . + 2 ))−1 , k = 1, . . . , κ. (6.99b)
σ1 σκ
Obviously, at stage n the factor αn∗ in the step size (6.95) may be chosen as a
(0)
function of Hn . Hence, for n = 1, 2, . . . we put
(0) (0) (0)
ϑn (Hn ), if Hn ∈ H1 and ϑn (Hn ) ∈ [ϑ, ϑ̄]
αn∗ := (6.106)
ϑ0 , else
- < ϑ̄,
ϑ<α (X13)
ϑn (Hn(0) ) −→ α
- a.s. (X14)
Thus, because of (X12), (X13) and (X14) there is an index n0 such that
the assumptions in Lemma C.5(b), cf. Appendix, then due to (6.110) and
(4) (0)
Lemma C.5 sequence ϑn (Hn ) converges a.s. towards
1
= α∗ .
2λmax (H ∗ + H ∗T )−1
Symmetric H ∗
If (e.g., in case of a gradient G(x) = ∇F (x)):
H ∗ := H(x∗ ) is symmetric, (X15)
then
1
α∗ :=
λmin (H ∗ + H ∗T ) = λmin (H ∗ ). (6.112)
2
Hence, in the present case the function ϑn (H) in (6.106) can be chosen to
depend only on H. In addition, conditions (X14) and (X9) can be fulfilled.
Due to (X6) we have
◦
H ∗ ∈ H++ , (6.113)
provided that
% &
H++ := H ∈ IM(ν × ν, IR) : det(H) > 0 and tr(H) > 0 .
In the following examples we take H1 := H++ , hence, condition (X12) holds.
Example 6.7. (see Example 6.4). The function
2
ϑ(5) (H) :=
tr(H −1 )
is continuous on H++ and ϑ(5) (H ∗ ) < 2
λmax ((H ∗ )−1 ) = 2α∗ . Thus, ϑn (H) :≡
ϑ(5) (H) is a suitable function.
Example 6.8. (see Example 6.5)
ν − 1 ν−1
ϑ(6) (H) := 2 det(H) · ( )
tr(H)
is also continuous on H++ , and Lemma C.4, Appendix, yields ϑ(6) (H ∗ ) < 2α∗ .
Hence, also ϑn (H) :≡ ϑ(6) (H) is an appropriate choice.
6.7 Realization of Adaptive Step Sizes 235
(7)
Thus, also ϑn (H) :≡ ϑn (H) is a suitable function.
Remark 6.2. Computation of ϑn (H):
In the above Examples 6.7–6.9 the expressions
tr (Hn(0) )−1 , det(Hn(0) ), (Hn(0) )−1
must be determined. This can be done very efficiently by means of
Lemma 6.23.
Example 6.10. The function ϑ(8) (H) := min hii is continuous on IM(ν × ν, IR),
i≤ν
where hii denotes the ith diagonal element of H. Hence, define
lim αn∗ = α
- a.s.
n→∞
are studied for algorithm (6.2a), (6.2b), where the An -measurable random
variables rn are defined recursively as follows:
For n = 1 we select positive A1 -measurable random variables r1 , w1 , and
for n > 1 we put
wn−1
rn = r-n−1 Qn (-
rn−1 ) Kn . (6.114b)
wn
Here,
w ≤ wn ≤ w̄ a.s., (Y3)
The following simple lemma is a key tool for the analysis of the step size rule
(6.114a)–(6.114d).
Proof. The first part is clear, and the formula in (b) follows by induction with
respect to n.
We show now that the step size rn according to (6.114a)–(6.114d) is of
order n1 .
238 6 Stochastic Approximation Methods with Changing Error Variances
7 8
w
Lemma 6.30. limextr n · rn ∈ w̄ v̄ , ww̄v a.s.
n
Proof. Putting
r-n wn Pn−1
Pn := K2 · . . . · Kn , tn := , un := vn (-
rn−1 ),
Pn wn−1
after multiplication with wn Pn−1 , definition (6.114a)–(6.114d) yields
tn−1
tn ≤ , n = 2, 3, . . . .
1 + tn−1 un
0n + ∆Mn
(i) Mn = M
(ii) lim ∆Mn = 0 a.s.
n→∞
0n H(x) u ≥ α w u2 for each x ∈ D and u ∈ IRν
(iii) uT M
w̄ v̄
Proof. For n ∈ IN define
0n := max nrn , w
M I and 0n .
∆Mn := Mn − M (6.117)
w̄ v̄
Because of condition (Y5), the assertions follows then immediately from
Lemma 6.30.
(a) Xn −→ x∗ a.s.
(b) Hn(0) −→ H ∗ a.s.
1
d := q1 d(1) + . . . + qκ d(κ) = , (6.119)
-
α
where
q qκ −1
1
d(k) := α-w(k) + . . . + , k = 1, . . . , κ (6.120)
w(1) w(κ)
Hence, the constant α- has the same meaning as in Sect. 6.7.2, see (6.97) and
(6.96b). Thus, corresponding to condition (X9), here we demand:
- < 2α∗ := λmin (H ∗ + H ∗T ).
α (Y9)
Corresponding to Theorem 6.12, here we have the following result:
6.8 A Special Class of Adaptive Scalar Step Sizes 241
Proof. Because of (6.119), (6.120) and (Y9), as in the proof of Theorem 6.12,
-, d according to (6.9), (6.120), resp., we have
for the number α
1
- ≥ dα∗ >
α .
2
Based on (6.118)–(6.120), the assertions follow now from Theorem 6.1 and
Corollary 6.3.
According to Sect. 6.5.2, the right hand side of (6.121) takes a minimum
at the values
- = α∗ ,
α (6.122)
w = σk2 := lim sup E σ0,n
(k) 2
, k = 1, . . . , κ. (6.123)
n∈IN(k)
Also the limit properties in Theorem 6.13 can be transferred easily to the
present situation:
1 ∗ q qκ −2 qk
κ
(H V + V H ∗T ) − V = α
1
- + . . . + W (k) .
-
α w(1) w(κ) (w (k) )2
k=1
(6.124)
Similar to Theorem 6.13, the above result can be shown by using Theorem 6.4,
where the numbers d(k) are not given by (6.96a), but by (6.118).
Several formulas are presented now for the function Qn (r) in (6.114a)–(6.114d)
such that conditions (Y1), (Y2) and (Y6) for the sequences (r̄n ) and (vn (r))
242 6 Stochastic Approximation Methods with Changing Error Variances
hold true. In each case the function Qn (r) depends on two An -measurable
random parameters αn∗ and Γn . Suppose that αn∗ , Γn fulfill the following
condition:
(A5) There are positive constants ϑ, ϑ̄ and γ̄ such that for n > 1
Xn − x∗ , Gn − G∗ ≥ αXn − x∗ 2 ,
Gn − G∗ ≤ βXn − x∗ (see (6.4)).
6.8 A Special Class of Adaptive Scalar Step Sizes 243
Based on the above estimate for the quadratic error ∆n 2 , in [89] an “opti-
mal” deterministic step size rn is given which is defined recursively as follows:
1 − rn−1 α
rn = rn−1 for n = 2, 3, . . . .
1 − rn−1
2 (β 2 + σ12 )
This suggests the following function Qn (r):
Let q̄ ∈ (0, 1), and for n > 1 define
1 − rαn∗
Qn (r) = , r ∈ [0, r̄n ] (6.126a)
1 − r 2 Γn
∗ $
αn 1
r̄n := q̄ min , . (6.126b)
Γn + αn∗
The feasibility of this approach is shown next:
Lemma 6.34. If Qn (r) and r̄n are defined by (6.126a), (6.126b), then Qn (r)
has the form (6.114d), and for each r ∈ [0, r̄n ] we have
αn∗ − rΓn ∗ Γn − αn∗2
(a) vn (r) = = α − r,
1 − rαn∗ n
1 − rαn∗
q̄ ϑ̄2 + γ̄
(b) (1 − q̄) ϑ ≤ vn (r) ≤ ϑ̄ + a.s.,
1 − q̄ ϑ
Γn − αn∗2 ϑ̄2 + γ̄
(c) ≤ a.s.,
1 − rαn∗ 1 − q̄
$
ϑ 1
(d) r̄n ≥ r̄0 := q̄ min , > 0 a.s.
γ̄ ϑ̄
Proof. Assertion (a) follows by a simple computation.
If αn∗2 < Γn , then (6.126b) yields
Γn − αn∗2 Γn − αn∗2
0≤ ≤ ≤ Γn .
1 − rαn∗
1 − q̄ αn∗2 Γn−1
Hence, because of (a) we get
αn∗ ≥ vn (r) ≥ αn∗ − rΓn ≥ αn∗ − r̄n Γn + ≥ αn∗ (1 − q̄).
Let Γn ≤ αn∗2 . Then (6.126b) yields
αn∗2 − Γn αn∗2 + Γn −
0≤ ≤ ,
1 − rαn∗ 1 − q̄
and therefore with (a)
αn∗2 + Γn − q̄
αn∗ ≤ vn (r) ≤ αn∗ + .
1 − q̄ αn∗
From the above relations and assumptions (Y10), (Y11) we obtain now the
assertions in (b), (c) and (d).
244 6 Stochastic Approximation Methods with Changing Error Variances
where K̄ > 0 and q̄ ∈ (0, 1) are arbitrary constants. In this case we have the
following result:
Lemma 6.35. Let Qn (r) and r̄n be given by (6.127a), (6.127b). Then func-
tion Qn (r) can be represented by (6.114d), and for each r ∈ [0, r̄n ] we have
un α∗ (1 + K̄)
αn∗ (1 − q̄) ≤ un ≤ = vn (r) ≤ n ,
1 − run 1 − q̄
Γn − αn∗ un | Γn | +αn∗ un | Γn | +αn∗2 (1 + K̄)
1 − run ≤ 1 − q̄
≤
1 − q̄
.
Applying now assumptions (Y10) and (Y11), the assertions in (b), (c) and (d)
follow.
In (6.128b) r̄ > 0 and q̄ ∈ (0, 1) are arbitrary constants. The feasibility of this
approach is guaranteed by the next lemma:
6.8 A Special Class of Adaptive Scalar Step Sizes 245
Lemma 6.36. Let Qn (r) and r̄n be given by (6.128a), (6.128b). Then Qn (r)
can be represented by (6.114d), where for each r ∈ [0, r̄n ] we have
1
(a) vn (r) = exp(rαn∗ − r2 Γn ) − 1 = αn∗ + ∆αn∗ (r) r with
r
1 exp(rαn∗ − r2 Γn ) − 1
∆αn (r) := (αn∗ − rΓn )
∗
− 1 − Γn ,
r rαn∗ − r2 Γn
r̄ ū
(b) ϑ (1 − q̄) ≤ vn (r) ≤ ū(1 + exp(r̄ ū)) a.s., where ū := ϑ̄ + r̄ γ̄,
2
ū2
(c) | ∆αn∗ (r) |≤ exp(r̄ ū) + γ̄ a.s.,
2 $
ϑ
(d) r̄n ≥ r̄0 := min r̄, q̄ > 0 a.s.
γ̄
Choice of (α∗n )
Note that (Y12) and (Y9) are exactly the conditions which are required for
(αn∗ ) in Sect. 6.7.2 (see (X8) and (X9)).
Theorem 6.15 guarantees also in the present case that Xn −→ x∗ and
Hn −→ H ∗ a.s. Thus, sequence (αn∗ ) can be chosen also according to (6.106).
(0)
Then (Y10) holds because of (X13), and due to Lemma 6.27 also conditions
(Y12) and (Y9) are fulfilled.
Since
Hn(0) −→ H ∗ := H(x∗ ) a.s., (6.132)
see Theorem 6.15(b), the following approximates βn of β ∗ can be used:
(a) βn = ψ (1) (Hn(0) ) with ψ (1) (H) := tr(H), (6.133)
(b) βn = ψ (2) (Hn(0) ) with ψ (2) (H) := max hii , (6.134)
i≤ν
wn ≡ 1, n ≥ 1, (6.138)
Since the limits w(1) , . . . , w(κ) of (wn ) from (Y8) should fulfill in the best case
(6.123), because of (6.140) we prefer the following choice of (wn ):
For an integer n ≥ 1 belonging to the set IN(k) with k ∈ {1, . . . , κ}, put
-n(k) .
wn := w (6.141)
Then, because of (6.139), (6.140) also conditions (Y3) and (Y8) hold.
Hence, the factors Kn are responsible only for the initial behavior of the step
size rn and the iterates Xn .
By the factor Kn (as well as by wn ) the step size can be enlarged at
rn−1 → rn . This is necessary, e.g., in case of a too small starting step size r1 .
An obvious definition of Kn reads
Kn ≡ 1, n > 1. (6.142)
Suppose
Kn = exp(s2n kn ), n > 1, (6.143)
where (sn ) is a deterministic sequence and kn denotes an An -measurable
random number such that
Consequently,
E s2n | kn | < ∞ and especially s2n | kn |< ∞ a.s.
n n
Example 6.16. According to assumption (a) in Sect. 6.3, the norm of the An -
measurable difference ∆Xn := Xn − Xn−1 is bounded by δ0 . Hence,
fulfill condition (Y14) and are therefore appropriate definitions for (kn ).
Example 6.17. The factors kn can be chosen also as functions of the approx-
(0) (2l )
imates Yn−1 , . . . , Yn−1n−1 for the step direction Yn−1 , cf. (W1a)–(W1e) in
Sect. 6.6. Thereby we need the following result:
Having this lemma, we find that condition (Y14) holds in the following
cases:
ln−1
1 (i) (l +i)
(a) kn := Yn−1 , Yn−1
n−1
, (6.149)
ln−1 i=1
ln−1
1 (i) (l +i) +
(b) kn := Yn−1 , Yn−1
n−1
, (6.150)
ln−1 i=1
(c) kn := Yn−1 2 I (i) n−1 (l +i) , (6.151)
{ min Yn−1 , Yn−1 >0}
i=1,...,ln−1
2ln−1
1 (0) (j)
(d) kn := Yn−1 , Yn−1 + . (6.152)
2ln−1 j=1
7.1 Introduction
In reliability analysis [31,98,138] of technical or economic systems/structures,
the safety or survival of the underlying technical system or structure S is
described first by means of a criterion of the following type:
⎧
⎪
⎪ < 0, a safe state σ of S exists,
⎪
⎪
⎪
⎪ S is in a safe state, survival of
⎪
⎪
⎪
⎪ S is guaranteed.
⎪
⎪
⎨ > 0, a safe state σ of S does not exist,
s∗ (R, L) S is in an unsafe state, (7.1a)
⎪
⎪
⎪
⎪ failure occurs (may occur)
⎪
⎪
⎪ = 0, S is in a safe or unsafe state,
⎪
⎪
⎪
⎪
⎪ resp. (depending on the special
⎩
situation)
Here,
s∗ = s∗ (R, L) (7.1b)
denotes the so-called (limit) state or performance function of the sys-
tem/structure S depending basically on an
– mR -Vector R of material or structural resistance parameters
– mL -Vector L of external load factors
In case of economics systems, the “resistance” vector R represents, e.g.,
the transfer capacity of an input–output system, and the “load” vector L may
be interpreted as the vector of demands.
The resistance and load vectors R, L depend on certain noncontrollable
and controllable model parameters, hence,
R = R a(ω), x , L = L a(ω), x (7.1c)
254 7 Computation of Probabilities
of nG factor domains
'
i a(ω), λ := σ i : gi σ i , a(ω), x ≤ 0 (7.5c)
In
' the above
mentioned convex case, due to (7.5b), the distance functional of
a(ω), x can be represented by
π σ, a(ω), x = max πi σi , a(ω), x , (7.5e)
1≤i≤nG
where σi '
πi σi , a(ω), x := inf λ > 0 : ∈ i a(ω), x (7.5f)
λ
'
denotes the Minkowski functional of the factor domain i a(ω), x .
Based
on the relations described in the
first section,
for a given pair of vectors
a(ω), x , the state function s∗ = s∗ a(ω), x can be defined by the minimum
value function of one of the following optimization problems:
Problem A
min s (7.6a)
s.t.
T σ, a(ω), x = 0 (7.6b)
g σ, a(ω), x ≤ s1 (7.6c)
'
σ ∈ 0, (7.6d)
where 1 := (1, . . . , 1)T . Note that due to Remark 7.2, problem (7.6a)–(7.6d)
always has feasible solutions (s, σ) = (s̃, σ̃) with any solution σ̃ of (7.2a) and
some sufficiently large s̃ ∈ IR.
Obviously, in case of a linear state equation (7.2b), problem (7.6a)–(7.6d)
is convex, provided that g = g σ, a(ω), x is a convex vector function of σ,
'
and σ has a convex range 0 .
Problem B '
If T and g are linear with respect to σ, and the range 0 of σ is a closed
convex polyhedron '
0 := {σ : Aσ ≤ b} (7.7)
with a fixed matrix (A, b), then (7.6a)–(7.6d) can be represented, see (7.2b)
and (7.3d), by the linear program (LP):
min s (7.8a)
s.t.
C a(ω, x σ = L a(ω), x (7.8b)
H a(ω), x σ − h a(ω), x ≤ s1 (7.8c)
Aσ ≤ b. (7.8d)
7.2 The State Function s∗ 257
where
"
0, i = 0
H̃i a(ω), x := 1 Hi a(ω), x , 1 ≤ i ≤ mH . (7.12b)
hi a(ω),x
258 7 Computation of Probabilities
Based on the state equation (I) and the admissibility condition (II), the
safety
of a system
or structure S = Sa(ω),x represented by the pair of vec-
tors a(ω), x can be defined as follows:
Definition 7.1. Let a = a(ω), x, resp., be given vectors of model parameters,
design variables. A system Sa(ω),x having configuration a(ω), x ' is in a safe
state (admits a safe state) if there exists a “safe” state vector σ̃ ∈ 0 fulfilling
(i) the state equation (7.2a) or (7.2b), hence,
T σ̃, a(ω), x = 0, C a(ω), x σ̃ = L a(ω), x , resp., (7.13a)
and (ii) the admissibility condition (7.3a), (7.3b) or (7.4a), (7.4b) in the
standard form
g σ̃, a(ω), x ≤ 0, π σ̃, a(ω), x ≤ 1, resp., (7.13b)
Remark 7.4. If Sa(ω),x is in an unsafe state, then, due to violations of the basic
safety conditions (I) and (II), failures and corresponding damages of Sa(ω),x
may (will) occur. E.g., in mechanical structures (trusses, frames) high external
loads may cause very high internal stresses in certain elements which damage
then the structure, at least partly. Thus, in this case failures and therefore
compensation and/or repair costs occur.
By means of the state function s∗ = s∗ a(ω), x , the safety
or failure of
a system or structure Sa(ω),x having configuration a(ω), x can be described
now as follows:
Theorem 7.1. Let s∗ = s∗ a(ω), x be the state function of Sa(ω),x defined
by the minimum function of one of the above optimization problems (A)–(C).
Then, the following basic relations hold:
(a) If s∗ a(ω), x < 0, then a safe state vector σ̃ exists fulfilling the strict
admissibility
condition (7.13c). Hence, Sa(ω),x is in a safe state.
∗
(b) If s a(ω), x > 0, then no safe state vector exists. Hence, Sa(ω),x is in an
unsafe state and failures and damages may occur.
(c) If a safe state vector σ̃ exists, then s∗ a(ω), x ≤ 0, s∗ a(ω), x < 0, resp.,
corresponding to the standard, the strict admissibility condition (7.13b),
(7.13c).
7.3 Probability of Safety/Survival 259
∗
(d) If s a(ω), x ≥ 0, then no safe state vector exists in case of the strict
admissibility condition (7.13c).
(e) Suppose that the minimum (s∗ , σ ∗ ) is attained in the selected optimization
problem defining the state function s∗ . If s∗ a(ω), x = 0, then a safe state
vector exists in case of the standard admissibility condition (7.13b), and
no safe state exists in case of the strict condition (7.13c).
Proof. (a) If s∗ a(ω), x < 0, then according to the definition of the minimum
value function s∗ , there exists a feasible point (s̃, σ̃) in Problem (A), (B),
(C), resp., such that, cf. (7.6c), (7.8c), (7.10c),
g σ̃, a(ω), x ≤ s̃1, s̃ < 0,
π σ̃, a(ω), x − 1 ≤ s̃, s̃ < 0.
'
Consequently, σ̃ ∈ 0 fulfills the state equation (7.13a) and the strict
admissibility condition (7.13c). Hence, according to Definition 7.1, the
vector σ̃ is a safe state vector fulfilling the strict admissibility condition
(7.13c).
(b) Suppose that s∗ a(ω), x > 0, and assume that there exists a safe state
vector σ̃ for Sa(ω),x . Thus, according to Definition 7.1, σ̃ fulfills the state
equation (7.13a) and the admissibility condition (7.13b). Consequently,
(s̃,σ̃) = (0,
σ̃) is feasible in Problem (A), (B) or (C), and we obtain
s∗ a(ω), x ≤ 0 in contradiction to s∗ a(ω), x > 0. Thus, no safe state
vector exists in this case.
(c) Suppose that σ̃ is a safe state for Sa(ω),x . According to Definition 7.1,
vector σ̃ fulfills (7.13a) and condition (7.13b), (7.13c), respectively. Thus,
(s̃, σ̃) with s̃ = 0, with a certain s̃ < 0,
resp., is feasible in Problem (A),
∗
(B) or (C).
Hence, we get s a(ω), x ≤ 0 in case of the standard and
∗
s a(ω), x < 0 in case of the strict admissibility condition.
(d) Thisassertion follows directly from assertion (c).
∗
(e) If
' s a(ω), x = 0, then under the present assumption there exists σ ∗ ∈
∗ ∗ ∗
0 such that (s , σ ) = (0, σ ) is an optimal solution of Problem (A), (B)
∗
or (C). Hence, σ fulfills the state equation (7.13a) and the admissibility
condition (7.13b). Thus, σ ∗ is a safe state vector with respect to the
standard admissibility condition. On the other hand, because of (7.13c),
no safe state vector σ̃ exists in case of the strict admissibility condition.
pf (x) = P Sa(ω),x is in an unsafe state = P a(ω) ∈ Bf,x
= P s∗ a(ω), x > 0 = P s∗ a(ω), x ≥ 0 . (7.16b)
we also have
ps (x) = P z(ω) ∈ Bs,x
Γ
, (7.20a)
pf (x) = P z(ω) ∈ Bf,x .
Γ
(7.20b)
Let x ∈ D be again an arbitrary, but fixed design vector. The well known
approximation method “FORM” is based [54] on the linearization of the state
function s∗Γ = s∗Γ (z, x) at a so-called β-point zx∗ :
Definition 7.2. A β-point is a vector zx∗ lying on the limit state surface
s∗Γ (z, x) = 0 and having minimum distance to the origin 0 of the z-space IRν .
For the computation of zx∗ three different cases have to be taken into account.
Here, we suppose
0 ∈ Bs,x
Γ
or s∗Γ (0, x) < 0, (7.22a)
thus
Γ −1 (0) ∈ Bs,x . (7.22b)
Since in many cases
ā = Ea(ω) = Γ −1 (0),
(7.22b) can be represented often by
Remark 7.5. Condition (7.22c) means that for the design vectors x under con-
sideration the expected parameter vector ā = Ea(ω) lies in the safe parameter
domain Bs,x which is, of course, a minimum requirement for an efficient op-
erating of a system/structure under stochastic uncertainty.
7.4 Approximation I of ps , pf : FORM 263
then there exists a Lagrange multiplier λ∗ ∈ IR such that (zx∗ , λ∗ ) fulfills the
K.–T. conditions (7.25a)–(7.25d).
The discussion of the two cases (7.26a), (7.26b) yields the following auxil-
iary result:
Corollary 7.1. Suppose that zx∗ is a regular β-point. Then, in the present
case (7.22a) we have
s∗Γ (zx∗ , x) = 0 (7.27a)
λ∗
zx∗ = ∇z s∗Γ (zx∗ , x) = 0 (7.27b)
2
with a Lagrange multiplier λ∗ > 0.
Proof. Let zx∗ be a regular β-point. If (7.26a) holds, then (7.25c) yields λ∗ = 0.
Thus, from (7.25a) we obtain zx∗ = 0 and therefore s∗Γ (0, x) > 0 which con-
tradicts assumption (7.22a) in this Sect. 7.4.1. Hence, in case (7.22a) we must
264 7 Computation of Probabilities
have s∗Γ (zx∗ , x) = 0. Since case (7.26b) is left, we have ∇z s∗Γ (zx∗ , x) = 0 for
a regular zx∗ . A similar argument as above yields λ∗ > 0, since otherwise we
have again zx∗ = 0 and therefore s∗Γ (0, x) = 0 which again contradicts (7.22a).
Thus, relation (7.27b) follows from (7.25a).
Suppose now that zx∗ is a regular β-point for all x ∈ D. Linearizing the
state function s∗Γ = s∗Γ (z, x) at zx∗ , because of (7.27a) we obtain
Using (7.16a) and (7.28), the probability of survival ps = ps (x) can be ap-
proximated now by
ps (x) = P s∗Γ z(ω), x < 0 ≈ p̃s (x), (7.29a)
where
p̃s (x) := P ∇z s∗Γ (zx∗ )T z(ω) − zx∗ < 0
= P ∇z s∗Γ (zx∗ , x)T z(ω) < ∇z s∗Γ (zx∗ , x)T zx∗ . (7.29b)
where Φ = Φ(t) denotes the distribution function of the standard N (0, 1)-
normal distribution.
Inserting (7.27b) into (7.29c), we get the following representation:
∗
∇z s∗Γ (zx∗ , x)T λ2 ∇z s∗Γ (zx∗ , x)
p̃s (x) = Φ . (7.29d)
∇z s∗ (zx∗ , x)
Γ
Using again (7.27b), for the approximative probability p̃s = p̃s (x) we find the
following final representation:
Theorem 7.3. Suppose that the origin 0 of IRν lies in the safe domain Bs,x
Γ
,
and assume that the projection problem (7.23) has regular optimal solutions
zx∗ only. Then, ps (x) ≈ p̃s (x), where
p̃s (x) = Φ zx∗ . (7.30)
7.4 Approximation I of ps , pf : FORM 265
As can be seen from the Definition (7.29b) of p̃s , the safe domain, cf.
(7.19a), % &
Γ
Bs,x = z ∈ IRν : s∗Γ (z, x) < 0
at zx∗ . If Bs,x
Γ
lies in Hx , hence, if (Fig. 7.4)
Γ
Bs,x ⊂ Hx , (7.32a)
then
ps (x) ≤ p̃s (x). (7.32b)
On the contrary, if
Hx ⊂ Bs,x
Γ
, (7.33a)
as indicated by Fig. 7.3, then
Γ
Fig. 7.3. Origin 0 lies in Bs,x
266 7 Computation of Probabilities
Fig. 7.4. The safe domain lies in the tangent half space
Again let x be an arbitrary, but fixed design vector. Opposite to (7.22a), here
we assume now that
0 ∈ Bf,x
Γ
or s∗Γ (0, x) > 0, (7.34a)
which means that
Γ −1 (0) ∈ Bf,x . (7.34b)
In this case, the β-point zx∗ ν
is the projection of the origin 0 of IR onto the
Γ
closure B̄s,x Γ
of the safe domain Bs,x , cf. Fig. 7.5. Thus, zx∗ is a solution of the
projection problem
min z2 s.t. s∗Γ (z, x) ≤ 0. (7.35)
Here, we have the Lagrangian
Γ
Fig. 7.5. Origin 0 lies in Bf,x
Proceeding now as in Sect. 7.4.1, we linearize the state function s∗Γ at zx∗ .
Consequently, ps (x) is approximated again by means of formulas (7.29a),
(7.29b), hence,
ps (x) = P s∗Γ z(ω), x < 0 ≈ p̃s (x),
where
p̃s (x) := P ∇z s∗Γ (zx∗ , x)T z(ω) − zx∗ < 0 .
In the same way as in Sect. 7.4.1, in the present case we have the following
result:
Theorem 7.5. Suppose that the origin 0 of IRν is contained in the unsafe do-
Γ
main Bf,x , and assume that the projection problem (7.35) has regular optimal
solutions zx∗ only. Then, ps (x) ≈ p̃s (x), where
p̃s (x) := Φ − zx∗ . (7.40)
1
Since Φ − zx∗ ≤ , the case 0 ∈ Bf,x
Γ
or s∗Γ (0, x) > 0 is of minor interest
2
in practice. Moreover, in most practical cases we may assume, cf. (7.22a)–
(7.22c), that 0 ∈ Bs,x
Γ
.
For the sake of completeness we still consider the last possible case that, for
a given, fixed design vector x (Fig. 7.6), we have
Remark 7.6. Comparing all cases (7.22a), (7.34a) and (7.41), we find that for
approximation purposes only the first case (7.22a) is of practical interest.
7.4 Approximation I of ps , pf : FORM 269
(a)
(b)
ps (x) ≥ αs (7.43)
270 7 Computation of Probabilities
with a prescribed minimum reliability level αs ∈ (0, 1]. Replacing the proba-
bilistic constraint (7.43) by the approximative condition
f (z) := cT z, (7.45)
Definition 7.3. For a given vector c ∈ IRν and a fixed design vector x, let
bc,x denote an optimal solution of (7.47).
7.5 Polyhedral Approximation of the Safe/Unsafe Domain 271
As indicated in Fig. 7.7, the boundary point bc,x can also be interpreted as a
β-point with respect to a shifted origin 0.
The Lagrangian of (7.47) reads – for both cases –
L(z, λ) = L(z, λ; x) := cT z + λ ± s∗Γ (z, x) , (7.48)
Supposing again that the solutions bc,x of (7.47) are regular, related to
Sect. 7.4, here we have this result:
Theorem 7.6 (Necessary optimality conditions). Let c = 0 be a given ν-
vector. If bc,x is a regular optimal solution of (7.47), then there exists λ∗ ∈ IR
such that
Proceeding now in similar way as in Sect. 7.4, the state function s∗Γ = s∗Γ (z, x)
is linearized at bc,x . Hence, with (7.50a) we get
is defined again by
p̃s (x) := P ∇z s∗Γ (bc,x , x)T z(ω) − bc,x < 0 . (7.51b)
Inserting now (7.50b) into (7.51c), for p̃s (x) we find the following repre-
sentation:
Theorem 7.7. Let c = 0 be a given ν-vector and assume that (7.47) has
regular optimal solutions bc,x only. Then the approximation p̃s (x) of ps (x),
defined by (7.51b), can be represented by
(∓c)T bc,x
p̃s (x) = Φ . (7.52)
c
In the following we assume that for any design vector x and any cost vector
c = 0, the vector bc,x indicates an ascent direction for s∗Γ = s∗Γ (z, x) at bc,x ,
i.e.,
∇z s∗Γ (bc,x , x)T bc,x > 0. (7.53a)
7.5 Polyhedral Approximation of the Safe/Unsafe Domain 273
Depending on the case taken in (7.47), this is equivalent, cf. (7.50b), to the
condition
(∓c)T bc,x > 0, (7.53b)
involving the following two cases:
Case 1 (“–”): Problem (7.47) with “≤,”
Case 2 (“+”): Problem (7.47) with “≥.”
Illustrations are given in the Fig. 7.8a,b below.
c1 , c2 , . . . , c j , . . . , cJ , (7.54a)
Γ
Case 1: Minimization of Linear Forms on Bs,x
Γ
In this case, see Fig. 7.9, the transformed safe domain Bs,x is approximated
from outside by a convex polyhedron defined by the linear constraints (7.55a)
or (7.55b) with “–.” Hence, for ps we get the upper bound
(a)
(b)
Fig. 7.8. (a) Condition (7.53a), (7.53b) in Case 1. (b) Condition (7.53a), (7.53b)
in Case 2
where
p̃s (x) = P ∇z s∗Γ (bcj ,x , x)T z(ω) − bcj ,x < 0, j = 1, . . . , J
= P (−cj )T z(ω) − bcj ,x < 0, j = 1, . . . , J . (7.56b)
7.5 Polyhedral Approximation of the Safe/Unsafe Domain 275
Γ
Case 2: Minimization of Linear Forms on Bf,x
Γ
Here, see Figs. 7.5 and 7.8b, the transformed failure domain Bf,x is approx-
imated from outside by a convex polyhedron based on the contrary of the
constraints (7.55a) or (7.55b). Thus, for the probability of failure pf we get,
cf. (7.33a), (7.33b), the upper bound
with
p̃f (x) := P ∇z s∗Γ (bcj ,x , x)T z(ω) − bcj ,x ≥ 0, f = 1, . . . , J
= P cjT z(ω) − bcj ,x ≥ 0, j = 1, . . . , J . (7.59b)
with
p̃f (x) := min P cjT z(ω) − bcj ,x ≥ 0
1≤j≤J
= min 1 − p̃s (x; cj ) = 1 − max p̃s (x; cj ). (7.62b)
1≤j≤J 1≤j≤J
Here,
cjT bcj ,x
p̃s (x; cj ) := Φ . (7.62c)
cj
7.5 Polyhedral Approximation of the Safe/Unsafe Domain 277
Remark 7.9. Since z = z(ω) has an N (0, I)-normal distribution, the random
vector
y = Az(ω) (7.63a)
with the matrix A = A(c1 , . . . , cJ ) given by (7.57a), (7.60a), resp., has an
N (0, Q)-normal distribution with the J × J covariance matrix
where ·E denotes the Euclidean norm, and the radius R = R(x) is a function
of the design vector x (Fig. 7.10).
In the present case for each c = 0 we obtain, see (7.47) with “≤,”
c
bc,x = −R(x) . (7.65a)
c
Γ
Fig. 7.11. Spherical safe domain Bs,x with c-vectors given by the unit vectors
7.6 Computation of the Boundary Points 279
p̃s (x; ±e1 , . . . , ±eν ) = P zj (ω) < R(x), j = 1, . . . , ν
ν
= P − R(x) < zj (ω) < R(x)
j=1
ν
= Φ R(x) − Φ − R(x)
ν
= 2Φ R(x) − 1 . (7.66b)
In the present case of the spherical safe domain (7.64), the exact proba-
bility ps (x) can be determined analytically:
ps (x) = P z(ω) ≤ R(x)
1 1
= ... ν/2
exp − z2 dz. (7.67a)
(2π) 2
z ≤R(x)
≤ p̃s (x; c) = Φ R(x) . (7.67d)
where
C(z, x) = L Γ −1 (z), x
x) := C Γ −1 (z), x , L(z, (7.73a)
x) := H Γ −1 (z), x , h̃(z, x) = h Γ −1 (z), x .
H(z, (7.73b)
s.t.
x)T u − h̃(z, x)T v − bT w ≥ 0
L(z, (7.74b)
x) u − H(z,
C(z, T x) v − A w = 0
T T
(7.74c)
1T v = 1 (7.74d)
v, w ≥ 0. (7.74e)
The above conditions mean that we have a fixed equilibrium matrix C, a
random external load L = L a(ω) independent of the design vector x and
fixed material or strength parameters.
Under the above conditions (7.75a)–(7.75d), problem (7.72a)–(7.72d) takes
the following simpler form:
min cT z (7.76a)
σ,z
282 7 Computation of Probabilities
s.t.
Cσ − L̃(z) = 0 (7.76b)
Hσ ≤ h(x) (7.76c)
Aσ ≤ b. (7.76d)
Thus, we get
p̃s (x; c1 , . . . , cJ ) = P Az(ω) ≤ b
= P b−1
d Az(ω) ≤ 1 = P u(ω) ≤ 1 , (7.78a)
7.7 Computation of the Approximate Probability Functions 283
where
u(ω) := b−1
d Az(ω), (7.78d)
where δij denotes the Kronecker symbol. Obviously, the random J-vector
u = u(ω) has a joint normal distribution with
Eu(ω) = 0, cov u(·) = b−1 T −1
d AA bd . (7.78e)
Inequalities for p̃s of the Markov-type can be obtained, see, e.g., [100], by
selecting a measurable function ϕ = ϕ(u) on IRJ such that
(1) ϕ(u) ≥ 0, u ∈ IRJ (7.79a)
p̃s (x; c1 , . . . , cJ ) ≥ αs
By means of this technique, lower probability bounds for p̃s and cor-
responding approximative chance constraints
may be obtained very easily,
provided that the expectation Eϕ b−1
d Az(ω) can be evaluated efficiently. Ob-
viously, the accuracy of the approximation (7.80) increases with the reliability
level αs used in (7.44a):
1
0 ≤ p̃s − 1 − Eϕ b−1
d Az(ω) ≤ 1 − αs . (7.81b)
ϕ0
Moreover, best lower bounds in (7.80) follow by maximizing the right hand
side of (7.80). Some examples for the selection of a function ϕ = ϕ(u), the
related positive constant ϕ0 and the consequences for the expectation term in
inequality (7.80) are given next.
284 7 Computation of Probabilities
Then, with (7.57a), (7.57b), (7.80) and (7.83) we have the following result:
Theorem 7.9. Suppose that ϕ = ϕ(u) is given by (7.82a), (7.82b) with a
positive definite matrix Q and a positive constant ϕ0 . In this case
1
p̃s (x; c1 , . . . , cJ ) ≥ 1 − trQb−1 T −1
d AA bd , (7.84a)
ϕ0
where
ciT cj
b−1 T −1
d AA bd := . (7.84b)
(ciT bci ,x )(cjT bcj ;x ) i,j=1,...,J
For a given positive definite matrix Q, the best, i.e., largest lower bound
ϕ∗0 > 0 in (7.82b) and therefore also in (7.84a), is given by
l = 1, . . . , J. If (Q−1 )ll denotes the lth diagonal element of the inverse Q−1 of
Q, then
7.7 Computation of the Approximate Probability Functions 285
1
ϕ∗0l = (7.85e)
(Q−1 )ll
and therefore
1
ϕ∗0 = ϕ∗0 (Q) = min . (7.85f)
1≤l≤J (Q−1 )ll
In the special case of a diagonal matrix Q = (qjj δij ) with positive diagonal
elements qjj > 0, j = 1, . . . , J, from (7.85f ) we get
The largest lower bound in the probability inequality (7.84a) can be ob-
tained now by solving the optimization problem
trQb−1
d AA bd
T −1
min , (7.86)
Q0 ϕ∗0 (Q)
J
trQW = qjj wjj
j=1
trQb−1 T −1
d AA bd
min ∗ = trb−1 T −1
d AA bd . (7.87)
Q0
Qdiagonal
ϕ0 (Q)
where ϕ∗0l = ϕ∗0l (Q, u0 ) denotes the minimum value of the convex program
l = 1, . . . , J. As in (7.85d)–(7.85f) we get
(1 − u0l )2
ϕ∗0 (Q, u0 ) = min for Q 0. (7.90d)
1≤l≤J (Q−1 )ll
7.7 Computation of the Approximate Probability Functions 287
Fixing a set
B ⊂ (Q, u0 ) : Q 0, u0 ≤ 1 , (7.91a)
(7.91b)
cf. (7.86). Hence, in the present case the following minimization problem is
obtained:
min trQb−1 T −1
d AA bd + u0 u0
T
(7.92a)
s.t.
(Q, u0 ) ∈ B. (7.92c)
and d < u0 < 1. As in the above examples we define now, cf. Fig. 7.13,
ϕ∗0 (Q, u0 ) = min (u − u0 )T Q(u − u0 ). (7.95b)
u∈[d,1]
We get
ϕ∗0 (Q, u0 ) = min min ϕ∗0la (Q, uo ), ϕ∗0lb (Q, u0 ) , (7.95c)
1≤l≤J
where ϕ∗0la = ϕ∗0la (Q, u0 ), ϕ∗0lb = ϕ∗0lb (Q, u0 ), resp., l = 1, . . . , J, denote the
minimum values of the convex optimization problems
min(u − u0 )T Q(u − u0 ) s.t. ul − dl ≤ 0, (7.95d)
min(u − u0 )T Q(u − u0 ) s.t. 1 − ul ≤ 0. (7.95e)
If the matrix Q is positive definite, then (7.90d) yields
(1 − u0l )2
ϕ∗0lb (Q, u0 ) = , l = 1, . . . , J, (7.95f)
(Q−1 )ll
and in similar way we get
(dl − u0l )2
ϕ∗0la (Q, u0 ) = , l = 1, . . . , J. (7.95g)
(Q−1 )ll
Thus, in case of a positive definite J × J matrix Q we finally get
1
ϕ∗0 (Q, u0 ) = min min (1 − u 0l )2
, (dl − u 0l )2
. (7.95h)
1≤l≤J (Q−1 )ll
with the transformed limit state function s∗Γ = s∗Γ (z, x) or the transformed
Γ
safe domain Bs,x .
Based on the representation of ps = ps (x) by a multiple integral [150] in
IRν , the N (0, I)-distributed random ν-vector z = z(ω) is approximated now
by a random ν-vector z̃ = z̃(ω) having a discrete distribution
N
Pz̃(·) = µ := αj εz(j) , (7.97a)
j=1
α := (α1 , α2 , . . . , αN )T (7.99a)
Z := (z (1) , z (2) , . . . , z (N ) ) (7.99b)
⎛√ ⎞
α1 0
⎜ √ ⎟
√ ⎜ α2 ⎟
( α)d := ⎜ . ⎟ (7.99c)
⎝ .. ⎠
√
0 αN
1 := (1, 1, . . . , 1)T , (7.99d)
√ √ Zα = 0 (7.98a )
(Z αd )(Z αd )T = I (7.98b )
αT 1 = 1 (7.98c )
α > 0. (7.98d )
7.7 Computation of the Approximate Probability Functions 291
z̃ (1) , z̃ (2) , . . . , z̃ (ν) , −z̃ (1) , −z̃ (2) , . . . , −z̃ (ν) (7.101a)
α̃1 α̃2 α̃ν α̃1 α̃2 α̃ν
, , ..., , , , ..., . (7.101b)
2 2 2 2 2 2
Then,
N ν ν
α̃j (j) α̃j
αj z (j) = z̃ + (−z̃ (j) )
j=1 j=1
2 j=1
2
ν
α̃j α̃j
= − z̃ (j) = 0, (7.101c)
j=1
2 2
N ν ν
α̃j (j) (j)T α̃j
αj z (j) z (j)T = z̃ z̃ + (−z̃ (j) )(−z̃ (j) )T
j=1 j=1
2 j=1
2
ν
= α̃j z̃ (j) z̃ (j)T , (7.101d)
j=1
N ν ν ν
α̃j α̃j
αj = + = α̃j . (7.101e)
j=1 j=1
2 j=1
2 j=1
292 7 Computation of Probabilities
ν
α̃j = 1 (7.102b)
j=1
Hence, we have
1
z̃ (j) = d(j) , j = 1, 2, . . . , ν, (7.103b)
α̃j
where
1 1 1 −1
d(1) = √ , d(2) = √
2 1 2 1
1
are the columns of an orthogonal 2 × 2 matrix. Taking α̃1 = α˜2 = , from
2
(7.103b) and (7.101a), (7.101b) we get
1 1 1 1
z (1) = z̃ (1) = / √ =
1 2 1 1
2
1 1 −1 −1
z (2) = z̃ (2) = / √ =
1 2 1 1
2
−1
z (3) = −z̃ (1) =
−1
1
z (4) = −z̃ (2) = ,
−1
1 1 1
where αj = · = , j = 1, . . . , 4.
2 2 4
7.7 Computation of the Approximate Probability Functions 293
Let z̃(·) denote again a random vector in IRν having a discrete probability dis-
tribution Pz̃(·) = µ according to (7.97a), (7.97b). We consider then sequences
of i.i.d. (independent and identically distributed) random ν-vectors
Z1 , Z2 , . . . , Zl , . . . (7.104a)
such that
PZl = Pz̃(·) = µ, l = 1, 2, . . . . (7.104b)
Hence, due to (7.97c), (7.97d) we get
EZl = E z̃ = zµ(dz) = 0 (7.104c)
cov(Zl ) = cov(z̃)) = zz T ν(dz) = I. (7.104d)
L
1
= EZl ZlT = I. (7.106b)
L
l=1
294 7 Computation of Probabilities
(L)
where Zj , j = 1, . . . , L, are i.i.d. random ν-vectors having, see (7.97a),
(7.97b), (7.97e), the discrete distribution
(1) (2)
z√ z√ z√(j) (N )
z√
L
, L
, . . . , L
, . . . , L . (7.106d)
α1 , α2 , . . . , αj , . . . , αN
jl ∈ {1, . . . , N }, l = 1, . . . , L, (7.107a)
then the distribution µ(L) of S (L) is the discrete distribution taking the real-
izations (atoms)
L
1
s = s(j1 , j2 , . . . , jL ) := √ z (jl ) (7.107b)
L l=1
where “ weak
−→ ” means “weak convergence” of probability measures [13, 110].
An important special case arises if one starts with a random ν-vector z̃
having independent components
z̃ = (z̃1 , . . . , z̃ν )T .
ν
µ(L) = PS (L) = ⊗ PS (L) (7.108)
i=1 i
(L)
is the product of the distributions PS (L) of the components Si , i = 1, . . . , ν,
i
of S (L) .
If we assume now that the i.i.d. components z̃i , i = 1, . . . , ν, of z̃ have the
common distribution
1 1
Pz̃i = ε−1 + ε+1 , (7.109a)
2 2
where εz denotes the one-point measure at a point z, then the i.i.d. compo-
(L)
nents Si , i = 1, . . . , ν, of S (L) have, see (7.106a)–(7.106d), (7.107a)–(7.107c),
the Binomial-type distribution
L
L 1
PS (L) = ε 2l−L , i = 1, . . . , ν. (7.109b)
i l 2L √L
l=0
Γ
If, besides the random ν-vector z = z(ω), also the safe domain Bs,x is approx-
s,x
imated, e.g., by a polyhedral set B Γ
such that
Γ ⊂ B Γ or B Γ ⊂ B
B Γ ,
s,x s,x s,x s,x
s,x
p̃s (x, B Γ
, µ) = αj . (7.111b)
z (j) ∈Bs,x
Γ
Using the central limit theorem [13, 49, 110], we get the following convergence
result:
where
Γ ) := P z(ω) ∈ B
ps (x, B Γ .
s,x s,x
A
Sequences, Series and Products
Here, often used statements about generalized mean values of scalar sequences
are given.
In this context the following double sequences are of importance. We use
the following notations:
Lemma A.1. Let (am,n ), (bm,n ), (αm,n ) and (βm,n ) denote sequences from
D+ .
(a) If am,n αm,n and bm,n βm,n , then
Defining β := lim inf βn and β̄ := lim sup βn , we get for each index k ∈ IN:
n n
'
n
(a) If αm,n am,n and β̄ ≥ 0, then lim sup αm,n βm ≤ aβ̄.
n m=k
'
n
(b) If am,n αm,n and β > 0, then aβ ≤ lim inf αm,n βm .
n m=k
'
n
(c) If αm,n ∼ am,n , then limextr αm,n βm ∈ [aβ, aβ̄].
n m=k
For the proof of (a) assume β̄ < ∞. Hence, for any ε > 0 there exists an
index n0 such that
n n0 −1
= (1 + ε)(β̄ + ε) am,n − am,n .
m=1 m=1
'
n
Now, by means of (i) we get lim sup αm,n βm ≤ (1 + ε)(β̄ + ε)a. Since
n m=1
ε > 0 was chosen arbitrarily, this shows assertion (a).
Using the assumptions of (b), there exists to any ε ∈ (0, β] an index n0
with
0 ≤ β − ε ≤ βm and 0 ≤ am,n ≤ (1 + ε)αm,n
for n0 ≤ m ≤ n. Hence, for n ≥ n0
n n0 −1
β−ε n
αm,n βm ≥ ( am,n − am,n )
m=n0
1 + ε m=1 m=1
'
n
holds. Therefore again by means of (i) we have lim inf αm,n βm ≥ (β − ε)
n m=1
(1 + ε)−1 a. Thus, we obtain now statement (b).
In the proof of (c) we consider the following three possible cases:
Case 1: β > 0. The relation in (c) is fulfilled due to (a) and (b). Case 2:
β̄ < 0. This case is reduced to the previous case by means of the transition
from βn to −βn . Case 3: β ≤ 0 ≤ β̄. Here, from (a) we get
n n
lim sup αm,n βm ≤ aβ̄ and lim sup αm,n (−βm ) ≤ a(−β).
n n
m=k m=k
Proof. In all cases (a), (b) and (c) there exists an index n0 with am < 1 for
m ≥ n0 . Hence, it holds
exp ⎝2 a2j ⎠ ≤ 1 + ε.
j≥n1
Tn+1 := (1 − an )Tn + an bn , n = 1, 2, . . . .
Proof. By means of complete induction, formula
n
Tn+1 = b0,n T1 + (bm,n am )bm for n = 1, 2, . . .
m=1
is obtained. Hence, the assertion follows from Lemmas A.3(a) and A.2(c).
Now, consider to given number a ∈ IR the following double sequence in
D+ : ⎧ n
⎨ * (1 − a ), if m < n
j
Φm,n (a) := j=m+1
⎩
1 , if m = n.
Moreover, let IM denote a subset of IN such that the limit
| {1, . . . , n} ∩ IM |
q := lim
n→∞ n
exists and is positive. Then we get the following result:
Lemma A.5. For any a ∈ IR holds:
m a
(a) Φm,n (a) ∼ ( ) ,
n
m | IM0,m |
(b) ∼ ,
n | IM0,n |
(c) Φm,n (a) ∼ Φ|IM0,m |,|IM0,n | (a),
1 q
(d) If a > 0, then lim Φm,n (a) = for each k ∈ IN0 .
n→∞ m a
m∈ IMk,n
306 A Sequences, Series and Products
Since the limit q is positive, to any ε > 0 there exists an index n0 such that
q | IM0,k |
0< ≤ ≤ (1 + ε)q for k ≥ n0 .
1+ε k
Hence, for n0 ≤ m ≤ n we get
m | IM0,m | m
≤ ≤ (1 + ε)2 ,
(1 + ε)2 n | IM0,n | n
and therefore (b) holds. Due to (a), (b), and Lemma A.1(b) we have
m a |IM0,m |
a
Φm,n (a) ∼ ∼ ∼ Φ|IM0,m |,|IM0,n | (a). (i)
n |IM0,n |
Here again q1 , . . . , qκ denote the limit parts of sets IN(1) , . . . , IN(κ) in IN ac-
cording to (U6). Then double sequence
⎧ n
⎨ * (1 − Aj ), for m < n
j
bm,n := j=m+1
⎩
1 , for m = n
κ
κ
= Φ|IN(k) (k) (bk ) ∼ Φm,n (bk )
0,m |,|IN0,n |
k=1 k=1
κ
m bk
∼ ( ) ∼ Φm,n (b1 + . . . + bk ). (ii)
n
k=1
This holds true for any δ ∈ (0, A), and therefore (c) holds.
Applying now Lemma A.6, the limit of sequence
An fn
Tn+1 := (1 − )Tn + + gn for n = 1, 2, . . .
n n
can be calculated. Here, T1 ∈ IR and (fn ), (gn ) ∈ F with
holds.
we get
1 1 fn
Tn+1 := (f1 + . . . + fn ) = 1− Tn + ,
n n n
In Sect. 6.5 (or more precisely in the proof of Lemma 6.13) the following
recursive matrix equation is of interest: For n = 1, 2, . . . let
T
1 1 1
Vn+1 = I− Bn Vn I − Bn + Cn
n n n
BV + V B T = C, (∗)
where we put
B := q1 B (1) + . . . + qκ B (κ) and C := q1 C (1) + . . . + qκ C (κ) .
For the proof of this theorem we need further auxiliary results. First of we
state the following remark:
Remark A.1. If for n ∈ IN holds
Bn + BnT ≥ 2
an I for Bn ≤ bn
an , bn , then (cf. Lemma 6.6)
with numbers
2
I − 1 Bn ≤ 1 − 2un with un :=
an −
1 2
b .
n n 2n n
-n+1 := 1 (E1 + . . . + En ) −→ 0
E and Fn < ∞.
n n
1 T
(2) -n+1 = (I −
Tn+1 − E Bn ) (Tn(2) − E -n I − 1 Bn
-n ) + E +
1 -n
−1 E
n n n
T
1 -n ) I − 1 Bn
= I−Bn (Tn(2) − E
n n
3
1 - 1 4
+ E - - T
n − (Bn En + En Bn ) +
-n B T .
Bn E n
n n
Hence, the above remark gives us
(2) -n+1 ≤ 2 -n
Tn+1 − E 1− un Tn(2) − E
n
-n 9
E 1
:
+ 1 + 2Bn + Bn .
2
n n
Again, using Lemma A.7, Tn − E -n −→ 0 follows. Finally, because of the
(2)
(2)
assumptions about En also Tn converges to 0.
Proof (of Lemma A.9). Having a solution V of (∗) it holds
1 1
T
1
Vn+1 − V = I − Bn (Vn − V ) + V I − Bn + Cn − V
n n n
T
1 1
= Bn (Vn − V ) I − Bn
I−
n n
17 8 1
+ Cn − (Bn V + V BnT ) + 2 Bn V BnT .
n n
Moreover, due to Lemma A.8 and the above assumptions we obtain
κ
1
(B1 + . . . + Bn ) −→ qk B (k) =: B,
n
k=1
κ
1
(C1 + . . . + Cn ) −→ qk C (k) =: C.
n
k=1
B.1.1 Consequences
In special case
αn = an Tn for n = 1, 2, . . .
'
with nonnegative An -measurable random variables an , fulfilling an = ∞
n
a.s., we get this result:
314 B Convergence Theorems for Stochastic Sequences
Proof. According to Lemma B.1 we know that T := lim Tn exists a.s. and
' ' n→∞
an Tn < ∞ a.s. Hence, due to an = ∞ a.s. we obtain the proposition.
n n
for indices n ∈ IN(k) , where k ∈ {1, . . . , κ}. Then, for all k = 1, . . . , κ and
n ∈ IN we get
lim Bn = A(k) − ε, An ≥ Bn − ∆Bn ,
n∈IN(k)
Bn
En Sn+1 ≤ 1− + βn(1) Sn + αn(2) ,
n
(1) (1)
where βn := αn + ∆B n
. For sufficiently large indices n it is ∆Bn = 0 a.s.,
' ∆Bn n
and therfore n < ∞ a.s. Hence, it holds
n
and therefore (S̄n ) converges to zero a.s. due to Lemma B.2. Hence, due to
the second inequality and the definition also (Sn ) converges a.s. to 0.
For n ≥ n0 we have
1− A−ε
1− A−ε
n
(1 − 2βn(1) ) ≤ pn ≤ n
.
1 − Bnn 1 − Bnn
As in the proof of Lemma A.6 it can be shown that for n0 ≤ m < n it holds
⎛ ⎛ ⎞
(k) ⎞
|IN0,l |
n κ
⎜
⎝1 − l l ⎠⎟
Bj B
1− = ⎝ (k) ⎠
j=m+1
j | IN |
k=1
(k)
l∈INm,n 0,l
⎛ ⎞
κ
⎜ qk (A − ε) ⎟
(k)
= ⎝ 1− (k) ⎠ ∼ Φm,n (A − ε),
k=1 (k) IN0,l
l∈INm,n
n
1− A−ε
j
B
∼ 1.
j=m+1 1 − jj
316 B Convergence Theorems for Stochastic Sequences
The last two relations and the above inequality for pn finally yield
Tn , Bn , vn ≥ 0,
lim inf An ≥ A(k) , lim sup Bn ≤ B (k) , lim sup Cn ≤ C (k) a.s.,
n∈IN(k) n∈IN(k) n∈IN(k)
with real numbers A(1) , . . . , A(κ) , B (1) , . . . , B (κ) , C (1) , . . . , C (κ) and w(1) , . . . ,
w(κ) .
Now, for sequence (Tn ) the following convergence theorem holds:
For the proof of this theorem the following theorem of Egorov (cf., e.g., [67]
page 285) is needed:
(b) lim sup Un ≤ U is fulfilled uniformly on Ω0 , that is: To any number ε > 0
n
there exists an index n0 ∈ IN such that
we have
Ω0 ⊂ Ω1 ∩ . . . ∩ Ωn =: Ω (n) ∈ An .
Let now S1 := T1 , and for n ≥ 1 define
an 7 an 8
Sn+1 := 1 − Sn + Tn+1 − 1 − Tn IΩ (n) .
n n
Then, due to Ω (1) ⊃ Ω (2) ⊃ . . . ⊃ Ω (n) we get
T , on Ω (n)
Sn+1 = n+1 an
1 − n Sn , otherwise.
'
n
Then, sequence ( Am,n Um )n holds an asymptotic normal distribution:
m=1
'
n
Lemma B.7. ( Am,n Um )n converges in distribution to a Gaussian dis-
m=1
tributed random vector U with EU = 0 and cov(U ) = V .
Proof. Let U denote an arbitrary N (0, V )-distributed random vector, and let
0 = h ∈ IRµ be chosen arbitrarily. Now, for m ≤ n define
n
Ym,n := hT Am,n Um , Yn := Ym,n , Y := hT U.
m=1
Due to the “Cramér-Wold-Devise” (cf., e.g., [43], Theorem 8.7.6), only the
convergence of (Yn ) in distribution to Y has to be shown.
Due to (K0), (Ym,n , Am+1 )m≤n is a martingale difference scheme. There-
fore, according to the theorem of Brown (cf. [43], Theorem 9.2.3), the following
conditions have to be verified:
n
stoch
2
Em Ym,n −→ hT V h, (KV1)
m=1
n
stoch
2
Em Ym,n I{|Ym,n |>δ} −→ 0 for all δ > 0. (KL1)
m=1
2
(KV1) follows from (KV), since Em Ym,n = hT Am,n Em (Um Um T
)ATm,n h.
Moreover, it is | Ym,n |≤ hAm,n Um ≤ hAm,n Um , and therefore for
δ > 0 we get
320 B Convergence Theorems for Stochastic Sequences
2
Em (Ym,n I{|Ym,n |>δ} )
≤ h2 Am,n 2 Em Um 2 I{ δ
Am,n Um > h } .
C.1 Miscellaneous
First we give inequalities about the error occurring in the inversion of a per-
turbed matrix, cf. [68].
Lemma C.1. Perturbation lemma. Let A and ∆A denote matrices such that
Lemma C.2. Let (An ) and (Bn ) denote two matrix sequences with the fol-
lowing properties:
L2 Bn − An
Bn−1 − A−1
n ≤ .
1 − LBn − An
tr(A) = λ1 + . . . + λn , det(A) = λ1 · . . . · λn .
Hence
ν−1 ν−1
det(A) λ 2 + . . . + λν tr(A)
= λ 2 · . . . · λν ≤ < .
λ1 ν−1 ν−1
Then, un = √ 1
(1)
, and the following lemma holds:
1+t2n
Lemma C.6. If
dn 1 + t2n < 1 ⇐⇒ An − A < λ̄u(1)
n ,
then
qtn + dn 1 + t2n
tn+1 ≤ .
1 − dn 1 + t2n
n + Q(An − A)un .
vn(2) = Q(An un ) = Au(2)
vn(1) ≥ λ̄u(1)
n − An − A, (i)
vn ≤ µu(2)
(2)
n + An − A, (ii)
n ≤ µun ,
Au(2) (2)
P , Q ≤ 1, un = 1.
(1)
From (i), (ii) and the assumption λ̄un − An − A > 0 we finally get
(1)
vn = 0 and
(2)
un+1 (2)
vn
(2)
µun + An − A
tn+1 := (1)
= (1)
≤ (1)
un+1 vn λ̄un − An − A
qtn + dn 1 + t2n
= .
1 − dn 1 + t2n
In the next step conditions are given such that (tn ) converges to zero:
Then tn → 0, n → ∞.
C.2 The v. Mises-Procedure in Case of Errors 325
tn ≤ T, n ≥ n0 . (∗∗)
Because of 2q
1+q ∈ [0, 1) and dn −→ 0, we finally have tn −→ 0.
(c) tn −→ 0.
Proof. Because of un = (1 + t2n )− 2 the equivalence of (b) and (c) is clear.
(1) 1
For n ∈ IN we have
un , vn = uTn A + (An − A) un
n + un
= λ̄u(1) n + un (An − A)un ,
2 (2)T
Au(2) T
and
T
un (An −A)un ≤ An −A , 0 ≤ u(2)T
n n ≤ µun .
Au(2) (2) 2
326 C Tools from Matrix Calculus
un , vn
n − dn ≤
u(1) 2
n + qun + dn
≤ u(1) 2 (2) 2
λ̄
= q + (1 − q)u(1)
n + dn .
2
Due to dn −→ 0 and q ∈ [0, 1) we obtain now also the equivalence of (a) and
(b).
λ̄ − µ (1) 1 − q (1)
sup An − A < un0 ⇐⇒ en0 < un0 . (ii)
n≥n0 3 3
Conversely, suppose that this condition holds. Then there are two possibilities:
(1)
Case 1: en0 = 0. Then un0 = 0, tn0 < ∞, and also en0 = 0 < (tn0 ) with
function (t) from Lemma C.7. According to this Lemma we get then tn−→ 0.
Case 2: en0 > 0. Due to (ii) it is
1−q 1
0 < en0 ≤ < √ (1 − q) = (1).
3 2 2
Since (t) is continuous on [1, ∞) and decreases monotonically to zero, there
is a number T ∈ [1, ∞) such that
17. Box G.E., Draper N.R. (1987) Empirical Model-Building and Response Sur-
faces. Wiley, New York, Chichester, Brisbane, Toronto, Singapore
18. Box G.E.P., Wilson K.G. (1951) On the Experimental Attainment of Optimum
Conditions. Journal of the Royal Statistical Society 13: 1–45
19. Breitung K., Hohenbichler M. (1989) Asymptotic Approximations for Multi-
variate Integrals with an Application to Multinormal Probabilities. Journal of
Multivariate Analysis 30: 80–97
20. Breitung K. (1984) Asymptotic Approximations for Multinormal Integrals.
ASCE Journal of the Engineering Mechanics Division 110(1): 357–366
21. Breitung K. (1990) Parameter Sensitivity of Failure Probabilities. In: Der
Kiureghian A., Thoft-Christensen P. (eds.), Reliability and Optimization of
Structural Systems ’90. Lecture Notes in Engineering, Vol. 61, 43–51. Springer,
Berlin Heidelberg New York
22. Bronstein I.N., Semendjajew K.A. (1980) Taschenbuch der Mathematik. Verlag
Harri Deutsch, Thun, Frankfurt/Main
23. Bucher C.G., Bourgund U. (1990) A Fast and Efficient Response Surface Ap-
proach For Structural Reliability Problems. Structural Safety 7: 57–66
24. Bullen P.S. (2003) Handbook of Means and Their Inequalities. Kluwer,
Dordrecht
25. Chankong V., Haimes Y.Y. (1983) Multiobjective Decision Making. Oxford,
North Holland, New York, Amsterdam
26. Chung K.L. (1954) On a Stochastic Approximation Method. Annals of Math-
ematical Statistics 25: 463–483
27. Craig J.J. (1988) Adaptive Control of Mechanical Manipulators. Addison-
Wesley, Reading, MA
28. Der Kiureghian A., Thoft-Christensen P. (eds.) (1990) Reliability and Opti-
mization of Structural Systems ’90. Lecture Notes in Engineering, Vol. 61.
Springer, Berlin Heidelberg New York
29. Dieudonné J. (1960) Foundations of Modern Analysis. Academic, New York
30. Ditlevsen O. (1981) Principle of Normal Tail Approximation. Journal of the
Engineering Mechanics Division, ASCE, 107: 1191–1208
31. Ditlevsen O., Madsen H.O. (1996) Structural Reliability Methods. Wiley,
Chichester
32. Dolinski K. (1983) First-Order Second Moment Approximation in Reliability
of Systems: Critical Review and Alternative Approach. Structural Safety 1:
211–213
33. Draper N.R., Smith H. (1980) Applied Regression Analysis. Wiley, New York
34. Ermoliev Y. (1983) Stochastic Quasigradient Methods and Their Application
to System Optimization. Stochastics 9: 1–36
35. Ermoliev Y., Gaivoronski A. (1983) Stochastic Quasigradient Methods and
Their Implementation. IIASA Workpaper, Laxenburg
36. Ermoliev Yu. (1988) Stochastic Quasigradient Methods. In: Ermoliev Yu., Wets
R.J.-B. (eds.), Numerical Techniques for Stochastic Optimization, Springer,
Berlin Heidelberg New York, 141–185
37. Ermoliev Yu., Wets R. (eds.) (1988) Numerical Techniques for Stochastic Op-
timization. Springer Series in Computational Mathematics Vol. 10. Springer,
Berlin Heidelberg New York
38. Eschenauer H.A. et al. (1991) Engineering Optimization of Design Processes.
Lecture Notes in Engineering, Vol. 63. Springer, Berlin Heidelberg New York
References 329
105. Montgomery C.A. (1984) Design and Analysis of Experiments. Wiley, New
York
106. Myers R.H. (1971) Response Surface Methodology. Allyn and Bacon, Boston
107. Nevel’son M.B., Khas’minskii R.Z. (1973) An Adaptive Robbins–Monro Pro-
cedure. Automation and Remote Control 34: 1594–1607
108. Pantel M. (1979) Adaptive Verfahren der Stochastischen Approximation. Dis-
sertation, Universität Essen-Gesamthochschule, FB 6-Mathematik
109. Park S.H. (1996) Robust Design and Analysis for Quality Engineering.
Chapman and Hall, London
110. Paulauskas V., Rackauskas A. (1989) Approximation Theory in the Central
Limit Theorem. Kluwer, Dordrecht
111. Pfeiffer F., Johanni R. (1987) A Concept for Manipulator Trajectory Planning.
IEEE Journal of Robotics and Automation RA-3: 115–123
112. Pflug G.Ch. (1988) Step Size Rules, Stopping Times and Their Implementation
in Stochastic Quasigradient Algorithms. In: Ermoliev Yu., Wets R.J.-B. (eds.),
Numerical Techniques for Stochastic Optimization, 353–372, Springer, Berlin
Heidelberg New York
113. Phadke M.S. (1989) Quality Engineering Using Robust Design. P.T.R. Prentice
Hall, Englewood Cliffs, NJ
114. Pahl G., Beitz W. (2003) Engineering Design: A Systematic Approach.
Springer, Berlin Heidelberg New York London
115. Plöchinger E. (1992) Realisierung von adaptiven Schrittweiten für stochastis-
che Approximationsverfahren bei unterschiedlichem Varianzverhalten des
Schätzfehlers. Dissertation, Universität der Bundeswehr München
116. Polyak B.T., Tsypkin Y.Z. (1980) Robust Pseudogradient Adaption Algo-
rithms. Automation and Remote Control 41: 1404–1410
117. Polyak B.G., Tsypkin Y.Z. (1980) Optimal Pseudogradient Adaption Algo-
rithms. Automatika i Telemekhanika 8: 74–84
118. Prekopa A., Szantai T. (1978) Flood Control Reservoir System Design Using
Stochastic Programming. Mathematical Programming Study 9: 138–151
119. Press S.J. (1972) Applied Multivariate Analysis. Holt, Rinehart and Winston,
New York, London
120. Rackwitz R., Cuntze R. (1987) Formulations of Reliability-Oriented Optimiza-
tion. Engineering Optimization 11: 69–76
121. Reinhart J. (1997) Stochastische Optimierung von Faserkunststoffverbundplat-
ten. Fortschritt–Berichte VDI, Reihe 5, Nr. 463. VDI Verlag GmbH, Düsseldorf
122. Reinhart J. (1998) Implementation of the Response Surface Method (RSM)
for Stochastic Structural Optimization Problems. In: Marti K., Kall P. (eds.),
Stochastic Programming Methods and Technical Applications. Lecture Notes
in Economics and Mathematical Systems, Vol. 458, 394–409
123. Review of Economic Design, ISSN: 1434–4742/50. Springer, Berlin Heidelberg
New York
124. Richter H. (1966) Wahrscheinlichkeitstheorie. Springer, Berlin Heidelberg New
York
125. Robbins H., Monro S. (1951) A Stochastic Approximation Method. Annals of
Mathematical Statistics 22: 400–407
126. Robbins H., Siegmund D. (1971) A Convergence Theorem for Non Negative
Almost Supermaringales and Some Applications. Optimizing Methods in Sta-
tistics. Academic, New York, 233–257
References 333