Hilbert Back Prop
Hilbert Back Prop
Daniel Crespin
Facultad de Ciencias
Universidad Central de Venezuela
Abstract
The specific purpose of backpropagation is to calculate the gradients of the quadratic
error of differentiable neural networks. In this paper formulas for such gradients are
obtained in the case of neural networks having inputs, weights and outputs in Hilbert
spaces. The structure of neural networks makes the backpropagated errors to appear
naturally as the result of successively applying transposed derivatives to objective er-
rors (see 33 and 35 below). We use: 1.- the mathematical formalism of neural net-
works described in [?]; 2.- Differential Calculus on normed spaces, and; 3.- knowledge
of transposes of linear maps between Hilbert spaces. For general differentiable neural
networks the Hilbert space setting extracts the essence of backpropagation while avoid-
ing coordinates. Generalizations to Banach spaces can be expected to be technically
complicated.
§1 Linearity
1. Let X, Y be sets. The collection of all functions from X to Y is denoted Y X . For
f ∈ Y X the statement “f is linear” means that the domain and codomain are vector
spaces and that f is is both additive, f (x + y) = f (x) + f (y), and homogeneous,
f (λx) = λf (x).
3. If E and F are vector spaces then F E is also a vector space. On the other hand the
non-linear functions from E to F are the elements of the complement Lin(E, F ) =
F E − Lin(E, F ) where Lin(E, F ) ⊆ F E is the subset of all the linear maps (possibly
discontinuous) from E to F . Non-linearity of functions is a set-theoretically frequent
property while linearity is rare.
4. In what follows and unless otherwise stated, all linear maps will be continuous, either
by hypothesis (possibly non-explicitly stated) or because continuity can be proved.
1
§2 Hilbert spaces
5. Let E, F be real Hilbert spaces and denote L(E, F ) the space of continuous linear
maps T : E → F equipped with the sup norm
7. In case F = R the elements f ∈ L(E, R) are called linear forms or linear functionals
and L(E, R) is the dual of E.
is also a Hilbert space, a fact that can be established using the following result.
§3 Derivatives
10. References for Calculus in Hilbert spaces are [?], [?] and [?]. These authors discuss the
more general case of normed spaces which applies directly to Hilbert spaces.
C 1 (U, F ) ⊆ F U
2
14. For the derivatives of compositions of smooth maps the usual chain rule applies. Let
E, F, G be Hilbert spaces, U ⊆ E an open set, f ∈ C 1 (U, F ); V ⊆ F open, g ∈
C 1 (V, G). Suppose that x ∈ U is such that y = f (x) ∈ V . Then U 0 = g −1 (V ) ∩ U is
open in E, x ∈ U 0 , and the derivative of the composition g ◦ f : U 0 → G at x is equal
to the composition of the derivatives of f at x and g at f (x)
⊆
Chain rule for multicompositions
15. Consider Hilbert spaces E (1) , . . . , E (p+1) with open sets U (k) ⊆ E (k) and a sequence of
smooth maps f (k) : U (k) → E (k+1) with k = 1 · · · p. This sequence is composable if the
image of each map is contained in the domain of the next: f (k) (U (k) ) ⊆ U (k+1) . The
multicomposition of these maps is f (p) ◦ · · · ◦ f (1) : U (1) → U (p+1)
16.
17. Multichain rule for derivatives: Given an element in the first domain, x ∈ U (1) ,
define the chain of values as x(1) = x and iterative with x(k+1) = f (k) (x(k) ), k =
2, . . . , p. If the f (k) are smooth then the derivative of the composition at x is equal to
the composition of the derivatives of the componends at the chain of values
3
ABC
α β γ
A0 B0 C0 0
0 A B C
4
§4 Gradients
19. The squared norm function is the map Sq : F → R given by Sq(y) = kyk2 .
20. The squared norm function is smooth with derivative at b ∈ F equal to the linear map
DSq(b) : F → R specified by
§5 Transposes
A proof for complex Hilbert spaces can be found in [?], Theorem 1, p. 131; with few
changes it is valid for real Hilbert space as well.
§6 Transposed derivatives
26. By definition the transposed derivative of f at x is the transposed linear map of the
derivative, D∗ f (x) = (Df (x))∗ . Therefore for all ∆x ∈ E, ∆y ∈ F we have
5
27. Let E, F be Hilbert spaces, U ⊆ E open, f ∈ C 1 (U, F ), x ∈ E, Df (x) ∈ L(E, F ) and
D∗ f (x) ∈ L(F, E). The following diagram shows f and Df (x) pointing left to right
while D∗ f (x) ∈ L(F, E) points in the opposite direction right to left
f
U F
Df (x)
E F
∗
D f (x)
E F
28. For a given f there is a collective derivative Df equal to the parametrized collection of
linear maps
Df = {Df (x) : E → F | x ∈ U }
29. The collective transposed derivative is then a parametrized collection obtained trans-
posing each of the derivatives
D∗ f = {D∗ f (x) : F → E | x ∈ U }
30. In general an element y ∈ f (U ) does not uniquely specify a transposed derivative with
domain F and codomain E because elements x with y = f (x) could be not unique or
may not even exist. Due exception is granted when f is invertible.
31. In case the domain and range are the usual numerical spaces, E = Rm , F = Rn , then
x = (x1 , . . . , xm ), f (x) = (f 1 (x), . . . , f n (x)), the derivative Df (x) has n × m matrix
equal to the Jacobian Jf (x) and the transposed derivative D∗ f (x) has m × n matrix
equal to the transposed Jacobian J T f (x)
∂f 1 (x)/∂x1 · · · f 1 (x)/∂xm
.. ..
Jf (x) = . .
∂f n (x)/∂x1 · · · f n (x)/∂xm
n×m
∂f 1 (x)/∂x1 · · · f n (x)/∂x1
T
.. ..
J f (x) = . .
∂f 1 (x)/∂xm · · · f n (x)/∂xm
m×n
6
§7 Errors
33. The objective error is by definition the difference between the actual output and the
desired output
f (a) − b ∈ F
34. The quadratic error is the square of the norm of the objective error
§8 Error functions
eb [f ](x) = f (x) − b
37. The quadratic error function of the F -valued function f ∈ F U is the real valued function
qb [f ] : U → R defined as the squared norm of the objective error
which can also be written as qb [f ](x) = kf (x) − bk2 = hf (x) − b, f (x) − bi.
39. Suppose for the remaining of this section that f ∈ C 1 (U, F ). Then eb [f ] ∈ C 1 (U, F )
and, since the given map f and the objective error function eb [f ] are equal up to the
constant b, their derivatives are equal
7
40. Comparison with (1) and the chain rule imply that the derivative of the quadratic error
function calculated at x ∈ U and evaluated on ∆x ∈ E is equal to the derivative of the
squared norm calculated at eb [f ](x) = f (x) − b and evaluated on ∆y = Deb [f ](x) · ∆x
Dqb [f ](x) · ∆x = DSq(f (x) − b) · (Deb [f ](x) · ∆x)
(3)
= h2(f (x) − b), Df (x) · ∆xi
41. The following observation is crucial: The increment ∆x in (3) above can be isolated
on the right of the inner product by means of a transposition
Dqb [f ](b) · ∆y = h2D∗ f (x) · (f (x) − b), ∆yi
42. From 22 it follows that for f ∈ C 1 (U, F ) the gradient at x ∈ U of the quadratic error
function is equal to twice the transposed derivative calculated at the input and evaluated
on the objective error
∇qb [f ](x) = 2D∗ f (x) · (f (x) − b)
§9 Backpropagators
8
46. Backpropagation Rule: For f ∈ C 1 (U, F ) the gradient at x ∈ U of the quadratic
error function is equal to its backpropagator calculated at x and evaluated on the
objective error
f = f1 × · · · × fn = (f1 , . . . , fn )
49. And the derivative of the product f is equal to the product of the derivatives of the
factors
Df (x) = Df1 (x) × · · · × Dfn (x) = (Df1 (x), . . . , Dfn (x))
resulting in the:
53. Briefly, the backpropagator of a product is the sum of the backpropagators of the factors.
55. Given an increment ∆x = (∆x1 , . . . , ∆xm ) the derivative at x evaluated at the in-
crement is equal to the sum of its partials at x evaluated at the components of the
increment
Df (x) · ∆x = DE1 f (x) · ∆x1 + · · · + DEm f (x) · ∆xm )
9
56. Define the partial backpropagators of f at x s twice the transposed derivatives calculated
at x
57. Backpropagators for multiproduct domains: The backpropagator of f calculated
at x ∈ U and evaluated on ∆y ∈ F is equal to the product of the partial backpropa-
gators calculated at x and evaluated on the increment
Bf (x) · ∆y = (BE1 f (x)∆y, . . . , BEm (x)∆y)
62. Said carelessly, the backpropagator of a multicomposition is 1/2k−1 times the multicom-
position, in reverse order, of the backpropagators of the componends.
63. The gradient descent method of Numerical Calculus/Analysis requires knowledge of the
gradient of the function to be minimized. When attempting to minimize by stepwise
change of parameters the quadratic error of a smooth multilayer neural network the
above product, coproduct and multichain rules play a central role. To actually reach
a global minimum for smooth multilayer networks using these gradients is in practice
a hopeless task.
10
§10 Discontinuous perceptrons in Hilbert spaces
. Discontinuous units, layers and networks
64. Let E, F, G be Hilbert spaces, U ⊆ E, W ⊆ G open sets. A neural unit (artificial neu-
ron, processing unit, parametric map) with inputs from U ⊆ E, parameters (connection
weights, controls, indexes) in W ⊆ G and outputs in F is any function f : U × W → F .
The pair E, F (domain and codomain, in this order) is the architecture of f .
65. As discussed in ??, the neural graph of the parametric map f : U × W → G is the
following neuroscience inspired diagram
Vector spaces, their subsets and elements can be hinted using planes, regions and
points. Neural units can then be geometrically represented as in the following illustra-
tion
11
Figure 2. Geometric representation of a neural unit f : U × W → F .
The richer the structures assumed for the neural unit, the more elaborate the repre-
sentations can be. See ??, ??, ?? and ?? below.
F U ×W = {f | f : U × W → F }
is a vector space.
68. Our results on backpropagation require the mathematical formulation of neural net-
works given in [?] which the reader may consult in case of need.
70. Denote P(X) the collection of all the subsets of a set X. A partition of X is a subcol-
lection A ⊆ P(X) of non-empty subsets that have union X
[
∀A ∈ A A 6= ∅ and {A | A ∈ A} = X
12
73. If there is a single restriction, |J| = 1, then ΠJ (a) is a hyperplane at a; for each a
there are m hyperplanes. If there are m − 1 restrictions, |J| = m − 1, then the plane
is an axis at a; for each a there are m axes.
74. The plane ΠJ (a) is a product of singletons, Ai = {ai } for i ∈ J, and factors, Ai = Xi
for i ∈ J, so that
ΠJ (a) = A1 × · · · × Am
75. The i-th axis of X at a is equal to the product of singletons and the i-th factor set
76. If a and a0 are {i}-equal then the i-th axes of X at a and at a0 are equal
Xi a = Xi a0
77. The crossing of X at a, denoted Cr(X, a), is the union over i of the axes at a
[
Cr(X, a) = Xi a
i∈ m
78. Consider a set Y and let f ∈ Y X . Define the ith-partial map of f at a as the function
fXi a : Xi → F given by
m
79. Let f, g ∈ Y X , a ∈ X and i ∈ . The i-partial maps at a are equal if and only if the
restrictions of the maps to the the i-th axis at a are equal
at for all f ∈ Y X the i-th partial map at a and i-th-partial map at a0 are equal
fXi a = fXi a0
80. Let f, g ∈ Y X and consider a fixed factor set Xi0 . If the i0 -partial maps of f and g at
a are equal then the maps are equal
81. Given x ∈ X let a = x. For all i the value of f at x is equal to the i-th-partial map
of f at a evaluated at xi
f (x) = fXi a (xi )
13
82. For f ∈ F U ×W and w ∈ W define the first partial map of f at w as fw : U → F
given by fw (x) = f (x, w). And for x ∈ U let the second partial map of f at x be
fx : W → F specified as fx (w) = f (x, w).
84. Conversely, a collection {fw : U → F }w∈W (resp. {fx : W → F }x∈U ) of maps indexed
by W (resp. by U ) having domain U (resp. W ) and codomain F determines a unique
map f : U × W → F specified by the relation f (x, w) = fw (x) (resp. relation
f (x, w) = fx (w)).
85. Consider Hilbert spaces E1 , . . . , Em and their product E = E1 ×· · ·×Em having elements
that are m-tuples x = (x1 , . . . , xm ) ∈ E with inner product hx, x0 i = hx1 , x01 i + · · · +
hxm , x0m i. The i-th inclusion ιi : Ei → E and the i-th projection pri : E → Ei are the
linear maps defined as
86. As for linear maps in general, the derivatives at respectively ai and a of the inclusions
and projections are themselves
89. Let f ∈ C 1 (U, F ). By definition the partial derivative of f at a with respect to the
factor space Ei , or i-th partial derivative of f at a, is the linear map with domain Ei
and codomain F , DEi f (a) ∈ L(Ei , F ), defined as the derivative at ai of the i-th partial
map of f at a
DEi f (a) = DfUi a (ai ) ∈ L(Ei , F )
14
90. Any map, in particular a linear map, is equal to its composition with the identity,
therefore ?? and ?? imply
92. Therefore ?? and ?? imply that the i-th partial map of the derivative is equal to the
i-th partial derivative of the map
93. There is a special functional form of parametric maps that often appears in practice.
Suppose that the parameter set and the input set are contained in products W ⊆
W1 × · · · × Wm , X ⊆ X1 × · · · × Xm ; in this situation wi is the i-th weight. The single
output parametric map f : W × X → Y has paired weights and inputs or is a paired
map if there are functions ξi : Wi × Xi → Xi , i = 1, . . . , m, and φ : X1 × · · · × Xm → Y
such that
f (w1 , . . . , wm , x1 , . . . , xm ) = φ(ξ1 (w1 , x1 ), . . . , ξm (wm , xm ))
Neither φ nor the ξi are uniquely determined by f . For example if Wi = Xi = Y =
R and f (w1 , . . . , wm , x1 , . . . , xm ) = |w1 |α1 |x1 |β1 + · · · + |wm |αm |xm |βm (αi , βi positive
integers) then one can take ξi (wi , xi ) = |wi |αi |xi |βi and φ(ξ1 , . . . , ξm ) = ξ1 +· · ·+ξm ; but
it is also possible to choose the functions ξi (wi , xi ) = |wi |qαi |xi |qβi and φ(ξ1 , . . . , ξm ) =
1/q 1/q
ξ1 + · · · + ξm , with any q > 0, obtaining the same f . The graph of a paired map is
the one shown below.
x1 q
@w1
@ f
w
@R
x2 q 2- q
@
- y
..
.
xm q wm
Figure 2. Neural graph of a paired map.
15
There are m inputs and a single output. For each input labeled xi there is an ar-
row labeled wi with tip attached to the circle labeled f . Equivalently, a single 1 × m
row-matrix label [wi ]1×m can be used instead of several individual labels. Heuristi-
cally speaking, this matrix ‘acts on the left’ of the input ‘column matrix’ with entries
x1 , . . . , xm . Neither φ nor the ξi ’s appear explicitly in the diagram.
Single input parametric maps are trivially paired. Since any parametric map can be
considered to be single input (remark at end of section 2), any map can be considered
a paired map.
94. Let E, F be Hilbert spaces. Denote L(E, F ) the space of continuous linear maps
having domain E and codomain F , that is, T : E → F . With the sup norm, kT k =
supkxk≤1 kT (x)k, L(E, F ) is a Banach space.
95. The subset of continuous linear isomorphisms, if any, is open in L(E, F ) and will be
denoted Iso(E, F ).
96. In case E = F the isomorphisms form a group under composition, the general linear
group of E
GL(E) = Iso(E, E)
therefore the transpose of an addition is the addition of the transposes of the addends,
(T1 + T2 )∗ = T1∗ + T2∗ .
16
99. Let E, F, G be Hilbert spaces, T ∈ L(E, F ), S ∈ L(F, G) with transposes T ∗ ∈ L(F, E),
S ∗ ∈ L(G, F ). The relations
prove that the transpose of a composition is the composition, in reverse order, of the
transposes of the componends, (S ◦ T )∗ = T ∗ ◦ S ∗ .
100. The transpose of an isomorphism is an isomorphism having inverse equal to the inverse
transposed
101. Define the transposition as the map τE,F : L(E, F ) → L(F, E) given by τE,F (T ) = T ∗ .
103. Trasposition is involutive, that is, (T ∗ )∗ = T . Equivalently, τF,E ◦ τE,F = IL(E,F ) so that
the following diagram commutes
IL(E,F )
L(E, F ) L(E, F )
τE,F τF,E
L(F, E)
104. The transposition map τE,F is an isometric linear isomorphism of Banach spaces.
106. The graph is a closed subspace of the product of the domain and codomain, Γ(T ) ⊆
E × F , while the cograph is a closed subspace of the product of the codomain and
domain, Γ∗ (T ) ⊆ F × E. Note the order of the factors.
107. The projections πE ∈ L(E × F, E) and πF ∈ L(E × F, F ) are the maps given by
πE (x, y) = x πF (x, y) = y
17
108. For any T ∈ L(E, F ) the projection πE restricts to an isomorphism πT from the graph
to the domain
πT = (πE |Γ(T )) ∈ Iso(Γ(T ), E)
having inverse JT ∈ Iso(E, Γ(T )) given by JT (x) = (x, T (x)).
109. Conversely, let Γ ⊆ E × F be a closed subspace such that the restriction (πE |Γ) is a
linear isomorphism (πE |Γ) : Γ → E. Then with
T = πF ◦ (πE |Γ)−1
111. The graph of a linear transformation and the cograph of its transpose are subspaces of
E × F each equal to the orthogonal of the other
Γ(T )⊥ = Γ∗ (T ∗ )
112. Given n Hilbert spaces E1 , . . . En their Hilbert product is the Hilbert space equal to the
vector space product E = E1 × · · · × En equipped with the inner product given by the
sum of the inner products of the components
114. The i th-inclusion into the product is the linear injective map ιi : Ei → E defined as
ιi (xi ) = x̂i . Here x̂i ∈ E has i th-component equal to xi and all other components
equal to zero
i
^
x̂i = (0, · · · , 0, xi , 0, · · · , 0)
πi∗ = ιi : Ei → E ι∗i = πi : E → Ei
18
116. By definition the i th-axis of E, denoted E
bi , is the image of the i th-inclusion
E
bi = ιi (Ei )
E1 × · · · × Em = E
b1 ⊕ · · · ⊕ E
bm
118. The identity of the product equals the sum of the compositions of the inclusions and
projections
IE = ι1 ◦ π1 + · · · + ιm ◦ πm
119. The identity of the product equals the product of the compositions of the projections
and inclusions
IE = π1 ◦ ι1 × · · · × πm ◦ ιm = (π1 ◦ ι1 , . . . , πm ◦ ιm )
T : E → F = F1 × · · · × Fm
Ti = πi ◦ T or equivalently Ti (x) = yi
121. A linear map with codomain a product is the product of its components
T : E = E1 × · · · × Em → F
123. A linear map with domain a product is equal to the sum of its partials evaluated on
the projections
Df (a, c) : E × G → F
19
125. In the present Hilbert space setting the derivative has a transpose
Df ∗ (a, c) : F → E × G
126. The transpose is the unique linear map in L(F, E × G) that for all ∆y ∈ F and all
(∆x, ∆w) ∈ E × G satisfies
128. Assume that the outputs are real numbers, F = R, and let f ∈ C 1 (U × W, R) be a
smooth unit. Then the derivative at (a, w) ∈ U × W is a linear functional
Df (a, w) : E × G → R
129. By Riesz representation theorem for Hilbert spaces ([], p. ???;) there exists a unique
vector in E × G, called gradient of f at (a, w) and denoted ∇f (a, w), such that for
all (∆x, ∆w) ∈ E × G
130. The increments in the product space are pairs of increments of the factors ∆x The
derivative Df (a, w0 ) and the partials DE f (a, w0 ), DF f (a, w0 ) by definition satisfy
the relation
Df (a, w0 ) =
132. The quadratic error of the sequence for the pair A, B is the sum of the quadratic errors
of each pair
Xn
qA,B (w) = qai ,c1 (w)
k=1
20
133. Evaluation of the function f (on elements of its domain=data vectors), that is, calcu-
lation of the value f (x0 , w0 ) for a given (x0 , w0 ), is called a forward pass or forward
propagation, conventionally understood as going left to right, domain to codomain,
U × W to G, as in
Df (x0 , w0 )
U × W −−−−−−−−−−→ G
134. Evaluation of the derivative Df (x0 , w0 ) and partial derivatives DfE (x0 , w0 ), DfF (x0 , w0 )
on respective vectors (∆x, ∆w) ∈ E × F , ∆x ∈ E, ∆w ∈ F is similarly a forward pass
or forward propagation
Df (x0 , w0 )
E × F −−−−−−−−−−→ G
DE f (x0 , w0 )
E −−−−−−−−−−−→ G
DF f (x0 , w0 )
F −−−−−−−−−−−→ G
135. The transposes of the derivatives are linear transformations with domains G and re-
spective codomains E × F , E and F that have opposite direction hence considered a
backward pass or backpropagation, depicted as
D∗ f (x0 , w0 )
E×F ←−−−−−−−−−−− G
D∗ f (x0 , w0 )
E ←−−E−−−−−−−−− G
D∗ f (x0 , w0 )
F ←−−F−−−−−−−−− G
21
§21 Errors
138. The error function for the neural network f with data vector x0 ∈ U , weight domain
W and desired output d0 ∈ G is the map e = ef,x0 ,d0 : W → G equal to the difference
between the output and the desired value
e(w) = f (x0 , w) − d0
139. The quadratic error function q = qf,x0 ,d0 : W → R is the square of the norm of the
error
q(w) = hf (x0 , w) − d0 , f (x0 , w) − d0 i
= kf (x0 , w) − d0 k2
= kf (x0 , w)k2 − 2hf (x0 , w), d0 i + kd0 k2
141. The quadratic error function qC = qf,C,d : W → R for the neural network f over the
data vector set C ⊆ U , with weight domain W and desired output function d is the sum
over the index j of the quadratic errors of f for data vectors xj with desired outputs
d(xj )
Pk
qC (w) = j=0 qf,xj ,d(xj )
Pk 2
= j=0 kf (xj , w) − d(xj )k
142. Therefore the derivative and the gradient of the quadratic error over C are respectively
the sum of the derivatives and the sum of the gradients of the quadratic errors for the
pairs xj , d(xj )
Pk
Dqf,C,d = j=0 Dqf,xj ,d(xj )
Pk
∇qf,C,d = j=0 ∇qf,xj ,d(xj )
143. A single layer architecture is a triple a = (E, F, G) of Hilbert spaces. The following
terminology applies:
1.- The space of input vectors is E
2.- The space of weights is F
3.- The space of output vectors is G
22
§23 Multilayer architectures
a = {a(1) , . . . , a(p) }
= {(E (1) , F (1) , G(1) ), . . . , (E (p) , F (p) , G(p) )}
146. Consider multilayer architectures a and b. Their concatenation is the multilayer ar-
chitecture a ∗ b given by
147. A multilayer architecture is composable if F (k) = E (k+1) , that is, if each layer has target
equal to the source of the next.
148. If a and b are composable multilayer architectures and if the final layer of the first has
target equal to the initial layer of the second, t(a) = s(b), then their concatenation
a ∗ b is composable.
149. Consider a single layer architecture a = (E, F, G). A multiple input for a is a factor-
ization of E as a finite product of Hilbert spaces
E=
150. Consider Hilbert spaces E (1) , . . . , E (p+1) , F (1) , . . . , F (p) with open sets U (k) ⊆ E (k) ,
W (k) ⊆ F (k) and a sequence consisting of p smooth neural networks
with corresponding sequence of architectures a(f (1) ), . . . , a((p) ). These networks can be
displayed in a sequence of neural graphs
23
(1)
f (1)
(2)
f (2)
(p)
f (p)
w w w
x(1) q - q - x(2) x(2) q - q - x(3) . . . x(p) q - q - x(p+1)
151. The above sequence is composable if the image of each network is contained in the
domain of the next, f (k) (U (k) × W (k) ) ⊆ U (k+1) . Note that, with architecture specified
as E (k) , F (k) , E (k+1) , the f (k) are networks of simple type. But different architecture
choices may apply.
f (1)
f (2) f (p)
w (1)
w(2)- ... q w
(p)
x q - q q - - q - y
@
@
@ f (p) b◦
· · · b◦ f (1)
@
@ w - q
154. A multilayer neural network is a composition f (p) b◦ · · · b◦ f (1) with architecture equal
to the sequence E (1) , F (1) . . . , E (p) , F (p) , E (p+1) . Note an alternative architecture is the
triple E, F, G the composition is of simple type.
24
156. When the neural network is multilayer, that is when it is a composition f = f (k)
In the context of neural networks the function whose minimum is sought is a quadratic
error function q. To approach minimums of q its gradient must be computed, a task
realized applying formula (??) and properties of derivatives to the structure (architec-
ture) of the neural network.
For an as required for our the terminology and notations of will be assumed.
formulas
The gradient descent is an iteration method much used in Numerical Calculus/Analysis
to approach minimums of smooth real valued functions of vector variables. Equation
(??) for ∇q(w0 ), the essence of backpropagation calculations, says: To obtain the
gradient, first calculate the error and then send it from G back to F via the transposed
derivative.
For neural networks the gradient method is an iteration of the type
wn+1 = wn − n ∇q(wn )
where in favorable cases and after a reasonable number of iterations wn is close enough
to a zero, or to some local minimum of q. But very often adequate convergence fails
and the procedure must be repeated. The of either local or global minimum depend
on conditions beyond the actual iteration formula.
Nevertheless, and since neural networks can be seen as small brain models, backprop-
agation is sometimes interpreted as “learning from mistakes by lowering the error”, a
behavior flavored metaphor. The number n is called “learning parameter” (or learning
constant when it is fixed) and in numerical practice is the subject of various recipes.
Given an input x0 ∈ U if the first component is fixed at x = x0 the neural network
provides a function of the weight w ∈ W , that is, provides the map fx0 : W → G
defined as fx0 (w) = f (x0 , w). If additionally a desired output d0 ∈ G is given then the
error vector and the quadratic error are also functions of the weight w ∈ W ; these are
the maps
e = ex0 = ef,x0 ,d0 : W → G
q = qx0 = qf,x0 ,d0 : W → R
defined as
e(w) = f (x0 , w) − d0
q(w) = kf (x0 , w) − d0 k2
25
The collection of all the continuous linear maps from E to F is a normed space denoted
L(E, F ); the norm is
kSkL(E,F ) = sup{kS(x)kF | kxkE ≤ 1}
§29 Transposes
For each continuous linear map between Hilbert spaces, T ∈ L(E, F ), there exists a
unique continuous linear map T ∗ ∈ L(F, E) such that for all x ∈ E and y ∈ F we have
hT (x), yiF = hx, T ∗ (y)iE for all x ∈ E and y ∈ F
The map T ∗ is the transpose of T . The transposition operation, T → T ∗ , is:
1.- Linear
2.- Involutive
3.- Isometric
4.- Linear forms transpose to line tracing
5.- Orthogonal projections transpose into complementary projections
6.- Transposes of independent sums are products of the transposes
26
§30 Gradients
§31 Transposed derivatives
§32 Neural networks
Many neural networks are simply functions that depend on parameters Neural networks
can be defined over any category with products. There are neural networks over the
categories of sets, of finite sets, of vector spaces over a field, of Rn s and matrices, of
Euclidean spaces, etc. see [].
The present paper calculates the gradient ∇Q of the quadratic error function Q for
differentiable neural networks having input space and parameter space equal to open
sets in Hilbert spaces.
Quadratic error functions are used but the procedure seems adaptable to Lq errors on
Lp spaces, (1/p) + (1/q) = 1, and more generally to reflexive Banach spaces. Hence
the results are of interest for general non-linear regression.
The formulas obtained are global. The inputs and processing units of each layer are
subject not to individual coordinate treatment but to typical vector expressions that
simultaneously treat all components. This gives results in principle suited for parallel
vector processing.
Gradients of (the quadratic error function of) processing units are expressed in terms
of the output error and the transposed derivative of the unit with respect to w
gradients of layers are products of gradients of processing units; and the gradient of the
network equals the product of the gradients of the layers. Backpropagation provides
the desired outputs or targets for the layers preceding the last one. In particular, the
explicit global formulas for semilinear networks are obtained. According to [5] percep-
tron neural networks that perform a given pattern recognition task can be constructed
directly from the data provided; training is not needed. But in practical applica-
tions training, and backpropagation in particular, could still be useful to fine tune
the weights. Hence, despite the direct methods, this paper may also be of practical
interest.
27
makes the product E × F into a Hilbert space such that the maps x → (x, 0) and
y → (0, y) are isometric linear transformations with respective images E × {0} and
{0} × F .
The space of vector inputs is E and the domain of inputs is a given non-empty open
set U .
For a ∈ E the function fa : E → R given by fa (x) = ha, xi is a (continuous) linear
form. Conversely, Riesz representation theorem says that for any continuous linear
form f : E → R there is a unique a ∈ E such that f (x) = fa (x).
Let E, F be Hilbert spaces, T : E → F a continuous linear map. The transpose of T ,
denoted T ∗ , is the linear map T ∗ : F → E defined by the expression
The relation
a
proves that the graph of T ∗ is equal, except for twisting, to the orthogonal in E × F
of the of the graph of −T
Γ(T ∗ ) = π(Γ(−T ))
§34 Derivatives
Let U ⊆ E be open and let f : E → F be differentiable at x ∈ U . Recall that the
derivative of f at x is a continuous linear map Df (x) : E → F such that
28
Here value of Df (x) at h ∈ E is denoted Df (x) · h. When the codomain is R the
gradient of f at x is defined as the unique vector ∇f (x) such that Df (x)·h = h∇f (x), hi
for all h ∈ E. With alternative notation h = ∆x the value is
Let M be another Hilbert space. The open subsets W ⊆ M will play the role of
parameters or weights. The parameter domain is U . A parametric map from U to F
with parameters in W is a C 1 map
f :W ×E →F
§35 Errors
Consider a finite set S in the input space S ⊆ U and a function d : S → F to be called
desired value function. The desired value or target of the input vector x ∈ S is the
output vector d(x) ∈ F . The desired value function is fixed along the discussion and
may be omitted from notation but is always present.
The error of f at input x ∈ S for the parameter value w ∈ W and with respect to the
desired value function d is the output vector e(x) = e[f,w,d] (x) ∈ F defined as
The quadratic error of f at input x ∈ S, Q(x) = Q[f,w,d] (x) for the parameter value
w ∈ W and with respect to the desired value function d is the non-negative number
q(x) = q[f,w,d] (x) ∈ [0, ∞) ⊆ R equal to the square of the length of the error
The quadratic error for the collection of inputs S, to be denoted Q[f,w,d] (S), Q(S) or
simply Q, is the non-negative real number equal to the sum over x ∈ S of the quadratic
errors at the elements of S
X X
Q= q(x) = he(x), e(x)i
x∈S x∈S
29
§36 Derivative of the error
If f is differentiable at (w, x) ∈ M × E with derivative Df (w, x) ∈ L(E, F ) then Q is
differentiable at (w, x) and its derivative is a linear form DQ(w, x) ∈ L(E, R) given by
where δ̄ = δ(w̄). This formula is basic and the rest of the paper concerns its application
to calculate the gradients of quadratic errors.
Assume W is open in G and that f : W × E → F is a differentiable map; see [1].
Recall that fw (x) = f (w, x). Figure 1 (a) illustrates f and Figure 1 (b) its partials.
30
∂φ ¯ ∂ξm
(ξ1 , . . . , ξ¯m ) (w̄m , x̄m )) (7)
∂ξm ∂wm
m
X ∂φ ¯ ∂ξi
Dx f (w̄, x̄) · dw = (ξ1 , . . . , ξ¯m ) (w̄i , x̄i )dxi
i=1
∂ξ i ∂x i
∂φ ¯ ∂ξ1
Dx∗ f (w̄, x̄) · δ̄ = δ̄ ( (ξ1 , . . . , ξ¯m ) (w̄1 , x̄1 ), . . . ,
∂ξ1 ∂x1
∂φ ¯ ∂ξm
(ξ1 , . . . , ξ¯m ) (w̄1 , x̄m )) (8)
∂ξm ∂xm
31
§40 Vector valued parametric maps
Consider now f : W × Rm → Rn so that f = (f1 , . . . , fn ) with fj : W × Rm → R, and
let ȳ = (ȳ1 , . . . , ȳn ), δ̄ = (δ̄1 , . . . , δ̄n ) = f (w̄, x̄) − ȳ = (f1 (w̄, x̄) − ȳ1 , . . . , fn (w̄, x̄) − ȳn ).
In this case Dw f (w̄, x̄) · dw = (Dw f1 (w̄, x̄) · dw, . . . , Dw fn (w̄, x̄) · dw) and therefore
Dw∗ f (w̄, x̄) · δ̄ = Dw∗ f1 (w̄, x̄) · δ̄1 + · · · + Dw∗ fn (w̄, x̄) · δ̄n . On the other hand the quadratic
errors of the component functions fj with targets ȳj are Q(fj )(w̄) = 2Dw∗ fj (w̄, x̄) · δ̄j .
Formula 5 applies and gives
In words, the gradient of the quadratic error of a vector valued parametric map equals
the sum of the gradients of corresponding quadratic errors of the components. For the
partial with respect to x and its transpose the formulas are
Dx∗ f (w̄, x̄) · δ̄ = Dx∗ f1 (w̄, x̄) · δ̄1 + · · · + Dx∗ fq (w̄, x̄) · δ̄q (12)
§41 Products
The integers m, M1 , . . . , Mq , n1 , . . . , nq are positive, M = M1 + · · · + Mq , n = n1 + · · · +
nq . Let Wj be open in RMj , W = W1 × · · · × Wq ⊆ RM1 × · · · × RMq = RM . Consider
differentiable maps fj : Wj ×Rm → Rnj and their parametric product f = f1 × b · · · ×f
b q:
m n
W ×R → R defined by the formula f (w1 , . . . , wq , x) = (f1 (w1 , x), . . . , fq (wq , x)). For
given input x̄ ∈ Rm , and target ȳ = (ȳ1 , . . . , ȳq ) ∈ Rn let δ(w) = (δ1 (w), . . . , δq (w)) =
f (w, x̄) − ȳ P = (f1 (w1 , x̄) − ȳ1 , . . . , fq (wq , x̄) − ȳq ).PThe quadratic error of f satisfies
Q(f )(w) = qj=1 hfj (wj , x̄) − ȳj , fj (wj , x̄) − ȳj i = qj=1 Q(fj )(wj ). Taking a fixed w̄ =
(w̄1 , . . . , w̄q ) ∈ W and letting δ̄ = (δ̄1 , . . . , δ̄q ) = δ(w̄) = (f1 (w̄1 , x̄) − ȳ1 , . . . , fq (w̄q , x̄) −
ȳq ) derivatives can be taken componentwise and it is obvious that
Dw f (w̄, x̄) · dw = (Dw1 f1 (w̄1 , x̄) · dw1 , . . . , Dwq fq (w̄q , x̄) · dwq )
Therefore Dw∗ f (w̄, x̄) · δ̄ = (Dw∗ 1 f1 (w̄1 , x̄) · δ̄1 , . . . , Dw∗ q fq (w̄q , x̄) · δ̄q ), so
∇Q(w̄) = (2Dw∗ 1 f1 (w̄1 , x̄) · δ̄1 , . . . , 2Dw∗ q fq (w̄q , x̄) · δ̄q ) and this is the same as
Thus, for a parametric product the quadratic error has gradient equal to the product
of the gradients of the quadratic errors of the factors. Similarly, Dx f (w̄, x̄) · dx =
(Dx f1 (w̄1 , x̄) · dx, . . . , Dx fq (w̄q , x̄) · dx) from where
Dx∗ f (w̄, x̄) · δ̄ = Dx∗ f1 (w̄1 , x̄) · δ̄1 + · · · + Dx∗ fq (w̄q , x̄) · δ̄q (14)
32
§42 Semilinear products
An additional subindex has to be added to the notation of section 6. In the case of prod-
ucts of semilinear real valued parametric factors fj (wj0 , wj1 , . . . , wjm , x1 , . . . , xm ) =
ψj (wj0 + wj1 x1 + · · · + wjm xm ) assume for simplicity that a unique threshold function
ψ = ψj , j = 1, . . . , q, is involved. Let wj = (wj0 , wj1 , . . . , wjm ), w = (w0 , w1 , . . . , wm ),
dwj = (dwj0 , dwj1 , . . . , dwjm ) and dw = (dw1 , . . . , dwq ). The parametric product of
q
All matrices in this term by term product are q × (m + 1) matrices. For the partial
with respect to x equations 10 and 14 give
Dx f (w̄, x̄) · dx = (ψ 0 (s̄1 ) hw̄1 , (0, dx)i, . . . , ψ 0 (s̄q ) hw̄q , (0, dx)i)
and for the transpose
Dx∗ f (w̄, x̄) · δ̄ = (h(ψ 0 (s̄1 )w̄11 , . . . , ψ 0 (s̄1 )w̄q1 ), δ̄i, . . . , h(ψ 0 (s̄q )w̄1m , . . . ,
ψ 0 (s̄q )w̄qm ), δ̄i) (17)
An equivalent matricial expression for 17 is
w̄11 · · · w̄q1 ψ 0 (s̄1 ) · · · 0 δ̄1
∗ .. .. .. .. ..
Dx f (w̄, x̄) · δ̄ = . (18)
. . . .
0
w̄1m · · · w̄qm 0 ··· ψ (s̄q ) δ̄q
here the second matrix is q × q diagonal.
33
§43 Compositions
Consider open sets W k ⊆ RMk , k = 1, . . . , p, W = W 1 × · · · × W p , and differ-
entiable parametric maps f k : W k × Rnk → Rnk+1 with parametric composition
f = f p b◦ · · · b◦ f 1 : W × Rn1 → Rnp+1 . Recall that for a given first input x = x1 =
(x11 , . . . , x1n1 ) ∈ Rn1 this is recursively defined by the expression f (w1 , . . . wp , x) =
f p (wp , f p−1 b◦ · · · b◦ f 1 (w1 , . . . , wp−1 , x)); see [1] and Figure 3 below.
In more compact notation let Lwj = Dwj f j (w̄j , x̄j ) and Lxj = Dxj f j (w̄j , x̄j ); the
transposes L∗wj = Dw∗ j f j (w̄j , x̄j ) and L∗xj = Dx∗j f j (w̄j , x̄j ) will be used later. Then
The gradient of the quadratic error Q = Q(f ) has p components: ∇Q(w) = (∇(1) Q(w), . . . , ∇(p) Q(w))
RM ; these can be calculated taking transposes in the previous formulas. For this, let
ȳ = ȳ p+1 ∈ Rnp+1 be a desired final output or final target and define the final error as
δ(w) = f (w, x̄) − ȳ = x̄p+1 − ȳ p+1 . The quadratic error Q = Q(f )(w) = hδ(w), δ(w)i is
then a function of w and if δ̄ = δ̄ p+1 = δ(w̄) = f (w̄, x̄) − ȳ,
∇(p) Q(w̄) = 2Dw∗ p f p (w̄p , x̄p ) · δ̄ p+1
∇(p−1) Q(w̄) = 2Dw∗ p−1 f p−1 (w̄p−1 , x̄p−1 ) ◦ Dx∗p f p (w̄p , x̄p ) · δ̄ p+1
∇(p−2) Q(w̄) = 2Dw∗ p−2 f p−2 (w̄p−2 , x̄p−2 ) ◦ Dx∗p−1 f p−1 (w̄p−1 , x̄p−1 ) ◦
34
Dx∗p f p (w̄p , x̄p ) · δ̄ p+1
..
.
∇(2) Q(w̄) = 2Dw∗ 2 f 2 (w̄2 , x̄2 ) ◦ Dx∗3 f 3 (w̄3 , x̄3 ) ◦ · · · ◦ Dx∗p−1 f p−1 (w̄p−1 , x̄p−1 ) ◦
Dx∗p f p (w̄p , x̄p ) · δ̄ p+1
∇(1) Q(w̄) = 2Dw∗ 1 f 1 (w̄1 , x̄2 ) ◦ Dx∗2 f 2 (w̄2 , x̄2 ) ◦ · · · ◦ Dx∗p−1 f p−1 (w̄p−1 , x̄p−1 ) ◦
Dx∗p f p (w̄p , x̄p ) · δ̄ p+1
Or, with compact notation,
∇(p) Q(w̄) = 2L∗wp · δ̄ p+1
∇(p−1) Q(w̄) = 2L∗wp−1 ◦ L∗xp · δ̄ p+1
..
.
∇(2) Q(w̄) = 2L∗w2 ◦ L∗x3 ◦ · · · ◦ L∗xp · δ̄ p+1
∇(1) Q(w̄) = 2L∗w1 L∗x2 ◦ · · · L∗xp−1 ◦ L∗xp · δ̄ p+1
Call (p+1)-th error to the already defined final error: δ̄ p+1 = δ̄ = f (w̄, x̄) − ȳ ∈ Rnp+1 ,
and define the k-th error as δ̄ k = Dx∗k f k (w̄k , x̄k ) · δ̄ k+1 = L∗xk · δ̄ k+1 ∈ Rnk , k =
p, p−1, . . . , 2, 1. One says that δ̄ k is obtained backpropagating the error δ̄ k+1 by means of
Dx∗k f k (w̄k , x̄k ). Define also the desired k-th output ȳ k+1 ∈ Rnk+1 as ȳ k+1 = x̄k+1 + δ̄ k+1 .
The previous expressions for the components of the gradient can now be reformulated
saying that the components of ∇Q(w̄) are the liftings to RMk via Dw∗ k f k (w̄k , x̄k ) = L∗wk
of the backpropagated errors
∇(p) Q(w̄) = 2 L∗wp · δ̄ p+1
∇ Q(w̄) = 2 L∗wp−1 · δ̄ p
(p−1)
.. (19)
.
(2)
∇ Q(w̄) = 2 L∗w2 · δ̄3
35
x̄1 ∈ Rn1 be an input, ȳ p+1 ∈ Rnp+1 a desired output and let Q(w) be the quadratic
error. To minimize Q the gradient algorithm can be used; see section 1. The compo-
nents ∇k (f )(w̄) of ∇(f )(w̄) can be calculated in terms of the gradients ∇(f k )(w̄k ) of
the quadratic error functions Q(f k )(wk ) using backpropagation to specify the corre-
sponding targets; see previous section. If the layers f k are parametric products then
the gradients ∇(f k )(w̄k ) can be expressed in terms of the gradients of the quadratic
errors of the factors (processing units); see section 8 and below.
In general the network has to be trained for several data: a finite set X̄ ⊆ Rn1 of
inputs and targets Ȳ ⊆ Rnp+1 . For each input x̄ let its corresponding target be
ȳ = g(x̄), and define Qx̄ (f )(w) = hf (w, x̄) − g(x̄), f (w,
Px̄) − g(x̄)i. The task is
X̄ x̄
now to minimize the total quadratic error Q (f )(w) = x̄∈X̄ Q (f )(w). But since
X̄ x̄ X̄
P
∇Q (f )(w) = x̄∈X̄ ∇Q (f )(w) the calculation of ∇Q (f )(w) reduces to the single
input case.
If the network f p b◦ · · · b◦ f 1 is semilinear then w = (w1 , . . . , wp ), wk = (wjkk+1 jk ) ∈
Rnk+1 ×(nk +1) ; here the indices take the values 1 ≤ jk+1 ≤ nk+1 , 0 ≤ jk ≤ nk . The layer
f k : Rnk+1 ×(nk +1) × Rnk → Rnk+1 is a parametric product of nk+1 semilinear real valued
units, that is, f k = f1k × b nk with fjk (wjk 0 , wjk 1 , . . . , wjk n , xk1 , . . . , xkn ) =
b · · · ×f
k+1 k+1 k+1 k+1 k+1 k k
ψ(wjkk+1 0 +wjkk+1 1 xk1 +· · ·+wjkk+1 nk xknk ). For a given single input x̄1 = (x̄11 , . . . , x̄1nk ) ∈ Rn1 ,
target ȳ p+1 ∈ Rnp+1 and parameter w̄ = (w̄1 , . . . , w̄p ) let δ̄ p+1 = (δ̄1p+1 , . . . , δ̄np+1
p+1
) be the
final error. Formula 18 gives the partial with respect to x of a semilinear product and
it follows that the backpropagated errors are
k 0 k k+1
δ̄1k w̄11 · · · w̄nk k+1 1 ψ (s̄1 ) · · · 0 δ̄1
k .. .. . .
. .
. .
δ̄ = . = . .. ..
. .
δ̄nkk k
w̄1nk
··· w̄nk k+1 nk 0 ··· ψ 0 (s̄knk+1 ) δ̄nk+1
k+1
where k = p, p−1, . . . , 2, and, for jk+1 = 1, . . . , nk+1 , s̄kjk+1 = hw̄jkk+1 , (1, x̄k )i = w̄jkk+1 0 +
w̄jkk+1 1 xk1 + · · · + w̄jkk+1 nk xknk . Formula 19 implies that the gradient has components
∇Q(f k )(w̄k ) and Formula 16 gives the partial with respect to w of a semilinear product
implying that for semilinear networks the components of the gradient of the quadratic
error are
k+1
ψ 0 (s̄k1 ) · · · ψ 0 (s̄k1 ) 1 x̄k1 · · · x̄knk δ̄1 · · · δ̄1k+1
.. .. .. .. .. # .. ..
2 # . .
. . . . .
0 k 0 k k k k+1 k+1
ψ (s̄nk+1 ) · · · ψ (s̄nk+1 ) 1 x̄1 · · · x̄nk δ̄nk+1 · · · δ̄nk+1
where each of the three matrices in the term by term products has nk+1 rows and
nk + 1 columns. In the case of several input-output pairs the gradients are added up
as previously explained.
36
(a) W ⊆G (b) W ⊆G
p p
wq w̄q
p p
&% &%
@ Dw f (w̄, x̄)
@
R
@
xq f fq (w, x) 0q Dx f (w̄, x̄)
- -
E F E F
37
(a) W ⊆G (b) W ⊆G
p p * ∇Q(w̄)
w̄q w̄q
p p
38
W1 W2 W p−1 Wp
p p p p−1 p
wq 1 wq 2 wq wq p
p p p p
&%&% &%&%
&%&% &%&%
@ Lw1 @ Lw2 @ Lwp−1 @ Lw p
@ @ @ @
R
@ R
@ R
@ R
@
x̄q 1 x̄q 2 L x2 Lxp−1 x̄q p Lxp
- ··· - -
n1 n2 np
R R R Rnp+1
&%&% &%&%
I L∗w1
@ I L∗w2
@ I L∗wp−1
@ I L∗wp
@
@ @ @ @
@ @ @ @
x̄q 1 L∗x1 x̄q 2 L∗x2 ∗
L p−1 x̄q p
∗
Lp
··· x x q
n1 n2 np
R R R Rnp+1
39
∇(1) Q(w̄) ∇(2) Q(w̄) ∇(p−1) Q(w̄) ∇(p) Q(w̄)
p p p p
q q q
p w̄ 1 p w̄ 2 p w̄ p−1 p w̄ p
&%&% &%&%
I 2L∗w1
@ I 2L∗w2
@ I 2L∗wp−1
@ I 2L∗wp
@
@ @ @ @
@ @ @ @
x̄q 1 δ̄ 2 q L∗x2 L∗xp−1 δ̄ p q L∗xp q
δ̄ p+1
q ··· q q
Rn1 Rn2 Rnp Rnp+1
40